Distributed Operating Systems

ANDREW S. TANENBAUM and ROBBERT VAN RENESSE Department of Mathematics and Computer Science, Vrije Universiteit, Amsterdam, The Netherlands

Distributed operating systems have many aspects in common with centralized ones, but they also differ in certain ways. This paper is intended as an introduction to distributed operating systems, and especially to current university research about them. After a discussion of what constitutes a distributed operating system and how it is distinguished from a computer network, various key design issues are discussed. Then several examples of current research projects are examined in some detail, namely, the Cambridge Distributed Computing System, Amoeba, V, and Eden.

Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]: Distributed Systems-network operating system; D.4.3 [Operating Systems]: File Systems Management-distributed file systems; D.4.5 [Operating Systems]: Reliability-fault tolerance; D.4.6 [Operating Systems]: Security and Protection-access controls; D.4.7 [Operating Systems]: Organization and Design-distributed systems General Terms: Algorithms, Design, Experimentation, Reliability, Security Additional Key Words and Phrases: File server

INTRODUCTION more convenient to use than the bare ma- chine. Examples of well-known centralized Everyone agrees that distributed systems (i.e., not distributed) operating systems are are going to be very important in the future. CP/M,’ MS-DOS,’ and .3 Unfortunately, not everyone agrees on A distributed operating system is one that what they mean by the term “distributed looks to its users like an ordinary central- system.” In this paper we present a view- ized operating system but runs on multi- point widely held within academia about ple, independent central processing units what is and is not a distributed system, we (CPUs). The key concept here is transpar- discuss numerous interesting design issues ency. In other words, the use of multiple concerning them, and finally we conclude processors should be invisible (transparent) with a fairly close look at some experimen- to the user. Another way of expressing the tal distributed systems that are the subject same idea is to say that the user views of ongoing research at universities. the system as a “virtual uniprocessor,” not To begin with, we use the term “distrib- as a collection of distinct machines. This uted system” to mean a distributed operat- is easier said than done. ing system as opposed to a database system Many multimachine systems that do not or some distributed applications system, fulfill this requirement have been built. For such as a banking system. An operating system is a program that controls the re- ’ CP/M is a trademark of Digital Research, Inc. sources of a computer and provides its users ’ MS-DOS is a trademark of Microsoft. with an interface or virtual machine that is 3 UNIX is a trademark of AT&T Bell Laboratories.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1986 ACM 0360-0300/85/1200-0419 $00.75

ComputingSurveys, Vol. 17, No. 4, December 1985 420 l A. S. Tanenbaum and R. van Renesse CONTENTS To make the contrast with distributed operating systems stronger, let us briefly look at another kind of system, which we call a “network operating system.” A typical INTRODUCTION configuration for a network operating sys- Goals and Problems tem would be a collection of personal com- System Models puters along with a common printer server 1. NETWORK OPERATING SYSTEMS 1.1 File System and file server for archival storage, all tied 1.2 Protection together by a local network. Generally 1.3 Execution Location speaking, such a system will have most of 1.4 An Example: The Sun Network the following characteristics that distin- File System guish it from a distributed system: 2. DESIGN ISSUES 2.1 Communication Primitives l Each computer has its own private oper- 2.2 Naming and Protection 2.3 Resource Management ating system, instead of running part of 2.4 Fault Tolerance a global, systemwide operating system. 2.5 Services l Each user normally works on his or her 3. EXAMPLES OF DISTRIBUTED own machine; using a different machine OPERATING SYSTEMS invariably requires some kind of “remote 3.1 The Cambridge Distributed Computing System login,” instead of having the operating 3.2 Amoeba system dynamically allocate processes to 3.3 The V Kernel CPUS. 3.4 The Eden Project l Users are typically aware of where each 3.5 Comparison of the Cambridge, Amoeba, V, and Eden Systems of their files are kept and must move files 4. SUMMARY between machines with explicit “file ACKNOWLEDGMENTS transfer” commands, instead of having REFERENCES file placement managed by the operating system. l The system has little or no fault toler- ance; if 1 percent of the personal com- puters crash, 1 percent of the users are example, the ARPANET contains a sub- out of business, instead of everyone sim- stantial number of computers, but by this ply being able to continue normal work, definition it is not a distributed system. albeit with 1 percent worse performance. Neither is a local network consisting of personal computers with minicomputers and explicit commands to log in here or Goals and Problems copy a file from there. In both cases we The driving force behind the current inter- have a computer network but not a distrib- est in distributed systems is the enormous uted operating system. Thus it is the soft- rate of technological change in micropro- ware, not the hardware, that determines cessor technology. Microprocessors have whether a system is distributed or not. become very powerful and cheap, compared As a rule of thumb, if you can tell which with mainframes and minicomputers, so it computer you are using, you are not using has become attractive to think about de- a distributed system. The users of a true signing large systems composed of many distributed system should not know (or small processors. These distributed sys- care) on which machine (or machines) their tems clearly have a price/performance ad- programs are running, where their files vantage over more traditional systems. are stored, and so on. It should be clear by Another advantage often cited is the rela- now that very few distributed systems are tive simplicity of the software-each pro- currently used in a production environ- cessor has a dedicated function-although ment. However, several promising research this advantage is more often listed by projects are in progress. people who have never tried to write a

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 421 distributed operating system than by those VAXs), each with multiple users. Each user who have. is logged onto one specific machine, with Incremental growth is another plus; if remote access to the other machines. This you need 10 percent more computing model is a simple outgrowth of the central power, you just add 10 percent more pro- time-sharing machine. cessors. System architecture is crucial to In the workstation model, each user has this type of system growth, however, since a personal workstation, usually equipped it is hard to give each user of a personal with a powerful processor, memory, a bit- computer another 10 percent of a personal mapped display, and sometimes a disk. computer. Reliability and availability can Nearly all the work is done on the work- also be a big advantage; a few parts of the stations. Such a system begins to look dis- system can be down without disturbing tributed when it supports a single, global people using the other parts. On the minus file system, so that data can be accessed side, unless one is very careful, it is easy without regard to their location. for the communication protocol overhead The processor pool model is the next to become a major source of inefficiency. evolutionary step after the workstation There has been built more than one system model. In a time-sharing system, whether requiring the full computing power of its with one or more processors, the ratio of machines just to run the protocols, leaving CPUs to logged-in users is normally much nothing over to do the work. The occasional less than 1; with the workstation model it lack of simplicity cited above is a real prob- is approximately 1; with the processor pool lem, although in all fairness, this problem model it is much greater than 1. As CPUs comes from inflated goals: With a central- get cheaper and cheaper, this model will ized system no one expects the computer to become more and more widespread. The function almost normally when half the idea here is that whenever a user needs memory is sick. With a distributed system, computing power, one or more CPUs are a high degree of fault tolerance is often, at temporarily allocated to that user; when least, an implicit goal. the job is completed, the CPUs go back into A more fundamental problem in distrib- the pool to await the next request. As an uted systems is the lack of global state example, when ten procedures (each on a information. It is generally a bad idea to separate file) must be recompiled, ten pro- even try to collect complete information cessors could be allocated to run in parallel about any aspect of the system in one table. for a few seconds and then be returned to Lack of up-to-date information makes the pool of available processors. At least many things much harder. It is hard to one experimental system described below schedule the processors optimally if you are (Amoeba) attempts to combine two of these not sure how many are up at the moment. models, providing each user with a work- Many people, however, think that these station in addition to the processor pool for obstacles can be overcome in time, so there general use. No doubt other variations will is great interest in doing research on the be tried in the future. subject. 1. NETWORK OPERATING SYSTEMS System Models Before starting our discussion of distrib- Various models have been suggested for uted operating systems, it is worth first building a distributed system. Most of them taking a brief look at some of the ideas fall into one of three broad categories, involved in network operating systems, which we call the “minicomputer” model, since they can be regarded as primitive the “workstation” model, and the “proces- forerunners. Although attempts to connect sor pool” model. In the minicomputer computers together have been around for model, the system consists of a few (per- decades, networking really came into the haps even a dozen) minicomputers (e.g., limelight with the ARPANET in the early

Computing Surveys, Vol. 17, No. 4, December 1985 422 . A. S. Tanenbaum and R. van Renesse

1970s. The original design did not provide file systems. In this approach, programs on for much in the way of a network operating one machine can open files on another ma- system. Instead, the emphasis was on using chine by providing a path name telling the network as a glorified telephone line to where the file is located. For example, one allow remote login and file transfer. Later, could say several attempts were made to create net- open(’ ‘/machinel/pathname’ ‘, READ); work operating systems, but they never open(“machinel!pathname”, READ); were widely used [Millstein 19771. In more recent years, several research open(‘f/. ./machinel/pathname”, READ); organizations have connected collections of minicomputers running the UNIX operat- The latter naming scheme is used in the ing system [Ritchie and Thompson 19741 Newcastle Connection [Brownbridge et al. into a network operating system, usually 19821 and Netix [Wambecq 19831 and is via a local network [Birman and Rowe derived from the creation of a virtual 1982; Brownbridge et al. 1982; Chesson “superdirectory” above the root directories 1975; Hwang et al. 1982; Luderer et al. 1981; of all the connected machines. Thus “/. .” Wambecq 19831. Wupit [1983] gives a good means start at the local root directory and survey of these systems, which we shall go upward one level (to the superdirectory), draw upon for the remainder of this section. and then down to the root directory of As we said earlier, the key issue that “machine.” In Figure 1, the root directory distinguishes a network operating system of three machines, A, B, and C are shown, from a distributed one is how aware the with a superdirectory above them. To ac- users are of the fact that multiple machines cess file x from machine C, one could say are being used. This visibility occurs in open(’ ‘/. ./C/x’ ‘, READ-ONLY) three primary areas: the file system, pro- tection, and program execution. Of course, In the Newcastle system, the naming tree it is possible to have systems that are highly is actually more general, since “machine 1” transparent in one area and not at all in may really be any directory, so one can the other, which leads to a hybrid form. attach a machine as a leaf anywhere in the hierarchy, not just at the top. 1.1 File System The third approach is the way it is done in distributed operating systems, namely, When connecting two or more distinct sys- to have a single global file system visible tems together, the first issue that must be from all machines. When this method is faced is how to merge the file systems. used, there is one “bin” directory for binary Three approaches have been tried. The first programs, one password file, and so on. approach is not to merge them at all. Going When a program wants to read the pass- this route means that a program on ma- word file it does something like chine A cannot access files on machine B open(’ ‘/etc/passwd’ ‘, READ-ONLY) by making system calls. Instead, the user must run a special file transfer program without reference to where the file is. It is that copies the needed remote files to the up to the operating system to locate the file local machine, where they can then be ac- and arrange for transport of data as they cessed normally. Sometimes remote print- are needed. LOCUS is an example of a ing and mail is also handled this way. system using this approach [Popek et al. One of the best-known examples of net- 1981; Walker et al. 1983; Weinstein et al. works that primarily support file transfer 19851. and mail via special programs, and not The convenience of having a single global system call access to remote files, is the name space is obvious. In addition, this UNIX “uucp” program, and its network, approach means that the operating system USENET. is free to move files around among ma- The next step upward in the direction of chines to keep all the disks equally full and a distributed file system is to have adjoining busy, and that the system can maintain

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 423 way, the network is just being used as a fancy switch to allow users at any terminal to log onto any computer, just as a tele- phone company switching center allows any subscriber to call any other subscriber. This solution is usually inconvenient for people and impractical for programs, so something better is needed. The next step up is to allow any user to access files on any machine without having to log in, but to have the remote user appear to have the UID corresponding to “GUEST” or r 9 t u v w x Y 2 “DEMO” or some other publicly known login name. Generally such names have Figure 1. A (virtual) superdirectory above the root little authority and can only access files directory provides access to remote files. that have been designated as readable or writable by all users. A better approach is to have the operat- replicated copies of files if it so chooses. ing system provide a mapping between When the user or program must specify the UIDs, so that when a user with UID 12 on machine name, the system cannot decide his or her home machine accesses a remote on its own to move a file to a new machine machine on which his or her UID is 15, the because that would change the (user visi- remote machine treats all accesses as ble) name used to access the file. Thus in a though they were done by user 15. This network operating system, control over file approach implies that sufficient tables are placement must be done manually by the provided to map each user from his or her users, whereas in a distributed operating home (machine, UID) pair to the appropri- system it can be done automatically by the ate UID for any other machine (and that system itself. messages cannot be tampered with). In a true distributed system there should 1.2 Protection be a unique UID for every user, and that UID should be valid on all machines with- Closely related to the transparency of the out any mapping. In this way no protection file system is the issue of protection. UNIX problems arise on remote accesses to files; and many other operating systems assign a as far as protection goes, a remote access unique internal identifier to each user. can be treated like a local access with the Each file in the file system has a little table same UID. The protection issue makes the associated with it (called an i-node in difference between a network operating UNIX) telling who the owner is, where the system and a distributed one clear: In one disk blocks are located, etc. If two previ- case there are various machines, each with ously independent machines are now con- its own user-to-UID mapping, and in the nected, it may turn out that some internal other there is a single, systemwide mapping User IDentifier (UID), for example, num- that is valid everywhere. ber 12, has been assigned to a different user on each machine. Consequently, when user 1.3 Execution Location 12 tries to access a remote file, the remote file system cannot see whether the access Program execution is the third area in is permitted since two different users have which machine boundaries are visible in the same UID. network operating systems. When a user or One solution to this problem is to require a running program wants to create a new all remote users wanting to access files on process, where is the process created? At machine X to first log onto X using a user least four schemes have been used thus far. name that is local to X. When used this The first of these is that the user simply

Computing Surveys, Vol. 17, No. 4, December 1985 424 l A. S. Tanenbaum and R. van Renesse says “CREATE PROCESS” in one way or all the system calls and decide whether each another, and specifies nothing about where. one was local or remote [Brownbridge et al. Depending on the implementation, this can 19821. Although most system calls can be be the best or the worst way to do it. In the handled this way without modifying the most distributed case, the system chooses kernel, invariably there are a few things, a CPU by looking at the load, location of such as interprocess signals, interrupt char- files to be used, etc. In the least distributed acters (e.g., BREAK) from the keyboard, case, the system always runs the process on etc., that are hard to get right. In a true one specific machine (usually the machine distributed operating system one would on which the user is logged in). normally write the kernel from scratch. The second approach to process location is to allow users to run jobs on any machine 1.4 An Example: The Sun Network by first logging in there. In this model, File System processes on different machines cannot communicate or exchange data, but a sim- To provide a contrast with the true distrib- ple manual load balancing is possible. uted systems described later in this paper, The third approach is a special command in this section we look briefly at a network that the user types at a terminal to cause a operating system that runs on the Sun program to be executed on a specific ma- Microsystems’ workstations. These work- chine. A typical command might be stations are intended for use as personal computers. Each one has a 68000 series remote vax4 who CPU, local memory, and a large bit- to run the who program on machine vax4. mapped display. Workstations can be In this arrangement, the environment of configured with or without local disk, as the new process is the remote machine. In desired. All the workstations run a ver- other words, if that process tries to read or sion of 4.2BSD UNIX specially modified write files from its current working direc- for networking. tory, it will discover that its working direc- This arrangement is a classic example of tory is on the remote machine, and that a network operating system: Each com- files that were in the parent process’s di- puter runs a traditional operating system, rectory are no longer present. Similarly, UNIX, and each has its own user(s), but files written in the working directory will with extra features added to make network- appear on the remote machine, not the local ing more convenient. During its evolution one. the Sun system has gone through three The fourth approach is to provide the distinct versions, which we now describe. “CREATE PROCESS” system call with a In the first version each of the work- parameter specifying where to run the new stations was completely independent from process, possibly with a new system call for all the others, except that a program rep specifying the default site. As with the pre- was provided to copy files from one work- vious method, the environment will gener- station to another. By typing a command ally be the remote machine. In many cases, such as signals and other forms of interprocess rep Ml:/usr/jim/file.c M2:/usr/ast/f.c communication between processes do not work properly among processes on different it was possible to transfer whole files from machines. one machine to another. A final point about the difference be- In the second version, Network Disk tween network and distributed operating (ND), a network disk server was provided systems is how they are implemented. A to support diskless workstations. Disk common way to realize a network operating space on the disk server’s machine was system is to put a layer of software on top divided into disjoint partitions, with each of the native operating systems of the in- partition acting as the virtual disk for some dividual machines (e.g., Mamrak et al. (diskless) workstation. [1982]). For example, one could write a Whenever a diskless workstation needed special library package that would intercept to read a file, the request was processed

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 425 locallv until it not down to the level of the tually working on behalf of a remote kernel. device driver, it which point the block The reply retraces the same path in the needed was retrieved by sending a message other direction. to the remote disk server. In effect, the The protocol between workstations has network was merely being used to simulate been carefully designed to be robust in the a disk controller. With this network disk face of network and server crashes. Each system, sharing of disk partitions was not request completely identifies the file (by its possible. vnode), the position in the file, and the byte The third version, the Network File Sys- count. Between requests, the server does tem (NFS), allows remote directories to be not maintain any state information about mounted in the local file tree on any work- which files are open or where the current station. By mounting, say, a remote direc- file position is. Thus, if a server crashes tory “dot” on the empty local directory and is rebooted, there is no state informa- “/usr/doc,” all subsequent references to tion that will be lost. “/usr/doc” are automatically routed to the The ND and NFS facilities are quite remote system. Sharing is allowed in NFS, different and can both be used on the same so several users can read files on a remote workstation without conflict. ND works at machine at the same time. a low level and just handles remote block To prevent users from reading other peo- I/O without regard to the structure of the ple’s private files, a directory can only be information on the disk. NFS works at a mounted remotely if it is explicitly exported much higher level and effectively takes re- by the workstation it is located on. A direc- quests appearing at the top of the operating tory is exported by entering a line for it in system on the client machine and gets them a file “/etc/exports.” To improve perform- over to the top of the operating system on ance of remote access, both the client ma- the server machine, where they are pro- chine and server machine do block caching. cessed in the same way as local requests. Remote services can be located using a Yellow Pages server that maps service 2. DESIGN ISSUES names onto their network locations. The NFS is implemented by splitting the Now we turn from traditional computer operating system up into three layers. The systems with some networking facilities top layer handles directories, and maps added on to systems designed with the each path name onto a generalized i-node intention of being distributed. In this sec- called a unode consisting of a (machine, tion we look at five issues that distributed i-node) pair, making each vnode globally systems’ designers are faced with: unique. Vnode numbers are presented to the mid- l communication primitives, dle layer, the virtual file system (VFS). l naming and protection, This layer checks to see if a requested l resource management, vnode is local or not. If it is local, it calls 0 fault tolerance, the local disk driver or, in the case of an l services to provide. ND partition, sends a message to the re- Although no list could possibly be exhaus- mote disk server. If it is remote, the VFS tive at this early stage of development, calls the bottom layer with a request to these topics should provide a reasonable process it remotely. impression of the areas in which current The bottom layer accepts requests for research is proceeding. accessesto remote vnodes and sends them over the network to the bottom layer on 2.1 Communication Primitives the serving machine. From there they prop- agate upward through the VFS layer to the The computers forming a distributed sys- top layer, where they are reinjected into the tem normally do not share primary mem- VFS layer. The VFS layer sees a request ory, and so communication via shared for a local vnode and processes it normally, memory techniques such as semaphores without realizing that the top layer is ac- and monitors is generally not applicable.

Computing Surveys, Vol. 17, No. 4, December 1985 426 . A. S. Tanenbaum and R. van Renesse Instead, message passing in one form or another is used. One widely discussed I+ framework for message-passing systems is Client the IS0 OS1 reference model, which has sends request +-El seven layers, each performing a well- Remage defined function [Zimmermann 19801. The Server seven layers are the physical layer, data- rend8 reply link layer, network layer, transport layer, marage session layer, presentation layer, and ap- plication layer. By using this model it is Figure 2. Client-server model of communication. possible to connect computers with widely different operating systems, character codes, and ways of viewing the world. Precisely what semantics these primi- Unfortunately, the overhead created by tives ought to have has been a subject all these layers is substantial. In a distrib- of much controversy among researchers. uted system consisting primarily of huge Two of the fundamental decisions that mainframes from different manufacturers, must be made are unreliable versus reli- connected by slow leased lines (say, 56 able and nonblocking versus blocking prim- kilobytes per second), the overhead might itives. At one extreme, SEND can put a be tolerable. Plenty of computing capacity message out onto the network and wish it would be available for running complex good luck. No guarantee of delivery is pro- protocols, and the narrow bandwidth vided, and no automatic retransmission means that close coupling between the sys- is attempted by the system if the message tems would be impossible anyway. On the is lost. At the other extreme, SEND can other hand, in a distributed system consist- handle lost messages, retransmissions, and ing of identical microcomputers connected acknowledgments internally, so that when by a lo-megabyte-per second or faster local SEND terminates, the program is sure network, the price of the IS0 model is that the message has been received and generally too high. Nearly all the experi- acknowledged. mental distributed systems discussed in the Blocking versus Nonblocking Primitives. literature thus far have opted for a differ- The other choice is between nonblocking ent, much simpler model, so we do not and blocking primitives. With nonblocking mention the IS0 model further in this primitives, SEND returns control to the paper. user program as soon as the message has been queued for subsequent transmission 2.1.1 Message Passing (or a copy made). If no copy is made, any The model that is favored by researchers changes the program makes to the data in this area is the client-server model, in before or (heaven forbid) while they are which a client process wanting some service being sent are made at the program’s peril. (e.g., reading some data from a tile) sends When the message has been transmitted a message to the server and then waits for (or copied to a safe place for subsequent a reply message, as shown in Figure 2. In transmission), the program is interrupted the most naked form the system just pro- to inform it that the buffer may be reused. vides two primitives: SEND and RE- The corresponding RECEIVE primitive CEIVE. The SEND primitive specifies the signals a willingness to receive a message destination and provides a message; the and provides a buffer for it to be put into. RECEIVE primitive tells from whom a When a message has arrived, the program message is desired (including “anyone”) is informed by interrupt, or it can poll for and provides a buffer where the incoming status continuously or go to sleep until the message is to be stored. No initial setup is interrupt arrives. The advantage of these required, and no connection is established, nonblocking primitives is that they provide hence no tear down is required. the maximum flexibility: Programs can

ComputingSurveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 427 compute and perform message I/O in par- receiver. Although buffered message pass- allel in any way they want. ing can be implemented in many ways, a Nonblocking primitives also have a dis- typical approach is to provide users with a advantage: They make programming tricky system call CREATEBUF, which creates a and difficult. Irreproducible, timing- kernel buffer, sometimes called a mailbox, dependent programs are painful to write of a user-specified size. To communicate, and awful to debug. Consequently, many a sender can now send messages to the people advocate sacrificing some flexibility receiver’s mailbox, where they will be and efficiency by using blocking primitives. buffered until requested by the receiver. A.blocking SEND does not return control Buffering is not only more complex (creat- to the user until the message has been sent ing, destroying, and generally managing (unreliable blocking primitive) or until the the mailboxes), but also raises issues of pro- message has been sent and an acknowledg- tection, the need for special high-priority ment received (reliable blocking primitive). interrupt messages, what to do with mail- Either way, the program may immediately boxes owned by processes that have been modify the buffer without danger. A block- killed or died of natural causes, and more. ing RECEIVE does not return control until A more structured form of communica- a message has been placed in the buffer. tion is achieved by distinguishing requests Reliable and unreliable RECEIVES differ from replies. With this approach, one typ- in that the former automatically acknowl- ically has three primitives: SEND-GET, edges receipt of a message, whereas the GET-REQUEST, and SEND-REPLY. latter does not. It is not reasonable to com- SEND-GET is used by clients to send re- bine a reliable SEND with an unreliable quests and get replies. It combines a SEND RECEIVE, or vice versa; so the system to a server with a RECEIVE to get the designers must make a choice and provide server’s reply. GET-REQUEST is done by one set or the other. Blocking and non- servers to acquire messages containing blocking primitives do not conflict, so there work for them to do. When a server has is no harm done if the sender uses one and carried the work out, it sends a reply with the receiver the other. SEND-REPLY. By thus restricting the message traffic and using reliable, blocking Buffered versus Unbuffered Primitives. primitives, one can create some order in the Another design decision that must be made chaos. is whether or not to buffer messages. The simplest strategy is not to buffer. When a sender has a message for a receiver that has 2.1.2 Remote Procedure Call (RPC) not (yet) executed a RECEIVE primitive, the sender is blocked until a RECEIVE The next step forward in message-passing has been done, at which time the mes- systems is the realization that the model of sage is copied from sender to receiver. This “client sends request and blocks until ser- strategy is sometimes referred to as a ver sends reply” looks very similar to a rendezvous. traditional procedure call from the client to A slight variation on this theme is to the server. This model has become known copy the message to an internal buffer on in the literature as “remote procedure call” the sender’s machine, thus providing for a and has been widely discussed [Birrell and nonblocking version of the same scheme. Nelson 1984; Nelson 1981; Spector 19821. As long as the sender does not do any more The idea is to make the semantics of inter- SENDS before the RECEIVE occurs, no machine communication as similar as pos- problem occurs. sible to normal procedure calls because the A more general solution is to have a latter is familiar and well understood, and buffering mechanism, usually in the oper- has proved its worth over the years as a ating system kernel, which allows senders tool for dealing with abstraction. It can be to have multiple SENDS outstanding, even viewed as a refinement of the reliable, without any interest on the part of the blocking SEND-GET, GET-REQUEST,

Computing Surveys, Vol. 17, No. 4, December 1985 428 . A. S. Tanenbaum and R. van Renesse

SENDREP primitives, with a more user- Client Machine Server Machine friendly syntax. The remote procedure call can be organ- ized as follows. The client (calling program) ~I-~~ makes a normal procedure call, say, p(x, y) on its machine, with the intention of invok- ing the remote procedure p on some other Figure 3. Remote procedure call. machine. A dummy or stub procedure p must be included in the caller’s address space, or at least be dynamically linked to leans, the amount of overhead and mecha- it upon call. This procedure, which may be nism needed to create a capability and send automatically generated by the compiler, it in a protected way is so large that this collects the parameters and packs them solution is highly undesirable. into a messagein a standard format. It then Still another problem that must be dealt sends the message to the remote machine with is how to represent parameters and (using SEND-GET) and blocks, waiting results in messages. This representation is for an answer (see Figure 3). greatly complicated when different types of At the remote machine, another stub pro- machines are involved in a communication. cedure should be waiting for a message A floating-point number produced on one using GET-REQUEST. When a message machine is unlikely to have the same value comes in, the parameters are unpacked by on a different machine, and even a negative an input-handling procedure, which then integer will create problems between the l’s makes the local call p(x, y). The remote complement and 2’s complement machines. procedure p is thus called locally, and so its Converting to and from a standard for- normal assumptions about where to find mat on every message sent and received is parameters, the state of the stack, etc., are an obvious possibility, but it is expensive identical to the case of a purely local call. and wasteful, especially when the sender The only procedures that know that the and receiver do, in fact, use the same inter- call is remote are the stubs, which build nal format. If the sender uses its internal and send the message on the client side and format (along with an indication of which disassemble and make the call on the server format it is) and lets the receiver do the side. The result of the procedure call follows conversion, every machine must be pre- an analogous path in the reverse direction. pared to convert from every other format. When a new machine type is introduced, Remote Procedure Call Design Issues. much existing software must be upgraded. Although at first glance the remote Any way it is done, with remote procedure procedure call model seems clean and sim- call (RPC) or with plain messages, it is an ple, under the surface there are several unpleasant business. problems. One problem concerns parameter Some of the unpleasantness can be hid- (and result) passing. In most programming den from the user if the remote procedure languages, parameters can be passed by call mechanism is embedded in a program- value or by reference. Passing value param- ming language with strong typing, so that eters over the network is easy; the stub just the receiver at least knows how many pa- copies them into the message and off they rameters to expect and what types they go. Passing reference parameters (pointers) have. In this respect, a weakly typed lan- over the network is not so easy. One needs guage such as C, in which procedures with a unique, systemwide pointer for each ob- a variable number of parameters are com- ject so that it can be remotely accessed. For mon, is more complicated to deal with. large objects, such as files, some kind of Still another problem with RPC is the capability mechanism [Dennis and Van issue of client-server binding. Consider, for Horn 1966; Levy 1984; Pashtan 19821could example, a system with multiple file ser- be set up, using capabilities as pointers. For vers. If a client creates a file on one of the small objects, such as integers and Boo- file servers, it is usually desirable that sub-

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 429 sequent writes to that file go to the file Client crashes can also cause trouble for server where the file was created. With servers. Consider, for example, the case of mailboxes, arranging for this is straight- processes A and B communicating via the forward. The client simply addresses the UNIX pipe model A ] B with A the server WRITE messagesto the same mailbox that and B the client. B asks A for data and gets the CREATE message was sent to. Since a reply, but unless that reply is acknowl- each file server has its own mailbox, there edged somehow, A does not know when it is no ambiguity. can safely discard data that it may not be When RPC is used, the situation is more able to reproduce. If B crashes, how long complicated, since all the client does is put should A hold onto the data? (Hint: If the a procedure call such as answer is less than infinity, problems will be introduced whenever B is slow in send- write(FileDescriptor, BufferAddress, ByteCount); ing an acknowledgment.) in his program. RPC intentionally hides all Closely related to this is the problem of the details of locating servers from the what happens if a client cannot tell whether client, but sometimes, as in this example, or not a server has crashed. Simply waiting the details are important. until the server is rebooted and trying again In some applications, broadcasting and sometimes works and sometimes does not. multicasting (sending to a set of destina- This is a case in which it works: Client asks tions, rather than just one) is useful. For to read block 7 of some file. This is a case example, when trying to locate a certain in which it does not work: Client says person, process, or service, sometimes the transfer a million dollars from one bank only approach is to broadcast an inquiry account to another. In the former case, it message and wait for the replies to come does not matter whether or not the server back. RPC does not lend itself well to carried out the request before crashing; sending messages to sets of processes carrying it out a second time does no harm. and getting answers back from some or In the latter case, one would definitely pre- all of them. The semantics are completely fer the call to be carried out exactly once, different. no more and no less. Calls that may be repeated without harm (like the first ex- Despite all these disadvantages, RPC re- mains an interesting form of communica- ample) are said to be idempotent. Unfortu- tion, and much current research is being nately, it is not always possible to arrange addressed toward improving it and solving for all calls to have this property. Any call the various problems discussed above. that causes action to occur in the outside world, such as transferring money, printing 2.1.3 Error Handling lines, or opening a valve in an automated chocolate factory just long enough to fill Error handling in distributed systems is exactly one vat, is likely to cause trouble if radically different from that of centralized performed twice. systems. In a centralized system, a system Spector [1982] and Nelson [1981] have crash means that the client, server, and looked at the problem of trying to make communication channel are all completely sure that remote procedure calls are exe- destroyed, and no attempt is made to revive cuted exactly once, and they have devel- them. In a distributed system, matters are oped taxonomies for classifying the seman- more complex. If a client has initiated a tics of different systems. These vary from remote procedure call with a server that systems that offer no guarantee at all (zero has crashed, the client may just be left or more executions), to those that guaran- hanging forever unless a time-out is built tee at most one execution (zero or one), to in. However, such a time-out introduces those that guarantee at least one execution race conditions in the form of clients that (one or more). time out too quickly, thinking that the Getting it right (exactly one) is probably server is down, when in fact, it is merely impossible, because even if the remote ex- very slow. ecution can be reduced to one instruction

Computing Surveys, Vol. 17, No. 4, December 1985 430 l A. S. Tanenbaum and R. van Renesse (e.g., setting a bit in a device register that to the network, transmission time), achiev- opens the chocolate valve), one can never ing even 80 kilobytes per second will be be sure after a crash if the system went difficult, if not impossible, no matter how down a microsecond before or a micro- high the network bandwidth or disk speed. second after the one critical instruction. Thus it is desirable to avoid copying, but Sometimes one can make a guess based on this is not always simple to achieve since observing external events (e.g., looking to without copies, (part of) a needed message see whether the factory floor is covered may be swapped or paged out when it is with a sticky, brown material), but in gen- needed. eral there is no way of knowing. Note that Another point worth making is that there the problem of creating stable storage is always a substantial fixed overhead with [Lampson 19811 is fundamentally different, preparing, sending, and receiving a mes- since remote procedure calls to the stable sage, even a short message, such as a re- storage server in that model never cause quest to read from a remote file server. The events external to the computers. kernel must be invoked, the state of the current process must be saved, the desti- nation must be located, various tables must 2.1.4 Implementation issues be updated, permission to access the net- Constructing a system in principle is al- work must be obtained (e.g., wait for the ways easier than constructing it in practice. network to become free or wait for the Building a 16-node distributed system that token), and quite a bit of bookkeeping must has a total computing power about equal to be done. a single-node system is surprisingly easy. This fixed overhead argues for making This observation leads to tension between messages as long as possible, to reduce the the goals of making it work fast in the number of messages. Unfortunately, many normal case and making the semantics rea- current local networks limit physical pack- sonable when something goes wrong. Some ets to 1K or 2K; 4K or 8K would be much experimental systems have put the empha- better. Of course, if the packets become too sis on one goal and some on the other, but long, a highly interactive user may occa- more research is needed before we have sionally be queued behind ten maximum- systems that are both fast and graceful in length packets, degrading response time; so the face of crashes. the optimum size depends on the work load. Some things have been learned from past work, however. Foremost among these is Virtual Circuits versus Datagrams that making message passing efficient is There is much controversy over whether very important. To this end, systems remote procedure call ought to be built on should be designed to minimize copying of top of a flow-controlled, error-controlled, data [Cheriton 1984a]. For example, a re- virtual circuit mechanism or directly on top mote procedure call system that first copies of the unreliable, connectionless (data- each message from the user to the stub, gram) service. Saltzer et al. [1984] have from the stub to the kernel, and finally pointed out that since high reliability can from the kernel to the network interface only be achieved by end-to-end acknowl- board requires three copies on the sending edgments at the highest level of protocol, side, and probably three more on the re- the lower levels need not be 100 percent ceiving side, for a total of six. If the call is reliable. The overhead incurred in provid- to a remote file server to write a 1K block ing a clean virtual circuit upon which to of data to disk, at a copy time of 1 micro- build remote procedure calls (or any other second per byte, 6 milliseconds are needed message-passing system), is therefore just for copying, which puts an upper limit wasted. This line of thinking argues for of 167 calls per second, or a throughput of building the message system directly on the 167 kilobytes per second. When other raw datagram interface. sources of overhead are considered (e.g., the The other side of the coin is that it would reply message, the time waiting for access be nice for a distributed system to be able

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 431 to encompass heterogeneous computers in Request different countries with different post, Request Ack telephone, and telegraph (PTT) networks Reply and possibly different national alphabets, Reply Ack and that this environment requires com- plex multilayered protocol structures. It is (4 our observation that both arguments are valid, but, depending on whether one is trying to forge a collection of small com- Request I puters into a virtual uniprocessor or merely RePlY access remote data transparently, one or Reply Ack the other will dominate. Even if one opts for building RPC on top (b) of the raw datagram service provided by a local network, there are still a number of Request f protocols open to the implementer. The simplest one is to have every request and RePlY reply separately acknowledged. The mes- Request 2 sage sequence for a remote procedure call is then: REQUEST, ACK, REPLY, ACK, (c) as shown in Figure 4a. The ACKs are man- Figure 4. Remote procedure call (a) with individual aged by the kernel without user knowledge. acknowledgments per message, (b) with the reply as The number of messages can be reduced the request acknowledgment, (c) with no explicit from four to three by allowing the REPLY acknowledgments. to serve as the ACK for the REQUEST, as shown in Figure 4b. However, a problem arises when the REPLY can be delayed for the opportunity to capture it. In theory, a long time. For example, when a login such a network is “fair” in that each user process makes an RPC to a terminal server has equal access to the network and no one requesting characters, it may be hours or user can monopolize it to the detriment of days before someone steps up to a terminal others. In practice, suppose that two users and begins typing. In this event, an addi- each want to read a long file from a file tional message has to be introduced to allow server. User A sends a request message to the sending kernel to inquire whether the the server, and then replaces the token on message has arrived or not. the network for B to acquire. A further step in the same direction is to After A’s message arrives at the server, eliminate the other ACK as well, and let it takes a short time for the server to handle the arrival of the next REQUEST imply an the incoming message interrupt and reen- acknowledgment of the previous REPLY able the receiving hardware. Until the re- (see Figure 4~). Again, some mechanism is ceiver is reenabled, the server is deaf. needed to deal with the case that no new Within a microsecond or two of the time A REQUEST is forthcoming quickly. puts the token back on the network, B sees One of the great difficulties in imple- and grabs it, and begins transmitting a menting efficient communication is that it request to the (unbeknown to B) deaf file is more of a black art than a science. Even server. Even if the server reenables halfway straightforward implementations can have through B’s message, the message will be unexpected consequences, as the following rejected owing to missing header, bad frame example from Sventek et al. [1983] shows. format, and checksum error. According to Consider a ring containing a circulating the ring protocol, after sending one mes- token. To transmit, a machine captures and sage, B must now replace the token, which removes the token, puts a message on the A captures for a successful transmission. network, and then replaces the token, thus Once again B transmits during the server’s allowing the next machine “downstream” deaf period, and so on. Conclusion: B gets

Computing Surveys, Vol. 17, No. 4, December 1985 432 . A. S. Tanenbaum and R. van Renesse

no service at all until A is finished. If A have to map the service name onto the happens to be scanning through the Man- name of a server process that is prepared hattan telephone book, B may be in for a to offer the service. As a second step, the long wait. This specific problem can be server would then be mapped onto the num- solved by inserting random delays in places ber of the CPU on which that process is to break the synchrony, but our point is running. The mapping need not always that totally unexpected problems like this be unique, for example, if there are multi- make it necessary to build and observe real ple processes prepared to offer the same systems to gain insight into the problems. service. Abstract formulations and simulations are not enough. 2.2.2 Name Servers In centralized systems, the problem of nam- 2.2 Naming and Protection ing can be effectively handled in a straight- All operating systems support objects such forward way. The system maintains a table as files, directories, segments, mailboxes, or database providing the necessary name- processes, services, servers, nodes, and I/O to-object mappings. The most straightfor- devices. When a process wants to access ward generalization of this approach to one of these objects, it must present some distributed systems is the single name kind of name to the operating system to server model. In this model, a server ac- specify which object it wants to access. In cepts names in one domain and maps them some instances these names are ASCII onto names in another domain. For exam- strings designed for human use; in others ple, to locate services in some distributed they are binary numbers used only inter- systems, one sends the service name in nally. In all cases they have to be managed ASCII to the name server, and it replies and protected from misuse. with the node number where that service can be found, or with the process name of the server process, or perhaps with the 2.2.1 Naming as Mapping name of a mailbox to which requests for Naming can best be seen as a problem of service can be sent. The name server’s da- mapping between two domains. For exam- tabase is built up by registering services, ple, the directory system in UNIX provides processes, etc., that want to be publicly a mapping between ASCII path names and known. File directories can be regarded as i-node numbers. When an OPEN system a special case of name service. call is made, the kernel converts the name Although this model is often acceptable of the file to be opened into its i-node in a small distributed system located at a number. Internal to the kernel, files are single site, in a large system it is undesira- nearly always referred to by i-node number, ble to have a single centralized component not ASCII string. Just about all operating (the name server) whose demise can bring systems have something similar. In a dis- the whole system to a grinding halt. In tributed system a separate name server is addition, if it becomes overloaded, perform- sometimes used to map user-chosen names ance will degrade. Furthermore, in a geo- (ASCII strings) onto objects in an analo- graphically distributed system that may gous way. have nodes in different cities or even coun- Another example of naming is the map- tries, having a single name server will be ping of virtual addresses onto physical ad- inefficient owing to the long delays in ac- dresses in a virtual memory system. The cessing it. paging hardware takes a virtual address as The next approach is to partition the input and yields a physical address as out- system into domains, each with its own put for use by the real memory. name server. If the system is composed of In some cases naming implies only a multiple local networks connected by gate- single level of mapping, but in other cases ways and bridges, it seems natural to have it can imply multiple levels. For example, one name server per local network. One to use some service, a process might first way to organize such a system is to have a

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems . 433 Nameserver 1 Nameserver 2 Nameserver 3 looks up a/b/c looks up b/c looks up c

a a

X > X

Y > C 2 > Elr Figure 5. Distributing the lookup of a/b/c over three name servers. global naming tree, with files and other could make its own synonym s and then objects having names of the form: /coun- perform accesses using s/x/foobar. Each try/city/network/pathname. When such a name server parsing a name that involves name is presented to any name server, it multiple name servers just strips off the can immediately route the request to some first component and passes the rest of the name server in the designated country, name to the name server found by looking which then sends it to a name server in the up the first component locally. designated city, and so on until it reaches A more extreme way of distributing the the name server in the network where the name server is to have each machine man- object is located, where the mapping can be age its own names. To look up a name, one done. Telephone numbers use such a hier- broadcasts it on the network. At each ma- archy, composed of country code, area code, chine, the incoming request is passed to the exchange code (first three digits of tele- local name server, which replies only if it phone number in North America), and sub- finds a match. Although broadcasting is scriber line number. easiest over a local network such as a ring Having multiple name servers does not net or CSMA net (e.g., Ethernet), it is also necessarily require having a single, global possible over store-and-forward packet naming hierarchy. Another way to organize switching networks such as the ARPANET the name servers is to have each one effec- [Dalal 19771. tively maintain a table of, for example, Although the normal use of a name server (ASCII string, pointer) pairs, where the is to map an ASCII string onto a binary pointer is really a kind of capability for any number used internally to the system, such object or domain in the system. When a as a process identifier or machine number, name, say a/b/c, is looked up by the local once in a while the inverse mapping is also name server, it may well yield a pointer to useful. For example, if a machine crashes, another domain (name server), to which upon rebooting it could present its (hard- the rest of the name, b/c, is sent for further wired) node number to the name server to processing (see Figure 5). This facility can ask what it was doing before the crash, that be used to provide links (in the UNIX is, ask for the ASCII string corresponding sense) to files or objects whose precise to the service that it is supposed to be whereabouts is managed by a remote name offering so that it can figure out what pro- server. Thus if a file foobar is located in gram to reboot. another local network, n, with name server n.s, one can make an entry in the local 2.3 Resource Management name server’s table for the pair (x, n.s) and then access xlfoobar as though it were a Resource management in a distributed local object. Any appropriately authorized system differs from that in a centralized user or process knowing the name xlfoobar system in a fundamental way. Centralized

Computing Surveys, Vol. 17, No. 4, December 1985 434 l A. S. Tanenbaum and R. van Renesse systems always have tables that give com- This hierarchy can be extended ad infini- plete and up-to-date status information tum, with the number of levels needed about all the resources being managed; dis- growing logarithmically with the number of tributed systems do not. For example, the workers. Since each processor need only process manager in a traditional centralized maintain communication with one superior operating system normally uses a “process and k subordinates, the information stream table” with one entry per potential process. is manageable. When a new process has to be started, it is An obvious question is, “What happens simple enough to scan the whole table to when a department head, or worse yet, a see whether a slot is free. A distributed big cheese, stops functioning (crashes)?” operating system, on the other hand, has a One answer is to promote one of the direct much harder job of finding out whether a subordinates of the faulty manager to fill processor is free, especially if the system in for the boss. The choice of which one designers have rejected the idea of having can either be made by the subordinates any central tables at all, for reasons of themselves, by the deceased’s peers, or in a reliability. Furthermore, even if there is a more autocratic system, by the sick man- central table, recent events on outlying ager’s boss. processors may have made some table en- To avoid having a single (vulnerable) tries obsolete without the table manager manager at the top of the tree, one can knowing it. truncate the tree at the top and have a The problem of managing resources committee as the ultimate authority. When without having accurate global state infor- a member of the ruling committee malfunc- mation is very difficult. Relatively little tions, the remaining members promote work has been done in this area. In the someone one level down as a replacement. following sections we look at some work Although this scheme is not completely that has been done, including distributed distributed, it is feasible and works well in process management and scheduling. practice. In particular, the system is self- repairing, and can survive occasional crashes of both workers and managers 2.3.1 Processor Allocation without any long-term effects. One of the key resources to be managed in In MICROS, the processors are mono- a distributed system is the set of available programmed, so if a job requiring S pro- processors. One approach that has been cesses suddenly appears, the system must proposed for keeping tabs on a collection of allocate S processors for it. Jobs can be processors is to organize them in a logical created at any level of the hierarchy. The hierarchy independent of the physical strategy used is for each manager to keep structure of the network, as in MICROS track of approximately how many workers [Wittie and van Tilborg 19801. This ap- below it are available (possibly several proach organizes the machines like people levels below it). If it thinks that a sufficient in corporate, military, academic, and number are available, it reserves some other real-world hierarchies. Some of the number R of them, where R 2 S, because machines are workers and others are the estimate of available workers may not managers. be exact and some machines may be down. For each group of k workers, one manager If the manager receiving the request machine (the “department head”) is as- thinks that it has too few processors avail- signed the task of keeping track of who is able, it passes the request upward in the busy and who is idle. If the system is large, tree to its boss. If the boss cannot handle there will be an unwieldy number of de- it either, the request continues propagating partment heads; so some machines will upward until it reaches a level that has function as “deans,” riding herd on k de- enough available workers at its disposal. At partment heads. If there are many deans, that point, the manager splits the request they too can be organized hierarchically, into parts and parcels them out among the with a “big cheese” keeping tabs on k deans. managers below it, which then do the same

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 435 thing until the wave of scheduling requests TiilE slot Machine Machine hits bottom. At the bottom level, the pro- cessors are marked as “busy,” and the ac- 0 1 01234567 tual number of processors allocated is re- A c ported back up the tree. 1 0 Q To make this strategy work well, R must 2 A c be large enough so that the probability is 3 0 D high that enough workers will be found to 4 A c handle the whole job. Otherwise, the re- 5 0 D quest will have to move up one level in the 0 1 tree and start all over, wasting considerable (a) (b) time and computing power. On the other Figure6. (a) Two jobs running out of phase with hand, if R is too large, too many processors each other. (b) Scheduling matrix for eight machines, will be allocated, wasting computing capac- each with six time slots. The X’s indicated allocated ity until word gets back to the top and they slots. can be released. The whole situation is greatly compli- cated by the fact that requests for proces- example in which processes A and B run sors can be generated randomly anywhere on one machine and processes C and D run in the system, so at any instant, multiple on another. Each machine is time shared requests are likely to be in various stages in, say, lOO-millisecond time slices, with A of the allocation algorithm, potentially giv- and C running in the even slices, and B and ing rise to out-of-date estimates of available D running in the odd ones, as shown in workers, race conditions, deadlocks, and Figure 6a. Suppose that A sends many mes- more. In Van Tilborg and Wittie [1981] a sages or makes many remote procedure mathematical analysis of the problem is calls to D. During time slice 0, A starts up given and various other aspects not de- and immediately calls D, which unfortu- scribed here are covered in detail. nately is not running because it is now C’s turn. After 100 milliseconds, process switching takes place, and D gets A’s mes- 2.3.2 Scheduling sage, carries out the work, and quickly re- The hierarchical model provides a general plies. Because B is now running, it will be model for resource control but does not another 100 milliseconds before A gets the provide any specific guidance on how to do reply and can proceed. The net result is one scheduling. If each process uses an entire message exchange every 200 milliseconds. processor (i.e., no multiprogramming), and What is needed is a way to ensure that each process is independent of all the oth- processes that communicate frequently run ers, any process can be assigned to any simultaneously. processor at random. However, if it is com- Although it is difficult to determine dy- mon that several processes are working to- namically the interprocess communication gether and must communicate frequently patterns, in many cases a group of related with each other, as in UNIX pipelines or processes will be started off together. For in cascaded (nested) remote procedure example, it is usually a good bet that the calls, then it is desirable to make sure that filters in a UNIX pipeline will communi- the whole group runs at once. In this sec- cate with each other more than they will tion we address that issue. with other, previously started processes. Let us assume that each processor can Let us assume that processes are created handle up to N processes. If there are in groups, and that intragroup commu- plenty of machines and N is reasonably nication is much more prevalent than large, the problem is not finding a free intergroup communication. Let us further machine (i.e., a free slot in some process assume that a sufficiently large number table), but something more subtle. The of machines are available to handle the basic difficulty can be illustrated by an largest group, and that each machine is

Computing Surveys, Vol. 17, No. 4, December 1985 436 l A. S. Tanenbaum and R. van Renesse multiprogrammed with N process slots (N- cesses are assigned to the empty slots; way multiprogramming). otherwise the window is slid to the right Ousterhout [ 19821 has proposed several and the algorithm repeated. Scheduling is algorithms based on the concept of co- done by starting the window at the left edge scheduling, which takes interprocess and moving rightward by about one win- communication patterns into account while dow’s worth per time slice, taking care not scheduling to ensure that all members of a to split groups over windows. Ousterhout’s group run at the same time. The first al- paper discusses these and other methods in gorithm uses a conceptual matrix in which more detail and gives some performance each column is the process table for one results. machine, as shown in Figure 6b. Thus, col- umn 4 consists of all the processes that run 2.3.3 Load Balancing on machine 4. Row 3 is the collection of all processes that are in slot 3 of some ma- The goal of Ousterhout’s work is to place chine, starting with the process in slot 3 of processes that work together on different machine 0, then the process in slot 3 of processors, so that they can all run in par- machine 1, and so on. The gist of his idea allel. Other researchers have tried to do is to have each processor use a round-robin precisely the opposite, namely, to find sub- scheduling algorithm with all processors sets of all the processes in the system that first running the process in slot 0 for a fixed are working together, so that closely related period, then all processors running the groups of processes can be placed on the process in slot 1 for a fixed period, etc. A same machine to reduce interprocess com- broadcast message could be used to tell each munication costs [Chow and Abraham processor when to do process switching, to 1982; Chu et al. 1980; Gylys and Edwards keep the time slices synchronized. 1976; Lo 1984; Stone 1977,1978; Stone and By putting all the members of a process Bokhari 19781. Yet other researchers have group in the same slot number, but on been concerned primarily with load balanc- different machines, one has the advantage ing, to prevent a situation in which some of N-fold parallelism, with a guarantee that processors are overloaded while others are all the processes will be run at the same empty [Barak and Shiloh 1985; Efe 1982; time, to maximize communication through- Krueger and Finkel 1983; Stankovic and put. Thus in Figure 6b, four processes that Sidhu 19841. Of course, the goals of maxi- must communicate should be put into slot mizing throughput, minimizing response 3, on machines 1, 2, 3, and 4 for optimum time, and keeping the load uniform are to performance. This scheduling technique some extent in conflict, so many of the can be combined with the hierarchical researchers try to evaluate different com- model of process management used in promises and trade-offs. MICROS by having each department head Each of these different approaches to maintain the matrix for its workers, assign- scheduling makes different assumptions ing processes to slots in the matrix and about what is known and what is most broadcasting time signals. important. The people trying to cluster Ousterhout also described several varia- processes to minimize communication tions to this basic method to improve per- costs, for example, assume that any process formance. One of these breaks the matrix can run on any machine, that the comput- into rows and concatenates the rows to ing needs of each process are known in form one long row. With k machines, any advance, and that the interprocess com- k consecutive slots belong to different ma- munication traffic between each pair of chines. To allocate a new process group to processes is also known in advance. The slots, one lays a window k slots wide over people doing load balancing typically make the long row such that the leftmost slot is the realistic assumption that nothing about empty but the slot just outside the left edge the future behavior of a process is known. of the window is full. If sufficient empty The minimizers are generally theorists, slots are present in the window, the pro- whereas the load balancers tend to be

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 437

Machine 1 f Machine 2 Machine 1 I Machine 2

Figure 7. Two ways of statically al- locating processes (nodes in the graph) to machines. Arcs show which pairs of processes communicate.

(4 (b) people making real systems who care less static requirements, known traffic matrix, about optimality than about devising algo- error-free processors and communication) rithms that can actually be used. Let us now are ever met. briefly look at each of these approaches. Heuristic Load Balancing. When the Graph-Theoretic Models. If the system goal of the scheduling algorithm is dy- consists of a fixed number of processes, namic, heuristic load balancing, rather than each with known CPU and memory re- finding related clusters, a different ap- quirements, and a known matrix giving the proach is taken. Here the idea is for each average amount of traffic between each pair processor to estimate its own load contin- of processes, scheduling can be attacked as ually, for processors to exchange load in- a graph-theoretic problem. The system can formation, and for process creation and be represented as a graph, with each pro- migration to utilize this information. cess a node and each pair of communicating Various methods of load estimation are processes connected by an arc labeled with possible. One way is just to measure the the data rate between them. number of runnable processes on each CPU The problem of allocating all the pro- periodically and take the average of the last cesses to k processors then reduces to the n measurements as the load. Another way problem of partitioning the graph into k [Bryant and Finkel19811 is to estimate the disjoint subgraphs, such that each subgraph residual running times of all the processes meets certain constraints (e.g., total CPU and define the load on a processor as the and memory requirements below some number of CPU seconds that all its pro- limit). Arcs that are entirely within one cesses will need to finish. The residual time subgraph represent internal communica- can be estimated mostly simply by assum- tion within a single processor (=fast), ing it is equal to the CPU time already whereas arcs that cut across subgraph consumed. Bryant and Finkel also discuss boundaries represent communication be- other estimation techniques in which both tween two processors (=slow). The idea is the number of processes and length of re- to find a partitioning of the graph that maining time are important. When round- meets the constraints and minimizes the robin scheduling is used, it is better to be network traffic, or some variation of this competing against one process that needs idea. Figure 7a depicts a graph of interact- 100 seconds than against 100 processes that ing processors with one possible partition- each need 1 second. ing of the processes between two machines. Once each processor has computed its Figure 7b shows a better partitioning, with load, a way is needed for each processor to less intermachine traffic, assuming that all find out how everyone else is doing. One the arcs are equally weighted. Many papers way is for each processor to just broadcast have been written on this subject, for ex- its load periodically. After receiving a ample, Chow and Abraham [1982], Chow broadcast from a lightly loaded machine, a and Kohler [1979], Stone [1977, 19781, processor should shed some of its load by Stone and Bokhari [1978], and Lo [1984]. giving it to the lightly loaded processor. The results are somewhat academic, since This algorithm has several problems. First, in real systems virtually none of the as- it requires a broadcast facility, which may sumptions (fixed number of processes with not be available. Second, it consumes

Computing Surveys, Vol. 17, No. 4, December 1985 438 l A. S. Tanenbaum and R. van Renesse considerable bandwidth for all the “here is is not what it expected because that pro- my load” messages. Third, there is a great cessor has bid for and won other work in danger that many processors will try to the meantime. shed load to the same (previously) lightly loaded processor at once. 2.3.4 Distributed Deadlock Detection A different strategy [Barak and Shiloh Some theoretical work has been done in the 1985; Smith 19791 is for each processor area of detection of deadlocks in distributed periodically to pick another processor (pos- systems. How applicable this work may be sibly a neighbor, possibly at random) and in practice remains to be seen. Two kinds exchange load information with it. After of potential deadlocks are resource dead- the exchange, the more heavily loaded pro- locks and communication deadlocks. Re- cessor can send processes to the other one source deadlocks are traditional deadlocks, until they are equally loaded. In this model, in which all of some set of processes are if 100 processes are suddenly created in an blocked waiting for resources held by other otherwise empty system, after one ex- blocked processes. For example, if A holds change we will have two machines with 50 X and B holds Y, and A wants Y and B processes, and after two exchanges most wants X, a deadlock will result. probably four machines with 25 processes. In principle, this problem is the same in Processes diffuse around the network like centralized and distributed systems, but it a cloud of gas. is harder to detect in the latter because Actually migrating running processes is there are no centralized tables giving the trivial in theory, but close to impossible in status of all resources. The problem has practice. The hard part is not moving the mostly been studied in the context of code, data, and registers, but moving the database systems [Gligor and Shattuck environment, such as the current position 1980; Isloor and Marsland 1978; Menasce within all the open files, the current values and Muntz 1979; Obermarck 19821. of any running timers, pointers or file de- The other kind of deadlock that can oc- scriptors for communicating with tape cur in a distributed system is a communi- drives or other I/O devices, etc. All of these cation deadlock. Suppose A is waiting for a problems relate to moving variables and message from B and B is waiting for C and data structures related to the process that C is waiting for A. Then we have a deadlock. are scattered about inside the operating Chandy et al. [1983] present an algorithm system. What is feasible in practice is to for detecting (but not preventing) commu- use the load information to create new nication deadlocks. Very crudely summa- processes on lightly loaded machines, in- rized, they assume that each process that stead of trying to move running processes. is blocked waiting for a message knows If one has adopted the idea of creating which process or processes might send the new processes only on lightly loaded ma- message. When a process logically blocks, chines, another approach, called bidding, is they assume that it does not really block possible [Farber and Larson 1972; Stan- but instead sends a query message to each kovic and Sidhu 19841. When a process of the processes that might send it a real wants some work done, it broadcasts a re- (data) message. If one of these processes is quest for bids, telling what it needs (e.g., a blocked, it sends query messages to the 68000 CPU, 512K memory, floating point, processes it is waiting for. If certain mes- and a tape drive). sages eventually come back to the original Other processors can then bid for the process, it can conclude that a deadlock work, telling what their workload is, how exists. In effect, the algorithm is looking much memory they have available, etc. The for a knot in a directed graph. process making the request then chooses the most suitable machine and creates the 2.4 Fault Tolerance process there. If multiple request-for-bid messages are outstanding at the same time, Proponents of distributed systems often a processor accepting a bid may discover claim that such systems can be more relia- that the workload on the bidding machine ble than centralized systems. Actually,

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 439 there are at least two issues involved here: Network reliability and availability. Reliability has Message Message to do with the system not corrupting or losing one’s data. Availability has to do Backup RE2 EEZ’ process with the system being up when it is needed. messa e A system could be highly reliable in the to bo a h sense that it never loses data, but at the Figure 8. Each process has its own backup process. same time be down most of the time and hence hardly usable. However, many people use the term “reliability” to cover availa- machines to cause that machine to tell the bility as well, and we will not make the world that it had a delay time of zero to distinction either in the rest of the paper. every machine in the network, which re- Distributed systems are potentially more sulted in all network traffic being rerouted reliable than a centralized system because to the broken machine. if a system only has one instance of some One of the key advantages of distributed critical component, such as a CPU, disk, or systems is that there are enough resources network interface, and that component to achieve fault tolerance, at least with fails, the system will go down. When there respect to expected errors. The system can are multiple instances, the system may be be made to tolerate both hardware and able to continue in spite of occasional fail- software errors, although it should be em- ures. In addition to hardware failures, one phasized that in both cases it is the soft- can also consider software failures. These ware, not the hardware, that cleans up the are of two types: The software failed to mess when an error occurs. In the past few meet the formal specification (implemen- years, two approaches to making distrib- tation error), or the specification does not uted systems fault tolerant have emerged. correctly model what the customer wanted They differ radically in orientation, goals, (specification error). All work on program and attitude toward the theologically sen- verification is aimed at the former, but the sitive issue of the perfectability of man- latter is also an issue. Distributed systems kind (programmers in particular). One allow both hardware and software errors to approach is based on redundancy and the be dealt with, albeit in somewhat different other is based on the notion of an atomic ways. transaction. Both are described briefly An important distinction should be made below. between systems that are fault tolerant and those that are fault intolerant. A fault- 2.4.1 Redundancy Techniques tolerant system is one that can continue functioning (perhaps in a degraded form) All the redundancy techniques that have even if something goes wrong. A fault- emerged take advantage of the existence of intolerant system collapses as soon as any multiple processors by duplicating critical error occurs. Biological systems are highly processes on two or more machines. A par- fault tolerant; if you cut your finger, you ticularly simple, but effective, technique is probably will not die. If a memory failure to provide every process with a backup garbles l/10 of 1 percent of the program process on a different processor. All pro- code or stack of a running program, the cesses communicate by message passing. program will almost certainly crash in- Whenever anyone sends a message to a stantly upon encountering the error. process, it also sends the same message to It is sometimes useful to distinguish be- the backup process, as shown in Figure 8. tween expected faults and unexpected The system ensures that neither the pri- faults. When the ARPANET was designed, mary nor the backup can continue running people expected to lose packets from time until it has been verified that both have to time. This particular error was expected correctly received the message. and precautions were taken to deal with it. Thus, if one process crashes because of On the other hand, no one expected a mem- any hardware fault, the other one can con- ory error in one of the packet-switching tinue. Furthermore, the remaining process

Computing Surveys, Vol. 17, No. 4, December 1985 440 l A. S. Tanenbaum and R. van Renesse can then clone itself, making a new backup Message ------+Message to maintain the fault tolerance in the fu- ture. Borg et al. [1983] have described a system using these principles. Receiving Recorder One disadvantage of duplicating every process process process is the extra processors required, but i%EG1l another, more subtle problem is that, if traffic processes exchange messages at a high rate, Figure9. A recorder process copies and stores all a considerable amount of CPU time may go network traffic without affecting the sender and into keeping the processes synchronized at receiver. each message exchange. Powell and Pre- sotto [1983] have described a redundant very reliable, but they are not perfect. If system that puts almost no additional load occasional messages can be lost, the whole on the processes being backed up. In their scheme becomes much less attractive. system all messages sent on the network Still, one has to be very careful about are recorded by a special “recorder” process reliability, especially when the problem is (see Figure 9). From time to time, each caused by faulty software. Suppose that a process checkpoints itself onto a remote processor crashes because of a software bug. disk. Both the schemes discussed above [Borg et If a process crashes, recovery is done by al. 1983; Powell and Presotto 19831 deal sending the most recent checkpoint to an with crashes by allocating a spare processor idle processor and telling it to start run- and restarting the crashed program, possi- ning. The recorder process then spoon feeds bly from a checkpoint. Of course the new it all the messages that the original process processor will crash too, leading to the al- received between the checkpoint and the location of yet another processor and crash. Messages sent by the newly restarted another crash. Manual intervention will process are discarded. Once the new process eventually be required to figure out what is has worked its way up to the point of crash, going on. If the hardware designers could it begins sending and receiving messages provide a bit somewhere that tells whether normally, without help from the recording a crash was due to hardware or software, it process. would be very helpful. The beauty of this scheme is that the Both of the above techniques apply only only additional work that a process must to tolerance of hardware errors. It is also do to become immortal is to checkpoint possible, however, to use redundancy in itself from time to time. In theory, even the distributed systems to make systems toler- checkpoints can be disposed with, if the ant of software errors. One approach is to recorder process has enough disk space to structure each program as a collection of store all the messages sent by all the cur- modules, each one with a well-defined func- rently running processes. If no checkpoints tion and a precisely specified interface to are made, when a process crashes, the re- the other modules. Instead of writing a corder will have to replay the process’s module only once, N programmers are whole history. asked to program it, yielding N functionally When a process successfully terminates, identical modules. the recorder no longer has to worry about During execution, the program runs on having to rerun it; so all the messages that N machines in parallel. After each module it received can be safely discarded. For serv- finishes, the machines compare their re- ers and other processes that never termi- sults and vote on the answer. If a majority nate, this idea must be varied to avoid of the machines say that the answer is X, repeating individual transactions that have then all of them use X as the answer, and successfully completed. all continue in parallel with the next mod- One drawback of this scheme is that it ule. In this manner the effects of an occa- relies on reliable reception of all messages sional software bug can be voted down. If all the time. In practice, local networks are formal specifications for any of the modules

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems 441 are available, the answers can also be property of not interleaving two jobs is checked against the specifications to guard called serializability. The goal of people against the possibility of accepting an an- working on the atomic transaction ap- swer that is clearly wrong. proach to fault tolerance has been to regain A variation of this idea can be used to the advantages of the old tape system, improve system performance. Instead of without giving up the convenience of always waiting for all the processes to fin- databases on disk that can be modified in ish, as soon as k of them agree on an place, and to be able to do everything in a answer, those that have not yet finished distributed way. are told to drop what they are doing, accept Lampson [1981] has described a way of the value found by the k processes, and achieving atomic transactions by building continue with the next module. Some work up a hierarchy of abstractions. We sum- in this area is discussed by Avizienis and marize his model below. Real disks can Chen [ 19771, Avizienis and Kelly [ 19841, crash during READ and WRITE opera- and Anderson and Lee [1981]. tions in unpredictable ways. Furthermore, even if a disk block is correctly written, 2.4.2 Atomic Transactions there is a small (but nonzero) probability of it subsequently being corrupted by a When multiple users on several machines newly developed bad spot on the disk sur- are concurrently updating a distributed face. The model assumes that spontaneous database and one or more machines crash, block corruptions are sufficiently infre- the potential for chaos is truly impressive. quent that the probability of two such In a certain sense, the current situation is events happening within some predeter- a step backward from the technology of the mined time T is negligible. To deal with 1950s when the normal way of updating a real disks, the system software must be able database was to have one magnetic tape, to tell whether or not a block is valid, for called the “master file,” and one or more example, by using a checksum. tapes with updates (e.g., daily sales reports The first layer of abstraction on top of from all of a company’s stores). The master the real disk is the “careful disk,” in which tape and updates were brought to the com- every CAREFUL-WRITE is read back im- puter center, which then mounted the mas- mediately to verify that it is correct. If the ter tape and one update tape, and ran the CAREFUL-WRITE persistently fails, the update program to produce a new master system marks the block as “bad” and then tape. This new tape was then used as the intentionally crashes. Since CAREFUL- “master” for use with the next update tape. WRITES are verified, CAREFUL-READS This scheme had the very real advantage will always be good, unless a block has gone that if the update program crashed, one bad after being written and verified. could always fall back on the previous mas- The next layer of abstraction is stable ter tape and the update tapes. In other storage. A stable storage block consists of words, an update run could be viewed as an ordered pair of careful blocks, which are either running correctly to completion (and typically corresponding careful blocks on producing a new master tape) or having no different drives, to minimize the chance of effect at all (crash part way through, new both being damaged by a hardware failure. tape discarded). Furthermore, update jobs The stable storage algorithm guarantees from different sources always ran in some that at least one of the blocks is always (undefined) sequential order. It never hap- valid. The STABLE-WRITE primitive pened that two users would concurrently first does a CAREFUL-WRITE on one read a field in a record (e.g., 6), each add 1 block of the pair, and then the other. If the to the value, and each store a 7 in that field, first one fails, a crash is forced, as men- instead of the first one storing a 7 and the tioned above, and the second one is left second storing an 8. untouched. The property of run-to-completion or do- After every crash, and at least once every nothing is called an atomic update. The time period T, a special cleanup process is

Computing Surveys, Vol. 17, No. 4, December 1985 442 . A. S. Tanenbaum and R. van Renesse run to examine each stable block. If both by the operating system. This approach blocks are “good” and identical, nothing leads to a smaller (hence more reliable) has to be done. If one is “good” and one kernel and makes it easier to provide, mod- is “bad” (failure during a CAREFUL- ify, and test new services. In the following WRITE), the “bad” one is replaced by the sections, we look at some of these services, “good” one. If both are “good” but different but first we look at how services and servers (crash between two CAREFUL-WRITES), can be structured. the second is replaced by a copy of the first. This algorithm allows individual disk 2.5.1 Server Structure blocks to be updated atomically and survive infrequent crashes. The simplest way to implement a service is Stable storage can be used to create “sta- to have one server that has a single, se- ble processors” [Lampson 19811. To make quential thread of control. The main loop itself crashproof, a CPU must checkpoint of the server looks something like this: itself on stable storage periodically. If it subsequently crashes, it can always restart while true do itself from the last checkpoint. Stable stor- begin GetRequest; age can also be used to create stable moni- CarryOutRequest; tors in order to ensure that two concurrent SendReply processes never enter the same critical re- end gion at the same time, even if they are running on different machines. This approach is simple and easy to under- Given a way to implement crashproof stand, but has the disadvantage that if the processors (stable processors) and crash- server must block while carrying out the proof disks (stable storage), it is possible to request (e.g., in order to read a block from implement multicomputer atomic transac- a remote disk), no other requests from tions. Before updating any part of the data other users can be started, even if they in place, a stable processor first writes an could have been satisfied immediately. An intentions list to stable storage, providing obvious example is a file server that main- the new value for each datum to be tains a large disk block cache, but occasion- changed. Then it sets a commit flag to in- ally must read from a remote disk. In the dicate that the intentions list is complete. time interval in which the server is blocked The commit flag is set by atomically up- waiting for the remote disk to reply, it dating a special block on stable storage. might have been able to service the next Finally it begins making all the changes ten requests, if they were all for blocks that called for in the intentions list. Crashes happened to be in the cache. Instead, the during this phase have no serious conse- time spent waiting for the remote disk is quences because the intentions list is stored completely wasted. in stable storage. Furthermore, the actual To eliminate this wasted time and im- making of the changes is idempotent, so prove the throughput of the server, the repeated crashes and restarts during this server can maintain a table to keep track phase are not harmful. of the status of multiple partially completed Atomic actions have been implemented requests. Whenever a request requires the in a number of systems (see, e.g., Fridrich server to send a message to some other and Older [1981, 19841, Mitchell and Dion machine and wait for the result, the server [ 19821, Brown et al. [ 19851, Popek et al. stores the status of the partially completed [1981], and Reed and Svobodova [1981]). request in the table and goes back to the top of the main loop to get the next message. 2.5 Services If the next message happens to be the In a distributed system, it is natural for reply from the other machine, that is fine user-level server processes to provide func- and it is processed, but if it is a new request tions that have been traditionally provided for service from a different client, that can

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 443

Message arrives at dispatcher

Dispatcher passes ,request to worker I Shared data

Figure 10. The dispatcher task waits for requests and passes them on to the worker tasks. also be started, and possibly completed be- the dispatcher or some other previously fore the reply for the first request comes in. blocked task can now run. Thus waiting for In this way, the server is never idle if there a remote procedure call to finish only is any work to be done. blocks one task, not the whole server. Although this organization makes better The other way of organizing the server is use of the server’s CPU, it makes the soft- to have each task capable of accepting new ware much more complicated. Instead of requests for work. When a message arrives, doing nicely nested remote procedure calls the kernel gives it at random to one of the to other machines whose services it needs, tasks listening to the address or port to the server is back to using separate SEND which the message was addressed. That and RECEIVE primitives, which are less task carries the work out by itself, and no structured. dispatcher is needed. One way of achieving both good perfor- Both of these schemes require some mance and clean structure is to program method of locking the shared data to pre- the server as a collection of miniprocesses, vent races. This locking can be achieved which we call a cluster of tadas. Tasks share explicitly by some kind of LOCK and the same code and global data, but each UNLOCK primitives, or implicitly by hav- task has its own stack for local variables ing the scheduler not stop any task while it and registers and, most important, its own is running. For example, task switching program counter. In other words, each task only occurs when a task blocks. With or- has its own thread of control. Multipro- dinary user programs, such a strategy gramming of the tasks can be done either is undesirable, but with a server whose by the operating system kernel or by a run behavior is well understood, it is not time library within each process. unreasonable. There are two ways of organizing the tasks. The first way is to assign one task the job of “dispatcher,” as shown in Figure 25.2 File Service 10. The dispatcher is the only task that There is little doubt that the most impor- accepts new requests for work. After in- tant service in any distributed system is the specting an incoming request, it determines file service. Many file services and file serv- if the work can be done without blocking ers have been designed and implemented, (e.g., if a block to be read is present in the so a certain amount of experience is avail- cache). If it can, the dispatcher just carries able (e.g., Birrell and Needham [1980], Del- out the work and sends the reply. If the lar [ 19821, Dion [1980], Fridrich and Older work requires blocking, the dispatcher [ 19811,Fridrich and Older [ 19841, Mitchell passes the work to some other task in the and Dion [ 19821, Mullender and Tanen- cluster, which can start work on it. When baum [1985], Reed and Svobodova [1981], that task blocks, task switching occurs, and Satyanarayanan et al. [1985], Schroeder et

Computing Surveys, Vol. 17, No. 4, December 1985 444 l A. S. Tanenbaum and R. van Renesse al. [1985], Sturgis et al. [1980], Svobodova The client need not be concerned with how [1981], and Swinehart et al. [1979]). A or where the data in the file are stored. survey about file servers can be found in The directory service provides a mecha- Svobodova [ 19841. nism for naming and protecting tiles, so File services can be roughly classified they can be accessed conveniently and into two kinds, “traditional” and “robust.” safely. The directory service typically pro- Traditional file service is offered by nearly vides objects called directories that map all centralized operating systems (e.g., the ASCII names onto the internal identifica- UNIX file system). Files can be opened, tion used by the file service. read, and rewritten in place. In particular, a program can open a file, seek to the Design Issues. One important issue in a middle of the file, and update blocks of data distributed system is how closely the three within the file. The file server implements components of a traditional file service are these updates by simply overwriting the integrated. At one extreme, the system can relevant disk blocks. Concurrency control, have distinct disk, file, and directory ser- if there is any, usually involves locking vices that run on different machines and entire tiles before updating them. only interact via the official interprocess Robust file service, on the other hand, is communication mechanism. This approach aimed at those applications that require is the most flexible, because anyone need- extremely high reliability and whose users ing a different kind of file service (e.g., a B- are prepared to pay a significant penalty in tree file) can use the standard disk server. performance to achieve it. These file ser- It is also potentially the least efficient, vices generally offer atomic updates and since it generates considerable interserver similar features lacking in the traditional traffic. file service. At the other extreme, there are systems In the following paragraphs, we discuss in which all three functions are handled by some of the issues relating to traditional a single program, typically running on a file service (and file servers) and then look machine to which a disk is attached. With at those issues that specifically relate to this model, any application that needs a robust file service and servers. Since robust slightly different file naming scheme is file service normally includes traditional forced to start all over making its own file service as a subset, the issues covered private disk server. The gain, however, is in the first part also apply. increased run-time efficiency, because the Conceptually, there are three compo- disk, file, and directory services do not have nents that a traditional file service nor- to communicate over the network. mally has: Another important design issue in dis- tributed systems is garbage collection. If l disk service, the directory and file services are inte- l flat file service, grated, it is a straightforward matter to l directory service. ensure that, whenever a tile is created, it is The disk service is concerned with reading entered into a directory. If the directory and writing raw disk blocks without regard system forms a rooted tree, it is always to how they are organized. A typical com- possible to reach every file from the root mand to the disk service is to allocate and directory. However, if the file directory ser- write a disk block, and return a capability vice and file service are distinct, it may be or address (suitably protected) so that the possible to create files and directories that block can be read later. are not reachable from the root directory. The flat file service is concerned with In some systems this may be acceptable, providing its clients with an abstraction but in others unconnected files may be consisting of files, each of which is a linear regarded as garbage to be collected by the sequence of records, possibly l-byte records system. (as in UNIX) or client-defined records. The Another approach to the garbage collec- operations are reading and writing records, tion problem is to forget about rooted trees starting at some particular place in the file. altogether and permit the system to remove

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 445 any file that has not been accessed for, say, model each request is completely self-con- five years. This approach is intended to tained (file name, file position, etc.), so a deal with the situation of a client creating newly reincarnated server will have no a temporary file and then crashing before trouble carrying it out. recording its existence anywhere. When the The price paid for this robustness, how- client is rebooted, it creates a new tempo- ever, is a slightly longer message,since each rary file, and the existence of the old one is file request must contain the full file name lost forever unless some kind of time-out and position. Furthermore, the virtual- mechanism is used. circuit model is sometimes less complex in There are a variety of other issues that environments in which the network can the designers of a distributed file system reorder messages,that is, deliver the second must address; for example, will the file ser- messagebefore the first. Local networks do vice be virtual-circuit oriented or stateless? not have this defect, but some wide-area In the virtual-circuit approach, the client networks and internetworks do. must do an OPEN on a file before reading it, at which time the file server fetches some Protection. Another important issue information about the file (in UNIX terms, faced by all file servers is access control- the i-node) into memory, and the client is who is allowed to read and write which file. given some kind of a connection identifier. In centralized systems, the same problem This identifier is used in subsequent exists and is solved by using either an ac- READS and WRITES. In the stateless ap- cess control list or capabilities. With access proach each READ request identifies the control lists, each file is associated with a file and file position in full, so the server list of users who may access it. The UNIX need not keep the i-node in memory (al- RWX bits are a simple form of access con- though most servers will maintain a cache trol list that divides all users into three for efficiency reasons). categories: owner, group, and others. With Both virtual-circuit and stateless file capabilities, a user must present a special servers can be used with the IS0 OS1 and “ticket” on each file access proving that he RPC models. When virtual circuits are used or she has access permission. Capabilities for communication, having the file server are normally maintained in the kernel to maintain open files is natural, However, prevent forgery. each request message can also be self-con- With a distributed system using remote tained so that the file server need not hold file servers, both of these approaches have the file open throughout the communica- problems. With access control lists the file tion session. server has to verify that the user in fact is Similarly, RPC fits well with a stateless who he or she claims to be. With capabili- file server, but it can also be used with a ties, how do you prevent users from making file server that maintains open files. In the them up? latter case the client does an RPC to the One way to make access control lists file server to OPEN the file and get back a viable is to insist that the client first set up tile identifier of some kind. Subsequent an authenticated virtual circuit with the RPCs can do READ and WRITE opera- file server. The authentication may involve tions using this file identifier. a trusted third party as in Birrell et al. The difference between these two be- [1982, 19841. When remote procedure calls comes clear when one considers the effects are used, setting up an authenticated ses- of a server crash on active clients. If a sion in advance is less attractive. The virtual-circuit server crashes and is then problem of authentication using RPC is quickly rebooted, it will almost always lose discussed by Birrell [ 19851. its internal tables. When the next request With capabilities, the protection nor- comes in to read the current block from file mally results from the fact that the kernel identifier 28, it will have no way of knowing can be trusted. With personal computers what to do. The client will receive an error on a network, how can the file server trust message, which will generally lead to the the kernel? After all, a user can easily boot client process aborting. In the stateless up a nonstandard kernel on his or her

Computing Surveys, Vol. 17, No. 4, December 1985 446 . A. S. Tanenbaum and R. van Renesse machine. A possible solution is to encrypt Furthermore, even if the server did know, the capabilities, as discussed by Mullender having the server initiate contact with its and Tanenbaum [1984, 1985, 19861 and clients is certainly an unpleasant reversal Tanenbaum et al. [1986]. of the normal client-server relationship, in which clients make remote procedure calls Performance. Performance is one of the on servers, but not vice versa. More re- key problems in using remote file servers search is needed in this area before we have (especially from diskless workstations). a satisfactory solution. Some results are Reading a block from a local disk requires presented by Schroeder et al. [1985]. a disk access and a small amount of CPU processing. Reading from a remote server Reliability. Reliability is another key has the additional overhead of getting the design issue. The simplest approach is to data across the network. This overhead has design the system carefully, use good qual- two components: the actual time to move ity disks, and make occasional tape back- the bits over the wire (including contention ups. If a disk ever gets completely wiped resolution time, if any) and the CPU time out because of hardware failure, all the the file server must spend running the pro- work done since the last tape backup is lost. tocol software. Although this mode of operation may seem Cheriton and Zwaenepoel [ 19831describe scary at first, nearly all centralized com- measurements of network overhead in puter systems work this way, and with a the context of the V system. With an 8- mean time between failure of 20,000 or megahertz 68000 processor and a lo-me- more hours for disks these days, it works gabyte-per-second Ethernet, they observe pretty well in practice. that reading a 512-byte block from the local For those applications that demand a machine takes 1.3 milliseconds and from a higher level of reliability, some distributed remote machine 5.7 milliseconds, assuming systems have a more robust file service, as that the block is in memory and no disk mentioned at the beginning of this section. access is needed. They also observe that The simplest approach is mirrored disks: loading a 64K program from a remote file Every WRITE request is carried out in server takes 255 milliseconds versus 60 mil- parallel on two disk drives. At every instant liseconds locally, when transfers are in 16K the two drives are identical, and either one units. A tentative conclusion is that access can take over instantly for the other in the to a remote file server is four times as event of failure. expensive as to a local one. (It is also worth A refinement of this approach is to have noting that the V designers have gone to the file server offer stable storage and great lengths to achieve good performance; atomic transactions, as discussed earlier. many other file servers are much slower Systems offering this facility are described than V’s.) by Brown et al. [1985], Dion [1980], One way to improve the performance of Mitchell and Dion [1982], Needham and a distributed file system is to have both Herbert [ 19821, Reed and Svobodova clients and servers maintain caches of disk [ 19811,Sturgis et al. [ 19801,and Svobodova blocks and possibly whole files. However, [1981]. A detailed comparison of a number maintaining distributed caches has a num- of file servers offering sophisticated con- ber of serious problems. The worst of these currency control and atomic update facili- is, “What happens when someone modifies ties is given by Svobodova [1984]. We just the ‘master copy’ on the disk?” Does the touch on a few of the basic concepts here. file server tell all the machines maintaining At least four different kinds of files can caches to purge the modified block or be supported by a file server. Ordinary files file from their caches by sending them consist of a sequence of disk blocks that “unsolicited messages” as in XDFS [Sturgis may be updated in place and that may be et al. 1980]? How does the server even know destroyed by disk or server crashes. who has a cache? Introducing a complex Recoverable files have the property that centralized administration to keep track is groups of WRITE commands can be brack- probably not the way to go. eted by BEGIN TRANSACTION and

Computing Surveys, Vol. 1’7, No. 4, December 1985 Distributed Operating Systems l 447 END TRANSACTION, and that a crash of the available printers, possibly with some or abort midway leaves the file in its origi- text justification or other formatting be- nal state. Robust files are written on stable forehand. In some cases the whole file is storage and contain sufficient redundancy sent to the print server in advance, and the to survive disk crashes (generally two disks server must buffer it. In other cases only are used). Finally, multiversion files consist the file name or capability is sent, and the of a sequence of versions, each of which print server reads the file block by block as is immutable. Changes are made to a file needed. The latter strategy eliminates the by creating a new version. Different file need for buffering (read: a disk) on the servers support various combinations of ’ server side but can cause problems if the these. file is modified after the print command is All robust file servers need some mecha- given but prior to the actual printing. Users nism for handling concurrent updates to a generally prefer “call-by-value” rather than file or group of files. Many of them allow “call-by-reference” semantics for printers. users to lock a file, page, or record to pre- One way to achieve the “call-by-value” vent conflicting writes. Locking introduces semantics is to have a printer spooler the problem of deadlocks, which can be server. To print a file, the client process dealt with by using two-phase locking sends the file to the spooler. When the file [Eswaran et al. 19761 or timestamps [Reed has been copied to the spooler’s directory, 19831. an acknowledgment is sent back to the When the file system consists of multiple client. servers working in parallel, it becomes pos- The actual print server is then imple- sible to enhance reliability by replicating mented as a print client. Whenever the some or all files over multiple servers. print client has nothing to print, it requests Reading also becomes easier because the another file or block of a file from the print workload can now be split over two servers, spooler, prints it, and then requests the but writing is much harder because multi- next one. In this way the print spooler is a ple copies must be updated simultaneously, server to both the client and the printing or this effect simulated somehow. device. One approach is to distribute the data Printer service is discussed by Janson et but keep some of the control information al. [1983] and Needham and Herbert (semi-) centralized. In LOCUS [Popek et [1982]. al. 1981; Walker et al. 19831, for example, files can be replicated at many sites, but 2.5.4 Process Service when a file is opened, the file server at one site examines the OPEN request, the num- Every distributed operating system needs ber and status of the file’s copies, and the some mechanism for creating new pro- state of the network. It then chooses one cesses. At the lowest level, deep inside the site to carry out the OPEN and the subse- system kernel, there must be a way of cre- quent READS and WRITES. The other ating a new process from scratch. One way sites are brought up to date later. is to have a FORK call, as UNIX does, but other approaches are also possible. For ex- ample, in Amoeba, it is possible to ask the 2.5.3 Print Service kernel to allocate chunks of memory of Compared with file service, on which a given sizes. The caller can then read and great deal of time and energy has been write these chunks, loading them with the expended by a large number of people, the text, data, and stack segments for a new other services seem rather meager. Still, it process. Finally, the caller can give the is worth saying at least a little bit about a filled-in segments back to the kernel and few of the more interesting ones. ask for a new process built up from these Nearly all distributed systems have some pieces. This scheme allows processes to be kind of print service to which clients can created remotely or locally, as desired. send files, file names, or capabilities for At a higher level it is frequently useful to files with instructions to print them on one have a process server that one can ask

Computing Surveys, Vol. 17, NO. 4, December 1985 448 l A. S. Tanenbaum and R. van Renesse

whether there is a Pascal, TROFF, or some 2.5.6 Mail Service other service, in the system. If there is, the Electronic mail is a popular application of request is forwarded to the relevant server. computers these days. Practically every If not, it is the job of the process server to university computer science department in build a process somewhere and give it the the Western world is on at least one inter- request. After, say, a very large-scale inte- national network for sending and receiving gration (VLSI) design rule checking server electronic mail. When a site consists of only has been created and has done its work, it one computer, keeping track of the mail is may or may not be a good idea to keep it in easy. When a site has dozens of computers the machine where it was created, depend- spread over multiple local networks, how- ing on how much work (e.g., network ever, users often want to be able to read traffic) is required to load it, and how often their mail on any machine they happen to it is called. The process server could easily be logged on to. This desire gives rise to the manage a server cache on a least recently need for a machine-independent mail ser- used basis, so that servers for common vice, rather like a print service that can be applications are usually preloaded and accessed systemwide. Almes et al. [1985] ready to go. As special-purpose VLSI pro- discuss how mail is handled in the Eden cessors become available for compilers system. and other applications, the process server should be given the job of managing them in a way that is transparent to the system’s 2.5.7 Time Service users. There are two ways to organize a time service. In the simplest way, clients can just ask the service what time it is. In the other 2.55 Terminal Service way, the time service can broadcast the How the terminals are tied to the system correct time periodically, to keep all the obviously depends to a large extent on the clocks on the other machines in sync. The system architecture. If the system consists time server can be equipped with a radio of a small number of minicomputers, each receiver tuned to WWV or some other with a well-defined and stable user popu- transmitter that provides the exact time lation, then each terminal can be hard down to the microsecond. wired to the computer that its user nor- Even with these two mechanisms, it is mally logs on to. If, however, the system impossible to have all processes exactly consists entirely of a pool of processors that synchronized. Consider what happens are dynamically allocated as needed, it is when a process requests the time of day better to connect all the terminals to one from the time server. The request message or more terminal servers that serve as comes in to the server, and a reply is sent concentrators. back immediately. That reply must propa- The terminal servers can also provide gate back to the requesting process, cause such features as local echoing, intraline an interrupt on its machine, have the ker- editing, and window management, if de- nel started up, and finally have the time sired. Furthermore, the terminal server can recorded somewhere. Each of these steps also hide the idiosyncracies of the various introduces an unknown, variable delay. terminals in use by mapping them all onto On an Ethernet, for example, the amount a standard virtual terminal. In this way the of time required for the time server to put rest of the software deals only with the the reply message onto the network is non- virtual terminal characteristics and the ter- deterministic and depends on the number minal server takes care of the mappings to of machines contending for access at that and from all the real terminals. The ter- instant. If a large distributed system has minal server can also be used to support only one time server, messages to and from multiple windows per terminal, with each it may have to travel a long distance and window acting as a virtual terminal. pass over store-and-forward gateways with

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 449 variable queuing delays. If there are multi- nal format to those demanded by the wide- ple time servers, they may get out of syn- area network carrier. chronization because their crvstals run at slightly different rates. Einstein’s special theory of relativity also puts constraints on 3. EXAMPLES OF DISTRIBUTED synchronizing remote clocks. OPERATING SYSTEMS The result of all these problems is that Having disposed with the principles, it is having a single global time is impossible. now time to look at some actual distributed Distributed algorithms that depend on systems that have been constructed as re- being able to find a unique global ordering search projects in universities around the of widely separated events may not work as world. Although many such projects are in expected. A number of researchers have various stages of development, space limi- tried to find solutions to the various prob- tations prevent us from describing all of lems caused by the lack of global time (see, them in detail. Instead of saying a few e.g., Jefferson [1985], Lamport [1978, words about each system, we have chosen 19841, Marzullo and Owicki [1985], Reed to look at four systems that we consider [1983], and Reif and Spirakis [1984]). representative. Our selection criteria were as follows. First, we only chose systems that were designed from scratch as-distributed 2.58 Boot Service systems (systems that gradually evolved by The boot service has two functions: bring- connecting together existing centralized ing up the system from scratch when the systems or are multiprocessor versions of power is turned on and helping important UNIX were excluded). Second, we only services survive crashes. In both cases, it is chose systems that have actually been im- helpful if the boot server has a hardware plemented; paper designs did not count. mechanism for forcing a recalcitrant ma- Third, we only chose systems about which chine to jump to a program in its own read- a reasonable amount of information was only memory (ROM) in order to reset it. available. The ROM program could simply sit in a Even with these criteria, there were loop waiting for a message from the boot many more systems that could have been service. The message would then be loaded discussed. As an aid to the reader interested into that machine’s memory and executed in pursuing this subject further, we provide as a program. here some references to other relevant The second function alluded to above is work: Accent [Fitzgerald and Rashid 1985; the “immortality service.” An important Rashid and Robertson 19811,Argus [Liskov service could register with the boot service, 1982,1984; Liskov and Scheifler 1982; Oki which would then poll it periodically to see et al. 19851, Chorus [Zimmermann, et al. if it were still functioning. If not, the boot 19811, CRYSTAL [Dewitt et al. 19841, service could initiate measures to patch DEMOS [Powell and Miller 19831,Distrib- things up, for example, forcibly reboot it or uted UNIX [Luderer et al. 19811, HXDP allocate another processor to take over its [Jensen 19781, LOCUS [Popek et al. 1981; work. To provide high reliability, the boot Walker et al. 1983; Weinstein et al. 19851, service should itself consist of multiple Meglos [Gaglianello and Katseff 19851, processors, each of which keeps checking MICROS [Curtis and Wittie 1984; Mohan that the others are still working properly. and Wittie 1985; Wittie and Curtis 1985; Wittie and van Tilborg 19801, RIG [Ball et al. 19761, Roscoe/Arachne [Finkel et al. 2.5.9 Gateway Service 1979; Solomon and Finkell978,1979], and If the distributed system in question needs the work at Xerox Palo Alto Research Cen- to communicate with other systems at re- ter [Birrell 1985; Birrell and Nelson 1984; mote sites, it may need a gateway server to Birrell et al. 1984; Boggs et al. 1980; Brown convert messagesand protocols from inter- et al. 1985; Swinehart et al. 19791.

Computing Surveys, Vol. 17, No. 4, December 1985 450 l A. S. Tanenbaum and R. van Renesse The systems we examine here are the icated machines that provide various useful Cambridge Distributed Computing System, services, including file service, name ser- Amoeba, V, and Eden. The discussion of vice, boot service, etc. The number and each system follows the list of topics treated location of these servers is relatively static. above, namely, communication primitives, naming and protection, resource manage- 3.7.7 Communication Primitives ment, fault tolerance, and services. Owing to the evolution from network to 3.1 The Cambridge Distributed Computing distributed system described earlier, the System communication primitives are usually de- scribed as network protocols rather than The Computing Laboratory at the Univer- language primitives. The choice of the sity of Cambridge has been doing research primitives was closely tuned to the capabil- in networks and distributed systems since ities of the ring in order to optimize per- the mid-1970s, first with the Cambridge formance. Nearly all communication is ring and later with the Cambridge Distrib- built up from sending packets consisting of uted Computing System [Needham and a 2-byte header, a 2-byte process identifier, Herbert 19821. The Cambridge ring is not up to 2048 data bytes, and a 2-byte check- a token-passing ring, but rather contains sum. On top of this basic packet protocol several minipacket slots circulating around are a simple remote procedure call protocol the ring. To send a packet, a machine waits and a byte stream protocol. until an empty slot passes by, then inserts The basic packet protocol, which is a a minipacket containing the source, desti- pure datagram system, is used by the single- nation, some flag bits, and 2 bytes of data. shot protocol to build up something similar Although the 2-byte minipackets them- to a remote procedure call. It consists of selves are occasionally useful (e.g., for having the client send a packet to the server acknowledgments), several block-oriented containing the request, and then having the protocols have been developed for reliably server send a reply. Some machines are exchanging 2K packets by accumulating multiprogrammed, so that the second 1024 minipackets. The nominal ring band- minipacket is used to route the incoming width is 10 megabytes per second, but since packet to the correct process. The request each minipacket has 2 bytes of data and 3 packet itself contains a function code and bytes of overhead, the effective bandwidth the parameters, if any. The reply packet is 4 megabytes per second. contains a status code and the result, if The Cambridge ring project was very suc- any. Clients do not acknowledge receipt of cessful, with copies of the ring currently in the result. operation at many universities and com- Some applications, such as terminal han- panies in the United Kingdom and else- dling and file transfer, work better with a where. The availability of the ring led to flow-controlled, virtual-circuit protocol. research on distributed computing systems The byte stream protocol is used for these initially using nine Computer Automation applications. This protocol is a full-duplex, LS14 minicomputers and later using about connection-oriented protocol, with full flow a dozen Motorola 680008, under the direc- control and error control. tion of Roger Needham. The Cambridge system is primarily com- 3.1.2 Naming and Protection posed of two components: the processor bank and the servers. When a user logs in, Services can be located in the Cambridge he or she normally requests one machine system by using the name server. To look from the processor bank, uses it as a per- up a name, the client sends an ASCII string sonal computer for the entire work session, to the name server, which then looks it up and returns it when logging out. Processors in its tables and returns the machine num- are not normally dynamically allocated for ber where the service is located, the port short periods of time. The servers are ded- used to address it, and the protocol it

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 451 expects. The name server stores service Login Session Class Control names as unstructured ASCII strings, MARVIN 1 91432 sTwENl 31513 which are simply matched against incoming requests character by character; that is, it l6AR6ARA 1 61300 19lWENl I 27130I does not manage hierarchical names. The AMY 42106 FACLLTY 31616 name server itself has a fixed address that never changes, so this address may be 61346 DIRECmR 41940 embedded into programs. Although the service database is rela- Figure 11. The Active name table. tively static, from time to time names must be added or deleted to the name server’s database. Commands are provided for this the user specifies a CPU type (e.g., 68000), purpose, but for protection reasons these a list of attributes (e.g., memory size), and commands may only be executed by the a program to be run. The resource manager system administrator. then selects the most suitable CPU cur- Finding the location of a service is only rently available for allocation. Various half the work. To use most services, a pro- defaults are available, so, for example, a cess must identify itself in an unforge- user can specify wanting to run TRIPOS able way, so that the service can check to (a straightforward single-user operating see whether that user is authorized. This system), and the resource manager will se- identification is handled by the Active lect an appropriate CPU type if none has Name Server, which maintains a table of been specified. currently logged-in users. Each table entry The downloading of programs into pro- has four fields: the user’s login name, his cessor bank machines is controlled by a or her session key (a big random number), server called the ancilla, although some of the user’s class (e.g., faculty, student), and the machines have intelligent ring inter- a control key, as shown in Figure 11. faces that actually do most of the work. To use a service, a user supplies the ser- The ancilla also helps simulate the ma- vice with his login name, session key (ob- chine’s console and front panel, so that tained at login time), and class. The service users have the same control over a proces- can then ask the Active Name Server if sor bank machine as they would over real such an entry exists. Since session keys are personal computers on their desks. sparse, it is highly unlikely that a student will be able to guess the current session key 3.1.4 Fault Tolerance for the computer center director, and thus be able to obtain services reserved for the The approach taken to fault tolerance in director. The control key must be presented the Cambridge system is to make it easy to to change an entry, thus providing a mech- bring servers back up after a crash. When anism to restrict changing the Active Name a ring interface detects a special minipacket Server’s table to a few people. whose source is the name server, it reboots the processor by forcing it to jump to a program in ROM. This program then sends 3.1.3 Resource Management a request to the boot server, which in turn The main resource managed by the system goes to the name server asking for reverse is the processor bank, handled by a service name lookup. The name server then called the resource manager. Usually a user searches its tables to find the service that requests a processor to be allocated at login is running on the machine from which the time, and then loads it with a single-user reverse lookup request came. As soon as operating system. The processor then be- the reply comes in, the server knows comes the user’s personal computer for the what it is supposed to be doing and can rest of the login session. request the resource manager and ancilla The resource manager accepts requests to download the appropriate program. to allocate a processor. In these requests When machines are physically reset or

Computing Surveys, Vol. 17, No. 4, December 1985 452 . A. S. Tanenbaum and R. van Renesse

Filing File Figure 12. The filing machine is posi- machine server tioned between the users and the file server. It maintains a block cache and handles ASCII names. Block cache Regular files Processor ASCII names Special files bank Index files machines powered up, the same procedure is carried consisting of a sequence of 16-bit words, out automatically. numbered from 0 to some maximum. Another area in which some effort has Operations are provided for reading or writ- been put to make the system fault tolerant ing arbitrary numbers of words, starting is the file system, which supports atomic anywhere in the file. Each file is uniquely updates on special files. This facility is identified by a 64-bit PUID (Permanent described in the next section. User IDentifier) consisting of a 32-bit disk address and a 32-bit random number. The first variation is the special file, 3.1.5 Services which has the property that writes to it are We have already described several key serv- atomic, that is, they will either succeed ers, including the name server, resource completely or not be done at all. They will manager, ancilla, and active name server. never be partly completed, even in the face Other small servers include the time server, of server crashes. print server, login server, terminal server, The second variation is a file called an and error server, which records system er- index, which is a special file consisting of a rors for maintenance purposes. The tile sequence of slots, each holding one PUID. server is examined here. When a file is created, the process creating The file system started out with the idea it must specify an index and slot in that of a single universal file server that pro- index into which the new file’s PUID is vided basic storage service but very primi- stored. Since indexes are also files and as tive naming and protection system, coupled such have PUIDs themselves, an index may with single-user TRIPOS operating sys- contain pointers (PUIDs) to other indices, tems in the processor bank machines, in allowing arbitrary directory trees and which the naming and directory manage- graphs to be built. One index is distin- ment would be done. The CAP computer (a guished as being the root index, which has large research machine within the Cam- the property that the file server’s internal bridge Computing Laboratory that does not garbage collector will never remove a file have any disks of its own) also uses the file reachable from the root index. server. After some experience with this In the initial implementation, the full model, it was decided to create a new server, code of the TRIPOS operating system was known as the filing machine, as a front end loaded into each pool processor. All of the to the file system to improve the perform- directory management and handling of ance (mostly by providing the filing ma- ASCII names was done on the processor chine with a large cache, something that bank machines. Unfortunately, this scheme the small user machines could not afford). had several problems. First, TRIPOS was The CAP machine, which has adequate rather large and filled up so much memory memory, continues to use the file server that little room was left for buffers, mean- directly. The position of the filing machine ing that almost every read or write request is shown in Figure 12. actually caused a disk access (the universal The universal file server supports one file server has hardly any buffers). Second, basic file type, with two minor variations. looking up a name in the directory hierar- The basic file type is an unstructured file chy required all the intermediate directo-

ComputingSurveys, Vol. 17,No. 4, December1985 Distributed Operating Systems l 453

Processor ries between the starting point and the file Workstations to be physically transported from the file pool server to a machine doing the search. To get around these problems, a filing machine with a large cache was inserted in front of the file server. This improvement allowed programs to request files by name instead of PUID, with the name lookup occurring in the filing machine now. Owing to the large cache, most of the relevant Specialized servers directories are likely to be already present (file. data base, etc) in the filing machine, thus eliminating much network traffic. Furthermore, it al- Figure 13. The Amoeba architecture. lowed the TRIPOS code in the user ma- chines to be considerably stripped, since the directory management was no longer on which users can carry out editing and needed. It also allowed the file server to other tasks that require fast interactive re- read and write in large blocks; this was sponse. Second are the pool processors, a previously possible, but rarely done because group of CPUs that can be dynamically of lack of buffer space on the user side. The allocated as needed, used, and then re- resulting improvements were substantial. turned to the pool. For example, the “make” command might need to do six compila- tions; so six processors could be taken out 3.1.6 Implementation of the pool for the time necessary to do the compilation and then returned. Alterna- As should be clear by now, the whole Cam- tively, with a five-pass compiler, 5 X 6 = bridge system is a highly pragmatic design, 30 processors could be allocated for the six which from its inception [Wilkes and Need- compilations, gaining even more speedup. ham 19801was designed to be actually used Third are the specialized servers, such as by a substantial user community. About 90 directory, file, and block servers, database machines are connected by three rings now, servers, bank servers, boot servers, and var- and the system is fairly stable. A related ious other servers with specialized func- research project was the connection of a tions. Fourth are the gateways, which are number of Cambridge rings via a satellite used to link Amoeba systems at different [Adams et al. 19821. Future research sites (and, eventually, different countries) may include interconnection of multiple into a single, uniform system. Cambridge rings using very-high-speed All the Amoeba machines run the same (2-megabit-per-second) lines. kernel, which primarily provides message- passing services and little else. The basic idea behind the kernel was to keep it small, 3.2 Amoeba not only to enhance its reliability, but also Amoeba is a research project on distributed to allow as much as possible of the operat- operating systems being carried out at the ing system to run as user processes, provid- Vrije Universiteit in Amsterdam under the ing for flexibility and experimentation. direction of Andrew Tanenbaum. Its goal Some of the research issues addressed by is to investigate capability-based, object- the project are how to put as much of the oriented systems and to build a working operating system as possible into user pro- prototype system to use and evaluate. It cesses, how to use the processor pool, how currently runs on a collection of 24 Moto- to integrate the workstations and pro- rola 68010 computers connected by a cessor pool, and how to connect multiple lo-megabytes-per-second local network. Amoeba sites into a single coherent system The Amoeba architecture consists of four using wide-area networks. All of these principal components, as shown in Figure issues use objects and capabilities in a 13. First are the workstations, one per user, uniform way.

Computing Surveys, Vol. 17, No. 4, December 1985 454 l A. S. Tanenbaum and R. van Renesse

3.2.1 Communication Primitives immediate reply with a status code of REQUEST ABORTED. If the receiver was The conceptual model for Amoeba com- also blocked waiting for a server, the excep- munication is the abstract data type or tion is recursively propagated all the way object model, in which clients perform op- erations on objects in a location-indepen- down the line, forcing each server in turn dent manner. To implement this model, to finish immediately. In this manner, all Amoeba uses a minimal remote procedure the nested processes terminate normally call model for communication between (with error status), so that little violence is clients and servers. The basic client primi- done to the nesting structure. In effect, an EXCEPTION message does not terminate tive is to send a message of up to about 32 kilobytes to a server and then block waiting execution. Instead, it just says “Force nor- for the result. Servers use GET-RE- mal termination immediately, even if you are not done yet, and return an error QUEST and PUT-REPLY to get new status.” work and send back the results, respec- tively. These primitives are not embedded in a language environment with automatic 3.2.2 Naming and Protection stub generation. They are implemented as small library routines that are used to in- All naming and protection issues in voke the kernel directly from C programs. Amoeba are dealt with by a single, uniform All the primitives are reliable in the sense mechanism: sparse capabilities [Tanen- that detection and retransmission of lost baum et al. 19861. The system supports messages, acknowledgment processing, and objects such as directories, files, disk message-to-packet and packet-to-message blocks, processes, bank accounts, and de- management are all done transparently by vices, but not small objects such as integers. the kernel. Messages are unbuffered. If a Each object is owned by some service and message arrives and no one is expecting it, managed by the corresponding server pro- the message is simply discarded. The send- cesses. ing kernel then times out and tries again. When an object is created, the process Users can specify how long the kernel requesting its creation is given a capability should retransmit before giving up and re- for it. Using this capability, a process can porting failure. The idea behind this strat- carry out operations on the object, such as egy is that server processes are generally reading or writing the blocks of a file, or cloned in N-fold, so normally there will be starting or stopping a process. The number a server waiting. Since a message is dis- and types of operations applicable to an carded only if the system is badly over- object are determined by the service that loaded, having the client time out and try created the object; a bit map in the capa- again later is not a bad idea. bility tells which of those the holder of the Although the basic message primitives capability is permitted to use. Thus the are blocking, special provision is made for whole of Amoeba is based on the conceptual handling emergency messages. For exam- model of abstract data types managed by ple, if a database server is currently blocked services, as mentioned above. Users view waiting for a tile server to get some data the Amoeba environment as a collection of for it, and a user at a terminal hits the objects, named by capabilities, on which BREAK key (indicating that he or she they can perform operations. This is in wants to kill off the whole request), some contrast to systems in which the user view way is needed to gracefully abort all is a collection of processes connected by the processes working on behalf of that virtual circuits. request. In the Amoeba system the terminal Each object has a globally unique name server generates and sends a special EX- contained in its capabilities. Capabilities CEPTION message, which causes an inter- are managed entirely by user processes; rupt at the receiving process. they are protected cryptographically, not This message forces the receiver to stop by any kernel-maintained tables or mech- working on the request and send an anisms. A capability has four fields, as

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 455 40 24 0 48 directory server with a capability for a di- Service port Object Rts Check rectory (itself an object) and an ASCII string and ask for the capability that cor- responds to that string in the given direc- Figure 14. An Amoeba capability. tory. Other operations are entering and deleting (ASCII string, capability) pairs. shown in Figure 14: This naming scheme is flexible in that a directory may contain capabilities for an The service port: a sparse address cor- arbitrary mixture of object types and loca- responding to the service that owns the tions, but it is also uniform in that every object, such as a file or directory service. object is controlled by a capability. A direc- The object number: an internal identi- tory entry may, of course, be for another fier that the service uses to tell which of directory, and so it is simple to build up its objects this is (comparable to the a hierarchical (e.g., UNIX-like) directory i-number in UNIX). tree, or even more general naming graphs. The rights field: a bit map telling which Furthermore, a directory may also contain operations on the object are permitted. a capability for a directory managed by a The check field: a large random number different directory service. As long as all used to authenticate the capability. the directory services have the same inter- faces with the user, one can distribute ob- When a server is asked to create an ob- jects over directory services in an arbitrary ject, it picks an available slot in its internal tables (e.g., a free i-node, in UNIX termi- way. nology), puts the information about the 3.2.3 Resource Management new object there, and picks a new random number to be used exclusively to protect Resource management in Amoeba is per- this new object. Each server is free to use formed in a distributed way, again using any protection scheme that it wants to, but capabilities. Each Amoeba machine (pool the normal one is for it to build a capability processor, work station, etc.) runs a re- containing its port, the object number, the source manager process that controls that rights (initially all present), and a known machine. This process actually runs inside constant. The two latter fields are then the kernel for efficiency reasons, but it uses thoroughly mixed by encrypting them with the normal abstract data type interface the random number as key, which is then with its clients. The key operations it sup- stored in the internal table. ports are CREATE SEGMENT, WRITE Later, when a process performs an oper- SEGMENT, READ SEGMENT, and ation on the object, a message containing MAKE PROCESS. To create a new pro- the object’s capability is sent to the server. cess, a process would normally execute The server uses the (plaintext) object num- CREATE SEGMENT three times for ber to find the relevant internal table entry the child process’s text, data, and stack and extract the random number, which is segments, getting back one capability for then used to decrypt the rights and check each segment. Then it would fill each one fields. If the decryption yields the correct in with that segment’s initial data and known constant, the rights field is believed finally perform MAKE PROCESS with and the server can easily check whether the these capabilities as parameters, getting requested operation is permitted. More de- back a capability for the new process. tails about protection of capabilities can be Using the above primitives, it is easy to found in Mullender and Tanenbaum [ 1984, build a set of processes that share text 19861 and Tanenbaum et al. [1986]. and/or data segments. This facility is useful Capabilities can be stored in directories for constructing servers that consist inter- managed by the directory service. A direc- nally of multiple miniprocesses (tasks) that tory is effectively a set of (ASCII string, share text and data. Each of these processes capability) pairs. The most common direc- has its own stack and, most important, its tory operation is for a user to present the own program counter, so that when one of

ComputingSurveys, Vol. 17, No. 4, December 1985 456 . A. S. Tanenbaum and R. van Renesse them blocks on a remote procedure call, the The bank server provides a basic mech- others are not affected. For example, the anism on top of which many interesting file server might consist of 10 processes policies can be implemented. For example: sharing a disk cache, all of which start out If some resource is in short supply, are by doing a GET-REQUEST. When a mes- servers allowed to raise the price as a ra- sage comes in, the kernel sees that ten tioning mechanism? Do you get your processes are all listening to the port spec- money back when you release disk space? ified in the message; so it picks one process That is: Is the model one of clients and at random and gives it the message. This servers buying and selling blocks, or is it process then performs the requested oper- like renting something? If it is like renting, ation, possibly blocking on remote proce- there will be a flow of money from users to dure calls (e.g., calling the disk) while doing the various servers, and so users need in- so, but leaving the other server processes comes to keep them going, rather than sim- free to accept and handle new requests. ply initial fixed budgets. When new users At a higher level the processor pool is are added, virtual money has to be created managed by a process server that keeps for them. Does this lead to inflation? The track of which processors are free and possibilities here are legion. which are not. If an installation wants to multiprogram the processor pool machines, then the process server manages each proc- 3.2.4 Fault Tolerance ess table slot on a pool processor as a virtual The basic idea behind fault tolerance in processor. One of the interesting research Amoeba is that machine crashes are infre- issues here is the interplay between the quent, and that most users are not willing workstations and the processor pool; that to pay a penalty in performance in order to is: When should a process be started up on make all crashes 100 percent transparent. the workstation and when should it be off- Instead, Amoeba provides a boot service, loaded to a pool processor? Research has with which servers can register. The boot not yet yielded any definitive answers here, service polls each registered server at although it seems intuitively clear that agreed upon intervals. If the server does highly interactive processes, such as screen not reply properly within a specified time, editors, should be local to the workstation, the boot service declares the server to be and batchlike jobs, such as big compila- broken and requests the process server to tions (e.g., UNIX “make”), should be run start up a new copy of the server on one of elsewhere. the pool processors. Accounting. Amoeba provides a general To understand how this strategy affects mechanism for resource management and clients, it is important to realize that accounting in the form of the bank server, Amoeba does not have any notion of a which manages “bank account” objects. virtual circuit or a session. Each remote Bank accounts hold virtual money, possibly procedure call is completely self-contained in multiple currencies. The principal oper- and does not depend on any previous setup; ation on bank account objects is transfer- that is, it does not depend on any volatile ring virtual money between accounts. For information stored in server’s memories. If example, to pay for file storage, a file server a server crashes before sending a reply, the might insist on payment in advance of X kernel on the client side will time out and dollars per megabyte of storage, and a pho- try again. When the new server comes up, totypesetter server might want a payment the client’s kernel will discover this and in advance of Y yen per page. The system send the request there, without the client management can decide whether or not dol- even knowing that anything has happened. lars and zlotys are convertible, depending Of course, this approach does not always on whether or not it wants users to have work, for example, if the request is not separate quotas on disk space and typeset- idempotent (the chocolate factory!) or if a ter pages, or just give each user a single sick disk head has just mechanically budget to use as he or she sees fit. scraped all the bits from some disk surface,

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 457 but it works much of the time and has zero itors, and shells, are operational. A server overhead under normal conditions. has also been implemented on UNIX to allow Amoeba programs to put capabilities 3.2.5 Services for UNIX files into their directories and use them without having to know that the Amoeba has several kinds of block, file, and files are actually located on a VAX running directory services. The simplest one is a UNIX. server running on top of the Amoeba kernel In addition to the UNIX emulation that provides a file service functionally work, various applications have been im- equivalent to the UNIX system call inter- plemented using pure Amoeba, including face, to allow most UNIX programs to run parallel traveling salesman and parallel on Amoeba with only the need to relink alpha-beta search [Bal et al. 19851. Current them with a special library. research includes connecting Amoeba sys- A more interesting server, however, is tems at five locations in three countries FUSS (Free University Storage System), using wide-area networks. which views each file as a sequence of versions. A process can acquire a capability for a private copy of a new version, modify 3.3 The V Kernel it, and then commit it in a single indivisible atomic action. Providing atomic commits The V kernel is a research project on dis- at the file level (rather than only as a tributed systems at Stanford University facility in some database systems) simpli- under the direction of David Cheriton fies the construction of various servers, [Cheriton 1984a; Cheriton and Mann 1984; such as the bank server, that have to be Cheriton and Zwaenepoel 1984a, 1984131. highly robust. FUSS also supports multiple, It was motivated by the increasing avail- simultaneous access using optimistic con- ability of powerful microcomputer-based currency control. It is described in greater workstations, which can be seen as an detail by Mullender and Tanenbaum alternative to traditional time-shared mini- [1985]. computers. The V kernel is an outgrowth Other key services are the directory ser- of the experience acquired with earlier sys- vice, bank service, and boot service, all of tems, Thoth [Cheriton 1982; Cheriton et al. which have already been discussed. 19791 and Verex. The V kernel can be thought of as a software back plane, analogous to the 3.2.6 Implementation Multibus or S-100 bus back planes. The The Amoeba kernel has been ported to five function of a back plane is to provide an different CPUs: 68010, NS32016, 8088, infrastructure for components (for hard- VAX, and PDP-11. All the servers de- ware, boards; for software, processes) to scribed above, except the boot server, have communicate, and nothing else. Conse- been written and tested, along with a num- quently, most of the facilities found in tra- ber of others. Measurements have shown ditional operating systems, such as a file that a remote procedure call from user system, resource management, and protec- space on one 68010 to user space on a tion, are provided in V by servers outside different 68010 takes just over 8 millisec- the kernel. In this respect V and Amoeba onds (plus the time to actually carry out are conceptually very similar. the service requested). The data rate be- Another point on which V and Amoeba tween user processes on different machines agree is the free-market model of services. has been clocked at over 250,000 bytes per Services such as the file system are, in second, which is about 20 percent of the principle, just ordinary user processes. Any raw network bandwidth, an exceptionally user who is dissatisfied with the standard high value. file system [Stonebraker 1981; Tanenbaum A library has been written to allow UNIX and Mullender 19821 is free to write his or programs to run on Amoeba. A substantial her own. This view is in contrast to the number of utilities, including compilers, ed- “centrally planned economy” model of most

Computing Surveys, Vol. 17, No. 4, December 1985 458 l A. S. Tanenbaum and R. van Renesse

Work- Work- station station

Network I ’ I I I I I File File Rint server server server

Figure 15. A typical V configuration. time-sharing systems, which present the tually. In this way, messages longer than file system on a “like it or lump it” basis. 32 bytes can be achieved. The V system consists of a collection of Servers use the RECEIVE and REPLY workstations (currently SUNS), each run- calls. The RECEIVE call can provide a ning an identical copy of the V kernel. The segment buffer in addition to the regular kernel consists of three components: the message buffer, so that if (part of) a seg- interprocess communication handler, the ment has been piggybacked onto the mes- kernel server (for providing basic services, sage, it will have a place to go. The REPLY such as memory management), and the de- call can also provide a segment buffer for vice server (for providing uniform access to the case in which the client provides a I/O devices). Some of the workstations pseudopointer that the server can use to support an interactive user, whereas others return results exceeding 32 bytes. function as file servers, print servers, and To make this communication system eas- other kinds of servers, as shown in Figure ier to use, calls to servers can be embedded 15. Unlike Amoeba, V does not have a in stubs so that the caller just sees an processor pool. ordinary procedure call. Stub generation is not automated, however. 3.3.1 Communication Primitives 3.3.2 Naming and Protection The V communication primitives have been designed in accordance with the back-plane V has three levels of naming. At the bottom model mentioned above. They provide level, each process has a unique 32-bit pid, basic, but fast communication. To access a which is the address used to send messages server, a client does SEND(message, pid), to it. At the next level, services (i.e., pro- which transmits the fixed-length (32-byte) cesses that carry out requests for clients) “message” to the server, and then blocks can have symbolic (ASCII string) names in until the server has sent back a reply, which addition to their pids. A service can register overwrites “message.” The second param- a symbolic name with its kernel so that eter, “pid,” is a 32-bit integer that uniquely clients can use the symbolic name instead identifies the destination process. A mes- of the pid. When a client wants to access a sage may contain a kind of pseudopointer service by its name, the client’s kernel to one of the client’s memory segments. broadcasts a query to all the other kernels, This pseudopointer can be used to permit to see where the server is. The (Server- the server to read from or write to the Name, pid) pair is then put in a cache for client’s memory. Such reads and writes are future use. handled by kernel primitives COPYFROM The top level of naming makes it possible and COPYTO. As an optimization, when a to assign symbolic names to objects, such client does a SEND containing one of these as files. Symbolic names are always inter- pseudopointers with READ permission, the preted in some “context,” analogous to first 1K of the segment is piggybacked onto looking up a file name in some directory in the message, on the assumption that the other systems. A context is a set of records, server will probably want to read it even- each including the symbolic name, server’s

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 459 pid, context number, and object identifier. all run on the same processor. Application Each server manages its own contexts; programs can make use of concurrency by there is no centralized “name server.” A running as a team of processes, each of symbolic name is looked up in a context by which does part of the kernel. If one process searching all the records in that context for in a team is blocked waiting for a reply to one whose name matches the given name. a message, the other ones are free to run. When a match is found, the context num- The kernel server is prepared to carry out ber and object identifier can be sent to the operations such as creating new processes appropriate server to have some operation and teams, destroying processes and teams, carried out. reading and writing processes’ states, and Names may be hierarchical, as in u/b/c. mapping processes onto memory. When a is looked up in some context, the All I/O in V is done using a uniform result will probably be a new context, pos- interface called the V I/O protocol. The sibly managed by a new server on a differ- protocol allows processes to read and write ent machine. In that case the remaining specific blocks on the device. This block string, b/c is passed on to that new server orientation was chosen to provide idempo- for further lookup, and so on. tency. Terminal drivers must store the last It is also possible to prefix a symbolic block read and filter out duplicate requests name with an explicit context, as in in order to maintain the idempotency prop- [HomeDirectory] a/b/c, in which case the erty. Implementation of byte streams is up name is looked up in the context specified, to the users. The I/O protocol has proved rather than in the current context (analo- general enough to handle disks, printers, gous to the current working directory in terminals, and even a mouse. other systems). A question that quickly arises is, “Who keeps track of the various 3.3.4 Fault Tolerance context names, such as ‘HomeDirectory’ above?” The answer is that each worksta- Since it was designed primarily for use in tion in the system has a Context Prefix an interactive environment, V provides Server, whose function is to map context little in the way of fault tolerance. If names onto server names, so that the ap- something goes wrong, the user just does propriate server can be found to interpret it again. V, however, does address excep- the name itself. tion handling. Whenever a process causes an exceptional condition to occur, such as 3.3.3 Resource Management stack overflow or referencing nonexistent memory, the kernel detecting the error Each processor in V has a dedicated func- sends a specially formatted message to the tion, either as a user workstation or a file, exception server, which is outside the ker- print, or other dedicated server; so no form nel. The exception server can then invoke of dynamic processor allocation is provided. a debugger to take over. This scheme does The key resources to be managed are pro- not require a process to make any advance cesses, memory, and the I/O devices. preparation for being debugged and in prin- Process and memory management is pro- ciple, can allow the process to continue vided by the kernel server. I/O manage- execution afterward. ment is provided by the device server. Both of these are part of the kernel present on 3.3.5 Services each machine, and are accessed via the standard message mechanism described Since most of the V workstations do not above. They are special only in that they have a disk, the central file server plays a run in kernel mode and can get at the raw key role in the system. The file server is hardware. not part of the operating system. Instead, Processes are organized into groups it is just an ordinary user program running called teams. A team of processes share a on top of the V kernel. Internally it is common address space, and therefore must structured as a team of processes. The main

Computing Surveys, Vol. 17, No. 4, December 1985 460 . A. S. Tanenbaum and R. van Renesse process handles directory operations, in- veloped at the University of Washington in cluding opening files; subsidiary processes Seattle under the direction of Guy Almes, perform the actual read and write com- Andrew Black, Ed Lazowska, and Jerre mands, so that when one of them blocks Noe, is to investigate logically integrated waiting for a disk block, the others can but physically distributed operating sys- continue operation. The members of file tems. The idea is to construct a system server team share a common buffer cache, based on the principle of one user and one used to keep heavily used blocks in main workstation (no processor pool), but with a memory. high degree of systemwide integration. The file system is a traditional hierar- Eden is object oriented, with all objects chical system, similar to that of Thoth accessed by capabilities, which are pro- [Cheriton 19821.Each file has a file descrip- tected by the Eden kernel. Eden objects, in tor, similar to an i-node in UNIX, except contrast to, say, Amoeba objects, contain that the file descriptors are gathered into not only passive data, but also one or more an ordinary file, which can grow as needed. processes that carry out the operations de- Extensive measurements have been fined for the object. Objects are general: made of the performance of the file server. Applications programmers can determine As an indication, it takes 7.8 milliseconds what operations their objects will provide. to read a 1K block from the file server when Objects are also mobile, but at any instant the block is in the cache. This time includes each object (and all the processes it con- the communication and network overhead. tains) resides on a single workstation. When the block must be fetched from the Much more than most research projects disk, the time is increased to 35.5 millisec- of this kind, Eden was designed top down. onds. Given that the access time of the In fact, the underlying hardware and lan- small Winchester disks used on personal guage was radically changed twice during computers is rarely better than 40 millisec- the project, without causing too much rede- onds, it is clear that the V implementation signing. This would have been much more of diskless workstations with a fast (18- difficult in a bottom-up, hardware-driven millisecond) central file server is definitely approach. competitive. Other V servers include the print server, 3.4.1 Communication Primitives gateway server, and time server. Other Communication in Eden uses “invocation,” servers are in the process of being devel- a form of remote procedure call. Programs oped. are normally written in EPL, the Eden Programming Language, which is based on 3.3.6 Implementation Concurrent Euclid. (The EPL translator is The V kernel has been up and running at actually a preprocessor for Concurrent Stanford University since September 1982. Euclid.) To perform an operation on an It runs on SUN Microsystems 68000-based object, say, Lookup, on a directory object, workstations, connected by 3-megabit-per- the EPL programmer just calls Lookup, second and lo-megabit-per-second Ether- specifying a capability for the directory to nets. The kernel is used as a base for a be searched, the string to be searched for, variety of projects at Stanford, including and some other parameters. the research project on distributed operat- The EPL compiler translates the call to ing systems. A great deal of attention has Lookup to a call to a stub routine linked been paid to tuning the system to make it together with the calling procedure. This fast. stub routine assembles the parameters and packs them in a standard form called 3.4 The Eden Project ESCII (Eden Standard Code for Informa- tion Interchange), and then calls a lower The goal of the Eden system [Almes et al. level routine to transmit the function code 1985; Black 1983, 1985; Jessop et al. 1982; and packed parameters to the destination Lazowska et al. 19811, which is being de- machine.

Computing Surveys, Vol. 1’7, No. 4, December 1985 Distributed Operating Systems l 461

When the message arrives at the desti- Each directory entry contains the ASCII nation machine, a stub routine there un- string by which the capability is accessed packs the ESCII message and makes a local and the capability itself. Clients can only call on Lookup using the normal EPL call- access the contents of a directory by invok- ing sequence. The reply proceeds analo- ing the directory object with one of the gously in the opposite direction. The stub valid operations, which include add entry, routines on both sides are automatically delete entry, lookup string, and rename ca- generated by the EPL compiler. pability. Capabilities are protected from The implementation of invocation is forgery by the kernel, but users keep copies slightly complicated by the fact that an of capabilities for their own use; the kernel object may contain multiple processes. verifies them when they are used. When one process blocks waiting for a re- The basic protection scheme protects ob- ply, the others must not be affected. This jects, using capabilities. Since all processes problem is handled by splitting the invo- are embedded in objects, a process can be cation into two layers. The upper layer protected by restricting the distribution of builds the message, including the capability capabilities to its object. The only way to for the object to be invoked and the ESCII obtain service from an object is by invoking parameters, passes it to the lower layer, the object with the proper capability, pa- and blocks the calling process until the rameters, etc., all of which are checked by reply arrives. The lower layer then makes the kernel and EPL run-time system before a nonblocking call to the kernel to actually the call is made. send the message. If other processes are active within the object, they can now be 3.4.3 Resource Management run; if none are active, the object waits until Because no version of Eden runs on bare a message arrives. machines, most of the issues associated On the receiving side, a process within with low-level resource management have the invoked object will normally have pre- not yet been dealt with. Nevertheless, some viously executed a call announcing its will- resource management issues have been ad- ingness to perform some operation (e.g., dressed. For example, when an object is Lookup in the above example), thereby created, the issue arises of where to put it. blocking itself. When the Lookup message At present, it is just put on the same work- comes in, it is accepted by a special dis- station as the object that created it unless patcher process that checks to see which an explicit request has been given to put it process, if any, is blocked waiting to per- somewhere else. form the operation requested by the mes- Another issue that has received consid- sage. If a willing process is found, it runs erable attention is how to achieve concur- and sends a reply, unblocking the caller. If rency within an object. From the beginning no such process can be found, the message of the project it was considered desirable to is queued until one becomes available. allow multiple processes to be simultane- ously active within an object. These pro- cesses all share a common address space, 3.4.2 Naming and Protection although each one has its own stack for Naming and protection in Eden are accom- local variables, procedure call/return in- plished using the capability system. Data formation, etc. Having multiple active are encapsulated within objects, and are processes within an object, coupled with only accessible by invoking one of the op- the basic Eden semantics of remote invoca- erations defined by the object. To invoke tions that block the caller but not the an object, a process must have a valid ca- whole object, makes the implementation pability. Thus there is a uniform naming somewhat complicated. It is necessary to and protection scheme throughout Eden. allow one process to block waiting for a Capabilities may be stored in any object. reply without blocking the object as a Directories provide a convenient mecha- whole. Monitors are used for synchroniza- nism for grouping capabilities together. tion. This multiprogramming of processes

Computing Surveys, Vol. 17, No. 4, December 1985 462 . A. S. Tanenbaum and R. van Renesse within an object is handled by a run-time the whole checkpoint mechanism is that it system within that object, rather than by is expensive: Any change to an object’s the kernel itself (as is done in Amoeba and state, no matter how small, requires writing also in V). The experiences of Eden, the entire object to the disk. The Eden Amoeba, and V all seem to indicate that designers acknowledge this as a problem. having cheap, “lightweight” processes that Another feature of Eden that supports share a common address space is often fault tolerance is the ability of the file useful [Black 19851. system, when asked, to store an object as Management of dynamic storage for ob- multiple copies on different machines jects has also been a subject of some work. (see below). Each object has a heap for its own internal use, for which the EPL compiler generates 3.4.5 Services explicit allocate and deallocate commands. However, a different storage management The Eden file system maintains arbitrary scheme is used for objects themselves. objects. One particular object type, the When a kernel creates an object, it allocates BYTESTORE, implements linear files, as storage for the object from its own heap in UNIX. It is possible to set the “current and gives the object its own address space. position” anywhere in the file and then read It also manages the user capabilities for the sequentially from that point. Unlike V and object in such a way that it is possible Amoeba, Eden does not have special ma- systematically to find all capabilities by chines dedicated as servers. Instead, each scanning the kernel’s data structures. workstation can support file objects, either for the benefit of the local user or remote ones. 3.4.4 Fault Tolerance The model used for file service in Eden The Eden kernel does not support atomic is quite different from the usual model of a actions directly, although some services file server, which manages some set of tiles provide them to their clients. Invocations and accepts requests from clients to per- can fail with status CANNOT LOCATE form operations on them. In Eden, each file OBJECT when the machine on which the (i.e., BYTESTORE object) contains within invoked object resides crashes. On the other it the processes needed to handle opera- hand, Eden goes to a considerable length tions on it. Thus the file contains the server to make sure that objects are not totally rather than the server containing the file destroyed by crashes. The technique used as in most other systems. to accomplish this goal is to have objects Of course, actually having a process run- checkpoint themselves periodically. Once ning for each file in existence would be an object has written a copy of its state to unbearably expensive, so an optimization disk, a subsequent crash merely has the is used in the implementation. When a file effect of resetting the object to the state is not open, its processes are dormant and that it had at the most recent checkpoint. consume no resources (other than the disk Checkpoints themselves are atomic, and space for its checkpoint). Mailboxes, direc- this property can be used to build up more tories, and all other Eden objects work the complex atomic actions. same way. When an object is not busy with By judicious timing of its checkpoints, an invocation, the processes inside it are an object can achieve a high degree of reli- put to sleep by checkpointing the whole ability. For example, within the user mail object to the disk. system, a mailbox object will checkpoint When a file is opened, a copy of the code itself just after any letter is received or for its internal processes is found, and the removed. Upon receipt of a letter, a mailbox processes started up. Although all files on can wait for confirmation of the checkpoint a given workstation share the same code, before sending an acknowledgment back to when the first file is opened on a work- the sender, to ensure that letters are never station, the code may have to be fetched lost because of crashes. One drawback of from another workstation.

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 463 The approach has advantages and dis- phases. When someone proposes a meeting, advantages compared with the traditional a transaction is first done to mark the one-file-server-for-all-files way of doing proposed time as “tentatively occupied” on things. There are two main advantages. all the participants’ calendars. When a par- First, the complicated, multithreaded file ticipant notices the proposed date, he or server code is eliminated: There is no tile she can then approve or reject it. If all server. The processes within a BYTE- participants approve the meeting, it is STORE object are dedicated to a single file. “committed” by another transaction; if Second, files can be migrated freely about someone rejects the proposed appointment, all the nodes in the system, so that, for the other participants are notified. example, a tile might be created locally and then moved to a remote node where it will 3.4.6 Implementation later be used. The chief disadvantage is performance. Eden has had a somewhat tortuous imple- All the processes needed for the open tiles mentation history. The initial version was consume resources, and fetching the code designed to be written in Ada4 on the Intel for the first file to be opened on a work- 432, a highly complex multiprocessor, fault- station is slow. tolerant microprocessor chip ensemble. To The Eden file system supports nested make a long story short, neither the Ada transactions [Pu and Noe 19851. When an compiler nor the 432 lived up to the pro- atomic update on a set of files (or other ject’s expectations. To gather information objects) is to be carried out, the manager for further design, a “throwaway” imple- for that transaction first makes sure that mentation was made on top of VMS on a all the new versions are safely stored on VAX. disk, then it checkpoints itself, and finally The VAX/VMS version, called Newark it updates the directory. (because that was thought to be far from The transaction facility can be used to Eden), was written in Pascal and was not support replicated tiles [Pu et al. 19861. In distributed (i.e., it ran on a single VAX). It the simplest case, a directory object maps supported multiple processes per object an ASCII name onto the capability for (VMS kernel processes) but did not have that object. However, the system also has automatic stub generation. Furthermore, “repdirs,” objects that map ASCII names the whole implementation was rather cum- onto sets of capabilities, for example, all bersome, so it was then decided to design a the copies of a replicated file. Updating a programming language that would provide replicated file is handled by a transaction automatic stub generation, better type manager, which uses a two-phase commit checking, and a more convenient way of algorithm to update all the copies simul- dealing with concurrency. taneously. If one of the copies is not This reevaluation led to EPL and a new available for updating (e.g., its machine is implementation on top of UNIX instead of down or the network is partitioned), a new VMS. Subsequently, Eden was ported to copy of the file is generated, and the capa- 68000-based workstations (SUNS), also on bility for the unreachable copy discarded. top of UNIX, rather than on the bare hard- Sooner or later, the garbage collector will ware (and in contrast to the Cambridge notice that the old copy is no longer in use system, V, and Amoeba, all of which run and remove it. on bare 68000s). The decision to put UNIX We touched briefly on the mail server on the bottom, instead of the top (as was above. The mail system defines message, done with Amoeba), made system develop- mailbox, and address list objects, with op- ment easier and assisted users in migrating erations to deliver mail, read mail, reply to from UNIX to Eden. The price that has mail, and so on. been paid is poor performance and a fair The appointment calendar system is an- other example of an Eden application. It is ‘Ada is a trademark of the U.S. Department of used to schedule meetings and runs in two Defense.

Computing Surveys, Vol. 17, No. 4, December 1985 464 . A. S. Tanenbaum and R. van Renesse

amount of effort spent trying to convince where to send the request. Protection is UNIX to do things against its will. done by the active name table, which keeps track of the authorization status of each 3.5 Comparison of the Cambridge, logged in user. Amoeba, V, and Eden Systems Amoeba has a single mechanism for all naming and protection-sparse capabili- Our four example systems have many as- ties. Each capability contains bits specify- pects in common, but also differ in some ing which operations on the object are significant ways. In this section we sum- allowed and which are not. The rights are marize and compare the four systems with protected cryptographically, so that user respect to the main design issues that we programs can manipulate them directly; have been discussing. they are not stored in the kernel. ASCII- string-to-capability mapping and capability 3.5.1 Communication Primitives storage are handled by directory servers for All four systems use an RPC-like mecha- convenience. nism (as opposed to an IS0 OS1 commu- Eden also uses capabilities, but these are nication-oriented mechanism). not protected by sparseness or encryption, The Cambridge mechanism is the sim- and so they must be protected by the ker- plest, using the single-shot protocol with a nel. A consequence of this decision is that 2K request packet and a 2K reply packet all the kernels must be trustworthy. The for most client-server communication. A Amoeba cryptographic protection scheme byte stream protocol is also available. is less restrictive on this point. Amoeba uses a similar REQUEST- V has naming at three levels: Processes REPLY mechanism, but allows messages have pids, kernels have ASCII-to-pid map- up to 32 kilobytes (with the kernel-han- pings, and servers use a context mechanism dling message fragmentation and reassem- to relate symbolic names to a given context. bly), as well as acknowledgments and time- outs, thus providing user programs with a 3.5.3 Resource Management more reliable and simpler interface. V also uses a REQUEST-REPLY mech- Resource management is also handled quite anism, but messages longer than an Eth- differently on all four systems. In the Cam- ernet packet are dealt with by having the bridge system the main resource is the sender include a sort of “capability” for a processor bank. A resource manager is pro- message segment in the REQUEST packet. vided to allocate machines to users. Gen- Using this “capability,” the receiver can erally, this allocation is fairly static-upon fetch the rest of the message, as needed. login a user is allocated one machine for For efficiency, the first 1K is piggybacked the duration of the login session, and this onto the REQUEST itself. is the only machine the user uses during Eden comes closest to a true RPC mech- the session. The user may load any oper- anism, including having a language and ating system that he or she chooses in this compiler with automatic stub generation machine. and a minilanguage for parameter passing. Amoeba also has a pool of processors, but None of the four examples attempts to these are allocated dynamically. A user run- guarantee that remote calls will be executed ning “make” might be allocated ten pro- exactly once. cessors to compile ten files; afterward, all the processors would go back into the pool. 3.5.2 Naming and Protection Amoeba also provides a way for processes to create segments on any machine (assum- All four systems use different schemes for ing that the proper capability can be naming and protection. In the Cambridge shown) and for these segments to be forged system a single name server process maps into processes. Amoeba is unique among symbolic service names onto (node, process the four systems in that it has a bank server identifier) pairs so that the client will know that can allow servers to charge for services

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems l 465 and to limit resource usage by accounting entire objects can be checkpointed, making for it. checkpointing a slow operation and thus In V, each processor is dedicated as either discouraging its frequent use. a workstation or a server, so processors are not resources to be dynamically allocated. 3.5.5 Services Each V kernel manages its own local re- sources; there is no systemwide resource The file systems used by Cambridge, management. Amoeba, V, and Eden are all quite different. Eden has been built on top of existing The Cambridge system has two servers, the operating systems, and therefore most of universal file server, and the filing machine, the issues of resource management are done which was added later to improve the per- by the underlying operating system. The formance by providing a large buffer cache. main issue remaining for Eden is allocating The universal file server supports a primi- and deallocating storage for objects. tive flat file, with no directory structure, which is provided by the filing machine or 3.5.4 Fault Tolerance the user machines. The universal file server has regular and special files, of which the None of the four systems go to great lengths latter can be updated atomically. to make themselves fault tolerant; for ex- Amoeba has several file systems. One of ample, none support atomic actions as a them is compatible with UNIX, to allow basic primitive. All four (with the possible UNIX applications to run on Amoeba. An- exception of Eden) were designed with the other one, FUSS, supports multiversion, intention of actually being used, so that the multiserver, tree-structured, immutable inherent trade-off between performance files with atomic commit. Directory servers and fault tolerance tended to get resolved map ASCII names to capabilities, thus al- in favor of performance. lowing an arbitrary graph of files and direc- In the Cambridge system the only con- tories to be constructed. cession to fault tolerance is a feature in the V has a traditional file server similar to ring interface to allow a machine to be UNIX. It is based on the earlier Thoth remotely reset by sending a special packet system. to the interface. There is also a small server Eden has no file server at all in the usual that helps get the servers started up. sense. Instead, each file object has embed- Amoeba provides some fault tolerance ded in it a process that acts like a private through its boot server, with which pro- file server for that one file. Like Amoeba, cesses can register. The boot server pools Eden has separate directory servers that the registered processes periodically and, map ASCII strings onto capabilities and finding one that fails to respond, requests provides the ability to map one string onto a new processor and downloads the failed several files, thus providing for file repli- program to it. This strategy does not re- cation. All four systems have a heteroge- trieve the processes that were killed when neous variety of other services (e.g., print, a machine has gone down, but it does au- mail, bank). tomatically ensure that no key service is ever down for more than, say, 30 seconds. 4. SUMMARY V does not address the problem of fault tolerance at all. Distributed operating systems are still in Of the four systems, Eden makes the an early phase of development, with many most effort to provide a higher degree of unanswered questions and relatively little reliability than provided by the bare hard- agreement among workers in the field about ware. The main tool used is checkpointing how things should be done. Many experi- complete objects from time to time. If a mental systems use the client-server model processor crashes, each of its objects can be with some form of remote procedure call as restored to the state it had at the time of the communication base, but there are also the last checkpoint. Unfortunately, only systems built on the connection model.

Computing Surveys, Vol. 17, No. 4, December 1985 466 . A. S. Tanenbaum and R. van Renesse Relatively little has been done on distrib- BIRMAN, K. P., AND ROWE, L. A. 1982. A local uted naming, protection, and resource network based on the UNIX operating system. management, other than building straight- IEEE Trans. Softw. Eng. SE-8 (Mar.), 137-146. BIRRELL, A. D. 1985. Secure communication using forward name servers and process servers. remote procedure calls. ACM Trans. Comput. Fault tolerance is an up-and-coming area, Syst. 3, 1 (Feb.), 1-14. with work progressing in redundancy tech- BIRRELL, A. D., AND NEEDHAM, R. M. 1980. A uni- niques and atomic actions. Finally, a con- versal file server. IEEE Trans. Softw. Eng. SE-6, siderable amount of work has gone into the (Sept.), 450-453. construction of file servers, print servers, BIRRELL, A. D., AND NELSON, B. J. 1984. and various other servers, but here too Implementing remote procedure calls. ACM Trans. Comput. Syst. 2, 1 (Feb.), 39-59. there is much work to be done. The only BIRRELL, A. D., LEVIN, R., NEEDHAM, R. M., AND conclusion that we draw is that distributed SCHROEDER,M. 1982. Grapevine: An exercise operating systems will be an interesting and in distributed computing. Commun. ACM 25, 4 fruitful area of research for a number of (Apr.), 260-274. years to come. BIRRELL, A. D., LEVIN, R., NEEDHAM, R. M., AND SCHROEDER,M. 1984. Experience with Grape- vine: The growth of a distributed system. ACM ACKNOWLEDGMENTS Trans. Comput. Syst. 2, 1 (Feb.), 3-23. We would like to thank Andrew Black, Dick Grune, BLACK, A. P. 1983. An asymmetric stream commu- Sape Mullender, and Jennifer Steiner for their critical nications system. Oper. Syst. Rev. (ACM) 17, 5, reading of the manuscript. 4-10. . BLACK, A. P. 1985. Supporting distributed applica- tions: Experience with Eden. In Proceedings of the 10th Symposium on Operating Systems Prin- REFERENCES ciples (Orcas Island, Wash., Dec. l-4). ACM, New York, pp. 181-193. ADAMS, C. J., ADAMS, G. C., WATERS, A. G., LESLIE, I., AND KIRK, P. 1982. Protocol architecture of Boccs, D. R., SCHOCH, J. F., TAFT, E. A., ANY the UNIVERSE project. In Proceedings of the 6th METCALFE, R. M. 1980. Pup: An internetwork International Conference on Computer Commu- architecture. IEEE Trans. Commun. COM-28 nication (London, Sept. 7-10). International Con- (Apr.), 612-624. ference for Computer Communication, pp. 379- BORG, A., BAUMBACH, J., AND GLAZER, S. 1983. A 383. message system supporting fault tolerance. Oper. ALMES, G. T., BLACK, A. P., LAZOWSKA,E. D., AND Syst. Rev. (ACM) 17,5,90-99. NI!IE, J. D. 1985. The Eden system: A technical BROWN, M. R., KOLLING, K. N., AND TAG. E. A. review. IEEE Trans. Softw. Eng. SE-11 (Jan.). 1985. The Alnine file svstem. ACM Trans. Com- 43-59. put. Syst. 3, 4 ~Nov.), 261-293. ANDERSON, T., AND LEE, P. A. 1981. Fault Toler- BROWNBRIDGE, D. R., MARSHALL, L. F., AND ance, Principles and Practice. Prentice-Hall RANDELL, B. 1982. The Newcastle connec- International, London. tion-Or of the world unite! Softw. AVIZIENIS, A., AND CHEN, L. 1977. On the im- Pratt. Exper. 12 (Dec.), 1147-1162. plementation of N-version programming for BRYANT, R. M., AND FINKEL, R. A. 1981. A stable software fault-tolerance during execution. In distributed scheduling algorithm. In Proceed- Proceedings of the International Computer Soft- ings of the 2nd International Conference on Dis- ware and Applications Conference. IEEE, New tributed Computer Systems (Apr.). IEEE, New York, pp. 149-155. York, pp. 314-323. AVIZIENIS, A., AND KELLY, J. 1984. Fault tolerance CHANDY, K. M., MISRA, J., AND HAAS, L. M. by design diversity. Computer 17 (Aug.), 66-80. 1983. Distributed deadlock detection. ACM BAL, H. E., VAN RENESSE, R., AND TANENBAUM, A. Trans. Comput. Syst. 1,2 (May), 145-156. S. 1985. A distributed, parallel, fault tolerant CHERITON, D. R. 1982. The Thoth System: Multi- computing system. Rep. 1%106, Dept. of Mathe- Process Structuring and Portability. American matics and Comnuter Science. Vriie- Univ., The Elsevier, New York. Netherlands, Oct. CHERITON, D. R. 1984a. An experiment using regis- BALL, J. E., FELDMAN, J., Low, R., RASHID, R., AND ters for fast message-based interprocess commu- ROVNER, P. 1976. RIG, Rochester’s intelligent nication. Oper. Syst. Rev. 18 (Oct.), 12-20. gateway: System overview. IEEE Trans. Softw. CHERITON, D. R. 1984b. The V kernel: A software Eng. SE-Z (Dec.), 321-329. base for distributed svstems. IEEE Softw. 1 BARAK, A., AND SHILOH, A. 1985. A distributed load- (Apr.), 19-42. - balancing policy for a multicomputer. Softw. CHERITON, D. R., AND MANN, T. P. 1984. Uniform Pratt. Exper. 1.5 (Sept.), 901-913. access to distributed name interpretation in the

ComputingSurveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems 467

V system. In Proceedings of the 4th International FARBER,D. J., AND LARSON,K. C. 1972. The system Conference on Distributed Computing Systems. architecture of the distributed computer sys- IEEE, New York, pp. 290-297. tem-The communications system. In Proceed- CHERITON, D. R., AND ZWAENEPOEL, W. 1983. The ings of the Symposium on Computer Networks distributed V kernel and its performance for disk- (Brooklyn, Apr.). Polytechnic Inst. of Brooklyn, less workstations. In Proceedings of the 9th Sym- Brooklyn, N.Y. posium on Operating System PrincipZes. ACM, FINKEL, R. A., SOLOMON,M. H., AND TISCHLER, R. New York, pp. 128-140. 1979. The Roscoe resource manager. COMP- CHERITON, D. R., AND ZWAENEPOEL, W. 1984. CON 79 Digest of Papers (Feb.). IEEE, New York, One-to-many interprocess communication in the pp. 8891. V-svstem. In SZGCOMM ‘84 Tutorials and Svm- FITZGERALD,R., AND RASHID R. 1985. The integra- poskm on Communications Architectures and tion of virtual memory management and inter- Protocols (Montreal, Quebec, June 6-8). ACM, process communication in Accent. In Proceedings New York. of the 10th Symposium on Operating Systems CHERITON, D. R., MALCOLM, M. A., MELEN, L. S., Principles (Orcas Island, Wash., Dec. l-4). ACM, AND SAGER,G. R. 1979. Thoth, a portable real- New York, pp. 13-14. time operating system. Commun. ACM 22, 2 FRIDRICH, M., AND OLDER, W. 1981. The Felix file (Feb.), 105-115. server. In Proceedings of the 8th Symposium on CHESSON,G. 1975. The network UNIX system. In Operating Systems Principles (Pacific Grove, Proceedings of the 5th Symposium on Operating Calif., Dec. 14-16). ACM, New York, pp. 37-44. Systems Principles (Austin, Tex., Nov. 19-21). FRIDRICH, M., AND OLDER, W. 1984. HELIX: The ACM, New York, pp. 60-66. architecture of a distributed file system. In Pro- CHOW, T. C. K., AND ABRAHAM, J. A. 1982. Load ceedings of the 4th International Conference on balancing in distributed svstems. IEEE Trans. Distributed Computing Systems. IEEE, New Softw. Eng. SE-8 (July), 401-412. York, pp. 422-431. CHOW, Y. C., AND KOHLER, W. H. 1979. Models for GAGLIANELLO, R. D., AND KATSEFF, H. P. 1985. dynamic load balancing in heterogeneous multi- Meglos: An operating system for a multiprocessor ple processor systems. IEEE Trans. Comput. environment. In Proceedings of the 5th Znterna- C-28 (May), 354-361. tional Conference on Distributed Computing Sys- tems (May). IEEE, New York, pp. 35-42. CHU, W. W., HOLLOWAY,L. J., MIN-TSUNG, L., AND EFE, K. 1980. Task allocation in distributed GLIGOR, V. D., AND SHATTUCK, S. H. 1980. Deadlock detection in distributed systems. IEEE data processing. Computer 23 (Nov.), 57-69. Trans. Softw. Eng. SE-6 (Sept.), 435-440. CURTIS, R. S., AND WI~IE, L. D. 1984. Global GYLYS, V. B., AND EDWARDS,J. A. 1976. Optimal naming in distributed systems. IEEE Softw. 1, partitioning of workload for distributed systems. 76-80. In Proceedings of COMPCON (Sept.). IEEE, New DALAL, Y. K. 1977. Broadcast protocols in packet York, pp. 353-357. switched computer networks. Ph.D. Dissertation, HWANG, K., CROFT, W. J., GOBLE, G. H. WAH, B. Computer Science Dept., Stanford Univ., Stan- W., BRIGGS, F. A., SIMMONS, W. R., AND ford, Calif. COATES,C. L. 1982. A UNIX-based local com- DELLAR, C. 1982. A file server for a network of puter network. Computer 15 (Apr.), 55-66, low-cost personal microcomputers. Softw. Pratt. ISLOOR, S. S., AND MARSLAND, T. A. 1978. An ef- Erper. 22 (Nov.), 1051-1068. fective on-line deadlock detection technique for DENNIS, J. B., AND VAN HORN, E. C. 1966. distributed database management systems. In Programming semantics for multiprogrammed Proceedings of the International Computer and computations. Commun. ACM 9, 3 (Mar.), 143- Software Application Conference. IEEE, New 154. York, pp. 283-288. DEWIIT, D. J., FINKEL, R. A., AND SOLOMON, M. JANSON, P., SVOBODOVA, L., AND MAEHLE, E. 1984. The CRYSTAL multicomnuter: Design 1983. Filing and printing services in a local area and implementation experience. Tech. Rep. TR- network. In Proceedings of the 8th Data Commu- 553, Computer Science Dept., Univ. of Wisconsin, nications Symposium (Cape Cod, Mass., Oct. Madison, Wis. 3-6). IEEE, New York, pp. 211-219. DION, J,. 1980. The Cambridge file server. Oper. Syst. JEFFERSON,D. R. 1985. Virtual time. ACM Trans. Reu. (ACM) 14 (Oct.), 41-49. Program. Lang. Syst. 7 (July), 404-425. EFE, K. 1982. Heuristic models of task assignment JENSEN, E. D. 1978. The Honeywell experimental scheduling in distributed systems. Computer 15 distributed processor-An overview of its objec- (June), 50-56. tive, philosophy and architectural facilities. Com- ESWARAN, K. P., GRAY, J. N., LORIE, J. N., AND puter 1 Z (Jan), 28-38. TRAIGER, I. L. 1976. The notions of consistency JESSOP,W. H., JACOBSON,D. M., NOE, J. D., BAER, and predicate locks in a database system. Com- J.-L., AND Pu, C. 1982. The Eden transaction- mun. ACM 19, 11 (Nov.), 624-633. based file system. In Proceedings of the 2nd Sym-

ComputingSurveys, Vol. 17, No. 4, December1985 468 l A. S. Tanenbaum and R. van Renesse

posium on Reliability in Distributed Software ana’ MILLSTEIN, R. E. 1977. The national software Database Systems (Julv)._. IEEE. New York. works. In Proceedings of the ACM Annual Con- pp. 163-169: ference (Seattle, Wash., Oct. 16-19). ACM, New KRUEGER, P., AND FINKEL, R. A. 1983. An adaptive York, pp. 44-52. load balancing algorithm for a multicomputer. MITCHELL, J. G., AND DION, J. 1982. A comparison Unpublished manuscript, Computer Science of two network-based file servers. Commun. ACM Dept., Univ. of Wisconsin. 25, 4 (Apr.), 233-245. LAMPORT, L. 1978. Time, clocks, and the ordering MOHAN, C. K., AND WI~IE, L. D. 1985. Local re- of events in a distributed system. Commun. ACM configuration of management trees in large net- 21, 7 (July), 558-565. works. In Proceedings of the 5th International LAMPORT, L. 1984. Using time instead of timeout Conference on Distributed Computing Systems for fault-tolerant distributed systems. ACM (May). IEEE, New York, pp. 386-393. Trans. Program. Long. Syst. 6 (Apr.), 254-280. MULLENDER, S. J., AND TANENBAUM, A. S. 1984. Protection and resource control in distributed LAMPSON, B. W. 1981. Atomic transactions. In Dis- operating systems. Comput. Networks 8 (Nov.), tributed Systems-Architecture and Implementa- 421-432. tion, B. W. Lampson, Ed. Springer-Verlag, Berlin and New York, pp. 246-265. MULLENDER, S. J., AND TANENBAUM, A. S. 1985. A distributed file service based on optimistic con- LAZOWSKA, E. D., LEVY, H. M., ALMES, G. T., currency control. In Proceedings of the 10th Sym- FISCHER, M. J., FOWLER, R. J., AND VESTAL, posium on Operating Systems Principles (Orcas S. C. 1981. The architecture of the Eden svstem. Island, Wash., Dec. l-4). ACM, New York, pp. In Proceedings of the 8th Symposium on Operating 51-62. Svstems Princioles (Pacific Grove. Calif.. Dec. 14- 16). ACM, New York, pp. 148-159. MULLENDER, S. J., AND TANENBAUM, A. S. 1986. The design of a capability-based distributed op- LEVY, H. M. 1984. Capability-Based Computer Sys- erating system. Computer J. (in press). tems. Digital Press, Maynard, Mass. NEEDHAM, R. M., AND HERBERT, A. J. 1982. The LISKOV, B. 1982. On linguistic support for distrib- Cambridge Distributed Computing System. uted nrozrams. IEEE Trans. Softw. Ens. SE-8 AddisonrWesley, Reading, Mass. - ” (Mayj, 203-210. NELSON, B. J. 1981. Remote procedure call. Tech. LISKOV, B. 1984. Overview of the Argus language Rep. CSL-81-9, Xerox Palo Alto Research and system. Programming Methodology Group Center, Palo Alto, Calif. Memo 40. Laboratory for Computer Science, OBERMARCK, R. 1982. Distributed deadlock detec- Massachusetts Institute of Technology, Cam- tion algorithm. ACM Trans. Database Syst. 7 bridge, Mass., Feb. (June), 187-208. LISKOV, B., AND SCHEIFLER, R. 1982. Guardians and OKI, B. M., LISKOV, B. H., AND SCHEIFLER, R. W. actions: Linguistic support for robust, distributed 1985. Reliable object storage to support atomic programs. ACM Trans. Program. Lang. Syst. 5, 3 actions. In Proceedings of the 10th Symposium on (July), 381-404. 1983. ACM, pp. 7-19, Jan. 1982. Operating Systems Principles (Orcas Island, LO, V. M. 1984. Heuristic algorithms for task assign- Wash., Dec. l-4). ACM, New York, pp. 147-159. ment in distributed systems. In Proceedings of OUSTERHOUT, J. K. 1982. Scheduling techniques for the 4th International Conference on Distributed concurrent systems. In Proceedings of the 3rd Computing Systems. IEEE, New York, pp. 30-39. International Conference on Distributed Comput- LUDERER, G. W. R., CHE, H., HAGGERTY, J. P., ing Systems. IEEE, New York, pp. 22-30. KIRSLIS, P. A., AND MARSHALL, W. T. 1981. A PASHTAN, A. 1982. Object oriented operating sys- distributed UNIX system based on a virtual cir- terns: An emerging design methodology. In Pro- cuit switch. In Proceedings of the 8th Symposium ceedings of the ACM National Conference (Dallas. on Operating Systems Principles (Pacific Grove, Tex., &t: 25-27). ACM, New York, pp. 126-131: Calif., Dec. 14-16). ACM, New York, pp. 160- POPEK, G., WALKER, B., CHOW, J., EDWARDS, D., 168. KLINE, C., RUDISIN, G., AND THIEL, G. 1981. MAMRAK, S. A., MAURATH, P., GOMEZ, J., JANARDAN, LOCUS: A network transparent, high reliability S., AND NICHOLAS, C. 1982. Guest layering dis- distributed system. In Proceedings of the 8th tributed processing support on local- operating Symposium on Operating Systems Principles (Pa- systems. In Proceedings of the 3rd Znternational cific Grove, Calif., Dec. 14-16). ACM, New York, Conference on Distributed Computing Systems. pp. 160-168. IEEE, New York, pp. 854-859. POWELL, M. L., AND MILLER, B. P. 1983. Process MARZULLO, K., AND OWICKI, S. 1985. Maintaining migration in DEMOS/MP. Oper. Syst. Rev. the time in a distributed system. Oper. Syst. Reu. (ACM) 17,5,110-119. 19 (July), 44-54. POWELL, M. L., AND PRESO’ITO, D. L. 1983. MENASCE, D., AND MUNTZ, R. 1979. Locking and Publishing-A reliable broadcast communication deadlock detection in distributed databases. mechanism. Oper. Syst. Reu. (ACM) 17, 5, lOO- IEEE Trans. Softw. Eng. SE-5 (May), 195-202. 109.

Computing Surveys, Vol. 17, No. 4, December 1985 Distributed Operating Systems 469 m C., AND NOE, J. D. 1985. Nested transactions Principles (Pacific Grove, Calif., Dec. 10-12). for general objects. Rep. TR-85-12-03, Computer ACM, New York, pp. 108-114. Science Dept., Univ. of Washington, Seattle, SPECTOR, A. Z. 1982. Performing remote operations Wash. efficiently on a local computer network. Commun. pu, C., NOE, J. D., AND PROUDFOOT, A. 1986. ACM 25,4 (Apr.), 246260. Regeneration of replicated objects: A technique STANKOVIC, J. A., AND SIDHU, I. S. 1984. An adap- and its Eden implementation. In Proceedings of tive bidding algorithm for processes, clusters, and the 2nd International Conference on Data &I&- distributed ups. In Proceedings of the 4th Inter- neering (Los Angeles, Calif., Feb. 4-6). IEEE, national Conference on Distributed Computing New York, pp. 175-187. Systems. IEEE, New York, pp. 49-59. RASHID, R. F., AND ROBERTSON, G. G. 1981. Accent: STONE, H. S. 1977. Multiprocessor scheduling with A communication oriented network operating the aid of network flow algorithms. IEEE Trans. system kernel. In Proceedings of tne 8th Sympo- Softw. Eng. SE-3 (Jan.), 88-93. sium on Operating Systems Principles (Pacific STONE, H. S. 1978. Critical load factors in distrib- Grove, Calif., Dec. 14-16). ACM, New York, pp. uted computer systems. IEEE Trans. Softw. Eng. 64-75. SE-4 (May), 254-258. REED, D. P. 1983. Implementing atomic actions on STONE, H. S., AND BOKHARI, S. H. 1978. Control decentralized data. ACM Trans. Comput. Syst. 1, of distributed processes. Computer 11 (July), 1 (Feb.), 3-23. 97-106. REED, D. P., AND SVOBODOVA, L. 1981. STONEBRAKER, M. 1981. Operating system support SWALLOW: A distributed data storage system for database management. Commun. ACM 24, 7 for a local network. In Local Networks for Com- (July), 412-418. puter Communications, A. West and P. Janson, STURGIS, H. E., MITCHELL, J. G., AND ISRAEL, J. Eds. North-Holland Publ., Amsterdam, pp. 355- 1980. Issues in the design and use of a distrib- 373. uted file system. Oper. Syst. Rev. 24 (July), REIF, J. H., AND SPIRAKIS, P. G. 1984. ‘Real-time 55-69. synchronization of interprocess communications. SVENTEK, J., GREIMAN, W., O’DELL, M., AND ACM Trans. Program. Lang. Syst. 6, 2 (Apr.), JANSEN, A. 1983. Token ring local networks- 215-238. A comparison of experimental and theoretical RITCHIE, D. M., AND THOMPSON, K. 1974. The performance. Lawrence Berkeley Lab. Rep. UNIX time-sharing system. Commun. ACM 19, 16254. 7 (July), 365-375. SVOBODOVA, L. 1981. A reliable object-oriented data SALTZER, J. H., REED, D. P., AND CLARK, D. D. repository for a distributed computer system. In 1984. End-to-end arguments in system design. Proceedings of the 8th Symposium on Operating ACM Trans. Comput. Syst. 2, 4 (Nov.), 277-278. Systems Principles (Pacific Grove, Calif., Dec. 14- SATYANARAYANAN, M., HOWARD, J., NICHOLS, D., 16). ACM, New York, pp. 47-58. SIDEBOTHAM, R., SPECTOR, A., AND WEST, M. SVOBODOVA, L. 1984. File servers for network-based 1985. The ITC distributed file system: Princi- distributed systems. ACM Comput. Sure. 16, 4 ples and design. In Proceedings of the 10th Sym- (Dec.), 353-398. posium on Operating Systems Principles (Orcas SWINEHART, D., MCDANIEL, G., AND Boccs, D. Island, Wash., Dec. l-4). ACM, New York, 1979. WFS: A simple shared file system for a pp. 35-50. distributed environment. In Proceedings of the SCHROEDER, M., GIFFORD, D., AND NEEDHAM, R. 7th Symposium on Operating Systems Principles 1985. A caching file system for a program- (Pacific Grove, Calif., Dec. 10-12). ACM, New mer’s workstation. In Proceedings of the 10th York, pp. 9-17. Symposium on Operating Systems Principles TANENBAUM, A. S., AND MULLENDER, S. J. 1982. (Orcas Island, Wash., Dec. l-4). ACM, New Operating system requirements for distributed York, pp. 25-34. data base systems. In Distributed Data Bases, SMITH, R. 1979. The contract net protocol: High- H.-J. Schneider, Ed. North-Holland Publ., level communication and control in a distributed Amsterdam, pp. 105-114. problem solver. In Proceedings of the 1st Znter- TANENBAUM, A. S., MULLENDER, S. J., AND VAN national Conference on Distributed Computing RENESSE, R. 1986. Using sparse capabilities in Systems. IEEE, New York, pp. 185-192. a distributed operating system. In Proceedings of SOLOMON, M. H., AND FINKEL, R. A. 1978. the 6th International Conference on Distributed ROSCOE: A multimicrocomputer operating - sys- Computer Systems. IEEE, New York, 1986, pp. tern. In Proceedings of the 2nd Rocky Mountain 558-563. Symposium on Microcomputers (Aug.), pp. 201- VAN TILBORG, A. M., AND W~IE, L. D. 1981. Wave 210. scheduling: Distributed allocation of task forces SOLOMON, M. H., AND FINKEL, R. A. 1979. The in network computers. In Proceedings of the 2nd Roscoe distributed operating system. In Proceed- International Conference on Distributed Comput- ings of the 7th Symposium on Operating Systems ing Systems. IEEE, New York, pp. 337-347.

Computing Surveys, Vol. 17, No. 4, December 1985 470 l A. 5’. Tanenbaum and R. van Renesse

WALKER, B., POPEK, G., ENGLISH, R., KLINE, C., Distributed Computing Systems (May). IEEE, AND THIEL, G. 1983. The LOCUS distributed New York, pp. 549-551. operating system. Oper. Syst. Reu. (ACM) 17, 5, WI~IE, L. D., AND VAN RLBORG, A. M. 1980. 49-70. MICROS, a distributed operating system for MI- WAMBECQ, A. 1983. NETIX: A network-using op- CRONET, a reconfigurable network computer. erating system, based on UNIX software. In Pro- IEEE Trans. Comput C-29 (Dec.), 1133-1144. ceedings of the NFWO-ENRS Contact Group (Leuven, Belgium, Mar.). WUPIT, A. 1983. Comparison of UNIX network sys- terns. ACM, New York, pp. 99-108.

WEINSTEIN., M. J.., PAGE. T. W.. JR.., LIVESEY.I B. K..I AND POPEK, G. J. 1985. Transactions and syn- ZIMMERMANN, H. 1980. OS1 reference model-The chronization in a distributed operating system. In IS0 model of architecture for open systems in- Proceedings of the 20th Symposium on Operating terconnection. IEEE Trans. Commun. COM-28 Systems Principles (Orcas Island, Wash., Dec. (Apr.), 425-432. l-4). ACM, New York, pp. 115-125. ZIMMERMANN, H., BANINO, J.-S., CARISTAN, A., WILKES, M. V., AND NEEDHAM, R. M. 1980. The GUILLEMONT, M., AND MORISSET, G. Cambridge model distributed system. Oper. Syst. 1981. Basic concepts for the support of distrib- Reu. 14 (Jan.), 21-29. uted systems: The chorus approach. In Proceed- WITTIE, L., AND CURTIS, R. 1985. Time manage- ings of t& 2nd International Conference on Dis- ment for debugging distributed systems. In Pro- tributed Computing Systems. IEEE, New York, ceedings of the 5th International Conference on pp. 60-66.

Received July 1985; final revision accepted April 1986.

ComputingSurveys, Vol. 17, No. 4, December 1985