Partha Dasgupta, Arizona State University Richard J. LeBlanc, Jr., Mustaque Ahamad, and Umakishore Ramachandran, Georgia Institute of Technology

distributed operating system is a control program running on a set of computers that are interconnected by a network. This control program unifies the different computers into a single integrated compute and storage resource. Depending on the facilities it provides, a distributed operating system is classified as general purpose, real time, or embedded. The need for distributed operating systems stems from rapid changes in the hardware environment in many organizations. Hardware prices have fallen rapidly in the last decade, resulting in the proliferation of workstations, personal comput- ers, data and compute servers, and networks. This proliferation has underlined the need for efficient and transparent management of these physically distributed resources (see sidebar titled “Distributed operating systems”). This article presents a paradigm for structuring distributed operating systems, the potential and implications this paradigm has for users, and research directions for the future.

Clouds is a general- Two paradigms purpose operating Operating system structures for a distributed environment follow one of two paradigms: message-based or object-based. Message-based operating systems system for distributed place a message-passing kernel on each node and use explicit messages to support environments. It is interprocesses communication. The kernel supports both local communication (between processes on the same node) and nonlocal or remote communication, based on an object- which is sometimes implemented through a separate network-manager process. In a traditional system such as Unix, access to system services is requested via thread model adapted protected procedure calls, whereas in a message-based operating system, the from object-oriented requests are via message passing. Message-based operating systems are attractive for structuring operating systems because the policy, which is encoded in the server programming. processes, is separate from the mechanism implemented in the kernel.

34 0018-9162/91/1100-0034$01.00Q 1991 IEEE COMPUTER Object-based distributed operating systems encapsulate services and re- Distributed operating systems sources into entities called objects. Ob- jects are similar to instances of abstract data types. They are written as individ- Networked computing environments are now commonplace. Because powerful ual modules composed of the specific desktop systems have become affordable, most computing environments are operations that define the module in- now composed of combinations of workstations and file servers. Such distributed terfaces. Clients request access to sys- environments, however, are not easy to use or administer. tem services by invoking the appropri- Using a collection of computers connected by a local area network often poses ate system object. The invocation problems of resource sharing and environment integration that are not present in mechanism is similar to a protected pro- centralized systems. To keep user productivity high, the distribution must be transparent and the environment must appear centralized. The short-term solu- cedure call. Objects encapsulate func- tion adopted by current commercial software extends conventional operating sys- tionality much as server processes do in tems to allow transparent file access and sharing. For example, Sun added Net- message-based systems (see sidebar ti- work File System to provide distributed file access capabilities in Unix. Sun-NFS tled “What can objects do?”). has become the industry-stangard distributed file system for Unix systems. The Among the well-known message- small systems world, dominated by IBM PC-compatible and Apple computers, based systems are the V-system ’ devel- has software packages that perform multitasking and network-transparent file ac- oped at and the cess - for example, Windows 3.0, Novel1 Netware, Appletalk, and PC- Amoeba system2 developed at Vrije NFS. University in Amsterdam, Netherlands. A better long-term solution is the design of an operating system that considers These systems provide computation and the distributed nature of the hardware architecture at all levels. A distributed op- erating system is such a system. It makes a collection of distributed computers data services via servers that run on look and feel like one centralized system, yet keeps intact the advantages of dis- machines linked by a network. tribution. Message-based and object-based systems are two paradigms for struc- Most object-based systems are built turing such operating systems. on top of an existing operating system, typically Unix. Examples of such sys- tems include Argus: Cronus,4and Eden? These systems support objects that re- spond to invocations sent via the mes- sage-passing mechanisms of Unix. Clouds approach pute and data servers for executing those Mach6 is an operating system with applications on the servers. Note that distinctive characteristics. A Unix-com- Clouds is a distributed operating sys- when a disk is associated with a com- patible system built to be machine- tem that integrates a set of nodes into a pute server, it can also act as a data independent, it runs on a large variety conceptually centralized system. The server for other compute servers. Clouds of uniprocessors and multiprocessors. system is composed of compute servers, is a nativeoperatingsystem that runson It has a small kernel that handles the data servers, and user workstations. A top of a native kernel called Ra (after virtual memory and process scheduling, computeserver is a machine that is avail- the Egyptian sun god). It currently runs and it builds other services on top of the able for use as a computational engine. on Sun-3/50 and Sun-3/60 computers kernel. Mach implements mechanisms A data server is a machine that func- and cooperates with Sun Sparcstations that provide distribution, especially tions as a repository for long-lived (that (running Unix) that provide user inter- through a facility called memory ob- is. persistent) data. A user workstation faces. jects, for sharing memory between sep- is a machine that provides a program- Clouds is a general-purpose operat- arate tasks executing on possibly differ- ming environment for developing ap- ing system. That is, it is intended to ent machines. plications and an interface with the com- support all types of languages and ap-

What can objects do?

Conceptually, an object is an encapsulation of data and a Objects can be active. An active object has one or more set of operations on the data. The operations are performed processes associated with it that communicate with the exter- by invoking the object and can range from simple data-ma- nal world and handle housekeeping chores internal to the ob- nipulation routines to complex algorithms, from shared library ject. For example, a process can monitor an object’s environ- accesses to elaborate system services. ment and can inform some other entity (another object) when Objects can also provide specialized services. For exam- an event occurs. This feature is particularly useful in objects ple, an object can represent a sensing device, and an invo- that manage sensor-monitoring devices. cation can gather data from the device without knowing Objects are a simple concept with a major impact. They can about the mechanisms involved in accessing it - or its loca- be used for almost every need - from general-purpose pro- tion. Similarly, an object that handles terminal or file 110 con- gramming to specialized applications - yet provide a simple tains defined read and write operations. procedural interface to the rest of the system.

November 1991 35 an abstraction of storage and threads to implement computations. This decou- ples computation and storage, thus main- taining their orthogonality. In addition, the object-thread model unifies the treat- ment of I/O, interprocess communica- tion, information sharing, and long-term storage. This model has been further augmented to support atomicity and re- liable execution of computations. Multics was the starting point for many ideas found in operating systems today, and Clouds is no exception. These ideas include sharable memory segments and single-level stores using mapped files. Hydra first implemented the use of ob- jects as a system-structuring concept. Hydra ran on a multiprocessor and pro- vided named objects for operating- system services.

Clouds paradigm

This section elaborates on the object- thread paradigm of Clouds, illustrating the paradigm with examples of its use.

Objects. A Clouds object is a persis- tent (or nonvolatile) virtual address space. Unlike virtual-address spaces in conventional operating systems, the con- tents of a Clouds object are long-lived. That is, a Clouds object exists forever and survives system crashes and shut- downs unless explicitly deleted - like a file. As the following description of ob- Figure 1. A Clouds object. jects shows, Clouds objects are some- what “heavyweight,” that is, they are best suited for storage and execution of large-grained data and programs because of the overhead associated with invoca- plications, distributed or not. All appli- ing a message to an object causes the tion and storage of objects. cations can view the system as a mono- object to execute a method, which in Unlike objects in some object-based lith, but distributed applications may turn accesses or updates data stored in operating systems, a Clouds object does choose to view the system as composed the object and may trigger sending mes- not contain a process (or thread). Thus, of several separate compute and data sages to other objects. Upon comple- Clouds objects are passive. Since con- servers. Each compute facility in Clouds tion, the method sends a reply to the tents of a virtual address space are not has access to all system resources. sender of the message. accessible from outside the address space, The system structure is based on an Clouds has a similar structure, imple- the memory (data) in an object is acces- object-thread model. The object-thread mented at. the operating system level. sible only by the code in the object. model is an adaptation of the popular Clouds objects are large-grained encap- A Clouds object contains user-defined object-oriented programming model, sulations of code and data that are con- code, persistent data, a volatile heap for which structures a software system as a tained in an entire virtual-address space. temporary memory allocation, and a set of objects. Each object is an instance An object is an instance of a class, and a persistent heap for allocating memory of an abstract data type consisting of class is a compiled program module. that becomes a part of the object’s per- data and operations on the data. The Clouds objects respond to invocations. sistent data structures (see Figure 1). operations are called methods. The ob- An invocation results from a thread of Recall that the data in the object can be ject type is defined by a class. A class execution entering the object to exe- manipulated only from within the ob- can have zero or more instances, but an cute an operation (or method) in the ject. Data can pass into the object when instance is derived from exactly one object. an entry point is invoked (input param- class. Objects respond to messages. Send- Clouds provides objects to support eters). Data can pass out of the object

36 COMPUTER ~~ ~ ~~ ~~ when this invocation terminates (result parameters). Distributed object space (persistent, virtual memory) Each Clouds object has a global sys- tem-level name called a sysname, which is a bit string that is unique over the entire distributed system. Therefore, the sysname-based naming scheme in Clouds creates a uniform, flat, system-name space for objects. Users can define high- level names for objects - a naming service translates them to sysnames. Objects are physically stored in data servers but are accessible from all com- pute servers in the system, thus provid- Thread ing location transparency to the users. Threads. The only form of user activ- ”r ity in the Clouds system is the user thread. A thread is a logical path of execution that traverses objects and executes the code in them. Thus, unlike a process in a conventional operating system, a Clouds thread is not bound to Stack a single address space. A thread is cre- 2J ated by an interactive user or under Stack program control. When a thread exe- cutes an entry point in an object, it Figure 2. Distributed object memory. accesses or updates the persistent data stored in it. In addition, the code in the object may invoke operations in other objects. In such an event. the thread semaphores to handle concurrency con- data by placing the data in an object. temporarily leaves the calling object, trol within the object. Computations that need access to shared enters the called object, and commenc- data invoke the object where the data es execution there. The thread returns Interaction between objects and exists. Clouds does not support messag- to the calling object after the execution threads. The structure created by a sys- es and files at the operating system lev- in the called object completes and re- tem, composed of objects and threads, el, although it does allow objects to turns results. These arguments and re- has several interesting properties. simulate them if necessary (see sidebar sults are strictly data; they cannot be Inter-object interfaces are procedural. titled “No files? No messages?”on the addresses. This restriction is mandatory Object invocations are equivalent to next page). because addresses in one object are procedure calls on long-lived modules In a message-based system, the user meaningless in the context of another that do not share global data. The must determine the desired level of object. In addition, object invocations invocations work across machine concurrency at the time of writing an can be nested as well as recursive. After boundaries. application, and program it as a certain the thread completes execution of the The storage mechanism used in Clouds number of server processes. The object- operation it was created to execute, it differs from those found in convention- thread model of Clouds eliminates this terminates. al operating systems. Conventionally, requirement. An object can be written The nature of Clouds objects prohib- files are used to store persistent data. from the viewpoint of the functionality its a thread from accessing any data Memory is associated with processes it is meant to provide, rather than the outside the current address space (ob- and is volatile (that is, the contents of actual level of concurrency it may have ject) in which it is executing. Control memory associated with a process are to support. At execution time, the level transfer between address spaces occurs lost when the process terminates). Ob- of concurrency is specified by creating through object invocation, and data jects in Clouds unify the concepts of concurrent threads to execute in the transfer between address spaces occurs persistent storage and memory to cre- objects that compose the user-level through parameter passing. ate a persistent address space. This uni- application. The application objects, Several threads can simultaneously fication makes programming simpler. however, must be written to support enter an object and execute concurrent- Persistent objects provide a structured concurrent executions, using synchro- ly. Multiple threads executing in the single-level store that is cosmetically nization primitives such as semaphores same object share the contents of the similar to mapped files in Multics and and locks. object’s address space. Figure 2 shows SunOS. To summarize: thread executions in the Clouds object Some systems use message-passing spaces. The programmer uses system- for communicating shared data and co- The Clouds system is composed of supported primitives such as locks or ordinating computations. Clouds shares named address spaces called objects.

November 1991 37 No files? No messages?

The persistent objects supported in an operating system The data can be kept in memory in a form controlled by the like Clouds provide a structured permanent storage mecha- programs (for example, lists or trees), even when the data is nism that can be used for a variety of purposes, including the not in use. simulation of files and messages. An object can store data in In fact, objects that store byte-sequential data can simulate any form and invocations can be used to files, and they can have read and write invocations defined to access this data. Such an object will look like a file, even Manipulate or process the stored data, though the operating system does not explicitly support files. Ship data in and out of the object in forms not necessarily The same is true for messages. The functional equivalence the same as those used for storage, and of messages and shared memory is well known. If desired, a Allow controlled concurrent access to shared data, without buffer object with defined send and receive invocations can regard to data location. serve as a port structure between two (or more) communicat- ing processes. There is no need for files in a persistent programming envi- We feel that files, messages, and disk I/O are artifacts of ronment. Conventional systems use files as byte-sequential hardware structure. Given an object implementation, these storage of long-lived data. When persistent shared memory is features are neither necessary nor attractive. New program- available, there is no need to convert data into byte-sequen- ming paradigms based on object-oriented styles use persis- tial form, store it in files, and later retrieve and reconvert it. tent memory effectively. They do not use files and messages.

Objects provide data storage, data ma- create the requisite number of instanc- Now RectOl.size can be used to set the nipulation, data sharing, concurrency es of these classes. The application is size and RectOl.area can be called to control, and synchronization. then executed by creating a thread to return the area of the rectangle. A com- Control flow is achieved by threads execute the top-level invocation that mand in the Clouds command line in- invoking objects. runs the application. terpreter can call the entry point in the Data flow is achieved by parameter The following is a simple example of object. Entry points can also be invoked passing. programming in the Clouds system. The in the program, allowing one object to object Rectangle consists of x and y call another. Programming in the Clouds model. dimensions of a rectangle. The object Objects have user names, which are For the programmer, Clouds has two has two entry points, one for setting the assigned by the programmer when ob- kinds of objects: classes and instances. size of the rectangle and the other for jects are created (compiled or instanti- A class is a template used to generate computing the area. The object defini- ated). A name server then translates instances. An instance is an object invo- tion is shown in Figure 3a. the user name to a sysname. (Recall cable by user threads. Thus, to write Once the class is compiled, any num- that a sysname is a unique name for an application programs for Clouds, a pro- ber of instances can be created either object, which is needed for invoking the grammer writes one or more Clouds from the command line or via another object.) The code fragment in Figure 3b classes that define the application code object. Suppose the Rectangle class is details the steps in accessing the Clouds and data. The programmer can then instantiated into an object called RectOl. object Recto1 and invoking operations on it. Clouds provides a variety of mecha- nisms to programmers. These include clouds-class rectangle; registering user-defined names of ob- int x, y; //persistent data for rect. jects with the name server, looking up entry rectangle; //constructor names using the name server, invoking entry size (int x, y) //set size of rect. objects both synchronously and asyn- entry int area (); //return area of rect. chronously, and synchronizing threads. end-class I/O to the user console is handled by read and write routines (and by printf and scanf library calls). These routines rectangle-ref rect; //“rect” is a class that refers to read/write ASCII strings to and from //an object of type rectangle. the controlling user terminal, irrespec- tive of the actual location of the object rect.bind (“RectOl ”); //call to name server, or the thread. //binds sysname to RectOl User objects and their entry points rectsize (5, 10); //invocation of RectOl are typed by the language definition. printf(“%d\n” rect.area() ); Ilwill print 50 The compiler performs static type-check- (b) ing on the object and entry point types at compile time. It performs no runtime Figure 3. Example code for Clouds. type checking. Clouds objects are coarse-

38 COMPUTER grained, unlike the fine-grained entities found in such object-oriented program- ming languages as Smalltalk. Since an object invocation in Clouds is at least an order of magnitude more expensive than a simple procedure call, it is appropri- ate to use a Clouds object as a module that may contain several fine-grained entities. These fine-grained objects are completely contained within the Clouds object and are not visible to the operat- ing system. Currently, we support two languages Network in the Clouds operating system. DC++ f is an extension of C++ that systems t programmers use. Distributed Eiffel is + an extension of Eiffel that targets appli- 1 cation developers. The design of both DC++ and Distributed Eiffel supports persistent fine-grained and large-grained Objects -’: I objects, invocations, thread creation, synchronization, and user-level object naming.

Clouds environment

The Clouds system integrates a set of homogeneous machines into one seam- less environment that behaves as one Figure 4. Clouds system architecture. large computer. The system configura- tion is configured as three logical cate- gories of machines, each supporting a different logical function. These are compute servers, data servers, and user use a one-to-one mapping to simplify A user invokes an Clouds object by workstations. the system’s implementation and con- specifying the object, the entry point, The system’s core consists of a set of figuration. and the arguments to the Clouds shell. homogeneous machines of the compute A suite of programs that run on top of The shell sends an invocation request to server category. Compute servers do Unix on Sun workstations provide the a compute server, and the invocation not have any secondary storage. These user interface to Clouds. These pro- proceeds under Clouds using a Clouds machines provide an execution service grams include Distributed Eiffel and thread. The user communicates to the for threads. Data servers provide sec- DC++ compilers, a Clouds user shell thread via a terminal window in the X ondary storage. They store Clouds ob- (under X Windows), a user I/O manag- Window System. All output generated jects and supply the objects’ code and er, and various utilities. The user inter- by the thread (regardless of where it is data to the compute servers. The data face with these programs is through the executing) appears on the user terminal servers also support the distributed syn- familiar Unix utilities (including Unix window, and input to the thread is pro- chronization functions. The third ma- editors). vided by typing in the window. chine category, the user workstation, provides user access to Clouds compute User environment. A user writes System environment. As we men- servers. Compute servers, in turn, know Clouds programs in DC++ or Distribut- tioned earlier, the hardware environ- how and when to access data servers. ed Eiffel and compiles them on the Unix ments consists of compute servers and The logical machine categories do not workstation. The compiler loads the data servers with some nodes providing require a one-to-one scheme for map- generated classes on a Clouds data serv- both functions. Starting a user-level ping to physical machines. Although a er. Now these classes are available to all computation on Clouds involves first diskless machine can function only as a Clouds compute servers. Any node (or selecting a compute server to execute compute server, a machine with a disk user on a Unix machine) can create the thread. This is a scheduling decision can simultaneously act as a compute instances of these classes and generate and may depend on such factors as sched- and data server. This enhances comput- invocations to the objects thus created. uling policies, the load at each compute ing performance, since data accessed Note that once created, the objects be- server, and the availability of resources via local disk is faster than data access- come part of the persistent object mem- needed for the computation. Once this ed over a network. However, in our ory and can be invoked until they are decision has been made, the second task prototype system, shown in Figure4, we explicitly deleted. is to bring into the selected compute

November 1991 39 proach, each level of the implementa- tion consists of only those functions Distributed shared Ory that cannot be implemented at a higher level without a significant performance penalty. Traditional systems such as Unix provide most operating-system services in one big monolithic kernel. Unlike such systems, we differentiate between the operating-system kernel and the operating system itself. This approach makes the system modular, easy to understand, more portable, and convenient to enhance. High-level fea- tures can be implemented as user-level libraries, objects, or services that use the low-level mechanisms in the oper- ating system. Further, it provides a clean separation of policy from mechanisms; that is, the policies are implemented at a high level using the lower level mech- anisms. The current Clouds implementation has three levels. At the lowest level is the minimal kernel, Ra, which provides the mechanisms for managing basic re- server the object in which the thread tion on either A itself or on a different sources, namely, processor and memo- executes. This requires a remote-pag- compute server, B. In the former case, if ry. The next level up is a set of system ing facility. Coupled with this require- the required pages of object 0, are at objects, which are trusted-software mod- ment is the fact that all objects are po- other nodes, they have to be brought to ules providing essential system servic- tentially shared in the Clouds model; node A using DSM. Once the object has es. Finally, other noncritical services therefore, the entity that provides the been brought into A, the invocation such as naming and spooling are imple- remote-paging facility must also main- proceeds the same way as when 0, re- mented as user objects to complete the tain the consistency of shared pages. sides at A. On the other hand, the sys- functionality of Clouds. This is satisfied in Clouds by a mecha- tem may choose to execute the invoca- nism called distributed shared memory. tion on compute server B. In this case, The Ra kernel. This native minimal DSM supports the notion of shared the thread sends an invocation request kernel supports virtual memory man- memory on a nonshared memory (dis- to B, which invokes the object 0, and agement and low-level scheduling. Ra tributed) architecture (see sidebar ti- returns the results to the thread at A. implements four abstractions: tled “Distributed shared memory”). The This scenario is similar to the remote data servers execute a coherence proto- procedure call found in other systems Segments. A segment is a variable- col that preserves single-copy seman- such as the V system,’ but it is more length sequence of uninterpreted bytes tics for all the objects.’With DSM, con- general because B does not have to be that exists either on the disk or in phys- current invocation of the same object the node where 0, currently resides. ical memory. Segments have system- by threads at different compute servers The compute and data server scheme wide unique names and, once created, is possible. Such a scenario would result makes all objects accessible to all com- segments persist until they are explicit- in multiple copies of the same object pute servers. The DSM coherence pro- ly destroyed. existing at more than one compute serv- tocol ensures that the data in an object Virtualspaces.Avirtual spaceis the er with DSM providing the consistency is seen consistently by concurrent threads abstraction of an addressing domain maintenance. even if they are executing on different and is a monotonically increasing range Suppose a thread is created on com- compute servers. The distributed syn- of virtual addresses with possible holes pute server A to invoke object 0,.The chronization support provided by data in the range. A segment can be mapped compute server retrieves a header for servers allows threads to synchronize to a contiguous range of addresses in a the object from the appropriate data their actions regardless of where they virtual space. server, sets up the object space, and execute. ZsiBas. An IsiBa is the abstraction starts the thread in that space. As the of system activity and can be thought of thread executes in the object space, the asa lightweight process. (Its name comes code and data of the object accessed by Clouds implementation from ancient Egyptian: Isi means light, the thread is demand-paged from the and Ba means soul.) An IsiBa is simply data servers (possibly over the network). The Clouds implementation uses a a kernel resource that is associated with If the thread executing in 0, gener- minimalist approach towards operating a stack to realize a schedulable entity. ates an invocation to object 0,, the sys- system development (like the V system, There are several types of stacks in the tem may choose to execute the invoca- Amoeba, and Mach 3.0). With this ap- system (for example, kernel, interrupt,

40 COMPUTER and user), and an IsiBa can use an in- stance of any stack type. A Clouds pro- cess is an IsiBa in conjunction with a user stack and a Ra virtual space. One or more Clouds processes are used to build a Clouds thread. IsiBas can also be used for a variety of purposes inside system objects, including inter- rupt services, event notification, and watchdogs. *Partitions. A partition is an entity that provides nonvolatile data storage for segments. A Clouds compute server has access to one or more partitions, but a segment belongs to exactly one parti- tion. To access a segment, the partition containing the segment has to be con- tacted. The partition communicates with the data server where the segment is stored to page the segment in and out I when necessary. Note that Ra only de- Figure 5. Virtual spaces, segments, and partitions. fines the interface to the partitions. The partitions themselves are implemented as system objects. tem; these functions include the buffer virtual address space as the object, and Figure 5 shows the relationship be- manager, uniform I/O interface, and the thread is allowed to commence exe- tween virtual spaces, segments, and par- Ethernet driver. Other system objects cution at the entry point of the object. titions. implement high-level functions that are When the execution of the operation The implementation of Ra is separat- invoked indirectly as a result of a sys- terminates, the object manager unmaps ed into machine-dependent and ma- tem call. These objects include the thread the thread stack from the object and chine-independent parts. All Ra com- manager, object manager, and user 110 remaps it in the object where the thread ponents use the class mechanisms of manager. was previously executing. If there was C++, a scheme that enhances Ra’s ob- The following bulleted paragraphs no previous object, then the object man- ject structure. Ra consists of 6,000 lines describe some of the important system ager informs the thread manager, and of machine-dependent C++ code, 6,000 objects. the thread is terminated. lines of machine-independent C++ code, DSM clients and servers. DSM cli- and 1,000 lines of Sun (68020) assembly Thread manager. As mentioned ear- ents and servers are partitions that in- code. It currently runs on the Sun-3 lier, a thread can span machine bound- teract with the data servers to provide class machines. More details about the aries and is implemented as a collection one-copy semantics for all object code implementation are in Dasgupta et aL8 of Clouds processes. There is some in- and data used by the Clouds nodes. formation associated with a thread, such When node A needs a page of data, the System objects. Ra can be thought of as the objects it may have visited, the DSM client partition requests it from as the conceptual motherboard. Oper- user workstation from which it was cre- the data server. If the page is currently ating-system services are provided on ated, and the windows on the user work- in use in exclusive mode at node B, the top of Ra by system objects. System station with which it has to communi- data server forwards the request to the objects are independently compiled cate when I/O requests are made during DSM server at node B, which supplies modules of code that have access to the computation. The thread manager the page to A. The DSM server allows certain Ra-defined operations. Ra ex- is responsible for the creation, termina- maintenance of both exclusive and ports these operations as kernel classes tion, naming, and bookkeeping neces- shared locks on segments and provides that are inherited by the system objects. sary to implement threads. other synchronization support. Conceptually, the system objects are User object manager. User-level User I/O manager. This system ob- similar to Clouds objects that live in objects are implemented through a sys- ject provides support for Clouds com- their own virtual space. However, for tem object called the object manager. putations to read from and write to user the sake of efficiency, system objects The object manager creates and deletes terminals. A user terminal is a window live in the kernelspace, are linked to the objects and provides the object-invoca- on a Unix workstation. When a thread Ra kernel at system configuration time, tion facility. An object is stored in a Ra executes a write system call, the I/O and are not directly invocable from the virtual space. The invocation of an ob- manager routes the written data to the user level. System objects are implicitly ject by a thread is handled mainly by the appropriate controlling terminal. Reads invoked through a system-call interface object manager in conjunction with the are handled similarly. The user U0 available to user-level objects. thread manager. Briefly, when a thread manager is a combination of a Ra sys- Some system objects implement low- invokes an object, the stack of the in- tem object and a server on each Unix level functions inside the operating sys- voking thread is mapped into the same workstation.

November 1991 41 Networking and RaTP. Two system fore, the thrust of several related re- naturally arise from the distributed na- objects handle networking: the Ether- search projects was to effectively sup- ture of the system include concurrent net driver and the network protocol. port and exploit persistent memory in a evaluations and load sharing. An imple- All Clouds communication uses a trans- distributed setting. Another area of re- mentation of the Clouds Lisp Distribut- port layer protocol called the Ra trans- search is in harnessing the distributed ed Environment (Clide) is currently in port protocol. RaTP is similar to the resources to speed up the execution of experimental usage. communication protocol VMTP9 used specific applications compared to a sin- in the V-system, and provides efficient, gle-processor implementation. We sum- Object-oriented programming envi- reliable, connectionless message trans- marize some of these projects here. ronment. Persistent memory is being actions. A message transaction is a send/ used to structure object-oriented pro- reply pair used for client-server type Distributed programming. Using the gramming environments. These envi- communications. RaTP has been im- DSM feature of Clouds, centralized al- ronments support multigrained objects plemented both on Ra (as a system ob- gorithms can run as distributed compu- inside Clouds objects and visibilitylmi- ject) and on Unix, allowing Clouds-to- tations with the expectation of achiev- gration for these language-defined ob- Unix communication. ing speedup. For example, sorting jects within Clouds objects. algorithms can use multiple threads to Status and current performance. The perform a sort, with each thread being Reliability in distributedsystems. One Clouds implementation and features executed at a different compute server, goal of Clouds is to provide a highly described thus far are in use. The kernel even though the data itself is contained reliable computing environment. The performance is good. Context-switch in one object. The threads work on the issue of reliability has two parts: main- time is 0.15 milliseconds. The time to data in parallel, and those parts of the taining consistency of data in spite of service a page fault when the page is data that are in use at a node migrate to failures and assuring forward progress resident on the same node costs 2.3 ms that node automatically. We have shown for computations. A consistency prob- for a zero-filled, 8-kilobyte page; it costs that even though the data resides in a lem can occur when a thread executes at 1.5 ms for a nonzero-filled page. single object, the computation can be several nodes or several nodes supply Networking is one of Clouds' most run in a distributed fashion without objects to a thread because of the DSM heavily used subsystems, especially since incurring a high overhead. These abstraction. In that case, a node or com- our current implementation uses disk- experiments are helping us under- munication link failure causes the com- less compute servers. All objects are stand the trade-off between computa- putation results to be reflected at some demand-paged to the servers over the tion and communication and the gran- nodes but not at others. A consistency network when used. The RaTP proto- ularity of computations that warrant mechanism should provide the atomic- col handles the reliable data transfer distribution. ity property guaranteeing that a thread between all machines. The Ethernet computation either completes at all round-trip time is 1.59 ms; this involves Types of persistent memory. Persis- nodes or has no effect on system state. sending and receiving a short message tent memory needs a structured way of Thus, if failures occur, the effects of all (72 bytes) between two compute serv- specifying attributes, such as longevity partially completed computations are ers. The RaTP reliable round-trip time and accessibility, for the language- level undone. is 3.56 ms. To transfer an 8-kilobyte objects contained in Clouds objects. To Consistency by itself does not prom- page reliably from one machine to an- this end we provide several types of ise progress, because a failure undoes other costs 12.3 ms, compared to 70 ms memory in objects. The sharable, per- the partially completed work. To en- using Unix file-transfer protocol and 50 sistent memory is called per-object mem- sure forward progress, objects and com- ms using Sun-NFS. ory. We also provide per-invocation putation must be replicated at nodes Object invocation costs vary widely memory that is not shared, yet is global with independent failure modes. The depending upon whether the object is to the routines in the object and lasts for following aspects of the Clouds system currently in memory or has to be fetched the length of each invocation. Similarly, address the consistency and progress from a data server. The maximum cost per-thread memory is global to the rou- requirements. for a null invocation is 103 ms, while the tines in the object but specific to a par- minimum cost is 8 ms. Note that due to ticular thread, and lasts until the thread Atomicity. The Clouds consistency- locality, the average cost is much closer terminates. Such a variety of memory preservation mechanisms present one to the minimum than the maximum. structures provides powerful program- uniform object-thread abstraction that ming support in the Clouds system.1° lets programmers specify a wide range of atomicity semantics. This scheme, Clouds and distributed Lisp programming environment. If the called Invocation-Based Consistency address space containing a Lisp envi- Control, automatically locks and recov- systems research ronment can be made persistent, sever- ers persistent data. Locking and recov- al advantages accrue, including not hav- ery are performed at the segment level, The Clouds project includes continu- ing to save/load the environment on not at the object level. (An object can ing systems research on several topics. startup and shutdown. Further, invok- contain multiple data segments. The ing entry points in remote Lisp inter- layout and number of segments are un- Using persistent objects. Persistent, preters allows interenvironment opera- der the control of the user programmer. shared, single-level storage is the cen- tions that are useful in building The segments may contain intersegment tral theme of the Clouds model. There- knowledge bases. Other features that pointers, and objects support dynamic

42 COMPUTER memory allocation on each segment.) If there is a failure in committing this Because segments are user-defined,the PET #1 PET #2 thread, another completed thread is Replicated user can control the granularity of lock- copies chosen. If the commit process succeeds, ing. Custom recovery and synchroniza- all the remaining threads are aborted. tion are still possible but are unneces- 1- 1 This method allows a trade-off in the sary in many cases. amount of resources used (that is, num- Instead of mandating customization ber of parallel threads started for each of synchronization and recovery for computation) and the desired degree of applications that do not need strict ato- resilience (that is, number of failures micity, the new scheme supports a vari- the computation can tolerate, while the ety of consistency-preserving mecha- computation is in progress). nisms. The threads that execute are of two kinds, namely, s-threads (or stan- dard threads) and cp-threads (or consis- he goal of Clouds was to tency-preserving threads). The s-threads build a general-purpose, distrib- are not provided with any system-level uted, computing environment locking or recovery. The cp-threads, on Iprocessing I suitable for a wide varietyof users in the the other hand, are supported by well- computer science community. We have defined locking and recovery features. Figure 6. Parallel execution threads. developed a native operating system When a cp-thread executes, all seg- and an application-development envi- ments it reads are read-locked, and the ronment that is being used for a variety segments it updates are write-locked. of distributed applications. The system automatically handles lock- Providing a conduit between Clouds ing at runtime. The updated segments existing (static) failures, as well as sys- and Unix saved us considerable effort. are written by a two-phase commit mech- tem and software failures that occur We did not have to port program devel- anism when the cp-thread completes. while a resilient computation is in opment and environment tools (such as Because s-threads do not automatically progress (dynamic failures).12 editors and window systems) to a new acquire locks, nor are they blocked by To obtain these properties, the basic operating system, and we can develop any system-acquired locks, they can free- requirements of the system are applications that harness the new sys- ly interleave with other s-threads and tem’s data and computation distribu- cp-threads. replication of objects, for tolerating tion capabilities in the familiar Unix There are two varieties of cp-threads, static and dynamic failures; environment. The Clouds system has namely, the gcp-thread and the lcp- replication of computation, for tol- been a fruitful exercise in providing an thread. The gcp-thread semantics pro- erating dynamic failures; and experimental platform for determining vide global (heavyweight) consistency An atomic commit mechanism to the worthiness of the object-thread par- and the Icp-thread semantics provide ensure correctness. adigm. = local (lightweight) consistency. All threads are s-threadswhen created. Each The PET system works by first repli- operation has a static label that declares cating all critical objects at different its consistency needs. The labels are S nodes in the system. The degree of rep- (for standard), LCP (for local consis- lication depends on the degree of resil- tency preserving), and GCP (for global ience required. References consistency preserving). Various com- Initiating a resilient computation cre- binations of different consistency labels ates separate replicated threads (gcp- 1. D.R. Cheriton, “The V Distributed Sys- in the same object (or in the same thread) threads) on a number of nodes. The tem,” Comm. ACM,Vol. 31, No. 3, Mar. lead to many interesting (as well as dan- number of nodes is another parameter 1988, pp. 314-333. gerous) execution-time possibilities, provided by the user and reflects the 2. S.J. Mullender et al., “Amoeba: A Dis- especially when s-threads update data degree of resilience required. The sep- tributed Operating System for the 1990s,” being readhpdated by gcp or Icp threads. arate threads (or PETS) run indepen- Computer, Vol. 23, No. 5, May 1990, pp. (For a discussion of the semantics, be- dently as if there were no replication. A 44-53. havior, and implementation of this thread invokes a replicated object by 3. B. Liskov, “Distributed Programming in scheme, see Chen and Dasgupta.”) choosing certain copies of the object, as Argus,” Comm. ACM,Vol. 31, No. 3, shown in Figure 6. The replica selection Mar., 1988, pp. 300-313. Fault tolerance. Transaction-process- algorithm tries to ensure that separate 4. R.E.Schantz,R.M.Thomas,andG.Bono, ing systems guarantee data consistency threads execute at different nodes to “The Architecture of the Cronus Dis- if computations do not complete (due minimize the number of threads affect- tributed Operating System,” Proc. Sixth Int’l Con& on Distributed Computing to failures). However, they do not guar- ed by afailure. After one or more threads Systems, CS Press, Los Alamitos, Calif., antee computational success.The Clouds complete successfully by executing at Order No. 697,1986, pp. 250-259. approach to fault-tolerant or resilient operational nodes, one thread is chosen 5. G.T. Almes et al., “The Eden System: A computations is called parallel execu- to be the terminating thread. All up- Technical Review,” ZEEE Trans. Soft- tion threads. PET tries to provide unin- dates made by this thread are propagat- ware Eng., Piscataway, N.J., Vol. SE-11, terrupted processing in the face of pre- ed to a quorum of replicas, if available. No. 1, Jan. 1985, pp 43-58.

hovember 1991 43 M. Accetta et al., “Mach: A New Kernel Foundation for Unix Development,” Proc. Summer Usenix Conf.,Usenix, 1986, 93-112. K. Li and P. Hudak, “Memory Coher- ence in Shared Virtual Memory Systems,” ACM Trans. Computer Systems, Vol. 7, NO. 4, NOV.1989, pp. 321-359. P. Dasgupta et al., “The Design and Im- Partha Dasgupta is an associate professor at Mustaque Ahamad is an associate professor plementation of the Clouds Distributed Arizona State University. He was technical in the School of Information and Computer Operating System,” Usenix Computing director of the Clouds project at Georgia Science at the Georgia Institute of Technol- Systems J., Vol. 3, No. 1,Winter 1990, pp. Institute of Technology, as well as coprinci- ogy, Atlanta. His research interests include 11-46. pal investigator of the institute’s NSF-Coor- distributed operating systems, distributed dinated Experimental Research award. His algorithms, fault-tolerant systems, and per- D.R. Cheriton, “VMTP: A Transport research interests include distributed oper- formance evaluation. Protocol for the Next Generation of Com- ating systems, persistent object systems, op- Ahamad received his BE with honors in munication Systems.” Proc. SIGcomm, erating system construction techniques, dis- electrical engineering from the Birla Insti- 1986, pp. 406-415. tributed algorithms, fault tolerance, and tute of Technology and Science, Pilani, In- distributed programming support. dia. He obtained an MS and a PhD in com- 10. P.Dasgupta and R.C. Chen, “Memory Dasgupta received his PhD in computer puter science from the State University of Semanticsin Persistent Object Systems,” science from the State University of New New York at Stony Brook in 1983 and 1985. in Implementation of Persistent Object York at Stony Brook. He is a member of the Systems, Stan Zdonick, ed., Morgan Kauf- IEEE Computer Society and ACM. mann Publishers,SanMateo,Calif., 1990.

11. R. Chen and P. Dasgupta, “Linking Con- sistency with ObjectKhread Semantics: An Approach to Robust Computations,” Proc. Ninth Int’l Conf. Distributed Com- puting Systems, IEEE CS Press, LosAlam- itos, Calif., 1989, Order No. 1953, pp. 121-128. 12. M. Ahamad, P. Dasgupta, and R.J. LeB- lanc, “Fault-Tolerant Atomic Computa- tions in an Object-based Distributed Sys- tem,” Distributed Computing, Vol. 4, No. 2, May 1990, pp. 69-80.

Richard J. LeBlanc, Jr. is a professor in the Umakishore Ramachandran is an associate School of Information and Computer Sci- professor in the College of Computing at the ence at the Georgia Institute of Technology. Georgia Institute of Technology. He is cur- As director of the Clouds project, he is study- rently involved in several research projects, Acknowledgments ing language concepts and software engi- including hardwarekoftware trade-off in the neering methodology for a highly reliable, design of distributed operating systems. He object-based distributed system. His research participated in the design of several distrib- This research was partially supported by interests include programming language de- uted systems, including Charlotte at Univer- the National Aeronautics and Space Admin- sign and implementation, programming en- sity of Wisconsin-Madison, Jasmin at Bell- istration under contract NAG-1-430 and by vironments, and software engineering. Core, Quicksilver at IBM Almaden Research the NationalScienceFoundation grants DCS- LeBlanc received his BS in physics from Center, and Clouds. His primary research 8316590 and CCR-8619886 (CERIII pro- Louisiana State University in 1972 and his interests are in computer architecture and gram). MS and PhD in computer sciences from the distributed operating systems. The authors acknowledge Martin McKen- University of Wisconsin-Madison in 1974and Ramachandran received his PhD in com- dry and Jim Allchin for starting the project 1977. He is a member of the ACM, the IEEE puter science from the University of Wiscon- and designing the earliest version of Clouds; Computer Society, and Sigma Xi. sin-Madison in 1986. In 1990, he received an David Pitts, Gene Spafford, and Tom Wilkes NSF Presidential Young Investigator Award. for the design and implementation of the He is a member of the ACM and the IEEE kernel and programming support for Clouds Computer Society. version 1; Jose Bernabeu, Yousef Khalidi, and Phil Hutto for their efforts in making the version 1 kernel usable and for the design and implementation of Ra; Sathis Menon R. Ananthanarayanan, Ray Chen, and Chris Wilkenloh for significant contribution to the implementation of Clouds version 2, as well as for managing the software development effort. Ray Chen and Mark Pearsondesigned and implemented IBCC and Clide, respec- tively. We also thank M. Chelliah, Vibby Gottemukkala, L. Gunaseelan, Ranjit John, Ajay Mohindra, and Gautam Shah for their participation in and contributions to the Readers may contact Partha Dasgupta at the Department of Computer Science and project. Engineering,Arizona StateUniversity, Tempe, AZ85287; e-mail [email protected].

44 COMPUTER