Sprite File System There Are Three Important Aspects of the Sprite ®Le System: the Scale of the System, Location-Transparency, and Distributed State

Naming, State Management, and User-Level Extensions in the Sprite Distributed File System Copyright 1990 Brent Ballinger Welch CHAPTER 1 Introduction ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ This dissertation concerns network computing environments. Advances in network and microprocessor technology have caused a shift from stand-alone timesharing systems to networks of powerful personal computers. Operating systems designed for stand-alone timesharing hosts do not adapt easily to a distributed environment. Resources like disk storage, printers, and tape drives are not concentrated at a single point. Instead, they are scattered around the network under the control of different hosts. New operating system mechanisms are needed to handle this sort of distribution so that users and application programs need not worry about the distributed nature of the underlying system. This dissertation explores the approach of centering a distributed computing environment around a shared network ®le system. The ®le system is chosen as a starting point because it is a heavily used service in stand-alone systems, and the read/write para- digm of the ®le system is a familiar one that can be applied to many system resources. The ®le system described in this dissertation provides a distributed name space for system resources, and it provides remote access facilities so all resources are available throughout the network. Resources accessible via the ®le system include disk storage, other types of peripheral devices, and user-implemented service applications. The result- ing system is one where resources are named and accessed via the shared ®le system, and the underlying distribution of the system among a collection of hosts is not important to users. A ``single system image'' is provided where any host is equally usable as another, much like a timesharing system where all the terminals provide equivalent access to the central host. The ®le system makes a basic distinction between operations on the ®le system name space and operations on open I/O streams. The name space is implemented by a collection of ®le servers, while I/O streams may be connected to objects located on any host. Thus, two different servers may be involved in access to any given object, one for naming the object and one for doing I/O on the object. This creates a ¯exible system where devices and services can be located on any host, yet the ®le servers' directory structures still make up the global name space. At the same time, a ®le server handles both naming and I/O operations for regular ®les in order to optimize this important com- mon case. The distinction between naming and I/O has been made in other systems by introducing a separate network name service. However, the name services in other systems have been too heavy weight to use for every ®le. Name services have been used to name hosts, users, and services instead. In contrast, the work here extends the ®le system name space to provide access to other services. This achieves the same ¯exibility provided by a separate name service, but it reuses the directory lookup mechanisms already 2 present on the ®le servers and it optimizes the important case of regular ®le access. Another novel feature of the Sprite ®le system concerns distributed state management. The ®le system is implemented by a distributed set of computers, and its internal state is distributed among the operating system kernels at the different sites. Much of the state is used to optimize ®le system accesses by using a main-memory caching system. The servers keep state about how their clients are caching ®les, and they use this state to guarantee a consistent view of the ®le system data [Nelson88b]. It is important to maintain this state ef®ciently so the performance gains of caching are not squandered, yet it is also important to maintain the state robustly so the system behaves well during server failures and network partitions. The dissertation describes how server state is maintained ef®ciently during normal operations and how server state is updated when processes migrate between hosts. This dissertation also describes a recovery protocol for ``state- ful'' servers that relies on redundant state on the clients. The redundant state is maintained cheaply as main-memory data structures (there is no disk logging), and servers can recover their state after a crash with the help of their clients. The next section of this introductory chapter gives a little background on the Sprite operating system project. Then the issues addressed by the ®le system are considered in more detail, along with the more speci®c contributions I make regarding these issues. The chapter ends with an outline of the other chapters and a summary of the thesis. 1.1. The Sprite Operating System Project The work presented here was done as part of the Sprite operating system project [Ousterhout88]. The project began in 1984 with the goal of designing and developing an operating system for a network of personal workstations. We felt that new hardware features changed our computing environment enough to warrant exploration of new operating system techniques. The new hardware features include multiprocessors, large main memories, and networks. Of these, my work mainly concerns the network and the problems associated with integrating a network of workstations into a single system. Sprite began as part of the SPUR multiprocessor project [Hill86]. The SPUR machine is an example of the new technology we felt demanded a new operating system. It is a multiprocessor with a large physical memory, and it is networked together with many other powerful personal workstations. While the SPUR machine was being built by fellow graduate students, our group began development of Sprite on Sun workstations. Today Sprite runs on Sun3 and Sun4 workstations, DECstations, and, of course, multiprocessor SPURs. Our approach to designing and building Sprite was to begin from scratch. Instead of modifying an existing operating system such as UNIX1, we decided to start fresh so as to be unfettered by existing design decisions. We were pragmatic, however, and realized that we did not have the man-power to rewrite all the applications needed to build a ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 1 UNIX is a registered trademark of A.T.&T. 3 useful system. We decided to implement the operating system kernel from scratch, but provide enough compatibility with UNIX to be able to easily port UNIX applications. We wanted to provide a system that was like a stand-alone timesharing system in terms of ease of use and ease of information sharing, yet provided the additional power of a network of high-performance personal workstations. Our target was a system that sup- ported a moderate number of people, perhaps an academic department or a research laboratory. Furthermore, our system had to ef®ciently support diskless workstations, which we favor because they reduce cost, heat, noise, and administrative overhead. Our goal of building a real system has been met. Sprite is in daily use by a growing number of people, and it is used to support continued development of Sprite itself. There are two features of Sprite that dovetail with its shared ®le system to provide a high-performance system: a high-performance ®le caching system, and process migration. The caching system keeps frequently accessed data close to the processes that use them, while the migration system lets processes move to idle workstations. (These features are described brie¯y below.) Furthermore, the stand-alone, ``timesharing'' semantics of the ®le system are maintained so that the additional performance provided by these features does not impact the ease of use of the system. Sprite uses main-memory caches of ®le data to optimize ®le access. The caches improve performance by eliminating disk and network accesses; data is written and read to and from the cache, if possible. Furthermore, a delayed-write policy is used so that data can age in the cache before being written back. In a distributed system this caching strategy creates a consistency problem. Data may be cached at many sites, and the most recent version of a ®le may not be in the local cache. The algorithm used to maintain cache consistency and the performance bene®ts of the caching system have been described in Nelson's thesis [Nelson88b] and [Nelson88a]. The caching system is relevant to this thesis because it creates a state management problem. The state that supports caching has to be managed ef®ciently so as not to degrade the performance gained by caching, yet it has to be managed robustly in the face of server crashes. Sprite provides the ability to migrate actively executing processes between hosts of identical CPU architecture. This feature is used to exploit the idle processors that are commonly found in a network of personal workstations; migration is used to off-load jobs onto idle hosts. The process migration system will be described in Douglis's thesis [Douglis90], while this dissertation describes the effect that process migration has on the ®le system. The shared ®le system makes process migration easier because data ®les, programs, and devices are accessible at a migrating process's new site. However, the state that supports caching and remote I/O streams has to be updated during migration, and this is a tricky problem in distributed state management. Thus, the combination of data caching and process migration pose new problems in state management that I address in this dissertation. 1.2. Thesis Overview This dissertation makes original contributions in the areas of distributed naming, remote device access, user-level extensions, process migration, and failure recovery. A 4 software architecture that supports these different features in a well-structured way is also a contribution of this work.

Sprite File System There Are Three Important Aspects of the Sprite ®Le System: the Scale of the System, Location-Transparency, and Distributed State

Ada Departmental Supercomputer Shared Memory GPU Cluster

Distributed Virtual Machines: a System Architecture for Network Computing

Workstation Operating Systems Mac OS 9

Distributed Operating Systems

Containers: a Sound Basis for a True Single System Image

A Single System Image Java Operating System for Sensor Networks

Architectural Review of Load Balancing Single System Image

SSI-OSCAR: a Distribution for High Performance Computing Using a Single System Image

Design of the SPEEDOS Operating System Kernel

MOSIX Evaluation on a Linux Cluster

Advanced Programming in the UNIX Environment

Quantian: a Single-System Image Scientific Cluster Computing