Data Movement in the Grasshopper

Technical Report Number 501

December, 1995

Rex di Bona

ISBN 0 86758 996 5

Basser Department of Computer Science University of Sydney NSW 2006

1

1 Introduction

Computer environments have changed dramatically in recent years, from the large centralised mainframe to the networked collections of style machines. The power of is increasing, but the common paradigms used to manage data on these workstations are still fundamentally the same as those employed by the earliest machines.

Persistent systems adopt a radically different approach. Unlike conventional systems which clearly distinguish between short-term computational storage and long-term ®le storage, persistent systems abstract over the longevity of data and provide a single abstraction of storage.

This thesis examines the issues involved with the movement of data in persistent systems. We are concerned with the differences in the requirements placed on the data movement layer by persistent and non-persistent operating systems. These differences are considered in relation to the Grasshopper operating system.

This thesis investigates the movement of data in the Grasshopper Operating System, and describes approaches to allow us to provide the functionality required by the operating system. Grasshopper offers the user a seamlessly networked environment, offering persistence and data security unparalleled by conventional workstation environments. To support this environment the paradigms employed in the movement of data are radically different from those employed by conventional systems. There is no concept of ®les. Processes and address spaces become orthogonal. The network is used to connect machines, and is not an entity in itself.

We shall show that the adoption of a persistent environment allows us a fundamental advantage over conventional systems in the movement of data through the storage hierarchy: from disk to memory, and from memory on one node to memory on another node. We shall investigate the hurdles that arise through the use of persistence, and show how each hurdle may be overcome. 2 Data Movement in the Grasshopper Operating System

We answer the following questions: Do the methods used to move data on non- persistent systems translate to persistent systems, or are other methods more applicable? Are there features of persistent systems that, when exploited, make the task of data movement easier? Are there requirements of persistent systems that, when satis®ed, complicate the task of data movement?

This thesis examines the two fundamental areas of data movement on modern computer systems: the movement of data on a single node, and the movement of data between nodes. On a single node, data moves between permanent storage, usually implemented as disk drives, and volatile storage, usually implemented as the RAM of the machine. Between multiple nodes, data moves as a series of packets over some form of network.

The fundamental differences between persistent and non-persistent systems are outlined. How these affect the choices available for the implementation of data movement is investigated. This thesis presents solutions to the problems of data movement in persistent systems that capitalise on the advantages of persistence, and also satisfy the additional requirements of persistent systems.

The major contribution of this thesis is the design of three new protocols for data movement, one for data movement between main memory and backing store, and two protocols that together provide ef®cient, reliable and causal movement of data between networked nodes. The ®rst protocol is implemented as a stackable module protocol which allows manipulation of pages between disk storage and main memory. The second is an ef®cient and reliable peer to peer network protocol that, combined with the third, a routing protocol, allows causal message delivery.

In Grasshopper, and most persistent systems, disks are considered as a repository for pages of data. These pages are identical in length to memory pages, and the only movement of data from disks is to memory and vice-versa. This constraint on the dimensions of data block sizes allows a more uniform and ef®cient protocol to be implemented.

Persistent systems place a strict requirement on network traf® - that it arrives in causal order. This requirement is not found in most non-persistent systems, and causes considerable complications to the routing protocol. Attempts to solve this problem usually involved using a conventional network structure and adding

1. Introduction 3 additional information to each message to enable causality violations to be detected and corrected. This thesis presents a novel design for a network that manipulates the network topology instead of the messages. Messages may only travel along certain constrained paths, which removes the requirement for additional causality information.

The thesis is organised as follows. Chapter 2 outlines how conventional systems perform computations, both from a process' and user's perspective. We contrast this in Chapter 3 with how persistent systems are organised, using the Napier persistent language and environment as an example. Chapter 4 introduces Grasshopper, concentrating on the basic abstractions provided by the operating system, and how they relate to those provided by both conventional and current persistent environments.

Chapter 5 details how movement of data is performed on a single node. It concentrates on disk to memory movement of data, and the structures allowed to manipulate the passage of data between disk and memory. Chapter 6 is concerned with the movement of data between two nodes in a Grasshopper network. It describes how data is described in a network independent manner. Chapter 7 describes the changes to the model to allow data to be located, once it has moved across the network. Chapter 8 is concerned with the causality considerations necessary to ensure that computation occurs in a sensible order on a distributed system. Finally, Chapter 9 discusses the implementation of the algorithms presented in Grasshopper, and evaluates their effectiveness. 4 Data Movement in the Grasshopper Operating System

1. 5

The Computational Model 2 of Conventional Systems

This chapter examines the computational model supported by conventional (non- persistent) systems. Our main focus is the models used for data movement, concentrating on disk storage and network movement of data. Since Unix [56] is the most popular workstation operating system, it is used as an example throughout this chapter. Where other systems have substantially different models these differences will be highlighted.

The Unix system supports different models at different levels. The model made available to users is different to that presented to processes in the system, most notably in the area of networking. We shall examine both of these views, as a comparison provides an important insight into the viability of persistent systems in the design space of computer systems.

2.1 The Process View of Unix

A process in a Unix system exists in a transient virtual address space. The only permanent storage paradigm in Unix is the ®le system. Processes usually communicate through either the ®le system or a network system. The Unix Signal system can be used as a primitive communication mechanism, and, as we shall see later, newer versions of Unix allow memory to be shared between processes.

In the Unix model the memory, ®les, and network each have separate namespaces. Each is speci®ed using different interfaces and different programming abstractions. Memory is accessed using standard processor operations. Files are accessed using the ®le name as a key and ®le-based read and write system calls. The network is accessed using a network address and port number pair and network-based read and write system calls. 6 Data Movement in the Grasshopper Operating System

Memory for a process in a Unix system is a purely ephemeral structure. A process is created with a new address space into which is placed the text and data on which the process is to operate. In addition, a stack is created within the address space and is used to maintain procedure call linkage information. A process functions by performing memory reads and writes, and performing computations on the data. Memory in a Unix process is tagged with permission attributes; a process can read, or write various portions of the address space depending upon the attributes for that area of memory. Usually the text of a process is tagged as read only, and the data and stack areas are tagged as read/write. If a process wants to interact with permanent storage the process has to perform ®le reads and writes. Entities represented in the ®le system have types, including ®les, directories, and devices. Different system calls are used to access each type of entity in a ®le system.

The disks on a Unix system are used to hold the ®le system. The namespace of the ®le system is different to that of memory. In memory, the namespace is the addressing range of the processor, but in the ®le system the namespace becomes a hierarchical collection of textual names. The ®le system is a kernel mediated object that associates portions of the disk with textual names. All accesses to ®les are through a kernel based interface. The namespace that a disk presents to a process is a collection of these ®les. If a different view of the disk is desired, the kernel of the operating system has to be changed.

Networking was added to Unix in the Berkeley [32] versions of the operating system. In the Berkeley system the network was viewed as a collection of byte streams by a process. Each of these streams could be connected to other machines, and data passed along these streams. The namespace via which a process accessed the network was different from that of both memory and ®les. The network namespace was the union of the namespaces of the various protocols the kernel implements. If an additional protocol was required, it had to be added to the kernel.

The network that was added to the Unix system by the Berkeley code is now commonly called the Internet, and the protocols that are used on the Internet are commonly referred to as the IP protocols. The namespace for the IP domain is the one most commonly used [51]. In this namespace the network is a collection of machines and ports. Each machine has a ®xed length address, and a number of ports. A process can request to be attached to one of these ports on the local machine, and

2.1 The Computational Model of Conventional Systems 7 then request that the port be attached to a port on a foreign machine. The network namespace is controlled by the kernel; a process cannot itself initiate a connection, but requests the kernel to perform the connection on its behalf.

In summary, as Figure 2.1 illustrates, a process' namespace is constructed from three different namespaces, each mutually exclusive. A process exists in isolation, but can manipulate data in all three namespaces.

File System

Network Memory

Figure 2.1. Process' Namespace in Unix

2.2 The User's View of Unix

A user at a Unix system is presented with a different view from that of a process. A user does not deal with memory at all, but deals almost exclusively with the ®le and network paradigms. To the user there is a collection of processes that exist on the machine. These processes interact with the ®le system and the network to perform computation.

The user perceives each process executing in a private and independent address space. The user does not expect them to interact except through the mechanisms provided, i.e. the ®le system, and the network. Newer versions of Unix have additional mechanisms which will be discussed later.

Instead of having processes communicate through shared ®les, Unix introduced the pipe facility. Pipes are a mechanism which allows data to be passed from one process to another. They simulate writing and reading operations on a ®le. One process writes data into a pipe which the other process will later read. A pipe provides a unidirectional, buffered, method of data communication between two processes. 8 Data Movement in the Grasshopper Operating System

The user may interact with processes, but work performed is usually assessed in terms of modi®cations to the permanent data structures; the ®les on a ®le system. Processes are run to perform speci®c operations on the ®le system. A user expects a process to examine or modify either the data or meta-data of ®les during its execution. Indeed, this is the only way a process can have an apparent effect on the permanent state of the system.

The most signi®cant difference between a process' view and a user's view of Unix occurs in the network structure. A process, as described above, views the namespace as Internet addresses and ports. A user views the namespace as programs providing services on particular machines, e.g. remote access, mail delivery, time daemons, news etc. Each service has a numeric name that describes its port in the IP address space. Each machine in an IP network has a textual name. These names are grouped into domains, which in turn are grouped hierarchically. There is a mapping between names and IP addresses, although as we shall see this has been replaced with a more distributed system. This mapping is maintained entirely outside the kernel.

In summary, a user views the system as consisting of many processes. Processes are independent entities, although they may communicate through the ®le system or the network to achieve some cooperative goal. The user sees a collection of machines all joined by a network.

This model, of a loosely coupled collection of machines, is the primary model of communication between conventional computer systems. The addressing of remote data items is currently either cumbersome, not permitted, or supported using different semantics to those used to access local data items.

2.3 Memory Namespaces

Each process in a Unix system executes in a separate, distinct, memory namespace. This namespace is constructed when the process is created, and only lasts the lifetime of the process. The namespace is tailored for the process during the creation phase, and is not modi®ed during the process' lifetime.

The memory namespace accessible to a process is only a sub-section of the total addressable by the process, as some sections of the virtual memory

2.3 The Computational Model of Conventional Systems 9 are made inaccessible by the operating system. In some Unix implementations, the ®rst page is made unavailable to trap erroneous programs that dereference a zero pointer, a common error in the C language. The virtual addresses occupied by the kernel are not usually addressable by the process, providing protection to kernel data structures. Some machines ensure kernel protection in software; e.g. the Sun 3 series of machines. Modern processors such as the MIPS RX000 series of machines, and the DEC Alpha provide a hardware protected kernel area.

A process' memory namespace is composed of three distinct sections. These sections, shown in Figure 2.2, are: the process text, the prede®ned data, and a stack. The text usually is marked as read-only, while the data and stack have read-write attributes. The data section and the stack section can grow in size. The stack grows automatically on procedure call, while the data section only grows by explicit request.

Process Virtual Address Space

Kernel Stack

Data Text 0

Figure 2.2. Memory Namespace Composition

2.4 File System Namespaces

Files in the Unix system are stored in a disk structure called a ®le system. A ®le system holds both the data and meta data for each ®le on the disk. The meta data includes the layout of the ®le on the disk, along with access permissions, and time stamps for the ®le. This meta data is stored in a structure that is associated with each ®le. This structure is called an inode in Unix systems. Inodes are themselves referenced from the directory structures on the disk, where a name is associated with the meta data, and as such, with the ®le.

To read or write the data stored in a ®le a process must ®rst request access from the kernel. This is achieved using the open system call in Unix. The process must 10 Data Movement in the Grasshopper Operating System present as arguments the name of the ®le to which access is desired, and the type of access: read, write, or both. If granted permission, a process can issue read and write requests which copy portions of the ®le data to the memory space of the process.

The ®le system namespace also allows other operations to be performed on ®les. For example, ®les can be created and deleted, operations that are meaningless in the memory namespace.

2.5 Network Namespaces

The network namespace is the area in which there is most variance between different operating systems. The Unix system allows multiple independent networks to connect machines together. These networks may provide different facilities, although, at a minimum, each network allows messages to be sent from one machine to another machine.

For a network to deliver a message, the network must be able to identify the destination machine. This identi®cation must take place at several levels: on the physical links, in the transport software, and in the application software.

There are many different structures that can connect machines together physically [39]. A physical link may be a single point to point connection, such as a serial line, it may be a ring structure such as a token ring or FDDI network, it may be a packet switched ATM or frame relay network, or it may be a common access network such as an ethernet.

There may be a multitude of different protocols operating on a physical link, ranging from packet switch protocols such as UDP [50], to end to end virtual circuits, to protocol stacks such as the OSI suite [57] or TCP [52]. Each of these software protocols requires the end points of the connection to be named.

The following sections examine three representative network protocols: the Internet Protocol (IP) suite [51], the MHSnet protocol [29], and the UUCP protocol [26]. We discuss the features they have in common, and those that distinguish them.

2.5 The Computational Model of Conventional Systems 11

2.5.1 Internet Naming

The Internet uses a 32 bit structured address. Each packet sent includes the address of both the source and destination nodes. The Internet uses a forwarding technique to pass the messages between intermediate machines. Intermediate nodes are responsible for determining the next node to which the packet is to be passed. If an intermediate node cannot determine the next node, or if the next node cannot be immediately reached due to network problems, the packet is declared undeliverable, and may be either dropped or returned to the sender along with an error status.

A node in the Internet is identi®ed by the 32 bit address mentioned above. All routing is done in terms of this number. Numbers are statically allocated, usually in large collections. These collections, however, can be physically widely scattered. Consecutive numbers may be separated by many hops.

It has been found that the identi®cation of nodes on the Internet by the IP number is a very cumbersome procedure. There is no immediate relationship between the function and the IP number of a particular machine. A collection of machines on one network will have similar numbers, even though their uses may differ markedly. To solve this problem an additional parallel naming scheme, the Domain Name Server (DNS) [43], has been developed. This scheme breaks the Internet into a collection of domains, and then breaks each domain into a series of smaller domains, etc, until the number of machines in a domain is small enough that it is considered manageable. Such a manageable domain contains a server, the name server, which will accept name requests and convert them into Internet addresses. A user wanting to obtain the IP number for a given name ®rst consults the local name server. If the local server cannot resolve the name it will automatically query a remote server, which may in turn query other name servers until the query is resolved.

The Domain Name Server provides the mapping from a human understandable name to the corresponding IP number, but the machines still route in terms of the IP number. The Internet therefore, has two distinct naming schemes, a user friendly text based name built from domains and recursively composed, and a global ¯at numeric naming scheme. 12 Data Movement in the Grasshopper Operating System

2.5.2 MHSnet Naming

MHSnet is a network for delivering messages over unreliable, intermittent connections. It is used to send messages between different nodes in a hierarchical network. MHSnet is a store and forward network; if a link is severed the nodes at either end of the link will buffer messages until the link is restored. In each message there is a header that contains the source and destination, although a message may be broken into multiple packets which are sent separately over the connection, and can arrive out of order.

An MHSnet network is composed of nodes and links. Each node is connected to other nodes in a loosely hierarchical network by point to point links. Messages move from the source node to the destination node by being passed from node to node through the network. Each intermediate node receives the entire message, calculates the best link to forward the message on, and then queues the message for sending on the new link.

MHSnet only routes messages in an internal format. At the source and destination nodes are handlers. These are complementary programs that inject messages into, and accept messages from the network. An example of a handler is the pair that deal with electronic mail. One will accept mail for delivery on a remote site, the other accepts mail from the network for delivery at the local site. The injecting handler will accept messages from the local mail routing program, and inject these messages into the network for delivery at the destination site. The intermediate sites on the network are not concerned with the contents of the messages; they just move the message to the destination node. On the destination node the message is delivered from the network to the handler, which will unpackage the mail item, and deliver it to the recipient. Since the intermediate nodes do not have to handle the internal structure of the messages, any form of data may be contained within a message. Intermediate nodes handle through traf®c for which they may have no applicable handlers.

The MHSnet naming structure is based on domains. A machine is present in a domain. The domains are recursively composed into larger domains, with the name of a machine being the local name and the list of domain names in which it resides. The name of a machine is locally unique, although the local portion of a name may be re-used in several domains at once.

2.5 The Computational Model of Conventional Systems 13

To route a message, the MHSnet protocol [30] will consult an internal table describing the network topology, and decide the best route to the destination. The size of this table grows as the number of nodes increases. To help constrain the size of the routing tables, machines vary in their visibility. A leaf node might only hold the entries for other local nodes and a more central node that has a larger visibility. The backbone sites hold large routing tables, but again, these do not need to hold all nodes on the network; they only hold those nodes that have high visibility, and it is these nodes that will in turn route to the leaf nodes. Using this technique the size of the routing tables is reduced to become manageable.

Routing information is sent in the form of multicasts between nodes in the MHSnet network. Thus, as nodes become attached to the network they become addressable by existing nodes without manual intervention.

The MHSnet naming structure is similar to the Internet Domain Namer Server naming, with two important differences. In the DNS mapping of names to machines it is permissible, and common, for a machine to have multiple DNS names. Firstly, in the Internet each network interface on a machine requires a different IP number. Thus a machine with three network connections requires three IP numbers, one for each interface. Each IP number is then allocated one or more DNS names. In the MHSnet naming structure a node only requires one name, irrespective of the number of network connections it has.

Secondly, the DNS system requires the use of fully quali®ed names. A name must be speci®ed with all the domains it belongs to listed². The MHSnet structure allows partially quali®ed names. Only those domains that differ from the current node must be listed. The software will route given a partially quali®ed name, allowing traf®c within a sub-network to be addressed using a sub-network address.

______² If no domains are listed each machine has a default domain which the software automatically appends to any requests. 14 Data Movement in the Grasshopper Operating System

2.5.3 UUCP Naming

The UUCP network is a network that was designed to facilitate the movement of ®les between Unix machines. It is a store and forward network utilising point-to-point serial links. There is no broadcast facility in a UUCP network.

The name of a UUCP node is an arbitrary string chosen by the system administrator. To pass a message between two sites the message must include the full path to the destination, listing explicitly the intermediate sites. The originator of a message must determine the sequence of sites through which that message must be passed. Intermediate sites just forward the message onto the next site.

2.6 Merging of the Namespaces

As described above, a process executing in a Unix system may manipulate three namespaces, the memory, ®le system, and network namespaces. Recent developments in Unix systems have attempted to blur the distinction between these namespaces. The system call to allow memory mapping of ®les, mmap, allows a ®le to be accessible through the memory namespace. Access to the shaded area of a process' memory space in Figure 2.3 will be satis®ed by reads from the mapped ®le. This call is restricted to text ®les; directories and device ®les are not able to be operated on using this mechanism. Also, the semantics of a ®le accessed using mmap are different to those available using read and write operations.

Using normal read and write operations, a ®le is extended by writing to addresses past the end of the ®le. The memory mapped operations on a ®le do not allow the ®le to be extended. Any attempt to write past the end of the ®le will return an error.

Process Address Space

File

0

Figure 2.3. Memory Mapped Files

2.6 The Computational Model of Conventional Systems 15

A common use for memory mapping of ®les is to allow the text section of an program or shared library, to be mapped into the address space of a process. This use of mapping removes the requirement to copy the text into swap space, or to reserve swap space.

A second development has allowed the privacy of address spaces to be broken by the implementation of shared memory [2]. This facility allows shared sections of memory to be accessed by multiple processes using memory semantics. The permissions of these memory segments can differ between processes, and the kernel does not de®ne the layout and meaning of data in these segments. Notice that a pointer in a shared segment may point to two different objects, as the pointer is interpreted in the local process' address space.

The network has also become blurred with the other namespaces. Filesystems from networked machines can be accessed on the local machines using networking ®le systems; NFS [40], RFS [55], and Sprite [48] are examples. In these systems the ®les from a remote machine become visible in the local namespace.

These systems allow the local ®le namespace to be augmented with portions of the ®le namespace of other machines. The combined namespace is visible to all processes executing on the machine. If each local ®le namespace is made visible on all other machines in a network, it is possible to create a namespace that is identical on all machines.

The semantics presented by these distributed ®le systems are not identical to those of the original Unix ®le system, mainly due to caching of ®les and devices not being visible over the network. The reader is referred to [61] for an excellent survey of distributed ®le systems.

The namespace in which a process exists is now blurred to some extent. Figure 2.4 16 Data Movement in the Grasshopper Operating System indicates the boundaries that are blurred.

File System mmap files Distributed File Systems

Network Memory Shared Memory

Figure 2.4. Process' Namespace in Unix

Unix variants are capable of weakening the isolation of a process in an address namespace. Packages are available to allow multiple threads to execute in one address space [65]. These threads are not identical to processes, and are achieved using one of two programming methods. On machines which have non-blocking system calls it is possible to use a library that switches between the threads on system calls. Newer versions of Unix that support the System V threads have kernel support for scheduling multiple threads.

Scheduling of the threads by the kernel allows true concurrency to occur, which is not possible using user level scheduling. In each system the address space is effectively shared between multiple processes, requiring synchronisation between the threads.

2.7 Other Non-Persistent Systems

Two other systems are worthy of mention, Plan 9 [53], from Bell labs, and Amoeba from the University of Amsterdam [66], Both of these systems have attempted to remove the separate network namespace. This has been achieved in Plan 9 by allowing a user to construct a namespace that will be visible from any node, and in Amoeba by giving all objects network independent names.

Plan 9, a new operating system from Bell Labs, is similar in structure to Unix, but has the network included in the ®le system namespace. In Plan 9 there are only two namespaces, the memory of a process, and the ®le system. One unusual feature of Plan 9 is the ability to compose the portions of the ®le system namespace through the bind call. A user will build a namespace that has all the objects desired, and work

2.7 The Computational Model of Conventional Systems 17 entirely within that namespace.

A user starts with an empty ®le system namespace, and constructs the desired namespace. Usually the standard system provided namespaces are included, but any other namespaces may be included on a process by process basis. Each process' namespace is passed across the network when the process moves from machine to machine. This gives each process an identical view of the total distributed system at all sites.

The Amoeba system is different from other non-persistent systems in that the networking code is fully integrated into the system. In Amoeba, all data storage devices are identi®ed by a capability [67]. The capability is the key which allows access to a facility, such as a ®le system. On presentation of a capability to the operating system, the operating system will send a message to the (possibly remote) function to which that capability refers. The function will then perform some computation on behalf of the user, and return the result.

In the Amoeba system the capabilities include the network address of the function. This removes the requirement for the user to track the location of the function provider. A program may use a capability irrespective of the location of both the requesting process and the location of the function provider process.

Amoeba can use this network transparency to allow specialised pieces of hardware to exist in the network, and to automatically migrate requests to those pieces of hardware. Usually processes run in a generic processor pool, but there may exist specialised server nodes that provide a particular service, such as a specialised database. Requests on these services are transparently passed to the server machines.

2.8 Flaws in the Merged Namespaces

The schemes adopted to blur the distinctions between namespaces have signi®cant problems. All of these mechanisms were ad-hoc solutions that only addressed one portion of the problem. The namespaces are not smoothly integrated. Examples include:

• it is possible to determine if a ®le is local or remote, 18 Data Movement in the Grasshopper Operating System

• some ®les can not be accessed using memory semantics,

• the protection facilities provided for memory are not identical to those for ®les,

• the ®le system allows, and has de®ned semantics for, reads and writes beyond the end of a ®le. Files which are mapped into memory do not allow access to areas beyond the range of the ®le, even if these addresses are nominally mapped.

The network implementations of the ®le systems do not support the same semantics as single node ®le systems. It is possible using network interactions to achieve a result that con¯icts with what would be achieved on a single node, and worse still, it is possible to construct a situation in which a ®le appears to revert to an earlier version. This would occur when a process moves from one machine to another machine that had the referenced ®le in cache, and the cache entry had not expired, resulting in the original data being returned to access requests, not the new, updated data.

This problem is the causal dependency problem. This can be stated as: that a change must not be propagated until all earlier changes have been propagated. If a node on a network sees a change to some state then anything on which that state depends must have been already visible at that node.

Currently, non-persistent systems are working towards a fully integrated approach. A fully integrated system would remove the distinctions between the namespaces, giving one uniform namespace in which a process may operate. In this integrated system the only distinction between memory and ®les would be that one persists after the execution of a program. A ®le would be accessed identically irrespective of whether it is local or remote.

This goal has not yet been achieved. Not all ®les can be viewed as memory. The semantics of those ®les that can be viewed as memory are different to normal memory. Remote ®les have different semantics to local ®les.

2.9 The Computational Model of Conventional Systems 19

2.9 Summary

Conventional workstation operating systems are designed with a ®le system as the permanent storage mechanism. The execution of a process mutates a temporary memory space, which was traditionally separate from the ®le system, but more recently can encompass the ®le system in a limited manner.

Processes that execute in conventional operating systems can manipulate a variety of namespaces; memory, ®les, and network. These namespaces were initially distinct, but they are being merged, albeit with limited success.

Systems which support orthogonal persistence attempt to meet the goal of a uniform namespace. A persistent system removes all distinctions between memory, disks, and the network. All data is accessed in a uniform manner, irrespective of its location or lifespan. In the next chapter we will examine the models used by persistent systems, and will demonstrate how they achieve a seamless integration of the memory, ®le and network namespaces. 20 Data Movement in the Grasshopper Operating System

2.9 21

The Computational Model 3 of Persistent Systems

The requirements of a system which supports orthogonal persistence can be summarised as follows:

• data structures must be treated in a uniform manner, irrespective of complexity and lifespan,

• data must be accessed in a uniform manner independent of its location in transient or stable storage,

• data must be resilient in the sense that after a crash or shutdown the data must be self-consistent, and

• data must be protected from unwanted access or modi®cation.

A system supporting orthogonal persistence therefore allows data to be stored in its original form. For example, a graph structure need not be ¯attened in order to persist beyond the life of its creator. This is in direct contrast to the storage dichotomy exhibited by conventional systems.

Location independence removes the requirement that the programmer keep track of the location of data. The system transparently moves data between long and short- term storage and, as a result, both transient and persistent data are accessed in a uniform manner.

Recovery after a failure is an essential component of any long term data storage system. Upon restart it must always be possible to re-construct a self-consistent data space. In conventional ®le based systems, tools such as fsck [26] are used to repair the long lived data: the ®le system. In a persistent system the problem of failure recovery is exacerbated as there may be arbitrary cross references between objects, and the loss of a single object may be catastrophic. 22 Data Movement in the Grasshopper Operating System

A persistent system must provide some mechanism to protect data against unauthorised access. This is usually provided by either a programming language type system [45], encapsulation of the data objects [35], capabilities to protect data areas [15], or a combination of these techniques.

The principles of orthogonal persistence may be extended to include network transparency. This principle can be stated as:

• Any object may be manipulated in the same manner regardless of the location of the object in the network.

The introduction of this principle raises new problems regarding naming and consistency. Firstly, objects must be able to be named from anywhere on the network in a consistent and location independent manner. Secondly, updates to objects must appear to occur in a causally consistent order.

3.1 Programming Persistent Systems

The use of a persistent system provides several bene®ts to the application programmer and to end users. The advantages include:

• Reduced complexity in the code. In a traditional system there are three components with which the programmer must interact:

• the programming language, which maintains only temporary data structures,

• the long term storage facility, which maintains the long term data structures, and

• the real world.

3.1 The Computational Model of Persistent Systems 23

Programming Language

Persistent Real System World

Data Real Base World

Traditional Systems Persistent Systems

Figure 3.1. Mappings in Traditional and Persistent Systems

The interaction of these three components gives rise to multiple mappings. As shown in Figure 3.1, the real world is mapped onto both the temporary and the long term data structures; the long term and temporary data structures also need to have a relationship between each other. In a persistent system these three mappings are replaced by a single mapping between the real world and the data structures used by the program. Since all data structures are possibly persistent there does not need to be a distinction between long and short term structures.

• Reduced code size. By removing the additional mappings in a program, the code that maintained these mappings is no longer needed. This reduces the amount of code required, and the time for this code to execute.

• If the persistent system is based on a type safe language then the additional bene®t of protection through the type system is available. Values may only be manipulated in a manner appropriate for their type. This allows the protection mechanism to be implemented by the type checking mechanism of the language.

• In a persistent system arbitrary structures can be preserved between invocations of a program. These structures do not have to be decomposed into simpler structures to be stored in the long term storage. This means that references between items do not have to be broken, preserving the referential integrity of the data structures. 24 Data Movement in the Grasshopper Operating System

These advantages combine to produce a new paradigm of programming. Instead of having a language which only has temporary variables, the programmer can use complex, long-lived data structures that model the structure of the data being manipulated.

3.2 Providing Persistence

Persistence can be provided in two fashions:

• As an application programming environment that is layered above a separate, possibly non-persistent, operating system, or

• by the native operating system.

An example of a language supporting the persistence paradigm is Napier88 [44]. Current native operating systems that support persistence include: Monads [58], Clouds [12], Opal [6], and Mungi [21]. These systems differ in the method they use to provide persistence, and in the addressing environment provided by the system.

We shall examine the Napier88 language, as it is highly indicative of the class of persistent languages; then we shall brie¯y examine the different native persistent operating systems.

3.2.1 Napier88

The Napier88 language and environment, hereafter called Napier, lives in a self contained object store which contains all the elements needed by the programmer. The store contains a complete program development environment, including an editor, compiler, and all programs and data. Napier is strongly typed and the type of each object therefore determines the operations which can be performed on it. Napier supports persistence by reachability. In the Napier store there is a root of persistence; this is a magic location that is always persistent. From this location there is a graph structure of the various objects. Any object reachable from the root of persistence is also persistent.

Programs in the Napier store are objects that are of the procedure type. Procedures are ®rst-class objects in the language, in that they can be passed as parameters,

3.2 The Computational Model of Persistent Systems 25 assigned, and compared. The compiler takes a textual description of a program and returns a new object of type procedure that represents the program.

In order to enable the store to be structured, Napier provides an abstraction called an environment. An environment is an object that holds (name, object-pointer) tuples. Objects are placed into an environment with the Napier in ... let construct. An environment can contain references to other environments. The root of persistence in Napier, PS, is a single environment which initially contains references to the standard tools supplied with a Napier system.

Figure 3.2 shows a simple Napier program. The program creates a procedure that on each call creates a new callable procedure. The procedure returned will itself return sequential numbers. The program creates the procedure and stores it in the persistent store as proc_inv in the newly created environment proc_env. Figure 3.3 shows a second program which calls the created procedure twice, generating two new procedures: proc1 and proc2. These are then called with the results being printed using the standard procedures writeInt and writeString. The environment structure of the Napier persistent store after the programs' execution is shown in Figure 3.4.

! ! Program to return a procedure that increments an integer on ! each invocation. ! let proc_inv = proc(->proc(->int)) begin let num := 0 proc(->int) begin num := num + 1 num end end let ps = PS() use ps with environment : proc(->env) in begin in ps let proc_env = environment() use ps with proc_env : env in begin in proc_env let proc_inv := proc_inv end end Figure 3.2. Napier Program to Create a Procedure. 26 Data Movement in the Grasshopper Operating System

! ! Program to use the above created procedure ! let writeNum = proc(num : int) begin let ps = PS() use ps with IO : env in use IO with writeInt : proc(int); writeString : proc(string) in begin writeInt(num) writeString("'n") end end let ps = PS() use ps with proc_env : env in use proc_env with proc_inv : proc(->proc(->int)) in begin let proc1 := proc_inv() let proc2 := proc_inv() writeNum(proc1()) writeNum(proc1()) writeNum(proc2()) writeNum(proc1()) writeNum(proc2()) end Figure 3.3. Simple Napier Program using above Procedure.

The output from the above program is the sequence: 1 2 1 3 2.

PS env

Environment IO env proc_env env proc(->env)

writeInt writeString proc_inv proc(int) proc(string) proc(->proc(->int))

Figure 3.4. Napier Environment Structure after Execution of Program. After execution of the above programs the persistent store only contains proc_inv, not proc1, proc2, or writeNum. These were only transient procedures, and, after the execution moved from their declared scope, they were no longer able to be reached from the root of persistence, and, as such, became garbage. The Napier system provides an automatic garbage collector to remove such objects.

The current implementation of Napier runs as an application above the Unix system. It utilises facilities such as mmap to implement the store.

3.2 The Computational Model of Persistent Systems 27

3.2.1.1 Napier88 Implementations Napier88 has been used as the language for two implementations of distributed systems. Napier, implemented at the University of St Andrews [4], is a single node system. Casper, developed at the University of Adelaide [68], is a distributed system.

Napier runs as a user level process on a Sun SPARC machine. The virtual store is implemented as a ®le in the normal ®le system. Each user has an independent copy of the virtual store, so no communication is possible. Napier is utilised by executing commands on the Sun that manipulate the virtual store ®le.

Casper is built directly on the Mach kernel. It provides persistence as a fundamental part of the system. Mach provides features that ease implementation of the persistent system. The external pager was used by Casper to implement the stable store, and the Mach IPC was used to provide transparent distribution between machines.

3.2.2 The Monads System

The Monads system was developed to investigate large software systems. The operating system was designed to run above purpose-built hardware which supports a large global virtual memory and capability-based protection. The system is based around persistent coarse-grained objects, called modules, which have purely procedural interfaces. Access to modules is protected by module capabilities. A module capability identi®es a particular module and the interfaces of that module which may be called. The hardware protects capabilities from arbitrary modi®cation.

Modules are composed of segments, which may be of arbitrary size. Segment boundaries are orthogonal to page boundaries and segments may overlap pages and may consist of several pages. Segments are addressed using segment capabilities, which contain an address, length and access information (e.g. read only). Segments may contain both data and segment capabilities, allowing arbitrary graph structures to be constructed. The hardware provides a set of addressing (capability) registers into which segment capabilities may be loaded. The hardware also enforces certain rules to guarantee that information-hiding is not broken.

The Monads system has been extended to support transparent distribution. This extension is based on uniform distributed shared memory. All segments on all nodes exist within a single global address space. When a segment is accessed on a node the 28 Data Movement in the Grasshopper Operating System corresponding page is fetched either from a local disk or from another node via the network. This is fully transparent to the application. A single writer, multiple reader scheme is enforced to ensure that all nodes see a consistent version of segments.

In order to make the global address space manageable, the Monads system logically partitions the global address space. The structure of an address is Monads is shown in Figure 3.5. The node number refers to the node which is responsible for maintenance of the page, the volume number identi®es the physical device or disk for the page and the volume address space index effectively identi®es a module within that disk. Facilities are provided to allow modules to move from their original location [22]. Details of the scheme are beyond the scope of this thesis.

Monads does not rely on a particular programming language or type system. The hardware supports general protection mechanisms and a variety of programming languages are supported.

Node Volume Volume Address Page Offset Number Number Space Index Number

Figure 3.5. Monads Address Composition

3.2.3 Single Address Space Operating Systems

With the advent of 64 bit architectures there is an increasing trend to revive the MULTICS single address space model [47]. In this model all processes co-exist in the one address space. Portions of the address space are accessed by each process depending on the permissions of that process. These systems, commonly called Single Address Space Operating Systems (SASOS), remove the separate memory namespaces for each process, and propose a single network wide namespace. A single memory namespace is used by processes and this namespace incorporates all nodes on the network. The node that controls a particular address can be calculated from the address itself.

The Opal project is being developed at the University of Washington. It uses a single 64 bit address space to hold all system information. Each node in the network is responsible for a portion of the address namespace. There is a central server which

3.2 The Computational Model of Persistent Systems 29 allocates sections of the namespace on demand, but there are no prede®ned divisions in the namespace. Portions of the namespace are allocated only when requested.

This use of on demand allocation removes the requirement to have a dedicated portion of the namespace allocated to each node as is necessary in Monads. However, it requires the interrogation of the central server, or a broadcast message, to determine which node controls a particular page.

Access control is provided through protection domains. A protection domain is a group of addresses that are related. A process that is currently in a protection domain can, theoretically, address all of memory, but only accesses to addresses in the current protection domain will succeed. Figure 3.6 shows two protection domains in the one address space. A process in domain 1 can only access those areas that are in domain 1 (the shaded regions), while the process in domain 2 can only access those areas in domain 2. There is a section, marked a that is accessible from both protection domains.

Protection Protection Domain 1 Domain 2

a

Figure 3.6. Opal Protection Domains Processes and protection domains are orthogonal in Opal. A process can move between protection domains, and multiple processes may execute simultaneously in one protection domain.

Capabilities are used to control access to protection domains. A process can only access a protection domain if it has a capability for that domain.

The Mungi system is similar to Opal, but differs in the implementation of its capability system, and the distribution of addresses. Instead of a central server issuing address ranges Mungi uses the traditional division of the address space to allocate portions of the address space to nodes. One unusual feature of Mungi is that a region of the address space, once created, may migrate from node to node. The creating node 30 Data Movement in the Grasshopper Operating System does not track this movement, and as such cannot be used to determine the current location of a page. Mungi uses broadcast messages to locate pages if the creating node does not currently hold the pages.

Both the Opal and Mungi system designers suggest that regions of the address space can persist, but neither has described the implementation of persistence in any detail.

3.3 Problems With Persistent Systems

Persistent systems are not without their drawbacks. These include:

• In a single environment, such as Napier, there is no concept of user, or other permission sensitive information.

• The single memory model is dif®cult to distribute ef®ciently, as it assumes only one processor, and sequential execution.

• Persistent systems usually have a very large address space, which is dif®cult to garbage collect ef®ciently.

• Unwanted data may still be reachable from the root of persistence. This will not be declared garbage while it continues to be reachable. It is dif®cult to control this leakage of memory.

In Napier anything reachable from the root of persistence can be reached by any user. The only protection available is to construct barrier programs that prompt for passwords before allowing a user past them deeper into the graph. However, once breached it is impossible to close the security hole created. Since a pointer to the sensitive data can be anywhere, including behind another barrier program, it is dif®cult to discover all links to data. Conventional systems give each user a unique identi®er, which is used by the permission system to provide access control.

A single Napier store is composed of a directed graph structure of pointers. From the root of persistence, pointers reach outwards to all persistent objects. Given an arbitrary object it is dif®cult to determine whether that object is persistent, or which other objects reference the given object. This dif®culty in deciding whether an object is persistent or available to be freed is the basis of the garbage collection problem. If there are multiple users, each generating garbage, the task of determining garbage is

3.3 The Computational Model of Persistent Systems 31 more dif®cult. Finally, as the store grows in size the garbage collection problem becomes more acute.

Multiple users exacerbate the security problem. It must be possible to distinguish between these users at some level. The common solution to this problem is to introduce the concept of capabilities. A user is de®ned by the capabilities that the user holds; providing capabilities to objects enables a user to access those objects, and removing capabilities denies the access.

In a persistent system any transient data may become persistent by a subtly incorrect program. In a traditional ®le based system it is dif®cult for such a program to pollute the long term storage system, and fairly easy to remedy the problem. In a persistent system the task of tracking this problem is more dif®cult.

A ®nal dif®culty is that of scale. When the system spans multiple nodes, the tasks involved with keeping the global state consistent become more dif®cult. For example, having a garbage collector scan multiple machines is more dif®cult in terms of ease of access to data than on a single node. Similarly, checking for concurrent updates to data and ensuring that updates are performed in the correct order is also more dif®cult on multiple nodes.

3.4 Recovery in a Persistent System

A persistent system requires stricter control on updates than a non-persistent system. The possibility of incorrect updates corrupting the store is greater than on a ®le based system. A persistent system also requires more sophisticated failure recovery than ®le based systems because of the potentially complex interrelationships between data items.

In persistent systems it is important that data be preserved across failures. Failures include:

• Hardware Failures: Disk crashes, machine crashes, power outages.

• Software Failures: Operating system bugs, incorrect programs.

These failures can be classed into three categories: disaster, corrupting, and non- critical. A disaster failure is one that is impossible to recover from without external 32 Data Movement in the Grasshopper Operating System aid. The destruction of physical hardware such as the CPU or disks would be classi®ed as a disaster failure. The strategy for recovering from disaster failures is to replace the destroyed hardware and recover the system to a previous consistent state by using backups. It is possible to have standby hardware to allow automated recovery for disaster failures. For example, RAID techniques can provide on-line backups for single, or limited multiple disk failures.

A corrupting failure is one in which data is corrupted or lost. For example, an unexpected power failure to a machine would result in the loss of the information in transient storage. To recover from these types of failures the machine would be shut down, if not already down, and restarted. The startup code would reconstruct a previous consistent state and computation would progress from that state.

A non-critical failure is one that can be recovered from without loss of information. An example of a non-critical failure is a broken network cable. The computation would halt until the cable is repaired; no data would be lost, and no recovery step is necessary by the operating system.

A persistent system is expected to be resilient. A resilient system provides stability of data over failures. The system must recover after a crash to a consistent state. The two main paradigms to achieve this are checkpointing and logging.

Checkpointing involves making a copy of the world state at some stage and providing a facility to enable reversion to the latest of these copies. Logging involves making a note of every change that occurs to the world, and on restart replaying these changes. It is possible to have a hybrid scheme that checkpoints, and then logs between checkpoints. This scheme only requires the log to be replayed from the last checkpoint.

Resilient crash recovery is not performed in traditional ®le based systems, where after a crash the granularity of the recovered data is a ®le. It is possible that after a crash only some changes to multiple ®les are evident; the other changes are lost, leading to inconsistent data. This is not a signi®cant problem in ®le based systems, as the interrelation between ®les is slight. The data in most ®les are self contained. The closest systems to persistent systems are data bases, where multiple ®les may have to be kept consistent across system crashes.

3.4 The Computational Model of Persistent Systems 33

In the following section we examine recovery in both single node and distributed persistent systems.

3.4.1 Recovery on Single Nodes

There are two common schemes for performing recovery in a single node, namely checkpointing and logging. Checkpointing requires the state of the system to be totally captured in a single snapshot, whilst logging allows the recording of the incremental changes to the system.

3.4.1.1 Checkpointing Checkpoints are usually stored on disk as a collection of pages. There are different methods of arranging the pages on the disk to enable a checkpoint to be recovered. The main goal of these methods is to always be able to discover the set of pages that comprise a checkpoint. If a page is to be modi®ed, the algorithm can either save a copy of the checkpointed page (before look), or it can assign a new disk location for the modi®ed page (after look).

The last checkpoint of the system, and any subsequently modi®ed pages must be kept in permanent storage, commonly as pages on one or more disks. Keeping additional checkpoints provides greater security in the case of partial disk failures. The keeping of copies of the pages, either before or after looks, depends on the ability to construct the set of pages that comprise the checkpoint. This is called shadow paging, and was suggested by Lorie [37] in relation to data bases. In shadow paging there is an active page, and one or more shadows of the active page from previous checkpoints. If a page has not been modi®ed since the last checkpoint, the active and the shadow page are identical. On a checkpoint the active pages become the shadow pages. The pages from any previous shadows that are no longer required are released into the free pool, and may be used for shadowing subsequent pages.

Several different implementations of stable shadow page based systems have been described in the literature [4, 59]. These all shadow the memory system at the page level. It is possible, in an object based system, to provide ®ner grained shadowing. Since it is objects that are requested and stored, instead of pages, individual objects can be shadowed. This idea has been used in the POMS [1] and REKURSIV [20] systems to provide stability. 34 Data Movement in the Grasshopper Operating System

In all existing stable stores there is a general requirement that there is atomic update. These systems operate by having a root location from which the set of pages that comprises the checkpoint can be calculated. When one of these root locations is to be overwritten the write must be performed atomically. If the location was partially updated and the system crashed then it would be possible for the root location to appear valid, but actually contain invalid data.

The algorithm commonly known as Challis' algorithm [5] may be used to perform this update. It operates by having a time-stamp at the beginning and end of the root data block. The root block is organised so that a write to disk performs the write of one of the timestamps as the last action. If the root block is one disk page, and the disk performs a sequential write then this condition is satis®ed.

The algorithm then works as follows: the pages in the new checkpoint are written to new disk blocks, but the old root block is not updated. The disk buffers are ¯ushed, to ensure the data is on stable media. The new root block is constructed, and a time stamp placed at the beginning and end of the block. The block is written to disk, and the buffers are ¯ushed. If a crash occurs during a partial write then the time stamps at the front and end of the block will differ, and the block can be identi®ed as invalid.

In a system that implements shadow paging using Challis' algorithm there are two, or more, root blocks. Each references a separate consistent state. On a restart after a crash each of the root blocks for the checkpoints on the disk is examined. The most recent correct root block is selected by picking the block with the most recent timestamp. This block is used as the root of the current image of the stable storage of the system, and computation then continues.

For this algorithm to work, the timestamps have to be monotonically increasing. It is assumed that timestamps will never wrap around, although if wrap around was to occur it would be possible to correct for this situation. This could be achieved by creating a new checkpoint using a new, low, timestamp, ensuring that the new checkpoint is on stable storage, and then removing the root blocks for all other checkpoints.

Shadowing relies on having the previous checkpoint available. When any modi®cation to data is performed the data is saved to a new location instead of the previous checkpoint. In this manner the previous checkpoint is available as a shadow

3.4 The Computational Model of Persistent Systems 35 of the current volatile state.

3.4.1.2 Logging Logging employs a checkpoint and a log of changes that have been made since the checkpoint. Any operation that affects permanent storage is logged. Logically a logging system maintains both the old and the new values for any data item. This can be optimised into having the original data and the changes that have occurred. When a new checkpoint is created a new copy of all the data is taken, and the log is cleared. After the checkpoint the log grows again until the next checkpoint, when the cycle repeats.

These techniques vary in the method used to recover from a crash. Crash recovery for a shadowing system involves discovering the last checkpoint, which is then used as the consistent state. Any remnants of the volatile copy on stable storage are removed.

In systems where replay is possible, resilience is possible using a combination of checkpointing and logging. A checkpoint is taken and a log of each operation is kept. On recovery the system reverts to the checkpoint, and the log is replayed, resulting in a system that has a much ®ner granularity of recovery.

Logging is traditionally performed in transaction based systems, where transactions are independent, sequential, operations. The log is a natural extension to the methods required to operate the transactions. Most database management systems employ logging for recoverability.

3.4.2 Recovery on Networked Nodes

Distributed systems are more complex that single node systems for ensuring stability. All nodes in the system have to be involved in maintaining a stable state, and recovering from any crashes.

When a single node system crashes, the recovered state of the node is only dependent on the stable information held by that node. In a distributed system, the state of other nodes is similar to stable information, in as much that the recovering node must consider the state of other nodes. The recovery process may also affect these other nodes; on recovery, the failed node must revert to an earlier stable state. This reversion may require the other nodes to also revert to an earlier state. The reversion of these nodes may affect other nodes, and so on. This cascading reversion must be 36 Data Movement in the Grasshopper Operating System limited in a practical distributed system.

Checkpointing across multiple nodes is also an issue of concern with distributed persistent systems. When a single node checkpoints, all other nodes on which the checkpointing node is causally dependent must also checkpoint. Similarly, all nodes on which these checkpointing nodes are dependent must also checkpoint, until all nodes on which the initial node is transitively dependent, checkpoint. This is the eager method of checkpointing. A node stops to checkpoint, and all other causally dependent nodes also stop to checkpoint. This may often be a large number of nodes.

It is desirable to allow either a smaller set of nodes to checkpoint, or to break the dependency link for checkpoints. This can be done by performing more work on crash restart. Algorithms such as those described by Strom and Yemini [64] or Johnson and Zwaenepoel [23] can be used to reconstruct a consistent set of checkpoints. These algorithms are useful after a crash to determine to which set of checkpoints the systems should roll back. The algorithms can also be used to determine when a checkpoint is no longer required. These algorithms depend on the initial interactions between the nodes in the network having proceeded in a causal order, but allow the nodes in the network to checkpoint independently.

3.5 Models of Distribution

Distributed persistent systems require data to be available on multiple nodes in a network simultaneously. There are two fundamental mechanisms for allowing data to move across a network. These methods are Remote Procedure Calls (RPC), which moves data by value, and Distributed shared memory (DSM) which moves data by reference.

Both of these methods of data sharing are examined in Chapter 6, where the mechanisms used by Grasshopper to allow distribution of data are examined.

3.6 The Computational Model of Persistent Systems 37

3.6 Summary

Persistent systems present a different model of computation to non-persistent systems. They do not require the explicit movement of data to and from secondary storage. Main memory is used as a cache for the single level storage system.

The operating system on a persistent system has to perform functions that differ from conventional systems. Instead of a ®le system as the long term storage a persistent system has persistent objects as the long term storage paradigm. 38 Data Movement in the Grasshopper Operating System

3.6 39

The Grasshopper 4 Model

Grasshopper is an operating system explicitly designed to support orthogonal persistence, and is an attempt to explore some of the persistent system design space. The fundamental goal of the Grasshopper project was to design and develop an operating system to support orthogonal persistence across a large network.

In order to support experimentation, Grasshopper removes policy decisions from the kernel, providing a set of mechanisms on which different policies can be implemented. In particular, Grasshopper does not directly enforce any particular model of resilience, but instead provides mechanisms to allow resilience to be supported at the user level.

4.1 Overview of Grasshopper

Grasshopper is designed to operate on conventional workstations connected by a network. Each workstation, or network node, contains:

• one or more processors

• one or more network connections

• local random access memory

• zero or more disks

• possibly a local screen with keyboard and mouse.

The Grasshopper operating system is based on a distributed kernel. Each node has a local kernel which cooperates with other local kernels to provide a single logical system. The complete system provides the basic abstractions of containers, loci, invocation, mapping, capabilities, and managers. The kernel also provides mechanisms to support consistency and resilience of the data in the system. The 40 Data Movement in the Grasshopper Operating System mutable objects in the system, containers and loci, are collectively called entities in Grasshopper nomenclature.

4.2 Basic Abstractions in Grasshopper

The following are brief descriptions of each of the basic abstractions supported by the Grasshopper kernel; these are described in more detail in [13].

4.2.1 Containers

Containers are the abstraction over data storage and access in Grasshopper. Containers are passive entities that replace both the address space and ®le system abstractions of conventional operating systems. In Grasshopper, containers are mutated by loci, an orthogonal concept. A locus may move between containers, and a container may have many loci executing in it simultaneously.

The container is a persistent entity, unlike the transient address space of conventional processes. A container is recoverable after a machine crash. Movement of data between containers is performed by two abstractions, invocation and mapping.

4.2.2 Loci

In Grasshopper, loci are the abstraction over execution. In its simplest form, a locus is simply the contents of the registers of the machine on which it is executing. Like containers, loci are maintained by the Grasshopper kernel and are inherently persistent. Making loci persistent is a departure from conventional operating system designs and frees the programmer from much complexity.

At all times every locus has a distinguished container, its host container, which de®nes the referencing environment of that locus. A container comprising program code, mutable data, and a locus forms a basic running program. The host container of a locus may be changed using the invoke operation which causes a locus to move from one container to another.

4.2 The Grasshopper Model 41

4.2.3 Invocation

A locus that has the correct permissions may invoke another container. This will cause the execution to recommence at a de®ned invocation point in the new container. Optionally, the current execution point is saved so the locus may return from the invocation to the current container. The new container is now the host container for the locus. The contents of the previous host container are unavailable until the locus returns.

Each container can only have one invocation point. If a container offers many services, these can be selected between by using one of the parameters to the invoke call. The invoke call is similar to transfer of control in object-thread-based systems [12, 31] as the locus is free to move between address spaces.

Figure 4.1 shows an example of these abstractions. There are two loci, L1 and L2, and three containers C1, C2, and C3. The invocation points of the containers are marked with squares. Locus L1 is created in C1 and invokes C2, and thence invokes C3, before returning from the invocation to C2. Concurrently, locus L2 is created in C1 and invokes C2. Both loci can simultaneously exist in the one container, and both are free to move between containers.

C1 C2 C3

L2

L1

0 0 0

Figure 4.1. Containers, Loci, and Invocation

4.2.4 Mapping

To allow data to be shared between containers, Grasshopper provides a mapping facility. Mapping makes the contents of a region of a container visible in a congruent region of a second container. There are two forms of mapping; container mapping provides a map that is visible to all users of the mapped container, whilst a locus 42 Data Movement in the Grasshopper Operating System private mapping creates a map that is only visible to one locus. The ®nal view of a container is composed of the original data overlaid with the data from each mapped container.

4.2.4.1 Container Mapping Mapping allows code and data to be shared amongst many containers instead of each container having its own copy. The mapping of containers can be recursively composed to build arbitrary data views. The only restriction is that it is not permissible to form circular mappings. This ensures that one container is always ultimately responsible for the data. Figure 4.2 shows an example of mapping. The container C1 encompasses a region of C2 which in turn encompasses a region of C3 and the whole of C4.

C1

C2 C4 0

C3

B A 0 0 C 0

Figure 4.2. A Container Mapping Hierarchy

In Figure 4.2 the data at address C in container C3 is available at address B in container C2 and also available at address A in container C1. Any access to the data stored at address A in C1 will be satis®ed by the data stored at address C in C3.

The container that is overlaid is called the destination container, and the container from which the data is obtained is the source container. It is possible that a container is both destination and source for different mappings. C1 is a destination container in Figure 4.2, C3 and C4 are source containers, and C2 is both a destination and source container.

4.2 The Grasshopper Model 43

4.2.4.2 Views The recursive nature of mappings allows different views of a container to be created. Using the protection afforded by the mapping hierarchy, multiple views of a single, possibly composed, container can be created. Figure 4.3 shows two views onto the one container.

Instead of providing access to C1, access can be provided to either C5 or C6 which present restricted subsets of the information available in the container.

View1 View2 C5 C6 C1

C2 C4 0

C3

0 0

0

Figure 4.3. Multiple Views onto One Container

4.2.4.3 Locus Private Mapping Grasshopper supports locus private mappings. These are mappings that are visible to only one locus. In all other respects they are identical to container mappings. Once a locus private map is created, that map exists until it is explicitly deleted, or the locus is deleted. The map will be honoured whether the locus has the destination container as its host container, or is viewing the destination container through another map.

Locus private maps are used to provide individual customisation of a particular container. Each locus may have a private area of the address space for a stack area, or for local data. Figure 4.4 shows two loci with locus private mappings. 44 Data Movement in the Grasshopper Operating System

Locus1 Locus2 C1 C1 C1

C5 C6 C2 C4 0 S 0 0 C3

0 0

0 0 0

Figure 4.4. Locus Private Mappings

Both locus1 and locus2 have container C1 as their host container. However at address S, locus1 accesses the contents of container C5, while locus2 accesses container C6.

A locus private mapping is a persistent mapping. It will continue to exist whenever the locus accesses the container, whether through a mapping into another container or as a host container. The mapping persists across multiple invocations, so the locus can return from an invocation, and have the same private map available on subsequent invocations.

The private map also allows more sophisticated programming constructs. It is used for data views, similar to single address space operating systems, and is the fundamental protection mechanism for the implementation of systems such as UNIX above Grasshopper [33].

4.2.5 Capabilities

In ®le based systems, access veri®cation is provided by a combination of a user identi®er and ®le permissions [26] or access lists [11]. In persistent systems, control is provided by type safe languages [45] or by data encapsulation [35]. As there is no identi®er for each user, and a single pervasive type system is not available due to support for multiple languages, these traditional methods of protection and veri®cation are insuf®cient for the requirements of Grasshopper. The capability abstraction is provided instead.

4.2 The Grasshopper Model 45

Capabilities are the sole means by which an entity is accessed and protected in Grasshopper. Whenever any form of action that requires kernel intervention is required, a capability must be presented to both specify the entity being manipulated and verify that the manipulation is permitted.

Capabilities must be protected from modi®cation and forgery. There are three well- known techniques for achieving this protection: tagging: in which extra bits are provided by the hardware to indicate memory regions representing capabilities and to restrict access to these areas, passwords: in which a key, embedded in a sparse address space, is stored with the entity, and a matching key must be presented to gain access to that entity, and segregation: in which capabilities are stored in a protection area of memory associated with the the entity, and an index into this area is provided.

Since Grasshopper is designed to run on conventional hardware, which does not support tagging of capabilities, the only choice is to use one of the remaining methods. Grasshopper uses the segregated policy, with capabilities being stored in kernel maintained data structures called capability lists. The advantage of using segregated capabilities instead of password capabilities is that it is possible to perform garbage collection on entities. The kernel keeps track of the number of capabilities that reference an entity, and when this number drops to zero the entity can be deleted. Password capabilities require another heuristic to decide when an entity is declared garbage; this is usually a timeout scheme.

In Grasshopper a capability refers to a single entity and contains a set of permissions which describe the access which is permitted using that capability. A capability list may be associated with each locus and container. A capability is speci®ed by a tuple identifying a capability list, associated with either the current locus or the host container, and the index of the capability in the list.

All kernel calls that create capabilities take as a parameter the index and the capability list into which the newly created capability will be placed. Higher level 46 Data Movement in the Grasshopper Operating System software outside of the kernel may keep track of the capabilities and provide symbolic names.

4.2.5.1 Entity Names Each capability refers to a single entity in Grasshopper. The entity is identi®ed by embedding within the capability a kernel interpreted data ®eld that uniquely references one entity. The names embedded within capabilities to identify entities are not visible at the user level. Users always access entities indirectly via capability list indices [14] known as caprefs.

All entities in Grasshopper are manipulated using capabilities. All calls to the kernel require caprefs be presented to identify which capability is being utilised. Figure 4.5 shows the relationship between caprefs, capability lists, capabilities, and entities. Containers and Loci each have capability lists. Container C1 has associated with it a capability list. A locus having C1 as its host container can reference these capabilities using caprefs. There is a one to one mapping between a capref and a capability. Each capability refers to an entity, and the operations performed using that capref affect the referenced entity. It is quite common for entities to hold capabilities that reference themselves.

C1 Capref Entity

Capability

Capability List

Figure 4.5. Naming Conventions in Grasshopper

4.2.6 Managers

In Grasshopper the backing storage for the data held in containers is not controlled by the kernel. Instead, user level programs called managers control the movement of data between backing store and main memory. Each container has a manager associated with it which, amongst other things, handles the page fault requests for the containers it manages.

4.2 The Grasshopper Model 47

When a page fault occurs, the kernel traverses the mapping graph of the faulting locus to determine which container holds the page. The kernel then invokes the relevant manager, requesting the page. The manager returns the page to the kernel which sets up the hardware translation unit to allow the data to be accessed.

Managers are also responsible for the resilience of the data in a container. They are required to implement a stability algorithm on the data, and, in cooperation with the kernel and other managers, ensure that consistent data is stored on stable storage [34].

A manager is allowed to manipulate the data in a page during its transit between memory and disk. This allows managers to perform encryption, compaction, or swizzling [70] on the data if required.

4.2.6.1 Self Managing Managers Managers themselves are implemented as user level containers. They have no special system privileges. This means that a manager requires a manager itself. This recursive chain requires a ®xed point. Two different ®xed points are possible. Either the kernel provides a manager which does not itself require a manager, or a manager is able to manage itself.

For bootstrapping purposes a simple kernel manager is provided, although the usual manager ®xed point is a self managing manager. Each container may have a single stable page that is controlled by the kernel. Self managing managers use this page to bootstrap their data from the disk. An example of a self managing manager is provided in Section 5.6.

Figure 4.6 shows a container with the single stable page. The manager's shared code is provided by the kernel's read only swapper. The mutable data for the bootstrapping is stored in the stable page provided by the kernel. The page contains enough data to allow the manager to obtain the remaining local data. 48 Data Movement in the Grasshopper Operating System

Self Managing Manager Stable Page

Shared Code Local Data

Figure 4.6. Self Managing Manager

4.2.7 Resiliency and Consistency

As we have described earlier, a persistent system must provide a view of data that is both resilient and consistent. In Grasshopper, this function is performed by the managers of each container.

Managers are responsible for maintaining the data in the containers in their care. However, they cannot maintain the relationships between containers; instead the kernel is responsible for coordinating between managers to ensure there is an overall consistent state on system restart, as described below.

Each manager is responsible for maintaining its data in a resilient fashion. However, the system does not dictate how this is achieved; a manager may ensure resilience in the manner that best suits the data stored in the container. Possible techniques include shadow paging [37], transaction logging [19], or object shadowing [1]. Each technique involves moving atomically from stable state to stable state.

As described above, the kernel in cooperation with managers ensures that data is consistent. This requires that the causality relationships between data be tracked. On a single Grasshopper node, causality is ensured by the kernel mediating the checkpoint operation. When a container is to be checkpointed the kernel, not the manager, is initially requested to make the checkpoint. The kernel then requests the manager to perform a checkpoint on the container, and may request other managers of causally dependent containers to also make checkpoints.

4.2 The Grasshopper Model 49

The kernel keeps an internal data structure, the causality association, which details the causality relationships between containers. When a consistent state is required to be made, for example, if the machine is shutting down, the kernel traverses this association structure and requests the managers of causally dependent containers to stabilise [69]. Appropriate algorithms for maintaining these relationships have been documented in the literature [23, 24, 64].

4.2.8 Programming Model

The abstractions provided by Grasshopper can be used to implement various programming models. A user views a Grasshopper system as a seamless collection of machines, with loci performing tasks by interacting with containers.

The programming model in Grasshopper provides an abstraction over distribution. To a user level entity all entities are visible. To ensure that locality cannot be determined, all operations on an entity must appear to occur in causal order. This causal ordering of operations must be performed by the distributed kernel.

4.2.9 Kernel Structure in Grasshopper

Each node in a Grasshopper network runs a separate kernel which cooperates with the kernels on other machines to provide the appearance of a single distributed kernel. Each kernel is responsible for maintaining the state necessary to allow entities to persist. To implement the persistent abstractions outlined above, the kernels must also contain persistent data [34]. The kernel data that must be persistent includes the capability lists, the mapping graphs, and the loci call chains. This data is accessed via a dedicated region of the kernel virtual address space which persists across reboots.

The kernel internal data structures are stored in persistent objects called arenas. Each arena holds the state information for a single entity, for example a container or a locus. For each entity the kernel holds the data that describes that entity. A locus is described by the identity of its host container, the call chain, the current register set, and the capabilities that it holds.

A container is described by the mappings in which it participates, the loci that have it as their host container, and its manager. Each kernel which has the container resident holds some of this information in the Local Container Descriptor. The remainder is 50 Data Movement in the Grasshopper Operating System stored on the container's home node. A container also has transient data maintained by the kernel. This data includes the pages that are used by the container in main memory. The information about which pages are allocated and their address is held in a transient data structure called a Page Set.

4.3 Summary

Grasshopper is an example of a persistent system. It provides the four requirements to support orthogonal persistence: uniform treatment of data structures, uniform access to transient and persistent data, resiliency of long term data, and control over access to data through capabilities.

Grasshopper provides loci as the agents of change, and provides the container abstraction for storage. Its novel features include managers and integrated causality support.

4.3 51

Data Movement 5 on a Single Node

5.1 Introduction

On a single node Grasshopper system, data moves between the stable store and main memory. The stable store for a Grasshopper system is usually located on one or more physically attached disk drives, although it is also possible to utilise a network-based disk.

This Chapter outlines the interface used between user level entities, in particular managers, and the kernel in Grasshopper to achieve this movement. The interface paradigm developed is generally applicable to conventional as well as persistent operating systems. This Chapter will ®rstly introduce the paradigm, and then detail its use in Grasshopper.

The traditional approach to disk device drivers is to have the driver linked into the kernel. If a change in functionality is required the driver has to be modi®ed and the kernel relinked. User code is restricted to stub routines that call the kernel drivers. This results in two problems: • the kernel code becomes overly large, and • the time spent in the kernel increases, reducing performance.

Recent systems have attempted to address these issues. The approach taken has been to implement a minimal, core driver in the kernel, and have the remaining functionality in user space. The driver is then broken into two layers, the kernel layer and the user layer. This results in a kernel layer which is fast and simple, but the user level layer still retains most of the complexities of the previous code. The layering only solves one problem, that of time spent in the kernel. The monolithic piece of code in user space is still dif®cult to construct and potentially slow to execute. 52 Data Movement in the Grasshopper Operating System

In our model we remove the need for a monolithic piece of code in user space. The design presented breaks this piece of code into multiple modules, with each module performing one simple function. We connect these modules together, on a per application basis, to gain the functionality desired. An application developer can use modules from a prede®ned set provided, or may produce a custom module if the required functionality is not already available. The developer selects the modules required, and arranges them in the order that is desired. This selection and arrangement of modules provides a simple, clean, and fast path between the application and the disk.

Since an application only uses those modules required, the code that must be executed for each disk access is reduced. In a monolithic driver there is a requirement for a series of tests and loops to perform the operations requested, eventuating in only a small fraction of the kernel driver being executed. In the modular system these tests and loops are not necessary; only the code required is placed into the modules that are called. In keeping with this paradigm there are multiple kernel drivers, each driver for one type of hardware controller unit. The kernel drivers only support those operations that are required for the controllers.

In summary, in the model the disk driver is spread between multiple user modules in user space and a set of core drivers in kernel space. Each kernel driver provides access to one hardware device. Each user module performs one simple function. A user may set up a graph of modules in user space between the application and the kernel to provide the functionality required. This approach reduces both kernel and user level overheads and complexities when compared to other systems.

5.2 Background

5.2.1 Layering versus Modularity

Traditionally, device drivers are integral parts of the kernel and therefore device functionality was dictated by the included drivers. As a result, a kernel rebuild is necessary whenever changes are made to any portion of, for example, a disk driver.

The layered approach, as utilised in [17, 25, 41, 42, 60], provides a ®xed hierarchy of modules that are used to perform the task of controlling disk drives. For example, in

5.2 Data Movement on a Single Node 53

Mach [17] there are chip dependent and chip independent layers. This approach improves the kernel response to interrupts. In SunOS the SCSI disk drivers have been replaced with the Sun Common SCSI Architecture (SCSA) [41]. The SCSA has three layers, the controller layer, the drive type layer, and the ®le system layer. Choices [25] uses the internal type system to build a hierarchy of disk types. The functionality of a disk is de®ned totally by the position of its code in the type system.

Each of these systems is characterised by having multiple layers with different interfaces between the layers. While it is possible to mix and match to a certain extent, the only choice is between objects in the one layer. What we shall describe is a system where the objects have a common interface, and may be placed in the best order for each application.

The idea of composing processing structures out of simple building blocks is not new. Using UNIX pipes, for example, it is possible to construct arbitrarily complex programs from a collection of tools by joining them in various ways. All programs have the same interface (that of a character stream). A program does not know which programs, if any, are upstream or downstream of itself. Data ¯ows from one program into another anonymously. Similarly, the x-kernel allows networking protocols to be described in a graph of protocol processors [49]. This anonymity is a desirable feature as it enforces clean, generic interfaces.

5.2.2 Adding Functionality

The placement of complexity in user space is advantageous in that it is easier to write and debug a small kernel driver, and then construct the user level modules, than it is to write all portions of a single kernel driver at once. As an example we consider the advent of disk jukeboxes. These are devices that store multiple disk platters in a rack, and place one platter in a disk drive to be read or written. Such devices are becoming more common as the need for large accessible storage systems becomes more prevalent. Jukebox systems are different from traditional disk drives. They require additional, special commands, namely position disk and load/unload disk, to be sent before the normal read/write commands.

A jukebox cannot be attached to a system with an older style disk interface, as the requirements of the jukebox do not align with the functionality of the traditional 54 Data Movement in the Grasshopper Operating System kernel based disk drivers. In the module system all that would be required is to construct a jukebox control module, and place this module as the ®rst module above the kernel interface, and a jukebox could be accessed as though it was a single, extremely large, disk. The upper level modules do not require any changes. The jukebox module would also have to access a second kernel driver that allowed commands to be sent to the Media-changer in the jukebox. This new kernel driver is almost identical to a standard SCSI disk driver [62].

5.2.3 Advantages of a Modular Approach

The modular approach has three major advantages. The ®rst is that less time is spent in the kernel, which has been shown to improve the overall kernel response time [10, 17]. The second advantage is that it is possible to customise the interface between an executing application and the underlying disks to improve performance. Only those modules required need be attached. For example, in experiments on databases, where exact placement of data on disk is important, the database engine can be placed directly above the kernel driver, allowing the highest possible bandwidth and exact placement of data, while a ®le server that does not need exact placement, but does require vast storage space could use a spanning module on top of many drives to access multiple drives in an independent fashion.

The third, and perhaps the most important advantage for an experimental system, is that having a more modular approach allows for easier experimentation. An experimental module can be constructed and tested without disturbing or modifying the existing modules. This allows additional modules to be developed and tested while the system is running, and does not require a reboot to make these modules available for general use.

5.3 Examples of Modular Designs

The model we shall describe is both powerful and ¯exible. The power of the model is in the ability to compose complex processing structures out of simple building blocks. In this section we present three examples that demonstrate this power and ¯exibility.

5.3 Data Movement on a Single Node 55

5.3.1 Partitioning a Multi-disk System

Consider a system that has multiple disks which may be of different types. These disks are to be presented to the application as a single partition. For conventional systems, the kernel drivers must know how to achieve a spanning partition. In the typical case, in which each partition is wholly contained on a single physical disk, this ability is a hindrance. The resultant kernel is larger than necessary because of the super¯uous code, and driver execution is less ef®cient due to unnecessary tests.

Suppose we wish to allocate three disks of differing sizes evenly between four applications. Figure 5.1 shows a solution using the scheme detailed. A two module system would be used. The three disks each have a kernel interface, Ki1, Ki2, and Ki3. We use a spanning module (Sm) to provide the abstraction of a single disk. The capacity of this disk is the combined size of the three underlying disks. Above this module we would place the partitioning module (Pm). The module would be con®gured to provide four partitions, each partition being accessed by an application.

Ki1 User Space Partitioning Module Kernel Space (Pm) Ki2 Applications

(Sm) Spanning Ki3 Module

Figure 5.1. Partitioning Spanned Drives

The spanning of the physical devices is transparent to the applications. Signi®cantly, the spanning is also transparent to the kernel device drivers. The functionality of the kernel drivers is identical irrespective of whether the driver is used for spanning or single disk partitions. 56 Data Movement in the Grasshopper Operating System

5.3.2 Security and Allocation on a Single Disk

One of the advantages of the modular system is the ability to adopt different policies on differing parts of a collection of disks. For example, suppose we need to store a sensitive data object on a portion of a disk, and have the remainder of the disk available for general allocation. If we were attempting to do this with a monolithic kernel disk driver, the kernel driver would have to have separate policies for the different sections of the disk. In one section the data would be encrypted for security reasons, and in the other section the disk space has to be shared amongst several applications. This requirement of having different policies would complicate the kernel driver.

The kernel driver would be required to differentiate the requests for each portion of the disk. It must implement the different policies desired for each portion. This would require the kernel driver to be designed with this functionality as a goal. In general a kernel based driver would not be able to anticipate all the different policies requested by users.

To support this situation using modular blocks we use three modules. As shown in Figure 5.2, attached to the kernel driver (Ki) is the partitioning module (Pm). Connected to the partitioning module is an encryption module (Em) for the secure data, and an allocating module (Am) to perform general disk allocation. This ensures that the data for the sensitive application is encrypted when stored on the physical disk, and that the non-sensitive data is stored in raw form. Applications store and retrieve data without knowledge of any transformation of the stored data.

Encrypting Module (Em) Ki Sensitive Application

Partitioning User Space Kernel Space Module (Pm) Applications Allocating Module (Am)

Figure 5.2. Allowing Different Policies

5.3 Data Movement on a Single Node 57

5.3.3 Remote Disks

On some occasions it is desirable to have the data for one node stored on another node, for example dataless nodes. A dataless node has only the operating system stored on local disks and all user data is stored on remote nodes. In this instance we would use a remote disk. A remote disk appears to be locally attached, when in actuality the physical device is attached to a remote node. It is not necessary for the remote node to be executing the same operating system provided the two nodes conform to a storage protocol.

Node 2 (N2) Node 1 (N1)

Applications

Network Remote Module (Rm)

Figure 5.3. Remote Disks

This type of remote storage is very useful during experimentation as it allows a reliable system to be used as the disk host for the system running the experiment. This allows disks to be prefabricated, and examined irrespective of the state of the experiment.

Figure 5.3 shows a remote disk con®guration. The disk attached to node N2 is available on node N1. The remote disk module on node N1 takes disk operations and converts them into network requests to node N2. Node N2 then performs the operations on the local disk.

In all the above examples the interface that is ®nally presented to the applications executing on the local node is identical. Any combination of modules can be placed between the physical disk and the application. The number of combinations that are possible is limitless, although some combinations are less than useful: a spanning module combining all portions of a partitioning module is basically a two module delay line. 58 Data Movement in the Grasshopper Operating System

5.4 Model Assumptions

There are several assumptions that the model uses to improve the ef®cacy of operations. The ®rst is that it is possible to ef®ciently move data between modules. The second is that a module is granted no special privileges; it is normal user code.

In a system where the structure is constructed from a collection of independent modules it is important to allow data to move between these modules as rapidly as possible. Disk drive systems, by their nature, are used for high speed traf®c. Copying data between modules would incur an unacceptable performance overhead. For this reason, the modular approach will only work on systems where data can be referenced by easily passed parameters. Since users can write their own modules it is important to be able to pass data between the kernel and a user module, and between individual user modules.

Disk drivers, by their nature, and for ef®ciency reasons, access raw data. Since we are advocating that users be able to write disk handling modules there must be some mechanism that restricts a malicious piece of user code from accessing unauthorised data. Furthermore, modules are not run in kernel mode. They may only access the resources that are available to any other user program. A module is treated no differently from any other piece of user code.

5.5 The Disk Module Model

In this section we provide an abstract view of the modules which comprise a disk subsystem. Operations supported by the proposed modules are described. Some examples of modules provided by a typical operating system are presented.

5.5.1 The Disk Modules

Each disk module is a self contained entity providing a core set of interface routines. This set comprises entry points, each of which performs a single operation. Disk modules communicate with each other using this core interface. Apart from the core interface, some disk modules provide additional operations that allow access to functionality particular to the module.

5.5 Data Movement on a Single Node 59

Each disk module accepts calls on its interface, provides some functionality, and then calls the interface routine of some other module in the disk module stack. The lowest module in the stack usually calls the disk interface provided by the kernel.

Kernel disk functions are accessed using the same core interface as that provided by disk modules. The kernel implements the connection between the disk module stack and the physical disk controller. To achieve this the kernel must provide a driver for each kind of disk controller attached to the machine.

5.5.2 Disk Module Operations

The core interface operations supported by each module may be grouped into three categories according to their function. These categories are: • read/write - to move data to and from the disk. • control - to manipulate the disk con®guration. • status - to obtain information about the disk con®guration.

Additionally a module may implement extra operations which are accessed through its extensions to the core interface. For example an encrypting module would provide an interface routine to set the encryption key.

The interface speci®cation does not detail whether the operations are synchronous, or asynchronous in nature. For ease of construction of the system we have deemed that operations on a module will be synchronous in our implementation. It should be noted that a synchronous interface does not preclude equivalent asynchronous operations. Asynchronous operations can still be performed by using a threads based system where a thread blocks, waiting on the operation to complete, while the main thread continues to execute [8, 9].

The full list of mandatory operations is: • read_block • write_block • ¯ush • initialise • reset 60 Data Movement in the Grasshopper Operating System

• format • eject • retrieve_layout • release_block • pre_allocate_block

The read and write operations are used to move data to and from a disk drive. The operations take two arguments: the block of data, and the disk address that is being accessed. The read operation will only return when the data has been retrieved. The write operation completes when the block of data has been queued for writing by the kernel. The user level modules operate synchronously for writes. The operations may fail for several reasons, eg; if the disk is unavailable, if the disk has been removed (for removable media), or if the data address is invalid.

Although logically all operations are synchronous, the write operation can return before the write has occurred, but may only return when it is guaranteed that the write can be scheduled and, barring hardware failure, will succeed. If synchronous writes are required there is a ¯ush operation provided. The ¯ush operation will not return until all outstanding writes have been completed.

The ¯ush, initialise, reset, format, and eject operations are used to control the disk. As mentioned, ¯ush commits writes to the physical disk. The format operation causes the speci®ed disk to be formatted. After a format all previous information is lost. The eject operation causes a removable medium to be ejected from the drive. The reset operation is provided for failure handling. If an operation fails the module may attempt to reset the drive, thereby bringing it back (hopefully) to a known state, and the failing operation may be retried before declaring the operation a failure. Initialise is used to inform the module that a change has occurred in the con®guration of the hardware, and to ask the module to re-check its knowledge of the hardware con®guration. This operation is useful if a drive is replaced, brought on-line or taken off-line, as it allows the system to rebuild its description of the attached devices without a restart.

The pre_allocate_block and release_block operations are used to allow trapping of invalid disk block references. It is felt that having a mechanism whereby ®rst access to blocks may be trapped is potentially very useful. As an example, trapping these references is useful for magneto-optical disks, which allow sector based formatting.

5.5 Data Movement on a Single Node 61

We can use the pre_allocate_block information to perform sector based formatting. Other drivers, such as the allocating module use the information from pre_allocate_block and release_block to perform block allocation and release.

The result of a read or write operation on a block that has not been preallocated, or that has been released, is de®ned on a module-by-module basis. Higher level modules return an error, but some modules implicitly perform a pre_allocate_block on all blocks, and ignore the release_block operation. For read accesses to blocks which have not been previously written, these modules return a data block with unde®ned contents. Write operations are always successful.

The retrieve_layout operation is used to allow optimisation or con®guration based on the physical disk layout. This operation returns the physical layout of the disk, including the size of the disk (in blocks), the number of heads, sectors per track, and the number of cylinders. If any of the disk parameters cannot be supplied then a zero is returned for that parameter. This is required as some modern disks do not have any externally useful information apart from the size of the disk, due to hidden internal layout optimisations. Composite modules, such as the spanning module cannot return any useful information apart from the size of the composite disk, as the geometries for the individual disks may vary greatly.

The operations outlined above comprise the core functionality of each module. Every module must perform these operations. Some modules may perform the operation by passing the operation to a lower level module. Other modules may modify the parameters before passing the operations to a lower level module.

Modules may have other operations in addition to the core functionality. These operations are usually used during the construction of a module stack, and are speci®c to each class of module. Examples of additional operations include get_new_partition for the partitioning module, add_drive for the spanning module, and set_encryption_key for the encrypting module.

5.5.3 Module Connections

Each module may be connected to one, or more, other modules. Figure 5.4 shows an example of the connections possible. The kernel provides a single core interface for each attached disk. For each attachment there is a provider, called the parent, and a 62 Data Movement in the Grasshopper Operating System user, called the child. The kernel is the ultimate parent to all modules and applications in the system. Each created module may have one, or more, parents. Each module in turn provides one, or more, disks that may be attached to by either applications or other modules.

Partitioning Module

Spanning Module Striping Module

Encrypting Module Applications

Allocating Module Node 2 Node 1

Replicating Network Remote Module Module

Figure 5.4. Example Module Connections

5.5.4 Standard Modules

As an example of the types of user level modules that can be constructed, the Grasshopper system comes with the following modules as standard:

• spanning • striping • replicating • partitioning • allocating • remote • encrypting

Each of the modules uses one or more parent disks to store information, and presents one or more logical disks to applications or other modules. The parent disks may be the logical disks of other modules, or physical disks controlled by the kernel. The

5.5 Data Movement on a Single Node 63 logical disks presented are either directly used by applications, or are the parent disks for other disk modules.

The spanning and striping modules are used to make multiple parent disks appear as a single logical disk. The difference between the modules is in the algorithm used to determine the layout of data on the parent disks. The spanning module presents a logical disk that is the concatenation of the parent disks. It allows parent disks of different geometries to be combined into a single logical disk. The striping module requires all the parent disks be identical in con®guration and places consecutive blocks of the logical disk on separate parent disks.

The replicating module is similar to the striping module, but implements a strategy of having multiple parent disks that hold identical data. If a disk fails the remaining disks can be used to continue service requests and to rebuild the data on the failed disk.

The partitioning module uses a single parent disk and presents multiple logical disks to its clients. The number of logical disks and their size is constrained only by the parent disk. Each logical disk has a geometry ®xed at create time. Once set, a logical disk cannot be changed in size, but may be deallocated.

The allocating module interfaces to a single parent disk and presents multiple logical disks, each of variable size. Each of the logical disks has no de®ned geometry. All geometry parameters are returned as zero on retrieve_layout operations to indicate that no useful geometry information exists for any of the logical disks.

The allocating module operates by allocating blocks for each logical disk from a general pool of blocks. The general pool of blocks is stored on the parent disk. The allocating module allows sparse logical disks, so the higher level modules can use disk block indices up to the maximum implementation value for a disk index. To facilitate the bookkeeping of disk blocks the allocating module responds to the pre_allocate_block and release_block operations by allocating the block from the general pool or returning the block to the general pool. Allocation of a block may fail if the parent disk is full, but once a block is allocated, writes to that block are guaranteed to succeed. 64 Data Movement in the Grasshopper Operating System

The allocating module may allocate the blocks for operations using the algorithm most appropriate. The interface speci®cation does not de®ne anything about the internal organisation of the data, only that the data is retrievable. The module may use pointers, or bitmaps [36] depending on the constructor of the module.

The remote module presents to applications a logical disk that is actually stored on a remote host. This allows a disk image stored on a reliable host (eg UNIX) to be available for testing drivers at all levels. The disk contents may be examined on the reliable host to determine if the drivers are working correctly. The use of a network- based disk was particularly useful during the development phase of Grasshopper, as it allowed the state of the disk to be examined independently of the state of the development machine.

The encrypting module is used to perform simple encryption using a one bit rotor style encrypter. It encrypts writes from its logical disk to its parent disk, and decrypts reads from the parent disk to the logical disk. It is used for encryption of data being placed onto removable media where the physical security of the media is questionable.

5.6 Grasshopper Implementation of the Modular Drivers

Grasshopper uses disk modules to provide all security and partitioning of disk resources in the system. Each disk is represented by a single kernel capability, and is accessed using the core driver interface.

If a single disk is to be used by multiple managers it important that the security of the data be ensured. A manager may not read the data written by another manager without the correct capability. If a disk is to be used by multiple managers then a module that shares the disk space; such as an allocating or partitioning manager, is required. These modules ensure that the data in separate portions of the disk remains inviolate.

The movement of data between disk and memory in a single node uses, at the base level, a kernel supplied driver. Any disks may have to be shared amongst many different managers. Each manager has to have a separate, secure, portion of the disk. This separation is enforced by some additional protection level between the manager

5.6 Data Movement on a Single Node 65 and the physical disk. Traditionally this level is implemented in hardware or the system kernel.

In Grasshopper a manager is designed to use a single disk drive as backing store, and to use the entire disk if necessary. This reduces the complexity of the manager considerably. The manager does not have to consider other users of a disk. A manager may use any layout of data on the disk desired. Managers requiring high performance may use the kernel disk interface to a single disk, giving, for example, a data base manager, the highest performance possible.

Using a disk module stack it is possible to implement any sharing scheme desired without requiring any changes to the managers. The same code in two managers may be used, with one using a raw disk, the other using a disk module to share a disk. It is possible to debug managers by using a pseudo disk module that reports all disk traf®c. This allows the programmer to observe the activity of the manager, and the data being stored.

5.6.1 Stability of Disk Modules

A disk module is implemented as a self managing container. A disk module is responsible for all stability considerations for its own data. It allocates enough disk space out of its available pool to implement the stability algorithm it uses.

On a stabilise request a disk module would create a stable copy of any permanent data using its stability algorithm. The disk module does not have to interact with any other container to perform its stabilise.

5.6.2 Construction of Disk Modules

Disk modules are constructed by using a standard disk module base with speci®c personality added into the module. The core functionality for a disk module requires the module to be initialised with at least one lower level module.

Figure 5.5 shows a typical disk module. The module has a single container that has the core code and the personality code mapped into it. It also has a stable page and local data. The stable page is managed by the kernel, and the local data by the module itself. 66 Data Movement in the Grasshopper Operating System

A disk module is constructed as follows:

1. A container is created, with itself as manager, and the core disk module routines mapped into the container.

2. To give the new module space to store its instance speci®c volatile data, some permanent storage is required. This must be provided by providing the capability to a module's interface. The provided module becomes one of the parents of the created module.

3. The personality portion of the module is mapped into the container, and the initialise routine called. This sets up the data area for the module, initialises the backing store the disk module will use, and initialises the module's kernel stabilised page.

4. Any additional disks, and other information (such as encryption key, partition table etc) are added to the module.

5. The capability for the constructed module is now ready to be used by other modules or applications for access to the disk.

The core disk module provides the routines for management of the internal memory of the module, and the invoke entry point. The last page in the module core is obscured by the ®rst page of the personality portion of the module. The obscured portion provides default routines that allow the core module to operate before the personality is provided.

The stable page of a module holds the addresses of the root blocks of the module local data. The stable page also holds the index of the capability for the ®rst disk, and a stack used by the low level code.

5.6 Data Movement on a Single Node 67

Disk Module Stable Page

Core Code Local Data Personality Code

Figure 5.5. Disk Module

Additional information may be required for an individual disk module. This additional information includes capabilities for any additional parent disks, and driver speci®c information such as the number and size of partitions, any encryption keys, etc.

5.6.3 Grasshopper Disk Module Interfaces

Some of the design decisions of Grasshopper were constrained due to the nature of the system. For a persistent system there must be some form of stabilisation to provide resilience. Since we do not wish to dictate policy in Grasshopper the stabilisation policy has to be external to the kernel. This in turn dictates that each driver module has to be responsible for its own stabilisation, and as such that each module must be able to recover from a crash. This is achieved by having each driver module be a self-stable entity, so that it is is able to recover and restore itself to its last stable state after a crash.

Also, the lack of a ®le system is an advantage in the Grasshopper design, as the lack of names (capabilities being anonymous objects) means that it is easy to hide the identity of lower layers; each layer only knows that the entity referenced by the capability is a driver module which will respond to a ®xed set of commands, and so only depends on these commands. Constructor programs may utilise each modules additional interfaces to set parameters etc.

Since only containers' pages are stored on disk, and containers are only accessed on a page by page basis we know that all disk traf®c is for page sized chunks of data. 68 Data Movement in the Grasshopper Operating System

Knowing this we can tailor our disk driver module's interface accordingly. To avoid copying of the data between modules we only have to pass the physical page address for reads and writes between modules.

For security of physical pages we use the capability system. Each physical page is protected by a capability, and the address of a physical page is passed between modules using a capability. This information is contained in the Page Set abstraction provided by the Grasshopper kernel. This allows the kernel to verify access to physical pages when the kernel driver is ®nally invoked. Attempts to access pages without the correct capability will be rejected, allowing physical pages to be secure.

5.6.4 Function De®nitions in Grasshopper

We use the following types in the description of the operations. A Capref is a local data structure that references a capability. A Block_offset is a value that holds the address of a disk block. It is a 64 bit value in the current implementation. A Page_index is the offset in a Page Set. A page in physical memory is uniquely de®ned by a Capref to a Page Set and a Page_index into the Page Set. Status is the returned value from all Grasshopper operations. It de®nes success and failure, and indicates the type of any failure. Operations Status read_block(Capref drive, Block_offset block, Capref dest_pageset, Page_index index) Status write_block(Capref drive, Block_offset block, Capref source_pageset, Page_index index) Status flush(Capref drive) Status initialise(Capref drive) Status reset(Capref drive) Status format(Capref drive) Status eject(Capref drive) Status retrieve_layout(Capref drive, Capref dest_pageset, Page_index index) Status release_block(Capref drive, Block_offset block) Status pre_allocate_block(Capref drive, Block_offset block)

The read and write operations access the block on drive drive and offset block, and move the data to/from the physical page in the Page Set dest/source_pageset at offset index.

5.6 Data Movement on a Single Node 69

The ¯ush, initialise, reset, format, and eject operations operate on the drive speci®ed. If the drive speci®ed refers to multiple physical drives, perhaps through use of a spanning or striping driver, then the operation refers to all the physical drives.

The release_block and pre_allocate_block functions will release/allocate the block speci®ed on the drive speci®ed.

The retrieve_layout operation is unusual in that it is the only core operation that returns data that is not a page in length. The data is stored in the speci®ed page, but may not ®ll the entire page.

5.7 Worked Example of Setting up and Using a Driver Stack

We shall work through the procedures used to create a module stack, and detail the events that occur when a manager performs disk operations. Figure 5.6 shows the desired module layout. A new disk, D1, has been added to the system. The disk is to be divided into some number of partitions, one of which is to be used by an allocating module, which is in turn to be used by a container manager.

Allocating Module

Manager D1 Partitioning Module

Figure 5.6. Example Disk Stack

5.7.1 Creating the Stack

When D1 has been added to a Grasshopper system the kernel is informed of its existence through a device_mount call. This creates a kernel driver for the disk, and returns a capability for that driver. 70 Data Movement in the Grasshopper Operating System

Once the capability is available it may be passed to a manager, or to a module. If a manager was to have exclusive access to a disk it would be given the capability directly. More usually, the disk is partitioned to allow multiple managers to store data on the drive. This is performed by layering additional modules onto the kernel interface. In the example we will layer two modules.

The partitioning module is created, and the capability for the kernel interface is provided to the module. The module is also provided with the desired partition information. To calculate the partition information, the geometry of the underlying drive must be known. This would be obtained by performing a retrieve_layout invoke on the parent interface; in this case the kernel interface.

A capability for a partition provided by the partitioning module is used as the parent for the allocating module. The capability is supplied during the creation of the allocating module. The allocating module stores its private information in the ®rst blocks of the partition supplied by the partitioning module. The allocating module may be called a variable number of times to create disk interfaces that can be used by other modules. In the example the allocating module provides a capability for a pseudo disk which will be used by the manager as the disk storage.

Finally, the manager is created and is passed the capability returned from the allocating driver. The manager uses the capability to perform all disk accesses. When the manager performs a disk access the allocating module is passed the request. The allocating module converts the disk address to a disk address in the partition returned from the partitioning module, and calls the partitioning module. The partitioning module also performs a mapping on the disk address, and calculates the real disk address. This is passed to the kernel which performs the operation on the actual disk block.

5.8 Summary

In this chapter we have described a new, modular approach to the construction of disk drivers. Instead of having one monolithic driver within the kernel there is a layered collection of small, fast, disk modules, each performing a simple function. These modules may be connected into a graph structure that allows any conceivable organisation of disk layouts to be simply and ef®ciently managed.

5.8 Data Movement on a Single Node 71

This model has been implemented in the Grasshopper Operating System. The Grasshopper system allows easy construction of the individual modules through the use of the mapping facility. The capability system allows ef®cient control over access to pages, and enables the details of lower levels to be hidden. 72 Data Movement in the Grasshopper Operating System

5.8 73

Data Movement in a 6 Multiple Node System

In the previous Chapter we were concerned with data movement on a single node. Now we concentrate on data movement on a network of cooperating machines. On a single node, data movement was between stable or backing storage and mutable storage. On a distributed system there is now an additional layer in the data movement hierarchy. Data can be supplied from either stable storage or from another node across the network.

6.1 Remote Data Sharing

In a system with more than one node it is desirable for data to be shared between nodes. Data can be shared either by value, or by reference. Systems that allow Remote Procedure Calls (RPC) perform distribution of data by value. Sharing of data by reference is possible using Distributed Shared Memory (DSM).

A system supporting DSM provides location independence of data. When an operation is performed on some data item the data item is transparently transferred across the network to the local memory of the machine, and the operation performed. This automatic migration allows a program to manipulate any data in a location independent manner and ensures the computing environment appears identical on every machine.

A system supporting RPC provides the ability to call routines on remote nodes transparently. Data is transferred across the network by the underlying call mechanism. To both the calling and called routines it appears as though a local procedure call has occurred. This environment allows a program to be constructed with its individual modules operating on the most advantageous node.

The Grasshopper system is designed to allow distribution using both RPC and DSM techniques. The Grasshopper kernel provides support for DSM by allowing the 74 Data Movement in the Grasshopper Operating System movement of pages between nodes using managers, and support for RPC through remote invocation.

6.1.1 Distributed Shared Memory in Grasshopper

In some distributed systems, for example Sprite [48] and Monads [22], the support for distribution is maintained entirely within the kernel. Any request for data is passed to the local kernel which may communicate with remote kernels to provide that data. In contrast, the Network File System [40] allows a server to exist outside the kernel on a host. This allows user level servers to provide data to remote nodes, but requests for data still pass through the client's local kernel.

In all of these systems, the communication between the local process and the remote server is mediated by the local kernel. The kernel issues the request to the remote server for the data, and then presents the data to the process. In Grasshopper the kernel mediates communication between the different nodes, but does not actively perform the distribution.

Distribution in Grasshopper is performed by user level managers in cooperation with the kernel. The Grasshopper kernel provides low level support for DSM by providing a mechanism to move pages between machines. The decisions as to which pages are moved, and where they are moved to and from is controlled by user level code.

Each container in Grasshopper has a home node. A manager for the container is required to exist on the home node, and is ultimately responsible for the container. We call the manager on the home node the supervisor manager. If there are multiple instances of the container on different nodes, additional subsidiary managers can be created by the supervisor manager. Each instance of the container in the network will have a manager responsible for that instance. By default the supervisor manager is responsible, but it may elect to have a subsidiary manager handle actions for that instance.

When a page fault occurs on a node, the local kernel checks to see if a subsidiary manager has been associated with the local instance of the faulting container. If no subsidiary manager has been associated, the local kernel passes the fault to the kernel on the container's home node, and the fault is passed to the supervisor manager. The supervisor manager is informed which container faulted, which page was missing,

6.1 Data Movement in a Multiple Node System 75 and the node on which the fault occurred. The supervisor manager is responsible for either supplying the data to the kernel on the faulting node, or for identifying a manager that will be associated with the faulting container instance as its subsidiary manager. The identi®ed manager will henceforth be asked to supply data for the container instance.

If the supervisor manager elects to provide the data, it obtains the faulted page, and requests the kernel to insert the page into the physical memory on the remote node and set up the remote page tables. The manager then informs the local kernel that the fault has been serviced. The local kernel in turn informs the remote kernel that the faulting locus may continue execution.

If the container is to be accessed frequently, the supervisor manager may elect to create a subsidiary manager on the faulting node, and inform the remote kernel that the new manager will handle page faults for the container. The faulting kernel will retry the fault using the subsidiary manager. The subsidiary manger will interact with any other subsidiary managers, and the supervisor manager to obtain pages for the container.

Each node on which the container is accessed may have a subsidiary manager. The choice of whether a subsidiary manager exists on a node or not is at the discretion of the supervisor manager.

The protocol used between the supervisor and subsidiary managers is not de®ned by the Grasshopper system. The only requirement is that the managers provide a consistent and resilient view of the container.

6.1.2 Remote Invocation in Grasshopper

Invocation between containers is a fundamental abstraction in the Grasshopper system. In a distributed Grasshopper system, invocation is permitted between containers on different nodes, allowing a locus to move from node to node within the network. When an invocation of a remote container occurs, either the locus can move to the home node of the container, or the container can move to the current node of the locus, or both may move to a third node. The decision as to which should move is beyond the judgement of the kernel, as it requires an understanding of the future access patterns of the locus. 76 Data Movement in the Grasshopper Operating System

When a locus invokes another container it is allowed to pass state information. The state information that is passed with an invoke call is the formal invoke parameters, either in the machine registers, or in the invoke data block, and the set of capabilities that the locus holds.

The parameters that are passed with an invoke can contain arbitrary data, and be arbitrary in size. This allows a second, independent method of moving data between two nodes on a Grasshopper system. This method is used to move data between containers on the same node, or between containers on different nodes. This method is not useful for moving data between instances of the same container, as this must already be provided by the underlying manager semantics.

Remote invocation is used as the fundamental method of moving capabilities between containers and between nodes on the network. A locus carries its capabilities with it when it invokes a new container. The code in that container may save a copy of some, or all, of those capabilities, and may insert new capabilities into the locus before the locus returns. Since, for security reasons, capabilities have a separate name space, the movement of capabilities with loci is the only method for moving capabilities. The movement of capabilities is not possible through the manager maintained DSM methods available with Grasshopper.

This movement of capabilities with loci enables servers to perform their functions. The requesting locus must provide any required capabilities either implicitly, through the invoke protection mechanism, i.e. show that it has suf®cient permissions by being able to perform the invoke itself, or explicitly, by carrying a capability into the server. The server will modify its state and may, optionally, return one or more additional capabilities by inserting them into the capability list of the locus.

6.2 Naming

Every entity accessible in a system must be uniquely identi®able. In Grasshopper the capability is the user level abstraction used to identify entities. The kernel is responsible for taking a capability and evaluating to which entity the capability refers. This translation process is transparent to users. In a single node Grasshopper system, the translation is avoided by using direct pointers to persistent kernel level data structures as the entity names. However, this simple approach is insuf®cient in a

6.2 Data Movement in a Multiple Node System 77 distributed environment.

6.2.1 Naming on a Single Node

The data within a single node system is totally self contained. All kernel addresses will refer to the same area of memory. This allows a single node system to name entities directly, using the address of the corresponding kernel structure. This provides signi®cant advantages over other naming schemes, as there is no lookup or indirection required.

Since the kernel data area for each kernel is a persistent area of memory, with the persistence handled directly by the kernel, the use of the virtual address as a name is guaranteed to be correct over reboots. The reuse of kernel memory is only allowed when all references to a memory area have been removed. This is achieved in the common case through the use of reference counts.

Internally, the Grasshopper kernel uses this direct naming scheme on each node. Kernel data structures are named using pointers to the actual structure. However, this does not allow distribution as Grasshopper does not, for scalability reasons, provide a distributed kernel address space. Each kernel has a unique, separate kernel address space.

This separate kernel name space results in the situation where the name of an entity on one node is meaningless on any other node. The naming mechanism must be extended to allow an entity to be named in a global manner if Grasshopper is to support a distributed system.

6.2.2 Name Generation in a Distributed System

There are two possible approaches to naming across machines: either the single node name is extended to be valid on all nodes, or a name has some structure that ®rstly identi®es a node, and then identi®es some feature of the node. The ®rst is a form of global kernel shared memory, and the second is a composite name space.

6.2.2.1 Global Name Spaces Global kernel shared memory would require that all kernels reside in the one notional address space. Each kernel would be able to access all of this kernel virtual address space, and an address would reference the same 78 Data Movement in the Grasshopper Operating System object on any node. This would allow the addresses that are usable on a single system to be available on any system in the network.

Sharing the kernel address space amongst all nodes utilises the SASOS paradigm discussed in Chapter 3. In this paradigm the one address space is visible on all machines, with portions of the address space being controlled by various machines. Two main methods of name generation for SASOS systems exist:

• A central server issuing names on demand. The server may issue either a single name or a range of names in response to each request.

• A partitioned name space with each node being assigned a ®xed portion of the name space at creation time.

The central server naming scheme is used in the Opal system [6, 7], with the central server issuing portions of a global name space to nodes on demand. This system has the a priori drawback of a problem with scalability. As more and more nodes are added to a network, the contention for the server will increase the latency of responses. If the server attempts to solve the latency by allocating larger and larger portions of the address space this system soon degenerates into a partitioning of the address space, without the advantages that partitioning provides.

The Monads [22] and Mungi [21] systems use a partitioned name space, with the partitioning being visible at the user level. In these systems, some number of bits in the virtual address of each object are used as a node identi®er. Both these systems allow movement of objects through the use of hints on the local node. To locate an object in Monads, the node address is extracted from the address of the object, and a message is sent to the node. The local node may have a hint table that redirects certain addresses to other nodes if the object has moved. If the object cannot be located, the Monads system resorts to broadcast to resolve the request.

In Mungi, address resolution is performed by ®rst looking at the hint table. If there is no entry in the hint table for the required address, the hint table is examined for nearby addresses. The kernel uses the heuristic that the node that holds the desired address may also hold nearby addresses. Any node that holds a nearby address is requested for the desired address. If no nodes with nearby addresses are listed in the hint table, or none have the requested address, a global broadcast is used to resolve

6.2 Data Movement in a Multiple Node System 79 the address.

The algorithm used in Mungi does not reference the node speci®ed by the node bits in the address. The node that is identi®ed in the address bits is only responsible for creating and deleting storage in that range; it does not keep track of the current location of objects it has created.

The use of a ®xed number of bits in these systems limits the expandability of the system. The number of nodes is ®xed because of the limited size of the address of a node. These systems also have a ¯at name space for the nodes in the distributed system in that a node is only identi®ed by a ®xed length bit string.

6.2.2.2 Composite Name Spaces In a composite name space a name is composed of two, or more, parts. One part speci®es a ®rst level name space, and the other parts are used to discriminate within the speci®ed name space.

A composite name space uses two or more individual names to construct a global name. The individual names can be unique in themselves, or only unique in combination. Node partitioning is a form of composite naming, with the address being composed of a node address and a local address on that node, although most systems using this form of naming do not describe it as such.

Grasshopper implements a composite name system; each entity name comprises a unique node name, and a local kernel virtual memory address. The unique node name is the network address of the home node of the entity, and the virtual memory address is the kernel address of the kernel data structure of the entity.

Recall that the naming mechanism in Grasshopper is the capability. Recall also that the capability has, as one of the ®elds, the name of the entity addressed. It is this name stored in the capability that has to provide a globally unique mechanism for determining the entity addressed. Also, since the capability is used often, and is the only method of addressing an entity, the mapping between the name ®eld and the entity must be ef®cient.

A Grasshopper node, when presented with a capability must be able to identify and address the entity referenced quickly. Since capabilities move between nodes as loci move, the name incorporated in the capability must be location independent. We shall see that the naming scheme used in Grasshopper allows this ef®ciency, location 80 Data Movement in the Grasshopper Operating System independence, and the global uniqueness we require.

6.3 Naming Nodes

Nodes in a network need a unique descriptor to enable them to be addressed. For example, every device on an ethernet has a unique 48 bit address. These addresses are dif®cult for people to use and are generally not very useful, especially when a node is connected to more than one network.

For this reason there are usually additional layers of naming placed on top of the hardware naming. Each node is given a name in some administration domain. The given name must be unique amongst the siblings of the node. When a node tries to communicate with a node in another domain it must be able to name the remote node.

The additional layering of names requires the total name space to be partitioned. The partitioning may be done on either a ®xed length or variable length name. Fixed length names are commonly used when the name is to be encoded in some other structure, as it makes the length of structures constant. The node name encoded into both the Monads and Mungi systems is ®xed length. The Internet uses ®xed length names which are contained in a 32 bit structure [51].

The fundamental name of a host in the Internet is a 32 bit Internet Protocol (IP) address comprising a tuple containing a network number and a local address. The network number is used to route a message to the correct network. The local address is used to ®nd the appropriate node on the network.

The Internet uses a ¯at form of addressing. In the entire Internet there are only ®rst level networks, each composed of a number of hosts. This has the problem that it is hard to route between networks, as the address of a network contains no information concerning the route to that network. Instead each node maintains a routing table which lists the route to all other networks, or a default route if the destination network is not listed.

There are fundamental limitations with ¯at, or ®xed length, addressing, and as a result the Internet is currently suffering from two signi®cant problems: exhaustion of network address, and an explosion in the tables used to route messages [18]. These problems arise from the ®xed length of the IP address, and the lack of structure in the

6.3 Data Movement in a Multiple Node System 81 names.

As an Internet address is divided into the network and host portions there is a much smaller, and limited number of network numbers available. When the Internet was designed these limits appeared virtually inexhaustible, but exhaustion of the network name space appears to be a signi®cant reality.

As the number of network numbers allocated increase, and since there is no structure to the positioning of these networks, each node has to hold the routes to more and more networks. In particular, nodes along main backbones have to hold routes to most of the networks allocated. This increase in the number of routed networks requires larger and larger routing tables, with a corresponding slowdown in searching times.

Variable length names are commonly used when some structure, or other information, is to be encoded into the name. The MHSnet naming scheme [30] uses variable length names, and partially encodes the route to the destination into the name.

This naming structure is popularly called Domain Naming. Domain naming allows a network to be recursively composed of smaller networks. Each network is considered as a tree structure, with each branch of the tree being itself a tree of networks. The leaves of the tree are the actual nodes in the network. The name of a node is the concatenation of the node's name with the name for each domain of which the node is a member.

In domain naming, each sub-tree is considered a separate domain, and has a name assigned to it. The entire network may also have an associated name. MHSnet does not associate a name with the entire network, whilst Grasshopper gives the entire network a name. 82 Data Movement in the Grasshopper Operating System

6.4 Grasshopper Node Names

Figure 6.1 shows a simple Grasshopper network. In a Grasshopper network the logical organisation is a strict tree structure. A parent node is responsible for routing all messages between its children and the rest of the world.

GHnet N0

net1 N1 N2 net2

N3 N4 N5 N6 N7 N8 N9

Figure 6.1. Domain Naming in Grasshopper

It is important to distinguish between the logical network and the physical network since, in Grasshopper, the physical implementation of a network is independent of the logical layout. A typical physical implementation of the logical network in Figure 6.1 is shown in Figure 6.2 in which the sub-networks are implemented using a broadcast network, and direct connections are set up from the parent nodes of the two sub-networks to the root node. Of course, performance may be affected by poor choices: having some logical neighbours at a distance physically would reduce the ef®ciency of the network.

On this tree structure Grasshopper implements a domain style of naming. In a Grasshopper network a node is given a local name, unique amongst its siblings. A sub-tree is also given a unique name amongst the sub-tree's siblings. This naming is recursively composed into larger tree structures, until the entire network is named. Each node in a Grasshopper network has a full domain name, generated from the concatenation of the node's local name with the name for each of the sub-trees in

6.4 Data Movement in a Multiple Node System 83 which it belongs.

N0 Direct Conection Direct Conection

N1 Broadcast N2 Broadcast Network Network

N3 N4 N5 N6 N7 N8 N9

Figure 6.2. The Physical Layout of a Simple Network Using this scheme, the entire Grasshopper network shown in Figure 6.2 is called GHnet, and the two subnets are net1 and net2. We adopt the naming convention of the domain style of names, where names are composed by concatenating a name with its domain, separated by a point. The name of the root of the entire network is GHnet. The roots of each of the two sub-trees are called net1.GHnet and net2.GHnet. The node labelled N3 is called N3.net1.GHnet while the node labelled N8 is called N8.net2.GHnet.

6.5 Types of Network Structures

Grasshopper supports any network structure that allows messages to pass from one node to another node. The two main types of network structures are broadcast networks, such as ethernet and token ring, and point to point networks. Each of these networks has different restrictions and advantages [39].

Broadcast networks allow a single message to be sent to multiple nodes at once. On an ethernet this is a fundamental action of the network. A node will receive all packets, and the network hardware is responsible for ®ltering the messages that are destined for this node.

A token ring is similar, in that broadcast messages are possible, but differs in implementation. A message travels around a ring until it reaches either the sending node, or the destination node, where it is removed from the ring. A broadcast message will be received by each node around the ring, but only the sender will actually remove the message. Normally a message sent from one node to another is removed by the receiving node, and another message can be sent in the gap thus 84 Data Movement in the Grasshopper Operating System created.

Both broadcast systems allow easy one-to-one and one to many transmission of messages. In contrast a network of Point to Point links does not allow a broadcast as a fundamental operation. Since there are only two nodes on each segment any message destined for multiple nodes must be passed through multiple nodes.

6.6 Routing to an Entity in Grasshopper

Each Grasshopper kernel has a unique kernel address space, since, as discussed earlier, the provision of a distributed kernel address space would seriously limit scalability. Consequently, some other naming mechanism is required by the kernel to identify the entities supported by Grasshopper. These names are invisible to the user as the user level naming requirements of Grasshopper entities are satis®ed by the capability system.

The kernel associates the global name of an entity with each capability. The global name is composed of the network name of the creating node, and the virtual address of the kernel data structure that represents the entity.

To pass a message to an entity, or more commonly, to the kernel that is maintaining the entity, the local kernel inspects the global name of the entity. The name is decomposed into the kernel virtual address, and the network node address. The local kernel can then issue a message with the network destination as the network portion of the entity's global name.

A message in the Grasshopper network has the route to the destination encoded into the destination's address. For example, in Figure 6.1 the destination of a message from node N3 to node N8 would be N8.net2.GHnet. Figure 6.3 outlines the basic steps in the Grasshopper routing algorithm. This outline will be later expanded to detail the full routing algorithm. Here we are only concerned with sending a message from one node to another node that is explicitly identi®ed.

At each node a message can be: destined for the node, passed to the parent, passed to a child, or declared invalid. For example, in the case of the message from N3 to N8, the message is ®rst passed to N3's parent, net1.GHnet. As net1.GHnet is not the destination, not a parent of the destination and not the root of the network,

6.6 Data Movement in a Multiple Node System 85 net1.GHnet in turn passes the message to its parent, the root of the network, GHnet. GHnet determines it is a valid parent of the destination, and so searches each of its children for the next hop to the destination. GHnet therefore passes the message to net2.GHnet which in turn passes it to N8.net2.GHnet as the ®nal destination.

if destination = current_node then deliver locally else if ancestor(current_node, destination) then for each child of current_node do if child = destination or ancestor(child, destination) then pass to child break if no child was selected then illegal address else if root_of_network illegal address else pass to parent Figure 6.3. Network Routing Algorithm Outline

There is a special node name; the empty name. This is used to identify a capability that refers to the local node, irrespective of the actual physical node. This name is used for capabilities that refer to operations that can be performed by any kernel. This is used for kernel-based operations on loci and containers, such as map or unmap requests.

This naming scheme breaks down if an entity is not present on the creating node. If an entity is created on a node, moves to another node, whilst retaining its original name, then the route to the new location of the entity is no longer the route to the creating node. We shall see later how the algorithm is adapted to allow entity movement and distribution.

6.7 Collapsing a Grasshopper Network

A Grasshopper network has many intermediate nodes. Only the leaf nodes actually perform computation. The intermediate nodes only pass messages around the network. Instead of having a dedicated machine for each of these nodes it is possible to collapse portions of the network. A node is selected from the leaf nodes in a sub- tree, and is setup so that it also performs routing for the entire sub-tree. This can be applied recursively until all nodes in the network are leaf nodes that also perform additional protocol processing. 86 Data Movement in the Grasshopper Operating System

Collapsing the network in this manner reduces the number of nodes that have to be maintained, and speeds up movement of messages up and down the network. The links between nodes that are handled by the same machine are implemented as internal queues, and no actual data movement is performed.

6.8 Summary

Data movement in a distributed system can be achieved using one of two fundamental paradigms; distributed shared memory, or remote procedure calls. Grasshopper supports both of these paradigms, distributed shared memory using managers to perform distribution, and remote procedure call through the invocation mechanism.

To allow distribution, entities must have globally unique names. These names must be resolvable from any node on the network. Grasshopper uses a domain style of naming to achieve this requirement. Every entity in Grasshopper is identi®ed by capabilities. Each capability contains the full network identity of the referenced entity.

Grasshopper uses a tree structure for domain names. Each node in the network has a full network name as a leaf node in the network. Any entities that belong to that node have, as part of their global name, the name of the node.

To perform an operation on an entity, the system sends a kernel to kernel message. This message is routed using the algorithm in Figure 6.3. In brief, the routing algorithm sends the message towards the root of the network tree until it reaches the least common ancestor of the sending and destination nodes. From that node it is sent towards the leaf node that holds the entity.

To improve performance, and to reduce the complexity of the network, Grasshopper allows a single physical node to represent multiple logical nodes. The links between the nodes are represented as internal queues within the one node.

6.8 87

Distributed 7 Causality

An important issue in the design of a networked system is the maintenance of causality. Causality is the ordering of events to ensure that they are observed to occur in one of the many possible consistent orderings which could have occurred. This was ®rst described by Lamport [27] who de®ned a relation, happened before, denoted by ª→º. The happened before relation is de®ned such that two events, say a and b, satisfy a → b if

1. both a and b are caused by the same process and a occurs before b, or

2. a is the sending of a message, and b is the receipt of the same message by another process, or

3. there is an event c such that a → c and c → b.

We say two events are concurrent if neither a → b nor b → a holds.

Causal relationships are not maintained by most current distributed ®le systems, which allow out of order propagation of changes between ®les. For example, it is possible to change a networked ®le on one node, move to another node and observe that the ®le is still unchanged. The change occurs an indeterminate time later. This violation of causality is not a major problem in distributed ®le systems, as each ®le is a self contained entity, and it is unusual that multiple nodes are accessing a ®le simultaneously.

With persistent systems, causality is more important, as there is only one data space, namely the memory of the system. Since there can be arbitrary pointers in a persistent system it is important that the changes made are propagated in causal order. If a pointer is changed on a remote node before the data it refers to is updated, a program on the remote node may function incorrectly. This is especially important with garbage collectors, which depend on the integrity of the pointers to identify garbage correctly. 88 Data Movement in the Grasshopper Operating System

7.1 Causal Delivery of Messages

Messages in systems which preserve causality have to be delivered in causal order. Messages that are concurrent, i.e. that are not causally dependent, can be delivered in any order. If there are multiple paths in a network, two messages may take different paths between two nodes. Therefore, each message must include additional information to ensure its delivery in causal order. To date this has been achieved by two methods: appending a log of previous messages, or a timestamp of each message.

Not only must direct messages be delivered in causal order, i.e. if message a is sent from node n1 to n2 and then message b follows a from n1 to n2 then a must be delivered before b, but the transitive closure of messages must be delivered in causal order. An example, shown in Figure 7.1, of a transitive violation is if node n1 sends message a to node n3 and then message b to node n2. Node n2 then sends message c, which depends on the contents of b, to n3. If message a is not delivered to node n2 before c then a violation of causality occurs. Simple sequence numbering of messages will not ensure transitive causality.

n1

a b

n2

c

n3 Time

Figure 7.1. Transitive Causality Violation

The Isis system [3] supports causal message delivery in a broadcast network. The Isis system originally required each message to include a log of previous messages sent. This method does not scale due to the overheads of the additional information required for each message.

It is also possible to preserve node-to-node ordering by sending a timestamp vector with each message [16]. However, for transitive causality a matrix must be sent with each message [54]. The matrix allows the receiver to determine if any outstanding

7.1 Distributed Causality 89 messages exist, and from which node the outstanding messages have been sent, enabling the receiver to request the outstanding messages.

The size of the matrix transmitted is proportional to the number of nodes participating in the interaction. It is possible to reduce the matrix to a single vector [38, 46]. However, since the vector cannot fully characterise the dependencies in the network, resynchronising messages must be transmitted before a node can be certain that a message can be delivered.

7.2 Causality in a Tree Structure

Transitive causality violations occur when messages can overtake each other. This can occur when there are multiple paths between two nodes, or messages can overtake each other on the link, or in a node. Instead of appending the additional causal vector or matrix it is possible to ensure causality by restricting the paths that a message can use to move. Two messages are constrained to follow the same path from the source to the destination, reducing the causality problem to one of ensuring a sequence of messages stay in the same relative order. To ensure that a sequence of messages maintain relative order, each message is tagged with a sequence number. If a message is dropped by the network, the recipient will detect that a message is missing, and can delay the remaining messages until the missing message can be obtained from the sender.

This ability to request a retransmission requires the sender to keep a log of outgoing messages. To control the growth of this log, each message header contains the current received sequence number. This is the latest message for which all previous messages has been received. When a message is received this number can be examined, and any messages that have been received by the destination can be deleted from the log. If there is a large imbalance in the traf®c ¯ow, periodic messages can be sent to ¯ush the log.

The restriction of messages to the use of a single link reduces the causality information that is included with each message to a single sequence number. To implement this method of causal control all links must be point to point. Grasshopper adopts its tree structure for the logical organisation of the network in response to this requirement. Each processing node is logically connected to only one 90 Data Movement in the Grasshopper Operating System other node, a routing node. The processing nodes are the leaf nodes in the network, and the routing nodes are the internal nodes.

Grasshopper guarantees the causal ordering of messages, and removes the need for causality vectors and matrices, through the use of the tree structured logical network. Below we prove that each link preserves causality.

7.3 Proof of Causal Delivery

In this section we will apply formal methods to prove the correctness, in terms of maintenance of causality, of the algorithm described in the previous chapter∗. The basis of these formal methods for distributed computing is to describe the global state of the whole algorithm, and also to indicate how this state changes as actions occur. Assertional methods [63] then provide properties that are true in every state that occurs during every execution. These properties are called invariants, and are generally proved by induction along the execution sequence.

For any distributed algorithm, the global state consists of a collection of local states (one per node) together with a collection of link states (one per directional communication channel). For simplicity we encapsulate all the packet buffering, error detection, retransmission etc, and group this with the true communication channel to provide an abstraction: a point-to-point, reliable, error-free, FIFO, communication mechanism². Once this abstraction is accepted, the state of a channel is just a queue of packets that are currently in transit. We denote this queue by the symbol S pq where p and q are the nodes at the source and terminus, respectively, of the channel. In the algorithm, as we model it, all packets have the format (e, f ,m) where m denotes the contents of a message, e is the entity from which the message originated, and f denotes the entity to which the message should be directed. Since all the packet buffering has been absorbed by the reliable channel abstraction, the

______∗ This section was co-authored by Dr. Alan Fekete, associate supervisor. ² Most formal methods are compositional, so that one can separately prove that this abstraction is provided, and that the higher layers of software work when given this abstraction.

7.3 Distributed Causality 91 only state at each node is the table that maps each entity name into a location in the tree, and in the simple algorithm of the previous chapter that information is constant and the same everywhere; we represent it by a function loc, so for any entity e, loc(e) is a node in the tree.

The initial state of the algorithm has all channels empty. The state changes through two sorts of events: the presentation to the system of a message, and the arrival at a node of a packet containing a message. For each of these we must show how the state after the event is determined.

When a new message is presented, with contents m and destination f, by an entity e, the system executes the following

let p denote loc(e) p is the node where this action occurs if loc( f ) == p then deliver m to the local entity f else

append(e, f ,m) to the channel S pr, where r = parent(p)

The event corresponding to the receipt of the packet (e, f ,m) at node p can only occur provided (e, f ,m) = head(S qp ) for some q. In that case the system executes the following

if loc( f ) == p then deliver m to the local entity f else if loc( f ) is a descendent of some child p′ of p then

append(e, f ,m) to the channel S pp′ else

append(e, f ,m) to the channel S pr, where r = parent(p)

Having shown how the state changes, we can now provide the key invariants that lead to a proof that the algorithm respects causality. To understand the ®rst invariant, we note that the tree structure of the network topology implies that the unique simple path between leaf nodes p and q consists of two pieces: a sequence of channels each going from one node to its parent, followed by a sequence of channels each going from a node to the unique child such that q is a descendant of that child. The junction between the two pieces occurs at the least common ancestor of p and q, which is the nearest ancestor of p which is also an ancestor of q. 92 Data Movement in the Grasshopper Operating System

Invariant 1: If m is a message that has³ been sent from e to f, but has not been delivered to f, then there exists a channel rs that lies on the unique simple path from loc(e) to loc( f ), with (e, f ,m) ∈ S rs.

Proof: When m is ®rst sent, it is either delivered immediately (in the case where the destination entity is at the same node as the sending entity), or else it is placed on the queue from loc(e) to its parent. As noted above, this channel lies on the path from loc(e) to loc( f ). This establishes the truth of the Invariant at ®rst.

In any subsequent step involving sending a message other than m, no change is made to the presence of packet (e, f ,m) on channels. Thus in these steps the truth of the Invariant is unchanged.

In any step involving the arrival of a message (e, f ,m) at a node p, we note that the Invariant is true in the state before the step, at which time (e, f ,m) was on a channel leading to p. Thus p must be on the simple unique path from loc(e) to loc( f ); this means that either p is an ancestor of loc(e) which is not an ancestor of loc( f ), or else p is a direct ancestor of loc( f ), or else p is loc( f ) itself. Examining the way the global state changes during the step, we see that in the state after the step, either

(e, f ,m) ∈ S pq where q = parent(p), or else (e, f ,m) ∈ S pq where q is a child of p and q is an ancestor of loc( f ), or else m has been delivered to entity f. In every case the truth of the Invariant holds in the state after the step.

Finally we must consider any step involving the arrival of a different packet. As the state change does not alter the location of any packet except that arriving, the truth of the Invariant is not affected.

It is not needed for our proof, but as a matter of interest we mention the converse to this result: if (e, f ,m) ∈ S rs then m has been sent by e to f and not received, and also rs lies on the unique simple path from loc(e) to loc( f ).

______³ In following the formal method precisely, we should not refer to past events such as the sending of a message. To be pedantic we should introduce history variables which capture in the state the set of messages sent and received at each node. In the interest of clarity we have omitted this trivial transformation.

7.3 Distributed Causality 93

For the next invariant, we need to de®ne precisely when one message is in front of another. Suppose (e, f ,m) ∈ S rs. Because the network forms a tree, removal of the connection between r and s would disconnect it. We denote by Front(r,s) the set of nodes connected to s by paths that do not go through r. We then say that m′ is in front of m if either m′ has been delivered to an entity located on a node in Front(r,s), or if

(e′, f ′,m′) ∈S pq where p∈Front(r,s), or if (e′, f ′,m′) is ahead of (e, f ,m) within the queue S rs.

Invariant 2: If (e, f ,m) ∈ S rs, and m′ is in front of m, then m is not a causal predecessor² of m′.

Proof: When m is ®rst sent, there is no message for which m is a causal predecessor, so the Invariant holds trivially. After a step when m′ is sent, we observe that the node where the step occurs is in Front(r,s), and so by the truth of the Invariant in the state before the step, we conclude that any message previously delivered to the sending entity cannot have m as causal predecessor. Also we see by Invariant 1 that e is not located at a node in Front(r,s). Thus there is no way to construct a chain of causality from m to m′, as such a chain would have to involve a message delivered to the sender of m′. This shows that the Invariant is true after the step. A step where a message other than m or m′ is sent, does not alter the position of any packet containing m or m′, and so does not affect the truth of the Invariant.

Now we consider steps in which a packet is received. If (e, f ,m) is received in a step, since afterwards (e, f ,m) ∈ S rs then from the transition information the step must have been receipt of (e, f ,m) at r, from the channel xr for some x. By Invariant 1 we see that x ≠ s. Thus Front(r,s) ⊆Front(x,r). Therefore in the state before the step,

(e, f ,m) ∈ S xr and m′ is in front of m (as the transition does not alter the position of m′ in the system). By the truth of the Invariant before the step, we see that m is not a causal predecessor of m′, as required for its truth after the step.

______² Note that the Invariant does not claim that m is a causal successor of m′. The two messages may be causally concurrent. 94 Data Movement in the Grasshopper Operating System

If (e′, f ′,m′) is received in a step, since afterwards m′ is in front of m, then from the transition, the step must have occurred at node x which is in Front(r,s). By the transition, we see that m∈S rs also in the state before the step. In the state before the step, (e′, f ′,m′) was at the head of S yx for some y. If y ∈ Front(r,s) then m′ was in front of m in the state before the step, so the truth of the Invariant before the step shows that m is not a causal predecessor of m′. On the other hand, if y ∉ Front(r,s), then we must have y = r and x = s. (recall that in a tree only one path connects any two nodes.) In this case (e′, f ′,m′) is also at the head of a queue in which (e, f ,m) is present, so again m′ was in front of m in the state before the step, so m is not a causal predecessor of m′. Thus the Invariant holds after the step.

Finally, a step in which a packet is received which is neither (e, f ,m) nor (e′, f ′,m′) causes no change in the locations of either of these packets. So also in this step, the truth of the Invariant before the step implies its truth afterwards.

With these two invariants, we can easily show that message causality is respected in our algorithm. Suppose, for the sake of contradiction, that a message m′ is delivered at an entity f, and m is another message destined for f that has not yet been delivered, but is a causal predecessor of m′. By Invariant 1, (e, f ,m) is an element of a queue

S rs on the path from loc(e) to loc( f ). This loc( f ) ∈Front(r,s) and since m′ is delivered at loc( f ) we have that m′ is in front of m. By Invariant 2, m is not a causal predecessor of m′, contradicting our assumption.

7.4 Network Causality Semantics

On a network of persistent systems the semantics of a single node have to be preserved when allowing network accesses to data. This is equivalent to Distributed File Systems in ®le based systems. We need to preserve the atomic read and write semantics of memory, and causal consistency through the network. On a single node this is trivial to perform, as normal memory gives us the desired semantics.

On a distributed system, causality requires updates to the data to occur in a causally consistent order. Any movement of data must ensure that inconsistencies are not created.

7.4 Distributed Causality 95

7.5 Causality in Grasshopper

The Grasshopper system preserves causality in data movement. Movement of data may occur in a number of ways. These include:

• A locus moving between containers.

• A page transferred between nodes.

• A map or unmap request.

• A container or locus deletion.

• The movement of an entity between nodes.

The relationships between these must be preserved both within a node and between nodes in a distributed Grasshopper system.

A locus invoking a container will contain causally dependent information in both the capabilities held by that locus, and state embodied within the parameters passed by the invoke call. This information must be causally consistent with the information held by, and reachable from, the invoked container.

If a page is transferred between two nodes, the information on that page may be dependent on other information, e.g. the state of loci, and information in pages from other containers that were accessed through mappings.

A map or unmap request can result in a causal dependency occurring to data in another container. As shown in Figure 7.2 container C1 has a second container, C2, mapped into it. A third container, C3, contains some state information that embodies this mapping. A locus, L1, removes the mapping of C2 into C1, and then changes the state information in C3. A second locus, L2, may read the changed state information, and act accordingly. If the second locus can directly reference the area of container C1 that was mapped, then any future references must be to container C1 not C2. A similar situation can occur during a map request. 96 Data Movement in the Grasshopper Operating System

C1 C3 C1 C3 L2

C2 C2

L1 0 0

0 0 0 0

Figure 7.2. Causal Dependency for Map/Unmap

A container deletion has a similar effect to that of the unmap described above. Deleting a locus can change the behaviour of system calls that reference that locus, and, as such, has causal considerations.

7.5.1 Container Consistency

Related to causality is consistency of distributed entities. Causality is required to allow consistency, but in itself does not guarantee a distributed entity will remain consistent. Grasshopper allows a container to exist on multiple nodes simultaneously. The causality protocol does not dictate how these container instances are maintained in a consistent state.

To achieve consistency between multiple simultaneous users of a data location, some form of protocol must be implemented to ensure updates and accesses occur in the correct order. In a uniprocessor system, this access control is performed implicitly. On a multiprocessor or distributed system the access control must be an explicit part of the system.

Messages between nodes on a network must be delivered in causal order to maintain the overall causal relationships in the system. Messages may be delivered out of causal order due to having multiple paths between two nodes. Figure 7.1 shows how

7.5 Distributed Causality 97 two messages may become out of causal order by passing through intermediate nodes. Causal delivery can be implemented by the underlying transport mechanism, or through a causality layer added to the transport mechanism.

7.5.1.1 Manager Provided Consistency The Grasshopper kernel provides low level support for distributed shared memory by providing a mechanism to move pages between machines. The decisions as to which pages are moved, and where they are moved to and from is controlled entirely by managers. Figure 7.3 shows two instances of a container on the network. Each has a manager on the local node which handles any faults that occur.

N0

N1 N2

Manager1 Faults Faults Manager2 C1 N3 N4 N5 C1 N6 N7 N8 N9

Figure 7.3. Distributed Container Manager

Figure 7.4 shows the two container instances with the manager on only one node. Any page faults from either instance are sent to the manager, which resolves the faults.

N0

N1 N2 Faults Manager1 Faults C1 N3 N4 N5 C1 N6 N7 N8 N9

Figure 7.4. Combined Container Manager 98 Data Movement in the Grasshopper Operating System

When a page fault occurs on a node, the kernel invokes the manager of that container. The manager is responsible for the provision of the required data; this may require cooperation with other managers of the faulting container extant in the system.

The Grasshopper memory manager model allows relaxed consistency protocols between the nodes to be implemented. This may require interaction between the managers and the application. Since the managers are user level entities, and can be specialised for each application, this interaction is possible. The use of relaxed memory models allows greater parallelism than is possible using the sequential consistency model.

For each distributed container there must be one or more managers that are responsible for the data the container holds. Since the kernel-manager interface for the maintenance of container data is provided via invocation, managers need not reside on the same node as the container instance being managed. At one extreme, a single manager may manage distributed shared memory for all instances of the container in the system. Conversely, a manager for the container may be provided on every node. Any hybridisation of the two schemes may also be exploited.

7.6 Network Failures

Any network may suffer failure. The types of failure range from the loss of a single packet on a link to the permanent removal of one or more nodes from the network. The system must be able to cope with failure in some de®ned manner. Grasshopper can automatically recover from some types of failures. Other failures require manual intervention. Intervention is required when information that is unavailable to the system must be used to determine the correct course of action.

7.6.1 Packet Loss

On each link, Grasshopper keeps information concerning which messages have been received. Each message also includes suf®cient information to enable a node to determine if a message has been lost. If messages are lost the receiver requests the sender to retransmit the missing messages, and delays acceptance of the remaining messages until the missing messages are sent.

7.6 Distributed Causality 99

On links with very little traf®c it is possible that the missing message may not be detected for a considerable time. This is not seen as a signi®cant problem. Higher level protocols can be used, if necessary, to perform timeouts.

7.6.2 Partitioning

A network without suitable redundant links can be partitioned into two, or more, independent networks. The correct action upon partitioning depends upon the duration of the partition. If the partition is temporary the best action may be to wait until the partition is repaired, and then continue processing. If the partition is permanent any outstanding requests should be aborted. If the partition will be in effect for a signi®cant period of time operator intervention is required to determine which, if any, messages should be terminated, and which can wait until the repair occurs.

To the system, a partitioned link is indistinguishable from a link with a very high delay. Timeouts may be inappropriate as it is undecidable whether the original message was received. The problem of whether the message was received is commonly known as the Two Generals Problem, and is illustrated in Figure 7.5. No matter how many acknowledge cycles are sent, it is never decidable that the last acknowledge, or the original message if no acknowledges are received, was received. The message may have been received just before the link was lost, or the message was in transit and was, as such, lost when the link was lost. 100 Data Movement in the Grasshopper Operating System

A B Message Lost

A B ACK Lost

A B ACK Lost

Figure 7.5. Two General's Problem

7.6.3 Node Destruction

The permanent removal of a node from the network is identical to the permanent severance of the link to that node. The actions available in this instance are identical to those available for partitioning. The system will continue to attempt to contact the failed node until informed by external intervention.

7.7 Timeouts of Messages

The Grasshopper network guarantees causal delivery of messages. The low level algorithms must not fail to deliver due to a timeout. If a timeout on a link occurs the message is retransmitted. As we shall see this retransmission allows external intervention, but, by default, failure of a link results in messages delaying until the link is repaired.

Higher levels of software are responsible for the implementation of timeouts. The original messages will still continue to wait until the situation that caused them to pause is recti®ed, at which time the message will be delivered. The programmer who implements a timeout of this nature is expecting the timeout possibility, and understands the implications of the timeout.

7.7 Distributed Causality 101

Some activities do not handle a timeout gracefully. The transfer of a page in a DSM system does not have semantics for timeout. It was decided that to require all calls to code for timeouts on all messages was unacceptable.

7.8 Operator Intervention

To recover from otherwise unrecoverable failures Grasshopper allows external intervention. Each node has an internal table that contains the network name for the parent (if this node is not the root of the network), and each child. This table is examined by the routing algorithm to determine how to route a message. The destination address of a message is checked against each entry in the table to determine if it is legal. The normal routing algorithm will fail a message with illegal address if the lookup in the routing table fails.

This table of node names can be manually manipulated. If a partition is permanent the entry for the link affected can be deleted. This will cause any messages queued for transmission along that link to fail. Any new messages that would have been transmitted along the defunct link will also fail with illegal address.

7.9 Summary

Causality is an important consideration in all distributed systems. Most non persistent systems ignore this consideration and allow causal inconsistencies to occur. This does not cause serious problems due to the coarse nature of the permanent entities in these systems, and the sparsity of links between these entities. Persistent systems, however, have ®ne grained entities, and the links between entities may be both numerous and complex.

Correct delivery of messages in systems that respect causality requires the inclusion of additional causality information in each message. On general networks that allow the overtaking of messages this information is a square matrix, the order of which is the number of nodes in the network. This assumes that nodes do not move around the network.

Grasshopper cannot use this method of causality tracking, as the entities in Grasshopper can move. The amount of information required if entities are allowed to 102 Data Movement in the Grasshopper Operating System move is a square matrix the order of the number of entities in the system, a considerably larger number.

Grasshopper implements causality control by not allowing messages to pass each other, and guaranteeing this through the use of a tree structured network. The processing nodes are connected to only one point in the network; all others are routing nodes.

Failure in a system must be handled. Grasshopper responds to any failure by halting message ¯ow until the failure is recti®ed. When the failure cannot be recti®ed the system allows operator intervention to delete the offending link.

7.9 103

Moving 8 Entities

One of the basic tenets of Grasshopper is that entities can move. For example, a locus may move from one container to another container, either on the same node, or on a remote node. The node that an entity is currently on is called the current node. The node that an entity was created on is called the home node. For a container the current node is any node on which pages of the container are present. The current node for a locus is the node where the most recent register set is resident.

Grasshopper supports distributed entities. A distributed entity is an entity that exists at multiple nodes simultaneously. A container, the most common instance of a distributed entity, is considered distributed when multiple nodes simultaneously hold pages from the container.

The distribution of a container is demand driven. A container will become resident on a remote node when a page fault occurs on one of the container's pages. The page fault occurs when a locus on the remote container accesses the container, either through an invoke call, or through a mapped portion of the container.

A locus moves through invoke calls. On invoke a locus can:

• Stay on the same node, with a new host container.

• Move to a node on which the new host container is currently resident.

• Move to a node on which the new host container is not currently resident.

The decision as to where the locus moves is controlled by the invoke call.

The node on which a locus and its host container should reside is determined by various factors:

• whether the container is self contained, 104 Data Movement in the Grasshopper Operating System

• whether the locus will access the container frequently,

• the load on the locus' current node, the load on the container's current node, and the load on any other reachable nodes.

Not only may entities move, but nodes or sub-networks can move. A Grasshopper node, or sub-network, may be detached from the network, and reattached at a different point in the network. This could happen if a portable machine is moved from one institution to another. For example, a visiting professor returns home with a portable machine. Continued functionality must be ensured in the new locale, albeit with slower response.

In these cases, messages must be delivered to the current node, which in the case of movement of loci and containers is different to the home node. The case of movement of nodes and sub-networks requires all messages for a node, or portion of the network, be redirected to another portion of the network.

8.1 Information to be Moved

For an entity to exist there must be some information about that entity stored in a kernel. At a minimum the home node kernel will store the entity's state information. If an entity has moved then both the home node and the current node must hold some state information.

When an entity moves between two nodes the required state information must be passed between the two nodes. The state information that is passed is a subset of the state information held at a node. The protocol to move an entity moves the kernel state information along with the entity when that entity moves.

8.1.1 Locus Movement

When a locus invokes a container that has, as its current node, a node other than the locus' current node, the locus may be moved to the containers' current node. The message that is sent from the locus' current node to the new node must include the state information required by the locus on the new node. This state information includes the register set of the locus² and the locus' capability list.

8.1 Moving Entities 105

8.1.2 Container Movement

A container may exist on many nodes simultaneously. The Grasshopper system allows containers to appear omnipresent on the network. This global visibility is provided by the manager of the container, not by the kernel. On each node there may, or may not, be a subsidiary manager responsible for the instance of the container on that node.

If the pages of a container exist on a node, the kernel on that node must hold the state information about that container. This information includes the Local Container Descriptor (LCD), the Page Sets, and the identity of the manager responsible for the LCD and page sets.

8.1.3 Sub-network Movement

Any state information contained in a sub-network is relevant only to that sub- network. A sub-network ranges from a single node to any branch of the network tree. A sub-network may be moved from one point on the tree to another point on the tree without having any of the state information on that sub-network become invalid.

The only state information that changes when a sub-network is moved is the name of the node that is now concerned with routing messages to the moved sub-network. The change occurs at two nodes, the node to which the sub-network was attached, and the node to which the sub-network is now attached.

______² Grasshopper currently only allows fully homogenous machines to participate in a network. How the register set may map between heterogenous architectures is as yet an unresolved issue. The movement and location protocols are independent of the architectures of the nodes, and so would function correctly in a heterogenous environment. 106 Data Movement in the Grasshopper Operating System

8.2 Messages or Movement?

The paradigms provided to the user are that of the locus actually moving across the network, and that of the page of data moving across the network. Operations on entities in Grasshopper also appear to move across the network. These are implemented at the fundamental level by messages moving across the network, from machine to machine and kernel to kernel. The user cannot generate these messages directly; they are only generated by the kernel on each node.

If we want to move an entity, or make an entity appear distributed, the kernels cooperate by sending messages to each other that create the appearance of movement or distribution. There are several types of messages that the kernels generate: control messages, which control the behaviour of messages on a single link, routing messages, which are used to control the ¯ow of messages between kernels, and data messages, which actually move high level data between kernels. When a locus moves to a new kernel, routing messages are sent to establish a pathway to the new kernel for the entity, and then a data message is sent with the locus information.

The routing information persists while the entity is not on its home node, in case other kernels send messages about the entity. For example, a locus destroy kernel call is converted into a locus destroy message. This message is sent on the network to the kernel which currently holds the locus. If the locus has moved the routing information set up previously will be used to direct this message to the appropriate kernel.

8.3 Causality Considerations

To preserve causality within the Grasshopper Network, it is important that the movement of an entity does not cause messages to be delivered out of order. Since causality is preserved by the strict ordering of messages across the network links, this ordering has to be preserved after the move occurs.

The fundamental problem with moving entities is that two causally dependent messages may be delivered out of causal order if the entity passes one or both of the messages during movement. Figure 8.1 shows two messages a and b, with b being somehow causally dependent on a, being sent to the entity e1. Entity e1 was created

8.3 Moving Entities 107 on node N3, and is currently residing on that node. After the transmission of these messages, but before delivery, entity e1 moves from Node N3 to node N9.

N0

e1 N1 N2 a b

N3 N4 N5 N6 N7 N8 N9 Home

Figure 8.1. Causality Violation

The two messages are still in transit. If the intermediate nodes, N0, N1, and N2 observe the movement of e1 and redirect messages for e1 then message b may be delivered before message a.

This can be solved by the naive algorithm of having all redirection for an entity occur on the entity's home node. In the above example, an attempt to deliver both messages, a and b, on node N3 would result in each message being re-directed to the entity's current node. If the entity moves from node N9 to a third node all messages would have to be subsequently redirected by node N9 to the new node. Finally, if the entity returns to a node it had visited previously all outstanding messages would have to be delivered before new messages could be delivered. For example, if entity e1 moves, as in the above example, from node N3 to node N9 and then returns to node N3 it is possible that only message a is redirected before entity e1 returns. It is then possible that message b is delivered before message a is further redirected by node N9.

Grasshopper solves the problem of causal order delivery of messages to moving or distributed entities by ensuring a total ordering on messages addressed to the entity. When an entity moves, messages for the entity are ®rst sent to the node that is the least common ancestor of all instances of the entity. This node is the root of the minimal sub-tree that includes all instances of the entity, and is called the Entity Sub-tree Root (ESR). 108 Data Movement in the Grasshopper Operating System

All routing nodes that are on paths between the ESR and the instances of the entity are used in the routing algorithm. These nodes are called Entity Sub-tree Nodes (ESN). The routing algorithm uses these nodes to redirect messages if necessary.

Figure 8.2 shows a simple sub-tree. The entity has three instances present, on nodes N3, N6, and N9. The home node is N6, and the ESR is N0. Each path between an instance and the ESR has an ESN present.

N0 ESR

ESN N1 N2 ESN

N3 N4 N5 N6 N7 N8 N9 Instance Home Instance

Figure 8.2. Routing Protocol Node Names

8.4 Name Alias Table

Every node in a Grasshopper network has a kernel routing structure called a Name Alias Table (NAT). This table is used by the routing algorithm to redirect messages to entities that have moved, or are distributed. For an entity that has moved an entry will be in the NAT at the ESR, each ESN, and the node that holds each instance of the entity. The NAT is used to record the new location(s) of an entity. The ESR node uses the NAT to indicate which entities should have their messages held. The hold function of the NAT is used to ensure that messages destined for moving entities are not allowed to become out of order. When an entity is moving, the ESR places a hold on any messages destined for the moving entity. Holding these messages at the ESR guarantees that they cannot be delivered out of order.

The structure of each NAT entry is shown in Figure 8.3. The entity name is the full name of the entity. This ®eld is used as the index key into the NAT. The ¯ags are used by the routing protocol to indicate entities that are moving. There are four ¯ags, a busy ¯ag, which indicates that a change to the NAT con®guration for this entity is

8.4 Moving Entities 109 in progress, and three hold ¯ags. The different hold ¯ags are used by the ESR, the moving node, and, if necessary, a new ESR. The hold ¯ags will cause messages to be queued under different conditions. The home ®eld holds the node name for the node that the entity was created on. The parent ®eld holds the name of the parent node, which is empty in the entry on the ESR. The current node list holds the address of each child node that has a NAT entry in its table.

Entity Home Parent Flags Current (Sub-)Node List Name Node Node

Figure 8.3. Name Alias Table

For each entity that has moved, the NAT records the entity's name and home node. For every entity that has moved, there exists an entry in the NAT on at least three nodes: the node that the entity was created on, the node that entity currently resides on, and the ESR of the creating and all current nodes. For entities that are only on their home node, all three nodes would be identical, so no explicit NAT entry is required. Figure 8.4 shows our network in which the entity δ has moved from node N3 to node N5.

Home Current Entity Parent Node Node(s) δ N3 - N3, N5 N0

Home Current N1 Home Current N2 Entity Parent Entity Parent Node Node(s) Node Node(s) δ N3 N1 - δ N3 N1 - N3 N4 N5 N6 N7 N8 N9

Figure 8.4. Single Entity Movement

When an entity moves from a node to another node, the NAT is updated at all affected nodes. Initially an entity is on its home node, and so the home node, current node, and the ESR are the same. When an entity moves to a new node, the NATs are 110 Data Movement in the Grasshopper Operating System updated to re¯ect the change. If the entity returns to the home node the NAT entries for that entity are deleted at all nodes. Figure 8.5 shows the updated NATs if the entity δ moves from N5 to node N9. The entry in the NAT at N5 is deleted and entries are created in the NATs at N0, N2, and N9.

Home Current Entity Parent Node Node(s) δ N3 - N1, N2 Home Current Entity Parent Node Node(s) Home Current δ N3 N0 N3 Entity Parent N0 Node Node(s) δ N3 N0 N9 N1 N2 Home Current Home Current Entity Parent Entity Parent Node Node(s) Node Node(s) δ N3 N1 - δ N3 N2 - N3 N4 N5 N6 N7 N8 N9

Figure 8.5. Updated NAT After Movement

8.5 Using the Name Alias Table

The NAT is utilised by the algorithm used to route messages, and by the algorithm to move an entity. The routing protocol outlined previously assumed entities had a ®xed location. Since entities can move, the routing algorithm is changed. The NAT is central to allowing entities to move, and as we shall see, to being distributed.

8.5.1 New Routing Protocol

To account for entries in the NAT, the routing algorithm given previously has to be enhanced. A check has to be performed to determine if an entry for the destination entity already exists in the NAT. If no entry exists the simple routing algorithm outlined previously can be used to route the message. When the hold ¯ag is set, messages destined for the entity may be held at the node until the hold ¯ag is cleared, when they are released in FIFO order.

If an entry exists, the entity is either being moved, in which case the message is delayed at the ESR while the move occurs, or the entity has already moved, in which case the destination node of the message has to be changed to re¯ect the new destination.

8.5 Moving Entities 111

After any modi®cation to the destination node occurs, the simple routing algorithm is used to calculate the node to which the message should be forwarded. Figure 8.6 shows the modi®ed routing algorithm.

if NAT_entry(entity) then if NAT_hold(entity) then hold entity until released restart algorithm else if message from parent, or NAT_parent(entity) is empty destination_nodes = NAT_current(entity) if more than one node then promote to multicast else pass to parent break if destination_nodes = current_node then deliver locally else if ancestor(current_node, destination_nodes) then for each child of current_node do if child = destination_nodes or ancestor(child, destination_nodes) then pass to child if no child was selected then illegal address else if root_of_network illegal address else pass to parent Figure 8.6. NAT based Network Routing Algorithm

8.5.2 Protocol to Move an Entity

All communication on the network is performed in terms of kernel to kernel messages. These messages are delivered between the kernels along the network links. If required, these messages can be stopped by holding them at a node. The protocol works by ensuring that messages for an entity always move towards the entity in strict FIFO order. When an entity moves, messages for that entity have to be delayed so that the FIFO ordering between them is not disrupted. After the entity has ®nished moving the messages for that entity are released by the kernels in such a manner that causal ordering is preserved.

In brief, we have to pass messages around the network to ensure that any message destined for the entity will be delivered to it in causal order. This involves having messages pass between the current node, the ESR, the new node, the new ESR, and the home node, in the correct order. We also delay propagation of messages destined for the entity at the home node, the ESN, and the current node while the move occurs. 112 Data Movement in the Grasshopper Operating System

8.6 Looping Messages

The routing algorithm is designed to ensure that messages cannot loop. The nodes that require a NAT entry are those nodes that have to either redirect a message, or divide a message. The ESR and each ESN requires a NAT entry. Furthermore each node on the path linking the home node, and the ESR also requires a NAT entry. If these nodes did not have a NAT entry they would, by default, direct a message towards the home node, and not towards the ESR. Any node not on the path between the home node and the ESR would, by default route a message towards the ESR. This is shown in Figure 8.7 where the default movement of messages for node N6 is shown.

N0

N1 N2

N3 N4 N5 N6 N7 N8 N9

Figure 8.7. Routing Protocol Node Names

If an entity that was created on node N6 moves, and the ESR is on node N0 the default movement of messages for the entity would be correct for all nodes, except N6 and N2. The NAT entries on these nodes ensures that all messages move in the correct direction.

8.7 Distributed Entities ± Multicast Messages

Some entities, such as containers, can exist on multiple nodes simultaneously. Messages for these entities should be delivered at all instances of the entity. This delivery of a message to multiple nodes simultaneously is called multicast. If causality is preserved it is called causal multicast.

8.7 Moving Entities 113

Causality does not require that the messages to two entities are delivered in the same relative order, only that any two messages are delivered in causal order. If two messages, a, and b are causally independent some instances may receive a before b, and the remaining receive b before a. If all instances receive all messages in the same order, i.e. they will all receive a before b, or all receive b before a the messages are said to be in total order. Total ordering is stronger than causality; a system that guarantees total ordering also guarantees causal delivery. Because Grasshopper routes all messages for an entity through the one ESR, which places an ordering on the messages, Grasshopper provides total ordering for messages, and as such causal delivery.

To perform the multicast, a message for an entity is ®rst sent to the the ESR. The ESR redirects the message to each node or ESN that is listed in the current node list. Each ESN in turn redirects the message to each entry listed in its current node list. The message is duplicated along each branch of the tree that holds one or more instances of the entity.

All multicast messages for an entity are ®rst sent to the ESR. Since each message redirected from the ESR stays in relative order we are assured that multicast messages are delivered to all instances of the entity in total order. Also, any message sent to an entity becomes a multicast message if the entity has more than one instance on the network. The promotion from a point to point message to a multicast message is performed automatically at the ESR by the routing algorithm.

8.7.1 Protocol to Move or Distribute Portions of a Distributed Entity

The algorithm given below, and shown diagrammatically in Figure 8.8 is used to move an entity on the Grasshopper network. It allows an entity to move from one node to another node, for an entity to split into two instances, and for an instance of a distributed entity to be deleted.

We shall describe the steps, and the messages that occur during the routing re- arrangement phase. In general, the ESR of a distributed entity is considered to be the entity itself. All messages to the entity are ®rst passed to the ESR, which provides a total ordering on messages to the entity. From the ESR each message is sent to each instance of the entity. 114 Data Movement in the Grasshopper Operating System

The bullet points are steps that are carried out at the nodes, and the numeric points are the messages that are passed across the network. Worked examples of the protocol are presented in Appendix A for reference.

• The algorithm starts when an entity requests the kernel to perform one of several functions. The requests are to move, copy, or delete an instance of an entity.

If a NAT entry for the entity already exists, i.e. it has been moved or copied before, the NAT entry has a ¯ag set to hold incoming messages for the entity. Outgoing messages from the entity may still proceed.

If the NAT entry does not exist, i.e. this entity is not currently distributed or moved, then a NAT entry is created and the entry ®lled as appropriate.

1 The current node sends a message to the ESR. This message contains the action that is to occur, and the node that is requesting the action.

• When the message arrives the ESR will delay the request if it is processing another request. The ESR sets a hold ¯ag, and all messages being sent to the entity in question will be queued, in order, at the ESR.

• If the operation is a move or a delete the current node will no longer have an instance of the entity. If the current node is not the home node, i.e. it is off another branch of the network, then any unnecessary ESN will be deleted. These are removed by sending a message from the current node to the ESR, deleting all the nodes that have no children.

If the current node is the home node then this operation does not occur, as the ESN must exist on the path between the home node and the ESR for the routing algorithm to perform correctly.

2 The ESR sends the message to the current node which causes the current node to delete the NAT entry for the entity. This message then moves towards the ESR until either a node with more than one child is found, or the ESR is found.

3 When the recursing message arrives at a node that has other children, or at the ESR, the recursion stops, and a message is sent to the ESR to enable it to continue processing.

8.7 Moving Entities 115

• When control returns to the ESR any NAT entries that had to be deleted are now deleted. It is possible that some of these may need to be re-created, but this will occur later in the algorithm.

4 The ESR sends a message to the current node which allows the rest of the algorithm to continue. This message signi®es that the operation is now able to proceed.

• The operation is performed on the entity:

Move: The entity and associated kernel data, including any queued messages, are transferred to the new node, and the local instance is deleted.

Copy: A copy of the entity and associated kernel data, including any queued messages, is transferred to the new node. The local instance is released to continue processing.

Delete: The instance and associated kernel data are deleted, and a message is sent to the ESR, informing it that the deletion has occurred.

5 Both the move and copy operations send an entity to the new node. The new node creates a NAT entry for the entity, and then sends a message to the ESR.

6 When the ESR is informed that the move, copy, or deletion has occurred it has to perform two actions. The ®rst is to create any new NAT entries that are required, the second is to continue deleting any entries that are no longer required.

If a new instance of the entity is created, the path between the current ESR and the new entity has to have NATs placed at the nodes. This is performed as a two step procedure. First if the new ESR is moved further towards the root of the tree than the old ESR, the ESR is moved crabwise, i.e. a single step at a time, to the new ESR, and then, any unnecessary NAT entries are deleted. These unnecessary NAT nodes occur if the new ESR if closer to the leaves than the old ESR. In this case the ESR moves crabwise until it reaches the new ESR node.

7 The ®rst step in the crabwise move outwards is to create a NAT entry at the parent of the current ESR. 116 Data Movement in the Grasshopper Operating System

8 The parent creates a NAT entry, and responds to the current ESR. This ensures that there are no causally dependent messages between the old and new ESRs.

9 Finally the old ESR sends any queued messages to the parent, which are inserted into the queue before any messages held at the new ESR.

10 When the ESR has reached the new root node, or if it was already at the root node, the chain to the new current node is created. This moves from node to node from the ESR to the new entity creating a NAT entry at each node. This creation occurs in parallel with the rest of the algorithm. • When there are no instances of the entity apart from that on the home node, the movement of the ESR outlined below will eventually reach the home node. The home node is the node that has no children, and no parent. If the ESR moves to this node the ESR deletes itself, resulting in no references to the entity in any NAT table. This is the state that existed before any movement by the entity occurred. • After any additional NAT entries have been created the ESR checks to see if it can be deleted. If it has no parent, and only one child, it is unnecessary, and so can move towards the home node.

11 This collapse of the ESR towards the home node also occurs in a crabwise fashion. A message is ®rstly sent to the child node, setting a hold on messages, and clearing the parent ®eld.

12 The child node tells the ESR that this has occurred.

13 The ESR sends any queued messages to the child, which will insert them before any held messages at the child. The ESR then deletes itself, and the child becomes the new ESR. This process continues until either a node is reached that has two or more children, in which case it is the new ESR, or the ESR reaches the home node, in which case there are no other instances of the entity left, and as stated above, the ESR deletes itself.

14 Finally, if there is a new instance of the entity, the ESR sends a message to the instance allowing it to continue processing. Also, the ESR releases the hold on messages, allowing any queued messages to continue.

8.7 Moving Entities 117 NAT_delete(entity) (various) NAT_delete_current(entity, current) mesg(myself) exit if current != myself then mesg(parent, myself, ESR) mesg(ESR) ESR and ESNs if current in NAT_current(entity) then if NAT_parent(entity) = empty then if NAT_number_current(entity) = 0 then else NAT_create(entity) NAT_set_busy(entity) NAT_set_hold(entity) NAT_parent(entity) = empty NAT_home(entity) = home NAT_add_current(entity, child) mesg(child) NAT_set_hold(entity) NAT_parent(entity) = empty mesg(ESR) 2 11 12 8 7 3 NAT_add_current(entity, child) mesg(child, myself, new_current, home) exit ESR release messages NAT_delete(entity) exit mesg(new_current) release requests mesg(current, current, ESR) wait() mesg(parent, myself, new_current, home) wait() mesg(parent, new_current, op, MESSAGES) NAT_clear_busy(entity) NAT_release(entity) exit if ancestor(child, dest) then /* Home Node processing */ entity_release(entity) if delayed mesages then else mesg(NAT_current(entity), ESR) wait() mesg(NAT_current(entity), op, new_current, MESSAGES) NAT_delete(entity) exit if op = move or create then NAT_release(entity) NAT_clear_busy(entity) if delayed requests then delay request if current != myself or home then if not ancestor(myself, new_current) then else for each child in child_routes do if NAT_parent(entity) = empty and NAT_number_current(entity) 0 then if NAT_parent(entity) = empty and NAT_number_current(entity) 1 then else if NAT_busy(entity) then NAT_set_busy(entity) NAT_set_hold(entity) if op = move or delete then mesg(current, ESR) if op = move or create then 9 6 13 14 10 4 NAT_add_current(entity, child) mesg(child, myself, dest, home) exit 1 entity_instantiate(entity) NAT_create(entity) NAT_home(entity) = home NAT_parent(entity) = parent exit NAT_create(entity) NAT_parent(entity) = parent NAT_home(entity) = home NAT_release(entity) if ancestor(child, dest) then New Current Node If not NAT_entry(entity) then NAT_set_instance(entity) mesg(ESR, op, new_current) if dest = myself then if not NAT_exists(entity) then for each child in child_routes do NAT_clear_in_hold(entity) entity-release(entity) 5 NAT_delete(entity) NAT_release(entity) NAT_delete(entity) NAT_release(entity) NAT_create(entity) NAT_home(entity) = myself NAT_set_instance(entity) mesg(new_current, op, ESR, home, MSGS) if not NAT_home(entity) = myself then else entity_delete(entity) NAT_clear_instance(entity) mesg(new_current, op, ESR, home, MSGS) NAT_release(entity) entity_release(entity) if not NAT_home(entity) = myself then else entity_delete(entity) NAT_clear_instance(entity) mesg(ESR, op, new_current) Current Node entity_set_hold(entity) if not NAT_entry(entity) then NAT_set_in_hold(entity) mesg(ESR, op, current) wait() switch op in case move case copy case delete

Figure 8.8. Pseudo code for Network Protocol 118 Data Movement in the Grasshopper Operating System

8.8 Causal Correctness of the ESR Movement

The routing protocol functions by ensuring the ESR is always the root of the entity subtree. If the sub-tree grows or shrinks the ESR moves accordingly. The ESR is always located on the path that connected the home node to the root of the tree. The ESR moves along this path to ensure all other entity instances were included.

The ESR either moves towards the global tree root, or towards the home node. If an instance of the entity is being created on a node not covered by the ESR the ESR moves towards the root. If an instance of the entity is deleted, and the ESR only has one ESN child remaining, the ESN is promoted to ESR, and the old ESR is deleted.

The routing algorithm uses the ESR as the entity, for routing purposes. During the movement of the ESR care has to be taken to ensure that messages continue to be delivered in causal order.

8.8.1 Causality for Outward Movement of the ESR

When a new instance of the entity is created the ESR moves towards the root of the global network tree. The movement is achieved using a three message protocol, shown in Figure 8.9. Node Na sends a message, m1, to its parent, Nb. Nb creates a NAT entry, and sets a hold for the entity. Nb sends a reply message, m2, in an atomic action with the NAT creation. When Na receives the reply message it atomically sends any queued messages, m3, and releases its hold in the NAT entry. When this message is received Nb inserts any passed messages before its saved messages, and the ESR movement is completed. If the new node is not the eventual ESR, i.e. Nb doesn't cover all instances, the process repeats.

8.8 Moving Entities 119

To Root To Leaves Nb

m1 m3

m2

Na

Figure 8.9. Outward Movement of the ESR

Figure 8.10 shows the causality implications for this movement. The time when node Na sends m1 to Nb is marked as P1. When node Nb receives m1 it sets a hold on all messages for the entity, and responds with message m2. When Na receives m2 it sends the saved messages as m3 and releases its lock. The dashed lines show the possible movement of other messages for the entity. The period marked S1 occurs before the movement starts. All messages for the entity are being received in causal order. Likewise for the period marked S4 which occurs after the movement has completed. The algorithm requires that messages arriving in the period marked S2 cannot be causally dependent on those that arrive in the period marked S3; i.e. that a message that arrives during the period S2 is not generated from a message that passed during the period S3.

This can be shown to be true by contradiction: All the messages that are processed by Nb and arrive at Na during period S2, can only be dependent on messages that were processed by Nb before P2. Any message that transited Nb after P2 would arrive at Na after the end of S2. For a message to arrive at Na during S2 and to have passed Na after S2 the message must have travelled back in time, which is a contradiction. 120 Data Movement in the Grasshopper Operating System

P2 S3 S4

Nb m1 m2 m3

Na Time

S1 P1 S2

Figure 8.10. Time Diagram for ESR outward Movement

8.8.2 Causality for Inward Movement of the ESR

When an instance of the entity is deleted the remaining instances may be contained by a smaller sub-tree. The ESR will move crabwise towards the home node until it is the root of this reduced sub-tree. The movement is achieved using a three message protocol, shown in Figure 8.11. Node Nb sends a message, m1, to its only child, Na. Na sets a hold for the entity in its NAT entry. Na sends a reply message, m2, in an atomic action with the NAT update. When Nb receives the reply message it atomically sends any queued messages, m3, and deletes its NAT entry. When this message is received Na inserts any passed messages before its saved messages, and the ESR movement is completed. If the new node is not the eventual ESR, i.e. Na is not the minimal sub-tree, the process repeats.

To Root To Leaves Nb

m1 m3

m2

Na

Figure 8.11. Inward Movement of the ESR

8.8 Moving Entities 121

Figure 8.12 shows the causality implications for this movement. We can show that the algorithm is correct in a similar manner to that used above. For a causality failure to occur a message has to be received in section S2 that is dependent on a message in S3.

For a message to arrive at Nb during S2 any causal precursor message must have passed Na before m0. If a precursor passed Na before m0 and a message that it is dependent on arrives at Na during S3 the precursor message must have passed the message it is dependent on. This is a contradiction in the routing protocol which states that messages cannot pass.

S1 P1 S2 Time Nb m2 m0 m3

Na m1 P2 S3 S4

Figure 8.12. Time Diagram for ESR Inward Movement

8.9 Broadcast Messages

Although broadcast messages are not utilised in Grasshopper, the tree structure of the Grasshopper Network is suited for ef®cient delivery of such messages. A message can be broadcast to either the entire network, or a sub-network. To perform a broadcast a message is sent to the root of the sub-network which sends it to all of its children. Each child in turn would send the message to all of its children. This operation recurses until the message reaches all of the leaf nodes.

Grasshopper does not use broadcast in the location or movement of entities. This avoidance of broadcast messages is due to the non-scalable nature of broadcast messages. As the number of nodes using broadcast messages increases, the number of broadcast messages reaching a particular node also increases, until the entire message ¯ow consists of nothing but broadcast messages. 122 Data Movement in the Grasshopper Operating System

8.10 Summary

To allow movement of entities in Grasshopper we modify the routing algorithm, and use a per-node data structure to track movement of entities through the network. Each node has a Name Alias Table (NAT) which holds information about each entity that has moved. The routing algorithm only requires that this information be kept at certain nodes. For any moved entity the information must be kept at the entity's home node, the entity's current node, and any nodes that form the spanning path between the home node and any current node. The nodes in this spanning path are called Entity Sub-tree Nodes (ESN), and the root of the spanning tree is the Entity Sub-tree Root (ESR).

The algorithm used to move entities in Grasshopper places a hold ¯ag in an entity's NAT entries. This ¯ag is used by the routing algorithm to delay messages that may otherwise arrive out of causal order.

The NAT allows an entity to exist simultaneously at multiple nodes, and automatically promotes a message from a point to point message to a multicast message if the addressed entity is distributed. The routing algorithm allows, as a side effect, the use of broadcast messages to either the full network, or to a sub-network, although broadcast messages are not used in the normal protocols due to their non- scalability.

8.10 123

Implementation, Evaluation 9 and Future Work

The Grasshopper Operating System is currently being implemented on the DEC Alpha platform. One of the requirements of Grasshopper was that the system would operate on current conventional hardware. DEC kindly provided two systems for the project.

The network protocol was tested using a simulator, due to the lack of test machines. The simulation runs as a user process under OSF/1. Each process simulates a single node in the network, and runs the full routing algorithm for that node.

9.1 Implementation

9.1.1 Link Protocol

The Grasshopper network, as described in Chapter 6, is implemented as a set of links between nodes in a tree structure. Along each link, a point to point protocol is implemented. The protocol must guarantee the reliable, in order, delivery of messages. The links are assumed to be unreliable, allowing packets to be delivered out of order or to be lost. The protocol used by Grasshopper is a standard ACK-less protocol with sequence numbers [28]. A message to be transmitted is broken into packets, each of which is given a sequence number and transmitted along the link. The sender does not discard the sent packets immediately. A copy of each packet is retained by the sender until it is assured the destination has received the packet. Each packet includes a reverse sequence number to facilitate the deletion of the saved packets.

When a packet is received, it is checked; and if the checksums are wrong, it is discarded. The packet's sequence number is examined next. If the packet is correct and is the next packet expected, it is passed to the deliverer, and any saved subsequent packets are also passed to the deliverer. If the packet is a duplicate, it is 124 Data Movement in the Grasshopper Operating System discarded. If the expected packet is missing, the received packet is saved and a message is sent along the link requesting the missing packets.

The deliverer is responsible for reconstructing fragmented messages. It is passed packets in sequence and in turn passes out reconstructed messages. When a packet is passed to the deliverer, its sequence number is recorded. This saved sequence number is included in the header of each outgoing packet. The receiver can use this number to delete the saved packets. An example of the protocol is given in Figure 9.1.

MSG

4 3 2 1 A B Rec 0 0 Rec Sent 0 0 Sent

Step 1

4 3 2 1 A B LOG Rec 0 0 Rec 4 3 2 1 Sent 4 0 Sent

Step 2

4 A B LOG Rec 0 2 Rec 2 1 4 3 2 1 Sent 4 0 Sent

Step 3

Held A B 4 R LOG Rec 0 Request 3 2 Rec 2 1 4 3 2 1 Sent 4 Rec 2 0 Sent Deliverer

Step 4

9.1 Implementation, Evaluation and Future Work 125

3 Held A B 4 LOG Rec 0 2 Rec 2 1 4 3 Sent 4 0 Sent Deliverer

Step 5

A B LOG Rec 0 4 Rec 4 3 2 1 4 3 Sent 4 0 Sent MSG Step 6

Figure 9.1. Link Protocol

Step 1: A large message is passed to the link layer of node A for transmission to node B. The message is too large to be sent as one packet, and as such is broken into four packets, each of which is given a sequence number and queued for transmission.

Step 2: As each packed is transmitted, it is saved into a message log in case it is needed later. In this example, packet 3 is lost on the network.

Step 3: The packets are received by node B. It accepts, and passes to the deliverer packets 1 and 2. When packet 4 is received, the node determines packet 3 to be missing.

Step 4: Node B constructs a message that is sent to node A, requesting packet 3. This message includes, along with other information, the sequence number of the last packet passed to the deliverer, the reverse sequence number. This message is sent out of the normal sequence of data in a control packet. These control packets do not have a sequence number, and always move to the head of the transmitting queue.

Step 5: When node A receives the packet, it removes packets 1 and 2 from the log and discards them, as it knows they have been passed to the deliverer by node B. The message in the packet received causes node A to requeue packet 3 for transmission. 126 Data Movement in the Grasshopper Operating System

Step 6: When Node B receives packet 3, it passes this packet and packet 4 to the deliverer, which combines them and packets 1 and 2 to form the original message. The reconstructed message is passed to the Network router for either retransmission, or delivery if this is the destination.

The request for retransmission may be lost; this is handled by a timeout. After a suitable time the receiver will re-request missing packets. Also, each request for retransmission includes all the missing packets. So, if additional out of sequence packets are received, the retransmission request will include missing packets.

The removal of ACKs allows the protocol to transfer data at full bandwidth, with a retransmission only occurring for those packets lost or corrupted. The ACKs are in effect provided by the additional sequence number on the return packets. ACK-less protocols ef®ciently use bandwidth on full duplex channels, such as ®bre optical cables when there is suf®cient traf®c.

Each messages includes the sequence message of the last successfully delivered message on the reverse channel. On reception of a packet any messages up to and including the identi®ed message may be removed from the log. Idle channels can generate occasional time messages to control the growth of the log.

9.1.2 Packet Structure

Each message sent on a Grasshopper network consists of a header with the following ®elds;

• the destination entity,

• the destination node (initially the entity's name's node part),

• the high level protocol and protocol type identi®ers,

• the header checksum,

• the length of the body data, and

• the body checksum.

When a message is sent across the network the message is sent in one or more packets.

9.1 Implementation, Evaluation and Future Work 127

Each link has associated with it a Maximum packet size. This is the largest size of packet that may be transmitted on the link. The maximum packet size is determined by the underlying physical hardware. On an ethernet this value is 1500 bytes. On a serial connection it would be set according to the level of noise on the line.

Each packet consists of a header and a portion of the message. The header contains the sequence number, the return sequence number, a multi-packet identi®er, and a control identi®er. The control identi®er is used to identify link control packets, which are sent by the link layer and are not included in the normal stream of packets.

When a message is split into multiple packets for transmission, each packet has its position in the message stored in the multi-packet identi®er. As the deliverer is passed each packet it checks to see if the packet is the ®rst in a multi-packet message. A zero identi®er is a self contained packet; the message is held in the body of the packet. A non-zero identi®er is an indication that the packet is the ®rst in a series, and also identi®es the number of packets following in the message. The deliverer assembles the message and then delivers the message when the last packet, identi®ed by a zero identi®er is presented. For each packet in a multi-packet message the identi®er holds the number of packets still to come, in effect it counts down from the number of packets to zero, easily identifying the correct order of the message fragments.

Figures 9.2 and 9.3 show the message and packet structures. If a message is small 128 Data Movement in the Grasshopper Operating System enough it is placed into one packet, otherwise it is split between multiple packets.

Header

Rev. Follow Seq. Body Seq. = 0 Length Ctl. Flag Cksum

Dest Dest Data Entity Node Protocol Cksum Type Cksum Length

Figure 9.2. Message and Packet Structure - Small Messages

Header Header

Rev. Follow Rev. Follow Seq. Body Seq. Body Seq. = N Seq. = 0 Length Length Ctl. Flag Ctl. Flag Cksum Cksum

Dest Dest Data Entity Node Protocol Cksum Type Cksum Length

Figure 9.3. Message and Packet Structure - Large Messages

9.1.3 Routing Protocol

The routing protocol is implemented as a distributed ®nite state machine, with the state changes occurring due to routing messages being passed between the network nodes. The state of the machine can be determined by examining all nodes, and all queued messages. No one node is the global arbiter or sequencer. The state at each node is maintained by the NAT, which has associated with each entry a number of

9.1 Implementation, Evaluation and Future Work 129 private data slots.

The ®nite state machine at each node is implemented as a call back from the message router. As router messages are delivered, the state at each node changes, and other messages are generated. The private data slots are used to save state between messages.

After messages have been reconstructed by the deliverer at a node, they are passed to the router. The router determines the protocol to which the messages belong. Control messages have already been removed by the link controller. Messages that control the routing protocol are exempt from being redirected by the routing protocol. All other messages are examined, and if an entry exists in the NAT, it has its destination ®eld changed. After being routed, a message is destined for either the local node, or one or more links. Local messages are queued for delivery while all other messages are queued for transmission on remote links.

9.1.4 Network Protocols

There are currently four different types of messages that can be sent over the Grasshopper Network. Figure 9.4 lists these types.

Protocol______Use Routed or Not Control link control Non-routed Routing routing control messages Non-routed Kernel kernel to kernel Routed Test test control messages Routed

Figure 9.4. Protocols Implemented on the Grasshopper Network

Neither Control messages or Routing messages are routed by the routing algorithm. The control messages only exist on one link, and do not move beyond the end to end control code. The Routing messages control the routing algorithm itself, and so must be exempt. The Kernel messages move from one kernel to another kernel, and are addressed to an entity, not a kernel. The Test messages are also directed towards entities, but were implemented for use in the evaluation portion of this chapter. 130 Data Movement in the Grasshopper Operating System

The different protocols are identi®ed by the protocol ®eld of each message. The router examines this ®eld to determine what to do with each message. On local delivery, the ®eld is used to determine the code to execute to handle the message.

The use of the protocol ®eld allows additional protocols to be easily handled, and to be Routed or Non-routed depending on their functions.

9.2 Evaluation

9.2.1 Link Protocol Testing

Each link protocol was implemented as a pair of user processes on a DEC Alpha running OSF/1 V3.0. They communicated using UDP packets using the Unix calls sendto and receivefrom. This approximated the direct network interface in the Grasshopper kernel.

It is felt that the performance of these programs represents the lower bound on performance of the protocols. The protocol throughput was tested for different sized messages, and for different sized packets. The test machines were DEC Alpha 3000/500 machines connected by a fairly quiescent ethernet. Figure 9.5 shows the throughput for the different messages and packet sizes.

2e+06 Theoretical Maximum ...... 1e+06 ...... 500000 . . . . Throughput ...... Bytes per Second .. 64K Message .. 200000 .. .. 256K Message .. 100000 512K Message 1024K Message 50000 500 1000 1500 Packet Size (Bytes)

Figure 9.5. Throughput on a Single Ethernet

9.2 Implementation, Evaluation and Future Work 131

The tests were each conducted ten times, and the average used as the value. It is notable, and perhaps an example of excellent DEC design, that during all the tests, no packets were lost. The two machines allowed almost full ethernet performance and handled the load. Two Sun SS1000's were also tested, the programs moving platforms without source changes. The performance of the Sun machines, also quiescent, was useful, as the machines both lost and rearranged packets. The Sun test showed that the protocol can handle a badly behaved network, and that the packet request code functions correctly.

The fall off of performance for smaller sized packets is due to the exhaustion of system resources and memory copying overheads.

9.2.2 Routing Protocol Testing

The Routing Protocol has several major stages: the construction of an NAT chain, the destruction of an NAT chain, movement of the entity, and movement of the ESR.

The following tests were performed, and the state of the network examined after each test:

1. Movement of an entity from its home node.

2. Further movement of a moved entity.

3. Movement of an entity back to its home node.

4. Creation of an entity from its home node.

5. Further creation of an entity from its home node.

6. Further creation of an entity from its non-home node.

7. Deletion of a moved entity.

8. Deletion of a created entity.

9. Simultaneous requests for different entities.

10. Test 1 - 8 while a probe message also occurred. The probe message is a routed message, and so is manipulated by the routing protocol. 132 Data Movement in the Grasshopper Operating System

11. Probe messages to singular, moved and distributed entities.

After each test, the network had the correct information in the NAT at each affected node. Debugging information showed events occurred in the correct order.

The tests indicated that the Routing protocol operates correctly. These, coupled with the high throughput of the network, give great con®dence in the usefulness of the Grasshopper Network.

9.2.3 Message Traf®c Comparison

A static analysis of the costs of allowing nodes to move shows that the number of extra messages required is very small. This occurs as a result of the redirection occurring at the ESR. Only those nodes that share a common segment of path between the home node and the ESR require additional message hops. Figure 9.6 shows the results of this analysis on the simple 9 node network used throughout this thesis. The ®rst table shows the number of message hops required to send a message to an entity. The remaining tables give the offset from this number, for when the entity moves. Each element shows the number of extra hops required to access an entity that was created at the speci®ed creation node. Two tables are provided, one for entities created on node N3, and another for entities created on node N6. The tables for the other nodes can be calculated by simple node name exchange.

9.2 Implementation, Evaluation and Future Work 133

Entity at Creating Node Creating Accessing Node Node______ N3 N4 N5 N6 N7 N8 N9 N3  0 2 2 4 4 4 4 N4  2 0 2 4 4 4 4 N5  2 2 0 4 4 4 4 N6  4 4 4 0 2 2 2 N7  4 4 4 2 0 2 2 N8  4 4 4 2 2 0 2 N9  4 4 4 2 2 2 0

Entity Created at Node N3 Current Accessing Node Node______ N3 N4 N5 N6 N7 N8 N9 N3  0 0 0 0 0 0 0 N4  2 -2 0 0 0 0 0 N5  2 0 -2 0 0 0 0 N6  4 2 2 -4 -2 -2 -2 N7  4 2 2 -2 -4 -2 -2 N8  4 2 2 -2 -2 -4 -2 N9  4 2 2 -2 -2 -2 -4

Entity Created at Node N6 Current Accessing Node Node______ N3 N4 N5 N6 N7 N8 N9 N3  -4 -2 -2 4 2 2 2 N4  -2 -4 -2 4 2 2 2 N5  -2 -2 -4 4 2 2 2 N6  0 0 0 0 0 0 0 N7  0 0 0 2 -2 0 0 N8  0 0 0 2 0 -2 0 N9  0 0 0 2 0 0 -2

Figure 9.6. Extra Message Hops Due to Entity Movement

In general, the maximum number of hops required to send a message to an entity is the sum of the depth of the branches of the minimal sub-tree containing both the requesting and requested nodes. Since the example network is very balanced the tables are very regular.

9.3 Further Work

Any project is an ongoing operation. There are questions raised, and avenues unexplored. This work has raised the following interesting questions, the full investigation of which, time did not permit. These are suitable for future work. 134 Data Movement in the Grasshopper Operating System

9.3.1 Partial Naming

The name scheme in Grasshopper may be able to be adapted for partial naming. Since there is a global network name any portion of a name can, conceivably, be used to route the message within the network speci®ed by the top most domain given in the partial name. Partial naming allows a name in the same sub-tree to be speci®ed in a partial fashion. Node N8 can specify node N7 as N7.net2 instead of N7.net2.GHnet. This allows a complete network to be incorporated into another network without all existing node names immediately becoming invalid.

This work would have to investigate where the changeover of partial to full names would occur. A starting point would be that every message passing towards the root of the global tree from the root of a sub-tree would have its names examined, and any local ones promoted to global names. This appears to be an expensive operation, but how expensive?

9.3.2 Relaxing the Tree Structure

The tree structure of the Grasshopper network ensures causality is preserved across the network. Whilst this allows causal delivery of messages it is not without associated problems: redundant links and nodes are not allowed, and broadcast networks cannot be fully utilised. Instead of being a fully connected network a broadcast network is only used to connect between the children and the parent.

In a tree structure the loss of a node or link is catastrophic. The network is partitioned at that point. Redundant links and nodes are used to reduce this risk. However, the addition of a redundant link or node breaks the tree structure.

A broadcast network, such as ethernet, can be modelled as a collection of redundant links between all the attached nodes. With a strict tree structure two nodes on a single ethernet may have to send messages through a third node to communicate.

It should be possible to allow the use of redundant links, along with point to point links, and still preserve causality. There are two avenues that should be explored. First, can we reintroduce the accompanying of messages with causality information? Each message has a tag, a causality matrix, that contains the causality information for the sub-network the message is traversing. The tag is replaced when the message

9.3 Implementation, Evaluation and Future Work 135 moves from one sub-network into another sub-network. Instead of tagging every message we only tag messages that are passing between nodes that have redundant links.

By allowing the tagging protocol to be used on local area networks, but still retaining the tree structure for outside communications, a Grasshopper network might limit the size of the causality matrix appended to each message to being the number of nodes on the local network.

The second approach that might allow the direct use of broadcast networks is to update the NAT at each node in the broadcast network. This may allow the routing algorithm to ensure that messages do not pass each other within the broadcast network. This would require total ordering on NAT updates in the local network.

The performance advantages in using broadcast networks and redundant links are signi®cant. The problem is that the routing algorithm would have to be modi®ed to allow this functionality. The routing algorithm should be adaptable, probably by having the NAT at each node in the broadcast network identical. These approaches should be examined and tested.

9.3.3 Message Reconstruction for Routing

It is recognised that reconstructing a message that is going to be sent on another link is potentially wasteful. Ef®ciency gains can be realised by allowing messages to remain fragmented while being routed, and only being reconstructed at their destination.

Allowing messages to remain fragmented through routing nodes complicates the link protocol. If messages remain fragmented, there is potentially an unbounded number of messages arriving simultaneously along one link. Figure 9.7 shows an example of two messages for the same destination arriving simultaneously. Two messages are sent to node N4. Message M1, being sent by node N5 and message M2, being sent by node N3. The routing node, N1, will interlace the packets of these messages. The packets will be passed on to node N4 interspersed. Node N4 would be responsible for assigning the packets to the correct messages. 136 Data Movement in the Grasshopper Operating System

N0

N1 N2 M2 M1

N3 N4 N5 N6 N7 N8 N9

Figure 9.7. Fragmented Message Recombination

Since we are currently guaranteed that only one message is being sent on a link, only the ®rst packet has an identi®cation. For example, the sequencing information of the penultimate packets from two different messages would be indistinguishable. Each only has a message fragment identi®er of one. Additional information would be required to distinguish between the packets from each message.

The tradeoffs involved in reconstructing messages at the intermediate nodes are the additional processing and memory at intermediate nodes against the increased packet size and more complex processing at the ®nal destinations. The best choice is dependent on the number and length of multi-packet messages. This can only be determined by examining a live system.

9.4 Summary

The Grasshopper Network is implemented as a multi-layer protocol. The link layer provides ``in order'' reliable communication of arbitrarily sized messages between adjacent nodes. The Routing layer moves messages between kernels, allowing for the redirection of messages when entities move, or replication of messages when entities are distributed.

In conjunction with these layers, additional protocols can be used. The Kernel protocol builds on the Routing protocol, and a Testing protocol was constructed to allow analysis of the network.

The network has been implemented as a simulation using UDP packets instead of ethernet packets. Tests were performed using DEC Alpha machines, and Sun SS1000

9.4 Implementation, Evaluation and Future Work 137 machines. Using the Grasshopper protocol between two Alphas successfully allowed 84% of theoretical bandwidth utilisation. The Sun tests effectively stressed the retry and fault recovery of the network.

The Routing layer was tested on the simulator by moving, replicating and deleting entities. Each of the operations provided by the Routing layer was tested, both individually and in combination, with the Routing layer correctly updating the NAT on each node affected. 138 Data Movement in the Grasshopper Operating System

9.4 139

10 Conclusion

Grasshopper is an experimental system designed to examine issues relating to persistent systems. This thesis investigated the problems that had to be solved with relation to the movement of data in Grasshopper. Two areas of data movement were studied: data movement on a single node, between memory and disks, and data movement between peer nodes across a network.

Persistent systems require solutions different to those available for non-persistent systems, as both the constraints and the opportunities are different. The different model of traf®c between the disks and memory provided an opportunity to design an ef®cient disk system. The traf®c between disks and memory consisted wholly of ®xed sized blocks, each block aligned on memory page boundaries. No ®le system is implemented by the kernel and there are no low level user drivers. The disks do not hold kernel-protected meta data.

This thesis describes a new model for data movement between main memory and backing store. The model is implemented as a stackable module protocol which allows ¯exible control over the transfer of pages between disk storage and main memory. The modules are small, fast and very ef®cient. They allow the easy design and construction of customised processing structures that match the requirements of an application.

We presented a distribution and routing protocol radically different from other systems. Distribution in Grasshopper may be fully transparent using DSM, or may be visible through the use of RPC. Shared ®les are replaced by distributed containers and RPC is provided via remote invocation.

Both invocation and distributed containers vary from their non-distributed counterparts in the need for causality. In conventional systems RPC and NFS are implemented without causality considerations. When a request arrives it is processed, irrespective of missing or delayed requests. 140 Data Movement in the Grasshopper Operating System

This laissez-faire attitude towards causality is accepted by the community because of the gross nature of the objects being manipulated, and the relative lack of cross object links. Also there are very few distributed computations occurring. Despite the advances in distributed systems, most computation is still sequential on a single node.

Persistent systems have ®ne grained objects, and a plethora of cross-object links. Distributed persistent systems offer the paradigm shift to allow easy programming of complex distributed tasks. These departures from conventional systems result in different demands placed on the network. The network has to guarantee causal delivery of messages to avoid every user level program implementing causality.

The routing and distribution protocols presented in this thesis combine to form a seamless integration of entity movement and causal delivery. An entity can move between nodes in the network whilst messages for that entity, and other entities, are guaranteed to be delivered in causal order. These novel protocols remove the requirement for a matrix of additional data to be sent with each message, replacing the matrix with a single sequence number. This is achieved by using a strict tree structured network layout.

We believe that the results presented in this thesis demonstrate that persistent systems can be used to provide an ef®cient base for both general purpose and algorithmic speci®c computing. The causality requirement of Grasshopper does not hinder good performance, and the bene®ts conferred by causality outweigh the performance degradation, by allowing more robust software design.

10. 141

11 References

1. M. P. Atkinson, P. J. Bailey, W. P. Cockshott, K. J. Chisholm, and R. Morrison, ``POMS: A Persistent Object Management System,'' Software Practice and Experience 14(1), pp. 49-71 (1984). 2. Maurice J. Bach, The Design of the UNIX Operating System, Prentice-Hall, Sydney, Australia (1986). 3. Kenneth Birman, Andre Schiper, and Pat Stephenson, ``Fast Causal Multicast,'' Cornell Computer Science Technical report, TR-1105 (April 1990). 4. Alfred Leonard Brown, ``Persistent Object Stores,'' Persistent Programming Report 71, Universities of St Andrews and (October 1989). 5. M. F. Challis, ``Database Consistency and Integrity in a Multiuser Environment,'' Databases: Improving Useability and Responsiveness, pp. 245-270, Academic Press (1978). 6. Jeffery S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska, ``Sharing and Protection in a Single Address Space Operating System,'' ACM Transactions on Computer Systems (May 1994). 7. J. S. Chase, Private Communication, April 1994. 8. D. Cheriton, ``Multi-Processing Structuring and the Thoth Operating System,'' Technical Report CS-79-19, University of Waterloo (May 1979). 9. D. Cheriton, ``The V Distributed System,'' Communications of the ACM 31(3), pp. 314-333 (March 1988). 10. R. Clark, E. Jensen, and F. Reynolds, ``An Architectural Overview Of The Alpha Real-Time Distributed Kernel,'' Usenix Workshop on Microkernels and Other Kernel Architectures (April 1992). 11. Apollo Domain Computer Corporation, DomainOS Users' Manual, Hewlett- Packard. 12. P. Dasgupta, R. J. LeBlanc, and W. F. Appelbe, ``The Clouds Distributed Operating System,'' Proceedings, 8th International Conference on Distributed Computing Systems (1988). 13. A. Dearle, R. di Bona, J. Farrow, F. Henskens, A. LindstroÈ m, J. Rosenberg, and F. Vaughan, ``Grasshopper: An Orthogonally Persistent Operating System,'' Computing Systems 7(3), pp. 289-312 (Summer 1994). 14. A. Dearle, R. di Bona, J. Farrow, F. Henskens, A. LindstroÈ m, J. Rosenberg, and F. Vaughan, ``Protection in the Grasshopper Operating System,'' Proceeding of the 6th International Workshop on Persistent Object Systems, France, Tarascon, 142 Data Movement in the Grasshopper Operating System

To appear in Springer-Verlag Workshops in Computing series (September 1994). 15. R. S. Fabry, ``Capability-Based Addressing,'' Communications of the ACM 17(7), pp. 403-412 (1974). 16. Colin J. Fidge, ``Timestamps in Message-Passing Systems That Preserve the Partial Ordering,'' 11th Australian Computer Science Conference, pp. 56-66, University of Queensland (1988). 17. A. Forin, D. Golub, and B. Bershad, ``An I/O System for Mach 3.0,'' Anonymous FTP from mach.cs.cmu.edu (1992). 18. E. Gerich, ``Guidelines for Management of IP Address Space.,'' Internet Request For Comment 1466, Available by Anon. ftp as RFC1466 (May 1993). 19. Jim Gray and Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Mateo (1993). 20. D. M. Harland, REKURSIV: Object-oriented , Ellis- Horwood Limited (1988). 21. Gernot Heiser, Kevin Elphinstone, Stephen Russell, and Jerry Vochteloo, ``Mungi: A Distributed Single Address-Space Operating System,'' SCS&E Report 9314, Sydney, University of New South Wales (November 1993). 22. Frans Henskens, ``A Capability-Based Persistent Distributed Shared Memory,'' Ph.D. Thesis, University of Newcastle (May 1991). 23. David B. Johnson and Willy Zwaenepoel, ``Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing,'' Journal of Algorithms 11(3), pp. 462-491, Academic press (1990). 24. David B. Johnson, ``Ef®cient Transparent Optimistic Rollback Recovery for Distributed Application Programs,'' Proceedings of the 12th Symposium on Reliable Destributed Systems, IEEE Computer Society (October 1993). 25. Panagiotis Kougiouris, ``A Device Management Framework for An Object- Oriented Operating System,'' Masters Thesis at the University of Illinois, Urbana-Champaign, Department of Computer Science (1991). 26. AT&T Bell Laboratories, UNIX Time-Sharing System, AT&T Bell Laboratories, Murry Hill, New Jersey (February 1985). 27. Leslie Lamport, ``Time, Clocks, and the Ordering of Events in a Distributed System,'' CACM 21(7), pp. 558-565, ACM (July 1978). 28. P Lauder, ``A High-Throughput Message Transport System,'' Software Systems Research Group SSRG17.1, Anonymous FTP from ftp.cs.su.oz.au. 29. Piers Lauder, ``MHSnet Management Guide,'' MHSnet Documentation Series, Message Handling Systems Pty Ltd. 30. P. Lauder, R. J. Kummerfeld, and A. Fekete, ``Hierarchical Network Routing,'' Proceedings of TriComm'91, Chapel Hill North Carolina, pp. 105-114, IEEE Conference on Communications Software (April 1991).

11. References 143

31. H. C. Lauer and R. M. Needham, ``On the Duality of Operating System Structures,'' Operating Systems Review 12(2), pp. 3-19 (1979). 32. Samuel J. Lef¯er, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman, The Design and Implementation of the 4.3BSD UNIX Operating System, Addison-Wesley, Sydney (1989). 33. Anders LindstroÈ m, John Rosenberg, and Al Dearle, ``The Grand Uni®ed Theory of Address Spaces,'' Proceedings of the Fifth Workshop on Hot Topics in Operating Systems (HotOS-V) (To Appear, May 1995). 34. A. LindstroÈ m, R. di Bona, A. Dearle, S. Norris, J. Rosenberg, and F. Vaughan, ``Persistence in the Grasshopper Kernel,'' Grasshopper Technical Report GH- 09, The Universities of Sydney and Adelaide (May 1994). 35. B. H. Liskov and S. N. Zilles, ``Programming with Abstract Data Types,'' SIGPLAN Notices 9(4) (1974). 36. D. Long, ``A Note on Bit-Mapped Free Sector Management,'' Operating System Review 27(2), pp. 7-9 (April 1993). 37. R. A. Lorie, ``Physical Integrity in a Large Segmented Database,'' ACM Transactions on Database Systems 2(1), pp. 91-104 (1977). 38. Raimundo A. Macedo, Paul D. Ezhilchelvan, and Santosh K. Shrivastava, ``Newtop: A Total Order Multicast Protocol Using Causal Blocks,'' Esprit Basic Research Project 6360, Newcastle upon Tyne, Department of Computer Science, University of Newcastle (October 1993). 39. James Martin, Local Area Networks, Prentice Hall, Englewood Cliffs, New Jersey (1989). 40. Sun Microsystems, Inc., ``NFS: Network File System Protocol Speci®cation,'' SRI Network Information Centre, Menlo Park, CA (March 1989). 41. Sun Microsystems, ``SUN Common SCSI Architecture,'' Document Number 800-4701-10, Revision A, Sun Microsystems Inc. (November 1989). 42. Sun Microsystems, ``SUN Common SCSI Architecture,'' Document Number 800-4701-10, Revision B, Sun Microsystems Inc. (March 1990). 43. P. V. Mockapetris, ``Domain names - implementation and speci®cation,'' Internet Request For Comment 1035, Available by Anon. ftp as RFC1035 (November 1987). 44. Ron Morrison, Fred Brown, Richard Connor, and Al Dearle, ``The Napier88 Reference Manual,'' Universities of Glasgow and St Andrews, PPRR-77-89 (1989). 45. R. Morrison, A. L. Brown, R. C. H. Conner, Q. I. Cutts, G. N. C. Kirby, A. Dearle, J. Rosenberg, and D. Stemple, ``Protection in Persistent Object Systems,'' Security and Persistence, pp. 48-66 (1990). 46. Achour Mostefaoui and Michel Raynal, ``Causal Milticasts in Overlapping Groups: Towards a Low Cost Approach,'' International Conference on Future Trends of Distributed Computer Systems, Lisboa, IEEE (September 1993). 144 Data Movement in the Grasshopper Operating System

47. E. I. Organick, The Multics System: An Examination of its Structure, MIT Press, Cambridge, Mass. (1972). 48. John K. Ousterhout, Andrew R. Cherenson, Frederick Douglis, Michael N. Nelson, and Brent B. Welch, ``The Sprite Network Operating System,'' Computer 21(2) (February 1988). 49. L. L. Peterson, N. C. Hutchinson, S. O'Malley, and H. Rao, ``The x-Kernel: A Platform for Accessing Internet Resources,'' IEEE Computer 23(5), pp. 23-33 (May 1990). 50. J. B. Postel, ``User Datagram Protocol,'' Internet Request For Comment 768, Available by Anon. ftp as RFC768 (August 1980). 51. J. B. Postel, ``Internet Protocol,'' Internet Request For Comment 791, Available by Anon. ftp as RFC791 (September 1981). 52. J. B. Postel, ``Transmission Control Protocol,'' Internet Request For Comment 793, Available by Anon. ftp as RFC793 (September 1981). 53. Dave Presotto and Phil Winterbottom, ``The Organization of Networks in Plan 9,'' USENIX Winter '93, San Diego, CA., pp. 271-287 (1993). 54. Michel Raynal, Andre Schiper, and Sam Toueg, ``The causal ordering abstraction and a simple way to implement it,'' Information Processing Letters 39(6), pp. 343-350, Elsevier Science Publishers (27 September 1991). 55. RFS, RFS documentation, 1994. 56. D. M. Ritchie and K. Thompson, ``The UNIX Time-Sharing System,'' The Bell System Technical Journal 57(6), pp. 1905-1930 (July-August 1978). 57. Marshall T. Rose, The Open Book, Prentice Hall, New Jersey (1990). 58. J. Rosenberg and D. A. Abramson, ``MONADS-PC: A Capability Based Workstation to Support Software Engineering,'' Proc, 18th Hawaii International Conference on System Sciences, pp. 515-522 (1985). 59. D. M. Ross, ``Virtual Files: A Framework for Experimental Design,'' CST-26- 83, University of Edinburgh (1983). 60. M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser, ``Overview of the CHORUS Distributed Operating System,'' Technical Report CS/TR-90-25, Chorus SysteÁ mes (April 1990). 61. M. Satyanarayanan, ``A Survey of Distributed File Systems,'' Annual Review of Computer Science 4 (February 1989). 62. SCSI-2, ``Small Computer System Interface - 2 Standard,'' Document X3T9.2/86-109, Rev 10E, Global Engineering Documents. 63. A. Shankar, ``An Introduction to Assertional Reasoning or Concurrent Systems,'' ACM computing surveys 25(3), pp. 225-262 (September 1993). 64. Robert E. Strom and Shaula Yemini, ``Optimistic Recovery in Distributed Systems,'' ACM Transactions on Computer Systems 3(3), pp. 204-226, ACM (August 1985).

11. References 145

65. SunSoft, ``SunOS 5.3 Guide to Multithread programming,'' Part Number 801- 5294-10, Mountain View, Sun Microsystems (November 1993). 66. A. S. Tanenbaum and S. J. Mullender, ``An Overview of the Amoeba Distributed Operating System,'' Operating System Review 13(3), pp. 51-64 (July 1981). 67. A. S. Tanenbaum, R. van Renesse, H. van Stavern, G. J. Sharp, S. J. Mullender, J. Jansen, and G. van Rossum, ``Experiences with the Amoeba Distributed Operating System,'' Communications of the ACM 33, pp. 46-63 (1990). 68. Francis Vaughan, Tracy Lo Basso, Alan Dearle, Chris Marlin, and Chris Baxter, ``Casper: a Cached Architecture Supporting Persistence,'' Computing Systems 5(3), pp. 337-59, Univeristy of California Press (1992). 69. F. Vaughan, A. Dearle, J. Cao, R. di Bona, J. Farrow, F. Henskens, A. LindstroÈ m, and J. Rosenberg, ``Causality Considerations in Distributed Persistent Operating Systems,'' Proceedings of the Seventeenth Annual Computer Science Conference, Christchurch 2, pp. 409-420, ed Gopal Gupta, Also as Technical Report GH-05, Universities of Sydney/Adelaide (January 1994). 70. Paul Wilson, ``Pointer Swizzling at Page Fault Time: Ef®ciently Supporting Huge Address Spaces on Standard Hardware,'' ACM Computer Architecture News 19 (4), pp. 6-13 (June 1991).

APPENDIX

149

Protocol Examples A for Moving Entities

This appendix presents four different case studies. The movement of an entity is examined under the following conditions:

• The simple movement of an entity from one node to another node.

• The replication of an entity to another node, involving the creation of a new ESR.

• The deletion of a replicated entity, involving the deletion of the current ESR.

• The return home of an entity that had moved.

A.1 Movement of a Single Entity from the Home Node

In the ®rst worked example there is an entity on its home node. It is about to move off that node onto another node. Because the entity is currently on its home node there are no NAT entries for the entity in the system. This is the usual case; most entities will exist only on their home node.

The entity is about to move to a new node. This will require the creation of the NAT entries, and the holding of messages for the entity. Since the entity is on its home node the home node, the current node, and the current ESR are the same.

Figure A.1 shows the protocol steps and messages. The home node is N3, and the entity is about to move to node N5. The steps are shown below, and are related to the algorithm given in Section 8.7.1. 150 Data Movement in the Grasshopper Operating System

N0 N0

N1 N2 N1 N2

1

N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 4 Phase 1 5 6 Phase 2

N0 N0

9 N1 N2 N1 N2 7 10 14 8 N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 3 Phase 4

Figure A.1. Movement of an Entity between Nodes

The protocol steps are shown below and are shown in Figure A.1.

1. Phase 1 involves preparing the network for the move by priming the ESR, and removing unnecessary ESN nodes.

• Since the entity has never moved there is no NAT entry. The NAT entry is created, and the message in hold ¯ag set in the entry.

• A message (1) is sent to the ESR, and the busy and hold ¯ags are set at the ESR.

• Since the current node is the home node no action is taken to remove extraneous ESN nodes.

• A trigger message (4) is queued for the protocol state machine.

2. Phase 2 involves the movement of the entity across the network.

• The protocol state machine moves the entity to the new node (5), where a NAT entry is created and populated for the entity. The instance ¯ag is deleted on the home node, as no entity instance remains there.

A.1 Protocol Examples for Moving Entities 151

If this was a create, not a move, the instance ¯ag would remain, and a copy of the node would have been sent.

• When the entity is safely on the new node a completion message is returned (6) to the ESR. The ESR calculates that it is no longer at the root of the sub-tree, and as such must move outwards.

3. Phase 3 involves the outward movement of the ESR.

• The ESR is moved using the three message protocol. Message (7) is used to initiate the move, message (8) is sent back in response, and message (9) moves the ESR data to the new ESR.

• Since the ESR has reached the new outermost node the building of the chain of ESN nodes to the new node is initiated. This sends message (10) to the current node, where the chain terminates.

4. Phase 4 involves the inward movement of the ESR.

• Since the ESR is already at the root of the smallest sub-tree that contains both instances of the entity, no action is taken to move the ESR.

• A message (14) is ®nally sent to the new current node, which releases the entity, and computation continues. 152 Data Movement in the Grasshopper Operating System

A.2 Replication of an Entity, Creating a New ESR

In this example there is a distributed entity. It has instances on nodes N3 and N5. The instance on N5 is about to create a new instance on node N6. The ESR is currently on node N1, but will have to move to node N0, with the current ESR being relegated to an ESN.

Figure A.2 shows the protocol steps and messages.

N0 N0

N1 N2 N1 N2 4 6 1 5 N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 1 Phase 2

9 10 7 N0 N0

8 N1 N2 N1 14 N2 10

N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 3 Phase 4

Figure A.2. Replication of an Entity between Nodes

1. Phase 1 involves preparing the network for the move by priming the ESR, and removing unnecessary ESN nodes.

• The message in hold ¯ag set in the NAT entry on the current node.

• A message (1) is sent to the ESR, and the busy and hold ¯ags are set at the ESR.

• Since the action is to copy an entity there will be no extraneous ESN nodes.

• A trigger message (4) is queued for the protocol state machine.

A.2 Protocol Examples for Moving Entities 153

2. Phase 2 involves the movement of the entity across the network.

• The protocol state machine copies the entity to the new node (5), where a NAT entry is created and populated for the entity.

• When the entity is safely on the new node a completion message is returned (6) to the ESR. The ESR calculates that it is no longer at the root of the sub-tree, and as such must move outwards.

3. Phase 3 involves the outward movement of the ESR.

• The ESR is moved using the three message protocol. Message (7) is used to initiate the move, message (8) is sent back in response, and message (9) moves the ESR data to the new ESR.

• Since the ESR has reached the new outermost node the building of the chain of ESN nodes to the new node is initiated. This sends message (10) to node 2, where a NAT entry is created, and then on to the new current node, where the chain terminates.

4. Phase 4 involves the inward movement of the ESR.

• Since the ESR is already at the root of the smallest sub-tree that contains both instances of the entity, no action is taken to move the ESR.

• A message (14) is ®nally sent to the new current node, which releases the entity, and computation continues. This message is guaranteed to arrive after message 10, as messages cannot pass, and message 10 is forwarded to any child node in an atomic action. 154 Data Movement in the Grasshopper Operating System

A.3 Deletion of an Entity, Deleting the Current ESR

In this example there is a distributed entity. It has instances on nodes N5 and N6. The instance on N6 is about to be deleted. There is a ESN on N1, and the ESR is on N0. The ESR will be deleted, and the ESN will become the new ESR.

Figure A.3 shows the protocol steps and messages.

3 N0 2 N0

N1 1 2 N2 N1 6 N2 2 4

N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 1 Phase 2

13 N0 11 N0

12 N1 N2 N1 N2

N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 3 Phase 4

Figure A.3. Deletion of an Entity, Deleting the Current ESR

1. Phase 1 involves preparing the network for the move by priming the ESR, and removing unnecessary ESN nodes.

• The message in hold ¯ag set in the NAT entry on the current node.

• A message (1) is sent to the ESR, and the busy and hold ¯ags are set at the ESR.

• Since the action is to delete an entity, extraneous ESN nodes will be deleted. The ESR sends a message (2) to the current node, which passes this message to its parent. The parent deletes N6 from its current node list, and passes the message onto its parent, N0. N0 deletes N2 from its current node list, and since it is the ESR the deletion chain ®nishes here.

A.3 Protocol Examples for Moving Entities 155

• The chain ®nishes by sending a message (3) to the ESR, which allows the ESR to know the deletion chain has ®nished. If node N2 had a second child, for instance if an instance existed on N9, then message (3) would have been generated by N2instead, but would have still been passed to N0.

• A trigger message (4) is queued for the protocol state machine.

2. Phase 2 involves the movement of the entity across the network.

• Since the function is a deletion, there is no movement of entity data across the network. Instead N6 deletes the entity, and removes the NAT entry and any saved messages. These messages can be deleted, as the instance is being deleted. Any other instances will have received their copies of the messages.

• When the entity is deleted a completion message is returned (6) to the ESR. The ESR calculates that it is no longer at the root of the sub-tree, and as such must move inwards.

3. Phase 3 involves the outward movement of the ESR.

• Since the ESR is already past the root of the new sub-tree no outward movement is performed.

4. Phase 4 involves the inward movement of the ESR.

• Since the ESR only has one child it is no longer at the root of the minimal spanning sub-tree. The ESR has to move towards the home node, until it is again the root of the sub-tree. The ESR moves from N0 to N1 using the three message protocol. When the ESR reaches N1 it is now the root and as such stops the inward movement.

• Since the action was a deletion there is no entity to wake, so no continue message is sent. 156 Data Movement in the Grasshopper Operating System

A.4 Return of an Entity to the Home Node

In this example there is a moved entity. Its instance is on node N5, and its home node is N3. The instance on N5 is about to move back to N3, which will remove all NAT entries for the entity from the network. There is currently an ESR on N1 which will be deleted.

Figure A.4 shows the protocol steps and messages.

N0 N0 3

N1 N2 N1 N2 2 1 2 6 4 N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9

Phase 1 5 Phase 2

N0 N0

13 N1 N2 N1 N2 11

12 N3 N4 N5 N6 N7 N8 N9 N3 N4 N5 N6 N7 N8 N9 Phase 3 Phase 4

Figure A.4. Return of an Entity to the Home Node

1. Phase 1 involves preparing the network for the move by priming the ESR, and removing unnecessary ESN nodes.

• The message in hold ¯ag set in the NAT entry on the current node.

• A message (1) is sent to the ESR, and the busy and hold ¯ags are set at the ESR.

• Since the action is to delete an entity, extraneous ESN nodes will be deleted. The ESR sends a message (2) to the current node, which passes this message to its parent (the ESR). The ESR deletes N5 from its current node list, and since N1 is the ESR the deletion chain ®nishes here.

A.4 Protocol Examples for Moving Entities 157

• The chain ®nishes by sending a message (3) to the ESR, which allows the ESR to know the deletion chain has ®nished.

• A trigger message (4) is passed to N5.

2. Phase 2 involves the movement of the entity across the network.

• Since the function is a deletion, there is no movement of entity data across the network. Instead N5 deletes the entity, removes the NAT entry and any saved messages. These messages can be deleted, as the instance is being deleted. Any other instances will have received their copies of the messages.

• When the entity is deleted a completion message is returned (6) to the ESR. The ESR calculates that it is no longer at the root of the sub-tree, and as such must move inwards.

3. Phase 3 involves the outward movement of the ESR.

• Since the ESR is already past the root of the new sub-tree no outward movement is performed.

4. Phase 4 involves the inward movement of the ESR.

• Since the ESR only has one child it is no longer at the root of the minimal spanning sub-tree. The ESR has to move towards the home node, until it is again the root of the sub-tree. The ESR moves from N1 to N3 using the three message protocol. When the ESR reaches N3 it is now on the home node.

• Since the ESR reached the home node there is only one instance on the network, that at the home node. The protocol ensures the entity on the home node is running, and then, if there are no requests for the entity, deletes the NAT entry on the home node. After this deletion there will be no NAT entries for the entity anywhere in the network.

This is the original state of the NAT tables for this entity. 158 Data Movement in the Grasshopper Operating System

A.4 Data Movement in the Grasshopper Operating System

by

Rex Monty di Bona

A thesis submitted in ful®lment of the requirements for the degree of Doctor of Philosophy

Basser Department University of Sydney of Computer Science

March, 1995 Data Movement in the Grasshopper Operating System Copyright  1995 by Rex M. di Bona ABSTRACT Operating systems are fundamentally responsible for the movement of data. Persistent systems allow, and require, different paradigms for the movement of data to conventional non-persistent operating systems. The work described in this thesis is based on the Grasshopper Operating System, an experimental, distributed, persistent operating system being developed at the University of Sydney. Grasshopper uses different methods to perform data movement depending on whether the movement is intra-nodal or inter-nodal. Within a node data movement is between stable storage and main memory, between nodes data movement is achieved by the movement of data packets across a network. This thesis describes a novel and ¯exible modular system for providing user level control of data movement between disks and memory. The methods provided allow the use of insecure modules, possibly written by untrusted third parties, to perform data movement on secure data. The modular system does not rely on the copying of data, but instead uses the memory management primitives of Grasshopper to pass references to main memory between modules. Between network nodes, Grasshopper allows the use of both remote procedure calls and distributed shared memory to move data. These two high level abstractions are implemented as kernel to kernel messages. Grasshopper allows its fundamental entities, loci and containers, to migrate across the network, and to distribute themselves on multiple nodes. This thesis presents new network protocols which provide causally consistent kernel to kernel message delivery to ensure that entities see a consistent view of the global system. Furthermore, messages addressed to entities are delivered in a causally correct order, irrespective of any movement of the entity.

iii ACKNOWLEDGEMENTS Thank you Karen, without whom this would never have been completed.

Many people helped in the production of this thesis. I wish to thank my supervisor, John Rosenberg, who provided the guidance and encouragement that enabled me to develop the ideas presented in this thesis. I also wish to thank the other members of the Grasshopper Group; Al Dearle, Matty Farrow, Anders LindstroÈ m, Frans Henskens, Dave Hulse, Stephen Norris, and Francis Vaughan, and also Alan Fekete who provided invaluable assistance with the theoretical aspects of this work. The times we were yelling at each other were fun times indeed. The Grasshopper Project has been supported by both the ARC, with grant A49130439, and Digital Equipment Corporation, with an equipment grant under the Alpha Innovators Program. I wish to thank my long suffering parents, who have wondered if this was ever going to ®nish, and I could go and ªget a real jobº. Mum, Dad, look, its done! To Ray and Cecilia, its time to work on the kitchen! What are you doing this weekend? And ®nally, to my friends and colleagues at Basser, phew!

iv CONTENTS

1. Introduction ...... 1 2. The Computational Model of Conventional Systems ...... 5 2.1 The Process View of Unix ...... 5 2.2 The User's View of Unix ...... 7 2.3 Memory Namespaces ...... 8 2.4 File System Namespaces ...... 9 2.5 Network Namespaces ...... 10 2.6 Merging of the Namespaces ...... 14 2.7 Other Non-Persistent Systems ...... 16 2.8 Flaws in the Merged Namespaces ...... 17 2.9 Summary ...... 19 3. The Computational Model of Persistent Systems ...... 21 3.1 Programming Persistent Systems ...... 22 3.2 Providing Persistence ...... 24 3.3 Problems With Persistent Systems ...... 30 3.4 Recovery in a Persistent System ...... 31 3.5 Models of Distribution ...... 36 3.6 Summary ...... 37 4. The Grasshopper Model ...... 39 4.1 Overview of Grasshopper ...... 39 4.2 Basic Abstractions in Grasshopper ...... 40 4.3 Summary ...... 50 5. Data Movement on a Single Node ...... 51 5.1 Introduction ...... 51 5.2 Background ...... 52 5.3 Examples of Modular Designs ...... 54 5.4 Model Assumptions ...... 58 5.5 The Disk Module Model ...... 58 5.6 Grasshopper Implementation of the Modular Drivers ...... 64 5.7 Worked Example of Setting up and Using a Driver Stack ...... 69 5.8 Summary ...... 70 6. Data Movement in a Multiple Node System ...... 73 6.1 Remote Data Sharing ...... 73 6.2 Naming ...... 76 6.3 Naming Nodes ...... 80 6.4 Grasshopper Node Names ...... 82 6.5 Types of Network Structures ...... 83 6.6 Routing to an Entity in Grasshopper ...... 84 6.7 Collapsing a Grasshopper Network ...... 85 6.8 Summary ...... 86 7. Distributed Causality ...... 87 7.1 Causal Delivery of Messages ...... 88 7.2 Causality in a Tree Structure ...... 89 7.3 Proof of Causal Delivery ...... 90 7.4 Network Causality Semantics ...... 94 7.5 Causality in Grasshopper ...... 95 7.6 Network Failures ...... 98 7.7 Timeouts of Messages ...... 100

v 7.8 Operator Intervention ...... 101 7.9 Summary ...... 101 8. Moving Entities ...... 103 8.1 Information to be Moved ...... 104 8.2 Messages or Movement? ...... 106 8.3 Causality Considerations ...... 106 8.4 Name Alias Table ...... 108 8.5 Using the Name Alias Table ...... 110 8.6 Looping Messages ...... 112 8.7 Distributed Entities ± Multicast Messages ...... 112 8.8 Causal Correctness of the ESR Movement ...... 118 8.9 Broadcast Messages ...... 121 8.10 Summary ...... 122 9. Implementation, Evaluation and Future Work ...... 123 9.1 Implementation ...... 123 9.2 Evaluation ...... 130 9.3 Further Work ...... 133 9.4 Summary ...... 136 10. Conclusion ...... 139 11. References ...... 141 APPENDIX A. Protocol Examples for Moving Entities ...... 149

vi LIST OF FIGURES

Figure 2.1. Process' Namespace in Unix ...... 7 Figure 2.2. Memory Namespace Composition ...... 9 Figure 2.3. Memory Mapped Files ...... 15 Figure 2.4. Process' Namespace in Unix ...... 16 Figure 3.1. Mappings in Traditional and Persistent Systems ...... 23 Figure 3.2. Napier Program to Create a Procedure...... 25 Figure 3.3. Simple Napier Program using above Procedure...... 26 Figure 3.4. Napier Environment Structure after Execution of Program. . . . . 26 Figure 3.5. Monads Address Composition ...... 28 Figure 3.6. Opal Protection Domains ...... 29 Figure 4.1. Containers, Loci, and Invocation ...... 41 Figure 4.2. A Container Mapping Hierarchy ...... 42 Figure 4.3. Multiple Views onto One Container ...... 43 Figure 4.4. Locus Private Mappings ...... 44 Figure 4.5. Naming Conventions in Grasshopper ...... 46 Figure 4.6. Self Managing Manager ...... 48 Figure 5.1. Partitioning Spanned Drives ...... 55 Figure 5.2. Allowing Different Policies ...... 56 Figure 5.3. Remote Disks ...... 57 Figure 5.4. Example Module Connections ...... 62 Figure 5.5. Disk Module ...... 67 Figure 5.6. Example Disk Stack ...... 69 Figure 6.1. Domain Naming in Grasshopper ...... 82 Figure 6.2. The Physical Layout of a Simple Network ...... 83 Figure 6.3. Network Routing Algorithm Outline ...... 85 Figure 7.1. Transitive Causality Violation ...... 88 Figure 7.2. Causal Dependency for Map/Unmap ...... 96 Figure 7.3. Distributed Container Manager ...... 97 Figure 7.4. Combined Container Manager ...... 97 Figure 7.5. Two General's Problem ...... 100 Figure 8.1. Causality Violation ...... 107 Figure 8.2. Routing Protocol Node Names ...... 108

vii Figure 8.3. Name Alias Table ...... 109 Figure 8.4. Single Entity Movement ...... 109 Figure 8.5. Updated NAT After Movement ...... 110 Figure 8.6. NAT based Network Routing Algorithm ...... 111 Figure 8.7. Routing Protocol Node Names ...... 112 Figure 8.8. Pseudo code for Network Protocol ...... 118 Figure 8.9. Outward Movement of the ESR ...... 119 Figure 8.10. Time Diagram for ESR outward Movement ...... 120 Figure 8.11. Inward Movement of the ESR ...... 120 Figure 8.12. Time Diagram for ESR Inward Movement ...... 121 Figure 9.1. Link Protocol ...... 125 Figure 9.2. Message and Packet Structure - Small Messages ...... 128 Figure 9.3. Message and Packet Structure - Large Messages ...... 128 Figure 9.4. Protocols Implemented on the Grasshopper Network ...... 129 Figure 9.5. Throughput on a Single Ethernet ...... 130 Figure 9.6. Extra Message Hops Due to Entity Movement ...... 133 Figure 9.7. Fragmented Message Recombination ...... 136 Figure A.1. Movement of an Entity between Nodes ...... 150 Figure A.2. Replication of an Entity between Nodes ...... 152 Figure A.3. Deletion of an Entity, Deleting the Current ESR ...... 154 Figure A.4. Return of an Entity to the Home Node ...... 156

viii