Making Applications Mobile Under

Cédric Le Goater, Daniel Lezcano, Clément Calmels IBM France {clg, dlezcano, clement.calmels}@fr.ibm.com Dave Hansen, Serge E. Hallyn Hubertus Franke IBM Linux Technology Center IBM T.J. Watson Research Center {haveblue, serue}@us.ibm.com [email protected]

Abstract a failure, a system upgrade, or a need to rede- ploy hardware resources. This ability to check- point context is most common in what is re- Application mobility has been an operating sys- ferred to as the High Performance Computing tem research topic for many years. Many (HPC) environment, which is often composed approaches have been tried and solutions are of large numbers of computers working on a found across the industry. However, perfor- distributed, long running computation. The ap- mance remains the main issue and all the ef- plications often run for days or weeks at a time, forts are now focused on performant solutions. some even as long as a year. In this paper, we will discuss a prototype which minimizes the overhead at runtime and the Even outside the HPC arena, there are many amount of application state. We will examine applications which have long start up times, constraints and requirements to enhance perfor- long periods of processing configuration files, mance. Finally, we will discuss features and en- pre-computing information and so on. Histor- hancements in the Linux kernel needed to im- ically, emacs was built with a script which in- plement migration of applications. cluded undump—the ability to checkpoint the full state of emacs into a binary which could then be started much more quickly. Some enter- 1 Introduction and Motivation prise class applications have thirty minute start up times, and those applications continue to build complex context as they continue to run. Applications increasingly run for longer peri- ods of time and build more context over time as Increasingly we as users tend to expect that our well. Recovering that context can be time con- applications will perform quickly, start quickly, suming, depending on the application, and usu- re-start quickly on a failure, and be always ally requires that the application be re-run from available. However, we also expect to be able the beginning to reconstruct its context. A few to upgrade our , apply secu- applications now provide the ability to check- rity fixes, add components, memory, some- point their data or context to a file, enabling that times even processing power without losing all application to be restarted later in the case of of the context that our applications have ac- 348 • Making Applications Mobile Under Linux quired. Application is a means of ab- stracting, or virtualizing, the software resources This paper discusses a generic mechanism for of the system. These include such things as pro- saving the state of an application at any point, cess id’s, IPC ids, network connections, mem- with the ability to later restart that applica- ory mappings, etc. It is also a means to contain tion exactly where it left off. This ability to and isolate resources required by the applica- save status and restart an application is typi- tion to enable its mobility. Compared to the cally referred to as checkpoint/restart, abbre- virtual machine approach, application virtual- viated throughout as CPR. This paper focuses ization approach minimizes the state of the ap- on the key areas for allowing applications to plication to be transferred and also allows for a be virtualized, simplifying the ability to check- higher degree of resource sharing between ap- point and later restart an application. Further, plications. On the other hand, it has limited the technologies covered here would allow ap- fault containment, when compared to the vir- plications to potentially be restarted on a differ- tual machine approach. ent operating system image than the one from which it was checkpointed. This provides the We built a prototype, called MCR, by modify- ability to move an application (or even a set of ing the Linux kernel and creating such a layer applications) dynamically from one machine or of containment. They are various other projects virtual operating system image to another. with similar goals, for instance VServer [8] and OpenVZ [7] and dated Linux implementa- Once the fundamental mechanisms are in tion of BSD Jails [4]. In this paper, we will place, this technology can be used for such describe our experiences from implementing more advanced capabilities such as check- MCR and examine the many communalities of pointing a cluster wide application—in other these projects. words synchronizing and stopping a coordi- nated, distributed applications, and restarting them. Cluster wide CPR would allow a site 2 Related Work in CPR administrator to install a security update or perform scheduled maintainance on the entire cluster without impacting the application run- CPR is theoretically simple. Stop execution ning. of the task and store the state of all mem- ory, registers, and other resources. To restart, Also, CPR would enable applications to be reload the executable image, load the state moved from host to host depending on sys- saved during the checkpoint, and restart exe- tem load. For instance, an overloaded machine cution at the location indicated by the instruc- could have its workload rebalanced by moving tion pointer register. In practice, complications an application set from one machine to another arise due to issues like inter-processes sharing, that is otherwise underutilized. Or, several sys- security implications, and the ways that the ker- tems which are underloaded could have their nel transparently manages resources. This sec- applications consolidated to a single machine. tion groups some of the existing solutions and CPR plus migration will henceforth be referred reviews their shortcomings. to as CPRM. Virtual machines control the entire system Most of the capabilities we’ve highlighted here state, making CPR easy to implement. The are best enabled via application virtualization. state of all memory and resources can simply be 2006 Linux Symposium, Volume One • 349 stored into a file, and recreated by the machine cess’ original process id. This method is also emulator or the operating system itself. In- potentially very inefficient as described in Sec- deed, the two most commonly mentioned VMs, tion 5.2. The approaches also fall VMware [9] and [10], both enable live mi- short by requiring applications to be rewritten gration of their guest operating systems. The and by exhibiting poor resource sharing. drawbacks of CPRM of an entire virtual ma- chine is the increased overhead of dealing with CPR becomes much easier given some help all resources defining the VM. This can make from the kernel. Kernel-based CPR solutions the approach unsuitable for load balancing ap- include zap [12], crak [2], and our MCR pro- plications, since a requirement to add the over- totype. We will be analyzing MCR in detail in head of a full VM and associated daemons to Section 4, followed by the requirements for a each migrateable application can have tremen- consolidated application virtualization and mi- dous performance implications. This issue is gration kernel approach. further explored in Section 6.

A more lighter weight CPRM approach can be achieved by isolating applications, which 3 Concepts and principles is predicated on the safe and proper isolation and migration of its underlying resources. In A user application is a set of resources—tasks, general, we look at these isolated and migrate- files, memory, IPC objects, etc.—that are ag- able units as containers around the relevant pro- gregated to provide some features. The general cesses and resources. We distinguish conceptu- concept behind CPR is to freeze the application ally system containers, such as VServer [8] or and save all its state (in both kernel and user OpenVZ [7], and application containers, such spaces) so that it can be resumed later, possibly as Zap [12] and our own prototype MCR. Since on another host. Doing this transparently with- containers share a single OS instance, many re- out any modification to the application code is sources provided by the OS must be specially quite an easy task, as long as you maintain a isolated. These issues are discussed in detail single strong requirement: consistency. in Section 5. Common to both container ap- proaches is their requirement to be able to CPR an isolated set of individual resources. 3.1 Consistency

For many applications CPR can be completely achieved from user space. An example imple- The kernel ensures consistency for each re- mentation is ckpt [5]. Ckpt teaches applica- source’s internal state and also provides system tions to checkpoint themselves in response to identifiers to user space to manipulate these re- a signal by either preloading a library, or in- sources. These identifiers are part of the ap- jecting code after application startup. The new plication state, and as such are critical to the code, when triggered, writes out a new exe- application’s correct behavior. For example, a cutable file. This executable reloads the appli- process waiting for a child will use the known cation and resets its state before continuing exe- process id of that child. If you were to re- cution where it left off. Since this method is im- sume a checkpointed application, you would plemented with no help from the kernel, there recreate the child’s pid to ensure that the par- is state which cannot easily be stored, such as ent would wait on the correct process. Unfortu- pending signals, or recreated, such as a pro- nately, Linux does not provide such control on 350 • Making Applications Mobile Under Linux how pids, or many other system identifiers, are tualization space according to the implied con- associated with resources. text.

An interesting approach to this problem is vir- tualization: a resource can be associated with a 3.3 Filesystem and Devices supplementary virtual system identifier for user space. The kernel can maintain associations A filesystem may be divided into two parts. On between virtual and system identifiers and of- one hand, global entries that are visible from all fer interfaces to control the way virtual iden- containers like /usr. On the other hand, lo- tifiers are assigned. This makes it possible to cal entries that may be specific to the contain- change the underlying resource and its system ers like /tmp and of course /proc. /proc identifier without changing the virtual identi- virtualization for numerical process entries is fier known by the application. A direct side ef- quite straightforward: simply generate a file- fect is that such virtualized applications can be name out of the virtual pid. Tagging the re- confused by virtual identifier collisions if they sultant dentries with a container id makes are not separated from one another. These con- it possible to support conflicting names in the flicts can be avoided if the virtualization is im- directory name cache and to filter out unrelated plemented with resource containment features. entries during readdir(). For example, /proc should only export virtual pid entries for processes in the same virtualized The same tagging mechanism can be applied container as the reading process. using files attributes for instance. Some user level administration commands can be used by the system administrator to keep track of files 3.2 Process subsystem created by the containers. Devices files can be tagged to be visible and usable by dedicated containers. And of course, the mknod system Processes are the essential resources. They of- call should fail on containers or be restricted to fer many features and are involved in many re- a minimal usage. lationships, such as parent, thread group, and process group. The pid is involved in many system calls and regular UNIX features such 3.4 Network as session management and shell job control. Correct virtualization must address the whole If the application is network oriented, the mi- picture to preserve existing semantics. For ex- gration is more complex because resources are ample, if we want to run multiple applications seen from outside the container, such as IP ad- in different containers from the same login ses- dresses, communication ports, and all the un- sion, we will also want to keep the same system derlying data related to the TCP protocol. Mi- session identifier for the ancestor of the con- gration needs to take all these resources into ac- tainer so as to still benefit from the regular ses- count, including in-flight messages. sion cleanup mechanism. The consequence is that we need a system pid to be virtualized mul- The IP address assigned to a source host should tiple times in different containers. This means be recreated on the target host during the mi- that any kernel code dealing with pids that are gration. This mobile IP is the foundation of the copied to/from user space must be patched to migration, but adds a constraint on the network. provide containment and choose the correct vir- We will need to stay on the same network. 2006 Linux Symposium, Volume One • 351

The migration of an application will be possible ables some nifty and useful extensions like dis- only if the communication channels are clearly tributed CPR and system management. isolated. The connections and the data associ- ated with each application should be identified. To ensure such a containment, we need to iso- 4.1 Architecture late the network interfaces.

We will need to freeze the TCP layer before the checkpoint, to make sure the state of the The following section provides an overview of peers is consistent with the snapshot. To do the MCR architecture (Figure 1). It relies on a this, we will block network traffic for both in- set of user level utilities which control the con- coming and outgoing packets. The application tainer: creation, checkpoint, restart, etc. New will be migrated immediately after the check- features in the kernel and a kernel module are point and all the network resources related to required today to enable the container and the the container will be cleaned up. CPR features.

Process n 4 Design Overview Process 2 Process 1

MCR (command line)

We have designed and implemented MCR, MCRP (plugin) a lightweight application oriented container container 1 User Space /dev/mcr syscall which supports mobility. It is discussed here Kernel Space MCR Kernel APIs because it is one of a few implementations MCRK (Kernel Module) which are relatively complete. It provides an excellent view on what issues and complexities arise. However, from our prototype work, we have concluded that certain functionality im- Figure 1: MCR architecture plemented in user space in MCR is best sup- ported by the kernel itself.

An important idea behind the design of MCR The CPR of the container is not handled by is that it is kernel-friendly and does not do ev- one component. It is distributed across the 3 erything in kernel space. A balance needed to components of MCR depending on the local- be struck between striving towards a minimal ity of the resourse to checkpoint. The user kernel impact to facilitate proper forward ports level utility mcr (Section 4.1.2) invokes and and ensuring that functional correctness and ac- orchestrates the overall checkpoint of a con- ceptable performance is achieved. This means tainer. It also is in charge of checkpointing the using available kernel features and mechanisms resources which are global to a container, like where possible and not violating important SYSV shared memory for instance. The user principles which ensure that user space appli- level plugin mcrp (Section 4.1.4) checkpoints cations work properly. the resources at the process level, memory, and at the thread level, signals and cpu state. Both With that principle in mind, CPR from user rely on the kernel module mcrk (Section 4.1.3) space makes your life much easier. It also en- to access kernel internals. 352 • Making Applications Mobile Under Linux

4.1.1 Kernel enhancements and API 4.1.2 User level utilities

Despite a user space-oriented approach, the Applications are made mobile simply by being mcr-execute kernel still requires modifications in order to started under . The container exec() support CPR. But, surprisingly, it may no be is created before the of the command as much as one might expect. There are three starting the application. It is maintained around high level needs: the application until the last process dies. mcr-checkpoint and mcr-restart in- The biggest challenge is to build a container for voke and orchestrate the overall checkpoint or the application. The aim here is neither security restart of a container. These commands also nor resource containment, but making sure that perform the get and set of the resources which the snapshot taken at checkpoint time is consis- are global to a container, as opposed to those lo- tent. To that end we need a container much like cal to a single process within the container. For the VServer[8] context. This will isolate and instance, this is where the SYSV IPC and file identify all kernel resources and objects used descriptors are handled. They rely on the kernel by an application. This kernel feature is not module (Section 4.1.3) to manage the container only a key requirement to application mobil- and access kernel internals. ity, but also for other frameworks in the secu- rity and resource management domains. The container will also virtualize system identifiers to make sure that the resources used by the ap- 4.1.3 Kernel module plication do not overlap with other containers. This includes resources such as processes IDs, The kernel module is the container manager in threads IDs, SysV IPC IDs, UTS names, and IP terms of resource usage and resource virtualiza- addresses, among others. tion. It maintains a real time definition of the container view around the application which The second need regards freezing a con- ensures that a checkpoint will be consistent at tainer. Today, we use the SIGSTOP signal to any time. freeze all running tasks. This gives valid re- sults, but for upstream kernel development we It is in charge of the global synchronization would prefer to use a container version of the of the container during the CPR sequence. It refrigerator() service from swsusp. freezes all tasks running in the container, and swsusp uses fake signals to freeze nearly all maps into the process a user level plugin (see kernel tasks before dumping the memory. Section 4.1.4). It provides the synchronization barriers which unrolls the full sequence of the Finally, MCR exposes the internals of different checkpoint before letting each process resume Linux subsystems and provides new services its execution. to get and set their state. These interfaces are by necessity very intrusive, and expose inter- It also acts as a proxy to capture the states which nal state. For an upstream migration solution, cannot be captured directly from user space. In- using a /proc or / interface which ex- ternal states handled by this module include, for poses more selective data would be more ap- example, the process’ memory page mapping, propriate. socket buffers, clone flags, and AIO states. 2006 Linux Symposium, Volume One • 353

4.1.4 User level plugin 4.2.2 Memory

When a checkpoint or a restart of a container is The memory mapping is checkpointed from invoked, the kernel module maps a plugin into the process context, parsing /proc/self/ each process of the container. This plugin is maps. However, some vm_area flags (i.e. run in the process context and is removed af- MAP_GROWSDOWN) are not exposed through ter completion of the checkpoint. It serves 2 the /proc file system. The latter are read from purposes. The first is synchronization, which the kernel using the kernel module. The same it orchestrates with the help of the kernel mod- method is used to retrieve the list of the mapped ule. Secondly, it performs get and set of states pages for each vm_area and reduce signifi- which can be handled from user space using cantly the size of the snapshot. standard syscalls. Such states include sigac- Special pages related to POSIX shared mem- tions, memory mapping, and rlimits. ory and POSIX semaphores are detected and skipped. They are handled by the checkpoint When all threads of a process enter the plugin, a of resources global to a container. master thread, not necessarily the main thread, is elected to handle the checkpoint of the re- sources at process level. The other threads only 4.2.3 Signals checkpoint the resources at thread level, like cpu state. Signal handlers are registered using sigaction() and called when the pro- 4.2 Linux subsystems CPR cess gets a signal. They are checkpointed and restarted using the same service in the process context. Signals can be sent to the process The following subsections describe the check- or directly to an invidual thread using the point and the restart of the essential resources tkill() syscall. In the former, the signal without doing a deep dive in all the issues goes into a shared sigpending queue, where which need to be addressed. The following sec- any thread can be selected to handle it. In tion 5 will delve deeper into selected issues for the latter, the signal goes to a thread private the interested reader. sigpending queue. To guarantee correct signal ordering, these queues must be checkpointed and restored separately using a dedicated kernel service. 4.2.1 CPU state

Checkpointing the cpu state is indirectly done 4.2.4 Process hierarchy by the kernel because the checkpoint is signal oriented: it is saved by the kernel on the top The relationships between processes must be of the stack before the signal handler is called. preserved across CPR sequences. Groups and This stack is then saved with the rest of the sessions leaders are detected and taken into ac- memory. At restart, the kernel will restore the count. At checkpoint, each process and thread cpu state in sigreturn() when it jumps out stores in the snapshot its execution command of the signal handler. using /proc/self/exe and its pid and ppid 354 • Making Applications Mobile Under Linux using getpid() and getppid(). Threads The thread local storage contains the thread also need to save their tid, ptid, and their specific information, data set by pthread_ stack frame. At restart time, the processes setspecific() and the pthread_self() are recreated by execve() and immediately pointer. On some architectures it is stored in killed with the checkpoint signal. Each process a general purpose register. In that case it is then jumps into the user level plugin (See Sec- already covered in the signal handler frame. tion 4.1.4) and spawns its children. Each pro- But on some other architectures, like Intel, it is cess also respawns its threads using clone(). stored in a separate segment, and this segment The process tree is recreated recursively. On mapping must be saved using a dedicated call restart, attention must be paid to correctly set- to kernel. ting the pid, pgid, tid, tgid for each newly cre- ated process and thread. 4.2.7 Open files

4.2.5 Interprocess communication File descriptors reference open files, which can be of any type, including regular files, pipes, The contents and attributes of SYSV IPCs, and sockets, FIFOs, and POSIX message queues. more recently POSIX IPCs, are checkpointed as resources global to a container, excepting se- CPR of file descriptors is not done entirely in mundos. the process context because they can be shared. Processes get their fd list by walking /proc/ Most of the IPC resource checkpoint is done at self/fd. They send this list, using ancillary the user level using standard system calls. For messages, to a helper daemon running in the example, mq_receive() to drain all mes- container during the checkpoint. Using the ad- sages from a queue, and mq_send() to put dress of the struct file as a unique identi- them back into the queue. The two main draw- fier, the daemon checkpoints the file descriptor backs to such an approach are that access time only once per container, since two file descrip- to resources are altered and that the process tors pointing to the same opened file will have must have read and write access to them. Some the same struct file. functionalities like mq_notify() are a bit trickier. In these cases, the kernel sends noti- File descriptors 0, 1, and 2 are considered spe- fication cookies using an AF_NETLINK socket cial. We may not want to checkpoint or restart which also needs to be checkpointed. them if we do the restart on another login console for example. The file descriptors are tagged and bypassed at checkpoint time. 4.2.6 Threads

4.2.8 Asynchronous I/O Every thread in a process shares the same mem- ory, but has its own register set. The threads can dump themselves in user context by asking the Asynchronous I/Os are difficult to handle by kernel for their properties. At restart time, the nature because they can not easily be frozen. main thread can read other threads’ data back The solution we found is to let the process reach from the snapshot and respawn each with its a quiescence point where all AIOs have com- original tid. pleted before checkpointing the memory. The 2006 Linux Symposium, Volume One • 355 other issue to cover is the ring buffer of com- always be available upon restart of a check- pleted events which is mapped in user space pointed application, as another process could and filled by the kernel. This memory area already have been started with this pid. Hence, needs to be mapped at the same address when pids need to be virtualized. Virtualization in the process is restarted. This requires a small this context can be and is interpreted in vari- patch to control the address used for the map- ous manners. Ultimately the requirement, that ping. an application consistently sees the same pid associated with a task (process/thread) across CPR, must be satisfied.

5 Zooming in There are essentially three issues that need to be dealt with in any solution:

This section will examine in more detail three major aspects of application mobility. The first 1. container init process visibility, topic covers a key requirement in process mi- 2. where in the kernel the virtualization inter- gration: the ability to restart a process keep- ception will take place, ing the same pid. Next we will discuss issues and solutions to VM migration, which has the 3. how the virtualization is maintained. biggest impact on performance. Finally, we will address network isolation and migration of live network communications. Particularly 1. is responsible for the non-trivial complexities of the various prototypes. It stems from the necessity to “rewrite” the pid relation- 5.1 Process Virtualization ships between the top process of a container (short cinit) and its parent. cinit essen- tially lives in both contexts, the creating con- A pid is a handle to a particular task or task tainer and the created container. The creat- group. Inside the kernel, a pid is dynamically ing container requires a pid in its context for assigned to a task at fork time. The relationship cinit to be able to deploy regular wait() is recorded in the pid hash table (pid → task) semantics. At the same time, cinit must refer and remains in place until a task exits. For sys- to its parent as the perceived system init process tem calls that return a pid (e.g. getpid()), (vpid = 1). the pid is typically extracted straight out of the task structure. For system calls that uti- lize a user provided pid, the task associated 5.1.1 Isolation with that pid is determined from the pid hash table. In addition various checks need to be performed that guarantee the isolation between Various solutions have been proposed and im- users and system tasks (in particular during the plemented. Zap [12] intercepts and wraps all sys_kill() call). pid related system calls and virtualizes the pids in the interception layer through a pid ↔ vpid Because pids might be cached at the user lookup table associated with the caller’s con- level, processes should be restarted with their tainer either before and/or after calling the orig- original pids. However, it is difficult if not inal syscall implementation with the real pids. impossible to ensure that the same pid will The benefit of this approach is that the kernel 356 • Making Applications Mobile Under Linux does not need to be modified. However, the A different approach is taken by the namespace ability of overwriting the syscall table is not a proposal [6]. Here, the container principle is direction Linux embraces. driven further down into the kernel. The pid- hash now is defined as a ({pid,container} → MCR, presented in greater detail in Section 4, task) function. The namespace approach natu- pushes the interception further down into the rally elevates the container as a first class ker- various syscalls itself, but also utilizes a pid ↔ nel object. Hence minor changes were required vpid lookup function. In general, the calling to the pid allocation which maintains a pidmap task provides the context for the lookup. for each and every namespace now. The benefit The isolation between containers is imple- of this approach is that the code modifications mented in the lookup function. Tasks that clearly highlight the conditions where con- are created inside a container are looked up tainer boundaries need to be crossed, where in through this function. For global tasks the pid the earlier virtualization approach these cross- == vpid holds. In both implementations the ings came implicitly through the results of the vpid is not explicitly stored with the task, but lookup function. On the other hand, the names- is determined through the pid ↔vpid lookup pace approach needs special provisioning for each and every time. On restart tasks can the cinit problem. To maintain the ability to be recreated through the fork(); exec() wait() on the cinit process from the cinit’s sequence and only the lookup table needs to parent (child_reaper), a task->wid is de- record the different pid. The cinit parent fined, that reflects the pid in the parent’s context problem mentioned earlier is solved by map- and on which the parent needs to wait. There is ping cinit twice, in the created context as no clear recommendation between the names- vpid=1 and in the creating container contexts pace and virtualization approach that we want with the assigned vpid. The lookup function is to give in this paper; both the OpenVZ and the straight forward, essentially we need to ensure namespace proposal are very promising. that we identify any cinit process and return the vpid/task associated with it relative to the provided container context. 5.1.2 CPR on Process Virtualization The OpenVZ implementation [7] provides an interesting, yet worthwhile optimization that only requires a lookup for tasks that have The CPR of the Process Virtualization is been restarted. OpenVZ relies on the fact that straight forward in both cases. In the ZAP, tasks do have a unique pid when tasks are in MCR and OpenVZ case, the lookup table is their original incarnation (not yet /R’d). The recreated upon restart and populated with the lookup function, which is called at the same vpid and the real pid translations, thus requir- code locations as the MCR implementation, ing the ability to select a specific vpid for a hence only has to maintain the isolation prop- restarted process. Which real pid is chosen is erty. In the case of a restarted task the unique- irrelevant and is hence left to the pidmap man- ness can no further be guaranteed, so the pid agement of the kernel. In the namespace ap- must be virtualized. Common to all three ap- proach since the pid selection is pushed into proaches is the fact that virtual pids are all rela- the kernel a function requires that a task can tive to their respective containers and that they be forked at a specific pid with in a container’s are translated into system-wide unique pids. pidmap. Ultimately, both approaches are very The guts of the pidhash have not changed. similar to each other. 2006 Linux Symposium, Volume One • 357

5.2 CPR on the Linux VM quiescence of all tasks sharing data. This in- cludes all shared memory areas and files. At first glance, the mechanism for checkpoint- ing a process’s memory state is an easy task. 5.2.3 Copy on Write The mechanism described in section 2 can be implemented with a simple ptrace. When a process forks, both the forker and the This approach is completely in user space, so new child have exactly the same view of mem- why is it not used in the MCR prototype, nor ory. The kernel gives both processes a read- any commercial CPR systems? Or in other only view into the same memory. Although not words, why do we need to push certain func- explicit, these memory areas are shared as long tionalities further down into the kernel? as neither process writes to the area.

The above proposed ptrace mechanism would 5.2.1 Anonymous Memory be a very poor choice for any processes which have these copy-on-write areas. The areas have One of the simplest kind of memory to check- no practical bounds on their sizes, and are point is anonymous memory. It is never used indistinguishable from normal, writable areas outside the process in which it is allocated. from the user’s (and thus ptrace’s) perspective.

However, even this kind of memory would have Any mechanism utilizing the ptrace mecha- serious issues with a ptrace approach. nism could potentially be forced to write out many, many copies of redundant data. This When memory is mapped, the kernel does not could be avoided with checksums, but it causes fill it in at that time, but waits until it is used user space reconstruction of information about to populate it. Any user space program do- which the kernel already explicitly knows. ing a checkpoint could potentially have to it- erate over multiple gigabytes of sparse, entirely In addition, user space has no way of explic- empty memory areas. While such an approach itly recreating these copy-on-write shared areas could consolidate such empty memory after the during a resume operation. The only mecha- fact, simply iterating over it could be an incred- nism is fork, which is an awfully blunt instru- ibly significant resource drain. ment by which to recreate an entire system full of processes sharing memory in this manner. The kernel has intimate knowledge of which The only alternative is restoring all processes memory areas actually contain memory, and and breaking any sharing that was occurring be- can avoid such resource drains. fore the checkpoint. Breaking down any shar- ing is highly undesirable because it has the po- tential to greatly increase memory utilization. 5.2.2 Shared Memory

The key to successful CPR is getting a consis- 5.2.4 Anonymous Shared tent snapshot. If two interconnected processes are checkpointed at different times, they may Anonymous shared memory is that which is become confused when restarted. Success- shared, but has no backing in a file. In Linux ful memory checkpointing requires a consistent there is no true anonymous shared memory. 358 • Making Applications Mobile Under Linux

The memory area is simply backed by a pseudo 5.2.6 File-backed Private file on a ram-based filesystem. So, there is no disk backing, but there certainly is a file back- When an application wants a copy of a file to ing. be mapped into memory, but does not want any changes reflected back on the disk, it will map It can only be created by an mmap() call the file MAP_PRIVATE. which uses the MAP_SHARED and MAP_ ANONYMOUS. Such a mapping is unique to a These areas have the same issues as anonymous single process and not truly shared. That is, un- memory. Just like anonymous memory, sepa- til a fork(). rately checkpointing a page is only necessary after a write. When simply read, these areas No running processes may attach to such mem- exactly mirror contents on the disk and do not ory because there is no handle by which to find need to be treated differently from normal file- or address it, neither does it have persistence. backed shared memory. The pseudo-file is actually deleted, which cre- ates a unique problem for the CPR system. However, once a write occurs, these areas’ treatment resembles that of anonymous mem- Since the “anonymous” file is mapped by some ory. The contents of each area must be read process, the entire addressable contents of the and preserved. As with anonymous memory, file can be recovered through the aforemen- user space has no detailed knowledge of spe- tioned ptrace mechanism. Upon resume, the cific pages having been written. It must sim- “anonymous” areas can be written to a real file ply assume that the entire area has changed, and in the same ram-based filesystem. After all must be checkpointed. processes sharing the areas have recreated their This assumption can, of course, be overridden references to the “anonymous” area, the file can by actually comparing the contents of memory be deleted, preserving the anonymous seman- with the contents of the disk, choosing not to tics. As long as the process performing the explicitly write out any data which has not ac- checkpoint has ptrace-like capabilities for all tually changed. processes sharing the memory area, this should not be difficult to implement. 5.2.7 Using the Kernel

5.2.5 File-backed Shared From the ptrace discussions above, it should be apparent that the he various kinds of mem- ory mappings in Linux can be checkpointed Shared memory backed by files is perhaps the from userspace while preserve many of their simplest memory to checkpoint. As long as important pre-checkpoint attributes. However, all dirty data has been written back, requir- it should now be apparent that user space lacks ing filesystem consistency be kept between a the detailed knowledge to do these operations checkpoint and restart is all that is required. efficiently. This can be done completely from user space. The kernel has exact knowledge of exactly One issue is with deleted files. However, these which pages have been allocated and popu- can be treated in the same way as “anonymous” lated. Our MCR prototype uses this informa- shared memory mentioned above. tion to efficiently create memory snapshots. 2006 Linux Symposium, Volume One • 359

It walks the pagetables of each memory area, each page in a file-backed memory area is truly and marks for checkpoint only those pages backed by the contents on the disk, but that which actually have contents, and have been would be an enormous undertaking. In addi- touched by the process being checkpointed. For tion, it would not be a complete solution be- instance, it records the fact that “page 14” in cause two different pages in the file could con- a memory area has contents. This solves the tain the same data. Userspace would have abso- issues with sparsely populated anonymous and lutely no way to uniquely identify the position private file-backed memory areas, because it of a page in a file, simply given that page’s con- accurately records the process’s actual use of tents. the memory. This means that the MCR implementation is However, it misses two key points: the simple incomplete, at least in regards to any mem- presence of a page’s mapping in the page tables ory area to which remap_file_pages() has does not indicate whether its contents exactly been applied. mirror those on the disk.

This is an issue for efficiently checkpointing the 5.2.9 Implementation Proposal file-backed private areas because the page may be mapped, but it may be either a page which has been only read, or one to which a write has Any effective and efficient checkpoint mecha- occurred. To properly distinguish between the nism must implement, at the least: two, the PageMappedToDisk() flag must be checked. 1. Detection and preservation of sharing of file-backed and other shared memory ar- 5.2.8 File-backed Remapped eas for both efficiency and correctness. 2. Efficient handling of sparse files and un- Assume that a file containing two pages worth touched anonymous areas. of data is mmap()ed. It is mapped from the beginning of the file through the end. One 3. Lower-level visibility than simply the file would assume that the first page of that map- and contents for remap_file_pages() ping would contain the first page of data from compatibility (such as effective page table the disk. By default, this is the behavior. But, contents). Linux contains a feature which invalidates this assumption: remap_file_pages(). There is one mechanism in the kernel today That system call allows a user to remap a mem- which deals with all these things: the swap ory area’s contents such that the nth page of a code. It does not attempt to swap out areas mapping does not correspond to the nth page which are file backed, or sparse areas which on the disk. The only place in which the in- have not been populated. It also correctly han- formation about the mapping is stored is in the dles the nonlinear memory areas from remap_ pagetables. In addition, the presence of one of file_pages(). these areas is not openly available to user space. We propose that the checkpointing of a pro- Our user space ptrace mechanism could likely cess’s memory could largely be done with a detect these situations by double-checking that synthetic swap file used only by that container. 360 • Making Applications Mobile Under Linux

This swap file, along with the contents of on the four essential networking components the pagetables of the checkpointed processes, required for container migration: network iso- could completely reconstruct the contents of a lation, network quiescent points, network state process’ memory. The process of checkpoint- access for a CPR, and network resource clean- ing a container could become very similar to up. the operation which swsusp performs on an entire system. 5.3.1 Network isolation The swap code also has a feature which makes it very attractive to CPR: the swap cache. The The network interface isolation consists of se- swap cache allows a page to be both mapped lectively revealing network interfaces to con- into memory and currently written out to swap tainers. These can be either physical or aliased space. The caveat is that, if there is a write interfaces. Aliased interfaces are more flexi- to the page, the on-disk copy must be thrown ble for managing the network in the contain- away. ers because different IP addresses can be as- Memory which is very rarely written to, such signed to different containers with the same as the file-backed private memory used in the physical network interface. The net_device jump table in dynamically linked libraries, has and in_ifaddr structures have been modi- the most to gain from the swap cache. Users of fied to store a list of the containers which may this memory can run unimpeded, even during view the interface. a checkpoint operation, as long as they do not The isolation ensures that each container uses perform writes. its own IP address. But any return packet must Just as has been done with other live cross- also go to the right interface. If a container con- system migration[11] systems, the process of nects to a peer without specifying the source moving the data across can be iterative. First, address, the system is free to assign a source copy data in several passes until, despite the ef- address owned by an another container. This forts to swap them out, the working set size of must be avoided. The tcp_v4_connect() the applications ceases to decrease. udp_sendmsg() functions are modified in order to choose a source address associated The application-level approach has the poten- with an interface visible from the source con- tial to be at least marginally faster than the tainer. whole-system migration because it is only con- cerned with application data. Xen must deal The network isolation ensures that network with the kernel’s working set in addition to the traffic is dispatched to the right container. application. This must increase the amount of Therefore it becomes quite easy to drop the data which must be migrated, and thus must in- traffic for any specific container. crease the potential downtime during a migra- tion. 5.3.2 Reaching a quiescent point

5.3 Migrating Sockets As the processes need to reach a quiescent point in order to stop their activities, the network In Section 3.4, the needs for a network migra- must reach this same point in order to retrieve tion were roughly defined. This section focuses network resource states at a fixed moment. 2006 Linux Symposium, Volume One • 361

This point is reached by blocking the network former have little information to retrieve be- traffic for a specified container. The filtering cause the PCB (Protocol Control Block) is not mechanism relies on netfilter hooks. Every used and the send/receive queues are empty. packet is contained in a struct skbuff. The latter have more information to be check- This structure has a link to the struct sock pointed, such as information related to the connection which has a record of the owner socket, the PCB, and the send/receive queues. container. Using this mechanism, packets re- Minisocks and the orphan sockets also fall in lated to a container being checkpointed can be the connected category. identified and dropped. A socket can be retrieved from /proc be- For dropping packets, iptables is not directly cause the file descriptors related to the current suitable because each drop rule returns a NF_ process are listed. The getpeername(), DROP, which interacts with the TCP stack. But, getsockname() and getsockopt() can we need the TCP stack to be frozen for our be directly used with the file descriptors. How- container. So a kernel module has been im- ever, some information is not accessible from plemented which drops the packets but returns user space, particularly the list of the minisocks NF_STOLEN instead of NF_DROP. and the orphaned sockets, because no fd is as- sociated with them. The PCB is also unacces- This of course relies on the TCP protocol’s re- sible because it is an internal kernel structure. transmission of the lost packets. However, traf- MCR adds several accessors to the kernel inter- fic blocking has a drawback: if the time needed nals to retrieve this missing information. for the migration is too large, the connections on the peers will be broken. The same will The PCB is not completely checkpointed and occur if the TCP keep alive time is too small. restored because there is a set of fields which This encourages any implementation to have a need to be modified by the kernel itself. For very short downtime during a migration. How- example, the round time trip values. ever, note that, when both sides of a connection are checkpointed simultaneously, there are no problems with TCP timeouts. In that case the 5.3.4 Socket Cleanup restart could occurs years later. When the container is migrated, the local net- work resources remaining in the source host 5.3.3 Socket CPR should be cleaned up in order to avoid dupli- cate resources on the network. This is done us- Now that we have successfully blocked the traf- ing the container identifier. The IP addresses fic and frozen the TCP state, the CPR can actu- can not be used to find connections related to ally be performed. a container because if several interconnected containers are running on the same machine, Retrieving information on UDP sockets is there is no way to find the connection owner. straightforward. The protocol control block is simple and the queues can be dropped because UDP communication is not reliable. Retriev- 5.3.5 CPR Dynamics ing a TCP socket is more complex. The TCP sockets are classified into two groups: SS_ The fundamentals for the migration have been UNCONNECTED and SS_CONNECTED. The described in the previous sections. Figure 2 362 • Making Applications Mobile Under Linux illustrates how they are used to move network (c) An ARP (adress request package) re- resources from one machine to an another. sponse is sent to the network in or- der to boost up and ensure correct

Global mac ↔ ip address association. Container clients (d) The traffic is unblocked.

4c 5. Destruction

Virtual IP eth0 Virtual IP eth0 administrator 192.168.10.10 administrator 192.168.10.111 1 1 The traffic is blocked, the aliased interface eth0:0 4a 3a eth0:0 Netfilter 10.0.10.10 10.0.10.10 4d is destroyed and the sockets related to the 2 IP IP container are removed from the system. Migrate UDP TCP UDP TCP

3b Server Server 4b 5 6 What is the cost?

Figure 2: Migrating network resources This section presents an overview of the cost of virtualization in different frameworks. We have focused on VServer, OpenVZ, and our own prototype MCR, which are all lightweight 1. Creation containers. We have also included Xen, when A network administration component is possible, as a point of reference in the field of launched; it creates an aliased interface full machine virtualization. and assigns it to a container. The first set of tests assesses the virtualization 2. Running overhead on a single container for each above The IP address associated with the aliased mentioned solution. The second measures scal- interface is the only one seen by the appli- ability of each solution by measuring the im- cations inside the container. pact of idle containers on one active container. The last set provides performance measures of 3. Checkpoint the MCR CPR functionality with a real world application. (a) All the traffic related to the aliased interface assigned to the container is blocked. 6.1 Virtualization overhead (b) The network resources are retrieved for each kind of socket and saved: At the time of this writing, no single kernel addresses, ports, multicast groups, version was supported by each of VServer, socket options, PCB, in-flight data OpenVZ, and MCR. Furthermore, patches and listening points. against newer kernel versions come out faster than we can collect results, and clearly by the 4. Restart time of publication the patches and base ker- nel used in testing will be outdated anyway. (a) The traffic is blocked Hence for each virtualization implementation (b) The network resources are set from we present results normalized against results the file to the system. obtained from the same version vanilla kernel. 2006 Linux Symposium, Volume One • 363

6.1.1 Virtualization overhead inside a con- these tests, the nodes used were 64bit dual tainer Xeon 2.8GHz (4 threads/2 real processors). Nodes were equipped with a 25P3495a IBM The following tests were made on quad PIII disk (SATA disk drive) and a Tigon3 gigabit 700MHz running Debian Sarge using the fol- ethernet adapter. The host nodes were running lowing versions: RHEL AS 4 update 1 and all guest servers were running Debian Sarge. We ran tbench and a 2.6.15.6 kernel build test in three environments: • VServer version vs2.0.2rc9 on a 2.6.15.4 on the system running the vanilla kernel, on the util-vserver kernel with version host system running the patched kernel, and in- 0.30.210 side a virtual server (or guest system). The ker- • MCR version 2.5.1 on a 2.6.15 kernel nel build test was done with warmed up cache.

We used dbench to measure filesystem load, • VServer version 2.1.0 on a 2.6.14.4 kernel LMbench for microbenchmarks, and a kernel with util-vserver version 0.30.209 build test for a generic macro benchmark. Each test was executed inside a container, with only • OpenVZ version 022stab064 on a 2.6.8 one container created. kernel with vzctl utilities version 2.7.0- 26 Overhead with patched kernel and within container 1.4 MCR • Xen version 3.0.1 on a 2.6.12.6 kernel 1.35 VServer 1.3 1.25 1.2 1.15 The results are shown in the figures 4 and 5. 1.1 1.05 1 Tbench bandwidth overhead 0.95 1.4 Host 0.9 1.35 Guest

Kernel 1.3 Dbench latency) aligned) Lmbench Lmbench Lmbench build time bandwidth (libc bcopy (Mmap read 1.25 (Memory load using localhost open2close bw) AF_UNIX socket Socket bandwidth 1.2

1.15 Figure 3: Various tests inside a container 1.1 1.05

1

0.95

The results shown in figure 3 demonstrate that 0.9 the overhead is hardly measurable. OpenVZ, Vserver(2.6.14.4) Xen(2.6.12.6) being a full virtualized server, was not taken Figure 4: tbench results regarding a vanilla ker- into account. nel

6.1.2 Virtualization overhead within a vir- The OpenVZ virtual server did not survive all tual server the tests. tbench looped forever and the kernel build test failed with a virtual memory alloca- The next set of tests were run in a full vir- tion error. As expected, lightweight containers tual server rather than a simple container. For outperform a virtual machine. Considering the 364 • Making Applications Mobile Under Linux

Kernel build system time • MCR version 2.5.1 on a 2.6.15 kernel 3 Host Guest

2.5 Performance Scalability 600 MCR VServer chcontext 2 VServer virtual server 500 OpenVZ Xen Vanilla 2.6.15 1.5 400

1 300 Dbench bandwidth

0.5 200 Vserver(2.6.14.4) Xen(2.6.12.6)

100

Figure 5: System time of a kernel build regard- 1 2 5 10 20 50 100 200 500 1000 ing a vanilla kernel number of running containers Figure 6: dbench performance with an increas- ing number of containers level of containment provided by Xen and the configuration of the domain, using file-backed virtual block device, Xen also behaved quite Performance Scalability 400 well. It would surely have better results with MCR VServer chcontext 380 VServer virtual server a LVM-backed VBD. OpenVZ Xen 360 Vanilla 2.6.15

340 6.2 Resource requirement 320

300 Tbench bandwidth

Using the same tests, we have studied how 280 performance is impacted when the number of 260 containers increases. To do so, we have con- 240 1 2 5 10 20 50 100 200 500 1000 tinously added idle containers to the system number of running containers and recorded the application performance of the reference test in the presence of an increas- Figure 7: tbench performance with an increas- ing number of idle containers. This gives some ing number of containers insight in the resource consumption of the var- ious virtualization techniques and its impact on application performance. This set of tests com- The results are shown in the figures 6, 7, and 8. pared: Lightweight containers are not really impacted by the number of idle containers. Xen over- head is still very reasonable but the number of • VServer version 2.1.0 on a 2.6.14.4 kernel simultaneous domains we were able to run sta- with util-vserver version 0.30.209 bly was quite low. This issue is a bug in current • OpenVZ version 022stab064 on a 2.6.8 Xen, and is expected to be solved. OpenVZ kernel with vzctl utilities version 2.7.0- performance was poor again and did not sur- 26 vive the tbench test not the kernel build. The tests should definitely be rerun with a newer • Xen version 3.0.1 on a 2.6.12.6 kernel version. 2006 Linux Symposium, Volume One • 365

Performance Scalability Checkpoint/Restart locally 160 20 MCR Checkpoint time VServer chcontext Restart time VServer virtual server Checkpoint + Restart 140 OpenVZ Xen Vanilla 2.6.15 15 120

100 10 Time in s

80 Kernel build system time 5 60

40 0 1 2 5 10 20 50 100 200 500 1000 0 10 20 30 40 50 60 70 80 90 number of running containers CPU Load in percentage

Figure 8: Kernel build time with an increasing Figure 9: Oracle CPR time under load on local number of containers disk

Checkpoint time regarding statefile size 6.3 Migration performance 10 Checkpoint time 9.5 To illustrate the cost of a migration, we have 9 set up a simple test case with Oracle (http:// 8.5 www.oracle.com/index.html) and Dots, a 8 database benchmark coming from the Linux 7.5 Checkpoint time in s Test Project (http://ltp.sourceforge. 7 net/). The nodes used in our test were 6.5 dual Xeon 2.4GHz HT (4 cpus/2 real proces- 6 120 130 140 150 160 170 180 190 200 210 220 230 sors). Nodes were equipped with a ST380011A Statefile size in M (ATA disk drive) and a Tigon3 gigabit ethernet adapter. Theses nodes were running a RHEL Figure 10: Time of checkpoint according to the AS 4 update 1 with a patched 2.6.9-11.EL ker- snapshot size nel. We used Oracle version 9.2.0.1.0 running under MCR 2.5.1 on one node and Dots version 1.1.1 running on another node (nodes are linked checkpoint time. It will also improve service by a gigabit switch). We measured the dura- downtime when the application is migated from tion of checkpoint, the duration of restart and a node to another. the size of the resulting snapshot with different Dots cpu workloads: no load, 25%, 50%, 75%, At the time we wrote the paper, we were not and 100% cpu load. The results are shown in able to run the same test with Xen, their migra- Figures 9 and 10. tion framework not yet being available.

The duration of the checkpoint is not im- pacted by the load but is directly correlated to 7 Conclusion the size of the snapshot (real memory size). Using a swap file dedicated to a container and incremental checkpoints in that swap file We have presented in this paper a motivation (Section 5.2) should improve dramatically the for application mobility as an alternative to 366 • Making Applications Mobile Under Linux the heavier virtual machine approach. We dis- References cussed our prototype of application mobility using the simple CPR approach. This proto- [1] William R. Dieter and James E. Lumpp, type helped us to identify issues and work on Jr. User-level Checkpointing for solutions that would bring useful features to the LinuxThreads Programs, Department of Linux kernel. These features are isolation of Electrical and Computer Engineering, resources through containers, virtualization of University of Kentucky resources and CPR of the kernel subsystem to provide mobility. We also went through the var- [2] Hua Zhong and Jason Nieh, CRAK: ious other alternatives projects in that are cur- Linux Checkpoint/Restart As a Kernel rently persued within community and exempli- Module, Department of Computer fied the many communalities and currents in Science, Columbia University, Technical this domain. We believe the time has come to Report CUCS-014-01, November, 2001. consolidate these efforts and drive the neces- sary requirements into the kernel. These are the [3] Duell, J., Hargrove, P., and Roman., E. necessary steps that will lead us to live migra- The Design and Implementation of tion of applications as a native kernel feature on Berkeley Lab’s Linux Checkpoint/Restart, top of containers. Berkeley Lab Technical Report (publication LBNL-54941).

[4] Poul-Henning Kamp and Robert N. M. 8 Acknowledgments Watson, R. N. M. Jails: Confining the omnipotent root, in Proc. 2nd Intl. SANE Conference (May, 2000). First all, we would like to thank all the for- mer Meiosys team who developed the ini- [5] Victor Zandy, Ckpt—A process tial MCR prototype, Frédéric Barrat, Laurent checkpoint library, http://www.cs. Dufour, Gregory Kurz, Laurent Meyer, and wisc.edu/~zandy/ckpt/. François Richard. Without their work and ex- pertise, there wouldn’t be much to talk about. [6] Eric Biederman, Code to implement A very special thanks to Byoung-jip Kim from multiple instances of various linux the Department of Computer Science at KAIST namespaces, git://git.kernel. and Jong Hyuk Choi from the IBM T.J. Wat- org/pub/scm/linux/kernel/git/ son Research Center for their contribution to ebiederm/linux-2.6-ns.git/. the s390 port, tests, benchmarks and this pa- [7] SWSoft, OpenVZ: Server Virtualization per. And, most of all we need to thank Gerrit Open Source Project, Huizenga for his patience and his encourage- http://openvz.org, 2005. ments. Merci, DonkeySchön! [8] Jacques Gélinas, Virtual private servers and security contexts, http://www. 9 Download solucorp.qc.ca/miscprj/s_ context.hc?prjstate=1&nodoc=0, 2004. Patches, documentations and benchmark re- sults will be available at http://lxc.sf. [9] VMware Inc, VMware, net. http://www.vmware.com/, 2005. 2006 Linux Symposium, Volume One • 367

[10] Boris Dragovic, Keir Fraser, Steve Hand, Linux is a registered trademark of Linus Torvalds Tim Harris, Alex Ho, Ian Pratt, Andrew in the United States, other countries, or both. Other Warfield, Paul Barham, and Rolf company, product, and service names may be trade- Neugebauer, Xen and the art of marks or service marks of others. References in this virtualization, in Proceedings of the publication to IBM products or services do not im- ACM Symposium on Operating Systems ply that IBM intends to make them available in all Principles, October 2003. countries in which IBM operates.

[11] Christopher Clark, Keir Fraser, Steven This document is provided “AS IS,” with no express Hand, Jakob Gorm Hansen, Eric Jul, or implied warranties. Use the information in this Christian Limpach, Ian Pratt, and document at your own risk. Andrew Warfield, Live Migration of Virtual Machines, In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI ’05), May, 2005, Boston, MA.

[12] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The design and implementation of zap: A system for migrating computing environments, in Proceedings of the 5th Usenix Symposium on Operating Systems Design and Implementation, pp. 361–376, December, 2002.

[13] Daniel Price and Andrew Tucker. “Solaris Zones: Operating System Support for Consolidating Commercial Workloads.” From Proceedings of the 18th Large Installation Systems Administration Conference (USENIX LISA ’04).

[14] Werner Almesberger, Tcp Connection Passing, http: //tcpcp.sourceforge.net

Copyright c 2006 IBM.

This work represents the view of the authors and does not necessarily represent the view of IBM.

IBM and the IBM logo are trademarks or registered trademarks of International Business Machines Cor- poration in the United States and/or other countries. 368 • Making Applications Mobile Under Linux Proceedings of the Linux Symposium

Volume One

July 19th–22nd, 2006 Ottawa, Ontario Canada Conference Organizers

Andrew J. Hutton, Steamballoon, Inc. C. Craig Ross, Linux Symposium

Review Committee

Jeff Garzik, Red Hat Software Gerrit Huizenga, IBM Dave Jones, Red Hat Software Ben LaHaise, Intel Corporation Matt Mackall, Selenic Consulting Patrick Mochel, Intel Corporation C. Craig Ross, Linux Symposium Andrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart, Red Hat, Inc. David M. Fellows, Fellows and Carr, Inc. Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission.