User-Space Process in the Context of Checkpoint-Restart and Virtual Machines

A dissertation presented by

Kapil Arya

to the Faculty of the Graduate School of the College of Computer and Information Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Northeastern University Boston, Massachusetts

August 2014

Copyright c August 2014 by Kapil Arya

NORTHEASTERN UNIVERSITY GRADUATE SCHOOL OF COMPUTER SCIENCE Ph.D. THESIS APPROVAL FORM

THESIS TITLE: User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines AUTHOR: Kapil Arya

Ph.D. Thesis approved to complete all degree requirements for the Ph.D. degree in Computer Science

Distribution: Once completed, this form should be scanned and attached to the front of the electronic dissertation document (page 1). An electronic version of the document can then be uploaded to the Northeastern University-UMI website.

Abstract

Checkpoint-Restart is the ability to save a set of running processes to a check- point image on disk, and to later restart them from the disk. In addition to its traditional use in fault tolerance, recovering from a system failure, it has numerous other uses, such as for application debugging and save/restore of the workspace of an interactive problem-solving environment. Transparent checkpointing operates without modifying the underlying application pro- gram, but it implicitly relies on a “Closed World Assumption” — the world (including file system, network, etc.) will look the same upon restart as it did at the time of checkpoint. This is not valid for more complex programs. Until now, checkpoint-restart packages have adopted ad hoc solutions for each case where the environment changes upon restart. This dissertation presents user-space process virtualization to decouple ap- plication processes from the external subsystems. A thin virtualization layer is introduced between the application and each external subsystem. It pro- vides the application with a consistent view of the external world and allows for checkpoint-restart to succeed. The ever growing number of external sub- systems make it harder to deploy and maintain virtualization layers in a monolithic checkpoint-restart system. To address this, an adaptive plugin based approach is used to implement the virtualization layers that allow the checkpoint-restart system to grow organically. The principle of decoupling the external subsystem through process vir- tualization is also applied in the context of virtual machines for providing a solution to the long standing double-paging problem. Double-paging oc- curs when the guest attempts to page out memory that has previously been swapped out by the and leads to long delays for the guest as the contents are read back into machine memory only to be written out again. The performance rapidly drops as a result of significant lengthening of the time to complete the guest I/O request.

Acknowledgments

No dissertation is accomplished without the support of many people and I can only begin to thank all those who have helped me in completing it. I am indebted to my advisor, Gene Cooperman, for his patience, encour- agement, support, and guidance over the years. It is because of Gene that I decided to go for a Ph.D., while I was a Master’s student at Northeastern. Gene taught me about how to do research and to distinguish the ideas that only I would find interesting, from the ideas that are important. I could not have asked for a better teacher and without him, this document would not exist. I am thankful to Panagiotis (Pete) Manolios, Alan Mislove and William Robertson for serving on my committee and for providing their insightful input and constructive criticism. I resoundingly thank Peter Desnoyers for always being available to discuss ideas and for providing constructive feed- back on several occasions. I also want to thank the International Student and Scholar Institute (ISSI) team and Bryan Lackaye for helping with the administrative matters during my stay at Northeastern. I was fortunate to be mentored by Alex Garthwaite during the summer internships at VMware. His guidance and encouragement is always there and never seems to fade away. Alex agreed to be the external member in my committee and I am thankful for his feedback and thoughtful comments that have not only improved the quality of this dissertation, but also pro- vided ideas for future directions. His dictum that a good dissertation is a completed one, became my mantra during the last two years. I also want to thank Yury Baskakov for all the help that I received while working on the Tesseract project. He never got tired of my random specula- tions and was always there to provide further insights and also to cover my blind spots. A special thanks goes to Jerri-Ann Meyer and Joyce Spencer for their continued support of the project. Finally, I want to thank Ron Mann for his continued advise and guidance that has helped me become a better engineer. I am grateful to Alok Singh Gehlot for his friendship, all the advice he provided me over the years, and for his constant reminder that it’s not done until it’s done. He was always available for me and without his guidance, I would not have been at Northeastern for my Master’s and later, Ph.D. I want to thank Rohan Garg and Jaideep Ramachandran for going through the thesis drafts and sitting through my practice talks and for providing valu- able feedback. Over the years, I have had the support of a lot of friends and I want to thank Jaijun Cao, Harsh Raju Chamarthi, Tyler Denniston, Anand Gehlot, Gregory Kerr, Samaneh Kazemi Nafchi, Artem Polyakov, Sumit Puro- hit, Praveen Singh Solanki, Ana-Maria Visan, Vishal Vyas, any others I regret- tably failed to name. I am enormously thankful to Surbhi for her enduring friendship and companionship through all these years. Finally, I owe much to my family. I want to express my deepest gratitude for my grandparents, Smt. Mohini Devi and Sh. Omdutt Ji, my parents, Smt. Jamana Devi and Sh. Nem Singh Ji, my aunt and uncle, Smt. Sangeeta Devi and Sh. Hari Singh Ji, my uncles Sh. Kamlesh Ji and Sh. Dilip Ji, and my siblings and cousins, Kavita, Lalita, Shilpa, and Anil, for their never ending love, dedication and support. I am forever indebted to them. To my grandfather Shri Omdutt Ji Solanki

And my school teacher Shri Devi Singh Ji Kachhwaha

Contents

Contents

List of Figures

List of Tables

1 Overview1 1.1 Closed-World Assumption ...... 2 1.2 Double-Paging Anomaly ...... 4 1.3 Process Virtualization ...... 4 1.4 Thesis Statement ...... 6 1.5 Contributions ...... 7 1.5.1 Process Virtualization through Plugins ...... 7 1.5.2 Application-Specific Plugins ...... 8 1.5.3 Third-Party Plugins ...... 9 1.5.4 Solving the Double-Paging Problem ...... 9 1.6 Organization ...... 10

2 Concepts Related to Checkpoint-Restart and Virtualization 13 2.1 Checkpoint-Restart ...... 13 2.1.1 Kernel-Level Transparent Checkpoint-Restart . . . . . 15 2.1.2 User-Level Transparent Checkpoint-Restart ...... 18 2.1.3 Fault Tolerance ...... 21 2.2 Interpositioning ...... 21 CONTENTS

2.3 Virtualization ...... 22 2.3.1 Language-Specific Virtual Machines ...... 22 2.3.2 Process Virtualization ...... 22 2.3.3 Lightweight O/S-based Virtual Machines ...... 23 2.3.4 Virtual Machines ...... 24 2.4 DMTCP Version 1 ...... 25 2.4.1 Library Call Wrappers ...... 27 2.4.2 DMTCP Coordinator ...... 27 2.4.3 Checkpoint Thread ...... 27 2.4.4 Checkpoint ...... 28 2.4.5 Restart ...... 28 2.4.6 Checkpoint Consistency for Distributed Processes . . 29

3 Adaptive Plugins as a Mechanism for Virtualization 31 3.1 The Ever Changing Execution Environment ...... 31 3.1.1 PID: Virtualizing Kernel Resource Identifiers . . . . . 32 3.1.2 SSH Connection: Virtualizing a Protocol ...... 33 3.1.3 InfiniBand: Virtualizing a Device Driver ...... 35 3.1.4 OpenGL: A Record/Replay Approach to Virtualizing a Device Driver ...... 36 3.1.5 POSIX Timers: Adapting to Application Requirements 36 3.2 Virtualizing the Execution Environment ...... 37 3.2.1 Virtualize Access to External Resources ...... 37 3.2.2 Capture/Restore the State of External Resources . . . 38 3.3 Adaptive Plugins as a Synthesis of System-Level and Application- Level Checkpointing ...... 39

4 The Design of Plugins 41 4.1 Plugin Architecture ...... 42 4.1.1 Virtualization through Function Wrappers ...... 43 4.1.2 Event Notifications ...... 46 CONTENTS

4.1.3 Publish/Subscribe Service ...... 49 4.2 Design Recipe for Virtualization through Plugins ...... 50 4.3 Plugin Dependencies ...... 52 4.3.1 Dependency Resolution ...... 52 4.3.2 External Resources Virtualized by Other Plugins . . . 54 4.3.3 Multiple Plugins Wrapping the Same Function . . . . 55 4.4 Extending to Multiple Processes ...... 56 4.4.1 Unique Resource-id for Shared Resources ...... 57 4.4.2 Checkpointing Shared Resources ...... 58 4.4.3 Restoring Shared Resources ...... 61 4.5 Three Base Plugins ...... 62 4.5.1 Coordinator Interface Plugin ...... 62 4.5.2 Thread Plugin ...... 62 4.5.3 Memory Plugins ...... 63 4.6 Implementation Challenges ...... 65 4.6.1 Wrapper Functions ...... 65 4.6.2 New Process/Program Creation ...... 67 4.6.3 Checkpoint Deadlock on a Runtime Library Resource 68 4.6.4 Blocking Library Functions and Checkpoint Starvation 69

5 Expressivity of Plugins 71 5.1 File Descriptor Related Plugins ...... 73 5.2 Pid, System V IPC, and Timer Plugins ...... 77 5.3 Application-Specific Plugins ...... 77 5.4 SSH Connection ...... 78 5.5 Batch-Queue Plugin for Resource Managers ...... 81 5.6 Plugin ...... 84 5.7 Deterministic Record-Replay ...... 85 5.8 Checkpointing Networks of Virtual Machines ...... 87 CONTENTS

5.9 3-D Graphic: Support for Programmable GPUs in OpenGL 2.0 and Higher ...... 88 5.10 Transparent Checkpointing of InfiniBand ...... 89 5.11 IB2TCP: Migrating from InfiniBand to TCP Sockets . . . . . 89

6 Tesseract: Reconciling Guest I/O and Hypervisor Swapping in a VM 91 6.1 Redundant I/O ...... 93 6.2 Motivation: The Double-Paging Anomaly ...... 94 6.3 Design ...... 97 6.3.1 Extending The Hosted Platform To Be Like ESX . . . 97 6.3.2 Reconciling Redundant I/Os ...... 99 6.3.3 Tesseract’s Virtual Disk and Swap Subsystems . . . . 102 6.4 Implementation ...... 105 6.4.1 Explicit Management of Hypervisor Swapping . . . . 105 6.4.2 Tracking Memory Pages and Disk Blocks ...... 106 6.4.3 I/O Paths ...... 107 6.4.4 Managing Block Indirection Metadata ...... 111 6.5 Guest Disk Fragmentation ...... 112 6.5.1 BSST Defragmentation ...... 113 6.5.2 Guest VMDK Defragmentation ...... 115 6.6 Evaluation ...... 116 6.6.1 Inducing Double-Paging Activity ...... 116 6.6.2 Application Performance ...... 117 6.6.3 Double-Paging and Guest Write I/O Requests . . . . . 121 6.6.4 Fragmentation in Guest Read I/O Requests ...... 122 6.6.5 Evaluating Defragmentation Schemes ...... 123 6.6.6 Using SSD For Storing BSST VMDK ...... 126 6.6.7 Overheads ...... 127 6.7 Related Work ...... 128 CONTENTS

6.7.1 Hypervisor Swapping and Double Paging ...... 128 6.7.2 Associations Between Memory and Disk State . . . . 130 6.7.3 I/O and Memory Deduplication ...... 131 6.8 Observations ...... 131

7 Impact for the Future 133 7.1 Compiled Code In Scripting Languages: Fast-Slow Paradigm 133 7.2 Support for Hadoop-style Big Data ...... 134 7.3 Cybersecurity ...... 135 7.4 Algorithmic debugging ...... 135 7.5 Reversible Debugging ...... 136 7.6 Android-Based Mobile Computing ...... 136 7.7 Cloud Computing ...... 136

8 Conclusion 137

A Plugin Tutorial 139 A.1 Introduction ...... 139 A.2 Anatomy of a plugin ...... 140 A.3 Writing Plugins ...... 141 A.3.1 Invoking a plugin ...... 141 A.3.2 The plugin mechanisms ...... 141 A.4 Application-Initiated Checkpoints ...... 145 A.5 Plugin Manual ...... 146 A.5.1 Plugin events ...... 146 A.5.2 Publish/Subscribe ...... 151 A.5.3 Wrapper functions ...... 152 A.5.4 Miscellaneous utility functions ...... 152

Bibliography 155

List of Figures

1.1 Application surface of a running process ...... 5

2.1 Architecture of DMTCP ...... 26

3.1 Virtualization of Process Id ...... 33 3.2 Two processes communicating over SSH ...... 33 3.3 Virtualizing an SSH connection ...... 34

4.2 Event notifications for write-ckpt and restart events ...... 47 4.4 Nested wrappers ...... 55 4.5 Plugin dependency for distributed processes ...... 61

5.1 Restoring an SSH connection ...... 80

6.1 Some cases of redundant I/O in a ...... 93 6.2 An example of double-paging...... 96 6.3 Double-paging with Tesseract...... 102 6.4 Write I/O and hypervisor swapping...... 103 6.5 Examples of reference count with Tesseract and with defragmen- tation...... 104 6.6 VMware Workstation I/O Stack ...... 108 6.7 Modified scatter-gather list to avoid double-paging ...... 109 6.8 Splitting scatter-gather list during read ...... 110 6.9 Defragmenting the BSST...... 114 LIST OF FIGURES

6.10 Defragmenting the guest VMDK...... 115 6.11 Trends for scores and pauses in SPECjbb runs with varying guest memory pressure and 10% host overcommitment...... 118 6.12 Maximum single pauses observed in SPECjbb instantaneous scor- ing with varying guest memory pressure and 10% host memory overcommitment...... 119 6.13 Scores and total pause times for SPECjbb runs with varying host overcommitment and 60 MB memhog...... 120 6.14 Comparing maximum single pauses for SPECjbb under various defragmentation schemes with varying host memory overcom- mitment and 60 MB memhog ...... 121 6.15 Scores and pauses in SPECjbb runs under various defragmenta- tion schemes with 10% host overcommitment...... 123 6.16 Score and pauses in SPECjbb under various defragmentation schemes with varying host overcommitment and 60 MB memhog. . . . . 124 6.17 Comparing maximum single pauses for SPECjbb under various defragmentation schemes with 10% host memory overcommit- ment...... 125 6.18 Tesseract performances with BSST placed on an SSD disk. . . . 126 List of Tables

2.1 Comparison of various checkpointing systems...... 21

5.1 Comparison of process virtualization based checkpoint-restart with prior art ...... 72 5.2 Statistics for various plugins...... 74

6.1 Holes in write I/O requests for varying host overcommitment and 60 MB memhog inside the guest...... 122 6.2 Holes in read I/O requests for Tesseract without defragmentation for varying levels of host overcommitment and 60 MB memhog inside the guest...... 122 6.3 Total I/Os with BSST and guest defragmentation...... 125 6.4 Average read and write prepare/completion times in microsec- onds for baseline and Tesseract with and without defragmenta- tion...... 127

CHAPTER 1

Overview

Checkpoint-restart is a powerful mechanism to save the state of one or more running processes to disk and later restore it. In addition to the tradi- tional use case of fault tolerance in long-running jobs, other use cases of checkpoint-restart include process migration, debugging, and save/restore of workspace. At a high-level, checkpointing a process can be viewed as writing all of process memory, including shared libraries, text and data, to a checkpoint image. Accordingly, restarting involves recreating the process memory by reading the checkpoint image from the disk. This works for simple programs, but for complex programs, one also needs to save and restore information about threads, open files, etc. In more sophisticated applications, it involves saving the network state (in-flight data, etc.), and information about the external environment such as the terminal, the standard input/output/error, and so on. Current checkpointing techniques fall into two categories: application- level and system-level. Application-level checkpointing requires modifica- tions to the target program to insert checkpoint-restart code. The developer identifies the relevant state and data to be checkpointed and implements the mechanism for checkpointing and restoring them. While it is flexible and allows the programmer to optimize and have greater control over the check-

1 2 CHAPTER 1. OVERVIEW

pointing process, there is a high cost paid by the developer for implementing and maintaining it. Further, the timing and frequency of checkpoints may not be specified in a flexible manner and could be limited to certain “safe” points in the program. System-level (or transparent) checkpointing on the other hand works without modifying the target application program. How- ever, a simple implementation is less flexible in that it requires the same environment on restart (the case of homogeneous computer hosts).

1.1 Closed-World Assumption

Traditionally checkpoint-restart packages have made a closed-world assump- tion:

The execution environment (file system, network, etc.) does not change between checkpoint and restart. Thus to save and restore the state of the processes of a computation, it suffices to save the state of the CPU registers, the process’s virtual memory, and kernel state.

While the closed world assumption holds for simple programs, it is not valid for more complex programs (such as distributed processes), and can cause checkpoint-restart to fail in remarkable ways. For example a process with open files will fail to restart if the underlying filesystem mount-point has changed, or if the host has a new IP address while the process remembers the old one. At a more basic level, the restarted process will have a new process id (pid) provided by the kernel. Thus, any attempt by the target application to re-use a previously cached old pid will result in a failure. One way to overcome the closed-world assumption is application-level checkpointing — modifying the application program to account for the chang- ing environment. As mentioned earlier, this approach is costly and hard to maintain. 1.1. CLOSED-WORLD ASSUMPTION 3

For these reasons, the existing systems have been used mostly for applica- tions that obey the closed-world assumption such as isolated batch jobs run- ning solely on traditional multi-core computer nodes within a cluster. The closed world assumption is enforced by posing several restrictions on the fea- tures that an application can use or by creating special-purpose workarounds to handle exceptions to the closed-world assumption.

For example, Condor [110] restricts applications from using multi-process jobs, interprocess communication, multi-threading, timers, and file locks, etc. [109]. BLCR [52] is implemented through a Linux kernel module, which restores the original pid when it is still unused and fails if it is unavailable. CRIU [111] places all target processes in a Linux container (lightweight vir- tual machine), which has private namespaces for kernel objects, but is iso- lated from other processes within the same host.

The closed world assumption breaks down as users ask to checkpoint more general types of software that communicate with the external world. Examples include communication with system daemons (e.g., NSCD, LDAP authentication servers), 3-D graphics libraries (e.g., OpenGL), connections with database servers, networks of virtual machines, hybrid computations using CPU accelerators (e.g., GPU and Xeon Phi), Hadoop-style computa- tions, a broader variety of network models (TCP sockets, InfiniBand, the SCIF network for the Intel Xeon Phi), competing implementations of Infini- Band libraries (QLogic/PSM versus InfiniBand OpenIB verbs), and so on.

These complex applications have created a dilemma. A system for pure transparent checkpointing has no knowledge of the application’s external world, and an application-level checkpointing system would require the writer of the target application to insert code that adapts to the modified external environment after restart. This conflict is the core problem being solved. 4 CHAPTER 1. OVERVIEW 1.2 Double-Paging Anomaly

Hypervisors often overcommit memory to achieve higher VM consolidation on the physical host. When overcommitting host physical memory, guest memory is paged in and out from a hypervisor-level swap file to reclaim host memory. Further, guests running in the virtual machines manage their own physical address space and may overcommit memory as needed.

Double-paging is an often-cited problem in multi-level of mem- ory between virtual machines (VMs) and the hypervisor. This problem oc- curs when both a virtualized guest and the hypervisor overcommit their re- spective physical address-spaces. When the guest pages out memory previ- ously swapped out by the hypervisor, it initiates an expensive sequence of steps causing the contents to be read in from the hypervisor-level swapfile only to be written out again, significantly lengthening the time to complete the guest I/O request. As a result, performance rapidly drops.

1.3 Process Virtualization

Often, application processes violate the closed-world assumption. When restarting from a checkpoint image, the recreated objects derived from ex- ternal systems/services may not be the same as their pre-checkpoint version. This is due to the changing execution environment across a checkpoint- restart boundary. In order to successfully restart an application process, we need to virtualize these objects in such a way that the application view of the objects does not change across checkpoint and restart.

Definition: The application surface of a running application is a set of code and associated data that includes all application-specific objects (code+data) and excludes all opaque objects derived from any outside systems/services. (An opaque object is an object for which the application knows nothing about the internal structure. The opaque object is only accessible through 1.3. PROCESS VIRTUALIZATION 5

Application Process

virtual names Application Surface Translation layer real names

External Resource

Figure 1.1: Application surface of a running process. The virtual names lie inside the application surface, whereas the real names lie outside the surface. an identifying handle) Definition: User-space process virtualization finds a surface that is at least as large as the application surface, such that any virtualized view of an object lies inside this surface and any real view lies outside this surface (see Fig- ure 1.1). On restart, the opaque objects are recreated to provide semanti- cally equivalent functionality to their pre-checkpoint version. Process virtu- alization then links these opaque objects with their virtualized view inside the application surface (through the identifying handles). There can be more than one possible application surface. Typically one chooses an application surface close to a well known API for the sake of stability and maintainability. A wrapper around any call to the API will update both the virtual and the real view in a consistent manner. Remarks:

1. In virtualizing a pid, we will see that libc will retain the real pid known to the kernel. Thus libc is outside the application surface. But the ap- plication knows only the virtual pid that resides inside the application surface. 6 CHAPTER 1. OVERVIEW

2. In the case of a shadow device driver, the user-space memory of the application may contain both some opaque objects (e.g., InfiniBand queues) and their virtualized views. In this case the application surface excludes parts of the user-space memory of the application process.

3. Because daemons and the kernel are opaque to the application, they always lie outside the application surface.

4. An application may create an auxiliary child process (or even dis- tributed processes in the case of MPI). In this case, the application surface includes these auxiliary processes.

The goal of user-space process virtualization is to break the tight coupling between the application process and an external subsystem not under the control of the application process. In effect, each API is designed to provide a stable interface to a single system service under the lifetime of a process. This thesis will demonstrate the ability to find an application surface and a corresponding API, for which a software translation layer can be built, enabling the application process to continue to receive the corresponding system service from an alternative external subsystem. This decouples the application process from the external subsystem.

1.4 Thesis Statement

User-space process virtualization can be used to decouple application pro- cesses from external subsystems to allow checkpoint-restart without enforc- ing a strict “closed-world assumption”. The method of decoupling subsys- tems applies beyond checkpointing as seen in a solution to the long standing double-paging problem. 1.5. CONTRIBUTIONS 7 1.5 Contributions

This dissertation shows that a checkpointing system can “adapt” to the ex- ternal environment, one subsystem at a time, by using the user-space process virtualization technique. To that end, this work introduces a plugin archi- tecture based on adaptive plugins to virtualize these external subsystems. A plugin is responsible for virtualizing and checkpointing exactly one external subsystem to allow the application to adapt to the modified external subsys- tem. The plugin architecture allows us to do selective (or partial) virtualiza- tion of the underlying resources for efficiency purposes. Plugins can be load- ed/unloaded to suit application requirements. Further, it allows the check- pointing system to be extended organically, in a non-monolithic manner.

1.5.1 Process Virtualization through Plugins

To demonstrate the strength of the plugin architecture for user-space pro- cess virtualization, this work presents principled techniques for the follow- ing problems, which have resisted successful checkpoint-restart solutions for at least a decade (these plugins are original with this dissertation):

• The PID plugin (§5.2) virtualizes the process and thread identifiers assigned by the kernel.

• The System V IPC plugin (§5.2) virtualizes the shared memory, semaphore, and message queue identifiers assigned by the kernel.

• The Timer plugin (§5.2) virtualizes posix timers as well as as clock identifiers assigned by the kernel.

• The SSH plugin (§5.4) virtualizes the underlying SSH connection be- tween two processes to allow recreation on restart. 8 CHAPTER 1. OVERVIEW

• The IB2TCP plugin (§5.11) virtualizes the InfiniBand device driver to allow a computation to be checkpointed on the InfiniBand hardware and restarted on the TCP hardware.

Notice that the Zap [86] system virtualized the kernel resource identi- fiers such as pids and System V IPC ids in kernel space. However, the work of this dissertation virtualizes entirely in user space without any applica- tion or kernel modifications or kernel modules. Further, this work extends the notion of user-space virtualization to processes/services outside the ker- nel such as SSH connections, network daemons and device drivers. This is achieved either through interposing library calls or by creating shadow agents/processes for the external resources.

1.5.2 Application-Specific Plugins

Next, we show that plugins can be used for application-specific adapta- tions, providing the benefits of application-level checkpointing without hav- ing to modify the base application. The following application-specific plug- ins (§5.3) are original with this dissertation:

• Malloc plugin virtualizes access to the underlying memory allocation library (e.g., libc malloc, tcmalloc, etc.).

• DL plugin is used to ensure atomicity for dlopen/dlsym functions with respect to checkpoint-restart.

• CkptFile plugin provides heuristics for checkpointing open files. It also helps the file plugin to locate files on restart.

• Uniq-Ckpt plugin is used to control the checkpoint file names, loca- tions, etc. 1.5. CONTRIBUTIONS 9 1.5.3 Third-Party Plugins

Finally, the success of the plugin architecture can also be seen in third party plugins. We show that third parties can write orthogonal customized plugins to fit their needs. The following demonstrates original work due to plugins created by third party contributors (this dissertation is not claiming these results):

• Ptrace plugin [127] virtualizes the ptrace system call to allow check- pointing of an entire gdb session for reversible debugging.

• Record-replay plugin [126] provides a light-weight deterministic re- play mechanism by recording library calls for reversible debugging.

• KVM plugin [44] is used for checkpointing the KVM/Qemu virtual ma- chine.

• Tun plugin [44] is used for checkpointing the Tun/Tap network inter- face for checkpointing a network of virtual machines.

• RM plugin [93] is used for checkpointing in a batch-queue environ- ment and can handle multiple batch-queue systems.

• InfiniBand plugin [27] provides the first non MPI-specific transparent checkpoint-restart of InfiniBand network.

• OpenGL plugin [62] uses a record-prune-replay technique for check- pointing 3D graphics (OpenGL 2.0 and beyond).

1.5.4 Solving the Double-Paging Problem

The process virtualization principles are also applied in the context of vir- tual machines. The double-paging problem is directly and transparently ad- dressed by applying the decoupling principle [11]. The guest and hyper- visor I/O operations are tracked to detect redundancy and are modified to 10 CHAPTER 1. OVERVIEW

create indirections to existing disk blocks containing the page contents. The indirection is created by introducing a thin virtualization layer to virtualize access to the guest disk blocks. Further, the virtualization is done completely in user space.

1.6 Organization

The remainder of this dissertation is organized as follows. A literature review is presented in Chapter2 and various checkpoint- restart mechanisms are discussed. The review also includes various virtual- ization schemes in the context of checkpointing. (Literature for the double- paging problem is reviewed in Chapter6) Chapter3 provides several examples to motivate the need for virtualiz- ing the execution environment. This chapter then uses this motivation to outline two basic requirements for virtualizing the execution environment. It is argued there that an adaptive plugin based approach is well suited for process virtualization. Chapter4 describes the design of adaptive plugins and presents the plu- gin architecture. The proposed plugin architecture is shown to meet the vir- tualization requirements laid out in Chapter3. This is followed by a design recipe for developing new plugins. Dependencies among multiple plugins are also discussed and an approach to dependency resolution is provided. Finally, some implementation challenges involved in designing plugins are presented. Chapter5 provides some case studies involving various plugins. In- cluded there are seven plugins that provide novel checkpointing solutions of their corresponding subsystems. Some application-specific plugins are also demonstrated along with several plugins that provide virtualization of kernel resource identifies in the user space. Chapter6 then turns to the double-paging problem. Like the core issue 1.6. ORGANIZATION 11 in checkpoint-restart, here also one is presented by distinct subsystems that must be combined in a unified virtualization scheme. The core problem is described and motivated, and a design and implementation of a solution is presented. We also discuss some of the side-effects of the proposed solution and finally present evaluation. Chapter7 provides some new directions and applications of checkpoint- restart to non-traditional use-cases that can be pursued based on this disser- tation, with a conclusion presented in Chapter8. Finally, a plugin tutorial is presented in AppendixA, thus providing a concrete view of the plugin API.

CHAPTER 2

Concepts Related to Checkpoint-Restart and Virtualization

This dissertation intersects with four broad areas. The first is that of checkpoint- restart at the process level. The second concerns system/library call inter- positioning for modifying process behavior. The third concerns process level virtualization. The fourth concerns the double-paging problem in the con- text of virtual machines. The literature for the first three areas is reviewed here, whereas the related work for the double-paging problem is discussed in Chapter6. Since this work builds on the DMTCP software package, a brief overview of the legacy DMTCP software (DMTCP version 1) is also provided.

2.1 Checkpoint-Restart

Checkpoint-restart has a long history with several mechanisms proposed over the years [90, 97, 98, 35]. It is often used for process migration, for load balancing, for fault tolerance, and so on [34]. The work of Milo- jiˇci´c et al. [81] provides a review of the field of process migration. Egwu- tuoha et al. [35] provides a survey of various checkpoint/restart implemen-

13 14 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

tations in high performance computing. The website checkpointing.org also lists several checkpoint-restart systems. There are three primary ap- proaches to checkpointing: virtual machine snapshotting, application-level checkpointing, and transparent checkpointing.

Virtual machine snapshotting

Virtual machine (VM) snapshotting is a form of checkpointing for virtual machines and is often used for virtual machine migration. A complex appli- cation is treated as a black box, and its application surface is expanded to include the entire guest physical memory, state, devices, etc. Checkpointing an application involves involves saving everything inside the application surface (i.e. the entire virtual machine). While this tech- nique is general and has been discussed quite extensively [80], it is also slower and produces larger checkpoint images because the checkpoint mod- ule is unable to exclude unnecessary parts of guest physical memory from the application surface. Hence, it is not commonly used for mechanisms of checkpoint-restart.

Application-level checkpointing

Application-level checkpointing is the simplest form of checkpointing. The developer of the application inserts checkpointing code directly inside the application to save the process state, such as data structures, to a file on disk that is later used to resume the computation. This is application-specific and requires extensive knowledge of the application. The knowledge of the ap- plication internals provides complete flexibility, but places a larger burden on the end user. There are several techniques [129] and frameworks that provide tools to assist in application-level checkpointing. Examples include pickling for Python [120] and Boost serialization [108] for C++. A some- what lighter mode of application-level checkpointing is the save/restore 2.1. CHECKPOINT-RESTART 15 workspace feature for interactive sessions. Notably, Bronevetsky et al. have applied this to shared memory parallelism in the context of OpenMP [24, 25] and distributed parallelism in the context of MPI [100, 23], where they pro- vide tools to lighten the end-user burden for writing checkpointing code.

The rest of this section focuses on several varieties of transparent check- pointing, in which the end-user does not need to make any changes to the target application.

Transparent checkpointing

This is sometimes called system-level or system-initiated checkpointing. It is the ability to checkpoint an application without making any changes to the application source or binary. The history of transparent checkpointing extends back at least to 1990 [73]. While, there are many systems that perform single-process checkpointing [91, 33, 89, 92, 73, 74, 29,1,3, 76], we will focus on systems that support multiple processes and/or distributed processes. Transparent system-level checkpointing technique can be further broken down into Kernel-level and user-level checkpointing. The two tech- niques are further discussed in Sections 2.1.1 and 2.1.2 respectively.

2.1.1 Kernel-Level Transparent Checkpoint-Restart

In kernel-level checkpointing, the operating system is modified to support checkpointing for applications. This approach leads to checkpoints being more tightly coupled to kernel versions. While there have been several such kernel-level packages, the difficulty of supporting multiple kernel versions makes it more difficult. It also makes future ports to other operating systems more difficult. 16 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

The Zap system and its derivatives

As an extension of CRAK (Checkpoint and Restart as a Kernel Module) [139], Zap [86, 67] implements checkpoint-restart using a kernel module. Zap can be considered a precursor to the Linux Containers (LXC) [117] as it also provides a virtualized view of the kernel resources. Zap uses a pod (process domain) abstraction, that provides a group of processes with a consistent vir- tualized view. The pods abstraction virtualizes kernel resource identifiers to present a pod-specific view. This isolates the process from the external world and provides a conflict free environment when migrating processes to other nodes. The downside of this implementation is the inability of processes in- side a pod to communicate with processes outside the pod. It intercepts all systems calls operating on the virtualized kernel resource identifiers, trans- lating their arguments and return values as needed. System call interception is also required for all processes in the system and poses runtime overhead for processes outside the pods.

Zap was later extended to support distributed network applications by Laadan et al. [68] to create ZapC and by Janakiraman et al. [59] to create CRUZ. The key enhancement was the support for virtualization of the net- work layer to decouple the processes from the node they are running on. This allowed these systems to checkpoint-restart distributed computations over a cluster. For ZapC was achieved by inserting hooks into the network stack using netfilter. The source and destination addresses were translated between virtual and real addresses for both in- coming and outgoing network packets.

The work of this dissertation is based entirely in the user space and doesn’t require any kernel modification or kernel modules. As explained by Laadan [66], the kernel module based approach incurs a burden both on users because it is cumbersome to install, and on developers because main- taining it on top of quickly changing upstream kernels is a sisyphean task and 2.1. CHECKPOINT-RESTART 17 development quickly falls behind. Further, user-space virtualization poses no runtime overhead for processes that are not part of the computation being checkpointed. Finally, this work can be used to virtualize agents/process- es/services outside the kernel. Examples include SSH connection, network daemons and device drivers.

Berkeley Lab Checkpoint Restart (BLCR)

BLCR [52] is another widely used checkpointing system that is implemented as a kernel module. It is used primarily in high performance computing. BLCR is often used along with MPI libraries to checkpoint a distributed com- putation. The BLCR does not have any support for virtualization and may fail if a kernel resource identifier (such as a pid) is not available at the time of restart. It also relies on MPI daemons to handle changed network addresses, mount points, etc. However, if the application has cached a directory name from before checkpoint and tries to open it after restart, it may fail.

Another notable kernel based system was Chpox by Sudakov et al. [105]. Initially, Chpox was implemented as a kernel module for Linux 2.4, whereas a later version for Linux 2.6 required base kernel modifications as well.

Pure kernel-level approaches

A more recent attempt by Laadan et al. [68] also implemented a single-host in-kernel solution. It consisted of some user-space utilities and a series of patches to the Linux 2.6 kernel to add checkpoint support in the mainline kernel itself. This was proposed for inclusion in the Linux kernel, but ulti- mately not accepted due to its invasive approach that touched/modified a large number of kernel subsystems [8]. 18 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION 2.1.2 User-Level Transparent Checkpoint-Restart

User level checkpointing works without any changes to the operating system kernel. The use of published APIs (e.g., POSIX and the Linux proc filesystem) to communicate with the kernel and to perform checkpoint-restart makes it highly stable.

Checkpointing library

The ground-breaking work of Plank et al. [92] on Libckpt uses a library to do the checkpointing and the application program is linked to this user-level library. Similar techniques are used by Condor [76]. These techniques are not completely transparent to the user as the application code is modified, recompiled, and relinked with the dynamic library. However, the amount of code changes is often fairly small (e.g., for Libckpt, the application program- mer needs to rename the main() to ckpt_target()). The main disadvantage of using such systems is the restrictions imposed on the operating system features such as interprocess communication, that the application program can use [109]. Further, these systems do not support process trees or dis- tributed computations.

Distributed checkpointing with MPI

Although application-level checkpointing for distributed programs dates back at least to 1997 [17], most practical systems were built around MPI-based distributed computations for supporting high performance computing. They use hooks or callback functions for specific MPI implementations [31, 54, 137, 138, 104, 21, 133, 49, 52, 99]. (MPI, Message Passing Interface, is a standard for message-based distributed high performance computation.) Most MPI implementors chose to build a custom checkpoint-restart service. This came about when InfiniBand became the preferred network for high performance computing, and there was still no package for transparent check- 2.1. CHECKPOINT-RESTART 19 pointing over InfiniBand. Examples of checkpoint-restart services can be found in Open MPI [54, 55], LAM/MPI [99] (now incorporated into MVA- PICH2 [77, 41]), MPICH-V [22], and MVAPICH2 [41], as well as a fault- tolerant “backplane”, CIFTS [51]. Each checkpoint-restart service would dis- connect from the network prior to checkpoint, and re-connect after restart. Hence, while the network was disconnected, the MPI checkpoint-restart ser- vice was then able to delegate single-host checkpointing to the BLCR [52] kernel module. This created an extra layer of complication, but it was un- avoidable at that time, due to the lack of support for transparent checkpoint- ing over InfiniBand. On restart, the network connections are restored and the checkpointer is called upon to restore the user processes. Since it’s work- ing at the MPI level, the ability to adapt to the environment outside of MPI is limited, and generally proves difficult to maintain.

Bronevetsky et al. produced a novel application-level checkpointing de- sign for the special case of MPI [23]. In this approach, a pre-compiler in- struments the application MPI code with additional information needed for checkpointing, thus coming close to the ideal of transparent checkpointing. The application programmer then adds code indicating valid points in the program for a potential checkpoint. The use of a pre-compiler relieved much of the burden of adding application-specific code to support checkpointing.

Cryopid

Cryopid [18] and Cryopid2 [85] use the ptrace system call to attach to a running process and create a core dump of the application process that is later used to restart the computation. The checkpointable features supported are quite limited as compared to other checkpointing packages, and adding a new feature is often harder. 20 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

Checkpoint Restart In Userspace (CRIU)

CRIU [111] is a more recent checkpointing package based on Linux con- tainers (LXC) [117]. The support is restricted to process trees and contain- ers. The Linux kernel API was extended for new kernel features to sup- port the user space tool. Like Cryopid, it also uses the ptrace system call to inject checkpointing code inside the user processes. The checkpointing code executes in the context of a process to gather all the relevant informa- tion using the extended kernel API. Due to security issues, the checkpoint- ing capability is only available for users with CAP_SYS_ADMIN capability. (CAP_SYS_ADMIN capability is a successor to the Linux setuid-root feature that is used to grant admin privileges to select applications/processes.)

Distributed MultiThreaded Checkpointing (DMTCP)

DMTCP version 1 [7] is implemented using user space shared libraries. The original DMTCP supported TCP sockets, but was limited in that it did not support distributed computations communicating over ssh or InfiniBand. Further, even in the single-host case, it did not support virtualization of such kernel resources as pids, System V IPC, POSIX and System V shared memory, and POSIX timers. Section 2.4 provides a brief background on the architecture and the working of DMTCP version 1. This work represents a rewrite of the original DMTCP [7], in order to introduce user-space process virtualization for checkpointing the external environment. This enables us to checkpoint a wide variety of applications. The virtualization layer is implemented completely in user space with mini- mal overhead. Process virtualization goes beyond virtualizing the kernel re- source identifiers and can be used to virtualize even higher level constructs and abstractions such as the SSH protocol, as discussed in Chapter3. Ta- ble 2.1 summarizes the difference between this work and the prominent transparent checkpointing packages. 2.2. SYSTEM CALL INTERPOSITIONING 21

Ckpt Multi-host Resource Virtualization Applic- Third- System computations kernel other specific party resources resources tuning plugins BLCR  Zap  CRIU  Cryopid2  DMTCP (v1)  Extensible  CKPT

Table 2.1: Comparison of various checkpointing systems. The other resource virtualization refers to the ability to virtualize protocols, device drivers, etc.

2.1.3 Fault Tolerance

Fault tolerance [70, 58] is a broader concept not discussed here. It enables a system to continue operating properly in the event of a failure of one of its components. Several strategies can be deployed to make a system fault tolerant such as: redundancy, partial re-execution, atomic transactions, instrumentation of data, and so on.

2.2 System Call Interpositioning

The concept of wrappers, as implemented in DMTCP, have a long and inde- pendent history under the more general heading of interposition. Interpo- sition techniques have been used for a wide variety of purposes [123, 136, 65]. See especially [123] for a survey of a wide variety of interposition tech- niques. The work of Garfinkel [42] discusses practical problems associated with system call interpositioning. The packages PIN [88] and DynInst [124] are two examples of software packages that provide interposition techniques at the level of binary instrumentation. 22 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION 2.3 Virtualization

Virtualization is the process of allowing unmodified source code or an un- modified binary to transparently run under varied external environments (different CPU, different network, different graphics server (e.g., X11-server), etc.). Most of the original checkpointing packages [73, 74, 26, 31, 71] ig- nored these issues and concentrated on homogeneous checkpointing. Virtualization techniques have been developed since the 1960s. Since then, systems have implemented different flavors of virtualization. In this section, we discuss the four types of virtualization techniques in common use today that are closest in spirit to this work.

2.3.1 Language-Specific Virtual Machines

A language-specific virtual machine, sometimes also known as an applica- tion virtual machine, a runtime environment, or a process virtual machine, allows an application to execute on any platform without having to write any platform-specific code. This is achieved by creating a platform-independent programming environment that abstracts the details of the underlying hard- ware or operating system. This abstraction is provided at the level of a high-level programming language. Notable examples include Java Virtual Machine (JVM) [75], .NET framework [122], and Android virtual machines () [20, 36]. Language-specific virtual machines are often implemented using an in- terpreter, with an option of using just-in-time compilation for performance close to that of a compiled language [32].

2.3.2 Process Virtualization

Process virtualization allows a process to be migrated or restarted in a new external environment, while preserving the process’s view of the external world. For example, a kernel may assign to a restarted process a different 2.3. VIRTUALIZATION 23 pid than the original pid at the time of checkpoint. The earliest checkpoint- ing packages had assumed that the targeted user process would not save the value of the pid of a peer process, but rather would re-discover that pid on each use. As software complexity grew, this assumption became unreliable. More recent packages either modified the Linux kernel (e.g., BLCR [52]), or ran inside a Linux Container, a lightweight virtual machine (e.g., CRIU [111]).

Process virtualization (as exemplified by this work) has been considered intensively in the context of checkpointing only recently. Nevertheless, it has important forerunners in process hijacking [136] and in the checkpointing packages [76, 135] used in Condor’s Standard Universe. Similarly, there are connections of process virtualization with dynamic instrumentation (e.g., Paradyn/DynInst [124], PIN [88]).

2.3.3 Lightweight O/S-based Virtual Machines

O/S virtualization allows several isolated execution environments to run within a single operating system kernel. This technique exhibits better per- formance and density compared to virtual machines. On the downside, it cannot host a guest operating system different from the host operating sys- tem, or a different guest kernel (different Linux distributions is fine). Some examples include FreeBSD Jail [61], Solaris Zones [96], Linux Containers (LXC) [117], Linux-VServer [116], OpenVZ [118] and Virtuozzo [119].

Linux Containers are a kernel-level tool for providing a type of virtual- ization in the form of namespaces for process spaces and network spaces. This provides an alternative approach for such tasks as that of pid virtu- alization. The CRIU [111] checkpointing system uses LXC namespaces to virtualize kernel resource identifiers within the container. The namespaces avoid the problem of name conflicts for kernel resource identifiers during process migration. 24 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

Although process-level virtualization and Library OS [6, 95, 107] both operate in user space without special privileges, the goal of Library OS is quite different. A Library OS modifies or extends the system services provided by the operating system kernel. For example, Drawbridge [95] presents a Windows 7 personality, so as to run Windows 7 applications un- der newer versions of Windows. Similarly, the original exokernel operating system [37] provided additional operating system services beyond those of a small underlying operating system kernel, and this was argued to often be more efficient that a larger kernel directly providing those services.

2.3.4 Virtual Machines

Hardware virtualization uses an abstract computing platform. Thus, it hides the hardware platform (the host software). On top of the host software, a virtual machine (guest software) is running. The guest software executes as if it were running directly on the physical hardware, with a few restrictions, such as the network access, display, keyboard, and disk storage. Examples of virtual machines include VMware, Qemu/KVM [114], [15], Virtu- alBox [130], and Lguest [115]. The virtual machines often run a set of tools inside the guest operating system to inspect and control its behavior. Further, in some cases the guest operating system is modified to provide additional support/features and the technique is referred to as paravirtu- alization. Some notable examples of paravirtualization are Xen [15] and Microsoft Hyper-V [125].

One could also include binary instrumentation techniques such PIN [88] and DynInst [124] in a discussion of virtualization, but this tends not to be used much with checkpointing.

The work of this thesis introduces process virtualization for abstractions beyond the traditional kernel resource identifiers in order to virtualize nu- merous external subsystems such as SSH connections, InfiniBand network, 2.4. DMTCP VERSION 1 25

KVM and Tun/Tap interfaces, SLURM and Torque batch queues, and GPU drivers. The modular approach to virtualize these external subsystems al- lows the checkpointing system to grow organically (see Chapter4). By vir- tualizing these external environments, this work enabled some projects to be the “first” to support checkpointing.

2.4 DMTCP Version 1

DMTCP (Distributed MultiThreaded CheckPointing) is free, open source soft- ware (http://dmtcp.sourceforge.net, LGPL license) and traces its roots to early 2005 [30]. The DMTCP approach has always insisted on not making modifications to the kernel, and not requiring any root (administra- tive) privileges. While this was sometimes more difficult than an approach with full privileges inside the kernel, it integrates better with complex cyber infrastructures. DMTCP’s lack of administrative privilege provides a level of security assurance. As a side effect of working completely in the user-space, DMTCP relies only on the published APIs (e.g., POSIX and the Linux proc filesystem) to perform checkpoint-restart. Thanks to the highly stable kernel API, the same DMTCP software can be used on Linux kernel ranging from the latest bleed- ing edge release to Linux 2.6.5 (released in April, 2004). In this section, we provide a only brief overview of the checkpoint-restart mechanisms of DMTCP. More Details can be found in Ansel et al. [7]. Using DMTCP with an application is as simple as:

dmtcp_launch ./myapp arg1 ... # From a second terminal window: dmtcp_command --checkpoint dmtcp_restart ckpt_myapp_*.dmtcp

This checkpoint image contains a complete standalone image of the ap- 26 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

plication with all the relevant information required to restart it later. It can be replicated and migrated as needed. DMTCP also creates a restart script to help automate restart of distributed computation.

DMTCP COORDINATOR

CKPT MSG CKPT MSG

CKPT THREAD CKPT THREAD USER THREAD A

SIGUSR2 USER THREAD C SIGUSR2 SIGUSR2 USER THREAD B socket connection

USER PROCESS 1 USER PROCESS 2

Figure 2.1: Architecture of DMTCP

As seen in Figure 2.1, a computation running under DMTCP consists of a centralized coordinator process and several user processes. The user pro- cesses may be local or distributed. User processes may communicate with each other using sockets, shared-memory, pseudo-terminals, etc. Further, each user process has a checkpoint thread which communicates with the co- ordinator. The checkpoint thread is created by the DMTCP library dmtcphi- jack.so, that is loaded into each of the application processes at startup (be- fore calling application’s main() function) by using the LD_PRELOAD fea- ture of the loader. The DMTCP library install signal handler for the check- point signal that is later used to quiesce user threads. The checkpoint thread is responsible for creating checkpoint images as and when requested by the coordinator. 2.4. DMTCP VERSION 1 27 2.4.1 Library Call Wrappers

The DMTCP library adds wrappers around a small number of libc func- tions. For efficiency reasons, it avoids wrapping any frequently invoked sys- tem calls such as read and write. The wrappers are used to gather infor- mation about the current process and to track all forked child processes as well as remote processes created via SSH and to automatically put them un- der checkpoint control. The local child processes inherit the LD_PRELOAD environment variable, whereas for the remote child processes, the comman- dline is modified to launch them under DMTCP control. In the case of sock- ets, DMTCP needs to know whether the sockets are TCP/IP sockets (and whether they are listener or non-listener sockets), UNIX domain sockets, or pseudo-terminals. Again, it uses wrappers around socket, connect, accept, open, close, etc., to do that.

2.4.2 DMTCP Coordinator

DMTCP uses a stateless centralized process, the DMTCP coordinator, to syn- chronize the separate phases at the time of checkpoint and restart. The checkpoint threads communicates with the DMTCP coordinator through a socket connection. Checkpoint procedure can be initiated by the coordi- nator on an explicit request from the user through its interactive interface, through the dmtcp_command utility, or on expiration of a predefined check- point interval. It should be noted that the coordinator is a single point of failure since the entire computation relies on it.

2.4.3 Checkpoint Thread

The checkpoint thread waits for a checkpoint request from the coordinator. On receiving a checkpoint request, the checkpoint thread quiesces the user threads (by sending a checkpoint signal) and takes the process through the phases of creating a checkpoint image. Similarly, during restart, it takes the 28 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

process through the restart phases and finally un-quiesces the user threads. The checkpoint thread is dormant during the normal execution of the pro- cess and is only active during the checkpoint/restart procedures.

2.4.4 Checkpoint

On receiving the checkpoint request from the coordinator, the checkpoint thread sends the checkpoint signal to all the user threads in the process. This quiesces the user threads by forcing them to block inside a signal han- dler previously installed by DMTCP. The checkpoint image is created by writ- ing all of user-space memory to a checkpoint image file. Each process has its own checkpoint image. Prior to creating the checkpoint image, the check- point thread also copies into the user-space memory, any kernel state that is required to restart the process such as the state of associated with network sockets, files, and pseudo-terminals. At the time of checkpoint, all of user-space memory is written to a check- point image file. The user threads are then allowed to resume executing application code. Note that user-space memory includes all of the run-time libraries (libc, libpthread, etc.), which are also saved in the checkpoint im- age. DMTCP doesn’t directly handle asynchronous DMA operations that may be pending or ongoing at the time of checkpoint. This could result in a inconsistent checkpoint state as the “quiesce” property has been violated.

2.4.5 Restart

As the first step of restart phase, DMTCP group all restart images from a single node under a single dmtcp_restart process. The dmtcp_restart process recreates all file descriptors. It then uses a discovery service to discover the new addresses for processes migrated to new hosts and restores network connections. It then forks a child process for each checkpoint image. These 2.4. DMTCP VERSION 1 29 individual processes then restore their memory areas. Next, the user threads are recreated using the original thread stacks. All user threads restore their pre-checkpoint context using the longjmp system call and are forced to wait in the signal handler. The checkpoint thread then restoring the kernel state that was saved during the checkpoint phase. Finally, the checkpoint thread un-quiesces the user threads and the user threads resume executing application code.

2.4.6 Checkpoint Consistency for Distributed Processes

In case of distributed processes, one needs to determine a consistent global state of the asynchronous system at the time of checkpoint. The notion of the global state of the system was formalized by Chandy and Lamport [28]. The central idea is to use marker (snapshot) messages. A process that wants to initiate a checkpoint, records its local state and sends a marker message on each of its outgoing channels. All other processes save their local state on receiving the first marker message on some incoming channel. For every other channel, any messages received before the marker message were ob- viously sent before the snapshot “cut off”. Hence they are included in the local snapshot.

Chandy and Lamport were primarily concerned with “uncoordinated snap- shots” (no centralized coordinator). DMTCP employs a strategy of “coordi- nated snapshots” using a global barrier. This makes the implementation of Chandy-Lamport consistency particularly easy, since messages can be sent only prior to the global barrier. Processes are “quiesced” (frozen) at the bar- rier. Next, the checkpoint thread of each process receives all pending data in the network, after which a globally consistent snapshot is taken. The details of the DMTCP implementation follow.

To initiate a checkpoint, the coordinator broadcasts a quiesce message to each process in the computation. On receiving the message, the check- 30 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION

point manager thread in each process quiesces the user threads, sends an acknowledgement to the coordinator, and waits for the drain message. Af- ter receiving acknowledgements from all processes, the coordinator lifts the global barrier and broadcasts the drain message. On receiving the drain message, the checkpoint manager thread sends a special cookie (marker mes- sage) through the “send” end of each socket. Next, it reads data from the “receive” end of each socket until the special cookie is received. Since user threads in all the processes have already been quiesced, there can be no more in-flight data. The received in-flight data has now been copied into user-space memory, and will be included in the checkpoint image. On restart, once the socket connections have been restored, the check- point manager thread sends the saved in-flight data (previously read from the “receive” end of the socket) back to its peer processes. The peer processes then refill the network buffers, by pushing the data back into the network through the “send” end of each restored socket connection. The checkpoint manager thread then sends a message to the coordinator to indicate the end of the refill phase and waits for the resume message. Once the coordina- tor has received messages indicating end of refill phase from all involved processes, it lifts the global barrier and broadcasts the resume message. On receiving the resume message, the checkpoint manager un-quiesces the user threads and they resume executing user code. CHAPTER 3

Adaptive Plugins as a Mechanism for Virtualization

This chapter introduces several important examples of the need to integrate checkpointing with an external subsystem: Pid virtualization, SSH virtual- ization, virtualization of the InfiniBand network, virtualization of OpenGL, and virtualization of POSIX timers. The concept of process virtualization is introduced in concrete examples. Virtualization of InfiniBand [27] and OpenGL [62] were extensive projects requiring much domain knowledge. The specific results represent long- standing open problems and are not part of this dissertation. We use those examples to motivate the need for process virtualization, and we use those examples to argue for the expressivity of process virtualization in Chapter5.

3.1 The Ever Changing Execution Environment

In the next subsections, five examples of strategies for process virtualization are described, in order to make clear the rich design space available for process virtualization. In each of these cases, the nature of its virtualization requirement is unique. The five examples are:

31 32 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

1. virtualization of kernel resource identifiers, using the example of process id (pid) (Section 3.1.1);

2. virtualization of protocols, using the SSH protocol as its example (Sec- tion 3.1.2);

3. a shadow device driver approach for transparent checkpointing over In- finiBand (Section 3.1.3);

4. a record-replay approach, using transparent checkpointing of OpenGL 3D-graphics as an example (Section 3.1.4); and

5. adapting to application requirements for more control over checkpoint- ing (Section 3.1.5).

3.1.1 PID: Virtualizing Kernel Resource Identifiers

Pid is one of the simplest examples of the kernel resource identifiers that needs virtualization. The operating system kernel is unlikely to assign the same pid on restart as existed at the time of checkpoint. Even if the kernel were to allow a mechanism to request a particular pid, the requested pid might be in use (assigned to a different process). If the target application has saved the pre-checkpoint pid and tries to use it after restart, it could have undesired effects. For example, if the process uses the saved pid to send a signal after restart, in the best case, the process will fail because the saved pid is invalid. In the worst case, the saved pid might correspond to some other process and signal will be sent to that other process. To avoid these situations, we must provide a mechanism such that the processes can continue to use the saved pid after restart without any un- desired side effects. This can be done by providing the application process with a virtual pid that never changes for the duration of the process lifetime. When communicating with the kernel, the corresponding real pid that the 3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 33

4000 2652 PID: 4000 getpid() User Process kill(4001, 9) KERNEL

PID: 4001 Sending signal 9 User Process 4001 3120 to pid 3120 Translation Table Virt. PID Real PID 4000 2652 4001 3120

Figure 3.1: Virtualization of kernel resource identifiers (example shown for process id) kernel knows about is looked up in the translation table and passed on to the kernel. Figure 3.1 shows a simple schematic of a translation layer be- tween the user processes and the operating system kernel along with a pid translation table to convert between virtual and real pids. At each restart, the translation table is refreshed to update the real pids.

3.1.2 SSH Connection: Virtualizing a Protocol

Pid virtualization is a classic example of virtualizing low level kernel re- source identifiers using a translation layer. However, the same solution doesn’t suffice for higher level abstractions, such as an SSH connection.

Node1 Node2

app1 app2 stdio stdio SSH client SSH server (ssh) socket (sshd)

Figure 3.2: SSH connection: ssh Node2 app2 The user process, app1, forks a child SSH client process (ssh) to call the SSH server (sshd) on the remote node to create a remote peer process, app2. 34 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

Recall that the ssh command operates by connecting across the net- work to a remote SSH daemon, sshd, as shown in Figure 3.2. Since the SSH daemon is privileged, it is not possible for the unprivileged user-space checkpointing system to start a new SSH daemon during restart. The issue becomes even more complicated when the client and server processes are restarted at entirely different network addresses on different hosts. For virtualizing an SSH connection, it doesn’t suffice to virtualize just the network address. Instead, it must virtualize the entire SSH client-server con- nection. In essence, the SSH daemon represents a privileged process running a certain protocol. Regardless of whether the protocol is an explicit standard or a de facto standard internal to the subsystem, process virtualization must virtualize that protocol. Checkpointing and restarting the privileged SSH daemon is not an option.

Node1 Node2

app1 app2 stdio stdio

virt_ssh virt_sshd stdio stdio

SSH client SSH server (ssh) socket (sshd)

Figure 3.3: Virtualizing an SSH connection: ssh Node2 app2 The call to launch an SSH client process is intercepted to launch virtual ssh client (virt_ssh) and server (virt_sshd) processes. virt_ssh and virt_sshd are unprivileged processes.

Process virtualization provides a principled and robust algorithm for trans- parently checkpointing an SSH connections. As shown in Figure 3.3, the SSH 3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 35 connection is virtualized by creating virt_ssh and virt_sshd helper pro- cesses that shadow the SSH client and server processes respectively. The virt_ssh and virt_sshd processes are owned by the user and are placed under checkpoint control. The ssh and sshd processes are not check- pointed.

On restart, the user processes are restored along with virt_ssh and virt_sshd processes (without the underlying SSH connection) on new hosts. The virt_ssh process then recreates a new SSH connection (see Sec- tion 5.4).

3.1.3 InfiniBand: Virtualizing a Device Driver

Both ssh for a traditional TCP network and the new InfiniBand network are intimately connected with high performance implementations of MPI (Message Passing Interface). An implementation usually retains ssh and TCP in addition to InfiniBand support, since typical MPI implementations bootstrap their operation through ssh in order to create additional MPI processes (MPI ranks), and to exchange InfiniBand addresses among peers.

InfiniBand virtualization has been a particular challenge both due to its complexity [134, 63, 16] and due to the fact that much of the state is hid- den either within a proprietary device driver or within the hardware itself. The solution here is to use a shadow device driver approach [106]. The InfiniBand plugin (§5.10) maintains a replica of the device driver and hard- ware state by intercepting and recording the InfiniBand library calls. On restart, this replica is used to recreate and restore the state of the InfiniBand connection. 36 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION 3.1.4 OpenGL: A Record/Replay Approach to Virtualizing a Device Driver

Scientific visualization is yet another example that requires a different kind of virtualization solution. Some graphics computations are extremely GPU- intensive. Further, most scientific visualizations today use OpenGL for 3D- graphics. If a scientist walks away from a visualization and needs to restart it the next day, there will be wasted time to reproduce it. Further, switch- ing between multiple scientific visualizations becomes extremely inefficient. Hence, checkpoint-restart is a critical technology. However, it is difficult to checkpoint, because much of the graphics state is encapsulated into a vendor-proprietary hardware GPU chip. The OpenGL plugin (§5.9) achieves checkpoint-restart of 3-D graphics by using a process virtualization strategy of record (record all OpenGL calls), prune (prune any calls not needed to reproduce the most recent graphics state), and replay (replay the calls during restart in order to place the GPU into a semantically equivalent state to the state that existed prior to checkpoint).

3.1.5 POSIX Timers: Adapting to Application Requirements

A posix timer is an external resource maintained within the kernel and has an associated kernel resource identifier known as timer id. As with pid virtu- alization, the timer-id needs to be virtualized as well and can use the same strategy. Consider a process that is checkpointed while a timer is still armed, i.e. the timeout specified with the timer has not expired yet. On restart, what is the desired behavior? Should the timer expire immediately or should it expire after exhausting the remaining timeout period? There is no single correct answer as the desired result is application dependent. For an appli- cation that is waiting for a response from a web server, it is desired to expire 3.2. VIRTUALIZING THE EXECUTION ENVIRONMENT 37 the timer on restart. However, for an application process that is monitor- ing a peer process for potential deadlocks, the time should continue for the remaining time period.

3.2 Virtualizing the Execution Environment

As seen in the previous section, it is imperative to virtualize the external resources in order to fully support checkpoint restart for any application. In order to be successful, virtualization should be done transparently to the ap- plication. This assumes that the application is interacting with the external resource through a fixed set of API. Two basic requirements for virtualizing an external resource for checkpointing are:

1. Virtualize external subsystems.

2. Capture/restore the state of external resources.

Next, we talk about each of these requirements and elaborate on their im- portance and discuss what additional features are required for a complete virtualization solution.

3.2.1 Virtualize Access to External Resources

Since external resources may change between checkpoint and restart, we need to virtualize them. This can be achieved through a translation layer between the application process and the resource. Virtualizing a resource may be as simple as translating between virtual and real identifiers such as pid-virtualization (Section 3.1.1) or it may involve more sophisticated mechanisms like shadow device drivers (Section 3.1.3). Depending upon the external resource, the translation may be active throughout the computation (e.g., for pids) or only during the restart procedure (for SSH). Further, the translation layer should ensure that the access to a resource is atomic with respect to checkpoint-restart i.e. a checkpoint shouldn’t be 38 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

allowed while the process is in the middle of manipulating/accessing the re- source. Not doing this may result in an inconsistent state at restart. Consider pid virtualization where a thread tries to send a signal to another thread us- ing the virtual tid (thread id). The pid virtualization layer translates the virtual tid to the real tid and sends the signal using real tid. Further con- sider that the process is checkpointed after the translation from virtual to real, but before the signal is actually sent. On restart, the process will re- sume and will try to send the signal with the old real tid, which of course is not valid now.

Share the virtualized view with peers

Virtualizing access to external resources gets complicated in a distributed environment. Processes communicate with their peers. This demands a consistent virtualization layer across all involved parties. It becomes more evident after restart, when the translation table is updated to reflect the current view of the external resource. These updates must be shared with all the peer processes to allow them to update their own translation tables. For example, in case of network address virtualization, each process must inform its peers of its new network address on restart to allow them to restore socket connections.

3.2.2 Capture/Restore the State of External Resources

When restarting a process from a previous checkpoint, we need to restore the process view of the external resource. We need to identify the relevant information that would be required to restore/recreate the external resource during restart. This information should be gathered at the time of check- point and should be saved as part of the checkpoint image. This information can then be read from the ckpt image on restart. 3.3. ADAPTIVE PLUGINS 39

Quiesce the external resource

During checkpoint, the external resources should be quiesced to ensure a consistent state. For example, an asynchronous disk read operation must be allowed to finish before writing the process memory to the checkpoint image to avoid data transformation due to on going memory updates (DMA).

Consistency of the computation state

As discussed above, a virtualization scheme should be transparent to the user application. Thus, the application view of the external resource should be consistent before and after checkpoint. Similarly, the application process should not observe any change in its own state before and after checkpoint. This involves preserving the state of the running process (e.g., threads, mem- ory layout, and file descriptors) between checkpoint and restart. Note that it is acceptable to alter the process state and/or the state of external resource while perform checkpoint-restart. However, such changes should be reverted and the pre-checkpoint view of the application should be restored before the application process is allowed to resume executing application code.

3.3 Adaptive Plugins as a Synthesis of System-Level and Application-Level Checkpointing

So far we have discussed the motivation for virtualizing the execution envi- ronment along with the basic requirements for achieving the same. In this section we will discuss possible design choices. There are two basic approaches for achieving the goals discussed in Sec- tion 3.2. One is to use application-specific checkpointing by having the ap- 40 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION

plication developer write extra code for supporting checkpointing. However, as discussed in Section 2.1, this is not an ideal solution as it requires knowl- edge of the internals of the applications and puts a burden on the developer. The second approach is to use an existing monolithic checkpointing system such as DMTCP version 1 and insert the virtualization code in it along with a large number of heuristics to satisfy a variety of application needs (e.g., heuristics for posix timers as discussed in Section 3.1.5). However, there is no universal set of heuristics that can be used with all applications as each application requires specific heuristics to cater its needs. In this work, we present adaptive plugins as an ideal compromise be- tween these two extreme approaches to meet the virtualization require- ments. An adaptive plugin is responsible for virtualizing a single external resource. By basing plugins on top of a transparent checkpointing package such as DMTCP, the simplicity of transparent checkpointing is maintained. With plugins, no target application code is ever modified, yet they enable application-specific fine tuning for checkpoint-restart. We have already seen examples where the external resource needs to be virtualized in previous sections. The posix timer plugin is an example of application-specific heuris- tic plugin. A memory cutout plugin to reduce the memory footprint of the process for reducing checkpoint image size would be yet another example of an application-specific plugin. CHAPTER 4

The Design of Plugins

In the previous chapter, we discussed several use cases that require virtual- ization of external resources in order to support checkpoint-restart. External resources may include, but are not limited to kernel resource identifiers, protocols, and hardware device drivers. We further listed the two basic re- quirements for virtualizing an external resource and discussed how a design based on adaptive plugins is well suited for such tasks. Section 4.1 introduces a basic framework of a plugin architecture that pro- vides the same set of services for virtualizing external resources that were introduced informally in Chapter3.A plugin is an implementation of the process virtualization abstraction. In process virtualization, an external sub- system is virtualized by a plugin. All software layers above the layer of that plugin see a modified subsystem. Section 4.2 then uses these requirements to provide a design recipe for virtualization through plugins. Section 4.3 then takes into account the is- sue of dependencies among multiple plugins within the same application process. Section 4.4 extends that design recipe to multiple processes, in- cluding distributed processes on multiple hosts. Section 4.5 describes three special-purpose plugins that are required for checkpointing all processes. This chapter concludes with Section 4.6, containing some implementation challenges.

41 42 CHAPTER 4. THE DESIGN OF PLUGINS

Target Application (program+data) Target Application

Library Wrappers Virtualize Resource Capture/Restore State

Library Wrappers Virtualize Resource

Party Plugin Libs Capture/Restore State

Coordinator Interface Plugin

Thread Plugin

Libraries Memory Plugin Base Plugin Internal and Third−

Runtime Libraries (libc, etc.) Plugin Engine Runtime Libraries

Operating System Kernel

Figure 4.1: Plugin Architecture.

4.1 Plugin Architecture

An application consists of program and data. It interacts with the execution environment through various libraries. For example, the libc runtime library provides access to the kernel resources, a device driver library may provide access to the underlying device hardware, and so on. Thus one can imagine virtualizing the execution environment by intercepting the relevant library calls. This allows us to inspect and modify the behavior of the underlying subsystem as seen by the application.

Figure 4.1 shows a high level view of the plugin architecture. It has 4.1. PLUGIN ARCHITECTURE 43 two main components: (1) plugins, and (2) the plugin engine. Plugins and the plugin engine are implemented as separate dynamic libraries. They are loaded into the application using the LD_PRELOAD feature of the Linux loader.

Plugin

A plugin is a checkpoint subsystem that virtualizes a single external resource or subsystem with the help of function wrappers (§4.1.1). It save/restores the state of the external subsystem. Examples of external subsystems are: process-id, network sockets, InfiniBand, etc. Application processes are con- sidered as if they are independent and inter process communication through pids, sockets, etc. is handled through plugins. Further, a plugin is transpar- ent to the target application and can be enabled/disabled for the application as needed. Finally, third parties can write orthogonal customized plugins to fit their needs.

Plugin Engine

The plugin engine provides event notification services (§4.1.2) to assist plug- ins to capture/restore the state of their specific external resources. It further interacts with a coordinator interface plugin to provide publish/subscribe services (§4.1.3) to enable plugins to interact with each other and share the translation tables for resource virtualization.

4.1.1 Virtualization through Function Wrappers

Since the underlying resources provided by the operating system may change between checkpoint and restart, there is a need to virtualize them. The plu- gin virtualizes the external resources by putting wrappers around interesting library calls, which interpose when the target application makes such a call. In case of pids, the virtualization can be done using a simple table translat- 44 CHAPTER 4. THE DESIGN OF PLUGINS

ing between virtual and real pids as shown in Listing 4.1. The arguments passed to the library call are modified to replace the virtual pid with the real pid. Similarly, the return value can also be modified as required. The virtual pid column of this table is saved as part of checkpoint image and at restart time the real pid column is populated as processes/threads are recreated.

int kill(pid_t pid, int sig) { disable_checkpoint(); real_pid = virt_to_real(pid); int ret = REAL_kill(real_pid, sig); enable_checkpoint(); return ret; }

Listing 4.1: A simple wrapper for kill ¥

As seen in the above listing, a function wrapper is implemented by defin- ing a function of the same name as the call it is going to wrap. Real function here refers to the function by the same signature, in a later plugin or a run- time library. It is possible for multiple plugins to create wrappers around a single library function. The order of execution of wrappers is determined by a plugin hierarchy corresponding to the order in which the plugins are invoked (Section 4.3).

Capture/Restore state of external resource

Wrappers are also used to “spy” on the parameters used by an application to create a system resource, in order to assist in creating a semantically equiv- alent copy on restart. At the time of checkpoint, a plugin saves the current state of its underlying resources into the process memory. The state can be obtained from a number of places such as the process environment and the 4.1. PLUGIN ARCHITECTURE 45 operating system kernel. In some cases, the function wrappers can also be used to gather the information about the external resources. For example, in the “socket” wrapper (Listing 4.2), the socket plugin will save the associated domain and protocol information along with the socket identifier. int socket(int domain, int type, int protocol) { disable_checkpoint(); int ret = REAL_socket(domain, type, protocol); if (ret != -1) { register_new_socket(ret, domain, type, protocol); } enable_checkpoint(); return ret; }

Listing 4.2: Wrapper for socket() to record socket state ¥

Atomic transactions

Plugins may have to perform atomic operations that must not be interrupted by a checkpoint. For example, the translation and call to real function should be done atomically with respect to checkpoint-restart. Otherwise, there is a possibility of checkpointing after the translation but before the real function is called. In that case, on restart, the translated value is no longer valid and can impact the correctness of the program. The plugin engine provides disable_checkpoint and enable_checkpoint services for enclosing the critical section as seen in Listing 4.1. The disable_checkpoint and enable_checkpoint services are implemented using a modified write-biased reader-writer lock. The modification allows a recursive reader lock even if the writer is queued and waiting for the lock. The checkpoint thread must acquire the writer lock before it can quiesce the 46 CHAPTER 4. THE DESIGN OF PLUGINS

user threads. On the other hand, the user threads acquire and release the reader lock as part of a call to disable_checkpoint and enable_checkpoint respectively. If a checkpoint request arrives while a user thread is in the middle of a critical section, the checkpoint thread will wait until the user thread comes out of the critical section and releases the reader lock. A user thread is not allowed to acquire a reader lock if the checkpoint thread is already waiting for the writer lock to prevent checkpoint starvation.

Atomicity is especially important for wrappers that create or destroy a resource instance. For example, when creating a , if the checkpoint is taken right after the socket is created but before the socket plugin has a chance to register it, the socket may not be create at restart as no record exists of the socket. Thus one must atomically create and record socket state as shown in Listing 4.2.

Wrappers can be considered the most basic of all virtualization tools. A flexible, robust implementation of wrapper functions turns out to be surpris- ingly subtle and is discussed in more detail in Section 4.6.1.

4.1.2 Event Notifications

Event notifications are used to inform other plugins (within the same pro- cess) of interesting events. Any plugin can generate notifications. Plugin engine then delivers these notification to all available plugin in a sequential fashion. The order of delivery of notification depends on the plugin hier- archy as discussed in Section 4.3. Plugins must declare an event hook in order to receive event notifications. A plugin may decide to ignore any or all notifications.

Figure 4.2 shows the “write-ckpt” and “restart” events generated by the coordinator interface plugin which are then delivered to all other plugins by the plugin engine. 4.1. PLUGIN ARCHITECTURE 47

Target Application Target Application

Socket Plugin (2) Socket Plugin (6)

Fork/Exec Plugin (3) Fork/Exec Plugin (5)

Pid Plugin (4) Pid Plugin (4)

Coordinator Interface Plugin (5) Coordinator Interface Plugin (3)

Memory Plugin (6) Memory Plugin (2) restart (1) write−ckpt (1) restart write−ckpt

Plugin Engine Plugin Engine

(a) Event notification for write-ckpt (b) Event notification for restart

Figure 4.2: Event notifications for write-ckpt and restart events. The numbers in the parenthesis indicate the order in which messages are sent. Notice that the restart event notification is delivered in the opposite order of write-ckpt event.

Some of the interesting notifications are:

• Initialize: generated during the process initialization phase (even be- fore main() is called). The plugins can initialize data structures, etc. A plugin may choose to register an exit-handler using atexit() which will be called when the process is terminating.

• Write-Ckpt: each plugin saves the state of the external resources into process’s memory. The memory plugin(s) then create the checkpoint image.)

• Resume: generated during the checkpoint cycle.

• Restart: generated during restart phase.

• AtFork: generated during a fork and works similar to the libc function, pthread_atfork. 48 CHAPTER 4. THE DESIGN OF PLUGINS

dmtcp_event_hook(is_pre_process, type, data) { if (is_pre_process) { switch (type) { case Initialize: myInit(); break; case Write_Ckpt: myWriteCkpt(); break; ... } } if (!is_pre_process) { switch (type) { case Resume: myResume(); break; case Restart: myRestart(); break; ... } } }

Listing 4.3: An event hook inside a plugin ¥

The Resume and Restart notifications are sent to plugins in the oppo- site order from the Write-Checkpoint notification (see Listing 4.3 and Fig- ure 4.2b). This is to ensure that any dependencies of a plugin are restored before the plugin itself is restored. For example, the memory plugin (re- sponsible for writing out or reading back the checkpoint image) is always the lowest layer (see Figure 4.1). This is so that other plugins may save data in the process’s memory during checkpoint, and find it again at the same address during restart. 4.1. PLUGIN ARCHITECTURE 49

Coordinator Node 1 Node 2 current local addr current remote addr Target Application Target Application

current local addr Socket Plugin Socket Plugin current remote addr

Coordinator Interface Plugin Coordinator Interface Plugin

Plugin Engine Plugin Engine

Figure 4.3: Publish/Subscribe example for sockets.

4.1.3 Publish/Subscribe Service

In a distributed environment, a publish/subscribe service is needed so that a given type of plugin may communicate with its peers in different processes. Typically, on restart, once the process resources have been recreated, the plugins publish their virtual ids along with the corresponding real ids using the publish/subscribe service. Next they subscribe for updates from other processes and update their translation tables accordingly. This was seen for the pid virtualization plugin (Section 3.1.1). Similarly, when a parallel computation is restarted on a new cluster, the socket plugin must exchange socket addresses among peers.

At the heart of the publish/subscribe services is a key-value database whose key corresponds to the virtual name and whose value corresponds to the real name of the underlying resource. The database is populated when plugins publish the key-value pairs. Once the plugin has published all of the relevant key-value pairs, it may now subscribe by sending queries to the database. The plugins are notified as soon as a match for the queried key is available. Typically, the key-value database is used only at restart time, as doesn’t need to be preserved across checkpoint-restart. 50 CHAPTER 4. THE DESIGN OF PLUGINS

Figure 4.3 shows an example of the socket plugins exchanging their cur- rent network address with their peers. During the Write-Checkpoint phase, the socket peers agree on using a unique key (see Section 4.4.1) to iden- tify the connection. While restarting, this unique key is used to publish the current network address.

It is possible to have multiple publish/subscribe APIs that differ accord- ing to scope. It is left to the plugins to choose the scope best suited for their needs. Two trivial scopes are node-private and cluster-wide. Node-private publish/subscribe API is sufficient for plugins dealing with resources limited to a single node, such as pseudo-terminals, shared-memory, and message- queues. Whereas plugins dealing with resources that may span over multiple nodes, such as sockets and InfiniBand, should use the cluster-wide publish/- subscribe API.

The node-private publish/subscribe service may be implemented using shared-memory while the cluster-wide publish/subscribe service must be provided by some centralized resource such as the DMTCP coordinator.

4.2 Design Recipe for Virtualization through Plugins

So far we have seen the plugin architecture and the services provided by it. We have also seen how these services suffice to meet the virtualization requirements. We use this information to create a typical recipe for writing a new plugin to virtualize an “external resource”. One is usually given a name or id (identifier) to provide a link to the external resource. The id may be for an InfiniBand queue pair, for a graphics window, for a database connection, for a connection from a guest virtual machine to its host/hypervisor, and so on. 4.2. DESIGN RECIPE FOR VIRTUALIZATION THROUGH PLUGINS 51

In all of these cases, the recipe is:

1. Intercept communication to the external resource (usually by inter- posing between library calls), and translate between any real ids from the external resource and virtual ids that are passed to the application software. A plugin maintains this translation table of virtual/real ids.

2. Quiesce the external resource (or wait until the external resource has itself reached a quiescent state);

3. Interrogate the state of the external resource sufficiently to be able to reconstruct a semantically equivalent resource at restart time.

4. Checkpoint the application. The checkpoint will include state infor- mation about the external resource, as well as a translation table of virtual/real ids.

5. At restart time, the state information for the external resource is used to create a semantically equivalent copy of the external resource. The translation table is then updated to maintain the same virtual ids, while replacing the real ids of the original external resource with the real ids of the newly created copy of the external resource.

It is not always efficient to quiesce and save the state of an external resource. The many disks used by Hadoop are a good example of this. The data in an external database server is another example. It is not practical to drain and save all of the external data in secondary storage. There are two potential approaches. The first approach is to delay the checkpoint during a critical phase. In the case of Hadoop, one would delay the checkpoint until the Hadoop computation has executed a reduce oper- ation, in order to not overly burden the resources of the Hadoop back end. A similar approach can be taken for NVIDIA GPUs. In many cases, there are also strategies for plugins to transparently detect this critical phase and delay the checkpoint until that time. 52 CHAPTER 4. THE DESIGN OF PLUGINS

The second approach is to allow for a partial closed-world assumption in which some state (data/contents) is assumed to be compatible across checkpoint and restart. In case of the external database server, the external data already lies in fault tolerant storage and is compatible across checkpoint and restart. Thus the solution is to maintain a virtual id that identifies the external storage of the server. That virtual id is used at restart time to restore the connection to the database server.

4.3 Plugin Dependencies

Some plugins may have dependencies on other plugins. For example, the File plugin depends on the Pid plugin to restore file descriptors pointing to “/proc/PID/maps” and so on. Each plugin provides the list of dependencies which must be satisfied to successfully load the given plugin. The depen- dency declaration also affects the level of parallelism that can be achieved when performing phases such as Checkpoint, Resume and Restart. Subject to the dependencies among plugins, this design provides end users with the possibility of selective virtualization. Selectively including only some plugins is advantageous for three reasons: (i) performance reasons (some end-user plugins might have high overhead); (ii) software mainte- nance (other plugins can be removed while debugging a particular plugin); and (iii) platform-specific plugins.

4.3.1 Dependency Resolution

Similar in spirit to modern software package formats such as RPM and deb, a plugin provides a list of features/services that it provides, depends on, or conflicts with. For example, the socket plugin may provide services for “TCP”, “UDS” (Unix Domain Sockets), and “Netlink” socket types and de- pends on the “File” plugin (to restore file system based unix domain sock- ets). 4.3. PLUGIN DEPENDENCIES 53

The dmtcp_launch program, that is used to launch an application un- der checkpoint control, compiles list of all available plugins by looking at various environment variables, such as LD_LIBRARY_PATH. A user-defined list of plugins can also be specified to be loaded into the application. The dmtcp_launch program examines this plugin list and creates a partial or- der of dependencies among the plugins. The list of available plugins is searched to fulfill any missing dependencies for the user-defined plugins. If a match is found, plugins are loaded automatically. Otherwise an error is reported. If two or more plugins provide the same feature/service, a conflict is recorded and the user is provided with the conflicting plugins. void dmtcp_plugin_dependencies(const char ***provides, const char ***requires, const char ***conflicts) { static const char *_provides[] = { " TCP " , "UDS" , " Netlink " , NULL};

static const char *_requires[] = { " F i l e " , NULL}; static const char *_conflicts[] = {NULL}; *provides = _provides; *requires = _requires; *conflicts = _conflicts; }

Listing 4.4: Dependencies declared by a plugin. The dmtcp_launch utility ¥ uses these fields to generate a partial order among the given plugins and to report any missing dependencies or any conflicts.

Listing 4.4 provides an example of dependency information as exported by the socket plugin. Since the plugins are implemented as shared libraries, the dmtcp_launch program can perform dlopen/dlsym to find and call the dmtcp_plugin_dependencies function to learn about the dependencies. 54 CHAPTER 4. THE DESIGN OF PLUGINS

Further, this approach assumes a common naming scheme to resolve matches/dependencies across plugins. This could be automated by scan- ning symbols in the object files, for example, for both definitions and uses. If a symbol is defined in more than one plugin, it can be listed as a potential source of conflict to help the plugin writer in debugging plugins.

Parallel event handling

In Section 4.1.2, we discussed how the plugin engine assumed serial delivery of event notifications due to plugin dependencies expressed in a linear order (Figure 4.2). However, for non-linear plugin dependencies, a dependency graph can be created to relax the order of notification delivery. The event notifications can be processed by multiple plugins in parallel as long as there is no dependency between them. This is useful in modern multi-core systems to allow idle CPU cores to process the event notifications for the plugins. It is also useful for plugins that need to perform asynchronous operations during event handling. In such cases, rather than blocking on a single plugin, the event notification can be carried out in parallel in other plugins.

4.3.2 External Resources Virtualized by Other Plugins

Plugins may use resources that are virtualized by an earlier plugin. For ex- ample, plugins are allowed to create threads, open sockets, use files etc. However, if the resource is created/used in a way that bypasses the wrap- pers created by the earlier plugin, the resources may not be virtualize/save- restored. In situations where this is not true, only the plugin using the resources can save-restore its state. This is done to avoid circular depen- dencies. If the save-restore/virtualization is absolutely required, the plugin should be broken into two or more smaller plugins and the newer plugin should be moved higher in the plugin-hierarchy. 4.3. PLUGIN DEPENDENCIES 55 4.3.3 Multiple Plugins Wrapping the Same Function

Multiple plugins are allowed to place wrappers around the same library call. For example, the open("/proc/PID/maps", ...) function is wrapped by the file plugin as well as the pid plugin. The file plugin needs to be able to save/restore the file descriptor, whereas the pid plugin has to convert the virtual PID to a real one. Figure 4.4 shows nested-wrappers provided by the pid plugin and the file plugin.

Target Application File Plugin PID Plugin Libc

func1(...) { open(...) { open(...) { open(...) { ...... p="/proc/1234/maps" REAL_open(...) REAL_open(...) sys_open(...) ...... fd = open(p, ...) } } } ... close(fd) close(...) { getpid() { close(...) { ...... } REAL_close(...) REAL_getpid() sys_close(...) ...... } } }

getpid() { ... sys_getpid() ... }

Figure 4.4: Nested wrappers: open function is wrapped both by the File plugin and by the Pid plugin.

Once a plugin has performed all the required pre-processing actions, it calls the function wrapper in the next plugin library. This is done by using the RTLD_NEXT feature of dlsym function call. The RTLD_NEXT service will find the next occurrence of the given function in the library search order after the current library. For example, in case of open wrapper in the File plugin from Figure 4.4, dlsym(RTLD_NEXT, “open”) would return the address of the open function defined in the Pid plugin. However, dlsym(RTLD_NEXT, “close”) would return the address of the close function defined in Libc as the close wrapper is not defined in the Pid plugin. Since the wrappers execute both before and after the library call, a plugin that was loaded earlier can place a wrapper around the wrapper created by a later plugin. Thus the pre-processing takes place in the order of plugin load sequence, whereas the post-processing takes place in the reverse order. 56 CHAPTER 4. THE DESIGN OF PLUGINS 4.4 Extending to Multiple Processes

Until this point, plugins have been described in the context of a single pro- cess. For distributed computations, the interaction among distributed pro- cesses is critical to making the plugin model practical. As we have seen, the plugins virtualize the resources for several reasons. However, in case of multiple processes, several processes may be using a common resource. For example, several processes may share a file descriptor open to the same file. A mapped memory region may be shared. A socket may be shared among multiple processes. Several processes may have duplicate pointers to the same underlying resource. These duplicate pointers may be created explicitly (e.g., the dup() system call creates a duplicate file descriptor), or implicitly (by creating a child process; the child process automatically gets a copy of all the file-descriptors, shared memory, etc.).

How does one ensure correctness if multiple processes are using the same resource and hence virtualizing it independently of each other? Should all processes save/restore the common resource or only one of them?

The correct answer is that only a single process should be allowed to save/restore the state of the underlying resource. This is required for two reasons: (i) for some resources, part of the state to be checkpointed can be read only once. This is the case with data in kernel buffers or network data; and (ii) if multiple processes recreate the resource during restart, it may no longer be shared. In some situations, it is impossible to recreate the resource (e.g. sockets) by multiple processes, while in other case, recreating the resource multiple times is permitted but results in incorrect behavior (e.g. same file can be opened by multiple processes resulting in loss of semantics). 4.4. EXTENDING TO MULTIPLE PROCESSES 57

Single process

It is possible to have duplicate pointers within a single process. Thus the plugins must ensure that only one copy is checkpointed and the duplication is restored during restart. This requires the ability of the plugins to identify duplicate resources during the checkpoint phase. For some resources, the operating system kernel (or the execution environment) assigns a unique id at the time of creation. Examples include sockets, pid, System V shared memory objects, semaphores, etc. When these resources are duplicated, the duplicates may be detected easily by querying the kernel for the resource id.

Multiple processes

The two key issues in dealing with multiple processes are: (i) checkpoint- restart of shared resources; and (ii) finding the current location of peer pro- cesses. We employ the publish/subscribe service to assist us in dealing with these issues. While it allows a central coordinator to mediate among multi- ple processes, it also implicitly produces a barrier. Hence, it is important to use that facility sparingly for the sake of efficiency.

4.4.1 Unique Resource-id for Shared Resources

Duplicate detection for the remaining resources must be done by keeping track of when the duplicates are created — explicitly or implicitly. This is done by assigning a unique resource-id to each resource when it is cre- ated. The resources duplication is tracked by putting wrappers around cor- responding library calls (such as dup or fork). Once detected, the duplicates are assigned the same resource-id as the original resource. A globally unique resource-id can be created in several ways. One possi- ble solution is to use a mixture of hostname, virtual/real pid of the process creating the resource, creation timestamp, etc. 58 CHAPTER 4. THE DESIGN OF PLUGINS 4.4.2 Checkpointing Shared Resources

Since only one process should be allowed to save the state of the shared resources and the original resource creator might not be present, we must select a checkpoint-leader process for each resource. The checkpoint-leader is responsible for saving and restoring the state of the underlying resource.

Checkpoint-leader election — consensus across processes

The processes sharing the underlying resource may elect a checkpoint-leader using several mechanisms. The basic idea is to have consensus across par- ticipating processes. Ansel et al. [7] used the fcntl system call to set own- ership of the file descriptors. Each process tries to set itself as the owner of the given file descriptor. The centralized coordinator process was used to create a global barrier to signal the end of election after each process had a chance to make the system call. The last process to perform the system call is considered the checkpoint-leader. An example is shown in Listing 4.5.

checkpoint_file(int fd) { // Participate in checkpoint-leader election; // publish ourself as the owner of the resource fcntl(fd, F_SETOWN); // Now wait for the election to be over wait_for_global_barrier(LEADER_ELECTION); // If we are the owner, we are ckpt-leader if (fcntl(fd, F_GETOWN) == getpid()) { // capture the state of the file descriptor capture_state(fd); } }

Listing 4.5: An example of leader election using the fcntl system call. ¥ 4.4. EXTENDING TO MULTIPLE PROCESSES 59

While this approach works for shared file descriptors, it doesn’t work for other resources, such as files. There can be multiple unique file descriptors that are opened on the same file. In this case, each unique file descriptor gets a checkpoint leader. This results in checkpointing of multiple copies of the file. The publish/subscribe service can be used to provide a better solution. Each process publishes itself as the checkpoint-leader using the unique resource-id of the resource. The last process to publish is elected the checkpoint-leader. Since files can have multiple unique file descriptors (and hence multiple unique resource-ids) associated with them, we can publish using the absolute file path or the inode number for leader election.

Global barriers

As mentioned above, a global barrier allows plugins in different processes to synchronize during checkpoint and restart. A simple implementation of the global barrier requires a centralized coordinator that keeps the count of all processes that have reached the barrier. Once all processes reach the barrier, it lifts the barrier and allows them to proceed as shown in Listing 4.6. void wait_for_global_barrier(BarrierId id) { MessageType msg, rmsg; msg.type = GLOBAL_BARRIER; msg.barrierId = id; // Tell the coordinator that we have reached the barrier send_msg_to_coordinator(msg); // Wait until all other peers reach the barrier recv_msg_from_coordinator(&rmsg); assert(rmsg.type = GLOBAL_BARRIER_LIFTED); // barrier has been lifted }

Listing 4.6: Global barrier. ¥ 60 CHAPTER 4. THE DESIGN OF PLUGINS

Global barriers are costly as each process has to communicate with the centralized coordinator process. If each plugin implements several global barriers, the performance impact can be significant in terms of checkpoint and restart times. The total number of global barriers can be reduced signif- icantly by using process level anonymous global barriers that can be imple- mented in the coordinator interface plugin as show in Listing 4.7.

void implement_global_barriers() { // Create an anonymous global barrier wait_for_global_barrier(BARRIER_ANON_1); // generate event notification indicating // lifting of anonymous barrier1 generate_event(ANON_GLOBAL_BARRIER_1);

wait_for_global_barrier(BARRIER_ANON_2); generate_event(ANON_GLOBAL_BARRIER_2);

wait_for_global_barrier(BARRIER_ANON_3); generate_event(ANON_GLOBAL_BARRIER_3); ... }

¥ Listing 4.7: Global barrier.

Consider the example of leader election. On receiving the event notifica- tion for ANON_GLOBAL_BARRIER_1 event, each plugin will participate in leader election for its resources by publishing itself as the checkpoint leader. On receiving the event notification for ANON_GLOBAL_BARRIER_2, each plugin can check to see if it is the checkpoint-leader by subscribing to the checkpoint leader information for the unique resource id. 4.4. EXTENDING TO MULTIPLE PROCESSES 61

Socket Plugin

File Plugin

Fork/Exec Plugin

Pid Plugin

Coord Interface Plugin WriteCheckpoint Resume/Restart

Thread Plugin

Memory Plugin(s)

Figure 4.5: Plugin dependency for distributed processes

4.4.3 Restoring Shared Resources

Note that memory regions are restored before plugins can restore the state of their corresponding resources. In case of shared resources, the checkpoint- leader recreates the underlying resources and then shares them with other processes using publish/subscribe service. The checkpoint leader publishes while the remaining processes subscribe to the resource-id.

Remark: Resources involving file-descriptors can be shared by passing them over the Unix Domain Sockets.

Note that sharing of resources forces a certain dependency among plu- gins that is summarized in Figure 4.5. The required dependency can be observed by noting the required actions of a plugin at the time of restart. The pid-plugin is responsible for virtualizing the pids which is required for fork/exec plugin to restore the process-trees. Once the process-trees have been created, the file, socket, System V shared memory, etc. plugins may recreate/restore the resources and share them with other processes. 62 CHAPTER 4. THE DESIGN OF PLUGINS 4.5 Three Base Plugins

In this section we discuss three special-purpose plugins: the coordinator interface plugin, the thread plugin, and the memory plugins.

4.5.1 Coordinator Interface Plugin

A centralized coordinator process is used to synchronize checkpoint-restart between multiple processes on the same or different hosts. A coordinator interface plugin communicates with the coordinator process and generates events related to checkpointing when requested by the coordinator. It cre- ates a checkpoint-manager thread, which listens to the coordinator process for a checkpoint message while the user threads are executing application code. On receiving a coordinator message, the checkpoint-manager thread generates the checkpoint, resume, or restart event which are then delivered to all other plugins. The coordinator interface plugin and the coordinator process can best be thought of as a single programming unit. It is this programming unit that implements global barriers at the time of checkpoint or restart. The special case of a single standalone target process can be supported by a minimal coordinator interface plugin, which directly generates the three basic event notifications: checkpoint, resume, and restart. In this case, one does not need any external coordinator process. At the other extreme, a coordinator interface plugin can be written to support a set of redundant coordinators. This alternative eliminates the possibility of a single point of failure.

4.5.2 Thread Plugin

The thread plugin is responsible for saving and restoring the state of all user threads during checkpointing. The plugin engine invokes the checkpoint- manager thread through the write-ckpt event hook. The checkpoint manager 4.5. THREE BASE PLUGINS 63 then sends a POSIX signal to all user threads. This forces the user threads into a checkpoint-specific signal handler (which was defined earlier within the thread plugin). The handler causes each user thread to save its context (register values, etc.) into the process memory and to then wait on a lock. When the checkpoint completes, the thread plugin releases all user threads from their locks, and user execution resumes. On restarting, the memory plugin restores user-space memory from a checkpoint image, and control is then passed to a restart event hook of the thread plugin. Only the primary thread of the restarted process exists at this time. That thread recreates the other threads, restores their context, and re- leases the user threads from the locks that were entered prior to checkpoint. (The state of a lock depends only on user-space memory.)

4.5.3 Memory Plugins

Other Plugin Libraries

Prepare list of memory areas

Zero−page detection

Compression

Encryption Various Memory Plugins Write to network socket

Runtime Libraries, Plugin Engine

Figure 4.6: Various memory plugins stacked together

Memory plugins are responsible for writing the contents of a process’s memory into the checkpoint image. The checkpoint image is read during 64 CHAPTER 4. THE DESIGN OF PLUGINS

restart process to recreate the process memory. Memory plugins are the last in the plugin loading sequence as every other plugin necessarily depends on the memory resource. Figure 4.6 shows an example of sequence of memory plugins that perform zero-page optimizations followed by compression and encryption before writing the checkpoint data to a network socket. A pro- cess on the other end of the socket may then save the data onto persistent storage.

At restart time, a special application, dmtcp_restart, is needed to boot- strap the restart procedure to load the restoration code corresponding to all the memory plugins involved. Control is then passed to memory plug- ins which then perform restoration of rest of process memory. After restor- ing memory, the rest of the plugins recreate/restore their corresponding re- sources. User threads are then recreated and the process resumes executing application code.

Here we list some characteristics of the memory plugins:

1. Since writing the checkpoint image is the last step in checkpoint pro- cess, the memory plugins must appear last in the plugin sequence.

2. If it is possible for memory plugins to alter the memory maps of the current process, the first memory plugin must create a list of memory areas to be written to the checkpoint image. The memory plugins can then map new memory area for checkpoint purposes only and these areas will not be checkpointed.

3. The memory plugins pass information to the next memory plugin using a pipe mechanism i.e. each plugin may process the incoming data and send the processed (and potentially modified) data to the next plugin. Data piping can be implemented by creating hooks for writing and reading memory. 4.6. IMPLEMENTATION CHALLENGES 65

4. The plugins agree on some notion of end-of-data to finish writing the checkpoint image.

5. Last memory plugin disposes the data onto persistent storage (file) or writes to a pipe/socket. There can be a different process on the other end of the pipe/socket which then saves it to a persistent device, or it restarts the process on the fly. The last memory plugin here means the final or lowest memory plugin (e.g., the “write to network socket” plugin in Figure 4.6).

6. Last memory plugin is responsible for reading from the checkpoint im- age.

7. During restart, memory plugins are responsible for restoring other run- time libraries, thus these plugin libraries must be self contained.

Remark: Note that the state managed by the memory plugins will not be compressed or encrypted in our running example of memory plugins. This is necessary to solve the problem of bootstrapping on restart. If the boot- strapping code were also encrypted, it would be impossible to bootstrap.

4.6 Implementation Challenges

In this section we describe some of the implementation challenges that we faced in implementing the plugin based virtualization in DMTCP version 2.

4.6.1 Wrapper Functions

We discuss three different implementation techniques that were tried in suc- cession, before settling on a fourth choice: a hybrid of the second and third options:

1. dlopen/dlsym: This is a naive approach, well-known in the literature. It allows the plugin to define a system call of the same name, whose body 66 CHAPTER 4. THE DESIGN OF PLUGINS

uses dlopen/dlsym to open the run-time library (e.g. libc, libpthread, etc.), and then call the system call in the run-time library. However, this fails when creating a wrapper for the GNU implementation of calloc. The GNU implementations of dlopen and dlsym would call calloc, thus creating a circular dependency. Wrapping occurrence of dlopen/dlsym from a user’s application creates a similar circular de- pendency. However, a still more severe criticism is that if the wrapper function directly calls the run-time library, then nested wrappers be- come impossible. In our implementation, multiple plugins frequently wish to wrap the same system call.

2. offsets within a run-time library: This was implemented in order to avoid the use of dlopen/dlsym. A base address is chosen within the run-time library. (It may be the start address of the library or an unusual system call unlikely to be needed by wrappers.) For all sys- tem calls to be wrapped, the offset from that system call to the base address is calculated before launching the end-user application. The end-user application is then launched and the base address is recalcu- lated. Next, the base address is used along with offsets to determine the addresses of the functions in the run-time library. At this point, the functions in the run-time library can be called using the corresponding addresses. This solves the issues caused by circular dependencies (e.g. dlopen, dlsym, calloc). However, nested wrappers still cannot be implemented.

3. dlsym/RTLD_NEXT: The POSIX option RTLD_NEXT for dlsym is de- signed in part to implement wrapper functions. This option causes dlsym to search the sequence of currently open libraries for the next matching symbol beyond the current library. This fixes the problem of implementing nested wrappers, but it does not solve the problem of circular dependencies. 4.6. IMPLEMENTATION CHALLENGES 67

The ultimate solution requires an additional observation: The run-time library sometimes internally calls a system call (as with dlopen/dlsym calling calloc). It is a mistake for the plugin to execute the wrapper function around this internal call. Yet, when dlsym internally calls calloc, the ELF loader will call the first definition of calloc that it finds. The first library to be loaded was libdmtcp.so, as part of the design of DMTCP. So, the calloc wrapper in libdmtcp.so is called.

A standard wrapper for calloc within libdmtcp.so would then call dlsym to determine the address of calloc within libc.so. But this would create the circularity. Instead, the wrapper detects that this is a circular call originating from the run-time library (libc.so). Upon detecting this, the calloc wrap- per reverts to second method above (offsets within a run-time library) in order to directly call the implementation of calloc within libc. Thus the circularity is broken.

4.6.2 New Process/Program Creation

When a process forks to create a new child process, the thread that calls fork() is the only thread in the new process. This poses certain challenges for plugins especially when dealing with locks. If at the time of fork(), some other thread is holding a lock, the threads in the new process may deadlock on this lock. The solution is to install atfork() handles in all plugins that use locks or similar artifacts and whenever a child process is created, it re- initialized the locks before doing anything. An alternate is to use the AtFork event generated by the fork/exec plugin. Glibc and firefox are two real world examples which install atfork handles to re-initialize the locks for their respective malloc-arenas.

New programs created by calling execve() have a different set of prob- lems. Since the new program gets completely new address space, all infor- mation that was gathered by the plugin prior to exec is lost. Plugins that 68 CHAPTER 4. THE DESIGN OF PLUGINS

need to preserve information across exec need a lifeboat where they can put the information for later use. A typical example of lifeboat would be a temporary file created on disk. The plugins serialize the previously captured information to the lifeboat. Since the plugins are independent of each other, there can be multiple lifeboats per process. Remark: As an optimization, it is possible to provide a single lifeboat that can be used by all the plugins.

4.6.3 Checkpoint Deadlock on a Runtime Library Resource

Atomic wrapper operations are also desired when dealing with resources that use locks for atomicity. Suppose a user thread is quiesced while holding the resource lock. Later on, if the resource is needed to complete check- point, it can cause a deadlock within the process. For example, in one of the most frequent scenario, a user thread is quiesced while performing mal- loc/free inside glibc. The checkpoint thread is blocked when it calls any of these functions during the checkpoint process. There are two possible solutions: (i) modify checkpointing logic to never call these functions, and (ii) create wrappers around these function which call disable_checkpoint, enable_checkpoint around the call to the real library functions as shown in Listing 4.8

malloc(size) { disable_checkpoint() ret_val = real_malloc(size) enable_checkpoint() return ret_val }

Listing 4.8: Malloc wrapper to avoid deadlock during checkpointing ¥ 4.6. IMPLEMENTATION CHALLENGES 69 4.6.4 Blocking Library Functions and Checkpoint Starvation

There are certain wrappers around blocking library functions that need to virtualize the underlying system resource. As discussed in Section 4.1.1, the call to library function and translation between real and virtual names should be atomic with respect to checkpointing. However, if a function call is blocking, the checkpoint may never succeed. Examples of such function are waitpid and pthread_join, etc. pid_t waitpid(pid, ) { while (true) { disable_checkpoint() real_pid = virtual_to_real(pid) // WNOHANG flag tells waitpid to return // immediately if the operation would block. ret_val = real_waitpid(real_pid, WNOHANG | ) virt_pid = real_to_virtual(ret_val) enable_checkpoint() if (ret_val != -1)// Success return virt_pid // If error other than timeout, the function failed. if (errno != ETIMEDOUT) break // Yield CPU to avoid spinning yield() } return -1; }

Listing 4.9: Wrapper for waitpid with non-blocking calls to the real waitpid ¥ function 70 CHAPTER 4. THE DESIGN OF PLUGINS

In these situations, one can modify the wrapper as seen in Listing 4.9 to call the non-blocking version of the function in a loop until it succeeds or returns an error other than timeout. The timed version waits for the given time period before returning instead of blocking indefinitely. In some situations, the blocking call may not provide a non-blocking version. In those cases a potential solution is to use signalling mechanism to force the call to return with an error. At this point, the checkpoint can take place. However, the wrapper must be re-executed from the beginning to avoid any stale state. CHAPTER 5

Expressivity of Plugins

This chapter presents a large variety of examples of adaptive plugins, to demonstrate the expressivity of the plugin framework. They fall into sev- eral categories, each of which represents a unique type of contribution, in generalizing the traditional functionality of checkpoint-restart. Some of the plugins represent long-standing challenges. Not only do these plugins provide additional functionality for checkpoint-restart, but they do so with far fewer lines of code than the previously available less func- tional approaches.These include transparent checkpointing of: InfiniBand networks by Cao et al. [27]; hardware accelerated 3-D graphics (OpenGL 2.0 and beyond) by Kazemi Nafchi et al. [62]; a network of virtual machines by Garg et al. [44]; and GDB sessions by Visan et al. [127]. Each of these efforts was led by a different author. Thus they represent trials of the new plugin feature by independent users. The full details of each plugin can be found in the publications and technical reports of those authors. While I believe any of these could have been done by adding support in any of the existing checkpointing package, the amount of effort (both in terms of person-hours and lines of code) would have been enormous. Instead, by using the adaptive plugins to implement a process virtualization approach, the job was made much easier. In all cases, the plugin writers

71 72 CHAPTER 5. EXPRESSIVITY OF PLUGINS

didn’t need to learn the details of DMTCP internals, allowing them to focus only on the plugin.

Plugin Lines Novelty Prior Art Lines of code of code SSH session 1,021 The only solution — — GDB session 938 The only solution — — Batch-Queue 1,715 The only solution — — KVM/Tun 1,100 Full snapshots of net- Single VM ?? work of VMs snapshots OpenGL 4,500 Supports programmable VMGL [69] 78,000 GPUs (OpenGL 2.0 and beyond) InfiniBand 2,500 Native InfiniBand check- MPI- 17,000 point for both MPI and specific [55] non-MPI jobs IB2TCP 1,000 InfiniBand to TCP mi- MPI- ?? gration for both MPI and specific [55] non-MPI jobs

Table 5.1: Process virtualization based checkpoint-restart is both more general and typically an order of magnitude less in implementation size

The expressivity is measured along two dimensions (see Table 5.1). The first dimension is a measurement of lines of code for the plugins. Since each example was a “first” for that functionality, we compare with lines of code for a pevious published implementation with lesser functionality where possible. In the second dimension, we compare functionality with that application identified as having the most previous functionality in the corresponding domain. Thus a two-fold argument is presented. The process virtualization approach permits implementations with much larger functionality than had previously been practical with moderate resources. Second, the process vir- tualization approach results in an implementation with many fewer lines of code than would have been practical by other approaches. (Of course, the 5.1. FILE DESCRIPTOR RELATED PLUGINS 73 fewer lines of code in the plugin is made possible by using the base support for plugins in DMTCP version 2.) Note that some of the plugins discussed in this chapter were not created as part of this thesis. Instead, they were created by different authors using the plugin API. Further details of each plugin can be found in the publica- tions and technical reports of those authors.

Statistics for various plugins

Table 5.2 provides several statistics including the source lines of code, the number of library call wrappers and various services used by the plugins. The lines of code were obtained by using SLOCCount [132]. Section 5.1 provides a brief overview of the plugins related to file descrip- tor handling. Section 5.2 provides an overview of the working of the plugin handling System V IPC mechanism. A few application-specific plugins are discussed in Section 5.3. The remaining sections provide various case stud- ies where new functionality was implemented, whereas previously in other checkpoint-restart packages, the added functionality was implemented only through independent, auxiliary applications.

5.1 File Descriptor Related Plugins

Since file descriptors may be used for file objects, socket connections, or event notifications, the corresponding plugins share some code for handling generic file descriptors. This results in a cleaner design and smaller code footprint. The shared code provides services for generating unique file de- scriptor ids, detecting/managing duplicate file descriptors, leader election, and re-sharing of file descriptors on restart. Note that DMTCP version 1 provided support for checkpointing TCP and Unix domain sockets for checkpointing distributed applications. It also pro- vided limited support for handling files and pseudo-terminals. For this work, 74 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Plugin Language Lines of Code Wrappers Services used Internal Plugins File C/C++ 2,276∗ 48 a,b,c,d,e Socket C/C++ 1,356∗ 17 a,b,c,d Event C/C++ 909∗ 12 a,b,c,d,e Pid C/C++ 1,644 47 c,d,e SysVIPC C/C++ 1,154 14 a,b,c,d,e Timer C/C++ 419 14 a,c,d,e SSH C/C++ 1,021 3 a,b,c,d,e Contrib Plugins Batch-Queue C/C++ 1,715 13 e† Ptrace C/C++ 938 7 a,b,c Record-replay C/C++ 8,071 164 a,b,c,e KVM C 749 2 a,b,c,e Tun C 351 3 a,b,c,e OpenGL C/C++ 4,500 119 a,b,c,e,f InfiniBand C 2,788 34 a,b,c,d,e IB2TCP C/C++ 804 31 c,d,e Application-Specific Plugins Malloc C/C++ 116 10 f Dlopen C/C++ 28 3 f Modify-env C 134 0 c,e CkptFile C/C++ 37 0 a,c Uniq-Ckpt C/C++ 39 0 a,c ∗: Uses additional 899 lines of shared common code. †: Uses specialized utilities to detect restart. Plugins Services: (a) Write checkpoint hook (b) Resume hook (c) Restart hook (d) Publish/Subscribe (e) Virtualization (f) Protect critical sections of code

Table 5.2: Statistics for various plugins. 5.1. FILE DESCRIPTOR RELATED PLUGINS 75 the plugins were created by rewriting the existing solution from DMTCP ver- sion 1. This greatly enhanced the available features and provided an easier way for the user to fine tune checkpointing. This section provides a brief overview of the three plugins.

File plugin

The File plugin is responsible for handling file descriptors pointing to regular files and directories. For implementation purposes, it also handles pseudo- terminals (ptys) and FIFO (first in first out) objects, since they have similar semantics as file objects. Apart from restore the relevant file descriptors, the File plugin also needs to translate the file paths if the computation is restarted on a system with different mount points or by a different user. There are several ways to provide file path translation. A simple mecha- nism involves recording the relative file paths on checkpoint and using the relative path information on restart to find the file. Another approach may involve wild card substitution, where a certain component of the file path is transparently replaced with a different one. For example, if a mount point has changed from /mnt/foo to /bar, the plugin would replace /mnt/foo/baz with /bar/baz. The file plugin also deploys some heuristics to determine if it also needs to save and restore the associated file data. In some cases, the file data must always be checkpointed. Examples include unlinked files (Linux allows a file to be unlinked while a process still has a valid file descriptor) and temporary files created by programs like vim and emacs. For a simpler design, the heuristics part of the File plugin is now im- plemented as a separate plugin (Ckpt-File). This way the user can tweak this relatively simple newer plugin according to their wishes. Similarly, the file path translation mechanism can also be moved into its own plugin. As obvious, the original File plugin will depend on these two plugins for their services. 76 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Socket plugin

The Socket plugin is responsible for checkpointing and restoring the TCP/IP sockets, Unix domain sockets, and netlink sockets. Potentially, this plugin can also be split into three different plugins, but for implementation pur- poses it is kept as a single unit. Further, since the Unix domain sockets may be backed by a file on the disk, it also depends on the File plugin for file path translation. The Socket plugin assigns a unique id to each end of a socket connection. In our implementation, the unique id comprises of the unique- id of the process that originally created the socket file descriptor and a per- process monotonously incrementing counter. At the time of checkpoint, the processes on each end of a socket connection perform a handshake to ex- change the unique socket id. On restart, this unique socket id is used to find the current location of the peer process using the publish-subscribe service.

Event plugin

The Event plugin is responsible for checkpointing and restoring the file de- scriptors used for event notifications. Apart from supporting the older poll system call (used for monitoring file descriptors), this plugin provides sup- port for epoll (similar to poll), eventfd (used for event wait/notify mech- anism from user space), signalfd (used for accepting signals targeted at the caller), and inotify (used for monitoring file system events) system calls. Inotify is the most difficult to checkpoint and restart. The desired be- havior on restart is not well-defined and may be application dependent. For example, inotify can be used to get notification if a file has been renamed. Suppose that the file is renamed after checkpoint. On restart, the file will be present with a new name and thus won’t be renamed. In this case, it is not clear if an event notification should be generated or not. The plugin can be modified to allow the user to specify the default behavior for use with the application. 5.2. PID, SYSTEM V IPC, AND TIMER PLUGINS 77 5.2 Pid, System V IPC, and Timer Plugins

We have already discussed the Pid plugin as an example of virtualizing the kernel resource identifiers in Section 3.1.1.

The System V IPC (SysVIPC) plugin support checkpointing of System V shared memory, semaphores, and message queues. The operating system kernel generates an identifier for each System V IPC object. The identifier may change on restart and thus we need to virtualize it. The SysVIPC plugin virtualizes these identifiers in a similar manner to the Pid plugin. A virtual id is generated for each System V IPC object and a translation is kept for translating between virtual and real ids. In addition to virtualizing the re- source ids, the SysVIPC plugin also needs to checkpoint the associated state of the System V IPC object. For example, the memory contents of the shared memory region need to be checkpointed, the semaphore value needs to be restored, and the message queue needs to be drained on checkpoint and re- filled on restart. Since these objects are potentially shared between multiple processes, the plugin performs leader election using the publish-subscribe mechanism.

Lastly, we discussed the virtualization of clock and timer ids in Sec- tion 3.1.5. As described there, in addition to virtualizing the resource ids, application-specific fine tuning is required to control the behavior of timers on restart.

5.3 Application-Specific Plugins

The CkptFile plugin is used to provide heuristics for saving the contents of open files during checkpoint. The plugin can be used to read wildcard pat- terns from a configuration file for dynamically updating the heuristics. The File plugin consults the CkptFile plugin for each open file. The CkptFile plugin may respond whether to checkpoint the data of the given file or not. 78 CHAPTER 5. EXPRESSIVITY OF PLUGINS

The Environ plugin provides heuristics for restoring/updating the process environment variables after a restart. This is useful for processes that use environment variables to find addresses, etc. of system services, daemons, etc. The Environ plugin reads patterns from a configuration file to selectively update the restarting process’s environment. The Uniq-Ckpt plugin is responsible for keeping a rolling set of checkpoint images as configured by the user. It can automatically delete or rename the older checkpoint images to save disk space. The Malloc plugin puts wrappers around malloc, free, etc. to avoid dead- lock inside malloc library as explained in Section 4.6.3. The plugin can be further used to switch to a different malloc implementation for debugging. The Dlopen plugin provides wrappers for dlopen, dlsym, and dlclose li- brary calls. The dlopen wrapper is used to ensure atomicity with respect to checkpointing so that the process doesn’t get checkpointed while the library is still being initialized. The dlsym wrapper is used to create wrappers for function that are present in the library being loaded. The dlsym wrapper can return the address of the wrapper function (defined in the plugin) instead of the library function. The wrapper function then may call the real function in the newly loaded library.

5.4 SSH Connection

The issues involved with checkpointing an SSH session as discussed in Sec- tion 3.1.2 are reviewed followed by a description of the solution based on our virtualization scheme. Previous support for distributed checkpointing covered the common uses of ssh where it is used to launch remote jobs but not used for active communication. In some HPC environments (e.g., Open MPI), this is the default behavior. Remote processes are launched over SSH, and later establish a simple TCP socket for efficient communication. This work provides support for active communications over SSH. 5.4. SSH CONNECTION 79

Recall that SSH allows two processes to securely communicate over an insecure network. A user process uses an SSH client process to connect to a remote SSH server (daemon) process. On creating a secure connection, the SSH server process (sshd) launches the child process (app2), as shown in Figure 3.2. The process app1 appears to read and write locally through a pipe to app2. The SSH daemon is a privileged process running a certain protocol. In the process virtualization approach, the plugin must virtualize that protocol. Further, checkpointing and restarting the privileged SSH daemon by an un- privileged user is not possible, since the user cannot recreate the privileged ssh daemon (sshd) on restart.

Launching remote process under checkpoint control

Recall that a process on Node1 launches a remote process on Node2 by running the SSH client program as ssh Node2 app2. The earlier DMTCP used a strategy of detecting an codeexec that calls ssh Node2 app2 and replacing it by ssh Node2 dmtcp_launch app2. Ad hoc code was used that allowed ssh to create a remote process under checkpoint control, but it was assumed that the application would then close the SSH connection. The solution for supporting long-lived SSH connections is shown in Fig- ure 3.3. In essence, following a process virtualization approach, the SSH plugin defines a wrapper function around the exec family of system calls. It then replaces a call by exec to ssh Node2 app2 with a call to:

ssh Node2 dmtcp_launch virt_sshd app2

For technical reasons, the plugin actually creates two auxiliary processes, virt_ssh and virt_sshd. (The code for these processes is part of the SSH plugin, which arranges for them to run as separate processes.) These processes also allow us to recreate the SSH connection on restart — even in the less common situations where the app1 process has exited, leaving a child of app1 to continue to employ the SSH connection from Node1. 80 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Checkpoint

At the time of checkpoint, only processes app1, app2, virt_ssh, and virt_sshd are checkpointed. The ssh and sshd process are not under checkpoint control and are not checkpointed. Further, the virt_ssh and virt_sshd can directly “drain” any in-flight network data that has not yet reached its destination at the time of checkpoint. Thus, they act as buffers to hold network data prior to resume or restart. During resume, the drained data is written directly to the corresponding pipes between the user pro- cesses and the dmtcp helper processes.

Node1 Node2

app1 app2 stdio stdio

virt_ssh virt_sshd

stdio sshd helper stdio stdio SSH client SSH server (ssh) socket (sshd)

Figure 5.1: Restoring an SSH connection. The virt_ssh process launched sshd_helper on Node2 that relays stdio between ssh and virt_sshd.

Restart

Figure 5.1 illustrates how the four checkpointed processes are restored dur- ing restart. The four processes on Node1 and Node2 are restarted via:

ssh Node1 dmtcp_restart ssh Node2 dmtcp_restart 5.5. BATCH-QUEUE PLUGIN FOR RESOURCE MANAGERS 81

Note that in the general case, Node1 and Node2 may both have been remote nodes. Next, an SSH connection must be created between the two processes, virt_ssh and virt_sshd. To accomplish this, the virt_ssh will use publish/subscribe to discover the address of the virt_sshd process. Next, virt_ssh will fork a child process, which “execs” into the following pro- gram:

ssh Node2 sshd_helper

Finally, the sshd_helper process will relay the data of its stdio pipes from the SSH server process through stdio pipes to the virt_sshd pro- cess. The sshd_helper process exits when the virt_sshd process exits. The sshd_helper process is never part of any subsequent checkpoint.

5.5 Batch-Queue Plugin for Resource Managers

One of the long-standing functionality requirements for batch-queue man- agers at various HPC centers is the ability to suspend a low priority job to allow execution of a high priority job as soon as it arrives. While there have been MPI-specific solutions to support this use-case (see Section 2.1.2), they have not been integrated into the batch-queue systems for the lack of complete functionality. The batch-queue plugin by Polyakov [93] solves this problem by providing a native checkpoint-restart facility that can be embed- ded in the batch-queue itself. The goal of the batch-queue plugin is to recreate the original parallel computation in a transparent manner. This mechanism is invisible both to any resource manager and to the MPI libraries themselves. During restart, the batch-queue plugin must adapt to a new execution environment created by the resource manager at that time. The plugin must detect the newly available nodes during restart, and arrange for launching the restarted user processes onto appropriate nodes. Issues specific to a resource manager may 82 CHAPTER 5. EXPRESSIVITY OF PLUGINS

arise during this process, such as the creation by the resource manager of a new read-only nodefile that is inconsistent with the pre-checkpoint version (see below). Recall that modern resource management (RM) systems allocate resources for jobs, which are then launched in background in a non-interactive mode. Although the RM systems don’t intervene much in a program’s execution (except for PMI, see an example blow), they do modify part of its execution environment. For example, some of them redirect a program’s standard in- put, output and error to special files, and later move those files to the user’s working directory once the program is finished or killed. They also provide services for remote launch of programs such as tm_spawn for TORQUE PBS, lsb_launch() for Load Sharing Facility (LSF), and even standalone commands such as srun for SLURM. The batch-queue plugin can handle the new execution environment dur- ing restart. It detects the available nodes, and launches the restarting pro- cesses onto the nodes as required. The new program may not have per- missions to overwrite some environment files (e.g., nodefile) and may need to update these file descriptors to point to the copy of files saved during checkpoint. We next discuss some of the virtualization strategies provided by the batch-queue plugin.

Support for batch system remote launch mechanism

To fully support parallel programs in modern RM systems, the remote child processes should be automatically placed under checkpoint control. For all supported batch systems this plugin uses the same technique to provide this service: it patches the command line passed to the remote launch mecha- nism by adding a prefix, dmtcp_launch < options >. For example, in the case of TORQUE PBS, a wrapper for tm_spawn updates the passed arguments to insert the dmtcp_launch command. 5.5. BATCH-QUEUE PLUGIN FOR RESOURCE MANAGERS 83

Communication between Batch Systems and the Application

A common issue for any resource manager is the binding of stdin/out/err to files. Those files must be saved in the checkpoint image, for the sake of consistency and transparency. At restart time, the plugin must discover the bindings of stdin/out/err to the new files created by the resource manager. Any saved content from prior to checkpoint must be written into those files.

Batch systems usually communicate with applications using special en- vironment variables. Some batch systems use auxiliary files in addition to the environment variables. For example, TORQUE saves a list of its allocated nodes into a read-only nodefile, which can be cached by the application. But at restart time, a new read-only nodefile will be generated, different from the one cached by the application. To address this situation, the batch-queue plugin creates a temporary file containing the original nodefile contents and modifies the file descriptor of the restarted application to point to this alter- nate nodefile.

Communication between MPI Application and External PMI Interface

Most modern MPI implementations use or support the Process Management Interface (PMI) [14]. The PMI model comprises three entities: the MPI li- brary, PMI library and the process manager. Currently there are several im- plementations of process manager entities, including the standalone Hydra package, and the PMI server of the SLURM resource manager.

While the multi-host capable Socket plugin transparently supports the Hydra implementation, additional plugin support is needed to integrate the SLURM PMI implementation. SLURM requires an MPI process to commu- nicate with the SLURM job step daemon, which is not under checkpoint control. In this case, an batch-queue plugin finalizes PMI session before checkpointing and recreates it afterward. 84 CHAPTER 5. EXPRESSIVITY OF PLUGINS

Specialized peer-discovery and remote launch service

The processes may be restarted on different nodes. The number of slots (number of processes per node) may be different for the new nodes. The batch-queue plugin employs a node discovery tool to find the new nodes and to map old resources to the newly allocated node set. For TORQUE RM, the plugin analyzes the new nodefile and for SLURM it parses the SLURM_JOB_NODELIST and SLURM_TASKS_PER_NODE environment vari- ables. After this step resource allocation is available in RM-independent for- mat. Next, the old resources are mapped onto new ones. Once the resources have been mapped, the application is launched using the appropriate RM system mechanism. The mapping algorithm should consider the slots when matching resources between the old and new sets. It should be noted that the processes that were launched on the head node of a cluster usually have a special environment (special stdin/out/err connections and access to the nodefile) and may need special treatment.

5.6 Ptrace Plugin

The ptrace system call is used by a superior process (e.g., gdb, strace, etc.) to attach to an inferior process (e.g., a.out) in order to trace it. The ptrace system call uses CPU hardware support, making it harder to check- point. The inferior process can’t perform a checkpoint until it is detached or allowed to run freely during the checkpoint phase. A ptrace plugin is used to solve these problems [127]. The ptrace plugin in the superior process de- taches the inferior process before checkpointing and re-attaches right after restart. The ptrace plugin in the inferior process has an added responsibility. It is often the case that the inferior threads are quiesced while they are in possession of a system resource, or while executing a critical section in the code. This can result in a deadlock. To fix this, the ptrace plugin forces the 5.7. DETERMINISTIC RECORD-REPLAY 85 user threads to release resources before entering a quiescent state. This is done by using Pre/Post-Quiesce event notifications. Pre-Quiesce is generated by the user thread just before entering the quiesce state. While processing this hook, each thread ensures that it is not holding any system resources, locks, etc. that can result in a deadlock. The Post-Quiesce phase forces the inferior thread to wait until the superior can attach to it after restart.

5.7 Deterministic Record-Replay

The record-replay plugin is needed by any reversible debugger that uses checkpoint, restart and re-execute. FReD (Fast Reversible Debugger) [112] can add reversibility to any debugger by using checkpoint, restart and re- execute strategy. FReD uses DMTCP for checkpointing. Deterministic record- replay for FReD was achieved by creating a record-replay plugin to be used with DMTCP. This plugin is generally placed before any other plugin in the plugin hierarchy, to allow it to “hijack” library calls. Due to its complex- ity, the record-replay plugin is the largest plugin in terms of lines of code (see Table 5.2). There are several potential sources of nondeterminism in program ex- ecution, and record-replay must address all of them: thread interleaving, external events (I/O, etc.), and memory allocation. While correct replay of external events is required for all kind of programs, memory accuracy is of- ten not an issue for higher-level languages like Python and Perl, which do not expose the underlying heap to the user’s program. FReD handles all these aspects by wrapping various system calls. Rele- vant events are captured by interposing on library calls using dlopen/dlsym for creating function wrappers for interesting library functions. The wrap- pers record events into the log on the first execution and then return the appropriate values (or block threads as required) on replay. We start recording when directed by FReD (often after the first check- 86 CHAPTER 5. EXPRESSIVITY OF PLUGINS

point). The system records the events related to thread-interleaving, exter- nal events, and memory allocation into a log. On replay, it ensures that the events are replayed in the same order as they were recorded. The plugin guarantees deterministic replay — even when executing on multiple cores — so long as the program is free of data races.

Thread interleaving

FReD uses wrappers around library calls such as pthread_mutex_lock and pthread_mutex_unlock, to enforce the cor- rect thread interleaving during replay. Apart from the usual pthread_xxx functions, some other functions that can enforce a certain interleaving are blocking functions like read. For example, a thread can signal another thread by writing into the write-end of a pipe when the other thread is do- ing a blocking read on the read-end of the pipe.

Replay of external events

Applications typically interact with the outside world as part of their execu- tion. They also interact with the debugger and the user, as part of the debug- ging process. Composite debugging requires separating these streams. For debuggers that trace a program in a separate process, the I/O by the process being debugged is recorded and replayed whereas the I/O by the debugger process is ignored.

For interpreted languages, the situation becomes trickier as the record- replay plugin cannot differentiate between the debugger I/O and the appli- cation I/O. FReD handles this situation heuristically. It designates the stan- dard input/output/error file descriptors as pass-through devices. Activity on the pass-through devices is ignored by the record-replay component. 5.8. CHECKPOINTING NETWORKS OF VIRTUAL MACHINES 87

Memory accuracy

One important feature of FReD is memory-accuracy: the addresses of ob- jects on the heap do not change between original execution and replay. This is important because it means that developers can use address literals in expression watchpoints (assuming they are supported by the underlying de- bugger). With true replay of application program, one would expect the memory layout to match the record phase, but the DMTCP libraries have to perform different actions during normal run and on restart. This results in some memory allocation/deallocations originating from DMTCP libraries that can alter the memory layout. Another cause for the change in memory layout is the memory allocated by the operating system kernel when the process doesn’t specify a fixed address. An example is the mmap system call without any address hint. In this case, the kernel is free to choose any address for the memory region. Memory-accuracy is accomplished by logging the arguments, as well as the return values of mmap, munmap, etc. on record. On replay, the real functions or system calls are re-executed in the exact same order. However, the record-replay plugin provides a hint to the kernel to obtain the same memory address as was received at record-time. FReD handles any conflicts caused by memory allocation/deallocation originating from DMTCP itself by forcing use of a separate allocation arena for DMTCP requests.

5.8 Checkpointing Networks of Virtual Machines

Garg et al. [43] used DMTCP and plugins to provide a generic checkpoint- restart mechanism for three cases of virtual machines: user-space (stan- dalone) QEMU [121], KVM/QEMU [114], and Lguest [115]. In all three 88 CHAPTER 5. EXPRESSIVITY OF PLUGINS

cases, the hypervisor (VMM — virtual machine monitor) was based on Linux as the host operating system. These examples covers three distinct virtual- ization scenarios: entirely user-space virtualization (QEMU), full virtualiza- tion using a Linux kernel driver (KVM/QEMU), and paravirtualization using a Linux kernel driver [115].

The user-space QEMU virtual machine did not require any specific plugin. The KVM/QEMU and Lguest virtual machines required a new plugin consist- ing of approximately 200 lines of code. In addition, the kernel driver from Lguest required an additional 40 lines of new code to support checkpoint- restart capability. The authors estimated the implementation time at approx- imately five to ten person days. This is in contrast with the number of lines of code required for libvirt.

Garg et al. [44] further implemented the first system to checkpoint a network of virtual machines by virtualizing the tun/tap interface using a plugin. The tun plugin consisted of approximately 350 lines of code.

5.9 3-D Graphic: Support for Programmable GPUs in OpenGL 2.0 and Higher

Kazemi Nafchi et al. [62] describe a mechanism for transparently check- pointing hardware-accelerated 3D graphics. The approach is based on DMTCP with a plugin to record-prune-replay of OpenGL library calls. The calls not relevant to the last graphics frame prior to checkpointing is discarded. The remaining OpenGL calls are replayed on restart. The plugin uses approxi- mately 4,500 lines of code.

Previously, Lagar-Cavillaet al. [69] presented VMGL for vector-independent checkpoint restart. VMGL used a shadow device driver for OpenGL, which shadows most OpenGL calls to model OpenGL state, and restores it when restarting form a checkpoint. The code to maintain OpenGL state was ap- 5.10. TRANSPARENT CHECKPOINTING OF INFINIBAND 89 proximately 78,000 lines of code. Further, the new plugin has added functionality. Lagar-Cavillaet al. sup- ported only OpenGL 1.5 (fixed pipeline functionality). The approach of the new plugin was demonstrated to apply to programmable GPUs (OpenGL 2.0 and beyond).

5.10 Transparent Checkpointing of InfiniBand

The InfiniBand plugin by Cao et al. [27] is the first to support checkpoint- restart of native InfiniBand network. Previous checkpoint-restart systems [55] were MPI-specific. This plugin provides support for checkpointing UPC, an example of a PGAS language, which runs more efficiently when it runs na- tively over the InfiniBand fabric (instead of on top of an MPI layer). For applications such as these, there is no alternate solution. Compared to approximately 3,000 lines of code for the InfiniBand plugin, the checkpoint-restart functionality in Open MPI uses approximately 17,000 lines of code (without counting the InfiniBand-specific code). This is in ad- dition to the single process checkpointer, BLCR, that is used by OpenMPI.

5.11 IB2TCP: Migrating from InfiniBand to TCP Sockets

Some traditional checkpoint-restart services, such as that for Open MPI [55], offer the ability to checkpoint over one network, and restart on a second net- work. This is especially useful for interactive debugging. A set of checkpoint images from an InfiniBand-based production cluster can be copied to an Ethernet/TCP-based debug cluster. Thus if a bug is encountered after run- ning for hours on the production cluster, the most recent checkpoints can be used to restart on the debug cluster under a symbolic debugger, such as GDB. 90 CHAPTER 5. EXPRESSIVITY OF PLUGINS

The IB2TCP plugin enables checkpointing over InfiniBand and restart- ing over Ethernet in the similar fashion. An important contribution of the IB2TCP plugin [27], is that unlike the BLCR kernel-based approach, the DMTCP/IB2TCP approach supports using an Ethernet-based cluster that uses a different Linux kernel, something that occurs frequently in practice. Fur- ther, the IB2TCP plugin can be used with the InfiniBand plugin or without InfiniBand plugin (but with limited support for checkpointing). CHAPTER 6

Tesseract: Reconciling Guest I/O and Hypervisor Swapping in a VM

The previous chapters were concerned with adaptive plugins, a virtualiza- tion mechanism that decoupled the application process from the execution environment to facilitate transparent checkpoint-restart. In this chapter, I will present a virtualization mechanism that decouples the guest virtual disk from the guest operating system to prevent redundant I/O operations be- tween the guest and the hypervisor. Guests running in virtual machines read and write state between their memory and virtualized disks. such as VMware ESXi [57] like- wise may page guest memory to and from a hypervisor-level swap file to reclaim memory. To distinguish these two cases, we refer to the activity within the guest OS as paging and that within the hypervisor as swapping. In overcommitted situations, these two sets of operations can result in a two-level scheduling anomaly known as “double paging”. Double-paging occurs when the guest attempts to page out memory that has previously been swapped out by the hypervisor and leads to long delays for the guest as the contents are read back into machine memory only to be written out again (see Sections 6.1 and 6.2). While the double-paging anomaly is well known [46, 48, 47, 128, 82], its impact on real workloads is not established.

91 92 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Our approach addresses the double-paging problem directly in a man- ner transparent to the guest(see Section 6.3). First, the virtual machine is extended to track associations between guest memory and either blocks in guest virtual disks or in the hypervisor swap file. Second, the virtual disks are extended to support a mechanism to redirect virtual block requests to blocks in other virtual disks or the hypervisor swap file. Third, the hyper- visor swap file is extended to track references to its blocks. Using these components to restructure guest I/O requests, we eliminate the main effects of double-paging by replacing the original guest operations with indirections between the guest and swap stores. An important benefit of this approach is that where hypervisors typically attempt to avoid swapping pages likely to be paged out by the guest, the two levels may now cooperate in selecting pages since the work is complementary.

We have prototyped our approach on the VMware Workstation [56] plat- form enhanced to explicitly swap memory in and out. While the current implementation focuses on deduplicating guest I/Os for contents stored in the hypervisor swap file, it is general enough to also deduplicate redundant contents between guest I/Os themselves or between the hypervisor swap file and guest disks (see Section 6.4).

In Section 6.5, we also show the impact of an unexpected side-effect of our solution: loss of locality caused by indirections to the hypervisor swap file which can substantially slow down subsequent guest I/Os. Finally, we describe techniques to detect this loss of locality and to recover it. These techniques isolate the expensive costs of the double-paging effect and mak- ing them asynchronous with respect to the guest.

In Section 6.6, we present results using a synthetic benchmark that show, for the first time, the cost of the double-paging problem. Finally, in Sec- tion 6.7, we discuss related work. 6.1. REDUNDANT I/O 93

Guest Physical Memory Guest Physical Memory

PPN PPN(2) vCPU

(2) (1) (1)

Guest Host Guest Host Virtual Paging Virtual Paging Disk Device Disk Device (a) Host swap out followed by guest (b) Host swap out followed by guest disk read overwriting the entire page

Guest Physical Memory Guest Physical Memory not PPN dirty PPN

(1) (2) (2) (1)

Guest Host Guest Host Virtual Paging Virtual Paging Disk Device Disk Device (c) Host swap out of an unmodified (d) Host swap out followed by guest guest page disk write (Double-Paging)

Figure 6.1: Some cases of redundant I/O in a virtual machine.

6.1 Redundant I/O

Figure 6.1 shows some examples of redundant I/O resulting from bad in- teraction between hypervisor swapping and guest I/O. In Figure 6.1a, the hypervisor swap out is followed by guest overwriting the entire page by doing a disk read. From the hypervisor’s point of view, the guest has ac- cessed the page, and so it unnecessarily swaps in the guest page. Similarly, in Figure 6.1b, the host swap out is followed by the guest zeroing out the entire page. Here again, the hypervisor swap in is wasteful. In Figure 6.1c, the guest reads a page from the disk into its physical memory. The page is “clean” i.e. the contents have not been modified by the guest. However, when under memory pressure, the hypervisor tries to swap out this page as well. Ideally, the hypervisor could have discarded the page contents and 94 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

later restore them from guest disk if needed. Finally, in Figure 6.1d, the guest tries to page out a page that is already swapped out by the host. This is the case of double-paging. The first two cases, (Figures 6.1a and 6.1b) have already been addressed in some commercial products such as the VMware ESX hypervisor. Further, concurrent work of Amit et al. [5] implements solutions for the first three cases (using mmap structures as the remapping mechanism or boundary in Linux) but ignore the fourth. Tesseract has a system that addresses solutions for the first two cases (Figures 6.1a and 6.1b) along with a solution to the double-paging case(Figure 6.1d). In addition, it can serve as a basis for a third case (Figure 6.1c) and a fifth case – guest write followed by another guest write.

6.2 Motivation: The Double-Paging Anomaly

Tesseract has four objectives. First, to extend VMware’s hosted platforms, WorkStation and Fusion, to explicitly manage how the hypervisor pages out memory so that its swap subsystem can employ many of the optimizations used by the ESX platform. Second, to prototype the mechanisms needed to identify redundant I/Os originating from the guest and virtual machine monitor (VMM) and eliminate these. Third, to use this prototype to justify restructuring the underlying virtual disks of VMs to support this optimiza- tion. Finally, to simplify the hypervisor’s memory scheduler so that it need not avoid paging out memory that guest may decide to page. To address these, the project initially focused on the double-paging anomaly. One of the tasks of the hypervisor is to allocate and map host (or ma- chine) memory to the VMs it is managing. Likewise, one of the tasks of the guest operating system in a VM is to manage the guest physical address space, allocating and mapping it to the processes running in the guest. In both cases, either the set of machine memory pages or the set of guest phys- 6.2. MOTIVATION: THE DOUBLE-PAGING ANOMALY 95 ical pages may be oversubscribed.

In overcommitted situations, the appropriate memory scheduler must repurpose some memory pages. For example, the hypervisor may reclaim memory from a VM by swapping out guest pages to the hypervisor-level swap file. Having preserved the contents of those pages, the underlying ma- chine memory may be used for a new purpose. The guest OS may reclaim memory within a VM too to allow a guest physical page to be used by a new virtual mapping.

As hypervisor-level memory reclamation is transparent to the guest OS, the latter may choose to page out to a virtualized disk pages that were already swapped by the hypervisor. In such cases, hypervisor must syn- chronously allocate machine pages to hold the contents and read the already swapped contents back into that memory so they can be saved, in turn, to the guest OS’s swap device. This multi-level scheduling conflict is called double-paging.

Figure 6.2 illustrates the double-paging problem. Suppose the hypervisor decides to reclaim a machine page (MPN) that is backing a guest physical page (PPN). In step 1, the mapping between the PPN and MPN is invalidated and, in step 2, the contents of MPN is saved to the hypervisor’s swap file. Suppose the guest OS later decides to reallocate PPN for a new guest virtual mapping. It, in turn, in step 3a invalidates the guest-level mappings to that PPN and initiates an I/O to preserve its contents in a guest virtual disk (or guest VMDK). In handling the guest I/O request, the hypervisor must ensure that the contents to be written are available in memory. So, in step 4, the hypervisor faults the contents into a newly allocated page (MPN2) and, in step 5, establishes a mapping from PPN to MPN2. This sequence puts extra pressure on the hypervisor memory system and may further cause additional hypervisor-level swapping as a result of allocating MPN2. In step 6, the guest OS completes the I/O by writing the contents of MPN2 to the guest VMDK. Finally, the guest OS is able to zero the contents of the new MPN so that the 96 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

LP1 (3b) (3a) Guest PPN Guest Phys Mem. Disk

(6) guest view

hypervisor view (5) (1)

Host (4) MPN MPN2 Host Memory Paging Device (2) (1), (2) : Swap out (3a,3b) : Guest block write request (4) : Memory allocation and swap in (5) : Establish PPN to MPN mapping (6) : Write block to guest disk (7) : Zero the new MPN for reuse

Figure 6.2: An example of double-paging.

PPN that now maps to it can be used for a new virtual mapping in step 7.

A hypervisor has no control over when a virtualized guest may page memory out to disk, and may even employ reclamation techniques like bal- looning [128] in addition to hypervisor-level swapping. Ballooning is a tech- nique that co-opts the guest into choosing pages to release back to the plat- form. It employs a guest driver or agent to allocate, and often pin, pages in the guest’s physical address-space. Ballooning is not a reliable solution in overcommitted situations since it requires guest execution to choose pages and release memory and the guest is unaware of which pages are backed by MPNs. Hypervisors that do not also page risk running out of memory. While preferring ballooning, VMware uses hypervisor swapping to guaran- tee progress. Because levels of overcommitment vary over time, hypervisor swapping may interleave with the guest, under pressure from ballooning, 6.3. DESIGN 97 also paging. This can lead to double paging. The double-paging problem also impacts hypervisor design. Citing the potential effects of double-paging, some [82] have advocated avoiding the use of hypervisor-level swapping completely. Others have attempted to mit- igate the likelihood through techniques such as employing random page selection for hypervisor-level swapping [128] or employing some form of paging-aware paravirtualized interface [48, 47]. For example, VMware’s scheduler uses heuristics to find “warm” pages to avoid paging out what the guest may also choose to page out. These heuristics have extended ef- fects, for example, on the ability to provide large (2MB) mappings to the guest. Our goals are to address the double-paging problem in a manner that is transparent to the guest running in the VM and identifies and elides the unnecessary intermediate steps such as steps 4, 5 and 6 in Figure 6.2 and to simplify hypervisor scheduling policies. Although we do not demon- strate that double-paging is a problem in real workloads, we do show how its effects can be mitigated.

6.3 Design

We now describe our prototype’s design. First, we describe how we extended the hosted platform to behave more like VMware’s server platform, ESX. Next, we outline how we identify and eliminate redundant I/Os. Finally, we describe the design of the hypervisor swap subsystem and the extensions to the virtual disks to support indirections.

6.3.1 Extending The Hosted Platform To Be Like ESX

VMware supports two kinds of hypervisors: the hosted platform in which the hypervisor cooperatively runs on top of an unmodified host operating system such as Windows or Linux, and ESX where the hypervisor runs as the platform kernel, the vmkernel. Two key differences between these two 98 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

platforms are how memory is allocated and mapped to a VM, and where the network and storage stacks execute.

In the existing hosted platform, each VM’s device support is managed in the vmx, a user-level process running on the host operating system. Privi- leged services are mediated by the vmmon device driver loaded into the host kernel, and control is passed between the vmx and the VMM and its guest via vmmon. An advantage of the hosted approach is that the virtualization of I/O devices is handled by libraries in the vmx and these benefit from the device support of the underlying host OS. Guest memory is mmapped into the address space of the vmx. Memory pages exposed to the VMM and guest by using the vmmon device driver to pin the pages in the host kernel and return the MPNs to the VMM. By backing the mmapped region for guest memory with a file, hypervisor swapping is a simple matter of invalidating all mappings for the pages to be released in the VMM, marking, if necessary, those pages as dirty in the vmx’s address space, and unpinning the pages on the host.

In ESX, network and storage virtual devices are managed in the vmker- nel. Likewise, the hypervisor manages per-VM pools of memory for backing guest memory. To page memory out to the VM’s swap file, the VMM and vmkernel simply invalidate any guest mappings and schedule the pages’ con- tents to be written out. Because ESX explicitly manages the swap state for a VM including its swap file, it is able to employ a number of optimizations unavailable on the current hosted platform. These optimizations include the capturing of writes to entire pages of memory [4], and the cancellation of swap-ins for swapped-out guest PPNs that are targets for disk read requests.

The first optimization is triggered when the guest accesses an unmapped or write-protected page and faults into the VMM. At this point, the guest’s instruction stream is analyzed. If the page is shared [128] and the effect of the write does not change the content of the page, page-sharing is not broken. Instead, the guest’s program counter is advanced past the write and 6.3. DESIGN 99 it is allowed to continue execution. If the guest’s write is overwriting an entire page, one or both of two actions are taken. If the written pattern is a known value, such as repeated 0x00, the guest may be mapped a shared page. This technique is used, for example, on Windows guests because Win- dows zeroes physical pages as they are placed on the freelist. Linux, which zeroes on allocation of a physical page, is simply mapped a writeable zeroed MPN. Separately, any pending swap-in for that PPN is cancelled. Since the most common case is the mapping of a shared zeroed-page to the guest, this optimization is referred to as the PShareZero optimization.

The second optimization is triggered by interposition on guest disk read requests. If a read request will overwrite whole PPNs, any pending swap-ins associated with those PPNs are deferred during write-preparation, the pages are pinned for the I/O, and the swap-ins are cancelled on successful I/O completion.

We have extended Tesseract so that its guest-memory and swap mecha- nisms behave more like those of ESX. Instead of mmapping a pagefile to pro- vide memory for the guest, Tesseract’s vmx process mmaps an anonymously- backed region of its address space, uses madvise to mark the range as NOT- NEEDED, and explicitly pins pages as they are accessed by either the vmx or by the VMM. Paging by the hypervisor becomes an explicit operation, read- ing from or writing to an explicit swap file. In this way, we are able to also employ the above optimizations on the hosted platform. We consider these as part of our baseline implementation.

6.3.2 Reconciling Redundant I/Os

Tesseract addresses the double-paging problem transparently to the guest al- lowing our solution to be applied to unmodified guests. To achieve this goal, we employ two forms of interposition. The first tracks writes to PPNs by the guest and is extended to include a mechanism to track valid relationships 100 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

between guest memory pages and disk blocks that contain the same state. The second exploits the fact that the hypervisor interposes on guest I/O re- quests in order to transform the requests’ scatter-gather lists. In addition, we modify the structure of the guest VMDKs and the hypervisor swap file, extending the former to support indirections from the VMDKs into the hy- pervisor swap disk. Finally, when the guest reallocates the PPN and zeroes its contents, we apply the PShareZero optimization in step 7 in Figure 6.2.

In order to track which pages have writable mappings in the guest, MPNs are initially mapped into the guest read-only. When written by the guest, the resulting page-fault allows the hypervisor to track that the guest page has been modified. We extend this same tracking mechanism to also track when guest writes invalidate associations between guest pages in memory and blocks on disk. The task is simpler when the hypervisor, itself, modifies guest memory since it can remove any associations for the modified guest pages. Likewise, virtual device operations into guest pages can create associations between the source blocks and pages. In addition, the device operations may remove prior associations when the underlying disk blocks are written. This approach, employed for example to speed the live migration of VMs from one host to another [87], can efficiently track which guest pages in memory have corresponding valid copies of their contents on disks.

The second form of interposition occurs in the handling of virtualized guest I/O operations. The basic I/O path can be broken down into three stages. The basic data structure describing an I/O request is the scatter- gather list, a structure that maps one or more possibly discontiguous mem- ory extents to a contiguous range of disk sectors. In the preparation stage, the guest’s scatter-gather list is examined and a new request is constructed that will be sent to the underlying physical device. It is here that the unmod- ified hypervisor handles the faulting in of swapped out pages as shown in steps 4 and 5 of Figure 6.2. Once the new request has been constructed, it is issued asynchronously and some time later there is an I/O completion event. 6.3. DESIGN 101

To support the elimination of I/Os to and from virtual disks and the hy- pervisor block-swap store (or BSST), each guest VMDK has been extended to maintain a mapping structure allowing its virtual block identifiers to refer to blocks in other VMDKs. Likewise, the hypervisor BSST has been extended with per-block reference counts to track whether blocks in the swap file are accessible from other VMDKs or from guest memory.

The tracking of associations and interposition on guest I/Os allows four kinds of I/O elisions: swap - guest-I/O a guest I/O follows the hypervisor swapping out a page’s contents (Figures 6.1a and 6.1d) swap - swap a page is repeatedly swapped out to the BSST with no inter- vening modification guest-I/O - swap the case in which the hypervisor can take advantage of prior guest reads or writes to avoid writing redundant contents to the BSST (Figure 6.1c) guest-I/O - guest-I/O the case in which guest I/Os can avoid redundant operations based on prior guest operations where the results known reside in memory (for reads) or in a guest VMDK (for writes)

For simplicity, Tesseract focuses on the first two cases since these capture the case of double-paging. Because Tesseract does not introspect on the guest, it cannot distinguish guest I/Os related to memory paging from other kinds of guest I/O. But the technique is general enough to support a wider set of optimizations such as disk deduplication for content streamed through a guest. It also complements techniques that eliminate redundant read I/Os across VMs [82]. 102 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

LP1

Guest Disk PPN

Guest Physical Memory

BSST guest view Block Indirection Layer hypervisor view

Host Memory

MPN

Figure 6.3: Double-paging with Tesseract.

6.3.3 Tesseract’s Virtual Disk and Swap Subsystems

Figure 6.3 shows our approach embodied in Tesseract. The hypervisor swaps guest memory to a block-swap store (BSST) VMDK, which manages a map from guest PPNs to blocks in the BSST, a per-block reference-counting mech- anism to track indirections from guest virtual disks, and a pool of 4KB disk blocks. When the guest OS writes out a memory page that happens to be swapped out by the hypervisor, the disk subsystem detects this condition while preparing to issue the write request. Rather than bringing memory contents for the swapped out page back to memory, the hypervisor updates the appropriate reference counts in the BSST, issues the I/O, and updates metadata in guest VMDK and adds a reference to the corresponding disk block in BSST. Figure 6.4 shows timelines for the scenario when guest OS is paging out an already swapped page with and without Tesseract. With Tesseract we are able to eliminate the overheads of a new page allocation and a disk read. To achieve this, Tesseract modifies the I/O preparation and I/O comple- tion steps. For write requests, the memory pages in the scatter-gather list are 6.3. DESIGN 103

Guest Zero Update VMM SwapOut ... Allocate Memory Synchronous SwapIn PTE Write I/O Write (a) Baseline (without Tesseract)

Write PShare Update Guest VMM SwapOut ... Metadata Zero PTE Write (b) With Tesseract

Figure 6.4: Write I/O and hypervisor swapping.

checked for valid associations to blocks in the BSST. If these are found, the target VMDK’s mapping structure is updated for those pages’ corresponding virtual disk blocks to reference the appropriate blocks in the BSST and the reference counts of these referenced blocks in the BSST are incremented. For read requests, the guest I/O request may be split into multiple I/O requests depending on where the source disk blocks reside.

Consider the state of a guest VMDK and the BSST as shown in Fig- ure 6.5a. Here, a guest write operation wrote five disk blocks in which two were previously swapped to the BSST. In this example, block 2 still con- tains the swapped contents of some PPN and has a reference count reflecting this fact and the guest write. Hence, its state has “swapped” as true and a reference count of 2. Similarly, block 4 only has a nonzero reference count because the PPN whose swapped contents originally created the disk block has since been accessed and its contents paged back in. Hence, its state has “swapped” as false and a reference count of 1. To read these blocks from the guest VMDK now requires three read operations: one against the guest VMDK and two against the BSST. The results of these read operations must then be coalesced in the read completion path.

One can view the primary cost of double-paging in an unmodified hy- pervisor as impacting the write-preparation time for guest I/Os. Likewise, one can view the primary cost of these cases in Tesseract as impacting the read-completion time. To mitigate these effects, we consider two forms of defragmentation. Both strategies make two assumptions: 104 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Guest VMDK Block-Swap Store (BSST)

1 swapped: true D 2 refcnt: 2 3

D

5 swapped: false refcnt: 1

(a) With Tesseract Guest VMDK Block-Swap Store (BSST)

1

D swapped: false 2 refcnt: 0 3 swapped: true D 2 refcnt: 2 swapped: false 5 swapped: false refcnt: 1 refcnt: 0

(b) With Tesseract and BSST defragmentation Guest VMDK Block-Swap Store (BSST)

1

2 swapped: true S refcnt: 1 3

4

5 swapped: false refcnt: 0

(c) With Tesseract and guest VMDK defragmentation

Figure 6.5: Examples of reference count with Tesseract and with defragmenta- tion.

• the original guest write I/O request (represented in blue) captures the guest’s notion of expected locality, and

• the guest is unlikely to immediately read the same disk blocks back into memory 6.4. IMPLEMENTATION 105

Based on these assumptions, we extended Tesseract to asynchronously reor- ganize the referenced state in the BSST. In Figure 6.5b, we copy the refer- enced blocks into a contiguous sequence in the BSST and update the guest VMDK indirections to refer to the new sequence. This approach reduces the number of split read operations. In Figure 6.5c, we copy the references blocks back to the locations in the original guest VMDK where the guest expects them. With this approach, the typical read operation need not be split. In effect, Tesseract asynchronously performs the expensive work that occurred in steps 4, 5, and 6 of Figure 6.2 eliminating its cost to the guest.

6.4 Implementation

Our prototype extends VMware Workstation as described in Section 6.3.1. Here, we provide more detail.

6.4.1 Explicit Management of Hypervisor Swapping

VMware Workstation relies on the host OS to handle much of the work as- sociated with swapping guest memory. A pagefile is mapped into the vmx’s address space and calls to the vmmon driver are used to lock MPNs backing this memory as needed by the guest. When memory is released through hy- pervisor swapping, the pages are dirtied, if necessary, in the vmx’s address space and unlocked by vmmon. Should the host OS need to reclaim the backing memory, it does so as if the vmx were any other process: it writes out the state to the backing pagefiles and repurposes the MPN. For Tesseract, we modified Workstation to support explicit swapping of guest memory. First, we eliminated the pagefile and replaced it with a spe- cial VMDK, the block swap store (BSST) into which swapped-out contents are written. The BSST maintains a partial mapping from PPNs to disk blocks tracking the contents of currently swapped-out PPNs. In addition, BSST 106 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

maintains a table of reference counts on the blocks in the BSST referenced by other guest VDMKs.

Second, we split the process for selecting pages for swapping from the process for actually writing out contents to the BSST and unlocking the back- ing memory. This split is motivated by the fact that having eliminated dupli- cate I/Os between hypervisor swapping and guest paging, the system should benefit by both levels of scheduling choosing the same set of pages. The se- lected swap candidates are placed in a victim cache to “cool down”. Only the coldest pages are eventually written out to disk. This victim cache is maintained as a percentage of locked memory by the guest—for our study, 10%. Should the guest access a page in the pool, it is removed from the pool without being unlocked.

When the guest pages out memory, it does so to repurpose a given guest physical page for a new linear mapping. Since this new use will access that guest physical page, one may be concerned that this access will force the page to be swapped in from the BSST first. However, because the guest will either zero the contents of that page or read into it from disk and because the VMM can detect that the whole page will be overwritten before it is visible to the guest, the vmx is able to cancel the swap-in and complete the page locking operation.

6.4.2 Tracking Memory Pages and Disk Blocks

There are two steps to maintaining a mapping between disk blocks and pages in memory. The first is recognizing the pages read and written in guest and hypervisor I/O operations. By examining scatter-gather lists of each I/O, one can identify when the contents in memory and on disk match. While we plan to maintain this mapping for all associations between guest disks and guest memory, we currently only track the associations between blocks in the BSST and main memory. 6.4. IMPLEMENTATION 107

The second step is to track when these associations are broken. For guest memory, this event happens when the guest modifies a page of memory. The VMM tracks when this happens by trapping the fact that a writable mapping is required and this information is communicated to the vmx. For device accesses, on the other hand, this event is tracked either through explicit checks in the module which provides devices the access to guest memory, or by examining page-lists for I/O operations that read contents into memory pages.

6.4.3 I/O Paths

When the guest OS is running inside a virtual machine, guest I/O requests are intercepted by the VMM, which is responsible for storage adaptor virtu- alization, and then passed to the hypervisor, where further I/O virtualization occurs. Figure 6.6 identifies the primary modules in VMware Workstation’s I/O stack. Guest operating system generates scatter-gather lists for I/O (1). Tesseract inspects scatter-gather lists of incoming guest I/O requests in the SCSI Disk Device layer, where a request to the guest VMDK may be updated (2). Any extra I/O requests to the BSST may be issued (3) as shown in Table 6.2. The Asynchronous I/O manager issues sends to I/O requests to the host file system (4). On completion, the asynchronous I/O manager generates completion events (5). Waiting for the completion of all the I/O requests needed to service the original guest I/O request is isolated to the SCSI Disk Device layer as well (6). When running with defragmentation enabled (see Section 6.5), Tesseract allocates a pool of worker threads for handling defragmentation requests.

Guest Write I/Os

Guest I/O requests have PPNs in scatter-gather lists. The vmx rewrites the scatter-gather list, replacing guest PPNs with virtual pages from its address 108 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Guest Operating System (1)

Virtual Machine Monitor (VMM)

SCSI Disk Device (2)Block Indirection Layer (6)

I/O dispatch I/O completion

(3) (5) Asynchronous I/O Manager

(4) VMX

Host File Layer

Guest I/O requests (1) : S/G list received from guest (2) : Tesseract updates S/G list : (write): swapped pages are removed : (read) : guest VMDK indirections are looked up (3) : dispatch I/O request : (write): a single request with holes : (read) : one request to guest VMDK; one or more requests to BSST (4) : asynchronous I/O scheduled ... I/O takes place asynchronously ... (5) : completion events generate for each dispatched I/O (6) : notify guest of completion: : (write): create guest to BSST indirections : (read) : wait for all requests; merge results

Figure 6.6: VMware Workstation I/O Stack

space before passing it further to the physical device. Normally, for write I/O requests, if a page was previously swapped, so that PPN does not have a backing MPN, the hypervisor allocates a new MPN and brings page’s con- tents from disk.

With Tesseract, we check if the PPNs are already swapped out to BSST blocks by querying the PPN BSST-block mapping. We then use a virtual 6.4. IMPLEMENTATION 109 address of a special dummy page in the scatter-gather list for each page that resides in the BSST. On completion of the I/O, metadata associated with the guest VMDK is updated to reflect the fact that the contents of guest disk blocks for BSST-resident pages are in the BSST. This sequence allows the guest to page out memory without inducing double-paging. 1 2 3 4 5 6 7 8 (a) Scatter-gather prepared by the guest OS for disk write.   1 3 5 8   (b) Modified scatter-gather to avoid double-paging  pages in host memory pages swapped out to BSST dummy page

Figure 6.7: The pages swapped out to BSST are replaced with a dummy page to avoid double-paging. Indirections are created for the corresponding guest disk blocks.

Figure 6.7 illustrates how write I/O requests to the guest VMDK are han- dled by Tesseract. Tesseract recognizes that contents for pages 2, 4, 6 and 7 in the scatter-gather list provided by the guest OS reside in the BSST (Fig- ure 6.7a). When a new scatter-gather list to be passed to the physical device is formed, a dummy page is used for each BSST resident (Figure 6.7b).

Guest Read I/Os and Guest Disk Fragmentation

Recognizing that data may reside in both the guest VMDK and the BSST is a double-edged sword. On the guest write path it allows us to dismiss pages that are already present in the BSST and thus avoid swapping them in just to be written out to the guest VMDK. However, when it comes to guest reads, the otherwise single I/O request might have to be split into multiple I/Os. This happens when some of the data needed by the I/O is located in the BSST. Since data that has to be read from the BSST may not be contiguous on disk, the number of extra I/O requests to the BSST may be as high as the number of data pages in the original I/O request that reside in the BSST. We 110 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

refer to a collection of pages in the original I/O request for which a separate I/O request to the BSST must be issued as a hole. Read I/O requests to the guest VMDK which have holes are called fragmented. We modify a fragmented request so that all pages that should be filled in with the data from the BSST are replaced with a dummy page which will serve as a placeholder and will get random data read from the guest VMDK. So in the end for each fragmented read request we issue one modified I/O request to the guest VMDK and N requests to the BSST, where N is the number of holes. After all the issued I/Os are completed, we signal the completion of the originally issued guest read I/O request. 1 2 3 4 5 6 7 8         1 3 5 8 2 4 6 7   pages in host memory pages swapped out to BSST dummy page

Figure 6.8: Original guest read request split into multiple reads requests due to holes in the guest VMDK.

In Figure 6.8, the guest read I/O request finds disk blocks for pages 2, 4, 6 and 7 located on the BSST, where they are taking non-contiguous space. Tesseract issues one read request to the guest VMDK to get data for pages 1, 3, 5 and 8. In the scatter-gather list sent to the physical device, a dummy page is used as a read target for pages 2, 4, 6 and 7. Together with that one read I/O request to the guest VMDK, four read I/O requests are issued to the BSST. Each of those four requests reads data from one of the four disk blocks in the BSST.

Optimization of Repeated Swaps

In addition to addressing the double-paging anomaly by tracking guest I/Os whose contents exist in the BSST, we also implemented an optimization for back-to-back swap-out requests for a memory page whose contents remain 6.4. IMPLEMENTATION 111 clean. If a page’s contents are written out to the BSST, and later swapped back in, we continue to track the old block in the BSST as a form of victim cache. If the same page is chosen to be swapped out again and there has been no intervening modification of the contents of the page in memory, we simply adjust the reference count (see Section 6.4.4) for the block copy that is already in the BSST.

6.4.4 Managing Block Indirection Metadata

Tesseract keeps in-memory metadata for tracking PPN-to-BSST block map- pings and for recording block indirections between guest and BSST VMDKs. The PPN-to-BSST block mapping is stored as key-value pair using a hash table. Indirection between guest and BSST VMDKs are tracked in a similar manner. Tesseract also keeps reference counts for the BSST blocks. When a new PPN-to-BSST mapping is created, the reference count for the corresponding BSST block is set to 1. The reference count is incremented in the write prepare stage for PPNs found to have PPN-to-BSST block mappings. This ensures that such BSST blocks are not repurposed while the guest write is still in progress. Later, on the write completion path, the guest-VMDK- to-BSST indirection is created. The reference count of the BSST blocks is decremented during hypervisor swap in operation. It is also decremented when the guest VMDK block is overwritten by new contents and the previous guest block indirection is invalidated. Blocks with zero reference counts are considered free and reclaimable.

Metadata Consistency

While updating metadata in memory is faster than updating it on the disk, it poses consistency issues. What if the system crashes before the metadata is synced back to persistent storage? To reduce the likelihood of such prob- lems, Tesseract periodically synchronizes the metadata to disk on the same 112 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

schedule used by the VMDK management library for virtual disk state. How- ever, because reference counts in the BSST and block-indirections in VMDKs are written at different stages in an I/O request, crashes must be detected and a fsck-like repair process run.

Entanglement of guest VMDKs and BSST

Once indirections are created between guest and BSST VMDK, it becomes impossible to move just the guest VMDK. To disentangle the guest VMDK, we must copy each block from the BSST to its guest VMDK for which there is an indirection. This can be done both online and offline. More details about the online process are in Section 6.5.2.

6.5 Guest Disk Fragmentation

As mentioned in Section 6.4.3, when running with Tesseract, guest read I/O requests might be fragmented in the sense that some of the data the guest is asking for in a single request may reside in both the BSST and the guest VMDK. The fragmentation level depends on the nature of the workload, the guest OS, and swap activity at the guest and the hypervisor level. Our ex- periments with SPECjbb2005 [103] showed that even for moderate level of memory pressure as much as 48% of all read I/O requests had at least one hole. By solving double-paging problem Tesseract significantly reduced write- prepare time of the guest I/O requests since synchronous swap-in requests no longer cause delays. However, a non-trivial overhead was added to read- completion. Indeed, instead of waiting for a single read I/O request to the guest VMDK, the hypervisor may now have to wait for several extra read I/O requests to the BSST to complete before reporting the completion to the guest. 6.5. GUEST DISK FRAGMENTATION 113

To address these overheads, Tesseract was extended with a defragmen- tation mechanism that improves read I/O access locality and thus reduces read-completion time. We investigated two approaches to implementing defragmentation - BSST defragmentation and guest VMDK defragmentation. While defragmentation is intended to help reduce read-completion time, it has its own cost. Defragmentation requests are asynchronous and reduce time to complete affected guest I/Os, but, at the same time, they contribute to a higher disk load and in the extreme cases may have an impact on read- prepare times. The defragmentation activity can be throttled on detecting performance bottlenecks due to higher disk load. ESX, for example, pro- vides a mechanism, SIOC, that measures latencies to detect overload and enforce proportional-share fairness [50]. The defragmentation mechanism could participate in this protocol.

6.5.1 BSST Defragmentation

BSST defragmentation uses guest write I/O requests as a hint of which BSST blocks might be accessed together in a single I/O read request in the future. Given that information we then group together the identified blocks in the BSST.

Figure 6.9 shows a scatter-gather list of the write I/O request that goes to the guest VMDK. In that request, the contents of pages 2, 4, 6 and 7 is already present in the BSST. As soon as these blocks are identified, a worker thread picks up a reallocation job that will allocate a new set of contiguous blocks in BSST and will copy the contents of BSST blocks for pages 2, 4, 6 and 7 into that new set of block. This copying allows those blocks to be read later as a single I/O request issued by the guest and reflects its own expectation of the locality of these blocks.

BSST defragmentation is not perfect. If multiple guest VMDK writes cre- ate indirections to the same BSST blocks, multiple copies of those blocks 114 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

  1 3 5 8 2 4 6 7   4 7 6 2 Guest Disk BSST Disk

Figure 6.9: Defragmenting the BSST.

may be made in the BSST. Further, since blocks are still present in both the guest VMDK and the BSST, extra I/O requests to the BSST cannot be en- tirely eliminated. In addition, BSST defragmentation tries to predict read access locality from write access locality and obviously the boundaries of read requests will not match with the boundaries of the write requests. So each read I/O request that without defragmentation would have required reads from both the guest VMDK and the BSST will still be split into the one which goes to the guest VMDK and one or more requests to the BSST. All this contributes to longer read completion times as shown in Table 6.4.

However, it is relatively easy to implement BSST defragmentation with- out worriying too much about data races with the I/O going to the guest VMDK. It can significantly reduce the number of extra I/Os that have to be issued to the BSST to service the guest I/O request as shown in Table 6.3.

If a guest read I/O request preserves the locality observed at the time of guest writes, we need more than one read I/O request from the BSST only when it hits more than one group of blocks created during BSST de- fragmentation. Although this is entirely dependent on a workload, one can expect read requests to typically be smaller than write requests, and, so, the number of extra I/O requests to BSST being reduced to one (fits into one de- fragmented area) or two (crosses the boundary of two defragmented areas) in many cases. 6.5. GUEST DISK FRAGMENTATION 115

12 3 4 56 7 8 4 7 6 2 Guest Disk BSST Disk

Figure 6.10: Defragmenting the guest VMDK. 6.5.2 Guest VMDK Defragmentation

Like BSST defragmentation, guest VMDK defragmentation uses the scatter- gather lists of write I/O requests to identify BSST blocks that must be copied. But unlike BSST defragmentation, these blocks are copied to the guest VMDK. The goal is to restore the guest VMDK to the state it would have had with- out Tesseract. Tesseract with guest VMDK defragmentation replaces swap-in operations with asynchronous copying from the BSST to the guest VMDK. For example, in Figure 6.10, blocks 2, 4, 6 and 7 are copied to the relevant locations on the guest VMDK by a worker thread. We enqueue a defragmentation request as soon as the scatter-gather list of the guest write I/O request is processed and blocks to be asynchronously fetched to the guest VMDK are identified. The defragmentation requests are organized as a priority queue. If a guest read I/O request needs to read data from the block that has not been copied from the BSST, the priority of the defragmentation request that refers to the block is raised to highest and the guest read I/O request is blocked until copying of all the missing blocks finishes. While Tesseract with guest defragmentation can have an edge over Tesser- act without defragmentation, it is not always a win. With guest defragmen- tation, before a guest I/O read request has a chance to be issued to the guest VMDK, it may become blocked waiting for a defragmentation request to complete. This may end up being slower than issuing requests to the BSST and the guest VMDK in parallel. 116 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Disentanglement of Guest and BSST VMDKs.

Guest defragmentation has an added benefit of removing the entanglement between guest and BSST VMDK. Once there are no block indirections be- tween guest and BSST VMDK, the guest VMDK can be moved easily. This also allows us to disable Tesseract’s double-paging optimization on-the-fly.

6.6 Evaluation

We ran our experiments on an AMD Opteron 6168 (Magny-Cours) with 12 1.9 GHz cores, 1.5 GB of memory and a 1 TB 7200rpm Seagate SATA drive, a 1 TB 7200rpm Western Digital SATA drive, and a 128 GB Samsung SSD drive. We used OpenSUSE 11.4 as the host OS and a 6 VCPU 700 MB VM running Ubuntu 11.04. We used Jenkins [113] to monitor and manage execution of the test cases. To ensure same test conditions for all test runs, we created a fresh copy of the guest virtual disk from backup before each run. For the evaluation we ran SPECjbb2005 [103] that was modified to emit instantaneous scores every second. It was run with 6 warehouses for 120 seconds. The heap size was set to 450 MB. The SPECjbb benchmark creates several warehouses and processes transactions for each of them. We induced hypervisor-level swapping by setting a maximum limit on the pages the VM can lock. The BSST VMDK was preallocated. Swap-out victim cache size was chosen to be 10% of the VM’s memory size. All experiments except the one with SSD represent results from five trial runs. The SSD experiment represents results from three trial runs.

6.6.1 Inducing Double-Paging Activity

To control hypervisor swapping, we set a hypervisor-imposed limit on the machine memory available for the VM. Guest paging was induced by running 6.6. EVALUATION 117 the SPECjbb benchmark with a working set larger than the available guest memory.

To induce double-paging, the guest must page out the pages that were already swapped by the hypervisor. Since, the hypervisor would choose only the cold pages from the guest memory, we employed a custom memhog that would lock some pages in the guest memory for a predetermined amount of time inside the guest. While the pages were locked by this memhog, a different memhog would repeatedly touch the rest of available guest pages making them “hot”. At this point the pages locked by the first memhog are considered “cold” and swapped out by the hypervisor.

Next, memhog unlocks all its memory and the SPECjbb benchmark is started inside the guest. Once the warehouses have been created by SPECjbb, the memory pressure increases inside the guest. The guest is forced to find and page out “cold pages”. The pages unlocked by memhog are good candi- dates as they have not been touched in the recent past.

We used memhog and memory locking in our setup to make the exper- iments more repeatable. In real world the conditions we were simulating could have been observed, for example, when execution phase shift of an application occurs, or when an application that caches a lot of data in mem- ory and not actively uses is descheduled and another memory intensive ap- plication is woken up by the guest.

As a baseline we ran with Tesseract disabled. This effectively disabled analysis and rewriting of guest I/O commands so that all pages affected by an I/O command that happened to be swapped out by the hypervisor had to be swapped back in before the command could be issued to disk.

6.6.2 Application Performance

While it is hard to control and measure the direct impact of individual double-paging events, we use the pauses or gaps observed in the logged 118 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

instantaneous scores of each SPECjbb run to characterize the application be- havior. Depending upon the amount of double-paging activity, the pauses can be as big as 60 seconds in a 120 second run and negatively affect the final score. Often the pauses are associated with garbage collection activity.

7500 7500 baseline baseline 7000 tesseract 7000 tesseract

6500 6500

6000 6000

5500 5500 SPECjbb score SPECjbb score SPECjbb

5000 5000

4500 4500 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (a) No memhog (b) 30 MB memhog 7500 7500 baseline baseline 7000 tesseract 7000 tesseract

6500 6500

6000 6000

5500 5500 SPECjbb score SPECjbb score SPECjbb

5000 5000

4500 4500 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (c) 60 MB memhog (d) 90 MB memhog 7500 7500 baseline baseline 7000 tesseract 7000 tesseract

6500 6500

6000 6000

5500 5500 SPECjbb score SPECjbb score SPECjbb

5000 5000

4500 4500 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (e) 120 MB memhog (f) 150 MB memhog

Figure 6.11: Trends for scores and pauses in SPECjbb runs with varying guest memory pressure and 10% host overcommitment. 6.6. EVALUATION 119

Varying Levels of Guest Memory Pressure

Figure 6.11 shows scores and pause times for different sizes of memhog in- side the guest with 10% host overcommitment. When the guest is trying to page out pages which are swapped by the hypervisor, the latter is swap- ping them back in and is forced to swap out some other pages. This cascade effect is responsible for increased pause period for the baseline. With Tesser- act, however, the pause periods grow at a lower rate. This growth can be explained by longer wait times due to increased disk activity. Although the scores are about the same for higher guest memory pressure, the total pauses for Tesseract are less than that for the baseline.

30 tesseract baseline 25

20

15

10

5

Max pause/blockage time (seconds) pause/blockage Max 0 0 30 60 90 120 150 180 240 Memhog Sizes (MB)

Figure 6.12: Maximum single pauses observed in SPECjbb instantaneous scor- ing with varying guest memory pressure and 10% host memory overcommit- ment.

Figure 6.12 shows the effect of increased memory pressure on the length of the biggest application pause. The bars represent the range of maximum pauses for individual sets of runs. There are five runs in each set. Notice that Tesseract clearly outperforms the baseline. The highest maximum pause time with Tesseract is 7 seconds, compared to 30 seconds for the baseline. This shows that the application is more responsive with Tesseract. 120 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

7000 baseline 7000 baseline tesseract tesseract 6000 6000

5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (a) 0% host overcommitment (b) 5% host overcommitment

7000 baseline 7000 baseline tesseract tesseract 6000 6000

5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 60 0 20 40 60 80 100 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (c) 15% host overcommitment (d) 20% host overcommitment

Figure 6.13: Scores and total pause times for SPECjbb runs with varying host overcommitment and 60 MB memhog.

Varying Levels of Host Memory Pressure

To study the effect of increasing memory pressure by the hypervisor, we ran the application with various levels of host overcommitment with 60 MB memhog inside the guest.

Figure 6.13 shows the effect of increasing host memory pressure on the application scores and total pause times. For lower host pressure (0% and 5%), the score and pause times for the baseline and Tesseract are about the same. However, for higher memory pressure there is a significant difference in the performance. For example, in the 20% case, the baseline observes total pauses in the range of 80–110 seconds. Tesseract, on the other hand, observes total pauses in a much lower range of 30–60 seconds. 6.6. EVALUATION 121

80 no-defrag 70 guest-defrag bsst-defrag 60 baseline

50

40

30

20

10

Max pause/blockage time (seconds) pause/blockage Max 0 0 5 15 20 Host Memory Overcommitment (%)

Figure 6.14: Comparing maximum single pauses for SPECjbb under vari- ous defragmentation schemes with varying host memory overcommitment and 60 MB memhog

Figure 6.14 focuses on the maximum pauses seen by the application as host memory pressure grows. While the maximum pauses are insignificant at lower memory pressure, with a higher pressure Tesseract clearly outper- forms the baseline.

6.6.3 Double-Paging and Guest Write I/O Requests

Table 6.1 shows why double-paging is affecting guest write I/O performance. As expected, if the host is not experiencing memory pressure, none of the 1,030 guest write I/O requests refer to pages swapped by the hypervisor. As memory pressure builds up, more and more guest write I/O requests require one or more pages to be swapped in before a write can be issued to the physical disk. All of this contributes to a longer write-prepare time for such a requests. Consider a setup with 20% host memory is overcommitment. Of 1,366 guest write I/O requests 981 had at least one page that had to be swapped in. Then, 524 guest write I/O requests needed between 1 and 20 swap- in requests completed by the hypervisor in order to proceed, 177 needed 122 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

Host Guest I/Os I/Os I/Os I/Os Double- (%) I/Os with 1 – 20 21 – 50 > 50 paging Issued holes holes holes holes cases 0 1,030 0 0 0 0 0 5 981 537 343 106 88 11,254 10 1,042 661 358 132 171 19,381 15 1,292 766 377 237 152 22,584 20 1,366 981 524 177 280 32,547

Table 6.1: Holes in write I/O requests for varying host overcommitment and 60 MB memhog inside the guest.

between 21 and 50 swap-in requests completed, and, finally, 280 guest write I/O requests needed more than 50 swap-in requests.

6.6.4 Fragmentation in Guest Read I/O Requests

Table 6.2 quantifies the number of extra read I/O requests that have to be issued to the BSST if defragmentation is not used.

Host Guest I/Os I/Os w/ Total Total I/Os Score (%) Issued Holes Holes Issued 0 5,152 0 0 5,152 7,010 5 5,230 708 1,675 6,197 6,801 10 5,206 2,161 5,820 8,865 6,271 15 4,517 2,084 6,990 9,423 6,048 20 5,698 2,739 11,854 14,813 2,841

Table 6.2: Holes in read I/O requests for Tesseract without defragmentation for varying levels of host overcommitment and 60 MB memhog inside the guest.

Without host memory pressure there is no hypervisor level swapping and all 5,152 guest read I/O requests can be satisfied without going to the BSST. At higher levels of memory pressure, the hypervisor starts swapping pages to disk. Tesseract detects pages in guest write I/O requests that are already in the BSST to avoid swap-in requests for such pages. The amount of work saved by Tesseract on the write I/O path is quantified in the final column of Table 6.1. 6.6. EVALUATION 123

7500 7500 baseline baseline 7000 no-defrag 7000 no-defrag bsst-defrag bsst-defrag guest-defrag guest-defrag 6500 6500

6000 6000

5500 5500 SPECjbb score SPECjbb score SPECjbb

5000 5000

4500 4500 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (a) 60 MB memhog (b) 120 MB memhog 7500 7500 baseline baseline 7000 no-defrag 7000 no-defrag bsst-defrag bsst-defrag guest-defrag guest-defrag 6500 6500

6000 6000

5500 5500 SPECjbb score SPECjbb score SPECjbb

5000 5000

4500 4500 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (c) 180 MB memhog (d) 240 MB memhog

Figure 6.15: Scores and pauses in SPECjbb runs under various defragmenta- tion schemes with 10% host overcommitment.

When host memory is 20% overcommitted we can see that out of 5,698 guest read I/O requests 2,739 will require extra read I/Os to be issued to read data from the BSST. The total number of such an extra I/Os to the BSST was 11,854, which made the total number of read I/O requests issued to both the guest VMDK and the BSST equal 14,813.

6.6.5 Evaluating Defragmentation Schemes

Figures 6.15 and 6.16 show the impact of using BSST and guest VMDK de- fragmentation on SPECjbb throughput, while Figures 6.14 and 6.17 give insight into SPECjbb responsiveness. Guest defragmentation performs better than the baseline in all situations 124 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

7000 baseline 7000 baseline no-defrag no-defrag 6000 bsst-defrag 6000 bsst-defrag guest-defrag guest-defrag 5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (a) 0% host overcommitment (b) 5% host overcommitment

7000 baseline 7000 baseline no-defrag no-defrag 6000 bsst-defrag 6000 bsst-defrag guest-defrag guest-defrag 5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 60 0 20 40 60 80 100 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (c) 15% host overcommitment (d) 20% host overcommitment

Figure 6.16: Score and pauses in SPECjbb under various defragmentation schemes with varying host overcommitment and 60 MB memhog.

and is as good or better than BSST defragmentation. With low levels of host memory overcommitment Tesseract with guest VMDK defragmentation secures better SPECjbb scores than Tesseract without defragmentation and performs on par in responsiveness metrics.

With increasing host memory overcommitment, Tesseract without de- fragmentation starts outperforming Tesseract with either of the defragmen- tation schemes in both the application throughput and responsiveness as the total and maximum pause times grow slower for the no-defragmentation case. This is due to the fact that at higher levels of hypervisor level swapping, guest read I/O becomes more and more fragmented and pending defrag- mentation requests become a bottleneck leading to longer read completion times. 6.6. EVALUATION 125

30 no-defrag guest-defrag 25 bsst-defrag baseline 20

15

10

5

Max pause/blockage time (seconds) pause/blockage Max 0 60 120 180 240 Memhog Sizes (MB)

Figure 6.17: Comparing maximum single pauses for SPECjbb under various defragmentation schemes with 10% host memory overcommitment.

Defrag Reads Reads Total BSST Total Defrag I/Os Strategy w/o w/ Holes Reads Reads Reads Writes Holes Holes Issued Issued Issued Issued No-Defrag 3,025 1,203 2,456 2,456 6,684 0 0 BSST 2,946 1,235 2,889 1,235 5,416 12,674 616 Guest 3,909 0 0 0 3,909 11,538 11,538

Table 6.3: Total I/Os with BSST and guest defragmentation.

Table 6.3 shows the I/O overheads of the two defragmentation schemes compared to Tesseract without them. For this table, 3 runs with similar scores and similar number of guest read I/O requests were selected. With BSST VMDK defragmentation enabled, Tesseract was able to reduce the number of synchronous I/O requests to BSST VMDK from 2,889 (2.23 reads per I/O with holes on average) to 1,235 (1 read per I/O with holes). To do BSST VMDK defragmentation, 12,674 asynchronous reads from BSST VMDK and 616 asynchronous writes to BSST VMDK had to be issued. This number of writes equals the number of guest write I/O requests with holes. Guest VMDK defragmentation eliminated holes in guest read I/O requests entirely, so there were no guest-related reads from BSST VMDK. To achieve this, 11,538 asynchronous reads from BSST VMDK and the same number of asynchronous writes to the guest VMDK were issued. 126 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

7000 7000 baseline baseline 6000 tesseract 6000 tesseract

5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 0 10 20 30 40 50 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (a) 15% host overcommitment (b) 20% host overcommitment 7000 7000 baseline baseline 6000 tesseract 6000 tesseract

5000 5000

4000 4000

3000 3000 SPECjbb score SPECjbb score SPECjbb

2000 2000

1000 1000 0 10 20 30 40 50 0 10 20 30 40 50 Total SPECjbb blockage time (seconds) Total SPECjbb blockage time (seconds) (c) 25% host overcommitment (d) 30% host overcommitment

Figure 6.18: Tesseract performances with BSST placed on an SSD disk.

6.6.6 Using SSD For Storing BSST VMDK

SSDs have dramatically better performance over magnetic disk in terms of lower latencies for random reads. However, their relatively higher cost keeps them from getting into mainstream server market. They are used in smaller units for boosting performance. One potential application for SSDs in servers is as a hypervisor swap device allowing for higher memory overcommitment as the cost of swapping is reduced.

In our experiment, we placed the BSST VMDK on a SATA SSD. Fig- ure 6.18 shows the performance of the baseline and Tesseract. At lower memory pressure, there is no difference in the performance, but as the mem- ory pressure increases, at both guest and hypervisor level, Tesseract starts to show benefits over the baseline. 6.6. EVALUATION 127

I/O Path Baseline No-defrag BSST defrag Guest defrag Read prepare 0 37 30 109 Read completion 0 232 247 55 Write prepare 24,262 220 256 265 Write completion 0 49 91 101

Table 6.4: Average read and write prepare/completion times in microseconds for baseline and Tesseract with and without defragmentation. Host overcom- mitment was 10%; memhog size was 60 MB.

6.6.7 Overheads

I/O Path Overhead

Table 6.4 presents Tesseract overheads on I/O paths. The average overhead per I/O is on the order of microseconds. Read prepare time for guest defrag- mentation is higher than the others due to the contention on guest VMDK during defragmentation. At the same time, the read completion time for guest defragmentation case is much lower than the other two cases as there are no extra reads going to the BSST. On the write I/O path, the defrag- mentation schemes have larger overhead. This is due to the background defragmentation of the disks which is kicked off as soon as the write I/O is scheduled.

Memory Overhead

Per Section 6.4.4, Tesseract maintains in-memory metadata for three pur- poses: tracking (a) associations between PPN and BSST blocks; (b) refer- ence counts for BSST blocks; and (c) indirections between guest VMDK and BSST VMDK. We use 64 bits to store a (4 KB) block number. To track asso- ciations between PPN and BSST blocks we re-use MPN field in page frames maintained by the hypervisor so there is no extra memory overhead here. In general case where associations between PPN and blocks in guest VMDK have to be tracked we will need a separate memory structure with a maxi- 128 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

mum overhead of 0.2% of VM’s memory size. Each BSST block’s reference count requires 4 bytes per disk block. To optimize the lookup for free/avail- able BSST blocks, a bitmap is also maintained with one bit for each block. The guest VMDK to BSST VMDK indirection metadata requires 24 bytes for each guest VMDK block for which there is a valid indirection to BSST. A bitmap similar to that for BSST is maintained for guest VMDK blocks to de- termine if an indirection to BSST exists for a given guest VMDK block.

6.7 Related Work

This work intersects three areas. The first is that of uncooperative hypervisor swapping and the double-paging problem. The second concerns the tracking of associations between guest memory and disk state. The third concerns memory and I/O deduplication.

6.7.1 Hypervisor Swapping and Double Paging

Concurrent work by Amit et al. [5] systematically explores the behavior of uncooperative hypervisor swapping and implement an improved swap sub- system for KVM called VSwapper. The main components of their imple- mentation are the Swap Mapper and the False Reader Preventer. The paper identifies five primary causes for performance degradation, studies each, and offers solutions to address them. The first, “silent swap writes”, corresponds to our notion of guest-I/O–swap optimization which we do not yet support because we do not support reference-counting on blocks in guest VMDKs. The second and third, “stale swap reads” and “false swap reads”, and their solutions are similar to the existing ESX optimizations that cancel swap-ins for memory pages that are either overwritten by disk I/O or by the guest. For “silent swap writes” and “stale swap reads”, the Swap Mapper uses the same techniques Tesseract does to track valid associations between pages in guest memory and blocks on disk. Their solution to “false swap reads”, the 6.7. RELATED WORK 129

False Reader Preventer, is more general, however, because it supports the accumulation of successive guest writes in a temporary buffer to identify if a page is entirely overwritten before next read. The last two, “decayed swap sequentiality” and “false page anonymity”, are not issues we consider. In their investigation, they did not observe double-paging to have much impact on performance. This is likely due to the fact that they followed guidelines from VMware and provisioned guests with enough VRAM that guest pag- ing was uncommon and most of the experiments were run with a persistent level of overcommitment. Tesseract allows for optimizing operations involv- ing guest I/O followed by another guest I/O with either same pages or disk blocks. This is not possible with VSwapper. Also, vswapper doesn’t allow for defragmentation or disk deduplication.

The double-paging problem was first identified in the context of virtual machines running on VM/370 [46, 101]. Goldberg and Hassinger [46] dis- cuss the impact of increased paging when the virtual machine’s address ex- ceeds that with which it is backed. Seawright and MacKinnon [101] mention the use of handshaking between the VMM and operating system to address the issue but do not offer details.

The Cellular Disco project at Stanford describes the problem of paging in the guest and swapping in the hypervisor [48, 47]. They address this double-paging or redundant paging problem by introducing a virtual paging device in the guest. The paging device allows the hypervisor to track the paging activity of the guest and reconcile it with its own. Like our approach, the guest paging device identified already swapped-out blocks and creates indirections to these blocks that are already persistent on disk. There is no mention of the fact that these indirections destroy expected locality and may impact subsequent guest read I/Os.

Subsequent papers on scheduling memory for virtual machines also refer in passing to the general problem. Waldspurger [128], for example, men- tions the impact of double-paging and advocates random selection of pages 130 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING

by the hypervisor as a simple way to minimize overlap with page-selection by the guest. Others projects, such as the Satori project [82], use double- paging to advocate against any mechanism to swap guest pages from the hypervisor.

Our approach differs from these efforts in several ways. First, we have a system in which we can—for the first time—measure the extent to which double-paging occurs. Second, we have an approach that directly addresses the problem of double-paging in a manner transparent to the guest. Finally, our techniques change the relationship between the two levels of scheduling: by reconciling and eliding redundant I/Os, Tesseract encourages the two schedulers to choose the same pages to be paged out.

6.7.2 Associations Between Memory and Disk State

Tracking the associations between guest memory and guest disks has been used to improve memory management and working-set estimation for vir- tual machines. The Geiger project [60], for example, uses paravirtualization and intimate knowledge of the guest disks to implement a secondary cache for guest buffer-cache pages. Lu et al. [78] implement a similar form of victim cache for the Xen hypervisor.

Park et al. [87] describe a set of techniques to speed live-migration of VMs. One of these techniques is to track associations between pages in memory and blocks on disks whose contents are shared between the source and destination machines. In cases where the contents are known to be resident on disk, the block information is sent to the destination in place of the memory contents. In the paper, the authors describe techniques for maintaining this mapping both through paravirtualization and through the use of read-only mappings for fully virtualized guests. 6.8. OBSERVATIONS 131 6.7.3 I/O and Memory Deduplication

The Satori project [82] also tracks the association between disk blocks and pages in memory. It extends the Xen hypervisor to exploit these associations, allowing it to elide repeated I/Os that read the same blocks from disk across VMs immediately sharing these pages of memory across those guests. Originally inspired by the Cellular Disco and Geiger projects, Tesseract shares much in common with these approaches. Like many of them, it tracks valid associations between memory pages and disk blocks that contain iden- tical content. Like Park et al., it employs techniques that are fully transpar- ent to the guest allowing it to be applied in a wider set of contexts. Unlike the Satori projects which focused on eliminating redundant read operations across VMs, Tesseract uses that mapping information to deduplicate I/Os from a specific guest and its hypervisor. As such, our approach complements and extends these others.

6.8 Observations

Our experience in this project has led us to question the existing interface for issuing I/O requests with scatter-gather lists. Given that the underly- ing physical organization of the disk blocks can differ significantly from the virtual disk structure, it makes little sense for a scatter-gather list to require that the target blocks on disk be contiguous. Having a more flexible structure may allow I/Os to be expressed more succinctly and to be more effective at communicating expected relationships or locality among those disk blocks. Further, one can think of generalizing I/O scatter-gather lists and espe- cially virtual disks to just be indirection tables into a large sea-of-blocks. This allows for a natural application surface for block indirection.

CHAPTER 7

Impact for the Future

In this chapter, we discuss some of the future directions that can be pursued based on this dissertation.

7.1 Compiled Code In Scripting Languages: Fast-Slow Paradigm

For many scripting languages (Python, R, Matlab, etc.), the interpreted lan- guage was developed first, and researchers developed an efficient compiler after the fact. As a result, we often have fast compiled functions that run inside the interpreted language. The compiled code makes assumptions to generate efficient code. Unusual user applications may violate these assump- tions, causing the compiled code to silently return an incorrect answer. So, a user must choose between reliable, interpreted (slow) code, and unreliable compiled (fast) code. Checkpointing provides an interesting third alternative. One splits the computation into segments. For concreteness, we will give an example with ten segments, and we will assume that ten additional “checking” hosts (or ten additional CPU cores) are available to run in parallel. Initially, the compiled code is run. At the beginning of each of the ten segments, one takes a checkpoint and copies it to a different “checking”

133 134 CHAPTER 7. IMPACT FOR THE FUTURE

computer. That computer runs the next segment in interpreted mode. At the end of that segment, the data from the corresponding checkpoint of the compiled segment is compared with the data at the end of the interpreted segment for correctness.

At the end, either the ten “checking” hosts (or ten “checking” CPU cores) report that the computation is correct, or else they report that the compu- tation must switch to interpreted mode for correctness at the beginning of a particular segment (after which, one can return to compiled operation as described above).

Wester et al. [131] implemented a speculation mechanism in the operat- ing system. It provided coordination across all applications and kernel state while the speculation policy was left up to the applications. A scheme similar to this was employed using DMTCP by Ghoshal et al. [45] in an application to MPI [45] and by Arya and Cooperman [9] to support the Python scripting language.

7.2 Support for Hadoop-style Big Data

Hadoop [39] and Spark [40] support a map-reduce paradigm in which the size of intermediate data may increase during a “map” phase and may de- crease during a “reduce” phase. Thus, the best place to checkpoint is at the end of a “reduce” phase. With the right hooks added to Hadoop (or Spark), Hadoop could be instructed by a plugin to move back-end data to longer-term storage. On restart, the plugin would use those hooks to move the longer-term storage back to active storage, and the front end would re- connect. 7.3. CYBERSECURITY 135 7.3 Cybersecurity

Section 5.8 described the ability to checkpoint a network of virtual machines using plugins [44]. This can be combined with DMTCP plugins to monitor and modify the operation of a guest virtual machine. In particular, if mal- ware uses any external services (from gettimeofday to calling back to a con- troller on the Internet), this can be intercepted by a suitable DMTCP plugin, and even replayed, in order to more closely examine the malware. See Visan et al. [127] and Arya et al. [10] for examples of using record-replay through DMTCP plugins. (While some malware tries to detect if it is running inside a virtual machine, malware will often continue to run in this situation. Oth- erwise, virtual machines would provide a good defense against malware.)

7.4 Algorithmic debugging

Algorithmic debugging [102, 13, 94, 83, 84, 79] is a well-developed tech- nique that was especially explored in the 1990s. Roughly, the idea is that an algorithmic debugger keeps a trace of the computation, and shows the user the input and output of various subprocedures. Through a series of questions and answers (similar to the game of 20 questions), the software determines which low-level subprocedure caused the bug. This tended to be used in functional languages and declarative languages such as Prolog, because of the ease of capturing the input and output of a subprocedure.

The use of checkpoints allows one to apply this same technique to main- stream languages including C/C++, Python, and others. Instead of en- capsulating a small input and output, a traditional debugger (e.g., GDB, Python pdb) would be used to allow the programmer to fully explore the global state at the beginning and end of the subprocedure. In case of a failed step, checkpoint-restart would allow us to restart from the last valid step instead of rerunning the program from the beginning. 136 CHAPTER 7. IMPACT FOR THE FUTURE 7.5 Reversible Debugging

Reversible debugging or time-travelling debuggers have a long history [19, 38, 64, 72]. Checkpointing provides an obvious approach in this area. Some parts of this approach have already been developed within the context of DMTCP (decomposing debugging histories for replay [127] and reverse ex- pression watchpoints [10]).

7.6 Android-Based Mobile Computing

Huang and Cheng have already demonstrated the use of DMTCP to check- point processes under Android [53]. This provides the potential for truly pervasive mobile apps, which can checkpoint themselves and migrate them- selves to other platforms. This can provide greater software sustainability (software engineering) by saving the entire mobile app, instead of the cur- rent practice of saving the state of an app and re-loading the state whenever the app is re-launched.

7.7 Cloud Computing

Cloud computing provides on-demand self-service and rapid elasticity of re- sources for applications. These characteristics are similar to that of the old- style mainframes from the 1960s through 1980s. However, to make the analogy complete, we need a scheduler for the Cloud. This scheduler must support parallel applications in addition to single-process applications. A scheduler for the Cloud can use DMTCP to suspend or migrate jobs. The ca- pabilities of DMTCP contributing to this goal include providing checkpoint support for: virtual machines [44], Intel Xeon Phi [12,2], InfiniBand [27], MPI, and 3D-graphics (for visualization) [62]. CHAPTER 8

Conclusion

Virtualization in the context of singular systems is well understood, but it is more difficult in context of multiple systems. This dissertation presented solutions to two long standing problems related to virtualization. A number of future directions were presented to apply the results of this dissertation both in context of checkpoint-restart and virtual machines.

Closed-World Assumption

This dissertation presented a framework for transparent checkpointing of application processes that do not obey the closed world assumption. A pro- cess virtualization approach was presented to decouple the application pro- cesses from the external subsystems. This was achieved by introducing a thin virtualization layer between the application and the external subsystem that provided the application with a consistent view of the external subsys- tem across checkpoint and restart. An adaptive plugin based architecture was presented to allow the checkpointing system to grow organically with each new external subsystem. The third-party plugins, developed to pro- vide seven novel checkpointing solutions, demonstrated the success of the plugin-based process virtualization approach.

137 138 CHAPTER 8. CONCLUSION

Double-Paging Problem

This work presented Tesseract, a system that directly and transparently (with- out any modifications to the guest operating system) addressed the double- paging problem. It reconciled and eliminated redundant I/O activity be- tween the guest’s virtual disks and the hypervisor swap subsystem by track- ing associations between the contents of the pages in guest memory and those on disk.

Finding an Application Surface

In the first body of work, the application surface was always chosen close to the application process. The concept of an application surface close to a stable API served as a guide in discovering a virtualization strategy in situa- tions where no previous virtualization strategy existed. The pid plugin is an example of a minimal application surface at the POSIX API layer, whereas the SSH plugin provided an application surface at the level of SSH protocol. In the second body of the work, there were several possibilities of choos- ing an application surface including the guest operating system, paravirtual- ized guest devices, virtual devices in the hypervisor, virtual disk interface, or the host kernel. We chose the application surface at the virtual disk device interface as it provides a clear separation between the hypervisor and the virtual disks. This application surface included the entire guest virtual ma- chine including operating system, device, etc. However, being at the virtual disk device layer, allowed us to provide block indirection without requiring any knowledge of the guest internals (virtual address space, file system, etc.) and without requiring any modifications to the host operating system. APPENDIX A

Plugin Tutorial

A.1 Introduction

Plugins enable one to modify the behavior of DMTCP. Two of the most com- mon uses of plugins are:

1. to execute an additional action at the time of checkpoint, resume, or restart.

2. to add a wrapper function around a call to a library function (including wrappers around system calls).

Plugins are used for a variety of purposes. The DMTCP_ROOT/contrib directory contains packages that users and developers have contributed to be optionally loaded into DMTCP. Plugin code is expressive, while requiring only a modest number of lines of code. The plugins in the contrib directory vary in size from 400 lines to 3000 lines of code. Beginning with DMTCP version 2.0, much of DMTCP itself is also now a plugin. In this new design, the core DMTCP code is responsible primarily for copying all of user space memory to a checkpoint image file. The remaining functions of DMTCP are handled by plugins, found in DMTCP_ROOT/plugin. Each plugin abstracts the essentials of a different subsystem of the operating

139 140 APPENDIX A. PLUGIN TUTORIAL

system and modifies its behavior to accommodate checkpoint and restart. Some of the subsystems for which plugins have been written are: virtualiza- tion of process and thread ids; files(open, close, dup, fopen, fclose, mmap, pty); events (eventfd, epoll, poll, inotify, signalfd); System V IPC constructs (shmget, semget, msgget); TCP/IP sockets (socket, connect, bind, listen, ac- cept); and timers (timer_create, clock_gettime). (The indicated system calls are examples only and not all-inclusive.)

A.2 Anatomy of a plugin

A plugin modifies the behavior of either DMTCP or a target application, through three primary mechanisms, plus virtualization of ids.

Wrapper functions: One declares a wrapper function with the same name as an existing library function (including system calls in the run-time library). The wrapper function can execute some prolog code, pass control to the “real” function, and then execute some epilog code. Sev- eral plugins can wrap the same function in a nested manner. One can also omit passing control to the “real” function, in order to shadow that function with an alternate behavior.

Events: It is frequently useful to execute additional code at the time of checkpoint, or resume, or restart. Plugins provide hook functions to be called during these three events and numerous other important events in the life of a process.

Coordinated checkpoint of distributed processes: DMTCP transparently checkpoints distributed computations across many nodes. At the time of checkpoint or restart, it may be necessary to coordinate information among the distributed processes. For example, at restart time, an inter- nal plugin of DMTCP allows the newly re-created processes to “talk” to their peers to discover the new network addresses of their peers. A.3. WRITING PLUGINS 141

This is important since a distributed computation may be restarted on a different cluster than its original one.

Virtualization of ids: Ids (process id, timer id, System V IPC id, etc.) are assigned by the kernel, by a peer process, and by remote processes. Upon restart, the external agent may wish to assign a different id than the one assigned prior to checkpoint. Techniques for virtualization of ids are described in Section Appendix A.3.2.

A.3 Writing Plugins

A.3.1 Invoking a plugin

Plugins are just dynamic run-time libraries (.so files). gcc -shared -fPIC -IDMTCP_ROOT/include -o PLUGIN1.so

PLUGIN1.c

They are invoked at the beginning of a DMTCP computation as command- line options: dmtcp_launch -with-plugin PLUGIN1.so:PLUGIN2.so myapp

Note that one can invoke multiple plugins as a colon-separated list. One should either specify a full path for each plugin (each .so library), or else to define LD_LIBRARY_PATH to include your own plugin directory.

A.3.2 The plugin mechanisms

The mechanisms of plugins are most easily described through examples. This tutorial will rely on the examples in DMTCP_ROOT/test/plugin. To get a feeling for the plugins, one can “cd” into each of the subdirectories and execute: “make check”. 142 APPENDIX A. PLUGIN TUTORIAL

Plugin events

For context, please scan the code of plugin/example/example.c. Exe- cuting “make check” will demonstrate the intended behavior. Plugin events are handled by including the function dmtcp_event_hook. When a DMTCP plugin event occurs, DMTCP will call the function dmtcp_event_hook for each plugin. This function is required only if the plugin will handle plugin events. See Appendix A for further details.

void dmtcp_event_hook(DmtcpEvent_t event, DmtcpEventData_t * data) { switch (event) { case DMTCP_EVENT_WRITE_CKPT: printf( " \n∗∗∗ The plugin is being called before checkpointing. ∗∗∗\n " ); break; case DMTCP_EVENT_RESUME: printf( " ∗∗∗ Resume: the plugin has now been checkpointed . ∗∗∗\n " ); break; case DMTCP_EVENT_RESTART: printf( " ∗∗∗ The plugin is now being restarted. ∗∗∗\n " ); break; ... default: break; } DMTCP_NEXT_EVENT_HOOK(event, data); }

¥ A.3. WRITING PLUGINS 143

Plugin wrapper functions

In its simplest form, a wrapper function can be written as follows: unsigned int sleep(unsigned int seconds) {

static unsigned int (*next_fnc)() = NULL;/ * Same type signature as sleep */ struct timeval oldtv, tv; gettimeofday(&oldtv, NULL); time_t secs = val.tv_sec; printf( " sleep1 : " ); print_time(); printf( "..." ); unsigned int result = NEXT_FNC(sleep)(seconds); gettimeofday(&tv, NULL); printf( "Time elapsed: %f\n" , (1e6*(val.tv_sec-oldval.tv_sec) + 1.0*(val.tv_usec -oldval.tv_usec)) / 1e6); print_time(); printf( " \n " );

return result; }

¥ In the above example, we could also shadow the standard “sleep” func- tion by our own implementation, if we omit the call to “NEXT_FNC”. To see a related example, try: cd DMTCP_ROOT/test/plugin/sleep1; make check

Wrapper functions from distinct plugins can be nested. For a nesting of plugin sleep2 around sleep1, do: cd DMTCP_ROOT/test/plugin make; cd sleep2; make check

If one adds a wrapper around a function from a library other than libc.so (e.g., libglx.so), it is best to dynamically link to that additional library: 144 APPENDIX A. PLUGIN TUTORIAL

gcc ... -o PLUGIN1.so PLUGIN1.c -lglx.so

Plugin coordination among multiple or distributed processes

It is often the case that an external agent will assign a particular initial id to your process, but later assign a different id on restart. Each process must re-discover its peers at restart time, without knowing the pre-checkpoint ids. DMTCP provides a “Publish/Subscribe” feature to enable communica- tion among peer processes. Two plugin events allow user plugins to discover peers and pass information among peers. The two events are: DMTCP_EVEN- T_REGISTER_NAME_SERVICE_DATA and DMTCP_EVENT_SEND_QUERIES. DMTCP guarantees to provide a global barrier between the two events. An example of how to use the Publish/Subscribe feature is contained in DMTCP_ROOT/test/plugin/example-db . The explanation below is best understood in conjunction with reading that example. A plugin processing DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA should invoke: int dmtcp_send_key_val_pair_to_coordinator(const void *key, size_t key_len, const void *val, size_t val_len). A plugin processing DMTCP_EVENT_SEND_QUERIES should invoke: int dmtcp_send_query_to_coordinator(const void *key, size_t key_len, void *val, size_t *val_len).

Using plugins to virtualize ids and other names

Often an id or name will change between checkpoint and restart. For ex- ample, on restart, the real pid of a process will change from its pid prior to checkpoint. Some DMTCP internal plugins maintain a translation table in order to translate between a virtualized id passed to the user code and a real id maintained inside the kernel. The utility to maintain this translation table can also be used within third-party plugins. For an example of adding virtualization to a plugin, see the plugin in plugin/ipc/timer. A.4. APPLICATION-INITIATED CHECKPOINTS 145

In some less common cases, it can happen that a virtualized id is passed to a library function by the target application. Yet, that same library function may be passed a real id by a second function from within the same library. In these cases, it is the responsibility of the plugin implementor to choose a scheme that allows the first library function to distinguish whether its argu- ment is a virtual id (passed from the target application) or a real id (passed from within the same library).

A.4 Application-Initiated Checkpoints

Application-initiated checkpoints are even simpler than full-featured plug- ins. In the simplest form, the following code can be executed both with dmtcp_launch and without.:

#include #include " dmtcp . h " int main() { if (dmtcpCheckpoint() == DMTCP_NOT_PRESENT) { printf( "dmtcpcheckpoint: DMTCP not present. No checkpoint is taken.\n" ); } return 0; }

¥ For this program to be aware of DMTCP, it must be compiled with -fPIC and -ldl : gcc -fPIC -IDMTCP_ROOT/include -o myapp myapp.c -ldl

The most useful functions are:

int dmtcpIsEnabled() — returns 1 when running with DMTCP; 0 146 APPENDIX A. PLUGIN TUTORIAL

otherwise.

int dmtcpCheckpoint() — returns DMTCP_AFTER_CHECKPOINT, DMTCP_AFTER_RESTART, or DMTCP_NOT_PRESENT.

int dmtcpDelayCheckpointsLock() — DMTCP will block any check- point requests.

int dmtcpDelayCheckpointsUnlock() — DMTCP will execute any blocked checkpoint requests, and will permit new checkpoint requests.

The last two functions follow the common pattern of returning 0 on suc- cess and DMTCP_NOT_PRESENT if DMTCP is not present.

A.5 Plugin Manual

A.5.1 Plugin events

dmtcp_event_hook

In order to handle DMTCP plugin events, a plugin must define an entry point, dmtcp_event_hook.

NAME dmtcp_event_hook - Handle plugin events for this plugin

SYNOPSIS #include "dmtcp/plugin .h"

void dmtcp_event_hook(DmtcpEvent_t event,

DmtcpEventData_t *data)

DESCRIPTION A.5. PLUGIN MANUAL 147

When a plugin event occurs, DMTCP will look for the symbol dmtcp_event_hook in each plugin library. If the symbol is found, that function will be called for the given plugin library. DMTCP guarantees only to invoke the first such plugin library found in library search order. Occurrences of dmtcp_event_hook in later plugin libraries will be called only if each previous function had invoked DMTCP_NEXT_EVENT_HOOK. The argument, < event>, will be bound to the event being declared by DMTCP. The argument, , is required only for certain events. See the following section, ‘‘Plugin Events ’’ for a list of all events.

SEE ALSO DMTCP_NEXT_EVENT_HOOK

¥

DMTCP_NEXT_EVENT_HOOK

A typical definition of dmtcp_event_hook will invoke the hook in the next plugin via DMTCP_NEXT_EVENT_HOOK.

NAME DMTCP_NEXT_EVENT_HOOK - call dmtcp_event_hook in next plugin library 148 APPENDIX A. PLUGIN TUTORIAL

SYNOPSIS #include "dmtcp/plugin .h"

void DMTCP_NEXT_EVENT_HOOK(event, data)

DESCRIPTION This function must be invoked from within a plugin function library called dmtcp_event_hook. The arguments and < data> should normally be the same arguments passed to dmtcp_event_hook.

DMTCP_NEXT_EVENT_HOOK may be called zero or one times . If invoked zero times, no further plugin libraries will be called to handle events. The behavior is undefined if DMTCP_NEXT_EVENT_HOOK is invoked more than once. The typical usage of this function is to create a wrapper around the handling of the same event by later plugins.

SEE ALSO dmtcp_event_hook

¥

Event Names

The rest of this section defines plugin events. The complete list of plugin events is always contained in DMTCP_ROOT/include/plugin.h . DMTCP guarantees to call the dmtcp_event_hook function of the plugin when the specified event occurs. A.5. PLUGIN MANUAL 149

Plugins that pass significant data through the data parameter are marked with an asterisk: ∗. Most plugin events do not pass data through the data parameter. Note that the events REGISTER_NAME_SERVICE_DATA, SEND_QUERIES, RESTART, RESUME, and REFILL, should all be processed after the call to DMTCP_NEXT_EVENT_HOOK() in order to guarantee that the internal DMTCP plugins have first restored full functionality.

Checkpoint-Restart

DMTCP_EVENT_WRITE_CKPT — Invoked at final barrier before writing checkpoint

DMTCP_EVENT_RESTART — Invoked at first barrier during restart of new process

DMTCP_EVENT_RESUME — Invoked at first barrier during resume fol- lowing checkpoint

Coordination of Multiple or Distributed Processes during Restart (see Appendix A.5.2)

DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA∗ restart/resume

DMTCP_EVENT_SEND_QUERIES∗ restart/resume

WARNING: EXPERTS ONLY FOR REMAINING EVENTS

Init/Fork/Exec/Exit

DMTCP_EVENT_INIT — Invoked before main (in both the original pro- gram and any new program called via exec)

DMTCP_EVENT_EXIT — Invoked on call to exit/_exit/_Exit return from main?;

DMTCP_EVENT_PRE_EXEC — Invoked prior to call to exec 150 APPENDIX A. PLUGIN TUTORIAL

DMTCP_EVENT_POST_EXEC — Invoked before DMTCP_EVENT_INIT in new program

DMTCP_EVENT_ATFORK_PREPARE — Invoked before fork (see POSIX pthread_atfork)

DMTCP_EVENT_ATFORK_PARENT — Invoked after fork by parent (see POSIX pthread_atfork)

DMTCP_EVENT_ATFORK_CHILD — Invoked after fork by child (see POSIX pthread_atfork)

Barriers (finer-grained control during checkpoint-restart)

DMTCP_EVENT_WAIT_FOR_SUSPEND_MSG — Invoked at barrier during coordinated checkpoint

DMTCP_EVENT_SUSPENDED — Invoked at barrier during coordinated checkpoint

DMTCP_EVENT_LEADER_ELECTION — Invoked at barrier during coordi- nated checkpoint

DMTCP_EVENT_DRAIN — Invoked at barrier during coordinated check- point

DMTCP_EVENT_REFILL — Invoked at first barrier during resume/restart of new process

Threads

DMTCP_EVENT_THREADS_SUSPEND — Invoked within checkpoint thread when all user threads have been suspended

DMTCP_EVENT_THREADS_RESUME — Invoked within checkpoint thread before any user threads are resumed. A.5. PLUGIN MANUAL 151

For debugging, consider calling the following code for this event: static int x = 1; while(x);

DMTCP_EVENT_PRE_SUSPEND_USER_THREAD — Each user thread in- vokes this prior to being suspended for a checkpoint

DMTCP_EVENT_RESUME_USER_THREAD — Each user thread invokes this immediately after a resume or restart (isRestart() available to plugin)

DMTCP_EVENT_THREAD_START — Invoked before start function given by clone

DMTCP_EVENT_THREAD_CREATED — Invoked within parent thread when clone call returns (like parent for fork)

DMTCP_EVENT_PTHREAD_START — Invoked before start function given by pthread_created

DMTCP_EVENT_PTHREAD_EXIT — Invoked before call to pthread_exit

DMTCP_EVENT_PTHREAD_RETURN — Invoked in child thread when thread start function of pthread_create returns

A.5.2 Publish/Subscribe

Section Appendix A.3.2 provides an explanation of the Publish/Subscribe feature for coordination among peer processes at resume- or restart-time. An example of how to use the Publish/Subscribe feature is contained in DMTCP_ROOT/test/plugin/example-db . The primary events and functions used in this feature are: DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA int dmtcp_send_key_val_pair_to_coordinator(const void *key, 152 APPENDIX A. PLUGIN TUTORIAL

size_t key_len, const void *val, size_t val_len) DMTCP_EVENT_SEND_QUERIES int dmtcp_send_query_to_coordinator(const void *key, size_t key_len, void *val, size_t *val_len)

A.5.3 Wrapper functions

For a description of including wrapper functions in a plugin, see Section Ap- pendix A.3.2.

A.5.4 Miscellaneous utility functions

Numerous DMTCP utility functions are provided that can be called from within dmtcp_event_hook(). The utility functions are still under active de- velopment, and may change in small ways. Some of the more commonly used utility functions follow. Functions that return “char *” will not allocate memory, but instead will return a pointer to a canonical string, which should not be changed.

void dmtcp_get_local_ip_addr(struct in_addr *in); const char* dmtcp_get_tmpdir();/ * given by--tmpdir, or DMTCP_TMPDIR, or TMPDIR */ const char* dmtcp_get_ckpt_dir(); /* given by--ckptdir, or DMTCP_CHECKPOINT_DIR, or curr dir at ckpt time */ const char* dmtcp_get_ckpt_files_subdir(); int dmtcp_get_ckpt_signal();/ * given by--mtcp-checkpoint- signal */ const char* dmtcp_get_uniquepid_str(); const char* dmtcp_get_computation_id_str(); uint64_t dmtcp_get_coordinator_timestamp();

uint32_t dmtcp_get_generation();/ * number of ckpt/restart A.5. PLUGIN MANUAL 153

sequences encountered */ const char* dmtcp_get_executable_path(); int dmtcp_get_restart_env(char *name, char *value, int maxvaluelen);

/* For’name’ in environment, copy its value into’value’ param, but with

* at most length’maxvaluelen’. * Return0 for success, and return code for various errors

* See contrib/modify-env for an example of its use. */

¥

Bibliography

[1] Hazim Abdel-Shafi, Evan Speight, and John K. Bennett. Efficient user- level thread migration and checkpointing on windows NT clusters. In Proceedings of the 3rd Conference on USENIX Windows NT Sympo- sium - Volume 3, WINSYM’99, page 1–1, Berkeley, CA, USA, 1999. USENIX Association. URL http://dl.acm.org/citation.cfm? id=1268427.1268428. (Cited on page 15.)

[2] David Abdurachmanov, Kapil Arya, Josh Bendavid, Tommaso Boc- cali, Gene Cooperman, Andrea Dotti, Peter Elmer, Giulio Eu- lisse, Francesco Giacomini, Christopher D. Jones, Matteo Man- zali, and Shahzad Muzaffar. Explorations of the viability of ARM and xeon phi for physics processing. Journal of Physics: Confer- ence Series, 513(5):052008, June 2014. ISSN 1742-6596. doi: 10.1088/1742-6596/513/5/052008. URL http://iopscience. iop.org/1742-6596/513/5/052008. (Cited on page 136.)

[3] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Mor- eira. Adaptive incremental checkpointing for massively parallel sys- tems. In Proceedings of the 18th Annual International Conference on Supercomputing, ICS ’04, page 277–286, New York, NY, USA, 2004. ACM. ISBN 1-58113-839-3. doi: 10.1145/1006209.1006248. URL http://doi.acm.org/10.1145/1006209.1006248. (Cited on page 15.)

155 156 BIBLIOGRAPHY

[4] Ole Agesen. System and method for maintaining memory page shar- ing in a virtual environment, February 2013. U.S. Classification 711/147, 711/152, 711/E12.102, 717/148; International Classifica- tion G06F12/08, G06F9/455, G06F7/04; Cooperative Classification G06F12/08, G06F9/544, G06F9/45537. (Cited on page 98.)

[5] Nadav Amit, Dan Tsafrir, and Assaf Schuster. VSwapper: a mem- ory swapper for virtualized environments. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, page 349–366, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2305-5. doi: 10. 1145/2541940.2541969. URL http://doi.acm.org/10.1145/ 2541940.2541969. (Cited on pages 94 and 128.)

[6] Glenn Ammons, Jonathan Appavoo, Maria Butrico, Dilma Da Silva, David Grove, Kiyokuni Kawachiya, Orran Krieger, Bryan Rosenburg, Eric Van Hensbergen, and Robert W. Wisniewski. Libra: A library operating system for a jvm in a virtualized execution environment. In Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE ’07, page 44–54, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-630-1. doi: 10.1145/1254810.1254817. URL http://doi.acm.org/10.1145/1254810.1254817. (Cited on page 24.)

[7] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: transparent checkpointing for cluster computations and the desktop. In IEEE In- ternational Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, pages 1–12, May 2009. doi: 10.1109/IPDPS.2009.5161063. (Cited on pages 20, 25, and 58.)

[8] Linux Kernel Mailing List (LKML) Archives. [LKML] checkpoint- restart: naked patch serialization, March 2014. URL http://lkml. BIBLIOGRAPHY 157

iu.edu/hypermail/linux/kernel/1011.0/00770.html. (Cited on page 17.)

[9] Kapil Arya and Gene Cooperman. DMTCP: bringing checkpoint- restart to python. In Proceedings of the 12th Python in Science Con- ference, pages 2–7, 2013. URL http://conference.scipy.org/ proceedings/scipy2013/arya.html. (Cited on page 134.)

[10] Kapil Arya, Tyler Denniston, Ana-Maria Visan, and Gene Cooper- man. Semi-automated debugging via binary search through a pro- cess lifetime. In Proceedings of the Seventh Workshop on Program- ming Languages and Operating Systems, PLOS ’13, page 9:1–9:7, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2460-1. doi: 10. 1145/2525528.2525533. URL http://doi.acm.org/10.1145/ 2525528.2525533. (Cited on pages 135 and 136.)

[11] Kapil Arya, Yury Baskakov, and Alex Garthwaite. Tesseract: Rec- onciling guest I/O and hypervisor swapping in a VM. In Pro- ceedings of the 10th ACM SIGPLAN/SIGOPS International Confer- ence on Virtual Execution Environments, VEE ’14, page 15–28, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2764-0. doi: 10. 1145/2576195.2576198. URL http://doi.acm.org/10.1145/ 2576195.2576198. (Cited on page9.)

[12] Kapil Arya, Gene Cooperman, Andrea Dotti, and Peter Elmer. Use of checkpoint-restart for complex HEP software on tradi- tional architectures and intel MIC. Journal of Physics: Confer- ence Series, 523(1):012015, June 2014. ISSN 1742-6596. doi: 10.1088/1742-6596/523/1/012015. URL http://iopscience. iop.org/1742-6596/523/1/012015. (Cited on page 136.)

[13] Evyatar Av-Ron. Top-Down Diagnosis of Prolog Programs. PhD thesis, Weizmanm Institute, 1984. (Cited on page 135.) 158 BIBLIOGRAPHY

[14] Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing Lusk, and Rajeev Thakur. PMI: a scalable parallel process-management interface for extreme-scale systems. In Proceed- ings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface, EuroMPI’10, page 31–41, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15645- 2, 978-3-642-15645-8. URL http://dl.acm.org/citation. cfm?id=1894122.1894127. (Cited on page 83.)

[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Har- ris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, page 164–177, New York, NY, USA, 2003. ACM. ISBN 1-58113-757-5. doi: 10.1145/945445.945462. URL http://doi.acm.org/10.1145/ 945445.945462. (Cited on page 24.)

[16] Tarick Bedeir. Building an RDMA-Capable application with IB verbs. Technical report, http://www.hpcadvisorycouncil.com/, Au- gust 2010. http://www.hpcadvisorycouncil.com/pdf/building-an- rdma-capable- application-with-ib-verbs.pdf. (Cited on page 35.)

[17] Adam Beguelin, Erik Seligman, and Peter Stephan. Applica- tion level fault tolerance in heterogeneous networks of work- stations. Journal of Parallel and Distributed Computing, 43(2): 147–155, June 1997. ISSN 0743-7315. doi: 10.1006/jpdc. 1997.1338. URL http://www.sciencedirect.com/science/ article/pii/S0743731597913381. (Cited on page 18.)

[18] Bernard Blackham. Cryopid, 2012. URL http://cryopid. berlios.de/index.html. (Cited on page 19.) BIBLIOGRAPHY 159

[19] Bob Boothe. Efficient algorithms for bidirectional debugging. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, page 299–310, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. doi: 10.1145/ 349299.349339. URL http://doi.acm.org/10.1145/349299. 349339. (Cited on page 136.)

[20] Dan Bornstein. Dalvik VM internals. In Google I/O Developer Confer- ence, volume 23, page 17–30, 2008. (Cited on page 22.)

[21] George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, and Anton Se- likhov. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In Proceedings of the 2002 ACM/IEEE Conference on Super- computing, SC ’02, page 1–18, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. URL http://dl.acm.org/citation. cfm?id=762761.762815. (Cited on page 18.)

[22] Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, Pierre Lemarinier, and Franck Cappello. MPICH-V project: A multipro- tocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications, 20(3):319–333, 2006. doi: 10.1177/1094342006067469. URL http://hpc.sagepub.com/ content/20/3/319.abstract. (Cited on page 19.)

[23] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI pro- grams. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, page 84–94, New York, NY, USA, 2003. ACM. ISBN 1-58113-588-2. doi: 160 BIBLIOGRAPHY

10.1145/781498.781513. URL http://doi.acm.org/10.1145/ 781498.781513. (Cited on pages 15 and 19.)

[24] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-level checkpointing for shared mem- ory programs. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI, page 235–247, New York, NY, USA, 2004. ACM. ISBN 1-58113-804-0. doi: 10.1145/1024393.1024421. URL http://doi.acm.org/10.1145/1024393.1024421. (Cited on page 15.)

[25] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Radu Rugina, and Sally A. McKee. Compiler-enhanced incremental checkpointing for OpenMP applications. In Proc. of IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS), pages 1–12, May 2009. doi: 10.1109/IPDPS.2009.5160999. (Cited on page 15.)

[26] Guohong Cao and M. Singhal. On coordinated checkpointing in dis- tributed systems. IEEE Transactions on Parallel and Distributed Sys- tems, 9(12):1213–1225, December 1998. ISSN 1045-9219. doi: 10.1109/71.737697. (Cited on page 22.)

[27] Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. Trans- parent checkpoint-restart over InfiniBand. In ACM 23rd Int. Symp. on High Performance Parallel and Distributed Computing (HPDC), 2014. (to appear). (Cited on pages9, 31, 71, 89, 90, and 136.)

[28] K. Mani Chandy and Leslie Lamport. Distributed snapshots: De- termining global states of distributed systems. ACM Trans. Com- put. Syst., 3(1):63–75, February 1985. ISSN 0734-2071. doi: 10.1145/214451.214456. URL http://doi.acm.org/10.1145/ 214451.214456. (Cited on page 29.) BIBLIOGRAPHY 161

[29] P. Emerald Chung, Woei-Jyh Lee, Yennun Huang, Deron Liang, and Chung-Yih Wang. Winckp: A transparent checkpointing and rollback recovery tool for windows NT applications. In Proc. of 29th Annual International Symposium on Fault-Tolerant Computing, page 220–223, 1999. doi: 10.1109/FTCS.1999.781053. (Cited on page 15.)

[30] Gene Cooperman, Jason Ansel, and Xiaoqin Ma. Adaptive check- pointing for master-worker style parallelism (extended abstract). In Proc. of 2005 IEEE Computer Society International Conference on Clus- ter Computing. IEEE Press, 2005. conference proceedings on CD. (Cited on page 25.)

[31] Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In Proceedings of the 2006 ACM/IEEE Conference on Supercom- puting, SC ’06, New York, NY, USA, 2006. ACM. ISBN 0-7695-2700-0. doi: 10.1145/1188455.1188587. URL http://doi.acm.org/10. 1145/1188455.1188587. (Cited on pages 18 and 22.)

[32] Timothy Cramer, Richard Friedman, Terrence Miller, David Seberger, Robert Wilson, and Mario Wolczko. Compiling java just in time. IEEE Micro, 17(3):36–43, May 1997. ISSN 0272-1732. doi: 10.1109/ 40.591653. URL http://dx.doi.org/10.1109/40.591653. (Cited on page 22.)

[33] William R. Dieter and James E. Lumpp,Jr. User-level checkpointing for LinuxThreads programs. In Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, page 81–92, Berkeley, CA, USA, 2001. USENIX Association. ISBN 1-880446-10-3. URL http: //dl.acm.org/citation.cfm?id=647054.715766. (Cited on page 15.) 162 BIBLIOGRAPHY

[34] Fred Douglis and John Ousterhout. Transparent process migration: Design alternatives and the sprite implementation. Software: Practice and Experience, 21(8):757–785, August 1991. ISSN 1097-024X. doi: 10.1002/spe.4380210802. URL http://onlinelibrary. wiley.com/doi/10.1002/spe.4380210802/abstract. (Cited on page 13.)

[35] Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and check- point/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326, September 2013. ISSN 0920-8542, 1573-0484. doi: 10. 1007/s11227-013-0884-0. URL http://link.springer.com/ article/10.1007/s11227-013-0884-0. (Cited on page 13.)

[36] David Ehringer. The dalvik virtual machine architecture. Technical report, 2010. (Cited on page 22.)

[37] Dawson R. Engler, M. Frans Kaashoek, and J.ames O’Toole,Jr. Ex- okernel: An operating system architecture for application-level re- source management. In Proceedings of the Fifteenth ACM Sympo- sium on Operating Systems Principles, SOSP ’95, page 251–266, New York, NY, USA, 1995. ACM. ISBN 0-89791-715-4. doi: 10.1145/ 224056.224076. URL http://doi.acm.org/10.1145/224056. 224076. (Cited on page 24.)

[38] Stuart I. Feldman and Channing B. Brown. IGOR: a system for program debugging via reversible execution. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, PADD ’88, page 112–123, New York, NY, USA, 1988. ACM. ISBN 0-89791-296-9. doi: 10.1145/68210.69226. URL http: //doi.acm.org/10.1145/68210.69226. (Cited on page 136.) BIBLIOGRAPHY 163

[39] Apache Software Foundation. Apache hadoop, March 2014. URL http://hadoop.apache.org/. (Cited on page 134.)

[40] Apache Software Foundation. Apache spark — lightning-fast clus- ter computing, March 2014. URL http://spark.incubator. apache.org/. (Cited on page 134.)

[41] Qi Gao, Weikuan Yu, Wei Huang, and D.K. Panda. Application- transparent Checkpoint/Restart for MPI programs over InfiniBand. In International Conference on Parallel Processing, 2006. ICPP 2006, pages 471–478, August 2006. doi: 10.1109/ICPP.2006.26. (Cited on page 19.)

[42] Tal Garfinkel. Traps and pitfalls: Practical problems in system call interposition based security tools. In In Proc. Network and Dis- tributed Systems Security Symposium, page 163–176, 2003. (Cited on page 21.)

[43] Rohan Garg, Komal Sodha, and Gene Cooperman. A generic checkpoint-restart mechanism for virtual machines. Technical report, arXiv tech. report, arXiv:1212.1787, December 2012. URL http: //arxiv.org/abs/1212.1787. Published: arXiv:1212.1787 [cs.OS], http://arxiv.org/abs/1212.1787. (Cited on page 87.)

[44] Rohan Garg, Komal Sodha, Zhengping Jin, and Gene Cooperman. Checkpoint-restart for a network of virtual machines. In Proc. of 2013 IEEE Computer Society International Conference on Cluster Computing, pages 1–8. IEEE Press, 2013. doi: 10.1109/CLUSTER.2013.6702626. (Cited on pages9, 71, 88, 135, and 136.)

[45] Devarshi Ghoshal, Sreesudhan R. Ramkumar, and Arun Chauhan. Distributed speculative parallelization using checkpoint restart. In Proceedings of the International Conference on Computational Science, 164 BIBLIOGRAPHY

ICCS 2011, volume 4 of Proceedings of the International Conference on Computational Science, ICCS 2011, pages 422–431, 2011. doi: 10. 1016/j.procs.2011.04.044. URL http://www.sciencedirect. com/science/article/pii/S1877050911001025. (Cited on page 134.)

[46] Robert P. Goldberg and Robert Hassinger. The double paging anomaly. In Proceedings of the May 6-10, 1974, National Com- puter Conference and Exposition, AFIPS ’74, page 195–199, New York, NY, USA, 1974. ACM. doi: 10.1145/1500175.1500215. URL http://doi.acm.org/10.1145/1500175.1500215. (Cited on pages 91 and 129.)

[47] Kinshuk Govil. Virtual clusters: resource management on large shared- memory multiprocessors. PhD thesis, Stanford University, Palo Alto, CA, USA, 2001. AAI3000034. (Cited on pages 91, 97, and 129.)

[48] Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosen- blum. Cellular disco: Resource management using virtual clusters on shared-memory multiprocessors. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, SOSP ’99, page 154–169, New York, NY, USA, 1999. ACM. ISBN 1-58113-140-2. doi: 10.1145/319151.319162. URL http://doi.acm.org/10.1145/ 319151.319162. (Cited on pages 91, 97, and 129.)

[49] Richard L. Graham, Sung-Eun Choi, David J. Daniel, Nehal N. De- sai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, and Mitchel W. Sukalski. A network-failure-tolerant message-passing sys- tem for terascale clusters. In Proceedings of the 16th International Con- ference on Supercomputing, ICS ’02, page 77–83, New York, NY, USA, 2002. ACM. ISBN 1-58113-483-5. doi: 10.1145/514191.514205. BIBLIOGRAPHY 165

URL http://doi.acm.org/10.1145/514191.514205. (Cited on page 18.)

[50] Ajay Gulati, Irfan Ahmad, and Carl A. Waldspurger. PARDA: pro- portional allocation of resources for distributed storage access. In Proccedings of the 7th Conference on File and Storage Technologies, FAST ’09, page 85–98, Berkeley, CA, USA, 2009. USENIX Associa- tion. URL http://dl.acm.org/citation.cfm?id=1525908. 1525915. (Cited on page 113.)

[51] Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvard- han Kharche, Niraj Tolia, Vanish Talwar, and Parthasarathy Ran- ganathan. GViM: GPU-accelerated virtual machines. In Pro- ceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, HPCVirt ’09, page 17–24, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-465-2. doi: 10. 1145/1519138.1519141. URL http://doi.acm.org/10.1145/ 1519138.1519141. (Cited on page 19.)

[52] Paul H. Hargrove and Jason C. Duell. Berkeley lab checkpoint/restart (BLCR) for linux clusters. Journal of Physics: Conference Series, 46(1): 494, September 2006. ISSN 1742-6596. doi: 10.1088/1742-6596/ 46/1/067. URL http://iopscience.iop.org/1742-6596/ 46/1/067. (Cited on pages3, 17, 18, 19, and 23.)

[53] Jim Huang and Kito Cheng. Implement checkpointing for android (slides). In Embedded Linux Conference Europe (ELCE2012). 0xlab, November 2012. URL http://www.slideshare.net/jserv/ implement-checkpointing-for-android-elce2012. (Cited on page 136.)

[54] J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of Checkpoint/Restart process fault tolerance for 166 BIBLIOGRAPHY

open MPI. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8, March 2007. doi: 10. 1109/IPDPS.2007.370605. (Cited on pages 18 and 19.)

[55] Joshua Hursey, Timothy I. Mattox, and Andrew Lumsdaine. Intercon- nect agnostic Checkpoint/Restart in open MPI. In Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, HPDC ’09, page 49–58, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-587-1. doi: 10.1145/1551609.1551619. URL http://doi.acm.org/10.1145/1551609.1551619. (Cited on pages 19, 72, and 89.)

[56] VMware Inc. VMware workstation, March 2014. URL http://www. vmware.com/products/workstation. (Cited on page 92.)

[57] VMware Inc. VMware vSphere hypervisor, March 2014. URL http: //www.vmware.com/products/esxi-and-esx/overview. (Cited on page 91.)

[58] Pankaj Jalote. Fault Tolerance in Distributed Systems. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994. ISBN 0-13-301367-7. (Cited on page 21.)

[59] G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti, and Yoshio Turner. Cruz: Application-transparent distributed checkpoint- restart on standard operating systems. In International Conference on Dependable Systems and Networks, 2005. DSN 2005. Proceedings, pages 260–269, June 2005. doi: 10.1109/DSN.2005.33. (Cited on page 16.)

[60] Stephen T. Jones, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci- Dusseau. Geiger: Monitoring the buffer cache in a virtual ma- chine environment. In Proceedings of the 12th International Confer- BIBLIOGRAPHY 167

ence on Architectural Support for Programming Languages and Oper- ating Systems, ASPLOS XII, page 14–24, New York, NY, USA, 2006. ACM. ISBN 1-59593-451-0. doi: 10.1145/1168857.1168861. URL http://doi.acm.org/10.1145/1168857.1168861. (Cited on page 130.)

[61] Poul-henning Kamp and Robert N. M. Watson. Jails: Confining the omnipotent root. In In Proc. 2nd Intl. SANE Conference, 2000. (Cited on page 23.)

[62] Samaneh Kazemi Nafchi, Rohan Garg, and Gene Cooperman. Trans- parent checkpoint-restart for hardware-accelerated 3D graphics. Technical report, arXiv tech. report, arXiv:1312.6650, 2013. URL http://arxiv.org/abs/1312.6650v2. (Cited on pages9, 31, 71, 88, and 136.)

[63] Gregory Kerr, Alex Brick, Gene Cooperman, and Sergey Bra- tus. Checkpoint-restart: Proprietary hardware and the ‘Spiderweb API’. Technical report, Recon 2011, July 2011. talk: abstract at http://recon.cx/2011/schedule/events/112.en.html; video at https://archive.org/details/Recon_2011_Checkpoint_Restart. (Cited on page 35.)

[64] Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceed- ings of the Annual Conference on USENIX Annual Technical Confer- ence, ATEC ’05, page 1–1, Berkeley, CA, USA, 2005. USENIX Associa- tion. URL http://dl.acm.org/citation.cfm?id=1247360. 1247361. (Cited on page 136.)

[65] Naveen Kumar and Ramesh Peri. Transparent debugging of dy- namically instrumented programs. SIGARCH Comput. Archit. News, 33(5):57–62, December 2005. ISSN 0163-5964. doi: 10. 168 BIBLIOGRAPHY

1145/1127577.1127589. URL http://doi.acm.org/10.1145/ 1127577.1127589. (Cited on page 21.)

[66] Oren Laadan. A Personal Virtual Computer Recorder. PhD the- sis, Columbia University, 2011. URL http://academiccommons. columbia.edu/catalog/ac:131552. (Cited on page 16.)

[67] Oren Laadan and Jason Nieh. Transparent checkpoint-restart of mul- tiple processes on commodity operating systems. In 2007 USENIX An- nual Technical Conference on Proceedings of the USENIX Annual Tech- nical Conference, ATC’07, page 25:1–25:14, Berkeley, CA, USA, 2007. USENIX Association. ISBN 999-8888-77-6. URL http://dl.acm. org/citation.cfm?id=1364385.1364410. (Cited on page 16.)

[68] Oren Laadan, Nicolas Viennot, and Jason Nieh. Transparent, lightweight application execution replay on commodity multiproces- sor operating systems. In Proceedings of the ACM SIGMETRICS Interna- tional Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’10, page 155–166, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0038-4. doi: 10.1145/1811039.1811057. URL http://doi.acm.org/10.1145/1811039.1811057. (Cited on pages 16 and 17.)

[69] H. Andres Lagar-Cavilla, Niraj Tolia, M. Satyanarayanan, and Eyal de Lara. VMM-independent graphics acceleration. In Proceed- ings of the 3rd International Conference on Virtual Execution Envi- ronments, VEE ’07, page 33–43, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-630-1. doi: 10.1145/1254810.1254816. URL http://doi.acm.org/10.1145/1254810.1254816. (Cited on pages 72 and 88.)

[70] Peter Alan Lee and Thomas Anderson. Fault tolerance. In Fault Tol- erance, number 3 in Dependable Computing and Fault-Tolerant Sys- BIBLIOGRAPHY 169

tems, pages 51–77. Springer Vienna, January 1990. ISBN 978-3- 7091-8992-4, 978-3-7091-8990-0. URL http://link.springer. com/chapter/10.1007/978-3-7091-8990-0_3. (Cited on page 21.)

[71] Pierre Lemarinier, Aurélien Bouteiller, Thomas Herault, Géraud Krawezik, and Franck Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In Pro- ceedings of the 2004 IEEE International Conference on Cluster Comput- ing, CLUSTER ’04, page 115–124, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7803-8694-9. URL http://dl.acm. org/citation.cfm?id=1111682.1111713. (Cited on page 22.)

[72] E. Christopher Lewis, Prashant Dhamdhere, and Eric Xiaojian Chen. Virtual machine-based replay debugging, October 2008. Google Tech Talks: http://www.youtube.com/watch?v=RvMlihjqlhY; further in- formation at http://www.replaydebugging.com. (Cited on page 136.)

[73] Kai Li, Jeffrey F. Naughton, and James S. Plank. Real-time, con- current checkpoint for parallel programs. In Proceedings of the Sec- ond ACM SIGPLAN Symposium on Principles &Amp; Practice of Par- allel Programming, PPOPP ’90, page 79–88, New York, NY, USA, 1990. ACM. ISBN 0-89791-350-7. doi: 10.1145/99163.99173. URL http://doi.acm.org/10.1145/99163.99173. (Cited on pages 15 and 22.)

[74] Kai Li, Jeffrey F. Naughton, and James S. Plank. Low-latency, con- current checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874–879, August 1994. ISSN 1045-9219. doi: 10.1109/71.298215. (Cited on pages 15 and 22.)

[75] Tim Lindholm and Frank Yellin. Java Virtual Machine Specification. 170 BIBLIOGRAPHY

Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1999. ISBN 0201432943. (Cited on page 22.)

[76] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and migration of UNIX processes in the condor dis- tributed processing system. Technical report 1346, University of Wis- consin, Madison, Wisconsin, April 1997. (Cited on pages 15, 18, and 23.)

[77] Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. High perfor- mance RDMA-Based MPI implementation over InfiniBand. Interna- tional Journal of Parallel Programming, 32(3):167–198, June 2004. ISSN 0885-7458, 1573-7640. doi: 10.1023/B:IJPP.0000029272. 69895.c1. URL http://link.springer.com/article/10. 1023/B:IJPP.0000029272.69895.c1. (Cited on page 19.)

[78] Pin Lu and Kai Shen. Virtual machine memory access tracing with hy- pervisor exclusive cache. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC’07, page 3:1–3:15, Berkeley, CA, USA, 2007. USENIX Association. ISBN 999-8888-77-6. URL http://dl.acm.org/citation.cfm?id= 1364385.1364388. (Cited on page 130.)

[79] Machi Maeji and Tadashi Kanamori. Top-down zooming diagnosis of logic programs. Technical report, Kyoto University, 1988. (Cited on page 135.)

[80] Violeta Medina and Juan Manuel García. A survey of migration mech- anisms of virtual machines. ACM Comput. Surv., 46(3):30:1–30:33, January 2014. ISSN 0360-0300. doi: 10.1145/2492705. URL http://doi.acm.org/10.1145/2492705. (Cited on page 14.)

[81] Dejan S. Milojiˇci´c, Fred Douglis, Yves Paindaveine, Richard Wheeler, and Songnian Zhou. Process migration. ACM Computing Surveys, BIBLIOGRAPHY 171

32(3):241–299, September 2000. ISSN 0360-0300. doi: 10.1145/ 367701.367728. URL http://doi.acm.org/10.1145/367701. 367728. (Cited on page 13.)

[82] Grzegorz Miłós, Derek G. Murray, Steven Hand, and Michael A. Fet- terman. Satori: Enlightened page sharing. In Proceedings of the 2009 Conference on USENIX Annual Technical Conference, USENIX’09, page 1–1, Berkeley, CA, USA, 2009. USENIX Association. URL http: //dl.acm.org/citation.cfm?id=1855807.1855808. (Cited on pages 91, 97, 101, 130, and 131.)

[83] Henrik Nilsson. Declarative debugging for lazy functional languages. Citeseer, 1998. (Cited on page 135.)

[84] Henrik Nilsson and Peter Fritzson. Algorithmic debugging for lazy functional languages. In Maurice Bruynooghe and Martin Wirsing, editors, Proceedings of the 4th International Symposium on Programming Language Implementation and Logic Programming, PLILP ’92, pages 385–399, London, UK, UK, 1992. Springer Berlin Heidelberg. ISBN 3-540-55844-6. URL http://dl.acm.org/ citation.cfm?id=646448.692462. (Cited on page 135.)

[85] Mark O’Neill. Cryopid2, December 2013. URL http:// sourceforge.net/projects/cryopid2. (Cited on page 19.)

[86] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The design and implementation of zap: A system for migrating comput- ing environments. In Proceedings of the 5th Symposium on Operating Systems Design and implementation, OSDI ’02, page 361–376, New York, NY, USA, 2002. ACM. ISBN 978-1-4503-0111-4. doi: 10. 1145/1060289.1060323. URL http://doi.acm.org/10.1145/ 1060289.1060323. (Cited on pages8 and 16.) 172 BIBLIOGRAPHY

[87] Eunbyung Park, Bernhard Egger, and Jaejin Lee. Fast and space- efficient virtual machine checkpointing. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Envi- ronments, VEE ’11, page 75–86, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0687-4. doi: 10.1145/1952682.1952694. URL http://doi.acm.org/10.1145/1952682.1952694. (Cited on pages 100 and 130.)

[88] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and Anand Karunanidhi. Pinpointing representative portions of large intel® itanium® programs with dynamic instrumenta- tion. In Proceedings of the 37th Annual IEEE/ACM International Sym- posium on Microarchitecture, MICRO 37, page 81–92, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2126-6. doi: 10.1109/MICRO.2004.28. URL http://dx.doi.org/10.1109/ MICRO.2004.28. (Cited on pages 21, 23, and 24.)

[89] Eduardo Pinheiro. EPCKPT — a checkpoint utility for the linux ker- nel, 2002. URL http://www.research.rutgers.edu/edpin/ epckpt/. (Cited on page 15.)

[90] James Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical report, University of Tennessee, Knoxville, TN, USA, 1997. (Cited on page 13.)

[91] James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under unix. In Proceedings of the USENIX 1995 Technical Conference Proceedings, TCON’95, page 18–18, Berke- ley, CA, USA, 1995. USENIX Association. URL http://dl.acm. org/citation.cfm?id=1267411.1267429. (Cited on page 15.) BIBLIOGRAPHY 173

[92] James S. Plank, Jian Xu, and Robert H. B. Netzer. Compressed dif- ferences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee, August 1995. (Cited on pages 15 and 18.)

[93] Artem Y. Polyakov. Batch-queue plugin for DMTCP, March 2014. URL https://sourceforge.net/p/dmtcp/code/ HEAD/tree/trunk/plugin/batch-queue. (Cited on pages9 and 81.)

[94] Bernard James Pope. A declarative debugger for Haskell. PhD thesis, University of Melbourne, Department of Computer Science and Soft- ware Engineering„ Victoria, Australia, 2007. (Cited on page 135.)

[95] Donald E. Porter, Silas Boyd-Wickizer, Jon Howell, Reuben Olinsky, and Galen C. Hunt. Rethinking the library OS from the top down. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, page 291–304, New York, NY, USA, 2011. ACM. ISBN 978-1- 4503-0266-1. doi: 10.1145/1950365.1950399. URL http://doi. acm.org/10.1145/1950365.1950399. (Cited on page 24.)

[96] Daniel Price, Andrew Tucker, and Sun Microsystems. Solaris zones: Operating system support for consolidating commercial workloads. In In 18th Large Installation System Administration Conference, page 241–254, 2004. (Cited on page 23.)

[97] Eric Roman. A survey of Checkpoint/Restart implementations. Tech- nical report, Lawrence Berkeley National Laboratory, Tech, 2002. (Cited on page 13.)

[98] Jose Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, and Song Jiang. Current practice and a direction forward in check- point/restart implementations for fault tolerance. In Parallel and 174 BIBLIOGRAPHY

Distributed Processing Symposium, 2005. Proceedings. 19th IEEE In- ternational, pages 8 pp.–, April 2005. doi: 10.1109/IPDPS.2005.157. (Cited on page 13.)

[99] Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, An- drew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. The Lam/Mpi Checkpoint/Restart framework: System-initiated check- pointing. International Journal of High Performance Computing Ap- plications, 19(4):479–493, November 2005. ISSN 1094-3420, 1741- 2846. doi: 10.1177/1094342005056139. URL http://hpc. sagepub.com/content/19/4/479. (Cited on pages 18 and 19.)

[100] Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI pro- grams. In Proceedings of the 2004 ACM/IEEE Conference on Supercom- puting, SC ’04, page 38–, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2153-3. doi: 10.1109/SC.2004.29. URL http: //dx.doi.org/10.1109/SC.2004.29. (Cited on page 15.)

[101] Love H. Seawright and Richard A. MacKinnon. VM/370: a study of multiplicity and usefulness. IBM Syst. J., 18(1):4–17, March 1979. ISSN 0018-8670. doi: 10.1147/sj.181.0004. URL http://dx.doi. org/10.1147/sj.181.0004. (Cited on page 129.)

[102] Josep Silva. A comparative study of algorithmic debugging strategies. In Germán Puebla, editor, Logic-Based Program Synthesis and Trans- formation, number 4407 in Lecture Notes in Computer Science, pages 143–159. Springer Berlin Heidelberg, January 2007. ISBN 978-3- 540-71409-5, 978-3-540-71410-1. URL http://link.springer. com/chapter/10.1007/978-3-540-71410-1_11. (Cited on page 135.) BIBLIOGRAPHY 175

[103] Standard Performance Evaluation Corporation SPEC. SPECjbb2005, March 2014. URL http://www.spec.org/jbb2005. (Cited on pages 112 and 116.)

[104] G. Stellner. CoCheck: checkpointing and process migration for MPI. In Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The 10th International, pages 526–531, April 1996. doi: 10.1109/IPPS. 1996.508106. (Cited on page 18.)

[105] O.O. Sudakov, I.S. Meshcheriakov, and Y.V. Boyko. CHPOX: transpar- ent checkpointing system for linux clusters. In 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technol- ogy and Applications, 2007. IDAACS 2007, pages 159–164, September 2007. doi: 10.1109/IDAACS.2007.4488396. (Cited on page 17.)

[106] Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy. Recovering device drivers. ACM Trans. Comput. Syst., 24(4):333–360, November 2006. ISSN 0734-2071. doi: 10. 1145/1189256.1189257. URL http://doi.acm.org/10.1145/ 1189256.1189257. (Cited on page 35.)

[107] Hajime Tazaki, Frédéric Urbani, Emilio Mancini, Mathieu Lacage, Daniel Camara, Thierry Turletti, and Walid Dabbous. Direct code execution: Revisiting library OS architecture for reproducible net- work experiments. In The 9th International Conference on emerg- ing Networking EXperiments and Technologies (CoNEXT), Santa Bar- bara, États-Unis, December 2013. URL http://hal.inria.fr/ hal-00880870. (Cited on page 24.)

[108] Boost Team. Boost serialization, March 2014. URL www.boost. org/libs/serialization. (Cited on page 14.) 176 BIBLIOGRAPHY

[109] Condor Team. Condor standard universe, 2013. URL http://research.cs.wisc.edu/htcondor/manual/v7. 9/2_4Road_map_Running.html. (Cited on pages3 and 18.)

[110] Condor Team. The condor project homepage, March 2014. URL http://www.cs.wisc.edu/condor/. (Cited on page3.)

[111] CRIU Team. CRIU, December 2013. URL http://criu.org/. (Cited on pages3, 20, and 23.)

[112] FReD Team. FReD software, 2011. URL https://github.com/ fred-dbg/fred. (Cited on page 85.)

[113] Jenkins Team. Jenkins, March 2014. URL http://jenkins-ci. org. (Cited on page 116.)

[114] KVM Team. KVM/QEmu, March 2014. URL http://wiki.qemu. org/KVM. (Cited on pages 24 and 87.)

[115] Lguest Team. Lguest: The simple x86 hypervisor, March 2014. URL http://lguest.ozlabs.org. (Cited on pages 24, 87, and 88.)

[116] Linux-VServer Team. Linux-VServer, 2003. URL http:// linux-vserver.org. (Cited on page 23.)

[117] LXC Team. LXC linux containers, December 2013. URL https:// linuxcontainers.org/. (Cited on pages 16, 20, and 23.)

[118] OpenVZ Team. OpenVZ, 2006. URL http://openvz.org. (Cited on page 23.)

[119] Parallels Virtuozzo Containers Team. Parallels virtuozzo contain- ers, 2014. URL http://www.parallels.com/products/pvc/. (Cited on page 23.) BIBLIOGRAPHY 177

[120] Python Team. Pickle: Python object serialization, March 2014. URL https://docs.python.org/2/library/pickle.html. (Cited on page 14.)

[121] QEmu Team. QEmu, 1998. URL http://qemu.org. (Cited on page 87.)

[122] Thuan L. Thai and Hoang Lam. .NET Framework Essentials. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2001. ISBN 0596001657. (Cited on page 22.)

[123] Douglas Thain and Miron Livny. Multiple bypass: Interposition agents for distributed computing. Cluster Computing, 4(1):39–47, March 2001. ISSN 1386-7857. doi: 10.1023/A:1011412209850. URL http://dx.doi.org/10.1023/A:1011412209850. (Cited on page 21.)

[124] Mustafa M. Tikir and Jeffrey K. Hollingsworth. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Com- puting, 68(9):1186–1200, September 2008. ISSN 0743-7315. doi: 10.1016/j.jpdc.2008.05.006. URL http://www.sciencedirect. com/science/article/pii/S0743731508001020. (Cited on pages 21, 23, and 24.)

[125] Anthony Velte and Toby Velte. Microsoft Virtualization with Hyper- V. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 2010. ISBN 0071614036, 9780071614030. (Cited on page 24.)

[126] Ana-Maria Vi¸san. Temporal Meta-Programming: Treating Time as a Spatial Dimension. PhD thesis, Northeastern University, 2012. (Cited on page9.)

[127] Ana-Maria Visan, Kapil Arya, Gene Cooperman, and Tyler Denniston. URDB: a universal reversible debugger based on decomposing debug- 178 BIBLIOGRAPHY

ging histories. In Proc. of 6th Workshop on Programming Languages and Operating Systems (PLOS) (part of Proc. of 23rd ACM Symp. on Operating System Principles (SOSP)), 2011. electronic proceed- ings at http://sigops.org/sosp/sosp11/workshops/plos/08-visan.pdf; software for latest version, FReD (Fast Reversible Debugger), at https://github.com/fred-dbg/fred. (Cited on pages9, 71, 84, 135, and 136.)

[128] Carl A. Waldspurger. Memory resource management in VMware ESX server. In Proceedings of the 5th Symposium on Operating Sys- tems Design and implementation, OSDI ’02, page 181–194, New York, NY, USA, 2002. ACM. ISBN 978-1-4503-0111-4. doi: 10. 1145/1060289.1060307. URL http://doi.acm.org/10.1145/ 1060289.1060307. (Cited on pages 91, 96, 97, 98, and 129.)

[129] John Paul Walters and Vipin Chaudhary. Application-level checkpoint- ing techniques for parallel programs. In Sanjay K. Madria, Kajal T. Claypool, Rajgopal Kannan, Prem Uppuluri, and Manoj Madhava Gore, editors, Distributed Computing and Internet Technology, number 4317 in Lecture Notes in Computer Science, pages 221–234. Springer Berlin Heidelberg, January 2006. ISBN 978-3-540-68379-7, 978-3- 540-68380-3. URL http://link.springer.com/chapter/10. 1007/11951957_21. (Cited on page 14.)

[130] Jon Watson. VirtualBox: bits and bytes masquerading as machines. Linux J., 2008(166), February 2008. ISSN 1075-3583. URL http: //dl.acm.org/citation.cfm?id=1344209.1344210. (Cited on page 24.)

[131] Benjamin Wester, Peter M. Chen, and Jason Flinn. Operating sys- tem support for application-specific speculation. In Proceedings of the Sixth Conference on Computer Systems, EuroSys ’11, page 229–242, BIBLIOGRAPHY 179

New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0634-8. doi: 10. 1145/1966445.1966467. URL http://doi.acm.org/10.1145/ 1966445.1966467. (Cited on page 134.)

[132] David A. Wheeler. SLOCCount: source lines of code counter, March 2014. URL http://www.dwheeler.com/sloccount. (Cited on page 73.)

[133] Namyoon Woo, Soonho Choi, hyungsoo Jung, Jungwhan Moon, Heon Y. Yeom, Taesoon Park, and Hyungwoo Park. MPICH-GF: pro- viding fault tolerance on grid environments. In Proceedings of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003. Published: The 3rd IEEE/ACM Interna- tional Symposium on Cluster Computing and the Grid (CCGrid2003), the poster and research demo session May, 2003, Tokyo, Japan. (Cited on page 18.)

[134] Bob Woodruff, Sean Hefty, Roland Dreier, and Hal Rosenstock. In- troduction to the InfiniBand core software. In Proceedings of the Linux Symposium (Volume Two), page 271–282, Ottawa, Canada, July 2005. (Cited on page 35.)

[135] Victor C. Zandy. ckpt — a process checkpoint library, 2005. URL http://cs.wisc.edu/~zandy/ckpt/. (Cited on page 23.)

[136] Victor C. Zandy, Barton P. Miller, and Miron Livny. Process hijack- ing. In The Eighth International Symposium on High Performance Dis- tributed Computing, 1999. Proceedings, pages 177–184, 1999. doi: 10.1109/HPDC.1999.805296. (Cited on pages 21 and 23.)

[137] Youhui Zhang, Dongsheng Wong, and Weimin Zheng. User- level checkpoint and recovery for LAM/MPI. SIGOPS Oper. Syst. Rev., 39(3):72–81, July 2005. ISSN 0163-5980. doi: 10. 180 BIBLIOGRAPHY

1145/1075395.1075402. URL http://doi.acm.org/10.1145/ 1075395.1075402. (Cited on page 18.)

[138] Gengbin Zheng, Lixia Shi, and L.V. Kale. FTC-Charm++: an in- memory checkpoint-based fault tolerant runtime for charm++ and MPI. In 2004 IEEE International Conference on Cluster Comput- ing, pages 93–103, September 2004. doi: 10.1109/CLUSTR.2004. 1392606. (Cited on page 18.)

[139] Hua Zhong and Jason Nieh. CRAK: linux Checkpoint/Restart as a kernel module. Technical report CUCS-014-01, Dept. of Computer Science, Columbia University, November 2001. (Cited on page 16.)