User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines
A dissertation presented by
Kapil Arya
to the Faculty of the Graduate School of the College of Computer and Information Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy
Northeastern University Boston, Massachusetts
August 2014
Copyright c August 2014 by Kapil Arya
NORTHEASTERN UNIVERSITY GRADUATE SCHOOL OF COMPUTER SCIENCE Ph.D. THESIS APPROVAL FORM
THESIS TITLE: User-Space Process Virtualization in the Context of Checkpoint-Restart and Virtual Machines AUTHOR: Kapil Arya
Ph.D. Thesis approved to complete all degree requirements for the Ph.D. degree in Computer Science
Distribution: Once completed, this form should be scanned and attached to the front of the electronic dissertation document (page 1). An electronic version of the document can then be uploaded to the Northeastern University-UMI website.
Abstract
Checkpoint-Restart is the ability to save a set of running processes to a check- point image on disk, and to later restart them from the disk. In addition to its traditional use in fault tolerance, recovering from a system failure, it has numerous other uses, such as for application debugging and save/restore of the workspace of an interactive problem-solving environment. Transparent checkpointing operates without modifying the underlying application pro- gram, but it implicitly relies on a “Closed World Assumption” — the world (including file system, network, etc.) will look the same upon restart as it did at the time of checkpoint. This is not valid for more complex programs. Until now, checkpoint-restart packages have adopted ad hoc solutions for each case where the environment changes upon restart. This dissertation presents user-space process virtualization to decouple ap- plication processes from the external subsystems. A thin virtualization layer is introduced between the application and each external subsystem. It pro- vides the application with a consistent view of the external world and allows for checkpoint-restart to succeed. The ever growing number of external sub- systems make it harder to deploy and maintain virtualization layers in a monolithic checkpoint-restart system. To address this, an adaptive plugin based approach is used to implement the virtualization layers that allow the checkpoint-restart system to grow organically. The principle of decoupling the external subsystem through process vir- tualization is also applied in the context of virtual machines for providing a solution to the long standing double-paging problem. Double-paging oc- curs when the guest attempts to page out memory that has previously been swapped out by the hypervisor and leads to long delays for the guest as the contents are read back into machine memory only to be written out again. The performance rapidly drops as a result of significant lengthening of the time to complete the guest I/O request.
Acknowledgments
No dissertation is accomplished without the support of many people and I can only begin to thank all those who have helped me in completing it. I am indebted to my advisor, Gene Cooperman, for his patience, encour- agement, support, and guidance over the years. It is because of Gene that I decided to go for a Ph.D., while I was a Master’s student at Northeastern. Gene taught me about how to do research and to distinguish the ideas that only I would find interesting, from the ideas that are important. I could not have asked for a better teacher and without him, this document would not exist. I am thankful to Panagiotis (Pete) Manolios, Alan Mislove and William Robertson for serving on my committee and for providing their insightful input and constructive criticism. I resoundingly thank Peter Desnoyers for always being available to discuss ideas and for providing constructive feed- back on several occasions. I also want to thank the International Student and Scholar Institute (ISSI) team and Bryan Lackaye for helping with the administrative matters during my stay at Northeastern. I was fortunate to be mentored by Alex Garthwaite during the summer internships at VMware. His guidance and encouragement is always there and never seems to fade away. Alex agreed to be the external member in my committee and I am thankful for his feedback and thoughtful comments that have not only improved the quality of this dissertation, but also pro- vided ideas for future directions. His dictum that a good dissertation is a completed one, became my mantra during the last two years. I also want to thank Yury Baskakov for all the help that I received while working on the Tesseract project. He never got tired of my random specula- tions and was always there to provide further insights and also to cover my blind spots. A special thanks goes to Jerri-Ann Meyer and Joyce Spencer for their continued support of the project. Finally, I want to thank Ron Mann for his continued advise and guidance that has helped me become a better engineer. I am grateful to Alok Singh Gehlot for his friendship, all the advice he provided me over the years, and for his constant reminder that it’s not done until it’s done. He was always available for me and without his guidance, I would not have been at Northeastern for my Master’s and later, Ph.D. I want to thank Rohan Garg and Jaideep Ramachandran for going through the thesis drafts and sitting through my practice talks and for providing valu- able feedback. Over the years, I have had the support of a lot of friends and I want to thank Jaijun Cao, Harsh Raju Chamarthi, Tyler Denniston, Anand Gehlot, Gregory Kerr, Samaneh Kazemi Nafchi, Artem Polyakov, Sumit Puro- hit, Praveen Singh Solanki, Ana-Maria Visan, Vishal Vyas, any others I regret- tably failed to name. I am enormously thankful to Surbhi for her enduring friendship and companionship through all these years. Finally, I owe much to my family. I want to express my deepest gratitude for my grandparents, Smt. Mohini Devi and Sh. Omdutt Ji, my parents, Smt. Jamana Devi and Sh. Nem Singh Ji, my aunt and uncle, Smt. Sangeeta Devi and Sh. Hari Singh Ji, my uncles Sh. Kamlesh Ji and Sh. Dilip Ji, and my siblings and cousins, Kavita, Lalita, Shilpa, and Anil, for their never ending love, dedication and support. I am forever indebted to them. To my grandfather Shri Omdutt Ji Solanki
And my school teacher Shri Devi Singh Ji Kachhwaha
Contents
Contents
List of Figures
List of Tables
1 Overview1 1.1 Closed-World Assumption ...... 2 1.2 Double-Paging Anomaly ...... 4 1.3 Process Virtualization ...... 4 1.4 Thesis Statement ...... 6 1.5 Contributions ...... 7 1.5.1 Process Virtualization through Plugins ...... 7 1.5.2 Application-Specific Plugins ...... 8 1.5.3 Third-Party Plugins ...... 9 1.5.4 Solving the Double-Paging Problem ...... 9 1.6 Organization ...... 10
2 Concepts Related to Checkpoint-Restart and Virtualization 13 2.1 Checkpoint-Restart ...... 13 2.1.1 Kernel-Level Transparent Checkpoint-Restart . . . . . 15 2.1.2 User-Level Transparent Checkpoint-Restart ...... 18 2.1.3 Fault Tolerance ...... 21 2.2 System Call Interpositioning ...... 21 CONTENTS
2.3 Virtualization ...... 22 2.3.1 Language-Specific Virtual Machines ...... 22 2.3.2 Process Virtualization ...... 22 2.3.3 Lightweight O/S-based Virtual Machines ...... 23 2.3.4 Virtual Machines ...... 24 2.4 DMTCP Version 1 ...... 25 2.4.1 Library Call Wrappers ...... 27 2.4.2 DMTCP Coordinator ...... 27 2.4.3 Checkpoint Thread ...... 27 2.4.4 Checkpoint ...... 28 2.4.5 Restart ...... 28 2.4.6 Checkpoint Consistency for Distributed Processes . . 29
3 Adaptive Plugins as a Mechanism for Virtualization 31 3.1 The Ever Changing Execution Environment ...... 31 3.1.1 PID: Virtualizing Kernel Resource Identifiers . . . . . 32 3.1.2 SSH Connection: Virtualizing a Protocol ...... 33 3.1.3 InfiniBand: Virtualizing a Device Driver ...... 35 3.1.4 OpenGL: A Record/Replay Approach to Virtualizing a Device Driver ...... 36 3.1.5 POSIX Timers: Adapting to Application Requirements 36 3.2 Virtualizing the Execution Environment ...... 37 3.2.1 Virtualize Access to External Resources ...... 37 3.2.2 Capture/Restore the State of External Resources . . . 38 3.3 Adaptive Plugins as a Synthesis of System-Level and Application- Level Checkpointing ...... 39
4 The Design of Plugins 41 4.1 Plugin Architecture ...... 42 4.1.1 Virtualization through Function Wrappers ...... 43 4.1.2 Event Notifications ...... 46 CONTENTS
4.1.3 Publish/Subscribe Service ...... 49 4.2 Design Recipe for Virtualization through Plugins ...... 50 4.3 Plugin Dependencies ...... 52 4.3.1 Dependency Resolution ...... 52 4.3.2 External Resources Virtualized by Other Plugins . . . 54 4.3.3 Multiple Plugins Wrapping the Same Function . . . . 55 4.4 Extending to Multiple Processes ...... 56 4.4.1 Unique Resource-id for Shared Resources ...... 57 4.4.2 Checkpointing Shared Resources ...... 58 4.4.3 Restoring Shared Resources ...... 61 4.5 Three Base Plugins ...... 62 4.5.1 Coordinator Interface Plugin ...... 62 4.5.2 Thread Plugin ...... 62 4.5.3 Memory Plugins ...... 63 4.6 Implementation Challenges ...... 65 4.6.1 Wrapper Functions ...... 65 4.6.2 New Process/Program Creation ...... 67 4.6.3 Checkpoint Deadlock on a Runtime Library Resource 68 4.6.4 Blocking Library Functions and Checkpoint Starvation 69
5 Expressivity of Plugins 71 5.1 File Descriptor Related Plugins ...... 73 5.2 Pid, System V IPC, and Timer Plugins ...... 77 5.3 Application-Specific Plugins ...... 77 5.4 SSH Connection ...... 78 5.5 Batch-Queue Plugin for Resource Managers ...... 81 5.6 Ptrace Plugin ...... 84 5.7 Deterministic Record-Replay ...... 85 5.8 Checkpointing Networks of Virtual Machines ...... 87 CONTENTS
5.9 3-D Graphic: Support for Programmable GPUs in OpenGL 2.0 and Higher ...... 88 5.10 Transparent Checkpointing of InfiniBand ...... 89 5.11 IB2TCP: Migrating from InfiniBand to TCP Sockets . . . . . 89
6 Tesseract: Reconciling Guest I/O and Hypervisor Swapping in a VM 91 6.1 Redundant I/O ...... 93 6.2 Motivation: The Double-Paging Anomaly ...... 94 6.3 Design ...... 97 6.3.1 Extending The Hosted Platform To Be Like ESX . . . 97 6.3.2 Reconciling Redundant I/Os ...... 99 6.3.3 Tesseract’s Virtual Disk and Swap Subsystems . . . . 102 6.4 Implementation ...... 105 6.4.1 Explicit Management of Hypervisor Swapping . . . . 105 6.4.2 Tracking Memory Pages and Disk Blocks ...... 106 6.4.3 I/O Paths ...... 107 6.4.4 Managing Block Indirection Metadata ...... 111 6.5 Guest Disk Fragmentation ...... 112 6.5.1 BSST Defragmentation ...... 113 6.5.2 Guest VMDK Defragmentation ...... 115 6.6 Evaluation ...... 116 6.6.1 Inducing Double-Paging Activity ...... 116 6.6.2 Application Performance ...... 117 6.6.3 Double-Paging and Guest Write I/O Requests . . . . . 121 6.6.4 Fragmentation in Guest Read I/O Requests ...... 122 6.6.5 Evaluating Defragmentation Schemes ...... 123 6.6.6 Using SSD For Storing BSST VMDK ...... 126 6.6.7 Overheads ...... 127 6.7 Related Work ...... 128 CONTENTS
6.7.1 Hypervisor Swapping and Double Paging ...... 128 6.7.2 Associations Between Memory and Disk State . . . . 130 6.7.3 I/O and Memory Deduplication ...... 131 6.8 Observations ...... 131
7 Impact for the Future 133 7.1 Compiled Code In Scripting Languages: Fast-Slow Paradigm 133 7.2 Support for Hadoop-style Big Data ...... 134 7.3 Cybersecurity ...... 135 7.4 Algorithmic debugging ...... 135 7.5 Reversible Debugging ...... 136 7.6 Android-Based Mobile Computing ...... 136 7.7 Cloud Computing ...... 136
8 Conclusion 137
A Plugin Tutorial 139 A.1 Introduction ...... 139 A.2 Anatomy of a plugin ...... 140 A.3 Writing Plugins ...... 141 A.3.1 Invoking a plugin ...... 141 A.3.2 The plugin mechanisms ...... 141 A.4 Application-Initiated Checkpoints ...... 145 A.5 Plugin Manual ...... 146 A.5.1 Plugin events ...... 146 A.5.2 Publish/Subscribe ...... 151 A.5.3 Wrapper functions ...... 152 A.5.4 Miscellaneous utility functions ...... 152
Bibliography 155
List of Figures
1.1 Application surface of a running process ...... 5
2.1 Architecture of DMTCP ...... 26
3.1 Virtualization of Process Id ...... 33 3.2 Two processes communicating over SSH ...... 33 3.3 Virtualizing an SSH connection ...... 34
4.2 Event notifications for write-ckpt and restart events ...... 47 4.4 Nested wrappers ...... 55 4.5 Plugin dependency for distributed processes ...... 61
5.1 Restoring an SSH connection ...... 80
6.1 Some cases of redundant I/O in a virtual machine...... 93 6.2 An example of double-paging...... 96 6.3 Double-paging with Tesseract...... 102 6.4 Write I/O and hypervisor swapping...... 103 6.5 Examples of reference count with Tesseract and with defragmen- tation...... 104 6.6 VMware Workstation I/O Stack ...... 108 6.7 Modified scatter-gather list to avoid double-paging ...... 109 6.8 Splitting scatter-gather list during read ...... 110 6.9 Defragmenting the BSST...... 114 LIST OF FIGURES
6.10 Defragmenting the guest VMDK...... 115 6.11 Trends for scores and pauses in SPECjbb runs with varying guest memory pressure and 10% host overcommitment...... 118 6.12 Maximum single pauses observed in SPECjbb instantaneous scor- ing with varying guest memory pressure and 10% host memory overcommitment...... 119 6.13 Scores and total pause times for SPECjbb runs with varying host overcommitment and 60 MB memhog...... 120 6.14 Comparing maximum single pauses for SPECjbb under various defragmentation schemes with varying host memory overcom- mitment and 60 MB memhog ...... 121 6.15 Scores and pauses in SPECjbb runs under various defragmenta- tion schemes with 10% host overcommitment...... 123 6.16 Score and pauses in SPECjbb under various defragmentation schemes with varying host overcommitment and 60 MB memhog. . . . . 124 6.17 Comparing maximum single pauses for SPECjbb under various defragmentation schemes with 10% host memory overcommit- ment...... 125 6.18 Tesseract performances with BSST placed on an SSD disk. . . . 126 List of Tables
2.1 Comparison of various checkpointing systems...... 21
5.1 Comparison of process virtualization based checkpoint-restart with prior art ...... 72 5.2 Statistics for various plugins...... 74
6.1 Holes in write I/O requests for varying host overcommitment and 60 MB memhog inside the guest...... 122 6.2 Holes in read I/O requests for Tesseract without defragmentation for varying levels of host overcommitment and 60 MB memhog inside the guest...... 122 6.3 Total I/Os with BSST and guest defragmentation...... 125 6.4 Average read and write prepare/completion times in microsec- onds for baseline and Tesseract with and without defragmenta- tion...... 127
CHAPTER 1
Overview
Checkpoint-restart is a powerful mechanism to save the state of one or more running processes to disk and later restore it. In addition to the tradi- tional use case of fault tolerance in long-running jobs, other use cases of checkpoint-restart include process migration, debugging, and save/restore of workspace. At a high-level, checkpointing a process can be viewed as writing all of process memory, including shared libraries, text and data, to a checkpoint image. Accordingly, restarting involves recreating the process memory by reading the checkpoint image from the disk. This works for simple programs, but for complex programs, one also needs to save and restore information about threads, open files, etc. In more sophisticated applications, it involves saving the network state (in-flight data, etc.), and information about the external environment such as the terminal, the standard input/output/error, and so on. Current checkpointing techniques fall into two categories: application- level and system-level. Application-level checkpointing requires modifica- tions to the target program to insert checkpoint-restart code. The developer identifies the relevant state and data to be checkpointed and implements the mechanism for checkpointing and restoring them. While it is flexible and allows the programmer to optimize and have greater control over the check-
1 2 CHAPTER 1. OVERVIEW
pointing process, there is a high cost paid by the developer for implementing and maintaining it. Further, the timing and frequency of checkpoints may not be specified in a flexible manner and could be limited to certain “safe” points in the program. System-level (or transparent) checkpointing on the other hand works without modifying the target application program. How- ever, a simple implementation is less flexible in that it requires the same environment on restart (the case of homogeneous computer hosts).
1.1 Closed-World Assumption
Traditionally checkpoint-restart packages have made a closed-world assump- tion:
The execution environment (file system, network, etc.) does not change between checkpoint and restart. Thus to save and restore the state of the processes of a computation, it suffices to save the state of the CPU registers, the process’s virtual memory, and kernel state.
While the closed world assumption holds for simple programs, it is not valid for more complex programs (such as distributed processes), and can cause checkpoint-restart to fail in remarkable ways. For example a process with open files will fail to restart if the underlying filesystem mount-point has changed, or if the host has a new IP address while the process remembers the old one. At a more basic level, the restarted process will have a new process id (pid) provided by the kernel. Thus, any attempt by the target application to re-use a previously cached old pid will result in a failure. One way to overcome the closed-world assumption is application-level checkpointing — modifying the application program to account for the chang- ing environment. As mentioned earlier, this approach is costly and hard to maintain. 1.1. CLOSED-WORLD ASSUMPTION 3
For these reasons, the existing systems have been used mostly for applica- tions that obey the closed-world assumption such as isolated batch jobs run- ning solely on traditional multi-core computer nodes within a cluster. The closed world assumption is enforced by posing several restrictions on the fea- tures that an application can use or by creating special-purpose workarounds to handle exceptions to the closed-world assumption.
For example, Condor [110] restricts applications from using multi-process jobs, interprocess communication, multi-threading, timers, and file locks, etc. [109]. BLCR [52] is implemented through a Linux kernel module, which restores the original pid when it is still unused and fails if it is unavailable. CRIU [111] places all target processes in a Linux container (lightweight vir- tual machine), which has private namespaces for kernel objects, but is iso- lated from other processes within the same host.
The closed world assumption breaks down as users ask to checkpoint more general types of software that communicate with the external world. Examples include communication with system daemons (e.g., NSCD, LDAP authentication servers), 3-D graphics libraries (e.g., OpenGL), connections with database servers, networks of virtual machines, hybrid computations using CPU accelerators (e.g., GPU and Xeon Phi), Hadoop-style computa- tions, a broader variety of network models (TCP sockets, InfiniBand, the SCIF network for the Intel Xeon Phi), competing implementations of Infini- Band libraries (QLogic/PSM versus InfiniBand OpenIB verbs), and so on.
These complex applications have created a dilemma. A system for pure transparent checkpointing has no knowledge of the application’s external world, and an application-level checkpointing system would require the writer of the target application to insert code that adapts to the modified external environment after restart. This conflict is the core problem being solved. 4 CHAPTER 1. OVERVIEW 1.2 Double-Paging Anomaly
Hypervisors often overcommit memory to achieve higher VM consolidation on the physical host. When overcommitting host physical memory, guest memory is paged in and out from a hypervisor-level swap file to reclaim host memory. Further, guests running in the virtual machines manage their own physical address space and may overcommit memory as needed.
Double-paging is an often-cited problem in multi-level scheduling of mem- ory between virtual machines (VMs) and the hypervisor. This problem oc- curs when both a virtualized guest and the hypervisor overcommit their re- spective physical address-spaces. When the guest pages out memory previ- ously swapped out by the hypervisor, it initiates an expensive sequence of steps causing the contents to be read in from the hypervisor-level swapfile only to be written out again, significantly lengthening the time to complete the guest I/O request. As a result, performance rapidly drops.
1.3 Process Virtualization
Often, application processes violate the closed-world assumption. When restarting from a checkpoint image, the recreated objects derived from ex- ternal systems/services may not be the same as their pre-checkpoint version. This is due to the changing execution environment across a checkpoint- restart boundary. In order to successfully restart an application process, we need to virtualize these objects in such a way that the application view of the objects does not change across checkpoint and restart.
Definition: The application surface of a running application is a set of code and associated data that includes all application-specific objects (code+data) and excludes all opaque objects derived from any outside systems/services. (An opaque object is an object for which the application knows nothing about the internal structure. The opaque object is only accessible through 1.3. PROCESS VIRTUALIZATION 5
Application Process
virtual names Application Surface Translation layer real names
External Resource
Figure 1.1: Application surface of a running process. The virtual names lie inside the application surface, whereas the real names lie outside the surface. an identifying handle) Definition: User-space process virtualization finds a surface that is at least as large as the application surface, such that any virtualized view of an object lies inside this surface and any real view lies outside this surface (see Fig- ure 1.1). On restart, the opaque objects are recreated to provide semanti- cally equivalent functionality to their pre-checkpoint version. Process virtu- alization then links these opaque objects with their virtualized view inside the application surface (through the identifying handles). There can be more than one possible application surface. Typically one chooses an application surface close to a well known API for the sake of stability and maintainability. A wrapper around any call to the API will update both the virtual and the real view in a consistent manner. Remarks:
1. In virtualizing a pid, we will see that libc will retain the real pid known to the kernel. Thus libc is outside the application surface. But the ap- plication knows only the virtual pid that resides inside the application surface. 6 CHAPTER 1. OVERVIEW
2. In the case of a shadow device driver, the user-space memory of the application may contain both some opaque objects (e.g., InfiniBand queues) and their virtualized views. In this case the application surface excludes parts of the user-space memory of the application process.
3. Because daemons and the kernel are opaque to the application, they always lie outside the application surface.
4. An application may create an auxiliary child process (or even dis- tributed processes in the case of MPI). In this case, the application surface includes these auxiliary processes.
The goal of user-space process virtualization is to break the tight coupling between the application process and an external subsystem not under the control of the application process. In effect, each API is designed to provide a stable interface to a single system service under the lifetime of a process. This thesis will demonstrate the ability to find an application surface and a corresponding API, for which a software translation layer can be built, enabling the application process to continue to receive the corresponding system service from an alternative external subsystem. This decouples the application process from the external subsystem.
1.4 Thesis Statement
User-space process virtualization can be used to decouple application pro- cesses from external subsystems to allow checkpoint-restart without enforc- ing a strict “closed-world assumption”. The method of decoupling subsys- tems applies beyond checkpointing as seen in a solution to the long standing double-paging problem. 1.5. CONTRIBUTIONS 7 1.5 Contributions
This dissertation shows that a checkpointing system can “adapt” to the ex- ternal environment, one subsystem at a time, by using the user-space process virtualization technique. To that end, this work introduces a plugin archi- tecture based on adaptive plugins to virtualize these external subsystems. A plugin is responsible for virtualizing and checkpointing exactly one external subsystem to allow the application to adapt to the modified external subsys- tem. The plugin architecture allows us to do selective (or partial) virtualiza- tion of the underlying resources for efficiency purposes. Plugins can be load- ed/unloaded to suit application requirements. Further, it allows the check- pointing system to be extended organically, in a non-monolithic manner.
1.5.1 Process Virtualization through Plugins
To demonstrate the strength of the plugin architecture for user-space pro- cess virtualization, this work presents principled techniques for the follow- ing problems, which have resisted successful checkpoint-restart solutions for at least a decade (these plugins are original with this dissertation):
• The PID plugin (§5.2) virtualizes the process and thread identifiers assigned by the kernel.
• The System V IPC plugin (§5.2) virtualizes the shared memory, semaphore, and message queue identifiers assigned by the kernel.
• The Timer plugin (§5.2) virtualizes posix timers as well as as clock identifiers assigned by the kernel.
• The SSH plugin (§5.4) virtualizes the underlying SSH connection be- tween two processes to allow recreation on restart. 8 CHAPTER 1. OVERVIEW
• The IB2TCP plugin (§5.11) virtualizes the InfiniBand device driver to allow a computation to be checkpointed on the InfiniBand hardware and restarted on the TCP hardware.
Notice that the Zap [86] system virtualized the kernel resource identi- fiers such as pids and System V IPC ids in kernel space. However, the work of this dissertation virtualizes entirely in user space without any applica- tion or kernel modifications or kernel modules. Further, this work extends the notion of user-space virtualization to processes/services outside the ker- nel such as SSH connections, network daemons and device drivers. This is achieved either through interposing library calls or by creating shadow agents/processes for the external resources.
1.5.2 Application-Specific Plugins
Next, we show that plugins can be used for application-specific adapta- tions, providing the benefits of application-level checkpointing without hav- ing to modify the base application. The following application-specific plug- ins (§5.3) are original with this dissertation:
• Malloc plugin virtualizes access to the underlying memory allocation library (e.g., libc malloc, tcmalloc, etc.).
• DL plugin is used to ensure atomicity for dlopen/dlsym functions with respect to checkpoint-restart.
• CkptFile plugin provides heuristics for checkpointing open files. It also helps the file plugin to locate files on restart.
• Uniq-Ckpt plugin is used to control the checkpoint file names, loca- tions, etc. 1.5. CONTRIBUTIONS 9 1.5.3 Third-Party Plugins
Finally, the success of the plugin architecture can also be seen in third party plugins. We show that third parties can write orthogonal customized plugins to fit their needs. The following demonstrates original work due to plugins created by third party contributors (this dissertation is not claiming these results):
• Ptrace plugin [127] virtualizes the ptrace system call to allow check- pointing of an entire gdb session for reversible debugging.
• Record-replay plugin [126] provides a light-weight deterministic re- play mechanism by recording library calls for reversible debugging.
• KVM plugin [44] is used for checkpointing the KVM/Qemu virtual ma- chine.
• Tun plugin [44] is used for checkpointing the Tun/Tap network inter- face for checkpointing a network of virtual machines.
• RM plugin [93] is used for checkpointing in a batch-queue environ- ment and can handle multiple batch-queue systems.
• InfiniBand plugin [27] provides the first non MPI-specific transparent checkpoint-restart of InfiniBand network.
• OpenGL plugin [62] uses a record-prune-replay technique for check- pointing 3D graphics (OpenGL 2.0 and beyond).
1.5.4 Solving the Double-Paging Problem
The process virtualization principles are also applied in the context of vir- tual machines. The double-paging problem is directly and transparently ad- dressed by applying the decoupling principle [11]. The guest and hyper- visor I/O operations are tracked to detect redundancy and are modified to 10 CHAPTER 1. OVERVIEW
create indirections to existing disk blocks containing the page contents. The indirection is created by introducing a thin virtualization layer to virtualize access to the guest disk blocks. Further, the virtualization is done completely in user space.
1.6 Organization
The remainder of this dissertation is organized as follows. A literature review is presented in Chapter2 and various checkpoint- restart mechanisms are discussed. The review also includes various virtual- ization schemes in the context of checkpointing. (Literature for the double- paging problem is reviewed in Chapter6) Chapter3 provides several examples to motivate the need for virtualiz- ing the execution environment. This chapter then uses this motivation to outline two basic requirements for virtualizing the execution environment. It is argued there that an adaptive plugin based approach is well suited for process virtualization. Chapter4 describes the design of adaptive plugins and presents the plu- gin architecture. The proposed plugin architecture is shown to meet the vir- tualization requirements laid out in Chapter3. This is followed by a design recipe for developing new plugins. Dependencies among multiple plugins are also discussed and an approach to dependency resolution is provided. Finally, some implementation challenges involved in designing plugins are presented. Chapter5 provides some case studies involving various plugins. In- cluded there are seven plugins that provide novel checkpointing solutions of their corresponding subsystems. Some application-specific plugins are also demonstrated along with several plugins that provide virtualization of kernel resource identifies in the user space. Chapter6 then turns to the double-paging problem. Like the core issue 1.6. ORGANIZATION 11 in checkpoint-restart, here also one is presented by distinct subsystems that must be combined in a unified virtualization scheme. The core problem is described and motivated, and a design and implementation of a solution is presented. We also discuss some of the side-effects of the proposed solution and finally present evaluation. Chapter7 provides some new directions and applications of checkpoint- restart to non-traditional use-cases that can be pursued based on this disser- tation, with a conclusion presented in Chapter8. Finally, a plugin tutorial is presented in AppendixA, thus providing a concrete view of the plugin API.
CHAPTER 2
Concepts Related to Checkpoint-Restart and Virtualization
This dissertation intersects with four broad areas. The first is that of checkpoint- restart at the process level. The second concerns system/library call inter- positioning for modifying process behavior. The third concerns process level virtualization. The fourth concerns the double-paging problem in the con- text of virtual machines. The literature for the first three areas is reviewed here, whereas the related work for the double-paging problem is discussed in Chapter6. Since this work builds on the DMTCP software package, a brief overview of the legacy DMTCP software (DMTCP version 1) is also provided.
2.1 Checkpoint-Restart
Checkpoint-restart has a long history with several mechanisms proposed over the years [90, 97, 98, 35]. It is often used for process migration, for load balancing, for fault tolerance, and so on [34]. The work of Milo- jiˇci´c et al. [81] provides a review of the field of process migration. Egwu- tuoha et al. [35] provides a survey of various checkpoint/restart implemen-
13 14 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
tations in high performance computing. The website checkpointing.org also lists several checkpoint-restart systems. There are three primary ap- proaches to checkpointing: virtual machine snapshotting, application-level checkpointing, and transparent checkpointing.
Virtual machine snapshotting
Virtual machine (VM) snapshotting is a form of checkpointing for virtual machines and is often used for virtual machine migration. A complex appli- cation is treated as a black box, and its application surface is expanded to include the entire guest physical memory, operating system state, devices, etc. Checkpointing an application involves involves saving everything inside the application surface (i.e. the entire virtual machine). While this tech- nique is general and has been discussed quite extensively [80], it is also slower and produces larger checkpoint images because the checkpoint mod- ule is unable to exclude unnecessary parts of guest physical memory from the application surface. Hence, it is not commonly used for mechanisms of checkpoint-restart.
Application-level checkpointing
Application-level checkpointing is the simplest form of checkpointing. The developer of the application inserts checkpointing code directly inside the application to save the process state, such as data structures, to a file on disk that is later used to resume the computation. This is application-specific and requires extensive knowledge of the application. The knowledge of the ap- plication internals provides complete flexibility, but places a larger burden on the end user. There are several techniques [129] and frameworks that provide tools to assist in application-level checkpointing. Examples include pickling for Python [120] and Boost serialization [108] for C++. A some- what lighter mode of application-level checkpointing is the save/restore 2.1. CHECKPOINT-RESTART 15 workspace feature for interactive sessions. Notably, Bronevetsky et al. have applied this to shared memory parallelism in the context of OpenMP [24, 25] and distributed parallelism in the context of MPI [100, 23], where they pro- vide tools to lighten the end-user burden for writing checkpointing code.
The rest of this section focuses on several varieties of transparent check- pointing, in which the end-user does not need to make any changes to the target application.
Transparent checkpointing
This is sometimes called system-level or system-initiated checkpointing. It is the ability to checkpoint an application without making any changes to the application source or binary. The history of transparent checkpointing extends back at least to 1990 [73]. While, there are many systems that perform single-process checkpointing [91, 33, 89, 92, 73, 74, 29,1,3, 76], we will focus on systems that support multiple processes and/or distributed processes. Transparent system-level checkpointing technique can be further broken down into Kernel-level and user-level checkpointing. The two tech- niques are further discussed in Sections 2.1.1 and 2.1.2 respectively.
2.1.1 Kernel-Level Transparent Checkpoint-Restart
In kernel-level checkpointing, the operating system is modified to support checkpointing for applications. This approach leads to checkpoints being more tightly coupled to kernel versions. While there have been several such kernel-level packages, the difficulty of supporting multiple kernel versions makes it more difficult. It also makes future ports to other operating systems more difficult. 16 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
The Zap system and its derivatives
As an extension of CRAK (Checkpoint and Restart as a Kernel Module) [139], Zap [86, 67] implements checkpoint-restart using a kernel module. Zap can be considered a precursor to the Linux Containers (LXC) [117] as it also provides a virtualized view of the kernel resources. Zap uses a pod (process domain) abstraction, that provides a group of processes with a consistent vir- tualized view. The pods abstraction virtualizes kernel resource identifiers to present a pod-specific view. This isolates the process from the external world and provides a conflict free environment when migrating processes to other nodes. The downside of this implementation is the inability of processes in- side a pod to communicate with processes outside the pod. It intercepts all systems calls operating on the virtualized kernel resource identifiers, trans- lating their arguments and return values as needed. System call interception is also required for all processes in the system and poses runtime overhead for processes outside the pods.
Zap was later extended to support distributed network applications by Laadan et al. [68] to create ZapC and by Janakiraman et al. [59] to create CRUZ. The key enhancement was the support for virtualization of the net- work layer to decouple the processes from the node they are running on. This allowed these systems to checkpoint-restart distributed computations over a cluster. For ZapC network virtualization was achieved by inserting hooks into the network stack using netfilter. The source and destination addresses were translated between virtual and real addresses for both in- coming and outgoing network packets.
The work of this dissertation is based entirely in the user space and doesn’t require any kernel modification or kernel modules. As explained by Laadan [66], the kernel module based approach incurs a burden both on users because it is cumbersome to install, and on developers because main- taining it on top of quickly changing upstream kernels is a sisyphean task and 2.1. CHECKPOINT-RESTART 17 development quickly falls behind. Further, user-space virtualization poses no runtime overhead for processes that are not part of the computation being checkpointed. Finally, this work can be used to virtualize agents/process- es/services outside the kernel. Examples include SSH connection, network daemons and device drivers.
Berkeley Lab Checkpoint Restart (BLCR)
BLCR [52] is another widely used checkpointing system that is implemented as a kernel module. It is used primarily in high performance computing. BLCR is often used along with MPI libraries to checkpoint a distributed com- putation. The BLCR does not have any support for virtualization and may fail if a kernel resource identifier (such as a pid) is not available at the time of restart. It also relies on MPI daemons to handle changed network addresses, mount points, etc. However, if the application has cached a directory name from before checkpoint and tries to open it after restart, it may fail.
Another notable kernel based system was Chpox by Sudakov et al. [105]. Initially, Chpox was implemented as a kernel module for Linux 2.4, whereas a later version for Linux 2.6 required base kernel modifications as well.
Pure kernel-level approaches
A more recent attempt by Laadan et al. [68] also implemented a single-host in-kernel solution. It consisted of some user-space utilities and a series of patches to the Linux 2.6 kernel to add checkpoint support in the mainline kernel itself. This was proposed for inclusion in the Linux kernel, but ulti- mately not accepted due to its invasive approach that touched/modified a large number of kernel subsystems [8]. 18 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION 2.1.2 User-Level Transparent Checkpoint-Restart
User level checkpointing works without any changes to the operating system kernel. The use of published APIs (e.g., POSIX and the Linux proc filesystem) to communicate with the kernel and to perform checkpoint-restart makes it highly stable.
Checkpointing library
The ground-breaking work of Plank et al. [92] on Libckpt uses a library to do the checkpointing and the application program is linked to this user-level library. Similar techniques are used by Condor [76]. These techniques are not completely transparent to the user as the application code is modified, recompiled, and relinked with the dynamic library. However, the amount of code changes is often fairly small (e.g., for Libckpt, the application program- mer needs to rename the main() to ckpt_target()). The main disadvantage of using such systems is the restrictions imposed on the operating system features such as interprocess communication, that the application program can use [109]. Further, these systems do not support process trees or dis- tributed computations.
Distributed checkpointing with MPI
Although application-level checkpointing for distributed programs dates back at least to 1997 [17], most practical systems were built around MPI-based distributed computations for supporting high performance computing. They use hooks or callback functions for specific MPI implementations [31, 54, 137, 138, 104, 21, 133, 49, 52, 99]. (MPI, Message Passing Interface, is a standard for message-based distributed high performance computation.) Most MPI implementors chose to build a custom checkpoint-restart service. This came about when InfiniBand became the preferred network for high performance computing, and there was still no package for transparent check- 2.1. CHECKPOINT-RESTART 19 pointing over InfiniBand. Examples of checkpoint-restart services can be found in Open MPI [54, 55], LAM/MPI [99] (now incorporated into MVA- PICH2 [77, 41]), MPICH-V [22], and MVAPICH2 [41], as well as a fault- tolerant “backplane”, CIFTS [51]. Each checkpoint-restart service would dis- connect from the network prior to checkpoint, and re-connect after restart. Hence, while the network was disconnected, the MPI checkpoint-restart ser- vice was then able to delegate single-host checkpointing to the BLCR [52] kernel module. This created an extra layer of complication, but it was un- avoidable at that time, due to the lack of support for transparent checkpoint- ing over InfiniBand. On restart, the network connections are restored and the checkpointer is called upon to restore the user processes. Since it’s work- ing at the MPI level, the ability to adapt to the environment outside of MPI is limited, and generally proves difficult to maintain.
Bronevetsky et al. produced a novel application-level checkpointing de- sign for the special case of MPI [23]. In this approach, a pre-compiler in- struments the application MPI code with additional information needed for checkpointing, thus coming close to the ideal of transparent checkpointing. The application programmer then adds code indicating valid points in the program for a potential checkpoint. The use of a pre-compiler relieved much of the burden of adding application-specific code to support checkpointing.
Cryopid
Cryopid [18] and Cryopid2 [85] use the ptrace system call to attach to a running process and create a core dump of the application process that is later used to restart the computation. The checkpointable features supported are quite limited as compared to other checkpointing packages, and adding a new feature is often harder. 20 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
Checkpoint Restart In Userspace (CRIU)
CRIU [111] is a more recent checkpointing package based on Linux con- tainers (LXC) [117]. The support is restricted to process trees and contain- ers. The Linux kernel API was extended for new kernel features to sup- port the user space tool. Like Cryopid, it also uses the ptrace system call to inject checkpointing code inside the user processes. The checkpointing code executes in the context of a process to gather all the relevant informa- tion using the extended kernel API. Due to security issues, the checkpoint- ing capability is only available for users with CAP_SYS_ADMIN capability. (CAP_SYS_ADMIN capability is a successor to the Linux setuid-root feature that is used to grant admin privileges to select applications/processes.)
Distributed MultiThreaded Checkpointing (DMTCP)
DMTCP version 1 [7] is implemented using user space shared libraries. The original DMTCP supported TCP sockets, but was limited in that it did not support distributed computations communicating over ssh or InfiniBand. Further, even in the single-host case, it did not support virtualization of such kernel resources as pids, System V IPC, POSIX and System V shared memory, and POSIX timers. Section 2.4 provides a brief background on the architecture and the working of DMTCP version 1. This work represents a rewrite of the original DMTCP [7], in order to introduce user-space process virtualization for checkpointing the external environment. This enables us to checkpoint a wide variety of applications. The virtualization layer is implemented completely in user space with mini- mal overhead. Process virtualization goes beyond virtualizing the kernel re- source identifiers and can be used to virtualize even higher level constructs and abstractions such as the SSH protocol, as discussed in Chapter3. Ta- ble 2.1 summarizes the difference between this work and the prominent transparent checkpointing packages. 2.2. SYSTEM CALL INTERPOSITIONING 21
Ckpt Multi-host Resource Virtualization Applic- Third- System computations kernel other specific party resources resources tuning plugins BLCR Zap CRIU Cryopid2 DMTCP (v1) Extensible CKPT
Table 2.1: Comparison of various checkpointing systems. The other resource virtualization refers to the ability to virtualize protocols, device drivers, etc.
2.1.3 Fault Tolerance
Fault tolerance [70, 58] is a broader concept not discussed here. It enables a system to continue operating properly in the event of a failure of one of its components. Several strategies can be deployed to make a system fault tolerant such as: redundancy, partial re-execution, atomic transactions, instrumentation of data, and so on.
2.2 System Call Interpositioning
The concept of wrappers, as implemented in DMTCP, have a long and inde- pendent history under the more general heading of interposition. Interpo- sition techniques have been used for a wide variety of purposes [123, 136, 65]. See especially [123] for a survey of a wide variety of interposition tech- niques. The work of Garfinkel [42] discusses practical problems associated with system call interpositioning. The packages PIN [88] and DynInst [124] are two examples of software packages that provide interposition techniques at the level of binary instrumentation. 22 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION 2.3 Virtualization
Virtualization is the process of allowing unmodified source code or an un- modified binary to transparently run under varied external environments (different CPU, different network, different graphics server (e.g., X11-server), etc.). Most of the original checkpointing packages [73, 74, 26, 31, 71] ig- nored these issues and concentrated on homogeneous checkpointing. Virtualization techniques have been developed since the 1960s. Since then, systems have implemented different flavors of virtualization. In this section, we discuss the four types of virtualization techniques in common use today that are closest in spirit to this work.
2.3.1 Language-Specific Virtual Machines
A language-specific virtual machine, sometimes also known as an applica- tion virtual machine, a runtime environment, or a process virtual machine, allows an application to execute on any platform without having to write any platform-specific code. This is achieved by creating a platform-independent programming environment that abstracts the details of the underlying hard- ware or operating system. This abstraction is provided at the level of a high-level programming language. Notable examples include Java Virtual Machine (JVM) [75], .NET framework [122], and Android virtual machines (Dalvik) [20, 36]. Language-specific virtual machines are often implemented using an in- terpreter, with an option of using just-in-time compilation for performance close to that of a compiled language [32].
2.3.2 Process Virtualization
Process virtualization allows a process to be migrated or restarted in a new external environment, while preserving the process’s view of the external world. For example, a kernel may assign to a restarted process a different 2.3. VIRTUALIZATION 23 pid than the original pid at the time of checkpoint. The earliest checkpoint- ing packages had assumed that the targeted user process would not save the value of the pid of a peer process, but rather would re-discover that pid on each use. As software complexity grew, this assumption became unreliable. More recent packages either modified the Linux kernel (e.g., BLCR [52]), or ran inside a Linux Container, a lightweight virtual machine (e.g., CRIU [111]).
Process virtualization (as exemplified by this work) has been considered intensively in the context of checkpointing only recently. Nevertheless, it has important forerunners in process hijacking [136] and in the checkpointing packages [76, 135] used in Condor’s Standard Universe. Similarly, there are connections of process virtualization with dynamic instrumentation (e.g., Paradyn/DynInst [124], PIN [88]).
2.3.3 Lightweight O/S-based Virtual Machines
O/S virtualization allows several isolated execution environments to run within a single operating system kernel. This technique exhibits better per- formance and density compared to virtual machines. On the downside, it cannot host a guest operating system different from the host operating sys- tem, or a different guest kernel (different Linux distributions is fine). Some examples include FreeBSD Jail [61], Solaris Zones [96], Linux Containers (LXC) [117], Linux-VServer [116], OpenVZ [118] and Virtuozzo [119].
Linux Containers are a kernel-level tool for providing a type of virtual- ization in the form of namespaces for process spaces and network spaces. This provides an alternative approach for such tasks as that of pid virtu- alization. The CRIU [111] checkpointing system uses LXC namespaces to virtualize kernel resource identifiers within the container. The namespaces avoid the problem of name conflicts for kernel resource identifiers during process migration. 24 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
Although process-level virtualization and Library OS [6, 95, 107] both operate in user space without special privileges, the goal of Library OS is quite different. A Library OS modifies or extends the system services provided by the operating system kernel. For example, Drawbridge [95] presents a Windows 7 personality, so as to run Windows 7 applications un- der newer versions of Windows. Similarly, the original exokernel operating system [37] provided additional operating system services beyond those of a small underlying operating system kernel, and this was argued to often be more efficient that a larger kernel directly providing those services.
2.3.4 Virtual Machines
Hardware virtualization uses an abstract computing platform. Thus, it hides the hardware platform (the host software). On top of the host software, a virtual machine (guest software) is running. The guest software executes as if it were running directly on the physical hardware, with a few restrictions, such as the network access, display, keyboard, and disk storage. Examples of virtual machines include VMware, Qemu/KVM [114], Xen [15], Virtu- alBox [130], and Lguest [115]. The virtual machines often run a set of tools inside the guest operating system to inspect and control its behavior. Further, in some cases the guest operating system is modified to provide additional support/features and the technique is referred to as paravirtu- alization. Some notable examples of paravirtualization are Xen [15] and Microsoft Hyper-V [125].
One could also include binary instrumentation techniques such PIN [88] and DynInst [124] in a discussion of virtualization, but this tends not to be used much with checkpointing.
The work of this thesis introduces process virtualization for abstractions beyond the traditional kernel resource identifiers in order to virtualize nu- merous external subsystems such as SSH connections, InfiniBand network, 2.4. DMTCP VERSION 1 25
KVM and Tun/Tap interfaces, SLURM and Torque batch queues, and GPU drivers. The modular approach to virtualize these external subsystems al- lows the checkpointing system to grow organically (see Chapter4). By vir- tualizing these external environments, this work enabled some projects to be the “first” to support checkpointing.
2.4 DMTCP Version 1
DMTCP (Distributed MultiThreaded CheckPointing) is free, open source soft- ware (http://dmtcp.sourceforge.net, LGPL license) and traces its roots to early 2005 [30]. The DMTCP approach has always insisted on not making modifications to the kernel, and not requiring any root (administra- tive) privileges. While this was sometimes more difficult than an approach with full privileges inside the kernel, it integrates better with complex cyber infrastructures. DMTCP’s lack of administrative privilege provides a level of security assurance. As a side effect of working completely in the user-space, DMTCP relies only on the published APIs (e.g., POSIX and the Linux proc filesystem) to perform checkpoint-restart. Thanks to the highly stable kernel API, the same DMTCP software can be used on Linux kernel ranging from the latest bleed- ing edge release to Linux 2.6.5 (released in April, 2004). In this section, we provide a only brief overview of the checkpoint-restart mechanisms of DMTCP. More Details can be found in Ansel et al. [7]. Using DMTCP with an application is as simple as:
dmtcp_launch ./myapp arg1 ... # From a second terminal window: dmtcp_command --checkpoint dmtcp_restart ckpt_myapp_*.dmtcp
This checkpoint image contains a complete standalone image of the ap- 26 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
plication with all the relevant information required to restart it later. It can be replicated and migrated as needed. DMTCP also creates a restart script to help automate restart of distributed computation.
DMTCP COORDINATOR
CKPT MSG CKPT MSG
CKPT THREAD CKPT THREAD USER THREAD A
SIGUSR2 USER THREAD C SIGUSR2 SIGUSR2 USER THREAD B socket connection
USER PROCESS 1 USER PROCESS 2
Figure 2.1: Architecture of DMTCP
As seen in Figure 2.1, a computation running under DMTCP consists of a centralized coordinator process and several user processes. The user pro- cesses may be local or distributed. User processes may communicate with each other using sockets, shared-memory, pseudo-terminals, etc. Further, each user process has a checkpoint thread which communicates with the co- ordinator. The checkpoint thread is created by the DMTCP library dmtcphi- jack.so, that is loaded into each of the application processes at startup (be- fore calling application’s main() function) by using the LD_PRELOAD fea- ture of the loader. The DMTCP library install signal handler for the check- point signal that is later used to quiesce user threads. The checkpoint thread is responsible for creating checkpoint images as and when requested by the coordinator. 2.4. DMTCP VERSION 1 27 2.4.1 Library Call Wrappers
The DMTCP library adds wrappers around a small number of libc func- tions. For efficiency reasons, it avoids wrapping any frequently invoked sys- tem calls such as read and write. The wrappers are used to gather infor- mation about the current process and to track all forked child processes as well as remote processes created via SSH and to automatically put them un- der checkpoint control. The local child processes inherit the LD_PRELOAD environment variable, whereas for the remote child processes, the comman- dline is modified to launch them under DMTCP control. In the case of sock- ets, DMTCP needs to know whether the sockets are TCP/IP sockets (and whether they are listener or non-listener sockets), UNIX domain sockets, or pseudo-terminals. Again, it uses wrappers around socket, connect, accept, open, close, etc., to do that.
2.4.2 DMTCP Coordinator
DMTCP uses a stateless centralized process, the DMTCP coordinator, to syn- chronize the separate phases at the time of checkpoint and restart. The checkpoint threads communicates with the DMTCP coordinator through a socket connection. Checkpoint procedure can be initiated by the coordi- nator on an explicit request from the user through its interactive interface, through the dmtcp_command utility, or on expiration of a predefined check- point interval. It should be noted that the coordinator is a single point of failure since the entire computation relies on it.
2.4.3 Checkpoint Thread
The checkpoint thread waits for a checkpoint request from the coordinator. On receiving a checkpoint request, the checkpoint thread quiesces the user threads (by sending a checkpoint signal) and takes the process through the phases of creating a checkpoint image. Similarly, during restart, it takes the 28 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
process through the restart phases and finally un-quiesces the user threads. The checkpoint thread is dormant during the normal execution of the pro- cess and is only active during the checkpoint/restart procedures.
2.4.4 Checkpoint
On receiving the checkpoint request from the coordinator, the checkpoint thread sends the checkpoint signal to all the user threads in the process. This quiesces the user threads by forcing them to block inside a signal han- dler previously installed by DMTCP. The checkpoint image is created by writ- ing all of user-space memory to a checkpoint image file. Each process has its own checkpoint image. Prior to creating the checkpoint image, the check- point thread also copies into the user-space memory, any kernel state that is required to restart the process such as the state of associated with network sockets, files, and pseudo-terminals. At the time of checkpoint, all of user-space memory is written to a check- point image file. The user threads are then allowed to resume executing application code. Note that user-space memory includes all of the run-time libraries (libc, libpthread, etc.), which are also saved in the checkpoint im- age. DMTCP doesn’t directly handle asynchronous DMA operations that may be pending or ongoing at the time of checkpoint. This could result in a inconsistent checkpoint state as the “quiesce” property has been violated.
2.4.5 Restart
As the first step of restart phase, DMTCP group all restart images from a single node under a single dmtcp_restart process. The dmtcp_restart process recreates all file descriptors. It then uses a discovery service to discover the new addresses for processes migrated to new hosts and restores network connections. It then forks a child process for each checkpoint image. These 2.4. DMTCP VERSION 1 29 individual processes then restore their memory areas. Next, the user threads are recreated using the original thread stacks. All user threads restore their pre-checkpoint context using the longjmp system call and are forced to wait in the signal handler. The checkpoint thread then restoring the kernel state that was saved during the checkpoint phase. Finally, the checkpoint thread un-quiesces the user threads and the user threads resume executing application code.
2.4.6 Checkpoint Consistency for Distributed Processes
In case of distributed processes, one needs to determine a consistent global state of the asynchronous system at the time of checkpoint. The notion of the global state of the system was formalized by Chandy and Lamport [28]. The central idea is to use marker (snapshot) messages. A process that wants to initiate a checkpoint, records its local state and sends a marker message on each of its outgoing channels. All other processes save their local state on receiving the first marker message on some incoming channel. For every other channel, any messages received before the marker message were ob- viously sent before the snapshot “cut off”. Hence they are included in the local snapshot.
Chandy and Lamport were primarily concerned with “uncoordinated snap- shots” (no centralized coordinator). DMTCP employs a strategy of “coordi- nated snapshots” using a global barrier. This makes the implementation of Chandy-Lamport consistency particularly easy, since messages can be sent only prior to the global barrier. Processes are “quiesced” (frozen) at the bar- rier. Next, the checkpoint thread of each process receives all pending data in the network, after which a globally consistent snapshot is taken. The details of the DMTCP implementation follow.
To initiate a checkpoint, the coordinator broadcasts a quiesce message to each process in the computation. On receiving the message, the check- 30 CHAPTER 2. CONCEPTS RELATED TO CHECKPOINT-RESTART AND VIRTUALIZATION
point manager thread in each process quiesces the user threads, sends an acknowledgement to the coordinator, and waits for the drain message. Af- ter receiving acknowledgements from all processes, the coordinator lifts the global barrier and broadcasts the drain message. On receiving the drain message, the checkpoint manager thread sends a special cookie (marker mes- sage) through the “send” end of each socket. Next, it reads data from the “receive” end of each socket until the special cookie is received. Since user threads in all the processes have already been quiesced, there can be no more in-flight data. The received in-flight data has now been copied into user-space memory, and will be included in the checkpoint image. On restart, once the socket connections have been restored, the check- point manager thread sends the saved in-flight data (previously read from the “receive” end of the socket) back to its peer processes. The peer processes then refill the network buffers, by pushing the data back into the network through the “send” end of each restored socket connection. The checkpoint manager thread then sends a message to the coordinator to indicate the end of the refill phase and waits for the resume message. Once the coordina- tor has received messages indicating end of refill phase from all involved processes, it lifts the global barrier and broadcasts the resume message. On receiving the resume message, the checkpoint manager un-quiesces the user threads and they resume executing user code. CHAPTER 3
Adaptive Plugins as a Mechanism for Virtualization
This chapter introduces several important examples of the need to integrate checkpointing with an external subsystem: Pid virtualization, SSH virtual- ization, virtualization of the InfiniBand network, virtualization of OpenGL, and virtualization of POSIX timers. The concept of process virtualization is introduced in concrete examples. Virtualization of InfiniBand [27] and OpenGL [62] were extensive projects requiring much domain knowledge. The specific results represent long- standing open problems and are not part of this dissertation. We use those examples to motivate the need for process virtualization, and we use those examples to argue for the expressivity of process virtualization in Chapter5.
3.1 The Ever Changing Execution Environment
In the next subsections, five examples of strategies for process virtualization are described, in order to make clear the rich design space available for process virtualization. In each of these cases, the nature of its virtualization requirement is unique. The five examples are:
31 32 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
1. virtualization of kernel resource identifiers, using the example of process id (pid) (Section 3.1.1);
2. virtualization of protocols, using the SSH protocol as its example (Sec- tion 3.1.2);
3. a shadow device driver approach for transparent checkpointing over In- finiBand (Section 3.1.3);
4. a record-replay approach, using transparent checkpointing of OpenGL 3D-graphics as an example (Section 3.1.4); and
5. adapting to application requirements for more control over checkpoint- ing (Section 3.1.5).
3.1.1 PID: Virtualizing Kernel Resource Identifiers
Pid is one of the simplest examples of the kernel resource identifiers that needs virtualization. The operating system kernel is unlikely to assign the same pid on restart as existed at the time of checkpoint. Even if the kernel were to allow a mechanism to request a particular pid, the requested pid might be in use (assigned to a different process). If the target application has saved the pre-checkpoint pid and tries to use it after restart, it could have undesired effects. For example, if the process uses the saved pid to send a signal after restart, in the best case, the process will fail because the saved pid is invalid. In the worst case, the saved pid might correspond to some other process and signal will be sent to that other process. To avoid these situations, we must provide a mechanism such that the processes can continue to use the saved pid after restart without any un- desired side effects. This can be done by providing the application process with a virtual pid that never changes for the duration of the process lifetime. When communicating with the kernel, the corresponding real pid that the 3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 33
4000 2652 PID: 4000 getpid() User Process kill(4001, 9) KERNEL
PID: 4001 Sending signal 9 User Process 4001 3120 to pid 3120 Translation Table Virt. PID Real PID 4000 2652 4001 3120
Figure 3.1: Virtualization of kernel resource identifiers (example shown for process id) kernel knows about is looked up in the translation table and passed on to the kernel. Figure 3.1 shows a simple schematic of a translation layer be- tween the user processes and the operating system kernel along with a pid translation table to convert between virtual and real pids. At each restart, the translation table is refreshed to update the real pids.
3.1.2 SSH Connection: Virtualizing a Protocol
Pid virtualization is a classic example of virtualizing low level kernel re- source identifiers using a translation layer. However, the same solution doesn’t suffice for higher level abstractions, such as an SSH connection.
Node1 Node2
app1 app2 stdio stdio SSH client SSH server (ssh) socket (sshd)
Figure 3.2: SSH connection: ssh Node2 app2 The user process, app1, forks a child SSH client process (ssh) to call the SSH server (sshd) on the remote node to create a remote peer process, app2. 34 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
Recall that the ssh command operates by connecting across the net- work to a remote SSH daemon, sshd, as shown in Figure 3.2. Since the SSH daemon is privileged, it is not possible for the unprivileged user-space checkpointing system to start a new SSH daemon during restart. The issue becomes even more complicated when the client and server processes are restarted at entirely different network addresses on different hosts. For virtualizing an SSH connection, it doesn’t suffice to virtualize just the network address. Instead, it must virtualize the entire SSH client-server con- nection. In essence, the SSH daemon represents a privileged process running a certain protocol. Regardless of whether the protocol is an explicit standard or a de facto standard internal to the subsystem, process virtualization must virtualize that protocol. Checkpointing and restarting the privileged SSH daemon is not an option.
Node1 Node2
app1 app2 stdio stdio
virt_ssh virt_sshd stdio stdio
SSH client SSH server (ssh) socket (sshd)
Figure 3.3: Virtualizing an SSH connection: ssh Node2 app2 The call to launch an SSH client process is intercepted to launch virtual ssh client (virt_ssh) and server (virt_sshd) processes. virt_ssh and virt_sshd are unprivileged processes.
Process virtualization provides a principled and robust algorithm for trans- parently checkpointing an SSH connections. As shown in Figure 3.3, the SSH 3.1. THE EVER CHANGING EXECUTION ENVIRONMENT 35 connection is virtualized by creating virt_ssh and virt_sshd helper pro- cesses that shadow the SSH client and server processes respectively. The virt_ssh and virt_sshd processes are owned by the user and are placed under checkpoint control. The ssh and sshd processes are not check- pointed.
On restart, the user processes are restored along with virt_ssh and virt_sshd processes (without the underlying SSH connection) on new hosts. The virt_ssh process then recreates a new SSH connection (see Sec- tion 5.4).
3.1.3 InfiniBand: Virtualizing a Device Driver
Both ssh for a traditional TCP network and the new InfiniBand network are intimately connected with high performance implementations of MPI (Message Passing Interface). An implementation usually retains ssh and TCP in addition to InfiniBand support, since typical MPI implementations bootstrap their operation through ssh in order to create additional MPI processes (MPI ranks), and to exchange InfiniBand addresses among peers.
InfiniBand virtualization has been a particular challenge both due to its complexity [134, 63, 16] and due to the fact that much of the state is hid- den either within a proprietary device driver or within the hardware itself. The solution here is to use a shadow device driver approach [106]. The InfiniBand plugin (§5.10) maintains a replica of the device driver and hard- ware state by intercepting and recording the InfiniBand library calls. On restart, this replica is used to recreate and restore the state of the InfiniBand connection. 36 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION 3.1.4 OpenGL: A Record/Replay Approach to Virtualizing a Device Driver
Scientific visualization is yet another example that requires a different kind of virtualization solution. Some graphics computations are extremely GPU- intensive. Further, most scientific visualizations today use OpenGL for 3D- graphics. If a scientist walks away from a visualization and needs to restart it the next day, there will be wasted time to reproduce it. Further, switch- ing between multiple scientific visualizations becomes extremely inefficient. Hence, checkpoint-restart is a critical technology. However, it is difficult to checkpoint, because much of the graphics state is encapsulated into a vendor-proprietary hardware GPU chip. The OpenGL plugin (§5.9) achieves checkpoint-restart of 3-D graphics by using a process virtualization strategy of record (record all OpenGL calls), prune (prune any calls not needed to reproduce the most recent graphics state), and replay (replay the calls during restart in order to place the GPU into a semantically equivalent state to the state that existed prior to checkpoint).
3.1.5 POSIX Timers: Adapting to Application Requirements
A posix timer is an external resource maintained within the kernel and has an associated kernel resource identifier known as timer id. As with pid virtu- alization, the timer-id needs to be virtualized as well and can use the same strategy. Consider a process that is checkpointed while a timer is still armed, i.e. the timeout specified with the timer has not expired yet. On restart, what is the desired behavior? Should the timer expire immediately or should it expire after exhausting the remaining timeout period? There is no single correct answer as the desired result is application dependent. For an appli- cation that is waiting for a response from a web server, it is desired to expire 3.2. VIRTUALIZING THE EXECUTION ENVIRONMENT 37 the timer on restart. However, for an application process that is monitor- ing a peer process for potential deadlocks, the time should continue for the remaining time period.
3.2 Virtualizing the Execution Environment
As seen in the previous section, it is imperative to virtualize the external resources in order to fully support checkpoint restart for any application. In order to be successful, virtualization should be done transparently to the ap- plication. This assumes that the application is interacting with the external resource through a fixed set of API. Two basic requirements for virtualizing an external resource for checkpointing are:
1. Virtualize external subsystems.
2. Capture/restore the state of external resources.
Next, we talk about each of these requirements and elaborate on their im- portance and discuss what additional features are required for a complete virtualization solution.
3.2.1 Virtualize Access to External Resources
Since external resources may change between checkpoint and restart, we need to virtualize them. This can be achieved through a translation layer between the application process and the resource. Virtualizing a resource may be as simple as translating between virtual and real identifiers such as pid-virtualization (Section 3.1.1) or it may involve more sophisticated mechanisms like shadow device drivers (Section 3.1.3). Depending upon the external resource, the translation may be active throughout the computation (e.g., for pids) or only during the restart procedure (for SSH). Further, the translation layer should ensure that the access to a resource is atomic with respect to checkpoint-restart i.e. a checkpoint shouldn’t be 38 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
allowed while the process is in the middle of manipulating/accessing the re- source. Not doing this may result in an inconsistent state at restart. Consider pid virtualization where a thread tries to send a signal to another thread us- ing the virtual tid (thread id). The pid virtualization layer translates the virtual tid to the real tid and sends the signal using real tid. Further con- sider that the process is checkpointed after the translation from virtual to real, but before the signal is actually sent. On restart, the process will re- sume and will try to send the signal with the old real tid, which of course is not valid now.
Share the virtualized view with peers
Virtualizing access to external resources gets complicated in a distributed environment. Processes communicate with their peers. This demands a consistent virtualization layer across all involved parties. It becomes more evident after restart, when the translation table is updated to reflect the current view of the external resource. These updates must be shared with all the peer processes to allow them to update their own translation tables. For example, in case of network address virtualization, each process must inform its peers of its new network address on restart to allow them to restore socket connections.
3.2.2 Capture/Restore the State of External Resources
When restarting a process from a previous checkpoint, we need to restore the process view of the external resource. We need to identify the relevant information that would be required to restore/recreate the external resource during restart. This information should be gathered at the time of check- point and should be saved as part of the checkpoint image. This information can then be read from the ckpt image on restart. 3.3. ADAPTIVE PLUGINS 39
Quiesce the external resource
During checkpoint, the external resources should be quiesced to ensure a consistent state. For example, an asynchronous disk read operation must be allowed to finish before writing the process memory to the checkpoint image to avoid data transformation due to on going memory updates (DMA).
Consistency of the computation state
As discussed above, a virtualization scheme should be transparent to the user application. Thus, the application view of the external resource should be consistent before and after checkpoint. Similarly, the application process should not observe any change in its own state before and after checkpoint. This involves preserving the state of the running process (e.g., threads, mem- ory layout, and file descriptors) between checkpoint and restart. Note that it is acceptable to alter the process state and/or the state of external resource while perform checkpoint-restart. However, such changes should be reverted and the pre-checkpoint view of the application should be restored before the application process is allowed to resume executing application code.
3.3 Adaptive Plugins as a Synthesis of System-Level and Application-Level Checkpointing
So far we have discussed the motivation for virtualizing the execution envi- ronment along with the basic requirements for achieving the same. In this section we will discuss possible design choices. There are two basic approaches for achieving the goals discussed in Sec- tion 3.2. One is to use application-specific checkpointing by having the ap- 40 CHAPTER 3. ADAPTIVE PLUGINS AS A MECHANISM FOR VIRTUALIZATION
plication developer write extra code for supporting checkpointing. However, as discussed in Section 2.1, this is not an ideal solution as it requires knowl- edge of the internals of the applications and puts a burden on the developer. The second approach is to use an existing monolithic checkpointing system such as DMTCP version 1 and insert the virtualization code in it along with a large number of heuristics to satisfy a variety of application needs (e.g., heuristics for posix timers as discussed in Section 3.1.5). However, there is no universal set of heuristics that can be used with all applications as each application requires specific heuristics to cater its needs. In this work, we present adaptive plugins as an ideal compromise be- tween these two extreme approaches to meet the virtualization require- ments. An adaptive plugin is responsible for virtualizing a single external resource. By basing plugins on top of a transparent checkpointing package such as DMTCP, the simplicity of transparent checkpointing is maintained. With plugins, no target application code is ever modified, yet they enable application-specific fine tuning for checkpoint-restart. We have already seen examples where the external resource needs to be virtualized in previous sections. The posix timer plugin is an example of application-specific heuris- tic plugin. A memory cutout plugin to reduce the memory footprint of the process for reducing checkpoint image size would be yet another example of an application-specific plugin. CHAPTER 4
The Design of Plugins
In the previous chapter, we discussed several use cases that require virtual- ization of external resources in order to support checkpoint-restart. External resources may include, but are not limited to kernel resource identifiers, protocols, and hardware device drivers. We further listed the two basic re- quirements for virtualizing an external resource and discussed how a design based on adaptive plugins is well suited for such tasks. Section 4.1 introduces a basic framework of a plugin architecture that pro- vides the same set of services for virtualizing external resources that were introduced informally in Chapter3.A plugin is an implementation of the process virtualization abstraction. In process virtualization, an external sub- system is virtualized by a plugin. All software layers above the layer of that plugin see a modified subsystem. Section 4.2 then uses these requirements to provide a design recipe for virtualization through plugins. Section 4.3 then takes into account the is- sue of dependencies among multiple plugins within the same application process. Section 4.4 extends that design recipe to multiple processes, in- cluding distributed processes on multiple hosts. Section 4.5 describes three special-purpose plugins that are required for checkpointing all processes. This chapter concludes with Section 4.6, containing some implementation challenges.
41 42 CHAPTER 4. THE DESIGN OF PLUGINS
Target Application (program+data) Target Application
Library Wrappers Virtualize Resource Capture/Restore State
Library Wrappers Virtualize Resource
Party Plugin Libs Capture/Restore State
Coordinator Interface Plugin
Thread Plugin
Libraries Memory Plugin Base Plugin Internal and Third−
Runtime Libraries (libc, etc.) Plugin Engine Runtime Libraries
Operating System Kernel
Figure 4.1: Plugin Architecture.
4.1 Plugin Architecture
An application consists of program and data. It interacts with the execution environment through various libraries. For example, the libc runtime library provides access to the kernel resources, a device driver library may provide access to the underlying device hardware, and so on. Thus one can imagine virtualizing the execution environment by intercepting the relevant library calls. This allows us to inspect and modify the behavior of the underlying subsystem as seen by the application.
Figure 4.1 shows a high level view of the plugin architecture. It has 4.1. PLUGIN ARCHITECTURE 43 two main components: (1) plugins, and (2) the plugin engine. Plugins and the plugin engine are implemented as separate dynamic libraries. They are loaded into the application using the LD_PRELOAD feature of the Linux loader.
Plugin
A plugin is a checkpoint subsystem that virtualizes a single external resource or subsystem with the help of function wrappers (§4.1.1). It save/restores the state of the external subsystem. Examples of external subsystems are: process-id, network sockets, InfiniBand, etc. Application processes are con- sidered as if they are independent and inter process communication through pids, sockets, etc. is handled through plugins. Further, a plugin is transpar- ent to the target application and can be enabled/disabled for the application as needed. Finally, third parties can write orthogonal customized plugins to fit their needs.
Plugin Engine
The plugin engine provides event notification services (§4.1.2) to assist plug- ins to capture/restore the state of their specific external resources. It further interacts with a coordinator interface plugin to provide publish/subscribe services (§4.1.3) to enable plugins to interact with each other and share the translation tables for resource virtualization.
4.1.1 Virtualization through Function Wrappers
Since the underlying resources provided by the operating system may change between checkpoint and restart, there is a need to virtualize them. The plu- gin virtualizes the external resources by putting wrappers around interesting library calls, which interpose when the target application makes such a call. In case of pids, the virtualization can be done using a simple table translat- 44 CHAPTER 4. THE DESIGN OF PLUGINS
ing between virtual and real pids as shown in Listing 4.1. The arguments passed to the library call are modified to replace the virtual pid with the real pid. Similarly, the return value can also be modified as required. The virtual pid column of this table is saved as part of checkpoint image and at restart time the real pid column is populated as processes/threads are recreated.
int kill(pid_t pid, int sig) { disable_checkpoint(); real_pid = virt_to_real(pid); int ret = REAL_kill(real_pid, sig); enable_checkpoint(); return ret; }
Listing 4.1: A simple wrapper for kill ¥
As seen in the above listing, a function wrapper is implemented by defin- ing a function of the same name as the call it is going to wrap. Real function here refers to the function by the same signature, in a later plugin or a run- time library. It is possible for multiple plugins to create wrappers around a single library function. The order of execution of wrappers is determined by a plugin hierarchy corresponding to the order in which the plugins are invoked (Section 4.3).
Capture/Restore state of external resource
Wrappers are also used to “spy” on the parameters used by an application to create a system resource, in order to assist in creating a semantically equiv- alent copy on restart. At the time of checkpoint, a plugin saves the current state of its underlying resources into the process memory. The state can be obtained from a number of places such as the process environment and the 4.1. PLUGIN ARCHITECTURE 45 operating system kernel. In some cases, the function wrappers can also be used to gather the information about the external resources. For example, in the “socket” wrapper (Listing 4.2), the socket plugin will save the associated domain and protocol information along with the socket identifier. int socket(int domain, int type, int protocol) { disable_checkpoint(); int ret = REAL_socket(domain, type, protocol); if (ret != -1) { register_new_socket(ret, domain, type, protocol); } enable_checkpoint(); return ret; }
Listing 4.2: Wrapper for socket() to record socket state ¥
Atomic transactions
Plugins may have to perform atomic operations that must not be interrupted by a checkpoint. For example, the translation and call to real function should be done atomically with respect to checkpoint-restart. Otherwise, there is a possibility of checkpointing after the translation but before the real function is called. In that case, on restart, the translated value is no longer valid and can impact the correctness of the program. The plugin engine provides disable_checkpoint and enable_checkpoint services for enclosing the critical section as seen in Listing 4.1. The disable_checkpoint and enable_checkpoint services are implemented using a modified write-biased reader-writer lock. The modification allows a recursive reader lock even if the writer is queued and waiting for the lock. The checkpoint thread must acquire the writer lock before it can quiesce the 46 CHAPTER 4. THE DESIGN OF PLUGINS
user threads. On the other hand, the user threads acquire and release the reader lock as part of a call to disable_checkpoint and enable_checkpoint respectively. If a checkpoint request arrives while a user thread is in the middle of a critical section, the checkpoint thread will wait until the user thread comes out of the critical section and releases the reader lock. A user thread is not allowed to acquire a reader lock if the checkpoint thread is already waiting for the writer lock to prevent checkpoint starvation.
Atomicity is especially important for wrappers that create or destroy a resource instance. For example, when creating a network socket, if the checkpoint is taken right after the socket is created but before the socket plugin has a chance to register it, the socket may not be create at restart as no record exists of the socket. Thus one must atomically create and record socket state as shown in Listing 4.2.
Wrappers can be considered the most basic of all virtualization tools. A flexible, robust implementation of wrapper functions turns out to be surpris- ingly subtle and is discussed in more detail in Section 4.6.1.
4.1.2 Event Notifications
Event notifications are used to inform other plugins (within the same pro- cess) of interesting events. Any plugin can generate notifications. Plugin engine then delivers these notification to all available plugin in a sequential fashion. The order of delivery of notification depends on the plugin hier- archy as discussed in Section 4.3. Plugins must declare an event hook in order to receive event notifications. A plugin may decide to ignore any or all notifications.
Figure 4.2 shows the “write-ckpt” and “restart” events generated by the coordinator interface plugin which are then delivered to all other plugins by the plugin engine. 4.1. PLUGIN ARCHITECTURE 47
Target Application Target Application
Socket Plugin (2) Socket Plugin (6)
Fork/Exec Plugin (3) Fork/Exec Plugin (5)
Pid Plugin (4) Pid Plugin (4)
Coordinator Interface Plugin (5) Coordinator Interface Plugin (3)
Memory Plugin (6) Memory Plugin (2) restart (1) write−ckpt (1) restart write−ckpt
Plugin Engine Plugin Engine
(a) Event notification for write-ckpt (b) Event notification for restart
Figure 4.2: Event notifications for write-ckpt and restart events. The numbers in the parenthesis indicate the order in which messages are sent. Notice that the restart event notification is delivered in the opposite order of write-ckpt event.
Some of the interesting notifications are:
• Initialize: generated during the process initialization phase (even be- fore main() is called). The plugins can initialize data structures, etc. A plugin may choose to register an exit-handler using atexit() which will be called when the process is terminating.
• Write-Ckpt: each plugin saves the state of the external resources into process’s memory. The memory plugin(s) then create the checkpoint image.)
• Resume: generated during the checkpoint cycle.
• Restart: generated during restart phase.
• AtFork: generated during a fork and works similar to the libc function, pthread_atfork. 48 CHAPTER 4. THE DESIGN OF PLUGINS
dmtcp_event_hook(is_pre_process, type, data) { if (is_pre_process) { switch (type) { case Initialize: myInit(); break; case Write_Ckpt: myWriteCkpt(); break; ... } } if (!is_pre_process) { switch (type) { case Resume: myResume(); break; case Restart: myRestart(); break; ... } } }
Listing 4.3: An event hook inside a plugin ¥
The Resume and Restart notifications are sent to plugins in the oppo- site order from the Write-Checkpoint notification (see Listing 4.3 and Fig- ure 4.2b). This is to ensure that any dependencies of a plugin are restored before the plugin itself is restored. For example, the memory plugin (re- sponsible for writing out or reading back the checkpoint image) is always the lowest layer (see Figure 4.1). This is so that other plugins may save data in the process’s memory during checkpoint, and find it again at the same address during restart. 4.1. PLUGIN ARCHITECTURE 49
Coordinator Node 1 Node 2 current local addr current remote addr Target Application Target Application
current local addr Socket Plugin Socket Plugin current remote addr
Coordinator Interface Plugin Coordinator Interface Plugin
Plugin Engine Plugin Engine
Figure 4.3: Publish/Subscribe example for sockets.
4.1.3 Publish/Subscribe Service
In a distributed environment, a publish/subscribe service is needed so that a given type of plugin may communicate with its peers in different processes. Typically, on restart, once the process resources have been recreated, the plugins publish their virtual ids along with the corresponding real ids using the publish/subscribe service. Next they subscribe for updates from other processes and update their translation tables accordingly. This was seen for the pid virtualization plugin (Section 3.1.1). Similarly, when a parallel computation is restarted on a new cluster, the socket plugin must exchange socket addresses among peers.
At the heart of the publish/subscribe services is a key-value database whose key corresponds to the virtual name and whose value corresponds to the real name of the underlying resource. The database is populated when plugins publish the key-value pairs. Once the plugin has published all of the relevant key-value pairs, it may now subscribe by sending queries to the database. The plugins are notified as soon as a match for the queried key is available. Typically, the key-value database is used only at restart time, as doesn’t need to be preserved across checkpoint-restart. 50 CHAPTER 4. THE DESIGN OF PLUGINS
Figure 4.3 shows an example of the socket plugins exchanging their cur- rent network address with their peers. During the Write-Checkpoint phase, the socket peers agree on using a unique key (see Section 4.4.1) to iden- tify the connection. While restarting, this unique key is used to publish the current network address.
It is possible to have multiple publish/subscribe APIs that differ accord- ing to scope. It is left to the plugins to choose the scope best suited for their needs. Two trivial scopes are node-private and cluster-wide. Node-private publish/subscribe API is sufficient for plugins dealing with resources limited to a single node, such as pseudo-terminals, shared-memory, and message- queues. Whereas plugins dealing with resources that may span over multiple nodes, such as sockets and InfiniBand, should use the cluster-wide publish/- subscribe API.
The node-private publish/subscribe service may be implemented using shared-memory while the cluster-wide publish/subscribe service must be provided by some centralized resource such as the DMTCP coordinator.
4.2 Design Recipe for Virtualization through Plugins
So far we have seen the plugin architecture and the services provided by it. We have also seen how these services suffice to meet the virtualization requirements. We use this information to create a typical recipe for writing a new plugin to virtualize an “external resource”. One is usually given a name or id (identifier) to provide a link to the external resource. The id may be for an InfiniBand queue pair, for a graphics window, for a database connection, for a connection from a guest virtual machine to its host/hypervisor, and so on. 4.2. DESIGN RECIPE FOR VIRTUALIZATION THROUGH PLUGINS 51
In all of these cases, the recipe is:
1. Intercept communication to the external resource (usually by inter- posing between library calls), and translate between any real ids from the external resource and virtual ids that are passed to the application software. A plugin maintains this translation table of virtual/real ids.
2. Quiesce the external resource (or wait until the external resource has itself reached a quiescent state);
3. Interrogate the state of the external resource sufficiently to be able to reconstruct a semantically equivalent resource at restart time.
4. Checkpoint the application. The checkpoint will include state infor- mation about the external resource, as well as a translation table of virtual/real ids.
5. At restart time, the state information for the external resource is used to create a semantically equivalent copy of the external resource. The translation table is then updated to maintain the same virtual ids, while replacing the real ids of the original external resource with the real ids of the newly created copy of the external resource.
It is not always efficient to quiesce and save the state of an external resource. The many disks used by Hadoop are a good example of this. The data in an external database server is another example. It is not practical to drain and save all of the external data in secondary storage. There are two potential approaches. The first approach is to delay the checkpoint during a critical phase. In the case of Hadoop, one would delay the checkpoint until the Hadoop computation has executed a reduce oper- ation, in order to not overly burden the resources of the Hadoop back end. A similar approach can be taken for NVIDIA GPUs. In many cases, there are also strategies for plugins to transparently detect this critical phase and delay the checkpoint until that time. 52 CHAPTER 4. THE DESIGN OF PLUGINS
The second approach is to allow for a partial closed-world assumption in which some state (data/contents) is assumed to be compatible across checkpoint and restart. In case of the external database server, the external data already lies in fault tolerant storage and is compatible across checkpoint and restart. Thus the solution is to maintain a virtual id that identifies the external storage of the server. That virtual id is used at restart time to restore the connection to the database server.
4.3 Plugin Dependencies
Some plugins may have dependencies on other plugins. For example, the File plugin depends on the Pid plugin to restore file descriptors pointing to “/proc/PID/maps” and so on. Each plugin provides the list of dependencies which must be satisfied to successfully load the given plugin. The depen- dency declaration also affects the level of parallelism that can be achieved when performing phases such as Checkpoint, Resume and Restart. Subject to the dependencies among plugins, this design provides end users with the possibility of selective virtualization. Selectively including only some plugins is advantageous for three reasons: (i) performance reasons (some end-user plugins might have high overhead); (ii) software mainte- nance (other plugins can be removed while debugging a particular plugin); and (iii) platform-specific plugins.
4.3.1 Dependency Resolution
Similar in spirit to modern software package formats such as RPM and deb, a plugin provides a list of features/services that it provides, depends on, or conflicts with. For example, the socket plugin may provide services for “TCP”, “UDS” (Unix Domain Sockets), and “Netlink” socket types and de- pends on the “File” plugin (to restore file system based unix domain sock- ets). 4.3. PLUGIN DEPENDENCIES 53
The dmtcp_launch program, that is used to launch an application un- der checkpoint control, compiles list of all available plugins by looking at various environment variables, such as LD_LIBRARY_PATH. A user-defined list of plugins can also be specified to be loaded into the application. The dmtcp_launch program examines this plugin list and creates a partial or- der of dependencies among the plugins. The list of available plugins is searched to fulfill any missing dependencies for the user-defined plugins. If a match is found, plugins are loaded automatically. Otherwise an error is reported. If two or more plugins provide the same feature/service, a conflict is recorded and the user is provided with the conflicting plugins. void dmtcp_plugin_dependencies(const char ***provides, const char ***requires, const char ***conflicts) { static const char *_provides[] = { " TCP " , "UDS" , " Netlink " , NULL};
static const char *_requires[] = { " F i l e " , NULL}; static const char *_conflicts[] = {NULL}; *provides = _provides; *requires = _requires; *conflicts = _conflicts; }
Listing 4.4: Dependencies declared by a plugin. The dmtcp_launch utility ¥ uses these fields to generate a partial order among the given plugins and to report any missing dependencies or any conflicts.
Listing 4.4 provides an example of dependency information as exported by the socket plugin. Since the plugins are implemented as shared libraries, the dmtcp_launch program can perform dlopen/dlsym to find and call the dmtcp_plugin_dependencies function to learn about the dependencies. 54 CHAPTER 4. THE DESIGN OF PLUGINS
Further, this approach assumes a common naming scheme to resolve matches/dependencies across plugins. This could be automated by scan- ning symbols in the object files, for example, for both definitions and uses. If a symbol is defined in more than one plugin, it can be listed as a potential source of conflict to help the plugin writer in debugging plugins.
Parallel event handling
In Section 4.1.2, we discussed how the plugin engine assumed serial delivery of event notifications due to plugin dependencies expressed in a linear order (Figure 4.2). However, for non-linear plugin dependencies, a dependency graph can be created to relax the order of notification delivery. The event notifications can be processed by multiple plugins in parallel as long as there is no dependency between them. This is useful in modern multi-core systems to allow idle CPU cores to process the event notifications for the plugins. It is also useful for plugins that need to perform asynchronous operations during event handling. In such cases, rather than blocking on a single plugin, the event notification can be carried out in parallel in other plugins.
4.3.2 External Resources Virtualized by Other Plugins
Plugins may use resources that are virtualized by an earlier plugin. For ex- ample, plugins are allowed to create threads, open sockets, use files etc. However, if the resource is created/used in a way that bypasses the wrap- pers created by the earlier plugin, the resources may not be virtualize/save- restored. In situations where this is not true, only the plugin using the resources can save-restore its state. This is done to avoid circular depen- dencies. If the save-restore/virtualization is absolutely required, the plugin should be broken into two or more smaller plugins and the newer plugin should be moved higher in the plugin-hierarchy. 4.3. PLUGIN DEPENDENCIES 55 4.3.3 Multiple Plugins Wrapping the Same Function
Multiple plugins are allowed to place wrappers around the same library call. For example, the open("/proc/PID/maps", ...) function is wrapped by the file plugin as well as the pid plugin. The file plugin needs to be able to save/restore the file descriptor, whereas the pid plugin has to convert the virtual PID to a real one. Figure 4.4 shows nested-wrappers provided by the pid plugin and the file plugin.
Target Application File Plugin PID Plugin Libc
func1(...) { open(...) { open(...) { open(...) { ...... p="/proc/1234/maps" REAL_open(...) REAL_open(...) sys_open(...) ...... fd = open(p, ...) } } } ... close(fd) close(...) { getpid() { close(...) { ...... } REAL_close(...) REAL_getpid() sys_close(...) ...... } } }
getpid() { ... sys_getpid() ... }
Figure 4.4: Nested wrappers: open function is wrapped both by the File plugin and by the Pid plugin.
Once a plugin has performed all the required pre-processing actions, it calls the function wrapper in the next plugin library. This is done by using the RTLD_NEXT feature of dlsym function call. The RTLD_NEXT service will find the next occurrence of the given function in the library search order after the current library. For example, in case of open wrapper in the File plugin from Figure 4.4, dlsym(RTLD_NEXT, “open”) would return the address of the open function defined in the Pid plugin. However, dlsym(RTLD_NEXT, “close”) would return the address of the close function defined in Libc as the close wrapper is not defined in the Pid plugin. Since the wrappers execute both before and after the library call, a plugin that was loaded earlier can place a wrapper around the wrapper created by a later plugin. Thus the pre-processing takes place in the order of plugin load sequence, whereas the post-processing takes place in the reverse order. 56 CHAPTER 4. THE DESIGN OF PLUGINS 4.4 Extending to Multiple Processes
Until this point, plugins have been described in the context of a single pro- cess. For distributed computations, the interaction among distributed pro- cesses is critical to making the plugin model practical. As we have seen, the plugins virtualize the resources for several reasons. However, in case of multiple processes, several processes may be using a common resource. For example, several processes may share a file descriptor open to the same file. A mapped memory region may be shared. A socket may be shared among multiple processes. Several processes may have duplicate pointers to the same underlying resource. These duplicate pointers may be created explicitly (e.g., the dup() system call creates a duplicate file descriptor), or implicitly (by creating a child process; the child process automatically gets a copy of all the file-descriptors, shared memory, etc.).
How does one ensure correctness if multiple processes are using the same resource and hence virtualizing it independently of each other? Should all processes save/restore the common resource or only one of them?
The correct answer is that only a single process should be allowed to save/restore the state of the underlying resource. This is required for two reasons: (i) for some resources, part of the state to be checkpointed can be read only once. This is the case with data in kernel buffers or network data; and (ii) if multiple processes recreate the resource during restart, it may no longer be shared. In some situations, it is impossible to recreate the resource (e.g. sockets) by multiple processes, while in other case, recreating the resource multiple times is permitted but results in incorrect behavior (e.g. same file can be opened by multiple processes resulting in loss of semantics). 4.4. EXTENDING TO MULTIPLE PROCESSES 57
Single process
It is possible to have duplicate pointers within a single process. Thus the plugins must ensure that only one copy is checkpointed and the duplication is restored during restart. This requires the ability of the plugins to identify duplicate resources during the checkpoint phase. For some resources, the operating system kernel (or the execution environment) assigns a unique id at the time of creation. Examples include sockets, pid, System V shared memory objects, semaphores, etc. When these resources are duplicated, the duplicates may be detected easily by querying the kernel for the resource id.
Multiple processes
The two key issues in dealing with multiple processes are: (i) checkpoint- restart of shared resources; and (ii) finding the current location of peer pro- cesses. We employ the publish/subscribe service to assist us in dealing with these issues. While it allows a central coordinator to mediate among multi- ple processes, it also implicitly produces a barrier. Hence, it is important to use that facility sparingly for the sake of efficiency.
4.4.1 Unique Resource-id for Shared Resources
Duplicate detection for the remaining resources must be done by keeping track of when the duplicates are created — explicitly or implicitly. This is done by assigning a unique resource-id to each resource when it is cre- ated. The resources duplication is tracked by putting wrappers around cor- responding library calls (such as dup or fork). Once detected, the duplicates are assigned the same resource-id as the original resource. A globally unique resource-id can be created in several ways. One possi- ble solution is to use a mixture of hostname, virtual/real pid of the process creating the resource, creation timestamp, etc. 58 CHAPTER 4. THE DESIGN OF PLUGINS 4.4.2 Checkpointing Shared Resources
Since only one process should be allowed to save the state of the shared resources and the original resource creator might not be present, we must select a checkpoint-leader process for each resource. The checkpoint-leader is responsible for saving and restoring the state of the underlying resource.
Checkpoint-leader election — consensus across processes
The processes sharing the underlying resource may elect a checkpoint-leader using several mechanisms. The basic idea is to have consensus across par- ticipating processes. Ansel et al. [7] used the fcntl system call to set own- ership of the file descriptors. Each process tries to set itself as the owner of the given file descriptor. The centralized coordinator process was used to create a global barrier to signal the end of election after each process had a chance to make the system call. The last process to perform the system call is considered the checkpoint-leader. An example is shown in Listing 4.5.
checkpoint_file(int fd) { // Participate in checkpoint-leader election; // publish ourself as the owner of the resource fcntl(fd, F_SETOWN); // Now wait for the election to be over wait_for_global_barrier(LEADER_ELECTION); // If we are the owner, we are ckpt-leader if (fcntl(fd, F_GETOWN) == getpid()) { // capture the state of the file descriptor capture_state(fd); } }
Listing 4.5: An example of leader election using the fcntl system call. ¥ 4.4. EXTENDING TO MULTIPLE PROCESSES 59
While this approach works for shared file descriptors, it doesn’t work for other resources, such as files. There can be multiple unique file descriptors that are opened on the same file. In this case, each unique file descriptor gets a checkpoint leader. This results in checkpointing of multiple copies of the file. The publish/subscribe service can be used to provide a better solution. Each process publishes itself as the checkpoint-leader using the unique resource-id of the resource. The last process to publish is elected the checkpoint-leader. Since files can have multiple unique file descriptors (and hence multiple unique resource-ids) associated with them, we can publish using the absolute file path or the inode number for leader election.
Global barriers
As mentioned above, a global barrier allows plugins in different processes to synchronize during checkpoint and restart. A simple implementation of the global barrier requires a centralized coordinator that keeps the count of all processes that have reached the barrier. Once all processes reach the barrier, it lifts the barrier and allows them to proceed as shown in Listing 4.6. void wait_for_global_barrier(BarrierId id) { MessageType msg, rmsg; msg.type = GLOBAL_BARRIER; msg.barrierId = id; // Tell the coordinator that we have reached the barrier send_msg_to_coordinator(msg); // Wait until all other peers reach the barrier recv_msg_from_coordinator(&rmsg); assert(rmsg.type = GLOBAL_BARRIER_LIFTED); // barrier has been lifted }
Listing 4.6: Global barrier. ¥ 60 CHAPTER 4. THE DESIGN OF PLUGINS
Global barriers are costly as each process has to communicate with the centralized coordinator process. If each plugin implements several global barriers, the performance impact can be significant in terms of checkpoint and restart times. The total number of global barriers can be reduced signif- icantly by using process level anonymous global barriers that can be imple- mented in the coordinator interface plugin as show in Listing 4.7.
void implement_global_barriers() { // Create an anonymous global barrier wait_for_global_barrier(BARRIER_ANON_1); // generate event notification indicating // lifting of anonymous barrier1 generate_event(ANON_GLOBAL_BARRIER_1);
wait_for_global_barrier(BARRIER_ANON_2); generate_event(ANON_GLOBAL_BARRIER_2);
wait_for_global_barrier(BARRIER_ANON_3); generate_event(ANON_GLOBAL_BARRIER_3); ... }
¥ Listing 4.7: Global barrier.
Consider the example of leader election. On receiving the event notifica- tion for ANON_GLOBAL_BARRIER_1 event, each plugin will participate in leader election for its resources by publishing itself as the checkpoint leader. On receiving the event notification for ANON_GLOBAL_BARRIER_2, each plugin can check to see if it is the checkpoint-leader by subscribing to the checkpoint leader information for the unique resource id. 4.4. EXTENDING TO MULTIPLE PROCESSES 61
Socket Plugin
File Plugin
Fork/Exec Plugin
Pid Plugin
Coord Interface Plugin WriteCheckpoint Resume/Restart
Thread Plugin
Memory Plugin(s)
Figure 4.5: Plugin dependency for distributed processes
4.4.3 Restoring Shared Resources
Note that memory regions are restored before plugins can restore the state of their corresponding resources. In case of shared resources, the checkpoint- leader recreates the underlying resources and then shares them with other processes using publish/subscribe service. The checkpoint leader publishes while the remaining processes subscribe to the resource-id.
Remark: Resources involving file-descriptors can be shared by passing them over the Unix Domain Sockets.
Note that sharing of resources forces a certain dependency among plu- gins that is summarized in Figure 4.5. The required dependency can be observed by noting the required actions of a plugin at the time of restart. The pid-plugin is responsible for virtualizing the pids which is required for fork/exec plugin to restore the process-trees. Once the process-trees have been created, the file, socket, System V shared memory, etc. plugins may recreate/restore the resources and share them with other processes. 62 CHAPTER 4. THE DESIGN OF PLUGINS 4.5 Three Base Plugins
In this section we discuss three special-purpose plugins: the coordinator interface plugin, the thread plugin, and the memory plugins.
4.5.1 Coordinator Interface Plugin
A centralized coordinator process is used to synchronize checkpoint-restart between multiple processes on the same or different hosts. A coordinator interface plugin communicates with the coordinator process and generates events related to checkpointing when requested by the coordinator. It cre- ates a checkpoint-manager thread, which listens to the coordinator process for a checkpoint message while the user threads are executing application code. On receiving a coordinator message, the checkpoint-manager thread generates the checkpoint, resume, or restart event which are then delivered to all other plugins. The coordinator interface plugin and the coordinator process can best be thought of as a single programming unit. It is this programming unit that implements global barriers at the time of checkpoint or restart. The special case of a single standalone target process can be supported by a minimal coordinator interface plugin, which directly generates the three basic event notifications: checkpoint, resume, and restart. In this case, one does not need any external coordinator process. At the other extreme, a coordinator interface plugin can be written to support a set of redundant coordinators. This alternative eliminates the possibility of a single point of failure.
4.5.2 Thread Plugin
The thread plugin is responsible for saving and restoring the state of all user threads during checkpointing. The plugin engine invokes the checkpoint- manager thread through the write-ckpt event hook. The checkpoint manager 4.5. THREE BASE PLUGINS 63 then sends a POSIX signal to all user threads. This forces the user threads into a checkpoint-specific signal handler (which was defined earlier within the thread plugin). The handler causes each user thread to save its context (register values, etc.) into the process memory and to then wait on a lock. When the checkpoint completes, the thread plugin releases all user threads from their locks, and user execution resumes. On restarting, the memory plugin restores user-space memory from a checkpoint image, and control is then passed to a restart event hook of the thread plugin. Only the primary thread of the restarted process exists at this time. That thread recreates the other threads, restores their context, and re- leases the user threads from the locks that were entered prior to checkpoint. (The state of a lock depends only on user-space memory.)
4.5.3 Memory Plugins
Other Plugin Libraries
Prepare list of memory areas
Zero−page detection
Compression
Encryption Various Memory Plugins Write to network socket
Runtime Libraries, Plugin Engine
Figure 4.6: Various memory plugins stacked together
Memory plugins are responsible for writing the contents of a process’s memory into the checkpoint image. The checkpoint image is read during 64 CHAPTER 4. THE DESIGN OF PLUGINS
restart process to recreate the process memory. Memory plugins are the last in the plugin loading sequence as every other plugin necessarily depends on the memory resource. Figure 4.6 shows an example of sequence of memory plugins that perform zero-page optimizations followed by compression and encryption before writing the checkpoint data to a network socket. A pro- cess on the other end of the socket may then save the data onto persistent storage.
At restart time, a special application, dmtcp_restart, is needed to boot- strap the restart procedure to load the restoration code corresponding to all the memory plugins involved. Control is then passed to memory plug- ins which then perform restoration of rest of process memory. After restor- ing memory, the rest of the plugins recreate/restore their corresponding re- sources. User threads are then recreated and the process resumes executing application code.
Here we list some characteristics of the memory plugins:
1. Since writing the checkpoint image is the last step in checkpoint pro- cess, the memory plugins must appear last in the plugin sequence.
2. If it is possible for memory plugins to alter the memory maps of the current process, the first memory plugin must create a list of memory areas to be written to the checkpoint image. The memory plugins can then map new memory area for checkpoint purposes only and these areas will not be checkpointed.
3. The memory plugins pass information to the next memory plugin using a pipe mechanism i.e. each plugin may process the incoming data and send the processed (and potentially modified) data to the next plugin. Data piping can be implemented by creating hooks for writing and reading memory. 4.6. IMPLEMENTATION CHALLENGES 65
4. The plugins agree on some notion of end-of-data to finish writing the checkpoint image.
5. Last memory plugin disposes the data onto persistent storage (file) or writes to a pipe/socket. There can be a different process on the other end of the pipe/socket which then saves it to a persistent device, or it restarts the process on the fly. The last memory plugin here means the final or lowest memory plugin (e.g., the “write to network socket” plugin in Figure 4.6).
6. Last memory plugin is responsible for reading from the checkpoint im- age.
7. During restart, memory plugins are responsible for restoring other run- time libraries, thus these plugin libraries must be self contained.
Remark: Note that the state managed by the memory plugins will not be compressed or encrypted in our running example of memory plugins. This is necessary to solve the problem of bootstrapping on restart. If the boot- strapping code were also encrypted, it would be impossible to bootstrap.
4.6 Implementation Challenges
In this section we describe some of the implementation challenges that we faced in implementing the plugin based virtualization in DMTCP version 2.
4.6.1 Wrapper Functions
We discuss three different implementation techniques that were tried in suc- cession, before settling on a fourth choice: a hybrid of the second and third options:
1. dlopen/dlsym: This is a naive approach, well-known in the literature. It allows the plugin to define a system call of the same name, whose body 66 CHAPTER 4. THE DESIGN OF PLUGINS
uses dlopen/dlsym to open the run-time library (e.g. libc, libpthread, etc.), and then call the system call in the run-time library. However, this fails when creating a wrapper for the GNU implementation of calloc. The GNU implementations of dlopen and dlsym would call calloc, thus creating a circular dependency. Wrapping occurrence of dlopen/dlsym from a user’s application creates a similar circular de- pendency. However, a still more severe criticism is that if the wrapper function directly calls the run-time library, then nested wrappers be- come impossible. In our implementation, multiple plugins frequently wish to wrap the same system call.
2. offsets within a run-time library: This was implemented in order to avoid the use of dlopen/dlsym. A base address is chosen within the run-time library. (It may be the start address of the library or an unusual system call unlikely to be needed by wrappers.) For all sys- tem calls to be wrapped, the offset from that system call to the base address is calculated before launching the end-user application. The end-user application is then launched and the base address is recalcu- lated. Next, the base address is used along with offsets to determine the addresses of the functions in the run-time library. At this point, the functions in the run-time library can be called using the corresponding addresses. This solves the issues caused by circular dependencies (e.g. dlopen, dlsym, calloc). However, nested wrappers still cannot be implemented.
3. dlsym/RTLD_NEXT: The POSIX option RTLD_NEXT for dlsym is de- signed in part to implement wrapper functions. This option causes dlsym to search the sequence of currently open libraries for the next matching symbol beyond the current library. This fixes the problem of implementing nested wrappers, but it does not solve the problem of circular dependencies. 4.6. IMPLEMENTATION CHALLENGES 67
The ultimate solution requires an additional observation: The run-time library sometimes internally calls a system call (as with dlopen/dlsym calling calloc). It is a mistake for the plugin to execute the wrapper function around this internal call. Yet, when dlsym internally calls calloc, the ELF loader will call the first definition of calloc that it finds. The first library to be loaded was libdmtcp.so, as part of the design of DMTCP. So, the calloc wrapper in libdmtcp.so is called.
A standard wrapper for calloc within libdmtcp.so would then call dlsym to determine the address of calloc within libc.so. But this would create the circularity. Instead, the wrapper detects that this is a circular call originating from the run-time library (libc.so). Upon detecting this, the calloc wrap- per reverts to second method above (offsets within a run-time library) in order to directly call the implementation of calloc within libc. Thus the circularity is broken.
4.6.2 New Process/Program Creation
When a process forks to create a new child process, the thread that calls fork() is the only thread in the new process. This poses certain challenges for plugins especially when dealing with locks. If at the time of fork(), some other thread is holding a lock, the threads in the new process may deadlock on this lock. The solution is to install atfork() handles in all plugins that use locks or similar artifacts and whenever a child process is created, it re- initialized the locks before doing anything. An alternate is to use the AtFork event generated by the fork/exec plugin. Glibc and firefox are two real world examples which install atfork handles to re-initialize the locks for their respective malloc-arenas.
New programs created by calling execve() have a different set of prob- lems. Since the new program gets completely new address space, all infor- mation that was gathered by the plugin prior to exec is lost. Plugins that 68 CHAPTER 4. THE DESIGN OF PLUGINS
need to preserve information across exec need a lifeboat where they can put the information for later use. A typical example of lifeboat would be a temporary file created on disk. The plugins serialize the previously captured information to the lifeboat. Since the plugins are independent of each other, there can be multiple lifeboats per process. Remark: As an optimization, it is possible to provide a single lifeboat that can be used by all the plugins.
4.6.3 Checkpoint Deadlock on a Runtime Library Resource
Atomic wrapper operations are also desired when dealing with resources that use locks for atomicity. Suppose a user thread is quiesced while holding the resource lock. Later on, if the resource is needed to complete check- point, it can cause a deadlock within the process. For example, in one of the most frequent scenario, a user thread is quiesced while performing mal- loc/free inside glibc. The checkpoint thread is blocked when it calls any of these functions during the checkpoint process. There are two possible solutions: (i) modify checkpointing logic to never call these functions, and (ii) create wrappers around these function which call disable_checkpoint, enable_checkpoint around the call to the real library functions as shown in Listing 4.8
malloc(size) { disable_checkpoint() ret_val = real_malloc(size) enable_checkpoint() return ret_val }
Listing 4.8: Malloc wrapper to avoid deadlock during checkpointing ¥ 4.6. IMPLEMENTATION CHALLENGES 69 4.6.4 Blocking Library Functions and Checkpoint Starvation
There are certain wrappers around blocking library functions that need to virtualize the underlying system resource. As discussed in Section 4.1.1, the call to library function and translation between real and virtual names should be atomic with respect to checkpointing. However, if a function call is blocking, the checkpoint may never succeed. Examples of such function are waitpid and pthread_join, etc. pid_t waitpid(pid,
Listing 4.9: Wrapper for waitpid with non-blocking calls to the real waitpid ¥ function 70 CHAPTER 4. THE DESIGN OF PLUGINS
In these situations, one can modify the wrapper as seen in Listing 4.9 to call the non-blocking version of the function in a loop until it succeeds or returns an error other than timeout. The timed version waits for the given time period before returning instead of blocking indefinitely. In some situations, the blocking call may not provide a non-blocking version. In those cases a potential solution is to use signalling mechanism to force the call to return with an error. At this point, the checkpoint can take place. However, the wrapper must be re-executed from the beginning to avoid any stale state. CHAPTER 5
Expressivity of Plugins
This chapter presents a large variety of examples of adaptive plugins, to demonstrate the expressivity of the plugin framework. They fall into sev- eral categories, each of which represents a unique type of contribution, in generalizing the traditional functionality of checkpoint-restart. Some of the plugins represent long-standing challenges. Not only do these plugins provide additional functionality for checkpoint-restart, but they do so with far fewer lines of code than the previously available less func- tional approaches.These include transparent checkpointing of: InfiniBand networks by Cao et al. [27]; hardware accelerated 3-D graphics (OpenGL 2.0 and beyond) by Kazemi Nafchi et al. [62]; a network of virtual machines by Garg et al. [44]; and GDB sessions by Visan et al. [127]. Each of these efforts was led by a different author. Thus they represent trials of the new plugin feature by independent users. The full details of each plugin can be found in the publications and technical reports of those authors. While I believe any of these could have been done by adding support in any of the existing checkpointing package, the amount of effort (both in terms of person-hours and lines of code) would have been enormous. Instead, by using the adaptive plugins to implement a process virtualization approach, the job was made much easier. In all cases, the plugin writers
71 72 CHAPTER 5. EXPRESSIVITY OF PLUGINS
didn’t need to learn the details of DMTCP internals, allowing them to focus only on the plugin.
Plugin Lines Novelty Prior Art Lines of code of code SSH session 1,021 The only solution — — GDB session 938 The only solution — — Batch-Queue 1,715 The only solution — — KVM/Tun 1,100 Full snapshots of net- Single VM ?? work of VMs snapshots OpenGL 4,500 Supports programmable VMGL [69] 78,000 GPUs (OpenGL 2.0 and beyond) InfiniBand 2,500 Native InfiniBand check- MPI- 17,000 point for both MPI and specific [55] non-MPI jobs IB2TCP 1,000 InfiniBand to TCP mi- MPI- ?? gration for both MPI and specific [55] non-MPI jobs
Table 5.1: Process virtualization based checkpoint-restart is both more general and typically an order of magnitude less in implementation size
The expressivity is measured along two dimensions (see Table 5.1). The first dimension is a measurement of lines of code for the plugins. Since each example was a “first” for that functionality, we compare with lines of code for a pevious published implementation with lesser functionality where possible. In the second dimension, we compare functionality with that application identified as having the most previous functionality in the corresponding domain. Thus a two-fold argument is presented. The process virtualization approach permits implementations with much larger functionality than had previously been practical with moderate resources. Second, the process vir- tualization approach results in an implementation with many fewer lines of code than would have been practical by other approaches. (Of course, the 5.1. FILE DESCRIPTOR RELATED PLUGINS 73 fewer lines of code in the plugin is made possible by using the base support for plugins in DMTCP version 2.) Note that some of the plugins discussed in this chapter were not created as part of this thesis. Instead, they were created by different authors using the plugin API. Further details of each plugin can be found in the publica- tions and technical reports of those authors.
Statistics for various plugins
Table 5.2 provides several statistics including the source lines of code, the number of library call wrappers and various services used by the plugins. The lines of code were obtained by using SLOCCount [132]. Section 5.1 provides a brief overview of the plugins related to file descrip- tor handling. Section 5.2 provides an overview of the working of the plugin handling System V IPC mechanism. A few application-specific plugins are discussed in Section 5.3. The remaining sections provide various case stud- ies where new functionality was implemented, whereas previously in other checkpoint-restart packages, the added functionality was implemented only through independent, auxiliary applications.
5.1 File Descriptor Related Plugins
Since file descriptors may be used for file objects, socket connections, or event notifications, the corresponding plugins share some code for handling generic file descriptors. This results in a cleaner design and smaller code footprint. The shared code provides services for generating unique file de- scriptor ids, detecting/managing duplicate file descriptors, leader election, and re-sharing of file descriptors on restart. Note that DMTCP version 1 provided support for checkpointing TCP and Unix domain sockets for checkpointing distributed applications. It also pro- vided limited support for handling files and pseudo-terminals. For this work, 74 CHAPTER 5. EXPRESSIVITY OF PLUGINS
Plugin Language Lines of Code Wrappers Services used Internal Plugins File C/C++ 2,276∗ 48 a,b,c,d,e Socket C/C++ 1,356∗ 17 a,b,c,d Event C/C++ 909∗ 12 a,b,c,d,e Pid C/C++ 1,644 47 c,d,e SysVIPC C/C++ 1,154 14 a,b,c,d,e Timer C/C++ 419 14 a,c,d,e SSH C/C++ 1,021 3 a,b,c,d,e Contrib Plugins Batch-Queue C/C++ 1,715 13 e† Ptrace C/C++ 938 7 a,b,c Record-replay C/C++ 8,071 164 a,b,c,e KVM C 749 2 a,b,c,e Tun C 351 3 a,b,c,e OpenGL C/C++ 4,500 119 a,b,c,e,f InfiniBand C 2,788 34 a,b,c,d,e IB2TCP C/C++ 804 31 c,d,e Application-Specific Plugins Malloc C/C++ 116 10 f Dlopen C/C++ 28 3 f Modify-env C 134 0 c,e CkptFile C/C++ 37 0 a,c Uniq-Ckpt C/C++ 39 0 a,c ∗: Uses additional 899 lines of shared common code. †: Uses specialized utilities to detect restart. Plugins Services: (a) Write checkpoint hook (b) Resume hook (c) Restart hook (d) Publish/Subscribe (e) Virtualization (f) Protect critical sections of code
Table 5.2: Statistics for various plugins. 5.1. FILE DESCRIPTOR RELATED PLUGINS 75 the plugins were created by rewriting the existing solution from DMTCP ver- sion 1. This greatly enhanced the available features and provided an easier way for the user to fine tune checkpointing. This section provides a brief overview of the three plugins.
File plugin
The File plugin is responsible for handling file descriptors pointing to regular files and directories. For implementation purposes, it also handles pseudo- terminals (ptys) and FIFO (first in first out) objects, since they have similar semantics as file objects. Apart from restore the relevant file descriptors, the File plugin also needs to translate the file paths if the computation is restarted on a system with different mount points or by a different user. There are several ways to provide file path translation. A simple mecha- nism involves recording the relative file paths on checkpoint and using the relative path information on restart to find the file. Another approach may involve wild card substitution, where a certain component of the file path is transparently replaced with a different one. For example, if a mount point has changed from /mnt/foo to /bar, the plugin would replace /mnt/foo/baz with /bar/baz. The file plugin also deploys some heuristics to determine if it also needs to save and restore the associated file data. In some cases, the file data must always be checkpointed. Examples include unlinked files (Linux allows a file to be unlinked while a process still has a valid file descriptor) and temporary files created by programs like vim and emacs. For a simpler design, the heuristics part of the File plugin is now im- plemented as a separate plugin (Ckpt-File). This way the user can tweak this relatively simple newer plugin according to their wishes. Similarly, the file path translation mechanism can also be moved into its own plugin. As obvious, the original File plugin will depend on these two plugins for their services. 76 CHAPTER 5. EXPRESSIVITY OF PLUGINS
Socket plugin
The Socket plugin is responsible for checkpointing and restoring the TCP/IP sockets, Unix domain sockets, and netlink sockets. Potentially, this plugin can also be split into three different plugins, but for implementation pur- poses it is kept as a single unit. Further, since the Unix domain sockets may be backed by a file on the disk, it also depends on the File plugin for file path translation. The Socket plugin assigns a unique id to each end of a socket connection. In our implementation, the unique id comprises of the unique- id of the process that originally created the socket file descriptor and a per- process monotonously incrementing counter. At the time of checkpoint, the processes on each end of a socket connection perform a handshake to ex- change the unique socket id. On restart, this unique socket id is used to find the current location of the peer process using the publish-subscribe service.
Event plugin
The Event plugin is responsible for checkpointing and restoring the file de- scriptors used for event notifications. Apart from supporting the older poll system call (used for monitoring file descriptors), this plugin provides sup- port for epoll (similar to poll), eventfd (used for event wait/notify mech- anism from user space), signalfd (used for accepting signals targeted at the caller), and inotify (used for monitoring file system events) system calls. Inotify is the most difficult to checkpoint and restart. The desired be- havior on restart is not well-defined and may be application dependent. For example, inotify can be used to get notification if a file has been renamed. Suppose that the file is renamed after checkpoint. On restart, the file will be present with a new name and thus won’t be renamed. In this case, it is not clear if an event notification should be generated or not. The plugin can be modified to allow the user to specify the default behavior for use with the application. 5.2. PID, SYSTEM V IPC, AND TIMER PLUGINS 77 5.2 Pid, System V IPC, and Timer Plugins
We have already discussed the Pid plugin as an example of virtualizing the kernel resource identifiers in Section 3.1.1.
The System V IPC (SysVIPC) plugin support checkpointing of System V shared memory, semaphores, and message queues. The operating system kernel generates an identifier for each System V IPC object. The identifier may change on restart and thus we need to virtualize it. The SysVIPC plugin virtualizes these identifiers in a similar manner to the Pid plugin. A virtual id is generated for each System V IPC object and a translation is kept for translating between virtual and real ids. In addition to virtualizing the re- source ids, the SysVIPC plugin also needs to checkpoint the associated state of the System V IPC object. For example, the memory contents of the shared memory region need to be checkpointed, the semaphore value needs to be restored, and the message queue needs to be drained on checkpoint and re- filled on restart. Since these objects are potentially shared between multiple processes, the plugin performs leader election using the publish-subscribe mechanism.
Lastly, we discussed the virtualization of clock and timer ids in Sec- tion 3.1.5. As described there, in addition to virtualizing the resource ids, application-specific fine tuning is required to control the behavior of timers on restart.
5.3 Application-Specific Plugins
The CkptFile plugin is used to provide heuristics for saving the contents of open files during checkpoint. The plugin can be used to read wildcard pat- terns from a configuration file for dynamically updating the heuristics. The File plugin consults the CkptFile plugin for each open file. The CkptFile plugin may respond whether to checkpoint the data of the given file or not. 78 CHAPTER 5. EXPRESSIVITY OF PLUGINS
The Environ plugin provides heuristics for restoring/updating the process environment variables after a restart. This is useful for processes that use environment variables to find addresses, etc. of system services, daemons, etc. The Environ plugin reads patterns from a configuration file to selectively update the restarting process’s environment. The Uniq-Ckpt plugin is responsible for keeping a rolling set of checkpoint images as configured by the user. It can automatically delete or rename the older checkpoint images to save disk space. The Malloc plugin puts wrappers around malloc, free, etc. to avoid dead- lock inside malloc library as explained in Section 4.6.3. The plugin can be further used to switch to a different malloc implementation for debugging. The Dlopen plugin provides wrappers for dlopen, dlsym, and dlclose li- brary calls. The dlopen wrapper is used to ensure atomicity with respect to checkpointing so that the process doesn’t get checkpointed while the library is still being initialized. The dlsym wrapper is used to create wrappers for function that are present in the library being loaded. The dlsym wrapper can return the address of the wrapper function (defined in the plugin) instead of the library function. The wrapper function then may call the real function in the newly loaded library.
5.4 SSH Connection
The issues involved with checkpointing an SSH session as discussed in Sec- tion 3.1.2 are reviewed followed by a description of the solution based on our virtualization scheme. Previous support for distributed checkpointing covered the common uses of ssh where it is used to launch remote jobs but not used for active communication. In some HPC environments (e.g., Open MPI), this is the default behavior. Remote processes are launched over SSH, and later establish a simple TCP socket for efficient communication. This work provides support for active communications over SSH. 5.4. SSH CONNECTION 79
Recall that SSH allows two processes to securely communicate over an insecure network. A user process uses an SSH client process to connect to a remote SSH server (daemon) process. On creating a secure connection, the SSH server process (sshd) launches the child process (app2), as shown in Figure 3.2. The process app1 appears to read and write locally through a pipe to app2. The SSH daemon is a privileged process running a certain protocol. In the process virtualization approach, the plugin must virtualize that protocol. Further, checkpointing and restarting the privileged SSH daemon by an un- privileged user is not possible, since the user cannot recreate the privileged ssh daemon (sshd) on restart.
Launching remote process under checkpoint control
Recall that a process on Node1 launches a remote process on Node2 by running the SSH client program as ssh Node2 app2. The earlier DMTCP used a strategy of detecting an codeexec that calls ssh Node2 app2 and replacing it by ssh Node2 dmtcp_launch app2. Ad hoc code was used that allowed ssh to create a remote process under checkpoint control, but it was assumed that the application would then close the SSH connection. The solution for supporting long-lived SSH connections is shown in Fig- ure 3.3. In essence, following a process virtualization approach, the SSH plugin defines a wrapper function around the exec family of system calls. It then replaces a call by exec to ssh Node2 app2 with a call to:
ssh Node2 dmtcp_launch virt_sshd app2
For technical reasons, the plugin actually creates two auxiliary processes, virt_ssh and virt_sshd. (The code for these processes is part of the SSH plugin, which arranges for them to run as separate processes.) These processes also allow us to recreate the SSH connection on restart — even in the less common situations where the app1 process has exited, leaving a child of app1 to continue to employ the SSH connection from Node1. 80 CHAPTER 5. EXPRESSIVITY OF PLUGINS
Checkpoint
At the time of checkpoint, only processes app1, app2, virt_ssh, and virt_sshd are checkpointed. The ssh and sshd process are not under checkpoint control and are not checkpointed. Further, the virt_ssh and virt_sshd can directly “drain” any in-flight network data that has not yet reached its destination at the time of checkpoint. Thus, they act as buffers to hold network data prior to resume or restart. During resume, the drained data is written directly to the corresponding pipes between the user pro- cesses and the dmtcp helper processes.
Node1 Node2
app1 app2 stdio stdio
virt_ssh virt_sshd
stdio sshd helper stdio stdio SSH client SSH server (ssh) socket (sshd)
Figure 5.1: Restoring an SSH connection. The virt_ssh process launched sshd_helper on Node2 that relays stdio between ssh and virt_sshd.
Restart
Figure 5.1 illustrates how the four checkpointed processes are restored dur- ing restart. The four processes on Node1 and Node2 are restarted via:
ssh Node1 dmtcp_restart
Note that in the general case, Node1 and Node2 may both have been remote nodes. Next, an SSH connection must be created between the two processes, virt_ssh and virt_sshd. To accomplish this, the virt_ssh will use publish/subscribe to discover the address of the virt_sshd process. Next, virt_ssh will fork a child process, which “execs” into the following pro- gram:
ssh Node2 sshd_helper
Finally, the sshd_helper process will relay the data of its stdio pipes from the SSH server process through stdio pipes to the virt_sshd pro- cess. The sshd_helper process exits when the virt_sshd process exits. The sshd_helper process is never part of any subsequent checkpoint.
5.5 Batch-Queue Plugin for Resource Managers
One of the long-standing functionality requirements for batch-queue man- agers at various HPC centers is the ability to suspend a low priority job to allow execution of a high priority job as soon as it arrives. While there have been MPI-specific solutions to support this use-case (see Section 2.1.2), they have not been integrated into the batch-queue systems for the lack of complete functionality. The batch-queue plugin by Polyakov [93] solves this problem by providing a native checkpoint-restart facility that can be embed- ded in the batch-queue itself. The goal of the batch-queue plugin is to recreate the original parallel computation in a transparent manner. This mechanism is invisible both to any resource manager and to the MPI libraries themselves. During restart, the batch-queue plugin must adapt to a new execution environment created by the resource manager at that time. The plugin must detect the newly available nodes during restart, and arrange for launching the restarted user processes onto appropriate nodes. Issues specific to a resource manager may 82 CHAPTER 5. EXPRESSIVITY OF PLUGINS
arise during this process, such as the creation by the resource manager of a new read-only nodefile that is inconsistent with the pre-checkpoint version (see below). Recall that modern resource management (RM) systems allocate resources for jobs, which are then launched in background in a non-interactive mode. Although the RM systems don’t intervene much in a program’s execution (except for PMI, see an example blow), they do modify part of its execution environment. For example, some of them redirect a program’s standard in- put, output and error to special files, and later move those files to the user’s working directory once the program is finished or killed. They also provide services for remote launch of programs such as tm_spawn for TORQUE PBS, lsb_launch() for Load Sharing Facility (LSF), and even standalone commands such as srun for SLURM. The batch-queue plugin can handle the new execution environment dur- ing restart. It detects the available nodes, and launches the restarting pro- cesses onto the nodes as required. The new program may not have per- missions to overwrite some environment files (e.g., nodefile) and may need to update these file descriptors to point to the copy of files saved during checkpoint. We next discuss some of the virtualization strategies provided by the batch-queue plugin.
Support for batch system remote launch mechanism
To fully support parallel programs in modern RM systems, the remote child processes should be automatically placed under checkpoint control. For all supported batch systems this plugin uses the same technique to provide this service: it patches the command line passed to the remote launch mecha- nism by adding a prefix, dmtcp_launch < options >. For example, in the case of TORQUE PBS, a wrapper for tm_spawn updates the passed arguments to insert the dmtcp_launch command. 5.5. BATCH-QUEUE PLUGIN FOR RESOURCE MANAGERS 83
Communication between Batch Systems and the Application
A common issue for any resource manager is the binding of stdin/out/err to files. Those files must be saved in the checkpoint image, for the sake of consistency and transparency. At restart time, the plugin must discover the bindings of stdin/out/err to the new files created by the resource manager. Any saved content from prior to checkpoint must be written into those files.
Batch systems usually communicate with applications using special en- vironment variables. Some batch systems use auxiliary files in addition to the environment variables. For example, TORQUE saves a list of its allocated nodes into a read-only nodefile, which can be cached by the application. But at restart time, a new read-only nodefile will be generated, different from the one cached by the application. To address this situation, the batch-queue plugin creates a temporary file containing the original nodefile contents and modifies the file descriptor of the restarted application to point to this alter- nate nodefile.
Communication between MPI Application and External PMI Interface
Most modern MPI implementations use or support the Process Management Interface (PMI) [14]. The PMI model comprises three entities: the MPI li- brary, PMI library and the process manager. Currently there are several im- plementations of process manager entities, including the standalone Hydra package, and the PMI server of the SLURM resource manager.
While the multi-host capable Socket plugin transparently supports the Hydra implementation, additional plugin support is needed to integrate the SLURM PMI implementation. SLURM requires an MPI process to commu- nicate with the SLURM job step daemon, which is not under checkpoint control. In this case, an batch-queue plugin finalizes PMI session before checkpointing and recreates it afterward. 84 CHAPTER 5. EXPRESSIVITY OF PLUGINS
Specialized peer-discovery and remote launch service
The processes may be restarted on different nodes. The number of slots (number of processes per node) may be different for the new nodes. The batch-queue plugin employs a node discovery tool to find the new nodes and to map old resources to the newly allocated node set. For TORQUE RM, the plugin analyzes the new nodefile and for SLURM it parses the SLURM_JOB_NODELIST and SLURM_TASKS_PER_NODE environment vari- ables. After this step resource allocation is available in RM-independent for- mat. Next, the old resources are mapped onto new ones. Once the resources have been mapped, the application is launched using the appropriate RM system mechanism. The mapping algorithm should consider the slots when matching resources between the old and new sets. It should be noted that the processes that were launched on the head node of a cluster usually have a special environment (special stdin/out/err connections and access to the nodefile) and may need special treatment.
5.6 Ptrace Plugin
The ptrace system call is used by a superior process (e.g., gdb, strace, etc.) to attach to an inferior process (e.g., a.out) in order to trace it. The ptrace system call uses CPU hardware support, making it harder to check- point. The inferior process can’t perform a checkpoint until it is detached or allowed to run freely during the checkpoint phase. A ptrace plugin is used to solve these problems [127]. The ptrace plugin in the superior process de- taches the inferior process before checkpointing and re-attaches right after restart. The ptrace plugin in the inferior process has an added responsibility. It is often the case that the inferior threads are quiesced while they are in possession of a system resource, or while executing a critical section in the code. This can result in a deadlock. To fix this, the ptrace plugin forces the 5.7. DETERMINISTIC RECORD-REPLAY 85 user threads to release resources before entering a quiescent state. This is done by using Pre/Post-Quiesce event notifications. Pre-Quiesce is generated by the user thread just before entering the quiesce state. While processing this hook, each thread ensures that it is not holding any system resources, locks, etc. that can result in a deadlock. The Post-Quiesce phase forces the inferior thread to wait until the superior can attach to it after restart.
5.7 Deterministic Record-Replay
The record-replay plugin is needed by any reversible debugger that uses checkpoint, restart and re-execute. FReD (Fast Reversible Debugger) [112] can add reversibility to any debugger by using checkpoint, restart and re- execute strategy. FReD uses DMTCP for checkpointing. Deterministic record- replay for FReD was achieved by creating a record-replay plugin to be used with DMTCP. This plugin is generally placed before any other plugin in the plugin hierarchy, to allow it to “hijack” library calls. Due to its complex- ity, the record-replay plugin is the largest plugin in terms of lines of code (see Table 5.2). There are several potential sources of nondeterminism in program ex- ecution, and record-replay must address all of them: thread interleaving, external events (I/O, etc.), and memory allocation. While correct replay of external events is required for all kind of programs, memory accuracy is of- ten not an issue for higher-level languages like Python and Perl, which do not expose the underlying heap to the user’s program. FReD handles all these aspects by wrapping various system calls. Rele- vant events are captured by interposing on library calls using dlopen/dlsym for creating function wrappers for interesting library functions. The wrap- pers record events into the log on the first execution and then return the appropriate values (or block threads as required) on replay. We start recording when directed by FReD (often after the first check- 86 CHAPTER 5. EXPRESSIVITY OF PLUGINS
point). The system records the events related to thread-interleaving, exter- nal events, and memory allocation into a log. On replay, it ensures that the events are replayed in the same order as they were recorded. The plugin guarantees deterministic replay — even when executing on multiple cores — so long as the program is free of data races.
Thread interleaving
FReD uses wrappers around library calls such as pthread_mutex_lock and pthread_mutex_unlock, to enforce the cor- rect thread interleaving during replay. Apart from the usual pthread_xxx functions, some other functions that can enforce a certain interleaving are blocking functions like read. For example, a thread can signal another thread by writing into the write-end of a pipe when the other thread is do- ing a blocking read on the read-end of the pipe.
Replay of external events
Applications typically interact with the outside world as part of their execu- tion. They also interact with the debugger and the user, as part of the debug- ging process. Composite debugging requires separating these streams. For debuggers that trace a program in a separate process, the I/O by the process being debugged is recorded and replayed whereas the I/O by the debugger process is ignored.
For interpreted languages, the situation becomes trickier as the record- replay plugin cannot differentiate between the debugger I/O and the appli- cation I/O. FReD handles this situation heuristically. It designates the stan- dard input/output/error file descriptors as pass-through devices. Activity on the pass-through devices is ignored by the record-replay component. 5.8. CHECKPOINTING NETWORKS OF VIRTUAL MACHINES 87
Memory accuracy
One important feature of FReD is memory-accuracy: the addresses of ob- jects on the heap do not change between original execution and replay. This is important because it means that developers can use address literals in expression watchpoints (assuming they are supported by the underlying de- bugger). With true replay of application program, one would expect the memory layout to match the record phase, but the DMTCP libraries have to perform different actions during normal run and on restart. This results in some memory allocation/deallocations originating from DMTCP libraries that can alter the memory layout. Another cause for the change in memory layout is the memory allocated by the operating system kernel when the process doesn’t specify a fixed address. An example is the mmap system call without any address hint. In this case, the kernel is free to choose any address for the memory region. Memory-accuracy is accomplished by logging the arguments, as well as the return values of mmap, munmap, etc. on record. On replay, the real functions or system calls are re-executed in the exact same order. However, the record-replay plugin provides a hint to the kernel to obtain the same memory address as was received at record-time. FReD handles any conflicts caused by memory allocation/deallocation originating from DMTCP itself by forcing use of a separate allocation arena for DMTCP requests.
5.8 Checkpointing Networks of Virtual Machines
Garg et al. [43] used DMTCP and plugins to provide a generic checkpoint- restart mechanism for three cases of virtual machines: user-space (stan- dalone) QEMU [121], KVM/QEMU [114], and Lguest [115]. In all three 88 CHAPTER 5. EXPRESSIVITY OF PLUGINS
cases, the hypervisor (VMM — virtual machine monitor) was based on Linux as the host operating system. These examples covers three distinct virtual- ization scenarios: entirely user-space virtualization (QEMU), full virtualiza- tion using a Linux kernel driver (KVM/QEMU), and paravirtualization using a Linux kernel driver [115].
The user-space QEMU virtual machine did not require any specific plugin. The KVM/QEMU and Lguest virtual machines required a new plugin consist- ing of approximately 200 lines of code. In addition, the kernel driver from Lguest required an additional 40 lines of new code to support checkpoint- restart capability. The authors estimated the implementation time at approx- imately five to ten person days. This is in contrast with the number of lines of code required for libvirt.
Garg et al. [44] further implemented the first system to checkpoint a network of virtual machines by virtualizing the tun/tap interface using a plugin. The tun plugin consisted of approximately 350 lines of code.
5.9 3-D Graphic: Support for Programmable GPUs in OpenGL 2.0 and Higher
Kazemi Nafchi et al. [62] describe a mechanism for transparently check- pointing hardware-accelerated 3D graphics. The approach is based on DMTCP with a plugin to record-prune-replay of OpenGL library calls. The calls not relevant to the last graphics frame prior to checkpointing is discarded. The remaining OpenGL calls are replayed on restart. The plugin uses approxi- mately 4,500 lines of code.
Previously, Lagar-Cavillaet al. [69] presented VMGL for vector-independent checkpoint restart. VMGL used a shadow device driver for OpenGL, which shadows most OpenGL calls to model OpenGL state, and restores it when restarting form a checkpoint. The code to maintain OpenGL state was ap- 5.10. TRANSPARENT CHECKPOINTING OF INFINIBAND 89 proximately 78,000 lines of code. Further, the new plugin has added functionality. Lagar-Cavillaet al. sup- ported only OpenGL 1.5 (fixed pipeline functionality). The approach of the new plugin was demonstrated to apply to programmable GPUs (OpenGL 2.0 and beyond).
5.10 Transparent Checkpointing of InfiniBand
The InfiniBand plugin by Cao et al. [27] is the first to support checkpoint- restart of native InfiniBand network. Previous checkpoint-restart systems [55] were MPI-specific. This plugin provides support for checkpointing UPC, an example of a PGAS language, which runs more efficiently when it runs na- tively over the InfiniBand fabric (instead of on top of an MPI layer). For applications such as these, there is no alternate solution. Compared to approximately 3,000 lines of code for the InfiniBand plugin, the checkpoint-restart functionality in Open MPI uses approximately 17,000 lines of code (without counting the InfiniBand-specific code). This is in ad- dition to the single process checkpointer, BLCR, that is used by OpenMPI.
5.11 IB2TCP: Migrating from InfiniBand to TCP Sockets
Some traditional checkpoint-restart services, such as that for Open MPI [55], offer the ability to checkpoint over one network, and restart on a second net- work. This is especially useful for interactive debugging. A set of checkpoint images from an InfiniBand-based production cluster can be copied to an Ethernet/TCP-based debug cluster. Thus if a bug is encountered after run- ning for hours on the production cluster, the most recent checkpoints can be used to restart on the debug cluster under a symbolic debugger, such as GDB. 90 CHAPTER 5. EXPRESSIVITY OF PLUGINS
The IB2TCP plugin enables checkpointing over InfiniBand and restart- ing over Ethernet in the similar fashion. An important contribution of the IB2TCP plugin [27], is that unlike the BLCR kernel-based approach, the DMTCP/IB2TCP approach supports using an Ethernet-based cluster that uses a different Linux kernel, something that occurs frequently in practice. Fur- ther, the IB2TCP plugin can be used with the InfiniBand plugin or without InfiniBand plugin (but with limited support for checkpointing). CHAPTER 6
Tesseract: Reconciling Guest I/O and Hypervisor Swapping in a VM
The previous chapters were concerned with adaptive plugins, a virtualiza- tion mechanism that decoupled the application process from the execution environment to facilitate transparent checkpoint-restart. In this chapter, I will present a virtualization mechanism that decouples the guest virtual disk from the guest operating system to prevent redundant I/O operations be- tween the guest and the hypervisor. Guests running in virtual machines read and write state between their memory and virtualized disks. Hypervisors such as VMware ESXi [57] like- wise may page guest memory to and from a hypervisor-level swap file to reclaim memory. To distinguish these two cases, we refer to the activity within the guest OS as paging and that within the hypervisor as swapping. In overcommitted situations, these two sets of operations can result in a two-level scheduling anomaly known as “double paging”. Double-paging occurs when the guest attempts to page out memory that has previously been swapped out by the hypervisor and leads to long delays for the guest as the contents are read back into machine memory only to be written out again (see Sections 6.1 and 6.2). While the double-paging anomaly is well known [46, 48, 47, 128, 82], its impact on real workloads is not established.
91 92 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Our approach addresses the double-paging problem directly in a man- ner transparent to the guest(see Section 6.3). First, the virtual machine is extended to track associations between guest memory and either blocks in guest virtual disks or in the hypervisor swap file. Second, the virtual disks are extended to support a mechanism to redirect virtual block requests to blocks in other virtual disks or the hypervisor swap file. Third, the hyper- visor swap file is extended to track references to its blocks. Using these components to restructure guest I/O requests, we eliminate the main effects of double-paging by replacing the original guest operations with indirections between the guest and swap stores. An important benefit of this approach is that where hypervisors typically attempt to avoid swapping pages likely to be paged out by the guest, the two levels may now cooperate in selecting pages since the work is complementary.
We have prototyped our approach on the VMware Workstation [56] plat- form enhanced to explicitly swap memory in and out. While the current implementation focuses on deduplicating guest I/Os for contents stored in the hypervisor swap file, it is general enough to also deduplicate redundant contents between guest I/Os themselves or between the hypervisor swap file and guest disks (see Section 6.4).
In Section 6.5, we also show the impact of an unexpected side-effect of our solution: loss of locality caused by indirections to the hypervisor swap file which can substantially slow down subsequent guest I/Os. Finally, we describe techniques to detect this loss of locality and to recover it. These techniques isolate the expensive costs of the double-paging effect and mak- ing them asynchronous with respect to the guest.
In Section 6.6, we present results using a synthetic benchmark that show, for the first time, the cost of the double-paging problem. Finally, in Sec- tion 6.7, we discuss related work. 6.1. REDUNDANT I/O 93
Guest Physical Memory Guest Physical Memory
PPN PPN(2) vCPU
(2) (1) (1)
Guest Host Guest Host Virtual Paging Virtual Paging Disk Device Disk Device (a) Host swap out followed by guest (b) Host swap out followed by guest disk read overwriting the entire page
Guest Physical Memory Guest Physical Memory not PPN dirty PPN
(1) (2) (2) (1)
Guest Host Guest Host Virtual Paging Virtual Paging Disk Device Disk Device (c) Host swap out of an unmodified (d) Host swap out followed by guest guest page disk write (Double-Paging)
Figure 6.1: Some cases of redundant I/O in a virtual machine.
6.1 Redundant I/O
Figure 6.1 shows some examples of redundant I/O resulting from bad in- teraction between hypervisor swapping and guest I/O. In Figure 6.1a, the hypervisor swap out is followed by guest overwriting the entire page by doing a disk read. From the hypervisor’s point of view, the guest has ac- cessed the page, and so it unnecessarily swaps in the guest page. Similarly, in Figure 6.1b, the host swap out is followed by the guest zeroing out the entire page. Here again, the hypervisor swap in is wasteful. In Figure 6.1c, the guest reads a page from the disk into its physical memory. The page is “clean” i.e. the contents have not been modified by the guest. However, when under memory pressure, the hypervisor tries to swap out this page as well. Ideally, the hypervisor could have discarded the page contents and 94 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
later restore them from guest disk if needed. Finally, in Figure 6.1d, the guest tries to page out a page that is already swapped out by the host. This is the case of double-paging. The first two cases, (Figures 6.1a and 6.1b) have already been addressed in some commercial products such as the VMware ESX hypervisor. Further, concurrent work of Amit et al. [5] implements solutions for the first three cases (using mmap structures as the remapping mechanism or boundary in Linux) but ignore the fourth. Tesseract has a system that addresses solutions for the first two cases (Figures 6.1a and 6.1b) along with a solution to the double-paging case(Figure 6.1d). In addition, it can serve as a basis for a third case (Figure 6.1c) and a fifth case – guest write followed by another guest write.
6.2 Motivation: The Double-Paging Anomaly
Tesseract has four objectives. First, to extend VMware’s hosted platforms, WorkStation and Fusion, to explicitly manage how the hypervisor pages out memory so that its swap subsystem can employ many of the optimizations used by the ESX platform. Second, to prototype the mechanisms needed to identify redundant I/Os originating from the guest and virtual machine monitor (VMM) and eliminate these. Third, to use this prototype to justify restructuring the underlying virtual disks of VMs to support this optimiza- tion. Finally, to simplify the hypervisor’s memory scheduler so that it need not avoid paging out memory that guest may decide to page. To address these, the project initially focused on the double-paging anomaly. One of the tasks of the hypervisor is to allocate and map host (or ma- chine) memory to the VMs it is managing. Likewise, one of the tasks of the guest operating system in a VM is to manage the guest physical address space, allocating and mapping it to the processes running in the guest. In both cases, either the set of machine memory pages or the set of guest phys- 6.2. MOTIVATION: THE DOUBLE-PAGING ANOMALY 95 ical pages may be oversubscribed.
In overcommitted situations, the appropriate memory scheduler must repurpose some memory pages. For example, the hypervisor may reclaim memory from a VM by swapping out guest pages to the hypervisor-level swap file. Having preserved the contents of those pages, the underlying ma- chine memory may be used for a new purpose. The guest OS may reclaim memory within a VM too to allow a guest physical page to be used by a new virtual mapping.
As hypervisor-level memory reclamation is transparent to the guest OS, the latter may choose to page out to a virtualized disk pages that were already swapped by the hypervisor. In such cases, hypervisor must syn- chronously allocate machine pages to hold the contents and read the already swapped contents back into that memory so they can be saved, in turn, to the guest OS’s swap device. This multi-level scheduling conflict is called double-paging.
Figure 6.2 illustrates the double-paging problem. Suppose the hypervisor decides to reclaim a machine page (MPN) that is backing a guest physical page (PPN). In step 1, the mapping between the PPN and MPN is invalidated and, in step 2, the contents of MPN is saved to the hypervisor’s swap file. Suppose the guest OS later decides to reallocate PPN for a new guest virtual mapping. It, in turn, in step 3a invalidates the guest-level mappings to that PPN and initiates an I/O to preserve its contents in a guest virtual disk (or guest VMDK). In handling the guest I/O request, the hypervisor must ensure that the contents to be written are available in memory. So, in step 4, the hypervisor faults the contents into a newly allocated page (MPN2) and, in step 5, establishes a mapping from PPN to MPN2. This sequence puts extra pressure on the hypervisor memory system and may further cause additional hypervisor-level swapping as a result of allocating MPN2. In step 6, the guest OS completes the I/O by writing the contents of MPN2 to the guest VMDK. Finally, the guest OS is able to zero the contents of the new MPN so that the 96 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
LP1 (3b) (3a) Guest PPN Guest Phys Mem. Disk
(6) guest view
hypervisor view (5) (1)
Host (4) MPN MPN2 Host Memory Paging Device (2) (1), (2) : Swap out (3a,3b) : Guest block write request (4) : Memory allocation and swap in (5) : Establish PPN to MPN mapping (6) : Write block to guest disk (7) : Zero the new MPN for reuse
Figure 6.2: An example of double-paging.
PPN that now maps to it can be used for a new virtual mapping in step 7.
A hypervisor has no control over when a virtualized guest may page memory out to disk, and may even employ reclamation techniques like bal- looning [128] in addition to hypervisor-level swapping. Ballooning is a tech- nique that co-opts the guest into choosing pages to release back to the plat- form. It employs a guest driver or agent to allocate, and often pin, pages in the guest’s physical address-space. Ballooning is not a reliable solution in overcommitted situations since it requires guest execution to choose pages and release memory and the guest is unaware of which pages are backed by MPNs. Hypervisors that do not also page risk running out of memory. While preferring ballooning, VMware uses hypervisor swapping to guaran- tee progress. Because levels of overcommitment vary over time, hypervisor swapping may interleave with the guest, under pressure from ballooning, 6.3. DESIGN 97 also paging. This can lead to double paging. The double-paging problem also impacts hypervisor design. Citing the potential effects of double-paging, some [82] have advocated avoiding the use of hypervisor-level swapping completely. Others have attempted to mit- igate the likelihood through techniques such as employing random page selection for hypervisor-level swapping [128] or employing some form of paging-aware paravirtualized interface [48, 47]. For example, VMware’s scheduler uses heuristics to find “warm” pages to avoid paging out what the guest may also choose to page out. These heuristics have extended ef- fects, for example, on the ability to provide large (2MB) mappings to the guest. Our goals are to address the double-paging problem in a manner that is transparent to the guest running in the VM and identifies and elides the unnecessary intermediate steps such as steps 4, 5 and 6 in Figure 6.2 and to simplify hypervisor scheduling policies. Although we do not demon- strate that double-paging is a problem in real workloads, we do show how its effects can be mitigated.
6.3 Design
We now describe our prototype’s design. First, we describe how we extended the hosted platform to behave more like VMware’s server platform, ESX. Next, we outline how we identify and eliminate redundant I/Os. Finally, we describe the design of the hypervisor swap subsystem and the extensions to the virtual disks to support indirections.
6.3.1 Extending The Hosted Platform To Be Like ESX
VMware supports two kinds of hypervisors: the hosted platform in which the hypervisor cooperatively runs on top of an unmodified host operating system such as Windows or Linux, and ESX where the hypervisor runs as the platform kernel, the vmkernel. Two key differences between these two 98 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
platforms are how memory is allocated and mapped to a VM, and where the network and storage stacks execute.
In the existing hosted platform, each VM’s device support is managed in the vmx, a user-level process running on the host operating system. Privi- leged services are mediated by the vmmon device driver loaded into the host kernel, and control is passed between the vmx and the VMM and its guest via vmmon. An advantage of the hosted approach is that the virtualization of I/O devices is handled by libraries in the vmx and these benefit from the device support of the underlying host OS. Guest memory is mmapped into the address space of the vmx. Memory pages exposed to the VMM and guest by using the vmmon device driver to pin the pages in the host kernel and return the MPNs to the VMM. By backing the mmapped region for guest memory with a file, hypervisor swapping is a simple matter of invalidating all mappings for the pages to be released in the VMM, marking, if necessary, those pages as dirty in the vmx’s address space, and unpinning the pages on the host.
In ESX, network and storage virtual devices are managed in the vmker- nel. Likewise, the hypervisor manages per-VM pools of memory for backing guest memory. To page memory out to the VM’s swap file, the VMM and vmkernel simply invalidate any guest mappings and schedule the pages’ con- tents to be written out. Because ESX explicitly manages the swap state for a VM including its swap file, it is able to employ a number of optimizations unavailable on the current hosted platform. These optimizations include the capturing of writes to entire pages of memory [4], and the cancellation of swap-ins for swapped-out guest PPNs that are targets for disk read requests.
The first optimization is triggered when the guest accesses an unmapped or write-protected page and faults into the VMM. At this point, the guest’s instruction stream is analyzed. If the page is shared [128] and the effect of the write does not change the content of the page, page-sharing is not broken. Instead, the guest’s program counter is advanced past the write and 6.3. DESIGN 99 it is allowed to continue execution. If the guest’s write is overwriting an entire page, one or both of two actions are taken. If the written pattern is a known value, such as repeated 0x00, the guest may be mapped a shared page. This technique is used, for example, on Windows guests because Win- dows zeroes physical pages as they are placed on the freelist. Linux, which zeroes on allocation of a physical page, is simply mapped a writeable zeroed MPN. Separately, any pending swap-in for that PPN is cancelled. Since the most common case is the mapping of a shared zeroed-page to the guest, this optimization is referred to as the PShareZero optimization.
The second optimization is triggered by interposition on guest disk read requests. If a read request will overwrite whole PPNs, any pending swap-ins associated with those PPNs are deferred during write-preparation, the pages are pinned for the I/O, and the swap-ins are cancelled on successful I/O completion.
We have extended Tesseract so that its guest-memory and swap mecha- nisms behave more like those of ESX. Instead of mmapping a pagefile to pro- vide memory for the guest, Tesseract’s vmx process mmaps an anonymously- backed region of its address space, uses madvise to mark the range as NOT- NEEDED, and explicitly pins pages as they are accessed by either the vmx or by the VMM. Paging by the hypervisor becomes an explicit operation, read- ing from or writing to an explicit swap file. In this way, we are able to also employ the above optimizations on the hosted platform. We consider these as part of our baseline implementation.
6.3.2 Reconciling Redundant I/Os
Tesseract addresses the double-paging problem transparently to the guest al- lowing our solution to be applied to unmodified guests. To achieve this goal, we employ two forms of interposition. The first tracks writes to PPNs by the guest and is extended to include a mechanism to track valid relationships 100 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
between guest memory pages and disk blocks that contain the same state. The second exploits the fact that the hypervisor interposes on guest I/O re- quests in order to transform the requests’ scatter-gather lists. In addition, we modify the structure of the guest VMDKs and the hypervisor swap file, extending the former to support indirections from the VMDKs into the hy- pervisor swap disk. Finally, when the guest reallocates the PPN and zeroes its contents, we apply the PShareZero optimization in step 7 in Figure 6.2.
In order to track which pages have writable mappings in the guest, MPNs are initially mapped into the guest read-only. When written by the guest, the resulting page-fault allows the hypervisor to track that the guest page has been modified. We extend this same tracking mechanism to also track when guest writes invalidate associations between guest pages in memory and blocks on disk. The task is simpler when the hypervisor, itself, modifies guest memory since it can remove any associations for the modified guest pages. Likewise, virtual device operations into guest pages can create associations between the source blocks and pages. In addition, the device operations may remove prior associations when the underlying disk blocks are written. This approach, employed for example to speed the live migration of VMs from one host to another [87], can efficiently track which guest pages in memory have corresponding valid copies of their contents on disks.
The second form of interposition occurs in the handling of virtualized guest I/O operations. The basic I/O path can be broken down into three stages. The basic data structure describing an I/O request is the scatter- gather list, a structure that maps one or more possibly discontiguous mem- ory extents to a contiguous range of disk sectors. In the preparation stage, the guest’s scatter-gather list is examined and a new request is constructed that will be sent to the underlying physical device. It is here that the unmod- ified hypervisor handles the faulting in of swapped out pages as shown in steps 4 and 5 of Figure 6.2. Once the new request has been constructed, it is issued asynchronously and some time later there is an I/O completion event. 6.3. DESIGN 101
To support the elimination of I/Os to and from virtual disks and the hy- pervisor block-swap store (or BSST), each guest VMDK has been extended to maintain a mapping structure allowing its virtual block identifiers to refer to blocks in other VMDKs. Likewise, the hypervisor BSST has been extended with per-block reference counts to track whether blocks in the swap file are accessible from other VMDKs or from guest memory.
The tracking of associations and interposition on guest I/Os allows four kinds of I/O elisions: swap - guest-I/O a guest I/O follows the hypervisor swapping out a page’s contents (Figures 6.1a and 6.1d) swap - swap a page is repeatedly swapped out to the BSST with no inter- vening modification guest-I/O - swap the case in which the hypervisor can take advantage of prior guest reads or writes to avoid writing redundant contents to the BSST (Figure 6.1c) guest-I/O - guest-I/O the case in which guest I/Os can avoid redundant operations based on prior guest operations where the results known reside in memory (for reads) or in a guest VMDK (for writes)
For simplicity, Tesseract focuses on the first two cases since these capture the case of double-paging. Because Tesseract does not introspect on the guest, it cannot distinguish guest I/Os related to memory paging from other kinds of guest I/O. But the technique is general enough to support a wider set of optimizations such as disk deduplication for content streamed through a guest. It also complements techniques that eliminate redundant read I/Os across VMs [82]. 102 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
LP1
Guest Disk PPN
Guest Physical Memory
BSST guest view Block Indirection Layer hypervisor view
Host Memory
MPN
Figure 6.3: Double-paging with Tesseract.
6.3.3 Tesseract’s Virtual Disk and Swap Subsystems
Figure 6.3 shows our approach embodied in Tesseract. The hypervisor swaps guest memory to a block-swap store (BSST) VMDK, which manages a map from guest PPNs to blocks in the BSST, a per-block reference-counting mech- anism to track indirections from guest virtual disks, and a pool of 4KB disk blocks. When the guest OS writes out a memory page that happens to be swapped out by the hypervisor, the disk subsystem detects this condition while preparing to issue the write request. Rather than bringing memory contents for the swapped out page back to memory, the hypervisor updates the appropriate reference counts in the BSST, issues the I/O, and updates metadata in guest VMDK and adds a reference to the corresponding disk block in BSST. Figure 6.4 shows timelines for the scenario when guest OS is paging out an already swapped page with and without Tesseract. With Tesseract we are able to eliminate the overheads of a new page allocation and a disk read. To achieve this, Tesseract modifies the I/O preparation and I/O comple- tion steps. For write requests, the memory pages in the scatter-gather list are 6.3. DESIGN 103
Guest Zero Update VMM SwapOut ... Allocate Memory Synchronous SwapIn PTE Write I/O Write (a) Baseline (without Tesseract)
Write PShare Update Guest VMM SwapOut ... Metadata Zero PTE Write (b) With Tesseract
Figure 6.4: Write I/O and hypervisor swapping.
checked for valid associations to blocks in the BSST. If these are found, the target VMDK’s mapping structure is updated for those pages’ corresponding virtual disk blocks to reference the appropriate blocks in the BSST and the reference counts of these referenced blocks in the BSST are incremented. For read requests, the guest I/O request may be split into multiple I/O requests depending on where the source disk blocks reside.
Consider the state of a guest VMDK and the BSST as shown in Fig- ure 6.5a. Here, a guest write operation wrote five disk blocks in which two were previously swapped to the BSST. In this example, block 2 still con- tains the swapped contents of some PPN and has a reference count reflecting this fact and the guest write. Hence, its state has “swapped” as true and a reference count of 2. Similarly, block 4 only has a nonzero reference count because the PPN whose swapped contents originally created the disk block has since been accessed and its contents paged back in. Hence, its state has “swapped” as false and a reference count of 1. To read these blocks from the guest VMDK now requires three read operations: one against the guest VMDK and two against the BSST. The results of these read operations must then be coalesced in the read completion path.
One can view the primary cost of double-paging in an unmodified hy- pervisor as impacting the write-preparation time for guest I/Os. Likewise, one can view the primary cost of these cases in Tesseract as impacting the read-completion time. To mitigate these effects, we consider two forms of defragmentation. Both strategies make two assumptions: 104 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Guest VMDK Block-Swap Store (BSST)
1 swapped: true D 2 refcnt: 2 3
D
5 swapped: false refcnt: 1
(a) With Tesseract Guest VMDK Block-Swap Store (BSST)
1
D swapped: false 2 refcnt: 0 3 swapped: true D 2 refcnt: 2 swapped: false 5 swapped: false refcnt: 1 refcnt: 0
(b) With Tesseract and BSST defragmentation Guest VMDK Block-Swap Store (BSST)
1
2 swapped: true S refcnt: 1 3
4
5 swapped: false refcnt: 0
(c) With Tesseract and guest VMDK defragmentation
Figure 6.5: Examples of reference count with Tesseract and with defragmenta- tion.
• the original guest write I/O request (represented in blue) captures the guest’s notion of expected locality, and
• the guest is unlikely to immediately read the same disk blocks back into memory 6.4. IMPLEMENTATION 105
Based on these assumptions, we extended Tesseract to asynchronously reor- ganize the referenced state in the BSST. In Figure 6.5b, we copy the refer- enced blocks into a contiguous sequence in the BSST and update the guest VMDK indirections to refer to the new sequence. This approach reduces the number of split read operations. In Figure 6.5c, we copy the references blocks back to the locations in the original guest VMDK where the guest expects them. With this approach, the typical read operation need not be split. In effect, Tesseract asynchronously performs the expensive work that occurred in steps 4, 5, and 6 of Figure 6.2 eliminating its cost to the guest.
6.4 Implementation
Our prototype extends VMware Workstation as described in Section 6.3.1. Here, we provide more detail.
6.4.1 Explicit Management of Hypervisor Swapping
VMware Workstation relies on the host OS to handle much of the work as- sociated with swapping guest memory. A pagefile is mapped into the vmx’s address space and calls to the vmmon driver are used to lock MPNs backing this memory as needed by the guest. When memory is released through hy- pervisor swapping, the pages are dirtied, if necessary, in the vmx’s address space and unlocked by vmmon. Should the host OS need to reclaim the backing memory, it does so as if the vmx were any other process: it writes out the state to the backing pagefiles and repurposes the MPN. For Tesseract, we modified Workstation to support explicit swapping of guest memory. First, we eliminated the pagefile and replaced it with a spe- cial VMDK, the block swap store (BSST) into which swapped-out contents are written. The BSST maintains a partial mapping from PPNs to disk blocks tracking the contents of currently swapped-out PPNs. In addition, BSST 106 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
maintains a table of reference counts on the blocks in the BSST referenced by other guest VDMKs.
Second, we split the process for selecting pages for swapping from the process for actually writing out contents to the BSST and unlocking the back- ing memory. This split is motivated by the fact that having eliminated dupli- cate I/Os between hypervisor swapping and guest paging, the system should benefit by both levels of scheduling choosing the same set of pages. The se- lected swap candidates are placed in a victim cache to “cool down”. Only the coldest pages are eventually written out to disk. This victim cache is maintained as a percentage of locked memory by the guest—for our study, 10%. Should the guest access a page in the pool, it is removed from the pool without being unlocked.
When the guest pages out memory, it does so to repurpose a given guest physical page for a new linear mapping. Since this new use will access that guest physical page, one may be concerned that this access will force the page to be swapped in from the BSST first. However, because the guest will either zero the contents of that page or read into it from disk and because the VMM can detect that the whole page will be overwritten before it is visible to the guest, the vmx is able to cancel the swap-in and complete the page locking operation.
6.4.2 Tracking Memory Pages and Disk Blocks
There are two steps to maintaining a mapping between disk blocks and pages in memory. The first is recognizing the pages read and written in guest and hypervisor I/O operations. By examining scatter-gather lists of each I/O, one can identify when the contents in memory and on disk match. While we plan to maintain this mapping for all associations between guest disks and guest memory, we currently only track the associations between blocks in the BSST and main memory. 6.4. IMPLEMENTATION 107
The second step is to track when these associations are broken. For guest memory, this event happens when the guest modifies a page of memory. The VMM tracks when this happens by trapping the fact that a writable mapping is required and this information is communicated to the vmx. For device accesses, on the other hand, this event is tracked either through explicit checks in the module which provides devices the access to guest memory, or by examining page-lists for I/O operations that read contents into memory pages.
6.4.3 I/O Paths
When the guest OS is running inside a virtual machine, guest I/O requests are intercepted by the VMM, which is responsible for storage adaptor virtu- alization, and then passed to the hypervisor, where further I/O virtualization occurs. Figure 6.6 identifies the primary modules in VMware Workstation’s I/O stack. Guest operating system generates scatter-gather lists for I/O (1). Tesseract inspects scatter-gather lists of incoming guest I/O requests in the SCSI Disk Device layer, where a request to the guest VMDK may be updated (2). Any extra I/O requests to the BSST may be issued (3) as shown in Table 6.2. The Asynchronous I/O manager issues sends to I/O requests to the host file system (4). On completion, the asynchronous I/O manager generates completion events (5). Waiting for the completion of all the I/O requests needed to service the original guest I/O request is isolated to the SCSI Disk Device layer as well (6). When running with defragmentation enabled (see Section 6.5), Tesseract allocates a pool of worker threads for handling defragmentation requests.
Guest Write I/Os
Guest I/O requests have PPNs in scatter-gather lists. The vmx rewrites the scatter-gather list, replacing guest PPNs with virtual pages from its address 108 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
Guest Operating System (1)
Virtual Machine Monitor (VMM)
SCSI Disk Device (2)Block Indirection Layer (6)
I/O dispatch I/O completion
(3) (5) Asynchronous I/O Manager
(4) VMX
Host File Layer
Guest I/O requests (1) : S/G list received from guest (2) : Tesseract updates S/G list : (write): swapped pages are removed : (read) : guest VMDK indirections are looked up (3) : dispatch I/O request : (write): a single request with holes : (read) : one request to guest VMDK; one or more requests to BSST (4) : asynchronous I/O scheduled ... I/O takes place asynchronously ... (5) : completion events generate for each dispatched I/O (6) : notify guest of completion: : (write): create guest to BSST indirections : (read) : wait for all requests; merge results
Figure 6.6: VMware Workstation I/O Stack
space before passing it further to the physical device. Normally, for write I/O requests, if a page was previously swapped, so that PPN does not have a backing MPN, the hypervisor allocates a new MPN and brings page’s con- tents from disk.
With Tesseract, we check if the PPNs are already swapped out to BSST blocks by querying the PPN BSST-block mapping. We then use a virtual 6.4. IMPLEMENTATION 109 address of a special dummy page in the scatter-gather list for each page that resides in the BSST. On completion of the I/O, metadata associated with the guest VMDK is updated to reflect the fact that the contents of guest disk blocks for BSST-resident pages are in the BSST. This sequence allows the guest to page out memory without inducing double-paging. 1 2 3 4 5 6 7 8 (a) Scatter-gather prepared by the guest OS for disk write. 1 3 5 8 (b) Modified scatter-gather to avoid double-paging pages in host memory pages swapped out to BSST dummy page
Figure 6.7: The pages swapped out to BSST are replaced with a dummy page to avoid double-paging. Indirections are created for the corresponding guest disk blocks.
Figure 6.7 illustrates how write I/O requests to the guest VMDK are han- dled by Tesseract. Tesseract recognizes that contents for pages 2, 4, 6 and 7 in the scatter-gather list provided by the guest OS reside in the BSST (Fig- ure 6.7a). When a new scatter-gather list to be passed to the physical device is formed, a dummy page is used for each BSST resident (Figure 6.7b).
Guest Read I/Os and Guest Disk Fragmentation
Recognizing that data may reside in both the guest VMDK and the BSST is a double-edged sword. On the guest write path it allows us to dismiss pages that are already present in the BSST and thus avoid swapping them in just to be written out to the guest VMDK. However, when it comes to guest reads, the otherwise single I/O request might have to be split into multiple I/Os. This happens when some of the data needed by the I/O is located in the BSST. Since data that has to be read from the BSST may not be contiguous on disk, the number of extra I/O requests to the BSST may be as high as the number of data pages in the original I/O request that reside in the BSST. We 110 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
refer to a collection of pages in the original I/O request for which a separate I/O request to the BSST must be issued as a hole. Read I/O requests to the guest VMDK which have holes are called fragmented. We modify a fragmented request so that all pages that should be filled in with the data from the BSST are replaced with a dummy page which will serve as a placeholder and will get random data read from the guest VMDK. So in the end for each fragmented read request we issue one modified I/O request to the guest VMDK and N requests to the BSST, where N is the number of holes. After all the issued I/Os are completed, we signal the completion of the originally issued guest read I/O request. 1 2 3 4 5 6 7 8 1 3 5 8 2 4 6 7 pages in host memory pages swapped out to BSST dummy page
Figure 6.8: Original guest read request split into multiple reads requests due to holes in the guest VMDK.
In Figure 6.8, the guest read I/O request finds disk blocks for pages 2, 4, 6 and 7 located on the BSST, where they are taking non-contiguous space. Tesseract issues one read request to the guest VMDK to get data for pages 1, 3, 5 and 8. In the scatter-gather list sent to the physical device, a dummy page is used as a read target for pages 2, 4, 6 and 7. Together with that one read I/O request to the guest VMDK, four read I/O requests are issued to the BSST. Each of those four requests reads data from one of the four disk blocks in the BSST.
Optimization of Repeated Swaps
In addition to addressing the double-paging anomaly by tracking guest I/Os whose contents exist in the BSST, we also implemented an optimization for back-to-back swap-out requests for a memory page whose contents remain 6.4. IMPLEMENTATION 111 clean. If a page’s contents are written out to the BSST, and later swapped back in, we continue to track the old block in the BSST as a form of victim cache. If the same page is chosen to be swapped out again and there has been no intervening modification of the contents of the page in memory, we simply adjust the reference count (see Section 6.4.4) for the block copy that is already in the BSST.
6.4.4 Managing Block Indirection Metadata
Tesseract keeps in-memory metadata for tracking PPN-to-BSST block map- pings and for recording block indirections between guest and BSST VMDKs. The PPN-to-BSST block mapping is stored as key-value pair using a hash table. Indirection between guest and BSST VMDKs are tracked in a similar manner. Tesseract also keeps reference counts for the BSST blocks. When a new PPN-to-BSST mapping is created, the reference count for the corresponding BSST block is set to 1. The reference count is incremented in the write prepare stage for PPNs found to have PPN-to-BSST block mappings. This ensures that such BSST blocks are not repurposed while the guest write is still in progress. Later, on the write completion path, the guest-VMDK- to-BSST indirection is created. The reference count of the BSST blocks is decremented during hypervisor swap in operation. It is also decremented when the guest VMDK block is overwritten by new contents and the previous guest block indirection is invalidated. Blocks with zero reference counts are considered free and reclaimable.
Metadata Consistency
While updating metadata in memory is faster than updating it on the disk, it poses consistency issues. What if the system crashes before the metadata is synced back to persistent storage? To reduce the likelihood of such prob- lems, Tesseract periodically synchronizes the metadata to disk on the same 112 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING
schedule used by the VMDK management library for virtual disk state. How- ever, because reference counts in the BSST and block-indirections in VMDKs are written at different stages in an I/O request, crashes must be detected and a fsck-like repair process run.
Entanglement of guest VMDKs and BSST
Once indirections are created between guest and BSST VMDK, it becomes impossible to move just the guest VMDK. To disentangle the guest VMDK, we must copy each block from the BSST to its guest VMDK for which there is an indirection. This can be done both online and offline. More details about the online process are in Section 6.5.2.
6.5 Guest Disk Fragmentation
As mentioned in Section 6.4.3, when running with Tesseract, guest read I/O requests might be fragmented in the sense that some of the data the guest is asking for in a single request may reside in both the BSST and the guest VMDK. The fragmentation level depends on the nature of the workload, the guest OS, and swap activity at the guest and the hypervisor level. Our ex- periments with SPECjbb2005 [103] showed that even for moderate level of memory pressure as much as 48% of all read I/O requests had at least one hole. By solving double-paging problem Tesseract significantly reduced write- prepare time of the guest I/O requests since synchronous swap-in requests no longer cause delays. However, a non-trivial overhead was added to read- completion. Indeed, instead of waiting for a single read I/O request to the guest VMDK, the hypervisor may now have to wait for several extra read I/O requests to the BSST to complete before reporting the completion to the guest. 6.5. GUEST DISK FRAGMENTATION 113
To address these overheads, Tesseract was extended with a defragmen- tation mechanism that improves read I/O access locality and thus reduces read-completion time. We investigated two approaches to implementing defragmentation - BSST defragmentation and guest VMDK defragmentation. While defragmentation is intended to help reduce read-completion time, it has its own cost. Defragmentation requests are asynchronous and reduce time to complete affected guest I/Os, but, at the same time, they contribute to a higher disk load and in the extreme cases may have an impact on read- prepare times. The defragmentation activity can be throttled on detecting performance bottlenecks due to higher disk load. ESX, for example, pro- vides a mechanism, SIOC, that measures latencies to detect overload and enforce proportional-share fairness [50]. The defragmentation mechanism could participate in this protocol.
6.5.1 BSST Defragmentation
BSST defragmentation uses guest write I/O requests as a hint of which BSST blocks might be accessed together in a single I/O read request in the future. Given that information we then group together the identified blocks in the BSST.
Figure 6.9 shows a scatter-gather list of the write I/O request that goes to the guest VMDK. In that request, the contents of pages 2, 4, 6 and 7 is already present in the BSST. As soon as these blocks are identified, a worker thread picks up a reallocation job that will allocate a new set of contiguous blocks in BSST and will copy the contents of BSST blocks for pages 2, 4, 6 and 7 into that new set of block. This copying allows those blocks to be read later as a single I/O request issued by the guest and reflects its own expectation of the locality of these blocks.
BSST defragmentation is not perfect. If multiple guest VMDK writes cre- ate indirections to the same BSST blocks, multiple copies of those blocks 114 CHAPTER 6. TESSERACT: RECONCILING GUEST I/O AND HYPERVISOR SWAPPING