Checkpoint-Restart for a Network of Virtual Machines

Rohan Garg, Komal Sodha, Zhengping Jin, Gene Cooperman∗ Northeastern University Boston, MA / USA {rohgarg,komal,jinzp,gene}@ccs.neu.edu

Abstract—The ability to easily deploy parallel compu- Further, the maintainability of a proposed architecture tations on the Cloud is becoming ever more important. is important. Here, we measure maintainability by the The first uniform mechanism for checkpointing a network number of lines of new code required, beyond the base of virtual machines is described. This is important for the parallel versions of common productivity software. code of a checkpoint-restart package, or the base code Potential examples of parallelism include Simulink for of the virtual machine itself. The proposed architecture MATLAB, parallel R for the R statistical modelling relies on just 600 lines of new code: 400 lines of code language, parallel blast.py for the BLAST bioinformatics for a KVM-specific plugin used to checkpoint the virtual software, IPython.parallel for Python, and GNU parallel machine, and 200 lines of code for a TUN/TAP plugin. for parallel shells. The checkpoint mechanism is imple- mented as a plugin in the DMTCP checkpoint-restart The two DMTCP plugins above are external libraries package. It operates on KVM/QEMU, and has also been loaded into an unmodified DMTCP. Source code can be adapted to Lguest and pure user-space QEMU. The plugin found in the contrib directory of the DMTCP repository. is surprisingly compact, comprising just 400 lines of code (See Section II for further details of plugins.) to checkpoint a single virtual machine, and 200 lines of The approach described here saves the state of an code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual arbitrary guest , which runs within a filesystem are accommodated through the filesystem. virtual machine under a host operating system. Experiments demonstrate checkpoint times of a fraction The primary virtual machine described in this work is of a second by using forked checkpointing, mmap-based KVM/QEMU [1]. However, to demonstrate the gener- restart, and incremental Btrfs-based snapshots. ality of the approach, a plugin was also developed for Lguest [2]. That plugin required about 100 lines of code, I. INTRODUCTION as well as about 40 lines of modifications to the Lguest kernel driver to extend its API. The methodology was An approach for providing fault-tolerance to complex also applied to pure user-space QEMU [3]. Surprisingly, distributed applications is demonstrated. It is based on DMTCP was able to checkpoint user-space QEMU checkpointing a network of virtual machines. Such a “out-of-the-box” (without the use of additional plugins). network can be started locally, and later checkpointed Experiments in Section IV-C demonstrate compatibil- for re-deployment (restart from checkpoint images) in ity with DMTCP’s performance optimizations: forked the Cloud. This is especially important to support fault checkpointing and mmap-based fast restart. Forked tolerance and load balancing in the Cloud. checkpointing enables virtual machine snapshot in The approach also provides flexibility. It employs 0.4 seconds when running with the Btrfs filesystem, DMTCP, an unprivileged, purely user-space checkpoint- while mmap-based fast restart allows resuming from ing package. Potential examples of flexible application- the snapshot in 0.3 seconds. In addition, Section IV-D specific policies are: incremental checkpointing, dec- shows the run-time overhead to be too small to measure laration of cutouts (regions of memory that don’t re- when running the nbench2 [4] benchmark program. quire checkpointing); application-specific memory com- Snapshots (including the filesystem): In VM ter- pression during checkpoint (for example, conversion minology, a snapshot saves not only the state of the of double to float), and so on. End users can write virtual machine, but also the filesystem used by that application-specific DMTCP plugins to support flexible virtual machine. The Btrfs filesystem [5] can be used to checkpointing. implement copy-on-write incremental snapshots. Thus,

∗ during checkpoint of a virtual machine, one can also This work was partially supported by the National Science create either a full snapshot or an incremental snapshot Foundation under Grant OCI-0960978. of the guest filesystem.

978-1-4799-0898-1/13/$31.00 c 2013 IEEE On computers where the host operating system does A. Checkpointing the KVM/QEMU Virtual Machine not provide the Btrfs filesystem, it is still possible to QEMU uses KVM to run user-space code natively employ Btrfs. An “inner” KVM/QEMU virtual machine on hardware that supports . It uses KVM’s can be run nested inside an “outer” KVM/QEMU virtual API to initialize and control the guest virtual machine. machine, which in turn runs under the host operating This API is based on the system call. system. The outer VM provides Btrfs and DMTCP runs For the rest of this discussion, the term QEMU is inside the outer VM, checkpointing the inner VM. used both to refer to the QEMU virtual machine monitor In the rest of this paper, Section II provides back- (VMM), and the virtual machine itself (including the ground on DMTCP plugins. Section III describes a guest operating system). generic mechanism for checkpoint-restart for single DMTCP plugins offer two primary mechanisms to ex- virtual machines. Section IV provides experimental tend checkpoint-restart: a run-time mechanism (wrapper running times over a variety of scenarios, Section V functions around library calls made by the application); describes related work, and Section VI provides the and customization of checkpoint/restart to save and conclusion. restore the state of external objects. In this case, QEMU II. DMTCP, KVM, AND TUN/TAP: EXTENDING is the target application being checkpointed, and the CHECKPOINT-RESTART TO VMS KVM kernel module is the external object whose state DMTCP (Distributed MultiThreaded CheckPoint- must be virtualized. ing) [6] is used to checkpoint and restart a network of The run-time portion of the KVM plugin is primarily virtual machines. DMTCP provides a facility for third- concerned with a function wrapper around the ioctl party plugins, as well as using them in its own internal system call. This wrapper function captures system calls architecture. The work described here is based on the by QEMU to KVM. This is used to make a local copy svn revision 1967 of DMTCP [7]. of the parameters that QEMU used to initialize the DMTCP implements transparent user-space new virtual machine. At the time of restart, those same checkpoint-restart. It does this by saving to a checkpoint parameters are used to reset the KVM parameters to image all of user-space memory, along with pertinent correspond. process state (thread information, open file descriptors, The remainder of the KVM plugin is concerned with associated terminal device, stdin/stdout/stderr, sockets, saving state at checkpoint time, and restoring state at shared memory regions, etc.). Internal DMTCP plugins restart time. The KVM saved state includes the state employ specific algorithms to checkpoint the state of of the virtual CPU (registers, etc.) and the state of the open files, network sockets, shared memory regions, interrupt controllers. The KVM API provides explicit and other special cases. system calls that the plugin used to save and restore the This work uses the plugin mechanism to extend above state. DMTCP in two directions: support for KVM, and Another example of KVM/QEMU state is the virtual support for the virtual-network kernel devices TUN memory tables. These tables are contained within the and TAP. TUN/TAP is used for networking of mul- user-space memory of the QEMU process itself (here tiple KVM-based virtual machines. First, DMTCP viewing QEMU as a process in the host operating is extended to support checkpointing of a single system). At the time of restart, the original mapping KVM/QEMU virtual machine. Second, DMTCP is ex- between the guest physical pages and host physical tended to support checkpointing of the TUN/TAP net- pages has been lost. However, the DMTCP plugin does work, including any network data “in flight”. not need to create a new mapping. This is because In order to checkpoint KVM/QEMU, it is launched the page fault causes the to re-establish the under the control of DMTCP. A typical example of mapping. launch, checkpoint, and restart is as follows: Figure 1 illustrates the generic architecture of a guest % dmtcp_checkpoint --with-plugin \ virtual machine. At the time of checkpoint, the DMTCP dmtcp_kvm_plugin.so \ plugin discovers the parameters of the KVM hypervisor dmtcp_tun_plugin.so qemu ... in supporting the current state of the QEMU virtual % dmtcp_command --checkpoint machine. DMTCP then writes to a checkpoint image the % dmtcp_restart qemu_*.dmtcp memory of the QEMU virtual machine, which consists Section II-A discusses handling of the KVM/QEMU of the user-space memory of the process of the host virtual machine, while Section II-B discusses network operating system that is running QEMU. handling and the use of TUN/TAP. Figure 2 presents the launching of a fresh virtual

2 Guest VM Guest VM ( component) (user space component) tables (shared tables (shared w/ kernel space) w/ kernel space)

Async I/O Async I/O threads threads vCPU threads vCPU threads

User Space Memory User Space Memory Kernel Space Memory Kernel Space Memory

Kernel Module for VM: Kernel Module for VM:

VM Shell tables (shared VM Shell tables (shared with user space) with user space) Hardware description (peripherals, IRQ, etc.) (Empty H/W description) vCPU0 vCPUn vCPU0 vCPUn vCPUs for vCPUs for virtual cores virtual cores

Figure 1: Generic VM Architecture. This sketch shows Figure 2: Re-Starting Virtual Machine from Checkpoint the VM components of interest for checkpoint-restart. Image. DMTCP Plugin re-creates the original hardware The VM shell refers to the uninitialized data structures description from the checkpoint image. In addition, the in the kernel driver that describes the virtual machine. A user-space memory of the guest VM is restored by VM launcher initializes those data structures. A generic DMTCP at the original addresses. checkpoint-restart mechanism restores those data struc- tures appropriately. time of checkpoint, “drains the network”: (a) by stop- ping user threads of all processes in the computation; machine at restart time, which is then modified to (b) receiving from each socket until all network data correspond to the pre-checkpoint QEMU. At the time of “in flight” has been collected; and (c) by then writing restart, the DMTCP plugin requests KVM to create a a checkpoint image. A “cookie” (unique set of data) fresh virtual machine (not specific to QEMU). Then, is sent through each network connection so that the DMTCP replaces this fresh virtual machine (which receiver can determine when no further data is in flight. exists as the user-space memory of a process in the host The TUN/TAP plugin employs a similar strategy, operating system) by the original user-space memory except that TUN/TAP does not provide an analog of from the checkpoint image. Finally, the DMTCP plugin a socket connection. It operates at a lower level in makes calls to the KVM kernel module to reset the which network packets generated by the guest operating KVM parameters so as to correspond to those of the system are injected directly into the physical network. pre-checkpoint QEMU virtual machine. Only the guest operating system is aware of the socket connections being used by the applications within it. B. Checkpointing the TUN/TAP Network Two alternative approaches to draining the network A TUN/TAP plugin extends DMTCP similarly to the are: (a) to send a broadcast packet that plays the role KVM plugin. Wrapper functions are implemented for of the DMTCP cookie; and (b) to wait for a specified ioctl to detect how the network was set up. time sufficient for all network packets to arrive. Mech- For background, we briefly review how DMTCP anism (b) is used currently. For added reliability, at the provides checkpointing over a TCP/IP network. At the end of writing the checkpoint image, the network is

3 checked to see if any late packets have arrived. If a late Further, Tables I and III show that two DMTCP packet is detected, the user can be warned, or a second options (further analyzed in Section IV-C) can enable DMTCP checkpoint can be automatically initiated. checkpoint and restart in a fraction of a second. First, in forked checkpointing, a child process is forked in III. GENERIC MECHANISM FOR CHECKPOINTING A order to checkpoint while the parent continues running. SINGLE VIRTUAL MACHINE Second, in mmap-based fast restart, mmap is used to The techniques employed by the KVM plugin from map into RAM the memory saved within the check- Section II-A extend to other virtual machines. In partic- point image. Hence, the process restarts faster, while ular, a DMTCP plugin was written for the Lguest virtual remaining memory is paged into RAM on demand. machine. In this case, Lguest provides a control mecha- 1) Scalability for a Distributed Network of VMs: Ta- nism by overloading the read and write system calls. ble I shows checkpoint and restart timings of HPCC [8]. Plugin wrapper functions were written for these calls. The Lguest kernel module also had to be modified with Number None (sec) F/C (sec) F/R (sec) F/C + F/R (sec) Nodes Ckpt Restart Ckpt Restart Ckpt Restart Ckpt Restart about 40 lines of code, in order to extend the Lguest 1 9.45 2.83 0.29 3.10 3.78 0.38 0.31 0.34 API for read/write. This enables the Lguest plugin 2 10.11 3.17 0.34 3.22 3.56 0.36 0.33 0.38 4 10.63 3.45 0.36 3.73 3.85 0.42 0.38 0.50 to discover and restore the virtual machine state. The 8 11.38 4.59 0.38 4.23 4.17 0.51 0.41 0.52 plugin itself comprised 100 lines of code. 12 11.53 5.01 0.42 4.90 4.18 0.59 0.48 0.55 In the case of user-space QEMU (no KVM kernel Table I: Checkpoint-restart of HPCC [8] benchmark on module), the task of checkpointing is even simpler. a Gigabit Ethernet cluster, as influenced by DMTCP’s The existing DMTCP package was found to correctly optional optimizations: forked checkpoint (F/C) and fast checkpoint and restart QEMU without any additional restart (F/R). DMTCP’s default gzip compression of plugins. See Tables VII, VIII and X for timings across checkpoint images is incompatible with DMTCP F/R, Lguest, KVM/QEMU and pure QEMU. and so is not used in those cases. (Memory allocated in IV. EXPERIMENTAL RESULTS each case is 1024 MB.) The experimental results are split into four subsec- tions concerning: a network of virtual machines; the 2) Scalability for a Network of Virtual Machines use of Btrfs for filesystem snapshots; DMTCP optimiza- in Multi-Core Shared Memory: Table II shows the tions; and performance on a commodity computer. efficiency for a network of virtual machines under Scalability is tested for two different architectures: shared memory. Coverage over three types of parallel distributed computing across a cluster of 12 nodes; and middleware is demonstrated: MPI (HPCC [8]), TCP/IP shared memory computing employing 16 CPU cores. sockets (IPython [9]), and PVM (the SNOW parallel Configuration (cluster of 12 nodes): Each of the computing framework for the R statistical programming 12 computers is a 12-core Intel Xeon (1.6 GHz) server language [10]). with 24 GB of RAM. The host operating system was a Number HPCC IPython Parallel R 64-bit version of CentOS-6.3 with 2.6.32. of VMs Ckpt (s) Restart (s) Ckpt (s) Restart (s) Ckpt (s) Restart (s) KVM/QEMU was chosen as the VMM. The guests were 1 9.84 3.31 9.63 3.46 10.02 3.68 2 10.08 3.75 10.44 4.10 10.54 4.17 set up to run Ubuntu-12.04 Server version. DMTCP svn 3 10.18 3.86 10.67 4.06 11.13 4.16 revision 1967 was used for these experiments. Configuration (single node with 16 cores): These Table II: Checkpoint-restart times for virtual machines experiments were run on a 16-core AMD Opteron on a single multi-core computer. (The allocated memory (1 GHz) server with 128 GB of RAM. The host op- in each case is 1024 MB.) erating system was a 64-bit version of Ubuntu-13.04 with Linux kernel 3.8. KVM/QEMU was chosen as the Table III shows that the two DMTCP optimizations, VMM. The guests were set up to run Ubuntu-12.04 forked checkpoint and fast restart, greatly enhance Server version. DMTCP svn revision 1967 was used checkpoint and restart times. See Section IV-C for for these experiments. descriptions of those optimizations. A. Scalability of Checkpointing of Virtual Machines B. Btrfs: Incremental Snapshots of Virtual Machines Tables I, II, and III show that restart time increases A virtual machine snapshot mechanism includes the slowly with the number of VMs, while checkpoint time ability to save the current state of the VM filesystem. is close to constant. This is implemented through the Btrfs copy-on-write

4 DMTCP HPCC (sec) IPython (sec) Parallel R (sec) Optimizations Ckpt Restart Ckpt Restart Ckpt Restart Tables IV and V show a performance penalty for None 10.18 3.86 10.67 4.06 11.13 4.16 restarting without Btrfs (using nested VMs), as com- F/C 0.37 3.17 0.41 3.92 0.38 3.91 F/R 3.25 0.36 3.48 0.34 4.01 0.27 pared to Table II (non-nested). DMTCP resides in the F/C + F/R 0.38 0.35 0.43 0.34 0.41 0.37 outer VM. Since the virtualization of I/O devices is Table III: Checkpoint-restart of three VMs on a 16- never handled by KVM, the outer KVM then transfers core computer, while running different applications. The control back to the outer QEMU. The outer QEMU DMTCP optimizations are forked checkpoint (F/C) and resides in user space memory. The continual switching fast restart (F/R). DMTCP’s default gzip compression of between kernel and user-space accounts for the ineffi- checkpoint images is incompatible with DMTCP F/R, ciency. and so is not used in those cases. (Memory allocated in C. Optimizing: Forked Checkpointing and Fast Restart each case is 1024 MB.) DMTCP supports two further performance opti- mizations: forked-checkpointing and mmap-based fast- filesystem for incremental snapshots of the guest virtual restart. Table VI demonstrates the much improved per- filesystem. Even though the host machines in our ex- formance when using both of these optimizations. All perimental facilities did not provide a Btrfs filesystem, experiments are run on the 16-core computer with just we were able to support a Btrfs filesystem through a single VM. nesting of one KVM/QEMU virtual machine inside another. The outer virtual machine provides a Btrfs Allocated Memory KVM/QEMU (F/C+F/R) (MB) Checkpoint (s) Restart (s) Image Size virtual filesystem for the inner one. DMTCP runs as a 128 0.20 0.10 184 MB process inside the outer virtual machine, and is used 256 0.19 0.09 310 MB 512 0.21 0.10 568 MB to checkpoint the inner virtual machine. Networking 768 0.22 0.10 822 MB of the VMs is supported through TUN/TAP, as before. 1024 0.21 0.10 1.1 GB Table IV demonstrates the scalability for a distributed Table VI: Forked checkpoint (F/C) and fast restart (F/R) computation across four nodes of the cluster. times for an idle VM under KVM/QEMU.

1 node (sec) 2 nodes (sec) 4 nodes (sec) Optimizations Ckpt Restart Ckpt Restart Ckpt Restart 1) Forked checkpointing: Times for the forked with Btrfs 2.36 1.20 2.45 1.65 3.68 2.35 without Btrfs 33.28 35.67 34.46 37.20 39.73 39.47 checkpointing optimization are given for an idle virtual machine in Table VII. This uses the Table IV: Snapshotting up to four distributed VMs run- “--enable-forked-checkpointing” configure ning HPCC [8] under KVM/QEMU. The Btrfs filesys- option of DMTCP. At checkpoint time, after “draining tem is used to snapshot the filesystem using nested the network”, a child process is forked. The child VMs. (Memory allocated in each case is 384 MB. The writes out the checkpoint image in parallel with the size of the guest filesystem is 2 GB.) parent process continuing its execution. As expected, the parent completes its portion of the checkpoint Checkpoint (s) Restart (s) largely independently of the size of the checkpoint with Btrfs 1.52 0.7 image or allocated memory. Forked checkpointing Without Btrfs 10.23 12.48 typically requires 0.2 seconds. Table V: Configuration is same as for Table IV, except The times for checkpoint and restart for KVM/QEMU that three VMs run on a single 16-core computer. are larger than the times for user-space QEMU. This is because the plugin for KVM/QEMU makes extra system Tables IV and V show the advantage of using calls at checkpoint and restart time. The times can be the copy-on-write feature of Btrfs to store the guest reduced by modifying the kernel driver to implement a VM’s filesystem. At checkpoint time a small additional new system call that coalesces all of the operations of DMTCP plugin rapidly copies the state of the entire the previous system calls. filesystem (which appears as a single file on the outer 2) Fast Restart: Times for the fast-restart optimiza- guest’s filesystem), using the --reflink option of the tion are given for an idle virtual machine in Table VIII. GNU binutils copy command. At restart time the state of This uses the “--enable-fast-restart” option the guest filesystem is similarly copied back. DMTCP’s of DMTCP. This option uses mmap to map the check- facilities for forked checkpointing and mmap-based fast point image from disk directly into virtual memory, restart were employed. instead of copying data from disk to virtual memory. In

5 Allocated Lguest (F/C) KVM/QEMU (F/C) QEMU (user-space, F/C) Memory (MB) Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size 128 0.16 1.18 30 MB 0.18 1.28 44 MB 0.16 1.70 59 MB 256 0.17 1.43 32 MB 0.20 2.38 90 MB 0.17 2.99 111 MB 512 0.18 2.52 35 MB 0.23 3.06 122 MB 0.17 4.44 171 MB 768 0.17 2.45 36 MB 0.21 3.11 122 MB 0.18 4.97 191 MB 1024 0.18 2.82 37 MB 0.24 2.96 116 MB 0.19 5.63 213 MB Table VII: Forked checkpointing (F/C) optimization for idle virtual machines.

Allocated Lguest (F/R) KVM/QEMU (F/R) QEMU (user-space, F/R) Memory (MB) Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size 128 0.52 0.10 139 MB 0.69 0.10 182 MB 0.59 0.10 230 MB 256 0.83 0.10 267 MB 1.10 0.09 311 MB 1.33 0.10 408 MB 512 1.49 0.10 523 MB 1.84 0.10 566 MB 2.44 0.10 761 MB 768 2.50 0.10 779 MB 2.52 0.09 823 MB 3.54 0.10 1.1 GB 1024 3.02 0.10 1.1 GB 3.12 0.10 1.1 GB 4.48 0.10 1.5 GB Table VIII: Fast restart (F/R) optimization for idle virtual machines.

KVM/QEMU QEMU (user-space) this case, memory is demand-paged from the checkpoint Memory Int. Float-point Memory Int. Float-point image on an as-needed basis. Index Index Index Index Index Index with DMTCP 31.48 25.54 47.81 2.52 3.47 0.29 In addition to faster restart times, one observes faster w/o DMTCP 31.38 25.52 48.38 2.44 3.34 0.27 checkpoint times. This is because fast restart disables the default gzip compression. The execution time of Table IX: Nbench2 benchmark program on virtual gzip normally dominates. machines. (Memory allocated in each case is 1024 MB. Note that on restart from a checkpoint image, the Higher index numbers represent higher performance.) shadow page tables inside the kernel must be recreated, after which the pages will be faulted back into RAM. This impact is not captured in Tables VIII and VI, on performance for a VM running CPU-intensive or because most page faults occur after restart is complete. memory-intensive loads. As expected, the performance of KVM/QEMU is much higher than user-space QEMU, D. Performance on a Commodity Host Computer regardless of whether DMTCP is used. Configuration: The experiments of this section Influence of Memory Footprint: Table X ana- employed a MacBook laptop with an Intel Core i7 lyzes the influence of the VM memory footprint on (2.3 GHz), a 256 GB SSD, and 8 GB of RAM. The host checkpoint-restart in the default mode of DMTCP (gzip operating system was a 32-bit version of Ubuntu-12.10 compression) for an idle virtual machine. For larger with Linux kernel-3.5.7. The host was running natively sizes (guest VMs with 512 MB to 1024 MB), the in its own partition on the MacBook. The guest was checkpoint times grow proportionally to the size of the set up to run Ubuntu-8.04 Desktop version. DMTCP allocated memory for the larger sizes. Below these sizes, svn revision 1967 was used for these experiments. other factors dominate. Restart times do not change Snapshots based on Btrfs (see Section IV-C) were used appreciably at the higher memory sizes. for all experiments. V. RELATED WORK Run-Time Overhead of DMTCP: The numbers in Table IX demonstrate the small overhead of executing Virtual machines support snapshots, a form of check- with DMTCP. DMTCP incurs this overhead due to pointing built into the virtual machine. Examples in- its use of lightweight wrapper functions around cer- clude [11] and QEMU [3]. Xen has offered check- tain system calls. We used the nbench2 benchmark pointing (snapshots) at least since 2006 [12]. QEMU program [4] for these tests. The nbench2 benchmark supports a “savevm” command to create a snapshot, program is a collection of applications that stress the cpu both with and without KVM. Live checkpointing for and the memory. Indexes for memory-intensive, integer- KVM has been implemented using an additional check- intensive, and floating-point-intensive computations are point thread [13]. The CEVM system [14] uses a com- reported. Each index in Table IX is an nbench2 measure bination of KVM/QEMU’s live migration and snapshot- of performance, normalized to a value of one for the ting facilities to provide a standalone high availability AMD K6/233. Higher numbers are better. system. Similarly to CEVM, both Remus [15] and VM- Table IX shows that DMTCP has little impact µCheckpoint [16] offer high frequency checkpointing

6 Allocated Free Lguest KVM/QEMU QEMU (user-space) Mem. (MB) Mem. (MB) Ckpt (s) Restart (s) Image Ckpt (s) Restart (s) Image Ckpt (s) Restart (s) Image 128 2.5 2.29 1.26 30 MB 3.95 1.31 44 MB 4.34 1.69 59 MB 256 4.2 3.17 1.38 33 MB 6.42 2.35 89 MB 7.71 3.02 109 MB 512 184 5.39 2.42 35 MB 9.89 3.28 129 MB 11.87 4.43 170 MB 768 441 6.82 3.01 38 MB 9.21 3.31 130 MB 14.04 5.05 194 MB 1024 700 8.34 2.99 37 MB 10.03 3.13 122 MB 16.50 5.47 208 MB Table X: Checkpoint-restart times for idle virtual machines. The checkpoint times include the times for compressing the memory image and writing the contents to the disk. of guest VMs on Xen. They employ Xen’s live migra- tal snapshots. Another recent choice is BlobSeer [27], as tion and dirty page tracking facilities for incremental used in [28, Section 3.3]. That choice has the advantage state snapshots. An earlier technical report provides of exposing the raw checkpoint image file to the host additional details on the use of a DMTCP plugin in operating system or hypervisor. The work described checkpointing a single virtual machine [17]. here uses Btrfs [5]. Like BlobSeer, Btrfs exposes the The Emulab system has demonstrated checkpointing raw checkpoint image to the host, making it compatible of distributed systems through the use of virtual ma- with the use of DMTCP from outside both the VM and chines [18]. They did so using Xen and a guest virtual the VM kernel driver. machine that ran a modified Linux kernel. The modified VI. CONCLUSION Linux kernel logs packets and replays them on restart. In addition, Emulab uses “delay nodes” (additional virtual A mechanism for checkpointing a network of virtual machines) sitting between the user’s virtual machines, machines has been presented. This uses the plugin in order to throttle network bandwidth to an acceptable architecture of the DMTCP checkpoint-restart package, level. In contrast, the current approach does not incur on top of the KVM/QEMU checkpoint-restart package. the run-time overhead of delay nodes, and supports The implementation requires a 400-line KVM-specific any guest operating system — not just a customized plugin, as well as a 200-line plugin to adapt Linux’s Linux kernel. Finally, Emulab operates over the Xen TUN/TAP to allow DMTCP to “drain the network” hypervisor, while the current approach employs hosted prior to checkpoint. The plugin mechanism has the virtual machines. potential to be easily adapted to other virtual machines. Checkpointing of distributed computations is pri- The integration of the Btrfs copy-on-write filesystem marily handled by one of two mechanisms today: with nested copies of KVM/QEMU was used for fast, checkpoint-restart services for MPI; and transparent incremental snapshots of a network of virtual machines. checkpoint of arbitrary distributed computations. MPI ACKNOWLEDGMENT implementations of checkpoint-restart typically operate by first stopping all MPI messages [19], [20], [21]. The authors acknowledge Kapil Arya for helpful When it can be detected that there are no MPI mes- comments on this paper, and also for advice on creating sages in transit, a single-host checkpointing package is the DMTCP plugins. They also thank Larry Owen, then employed. Often that single-host package is the Anthony Skjellum and the University of Alabama at kernel-based BLCR [22] package. Open MPI supports Birmingham (under NSF grant CNS-1337747) for pro- the option of using either MTCP (the single-process viding a cluster with KVM and TUN/TAP. component of DMTCP) or BLCR. In addition to BLCR, REFERENCES two other commonly used packages for single-host checkpointing are CryoPid2 [23] and OpenVZ [24] [1] KVM team, “KVM — QEMU,” http://wiki.qemu.org/ (based on CRIU [25]). KVM, see also http://www.linux-kvm.org/page/Main Page, Accessed Nov. 18, 2012. DMTCP [6] was the first transparent user-space checkpoint-restart for distributed computations, and re- [2] R. Russell, “Lguest: The simple x86 hypervisor,” http: mains the most widely used example of this. Further, //lguest.ozlabs.org/, Accessed Nov. 18, 2012. unlike the MPI approach, DMTCP permits network messages to be in transit when the checkpoint occurs. [3] QEMU team, “QEMU,” http://wiki.qemu.org/Main Page, Accessed Nov. 18, 2012. For the support of snapshots, one requires a copy- on-write filesystem. A common current choice is [4] U. F. Mayer, “Linux/Unix nbench,” http://www.tux.org/ QCOW2 [26], which supports the creation of incremen- ∼mayer/linux/bmark.html; retrieved Dec. 4, 2012.

7 [5] O. Rodeh, J. Bacik, and C. Mason, “BTRFS: The [16] L. Wang, Z. Kalbarczyk, R. Iyer, and A. Iyengar, Linux B-tree filesystem,” IBM Research Report, “Checkpointing virtual machines against transient er- Tech. Rep., July 2012, rJ10501 (ALM1207-004); rors,” in On-Line Testing Symposium (IOLTS), 2010 http://domino.watson.ibm.com/library/CyberDig.nsf/ IEEE 16th International, 2010, pp. 97–102. papers/6E1C5B6A1B6EDD9885257A38006B6130/ $File/rj10501.pdf. [17] R. Garg, K. Sodha, and G. Cooperman, “A generic checkpoint-restart mechanism for virtual machines,” [6] J. Ansel, K. Arya, and G. Cooperman, “DMTCP: Trans- Tech. Rep., 2012. [Online]. Available: http://arxiv.org/ parent checkpointing for cluster computations and the abs/1212.1787v1 desktop,” in 23rd IEEE Int. Symp. on Parallel and Distributed Processing (IPDPS-09), 2009, pp. 1–12. [18] A. Burtsev, P. Radhakrishnan, M. Hibler, and J. Lepreau, “Transparent checkpoints of closed distributed systems [7] DMTCP team, “DMTCP : Distributed multithreaded in Emulab,” in Proc. of 4th ACM European Conf. on checkpointing,” http://dmtcp.sourceforge.net, Accessed Computer Systems. ACM, 2009, pp. 173–186. Nov. 18, 2012. [19] J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdain, [8] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, “The design and implementation of checkpoint/restart R. F. Lucas, R. Rabenseifner, and D. Takahashi, “The process fault tolerance for Open MPI,” in Proceedings of the 21st IEEE International Parallel and Distributed HPC challenge (HPCC) benchmark suite,” in Proc. of th the 2006 ACM/IEEE Conf. on Supercomputing (SC-06). Processing Symposium (IPDPS) / 12 IEEE Work- New York, NY, USA: ACM, 2006. [Online]. Available: shop on Dependable Parallel, Distributed and Network- http://doi.acm.org/10.1145/1188455.1188677 Centric Systems. IEEE Computer Society, March 2007.

[9] F. Perez´ and B. E. Granger, “IPython: a system for [20] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, interactive scientific computing,” Comput. Sci. Eng., G. Fedak, C. Germain, T. Herault, P. Lemarinier, vol. 9, no. 3, pp. 21–29, May 2007. [Online]. Available: O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, http://ipython.org “MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes,” in ACM/IEEE 2002 Conference on Su- [10] L. Tierney, A. J. Rossini, and N. Li, “Snow: a parallel percomputing. IEEE Press, 2002. computing framework for the r system,” Int. J. Parallel [21] S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, Program., vol. 37, no. 1, pp. 78–90, Feb. 2009. [Online]. A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, Available: http://dx.doi.org/10.1007/s10766-008-0077-2 “The LAM/MPI checkpoint/restart framework: System- [11] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, initiated checkpointing,” International Journal of High A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen Performance Computing Applications, vol. 19, no. 4, pp. and the art of virtualization,” in Proc. of 19th ACM 479–493, 2005. symposium on Operating Systems Principles (SOSP-03) . [22] P. Hargrove and J. Duell, “Berkeley lab check- New York, NY, USA: ACM, 2003, pp. 164–177. point/restart (BLCR) for Linux clusters,” J. of Physics Conference Series [12] G. Vallee,´ T. Naughton, H. Ong, and S. L. Scott, , vol. 46, pp. 494–499, Sep. 2006. “Checkpoint/restart of virtual machines based on Xen,” [23] M. O’Neill, “Cryopid2,” http://sourceforge.net/projects/ in HAPCW’06: High Availability and Performance Com- cryopid2. puting Workshop. Santa Fe, New Mexico, USA: Held in conjunction with LACSI 2006, Oct. 2006. [24] OpenVZ team, “OpenVZ,” http://wiki.openvz.org/.

[13] V. Siripoonya and K. Chanchio, “Thread-based live [25] CRIU team, “CRIU,” http://criu.org/. checkpointing of virtual machines,” in 10th IEEE Int. Symp. on Network Computing and Applications, 2011. [26] M. McLoughlin, “The QCOW2 image format,” http: //people.gnome.org/∼markmc/qcow-image-format.html, [14] K. Chanchio, C. Leangsuksun, H. Ong, 2008. V. Ratanasamoot1, and A. Shafi, “An efficient virtual machine checkpointing mechanism for hypervisor-based [27] B. Nicolae, G. Antoniu, L. Bouge,´ D. Moise, and HPC systems,” in High Availability and Performance A. Carpen-Amarie, “BlobSeer: Next generation data Computing Workshop (HAPCW), 2008. management for large scale infrastructures,” Journal of Parallel and Distributed Computing, vol. 71, [15] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, no. 2, pp. 168–184, Feb. 2011. [Online]. Available: N. Hutchinson, and A. Warfield, “Remus: high http://hal.inria.fr/inria-00511414 availability via asynchronous virtual machine replication,” in Proc. of the 5th USENIX Symp. [28] B. Nicolae and F. Cappello, “BlobCR: efficient on Networked Systems Design and Implementation checkpoint-restart for HPC applications on IaaS clouds (NSDI-08). Berkeley, CA, USA: USENIX As- using virtual disk image snapshots,” in Proc. of 2011 sociation, 2008, pp. 161–174. [Online]. Available: Int. Conf. for High Performance Computing, Networking, http://dl.acm.org/citation.cfm?id=1387589.1387601 Storage and Analysis (SC-11). ACM, 2011, pp. 1–12.

8