Checkpoint-Restart for a Network of Virtual Machines

Checkpoint-Restart for a Network of Virtual Machines Rohan Garg, Komal Sodha, Zhengping Jin, Gene Cooperman∗ Northeastern University Boston, MA / USA {rohgarg,komal,jinzp,gene}@ccs.neu.edu Abstract—The ability to easily deploy parallel compu- Further, the maintainability of a proposed architecture tations on the Cloud is becoming ever more important. is important. Here, we measure maintainability by the The first uniform mechanism for checkpointing a network number of lines of new code required, beyond the base of virtual machines is described. This is important for the parallel versions of common productivity software. code of a checkpoint-restart package, or the base code Potential examples of parallelism include Simulink for of the virtual machine itself. The proposed architecture MATLAB, parallel R for the R statistical modelling relies on just 600 lines of new code: 400 lines of code language, parallel blast.py for the BLAST bioinformatics for a KVM-specific plugin used to checkpoint the virtual software, IPython.parallel for Python, and GNU parallel machine, and 200 lines of code for a TUN/TAP plugin. for parallel shells. The checkpoint mechanism is imple- mented as a plugin in the DMTCP checkpoint-restart The two DMTCP plugins above are external libraries package. It operates on KVM/QEMU, and has also been loaded into an unmodified DMTCP. Source code can be adapted to Lguest and pure user-space QEMU. The plugin found in the contrib directory of the DMTCP repository. is surprisingly compact, comprising just 400 lines of code (See Section II for further details of plugins.) to checkpoint a single virtual machine, and 200 lines of The approach described here saves the state of an code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual arbitrary guest operating system, which runs within a filesystem are accommodated through the Btrfs filesystem. virtual machine under a Linux host operating system. Experiments demonstrate checkpoint times of a fraction The primary virtual machine described in this work is of a second by using forked checkpointing, mmap-based KVM/QEMU [1]. However, to demonstrate the gener- restart, and incremental Btrfs-based snapshots. ality of the approach, a plugin was also developed for Lguest [2]. That plugin required about 100 lines of code, I. INTRODUCTION as well as about 40 lines of modifications to the Lguest kernel driver to extend its API. The methodology was An approach for providing fault-tolerance to complex also applied to pure user-space QEMU [3]. Surprisingly, distributed applications is demonstrated. It is based on DMTCP was able to checkpoint user-space QEMU checkpointing a network of virtual machines. Such a “out-of-the-box” (without the use of additional plugins). network can be started locally, and later checkpointed Experiments in Section IV-C demonstrate compatibil- for re-deployment (restart from checkpoint images) in ity with DMTCP’s performance optimizations: forked the Cloud. This is especially important to support fault checkpointing and mmap-based fast restart. Forked tolerance and load balancing in the Cloud. checkpointing enables virtual machine snapshot in The approach also provides flexibility. It employs 0.4 seconds when running with the Btrfs filesystem, DMTCP, an unprivileged, purely user-space checkpoint- while mmap-based fast restart allows resuming from ing package. Potential examples of flexible application- the snapshot in 0.3 seconds. In addition, Section IV-D specific policies are: incremental checkpointing, dec- shows the run-time overhead to be too small to measure laration of cutouts (regions of memory that don’t re- when running the nbench2 [4] benchmark program. quire checkpointing); application-specific memory com- Snapshots (including the filesystem): In VM ter- pression during checkpoint (for example, conversion minology, a snapshot saves not only the state of the of double to float), and so on. End users can write virtual machine, but also the filesystem used by that application-specific DMTCP plugins to support flexible virtual machine. The Btrfs filesystem [5] can be used to checkpointing. implement copy-on-write incremental snapshots. Thus, ∗ during checkpoint of a virtual machine, one can also This work was partially supported by the National Science create either a full snapshot or an incremental snapshot Foundation under Grant OCI-0960978. of the guest filesystem. 978-1-4799-0898-1/13/$31.00 c 2013 IEEE On computers where the host operating system does A. Checkpointing the KVM/QEMU Virtual Machine not provide the Btrfs filesystem, it is still possible to QEMU uses KVM to run user-space code natively employ Btrfs. An “inner” KVM/QEMU virtual machine on hardware that supports virtualization. It uses KVM’s can be run nested inside an “outer” KVM/QEMU virtual API to initialize and control the guest virtual machine. machine, which in turn runs under the host operating This API is based on the ioctl system call. system. The outer VM provides Btrfs and DMTCP runs For the rest of this discussion, the term QEMU is inside the outer VM, checkpointing the inner VM. used both to refer to the QEMU virtual machine monitor In the rest of this paper, Section II provides back- (VMM), and the virtual machine itself (including the ground on DMTCP plugins. Section III describes a guest operating system). generic mechanism for checkpoint-restart for single DMTCP plugins offer two primary mechanisms to ex- virtual machines. Section IV provides experimental tend checkpoint-restart: a run-time mechanism (wrapper running times over a variety of scenarios, Section V functions around library calls made by the application); describes related work, and Section VI provides the and customization of checkpoint/restart to save and conclusion. restore the state of external objects. In this case, QEMU II. DMTCP, KVM, AND TUN/TAP: EXTENDING is the target application being checkpointed, and the CHECKPOINT-RESTART TO VMS KVM kernel module is the external object whose state DMTCP (Distributed MultiThreaded CheckPoint- must be virtualized. ing) [6] is used to checkpoint and restart a network of The run-time portion of the KVM plugin is primarily virtual machines. DMTCP provides a facility for third- concerned with a function wrapper around the ioctl party plugins, as well as using them in its own internal system call. This wrapper function captures system calls architecture. The work described here is based on the by QEMU to KVM. This is used to make a local copy svn revision 1967 of DMTCP [7]. of the parameters that QEMU used to initialize the DMTCP implements transparent user-space new virtual machine. At the time of restart, those same checkpoint-restart. It does this by saving to a checkpoint parameters are used to reset the KVM parameters to image all of user-space memory, along with pertinent correspond. process state (thread information, open file descriptors, The remainder of the KVM plugin is concerned with associated terminal device, stdin/stdout/stderr, sockets, saving state at checkpoint time, and restoring state at shared memory regions, etc.). Internal DMTCP plugins restart time. The KVM saved state includes the state employ specific algorithms to checkpoint the state of of the virtual CPU (registers, etc.) and the state of the open files, network sockets, shared memory regions, interrupt controllers. The KVM API provides explicit and other special cases. system calls that the plugin used to save and restore the This work uses the plugin mechanism to extend above state. DMTCP in two directions: support for KVM, and Another example of KVM/QEMU state is the virtual support for the virtual-network kernel devices TUN memory tables. These tables are contained within the and TAP. TUN/TAP is used for networking of mul- user-space memory of the QEMU process itself (here tiple KVM-based virtual machines. First, DMTCP viewing QEMU as a process in the host operating is extended to support checkpointing of a single system). At the time of restart, the original mapping KVM/QEMU virtual machine. Second, DMTCP is ex- between the guest physical pages and host physical tended to support checkpointing of the TUN/TAP net- pages has been lost. However, the DMTCP plugin does work, including any network data “in flight”. not need to create a new mapping. This is because In order to checkpoint KVM/QEMU, it is launched the page fault causes the hypervisor to re-establish the under the control of DMTCP. A typical example of mapping. launch, checkpoint, and restart is as follows: Figure 1 illustrates the generic architecture of a guest % dmtcp_checkpoint --with-plugin \ virtual machine. At the time of checkpoint, the DMTCP dmtcp_kvm_plugin.so \ plugin discovers the parameters of the KVM hypervisor dmtcp_tun_plugin.so qemu ... in supporting the current state of the QEMU virtual % dmtcp_command --checkpoint machine. DMTCP then writes to a checkpoint image the % dmtcp_restart qemu_*.dmtcp memory of the QEMU virtual machine, which consists Section II-A discusses handling of the KVM/QEMU of the user-space memory of the process of the host virtual machine, while Section II-B discusses network operating system that is running QEMU. handling and the use of TUN/TAP. Figure 2 presents the launching of a fresh virtual 2 Guest VM Guest VM (user space component) (user space component) tables (shared tables (shared w/ kernel space) w/ kernel space) Async I/O Async I/O threads threads vCPU threads vCPU threads User Space Memory User Space Memory Kernel Space Memory Kernel Space Memory Kernel Module for VM: Kernel Module for VM: VM Shell tables (shared VM Shell tables (shared with user space) with user space) Hardware description (peripherals, IRQ, etc.) (Empty H/W description) vCPU0 vCPUn vCPU0 vCPUn vCPUs for vCPUs for virtual cores virtual cores Figure 1: Generic VM Architecture. This sketch shows Figure 2: Re-Starting Virtual Machine from Checkpoint the VM components of interest for checkpoint-restart.

Checkpoint-Restart for a Network of Virtual Machines

Lguest a Journey of Learning the Linux Kernel Internals

Lguest the Little Hypervisor

Proceedings of the Linux Symposium

Bunker: a Privacy-Oriented Platform for Network Tracing Andrew G

Virtual Servers and Checkpoint/Restart in Mainstream Linux

Using KVM to Run Xen Guests Without Xen

Paravirtualizing Linux in a Real-Time Hypervisor

Enabling Lightweight Multi-Tenancy at the Network's Extreme Edge

Information Session for the Seminars “Future Internet” and “Innovative Internet Technologies and Mobile Communications”

Running Xen-A Hands-On Guide to the Art of Virt

Linux Symposium 2007 As Shown by the Many Related Presentations at This Confer - Ence (E.G., KVM, Lguest, Vserver)

CYBERSECURITY When Will You Be Hacked?