Post-Copy of Virtual Machines

Michael R. Hines, Umesh Deshpande, and Kartik Gopalan Computer Science, Binghamton University (SUNY) {mhines,udeshpa1,kartik}@cs.binghamton.edu

ABSTRACT 1 a cluster environment where physical nodes are intercon- We present the design, implementation, and evaluation of nected via a high-speed LAN and also employ a network- post-copy based live migration for virtual machines (VMs) accessible storage system. State-of-the-art live migration across a Gigabit LAN. Post-copy migration defers the trans- techniques [19, 3] use the pre-copy approach which works as fer of a VM’s memory contents until after its processor state follows. The bulk of the VM’s memory state is migrated has been sent to the target host. This deferral is in contrast to a target node even as the VM continues to execute at a to the traditional pre-copy approach, which first copies the source node. If a transmitted page is dirtied, it is re-sent memory state over multiple iterations followed by a final to the target in the next round. This iterative copying of transfer of the processor state. The post-copy strategy can dirtied pages continues until either a small, writable work- provide a “win-win” by reducing total migration time while ing set (WWS) has been identified, or a preset number of maintaining the liveness of the VM during migration. We iterations is reached, whichever comes first. This constitutes compare post-copy extensively against the traditional pre- the end of the memory transfer phase and the beginning of copy approach on the . Using a range of VM service downtime. The VM is then suspended and its pro- workloads we show that post-copy improves several metrics cessor state plus any remaining dirty pages are sent to a including pages transferred, total migration time, and net- target node, where the VM is restarted. work overhead. We facilitate the use of post-copy with adap- Pre-copy’s overriding goal is to keep downtime small by tive prepaging techniques to minimize the number of page minimizing the amount of VM state that needs to be trans- faults across the network. We propose different prepaging ferred during downtime. Pre-copy will cap the number of strategies and quantitatively compare their effectiveness in copying iterations to a preset limit since the WWS is not reducing network-bound page faults. Finally, we eliminate guaranteed to converge across successive iterations. On the the transfer of free memory pages in both pre-copy and post- other hand, if the iterations are terminated too early, then copy through a dynamic self-ballooning (DSB) mechanism. the larger WWS will significantly increase service down- DSB periodically reclaims free pages from a VM and sig- time. Pre-copy minimizes two metrics particularly well – nificantly speeds up migration with negligible performance VM downtime and application degradation – when the VM impact on VM workload. is executing a largely read-intensive workload. However, even moderately write-intensive workloads can reduce pre- Categories and Subject Descriptors D.4 [Operating copy’s effectiveness during migration because pages that are Systems] repeatedly dirtied may have to be transmitted multiple times. In this paper, we propose and evaluate the postcopy strat- General Terms Experimentation, Performance egy for live VM migration, previously studied only in the context of migration. At a high-level, post-copy Keywords Virtual Machines, Operating Systems, Process migration defers the memory transfer phase until after the Migration, Post-Copy, Xen VM’s CPU state has already been transferred to the target and resumed there. Post-copy first transmits all processor state to the target, starts the VM at the target, and then ac- 1. INTRODUCTION tively pushes the VM’s memory pages from source to target. This paper addresses the problem of optimizing the live Concurrently, any memory pages that are faulted on by the migration of system virtual machines (VMs). Live migra- VM at target, and not yet pushed, are demand-paged over tion is a key selling point for state-of-the-art the network from source. Post-copy thus ensures that each technologies. It allows administrators to consolidate system memory page is transferred at most once, thus avoiding the load, perform maintenance, and flexibly reallocate cluster- duplicate transmission overhead of pre-copy. wide resources on-the-fly. We focus on VM migration within Effectiveness of post-copy depends on the ability to mini-

1 mize the number of network-bound page-faults (or network A shorter version of this paper appeared in the ACM SIG- faults), by pushing the pages from source before they are PLAN/SIGOPS International Conference on Virtual Exe- faulted upon by the VM at target. To reduce network faults, cution Environments (VEE), March 2009 [11]. The addi- tional contributions of this paper are new prepaging strate- we supplement the active push component of post-copy with gies including dual-direction and multi-pivot bubbling (Sec- adaptive prepaging. Prepaging is a term borrowed from ear- tion 3.2), proactive LRU ordering of pages (Section 4.3), and lier literature [22, 32] on optimizing memory-constrained their evaluation (Section 5.4). disk-based paging systems. It traditionally refers to a more 2. RELATED WORK proactive form of pre-fetching from storage devices (such Process Migration: The post-copy technique has been as hard disks) in which the memory subsystem can try to variously studied in the context of process migration lit- hide the latency of high-locality page faults by intelligently erature: first implemented as “Freeze Free” using a file- sequencing the pre-fetched pages. Modern virtual memory server [26], then evaluated via simulations [25], and later via subsystems do not typically employ prepaging due increas- actual implementation [20]. There was also a recent ing DRAM capacities. Although post-copy doesn’t deal with implementation of post-copy process migration under open- disk-based paging, the prepaging algorithms themselves can Mosix [12]. In contrast, our contributions are to develop a still play a helpful role in reducing the number of network viable post-copy technique for live migration of virtual ma- faults in post-copy. Prepaging adapts the sequence of ac- chines. Process migration techniques in general have been tively pushed pages by using network faults as hints to pre- extensively researched and an excellent survey can be found dict the VM’s page access locality at the target and actively in [17]. Several distributed computing projects incorporate push the pages in the neighborhood of a network fault be- process migration [31, 24, 18, 30, 13, 6]. However, these fore they are accessed by the VM. We propose and compare systems have not gained widespread acceptance primarily a number of prepaging strategies for post-copy, which we because of portability and residual dependency limitations. call bubbling, that reduce the number of network faults to In contrast, VM migration operates on whole operating sys- varying degrees. tems and is naturally free of these problems. Additionally, we identified a deficiency in both pre-copy PrePaging: Prepaging is a technique for hiding the la- and post-copy migration due to which free pages in the VM tency of page faults (and in general I/O accesses in the crit- are also transmitted during migration, increasing the total ical execution path) by predicting the future working set [5] migration time. To avoid transmitting the free pages, we and loading the required pages before they are accessed. develop a “dynamic self-ballooning” (DSB) mechanism. Bal- Prepaging is also known as adaptive prefetching or adap- looning is an existing technique that allows a guest kernel tive remote paging. It has been studied extensively [22, 33, to reduce its memory footprint by releasing its free memory 34, 32] in the context of disk based storage systems, since pages back to the hypervisor. DSB automates the balloon- disk I/O accesses in the critical application execution path ing mechanism so it can trigger periodically (say every 5 sec- can be highly expensive. Traditional prepaging algorithms onds) without degrading application performance. Our DSB use reactive and history based approaches to predict and implementation reacts directly to kernel memory allocation prefetch the working set of the application. Our system em- requests without the need for guest kernel modifications. It ploys prepaging, not in the context of disk prefetching, but neither requires external introspection by a co-located VM for the limited duration of live VM migration to avoid the nor excessive communication with the hypervisor. We show latency of network page faults from target to source. Our that DSB significantly reduces total migration time by elim- implementation employs a reactive approach that uses any inating the transfer of free memory pages in both pre-copy network faults as hints about the VM’s working set with and post-copy. additional optimizations described in Section 3.1. The original pre-copy algorithm has advantages of its own. Live VM Migration. Pre-copy is the predominant ap- It can be implemented in a relatively self-contained external proach for live VM migration. These include hypervisor- migration to isolate most of the copying complexity based approaches from VMware [19], Xen [3], and KVM [14], to a single process at each node. Further, pre-copy also pro- OS-level approaches that do not use from OpenVZ vides a clean way to abort the migration should the target [21], as well as wide-area migration [2]. Self-migration of op- node ever crash during migration because the VM is still erating systems (which has much in common with process running at the source. (Source node failure is fatal to both migration) was implemented in [9] building upon prior work migration schemes.) Although our current post-copy im- [8] atop the L4 Linux microkernel. All of the above sys- plementation cannot recover from failure of the target node tems currently use pre-copy based migration and can poten- during migration, we discuss approaches in Section 3.4 by tially benefit from the approach in this paper. The closest which post-copy could potentially provide the same level of work to our technique is SnowFlock [15]. This work sets up reliability as pre-copy. impromptu clusters to support highly parallel computation We designed and implemented a prototype of the post- tasks across VMs by cloning the source VM on the fly. This copy live VM migration in the Xen VM environment. Through is optimized by actively pushing cloned memory via multi- extensive evaluations, we demonstrate situations in which cast from the source VM. They do not target VM migra- post-copy can significantly improve performance in terms of tion in particular, nor present a comprehensive comparison total migration time and pages transferred. We note that against (or optimize upon) the original pre-copy approach. post-copy and pre-copy complement each other in the tool- Non-Live VM Migration. There are several non-live box of VM migration techniques available to a cluster ad- approaches to VM migration. Schmidt [29] proposed us- ministrator. Depending upon the VM workload type and ing capsules, which are groups of related processes along performance goals of migration, an administrator has the with their IPC/network state, as migration units. Similarly, flexibility to choose either of the techniques. For VMs with Zap [23] uses process groups (pods) along with their ker- read-intensive workloads, pre-copy would be the better ap- nel state as migration units. The Denali project [37, 36] proach whereas for large-memory or write-intensive work- addressed migration of checkpointed VMs. Work in [27] ad- loads, post-copy would better suited. Our main contribu- dressed user mobility and system administration by encap- tion is in demonstrating that a post-copy based approach is sulating the computing environment as capsules to be trans- practical for live VM migration and to evaluate its merits ferred between distinct hosts. Internet suspend/resume [28] and drawbacks against the pre-copy approach. focuses on saving/restoring computing state on anonymous hardware. In all the above systems, the VM execution sus- pended and applications do not make progress. ping write accesses to each page, which significantly Dynamic Self-Ballooning (DSB): Ballooning refers to slows down write-intensive workloads. Similarly, post- artificially requesting memory within a guest kernel and re- copy needs to service network faults generated at the leasing that memory back to the hypervisor. Ballooning target, which also slows down VM workloads. is used widely for the purpose of VM memory resizing by both VMWare [35] and Xen [1], and relates to self-paging in 3.1 Post-Copy and its Variants Nemesis [7]. However, it is not clear how current ballooning In the basic approach, post-copy first suspends the mi- mechanisms interact, if at all, with live VM migration tech- grating VM at the source node, copies minimal processor niques. For instance, while Xen is capable of simple one-time state to the target node, resumes the , and ballooning during migration and system boot time, there is begins fetching memory pages over the network from the no explicit use of dynamic ballooning to reduce the memory source. The manner in which pages are fetched gives rise footprint before live migration. Additionally, self-ballooning to different variants of post-copy, each of which provides in- has been recently committed into the Xen source tree [16] cremental improvements. We employ a combination of four to enable a guest kernel to dynamically return free mem- techniques to fetch memory pages from the source: demand- ory to the hypervisor without explicit human intervention. paging, active push, prepaging, and dynamic self-ballooning VMWare ESX server [35] includes dynamic ballooning and (DSB). Demand paging ensures that each page is sent over idle memory tax, but the focus is not on reducing the VM the network only once, unlike in pre-copy where repeatedly footprint before migration. Our DSB mechanism is simi- dirtied pages could be resent multiple times. Similarly, ac- lar in spirit to the above dynamic ballooning approaches. tive push ensures that residual dependencies are removed However, to the best of our knowledge, DSB has not been from the source host as quickly as possible, compared to the exploited systematically to date for improving the perfor- non-deterministic copying iterations in pre-copy. Prepaging mance of live migration. Our work uses DSB to improve the uses hints from the VM’s page access patterns to reduce migration performance of both the pre-copy and post-copy both the number of major network faults and the duration approaches with minimal runtime overhead. of the resume phase. DSB reduces the number of free pages transferred during migration, improving the performance of 3. DESIGN both pre-copy and post-copy. Figure 1 provides a high-level contrast of how different stages of pre-copy and post-copy In this section we present the design of post-copy live VM relate to each other. Table 1 contrasts different migration migration. The performance of any live VM migration strat- egy could be gauged by the following metrics. Preparation (live) Resume Time 1. Preparation Time: This is the time between initi- Downtime  ating migration and transferring the VM’s processor   Pre−Copy Rounds ...  state to the target node, during which the VM con-   tinues to execute and dirty its memory. For pre-copy,  Round: 1 2 3 ... N (Dirty Memory) this time includes the entire iterative memory copying phase, whereas it is negligible for post-copy. (a) Pre−Copy Timeline 2. Downtime: This is time during which the migrating VM’s execution is stopped. At the minimum this in- Time cludes the transfer of processor state. For pre-copy, this transfer also includes any remaining dirty pages. Preparation (live) Resume Time (live) For post-copy this includes other minimal execution Downtime  state, if any, needed by the VM to start at the target.    Post−Copy Prepaging 3. Resume Time: This is the time between resuming   the VM’s execution at the target and the end of migra-  (Non−pageable tion altogether, at which point all dependencies on the Memory) source must be eliminated. For pre-copy, one needs (a) Post−Copy Timeline only to re-schedule the target VM and destroy the source copy. On the other hand, majority of our post- Figure 1: Timeline for Pre-copy vs. Post-copy. copy approach operates in this period. 4. Pages Transferred: This is the total count of mem- techniques, each of which is described below in detail. ory pages transferred, including duplicates, across all Post-Copy via Demand Paging: The demand paging of the above time periods. Pre-copy transfers most of variant of post-copy is the simplest and slowest option. Once its pages during preparation time, whereas post-copy the VM resumes at the target, its memory accesses result in transfers most during resume time. page faults that can be serviced by requesting the referenced 5. Total Migration Time: This is the sum of all the page over the network from the source node. However, ser- above times from start to finish. Total time is impor- vicing each fault will significantly slow down the VM due to tant because it affects the release of resources on both the network’s round trip latency. Consequently, even though participating nodes as well as within the VMs on both each page is transferred only once, this approach consider- nodes. Until the completion of migration, we cannot ably lengthens the resume time and leaves long-term residual free the source VM’s memory. dependencies in the form of unfetched pages, possibly for an 6. Application Degradation: This is the extent to indeterminate duration. Thus, post-copy performance for which migration slows down the applications running this variant by itself would be unacceptable from the view- in the VM. Pre-copy must track dirtied pages by trap- point of total migration time and application degradation. 1. let N := total # of pages in VM 2. let page[N] := set of all VM pages Preparation Downtime Resume 3. let bitmap[N] := all zeroes 1. Pre-copy Iterative CPU + dirty Reschedule 4. let pivot := 0; bubble := 0 // place pivot at the start Only mem txfer mem txfer VM 2. Demand- Prep time CPU + Net. page 5. ActivePush (Guest VM) Paging (if any) minimal state faults only 6. while bubble < max (pivot, N-pivot) do 3. Pushing + Prep time CPU + Active push 7. let left := max(0, pivot - bubble) Demand (if any) minimal state +page faults 8. let right := min(MAX_PAGE_NUM-1, pivot + bubble) 4. Prepaging Prep time CPU + Bubbling + 9. if bitmap[left] == 0 then + Demand (if any) minimal state page faults 10. set bitmap[left] := 1 5. Hybrid Single copy CPU + Bubbling + 11. queue page[left] for transmission (all) round minimal state dirty faults 12. if bitmap[right] == 0 then 13. set bitmap[right] := 1 14. queue page[right] for transmission Table 1: Design choices for live VM migration, in the 15. bubble++ order of their incremental improvements. Method 4 16. PageFault (Guest-page X) combines methods 2 and 3 with the use of prepaging. 17. if bitmap[X] == 0 then Method 5 combines all of 1 through 4, with pre-copy 18. set bitmap[X] := 1 only performing a single copy round. 19. transmit page[X] immediately 20. set pivot := X // shift prepaging pivot 21. set bubble := 1 // new prepaging window Post-Copy via Active Pushing: One way to reduce the duration of residual dependencies on the source node is Figure 2: Pseudo-code for the Bubbling algorithm to proactively “push” the VM’s pages from the source to the with a single pivot. Synchronization and locking target even as the VM continues executing at the target. code are omitted for clarity. Any major faults incurred by the VM can be serviced con- currently over the network via demand paging. Active push Backward Edge Forward Edge avoids transferring pages that have already been faulted in of Bubble of Bubble by the target VM. Thus, each page is transferred only once, either by demand paging or by an active push. (a) 0 MAX Post-Copy via Prepaging: The goal of post-copy via Pivot prepaging is to anticipate the occurrence of major faults in advance and adapt the page pushing sequence to better re- (b) 0 MAX flect the VM’s memory access pattern. While it is impossible P1 P2 P3 Stopped to predict the VM’s exact faulting behavior, our approach P1P2 P3 Pivot Array Bubble Edges works by using the faulting addresses as hints to estimate the spatial locality of the VM’s memory access pattern. The Figure 3: Prepaging strategies: (a) Bubbling with prepaging component then shifts the transmission window single pivot and (b) Bubbling with multiple piv- of the pages to be pushed such that the current page fault ots. Each pivot represents the location of a network location falls within the window. This increases the proba- fault on the in-memory pseudo-paging device. Pages bility that pushed pages would be the ones accessed by the around the pivot are actively pushed to target. VM in the near future, reducing the number of major faults. Various prepaging strategies are described in Section 3.2. Hybrid Live Migration: The hybrid approach was first The effectiveness of prepaging is measured by the percentage described in [20] for process migration. It works by doing of VM’s page faults at the target that require an explicit a single pre-copy round in the preparation phase of the mi- page request to be sent over the network to the source node gration. During this time, the VM continues running at the – also called network page faults. The smaller the percentage source while all its memory pages are copied to the target of network page faults, the better the prepaging algorithm. host. After just one iteration, the VM is suspended and The challenge in designing an effective prepaging strategy is its processor state and dirty non-pageable pages are copied to accurately predict the pages that might be accessed by the to the target. Subsequently, the VM is resumed at tar- VM in the near future, and to push those pages before the get and post-copy described above kicks in, pushing in the VM faults upon them. Below we describe different design remaining dirty pages from the source. As with pre-copy, options for prepaging strategies. this scheme can perform well for read-intensive workloads. (A) Bubbling with a Single Pivot: Figure 2 lists the Yet it also provides deterministic total migration time for pseudo-code for the two components of bubbling with a sin- write-intensive workloads, as with post-copy. This hybrid gle pivot – active push (lines 5–15), which executes in a ker- approach is currently being implemented and not covered nel thread, and page fault servicing (lines 16–21), which exe- within the scope of this paper. Rest of this paper describes cutes in the interrupt context whenever a page-fault occurs. the design and implementation of post-copy via prepaging. Figure 3(a) illustrates this algorithm graphically. The VM’s pages at source are kept in an in-memory pseudo-paging de- 3.2 Prepaging Strategy vice, which is similar to a traditional swap device except that Prepaging refers to actively pushing the VM’s pages from it resides completely in memory (see Section 4 for details). the source to the target. The goal is to make pages available The active push component starts from a pivot page in the at the target before they are faulted on by the running VM. pseudo-paging device and transmits symmetrically located pages around that pivot in each iteration. We refer to this sent-pages at their edges and stop progressing. (A simple algorithm as “bubbling” since it is akin to a bubble that thought exercise can show that stalling of active push is not grows around the pivot as the center. Even if one edge of a problem for dual-direction multi-pivot bubbling.) While the bubble reaches the boundary of pseudo-paging device there are multiple ways to solve this problem, we chose a (0 or MAX), the other edge continues expanding in the simple approach of designating the initial pivot (at the first opposite direction. To start with, the pivot is initialized to page in pseudo-paging device) as a sticky pivot. Unlike other the first page in the in-memory pseudo-paging device, which pivots, this sticky pivot is never replaced by another pivot. means that initially the bubble expands only in the forward Further, the bubble around sticky pivot does not stall when direction. Subsequently, whenever a network page fault oc- it encounters an already transmitted page; rather it skips curs, the fault servicing component shifts the pivot to the such a page and keeps progressing, ensuring that the active location of the new fault and starts a new bubble around push component never stalls. this new location. In this manner, the location of the pivot adapts to new network faults in order to exploit the spatial 3.3 Dynamic Self-Ballooning locality of reference. Pages that have already been transmit- Before migration begins, a VM will have an arbitrarily ted (as recorded in a bitmap) are skipped over by the edge large number of free, unallocated pages. Transferring these of the bubble. Network faults that arrive at the source for a pages would be a waste of network and CPU resources, and page that is in-flight (or just been pushed) to the target are would increase the total migration time regardless of which ignored to avoid duplicate page transmissions. migration algorithm we use. Further, if a free page is allo- (B) Bubbling with Multiple Pivots: Consider the cated by the VM during a post-copy migration and subse- situation where a VM has multiple processes executing con- quently causes a major fault (due to a copy-on-write by the currently. Here, a newly migrated VM would fault on pages virtual memory subsystem), fetching that free page over the at multiple locations in the pseudo-paging device. Conse- network, only to be overwritten immediately, will result in quently, a single pivot would be insufficient to capture the unnecessary execution delay for the VM at the target. Thus, locality of reference across multiple processes in the VM. it is highly desirable to avoid transmitting free pages. Bal- To address this situation, we extend the bubbling algorithm looning [35] is a minimally intrusive technique for resizing described above to operate on multiple pivots. Figure 3(b) the memory allocation of a VM (called a reservation). Typ- illustrates this algorithm graphically. The algorithm is sim- ical ballooning implementations involve a balloon driver in ilar to the one outlined in Figure 2, except that the active the guest kernel. The balloon driver can either reclaim pages push component pushes pages from multiple “bubbles” con- considered least valuable by the OS and return them back currently. (We omit the pseudo-code for space constraints, to the hypervisor (inflating the balloon), or request pages since it is a straightforward extension of single pivot case.) from the hypervisor and return them back to the guest ker- Each bubble expands around an independent pivot. When- nel (deflating the balloon). As of this writing, Xen-based ever a new network fault occurs, the faulting location is ballooning is primarily used during the initialization of a recorded as one more pivot and a new bubble is started new VM. If the hypervisor cannot reserve enough memory around that location. To save on unnecessary page trans- for a new VM, it steals unused memory from other VMs missions, if the edge of a bubble comes across a page that is by inflating their balloons to accommodate the new VM. already transmitted, that edge stops progressing in the cor- The system administrator can re-enlarge those diminished responding direction. For example, the edges between bub- reservations at a later time should more memory become bles around pivots P2 and P3 stop progressing when they available. We extend Xen’s ballooning mechanism to avoid meet, although the opposite edges continue making progress. transmitting free pages during both pre-copy and post-copy In practice, it is sufficient to limit the number of concurrent migration. The VM performs ballooning continuously over k bubbles to those around most recent pivots. When new its execution lifetime – a technique we call Dynamic Self- network faults arrives, we replace the oldest pivot in a pivot Ballooning (DSB). DSB reduces the number of free pages array with the new network fault location. For the work- without significantly impacting the normal execution of the loads tested in our experiments in Section 5, we found that VM, so that the VM can be migrated quickly with a minimal k around = 7 pivots provided the best performance. memory footprint. Our DSB design responds dynamically (C) Direction of Bubble Expansion: We also wanted to VM memory pressure by inflating the balloon under low to examine whether the pattern in which the source node pressure and deflating under increased pressure. For DSB to pushes the pages located around the pivot made a signifi- be both effective and minimally intrusive, we must choose cant difference in performance. In other words, is it better an appropriate interval between consecutive invocations of to expand the bubble around a pivot in both directions, or ballooning such that DSB’s activity does not interfere with only the forward direction, or only the backward direction? the execution of VM’s applications. Secondly, DSB must To examine this we included an option of turning off the ensure that the balloon can shrink when one or more appli- bubble expansion in either the forward or the backward di- cations becomes memory-intensive. We describe the specific rection. Our results, detailed in Section 5.4, indicate that implementation details of DSB in Section 4.2. forward bubble expansion is essential, dual (bi-directional) bubble expansion performs slightly better in most cases, and 3.4 Reliability backwards-only bubble expansion is counter-productive. Either the source or destination node can fail during mi- When expanding bubbles with multiple pivots in only a gration. In both pre-copy and post-copy, failure of the source single direction (forward-only or backward-only), there is a node implies permanent loss of the VM itself. Failure of the possibility that the entire active push component could stall destination node has different implications in the two cases. before transmitting all pages in the pseudo-paging device. For pre-copy, failure of the destination node does not matter This happens when all active bubble edges encounter already because the source node still holds an entire up-to-date copy of the VM’s memory and processor state and the VM can to service those faults, a modified block driver is inserted to be revived if necessary from this copy. However, when post- retrieve the pages over the network. copy is used, the destination node has more up-to-date copy In the end, we chose to implement the pseudo-paging op- of the virtual machine and the copy at the source happens tion because it was the quickest to implement. In fact, we at- to be stale, except for pages not yet modified at the destina- tempted page tracking first, but switched to pseudo-paging tion. Thus, failure of the destination node during post-copy due to implementation issues. Shadow paging can provide migration constitutes a critical failure of the VM. Although an ideal middle ground, being faster than pseudo-paging but our current post-copy implementation does not address this slower (and cleaner) than page tracking. We intend to switch drawback, we plan to address this problem by developing a to shadow paging soon. Our prototype doesn’t change much mechanism to incrementally checkpoint the VM state from except to make a hook available for post-copy to use. Re- the destination node back at the source node. Our approach cently, SnowFlock [15] used this method in the context of is as follows: while the active push of pages is in progress, parallel cloud computing clusters using Xen. we also propagate incremental changes to memory and the The pseudo-paging approach is illustrated in Figure 4. VM’s execution state at the destination back to the source Page-fault detection and servicing is implemented through node. We do not need to propagate the changes from the the use of two loadable kernel modules, one inside the mi- destination on a continuous basis, but only at discrete points grating VM and one inside Domain 0 at the source node. such as when interacting with a remote client over the net- These modules leverage our prior work on a system called work, or committing an I/O operation to the storage. This MemX [10], which provides transparent remote memory ac- mechanism can provide a consistent backup image at the cess for both Xen VMs and native Linux systems at the source node that one can fall back upon in case the desti- kernel level. As soon as migration is initiated, the memory nation node fails in the middle of post-copy migration. The pages of the migrating VM at the source are swapped out performance of this mechanism would depend upon the ad- to a pseudo-paging device exposed by the MemX module ditional overhead imposed by reverse network traffic from in the VM. This “swap” is performed without copies using the target to the source and the frequency of incremental a lightweight MFN exchange mechanism (described below), checkpointing. Recently, in a different context [4], simi- after which the pages are mapped to Domain 0 at the source. lar mechanisms have been successfully used to provide high CPU state and non-pageable memory are then transferred to availability. the target node during downtime. Note the pseudo-paging approach implies that a small amount of non-pageable mem- 4. IMPLEMENTATION DETAILS ory, typically small in-kernel caches and pinned pages, must be transferred during downtime. This increases the down- We implemented post-copy along with all of the opti- mizations described in Section 3 on Xen 3.2.1 and para- time in our current post-copy implementation. The non- virtualized Linux 2.6.18.8. We first discuss the different ways pageable memory overhead can be significantly reduced via of trapping page faults at the target within the Xen/Linux the hybrid migration approach discussed earlier. architecture and their trade-offs. Then we will discuss our MFN Exchanges: Swapping the VM’s pages to a pseudo- implementation of dynamic self-ballooning (DSB). paging device can be accomplished in two ways: by either transferring ownership of the pages to a co-located VM (like 4.1 Page Fault Detection Xen’s Domain 0) or by remapping the pseudo-physical ad- dress of the pages within the VM itself with zero copying There are three ways by which the demand-paging com- overhead. We chose the latter because of its lower overhead ponent of post-copy can trap page faults at the target VM. (fewer calls into the hypervisor). We accomplish the remap- (1) Shadow Paging: Shadow paging refers to a set of ping by executing an MFN exchange (machine frame number read-only page tables for each VM maintained by the hy- exchange) within the VM. The VM’s memory reservation is pervisor that maps the VM’s pseudo-physical pages to the first doubled and all the running processes in the system physical page frames. Shadow paging can be used to trap ac- are suspended. The guest kernel is then instructed to swap cess to non-existent pages at the target VM. For post-copy, out each pageable frame (through the use of existing soft- each major fault at the target can be intercepted via these ware suspend code in the ). Each time a frame traps and be redirected to the source. is paged, we re-write both the VM’s PFN (pseudo-physical (2) Page Tracking: The idea here is to mark all of the frame number) to MFN mapping (called a physmap) as well resident pages in the VM at the target as not present in as the frame’s kernel-level PTE such that we simulate an their page-table-entries (PTEs) during downtime. This has exchange between the frame’s MFN with that of a free page the effect of forcing a page fault exception when the VM frame. These exchanges are all batched before invoking the accesses a page. After some third party services the fault, hypervisor. Once the exchanges are over, we memory-map the PTE can be fixed up to reflect accurate mappings. PTEs the exchanged MFNs into Domain 0. The data structure in carry a few unused bits to potentially support this, needed to bootstrap this memory mapping is created within but it requires significant changes to the guest kernel. the VM on-demand as a tree that is almost identical to a (3) Pseudo-paging: The idea here is to swap out all page-table, from which the root is sent to Domain 0 through pageable memory in the VM to an in-memory pseudo-paging the Xen Store. device within the guest kernel. This is done with minimal Once the VM resumes at the target, demand paging be- overhead and without any disk I/O. Since the source copy gins for missing pages. The MemX “client” module in the of the VM is suspended at the beginning of post-copy mi- VM at the target activates again to service major faults and gration, the memory reservation for the VM’s source copy perform prepaging by coordinating with the MemX “server” can be made to appear as a pseudo-paging device. During module in Domain 0 at the source. The two modules com- resume time, the VM then retrieves its “swapped”-out pages municate via a customized and lightweight remote memory through its normal page fault servicing mechanism. In order SOURCE VM TARGET VM

Non−Pageable Memory Restore Memory Reservation Double Memory DOMAIN 0 (at source) Reservation Pageable Memory Migration Daemon (XenD) Exchange MFN Pseudo−Paged Memory Memory−Mapped Pages New Page−Frames

Page−Fault Traffic mmu_update() Pre−Paging Source Hypervisor Traffic Target Hypervisor

Figure 4: Pseudo-Paging : Pages are swapped out to a pseudo-paging device within the source VM’s memory by exchanging MFN identifiers. Domain 0 at the source maps the swapped pages to its memory with the help of the hypervisor. Prepaging then takes over after downtime. access protocol (RMAP) which directly operates above the driven by the virtual memory system as a result of copy- network . on-write faults, satisfied on behalf of an application that has allocated a large amount of memory and is accessing it 4.2 Dynamic Self Ballooning Implementation for the first time. DSB does not register a new filesystem, but rather registers a similar callback function for the same The implementation of DSB has three components. (1) In- purpose. (It is not necessary to register a new filesystem flate the balloon: A kernel-level DSB thread in the VM in order to register this callback). This worked remarkably first allocates as much free memory as possible and hands well and provides very precise feedback to the DSB process those pages over to the hypervisor. (2) Detect mem- about memory pressure. Alternatively, one could manually ory pressure: Memory pressure indicates that some entity scan /proc statistics to determine this information, but we needs to access a page frame right away. In response, the found the filesystem API to be more direct reflection of the DSB process must partially deflate the balloon depending decisions that the virtual memory system is actually mak- on the extent of memory pressure. (3) Deflate the bal- ing. Each callback contains a numeric value of exactly how loon: Deflation is the reverse of Step 1. The DSB process many pages the DSB should release, which typically defaults re-populates it’s memory reservation with free pages from to 128 pages at a time. When the callback returns, part of the hypervisor and then releases the list of free pages back the return value indicates to the virtual memory system how to the guest kernel’s free pool. much “pressure” is still available to be relieved. Filesystems Detecting Memory Pressure: In order to detect mem- typically return the sum totals of their caches, whereas the ory pressure in a completely transparent manner, we begin DSB process will return the size of the balloon itself. by employing one of two mechanisms in the Linux Ker- Also, the DSB must periodically reclaim free pages that nel. First, we depend on the kernel’s existing ability to may have been released over time. The DSB process per- perform physical memory overcommitment. Memory over- forms this sort of “garbage collection” by periodically waking commitment within an individual OS allows the virtual mem- up and re-inflating the balloon as much as possible. Cur- ory subsystem to provide the illusion of infinite physical rently, we will inflate to 95% of all of the available free mem- memory (as opposed to just infinite virtual memory only, ory. (Inflating to 100% would trigger the “out-of-memory” depending on the number of bits in the architecture of the killer, hence the 5% buffer). If memory pressure is detected CPU). Linux, however, offers multiple operating modes of during this time, the thread preempts any attempts to do physical memory overcommitment - and these modes can a balloon inflation and will maintain the size of the balloon be changed during runtime. By default, Linux disables this for a backoff period of about 10 intervals. As we show later, feature, which precludes application memory allocations in an empirical analysis has shown that a ballooning interval advance by returning an error. However if you enable over- size of about 5 seconds has proven effective. commitment, the kernel will return success. Without this, Lines of Code. Most of the post-copy implementation we cannot detect memory pressure transparently. is about 7000 lines within pluggable kernel modules. 4000 Second, coupled with over-commitment, the kernel also lines of that are part of the MemX system that is invoked already provides a transparent mechanism to detect mem- during resume time. 3000 lines contribute to the prepaging ory pressure: through the kernel’s filesystem API. Using a component, the pushing component, and the DSB compo- function called “set shrinker()”, one of the function pointers nent combined. A 200 line patch is applied to the migration provided as a parameter to this function acts as a callback to daemon to support ballooning and a 300-line patch is ap- some memory-hungry portion of the kernel. This indicates plied to the guest kernel so as to initiate pseudo-paging. In to the virtual memory system that this function can be used the end, the system remains transparent to applications and to request the deallocation of a requisite amount of mem- approaches about 7500 lines. Neither original pre-copy im- ory that it may have pinned – typically things like plementation, nor the hypervisor is modified in any way. and directory entry caches. These callbacks are indirectly 4.3 Proactive LRU Ordering to Improve Ref- Total Migration Time erence Locality 80 Read Post-Copy DSB During normal operation, the guest kernel maintains the 70 Write Post-Copy DSB age of each allocated page frame in its page cache. Linux, Read Post-Copy w/o DSB 60 Write Post-Copy w/o DSB for example, maintains two linked lists in which pages are Read Pre-Copy DSB Write Pre-Copy DSB maintained in Least Recently Used (LRU) order: one for 50 Read Pre-Copy w/o DSB active pages and one for inactive pages. A kernel daemon Write Pre-Copy w/o DSB 40 Stop-and-Copy DSB periodically ages and transfers these pages between the two lists. The inactive list is subsequently used by the paging 30 system to reclaim pages and write to the swap device. As 20 a result, the order in which pages are written to the swap 10

device reflects the historical locality of access by processes Total Migration Time (Secs) in the VM. Ideally, the active push component of post-copy 80 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB could simply use this ordering of pages in its pseudo-paging Working Set Size (MB) device to predict the page access pattern in the migrated VM and push pages just in time to avoid network faults. Figure 5: Comparison of total migration times. However, Linux does not actively maintain the LRU order- ing in these lists until a swap device is enabled. Since a pseudo-paging device is enabled just before migration, post- migration time and number of pages transferred. copy would not automatically see pages in the swap device 2. Read-intensive Pre-Copy with DSB: This config- ordered in the LRU order. To address this problem, we uration provides the best-case workload for pre-copy. implemented a kernel thread which periodically scans and The performance is expected to be roughly similar to reorders the active and inactive lists in LRU order, without pure stop-and-copy migration. modifying the core kernel itself. In each scan, the thread 3. Write-intensive Pre-Copy with DSB: This config- examines the referenced bit of each page. Pages with their uration provides the worst-case workload for pre-copy. referenced bit set are moved to the most recently used end 4. Read-intensive Pre-Copy without DSB: of the list and their referenced bit is reset. This mechanism 5. Write-intensive Pre-Copy without DSB: These supplements the kernel’s existing aging support without the two configurations test the default implementation of requirement that a real paging device be turned on. Sec- pre-copy in Xen. tion 5.4 shows that such a proactive LRU ordering plays a 6. Read-intensive Post-Copy with and without DSB: positive role in reducing network faults. 7. Write-intensive Post-Copy with and without DSB: These four configurations will stress our prepaging al- 5. EVALUATION gorithm. Both reads and writes are expected to per- In this section, we present a detailed evaluation of post- form almost identically. DSB is expected to minimize copy implementation and compare it against Xen’s pre-copy the transmission of free pages. migration. Our test environment consists of 2.8 GHz multi- core Intel machines connected via a Gigabit Ethernet switch Total Migration Time: Figure 5 shows the variation of and having between 4 to 16 GB of memory. Both the VM total migration time with increasing working set size. No- in each experiment and the Domain 0 are configured to use tice that both post-copy plots with DSB are at the bottom, two virtual CPUs. Guest VM sizes range from 128 MB to surpassed only by read-intensive pre-copy with DSB. Both 1024 MB. Unless otherwise specified, the default VM size is the read and write intensive tests of post-copy perform very 512 MB. In addition to the performance metrics mentioned similarly. Thus our post-copy algorithm’s performance is ag- in Section 3, we evaluate post-copy against an additional nostic to the read or write-intensive nature of the application metric. Recall that post-copy is effective only when a large workload. Furthermore, without DSB activated, the total majority of the pages reach the target before they are faulted migration times are high for all migration schemes due to upon by the VM at the target, in which case they become unnecessary transmission of free pages. minor page faults rather than major network page faults. Downtime: Figure 6 compares the metric of downtime as Thus the fraction of major faults compared to minor page the working set size increases. As expected, read-intensive faults is another indication of the effectiveness of post-copy. pre-copy gives the lowest downtime, whereas that for write- intensive pre-copy increases as the size of the writable work- 5.1 Stress Testing ing set increases. For post-copy, recall that our choice of We start by first doing a stress test for both migration pseudo-paging for page fault detection (in Section 4) in- schemes with the use of a simple, highly sequential memory- creases the downtime since all non-pageable memory pages intensive C program. This program accepts a parameter to are transmitted during downtime. With DSB, post-copy change the working set of memory accesses and a second achieves a downtime that ranges between 600 milliseconds to parameter to control whether it performs memory reads or just over one second. However, without DSB, our post-copy writes during the test. The experiment is performed in a implementation experiences a large downtime of around 20 2048 MB VM with its working set ranging from 8 MB to seconds because all free pages in the guest kernel are treated 512 MB. The rest is simply free memory. We perform the as non-pageable pages and transferred during downtime. experiment with different test configurations: Hence the use of DSB is essential to maintain a low down- time with our current implementation of post-copy. This 1. Stop-and-copy Migration: This is a non-live mi- limitation can be overcome by the use of shadow paging to gration which provides a baseline to compare the total track page-faults.       

 Downtime     !       !   Read Post-Copy DSB Write Post-Copy DSB  

10000 Read Post-Copy w/o DSB 

Write Post-Copy w/o DSB  Read Pre-Copy DSB

Write Pre-Copy DSB 

  Read Pre-Copy w/o DSB  Write Pre-Copy w/o DSB  1000  

      

100     Downtime (millisec) Figure 8: Degradation time with kernel compile.

10 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Pages Transferred and Page Faults: Figure 7 and Ta- Working Set Size (MB) ble 2 illustrate the utility of our prepaging algorithm in post- copy across increasingly large working set sizes. Figure 7 Figure 6: Downtime comparison. (Log scale y-axis). plots the total number of pages transferred. As expected, post-copy transfers far fewer pages than write-intensive pre- copy. It performs on par with read-intensive post-copy and stop-and-copy. Without DSB, the number of pages trans- ferred increase significantly for both pre-copy and post-copy. Pages Transferred Table 14 compares the fraction of network and minor faults in post-copy. We see that prepaging reduces the fraction of 1.4 M network faults from 7% to 13%. To be fair, the stress-test is 1.3 M Read Post-Copy DSB highly sequential in nature and consequently, prepaging pre- Write Post-Copy DSB 1.2 M Read Post-Copy w/o DSB dicts this behavior almost perfectly. The real applications 1.1 M Write Post-Copy w/o DSB in the next section do worse than this ideal case. 1 M Read Pre-Copy DSB 900 k Write Pre-Copy DSB Read Pre-Copy w/o DSB 5.2 Degradation, Bandwidth, and Ballooning 800 k Write Pre-Copy w/o DSB 700 k Stop-and-Copy DSB Next, we quantify the side effects of migration on a cou- 600 k ple of sample applications. We want to answer the following 500 k questions: What kinds of slow-downs do VM workloads ex- 400 k perience during pre-copy versus post-copy migration? What 300 k is the impact on network bandwidth received by applica- 200 k tions? And finally, what kind of balloon inflation interval 100 k should we choose to minimize the impact of DSB on run- 0

4 KB pages transferred (in thousands) 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB ning applications? For application degradation and the DSB Working Set Size (MB) interval, we use Linux kernel compilation. For bandwidth testing we use the NetPerf TCP benchmark. Figure 7: Comparison of the number of pages trans- Degradation Time: Figure 8 depicts a repeat of an in- ferred during a single migration. teresting experiment from [19]. We initiate a kernel compile inside the VM and then migrate the VM repeatedly between two hosts. We script the migrations to pause for 5 sec- onds each time. Although there is no exact way to quantify degradation time (due to and context switching), this experiment provides an approximate measure. As far Working Prepaging Pushing as memory is concerned, we observe that kernel compila- Set Size Net Minor Net Minor tion tends not to exhibit too many memory writes. (Once 8 MB 2% 98% 15% 85% gcc forks and compiles, the OS page cache will only be used 16 MB 4% 96% 13% 87% once more at the end to link the kernel object files together). 32 MB 4% 96% 13% 87% As a result, the experiment represents the best case for the 64 MB 3% 97% 10% 90% original pre-copy approach when there is not much repeated 128 MB 3% 97% 9% 91% dirtying of pages. This experiment is also a good worst-case 256 MB 3% 98% 10% 90% test for our implementation of Dynamic Self Ballooning due to the repeated fork-and-exit behavior of the kernel com- pile as each object file is created over time. (Interestingly Table 2: Percent of minor and network faults for enough, this experiment also gave us a headache, because it pushing vs. prepaging. exposed the bugs in our code!) We were surprised to see how many additional seconds were added to the kernel compila- tion in Figure 8 just by executing back to back invocations of    

      !

      ! 

 

 

   $% 

  

 

        

       ! "#          

Figure 9: NetPerf run with back-to-back migrations. pre-copy migration. Nevertheless, we observe that post-copy Figure 10: Post-copy impact on NetPerf bandwidth. closely matches pre-copy in the amount of degradation. This is in line with the competitive performance of post-copy with read-intensive pre-copy tests in Figures 5 and 7. We suspect that a shadow-paging based implementation of post-copy would perform much better due to the significantly reduced downtime it would provide. Figure 9 shows the same exper- iment using NetPerf. A sustained, high-bandwidth stream of network traffic causes slightly more page-dirtying than the compilation does. The setup involves placing the NetPerf   sender inside the VM and the receiver on an external node     on the same switch. Consequently, regardless of VM size, post-copy actually does perform slightly better and reduce the degradation time experienced by NetPerf.     !  Effect on Bandwidth: In the original paper [3], the Xen "#  project proposed a solution called “adaptive rate limiting” to    control the bandwidth overhead due to migration. However,     this feature is not enabled in the currently released version of Xen. In fact it is compiled out without any runtime options or any pre-processor directives. This could likely be because rate-limiting increases the total migration time, or even be- cause it is difficult, if not impossible, to predict beforehand Figure 11: Pre-copy impact on NetPerf bandwidth. the bandwidth requirements of any single VM, on the basis of which to guide adaptive rate limiting. We do not activate rate limiting for our post-copy implementation either so as Dynamic Ballooning Effects on Completion Time to normalize the comparison of the two techniques. Kernel Compile, 439 secs With that in mind, Figures 10 and 11 show a visual rep- 50 resentation of the reduction in bandwidth experienced by a 128 MB Guest high-throughput NetPerf session. We conduct this experi- 40 512 MB Guest ment by invoking VM migration in the middle of a NetPerf session and measuring bandwidth values rapidly through- out. The impact of migration can be seen in both figures by 30 a sudden reduction in the observed bandwidth during mi- gration. This reduction is more sustained, and greater, for 20 pre-copy than for post-copy due to the fact that the total pages transferred in pre-copy is much higher. This is exactly 10 the bottom line that we were targeting for improvement. Slowdown Time (secs) Dynamic Ballooning Interval: Figure 12 shows how 0 we chose the DSB interval, by which the DSB process wakes 0 200 400 600 800 up to reclaim available free memory. With the kernel com- Balloon Interval (jiffies) pile as a test application, we execute the DSB process at intervals from 10ms to 10s. At every interval, we script the Figure 12: Application degradation is inversely pro- kernel compile to run multiple times and output the average portional to the ballooning interval. completion time. The difference in that number from the    

      

                 

  

       

           

    

 

  

 

    



     

     

        

    

Figure 13: Total pages transferred.  

Figure 14: Network faults versus minor faults. base case is the degradation time added to the application by the DSB process due to its CPU usage. As expected, the ballooning interval is inversely proportional to the applica- tion degradation. The more often you balloon, the more it affects the VM workload. The graph shows that an interval between 4 and 10 seconds is good enough to frequently re- claim free pages without impacting application performance.      5.3 Application Scenarios    We re-visit the aforementioned performance metrics across 

four applications: (a) SPECWeb 2005: This is our largest     application. It is a well-known webserver benchmark in- 

volving at least 2 or more physical hosts. We place the   system under test within the VM, while six separate client  nodes bombard the VM with connections; (b) Bit Torrent  Client: Although this is not a typical server application, we  chose it because it is a simple representative of a multi-peer       distributed application. It is easy to initiate and does not     immediately saturate a Gigabit Ethernet pipe. Instead, it  fills up the network pipe gradually, is slightly CPU inten- sive, and involves a somewhat more complex mix of page- Figure 15: Total migration time. dirtying and disk I/O than just a kernel compile. (c) Linux Kernel Compile: We consider this application again for consistency. (d) NetPerf: Once more, as in the previous experiments, the NetPerf sender is placed inside the VM. Using these applications, we evaluate the same four primary metrics that we covered in Section 5.1: downtime, total mi-  gration time, pages transferred, and page faults. Each figure  for these applications represents one of the four metrics and   contains results for a constant, 512 MB virtual machine in     the form of a bar graph for both migration schemes across    each application. Each data point is the average of 20 sam-    ples. And just as before, the VM is configured to have two     virtual CPUs. All these experiments have DSB enabled.  

Pages Transferred and Page Faults.. The experi- 

  ments in Figures 13 and 14 illustrate these results. For all   of the applications except the SPECWeb, post-copy reduces  the total pages transferred by more than half. The most sig-           nificant result we’ve seen so far is in Figure 14 where post- copy’s prepaging algorithm is able to avoid 79% and 83%   of the network page faults (which become minor faults) for the largest applications (SPECWeb, Bittorrent). For the Figure 16: Downtime comparison: Post-copy down- smaller applications (Kernel, NetPerf), we still manage to time can improve with better page-fault detection. save 41% and 43% of network page faults. Total Time and Downtime. Figure 15 shows that post-         

   

  

 

   

   

                   

Figure 17: Comparison of prepaging strategies using multi-process Quicksort workloads. copy reduces the total migration time for all applications, 6. CONCLUSIONS when compared to pre-copy, in some cases by more than This paper presented the design and implementation of a 50%. However, the downtime in Figure 16 is currently much post-copy technique for live migration of virtual machines. higher for post-copy than for pre-copy. As we explained ear- Post-copy is a combination of four key components: de- lier, the relatively high downtime is due to our speedy choice mand paging, active pushing, prepaging, and dynamic self- of pseudo-paging for page fault detection, which we plan to ballooning. We implemented and evaluated post-copy on reduce through the use of shadow paging. Nevertheless, this Xen and Linux based platform. Our evaluations show that tradeoff between total migration time and downtime may post-copy significantly reduces the total migration time and be acceptable where network overhead needs to be kept low the number of pages transferred compared to pre-copy. Fur- and the entire migration needs to be completed quickly. ther, the bubbling algorithm for prepaging is able to sig- nificantly reduce the number network faults incurred dur- 5.4 Comparison of Prepaging Strategies ing post-copy migration. Finally, dynamic self-ballooning This section compares the effectiveness of different prepag- improves the performance of both pre-copy and post-copy ing strategies. The VM workload is a Quicksort application by eliminating the transmission of free pages during migra- that sorts a randomly populated array of user-defined size. tion. In future work, we plan to investigate an alternative We vary the number of processes running Quicksort from to pseudo-paging, namely shadow paging based page fault 1 to 128, such that 512MB of memory is collectively used detection. We are also investigating techniques to handle among all processes. We migrate the VM in the middle target node failure during post-copy migration, so that post- of its workload execution and measure the number of net- copy can provide at least the same level of reliability as pre- work faults during migration. A smaller the network fault copy. Finally, we are implementing a hybrid pre/post copy count indicates better prepaging performance. We compare approach where a single round of pre-copy precedes the CPU a number of prepaging combinations by varying the follow- state transfer, followed by a post-copy of the remaining dirty ing factors: (a) whether or not some form of bubbling is pages from the source. used; (b) whether the bubbling occurs in forward-only or dual directions; (c) whether single or multiple pivots are used; and (d) whether the page-cache is maintained in LRU 7. ACKNOWLEDGMENTS order. We’d like to thank AT&T Labs Research at Florham Park, Figure 17 shows the results. Each vertical bar represents New Jersey, for providing the fellowship funding to Michael an average over 20 experimental runs. First observation is Hines for this research. This work is also supported in part that bubbling, in any form, performs better than push-only by the National Science Foundation through grants CNS- prepaging. Secondly, sorting the page-cache in LRU order 0845832, CNS-0509131, and CCF-0541096. performs better than non-LRU cases by improving the lo- cality of reference of neighboring pages in the pseudo-paging 8. REFERENCES device. Thirdly, dual directional bubbling improves perfor- mance over forward-only bubbling in most cases, but never [1] Barham, P., Dragovic, B., Fraser, K., Hand, S., performs significantly worse. This indicates that it is always Harris, T., Ho, A., Neugebauer, R., Pratt, I., preferable to use dual directional bubbling. (The perfor- and Warfield, A. Xen and the art of virtualization. mance of reverse-only bubbling was found to be much worse In Proc. of ACM SOSP 2003 (Oct. 2003). than even push-only prepaging, hence its results are omit- [2] Bradford, R., Kotsovinos, E., Feldmann, A., ted). Finally, dual multi-pivot bubbling is found to consis- and Schioberg,¨ H. Live wide-area migration of tently improve the performance over single-pivot bubbling virtual machines including local persistent state. In since it exploits locality of reference at multiple locations in Proc. of the International Conference on Virtual the pseudo-paging device. Execution Environments (2007), pp. 169–179. [3] Clark, C., Fraser, K., Hand, S., Hansen, J., Jul, [21] OpenVZ. Container-based Virtualization for Linux, E., Limpach, C., Pratt, I., and Warfield, A. Live http://www.openvz.com/. migration of virtual machines. In Network System [22] Oppenheimer, G., and Weizer, N. Resource Design and Implementation (2005). management for a medium scale time-sharing [4] Cully, B., Lefebvre, G., and Meyer, D. Remus: . Commun. ACM 11, 5 (1968), High availability via asynchronous virtual machine 313–322. replication. In NSDI ’07: Networked Systems Design [23] Osman, S., Subhraveti, D., Su, G., and Nieh, J. and Implementation (2008). The design and implementation of Zap: A system for [5] Denning, P. J. The working set model for program migrating computing environments. In Proc. of OSDI behavior. Communications of the ACM 11, 5 (1968), (2002), pp. 361–376. 323–333. [24] Plank, J., Beck, M., Kingsley, G., and Li, K. [6] Douglis, F. Transparent process migration in the Libckpt: Transparent checkpointing under UNIX. In Sprite operating system. Tech. rep., University of Proc. of Usenix Annual Technical Conference, New California at Berkeley, Berkeley, CA, USA, 1990. Orleans, Louisiana (1998). [7] Hand, S. M. Self-paging in the nemesis operating [25] Richmond, M., and Hitchens, M. A new process system. In OSDI’99, New Orleans, Louisiana, USA migration algorithm. SIGOPS Oper. Syst. Rev. 31, 1 (1999), pp. 73–86. (1997), 31–42. [8] Hansen, J., and Henriksen, A. Nomadic operating [26] Roush, E. T. Fast dynamic process migration. In systems. In Master’s thesis, Dept. of Computer Intl. Conference on Distributed Computing Systems Science, University of Copenhagen, Denmark (2002). (ICDCS) (1996), p. 637. [9] Hansen, J., and Jul, E. Self-migration of operating [27] Sapuntzakis, C., Chandra, R., and Pfaff, B. systems. In Proc. of ACM SIGOPS Europen Optimizing the migration of virtual computers. In Workshop, Leuven, Belgium (2004). Proc. of OSDI (2002). [10] Hines, M., and Gopalan, K. MemX: Supporting [28] Satyanarayanan, M., and Gilbert, B. Pervasive large memory applications in Xen virtual machines. In personal computing in an internet suspend/resume Second International Workshop on Virtualization system. IEEE Internet Computing 11, 2 (2007), 16–25. Technology in Distributed Computing (VTDC07), [29] Schmidt, B. K. Supporting Ubiquitous Computing Reno, Nevada (2007). with Stateless Consoles and Computation Caches. [11] Hines, M., and Gopalan, K. Post-copy based live PhD thesis, Computer Science Dept., Stanford virtual machine migration using adaptive pre-paging University, 2000. and dynamic self-ballooning. In Proceedings of ACM [30] Stellner, G. Cocheck: Checkpointing and process SIGPLAN/SIGOPS International Conference on migration for MPI. In IPPS ’1996 (Washington, DC, Virtual Execution Environments (VEE), Washington, USA), pp. 526–531. DC (March 2009). [31] Thain, D., Tannenbaum, T., and Livny, M. [12] Ho, R. S., Wang, C.-L., and Lau, F. C. Distributed computing in practice: the Condor Lightweight process migration and memory experience. Concurr. Comput. : Pract. Exper. 17 prefetching in OpenMosix. In Proc. of IPDPS (2008). (2005), 323–356. [13] Kerrighed. http://www.kerrighed.org. [32] Trivedi, K. An analysis of prepaging. Journal of [14] Kivity, A., Kamay, Y., and Laor, D. KVM: the Computing 22 (1979), 191–210. linux virtual machine monitor. In Proc. of Ottawa [33] Trivedi, K. On the paging performance of array Linux Symposium (2007). algorithms. IEEE Transactions on Computers C-26, [15] Lagar-Cavilla, H. A., Whitney, J., Scannel, A., 10 (Oct. 1977), 938–947. Rumble, S., Brudno, M., de Lara, E., and [34] Trivedi, K. Prepaging and applications to array Satyanarayanan, M. Impromptu clusters for algorithms. IEEE Transactions on Computers C-25, 9 near-interactive cloud-based services. Tech. rep., (Sept. 1976), 915–921. CSRG-578, University of Toronto, June 2008. [35] Waldspurger, C. Memory resource management in [16] Magenheimer, D. Add self-ballooning to balloon VMWare ESX server. In Operating System Design and driver. Discussion on Xen Development mailing list Implementation (OSDI 02), Boston, MA (Dec 2002). and personal communication, April 2008. [36] Whitaker, A., Cox, R., and Shaw, M. [17] Milojicic, D., Douglis, F., Paindaveine, Y., Constructing services with interposable virtual Wheeler, R., and Zhou, S. Process migration hardware. In NSDI 2004 (2004), pp. 13–13. survey. ACM Computing Surveys 32(3) (Sep. 2000), [37] Whitaker, A., Shaw, M., and Gribble, S. Scale 241–299. and performance in the denali isolation kernel. In [18] MOSIX. http://www.mosix.org. OSDI 2002, New York, NY, USA (2002), pp. 195–209. [19] Nelson, M., Lim, B.-H., and Hutchins, G. Fast transparent migration for virtual machines. In Usenix, Anaheim, CA (2005), pp. 25–25. [20] Noack, M. Comparative evaluation of process migration algorithms. Master’s thesis, Dresden University of Technology - Operating Systems Group, 2003.