TECHNIQUES FOR COLLECTIVE PHYSICAL MEMORY
UBIQUITY WITHIN NETWORKED CLUSTERS
OF VIRTUAL MACHINES
BY
MICHAEL R. HINES
B.S., Johns Hopkins University, 2003 M.S., Florida State University, 2005
DISSERTATION
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009 c Copyright by Michael R. Hines 2009
All Rights Reserved Accepted in partial fullfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009
July 31st, 2009
Dr. Kartik Gopalan, Department of Computer Science, Binghamton University
Prof. Kanad Ghose, Department of Computer Science, Binghamton University
Dr. Kenneth Chiu, Department of Computer Science, Binghamton University
Dr. Kobus van der Merwe, AT&T Labs Research, Florham Park, NJ.
iii ABSTRACT
This dissertation addresses the use of distributed memory to improve the performance of state-of-the-art virtual machines (VMs) in clusters with gigabit interconnects. Even with ever-increasing DRAM capacities, we observe a continued need to support applications that exhibit mostly memory-intensive execution patterns, like databases, webservers, sci- entific and grid applications. In this dissertation, we make four primary contributions. First, we fully survey the history of the solutions available for basic, transparent distributed mem- ory support. Then, we document a bottom-up implementation and evaluation of a basic prototype whose goal is to move deeper into the kernel than previous application-level so- lutions. We choose a clean, transparent device interface capable of minimizing network la- tency and copying overheads. Second, we explore how recent work with VMs has brought back into question the memory management logic of the operating system. VM technology provides ease and transparency for imposing order on OS memory management (using techniques like full virtualization and para-virtualization). As such, we evaluate distributed memory in this context by trying to optimize our previous prototype at different places in the
Xen virtualization architecture. Third, we leverage this work to explore alternative strate- gies for live VM migration. A key component that determines the success of migration techniques has been exactly how memory is transmitted and when. More specifically, this involves fine grained page-fault management either before a VM’s CPU state is migrated
(the current default) or afterwards. Thus, we design and evaluate the Post-Copy live VM migration scheme and compare it to the existing (Pre-Copy) migration scheme, realizing significant improvements. Finally, we promote the ubiquity of individual page frames as a cluster resource by integrating the use of distributed memory into the hypervisor (or virtual machine monitor). We design and implement CIVIC: a system that allows un-modified VMs to oversubscribe their DRAM size to larger than a given host’s physical memory. Then, we compliment this by implementing and evaluating network paging in the hypervisor for lo- cally resident VMs. We evaluate the performance impact of CIVIC on various application workloads and show how CIVIC allows for many possible VM extensions such as better VM consolidation, multi-host caching, and the ability to better coordinate with VM migration.
iv ACKNOWLEDGEMENTS
First, I would like to thank a few organizations responsible for providing invaluable sources of funding which allowed me to work through graduate school. The AT&T Labs Research
Fellowship Program, in cooperation with Kobus van der Merwe in New Jersey, provided support for a full 3 years. The Clark fellowship program at SUNY Binghamton also provided a full year of funding. The department of Computer Science at both Florida State and
Binghamton made teaching assistantships available for a year. These deeds often go unsaid - without them I would not have been able to complete this degree. I would also like to thank the National Science Foundation and the Computing Innovation Fellows Project
(cifellows.org). Through them, I will be continuing on an assistantship as a post-doctoral fellow for the next year.
My advisor deserves his own paragraph. Not many graduate students can say what I can: I have one of the greatest advisors on the planet. Six years ago, he took a chance on me and stood patiently through the entire process: through the transfers, the applications, the bad papers, the good papers, the leaps of faith, the happy accomplishments, and the sad ones. Not only is a he a fantastic researcher, but he is a strong teacher. I am very proud to be his student and I know many other students will be as well.
v DEDICATION
To my father: for his unconditional support, and love. And for all our tribulations.
To my mother: for her strength, wisdom, and love. And for all of our struggles.
To my brother: for his continuous perseverence and happiness.
To my extended family: I stand on your shoulders.
vi BIOGRAPHICAL SKETCH
Michael R. Hines was born and raised in Dallas, Texas in 1983 and grew up playing classical piano. He began college at a program called the Texas Academy of Math and
Science at the University of North Texas. 2 years later he transferred to Johns Hopkins
University in Baltimore, Maryland and received his Bachelor of Science degree in Com- puter Science in 2003. Subsequently, he entered Florida State University to complete an
Information Security certification in 2004 and a Masters degree in Computer Science in
2005. Immediately after that, he transferred to SUNY Binghamton University in New York state to finish working on a PhD degree in Computer Science in 2009.
Michael will begin post-doctoral research at Columbia University in late 2009. He is a recipient of multiple awards, including the Jackie Robinson Undergraduate Scholarship
(2yrs), the AT&T Labs Foundation Fellowship (3yrs), the Clark D. Gifford Fellowship (1yr) from Binghamton University, and the CIFellows CRA/NSF Award (1yr) for Post-Doctoral
Research. He is a member of the academic honor societies Alpha Lambda Delta, Phi Eta
Sigma and the Computer Science honor society Upsilon Pi Epsilon. His hobbies include billiards, skateboarding, and yo-yos.
vii Contents
List of Figures xiii
List of Tables xviii
1 Introduction and Outline 1
1.1 Distributed Memory Virtualization in Networked Clusters ...... 3
1.2 Virtual Machine Based Use for Distributed Memory ...... 3
1.3 Improvement of Live Migration for Virtual Machines ...... 4
1.4 VM Memory Over-subscription with Network Paging ...... 5
2 Area Survey 7
2.1 Distributed Memory Systems ...... 7
2.1.1 Basic Distributed Memory (Anemone) ...... 8
2.1.2 Software Distributed Shared Memory ...... 9
2.2 Virtual Machine Technology and Distributed Memory ...... 10
2.2.1 Microkernels ...... 10
2.2.2 Modern Hypervisors ...... 11
2.3 VM Migration Techniques ...... 13
2.3.1 Process Migration ...... 13
2.3.2 Pre-Paging ...... 14
2.3.3 Live Migration ...... 15
viii 2.3.4 Non-Live Migration ...... 15
2.3.5 Self Ballooning ...... 16
2.4 Over-subscription of Virtual Machines ...... 16
3 Anemone: Distributed Memory Access 18
3.1 Introduction ...... 18
3.2 Design & Implementation ...... 20
3.2.1 Client and Server Modules ...... 23
3.2.2 Remote Memory Access Protocol (RMAP) ...... 24
3.2.3 Distributed Resource Discovery ...... 27
3.2.4 Soft-State Refresh ...... 27
3.2.5 Server Load Balancing ...... 28
3.2.6 Fault-tolerance ...... 28
3.3 Evaluation ...... 29
3.3.1 Paging Latency ...... 30
3.3.2 Application Speedup ...... 32
3.3.3 Tuning the Client RMAP Protocol ...... 36
3.3.4 Control Message Overhead ...... 37
3.4 Summary ...... 38
4 MemX: Virtual Machine Uses of Distributed Memory 39
4.1 Introduction ...... 39
4.2 Split Driver Background ...... 41
4.3 Design and Implementation ...... 43
4.3.1 MemX-Linux: MemX in Non-virtualized Linux ...... 44
4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU ...... 46
4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain . . . . 48
4.3.4 MemX -Dom0: (Option 3) ...... 51
4.3.5 Alternative Options ...... 51
4.3.6 Network Access Contention: ...... 52
4.4 Evaluation ...... 53
ix 4.4.1 Latency and Bandwidth Microbenchmarks ...... 54
4.4.2 Application Speedups ...... 61
4.4.3 Multiple Client VMs ...... 63
4.4.4 Live VM Migration ...... 65
4.5 Summary ...... 65
5 Post-Copy: Live Virtual Machine Migration 67
5.1 Introduction ...... 68
5.2 Design ...... 70
5.2.1 Pre-Copy ...... 71
5.2.2 Design of Post-Copy Live VM Migration ...... 73
5.2.3 Prepaging Strategy ...... 76
5.2.4 Dynamic Self-Ballooning ...... 78
5.2.5 Reliability ...... 80
5.2.6 Summary ...... 81
5.3 Post-Copy Implementation ...... 82
5.3.1 Page-Fault Detection ...... 83
5.3.2 MFN exchanging ...... 85
5.3.3 Xen Daemon Modifications ...... 86
5.3.4 VM-to-VM kernel-to-kernel Memory-Mapping ...... 88
5.3.5 Dynamic Self Ballooning Implementation ...... 89
5.3.6 Proactive LRU Ordering to Improve Reference Locality ...... 92
5.4 Evaluation ...... 93
5.4.1 Stress Testing ...... 94
5.4.2 Degradation, Bandwidth, and Ballooning ...... 98
5.4.3 Application Scenarios ...... 104
5.4.4 Comparison of Prepaging Strategies ...... 107
5.5 Summary ...... 109
6 CIVIC: Transparent Over-subscription of VM Memory 110
6.1 Introduction ...... 111
x 6.2 Design ...... 113
6.2.1 Hypervisor Memory Management ...... 113
6.2.2 Shadow Paging Review ...... 115
6.2.3 Step 1: CIVIC Memory Allocation, Caching Design ...... 116
6.2.4 Step 2: Paging Communication and The Assistant ...... 118
6.2.5 Future Work: Page Migration, Sharing and Compression ...... 124
6.3 Implementation ...... 127
6.3.1 Address Space Expansion, BIOS Tables ...... 127
6.3.2 Communication Paths ...... 129
6.3.3 Cache Eviction and Prefetching ...... 130
6.3.4 Page-Fault Interception, Shadows, Reverse Mapping ...... 134
6.4 Evaluation ...... 138
6.4.1 Micro-Benchmarks ...... 138
6.4.2 Applications ...... 142
6.5 Summary ...... 147
7 Improvements and Closing Arguments 148
7.1 MemX Improvements ...... 148
7.1.1 Non-Volatile MemX Memory Descriptors ...... 148
7.1.2 MemX Internal Caching ...... 149
7.1.3 Server-to-Server Proactive Page Migration ...... 150
7.1.4 Increased MemX bandwidth w/ Multiple NICs ...... 150
7.2 Migration Flexibility ...... 151
7.2.1 Hybrid Migration ...... 151
7.2.2 Improved Migration of VMs Through CIVIC ...... 151
7.3 CIVIC Improvements and Ideas ...... 152
7.3.1 How high can you go?: Extreme Consolidation ...... 152
7.3.2 Improved Eviction and Shadow Optimizations ...... 152
7.4 Conclusions ...... 153
xi A CIVIC Screenshots 154
A.1 Small-HVM Over-subscription ...... 154
A.2 Large-HVM Oversubscription ...... 156
B The Xen Live-migration process 158
B.1 Xen Daemon ...... 158
B.2 Understanding Frame Numbering ...... 161
B.3 Memory-related Data Structures ...... 163
B.4 Page-table Management ...... 165
B.5 Actually Performing the Migration ...... 166
[] Bibliography 169
xii List of Figures
3.1 Placement of distributed memory within the classical memory hierarchy. . . 21
3.2 The components of a client...... 22
3.3 The components of a server...... 23
3.4 A view of a typical anemone packet header. The RMAP protocol transmits
these directly to the network card from the BDI device driver...... 26
3.5 Random read disk latency CDF ...... 30
3.6 Sequential read disk latency CDF ...... 31
3.7 Random write disk latency CDF ...... 31
3.8 Sequential write disk latency CDF ...... 32
3.9 Execution times of POV-ray for increasing problem sizes...... 33
3.10 Execution times of STL Quicksort for increasing problem sizes...... 34
3.11 Execution times of multiple concurrent processes executing POV-ray. . . . . 35
3.12 Execution times of multiple concurrent processes executing STL Quicksort. 35
3.13 Effects of varying the transmission window using Quicksort...... 36
4.1 Split Device Driver Architecture in Xen...... 42
4.2 MemX-Linux: Baseline operation of MemX in a non-virtualized Linux envi-
ronment. The client can communicate with multiple memory servers across
the network to satisfy the memory requirements of large memory applications. 44
4.3 MemX-DomU: Inserting the MemX client module within DomU’s Linux ker-
nel. The server executes in non-virtualized Linux...... 47
xiii 4.4 MemX-DD: Executing a common MemX client module within the driver do-
main, allowing multiple DomUs to share a single client module. The server
module continues to execute in non-virtualized Linux...... 49
4.5 I/O bandwidth, for different MemX-configurations, using custom benchmark
that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to
opening the file descriptor with direct I/O turned on, to compare against by-
passing the Linux page cache...... 55
4.6 Comparison of sequential and random read latency distributions for MemX-
DD and disk. Reads traverse the filesystem buffer cache. Most random read
latencies are an order of magnitude smaller with MemX-DD than with disk.
All sequential reads benefit from filesystem prefetching...... 58
4.7 Comparison of sequential and random write latency distributions for MemX-
DD and disk. Writes goes through the filesystem buffer cache. Conse-
quently, all four latencies are similar due to write buffering...... 58
4.8 Effect of filesystem buffering on random read latency distributions for MemX-
DD and disk. About 10% of random read requests (issued without the direct
I/O flag) are serviced at the filesystem buffer cache, as indicated by the first
knee below 10µs for both MemX-DD and disk...... 59
4.9 Quicksort execution times in various MemX combinations and disk. While
clearly surpassing disk performance, MemX-DD trails regular Linux only
slightly using a 512 MB Xen Guest...... 60
4.10 Quicksort execution times for multiple concurrent guest VMs using MemX-
DD and iSCSI configurations...... 62
4.11 Our multiple client setup: Five identical 4 GB dual-core machines, where
one houses 20 Xen Guests and the others serve as either MemXservers or
iSCSI servers...... 63
5.1 Pseudo-code for the pre-paging algorithm employed by post-copy migration.
Synchronization and locking code omitted for clarity of presentation. . . . . 74
xiv 5.2 Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with
multiple pivots. Each pivot represents the location of a network fault on
the in-memory pseudo-paging device. Pages around the pivot are actively
pushed to target...... 76
5.3 Pseudo-Swapping (item 3): As pages are swapped out within the source
guest itself, their MFN identifiers are exchanged and Domain 0 memory
maps those frames with the help of the hypervisor. The rest of post-copy
then takes over after downtime...... 82
5.4 The intersection of downtime within the two migration schemes. Currently,
our downtime consists of sending non-pageable memory (which can be
eliminated by employing the use of shadow-paging). Pre-copy downtime
consists of sending the last round of pages...... 85
5.5 Comparison of total migration times between post-copy and pre-copy. . . . 95
5.6 Comparison of downtimes between pre-copy and post-copy...... 96
5.7 Comparison of the number of pages transferred during a single migration. . 97
5.8 Kernel compile with back-to-back migrations using 5 seconds pauses. . . . 98
5.9 NetPerf run with back-to-back migrations using 5 seconds pauses...... 100
5.10 Impact of post-copy NetPerf bandwidth...... 101
5.11 Impact of pre-copy NetPerf bandwidth...... 102
5.12 The application degradation is inversely proportional to the ballooning interval.103
5.13 Total pages transferred for both migration schemes...... 105
5.14 Page-fault comparisons: Pre-paging lowers the network page faults to 17%
and 21%, even for the heaviest applications...... 106
5.15 Total migration time for both migration schemes...... 106
5.16 Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with
better page-fault detection...... 107
5.17 Comparison of prepaging strategies using multi-process Quicksort workloads.108
6.1 Original Xen-based physical memory design for multiple, concurrently-running
virtual machines...... 116
xv 6.2 Physical memory caching design of a CIVIC-enabled Hypervisor for multiple,
concurrently-running virtual machines...... 117
6.3 Illustration of a full PPAS cache. All page accesses in the PPAS space must
be brought into the cache before the HVM can use the page. If the cache is
full, an old page is evicted from the FIFO maintained by the cache...... 119
6.4 Internal CIVIC architecture: An Assistant VM holds two kernel modules re-
sponsible for mapping and paging HVM memory. One module directly (on-
demand) memory-maps portions of PPAS #2, whereas MemX does I/O. A
modified, CIVIC-enabled hypervisor intercepts page-faults to shadow page
tables in the RAS and delivers them to the Assistant VM. If the HVM cache
is full, the Assistant also receives victim pages...... 121
6.5 High-level CIVIC architecture: unmodified CIVIC-enabled HVM guests have
both local reservations (caches) while small or large amounts of their reser-
vations actually expand out to nearby hosts...... 123
6.6 Future CIVIC architecture: a large number of nodes would collectively pro-
vide global and local caches. The path of a page would potentially exhibit
multiple evictions from Guest A to local to global. Furthermore, a global
cache can be made to evict pages to other global caches...... 125
6.7 Pseudo-code for the prefetching algorithm employed by CIVIC. On every
page-fault, this routine is called to adjust the window based on the spatial
location of the current PFN address in the PPAS...... 132
6.8 Page Dirtying Rate for different types of Virtual Machines, including HVM
Guests, Para-virtual Guests, and with different types of shadow paging. This
includes the overhead of creating new page tables from scratch...... 140
6.9 Bus-speed Page Dirtying Rate in gigabits-per-second. This is line-speed
hardware memory speed once page-tables have already been created and
shows throughput at an order of magnitude higher than the previous graph. 141
6.10 Completion times for quicksort on a CIVIC-enabled virtual machine and a
regular virtual machine...... 143
xvi 6.11 Completion times for Sparse Matrix Multiplication with a resident memory
footprint of 512 MB while varying the cache sizes...... 144
6.12 Requests Per Second for the RUBiS Auction Benchmark with a resident
memory footprint of 490 MB while varying the cache sizes...... 145
A.1 A live run of an HVM guest on top of CIVIC with a very small PPAS cache
size of 64 MB. The HVM has 2 GB. (Turn the page sideways) ...... 155
A.2 A live run of an HVM guest on top of CIVIC with a very large PPAS cache
size of 2GB. The HVM believes that it has 64 GB. (Turn the page sideways) 157
xvii List of Tables
3.1 Average application execution times and speedups for local memory, Dis-
tributed Anemone, and Disk. N/A indicates insufficient local memory. . . . . 32
4.1 I/O latency for each MemX-Combination in Microseconds...... 54
4.2 Execution time comparisons for various large memory application workloads. 62
5.1 Migration algorithm design choices in order of their incremental improvements.
Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually
combines all of #1 through #4, by which pre-copy is only used in a single, primer
iterative round...... 74
5.2 Percent of minor and network faults for flushing vs. pre-paging. Pre-paging
greatly reduces the fraction of network faults...... 97
6.1 Latency of a page-fault through a CIVIC-enabled hypervisor to and from network
memory at different stages...... 139
6.2 Number of Shadow-Pagefaults to and from network memory with CIVIC prefetching
disabled and enabled. Each application has a memory footprint of 512 MB and a
PPAS cache of 256 MB...... 146
xviii Chapter 1
Introduction and Outline
Both the methods for design and use of main memory have changed dramatically over the last half-century. And because of fast moving advances in hardware and software, the OS designer’s choices have also increased, especially as the performance gaps between each level of the memory hierarchy grow larger. In this dissertation, we observe that the need to support large-memory non-parallel applications still persists, whose memory access pat- terns are mostly singular and disjoint from each other. This continues to include many common applications like databases and webservers as well as scientific and grid applica- tions. We describe a bottom-up attempt over the last few years to investigate solutions for these kinds of large-memory applications (LMAs) that can be applied across high-speed networked clusters of machines. The representative set of applications we benchmark in this dissertation include:
• Large Sorting • Graphical Ray-tracing • Database Workloads • Support Webserver • E-commerce Webserver • Kernel Compilation • Parallel Benchmarks • Torrent Clients • Network Throughput • Network Simulation
1 CHAPTER 1. INTRODUCTION AND OUTLINE 2
We refer to these applications are ”large-memory applications”. They tend to be some- what CPU intensive. Across application boundaries (between individual running processes), they are either not necessarily parallelizable or not designed to be without explicit thread- ing. Their computational behavior is such that: when they do need to access portions of their large memory pools, they need it fast. These accesses are also usually done in a relatively “cache-oblivious” manner, such that the size of the working-set in memory will eventually converge to a size that fits within memory (before moving onto a new working set).
For these kinds of applications, this work has investigated low-level memory manage- ment options across a number of different projects and this chapter presents a high-level outline of them. The focus of this work is the virtualization of physical memory to support these applications. We categorize this chapter’s high-level outline into three overarching goals:
1. Maximum Application Transparency: We want to improve the performance of
these large memory applications with zero changes to the application. The last
project of this dissertation extends this all the way to complete operating system
transparency as well.
2. Clustered Memory Pool: We want to provide a potentially unlimited pool of cluster-
wide memory to these applications with the help of distributed, low-latency commu-
nication.
3. Ubiquitous Resource Management: We ultimately want page-granular support for
the arbitrary, transparent relocation of any single page frame in a cluster of machines.
This dissertation employs a combination of virtual machine technology, operating system modifications and network protocol design to accomplish those three high-level goals for the aforementioned types of applications. The bottom-up process taken to explore the vir- tualization of physical memory in this dissertation is organized as follows: first we build a distributed memory virtualization system, followed by its evaluation in a virtual machine environment. Next, we develop alternative strategies for VM migration by leveraging dis- tributed memory virtualization. Finally, we integrate these techniques to develop a system CHAPTER 1. INTRODUCTION AND OUTLINE 3 for VM memory oversubscription. First we begin with a discussion of the basic distributed memory system.
1.1 Distributed Memory Virtualization in Networked Clusters
Chapter 3 begins by investigating the options available to large memory applications with basic, transparent distributed memory support in clusters of gigabit ethernet-connected machines. Distributed memory itself is a very old idea, but our previous efforts at re- investigating it have revealed various unsolved performance issues as well new applica- tions. Additionally, implementing a distributed memory solution was a springboard to tackle low-level memory management issues in virtual machines. Our prototype was an effort to move further away from the application than previous work (very low into the kernel) by choosing a clean, familiar interface (like the device) such that needs of the application are still respected without any changes. It consists of a fully distributed, non-shared, Linux- based, all kernel-space distributed memory solution, including a custom networking proto- col and a full performance evaluation. The solution exports an interface to any process that wants to map it, and hides the complexity of shipping those frames over gigabit ethernet to other connected machines. It is not, however, a software distributed shared memory solution - it does not provide cache coherent resolution protocols for simultaneous write access by parallel clients. That was not the focus of this work.
1.2 Virtual Machine Based Use for Distributed Memory
In Chapter 4, we investigate how distributed memory virtualization could benefit state-of- the-art virtual machine technology. We describe the design and implementation of a sys- tem that evaluates how distributed memory can enhance this transparency. We did this by placing (and improving upon) the aforementioned distributed memory solution at different places within the virtual machine architecture and benchmarking applications within those
VMs.
At the end of 2005, a handful of virtual machine projects had already matured into both CHAPTER 1. INTRODUCTION AND OUTLINE 4 proprietary and open-source versions. We began looking into how our distributed memory implementation could apply to virtual machine technology. Recent work with VMs in the last decade is interesting in that it has brought into question once again where exactly the memory management logistics of LMAs should be placed, now that there is an extra level of indirection (called the virtual machine monitor, or “hypervisor”) placed below the OS (a well-known technique). Both hardware and software advances have created many ways to impose order on the handling of OS memory management while still maintaining high amounts of transparency to applications through techniques such as full virtualization and para-virtualization.
1.3 Improvement of Live Migration for Virtual Machines
It soon became clear that virtual machine technology has succeeded tremendously at demonstrating the utility of transparent, live OS migration. In fact, it’s likely that the increas- ing pervasiveness of VMs may never have happened without it. It is well known that many process migration prototypes, while very well-built, were unable to become widespread due to fundamental limitations relating to transparency, process portability, and residual dependencies on the host OS. But changing the unit of migration to the OS itself has taken that problem out of the picture completely, even among different hypervisor vendors. The ability to run the VM transparently has shifted the base unit of computational containment from the process to the OS without changing the semantics of the application. A key com- ponent that determines the success of migration has been exactly how the virtualization architecture migrates the VM’s memory, which is what initially lead us to this particular problem.
In Chapter 5, we applied some of the techniques developed for virtual machine based distributed memory use to develop alternative strategies for live migration of VMs. We de- sign, implement and evaluate a new migration scheme and compare it to existing migration schemes present in today’s virtualization technology. We were able to realize significant gains in migration performance with a new live-migration system for write-intensive VM workloads - as well as point out some fundamental ways in which the management of VM CHAPTER 1. INTRODUCTION AND OUTLINE 5 memory could be improved from the pre-copy approach.
1.4 VM Memory Over-subscription with Network Paging
Our experience we the previous projects exposed the need for more fine-grained policies underneath the OS - particularly when VMs are consolidated from multiple physical hosts onto single hosts and compete with each other for memory resources. In situations like this, the need to determine better runtime placement and allocation of individual page frames among each VM is important. This is where the idea of the ubiquity of individual frames of memory comes from: not only does virtualization remove the constraints on a page frame as to its location in memory, but it releases a page frame from even being on the same physical machine, even when the VM is still considered to have local ownership of the frame. We believe that, with the dynamics of a virtualized environment, the OS should consider its physical memory as being a ubiquitous “resource” without worrying about its physical location. This does not mean that it should not be aware of the contiguity of the physical memory space (with respect to kernel subsystems that handle memory allocation and fragmentation). Rather, this means that the source of that contiguous resource should be more flexible. On the same lines, the interfaces that export this resource should both maintain fast, efficient memory access and do so without duplicating implementation effort or functionality.
With that, Chapter 6 presents the last contribution of this dissertation: a complete im- plementation and evaluation of a system that allows an un-modified VM to use more DRAM than is physical provided by the host machine. Our system is able to do this without any changes to the virtual machine. This is done through a combination of means. First, we alter the hypervisor under the VM and give the VM a view of a physical memory allocation that is larger than what is available at the host on which it is running. We then hook into the shadow paging mechanism, a feature provided by all modern hypervisors to intercept page-table modifications performed by the VM. Finally, we supplement this by implement- ing a network paging system at the hypervisor level to allow for victim page selection when non-resident pages are accessed. This system is implemented while preserving the tradi- CHAPTER 1. INTRODUCTION AND OUTLINE 6 tional concepts of paging and segmentation employed by an OS and by taking a page (par- don the pun) from microkernels by continuing to keep the hypervisor as small as possible.
Our implementation also maintains the same transparency to the OS and its applications that all of our previous work has guaranteed. This system gives the system administrator and application programmer a wide berth: to have the option to arbitrarily cache, share, or move individual page frames for improved consolidation of multiple co-located VMs among physical hosts. Chapter 2
Area Survey
Aside from the focus of this work discussed in Chapter 1, there is a great deal of related work. This chapter will present a literature survey of supporting work up to this point.
We will go through the three major steps taken in prior work discussed in the introduc- tion and explain how other literature is similar and how it differs from our work, including the Anemone system, the MemX system, the Post-Copy Migration system, and our final system, CIVIC.
2.1 Distributed Memory Systems
Our distributed memory system, Anemone [50, 51], was the first system that provided un- modified large memory applications (LMAs) with completely transparent access to cluster- wide memory over commodity gigabit Ethernet LANs. One goal of our work was to make a concerted effort to bring all components of the implementation into the Linux kernel and optimize for network conditions in the LAN that were specific to the network memory traf-
fic: particularly the repeated flow control of 4 Kilobyte page frames. As such, it can, briefly, be treated as distributed paging, distributed memory-mapping, or as a remote in-memory
filesystem, while the logic and design decisions are hidden behind a block device driver.
7 CHAPTER 2. AREA SURVEY 8
2.1.1 Basic Distributed Memory (Anemone)
The two most popular celebrities among systems designed to support distributed memory in the 1990s (which are now dormant) included the NOW [15] project at Berkeley and the
Global Memory System [37] at Washington. We decided to re-tackle this problem at the time for a few reasons: a). neither of these two projects were available for use. b). networks and CPU speeds had increased an order of magnitude since, and c). both projects required extensive operating system support. The Global Memory System was designed to provide network-wide memory management support for paging, memory mapped files, and file caching. This system was closely built into the end-host operating system and operated upon a 155Mbps DEC Alpha ATM Network. The NOW project [15] did a plethora of things on top of the Digital Unix operating system. In the end, their solution included an OS- supported “cooperative caching” system, which is a type of distributed filesystem that had the added responsibility of caching disk blocks (which could be memory mapped) into the memory of participating nodes. We will describe cooperative caching systems later, but suffice it to say that these were very large implementations that could be functionally reduced to doing the task of distributed memory in an indirect manner. Our goals were to re-tackle just the distributed memory components of these systems without any OS modifications as low as possible within a device driver in the hopes that the project would be an enabling mechanism for more complicated projects in the later years, which is exactly what we did. But in order to explore these problems, we needed a working prototype that solved these problems in the Linux operating system while taking into account all the design principles of kernel development in the current state of operating systems design that was also capable of functioning well over gigabit ethernet networks. For all of those reasons, Chapter 3 will describe a very new system as we’ve designed it.
Although the previously mentioned projects were the most popular, they were by far not the only projects around in the 1990s. The earliest non-shared efforts [40, 21, 57] at using distributed memory aimed to improve memory management, recovery, concurrency con- trol, and read/write performance for in-memory database and transaction processing sys- tems. The first two distributed paging were presented in [28] and [38]. These projects also CHAPTER 2. AREA SURVEY 9 took the stance of incorporating extensive OS changes to both the client and the memory servers on other nodes. The Samson project (of which my advisor was a member, actually)
[90] was a dedicated memory server with a highly modified OS over a Myrinet interconnect that actively attempts to predict client page requirements. The Dodo project [59, 9] was an- other late 1990’s attempt to provide a more end-to-end solution to the distributed-memory problem. They built a user-level library based interface that a programmer can use to coor- dinate all data transfers to and from a distributed memory cache. This obviously required legacy applications to be aware of specific API in that library. For the Anemone project, this was pretty much a deal-breaker. The work that is probably the closest to our prototype was done by [68] and followed up in [39], implemented within the DEC OSF/1 operating system in 1996. They use a transparent device driver just like we do to do paging. Again, our primary differences are as in the NOW case: a slow network, an out-of-date operating system, and no available code for which we could build a broader research project out of.
They do, however have a recovery system built into their work, capable of supporting single node failures.
2.1.2 Software Distributed Shared Memory
For shared memory systems, typically called ”Software Distributed Shared Memory” (DSM), a group of nodes participates in one of a host of different consistency protocols, not un- like the hardware requirements of cache-coherent Non-Uniform Memory Access (NUMA) shared memory machines. There are many of these systems. By its nature, the purpose of cache-coherent systems is to be able to provide a competing paradigm for Parallel Exe- cution systems that depend on Message Passing Interfaces (MPI). In general, a DSM and
MPI are competitors, each attempting to provide the means for parallel speedup across multiple physical host machines at different levels of the computing hierarchy. MPI at- tempts to provide the speedup through explicit data movement across each node through a series of calls, where as a properly implemented DSM attempts to make this data move- ment inherent. This is typically done either at the language level or (like MPI) at the library level in such a way that the DSM system handles shared writes (with proper ordering) so CHAPTER 2. AREA SURVEY 10 that the concurrently running programs on different nodes need only to focus on locking critical sections that access shared data structures. As we mentioned, Anemone is not a DSM, nor are we trying to do research on parallel execution. Nevertheless, some of the more popular DSM projects in the 1990s included [35] and [14], which allow a set of independent nodes to behave as a large shared memory multi-processor, often requiring customized programming to share common data across nodes.
2.2 Virtual Machine Technology and Distributed Memory
Whole operating system VM technology, in which multiple independent, and possibly dif- ferent, operating systems running simultaneously has been re-invented in the last decade.
The modern virtual machine monitor or hypervisor is inspired by three different kinds of OS virtualization: a). Library Operating Systems b.) Microkernels (versus monolithic kernels), and c). the commodity OS virtualization work in the early 1970’s. We will briefly survey some of these ideas and how they’ve influenced choices in our work, resulting in a project called “MemX” [49]. When that work was completed, MemX was the first system in a VM environment that provided unmodified LMAs with a completely transparent and virtualized access to cluster-wide distributed memory over commodity gigabit Ethernet LANs. We begin our survey of virtual machine technology with Microkernels first and then discuss modern hypervisors.
2.2.1 Microkernels
Microkernels were attempts by the operating systems community in the 1980’s and 90’s to shrink the size of the core OS base and move more of the subsystems in a traditional
“Macro” OS into user-land processes or servers. This decreased the privileges of these subsystems, giving them more fault-tolerance from foreign device drivers and required fast communication mechanisms for them to talk to each other. Other motivations for the use of microkernels included the ability to provide UNIX-compatible environments without the need to constantly port drivers to new systems and without the need to port new sys- tems to new CPU architectures. As long as you keep the microkernel and the supporting CHAPTER 2. AREA SURVEY 11 communication framework constant as a standard, one could gain a great deal of inter- operability, a source of headache that continues to exist today. The advantages provided by microkernels and virtual machines are almost identical, and without going into too much of a philosophical debate, virtual machine designers add more hypervisor-aware code to current operating systems every year. One could almost consider modern hypervisors to
be microkernels [45]. Probably the only reason that microkernels did not become more
widespread was that industry support for these research prototypes never really gained
traction completely, where as virtual machine technology has managed to do so. Nev-
ertheless, the exploration of Microkernels had a great deal of success beginning in the
1980s, including successful projects like Mach [8], Chorus [7] Amoeba [72], and L4 [64].
Notable work was also performed on “Library” operating systems. This is based on the
idea of having another root system ”fork” off a smaller operating system in much the same
way library code is stored and loaded on demand. These kinds of systems do not fall
cleanly into the definition of a microkernel, but they are closer to microkernels than virtual
machines because they also depend on fast communication primitives and their focus is
not to provide full virtualization of multiple CPU architectures. Such systems included the
Exokernel [36] and Nemesis [62].
2.2.2 Modern Hypervisors
The first hypervisors (the current term for the longer virtual machine monitor), have been
around since the late 1960’s [10] and were developed all the way through the late 70s
(primarily by industry) until academic research began to focus on microkernels, which took
over research until the mid 1990s. These early hypervisors were generally paired directly
with specific hardware and meant to support multiple identical copies of the same oper-
ating system. After the microkernel movement slowed down, probably the first “revival”
of hypervisor technology started with Disco [23]. The context of this work was on top of
cache-coherent NUMA machines, motivated by IBM’s work [10]. Their focus was similar:
to support multiple commodity operating systems, but their aim was to do it with as few
changes as possible. A popular open-source attempt also sprang up for a short while CHAPTER 2. AREA SURVEY 12 called “User Mode Linux” [2], but operated completely in userland. (We actually used this for a while to test our early distributed memory prototypes, but the developer base did not continue to grow.) At the turn of the century, two more hypervisor arrived, including Denali
[6] (which was later modified to be a microkernel) and the familiar VMware system.
Modern hypervisors are split into three categories at the moment: a). Full Virtualization, b). Para-Virtualization and c.) Pre-Virtualization. Para-virtualization indicates that the OS has been modified to be aware that it is virtualized and to provide direct support to the supporting hypervisor to improve upon the speed of virtualizing memory accesses and device emulation. Full Virtualization indicates that the guest operating system (the OS being virtualized by a hypervisor) has not been modified to support virtualization. Full virtualization can be supported in two ways: with or without hardware support. Both AMD
[13] and Intel [3] provide hardware support for virtualization by enabling the processor to trap directly into the hypervisor voluntarily when the guest attempts to execute a privileged instruction that must be emulated. Full virtualization systems like KVM [4] depend on hardware support completely. Projects like Xen [20] support both para-virtualized and fully- virtualized operating systems both with and without support from hardware. The second way to perform full virtualization is to use binary translation, as is the case with VMware.
Critiques to this attempt are that they must do this at clock speeds, requiring execution overheads of up to 20%. Similarly, Pre-virtualization [63] is a related attempt to do these translations offline in a layered manner or with a custom compiler, but existing prototypes have not caught much traction in the community.
Finally, para-virtualization is an opposite technique to do virtualization by modifying the operating system itself. This technique met with a lot of success with the Xen project
[20], which is the hypervisor platform used in this work. Recently, the Linux and Win- dows communities have been updating these macro-kernels with Hypervisor aware hooks to mitigate the overhead of forward-porting. Such changes will also benefit many of the aforementioned full-virtualization technologies. Other paravirtualization techniques include operating-system level virtualization, similar to [2] in which the OS itself and all processes are isolated into individual containers without the use of a true hypervisor [1]. CHAPTER 2. AREA SURVEY 13
2.3 VM Migration Techniques
Chapter 5 targets the performance of the live migration of virtual machines. The technique we use, accompanied with a handful of new optimizations is called “Post-Copy”. Live migration is a mandatory feature of modern hypervisors. It facilitates server consolidation, system maintenance, and lower power consumption. Post-copy refers to the deferral of the “copy” phase of live migration until the virtual machine’s CPU state has already been migrated. On the other hand, pre-copy refers to the opposite, and currently is the dominant way to migrate a process or virtual machine. A survey of the different units of migration and types involved is present here.
2.3.1 Process Migration
The post-copy algorithm (whose name has assumed different titles) has variably appeared in the context of process-migration among four previous incarnations: first implemented as
“Freeze Free” using a file-server [84] in 1996, simulated in 1997 [83] (which is where the term post-copy was first coined) and later followed up by an actual Linux implementation in 2003 [74] - the original creator of the “hybrid” assisted post-copy scheme, which we will summarize later. Also, in 2008 a version under the openMosix kernel was presented again with respect to process migration in [85]. Our contributions instead address new challenges at the virtual machine level that are not seen at the process level and benchmark an array of applications affecting the different metrics of full virtual machine migration, which these two approaches do not do. The closest work to Post-Copy is a report called SnowFlock
[44]. They use a similar technique in the context of Parallel Computing by introducing
“Impromptu Clusters” which clones a VM to multiple destination nodes and collect results from the new clones. They do not compare their scheme to (or optimize on) the original pre-copy system. Also, their page-fault avoidance heuristics are also different in that they paravirtualize Xen Guests to avoid transmitting free pages, where as we use ballooning as it is less invasive and transparent to kernel operations. Process migration schemes, well surveyed in [71] have not become widely pervasive, though several projects exist, includ- ing Condor [30], Mosix [19], libckpt [80], CoCheck [91], Kerrighed [58], and Sprite [34]. CHAPTER 2. AREA SURVEY 14
The migration of entire operating systems is inherently free of residual dependencies while still providing a live and clean unit of migration. Techniques also exist to migrate appli- cations [71] or entire VMs [17, 27, 73] to nodes that have more free resources (memory,
CPU) or better data access locality. Both Xen [27] and VMWare [73] support migration of
VMs from one physical machine to another, for example, to move a memory-hungry enter- prise application from a low-memory node to a memory-rich node. However large memory applications within each VM are still constrained to execute within the memory limits of a single physical machine at any time. In fact, we have shown that MemX can be used in conjunction with the VM migration in Xen, combining the benefits of both live VM migration and distributed memory access. MOSIX [19] is a management system that uses process migration to allow sharing of computational resources among a collection of nodes, as if in a single multiprocessor machine. However each process is still restricted to use memory resources within a single machine.
2.3.2 Pre-Paging
The post-copy algorithm does its best (as pre-copy does) to identify the collective working- set of the virtual machine’s processes, whose concept for individual processes was first identified in 1968 [32]. Pre-copy does this with shadow paging: the use of an additional read-only page table level that tracks the dirtying of pages. Post-copy does this by the reception of a page-fault. We mitigate the effect of faults on applications through the use of pre-paging, a technique that also goes by different titles. In virtual-memory and application level solutions, it is called pre-paging. At the I/O level or the actual paging-device level, it can also be referred to as “adaptive prefetcing”. For process migration and distributed memory systems it can also be referred to as “adaptive distributed paging” (whereas or- dinary distributed paging suffers from the residual dependency problem, and may or may not involve the use of pre-fetching). In either case, we use the term pre-paging to refer a migration system that adaptively “flushes” out all of the distributed pages while simultane- ously trying to hide the latency of page-faults as pre-fetching does. We do not use disks or intermediate nodes. Traditionally, the algorithms involved in pre-paging involve both re- CHAPTER 2. AREA SURVEY 15 active and history based approaches to anticipate as best as possible what the working set of the application may or may not be. Pre-paging has experienced a very brief resur- gence this decade and goes back as far as 1968 [76], a survey of which can be found in
[94]. In our case, we implement a reactive approach with a few optimizations at the virtual machine level described later. History-based approaches may benefit future work, but we do not implement them here.
2.3.3 Live Migration
System-level virtual machine migration has been revived with several projects, including architecture independent approaches w/ VMware migration [73] and Xen migration [27], architecture-dependent projects using VT-x or VT-d chips with the KVM project in Linux [4], operating-system level approaches that do not use hypervisors (similar to capsules/pods) with the OpenVZ system [1], and even Wide-Area-Network approaches [22], all of which can potentially benefit from the post-copy method of VM migration presented in this paper.
Furthermore, the self-migration of operating systems has much in common with migration of single processes [48]. The same group started this project built on top of their ”Nomadic
Operating Systems” [47] project as well as their first prototype implementation on top of the L4 Linux Microkernel using “NomadBIOS”. All of these systems currently use pre-copy based migration schemes.
2.3.4 Non-Live Migration
There are several non-live approaches to migration systems, in which the dependent ap- plications must be completely suspended during the entire migration. The term capsule was introduced by Schmidt in [87]. In this work, capsules were implemented by grouping together processes in Linux or solaris operating systems and migrating all of their state as a group as opposed to the full operating system. Along the same lines, Zap [78] uses units of migration called process domains (pods), which are essentially process groups along with their process-to-kernel interfaces such as file handles and sockets. Migration is done by suspending the pod and copying it to the target. Also, connections to active CHAPTER 2. AREA SURVEY 16 services are not maintained during transit. The Denali project [6, 5] dealt with migrating checkpointed VMWare virtual machines across a network incurring longer migration down- time. Chen and Noble suggested using hardware-level virtual machines for user mobility
[41]. The Capsules/COW project [24] addresses user mobility and system administration by encapsulating the state of computing environments as objects that can be transferred between distinct physical hosts, citing the example of the transfer of an OS instance to a home computer while the user drives home from work. The OS instance is not active during the transfer. The “Internet Suspend/Resume” project [66] focuses on the capability to save and restore computing state on anonymous hardware. The execution of the virtual machine is suspended during transit. In contrast to these systems, our aim is to transfer live, active OS instances on fast networks without stopping them.
2.3.5 Self Ballooning
Ballooning is the act of changing the view of the amount of physical memory seen by the operating system during runtime. Ballooning itself has already been used a few times among virtual machine technology but none have been made to be continuous in produc- tion as of yet, nor has the use of ballooning been investigated among different VM migration systems, which is the purpose of this work. Prior ballooning work includes VMware’s 2002 publication [96], which was inspired by “self-paging” in the nemesis operating system [46].
It is not clear, however how their ballooning mechanisms interact with different forms of
VM migration, which is what we are trying to investigate. Xen is also capable of simple one-time ballooning during migration and system boot time. Additionally, an effort is being made to commit a general version of self-ballooning into the Xen upstream development tree by a group within Oracle Corp [67]. Such contributions will help standardize the use of ballooning.
2.4 Over-subscription of Virtual Machines
The most notable attempt to oversubscribe virtual machine memory was presented in [96] for VMware and [33] for Xen. These projects work very well, but the amount of VM memory CHAPTER 2. AREA SURVEY 17 is constrained to what is available on the physical host. Additionally, a couple of DSM- level attempts to present a Single-System Image (SSI) for unmodified VMs exist in [12] and [69]. Building an SSI was not the focus of this dissertation, but rather to allow local virtual machines to gain cluster memory access. This is because we want to increase VM consolidation and migration performance rather than spread processing out into the cluster.
Thus, processor resources available to such VMs in our work are only available on one host. Ballooning, as described in the previous section, also allows VMs to oversubscribe virtual machine memory, but requires direct operating system participation. Ballooning also does not allow access to non-resident memory. This requires a one-to-one static memory allocation throughout the virtual machine’s lifetime. To date, the CIVIC system, described in Chapter 6 is the first attempt to apply distributed memory to unmodified virtual machines running applications with large memory requirements in a low-latency environment through the use of network paging and shadow memory interception within the Xen hypervisor. Chapter 3
Anemone: Distributed Memory Access
In this Chapter, we describe our initial distributed memory work in detail, called the Anemone project. Because the performance of large memory applications degrade rapidly once the system hits the physical memory limit, they will likely start paging or thrashing. We present the design, implementation and evaluation of Distributed Anemone (Adaptive Network
Memory Engine) – a lightweight and distributed system that pools together the collective memory resources of multiple Linux machines across a gigabit Ethernet LAN. Anemone treats distributed memory as another level in the memory hierarchy between very fast local memory and very slow local disks. Anemone enables applications to access poten- tially “unlimited” network memory without any application or operating system modifications
(when Anemone is used as a swap device). Our kernel-level prototype features fully dis- tributed resource management, low-latency paging, resource discovery, load balancing, soft-state refresh, and support for ’jumbo’ Ethernet frames. Anemone achieves low page- fault latencies of 160µs average, application speedups of up to 4 times for single process and up to 14 times for multiple concurrent processes, when compared against disk-based paging.
3.1 Introduction
Performance of large-memory applications (LMAs) can suffer from large disk access la- tencies when the system hits the physical memory limit and starts paging to local disk.
18 CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 19
At the same time, affordable, low-latency, gigabit Ethernet is becoming commonplace with support for jumbo frames (packets larger than 1500 bytes). Consequently, instead of pag- ing to a slow local disk, one could page over a gigabit Ethernet to the unused memory of distributed machines and use the disk only when distributed memory is exhausted.
Thus, distributed memory can be viewed as another level in the traditional memory hi- erarchy, filling the widening performance gap between low-latency RAM and high-latency disk. In fact, distributed memory paging latencies of about 160µs or less can be easily achieved whereas disk read latencies range anywhere between 6 to 13ms. A natural goal is to enable unmodified LMAs to transparently utilize the collective distributed memory of nodes across a gigabit Ethernet LAN. Several prior efforts [28, 38, 37, 59, 68, 39, 70, 90] have addressed this problem by relying upon expensive interconnect hardware (ATM or
Myrinet switches), slow bandwidth limited LANs (10Mbps/100Mbps), or heavyweight soft- ware Distributed Shared Memory (DSM) [35, 14] systems that require intricate consis- tency/coherence techniques and, often, customized application programming interfaces.
Additionally, extensive changes were often required to the LMAs or the OS kernel or both.
Our earlier work [50] addressed the above problem through an initial prototype, called the Adaptive Network Memory Engine (Anemone) – the first attempt at demonstrating the feasibility of transparent distributed memory access for LMAs over commodity gigabit
Ethernet LAN. This was done without requiring any OS changes or recompilation, and relied upon a central node to map and exchange pages between nodes in the cluster. Here we describe the implementation and evaluation of a fully distributed Anemone architecture.
Like the centralized version, distributed Anemone uses lightweight, pluggable Linux kernel modules and does not require any OS changes. Additionally, it achieves the following significant improvements over a centralized system.
1. Full distribution: Memory resource management is distributed across the whole
cluster. There is no single control node.
2. Low latency: The round-trip time from one machine to the other is reduced by over
a factor of 3 when compared to disk access – to around 160µs.
3. Load balancing: Clients make intelligent decisions to direct distributed memory traf- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 20
fic across all available memory servers, taking into account their memory usage and
paging load.
4. Dynamic Discovery and Release: A distributed resource discovery mechanism en-
ables clients to discover newly available servers and track memory usage across the
cluster. The protocol also has a mechanism for releasing servers and re-distributing
their memory so that individual servers can be taken down for maintenance.
5. Large packet support: The distributed version incorporates the flexibility of whether
or not ’jumbo’ frames should be used based on which network hardware is used, al-
lowing operation in networks with any MTU size. Our protocol is custom built without
the use of TCP.As far as the application is concerned, network transmission does not
exist, so the end-to-end design of our protocol is built to satisfy the efficiency needs
of code in the kernel.
We evaluated our prototype using unmodified LMAs such as ray-tracing, network simu- lations, in-memory sorting, and k-nearest neighbor search. Results show that the system is able to reduce average page-fault latencies from 8.3ms to 160µs. Single-process appli- cations (including those that internally contain threads) speed up by up to a factor of 4, and multiple concurrent processes by up to a factor of 14, when compared against disk-based paging.
3.2 Design & Implementation
Distributed Anemone has two major software components: the client module on low mem- ory machines and the server module on machines with unused memory. The client module appears to the client system simply as a block device that can be configured in multiple ways.
• Storage: the “device” can be treated like storage. One can place any filesystem on
top of it and mount it like a regular filesystem. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 21
REGISTERS
CACHE
MAIN MEMORY
REMOTE MEMORY
DISK
TAPE
Figure 3.1: Placement of distributed memory within the classical memory hierarchy.
• Memory Mapping: one can memory map the anemone device directly, creating the
view of a linear array of addresses within the application itself. This is a standard
practice by many applications, most popularly for the dynamic loading of libraries,
but can be made explicit through standard system calls.
• Paging Device: The system can be used for distributed memory paging directly by
the operating system. This is the mode we use to evaluate the system later on.
Whenever an LMA needs more virtual memory, the pager (swap daemon) in the client swaps out pages from the client to other server machines. As far as the pager is concerned, the client module is just a block device not unlike a hard disk partition. Internally, however, the client module maps swapped out pages to distributed memory servers. On a high level, our goal was to develop a prototype that could realize a view presented in Figure 3.1, where distributed memory represents a new level of the memory access hierarchy.
The servers themselves are also regular machines, but have unused memory to con- tribute, and can in fact switch between the roles of client and server at different times, depending on their memory requirements. Client machines discover available servers by using a simple distributed resource discovery mechanism. Servers provide regular feed- back about their load information to clients, both as a part of the resource discovery pro- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 22
Client Node Large−Memory App. (LMA)
PAGER
RAM
MODULE Block Device Write−Back Interface Cache
RMAP Mapping Protocol Intelligence
NIC
Figure 3.2: The components of a client. cess and as a part of regular paging process (piggy backed on acknowledgments). Clients use this information to schedule page-out requests by choosing the least loaded server node to send a new page. Also, both the clients and servers use a soft-state refresh proto- col to maintain the liveness of pages stored at the servers. The earlier Anemone prototype
[50] differed in that the page-to-server mapping logic was maintained at a central Memory
Engine, instead of individual client nodes. Although simpler to implement, this centralized architecture incurred two extra round trip times on every request besides forcing all traffic to go through the central Memory Engine, which can become a single point of failure and a significant bottleneck. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 23
Server Node
RAM
MODULE
RMAP Mapping Protocol Intelligence
NIC
Figure 3.3: The components of a server.
3.2.1 Client and Server Modules
Figure 3.2 illustrates the client module that handles paging operations. It has four major components:
1. The Block Device Interface (BDI),
2. a basic LRU-based write-back cache,
3. mapping logic for server location of swapped-out pages, and
4. a Remote Memory Access Protocol (RMAP) layer.
The pager issues read and write requests to the BDI in 4KB data blocks. The device driver that exports the BDI is instructed to keep page write requests aligned on 4 KB boundaries.
(The usual sector size of a block devices is 512KB). The BDI, in turn, performs read and write operations to our write-back cache (for which pages do not get transmitted until evic- tion). When the cache is full, a page is evicted to a server using RMAP.Figure 3.3 illustrates the two major components of the server module: (1) a hash table that stores client pages CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 24 along with the client’s identity (layer-2 MAC address) and (2) the RMAP layer. The server module can store/retrieve pages for any client machine. Once the server reaches capacity, it responds to the requesting client with a negative acknowledgment. It is then the client’s responsibility to select another server, if available, or to page to disk if necessary. Page-to- server mappings are kept in a standard chained hashtable. Linked-lists contained within each bucket hold 64-byte entries that are managed using the Linux slab allocator (which performs fine-grained management of small, equal-sized memory objects). Standard disk block devices interact with the kernel through a request queue mechanism, which per- mits the kernel to group spatially consecutive block I/Os (BIO) together into one “request” and schedule them using an elevator algorithm for seek-time minimization. Unlike disks,
Anemone is essentially a random access device with a fixed read/write latency. Thus, the
BDI does not need to group sequential BIOs. It can bypass request queues, perform out- of-order transmissions, and asynchronously handle un-acknowledged, outstanding RMAP messages.
3.2.2 Remote Memory Access Protocol (RMAP)
RMAP is a tailor-made, low-overhead communication protocol for distributed memory ac- cess within the same subnet. It implements the following features: (1) Reliable Packet
Delivery, (2) Flow-Control, and (3) Fragmentation and Reassembly. While one could tech- nically communicate over TCP, UDP, or even the IP protocol layers, this choice comes burdened with unwanted protocol processing. Instead RMAP takes an integrated, faster approach by communicating directly with the network device driver, sending frames and handling reliability issues in a manner that suites the needs of the Anemone system. Every
RMAP message is acknowledged except for soft-state and dynamic discovery messages.
Timers trigger retransmissions when necessary (which is extremely rare) to guarantee reli- able delivery. We cannot allow a paging request to be lost, or the application that depends on that page will fail altogether. RMAP also implements flow control to ensure that it does not overwhelm either the receiver or the intermediate network card and switches.
The performance of any distributed system is heavily influenced by the types of net- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 25 working requirements imposed, including both the design of the network and the appli- cation’s requirements. To minimize latency and protocol-related processing overhead, a conscious choice was made to eliminate the use of TCP/IP and write a simpler, lightweight protocol. The subset of networking-functions needed by our system in the kernel was sig- nificantly smaller than the full set provided by the combination of TCP and IP in a cluster of machines. Four of the most prominent features that we do not include are:
• Port Abstraction: Our system has no use for the concept of ports, application-level
socket buffers, byte-streams, or in-order delivery. Since our system operates at the
block-I/O level, these mostly application-driven requirements disappear.
• IP Addresses: The system does not operate across routed IP subnets, nor do we
plan on supporting this feature due to performance overheads. They take from the
distributed nature of the system and create unwanted link congestion bottlenecks
with flows from other networks and is not the kind of problem we’re trying to attack.
As a result, the ability of one node to address/communicate with another node is
simplified. We also noticed that a custom protocol was much easier to maintain in
the kernel because the client and servers can address each other over the network
much easier, without the need to juggle IP addresses and socket error handling.
• Fragmentation: With the right use of the Linux networking API, this turned out to
be a far simpler problem to solve: today’s Linux provides a good enough design
abstraction to deploy a non-IP based, zero copy fragmentation solution. Furthermore,
Our protocol can auto-detect the MTU of the system’s NIC and automatically send
larger-packets (so called ‘jumbo’ frames) if the card supports it, especially because
we have no need for multi-network ICMP mtu discovery (assuming that all hops in
the network support the same size MTU).
• Segmentation Offload: The performance of 10-gigabit and higher speed networks
depends heavily on the use of TCP and Checksum offloading. It is gradually becom-
ing quite commonplace to find gigabit cards with offloading engines on them that the
kernel can exploit. Recent 2.6 kernels have integrated the zero-copy use of segmen- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 26
RMAP Header Format
Type Status Anemone Packet Sequence union { Page Data Advertisement { (if any) Session ID RMAP Header Load Status (Network) Load Capacity } Ethernet Header (Data Link) Page Request { Offset Size } }
Fragmentation Flags
Figure 3.4: A view of a typical anemone packet header. The RMAP protocol transmits these directly to the network card from the BDI device driver.
tation into their TCP/IP APIs. We’ve observed that, under a highly-active system, the
network can easily exhibit full-speed workloads. Since we use RMAP, this potentially
frees up the use of segmentation offloading for application-level networking traffic
that might be concurrently running within the same guest VM.
Figure 3.4 depicts what a typical anemone packet header looks like.
The last design consideration in RMAP is that while the standard memory page size is 4KB (although it is not uncommon for an operating system to employ the use of 4 MB super-pages for better use of the translation-lookaside-buffer), the maximum transmission unit (MTU) in traditional Ethernet networks is limited to 1500 bytes. RMAP implements dynamic fragmentation/reassembly for paging traffic. Additionally, RMAP also has the flex- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 27 ibility to use Jumbo frames, which are packets with sizes greater than 1500 bytes (typically between 8KB to 16KB). Jumbo frames enable RMAP to transmit complete 4KB pages to servers using a single packet, without fragmentation. Our testbed includes an 8-port switch that supports Jumbo Frames (9KB packet size). We observe a 6% speed up in RMAP throughput by using Jumbo Frames. However, in this Chapter, we conduct all experiments with 1500 byte MTU sizes with fragmentation/reassembly performed by RMAP.
3.2.3 Distributed Resource Discovery
As servers constantly join or leave the network, Anemone can (a) seamlessly absorb the increase/decrease in cluster-wide memory capacity, insulating LMAs from resource fluctu- ations and (b) allow any server to reclaim part or all of its contributed memory. This ob- jective is achieved through distributed resource discovery described below, and soft-state refresh described next in Section 3.2.4. Clients can discover newly available distributed memory in the cluster and the servers can announce their memory availability. Each server periodically broadcasts a Resource Announcement (RA) message (1 message every 10 seconds in our prototype) to advertise its identity and the amount of memory it is willing to contribute. Besides RAs, servers also piggyback their memory availability information in their page-in/page-out replies to individual clients. This distributed mechanism permits any new server in the network to dynamically announce its presence and allows existing servers to announce their up-to-date memory availability information to clients.
3.2.4 Soft-State Refresh
Distributed Anemone also includes soft-state refresh mechanisms (keep-alives) to permit clients to track the liveness of servers and vice-versa. Firstly, the RA message serves an additional purpose of informing the client that the server is alive and accepting paging re- quests. In the absence of any paging activity, if a client does not receive the server’s RA for three consecutive periods, it assumes that the server is offline and deletes the server’s en- tries from its hashtables. If the client also had pages stored on that server that went offline, it needs to recover the corresponding pages from a copy stored either on the local disk on CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 28 another server’s memory. Soft-state also permits servers to track the liveness of clients whose pages they store. Each client periodically transmits a Session Refresh message to each server that hosts its pages (1 message every 10 seconds in our prototype), which carries a client-specific session ID. The client module generates a different and unique ID each time the client restarts. If a server does not receive refresh messages with matching session IDs from a client for three consecutive periods, it concludes that the client has failed or rebooted and frees up any pages stored on that client’s behalf.
3.2.5 Server Load Balancing
Memory servers themselves are commodity nodes in the network that have their own pro- cessing and memory requirements. Hence another design goal of Anemone is to avoid overloading any particular server node as far as possible by transparently distributing the paging load evenly. In the earlier centralized architecture, this function was performed by the memory engine which kept track of server utilization levels. Distributed Anemone im- plements additional coordination among servers and clients to exchange accurate load in- formation. Section 3.2.3 described the mechanism to perform resource discovery. Clients utilize the server load information gathered from resource discovery to decide the server to which they should send new page-out requests. This decision process is based upon one of two different criteria: (1) The number of pages stored at each active server and
(2) The number of paging requests serviced by each active server. While (1) attempts to balance the memory usage at each server, (2) attempts to balance the request processing overhead.
3.2.6 Fault-tolerance
The ultimate consequence of failure in swapping to distributed memory is no worse than failure in swapping to local disk. However, the probability of failure is greater in a LAN envi- ronment because of multiple components involved in the process, such as network cards, connectors, switches etc. Although RMAP provides reliable packet delivery as described in Section 3.2.2 at the protocol level, our future work plans to build two alternatives for CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 29 tolerating server failures: (1) To maintain a local disk-based copy of every memory page swapped out over the network. This provides same level of reliability as disk-based paging, but risks performance interference from local disk activity. (2) To keep redundant copies of each page on multiple distributed servers. This approach avoids disk activity and reduces recovery-time, but consumes bandwidth, reduces the global memory pool and is suscepti- ble to network failures. In an ideal implementation, the memory servers would participate in a protocol similar to raid 5 [26].
3.3 Evaluation
The Anemone testbed consists of one 64-bit low-memory AMD 2.0 GHz client machine containing 256 MB of main memory and nine distributed-memory servers. The DRAM on these servers consist of: four 512 MB machines, three 1 GB machines, one 2 GB machine, one 3 GB machine, totaling to almost 9 gigabytes of distributed memory. The
512 MB servers range from 1.7 GHz to 800 MHz Intel processors. The other 5 machines are all 2.7 GHz and above Intel Xeons, with mixed PCI and PCI express motherboards.
For disk based tests, we used a Western Digital WD800JD 80 GB SATA disk, with a 7200
RPM speed, 8 MB of cache and 8.9ms average seek time, (which is consistent with our
results). This disk has a 10 GB swap partition reserved on it to match the equivalent
amount of distributed memory available in the cluster, which we use exclusively when
comparing our system against the disk. Each machine is equipped with an Intel PRO/1000
gigabit Ethernet card connected to one of two 8-port gigabit switches, one from Netgear
and one from SMC. The performance results presented below can be summarized as
follows. Distributed Anemone reduces read latencies to an average 160µs compared to
8.3ms average for disk and 500µs average for centralized Anemone. For writes, both disk and Anemone deliver similar latencies due to write caching. In our experiments, Anemone delivers a factor of 1.5 to 4 speedup for single process LMAs, and delivers up to a factor of 14 speedup for multiple concurrent LMAs. Our system can successfully operate with both multiple clients and multiple servers. We also run experiments in which multiple client machines are simultaneously accessing the memory system at the same time. These CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 30
CDF of 500,000 Random Reads to a 6 GB Space 100
80
60
40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 3.5: Random read disk latency CDF results are equally as successful as the single-process cases.
3.3.1 Paging Latency
To begin the experiments, we first want to characterize exactly what kinds of microbench- marks we observe for different types of I/O, both read and write streams. The next 4 graphs present these results for both our memory system and the disk. Figures 3.5, 3.6,
3.7, and 3.8 show the distribution of observed read and write latencies for sequential and random access patterns with both Anemone and disk. Though real-world applications rarely generate purely sequential or completely random memory access patterns, these graphs provide a useful measure to understand the underlying factors that impact applica- tion execution times. Most random read requests to disk experience a latency between 5
to 10 milliseconds. On the other hand most requests in Anemone experience only around
160µs latency. Most sequential read requests to disk are serviced by the on-board disk
cache within 3 to 5µs because sequential read accesses fit well with the motion of disk
head. In contrast, Anemone delivers a range of latency values, most below 100µs. This
is because network communication latency dominates in Anemone even for sequential re- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 31
CDF of 500,000 Sequential Reads to a 6 GB Space 100
80
60
40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 3.6: Sequential read disk latency CDF
CDF of 500,000 Random Writes to a 6 GB Space 100 90 80 70 60 50 40 30 Percent of Requests 20 10 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 3.7: Random write disk latency CDF CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 32
CDF of 500,000 Sequential Writes to a 6 GB Space 100
80
60
40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 3.8: Sequential write disk latency CDF
Size Local Distr. Speedup Disk (GB) Mem Anemone Disk Anemone Povray 3.4 145 1996 8018 4.02 Quicksort 5 N/A 4913 11793 2.40 NS2 1 102 846 3962 4.08 KNN 1.5 62 7.1 2667 3.7
Table 3.1: Average application execution times and speedups for local memory, Distributed Anemone, and Disk. N/A indicates insufficient local memory. quests, though it is masked to some extent by the prefetching performed by the pager and the file-system within the Linux kernel. The write latency distributions for both disk and
Anemone are comparable, with most latencies being close to 9µs because writes typically return after writing to the local Linux buffer cache (which is now a unified page cache in
Linux 2.6).
3.3.2 Application Speedup
Single-Process LMAs: Table 3.1 summarizes the performance improvements seen by unmodified single-process LMAs using the Anemone system. This is a setup, similar to CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 33
Single Process ’POV’ Ray Tracer
8000 Local Memory Anemone Local Disk
6000
4000
2000 Render Time (seconds)
0 0 1000 2000 3000 Amount of Scene Memory (MB)
Figure 3.9: Execution times of POV-ray for increasing problem sizes. the previous microbenchmarks, in which a single LMA process on a single client node is using the memory system consisting of all nine available servers at its disposal. The first application is a ray-tracing program called POV-Ray [81]. The memory consumption of POV-Ray was varied by rendering different scenes with increasing number of colored spheres. Figure 3.9 shows the completion times of these increasingly large renderings up to 3.4 GB of memory versus the disk using an equal amount of local swap space. The
figure clearly shows that Anemone delivers increasing application speedups with increas- ing memory usage and is able to improve the execution time of a single-process POV-ray rendering by a factor of 4 for 3.4 GB memory usage. The second application is a large in-memory Quicksort program that uses a C++ STL-based implementation [89], with a complexity of O(N log N) comparisons. We sorted randomly populated large in-memory arrays of integers. Figure 3.10 shows that Anemone delivers a factor 2.4 speedup for a single-process Quicksort using 5 GB of memory. The third application is the popular NS2
network simulator [75]. We simulated a delay partitioning algorithm [42] on a 6-hop wide-
area network path using voice-over-IP traffic traces. Factors contributing to memory usage
in NS2 include the number of nodes being simulated, the amount of traffic sent between
nodes, and choices of protocols at different layers. Table 3.1 shows that, with NS2 requiring CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 34
Single Process Quicksort
12000 11000 Local Memory 10000 Anemone 9000 Local Disk 8000 7000 6000 5000 4000 3000 Sort Time (seconds) 2000 1000 0 0 1000 2000 3000 4000 5000 Sort Size (MB)
Figure 3.10: Execution times of STL Quicksort for increasing problem sizes.
1GB memory, Anemone speeds up the simulation by a factor of 4 compared to disk based paging. The fourth application is the k-nearest neighbor (KNN) search algorithm on large
3D datasets, using code from [29]. This algorithm is useful in applications such as medical imaging, molecular biology, CAD/CAM, and multimedia databases. Table 3.1 shows that, when executing KNN search algorithm over a dataset of 2 million points consuming 1.5GB
memory, Anemone speeds up the simulation by a factor of 3.7 over disk based paging.
Multiple Concurrent LMAs: In this section, we test the performance of Anemone under
varying levels of concurrent application execution. Multiple concurrently executing LMAs
tend to stress the system by competing for computation, memory and I/O resources and
by disrupting any sequentiality in paging activity, including competition for buffer space on
the network switch itself - particularly at gigabit speeds. Figures 3.11 and 3.12 show the
execution time comparison of Anemone and disk as the number of POV-ray and Quicksort
processes increases. The execution time measures the time interval between the start of
the execution and the completion of last process in the set. We try to keep each process
at around 100 MB of memory. The figures show that the execution times using disk-based
swap increases steeply with number of processes. Paging activity loses out sequentiality to CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 35
Multiple Process ’POV’ Ray Tracer
10000
9000 Anemone Local Disk 8000
7000
6000
5000
4000
3000
2000 Render Time (seconds) 1000
0 0 1 2 3 4 5 6 7 Number of Concurrent Processes
Figure 3.11: Execution times of multiple concurrent processes executing POV-ray. Multiple Process Quicksort
2800 Anemone Local Disk 2400
2000
1600
1200
800 Sort Time (seconds)
400
0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Concurrent Processes
Figure 3.12: Execution times of multiple concurrent processes executing STL Quicksort. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 36
Effect of Transmission Window Size 1 - GB Quicksort
10000 Bandwidth Acheived (Mbit/s) No. Retransmissions (Requests) Completion Time (secs)
1000
100 (Logscale)
10
1 0 2 4 6 8 10 12 14 Max Window Size
Figure 3.13: Effects of varying the transmission window using Quicksort. the memory system performance with an increasing number of processes, making the disk seek and rotational overheads dominant. On the other hand, Anemone reacts very well as execution time increases very slowly, due to the fact that network latencies are mostly constant, regardless of sequentiality. With 12–18 concurrent LMAs, Anemone achieves speedups of a factor of 14 for POV-ray and a factor of 6.0 for Quicksort.
3.3.3 Tuning the Client RMAP Protocol
One of the important knobs in RMAP’s flow control mechanism is the client’s transmis-
sion window size. Using a 1 GB Quicksort, Figure 3.13 shows the effect of changing
this window size on three characteristics of the Anemone’s performance: (1) the number
of retransmissions, (2) paging bandwidth, which is represented in terms of “goodput”, i.e.
the amount of bandwidth obtained after excluding retransmitted bytes and header bytes,
and (3) completion time. Recall that in our implementation of the RMAP protocol, we use
a static window size - configured once before runtime. This means the traditional sense CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 37 of “flow-control” that you would expect from a TCP-style protocol is not fully dynamic. As a result, our window size is chosen empirically to be large enough to maintain network throughput but small enough to fit within the capabilities of the NIC’s ring buffers. A com- plete implementation of RMAP would provide a dynamic flow control window, but we leave that to future work.
To demonstrate this, figure 3.13 shows us that as the window size increases, the num- ber of retransmissions increases because the number of packets that can potentially be delivered back-to-back also increases. For larger window sizes, the paging bandwidth is also seen to increase and saturates because the transmission link remains busy more often, delivering higher goodput in spite of an initial increase in the number of retrans- missions. However, if driven too high, the window size will cause the paging bandwidth to decline considerably due to increasing number packet drops and retransmissions. The application completion times depend upon the paging bandwidth. Initially, an increase in window size increases the paging bandwidth and lowers the completion times. Similarly, if driven too high, the window size causes more packet drops, more retransmissions, lower paging bandwidth and higher completion times.
3.3.4 Control Message Overhead
To measure the control traffic overhead due to RMAP, we measured the percentage of control bytes generated by RMAP compared to the amount of data bytes transferred while executing a 1GB POVRay application. Control traffic refers to the page headers, acknowl- edgments, resource announcement messages, and soft-state refresh messages. We first varied the number of servers from 1 to 6, with a single client executing the POV-Ray appli- cation. Next, we varied the number of clients from 1 to 4 (each executing one instance of
POV-Ray), with 3 memory servers. The percentage of control traffic overhead was consis- tently measured at 1.74% – a very small percentage of the total paging traffic. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 38
3.4 Summary
In this Chapter, we presented Distributed Anemone – a system that enables unmodified large memory applications to transparently utilize the unused memory of nodes across a gi- gabit Ethernet LAN. Unlike its centralized predecessor, Distributed Anemone features fully distributed memory resource management, low-latency distributed memory paging, dis- tributed resource discovery, load balancing, soft-state refresh to track liveness of nodes, and the flexibility to use Jumbo Ethernet frames. We presented the architectural design and implementation details of a fully operational Anemone prototype. Evaluations using multiple real-world applications, include ray-tracing, large in-memory sorting, network sim- ulations, and nearest neighbor search, show that Anemone speeds up single process ap- plication by up to a factor of 4 and multiple concurrent processes by up to a factor of 14, compared to disk-based paging. Average page-fault latencies are reduced from 8.3ms with disk based paging to 160µs with Anemone. Chapter 4
MemX: Virtual Machine Uses of Distributed Memory
In this Chapter, we present our experiences in developing a fully transparent distributed system, called MemX, within the Xen VM environment that coordinates the use of cluster- wide memory resources to support large memory workloads.
4.1 Introduction
In modern cluster-based platforms, VMs can enable functional and performance isolation across applications and services. VMs also provide greater resource allocation flexibility, improve the utilization efficiency, enable seamless load balancing through VM migration, and lower the operational cost of the cluster. Consequently, VM environments are in- creasingly being considered for executing grid and enterprise applications over commodity high-speed clusters. However, such applications tend to have memory workloads that can stress the limited resources within a single VM by demanding more memory than the slice available to the VM. Clustered bastion hosts (mail, network attached storage), data mining applications, scientific workloads, virtual private servers, and backend support for websites are common examples of resource-intensive workloads, I/O bottlenecks in these applications can quickly form due to frequent access to large disk-resident dataset, paging
39 CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 40 activity, flash crowds, or competing VMs on the same node. Even though virtual machines with demanding workloads are here to stay as integral parts of modern clusters, signifi- cant improvements are needed in the ability of memory-constrained VMs to handle these workloads.
I/O activity due to memory pressure can prove to be particularly expensive in a virtu- alized environment where all I/O operations need to traverse an extra layer of indirection.
Over-provisioning of memory resources (and in general any hardware resource) within a physical machine may not always be a viable solution as it can lead to poor resource utilization efficiency, besides increasing the operational costs. Although domain-specific out-of-core computation techniques [56, 65] and migration strategies [71, 17, 27] can also improve the application performance up to a certain extent, they do not overcome the fun- damental limitation that an application is restricted to using the memory resources within a single physical machine, particularly with some of the aforementioned applications that are not generally parallelized.
In this Chapter, we present the design, implementation, and evaluation of the MemX
system for VMs that bridges the I/O performance gap in a virtualized environment by ex-
ploiting low-latency access to the memory of other nodes across a Gigabit cluster. MemX
is fully transparent to the user applications – developers do not need any specialized APIs,
libraries, recompilation, or relinking for their applications, nor does the application’s dataset
need any special pre-processing, such as data partitioning across nodes. We compare and
contrast the three modes in which MemX can operate with Xen VMs [20]:
1. MemX-DomU: the system within individual guest virtual machines. The letter ‘U’ in
“DomU” refers to the guest domain. Specifically it refers to the fact that they are
“unprivileged” in reference to Dom0.
2. MemX-DD: the system within a common driver domain (DD) (in this case: dom0
functions as the DD). However, this time the system is shared by multiple guest OSes
that co-reside with the DD. We use “DD” to distinguish between the fact that the client
module is running in the same place as the MemX-Dom0 case (within Domain Zero
itself), except that the client module is actually being used by applications located CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 41
within guest VMs (DomU), rather than applications within the driver domain (dom0)
itself.
3. MemX-Dom0: The distributed memory system within dom0, called ”Dom0” in Xen
terms. This represents the base virtualization overhead without the presence of other
guest virtual machines.
The proposed techniques can also work with other VM technologies besides Xen. We focus on Xen mainly due to its open source availability and para-virtualization support. In the performance section, we also compare all three options to the baseline case where just a regular, non-virtualized Linux system is used as described in Chapter 3.
4.2 Split Driver Background
As we stated in Chapter 2, Xen is an open source virtualization technology that provides secure resource isolation. Xen provides close to native machine performance through the use of para-virtualization [97] – a technique by which the guest OS is co-opted into reducing the virtualization overheads via modifications to its hardware dependent components. The modifications enable the guest OS to execute over virtualized hardware and devices rather than over bare metal. In this section, we review the background of the Xen I/O subsystem as it relates to the design of MemX.
Xen exports I/O devices to each guest OS (domU) as virtualized views of “class” de- vices as opposed to real physical devices. For example, Xen exports a block device or a network device, rather than a specific hardware make and model. The actual drivers that interact with the native hardware devices execute within Dom0 – the privileged domain that can directly access all hardware in the system. Dom0 acts as the management VM that coordinates devices access and privileges among all of the other guest domains. In the rest of the Chapter, we will use the term driver domain and Dom0 interchangeably.
Physical devices (and their device drivers) can be multiplexed among multiple concur- rently executing guest OSes. To enable this multiplexing, the privileged driver domain and the unprivileged guest domains (DomU) communicate by means of a split device-driver ar- CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 42
EVENT CHANNEL DRIVER BACK FRONT DOMAIN END END DRIVER DRIVER
NATIVE DRIVER GUEST OS Hypercalls / Callbacks
ACTIVE GRANT TABLE SAFE H/W I/F
XEN HYPERVISOR
PHYSICAL DEVICE
Figure 4.1: Split Device Driver Architecture in Xen. chitecture. This architecture is shown in Figure 4.1. The driver domain hosts the backend of the split driver for the device class and the DomU hosts the frontend. The backends and frontends interact using high-level device abstractions instead of low-level hardware specific mechanisms. For example, a DomU only cares that it is using a block device, but doesn’t worry about the specific type of driver that is controlling that block device.
Frontends and backends communicate with each other via the grant table: an in- memory communication mechanism that enables efficient bulk data transfers across do- main boundaries. The grant table enables one domain to allow another domain access to its pages in system memory. The access mechanism can include read, write, or mu- tual exchange of pages. The primary use of the grant table in device I/O is to provide a fast and secure mechanism for unprivileged DomU domains to receive indirect access to hardware devices. They enable the driver domain to set up a DMA based data transfer directly to/from the system memory of a DomU rather than performing the DMA to/from driver domain’s memory with the additional copying of the data between DomU and driver domain. In other words, the grant table enables zero-copy data transfers across domain CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 43 boundaries.
The grant table can be used to either share or transfer pages between the DomU and driver domain depending upon whether the I/O operation is synchronous or asynchronous in nature. For example, because block devices perform synchronous data transfer, the driver domain will know at the time of I/O initiation as to which DomU requested the block
I/O request. In this case, the frontend of the block driver in DomU will notify the Xen hypervisor (via the gnttab grant foreign access hypercall) that a memory page can be shared with the driver domain. A hypercall is the hypervisor’s equivalent of a system call in the operating system. The DomU then passes a grant table reference ID via the event channel to the driver domain, which sets up a direct DMA to/from the memory page of the DomU. Once the DMA is complete, the DomU removes the grant reference (via the gnttab end foreign access call). On the other hand, network devices receive data asyn- chronously. This means that the driver domain does not know the target DomU for an incoming packet until the entire packet has been received and its header examined. In this situation, the driver domain DMAs the packet into its own page and notifies the Xen hypervisor (via the gnttab grant foreign transfer call) that the page can be transferred to the target DomU. The driver domain then transfers the received page to target DomU and receives a free page in return from the DomU. In summary, Xen’s I/O subsystem for shared physical devices uses a split driver architecture that involves an additional level of indirection through the driver domain and the Xen hypervisor, with efficient optimizations to avoid data copying during bulk data transfers.
4.3 Design and Implementation
The core functionality of the MemX system partially builds upon our previous work and is
encapsulated within kernel modules that do not require modifications to either the Linux
kernel or the Xen hypervisor. However the interaction of the core modules with rest of
the virtualized subsystem presents several alternatives. In this section, we briefly discuss
the different design alternatives for the MemX system, justify the decisions we make, and
present the implementation details. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 44
LOW MEMORY CLIENT LARGE MEMORY SERVER
LARGE MEMORY APPLICATION
User User Kernel Kernel FILE CONTRIBUTED PAGER SYSTEM DRAM
RAW BLOCK DEVICE INTERFACE
MemX SERVER REMOTE MEMORY ACCESS MemX CLIENT MODULE MODULE PROTOCOL OVER GIGABIT INTERCONNNECT
Figure 4.2: MemX-Linux: Baseline operation of MemX in a non-virtualized Linux environ- ment. The client can communicate with multiple memory servers across the network to satisfy the memory requirements of large memory applications.
4.3.1 MemX-Linux: MemX in Non-virtualized Linux
Figure 4.2 shows the operation of MemX in a non-virtualized (vanilla) Linux environment.
An earlier variant of MemX-Linux was published in [51]. MemX-Linux includes several additional features listed later in this section. For completeness, we summarize the archi- tecture of MemX-Linux in this section and use it as a baseline for comparison with other virtualized versions of MemX – the primary focus of this work.
Two main components of MemX-Linux are the client module on the low memory ma- chines and the server module on the machines with unused memory. The two communi- cate with each other using a remote memory access protocol (RMAP), described in detail in Chapter 3, Section 3.2.2. Both client and server components execute as isolated Linux kernel modules. Aside from optimizations, the function of this code operates much the same way as described in Chapter 3. Nevertheless, there are a number of important CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 45 changes to that work, and we present a brief summary of those components here.
Client and Server Modules: The client module provides a virtualized block device inter- face to the large dataset applications executing on the client machine. This block device can either be: a) configured as a low-latency primary swap device, b) treated as a low- latency volatile store for large data sets accessed via the standard file-system interface, or c) memory mapped to the address space of an executing large memory application. Inter- nally, the client module maps the single linear I/O space of the block device to the unused memory of multiple distributed servers, using a memory-efficient radix-tree based map- ping. The old system used a hashtable-based implementation, but we found that to use high amounts of memory for the table data structure (for buckets and entries), particularly as we bought newer machines with substantially more memory than our old ones. A radix tree is a modified trie-structure in which a tree is referenced by strings of an alphabet, one character at a time. This system worked perfectly for things like addresses and file offsets
- key types that are used everywhere throughout distributed memory system. As usual, the memory system discovers and communicates with distributed server modules using a custom-designed, reliable, Servers broadcast periodic resource announcement messages which the client modules can use to discover the available memory servers. Servers also include feedback about their memory availability and load during both resource announce- ment and regular page transfers with clients. When a server reaches capacity, it declines to serve any new write requests from clients, which then try to select another server, if available, or otherwise write the page to disk. Binding these modules together is the Re- mote Memory Access Protocol (RMAP). This protocol is described later in much more detail than was provided in the previous Chapter. The server module is also designed to allow a server node to be taken down while live; our RMAP implementation can disperse, re-map, and load-balance an individual server’s pages to any other servers in the cluster that are capable of absorbing those pages, allowing the server to shut down without killing any of its client’s applications. Getting this custom protocol to work properly in a virtualized environment exposed a great number of kernel bugs that were not originally present in the
Anemone prototype, making the system much more robust. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 46
Additional Virtualization Features: MemX also includes a couple additional features that are not the specific focus of this work, and which were not present in the original
Anemone system either. The first is the ability to support named distributed memory data spaces that can be shared by multiple clients. What this provides is a read-only DSM sys- tem in which data stored on server nodes remains persistent through the life of the server, even when client nodes disconnect from the system altogether. When a client re-connects, all servers that have records for that client in the past will re-forward the necessary map- ping information to allow the client to re-construct its radix-tree data structure mappings and begin re-accessing the same persistent data. The system does not allow multiple concurrent writers, however, as this was not the focus of our work. There are two other features that turned out to be very important to the memory system as a whole because of virtualization-specific reasons: First, because of the way the split-driver system works that we described, the device driver needs to be able to support multiple major and minor block numbers. The driver would then be responsible for mapping one device per local vir- tual machine on a physical host, allowing the completely seamless, transparent access by multiple VM clients on the same host. This was part of the motivation behind switching to a radix tree over a hashtable - because the mapping data stored to lookup page locations would be relatively large for so many virtual machines, on the order of 10s of megabytes. It turned out the worst-case lookup time for the tree was equally comparable to the hashtable and did not take away from the efficient performance of the system. Second, we had to optimize the fragmentation implementation designed in the Anemone system: it had to be able to support zero-copy transmission and receipt of page fragments, which would not have worked properly in the original system.
4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU
In order to support large dataset applications within a VM environment, the simplest design option is to place the MemX client module within the kernel of each guest OS (DomU), whereas distributed server modules continue to execute within non-virtualized Linux kernel on machines connected to the physical network. This option is illustrated in Figure 4.3. The CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 47
Management Processes Large−Memory Application
Native Drivers Backend Frontend
Backend Network, Disk, PCI Frontend MemX Backend Frontend Module
Virtual H/W Event Virtual H/W Channel Driver Domain Guest Domains
Scheduling, Grant Tables, Exception Handling, Memory Enforcement Hypercall API Hardware Devices
Figure 4.3: MemX-DomU: Inserting the MemX client module within DomU’s Linux kernel. The server executes in non-virtualized Linux. client module exposes the block device interface to large memory applications within the
DomU as in the baseline, but communicates with the distributed server using the virtualized network interface (VNIC) exported by the network driver in the driver domain. The VNIC in
Xen is also organized as a split device driver in which the frontend (residing in the guest
OS) and the backend (residing in the driver domain) talk to each other using well-defined grant table and event channel mechanisms. Two event channels are used between the backend and frontend of the VNIC – one for packet transmissions and one for packet receptions. To perform zero-copy data transfers across the domain boundaries, the VNIC performs a page exchange with the backend for every packet received or transmitted using the grant table. All backend interfaces in the driver domain can communicate with the physical NIC as well as with each other via a virtual network bridge. Each VNIC is assigned its own MAC address whereas the driver domain’s own internal VNIC in Dom0 uses the physical NIC’s MAC address. The physical NIC itself is placed in promiscuous mode by the driver domain to enable the reception of any packet addressed to any of the local virtual CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 48 machines. The virtual bridge demultiplexes incoming packets directed towards the target
VNIC’s backend driver.
Compared to the baseline non-virtualized MemX -Linux deployment, MemX-DomU has the additional overhead of requiring every network packet to traverse across domain boundaries in addition to being multiplexed or demultiplexed at the virtual network bridge.
Additionally, the client module needs to be separately inserted within each DomU that might potentially execute large memory applications. Also note that each I/O request is typically 4KBytes in size, whereas our network hardware uses a 1500-byte MTU (maxi- mum transmission unit), unless the underlying network supports Jumbo frames. Thus the client module needs to fragment each 4KByte write request into (and reassemble a com- plete read reply from) at least 3 network packets. In MemX-DomU each fragment needs to traverse the domain boundary to reach the backend. Due to current memory allocation policies in Xen, buffering for each fragment ends up consuming an entire 4KByte page worth of memory allocation, which results in three times the actual memory needed within the machine. If this were a non-virtualized case, each of those fragments would still come from the same physical page because of the internal Linux slab allocator. But virtualization requires those fragments to be separated out. Newer Xen versions may offer solutions to this type of problem, but we leave it for now. We will contrast this performance overhead in greater detail with MemX-DD (option 2) below.
4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain
A second design option is to place the MemX client module within the driver domain
(Dom0) and allow multiple DomUs to share this common client module via their virtual- ized block device (VBD) interfaces. This option is shown in Figure 4.4. The guest OS executing within the DomU VM does not require any MemX specific modifications. The
MemX client module executing within the driver domain exposes a block device interface, as before. Any DomU, whose applications require distributed memory resources, config- ures a split VBD. The frontend of the VBD resides in DomU and the backend in the block driver domain. The frontend and backend of each VBD communicates using event chan- CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 49
Management Processes Large−Memory Application
Native Drivers Backend Frontend
Backend Network, Disk, PCI Frontend MemX Module Backend Frontend
Virtual H/W Event Virtual H/W Channel Driver Domain Guest Domains
Scheduling, Grant Tables, Exception Handling, Memory Enforcement Hypercall API Hardware Devices
Figure 4.4: MemX-DD: Executing a common MemX client module within the driver domain, allowing multiple DomUs to share a single client module. The server module continues to execute in non-virtualized Linux. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 50 nels and the grant table, as in the earlier case of VNICs. (This splitting of interfaces is completely automated by the Xen system itself). The MemX client module provides a sep- arate VBD lettered-slice (/dev/memx{a,b,c}, etc.) for each backend that corresponds to a distinct DomU. On the network side, the MemX client module attaches itself to the driver domain’s VNIC which in turn talks to the physical NIC via the virtual network bridge. For performance reasons, here we assume that the VNIC in the driver domain and the disk in the driver domain are co-located - meaning both drivers are within the same privileged driver domain (dom0). Thus the driver domain’s VNIC does not need to be organized as another split driver. Rather it is a single software construct that can attach directly to the virtual bridge. During execution within a DomU, read/write requests to distributed memory are generated in the form of synchronous I/O requests to the corresponding virtual block device frontend. These requests are sent to the MemX client module via the event channel and the grant table. The client module packages each I/O request into network packets and transmits them asynchronously to distributed memory servers using RMAP.
Note that, although the network packets still need to traverse the virtual network bridge, they no longer need to traverse a split VNIC architecture, unlike in MemX-DomU. One consequence of not going through a split VNIC architecture is that, while client module still needs to fragment a 4KByte I/O request into 3 network packets to fit the MTU requirements, each fragment no longer needs to occupy an entire 4KByte buffer, unlike in MemX-DomU.
As a result, only one 4KByte I/O request needs to cross the domain boundary across the split block device driver, as opposed to three 4KB packet buffers in Section 4.3.2. Finally, since the guest OS within DomUs do not require any MemX specific software components, the DomUs can potentially run any para-virtualized OS and not just XenoLinux.
However, compared to the non-virtualized baseline case, MemX-DD still has the ad- ditional overhead of using the split VBD and the virtual network bridge, though still with highly acceptable performance. Also note that, unlike MemX-DomU, MemX-DD does not currently support seamless migration of live Xen VMs using distributed memory. This is because part of the internal state of the guest OS, (in the form of page-to-server mappings) that resides in the driver domain of MemX-DD is not automatically transferred by the mi- gration mechanism in Xen. We plan to enhance Xen’s migration mechanism to transfer this CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 51 internal state information in a host-independent manner to the target machine’s MemX-DD module. Furthermore, our current implementation does not support per-DomU reservation of distributed memory, which can potentially violate isolation guarantees. This reservation feature is currently being added to our prototype.
4.3.4 MemX -Dom0: (Option 3)
As we mentioned in the introduction, we will also present this scenario. Again, this is described as the distributed memory system within Dom0 (same as the driver domain), except that the applications are executed directly within this domain and not inside of a guest domain. This represents the base virtualization overhead without the presence of other guest virtual machines.
4.3.5 Alternative Options
Guest Physical Address Space Expansion: Another alternative to supporting large memory applications with direct distributed memory is to enable support for this indirectly via a larger pseudo-physical memory address space than is normally available within the physical machine. This option would require fundamental modifications to the memory management in both the Xen hypervisor as well as the guest OS. In particular, at boot time, the guest OS would believe that it has a large “physical” memory – or the so called pseudo- physical memory space. It then becomes the Xen hypervisor’s task to map each DomU’s large partly into guest-local memory, partly into distributed memory, and the rest to sec- ondary storage. This is analogous to the large conventional virtual address space available to each process that is managed transparently by traditional operating systems. The func- tionality provided by this option is essentially equivalent to that provided by MemX-DomU
and MemX-DD. However, this option requires the Xen hypervisor to take up a prominent
role in memory address translation process, something that original design of Xen strives
to minimize. Exploring this option is the focus of Chapter 6.
MemX Server Module in DomU Technically speaking, we can also execute the MemX
server module within a guest OS by itself as well, coupled with Options 1 or 2 above. This CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 52 could enable one to initiate a VM solely for the purpose of providing distributed memory to other low-memory client VMs that are either across the cluster or even within the same physical machine. However, practically, this option does not seem to provide any sig- nificant functional benefits whereas the overheads of executing the server module within
DomU are considerable. This is also not necessary because our system already supports the re-distribution of server memory to nearby servers - allowing the server to shut down if necessary. This obviates such a need to run the module within a virtual machine. Conse- quently, we do not pursue this option further.
4.3.6 Network Access Contention:
Handling network contention within the physical machine itself was the biggest (solv- able) difficulty in our decision to implement RMAP without TCP/IP. Three major factors contribute to network contention in our system:
• Inter-VM Congestion: MemX generates traffic at the block-I/O level. In a virtual ma-
chine environment, each guest VM on a given node assumes that it has full control of
the NIC, when in reality that NIC is generally shared among multiple VMs. We elab-
orate on this simple but important problem of inter-VM congestion in Section 4.4.3
while evaluating multiple VM performance.
• Flow Control: Currently, RMAP uses a static send window per MemX node. In a
subnet with fairly constant round trip times, this serves us well, although a reactive
approach where the receiver informs the client of the size of its receive window could
be easily deployed. We have not observed a need for this feature as of yet.
• Switch/Server Congestion: MemX servers in the network can potentially be the
destination for dozens of client pages. Two or more clients generating traffic towards
a particular server can quickly overwhelm both the switch port and the server itself.
As a partial solution to this problem, MemX clients perform load-balancing across
MemX servers by dynamically selecting the least loaded server for page write opera-
tions. Empirically, we’ve observed that congestion happens only when the number of CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 53
clients significantly outweighs the number of servers. If MemX were scaled to hun-
dreds of switched nodes, a cross-bar or fat-tree design in addition to more advanced
switch-bound congestion control would be mandatory, but our 8-node cluster hasn’t
warranted this as of yet. We plan to handle this if our testbed scales to more nodes.
4.4 Evaluation
In this section we evaluate the performance of the different variants of MemX. Our goal is to answer the following questions:
• How do the different variants of MemX compare in terms of I/O latency and band-
width?
• What are the overheads incurred by MemX due to virtualization in Xen?
• What type of speedups can be achieved by real large memory applications using
MemX when compared to virtualized disk?
• How well does MemX perform in the presence of multiple concurrent VMs?
Our testbed consists of eight machines. Each machine has 4 GB of memory, an SMP 64-bit dual-core 2.8 Ghz processor, and one gigabit Broadcom Ethernet NIC. Our Xen version is
3.0.4 and XenoLinux version 2.6.16.33. Backend MemX-servers run Vanilla Linux 2.6.20.
Collectively, this provides us with over 24GB of effectively usable cluster-wide memory after accounting for roughly 1GB of local memory usage per node. We limit the local memory of client machines to a maximum of 512 MB under all test cases. In addition to the three MemX configurations described earlier, namely MemX-Linux, MemX-DomU, and
MemX-DD, we also include a fourth configuration – MemX-Dom0 – for the sole purpose of performance evaluation. This additional configuration corresponds to the MemX client module executing within Dom0 itself, but not as part of the backend for a VBD. Rather, the client module in MemX-Dom0 serves large memory applications executing within Dom0, and helps to measure the basic virtualization overhead due to Xen. Furthermore, whenever we mention the ”disk” baseline, we are referring to virtualized disk within Dom0. When
MemX-DD or MemX-DomU is compared to virtualized disk in any experiment, it means CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 54
MemX-Linux MemX-Dom0 MemX-DD MemX-DomU Virtualized Disk Kernel RTT 85 usec 95 usec 95 usec 115 usec 8.3 millisec
Table 4.1: I/O latency for each MemX-Combination in Microseconds. that we exported the virtualized disk as a frontend VBD to the dependent guest VM, just as we exported the block device from MemX itself to applications.
4.4.1 Latency and Bandwidth Microbenchmarks
Figure 4.5 and Table 4.1 characterize different MemX-combinations in terms of these two metrics. Table 4.1 shows the average round trip time (RTT) for a single 4KB read request transmitted from a client module and replied to by a server node. The RTT is measured in microseconds, using the on-chip time stamp counter (TSC) register at the kernel level in the client module immediately before transmission to the NIC and after reception of the
ACK from the NIC. Thus the measured RTT values include only MemX related time com- ponents and exclude the variable time required to deliver the page to user-level, put that process back on the ready-queue, and perform a context switch. Moreover, this is the latency that VFS (virtual filesystem) or the system pager would experience when sending
I/O to and from MemX. MemX-Linux, as a base case, provides an RTT of 85µs. Following close behind are MemX-Dom0, MemX-DD, and MemX-DomU in that order. The virtualized disk base case performs as expected at an average 8.3ms. These RTT numbers show that accessing the memory of a remote machine over the network is about a two orders of magnitude faster than from local virtualized disk. Also, the Xen VMM introduces a negligi- ble overhead of 10µs in MemX-Dom0 and MemX-DD over MemX-Linux. Similarly the split network driver architecture, which needs to transfer 3 packet fragments for each 4KB block across the domain boundaries, introduces an overhead of another 20µs in MemX-DomU over MemX-Dom0 and MemX-DD.
Figure 4.5 shows throughput measurements using a custom benchmark [52] which is made to issue long streams of random/sequential, asynchronous, 4KB requests. We ensure that the range of requests is at least twice that of the size of local memory of a CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 55
Figure 4.5: I/O bandwidth, for different MemX-configurations, using custom benchmark that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to opening the file descriptor with direct I/O turned on, to compare against by-passing the Linux page cache. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 56 client node (about 1+ Gbyte). These tests give us insight during development of where bottlenecks might exist. The throughput for all of the tests is generally at its maximum minus the effect of CPU overhead. A small loss of 50 Mbits/second naturally occurs for
MemX-domU, which is to be expected. The only case that suffers is random reads, which all hover around 300 Mbits/second. There is a very specific reason for this that is a direct artifact of the way VFS in the Linux kernel handles asynchronous I/O (AIO) [60].
Asynchronous I/O and Scheduling in Linux . Block devices, by nature, handle all I/O asynchronously (AIO) unless otherwise instructed to by the Virtual Filesystem (VFS). In
Linux, the AIO call stack is the fundamental atomic operation to the device (through the page cache) by which other types of I/O are realized. As of 2007, the AIO hierarchy in
Linux uses a separate thread that plays tricks to run in the same process context of the user application that submitted the I/O (for those file descriptors that are asynchronous).
Doing so, the application can continue doing other work and check for the results later.
The core problem described here involves the kernel thread that handles AIO system calls itself: it is in fact executed synchronously after the request handoff has been made. Linux
(and perhaps other kernels) is capable of accepting a submission of multiple (sparse) AIO reads/writes using a single system call. After the system call returns, the thread then syn- chronously issues those I/Os to the device driver one-by-one (by blocking and removing itself from the run-queue). For devices with variable latencies (i.e. disks), this long-standing
VFS design makes sense, as I/O should block, while the device is kept busy from dynam- ically generated amounts of parallel I/O from the read-ahead (prefetching) policies of the
Linux page-cache. But for random-access style devices, this is useless. Additionally, the
Linux I/O scheduler makes similar assumptions for devices that have requests queues
(per-driver queues that re-order I/Os for better fairness and latency guarantees).
What this means for the MemX in both virtualized and non-virtualized environments is that out-bound randomly-spaced read block I/O bandwidth (not networking bandwidth) is cut by two thirds to about one third of it’s normal speed. The presents a chain reac- tion for these kinds of randomly-spaced reads: Rather than getting a read performance of a full gigabit per second over the network, the application only experiences about three CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 57 megabits per second. This does not significantly affect the speedups we experience in the next section, but it does explain some of the microbenchmarks performed at the be- ginning. To solve the problem in the future, we propose a “re-plumbing” of the VFS and
I/O scheduling subsystems to dynamically detect the underlying latency characteristics of the device (specifically the un-changing behavior of constant vs. variable latency) in order to allow those subsystems to take alternate code-paths that are capable fully exploit- ing the deliverable performance of the underlying device. The actual blocking call is that of ”lock page()” within the function ”do generic mapping read” as per the Linux AIO call stack. On the bright side, as of 2007, there was a patch [60] in progress (Contact informa- tion for the developers can be found in linux-2.6.xx/MAINTAINERS). The patch could be modified to handle *the more specific case* that needs rather than be a generic solution for all users of the page cache. We also noticed that, if the user is a userland C program
(versus say a filesystem thread running within the kernel), then setting O DIRECT on the
file descriptor will cause the system call to by-pass the page cache and go direct-to-BIO.
Maximum throughput will then be realized. We also observed that, out of the 4 I/O sched- ulers available in Linux, none of them have any effect whatsoever on device drivers that do not use a request queue, which is the case for our client module implementation - due to it exhibiting random-access style latencies when pages are accessed through network memory. Demonstration of this problem involved: 1. instrumenting Linux AIO stack to print out TSC counter microsecond estimates, 2. logging the MemX outbound queue-size, 3. recording the amount of time in between which new requests were handed to the device driver, 4. a preliminary hypothesis derived from an observation that the dependent pro- cess was spending too much time idly waiting (inside mwait idle()), and 5. finally receiving confirmation from the mainline kernel developers of the hypothesis.
Figures 4.6 through 4.8 compare the distributions of the total RTT measured from a user level application that performs either sequential or random I/O on either MemX or the virtual disk, both with and without the O DIRECT flag enabled. Note that these RTT-values are measured from user-level synchronous read/write system calls, which adds a few tens of microseconds to the kernel-level RTTs in Table 4.1. Figure 4.6 compares the read latency distribution for MemX-DD against disk-based I/O in both random and sequential CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 58
CDF of MemX-DD vs Disk Latencies (Reads, Buffered) 100
80
60
40 Percent of Requests 20 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 4.6: Comparison of sequential and random read latency distributions for MemX-DD and disk. Reads traverse the filesystem buffer cache. Most random read latencies are an order of magnitude smaller with MemX-DD than with disk. All sequential reads benefit from filesystem prefetching.
CDF of MemX-DD vs Disk Latencies (Writes, Buffered) 100
80
60
40 Percent of Requests 20 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq 0 1 10 100 1000 10000 100000 1e+06 1e+07 Latency (microseconds, logscale)
Figure 4.7: Comparison of sequential and random write latency distributions for MemX-DD and disk. Writes goes through the filesystem buffer cache. Consequently, all four latencies are similar due to write buffering. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 59
CDF of MemX-DD vs Disk Latencies (Reads, Random) 100
80
60
40 Percent of Requests 20 MemX-DD-Buffer MemX-DD-Direct Disk-Buffer Disk-Direct 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)
Figure 4.8: Effect of filesystem buffering on random read latency distributions for MemX- DD and disk. About 10% of random read requests (issued without the direct I/O flag) are serviced at the filesystem buffer cache, as indicated by the first knee below 10µs for both MemX-DD and disk. reads via the filesystem cache. Random read latencies are an order of magnitude smaller with MemX-DD (around 160µs) than with disk (around 9ms). Sequential read latency distributions are similar for MemX-DD and disk primarily due to filesystem prefetching.
Figure 4.7 shows the RTT distribution for buffered write requests. Again MemX-DD and disk show similar distributions, mostly less than 10µs, due to write buffering. Figure 4.8 demonstrates the effect of passing o direct flag to the open() system call, which bypasses the filesystem buffer cache. The random read latency distributions without the flag display a distinct knee below 10µs indicating that roughly 10% of the random read requests are serviced at the filesystem buffer cache and that prefetching benefits MemX as well as disk.
We observed a similar trend for sequential read distributions, with and without the flag, where the first knee indicated that about 90% of sequential reads were serviced at the
filesystem buffer cache. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 60
Quicksort
12000 11000 MemX-DomU MemX-DD 10000 MemX-Linux 9000 Local Memory 8000 Local Disk 7000 6000 5000 4000 3000 Sort Time (seconds) 2000 1000 0 0 1 2 3 4 5 Sort Size (GB)
Figure 4.9: Quicksort execution times in various MemX combinations and disk. While clearly surpassing disk performance, MemX-DD trails regular Linux only slightly using a 512 MB Xen Guest. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 61
4.4.2 Application Speedups
We now evaluate the execution times of a few large memory applications using our testbed.
Again, we include both MemX-Linux and virtual disk as base cases to illustrate the over- head imposed by Xen virtualization and the gain over the virtualized disk respectively.
Figure 4.9 shows the performance of a very large sort of increasingly large arrays of inte- gers, using an in-house C implementation of the old static-partitioning quicksort algorithm.
We stopped using the STL version because of it’s lack of ability to provide more detailed runtime information about the progress of the sort. We record the execution times of the sort for each of the 3 mentioned cases. We also include an ”extreme” base case plot for local memory using one of the vanilla-Linux 4 GB nodes, where the sort executes purely in-memory. From the figure, we ceased to even bother with the disk case beyond 2GB problem sizes due to the unreasonably large amount of time it takes to complete, po- tentially for days. The sorts using MemX-DD, MemX-domU, and MemX-Linux however
finished within 90 minutes, where the distinction between the different virtualization levels is very small. Table 4.2 lists execution times for some much larger problem sizes with both quicksort and a second large memory application – the same ray-tracing scene used in Chapter 3 [81]. Each row in the table describes an increasingly large problem size, as high as 13 GB. Again, both the MemX cases behave similarly, while the disk lags be- hind. These performance numbers show that MemX provides a highly attractive option for executing large memory workloads in both virtualized and non-virtualized environments.
Furthermore, given the amount of un-quantified amount of randomized-reads generated by the system’s pager (which is correlates with the recursive nature of the sort algorithm), the same non-asynchronous I/O problem that we described in the previous section also applies here. If a fix is applied, the observed speed-ups in the figure have the potential to double or triple what they already are. But for now, the throughput observed from the system pager remains around 300 to 400 Mbits/sec. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 62
Application Client Mem MemX-Linux MemX-DD Disk 5 GB Quicksort 512 MB 65 min 93 min several hours 6 GB Ray-tracer 512 MB 48 min 61 min several hours 13 GB Ray-tracer 1 GB 93 min 145 min several hours
Table 4.2: Execution time comparisons for various large memory application workloads.
MemX vs. Parallel iSCSI: Multiple Guests
14000 MemX-DD 12000 iSCSI - DD
10000
8000
6000
4000 Sort Time (seconds) 2000
0 0 2 4 6 8 10 12 14 16 18 20 Number of VMs
Figure 4.10: Quicksort execution times for multiple concurrent guest VMs using MemX-DD and iSCSI configurations. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 63
Domain 1 4 GB MemX Server Domain 2 80 GB iSCSI
RMAP or iSCSI 4 GB MemX Server 80 GB iSCSI
4 GB MemX Server
Domain 19 80 GB iSCSI Domain 20
4 GB MemX Server Domain 0 GigE Switch 80 GB iSCSI Xen Hypervisor
4 GB Memory
Figure 4.11: Our multiple client setup: Five identical 4 GB dual-core machines, where one houses 20 Xen Guests and the others serve as either MemXservers or iSCSI servers.
4.4.3 Multiple Client VMs
In this section, we evaluate the overhead of executing multiple client VMs using the MemX-
DD combination. In a real data center, an iSCSI or FibreChannel network would be setup
to provide backend storage for guest virtual machines. To duplicate this base case in our
cluster, we use five of our dual-core 4GB memory machines to compare MemX-DD against
a 4-disk parallel iSCSI setup illustrated in Figure 4.11. For the iSCSI target software, we
used the open source project IET [93] and used open-iscsi.org for the initiator software
within Dom0 as a driver domain for all the Xen Guests. Our setup involves using one
of the five machines to execute up to twenty concurrently running 100MB Xen Guests.
Within each guest, we run a 400MB quicksort. We vary the number of concurrent guest
VMs from 1 to 20, and in each guest we run quicksort to completion. We perform the
same experiment for both MemX-DD and iSCSI. Figure 4.10 shows the results of this
experiment. At its highest point (about 10 GB of collective memory and 20 concurrent
virtual machines) the execution time with MemX-DD is about 5 times smaller than with
iSCSI setup. Recall that we are using four remote iSCSI disks, and one can observe a
stair-step behavior in the iSCSI curve where the level of parallelism wraps around at 4, 8,
12, and 16 Virtual machines. Even with concurrent disks and competing virtual machine
CPU activity, MemX-DD provides clear benefits in providing low-latency I/O among multiple CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 64 concurrent Xen virtual machines.
Inter-VM Congestion: In Section 4.3.6, we described the phenomenon of inter-VM congestion, that arises due to the absence of explicit congestion control across multiple guests within a Xen node. Here we discuss how inter-VM congestion is handled in different
MemX configurations.
1. MemX-DomU and MemX-Linux: Inter-VM congestion does not arise trivially in the
base cases of MemX-Dom0 and MemX-Linux because the only users of the client
module are local application processes. These processes, controlled by a static send
window, use semaphores and wait queues to put competing processes on the OS’s
blocked list when the client’s send window is full. So, there is no competition among
multiple virtual machines - only between competing processes.
2. MemX-DD: Inter-VM congestion in MemX-DD is handled indirectly by Xen itself.
Xen schedules block I/O backend requests in a strictly round-robin fashion. Since
MemX is the destination of requests from the backend, Xen will “stop” the delivery
of requests to MemX when there is a full queue (of some fixed size). This stop is
performed by placing the dependent guest VMs in a blocked state in the same way
that multi-programmed processes are blocked when waiting for I/O.
3. MemX-DomU: For MemX-DomU, recall that inter-VM congestion arises from mul-
tiple network front-end drivers rather than competing block front-ends. Xen handles
this type of contention by using credit-based scheduling, where each front-end is al-
located a bandwidth share of the form x bytes every y microseconds. VMs that use
up their credit are blocked.
This leaves us to handle only the network contention at the switch and server-level, which we plan to address as our testbed scales to more nodes. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 65
4.4.4 Live VM Migration
While migration techniques can MemX-DomU configuration has a significant benefit when it comes to migrating live Xen VMs [27] during runtime, even though it has lower throughput and higher I/O latency than MemX-DD. Specifically, a VM using MemX-DomU for fast I/O
to distributed memory can be seamlessly migrated from one physical machine to another,
without disrupting the execution of any large dataset applications within the VM. There
are two specific reasons for this benefit. First, since MemX-DomU is designed as a self-
contained pluggable module within the guest OS, any page-to-server mapping information
is migrated along with the kernel state of the guest OS without leaving any residual de-
pendencies behind in the original machine. The second reason is that RMAP is used for
communicating read-write requests to distributed memory is designed to be reliable. As
the VM carries with itself its link layer MAC address identification during the migration pro-
cess, any in-flight packets dropped during migration are safely retransmitted to the VM’s
new location, thereby enabling any large memory application to continue execution without
disruption. What makes the MemX-DomU case interesting is that administrators of virtual
hosting centers can exploit live-migration features by seamlessly transferring guest VMs to
other physical machines at will to better utilize resources. Our work in the next Chapter 5
focuses exclusively on the optimization of virtual machine migration and will elaborate on
this in more detail.
4.5 Summary
State-of-the-art in virtual machine technology does not adequately address the needs of
large memory workloads that are increasingly common in modern data centers and virtual
hosting platforms. Such application workloads quickly become throttled by the disk I/O bot-
tleneck in a virtualized environment where the I/O subsystem includes an additional level
of indirection. In this Chapter, we presented the design, implementation, and evaluation
of the MemX system in the Xen environment that enables memory and I/O-constrained
VMs to transparently utilize the collective pool of memory within a cluster for low-latency CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 66
I/O operations. Large dataset applications using MemX do not require any specialized
APIs, libraries, or any other modifications. MemX can operate as a kernel module within non-virtualized Linux (MemX-Linux), an individual VM (MemX-DomU), or a driver domain
(MemX-DD). The latter option permits multiple VMs within a single physical machine to multiplex their memory requirements over a common distributed memory pool. Perfor- mance evaluations using our MemX prototype shows that I/O latencies are reduced by an order of magnitude and large memory applications speed up significantly when compared against virtualized disk. As an extra benefit, live Xen VMs executing large memory ap- plications over MemX-DomU can be migrated without disrupting applications. Our future work includes the capability to provide per-VM reservations over the cluster-wide memory, developing mechanisms to control inter-VM congestion, and enabling seamless migration of VMs in the driver domain mode of operation. Chapter 5
Post-Copy: Live Virtual Machine Migration
In this Chapter, we present the design, implementation, and evaluation of the post-copy based approach for the live migration of virtual machines (VMs) across a gigabit LAN. Live migration is a mandatory feature of modern hypervisors today. It facilitates server consol- idation, system maintenance, and lower power consumption. Post-copy [53] refers to the deferral of the memory “copy” phase of live migration until after the VM’s CPU state has been migrated to the target node. This is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by the transfer of CPU execution state. The post-copy strategy provides a “win-win” by approaching the baseline total migration time achieved with the stop-and-copy approach, while maintaining the live- ness and low downtime benefits of the pre-copy approach. We facilitate the use of post- copy with a specific instance of adaptive prepaging (also known as adaptive distributed paging). Pre-paging eliminates all duplicate page transmissions and quickly removes any residual dependencies for the migrating VM from the source node. Our pre-paging algo- rithm is able to reduce the number of page faults across the network to 17% of the VM’s working set. Finally, we enhance both the original pre-copy and post-copy schemes with the use of a dynamic, periodic self-ballooning (DSB) strategy, which prevents the migra- tion daemon from transmitting unnecessary free pages in the guest OS. DSB significantly
67 CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 68 speeds up both the migration schemes with very negligible performance degradation to the processes running within the VM. We implement the post-copy approach in the Xen
VM environment and show that it significantly reduces the total migration time and net- work overheads across a range of VM workloads when compared against the traditional pre-copy approach.
5.1 Introduction
This Chapter addresses the problem of optimizing the live migration of system virtual ma- chines (VMs). Live migration is a key selling point for state-of-the-art virtualization tech- nologies. It allows administrators to consolidate system load, perform maintenance, and
flexibly reallocate cluster-wide resources on-the-fly. We focus on VM migration within a cluster environment where physical nodes are interconnected via a high-speed LAN and also employ a network-accessible storage system (such as a SAN or NAS). State-of-the- art live migration techniques [73, 27] use the pre-copy approach, where the bulk of the
VM’s memory state is migrated even as the VM continues to execute at the source node.
Once the “working set” has been identified through a number of iterative copy rounds, the
VM is suspended and its CPU execution state plus remaining dirty pages are transferred to the target host. The overriding goal of the pre-copy approach is to keep the service down- time to a bare minimum by minimizing the amount of VM state that needs to be transferred during the downtime.
We seek to demonstrate the benefits of another strategy for live VM migration, called post-copy, that was previously applied only in the context of process migration in the late
1990s, but to address the issues involved at the operating system level as well. We be- lieve that modern hypervisors provide the means to employ alternative approaches without much additional complexity. On a high-level, post-copy refers to the deferral of the mem- ory “copy” phase of live migration until the virtual machine’s CPU state has already been migrated to the target node. This enables the migration daemon to try different methods by which to perform the memory copy. Post-copy works by transferring a minimal amount of CPU execution state to the target node, starting the VM at the target, and then pro- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 69 ceeds to actively push memory pages from the source to the target. This active push component, also known as pre-paging, distinguishes the post-copy approach from both pre-copy as well as from the demand-paging approach, in which the source node would passively wait for the memory pages to be faulted in by the target node across the network.
Pre-paging is a broad term that was used in earlier literature [76, 94] in the context of op- timizing memory-constrained disk-based paging systems and refers to a more proactive form of page prefetching from disk. By intelligently sequencing the set of actively pre- fetched memory pages, the memory subsystem (or even a cache) can hide the latency of high-locality page faults or cache misses from live applications, while continuing to retrieve the rest of the address space out-of-band until the entire address space is complete. Mod- ern memory subsystems do not typically employ pre-paging anymore due the increasingly large DRAM capacities in commodity systems. However, pre-paging can play a significant role in the context of live VM migration which involves the transfer of an entire physical address space across the network.
We design and implement a post-copy based technique for live VM migration in the
Xen VM environment. Through extensive evaluations, we demonstrate how post-copy can improve the live migration performance across each of the following metrics: pages trans- ferred, total migration time, downtime, application degradation, network bandwidth, and identification of the working set. The traditional pre-copy approach does particularly well in minimizing two metrics – application downtime and degradation – when the VM is executing a largely read-intensive workload. These two metrics are important in preserving system uptime as well as the interactive user experience. However all the above metrics can be impacted adversely when pre-copy is confronted with even moderately write-intensive VM workloads during migration. Post-copy not only maintains VM liveness and application per- formance during migration, but also improves upon the other performance metrics listed above.
The two key ideas behind an effective post-copy strategy are: (a) transmitting each page across the network no more than once, in other words, avoiding the potentially non- converging iterative copying rounds in pre-copy and (b) an adaptive pre-paging strategy
that hides the latency of fetching most pages across the network by actively pushing pages CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 70 from the source before the page is faulted in at the target node and adapting the sequence of pushed pages using any network page-faults as hints. We show that our post-copy implementation is capable of minimizing network-bound page faults to 17% of the working- set.
Additionally, we identified deficiencies in both the pre-copy and post-copy schemes with regard to the transfer of free pages in the guest VM during migration. We improved both migration schemes to avoid transmitting free pages through the use of a Dynamic Self-
Ballooning (DSB) technique in which the guest actively balloons down its memory footprint without human intervention. DSB significantly speeds up the total migration time, normal- izes both approaches, and is capable of frequent ballooning without adversely affecting live applications with intervals as small as 5 seconds.
Both the Xen and VMWare hypervisors have demonstrated that the use of migration itself is an essential tool. The original pre-copy algorithm does have other advantages: it employs a relatively self-contained implementation that allows the migration daemon to iso- late most of the copying complexity to a single process at each node. Additionally, pre-copy provides a clean method of aborting the migration should the target node ever crash during migration, because the VM is still running at the source and not the target host (whether or not this benefit is made obvious in current virtualization technologies). Although our current post-copy implementation does not handle target node failure, we will discuss a straightforward approach in Section 5.2.5 by which post-copy can provide the same level of reliability as pre-copy. Our contributions are to demonstrate a complete way in which, with a little bit more help from the migration system, one can preserve the liveness and downtime benefits of pre-copy while also breaking from the non-deterministic convergence phase inherent in pre-copy, ensuring that each page of VM memory is transferred over the network at most once.
5.2 Design
We begin with a brief discussion of the performance goals of VM migration. Afterwards, we will present our design of post-copy and how it improves those goals. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 71
5.2.1 Pre-Copy
For a more in-depth performance summary of pre-copy migration, we refer the reader to
[27] and [73]. For completeness, pre-copy migration works as follows: Pre-copy is an eager strategy in which memory pages are actively pushed to the target machine, while the migrating VM continues to run at the source machine. Pages dirtied at the source that have already been transferred to the target are re-sent through several iterations until the number of dirtied pages is as small as a fixed threshold. (Note: this threshold is not dynamic. Although one could potentially imagine modern hypervisors designing a dynamic threshold, neither their companies nor the literature have attempted to do so.) Furthermore, in all known implementations, if the threshold is never reached, an empirical “cap” on the total number of iterations is chosen (currently set to 30) by the migration implementer.
Without this cap, it is possible that pre-copy may never converge at all. After the iterations complete, the VM is then suspended and its state is transferred to the target machine where it is restarted. This transfer of VM state is accompanied by the final flush of the remaining address space modified at the host. The VM is the resumed at the target and the source
VM copy is destroyed. Pre-copy migration involves the following six performance goals:
1. Transparency: The pre-copy scheme can work transparently in both fully-virtualized
and para-virtualized environments. Any new migration scheme must maintain that
ability without requiring any application changes.
2. Preparation Time: Any required CPU or network activity within either the migrating
guest VM or the maintenance VM contributes to preparation time. This includes most
of the memory copying during pre-copy rounds. There is no guarantee that this time
ever converges to a stopping round. In fact, later we show that even with mildly active
VMs, these rounds never converge.
3. Down Time: This time represents how long the migrating VM is stopped, during
which no execution progress is made. Pre-copy uses this time for dirty memory
transfer. Minimizing this goal is their primary goal. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 72
4. Resume Time: Any remaining cleanup required by the maintenance VM at the target
host goes into this time period. Although pre-copy has nothing to do besides re-
scheduling the migrating VM, the majority of our post-copy design operates primarily
in this period. After this period is complete, regardless of which migration algorithm
is used, all dependencies on the source VM must be eliminated.
5. Pages Transferred: This performance goal consists of a total count of the number
of transferred memory pages across all of the above time periods. For pre-copy this
is dominated by preparation time.
6. Total Migration Time: For pre-copy, the total time required to complete the migration
is dominated by the preparation time. Total migration time is important because it
affects the release of resources on both sides within the individual host as well as
within the VMs on both hosts. Until completion of migration, the unused memory at
the source cannot yet be freed, and both maintenance VMs will continue to consume
network bandwidth and CPU cycles.
7. Application Degradation: This refers to the extent of slowdown experienced by
application workloads executing within the VM due to the migration event. The slow-
down occurs primarily due to CPU time taken away from normal applications to carry
out the migration. Additionally, the pre-copy approach needs to track dirtied pages
across successive iterations by trapping write accesses to each page, which signif-
icantly slows down write-intensive workloads. In the case of post-copy, access to
memory pages not yet present at the target results in network page faults, potentially
slowing down the VM workloads.
One of this Chapter’s contributions is to reduce the number of pages transferred com-
pared to pre-copy: the wasteful transfer of pages that may never be used at the target
machine is likely to occur. If the threshold of the number of dirty pages chosen to termi-
nate the pre-copy phase is too small, then pre-copy may never converge and terminate. On
the other hand, if the number of pages transferred during final iteration is large, significant
downtime can result. Given that the number of pages transferred directly impacts all other CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 73 metrics, our post-copy method aims to reduce this metric.
5.2.2 Design of Post-Copy Live VM Migration
Post-copy is a strategy in which the migrating virtual machine is first suspended at the source, a minimalistic execution state is copied over to the target where the virtual machine is restarted and then the memory pages that are referenced are faulted over the network from the source. VM execution experiences a delay during this period of faults, and that delay depends on the characteristics of the network connection and how fast the source machine can serve the request. As a result, this method incurs considerable resume time. Additionally, leaving any long-term residual dependencies on the source host is not acceptable. Thus, post-copy is not useful unless two additional goals are required:
1. Post-copy must effectively anticipate page-faults from the target and allow VM exe-
cution to move forward, while hiding the latency of page-faults.
2. Post-copy must flush the remaining clean pages from the source out-of-band while
the VM is simultaneously faulting, so that no residual dependency remains on the
source.
Note that both migration schemes must be normalized with respect to the unused /
free pages within the guest VM. This must be done such that any improvement is realized
only by the treatment of pages that actually contributed to the guest VM’s working set. We
will discuss this solution momentarily. The post-copy algorithm can actually be designed in
multiple ways, each of which provides an incrementally better improvement on the previous
method across all the aforementioned performance goals. Table 5.1 illustrates how each of
these ways slightly increases in complexity from the previous one during a certain phase
of the migration, with the common goal of improving the bottom line. Method 1 heads off
the table as the current form of migration.
Method 2: Post-Copy via Demand Paging: The demand paging variant of post-
copy is the simplest and slowest option. Once the VM resumes at the target, its memory
accesses result in page faults that can be serviced by requesting the referenced page over CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 74
Preparation Downtime Resume 1 Pre-copy Multiple iterative Send dirty CPU state Only memory transfers memory transfer 2 Demand Pre-suspend time CPU state Page-faults Paging (if any) transfer only 3 Basic Pre-suspend time CPU state Flushing + Post-copy (if any) transfer page-faults 4 Pre-paging Pre-suspend time CPU state Bubbling + + Post-copy (if any) transfer page-faults 5 Hybrid Single pre-copy CPU state Bubbling + Pre + Post round transfer page-faults
Table 5.1: Migration algorithm design choices in order of their incremental improvements. Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually combines all of #1 through #4, by which pre-copy is only used in a single, primer iterative round.
1. let N := total # of guest VM pages 2. let page[N] := set of all guest VM pages 3. let bitmap[N] := all zeroes 4. let pivot := 0; bubble := 0
5. ActivePush (Guest VM) 6. while bubble < max (pivot, N-pivot) do 7. let left := max(0, pivot - bubble) 8. let right := min(MAX_PAGE_NUM-1, pivot + bubble) 9. if bitmap[left] == 0 then 10. set bitmap[left] := 1 11. queue page[left] for transmission 12. if bitmap[right] == 0 then 13. set bitmap[right] := 1 14. queue page[right] for transmission 15. bubble++
16. PageFault (Guest-page X) 17. if bitmap[X] == 0 then 18. set bitmap[X] := 1 19. transmit page[X] immediately 20. discard pending queue 21. set pivot := X // shift pre-paging pivot 22. set bubble := 1 // new pre-paging window
Figure 5.1: Pseudo-code for the pre-paging algorithm employed by post-copy migration. Synchronization and locking code omitted for clarity of presentation. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 75 the network from the source node. However, servicing each fault will significantly slow down the VM due to the network’s round trip latency. Consequently, even though each page is transferred only once, this approach considerably lengthens the resume time and leaves long-term residual dependencies in the form of un-fetched pages, possibly for an indeterminate duration. Thus, post-copy performance for this variant by itself would be unacceptable from the viewpoint of total migration time and application degradation.
Method 3: Post-Copy via Active Pushing: One way to reduce the duration of residual dependencies on the source node is to proactively “push” the VM’s pages from the source to the target even as the VM continues executing at the target. Any major faults incurred by the VM can be serviced concurrently over the network via demand paging. Active push avoids transferring pages that have already been faulted in by the target VM. Thus, each page is transferred only once, either by demand paging or by an active push.
Method 4: Post-Copy via Prepaging: The goal of post-copy via prepaging is to antic- ipate the occurrence of major faults in advance and adapt the page pushing sequence to better reflect the VM’s memory access pattern. While it is impossible to predict the VM’s exact faulting behavior, our approach works by using the faulting addresses as hints to es- timate the spatial locality of the VM’s memory access pattern. The prepaging component then shifts the transmission window of the pages to be pushed such that the current page fault location falls within the window. This increases the probability that pushed pages would be the ones accessed by the VM in the near future, reducing the number of major faults. Various prepaging strategies are described in Section 5.2.3.
Method 5: Hybrid Live Migration: The hybrid approach was first described in [74] for process migration. It works by doing a single pre-copy round in the preparation phase of the migration. During this time, the VM continues running at the source while all its memory pages are copied to the target host. After just one iteration, the VM is suspended and its processor state and dirty non-pageable pages are copied to the target. Subsequently, the VM is resumed at the target and post-copy as described above kicks in, pushing in the remaining dirty pages from the source. As with pre-copy, this scheme can perform well for read-intensive workloads. Yet it also provides deterministic total migration time for write-intensive workloads, as with post-copy. This hybrid approach is currently being CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 76
Backward Edge Forward Edge of Bubble of Bubble
(a) 0 MAX Pivot
(b) 0 MAX P1 P2 P3 Stopped P1P2 P3 Pivot Array Bubble Edges
Figure 5.2: Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with multiple pivots. Each pivot represents the location of a network fault on the in-memory pseudo-paging device. Pages around the pivot are actively pushed to target. implemented and not covered within the scope of this chapter. Rest of this paper describes the design and implementation of post-copy via prepaging.
5.2.3 Prepaging Strategy
Prepaging refers to actively pushing the VM’s pages from the source to the target. The goal is to make pages available at the target before they are faulted on by the running VM.
The effectiveness of prepaging is measured by the percentage of VM’s page faults at the
target that require an explicit page request to be sent over the network to the source node
– also called network page faults. The smaller the percentage of network page faults, the
better the prepaging algorithm. The challenge in designing an effective prepaging strategy
is to accurately predict the pages that might be accessed by the VM in the near future, and
to push those pages before the VM faults upon them. Below we describe different design
options for prepaging strategies.
(A) Bubbling with a Single Pivot:
Figure 5.1 lists the pseudo-code for the two components of bubbling with a single pivot –
active push (lines 5–15), which executes in a kernel thread, and page fault servicing (lines
16–21), which executes in the interrupt context whenever a page-fault occurs. Figure 5.2(a)
illustrates this algorithm graphically. The VM’s pages at source are kept in an in-memory CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 77 pseudo-paging device, which is similar to a traditional swap device except that it resides completely in memory (see Section 5.3 for details). The active push component starts from a pivot page in the pseudo-paging device and transmits symmetrically located pages around that pivot in each iteration. We refer to this algorithm as “bubbling” since it is akin to a bubble that grows around the pivot as the center. Even if one edge of the bubble reaches the boundary of pseudo-paging device (0 or MAX), the other edge continues expanding in the opposite direction. To start with, the pivot is initialized to the first page in the in-memory pseudo-paging device, which means that initially the bubble expands only in the forward direction. Subsequently, whenever a network page fault occurs, the fault servicing component shifts the pivot to the location of the new fault and starts a new bubble around this new location. In this manner, the location of the pivot adapts to new network faults in order to exploit the spatial locality of reference. Pages that have already been transmitted (as recorded in a bitmap) are skipped over by the edge of the bubble.
Network faults that arrive at the source for a page that is in-flight (or just been pushed) to the target are ignored to avoid duplicate page transmissions.
(B) Bubbling with Multiple Pivots: Consider the situation where a VM has multiple processes executing concurrently. Here, a newly migrated VM would fault on pages at multiple locations in the pseudo-paging device. Consequently, a single pivot would be insufficient to capture the locality of reference across multiple processes in the VM. To address this situation, we extend the bubbling algorithm described above to operate on multiple pivots. Figure 5.2(b) illustrates this algorithm graphically. The algorithm is similar to the one outlined in Figure 5.1, except that the active push component pushes pages from multiple “bubbles” concurrently. (We omit the pseudo-code for space constraints, since it is a straightforward extension of single pivot case.)
Each bubble expands around an independent pivot. Whenever a new network fault occurs, the faulting location is recorded as one more pivot and a new bubble is started around that location. To save on unnecessary page transmissions, if the edge of a bubble comes across a page that is already transmitted, that edge stops progressing in the cor- responding direction. For example, the edges between bubbles around pivots P2 and P3 stop progressing when they meet, although the opposite edges continue making progress. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 78
In practice, it is sufficient to limit the number of concurrent bubbles to those around k most recent pivots. When new network faults arrives, we replace the oldest pivot in a pivot ar- ray with the new network fault location. For the workloads tested in our experiments in
Section 5.4, we found that around k = 7 pivots provided the best performance.
(C) Direction of Bubble Expansion: We also wanted to examine whether the pattern in which the source node pushes the pages located around the pivot made a significant difference in performance. In other words, is it better to expand the bubble around a pivot in both directions, or only the forward direction, or only the backward direction? To examine this we included an option of turning off the bubble expansion in either the forward or the backward direction. Our results, detailed in Section 5.4.4, indicate that forward bubble expansion is essential, dual (bi-directional) bubble expansion performs slightly better in most cases, and backwards-only bubble expansion is counter-productive.
When expanding bubbles with multiple pivots in only a single direction (forward-only or backward-only), there is a possibility that the entire active push component could stall before transmitting all pages in the pseudo-paging device. This happens when all active bubble edges encounter already sent-pages at their edges and stop progressing. (A simple thought exercise can show that stalling of active push is not a problem for dual-direction multi-pivot bubbling.) While there are multiple ways to solve this problem, we chose a simple approach of designating the initial pivot (at the first page in pseudo-paging device) as a sticky pivot. Unlike other pivots, this sticky pivot is never replaced by another pivot.
Further, the bubble around sticky pivot does not stall when it encounters an already trans- mitted page; rather it skips such a page and keeps progressing, ensuring that the active push component never stalls.
5.2.4 Dynamic Self-Ballooning
The Free Memory Problem. As we touched on earlier, there can be an arbitrarily large number of free pages within the guest VM before migration begins - or there may be little or no free pages. Nevertheless, it is wasteful to send those pages, regardless which migration algorithm you are using. If you do not eliminate as many of these pages as possible from CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 79 being migrated in the pre-copy algorithm, then one cannot properly compare it to post- copy. This is because there would be no way of distinguishing clean pages from free pages during each pre-copy iteration. If a clean page is freed, there is no way for the migration process to detect this. We observe that there are two ways to solve this problem.
For post-copy, this turns out to be quite easy: this leaves us with Method 5 (hybrid method) in Table 5.1. This method combines pre-copy with post-copy. This works by doing a single pre-copy round in the preparation time phase of the migration. As a result this allows the guest VM to continue running at the source while its free pages and clean pages are copied to the target host. Subsequently, the post-copy process kicks in immediately after downtime. There is no memory transfer during downtime, and post-copy operates just as we described. The second way to solve the free memory problem is through the use of ballooning.
The first time the hybrid scheme was used in the literature was in [74]. But since we are dealing with whole system VM migration, this presents a problem for a performance comparison against a stand-alone pre-copy migration: the hybrid scheme does not elim- inate the transmission of free pages. Without eliminating them, we cannot determine the effectiveness of post-copy with respect to how well pre-paging successfully promotes VM execution time by hiding page-fault latency from the migrating guest VM. We cannot eval- uate that effectiveness for two reasons: First, if a free page is transmitted (which is highly probable), it consumes bandwidth that might otherwise have been used by both pre-paging as well as for the iterative rounds used in pre-copy. Second, during pre-paging, if a free page is allocated by the guest VM and subsequently causes a page-fault (as the result of a copy-on-write by the virtual memory system), this will cause additional delay on the
VM at the target when there need not have been. Therefore, we cannot do a performance analysis of post-copy without eliminating the transmission of those empty page frames.
Ballooning is the act of changing the view of physical memory (and pseudo-physical memory) such that the guest VM has a larger or smaller amount of allocatable memory than it had before. In current virtualization systems, this is only used to during guest VM boot time when it is first created and initialized. If the maintenance VM cannot “reserve” enough memory for the new guest - henceforth referred to as a reservation - it steals CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 80 some from the other VMs on the host by enlarging a kind of balloon in other VMs and giving it to the new one. This is done by giving the existing VMs a “target” reservation and waiting for them to release enough pages from their own reservations to satisfy that smaller target. The system administrator can re-enlarge those diminished reservations at a later time should more memory become available. This might happen as the result of either shutting down or even migration itself. What we have implemented is a way for the migrating guest VM to perform this ballooning continuously by itself, called Dynamic
Self-Ballooning (DSB). The way to make this effective for migration is two-fold: First, we
must choose an appropriate interval between consecutive DSB attempts such that the
CPU time consumed by the DSB process does not interfere with the applications running
within the VM. Second, the DSB process must ensure that it can allow the balloon to
shrink. When one or more memory-intensive applications begins to run and perform copy-
on-writes within the guest VM, there must be a way for the DSB process to detect this and
respond to it by releasing free pages from the balloon so that the applications can use
them. We’ve devised a way to do this in the next couple of sections and have chosen an
interval of about 5 seconds through some performance experiments and determined that
application performance is not adversely affected. During pre-copy migration only, DSB is
used continuously. On the other hand, post-copy only performs DSB once right before the
beginning of the downtime phase. After resume, it is disabled and the rest of post-copy
proceeds as described.
5.2.5 Reliability
As we touched on in the introduction, post-copy has a drawback with respect to the reli-
ability of the target node. Either the source or destination node can fail in the middle of
VM migration. In both pre-copy and post-copy migration, failure of source node implies
permanent loss of the VM itself. Failure of destination node has different implications in
the two cases. For pre-copy, failure of the destination node does not matter because the
source node still holds an entire up-to-date copy of the VM’s memory and CPU execution
state and the VM can be revived if necessary from this copy. However, with post-copy, the CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 81
VM begins execution at the target node as soon as minimal CPU execution state is trans- ferred first, which implies that the destination node has the more up-to-date version of the
VM state and copy at the source happens to be stale, except for pages not yet modified at the destination. Thus failure of the destination node constitutes a critical failure of the VM during post-copy migration.
We plan to address this problem by developing mechanisms to incrementally check- point the VM state from the destination node back at the source node, an approach taken by the Xen-based system Remus [18]. According to them, we believe that the increased network overhead of doing this is negligible, but a through evaluation of that would first be required. One approach is as follows: while the active push of pages is in progress from the source node to the destination, we also propagate incremental changes to memory pages and execution state in the VM at the destination back to the source node. We do not need to propagate the changes from the destination on a continuous basis, but only at dis- crete points such as when interacting with a remote client over the network, or committing an I/O operation to the storage. This mechanism can provide a consistent backup image at the source node that we can fall back on in case the destination node fails in the middle of post-copy migration, although at the expense of some increase in reverse network traffic.
Further, once the migration is over, the backup state at the source node can be discarded safely. The performance of this mechanism would depend upon the additional overhead imposed by reverse network traffic from the destination to the source. In a different context, similar incremental checkpointing mechanisms have been used to provide high availability in the Remus project [18].
5.2.6 Summary
We have described Post-Copy and solved four problems that are important for the improved migration of system virtual machines. By focusing on the total number of pages transferred, we use the following approaches: demand-paging, flushing, pre-paging through what we call “bubbling”, and dynamic self-ballooning (DSB), all working together at the same time.
Demand paging ensures that we eliminate the non-deterministic copying iterations involved CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 82
SOURCE VM TARGET VM Non−Pageable Restore Memory Memory Domain 0 Reservation MFN Double Exchange (at source) Memory Pageable Memory Migration Daemon Reservation (Xend)
Pseudo−Paged Memory−Mapped New Page−Frames Memory Page−Frames
Page−Fault Traffic mmu_update() Pre−Paging Traffic Source Hypervisor Target Hypervisor
Figure 5.3: Pseudo-Swapping (item 3): As pages are swapped out within the source guest itself, their MFN identifiers are exchanged and Domain 0 memory maps those frames with the help of the hypervisor. The rest of post-copy then takes over after downtime. in pre-copy. Flushing ensures that no residual dependencies are left on the source host.
Bubbling helps minimize the number of page faults as well as the length of time spent in the resume phase. Self-ballooning allows us to normalize the two migration schemes for comparison by eliminating the transmission of free pages. Note that we do not implement the Hybrid scheme that we mentioned earlier as it does not directly contribute to the com- parison of the two schemes, but would nonetheless still significantly improve the treatment of clean pages during post-copy migration. We leave that to future work.
5.3 Post-Copy Implementation
We’ve implemented post-copy on top of the Xen 3.2.1 along with all of the optimizations in- troduced in Section 5.2. We use the para-virtualized version of Linux 2.6.18.8 as our base.
We begin by first discussing how there are different ways of trapping page-faults within the
Xen / Linux architecture and their trade-offs. Then we will discuss our implementation of dynamic self-ballooning. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 83
5.3.1 Page-Fault Detection
The working set of the Virtual Machine can (and will) span multiple user applications and in-kernel data structures. We propose three different ways by which the demand-paging component of post-copy at the system virtual machine level can trap accesses to the WWS.
These include:
1. Shadow Paging: Through the pre-existing, well designed use of an extra, read-only
set of page tables underneath the VM, shadow paging provides multiple benefits to
virtual machines in modern hypervisors. Support for shadow paging contributes to
the use of both fully-virtualized VMs and para-virtualized VMs as well as the facili-
tation of pre-copy migration by detecting page dirtying. In the post-copy case, each
attempt to write to a page at the target would be trapped by shadow-paging. The
migration daemon would then use this information to retrieve that page before the
read or write can proceed.
2. Page Tracking: The idea here is to use the downtime phase to mark all of the res-
ident pages in the VM as not present within the corresponding page-table-entries
(PTEs) for each page. This has the effect of forcing a real page-fault exception on
the CPU. The hypervisor would then be responsible for propagating that fault to Do-
main 0 to be fixed up. The migration process would then bring in the page and fixup
the page-table entry back to normal. x86 PTEs currently have 2 or 3 unused bits in
their lower order bits that can be used to track this information for fixup.
3. Pseudo Swapping: This solution preserves the spirit of para-virtualization, but re-
mains transparent to applications. The idea is to take the set of all pageable appli-
cation and page cache memory within the guest VM and make it “suddenly appear”
that it has been swapped out but without the actual cost of doing so - and without
the use of any disks whatsoever. Although this sounds strange, recall that the source
VM is not running during post-copy. Only the target VM is running. So the memory
reservation that the source VM is occupying is essentially acting like a limited swap
device. During resume time, the guest VM itself can be paravirtualized to request CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 84
those pages from a sort of pseudo swap device.
In the end, we chose to use Pseudo Swapping because it was the the quickest to im-
plement, which is illustrated in Figure 5.3. Initially, we actually started with Page Tracking,
but stopped working on it. We believe that Page Tracking is the fastest, most efficient form
of demand paging at the system VM level. This is because faults are true CPU exceptions.
We started writing this by trapping those exceptions directly within the Hypervisor and then
propagating a new Virtual Interrupt to Domain 0. The major problem with this scheme is
that there exists no way in modern operating systems to detect when a physical page frame
is no longer in use by the operating system. Ideally one could imagine an architecturally
defined bitmap structure that is managed by the OS, not unlike the way a page table is
architecturally defined. This bitmap would allow the hardware to know which page frames
actually contain real bytes and which were free. Once page tracking was initiated, Domain
0 could use this bitmap in combination with the aforementioned page table modifications
to determine whether or not it was still necessary to fixup the PTE at the given time. Page
Tracking is not feasible without this feature. On the other hand, Shadow Paging provides a
clear middle ground. Although it would be slower than Page Tracking (due to the extra level
of PTE propagation) it is more transparent than Pseudo Swapping. For the most part, such
an implementation would remain relatively unchanged except for making a hook available
for trapping into Domain 0. Recently, a version of this type of demand-paging for use in
parallel cloud computing was demonstrated in a tech report [44] based on top of the Xen
hypervisor.
Our page-fault detection is implemented through the use of two loadable kernel mod-
ules. One sits inside the migrating VM and one sits inside Domain 0. These modules
leverage our prior work called MemX [49], which provides distributed paging support for
both Xen VMs and Linux machines at the kernel level. Once the target is ready to begin
pre-paging in the post-copy algorithm, MemX is invoked to service page faults through
the use of pseudo swapping as described. Figure 5.4 illustrates a high-level overlay of
how both pre-copy and post-copy relate to each other. Recall that in order to use Pseudo
Swapping to implement demand paging, one can only apply this to the set of all pageable CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 85
Non−pageable Memory Post−Copy Downtime
Preparation (live) Post−Copy Prepaging (live)
Pre−Copy Rounds (live) Complete Round: 1 2 3 ... N Pre−Copy Downtime Dirty Memory Time
Figure 5.4: The intersection of downtime within the two migration schemes. Currently, our downtime consists of sending non-pageable memory (which can be eliminated by employ- ing the use of shadow-paging). Pre-copy downtime consists of sending the last round of pages. memory in the system. Thus, the remaining memory (which is typically made up of small in-kernel caches or pinned pages) must be sent over to the target host during downtime.
This drawback to Pseudo Swapping is that it puts a small lower-bound on the achievable downtime experienced by our implementation of Post-Copy, but is not a fundamental limi- tation of the post-copy method of migration by any means. In future work, we plan to switch to Shadow Paging as a means to implement the demand paging component of post-copy.
This will eliminate that drawback. Nonetheless, we preserve the worsened downtime val- ues later in our performance experiments. These downtimes typically range from 600 ms to a little over one second.
5.3.2 MFN exchanging
Because of our speedy implementation, it was necessary to devise a way of making it appear that the set of all pageable memory in the guest VM had been swapped out without actually moving those pages anywhere. This can be accomplished in two ways: we can either transfer the pages out of the guest VM (and into the maintenance VM) or we can alter the location of the physical frame within the VM itself to a new location (with zero copying). We chose the latter because it does not place any extra dependencies on the maintenance VM. We accomplish this by what we call performing an “mfn exchange”. This works by first doubling the memory reservation of the VM and allocating free pages from CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 86 the new memory and briefly suspending all of the running processes in the system. We then instruct the kernel to swapout each pageable frame. Each time a used frame is paged, we re-write the hypervisor’s PFN to MFN mapping table (called a “physmap”) and exchange the two physical frames without actually copying them. We also do the same thing for the kernel-level page tables entries of both physical frames. This is efficient because we batch the hypercalls necessary to perform these operations within the hypervisor. Once downtime has completed, we restart the applications and wait for page-faults to the pseudo swap device to arrive.
5.3.3 Xen Daemon Modifications
A handful of modifications to the Xen Daemon where made to support page-fault detec- tion. The Xen daemon has the responsibility of initializing the migration and initial memory transfer process, including page tables and CPU state. For our system, the only memory transfer that the daemon is responsible for is the transfer of non-pageable memory. All other pages are ignored until later as usual. Additionally, the set of pages that are elimi- nated through self-ballooning must also be ignored. By default, however, the Xen Daemon has no way of knowing whether a particular memory page actually belongs to any of those
3 categories (pageable, non-pageable, or ballooned) because of the strict memory reser- vation policy employed by Xen (as it should be). This presents a problem for Post-Copy: the way non-pageable memory is transferred in our system is implemented by using the same code that runs when the daemon is ready to execute a Pre-Copy iteration in the original system. Thus, to support our system, we patch this code to check a new bitmap data structure that indicates whether or not a particular frame should actually be sent or not (rather than just treating all pages as dirty or not dirty as before in the original system).
This bitmap is populated by the kernel module running inside the actual guest VM itself running at the source (before downtime begins).
The next part is not so obvious upon first examination: The Xen Daemon (the manage- ment process running inside the co-located Domain 0 on the same host) needs to be able to read this bitmap from user-space. Thus, we perform a memory-mapping from the kernel CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 87 space of the guest VM to the user space of the Xen Daemon in the other virtual machine.
Furthermore, in order to perform a successful memory-mapping, the Xen Daemon needs to know in advance what the physical frame numbers (the MFNs) are of each page frame that physically backs that bitmap data structure. This is required because of the nature of performing a memory mapping: the Xen Daemon’s page tables must be populated with physical mfn numbers, not virtual ones. As a result, we use a physically contiguous map- ping to discover these MFNs: the physical-to-machine (p2m) mapping table. This is a table that translates every PFN of a guest (from 0 to max) into a physical frame number (mfn) owned by the guest virtual machine. Finally, to complete the memory mapping this results in the Daemon only needing to know 2 pieces of information: The location of the *first* virtual frame number and the total number of frames. Thus, the guest VM needs only to transmit these two values to the Xen Daemon before downtime begins.
We accomplish this by exporting the address (the PFN, specifically) of the (virtually contiguous) first frame of the bitmap inside of the p2m table into the “Xen Store”. The
Xen Store is a messaging abstraction for Xen virtual machines to be able to communicate small pieces of information to each other and is organized into a directory structure for each co-located virtual machine on the same host. Recall that we also have a kernel module running inside the management VM acting as the retrieval entity for the whole post-copy process and it is responsible for facilitating pseudo-paging. This module reads the first bitmap frame address from the Xen store and then communicates that information upwards to the Xen Daemon running inside the same virtual machine. The daemon then performs a memory mapping of this bitmap by grabbing each mfn out of the p2m table based on this first frame number one-by-one. Finally, once the bitmap is mapped and the physical frames are mapped, the Daemon can then determine which frames should be transmitted to the target host and which ones can be ignored by simply checking the bitmap. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 88
5.3.4 VM-to-VM kernel-to-kernel Memory-Mapping
During downtime, in addition to the Xen Daemon modifications in the previous section, the third-party module running within Domain 0 itself that is responsible for transmitting faulted pages to the target host also has the responsibility of memory-mapping the entirety of the guest’s VM memory footprint. This is to avoid copying memory. A problem arises, however, which is similar to the one presented in Section 5.3.3: in order to complete this memory mapping, we must again know what the addresses are of each page frame owned by the migrating guest VM. This is a much larger task however, because we’re not just exporting a bitmap to another virtual machine (where the total mapped data is one bit per page). Instead, we are memory-mapping 8 bytes per page owned by the guest. Thus, for a common 512 MB guest virtual machine, this means we have a megabyte of data to transmit to the other virtual machine. (512 MB constitutes 128K pages, so a 64-bit page frame identifier would require a megabyte of memory to store all of the physical frame numbers).
The problem with this megabyte is that you cannot simply allocate a contiguous megabyte of memory in kernel space with any guaranteed certainty. Slab caches and kmalloc’s are not meant for that. So, this leaves you with the alloc pages() family of routines in Linux.
This routine allocates memory in power-of-two orders of 4 KB pages. The largest contigu- ous order allowed by linux is 12 (and that is under ideal circumstances). Even a simple
1-MB allocation requires an “order-8” memory allocation. Larger VM memory sizes would thus approach 9 and 10. Under a heavily utilized system it is highly unlikely the Linux buddy system would return success on such requests. This requires us to find another way to send this 1-MB of data to the third party module inside Domain 0: through a second-level memory mapping.
This solution involves constructing a kind of “impromptu” page-table structure. This structure has the exact same 3-level hierarchy of a regular page table except that is not architecturally defined; but it still places the required mfn data at the leaves of the tree. We create this structure very quickly and pass the root of the table to the third party module through the use of the Xen store as was done in the previous section. During downtime, CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 89 the receiving module maps each frame of the page table structure itself in a recursive fashion beginning at the root. Once that is complete, it maps all of the page frame mfn numbers stored at the leaves of the table. These leaves collectively store the addresses of only those page frames that can potentially incur page faults. Thus, when a page fault actually occurs at the target host, the module needs to only consult this table and snap up that page to be ready for transmission without any copying whatsoever.
5.3.5 Dynamic Self Ballooning Implementation
The Xen Hypervisor has a set of hypercalls responsible for allowing a guest to change its memory reservation on-demand. The general idea to implementing DSB under Xen is three-fold. We will discuss how each of these steps is implemented within our version of the post-copy system and also how we modified it to be used in the original Xen implementation of pre-copy:
1. Inflate the balloon: For migration, this is accomplished by allocating as much free
memory as possible and handing those pages over to the “decrease reservation”
hypercall. This results in those pages being placed under the ownership of the hy-
pervisor.
2. Detect memory pressure: There are a few ways of doing this within Linux, which
we will describe shortly. Memory pressure indicates that either an application or the
kernel needs a page frame right now. In response, the DSB process must deflate
the balloon by the corresponding amount of memory pressure (but it need not destroy
the balloon completely).
3. Deflate the balloon: Again, this is accomplished by performing the reverse of step 1:
first the DSB process invokes the “increase reservation” hypercall. Then it proceeds
to release the list of free pages that were previously allocated (and handed to the
hypervisor for re-use) and actually give them back to the kernel’s free pool.
In order to rapidly inflate and deflate the balloon, we first had to determine where to initiate these operations. One can either place the DSB process within Domain 0 and CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 90 communicate the intent to modify the balloon to the migrating VM through Xen’s internal communication mechanisms (called the XenStore) - or one could place the DSB process within the VM itself. Because the nature of performing ballooning requires internal knowl- edge of the kernel anyway, we decided to go with the latter. But the deciding factor on this placement actually dealt with the balloon driver that ships with the Xen source code instead. We found this driver to be a little slow. This driver does not batch the hypercalls required to perform ballooning, but instead executes hypercalls one-by-one. In our labora- tory, we observed that a single hypercall can take as long as 2-3 microseconds. If we are to expect to perform DSB rapidly, these hypercalls must be batched together into a single hypercall - a feature which Xen already provides. It just simply needed a little kick forward.
Thus we placed the DSB process within the guest VM itself and updated the existing driver to perform this batching.
Memory Overcommitment. Memory over-commitment within an individual operating system is a method by which the virtual memory subsystem can provide an illusion to an application that the physical memory in the machine is larger than what is true. However, there are multiple operating modes of over-commitment within the Linux kernel - and these modes can be enabled or disabled at runtime. By default, Linux disables this feature.
This has the effect of causing application level memory allocations to be precluded in advance by returning a failure. So, by default (within Linux) if an application submits a memory allocation request without sufficient physical memory available, Linux will return an
error. However if you enable over-commitment, the kernel will truly view the set of physical
memory as infinite. One could spend an entire paper arguing that the over-commitment
feature should be enabled by default, but the Linux community has instead chosen to “err
on the side of caution” and defer such a decision to experienced system administrators.
Over-commitment is required for the transparent detection of memory pressure that we
have developed for our version of the DSB process, which we describe next.
Detecting Memory Pressure. Surprisingly enough, the Linux kernel already provides
a transparent way of doing this: through the filesystem interface. When a new filesystem
is registered with the kernel, one of the function pointers provided includes a callback to
request that the filesystem free any in-kernel data caches that the filesystem may have CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 91 pinned in memory. Such caches typically include things like inode and directory entry caches. These callbacks are driven by the virtual memory system and are invoked when applications ask for more memory. Indirectly, the virtual memory system makes this de- termination when it is time to perform a copy-on-write on behalf of an application that has allocated a large amount of memory but has only recently decided to write to it for the
first time. Consequently, the DSB process does not register a new filesystem, but we are still allowed to register a callback function by which the virtual memory system can use.
This worked remarkably well and does indeed provide very precise feedback to the DSB process on exactly when a memory-intensive application has become active. The name of this Linux function is called set shrinker(). Alternatively, one could periodically wakeup the
DSB process at an interval and scan the /proc/meminfo and /proc/vmstat files to determine this information by hand. We found the filesystem interface to be more direct as well as accurate. Whenever we get a callback, the callback already contains a numeric value of ex- actly how many pages it wants the DSB process to release at once. The size of this batch is typically 128 pages. The callback can happen very frequently in a back-to-back manner on behalf of active user applications. Each time the callback occurs the DSB process will deflate the balloon as described by the requested amount and go back to sleep.
Completing the DSB process. Finally, the DSB process, with the ability to detect memory pressure, must periodically reclaim free pages that may or may not have been released by running applications or the kernel itself. We perform this sort of “garbage collection” within a kernel thread. Note: this is not true garbage collection - that is not our intention. The kernel thread will wake up at periodic intervals and attempt to re-inflate the balloon as much as possible and then go back to sleep. If memory pressure is detected during this time, the thread will preempt itself and cease inflation and go back to sleep.
The only thing that was required complete this is a 200-line patch to the Xen migration daemon running within Domain 0. Recall the operation of the DSB process with respect to pre-copy and post-copy. Post-copy uses DSB only once: the kernel thread will balloon a single time before downtime occurs and go back to sleep, whereas DSB runs continuously for pre-copy. The migration daemon has a policy to which it strictly adheres: if a page frame has never been mapped before, it will not be migrated or transmitted. Note: this is not CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 92 the same as detecting whether or not a page frame has been allocated and subsequently freed - only when a page has been allocated for the first time (by assigning a machine frame number to the corresponding pseudo-physical frame number). This information is stored in what Xen calls a “physmap”, which we discussed earlier in the mfn exchanging section. A property of this physmap is that the total number of valid entries in this map is monotonically increasing; it will never decrease on the same host. This means that if the DSB process has inflated the balloon and the balloon contains a page frame that is mapped inside the physmap table, then the migration VM transmit that frame regardless.
That defeats the purpose of the DSB process. So we modify the migration daemon by
exposing to it the list of ballooned pages. As a result, whenever the migration daemon is
ready to transmit a particular page, it first consults that list and skips transmission if it is in
the list. (This list is actually a bitmap). Our suggestion to the Xen community is to develop
a sort of watermarked “dynamic physmap garbage collection” such that the kernel would
be responsible for clearing the physmap when it is no longer using a page. This is almost
identical to the earlier suggestion in the Page Tracking scheme we devised, except that
such use of the physmap would not be architecturally defined - nor would it necessarily
be visible to the hardware. We believe that a garbage-collected physmap would allow
for both the seamless implementation of Dynamic Self-Ballooning as well as the ability to
implement Page Tracking without hardware support. But for now, we are using the cards
we have been dealt.
5.3.6 Proactive LRU Ordering to Improve Reference Locality
During normal operation, the guest kernel maintains the age of each allocated page frame
in its page cache. Linux, for example, maintains two linked lists in which pages are main-
tained in Least Recently Used (LRU) order: one for active pages and one for inactive
pages. A kernel daemon periodically ages and transfers these pages between the two
lists. The inactive list is subsequently used by the paging system to reclaim pages and
write to the swap device. As a result, the order in which pages are written to the swap
device reflects the historical locality of access by processes in the VM. Ideally, the active CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 93 push component of post-copy could simply use this ordering of pages in its pseudo-paging device to predict the page access pattern in the migrated VM and push pages just in time to avoid network faults. However, Linux does not actively maintain the LRU ordering in these lists until a swap device is enabled. Since a pseudo-paging device is enabled just before migration, post-copy would not automatically see pages in the swap device ordered in the
LRU order. To address this problem, we implemented a kernel thread which periodically scans and reorders the active and inactive lists in LRU order, without modifying the core kernel itself. In each scan, the thread examines the referenced bit of each page. Pages with their referenced bit set are moved to the most recently used end of the list and their referenced bit is reset. This mechanism supplements the kernel’s existing aging support without the requirement that a real paging device be turned on. Section 5.4.4 shows that such a proactive LRU ordering plays a positive role in reducing network faults.
Lines of Code. The kernel-level implementation of Post-Copy, which leverages the
MemX system, is about 7000 lines of code within pluggable kernel modules. 4000 lines of that is part of the MemX system that is invoked during demand-paging. 3000 lines con- tribute to the pre-paging component, the flushing component, and the ballooning compo- nent combined. (The DSB implementation also operates within the aforementioned kernel modules and runs inside the guest OS itself as a kernel thread. There is no dom0 in- teraction with the DSB process). A 200 line patch is applied to the migration daemon to support ballooning and a 300-line patch is applied to the guest kernel so that the initiation of pseudo swapping can begin. When all is said and done, the system remains completely transparent to applications and approaches about 8000 lines. Neither the original pre-copy algorithm code, nor the hypervisor itself is changed at all. (As discussed before, alternative page-fault detection methods will require additional hypervisor support).
5.4 Evaluation
In this section, we present the detailed evaluation of our post-copy implementation and compare it against Xen’s original pre-copy migration. Our test environment consists of two 2.8 GHz dual core Intel machines connected via a Gigabit Ethernet switch. Each CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 94 machine has 4 GB of memory. Both the guest VM in each experiment and the Domain
0 are configured to use two virtual CPUs. Guest VM sizes range from 128 MB to 1024
MB. Unless otherwise specified, the default guest VM size is 512 MB. In addition to the performance metrics mentioned in Section 5.2, we evaluate post-copy against an additional metric. Recall that post-copy is effective only when a large majority of the pages reach the target node before they are faulted upon by the VM at the target, in which case they become minor page faults rather than network-bound page faults. Thus the fraction of network page faults compared to minor page faults is another indication of the effectiveness of our post- copy approach. Secondly, we quantify the pages transferred of pre-copy by scripting them those numbers from the Xen logs. For post-copy we output this information to procfiles.
That value is then added to the number of pages that contribute to “non-pageable memory” for a grand total.
5.4.1 Stress Testing
We start by first doing a stress test for both migration schemes with the use of a sim- ple, highly sequential memory-intensive C program. This program accepts a parameter to change the working set of memory accesses and a second parameter to control whether it performs memory reads or writes during the test. The experiment is performed in a 1024
MB VM with its working set ranging from 8 MB to 512 MB. The rest is simply free memory.
We perform the experiment with seven different test configurations:
1. Stop-and-copy Migration: This is a non-live VM migration scenario which provides
a baseline to compare the total migration time and number of pages transferred by
post-copy.
2. Read-intensive Pre-Copy: This configuration provides the best-case workload for
pre-copy. The performance is of the total migration time metric expected to be roughly
similar to pure stop-and-copy migration.
3. Write-intensive Pre-Copy: This configuration provides the worst-case workload for
pre-copy and causes worsening of all performance metrics.
4. Read-intensive Post-Copy: CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 95
Total Migration Time 80 Write Pre-Copy w/o DSB 70 Read Pre-Copy w/o DSB Write Pre-Copy DSB 60 Write Post-Copy Read Post-Copy 50 Read Pre-Copy DSB 40 Stop-and-Copy DSB 30 20 10 Total Migration Time (Secs) 80 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)
Figure 5.5: Comparison of total migration times between post-copy and pre-copy.
5. Write-intensive Post-Copy: These two configurations will stress our pre-paging al-
gorithm and flushing implementations and are expected to perform almost identically.
6. Read-intensive Pre-Copy without DSB:
7. Write-intensive Pre-Copy without DSB: These two configurations test the default
implementation of pre-copy in Xen that does not use DSB. Unless we specify other-
wise, the reader should assume that DSB is turned on for pre-copy. Post-copy always
uses DSB.
For each figure, the plots in the legend are in the same order as you see them from top to bottom in the figure.
Total Migration Time: Figure 5.5 shows the variation of total migration time with in- creasing working set size. Notice that both post-copy plots for total time are at the bottom, surpassed only by read-intensive pre-copy. Our first observation is that both the read and write intensive tests of post-copy perform very similarly. Thus our post-copy algorithm’s performance is agnostic to the read or write-intensive nature of the application workload.
Future work might involve giving more priority to page fault writes over reads. Further- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 96
Downtime 5000 Write Pre-Copy w/o DSB 4500 Write Pre-Copy DSB 4000 Write Post-Copy 3500 Read Post-Copy Read Pre-Copy DSB 3000 Read Pre-Copy w/o DSB 2500 2000 1500
Downtime (millisec) 1000 500 0 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)
Figure 5.6: Comparison of downtimes between pre-copy and post-copy. more, we observe that without DSB activated, as in the default Xen implementation, the total migration time for read-intensive pre-copy is very high due to unnecessary transmis- sion of free guest pages over the network. This conclusion demonstrates itself in the three remaining plots as well.
Downtime: Figure 5.6 exhibits similar behavior for the metric of downtime as the working set size increases. Recall that our choice of page fault detection in Section 5.3 increases the base downtime in post-copy. Thus, the figure shows a roughly constant downtime that ranges between 600 milliseconds to over one second. As is expected, the downtime for write-intensive pre-copy test increases significantly as the size of the writable working set increases.
Pages Transferred and Page Faults: Figure 5.7 and Table 5.2 illustrate the utility of our pre-paging algorithm in post-copy across increasingly large working set sizes. Fig- ure 5.7 plots the total number of pages transferred. As expected, post-copy transfers fewer pages than write-intensive pre-copy as well as pre-copy without DSB, the reduction being as much as 85%. It performs on par with read-intensive post-copy with DSB and stop- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 97
Pages Transferred
1 M Write Pre-Copy w/o DSB 900 k Read Pre-Copy w/o DSB 800 k Write Pre-Copy DSB 700 k Write Post-Copy Read Post-Copy 600 k Read Pre-Copy DSB 500 k Stop-and-Copy DSB 400 k 300 k 200 k 4 KB pages (in thousands) 100 k 0 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)
Figure 5.7: Comparison of the number of pages transferred during a single migration.
Working Pre-Paging Flushing Set Size Net Minor Net Minor 8 MB 2% 98% 15% 85% 16 MB 4% 96% 13% 87% 32 MB 4% 96% 13% 87% 64 MB 3% 97% 10% 90% 128 MB 3% 97% 9% 91% 256 MB 3% 98% 10% 90%
Table 5.2: Percent of minor and network faults for flushing vs. pre-paging. Pre-paging greatly reduces the fraction of network faults. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 98
Degradation Time: Kernel Compile 300 No Migration Pre-Copy w/o DSB 250 Post-Copy Pre-Copy DSB
200
150
100
50
Completion Time (secs) 0 128 MB 256 MB 512 MB 1024 MB Guest Memory (MB)
Figure 5.8: Kernel compile with back-to-back migrations using 5 seconds pauses. and-copy, all of which transfer each page only once over the network. Table 5.2 compares the fraction of network and minor faults in post-copy. We see that pre-paging reduces the fraction of network faults from 7% to 13%. To be fair, the stress-test is highly sequential in nature and consequently, pre-paging predicts this behavior almost perfectly. We expect the real applications in the next section to do worse than this optimal case.
5.4.2 Degradation, Bandwidth, and Ballooning
Next, we quantify the side effects of migration on a couple of sample applications. We want to answer the following questions: What kind of slow-down do VM workloads experience during pre-copy versus post-copy migration? What is their impact on network bandwidth re- ceived by applications? And finally, what kind of balloon inflation interval should we choose to minimize the impact of DSB on running applications? For application degradation and
DSB interval, we use Linux kernel compilation. For bandwidth testing we use the NetPerf
TCP benchmark. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 99
Degradation Time: Figure 5.8 depicts a repeat of an interesting experiment from [73].
We initiate a kernel compile inside the VM and then migrate the VM repeatedly between
two hosts. We script the migrations to pause for 5 seconds each time. Although there
is no exact way to quantify degradation time (due to scheduling and context switching),
this experiment provides an approximate measure. As far as memory is concerned, we
observe that kernel compilation tends not to exhibit too many memory writes. (Once gcc
forks and compiles, the OS page cache will only be used once more at the end to link
the kernel object files together). As a result, experiment is good for post-copy comparison
because it represents the best case for the original pre-copy approach when there is not much repeated dirtying of pages. This experiment is also a good worst-case tester for our implementation of Dynamic Self Ballooning due to the repeated fork-and-exit behavior of the kernel compile as each object file is created over time. (Interestingly enough, this experiment also gave us a headache, because it exposed the bugs in our code!) We were surprised to see how many additional seconds were added to the kernel compilation in
Figure 5.8 just by executing back to back invocations of pre-copy migration. Nevertheless, we observe that post-copy tends to match pre-copy by the same amount of degradation.
Although we would have preferred to see less degradation than pre-copy, we can at least rest assured that we’re not doing worse. This is in line with the competitive performance of post-copy with read-intensive pre-copy tests in Figures 5.5 and 5.7. We suspect that a shadow-paging based implementation of post-copy would perform much better due to the significantly reduced downtime it would provide.
Additionally, Figure 5.9 shows the same experiment using NetPerf. A sustained, high- bandwidth stream of network traffic causes slightly more page-dirtying than the compilation does. The setup involves placing the NetPerf sender inside the guest VM and the receiver on an external node on the same switch. Consequently, regardless of VM size, post- copy actually does perform slightly better and reduce the degradation time experienced by
NetPerf. The figure also indicates an example of severe degradation without DSB due to transmission of free pages.
Effect on Bandwidth: In their paper [27], the Xen project proposed a solution called
“adaptive rate limiting” to control the bandwidth overhead due to migration. However, this CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 100
Degradation Time: NetPerf 250 No Migration Pre-Copy w/o DSB Post-Copy Pre-Copy DSB 200
150
100
50
CompletionTime (sec) 0 128 MB 256 MB 512 MB 1024 MB Guest Memory (MB)
Figure 5.9: NetPerf run with back-to-back migrations using 5 seconds pauses. feature is not enabled in the currently released version of Xen. In fact it is compiled out without any runtime options or any pre-processor directives. This could likely be because it is difficult, if not impossible to predict beforehand the bandwidth requirement of any single guest in order to guide the behavior of adaptive rate limiting. Hence, there is no explicit arbitration of network bandwidth contention between simultaneous operation of the migration daemon and a network-heavy application. With that in mind, Figures 5.10 and 5.11 show a visual representation of the reduction in bandwidth experienced by a high- throughput NetPerf session. We conduct this experiment by measuring bandwidth values rapidly and invoke VM migration in between. The impact of migration can be seen in both
figures by a sudden reduction in the observed bandwidth during migration. This reduction is more sustained, and greater, for the pre-copy approach than for post-copy due to the fact that the total number of pages transferred in pre-copy is much higher. This is exactly the bottom line that we were targeting for improvement. Each experiment henceforth operates under that mode of operation. We believe their choice does make sense, however: the migration daemon really cannot guess if the guest is hosting, say, a webserver, it’s likely CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 101
1. Normal Operation 5. Migration 3. CPU + Complete non-paged memory
4. Resume 2. DSB + Pre-paging Invocation
Figure 5.10: Impact of post-copy NetPerf bandwidth. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 102
5. Migration 1. Normal Complete Operation
2. DSB Invocation 4. CPU-state Transfer
3. Iterative Memory Copies
Figure 5.11: Impact of pre-copy NetPerf bandwidth. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 103
Dynamic Ballooning Effects on Completion Time Kernel Compile, 439 secs 50 128 MB Guest 40 512 MB Guest
30
20
10 Slowdown Time (secs)
0 0 200 400 600 800 Balloon Interval (jiffies)
Figure 5.12: The application degradation is inversely proportional to the ballooning inter- val. the webserver will take whatever size pipe in can get its hands on, which would suggest that the migration daemon should just let TCP do what it normally does. On the other hand, the daemon might use up CPU cycles that might otherwise be granted to the guest itself. The point is that it’s all guesswork without some kind of signal from the guest. In fact, while strolling through the Xen daemon’s code, the end of the pre-copy iteration process is guided only by two factors: a 30-iteration maximum constant combined with a minimum page dirtying rate of 50 pages per pre-copy round. This will allow the daemon to iterate forever until either one of those conditions is met. This is why even mildly write intensive applications never converge.
Dynamic Ballooning Interval: Figure 5.12 shows how we chose the DSB interval, by which the DSB process wakes up to reclaim available free memory. With the kernel compile as the test application, we execute DSB process at intervals from 10ms to 10s. At every interval, we script the kernel compile to run multiple times and output the average completion time. The difference in that number from the base case is the degradation time CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 104 added to the application by the DSB process due to its CPU usage. As expected, the choice of ballooning interval is inversely proportional to the application degradation. The more often you balloon, the more it affects the VM workload. The graph indicates that we should choose an interval between 4 and 10 seconds to balance between frequently reclaiming free pages and avoid impacting applications significantly. Note that this graph only represents on type of mixed application. For more CPU-intensive workloads, it will be necessary to make the ballooning interval dynamic enough that it could increase for
CPU-intensive applications or applications that performed rapid memory allocation.
5.4.3 Application Scenarios
The last part of our evaluation is to re-visit the aforementioned performance metrics across four real applications:
1. SPECWeb 2005: This is our largest application. It is a well-known webserver bench-
mark involving at least 2 or more physical hosts. We place the system under test
within the guest VM, while six separate client nodes bombard the VM with connec-
tions.
2. Bit Torrent Client: Although this is not a typical server application, we chose it
because it is a simple representative of a multi-peer distributed application. It is easy
to initiate and does not immediately saturate a Gigabit Ethernet pipe. Instead it fills
up the network pipe gradually, is slightly CPU intensive, and involves a somewhat
more complex mix of page-dirtying and disk I/O than just a kernel compile.
3. Linux Kernel Compile: We consider this again for consistency.
4. NetPerf: Once more, as in the previous experiments, the NetPerf sender is placed
inside the guest VM.
Using these applications, we evaluate the same four primary metrics that we covered in
Section 5.4.1: Downtime, Total Migration Time, Pages Transferred, and Page Faults. Each
figure for these applications represents one of the four metrics and contains results for a constant, 512 MB virtual machine in the form of a bar graph for both migration schemes across each application. Each data point is the average of 20 samples. And just as before, CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 105
Pages Transferred 200000 180000 Post-Copy 160000 Pre-Copy 140000 120000 100000 80000 60000 40000 20000 0 #4KTransferred Pages Kernel Compile NetPerf SpecWeb2005 BitTorrent Application
Figure 5.13: Total pages transferred for both migration schemes. the guest VM is configured to have two virtual CPUs. All of these experiments have the
DSB activated.
Pages Transferred and Page Faults.. The experiments in Figures 5.13 and 5.14 illus- trate these results. For all of the applications except the SPECWeb, post-copy reduces the total pages transferred by more than half. The most significant result we’ve seen so far is in Figure 5.14 where post-copy’s pre-paging algorithm is able to avoid 79% and 83% of the network page faults (which become minor faults) for the largest applications (SPECWeb,
Bittorrent). For the smaller applications (Kernel, NetPerf), we still manage to save 41% and 43% of network page faults. There is a significant amount of additional prior work in the literature aimed at working-set identification, and we believe that these improvements can be even better if we employ both knowledge-based and history-based predictors in our pre-paging algorithm. But even with a reactive approach, post-copy appears to be a strong competitor.
Total Time and Downtime. Figure 5.15 shows that post-copy reduces the total mi- gration time for all applications, when compared to pre-copy, in some cases by more than CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 106
Page-Faults 100000 79% Minor 21% 10000 Faults Network 1000 41%59% 83% Faults 17% 100 43%57% 10
1 Kernel Compile NetPerf
#4K(logscale) Page-Faults SpecWeb2005 BitTorrent Application
Figure 5.14: Page-fault comparisons: Pre-paging lowers the network page faults to 17% and 21%, even for the heaviest applications.
Total Migration Time 12 Post-Copy 10 Pre-Copy 8 6 4 Time (secs) 2 0 Kernel Compile NetPerf SpecWeb2005 BitTorrent Application
Figure 5.15: Total migration time for both migration schemes. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 107
Downtime 2000 1800 Post-Copy 1600 Pre-Copy 1400 1200 1000 800 600 400
Time(millisec) 200 0 Kernel Compile NetPerf SpecWeb2005 BitTorrent Application
Figure 5.16: Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with better page-fault detection.
50%. However, the downtime in Figure 5.16 is currently much higher for post-copy than for pre-copy. As we explained earlier, the relatively high downtime is due to our speedy choice of pseudo-paging for page fault detection, which we plan to reduce through the use of shadow paging. Nevertheless, this tradeoff between total migration time and downtime may be acceptable in situations where network overhead needs to be kept low and the entire migration needs to be completed quickly.
5.4.4 Comparison of Prepaging Strategies
This section compares the effectiveness of different prepaging strategies. The VM work- load is a Quicksort application that sorts a randomly populated array of user-defined size.
We vary the number of processes running Quicksort from 1 to 128, such that 512MB of memory is collectively used among all processes. We migrate the VM in the middle of its workload execution and measure the number of network faults during migration. A smaller network fault count indicates better prepaging performance. We compare a number of CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 108
6000 Push-Only 5000 Push-Only-LRU Forward 4000 Dual 3000 Forward-MultiPivot Forward-LRU 2000 Dual-LRU 1000 Forward-MultiPivot-LRU # 4K Page-Faults 4K # Dual-MultiPivot-LRU 0 (1, 512) (2, 256) (4, 128) (8, 64) (16, 32) (32, 16) (64, 8) (128, 4) (# Processes, MB Each Process)
Figure 5.17: Comparison of prepaging strategies using multi-process Quicksort work- loads. prepaging combinations by varying the following factors:
1. whether or not some form of bubbling is used;
2. whether the bubbling occurs in forward-only or dual directions;
3. whether single or multiple pivots are used; and
4. whether the page-cache is maintained in LRU order.
Figure 5.17 shows the results. Each vertical bar represents an average over 20 experi- mental runs. First observation is that bubbling, in any form, performs better than push-only prepaging. Secondly, sorting the page-cache in LRU order performs better than non-LRU cases by improving the locality of reference of neighboring pages in the pseudo-paging de- vice. Thirdly, dual directional bubbling improves performance over forward-only bubbling in most cases, but never performs significantly worse. This indicates that it is always prefer- able to use dual directional bubbling. (The performance of reverse-only bubbling was found to be much worse than even push-only prepaging, hence its results are omitted). Finally, dual multi-pivot bubbling is found to consistently improve the performance over single-pivot bubbling since it exploits locality of reference at multiple locations in the pseudo-paging device. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 109
5.5 Summary
We have presented the post-copy based live virtual machine migration using adaptive pre- paging and dynamic self-ballooning. Post-copy is a combination solution consisting of 4 pieces: demand paging, our pre-paging algorithm called “bubbling”, flushing, and the use of dynamic self-ballooning. We have implemented and evaluated this system and shown that it is able to effect significant performance improvements over the pre-copy based mi- gration of system virtual machines by reducing the number of pages transferred between the source and target hosts. Our future work will explore the use of alternative page-fault detection mechanisms as well as an attempt to explore future use applications of dynamic self-ballooning. There is a great deal of additional work that remains to be done. As we mentioned in Section 5.3, there are three different methods by which one can implement page-fault detection to support demand paging at the Virtual Machine level. We would like to set aside our expedient choice of pseudo-swapping in favor of a shadow paging based method of detection, and if possible investigate extensions to the Xen phsymap (the array of mappings between pseudo-physical and real page frames), with the goal of imple- menting the more efficient use of real CPU exceptions, which we called “page tracking”.
Second, as stated in Section 5.2 we must take care to addresses the reliability issue for post-copy so that we may provide the same level of reliability that the original pre-copy scheme provides. Chapter 6
CIVIC: Transparent Over-subscription of VM Memory
In this chapter, we describe the design, implementation and evaluation of Collective Indirect
Virtual Caching, or CIVIC for short. CIVIC is a significantly lower-level support for access to virtual cluster-wide memory than MemX. CIVIC is a memory oversubscription system for VMs designed to integrate the techniques from the previous three systems described in this dissertation by which the hypervisor can multiplex individual page frames of unmodified
Virtual Machines in a fine-grained manner.
Three primary uses of CIVIC are:
1. Higher Consolidation: to oversubscribe the limited memory of a single physical
host for the purpose of running higher numbers of consolidated Virtual Machines
with greater use of the hardware and without depending on para-virtualization or
ballooning.
2. Large-Memory Pool: to provide large-memory applications transparent access to
a cluster-wide, low-latency memory-pool without any additional binary or operating
system interfaces, and
3. Improved Migration: to reduce the amount of resident main memory when the time
comes to migrate individual Virtual Machines across the network. Due to time, this
feature has not been implemented, but CIVIC is designed for it.
110 CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 111
The motivation for this work derives directly from the last few chapters: Now that we have a number of systems for both distributing and migrating individual page frames, the
final step is to fully utilize the power of virtualization technology to support VMs with more transparency and ubiquity - to use unmodified, commodity operating systems in such a way that they have access to a (potentially) unlimited memory resource with the participa- tion of an entire cluster. The end goal of this chapter is to build a system underneath any commodity OS capable of giving a systems programmer arbitrary access to individual page frames located anywhere in the entire cluster in such a way that new techniques can be de- signed for VM memory management with ease and efficiency. The CIVIC system does just that: it transparently allows a Virtual Machine to oversubscribe (or overcommit) it’s physical memory space without any participation from the VM operating system whatsoever. Any non-committed memory is then paged out. In our case, it’s paged out to MemX.
6.1 Introduction
Although used frequently in many areas of computer science (and it is not always made obvious that one is doing so), one of the great rules in system design is that if a piece of data will likely be used again in the future, you will probably succeed wildly by going out of your way to design your algorithm or data structure such that it caches or preserves that data. It is remarkable how often that rule shows itself. The transparency afforded to VMs by hypervisors provides some good opportunities to exploit caching. Virtual Machines ex- perience almost no consciousness of the fact that their low-level view of physical memory is being “toyed” with in significant ways. So, in order to achieve the kind of memory ubiquity that we described, we’re proposing to combine the ability to do more fine-grained caching underneath VMs with the ability to virtualize cluster-wide memory (which was covered in earlier chapters). With CIVIC, we propose to allow the hosts in the cluster to cooperate with each other in order to transparently support VMs whose physical memory footprints can span across multiple machines in the cluster. To re-iterate: CIVIC is not a Distributed
Shared Memory system (DSM). There are already two existing hypervisor-level DSM at- tempts: one by Virtual Iron in 2005 [12] and one at the Open Kernel Labs in 2009 [69]. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 112
The purpose of these systems is to build a single-system image (SSI). Building an SSI is not the focus of this dissertation. Rather, our goal is to allow unmodified virtual machines to gain access to cluster memory. Our focus is to enable greater VM consolidation and migration performance rather than to spread processing out into the cluster. Thus, VMs in our work use only local processors.
A simple view of CIVIC’s role is that it does for VMs precisely what modern operating systems already do for processes in their virtual memory sub-systems: to give a run- ning process (nearly) unlimited access to virtual memory. The OS has a well-established method of multiplexing virtual to physical memory accesses - the page table. We leverage a similar mechanism to manipulate a VM’s view of physical memory, namely the ”pseudo”- physical address space, hereto referred to as the PPAS. (The “real” address space seen by the processor is thus respectively referred to as the RAS). The hypervisor undertakes the responsibility of mapping pages in the PPAS to pages in the RAS. Technically, one could of course use a disk-based swap device to page in and out the unused portions of the PPAS, but that would lead to significant a slowdown in VM performance as we have explored extensively in this dissertation. Instead, we use MemX to expand a VM’s PPAS to utilize the cluster-wide memory pool and minimize performance impacts on the VM that a disk might otherwise incur without changing the operating system at all. The hypervisor plays the role of an intermediary by (1) providing the VM with the view of an expanded
PPAS, (2) intercepting memory accesses by the VM to non-resident PPAS pages, and (3) efficiently redirecting these memory accesses for servicing by MemX, which executes in a separate virtual machine. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 113
6.2 Design
The design of CIVIC depends heavily on the virtualization platform, which in our case is
Xen. Although we have covered the design of Xen frequently in previous chapters, none of those systems were strictly at the hypervisor level. This requires a brief discussion of the hypervisor’s memory management schemes, including memory allocation and shadow paging. After this discussion, we will present the design choices for CIVIC within the hy- pervisor itself and its interactions with higher-level services followed by implementation specific details.
6.2.1 Hypervisor Memory Management
VM memory management is fairly straightforward, with an extra level of indirection through the PPAS. This address space sits in between the virtual address space and the real phys- ical address space (RAS) seen by the processor. Since the processor is no longer singly owned by one operating system, this extra level allows for multiplexing of multiple PPAS- es on top of a single RAS. Additionally, from here on out, the frame numbers associated with the PPAS (in Xen terminology) are called ’P’ frame numbers: or ”PFN”s. Similarly, real frame numbers are called “machine” frame numbers or MFNs. PFNs are contiguously numbered, whereas MFNs allocated to a VM in the RAS are almost guaranteed to be sparse. In modern VM technology, there are three ways to manage the PPAS:
1. Para-virtualization: A para-virtual VM (or guest) is one that has been modified in
such a way that the VM is aware of the hypervisor. It has been patched directly
to inform the hypervisor explicitly when it intends to update any given page table in
its ownership. In such a guest, the OS will map page frames using machine frame
numbers (MFN)s and has no actual concept of the PPAS (except for memory allo-
cation and VM migration, discussed in the last chapter). Thus, frame identifiers in
a para-virtual guest’s page table entries are the same ones seen by the processor.
This has performance advantages because the guest OS can “batch” a number of
page table updates in one hypercall (but only up to a limit, as we’ll see in option CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 114
#3). Para-virtual support has recently been made upstream and built into both Linux
and Windows, which mitigates some of the problems with this approach that relate
to transparency of maintaining upstream compatibility with newly released operating
systems versions. Thus, para-virtualization is no longer a technological obstacle.
2. Shadow-paging: When modifying the guest is unacceptable (for older OS kernels),
hypervisors no longer place real MFNs into guest OS page tables. Instead, “pseudo”
PFNs are used in such a way that virtual page numbers map to PFNs in their page
tables. Subsequently, the hypervisor traps write accesses to those tables (using CR3
register virtualization and by marking them read-only) while maintaining another set
of “shadow” page tables underneath the virtual machine that map the virtual page
numbers to MFNs. These page tables are exposed to the processor. Thus, memory
virtualization and device emulation can be done for arbitrary, unmodified operating
systems. When this kind of memory management is used, we refer to the guest OS
as a hardware virtual machine or “HVM”, as opposed to a para-virtual guest. An
elaboration of shadow paging is described next, in Section 6.2.2.
3. Hardware-assisted Paging: This approach is an improvement over shadow-paging
by moving the translation logic shadow paging from the hypervisor into the proces-
sor. Essentially this is an MMU expansion - making the MMU do a little more of what
it is already doing. With this support, it is no longer necessary to trap into the hyper-
visor as frequently - allowing for page-fault exceptions to be delivered directly to the
guest OS. Such guests are also called HVM guests, with the internal distinction of
hardware-assisted paging.
As of this writing, CIVIC depends exclusively on the hypervisor’s ability to perform shadow-paging for unmodified HVM guest operating systems. The most basic ability re- quired by CIVIC is to both create and intercept page-fault exceptions before they are propa- gated to the guest virtual machine that would not normally be seen by the OS itself. An un- modified HVM running on top of a CIVIC-enabled hypervisor that used hardware-assisted paging (instead of shadow-paging) would require additional logic to force the processor to trap into the hypervisor when a page is owned by CIVIC (a non-resident page frame) CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 115 during such CPU exceptions. So, as of this writing, CIVIC depends on shadow paging only without the assistance of hardware-assisted paging. Section 6.4 describes how the use of shadow paging affects the baseline performance of the virtual machine when used on top of a CIVIC-enabled hypervisor.
6.2.2 Shadow Paging Review
Next, it is necessary to elaborate for the reader about the use of shadow-paging and some of the common Xen-specific data structures. All of the machines in our cluster particular
Xen-cluster are 64-bit machines. Thus, the assumption of this discussion is that our HVM guests are also 64-bit virtual machines, requiring a standard 4-level page table hierarchy.
When we say “L1” page tables, this means the standard definition where pointers to data pages are contained at the lowest level of the hierarchy in the leaves and the root of the page table is at level L4. All L4 tables are pointed to by “Control Register #3”, or CR3, sometimes called the page-table base pointer. And as usual, for any given process running on the CPU, the value of CR3 will only point to the root L4 table of a single process at a time - or to the kernel’s page tables. A “resident” page table entry (PTE) at any level of the page table hierarchy means that the lowest order bit in a PTE is set, indicating the page beneath it (either data or page table) is actually sitting in memory somewhere. During the shadow paging process, three things can happen:
1. Shadow-Walk: The MMU, with access to a virtualized CR3 base pointer attempts
to walk the shadow page table hierarchy of a particular virtual machine. For every
HVM page table, there is a corresponding shadow page table at each level of the PT
hierarchy. If the MMU does not find a shadow PTE, a trap into the hypervisor occurs.
2. Guest-walk: The hypervisor then performs a manual walk of the real HVM tables
starting at what the HVM thinks is the true CR3 base pointer. If the hypervisor finds
the appropriate PTE, then the whole page table is copied and control returns to the
CPU for that virtual machine. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 116
Paravirt VM #1 HVM #2 HVM #3 HVM #N
Domain 0 PPAS #1 PPAS #2 PPAS #3 PPAS #N Multiple Pseudo−Physical Address Spaces (PPAS) Real Address Space (RAS) Hypervisor
Figure 6.1: Original Xen-based physical memory design for multiple, concurrently-running virtual machines.
3. Guest-walk-miss: Otherwise, a missing PTE in the guest signifies a real CPU ex-
ception and the fault is propagated to the HVM. At that point, it is the HVM’s respon-
sibility to service the fault and proceed as normal.
Furthermore, during the shadow-paging process, there are upwards of a dozen or so
“shadow optimizations” employed by Xen on top of this basic design that are used to speed up memory access latency when going through the shadows with respect to Windows vir- tual machines, HVMs and more. For the current version of CIVIC, these optimizations are disabled. Doing so was necessary to get an initial version of CIVIC working. Future ver- sions of CIVIC can be made to take advantage of these optimizations. Thus, the rest of this chapter and the next section discuss the rest of our implementation under the assumption that these optimizations are disabled. This assumption also constitutes our base case for doing benchmarking during our evaluation.
6.2.3 Step 1: CIVIC Memory Allocation, Caching Design
Figure 6.1 illustrates how memory is allocated to virtual machines in a typical virtualization architecture. Each VM gets a statically-allocated region of physical memory on the host
(depending on ballooning). During normal operation, the size of the PPAS for each virtual machine does not change. Any number (depending on the amount of memory available) of VMs can be created by the administrator in an adjacent manner and the OS of each CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 117
HVM #2: un−modified Network PPAS #2 Cached PPAS #2 Network PPAS #2
Paravirt (non−civic) VM #N Paravirt (non−civic) VM #1 HVM #3: un−modified PPAS #N PPAS #1 Cached PPAS #3 Network PPAS #3
Domain 0 PPAS #1 Cache #2Cache #3 PPAS #N Real Address Space (RAS) CIVIC−enabled Hypervisor
Figure 6.2: Physical memory caching design of a CIVIC-enabled Hypervisor for multiple, concurrently-running virtual machines.
virtual machine will manage the PPAS given to them (since the PPAS is contiguous) with-
out interruption. In this default design, if an operating system places a reference (PFN) to
a page in one of it’s page tables that it expects to be physically resident in memory, then
it will be there - no questions asked. All VM technology currently works this way (except
for our previous VM migration work where dynamic ballooning is used in the last chapter).
In figure 6.1, we have four virtual machines, three of which are HVMs and one that uses
para-virtualization. Regardless, the PPAS of all 4 virtual machines is static: from the mo-
ment those VMs are booted up to the time they shut down, their PPAS is fixed.
CIVIC relaxes the assumption that a page actually exists when the VM asks for it.
The first step in designing CIVIC involves taking an unmodified operating system of an
arbitrary Virtual Machine and growing its PPAS by some amount. Afterwards, we add
another level of indirection within the hypervisor that recognizes this expanded PPAS (by
intercepting access through shadow-paging). Figure 6.2 illustrates how the hypervisor has
been modified to change the memory allocation strategy for an unmodified CIVIC-enabled
hypervisor. VMs #2 and #3 get a statically allocated cache-size on which only a subset of
their total PPAS is actually resident. The rest is out on the network. Hits in the cache go to
the RAS whereas hits in the PPAS go to the network.
Take note of the difference between HVM #2 and HVM #3: the PPAS of an unmodified CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 118 virtual machine need not be larger than the RAS of the physical host. This gives the administrator a choice: to either grow the PPAS to be very large, or simply to provide higher levels of consolidation by running more VMs on one host. However, the PPAS should at least be larger than (or equal to) the cache that CIVIC provides to each PPAS. It cannot be smaller, or that would preclude the need for CIVIC. Notice that Figure 6.2 also has two simultaneously running paravirtualized VMs. The current CIVIC implementation supports multiple PPAS strategies and does not require any VM to use CIVIC. You may choose to grow the PPAS of a virtual machine during boot time or choose to leave it unchanged in its default mode of operation.
Figure 6.3 demonstrates the operation of an example CIVIC cache underneath an
HVM. This HVM has 3 working sets (perhaps from three different processes or three dif- ferent data structures within one process). The figure represents the common case, where the cache is full populated with accessed memory. In this example, two of the working sets are in the cache, and a page fault to frame #6 occurs in the {4,5,6} set. Since the {8,
9} set is older, according to the FIFO, frame #9 is evicted to MemX. An old copy of page
#9 may or may not actually exist yet on MemX, but it will likely be there if the HVM has
been running for a long time. The next section will use the same HVM and describe the
hypervisor-level interactions between the cache and MemX.
6.2.4 Step 2: Paging Communication and The Assistant
Devices and drivers available in the modern virtualization stack today that are used to
service popular devices for Virtual Machines are typically bundled up into a VM that is
commonly called ”Domain 0” or ”Dom0” for short. From here on out we will not refer to this
VM anymore except to acknowledge its presence. During runtime, this VM always exists
and typically hosts various drivers and has direct access to those corresponding devices,
acting as a relay for co-located virtual machines. There is a movement to break away from
this unified, ”monolithic” design. CIVIC follows that philosophy [45]. Dom0 is not only a
single point of failure during the development process, but also performance bottleneck for
the hypervisor’s CPU scheduler due to the fact that all I/O must go through Dom0 while the CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 119
HVM: un−modified
Userland Virtual Address Space
Guest Tables Guest CR3