TECHNIQUES FOR COLLECTIVE PHYSICAL MEMORY

UBIQUITY WITHIN NETWORKED CLUSTERS

OF VIRTUAL MACHINES

BY

MICHAEL R. HINES

B.S., Johns Hopkins University, 2003 M.S., Florida State University, 2005

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009 c Copyright by Michael R. Hines 2009

All Rights Reserved Accepted in partial fullfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009

July 31st, 2009

Dr. Kartik Gopalan, Department of Computer Science, Binghamton University

Prof. Kanad Ghose, Department of Computer Science, Binghamton University

Dr. Kenneth Chiu, Department of Computer Science, Binghamton University

Dr. Kobus van der Merwe, AT&T Labs Research, Florham Park, NJ.

iii ABSTRACT

This dissertation addresses the use of distributed memory to improve the performance of state-of-the-art virtual machines (VMs) in clusters with gigabit interconnects. Even with ever-increasing DRAM capacities, we observe a continued need to support applications that exhibit mostly memory-intensive execution patterns, like databases, webservers, sci- entific and grid applications. In this dissertation, we make four primary contributions. First, we fully survey the history of the solutions available for basic, transparent distributed mem- ory support. Then, we document a bottom-up implementation and evaluation of a basic prototype whose goal is to move deeper into the kernel than previous application-level so- lutions. We choose a clean, transparent device interface capable of minimizing network la- tency and copying overheads. Second, we explore how recent work with VMs has brought back into question the memory management logic of the . VM technology provides ease and transparency for imposing order on OS memory management (using techniques like full virtualization and para-virtualization). As such, we evaluate distributed memory in this context by trying to optimize our previous prototype at different places in the

Xen virtualization architecture. Third, we leverage this work to explore alternative strate- gies for live VM migration. A key component that determines the success of migration techniques has been exactly how memory is transmitted and when. More specifically, this involves fine grained page-fault management either before a VM’s CPU state is migrated

(the current default) or afterwards. Thus, we design and evaluate the Post-Copy live VM migration scheme and compare it to the existing (Pre-Copy) migration scheme, realizing significant improvements. Finally, we promote the ubiquity of individual page frames as a cluster resource by integrating the use of distributed memory into the hypervisor (or virtual machine monitor). We design and implement CIVIC: a system that allows un-modified VMs to oversubscribe their DRAM size to larger than a given host’s physical memory. Then, we compliment this by implementing and evaluating network paging in the hypervisor for lo- cally resident VMs. We evaluate the performance impact of CIVIC on various application workloads and show how CIVIC allows for many possible VM extensions such as better VM consolidation, multi-host caching, and the ability to better coordinate with VM migration.

iv ACKNOWLEDGEMENTS

First, I would like to thank a few organizations responsible for providing invaluable sources of funding which allowed me to work through graduate school. The AT&T Labs Research

Fellowship Program, in cooperation with Kobus van der Merwe in New Jersey, provided support for a full 3 years. The Clark fellowship program at SUNY Binghamton also provided a full year of funding. The department of Computer Science at both Florida State and

Binghamton made teaching assistantships available for a year. These deeds often go unsaid - without them I would not have been able to complete this degree. I would also like to thank the National Science Foundation and the Computing Innovation Fellows Project

(cifellows.org). Through them, I will be continuing on an assistantship as a post-doctoral fellow for the next year.

My advisor deserves his own paragraph. Not many graduate students can say what I can: I have one of the greatest advisors on the planet. Six years ago, he took a chance on me and stood patiently through the entire process: through the transfers, the applications, the bad papers, the good papers, the leaps of faith, the happy accomplishments, and the sad ones. Not only is a he a fantastic researcher, but he is a strong teacher. I am very proud to be his student and I know many other students will be as well.

v DEDICATION

To my father: for his unconditional support, and love. And for all our tribulations.

To my mother: for her strength, wisdom, and love. And for all of our struggles.

To my brother: for his continuous perseverence and happiness.

To my extended family: I stand on your shoulders.

vi BIOGRAPHICAL SKETCH

Michael R. Hines was born and raised in Dallas, Texas in 1983 and grew up playing classical piano. He began college at a program called the Texas Academy of Math and

Science at the University of North Texas. 2 years later he transferred to Johns Hopkins

University in Baltimore, Maryland and received his Bachelor of Science degree in Com- puter Science in 2003. Subsequently, he entered Florida State University to complete an

Information Security certification in 2004 and a Masters degree in Computer Science in

2005. Immediately after that, he transferred to SUNY Binghamton University in New York state to finish working on a PhD degree in Computer Science in 2009.

Michael will begin post-doctoral research at Columbia University in late 2009. He is a recipient of multiple awards, including the Jackie Robinson Undergraduate Scholarship

(2yrs), the AT&T Labs Foundation Fellowship (3yrs), the Clark D. Gifford Fellowship (1yr) from Binghamton University, and the CIFellows CRA/NSF Award (1yr) for Post-Doctoral

Research. He is a member of the academic honor societies Alpha Lambda Delta, Phi Eta

Sigma and the Computer Science honor society Upsilon Pi Epsilon. His hobbies include billiards, skateboarding, and yo-yos.

vii Contents

List of Figures xiii

List of Tables xviii

1 Introduction and Outline 1

1.1 Distributed Memory Virtualization in Networked Clusters ...... 3

1.2 Virtual Machine Based Use for Distributed Memory ...... 3

1.3 Improvement of Live Migration for Virtual Machines ...... 4

1.4 VM Memory Over-subscription with Network Paging ...... 5

2 Area Survey 7

2.1 Distributed Memory Systems ...... 7

2.1.1 Basic Distributed Memory (Anemone) ...... 8

2.1.2 Software Distributed Shared Memory ...... 9

2.2 Virtual Machine Technology and Distributed Memory ...... 10

2.2.1 Microkernels ...... 10

2.2.2 Modern Hypervisors ...... 11

2.3 VM Migration Techniques ...... 13

2.3.1 Process Migration ...... 13

2.3.2 Pre-Paging ...... 14

2.3.3 Live Migration ...... 15

viii 2.3.4 Non-Live Migration ...... 15

2.3.5 Self Ballooning ...... 16

2.4 Over-subscription of Virtual Machines ...... 16

3 Anemone: Distributed Memory Access 18

3.1 Introduction ...... 18

3.2 Design & Implementation ...... 20

3.2.1 Client and Server Modules ...... 23

3.2.2 Remote Memory Access Protocol (RMAP) ...... 24

3.2.3 Distributed Resource Discovery ...... 27

3.2.4 Soft-State Refresh ...... 27

3.2.5 Server Load Balancing ...... 28

3.2.6 Fault-tolerance ...... 28

3.3 Evaluation ...... 29

3.3.1 Paging Latency ...... 30

3.3.2 Application Speedup ...... 32

3.3.3 Tuning the Client RMAP Protocol ...... 36

3.3.4 Control Message Overhead ...... 37

3.4 Summary ...... 38

4 MemX: Virtual Machine Uses of Distributed Memory 39

4.1 Introduction ...... 39

4.2 Split Driver Background ...... 41

4.3 Design and Implementation ...... 43

4.3.1 MemX-: MemX in Non-virtualized Linux ...... 44

4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU ...... 46

4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain . . . . 48

4.3.4 MemX -Dom0: (Option 3) ...... 51

4.3.5 Alternative Options ...... 51

4.3.6 Network Access Contention: ...... 52

4.4 Evaluation ...... 53

ix 4.4.1 Latency and Bandwidth Microbenchmarks ...... 54

4.4.2 Application Speedups ...... 61

4.4.3 Multiple Client VMs ...... 63

4.4.4 Live VM Migration ...... 65

4.5 Summary ...... 65

5 Post-Copy: Live Virtual Machine Migration 67

5.1 Introduction ...... 68

5.2 Design ...... 70

5.2.1 Pre-Copy ...... 71

5.2.2 Design of Post-Copy Live VM Migration ...... 73

5.2.3 Prepaging Strategy ...... 76

5.2.4 Dynamic Self-Ballooning ...... 78

5.2.5 Reliability ...... 80

5.2.6 Summary ...... 81

5.3 Post-Copy Implementation ...... 82

5.3.1 Page-Fault Detection ...... 83

5.3.2 MFN exchanging ...... 85

5.3.3 Xen Daemon Modifications ...... 86

5.3.4 VM-to-VM kernel-to-kernel Memory-Mapping ...... 88

5.3.5 Dynamic Self Ballooning Implementation ...... 89

5.3.6 Proactive LRU Ordering to Improve Reference Locality ...... 92

5.4 Evaluation ...... 93

5.4.1 Stress Testing ...... 94

5.4.2 Degradation, Bandwidth, and Ballooning ...... 98

5.4.3 Application Scenarios ...... 104

5.4.4 Comparison of Prepaging Strategies ...... 107

5.5 Summary ...... 109

6 CIVIC: Transparent Over-subscription of VM Memory 110

6.1 Introduction ...... 111

x 6.2 Design ...... 113

6.2.1 Hypervisor Memory Management ...... 113

6.2.2 Shadow Paging Review ...... 115

6.2.3 Step 1: CIVIC Memory Allocation, Caching Design ...... 116

6.2.4 Step 2: Paging Communication and The Assistant ...... 118

6.2.5 Future Work: Page Migration, Sharing and Compression ...... 124

6.3 Implementation ...... 127

6.3.1 Address Space Expansion, BIOS Tables ...... 127

6.3.2 Communication Paths ...... 129

6.3.3 Eviction and Prefetching ...... 130

6.3.4 Page-Fault Interception, Shadows, Reverse Mapping ...... 134

6.4 Evaluation ...... 138

6.4.1 Micro-Benchmarks ...... 138

6.4.2 Applications ...... 142

6.5 Summary ...... 147

7 Improvements and Closing Arguments 148

7.1 MemX Improvements ...... 148

7.1.1 Non-Volatile MemX Memory Descriptors ...... 148

7.1.2 MemX Internal Caching ...... 149

7.1.3 Server-to-Server Proactive Page Migration ...... 150

7.1.4 Increased MemX bandwidth w/ Multiple NICs ...... 150

7.2 Migration Flexibility ...... 151

7.2.1 Hybrid Migration ...... 151

7.2.2 Improved Migration of VMs Through CIVIC ...... 151

7.3 CIVIC Improvements and Ideas ...... 152

7.3.1 How high can you go?: Extreme Consolidation ...... 152

7.3.2 Improved Eviction and Shadow Optimizations ...... 152

7.4 Conclusions ...... 153

xi A CIVIC Screenshots 154

A.1 Small-HVM Over-subscription ...... 154

A.2 Large-HVM Oversubscription ...... 156

B The Xen Live-migration process 158

B.1 Xen Daemon ...... 158

B.2 Understanding Frame Numbering ...... 161

B.3 Memory-related Data Structures ...... 163

B.4 Page-table Management ...... 165

B.5 Actually Performing the Migration ...... 166

[] Bibliography 169

xii List of Figures

3.1 Placement of distributed memory within the classical memory hierarchy. . . 21

3.2 The components of a client...... 22

3.3 The components of a server...... 23

3.4 A view of a typical anemone packet header. The RMAP protocol transmits

these directly to the network card from the BDI device driver...... 26

3.5 Random read disk latency CDF ...... 30

3.6 Sequential read disk latency CDF ...... 31

3.7 Random write disk latency CDF ...... 31

3.8 Sequential write disk latency CDF ...... 32

3.9 Execution times of POV-ray for increasing problem sizes...... 33

3.10 Execution times of STL Quicksort for increasing problem sizes...... 34

3.11 Execution times of multiple concurrent processes executing POV-ray. . . . . 35

3.12 Execution times of multiple concurrent processes executing STL Quicksort. 35

3.13 Effects of varying the transmission window using Quicksort...... 36

4.1 Split Device Driver Architecture in Xen...... 42

4.2 MemX-Linux: Baseline operation of MemX in a non-virtualized Linux envi-

ronment. The client can communicate with multiple memory servers across

the network to satisfy the memory requirements of large memory applications. 44

4.3 MemX-DomU: Inserting the MemX client module within DomU’s Linux ker-

nel. The server executes in non-virtualized Linux...... 47

xiii 4.4 MemX-DD: Executing a common MemX client module within the driver do-

main, allowing multiple DomUs to share a single client module. The server

module continues to execute in non-virtualized Linux...... 49

4.5 I/O bandwidth, for different MemX-configurations, using custom benchmark

that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to

opening the file descriptor with direct I/O turned on, to compare against by-

passing the Linux ...... 55

4.6 Comparison of sequential and random read latency distributions for MemX-

DD and disk. Reads traverse the filesystem buffer cache. Most random read

latencies are an order of magnitude smaller with MemX-DD than with disk.

All sequential reads benefit from filesystem prefetching...... 58

4.7 Comparison of sequential and random write latency distributions for MemX-

DD and disk. Writes goes through the filesystem buffer cache. Conse-

quently, all four latencies are similar due to write buffering...... 58

4.8 Effect of filesystem buffering on random read latency distributions for MemX-

DD and disk. About 10% of random read requests (issued without the direct

I/O flag) are serviced at the filesystem buffer cache, as indicated by the first

knee below 10µs for both MemX-DD and disk...... 59

4.9 Quicksort execution times in various MemX combinations and disk. While

clearly surpassing disk performance, MemX-DD trails regular Linux only

slightly using a 512 MB Xen Guest...... 60

4.10 Quicksort execution times for multiple concurrent guest VMs using MemX-

DD and iSCSI configurations...... 62

4.11 Our multiple client setup: Five identical 4 GB dual-core machines, where

one houses 20 Xen Guests and the others serve as either MemXservers or

iSCSI servers...... 63

5.1 Pseudo-code for the pre-paging algorithm employed by post-copy migration.

Synchronization and locking code omitted for clarity of presentation. . . . . 74

xiv 5.2 Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with

multiple pivots. Each pivot represents the location of a network fault on

the in-memory pseudo-paging device. Pages around the pivot are actively

pushed to target...... 76

5.3 Pseudo-Swapping (item 3): As pages are swapped out within the source

guest itself, their MFN identifiers are exchanged and Domain 0 memory

maps those frames with the help of the hypervisor. The rest of post-copy

then takes over after downtime...... 82

5.4 The intersection of downtime within the two migration schemes. Currently,

our downtime consists of sending non-pageable memory (which can be

eliminated by employing the use of shadow-paging). Pre-copy downtime

consists of sending the last round of pages...... 85

5.5 Comparison of total migration times between post-copy and pre-copy. . . . 95

5.6 Comparison of downtimes between pre-copy and post-copy...... 96

5.7 Comparison of the number of pages transferred during a single migration. . 97

5.8 Kernel compile with back-to-back migrations using 5 seconds pauses. . . . 98

5.9 NetPerf run with back-to-back migrations using 5 seconds pauses...... 100

5.10 Impact of post-copy NetPerf bandwidth...... 101

5.11 Impact of pre-copy NetPerf bandwidth...... 102

5.12 The application degradation is inversely proportional to the ballooning interval.103

5.13 Total pages transferred for both migration schemes...... 105

5.14 Page-fault comparisons: Pre-paging lowers the network page faults to 17%

and 21%, even for the heaviest applications...... 106

5.15 Total migration time for both migration schemes...... 106

5.16 Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with

better page-fault detection...... 107

5.17 Comparison of prepaging strategies using multi-process Quicksort workloads.108

6.1 Original Xen-based physical memory design for multiple, concurrently-running

virtual machines...... 116

xv 6.2 Physical memory caching design of a CIVIC-enabled Hypervisor for multiple,

concurrently-running virtual machines...... 117

6.3 Illustration of a full PPAS cache. All page accesses in the PPAS space must

be brought into the cache before the HVM can use the page. If the cache is

full, an old page is evicted from the FIFO maintained by the cache...... 119

6.4 Internal CIVIC architecture: An Assistant VM holds two kernel modules re-

sponsible for mapping and paging HVM memory. One module directly (on-

demand) memory-maps portions of PPAS #2, whereas MemX does I/O. A

modified, CIVIC-enabled hypervisor intercepts page-faults to shadow page

tables in the RAS and delivers them to the Assistant VM. If the HVM cache

is full, the Assistant also receives victim pages...... 121

6.5 High-level CIVIC architecture: unmodified CIVIC-enabled HVM guests have

both local reservations (caches) while small or large amounts of their reser-

vations actually expand out to nearby hosts...... 123

6.6 Future CIVIC architecture: a large number of nodes would collectively pro-

vide global and local caches. The path of a page would potentially exhibit

multiple evictions from Guest A to local to global. Furthermore, a global

cache can be made to evict pages to other global caches...... 125

6.7 Pseudo-code for the prefetching algorithm employed by CIVIC. On every

page-fault, this routine is called to adjust the window based on the spatial

location of the current PFN address in the PPAS...... 132

6.8 Page Dirtying Rate for different types of Virtual Machines, including HVM

Guests, Para-virtual Guests, and with different types of shadow paging. This

includes the overhead of creating new page tables from scratch...... 140

6.9 Bus-speed Page Dirtying Rate in gigabits-per-second. This is line-speed

hardware memory speed once page-tables have already been created and

shows throughput at an order of magnitude higher than the previous graph. 141

6.10 Completion times for quicksort on a CIVIC-enabled virtual machine and a

regular virtual machine...... 143

xvi 6.11 Completion times for Sparse Matrix Multiplication with a resident memory

footprint of 512 MB while varying the cache sizes...... 144

6.12 Requests Per Second for the RUBiS Auction Benchmark with a resident

memory footprint of 490 MB while varying the cache sizes...... 145

A.1 A live run of an HVM guest on top of CIVIC with a very small PPAS cache

size of 64 MB. The HVM has 2 GB. (Turn the page sideways) ...... 155

A.2 A live run of an HVM guest on top of CIVIC with a very large PPAS cache

size of 2GB. The HVM believes that it has 64 GB. (Turn the page sideways) 157

xvii List of Tables

3.1 Average application execution times and speedups for local memory, Dis-

tributed Anemone, and Disk. N/A indicates insufficient local memory. . . . . 32

4.1 I/O latency for each MemX-Combination in Microseconds...... 54

4.2 Execution time comparisons for various large memory application workloads. 62

5.1 Migration algorithm design choices in order of their incremental improvements.

Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually

combines all of #1 through #4, by which pre-copy is only used in a single, primer

iterative round...... 74

5.2 Percent of minor and network faults for flushing vs. pre-paging. Pre-paging

greatly reduces the fraction of network faults...... 97

6.1 Latency of a page-fault through a CIVIC-enabled hypervisor to and from network

memory at different stages...... 139

6.2 Number of Shadow-Pagefaults to and from network memory with CIVIC prefetching

disabled and enabled. Each application has a memory footprint of 512 MB and a

PPAS cache of 256 MB...... 146

xviii Chapter 1

Introduction and Outline

Both the methods for design and use of main memory have changed dramatically over the last half-century. And because of fast moving advances in hardware and software, the OS designer’s choices have also increased, especially as the performance gaps between each level of the memory hierarchy grow larger. In this dissertation, we observe that the need to support large-memory non-parallel applications still persists, whose memory access pat- terns are mostly singular and disjoint from each other. This continues to include many common applications like databases and webservers as well as scientific and grid applica- tions. We describe a bottom-up attempt over the last few years to investigate solutions for these kinds of large-memory applications (LMAs) that can be applied across high-speed networked clusters of machines. The representative set of applications we benchmark in this dissertation include:

• Large Sorting • Graphical Ray-tracing • Database Workloads • Support Webserver • E-commerce Webserver • Kernel Compilation • Parallel Benchmarks • Torrent Clients • Network Throughput • Network Simulation

1 CHAPTER 1. INTRODUCTION AND OUTLINE 2

We refer to these applications are ”large-memory applications”. They tend to be some- what CPU intensive. Across application boundaries (between individual running processes), they are either not necessarily parallelizable or not designed to be without explicit thread- ing. Their computational behavior is such that: when they do need to access portions of their large memory pools, they need it fast. These accesses are also usually done in a relatively “cache-oblivious” manner, such that the size of the working-set in memory will eventually converge to a size that fits within memory (before moving onto a new working set).

For these kinds of applications, this work has investigated low-level memory manage- ment options across a number of different projects and this chapter presents a high-level outline of them. The focus of this work is the virtualization of physical memory to support these applications. We categorize this chapter’s high-level outline into three overarching goals:

1. Maximum Application Transparency: We want to improve the performance of

these large memory applications with zero changes to the application. The last

project of this dissertation extends this all the way to complete operating system

transparency as well.

2. Clustered Memory Pool: We want to provide a potentially unlimited pool of cluster-

wide memory to these applications with the help of distributed, low-latency commu-

nication.

3. Ubiquitous Resource Management: We ultimately want page-granular support for

the arbitrary, transparent relocation of any single page frame in a cluster of machines.

This dissertation employs a combination of virtual machine technology, operating system modifications and network protocol design to accomplish those three high-level goals for the aforementioned types of applications. The bottom-up process taken to explore the vir- tualization of physical memory in this dissertation is organized as follows: first we build a distributed memory virtualization system, followed by its evaluation in a virtual machine environment. Next, we develop alternative strategies for VM migration by leveraging dis- tributed memory virtualization. Finally, we integrate these techniques to develop a system CHAPTER 1. INTRODUCTION AND OUTLINE 3 for VM memory oversubscription. First we begin with a discussion of the basic distributed memory system.

1.1 Distributed Memory Virtualization in Networked Clusters

Chapter 3 begins by investigating the options available to large memory applications with basic, transparent distributed memory support in clusters of gigabit ethernet-connected machines. Distributed memory itself is a very old idea, but our previous efforts at re- investigating it have revealed various unsolved performance issues as well new applica- tions. Additionally, implementing a distributed memory solution was a springboard to tackle low-level memory management issues in virtual machines. Our prototype was an effort to move further away from the application than previous work (very low into the kernel) by choosing a clean, familiar interface (like the device) such that needs of the application are still respected without any changes. It consists of a fully distributed, non-shared, Linux- based, all kernel-space distributed memory solution, including a custom networking proto- col and a full performance evaluation. The solution exports an interface to any process that wants to map it, and hides the complexity of shipping those frames over gigabit ethernet to other connected machines. It is not, however, a software distributed shared memory solution - it does not provide cache coherent resolution protocols for simultaneous write access by parallel clients. That was not the focus of this work.

1.2 Virtual Machine Based Use for Distributed Memory

In Chapter 4, we investigate how distributed memory virtualization could benefit state-of- the-art virtual machine technology. We describe the design and implementation of a sys- tem that evaluates how distributed memory can enhance this transparency. We did this by placing (and improving upon) the aforementioned distributed memory solution at different places within the virtual machine architecture and benchmarking applications within those

VMs.

At the end of 2005, a handful of virtual machine projects had already matured into both CHAPTER 1. INTRODUCTION AND OUTLINE 4 proprietary and open-source versions. We began looking into how our distributed memory implementation could apply to virtual machine technology. Recent work with VMs in the last decade is interesting in that it has brought into question once again where exactly the memory management logistics of LMAs should be placed, now that there is an extra level of indirection (called the virtual machine monitor, or “hypervisor”) placed below the OS (a well-known technique). Both hardware and software advances have created many ways to impose order on the handling of OS memory management while still maintaining high amounts of transparency to applications through techniques such as full virtualization and para-virtualization.

1.3 Improvement of Live Migration for Virtual Machines

It soon became clear that virtual machine technology has succeeded tremendously at demonstrating the utility of transparent, live OS migration. In fact, it’s likely that the increas- ing pervasiveness of VMs may never have happened without it. It is well known that many process migration prototypes, while very well-built, were unable to become widespread due to fundamental limitations relating to transparency, process portability, and residual dependencies on the host OS. But changing the unit of migration to the OS itself has taken that problem out of the picture completely, even among different hypervisor vendors. The ability to run the VM transparently has shifted the base unit of computational containment from the process to the OS without changing the semantics of the application. A key com- ponent that determines the success of migration has been exactly how the virtualization architecture migrates the VM’s memory, which is what initially lead us to this particular problem.

In Chapter 5, we applied some of the techniques developed for virtual machine based distributed memory use to develop alternative strategies for live migration of VMs. We de- sign, implement and evaluate a new migration scheme and compare it to existing migration schemes present in today’s virtualization technology. We were able to realize significant gains in migration performance with a new live-migration system for write-intensive VM workloads - as well as point out some fundamental ways in which the management of VM CHAPTER 1. INTRODUCTION AND OUTLINE 5 memory could be improved from the pre-copy approach.

1.4 VM Memory Over-subscription with Network Paging

Our experience we the previous projects exposed the need for more fine-grained policies underneath the OS - particularly when VMs are consolidated from multiple physical hosts onto single hosts and compete with each other for memory resources. In situations like this, the need to determine better runtime placement and allocation of individual page frames among each VM is important. This is where the idea of the ubiquity of individual frames of memory comes from: not only does virtualization remove the constraints on a page frame as to its location in memory, but it releases a page frame from even being on the same physical machine, even when the VM is still considered to have local ownership of the frame. We believe that, with the dynamics of a virtualized environment, the OS should consider its physical memory as being a ubiquitous “resource” without worrying about its physical location. This does not mean that it should not be aware of the contiguity of the physical memory space (with respect to kernel subsystems that handle memory allocation and fragmentation). Rather, this means that the source of that contiguous resource should be more flexible. On the same lines, the interfaces that export this resource should both maintain fast, efficient memory access and do so without duplicating implementation effort or functionality.

With that, Chapter 6 presents the last contribution of this dissertation: a complete im- plementation and evaluation of a system that allows an un-modified VM to use more DRAM than is physical provided by the host machine. Our system is able to do this without any changes to the virtual machine. This is done through a combination of means. First, we alter the hypervisor under the VM and give the VM a view of a physical memory allocation that is larger than what is available at the host on which it is running. We then hook into the shadow paging mechanism, a feature provided by all modern hypervisors to intercept page-table modifications performed by the VM. Finally, we supplement this by implement- ing a network paging system at the hypervisor level to allow for victim page selection when non-resident pages are accessed. This system is implemented while preserving the tradi- CHAPTER 1. INTRODUCTION AND OUTLINE 6 tional concepts of paging and segmentation employed by an OS and by taking a page (par- don the pun) from microkernels by continuing to keep the hypervisor as small as possible.

Our implementation also maintains the same transparency to the OS and its applications that all of our previous work has guaranteed. This system gives the system administrator and application programmer a wide berth: to have the option to arbitrarily cache, share, or move individual page frames for improved consolidation of multiple co-located VMs among physical hosts. Chapter 2

Area Survey

Aside from the focus of this work discussed in Chapter 1, there is a great deal of related work. This chapter will present a literature survey of supporting work up to this point.

We will go through the three major steps taken in prior work discussed in the introduc- tion and explain how other literature is similar and how it differs from our work, including the Anemone system, the MemX system, the Post-Copy Migration system, and our final system, CIVIC.

2.1 Distributed Memory Systems

Our distributed memory system, Anemone [50, 51], was the first system that provided un- modified large memory applications (LMAs) with completely transparent access to cluster- wide memory over commodity gigabit Ethernet LANs. One goal of our work was to make a concerted effort to bring all components of the implementation into the Linux kernel and optimize for network conditions in the LAN that were specific to the network memory traf-

fic: particularly the repeated flow control of 4 Kilobyte page frames. As such, it can, briefly, be treated as distributed paging, distributed memory-mapping, or as a remote in-memory

filesystem, while the logic and design decisions are hidden behind a block device driver.

7 CHAPTER 2. AREA SURVEY 8

2.1.1 Basic Distributed Memory (Anemone)

The two most popular celebrities among systems designed to support distributed memory in the 1990s (which are now dormant) included the NOW [15] project at Berkeley and the

Global Memory System [37] at Washington. We decided to re-tackle this problem at the time for a few reasons: a). neither of these two projects were available for use. b). networks and CPU speeds had increased an order of magnitude since, and c). both projects required extensive operating system support. The Global Memory System was designed to provide network-wide memory management support for paging, memory mapped files, and file caching. This system was closely built into the end-host operating system and operated upon a 155Mbps DEC Alpha ATM Network. The NOW project [15] did a plethora of things on top of the Digital Unix operating system. In the end, their solution included an OS- supported “cooperative caching” system, which is a type of distributed filesystem that had the added responsibility of caching disk blocks (which could be memory mapped) into the memory of participating nodes. We will describe cooperative caching systems later, but suffice it to say that these were very large implementations that could be functionally reduced to doing the task of distributed memory in an indirect manner. Our goals were to re-tackle just the distributed memory components of these systems without any OS modifications as low as possible within a device driver in the hopes that the project would be an enabling mechanism for more complicated projects in the later years, which is exactly what we did. But in order to explore these problems, we needed a working prototype that solved these problems in the Linux operating system while taking into account all the design principles of kernel development in the current state of operating systems design that was also capable of functioning well over gigabit ethernet networks. For all of those reasons, Chapter 3 will describe a very new system as we’ve designed it.

Although the previously mentioned projects were the most popular, they were by far not the only projects around in the 1990s. The earliest non-shared efforts [40, 21, 57] at using distributed memory aimed to improve memory management, recovery, concurrency con- trol, and read/write performance for in-memory database and transaction processing sys- tems. The first two distributed paging were presented in [28] and [38]. These projects also CHAPTER 2. AREA SURVEY 9 took the stance of incorporating extensive OS changes to both the client and the memory servers on other nodes. The Samson project (of which my advisor was a member, actually)

[90] was a dedicated memory server with a highly modified OS over a Myrinet interconnect that actively attempts to predict client page requirements. The Dodo project [59, 9] was an- other late 1990’s attempt to provide a more end-to-end solution to the distributed-memory problem. They built a user-level library based interface that a programmer can use to coor- dinate all data transfers to and from a distributed memory cache. This obviously required legacy applications to be aware of specific API in that library. For the Anemone project, this was pretty much a deal-breaker. The work that is probably the closest to our prototype was done by [68] and followed up in [39], implemented within the DEC OSF/1 operating system in 1996. They use a transparent device driver just like we do to do paging. Again, our primary differences are as in the NOW case: a slow network, an out-of-date operating system, and no available code for which we could build a broader research project out of.

They do, however have a recovery system built into their work, capable of supporting single node failures.

2.1.2 Software Distributed Shared Memory

For shared memory systems, typically called ”Software Distributed Shared Memory” (DSM), a group of nodes participates in one of a host of different consistency protocols, not un- like the hardware requirements of cache-coherent Non-Uniform Memory Access (NUMA) shared memory machines. There are many of these systems. By its nature, the purpose of cache-coherent systems is to be able to provide a competing paradigm for Parallel Exe- cution systems that depend on Message Passing Interfaces (MPI). In general, a DSM and

MPI are competitors, each attempting to provide the means for parallel speedup across multiple physical host machines at different levels of the computing hierarchy. MPI at- tempts to provide the speedup through explicit data movement across each node through a series of calls, where as a properly implemented DSM attempts to make this data move- ment inherent. This is typically done either at the language level or (like MPI) at the library level in such a way that the DSM system handles shared writes (with proper ordering) so CHAPTER 2. AREA SURVEY 10 that the concurrently running programs on different nodes need only to focus on locking critical sections that access shared data structures. As we mentioned, Anemone is not a DSM, nor are we trying to do research on parallel execution. Nevertheless, some of the more popular DSM projects in the 1990s included [35] and [14], which allow a set of independent nodes to behave as a large shared memory multi-processor, often requiring customized programming to share common data across nodes.

2.2 Virtual Machine Technology and Distributed Memory

Whole operating system VM technology, in which multiple independent, and possibly dif- ferent, operating systems running simultaneously has been re-invented in the last decade.

The modern virtual machine monitor or hypervisor is inspired by three different kinds of OS virtualization: a). Library Operating Systems b.) Microkernels (versus monolithic kernels), and c). the commodity OS virtualization work in the early 1970’s. We will briefly survey some of these ideas and how they’ve influenced choices in our work, resulting in a project called “MemX” [49]. When that work was completed, MemX was the first system in a VM environment that provided unmodified LMAs with a completely transparent and virtualized access to cluster-wide distributed memory over commodity gigabit Ethernet LANs. We begin our survey of virtual machine technology with Microkernels first and then discuss modern hypervisors.

2.2.1 Microkernels

Microkernels were attempts by the operating systems community in the 1980’s and 90’s to shrink the size of the core OS base and move more of the subsystems in a traditional

“Macro” OS into user-land processes or servers. This decreased the privileges of these subsystems, giving them more fault-tolerance from foreign device drivers and required fast communication mechanisms for them to talk to each other. Other motivations for the use of microkernels included the ability to provide UNIX-compatible environments without the need to constantly port drivers to new systems and without the need to port new sys- tems to new CPU architectures. As long as you keep the microkernel and the supporting CHAPTER 2. AREA SURVEY 11 communication framework constant as a standard, one could gain a great deal of inter- operability, a source of headache that continues to exist today. The advantages provided by microkernels and virtual machines are almost identical, and without going into too much of a philosophical debate, virtual machine designers add more hypervisor-aware code to current operating systems every year. One could almost consider modern hypervisors to

be microkernels [45]. Probably the only reason that microkernels did not become more

widespread was that industry support for these research prototypes never really gained

traction completely, where as virtual machine technology has managed to do so. Nev-

ertheless, the exploration of Microkernels had a great deal of success beginning in the

1980s, including successful projects like Mach [8], Chorus [7] Amoeba [72], and L4 [64].

Notable work was also performed on “Library” operating systems. This is based on the

idea of having another root system ”fork” off a smaller operating system in much the same

way library code is stored and loaded on demand. These kinds of systems do not fall

cleanly into the definition of a microkernel, but they are closer to microkernels than virtual

machines because they also depend on fast communication primitives and their focus is

not to provide full virtualization of multiple CPU architectures. Such systems included the

Exokernel [36] and Nemesis [62].

2.2.2 Modern Hypervisors

The first hypervisors (the current term for the longer virtual machine monitor), have been

around since the late 1960’s [10] and were developed all the way through the late 70s

(primarily by industry) until academic research began to focus on microkernels, which took

over research until the mid 1990s. These early hypervisors were generally paired directly

with specific hardware and meant to support multiple identical copies of the same oper-

ating system. After the microkernel movement slowed down, probably the first “revival”

of hypervisor technology started with Disco [23]. The context of this work was on top of

cache-coherent NUMA machines, motivated by IBM’s work [10]. Their focus was similar:

to support multiple commodity operating systems, but their aim was to do it with as few

changes as possible. A popular open-source attempt also sprang up for a short while CHAPTER 2. AREA SURVEY 12 called “User Mode Linux” [2], but operated completely in userland. (We actually used this for a while to test our early distributed memory prototypes, but the developer base did not continue to grow.) At the turn of the century, two more hypervisor arrived, including Denali

[6] (which was later modified to be a microkernel) and the familiar VMware system.

Modern hypervisors are split into three categories at the moment: a). Full Virtualization, b). Para-Virtualization and c.) Pre-Virtualization. Para-virtualization indicates that the OS has been modified to be aware that it is virtualized and to provide direct support to the supporting hypervisor to improve upon the speed of virtualizing memory accesses and device emulation. Full Virtualization indicates that the guest operating system (the OS being virtualized by a hypervisor) has not been modified to support virtualization. Full virtualization can be supported in two ways: with or without hardware support. Both AMD

[13] and Intel [3] provide hardware support for virtualization by enabling the processor to trap directly into the hypervisor voluntarily when the guest attempts to execute a privileged instruction that must be emulated. Full virtualization systems like KVM [4] depend on hardware support completely. Projects like Xen [20] support both para-virtualized and fully- virtualized operating systems both with and without support from hardware. The second way to perform full virtualization is to use binary translation, as is the case with VMware.

Critiques to this attempt are that they must do this at clock speeds, requiring execution overheads of up to 20%. Similarly, Pre-virtualization [63] is a related attempt to do these translations offline in a layered manner or with a custom compiler, but existing prototypes have not caught much traction in the community.

Finally, para-virtualization is an opposite technique to do virtualization by modifying the operating system itself. This technique met with a lot of success with the Xen project

[20], which is the hypervisor platform used in this work. Recently, the Linux and Win- dows communities have been updating these macro-kernels with Hypervisor aware hooks to mitigate the overhead of forward-porting. Such changes will also benefit many of the aforementioned full-virtualization technologies. Other paravirtualization techniques include operating-system level virtualization, similar to [2] in which the OS itself and all processes are isolated into individual containers without the use of a true hypervisor [1]. CHAPTER 2. AREA SURVEY 13

2.3 VM Migration Techniques

Chapter 5 targets the performance of the live migration of virtual machines. The technique we use, accompanied with a handful of new optimizations is called “Post-Copy”. Live migration is a mandatory feature of modern hypervisors. It facilitates server consolidation, system maintenance, and lower power consumption. Post-copy refers to the deferral of the “copy” phase of live migration until the virtual machine’s CPU state has already been migrated. On the other hand, pre-copy refers to the opposite, and currently is the dominant way to migrate a process or virtual machine. A survey of the different units of migration and types involved is present here.

2.3.1 Process Migration

The post-copy algorithm (whose name has assumed different titles) has variably appeared in the context of process-migration among four previous incarnations: first implemented as

“Freeze Free” using a file-server [84] in 1996, simulated in 1997 [83] (which is where the term post-copy was first coined) and later followed up by an actual Linux implementation in 2003 [74] - the original creator of the “hybrid” assisted post-copy scheme, which we will summarize later. Also, in 2008 a version under the openMosix kernel was presented again with respect to process migration in [85]. Our contributions instead address new challenges at the virtual machine level that are not seen at the process level and benchmark an array of applications affecting the different metrics of full virtual machine migration, which these two approaches do not do. The closest work to Post-Copy is a report called SnowFlock

[44]. They use a similar technique in the context of Parallel Computing by introducing

“Impromptu Clusters” which clones a VM to multiple destination nodes and collect results from the new clones. They do not compare their scheme to (or optimize on) the original pre-copy system. Also, their page-fault avoidance heuristics are also different in that they paravirtualize Xen Guests to avoid transmitting free pages, where as we use ballooning as it is less invasive and transparent to kernel operations. Process migration schemes, well surveyed in [71] have not become widely pervasive, though several projects exist, includ- ing Condor [30], Mosix [19], libckpt [80], CoCheck [91], Kerrighed [58], and Sprite [34]. CHAPTER 2. AREA SURVEY 14

The migration of entire operating systems is inherently free of residual dependencies while still providing a live and clean unit of migration. Techniques also exist to migrate appli- cations [71] or entire VMs [17, 27, 73] to nodes that have more free resources (memory,

CPU) or better data access locality. Both Xen [27] and VMWare [73] support migration of

VMs from one physical machine to another, for example, to move a memory-hungry enter- prise application from a low-memory node to a memory-rich node. However large memory applications within each VM are still constrained to execute within the memory limits of a single physical machine at any time. In fact, we have shown that MemX can be used in conjunction with the VM migration in Xen, combining the benefits of both live VM migration and distributed memory access. MOSIX [19] is a management system that uses process migration to allow sharing of computational resources among a collection of nodes, as if in a single multiprocessor machine. However each process is still restricted to use memory resources within a single machine.

2.3.2 Pre-Paging

The post-copy algorithm does its best (as pre-copy does) to identify the collective working- set of the virtual machine’s processes, whose concept for individual processes was first identified in 1968 [32]. Pre-copy does this with shadow paging: the use of an additional read-only page table level that tracks the dirtying of pages. Post-copy does this by the reception of a page-fault. We mitigate the effect of faults on applications through the use of pre-paging, a technique that also goes by different titles. In virtual-memory and application level solutions, it is called pre-paging. At the I/O level or the actual paging-device level, it can also be referred to as “adaptive prefetcing”. For process migration and distributed memory systems it can also be referred to as “adaptive distributed paging” (whereas or- dinary distributed paging suffers from the residual dependency problem, and may or may not involve the use of pre-fetching). In either case, we use the term pre-paging to refer a migration system that adaptively “flushes” out all of the distributed pages while simultane- ously trying to hide the latency of page-faults as pre-fetching does. We do not use disks or intermediate nodes. Traditionally, the algorithms involved in pre-paging involve both re- CHAPTER 2. AREA SURVEY 15 active and history based approaches to anticipate as best as possible what the working set of the application may or may not be. Pre-paging has experienced a very brief resur- gence this decade and goes back as far as 1968 [76], a survey of which can be found in

[94]. In our case, we implement a reactive approach with a few optimizations at the virtual machine level described later. History-based approaches may benefit future work, but we do not implement them here.

2.3.3 Live Migration

System-level virtual machine migration has been revived with several projects, including architecture independent approaches w/ VMware migration [73] and Xen migration [27], architecture-dependent projects using VT-x or VT-d chips with the KVM project in Linux [4], operating-system level approaches that do not use hypervisors (similar to capsules/pods) with the OpenVZ system [1], and even Wide-Area-Network approaches [22], all of which can potentially benefit from the post-copy method of VM migration presented in this paper.

Furthermore, the self-migration of operating systems has much in common with migration of single processes [48]. The same group started this project built on top of their ”Nomadic

Operating Systems” [47] project as well as their first prototype implementation on top of the L4 Linux Microkernel using “NomadBIOS”. All of these systems currently use pre-copy based migration schemes.

2.3.4 Non-Live Migration

There are several non-live approaches to migration systems, in which the dependent ap- plications must be completely suspended during the entire migration. The term capsule was introduced by Schmidt in [87]. In this work, capsules were implemented by grouping together processes in Linux or solaris operating systems and migrating all of their state as a group as opposed to the full operating system. Along the same lines, Zap [78] uses units of migration called process domains (pods), which are essentially process groups along with their process-to-kernel interfaces such as file handles and sockets. Migration is done by suspending the pod and copying it to the target. Also, connections to active CHAPTER 2. AREA SURVEY 16 services are not maintained during transit. The Denali project [6, 5] dealt with migrating checkpointed VMWare virtual machines across a network incurring longer migration down- time. Chen and Noble suggested using hardware-level virtual machines for user mobility

[41]. The Capsules/COW project [24] addresses user mobility and system administration by encapsulating the state of computing environments as objects that can be transferred between distinct physical hosts, citing the example of the transfer of an OS instance to a home computer while the user drives home from work. The OS instance is not active during the transfer. The “Internet Suspend/Resume” project [66] focuses on the capability to save and restore computing state on anonymous hardware. The execution of the virtual machine is suspended during transit. In contrast to these systems, our aim is to transfer live, active OS instances on fast networks without stopping them.

2.3.5 Self Ballooning

Ballooning is the act of changing the view of the amount of physical memory seen by the operating system during runtime. Ballooning itself has already been used a few times among virtual machine technology but none have been made to be continuous in produc- tion as of yet, nor has the use of ballooning been investigated among different VM migration systems, which is the purpose of this work. Prior ballooning work includes VMware’s 2002 publication [96], which was inspired by “self-paging” in the nemesis operating system [46].

It is not clear, however how their ballooning mechanisms interact with different forms of

VM migration, which is what we are trying to investigate. Xen is also capable of simple one-time ballooning during migration and system boot time. Additionally, an effort is being made to commit a general version of self-ballooning into the Xen upstream development tree by a group within Oracle Corp [67]. Such contributions will help standardize the use of ballooning.

2.4 Over-subscription of Virtual Machines

The most notable attempt to oversubscribe virtual machine memory was presented in [96] for VMware and [33] for Xen. These projects work very well, but the amount of VM memory CHAPTER 2. AREA SURVEY 17 is constrained to what is available on the physical host. Additionally, a couple of DSM- level attempts to present a Single-System Image (SSI) for unmodified VMs exist in [12] and [69]. Building an SSI was not the focus of this dissertation, but rather to allow local virtual machines to gain cluster memory access. This is because we want to increase VM consolidation and migration performance rather than spread processing out into the cluster.

Thus, processor resources available to such VMs in our work are only available on one host. Ballooning, as described in the previous section, also allows VMs to oversubscribe virtual machine memory, but requires direct operating system participation. Ballooning also does not allow access to non-resident memory. This requires a one-to-one static memory allocation throughout the virtual machine’s lifetime. To date, the CIVIC system, described in Chapter 6 is the first attempt to apply distributed memory to unmodified virtual machines running applications with large memory requirements in a low-latency environment through the use of network paging and shadow memory interception within the Xen hypervisor. Chapter 3

Anemone: Distributed Memory Access

In this Chapter, we describe our initial distributed memory work in detail, called the Anemone project. Because the performance of large memory applications degrade rapidly once the system hits the physical memory limit, they will likely start paging or thrashing. We present the design, implementation and evaluation of Distributed Anemone (Adaptive Network

Memory Engine) – a lightweight and distributed system that pools together the collective memory resources of multiple Linux machines across a gigabit Ethernet LAN. Anemone treats distributed memory as another level in the memory hierarchy between very fast local memory and very slow local disks. Anemone enables applications to access poten- tially “unlimited” network memory without any application or operating system modifications

(when Anemone is used as a swap device). Our kernel-level prototype features fully dis- tributed resource management, low-latency paging, resource discovery, load balancing, soft-state refresh, and support for ’jumbo’ Ethernet frames. Anemone achieves low page- fault latencies of 160µs average, application speedups of up to 4 times for single process and up to 14 times for multiple concurrent processes, when compared against disk-based paging.

3.1 Introduction

Performance of large-memory applications (LMAs) can suffer from large disk access la- tencies when the system hits the physical memory limit and starts paging to local disk.

18 CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 19

At the same time, affordable, low-latency, gigabit Ethernet is becoming commonplace with support for jumbo frames (packets larger than 1500 bytes). Consequently, instead of pag- ing to a slow local disk, one could page over a gigabit Ethernet to the unused memory of distributed machines and use the disk only when distributed memory is exhausted.

Thus, distributed memory can be viewed as another level in the traditional memory hi- erarchy, filling the widening performance gap between low-latency RAM and high-latency disk. In fact, distributed memory paging latencies of about 160µs or less can be easily achieved whereas disk read latencies range anywhere between 6 to 13ms. A natural goal is to enable unmodified LMAs to transparently utilize the collective distributed memory of nodes across a gigabit Ethernet LAN. Several prior efforts [28, 38, 37, 59, 68, 39, 70, 90] have addressed this problem by relying upon expensive interconnect hardware (ATM or

Myrinet switches), slow bandwidth limited LANs (10Mbps/100Mbps), or heavyweight soft- ware Distributed Shared Memory (DSM) [35, 14] systems that require intricate consis- tency/coherence techniques and, often, customized application programming interfaces.

Additionally, extensive changes were often required to the LMAs or the OS kernel or both.

Our earlier work [50] addressed the above problem through an initial prototype, called the Adaptive Network Memory Engine (Anemone) – the first attempt at demonstrating the feasibility of transparent distributed memory access for LMAs over commodity gigabit

Ethernet LAN. This was done without requiring any OS changes or recompilation, and relied upon a central node to map and exchange pages between nodes in the cluster. Here we describe the implementation and evaluation of a fully distributed Anemone architecture.

Like the centralized version, distributed Anemone uses lightweight, pluggable Linux kernel modules and does not require any OS changes. Additionally, it achieves the following significant improvements over a centralized system.

1. Full distribution: Memory resource management is distributed across the whole

cluster. There is no single control node.

2. Low latency: The round-trip time from one machine to the other is reduced by over

a factor of 3 when compared to disk access – to around 160µs.

3. Load balancing: Clients make intelligent decisions to direct distributed memory traf- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 20

fic across all available memory servers, taking into account their memory usage and

paging load.

4. Dynamic Discovery and Release: A distributed resource discovery mechanism en-

ables clients to discover newly available servers and track memory usage across the

cluster. The protocol also has a mechanism for releasing servers and re-distributing

their memory so that individual servers can be taken down for maintenance.

5. Large packet support: The distributed version incorporates the flexibility of whether

or not ’jumbo’ frames should be used based on which network hardware is used, al-

lowing operation in networks with any MTU size. Our protocol is custom built without

the use of TCP.As far as the application is concerned, network transmission does not

exist, so the end-to-end design of our protocol is built to satisfy the efficiency needs

of code in the kernel.

We evaluated our prototype using unmodified LMAs such as ray-tracing, network simu- lations, in-memory sorting, and k-nearest neighbor search. Results show that the system is able to reduce average page-fault latencies from 8.3ms to 160µs. Single-process appli- cations (including those that internally contain threads) speed up by up to a factor of 4, and multiple concurrent processes by up to a factor of 14, when compared against disk-based paging.

3.2 Design & Implementation

Distributed Anemone has two major software components: the client module on low mem- ory machines and the server module on machines with unused memory. The client module appears to the client system simply as a block device that can be configured in multiple ways.

• Storage: the “device” can be treated like storage. One can place any filesystem on

top of it and mount it like a regular filesystem. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 21

REGISTERS

CACHE

MAIN MEMORY

REMOTE MEMORY

DISK

TAPE

Figure 3.1: Placement of distributed memory within the classical memory hierarchy.

• Memory Mapping: one can memory map the anemone device directly, creating the

view of a linear array of addresses within the application itself. This is a standard

practice by many applications, most popularly for the dynamic loading of libraries,

but can be made explicit through standard system calls.

• Paging Device: The system can be used for distributed memory paging directly by

the operating system. This is the mode we use to evaluate the system later on.

Whenever an LMA needs more virtual memory, the pager (swap daemon) in the client swaps out pages from the client to other server machines. As far as the pager is concerned, the client module is just a block device not unlike a hard disk partition. Internally, however, the client module maps swapped out pages to distributed memory servers. On a high level, our goal was to develop a prototype that could realize a view presented in Figure 3.1, where distributed memory represents a new level of the memory access hierarchy.

The servers themselves are also regular machines, but have unused memory to con- tribute, and can in fact switch between the roles of client and server at different times, depending on their memory requirements. Client machines discover available servers by using a simple distributed resource discovery mechanism. Servers provide regular feed- back about their load information to clients, both as a part of the resource discovery pro- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 22

Client Node Large−Memory App. (LMA)

PAGER

RAM

MODULE Block Device Write−Back Interface Cache

RMAP Mapping Protocol Intelligence

NIC

Figure 3.2: The components of a client. cess and as a part of regular paging process (piggy backed on acknowledgments). Clients use this information to schedule page-out requests by choosing the least loaded server node to send a new page. Also, both the clients and servers use a soft-state refresh proto- col to maintain the liveness of pages stored at the servers. The earlier Anemone prototype

[50] differed in that the page-to-server mapping logic was maintained at a central Memory

Engine, instead of individual client nodes. Although simpler to implement, this centralized architecture incurred two extra round trip times on every request besides forcing all traffic to go through the central Memory Engine, which can become a single point of failure and a significant bottleneck. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 23

Server Node

RAM

MODULE

RMAP Mapping Protocol Intelligence

NIC

Figure 3.3: The components of a server.

3.2.1 Client and Server Modules

Figure 3.2 illustrates the client module that handles paging operations. It has four major components:

1. The Block Device Interface (BDI),

2. a basic LRU-based write-back cache,

3. mapping logic for server location of swapped-out pages, and

4. a Remote Memory Access Protocol (RMAP) layer.

The pager issues read and write requests to the BDI in 4KB data blocks. The device driver that exports the BDI is instructed to keep page write requests aligned on 4 KB boundaries.

(The usual sector size of a block devices is 512KB). The BDI, in turn, performs read and write operations to our write-back cache (for which pages do not get transmitted until evic- tion). When the cache is full, a page is evicted to a server using RMAP.Figure 3.3 illustrates the two major components of the server module: (1) a hash table that stores client pages CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 24 along with the client’s identity (layer-2 MAC address) and (2) the RMAP layer. The server module can store/retrieve pages for any client machine. Once the server reaches capacity, it responds to the requesting client with a negative acknowledgment. It is then the client’s responsibility to select another server, if available, or to page to disk if necessary. Page-to- server mappings are kept in a standard chained hashtable. Linked-lists contained within each bucket hold 64-byte entries that are managed using the Linux slab allocator (which performs fine-grained management of small, equal-sized memory objects). Standard disk block devices interact with the kernel through a request queue mechanism, which per- mits the kernel to group spatially consecutive block I/Os (BIO) together into one “request” and schedule them using an elevator algorithm for seek-time minimization. Unlike disks,

Anemone is essentially a random access device with a fixed read/write latency. Thus, the

BDI does not need to group sequential BIOs. It can bypass request queues, perform out- of-order transmissions, and asynchronously handle un-acknowledged, outstanding RMAP messages.

3.2.2 Remote Memory Access Protocol (RMAP)

RMAP is a tailor-made, low-overhead communication protocol for distributed memory ac- cess within the same subnet. It implements the following features: (1) Reliable Packet

Delivery, (2) Flow-Control, and (3) Fragmentation and Reassembly. While one could tech- nically communicate over TCP, UDP, or even the IP protocol layers, this choice comes burdened with unwanted protocol processing. Instead RMAP takes an integrated, faster approach by communicating directly with the network device driver, sending frames and handling reliability issues in a manner that suites the needs of the Anemone system. Every

RMAP message is acknowledged except for soft-state and dynamic discovery messages.

Timers trigger retransmissions when necessary (which is extremely rare) to guarantee reli- able delivery. We cannot allow a paging request to be lost, or the application that depends on that page will fail altogether. RMAP also implements flow control to ensure that it does not overwhelm either the receiver or the intermediate network card and switches.

The performance of any distributed system is heavily influenced by the types of net- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 25 working requirements imposed, including both the design of the network and the appli- cation’s requirements. To minimize latency and protocol-related processing overhead, a conscious choice was made to eliminate the use of TCP/IP and write a simpler, lightweight protocol. The subset of networking-functions needed by our system in the kernel was sig- nificantly smaller than the full set provided by the combination of TCP and IP in a cluster of machines. Four of the most prominent features that we do not include are:

• Port Abstraction: Our system has no use for the concept of ports, application-level

socket buffers, byte-streams, or in-order delivery. Since our system operates at the

block-I/O level, these mostly application-driven requirements disappear.

• IP Addresses: The system does not operate across routed IP subnets, nor do we

plan on supporting this feature due to performance overheads. They take from the

distributed nature of the system and create unwanted link congestion bottlenecks

with flows from other networks and is not the kind of problem we’re trying to attack.

As a result, the ability of one node to address/communicate with another node is

simplified. We also noticed that a custom protocol was much easier to maintain in

the kernel because the client and servers can address each other over the network

much easier, without the need to juggle IP addresses and socket error handling.

• Fragmentation: With the right use of the Linux networking API, this turned out to

be a far simpler problem to solve: today’s Linux provides a good enough design

abstraction to deploy a non-IP based, zero copy fragmentation solution. Furthermore,

Our protocol can auto-detect the MTU of the system’s NIC and automatically send

larger-packets (so called ‘jumbo’ frames) if the card supports it, especially because

we have no need for multi-network ICMP mtu discovery (assuming that all hops in

the network support the same size MTU).

• Segmentation Offload: The performance of 10-gigabit and higher speed networks

depends heavily on the use of TCP and Checksum offloading. It is gradually becom-

ing quite commonplace to find gigabit cards with offloading engines on them that the

kernel can exploit. Recent 2.6 kernels have integrated the zero-copy use of segmen- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 26

RMAP Header Format

Type Status Anemone Packet Sequence union { Page Data Advertisement { (if any) Session ID RMAP Header Load Status (Network) Load Capacity } Ethernet Header (Data Link) Page Request { Offset Size } }

Fragmentation Flags

Figure 3.4: A view of a typical anemone packet header. The RMAP protocol transmits these directly to the network card from the BDI device driver.

tation into their TCP/IP APIs. We’ve observed that, under a highly-active system, the

network can easily exhibit full-speed workloads. Since we use RMAP, this potentially

frees up the use of segmentation offloading for application-level networking traffic

that might be concurrently running within the same guest VM.

Figure 3.4 depicts what a typical anemone packet header looks like.

The last design consideration in RMAP is that while the standard memory page size is 4KB (although it is not uncommon for an operating system to employ the use of 4 MB super-pages for better use of the translation-lookaside-buffer), the maximum transmission unit (MTU) in traditional Ethernet networks is limited to 1500 bytes. RMAP implements dynamic fragmentation/reassembly for paging traffic. Additionally, RMAP also has the flex- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 27 ibility to use Jumbo frames, which are packets with sizes greater than 1500 bytes (typically between 8KB to 16KB). Jumbo frames enable RMAP to transmit complete 4KB pages to servers using a single packet, without fragmentation. Our testbed includes an 8-port switch that supports Jumbo Frames (9KB packet size). We observe a 6% speed up in RMAP throughput by using Jumbo Frames. However, in this Chapter, we conduct all experiments with 1500 byte MTU sizes with fragmentation/reassembly performed by RMAP.

3.2.3 Distributed Resource Discovery

As servers constantly join or leave the network, Anemone can (a) seamlessly absorb the increase/decrease in cluster-wide memory capacity, insulating LMAs from resource fluctu- ations and (b) allow any server to reclaim part or all of its contributed memory. This ob- jective is achieved through distributed resource discovery described below, and soft-state refresh described next in Section 3.2.4. Clients can discover newly available distributed memory in the cluster and the servers can announce their memory availability. Each server periodically broadcasts a Resource Announcement (RA) message (1 message every 10 seconds in our prototype) to advertise its identity and the amount of memory it is willing to contribute. Besides RAs, servers also piggyback their memory availability information in their page-in/page-out replies to individual clients. This distributed mechanism permits any new server in the network to dynamically announce its presence and allows existing servers to announce their up-to-date memory availability information to clients.

3.2.4 Soft-State Refresh

Distributed Anemone also includes soft-state refresh mechanisms (keep-alives) to permit clients to track the liveness of servers and vice-versa. Firstly, the RA message serves an additional purpose of informing the client that the server is alive and accepting paging re- quests. In the absence of any paging activity, if a client does not receive the server’s RA for three consecutive periods, it assumes that the server is offline and deletes the server’s en- tries from its hashtables. If the client also had pages stored on that server that went offline, it needs to recover the corresponding pages from a copy stored either on the local disk on CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 28 another server’s memory. Soft-state also permits servers to track the liveness of clients whose pages they store. Each client periodically transmits a Session Refresh message to each server that hosts its pages (1 message every 10 seconds in our prototype), which carries a client-specific session ID. The client module generates a different and unique ID each time the client restarts. If a server does not receive refresh messages with matching session IDs from a client for three consecutive periods, it concludes that the client has failed or rebooted and frees up any pages stored on that client’s behalf.

3.2.5 Server Load Balancing

Memory servers themselves are commodity nodes in the network that have their own pro- cessing and memory requirements. Hence another design goal of Anemone is to avoid overloading any particular server node as far as possible by transparently distributing the paging load evenly. In the earlier centralized architecture, this function was performed by the memory engine which kept track of server utilization levels. Distributed Anemone im- plements additional coordination among servers and clients to exchange accurate load in- formation. Section 3.2.3 described the mechanism to perform resource discovery. Clients utilize the server load information gathered from resource discovery to decide the server to which they should send new page-out requests. This decision process is based upon one of two different criteria: (1) The number of pages stored at each active server and

(2) The number of paging requests serviced by each active server. While (1) attempts to balance the memory usage at each server, (2) attempts to balance the request processing overhead.

3.2.6 Fault-tolerance

The ultimate consequence of failure in swapping to distributed memory is no worse than failure in swapping to local disk. However, the probability of failure is greater in a LAN envi- ronment because of multiple components involved in the process, such as network cards, connectors, switches etc. Although RMAP provides reliable packet delivery as described in Section 3.2.2 at the protocol level, our future work plans to build two alternatives for CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 29 tolerating server failures: (1) To maintain a local disk-based copy of every memory page swapped out over the network. This provides same level of reliability as disk-based paging, but risks performance interference from local disk activity. (2) To keep redundant copies of each page on multiple distributed servers. This approach avoids disk activity and reduces recovery-time, but consumes bandwidth, reduces the global memory pool and is suscepti- ble to network failures. In an ideal implementation, the memory servers would participate in a protocol similar to raid 5 [26].

3.3 Evaluation

The Anemone testbed consists of one 64-bit low-memory AMD 2.0 GHz client machine containing 256 MB of main memory and nine distributed-memory servers. The DRAM on these servers consist of: four 512 MB machines, three 1 GB machines, one 2 GB machine, one 3 GB machine, totaling to almost 9 gigabytes of distributed memory. The

512 MB servers range from 1.7 GHz to 800 MHz Intel processors. The other 5 machines are all 2.7 GHz and above Intel Xeons, with mixed PCI and PCI express motherboards.

For disk based tests, we used a WD800JD 80 GB SATA disk, with a 7200

RPM speed, 8 MB of cache and 8.9ms average seek time, (which is consistent with our

results). This disk has a 10 GB swap partition reserved on it to match the equivalent

amount of distributed memory available in the cluster, which we use exclusively when

comparing our system against the disk. Each machine is equipped with an Intel PRO/1000

gigabit Ethernet card connected to one of two 8-port gigabit switches, one from Netgear

and one from SMC. The performance results presented below can be summarized as

follows. Distributed Anemone reduces read latencies to an average 160µs compared to

8.3ms average for disk and 500µs average for centralized Anemone. For writes, both disk and Anemone deliver similar latencies due to write caching. In our experiments, Anemone delivers a factor of 1.5 to 4 speedup for single process LMAs, and delivers up to a factor of 14 speedup for multiple concurrent LMAs. Our system can successfully operate with both multiple clients and multiple servers. We also run experiments in which multiple client machines are simultaneously accessing the memory system at the same time. These CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 30

CDF of 500,000 Random Reads to a 6 GB Space 100

80

60

40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 3.5: Random read disk latency CDF results are equally as successful as the single-process cases.

3.3.1 Paging Latency

To begin the experiments, we first want to characterize exactly what kinds of microbench- marks we observe for different types of I/O, both read and write streams. The next 4 graphs present these results for both our memory system and the disk. Figures 3.5, 3.6,

3.7, and 3.8 show the distribution of observed read and write latencies for sequential and random access patterns with both Anemone and disk. Though real-world applications rarely generate purely sequential or completely random memory access patterns, these graphs provide a useful measure to understand the underlying factors that impact applica- tion execution times. Most random read requests to disk experience a latency between 5

to 10 milliseconds. On the other hand most requests in Anemone experience only around

160µs latency. Most sequential read requests to disk are serviced by the on-board disk

cache within 3 to 5µs because sequential read accesses fit well with the motion of disk

head. In contrast, Anemone delivers a range of latency values, most below 100µs. This

is because network communication latency dominates in Anemone even for sequential re- CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 31

CDF of 500,000 Sequential Reads to a 6 GB Space 100

80

60

40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 3.6: Sequential read disk latency CDF

CDF of 500,000 Random Writes to a 6 GB Space 100 90 80 70 60 50 40 30 Percent of Requests 20 10 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 3.7: Random write disk latency CDF CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 32

CDF of 500,000 Sequential Writes to a 6 GB Space 100

80

60

40 Percent of Requests 20 Anemone Local Disk 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 3.8: Sequential write disk latency CDF

Size Local Distr. Speedup Disk (GB) Mem Anemone Disk Anemone Povray 3.4 145 1996 8018 4.02 Quicksort 5 N/A 4913 11793 2.40 NS2 1 102 846 3962 4.08 KNN 1.5 62 7.1 2667 3.7

Table 3.1: Average application execution times and speedups for local memory, Distributed Anemone, and Disk. N/A indicates insufficient local memory. quests, though it is masked to some extent by the prefetching performed by the pager and the file-system within the Linux kernel. The write latency distributions for both disk and

Anemone are comparable, with most latencies being close to 9µs because writes typically return after writing to the local Linux buffer cache (which is now a unified page cache in

Linux 2.6).

3.3.2 Application Speedup

Single-Process LMAs: Table 3.1 summarizes the performance improvements seen by unmodified single-process LMAs using the Anemone system. This is a setup, similar to CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 33

Single Process ’POV’ Ray Tracer

8000 Local Memory Anemone Local Disk

6000

4000

2000 Render Time (seconds)

0 0 1000 2000 3000 Amount of Scene Memory (MB)

Figure 3.9: Execution times of POV-ray for increasing problem sizes. the previous microbenchmarks, in which a single LMA process on a single client node is using the memory system consisting of all nine available servers at its disposal. The first application is a ray-tracing program called POV-Ray [81]. The memory consumption of POV-Ray was varied by rendering different scenes with increasing number of colored spheres. Figure 3.9 shows the completion times of these increasingly large renderings up to 3.4 GB of memory versus the disk using an equal amount of local swap space. The

figure clearly shows that Anemone delivers increasing application speedups with increas- ing memory usage and is able to improve the execution time of a single-process POV-ray rendering by a factor of 4 for 3.4 GB memory usage. The second application is a large in-memory Quicksort program that uses a C++ STL-based implementation [89], with a complexity of O(N log N) comparisons. We sorted randomly populated large in-memory arrays of integers. Figure 3.10 shows that Anemone delivers a factor 2.4 speedup for a single-process Quicksort using 5 GB of memory. The third application is the popular NS2

network simulator [75]. We simulated a delay partitioning algorithm [42] on a 6-hop wide-

area network path using voice-over-IP traffic traces. Factors contributing to memory usage

in NS2 include the number of nodes being simulated, the amount of traffic sent between

nodes, and choices of protocols at different layers. Table 3.1 shows that, with NS2 requiring CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 34

Single Process Quicksort

12000 11000 Local Memory 10000 Anemone 9000 Local Disk 8000 7000 6000 5000 4000 3000 Sort Time (seconds) 2000 1000 0 0 1000 2000 3000 4000 5000 Sort Size (MB)

Figure 3.10: Execution times of STL Quicksort for increasing problem sizes.

1GB memory, Anemone speeds up the simulation by a factor of 4 compared to disk based paging. The fourth application is the k-nearest neighbor (KNN) search algorithm on large

3D datasets, using code from [29]. This algorithm is useful in applications such as medical imaging, molecular biology, CAD/CAM, and multimedia databases. Table 3.1 shows that, when executing KNN search algorithm over a dataset of 2 million points consuming 1.5GB

memory, Anemone speeds up the simulation by a factor of 3.7 over disk based paging.

Multiple Concurrent LMAs: In this section, we test the performance of Anemone under

varying levels of concurrent application execution. Multiple concurrently executing LMAs

tend to stress the system by competing for computation, memory and I/O resources and

by disrupting any sequentiality in paging activity, including competition for buffer space on

the network switch itself - particularly at gigabit speeds. Figures 3.11 and 3.12 show the

execution time comparison of Anemone and disk as the number of POV-ray and Quicksort

processes increases. The execution time measures the time interval between the start of

the execution and the completion of last process in the set. We try to keep each process

at around 100 MB of memory. The figures show that the execution times using disk-based

swap increases steeply with number of processes. Paging activity loses out sequentiality to CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 35

Multiple Process ’POV’ Ray Tracer

10000

9000 Anemone Local Disk 8000

7000

6000

5000

4000

3000

2000 Render Time (seconds) 1000

0 0 1 2 3 4 5 6 7 Number of Concurrent Processes

Figure 3.11: Execution times of multiple concurrent processes executing POV-ray. Multiple Process Quicksort

2800 Anemone Local Disk 2400

2000

1600

1200

800 Sort Time (seconds)

400

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Concurrent Processes

Figure 3.12: Execution times of multiple concurrent processes executing STL Quicksort. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 36

Effect of Transmission Window Size 1 - GB Quicksort

10000 Bandwidth Acheived (Mbit/s) No. Retransmissions (Requests) Completion Time (secs)

1000

100 (Logscale)

10

1 0 2 4 6 8 10 12 14 Max Window Size

Figure 3.13: Effects of varying the transmission window using Quicksort. the memory system performance with an increasing number of processes, making the disk seek and rotational overheads dominant. On the other hand, Anemone reacts very well as execution time increases very slowly, due to the fact that network latencies are mostly constant, regardless of sequentiality. With 12–18 concurrent LMAs, Anemone achieves speedups of a factor of 14 for POV-ray and a factor of 6.0 for Quicksort.

3.3.3 Tuning the Client RMAP Protocol

One of the important knobs in RMAP’s flow control mechanism is the client’s transmis-

sion window size. Using a 1 GB Quicksort, Figure 3.13 shows the effect of changing

this window size on three characteristics of the Anemone’s performance: (1) the number

of retransmissions, (2) paging bandwidth, which is represented in terms of “goodput”, i.e.

the amount of bandwidth obtained after excluding retransmitted bytes and header bytes,

and (3) completion time. Recall that in our implementation of the RMAP protocol, we use

a static window size - configured once before runtime. This means the traditional sense CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 37 of “flow-control” that you would expect from a TCP-style protocol is not fully dynamic. As a result, our window size is chosen empirically to be large enough to maintain network throughput but small enough to fit within the capabilities of the NIC’s ring buffers. A com- plete implementation of RMAP would provide a dynamic flow control window, but we leave that to future work.

To demonstrate this, figure 3.13 shows us that as the window size increases, the num- ber of retransmissions increases because the number of packets that can potentially be delivered back-to-back also increases. For larger window sizes, the paging bandwidth is also seen to increase and saturates because the transmission link remains busy more often, delivering higher goodput in spite of an initial increase in the number of retrans- missions. However, if driven too high, the window size will cause the paging bandwidth to decline considerably due to increasing number packet drops and retransmissions. The application completion times depend upon the paging bandwidth. Initially, an increase in window size increases the paging bandwidth and lowers the completion times. Similarly, if driven too high, the window size causes more packet drops, more retransmissions, lower paging bandwidth and higher completion times.

3.3.4 Control Message Overhead

To measure the control traffic overhead due to RMAP, we measured the percentage of control bytes generated by RMAP compared to the amount of data bytes transferred while executing a 1GB POVRay application. Control traffic refers to the page headers, acknowl- edgments, resource announcement messages, and soft-state refresh messages. We first varied the number of servers from 1 to 6, with a single client executing the POV-Ray appli- cation. Next, we varied the number of clients from 1 to 4 (each executing one instance of

POV-Ray), with 3 memory servers. The percentage of control traffic overhead was consis- tently measured at 1.74% – a very small percentage of the total paging traffic. CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS 38

3.4 Summary

In this Chapter, we presented Distributed Anemone – a system that enables unmodified large memory applications to transparently utilize the unused memory of nodes across a gi- gabit Ethernet LAN. Unlike its centralized predecessor, Distributed Anemone features fully distributed memory resource management, low-latency distributed memory paging, dis- tributed resource discovery, load balancing, soft-state refresh to track liveness of nodes, and the flexibility to use Jumbo Ethernet frames. We presented the architectural design and implementation details of a fully operational Anemone prototype. Evaluations using multiple real-world applications, include ray-tracing, large in-memory sorting, network sim- ulations, and nearest neighbor search, show that Anemone speeds up single process ap- plication by up to a factor of 4 and multiple concurrent processes by up to a factor of 14, compared to disk-based paging. Average page-fault latencies are reduced from 8.3ms with disk based paging to 160µs with Anemone. Chapter 4

MemX: Virtual Machine Uses of Distributed Memory

In this Chapter, we present our experiences in developing a fully transparent distributed system, called MemX, within the Xen VM environment that coordinates the use of cluster- wide memory resources to support large memory workloads.

4.1 Introduction

In modern cluster-based platforms, VMs can enable functional and performance isolation across applications and services. VMs also provide greater resource allocation flexibility, improve the utilization efficiency, enable seamless load balancing through VM migration, and lower the operational cost of the cluster. Consequently, VM environments are in- creasingly being considered for executing grid and enterprise applications over commodity high-speed clusters. However, such applications tend to have memory workloads that can stress the limited resources within a single VM by demanding more memory than the slice available to the VM. Clustered bastion hosts (mail, network attached storage), data mining applications, scientific workloads, virtual private servers, and backend support for websites are common examples of resource-intensive workloads, I/O bottlenecks in these applications can quickly form due to frequent access to large disk-resident dataset, paging

39 CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 40 activity, flash crowds, or competing VMs on the same node. Even though virtual machines with demanding workloads are here to stay as integral parts of modern clusters, signifi- cant improvements are needed in the ability of memory-constrained VMs to handle these workloads.

I/O activity due to memory pressure can prove to be particularly expensive in a virtu- alized environment where all I/O operations need to traverse an extra layer of indirection.

Over-provisioning of memory resources (and in general any hardware resource) within a physical machine may not always be a viable solution as it can lead to poor resource utilization efficiency, besides increasing the operational costs. Although domain-specific out-of-core computation techniques [56, 65] and migration strategies [71, 17, 27] can also improve the application performance up to a certain extent, they do not overcome the fun- damental limitation that an application is restricted to using the memory resources within a single physical machine, particularly with some of the aforementioned applications that are not generally parallelized.

In this Chapter, we present the design, implementation, and evaluation of the MemX

system for VMs that bridges the I/O performance gap in a virtualized environment by ex-

ploiting low-latency access to the memory of other nodes across a Gigabit cluster. MemX

is fully transparent to the user applications – developers do not need any specialized APIs,

libraries, recompilation, or relinking for their applications, nor does the application’s dataset

need any special pre-processing, such as data partitioning across nodes. We compare and

contrast the three modes in which MemX can operate with Xen VMs [20]:

1. MemX-DomU: the system within individual guest virtual machines. The letter ‘U’ in

“DomU” refers to the guest domain. Specifically it refers to the fact that they are

“unprivileged” in reference to Dom0.

2. MemX-DD: the system within a common driver domain (DD) (in this case: dom0

functions as the DD). However, this time the system is shared by multiple guest OSes

that co-reside with the DD. We use “DD” to distinguish between the fact that the client

module is running in the same place as the MemX-Dom0 case (within Domain Zero

itself), except that the client module is actually being used by applications located CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 41

within guest VMs (DomU), rather than applications within the driver domain (dom0)

itself.

3. MemX-Dom0: The distributed memory system within dom0, called ”Dom0” in Xen

terms. This represents the base virtualization overhead without the presence of other

guest virtual machines.

The proposed techniques can also work with other VM technologies besides Xen. We focus on Xen mainly due to its open source availability and para-virtualization support. In the performance section, we also compare all three options to the baseline case where just a regular, non-virtualized Linux system is used as described in Chapter 3.

4.2 Split Driver Background

As we stated in Chapter 2, Xen is an open source virtualization technology that provides secure resource isolation. Xen provides close to native machine performance through the use of para-virtualization [97] – a technique by which the guest OS is co-opted into reducing the virtualization overheads via modifications to its hardware dependent components. The modifications enable the guest OS to execute over virtualized hardware and devices rather than over bare metal. In this section, we review the background of the Xen I/O subsystem as it relates to the design of MemX.

Xen exports I/O devices to each guest OS (domU) as virtualized views of “class” de- vices as opposed to real physical devices. For example, Xen exports a block device or a network device, rather than a specific hardware make and model. The actual drivers that interact with the native hardware devices execute within Dom0 – the privileged domain that can directly access all hardware in the system. Dom0 acts as the management VM that coordinates devices access and privileges among all of the other guest domains. In the rest of the Chapter, we will use the term driver domain and Dom0 interchangeably.

Physical devices (and their device drivers) can be multiplexed among multiple concur- rently executing guest OSes. To enable this multiplexing, the privileged driver domain and the unprivileged guest domains (DomU) communicate by means of a split device-driver ar- CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 42

EVENT CHANNEL DRIVER BACK FRONT DOMAIN END END DRIVER DRIVER

NATIVE DRIVER GUEST OS Hypercalls / Callbacks

ACTIVE GRANT TABLE SAFE H/W I/F

XEN HYPERVISOR

PHYSICAL DEVICE

Figure 4.1: Split Device Driver Architecture in Xen. chitecture. This architecture is shown in Figure 4.1. The driver domain hosts the backend of the split driver for the device class and the DomU hosts the frontend. The backends and frontends interact using high-level device abstractions instead of low-level hardware specific mechanisms. For example, a DomU only cares that it is using a block device, but doesn’t worry about the specific type of driver that is controlling that block device.

Frontends and backends communicate with each other via the grant table: an in- memory communication mechanism that enables efficient bulk data transfers across do- main boundaries. The grant table enables one domain to allow another domain access to its pages in system memory. The access mechanism can include read, write, or mu- tual exchange of pages. The primary use of the grant table in device I/O is to provide a fast and secure mechanism for unprivileged DomU domains to receive indirect access to hardware devices. They enable the driver domain to set up a DMA based data transfer directly to/from the system memory of a DomU rather than performing the DMA to/from driver domain’s memory with the additional copying of the data between DomU and driver domain. In other words, the grant table enables zero-copy data transfers across domain CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 43 boundaries.

The grant table can be used to either share or transfer pages between the DomU and driver domain depending upon whether the I/O operation is synchronous or asynchronous in nature. For example, because block devices perform synchronous data transfer, the driver domain will know at the time of I/O initiation as to which DomU requested the block

I/O request. In this case, the frontend of the block driver in DomU will notify the Xen hypervisor (via the gnttab grant foreign access hypercall) that a memory page can be shared with the driver domain. A hypercall is the hypervisor’s equivalent of a system call in the operating system. The DomU then passes a grant table reference ID via the event channel to the driver domain, which sets up a direct DMA to/from the memory page of the DomU. Once the DMA is complete, the DomU removes the grant reference (via the gnttab end foreign access call). On the other hand, network devices receive data asyn- chronously. This means that the driver domain does not know the target DomU for an incoming packet until the entire packet has been received and its header examined. In this situation, the driver domain DMAs the packet into its own page and notifies the Xen hypervisor (via the gnttab grant foreign transfer call) that the page can be transferred to the target DomU. The driver domain then transfers the received page to target DomU and receives a free page in return from the DomU. In summary, Xen’s I/O subsystem for shared physical devices uses a split driver architecture that involves an additional level of indirection through the driver domain and the Xen hypervisor, with efficient optimizations to avoid data copying during bulk data transfers.

4.3 Design and Implementation

The core functionality of the MemX system partially builds upon our previous work and is

encapsulated within kernel modules that do not require modifications to either the Linux

kernel or the Xen hypervisor. However the interaction of the core modules with rest of

the virtualized subsystem presents several alternatives. In this section, we briefly discuss

the different design alternatives for the MemX system, justify the decisions we make, and

present the implementation details. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 44

LOW MEMORY CLIENT LARGE MEMORY SERVER

LARGE MEMORY APPLICATION

User User Kernel Kernel FILE CONTRIBUTED PAGER SYSTEM DRAM

RAW BLOCK DEVICE INTERFACE

MemX SERVER REMOTE MEMORY ACCESS MemX CLIENT MODULE MODULE PROTOCOL OVER GIGABIT INTERCONNNECT

Figure 4.2: MemX-Linux: Baseline operation of MemX in a non-virtualized Linux environ- ment. The client can communicate with multiple memory servers across the network to satisfy the memory requirements of large memory applications.

4.3.1 MemX-Linux: MemX in Non-virtualized Linux

Figure 4.2 shows the operation of MemX in a non-virtualized (vanilla) Linux environment.

An earlier variant of MemX-Linux was published in [51]. MemX-Linux includes several additional features listed later in this section. For completeness, we summarize the archi- tecture of MemX-Linux in this section and use it as a baseline for comparison with other virtualized versions of MemX – the primary focus of this work.

Two main components of MemX-Linux are the client module on the low memory ma- chines and the server module on the machines with unused memory. The two communi- cate with each other using a remote memory access protocol (RMAP), described in detail in Chapter 3, Section 3.2.2. Both client and server components execute as isolated Linux kernel modules. Aside from optimizations, the function of this code operates much the same way as described in Chapter 3. Nevertheless, there are a number of important CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 45 changes to that work, and we present a brief summary of those components here.

Client and Server Modules: The client module provides a virtualized block device inter- face to the large dataset applications executing on the client machine. This block device can either be: a) configured as a low-latency primary swap device, b) treated as a low- latency volatile store for large data sets accessed via the standard file-system interface, or c) memory mapped to the address space of an executing large memory application. Inter- nally, the client module maps the single linear I/O space of the block device to the unused memory of multiple distributed servers, using a memory-efficient radix-tree based map- ping. The old system used a hashtable-based implementation, but we found that to use high amounts of memory for the table data structure (for buckets and entries), particularly as we bought newer machines with substantially more memory than our old ones. A radix tree is a modified trie-structure in which a tree is referenced by strings of an alphabet, one character at a time. This system worked perfectly for things like addresses and file offsets

- key types that are used everywhere throughout distributed memory system. As usual, the memory system discovers and communicates with distributed server modules using a custom-designed, reliable, Servers broadcast periodic resource announcement messages which the client modules can use to discover the available memory servers. Servers also include feedback about their memory availability and load during both resource announce- ment and regular page transfers with clients. When a server reaches capacity, it declines to serve any new write requests from clients, which then try to select another server, if available, or otherwise write the page to disk. Binding these modules together is the Re- mote Memory Access Protocol (RMAP). This protocol is described later in much more detail than was provided in the previous Chapter. The server module is also designed to allow a server node to be taken down while live; our RMAP implementation can disperse, re-map, and load-balance an individual server’s pages to any other servers in the cluster that are capable of absorbing those pages, allowing the server to shut down without killing any of its client’s applications. Getting this custom protocol to work properly in a virtualized environment exposed a great number of kernel bugs that were not originally present in the

Anemone prototype, making the system much more robust. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 46

Additional Virtualization Features: MemX also includes a couple additional features that are not the specific focus of this work, and which were not present in the original

Anemone system either. The first is the ability to support named distributed memory data spaces that can be shared by multiple clients. What this provides is a read-only DSM sys- tem in which data stored on server nodes remains persistent through the life of the server, even when client nodes disconnect from the system altogether. When a client re-connects, all servers that have records for that client in the past will re-forward the necessary map- ping information to allow the client to re-construct its radix-tree data structure mappings and begin re-accessing the same persistent data. The system does not allow multiple concurrent writers, however, as this was not the focus of our work. There are two other features that turned out to be very important to the memory system as a whole because of virtualization-specific reasons: First, because of the way the split-driver system works that we described, the device driver needs to be able to support multiple major and minor block numbers. The driver would then be responsible for mapping one device per local vir- tual machine on a physical host, allowing the completely seamless, transparent access by multiple VM clients on the same host. This was part of the motivation behind switching to a radix tree over a hashtable - because the mapping data stored to lookup page locations would be relatively large for so many virtual machines, on the order of 10s of megabytes. It turned out the worst-case lookup time for the tree was equally comparable to the hashtable and did not take away from the efficient performance of the system. Second, we had to optimize the fragmentation implementation designed in the Anemone system: it had to be able to support zero-copy transmission and receipt of page fragments, which would not have worked properly in the original system.

4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU

In order to support large dataset applications within a VM environment, the simplest design option is to place the MemX client module within the kernel of each guest OS (DomU), whereas distributed server modules continue to execute within non-virtualized Linux kernel on machines connected to the physical network. This option is illustrated in Figure 4.3. The CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 47

Management Processes Large−Memory Application

Native Drivers Backend Frontend

Backend Network, Disk, PCI Frontend MemX Backend Frontend Module

Virtual H/W Event Virtual H/W Channel Driver Domain Guest Domains

Scheduling, Grant Tables, Exception Handling, Memory Enforcement Hypercall API Hardware Devices

Figure 4.3: MemX-DomU: Inserting the MemX client module within DomU’s Linux kernel. The server executes in non-virtualized Linux. client module exposes the block device interface to large memory applications within the

DomU as in the baseline, but communicates with the distributed server using the virtualized network interface (VNIC) exported by the network driver in the driver domain. The VNIC in

Xen is also organized as a split device driver in which the frontend (residing in the guest

OS) and the backend (residing in the driver domain) talk to each other using well-defined grant table and event channel mechanisms. Two event channels are used between the backend and frontend of the VNIC – one for packet transmissions and one for packet receptions. To perform zero-copy data transfers across the domain boundaries, the VNIC performs a page exchange with the backend for every packet received or transmitted using the grant table. All backend interfaces in the driver domain can communicate with the physical NIC as well as with each other via a virtual network bridge. Each VNIC is assigned its own MAC address whereas the driver domain’s own internal VNIC in Dom0 uses the physical NIC’s MAC address. The physical NIC itself is placed in promiscuous mode by the driver domain to enable the reception of any packet addressed to any of the local virtual CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 48 machines. The virtual bridge demultiplexes incoming packets directed towards the target

VNIC’s backend driver.

Compared to the baseline non-virtualized MemX -Linux deployment, MemX-DomU has the additional overhead of requiring every network packet to traverse across domain boundaries in addition to being multiplexed or demultiplexed at the virtual network bridge.

Additionally, the client module needs to be separately inserted within each DomU that might potentially execute large memory applications. Also note that each I/O request is typically 4KBytes in size, whereas our network hardware uses a 1500-byte MTU (maxi- mum transmission unit), unless the underlying network supports Jumbo frames. Thus the client module needs to fragment each 4KByte write request into (and reassemble a com- plete read reply from) at least 3 network packets. In MemX-DomU each fragment needs to traverse the domain boundary to reach the backend. Due to current memory allocation policies in Xen, buffering for each fragment ends up consuming an entire 4KByte page worth of memory allocation, which results in three times the actual memory needed within the machine. If this were a non-virtualized case, each of those fragments would still come from the same physical page because of the internal Linux slab allocator. But virtualization requires those fragments to be separated out. Newer Xen versions may offer solutions to this type of problem, but we leave it for now. We will contrast this performance overhead in greater detail with MemX-DD (option 2) below.

4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain

A second design option is to place the MemX client module within the driver domain

(Dom0) and allow multiple DomUs to share this common client module via their virtual- ized block device (VBD) interfaces. This option is shown in Figure 4.4. The guest OS executing within the DomU VM does not require any MemX specific modifications. The

MemX client module executing within the driver domain exposes a block device interface, as before. Any DomU, whose applications require distributed memory resources, config- ures a split VBD. The frontend of the VBD resides in DomU and the backend in the block driver domain. The frontend and backend of each VBD communicates using event chan- CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 49

Management Processes Large−Memory Application

Native Drivers Backend Frontend

Backend Network, Disk, PCI Frontend MemX Module Backend Frontend

Virtual H/W Event Virtual H/W Channel Driver Domain Guest Domains

Scheduling, Grant Tables, Exception Handling, Memory Enforcement Hypercall API Hardware Devices

Figure 4.4: MemX-DD: Executing a common MemX client module within the driver domain, allowing multiple DomUs to share a single client module. The server module continues to execute in non-virtualized Linux. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 50 nels and the grant table, as in the earlier case of VNICs. (This splitting of interfaces is completely automated by the Xen system itself). The MemX client module provides a sep- arate VBD lettered-slice (/dev/memx{a,b,c}, etc.) for each backend that corresponds to a distinct DomU. On the network side, the MemX client module attaches itself to the driver domain’s VNIC which in turn talks to the physical NIC via the virtual network bridge. For performance reasons, here we assume that the VNIC in the driver domain and the disk in the driver domain are co-located - meaning both drivers are within the same privileged driver domain (dom0). Thus the driver domain’s VNIC does not need to be organized as another split driver. Rather it is a single software construct that can attach directly to the virtual bridge. During execution within a DomU, read/write requests to distributed memory are generated in the form of synchronous I/O requests to the corresponding virtual block device frontend. These requests are sent to the MemX client module via the event channel and the grant table. The client module packages each I/O request into network packets and transmits them asynchronously to distributed memory servers using RMAP.

Note that, although the network packets still need to traverse the virtual network bridge, they no longer need to traverse a split VNIC architecture, unlike in MemX-DomU. One consequence of not going through a split VNIC architecture is that, while client module still needs to fragment a 4KByte I/O request into 3 network packets to fit the MTU requirements, each fragment no longer needs to occupy an entire 4KByte buffer, unlike in MemX-DomU.

As a result, only one 4KByte I/O request needs to cross the domain boundary across the split block device driver, as opposed to three 4KB packet buffers in Section 4.3.2. Finally, since the guest OS within DomUs do not require any MemX specific software components, the DomUs can potentially run any para-virtualized OS and not just XenoLinux.

However, compared to the non-virtualized baseline case, MemX-DD still has the ad- ditional overhead of using the split VBD and the virtual network bridge, though still with highly acceptable performance. Also note that, unlike MemX-DomU, MemX-DD does not currently support seamless migration of live Xen VMs using distributed memory. This is because part of the internal state of the guest OS, (in the form of page-to-server mappings) that resides in the driver domain of MemX-DD is not automatically transferred by the mi- gration mechanism in Xen. We plan to enhance Xen’s migration mechanism to transfer this CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 51 internal state information in a host-independent manner to the target machine’s MemX-DD module. Furthermore, our current implementation does not support per-DomU reservation of distributed memory, which can potentially violate isolation guarantees. This reservation feature is currently being added to our prototype.

4.3.4 MemX -Dom0: (Option 3)

As we mentioned in the introduction, we will also present this scenario. Again, this is described as the distributed memory system within Dom0 (same as the driver domain), except that the applications are executed directly within this domain and not inside of a guest domain. This represents the base virtualization overhead without the presence of other guest virtual machines.

4.3.5 Alternative Options

Guest Physical Address Space Expansion: Another alternative to supporting large memory applications with direct distributed memory is to enable support for this indirectly via a larger pseudo-physical memory address space than is normally available within the physical machine. This option would require fundamental modifications to the memory management in both the Xen hypervisor as well as the guest OS. In particular, at boot time, the guest OS would believe that it has a large “physical” memory – or the so called pseudo- physical memory space. It then becomes the Xen hypervisor’s task to map each DomU’s large partly into guest-local memory, partly into distributed memory, and the rest to sec- ondary storage. This is analogous to the large conventional virtual address space available to each process that is managed transparently by traditional operating systems. The func- tionality provided by this option is essentially equivalent to that provided by MemX-DomU

and MemX-DD. However, this option requires the Xen hypervisor to take up a prominent

role in memory address translation process, something that original design of Xen strives

to minimize. Exploring this option is the focus of Chapter 6.

MemX Server Module in DomU Technically speaking, we can also execute the MemX

server module within a guest OS by itself as well, coupled with Options 1 or 2 above. This CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 52 could enable one to initiate a VM solely for the purpose of providing distributed memory to other low-memory client VMs that are either across the cluster or even within the same physical machine. However, practically, this option does not seem to provide any sig- nificant functional benefits whereas the overheads of executing the server module within

DomU are considerable. This is also not necessary because our system already supports the re-distribution of server memory to nearby servers - allowing the server to shut down if necessary. This obviates such a need to run the module within a virtual machine. Conse- quently, we do not pursue this option further.

4.3.6 Network Access Contention:

Handling network contention within the physical machine itself was the biggest (solv- able) difficulty in our decision to implement RMAP without TCP/IP. Three major factors contribute to network contention in our system:

• Inter-VM Congestion: MemX generates traffic at the block-I/O level. In a virtual ma-

chine environment, each guest VM on a given node assumes that it has full control of

the NIC, when in reality that NIC is generally shared among multiple VMs. We elab-

orate on this simple but important problem of inter-VM congestion in Section 4.4.3

while evaluating multiple VM performance.

• Flow Control: Currently, RMAP uses a static send window per MemX node. In a

subnet with fairly constant round trip times, this serves us well, although a reactive

approach where the receiver informs the client of the size of its receive window could

be easily deployed. We have not observed a need for this feature as of yet.

• Switch/Server Congestion: MemX servers in the network can potentially be the

destination for dozens of client pages. Two or more clients generating traffic towards

a particular server can quickly overwhelm both the switch port and the server itself.

As a partial solution to this problem, MemX clients perform load-balancing across

MemX servers by dynamically selecting the least loaded server for page write opera-

tions. Empirically, we’ve observed that congestion happens only when the number of CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 53

clients significantly outweighs the number of servers. If MemX were scaled to hun-

dreds of switched nodes, a cross-bar or fat-tree design in addition to more advanced

switch-bound congestion control would be mandatory, but our 8-node cluster hasn’t

warranted this as of yet. We plan to handle this if our testbed scales to more nodes.

4.4 Evaluation

In this section we evaluate the performance of the different variants of MemX. Our goal is to answer the following questions:

• How do the different variants of MemX compare in terms of I/O latency and band-

width?

• What are the overheads incurred by MemX due to virtualization in Xen?

• What type of speedups can be achieved by real large memory applications using

MemX when compared to virtualized disk?

• How well does MemX perform in the presence of multiple concurrent VMs?

Our testbed consists of eight machines. Each machine has 4 GB of memory, an SMP 64-bit dual-core 2.8 Ghz processor, and one gigabit Broadcom Ethernet NIC. Our Xen version is

3.0.4 and XenoLinux version 2.6.16.33. Backend MemX-servers run Vanilla Linux 2.6.20.

Collectively, this provides us with over 24GB of effectively usable cluster-wide memory after accounting for roughly 1GB of local memory usage per node. We limit the local memory of client machines to a maximum of 512 MB under all test cases. In addition to the three MemX configurations described earlier, namely MemX-Linux, MemX-DomU, and

MemX-DD, we also include a fourth configuration – MemX-Dom0 – for the sole purpose of performance evaluation. This additional configuration corresponds to the MemX client module executing within Dom0 itself, but not as part of the backend for a VBD. Rather, the client module in MemX-Dom0 serves large memory applications executing within Dom0, and helps to measure the basic virtualization overhead due to Xen. Furthermore, whenever we mention the ”disk” baseline, we are referring to virtualized disk within Dom0. When

MemX-DD or MemX-DomU is compared to virtualized disk in any experiment, it means CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 54

MemX-Linux MemX-Dom0 MemX-DD MemX-DomU Virtualized Disk Kernel RTT 85 usec 95 usec 95 usec 115 usec 8.3 millisec

Table 4.1: I/O latency for each MemX-Combination in Microseconds. that we exported the virtualized disk as a frontend VBD to the dependent guest VM, just as we exported the block device from MemX itself to applications.

4.4.1 Latency and Bandwidth Microbenchmarks

Figure 4.5 and Table 4.1 characterize different MemX-combinations in terms of these two metrics. Table 4.1 shows the average round trip time (RTT) for a single 4KB read request transmitted from a client module and replied to by a server node. The RTT is measured in microseconds, using the on-chip time stamp counter (TSC) register at the kernel level in the client module immediately before transmission to the NIC and after reception of the

ACK from the NIC. Thus the measured RTT values include only MemX related time com- ponents and exclude the variable time required to deliver the page to user-level, put that process back on the ready-queue, and perform a context switch. Moreover, this is the latency that VFS (virtual filesystem) or the system pager would experience when sending

I/O to and from MemX. MemX-Linux, as a base case, provides an RTT of 85µs. Following close behind are MemX-Dom0, MemX-DD, and MemX-DomU in that order. The virtualized disk base case performs as expected at an average 8.3ms. These RTT numbers show that accessing the memory of a remote machine over the network is about a two orders of magnitude faster than from local virtualized disk. Also, the Xen VMM introduces a negligi- ble overhead of 10µs in MemX-Dom0 and MemX-DD over MemX-Linux. Similarly the split network driver architecture, which needs to transfer 3 packet fragments for each 4KB block across the domain boundaries, introduces an overhead of another 20µs in MemX-DomU over MemX-Dom0 and MemX-DD.

Figure 4.5 shows throughput measurements using a custom benchmark [52] which is made to issue long streams of random/sequential, asynchronous, 4KB requests. We ensure that the range of requests is at least twice that of the size of local memory of a CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 55

Figure 4.5: I/O bandwidth, for different MemX-configurations, using custom benchmark that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to opening the file descriptor with direct I/O turned on, to compare against by-passing the Linux page cache. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 56 client node (about 1+ Gbyte). These tests give us insight during development of where bottlenecks might exist. The throughput for all of the tests is generally at its maximum minus the effect of CPU overhead. A small loss of 50 Mbits/second naturally occurs for

MemX-domU, which is to be expected. The only case that suffers is random reads, which all hover around 300 Mbits/second. There is a very specific reason for this that is a direct artifact of the way VFS in the Linux kernel handles asynchronous I/O (AIO) [60].

Asynchronous I/O and Scheduling in Linux . Block devices, by nature, handle all I/O asynchronously (AIO) unless otherwise instructed to by the Virtual Filesystem (VFS). In

Linux, the AIO call stack is the fundamental atomic operation to the device (through the page cache) by which other types of I/O are realized. As of 2007, the AIO hierarchy in

Linux uses a separate thread that plays tricks to run in the same process context of the user application that submitted the I/O (for those file descriptors that are asynchronous).

Doing so, the application can continue doing other work and check for the results later.

The core problem described here involves the kernel thread that handles AIO system calls itself: it is in fact executed synchronously after the request handoff has been made. Linux

(and perhaps other kernels) is capable of accepting a submission of multiple (sparse) AIO reads/writes using a single system call. After the system call returns, the thread then syn- chronously issues those I/Os to the device driver one-by-one (by blocking and removing itself from the run-queue). For devices with variable latencies (i.e. disks), this long-standing

VFS design makes sense, as I/O should block, while the device is kept busy from dynam- ically generated amounts of parallel I/O from the read-ahead (prefetching) policies of the

Linux page-cache. But for random-access style devices, this is useless. Additionally, the

Linux I/O scheduler makes similar assumptions for devices that have requests queues

(per-driver queues that re-order I/Os for better fairness and latency guarantees).

What this means for the MemX in both virtualized and non-virtualized environments is that out-bound randomly-spaced read block I/O bandwidth (not networking bandwidth) is cut by two thirds to about one third of it’s normal speed. The presents a chain reac- tion for these kinds of randomly-spaced reads: Rather than getting a read performance of a full gigabit per second over the network, the application only experiences about three CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 57 megabits per second. This does not significantly affect the speedups we experience in the next section, but it does explain some of the microbenchmarks performed at the be- ginning. To solve the problem in the future, we propose a “re-plumbing” of the VFS and

I/O scheduling subsystems to dynamically detect the underlying latency characteristics of the device (specifically the un-changing behavior of constant vs. variable latency) in order to allow those subsystems to take alternate code-paths that are capable fully exploit- ing the deliverable performance of the underlying device. The actual blocking call is that of ”lock page()” within the function ”do generic mapping read” as per the Linux AIO call stack. On the bright side, as of 2007, there was a patch [60] in progress (Contact informa- tion for the developers can be found in linux-2.6.xx/MAINTAINERS). The patch could be modified to handle *the more specific case* that needs rather than be a generic solution for all users of the page cache. We also noticed that, if the user is a userland C program

(versus say a filesystem thread running within the kernel), then setting O DIRECT on the

file descriptor will cause the system call to by-pass the page cache and go direct-to-BIO.

Maximum throughput will then be realized. We also observed that, out of the 4 I/O sched- ulers available in Linux, none of them have any effect whatsoever on device drivers that do not use a request queue, which is the case for our client module implementation - due to it exhibiting random-access style latencies when pages are accessed through network memory. Demonstration of this problem involved: 1. instrumenting Linux AIO stack to print out TSC counter microsecond estimates, 2. logging the MemX outbound queue-size, 3. recording the amount of time in between which new requests were handed to the device driver, 4. a preliminary hypothesis derived from an observation that the dependent pro- cess was spending too much time idly waiting (inside mwait idle()), and 5. finally receiving confirmation from the mainline kernel developers of the hypothesis.

Figures 4.6 through 4.8 compare the distributions of the total RTT measured from a user level application that performs either sequential or random I/O on either MemX or the virtual disk, both with and without the O DIRECT flag enabled. Note that these RTT-values are measured from user-level synchronous read/write system calls, which adds a few tens of microseconds to the kernel-level RTTs in Table 4.1. Figure 4.6 compares the read latency distribution for MemX-DD against disk-based I/O in both random and sequential CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 58

CDF of MemX-DD vs Disk Latencies (Reads, Buffered) 100

80

60

40 Percent of Requests 20 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 4.6: Comparison of sequential and random read latency distributions for MemX-DD and disk. Reads traverse the filesystem buffer cache. Most random read latencies are an order of magnitude smaller with MemX-DD than with disk. All sequential reads benefit from filesystem prefetching.

CDF of MemX-DD vs Disk Latencies (Writes, Buffered) 100

80

60

40 Percent of Requests 20 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq 0 1 10 100 1000 10000 100000 1e+06 1e+07 Latency (microseconds, logscale)

Figure 4.7: Comparison of sequential and random write latency distributions for MemX-DD and disk. Writes goes through the filesystem buffer cache. Consequently, all four latencies are similar due to write buffering. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 59

CDF of MemX-DD vs Disk Latencies (Reads, Random) 100

80

60

40 Percent of Requests 20 MemX-DD-Buffer MemX-DD-Direct Disk-Buffer Disk-Direct 0 1 10 100 1000 10000 100000 1e+06 Latency (microseconds, logscale)

Figure 4.8: Effect of filesystem buffering on random read latency distributions for MemX- DD and disk. About 10% of random read requests (issued without the direct I/O flag) are serviced at the filesystem buffer cache, as indicated by the first knee below 10µs for both MemX-DD and disk. reads via the filesystem cache. Random read latencies are an order of magnitude smaller with MemX-DD (around 160µs) than with disk (around 9ms). Sequential read latency distributions are similar for MemX-DD and disk primarily due to filesystem prefetching.

Figure 4.7 shows the RTT distribution for buffered write requests. Again MemX-DD and disk show similar distributions, mostly less than 10µs, due to write buffering. Figure 4.8 demonstrates the effect of passing o direct flag to the open() system call, which bypasses the filesystem buffer cache. The random read latency distributions without the flag display a distinct knee below 10µs indicating that roughly 10% of the random read requests are serviced at the filesystem buffer cache and that prefetching benefits MemX as well as disk.

We observed a similar trend for sequential read distributions, with and without the flag, where the first knee indicated that about 90% of sequential reads were serviced at the

filesystem buffer cache. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 60

Quicksort

12000 11000 MemX-DomU MemX-DD 10000 MemX-Linux 9000 Local Memory 8000 Local Disk 7000 6000 5000 4000 3000 Sort Time (seconds) 2000 1000 0 0 1 2 3 4 5 Sort Size (GB)

Figure 4.9: Quicksort execution times in various MemX combinations and disk. While clearly surpassing disk performance, MemX-DD trails regular Linux only slightly using a 512 MB Xen Guest. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 61

4.4.2 Application Speedups

We now evaluate the execution times of a few large memory applications using our testbed.

Again, we include both MemX-Linux and virtual disk as base cases to illustrate the over- head imposed by Xen virtualization and the gain over the virtualized disk respectively.

Figure 4.9 shows the performance of a very large sort of increasingly large arrays of inte- gers, using an in-house C implementation of the old static-partitioning quicksort algorithm.

We stopped using the STL version because of it’s lack of ability to provide more detailed runtime information about the progress of the sort. We record the execution times of the sort for each of the 3 mentioned cases. We also include an ”extreme” base case plot for local memory using one of the vanilla-Linux 4 GB nodes, where the sort executes purely in-memory. From the figure, we ceased to even bother with the disk case beyond 2GB problem sizes due to the unreasonably large amount of time it takes to complete, po- tentially for days. The sorts using MemX-DD, MemX-domU, and MemX-Linux however

finished within 90 minutes, where the distinction between the different virtualization levels is very small. Table 4.2 lists execution times for some much larger problem sizes with both quicksort and a second large memory application – the same ray-tracing scene used in Chapter 3 [81]. Each row in the table describes an increasingly large problem size, as high as 13 GB. Again, both the MemX cases behave similarly, while the disk lags be- hind. These performance numbers show that MemX provides a highly attractive option for executing large memory workloads in both virtualized and non-virtualized environments.

Furthermore, given the amount of un-quantified amount of randomized-reads generated by the system’s pager (which is correlates with the recursive nature of the sort algorithm), the same non-asynchronous I/O problem that we described in the previous section also applies here. If a fix is applied, the observed speed-ups in the figure have the potential to double or triple what they already are. But for now, the throughput observed from the system pager remains around 300 to 400 Mbits/sec. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 62

Application Client Mem MemX-Linux MemX-DD Disk 5 GB Quicksort 512 MB 65 min 93 min several hours 6 GB Ray-tracer 512 MB 48 min 61 min several hours 13 GB Ray-tracer 1 GB 93 min 145 min several hours

Table 4.2: Execution time comparisons for various large memory application workloads.

MemX vs. Parallel iSCSI: Multiple Guests

14000 MemX-DD 12000 iSCSI - DD

10000

8000

6000

4000 Sort Time (seconds) 2000

0 0 2 4 6 8 10 12 14 16 18 20 Number of VMs

Figure 4.10: Quicksort execution times for multiple concurrent guest VMs using MemX-DD and iSCSI configurations. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 63

Domain 1 4 GB MemX Server Domain 2 80 GB iSCSI

RMAP or iSCSI 4 GB MemX Server 80 GB iSCSI

4 GB MemX Server

Domain 19 80 GB iSCSI Domain 20

4 GB MemX Server Domain 0 GigE Switch 80 GB iSCSI Xen Hypervisor

4 GB Memory

Figure 4.11: Our multiple client setup: Five identical 4 GB dual-core machines, where one houses 20 Xen Guests and the others serve as either MemXservers or iSCSI servers.

4.4.3 Multiple Client VMs

In this section, we evaluate the overhead of executing multiple client VMs using the MemX-

DD combination. In a real data center, an iSCSI or FibreChannel network would be setup

to provide backend storage for guest virtual machines. To duplicate this base case in our

cluster, we use five of our dual-core 4GB memory machines to compare MemX-DD against

a 4-disk parallel iSCSI setup illustrated in Figure 4.11. For the iSCSI target software, we

used the open source project IET [93] and used open-iscsi.org for the initiator software

within Dom0 as a driver domain for all the Xen Guests. Our setup involves using one

of the five machines to execute up to twenty concurrently running 100MB Xen Guests.

Within each guest, we run a 400MB quicksort. We vary the number of concurrent guest

VMs from 1 to 20, and in each guest we run quicksort to completion. We perform the

same experiment for both MemX-DD and iSCSI. Figure 4.10 shows the results of this

experiment. At its highest point (about 10 GB of collective memory and 20 concurrent

virtual machines) the execution time with MemX-DD is about 5 times smaller than with

iSCSI setup. Recall that we are using four remote iSCSI disks, and one can observe a

stair-step behavior in the iSCSI curve where the level of parallelism wraps around at 4, 8,

12, and 16 Virtual machines. Even with concurrent disks and competing virtual machine

CPU activity, MemX-DD provides clear benefits in providing low-latency I/O among multiple CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 64 concurrent Xen virtual machines.

Inter-VM Congestion: In Section 4.3.6, we described the phenomenon of inter-VM congestion, that arises due to the absence of explicit congestion control across multiple guests within a Xen node. Here we discuss how inter-VM congestion is handled in different

MemX configurations.

1. MemX-DomU and MemX-Linux: Inter-VM congestion does not arise trivially in the

base cases of MemX-Dom0 and MemX-Linux because the only users of the client

module are local application processes. These processes, controlled by a static send

window, use semaphores and wait queues to put competing processes on the OS’s

blocked list when the client’s send window is full. So, there is no competition among

multiple virtual machines - only between competing processes.

2. MemX-DD: Inter-VM congestion in MemX-DD is handled indirectly by Xen itself.

Xen schedules block I/O backend requests in a strictly round-robin fashion. Since

MemX is the destination of requests from the backend, Xen will “stop” the delivery

of requests to MemX when there is a full queue (of some fixed size). This stop is

performed by placing the dependent guest VMs in a blocked state in the same way

that multi-programmed processes are blocked when waiting for I/O.

3. MemX-DomU: For MemX-DomU, recall that inter-VM congestion arises from mul-

tiple network front-end drivers rather than competing block front-ends. Xen handles

this type of contention by using credit-based scheduling, where each front-end is al-

located a bandwidth share of the form x bytes every y microseconds. VMs that use

up their credit are blocked.

This leaves us to handle only the network contention at the switch and server-level, which we plan to address as our testbed scales to more nodes. CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 65

4.4.4 Live VM Migration

While migration techniques can MemX-DomU configuration has a significant benefit when it comes to migrating live Xen VMs [27] during runtime, even though it has lower throughput and higher I/O latency than MemX-DD. Specifically, a VM using MemX-DomU for fast I/O

to distributed memory can be seamlessly migrated from one physical machine to another,

without disrupting the execution of any large dataset applications within the VM. There

are two specific reasons for this benefit. First, since MemX-DomU is designed as a self-

contained pluggable module within the guest OS, any page-to-server mapping information

is migrated along with the kernel state of the guest OS without leaving any residual de-

pendencies behind in the original machine. The second reason is that RMAP is used for

communicating read-write requests to distributed memory is designed to be reliable. As

the VM carries with itself its link layer MAC address identification during the migration pro-

cess, any in-flight packets dropped during migration are safely retransmitted to the VM’s

new location, thereby enabling any large memory application to continue execution without

disruption. What makes the MemX-DomU case interesting is that administrators of virtual

hosting centers can exploit live-migration features by seamlessly transferring guest VMs to

other physical machines at will to better utilize resources. Our work in the next Chapter 5

focuses exclusively on the optimization of virtual machine migration and will elaborate on

this in more detail.

4.5 Summary

State-of-the-art in virtual machine technology does not adequately address the needs of

large memory workloads that are increasingly common in modern data centers and virtual

hosting platforms. Such application workloads quickly become throttled by the disk I/O bot-

tleneck in a virtualized environment where the I/O subsystem includes an additional level

of indirection. In this Chapter, we presented the design, implementation, and evaluation

of the MemX system in the Xen environment that enables memory and I/O-constrained

VMs to transparently utilize the collective pool of memory within a cluster for low-latency CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY 66

I/O operations. Large dataset applications using MemX do not require any specialized

APIs, libraries, or any other modifications. MemX can operate as a kernel module within non-virtualized Linux (MemX-Linux), an individual VM (MemX-DomU), or a driver domain

(MemX-DD). The latter option permits multiple VMs within a single physical machine to multiplex their memory requirements over a common distributed memory pool. Perfor- mance evaluations using our MemX prototype shows that I/O latencies are reduced by an order of magnitude and large memory applications speed up significantly when compared against virtualized disk. As an extra benefit, live Xen VMs executing large memory ap- plications over MemX-DomU can be migrated without disrupting applications. Our future work includes the capability to provide per-VM reservations over the cluster-wide memory, developing mechanisms to control inter-VM congestion, and enabling seamless migration of VMs in the driver domain mode of operation. Chapter 5

Post-Copy: Live Virtual Machine Migration

In this Chapter, we present the design, implementation, and evaluation of the post-copy based approach for the live migration of virtual machines (VMs) across a gigabit LAN. Live migration is a mandatory feature of modern hypervisors today. It facilitates server consol- idation, system maintenance, and lower power consumption. Post-copy [53] refers to the deferral of the memory “copy” phase of live migration until after the VM’s CPU state has been migrated to the target node. This is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by the transfer of CPU execution state. The post-copy strategy provides a “win-win” by approaching the baseline total migration time achieved with the stop-and-copy approach, while maintaining the live- ness and low downtime benefits of the pre-copy approach. We facilitate the use of post- copy with a specific instance of adaptive prepaging (also known as adaptive distributed paging). Pre-paging eliminates all duplicate page transmissions and quickly removes any residual dependencies for the migrating VM from the source node. Our pre-paging algo- rithm is able to reduce the number of page faults across the network to 17% of the VM’s working set. Finally, we enhance both the original pre-copy and post-copy schemes with the use of a dynamic, periodic self-ballooning (DSB) strategy, which prevents the migra- tion daemon from transmitting unnecessary free pages in the guest OS. DSB significantly

67 CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 68 speeds up both the migration schemes with very negligible performance degradation to the processes running within the VM. We implement the post-copy approach in the Xen

VM environment and show that it significantly reduces the total migration time and net- work overheads across a range of VM workloads when compared against the traditional pre-copy approach.

5.1 Introduction

This Chapter addresses the problem of optimizing the live migration of system virtual ma- chines (VMs). Live migration is a key selling point for state-of-the-art virtualization tech- nologies. It allows administrators to consolidate system load, perform maintenance, and

flexibly reallocate cluster-wide resources on-the-fly. We focus on VM migration within a cluster environment where physical nodes are interconnected via a high-speed LAN and also employ a network-accessible storage system (such as a SAN or NAS). State-of-the- art live migration techniques [73, 27] use the pre-copy approach, where the bulk of the

VM’s memory state is migrated even as the VM continues to execute at the source node.

Once the “working set” has been identified through a number of iterative copy rounds, the

VM is suspended and its CPU execution state plus remaining dirty pages are transferred to the target host. The overriding goal of the pre-copy approach is to keep the service down- time to a bare minimum by minimizing the amount of VM state that needs to be transferred during the downtime.

We seek to demonstrate the benefits of another strategy for live VM migration, called post-copy, that was previously applied only in the context of process migration in the late

1990s, but to address the issues involved at the operating system level as well. We be- lieve that modern hypervisors provide the means to employ alternative approaches without much additional complexity. On a high-level, post-copy refers to the deferral of the mem- ory “copy” phase of live migration until the virtual machine’s CPU state has already been migrated to the target node. This enables the migration daemon to try different methods by which to perform the memory copy. Post-copy works by transferring a minimal amount of CPU execution state to the target node, starting the VM at the target, and then pro- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 69 ceeds to actively push memory pages from the source to the target. This active push component, also known as pre-paging, distinguishes the post-copy approach from both pre-copy as well as from the demand-paging approach, in which the source node would passively wait for the memory pages to be faulted in by the target node across the network.

Pre-paging is a broad term that was used in earlier literature [76, 94] in the context of op- timizing memory-constrained disk-based paging systems and refers to a more proactive form of page prefetching from disk. By intelligently sequencing the set of actively pre- fetched memory pages, the memory subsystem (or even a cache) can hide the latency of high-locality page faults or cache misses from live applications, while continuing to retrieve the rest of the address space out-of-band until the entire address space is complete. Mod- ern memory subsystems do not typically employ pre-paging anymore due the increasingly large DRAM capacities in commodity systems. However, pre-paging can play a significant role in the context of live VM migration which involves the transfer of an entire physical address space across the network.

We design and implement a post-copy based technique for live VM migration in the

Xen VM environment. Through extensive evaluations, we demonstrate how post-copy can improve the live migration performance across each of the following metrics: pages trans- ferred, total migration time, downtime, application degradation, network bandwidth, and identification of the working set. The traditional pre-copy approach does particularly well in minimizing two metrics – application downtime and degradation – when the VM is executing a largely read-intensive workload. These two metrics are important in preserving system uptime as well as the interactive user experience. However all the above metrics can be impacted adversely when pre-copy is confronted with even moderately write-intensive VM workloads during migration. Post-copy not only maintains VM liveness and application per- formance during migration, but also improves upon the other performance metrics listed above.

The two key ideas behind an effective post-copy strategy are: (a) transmitting each page across the network no more than once, in other words, avoiding the potentially non- converging iterative copying rounds in pre-copy and (b) an adaptive pre-paging strategy

that hides the latency of fetching most pages across the network by actively pushing pages CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 70 from the source before the page is faulted in at the target node and adapting the sequence of pushed pages using any network page-faults as hints. We show that our post-copy implementation is capable of minimizing network-bound page faults to 17% of the working- set.

Additionally, we identified deficiencies in both the pre-copy and post-copy schemes with regard to the transfer of free pages in the guest VM during migration. We improved both migration schemes to avoid transmitting free pages through the use of a Dynamic Self-

Ballooning (DSB) technique in which the guest actively balloons down its memory footprint without human intervention. DSB significantly speeds up the total migration time, normal- izes both approaches, and is capable of frequent ballooning without adversely affecting live applications with intervals as small as 5 seconds.

Both the Xen and VMWare hypervisors have demonstrated that the use of migration itself is an essential tool. The original pre-copy algorithm does have other advantages: it employs a relatively self-contained implementation that allows the migration daemon to iso- late most of the copying complexity to a single process at each node. Additionally, pre-copy provides a clean method of aborting the migration should the target node ever crash during migration, because the VM is still running at the source and not the target host (whether or not this benefit is made obvious in current virtualization technologies). Although our current post-copy implementation does not handle target node failure, we will discuss a straightforward approach in Section 5.2.5 by which post-copy can provide the same level of reliability as pre-copy. Our contributions are to demonstrate a complete way in which, with a little bit more help from the migration system, one can preserve the liveness and downtime benefits of pre-copy while also breaking from the non-deterministic convergence phase inherent in pre-copy, ensuring that each page of VM memory is transferred over the network at most once.

5.2 Design

We begin with a brief discussion of the performance goals of VM migration. Afterwards, we will present our design of post-copy and how it improves those goals. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 71

5.2.1 Pre-Copy

For a more in-depth performance summary of pre-copy migration, we refer the reader to

[27] and [73]. For completeness, pre-copy migration works as follows: Pre-copy is an eager strategy in which memory pages are actively pushed to the target machine, while the migrating VM continues to run at the source machine. Pages dirtied at the source that have already been transferred to the target are re-sent through several iterations until the number of dirtied pages is as small as a fixed threshold. (Note: this threshold is not dynamic. Although one could potentially imagine modern hypervisors designing a dynamic threshold, neither their companies nor the literature have attempted to do so.) Furthermore, in all known implementations, if the threshold is never reached, an empirical “cap” on the total number of iterations is chosen (currently set to 30) by the migration implementer.

Without this cap, it is possible that pre-copy may never converge at all. After the iterations complete, the VM is then suspended and its state is transferred to the target machine where it is restarted. This transfer of VM state is accompanied by the final flush of the remaining address space modified at the host. The VM is the resumed at the target and the source

VM copy is destroyed. Pre-copy migration involves the following six performance goals:

1. Transparency: The pre-copy scheme can work transparently in both fully-virtualized

and para-virtualized environments. Any new migration scheme must maintain that

ability without requiring any application changes.

2. Preparation Time: Any required CPU or network activity within either the migrating

guest VM or the maintenance VM contributes to preparation time. This includes most

of the memory copying during pre-copy rounds. There is no guarantee that this time

ever converges to a stopping round. In fact, later we show that even with mildly active

VMs, these rounds never converge.

3. Down Time: This time represents how long the migrating VM is stopped, during

which no execution progress is made. Pre-copy uses this time for dirty memory

transfer. Minimizing this goal is their primary goal. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 72

4. Resume Time: Any remaining cleanup required by the maintenance VM at the target

host goes into this time period. Although pre-copy has nothing to do besides re-

scheduling the migrating VM, the majority of our post-copy design operates primarily

in this period. After this period is complete, regardless of which migration algorithm

is used, all dependencies on the source VM must be eliminated.

5. Pages Transferred: This performance goal consists of a total count of the number

of transferred memory pages across all of the above time periods. For pre-copy this

is dominated by preparation time.

6. Total Migration Time: For pre-copy, the total time required to complete the migration

is dominated by the preparation time. Total migration time is important because it

affects the release of resources on both sides within the individual host as well as

within the VMs on both hosts. Until completion of migration, the unused memory at

the source cannot yet be freed, and both maintenance VMs will continue to consume

network bandwidth and CPU cycles.

7. Application Degradation: This refers to the extent of slowdown experienced by

application workloads executing within the VM due to the migration event. The slow-

down occurs primarily due to CPU time taken away from normal applications to carry

out the migration. Additionally, the pre-copy approach needs to track dirtied pages

across successive iterations by trapping write accesses to each page, which signif-

icantly slows down write-intensive workloads. In the case of post-copy, access to

memory pages not yet present at the target results in network page faults, potentially

slowing down the VM workloads.

One of this Chapter’s contributions is to reduce the number of pages transferred com-

pared to pre-copy: the wasteful transfer of pages that may never be used at the target

machine is likely to occur. If the threshold of the number of dirty pages chosen to termi-

nate the pre-copy phase is too small, then pre-copy may never converge and terminate. On

the other hand, if the number of pages transferred during final iteration is large, significant

downtime can result. Given that the number of pages transferred directly impacts all other CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 73 metrics, our post-copy method aims to reduce this metric.

5.2.2 Design of Post-Copy Live VM Migration

Post-copy is a strategy in which the migrating virtual machine is first suspended at the source, a minimalistic execution state is copied over to the target where the virtual machine is restarted and then the memory pages that are referenced are faulted over the network from the source. VM execution experiences a delay during this period of faults, and that delay depends on the characteristics of the network connection and how fast the source machine can serve the request. As a result, this method incurs considerable resume time. Additionally, leaving any long-term residual dependencies on the source host is not acceptable. Thus, post-copy is not useful unless two additional goals are required:

1. Post-copy must effectively anticipate page-faults from the target and allow VM exe-

cution to move forward, while hiding the latency of page-faults.

2. Post-copy must flush the remaining clean pages from the source out-of-band while

the VM is simultaneously faulting, so that no residual dependency remains on the

source.

Note that both migration schemes must be normalized with respect to the unused /

free pages within the guest VM. This must be done such that any improvement is realized

only by the treatment of pages that actually contributed to the guest VM’s working set. We

will discuss this solution momentarily. The post-copy algorithm can actually be designed in

multiple ways, each of which provides an incrementally better improvement on the previous

method across all the aforementioned performance goals. Table 5.1 illustrates how each of

these ways slightly increases in complexity from the previous one during a certain phase

of the migration, with the common goal of improving the bottom line. Method 1 heads off

the table as the current form of migration.

Method 2: Post-Copy via Demand Paging: The demand paging variant of post-

copy is the simplest and slowest option. Once the VM resumes at the target, its memory

accesses result in page faults that can be serviced by requesting the referenced page over CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 74

Preparation Downtime Resume 1 Pre-copy Multiple iterative Send dirty CPU state Only memory transfers memory transfer 2 Demand Pre-suspend time CPU state Page-faults Paging (if any) transfer only 3 Basic Pre-suspend time CPU state Flushing + Post-copy (if any) transfer page-faults 4 Pre-paging Pre-suspend time CPU state Bubbling + + Post-copy (if any) transfer page-faults 5 Hybrid Single pre-copy CPU state Bubbling + Pre + Post round transfer page-faults

Table 5.1: Migration algorithm design choices in order of their incremental improvements. Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually combines all of #1 through #4, by which pre-copy is only used in a single, primer iterative round.

1. let N := total # of guest VM pages 2. let page[N] := set of all guest VM pages 3. let bitmap[N] := all zeroes 4. let pivot := 0; bubble := 0

5. ActivePush (Guest VM) 6. while bubble < max (pivot, N-pivot) do 7. let left := max(0, pivot - bubble) 8. let right := min(MAX_PAGE_NUM-1, pivot + bubble) 9. if bitmap[left] == 0 then 10. set bitmap[left] := 1 11. queue page[left] for transmission 12. if bitmap[right] == 0 then 13. set bitmap[right] := 1 14. queue page[right] for transmission 15. bubble++

16. PageFault (Guest-page X) 17. if bitmap[X] == 0 then 18. set bitmap[X] := 1 19. transmit page[X] immediately 20. discard pending queue 21. set pivot := X // shift pre-paging pivot 22. set bubble := 1 // new pre-paging window

Figure 5.1: Pseudo-code for the pre-paging algorithm employed by post-copy migration. Synchronization and locking code omitted for clarity of presentation. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 75 the network from the source node. However, servicing each fault will significantly slow down the VM due to the network’s round trip latency. Consequently, even though each page is transferred only once, this approach considerably lengthens the resume time and leaves long-term residual dependencies in the form of un-fetched pages, possibly for an indeterminate duration. Thus, post-copy performance for this variant by itself would be unacceptable from the viewpoint of total migration time and application degradation.

Method 3: Post-Copy via Active Pushing: One way to reduce the duration of residual dependencies on the source node is to proactively “push” the VM’s pages from the source to the target even as the VM continues executing at the target. Any major faults incurred by the VM can be serviced concurrently over the network via demand paging. Active push avoids transferring pages that have already been faulted in by the target VM. Thus, each page is transferred only once, either by demand paging or by an active push.

Method 4: Post-Copy via Prepaging: The goal of post-copy via prepaging is to antic- ipate the occurrence of major faults in advance and adapt the page pushing sequence to better reflect the VM’s memory access pattern. While it is impossible to predict the VM’s exact faulting behavior, our approach works by using the faulting addresses as hints to es- timate the spatial locality of the VM’s memory access pattern. The prepaging component then shifts the transmission window of the pages to be pushed such that the current page fault location falls within the window. This increases the probability that pushed pages would be the ones accessed by the VM in the near future, reducing the number of major faults. Various prepaging strategies are described in Section 5.2.3.

Method 5: Hybrid Live Migration: The hybrid approach was first described in [74] for process migration. It works by doing a single pre-copy round in the preparation phase of the migration. During this time, the VM continues running at the source while all its memory pages are copied to the target host. After just one iteration, the VM is suspended and its processor state and dirty non-pageable pages are copied to the target. Subsequently, the VM is resumed at the target and post-copy as described above kicks in, pushing in the remaining dirty pages from the source. As with pre-copy, this scheme can perform well for read-intensive workloads. Yet it also provides deterministic total migration time for write-intensive workloads, as with post-copy. This hybrid approach is currently being CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 76

Backward Edge Forward Edge of Bubble of Bubble

(a) 0 MAX Pivot

(b) 0 MAX P1 P2 P3 Stopped P1P2 P3 Pivot Array Bubble Edges

Figure 5.2: Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with multiple pivots. Each pivot represents the location of a network fault on the in-memory pseudo-paging device. Pages around the pivot are actively pushed to target. implemented and not covered within the scope of this chapter. Rest of this paper describes the design and implementation of post-copy via prepaging.

5.2.3 Prepaging Strategy

Prepaging refers to actively pushing the VM’s pages from the source to the target. The goal is to make pages available at the target before they are faulted on by the running VM.

The effectiveness of prepaging is measured by the percentage of VM’s page faults at the

target that require an explicit page request to be sent over the network to the source node

– also called network page faults. The smaller the percentage of network page faults, the

better the prepaging algorithm. The challenge in designing an effective prepaging strategy

is to accurately predict the pages that might be accessed by the VM in the near future, and

to push those pages before the VM faults upon them. Below we describe different design

options for prepaging strategies.

(A) Bubbling with a Single Pivot:

Figure 5.1 lists the pseudo-code for the two components of bubbling with a single pivot –

active push (lines 5–15), which executes in a kernel thread, and page fault servicing (lines

16–21), which executes in the interrupt context whenever a page-fault occurs. Figure 5.2(a)

illustrates this algorithm graphically. The VM’s pages at source are kept in an in-memory CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 77 pseudo-paging device, which is similar to a traditional swap device except that it resides completely in memory (see Section 5.3 for details). The active push component starts from a pivot page in the pseudo-paging device and transmits symmetrically located pages around that pivot in each iteration. We refer to this algorithm as “bubbling” since it is akin to a bubble that grows around the pivot as the center. Even if one edge of the bubble reaches the boundary of pseudo-paging device (0 or MAX), the other edge continues expanding in the opposite direction. To start with, the pivot is initialized to the first page in the in-memory pseudo-paging device, which means that initially the bubble expands only in the forward direction. Subsequently, whenever a network page fault occurs, the fault servicing component shifts the pivot to the location of the new fault and starts a new bubble around this new location. In this manner, the location of the pivot adapts to new network faults in order to exploit the spatial locality of reference. Pages that have already been transmitted (as recorded in a bitmap) are skipped over by the edge of the bubble.

Network faults that arrive at the source for a page that is in-flight (or just been pushed) to the target are ignored to avoid duplicate page transmissions.

(B) Bubbling with Multiple Pivots: Consider the situation where a VM has multiple processes executing concurrently. Here, a newly migrated VM would fault on pages at multiple locations in the pseudo-paging device. Consequently, a single pivot would be insufficient to capture the locality of reference across multiple processes in the VM. To address this situation, we extend the bubbling algorithm described above to operate on multiple pivots. Figure 5.2(b) illustrates this algorithm graphically. The algorithm is similar to the one outlined in Figure 5.1, except that the active push component pushes pages from multiple “bubbles” concurrently. (We omit the pseudo-code for space constraints, since it is a straightforward extension of single pivot case.)

Each bubble expands around an independent pivot. Whenever a new network fault occurs, the faulting location is recorded as one more pivot and a new bubble is started around that location. To save on unnecessary page transmissions, if the edge of a bubble comes across a page that is already transmitted, that edge stops progressing in the cor- responding direction. For example, the edges between bubbles around pivots P2 and P3 stop progressing when they meet, although the opposite edges continue making progress. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 78

In practice, it is sufficient to limit the number of concurrent bubbles to those around k most recent pivots. When new network faults arrives, we replace the oldest pivot in a pivot ar- ray with the new network fault location. For the workloads tested in our experiments in

Section 5.4, we found that around k = 7 pivots provided the best performance.

(C) Direction of Bubble Expansion: We also wanted to examine whether the pattern in which the source node pushes the pages located around the pivot made a significant difference in performance. In other words, is it better to expand the bubble around a pivot in both directions, or only the forward direction, or only the backward direction? To examine this we included an option of turning off the bubble expansion in either the forward or the backward direction. Our results, detailed in Section 5.4.4, indicate that forward bubble expansion is essential, dual (bi-directional) bubble expansion performs slightly better in most cases, and backwards-only bubble expansion is counter-productive.

When expanding bubbles with multiple pivots in only a single direction (forward-only or backward-only), there is a possibility that the entire active push component could stall before transmitting all pages in the pseudo-paging device. This happens when all active bubble edges encounter already sent-pages at their edges and stop progressing. (A simple thought exercise can show that stalling of active push is not a problem for dual-direction multi-pivot bubbling.) While there are multiple ways to solve this problem, we chose a simple approach of designating the initial pivot (at the first page in pseudo-paging device) as a sticky pivot. Unlike other pivots, this sticky pivot is never replaced by another pivot.

Further, the bubble around sticky pivot does not stall when it encounters an already trans- mitted page; rather it skips such a page and keeps progressing, ensuring that the active push component never stalls.

5.2.4 Dynamic Self-Ballooning

The Free Memory Problem. As we touched on earlier, there can be an arbitrarily large number of free pages within the guest VM before migration begins - or there may be little or no free pages. Nevertheless, it is wasteful to send those pages, regardless which migration algorithm you are using. If you do not eliminate as many of these pages as possible from CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 79 being migrated in the pre-copy algorithm, then one cannot properly compare it to post- copy. This is because there would be no way of distinguishing clean pages from free pages during each pre-copy iteration. If a clean page is freed, there is no way for the migration process to detect this. We observe that there are two ways to solve this problem.

For post-copy, this turns out to be quite easy: this leaves us with Method 5 (hybrid method) in Table 5.1. This method combines pre-copy with post-copy. This works by doing a single pre-copy round in the preparation time phase of the migration. As a result this allows the guest VM to continue running at the source while its free pages and clean pages are copied to the target host. Subsequently, the post-copy process kicks in immediately after downtime. There is no memory transfer during downtime, and post-copy operates just as we described. The second way to solve the free memory problem is through the use of ballooning.

The first time the hybrid scheme was used in the literature was in [74]. But since we are dealing with whole system VM migration, this presents a problem for a performance comparison against a stand-alone pre-copy migration: the hybrid scheme does not elim- inate the transmission of free pages. Without eliminating them, we cannot determine the effectiveness of post-copy with respect to how well pre-paging successfully promotes VM execution time by hiding page-fault latency from the migrating guest VM. We cannot eval- uate that effectiveness for two reasons: First, if a free page is transmitted (which is highly probable), it consumes bandwidth that might otherwise have been used by both pre-paging as well as for the iterative rounds used in pre-copy. Second, during pre-paging, if a free page is allocated by the guest VM and subsequently causes a page-fault (as the result of a copy-on-write by the virtual memory system), this will cause additional delay on the

VM at the target when there need not have been. Therefore, we cannot do a performance analysis of post-copy without eliminating the transmission of those empty page frames.

Ballooning is the act of changing the view of physical memory (and pseudo-physical memory) such that the guest VM has a larger or smaller amount of allocatable memory than it had before. In current virtualization systems, this is only used to during guest VM boot time when it is first created and initialized. If the maintenance VM cannot “reserve” enough memory for the new guest - henceforth referred to as a reservation - it steals CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 80 some from the other VMs on the host by enlarging a kind of balloon in other VMs and giving it to the new one. This is done by giving the existing VMs a “target” reservation and waiting for them to release enough pages from their own reservations to satisfy that smaller target. The system administrator can re-enlarge those diminished reservations at a later time should more memory become available. This might happen as the result of either shutting down or even migration itself. What we have implemented is a way for the migrating guest VM to perform this ballooning continuously by itself, called Dynamic

Self-Ballooning (DSB). The way to make this effective for migration is two-fold: First, we

must choose an appropriate interval between consecutive DSB attempts such that the

CPU time consumed by the DSB process does not interfere with the applications running

within the VM. Second, the DSB process must ensure that it can allow the balloon to

shrink. When one or more memory-intensive applications begins to run and perform copy-

on-writes within the guest VM, there must be a way for the DSB process to detect this and

respond to it by releasing free pages from the balloon so that the applications can use

them. We’ve devised a way to do this in the next couple of sections and have chosen an

interval of about 5 seconds through some performance experiments and determined that

application performance is not adversely affected. During pre-copy migration only, DSB is

used continuously. On the other hand, post-copy only performs DSB once right before the

beginning of the downtime phase. After resume, it is disabled and the rest of post-copy

proceeds as described.

5.2.5 Reliability

As we touched on in the introduction, post-copy has a drawback with respect to the reli-

ability of the target node. Either the source or destination node can fail in the middle of

VM migration. In both pre-copy and post-copy migration, failure of source node implies

permanent loss of the VM itself. Failure of destination node has different implications in

the two cases. For pre-copy, failure of the destination node does not matter because the

source node still holds an entire up-to-date copy of the VM’s memory and CPU execution

state and the VM can be revived if necessary from this copy. However, with post-copy, the CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 81

VM begins execution at the target node as soon as minimal CPU execution state is trans- ferred first, which implies that the destination node has the more up-to-date version of the

VM state and copy at the source happens to be stale, except for pages not yet modified at the destination. Thus failure of the destination node constitutes a critical failure of the VM during post-copy migration.

We plan to address this problem by developing mechanisms to incrementally check- point the VM state from the destination node back at the source node, an approach taken by the Xen-based system Remus [18]. According to them, we believe that the increased network overhead of doing this is negligible, but a through evaluation of that would first be required. One approach is as follows: while the active push of pages is in progress from the source node to the destination, we also propagate incremental changes to memory pages and execution state in the VM at the destination back to the source node. We do not need to propagate the changes from the destination on a continuous basis, but only at dis- crete points such as when interacting with a remote client over the network, or committing an I/O operation to the storage. This mechanism can provide a consistent backup image at the source node that we can fall back on in case the destination node fails in the middle of post-copy migration, although at the expense of some increase in reverse network traffic.

Further, once the migration is over, the backup state at the source node can be discarded safely. The performance of this mechanism would depend upon the additional overhead imposed by reverse network traffic from the destination to the source. In a different context, similar incremental checkpointing mechanisms have been used to provide high availability in the Remus project [18].

5.2.6 Summary

We have described Post-Copy and solved four problems that are important for the improved migration of system virtual machines. By focusing on the total number of pages transferred, we use the following approaches: demand-paging, flushing, pre-paging through what we call “bubbling”, and dynamic self-ballooning (DSB), all working together at the same time.

Demand paging ensures that we eliminate the non-deterministic copying iterations involved CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 82

SOURCE VM TARGET VM Non−Pageable Restore Memory Memory Domain 0 Reservation MFN Double Exchange (at source) Memory Pageable Memory Migration Daemon Reservation (Xend)

Pseudo−Paged Memory−Mapped New Page−Frames Memory Page−Frames

Page−Fault Traffic mmu_update() Pre−Paging Traffic Source Hypervisor Target Hypervisor

Figure 5.3: Pseudo-Swapping (item 3): As pages are swapped out within the source guest itself, their MFN identifiers are exchanged and Domain 0 memory maps those frames with the help of the hypervisor. The rest of post-copy then takes over after downtime. in pre-copy. Flushing ensures that no residual dependencies are left on the source host.

Bubbling helps minimize the number of page faults as well as the length of time spent in the resume phase. Self-ballooning allows us to normalize the two migration schemes for comparison by eliminating the transmission of free pages. Note that we do not implement the Hybrid scheme that we mentioned earlier as it does not directly contribute to the com- parison of the two schemes, but would nonetheless still significantly improve the treatment of clean pages during post-copy migration. We leave that to future work.

5.3 Post-Copy Implementation

We’ve implemented post-copy on top of the Xen 3.2.1 along with all of the optimizations in- troduced in Section 5.2. We use the para-virtualized version of Linux 2.6.18.8 as our base.

We begin by first discussing how there are different ways of trapping page-faults within the

Xen / Linux architecture and their trade-offs. Then we will discuss our implementation of dynamic self-ballooning. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 83

5.3.1 Page-Fault Detection

The working set of the Virtual Machine can (and will) span multiple user applications and in-kernel data structures. We propose three different ways by which the demand-paging component of post-copy at the system virtual machine level can trap accesses to the WWS.

These include:

1. Shadow Paging: Through the pre-existing, well designed use of an extra, read-only

set of page tables underneath the VM, shadow paging provides multiple benefits to

virtual machines in modern hypervisors. Support for shadow paging contributes to

the use of both fully-virtualized VMs and para-virtualized VMs as well as the facili-

tation of pre-copy migration by detecting page dirtying. In the post-copy case, each

attempt to write to a page at the target would be trapped by shadow-paging. The

migration daemon would then use this information to retrieve that page before the

read or write can proceed.

2. Page Tracking: The idea here is to use the downtime phase to mark all of the res-

ident pages in the VM as not present within the corresponding page-table-entries

(PTEs) for each page. This has the effect of forcing a real page-fault exception on

the CPU. The hypervisor would then be responsible for propagating that fault to Do-

main 0 to be fixed up. The migration process would then bring in the page and fixup

the page-table entry back to normal. x86 PTEs currently have 2 or 3 unused bits in

their lower order bits that can be used to track this information for fixup.

3. Pseudo Swapping: This solution preserves the spirit of para-virtualization, but re-

mains transparent to applications. The idea is to take the set of all pageable appli-

cation and page cache memory within the guest VM and make it “suddenly appear”

that it has been swapped out but without the actual cost of doing so - and without

the use of any disks whatsoever. Although this sounds strange, recall that the source

VM is not running during post-copy. Only the target VM is running. So the memory

reservation that the source VM is occupying is essentially acting like a limited swap

device. During resume time, the guest VM itself can be paravirtualized to request CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 84

those pages from a sort of pseudo swap device.

In the end, we chose to use Pseudo Swapping because it was the the quickest to im-

plement, which is illustrated in Figure 5.3. Initially, we actually started with Page Tracking,

but stopped working on it. We believe that Page Tracking is the fastest, most efficient form

of demand paging at the system VM level. This is because faults are true CPU exceptions.

We started writing this by trapping those exceptions directly within the Hypervisor and then

propagating a new Virtual Interrupt to Domain 0. The major problem with this scheme is

that there exists no way in modern operating systems to detect when a physical page frame

is no longer in use by the operating system. Ideally one could imagine an architecturally

defined bitmap structure that is managed by the OS, not unlike the way a page table is

architecturally defined. This bitmap would allow the hardware to know which page frames

actually contain real bytes and which were free. Once page tracking was initiated, Domain

0 could use this bitmap in combination with the aforementioned page table modifications

to determine whether or not it was still necessary to fixup the PTE at the given time. Page

Tracking is not feasible without this feature. On the other hand, Shadow Paging provides a

clear middle ground. Although it would be slower than Page Tracking (due to the extra level

of PTE propagation) it is more transparent than Pseudo Swapping. For the most part, such

an implementation would remain relatively unchanged except for making a hook available

for trapping into Domain 0. Recently, a version of this type of demand-paging for use in

parallel cloud computing was demonstrated in a tech report [44] based on top of the Xen

hypervisor.

Our page-fault detection is implemented through the use of two loadable kernel mod-

ules. One sits inside the migrating VM and one sits inside Domain 0. These modules

leverage our prior work called MemX [49], which provides distributed paging support for

both Xen VMs and Linux machines at the kernel level. Once the target is ready to begin

pre-paging in the post-copy algorithm, MemX is invoked to service page faults through

the use of pseudo swapping as described. Figure 5.4 illustrates a high-level overlay of

how both pre-copy and post-copy relate to each other. Recall that in order to use Pseudo

Swapping to implement demand paging, one can only apply this to the set of all pageable CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 85

Non−pageable Memory Post−Copy Downtime

Preparation (live) Post−Copy Prepaging (live)

Pre−Copy Rounds (live) Complete Round: 1 2 3 ... N Pre−Copy Downtime Dirty Memory Time

Figure 5.4: The intersection of downtime within the two migration schemes. Currently, our downtime consists of sending non-pageable memory (which can be eliminated by employ- ing the use of shadow-paging). Pre-copy downtime consists of sending the last round of pages. memory in the system. Thus, the remaining memory (which is typically made up of small in-kernel caches or pinned pages) must be sent over to the target host during downtime.

This drawback to Pseudo Swapping is that it puts a small lower-bound on the achievable downtime experienced by our implementation of Post-Copy, but is not a fundamental limi- tation of the post-copy method of migration by any means. In future work, we plan to switch to Shadow Paging as a means to implement the demand paging component of post-copy.

This will eliminate that drawback. Nonetheless, we preserve the worsened downtime val- ues later in our performance experiments. These downtimes typically range from 600 ms to a little over one second.

5.3.2 MFN exchanging

Because of our speedy implementation, it was necessary to devise a way of making it appear that the set of all pageable memory in the guest VM had been swapped out without actually moving those pages anywhere. This can be accomplished in two ways: we can either transfer the pages out of the guest VM (and into the maintenance VM) or we can alter the location of the physical frame within the VM itself to a new location (with zero copying). We chose the latter because it does not place any extra dependencies on the maintenance VM. We accomplish this by what we call performing an “mfn exchange”. This works by first doubling the memory reservation of the VM and allocating free pages from CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 86 the new memory and briefly suspending all of the running processes in the system. We then instruct the kernel to swapout each pageable frame. Each time a used frame is paged, we re-write the hypervisor’s PFN to MFN mapping table (called a “physmap”) and exchange the two physical frames without actually copying them. We also do the same thing for the kernel-level page tables entries of both physical frames. This is efficient because we batch the hypercalls necessary to perform these operations within the hypervisor. Once downtime has completed, we restart the applications and wait for page-faults to the pseudo swap device to arrive.

5.3.3 Xen Daemon Modifications

A handful of modifications to the Xen Daemon where made to support page-fault detec- tion. The Xen daemon has the responsibility of initializing the migration and initial memory transfer process, including page tables and CPU state. For our system, the only memory transfer that the daemon is responsible for is the transfer of non-pageable memory. All other pages are ignored until later as usual. Additionally, the set of pages that are elimi- nated through self-ballooning must also be ignored. By default, however, the Xen Daemon has no way of knowing whether a particular memory page actually belongs to any of those

3 categories (pageable, non-pageable, or ballooned) because of the strict memory reser- vation policy employed by Xen (as it should be). This presents a problem for Post-Copy: the way non-pageable memory is transferred in our system is implemented by using the same code that runs when the daemon is ready to execute a Pre-Copy iteration in the original system. Thus, to support our system, we patch this code to check a new bitmap data structure that indicates whether or not a particular frame should actually be sent or not (rather than just treating all pages as dirty or not dirty as before in the original system).

This bitmap is populated by the kernel module running inside the actual guest VM itself running at the source (before downtime begins).

The next part is not so obvious upon first examination: The Xen Daemon (the manage- ment process running inside the co-located Domain 0 on the same host) needs to be able to read this bitmap from user-space. Thus, we perform a memory-mapping from the kernel CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 87 space of the guest VM to the user space of the Xen Daemon in the other virtual machine.

Furthermore, in order to perform a successful memory-mapping, the Xen Daemon needs to know in advance what the physical frame numbers (the MFNs) are of each page frame that physically backs that bitmap data structure. This is required because of the nature of performing a memory mapping: the Xen Daemon’s page tables must be populated with physical mfn numbers, not virtual ones. As a result, we use a physically contiguous map- ping to discover these MFNs: the physical-to-machine (p2m) mapping table. This is a table that translates every PFN of a guest (from 0 to max) into a physical frame number (mfn) owned by the guest virtual machine. Finally, to complete the memory mapping this results in the Daemon only needing to know 2 pieces of information: The location of the *first* virtual frame number and the total number of frames. Thus, the guest VM needs only to transmit these two values to the Xen Daemon before downtime begins.

We accomplish this by exporting the address (the PFN, specifically) of the (virtually contiguous) first frame of the bitmap inside of the p2m table into the “Xen Store”. The

Xen Store is a messaging abstraction for Xen virtual machines to be able to communicate small pieces of information to each other and is organized into a directory structure for each co-located virtual machine on the same host. Recall that we also have a kernel module running inside the management VM acting as the retrieval entity for the whole post-copy process and it is responsible for facilitating pseudo-paging. This module reads the first bitmap frame address from the Xen store and then communicates that information upwards to the Xen Daemon running inside the same virtual machine. The daemon then performs a memory mapping of this bitmap by grabbing each mfn out of the p2m table based on this first frame number one-by-one. Finally, once the bitmap is mapped and the physical frames are mapped, the Daemon can then determine which frames should be transmitted to the target host and which ones can be ignored by simply checking the bitmap. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 88

5.3.4 VM-to-VM kernel-to-kernel Memory-Mapping

During downtime, in addition to the Xen Daemon modifications in the previous section, the third-party module running within Domain 0 itself that is responsible for transmitting faulted pages to the target host also has the responsibility of memory-mapping the entirety of the guest’s VM memory footprint. This is to avoid copying memory. A problem arises, however, which is similar to the one presented in Section 5.3.3: in order to complete this memory mapping, we must again know what the addresses are of each page frame owned by the migrating guest VM. This is a much larger task however, because we’re not just exporting a bitmap to another virtual machine (where the total mapped data is one bit per page). Instead, we are memory-mapping 8 bytes per page owned by the guest. Thus, for a common 512 MB guest virtual machine, this means we have a megabyte of data to transmit to the other virtual machine. (512 MB constitutes 128K pages, so a 64-bit page frame identifier would require a megabyte of memory to store all of the physical frame numbers).

The problem with this megabyte is that you cannot simply allocate a contiguous megabyte of memory in kernel space with any guaranteed certainty. Slab caches and kmalloc’s are not meant for that. So, this leaves you with the alloc pages() family of routines in Linux.

This routine allocates memory in power-of-two orders of 4 KB pages. The largest contigu- ous order allowed by linux is 12 (and that is under ideal circumstances). Even a simple

1-MB allocation requires an “order-8” memory allocation. Larger VM memory sizes would thus approach 9 and 10. Under a heavily utilized system it is highly unlikely the Linux buddy system would return success on such requests. This requires us to find another way to send this 1-MB of data to the third party module inside Domain 0: through a second-level memory mapping.

This solution involves constructing a kind of “impromptu” page-table structure. This structure has the exact same 3-level hierarchy of a regular page table except that is not architecturally defined; but it still places the required mfn data at the leaves of the tree. We create this structure very quickly and pass the root of the table to the third party module through the use of the Xen store as was done in the previous section. During downtime, CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 89 the receiving module maps each frame of the page table structure itself in a recursive fashion beginning at the root. Once that is complete, it maps all of the page frame mfn numbers stored at the leaves of the table. These leaves collectively store the addresses of only those page frames that can potentially incur page faults. Thus, when a page fault actually occurs at the target host, the module needs to only consult this table and snap up that page to be ready for transmission without any copying whatsoever.

5.3.5 Dynamic Self Ballooning Implementation

The Xen Hypervisor has a set of hypercalls responsible for allowing a guest to change its memory reservation on-demand. The general idea to implementing DSB under Xen is three-fold. We will discuss how each of these steps is implemented within our version of the post-copy system and also how we modified it to be used in the original Xen implementation of pre-copy:

1. Inflate the balloon: For migration, this is accomplished by allocating as much free

memory as possible and handing those pages over to the “decrease reservation”

hypercall. This results in those pages being placed under the ownership of the hy-

pervisor.

2. Detect memory pressure: There are a few ways of doing this within Linux, which

we will describe shortly. Memory pressure indicates that either an application or the

kernel needs a page frame right now. In response, the DSB process must deflate

the balloon by the corresponding amount of memory pressure (but it need not destroy

the balloon completely).

3. Deflate the balloon: Again, this is accomplished by performing the reverse of step 1:

first the DSB process invokes the “increase reservation” hypercall. Then it proceeds

to release the list of free pages that were previously allocated (and handed to the

hypervisor for re-use) and actually give them back to the kernel’s free pool.

In order to rapidly inflate and deflate the balloon, we first had to determine where to initiate these operations. One can either place the DSB process within Domain 0 and CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 90 communicate the intent to modify the balloon to the migrating VM through Xen’s internal communication mechanisms (called the XenStore) - or one could place the DSB process within the VM itself. Because the nature of performing ballooning requires internal knowl- edge of the kernel anyway, we decided to go with the latter. But the deciding factor on this placement actually dealt with the balloon driver that ships with the Xen source code instead. We found this driver to be a little slow. This driver does not batch the hypercalls required to perform ballooning, but instead executes hypercalls one-by-one. In our labora- tory, we observed that a single hypercall can take as long as 2-3 microseconds. If we are to expect to perform DSB rapidly, these hypercalls must be batched together into a single hypercall - a feature which Xen already provides. It just simply needed a little kick forward.

Thus we placed the DSB process within the guest VM itself and updated the existing driver to perform this batching.

Memory Overcommitment. Memory over-commitment within an individual operating system is a method by which the virtual memory subsystem can provide an illusion to an application that the physical memory in the machine is larger than what is true. However, there are multiple operating modes of over-commitment within the Linux kernel - and these modes can be enabled or disabled at runtime. By default, Linux disables this feature.

This has the effect of causing application level memory allocations to be precluded in advance by returning a failure. So, by default (within Linux) if an application submits a memory allocation request without sufficient physical memory available, Linux will return an

error. However if you enable over-commitment, the kernel will truly view the set of physical

memory as infinite. One could spend an entire paper arguing that the over-commitment

feature should be enabled by default, but the Linux community has instead chosen to “err

on the side of caution” and defer such a decision to experienced system administrators.

Over-commitment is required for the transparent detection of memory pressure that we

have developed for our version of the DSB process, which we describe next.

Detecting Memory Pressure. Surprisingly enough, the Linux kernel already provides

a transparent way of doing this: through the filesystem interface. When a new filesystem

is registered with the kernel, one of the function pointers provided includes a callback to

request that the filesystem free any in-kernel data caches that the filesystem may have CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 91 pinned in memory. Such caches typically include things like inode and directory entry caches. These callbacks are driven by the virtual memory system and are invoked when applications ask for more memory. Indirectly, the virtual memory system makes this de- termination when it is time to perform a copy-on-write on behalf of an application that has allocated a large amount of memory but has only recently decided to write to it for the

first time. Consequently, the DSB process does not register a new filesystem, but we are still allowed to register a callback function by which the virtual memory system can use.

This worked remarkably well and does indeed provide very precise feedback to the DSB process on exactly when a memory-intensive application has become active. The name of this Linux function is called set shrinker(). Alternatively, one could periodically wakeup the

DSB process at an interval and scan the /proc/meminfo and /proc/vmstat files to determine this information by hand. We found the filesystem interface to be more direct as well as accurate. Whenever we get a callback, the callback already contains a numeric value of ex- actly how many pages it wants the DSB process to release at once. The size of this batch is typically 128 pages. The callback can happen very frequently in a back-to-back manner on behalf of active user applications. Each time the callback occurs the DSB process will deflate the balloon as described by the requested amount and go back to sleep.

Completing the DSB process. Finally, the DSB process, with the ability to detect memory pressure, must periodically reclaim free pages that may or may not have been released by running applications or the kernel itself. We perform this sort of “garbage collection” within a kernel thread. Note: this is not true garbage collection - that is not our intention. The kernel thread will wake up at periodic intervals and attempt to re-inflate the balloon as much as possible and then go back to sleep. If memory pressure is detected during this time, the thread will preempt itself and cease inflation and go back to sleep.

The only thing that was required complete this is a 200-line patch to the Xen migration daemon running within Domain 0. Recall the operation of the DSB process with respect to pre-copy and post-copy. Post-copy uses DSB only once: the kernel thread will balloon a single time before downtime occurs and go back to sleep, whereas DSB runs continuously for pre-copy. The migration daemon has a policy to which it strictly adheres: if a page frame has never been mapped before, it will not be migrated or transmitted. Note: this is not CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 92 the same as detecting whether or not a page frame has been allocated and subsequently freed - only when a page has been allocated for the first time (by assigning a machine frame number to the corresponding pseudo-physical frame number). This information is stored in what Xen calls a “physmap”, which we discussed earlier in the mfn exchanging section. A property of this physmap is that the total number of valid entries in this map is monotonically increasing; it will never decrease on the same host. This means that if the DSB process has inflated the balloon and the balloon contains a page frame that is mapped inside the physmap table, then the migration VM transmit that frame regardless.

That defeats the purpose of the DSB process. So we modify the migration daemon by

exposing to it the list of ballooned pages. As a result, whenever the migration daemon is

ready to transmit a particular page, it first consults that list and skips transmission if it is in

the list. (This list is actually a bitmap). Our suggestion to the Xen community is to develop

a sort of watermarked “dynamic physmap garbage collection” such that the kernel would

be responsible for clearing the physmap when it is no longer using a page. This is almost

identical to the earlier suggestion in the Page Tracking scheme we devised, except that

such use of the physmap would not be architecturally defined - nor would it necessarily

be visible to the hardware. We believe that a garbage-collected physmap would allow

for both the seamless implementation of Dynamic Self-Ballooning as well as the ability to

implement Page Tracking without hardware support. But for now, we are using the cards

we have been dealt.

5.3.6 Proactive LRU Ordering to Improve Reference Locality

During normal operation, the guest kernel maintains the age of each allocated page frame

in its page cache. Linux, for example, maintains two linked lists in which pages are main-

tained in Least Recently Used (LRU) order: one for active pages and one for inactive

pages. A kernel daemon periodically ages and transfers these pages between the two

lists. The inactive list is subsequently used by the paging system to reclaim pages and

write to the swap device. As a result, the order in which pages are written to the swap

device reflects the historical locality of access by processes in the VM. Ideally, the active CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 93 push component of post-copy could simply use this ordering of pages in its pseudo-paging device to predict the page access pattern in the migrated VM and push pages just in time to avoid network faults. However, Linux does not actively maintain the LRU ordering in these lists until a swap device is enabled. Since a pseudo-paging device is enabled just before migration, post-copy would not automatically see pages in the swap device ordered in the

LRU order. To address this problem, we implemented a kernel thread which periodically scans and reorders the active and inactive lists in LRU order, without modifying the core kernel itself. In each scan, the thread examines the referenced bit of each page. Pages with their referenced bit set are moved to the most recently used end of the list and their referenced bit is reset. This mechanism supplements the kernel’s existing aging support without the requirement that a real paging device be turned on. Section 5.4.4 shows that such a proactive LRU ordering plays a positive role in reducing network faults.

Lines of Code. The kernel-level implementation of Post-Copy, which leverages the

MemX system, is about 7000 lines of code within pluggable kernel modules. 4000 lines of that is part of the MemX system that is invoked during demand-paging. 3000 lines con- tribute to the pre-paging component, the flushing component, and the ballooning compo- nent combined. (The DSB implementation also operates within the aforementioned kernel modules and runs inside the guest OS itself as a kernel thread. There is no dom0 in- teraction with the DSB process). A 200 line patch is applied to the migration daemon to support ballooning and a 300-line patch is applied to the guest kernel so that the initiation of pseudo swapping can begin. When all is said and done, the system remains completely transparent to applications and approaches about 8000 lines. Neither the original pre-copy algorithm code, nor the hypervisor itself is changed at all. (As discussed before, alternative page-fault detection methods will require additional hypervisor support).

5.4 Evaluation

In this section, we present the detailed evaluation of our post-copy implementation and compare it against Xen’s original pre-copy migration. Our test environment consists of two 2.8 GHz dual core Intel machines connected via a Gigabit Ethernet switch. Each CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 94 machine has 4 GB of memory. Both the guest VM in each experiment and the Domain

0 are configured to use two virtual CPUs. Guest VM sizes range from 128 MB to 1024

MB. Unless otherwise specified, the default guest VM size is 512 MB. In addition to the performance metrics mentioned in Section 5.2, we evaluate post-copy against an additional metric. Recall that post-copy is effective only when a large majority of the pages reach the target node before they are faulted upon by the VM at the target, in which case they become minor page faults rather than network-bound page faults. Thus the fraction of network page faults compared to minor page faults is another indication of the effectiveness of our post- copy approach. Secondly, we quantify the pages transferred of pre-copy by scripting them those numbers from the Xen logs. For post-copy we output this information to procfiles.

That value is then added to the number of pages that contribute to “non-pageable memory” for a grand total.

5.4.1 Stress Testing

We start by first doing a stress test for both migration schemes with the use of a sim- ple, highly sequential memory-intensive C program. This program accepts a parameter to change the working set of memory accesses and a second parameter to control whether it performs memory reads or writes during the test. The experiment is performed in a 1024

MB VM with its working set ranging from 8 MB to 512 MB. The rest is simply free memory.

We perform the experiment with seven different test configurations:

1. Stop-and-copy Migration: This is a non-live VM migration scenario which provides

a baseline to compare the total migration time and number of pages transferred by

post-copy.

2. Read-intensive Pre-Copy: This configuration provides the best-case workload for

pre-copy. The performance is of the total migration time metric expected to be roughly

similar to pure stop-and-copy migration.

3. Write-intensive Pre-Copy: This configuration provides the worst-case workload for

pre-copy and causes worsening of all performance metrics.

4. Read-intensive Post-Copy: CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 95

Total Migration Time 80 Write Pre-Copy w/o DSB 70 Read Pre-Copy w/o DSB Write Pre-Copy DSB 60 Write Post-Copy Read Post-Copy 50 Read Pre-Copy DSB 40 Stop-and-Copy DSB 30 20 10 Total Migration Time (Secs) 80 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)

Figure 5.5: Comparison of total migration times between post-copy and pre-copy.

5. Write-intensive Post-Copy: These two configurations will stress our pre-paging al-

gorithm and flushing implementations and are expected to perform almost identically.

6. Read-intensive Pre-Copy without DSB:

7. Write-intensive Pre-Copy without DSB: These two configurations test the default

implementation of pre-copy in Xen that does not use DSB. Unless we specify other-

wise, the reader should assume that DSB is turned on for pre-copy. Post-copy always

uses DSB.

For each figure, the plots in the legend are in the same order as you see them from top to bottom in the figure.

Total Migration Time: Figure 5.5 shows the variation of total migration time with in- creasing working set size. Notice that both post-copy plots for total time are at the bottom, surpassed only by read-intensive pre-copy. Our first observation is that both the read and write intensive tests of post-copy perform very similarly. Thus our post-copy algorithm’s performance is agnostic to the read or write-intensive nature of the application workload.

Future work might involve giving more priority to page fault writes over reads. Further- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 96

Downtime 5000 Write Pre-Copy w/o DSB 4500 Write Pre-Copy DSB 4000 Write Post-Copy 3500 Read Post-Copy Read Pre-Copy DSB 3000 Read Pre-Copy w/o DSB 2500 2000 1500

Downtime (millisec) 1000 500 0 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)

Figure 5.6: Comparison of downtimes between pre-copy and post-copy. more, we observe that without DSB activated, as in the default Xen implementation, the total migration time for read-intensive pre-copy is very high due to unnecessary transmis- sion of free guest pages over the network. This conclusion demonstrates itself in the three remaining plots as well.

Downtime: Figure 5.6 exhibits similar behavior for the metric of downtime as the working set size increases. Recall that our choice of page fault detection in Section 5.3 increases the base downtime in post-copy. Thus, the figure shows a roughly constant downtime that ranges between 600 milliseconds to over one second. As is expected, the downtime for write-intensive pre-copy test increases significantly as the size of the writable working set increases.

Pages Transferred and Page Faults: Figure 5.7 and Table 5.2 illustrate the utility of our pre-paging algorithm in post-copy across increasingly large working set sizes. Fig- ure 5.7 plots the total number of pages transferred. As expected, post-copy transfers fewer pages than write-intensive pre-copy as well as pre-copy without DSB, the reduction being as much as 85%. It performs on par with read-intensive post-copy with DSB and stop- CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 97

Pages Transferred

1 M Write Pre-Copy w/o DSB 900 k Read Pre-Copy w/o DSB 800 k Write Pre-Copy DSB 700 k Write Post-Copy Read Post-Copy 600 k Read Pre-Copy DSB 500 k Stop-and-Copy DSB 400 k 300 k 200 k 4 KB pages (in thousands) 100 k 0 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB Working Set Size (MB)

Figure 5.7: Comparison of the number of pages transferred during a single migration.

Working Pre-Paging Flushing Set Size Net Minor Net Minor 8 MB 2% 98% 15% 85% 16 MB 4% 96% 13% 87% 32 MB 4% 96% 13% 87% 64 MB 3% 97% 10% 90% 128 MB 3% 97% 9% 91% 256 MB 3% 98% 10% 90%

Table 5.2: Percent of minor and network faults for flushing vs. pre-paging. Pre-paging greatly reduces the fraction of network faults. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 98

Degradation Time: Kernel Compile 300 No Migration Pre-Copy w/o DSB 250 Post-Copy Pre-Copy DSB

200

150

100

50

Completion Time (secs) 0 128 MB 256 MB 512 MB 1024 MB Guest Memory (MB)

Figure 5.8: Kernel compile with back-to-back migrations using 5 seconds pauses. and-copy, all of which transfer each page only once over the network. Table 5.2 compares the fraction of network and minor faults in post-copy. We see that pre-paging reduces the fraction of network faults from 7% to 13%. To be fair, the stress-test is highly sequential in nature and consequently, pre-paging predicts this behavior almost perfectly. We expect the real applications in the next section to do worse than this optimal case.

5.4.2 Degradation, Bandwidth, and Ballooning

Next, we quantify the side effects of migration on a couple of sample applications. We want to answer the following questions: What kind of slow-down do VM workloads experience during pre-copy versus post-copy migration? What is their impact on network bandwidth re- ceived by applications? And finally, what kind of balloon inflation interval should we choose to minimize the impact of DSB on running applications? For application degradation and

DSB interval, we use Linux kernel compilation. For bandwidth testing we use the NetPerf

TCP benchmark. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 99

Degradation Time: Figure 5.8 depicts a repeat of an interesting experiment from [73].

We initiate a kernel compile inside the VM and then migrate the VM repeatedly between

two hosts. We script the migrations to pause for 5 seconds each time. Although there

is no exact way to quantify degradation time (due to scheduling and context switching),

this experiment provides an approximate measure. As far as memory is concerned, we

observe that kernel compilation tends not to exhibit too many memory writes. (Once gcc

forks and compiles, the OS page cache will only be used once more at the end to link

the kernel object files together). As a result, experiment is good for post-copy comparison

because it represents the best case for the original pre-copy approach when there is not much repeated dirtying of pages. This experiment is also a good worst-case tester for our implementation of Dynamic Self Ballooning due to the repeated fork-and-exit behavior of the kernel compile as each object file is created over time. (Interestingly enough, this experiment also gave us a headache, because it exposed the bugs in our code!) We were surprised to see how many additional seconds were added to the kernel compilation in

Figure 5.8 just by executing back to back invocations of pre-copy migration. Nevertheless, we observe that post-copy tends to match pre-copy by the same amount of degradation.

Although we would have preferred to see less degradation than pre-copy, we can at least rest assured that we’re not doing worse. This is in line with the competitive performance of post-copy with read-intensive pre-copy tests in Figures 5.5 and 5.7. We suspect that a shadow-paging based implementation of post-copy would perform much better due to the significantly reduced downtime it would provide.

Additionally, Figure 5.9 shows the same experiment using NetPerf. A sustained, high- bandwidth stream of network traffic causes slightly more page-dirtying than the compilation does. The setup involves placing the NetPerf sender inside the guest VM and the receiver on an external node on the same switch. Consequently, regardless of VM size, post- copy actually does perform slightly better and reduce the degradation time experienced by

NetPerf. The figure also indicates an example of severe degradation without DSB due to transmission of free pages.

Effect on Bandwidth: In their paper [27], the Xen project proposed a solution called

“adaptive rate limiting” to control the bandwidth overhead due to migration. However, this CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 100

Degradation Time: NetPerf 250 No Migration Pre-Copy w/o DSB Post-Copy Pre-Copy DSB 200

150

100

50

CompletionTime (sec) 0 128 MB 256 MB 512 MB 1024 MB Guest Memory (MB)

Figure 5.9: NetPerf run with back-to-back migrations using 5 seconds pauses. feature is not enabled in the currently released version of Xen. In fact it is compiled out without any runtime options or any pre-processor directives. This could likely be because it is difficult, if not impossible to predict beforehand the bandwidth requirement of any single guest in order to guide the behavior of adaptive rate limiting. Hence, there is no explicit arbitration of network bandwidth contention between simultaneous operation of the migration daemon and a network-heavy application. With that in mind, Figures 5.10 and 5.11 show a visual representation of the reduction in bandwidth experienced by a high- throughput NetPerf session. We conduct this experiment by measuring bandwidth values rapidly and invoke VM migration in between. The impact of migration can be seen in both

figures by a sudden reduction in the observed bandwidth during migration. This reduction is more sustained, and greater, for the pre-copy approach than for post-copy due to the fact that the total number of pages transferred in pre-copy is much higher. This is exactly the bottom line that we were targeting for improvement. Each experiment henceforth operates under that mode of operation. We believe their choice does make sense, however: the migration daemon really cannot guess if the guest is hosting, say, a webserver, it’s likely CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 101

1. Normal Operation 5. Migration 3. CPU + Complete non-paged memory

4. Resume 2. DSB + Pre-paging Invocation

Figure 5.10: Impact of post-copy NetPerf bandwidth. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 102

5. Migration 1. Normal Complete Operation

2. DSB Invocation 4. CPU-state Transfer

3. Iterative Memory Copies

Figure 5.11: Impact of pre-copy NetPerf bandwidth. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 103

Dynamic Ballooning Effects on Completion Time Kernel Compile, 439 secs 50 128 MB Guest 40 512 MB Guest

30

20

10 Slowdown Time (secs)

0 0 200 400 600 800 Balloon Interval (jiffies)

Figure 5.12: The application degradation is inversely proportional to the ballooning inter- val. the webserver will take whatever size pipe in can get its hands on, which would suggest that the migration daemon should just let TCP do what it normally does. On the other hand, the daemon might use up CPU cycles that might otherwise be granted to the guest itself. The point is that it’s all guesswork without some kind of signal from the guest. In fact, while strolling through the Xen daemon’s code, the end of the pre-copy iteration process is guided only by two factors: a 30-iteration maximum constant combined with a minimum page dirtying rate of 50 pages per pre-copy round. This will allow the daemon to iterate forever until either one of those conditions is met. This is why even mildly write intensive applications never converge.

Dynamic Ballooning Interval: Figure 5.12 shows how we chose the DSB interval, by which the DSB process wakes up to reclaim available free memory. With the kernel compile as the test application, we execute DSB process at intervals from 10ms to 10s. At every interval, we script the kernel compile to run multiple times and output the average completion time. The difference in that number from the base case is the degradation time CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 104 added to the application by the DSB process due to its CPU usage. As expected, the choice of ballooning interval is inversely proportional to the application degradation. The more often you balloon, the more it affects the VM workload. The graph indicates that we should choose an interval between 4 and 10 seconds to balance between frequently reclaiming free pages and avoid impacting applications significantly. Note that this graph only represents on type of mixed application. For more CPU-intensive workloads, it will be necessary to make the ballooning interval dynamic enough that it could increase for

CPU-intensive applications or applications that performed rapid memory allocation.

5.4.3 Application Scenarios

The last part of our evaluation is to re-visit the aforementioned performance metrics across four real applications:

1. SPECWeb 2005: This is our largest application. It is a well-known webserver bench-

mark involving at least 2 or more physical hosts. We place the system under test

within the guest VM, while six separate client nodes bombard the VM with connec-

tions.

2. Bit Torrent Client: Although this is not a typical server application, we chose it

because it is a simple representative of a multi-peer distributed application. It is easy

to initiate and does not immediately saturate a Gigabit Ethernet pipe. Instead it fills

up the network pipe gradually, is slightly CPU intensive, and involves a somewhat

more complex mix of page-dirtying and disk I/O than just a kernel compile.

3. Linux Kernel Compile: We consider this again for consistency.

4. NetPerf: Once more, as in the previous experiments, the NetPerf sender is placed

inside the guest VM.

Using these applications, we evaluate the same four primary metrics that we covered in

Section 5.4.1: Downtime, Total Migration Time, Pages Transferred, and Page Faults. Each

figure for these applications represents one of the four metrics and contains results for a constant, 512 MB virtual machine in the form of a bar graph for both migration schemes across each application. Each data point is the average of 20 samples. And just as before, CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 105

Pages Transferred 200000 180000 Post-Copy 160000 Pre-Copy 140000 120000 100000 80000 60000 40000 20000 0 #4KTransferred Pages Kernel Compile NetPerf SpecWeb2005 BitTorrent Application

Figure 5.13: Total pages transferred for both migration schemes. the guest VM is configured to have two virtual CPUs. All of these experiments have the

DSB activated.

Pages Transferred and Page Faults.. The experiments in Figures 5.13 and 5.14 illus- trate these results. For all of the applications except the SPECWeb, post-copy reduces the total pages transferred by more than half. The most significant result we’ve seen so far is in Figure 5.14 where post-copy’s pre-paging algorithm is able to avoid 79% and 83% of the network page faults (which become minor faults) for the largest applications (SPECWeb,

Bittorrent). For the smaller applications (Kernel, NetPerf), we still manage to save 41% and 43% of network page faults. There is a significant amount of additional prior work in the literature aimed at working-set identification, and we believe that these improvements can be even better if we employ both knowledge-based and history-based predictors in our pre-paging algorithm. But even with a reactive approach, post-copy appears to be a strong competitor.

Total Time and Downtime. Figure 5.15 shows that post-copy reduces the total mi- gration time for all applications, when compared to pre-copy, in some cases by more than CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 106

Page-Faults 100000 79% Minor 21% 10000 Faults Network 1000 41%59% 83% Faults 17% 100 43%57% 10

1 Kernel Compile NetPerf

#4K(logscale) Page-Faults SpecWeb2005 BitTorrent Application

Figure 5.14: Page-fault comparisons: Pre-paging lowers the network page faults to 17% and 21%, even for the heaviest applications.

Total Migration Time 12 Post-Copy 10 Pre-Copy 8 6 4 Time (secs) 2 0 Kernel Compile NetPerf SpecWeb2005 BitTorrent Application

Figure 5.15: Total migration time for both migration schemes. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 107

Downtime 2000 1800 Post-Copy 1600 Pre-Copy 1400 1200 1000 800 600 400

Time(millisec) 200 0 Kernel Compile NetPerf SpecWeb2005 BitTorrent Application

Figure 5.16: Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with better page-fault detection.

50%. However, the downtime in Figure 5.16 is currently much higher for post-copy than for pre-copy. As we explained earlier, the relatively high downtime is due to our speedy choice of pseudo-paging for page fault detection, which we plan to reduce through the use of shadow paging. Nevertheless, this tradeoff between total migration time and downtime may be acceptable in situations where network overhead needs to be kept low and the entire migration needs to be completed quickly.

5.4.4 Comparison of Prepaging Strategies

This section compares the effectiveness of different prepaging strategies. The VM work- load is a Quicksort application that sorts a randomly populated array of user-defined size.

We vary the number of processes running Quicksort from 1 to 128, such that 512MB of memory is collectively used among all processes. We migrate the VM in the middle of its workload execution and measure the number of network faults during migration. A smaller network fault count indicates better prepaging performance. We compare a number of CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 108

6000 Push-Only 5000 Push-Only-LRU Forward 4000 Dual 3000 Forward-MultiPivot Forward-LRU 2000 Dual-LRU 1000 Forward-MultiPivot-LRU # 4K Page-Faults 4K # Dual-MultiPivot-LRU 0 (1, 512) (2, 256) (4, 128) (8, 64) (16, 32) (32, 16) (64, 8) (128, 4) (# Processes, MB Each Process)

Figure 5.17: Comparison of prepaging strategies using multi-process Quicksort work- loads. prepaging combinations by varying the following factors:

1. whether or not some form of bubbling is used;

2. whether the bubbling occurs in forward-only or dual directions;

3. whether single or multiple pivots are used; and

4. whether the page-cache is maintained in LRU order.

Figure 5.17 shows the results. Each vertical bar represents an average over 20 experi- mental runs. First observation is that bubbling, in any form, performs better than push-only prepaging. Secondly, sorting the page-cache in LRU order performs better than non-LRU cases by improving the locality of reference of neighboring pages in the pseudo-paging de- vice. Thirdly, dual directional bubbling improves performance over forward-only bubbling in most cases, but never performs significantly worse. This indicates that it is always prefer- able to use dual directional bubbling. (The performance of reverse-only bubbling was found to be much worse than even push-only prepaging, hence its results are omitted). Finally, dual multi-pivot bubbling is found to consistently improve the performance over single-pivot bubbling since it exploits locality of reference at multiple locations in the pseudo-paging device. CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION 109

5.5 Summary

We have presented the post-copy based live virtual machine migration using adaptive pre- paging and dynamic self-ballooning. Post-copy is a combination solution consisting of 4 pieces: demand paging, our pre-paging algorithm called “bubbling”, flushing, and the use of dynamic self-ballooning. We have implemented and evaluated this system and shown that it is able to effect significant performance improvements over the pre-copy based mi- gration of system virtual machines by reducing the number of pages transferred between the source and target hosts. Our future work will explore the use of alternative page-fault detection mechanisms as well as an attempt to explore future use applications of dynamic self-ballooning. There is a great deal of additional work that remains to be done. As we mentioned in Section 5.3, there are three different methods by which one can implement page-fault detection to support demand paging at the Virtual Machine level. We would like to set aside our expedient choice of pseudo-swapping in favor of a shadow paging based method of detection, and if possible investigate extensions to the Xen phsymap (the array of mappings between pseudo-physical and real page frames), with the goal of imple- menting the more efficient use of real CPU exceptions, which we called “page tracking”.

Second, as stated in Section 5.2 we must take care to addresses the reliability issue for post-copy so that we may provide the same level of reliability that the original pre-copy scheme provides. Chapter 6

CIVIC: Transparent Over-subscription of VM Memory

In this chapter, we describe the design, implementation and evaluation of Collective Indirect

Virtual Caching, or CIVIC for short. CIVIC is a significantly lower-level support for access to virtual cluster-wide memory than MemX. CIVIC is a memory oversubscription system for VMs designed to integrate the techniques from the previous three systems described in this dissertation by which the hypervisor can multiplex individual page frames of unmodified

Virtual Machines in a fine-grained manner.

Three primary uses of CIVIC are:

1. Higher Consolidation: to oversubscribe the limited memory of a single physical

host for the purpose of running higher numbers of consolidated Virtual Machines

with greater use of the hardware and without depending on para-virtualization or

ballooning.

2. Large-Memory Pool: to provide large-memory applications transparent access to

a cluster-wide, low-latency memory-pool without any additional binary or operating

system interfaces, and

3. Improved Migration: to reduce the amount of resident main memory when the time

comes to migrate individual Virtual Machines across the network. Due to time, this

feature has not been implemented, but CIVIC is designed for it.

110 CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 111

The motivation for this work derives directly from the last few chapters: Now that we have a number of systems for both distributing and migrating individual page frames, the

final step is to fully utilize the power of virtualization technology to support VMs with more transparency and ubiquity - to use unmodified, commodity operating systems in such a way that they have access to a (potentially) unlimited memory resource with the participa- tion of an entire cluster. The end goal of this chapter is to build a system underneath any commodity OS capable of giving a systems programmer arbitrary access to individual page frames located anywhere in the entire cluster in such a way that new techniques can be de- signed for VM memory management with ease and efficiency. The CIVIC system does just that: it transparently allows a Virtual Machine to oversubscribe (or overcommit) it’s physical memory space without any participation from the VM operating system whatsoever. Any non-committed memory is then paged out. In our case, it’s paged out to MemX.

6.1 Introduction

Although used frequently in many areas of computer science (and it is not always made obvious that one is doing so), one of the great rules in system design is that if a piece of data will likely be used again in the future, you will probably succeed wildly by going out of your way to design your algorithm or data structure such that it caches or preserves that data. It is remarkable how often that rule shows itself. The transparency afforded to VMs by hypervisors provides some good opportunities to exploit caching. Virtual Machines ex- perience almost no consciousness of the fact that their low-level view of physical memory is being “toyed” with in significant ways. So, in order to achieve the kind of memory ubiquity that we described, we’re proposing to combine the ability to do more fine-grained caching underneath VMs with the ability to virtualize cluster-wide memory (which was covered in earlier chapters). With CIVIC, we propose to allow the hosts in the cluster to cooperate with each other in order to transparently support VMs whose physical memory footprints can span across multiple machines in the cluster. To re-iterate: CIVIC is not a Distributed

Shared Memory system (DSM). There are already two existing hypervisor-level DSM at- tempts: one by Virtual Iron in 2005 [12] and one at the Open Kernel Labs in 2009 [69]. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 112

The purpose of these systems is to build a single-system image (SSI). Building an SSI is not the focus of this dissertation. Rather, our goal is to allow unmodified virtual machines to gain access to cluster memory. Our focus is to enable greater VM consolidation and migration performance rather than to spread processing out into the cluster. Thus, VMs in our work use only local processors.

A simple view of CIVIC’s role is that it does for VMs precisely what modern operating systems already do for processes in their virtual memory sub-systems: to give a run- ning process (nearly) unlimited access to virtual memory. The OS has a well-established method of multiplexing virtual to physical memory accesses - the page table. We leverage a similar mechanism to manipulate a VM’s view of physical memory, namely the ”pseudo”- physical address space, hereto referred to as the PPAS. (The “real” address space seen by the processor is thus respectively referred to as the RAS). The hypervisor undertakes the responsibility of mapping pages in the PPAS to pages in the RAS. Technically, one could of course use a disk-based swap device to page in and out the unused portions of the PPAS, but that would lead to significant a slowdown in VM performance as we have explored extensively in this dissertation. Instead, we use MemX to expand a VM’s PPAS to utilize the cluster-wide memory pool and minimize performance impacts on the VM that a disk might otherwise incur without changing the operating system at all. The hypervisor plays the role of an intermediary by (1) providing the VM with the view of an expanded

PPAS, (2) intercepting memory accesses by the VM to non-resident PPAS pages, and (3) efficiently redirecting these memory accesses for servicing by MemX, which executes in a separate virtual machine. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 113

6.2 Design

The design of CIVIC depends heavily on the virtualization platform, which in our case is

Xen. Although we have covered the design of Xen frequently in previous chapters, none of those systems were strictly at the hypervisor level. This requires a brief discussion of the hypervisor’s memory management schemes, including memory allocation and shadow paging. After this discussion, we will present the design choices for CIVIC within the hy- pervisor itself and its interactions with higher-level services followed by implementation specific details.

6.2.1 Hypervisor Memory Management

VM memory management is fairly straightforward, with an extra level of indirection through the PPAS. This address space sits in between the virtual address space and the real phys- ical address space (RAS) seen by the processor. Since the processor is no longer singly owned by one operating system, this extra level allows for multiplexing of multiple PPAS- es on top of a single RAS. Additionally, from here on out, the frame numbers associated with the PPAS (in Xen terminology) are called ’P’ frame numbers: or ”PFN”s. Similarly, real frame numbers are called “machine” frame numbers or MFNs. PFNs are contiguously numbered, whereas MFNs allocated to a VM in the RAS are almost guaranteed to be sparse. In modern VM technology, there are three ways to manage the PPAS:

1. Para-virtualization: A para-virtual VM (or guest) is one that has been modified in

such a way that the VM is aware of the hypervisor. It has been patched directly

to inform the hypervisor explicitly when it intends to update any given page table in

its ownership. In such a guest, the OS will map page frames using machine frame

numbers (MFN)s and has no actual concept of the PPAS (except for memory allo-

cation and VM migration, discussed in the last chapter). Thus, frame identifiers in

a para-virtual guest’s page table entries are the same ones seen by the processor.

This has performance advantages because the guest OS can “batch” a number of

page table updates in one hypercall (but only up to a limit, as we’ll see in option CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 114

#3). Para-virtual support has recently been made upstream and built into both Linux

and Windows, which mitigates some of the problems with this approach that relate

to transparency of maintaining upstream compatibility with newly released operating

systems versions. Thus, para-virtualization is no longer a technological obstacle.

2. Shadow-paging: When modifying the guest is unacceptable (for older OS kernels),

hypervisors no longer place real MFNs into guest OS page tables. Instead, “pseudo”

PFNs are used in such a way that virtual page numbers map to PFNs in their page

tables. Subsequently, the hypervisor traps write accesses to those tables (using CR3

register virtualization and by marking them read-only) while maintaining another set

of “shadow” page tables underneath the virtual machine that map the virtual page

numbers to MFNs. These page tables are exposed to the processor. Thus, memory

virtualization and device emulation can be done for arbitrary, unmodified operating

systems. When this kind of memory management is used, we refer to the guest OS

as a hardware virtual machine or “HVM”, as opposed to a para-virtual guest. An

elaboration of shadow paging is described next, in Section 6.2.2.

3. Hardware-assisted Paging: This approach is an improvement over shadow-paging

by moving the translation logic shadow paging from the hypervisor into the proces-

sor. Essentially this is an MMU expansion - making the MMU do a little more of what

it is already doing. With this support, it is no longer necessary to trap into the hyper-

visor as frequently - allowing for page-fault exceptions to be delivered directly to the

guest OS. Such guests are also called HVM guests, with the internal distinction of

hardware-assisted paging.

As of this writing, CIVIC depends exclusively on the hypervisor’s ability to perform shadow-paging for unmodified HVM guest operating systems. The most basic ability re- quired by CIVIC is to both create and intercept page-fault exceptions before they are propa- gated to the guest virtual machine that would not normally be seen by the OS itself. An un- modified HVM running on top of a CIVIC-enabled hypervisor that used hardware-assisted paging (instead of shadow-paging) would require additional logic to force the processor to trap into the hypervisor when a page is owned by CIVIC (a non-resident page frame) CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 115 during such CPU exceptions. So, as of this writing, CIVIC depends on shadow paging only without the assistance of hardware-assisted paging. Section 6.4 describes how the use of shadow paging affects the baseline performance of the virtual machine when used on top of a CIVIC-enabled hypervisor.

6.2.2 Shadow Paging Review

Next, it is necessary to elaborate for the reader about the use of shadow-paging and some of the common Xen-specific data structures. All of the machines in our cluster particular

Xen-cluster are 64-bit machines. Thus, the assumption of this discussion is that our HVM guests are also 64-bit virtual machines, requiring a standard 4-level page table hierarchy.

When we say “L1” page tables, this means the standard definition where pointers to data pages are contained at the lowest level of the hierarchy in the leaves and the root of the page table is at level L4. All L4 tables are pointed to by “Control Register #3”, or CR3, sometimes called the page-table base pointer. And as usual, for any given process running on the CPU, the value of CR3 will only point to the root L4 table of a single process at a time - or to the kernel’s page tables. A “resident” page table entry (PTE) at any level of the page table hierarchy means that the lowest order bit in a PTE is set, indicating the page beneath it (either data or page table) is actually sitting in memory somewhere. During the shadow paging process, three things can happen:

1. Shadow-Walk: The MMU, with access to a virtualized CR3 base pointer attempts

to walk the shadow page table hierarchy of a particular virtual machine. For every

HVM page table, there is a corresponding shadow page table at each level of the PT

hierarchy. If the MMU does not find a shadow PTE, a trap into the hypervisor occurs.

2. Guest-walk: The hypervisor then performs a manual walk of the real HVM tables

starting at what the HVM thinks is the true CR3 base pointer. If the hypervisor finds

the appropriate PTE, then the whole page table is copied and control returns to the

CPU for that virtual machine. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 116

Paravirt VM #1 HVM #2 HVM #3 HVM #N

Domain 0 PPAS #1 PPAS #2 PPAS #3 PPAS #N Multiple Pseudo−Physical Address Spaces (PPAS) Real Address Space (RAS) Hypervisor

Figure 6.1: Original Xen-based physical memory design for multiple, concurrently-running virtual machines.

3. Guest-walk-miss: Otherwise, a missing PTE in the guest signifies a real CPU ex-

ception and the fault is propagated to the HVM. At that point, it is the HVM’s respon-

sibility to service the fault and proceed as normal.

Furthermore, during the shadow-paging process, there are upwards of a dozen or so

“shadow optimizations” employed by Xen on top of this basic design that are used to speed up memory access latency when going through the shadows with respect to Windows vir- tual machines, HVMs and more. For the current version of CIVIC, these optimizations are disabled. Doing so was necessary to get an initial version of CIVIC working. Future ver- sions of CIVIC can be made to take advantage of these optimizations. Thus, the rest of this chapter and the next section discuss the rest of our implementation under the assumption that these optimizations are disabled. This assumption also constitutes our base case for doing benchmarking during our evaluation.

6.2.3 Step 1: CIVIC Memory Allocation, Caching Design

Figure 6.1 illustrates how memory is allocated to virtual machines in a typical virtualization architecture. Each VM gets a statically-allocated region of physical memory on the host

(depending on ballooning). During normal operation, the size of the PPAS for each virtual machine does not change. Any number (depending on the amount of memory available) of VMs can be created by the administrator in an adjacent manner and the OS of each CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 117

HVM #2: un−modified      Network PPAS #2 Cached PPAS #2 Network PPAS #2 

Paravirt (non−civic) VM #N Paravirt (non−civic) VM #1 HVM #3: un−modified   PPAS #N   PPAS #1 Cached PPAS #3 Network PPAS #3 

Domain 0 PPAS #1 Cache #2Cache #3 PPAS #N Real Address Space (RAS) CIVIC−enabled Hypervisor

Figure 6.2: Physical memory caching design of a CIVIC-enabled Hypervisor for multiple, concurrently-running virtual machines.

virtual machine will manage the PPAS given to them (since the PPAS is contiguous) with-

out interruption. In this default design, if an operating system places a reference (PFN) to

a page in one of it’s page tables that it expects to be physically resident in memory, then

it will be there - no questions asked. All VM technology currently works this way (except

for our previous VM migration work where dynamic ballooning is used in the last chapter).

In figure 6.1, we have four virtual machines, three of which are HVMs and one that uses

para-virtualization. Regardless, the PPAS of all 4 virtual machines is static: from the mo-

ment those VMs are booted up to the time they shut down, their PPAS is fixed.

CIVIC relaxes the assumption that a page actually exists when the VM asks for it.

The first step in designing CIVIC involves taking an unmodified operating system of an

arbitrary Virtual Machine and growing its PPAS by some amount. Afterwards, we add

another level of indirection within the hypervisor that recognizes this expanded PPAS (by

intercepting access through shadow-paging). Figure 6.2 illustrates how the hypervisor has

been modified to change the memory allocation strategy for an unmodified CIVIC-enabled

hypervisor. VMs #2 and #3 get a statically allocated cache-size on which only a subset of

their total PPAS is actually resident. The rest is out on the network. Hits in the cache go to

the RAS whereas hits in the PPAS go to the network.

Take note of the difference between HVM #2 and HVM #3: the PPAS of an unmodified CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 118 virtual machine need not be larger than the RAS of the physical host. This gives the administrator a choice: to either grow the PPAS to be very large, or simply to provide higher levels of consolidation by running more VMs on one host. However, the PPAS should at least be larger than (or equal to) the cache that CIVIC provides to each PPAS. It cannot be smaller, or that would preclude the need for CIVIC. Notice that Figure 6.2 also has two simultaneously running paravirtualized VMs. The current CIVIC implementation supports multiple PPAS strategies and does not require any VM to use CIVIC. You may choose to grow the PPAS of a virtual machine during boot time or choose to leave it unchanged in its default mode of operation.

Figure 6.3 demonstrates the operation of an example CIVIC cache underneath an

HVM. This HVM has 3 working sets (perhaps from three different processes or three dif- ferent data structures within one process). The figure represents the common case, where the cache is full populated with accessed memory. In this example, two of the working sets are in the cache, and a page fault to frame #6 occurs in the {4,5,6} set. Since the {8,

9} set is older, according to the FIFO, frame #9 is evicted to MemX. An old copy of page

#9 may or may not actually exist yet on MemX, but it will likely be there if the HVM has

been running for a long time. The next section will use the same HVM and describe the

hypervisor-level interactions between the cache and MemX.

6.2.4 Step 2: Paging Communication and The Assistant

Devices and drivers available in the modern virtualization stack today that are used to

service popular devices for Virtual Machines are typically bundled up into a VM that is

commonly called ”Domain 0” or ”Dom0” for short. From here on out we will not refer to this

VM anymore except to acknowledge its presence. During runtime, this VM always exists

and typically hosts various drivers and has direct access to those corresponding devices,

acting as a relay for co-located virtual machines. There is a movement to break away from

this unified, ”monolithic” design. CIVIC follows that philosophy [45]. Dom0 is not only a

single point of failure during the development process, but also performance bottleneck for

the hypervisor’s CPU scheduler due to the fact that all I/O must go through Dom0 while the CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 119

HVM: un−modified

Userland Virtual Address Space

Guest Tables Guest CR3

  2 4 5 6 8 9   HVM Working Sets: {2}, {4, 5, 6}, {8, 9}

Before Fault on 6:

  54 8 9 FIFO out   After Fault on 6:

  FIFO in 6 5 4 8  

MemX View: (on cluster)

  12 34 5 67 8 9 10  

Figure 6.3: Illustration of a full PPAS cache. All page accesses in the PPAS space must be brought into the cache before the HVM can use the page. If the cache is full, an old page is evicted from the FIFO maintained by the cache. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 120 dependent VMs block. Thus, CIVIC introduces a second VM that assists exclusively in the paging process and nothing else. We refer to this domain as the ”Assistant”.

Figure 6.4 illustrates the host-level internal design of CIVIC and the placement of the

Assistant within the Virtualization stack. Observe that Dom0 still exists, except that it is very thin. It still hosts device drivers for all the virtual machines running on the host, but the Assistant handles the most critically time-sensitive component of the modified Virtu- alization stack that is responsible for transferring pages in and out of the PPAS. Another motivation behind this design is that the machines we have in our cluster have two Network

Interfaces. So, we can dedicate one of the interfaces to the Assistant and one interface to

Dom0. Dom0 still handles regular network traffic out of individual virtual machines. Thus, the Assistant can be scheduled independently by the hypervisor and will context-switch into Dom0 much less.

The final piece in Figure 6.4 is the design of the page-delivery and page-fault com- munication paths. The sequence of steps taken by CIVIC at this level are as follows:

1. When the host first boots up, Dom0 will start up the Assistant.

2. If there are any non-CIVIC dependent VMs, they can be started simultaneously as

well.

3. Next, one or more HVMs are created. Almost immediately, they will begin filling up

their respective caches.

4. At some point a shadow-level page-fault exception occurs to a page that is not in the

cache during runtime. A page is allocated for the missing page in the HVM.

5. The CIVIC-enabled hypervisor induces the faulting HVM to block execution on the

Virtual CPU that caused the execution by de-scheduling that Virtual CPU.

6. The hypervisor puts an entry into a piece of memory that is shared with the Assistant

and delivers a Virtual Interrupt to the Assistant. If the cache for the faulting HVM is

full, then a victim is chosen from the cache and one or more additional entries are

put in to the shared memory. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 121

HVM: un−modified

Userland Virtual Address Space

Guest Tables Guest CR3

  Networked PPAS #2  

Assistant VM (paravirt)

MemX Client

Dom0 mmap() module memory−mapping

PPAS #0 PPAS #1 Cached PPAS #2

 Real Address Space (RAS)      Page−Fault on HVM    Eviction on HVM Shared Memory Notify

Shadow Tables Virtual CR3 Cached Page Frames

CIVIC−enabled Hypervisor To Network

Figure 6.4: Internal CIVIC architecture: An Assistant VM holds two kernel modules respon- sible for mapping and paging HVM memory. One module directly (on-demand) memory- maps portions of PPAS #2, whereas MemX does I/O. A modified, CIVIC-enabled hyper- visor intercepts page-faults to shadow page tables in the RAS and delivers them to the Assistant VM. If the HVM cache is full, the Assistant also receives victim pages. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 122

7. The mmap() module in the Assistant receives the interrupt through a kernel-level

interrupt handler and proceeds to memory map() the faulting pages and victim pages

on-demand.

8. The mmap() kernel module submits one or more I/O operations to the MemX client

kernel module which then use the RMAP protocol to read or write the corresponding

page frames to and from the network.

9. When the I/O is complete, the Assistant invokes a CIVIC-specific hypercall to notify

the hypervisor that the fault exception has been fixed up.

10. The hypervisor un-blocks the faulting virtual CPU and schedules it for execution and

the HVM continues until the cycle repeats itself.

All-in-all, there are four primary sources of latency in the path of an individual page frame: 1) the Virtual IRQ notification to the Assistant, 2) the time it takes for MemX to store/retrieve pages to and from the network, 3) the time it takes to fixup the exception and re-schedule the Virtual Machine after the Assistant notifies the hypervisor, and 4) the time it takes to evict pages out of the PPAS cache.

One additional thing to note regards the design of the mmap() module inside the Assis- tant: This is a kernel module and is responsible for directly mapping page frames located within the cache of a CIVIC-dependent HVM guest (in order to hand them over to MemX).

Recall that the page numbers located within the virtual machine’s PPAS are contiguous: all page frame numbers in the PPAS are sequentially chosen at startup and do not change

(unless ballooning is activated). At first glance, one might just choose to memory-map the entire PPAS of the virtual machine during startup and get rid of this module altogether. This is not possible because, despite the fact that the PPAS is contiguous, the frames backing the PPAS are not all resident - only a subset of the PPAS is actually in the cache. As a result, at any given time during the execution of an HVM, pages are evicted and populated in the HVM’s cache. When pages are re-populated, new RAS-level frame numbers (MFNs) are chosen. There is absolutely no guarantee that the MFN for the corresponding PFN (in the PPAS) is the same as it was before it was evicted in the past. In fact, we can almost CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 123

HVM 1

Assistant HVM 2 Dom0 MemX Client Dom0 MemX Server PPAS Caches Dom0 MemX Server

Hypervisor B Hypervisor C Hypervisor A

Page−Fault on HVM1 Eviction on HVM2

Figure 6.5: High-level CIVIC architecture: unmodified CIVIC-enabled HVM guests have both local reservations (caches) while small or large amounts of their reservations actually expand out to nearby hosts. guarantee that it will not be the same frame number. Thus, if we were to memory-map the entire PPAS in the beginning, the majority of those mappings would be invalidated in the near future as each page in the PPAS is victimized. So, the mmap() module in the As- sistant maps those pages in an on-demand fashion during fault time. We have optimized this module to batch as many pages as possible when this occurs (since these mappings require an additional hypercall to be completed). There are two opportunities for batching memory mappings: First, any N faults on N virtual CPUs (across all HVMs) can be batched simultaneously. Second, all pages that are pre-fetched into the cache (which we will dis- cuss later) can also be batched. This allows us to perform this mapping very quickly with little overhead during the paging process.

At the cluster level, we employ MemX, as described in Chapter 4. MemX is a kernel- to-kernel distributed memory system designed for low-latency memory access. The same source code for the MemX kernel module is loaded into the Assistant simultaneously and automatically detects available memory servers in the cluster.

Once a CIVIC host is up and running, the Administrator can choose any cluster design they like, such as the one illustrated in Figure 6.5. In this example, we have a cluster of virtual machines, where Hypervisors B and C host MemX Servers. MemX is flexible enough that it can be loaded anywhere. The servers need not be virtualized at all, but we illustrated it this way for completeness. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 124

6.2.5 Future Work: Page Migration, Sharing and Compression

This design of CIVIC has many, many areas of possible improvement. In the next chap- ter, we provide a discussion of several interesting new systems that can be built on top of

CIVIC. Here, we discuss some of those obvious improvements to the base system imple- mented in this dissertation:

At some future time, one or more VMs will run a large memory application and become active. At that point, the collective unused memory of various nodes will now become a partial global cache for the pages from the active node. (Caching nodes may as well be- come active themselves). This next stage for CIVIC might involve migrating globally cached pages out and over the network into global caches on other hosts - should the cache’s host node need that space for local cache space. There are a handful of algorithms to support this type of page migration, involving both greedy approaches [31] and approaches that use hints about access behavior [86]. These approaches are typically done in the con- text of a file-system, called “cooperative caching” and some of these “eviction” techniques could also be applied to virtual machine memory just as well. A survey of them can be found in [82]. This kind of caching involves allowing neighboring nodes to cache potentially stale page frames in local memory. The intelligence in such a system involves developing a coordinated algorithm that allows multiple nodes to decide which pages to keep in the global cache and which pages to evict. Example systems in the 1990s include ds-RAID-x

[55], TickerTAIP [25], xFS [16], and Petal [61].

Page-granular cache migration. Consider the case of a single oversubscribed HVM

Guest (A) as depicted in Figure 6.6 a portion of whose physical memory is backed up by one remote MemX Server on another host across the network. We propose the following multi-path of an individual page frame:

1. VM to local: A third party (such as the Assistant) decides to evict a used page out of

the VM’s cache into a local cache. Such a local cache does not exist in CIVIC right

now, but would be easily implemented by putting a central node-local cache within

the Assistant itself that held evicted cache pages from the PPAS caches of individual

HVMs. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 125

Guest A Assistant Dom0 Dom0 (Over−commited)

Large Memory App MemX Client MemX Server

Used Page EVICTION #1 EVICITION #2 New Page Local Cache Global cache

PPAS Cache EVICITION #3 FAULT Hypervisor A Hypervisor B

(Network)

Figure 6.6: Future CIVIC architecture: a large number of nodes would collectively provide global and local caches. The path of a page would potentially exhibit multiple evictions from Guest A to local to global. Furthermore, a global cache can be made to evict pages to other global caches.

2. Local to global: Potentially, the Assistant-cached page, based on some recency

information, can then be evicted into the global cache on another physical host simply

by pushing it out of the MemX client.

3. Global to global: Next, based on page-migration heuristics mentioned in related

work, the system may potentially move the page from one global cache to another

global cache, depending on how it is aged. This is already partially supported by

MemX based on existing support we wrote that allows MemX servers to shut them-

selves down for maintenance by re-distributing their memory to nearby servers.

4. Global to backing store: Should the sum total of all of the local and global caches

max out the available physical memory in the cluster, a third party disk (either cen-

tralized or distributed) should be available as a backing store.

5. Page Fault: Eventually at some time, a page fault will occur on a cached page,

at which point the page must be located and brought back in to the VM’s physical

address space. When this happens, the Assistant would be responsible for invoking

MemX to bring the page back in from one or more caches. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 126

In fact, a page-fault can happen on a page that has been evicted at nearly any level of this caching hierarchy. Recent work that relates to the functionality of the local cache component of CIVIC was published in [79], where they perform “hypervisor caching”. In this work, they evaluate the efficiency and performance of using the hypervisor to present a local pool of reserve pages to the VM. One could think of CIVIC as an extension of this work into the cluster. This completes the basic operation of CIVIC.

Page Sharing and Compression. The most ideal, cluster-level use of CIVIC would be to employ the use of two techniques for reducing data duplication throughout the clus- ter. The first technique is called content-based page sharing, initially proposed in the

Disco system [23] and used in the context of VMware’s ESX server [96] as well as the difference engine [33] and project Satori [43]. They proposed allowing for multiple virtual addresses (or virtual page frame numbers in the context of VMs) to refer to the same physical location in memory. Reports of reductions in memory utilization of up to 40% are reported. This is not surprising in the context of VMs because even an under-provisioned physical server with a handful of guest VMs will share many copies of binary executables, common libraries, and potentially similar parts of network filesystem data that gets cached on accesses. In the context of CIVIC, the opportunities for page sharing are increased even further due to our collective use of indirect caching across multiple hosts. Further- more, some recent work in 2005 [95] provides a compelling case for the advantages of compressing physical page frames in an operating system. We propose to apply these techniques at the VM level in combination with CIVIC. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 127

6.3 Implementation

Here we describe the low-level hurdles that were overcome to get CIVIC working and integrated with MemX. In the next Chapter, we provide a better outline of the missing features that could be expanded into more wide-spread systems. All in all, the CIVIC code base comprises about 5000 lines in the Xen hypervisor. This effectually doubles the entire

MemX code base. It does not require any patches whatsoever to the Dom0 kernel or the

HVM guest at all. There are about 400 lines of new code to support CIVIC in the Xen

Daemon and the rest of the code is entirely inside of the hypervisor.

We’ve implemented the system within Xen 3.3.0. The Assistant runs XenLinux (paravir- tualized) 2.6.18 as usual, but the HVM guests in the performance section are completely unmodified. They run out-of-the box Fedora Core 10 installations. We’ve also run HVM guests for OpenSolaris. Microsoft Windows will actually startup, but due to some shadow- paging related bugs in the Hypervisor, it stops at the login screen. All of the HVM guests in this chapter use a single virtual CPU. Due to time and manpower, an SMP-implementation of CIVIC is not yet complete. Currently, the system is fully implemented except for the more contemporary features described in the previous section’s Future Work.

6.3.1 Address Space Expansion, BIOS Tables

In order to transparently provide an ”oversubscribed” view of the PPAS to the Virtual Ma- chine when it first boots up, we must “lie” about the actual amount of DRAM that the HVM thinks is available by significantly expanding the size of the PPAS, while keeping the size of the RAS cache small. With modern hypervisor technology there are multiple ways to lie to virtual machines, actually, but none of them are completely transparent to HVM operation.

Furthermore, none of them allow the size of the RAS to be different from the size of the

PPAS.

One way to partially expand the PPAS is using ballooning, which we discussed previ- ously in Chapter 5. Ballooning allows you to increase and decrease the physical memory of an HVM, but this still requires that an equivalent amount of actual DRAM inside of the

RAS be statically mapped to that memory. Another way to expand the PPAS is to use CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 128 memory hot-plugging. Some operating systems are capable of receiving ACPI upcalls when new DRAM is available. They can then subsequently add the new DRAM to the kernel and make it available for allocation by processes. A summary of how this could be supported in Xen can be found in [88]. Similarly, memory must be removed in a physically contiguous manner within the PPAS. Both of these solutions require direct participation and modification of the virtual machine, however.

In order to avoid these difficulties, CIVIC instead oversubscribes the PPAS at boot time of the HVM. Normally, a physical machine determines the amount of available DRAM by reading a page of memory populated by the BIOS during boot up, called an ”e820” page.

This page contains a list of the various usable and reserved areas of physical memory that the Operating System must manage. After the BIOS populates this list, the OS reads the list and initializes all of its own data structures during start of day before other processes in the system begin to run.

For a virtual machine, there no longer is a physical BIOS - but a virtual one. Thus, the e820 page containing the list of usable memory ranges is virtualized. To do this, CIVIC constructs the e820 page with an artificial list of memory ranges that are available at the time the HVM is started while taking into consideration how much memory is actually avail- able on the host. Normally, the amount of memory listed in the e820 page is equal to the cache size that Xen allocates for the HVM (meaning that what would have been the

RAS cache in our system is normally just a flat RAS that is equal in size to the PPAS). To oversubscribe the HVM, we patch the Xen Daemon to increase the size of usable mem- ory ranges in the e820 page based on an additional configuration parameter in the HVM’s guest configuration file. This modification immediately takes effect for any kind of operating system since this is a standard requirement for an OS to boot up. This is how CIVIC boot- straps the oversubscription process and it works quite well. Furthermore, the semantics of memory seen by the hypervisor are preserved, including the initial pre-allocation of mem- ory. The hypervisor is instructed by the Xen Daemon only to allocate as much memory as has been specified by the configuration file for how big the cache should be. The HVM then proceeds to bootup and begin filling it’s cache as it faults on pages that do not exist in the cache. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 129

6.3.2 Communication Paths

The hypervisor’s primary interactions during the slow-path for page-fault handling and cache evictions are with the Assistant VM. In general, there are two ways to communi- cate between the hypervisor and a para-virtualized VM such as the Assistant or Dom0:

• VM-to-hypervisor, Hypercalls: The hypercall API available to a VM is vast. Hyper-

calls can be invoked by any code running with kernel-level privileges in a guest virtual

machine.

• hypervisor-to-VM, Virtual IRQs: These kinds of IRQs are the virtualized equiv-

alent of real ones, with the exception that a few ”new” IRQs are imposed by the

hypervisor for other reasons, such as alternate consoles, VM-to-VM communication,

inter-processor messages, and more.

CIVIC uses a combination of both when talking to the Assistant. During the start-of-day, the Assistant is asked to setup a common piece of shared memory for each HVM guest.

This memory is shared only between the Assistant and the hypervisor and stores a fixed number of descriptors that represent which pages are to be evicted from or faulted into the cache. At the moment, this memory holds around 2048 descriptors, using 32 Kilobytes of memory. Through empirical experimentation, this was a sufficient number of descriptors required to maintain maximum throughput on a Gigabit Ethernet switched network.

CIVIC maintains this memory using a one-way, half-duplex producer consumer rela- tionship. The hypervisor is the sole producer of descriptors and the Assistant is the only consumer - it does not pro-actively bring in pages in our out of the HVM’s cache unless it is instructed to do so by reading a descriptor from shared memory. We later experimented with a circular ring model with the assumption that we would get added concurrency, by allowing the Assistant to asynchronously remove descriptors from shared memory while adding completed descriptors back onto the ring after MemX had completed the I/O. Due to the fact that paging is only half-duplex, this proved to be equivalent to the simpler im- plementation we used initially because of our use of prefetching, described in the next section. Although a complete proof would be required, we empirically observed no differ- CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 130 ence between the use of an asynchronous, circular request/response notification versus a synchronous one, so we stuck with the synchronous model. If CIVIC were to allow new pages into the HVM cache that it did not initiate (say, due to our previous future work design where other hosts could initiate global-to-global page cache transfers inde- pendently), then an asynchronous model would be mandatory. This is similar to the way the netfront/netback asynchronous rings work in Xen right now discussed in the last chap- ter, since the network has a natural full-duplex relationship where a receiver on the other end of a socket can independently send data at the same time the sender does. Thus, for

CIVIC, once the hypervisor has added descriptors to the ring, a Virtual IRQ is delivered to the Assistant. Once the Assistant is done, a hypercall is made to the hypervisor to signal completion of those descriptors. With more man-power, a fully asynchronous design can be implemented in the future, for example, if we were to go to a future 10-Gigabit ethernet network.

6.3.3 Cache Eviction and Prefetching

Central to maintain HVM performance is CIVIC’s ability to manage the PPAS caches in order to be scalable. Some papers mentioned earlier [79, 44] present some approaches

(although not very obvious) to doing recency detection: they use a para-virtual approach.

The basic idea is to modify the kernel in a minimal fashion such that a third party is notified when the page’s allocation status changes (from used to free or back again). This has a possibility of being useful for us in that once a third party (say, the Assistant) chooses to evict a page from the VM, it will use the modified kernel interface to receive notifications of future deallocations of that page. This allows the cache to know when it’s safe to clear that page when it is no longer used. Finally, there is only a single problem with the approach: pinning evicted pages for future page faults (or cache hits). Note: an evicted page is a used page. An evicted page must be pinned in such a manner that the cache is notified when the kernel needs the page back. [79] proposes doing this by doing the following: during eviction from the VM address space, the page is Locked for I/O. This means that from the kernel’s perspective, someone else is using the page. After eviction from the CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 131

PPAS cache, the kernel will need the page back in the future and probe the lock (which is generally a semaphore). Attempts to access that lock are delivered to the hypervisor and the page is released from I/O status. For CIVIC, however, our goal is to maintain maximum transparency, and since we now have a page-fault interception system in place, we no longer need para-virtual support from the operating system.

Recency Detection. When it is time to choose victim pages however, it is no longer simply a manner of maintaining one or more Least-Recently-Used lists within the hypervi- sor and keeping them updated. In this dissertation we do not explore complex page-frame reclamation policies. There is a large body of work in literature that does this already. The

LRU used for CIVIC at the moment is a simple first-in first-out FIFO. When the cache is full, pages are evicted from the front of the FIFO. When page-faults occur, pages are allocated onto the end of the FIFO. A proper characterization of the type of eviction scheme required to improve upon a FIFO in the context of multiple, concurrently-running virtual machines will be mandatory in a future incarnation of CIVIC.

Prefetching. During implementation, one of the bigger obstacles to a stable imple- mentation that we observed during operation was ”capacity” cache misses. These types of cache misses occur when the HVM is just booting up for the first time or when the HVM has just forked off a large application that is (mostly) sequentially touching large amounts of memory at once. These two common usage scenarios required a basic prefetching implementation to be placed into the implementation of CIVIC. Pre-fetching (a survey of which can be found here [77] (and pre-paging from Chapter 5 on VM migration) are well explored concepts in computer science. Our pre-fetching implementation is a simple stride- prefetching algorithm. We will not cover all of the 4 states of stride prefetching, but describe the pseudo-code of our algorithm for detecting page-fault behavior.

Pseudo-code for our CIVIC-level prefetching algorithm is in Figure 6.7. CIVIC’s prefetch- ing first maintains a ”window size” of allowed page faults in powers of two, starting with an initial value of one. Each time the prefetching algorithm is invoked we update the window based on the location of the last page-fault to adapt to how large or small the current stride of pages actually is. During any given page-fault, we ask the following question: CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 132

1. let miss_count := 0 2. let window := 1 3. 4. AdjustWindow (PFN) // PFN is in the PPAS 5. let last_fault_pfn := current_fault_pfn 6. let current_fault_pfn := PFN 7. let last_miss_count := miss_count 8. 9. if PFN <= last_fault_pfn // precedes the window 10. miss_count++ 11. else if PFN > (last_fault_pfn + window * 2) // past a future window 12. if PFN <= (last_fault_pfn + window * 4) 13. miss_count = 0 14. else 15. miss_count++ 16. else // double the window 17. window = min(window * 2, 18. (max_shared_memory / 2)) 19. miss_count = 0 20. 21. if miss_count && last_miss_count != miss_count 22. if miss_count > 1 // half the window 23. window = max(window / 2, 1) 24. return 1 // Do not prefetch 25. 26. return 0 // go ahead and prefetch 27. 28. PrefetchFault (PFN) 29. let next_pfn := PFN + 1 30. let space := max_shared_memory 31. - available_shared_memory 32. 33. if AdjustWindow(PFN) == 1 // adjust the window 34. return 35. 36. while (space >= 2) && ((next_pfn - PFN) <= window) 37. if next_pfn is resident 38. continue 39. if cache is full // need space to prefetch 40. kickout a vitcim page 41. space-- 42. 43. fetch ( next_pfn ) // bring page into cache 44. space-- 45. next_pfn++

Figure 6.7: Pseudo-code for the prefetching algorithm employed by CIVIC. On every page- fault, this routine is called to adjust the window based on the spatial location of the current PFN address in the PPAS. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 133

If we were to double the size of the prefetching window, would the current fault location be spatially close to the last fault?

This question has several different cases:

1. If the current fault is spatially forward (determined by PPAS frame number) of the last

fault but still outside of the current PPAS window, BUT spatially within two future

window increases (i.e. window size times four), then we will accept the fault for now

but signal a cache miss. The window will stay the same in this case. We call this is

called a ”tolerated” hit.

2. If we get a hit or a tolerated hit, then we reset the number of sequential misses.

The number of sequential misses is simply a counter of how many page-faults have

landed outside of the PPAS window consecutively.

3. If the fault is outside of two proposed window increases, then we consider that a miss

and we cut the window in half. This contributes to a possible future sequential miss.

4. If the fault is spatially behind the last fault, then we consider that a non-sequential

miss, but we do not cut the window in half (yet).

5. If we miss once (non-sequentially), we still prefetch within the current window.

6. If we miss twice or more in a row (in both the forward direction and backwards di-

rection) then we consider that a sequential miss and we half the window and avoid

servicing the fault at all.

7. Never (at any time) do we reset the window completely. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 134

The main purpose of this algorithm is to be able to capture very highly sequential page-faults or very highly random page faults and adapt accordingly:

• If the next fault is purely sequential, then it will appear to be in the proposed window

because the previous fault did not actually fill the window with pre-fetches since the

last fault itself was part of the window - which is a nice side-effect of the algorithm.

• If the next fault is purely random, it will also get caught because the miss count

will continue to increase and will only stop increasing when there is a hit within a

proposed window increase (times two) as if it were a ”tolerated” prefetch.

All in all, this algorithm ”slides” the window up and down but never resets it altogether, effectually implementing a sort of rough hysteresis. Empirically, the observed prefetching window tends to hover around 256 pages. The maximum value of the prefetching window is capped to the size of the I/O communication memory that is shared with the Assistant

VM described in the previous section.

6.3.4 Page-Fault Interception, Shadows, Reverse Mapping

There are two primary data structures used by CIVIC for page-fault interception:

(1) The PPAS table: (or “p2m” map in Xen terminology). This table is also a page-table- like structure which provides a direct translation between the PPAS and the RAS, whereas the shadow tables provide translations between process virtual address spaces and the

RAS.

(2) The PPAS bitmap: This is a CIVIC-owned structure for determining if pages are resi-

dent in RAM or not. CIVIC maintains this as a contiguously allocated bitmap, with one bit

for every 4-kilobyte page in the HVM’s entire PPAS. The Assistant does not see this struc-

ture. At the moment, the PPAS table is being more heavily used for other purposes in the

Xen hypervisor and cannot yet be overloaded to hold the PPAS bitmap. But we anticipate

using the PPAS table exclusively in the future to store page residency.

The tricky part in determining if pages are resident in RAM or not is that there are a number of ”special” memory ranges that Xen puts into the guest’s memory space that CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 135 are reserved and only used by Xen, including things like IOAPIC pages, the MMIO space used for device emulation through the QEMU emulation engine (whose location changes whether or not the HVM has more than 4GB of memory or not), pages for VGA support, and a handful of pages used for Xen-specific purposes. These “special” pages represent memory that does not actually exist in the RAS of the HVM itself, but are still reserved in the PPAS of the HVM. Even though these pages do not belong to the HVM, they are still accessed by the HVM. Thus, CIVIC must be very careful not to evict these pages or allow page-faults to occur on these pages. In order to satisfy these problems, all of the special pages are set to ”present” or resident in the PPAS bitmap during the HVM at start-of-day.

All other pages are assumed to be not present in this bitmap and will immediately result in page-faults at least once. When victim pages are chosen, CIVIC takes care not to evict these special pages that are not originally owned by the HVM.

The lower-level sequence of events during a page fault for using these three data

structures (the Shadow tables, PPAS table, and PPAS bitmap) are as follows:

1. The CPU generates a page-fault exception.

2. The hypervisor consults the shadow-paging API

3. First, CIVIC analyzes the faulting PFN and checks the PPAS bitmap to see if we were

responsible for the exception due to having swapped out that page.

4. If the page is resident in the PPAS bitmap, then we’re done. Otherwise, we allocate

a free page and contact the Assistant for retrieval of the page.

5. When the Assistant is done, we clear the bit in the PPAS bitmap and insert the

missing page into the PPAS table used by Xen. This has the effect of increasing the

cache size as seen by Xen and as seen by dom0 for the purposes of determining

how much memory is available on the host when creating new Virtual Machines.

6. Afterwards, the regular Shadow-Paging process kicks in and Xen retrieves the ap-

propriate shadow-copy of the faulting page table entry owned by the HVM.

7. Execution continues. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 136

The lower-level sequence of events during a page eviction are as follows:

1. When CIVIC determines that the cache is full (say, during an eviction or during

bootup), a reference to a victim page is removed from the PPAS table. This has the

effect of decreasing the cache size and freeing the actual page in the RAS pointed to

in the PPAS cache. Dom0 also will see at this point that the amount of free memory

has decreased.

2. The bit for the PFN of the victim page is set in the PPAS bitmap. This will signify to

CIVIC in the future that the page is no longer resident.

3. The Assistant is contacted and instructed to write the page to the network. Once the

Assistant is complete, the hypervisor reschedules the HVM for execution.

4. Execution continues.

Gotchas: Note that when CIVIC evicts a page, the HVM may *free* that page in its own PPAS but not necessarily write to it (which would normally cause a page-fault). Sub- sequently what tends to happen is that the HVM kernel reallocates that page in the PPAS for some new process or kernel thread, which causes a page-fault in the near future at a completely different RAS page location. This happens for both data pages and page- table pages. Not only does it happen often, but for page-table pages it can happen at any level of the 4-level page table hierarchy. Thus, we have to make sure to detect these sort of ”recursive” page faults and re-populate the PPAS table appropriately. Furthermore, there are also times when Dom0 accesses page frames of the HVM guest asynchronously

(when setting up DMA accesses). These kinds of accesses do not cause shadow-paging faults on the CPU that would be normally recognized by CIVIC. Instead, they are seen as para-virtualized accesses to non-resident CIVIC pages. When we see these kinds of accesses in the hypervisor, we simply allocate a free page for the missing page and return control. Our observation is that 100% of these accesses are fresh writes for I/O and that any old data that might have been paged out is actually old DMA memory that is being reused. A proper solution to this would be to further paravirtualize Dom0 to instruct CIVIC what these pages are so that they do not get paged out in the first place, but this solution CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 137 maintains transparency from even Dom0 itself since it does not require any further kernel modifications.

Reverse-Shadow Page Table Mappings. A surprisingly sneaky problem we observed during the implementation of CIVIC was one experienced by nearly all paging implementa- tions in modern operating systems: the reverse mapping problem. In an operating system, all processes get their own set of page-tables, but are allowed to memory-map shared pieces of memory and shared files for a multitude of reasons during regular operation.

When a page is being evicted, the OS must make sure that multiple references to the same page across all page-table entries in the OS are removed before the page can be safely freed. CIVIC has exactly the same problem, but rather at the shadow paging level.

Normally, a solution to this would require searching every page table entry in every shadow of the HVM. Linux solves this by maintaining reverse pointers to those page table entries on an object-level basis (to reduce memory usage of the reverse pointers).

Similarly, we implemented a (maximum) 2-PTE reverse mapping solution within the hy- pervisor: When a new shadow page table is created, the page structure for the referenced frame (maintained in a global list) is populated with the address of the parent shadow page-table entry. Up to two parent shadow page table reverse pointers are maintained in the global page structures own by Xen. (Any more than that are ignored). During page eviction, if the reference count of the evicted page is less than or equal to 2, then CIVIC uses the reverse pointers to clear out those PTEs pointing to the victim page. If the ref- erence count is greater than 2, then Xen goes forward with a brute-force search of all the remaining shadows not covered by the reverse pointers. Empirically we observed a signif- icant improvement to fast page eviction for highly-sequential workloads where new pages are being written frequently. Without this optimization, CIVIC eviction times were reduced almost to a crawl over the base case where the HVM did not use CIVIC at all. Note that this optimization does not improve upon regular HVM performance but only normalizes

CIVIC performance to be equal to the base case where all of the HVM’s pages are always

in RAM. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 138

6.4 Evaluation

An evaluation of CIVIC was done on our cluster of about 12 machines each containing two AMD Quad-Core sockets with each core running at 1.7 Ghz. For the curious, we have actually provided screenshots of a live CIVIC-enabled machine in Appendix A. Each of the machines in our cluster that are used to support CIVIC have between 12 and 16 GB of

DRAM on them. The largest unmodified HVM Guest we have run on a single host is up- wards of 64 Gigabytes, spanning the entire 12-node cluster. This evaluation first provides a micro-benchmark breakdown of the communication paths in CIVIC followed by a bench- mark of 3 applications: The RUBiS Auction Benchmark [11], Matrix Multiplication (JASPA)

[54], and Quicksort. RUBiS is a very popular Apache webserver benchmark to generate auction-like transactions in an eBay-like system. This is similar to the heavy benchmark,

SpecWeb, used in Chapter 5. Quicksort and Matrix Multiplication are implemented as in- dividual C programs. As in the previous chapter, one sorts a large array and the other multiplies a large, spare matrix by itself. With these 3 applications we want to answer the following questions:

1. What is the latency of a page-fault in CIVIC?

2. During peak load, what kind of paging throughput can CIVIC deliver?

3. How much does an application slow down when using CIVIC?

4. What acceptable fraction of the PPAS can be cached without significant slowdown?

5. How well does prefetching mitigate the effects of caching?

6.4.1 Micro-Benchmarks

Page-Fault Latency. First, we begin with a few microbenchmarks while CIVIC is in opera- tion for a Virtual Machine. Table 6.1 shows a breakdown of the latency of a page-fault in a

CIVIC-enabled Hypervisor (particularly as illustrated in Figure 6.4). All of the latency mea- surements in the figure were done with the on-chip TSC register to provide microsecond- granular values. As is expected, the largest component of latency of the page-fault is in the delivery to MemX within the Assistant VM. This latency is about 170 microseconds. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 139

Communication Path Latency 1 Page is evicted on CPU fault 16 µs 2 Virtual Interrupt to Assistant 10 µs 3 Round-trip I/O to MemX 170 µs 4 Hypercall completion to Xen 30 µs 5 Rescheduling of the HVM 5 µs Total 231 µs

Table 6.1: Latency of a page-fault through a CIVIC-enabled hypervisor to and from network mem- ory at different stages.

In a non-virtualized Linux environment on our hardware, this typically goes to around 130 microseconds. This is 30 microseconds faster than we achieved in Chapter 4 due to the use of new hardware. On top of the base 130 microsecond non-virtualized RTT incurred by MemX, the para-virtualized Assistant VM adds an additional 40 microseconds due to the need to go through Domain0. There is a clear motivation here to setup the use of pre-existing dual-NIC hardware available in our cluster to get rid of this additional latency.

However, there is also a noticeably large amount of latency in three other places: (1) in the Hypercall (downcall) used to complete the notification that one or more completed page-fault acknowledgments have been placed into the shared memory area. Just the time to invoke the hypercall for completion is around 30 microseconds. (2) The time it takes for the Assistant to be asynchronously scheduled and pickup the new evicted page

(or to signal a page-fault) from shared memory is large. (3) The first time component - eviction time also takes a surprisingly long time (the first row). However, this is a significant improvement on what it used to take (around 1 millisecond) when the shadow paging code had to search through all the shadow tables before evicting a page out of the PPAS cache.

Our reverse mapping implementation brought this down significantly. There is clearly some room for improvement here. We leave it to future work to bring these numbers down.

Page-Dirtying Throughput. Next, it is important to ensure that if an HVM’s memory access behavior is heavy enough (be it sequentially, randomly, or otherwise), then CIVIC should at least be able to maintain maximum throughput on a gigabit ethernet NIC. To do this, we must establish the baseline VM memory access performance we expect in a non-CIVIC enabled Hypervisor. We establish this baseline by writing a C program that CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 140

Page-Table Dirtying Rate

14000 Non-virtualized HVM, SH3, with HAP 12000 Para-virtual Guest HVM, SH1, w/o HAP HVM, SH1, w/o HAP, CIVIC 10000

8000

6000

4000 Dirtying Rate (Mbps)

2000

0 8 16 32 64 128 256 512 1K 2K 4K Bytes Per Page

Figure 6.8: Page Dirtying Rate for different types of Virtual Machines, including HVM Guests, Para-virtual Guests, and with different types of shadow paging. This includes the overhead of creating new page tables from scratch. allocates a large amount of memory and proceeds to pass over that memory in page- size increments. At each page, the program dirties one or more bytes of that page. We parametrize the program to dirty increasingly larger amounts of memory and graph the amount of front-side-bus throughput that the HVM is able to sustain while dirtying memory.

Figure 6.8 presents four different case scenarios that we are interested in for this pro- gram. The first plot is the fastest dirtying rate we can expect from an HVM guest without the use of CIVIC. (The only CIVIC plot is at the bottom.) The first plot employs both the shadow optimizations (level 3) that we talked about in the beginning of the Chapter as well as processor support for nested page tables. Surprisingly, this plot actually does better than a para-virtualized operating system, as shown in the second plot. The baseline that we will compare against in the next section is shown in the 3rd plot at shadow level 1 (SH1).

This was the first shadow paging implementation released by the Xen community. We re- moved these optimizations to show how CIVIC performs against just the basic shadow paging code with no efficiencies and also without support from the processor to walk page CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 141

Bus-Speed Dirtying Rate

250 Non-virtualized 225 Para-virtual Guest HVM, SH3, with HAP 200 HVM, SH1, w/o HAP HVM, SH1, w/o HAP, CIVIC 175

150

125

100

75 Dirtying Rate (Gbps) 50

25

0 8 16 32 64 128 256 512 1K 2K 4K Bytes Per Page

Figure 6.9: Bus-speed Page Dirtying Rate in gigabits-per-second. This is line-speed hard- ware memory speed once page-tables have already been created and shows throughput at an order of magnitude higher than the previous graph. tables without trapping into the hypervisor. This plot’s paging rate is just slightly above that provided by gigabit ethernet. And finally, the fourth plot is the same as the 3rd plot except that we are now using a CIVIC-enabled hypervisor. Thus, the paging rate is limited to that of the network card. An interesting note across all plots is that they become normalized at the same throughput when all of the bytes of each page of the program are touched. So, if the HVM guest is heavily memory intensive, it does not matter which plot you choose - in which case, CIVIC will also provide comparable performance.

Figure 6.9 presents the same graph as the previous one, except the different is that the page tables backing the memory that is being dirtied have already been created. In this case, the para-virtualized guest wins over the SH3/HAP virtual machine, and also the SH1 and SH3 virtual machines are relatively the same in dirtying performance. Nevertheless,

CIVIC performance does not change in throughput since we’re using a 1-gigabit NIC.

There is, however, the case here for investing in 10-gigabit ethernet hardware - perhaps in a production environment. If the Assistant could transmit at 10GigE, then we would CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 142

(theoretically) see CIVIC performance go up accordingly to match the other three plots, no matter how memory intensive the application within the HVM actually became. Given that one of the primary goals of this system is higher consolidation, a single 10-GigE NIC would not only be available for the Assistant, but would simultaneously be available for the potentially higher numbers of virtual machines sharing the same physical host.

6.4.2 Applications

Next we will show performance results for each of the aforementioned applications. The

HVM that runs these applications is configured to be relatively small in comparison to the hardware we have available. To give some perspective, the HVM that we use is similar in configuration to the one labeled “HVM #3” depicted in Figure 6.2. The HVM in our experiments is a stock 64-bit Fedora 10 Linux distribution with no changes running on top of a CIVIC-enabled Xen hypervisor. It is configured with 4 GB of DRAM on a 12GB host - but the cache size used underneath the HVM varies by experiment, ranging from 128 MB to 512 MB.

Evaluating how big this cache size should be is much different than the ad-hoc rules used when choosing the size of, say, your swap partition in a vanilla Linux distribution. Our

HVM has internal swap disabled and does not do any internal paging. This means that all of the HVM’s page frames are either in the PPAS cache or out on the network, but not both (since we do not have a second-level cache yet). We would expect that choosing the cache size would ideally be an automated decision: for example, on our 12 GB host, if we had twelve 1GB caches (one HVM per cache) and we wanted to create a 13th HVM, then we would modify the Assistant to instruct the hypervisor to evict enough memory out of the other 12 caches so that a 13th cache could be created for the new virtual machine. With that said, we’ll begin our first application:

Quicksort: Fixed 512MB PPAS cache, Variable Working Set. Since this is a C pro- gram, it’s easily parametrized to sort larger and larger amounts of data. So, we choose a fixed 512MB cache size and then instruct the sort to work on increasingly larger ran- domly populated arrays. Figure 6.10 shows that quicksort slowdown remains relatively CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 143

Quicksort Completion Time Civic Local Cache: 512 MB, Pre-populated 1000

900 HVM, SH1, w/o HAP, with Civic HVM, SH1, w/o HAP 800 Paravirt HVM, SH3, with HAP 700

600

500

400 TIme (secs) 300

200

100

0 16M 32M 64M 128M 256M 512M 1GB 2GB 4GB Array Size

Figure 6.10: Completion times for quicksort on a CIVIC-enabled virtual machine and a regular virtual machine. unchanged. At each sort size, we take 5 runs and compute the average completion time of the sort. As is expected, the baseline CIVIC performance matches that of the case without the use of CIVIC up to 512MB sort sizes. After that only slight application slowdown is observed - a few seconds worth.

Matrix Multiply: Variable PPAS cache, 512 MB Working Set. We downloaded the

JASPA sparse matrix multiplication program [54] to compare application slowdown on a

fixed working set. The matrix we use for JASPA has a density of 20%. While running

this multiplication multiple times, we vary the cache size underneath the virtual machine

to range between 128 MB and 512MB. (The virtual machine in all of our experiments has

an existing buffer cache of about 100 MB when the machine boots up in addition to the

application’s own memory footprint). This graph also has two plots: one plot shows the

performance of CIVIC with prefetching turned on and one with it turned off.

Figure 6.11 shows the results of the time required to complete this matrix multiplication.

The right-most bar in the figure shows the baseline case on a non-CIVIC enabled hypervi- CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 144

Spare Matrix Multiplication 50 w/o prefetch 45 w/ prefetch 40 35 30 25 20 15 10 5

Computation TimeComputation (secs) 0 128 MB 256 MB 512 MB Baseline PPAS Cache Size (MB)

Figure 6.11: Completion times for Sparse Matrix Multiplication with a resident memory footprint of 512 MB while varying the cache sizes. CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 145

Rubis Auction Web Server 18 16 w/ prefetch w/o prefetch 14 12 10 8 6 4 2 Requests PerRequests Second 0 128 MB 256 MB 512 MB Baseline PPAS Cache Size (MB)

Figure 6.12: Requests Per Second for the RUBiS Auction Benchmark with a resident memory footprint of 490 MB while varying the cache sizes. sor - and completes in 37 seconds. This is the bar we would like to match. Out-of the box, a 512 MB cache underneath the VM slows down the application to 39 seconds and also to 41 seconds when prefetching is disabled. This is a fairly acceptable slowdown (for now) of about 5%. When the cache size is reduced to as low as a quarter of the application’s memory footprint (to 128 MB), that slowdown goes higher 43 seconds - or about a 10% slowdown. If prefetching is disabled, this slowdown becomes 47 seconds (18%).

RUBiS. Figure 6.12 for the RUBiS Auction Webserver benchmark is structured in a similar manner to Figure 6.11. The configuration of RUBiS for CIVIC uses the PHP version of RUBiS. We configure the client to use 8 client simulators (on one of our machines that has 8 cores). Each of the 8 clients simulates 100 connections per core, all aimed at the same test HVM virtual machine as before running an Apache Webserver. During peak load, this configuration of RUBiS imposes a memory footprint on the Virtual Machine of about 490 MB. For this application, webserver slowdown is represented by requests per second in Figure 6.12. Slowdown for this application is more significant - although the CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 146

Baseline w/ Prefetch w/o Prefetch Application Demand Demand Net Demand Net Quicksort 1643 455766 8289 5008 648227 Matrix 107280 214159 3399 3729 220858 RUBiS 123217 196613 15041 106301 45736

Table 6.2: Number of Shadow-Pagefaults to and from network memory with CIVIC prefetching disabled and enabled. Each application has a memory footprint of 512 MB and a PPAS cache of 256 MB. effects of prefetching when compared to the previous application are similar. The baseline

Requests Per Second (RPS) with CIVIC disabled is 16. Simply enabling a cache size equal to the memory footprint of the webserver brings this down to 12, which is a 25% reduction in performance. When the cache size is half of the memory footprint, this slowdown becomes almost 40%. There is clearly some room for optimization here in the proof-of-concept version of CIVIC.

Prefetching Behavior. Finally, one of the more interesting evaluations of CIVIC per- formance comes within the hypervisor when determining the effects of prefetching. It is important to understand what effect on Virtual Machine performance our basic prefetching implementation is having. Table 6.2 illustrates how we calculate this effect for all 3 of the applications we have talked about. This table provides two types of information: the in- crease in shadow-page faults incurred due to the use of CIVIC as well as the decrease in network-bound major page-faults that was saved due to our prefetching implementation.

The second column, called “baseline” refers to the number of shadow-page fault ex- ceptions on the processor when CIVIC is not used on the HVM guest for each application.

These are typically “demand” faults. What this means is that when the HVM guest is fresh

- or was just booted up in our case - the Operating System has created relatively few page tables. When the application starts allocating and using memory, the hypervisor will only create shadow page-tables for HVM page-tables in an on-demand fashion: i.e. only when the OS incurs a copy-on-write fault for that new, large amount of memory does the hypervisor shadow the corresponding page tables for that new memory.

When CIVIC is used, the number of demand page faults goes up significantly because the hypervisor is frequently tearing down and reinstalling new shadow pages when inter- CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 147 acting with the PPAS cache. For each of the three applications, one can clearly see that the effects of prefetching are significant. There is a clear correlation between the order- of-magnitude increases in network-bound major faults when prefetching is disabled and the increase in local, minor shadow-page faults when prefetching is used.

It remains to be seen whether or not some of the prepaging strategies we used in

Chapter 5 to perform faster VM migration can be used in this context as well to further reduce the number of major network bound page faults incurred by the virtual machine to help reduce the slowdown experienced in the previous figures. We suspect this will work quite well.

6.5 Summary

We have presented the design, implementation, and evaluation of a complete system for the transparent oversubscription of unmodified Virtual Machines and their applications. We have also applied previously learned distributed memory techniques to aid in promoting the ubiquity of access to VM memory as a virtualized resource. Although the manipulation of

Virtual Machine memory can be complex, we have shown that the advantages of doing so outweigh those complexities. CIVIC provides an expandable framework for a number of other ways to improve upon Virtual Machine technology and ways to provide more con- solidated use of machines in a clustered environment. Future directions for this system include both VM Migration improvement as well as performance improvements. Further- more, CIVIC can benefit even more by a continued effort to maintain the performance guarantees of the dependent MemX system - as that is critical to minimizing application slowdown. Chapter 7

Improvements and Closing Arguments

In this Chapter, we provide a discussion of a number of project ideas that this dissertation can be expanded into as well as some discussion about system design and implementation changes to the solutions we have presented.

7.1 MemX Improvements

In its current incarnation, MemX is well tested in its ability to provide low-latency access to

DRAM on large numbers of computers, but it has deficiencies that limit that kind of ubiquity we would ultimately like to see in a distributed memory system that is made to support virtual machines.

7.1.1 Non-Volatile MemX Memory Descriptors

One drawback to MemX is that it is volatile. This means that whenever you shut down a MemX client, it’s memory is lost. In networked systems, there is a common push and pull between the two concepts of statelessness and statefulness. The client side of a network protocol either has two high-level choices: the more state you put into the client, the faster it can access its own data structures (for files, memory, or any other resource).

However, the more stateless you design your clients, the more flexibility you gain. In the case of MemX, our clients are as full of state as they can possibly be. The protocol is

148 CHAPTER 7. IMPROVEMENTS AND CLOSING ARGUMENTS 149 entirely state-driven. This poses problems for Virtual Machine transparency, particularly when VMs are being migrated, since that state must either reside within the VM itself or that state must be transmitted out-of-band using a separate framework.

We are currently designing a new major revision of MemX in such a way that the proto- col is still client-driven, but that both clients and servers can become more stateless. The high-level design we have is that clients probe memory servers for the descriptors of mem- ory pages that they already have through lookups - allowing clients to shut themselves down or detach from the memory pool and restart at other locations. This has advantages for VMs for two reasons: VMs that use MemX as a filesystem can now be assured that their files do not disappear. VMs that use MemX for migration can avoid loading MemX into their own address spaces and leave the client module untouched within Dom0 of in- dividual host systems. VMs that use MemX for CIVIC memory overcommitment can now make the Assistant VM stateless. One can snapshot a CIVIC virtual machine and shut it down. Days later one can restart that VM without asking the Assistant to know anything about HVM guest except that its pages are out on the network somewhere.

7.1.2 MemX Internal Caching

The next drawback to MemX is that there is no page caching system available or prioritizing of servers. When MemX is used as a swap device, caching is irrelevant because of the cache already implemented by the operating system. The same argument applies for

filesystem use. But when MemX is used by CIVIC or used by a process for mmap() or any other future virtualizable purposes, the lack of an internal-MemX cache provides a serious performance bottleneck. This is particularly true if MemX gains non-volatile support and if

CIVIC is used on top of MemX. A transparent deployment of MemX cannot always require the Administrator to depend on Operating Support for caching of swapped pages. This means that we are shifting the caching burden to MemX. Caches also serve the purpose of ”write buffering” of small, bursty patterns that need not be sent over the network. For all users of MemX that do not have their own caches, they currently lose the advantages of write buffering since MemX does not have its own internal data cache. CHAPTER 7. IMPROVEMENTS AND CLOSING ARGUMENTS 150

7.1.3 Server-to-Server Proactive Page Migration

One of the more interesting features that MemX does provide is server-to-server granular page migration. In our stable version, servers will automatically shut themselves down upon unforeseen events such as a network restart script or a poweroff command by an ad- ministrator. MemX knows how to block these events until the server module has completely transferred its memory to nearby servers.

Right now this feature is only being used for management. But an entire project could be developed out of this in the same way it was done for cooperative caching filesystems by having a third party daemon move data around the cluster when the load on individual machines goes up or down. Obviously, the killer application of this feature would be page sharing and compression, but also would greatly add flexibility for client statelessness as well as the migration of CIVIC virtual machines, as we’ll discuss later.

7.1.4 Increased MemX bandwidth w/ Multiple NICs

A more subtle drawback to CIVIC right now is network throughput. Disks have the advan- tage of very high-throughput PCI buses connecting them together. For the same reason that SANs (storage area networks) exist over fiber-channel, MemX is limited by the max- imum speed of our ethernet. All of the machines in our virtualization cluster have dual network interfaces on them. A nice side-expansion of MemX would be to extend the proto- col to send traffic over any available network interface while still maintaining transparency to the application. MemX has an advantage to doing this because MemX does not depend on IP and does not require routing except at the switch-level. So extending the protocol over the same client session identifier would be a relatively straightforward incremental improvement for quite a large gain in throughput. CHAPTER 7. IMPROVEMENTS AND CLOSING ARGUMENTS 151

7.2 Migration Flexibility

7.2.1 Hybrid Migration

Currently in the works is an improvement to Post-Copy VM migration by merging Post-Copy and Pre-Copy together. The idea is that clean data pages that do not get changed during the post-copy processes should be sent in advance before post-copy actually takes effect rather than using pre-paging to handle them. This also has a nice side-effect of solving the

“non-pageable memory” problem we have where a MemX based Post-Copy implementa- tion is incapable of paging out data pages in the system that cannot be swapped out. A hybrid solution solves this problem by doing a single pre-copy iteration before post-copy begins.

On the other hand, current experience with CIVIC shows as that hooking into the

Shadow-Paging API of the hypervisor is not such a bad idea. The current post-copy im- plementation requires a lot of knowledge about the Operating System, where as a shadow based solution could be deployed for any non-Linux based operating system. It may be well-worth the effort to do this for both reasons of performance and transparency. Until then, the current post-copy evaluation has shown that the solution itself is very viable.

7.2.2 Improved Migration of VMs Through CIVIC

More intriguing, however is the possibility of combining CIVIC and Post-Copy together. For

CIVIC to be usable, it will probably need to support migration. In related work, as demon- strated in [92]), migration performance will become critical during automated or manual migration requests - not just from a liveness standpoint: the local cache belonging to each

VM will need to be synchronized with the VM and perhaps migrated along with it.

However, now that CIVIC is in place, the design is clean enough that the normal Xen migration or Post-Copy migration systems can be executed within the HVM by which they would only migrate cached pages and the networked components of the PPAS can remain on the servers without interference. Even more interesting would be the application of migration, MemX and CIVIC in a hierarchically-switched LAN, where different latencies are CHAPTER 7. IMPROVEMENTS AND CLOSING ARGUMENTS 152 imposed. If MemX based page-granular migration system were available, MemX could be instructed to proactively do global-to-global cache migrations for both the purposes of predicting cache hits as well as super-fast VM migration.

7.3 CIVIC Improvements and Ideas

Due to CIVIC’s high level of transparency, we propose a number of ways to further investi- gate the usefulness of VM Memory oversubscription itself:

7.3.1 How high can you go?: Extreme Consolidation

Assuming adequate CPU resources available, just how many VMs can you over commit on a single host? Furthermore, as was explored in related cooperative caching work, can we use existing algorithms to do cluster-wide content based page sharing as is already available in VMware and projects like the “Difference Engine” [33]?

7.3.2 Improved Eviction and Shadow Optimizations

CIVIC currently has a very basic mechanism for evicting pages of the virtual machine from the PPAS cache. Straightforward projects for student learning include applying existing

Operating Systems algorithms for doing “CLOCK” and multi-LRU implementations of page eviction on pages that are no longer in the VM’s writable working set.

Furthermore, recall that CIVIC currently depends on removing the shadow optimiza- tions that are present in Xen. A production implementation will require re-enabling these, and perhaps even attempting to go further by taking advantage of Nested Page-Table sup- port on modern CPUs. It should be possible to make CIVIC performance equal to the state-of-the art baseline performance present at the highest level of performance when those optimizations are combined with hardware-assisted paging. CHAPTER 7. IMPROVEMENTS AND CLOSING ARGUMENTS 153

7.4 Conclusions

Virtual Machines have come alive again, but they have a long way to go. This disserta- tion explores just one of the many resources available to the modern operating system.

Hopefully, more system-level ideas can be built out of the transparent, ubiquitous manage- ment of Virtual Machine memory to increase the state-of-the-art in this technology. The three primary projects presented in this dissertation can be improved upon in very simi- lar ways to provide a much more robust memory resource virtualization system for future virtual machines. The most basic components of ubiquitous memory virtualization are covered by this work across three areas: (1) MemX, the core distributed memory system,

(2) Post-Copy, and improvement to basic VM migration at the whole-system level lever- aging MemX, and finally (3) CIVIC: a page-granular system for memory overcommitment that is completely transparent to the operating system. With these three frameworks in place, there are a great many improvements to be made as well as new ideas that can be created. Appendix A

CIVIC Screenshots

To give the reader a closer sense of what its like to run a large, unmodified HVM guest on top of CIVIC, we’ve included two screenshots in a printable format. They provide a com- plete picture of what the administrator sees in a Linux environment with our implementation for two different configurations of CIVIC.

A.1 Small-HVM Over-subscription

The first screenshot in Figure A.1 shows a live session running an un-modified version of

64-bit Fedora 10 Linux on top of a CIVIC-enabled Xen Hypervisor. The main window is remote VNC session. Inside this session is the actual node running the HVM. The HVM is viewed by another VNC session exported by the QEMU binary emulator that helps Xen take care of hardware device emulation. The HVM is running in the upper-right corner and is allocated a very small cache size: 64 megabytes. In this session, we’re running 2

GB sort, which is in the beginning stages of filling up the array with random numbers. The assistant console is in the bottom right corner. It shows the output from MemX about which servers are available for network paging. Of the 16 nodes in the list, only 3 of the largest ones are actually being used.

154 APPENDIX A. CIVIC SCREENSHOTS 155

Figure A.1: A live run of an HVM guest on top of CIVIC with a very small PPAS cache size of 64 MB. The HVM has 2 GB. (Turn the page sideways) APPENDIX A. CIVIC SCREENSHOTS 156

A.2 Large-HVM Oversubscription

This screenshot is similar in structure to the first one except that we are running a much larger guest virtual machine. For both screenshots, notice the Assistant output of the struc- ture of each HVM’s cache and memory size. The size of the cache is labeled in the ”Mem” column. The actual size of the HVM’s PPAS is labeled in the ”Civic” column. Figure A.2 shows a 64 GB HVM guest with a 2GB cache size. We have also setup a large quick- sort to run. In fact the HVM guest running in this figure is the exact same guest operating system and filesystem image used in the previous screenshot. The only difference is in the Xen configuration file when we change the size of the cache and memory. The BIOS modifications CIVIC uses in the Xen Daemon automatically make sure that the guest OS reconfigures its physical memory to recognize the new memory available for process allo- cation. This particular configuration of the HVM is depending on five simultaneous memory servers where as the previous only use three. There are absolutely no changes whatsoever made to the HVM guest filesystem image or kernel in either one of these screenshots. APPENDIX A. CIVIC SCREENSHOTS 157

Figure A.2: A live run of an HVM guest on top of CIVIC with a very large PPAS cache size of 2GB. The HVM believes that it has 64 GB. (Turn the page sideways) Appendix B

The Xen Live-migration process

This Chapter is a bit of useful reading we composed in our early investigations of virtual machines that may significantly help the reader get started digging into the Xen Hypervisor.

It can seem daunting at first glance. Here, I clarify exactly what shadow page tables are used for in Xen. Although it is long and technical, I strongly recommend you finish the whole thing. An equally important reason for this is its documentary aspect. It explains a lot of the Xen terminology and in particular exposes some of the steps in both live migration and shadow paging.

This Appendix uses references to the Online Cross Reference, typically found on the

Xen website. These links are circa version 3.2 of Xen but are still highly relevant. So, simply adapt the embedded links to your own browser as you go through the appendix.

(The line numbers will definintely be a few lines off).

B.1 Xen Daemon

OK, Let’s begin:

We start with live migration because in order to migrate you must understand the way page frames are referenced in a virtual machine architecture, which requires our entry to begin in the Xen Daemon (i.e. xen-3.*/tools). Looking through the cross reference

(http://lxr.xensource.com/) is interesting. Python is strange. No braces. Kinda looks like a functional language. I’ve also noticed: you learn *a lot more* by reading their header files.

158 APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 159

Not their function definitions. Most of their comments are in headers.

To begin a migration (also called a ”checkpointed save”), Xend calls a python function called ”save”: http://lxr.xensource.com/xen/source/tools/python/xen/xend/XendCheckpoint.py#054

It’s called a checkpoint because it’s possible that any round of pre-copy may potentially have satisfied the needs of the entire migration (depending on how active the guest is).

This function in turn forks off the main routine in the Xen-Control (xc) library in this entry point: http://lxr.xensource.com/xen/source/tools/xcutils/xc save.c#054

Which of course calls the ENORMOUS xc domain save function.

Now for shadow control. Let’s start with that:

The first live shadow control command turns on shadow dirty logging.

To do this, a hypercall is prepared in user-space. The hypercall is initiated by sending an ioctl with the locked hypercall memory to the domain zero linux kernel with this call: http://lxr.xensource.com/xen/source/tools/libxc/xc linux.c#124

Next, the handler for xenLinux ioctls (that in turn fires the hypercall) is located here: http://lxr.xensource.com/xen/source/linux-2.6.16.33-xen/drivers/xen/privcmd/privcmd.c#042

This ioctl makes a hypercall whose entry point ”do domctl” into xen is located here (We are now inside Xen itself): http://lxr.xensource.com/xen/source/xen/common/domctl.c#172

Then we make it to a switch statement that handles shadow-mode activation in a function called ”arch do domctl”: http://lxr.xensource.com/xen/source/xen/arch/x86/domctl.c#027 APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 160

In the Hypervisor code, you’ll notice lots of ”XEN GUEST HANDLE” or TYPE SAFE state- ments. Don’t be too confused by these. They are simply defined as:

#define __DEFINE_GUEST_HANDLE(name, type) \

typedef struct { type *p; } __guest_handle_ ## name

In laymen’s terms, this means:

”Re-define the type ”A” as ” guest handle A” wrapped as a struct

To use those types, they have:

#define GUEST_HANDLE(name) __guest_handle_ ## name

So, if they say:

GUEST_HANDLE(int) variablename;

That really means:

__guest_handle_int variablename;

They do this to force explicit type safety. For example, in the Xen shadow headers, they have a typedef macro that does the exact same thing:

#define TYPE_SAFE(_type,_name)

typedef struct { _type _name; } _name##_t;

This is the exact same macro, in a different location, except the type safe prefix is now a suffix instead.

The biggest reason for this is distinguishing between guest machine addresses and hy- pervisor machine addresses (and other values). By wrapping the type in a variable and APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 161 using it universally, you can debug errors quicker and avoid making coding mistakes that mix code between Xen and Xen domains.

Moving along....

B.2 Understanding Frame Numbering

The very first case in the switch statement provides the function handler that takes care of shadow control operations: http://lxr.xensource.com/xen/source/xen/arch/x86/mm/shadow/common.c#3188

Again, you should read xen/include/asm-x86/shadow.h, as it provides more comments about its facilities than xen/arch/x86/mm/shadow.c does. For example, they have the fol- lowing very important comment in that file:

”Guest frame numbers (gfns) are the entries that the guest puts in its pagetables. For normal paravirtual guests, they are actual frame numbers, with the translation done by the guest. Machine frame numbers (mfns) are the entries that the hypervisor puts in the shadow page tables. ”

It’s important not to over-think the core of this description:

1. Page-tables require memory. It’s easy to forget that. Both pages and page-table

entries belong to machine-frames (the real ones) and have real machine-frame num-

bers. So, when they say ”gfn”, they mean that ”guest frame numbers” are machine

frame numbers too allocated by the guest to create entries in the guest’s own page-

table - they are not machine frames to be used for data storage.

2. With that being said, what is a shadow page table from Xen’s perspective? In

VMware, the x86 MMU walks shadow page tables, not the guest’s. This happens in

Xen too but only during migration. Don’t confuse this behavior with normal operation. APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 162

Now let’s go back to second-half of the quote taken from shadow.h:

”Machine frame numbers (mfns) are the entries that the hypervisor puts in the shadow page tables”.

In other words, (mfns) are the addresses of real data pages used by the guest os. That

means: a shadow-page-table is a Xen-owned page table which is used exclusively for the

purposes of Xen to keep track of real data pages (machine frames) that the guest has

dirtied.

Furthermore, it’s important realize also (and exactly how) PV guests really do manage their

own memory allocations: Xen simply provides an array of free pages in the entire system that the PV guest can choose from at will (and be validated) so long as it doesn’t over-step its reservation. It’s also important to realize: this table is not another level of indirection

during the fast path. What does that mean? It means that this array is used for allocation

only - it’s not used in some ”strange shadow-ish method” by the MMU to do virtual/physical

translations. It has nothing to do with actual MMU translation path, which is what I mean by

the fast path. Once a machine frame has been allocated from this array, its real address

goes directly into the guest’s PTs and is finally walked by the MMU.

Now: this is why they used the above type-checking I described: this all means that an

unsigned long pointer (in guest space or xen space) to ANY machine frame in the system

could either be an actual guest machine frame number (both PT entry or data page) and it

could also be a regular (mfn), i.e. a page that belongs to Xen or some other guest. AND: all

pages in the system are ”addresssable” by all guests - just not necessarily with permission

(except for domain 0) - but still addressable with ioremap-xen.c file. This forces a more

rigorous type-checking of these pointers by re-wrapping them as described before.

So, next: what does all this mean for the term ”pseudo-physical”?

It means two things:

1. For Xen: As far as Xen is concerned **there is no such thing** as pseudo-physical: APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 163

because guests manage their own PT pages and their own data pages, so whatever

addresses they put in their page-tables are in fact machine addresses, not pseudo-

physical ones.

2. For Guests: kernels like windows and Linux are in-capable of allocating memory in an

*explicitly* dis-contiguous manner. They assume (and will always assume) that they

have full reign of the physical memory space. In order to preserve the statements

above regarding the guest’s self-management of its page-tables, Xen provides two

arrays:

B.3 Memory-related Data Structures

From Xen’s point of view there are two primary memory data structures:

• A P2M table (pseudo to machine table) Every domain gets one of these (of reserva-

tion size)

• M2P table (machine to pseudo) Globally visible (after xen-ioremap).

Other code acronyms:

• ”max pfn”: largest non-sparse allocatable pseudo frame number

• ”max mfn”: largest spare allocatable machine number in entire system

These two are used all over the xc shadow code, and its very easy to mistake one for the other. Very poor choices by the Xen folks.

The second table is used by Xen (and domain zero), where as the first is used by domains.

The second table is only updatable with hypercalls in an attempt for domains to change their reservations. Xen uses the M2P table to verify that the guest is not trying to grab frames already allocated to other domains during PTE updates or reservation changes. APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 164

The first table allows the linux memory allocator (nothing to do with the x86 MMU what- soever) the ability to allocate memory using the same buddy algorithm that it normally does by choosing pages from the contiguous, linear, P2M array instead of being modified to choose from multiple dis-contiguous extents of memory. The guest then makes a hyper- call to Xen to verify that its choices were correct and that it did not exceed its reservation or attempt to allocate pages already belonging to another guest (using the M2P array).

If you want, you can think of them as ”pseudo-contiguous physical machine frames”. Mean- ing: there’s an array of free machine pages provided by Xen that the guest itself (not xen) chooses from, which are not guaranteed to be contiguous, but they are by no means truly

”pseudo-physical”. In fact, here is a direct quote from that same region in shadow.h hyper- visor code:

”(Pseudo-)physical addresses are the abstraction of physical memory the guest uses for allocation and so forth. For the purposes of this code, we can largely ignore them.”

.. i.e. the key difference between Xen’s idea of ”pseudo-physical” and the traditional picture: these pseudo-physical numbers are only for allocation - not translation for the TLB.

Obvious conclusions:

• the M2P tables contain duplicate pseudo numbers across domains

• but mfn’s in each domain’s P2M table are unique.

Further conclusions:

1. These tables are pre-constructed (machine frames are pre-chosen) during the start-

of-day by Xen before the domain is started up based on its reservation.

2. These tables are for the most part read-only after construction for each domain and

only change when 1. checkpoints happen (save/migrate) or 2. when a request is

made to change the reservation of the guest domain. APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 165

3. There are multiple Xen-visible M2P tables, one for each domain that only domain-

zero can use (as we’ll see later in a deeper poke into the shadow code).

The next two subtle points regarding these two tables:

1. The P2M list is stored inside the guest. The frames that store this list must them-

selves be migrated and re-mapped after migration.

2. This of course must be done for the M2P list as well. This is what a lot of the initial

code inside xc domain save is preparing for.

3. Remember: the xc library is user-land code. So it must re-map the tables from their

sources before sending them over.

B.4 Page-table Management

So all of this begs the question: what’s the difference between migration-time shadow PTs and the ”dirty bitmap” that we’ve heard about?

Shadow page tables in Xen do get walked by the MMU - they’re only used during migration

and they’re only walked during migration. As soon as migration is over, those PTs are

destroyed and the destination node switches its CR3 to store the address of the guests

own real page-table base pointer as usual - no more shadow walking.

So, then: how does this work? Well, it would be highly in-efficient on the source node for

Xen to duplicate the guest’s entire page table before switching to shadow-mode. Instead

the shadow PTEs are populated on-demand (copy-on-write): i.e. all page-faults trap to

Xen, and Xen does a translation by pulling the corresponding virtual-to-machine frame

mapping from the guest’s own page-table, copying that mapping to the shadow-page-table,

and then returns so that that the MMU can resume walking the page table to populate the

TLB. Similarly, if the guest updates its page-table (either by making a new PTE or updating APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 166 one by bring an existing virtual addressed page from backend storage), then that change is also propagated to the corresponding Xen shadow PTE.

Along the same lines, this begs the question: what happens if a fault traps to Xen’s shadow

PT and the corresponding lookup to the Guest’s PT is non-resident? i.e. what if the page has been swapped out? (assuming paging is even turned on at all, which is unlikely for our purposes) That means Xen must invoke the Guest’s page-fault handler, wait for the guest to bring the page back in, then finally copy the resulting virtual-to-mfn PTE from the guest and return operation to the MMU (shadow-mode-only. all of this only happens in save/restore/migrate).

This should now make it clear what the ”dirty logging bitmap” is used for. At the end of each round (which the xen control xc software controls), Domain-Zero needs a way to do a ”fast lookup” of the dirtied pages (the working set). The shadow page-table (in Xen space) is too big for this - so they supplement the shadow PT with a shorter, smaller, easily traversable bitmap to pass back to the domain-zero-land kernel-space bitmap.

Other acronyms you’ll see in the hypervisor code:

• ”L1mfn”: machine frame number of a xen 1st-level PTE (xen’s own PTs).

• ”L2mfn”: machine frame number of a xen 2nd-level PTE.

• ”gl1mfn”: mfn of a guest 1st-level PTE

B.5 Actually Performing the Migration

Got all that? Ok, moving along. We’ll start taking apart xc domain save:

Now, for the most part, we already know that most of the migration work is done in the XC library. So, its likely we won’t have to make very many Xen modifications.

The overall low-level sequence to prepare for pre-copy: APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 167

1. checkpoint the domain in upper-python code

2. open a socket to the destination

3. send over the *entire* xen-store domain configuration

4. pass the file descriptor \& domain ID to the xc\_library

5. Bring in a local copy of the migrating guest’s P2M table.

6. Bring in a local copy of Xen’s M2P table for the migrating guest.

7. Canonicalize *the references to frames that store the P2M table*

Will explain this in a second....

8. Enable shadow-page-table mode (switch CR3 to copy-on-write).

9. Allocate bitmaps (which pages to send and which to ignore)

10. Initialize the bitmaps to all ones (assume all get sent first)

11. Send over the P2M frames themselves

12. Pin (make read-only) the guest’s page tables. (kernel and user)

BIG LOOP START

11. Get a list of the *types* of all the send-able pages

12. start sending them:

a. If the page is a PT page, canonize it:

Replace all MFNs with PFN: i.e. switch over guests PTEs

to point to pseudo numbers (and later switch them back)

b. If page is data page, send it over

12. If working set not small enough or iterations left, then repeat 11.

BIG LOOP STOP

13. Else do final save and state

14. canonize and send segment descriptor tables (LDT / GDT)

15. canonize and send page-table CR3

16. send final CPU state APPENDIX B. THE XEN LIVE-MIGRATION PROCESS 168

17. turn off shadow mode (destroy tables)

18. un-map all of the locally allocated free memory

19. done

Regarding, step # 7 and #11, which is perhaps the hardest part to understand: These tables, as we’ve learned, do not change. In order to bring in those tables, we first map the

*pages frames* of those tables and then copy them. The P2M list *also has references to itself*. i.e. if a machine frame X somewhere in the system is allocated to the P2M table, then it has a P2M entry in the table at pseudo number X-prime. The first thing live migration does is send over this table itself. Before the table is sent, all of the P2M’s references to itself must be removed because the table itself will no get a new location. Once its received, those references will be re-written. This is done by making P2M entry X-prime == X-prime.

(i.e. changing the table value to be equal to the index of itself - the pseudo frame number).

This can be observed *before* the big loop in this tiny loop: http://lxr.xensource.com/xen/source/tools/libxc/xc domain save.c#713

The references to ”i/fpp” is taking the address in 4K Page blocks. fpp is defined as ”the number of mappings in a page”. So, the loop increment thus increments by 4K. It then takes the address of the array location and re-writes THAT MFN (by looking up in the mfn to pfn table that Xen provides) to be equal to the PFN (yes, itself).

Flag definition: ENABLE DIRTY LOGGING is equivalent to *both* logging and turning on shadow mode (no translation - i.e. no CR3 change).

Flag definition: ENABLE TRANSLATE: is equivalent to 3 things: 1. enable shadow mode

(mark guest pages read-only and dynamically populate as described), 2. enable dirty logging, 3. and translate by changing CR3.

Thanks for reading. Bibliography []

Bibliography

[1] OpenVZ. (Virtuozzo): http://www.openvz.com/.

[2] User Mode Linux, http://user-mode-linux.sourceforge.net/.

[3] Intel corp., intel virtualization technology specification for the ia-32 intel architecture.

2005.

[4] et. al. A. Kivity, Y. Kamay. kvm: the Linux Virtual Machine Monitor, Ottawa Linux

Symposium. 2007.

[5] et. al. A. Whitaker, R.S. Cox. Constructing services with interposable virtual hardware.

In NSDI 2004, pages 13–13, 2004.

[6] M. Shaw A. Whitaker and S.D. Gribble. Scale and performance in the denali isolation

kernel. pages 195–209, New York, NY, USA, 2002.

[7] Vadim Abrossimov, Marc Rozier, and Michel Gien. Virtual memory management in

chorus. In Proceedings of the European Workshop on Process in Distributed Operat-

ing Systems and Distributed Systems Management, pages 45–59, 1990.

169 170

[8] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young.

Mach: A new kernel foundation for unix development. In USENIX ATC, 1986.

[9] A. Acharya and S. Setia. Availability and utility of idle memory in workstation clusters.

In Measurement and Modeling of Computer Systems, pages 35–46, 1999.

[10] R. Adair, R. Bayles, L. Comeau, and R. Creasy. A virtual machine system for the

360/40, tecnical report 320-2007, 1966.

[11] Vasudeva Akula. Workload characterization and business-oriented performance im-

provement techniques for online auction sites. PhD thesis, Fairfax, VA, USA, 2007.

Adviser-Menasce, Daniel.

[12] Steve Ofsthun Alex Vasilevsky, David Lively. Linux virtualization on virtual iron vfe. In

Ottowa Linux Symposium, volume 2, pages 235–249, 2005.

[13] AMD. Amd64 virtualization codenamed “pacifica” technology: Secure virtual machine

architecture reference manual. 2005.

[14] C. Amza and A.L. Cox et. al. Treadmarks: Shared memory computing on networks of

workstations. IEEE Computer, 29(2):18–28, Feb. 1996.

[15] T. Anderson, D. Culler, and D. Patterson. A case for NOW (Networks of Workstations).

IEEE Micro, 15(1):54–64, 1995.

[16] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Serverless

network file systems. In Proc. of the 15th Symp. on Operating System Principles,

pages 109–126, Copper Mountain, Colorado, Dec. 1995.

[17] A.A. Awadallah and M. Rosenblum. The vMatrix: Server Switching. In Proc. of Intl.

Workshop on Future Trends in Distributed Computing Systems, Suzhou, China, May

2004.

[18] et. al. B. Cully, G. Lefebvre. Remus: High availability via asynchronous virtual machine

replication. In NSDI ’07: Networked Systems Design and Implementation, 2008. 171

[19] A. Barak and O. Laadan. The MOSIX multicomputer operating system for high per-

formance cluster computing. Future Generation Computer Systems, 13(4–5), Mar.

1998.

[20] P. Barham, B. Dragovic, K. Fraser, and S. Hand et.al. Xen and the art of virtualization.

In SOSP, Oct. 2003.

[21] P. Bohannon, R. Rastogi, A. Silberschatz, and S. Sudarshan. The architecture of the

Dali main memory storage manager. Bell Labs Technical Journal, 2(1):36–47, 1997.

[22] Robert Bradford, Evangelos Kotsovinos, Anja Feldmann, and Harald Schioberg.¨ Live

wide-area migration of virtual machines including local persistent state. In VEE ’07:

Proceedings of the 3rd international conference on Virtual execution environments,

pages 169–179, 2007.

[23] E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: Running commodity

operating systems on scalable multiprocessors. In Proc. of ACM SOSP 1997, vol.

31(5) of ACM Operating Systems Review, pages 143–156, Oct. 1997.

[24] et. al. C. Sapuntzakis, R. Chandra. Optimizing the migration of virtual computers. In

Proc. of OSDI, December 2002.

[25] Pei Cao, Swee Boon Lim, Shivakumar Venkataraman, and John Wilkes. The tickertaip

parallel raid architecture. SIGARCH Comput. Archit. News, 21(2):52–63, 1993.

[26] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Pat-

terson. Raid: high-performance, reliable secondary storage. ACM Comput. Surv.,

26(2):145–185, 1994.

[27] C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield.

Live migration of virtual machines. In Network System Design and Implementation,

2005.

[28] D. Comer and J. Griffoen. A new design for distributed systems: the remote memory

model. In Proc. of the USENIX 1991 Summer Technical Conference, pages 127–135,

1991. 172

[29] M. Connor and P. Kumar. Parallel construction of k-nearest neighbor graphs for point

clouds. In Eurographics Symposium on Point-Based Graphics, 2008.

[30] M. Livny D. Thain, T. Tannenbaum. Distributed computing in practice: the condor

experience. Concurr. Comput. : Pract. Exper., 17:323–356, 2005.

[31] Michael D. Dahlin, Randolph Y. Wang, Thomas E. Anderson, and David A. Patter-

son. Cooperative caching: using remote client memory to improve file system per-

formance. In OSDI ’94: Proceedings of the 1st USENIX conference on Operating

Systems Design and Implementation, page 19, Berkeley, CA, USA, 1994. USENIX

Association.

[32] Peter J. Denning. The working set model for program behavior. Commun. ACM,

11(5):323–333, 1968.

[33] Michael Vrable Diwaker Gupta, Sangmin Lee. Difference engine: Harnessing memory

redundancy in virtual machines. In OSDI: Operating Systems Design and Implemen-

tation, 2008.

[34] Frederick Douglis. Transparent process migration in the sprite operating system.

Technical report, Berkeley, CA, USA, 1990.

[35] S. Dwarkadas, N. Hardavellas, L. Kontothanassis, R. Nikhil, and R. Stets. Cashmere-

VLM: Remote memory paging for software distributed shared memory. In Proc. of Intl.

Parallel Processing Symposium, San Juan, Puerto Rico, pages 153–159, April 1999.

[36] D. R. Engler, M. F. Kaashoek, and Jr. J. O’Toole. Exokernel: an operating system

architecture for application-level resource management. In SOSP ’95: Proceedings

of the fifteenth ACM symposium on Operating systems principles, pages 251–266,

New York, NY, USA, 1995. ACM.

[37] M. Feeley, W. Morgan, F. Pighin, A. Karlin, and H. Levy. Implementing global memory

management in a workstation cluster. Operating Systems Review, 15th ACM Sympo-

sium on Operating Systems Principles, 29(5):201–212, 1995. 173

[38] E. Felten and J. Zahorjan. Issues in the implementation of a remote paging system.

Tech. Report 91-03-09, Comp. Science Dept., University of Washington, 1991.

[39] M. Flouris and E.P. Markatos. The network RamDisk: Using remote memory on

heterogeneous NOWs. Cluster Computing, 2(4):281–293, 1999.

[40] H. Garcia-Molina, R. Lipton, and J. Valdes. A massive memory machine. IEEE Trans-

actions on Computers, C-33 (5):391–399, 1984.

[41] Tal Garfinkel and Mendel Rosenblum. When virtual is harder than real: security chal-

lenges in virtual machine based computing environments. In HOTOS 2005, pages

20–20, Berkeley, CA, USA.

[42] K. Gopalan and T. Chiueh. Delay budget partitioning to maximize network resource

usage efficiency. In Proc. IEEE INFOCOM’04, Hong Kong, China, March 2004.

[43] Michael A. Fetterman Grzegorz Mio, Derek G. Murray. Satori: Enlightened page shar-

ing. In USENIX Annual Technical Conference, 2009.

[44] et. al. H. Andres Lagar-Cavilla. Impromptu clusters for near-interactive cloud-based

services. Technical report, 2008.

[45] Steven Hand, Andrew Warfield, Keir Fraser, Evangelos Kotsovinos, and Dan Magen-

heimer. Are virtual machine monitors microkernels done right? In HOTOS, pages

1–1. USENIX Association, 2005.

[46] Steven M. Hand. Self-paging in the nemesis operating system. In OSDI, pages 73–86,

1999.

[47] J. Hansen and A. Henriksen. Nomadic operating systems. In Master’s thesis, Dept.

of Computer Science, University of Copenhagen, Denmark, 2002.

[48] J.G. Hansen and E. Jul. Self-migration of operating systems. In In Proc. of the 11th

ACM SIGOPS, 2004. 174

[49] Michael Hines and Kartik Gopalan. MemX: Supporting large memory applications in

xen virtual machines. In Second International Workshop on Virtualization Technology

in Distributed Computing (VTDC07), Reno, Nevada, 2007.

[50] Michael Hines, Mark Lewandowski, Jian Wang, and Kartik Gopalan. Anemone: Trans-

parently harnessing cluster-wide memory. In Proc. of the International Symposium on

Performance Evaluation of Computer and Telecommunication Systems (SPECTS),

Calgary, Canada, Aug. 2006.

[51] Michael Hines, Jian Wang, and Kartik Gopalan. Distributed anemone: Transparent

low-latency access to remote memory. In Proc. of International Conference on High

Performance Computing, Bangalor, India, 2006.

[52] Michael R. Hines. http://www.cs.binghamton.edu/∼mhines/code/, simple use of the

linux aio system calls, Feb. 2007.

[53] Michael R. Hines and Kartik Gopalan. Post-copy based live virtual machine migration

using adaptive pre-paging and dynamic self-ballooning. In VEE, pages 51–60, 2009.

[54] Yifan Hu. http://www.cse.scitech.ac.uk/arc/jaspa/, jaspa sparse matrix multiplication

benchmark, advanced research computing group, science and technology facilities

council, uk.

[55] Kai Hwang, Hai Jin, and Roy S.C. Ho. Orthogonal striping and mirroring in distributed

raid for i/o-centric cluster computing. IEEE Trans. Parallel Distrib. Syst., 13(1):26–44,

2002.

[56] L. Ibarria, P. Lindstrom, J. Rossignac, and A. Szymczak. Out-of-core compression

and decompression of large N-dimensional scalar fields. In Proc. of Eurographics

2003, pages 343–348, September 2003.

[57] S. Ioannidis, E.P. Markatos, and J. Sevaslidou. On using network memory to improve

the performance of transaction-based systems. In In Proc. of Parallel and Distributed

Processing Techniques and Applications (PDPTA ’98), 1998. 175

[58] Kerrighed. http://www.kerrighed.org.

[59] S. Koussih, A. Acharya, and S. Setia. Dodo: A user-level system for exploiting idle

memory in workstation clusters. In Proc. of the Eighth IEEE Intl. Symp. on High

Performance Distributed Computing (HPDC-8), 1999.

[60] Benjamin LaHaise and Alexander Viro. http://lwn.net/articles/216200/, problems with

the linux asynchronous i/o susbsytem, Jan. 2007.

[61] Edward K. Lee and Chandramohan A. Thekkath. Petal: distributed virtual disks. In

ASPLOS-VII, pages 84–92, 1996.

[62] I. M. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R. Fairbairns, ,

and E. Hyden. The design and implementation of an operating system to support

distributed multimedia applications. 1996.

[63] Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman, Peter Chubb, Ben Leslie, and

Gernot Heiser. Pre-virtualization: Slashing the cost of virtualization. Technical Report

2005-30, Fakultat¨ fur¨ Informatik, Universitat¨ Karlsruhe (TH), November 2005.

[64] Jochen Liedtke. Improving ipc by kernel design. In SOSP ’93: Proceedings of the

fourteenth ACM symposium on Operating systems principles, pages 175–188, New

York, NY, USA, 1993. ACM.

[65] P.Lindstrom. Out-of-core construction and visualization of multiresolution surfaces. In

Proc. of ACM SIGGRAPH 2003 Symposium on Interactive 3D Graphics, April 2003.

[66] et. al. M. Satyanarayanan, B. Gilbert. Pervasive personal computing in an internet

suspend/resume system. IEEE Internet Computing, 11(2):16–25, 2007.

[67] Dan Magenheimer. Memory Overcommit... without the commitment:

http://wiki.xensource.com/xenwiki/

Open Topics For Discussion?action=AttachFile&do=get&target=Memory+Overcommit.pdf.

Oracle Corp., 2008. 176

[68] E.P. Markatos and G. Dramitinos. Implementation of a reliable remote memory pager.

In USENIX Annual Technical Conference, pages 177–190, 1996.

[69] Gernot Heiser Mathew Chapman. vnuma: A virtual shared-memory multiprocessor.

In USENIX Annual Technical Conference, 2009.

[70] I. McDonald. Remote paging in a single address space operating system supporting

quality of service. Tech. Report, Dept. of Comp. Science, Univ. of Glasgow, 1999.

[71] D. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. Process migration

survey. ACM Computing Surveys, 32(3):241–299, Sep. 2000.

[72] Sape J. Mullender, Guido van Rossum, Andrew S. Tanenbaum, Robbert van Re-

nesse, and Hans van Staveren. Amoeba: A distributed operating system for the

1990s. Computer, 23(5):44–53, 1990.

[73] Michael Nelson, Beng-Hong Lim, and Greg Hutchins. Fast transparent migration for

virtual machines. In Usenix 2005, pages 25–25.

[74] M. Noack. Comparative evaluation of process migration algorithms. Master’s thesis,

Dresden University of Technology - Operating Systems Group, 2003.

[75] NS2: Network Simulator. http://www.isi.edu/nsnam/ns/.

[76] G. Oppenheimer and N. Weizer. Resource management for a medium scale time-

sharing operating system. Commun. ACM, 11(5):313–322, 1968.

[77] Nir Oren. A survey of prefetching techniques. Technical report, University of Ab-

erdeen, Computing Science, 2000.

[78] S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of zap:

A system for migrating computing environments. In Proc. of OSDI, pages 361–376,

2002.

[79] Kai Shen Pin Lu. Virtual machine memory access tracing with hypervisor exclusive

cache. In 2007 USENIX Annual Technical Conference, 2007. 177

[80] J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under

Unix. In Usenix Winter Technical Conference, pages 213–223, January 1995.

[81] POV-Ray. The persistence of vision raytracer, 2005.

[82] Mohammad Salimullah Raunak. A survey of cooperative caching. Technical re-

port, University of Massacusetts Amherst, Laboratory for Advanced System Software,

1999.

[83] Michael Richmond and Michael Hitchens. A new process migration algorithm.

SIGOPS Oper. Syst. Rev., 31(1):31–42, 1997.

[84] E. T. Roush. Fast dynamic process migration. In ICDCS 1996 Conference on Dis-

tributed Computing Systems (ICDCS ’96), page 637, Washington, DC, USA, 1996.

[85] Francis C.M. Lau Roy S.C. Ho, Cho-Li Wang. Lightweight process migration and

memory prefetching in openmosix. In IPDPS 2008, 2008.

[86] Prasenjit Sarkar and John Hartman. Efficient cooperative caching using hints.

SIGOPS Oper. Syst. Rev., 30(SI):35–46, 1996.

[87] B. K. Schmidt. Supporting Ubiquitous Computing with Stateless Consoles and Com-

putation Caches. PhD thesis, Computer Science Dept., Stanford University, 2000.

[88] J.H. Schopp, K. Fraser, and M.J. Silbermann. Resizing memory with balloons and

hotplug. In Linux Symposium, pages 305–312, 2006.

[89] Silicon Graphics Inc. http://www.sgi.com/tech/stl/sort.html, Standard Template Library

Quicksort.

[90] E. Stark. SAMSON: A scalable active memory server on a network, Aug. 2003.

[91] Georg Stellner. Cocheck: Checkpointing and process migration for mpi. In IPPS

’1996, pages 526–531, Washington, DC, USA.

[92] A. Venkataramani T. Wood, P. Shenoy. Black-box and gray-box stategies for virtual

machine migration. In Proc. of NSDI 2007, April 2007. 178

[93] The iSCSI Enterprise Target Project.

http://iscsitarget.sourceforge.net/.

[94] Prof. K. S. Trivedi. An analysis of prepaging. Technical report, Dept. of Computer

Science, Duke University, 1979.

[95] Irina Chihaia Tuduce and Thomas Gross. Adaptive main memory compression. In

ATEC ’05, pages 29–29. USENIX, 2005.

[96] C.A. Waldspurger. Memory resource management in vmware esx server. In Operating

System Design and Implementation (OSDI 02), Boston, MA, Dec 2002.

[97] A. Whitaker, M. Shaw, and S.D. Gribble. Denali: Lightweight Virtual Machines for

Distributed and Networked Applications. Tech Report 02-02-01, Univ of Washington,

2002.