PROVISIONING WIDE-AREA VIRTUAL ENVIRONMENTS THROUGH I/O INTERPOSITION: THE REDIRECT-ON-WRITE FILE SYSTEM AND CHARACTERIZATION OF I/O OVERHEADS IN A VIRTUALIZED PLATFORM

By VINEET CHADHA

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008

1 °c 2008 Vineet Chadha

2 I dedicate this thesis to my parents

3 ACKNOWLEDGMENTS I would like to thank my advisor Dr. Renato Figueiredo for all the support he has provided me during last six years. He has been taking around the maze of systems research and shown me the right way whenever I felt lost. It has been a privilege to work with Dr. Figueiredo whose calmness, humble and polite demeanor is the one i would like to carry and apply further in my career. Thanks to Dr. Jose Fortes who provided me opportunity to work at Advanced Computing and Information System (ACIS) laboratory. He gave me encouragement and support whenever things were down. I would like to thank Dr. Oscar Boykin for serving in my committee and for all those fruitful discussions on

Research, , healthy food and Running. His passion to achieve perfection in every endeavors of life often eggs me to do better. Thanks to Dr. Alan George and Dr. Joseph

Wilson for serving in my PhD program committee and motivating me through their courses and research work. I would like to thank my mentor, Ramesh Illikkal and manager, Donald Newell at Intel Corporation for the faith they have shown in me and egged me to work hard. It has been a privilege to work with Ramesh who taught me importance of teamwork, failure and success. Thanks to Dr. Padmashree Apparao and Dr. Ravishankar Iyer for guidance and encouragement to achieve my goals. Thanks is also due to Dr. Ivan krsul and Dr. Suma Adabala for guiding me not only during the development and research of In-VIGO project but also often sharing thoughts on a PhD program and expectations. Thanks is due to all the colleagues here at ACIS which made work environment fun to work in. I would like to thank Andrea and Mauricio for providing excellent research facilities and resources. Thanks to my officemates Arijit and Girish for all fruitful discussions. Thanks is due to Cathy for maintaining cordial environment in ACIS lab and extending support to me as a good friend whenever need arises.

4 Thanks further to rest of the team members in Many core architecture group at Intel Corporation. I am thankful to National Science Foundation and Intel Corporation for providing me financial assistance during my PhD program. Thanks is due to Department of Electrical Engineering and the Department of Computer Information Science and Engineering for maintaining all the paperwork related to my appointment, travel and PhD milestones that made my life easier in Gainesville. This dissertation would not have been possible without unwavering support from friends and family. I would like to thank my good friend, Hemang for support and encouragement he provided me in past six years. This dissertation is dedicated to my parents whose sacrifices can not be described in words. They have been source of strength all these years. The good values and importance of education they have imbibed in me has been a guiding light in my career. Thanks is due to my brother who often guided me the right way to achieve the goals in the life. Finally, I thank almighty for giving me strength to work hard.

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 9

LIST OF FIGURES ...... 10

ABSTRACT ...... 13

CHAPTER

1 INTRODUCTION ...... 16 1.1 Introduction ...... 16 1.1.1 Virtual Network File System I/O Redirection ...... 16 1.1.2 Characterization of I/O Overheads in a Virtualized Environments . 18 1.1.2.1 Simulation ...... 19 1.1.2.2 I/O Mechanisms ...... 19 1.2 Dissertation Contributions ...... 20 1.3 Dissertation Relevance ...... 22 1.4 Dissertation Overview ...... 23 1.5 Dissertation Organization ...... 26 2 I/O VIRTUALIZATION: RELATED TERMS AND TECHNOLOGIES ..... 28 2.1 Introduction ...... 28 2.2 Virtualization Technologies ...... 29 2.3 Virtual Machine Architectures ...... 32 2.3.1 I/O mechanisms in Virtual Machines ...... 32 2.3.2 Virtual Machines and CMP architectures ...... 34 2.4 Grid Computing ...... 35 2.5 File System Virtualization ...... 35 2.5.1 Network File System ...... 36 2.5.2 Grid Virtual File System ...... 38

3 REDIRECT-ON-WRITE DISTRIBUTED FILE SYSTEM ...... 41 3.1 Introduction ...... 41 3.1.1 File System Abstraction ...... 41 3.1.2 Redirect-on-Write File System ...... 42 3.2 Motivation and Background ...... 42 3.2.1 Use-Case Scenario: File System Sessions for Grid Computing .... 44 3.2.2 Use-Case Scenario: NFS-Mounted Virtual Machine Images and O/S File Systems ...... 44

6 3.2.3 Use-Case Scenario: Fault Tolerant Distributed Computing with Virtual Machines ...... 45 3.3 ROW-FS Architecture ...... 47 3.3.1 Hash Table ...... 48 3.3.2 Bitmap ...... 49 3.4 ROW-FS Implementation ...... 51 3.4.1 MOUNT ...... 51 3.4.2 LOOKUP ...... 53 3.4.3 GETATTR/SETATTR ...... 54 3.4.4 READ ...... 54 3.4.5 WRITE ...... 55 3.4.6 READDIR ...... 55 3.4.7 REMOVE/RMDIR/RENAME ...... 58 3.4.8 LINK/READLINK ...... 58 3.4.9 SYMLINK ...... 60 3.4.10 CREATE/MKDIR ...... 60 3.4.11 STATFS ...... 60 3.5 Experimental Results ...... 60 3.5.1 Microbenchmark ...... 61 3.5.2 Application Benchmark ...... 63 3.5.3 Virtual Machine Instantiation ...... 66 3.5.4 File System Comparison ...... 67 3.6 Related Work ...... 68 3.7 Conclusion ...... 68

4 PROVISIONING OF VIRTUAL ENVIRONMENTS FOR WIDE AREA DESKTOP GRIDS ...... 70 4.1 Introduction ...... 70 4.2 Data Provisioning Architecture ...... 71 4.3 ROW-FS Consistency and Replication Approach ...... 78 4.3.1 ROW-FS Consistency in Image Provisioning ...... 79 4.3.2 ROW-FS Replication in Image Provisioning ...... 80 4.4 Security Implications ...... 81 4.5 Experiments and Results ...... 84 4.5.1 Proxy VM Resource Consumption ...... 84 4.5.2 RPC Call Profile ...... 86 4.5.3 Data Transfer Size ...... 87 4.5.4 Wide-area Experiment ...... 87 4.5.5 Distributed Hash Table State Evaluation and Analysis ...... 88 4.6 Related Work ...... 88 4.7 Conclusion ...... 90

7 5 I/O WORKLOAD PERFORMANCE CHARACTERIZATION ...... 91 5.1 Introduction ...... 92 5.2 Motivation and Background ...... 93 5.2.1 Full System Simulator ...... 94 5.2.2 I/O Virtualization in Xen ...... 94 5.3 Analysis Methodology ...... 96 5.3.1 Full System Simulation: Xen VMM as Workload ...... 96 5.3.2 Instruction Trace ...... 97 5.3.3 Symbol Annotation ...... 98 5.3.4 Performance Statistics ...... 98 5.3.5 Environmental Setup for Virtualized Workload ...... 100 5.4 Experiments and Simulation Results ...... 102 5.4.1 Life Cycle of an I/O packet ...... 102 5.4.1.1 Unprivileged Domain ...... 103 5.4.1.2 Grant Table Mechanism ...... 105 5.4.1.3 Timer Interrupts ...... 105 5.4.1.4 Privileged Domain ...... 106 5.4.2 Cache and TLB Characteristics ...... 108 5.5 Cache and TLB Scaling ...... 110 5.6 Related Work ...... 114 5.7 Conclusion ...... 115 6 HARDWARE SUPPORT FOR I/O WORKLOADS: AN ANALYSIS ...... 116 6.1 Introduction ...... 116 6.2 Translation Lookaside Buffer ...... 117 6.2.1 Introduction ...... 117 6.2.2 TLB Invalidation in Multiprocessors ...... 118 6.3 Interprocessor Interrupts ...... 120 6.4 Grant Table Mechanism: I/O Analysis ...... 121 6.5 Experiments and Results ...... 123 6.5.1 Grant Table Performance ...... 123 6.5.2 Hypervisor Global Bit ...... 124 6.5.3 TLB Coherence Evaluation ...... 125 6.6 Related Work ...... 129 6.7 Conclusion ...... 131

7 CONCLUSION AND FUTURE WORK ...... 132

7.1 Conclusion ...... 132 7.2 Future Work ...... 133

REFERENCES ...... 136

BIOGRAPHICAL SKETCH ...... 145

8 LIST OF TABLES Table page 3-1 Summary of the NFS v2 protocol remote procedure calls ...... 53

3-2 LAN and WAN experiments for micro-benchmarks ...... 62

3-3 Andrew benchmark and AM-Utils execution times ...... 63

3-4 Linux kernel compilation execution times on a LAN and WAN...... 65

3-5 Wide area experimental results for diskless Linux boot and second boot ..... 66 3-6 Remote Xen boot/reboot experiment ...... 67 4-1 Grid appliance boot and reboot times over wide area network ...... 88

4-2 Mean and variance of DHT access time for five clients ...... 88 6-1 Grant table overhead summary ...... 124

6-2 TLB flush statistics with and with-out IPI flush optimization ...... 128 6-3 Instruction TLB miss statistics with and with-out IPI flush optimization .... 129 6-4 Data miss TLB statistics with and with-out IPI flush optimization ...... 129

9 LIST OF FIGURES Figure page 1-1 Protocol redirection through user-level proxies ...... 17

1-2 Illustration of server consolidation ...... 18

2-1 Landscape of virtualized computer systems ...... 30

2-2 Systems partitioning characteristics ...... 31

2-3 I/O virtualization path for single O/S and virtual machines ...... 33 2-4 I/O partitioning in virtual machine and CMP architectures ...... 34 2-5 Grid virtual file system ...... 39

3-1 Indirection mechanism in the Linux virtual file system ...... 41 3-2 Middleware data management for shared VM images ...... 43

3-3 Check-pointing a VM container running an application with NFS-mounted file system ...... 46 3-4 Redirect-on-write file system architecture ...... 48 3-5 ROW-FS proxy deployment options ...... 49 3-6 Hash table and flag descriptions ...... 50 3-7 Remote procedure call processing in ROW-FS ...... 50 3-8 A snapshot view of file system session through Redirect-on-Write proxy ..... 52

3-9 Sequence of redirect-on-write file system calls ...... 56

3-10 Number of RPC calls received by NFS server in non-virtualized environment, and by ROW-FS shadow and main servers during Andrew benchmark execution ...... 64

4-1 O/S image management over wide area desktops ...... 72 4-2 The deployment of the ROW proxy to support PXE-based boot of a (diskless) non-persistent VM over a wide area network...... 73 4-3 Algorithm to bootstrap a VM session ...... 77

4-4 Algorithm to publish a virtual machine Image ...... 77 4-5 Replication approach for ROW-FS ...... 81

4-6 Diskless client and publisher client security ...... 83

10 4-7 Proxy VM usage time series for CPU, disk and network ...... 85 4-8 RPC statistics for diskless boot ...... 87

4-9 Cumulative distribution of DHT query through 10 IPOP clients (in seconds) .. 89

5-1 Full system simulation environment with Xen execution ...... 95 5-2 Execution driven simulation and symbol annotated profiling methodology .... 97

5-3 Symbol annotation ...... 99 5-4 Function-level performance statistics ...... 99 5-5 SoftSDV CPU controller execution mode: performance or functional ...... 101

5-6 Life of an I/O packet ...... 103 5-7 Unprivileged domain call graph ...... 104

5-8 TCP transmit and grant table invocation ...... 105

5-9 Timer interrupts to initiate context switch ...... 106 5-10 Life of a packet in privileged domain ...... 107

5-11 Impact of TLB flush and context switch ...... 109 5-12 Correlation between VM switching and TLB misses ...... 109 5-13 TLB Misses after a VM context switch ...... 110 5-14 TLB misses after a grant destroy ...... 111 5-15 Impact of VM switch on cache misses ...... 111

5-16 L2 cache performance for transmit of I/O packets ...... 112 5-17 Data and instruction TLB performance for transmit of I/O packets ...... 112

5-18 L2 cache performance for receive of I/O packets ...... 113 5-19 Data and instruction TLB performance for receive of I/O packets ...... 114

6-1 The x86 page table for small pages ...... 119 6-2 Interprocessor interrupt mechanism in x86 architecture ...... 121

6-3 Simulation experimental setup ...... 124 6-4 Impact of tagging TLB with a global bit ...... 125

6-5 Page sharing in multicore environment ...... 126

11 6-6 Simics model to capture inter-processor interrupts ...... 127

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PROVISIONING WIDE-AREA VIRTUAL ENVIRONMENTS THROUGH I/O INTERPOSITION: THE REDIRECT-ON-WRITE FILE SYSTEM AND CHARACTERIZATION OF I/O OVERHEADS IN A VIRTUALIZED PLATFORM By Vineet Chadha December 2008

Chair: Renato J. O. Figueiredo Major: Computer Engineering

This dissertation presents the mechanisms to provision and characterize I/O workloads for applications found in virtual data-centers. This dissertation address two specific modes of workload execution in a virtual data center (1) workload execution in heterogeneous compute resources across wide-area environment, and (2) workload execution and characterization within a virtualized platform. A key challenge arising in wide-area, grid computing infrastructures is that of data management - how to provide data to applications, seamlessly, in environments spanning multiple domains. In these environments, it is often the case that data movement and sharing is mediated by middleware that schedules applications. This thesis presents a novel approach that enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area - Redirect-on-Write file system (ROW-FS). The ROW-FS approach enables multiple clients to operate on private, virtual versions of data mounted from a single shared data served as a network file system (NFS). ROW-FS approach enables multiple VM instances to efficiently share a common set of virtual machine image files.The proposed approach offers savings in storage and bandwidth requirements compared to the conventional approaches of provisioning VMs by copying the entire

VM image to the client and by cloning the image on the server side. The Thin client

13 approach described in this dissertation uses ROW-FS to enable the use of unmodified NFS clients/servers and local buffering of file system modifications during an application’s lifetime. An important application of ROW-FS is in enabling the instantiation of multiple non-persistent virtual machines across wide-area resources from read-only images stored in an image servers (or distributed along multiple replicas). A common deployment scenario of ROW-FS is when the virtual machine hosting its private, redirected ”shadow” file system server and the client virtual machine are consolidated into a single physical machine. While a virtual machine provides levels of execution isolation and service partition that are desirable in environments such as data centers, its associated overheads can be a major impediment for wide deployment of virtualized environments. While the virtualization cost depends heavily on workloads, the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive. Unfortunately, the architectural reasons behind the I/O performance overheads are not well understood. Early research in characterizing these penalties has shown that cache misses and TLB related overheads contribute to most of I/O virtualization cost. While most of these evaluations are done using measurements, this thesis presents an execution-driven simulation based analysis methodology with symbol annotation as a means of evaluating the performance of virtualized workloads, and presents simulation-based characterization of the performance of a representative network-intensive benchmark (iperf) in the Xen virtual machine environment.

The main contributions of this dissertation work are: 1)the novel design and implementation of the ROW-FS file system, 2) experimental evaluation of ROW-FS for the O/S image framework that enables virtual machine images to be published, discovered and transferred on-demand through a combination of ROW-FS and peer-to-peer techniques 3) a novel implementation of an execution-driven simulation framework to evaluate network I/O performance using symbol annotation for environments that encompass both a virtual machine hypervisor and guest domains and 4)

14 evaluation, through simulation, of the potential benefits of different micro-architectural TLB improvements on performance.

15 CHAPTER 1 INTRODUCTION

1.1 Introduction

Virtualization technologies are widely adopted for usage models such as high performance grid computing and server consolidation [1]. This dissertation investigates data provisioning and performance characterization of virtual I/O, in particular network file systems, in such virtualized environments. The goals of this dissertation are as follows:

First, devise and evaluate techniques which can seamlessly provide data to applications in wide-area environments spanning multiple domains through distributed file system protocol I/O redirection. Second, because the performance of this data provisioning solution is limited inherently by overheads associated with network I/O in a virtualized environment, this dissertation evaluates network I/O virtualization overheads in such environments with a simulation-based methodology that enables quantitative analysis of the impact of micro-architecture features on performance of a contemporary split-I/O virtual machine hypervisor. Finally, this work explores hardware/software support to improve bottlenecks in I/O performance. The distributed file system redirection approach for data management and evaluation can benefit, in particular applications where large mostly read data sets need to be provisioned in a virtual data center.

1.1.1 Virtual Network File System I/O Redirection for Wide Area Applica- tions

In environments such as desktop grids, scientific experiments over wide area networks are often conducted in low bandwidth desktops for long hours [2]. To facilitate data movement in such environments, I have developed a novel redirect-on-write distributed filesystem (ROW-FS) which allows for application-transparent buffering and request re-routing of all file system modifications locally through user-level proxies. These proxies forward file system accesses that modify any file system objects to a “shadow” distributed

file system server, creating on-demand private copies of such objects in the shadow file server while routing accesses to unmodified data to a “main” server. A motivation for this

16 approach is the transparency offered to access large read-only virtual machine images with redirect-on-write semantics - a functionality lacking in traditional distributed file systems

such as the Network File System (NFS) [3]. Figure 1-1 shows an example of protocol

redirection of NFS through user-level proxies. Virtual machine environments are commonly used to improve CPU utilization in

computer systems [1]. Such VM-based environments are being increasingly used in data centers for resource consolidation, high performance and grid computing as means to facilitate the deployment of user-customized execution environments [4][5][6][7]. Thus, the deployment of user-level proxies in VM-based execution environments is a common

scenario to facilitate data movement and provisioning[4].

Figure 1-1. Protocol Redirection: Client/Server protocol is modified to forward RPC calls to a shadow distributed file system server

For example, consider the scenario of multiple servers in data centers which are consolidated into a single server through deployment of virtual machines, which represents

a common contemporary use case of virtualization [8]. Figure 1-2 shows a possible deployment of ROW proxies which forward calls to NFS and shadow server. Shadow

server VM and client VM are consolidated into a single physical machine. Such a scenario is common when client VM is diskless or disk space on client VM is a constraint. While deploying ROW proxies in such cases provides a much needed functionality, the overhead

associated with virtualized network I/O and is often considered a bottleneck [9][10]. Specifically, the performance of ROW-FS in such environments, as well as that of

17 several other approaches to virtual I/O which have been proposed in the context of VM environments [3], is dependent upon the performance of the underlying virtual I/O network layer.

Figure 1-2. Server Consolidation: Illustration of partitioning of physical system with shared hardware resources. A specific case of ROW-FS deployment is shown. Shadow server is encapsulated into a separate VM. RPC calls are passed to Remote/shadow servers through a ROW-FS proxy deployed in privileged VM.

1.1.2 Characterization of I/O Overheads in a Virtualized Environments

The design and performance of network I/O in VM-based environments is critical especially for enterprise-class applications used in the IT industry [11]. Also, system workloads are becoming more complex as applications are often compartmentalized and executed within virtual machines [1]. Thus, the key responsibilities of conventional O/Ses such as scheduling are delegated to a new layer - the Hypervisor or Virtual Machine

Monitor [12]. Hypervisors are widely accepted as an approach to address under-utilization of resources such as I/O and CPU in physical machines. To characterize the impact

18 of network I/O in these virtual machines, I employ a simulation-driven approach to evaluate the network I/O performance with respect to different micro-architectural design and configurations (e.g. cache and TLB). In next few sections, I answer the following questions: 1. Why has a simulation-based approach been chosen?

2. What are the options of network I/O mechanisms in virtual machine designs ? 1.1.2.1 Simulation

Simulation-based approaches have been extensively used in computer architecture to design and analyze the performance of system architectures before they are implemented in silicon [13][14][15]. The use of simulation in this dissertation is motivated by the fact that current system evaluation methodologies for virtual machines are based on measurements of a deployed virtualized environment on a physical machine. Although such an approach gives good estimates of performance overheads for a given physical machine, it lacks flexibility in determining the resource scaling performance. In addition, it is difficult to replicate a measurement framework on different system architectures. It is important to move towards a full system simulation methodology because it is a flexible approach in studying different architectures.

1.1.2.2 I/O Mechanisms

For several applications which are data- and communication-intensive, the performance of virtual machine approaches depend on how efficiently I/O operations are virtualized.

Several options have been proposed and implemented in VMMs, such as Vmware’s hosted I/O [16]. The virtual machine I/O architectures which have been implemented in different Hypervisors are: (1) Direct I/O Model (Xen 1.0), where the Hypervisor is responsible for running device drivers, and (2) Split I/O Model (Xen 2.0/3.0), where device drivers are exported to a privileged guest virtual machine (3)Pass-through I/O Model, where direct access to hardware devices is provided from guest virtual machines. Chapter 2 further

19 describes these I/O mechanisms. This section primarily focuses discussion on the split I/O approach. The split I/O approach is motivated by factors which include the ability to reuse device drivers of existing O/Ses and the ability to isolate device driver faults (which are responsible for a substantial fraction of system failures in practice) from the kernel. Such splitting of the system across services that can be instantiated by some form of inter-process communication (IPC) is akin to the microkernel approach to O/S design [17]. Microkernels have not found wide acceptance arguably because of overheads associated with inter-process communication between different entities (such as file system and device drivers). As processors are becoming faster and new CMP architectures are providing core-level parallelism, I/O approaches such as offloading selective virtual I/O functionality to separate core are being explored [18]. I chose split I/O as a basis for the investigation in this dissertation because of the following reasons:

• Split I/O model has been adopted by open source hypervisors such as Xen 3.0. I/O workloads stress inter-process communication in split-I/O hypervisors, thus characterizing I/O performance and providing hardware and software support for fast on-chip handling of IPC is an important research problem to address. • Chip multi-processor (CMP) architectures have become prevalent [19]. With the advent of chip multiprocessor architectures, the microkernel approach is being

re-visited; the Xen VMM version 3.0 [20] has features which are also found in

microkernels [17]. Dedicated CPUs to improve split I/O performance can easily be addressed through CMP architectures.

1.2 Dissertation Contributions

This dissertation addresses two specific modes to provision and characterize partitioned I/O workloads in a virtual data center (1) workload execution in heterogeneous compute resources across a grid, (2) workload execution and characterization on a virtualized platform. The key contributions of this dissertation are the following:

20 • First, I developed a novel method to virtualize a distributed file system using a user-level proxy between client and server so as to enable redirect-on-write functionality. Redirect-on-write (ROW) enables multiple clients to share network

file systems which are exported read-only by automatically and transparently maintaining local modifications private to each client. I show that ROW-FS not only gives substantially better performance than NFS for applications such as VM instantiation but also is easily deployable with no kernel modifications. The ROW-FS approach is novel in the way it virtualizes NFS protocol calls to enable redirect-on-write functionality on top of existing clients/servers. It is also unique in the way that it can be coupled with other potential enhancements of proxies (e.g. cache proxy), thus providing a needed flexibility to seamlessly access the data. An

example of this flexible approach is read-only data buffering of O/S images through cache proxies and write-only data buffering through ROW proxies.

• Second, to study performance issues of one such deployment of ROW proxy and other representative network I/O workloads, I demonstrate the feasibility and initial results of using a simulation environment that accounts for hypervisor, guest O/S kernel and application layers to evaluate the profile of cache and TLB misses in a representative I/O workload. I show that this simulation-based approach not only is flexible to study new architectural paradigms but also provides detailed analysis of virtualized software stack.

While previous studies have relied on measurements (e.g. using profiling tools) to assess the performance impact of I/O virtualization on existing workloads and systems, it is important to understand architectural-level implications to guide the design of future platforms and the tuning of system software for virtualized environments. To address this problem, I apply a simulation-based analysis methodology which extends a full system simulator with symbol annotation of the entire software stack in virtualized environments

21 - including the hypervisor, service and guest domains. This is the first study using full-system simulation to estimate overheads of I/O processing in a virtualized system.

1.3 Dissertation Relevance

The contributions of this dissertation are particularly relevant to the following scenarios:

• An important class of wide-area Grid applications consists of long-running simulations. In domains such as coastal ocean modeling and high-energy physics, execution times in the order of days are not uncommon, and mid-session faults are

highly undesirable. Related distributed computing systems such as Condor [2] have dealt with this problem via application check-pointing and restart. A limitation of this approach lies in that it only supports a very restricted set of applications - they must be re-linked to Condor libraries and cannot use many system calls (e.g. fork, exec, mmap). The approach of redirecting RPC calls in a shadow server, in contrast, supports unmodified applications. It uses client-side virtualization mechanisms that allow for transparent buffering of all file system modifications produced by distributed file system clients on local stable storage. Locally buffered file system

modifications can then be checkpointed together with the application state. By ”local storage”, I mean the storage which is either local to the client machine or in client’s local area network.

• Another application is virtual machine state sharing between different clients for VM-based job scheduling. For example, the condor system also supports a VM-based approach to high-throughput computing, where a user can submit a virtual machine for execution by specifying a VM image containing the application and its execution environment. Condor executes the VM and sends the resulting VM image back

to the user. A read-only block-based shared file system enabled by ROW-FS can minimize the overhead associated with transferring the complete VM state for job

execution.

22 • O/S image management is another application relevant to the approach put forth in this dissertation. ROW-FS is deployable to share and overlay immutable template images for operating systems kernels and file-systems. Related O/S image

management frameworks are based on filesystems stack to leverage advantages of local and wide area file systems [21][22]. These O/S image frameworks either require kernel level support or rely on aggressive caching of full O/S images.

• Virtual Machine Monitors (VMM) can benefit from the simulation-based study of network I/O overhead associated with inter-VM communication. Virtual machines in a consolidated server generally communicate through shared memory mechanisms.

The implementation of these shared memory mechanisms lack micro-architectural overhead analysis for shared resources in virtualized environments. Thus, simulation-based

approach can help in evaluation of such communication mechanisms and optimal sharing of hardware resources between virtual machines.

1.4 Dissertation Overview

This dissertation addresses microarchitectural analysis for I/O workloads for virtual environments and redirection-based virtualization approach to segregate I/O traffic

(read/write). Both problems are important in the context of managing data in a virtual data centers. Distributed computing paradigms are desired to leverage compute nodes to execute long-running jobs. Further, the characterization of virtualized I/O platform is important to maximize the capability of each compute node. The remaining of this dissertation addresses these two inter-related challenges by tackling the following problems: Problem 1: Provisioning of virtual environments in wide area network A key challenge arising in wide-area, grid computing infrastructures is that of data management - how to provide data to applications, seamlessly, in environments spanning multiple domains. In these environments, it is often the case that data movement and sharing is mediated by middleware that is responsible of scheduling applications.

23 Empirical evidence for a distributed file system (DFS) shows that 65-80% of written files are deleted within 30 seconds [23]. Previous research has shown that data access patterns for a DFS are often dynamic and ephemeral [23]. Also, current solutions are either based on complete file transfer that may incur access latency overhead (as compared to block transfers) [24] or are limited to single domains [25]; not acceptable in virtualized environment where often there is need to access large O/S images over wide area. Solution: I present a novel approach that enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area - Redirect-On-Write File System (ROW-FS). I show that the ROW-FS approach provides substantial improvements compared to traditional NFS protocol for benchmark applications such as Linux kernel compilation and virtual machine instantiation. Problem 2: Client-Server consistency and replication mechanisms for on-demand data transfer During the time a ROW-FS file system session is mounted, all modifications are redirected to the shadow server. It is important to consider consistency in distributed file systems because data can be potentially shared by multiple clients. For consistency, two different scenarios need to be considered. First, there are applications in which it is neither needed nor desirable for data in the shadow server to be reconciled with the main server; an example is the provisioning of system images for diskless clients or virtual machines. Second, for applications in which it is desirable to reconcile data with the server, the ROW-FS proxy holds state in its primary data structures that can be used to commit modifications back to the server. Solution: I leverage on APIs exported by lookup services (such as Distributed Hash table) in distributed frameworks (e.g. IPOP [26]) to keep client consistent with latest updates

24 (e.g. kernel patches). I describe a thin client approach of deploying NFS proxies over a virtualized peer-to-peer network. This approach allows sharing of read-only images among multiple clients and the re-direction of write access to a loop-back local client buffer. The

solution is very generic and easily deployable as user level proxies redirect traffic between local and remote conventional network file system server without requiring kernel or VMM modifications. The ROW-FS proxy operations rely on the opaque nature of NFS file handles. These file handles remain unchanged in a replicated version of a server virtual machine. Thus, in the case of failure, ROW-proxy can be configured to redirect RPC calls to a replicated server Problem 3: Characterizing and improving the performance of I/O work- loads

While the virtualization cost depends heavily on workloads, the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive

[27]. The architectural reasons behind the I/O performance overheads in virtualized environments are not well understood. Early research in characterizing these penalties have shown that cache misses and TLB related overheads contribute to most of I/O virtualization cost [27]. Solution: I have applied execution-driven methodology to study the network I/O performance of Xen (as a case study) in a full system simulation environment, using detailed cache and TLB models to profile and characterize software and hardware hotspots. By applying symbol annotation to the instruction flow reported by the execution driven simulator we derive function level call flow information. This methodology provides detailed information at the architectural level and allows designers to evaluate potential hardware enhancements to reduce virtualization overhead. Problem 4: Hardware support for network I/O virtualization

25 I have used an abstract performance model for two reasons. First, timing accurate performance models can take long time to provide overhead statistics of virtualized environment. Second, timing accurate models for full system simulators are not widely

available. However, the abstract performance model does not reflect correct overhead analysis. In multi core environment, potential reason for performance degradation could be inter processor interrupts between the CPUs. Solution: To evaluate the potential reason for I/O overhead, I have instrumented and profiled Xen hypervisor to collect the statistics of CPU cycle overhead. Further, I have used simulation framework to evaluate the potential benefit of additional hardware support to handle TLB coherence. The goal of this analysis is to evaluate the extent in which the conservative approach of TLB shootdowns can negatively impact performance by removing valid translations from a remote TLB. 1.5 Dissertation Organization

The rest of the dissertation is organized as follows. Chapter 2 provides a brief introduction to state-of-art technologies addressed and used in this dissertation. This includes virtual machine architectures and technologies,

VM-based grid computing and file system virtualization. Chapter 3 discusses the NFS protocol, a widely-used distributed filesystem upon which the ROW-FS architecture described in this thesis builds on. I chose to work on network file system as it is a widely deployed distributed file system. Chapter 3 also describes potential systems for deployment of ROW-FS. This includes on-demand session-based computational approaches (e.g. In-VIGO [4], COD [28]), virtual machine image provisioning and fault tolerance in distributed computing. Further, Chapter 3 discusses the design and implementation of ROW-FS. I evaluate ROW-FS with micro- and

application benchmarks to measure overhead associated with individual RPC calls. Chapter 4 describes a novel approach to O/S image management and provisioning

architecture through diskless setup of VMs using ROW-FS. The primary goal is to

26 automate the process of publishing, updating and mounting images. I explain image versioning through primitives exported by a distributed hash table. Chapter 5 addresses the full system simulation based approach to characterize and improve the performance of I/O workloads in virtualized environment. Chapter 5 discusses challenges in building simulation workloads, analyzing the life of I/O packet inside the virtualized environment as a motivating example. I conclude that two of the main reasons for low network I/O performance are cache and TLB misses during the packet transfer from/to guest VM. In Chapter 6, I evaluate the overhead in grant table and investigate possible micro-architecture approaches to reduce TLB misses. To gather statistical information on hardware counters (e.g CPU cycle count) inside the grant table, I instrumented the

Xen hypervisor. I also evaluate the impact of sharing pages between two CPUs in a multiprocessor environment. I conclude and summarize the findings of this dissertation in Chapter 7.

27 CHAPTER 2 I/O VIRTUALIZATION: RELATED TERMS AND TECHNOLOGIES

2.1 Introduction

Revisiting the history of computer systems, often hardware and software have been designed to evolve such that a new software model or mechanism is designed to remain compatible with legacy systems. Also, it has been often the case that hardware/software models and mechanisms are developed at different paces. For example, even though hardware systems have been able to provide reasonable CPU power for end-user and enterprise applications, software in many instances is still unable to fully harness the available CPU performance. In the current landscape of computer systems, research efforts are addressing challenges that arise at the hardware resource layer as underlying components are reaching physical limits in miniaturization, as well as at the software and application layer, in order to effectively harness the increasing amount of available resources within a chip and across networks. These include multiprocessing within a chip, virtualization technologies within a single computer as well as across networked environments [16], and wide-area Grid and peer-to-peer (P2P) computing systems [29– 32]. As systems evolve in the direction of large-scale, networked environments and as applications evolve to demand processing of vast amounts of information, the input/output (I/O) subsystem which provides a computer with access to mass storage and to networks becomes increasingly important. Fundamentally, computer systems consist of three subsystems: processors, memory and I/O [1]. In the early days of desktop computing, I/O systems were primarily used as an extension of the memory system (as a hierarchy level backing up cache and RAM memories) and for persistent data storage. With the arrival of networks and the wide-area

Internet, the I/O subsystem can be interpreted as a generic term referring to access of data, over the network or in storage. In the context of networked I/O, a particular subsystem which has been successfully used in provisioning data over networks is a

28 distributed file system. Distributed file systems were initially targeted to a specific domain such as the Network File System (NFS) for local area networks. However, as heterogeneity in computer systems and software increases, it becomes necessary to expand the domain

range of systems I/O, and research is further performed either to extend current file systems (e.g. NFSv4) or to develop new file system protocols for wide area domains (e.g.

coda [33], AFS [24], Legion [34]). In the processor and memory subsystems, system architects increasingly apply ideas of virtualization, with approaches that build on techniques for logically partitioning physical systems developed in mainframes [12] which are now accessible for commodity systems based on the x86 architecture. While for several years the x86 architecture did

not satisfy conditions that make a CPU amenable to virtualization [12], hardware vendors

(Intel and AMD) have provided hardware support to simplify the design of Virtual Machine Monitors (VMMs) [35][36]. Even though virtualization approaches can be used to address the problem of CPU under-utilization through workload consolidation, the gap between I/O mechanisms and CPU efficiency has widen.

Figure 2-1 gives an overview of some of the different technologies being harnessed by state-of-the-art computing systems, and provides a landscape in which the approaches described in this dissertation are intended to be applied: wide-area systems (possibly organized in a peer-to-peer fashion), where nodes and networks are virtualized and where commodity CPUs contain multiple cores. The following subsections address these technologies in more detail.

2.2 Virtualization Technologies

Virtualization technologies are increasingly becoming part of mainstream computing systems. While the term is used in various contexts, a key enabling virtualization

techniques in systems is the virtual machine, which is often used to refer to mechanisms allowing the multiplexing of a single physical resource by multiple O/Ses [37]. The process of providing transparent access to heterogeneous resources through a layer of

29 Figure 2-1. Landscape of virtualized computer systems: resources and platforms are heterogeneous with respect to hardware and system software environments; computing nodes have multiple independent processing units (cores) and are virtualizable; Grid computing middleware is used to harnesses the compute power of heterogeneous resources; a peer-to-peer organization of resources enables self-organizing ensembles of CPUs to support high-throughput computing applications and scientific experiments. indirection is central to virtualization. For example, in Linux, the virtual file system provides transparent access to data across different file systems. Indirection can be achieved through interposition by an agent or proxy between two or more communicating entities. Proxies have been extensively used across for user authentication, call forwarding and for secure gateways [38]. For example, virtual memory is a widely used mechanism for multiplexing physical RAM in traditional operating systems.

30 Virtualization is an example of horizontal partitioning of a computing system. Figure 2-2 provides a broader view of partitioning of a system which also includes vertical partitioning. Horizontal partitioning essentially adds an extra layer in the system stack which provides abstraction to application layer to access underlying resources, while vertical partitioning mechanisms can be used to divide the underlying resources, for instance to isolate sub-systems such that interference due to faults and performance cross-talk is minimized.

Figure 2-2. Systems partitioning characteristics: Access to CPU in multi-core systems can be naturally partitioned across cores, however there are hardware resources which are shared (e.g. L2 cache, memory, hard disk and NICs). Quality of service (QoS) provisioning can be used to partition shared resources across virtual containers.

In general, virtual machine architectures exhibit three common characteristics: multiplexing, polymorphism and manifolding [4]. The virtualization approach of multiplexing physical resources not only decouples the compute resources from hardware but provides the flexibility of allowing compute resources to migrate seamlessly. Today, many virtual machine monitors are available for research and development(e.g. VMware

[37], Parallels [39], VirtualBox [40], KVM citevm:kvm, lguest [41], Xen [20], UML [42], and

Qemu [43]).

31 2.3 Virtual Machine Architectures

System Virtual Machines provide the illusion of a complete computer with virtual

CPU, memory, and I/O devices including storage and network interfaces [1]. In order to be effectively virtualizable, it is desirable that the underlying processor instruction set architecture (ISA) follows conditions set forth in [12]. For several years, the instruction set of what is currently the most popular microarchitecure [44] was not directly virtualizable because there are a set of sensitive instructions which do not cause the processor to trap when running in unprivileged mode [1].

System virtual machines have been designed to overcome this limitation in previous x86 generations with two major approaches resulting in successful implementations. In the classical approach, a virtual machine such as VMware relies on efficient binary translation mechanisms to emulate non-virtualizable instructions without requiring modifications to the guest O/S. In the paravirtualized approach (e.g Xen), modifications to the architecture-dependent code of the guest O/S are required, both to avoid the occurrence of non-virtualizable instructions and to enable improved system performance. Intel and AMD have provided hardware support to extend the x86 architecture for virtualized environment [1], making the implementation of classic VMMs a much easier task because binary translation is not required; the KVM virtual machine is an example of a recent VMM which builds upon such hardware extensions. 2.3.1 I/O mechanisms in Virtual Machines

In typical O/S systems such as Linux, applications communicate with a physical device through a device driver as shown in Figure 2-3(a). In contrast, a conventional I/O virtualization path traverses through a guest VM driver, virtual I/O device, physical device driver and physical device (e.g. a network interface card, NIC). The guest VM driver merely provides a mechanism to share access to a virtual I/O device (emulated device driver) through a shared memory mechanism. The emulated device is either

32 implemented by the hypervisor or it can be partitioned into a virtualized container (Figure 2-3(b)).

Figure 2-3. I/O virtualization path: (a) In single O/S, application invokes physical device driver by means of system calls (b) In a VM environment, each VM’s guest driver communicates with virtual I/O device through a shared memory mechanism. The virtual I/O device can either reside in the hypervisor or in the separate VM as shown by dotted lines.

The following are typical alternatives which have been considered for I/O handling in virtual machine environments:

• Direct I/O: A simple approach for I/O emulation is to trap I/O requests from a guest VM in the hypervisor; emulated I/O device drivers resides inside the hypervisor [37]. • Split I/O: Another approach is to partition emulated I/O driver into a separate privileged container and channel the guest I/O request through it. This approach

of splitting I/O from hypervisor is fault tolerant compared to hypervisor-based

emulated I/O device as often O/S-related faults are attributed to device drivers [45]. • Pass-through I/O: The ultimate goal of virtualization technologies is to match the

performance of native systems. In this regard, it is important to provide direct

access to hardware devices from guest virtual machine. This is usually referred to

33 as pass-through I/O. While pass-through I/O provides good performance, care must be taken with respect to isolation. Direct hardware access by a virtual machine can cause spurious DMA access by a rouge VM which can overwrite any memory

location through hardware devices. Various vendors have taken steps to address specifications for I/O memory management unit (IOMMU) [46][47]. IOMMU allows direct mapping of hardware device addresses into guest physical memory and also protects VMs from spurious memory accesses. 2.3.2 Virtual Machines and CMP architectures

Figure 2-4. I/O Partitioning: Resource mapping in multi-core systems. VM1 is allocated two CPUs. VM3 is pinned to I/O devices Disk D1, and NIC N1, N2.

CMP architectures allow concurrent execution of software threads and modules. Thus, to leverage on CMP architectures, it is important to partition the system software resources, operating systems and hardware resources so as to deterministically allocate

CPU resources to the applications [48]. The performance gain through such a partitioning of resources could be abbreviated if bottleneck is hardware devices such as NICs. To address this, trends are either towards harnessing multiple devices such as network

34 interfaces [10] or developing new models of communication with hardware devices [48][49]. It is conceivable that virtualized CMP architectures of the future will run multiple VMs, with a mix of resources time-shared and/or dedicated to VM guests. For example, as

shown in Figure 2-4, virtual machines VM1, VM2 and VM3 are hosted on a multi-core system. As shown in Figure 2-4, VM3 is pinned to Disk D1, NICs N1 and N2 and has been allocated a single CPU. VM1 has been allocated with 2 CPUs and have privileged status to access hardware resources. The figure also illustrates virtualization path for split I/O (dotted path) and direct I/O approach. In split I/O, since virtualization path is divided into separate VM containers, allocating dedicated resources (e.g. CPUs and NICs)

to guest and privileged VMs can potentially improve I/O performance.

2.4 Grid Computing

As described in [29], Grid computing refers to the“coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organization”. Grid computing typically takes place on heterogeneous resources distributed across wide-area networks, relying on support from middleware to provide services such as authentication, scheduling, and data transfers. Virtualization in the context of Grid computing is motivated by the ability to abstract the heterogeneity of resources and provide a consistent environment for the execution of workloads. The In-VIGO [4] middleware is a representative example of a system which extensively employs virtualization in this scenario. Similar approach has been taken by several projects in Grid or Utility computing such as COD [28] and Virtual Workspaces [5]. A key challenge arising in wide-area, Grid computing infrastructures is that of management and access to data not only I/O local to a node, but also how to provide data to applications, seamlessly, in environments spanning multiple domains.

2.5 File System Virtualization

Virtualization techniques can be applied to facilitate the transfer of data across grid

resources. Previous research in distributed file system indicate that data transfer is a

bottleneck over wide area networks (WAN) [50]. In grid environments, data movement and

35 sharing is often mediated by middleware that schedules applications and workflows, and data management is achieved by means of explicit file transfers. Many solutions have been proposed to address scalable data access over wide area networks, with approaches ranging from protocols explicitly designed to WAN environments [3][24][33] to enhancements of widely-used local-area protocols (NFS v2/v3) and, to overlay additional functionalities or modified consistency models over wide area networks [51][52][53][54]. However, the wide-spread deployment of new protocols is hindered by the fact that operating systems designers have mostly focused on local-area distributed file systems which cover typical usage scenarios. As an example, open-source and proprietary implementations of the LAN-oriented versions of the NFS protocol (v2/v3) have been deployed (and hardened over time) in the majority of flavors and Windows, while open-source implementations of the wide-area protocol (v4) under development since the late 1990s have not yet been widely deployed. The following sections will explain the NFS protocol architecture and a related approach that consist of virtualization layer built on existing Network File System (NFS) components - Grid virtual file system (GVFS). 2.5.1 Network File System

Network file system is a network protocol based on remote procedural calls (RPC). The NFS protocol follows a client/server model to provide access to shared files among multiple clients and specifies a collection of remote procedures that allow a client to transparently access data from remote server. The NFS protocol divides a remote file into blocks of equal size and supports on-demand, block-based transfer of file contents. The primary goals of the NFS protocol are: 1. Machine and file system independence

2. Transparent access to remote files

3. Simple crash recovery mechanism

4. Low performance overhead in local area networks

36 A NFS client invokes protocol operations as RPC calls to the remote NFS server. RPC uses external data representation (XDR) to encode data in network call and reply messages. XDR is a standard for encoding data in interoperable format. This satisfies the goal of machine independence in the NFS protocol. NFS server and client convert RPC requests and replies into virtual file system (VFS) file operations to access the data from local file systems. The first widely used version of NFS (NFSv2) was released by

Sun Microsystem [3] and was later extended to address shortcomings such as small file sizes, large number of Getattr calls and performance overhead, leading to NFS version 3 (NFSv3). For example, in NFSv2, the invocation of a lookup call is always followed by the invocation of a getattr call. In NFSv3, the lookup call is optimized to return attributes in a single RPC operation. Similarly, NFSv3 introduces new procedure calls to provision buffering of writes in client and later committing it to the server. Neither NFS v2 nor v3 scale well in cross-domain wide-area environments [3]. In addition, NFSv3 does not provide guaranteed consistency between the clients. NFSv4 improves the consistency mechanism at the expense of a more complex server design, which is no longer stateless [3]. NFSv4 implements an open-close consistency mechanism. NFSv4 clients can cache the data after file is opened for access. If cached data is modified, NFS clients need to commit the data back during file close operation. NFS clients can also re-validate the cached data through file timestamp for future access of the data. In addition, NFSv4 server supports delegation and callback mechanisms to provide write permissions to the client. Thus, a NFSv4 client can allow other clients to access the data from the delegated file. NFS supports hierarchical organization of files and directories. Each directory or file in NFS server is uniquely addressed by a persistent file handle [3]. A file handle is a reference to a file or directory that is independent of the filename. For example, NFSv2 has persistent file handles of size of 32-bytes. A comprehensive survey of NFS file handle structure can be found in [55].

37 An initial design goal of the NFS protocol was to deploy a stateless server, such that client state is not maintained at the server between NFS requests [3]. This approach of stateless server simplifies failure recovery. For example, a failure requiring a server restart can be dealt with by simply requiring clients to re-establish the connection. In addition, the NFS protocol supports idempotent operations. In the event of a server crash, the client only needs to wait for the server to boot and re-send the request. Network File System primarily consists of two protocols. First, the mountd protocol is used to initially access the file handle of the root of an exported directory. Second, the nfsd protocol is used to invoke RPC procedure calls to perform file operations on remote server.

A NFS client invokes the mount protocol through mount utility. The mount protocol is a three step process. First, the client contacts the mountd server to obtain the initial

file handle for an exported file system. In the second step, the mount protocol access the attributes of the directory mount point requested by the client. Finally, the NFS client obtains the attributes of the exported file system. NFS provides the capability to enable different authentication mechanism such as Unix system authentication (UID or GID). NFS supports authorization based on access control lists maintained by the server. This access control list provides a mapping of user and group ID between client and server. Whenever a RPC call is received by the server, the server validates the client credentials through access control list.

2.5.2 Grid Virtual File System

An alternative approach, termed Grid virtual file system (GVFS), decouples the data management through a proxy middleware layer [56]. GVFS forms the basic framework for the transfer of data necessary for problem solving environments such as In-VIGO. It relies on a virtualization layer built on existing Network File System

(NFS) components, and is implemented at the level of Remote Procedure Calls (RPC) by means of middleware-controlled file system proxies. A virtual file system proxy intercepts

RPC calls from an NFS client and forwards them to an NFS server, possibly modifying

38 arguments and returning values in the process. Through the use of proxies, the virtual file system supports multiplexing, manifolding and polymorphism by (1) sharing of a single file account across multiple users, (2) allowing multiple per-user virtual file system

sessions in a single server, and (3) mapping of user and group identities to allow for cross-domain NFS authentication. Potential applications of data content virtualization include translation of file formats, language translation, summarization and reduction of

data, or other data transformations invoked at data-access time [4].

Figure 2-5. Grid Virtual File System: NFS procedural calls are intercepted through user-level proxies. GVFS proxies are deployed in client and server machines. Users are authenticated through a access control list exported by GVFS proxy and NFS server.

Figure 2-5 provides an overview of grid virtual file system. As shown in this figure, middleware-controlled file system proxies are used to start a grid session for the client. Grid virtual File system supports performance enhancing mechanisms such as a disk-cache

[56]. GVFS proxies are further extended with write back support to provide on-demand virtual environments for grid computing [57]. This approach relies on buffering RPC requests and results in a disk-cache, and committing changes back to server at the end of user session. A related work uses a service-oriented approach to harness the GVFS

proxies for optimizations such as caching or copy-on-write support. This approach is

based-on Web services resource framework (WSRF) that enables the provisioning of data

39 to applications by controlling the configuration and creation of GVFS data access sessions [58].

40 CHAPTER 3 REDIRECT-ON-WRITE DISTRIBUTED FILE SYSTEM

3.1 Introduction

3.1.1 File System Abstraction

A file system is an abstraction commonly used to access data from the memory/storage systems (e.g. disk). This abstraction is often implemented as a layer of indirection. Indirection mechanisms are commonly used to address computer science problems. For example, in the Linux O/S, to provide transparent access to different filesystems, indirection mechanisms are typically used to steer the file system operations through a common filesystem framework called virtual file system (VFS). Figure 3-1 shows indirection mechanisms across three levels: transparent access of a file system through the VFS framework, logical access to a disk volume, and indirect access to a file block through i-node. This dissertation applies indirection mechanisms through a user-level proxy so as to provide transparent access of data from one or more servers. The redirect-on-write file system enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area.

Figure 3-1. Indirection mechanism in the Linux virtual file system(vfs). The access to various file systems through the indirection mechanism. A file system’s blocks could be logically present in multiple hard disks. Further, access to the file blocks is performed through direct or indirect access of blocks through inodes.

41 3.1.2 Redirect-on-Write File System

ROW-FS is based on user-level redirect-on-write virtualization techniques that address two important needs. First, the ability of accessing and caching file system data and meta-data from remote servers, on-demand, on a per-block basis, while buffering file system modifications locally. This is key in supporting applications that rely on the availability of a file system or operate on sparse data - a representative example is the instantiation of customized execution environment containers such as system virtual machines (VMs) or physical machines provisioned on demand. Second, the ability to checkpoint file system modifications to facilitate application recovery and restart in the event of a failure. Checkpointing and migration in wide-area computing systems is often achieved at the level of operating system processes by means of library interposition or system call interception, which limits the applicability of checkpointing to a restricted set of applications. In contrast, ROW-FS enables checkpoint and restart of modifications made by a client to a mounted distributed file system. The approach is unique in supporting this functionality on top of existing, widely available kernel distributed file system clients and servers that implement the NFS protocol. This chapter describes the organization of the ROW proxy and the techniques used to virtualize NFS remote procedure calls, and evaluates the performance of a user-level implementation of these techniques using a variety of micro-benchmarks and applications. Results show that ROW-FS mounted file systems can achieve better performance than non-virtualized NFS in wide-area setups by steering data and meta-data calls to a local-area shadow server, and that it enables an unmodified application running on a VM container and operating on data within a ROW-FS file system to be successfully restarted from a checkpoint following a failure.

3.2 Motivation and Background

There are three goals which motivate the ROW-FS approach. First, with ROW-FS a primary server can be made read only, thus safeguarding the integrity of data mounted

42 from the primary server from unintentional user modification. Second, since heterogeneity and dynamism in distributed computing makes failure recovery a difficult task, ROW-FS provides a consistent point-in-time view of a recently modified file system. Third, to facilitate deployment, ROW-FS leverages capabilities provided by the underlying file system (e.g. NFS) without requiring kernel-level modifications.

Figure 3-2. Middleware data management: Grid Users G1, G2 and G3 accesses file disk.img from server and customize for personal use through ROW proxy. G1 modifies second block B to B’, G2 modifies block C to C’ and G3 extends the file with additional block D (a) Modifications are stored locally at each shadow server (b) virtualized view

ROW-FS complements capabilities provided by ”classic” virtual machine (VMs [12][1]) to support flexible, fault-tolerant execution environments in distributed computing systems. Namely, ROW-FS enables mounted distributed file system data to be periodically check-pointed along with a VM’s state during the execution of a long-running application. ROW-FS also enables the creation of non-persistent execution environments for non-virtualized machines. For instance, it allows multiple clients to access in read/write mode an NFS

file system containing an O/S distribution exported in read-only mode by a single server.

Local modifications are kept in per-client ”shadow” file systems that are created and

43 managed on-demand. Figure 3-2 illustrates an example of shared VM image between Grid users G1, G2 and G3. O/S image modifications are locally buffered whereas the server hosts read-only O/S images.

3.2.1 Use-Case Scenario: File System Sessions for Grid Computing

ROW-FS is well-suited for systems where execution environments are created, allocated for an application to host their execution, and then terminated after the application finishes. This approach is taken by several projects in Grid or Utility computing (e.g. In-VIGO [4], COD [28] and Virtual Workspaces [5]). For example, data management in In-VIGO is provided by a virtualization layer known as Grid Virtual File System. The resulting grid virtual file system allows dynamic creation and destruction of file system sessions on a per-user or per-application basis. Such sessions allow on-demand data transfers, and present to users and applications the API of a widely used distributed network file system across nodes of a computational grid. ROW-FS can export similar APIs to end user and network-intensive applications to transparently buffer writes in a local server.

3.2.2 Use-Case Scenario: NFS-Mounted Virtual Machine Images and O/S File Systems

One important application of ROW-FS is supporting read-only access of shared VM disks or O/S distribution file systems to support rapid instantiation and configuration of nodes in a network. The ROW capabilities, in combination with aggressive client-side caching, allow many clients to efficiently mount a system disk or file system from a single image - even if mounted across wide-area networks. One particular use case is the on-demand provisioning of non-persistent VM environments. In this scenario, the goal is to have thin, generic boot-strapping VMs that can be pushed to computational servers without requiring the full transfer or storage of large VM images. Upon instantiation, a diskless VM boots through a pre-boot execution

44 environment (PXE) using one out of several available shared non-persistent root file system images, stored potentially across a wide area network. This approach delivers capabilities that are not presently provided by VM monitors themselves. Without an NFS-mounted file system on the host, on-demand transfer of VM image files is not possible, thus the entire VM image would need to be brought to the client before a non-persistent VM could start. In shared Grid computing environments it is difficult to acquire privileges on the host to perform such file system mounts; in contrast, with ROW-FS, the NFS-mounted file system can be kept inside a guest, and no host configuration or privileges are required to deploy the boot-strapping VM and the diskless VM.

ROW-FS could be deployed to create a management model as describe in [59][60]. In these environments, often it is the case that base operating system layer is shared between different clients (read-only). Any modifications to base OS by clients (e.g. a kernel patch) can be feasible by deploying ROW-FS to access read-only images.

3.2.3 Use-Case Scenario: Fault Tolerant Distributed Computing with Virtual Machines

Existing VM monitors have support for check-pointing and resuming of VM state which is a key capability upon which many fault-tolerant techniques can be built. However, checkpointing of the VM state alone is not sufficient to cover the scenarios envisioned for a VM-based distributed computing environment. Consider a virtual machine-based client-server session using traditional NFS compared to ROW-FS. A long-running application may take hours to complete; if it operates on data mounted over a distributed file system, a failure in the client may require restarting the entire session again - even if the VM had been check-pointed. In contrast, a ROW-FS session with regular checkpoints provides fault tolerance additional to VM checkpointing by allowing file system-mounted data used by the application to be checkpointed with the VM.

45 Consider the example illustrated in Figure 3-3. In the figure, the client virtual machine ”C” crashes at time tf . In traditional NFS (Figure 3-3, top), job execution has to restart again, because the server state ’S’ may no longer be consistent with the client state at the time of the last checkpoint. In the redirect-on-write setup (Figure 3-3, bottom), job execution can correctly restart at the last checkpoint tc.

Figure 3-3. Check-pointing a VM container running an application with NFS-mounted file systems. In traditional NFS (top), once a client rolls back to checkpointed state, it may be inconsistent with respect to the (non-checkpointed) server state. In ROW-FS (bottom), state modifications are buffered at the client side and are checkpointed along with the VM

An important class of Grid applications consists of long-running simulations, where execution times in the order of days are not uncommon, and mid-session faults are highly undesirable. Systems such as Condor [2] have dealt with this problem via application check-pointing and restart. A limitation of this approach lies in that it only supports a restricted set of applications - they must be re-linked to specific libraries and cannot use many system calls (e.g. fork, exec, mmap). ROW-FS, in contrast, supports unmodified

46 applications; it uses client-side virtualization that allows for transparent buffering of all modifications produced by DFS clients on local storage. In this use case, the role of ROW-FS in supporting checkpoint and restart is to buffer

file system modifications within a VM container. The actual process of checkpointing is external and complementary to ROW-FS. It can be achieved with support from VMM APIs (e.g. vmware-cmd and Xen’s ”xm”) and distributed computing middleware.

For instance, the Condor [2][61] middleware is being extended with the so-called VM universe to support checkpoint and restore of entire VMs rather than individual processes; ROW-FS sessions can be conceivably controlled by this middleware to buffer file system

modifications until a VM session completes.

3.3 ROW-FS Architecture

The architecture of ROW-FS is illustrated in Figure 3-4. It consists of user-level DFS extensions that support selective redirection of distributed file system (DFS) calls to two servers: the main server and a shadow server. The architecture is novel in the manner it overlays the ROW capabilities upon unmodified clients and servers, without requiring changes to the underlying protocol. The approach relies on the opaque nature of NFS file

handles to allow for virtual handles [3] that are always returned to the client, but map to physical file handles at the main and ROW servers. A file handle hash table stores such mappings, as well as information about client modifications made to each file handle. Files whose contents are modified by the client have ”shadow” files created by the shadow server in a sparse file, and block-based modifications are inserted in-place in the shadow file. A presence bitmap marks which blocks have been modified, at the granularity of NFS blocks (typically of size 8-32KB). Figure 3-5 shows possible deployments of proxies enabled with user-level disk caching and ROW capabilities. For example, a cache proxy configured to cache read-only data may precede the ROW proxy, thus effectively forming a read/write cache hierarchy. Such cache-before-redirect (Figure 3-5(a)) proxy setup allows disk caching of both read-only

47 Figure 3-4. ROW-FS Architecture - The Redirect-on-write file system is implemented by means of a user-level proxy which virtualizes NFS by selectively steering calls to either a main server or shadow server. MFH: Main File Handle, SFH: Shadow File Handle, F: Flags, HP: Hash table processor, BITMAP: bitmap processor. contents of the main server as well as of client modifications. Write-intensive applications can be supported with better performance using a redirect-before-cache (Figure 3-5(b)) proxy setup. Furthermore, redirection mechanisms based on the ROW proxy can be configured with both shadow and main servers being remote (Figure 3-5(c)). Such setup could, for example, be used to support a ROW-mounted O/S image for a diskless workstation.

3.3.1 Hash Table

The hash table processor (HP) is responsible for maintaining in-memory data structures on a per-session basis to keep mapping of file handles between the client and the two servers. Two hash tables are employed. The shadow-indexed (SI) hash table is used to keep mappings between the shadow and main servers. This hash table is indexed by shadow file handle because the number of file system objects in the shadow server

48 Figure 3-5. Proxy deployment options (a) Cache-before-redirect (CBR), (b) Redirect-before-cache (RBC) (c) Non-local shadow server. is a superset of the file system objects in main server. The main-indexed (MI) table is needed to maintain state information about files in the main server. Figure 3-6 shows the structure of the hash table and flag information. The readdir flag (RD) is used to indicate the occurrence of the invocation of an NFS readdir procedure call for a directory in the main server. Generation count (GC) is a number inserted into hash tuple for each file system object to create a unique disk based bitmap. Remove (RM) and Rename (RN) flags are used to indicate deletion/rename of a file.

3.3.2 Bitmap

The bitmap processor processes file handle and offset information and checks the presence bitmap data structure to determine whether read and write calls should be directed to the main or shadow server. The bitmap is a disk-based hierarchical data structure that keeps information about individual blocks within a file. The parent directory in the bitmap data structure is a concatenation of the hashed value of a shadow

file handle and the generation count, which results in a unique bitmap directory for each

file system object. As in NFS, reads and writes are performed in a per-block basis in

49 Figure 3-6. Hash table and flag descriptions: SFH: Shadow File Handle, MFH: Main File Handle, RD: Readdir Flag, RE: Read Flag, GC: Generation Count, RM: Remove Flag, RN: Rename Flag, L1: Initial Main Link, L2: New Shadow Link, L3: Current Main Link, RL: Remove/Rename File List the ROW file system. To keep track of current location of updated blocks, each file is represented by a two-level hierarchical data structure in disk. The first level indicates the name of the file which contains information about the block. The second level indicates the location of a presence bit within the bitmap file.

Figure 3-7. Remote procedure call processing in ROW-FS. The procedure call is first forwarded to the shadow server and later to the main NFS server. SS: Shadow Server, MS: Main Server, SI: Shadow Indexed, MI: Main Indexed.

50 Figure 3-8 illustrates a snapshot view of file system session through Redirect-on-write proxy. ROW-proxy which is used to intercept the mount protocol is abstracted in the figure. NFS clients mounts a read-only directory (/usr/lib) from the server VM. The mounted file system directory is transparently replicated in the client VM to buffer local modifications. Files replicated at shadow server are dummy files which represents a sparse version of read-only file in the Server VM. Only file blocks written during the file system sessions are replicated in the shadow server. An hash table entry is updated to provide status of files. Figure 3-8 illustrates hex dump of 32 bytes NFSv2 hash table. Generation count is used to provide a unique bitmap directory. The generation count along with hashed value of shadow file handle is used to create a bitmap directory per file. As shown

in the Figure 3-8, libX file handle is hashed to ”777” which is further concatenated with generation count ”234” to produce a unique bitmap directory. RD flag is marked ”0” as this is for a file. RE flag is marked ”1” to indicate that bitmap needs to be accessed for a possible ”libX” file block in the shadow server. For ”libX” file, there is no main-indexed hash table entry as there is no status information to keep for read-only file in the Server VM. All the newly written blocks are present in ”0” file of bitmap directory. 3.4 ROW-FS Implementation

This section describes how a ROW-FS proxy virtualizes NFS protocol calls, which enables ROW functionality while reusing existing NFS clients and servers. Each procedure

call is handled in three phases: predicate, process and update. Figure 3-7 describes the various hash table entries stored in the proxy which are referenced throughout this section.

Table 3-1 briefly describes NFSv2 RPC calls and points to the relevant sections for call modifications. A detailed description of all NFS protocol calls described below can be found in [3].

3.4.1 MOUNT

In network file system deployments, the mount system utility issues an RPC call

using mount protocol to obtain the initial file handle of a directory to be mounted from

51 Figure 3-8. A snapshot view of file system session through Redirect-on-Write proxy. The hash table and bitmap status is shown for the file “libX” which is transparently replicated in a shadow server. Three blocks of “libX” are shown to be recently accessed and written in the shadow server. the server. In the second step, the mount utility invokes the NFS getattr procedure to get attributes of directory. Finally, the mount utility gets the attributes of file system.

To maintain mount transparency, ROW-FS also has a proxy for the mount protocol. The mount procedure is modified to obtain initial mount file handle of shadow server. Specifically, the mount proxy forwards a mount call to both shadow and main server. When the mount utility is issued by a client, the shadow server is contacted first to save the file handle of the directory to be mounted. This file handle is later used by NFS procedure calls to direct RPC calls to the shadow server. The initial mapping of file handles of a mounted directory is inserted in the SI hash table during invocation of the getattr procedure. Figure 3-9(top, left) depicts handling of the Mount procedure.

52 Table 3-1. Summary of the NFS v2 protocol remote procedure calls. Each row summarizes the behavior of the RPC call and points to the section within this chapter where the mechanism to virtualize each call in ROW-FS is described. NFS call Behavior (Modification Section) Null Testing call (No modification) Getattr Retrieves the attributes from NFS server (Section 3.4.3) Setattr Set the attributes of a file or directory (Section 3.4.3) Lookup Return file handle for a filename or directory (Section 3.4.2) Readlink Read symbolic link (Section 3.4.8) Read Read a block of a file (Section 3.4.4) Write Write to a block of a file (Section 3.4.5) Create Create a new file(Section 3.4.10) Remove Remove a file (Section 3.4.7) Rename Rename a file (Section 3.4.7) Link Create a hard link of a file (Section 3.4.8) Symlink Create a symbolic link of a file (Section 3.4.9) Mkdir Create a new directory (Section 3.4.10) Rmdir Remove an existing directory (Section 3.4.7) Readdir List content of an existing directory (Section 3.4.6) Statfs Check status of file system (Section 3.4.11)

3.4.2 LOOKUP

The lookup procedure returns a file handle reference (FH) to the file system object sought by the client. The indirection of lookup calls between shadow and main server works as follows. In the predicate phase, the proxy obtains SI hash table mapping of parent file handle. I choose to first forward the lookup call to shadow server since more often client session involves repeatedly accessing same file and data. If shadow lookup result is successful, then the proxy returns back the result from the shadow server to client. It implies that the file is either locally created or replicated from main server.

In both cases, looked-up file handle mapping is already present in SI hash table either through previous lookup call or create call. If shadow lookup result is unsuccessful, it implies that either the file is not present, or is present in the main server as read-only. Thus, the lookup call is forwarded to the main server. If main lookup is successful, the proxy checks remove/rename (RM/RN) flag status in MI hash table. If any flag is set, it returns the ”no exist” error. Otherwise, the proxy issues an NFS create call for a dummy

53 file in the shadow server, corresponding to the looked-up file. The file object may be a link, a symbolic link, or a regular file. If the file system object is a link, then Link flags are updated to indicate current status of linked file handle. Finally, the SI hash table is updated with new file handle mapping for the looked-up file. 3.4.3 GETATTR/SETATTR

The getattr procedure contacts the NFS server for the attributes of a specified file or directory. The procedure returns the attributes of the specified file in an encapsulated data structure, fattr. For ROW-FS, the current updated attributes will always be found at the shadow server. Hence, getattr call is forwarded to the shadow server. Similarly, ROW proxy forwards setattr calls to the shadow server as either file object is replicated or newly created at the shadow server.

3.4.4 READ

Read call reads data from a file referred by a given file handle and at a given offset. Again, read call is always preceded by a lookup call. Hence, file handle is always valid once the RPC call invokes read procedure. Note that mounted file system block size depends on parameters specified during invocation of mount utility (after proxy initialization) and bitmap block size is specified at initialization of ROW proxy. Hence, it may be the case that the proxy forwards calls to both shadow and main server if requested data is present partly in main and shadow server. In addition, main server read results are virtualized through attributes obtained from the shadow server for corresponding file handle (i.e. a shadow getattr call is invoked before forwarding read call to main server). This is important in order to maintain consistency between follow up getattr call and current read call (as in traditional NFS protocol). The following cases are processed by the ROW-FS proxy: 1. A new file may have been created at the shadow server. In this case, all blocks are

present in shadow server and all read calls are directed to it.

54 2. If the file system object is not newly created at shadow server, file blocks may reside in either shadow or main server. In that case, the proxy uses the bitmap presence data structure and calculates location of current and valid file block to determine

whether the read request should be satisfied by the main or by the shadow server. 3. An optimization (RE flag) is done for the case when file handle mapping is present in MI hash table; even though a bitmap has not been created (i.e. no blocks of the file have been written into), the call is forwarded to the main server. This optimization prevents expense of checking bitmap data structure from disk.

3.4.5 WRITE

Any write call is preceded by the lookup call. Hence, any invoked write call will have a valid file handle (i.e. SI hash table will contain mapping of file handles). Thus the proxy first checks for status of main server file handle in the SI hash table. If it is null, the write call is then directed to the shadow server. If the main server file handle mapping is not null, the ROW proxy performs bitmap processing to check status of concerned block for which write call is invoked (i.e. first check RE flag and then check for state of bitmap file). If it is present in shadow server, then the call is simply forwarded to the shadow server. If the block is not present in the shadow server, the proxy performs a read operation on the entire block from the main server first, and then issues a write call to the shadow server to replicate the read-only copy the of block from the main server. Finally, the proxy forwards the client-issued write call to shadow server to perform any changes in the file block. Note that, as in the case of reads, it is possible that write offset and count crosses the boundary of file system block as maintained through bitmap, essentially requiring the writing of two NFS blocks. A block replicated in the shadow server is updated in the presence bitmap to indicate that a valid copy is present in the shadow server.

3.4.6 READDIR

The readdir procedure returns a list of file objects in the parent directory, along with an identifier to locate a position cookie for the subsequent readdir request. For example,

55 Figure 3-9. Sequence of redirect-on-write file system calls a simple list directory utility (i.e. ”ls -l”) may invoke multiple readdir procedure calls because of large number of file objects present in the directory. To provide synchronization between multiple readdir calls, information is needed to keep track of the position where last readdir call returned the file system object. In traditional NFS, this is accomplished by the means of a cookie randomly generated on a per-file object basis. In the context of ROW-FS, there are two possible scenarios for readdir procedure: first call or subsequent call. For sake of clarity, I refer the main server readdir as m-readdir and the shadow server forwarding invocation as s-readdir. To virtualize the first s-readdir procedure, it

56 is important to store cookie information into a temporary buffer. For the first s-readdir call, the ROW-proxy issues as many m-readdir calls to the main server as needed. The stored cookie is used to keep track of multiple m-readdir calls. During each m-readdir call, the proxy reads the list of directory entries obtained from the main server and replicates each object issuing NFS calls to the shadow server. Finally, when all objects are replicated in the shadow server, the RD flag is set. The Readdir (RD) flag is an optimization to make the handling of subsequent readdir calls efficient: readdir calls are always directed to the shadow server after the first readdir call. The following are the detailed steps to implement indirection of readdir calls between shadow and main servers:

1. ROW proxy intercepts s-readdir request from client and checks status of parent file handle in the SI hash table. If it is the first call, initialize a temporary buffer to store

temporary cookie for multiple m-readdir calls. 2. Check status of RD flag of parent directory which indicates if readdir has been previously called. If RD flag is set then forward call to shadow server. If RD flag is not set, the following are the options of relative structure of the directories in the shadow and main server:

• Initially some of newly created file system objects are in shadow server or some file system objects are regenerated through other RPC procedure calls. In this case, proxy checks the status of file system object through mapping in the SI hash table. If file handle is present in the SI hash table, omit replicating file

system object in the shadow server.

• Directory in shadow server corresponding to main server is empty. In this case, all the file system objects in main server are generated in the shadow server. 3. Check for file system type of the returned file system object. If it is a symbolic link, the readlink procedure is invoked to get the data from main server and a symlink

call is issued to shadow server to replicate the object.

57 4. If file system object is of link type (as provided by the nlink attribute of the fattr data structure), then the LINK procedure is called in the shadow server. Link procedure is the only procedure which can increment nlink attribute of file system

object. Update all regenerated file system objects in SI/MI hash tables. 3.4.7 REMOVE/RMDIR/RENAME

The remove/rmdir procedure calls are invoked to remove a file system object (file or directory) from the server. Since the main server is read-only, the semantics of remove is modified. I have provided a remove flag (RM) in the MI hash table to indicate the removal of an object from the main server. Any procedure call with file handle referring to a file is initially checked for such remove flag (RM). The remove/readdir call will remove any regenerated version of the file in the shadow server, and accordingly update the bitmap structure. For instance, consider the example of the creation of a new file with same name in the shadow server. If its corresponding RM flag for the main server is not set, the create call will fail; otherwise, the create call is forwarded to the shadow server. The write procedure is invoked with ”to” and ”from” file handles corresponding to the directories in which the rename operation is to be performed. In addition, the present file name (”from name”) and the new name (”to name”) are parameters for the call. If ”from name” is present in the shadow server only, then proxy check the status of ”to name” in the main server. If ”to name” is present in the main server, it implies proxy overwrites the file system object with contents of the ”from name” file. The proxy regenerates the ”to name” file in the shadow server and forwards the rename call to the shadow server. The rename call first implicitly removes the target file and then renames the current file. Hence, we need to update the hash table mapping of the ”to name” file with remove flag as set. Any reference to ”to name” file will result in file ”no exist” error.

3.4.8 LINK/READLINK

LINK procedure poses a unique problem of maintaining multiple file mappings between main and shadow server. Regular file objects (with a single hard link) have

58 one-to-one mapping between file object and file handle. File objects with hard links greater than one result in many-to-many mapping between file object and file handle, i.e. a single file handle represents multiple files. Many-to-many mappings complicate the ROW indirection mechanism and change the semantics of other procedure calls. Shadow-only file objects can be easily handled by forwarding the link call to the shadow server. To provide support for linked file objects in the main server, it is imperative to virtualize the nlink (number of links) attribute of the fattr data structure. This is achieved by bookkeeping a list of three hard link numbers: L1 (initial main link), L2 (new shadow links), and L3 (current main link). Consider for example, that files ”A”, ”B” and ”C” are hard linked (represented by file handle Fmain) in the main server. The first replication of file handle Fmain in shadow server is performed through invocation of a create call by the ROW proxy. As a result, the file handle mapping is inserted into SI hash table (assume that ”A” is replicated). Now the attributes of ”A”, as retrieved from shadow server, indicate only a single hardlink. To get the correct number of links, state is maintained as new shadow link L2 and current main link L3. A new link created in the shadow increases the value of L2, and a link deleted from the main server decreases L3. L1 is used to keep initial link information about hard links in main server. Subsequent replication of link information also changes semantics of remove/rmdir calls. Generally, the remove call results in decrement of hard link if hard link is greater than one. The remove call also updates current main link. It is important to note that current main link is not sufficient for correct remove semantics. In reference to above example, consider the case when ”C” is removed. During a lookup for ”C” in main server, L1 provides only information about number of current main links, but no information about which of the three file is removed/renamed. Hence, we include a linked-list (RL) to hold information about removed hard-linked files. This list is also used when handling a

59 create call, which requires knowing which file removed corresponds to a main server file handle Fmain.

3.4.9 SYMLINK

Symbolic link is a file which provides indirection towards a path to other file objects. The symlink procedure merely creates the symbolic link in the server. Even though symbolic link procedure just returns the status of call, it creates new file handle corresponding to the file which points to other file object. The proxy forwards the symlink call to shadow server. If the call is successful, the newly created symlink file handle is retrieved from the shadow server and ROW-FS proxy updates the SI hash table for the

file handle mapping. The ROW proxy forwards readlink calls to the shadow server, as a symbolic link either has been recreated in the shadow server, or is newly created.

3.4.10 CREATE/MKDIR

Regular file objects are created with the invocation of create and mkdir. Since main server is read-only, all the create calls are forwarded to the shadow server. Semantics of the create call will depend on the remove (RM) and rename (RN) flags. For example, if a rename flag (RN) is set for the file in the main server, ”file name” can be newly created in the shadow server. Similarly, any ”file name” corresponding to set remove flag can be created in the shadow server. Returned file handle is updated in SI hash table with corresponding entry for the main server as null.

3.4.11 STATFS

The statfs procedure is invoked to get file system information related to the mounted file system. ROW proxy virtualizes statfs call by forwarding it to the shadow server. This virtualized behavior is semantically correct since all the returned references of the file system objects are relative to the shadow server. 3.5 Experimental Results

The virtualized ROW file system as described in the previous subsection has been fully implemented in user-level proxies. Experiments have been conducted to measure

60 the performance of this ROW-FS user-level implementation for micro-benchmarks and applications. The experiments were performed in simulated wide area network links using

the NISTnet network emulation package [62]. The NISTnet emulator is deployed as a virtual router in a VMware VM with 256MB memory and running Linux Redhat 7.3. Redirection is performed to a shadow server running in a virtual machine in the client’s local domain.

3.5.1 Microbenchmark

The goal of micro-benchmarks is to measure the performance of basic file system

operations. For ROW-FS, I stress important NFS procedure calls. Specifically, I conducted benchmarks for lookup, remove/rmdir and readdir to evaluate overheads in these operations. For LAN, measured TCP bandwidth (using iperf) is 40Mbits/s. For the WAN setup, the bandwidth is 5Mbits/s with round-trip latencies of 70ms. I tested the benchmarks on file system hierarchy of nearly 15700 file systems objects (the total disk space consumed is approximately 190MB). In all micro-benchmark experiments, the main server is a Linux VM with 256MB memory hosted on an Intel Pentium 4 1.7GHz workstation with 512MB memory. The WAN router is hosted in the same machine as the

main server. The client machine is an Intel Pentium 1.7.GHz workstation with 512 MB memory running cache and ROW proxies. The machines are interconnected by 100Mb/s Ethernet. Lookup/Stat: Lookup is often the most frequent operation issued by NFS clients. Since the initial request for a file handle invokes a lookup request, I decided to measure individual lookup latency and a recursive stat of file system hierarchy. For a random set of individual files (in the LAN setup), the average lookup time for initial run of ROW-FS is 18ms and the second run is executed in approximately 8ms. In comparison, NFSv3

executes a lookup call in approximately 11ms. The results summarized in Table 3-2 show that ROW-FS performance is superior to NFSv3 in a WAN scenario, while comparable in

a LAN. In the WAN experiment, recursive stat shows nearly five times improvement over

61 second run of ROW-FS, because all the file objects are present in the shadow server during the second run of recursive stat. Readdir: For newly created files and directories, the Readdir micro benchmark scans completely a directory to display the file system objects to the client. Results for Readdir along with lookup and recursive stats are shown in Table 3-2. Clearly, performance for WAN for ROW-FS during the second run is comparable with LAN performance and much improved over NFSv3. This is because once a directory is replicated at the shadow server, subsequent calls are directed to the shadow server by means of readdir status flag. The initial readdir overhead for ROW-FS (especially in LAN setup) is due to the fact that dummy file objects are being created in the shadow server during the execution. Remove: To measure the latency of remove operations, I deleted a large number of files (greater than 15000 and total data size 190MB). I observed that in ROW-FS, since only the remove state is being maintained rather than complete removal of file, performance is nearly 80% better than that of conventional NFSv3. It takes nearly 37 minutes in ROW-FS in comparison to 63 minutes in NFS3 to delete 190MB of data over a wide area network. Note that each experiment is performed with cold caches, setup by re-mounting file systems in every new session. If the file system is already replicated in shadow server, it takes 18 minutes (WAN) to delete the complete hierarchy.

Table 3-2. LAN and WAN experiments for lookup, readdir and recursive stat micro-benchmarks. For ROW-FS, each benchmark is run for two iterations: First, warms up shadow server. Second, to access modifications locally. NFSv3 is executed once as performance for second run is similar to first run. In both ROW-FS and NFSv3, NFS caching is disabled. Micro-benchmark LAN(seconds) WAN(seconds) ROW-FS NFSv3 ROW-FS NFSv3 1st run 2nd run 1st run 2nd run Lookup 0.018 0.008 0.011 0.089 0.018 0.108 Readdir 67 17 41 1127 17 1170 Recursive Stat 425 404 367 1434 367 1965 Remove 160 NA 230 2250 NA 3785

62 3.5.2 Application Benchmark

The primary goal of the application benchmark experiments is to evaluate performance of redirect-on-write file system in comparison to traditional kernel network file system (NFSv3). Experiments are conducted for both local area and wide area networks. The client machine is a 1.7 GHz Pentium IV workstation with 512MB RAM and RedHat 7.3 Linux installed. The main and shadow servers are VMware-based virtual machines. Each VM is based on VMware GSX 3.0 and are configured with one CPU and 256 MB RAM. They are hosted by a dual-processor Intel Xeon CPU 2.40GHz server with 4GB memory.

Andrew Benchmark: Andrew benchmark is used to gauge performance of ROW file system in local and wide area networks. In addition, I collected statistics of RPC calls going to shadow and main server to evaluate our performance. Table 3-3 summarizes the performance of Andrew benchmark and Figure 3-10 provides statistics for number of RPC calls. The important conclusion taken from the data in Figure 3-10 is that ROW-FS, while increasing the total number of RPC calls processed during the application execution, it reduces the number of RPC calls that cross domains to less than half. Note that the increase in number of getattr calls is due to invocation of getattr procedure to virtualize read calls to main server. Read calls are virtualized with shadow attributes (the case when blocks are read from Main server) because the client is unaware of the shadow server; file system attributes like file system statistics and file inode number have to be consistent between read and post-read getattr call. Nonetheless, since all getattr calls go to the local-area shadow server, the overhead of extra getattr calls is small compared to getattr calls over WAN.

Table 3-3. Andrew Benchmark and AM-Utils execution times in local- and wide-area networks. Benchmark ROW-FS(sec) NFSv3(sec) Andrew(LAN) 13 10 Andrew(WAN) 78 308 AM Utils(LAN) 833 703 AM Utils(WAN) 986 2744

63 AM-Utils: The build of the Berkeley package [63] is also used as an additional benchmark to evaluate performance. The automounter build consists of configuration tests to determine required features for build, thus generating large number of lookups, read and write calls. The second step involves compiling of am-utils software package. Table 3-3 provides experimental results for LAN and WAN. The resulting average ping time for the NIST-emulated WAN is 48.9ms in the ROW-FS experiment and 29.1ms in the NFSv3 experiment. Wide-area performance of ROW-FS for this benchmark is again better than NFSv3, even under larger average ping latencies.

Figure 3-10. Number of RPC calls received by NFS server in non-virtualized environment, and by ROW-FS shadow and main servers during Andrew benchmark execution

Linux Kernel Compilation: Compilation of the Linux kernel is used to benchmark the application-perceived performance of ROW for a typical software development environment application. This is a representative application with a mix of compute- and I/O-intensive features which runs for minutes and generates thousands of NFS calls of various kinds - block reads/writes, file and directory creation, and metadata lookups and modifications. The kernel used is debian 2.4.27 with compilation steps including: make

64 Table 3-4. Linux kernel compilation execution times on a LAN and WAN. Setup FS Oldconfig time(s) Dep time(s) BzImage time(s) NFSv3 49 120 710 LAN ROW-FS 55 315 652 NFSv3 472 2648 4200 WAN ROW-FS 77 1590 780

”oldconfig”, make ”dep” and make ”bzImage”. Table 4 shows performance readings for both LAN and WAN environments. The performance of Linux kernel compilation for ROW-FS is comparable with NFSv3 in LAN environment and shows substantial improvement in the performance over the emulated WAN. Note that, for WAN, kernel compilation performance is nearly five times better with the ROW proxy in comparison with NFSv3. The results shown in Table 3-4 do not account for the overhead in synchronizing the main server. Nonetheless, as shown in

Figure 3-10, a majority of RPC calls do not require server updates (read, lookup, getattr); furthermore, many RPC calls (write, create, mkdir, rename) are also aggregate in statistics - often the same data is written again, and many temporary files are deleted and need not be committed.

Fault tolerance: Finally, I tested the check-pointing and recovery of a computational chemistry scientific application (Gaussian [64]). A VMware virtual machine running Gaussian is checkpointed (along with ROW-FS state in the VM’s memory and disk). It is then resumed, runs for a period of time, and a fault is injected. Some Gaussian experiments take more than one hour to finish and generate a large amount of temporary data (hundreds of MBytes). With ROW-FS, I observe that the application successfully resumes from a previous checkpoint. With NFSv3, inconsistencies between the client checkpoint and the server state caused the application to crash, preventing its successful completion.

65 3.5.3 Virtual Machine Instantiation

Diskless Linux: The experiments were conducted to boot diskless Linux nodes over the emulated WAN. I choose to measure diskless boot setup over wide area network as VM booting is often a frequent operation in dynamic system provisioning. The NISTnet delay is fixed at 20ms with measured bandwidth of 7Mbit/s. In this experiment, VM1 is the diskless virtual machine, and VM2 is a boot proxy machine configured with two NIC cards for communication with host only and public network. Both ROW and cache proxies are deployed in VM2, which proxies NFS requests to a remote file server. In addition,

VM2 is configured to run DHCP and TFTP servers to provide an IP address and initial kernel image to VM1. Table 3-5 summarizes the performance of diskless boot times with different proxy cache configurations. The results show what pre caching of attributes before redirection and post redirection data caching deliver the best performance, reducing wide-area boot time with ”warm” caches by over 300%.

Table 3-5. Wide area experimental results for diskless Linux boot/second boot for (1) ROW proxy only (2) ROW proxy + data cache (3) attribute + ROW + data cache WAN Boot (sec) 2nd Boot(sec) Client-> ROW-> Server 435 236 Client-> ROW-> Data Cache-> Server 495 109 Client-> Attr.Cache ->ROW->DataCache-> Server 409 76

VM boot/Second boot: This experiment involves running a Xen virtual machine (domU) with the root file system mounted over ROW-FS. The primary goal is to measure the overhead for additional layer of proxy indirection. The experiments are conducted in two parts. In the first part, there is only a ROW proxy and no cache proxy. The Xen domU VM is booted with and re-booted to capture the behavior of ROW-FS with the presence of data locality. I benchmarked the time to boot a Xen VM because VM-boot is a frequent operation

- a container can be started for just the duration of an application run. Results for this experiment are summarized in table 3-6. In the second part, I tested the setup with

66 Table 3-6. Remote Xen boot/reboot experiment with ROW proxy and ROW proxy + cache NISTNet Delay ROW Proxy ROW Proxy + Cache Proxy Boot(sec) 2nd Boot(sec) Boot(sec) 2nd Boot(sec) 1ms 121 38 147 36 5ms 179 63 188 36 10ms 248 88 279 37 20ms 346 156 331 37 50ms 748 266 604 41 aggressive client side caching (Figure 2(b)). Table 3-6 also presents the boot/second boot latencies for this scenario. For delays smaller than 10 ms, the ROW+CP setup has additional overhead for Xen boot (in comparison with ROW setup); however, for delays greater than 10 ms, the boot performance with ROW+CP setup is better than ROW setup. Reboot execution time is almost constant with ROW+CP proxy setup. Clearly, the results show much better performance of Xen second boot for the ROW+CP experimental setup.

3.5.4 File System Comparison

A related copy-on-write approach is implemented in UnionFS [65]. The key advantages of ROW-FS over UnionFS are that the former is user-level and integrates with unmodified NFS clients/servers, while the latter is a kernel-level approach that requires support from the kernel, and that the former operates with individual file data blocks while the latter operates on whole files. This is important in applications where unmodified clients are deployed and applications that access sparse data; for example, the provisioning of VM images. I have attempted to compare the performance of UnionFS and ROW-FS for Xen virtual machine instantiation across wide-area, but instantiating a Xen 3.0 domU with an image stacked using the latest version of UnionFS available at the time of writing (Unionfs 1.4) fails. UnionFS copy-on-write mechanism is based on copy-up complete to new branch on write invocation whereas ROW-FS just replicates the needed block; hence, ROW-FS has added advantage over UnionFS for disk images instantiation

(large copy-up is expensive). Advantages of UnionFS over ROW-FS include potentially

67 better performance with kernel-level handling of file system events, and the possibility of stacking multiple overlay levels.

3.6 Related Work

The notion of network file system call indirection is not new; interposition of a proxy for routing remote procedural calls was previously addressed to provide scalable network

file system services [38]. In the past, researchers have used NFS shadowing technique to log users’ behavior on old files in a versioning file system [66]. Emulation of NFS mounted

directory hierarchy is often used as a means of caching and performance improvement [67].

Kosha provides a peer to peer enhancement of network file system to utilize redundant storage space [51]. In the past, file virtualization was addressed through NFS mounted file system within the private name spaces for group of processes with motivation to migrate

the process domain [68]. Striped network file system is implemented to increase the server throughput by striping file between multiple servers [69]. This approach is primarily used to parallely access file blocks from multiple servers, thus improving its performance over NFS. A copy-on-write file server is deployed to share immutable template images for

operating systems kernels and file-systems in [21]. The proxy-based approach presented

in this dissertation is unique in how it not only provides copy-on-write functionality, but also provides provision for inter-proxy composition. Checkpoint mechanisms are integrated into language specific byte-code virtual machine as means of saving application’s state [70]. VMware and Xen 3.0 virtual machines have provision of taking checkpoints (snapshots) and reverting back to them. These snapshots, however, do not support checkpoints of changes in a mounted distributed file system. 3.7 Conclusion

This chapter introduced a novel architecture that enables redirect-on-write functionality using virtualization techniques. It is designed to overlay existing NFS deployments, and

can leverage virtual machine techniques to support client-side checkpointing of distributed

file system modifications. For a benchmark application (Linux kernel compilation), the

68 performance of ROW-FS across an emulated wide area network is four times better than conventional NFS. For the provisioning of non-persistent virtual machine execution environments, the performance of Xen virtual machine boot-up over wide area networks is comparable with local-area networks if the ROW-FS proxy is coupled with user-level NFS caching proxies. This is because the majority of calls are redirected to local domain machine.

69 CHAPTER 4 PROVISIONING OF VIRTUAL ENVIRONMENTS FOR WIDE AREA DESKTOP GRIDS 4.1 Introduction

In this chapter I present a generic, user-level distributed file system virtualization framework which enables multiple VM instances to efficiently share a common set of virtual machine image files. The primary goal is to facilitate the deployment of voluntary

Grids deployed as wide-area virtual networks of virtual machines [71].The approach is a thin client solution for desktop grid computing based on virtual machine appliances whose images are fetched on-demand and on a per-block basis over wide-area networks. Specifically, I aim at reducing download times associated with appliance images, and providing a decentralized, scalable mechanism to publish and discover upgrades to appliance images. The approach embodies different components and technologies — virtual machines, an overlay network, pre-boot execution environment services, and a redirect-on-write virtual file system. Virtual Machines over virtual networks are deployed with pre-boot execution server to facilitate remote network booting. The approach uses ROW-FS that enables the use of unmodified NFS clients/servers and local buffering of file system modifications during the appliance’s lifetime. Similarly to related efforts, our approach targets applications deployed on non-persistent virtual containers [4][28] through provisioning of virtual environments with role-specific disk images [60]. Thin computing paradigms offer advantages such as lower administration cost and failure management. In early computing systems, thin-client computing was successful because of two main reasons: low-cost commodity hardware was not available to end user, and a centralized approach of computing is often preferred due to easier system administration. As low-cost PCs and high-bandwidth local-area networks became widely available, thin-client computing lost ground. The advent of virtual machines have opened up new opportunities; virtual machines can be easily created, configured, managed and deployed. The virtualization approach of multiplexing physical resources not only

70 decouples the compute resources from hardware but provides a much needed flexibility of easily movable compute resources. For data transfers over wide-area networks, and especially in voluntary-computing environments, available bandwidth is a bottleneck. Different solutions proposed to address bandwidth limitations include caching, data compression and quality of service (QoS) provided for applications. In the environment focused in this chapter, limited bandwidth hinders the voluntary deployment of appliances because these often have hundreds of MBytes of virtual disk state — a 600-MB VM image takes more than one hour to download in a 1Mbit/s link, which is a significant deterrent for end users. Virtual appliances are commonly used and packaged to speed up software development, distribution and management [72]. An illustrative example of an virtual appliance is a

Fedora 9 appliance of size 800-MB with pre-configured graphical user interface packages [72]. Optimizing the size of an appliance is time-consuming, and in many cases not possible without loss of functionality (e.g. by avoiding installation of certain packages). Nonetheless, it is often the case that at run-time only a small fraction of the virtual disk is actually “touched” by an application. I exploit this behavior by building on on-demand data transfers that substantially reduce the download time and bandwidth requirements for the end user. The following sections will explain the overall architecture and approach.

4.2 Data Provisioning Architecture

The overall architecture is depicted in Figure 4-1. The approach is based on diskless provisioning of virtual machine environments through a virtual machine proxy. The utility of the envisioned architecture can be observed from the viewpoint of both users and system administrators. Users not only have fast and transparent access to different O/S images but also have automatic support to upgrade the O/S images. For administrators, it provides a framework for simple deployment and maintenance of new images.

As shown in Figure 4-1, an end user X1 downloads a small proxy appliance (VM2) from Download Server DS. The proxy appliance is configured to connect to a virtual

71 Figure 4-1. O/S image management over wide area desktops: User X1 downloads a small ROW-FS proxy configured appliance (VM2) from download server DS. User X2 potentially can share the appliance image with User X1. Image Server (IS) exports read-only images to clients over NFS. VM1 is a diskless client. The appliance bootstrap procedure is explained further in Figure 4-2

network overlay connecting it to other users (e.g. using IPOP[26][30]). An example NFS proxy appliance of size 350 MB can be downloaded from Vmware virtual appliance marketplace [72]. The proxy appliance is also configured to run a small ftp server and DHCP server to download network bootstrap program and allocate IP address to client’s working environment. The actual appliances which carry out computation can be configured with a desired execution environment and need not be downloaded in their entirety by end users - they are brought in on-demand through the proxy appliance. Each node is an independent computer which has its own IP address on a private network.

Key to this architecture is the redirect-on-write file system (ROW-FS). As explained in chapter 3, ROW-FS consists of user-level DFS extensions that support selective

72 redirection of DFS calls to two servers: the main server and a copy-on-write server. The architecture is novel in the manner it overlays the ROW capabilities upon unmodified clients and servers, without requiring changes to the underlying protocol.

Figure 4-2. The deployment of the ROW proxy to support PXE-based boot of a (diskless) non-persistent VM over a wide area network.

Figure 4-2 expands on Figure 4-1 to show the diskless provisioning of virtual machines. In Figure 4-2, VM1 is a diskless virtual machine, and VM2 is a boot proxy appliance configured with two NIC cards for communication with host-only and public network. VM2 is configured to execute ROW file system (ROW-FS) and NFS cache proxies. In addition, VM2 is configured to run DHCP and TFTP servers to provide diskless client an IP address and initial kernel image to VM1. Classic virtual machines such as VMware provides support for PXE-enabled BIOS and NICs; PXE is a technology to boot diskless computers using network interface cards. The server VM is configured to share a common directory through ROW-FS to clients. To illustrate the workings of diskless setup, consider the following steps to boot a diskless VM with an appliance image served over a wide-area network:

1. Diskless VM (VM1) invokes DHCP request for an IP address 2. DHCP request is routed through a host-only switch to gateway VM (VM2)

3. VM2 is configured to have two NICs: host-only (private IP address) and public.

VM2 receives request at host-only NIC (eth0).

4. DHCP Server allocates an IP address and sends a reply back to diskless VM (VM1)

73 5. Diskless VM invokes TFTP request to obtain network bootstrap program and initial kernel image.

6. VM2 receives TFTP request at host-only eth0

7. Kernel image is transferred to VM1 and loaded in RAM to kick start boot process 8. Diskless VM invokes mount request to mount read-only directory from Server (VM3) through the proxy VM (VM2)

9. VM2 is configured to redirect write calls to a local server. Read-only NFS calls are routed through the proxy VM2 to VM3; the connection between VM2 and VM3 is through the virtual overlay network.

The P2P networks are considered to be inherently self-configuring, scalable and very robust to node or system failures. Each P2P node maintains a view of network at regular intervals which facilitates seamless addition or removal of a node from the system. As the number of nodes are added into the network pool, bandwidth and CPU processing is distributed and shared among users, thus P2P systems are very scalable. Further more, P2P systems are configured to be tolerant to node failures. P2P overlay networks such as IPOP also facilitate firewall traversal without administrator intervention which allows P2P nodes behind the firewalls to join the network [30]. The process of publishing and sharing O/S images is well supported by these P2P properties. The primary goal of the architecture is to automate the process of publishing, discovering and mounting appliance images. Furthermore, it should be possible for images to be replicated (fully or partially) across multiple virtual servers throughout a virtual network for load-balancing and fault-tolerance. It is feasible to provide image versioning capability through maintaining the latest image state in a decentralized way using a

Distributed Hash Table (DHT) – which, in the case of the IPOP virtual network [30], is already responsible for providing DHCP addresses. DHTs provide two simple primitives: put(key, value) and get(key). In order to use the DHT to track appliance image versions, the key functionality needed can be broken down into diskless client and publisher client.

74 • Diskless client: a client should be able to query which version Vi is the most recent for an appliance A, and for virtual IP addresses of one or more servers which have the image for A available for use. This information is used at boot-time by the proxy

VM to select an appropriate server to mount a read-only image over ROW-FS, and can also be used to redirect calls in case of server failures.

• Diskless client: a client should be able to periodically publish that it is using version Vi of appliance A, notifying image publishers that such version is in use and should not be removed. • Publisher client: a developer should be able to publish that a new version Vj of appliance A has been created, along with the identifiers (IP address and mount path) of one or more servers where the image is available.

• Publisher client: a developer should be able to query for the total number of sharers for a given version Vi of appliance image A in order to make decisions about garbage-collecting versions which are no longer in use. E.g. if there are no users of image versions i through j of an appliance A, and the current appliance version is k > j, a publisher may decide to remove versions i through j to make storage for additional versions available. A first-order approach to deal with this problem using a DHT is to support three tables. One is indexed by the appliance identifier A, which is assumed to be unique, which holds as a value the latest version associated with A. A second table is then indexed by a tuple (A,Vi), and stores as values the (IP,mount path) tuple pointing to available servers for this image. A third table, also indexed by (A,Vi), holds a list of values “1” for each client currently using the image. This value has a limited time to live (TTL) and must be refreshed by each client before the expiration of the TTL; summing up all the values associated with this key gives an estimate reference count on the number of sharers of an image. The goal is to store the state information such as reference count of the number of current users of version Vi of appliance A. To illustrate the role of distributed hash

75 table, consider the following sequence of steps to establish an experimental session between client’s desktop and the image server:

1. Diskless client downloads the boot proxy appliance machine from the download

server (VM2 in Figure 4-1). Client bootstraps this generic appliance configured to forward client’s request to Image Server (IS). The downloaded proxy machine is configured so as to connect with network of appliances through IPOP p2p network. 2. Diskless client queries the DHT for appliance A version. An illustrative example of an appliance name could be a “Redhat” appliance.

3. Diskless client queries the DHT to obtain the image server IP address and mountpath for the appliance A.

4. Start the ROW-FS proxies using the image server IP address. The startup of

ROW-FS proxies sets up an access control list and a session directory to allow call forwarding to the image server and the local NFS server.

5. Bootstrap a diskless client virtual machine and establish a mount session with the image server. Virtual machine APIs are leveraged to bootstrap the diskless client. 6. Diskless client performs the experiments during the established ROW-FS session between diskless client and the image server. 7. Halt the booted diskless client

8. Kill the ROW-FS proxies. The boot proxy appliance machine contains the client’s session data and experimental run results.

Figure 4-3 and Figure 4-4 illustrates the algorithm for diskless client to bootstrap an appliance and a publisher client to publish the O/S image. Unused VMs can be removed from the system at regular intervals. When number of clients accessing the VM image is zero and the image is not the latest version, DHT expires with timeout.

76 Figure 4-3. Algorithm to bootstrap a VM session

Figure 4-4. Algorithm to publish a virtual machine Image

77 4.3 ROW-FS Consistency and Replication Approach

It is important to consider consistency in ROW-FS because data can be potentially shared by multiple clients. In the ROW-FS architecture, I make the assumptions that 1) ROW-FS file systems are ephemeral; they are dynamically created and terminated by middleware that oversees the scheduling of application workflows, and 2) data stored in the main server of a file system mounted as ROW-FS remains unmodified for the duration of such an ephemeral file system session. Previous work has described techniques for establishing such dynamic file system sessions and enforcing exclusive access to shared data with a service-oriented architecture [58]; it is also conceivable to integrate the logic to configure, create and tear-down ROW-FS sessions with application workflow schedulers such as [73]. Further, failure transparency is important property of distributed systems. During the time a ROW-FS file system session is mounted, all modifications are redirected to the shadow server. A question that arises is how file system modifications in the shadow server should be reconciled with data in the main server at the end of a session. Three scenarios can be considered for consistency support in the context of redirect-on-write file system:

1. There are applications in which it is neither needed nor desirable for data in the shadow server to be reconciled with the main server; an example is the provisioning of system images for diskless clients or virtual machines, where local modifications made by individual VM instances or diskless machines are not persistent. 2. For applications in which it is desirable to reconcile data with the server, the ROW-FS proxy holds state in its primary data structures (the file handle hash table and the block bitmaps) that can be used to commit modifications back to the server. The approach is to remount the file system at the end of a ROW-FS session

in read/write mode, and signal the ROW-FS proxy to traverse its file handle hash tables and bitmaps to commit changes (moves, removes, renames, etc) to directories

78 and files, and commit individually each data block modified at the shadow server back to the main server by crafting appropriate NFS calls to the server.

3. One particular use case of ROW-FS is autonomic provisioning of O/S disk images

shared between multiple clients. In this context, I leverage on APIs exported by lookup services (such as Distributed Hash Table) in distributed frameworks (such as

IPOP[26]) to store client’s usage and sharing information. This approach is based on multiple clients converging to use the latest appliance image during the course of time. 4.3.1 ROW-FS Consistency in Image Provisioning

In this dissertation, the consistency approach adopted is the client centric eventual consistency for autonomic provisioning of O/S disk images. Clients access and store the O/S image information and usage state through lookup services such as distributed hash table. In eventual consistency, clients will eventually converge to use a latest version of O/S image over the course of time. It implies distribution of O/S images over the P2P network is tolerant to a high degree of inconsistency. No guarantees are given on the concurrent access for latest version of appliance. It is acceptable to propagate an update to an appliance in lazy fashion. Eventually all accesses to DHT will return the latest appliance update. The O/S images published and used by clients are read-only. An update of new image by a publisher does not necessarily need to immediately propagate to all the clients (using that image or otherwise). An example case is a installation of new package in the appliance. If this package is necessary for the client, client may query the server for the latest appliance and reboot a new version of appliance. Thus, image publishing and removal is not a mutually exclusive process. Consider the cases when potential conflict between diskless and publisher clients can be possible. With conflict, I mean client’s query session does not give the latest appliance results. At time t, appliance A and Version Vi may be concurrently access or update by multiple clients.

79 • Diskless client X queries the DHT for latest appliance version at time t1. DHT returns the version APP.1 to the client X at time t2 (this includes DHT access time and latency to return version to the client). Publisher client Y also publishes version

APP.2 at time t3 where (t1 < t3 < t2 ) . In this case, client X does not get latest version of appliance.

• Diskless client X queries the DHT for latest appliance at time t1 and DHT returns APP.2 at time t2. At time t3 where t1 < t3 < t2, APP.2 state is removed before appliance status of “in use” could be updated. In this case, client request for a session fails. • Diskless client is running an appliance session at time t1 and another client publishes a new appliance version at time t2 where t1 < t2 < t3. t3 is the time when VM

appliance session ends. While above conflicts may cause inconsistency, it is still safe to continue the current appliance sessions. I can apply different variants of eventual consistency model. Client X can see its own modifications as long as appliance session is applicable. Thus, consistency is maintained for client session, a session consistency model. Secondly, publisher of an appliance always see the latest version of appliance submitted by itself, a variant often known as read-your-write consistency model[74].

4.3.2 ROW-FS Replication in Image Provisioning

The ROW-FS approach relies on the opaque nature of NFS file handles to allow for virtual handles that are always returned to the client, but map to physical file handles at the main and ROW servers. Virtual machines can easily be replicated or cloned through the virtual machine API exported by hypervisors (e.g. VMware [16]). Each directory or file in ROW-FS is uniquely addressed by a persistent file handle. It is feasible to provide replica support for wide area desktops in ROW-FS Proxy as file handles in replicated VM is consistent with primary VM.

80 Figure 4-5. Replication Approach for ROW-FS. File handles of server replica and read-only server are equivalent. A timeout of the connection to read-only server can result in switch over to replica server.

ROW-FS proxy can be configured to provide support for read-only server replica. Figure 4-5 shows the feasibility of virtual machine-based replication mechanism. The ROW-FS proxy is configured to forward calls to both read-only and replica server. Each replica is a cloned virtual machine which is exporting root file system of the grid appliance to the client. An example scenario is shown in the figure. If read-only server r0 goes down for some reason, ROW-FS proxy can be configured to switch over to a replica server (r1) after no response time, Tout. The shadow state of ROW-FS proxy which includes file

handle mapping between shadow server (s0) and read-only server (r0) and bitmap data structure is still valid with new read-only server replica (r1). This is feasible because shadow server state is dependent on file handles of s0 and r0 servers. Servers r0 and r1 have identical file handles which facilitates a seamless transition of NFS calls to r1. 4.4 Security Implications

A question of trust or validity can be raised about the user publishing the O/S image.

Consider an example, a Client X publishes a Redhat 7.3 O/S image. This image can

81 turn out to be spurious or completely different version. Thus, it is important to consider security implications of such claims. Following are the important aspects of security properties that needs to be considered [74]:

1. Confidentiality: Publishers should be able to publish O/S images for intended users. 2. Integrity: Integrity of publisher’s claim should be maintained. No other user should be able to modify the publisher’s claim. To address these security properties, I consider following security mechanisms: 1. Encryption: Publisher’s claim for O/S image needs to be encrypted to avoid any interception or fabrication from a rogue user.

2. Authentication: Publisher of an image must be able to prove its identity. To establish publisher’s identity, public key cryptography scheme can be applied. Public

key of each user in P2P network can be advertised. Any claim by User is encrypted by its own private key. While authentication and encryption helps in sending data securely, it is important to model trust between the P2P users. Trust models are a way to validate client X’s claim to be “User X”. Various trust models have been used to establish trust in distributed systems. For example, a “web of trust” approach is a commonly used email scheme to send private emails to the end users[74]. The approach relies on maintaining a list of trusted public keys by user. I suggest a public key infrastructure as trusted model for O/S management framework.

Public key infrastructure is a collection of certificate authorities and certificates assigned to users. Certificates are common cryptographic technique used in e-commerce applications. A certificate is a digital signature which helps in maintaining identification, authorization and data confidentiality of the user. A digital signature is a kind of asymmetric cryptography to securely send message between users. While asymmetric key cryptography securely transfer the data, the question persist in formation of trust between end users. For the purpose of this dissertation, I assume that there is a trusted certificate

82 authority to administer user’s certificate. Figure 4-6 illustrates security mechanism to authenticate and encrypt client’s and publisher’s data. Here, assumption is that public key of certificate authority is built into P2P overlay network. To validate each client’s

+ identity, certificate authority encrypts client’s identification (IDC ) and public key (KC ) − (i.e. certificate) through its private key (KCA) and distribute it over P2P network.

Figure 4-6. Diskless client and publisher client security. C: Client, P: Publisher and CA: Certificate Authority.

Following equations illustrate the encryption of appliance (A) information by a publisher and further decryption by a client.

− Publisher Encryption: (A, V i) => KP (A, V i) − (A, V i, IP, MountP ath) => KP (A, V i, IP, MountP ath) + − Client Decryption: KP (KP (A, V i)) => (A, V i) + − KP (KP (A, V i, IP, MountP ath)) => (A, V i, IP, MountP ath)

83 4.5 Experiments and Results

This section evaluates the feasibility and performance of ROW-FS approach through quantitative experiments. First, I evaluated the overhead associated with the use of a proxy VM for on-demand access to the appliance image. I evaluate CPU, network and disk utilization during the appliance boot, execution of a CPU-intensive application, and reboot. Second, I profiled at the proxy VM the occurrence of RPC calls throughout the execution of an appliance to correlate with VM overhead. Third, I measured the image size replicated in the shadow server after diskless boot and the total amount of

data fetched on-demand from the image server. Finally, I demonstrate a proof-of-concept implementation and measure the boot and reboot times for appliances deployed over the IPOP virtual network and in a realistic wide-area environment.

4.5.1 Proxy VM Resource Consumption

The experiments are conducted to characterize resource consumption (CPU, disk and

network) by the proxy virtual machine (VM2 in Figure 4-2). In this experiment, virtual machines VM1, VM2 and VM3 are deployed in a host-only network using VMware’s ESX Server VM monitor. Note that this experiment only reflects VM resource statistics (not application execution time). The experimental setup is as follows: Server VM3 is configured to have 2GB RAM, single virtual CPU. VM2 and VM1 are configured to have 1GB RAM, and also a single VCPU. The VMs are hosted by a dual Xeon 3.2GHz processor, 4GB memory server. The size of the image of the appliance is 934MB.

Figure 4-7 shows time-series plots with CPU, disk and network rates for three different intervals. These values are obtained in 20-second intervals leveraging VMware ESX’s internal monitoring capabilities. In the first interval, the VM is booted. In the second interval, the VM runs a CPU-intensive application (the computer architecture simulator Simplescalar) which models the target workload of a typical voluntary computing execution. In the third phase, the appliance is rebooted.

84 Figure 4-7. Proxy VM usage time series for CPU, disk and network. Results are sampled every 20 seconds and reflect data measured at the VM monitor. Three phases are shown (marked by vertical lines): appliance boot, execution of CPU-intensive application (simplescalar), and appliance reboot.

85 VM Boot: Initially, the shadow copy-on-write server is empty; no file system state is present. During VM boot, the ROW-FS proxy accesses boot-time files from the server VM (VM3) and re-generates the file system hierarchy in the shadow server (local server)

as described in [75]. The Figure shows high data write rates during VM boot execution. Clearly, I can observe a maximum of 12% CPU consumption and also a high data rate across the network to load the initial kernel image into diskless client (VM1) memory. Application Execution: Since the application is CPU intensive, the proxy VM exhibits little run-time overhead in this phase. This is because once the diskless VM (VM1) is booted, it loads necessary files for the application execution into RAM (as shown by initial network activity). I can further observe that disk and network usage is negligible in VM2 during the execution of the simplescalar application, thus supporting my assumption of minimal overhead of proxy configured VM. VM Reboot: During reboot, the client has replicated session state at the shadow server. I see an average boot time reduction (described below). I observe further spikes in network and CPU usage as some of files are fetched and read from the Server VM. In past results, I have shown that aggressive caching can further improve boot performance [75]. 4.5.2 RPC Call Profile

Figure 4-8 provides statistics for number of RPC calls during the boot up of the diskless appliance VM1. The histogram is broken down by different types of RPC calls corresponding to NFS protocol calls, from left to right: get and set file attributes, file handle lookup, read links, read block, write block, create file, rename file, make directory, make symbolic link, and read directory. The important conclusion taken from the data in Figure 4-8 is that ROW-FS, while increasing the total number of calls routed to the local shadow server, reduces the number

of RPC calls that cross domains. Note that the increase in number of get attribute (getattr) calls is due to invocation of getattr procedure to virtualize read calls to main

86 server. Since all getattr calls go to the local-area shadow server, the overhead of extra getattr calls is small compared to getattr calls over WAN.

Figure 4-8. RPC statistics for diskless boot. Shadow server receives majority of RPC calls. The bars represent the number of RPC calls received by shadow and main servers.

4.5.3 Data Transfer Size

The RPC statistics confirm that the amount of data needed to boot the appliance and execute an application is far smaller than the entire appliance image. The total number of read calls is roughly 2000; at 8KB per block, the total amount of data brought in from the server is approximately 16MB, which is less than 2% of the image size at the server (934MB). Also, I observe that for VM boot with, only 646KB of data is created and redirected to the local shadow server. Because the VM proxy includes the ROW-FS redirection capabilities, the server VM3 is mounted read-only and NFS blocks can be aggressively cached in VM2’s local disk.

4.5.4 Wide-area Experiment

In the final experiment, we have measured appliance boot and reboot times in an actual WAN deployment. In the experiment, VM3 is deployed in one domain, and VM1 and VM2 are deployed on a different domain. The VMs are connected by the IPOP [30] virtual network, and the server and client VMs are behind NATs. The proxy VM2 is

87 equipped with both the ROW-FS proxy and a NFS cache proxy. Table 4-1 summarizes the results from this experiment. Notice that the boot times reduce to less than half, becoming comparable to the LAN PXE/NFS boot time of approximately 2 minutes.

Table 4-1. Appliance Boot/reboot times over WAN. ISP is a VM behind a residential ISP provider; UFL is a desktop machine at the University of Florida; VIMS is a server machine at the Virginia Institute of Marine Sciences. VM1/VM2, Boot 2nd Boot Ping latency VM3 seconds seconds seconds ISP, UFL 291 116 23ms UFL, VIMS 351 162 68ms

4.5.5 Distributed Hash Table State Evaluation and Analysis

To measure performance of the DHT, 10 IPOP clients are connected to Planetlab nodes. The clients are running on Intel Pentium 1.7GHz desktop. An IPOP bootstrap over 118 P2P nodes is setup on Planetlab. Figure 4-9 shows the cumulative distribution graph of 100 accesses of DHT through 10 IPOP clients. The clients are randomly chosen to query the DHT. The distribution graph shows that in most cases it takes less than 2 seconds to query the DHT and obtain the appliance status information. The average time to insert appliance version with appliance name as key over 10 iterations is 1.4 sec. Table 4-2 provide the mean and variance statistics for five clients. The mean and variance statistics shows that client access times vary over DHT over P2P nodes deployed on Planetlab. The access time often depends on the route path taken to access DHT information. Table 4-2. Mean and variance of DHT access time for five clients Clients/ Client1 Client2 Client3 Client4 Client5 Statistics Mean 0.567 0.648 2.699 0.6875 2.224 Variance 0.00254 0.0389 4.731 0.08512 1.337

4.6 Related Work

Shark file system [76] provides mechanisms to make read-only data scalable over the wide area network through cooperative caching and NFS proxies; my approach

88 Figure 4-9. Cumulative distribution of DHT query through 10 IPOP clients (in seconds)

complements it by enabling redirect-on-write capabilities, which is a requirement to support the target application environment of NFS-mounted diskless VM clients. SFS advocated the approach of read-only file system for untrusted clients.There have been upcoming commercial products which either provide thin-client solutions based on

pre-boot execution environment [77][78] or provide cache based solution as a viable thin-client approach for scalable computing [79].

Distributed computing approach based on stackable virtual machine sandboxes is advocated in [59]. Stackable storage based framework is also used to automate cluster management as means to reduce administrative complexity and cost [60].The approach advocated in [60] is re-provisioning of application environment (base OS, servers, libraries and application) through role-specific (read or write) disk images. A framework to manage cluster of virtual machines is proposed in [80]. Stork package management tool provides mechanism to share files such as libraries and binaries between virtual machines[81]. A copy-on-write file server is deployed to share immutable template images for operating systems kernels and file-systems in [21]. This approach uses a combination

of traditional NFS for read-only mount and AFS for aggressive caching of shared images

89 [21]. Write-once semantics based file systems are common to leverage on commodity disk images for applications such as map-reduce[82].

4.7 Conclusion

This chapter explained the O/S image framework that automates the virtual machine appliance updates without requiring administrator intervention. The approach enables on-demand transfer of appliance state with local buffering of modifications through ROW-FS. The ROW-FS capability is support read-write operations over p2p overlay. Experiments show that proxy-configured VM consume client resources only during VM bootstrapping. RPC statistics for NFS call show that since majority of calls are routed to shadow server, significant performance (bootup time) improvement is observed for second VM boot. The storage and access of distributed hash table is measured with 95% of accesses are completed in less than 2 seconds. The access to distributed hash table is dependent on client location and routing path.

90 CHAPTER 5 I/O WORKLOAD PERFORMANCE CHARACTERIZATION In the previous chapters, I explained that Redirect-on-write file system is easily deployable as user level proxies with no kernel modifications required in system O/S. With VM-based execution environments have become prevalent [1][71], a common deployment scenario of ROW-FS is when the virtual machine hosting shadow server and the client virtual machine are consolidated into a single physical machine. Such a scenario is common when client VM is diskless or disk space on client VM is a constraint. While deploying ROW proxies in such cases provides a much needed functionality, the overhead

associated with virtualized network I/O and is often considered a bottleneck [9][10]. While the virtualization cost depends heavily on workloads, it has been demonstrated

that the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive [10]. Unfortunately, the architectural reasons behind the I/O performance overheads are not well understood. Early research in characterizing these penalties has shown that cache misses and TLB related overheads contribute to most of

I/O virtualization cost [10][83][84]. While most of these evaluations are done using measurements, in this chapter I discuss an execution-driven simulation based analysis methodology with symbol annotation as a means of evaluating the performance of virtualized workloads. This

methodology provides detailed information at the architectural level (with a focus on cache and TLB) and allows designers to evaluate potential hardware enhancements to reduce virtualization overhead. This methodology is applied to study the network I/O performance of Xen (as a case study) in a full system simulation environment, using detailed cache and TLB models to profile and characterize software and hardware hotspots. By applying symbol annotation to the instruction flow reported by the execution driven simulator I derive function level call flow information. I follow the anatomy of

91 I/O processing in a virtualized platform for network transmit and receive scenarios and demonstrate the impact of cache scaling and TLB size scaling on performance.

5.1 Introduction

It is important to understand architectural-level implications to guide the design of future platforms and the tuning of system software for virtualized environments. This chapter presents a simulation-based analysis methodology which extends a full system simulator with symbol annotation of the entire software stack in virtualized environments - including the hypervisor, service and guest domains. First, I describe methodologies and issues involved in analyzing a virtualized workload on an existing simulator, including symbol annotation to differentiate the various components in the software stack. Second, I demonstrate the feasibility of using this extended simulation environment to evaluate the profile of cache and TLB misses in a representative I/O workload. Results from this case study show that the use of symbol annotation coupled with full-system simulation makes it possible to correlate simulated results with important events across these different components of the stack. This is the first study using full-system simulation to estimate overheads and profile the anatomy of I/O processing in a virtualized system. Using full-system simulation, I profile the workload following the execution path of network packet handling inside the virtual environment. Furthermore, I perform architecture-level quantitative analyses using cache and TLB simulation models that are integrated with the execution-driven simulation and symbol annotation framework. The cache and TLB are modeled for performance evaluation since the cost associated with these resources are considered to be high. By profiling the execution and collecting architectural data, I show the causes for cache misses as well as TLB misses. I also show the impact of cache size and TLB size on I/O performance by scaling these resources.

In this chapter, I provide a detailed analysis of the current I/O VM architecture of a representative open-source VMM (Xen [20]), using the SoftSDV [85] execution-driven simulator extended with symbol annotation support and a network I/O workload (iperf).

92 Inter-VM communication and service VM architecture is integral part of the current I/O virtualization architecture. Also, recent studies have indicated that the I/O VM architecture becomes a performance bottleneck when it is desired to achieve high network

I/O throughput [10][83]. The rest of this chapter is organized as follows. The motivation behind the current work is described in Section 5.2. Section 5.3 describes the simulation methodology, tools and symbol annotations. Section 5.4 details the software and architectural anatomy of I/O processing by following the execution path through guest domain, hypervisor and the I/O

VM domain. Also, I provide initial results of resource scaling in Section 5.5. Section 5.6 describes related work.

5.2 Motivation and Background

The present work is motivated by the fact that current system evaluation methodologies for classic and para-virtualized VMs are based on measurements of a deployed virtualized environment on a physical machine. Although such an approach gives good estimates of performance overheads for a given physical machine, it lacks flexibility in determining the resource scaling performance. In addition, it is difficult to replicate a measurement

framework on different system architectures. We suggest that it is important to move towards a full system simulation methodology because it is a flexible approach in studying different architectures. Simulation-based approaches have been extensively used in computer architecture to design and analyze the performance of upcoming system architecture [13][14][15]. A simulation-based methodology for virtual environments is also important to guide the design and tuning of architectures for virtualized workloads, and to help software systems developers to identify and mitigate sources of overheads in their code.

A driving application for simulation-driven analysis is I/O workloads. It is important to minimize performance overheads of I/O virtualization in order to enable efficient

workload consolidation. For example, in a typical three tier data center environment, Web

93 servers providing the external interface are typically I/O-intensive; a low-performing front end server could bring the overall data center performance down. It is also important to minimize performance overheads to enable emerging usage models of virtualization. New architecture features could also drive the virtualization evolution. For example, offloading the I/O services to an isolated, specialized I/O domain and communicating to it through messages is motivated by similar arguments that have motivated micro-kernels [17]. Enabling a low latency, high bandwidth inter-domain communication mechanism between VM domains is one of the key architecture elements which could push this distributed services architecture evolution forward. 5.2.1 Full System Simulator

Full system simulators are often employed to evaluate design, development and testing on traditional hardware and software for upcoming architectures. There are several cycle-accurate simulators that support the x86 instruction set architecture [86][13]; I use the SoftSDV simulator [85] as a basis for the experiments. SoftSDV not only supports fast emulation with dynamic binary translation, but also allows proxy I/O devices to connect a simulation run with physical hardware devices. It also supports multiple sessions to be connected and synchronized through a virtual SoftSDV network. For cache and

TLB modeling I integrated CASPER [87] - a functional simulator which offers rich set of performance metrics and protocols to determine cache hierarchical statistics. 5.2.2 I/O Virtualization in Xen

The design of I/O architecture in virtual systems is often driven by tradeoffs between fault tolerance and I/O performance. In this context, I/O architectures can be broadly divided into split I/O and direct I/O. Direct I/O is generally adopted in classical virtual machines like VMware to boost I/O performance where front end and backend drivers often communicate using system calls. The split I/O architecture, adopted by para-virtualized machines, isolates backend drivers in a separate VM to communicate with

94 front end drivers through inter-process communication (IPC), resulting in an approach similar to those found in micro-kernels. The Xen I/O architecture has evolved from hypervisor-contained device drivers

(direct I/O) to split I/O. The primary goal of the I/O service VM based Xen I/O architecture is to provide fault isolation from misbehaved device drivers. It also enables the use of unmodified device drivers. The Xen network I/O architecture is based on a communication mechanism to transfer information between guest and service VM (Figure 5-1, (A)). The guest domain’s front end driver communicates with backend drivers through IPC calls. The virtual and backend driver interfaces are connected by an I/O channel. This I/O channel implements a zero-copy page remapping mechanism for transferring packets between multiple domains. I describe the I/O VM architecture along with the life-of-packet analysis in Section 5.4.

Figure 5-1. Full system simulation environment with Xen execution includes (A) Xen Virtual Environment (B) SoftSDV Simulator (C) Physical Machine

95 5.3 Analysis Methodology

In this section, I present an overview of Xen as a case study of using the full system simulation analysis methodology. I also show how the flow of packets is identified inside a multi-layer software environment with multiple VMs and hypervisor along with

micro-architectural details of the processor events of interest. Figure 5-2 summarizes the profiling methodology and the tools used. The following sections describe the individual steps in detail; these include (1) virtualization workload, (2) full system simulation, (3) instruction trace, (4) performance simulation with detailed cache and TLB simulation, and

(5) symbol annotation.

5.3.1 Full System Simulation: Xen VMM as Workload

The first step in the methodology for getting a detailed understanding of the workload is to run a virtualized environment, unmodified, within a full system execution driven

simulator. In the analysis presented in this chapter, the Xen virtualized environment includes the Xen hypervisor, the service domain (Dom0) with its O/S kernel and applications, and a guest, ”user” domain (DomU) with its O/S kernel and applications

(Figure 5-1). In order to analyze a network-intensive I/O workload, the iperf benchmark application is executed in DomU. This environment allows us to tap into the instruction flow to study the execution flow and to plug in detailed performance models to characterize architectural overheads. The DomU guest uses a front end driver to communicate with a backend driver inside

Dom0, which controls the I/O devices. I synchronized two separate simulation sessions to create a virtual networked environment for I/O evaluation. The execution-driven simulation environment combines functional and performance models of the platform. For this study, I chose to abstract the processor performance model and focus on cache and TLB models to enable the coverage of a long period in the workload (approximately 1.2 billion instructions).

96 Figure 5-2. Execution driven simulation and symbol annotated profiling methodology. Full system simulator operates either in functional mode or performance mode. Instruction trace and hardware events are parsed and correlated with symbols to obtain annotated instruction trace.

5.3.2 Instruction Trace

Functional simulation provides stateless execution of systems instructions; no state is maintained for TLB and cache access. The SoftSDV functional simulator loads and executes the Xen hypervisor and guest images. When iperf executes and communicates with the I/O services in Dom0, the instructions issued by the hypervisor, DomU and Dom0 are decoded and executed by the functional model. This enables tracing of the flow of execution at the instruction level for the entire workload execution which serves as a starting point for the analysis. The instruction trace can then be parsed to identify important events such as context switches and function calls. For example, I mapped the

97 next instruction after the CALL instruction to the symbols collected from hypervisor, application and drivers to obtain a sequence of functions in execution.

5.3.3 Symbol Annotation

In Linux, symbols for the kernel (and, similarly for applications and drivers) can be located in compile-time files (such as system.map for kernel). Symbols for running process can be collected from proc kernel data structures. In order to gain more insight into the packet flow and software modules inside the virtualization software layers, symbol information is added to the execution flow. Symbols were collected from the Xen hypervisor, drivers, applications and guest operating system to complete the instruction trace annotation. The annotation process matches the simulated instruction pointer (EIP in x86) with such symbols, allowing the tagging of regions of the instruction trace (and associated statistics) with code executed by the different components of the virtualized environment. For example, this methodology is useful to identify the life of a network packet inside the Xen virtualized environment, which is described in Section 5.4. An example execution flow after symbol annotation is given in Figure 5-3. These decoded instructions from the functional model are then provided to the performance model which simulates the architectural resources and timing for the instructions executed.

5.3.4 Performance Statistics

The instruction flow and associated performance statistics is collected from cache and TLB models to identify performance hotspots. Through a performance model, the detailed models of cache and TLB can be leveraged to characterize the impact of cache and TLB size on the I/O virtualization performance. A simulated platform also provides us with the capability of changing the underlying hardware architecture to evaluate architecture enhancements and their impact on workload performance. An example of the execution flow with performance details is given in Figure 5-4.

98 Figure 5-3. Symbol annotation. Compile-time Xen symbols are collected from hypervisor, driver and application and annotated. Figure shows an example where symbols are annotated with “kernel” and “hypervisor”.

Figure 5-4. Function-level performance statistics. Figure illustrates that how performance statistics are coupled with instruction call graph for each function. Sample statistics for L1/L2 caches and, instruction and data TLB are shown.

99 5.3.5 Environmental Setup for Virtualized Workload

The setup and priming of a workload within a simulation environment can be time-consuming. To facilitate the setup for simulation of the virtualized environment, a raw virtual disk is created which is then ported to the simulator. I chose to apply physical-to-virtual disk conversion as generally it is time consuming to test and commit changes in a simulated medium; creating a disk image outside the simulator facilitates the setup and testing of the workload. Even though iperf workload is executed for the experimental evaluation, the above methodology provides flexibility to support any

application. To convert a physical disk into a virtual disk, I modified the physical disk partition table to create a miniature replica of the physical disk using the Linux dd utility. To reduce booting time of the installed O/S, a stripped-down version of the physical image is customized by removing unnecessary boot time processes. For guest Xen images, a blank virtual disk is created and populated it with minimal RPM installation packages primarily to facilitate iperf run and networking with Dom0. The CASPER cache model exports APIs to print or collect the instruction traces during a simulation run. As shown

in Figure 5-5, an instruction parser is used to parse different instruction events such as

INT (interrupts, system calls), MOV CR3 (address space switch), and CALL (function call). These traces were dumped into a file with run-time virtual address information, as well as cache and TLB statistics. Instruction traces are parsed and mapped with symbol dumps to create I/O call graph. SoftSDV system call (SSC) utilities facilitate transfer of data between host and simulated guest. A performance simulation model is used to collect instruction traces along with hardware events of virtualized workload. These utilities are important as I gathered run time symbols of kernels and application from the proc kernel data structure to transfer to the host system (for example, /proc/kallsyms for kernel symbols). For iperf runtime symbols, we mapped process ID with corresponding process ID in proc directory. These run-time symbols,

100 in addition to compile-time symbols from kernel, hypervisor, drivers and iperf, provide mapping information between functions and virtual addresses.

Figure 5-5. SoftSDV CPU controller execution mode: performance or functional. In functional mode, SoftSDV simulator provides instruction trace. In performance mode, instruction trace is parsed to obtain hardware events such as cache and TLB misses. Compile time symbols from kernel, drivers and application along with run time symbols from proc file system are collected to obtain per-function event statistics.

Symbols are annotated to keep track of the source of a function call invocation. Note that there can be duplicate symbols when collected symbols are summed up into a

file. These duplicates are removed and further the data collected is formated in a useful way. In some cases, it is necessary to manually resolve ambiguities in virtual address spaces through checkpoint at virtual address during re-run of a simulated SoftSDV session. Linux utilities such as nm and objdump are often used to collect symbols from compile time symbol tables. In general, any application can be compiled to provide symbol table information. In C++ applications (such as iperf), function name mangling in object code is used to provide distinct name for functions that share the same name.

Essentially, it adds some randomness at prefix and suffix of the function name. I used the demangle option of the nm utility to identify the correct function for iperf application.

101 Xen kernel and hypervisor symbols are collected from /boot/System.map-2.6.13-xen and $INSTALL/xen/xen-syms. The instruction traces and symbol dumps are compared and visualized into user-friendly format so as to obtain call graphs and statistical information

such as cache and TLB misses per function invocation. 5.4 Experiments and Simulation Results

The experiments to evaluate are conducted in two parts. First, the important events such as the occurrence of CALL instructions to determine the flow of a virtual Ethernet packet is collected from a simulation run. Secondly, the iperf application is executed to

generate both transmit and receive workloads so as to perform cache and TLB scaling

studies. Figure 5-5 shows the simulation framework implementation to obtain call graph information and perform cache scaling studies. As illustrated in Figure 5-5, the CPU controller layer in SoftSDV integrates with a performance or functional model. The platform configuration for this study is set to a single processor with 2 levels of cache (32 KB first level data and instruction cache, 2MB L2 cache) and with 64-entry instruction and data TLBs. The experimental setup involved multiple SoftSDV sessions connected over virtual network. I choose to run iperf application to study life of I/O

packet as it is the representative benchmark to measure and study network characteristics. The iperf client is executed to initiate packet transmissions from a Xen environment.

5.4.1 Life Cycle of an I/O packet

This section describes the execution flow of packet processing inside a Xen virtual

machine. Figure 5-6 shows an overview of different stages which characterize the life of a packet between VM domains. Typically, a network packet in the Xen environment goes through the following four stages in its life cycle after the application execution:

1. Unprivileged domain - Packet build and memory allocation 2. Page transfer Mechanism - A zero-copy mechanism to map pages in virtual address

space of Dom0/DomU domains 3. Timer interrupts - Context switch between hypervisor and domains

102 4. Privileged domain - Forwarding I/O packet down the wire and sending acknowledgment back to the guest domain.

Figure 5-6. Life of an I/O packet (a) Application execution (b) Unprivileged domain, (c) Grant table mechanism - switch to hypervisor, (d) Timer interrupt, (e) Privileged domain

5.4.1.1 Unprivileged Domain

On the transmit side, an I/O packet originates from the iperf application. The execution flow traverses from the application into the DomU guest O/S kernel where all the required TCP/IP processing is completed. The TCP/IP stack builds the payload in transmit socket buffers (skb) and hands them over to the front-end driver. Socket buffers (skb) represent network packets and facilitate the implementation of zero-copy networking between Xen virtual machines [27]. An interface in Xen to allocate a socket buffer in the networking layer (alloc skb from cache) is identified. The front end driver uses the grant table mechanism provided by the hypervisor to transfer the buffer to Dom-0. The functions and the associated instruction count for overall life of the packet in DomU include socket lock, copy data from user space to kernel space, allocate page from free list, and release socket lock (Figure 5-7). Note that the instruction count statistics are shown in chronological order with function entry points as markers. I removed some repeating

103 functions to improve readability. As a part of the transmit processing the DomU guest domain communicates with Dom0 using event channels.

EIP Function Module Instr count c02aeb50 do_sock_write Dom-U 0 c02b2720 lock_sock Dom-U 78 c0331a00 _spin_lock_bh Dom-U 85 c02f2590 tcp_current_mss Dom-U 154 c01e5560 copy_from_user Dom-U 255 c01e54b0 __copy_from_user_ll Dom-U 277 c023fa00 __alloc_skb Dom-U 505 c02b3380 alloc_skb_from_cache Dom-U 527 c0164d90 kmem_cache_alloc Dom-U 543 c0164d90 kmem_cache_alloc Dom-U 591 c01e5560 copy_from_user Dom-U 1552 c01e54b0 __copy_from_user_ll Dom-U 1574 c023fa00 __alloc_skb Dom-U 2029 c02b3380 alloc_skb_from_cache Dom-U 2051 c0164d90 kmem_cache_alloc Dom-U 2067 c0164d90 kmem_cache_alloc Dom-U 2115 c01649b0 cache_alloc_refill Dom-U 2136 c0331b40 _spin_lock Dom-U 2170 c01647a0 cache_grow Dom-U 2217 c0331b40 _spin_lock Dom-U 2245 c01646e0 kmem_flagcheckDom-U 2277 c0163860 kmem_getpagesDom-U 2290 c01499c0 __alloc_pagesDom-U 2309 c01498d0 get_page_from_freelistDom-U 3581 c01497f0 zone_watermark_okDom-U 3611 c0149550 buffered_rmqueueDom-U 3662 c014a140 _page_state_offsetDom-U 3728 c0148f80 prep_new_pageDom-U 3754 c01517d0 page_addressDom-U 3832 c014a170 mod_page_state_offsetDom-U 3864 c01645f0 alloc_slabmgmtDom-U 3923 c0164d90 kmem_cache_allocDom-U 3937 c0164760 set_slab_attr Dom-U 3999 c0164660 cache_init_objsDom-U 4028 c0331b40 _spin_lock Dom-U 4085 c0331b40 _spin_lock Dom-U 4140 c0164710 slab_get_obj Dom-U 4184 c0164710 slab_get_obj Dom-U 4210 c02b7230 sk_stream_mem_schedule Dom-U 7810 c01e5560 copy_from_user Dom-U 9175 c01e54b0 __copy_from_user_ll Dom-U 9197 c02f2ed0 __tcp_push_pending_framesDom-U 9382 c02f2c10 tcp_write_xmit Dom-U 9396 c02f2750 tcp_init_tso_segs Dom-U 9417 c02b2780 release_sock Dom-U 9478

Figure 5-7. Dom-U call graph: Socket allocation (alloc skb from cache), user-kernel data copy (copy from user) and finally TCP transmit write (tcp write xmit).

104 5.4.1.2 Grant Table Mechanism

Once the message to notify Dom0 of a transmit request is sent through event channels, the transmit packets are picked up by the Dom0 when the hypervisor schedules it to execute. The Xen VMM provides a generic mechanism to share memory pages between domains, referred to as grant table mechanism: before sending an event to Dom0, the DomU guest domain sets access rights to the memory pages holding the actual packet contents through a grant table interface provided by the hypervisor. Figure 5-8 demonstrates the execution flow from domU to hypervisor through grant table mechanism.

Figure 5-8. TCP transmit (tcp transmit skb) and Grant table invocation (gnttab claim grant reference)

5.4.1.3 Timer Interrupts

Timer interrupts initiate the switching into the hypervisor from the guest domains. Timer interrupts are often used by O/S task schedulers to re-schedule the priorities of running VM or processes. This results in context switch from hypervisor to dom0. At this point, Dom0 invokes evchn do upcall to start processing the event. The functions invoked during the timer interrupt is shown in figure 5-9.

105 Figure 5-9. Annotated call graph to show context switch between hypervisor and Dom-0 VM -Timer interrupts (write ptbase)

5.4.1.4 Privileged Domain

The backend driver receives the packets and bridges them to the real network interface card. For this it needs to access the packet buffer from the guest domain. It uses the grant provided by the guest to map the page into its own domain and accesses it. Once transmit processing is complete, Dom0 sends an acknowledgment back to the DomU guest domain using event channel mechanisms. Execution flow in Dom0 is shown in figure

5-10 (since the complete execution at this stage is long, snippets of execution covering the basic flow and highlighting the important functions is shown). Note that the grant table mechanism is used to map guest pages into Dom0 address domain on the backend receiving side. Then the packet is sent to the bridge code, after which it is sent out on the wire. Once complete, the host map is destroyed and an event is sent on the event channel to the guest domain. It is interesting to note that the processor TLB is flushed while destroying the grant. It is done by writing the CR3 register (the x86 page table pointer) through the write cr3 function. I describe the impact of this TLB flush in Section 5.4.2.

106 Figure 5-10. Life of a packet in Dom-0: Accessing granted page (create grant host mapping), ethernet transmission (e100 tx clean), destroy grant mapping (destroy grant host mapping) and event notification back to hypervisor (evtchn send)

107 Note that the flow described here is only an example. The execution flow may vary based on the state of the stack and the availability of buffers. External interrupts also may alter the execution flow considerably. An execution driven simulation environment

allows us to profile various execution flows and characterize the I/O architecture correctly. Similarly, the execution flow at the receiver side in a Xen execution environment can be gathered and studied.

5.4.2 Cache and TLB Characteristics

It is important to analyze the impact of hardware design decisions on the performance

of VMMs. As mentioned earlier, the focus of current work is on the performance characteristics related to cache and TLB resources in a virtualized environment. Figure 5-11 shows an execution snippet where TLB flushes and misses are plotted as a function of simulated instructions retired. The figure shows that there is a high correlation between the TLB misses, context switches and TLB flush events. An execution run of VM during a period with no context switches or TLB flushes results in negligible TLB misses. Whenever TLB flushing events happen, there is a surge of TLB misses. This correlates well with the observations of TLB miss overhead in earlier studies. Figure

5-12 shows the increased number of TLB misses associated with the VM switches in a cumulative graph. I observe that there is a surge of TLB misses associated with each VM switch. Execution segments without VM switches show flat areas with few TLB flushes. Figure 5-13 depicts a typical VM switch scenario. The execution moves from one VM to another through a context. The CR3 value is changed to point to the new VM context. This triggers the hardware to flush all the TLBs to avoid invalid translations. But this comes with a cost of TLB flushes every time a new page is touched, both for code and data pages.

Another scenario is the explicit TLB flushes done by the Xen hypervisor as part of the data transfer between VMs. This is an artifact of the current I/O VM implementation as explained in the previous section. In order to revoke a grant, a complete TLB flush

108 Figure 5-11. Impact of TLB flush and context switch. The x-axis shows a slice of total number of instructions retired during an execution run of iperf application. The y-axis shows instruction and data TLB miss events, normalized to TLB flushes and context switch.

Figure 5-12. Correlation between VM switching and TLB misses. The x-axis shows a segment of the total number of instructions retired. The y-axis (left) represents VM switching where “1” indicates VM switch. The y-axis(right) shows cumulative TLB misses.

109 Figure 5-13. TLB Misses after a VM context switch. Instruction and data TLB misses are ploted on the y-axis against a segment of instructions retired on x-axis. The context switch between virtual machine causes a TLB flush which increases the number of TLB misses. is executed explicitly, which also creates TLB performance issues similar to VM switch. Figure 5-14 demonstrates the code flow and the TLB impact. Figure 5-15 shows the impact of context switches on cache performance. The vertical lines mark VM switch events obtained through symbol annotation, and the plotted line shows the cumulative cache miss events. Note that the cache miss rate increases are also correlated with VM switch events.

5.5 Cache and TLB Scaling

In this section, I evaluate the impact of cache size and TLB sizes on I/O virtualization overhead. As described earlier, I used the functional model of SoftSDV to boot a Redhat Enterprise Linux (RHEL 4) image and Xen-3.0.2 as a test bed. Two sessions of the SoftSDV simulation tool are executed which are connected to each other through a virtual subnet configured for network communication. For each experiment, I executed a session of iperf [88]. TLB and cache statistics are measured for transfer of approximately 25 million TCP/IP packets.

110 Figure 5-14. TLB misses after a grant destroy. The x-axis show a segment of instructions retired and the y-axis represents data and instruction TLB misses.

Figure 5-15. Impact of VM switch on cache misses. The x-axis show a segment of instructions retired. The y-axis(left) represents VM context switch through the vertical lines between 0 and 1. The context switch between virtual machine causes a TLB flush which increases the number of L2 cache misses (y-axis(right)).

111 Figure 5-16. L2 cache performance when L2 cache size is scaled from 2 MB to 32 MB. In the plot, L2 cache miss ratio is normalized to L2 cache size 2MB. The data points are collected when iperf client is executed in a guest VM (transmit of I/O packets).

Figure 5-17. Data and instruction TLB performance when TLB size is scaled between 64 and 1024 entries. The plot indicates that TLB miss ratio is normalized to TLB size of 64 entries. The data points are collected when iperf client is executed in a guest VM (transmit of I/O packets).

112 Figure 5-16 shows the effect of scaling L2 cache. The performance model is configured to simulate a two level cache: 32KB L1 (split data and instruction) and a 2MB unified L2 cache. The primary goal is to understand the cache sensitivity of the I/O virtualization architecture in the context of network I/O. Note that increasing the L2 cache size up to 4MB provided good performance scaling, after which the increase in performance was minimal. Increasing the cache size beyond 8MB, the rate of reduction in miss rates is small. I can attribute reduced miss rates from the 8MB cache to the inclusion of needed pages from hypervisor, Dom0 and DomU.

Figure 5-18. L2 cache performance when L2 cache size is scaled from 2 MB to 32 MB. L2 cache miss ratio is normalized to L2 cache size 2MB. The data points are collected when iperf server is running in a guest VM (Receive of I/O packets).

Figure 5-17 shows the TLB performance scaling impact for data and instruction TLBs. As shown in the figure, with increase in size of the data TLB, the miss ratio decreases for sizes up to 128 entries. For larger sizes, the miss ratio is nearly constant. The ITLB miss rate decreases slightly, while the DTLB rate shows a sharper decrease from 64 to 128 entries. It can be inferred that TLB size of 128 entries is sufficient to incorporate all address translations during the TLB stage. Increasing the TLB size is not a very effective enhancement in this scenario. This is because, as observed in Figure

113 Figure 5-19. Data and instruction TLB performance when TLB size is scaled between 64 and 1024 entries. The plot indicates that TLB miss ratio is normalized to TLB size of 64 entries. The data points are collected when iperf server is running in a guest VM (Receive of I/O packets).

5-12 and 5-13, there are substantial numbers of TLB flushes during grant revocation and VM switches, which invalidate all TLB entries. A large TLB size does not help mitigate the effect of compulsory TLB misses that follow a flush. Similarly, the cache and TLB scaling studies on the receive side is performed. Results are given in Figures 5-18 and 5-19 respectively. 5.6 Related Work

The characterization of the performance overhead is an important concern in the study of virtualized environments, and several studies have addressed this issue with a methodology based on execution of application benchmarks on virtualized platforms [89][20]. Performance monitoring tools have been deployed to gauge application performance in virtualized environments [9][83][10]. Traditional network optimizations such as TCP/IP checksum offload, TCP segmentation offload are being used to improve network performance of Xen-based virtual machines [10]. In addition, faster I/O channel for transferring network packets between guest and driver domains is being studied [10].

These studies lack micro-architectural overhead analysis of the virtualized environment.

114 TLB misses after context switches negatively impact I/O performance. In the past, TLBs have been tagged with a global bit to prevent flushing of global pages such as shared libraries and kernel data structures. In current system architectures, context switch overhead can be reduced through tagging TLB entries with address-space identifiers (ASID). A tag based on VMID could be further used to improve I/O performance for virtual machines. Processor architectures, with hardware virtualization support, incorporate features such as virtual-processor identifiers (VPID) to tag translations in the TLB [90][36].

5.7 Conclusion

The focus of this chapter is to present a case study of a virtualized workload in a simulated environment to study micro-architectural features as a means of performance evaluation. I used an execution driven simulation framework, along with a symbol annotation methodology, to analyze the overheads of an I/O intensive workload running in a virtualized Xen environment. I also presented the initial research results from TLB and cache scaling for the I/O workload. The execution driven simulation framework presented provides the speed and flexibility needed for understanding the current architecture bottlenecks and experiment with potential architectural changes in hardware and software.

115 CHAPTER 6 HARDWARE SUPPORT FOR I/O WORKLOADS: AN ANALYSIS

6.1 Introduction

In this chapter, I present an approach to analyze and evaluate TLB performance for multiprocessor systems using the full system simulation framework explained in Chapter 5. The approach is based on tracking TLB modifications to evaluate potential shared pages between multiple processors.

It is interesting to consider I/O performance of virtual environments in multi-core processors. One way to improve performance is to evaluate consistency for sharing and invalidating TLB entries. Generally, software-based coherence mechanisms have an overhead of communication latency between processors (as compared to time consumed to access the shared data). In comparison, the primary cost of coherence protocol in hardware mechanism is to access the data - the coherence messaging overhead is smaller in hardware-supported approaches because communication occurs via bus-level transactions [91].

In multi-core processors, a virtual address translation stored in page table entries of a processor needs to be propagated to the TLBs of all processors. Many architectures resort to the flushing of entire contents of a remote TLB to enforce such coherence, in a process often called ”TLB shootdown”. Consider an example of network I/O communication in a virtualized environment. The Grant table mechanism adopted by Xen VMM is based on modifying access protection bits of shared page table entry between guest and privileged domain. Therefore, network I/O communication between guest and privileged domains may result in TLB shootdowns. The problem with the shootdown approach is that it works at a coarse coherence granularity by invalidating all entries of a TLB. Because not all TLB entries must be invalidated to enforce consistency (only those that are affected by the protection changes) this coarse-grain approach to enforce coherence can result in the

116 eviction of translations that remain valid from the TLB, thus potentially increasing the number of TLB misses. The rest of this chapter is organized as follows. I provide a brief overview of translation lookaside buffer in Section 6.2. I introduce overview the interprocessor interrupt mechanism used by Linux in x86-based processors to implement TLB shootdowns in Section 6.3. Section 6.4 explains the page sharing mechanism in Xen hypervisor. Section 6.5 provide details of experiments to measure I/O overhead, evaluate hardware support to tag hypervisor pages and evaluate potential for selective flushing in interprocessor interrupts. Section 6.6 describes the related work. 6.2 Translation Lookaside Buffer

6.2.1 Introduction

The Translation look-aside buffer (TLB) is an on-chip cache to expedite the virtual-to-physical address translation. In the absence of TLB, page table data structure is used to access physical page corresponding to virtual address; this process of translating virtual into physical address is expensive. Instead, processors rely on the TLB and locality of references to achieve fast address translation. This process can be summarized as follows. An application run generates a virtual memory address to access instruction or data from the memory. The CPU looks up the virtual address by indexing the TLB. If the TLB access is a hit then the page table entry present in TLB, is used to access the physical page. In multi-core systems, typically each processor has its own TLB in order to achieve fast lookup times. This creates a challenge in managing multiple translations cached across multiple TLBs, and thus it is important to maintain TLB coherency. Unlike processor data and instruction caches, the TLB coherency is implemented by operating system. This is accomplished by the operating system issuing, for any updates to a page table entry, a TLB invalidation operation.

117 6.2.2 TLB Invalidation in Multiprocessors

Memory accesses associated with execution of instructions, as they progress in the pipeline, go through TLB, memory and page fault handler. When CPU is unable to find a physical frame for a virtual page, it generates a processor interrupt called a page fault. O/Ses generally implement handlers to deal with page faults. The handler can look up the address mapping in the page table. If one exists, it is written back to the TLB so that CPU accesses physical memory through the TLB. If page address mapping is not present in the page table, page fault is generated and handler updates the page mapping into page table. In multiprocessors, updating processor sends TLB-update request to other processors so as to synchronize TLBs. This request for update is send through an inter-processor interrupt (IPI). Such TLB shootdown is similar to message synchronization as TLB update is valid only after all processors invalidate old TLB entries. Consider this example to illustrate the process of TLB invalidation in multiprocessor systems. Assume a page table of process A is shared between CPU 0 and CPU 1. At first compulsory miss (in TLB) for page X, TLB entry is updated in the processor CPU0 when it tries to access the page. Subsequently, Page X mapping is accessed by CPU0 from

TLB, instead of translation through page table. Further, when CPU1 also accesses the page X, entry is updated in TLB. Now the CPU0 updates the page table entry for Page X (For example, CPU0 modifies the permission bits). It is necessary that CPU0 not only invalidates its TLB entry for page X but also sends an IPI to CPU1 to invalidate its own TLB entry for page X. In general, page table updates (and hence the need for TLB synchronization) can be cause by several reasons:

• Change of page permissions (Read, Write, Execute)

• Lack of memory has caused page X to swap to the disk (Entry is invalid).

• Page X does not exist anymore as application removes it.

• There is no entry for Page X in the page table

118 The operations involved in flushing and refilling TLB entries can be expensive in terms of number of cycles. In x86 architecture, the hardware and software interface is designed at page table level. It implies that page table entry format and configuration

is defined in the x86 instruction set architecture [91]. TLB entries rely on information provided and updated through the page table. A TLB flush operation may result in TLB miss, a page table walk, a possible page fault (if page is not in the memory) and a TLB refill. Hardware state machine walks through the page table to refill the TLB entry.

Figure 6-1. The x86 page table for small pages: Paging mechanism in x86 architecture is shown. Control register(CR3) is loaded with currently scheduled process. The virtual address from an application is divided to obtain page directory entry (PDE), Page table entry (PTE) and Page offset.

Figure 6-1 illustrates the translation of virtual address into a physical address in the x86 architecture. The sequence of access to physical page from a virtual or linear address is as follows: (1) Virtual address is looked into TLB (2) If TLB translation is not available, virtual address is translated into physical address to retrieve the page content through the page table (3) If virtual address is not present in page table, a page fault is invoked. The page table is a hierarchical structure to index and retrieve the final location of the physical address corresponding to the virtual address. In addition, it provide entries to check the access privileges and mode of invocation of the page. This is to prevent other

119 processes from accessing or modifying the pages. The complete list of flag entries in PDE and PTE entry can be obtained from [92]. In case of multi-core, it may be the case either page table between two CPUs is shared or page entries are shared between different CPUs.

6.3 Interprocessor Interrupts

The x86 architecture uses an inter-processor interrupt (IPI) mechanism to implement TLB shootdowns. This is accomplished with the help of a hardware facility called the APIC (Advanced programmable interrupt controller), which can be programmed to send interrupt requests to other processors in the bus [92]. Sending such a request involves

programming special registers in this APIC (ICR, or interrupt control registers) to select, among other things, the destination of an interrupt and its arguments. The format and entries in ICR register can be obtained from the Intel’s software developer specifications

[92]. Inter processor interrupts are initiated after a write to the APIC’s ICR register. In SMP architecture, each processor have its own local APIC to invoke or act on an IPI. A local APIC unit indicate successful dispatch of an IPI by resetting delivery Status bit in Interrupt Command Register(ICR). An example of IPI is flushing the TLB contents when a processor modifies the

shared entry in the page table data structure that is shared with other processors. This allows synchronization between the address translation procedure of SMP processors. Another example of IPI is when rescheduling a new task on SMP machine. Consider an example of scheduling “idle” task. To implement this, first all the interrupts are enabled in SMP processors and then “hlt“ instruction is issued to all the processors. Whenever an interrupt is received from system devices such as Keyboard, CPU is awaken through interrupt. In SMP, one CPU is awaken on such interrupt, an IPI is sent to other CPU

through a write to APIC’s ICR register (e.g. In Linux O/S ”send IPI mask” function

performs the task of writing to ICR register). Figure 6-2 shows an example of IPI invocation mechanism in a two processor SMP system. Each CPU has its own local APIC unit. CPUs can store the interrupt vector

120 Figure 6-2. Interprocessor interrupt mechanism in x86 architecture: Interrupt is initiated with a write to Interrupt control register (ICR). ICR is a memory mapped register for the APIC system.

and the identifier to target processor in ICR. On a write to ICR, a message is sent to target processor via system bus. Figure 6-2 shows the contents of ICR register to identify destination CPU(0x1000000) and kind of IPI (e.g. the content of ICR register for TLB invalidation is 0x8fd). The default location of APIC registers is at 0xfee0000 in the physical memory. The sequence of events to invoke IPI for TLB invalidation is as follows:

• CPU0 writes to ICR of local APIC unit with target destination and invalidation vector. • CPU0 sends an IPI to CPU1 to flush the local TLB contents.

• CPU1 flushes the TLB content by writing to ICR of its local APIC unit. • CPU0 flushes its own TLB contents.

6.4 Grant Table Mechanism: I/O Analysis

In previous chapter I discussed that Xen hypervisor provide a shared memory

mechanism to communicate between guest and privileged domains. The shared pages are

generally referred and shared by a reference - grant. To keep track of grant allocation and

121 de-allocation, Xen maintains a list of free entries in a list data-structure and may allocate grant in groups to amortize the cost of allocation. Grant table mechanism supports two types of operations for page transfer between domains, namely mapping and transferring.

While transmit of an I/O packet is primarily done through mapping, receive of packet is performed through copying of transfer pages. In mapping mechanism, the page from address space A is mapped into address space B with a page reference is present in both domain’s after the page mapping operation. This mapping could generate lot of TLB churn as update to a shared entry (for example ownership of a page) can result in complete TLB flush. To allocate new pages and track page mapping mechanism, grant table keeps the bookkeeping information of shared and active grant entries. The shared entries are shared between Xen hypervisor and a guest domain. These entries are used to create new grant references for transfer of network packets. If a grant reference is used by another domain, grant table will update the status in the shared entry of that page. Similarly, another data structure that is kept inside the hypervisor for each domain, is used to keep track of active grant entries. Finally, to map and unmap a physical page into the domain address space, a table is maintained to index and update the status of the physical page. Consider an example of DomB guest VM sending packet to privileged domain dom0. Following are the sequence of steps executed by DomB and Dom0 to index and update grant information:

• DomB creates a grant reference (e.g. gref #N) by updating an entry into shared grant entry. Dom B places grant reference on virtual device channel between DomB and Dom0. • Dom0 gets the grant reference through virtual device channel between DomB and Dom0. • Dom0 sends request to pin the frame into its address space to Xen hypervisor

122 • Xen hypervisor checks the active grant entry (maintained for Dom0 in Xen hypervisor) for the grant reference (#N). If mapping is not present, Xen hypervisor will update the grant entry into active grant table with status flag as currently

active. • Xen hypervisor will send an acknowledgment to Dom0 that indeed page frame has been pinned into its address space. This pinned address space will result in page table update and eventually a complete TLB flush. • Dom0 sends a DMA request to access the page.

• Similarly, an unmap operation will result in a complete TLB flush. 6.5 Experiments and Results

6.5.1 Grant Table Performance

The performance model used in Chapter 5 captures functional behavior but is not timing accurate. This choice is motivated by the fact that timing-accurate models are considerably slower than functional models. A conclusion drawn from the previous Chapter was that split I/O implementation results in additional TLB flushes and misses. It is also important to characterize the impact of this mechanism on the timing and performance of a virtualized system. This subsection addresses this issue through a profiling-based analysis of a modified version of Xen. To accurately evaluate the overhead in the life cycle of network packet, in this chapter I instrumented xen source code to evaluate percentage of cycles consumed during grant mechanism. This experiment is performed on both Intel duo core 2 and Pentium-based machines. Table 6-1 provides the percentage of cycles consumed due to grant table operations by hypervisor, averaged over the period where the selected network application benchmark

(iperf) executed. As inferred from Table 6-1, grant operations consume a significant amount of resources - approximately 20% of CPU cycles during a packet transfer for both the Core Duo 2 and Pentium III CPUs. While grant map and unmap statistics are gathered during the transmit of the iperf packet, grant copy or transfer statistics show

123 the overhead incurred during the receive of packets. Grant copy operation is based on a domain able to access machine page frame in both domains. A related analysis has been performed to conclude that grant copy can be more efficient than grant transfer operation

[10], similar to the result showed in Table 6-1.

Table 6-1. Grant table overhead summary Function calls %cycles(duo core2) %cycles(Pentium III) gnttab map 7.13 6.18 gnttab unmap 4.28 3.56 gnttab transfer/copy 8.20 (copy) 10.39 (transfer)

The statistics are obtained from profiling the grant operations in Xen hypervisor through Xentrace tool[93] and accessing the system cycle counts through time-stamp register(RDTSC).

6.5.2 Hypervisor Global Bit

Figure 6-3. Experimental setup: Two simics/softSDV sessions are synchronized using a virtual network. Iperf application deployed on Linux O/S is executed in one session. Iperf is deployed into DomU in Xen hypervisor environment. Packets are sent to/from DomU from/to Linux O/S

I studied the potential impact of a TLB optimization to make global hypervisor pages persistent in TLBs. In the absence of TLB tagging, on a TLB flush all translations

124 are invalidated. The goal of this optimization is to allow tagging the TLB with a single bit indicating that tagged translations are not to be flushed, which can be used in a virtualized environment to tag pages associated with hypervisor code and data. As shown

in Figure 6-4, such an optimization indeed has the potential to substantially reduce DTLB misses (and, to a lesser extent, reduce ITLB misses). It is more effective than increasing the TLB size because global bit tagging allows a subset of the translations to remain cached during switches and grant revocations. The experimental setup is shown in Figure 6-3.

Figure 6-4. Impact of tagging TLB with a global bit to prevent TLB flush for hypervisor pages. In the plot, TLB miss ratio is normalized to TLB size of 64 entries. The data points are collected when iperf client is executed in a guest VM (transmit of I/O packets).

The importance of performance isolation and VM level QoS is a growing research area especially with the introduction of multi-core processors which are sharing platform resources like cache, TLB and memory resources. My work is further extended to study

the impact of quality of service on TLB in [94]. 6.5.3 TLB Coherence Evaluation

This analysis considers the potential benefit of additional hardware support to

handle TLB coherence. The goal of this analysis is to evaluate the extent in which the

125 conservative approach of TLB shootdowns can negatively impact performance by removing valid translations from a remote TLB.

Figure 6-5. Page sharing in multicore environment: An example of potential reason for invocation of inter processor interrupt between two CPUs is shown.

To check the consistency between page table and TLB contents at invocation of interprocessor interrupt, following are the sequence of steps as shown in Figure 6-5: • Page tables for CPU0 and CPU1 share a page X between them. Page X is marked as read only in TLB entry of CPU0 and CPU1 at time t0.

• CPU0 modifies page X at time t1 with read-write permission. It updates the page table entry for page X in the currently loaded page table. • Since page X is a shared page, CPU0 needs to inform CPU1 to invalidate the local

TLB entry. At time t2, CPU0 informs CPU1 to flush the TLB entries through interprocessor interrupt.

• The page table and TLB contents of CPU1 is not consistent at time t2 as there is at

least one modification to the shared entry in the page table. The normal behavior

of IPI is to flush the complete entries on CPU1. I modified the TLB flush behavior

126 to flush only entries of TLB that are inconsistent with the page table. To check this consistency, TLB contents are looked into page table at the invocation of IPI for potential changes.

The experiments and implementation is done using Simics simulator. Simics is a full system simulator that can run unmodified operating system, device drivers and applications. This allows system simulation with actual workloads and applications. To understand and evaluate the TLB entry sharing, I monitored the TLB IPI invocation during the execution run of iperf benchmark

Figure 6-6. Simics model to capture inter-processor interrupts: Simics exports API to register different performance models. These performance models can capture important events such as IPI through simics API (SIM hap add callback). Simics workload is abstracted to represent an O/S or a hypervisor.

TLB model is generally initialized and loaded when simics is booted. Simics provide API to register a callback function to capture and act on IPI event in the TLB module. An example functionality of IPI callback function could be a counter to count the number of IPIs during the execution of workload. Figure 6-6 illustrate that simics API

(SIM hap add callback) is used to register an event related to core system device (in this case APIC). This callback function is used to modify the semantics of TLB flush

127 during the execution run of a benchmark. In this case, I parsed the page table of the CPU receiving IPI and compare it with TLB contents to capture the consistency behavior.

Table 6-2 shows the number of IPIs for transmit and receive of the network I/O packets during execution run of iperf benchmark. The potential benefit of not flushing the TLB during an interprocessor interrupt can be negated if O/S issues a normal flush due to scheduling another process. For this experiment, I created a network of two simics simulated machine. For each experiment, Simics sessions are warmed up for 20 million instruction. The statistics shown are collected for a run of 500 million instructions. Note that number of flushes is more in receive scenario when guest virtual machine is receiving the packets. To evaluate the consistency during the IPI, the page table parser is implemented in simics TLB module to lookup the TLB entries in the page table.

Table 6-2. TLB flush statistics with and with-out IPI flush optimization Transmit/Receive IPI flush Normal flush Transmit 485 31670 Receive 602 57718

To further understand the impact of IPI mechanism, I studied the impact of scaling the TLB size for instruction and data TLBs. For these experiments, domain-1 is affinitize to CPU1 while domain-0 does not have any CPU affinity. Table 6-3 and Table 6-4 provide the TLB miss statistics for instruction and data caches during the application run of iperf for 50 million instructions. While the data TLB does not show significant improvement in performance for both CPUs, instruction TLB for CPU1 shows improvement of 1.2 to 2.4%. In addition, number of misses are not significantly changed beyond the TLB size of 128 entries.

While number of TLB misses is reduced when complete flush is avoided during invocation of interprocessor interrupt, the potential performance improvement is negated by complete flushes due to local flushes (e.g. scheduling of new process).

128 Table 6-3. Instruction TLB miss statistics with and with-out IPI flush optimization for receive iperf benchmark IPI mode TLB Entries 64 128 256 512 IPI flush(CPU0) 112848 100687 100657 100687 IPI flush(CPU1) 347 4506 4506 4506 No IPI flush(CPU0) 112811 100604 100602 100602 No IPI flush(CPU1) 347 4455 4455 4455

Table 6-4. Data miss TLB statistics with and with-out IPI flush optimization for receive iperf benchmark IPI mode TLB Entries 64 128 256 512 IPI flush(CPU0) 389221 454949 450143 450118 IPI flush(CPU1) 55369 33636 33628 33628 No IPI flush(CPU0) 389186 454916 450051 450008 No IPI flush(CPU1) 55369 33580 33554 33554

6.6 Related Work

Accelerated routes setup by privileged domain for direct hardware access by the guest domains is being studied [95]. Performance evaluation of shared hardware resources between multi-core processors with virtual machine environment-based workload for server consolidation has recently been addressed [96]. Simulation-based approach has been used to maximize shared memory access between multiple VMs [96], however this study approximates use of virtualization - the model does not consider the execution of hypervisors. All these analysis lacked the complete understanding of the network stack. Specialized virtual machine containers (with minimal functionality such as only I/O) are being used to study the I/O scalability [97].

Solarflare [11] has approach to bypass the hypervisor for network I/O. The approach is based on providing hardware support of a virtual NIC (vNIC) per virtual machine. The virtual NIC controller for I/O acceleration communicates with the network driver interface offered by guest virtual machines. The network drivers inside the guest virtual machines communicate directly with interface offered by the virtual NIC controller (bypassing the hypervisor).

129 There is an overhead for sending and receiving network packets. In the past, network packet size is limited to around 1500 bytes due to high error rates and low speed of communication. Lower the network packet size, it means more time for CPU to process

the packet and time consumed in software and hardware stack is considerable. Specifically, number of interrupts to send packets are more as packet size is smaller. Jumbo frames can allow reducing number of interrupts and time to process large amount of bytes in a single packet. The InfiniBand architecture is a specification developed by an Industry consortium to develop high performance I/O. It provides OS bypass communication schemes, memory semantics (RDMA) and channel semantics (send/receive). A driving application of infiniband architecture is time critical or real-time applications. One solution to address

low network I/O is to implement infiniband driver for Xen. The primary goal is to bypass privileged domain in network I/O. To obtain full infiniband performance gain, guest VM

needs to be aware of underlying hardware [98]. Many approaches have been considered in the past to improve TLB performance. When a processor tries to modify a TLB entry, it locks page table to prevent other processor from modifying it. It flushes the local TLB entries. TLB operations (such as TLB refill) are queued for update. Processor sends and IPI and spins until all other processors are done. Finally it unlocks the page table. These steps are cycle consuming in TLB. Many improvements have been suggested to improve this performance of TLB. This includes: • Keep track of processor’s TLB state and selectively send IPI.

• Defer shootdowns on upgrade changes [99]. An example of this is upgrade of a page from read-only to read-write. Linux O/S does support Lazy TLB flushing. • INVLPG instruction is to selectively flush TLB. This way, complete flush is

unnecessary.

130 6.7 Conclusion

In this chapter, I simulated the networking workload scenario in a full system simulator to evaluate the impact of I/O packets on inter processor communication between the CPUs. While number of TLB misses is reduced when complete flush is avoided during IPI invocation, the potential performance improvement is negated by complete flushes due to local flushes (e.g. scheduling of new process). It would be interesting to further evaluate selective flushing of interprocessor interrupt along with selective flushing of all the local flushes as well as the benefit from selective invalidation coupled with TLB tagging.

131 CHAPTER 7 CONCLUSION AND FUTURE WORK

7.1 Conclusion

ROW-FS supports functionality that is not available in traditional network file system while remaining compatible to existing NFS clients and servers. The two main conclusions from the experimental analyses of ROW-FS are that (1) redirect-on-write functionality allows transparent on-demand access to data in multiple servers and is easy to deploy as user-level proxies, and (2) its functionality is beneficial to applications and frameworks such as O/S image management delivering good performance without requiring administrator intervention. The protocol redirection approach allows to locally buffer session modifications that can be frequently checkpointed with API exported from VM monitors such as VMware Server. Benchmarks including Linux kernel compilation and Andrew show that performance of ROW-FS over WAN is often comparable with NFSv3 in LAN. ROW-FS is important to access read-only services from remote servers and store user’s private data locally. An application highlighted in this thesis shows how ROW-FS can be integrated with P2P services of virtual network routing (IP-over-P2P) and object storage/lookup (DHT) to allow a flexible system for VM image provisioning. Such an integrated approach is particularly useful in applications such as high-throughput computing in systems like Condor, where entire virtual machines instead of processes are scheduled for computation. Experimental results show that VM bootup times are close to that of LAN bootup times, in particular when ROW-FS proxies are combined with disk-caching NFS proxies, while only requiring transfers of a small fraction of the total disk image. Again, this is important in applications such as high-throughput computing where the same base VM image can be used for large numbers of jobs. An important source of overhead in ROW-FS applications in particular (and, in general, in network I/O workloads in split-I/O virtual machine environments) can be

132 attributed to inter-process communication within a multi-core platform. This thesis has also explored a simulation-based methodology to characterize the performance of network I/O workloads. Simulation-based approach not only allows one to analyze current architectures but also supports exploring hardware and software trade-offs. This work concludes that the overhead associated with network I/O in a virtualized environment is primarily due to context switches and shared resources between VMs, and that approaches that reduce the amount or scope of TLB flushes can provide reductions in TLB misses that are more significant that those brought about by increasing TLB sizes.

7.2 Future Work

This dissertation motivates the research on user-level redirection approaches and protocols. While ROW-FS is based on NFSv2 protocol, ROW-FS framework for O/S images can benefit with proxy extension to NFSv3 and NFSv4 protocols. Furthermore, ROW-FS implementations can be optimized to prefetch NFS blocks for file or directory access pattern. ROW-FS could be deployed in scenarios where services such as storage or bandwidth are billed as utilities. An advantage of ROW-FS in this environment is that it has the potential to reduce the amount of storage required for VM images at the provider’s site. An example of such environment is cloud computing model where services are transparently to the end user. This motivates future work to better characterize the savings in storage and bandwidth for on-demand transfers versus full-image copying for a wide range of O/Ses and workloads. Another area of future work considers approaches to reconcile private ROW-FS shadow data with the main server. ROW-FS data could be re-reconciled with image server through utilities that provide data synchronization (e.g rsync).

The result of this dissertation show that full simulation-based studies are important to evaluate processor architecture and OS-hardware interaction. New I/O models are being adopted to improve network I/O performance in virtualized environments. The

133 present work can be used to study in detail the micro-architecture of new systems, and implications for virtualized workloads. Some of the future changes in I/O models include:

• Constraints in scalability of virtual machines indicate that future virtual machine

environments will be further partitioned into independent sub-systems. In design of separate I/O domains, emulated device drivers reside in privileged domain (e.g. Dom0 in Xen). An I/O intensive guest VM may result in executing considerable amount of time in Dom0 as emulated device drivers reside in Dom0. Hence, Dom0 may prevent a share of CPU to other guest VMs. This problem is addressed using per-guest stub domain which will run individual emulated devices.

• In the past, Linux interrupt model used by O/Ss such as Linux was generic to send interprocessor interrupts to all processors (idle or otherwise). Recent trend is to

control CPU performance and energy through hardware support for several different states (e.g. idle, operational). These states can be used by O/S to control the behavior of TLB flush. In this respect, future architectures are predicted to perform bookkeeping so as to prevent useless TLB flush (conceivable either through O/S or hardware support mechanism).

• Future generation microarchitecture are predicted to incorporate new models of TLBs and caches. An example of new technology trends include Extended Page Tables (EPT) and two-level of TLBs that are incorporated into Intel’s new

microarchitecture [100] (Nehelam) to improve the cost of virtual to physical address

translation. In this scenario, this dissertation motivates research on the models that captures the behavior of similar new on-chip resources. These models can be augmented and evaluated further to guide system designers for improved systems performance. Future internet usage is predicted to include new usage models that are based on user connectivity such as social networking, collaboration and on-line gaming. There are instances where these applications are desired to be encapsulated in VM environments

134 for isolation and encapsulation. These applications are also often network-intensive. Two of the keys challenges and research opportunities in such models are (1) workload characterization to optimize future platforms (2) scalable system and application execution on diverse clients. This dissertation motivates the research to address above challenges.

135 REFERENCES [1] J. E. Smith and R. Nair, Virtual Machines: versatile platforms for systems and processes, Morgan Kaufmann publishers, May 2005.

[2] M. Litzkow, M. Livny, and M. W. Mutka, “Condor: a hunter of idle workstations,” in Proceedings of 8th international conference on Distributed Computing Systems, Jun 1988, pp. 104–111.

[3] B. Callaghan, NFS illustrated, Addison-Wesley Longman Ltd., Essex, UK, 2000.

[4] S. Adabala, V. Chadha, P. Chawla, R. J. Figueiredo, J. A. B. Fortes, I. Krsul, A. Matsunga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu, “From virtualized resources to virtual computing grids: The in-vigo system,” Future Generation Computing Systems,special issue on Complex Problem-Solving Environ- ments for Grid Computing, vol. 21, no. 6, Apr 2005. [5] K. Keahey, I. Foster, T. Freeman, X. Zhang, and D. Galron., “Virtual workspaces in the grid,” in Proceedings of the Euro-Par Conference, Lisbon, Portugal, Sep 2005.

[6] A. Sundaraj and P. A. Dinda, “Towards virtual networks for virtual machine grid computing,” in 3rd USENIX Virtual Machine Research and Technology Symp., May 2004. [7] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youse, and D. Zagorodnov, “The eucalyptus open-source cloud-computing system,” in Cloud Computing and Its Applications workshop (CCA’08), Chicago, IL, October 2008. [8] VMware, “Merrill lynch to standardize on vmware virtual machine software [Online],” World Wide Web electronic publication, Available: http://www.vmware. com/company/news/releases/merrill lynch.html 2008. [9] A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel, “Diagnosing performance overheads in the xen virtual machine environment,” in VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, New York, NY, USA, 2005, pp. 13–23, ACM.

[10] A. Menon, A. L. Cox, and W. Zwaenepoel, “Optimizing network virtualization in xen,” in ATEC ’06: Proceedings of the annual conference on USENIX ’06 Annual Technical Conference, Berkeley, CA, USA, 2006, USENIX Association.

[11] D. Chisnall, The definitive guide to the xen hypervisor, Prentice Hall Press, Upper Saddle River, NJ, USA, 2007. [12] R. Goldberg, “Survey of virtual machine research,” IEEE Computer Magazine, vol. 7, no. 6, pp. 34–45, 1974.

136 137

[13] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hllberg, J. Hgberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” IEEE Computer, 2002. [14] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, “Complete computer system simulation: The simOS approach,” IEEE Parallel and Distributed Technol- ogy, vol. 3, pp. 34–43, 1995. [15] J. J. Yi and D. J. Lilja, “Simulation of computer architectures: Simulators, benchmarks, methodologies, and recommendations,” IEEE Transactions on Computers, vol. 55, no. 3, pp. 268–280, 2006.

[16] J. Sugerman, G. Venkitachalan, and B. H. Lim, “Virtualizing i/o devices on vmware workstation’s hosted virtual machine monitor,” in Proceedings of the USENIX Annual Technical Conference, Jun 2001.

[17] S. Hand, A. Warfield, K. Fraser, E. Kotsovinos, and D. Magenheimer, “Are virtual machine monitors microkernels done right?,” in HOTOS’05: Proceedings of the 10th conference on Hot Topics in Operating Systems, Berkeley, CA, USA, 2005, USENIX Association.

[18] S. Kumar, H. Raj, K. Schwan, and I. Ganev, “Re-architecting vmms for multicore systems: The sidecore approach,” in Workshop on the Interaction between Operating Systems and Computer Architecture, 2007. [19] K. Krewell, “Best servers of 2004: Multicore is norm,” Microprocessor Report, 2005. [20] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” in SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, New York, NY, USA, 2003, pp. 164–177, ACM.

[21] E. Kotsovinos, T. Moreton, I. Pratt, R. Ross, K. Fraser, S. Hand, and T. Harris, “Global-scale service deployment in the xenoserver platform,” in Proceedings of the First Workshop on Real, Large Distributed Systems (WORLDS ’04), Dec 2004.

[22] K. Suzaki, T. Yagi, K. Iijima, and N. A. Quynh, “Os circular: internet client for reference,” in LISA’07: Proceedings of the 21st conference on Large Installation System Administration Conference, Berkeley, CA, USA, 2007, pp. 105–116, USENIX Association.

[23] B. M. G., J. Hartman, M. Kupfer, K.W.Shriff, and J.Ousterhout, “Measurements of a distributed file system,” in Proceedings of the 13th Symposium on Operating Systems Principles, 1991.

[24] J. H. Howard, M. L. Kazar, S. G. Menees, A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West, “Scale and performance in a distributed file system,” ACM Transactions on Computer Systems, vol. 6, Feb 1988. 138

[25] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, “Nfs version 3: Design and implementation,” in USENIX Summer, Boston, MA, Jun 1994. [26] A. Ganguly, D. Wolinsky, P. O. Boykin, and R. Figueiredo, “Decentralized dynamic host: Configuration in wide-area overlay networks of virtual workstations,” in Workshop on Large-Scale and Volatile Desktop Grids (PCGrid), Long Beach, CA, Mar 2007, pp. 1–8.

[27] P. Apparao, S. Makineni, and D. Newell, “Characterization of network processing overheads in xen,” in VTDC ’06: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing, Washington DC, USA, 2006, p. 2, IEEE Computer Society. [28] J. S. Chase, D. E. Irwin, L. E. Grit, J. D. Moore, and S. E. Sprenkle, “Dynamic virtual clusters in a grid site manager,” in HPDC ’03: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, 2003, pp. 90–101, IEEE Computer Society.

[29] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: enabling scalable virtual organizations,” International Journal of Supercomputing Applications, vol. 15, no. 3, pp. 200–222, Apr 2001. [30] A. Ganguly, A. Agrawal, P. Boykin, and R. J. Figueiredo, “Ip over p2p: Enabling self-configuring virtual ip networks for grid computing,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS), Rhode Island, Greece, Apr 2006. [31] S. M. Larson, C. D. Snow, M. R. Shirts, and V. S. Pande, “Folding@home and genome@home: Using distributed computing to tackle previously intractable problems in computational biology,” in Computational Genomics. 2002, Horizon Press.

[32] D. P. Anderson, J. Cobb, E. Korpella, M. Lebofsky, and D. Werthimer, “Seti@home: An experiment in public-resource computing,” Communications of the ACM, vol. 11, no. 45, pp. 56–61, 2002.

[33] J. J. Kistler and M. Satyanarayan, “Disconnected operation in coda file system,” ACM Transactions on Computer Systems, vol. 6, Feb 1992.

[34] B. S. White, A. S. Grimshaw, and A. Nguyen-Tuong, “Grid-based file access: The legion i/o model,” in High Performance distributed Computing, Pittsburgh, PA, Aug 2000, pp. 165–174. [35] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S. M. Bennett, A. Kagi, F. H. Leung, and L. Smith, “Intel virtualization technology,” Computer, vol. 38, no. 5, 2005. 139

[36] T. Garfinkel, K. Adams, A. Warfield, and J. Franklin, “Compatibility is not transparency: Vmm detection myths and realities,” in HOTOS’07: Proceedings of the 11th USENIX workshop on Hot topics in operating systems, Berkeley, CA, USA, 2007, pp. 1–6, USENIX Association.

[37] M. Rosemblum and T. Garfinkel, “Virtual machine monitors: Current technology and future trends,” IEEE Computer, vol. 38, pp. 39–47, 2005. [38] D. C. Anderson, J. S.Chase, and A. M. Vahdat, “Interposed request for routing for scalable network storage,” in Symposium on OSDI, San Diego, CA, oct 2000.

[39] D. J. Blezard, “Multi-platform computer labs and classrooms: a magic bullet?,” in SIGUCCS ’07: Proceedings of the 35th annual ACM SIGUCCS conference on User services, New York, NY, USA, 2007, pp. 16–20, ACM. [40] J. Watson, “Virtualbox: bits and bytes masquerading as machines,” Linux J., vol. 2008, no. 166, pp. 1, 2008.

[41] S. Bhattiprolu, E. W. Biederman, S. Hallyn, and D. Lezcano, “Virtual servers and checkpoint/restart in mainstream linux,” SIGOPS Operating System Review, vol. 42, no. 5, pp. 104–113, 2008. [42] J. Dike, “A user-mode port of the linux kernel,” in Proc. of the 4th Annual Linux Showcase and Conference, Atlanta, GA, 2000. [43] F. Bellard, “Qemu, a fast and portable dynamic translator,” in ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA, 2005, pp. 41–46, USENIX Association.

[44] “x86 architecture [Online],” World Wide Web electronic publication, Available: http://en.wikipedia.org/wiki/X86 2008. [45] A. S. Tanenbaum, J. N. Herder, and H. Bos, “Can we make operating systems reliable and secure?,” in Computer. may 2006, pp. 44–51, IEEE Computer Society. [46] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, “Accelerating two-dimensional page walks for virtualized systems,” in ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, New York, NY, USA, 2008, pp. 26–35, ACM.

[47] P. Willmann, S. Rixner, and A. L. Cox, “Protection strategies for direct access to virtualized i/o devices,” in ATC’08: USENIX 2008 Annual Technical Conference on Annual Technical Conference, Berkeley, CA, USA, 2008, pp. 15–28, USENIX Association. [48] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt, “Qos policies and architecture for cache/memory in cmp platforms,” in ACM Sigmetrics, Jun 2007. 140

[49] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, and L. Mathy, “Fairness issues in software virtual routers,” in PRESTO ’08: Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow, New York, NY, USA, 2008, pp. 33–38, ACM.

[50] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” in Symposium on Operating Systems Principles, 2001, pp. 174–187. [51] A. R. Butt, T. Johnson, Y. Zheng, and Y. C. Hu, “Kosha: A peer-to-peer enhancement for the network file system,” in Proceedings of IEEE/ACM SC2004, Nov 2004.

[52] V. Srinivasan and J. C. Mogul, “Spritely nfs: Experiements with cache-consistency protocols,” in Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, Dec 1989, pp. 45–57.

[53] R. Macklem, “Not quite nfs, soft cache consistency for nfs,” in Proceedings of the Winter 1994 Usenix Conference, San Francisco, CA, Jan 1994. [54] D. Hildebrand and P. Honeyman, “Exporting storage systems in a scalable manner with pnfs,” in Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies, Washington, DC, USA, 2005, pp. 18–27, IEEE Computer Society.

[55] A. Traeger, A. Rai, C. P. Wright, and E. Zadok, “Nfs file handle security,” Tech. Rep., Stony Brook University, 2004. [56] R. J. Figueiredo, P. Dinda, and J. A. B. Fortes, “A case for grid computing on virtual machines,” in Proc. of the 23rd IEEE Intl. Conference on Distributed Computing Systems (ICDCS), Providence, Rhode Island, May 2003.

[57] M. Zhao and R. J. Figueiredo, “Distributed file system support for virtual machines in grid computing,” in Proceedings of HPDC, Jun 2004. [58] M. Zhao, V. Chadha, and R. J. Figueiredo, “Supporting application-tailored grid file system sessions with wsrf-based services.,” in Proceedings of HPDC, Jul 2005. [59] D. Wolinsky, A. Agrawal, P. O. Boykin, J. Davis, A. Ganguly, V. Paramygin, P. Sheng, and R. Figueiredo, “On the design of virtual machine sandboxes for distributed computing in wide area overlays of virtual workstations,” in First Workshop on Virtualization Technologies in Distributed Computing (VTDC), Nov 2006.

[60] F. Oliveira, G. Guardiola, J. A. Patel, and E. V. Hensbergen, “Blutopia: Stackable storage for cluster management,” in Proceedings of the IEEE cluster computing, Sep 2007. [61] S. Santhanam, P. Elango, A. Arpaci-Dusseau, and M. Livny, “Deploying virtual machines as sandboxes for the grid,” in USENIX WORLDS, 2004. 141

[62] M. Carson and D. Santay, “Nist net: a linux-based network emulation tool,” SIGCOMM Computer Communication Review, vol. 33, no. 3, pp. 111–126, 2003.

[63] J. Spadavecchia and E. Zadok, “Enhancing nfs cross-administrative domain access,” in USENIX Annual Technical Conference FREENIX Track, 2002, pp. 181–194. [64] M. Baker, R. Buyya, and D. Laforenza, “Grids and grid technologies for wide-area distributed computing,” Software Practice & Experience, vol. 32, no. 15, pp. 1437–1466, 2002. [65] C. P. Wright, J. Dave, P. Gupta, H. Krishnan, D. P. Quigley, E. Zadok, and M. N. Zubair, “Versatility and unix semantics in namespace unification,” ACM Transac- tions on Storage (TOS), vol. 2, no. 1, pp. 1–s32, February 2006.

[66] D. Santry, M. Feeley, N. Hutchinson, A. Veitch, R. Carton, and J. Ofir., “Deciding when to forget in the elephant file system,” in 17th ACM SOSP Principles, 1999. [67] R. G. Minnich, “The autocacher: A file cache which operates at the nfs level,” in USENIX conference proceedings, 1993, pp. 77–83.

[68] S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The design and implementation of zap: A system for migrating computing environments.,” in Symposium on OSDI, Boston, MA, Dec 2002. [69] J. H. Hartman and J. K. Ousterhout, “The zebra striped network file system,” in SOSP ’93: Proceedings of the fourteenth ACM symposium on Operating systems principles. Dec 1993, pp. 29–43, ACM. [70] A. Agbaria and R. Friedman, “Virtual machine based heterogeneous checkpointing,” software-Practice & Experience, vol. 32, pp. 1175–1192, 2002.

[71] A. Ganguly, A. Agrawal, P. O. Boykin, and R. J. O. Figueiredo, “Wow: Self-organizing wide area overlay networks of virtual workstations.,” in HPDC. June 2006, pp. 30–42, IEEE. [72] C. Sun, L. He, Q. Wang, and R. Willenborg, “Simplifying service deployment with virtual appliances,” in IEEE International Conference on Services Computing, vol. 2, pp. 265–272, 2008.

[73] R. Prodan and T. Fahringer, “Overhead analysis of scientific workflows in grid environments,” IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 3, pp. 378–393, 2008. [74] A. S. Tanenbaum and M. van Steen, Distributed Systems: Principles and Paradigms (1st Edition), Prentics-Hall Inc, 2002.

[75] V. Chadha and R. J. Figueiredo, “Row-fs: A user-level virtualized redirect-on-write distributed file system for wide area applications,” in International Conference on high Performance Computing(HiPC), Goa, India, Dec 2007. 142

[76] S. Annapureddy, M. J. Freedman, and D. Mazires, “Shark: Scaling file servers via cooperative caching,” in Proceedings of the 2nd USENIX/ACM Symposium on Networked Systems Design and Implementation, May 2005. [77] D. Reimer, A. Thomas, G. Ammons, T. Mummert, B. Alpern, and V. Bala, “Opening black boxes: using semantic information to combat virtual machine image sprawl,” in VEE ’08: Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, New York, NY, USA, 2008, pp. 111–120, ACM.

[78] R. Chandra, N. Zeldovich, C. Sapuntzakis, and M. S. Lam, “The collective: A cache-based system management architecture,” in Proceedings of 2nd Symposium on Networked Systems Design & Implementation (NSDI), 2005, pp. 259–272. [79] “2x thinclientserver,” World Wide Web electronic publication, 2008, http://www. 2x.com/downloads/thinclientserver/2XThinClientServer.pdf. [80] M. McNett, D. Gupta, A. Vahdat, and G. M. Voelker, “Usher: An extensible framework for managing clusters of virtual machines,” in Proceedings of the 21st Large Installation System Administration Conference (LISA), Nov 2007.

[81] J. Cappos, S. Baker, J. Plichta, D. Nyugen, J. Hardies, M. Borgard, J. Johnston, and J. H. Hartman, “Stork: Package management for distributed vm environments,” in Proceedings of the 21st Large Installation System Administration Conference (LISA), Nov 2007.

[82] R. Grossman and Y. Gu, “Data mining using high performance data clouds: experimental studies using sector and sphere,” in KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, 2008, pp. 920–927, ACM. [83] A. Menon, J. R. Santos, Y. Turner, G. Janakiraman, and W. Zwaenepoel, “Diagnosing performance overheads in the xen virtual machine environment,” in VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments. 2005, pp. 13–23, ACM.

[84] J. R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging the gap between software and hardware techniques for i/o virtualization,” in ATC’08: USENIX 2008 Annual Technical Conference on Annual Technical Conference, Berkeley, CA, USA, 2008, pp. 29–42, USENIX Association. [85] R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, and H. Wang, “Softsdv: A presilicon software development environment for the ia-64 architecture,” Intel Technology Journal, 1999.

[86] M. Yourst, “PTLsim: A cycle accurate full system x86-64 microarchitectural simulator,” IEEE International Symposium on Performance Analysis of Systems & Software, 2007, pp. 23–34, April 2007. 143

[87] R. Iyer, “On modeling and analyzing cache hierarchies using casper,” in 11th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS), Oct 2003. [88] W. Wu and M. Crawford, “Interactivity vs. fairness in networked linux systems,” Computer Networks, vol. 51, no. 14, pp. 4050–4069, 2007.

[89] L. Cherkasova and R. Gardner, “Measuring cpu overhead for i/o processing in the xen virtual machine monitor,” in ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, Berkeley, CA, USA, Apr 2005, USENIX Association. [90] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig, “Intel virtualization technology: Hardware support for efficient processor virtualization,” Intel Technology Journal, Aug 2006.

[91] B. Jacob, S. W. NG, and D. T. Wang, Memory Systems: Cache, DRAM and Disk, Morgan Kaufmann publishers, 2007. [92] I. Corporation, “Intel 64 and ia-32 architectures software developer’s manuals [Online],” World Wide Web electronic publication, Available: http://www.intel. com/products/processor/manuals/index.htm 2008. [93] D. Gupta, R. Gardner, and L. Cherkasova, “Xenmon: Qos monitoring and performance profiling tool [Online],” World Wide Web electronic publication, Available: http://hpl.hp.com/techreports/2005/HPL-2005-187.pdf 2008. [94] O. tickoo, H. Kannan, V. Chadha, R. Illikkal, R. Iyer, and D. Newell, “qtlb: Looking inside lookaside buffer,” in International Conference on high Performance Computing(HiPC), Goa, India, Dec 2007.

[95] R. Santos, G. Janikaraman, and Y. turner, “Xen network optimization [Online],” World Wide Web electronic publication, Available: http://www.getxen.org/ files/summit 3/networkoptimizations.pdf 2008. [96] M. R. Marty and M. D. Hill, “Virtual hierarchies to support server consolidation,” in ISCA ’07: Proceedings of the 34th annual international symposium on Computer architecture, New York, NY, USA, 2007, pp. 46–56, ACM. [97] J. Wiegert, G. Regnier, and J. Jackson, “Challenges for scalable networking in a virtualized server,” in Proceedings of the 16th International Conference on Computer Communications and Networks, Aug 2007.

[98] J. Liu, W. Huang, B. Abali, and D. K. Panda, “High performance vmm-bypass i/o in virtual machines,” in ATEC ’06: Proceedings of the annual conference on USENIX ’06 Annual Technical Conference, Berkeley, CA, USA, May 2006, USENIX Association. 144

[99] M.-S. Chang and K. Koh, “Lazy tlb consistency for large-scale multiprocessors,” in Proceedings of the 2nd AIZU International Symposium on Parallel Algo- rithms/Architecture Synthesis, 1997. [100] I. Corporation, “First the tick, now the tock: Next generation Intel micro architecture [Online],” World Wide Web electronic publication, Available: http://support.intel.com/technology/architecture-silicon/next-gen/ whitepaper.pdf 2008. BIOGRAPHICAL SKETCH Vineet has graduated with B.E in Electronics and Telecommunication from University of Pune, India. He finished his M.S in Computer Science from Mississippi State University.

He is pursuing his PhD in Computer Information Science and Engineering at University of Florida. His research interests include virtualization, operating systems, computer architecture, file systems and distributed computing. Since Fall 2002, Vineet has been research assistant at Advanced Computing and Information Systems (ACIS) Laboratory. At ACIS, his research focus has been Grid Virtual File System (GVFS) and I/O virtualization. He has been involved in development of middleware support for network file system and simulation-based evaluation methodology to characterize I/O overhead in virtualized environments. To complement his academic experience, Vineet has completed two summer internships at the Intel Systems Technology Lab. Upon his graduation, Vineet plans to take up a full time position at Intel Corporation.

145