<<

A Case for High Performance Computing with Virtual Machines

Wei Huang† Jiuxing Liu‡ Bulent Abali‡ Dhabaleswar K. Panda† † Science and Engineering ‡ IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 {huanwei, panda}@cse.ohio-state.edu {jl, abali}@us.ibm.com

ABSTRACT in the 1960s [9], but are experiencing a resurgence in both (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called ). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest . They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of . Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps). Due to those characteristics, they are be- chines by introducing a framework which addresses the per- coming strong players in the field of high performance com- formance and management overhead associated with VM- puting (HPC). As evidenced by the Top 500 based computing. Two key ideas in our design are: Virtual list [35], clusters, which are typically built from commodity Machine Monitor (VMM) bypass I/O and scalable VM im- PCs connected through high speed interconnects, have be- age management. VMM-bypass I/O achieves high commu- come the predominant architecture for HPC since the past nication performance for VMs by exploiting the OS-bypass decade. feature of modern high speed interconnects such as Infini- Although originally more focused on resource sharing, cur- Band. Scalable VM image management significantly reduces rent virtual machine technologies provide a wide range of the overhead of distributing and managing VMs in large benefits such as ease of management, system security, per- scale clusters. Our current implementation is based on the formance isolation, checkpoint/restart and live migration. VM environment and InfiniBand. However, many of Cluster-based HPC can take advantage of these desirable our ideas are readily applicable to other VM environments features of virtual machines, which is especially important and high speed interconnects. when ultra-scale clusters are posing additional challenges on We carry out detailed analysis on the performance and performance, , system management, and adminis- management overhead of our VM-based HPC framework. tration of these systems [29, 17]. Our evaluation shows that HPC applications can achieve In spite of these advantages, VM technologies have not almost the same performance as those running in a native, yet been widely adopted in the HPC area. This is due to non-virtualized environment. Therefore, our approach holds the following challenges: promise to bring the benefits of VMs to HPC applications • Virtualization overhead: To ensure system integrity, with very little degradation in performance. the virtual machine monitor (VMM) has to trap and privileged operations from the guest VMs. This 1. INTRODUCTION overhead is especially visible for I/O virtualization, where the VMM or a privileged host OS has to in- Virtual machine (VM) technologies were first introduced tervene every I/O operation. This added overhead is not favored by HPC applications where communica- tion performance may be critical. Moreover, memory consumption for VM-based environment is also a con- Permission to make digital or hard copies of all or part of this work for cern because a physical box is usually hosting several personal or classroom use is granted without fee provided that copies are guest virtual machines. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to • Management Efficiency: Though it is possible to uti- republish, to post on servers or to redistribute to lists, requires prior specific lize VMs as static computing environments and run ap- permission and/or a fee. ICS'06 June 2830, Cairns, Queensland, Australia. plications in pre-configured systems, high performance Copyright 2006 ACM 1595932828/06/0006 ...$5.00. computing cannot fully benefit from VM technologies unless there exists a management framework which helps to map VMs to physical machines, dynamically With the deployment of large scale clusters for HPC appli- distribute VM OS images to physical machines, boot- cations, management and scalability issues on these clusters up and shutdown VMs with low overhead in cluster are becoming increasingly important. Virtual machines can environments. greatly benefit cluster computing in such systems, especially from the following aspects: In this paper, we take on these challenges and propose a VM-based framework for HPC which addresses various • Ease of management: A system administrator can performance and management issues associated with virtu- view a virtual machine based cluster as consisting of a alization. In this framework, we reduce the overhead of net- set of virtual machine templates and a pool of physi- work I/O virtualization through VMM-bypass I/O [18]. The cal resources [33]. To create the runtime environment concept of VMM-bypass extends the idea of OS-bypass [39, for certain applications, the administrator needs only 38], which takes the shortest path for time critical oper- pick the correct template and instantiate the virtual ations through user-level communication. An example of machines on physical nodes. VMs can be shutdown VMM-bypass I/O was presented in Xen-IB [18], a proto- and brought up much easier than real physical ma- type we developed to virtualize InfiniBand under Xen [6]. chines, which eases the task of system reconfiguration. Bypassing the VMM for time critical communication oper- VMs also provide clean solutions for live migration ations, Xen-IB provides virtualized InfiniBand devices for and checkpoint/restart, which are helpful to deal with Xen virtual machines with near-native performance. Our hardware problems like hardware upgrades and failure, framework also provides the flexibility of using customized which happen frequently in large-scale systems. kernels/OSes for individual HPC applications. It also al- lows building very small VM images which can be managed • Customized OS: Currently most clusters are utiliz- very efficiently. With detailed performance evaluations, we ing general purpose operating systems, such as , demonstrate that high performance computing jobs can run to meet a wide range of requirements from various user as efficiently in our Xen-based cluster as in a native, non- applications. Although researchers have been suggest- virtualized InfiniBand cluster. Although we focus on Infini- ing that light-weight OSes customized for each type Band and Xen, we believe that our framework can be read- of application can potentially gain performance bene- ily extended for other high-speed interconnects and other fits [10], this has not yet been widely adopted because VMMs. To the best of our knowledge, this is the first study of management difficulties. However, with VMs, it is to adopt VM technologies for HPC in modern cluster envi- possible to highly customize the OS and the run-time ronments equipped with high speed interconnects. environment for each application. For example, ker- In summary, the main contributions of our work are: nel configuration, system parameters, loaded modules, as well as memory and disk space can be • We propose a framework which allows high perfor- changed conveniently. mance computing applications to benefit from the de- sirable features of virtual machines. To demonstrate • System Security: Some applications, such as system the framework, we have developed a prototype system level profiling [27], may require additional kernel ser- using Xen virtual machines on an InfiniBand cluster. vices to run. In a traditional HPC system, this requires either the service to run for all applications, or users to • We describe how the disadvantages of virtual machines, be trusted with privileges for loading additional kernel such as virtualization overhead, memory consumption, modules, which may lead to compromised system in- management issues, etc., can be addressed using cur- tegrity. With VM environments, it is possible to allow rent technologies with our framework. normal users to execute such privileged operations on • We carry out detailed performance evaluations on the VMs. Because the resource integrity for the physical overhead of using virtual machines for high perfor- machine is controlled by the virtual machine monitor mance computing (HPC). This evaluation shows that instead of the guest operating system, faulty or mali- our virtualized InfiniBand cluster is able to deliver al- cious code in guest OS may in the worst case crash a most the same performance for HPC applications as virtual machine, which can be easily recovered. those in a non-virtualized InfiniBand cluster. 3. BACKGROUND The rest of the paper is organized as follows: To further In this section, we provide the background information justify our motivation, we start with more discussion on the benefits of VM for HPC in Section 2. In Section 3, we pro- for our work. Our implementation of VM-based comput- ing is based on the Xen VM environment and InfiniBand. vide the background knowledge for this work. We identify Therefore, we start with introducing Xen in Section 3.1 and the key challenges for VM-based computing in Section 4. the InfiniBand in Section 3.2. In Section 3.3, we introduce Then, we present our framework for VM-based high perfor- Xenoprof, a profiling toolkit for Xen we used in this paper. mance computing in Section 5 and carry out detailed perfor- mance analysis in Section 6. In Section 7, we discuss several 3.1 Overview of the Xen Virtual Machine issues in our current implementation and how they can be Monitor addressed in future. Last, we discuss the related work in Xen is a popular high performance VMM originally devel- Section 8 and conclude the paper in Section 9. oped at the University of Cambridge. It uses paravirtualiza- tion [41], which requires that the host operating systems be 2. BENEFITS OF VIRTUAL MACHINES FOR explicitly ported to the Xen architecture, but brings higher HPC APPLICATIONS performance. However, Xen does not require changes to the application binary interface (ABI), so existing user applica- Requests (WQR), or descriptors, and are submitted to the tions can run without any modification. QPs. Submitted WQRs are executed by the HCA and the completions are reported through a Completion Queue (CQ) VM0 VM1 VM2 (Domain0) (Guest Domain) (Guest Domain) as Completion Queue Entries (CQE). InfiniBand requires all Device Manager Unmodified Unmodified buffers involved in communication be registered before they and Control User User can be used in data transfers. Upon the success of registra- Software Software Software tion, a pair of local and remote keys are returned. They will Guest OS Guest OS Guest OS be used later for local and remote (RDMA) accesses. (XenoLinux) (XenoLinux) (XenoLinux) Back−end driver Initiating data transfer (posting WQRs) and completion of work requests notification (poll for completion) are time- native front−end driver front−end driver Device critical tasks that need to be performed by the application in Driver a OS-bypass manner. In the Mellanox [21] approach, which represents a typical implementation of InfiniBand specifi- Xen Hypervisor cation, these operations are done by ringing a doorbell. Safe HW IF Control IF Event Channel Virtual CPU Virtual MMU Doorbells are rung by writing to the registers that form the User Access Region (UAR). UAR is memory-mapped Hardware (SMP, MMU, Physical Memory, Ehternet, SCSI/IDE) directly from a physical address space that is provided by HCA. It allows access to HCA resources from privileged as Figure 1: The structure of the Xen hypervisor, host- well as unprivileged mode. Posting a work request includes ing three xenoLinux operating systems putting the descriptors (WQR) to a QP buffer and writing the doorbell to the UAR, which is completed without the Figure 1(courtesy [30]) illustrates the structure of a phys- involvement of the operating system. CQ buffers, where the ical machine running Xen. The Xen hypervisor (the VMM) CQEs are located, can also be directly accessed from the pro- is at the lowest level and has direct access to the hardware. cess virtual address space. These OS-bypass features make The hypervisor is running in the most privileged processor- it possible for InfiniBand to provide very low communication level. Above the hypervisor are the Xen domains (VMs). latency. Guest OSes running in guest domains are prevented from directly executing privileged processor instructions. A spe- User−level Application cial domain called Domain0 (or Dom0), which is created at boot time, is allowed to access the control interface pro- User −level Infiniband Service vided by the hypervisor and performs the tasks to create, User−level HCA Driver User−space terminate or migrate other guest domains (User Domain or Kernel DomU) through the control interfaces. Core Infiniband Modules In Xen, domains communicate with each other through OS−bypass shared pages and event channels, which provide an asyn- HCA Driver chronous notification mechanism between domains. A “send” InfiniBand HCA operation on one side of the event channel will cause an event to be received by the destination domain, which may in turn cause an interrupt. If a domain wants to send data to an- Figure 2: Architectural overview of OpenIB Gen2 other, the typical scheme is for a source domain to grant stack access to local memory pages to the destination domain and then send a notification event. Then, these shared pages are OpenIB Gen2 [26] is a popular IB stack provided by the used to transfer data. OpenIB community. Xen-IB presented in this paper is based Most existing device drivers assume they have complete on the OpenIB-Gen2 driver. Figure 2 illustrates the archi- control of the device, so there cannot be multiple instantia- tecture of the Gen2 stack. tions of such drivers in different domains for a single device. 3.3 Xenoprof: Profiling in Xen VM Environ- To ensure manageability and safe access, device virtualiza- ments tion in Xen follows a split device driver model [12]. Each device driver is expected to run in an Isolated Device Do- Xenoprof [2] is an open source system-wide statistical pro- main (or IDD, typically the same as Dom0), which also hosts filing toolkit developed by HP for the Xen VM environment. a backend driver, running as a daemon and serving the ac- The toolkit enables coordinated profiling of multiple VMs in cess requests from DomUs. The guest OS in DomU uses a a system to obtain the distribution of hardware events such frontend driver to communicate with the backend. The split as clock cycles, cache and TLB misses, etc. driver organization provides security: misbehaving code in Xenoprof is an extension to Oprofile [27], the statistical one DomU will not result in failure of other domains. profiling toolkit for Linux system. For example, Oprofile can collect the PC value whenever the clock cycle counter 3.2 InfiniBand reaches a specific count to generate the distribution of time InfiniBand [16] is an emerging interconnect offering high spent in various routines at Xen domain level. And through performance and features such as OS-bypass. InfiniBand multi-domain coordination, Xenoprof is able to figure out host channel adapters (HCAs) are the equivalent of network the distribution of time spent in various Xen do- interface cards (NICs) in traditional networks. A queue- mains and the Xen hypervisor (VMM). based model is presented to the customers. A Queue Pair In this paper, we use Xenoprof as an important profiling (QP) consists of a send work queue and a receive work queue. tool to identify the bottleneck of virtualization overhead for Communication instructions are described in Work Queue HPC application running in Xen environments. 4. CHALLENGES FOR VM-BASED COM- Dom0 Xen DomU CG 16.6% 10.7% 72.7% PUTING IS 18.1% 13.1% 68.8% In spite of the advantages of virtual machines, their usage EP 00.6% 00.3% 99.0% in cluster computing has hardly been adopted. Performance BT 06.1% 04.0% 89.9% overhead and management issues are the major bottlenecks SP 09.7% 06.5% 83.8% for VM-based computing. In this section, we identify the key challenges to reduce the virtualization overhead and enhance Table 1: Distribution of execution time for NAS the management efficiency for VM-based computing. node to utilize the dual CPUs, Dom0 starts competing for 4.1 Performance Overhead CPU resources with user domains when processing I/O re- quests, which leads to even larger performance degradation The performance overhead of virtualization includes three compared to the native case. main aspects: CPU virtualization, memory virtualization This example identifies I/O virtualization as the main and I/O virtualization. bottleneck for virtualization, which leads to the observa- Recently high performance VM environments such as Xen tion that I/O virtualization with near-native performance and VMware are able to achieve low cost CPU and mem- would allow us to achieve application performance in VMs ory virtualization [6, 40]. Most of the instructions can be that rivals native environments. executed natively in the guest VMs except a few privileged operations, which are trapped by the virtual machine mon- 4.2 Management Efficiency itor. This overhead is not a big issue because applications VM technologies decouple the computing environments in the HPC area seldom call these privileged operations. from the physical resources. As mentioned in Section 2, this I/O virtualization, however, poses a more difficult prob- different view of computing resources bring numerous ben- lem. Because I/O devices are usually shared among all VMs efits such as ease of management, ability to use customized in a physical machine, the virtual machine monitor (VMM) OSes, and flexible security rules. Although with VM en- has to verify that accesses to them are legal. Currently, this vironments, many of the administration tasks can be com- requires the VMM or a privileged domain to intervene on ev- pleted without restarting the physical systems, it is required ery I/O access from guest VMs [34, 40, 12]. The intervention that VM images be dynamically distributed to the physical leads to longer I/O latency and higher CPU overhead due computing nodes and VMs be instantiated each time the to context switches between the guest VMs and the VMM. administrator reconfigure the cluster. Since VM images are often housed on storage nodes which are separate from the computing nodes, this poses additional challenges on VM image management and distribution. Especially for large scale clusters, the benefits of VMs cannot be achieved unless VM images can be distributed and instantiated efficiently.

5. A FRAMEWORK FOR HIGH PERFOR- MANCE COMPUTING WITH VIRTUAL MACHINES In this section, we propose a framework for VM-based cluster computing for HPC applications. We start with a high-level view of the design, explaining the key tech- niques used to address the challenges of VM-based com- Figure 3: NAS Parallel Benchmarks puting. Then we introduce our framework and its major Figure 3 shows the performance of the NAS [25] Paral- components. Last, we present a prototype implementation lel Benchmark suite on 4 processes with MPICH [3] over of the framework – an InfiniBand cluster based on Xen vir- TCP/IP. The processes are on 4 Xen guest domains (Do- tual machines. mUs), which are hosted on 4 physical machines. We com- pare the relative execution time with the same test on 4 5.1 Design Overview nodes with native environment. For communication inten- As mentioned in the last section, reducing the I/O virtual- sive benchmarks such as IS and CG, applications running ization overhead and enhancing the management efficiency in virtualized environments perform 12% and 17% worse, are the key issues for VM-based computing. respectively. The EP benchmark, however, which is com- To reduce the I/O virtualization overhead we use a tech- putation intensive with only a few barrier synchronization, nique termed Virtual Machine Monitor bypass (VMM-bypass) maintains the native level of performance. I/O, which was proposed in our previous work [18]. VMM- Table 1 illustrates the distribution of execution time col- bypass I/O extends the idea of OS-bypass I/O in the context lected by Xenoprof. We find that for CG and IS, around of VM environments. The overall idea of VMM-bypass I/O 30% of the execution time is consumed by the isolated de- is illustrated in Figure 4. A guest module in the VM man- vice domain (Dom0 in our case) and the Xen VMM for pro- ages all the privileged accesses, such as creating the virtual cessing the network I/O operations. On the other hand, for access points (i.e. UAR for InfiniBand) and mapping them EP benchmark, 99% of the execution time is spent natively into the address space of user processes. Guest modules in the guest domain (DomU) due to a very low volume of in a guest VM cannot directly access the device hardware. communication. We notice that if we run two DomUs per Thus a backend module provides such accesses for guest mod- Device Driver VM VM quests to the management module. Besides the appli- Application Backend Module ...... cation binaries and the number of processes required, a typical job request also includes the runtime environ- Privilegded Module Guest Module OS ment requirements for the application, such as required OS type, kernel version, system shared libraries etc. VMM Users can execute privileged operations such as load- ing kernel modules in their jobs, which is a major ad- Device vantage over normal cluster environment where users only have access to a restricted set of non-privilege op- Privileged Access VMM−Bypass Access erations. Users can be allowed to submit customized VMs containing special purpose kernels/OSes or to up- Figure 4: VMM-Bypass I/O date the VM which they previously submitted. As dis- cussed in Section 2, hosting user-customized VMs, just like running user-space applications, will not compro- ules. This backend module can either reside in the device mise the system integrity. The detailed rules to submit domain (such as Xen) or in the virtual machine monitor such user-customized VM images can be dependent on for other VM environments like VMware ESX server. The various cluster administration policies and we will not communication between guest modules and backend module discuss them further in this paper. is achieved through the inter-VM communication schemes provided by the VM environment. After the initial setup • Physical Resources: These are the cluster comput- process, communication operations can be initiated directly ing nodes connected via high speed interconnects. Each from the user process. By removing the VMM from the of the physical nodes hosts a VMM and perhaps a priv- critical path of communication, VMM-bypass I/O is able to ileged host OS (such as Xen Dom0) to take charge of achieve near-native I/O performance in VM environments. the start-up and shut-down of the guest VMs. To avoid To achieve the second goal, improving the VM image man- unnecessary kernel noise and OS overhead, the host agement efficiency, our approach has three aspects: cus- OS/VMM may only provide limited services which are tomizing small kernels/OSes for HPC applications to reduce necessary to host VM management software and dif- the size of VM images that need to be distributed, devel- ferent backend daemons to enable physical device ac- oping fast and scalable distributing schemes for large-scale cesses from guest VMs. For HPC applications, a physi- clusters and VM image caching on computing nodes. Reduc- cal node will typically host no more VM instances than ing the VM image size is possible because the customized the number of CPUs/cores. This is different from the OS/kernel only needs to support a special type of HPC ap- requirements of other computing areas such as high- plications and thus needs very few services. We will discuss end data centers, where tens of guest VMs can be the detailed schemes in Section 5.3. hosted in a single physical box.

5.2 A Framework for VM-Based Computing • Management Module: The management module is the key part of our framework. It maintains the map- Physical Resources Computing Nodes ping between VMs and the physical resources. The Guest VMs management module receives user requests from the Front−End VMM front-end node and determines if there are enough idle VM Image Distribution/ VM instances that match the user requirements. If Submit Jobs/ User Generated Data VMs Instantiate VMs/ there are not enough VM instances available, the man- Execute Jobs agement module will query the VM image manger to Management Storage Nodes find the VM image matching the runtime requirements Module Control for the submitted user applications. This matching Query/ process can be similar to those found in previous re- User Submitted Update VMs VM Image search work on resource matching in Grid environ- Manager ments, such as Matchmaker in Condor [37]. The man- agement module will then schedule physical resources, Figure 5: Framework for Cluster with Virtual Ma- distribute the image, and instantiate the VMs on the chine Environment allocated computing nodes. A good scheduling method- ology will not only consider those merits that have Figure 5 illustrates the framework that we propose for been studied in current cluster environment [8], but a cluster running in virtual machine environments. The also take into account the cost of distributing and in- framework takes a similar approach as batch scheduling, stantiating VMs, where reuse of already instantiated which is a common paradigm for modern production clus- VMs is preferred. ter computing environments. We have a highly modularized • VM Image Manager: The VM image manager man- design so that we can focus on the overall framework and ages the pool of VM images, which are stored in the leave the detailed issues like resource matching, scheduling storage system. To accomplish this, VM image man- , etc., which may vary depending on different en- ager hosts a containing all the VM image vironments, to the implementations. information, such as OS/kernel type, special user li- The framework consists of the following five major parts: braries installed, etc. It provides the management • Front-end Nodes: End users submit batch job re- module with information so correct VM images are selected to match the user requirements. The VM support VM management software and the physical device images can be created by system administrator, who accesses. And customized OSes in DomUs are providing the understands the requirements for the most commonly “minimum” services needed for HPC applications. Thus the used high performance computing applications, or VM memory footprints can be reduced to a small value. Simply images can also be submitted by users if their applica- by removing the unnecessary kernel options and OS services tions have special requirements. for HPC applications, we are able to reduce the memory usage of the guest OS in DomU to around 23MB when the • Storage: The performance of the storage system as OS is idle, more than two-thirds of reduction compared with well as the cluster file system will be critical in the running normal Linux AS4 OS. We believe it can be further framework. VM images, which are stored in the stor- reduced if more careful tuning is carried out here. age nodes, may need to be transferred to the comput- ing nodes during runtime, utilizing the underlying file 5.3.2 Reducing VM Image Management Cost system. To reduce the management overhead, high As previously mentioned, we take three approaches to performance storage and file systems are desirable. minimize the VM image management overhead: making the virtual machine images as small as possible, developing scal- 5.3 An InfiniBand Cluster with Xen VM En- able schemes to distribute the VM images to different com- vironment puting nodes, and VM image caching. To further demonstrate our ideas of VM-based comput- Minimizing VM Images: In literature, there are lots ing, we present a case of an InfiniBand cluster with Xen of research efforts on customized OSes [14, 20, 22, 4], whose virtual machine environments. Our prototype implementa- purpose is to develop infrastructures for parallel resource tion does not address the problems of resource matching and management and to enhance the OS for the ability to sup- node scheduling, which we believe have been well studied in port systems with very large number of processors. In our literature. Instead, we focus on reducing the virtualization Xen-based computing environment, our VM images include overhead and the VM image management cost. customized OS environments as thin as just one contain- ing an MPI , a very small set of system utilities, and 5.3.1 Reducing Virtualization Overhead some runtime environmental requirements such as ssh dae- mon. Such customized VM image can be very small and can Xen Device Domain Xen Guest Domain be transferred through network and boot-up very efficiently. User−level IB Provider Fast and Scalable Image Distribution: HPC applica- Core IB Module tions will typically require multiple computing nodes. This requires VM images to be broadcasted to multiple physical Back−end Daemon Virtual HCA Provider nodes where the VMs will be instantiated. Even though a Core IB Module VMM−Bypass Direct Access small VM image can be transferred very efficiently between HCA Provider the storage nodes and the computing nodes, we need a scal- able VM image distribution scheme for large-scale clusters HCA Hardware where jobs can include many nodes. For this purpose, we build a VM distribution module to broadcast the images to Figure 6: Xen-IB Architecture: Virtualizing Infini- destination physical nodes through a binomial tree struc- Band with VMM-bypass I/O ture, which significantly speeds the VM image distribution. VM Image Caching: We also use VM image caching on computing nodes to accelerate the time to launch a user As mentioned in Section 5.1, I/O virtualization overhead request. The main idea is to cache the VM image on the lo- can be significantly reduced with VMM-bypass I/O tech- cal storage of the computing nodes, provided there is avail- niques. Xen-IB [18] is a VMM-bypass I/O prototype imple- able space. In this case, we do not need to transfer the VM mentation we developed to virtualize the InfiniBand device image to a computing node if the VM has not been updated for Xen. Figure 6 illustrates the general design of Xen-IB. since the last time it was instantiated on that computing Xen-IB involves the device domain for privileged InfiniBand node. We will continue to work on more advanced schemes operations such as initializing HCA, creating queue pair and to transfer only the updated portion of the VM image so the completion queues, etc. However, Xen-IB is able to expose transfer will only include the difference, likely much smaller the User Access Regions to guest VMs and allow direct DMA than the whole VM image. All these caching schemes will operations. Thus, Xen-IB executes time critical operations require additional coordination between the management such as posting work requests and polling for completion module and the VM image manager to maintain consistency natively without sacrificing the system integrity. With the between different versions of VM images. very low I/O virtualization overhead through Xen-IB, HPC applications have the potential to achieve near-native per- formance in VM-based computing environments. 6. EVALUATION Additional memory consumption is also a part of virtual- In this section, we evaluate the performance overhead and ization overhead. In the Xen virtual machine environment, the VM management efficiency of an InfiniBand cluster with the memory overhead mainly consists of the memory foot- Xen VM environments. We first provide MPI-level perfor- prints of three components: the Xen VMM, the privileged mance results at both the micro-benchmark and application domain, and the operating systems of the guest domains. levels. Then, we present different efficiency metrics for VM The memory footprints of the virtual machine monitor is image management, including image distribution time, ad- examined in detail in [6], which can be as small as 20KB ditional memory consumption for the guest VMs, and time of state per guest domain. The OS in Dom0 needs only needed for startup/shutdown of a VM. 6.1 Experimental Setup 900 Xen We conducted our performance evaluations on an eight- 800 Native node InfiniBand cluster. Each node in the cluster is equipped 700 with dual Xeon 3.0GHz CPUs, 2 GB of memory and 600 a Mellanox MT23108 PCI-X InfiniBand HCA. The systems 500 are connected with an InfiniScale InfiniBand switch. The 400 same cluster is used to obtain performance results for both 300 the VM-based environment and the native, non-virtualized 200 100 environment. In VM-based environments, we use Xen 3.0. Throughput (MillionBytes/s) 0 The Xen domain0 hosts RedHat AS4 with kernel 2.6.12 with 1 4 16 64 256 1k 4k 16k64K256k1M 256MB memory. All user domains are running with a single Message Size (Bytes) virtual CPU and 896 MB memory, which allows two Do- mUs per physical machine. Each guest OS in DomUs uses Figure 8: MPI bandwidth test the 2.6.12 kernel with all unnecessary services removed. The OS is derived from ttylinux [36], with minimum changes in order to host MPI applications as mentioned in Section 5.3. previous work [18] revealed that buffer registration in Xen In the native environment, we also use RedHat AS4 but with user domain is more costly, as illustrated in Figure 9. It is SMP mode. because registration is a privileged operation and needs to Our performance results are collected using MVAPICH, a be conducted in the device domain. popular MPI implementation over InfiniBand [23, 19]. 600 Xen 6.2 Micro-benchmark Evaluations Native 500 In this section, we compare MPI-level micro-benchmark performance between the Xen-based cluster and the native 400 InfiniBand cluster. Tests were conducted between two user 300 domains running on two different physical machines with Xen or between two different nodes in native environments. 200 The latency test repeated ping-pong communication for many times and the average half round-trip time is reported Registration Time (us) 100 as one-way latency. As shown in Figure 7, there is very lit- 0 tle difference between the Xen and the native environment, 0 200 400 600 800 1000 Number of pages with both achieving 4.9µs for 1 byte messages. This shows the benefit of VMM-bypass I/O, where all communication Figure 9: Memory registration time operations in Xen are completed by directly accessing the HCAs from the user domains.

30 4500 Xen Xen Native 4000 Native 25 3500 20 3000 2500 15 2000

Latency (us) 1500

Latency (us) 10 1000 5 500 0 0 1 2 4 8 16 32 64 128 256 5121k 2k 4k 8k 32k 64K 128k 256k 512k 1M 2M Message Size (Bytes) Message Size (Bytes)

Figure 7: MPI latency test Figure 10: MPI latency test without registration cache (large messages) In the bandwidth tests, a sender sent a number of mes- sages to a receiver using MPI non-blocking communication To show the worst-case scenario in the Xen-based environ- primitives. It then waited for an acknowledgment and re- ment, we repeated the above tests with registration cache ported bandwidth by dividing number of bytes transferred disabled. We observed no difference for short message la- by the elapsed time. We again see almost no difference be- tency and bandwidth because MVAPICH copies the short tween Xen and native environments. As shown in Figure 8, messages to pre-registered buffers and sends them out. For in both cases we achieved 880MB/sec. (Please note that all large messages (from 16KB), MVAPICH registers the user bandwidth numbers in this paper are reported in millions of buffer on the fly and send out the message using RDMA bytes per second.) without extra copy. Because of the extra registration cost, MVAPICH implements a registration cache, which allows we observe in Figure 10 that large message latencies in Xen zero-copy transfer for large messages without memory reg- environment is constantly around 200µs higher, which con- istration if the communication buffer is reused [19]. The forms to what we observed in Figure 9 since both the sender above tests benefited from this technique. However, our and the receiver side need to register the communication 900 Xen 800 Native 700 600 500 400 300 200 100 Throughput (MillionBytes/s) 0 1 4 16 64 256 1k 4k 16k64K256k1M Message Size (Bytes)

Figure 11: MPI bandwidth test without registration cache Figure 13: HPL on 2, 4, 8 and 16 processes

Dom0 Xen DomUs buffers. For bandwidth test, registration time can be over- IS 3.6 1.9 94.5 lapped with network communication. Thus in Figure 11 we SP 0.3 0.1 99.6 see only a maximum 11% difference between the Xen and BT 0.4 0.2 99.4 native environment for medium size messages (16KB), and EP 0.6 0.3 99.3 virtually no difference for large messages. Again, this is CG 0.6 0.3 99.0 the worst case scenario for Xen-based environments. Actual LU 0.6 0.3 99.0 HPC applications usually have good buffer reuse patterns FT 1.6 0.5 97.9 and in most cases we will not see this difference. MG 1.8 1.0 97.3

6.3 HPC Application Evaluations Table 2: Distribution of execution time for NAS In this section, we evaluate the Xen-based computing en- the same physical node can be added by taking advantage vironment with actual HPC applications. We instantiate of the page sharing mechanism provided by Xen. two DomUs on each physical machine and run one pro- We also report the Gflops achieved in HPL on 2, 4, 8 and cess per DomU for the Xen case. (Each DomU runs a uni- 16 processes in Figure 13. We observe very close perfor- processor kernel.) And for native environment, we run two mance here with native case outperforms at most 1%. processes per physical node. Table 2 shows the distribution of the execution time among We evaluate both NAS parallel benchmarks and High Per- Dom0, Xen hypervisor, and two DomUs (sum) for NAS formance Linpack (HPL). HPL is the parallel implementa- benchmarks. As we can see, due to VMM-bypass approach, tion of Linpack [28] and the performance measure for rank- most of the instructions are executed locally in user do- ing the computer systems of Top 500 supercomputer list. mains, which achieves highly efficient computing. For IS, the time taken by Dom0 and the Xen hypervisor is slightly higher. This is because IS has very short running time and Dom0/hypervisor are mainly involved during applica- tion startup/finalize, where all the connections need to be set up and communication buffers need to be pre-registered. 6.4 VM Image Management Evaluation In this section, we evaluate the VM image management overhead for the Xen-based environment. Table 3 shows the time to startup and shutdown a do- main in Xen environments as well as the extra memory con- sumed by the OSes (after a fresh startup) running in each user domain. As we can see, managing a guest domain with a customized OS/kernel (ttylinux) has much less overhead Figure 12: NAS Parallel Benchmarks (16 processes, compared with a guest domain with normal full featured OS class B) (AS4). Additionally, the extra memory consumption is also reduced to around 23.6 MB for ttylinux, which is about 1.2% The execution time for NAS has been normalized based of the total memory we have on each physical node. We also on the native environment in Figure 12. We observed that include the startup/shutdown time and memory consump- Xen-based environment performs comparably with the na- tion for RedHat AS4 on native servers as a reference. We can tive environment. For NAS applications CG and FT, where clearly see that starting a VM is much faster than starting native environment performs around 4% better, the gap is an physical server. due to the ability of MVAPICH to utilize We also measure the VM image distribution time. As we communication for processes on the same node in the na- are using customized OS for the guest VMs, we were able to tive environment. Our prototype implementation of VM- reduce the VM image size to around 30MB. We distribute based computing currently does not have this feature. How- the VM image to the physical nodes through a binomial ever, high speed communication between Xen domains on tree topology over IPoIB (IP over InfiniBand [15]). As we startup shutdown memory care of in Dom0, it is important to understand the memory ttylinux-domu 5.3s 5.0s 23.6MB requirements for Dom0 for hosting different applications in AS4-domu 24.1s 13.2s 77.1MB the user domains. We plan to carry more studies along this AS4-native 58.9s 18.4s 90.0MB direction in future.

Table 3: Startup, shutdown time and extra memory consumption 8. RELATED WORK Scheme 1 2 4 8 Currently, many VM designs exploit multiple VMs in cluster- Binomial tree 1.3s 2.8s 3.7s 5.0s like environments. For example, the Xenoserver project [32] builds a VM-based infrastructure for wide-area distributed NFS 4.1s 6.2s 12.1s 16.1s computing. The vMatrix [5] project proposed the idea of Table 4: VM image distribution time using a network of virtual machine monitors for dynamic content distribution. However, none of the above address can see in Table 4, the binomial tree distribution the problem of using VMs for high performance computing. can distribute the image to 8 physical nodes within 5 sec- In [11], the authors advocate an approach of using VMs for onds. While using normal NFS, the distribution process . However, the hardware environments for needs around 16 seconds. Since the distribution time with grid computing are usually loosely-coupled, wide-area sys- the binomial tree algorithm increases proportional to logN, tems. Therefore, they did not explicitly address the issues of where N is the number of destination nodes, we believe our I/O performance degradation due to virtualization. In our VM image distribution scales for very large size clusters. work, we focus on traditional high performance computing Note that the VM image distribution occurs only if VM environments, which are more tightly-coupled and have very image is not cached or has been updated, in practice, the stringent requirements for communication performance. distribution time can be even smaller due to caching. We first presented the idea of VMM-bypass in [18]. VMM- bypass extends the idea of OS-bypass to VM environments. 7. DISCUSSION OS-bypass is a feature found in user-level communication In this section we discuss several issues with our current protocols such as active messages [39] and U-Net [38]. It prototype of a virtualized InfiniBand cluster and how they was later adopted by InfiniBand. can be addressed in future. Virtual machine OS image distribution and start-up are important issues in any VM-based environment. However, 7.1 Checkpoint and Migration our use of VMs in cluster-based HPC systems introduces new challenges because these systems can be very large scale, Although our framework can take advantage of VM check- which requires a highly scalable approach. Complementary point and migration, the VMM-bypass approach that we to our work, scalable job launching in traditional HPC en- propose to reduce the I/O virtualization overhead poses vironments has been discussed in studies such as [42] and additional challenges. With VMM-bypass I/O, the VMM [7]. is not involved in I/O operations, and thus cannot easily High level - and language-based virtual ma- suspend and buffer those operations when checkpoint or chines have been studied and used for high performance migration starts. Further, high-speed interconnects where computing. Examples include PVM [13], HPVM [1], and VMM-bypass approaches are applicable usually store part Java. In our work, virtual machines refer to those which of the communication context on board. For example, In- provide abstractions identical or similar to existing hard- finiBand HCAs will store information such as created queue- ware. Therefore, our VM-based platform is implemented at pair structure, a translation table between virtual address a much lower level and provides more flexibility. For exam- and actual PCI address, etc. on the device. This makes ple, our approach can not only support MPI, but also all checkpointing or migration even more difficult since part the middleware- and language-based virtual machines men- of the VM status is not accessible from the VMM. To ad- tioned above. dress these issues, we plan to involve guest virtual machines, which have the knowledge of the communication status, in the checkpoint and migration process. 9. CONCLUSIONS AND FUTURE WORK In this paper, we proposed a framework for VM-based 7.2 Memory Consumption of the Privileged computing for HPC applications. We presented a case for Domain building an InfiniBand cluster with the Xen virtual machine The Xen virtual machine environment needs a privileged environment. We explained how we reduced I/O virtual- domain (Dom0) running on each machine to manage other ization overhead with VMM-bypass I/O and addressed the user domains. Xen must allocate memory for the privileged efficiency for management operations in such VM-based en- domain at the boot time of the physical machine. At our vironments. Our performance evaluation showed that HPC current stage of research, we are focusing on customizing applications can achieve almost the same performance as guest OS/kernels in user domains. Thus we just statically those running in a native, non-virtualized environment. We allocate 256 MB for Dom0, which is hosting a normal Red- also demonstrated quantitatively that other costs of virtual- Hat AS4 OS. Since this part of memory is used for Dom0, ization, such as extra memory consumption and VM image it cannot be used for actual computing applications. How- management, can be reduced significantly by optimization. ever, since Dom0 is hardly involved in HPC applications Our approach can potentially benefit HPC applications with especially with VMM-bypass I/O, this overhead can be sig- desirable features of virtual machine technologies, including nificantly reduced by careful tuning of the host OS. Mean- ease of management, OS customization, performance isola- while, since privileged operations will still need to be taken tion, check-pointing, and migration, with very little perfor- mance degradation. [16] InfiniBand Trade Association. InfiniBand Architecture In future, we will work on checkpoint/migration support Specification, Release 1.2. for Xen-IB, our VMM-bypass I/O implementation to virtu- [17] K. Koch. How does ASCI Actually Complete Multi-month alize InfiniBand in the Xen virtual environment. We plan 1000-processor Milestone Simulations? In Proceedings of the Conference on High Speed Computing, 2002. to investigate scheduling and resource management schemes [18] J. Liu, W. Huang, B. Abali, and D. K. Panda. High for the efficient management in VM-based clusters. We will Performance VMM-Bypass I/O in Virtual Machines. In also carry out more detailed performance evaluations of VM- Proceedings of USENIX ’06, Boston, MA, 2006. based computing on clusters with larger numbers of nodes. [19] J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In Proceedings of ICS ’03, June 2003. Acknowledgments [20] B. Maccabe, P. G. Bridges, R. Brightwell, R. Riesen, and We would like to thank Matthew J. Koop from the Network Based T. Hudson. Highly Configurable Operating Systems for Computing Laboratory at the Ohio State University for the valu- Ultrascale Systems. In Proceedings of the First able discussions to improve this paper. We also thank the anony- International Workshop on Operating Systems, mous reviewers for their insightful comments. Programming Environments and Management Tools for This research is supported in part by the following grants and High-Performance Computing on Clusters, 2004. equipment donations to the Ohio State University: Department of [21] Mellanox Technologies. http://www.mellanox.com. Energy’s Grant #DE-FC02-01ER25506; National Science Foun- [22] MOLAR: Modular Linux and Adaptive Runtime Support dation grants #CNS-0403342 and #CCR-0509452; grants from for High-end Computing Operating and Runtime Systems. http://forge-fre.ornl.gov/molar/. Intel, Mellanox, Sun, Cisco, and Linux Networx; and equipment [23] MVAPICH Project Website. http://nowlab.cse.ohio- donations from Apple, AMD, Dell, IBM, Intel, Microway, Path- state.edu/projects/mpi-iba/index.html. scale, Silverstorm and Sun. [24] Myricom, Inc. Myrinet. http://www.myri.com. [25] NASA. NAS Parallel Benchmarks. 10. REFERENCES http://www.nas.nasa.gov/Software/NPB/. [1] A. Chien et al. Design and Evaluation of an HPVM-Based [26] Open InfiniBand Alliance. http://www.openib.org. Windows NT Supercomputer. The International Journal of [27] OProfile. http://oprofile.sourceforge.net. High Performance Computing Applications, 13(3):201–219, [28] A. Petitet, R. . Whaley, J. Dongarra, and A. Cleary. HPL Fall 1999. - A Portable Implementation of the High-Performance [2] A. Menon et al. Diagnosing Performance Overheads in the Linpack Benchmark for Distributed-Memory . Xen Virtual Machine Environment. In Proceedings of the http://www.netlib.org/benchmark/hpl/. First ACM/USENIX Conference on Virtual Execution [29] F. Petrini, D. J. Kerbyson, and S. Pakin. The Case of the Environments (VEE’05), June 2005. Missing Supercomputer Performance: Achieving Optimal [3] Argonne National Laboratory. Performance on the 8,192 Processors of ASCI Q. In http://www-unix.mcs.anl.gov/mpi/mpich/. Proceedings of SC ’03, Washington, DC, USA, 2003. [4] Argonne National Laboratory. Zeptoos: The small linux for [30] I. Pratt. Xen Virtualization. Linux World 2005 big computers. http://www-unix.mcs.anl.gov/zeptoos/. Virtualization BOF Presentation. [5] A. Awadallah and M. Rosenblum. The vMatrix: A network [31] Quadrics, Ltd. QsNet. http://www.quadrics.com. of virtual machine monitors for dynamic content [32] D. Reed, I. Pratt, P. Menage, S. Early, and N. Stratford. distribution. In Seventh International Workshop on Web Xenoservers: Accountable Execution of Untrusted Content Caching and Distribution, 2002. Programs. In Workshop on Hot Topics in Operating [6] B. Dragovic et al. Xen and the Art of Virtualization. In Systems, 1999. Proceedings of the ACM Symposium on Operating Systems [33] M. Rosenblum and T. Garfinkel. Virtual Machine Monitors: Principles, Oct. 2003. Current Technology and Future Trends. IEEE Computer, [7] R. Brightwell and L. A. Fisk. Scalable Parallel Application 38(5):39–47, 2005. Launch on Cplant. In Proceedings of SC ’01, 2001. [34] J. Sugerman, G. Venkitachalam, and B. H. Lim. [8] B. Chun and D. Culler. User-centric Performance Analysis Virtualizing I/O Devices on VMware ’s Hosted of Market-based Cluster Batch Schedulers. In Proceedings Virtual Machine Monitor. In Proceedings of USENIX, 2001. of CCGrid, 2002. [35] Top 500 Supercomputer Site. http://www.top500.com. [9] R. J. Creasy. The Origin of the VM/370 Time-sharing [36] ttylinux. http://www.minimalinux.org/. System. IBM Journal of Research and Development, [37] University of Wisconsin. Condor High Throughput 25(5):483–490, 1981. Computing. http://www.cs.wisc.edu/condor/. [10] FastOS: Forum to Address Scalable Technology for runtime [38] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A and Operating Systems. http://www.cs.unm.edu/ fastos/. User-level Network Interface for Parallel and Distributed [11] R. Figueiredo, P. Dinda, and J. Fortes. A Case for Grid Computing. In ACM Symposium on Operating Systems Computing on Virtual Machines. In Proceedings of Principles, 1995. International Conference on [39] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Systems (ICDCS), May 2003., 2003. Schauser. Active Messages: A Mechanism for Integrated [12] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. W. and Communication and Computation. In International M. Williamson. Safe Hardware Access with the Xen Virtual Symposium on , 1992. Machine Monitor. In Proceedings of OASIS ASPLOS [40] C. Waldspurger. Memory resource management in VMware Workshop, 2004. ESX server. In Proceedings of the Fifth Symposium on [13] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, Operating Systems Design and Implementation, 2002. and V. S. Sunderam. PVM: : A [41] A. Whitaker, M. Shaw, and S. D. Gribble. Denali: Users’ Guide and Tutorial for Networked Parallel Lightweight Virtual Machines for Distributed and Computing. MIT Press, Cambridge, MA, USA, 1994. Networked Applications. Technical report, University of [14] HPC-Colony Project: Services and Interfaces for Very Washington, February 2002. Large Linux Clusters. http://www.hpc-colony.org/. [42] W. Yu, J. Wu, and D. K. Panda. Fast and Scalable Startup [15] IETF IPoIB Workgroup. of MPI Programs In InfiniBand Clusters. In Proceedings of http://www.ietf.org/html.charters/ipoib-charter.html. HiPC’04, Banglore, Inida, December 2004.