Diagnosing Performance Overheads in the Xen Virtual Machine Environment
Total Page:16
File Type:pdf, Size:1020Kb
Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon∗ Jose Renato Santos Yoshio Turner EPFL, Lausanne HP Labs, Palo Alto HP Labs, Palo Alto aravind.menon@epfl.ch [email protected] [email protected] G. (John) Janakiraman Willy Zwaenepoel HP Labs, Palo Alto EPFL, Lausanne [email protected] willy.zwaenepoel@epfl.ch ABSTRACT General Terms Virtual Machine (VM) environments (e.g., VMware and Measurement, Performance Xen) are experiencing a resurgence of interest for diverse uses including server consolidation and shared hosting. An Keywords application's performance in a virtual machine environ- ment can differ markedly from its performance in a non- Virtual machine monitors, performance analysis, statistical virtualized environment because of interactions with the profiling underlying virtual machine monitor and other virtual ma- chines. However, few tools are currently available to help de- 1. INTRODUCTION bug performance problems in virtual machine environments. Virtual Machine (VM) environments are experiencing a In this paper, we present Xenoprof, a system-wide statis- resurgence of interest for diverse uses including server con- tical profiling toolkit implemented for the Xen virtual ma- solidation and shared hosting. The emergence of commercial chine environment. The toolkit enables coordinated profil- virtual machines for commodity platforms [7], paravirtual- ing of multiple VMs in a system to obtain the distribution izing open-source virtual machine monitors (VMMs) such of hardware events such as clock cycles and cache and TLB as Xen [4], and widespread support for virtualization in mi- misses. croprocessors [13] will boost this trend toward using virtual We use our toolkit to analyze performance overheads in- machines in production systems for server applications. curred by networking applications running in Xen VMs. We An application's performance in a virtual machine en- focus on networking applications since virtualizing network vironment can differ markedly from its performance in a I/O devices is relatively expensive. Our experimental re- non-virtualized environment because of interactions with the sults quantify Xen's performance overheads for network I/O underlying VMM and other virtual machines. The grow- device virtualization in uni- and multi-processor systems. ing popularity of virtual machine environments motivates Our results identify the main sources of this overhead which deeper investigation of the performance implications of vir- should be the focus of Xen optimization efforts. We also tualization. However, few tools are currently available to show how our profiling toolkit was used to uncover and re- analyze performance problems in these environments. solve performance bugs that we encountered in our experi- In this paper, we present Xenoprof, a system-wide statis- ments which caused unexpected application behavior. tical profiling toolkit implemented for the Xen virtual ma- chine environment. The Xenoprof toolkit supports system- Categories and Subject Descriptors wide coordinated profiling in a Xen environment to obtain D.4.8 [Operating Systems]: Performance|Measurements; the distribution of hardware events such as clock cycles, in- D.2.8 [Software Engineering]: Metrics|Performance mea- struction execution, TLB and cache misses, etc. Xenoprof sures allows profiling of concurrently executing domains (Xen uses ∗ the term \domain" to refer to a virtual machine) and the Work done during internship at HP Labs and then at Xen VMM itself at the fine granularity of individual pro- EPFL. cesses and routines executing in either the domain or in the Xen VMM. Xenoprof will facilitate a better understanding of the performance characteristics of Xen's mechanisms and Permission to make digital or hard copies of all or part of this work for thereby help to advance efforts to optimize the implementa- personal or classroom use is granted without fee provided that copies are tion of Xen. Xenoprof is modeled on the OProfile [1] profil- not made or distributed for profit or commercial advantage and that copies ing tool available on Linux systems. bear this notice and the full citation on the first page. To copy otherwise, to We report on the use of Xenoprof to analyze performance republish, to post on servers or to redistribute to lists, requires prior specific overheads incurred by networking applications running in permission and/or a fee. VEE'05, June 1112, 2005, Chicago, Illinois, USA. Xen domains. We focus on networking applications because Copyright2005ACM1•59593•047•7/05/0006...$5.00. of the prevalence of important network intensive applica- 13 tions, such as web servers and because virtualizing network I/O devices is relatively expensive. Previously published Linux Receive Throughput 1000 XenoLinux Receive Throughput papers evaluating Xen's performance have used network- ing benchmarks in systems with limited network bandwidth and high CPU capacity [4, 10]. Since this results in the 800 network, rather than the CPU, becoming the bottleneck resource, in these studies the overhead of network device virtualization does not manifest in reduced throughput. In 600 contrast, our analysis examines cases where throughput de- grades because CPU processing is the bottleneck instead of 400 the network. Our experimental results quantify Xen's per- Throughput (Mb/s) formance overheads for network I/O device virtualization in uni- and multi-processor systems. Our results identify the 200 main sources of this overhead which should be the focus of Xen optimization efforts. For example, Xen's current I/O 0 virtualization implementation can degrade performance by 0 20 40 60 80 100 120 140 preventing the use of network hardware offload capabilities App. buffer size(KB) such as TCP segmentation and checksum. We additionally show a small case study that demon- Figure 1: XenoLinux performance anomaly: Net- strates the utility of our profiling toolkit for debugging real work receive throughput in XenoLinux shows erratic performance bugs we encountered during some of our exper- behavior with varying application buffer size iments. The examples illustrate how unforeseen interactions between an application and the VMM can lead to strange performance anomalies. the application receive buffer size. In contrast, throughput The rest of the paper is organized as follows. We begin in is highly sensitive to the application buffer size when the Section 2 with our example case study motivating the need receiver runs in the Xen environment. for performance analysis tools in virtual machine environ- This is one prime example of a scenario where we would ments. Following this, in Section 3 we describe aspects of like to have a performance analysis tool which can pinpoint Xen and OProfile as background for our work. Section 4 the source of unexpected behavior in Xen. Although system- describes the design of Xenoprof. Section 5 shows how the wide profiling tools such as OProfile exist for single OSes toolkit can be used for performance debugging of the moti- like Linux, no such tools are available for a virtual machine vating example of Section 2. In Section 6, we describe our environment. Xenoprof is intended to fill this gap. analysis of the performance overheads for network I/O de- vice virtualization. Section 7 describes related work, and we 3. BACKGROUND conclude with discussion in Section 8. In this section, we briefly describe the Xen virtual ma- chine monitor and the OProfile tool for statistical profiling 2. MOTIVATING EXAMPLE on Linux. Our description of their architecture is limited to Early on in our experimentation with Xen, we encountered the aspects that are relevant to the discussion of our statis- a situation in which a simple networking application which tical profiling tool for Xen environments and the discussion we wrote exhibited puzzling behavior when running in a Xen of Xen's performance characteristics. We refer the reader to domain. We present this example in part to illustrate how [4] and [1] for more comprehensive descriptions of Xen and the behavior of an application running in a virtual machine OProfile, respectively. environment can differ markedly and in surprising ways from its behavior in a non-virtualized system. We also show the 3.1 Xen example to demonstrate the utility of Xenoprof, which as Xen is an open-source x86 virtual machine monitor we describe later in Section 5 we used to help us diagnose (VMM) that allows multiple operating system (OS) in- and resolve this problem. stances to run concurrently on a single physical machine. The example application consists of two processes, a Xen uses paravirtualization [25], where the VMM exposes sender and a receiver, running on two Pentium III ma- a virtual machine abstraction that is slightly different from chines connected through a gigabit network switch. The the underlying hardware. In particular, Xen exposes a hy- two processes are connected by a single TCP connection, percall mechanism that virtual machines must use to per- with the sender continuously sending a 64 KB file (cached form privileged operations (e.g., installing a page table), an in the sender's memory) to the receiver using the zero-copy event notification mechanism to deliver virtual interrupts sendfile() system call. The receiver simply issues receives and other notifications to virtual machines, and a shared- for the data as fast as possible. memory based device channel for transferring I/O messages In our experiment, we run the sender on Linux (non- among virtual machines. While the OS must be ported to virtualized), and we run the receiver either on Linux or in this virtual machine interface, this approach leads to better a Xen domain. We measure the receiver throughput as we performance than an approach based on pure virtualization. vary the size of the user-level buffer which the application The latest version of the Xen architecture [10] introduces passes to the socket receive system call. Note that we are a new I/O model, where special privileged virtual machines not varying the receiver socket buffer size which limits ker- called driver domains own specific hardware devices and run nel memory usage.