Vic: Interrupt Coalescing for Virtual Machine Storage Device IO

Total Page:16

File Type:pdf, Size:1020Kb

Vic: Interrupt Coalescing for Virtual Machine Storage Device IO vIC: Interrupt Coalescing for Virtual Machine Storage Device IO Irfan Ahmad Ajay Gulati Ali Mashtizadeh firfan, agulati, [email protected] VMware, Inc., Palo Alto, CA 94304 Abstract sue hundreds of very small IO operations in parallel re- sulting in tens of thousands of IOs per second (IOPS). Interrupt coalescing is a well known and proven tech- Such high IOPS are now within reach of even more IT nique for reducing CPU utilization when processing high organizations with faster storage controllers, wider adop- IO rates in network and storage controllers. Virtualiza- tion of solid-state disks (SSDs) as front-end tier in stor- tion introduces a layer of virtual hardware for the guest age arrays and increasing deployments of high perfor- operating system, whose interrupt rate can be controlled mance consolidated storage devices using Storage Area by the hypervisor. Unfortunately, existing techniques Network (SAN) or Network-Attached Storage (NAS) based on high-resolution timers are not practical for vir- protocols. tual devices, due to their large overhead. In this paper, we For high IO rates, the CPU overhead for handling all present the design and implementation of a virtual inter- the interrupts can get very high and eventually lead to rupt coalescing (vIC) scheme for virtual SCSI hardware lack of CPU resources for the application itself [7, 14]. controllers in a hypervisor. CPU overhead is even more of a problem in virtualiza- We use the number of commands in flight from the tion scenarios where we are trying to consolidate as many guest as well as the current IO rate to dynamically set virtual machines into one physical box as possible. Free- the degree of interrupt coalescing. Compared to exist- ing up CPU resources from one virtual machine (VM) ing techniques in hardware, our work does not rely on will improve performance of other VMs on the same high-resolution interrupt-delay timers and thus leads to host. Traditionally, interrupt coalescing or moderation a very efficient implementation in a hypervisor. Further- has been used in network and storage controller cards more, our technique is generic and therefore applicable to limit the number of times that application execution to all types of hardware storage IO controllers which, un- is interrupted by the device to handle IO completions. like networking, don’t receive anonymous traffic. We Such coalescing techniques have to carefully balance an also propose an optimization to reduce inter-processor increase in IO latency with the improved execution effi- interrupts (IPIs) resulting in better application perfor- ciency due to fewer interrupts. mance during periods of high IO activity. Our imple- mentation of virtual interrupt coalescing has been ship- In hardware controllers, fine-grained timers are used ping with VMware ESX since 2009. We present our in conjunction with interrupt coalescing to keep an up- evaluation showing performance improvements in micro per bound on the latency of IO completion notifications. benchmarks of up to 18% and in TPC-C of up to 5%. Such timers are inefficient to use in a hypervisor and one has to resort to other pieces of information to avoid longer delays. This problem is challenging for several 1 Introduction other reasons, including the desire to maintain a small code size thus keeping the trusted computing base to a The performance overhead of virtualization has de- manageable size. We treat the virtual machine workload creased steadily in the last decade due to improved hard- as unmodifiable and as an opaque black box. We also ware support for hypervisors. This and other storage de- assume based on earlier work that guest workloads can vice optimizations have led to increasing deployments change their behavior very quickly [6, 10]. of IO intensive applications on virtualized hosts. Many In this paper, we target the problem of coalescing in- important enterprise applications today exhibit high IO terrupts for virtual devices without assuming any support rates. For example, transaction processing loads can is- from hardware controllers and without using high res- olution timers. Traditionally, there are two parameters of customers in the currently shipping ESX version. that need to be balanced: maximum interrupt delivery la- The next section presents background on VMware tency (MIDL) and maximum coalesce count (MCC). The ESX Server architecture and overall system model along first parameter denotes the maximum time to wait before with a more precise problem definition. Section 3 sending the interrupt and the second parameter denotes presents the design of our virtual interrupt coalescing the number of accumulated completions before sending mechanism along with a discussion of some practical an interrupt to the operating system (OS). The OS is in- concerns. An extensive evaluation of our implementa- terrupted based on whichever parameter is hit first. tion is presented in Section 4, followed by some lessons We propose a novel scheme to control for both MIDL learned from our deployment experience in real world and MCC implicitly by setting the delivery ratio of in- in Section 5. Section 6 presents an overview of related terrupts based on the current number of commands in work followed by conclusions and directions for future flight (CIF) from the guest OS and overall IO comple- work in Sections 7 and 8 respectively. tion rate. The ratio, denoted as R, is simply the ratio of how many virtual interrupts are sent to the guest divided 2 System Model by the number of actual IO completions received by the hypervisor on behalf of that guest. Note that 0 < R ≤ 1. Our system model consists of two components in the Lower values of delivery ratio, R, denotes a higher de- VMware ESX hypervisor: VMkernel and the virtual ma- gree of coalescing. We increase R when CIF is low and chine monitor (VMM). The VMkernel is a hypervisor decrease the delivery rate R for higher values of CIF. kernel, a thin layer of software controlling access to The key insight in the paper is that unlike network IO, physical resources among virtual machines. The VMk- CIF can be used directly for storage controllers because ernel provides isolation and resource allocation among each request has a corresponding command in flight prior virtual machines running on top of it. The VMM is re- to completion. Also, based on the characteristics of stor- sponsible for correct and efficient virtualization of the age devices, it is important to maintain certain number x86 instruction set architecture as well as emulation of of commands in flight to efficiently utilize the underly- high performance virtual devices. It is also the concep- ing storage device [9, 11, 23]. The benefits of command tual equivalent of a “process” to the ESX VMkernel. The queuing are well known and concurrent IOs are used in VMM intercepts all the privileged operations from a VM most storage arrays to maintain high utilization. Another including IO and handles them in cooperation with the challenge in coalescing interrupts for storage IO requests VMkernel. is that many important applications issue synchronous Figure 1 shows the ESX VMkernel executing storage IOs. Delaying the completion of prior IOs can delay the stack code on the CPU on the right and an example VM issue of future ones, so one has to be very careful about running on top of its virtual machine monitor (VMM) minimizing the latency increase. running on the left processor. In the figure, when an in- Another problem we address is specific to hypervisors, terrupt is received from a storage adapter (1), appropriate where the host storage stack has to receive and process an code in the VMkernel is executed to handle the IO com- IO completion before routing it to the issuing VM. The pletion (2) all the way up to the vSCSI subsystem which hypervisor may need to send inter-processor interrupts narrows the IO to a specific VM. Each VMM shares a (IPIs) from the CPU that received the hardware interrupt common memory area with the ESX VMkernel, where to the remote CPU where the VM is running for notifi- the VMkernel posts IO completions in a queue (3) fol- cation purposes. We provide an optimization to reduce lowing which it may issue an inter-process interrupt or the number of IPIs issued using the timestamp of the last IPI (4) to notify the VMM. The VMM can pick up the interrupt that was sent to the guest OS. This reduces the completions on its next execution (5) and process them overall number of IPIs while bounding the latency of no- (6) resulting finally in the virtual interrupt being fired (7). tifying the guest OS about an IO completion. Without explicit interrupt coalescing, the VMM al- We have implemented our virtual interrupt coalescing ways asserts the level-triggered interrupt line for every (vIC) techniques in the VMware ESX hypervisor [21] IO. Level-triggered lines do some implicit coalescing al- though they can be applied to any hypervisor including ready but that only helps if two IOs are completed back- type 1 and type 2 as well as hardware storage controllers. to-back in the very short time window before the guest Experimentation with a set of micro benchmarks shows interrupt service routine has had the chance to deassert that vIC techniques can improve both workload through- the line. put and CPU overheads related to IO processing by up Only the VMM can assert the virtual interrupt line and to 18%. We also evaluated vIC against the TPC-C work- it is possible after step 3 that the VMM may not get load and found improvements of up to 5%. The vIC im- a chance to execute for a while. To limit any latency plementation discussed here is being used by thousands implications of a VM not entering into the VMM, the Figure 1: Virtual Interrupt Delivery Mechanism.
Recommended publications
  • Hardware Virtualization Support for Afterburner/L4
    Universit¨atKarlsruhe (TH) Institut f¨ur Betriebs- und Dialogsysteme Lehrstuhl Systemarchitektur Hardware virtualization support for Afterburner/L4 Martin B¨auml Studienarbeit Verantwortlicher Betreuer: Prof. Dr. Frank Bellosa Betreuender Mitarbeiter: Dipl.-Inf. Jan St¨oß 4. Mai 2007 Hiermit erkl¨areich, die vorliegende Arbeit selbst¨andig verfaßt und keine anderen als die angegebenen Literaturhilfsmittel verwendet zu haben. I hereby declare that this thesis is a work of my own, and that only cited sources have been used. Karlsruhe, den 4. Mai 2007 Martin B¨auml Abstract Full virtualization of the IA32 architecture can be achieved using hardware sup- port. The L4 microkernel has been extended with mechanisms to leverage In- tel’s VT-x technology. This work proposes a user level virtual machine monitor that complements L4’s virtualization extensions and realizes microkernel-based full virtualization of arbitrary operating systems. A prototype implementation within the Afterburner framework demonstrates the approach by successfully booting a current Linux kernel. Contents 1 Introduction 5 2 Background and Related Work 6 2.1 Intel VT-x ............................... 6 2.2 Virtualization Hardware Support for L4 .............. 7 2.3 Afterburner Framework ....................... 8 2.4 Xen .................................. 8 3 Design 9 3.1 Virtualization environment ..................... 9 3.2 Nested Memory Translation ..................... 11 3.3 Privileged Instructions ........................ 14 3.4 Interrupts and Exceptions ...................... 15 3.4.1 Event Source and Injection ................. 15 3.5 Devices ................................ 17 3.6 Boundary Cases ............................ 17 4 Implementation 19 4.1 Integration into Afterburner Framework .............. 19 4.1.1 Resource Monitor ....................... 19 4.1.2 Device Models ........................ 20 4.2 The Monitor Server ......................... 20 4.2.1 Virtualization Fault Processing ..............
    [Show full text]
  • Linux X86 Documentation
    Linux X86 Documentation The kernel development community Jul 14, 2020 CONTENTS i ii CHAPTER ONE THE LINUX/X86 BOOT PROTOCOL On the x86 platform, the Linux kernel uses a rather complicated boot convention. This has evolved partially due to historical aspects, as well as the desire in the early days to have the kernel itself be a bootable image, the complicated PC memory model and due to changed expectations in the PC industry caused by the effective demise of real-mode DOS as a mainstream operating system. Currently, the following versions of the Linux/x86 boot protocol exist. 1 Linux X86 Documentation Old zImage/Image support only. Some very early kernels may not even support ker- a command line. nels Pro- (Kernel 1.3.73) Added bzImage and initrd support, as well as a for- to- malized way to communicate between the boot loader and the kernel. col setup.S made relocatable, although the traditional setup area still as- 2.00 sumed writable. Pro- (Kernel 1.3.76) Added a heap overrun warning. to- col 2.01 Pro- (Kernel 2.4.0-test3-pre3) New command line protocol. Lower the conven- to- tional memory ceiling. No overwrite of the traditional setup area, thus col making booting safe for systems which use the EBDA from SMM or 32-bit 2.02 BIOS entry points. zImage deprecated but still supported. Pro- (Kernel 2.4.18-pre1) Explicitly makes the highest possible initrd address to- available to the bootloader. col 2.03 Pro- (Kernel 2.6.14) Extend the syssize field to four bytes.
    [Show full text]
  • X86-64 Assembly Language Programming with Ubuntu
    x86-64 Assembly Language Programming with Ubuntu Ed Jorgensen, Ph.D. Version 1.1.40 January 2020 Cover image: Top view of an Intel central processing unit Core i7 Skylake type core, model 6700K, released in June 2015. Source: Eric Gaba, https://commons.wikimedia.org/wiki/File : Intel_CPU_Core_i7_6700K_Skylake_top.jpg Cover background: By Benjamint444 (Own work) Source: http://commons.wikimedia.org/wiki/File%3ASwirly_belt444.jpg Copyright © 2015, 2016, 2017, 2018, 2019 by Ed Jorgensen You are free: To Share — to copy, distribute and transmit the work To Remix — to adapt the work Under the following conditions: Attribution — you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Noncommercial — you may not use this work for commercial purposes. Share Alike — if you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. Table of Contents Table of Contents 1.0 Introduction...........................................................................................................1 1.1 Prerequisites........................................................................................................1 1.2 What is Assembly Language...............................................................................2 1.3 Why Learn Assembly Language.........................................................................2 1.3.1 Gain a Better Understanding of Architecture
    [Show full text]
  • Yawn: a CPU Idle-State Governor for Datacenter Applications
    Yawn: A CPU Idle-state Governor for Datacenter Applications Erfan Sharafzadeh1;2, Seyed Alireza Sanaee Kohroudi1 Esmail Asyabi1, Mohsen Sharifi1 1Iran University of Science and Technology 2Johns Hopkins University [email protected],{sarsanaee,e_asyabi}@comp.iust.ac.ir,[email protected] ABSTRACT on Systems (APSys ’19), August 19–20, 2019, Hangzhou, China. ACM, Idle-state governors partially turn off idle CPUs, allowing New York, NY, USA, 8 pages. https://doi.org/10.1145/3343737.3343740 them to go to states known as idle-states to save power. Exiting from these idle-sates, however, imposes delays on the execution of tasks and aggravates tail latency. Menu, 1 INTRODUCTION the default idle-state governor of Linux, predicts periods of idleness based on the historical data and the disk I/O In-memory key-value databases are the beating heart of information to choose proper idle-sates. Our experiments the current data centers. The workloads for such databases show that Menu can save power, but at the cost of sacrificing consist of short-lived tasks generated initially from a user- tail latency, making Menu an inappropriate governor for facing server. User requests are partitioned into multiple data centers that host latency-sensitive applications. In this queries distributed among servers in data centers [11]. The paper, we present the initial design of Yawn, an idle-state aggregation of query results forms the final response to the governor that aims to mitigate tail latency without sacrificing user request and the latency of the slowest server determines power. Yawn leverages online machine learning techniques the overall QoS [3, 10].
    [Show full text]
  • Booting View of the Linux Boot Process BIOS Tasks BIOS Example
    Käyttöjärjestelmät LUENTO 23 Booting • Key problem: How do you initiate a system Booting, multimedia systems using only itself? Exam issues • Booting <- Bootstraping • Bootstraping - why this term? • Baron Munchausen pulled himself out of swamp using his boot straps 1 2 View of the Linux boot process BIOS tasks • BIOS – Basic Input / Output System Power-up /Reset • BIOS refers to the firmware code run by a personal computer when first powered on. System startup BIOS/ Bootmonitor • BIOS can also be said to be a coded program embedded on a chip that recognizes and controls various devices that Stage 1 bootloader Master Boot Record make up x86 personal computers. (source: wikipedia) Stage 2 bootloader LILO, GRUB, etc • Execution starts from a fixed location (on x386 that Kernel Linux location is 0xFFFF0000) • Check hardware, locate the boot device Operation Init User-space • Disk, CD, ... • Load the first ’loader’ from Master Boot Record Source: http://www-128.ibm.com/developerworks/library/l-linuxboot/index.html 3 4 BIOS loading the boot loader – BIOS example sequence pseudocode example 1. Check the CMOS Setup for custom settings • 0: set the P register to 8 2. Load the interrupt handlers and device drivers • 1: check paper tape reader ready 3. Initialize registers and power management • 2: if not ready, jump to 1 4. Perform the power-on self-test (POST) • 3: read a byte from paper tape reader to accumulator 5. Display system settings • 4: if end of tape, jump to 8 6. Determine which devices are bootable • 5: store accumulator to address in P register • 6: increment the P register 7.
    [Show full text]