Vic: Interrupt Coalescing for Virtual Machine Storage Device IO

vIC: Interrupt Coalescing for Virtual Machine Storage Device IO Irfan Ahmad Ajay Gulati Ali Mashtizadeh firfan, agulati, [email protected] VMware, Inc., Palo Alto, CA 94304 Abstract sue hundreds of very small IO operations in parallel resulting in tens of thousands of IOs per second (IOPS). Interrupt coalescing is a well known and proven tech- Such high IOPS are now within reach of even more IT nique for reducing CPU utilization when processing high organizations with faster storage controllers, wider adop- IO rates in network and storage controllers. Virtualiza- tion of solid-state disks (SSDs) as front-end tier in stor- tion introduces a layer of virtual hardware for the guest age arrays and increasing deployments of high perfor- operating system, whose interrupt rate can be controlled mance consolidated storage devices using Storage Area by the hypervisor. Unfortunately, existing techniques Network (SAN) or Network-Attached Storage (NAS) based on high-resolution timers are not practical for vir- protocols. tual devices, due to their large overhead. In this paper, we For high IO rates, the CPU overhead for handling all present the design and implementation of a virtual inter- the interrupts can get very high and eventually lead to rupt coalescing (vIC) scheme for virtual SCSI hardware lack of CPU resources for the application itself [7, 14]. controllers in a hypervisor. CPU overhead is even more of a problem in virtualiza- We use the number of commands in flight from the tion scenarios where we are trying to consolidate as many guest as well as the current IO rate to dynamically set virtual machines into one physical box as possible. Free- the degree of interrupt coalescing. Compared to existing up CPU resources from one virtual machine (VM) ing techniques in hardware, our work does not rely on will improve performance of other VMs on the same high-resolution interrupt-delay timers and thus leads to host. Traditionally, interrupt coalescing or moderation a very efficient implementation in a hypervisor. Further- has been used in network and storage controller cards more, our technique is generic and therefore applicable to limit the number of times that application execution to all types of hardware storage IO controllers which, un- is interrupted by the device to handle IO completions. like networking, don’t receive anonymous traffic. We Such coalescing techniques have to carefully balance an also propose an optimization to reduce inter-processor increase in IO latency with the improved execution effi- interrupts (IPIs) resulting in better application perfor- ciency due to fewer interrupts. mance during periods of high IO activity. Our implementation of virtual interrupt coalescing has been ship- In hardware controllers, fine-grained timers are used ping with VMware ESX since 2009. We present our in conjunction with interrupt coalescing to keep an up- evaluation showing performance improvements in micro per bound on the latency of IO completion notifications. benchmarks of up to 18% and in TPC-C of up to 5%. Such timers are inefficient to use in a hypervisor and one has to resort to other pieces of information to avoid longer delays. This problem is challenging for several 1 Introduction other reasons, including the desire to maintain a small code size thus keeping the trusted computing base to a The performance overhead of virtualization has de- manageable size. We treat the virtual machine workload creased steadily in the last decade due to improved hard- as unmodifiable and as an opaque black box. We also ware support for hypervisors. This and other storage de- assume based on earlier work that guest workloads can vice optimizations have led to increasing deployments change their behavior very quickly [6, 10]. of IO intensive applications on virtualized hosts. Many In this paper, we target the problem of coalescing in- important enterprise applications today exhibit high IO terrupts for virtual devices without assuming any support rates. For example, transaction processing loads can is- from hardware controllers and without using high resolution timers. Traditionally, there are two parameters of customers in the currently shipping ESX version. that need to be balanced: maximum interrupt delivery la- The next section presents background on VMware tency (MIDL) and maximum coalesce count (MCC). The ESX Server architecture and overall system model along first parameter denotes the maximum time to wait before with a more precise problem definition. Section 3 sending the interrupt and the second parameter denotes presents the design of our virtual interrupt coalescing the number of accumulated completions before sending mechanism along with a discussion of some practical an interrupt to the operating system (OS). The OS is in- concerns. An extensive evaluation of our implementa- terrupted based on whichever parameter is hit first. tion is presented in Section 4, followed by some lessons We propose a novel scheme to control for both MIDL learned from our deployment experience in real world and MCC implicitly by setting the delivery ratio of in- in Section 5. Section 6 presents an overview of related terrupts based on the current number of commands in work followed by conclusions and directions for future flight (CIF) from the guest OS and overall IO comple- work in Sections 7 and 8 respectively. tion rate. The ratio, denoted as R, is simply the ratio of how many virtual interrupts are sent to the guest divided 2 System Model by the number of actual IO completions received by the hypervisor on behalf of that guest. Note that 0 < R ≤ 1. Our system model consists of two components in the Lower values of delivery ratio, R, denotes a higher de- VMware ESX hypervisor: VMkernel and the virtual ma- gree of coalescing. We increase R when CIF is low and chine monitor (VMM). The VMkernel is a hypervisor decrease the delivery rate R for higher values of CIF. kernel, a thin layer of software controlling access to The key insight in the paper is that unlike network IO, physical resources among virtual machines. The VMk- CIF can be used directly for storage controllers because ernel provides isolation and resource allocation among each request has a corresponding command in flight prior virtual machines running on top of it. The VMM is re- to completion. Also, based on the characteristics of stor- sponsible for correct and efficient virtualization of the age devices, it is important to maintain certain number x86 instruction set architecture as well as emulation of of commands in flight to efficiently utilize the underly- high performance virtual devices. It is also the concep- ing storage device [9, 11, 23]. The benefits of command tual equivalent of a “process” to the ESX VMkernel. The queuing are well known and concurrent IOs are used in VMM intercepts all the privileged operations from a VM most storage arrays to maintain high utilization. Another including IO and handles them in cooperation with the challenge in coalescing interrupts for storage IO requests VMkernel. is that many important applications issue synchronous Figure 1 shows the ESX VMkernel executing storage IOs. Delaying the completion of prior IOs can delay the stack code on the CPU on the right and an example VM issue of future ones, so one has to be very careful about running on top of its virtual machine monitor (VMM) minimizing the latency increase. running on the left processor. In the figure, when an in- Another problem we address is specific to hypervisors, terrupt is received from a storage adapter (1), appropriate where the host storage stack has to receive and process an code in the VMkernel is executed to handle the IO com- IO completion before routing it to the issuing VM. The pletion (2) all the way up to the vSCSI subsystem which hypervisor may need to send inter-processor interrupts narrows the IO to a specific VM. Each VMM shares a (IPIs) from the CPU that received the hardware interrupt common memory area with the ESX VMkernel, where to the remote CPU where the VM is running for notifi- the VMkernel posts IO completions in a queue (3) fol- cation purposes. We provide an optimization to reduce lowing which it may issue an inter-process interrupt or the number of IPIs issued using the timestamp of the last IPI (4) to notify the VMM. The VMM can pick up the interrupt that was sent to the guest OS. This reduces the completions on its next execution (5) and process them overall number of IPIs while bounding the latency of no- (6) resulting finally in the virtual interrupt being fired (7). tifying the guest OS about an IO completion. Without explicit interrupt coalescing, the VMM al- We have implemented our virtual interrupt coalescing ways asserts the level-triggered interrupt line for every (vIC) techniques in the VMware ESX hypervisor [21] IO. Level-triggered lines do some implicit coalescing al- though they can be applied to any hypervisor including ready but that only helps if two IOs are completed back- type 1 and type 2 as well as hardware storage controllers. to-back in the very short time window before the guest Experimentation with a set of micro benchmarks shows interrupt service routine has had the chance to deassert that vIC techniques can improve both workload through- the line. put and CPU overheads related to IO processing by up Only the VMM can assert the virtual interrupt line and to 18%. We also evaluated vIC against the TPC-C work- it is possible after step 3 that the VMM may not get load and found improvements of up to 5%. The vIC im- a chance to execute for a while. To limit any latency plementation discussed here is being used by thousands implications of a VM not entering into the VMM, the Figure 1: Virtual Interrupt Delivery Mechanism.

Load more