Understanding the Impact of Cache Locations on Storage Performance and Energy Consumption of Virtualization Systems

Understanding the Impact of Cache Locations on Storage Performance and Energy Consumption of Virtualization Systems Tao Lu§, Ping Huang§, Xubin He§, Ming Zhang‡ § Virginia Commonwealth University, USA, ‡ EMC Corporation, USA flut2, phuang, [email protected], [email protected] Abstract ing in increased resource utilization [2]. Despite the management flexibilities, hypervisor-side As per-server CPU cores and DRAM capacity continu- cache access involves the costly I/O virtualization layers, ously increase, application density of virtualization plat- which cause intensive activities of the userspace process forms hikes. High application density imposes tremen- such as qemu, and incur considerable interrupt deliver- dous pressure on storage systems. Layers of caches are ies [6]. Study on virtio [7] with network transactions deployed to improve storage performance. Owing to its shows that a busy virtualized web-server may consume manageability and transparency advantages, hypervisor- 40% more energy, due to 5x more CPU cycles to deliver side caching is widely employed. However, hypervisor- a packet, than its non-virtualized counterparts [8]. Simi- side caches locate at the lower layer of VM disk filesys- larly, our tests on storage transactions show that I/O vir- tems. Thus, the critical path of cache access involves tualization caps hypervisor-side cache performance and the virtual I/O sub-path, which is expensive in operation increases per I/O energy consumption. In contrast, VM- latency and CPU cycles. The virtual I/O sub-path caps side caches may obviate the I/O virtualization overheads. the throughput (IOPS) of the hypervisor-side cache and More specifically, for 4KB read requests at the IOPS incurs additional energy consumption. It’s viable to di- of 5k, hypervisor-side DRAM caches consume about rectly allocate spare cache resources such as DRAM of a 3x the power and delivers 30x the per I/O latency of host machine to a VM for building a VM-side cache so as VM-side DRAM caches. For a server hosting four I/O to obviate the I/O virtualization overheads. In this work, intensive VMs, the hypervisor-side caching consumes we quantitatively compare the performance and energy 48 watts while the VM-side caching consumes only 13 efficiency of VM-side and hypervisor-side caches based watts. The idle power of the server is 121 watts. In other on DRAM, SATA SSD, and PCIe SSD devices. Insights words, comparing with hypervisor-side caching, VM- of this work can direct designs of high-performance and side caching can save about 25% of the entire server’s ac- energy-efficient virtualization systems in the future. tive power. Unfortunately, benefits of VM-side caching 1 Introduction have long been ignored. To rouse awareness of its benefits, as a first step, we conduct empirical comparisons Virtualization techniques are the backbone of production between hypervisor-side (host-side) and VM-side (guest- clouds. As CPU cores and memory capacities increase, side) caches. We believe answering the following ques- virtual machine density of production clouds increases, tions is helpful for future cache designs: disk storage becomes a salient performance bottleneck. • How much are the performance and energy efficiency DRAM or SSD-based caches [1–4] have been deployed penalties for hypervisor-side caching? in virtualization storage stacks for I/O acceleration of • What are the root causes of the penalties of hypervisor- guest virtual machines. Currently, caches are commonly side caching? implemented in hypervisors due to several reasons. First, • Why do these penalties need to be mitigated? hypervisor-side caches are transparent to VMs, thus, can • What are the potential approaches to reduce these be naturally ported across many guest operating systems. penalties? Second, a hypervisor has complete control over system resources, therefore, can make the most informed cache In this paper, we present empirical studies of these management decisions such as cache capacity allocation problems on a KVM [9] virtualization platform us- and page replacement [5]. Third, comparing with VM- ing DRAM, SATA SSD, and PCIe SSD as cache de- side caches, hypervisor-side caches can be shared, result- vices. Our evaluation demonstrates that VM-side DRAM VM VM VM ization is the standard solution. Two of the most widely App App App QEMU used para-virtualized device drivers are virtio [7] and Xen Direct VFS VFS Direct I/O I/O paravirtualization [10]. The former is widely used in Page Cache 1a 2a Page Cache QEMU/KVM based virtualization platforms; the latter is Network FS Network FS from the Xen project. Xen PV and virtio are architec- Network Network Flash Cache 1b Flash Cache 2b turally similar, we discuss the virtio based storage stack Disk Block vhost- Mapping Layer in details. As it’s shown in Figure 1, assume a guest OS Filesystem Device File scsi To Mapping Layer Device Driver issued a read request on some disk file, the activities of irqfd Generic Block Layer Image the guest OS and the host OS components are as follows: I/O Scheduler Layer ioeventfd File Local Disk Kick kvm.ko 1. The read request activates a Virtual Filesystem (VFS) Virtio-blk virtqueue Guest I/O exit function, passing to it a file descriptor and an offset. Emulated Host Machine Hard Disk 2. If the request doesn’t indicate direct I/O, the VFS Figure 1: System components affected by a VM block function determines whether the required data are device operation. DRAM-based page caches are built-in available in the page cache 1a. If 1a hits, the data are modules of operating systems. Flash-based caches are returned from the cache and the request is completed. optional but widely deployed in virtualization platforms 3. Assuming the page cache 1a missed, the guest OS to accelerate VM storage. kernel must read the data from the block device. If caches yield enticing performance and energy efficiency the requested data reside in the flash cache, it’s a flash gains over their hypervisor-side counterparts. For a cache hit at 1b. In this case, data are fetched from the SATA SSD, either attached to a VM to build a VM-side flash device and HDD access is obviated. cache, or used to build a hypervisor-side cache, virtio 4. Assuming it’s a flash cache miss, the request has to go is involved in the device access path. As a result, VM- through the traditional generic block and I/O sched- side SATA SSD caches don’t provide benefits over their uler layer and then be served by the virtual I/O device hypervisor-side counterparts. However, PCI passthrough driver such as virtio-blk. enables a VM bypassing the virtual I/O path and directly 5. Upon a read request, the virtio-blk frontend driver accessing a PCIe SSD in a dedicated way to achieve bet- composes a request entry and places it into the de- ter performance and energy efficiency. scriptor table of the virtqueue. Then, the virtio-blk frontend driver will call virtqueue kick, which causes 2 Problem Analysis guest I/O exit and triggers a hardware register access In this section, we first present storage access for VMs, called VIRTIO PCI QUEUE NOTIFY. explain the involvement of storage components when 6. vhost is the host-side virtio component for completing cache hits at VM and hypervisor side. Then, we analyze the virtual I/O request. Once vhost is notified by KVM where the virtual I/O overheads lie in. for the guest kick, it fetches the virtio request from the queue and calls QEMU, which works as a regular 2.1 Storage Access for VMs userspace process, to complete the data transfer. Block devices are commonly exposed to VMs via emu- 7. To fetch the data, QEMU issues I/O requests which lation. The host-side entity of a virtualized block device again traverse the host OS storage stack. Host-side can be a file, an LVM logical volume, a device partition, page cache or the optional flash cache are succes- or a whole device. Since device emulation incurs over- sively checked. Once the data hit at the caches or have heads, dedicated device allocation, also known as device been fetched from the HDD, vhost updates the status passthrough, is implemented to enable a device being ex- bit of the virtio request and issues an irqfd interrupt to clusively used by a VM without the involvement of I/O notify the guest that the request is completed. virtualization layers. However, not all devices can be allocated in the way of passthrough. For block devices, 2.2 I/O Virtualization Overheads currently only the PCI-based devices such as PCIe SSDs In the virtual I/O path, there are two operations expen- can be assigned to VMs via PCI passthrough; SATA de- sive in CPU cycles or request latency. The first is virtual vices, however, cannot be allocated to VMs in the way I/O emulation, which requires intensive interactions be- of passthrough. In this paper we assume the persistent tween the virtio frontend and backend. Emulation causes storage of VMs is backed by emulated block devices. frequent I/O interrupts and guest I/O exits, which are ex- But when we discuss the implementations of SSD-based pensive in CPU cycles, as well as increase I/O latency. guest-side caches, we will compare the performance and The second is the relatively slow HDD-based storage ac- energy efficiency of SSDs connected to VMs via PCI cess, which costs milliseconds and has long been the bot- passthrough and virtio, respectively. tleneck of cloud applications. For efficient emulation of block devices, paravirtual- Both VM-side and hypervisor-side cache hits avoid 2 450 350 30 Native Native Native 400 VM_side VM_side VM_side 300 25 350 Hypervisor_side Hypervisor_side Hypervisor_side 250 300 20 250 200 15 200 150 IOPS (K) 150 10 Latency (us) 100 100 5 50 50 0 0 Power consumption (watts) 0 4 8 16 32 64 128 256 512 1024 1 5 10 50 90 95 99 99.9 99.99 1 2 3 4 5 6 7 8 9 10 I/O size (KB) Latency distribution (percentile) Throughput: IOPS (K) (a) Maximum randread throughputs with vari- (b) Latency distributions of 4KB randread at (c) Power consumption of 4KB randread at ous I/O sizes.

Understanding the Impact of Cache Locations on Storage Performance and Energy Consumption of Virtualization Systems

The Linux Kernel Module Programming Guide

Interrupt Handling in Linux

Linux: Kernel Release Number, Part II

Open Source Software Notice

16-Bit MS-DOS Programming (MS-DOS & BIOS-Level Programming )

The Xen Port of Kexec / Kdump a Short Introduction and Status Report

Kernel Boot-Time Tracing

Anatomy of Linux Loadable Kernel Modules a 2.6 Kernel Perspective

Linux Perf Event Features and Overhead

Lesson-2: Interrupt and Interrupt Service Routine Concept

Additional Functions in HW-RTOS Offering the Low Interrupt Latency

Exceptions and Processes