Vturbo: Accelerating Virtual Machine I/O Processing Using Designated
Total Page:16
File Type:pdf, Size:1020Kb
vTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, Dongyan Xu Department of Computer Science, Purdue University Abstract each core is quite powerful, another major scaling fac- tor comes by allocating multiple VMs per core. While In a virtual machine (VM) consolidation environment, the ever increasing core count in modern systems may it has been observed that CPU sharing among multiple suggest the possibility of a dedicated core per VM, it VMs will lead to I/O processing latency because of the is not likely to happen any time soon, as evidenced in CPU access latency experienced by each VM. In this pa- current cloud computing environments such as Amazon per, we present vTurbo, a system that accelerates I/O pro- EC2, where a 3GHz CPU may be shared by three small cessing for VMs by offloading I/O processing to a des- instances. ignated core. More specifically, the designated core – In practice, CPU sharing among VMs can be quite called turbo core – runs with a much smaller time slice complicated. Each VM is typically assigned one or more (e.g., 0.1ms) than the cores shared by production VMs. virtual CPUs (vCPUs) which are scheduled by the hy- Most of the I/O IRQs for the production VMs will be pervisor on to physical CPUs (pCPUs) ensuring propor- delegated to the turbo core for more timely processing, tional fairness. The number of vCPUs is usually larger hence accelerating the I/O processing for the production than the number of pCPUs, which means that, even if VMs. Our experiments show that vTurbo significantly a vCPU is ready for execution, it may not find a free improves the VMs’ network and disk I/O throughput, pCPU immediately and thus needs to wait for its turn, which consequently translates into application-level per- causing CPU access latency. If a VM is running I/O- formance improvement. intensive applications, this latency can have a significant negative impact on application performance. While sev- 1 Introduction eral efforts [30, 12, 17, 28, 19, 25] have made this obser- Cloud computing is arguably one of the most transforma- vation in the past, and have in fact provided solutions tive trends in recent times. Many enterprises and busi- to improve VMs’ I/O performance, the improvements nesses are increasingly migrating their applications to are still moderate compared to the available I/O capac- public cloud offerings such as Amazon EC2 [1] and Mi- ity because, they do not explicitly focus on reducing the crosoft Azure [8]. By purchasing or leasing cloud servers most important and common component of I/O process- with a pay-as-you-go charging model, enterprises bene- ing workflow—namely IRQ processing latency. fit from significant cost savings in running their applica- To explain this more clearly, let us look at I/O process- tions, both in terms of capital expenditure (e.g., reduced ing in modern OSes today. There are two basic stages server costs) as well as operational expenditure (e.g., involved typically. (1) Device interrupts are processed management staff). On the other hand, cloud providers synchronously in an IRQ context in the kernel and the generate revenue by achieving good performance for data (e.g., network data, disk reads) is buffered in kernel their “tenants” while maintaining reasonable cost of op- buffers; (2) The application eventually copies the data eration. from kernel buffer to its user-level buffer in an asyn- One of the key factors influencing the cost of cloud chronous fashion whenever it gets scheduled by the pro- platforms is server consolidation—the ability to host cess scheduler. If the OS were running directly on a multiple virtual machines (VM) in the same physical physical machine, or if there were a dedicated CPU for server. If the cloud providers can increase the level of a given VM, the IRQ processing component gets sched- server consolidation, i.e., pack more VMs in each phys- uled almost instantaneously by preempting the currently ical machine, they can generate more revenue from their running process. However, for a VM that shares CPU infrastructure investment and possibly pass cost savings with other VMs, the IRQ processing may be significantly on to their customers. Two main resources that typi- delayed because the VM may not be running when the cally dictate the level of server consolidation, memory I/O event (e.g., network packet arrival) occurs. and CPU. Memory is strictly partitioned across VMs, IRQ processing delay can affect both network and disk although there are techniques (e.g., memory ballooning I/O performance significantly. For example, in the case [29]) for dynamically adjusting the amount of memory of TCP, incoming packets are staged in the shared mem- available to each VM. CPU can also be strictly parti- ory between the hypervisor (or privileged domain) and tioned across VMs, with the trend of ever increasing the guest OS, which delays the ACK generation and can number of cores per physical host. However, given that result in significant throughput degradation. For UDP USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 243 flows, there is no such time-sensitive ACK generation The other regular kernel threads are scheduled on a reg- that governs the throughput. However, since there is lim- ular core with regular slicing just like what exists to- ited buffer space in the shared memory (ring buffer) be- day. Since the IRQ handlers are executed by the turbo tween the guest OS and the hypervisor, it may fill up core, they are handled almost synchronously with a mag- quickly leading to packet loss. IRQ processing delay nitude smaller latency. For example, assuming 5 VMs can also impact disk write performance. Applications of- and 0.1ms quantum for the turbo core, an IRQ request is ten just write to memory buffers and return. The kernel processed within 0.4ms compared to 120 ms (assuming threads handling disk I/O will flush the data in memory 30ms time slice for regular cores). to the disk in the background. As soon as one block write The turbo core is accessible to all VMs in the system. is done, the IRQ handler will schedule the next write and If a VM runs only CPU-bound processes, it may choose so on. If the IRQ processing is delayed, write throughput not to use this core since its performance is not likely to will be significantly reduced. be good due to frequent context switches. Even if a VM Unfortunately, none of the existing efforts explicitly chooses to schedule a CPU-bound process on the turbo tackle this problem. Instead, they propose indirect ap- core, it has virtually no impact on other VMs’ turbo core proaches that moderately shorten IRQ processing latency access latency thus providing good isolation between hence achieving only modest improvement. Further, be- VMs. We ensure fair-sharing among VMs with differ- cause of the specific design choices made in those ap- ential requirement between regular/turbo cores because, proaches, the IRQ processing latency cannot be funda- otherwise, it would motivate VMs to push more process- mentally eliminated (e.g., made negligible) by any of the ing to the turbo core. Thus, for example, if there are two designs, meaning that they cannot achieve close to op- VMs—VM1 requesting 100% of the regular core, and timal performance. For instance, the vSlicer approach VM2 requesting 50% regular and 50% turbo cores, the [30] schedules I/O-intensive VMs more frequently us- regular core will be split 75-25% while VM2 obtains the ing smaller micro time slices, which implicitly lowers full 50% of the turbo core, thus equalizing the total CPU the IRQ processing latency, but not significantly. Also usage for both VMs. We also note that, while we men- it does not work under all scenarios. For example, tion one turbo core in the system, our design seamlessly if two I/O latency-sensitive VMs and two non-latency- allows multiple turbo cores in the system driven by I/O sensitive VMs share the same pCPU, the worst-case IRQ processing load of all VMs in the host. This makes our processing latency will be about 30ms, which is still design extensible to higher bandwidth networks (10Gbps non-trivial, even though it is better than without vS- and beyond) and higher disk I/O bandwidths that require licer (which would be 90ms). Similarly, another ap- significant IRQ processing beyond what a single core can proach called vBalance [10] proposes routing the IRQ provide. to the vCPU that is scheduled for the corresponding VM. To summarize, the main contributions of this paper are This may work well for SMP VMs that have more than as follows: one vCPU, but will not improve performance for sin- (1) We propose a new class of high-frequency schedul- gle vCPU VMs. Even in the SMP case, it improves the ing CPU core named turbo core and a new class of co- chances that at least one vCPU is scheduled; but fun- vCPU called turbo vCPU. The turbo vCPU pinned on damentally it does not eliminate IRQ processing latency turbo core(s) is used for timely processing of the I/O because each vCPU is contending for the physical CPU IRQs thus accelerating I/O processing for VMs in the independently. same physical host. To solve this problem more fundamentally, we aim to (2) We develop a simple but effective VM scheduling make the IRQ processing latency for a CPU-sharing VM policy giving general CPU cores and turbo core magni- almost similar to the scenario where the VM is given a tudes different time-slice.