Paravirtual Remote I/O

Paravirtual Remote I/O Yossi Kuperman Eyal Moscovici Joel Nider Technion – Israel institute of Technion – Israel institute of IBM Research—Haifa Technology & IBM Research—Haifa Technology & IBM Research—Haifa [email protected] [email protected] [email protected] Razya Ladelsky Abel Gordon Dan Tsafrir IBM Research—Haifa IBM Research—Haifa & Stratoscale Technion – Israel institute of [email protected] [email protected] Technology [email protected] Abstract device, and it fulfills (“emulates”) them using the physical The traditional “trap and emulate” I/O paravirtualization hardware. This trap-and-emulate approach [28,54] allows the model conveniently allows for I/O interposition, yet it in- host to interpose on the I/O activity of its guests. herently incurs costly guest-host context switches. The newer The benefits of I/O interposition are substantial [69]: im- “sidecore” model eliminates this overhead by dedicating host proving utilization of the physical devices via time and space (side)cores to poll the relevant guest memory regions and re- multiplexing; enabling live migration and suspend/resume act accordingly without context switching. But the dedication using the indirection layer to decouple and recouple VMs of sidecores on each host might be wasteful when I/O activity from/to the physical device; enabling seamless switching be- is low, or it might not provide enough computational power tween different I/O channels, which allows for device aggrega- when I/O activity is high. We propose to alleviate this prob- tion, load balancing, and failure masking; and supporting such lem at rack scale by consolidating the dedicated sidecores features as monitoring, software-defined networking (SDN), spread across several hosts onto one server. The hypervisor file-based images, replication, deduplication, snapshots, and is then effectively split into two parts: the local hypervisor security related functionalities such as record-replay, encryp- that hosts the VMs, and the remote hypervisor that processes tion, firewalls, and intrusion detection. their paravirtual I/O. We call this model vRIO—paraVirtual SRIOV The virtual I/O indirection layer hinders perfor- Remote I/O. We find that by increasing the latency somewhat, mance, mainly due to the overhead generated by exits [1], it provides comparable throughput with fewer sidecores and namely, the guest-host context switches upon I/O operations, superior throughput with the same number of sidecores as which are being trapped and emulated. This overhead has compared to the state of the art. vRIO additionally constitutes motivated hardware vendors to develop the Single Root I/O a new, cost-effective way to consolidate I/O devices (on the Virtualization (SRIOV) technology [34,52], whereby a physi- remote hypervisor) while supporting efficient programmable cal I/O device can self-virtualize, supporting multiple virtual I/O interposition. instances of itself that can be individually assigned to VMs. SRIOV thus bypasses the host, which is largely removed from 1. Introduction the critical data path. The result is significantly better per- Interposition I/O virtualization decouples logical from formance for I/O intensive workloads [41,44,46,56,72,76]. physical I/O through an indirection layer. The host exposes There is a serious drawback, however, to using SRIOV, as it a virtual I/O device to its guest virtual machines (VMs). It eliminates the indirection layer and thus only provides multi- then intercepts (“traps”) VM requests directed at the virtual plexing out of the benefits of virtual I/O listed above. Notably, SRIOV negates live migration, memory overcommitment, metering, and other such features that rely on interposition and that are crucial for VM environments. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to Paravirtualization Consequently, real-world applications lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. of virtualization today, including most enterprise data centers ASPLOS ’16, April 2–6, 2016, Atlanta, GA, USA.. Copyright is held by the owner/author(s). Publication rights licensed to ACM. and cloud computing sites, seldom use SRIOV. Instead, they ACM 978-1-4503-4091-5/16/04. $15.00. http://dx.doi.org/10.1145/2872362.2872378 realize virtual I/O through paravirtualization [11, 58, 66, 70], 49 exemplified by VMware VMXNET3 [67], KVM virtio [58], sidecores. VMhosts do not process (and are unaware of) the and Xen PV [11]. Paravirtual I/O boosts the performance paravirtual I/O of their VMs. Consolidating the sidecores of the virtualization indirection layer while retaining all the in IOhosts allows us to achieve superior performance using aforementioned benefits of interposable virtual I/O. Paravir- the same number of CPUs, or comparable performance us- tualization achieves this goal by requiring guests to install a ing fewer CPUs. In exchange for reducing the CPU number, host-specific device driver that is purely software based. The vRIO requires additional networking infrastructure in the latter is not modeled after any real device. Rather, it is opti- rack; we show how this tradeoff yields a cheaper system. mized to minimize the overheads associated with guest-host Device Consolidation Supporting the cost-effectiveness of communication and context switching. the net/CPU tradeoff is the fact that vRIO introduces a Sidecores While offering an improvement, paravirtualiza- new type of device consolidation for VMs. Splitting the tion still hampers performance due to the exits it induces [13, hypervisor into local and remote components makes the 22,40,44,57,74]. Recent studies show that the sidecore ap- remote devices at the IOhost efficiently accessible to local proach is effective in combating this problem [4,13,31,39,43, VMs via the paravirtual I/O interface. Note that this feature is 45]. The host’s cores are divided to (1) sidecores dedicated complementary to existing device consolidation approaches, to processing virtual I/O, and to (2) VMcores for running the such as storage area networks (SANs). The reason: if the VM VMs that generate the I/O. Each VM writes its I/O requests is configured to use a SAN directly (not as a virtual device), to memory shared with the host as is usual for the paravirtual then we lose interposition entirely. Otherwise, the SAN is I/O model. Unusual, however, is that VM does not trigger exposed to the VM as a traditional paravirtual device and an exit. Instead, the host sidecore polls the relevant memory so the associated overheads of traditional paravirtualization region and processes the request on arrival. The benefit of this come into play yet again. approach is twofold: more cycles to the VMs (whose OSes Hypervisor Independence The vRIO model bypasses the are asynchronous in nature and do not block-wait for I/O) local hypervisor, so the vRIO driver is independent of the and less cache pollution, yielding a substantial performance local hypervisor, thus providing several additional advantages. improvement. For example, the Elvis system—which imple- It is possible, for example, to implement a layer-2 virtual LAN ments paravirtual block and net devices on sidecores—is up that seamlessly works for virtualized OSes across different to 3x more performant than baseline paravirtualization and is processor architectures and hypervisor types. This feature oftentimes on par with SRIOV [31]. is valuable, as there are nowadays fewer (guest) OSes in The sidecore virtual I/O model has an inherent draw- widespread use than hypervisors. vRIO also supports bare back. A sidecore must continuously poll the relevant mem- metal (non-virtual) OSes that install its driver, e.g., they too ory regions of its associated VMs. Therefore, 100% of the can be easily incorporated in the aforementioned LAN. sidecore’s cycles are consumed, even when the I/O load gen- Notably, system-wide services and policies can be imple- erated by these VMs is light. In this case, polling wastes mented for all hypervisor types in one location. Moreover, valuable CPU cycles that could have otherwise been used the IOhost hardware can be specialized for I/O processing. to support other, more productive activity. Conversely, it is also possible that the load generated by the corresponding Minimizing Latency To enjoy the benefits of vRIO, we VMs exceeds the processing capabilities of their designated must make the penalty of adding one hop to the I/O path sidecores, and thus the sidecores might become a bottleneck tolerable. To this end, somewhat surprisingly, we utilize in the system. SRIOV. Whereas SRIOV negates interposition in existing I/O models, vRIO is able to use it in a compatible manner. Sidecore Consolidation The most notable benefit of virtu- Specifically, the vRIO paravirtual driver installed in the VM alization is, arguably, resource consolidation, enabling mul- generates the same I/O as in traditional paravirtualization. But tiple OSes to multiplex the physical hardware. Experience instead of handing the I/O to the local hypervisor via shared shows that multiple virtual servers can be adequately serviced memory, the vRIO driver communicates it to the IOhost by much fewer physical servers. We contend that the same through an SRIOV channel. Additionally, vRIO entirely reasoning holds for sidecores, and that applying it will alle- eliminates the overhead of interrupt processing from the viate the problem of having too few or too many per-server virtualization layer by: (1) coupling the SRIOV channel with sidecores, making this newer I/O model more cost effective.

Paravirtual Remote I/O

Intel® Omni-Path Architecture (Intel® OPA) for Machine Learning

Intel Server Board SE7320EP2 and SE7525RP2

Intel® Omni-Path Fabric Software in Red Hat* Enterprise Linux* 8.1 Release Notes

Pro Processor Workstation Performance Brief

Intel® Omni-Path Architecture Overview-Partner.Sales.Training

Initial Experience with 3D Xpoint Main Memory

Intel® Omni-Path Software — Release Notes for V10.9.2

Intel® Omni-Path Software — Release Notes for V10.8.0.2

Product Change Notification 104637 – 00

Intel® Omni-Path Fabric Software in Red Hat* Enterprise Linux* 8.0 Release Notes

Eindhoven University of Technology MASTER Evaluation of Real-Time

Analyzing the Performance of Intel Optane DC Persistent Memory in App Direct Mode in Lenovo Thinksystem Servers