Crane: Fast and Migratable GPU Passthrough for Opencl Applications

Crane: Fast and Migratable GPU Passthrough for OpenCL Applications James Gleeson, Daniel Kats, Charlie Mei, Eyal de Lara University of Toronto Toronto, Canada {jgleeson,dbkats,cmei,delara}@cs.toronto.edu ABSTRACT (IaaS) providers now have virtualization offerings with ded- General purpose GPU (GPGPU) computing in virtualized icated GPUs that customers can leverage in their guest vir- environments leverages PCI passthrough to achieve GPU tual machines (VMs). For example, Amazon Web Services performance comparable to bare-metal execution. However, allows VMs to take advantage of NVIDIA’s high-end GPUs GPU passthrough prevents service administrators from per- in EC2 [6]. Microsoft and Google recently introduced simi- forming virtual machine migration between physical hosts. lar offerings on Azure and GCP platforms respectively [34, Crane is a new technique for virtualizing OpenCL-based 21]. GPGPU computing that achieves within 5:25% of pass- Virtualized GPGPU computing leverages modern through GPU performance while supporting VM migra- IOMMU features to give the VM direct use of the GPU tion. Crane interposes a virtualization-aware OpenCL li- through DMA and interrupt remapping – a technique which brary that makes it possible to reclaim and subsequently is usually called PCI passthrough [17, 50]. By running reassign physical GPUs to a VM without terminating the the GPU driver in the context of the guest operating guest or its applications. Crane also enables continued GPU system, PCI passthrough bypasses the hypervisor and operation while the VM is undergoing live migration by achieves performance that is comparable to bare-metal OS transparently switching between GPU passthrough opera- execution [47]. tion and API remoting. Unfortunately, GPU passthrough is incompatible with VM migration for two key reasons: (1) the hypervisor cannot safely interrupt guest GPU access from the guest GPU CCS Concepts drivers and GPGPU applications, and (2) the hypervisor •Computing methodologies ! Graphics processors; cannot extract and reinject GPU state. Without the abil- •Computer systems organization ! Heterogeneous ity of the hypervisor to interrupt the guest OS’s access to (hybrid) systems; •Software and its engineering ! the GPU, it is impossible for the hypervisor to reclaim the Virtual machines; GPU from the guest OS without the guest OS crashing; we confirm this behaviour in the Xen hypervisor (Section 2.1). Without the ability of the hypervisor to extract GPU hard- Keywords ware state prior to migration, any data stored by guest virtualization, live migration, passthrough, GPU, GPGPU, GPGPU applications in GPU device memory are lost once OpenCL the guest moves to a new machine. We present Crane, a new approach for achieving passthrough GPU performance without sacrificing VM migra- 1. INTRODUCTION tion for virtualized OpenCL GPGPU applications. Crane General purpose GPU (GPGPU) is being used to accel- allows both stop-and-copy and live migration. Crane does erate parallel workloads as varied as training deep neural not require modification to applications, the hypervisor, and networks for machine learning [2], encoding high-definition importantly, guest OS drivers which are often proprietary. video for streaming [42], and molecular dynamics [20]. Crane implements virtualization using only the portable GPGPU programming is primarily done according to one of OpenCL API, allowing Crane to achieve vendor indepen- two APIs: CUDA for NVIDIA’s GPUs, and OpenCL, which dence: the ability to migrate any current and future GPU is implemented on many different GPUs from NVIDIA and model that supports the OpenCL API without vendor sup- other vendors. Major cloud “infrastructure as a service” port. Permission to make digital or hard copies of all or part of this work Crane operationally consists of three main components: for personal or classroom use is granted without fee provided that a transparent interposition library, a guest migration dae- copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. mon, and a host migration daemon. When the guest OS is Copyrights for components of this work owned by others than ACM not migrating, the OpenCL applications run at passthrough must be honored. Abstracting with credit is permitted. To copy speeds with OpenCL calls sent directly to the GPU. While otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions running, the interposition library interposes on OpenCL from [email protected]. calls by tracking state needed to recreate OpenCL objects. SYSTOR ’17 Haifa, Israel When an administrator intends to migrate, Crane signals © 2017 ACM. ISBN 978-1-4503-5035-8/17/05…$15.00 their intent to the guest migration daemon, which coordi- DOI: http://dx.doi.org/10.1145/3078468.3078478 nates the pre-migration stage. During pre-migration, the 2.1 GPU passthrough guest migration daemon coordinates both extracting GPU Passthrough is a technique that allows guest OS GPU state to DRAM and safely interrupting guest GPU access. drivers direct use of GPUs as if the guest OS was installed on Crane implements a universal GPU device memory extrac- the system as a bare-metal OS. Passthrough takes advantage tion mechanism built using portable OpenCL APIs for sav- of the input-output memory management unit (IOMMU) ing GPU state to DRAM. Guest GPU access is safely paused found on modern chipsets, which allows guests unmediated by releasing OpenCL objects, unloading OpenCL shared li- access to GPU device memory, thereby bypassing any hy- brary handles, and removing guest GPU drivers. With the pervisor involvement. As a result, the guest VM attains GPU unused, the host migration daemon safely reclaims GPU performance comparable to a bare-metal OS [54]. the GPU and commences VM migration. Crane enables live Unfortunately, using passthrough sacrifices the ability to migration [13] by temporarily switching from passthrough migrate VMs. To prove that the hypervisor cannot safely in- GPU execution to API remoting, which forwards OpenCL terrupt GPU access, we conducted a small experiment with commands over RPC to a stationary proxy domain on the the Xen hypervisor. Modern hypervisors are able to dy- source host. The proxy domain is pre-provisioned with a namically detach and reattach PCI devices in passthrough dedicated passthrough GPU to minimize the time to tran- without requiring a restart of the guest VM [49, 30]. Our ex- sition to API forwarding. Once the guest migrates, Crane periment consists of running an OpenCL application in the leverages support in modern OSes and hypervisors for hot- guest OS, then issuing a detach of its GPU card while the plugging PCI devices to re-attach a passthrough GPU [17, application is running. After the card detaches, the entire 50], and executes the post-migration stage. During post- guest OS crashes (for experimental setup see Section 6.1). migration, Crane coordinates resuming execution using the Hence, detaching the card in the hypervisor is equivalent to passthrough GPU by reinjecting GPU device memory us- ripping the physical GPU card out of the PCI-express slot ing its universal mechanism, and recreating OpenCL objects of a bare-metal OS installation. from tracked API state. Even if passthough GPU access could be interrupted by This paper makes the following contributions: the hypervisor, the hypervisor has no method of extracting state from the GPU and reinjecting it into the GPU on the • An approach to GPU virtualization that simulta- new host. Proprietary GPU drivers provide no standard neously achieves — Passthrough performance: API for GPU state extraction/reinjection. within 5:25% of bare-metal passthrough GPU perfor- A simple approach is to terminate the OpenCL applica- mance, Vendor independence: Crane uses vendor tions before migrating then restart them on the new host; standardized OpenCL APIs to implement virtualiza- however, sheer economic costs [5] in lost computational re- tion, allowing it to work for current and future GPU sources precludes such an approach in industry for batch models, VM migration: Crane supports both stop- processing jobs, which can takes many hours to execute [35]. and-copy and live migration modes. • A method for extraction/reinjection of guest GPU 2.2 OpenCL programming state and safe interruption of guest GPU access with- The OpenCL API describes a cross-platform method out terminating the guest or its OpenCL applications. of writing and executing massively data-parallel programs The approach makes use of standard OpenCL APIs to known as kernels on a device such as a GPU. The general maintain transparency thereby avoiding application, flow of an OpenCL application is as follows: driver, and hypervisor modifications. 1. Device configuration: Query and select an available • A method for transparently extracting and injecting GPU to use throughout the program. OpenCL state, enabling OpenCL applications to con- tinue to execute (albeit at lower performance) while 2. Resource allocation: Kernels are compiled for the the VM is undergoing live migration via API remot- target device, and OpenCL memory buffers are used ing. to write GPU memory as input. The rest of the paper is structured as follows. Section 2 3. Program execution: Kernels are enqueued for ex- provides a quick overview of OpenCL and introduces GPU ecution on the GPU, and results are read from GPU passthrough and how it prevents VM migration. Sections 3 device memory back to DRAM. and 4 describe the design and implementation of Crane. 4. Cleanup: Command queues are flushed, and any al- Section 5 discusses our solution for dynamically offloading located OpenCL resource handles are freed. close-source GPU drivers. Section 6 presents our experimental evaluation. Section 7 discusses related work. Section 8 The standardized OpenCL API empowers developers with discusses future work such as extending Crane to CUDA. application portability across all GPU vendors and mod- Finally, Section 9 concludes the paper. els, including popular proprietary vendors like NVIDIA and AMD.

Crane: Fast and Migratable GPU Passthrough for Opencl Applications

GPU Virtualization on Vmware's Hosted I/O Architecture

Gscale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics

A Full GPU Virtualization Solution with Mediated Pass-Through

GPU Virtualization on Vmware's Hosted I/O Architecture

Enabling VDI for Engineers and Designers Positioning Information

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

With NVIDIA Virtual GPU

GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments

NVIDIA GRID GPU Acceleration for Virtualization

CUSTOMER EXPERIENCES with GPU VIRTUALIZATION and 3D REMOTING Derek Thorslund Director of Product Management, HDX

Using PCI Pass-Through for GPU Virtualization with CUDA*

Master's Thesis