GPUvm: Why Not Virtualizing GPUs at the Hypervisor? Yusuke Suzuki, Keio University; Shinpei Kato, Nagoya University; Hiroshi Yamada, Tokyo University of Agriculture and Technology; Kenji Kono, Keio University https://www.usenix.org/conference/atc14/technical-sessions/presentation/suzuki

This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2

Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. GPUvm: Why Not Virtualizing GPUs at the Hypervisor?

Yusuke Suzuki Shinpei Kato Hiroshi Yamada Kenji Kono Keio University Nagoya University Tokyo University of Keio University Agriculture and Technology

Abstract practically deployed GPU-accelerated systems, however, Graphics processing units (GPUs) provide orders-of- are limited to non-commercial supercomputers. Most magnitude speedup for compute-intensive data-parallel GPGPU applications are still being developed in the re- applications. However, enterprise and cloud computing search phase. This is largely due to the fact that GPUs domains, where resource isolation of multiple clients is and their relevant system software are not tailored to required, have poor access to GPU technology. This is support virtualization, preventing enterprise servers and due to lack of operating system (OS) support for vir- cloud computing services from isolating resources on tualizing GPUs in a reliable manner. To make GPUs multiple clients. For example, Amazon Elastic Compute more mature system citizens, we present an open ar- Cloud (EC2) [1] employs GPUs as computing resources, chitecture of GPU virtualization with a particular em- but each client is assigned with an individual physical phasis on the hypervisor. We provide design and instance of GPUs. implementation of full- and para-virtualization, includ- Current approaches to GPU virtualization are classi- ing optimization techniques to reduce overhead of GPU fied into I/O pass-through [1], API remoting [8,9,11,12, virtualization. Our detailed experiments using a rele- 24, 37, 41], or hybrid [7]. These approaches are also re- vant commodity GPU show that the optimized perfor- ferred to as back-end, front-end, and para virtualization, mance of GPU para-virtualization is yet two or three respectively [7]. I/O pass-through, which exposes GPU times slower than that of pass-through and native ap- hardware to guest device drivers, can minimize overhead proaches, whereas full-virtualization exhibits a different of virtualization, but the owner of GPU hardware is lim- scale of overhead due to increased memory-mapped I/O ited to a specific VM by hardware design. operations. We also demonstrate that coarse-grained fair- API remoting is more oriented to multi-tasking and is ness on GPU resources among multiple virtual machines relatively easy to implement, since it needs only to ex- can be achieved by GPU scheduling; finer-grained fair- port API calls to outside of guest VMs. Albeit simple, ness needs further architectural support by the nature of this approach lacks flexibility in the choice of languages non-preemptive GPU workload. and libraries. The entire software stack must be rewrit- ten to incorporate an API remoting mechanism. Imple- 1 Introduction menting API remoting could also result in enlarging the Graphics processing units (GPUs) integrate thousands of trusted computing base (TCB) due to accomodation of compute cores on a chip. Following a significant leap in additional libraries and drivers in the host. hardware performance, recent advances in programming The para-virtualization allows multiple VMs to access languages and compilers have allowed compute applica- the GPU by providing an ideal device model through the tions, as well as graphics, to use GPUs. This paradigm hypervisor, but guest device drivers must be modified shift is often referred to as general-purpose computing to support the device model. According to these three on GPUs, a.k.a., GPGPU. Examples of GPGPU applica- classes of approaches, it is difficult, if not impossible, to tions include scientific simulations [26,38], network sys- use vanilla device drivers for guest VMs while providing tems [13,17], file systems [39,40], database management resource isolation on multiple VMs. This lack of reliable systems [14,18,22,36], complex control systems [19,33], virtualization support prevents GPU technology from the and autonomous vehicles [15, 27]. enterprise market. In this work, we explore GPU virtu- While significant performance benefits of GPUs have alization that allows multiple VMs to share underlying attracted a wide range of applications, main governors of GPUs without modification of existing device drivers.

USENIX Association 2014 USENIX Annual Technical Conference 109 the architectures. The detailed hardware mech- anism is not identical among different vendors, though recent GPUs adopt the same high-level design presented in Figure 1. MMIO: The current form of GPU is an independent compute device. Therefore the CPU communicates with the GPU via MMIO. MMIO is the only interface that the CPU can directly access the GPU, while hardware engines for direct memory access (DMA) are supported Figure 1: The GPU resource management model. to transfer a large size of data. Contribution: This paper presents GPUvm, which is GPU Context: Just like the CPU, we must create a an open architecture of GPU virtualization. We provide context to run on the GPU. The context represents the the design and implementation of GPUvm based on the state of GPU computing, a part of which is managed by Xen hypervisor [3], introducing virtual memory-mapped the device driver, and owns a virtual address space in I/O (MMIO), GPU shadow channels, GPU shadow page GPU. tables, and virtual GPU schedulers. These pieces of GPU Channel: Any operation on the GPU is driven resource management are provided with both full- and by commands issued from the CPU. This command para-virtualization approaches by exposing a native GPU stream is submitted to a hardware unit called a GPU device model to guest device drivers. We also develop channel, and isolated from the other streams. A GPU several optimization techniques to reduce overhead of channel is associated with exactly one GPU context, GPU virtualization. To the best of our knowledge, this is while each GPU context can have one or more GPU the first piece of work that addresses fundamental prob- channels. Each GPU context has GPU channel descrip- lems of GPU virtualization. GPUvm is provided as com- tors for the associated hardware channels, each of which plete open-source software 1. is created as a memory object in the GPU memory. Each Organization: The rest of this paper is organized GPU channel descriptor stores the settings of the cor- as follows. Section 2 describes a system model be- responding channel, which includes a page table. The hind this paper. Section 3 provides a design concept commands submitted to a GPU channel is executed in of GPUvm, and Section 4 presents its prototype imple- the associated GPU context. For each GPU channel, a mentation. Section 5 shows experimental results. Sec- dedicated command buffer is allocated in the GPU mem- tion 6 discusses related work. This paper concludes in ory visible to the CPU through MMIO. Section 7. GPU Page Table: Paging is supported by the GPU. The GPU context is assigned with the GPU page table, 2 Model which isolates the virtual address space from the others. A GPU page table is set to a GPU channel descriptor. The system is composed of a multi-core CPU and an on- All the commands and programs submitted through the board GPU connected on the bus. A compute-intensive channel are executed in the corresponding GPU virtual function offloaded from the CPU to the GPU is called address space. a GPU kernel, which could produce a large number of GPU page tables can translate a GPU virtual address to compute threads running on a massive set of compute not only a GPU device physical address but also a host cores integrated in the GPU. The given workload may physical address. This means that the GPU virtual ad- also launch multiple kernels within a single process. dress space is unified over the GPU memory and the host Product lines of GPU vendors are closely tied with main memory. Leveraging GPU page tables, the com- programming languages and architectures. For example, mands executed in the GPU context can access to the NVIDIA invented the Compute Unified Device Architec- host physical memory with the GPU virtual address. ture (CUDA) as a GPU programming framework. CUDA PCIe BAR: The following information includes some was first introduced in the Tesla architecture [30], fol- details about the real system. The host computer is based lowed by the Fermi and the Kepler architectures [30,31]. on the x86 chipset and is connected to the GPU upon the The prototype system of GPUvm presented in this paper PCI Express (PCIe). The base address registers (BARs) assumes these NVIDIA technologies, yet the design con- of PCIe, which work as windows of MMIO, are config- cept of GPUvm is applicable for other architectures and ured at boot time of the GPU. GPU control registers and programming languages. GPU memory apertures are mapped on the BARs, allow- Figure 1 illustrates the GPU resource management ing the device driver to configure the GPU and access the model, which is well aligned with, but is not limited to, GPU memory. 1https://github.com/CS005/gxen To operate DMA onto the associated host memory,

2 110 2014 USENIX Annual Technical Conference USENIX Association VM1 VM2 relevant system stack. GPUvm exposes a native GPU App App App App device model to VMs where guest device drivers are Device Driver Device Driver loaded. VM operations upon this device model are redi-

GPU Device Model GPU Device Model rected to the hypervisor so that VMs can never access the GPU directly. The GPU Access Aggregator arbitrates all GPUvm GPU Access Aggregator accesses to the GPU to partition GPU resources across VMs. GPUvm adopts a client-server model for commu- GPU Shadow Page Tables GPU Shadow Channels

Shadowing Mechanism nications between the device model and the aggregator.

GPU Fair-Share Scheduler While arbitrating accesses to the GPU, the aggregator modifies them to manage the status of the GPU. This Hypervisor GPU mechanism allows multiple VMs to access a single GPU in the isolated way. Figure 2: The design of GPUvm and system stack. 3.1 Approaches GPU has a mechanism similar to IOMMU, such as the GPUvm enables full- and para-virtualization of GPUs at graphics address remapping table (GART). However, the hypervisor. To isolate multiple VMs on the GPU since DMA issued from the GPU context goes through hardware resources, memory areas, PCIe BARs, and the GPU page table, the safety of DMA can be guaran- GPU channels must be multiplexed among those VMs. teed without IOMMU. In addition to this spacial multiplexing, the GPU also Currently GPU vendors hide the Documentation: needs to be scheduled in a fair-share manner. The main details of GPU architectures due to a marketing rea- components of GPUvm to address this problem include son. Implementations of device drivers and runtime li- GPU shadow page tables, GPU shadow channels, and braries are also protected by binary proprietary software, GPU fair-share schedulers. whereas the compiler source code has been recently To aggregate accesses to a GPU device model from a open-released from NVIDIA to a limited extent. Some guest device driver, GPUvm intercepts MMIO by setting work uncovers the black-boxed interaction between the these ranges as inaccessible. GPU and the driver [28]. Recently the Linux kernel com- In order to ensure that one VM can never access the munity has developed Nouveau [25], which is an open- memory area of another VM, GPUvm creates a GPU source device driver for NVIDIA GPUs. Throughout shadow page table for every GPU channel descriptor, their development, the details of NVIDIA architectures which is protected from guest OSes. All GPU mem- are well documented in the Envytools project [23]. Inter- ory accesses are handled by GPU shadow page tables; ested readers are encouraged to visit their website. a virtual address for GPU memory is translated by the Scope and Limitation: GPUvm is an open architec- shadow page table not by the one set by the guest device ture of GPU virtualization with a solid design and im- driver. Since GPUvm validates the contents of shadow plementation using Xen. Its implementation focuses on page tables, GPU memory can be safely shared by mul- NVIDIA Fermi- and Kepler-based GPUs with CUDA. tiple VMs. And by making use of GPU shadow page We also restrict our attention to Nouveau as a guest de- tables, GPUvm guarantees that DMA initiated by GPU vice driver. NVIDIA binary drivers should be available never accesses memory areas outside of those allocated with GPUvm, but they cannot be successfully loaded to the VM. with current versions of Xen, even in the pass-through To create a GPU context, the device driver must es- mode, which has also happened in the prior work [11, tablish the corresponding GPU channel. However, the 12]. number of GPU channels is limited in hardware. To mul- tiplex VMs on GPU channels, GPUvm creates shadow 3 Design channels. GPUvm configures shadow channels, assigns The challenge of GPUvm is to show that GPU can be virtual channels to each VM and maintains the mapping virtualized at the hypervisor level. GPU is a unique and between a virtual channel and a shadow channel. When complicated device and its resources (such as memory, guest device drivers access a virtual channel assigned by channels and GPU time) must be multiplexed like the GPUvm, GPUvm intercepts and redirects the operations host computing system. Although the architectural de- to a corresponding shadow channel. tails of GPU are not known well, GPUvm virtualizes GPUs by combining the well-established techniques of 3.2 Resource Partitioning CPU, memory, and I/O virtualization of the traditional GPUvm partitions physical memory space and MMIO hypervisors. space over PCIe BARs into multiple sections of contin- Figure 2 shows the high-level design of GPUvm and uous address space, each of which is assigned to an in-

3 USENIX Association 2014 USENIX Annual Technical Conference 111 dividual VM. Guest device drivers consider that physical GPU page tables can map GPU virtual addresses to phys- memory space origins at 0, but actual memory access is ical addresses in GPU memory and host memory. Unlike shifted by the corresponding size through shadow page conventional devices, GPU uses GPU virtual addresses tables created by GPUvm. Similarly, PCIe BARs and to initiate DMA. If the mapped memory happens to be GPU channels are partitioned by multiple sections of the in the host memory, DMA is initiated. Since shadow same size for individual VMs. page tables are controlled by GPUvm, the memory ac- The static partitioning is not a critical limitation of cess by DMA is confined in the memory region of the GPUvm. Dynamic allocation is possible. When a corresponding VM. shadow page table refers to a new page, GPUvm allo- The current design of GPU poses an implementation cates a page, assigns it to a VM and maintains the map- problem of shadow page tables. In the traditional shadow pings between guest physical GPU pages and host phys- page tables, page faults are extensively utilized to reduce ical GPU pages. For ease of implementation, the current the cost of constructing shadow page tables. However, GPUvm employs static partitioning. We plan to imple- GPUvm cannot handle page faults caused by GPU [10]. ment the dynamic allocation in the future. This makes it impossible to update the shadow page ta- ble upon page fault handling, and it is also impossible to 3.3 GPU Shadow Page Table trace changes to the page table entry. Therefore, GPUvm GPUvm creates GPU shadow page tables in the reserved scans the entire page tables upon TLB flush. area of GPU memory, which translates guest GPU vir- tual addresses to GPU device physical or host physical 3.4 GPU Shadow Channel addresses. By design, the device driver needs to flush The number of GPU channels is limited in hardware and TLB caches every time a page table entry is updated. they are numerically indexed. The device driver assumes GPUvm can intercept TLB flush requests because those that these indexes start from zero. Since the same index requests are issued from the host CPU through MMIO. cannot be assigned to multiple channels, channel isola- After the interception, GPUvm updates the correspond- tion must be supported to multiplex VMs. ing GPU shadow page table entry. GPUvm provides GPU shadow channels to isolate GPU shadow page tables play an important role in pro- GPU accesses from VMs. Physical indexes of GPU tecting GPUvm itself, shadow page tables, and GPU con- channels are hidden from VMs but virtual indexes are as- texts from buggy or malicious VMs. GPUvm excludes signed to their virtual channels. Mapping between phys- the memory mappings to those sensitive memory pages ical and virtual indexes is managed by GPUvm. from the shadow page tables. Since all the memory ac- When a GPU channel is used, it must be activated. cesses by GPU go through the shadow page tables, any GPUvm manages currently activated channels by VMs. VMs cannot access those sensitive memory areas. These activation requests are submitted through MMIO There is a subtle problem regarding a pointer to a and they can be intercepted by GPUvm. When GPUvm shadow page table. In case of GPU, a pointer to a shadow receives the requests, GPUvm activates the correspond- page table is stored in GPU memory as part of the GPU ing channels by using physical indexes. channel descriptor. In the traditional shadow page tables, Each GPU channel has channel registers, through a pointer to a shadow page table is stored in a privileged which the host CPU submits commands to GPU. Chan- register (e.g. CR3 in x86); VMM intercepts the ac- nel registers are placed in GPU virtual address space cess to the privileged register to protect the shadow page which is mapped to a memory aperture. GPUvm man- tables. To protect GPU channel descriptors, including ages all physical channel registers and maintains the a pointer to a shadow page table, from being accessed mapping between physical and virtual GPU channel reg- by GPU, GPUvm excludes the memory areas for GPU isters. Since the virtual channel registers are mapped to channel descriptors from the shadow page tables. The the memory aperture, GPUvm can intercept the access accesses to the GPU channel descriptors from host CPUs to them and redirect it to the physical channel registers. are all through MMIO and thus, can be easily detected by Since the guest GPU driver can dynamically change the GPUvm. As a result, GPUvm protect GPU channel de- location of the channel registers, GPUvm monitors it and scriptors, including pointers to shadow page tables, from changes the mapping if necessary. buggy VMs. GPUvm guarantees the safety of DMA. If a buggy 3.5 GPU Fair-Share Scheduler driver sets an erroneous physical address when initiat- So far we have argued virtualization of memory re- ing DMA, the memory regions assigned to other VMs or sources and GPU channels for multiple VMs. We herein the hypervisor can be destroyed. To avoid this situation, provide virtualization of GPU time. Indeed this is a GPUvm makes use of shadow page tables and the uni- scheduling problem. The GPU scheduler of GPUvm is fied memory model of GPU. As explained in Section 2, based on the bandwidth-aware non-preemptive device

4 112 2014 USENIX Annual Technical Conference USENIX Association (BAND) scheduling algorithm [21], which was devel- are used when memory apertures are accessed or after oped for virtual GPU scheduling. The BAND schedul- the GPU kernels starts. Note that GPUvm can intercept ing algorithm is an extension of the CREDIT scheduling memory apertures access and command submission. So, algorithm [3] in that (i) the prioritization policy uses re- GPUvm scans the guest page table at this point of time served bandwidth and (ii) the scheduler intentionally in- and reflects it to the shadow page table. By delaying the serts a certain amount of waiting time after completion reflection, GPUvm can reduce the number of the page of GPU kernels, which leads to fairer utilization of GPU table scans. time among VMs. Since the current GPUs are not pre- BAR Remap: GPUvm intercepts data accesses emptive, GPUvm waits for GPU kernel completion and through BARs to virtualize GPU channel descriptors. assigns credits based on the GPU usage. More details By intercepting all data accesses, it keeps the consis- can be found in [21]. tency between shadow GPU channel descriptors and The BAND scheduling algorithm assumes that the to- guest GPU channel descriptors. However, this design in- tal utilization of virtual GPUs could reach 100%. This is curs non-trivial overheads because the hypervisor is in- a flaw because there must be some interval that the CPU voked every time the BAR is accessed. The BAR remap executes the GPU scheduler during which the GPU re- optimization reduces this overhead by limiting the han- mains idle, causing utilization of the GPU to be less than dling of BAR accesses. In the BAR remap optimization, 100%. This means that even though the total bandwidth GPUvm passes through the BAR accesses other than to is set to 100%, VMs’ credit would be left unused, if the GPU channel descriptors because GPUvm does not have GPU scheduler consumes some time in the correspond- to virtualize the values read from or written to the BAR ing period. The problem is that the amount of credit to areas except for GPU channel descriptors. Even if the be replenished and the period of replenishment are fixed. BAR accesses are passed through, they must be isolated If the fixed amount of credit is always replenished, after among multiple VMs. This isolation is achieved by mak- a while all VMs could have a lot of credit left unused. ing use of shadow page tables. The BAR accesses from As a result, the credit may not influence the decision of the host CPU all go through GPU page tables; the off- scheduling at all. To overcome this problem, GPUvm sets in the BAR areas are regarded as virtual addresses in accounts for CPU time consumed by the GPU scheduler, GPU memory and translated to GPU physical addresses and considers it as GPU time. through the shadow page tables. By setting shadow page Note that there is a critical problem in guaranteeing tables appropriately, all the accesses to the BAR areas are the fairness of GPU time. If a malicious or buggy VM isolated among VMs. starts infinite computation on GPU, it can monopolize Para-virtualization: Shadowing of GPU page tables GPU time. One possible solution to this problem is to is a major source of overhead in full-virtualization, be- abort the GPU computation if the GPU time exceeds the cause the entire page table needs to be scanned to detect pre-defined limit of the computation time. Another ap- changes to the guest GPU page tables. To reduce the cost proach is to cut longer requests into smaller pieces, as of detecting the updates, we take a para-virtualization shown in [4]. For the future directions, we are plan- approach. In this approach, the guest GPU page tables ning to incorporate the disengaged scheduling [29] at are placed within the memory areas under the control of the VMM level. The disengaged scheduling provides GPUvm and cannot be directly updated by guest GPU fair, safe and efficient OS-level management of GPU re- drivers. To update the guest GPU page tables, the guest sources. We believe that GPUvm can incorporate the dis- GPU driver issues hypercalls to GPUvm. GPUvm vali- engaged scheduling without any technical issues except dates the correctness of the page table updates when the for engineering efforts. hypercall is issued. This approach is inspired by the di- rect paging in Xen para-virtualization [3]. 3.6 Optimization Techniques Multicall: Hypercalls are expensive because the con- Lazy Shadowing: In principle, GPUvm has to reflect the text is switched to the hypervisor. To reduce the number contents of guest page tables to shadow page tables every of hypercalls, GPUvm provides a multicall interface that time it detects TLB flushes. As explained in Section 3.3, can batch several hypercalls into one. For example, in- GPUvm has to scan the entire page table to find the mod- stead of providing a hypercall that updates one page table ified entries in the guest page table because GPUvm can- entry at once, GPUvm provides a hypercall that allows not use page faults to detect the modifications on the multiple page table entries to be updated by one hyper- guest page tables. Since TLB flushes can happen fre- call. The multicall is borrowed from Xen. quently, the cost of scanning page tables introduces sig- nificant overhead. To reduce this overhead, GPUvm de- 4 Implementation lays the reflection to the shadow page tables until an at- Our prototype of GPUvm uses Xen 4.2.0, where both tempt is made to reference them. The shadow page tables the domain 0 and the domain U adopt the Linux kernel

5 USENIX Association 2014 USENIX Annual Technical Conference 113 user contexts may also use some interrupt commands, and the GPU Access Aggregator cannot recognize them once they are fired. Therefore, our prototype implemen- tation uses a thread-based scheduler polling on the re- quest queue. Whenever command submission requests are stored in the queue, the scheduler dispatches them to the GPU. To calculate GPU time, our prototype polls a GPU control register value that is modified by the hard- ware just after GPU channels become active/inactive. Another task of the GPU Access Aggregator is to man- age GPU memory and maintain isolation of multiple VMs on partitioned memory resources. For this purpose, GPUvm creates shadow page tables and channel descrip- Figure 3: The prototype implementation of GPUvm. tors in the reserved area of GPU memory. v3.6.5. We target the device model of NVIDIA GPUs based on the Fermi/Kepler architectures [30, 31]. While full-virtualization does not require any modification to 5 Experiments guest system software, we make a small modification to To demonstrate the effectiveness of GPUvm, we con- the GPU device driver called Nouveau, which is provided ducted detailed experiments using a relevant commodity as part of the mainline Linux kernel, to implement our GPU. The objective of this section is to answer the fol- GPU para-virtualization approach. lowing fundamental questions: Figure 3 shows the overview of implementation of the GPU Access Aggregator and its interactions with other 1. How much is the overhead of GPU virtualization components. The GPU device model is exposed in the incurred by GPUvm? domain U, which is created by the QEMU-dm and be- 2. How does the number of GPU contexts affect per- haves as a virtual GPU. Guest device drivers in the do- formance? main U consider it as a normal GPU. It also exposes 3. Can multiple VMs meet fairness on GPU resources? MMIO PCIe BARs, handling accesses to the BARs in The experiments were conducted on a DELL PowerEdge Xen. All accesses to the GPU arbitrated by the GPU Ac- T320 machine with eight Xeon E5-24700 2.3 GHz pro- cess Aggregator are committed to the GPU through the cessors, 16 GB of memory, and two 500 GB SATA sysfs interface. hard disks. We use NVIDIA 6000 for the tar- The GPU device model communicates with the GPU get GPU, which is based on the NVIDIA Fermi archi- Access Aggregator in the domain 0, using the POSIX tecture. We ran our modified Xen 4.2.0, assigning 4 GB inter-process communication (IPC). The GPU Access and 1 GB of memory to the domain 0 and the domain Aggregator is a user process in the domain 0, which re- U, respectively. In the domain U, Nouveau was run- ceives requests from the GPU device model, and issues ning as the GPU device driver and Gdev [21] was run- the aggregated requests to the physical GPU. ning as the CUDA runtime. The following eight schemes The GPU Access Aggregator has virtual GPU control were evaluated: Native (non-virtualized Linux 3.6.5), blocks and the GPU command scheduler, which repre- PT (pass-through provided by Xen’s pass-through fea- sent the state of virtual GPUs. The GPU device models ture), FV Naive (full-virtualization w/o any optimiza- update their own virtual GPU control blocks using IPC to tion techniques), FV BAR-Remap (full-virtualization w/ manage the states of corresponding virtual GPUs when BAR Remapping), FV Lazy (full-virtualization w/ Lazy privileged events such as control register changes are is- Shadowing), FV Optimized (full-virtualization w/ BAR sued from domain U. Remapping and Lazy Shadowing), PV Naive (para- Each virtual GPU control block maintains a queue virtualization w/o multicall), and PV Multicall (para- to store command submission requests issued from the virtualization w/ multicall). GPU device model. These command submission re- quests are scheduled to control GPU executions. This command scheduling mechanism is similar to Time- 5.1 Overhead Graph [20]. However, the GPU command scheduler of To identify the overhead of GPU virtualization incurred GPUvm differs from TimeGraph in that it does not use by GPUvm, we run the well-known GPU benchmarks GPU interrupts. It is very difficult, if not impossible, for called Rodinia [5] as well as our microbenchmarks, as the GPU Access Aggregator to insert the interrupt com- listed in Table 1. We measure their execution time on the mand to the original sequence of commands, because eight platforms.

6 114 2014 USENIX Annual Technical Conference USENIX Association Table 1: List of the GPU benchmarks. GPUvm’s full virtualization configurations. In FV Naive, Benchmark Description init and close phases are more than 90 % in the execution NOP No GPU operation times. By using optimization techniques, the phases’ ra- LOOP Long-loop compute without data MADD 1024x1024 matrix addition tio becomes lowered. MMUL 1024x1024 matrix multiplication Table 2 and 3 list BAR3 writes and shadow page ta- FMADD 1024x1024 matrix floating addition ble update counts in each benchmark. BAR3 is used for FMMUL 1024x1024 matrix floating multiplication a memory aperture. BAR remapping achieves more per- CPY 64MB of HtoD and DtoH PINCPY CPY using pinned host I/O memory formance gain in benchmarks that write more bytes in the BP Back propagation (pattern recognition) BAR3 region. For example, the execution time of pincpy BFS Breadth-first search (graph algorithm) that writes 64 MB is 2 times shorter in FV BAR-Remap HS Hotspot (physics simulation) than FV Naive. Also, lazy shadowing works more effec- LUD LU decomposition (linear algebra) tively to the benchmarks that update shadow page tables SRAD Speckle reducing anisotropic diffusion (imaging) SRAD2 SRAD with random pseudo-inputs (imaging) more frequently. Specifically, the execution time of srad, where the shadow page table is updated in 52 times, is 2 5.1.1 Results times shorter in FV Lazy than FV Naive. Figure 4 shows the execution time of the benchmarks On the other hand, init and close phases in two on each platform. The x-axis lists the benchmark names GPUvm’s para-virtualized platforms are much shorter while the y-axis exhibits the execution time normalized than the full-virtualization configuration in all cases. by one of Native. It is clearly observed that the over- Full-virtualization GPUvm performs many shadowing head of GPU full-virtualization is mostly unacceptable, operations, including TLB flushes, since memory alloca- but our optimization techniques significantly contribute tion are done frequently in the two phases. This cost can to reduction of this overhead. The execution times ob- be significantly reduced in para-virtualization GPUvm in tained in FV Naive are more than 100 times slower in which memory allocations are requested by hypercall. nine benchmarks (nop, loop, madd, fmadd, fmmul, bp, Time spent in these phases is longer in pincpy since it hfs, hs, lid) than those obtained in Native. This overhead issues more hypercalls than the other benchmarks. Ta- can be mitigated by using the BAR Remap and the Lazy ble 4 lists the number of hypercall issues of each bench- Shadowing optimization techniques. Since these opti- mark in PV Naive and PV Multicall. Compared to the mization techniques are complementary to each other, other benchmarks, pincpy issues much more hypercalls. putting it together achieves more performance gain. The Also, our optimization dramatically reduces hypercall is- execution time is 6 times shorter in madd (the best case) sues, resulting in the reduced overhead in PV Naive. As while being 5 times shorter in mmul (the worst case). a result, the execution times in PV Multicall are close to In some benchmarks, PT exhibits slightly faster per- those of PT and Native, compared with PV Naive’s re- formance than Native, especially in madd is 1.5 times sult. For example, the results in 7 benchmarks (mmul, shorter than PT. This is a GPU’s mysterious behavior. cpu, pincpy, bp, hfs, srad, srad2) are similar in PV Mul- From these experimental results, we also find that ticall, PT, and Native. GPU para-virtualization is much faster than full virtual- ization. The execution times obtained in PV Naive are 3 - 5.2 Performance at Scale 10 times slower than those obtained in Native except for To discern the overhead GPUvm incurs in multiple GPU pincpy. We discuss this reason in the next section. This contexts, we generate GPU workloads and measure their overhead can also be reduced by our reduced hypercalls execution time in two scenarios; in the first scenario, one feature. The execution time increased in PV Multicall is VM executes multiple GPU tasks, in the other scenario, at most 3 times. multiple VM executes GPU tasks. We first launch 1, 2, 4, 8 GPU tasks in one VM with full-virtualized, para- 5.1.2 Breakdown virtualized, pass-throughed, GPU (FV(1VM), PV(1VM), The breakdown on the execution time of the GPU bench- and PT). These tasks are also run on native Linux (Na- marks is shown in Figure 5. We divide the total ex- tive). Next, we prepare 1, 2, 4, 8 VMs and execute one ecution time into five phases; init, htod, launch, dtoh, GPU task on each VM with a full- or para-virtualized and close. Init is time for setting up the GPU to exe- GPU (FV and PV) where all our optimizations are turned cute a GPU kernel. Htod is time for host-to-device data on. In each scenario, we run madd listed in Table 1. transfers. Launch is time for the calculation on GPUs. Specifically, we repeat the GPU kernel execution of Dtoh is device-to-host data transfer time. Close is time madd 10000 times, and measure its execution time. for destroying the GPU kernel. The figure indicates that The results are shown in Fig. 6. The x-axis is the the dominant factor of execution time in GPUvm is init number of launched GPU contexts and the y-axis rep- and close phases. This tendency is significant for four resents execution time. This figure reveals two points.

7 USENIX Association 2014 USENIX Annual Technical Conference 115 1000.00 460.11 445.06 325.94 319.29 279.62 275.39 272.76 241.93 236.63 182.72 181.93 180.83 150.48 150.33 149.80 79.25 78.24 74.11

100.00 56.71 41.34 40.72 40.10 38.44 17.44 12.03 11.99 10.35 10.01 9.78 6.43 6.38 10.00 6.30 3.53 3.39 3.08 2.64 2.08 2.00 2.00 Rela%ve %me 1.64 1.34 1.08 1.07 1.03 1.01 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.93 0.65 0.10 PT PT PT PT PT PT PT Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap

nop loop madd mmul fmadd fmmul cpy

1000.00 353.96 258.14 222.32 184.62 180.69 180.08 147.11 142.21 120.23 103.47 95.97 89.45 87.75 63.29

100.00 52.47 41.91 41.29 33.70 28.92 26.89 24.83 19.90 19.32 18.61 16.60 13.00 12.65 7.30 6.34 5.58 5.50 5.15

10.00 4.78 2.33 2.17 2.14 1.94 1.76 Rela%ve %me 1.57 1.57 1.16 1.14 1.03 1.02 1.02 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.93 0.10 PT PT PT PT PT PT PT Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap

pincpy bp bfs hs lud srad srad2 Figure 4: Execution time of the GPU benchmarks on the eight platforms. Table 2: Total size of BAR access (bytes). nop loop madd mmul fmadd fmmul cpy pincpy bp bfs hs lud srad srad2 READ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 WRITE 6688 6696 6648 6672 6680 6688 6168 268344 7280 6232 6736 7240 6352 6248

One is that full-virtualized GPUvm incurs larger over- Kernel execu/on /me Init & Close /me 400 head as GPU contexts are more. Time spent in init 350

300 and close phases is longer in FV and FV (1VM) since 250 GPUvm performs exclusive accesses to some GPU re- 200 150

sources including the dynamic window. The other is that (seconds) Time 100 50 para-virtualized GPUvm achieves similar performance to 0 PT PT PT PT FV FV FV FV pass-through GPU. The total execution time in PV and PV PV PV PV Na/ve Na/ve Na/ve Na/ve Na/ve FV 1VM FV 1VM FV 1VM FV 1VM PV 1VM PV 1VM PV 1VM PV 1VM

PV (1VM) is quite similar to those in PT even if GPU 1 2 4 8 contexts are more. The kernel execution times in both Number of GPU Contexts FV (1VM) and FV are larger than the other GPU con- Figure 6: Performance across multiple VMs. figurations, while those in the three GPU configurations are longer in 4 and 8 GPU contexts. This comes from GPU requests are executed, and issues GPU requests of GPUvm’s overhead that it polls a GPU register to detect a VM whose credits are highest. BAND is our scheduler GPU kernel completion through MMIO. described in Sec.3.5. We prepare two GPU tasks; the one is madd, used in the previous experiment, and the 5.3 Performance Isolation other is an extended madd (long-madd), which performs To demonstrate how GPUvm achieves performance iso- 5 times more calculations than the regular madd. Each lation among VMs, we launch a GPU workload on 2, 4, VM loops one of them. We run each task on a half of the 8 VMs and measure each VM’s GPU usage. For com- VMs, respectively. For example, madd runs on 2 VMs parison, we use three schedulers; FIFO, CREDIT, and while long-madd runs on 2 VMs in the 4-VM case. BAND scheduler. FIFO issues GPU requests in a first- Fig. 7 shows the results. The x-axis represents the in/first-out manner. CREDIT schedules GPU requests in elapsed time and the y-axis is VM’s GPU usage over 500 a proportional fair share manner. Specifically, CREDIT msec. The figure reveals that BAND is the only sched- reduces credits assigned to a VM in advance when its uler that achieves performance isolation in all cases. In

8 116 2014 USENIX Annual Technical Conference USENIX Association init htod launch dtoh close 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% PT PT PT PT PT PT PT Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap

nop loop madd mmul fmadd fmmul cpy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% PT PT PT PT PT PT PT Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve Na=ve FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Lazy FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV Naive FV PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV Naive PV PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call PV Mul=call FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV Op=mized FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap FV BAR Remap

pincpy bp bfs hs lud srad srad2 Figure 5: Breakdown on execution time of the GPU benchmarks. Table 3: Update count for GPU shadow page tables. nop loop madd mmul fmadd fmmul cpy pincpy bp bfs hs lud srad srad2 FV Naive 30 30 34 34 34 34 26 28 40 42 34 30 52 40 FV Optimized 7 7 7 7 7 7 6 6 7 7 7 7 7 7 FIFO, GPU usages in long-madd are higher in all cases 6 Related Work since FIFO dispatches more GPU commands from long- Table 5 briefly summarizes the characteristics of sev- madd than madd. CREDIT fails to achieve the fairness eral GPU virtualization schemes, and compares them to among VMs in 2-VM case. Since the command sub- GPUvm. Some vendors invent techniques of GPU virtu- mission request queue contains the requests only from alization. NVIDIA has announced NVIDIA VGX [32], long-madd just after the madd’s commands completed, which exploits virtualization supports of Kepler gener- CREDIT dispatches the long-madd’s requests. As a re- ation GPUs. These are proprietary so their details are sult, the VM running madd has to wait for the completion closed. To the best of our knowledge, GPUvm is the first of the long-madd’s commands. BAND waits for request open architecture of GPU virtualization offered by the arrivals for a short period just after a GPU kernel com- hypervisor. GPUvm carefully selects resources to virtu- pletes. BAND can handle the requests issued from the alize, GPU page tables and channels, to keep its applica- VMs whose GPU usage is less. bility to various GPU architectures. CREDIT achieves fair-share GPU scheduling in 4- and VMware SVGA2 [7] para-virtualizes GPUs to mit- 8- VM. In these cases, CREDIT has more opportunities igate the overhead of virtualizing GPU graphics fea- to dispatch less-credit VMs’ commands for the following tures. The SVGA2 handles graphics-related requests two reasons. First, GPUvm’s queue has GPU command by using an architecture-independent communication to submission requests from two or more VMs just after a efficiently perform 3D rendering and hide GPU hard- GPU kernel completes, differently from the 2-VM case. ware. While this approach is specific to graphics accel- Second, GPUvm submits GPU operations from three or eration, GPUvm coordinates interactions between GPUs more VMs that complete shortly; GPUvm can have more and guest device drivers. scheduling points. Gottschalk et al. proposes low-overhead GPU virtu- Note that BAND cannot achieve fairness among VMs alization, named LoGV, for GPGPU applications [10]. in a fine-grained manner on the current GPUs. Fig. 8 Their approach is categorized into para-virtualization shows VMs’ GPU usages over 100 msec. Even with where device drivers in VMs send requests for resource BAND, the GPU usages are fluctuated over time. This is allocation and mapping memory into system RAM to the because the GPU is a non-preemptive device. To achieve hypervisor. Similar to our work, this work exhibits para- finer-grained GPU fair-share scheduling, we need a novel virtualization mechanisms to minimize GPGPU virtual- mechanism inside the GPU which effectively switches ization overhead. Our work reveals which virtualization GPU kernels. technique for GPUs is efficient in a quantitative way.

9 USENIX Association 2014 USENIX Annual Technical Conference 117 Table 4: The number of hypercall issues. nop loop madd mmul fmadd fmmul cpy pincpy bp bfs hs lud srad srad2 PV Naive 1230 1230 1420 1420 1420 1420 2010 34784 1628 1993 1169 1429 1985 2681 PV Multicall 93 93 97 97 97 97 81 218 117 117 97 89 149 107

100 100 100 VM0 VM1 VM0 VM1 VM0 VM1 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (a) 2VM FIFO (b) 2VM CREDIT (c) 2VM BAND 100 100 100 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (d) 4VM FIFO (e) 4VM CREDIT (f) 4VM BAND 100 100 100 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM4 VM5 VM6 VM7 VM4 VM5 VM6 VM7 VM4 VM5 VM6 VM7 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (g) 8VM FIFO (h) 8VM CREDIT (i) 8VM BAND Figure 7: VMs’ GPU usages (over 500 ms) on the three GPU schedulers. API remoting, in which API calls are forwarded from gaged Scheduling [29] applies fair queuing scheduling the guest to the host which has the GPU, have been stud- with a probabilistic extension to GPU, and provides pro- ied widely. GViM [11], vCUDA [37], and rCUDA [8] tection and fairness without compromising efficiency. forward CUDA . VMGL [24] achieves API remot- XenGT [16] is GPU virtualization for Intel on-chip ing of OpenGL. gVirtuS [9] supports API remoting of GPUs with the similar design to GPUvm. Our work pro- CUDA, OpenCL, a part of OpenGL. In these approaches, vides detailed analysis of the performance bottlenecks of applications are inherently limited to APIs the wrapper- the current GPU virtualization. It is useful to design the libraries offer. Keeping the wrapper-libraries compatible architecutre of the GPU virtualization, and also useful to the original ones is not trivial task since new func- for GPU vendors to design the future GPU architecture tionalities are frequently integrated into GPU libraries which supports virtualization. including CUDA and OpenCL. Moreover API remoting Some work aims at the efficient management of GPUs requires that the whole GPU software stacks including at the operating system layer such as GPU command device drivers and runtimes become part of the TCB. scheduler [20], kernel-level GPU runtime [21], OS ab- Amazon EC2 [1] provides GPU instances. It makes straction of GPUs [34,35] and file system for GPUs [39]. use of pass-through technology to expose a GPU to an These mechanisms can be incorporated into GPUvm. instance. Since a pass-throughed GPU is directly man- aged by the guest OS, we cannot multiplex the GPU on a physical host. 7 Conclusion GPUvm is complementary to GPU command schedul- In this work, we try to answer the question; why not ing methods. VGRIS [41] enables us to schedule GPU virtualizing GPU at the hypervisor? This paper have commands in SLA-aware, proportional-share or the hy- presented GPUvm, an open architecture of GPU virtu- brid scheduling. Pegasus [12] coordinates GPU com- alization. GPUvm supports full-virtualization and para- mand queuing and CPU dispatching so that multi-VMs virtualization with optimization techniques. Experi- can effectively share CPU and GPU resources. Disen- mental results using our prototype showed that full-

10 118 2014 USENIX Annual Technical Conference USENIX Association 100 100 100 VM0 VM1 VM0 VM1 VM0 VM1 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (a) 2VM FIFO (b) 2VM CREDIT (c) 2VM BAND 100 100 100 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (d) 4VM FIFO (e) 4VM CREDIT (f) 4VM BAND

100 100 100 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM0 VM1 VM2 VM3 VM4 VM5 VM6 VM7 VM4 VM5 VM6 VM7 VM4 VM5 VM6 VM7 75 75 75

50 50 50

25 25 25 GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on (%) GPU U%liza%on

0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time (seconds) Time (seconds) Time (seconds) (g) 8VM FIFO (h) 8VM CREDIT (i) 8VM BAND Figure 8: VMs’ GPU usages (over 100 ms) on the three GPU schedulers. Table 5: A comparison of GPUvm to other GPU virtualization schemes. Category W/o Library mod. W/o kernel mod. Multiplexing Scheduling vCUDA [37], rCUDA [8] √ √ × × VMGL [24] √ √ × × VGRIS(gVirtuS) [9, 41] API Remoting √ √ SLA-aware, Proportional-share × SLA-aware, Throughput-based, Pegasus(GViM) [11, 12] √ √ × Proportional fair-share SVGA2 [7] √ √ Para-Virt. × × LoGV [10] √ √ × × GPU Instances [1] Pass-through √ × × × GPUvm Full-Virt.& Para-Virt. √ √ √ Credit-based fair-share virtualization exhibits non-trivial overhead largely due References to MMIO handling, and para-virtualization provides yet [1] AMAZON.COM. Amazon Elastic Compute Cloud (Amazon two or three times slower performance than pass-through EC2). http://aws.amazon.com/ec2/, 2014. and native approaches. Also the results reveal that para- [2] AMIT, N., BEN-YEHUDA, M., TSAFRIR, D., AND SCHUSTER, virtualization is preferred in performance, though highly A. vIOMMU: Efficient IOMMU Emulation. In USENIX ATC compute-intensive GPU applications may also benefit (2011), pp. 73–86. from full-virtualization if their execution times are much [3] BARHAM,P.,DRAGOVIC, B., FRASER, K., HAND, S., larger than the overhead of full-virtualization. HARRIS,T.,HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the Art of Virtualization. In ACM SOSP For future directions, it should be investigated that (2003), pp. 164–177. the optimization techniques proposed in vIOMMU [2] [4] BASARAN, C., AND KANG, K.-D. Supporting Preemptive Task can be applied to GPUvm. We hope our experience in Executions and Memory Copies in GPGPUs. In Proc. of Euromi- GPUvm gives insight into designing the support of de- cro Conf. on Real-Time Systems (2012), IEEE, pp. 287–296. vice virtualization such as SR-IOV [6]. [5] CHE, S., BOYER, M., MENG, J., TARJAN, D., SHEAFFER, J. W., LEE, S.-H., AND SKADRON, K. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of IEEE Int’l Symp. 8 Acknowledgements on Workload Characterization (2009), pp. 44–54. [6] DONG,Y.,YU, Z., AND ROSE, G. SR-IOV Networking in Xen: This work was supported in part by the Grant-in-Aid for Architecture, Design and Implementation. In Proc. of Workshop Scientific Research B (25280043) and C (23500050). on I/O Virtualization (2008).

11 USENIX Association 2014 USENIX Annual Technical Conference 119 [7] DOWTY, M., AND SUGERMAN, J. GPU Virtualization on [24] LAGAR-CAVILLA, H. A., TOLIA, N., SATYANARAYANAN, M., VMware’s Hosted I/O Architecture. ACM SIGOPS Operating AND DE LARA, E. VMM-Independent Graphics Acceleration. Systems Review 43, 3 (2009), 73–82. In ACM VEE (2007), pp. 33–43.

[8] DUATO, J., PENA, A. J., SILLA, F., MAYO, R., AND [25] LINUX OPEN-SOURCE COMMUNITY. Nouveau Open-Source QUINTANA-ORTI, E. rCUDA: Reducing the number of GPU- GPU Device Driver. http://nouveau.freedesktop.org/, based accelerators in high performance clusters. In Proc. of IEEE 2014. Int’l Conf. on High Performance Computing Simulation (2010), [26] MARUYAMA, N., NOMURA,T.,SATO, K., AND MATSUOKA, pp. 224–231. S. Physis: An Implicitly Parallel Programming Model for Sten- [9] GIUNTA, G., MONTELLA, R., AGRILLO, G., AND COVIELLO, cil Computations on Large-Scale GPU-Accelerated Supercom- G. A GPGPU Transparent Virtualization Component for High puters. In Proc. of Int’l Conf. for High Performance Computing, Performance Computing Clouds. In Proc. of Int’l Euro-Par Conf. Networking, Storage and Analysis (2011), pp. 1–12. on Parallel Processing (2010), Springer, pp. 379–391. [27] MCNAUGHTON, M., URMSON, C., DOLAN, J. M., AND LEE, [10] GOTTSCHLAG, M., HILLENBRAND, M., KEHNE, J., STOESS, J.-W. Motion Planning for Autonomous Driving with a Con- J., AND BELLOSA, F. LoGV: Low-overhead GPGPU Virtualiza- formal Spatiotemporal Lattice. In Proc. of IEEE Int’l Conf. on tion. In Proc. of Int’l Workshop on Frontiers of Heterogeneous Robotics and Automation (2011), pp. 4889–4895. Computing (2013), IEEE, pp. 1721–1726. [28] MENYCHTAS, K., SHEN, K., AND SCOTT, M. L. Enabling OS [11] GUPTA, V., G AVRILOVSKA, A., SCHWAN, K., KHARCHE, H., Research by Inferring Interactions in the Black-Box GPU Stack. TOLIA, N., TALWAR, V., AND RANGANATHAN, P. GViM: In USENIX ATC (2013), pp. 291–296. GPU-Accelerated Virtual Machines. In Proc. of ACM Workshop [29] MENYCHTAS, K., SHEN, K., AND SCOTT, M. L. Disengaged on System-level Virtualization for High Performance Computing scheduling for fair, protected access to fast computational accel- (2009), pp. 17–24. erators. In ACM ASPLOS (2014), pp. 301–316. [12] GUPTA,V.,SCHWAN, K., TOLIA, N., TALWAR, V., AND RAN- [30] NVIDIA. NVIDIA’s next generation CUDA computer architec- GANATHAN, P. Pegasus: Coordinated Scheduling for Virtualized ture: Fermi. http://www.nvidia.com/, 2009. Accelerator-based Systems. In USENIX ATC (2011), pp. 31–44. [31] NVIDIA. NVIDIA’s next generation CUDA computer architec- [13] HAN, S., JANG, K., PARK, K., AND MOON, S. PacketShader: ture: Kepler GK110. http://www.nvidia.com/, 2012. a GPU-accelerated software router. ACM SIGCOMM Computer [32] NVIDIA. NVIDIA GRID VGX SOFTWARE. http://www. Communication Review 40, 4 (2010), 195–206. nvidia.com/object/grid-vgx-software.html, 2014. [14] HE, B., YANG, K., FANG, R., LU, M., GOVINDARAJU, N., [33] RATH, N., BIALEK, J., BYRNE,P.,DEBONO, B., LEVESQUE, LUO, Q., AND SANDER, P. Relational Joins on Graphics Pro- J., LI, B., MAUEL, M., MAURER, D., NAVRATIL, G., AND cessors. In ACM SIGMOD (2008), pp. 511–524. SHIRAKI, D. High-speed, multi-input, multi-output control using [15] HIRABAYASHI, M., KATO, S., EDAHIRO, M., TAKEDA, K., GPU processing in the HBT-EP tokamak. Fusion Engineering KAWANO, T., , AND MITA, S. Gpu implementations of object and Design (2012), 1895–1899. detection using hog features and deformable models. In Proc. [34] ROSSBACH, C. J., CURREY, J., SILBERSTEIN, M., RAY, B., of IEEE Int’l Conf. on Cyber-Physical Systems, Networks, and AND WITCHEL, E. PTask: Operating System Abstractions to Applications (2013), pp. 106–111. Manage GPUs as Compute Devices. In ACM SOSP (2011), [16] INTEL. Xen - Graphics Virtualization (XenGT). pp. 223–248. https://01.org/xen/blogs/srclarkx/2013/ [35] ROSSBACH, C. J., YU, Y., C URREY, J., MARTIN, J.-P., AND graphics-virtualization-xengt, 2013. FETTERLY, D. Dandelion: a Compiler and Runtime for Hetero- [17] JANG, K., HAN, S., HAN, S., MOON, S., AND PARK, K. geneous Systems. In ACM SOSP (2013), pp. 49–68. SSLShader: Cheap SSL Acceleration with Commodity Proces- [36] SATISH, N., KIM, C., CHHUGANI, J., NGUYEN, A. D., LEE, sors. In USENIX NSDI (2011), pp. 1–14. V. W., K IM, D., AND DUBEY, P. Fast Sort on CPUs and GPUs: [18] KALDEWEY,T.,LOHMAN, G. M., MULLER¨ , R., AND VOLK, A Case for Bandwidth Oblivious SIMD Sort. In ACM SIGMOD P. B. GPU Join Processing Revisited. In Proc. of Int’l Workshop (2010), pp. 351–362. on Data Management on New Hardware (2012), pp. 55–62. [37] SHI, L., CHEN, H., SUN, J., AND LI, K. vCUDA: GPU- Accelerated High-Performance Computing in Virtual Machines. [19] KATO, S., AUMILLER, J., AND BRANDT, S. Zero-copy I/O pro- IEEE Transactions on Computers 61 cessing for low-latency GPU computing. In Proc. of ACM/IEEE , 6 (2012), 804–816. Int’l Conf. on Cyber-Physical Systems (2013), pp. 170–178. [38] SHIMOKAWABE,T.,AOKI,T.,TAKAKI,T.,ENDO,T.,YA- MANAKA, A., MARUYAMA, N., NUKADA, A., AND MAT- [20] KATO, S., LAKSHMANAN, K., RAJKUMAR, R., AND SUOKA, S. Peta-scale Phase-Field Simulation for Dendritic So- ISHIKAWA, Y. TimeGraph: GPU Scheduling for Real-Time lidification on the TSUBAME 2.0 Supercomputer. In Proc. of Multi-Tasking Environments. In USENIX ATC (2011), pp. 17– Int’l Conf. for High Performance Computing, Networking, Stor- 30. age and Analysis (2011), pp. 1–11. [21] KATO, S., MCTHROW, M., MALTZAHN, C., AND BRANDT., S. [39] SILBERSTEIN, M., FORD, B., KEIDAR, I., AND WITCHEL, E. Gdev: First-Class GPU Resource Management in the Operating GPUfs: Integrating a File System with GPUs. In ACM ASPLOS System. In USENIX ATC (2012), pp. 401–412. (2013), pp. 485–498. [22] KIM, C., CHHUGANI, J., SATISH, N., SEDLAR, E., NGUYEN, [40] SUN,W.,RICCI, R., AND CURRY, M. L. GPUstore: Harnessing A. D., KALDEWEY,T.,LEE,V.W.,BRANDT, S. A., AND GPU Computing for Storage Systems in the OS Kernel. In Proc. DUBEY, P. FAST: Fast Architecture Sensitive Tree Search on of Int’l Systems and Storage Conf. (2012), pp. 9:1–9:12. Modern CPUs and GPUs. In ACM SIGMOD (2010), pp. 339– 350. [41] YU, M., ZHANG, C., QI, Z., YAO, J., WANG, Y., AND GUAN, H. VGRIS: Virtualized GPU Resource Isolation and Scheduling [23] KOSCIELNICKI, M. Envytools. https://0x04.net/ in . In ACM HPDC (2013), pp. 203–214. envytools.git, 2014.

12 120 2014 USENIX Annual Technical Conference USENIX Association