Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables Yaozu Dong and Mochi Xue, Shanghai Jiao Tong University and Intel Corporation; Xiao Zheng, Intel Corporation; Jiajun Wang, Shanghai Jiao Tong University and Intel Corporation; Zhengwei Qi and Haibing Guan, Shanghai Jiao Tong University https://www.usenix.org/conference/atc15/technical-session/presentation/dong This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables Yaozu Dong1, 2, Mochi Xue1,2, Xiao Zheng2, Jiajun Wang1,2, Zhengwei Qi1, Haibing Guan1 {eddie.dong, xiao.zheng}@intel.com {xuemochi, jiajunwang, qizhenwei, hbguan}@sjtu.edu.cn 1Shanghai Jiao Tong University, 2Intel Corporation Abstract computing paradigm called GPU cloud (such as Ama- zon’s GPU cloud [2]). Hence, it is now vitally important The increasing adoption of Graphic Process Unit (GPU) to provide efficient GPU virtualization to provision elas- to computation-intensive workloads has stimulated a new tic GPU resources to multiple users. computing paradigm called GPU cloud (e.g., Amazon’s To address this challenge, two recent full GPU virtual- GPU Cloud), which necessitates the sharing of GPU re- ization techniques, gVirt [29] and GPUvm [28], are pro- sources to multiple tenants in a cloud. However, state-of- posed respectively. gVirt is the first open-source product- the-art GPU virtualization techniques such as gVirt still level full GPU virtualization approach based on Xen hy- suffer from non-trivial performance overhead for graph- pervisor [11] for Intel GPUs, while GPUvm provides ics memory-intensive workloads involving frequent page a Graphic Process Unit (GPU) virtualization approach table updates. on the NVIDIA card. This paper mainly focuses on To understand such overhead, this paper first presents gVirt due to its open-source availability. Specifically, GMedia, a media benchmark, and uses it to analyze the gVirt presents a vGPU instance to each VM to run na- causes of such overhead. Our analysis shows that fre- tive graphics driver, which achieves high performance quent updates to guest VM’s page tables causes excessive and good scalability for GPU-intensive workloads. updates to the shadow page table in the hypervisor, due to the need to guarantee the consistency between guest page While gVirt has made an important first step to provide table and shadow page table. To this end, this paper pro- full GPU virtualization, our measurement shows that it poses gHyvi1, an optimized GPU virtualization scheme still incurs non-trivial overhead for media transcoding based on gVirt, which uses adaptive hybrid page table workloads. Specifically, we build GMedia using Intel’s shadowing that combines strict and relaxed page table MSDK (Media Software Development Kit) to charac- schemes. By significantly reducing trap-and-emulation terize the performance of gVirt. Our analysis uncovers due to page table updates, gHyvi significantly improves that gVirt still suffers from non-trivial performance slow- gVirt’s performance for memory-intensive GPU work- down due to an issue called Massive Update Issue. This loads. Evaluation using GMedia shows that gHyvi can is caused by frequent updates on guest page tables, which achieve up to 13x performance improvement compared lead to excessive VM-exits to the hypervisor to synchro- to gVirt, and up to 85% native performance for multi- nize the shadow page table with the guest page table. thread media transcoding. To address the Massive Update Issue, this paper intro- duces gHyvi, which provides a hybrid page table shadowing scheme to provide optimized full GPU virtual- 1 Introduction ization based on Xen hypervisor for Intel GPUs. In- spired by the GPU programming model, we introduce a The emergence of HPC cloud [30] has shifted many new asynchronous mechanism, namely relaxed page ta- computation-intensive workloads such as machine learn- ble shadowing, which removes trap-and-emulation and ing [24], molecular dynamics simulations [31] and me- thus reduces the overhead of massive page table’s mod- dia transcoding to cloud environments. This necessi- ifications. To minimize the overhead of making guest tates the use of GPU to boost the performance of such and shadow page tables consistent, we combine the two computation-hungry applications, resulting in a new mechanisms into a adaptive hybrid page table shadow- 1The source code of gHyvi will be available at https://01.org/ ing scheme, which take advantage of both the traditional igvt-g. strict and the new relaxed page table shadowing. When 1 USENIX Association 2015 USENIX Annual Technical Conference 517 there are infrequent page table accesses, gHyvi works in 2 Background strict page table shadowing; once the gHyvi detects the guest VM is frequently updating the page table, it will switch to the relaxed page table shadowing. 2.1 GPU for Computing One critical issue of using the relaxed page table shadowing scheme is to reconstruct the shadow pages when shadow pages are inconsistent with guest pages. To better understand the tradeoff of different reconstruction Batch Buffer policies, we implement and evaluate four page table re- Head Render Engine Frame Buffer construction policies: full reconstruction, static partial CMDs Ring Access reconstruction, dynamic partial reconstruction and dy- Buffer Graphic Memory GPU Page Table namic segmented partial reconstruction. Our analysis Fetch shows that the last one usually has better performance Commands Tail than the others, which is thus used as the default policy Feed Access Commands for gHyvi. CPU System Memory We have implemented gHyvi based on gVirt, which Access comprises 600 LoCs. Experiments using GMedia on an Intel GPU card show that gHyvi can achieve up to Figure 1: GPU Programming Model 13x performance improvement compared to gVirt, and up to 85% native performance for multi-thread media transcoding. Our analysis shows that gHyvi wins due to the reduction of up to 69% VM-exits. GPU programming model: Figure 1 illustrates the In summary, this paper makes the following contribu- GPU programming model. The graphics driver produces tions: GPU commands into primary buffer and batch buffer, which is driven by the high level programming APIs like • A GPU-enabled benchmark for media transcoding OpenGL and DirectX. GPU consumes the commands performance (GMedia), by invoking functions from and fulfills the acceleration work accordingly. The pri- Intel MSDK to evaluate and collect the performance mary buffer is a ring structure (ring buffer), which is data on Intel’s GPU platforms. designed to deliver the primary commands. Due to the limited space in the ring buffer, the majority (up to 98%) • A relaxed page table shadowing mechanism as well of commands are in the batch buffer chained to the ring as a hybrid shadow page table scheme, which com- buffer. bines the strict page table shadowing with the relaxed page table shadowing. A register tuple, which includes a head register and a tail register, is implemented in the ring buffer. CPU • Four reconstruction policies: the full reconstruc- fills commands from tail to head, and GPU fetches com- tion policy, static partial reconstruction policy, dy- mands from head to tail, all within the ring buffer. The namic partial reconstruction policy, and the dy- driver notifies GPU the submission and completion of the namic segmented partial reconstruction policy for commands through the tail, while GPU updates the head. relaxed page table shadowing mechanism. Once the CPU completes the placement of commands in the ring buffer and batch buffer, it informs GPU to fetch • An evaluation showing that gHyvi achieves up to the commands. In general, GPU will not fetch the com- 85% native performance for multi-thread media mands placed by the CPU in the ring buffer until the CPU transcoding and a 13x speedup over gVirt. updates the tail register [29]. GPU Cloud: Due to the massive computing power, The rest of the paper is organized as follows: Sec- GPU has been expanded from the original graphic com- tion 2 describes some background information on gVirt puting to general purpose computing. The rising of GPU and GPU programming model. Section 3 presents our cloud, which extends today’s elastic resource manage- benchmark for media transcoding and discusses the Mas- ment capability from CPU to GPU, further enables effi- sive Update Issue in detail, followed by the design and cient hosting of GPU workload in cloud and datacenter implementation of gHyvi In section 4. Then, section 5 environments. The strong demand of hosting GPU ap- evaluates the gHyvi and section 6 discusses the related plications calls for GPU clouds that offer full GPU virtu- work. Finally, section 7 concludes with a brief discus- alization solutions with good performance, full features sion on future work. and sharing capability. 2 518 2015 USENIX Annual Technical Conference USENIX Association 2.2 GPU Benchmarks VM1 global page table VM2 global page table While there are many GPU benchmarks evaluating the ballooned ballooned performance of GPU cards, they mainly focus on graphics ability of cards [1, 8] either for OpenGL or DirectX commands. Though there are a few benchmarks for gen- System memory eral purpose computing (GPGPU) such as Rodinia [12] Guest and Parboil [27], they are not available for Intel’s GPU. Host shadow global page table Besides, existing benchmarks neglect the media process- ing workloads, which is a key to boost the performance of media applications in cloud. To this end, this paper presents GMedia, a media transcoding benchmark shown in Figure 4, based on In- tel’s MSDK (Media Software Development Kit). Intel’s MSDK grants media application developers access to Figure 2: Shared shadow global page table hardware acceleration through a unified API.

Load more