GPU on VMware’s Hosted I/O Architecture Micah Dowty, Jeremy Sugerman VMware, Inc. 3401 Hillview Ave, Palo Alto, CA 94304 micah@.com, [email protected]

Abstract more computational performance than CPUs. At the Modern graphics co-processors (GPUs) can produce same time, GPU acceleration has extended beyond en- high fidelity images several orders of magnitude faster tertainment (e.g., games and video) into the basic win- than general purpose CPUs, and this performance expec- dowing systems of recent operating systems and is start- tation is rapidly becoming ubiquitous in personal com- ing to be applied to non-graphical high-performance ap- puters. Despite this, GPU virtualization is a nascent field plications including protein folding, financial modeling, of research. This paper introduces a taxonomy of strate- and medical image processing. The rise in applications gies for GPU virtualization and describes in detail the that exploit, or even assume, GPU acceleration makes specific GPU virtualization architecture developed for it increasingly important to expose the physical graph- VMware’s hosted products (VMware Workstation and ics hardware in virtualized environments. Additionally, VMware Fusion). virtual desktop infrastructure (VDI) initiatives have led We analyze the performance of our GPU virtualiza- many enterprises to try to simplify their desktop man- tion with a combination of applications and microbench- agement by delivering VMs to their users. Graphics vir- marks. We also compare against , the tualization is extremely important to a user whose pri- GPU virtualization in Parallels Desktop 3.0, and the na- mary desktop runs inside a VM. tive GPU. We find that taking advantage of hardware GPUs pose a unique challenge in the field of virtu- acceleration significantly closes the gap between pure alization. Machine virtualization multiplexes physical emulation and native, but that different implementations hardware by presenting each VM with a virtual device and host graphics stacks show distinct variation. The mi- and combining their respective operations in the hyper- crobenchmarks show that our architecture amplifies the visor platform in a way that utilizes native hardware overheads in the traditional graphics API bottlenecks: while preserving the illusion that each guest has a com- draw calls, downloading buffers, and batch sizes. plete stand-alone device. Graphics processors are ex- Our virtual GPU architecture runs modern graphics- tremely complicated devices. In addition, unlike CPUs, intensive games and applications at interactive frame chipsets, and popular storage and network controllers, rates while preserving portability. The GPU designers are highly secretive about the specifi- applications we tested achieve from 86% to 12% of na- cations for their hardware. Finally, GPU architectures tive rates and 43 to 18 frames per second with VMware change dramatically across generations and their gener- Fusion 2.0. ational cycle is short compared to CPUs and other de- vices. Thus, it is nearly intractable to provide a virtual 1 Introduction device corresponding to a real modern GPU. Even start- Over the past decade, virtual machines (VMs) have be- ing with a complete implementation, updating it for each come increasingly popular as a technology for multi- new GPU generation would be prohibitively laborious. plexing both desktop and server commodity com- Thus, rather than modeling a complete modern GPU, puters. Over that time, several critical challenges in our primary approach paravirtualizes: it delivers an ide- CPU virtualization were solved and there are now both alized software-only GPU and our own custom graphics software and hardware techniques for virtualizing CPUs driver for interfacing with the guest operating system. with very low overheads [1]. I/O virtualization, how- The main technical contributions of this paper are (1) ever, is still very much an open problem and a wide a taxonomy of GPU virtualization strategies—both emu- variety of strategies are used. Graphics co-processors lated and passthrough-based, (2) an overview of the vir- (GPUs) in particular present a challenging mixture of tual graphics stack in VMware’s hosted architecture, and broad complexity, high performance, rapid change, and (3) an evaluation and comparison of VMware Fusion’s limited documentation. 3D acceleration with other approaches. We find that a Modern high-end GPUs have more transistors, draw hosted model [2] is a good fit for handling complicated, more power, and offer at least an order of magnitude rapidly changing GPUs while the largely asynchronous

Published in the USENIX Workshop on I/O Virtualization 2008 1 graphics programming model is still able efficiently to ware, formerly fixed-function transformation and shad- utilize GPU . ing has become generally programmable. Graphics ap- The rest of this paper is organized as follows. Sec- plications use high-level Application Programming In- tion 2 provides background and some terminology. Sec- terfaces () to configure the pipeline, and provide tion 3 describes a taxonomy of strategies for exposing shader programs which perform application specific GPU acceleration to VMs. Section 4 describes the de- per-vertex and per-pixel processing on the GPU [13]. vice emulation and rendering thread of the graphics vir- Future GPUs are expected to continue providing in- tualization in VMware products. Section 5 evaluates the creased programmability. recently announced its 3D acceleration in VMware Fusion. Section 6 summa- Larrabee [14] architecture, a potentially disruptive tech- rizes our findings and describes potential future work. nology which follows this trend to its extreme. With the recent exception of many AMD GPUs, for 2 Background which open documentation is now available [15], GPU While CPU virtualization has a rich research and com- hardware is proprietary. ’s hardware documen- mercial history, graphics is a rel- tation, for example, is a closely guarded trade secret. atively new area. VMware’s virtual hardware has al- Nearly all graphics applications interact with the GPU ways included a display adapter, but it initially included via a standardized API such as ’s DirectX or only basic 2D support [3]. Experimental 3D support the vendor-independent OpenGL standard. did not appear until VMware Workstation 5.0 (April 2005). Both Blink [4] and VMGL [5] used a user-level 3 GPU Virtualization Taxonomy Chromium-like approach [6] to accelerate fixed function This section explores the GPU virtualization approaches OpenGL in Linux and other UNIX-like guests. Parallels we have considered at VMware. We use four primary Desktop 3.0 [7] accelerates some OpenGL and criteria for judging them: performance, fidelity, multi- guest applications with a combination of and pro- plexing, and interposition. The former two emphasize prietary code [8], but loses its interposition while those minimizing the cost of virtualization: users desire native applications are running. Finally, at the most recent Intel performance and full access to the native hardware fea- Developer Forum, Parallels presented a demo that ded- tures. The latter two emphasize the added value of virtu- icates an entire native GPU to a single virtual machine alization: virtualization is fundamentally about enabling using Intel’s VT-d [9, 10]. many virtual instances of one physical entity and then The most immediate application for GPU virtualiza- hopefully using that abstraction to deliver secure isola- tion is to desktop virtualization. While server workloads tion, resource management, virtual machine portability, still form the core use case for virtualization, desktop and many other features enabled by insulating the guest virtualization is now the strongest growth market [11]. from physical hardware dependencies. Desktop users run a diverse array of applications, in- We observe that different use cases weight the crite- cluding entertainment, CAD, and visualization software. ria differently—for example a VDI deployment values Windows Vista, Mac OS X, and recent Linux distribu- high VM-to-GPU consolidation ratios (e.g., multiplex- tions all include GPU-accelerated windowing systems. ing) while a consumer running a VM to access a game or Furthermore, an increasing number of ubiquitous appli- CAD application unavailable on his host values perfor- cations are adopting GPU acceleration. Adobe Flash mance and likely fidelity. A tech support person main- Player 10, the next version of a product which currently taining a library of different configurations and an IT ad- reaches 99.0% of viewers [12], will include ministrator running server VMs are both likely to value GPU acceleration. There is a user expectation that vir- portability and secure isolation (interposition). tualized applications will “just work”, and this increas- Since these criteria are often in opposition (e.g., per- ingly includes having access to their graphics card. formance at the expense of interposition), we describe several possible designs. Rather than give an exhaustive 2.1 GPU Hardware list, we describe points in the design space which high- This section will briefly introduce GPU hardware. It is light interesting trade-offs and capabilities. At a high not within the scope of this paper to provide a full dis- level, we group them into two categories: front-end (ap- cussion of GPU architecture and programming models. plication facing) and back-end (hardware facing). Graphics hardware has experienced a continual evo- lution from mere CRT controllers to powerful pro- 3.1 Front-end Virtualization grammable stream processors. Early graphics acceler- Front-end virtualization introduces a virtualization ators could draw rectangles or bitmaps. Later graphics boundary at a relatively high level in the stack, and runs accelerators could rasterize triangles and transform and the graphics driver in the host/. This approach light them in hardware. With current PC graphics hard- does not rely on any GPU vendor- or model-specific de-

Published in the USENIX Workshop on I/O Virtualization 2008 2 tails. Access to the GPU is entirely mediated through the special knowledge of a GPU’s programming interfaces. vendor provided APIs and drivers on the host while the However, fixed pass-through is not a general solution. guest only interacts with software. Current GPUs allow It completely forgoes any multiplexing and packing ma- applications many independent “contexts” so multiplex- chines with one GPU per virtual machine (plus one for ing is easy. Interposition is not a given—unabstracted the host) is not feasible. details of the GPU’s capabilities may be exposed to the One extension of fixed pass-through is mediated pass- virtual machine for fidelity’s sake—but it is straightfor- through. As mentioned, GPUs already support multiple ward to achieve if desired. However, there is a perfor- independent contexts and mediated pass-through pro- mance risk if too much abstraction occurs in pursuit of poses dedicating just a context, or set of contexts, to a interposition. virtual machine rather than an entire GPU. This allows Front-end techniques exist on a continuum between multiplexing, but incurs two additional costs: the GPU two extremes: API remoting, in which graphics API hardware must implement contexts in a way that they calls are blindly forwarded from the guest to the external can be mapped to different virtual machines with low graphics stack via remote procedure call, and device em- overheads and the host/hypervisor must have enough of ulation, in which a virtual GPU is emulated and the em- a hardware driver to allocate and manage GPU contexts. ulation synthesizes host graphics operations in response Potentially third, if each context does not appear as a full to actions by the guest device drivers. These extremes (logical) device, the guest device drivers must be able to have serious disadvantages that can be overcome by in- handle it. termediate solutions. Pure API remoting is simple to Mediated pass-through is still missing any interposi- implement, but completely sacrifices interposition and tion features beyond (perhaps) basic isolation. A num- involves wrapping and forwarding an extremely broad ber of tactics using or standardization collection of entry points. Pure emulation of a modern of a subset of hardware interfaces can potentially unlock GPU delivers excellent interposition and implements a these additional interposition features. Analogous tech- narrower interface, but a highly complicated and under- niques for networking hardware were presented at VM- documented one. world 2008 [16]. Our hosted GPU acceleration employs front-end vir- tualization and is described in Section 4. Parallels Desk- 4 VMware’s Virtual GPU top 3.0, Blink, and VMGL are other examples of front- All of VMware’s products include a virtual display end virtualization. Parallels appears to be closest to adapter that supports VGA and basic high resolution pure API remoting, as VM execution state cannot be 2D graphics modes. On VMware’s hosted products, saved to disk while OpenGL or Direct3D applications this adapter also provides accelerated GPU virtualiza- are running. VMGL uses Chromium to augment its re- tion using a front-end virtualization strategy. To sat- moting with OpenGL state tracking and Blink imple- isfy our design goals, we chose a flavor of front-end ments something similar. This allows them suspend-to- virtualization which provides good portability and per- disk functionality and reduces the amount of data which formance, and which integrates well with existing op- needs to be copied across the virtualization boundary. erating system driver models. Our approach is most similar to the device emulation approach above, but it 3.2 Back-end Virtualization includes characteristics similar to those of API remot- Back-end techniques run the graphics driver stack in- ing. The in-guest driver and emulated device communi- side the virtual machine with the virtualization boundary cate asynchronously with VMware’s Mouse-Keyboard- between the stack and physical GPU hardware. These Screen (MKS) abstraction. The MKS runs as a separate techniques have the potential for high performance and thread and owns all of our access to the host GPU (and fidelity, but multiplexing and especially interposition windowing system in general). can be serious challenges. Since a VM interacts directly with proprietary hardware resources, its execution state 4.1 SVGA Device Emulation is bound to the specific GPU vendor and possibly the ex- Our virtual GPU takes the form of an emulated PCI de- act GPU model in use. However, exposure to the native vice, the VMware SVGA II card. No physical instances GPU is excellent for fidelity: a guest can likely exploit of this card exist, but our virtual implementation acts like the full range of hardware abilities. a physical graphics card in most respects. The architec- The most obvious back-end virtualization technique ture of our PCI device is outlined by Figure 1. Inside is fixed pass-through: the permanent association of a the VM, it interfaces with a device driver we supply for virtual machine with full exclusive access to a physi- common guest operating systems. Currently only the cal GPU. Recent chipset features, such as Intel’s VT-d, Windows XP driver has 3D acceleration support. Out- make fixed pass-through practical without requiring any side the VM, a user-level device emulation process is re-

Published in the USENIX Workshop on I/O Virtualization 2008 3 sponsible for handling accesses to the PCI configuration memory region between the guest driver and the MKS. and I/O space of the SVGA device. GMRs are an abstraction for guest owned memory which the virtual GPU is allowed to read or write. GMRs Sync. Registers I/O Space can be defined by the guest’s video driver using arbitrary Video Mode Index discontiguous regions of guest system memory. Addi- Device caps Value IRQ Config tionally, there always exists one default GMR: the de- IRQ Status GMR Config vice’s “virtual VRAM.” This VRAM is actually host ... PCI Device system memory, up to 128 MB, mapped into PCI mem-

BAR 0 Virtual VRAM GMR Page Tables System Memory ory space via BAR1. The beginning of this region is BAR 1 2D Framebuffer GMR 0 User-defined reserved as a 2D framebuffer. BAR 2 General Purpose ... GMR In our virtual GPU, physical VRAM is not directly DMA memory VRAM GMR IRQ ... visible to the guest. This is important for portability, and it is one of the primary trade-offs made by our front- FIFO Memory end virtualization model. To access physical VRAM Virtual Physical surfaces like textures, vertex buffers, and render targets, FIFO Pointers DMA Engine VRAM Async. regs the guest driver schedules an asynchronous DMA op- 3D Caps eration which transfers data between a surface and a Command MKS FIFO Buffer HostOps GMR. In every surface transfer, this DMA mechanism adds at least one copy beyond the normal overhead that would be experienced in a non-virtualized environment Figure 1: VMware SVGA II device architecture or with back-end virtualization. Often only this single copy is necessary, because the MKS can provide the

App host’s OpenGL or Direct3D implementation with direct pointers into mapped GMR memory. This virtual DMA VMware SVGA Driver Guest Mem Guest model has the potential to far outperform a pure API re- SVGA FIFO / Registers moting approach like VMGL or Chromium, not only be- Guest VRAM SVGA Device

Host cause so few copies are necessary, but because the guest MKS / HostOps Dispatch SVGA GMR driver may cache lockable Direct3D buffers directly in GMR memory.

2D 3D Like a physical graphics accelerator, the SVGA de- Shader Surface State Program Abstraction vice processes commands asynchronously via a lockless Video Translator Translator FIFO queue. This queue, several megabytes in size, DMA 3D Drawing Path Engine occupies the bulk of the FIFO Memory region refer- 3D Rendering 3D 2D Compositing 2D enced by BAR2. During unaccelerated 2D rendering, FIFO commands are used to mark changed regions in the GPU API / Driver framebuffer, informing the MKS to copy from the guest GPU framebuffer to the physical display. During 3D render- ing, the FIFO acts as transport layer for our architecture- Figure 2: The virtual graphics stack. The MKS/HostOps independent SVGA3D rendering protocol. FIFO com- Dispatch and rendering occur asynchronously in their mands also initiate all DMA operations, perform hard- own thread. ware accelerated blits, and control accelerated video and Our virtual graphics device provides three fundamen- mouse cursor overlays. tal kinds of virtual hardware resources: registers, Guest We deliver host to guest notifications via a virtual in- Memory Regions (GMRs), and a FIFO command queue. terrupt. Our virtual GPU has multiple interrupt sources Registers may be located in I/O space, for infrequent which may be programmed via FIFO registers. To mea- operations that must be emulated synchronously, or in sure the host’s command execution progress, the guest the faster “FIFO Memory” region, which is backed by may insert FIFO fence commands, each with a unique plain system memory on the host. I/O space registers 32-bit ID. Upon executing a fence, the host stores its are used for mode switching, GMR setup, IRQ acknowl- value in a FIFO register and optionally delivers an inter- edgement, versioning, and for legacy purposes. FIFO rupt. This mechanism allows the guest to very efficiently registers include large data structures, such as the host’s check whether a specific command has completed yet, 3D rendering capabilities, and fast-moving values such and to optionally wait for it by sleeping until a FIFO as the mouse cursor location—this is effectively a shared goal interrupt is received. The SVGA3D protocol is a simplified and idealized

Published in the USENIX Workshop on I/O Virtualization 2008 4 adaptation of the Direct3D API. It has a minimal number video driver must implement some form of flow con- of distinct commands. Drawing operations are expressed trol, so that video latency is not unbounded if the guest using a single flexible vertex/index array notation. All submits FIFO commands faster than the host completes host VRAM resources, including 2D textures, 3D tex- them. The driver may also need to wait for DMA com- tures, cube environment maps, render targets, and ver- pletion, either to recycle DMA memory or to read back tex/index buffers are represented using a homogeneous results from the GPU. To implement this synchroniza- surface abstraction. Shaders are written in a variant tion efficiently, the FIFO requires both guest to host and of Direct3D’s bytecode format, and most fixed-function host to guest notifications. render states are based on Direct3D render state. The MKS will normally poll the command FIFO at a This protocol acts as a common interchange format fixed rate, between 25 and 100 Hz. This is effectively the for GPU commands and state. The guest contains API virtual vertical refresh rate of the device during unaccel- implementations which produce SVGA3D commands erated 2D rendering. During synchronization-intensive rather than commands for a specific GPU. This provides 3D rendering, we need a lower latency guest to host no- an opportunity to actively trade capability for portability. tification. The guest can write to the doorbell, a register The host can control which of the physical GPU’s fea- in I/O space, to explicitly ask the host to poll the com- tures are exposed to the guest. As a result, VMs using mand FIFO immediately. SVGA3D are widely portable between different physical GPUs. It is possible to suspend a live application to disk, 5 Evaluation move it to a different host with a different GPU or MKS We conducted two categories of tests: application backend, and resume it. Even if the destination GPU ex- benchmarks, and microbenchmarks. All tests were con- poses fewer capabilities via SVGA3D, in some cases our ducted on the same physical machine: a 2nd generation architecture can use its layer of interposition as an op- Apple Mac Pro, with a total of eight 2.8 GHz Intel Xeon portunity to emulate missing features in software. This cores and an ATI Radeon HD2600 graphics card. All portability assurance is critical for preventing GPU vir- VMs used a single virtual CPU. With one exception, we tualization from compromising the core value proposi- found that all non-virtualized tests were unaffected by tions of machine virtualization. the number of CPU cores enabled. 4.2 Rendering 5.1 Application Benchmarks This FIFO design is inherently asynchronous. All host- side rendering happens in the MKS thread, while the Application Resolution FPS guest’s virtual CPUs execute concurrently. As illustrated RTHDRIBL 1280 × 1024 22 RTHDRIBL 640 × 480 27.5 in Figure 2, access to the physical GPU is mediated first Half Life 2: Episode 2 1600 × 1200 22.2 through the GPU vendor’s driver running in the host OS, Half Life 2: Episode 2 1024 × 768 32.2 and secondly via the Host Operations (HostOps) back- Civilization 4 1600 × 1200 18 1600 × 1200 ends in the MKS. The MKS has multiple HostOps back- Max Payne 2 42 end implementations including GDI and X11 backends Table 1: Absolute frame rates with VMware Fusion 2.0. to support basic 2D graphics on all Windows and Linux All applications run at interactive speeds (18–42 FPS). hosts, a VNC server for remote display, and 3D acceler- ated backends written for both Direct3D and OpenGL. The purpose of graphics acceleration hardware is In theory we need only an OpenGL backend to support to provide higher performance than would be possible Windows, Linux, and Mac OS hosts; however we have using software alone. Therefore, in this section we found Direct3D drivers to be of generally better qual- will measure both the performance impact of virtual- ity, so we use them when possible. Additional backends ized graphics relative to non-virtualized GPU hardware, could be written to access GPU hardware directly. and the amount of performance improvement relative to The guest video driver writes commands into FIFO TransGaming’s SwiftShader [17] software renderer, run- memory, and the MKS processes them continuously on ning in a VMware Fusion virtual machine. a dedicated rendering thread. This design choice is crit- In addition to VMware Fusion 2.0, which uses the ar- ical for performance, however it introduces several new chitecture described above, we measured Parallels Desk- challenges in synchronization. In part, this is a clas- top 3.0 where possible (three of our configurations do sic producer-consumer problem. The FIFO requires no not run). Both versions are the most recent public release host-guest synchronization as long as it is never empty at time of writing. To demonstrate the effects that can nor full, but the host must sleep any time the FIFO is be caused by API translation and by the host graphics empty, and the guest must sleep when it is full. The stacks, we also ran our applications on VMware Work- guest may also need to sleep for other reasons. The guest station 6.5. These used our Direct3D rendering backend

Published in the USENIX Workshop on I/O Virtualization 2008 5 RTHDRIBL (1280x1024)

RTHDRIBL (640x480)

Half Life 2: Episode 2 (1600x1200)

Half Life 2: Episode 2 (1024x768)

Civilization 4 (1600x1200)

Fusion 2.0 3Dmark2001 (1024x768) Workstation 6.5 Parallels 3.0 SwiftShader Max Payne 2 (1600x1200)

1 10 100 % of Host Performance

Figure 3: Relative performance of software rendering (SwiftShader) and three hardware accelerated virtualization techniques. The log scale highlights the huge gap between software and hardware acceleration versus the gap between virtualized and native hardware.

on the same hardware, but running Windows XP using execution time. In absolute terms, though, Max Payne Boot Camp. has the highest frame rate of our applications. It is quite challenging to measure the performance of Table 1 reports the actual frame rates exhibited with graphics virtualization implementations accurately and these applications under VMware Fusion. While our vir- fairly. The system under test has many variables, and tualized 3D acceleration still lags native performance, they are often difficult or impossible to isolate. The vir- we make two observations: it still achieves interactive tualized operating system, host operating system, CPU frame rates and it closes the lion’s share of the gulf be- virtualization overhead, GPU hardware, GPU drivers, tween software rendering and native performance. For and application under test may each have a profound ef- example, at 1600 × 1200, VMware Fusion renders Half- fect on the overall system performance. Any one of these Life 2 at 22 frames per second, which is 23.35x faster components may have opaque fast and slow paths— than software rendering and only 2.4x slower than na- small differences in the application under test may cause tive. wide gaps in performance, due to subtle and often hid- den details of each component’s implementation. For 5.2 Microbenchmarks example, each physical GPU driver may have differ- To better understand the nature of front-end virtualiza- ent undocumented criteria for transferring vertex data at tion’s performance impact, we performed a suite of mi- maximum speed. crobenchmarks based on triangle rendering speed under Additionally, the matrix of possible tests is limited various conditions. For all microbenchmarks, we ren- by incompatible graphics APIs. Most applications and dered unlit untextured triangles using Vertex Shader 1.1 benchmarks are written for a single API, either OpenGL and the fixed-function pixel pipeline. This minimizes or Direct3D. Each available GPU virtualization imple- our dependency on shader translation and GPU driver mentation has a different level of API support. Parallels implementation. Desktop supports both OpenGL and Direct3D, VMware Each test renders a fixed number of frames, each con- Fusion supports only Direct3D applications, and VMGL taining a variable number of draw calls with a variable supports only OpenGL. length vertex buffer. For security against jitter or drift Figure 3 summarizes the application benchmark re- caused by timer virtualization, all tests measured elapsed sults. All three virtualization products performed sub- time via a TCP/IP server running on an idle physical stantially better than the fastest available software ren- machine. Parameters for each test were chosen to op- derer, which obtained less than 3% of native perfor- timize frame duration, so as to minimize the effects of mance in all tests. Applications which are mostly GPU noise from time quantization, network latency, and ver- limited, RTHDRIBL [18] and Half Life 2: Episode 2, tical sync latency. ran at closer to native speeds. Max Payne exhibits low The static vertex test, Figure 4(a), tests performance performance relative to native, but that reflects the low scalability when rendering vertex buffers which do not ratio of GPU load to API calls. As a result, virtualiza- change contents. In Direct3D terms, this tests the man- tion overhead occupies a higher proportion of the whole aged buffer pool. Very little data must be exchanged

Published in the USENIX Workshop on I/O Virtualization 2008 6 between host and guest in this test, so an ideal front- driver maintenance. While fixed pass-through is easy, end virtualization implementation would do quite well. none of the more advanced techniques have been demon- VMware Workstation manages to get just over 80% of strated. This is a substantial opportunity for work by the host’s performance in this test. Parallels Desktop and GPU and virtualization vendors. VMware Fusion get around 30%. In our experience, this Front-end virtualization currently shows a substantial is due to inefficiency in the Vertex Buffer Object support degradation in performance and GPU feature set rela- within Mac OS’s OpenGL stack. tive to native hardware. Nevertheless, it is already en- The dynamic vertex test, Figure 4(b), switches from abling virtualized applications to run interactively that the managed buffer pool back to the default Direct3D could never have been virtualized before, and is a foun- buffer pool, and uploads new vertex data prior to each dation for virtualization of tomorrow’s GPU-intensive of the 100 draws per frame. It tests the driver stack’s software. Even as back-end virtualization gains popu- ability to stream data to the GPU, and manage the re-use larity, front-end virtualization can fill an important role of buffer memory. for VMs which must be portable among diverse GPUs. The next test, Figure 4(c), is intended to test virtual- ization overhead while performing a GPU-intensive op- 7 Acknowledgements eration. While triangles in previous tests had zero pixel Many people have contributed to the SVGA and 3D coverage, this tests renders triangles covering half the code over the years. We would specifically like to thank viewport. Ideally, this test would show nearly identical Tony Cannon and Ramesh Dharan for their work on the results for any front-end virtualization implementation. foundations of our display emulation. Aaditya Chan- The actual results are relatively close, but on VMware’s drasekhar pioneered our shader translation architecture platform there is a substantial amount of noise in the and continues to advance our Direct3D virtualization. results. This appears to be due to the irregular com- Shelley Gong, Alex Corscadden, Mark Sheldon, and pletion of asynchronous commands when the physical Stephen Johnson all actively contribute to our 3D em- GPU is under heavy load. Also worth noting is the fact ulation. that VMware Fusion, on average, performed better than the host machine. It’s possible that this test is exercis- References ing a particular drawing state which is more optimized [1] Keith Adams and Ole Agesen. A Comparison of in ATI’s Mac OS OpenGL driver than in their Windows Software and Hardware Techniques for x86 Virtu- Direct3D driver. alization. In Proceedings of ASPLOS ’06, October The final test, Figure 4(d), measures the overhead 2006. added to every separate draw. This was the only test [2] Jeremy Sugerman, Ganesh Venkitachalam, and where we saw variation in host performance based on Beng-Hong Lim. Virtualizing I/O Devices on the number of enabled CPU cores. This microbench- VMware Workstation’s Hosted Virtual Machine mark illustrates why the number of draw calls per frame Monitor. In Proceedings of the 2001 USENIX An- is, in our experience, a relatively good predictor of over- nual Technical Conference, June 2001. all application performance with front-end GPU virtual- ization. [3] VMware SVGA Device Interface and Pro- gramming Model. In X.org source repository, 6 Conclusion xf86-video-vmware driver README. In VMware’s hosted architecture, we have implemented [4] Jacob Gorm Hansen, Blink: Advanced Display front-end GPU virtualization using a virtual device Multiplexing for Virtualized Applications. In Pro- model with a high level rendering protocol. We have ceedings of NOSSDAV 2007, June 2007. shown it to run modern graphics-intensive games and [5] H. Andres´ Lagar-Cavilla et al., VMM-Independent applications at interactive frame rates while preserving Graphics Acceleration. In Proceedings of VEE ’07. virtual machine interposition. [6] Greg Humphreys et al., Chromium: A Stream- There is much future work in developing reliable Processing Framework for Interactive Rendering benchmarks which specifically stress the performance on Clusters. In Proceedings of the 29th Annual weaknesses of a virtualization layer. Our tests show API Conference on and Interactive overheads of about 2 to 120 times that of a native GPU. Techniques, pages 693-702 (2002). As a result, the performance of a virtualized GPU can be highly dependent on subtle implementation details of [7] Parallels Desktop, http://www.parallels. the application under test. com/en/desktop/ Back-end virtualization holds much promise for per- [8] Parallels on the Wine project wiki, http:// formance, breadth of GPU feature support, and ease of wiki.winehq.org/Parallels

Published in the USENIX Workshop on I/O Virtualization 2008 7 7 2

Parallels Desktop 3.0 Parallels Desktop 3.0 Vmware Fusion 2.0 Vmware Fusion 2.0 6 Vmware Workstation 6.5 Vmware Workstation 6.5 Host Host 1.5 5 ) ) s s d d

n 4 n o o c c

e e 1 s s ( (

e 3 e m m i i T T 2 0.5

1

0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Batch Size (triangles) Batch Size (triangles)

(a) Static vertex rendering performance (b) Dynamic vertex rendering performance

5

10 Parallels Desktop 3.0 Vmware Workstation 6.5 Vmware Fusion 2.0 Parallels Desktop 3.0 Vmware Workstation 6.5 4 Host Host (1 CPU) 8 Vmware Fusion 2.0 Host (8 CPU) ) s ) d s 3 n d o

n 6 c o e c s e ( s

( e

e 2 m

4 i m T i T

2 1

0 0 200 400 600 800 1000 0 10000 20000 30000 40000 50000 Batch Size (triangles) Draw Calls

(c) Filled triangle rendering performance (d) Draw call overhead

Figure 4: Microbenchmark results

[9] IDF SF08: Parallels and Intel Virtualization for [14] Larry Seiler et al., Larrabee: A Many-Core x86 Ar- Directed I/O, http://www.youtube.com/ chitecture for Visual Computing. In ACM Transac- watch?v=EiqMR5Wx_r4 tions on Graphics Vol. 27, No. 3, Article 18 (Au- gust 2008). [10] D. Abramson et al., Intel Virtualization Technol- ogy for Directed I/O. In Intel Technology Journal, [15] AMD Developer Guides and Manuals, http:// http://www.intel.com/technology/ developer.amd.com/documentation/ itj/2006/v10i3/ (August 2006). guides/Pages/default.aspx [11] Andi Mann, Virtualization and Management: [16] Howie Xu et al., TA2644: Networking I/O Virtual- Trends, Forecasts, and Recommendations, Enter- ization, VMworld 2008. prise Management Associates. (2008). [17] SwiftShader, http://www.transgaming. com/products/swiftshader/ [12] Flash Player Penetration, http://www. adobe.com/products/player_census/ [18] Real-Time High Dynamic Range Image-Based flashplayer/ Lighting demo, http://www.daionet.gr. jp/˜masa/rthdribl/ [13] John Owens, GPU architecture overview. In SIGGRAPH ’07: ACM SIGGRAPH 2007 courses, http://doi.acm.org/10.1145/ 1281500.1281643

Published in the USENIX Workshop on I/O Virtualization 2008 8