Implementation of GPU virtualization using PCI pass-through mechanism

Chao-Tung Yang, Jung-Chun Liu, Hsien-Yi Wang & Ching-Hsien Hsu

The Journal of Supercomputing An International Journal of High- Performance Computer Design, Analysis, and Use

ISSN 0920-8542

J Supercomput DOI 10.1007/s11227-013-1034-4

1 23 Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be self- archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23 Author's personal copy

J Supercomput DOI 10.1007/s11227-013-1034-4

Implementation of GPU virtualization using PCI pass-through mechanism

Chao-Tung Yang · Jung-Chun Liu · Hsien-Yi Wang · Ching-Hsien Hsu

© Springer Science+Business Media New York 2013

Abstract As a general purpose scalable parallel programming model for coding highly parallel applications, CUDA from provides several key abstractions: a hierarchy of thread blocks, , and barrier synchronization. It has proven to be rather effective at programming multithreaded many-core GPUs that scale transparently to hundreds of cores; as a result, scientists all over the industry and academia are using CUDA to dramatically expedite on production and codes. GPU-based clusters are likely to play an essential role in future cloud computing centers, because some computation-intensive applications may require GPUs as well as CPUs. In this paper, we adopted the PCI pass-through technology and set up virtual machines in a virtual environment; thus, we were able to use the NVIDIA graphics card and the CUDA high performance computing as well. In this way, the virtual machine has not only the virtual CPU but also the real GPU for computing. The performance of the virtual machine is predicted to increase dramatically. This paper measured the difference of performance between physical and virtual machines using CUDA, and investigated how virtual machines would verify CPU numbers under the influence of CUDA performance. At length, we compared CUDA performance of two open source virtualization hypervisor environments, with or without using PCI pass- through. Through experimental results, we will be able to tell which environment is most efficient in a virtual environment with CUDA.

C.-T. Yang (B) · J.-C. Liu · H.-Y. Wang Department of Computer Science, Tunghai University, Taichung 40704, Taiwan e-mail: [email protected] J.-C. Liu e-mail: [email protected]

C.-H. Hsu Department of Computer Science and Information Engineering, Chung Hua University, Hsinchu, Taiwan e-mail: [email protected] Author's personal copy

C.-T. Yang et al.

Keywords CUDA · GPU virtualization · Cloud computing · PCI pass-through

1 Introduction

1.1 Motivations

Graphics processing units (GPUs) are true many-core processors with hundreds of processing elements. The GPU is a specialized microprocessor that offloads and ac- celerates graphics rendering from the central microprocessor. Modern GPUs are very efficient at manipulating , and their highly parallel structures make them more effective than general-purpose CPUs over a range of complex algorithms. Currently, a CPU has only 8 cores in a single chip, but a GPU has grown to 448 cores. From the number of cores, the GPU is fitting to execute programs suitable for massive parallel processing. Although the clock frequency of cores on the GPU is lower than that of the CPU, its powerful parallel processing ability conquers the problem of lower frequency. So far, the GPU has been used on supercomputers: on the TOP500 site in November 2010 [1], three of the first five supercomputers were built with NVIDIA GPU [2], and Titan, the world’s fastest supercomputer according to the TOP500 list released in November 2012 was also powered by NVIDIA GPU [1]. In recent years, the virtualization environment on Cloud [3] has become more popular than before. The balance between performance and cost is the most important factor. In order to live up to the potential of the server resource, the virtualization technology is the main solution for running many more virtual machines on a server and yet its resources can be used a lot more effectively. However, virtual machines have their own performance limitations so that users are restrained from using a lot of computing on them. Building a virtual environment in a Cloud computing system for users has be- come an important trend in the last few years. Proper use of hardware resources and computing power of each virtual machine is the aim of the Infrastructure as a Ser- vice (IaaS), which is one of the feature architectures of Cloud computing. Neverthe- less, the virtual machine has limitation when the virtual environment system does not have support of Compute Unified Device Architecture (CUDA), such as the physical General-Purpose computing on Graphics Processing Units (GPGPU) [4, 41Ð43]used by virtual machines in real machines to assist computing. Since the GPU is a real many-core processor, the computing power of virtual machines will be increased.

1.2 Goal and contribution

In this paper, we explored various hypervisor environments for virtualization, differ- ent virtualization types on the cloud system, and several types of hardware virtual- ization [40]. We focused on the GPU virtualization, implemented a system with vir- tualization environment, and used the PCI pass-through [5] technology that enables the virtual machines on the system to use the GPU accelerator to increase the com- puting power. We conducted experiments to compare performance between virtual Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism machines with GPU virtualization and PCI pass-through and the native machine with GPU. Then we showed the GPU performance between the virtual machine and native machine, and also compared system time of virtual machines with that of the native machine. At last, we analyzed two other GPU virtualization technologies; the exper- imental results displayed the advantage of performance by using PCI pass-through over the other GPU virtualization technologies.

1.3 Organization of paper

The rest of this work is organized as follows. Section 2 provides the background re- views of the Cloud computing, virtualization technology, and CUDA [6]. Section 3 describes the system implementation, architecture, and specifications of Tesla C1060 and Tesla C2050, and end-user’s interface. Section 4 presents the experimental envi- ronment, the used methods, results of GPU virtualization, and the improved perfor- mance of the proposed approach. Finally, conclusions are made in Sect. 5.

2 Background review

2.1 Cloud computing

Cloud computing [3] is a computing approach based on the Internet, in which users can remotely use software services and data storage in remote servers. It is a new service architecture that brings a new choice of software and data storage services to users. To use the “Cloud,” users no longer need to find out details of the infrastructure in advance, not necessary to possess the professional knowledge, and are without direct control of the real machines that provide services. National Institute of Standards and Technology, aka NIST, defined following five basic features for Cloud computing in April 2009 [7]: • On-demand self-service • Broad network access • Resource pooling • Rapid elasticity • Measured service Cloud computing can be considered to include three levels of service: Infrastruc- ture as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) [3]. The architecture of cloud computing is shown in Fig. 1. • Infrastructure as a Service (IaaS): Users can follow the required level of computers and network equipment and other resources, to set the service provider subscription service, and may require changes to settings. The cost is calculated according to the use of the CPU, memory, disk space, and the network load. • Platform as a Service (PaaS): development of service vendors who rent computers with the necessary hardware and software development environment; developer fees are calculated in accordance with the amount of traffic of the use of resources. • Software as a Service (SaaS): the software stored in the data center to provide users network access services. The type of charge is on a period or pay-per-order. Author's personal copy

C.-T. Yang et al.

Fig. 1 Architecture of cloud computing

Fig. 2 Diagram of virtualization

2.2 Virtualization

Virtualization technology [8] is a technology that creates a virtual version of some- thing, such as a hardware platform, operating system, a storage device, or network resources. The goal of virtualization is to centralize administrative tasks, while im- prove scalability and overall hardware-resource utilization. By using virtualization, several operating systems can be run in parallel on a single powerful server without glitches. The diagram of virtualization is shown in Fig. 2. The case of a general operating system is shown in Fig. 3. To protect instructions, there are four levels of permissions. The user’s applications are the implementation of the Ring 3 in the CPU part, and the implementation of the operating system is in Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 3 The general operating system

Fig. 4 The virtualization operating system

Ring 0 to control the CPU and hardware. The hardware directly executes requests of the operating system and instructions of user applications. Figure 4 shows the schematic of the virtualization operating system. User’s appli- cations are still implemented in Ring 3, and the virtual operating system (Guest OS) is implemented in Ring 1. The original operating system becomes a Virtual Machine Manager (VMM). Guest OS is not to be executed directly by the CPU, but to use VMM to translate to CPU and other hardware for execution.

2.2.1 Full-virtualization

Unlike the traditional way that uses the operating system kernel in the Ring 0 level, full-virtualization uses the hypervisor instead. The hypervisor manages all instruc- tions sent to Ring 0 from Guest OS. Full-virtualization is shown in Fig. 5, which uses the Binary Translation technology to translate all instructions to Ring 0 from Guest OS and then sends the requirement to hardware. The hypervisor virtualizes all hardware, and Guest OS accesses the hardware just like a real machine. It has Author's personal copy

C.-T. Yang et al.

Fig. 5 Full-virtualization [9]

Fig. 6 Para-virtualization [10]

high independence. But the Binary Translation technology reduces the performance of virtual machines (Fig. 5).

2.2.2 Para-virtualization

The para-virtualization is shown in Fig. 6 which does not virtualize all hardware. A unique Host OS called Domain0 parallel with other Guest OS uses the native oper- ating system to manage hardware drivers. The Guest OS accesses the real hardware by calling the driver in Domain0 through hypervisors. The requirement sent by the Guest OS to the hypervisor is called Hypercall. To let the Guest OS send the hyper- call instead of requirement to hardware, the Guest OS’s kernel needs rewriting, thus some nonopen-sourced operating systems cannot support this. Unlike full-virtualization using the Binary Translation technology, paravirtualiza- tion lets the Guest OS use hardware through Domain0. Although the performance of virtual machines is obviously enchanced, the driver of hardware is bounded on Domain0, and the kernel on the Guest OS needs rewriting; thus its independence is lower than that of full-virtualization (Fig. 6).

2.2.3

As shown in Fig. 7, there are two types of host virtualization software: Host OS type and Hypervisor type. The VM Layer of the Host OS type is deployed on the top Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 7 Host and hypervisor types

Fig. 8 Domain0 and DomainU of the Host OS, such as Windows or Linux, and then the other operating system is installed on top of the VM Layer. The operating system on top of the VM Layer is called the Guest OS. Xen’s hypervisor is installed directly in the host, and the other wanted operating systems are deployed on top of it; in this way, it is easier to manage CPU, memory, networks, storage, and other resources. The main purposes of Xen [11] using the hypervisor type and Virtual Machine Monitor (VMM) are safer and more efficient in controlling the host CPU, memory, and other resources. There are two types of hypervisors used by Xen: para-virtualization and full- virtualization. The features of these two types of virtualization have been described in detail in Sects. 2.2.1 and 2.2.2. Xen uses a unit called Domain to manage virtual machines. Its Domain is divided into two types as shown in Fig. 8. One type called Domain0, acting like the Host OS and with control AP of Xen, is used for management. The other type called DomainU is a field where the Guest OS is installed. To use physical resources, DomainU cannot directly call the hardware driver, it must act through Domain0. In the industry, Xen has been used in SUSE Linux Server (SLES) by Novell and in Red Hat Enterprise Linux (RHEL) and in other commercial Linux versions. In Author's personal copy

C.-T. Yang et al.

Fig. 9 Architecture of KVM

addition, Oracle also introduces a virtualization product called Oracle VM, and Sun Microsystem released xVM Server, all based on Xen. In other words, Xen has been widely supported by the system vendors in virtualization software.

2.2.4 KVM

The Kernel-based Virtual Machine (KVM) [12] is a part of architecture in the Linux core. The architecture of KVM is shown in Fig. 9. For now, KVM supports native virtualization architecture, and hardware-assisted virtualization supported by CPU. This virtualization technology in is called VT-x; and in AMD, AMD-V. These two CPUs use different modules to support KVM, for example, kvm-intel.ko and kvm-amd.ko in Linux. The Linux kernel has included KVM since 2.6.20. FreeBSD uses kernel modules to support KVM. KVM’s architecture consists of two parts: • Kernel Device Driver—Used to manage and simulate virtual machine hardware. • User Space Process—QEMU, a PC hardware emulator, becomes kqemu after mod- ified by KVM.

2.3 CUDA

CUDA [6, 13Ð19] is the parallel computing architecture developed by NVIDIA. The first time a C-complier is used in a development environment for GPU; hence, CUDA’s programming model maintains a low learning curve for programmers famil- iar with standard programming languages such as C and FORTRAN (refer to Fig. 10). The architecture of CUDA is compatible with OpenCL [20Ð22] and C-complier. The instructions are transformed into PTX codes by drivers no matter they come from CUDA C-language or OpenCL, and then executed by graphics cores. As shown in Fig. 11, the processing flow of CUDA consists of four steps. The first step is to copy data from the main memory of CPU to the memory of GPU. The second is to instruct the process to GPU by CPU. The third is to parallel execute in each core on GPU. The last is to copy results from the memory of GPU to the main memory of CPU.

2.4 Virtualization on GPU

In recent, virtualization becomes popular, and the requirement of virtualization is also increased. Common virtual machines are inadequate for use, because the environment Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 10 CUDA programming model from nVidia [2]

Fig. 11 Processing flow on CUDA from Wiki [16]

of the virtual machines is through virtualization after all. Figures 12 and 13 show the two common virtual types to emulate devices and support I/O. Figure 12 shows that the virtualization of user-space device emulation. Guest OS must use the emulated device created in Host OS to communicate with the physical device. Rather than embedded the device emulation within the hypervisor, it is instead implemented in the user space. QEMU [23], which provides not only device emula- tion but a hypervisor as well, provides for device emulation and is used by a large number of independent hypervisors such as Kernel-based Virtual Machine (KVM) and VirtualBox [24]. Figure 13 shows the other way to emulate devices. All devices and I/Os in the vir- tual machine are emulated by hypervisors. This is a common method implemented within an operating system-based hypervisor. In this model, the hypervisor includes emulations of common devices that can be shared among various guest operating sys- tems, including virtual disks, virtual network adapters, and other necessary platform elements. Author's personal copy

C.-T. Yang et al.

Fig. 12 User space device emulation

Fig. 13 Hypervisor-based device emulation

Fig. 14 Pass-through within the hypervisor

Unlike the two kinds of emulation of devices before, device pass-through provides an isolation of devices to a given guest operating system as shown in Fig. 14. Assign- Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 15 Front-end and back-end

ing devices to specific guests is useful when those devices cannot be shared. As for performance, near-native performance can be achieved using device pass-through.

2.5 Green computing

Green computing [25, 26] is to effectively use the resources such as implementation of energy-efficient CPU, servers, and peripherals as well as reduction of resource consumption. Green computing uses the virtualization technology and power man- agement to reach goals of energy saving and carbon emission reduction. Virtualiza- tion is one of the most effective tools for more cost-effective, greener-energy efficient computing where each server is divided into multiple virtual machines that run dif- ferent applications.

2.6 Related works

In recent years, virtualization environment on the Cloud becomes more and more popular. The balance between performance and cost is the most important factor that people focus. For more effective use of resources on the server, the virtualization technology is the solution. By running many virtual machines on a server, resources can be used more effectively. But performance of virtual machines has a limit. Users might be limited from using a lot of computing on virtual machines. To solve this problem, one way is to let the virtual machines use the physical GPGPU in the real machine to help computing. The other way is using CUDA. There are some approaches for virtualization of the CUDA Runtime API for VMs, such as rCUDA [27Ð29], vCUDA [30], GViM [31], and gVirtuS [32]. The solutions feature a dis- tributed middleware comprised of two parts: the front-end and back-end [33]. Figure 15 shows that the front-end middleware is installed in the virtual machine, and the back-end middleware with direct access to the acceleration hardware, is run by host OS via executing the VMM. rCUDA using Sockets API to let the client and server communicate with each other. Client can use the GPU on a server through it. There is a production-ready Author's personal copy

C.-T. Yang et al.

Fig. 16 Architecture of rCUDA

Fig. 17 The vCUDA architecture

framework to run CUDA applications from VMs, based on a recent CUDA API ver- sion. We can use this middleware to make a customized communications protocol [27]. The architecture is shown in Fig. 16. Unlike rCUDA, GViM, and vCUDA are not at the expense of losing VMM inde- pendence. The key idea in vCUDA is: API call interception and redirection. With API in- terception and redirection, applications in VMs can access the graphics hardware device and achieve high performance for computing applications. It allows the ap- plication to execute within virtual machines to leverage . Shi et al. explained how to transparently access graphics hardware in VMs by API call interception and redirection [30]. Their evaluation showed that GPU acceleration for HPC applications in VMs is feasible and competitive with those running in a native, non-virtualized environment. The architecture is shown in Fig. 17. GViM is a system designed for virtualization and managing the resources of a general purpose system accelerated by graphics processors. GViM uses Xen-specific Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism mechanisms for the communication between front-end and back-end middleware. The GViM virtualization infrastructure for a GPGPU platform enables the sharing and consolidation of graphics processors. The experimental measurements of a Xen- based GViM implementation on a multicore platform with multiple attached NVIDIA graphics accelerators demonstrated small performance penalties for virtualized vs. nonvirtualized settings, coupled with substantial improvements concerning fairness in accelerators used by multiple VMs [31]. VMGL [34] is the OpenGL hardware 3D acceleration for virtual machines. OpenGL apps can run inside a virtual machine through VMGL. VMGL can be used on VMware guests, Xen HVM domains (depending on hardware virtualization exten- sions) and Xen paravirtual domains, using XVnc or the virtual frame buffer. VMGL is available for X11-based guest OS’s: Linux, FreeBSD, and OpenSolaris. VMGL is GPU-independent and it supports ATI, NVidia, and Intel GPUs. In Duato et al.’s work, the remote GPU was used for the virtual machine. Al- though this virtualization technique noticeably increases execution time when using a 1 Gbps Ethernet network, it performs almost as efficiently as a local GPU when higher performance interconnects are used. Therefore, the small overhead incurred by the remote use of GPUs is worth the savings that a cluster configuration with less GPUs than nodes [29]. Kawai et al. proposed DS-CUDA, a middleware to virtualize a GPU cluster as a distributed shared GPU system. It simplifies development of codes that use multiple GPUs distributed on a network. Results with good scalability were shown in their paper. Also, the usefulness of the redundant calculation mechanism was confirmed.

3 System implementation

3.1 System architecture

To use the GPU accelerator on virtual machines, we propose to use PCI-pass-through to implement a high performance system. In view of performance, near-native per- formance can be achieved using a device pass-through. This technology is perfect for networking applications or those that have high disk I/O or those like to use hardware accelerators that have not adopted virtualization because of contention and perfor- mance degradation through the hypervisor. But assigning devices to specific guests is also useful when those devices cannot be shared. For example, if a system included multiple video adapters, those adapters could be passed through to unique guest do- mains. VT-d Pass-Through is a technique to give a DomU exclusive access to a PCI function using the IOMMU [35] provided by VT-d. It is primarily targeted at HVM (fully virtualized) guests because Para-Virtualized pass-through does not require VT- d. There is an important thing that the hardware must support that feature. In addition to the motherboard chipset and BIOS also the CPU must have support for IOMMU IO virtualization (VT-d). VT-d is disabled by default; in order to enable it, the “iommu” parameter is used (refer to Fig. 18). This paper used Xen and KVM as hypervisors, and implemented PCI pass through to pass GPUs to virtual machines via the hypervisor as shown in Fig. 19. Figure 20 Author's personal copy

C.-T. Yang et al.

Fig. 18 IOMMU on

Fig. 19 System architecture

Fig. 20 User architecture Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 21 Tesla T10 shows the user’s architecture. Users can use the GPU accelerator via the Internet when the GPU virtualization environment has been set up.

3.2 Tesla C1060 computing processor board

The NVIDIA Teslaª C1060 [36] transforms a workstation into a high-performance computer that outperforms a small cluster. This gives technical professionals a ded- icated computing resource at their desk-side that is much faster and more energy- efficient than a shared cluster in the data center. The details of NVIDIA Teslaª C1060 computing processor board’s specification are shown below. • One Tesla T10 • 240 CUDA cores • 1.296 GHz core frequency • 933 Gflops Single Precision • 78 Gflops Double Precision • 4 GB GDDR3 memory at 102 GB/s bandwidth • 800 MHz memory frequency A computer system with an available PCI Express ×16 slot is required for the Tesla C1060. To have the best system bandwidth between the host processor and the Tesla C1060, it is recommended (but not required) that the Tesla C1060 be installed in a PCI Express ×16 Gen2 slot. The Tesla C1060 is based on the massively par- allel, many-core Tesla processor, which cooperates with the standard CUDA C pro- gramming [15] environment to simplify many-core programming. The architecture of Tesla T10 is shown in Fig. 21.

3.3 Tesla C2050 computing processor board

The NVIDIA Teslaª C2050 [38] is based on the next-generation CUDAª architec- ture codenamed “Fermi.” The 20-series family of Tesla GPUs supports many “must Author's personal copy

C.-T. Yang et al.

Fig. 22 PCI pass-through is successful

have” features for technical and enterprise computing including C++ support, Ecc memory for uncompromised accuracy and scalability, and a 7X increase in double precision performance compared to Tesla 10-series GPUs. Compared to the latest quad-core CPUs, Tesla C2050 computing processors de- liver equivalent supercomputing performance at 1/10th the cost and 1/20th the power consumption. Its specifications are shown below. • One Tesla core • 448 CUDA cores • 1.15 GHz core frequency • 1.03 Tflops Single Precision • 515 Gflops Double Precision • 3GB GDDR5 memory at 144 GB/s bandwidth • 1.5 GHz memory frequency

3.4 End user’s operating interface

When users create a virtual machine and pass the GPU through to the virtual machine successfully, they can see the result through an application called “virtual machine manager” in Linux. In Fig. 22, the GPU pass-through is built successfully and the virtualization is running. On the other way, users can also use pietty or VNC. Users must have the Internet and VNC connections, and then set the IP and port as long as they can connect to the virtual machine. In the console, users can use the command “lspci” to see the PCI pass-through is working or not. The setup processes are shown in Fig. 23 for using pietty, Fig. 24 for using VNC, and Fig. 25 for using lspci.

3.5 System environment

Above we have described the design principle and implementation methods. Here, we present experiment settings conducted on two machines. The node’s hardware and software specifications are listed in Table 1. Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 23 Using Pietty

Fig. 24 Using VNC

Fig. 25 Using lspci

As listed in Table 1, we used two machines with the same hardware specification and two hypervisors: Xen and KVM. The purpose is to compare the performance between these two hypervisors using PCI pass-through with the same GPU. NVS 295 [37] was used as the primary graphics card. Tesla C1060 (or Tesla C2050) was used for computing and is passed through to virtual machines. Table 2 lists the hardware/software specification of virtual machines. We created three virtual machines with same specifications except the number of CPUs and the kind of GPU. We wanted to find whether the CPU number will affect the performance Author's personal copy

C.-T. Yang et al.

Table 1 Hardware/software specification

CPU Memory Disk OS Hypervisor GPU

Node1 Xeon E5506 12GB 1TB CentOS 6.2 Xen Quadro NVS 295/Tesla C1060/Tesla C2050

Table 2 Hardware/software specification of virtual machine

CPU Memory Disk OS Hypervisor GPU Virtualization

VM1 1,2,4 1GB 12GB CentOS6.2 Xen Quadro NVS 295 Full VM2 1,2,4 1GB 12GB CentOS6.2 Xen Tesla C1060 Full VM3 1,2,4 1GB 12GB CentOS6.2 Xen Tesla C2050 Full

Table 3 GPU software environments Driver 285.05.33 Cuda toolkit 4.1.28 CUDA SDK 4.1.28 of virtual machines with PCI pass-through. So we used 1, 2, or 4 CPUs in virtual machines for each kind of GPU to see the performance difference among them. We used full virtualization, because we found that PCI pass-through did not work for in our research. Table 3 shows the GPU software environment.

4 Experimental methods and results

4.1 Experimental methods

We set up ten comparison benchmarks: alignedTypes, asyncAPI, BlackScholes, clock, convolutionSeparable, fastWalshTransform, matrixMul, Bandwidthtest, ma- trixmul-sizeable, and VecAdd. The first seven benchmarks are parts of CUDA SDK [13]. From benchmarks in the CUDA SDK suite, we selected seven representative SDK benchmarks of varying computation loads and data sizes with different CUDA features. These benchmarks were executed as default. Another two benchmarks were selected: matrixmul-sizeable and VecAdd, since their problem sizes can be set to have high computation loading. Execution time of the SDK benchmark was measured by the command “time” in the CentOS [39]. Table 4 shows the size of data transfer for each benchmark. The first experiment is GPU performance comparison between the native and vir- tual machine. We present the effect of using PCI pass-through to pass GPU to virtual machines. The second experiment is the performance comparison between virtual machines with 1 CPU, 2 CPUs, and 4 CPUs to see whether CPU numbers in virtual machines will affect the GPU performance or not. The final experiment compares GPU performance of the proposed implementation using PCI pass-through with two Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Table 4 Data transfers of benchmarks GPU software environments SDK name Data transfer

Aligned Types 413.26 MB Async API 128.00 MB Black Scholes 76.29 MB Clock 2.50 KB Convolution Separable 36.00 MB Fast Walsh Transform 64.00 MB Matrix Mul 79.00 KB common virtualization hypervisors, i.e., rCUDA and vCUDA. We will display that the implementation of GPU virtualization using PCI pass-through results in better GPU performance.

4.2 Experimental results

To maximize performance, execution time for some application should be minimized; thus, we can define performance of a computer as 1 performance = (1) execution time Thus, to compare two different computers A and B, we have performance execution time A = B (2) performanceB execution timeA Likewise, to maximize performance, bandwidth for applications should be maxi- mized; thus performance of a computer is proportional to the value of bandwidth. We first analyzed the GPU performance by the seven CUDA SDK benchmarks running in a VM using PCI pass-through, and compared execution times of bench- marks between the VM and a native machine that calls the regular CUDA Runtime library in a nonvirtualized environment. The results of these experiments are reported in plots below. Figures 26, 27, and 28 show the execution times, user times, and system times for processing the seven SDK benchmarks in the native and the virtual machine, with one CPU on Xen and using PCI pass-through. We can see that the measured execution times of these benchmarks on the virtual machine are less than those on the native machine. The execution time consists of system time and user time. The user time, or GPU computing time, is very close for processing same applications on both native and virtual machines; however, the system time on the native machine is more than that on the virtual machines, resulting in less execution time of SDK benchmarks on virtual machines than that on the native machine. Note that the range of y-axis on the plots is adjusted according to the range of results. The system time looks different significantly in Figs. 26 and 27, since the range of y-axis in both figures is much smaller than that in Fig. 28. Author's personal copy

C.-T. Yang et al.

Fig. 26 Execution time between native and VM with C1060

Fig. 27 Execution time between native and VM with C2050

Fig. 28 Execution time between native and VM with NVS295

Figures 29, 30, and 31 show execution times, user times, and system times of the seven benchmarks on virtual machines with one and two CPUs. From the figures, we can see that the number of CPUs does not visibly affect the user time, which means that the GPU computing time does not have significant change when the number of Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 29 Execution time between 1 Core and 2 Core VMs with C1060

Fig. 30 Execution time between 1 Core and 2 Core VMs with C2050

Fig. 31 Execution time between 1 Core and 2 Core VMs with NVS295

CPU increases from one to two. Since the computing task is computed by the same GPU, the number of CPUs does not affect the user time as long as no requests being queue up due to the demand for processing in one core CPU. Author's personal copy

C.-T. Yang et al.

Fig. 32 Execution time between 2 Core and 4 Core VMs with C1060

Fig. 33 Execution time between 2 Core and 4 Core VMs with C2050

Fig. 34 Execution time between 2 Core and 4 Core VMs with NVS295

Figures 32, 33, and 34 show that execution times, users times, and system times of the seven benchmarks between virtual machines based on Xen, with two CPUs and four CPUs. In the figures, it is obvious to see again that the number of CPUs does not have perceivable effect on the performance of GPU. Since the computing task is computed by the same GPU, the number of CPUs does not affect the user time as Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 35 User time with C1060

Fig. 36 User time with C2050

Fig. 37 User time with NVS295

long as no requests being queue up due to the demand for processing in two-core CPU. From Figs. 26 to 34, we demonstrate that on a same machine, execution time, and user time for processing each benchmark are very close; only the system time is significantly different, which will be further illustrated. For further illustration, in Figs. 35, 36, and 37, we plot the user time, i.e., GPU computing time, for processing each SDK benchmark. Device pass-through provides an isolation of devices to a given guest operating system as shown in Fig. 14,no matter the native or virtual machines with the PCI pass-through, the performance of Author's personal copy

C.-T. Yang et al.

Fig. 38 System time with C1060

Fig. 39 System time with C2050

Fig. 40 System time with NVS295

GPU is close is as long as the bandwidth of I/O is similar and no demand queued up in CPUs. Differences of user time among native and virtual machines are found to be very slight. Since user times of the benchmark “clock” are all under 0.001 s, therefore, it is not obvious to tell the difference in the figures. Figures 38, 39, and 40 show system time for processing each SDK benchmark. Using the GPU accelerator to help computing and the system inner communication is also important. The system time of the native machine, which calls the regular CUDA Runtime library in a nonvirtualized environment, is obviously much longer than that of virtual machines using PCI pass-through. The system time of the virtual machine with one CPU is shorter than the others, which means that if we run programs with heavily GPU computing, we can simply use one CPU for virtual machines to save resources on the host server for other users. Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 41 Bandwidth test with C1060

Fig. 42 Bandwidth test with C2050

Fig. 43 Bandwidth test with NVS295

In Figs. 41, 42, and 43, H2D means “Host to Device,” D2H means “Device to Host.” There is another term called D2D, means “Device to Device,” for which the measured bandwidth values are almost the same so we skip plots for it here. It is obvious to see bandwidths of the natives are higher than the others. In the figures, the CPU numbers of virtual machines do not have significant effect on the bandwidth. PCI pass-through is the main reason that really affects some bandwidth between the virtual machine and GPU accelerator, averagely, the bandwidth of virtual machines is found to be about 400 MB/s lower than that of the native machine. In Figs. 44, 45, and 46, the application is VecAdd with varied problem sizes of 128, 256, 512, and 1024. The differences in execution time in these four environ- ments, i.e., native and 1, 2, and 4 core virtual machine, are found to be very slight, and yet on Tesla C1060 and Quadro NVS 295 the execution times of virtual ma- Author's personal copy

C.-T. Yang et al.

Fig. 44 Execution time of VecAdd with C1060

Fig. 45 Execution time of VecAdd with C2050

Fig. 46 Execution time of VecAdd with NVS295

chines are slightly shorter than those of the native machine. We think it is caused by the difference of system time of the real machine and virtual machines, since the GPU performance results, i.e., user times, among these four machine settings are very close. However, on Tesla C2050, the execution times of virtual machines are slightly longer than those of the native machine. We think it is caused by the large difference in bandwidth for the native machine and virtual machines on Tesla C2050 as shown in Fig. 42. Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

Fig. 47 Execution time of MatrixMul with C1060

Fig. 48 Execution time of MatrixMul with C2050

Fig. 49 Execution time of MatrixMul with NVS295

Figures 47, 48, and 49 show the execution time for processing MatrixMul of varied problem sizes of 256, 512, 1024, and 2048. In the figures, we can see similar results shown in the previous example. The execution times in these four environments are very close. The execution times of virtual machines are also found to be slightly shorter than those of the native machine on Tesla C1060 and Quadro NVS 295, which is due to the difference of system time of the real machine and virtual machines, since the GPU performance among these four machine settings is very close. However, on Author's personal copy

C.-T. Yang et al.

Fig. 50 Performance of PCI pass-through compared with rCUDA

Fig. 51 Performance of PCI pass-through compared with vCUDA

Tesla C2050, the execution times of the native machine are slightly longer than those of the virtual machine due to higher bandwidth of the native machine on Tesla C2050 as shown in Fig. 42. The difference of execution time of problem size 256 is shorter than 0.1 s so it is difficult to see the difference clearly in the figures. We also compared performance of the implementation of GPU virtualization using PCI pass-through with rCUDA and vCUDA. Figures 50 and 51 show the compari- son. The time in the figures is the execution time difference with or without GPU virtualization. We calculated them by subtracting the time after GPU virtualization by the time before GPU virtualization. The execution time is taken from [27, 30]. From these two figures, we can see that using PCI pass-through does not add too much time, since PCI pass-through provides an isolation of devices to a given guest operating system so that the device, i.e., GPU can be used exclusively by that virtual machine. Compared with these two technologies, PCI pass-through is more efficient. Our work has shown that the GPU performance is similar in the native and virtual machines no matter how many CPUs used in virtual machines, the GPU provides the same performance by PCI pass-through. It is as well seen if one use virtual machines, the system time is less than the real machine; the system time of the virtual machine Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism with one CPU is less than four CPUs. The inner communication in virtual machines is not through the real hardware but simply relies on the memory of the real machine. The data transfer time is shorter than rCUDA because rCUDA is network related and the seed of network is the key of rCUDA as seen. Though rCUDA can let the virtual machine run not only in local GPU, but also in remote GPU by network. PCI pass- through is more direct than vCUDA. vCUDA uses the middleware as the connect point but it takes more time than PCI pass-through. Thus, using PCI pass-through to implement computing with GPU accelerators in virtual machines can save resources and have the same high performance in real machines.

5 Conclusions and future work

5.1 Concluding remark

In the study, we find that the GPU performance is similar on the native machine and virtual machines using PCI pass-through. No matter how many CPUs used in virtual machines, the GPU provides the same performance by PCI pass-through. When we use virtual machines with PCI pass-through to run applications, its system time is found to be less than the native machine (1.6 s/0.8 s, or 200 % on NVIDIA Telsa C1060, and C2050 refer to Figs. 38 and 39). The system time of the virtual machine with one CPU is less than that of virtual machines with two or four CPUs, since the inner communication inside virtual machines is not through real hardware but relies on the memory of the real machine. Data transfer time of implementation GPU virtualization using PCI pass-through is shorter than rCUDA (about 4.5 s for alignedTypes refer to Fig. 50) because rCUDA is network related and the seed of network plays the key of rCUDA. Codes need to be rewritten to use rCUDA, but not PCI pass-through. Though rCUDA can let the virtual machine run not only in local GPU, but also in remote GPU by networking. Implementation GPU virtualization using PCI pass-through is also more direct than vCUDA (data transfer time is about 66.5 s shorter for alignedTypes with PCI pass-through, refer to Fig. 51), since vCUDA uses middleware as the connect point but it takes more time than PCI pass-through. Thus, using PCI pass-through to imple- ment computing with GPU accelerators in virtual machines can save resources and have the similar high performance found in real machines.

5.2 Future work

In the future, we plan to test more GPU boards for PCI pass-through and implement GPU hot-plugs to virtual machines. One of the problems introduced with device pass- through is when live migration is needed. Live migration is a great feature to support load balancing of VMs over a network of physical hosts, but it presents a performance bottleneck when PCI pass-through devices are used. In the future work, we will use hot-plug to address this issue, since hot-plug allows PCI devices to join and leave from a given kernel; thus the GPU hot-plug is very useful for performance of virtual machines and the whole system. There is an open source monitoring system called Author's personal copy

C.-T. Yang et al.

OpenNebula, in which the interface of virtual machines can be controlled through webpages. Therefore, we may add OpenNebula to control virtual machines with GPU using PCI pass-through.

Acknowledgement This work was supported in part by the National Science Council, Taiwan, ROC, under grant numbers NSC 102-2218-E-029-002, NSC 101-2218-E-029-004, and NSC 102-2622-E-029- 005-CC3. This work also supported in part by Tunghai University, Taiwan ROC, under grant number GREEnS 04-2.

References

1. TOP 500 (2013) http://www.top500.org. Accessed 17 September 2013 2. nVidia (2013) http://www.nvidia.com. Accessed 17 September 2013 3. Cloud computing (2013) http://en.wikipedia.org/wiki/Cloud_computing. Accessed 17 September 2013 4. GPGPU (2013) http://en.wikipedia.org/wiki/GPGPU. Accessed 17 September 2013 5. PCI-pass-through (2013) http://www.ibm.com/developerworks/linux/library/l-pci-passthrough.Ac- cessed 17 September 2013 6. CUDA (2013) http://www.nvidia.com.tw/object/cuda_home_new_tw.html. Accessed 17 September 2013 7. National Institute of Standards and Technology (2013) http://www.nist.gov/index.html. Accessed 17 September 8. Virtualization (2013) http://en.wikipedia.org/wiki/Virtualization. Accessed 17 September 2013 9. Full virtualization (2013) http://en.wikipedia.org/wiki/Full_virtualization. Accessed 17 September 2013 10. Para virtualization (2013) http://en.wikipedia.org/wiki/Paravirtualization. Accessed 17 September 2013 11. Xen (2013) http://www.xen.org. Accessed 17 September 2013 12. KVM (2013) http://www.linux-kvm.org/page/Main_Page. Accessed 17 September 2013 13. NVIDIA CUDA SDK (2013) http://developer.nvidia.com/cuda-cc-sdk-code-samples. Accessed 17 September 2013 14. Download CUDA (2013) http://developer.nvidia.com/object/cuda.htm. Accessed 17 September 2013 15. NVIDIA CUDA programming guide (2013) http://docs.nvidia.com/cuda/cuda-c-programming-guide/ index.html#abstract. Accessed 17 September 2013 16. CUDA-wiki (2013) http://en.wikipedia.org/wiki/CUDA. Accessed 17 September 2013 17. Lionetti FV, McCulloch AD, Baden SB (2010) Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling. In: Euro-par 2010—parallel processing. Lecture notes in computer science, vol 6271, pp 38Ð49 18. Jung S (2009) Parallelized pairwise sequence alignment using CUDA on multiple GPUs. BMC Bioin- form 10(Suppl 7):A3 19. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general- purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370Ð 1380 20. OpenCL (2013) http://www.khronos.org/opencl. Accessed 17 September 2013 21. OpenCL-wiki (2013) http://en.wikipedia.org/wiki/OpenCL. Accessed 17 September 2013 22. Harvey MJ, De Fabritiis G (2011) Swan: a tool for porting CUDA programs to OpenCL. Comput Phys Commun 182(4):1093Ð1099 23. QEMU (2013) http://wiki.qemu.org/Main_Page. Accessed 17 September 2013 24. VirtualBox (2013) https://www.virtualbox.org. Accessed 17 September 2013 25. Lo C-TD, Qian K (2010) Green computing methodology for next generation computing scientists. In: Proceedings of IEEE 34th annual computer software and applications conference, pp 250Ð251 26. Zhong B, Feng M, Lung C-H (2010) A green computing based architecture comparison and analysis. In: Proceedings of the 2010 IEEE/ACM int’l conference on green computing and communications & int’l conference on cyber, physical and social computing (GREENCOM-CPSCOM’10), pp 386Ð391 Author's personal copy

Implementation of GPU virtualization using PCI pass-through mechanism

27. Duato J, Peña AJ, Silla F, Mayo R, Quintana-Ortí ES (2010) RCUDA: reducing the number of GPUb- ased accelerators in high performance clusters. In: Proceedings of the 2010 international conference on high performance computing & simulation (HPCS 2010), June 2010, pp 224Ð231 28. Duato J, Pena AJ, Silla F, Fernandez JC, Mayo R, Quintana-Orti ES (2011) Enabling CUDA acceler- ation within virtual machines using rCUDA. In: Proceedings of 18th international conference on high performance computing 2010 (HiPC), pp 1Ð10 29. Duato J, Peña AJ, Silla F, Mayo R, Quintana-Orti ES (2011) Performance of CUDA virtualized remote GPUs in high performance clusters. In: Proceedings of international conference on parallel processing (ICPP), September 2011, pp 365Ð374 30. Shi L, Chen H, Sun J (2009) VCUDA: GPU accelerated high performance computing in virtual machines. In: Proceedings of IEEE international symposium on parallel and distributed processing (IPDPS’09), pp 1Ð11 31. Gupta V, Gavrilovska A, Schwan K, Kharche H, Tolia N, Talwar V, Ranganathan P (2009) GViM: GPU-accelerated virtual machines. In: 3rd workshop on system-level virtualization for high perfor- mance computing. ACM, NY, USA, pp 17Ð24 32. Giunta G, Montella R, Agrillo G, Coviello G (2010) A GPGPU transparent virtualization component for high performance computing clouds. In: Ambra PD, Guarracino M, Talia D (eds) Euro-Par 2010— parallel processing. Lecture notes in computer science, vol 6271. Springer, Berlin, pp 379Ð391 33. Front and back ends (2013) http://en.wikipedia.org/wiki/Front_and_back_ends. Accessed 17 Septem- ber 2013 34. VMGL (2013) http://sysweb.cs.toronto.edu/vmgl. Accessed 17 September 2013 35. Amit N, Ben-Yehuda M, Yassour B-A (2012) IOMMU: strategies for mitigating the IOTLB bottle- neck. In: Computer architecture. Lecture notes in computer science, vol 6161, pp 256Ð274 36. NVIDIA Telsa C1060 computing processor (2012) http://www.nvidia.com/object/product_tesla_ c1060_us.html. Accessed 12 May 2012 37. NVIDIA quadro NVS 295 (2012) http://www.nvidia.com.tw/object/product_quadro_nvs_295_ tw.html. Accessed 12 May 2012 38. NVIDIA Telsa C2050 computing processor (2013) http://www.nvidia.com.tw/object/product_tesla_ C2050_C2070_tw.html. Accessed 17 September 2013 39. CentOS (2013) http://www.centos.org. Accessed 17 September 2013 40. Lagar-Cavilla HA, Tolia N, Satyanarayanan M, de Lara E (2007) VMM-independent graphics ac- celeration. In: Proceedings of the 3rd international conference on virtual execution environments (VEE’07). ACM, New York, pp 33Ð43 41. Yang CT, Huang CL, Lin CF (2010) Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput Phys Commun 182(1):266Ð269 42. Yang CT, Huang CL, Lin CF, Chang TC (2010) Hybrid parallel programming on GPU clusters. In: Proceedings of international symposium on parallel and distributed processing with applications (ISPA), September 2010, pp 142Ð147 43. Yang CT, Chang TC, Wang HY, Chu WCC, Chang CH (2011) Performance comparison with OpenMP parallelization for multi-core systems. In: Proceedings 2011 IEEE 9th international symposium on parallel and distributed processing with applications (ISPA), pp 232Ð237