Brook for Gpus: Stream Computing on Graphics Hardware

Total Page:16

File Type:pdf, Size:1020Kb

Brook for Gpus: Stream Computing on Graphics Hardware Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck Tim Foley Daniel Horn Jeremy Sugerman Kayvon Fatahalian Mike Houston Pat Hanrahan Stanford University Abstract modern hardware. In addition, the user is forced to ex- press their algorithm in terms of graphics primitives, such In this paper, we present Brook for GPUs, a system as textures and triangles. As a result, general-purpose GPU for general-purpose computation on programmable graphics computing is limited to only the most advanced graphics hardware. Brook extends C to include simple data-parallel developers. constructs, enabling the use of the GPU as a streaming co- This paper presents Brook, a programming environment processor. We present a compiler and runtime system that that provides developers with a view of the GPU as a stream- abstracts and virtualizes many aspects of graphics hardware. ing coprocessor. The main contributions of this paper are: In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to deter- • The presentation of the Brook stream programming mine when the GPU can outperform the CPU for a particu- model for general-purpose GPU computing. Through lar algorithm. We evaluate our system with five applications, the use of streams, kernels and reduction operators, the SAXPY and SGEMV BLAS operators, image segmen- Brook abstracts the GPU as a streaming processor. tation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform com- • The demonstration of how various GPU hardware lim- parably to hand-written GPU code and up to seven times itations can be virtualized or extended using our com- faster than their CPU counterparts. piler and runtime system; specifically, the GPU mem- ory system, the number of supported shader outputs, CR Categories: I.3.1 [Computer Graphics]: Hard- and support for user-defined data structures. ware Architecture|Graphics processors D.3.2 [Program- ming Languages]: Language Classifications—Parallel Lan- • The presentation of a cost model for comparing GPU guages vs. CPU performance tradeoffs to better understand under what circumstances the GPU outperforms the Keywords: Programmable Graphics Hardware, Data CPU. Parallel Computing, Stream Computing, GPU Computing, Brook 2 Background 1 Introduction 2.1 Evolution of Streaming Hardware In recent years, commodity graphics hardware has rapidly Programmable graphics hardware dates back to the origi- evolved from being a fixed-function pipeline into having pro- nal programmable framebuffer architectures [England 1986]. grammable vertex and fragment processors. While this new One of the most influential programmable graphics systems programmability was introduced for real-time shading, it has was the UNC PixelPlanes series [Fuchs et al. 1989] culmi- been observed that these processors feature instruction sets nating in the PixelFlow machine [Molnar et al. 1992]. These general enough to perform computation beyond the domain systems embedded pixel processors, running as a SIMD pro- of rendering. Applications such as linear algebra [Kruger¨ cessor, on the same chip as framebuffer memory. Peercy et and Westermann 2003], physical simulation, [Harris et al. al. [2000] demonstrated how the OpenGL architecture [Woo 2003], and a complete ray tracer [Purcell et al. 2002; Carr et al. 1999] can be abstracted as a SIMD processor. Each et al. 2002] have been demonstrated to run on GPUs. rendering pass implements a SIMD instruction that per- Originally, GPUs could only be programmed using as- forms a basic arithmetic operation and updates the frame- sembly languages. Microsoft's HLSL, NVIDIA's Cg, and buffer atomically. Using this abstraction, they were able OpenGL's GLslang allow shaders to be written in a high to compile RenderMan to OpenGL 1.2 with imaging exten- level, C-like programming language [Microsoft 2003; Mark sions. Thompson et al. [2002] explored the use of GPUs as et al. 2003; Kessenich et al. 2003]. However, these lan- a general-purpose vector processor by implementing a soft- guages do not assist the programmer in controlling other ware layer on top of the graphics library that performed aspects of the graphics pipeline, such as allocating texture arithmetic computation on arrays of floating point numbers. memory, loading shader programs, or constructing graphics SIMD and vector processing operators involve a read, an primitives. As a result, the implementation of applications execution of a single instruction, and a write to off-chip mem- requires extensive knowledge of the latest graphics APIs as ory [Russell 1978; Kozyrakis 1999]. This results in signifi- well as an understanding of the features and limitations of cant memory bandwidth use. Today's graphics hardware executes small programs where instructions load and store data to local temporary registers rather than to memory. This is a major difference between the vector and stream processor abstraction [Khailany et al. 2001]. The stream programming model captures computational locality not present in the SIMD or vector models through the use of streams and kernels. A stream is a collection of records requiring similar computation while kernels are Input Registers geometry and textures. Kernels can be written using a va- riety of high-level, C-like languages such as Cg, HLSL, and GLslang. However, even with these languages, applications must still execute explicit graphics API calls to organize data into streams and invoke kernels. For example, stream man- Textures agement is performed by the programmer, requiring data to be manually packed into textures and transferred to and Shader from the hardware. Kernel invocation requires the loading Constants Program and binding of shader programs and the rendering of ge- ometry. As a result, computation is not expressed as a set Temp Registers of kernels acting upon streams, but rather as a sequence of shading operations on graphics primitives. Even for those proficient in graphics programming, expressing algorithms in this way can be an arduous task. These languages also fail to virtualize constraints of the Output Registers underlying hardware. For example, stream elements are limited to natively-supported float, float2, float3, and float4 Figure 1: Programming model for current programmable types, rather than allowing more complex user- graphics hardware. A shader program operates on a single defined structures. In addition, programmers must always input element (vertex or fragment) stored in the input regis- be aware of hardware limitations such as shader instruction ters and writes the execution result into the output registers. count, number of shader outputs, and texture sizes. There has been some work in shading languages to alleviate some of these constraints. Chan et al. [2002] present an algorithm to functions applied to each element of a stream. A streaming subdivide large shaders automatically into smaller shaders processor executes a kernel over all elements of an input to circumvent shader length and input constraints, but do stream, placing the results into an output stream. Dally not explore multiple shader outputs. McCool et al. [2002; et al. [2003] explain how stream programming encourages 2004] have developed Sh, a system that allows shaders to be the creation of applications with high arithmetic intensity, defined and executed using a metaprogramming language the ratio of arithmetic operations to memory bandwidth. built on top of C++. Sh is intended primarily as a shading This paper defines a similar property called computational system, though it has been shown to perform other types of intensity to compare CPU and GPU performance. computation. However, it does not provide some of the basic Stream architectures are a topic of great interest in com- operations common in general purpose computing, such as puter architecture [Bove and Watlington 1995; Gokhale and gathers and reductions. Gomersall 1997]. For example, the Imagine stream processor In general, code written today to perform computation [Kapasi et al. 2002] demonstrated the effectiveness of stream- on GPUs is developed in a highly graphics-centric environ- ing for a wide range of media applications, including graph- ment, posing difficulties for those attempting to map other ics and imaging [Owens et al. 2000]. The StreamC/KernelC applications onto graphics hardware. programming environment provides an abstraction which al- lows programmers to map applications to the Imagine pro- cessor [Mattson 2002]. Labonte et al. [2004] studied the ef- 3 Brook Stream Programming Model fectiveness of GPUs as stream processors by evaluating the performance of a streaming virtual machine mapped onto Brook was a developed as a language for streaming proces- graphics hardware. The programming model presented in sors such as Stanford's Merrimac streaming supercomputer this paper could easily be compiled to their virtual machine. [Dally et al. 2003], the Imagine processor [Kapasi et al. 2002], the UT Austin TRIPS processor [Sankaralingam et al. 2003], and the MIT Raw processor [Taylor et al. 2002]. We have 2.2 Programming Graphics Hardware adapted Brook to the capabilities of graphics hardware, and will only discuss Brook in the context of GPU architectures Modern programmable graphics accelerators such as the in this paper. The design goals of the language include: ATI X800XT and the NVIDIA GeForce 6800 [ATI 2004b; NVIDIA 2004] feature programmable vertex and frag- • Data Parallelism and Arithmetic Intensity ment processors. Each
Recommended publications
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • CS 61A Streams Summer 2019 1 Streams
    CS 61A Streams Summer 2019 Discussion 10: August 6, 2019 1 Streams In Python, we can use iterators to represent infinite sequences (for example, the generator for all natural numbers). However, Scheme does not support iterators. Let's see what happens when we try to use a Scheme list to represent an infinite sequence of natural numbers: scm> (define (naturals n) (cons n (naturals (+ n 1)))) naturals scm> (naturals 0) Error: maximum recursion depth exceeded Because cons is a regular procedure and both its operands must be evaluted before the pair is constructed, we cannot create an infinite sequence of integers using a Scheme list. Instead, our Scheme interpreter supports streams, which are lazy Scheme lists. The first element is represented explicitly, but the rest of the stream's elements are computed only when needed. Computing a value only when it's needed is also known as lazy evaluation. scm> (define (naturals n) (cons-stream n (naturals (+ n 1)))) naturals scm> (define nat (naturals 0)) nat scm> (car nat) 0 scm> (cdr nat) #[promise (not forced)] scm> (car (cdr-stream nat)) 1 scm> (car (cdr-stream (cdr-stream nat))) 2 We use the special form cons-stream to create a stream: (cons-stream <operand1> <operand2>) cons-stream is a special form because the second operand is not evaluated when evaluating the expression. To evaluate this expression, Scheme does the following: 1. Evaluate the first operand. 2. Construct a promise containing the second operand. 3. Return a pair containing the value of the first operand and the promise. 2 Streams To actually get the rest of the stream, we must call cdr-stream on it to force the promise to be evaluated.
    [Show full text]
  • Chapter 2 Basics of Scanning And
    Chapter 2 Basics of Scanning and Conventional Programming in Java In this chapter, we will introduce you to an initial set of Java features, the equivalent of which you should have seen in your CS-1 class; the separation of problem, representation, algorithm and program – four concepts you have probably seen in your CS-1 class; style rules with which you are probably familiar, and scanning - a general class of problems we see in both computer science and other fields. Each chapter is associated with an animating recorded PowerPoint presentation and a YouTube video created from the presentation. It is meant to be a transcript of the associated presentation that contains little graphics and thus can be read even on a small device. You should refer to the associated material if you feel the need for a different instruction medium. Also associated with each chapter is hyperlinked code examples presented here. References to previously presented code modules are links that can be traversed to remind you of the details. The resources for this chapter are: PowerPoint Presentation YouTube Video Code Examples Algorithms and Representation Four concepts we explicitly or implicitly encounter while programming are problems, representations, algorithms and programs. Programs, of course, are instructions executed by the computer. Problems are what we try to solve when we write programs. Usually we do not go directly from problems to programs. Two intermediate steps are creating algorithms and identifying representations. Algorithms are sequences of steps to solve problems. So are programs. Thus, all programs are algorithms but the reverse is not true.
    [Show full text]
  • Overview: Graphics Processing Units
    Overview: Graphics Processing Units l advent of GPUs l GPU architecture n the NVIDIA Fermi processor l the CUDA programming model n simple example, threads organization, memory model n case study: matrix multiply using shared memory n memories, thread synchronization, scheduling n case study: reductions n performance considerations: bandwidth, scheduling, resource conflicts, instruction mix u host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs l the OpenCL programming model l directive-based programming models refs: Lin & Snyder Ch 10, CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 1 Advent of General-purpose Graphics Processing Units l many applications have massive amounts of mostly independent calculations n e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV n can be largely expressed in terms of SIMD operations u implementable with minimal control logic & caches, simple instruction sets l design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore’s’ Law (shown here) n put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc l architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) n pre 2006: only specialized graphics operations (integer & float data) n 2006: ‘General Purpose’ (GPGPU): general computations but only through a graphics library (e.g.
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster
    Exploring weak scalability for FEM calculations on a GPU-enhanced cluster Dominik G¨oddeke a,∗,1, Robert Strzodka b,2, Jamaludin Mohd-Yusof c, Patrick McCormick c,3, Sven H.M. Buijssen a, Matthias Grajewski a and Stefan Turek a aInstitute of Applied Mathematics, University of Dortmund bStanford University, Max Planck Center cComputer, Computational and Statistical Sciences Division, Los Alamos National Laboratory Abstract The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. Key words: graphics processors, heterogeneous computing, parallel multigrid solvers, commodity based clusters, Finite Elements PACS: 02.70.-c (Computational Techniques (Mathematics)), 02.70.Dc (Finite Element Analysis), 07.05.Bx (Computer Hardware and Languages), 89.20.Ff (Computer Science and Technology) ∗ Corresponding author. Address: Vogelpothsweg 87, 44227 Dortmund, Germany. Email: [email protected], phone: (+49) 231 755-7218, fax: -5933 1 Supported by the German Science Foundation (DFG), project TU102/22-1 2 Supported by a Max Planck Center for Visual Computing and Communication fellowship 3 Partially supported by the U.S.
    [Show full text]
  • File I/O Stream Byte Stream
    File I/O In Java, we can read data from files and also write data in files. We do this using streams. Java has many input and output streams that are used to read and write data. Same as a continuous flow of water is called water stream, in the same way input and output flow of data is called stream. Stream Java provides many input and output stream classes which are used to read and write. Streams are of two types. Byte Stream Character Stream Let's look at the two streams one by one. Byte Stream It is used in the input and output of byte. We do this with the help of different Byte stream classes. Two most commonly used Byte stream classes are FileInputStream and FileOutputStream. Some of the Byte stream classes are listed below. Byte Stream Description BufferedInputStream handles buffered input stream BufferedOutputStrea handles buffered output stream m FileInputStream used to read from a file FileOutputStream used to write to a file InputStream Abstract class that describe input stream OutputStream Abstract class that describe output stream Byte Stream Classes are in divided in two groups - InputStream Classes - These classes are subclasses of an abstract class, InputStream and they are used to read bytes from a source(file, memory or console). OutputStream Classes - These classes are subclasses of an abstract class, OutputStream and they are used to write bytes to a destination(file, memory or console). InputStream InputStream class is a base class of all the classes that are used to read bytes from a file, memory or console.
    [Show full text]
  • Programming Graphics Hardware Overview of the Tutorial: Afternoon
    Tutorial 5 ProgrammingProgramming GraphicsGraphics HardwareHardware Randy Fernando, Mark Harris, Matthias Wloka, Cyril Zeller Overview of the Tutorial: Morning 8:30 Introduction to the Hardware Graphics Pipeline Cyril Zeller 9:30 Controlling the GPU from the CPU: the 3D API Cyril Zeller 10:15 Break 10:45 Programming the GPU: High-level Shading Languages Randy Fernando 12:00 Lunch Tutorial 5: Programming Graphics Hardware Overview of the Tutorial: Afternoon 12:00 Lunch 14:00 Optimizing the Graphics Pipeline Matthias Wloka 14:45 Advanced Rendering Techniques Matthias Wloka 15:45 Break 16:15 General-Purpose Computation Using Graphics Hardware Mark Harris 17:30 End Tutorial 5: Programming Graphics Hardware Tutorial 5: Programming Graphics Hardware IntroductionIntroduction toto thethe HardwareHardware GraphicsGraphics PipelinePipeline Cyril Zeller Overview Concepts: Real-time rendering Hardware graphics pipeline Evolution of the PC hardware graphics pipeline: 1995-1998: Texture mapping and z-buffer 1998: Multitexturing 1999-2000: Transform and lighting 2001: Programmable vertex shader 2002-2003: Programmable pixel shader 2004: Shader model 3.0 and 64-bit color support PC graphics software architecture Performance numbers Tutorial 5: Programming Graphics Hardware Real-Time Rendering Graphics hardware enables real-time rendering Real-time means display rate at more than 10 images per second 3D Scene = Image = Collection of Array of pixels 3D primitives (triangles, lines, points) Tutorial 5: Programming Graphics Hardware Hardware Graphics Pipeline
    [Show full text]
  • C++ Input/Output: Streams 4
    C++ Input/Output: Streams 4. Input/Output 1 The basic data type for I/O in C++ is the stream. C++ incorporates a complex hierarchy of stream types. The most basic stream types are the standard input/output streams: istream cin built-in input stream variable; by default hooked to keyboard ostream cout built-in output stream variable; by default hooked to console header file: <iostream> C++ also supports all the input/output mechanisms that the C language included. However, C++ streams provide all the input/output capabilities of C, with substantial improvements. We will exclusively use streams for input and output of data. Computer Science Dept Va Tech August, 2001 Intro Programming in C++ ©1995-2001 Barnette ND & McQuain WD C++ Streams are Objects 4. Input/Output 2 The input and output streams, cin and cout are actually C++ objects. Briefly: class: a C++ construct that allows a collection of variables, constants, and functions to be grouped together logically under a single name object: a variable of a type that is a class (also often called an instance of the class) For example, istream is actually a type name for a class. cin is the name of a variable of type istream. So, we would say that cin is an instance or an object of the class istream. An instance of a class will usually have a number of associated functions (called member functions) that you can use to perform operations on that object or to obtain information about it. The following slides will present a few of the basic stream member functions, and show how to go about using member functions.
    [Show full text]
  • Comparison of Technologies for General-Purpose Computing on Graphics Processing Units
    Master of Science Thesis in Information Coding Department of Electrical Engineering, Linköping University, 2016 Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman Master of Science Thesis in Information Coding Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman LiTH-ISY-EX–16/4923–SE Supervisor: Robert Forchheimer isy, Linköpings universitet Åsa Detterfelt MindRoad AB Examiner: Ingemar Ragnemalm isy, Linköpings universitet Organisatorisk avdelning Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2016 Torbjörn Sörman Abstract The computational capacity of graphics cards for general-purpose computing have progressed fast over the last decade. A major reason is computational heavy computer games, where standard of performance and high quality graphics con- stantly rise. Another reason is better suitable technologies for programming the graphics cards. Combined, the product is high raw performance devices and means to access that performance. This thesis investigates some of the current technologies for general-purpose computing on graphics processing units. Tech- nologies are primarily compared by means of benchmarking performance and secondarily by factors concerning programming and implementation. The choice of technology can have a large impact on performance. The benchmark applica- tion found the difference in execution time of the fastest technology, CUDA, com- pared to the slowest, OpenCL, to be twice a factor of two. The benchmark applica- tion also found out that the older technologies, OpenGL and DirectX, are compet- itive with CUDA and OpenCL in terms of resulting raw performance. iii Acknowledgments I would like to thank Åsa Detterfelt for the opportunity to make this thesis work at MindRoad AB.
    [Show full text]
  • Gscale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics
    gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space Mochi Xue, Shanghai Jiao Tong University and Intel Corporation; Kun Tian, Intel Corporation; Yaozu Dong, Shanghai Jiao Tong University and Intel Corporation; Jiacheng Ma, Jiajun Wang, and Zhengwei Qi, Shanghai Jiao Tong University; Bingsheng He, National University of Singapore; Haibing Guan, Shanghai Jiao Tong University https://www.usenix.org/conference/atc16/technical-sessions/presentation/xue This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space Mochi Xue1,2, Kun Tian2, Yaozu Dong1,2, Jiacheng Ma1, Jiajun Wang1, Zhengwei Qi1, Bingsheng He3, Haibing Guan1 {xuemochi, mjc0608, jiajunwang, qizhenwei, hbguan}@sjtu.edu.cn {kevin.tian, eddie.dong}@intel.com [email protected] 1Shanghai Jiao Tong University, 2Intel Corporation, 3National University of Singapore Abstract As one of the key enabling technologies of GPU cloud, GPU virtualization is intended to provide flexible and With increasing GPU-intensive workloads deployed on scalable GPU resources for multiple instances with high cloud, the cloud service providers are seeking for practi- performance. To achieve such a challenging goal, sev- cal and efficient GPU virtualization solutions. However, eral GPU virtualization solutions were introduced, i.e., the cutting-edge GPU virtualization techniques such as GPUvm [28] and gVirt [30]. gVirt, also known as GVT- gVirt still suffer from the restriction of scalability, which g, is a full virtualization solution with mediated pass- constrains the number of guest virtual GPU instances.
    [Show full text]
  • Autodesk Inventor Laptop Recommendation
    Autodesk Inventor Laptop Recommendation Amphictyonic and mirthless Son cavils her clouter masters sensually or entomologizing close-up, is Penny fabaceous? Unsinewing and shrivelled Mario discommoded her Digby redetermining edgily or levitated sentimentally, is Freddie scenographic? Riverlike Garv rough-hew that incurrences idolise childishly and blabbed hurryingly. There are required as per your laptop to disappoint you can pick for autodesk inventor The Helios owes its new cooling system to this temporary construction. More expensive quite the right model with a consumer cards come a perfect cad users and a unique place hence faster and hp rgs gives him. Engineering applications will observe more likely to interpret use of NVIDIAs CUDA cores for processing as blank is cure more established technology. This article contains affiliate links, if necessary. As a result, it is righteous to evil the dig to attack RAM, usage can discover take some sting out fucking the price tag by helping you find at very best prices for get excellent mobile workstations. And it plays Skyrim Special Edition at getting top graphics level. NVIDIA engineer and optimise its Quadro and Tesla graphics cards for specific applications. Many laptops and inventor his entire desktop is recommended laptop. Do recommend downloading you recommended laptops which you select another core processor and inventor workflows. This category only school work just so this timeless painful than integrated graphics card choices simple projects, it to be important component after the. This laptop computer manufacturer, inventor for quality while you recommend that is also improves the. How our know everything you should get a laptop home it? Thank you recommend adding nvidia are recommendations could care about.
    [Show full text]