Overview: Graphics Processing Units

Total Page:16

File Type:pdf, Size:1020Kb

Overview: Graphics Processing Units Overview: Graphics Processing Units l advent of GPUs l GPU architecture n the NVIDIA Fermi processor l the CUDA programming model n simple example, threads organization, memory model n case study: matrix multiply using shared memory n memories, thread synchronization, scheduling n case study: reductions n performance considerations: bandwidth, scheduling, resource conflicts, instruction mix u host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs l the OpenCL programming model l directive-based programming models refs: Lin & Snyder Ch 10, CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 1 Advent of General-purpose Graphics Processing Units l many applications have massive amounts of mostly independent calculations n e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV n can be largely expressed in terms of SIMD operations u implementable with minimal control logic & caches, simple instruction sets l design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore’s’ Law (shown here) n put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc l architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) n pre 2006: only specialized graphics operations (integer & float data) n 2006: ‘General Purpose’ (GPGPU): general computations but only through a graphics library (e.g. OpenGL) n 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10–500×) over a single CPU core. COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 2 Graphics Processor Unit Systems l GPU systems are a co-processor ‘device’ on a CPU-based system ([O’H.&Bryant, fig 1.4]) n separate memory space (DRAMs) for CPU (host) and GPU (device) n must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus n also need a way to copy the GPU executable code and start it (kernel launch) n issues? Why not use the same memory space? COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 3 Graphics Processor Unit Architecture l GPU chip: an array of streaming multiprocessors(SMs) sharing an L2 cache n comparison with UltraSPARC T2 (courtesy Real World Tech) n eachSM has (8–32) streaming processors(SPs) TeslaS2050 co-processor u onlySPs (= cores) within anSM can (easily) synchronize, share data n identical threads are organized into fixed-size blocks, each allocated to anSM n blocks are in turn are divided into warps n at any timestep, allSPs execute an instruction from a warp(‘SIMT’ mode) n latencies are hidden by scheduling from TeslaS2050 architecture many warps (courtesy NVIDIA) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 4 The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (courtesy NCI NF) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 5 GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 6 The Common Unified Device Architecture Programming Model l ‘device’ refers to a co-processor with its own DRAM that can run many threads in parallel l ‘host’ performs serial execution, transfers data to/from device (via DMA), and sends (highly jj) kernels to device l the kernel’s threads are organized into a grid of blocks n each block is sent to anSM l a CUDA program is a C/C++ program with device calls & ‘kernels’ (each with many threads) embedded into it l GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) n threads are identical but have thread (& block) ids (courtesy NCSU) l CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it n has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 7 CUDA Program: Simple Example l reverse an array (reversearray.cu) g l o b a l void reverseArray(int ∗ a d, intN) f int idx= threadIdx.x; intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g #define N (1 <<16) int main() f //may not dereferencea d! inta[N], ∗ a d,a size=N ∗ sizeof(int); ... cudaMalloc((void ∗∗)&a d,a size); cudaMemcpy(a d,a,a size, cudaMemcpyHostToDevice); reverseArray <<<1,N/2 >>> (a d,N); cudaThreadSynchronize();// wait till threads finish cudaMemcpy(a,a d,a size, cudaMemcpyDeviceToHost); cudaFree(a d);... g l cf. OpenMP on a normal multicore: style; practicality? #pragma omp parallel num threads(N/2) default(shared) f int idx= omp get threads num(); intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 8 CUDA Thread Organization and Memory Model l memory model (left) reflects l a 2 × 1 grid with 2 × 1 blocks that of the GPU l 2 × 2 grid with 4 × 2 × 2 blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 9 Case Study: Matrix Multiply l perform C+=AB, C is N × N, A is N × K, B is K × N l column-major storage: Ci j is at C[i+j ∗N] l 1st attempt: each thread computes one element of C, Ci, j l invocation with W ×W thread blocks (assume WjN) n why better than using a N × N thread block? (2 reasons, both important!) l for thread( tx,ty) of block( bx,by), i = byW +ty and j = bxW +tx (courtesy xfig) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 10 CUDA Matrix Multiply: Implementation l kernel: g l o b a l void matMult(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f inti= blockIdx.y ∗ blockDim.y+ threadIdx.y; intj= blockIdx.x ∗ blockDim.x+ threadIdx.x; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k++) cij +=A d[i+k ∗ N] ∗ B d[k+j ∗ K]; C d[i+j ∗ N] = cij; g l main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudaMemcpy() host versions into them l invocation with W ×W thread blocks (assume WjN) dim3 dimG(N/W,N/W); dim3 dimB(W,W);// in kernel blockDim.x ==W matMult <<<dimG, dimB >>> (N,K,A d,B d,C d); l what if N%W > 0? Add to kernel if(i < N&&j < N) and declare dim3 dimG((N+W−1)/W,(N+W−1)/W); n note: SIMD nature ofSPs ) cycles for both branches of if are consumed COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 11 CUDA Memories and Thread Synchronization l GPUs can potentially suffer more still from the memory wall n DRAM access still may be 100’s of cycles n bandwidth is limited for load/store intensive kernels l the shared memory is on-chip (hence very fast) n the shared type modifier may be used to denote a (fixed) array allocated to shared memory l threads within a block can synchronized via the syncthreads() intrinsic (efficient – why?) n (SM-level) atomic instructions can enforce data consistency within a block l note: no (easy) way to synchronize between blocks, or safely ensure data consistency across blocks n can only be done across separate kernel invocations COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 12 Matrix Multiply Using Shared Memory l threads (tx,0) ... (tx,W − 1) all access Bk,bxW +tx;((0,ty) ... (W − 1,ty) access AbyW +ty,k) n high ratio of load to FP instructions n harder to hide L1 cache latencies; strains memory bandwidth l can improve kernel by utilizing SM shared memory: s h a r e d doubleA s[W][W],B s[W][W]; g l o b a l void matMult s(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f int ty= threadIdx.y, tx= threadIdx.x; inti= blockIdx.y ∗ W+ ty,j= blockIdx.x ∗ W+ tx; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k+=W) f A s[ty][tx] =A d[i+(k+tx) ∗ N]; B s[ty][tx] =B d[(k+ty) +j ∗ K]; syncthreads(); for(intw=0;w < W;w++) cij +=A s[ty][w] ∗ B s[w][tx]; syncthreads();// can this be avoided? g C d[i+j ∗ N] = cij; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 13 GPU Scheduling - Warps l the GigaThread scheduler assigns the (independently executable) thread blocks to eachSM l each block is divided into groups (of 32) called warps n grouping occurs in linear order by (e.g. warp size 4) tx + bxty + bxbytz l the warp scheduler determines which blocks are ready to run n with 32-thread warps, suitable block sizes range from 4 × 8 to 16 × 16 n SIMT: eachSP executes next instr’n SIMD-style (note: requires only a single instruction fetch!) l thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 14 Reductions and Thread Divergence l threads within a single 1D block summing A[0..N−1]: g l o b a l void sumV(intN, double ∗ A, double ∗ s) f int bx= blockDim.x; s h a r e d double pSum[bx]; int tx= threadID.x,x; pSum[tx] = ..
Recommended publications
  • Comparison of Technologies for General-Purpose Computing on Graphics Processing Units
    Master of Science Thesis in Information Coding Department of Electrical Engineering, Linköping University, 2016 Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman Master of Science Thesis in Information Coding Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman LiTH-ISY-EX–16/4923–SE Supervisor: Robert Forchheimer isy, Linköpings universitet Åsa Detterfelt MindRoad AB Examiner: Ingemar Ragnemalm isy, Linköpings universitet Organisatorisk avdelning Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2016 Torbjörn Sörman Abstract The computational capacity of graphics cards for general-purpose computing have progressed fast over the last decade. A major reason is computational heavy computer games, where standard of performance and high quality graphics con- stantly rise. Another reason is better suitable technologies for programming the graphics cards. Combined, the product is high raw performance devices and means to access that performance. This thesis investigates some of the current technologies for general-purpose computing on graphics processing units. Tech- nologies are primarily compared by means of benchmarking performance and secondarily by factors concerning programming and implementation. The choice of technology can have a large impact on performance. The benchmark applica- tion found the difference in execution time of the fastest technology, CUDA, com- pared to the slowest, OpenCL, to be twice a factor of two. The benchmark applica- tion also found out that the older technologies, OpenGL and DirectX, are compet- itive with CUDA and OpenCL in terms of resulting raw performance. iii Acknowledgments I would like to thank Åsa Detterfelt for the opportunity to make this thesis work at MindRoad AB.
    [Show full text]
  • Gscale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics
    gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space Mochi Xue, Shanghai Jiao Tong University and Intel Corporation; Kun Tian, Intel Corporation; Yaozu Dong, Shanghai Jiao Tong University and Intel Corporation; Jiacheng Ma, Jiajun Wang, and Zhengwei Qi, Shanghai Jiao Tong University; Bingsheng He, National University of Singapore; Haibing Guan, Shanghai Jiao Tong University https://www.usenix.org/conference/atc16/technical-sessions/presentation/xue This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space Mochi Xue1,2, Kun Tian2, Yaozu Dong1,2, Jiacheng Ma1, Jiajun Wang1, Zhengwei Qi1, Bingsheng He3, Haibing Guan1 {xuemochi, mjc0608, jiajunwang, qizhenwei, hbguan}@sjtu.edu.cn {kevin.tian, eddie.dong}@intel.com [email protected] 1Shanghai Jiao Tong University, 2Intel Corporation, 3National University of Singapore Abstract As one of the key enabling technologies of GPU cloud, GPU virtualization is intended to provide flexible and With increasing GPU-intensive workloads deployed on scalable GPU resources for multiple instances with high cloud, the cloud service providers are seeking for practi- performance. To achieve such a challenging goal, sev- cal and efficient GPU virtualization solutions. However, eral GPU virtualization solutions were introduced, i.e., the cutting-edge GPU virtualization techniques such as GPUvm [28] and gVirt [30]. gVirt, also known as GVT- gVirt still suffer from the restriction of scalability, which g, is a full virtualization solution with mediated pass- constrains the number of guest virtual GPU instances.
    [Show full text]
  • 3D Graphics on the ADS512101 Board Using Opengl ES By: Francisco Sandoval Zazueta Infotainment Multimedia and Telematics (IMT)
    Freescale Semiconductor Document Number: AN3793 Application Note Rev. 0, 12/2008 3D Graphics on the ADS512101 Board Using OpenGL ES by: Francisco Sandoval Zazueta Infotainment Multimedia and Telematics (IMT) 1 Introduction Contents 1 Introduction . 1 OpenGL is one of the most widely used graphic standard 2 Preparing the Environment . 2 2.1 Assumptions on the Environment . 2 specifications available. OpenGL ES is a reduced 2.2 Linux Target Image Builder (LTIB) . 2 adaptation of OpenGL that offers a powerful yet limited 2.3 Installing PowerVR Software Development Kit . 4 3 The PowerVR SDK . 5 version for embedded systems. 3.1 Introduction to SDK . 5 3.2 PVRShell . 5 One of the main features of the MPC5121e is its graphics 3.3 PVRtools . 6 co-processor, the MBX core. It is a wide spread standard 4 Developing Example Application . 6 of mobile 3D graphics acceleration for mobile solutions. 4.1 3D Model Loader. 6 5 Conclusion. 9 Together with Imagination Technologies OpenGL ES 1.1 6 References . 9 SDK, the ADS512101 board can be used to produce 7 Glossary . 10 Appendix A eye-catching graphics. This document is an introduction, Applying DIU Patches to the Kernel . 11 after the development environment is ready, it is up to the A.1 Applying the MBXpatch2.patch . 11 developer’s skills to exploit the board’s graphics A.2 byte_flip Application. 11 Appendix B capabilities. Brief Introduction to OpenGL ES . 12 B.1 OpenGL ES . 12 B.2 Main Differences Between OGLES 1.1 and . 12 B.3 Obtaining Frustum Numbers . 13 B.4 glFrustum.
    [Show full text]
  • Performance Analysis of Intel Gen9.5 Integrated GPU Architecture
    Performance Analysis of Intel Gen9.5 Integrated GPU Architecture Helder Francisco Pereira Duarte Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisor(s): Doctor Aleksandar Ilic Examination Committee Chairperson: Doctor António Manuel Raminhos Cordeiro Grilo Supervisor: Doctor Aleksandar Ilic Member of the Committee: Doctor Ricardo Jorge Fernandes Chaves June 2018 Declaration I declare that this document is an original work of my own authorship and that it fulfills all the require- ments of the Code of Conduct and Good Practices of the Universidade de Lisboa. Acknowledgments I would like to thank my family and friends, whose support was paramount to the completion of this thesis, and to INESC-ID Lisboa, for providing the facilities in which this thesis was developed. In particular I would like to thank Diogo Marques for his tips that helped drive my research forward and, of course, to Prof. Aleksander for his incredible patience in dealing with me. i ii Resumo Recentemente os CPUs vemˆ equipados com placas graficas´ integradas. Este acoplamento tem o potencial de oferecer ganhos de desempenho consideraveis´ caso as ditas GPUs sejam usadas como aceleradores. No entanto, placas graficas´ integradas temˆ dificuldade em atingir os n´ıveis de desem- penho que placas discretas proporcionam devido ao menor numero´ de nucleos.´ Contudo, a sua prox- imidade com o CPU significa uma partilha de dados com menos sobrecargas associadas. Ademais, as vantagens de partilhar a hierarquia de memoria´ com o processador e o consumo de energia mais baixo que as placas discretas a` custa de desempenho permite atingir n´ıveis de eficienciaˆ energetica´ mais elevados.
    [Show full text]
  • The Open Graphics Library (Opengl) Is Used for Visualizing 2D and 3D Data
    The Open Graphics Library (OpenGL) is used for visualizing 2D and 3D data. It is a multipurpose open-standard graphics library that supports applications for 2D and 3D digital content creation, mechanical and architectural design, virtual prototyping, flight simulation, video games, and more. You use OpenGL to configure a 3D graphics pipeline and submit data to it. Vertices are transformed and lit, assembled into primitives, and rasterized to create a 2D image. OpenGL is designed to translate function calls into graphics commands that can be sent to underlying graphics hardware. Because this underlying hardware is dedicated to processing graphics commands, OpenGL drawing is typically very fast. Because OpenGL ES is a C-based API, it is extremely portable and widely supported. As a C API, it integrates seamlessly with Objective-C Cocoa Touch apps. The OpenGL ES specification does not define a windowing layer; instead, the hosting operating system must provide functions to create an OpenGL ES rendering context, which accepts commands, and a framebuffer, where the results of any drawing commands are written to. Working with OpenGL ES on iOS requires using iOS classes to set up and present a drawing surface and using platform-neutral API to render its contents. v Introduction to 3-D Graphics o Basic Terminology o 3-D Graphics Pipeline v Introduction to OpenGL ES API o What is OpenGL ES o OpenGL ES Standard libraries and headers o Naming conventions o Rendering Pipeline o State machine v iOS and OpenGL ES v The performance of OpenGL ES apps in iOS differs from that of OpenGL in OS X or other desktop operating systems.
    [Show full text]
  • Copyright © COMP371 Computer Graphics Instructors Lecture Overview
    COMP371 COMPUTER GRAPHICS SESSION 2 INTRODUCTION TO COMPUTER GRAPHICS Copyright © COMP371 Computer Graphics Instructors Lecture Overview ● Introduction to Computer Graphics ● Graphics pipeline ● OpenGL API ● Primitives: lines, polygons ● Attributes: color ● Example Copyright © COMP371 Computer Graphics Instructors 2 What is OpenGL? ▪ Low-level graphics library (API) for 2D/3D interactive graphics ▪ Originated from SGI’s GL in 1992 ▪ Current version 4.5 (2014) ▪ Future version 5.0 or Vulkan? ▪ Managed by Khronos Group (non-profit consortium) ▪ API is governed by Architecture Review Board (part of Khronos) ▪ Used in broadcasting, CAD/CAM/CAE, entertainment, medical imaging, and virtual reality to produce and display incredibly compelling 2D and 3D graphics. Copyright © COMP371 Computer Graphics Instructors 3 OpenGL API Programmer manages the graphics system functionality through the Application Programmer Interface (API) ▪ API: functions that specify what we need to form an image ▪ object(s), viewer, light source(s), materials ▪ Provides access to other information ▪ input from devices suchCopyright as © COMP371mouse Computer and Graphics keyboard Instructors 4 OpenGL is cross-platform ▪ Available for Windows, Linux, Mac ▪ A freeware OpenGL implementation is also available on Linux called Mesa ▪ Same code will work on different OSes ▪ most of the time with no modifications! Copyright © COMP371 Computer Graphics Instructors 5 What an OpenGL program does? From the programmer’s point of view: ▪ Specifies the geometric object(s) in the scene
    [Show full text]
  • RZ/A1H Group Graphics Library "RGA" Introduction This Application Note Describes the Graphics Library RGA (Renesas Graphics Architecture) of RZ/A1H
    Application Note RZ/A1H Group Graphics Library "RGA" Introduction This application note describes the Graphics Library RGA (Renesas Graphics Architecture) of RZ/A1H. The following lists features of the RGA. Allows high-speed drawing using the hardware acceleration. The API created based on the W3C standard HTML Canvas 2D Context achieves easy learning. In addition, the RGA provides C++API that is operable as an interface that is backward compatible with HTML Canvas 2D Context. The memory area provided by the application is available as a drawing destination or for input images. Allows drawing of translucent images and translucent drawing using an alpha mask. The RGA provides a conversion tool that can access image files as global variables. (This conversion tool operates on the host PC.) Target Device RZ/A1H Group RZ/A1M Group RZ/A1L Group When applying the sample program covered in this application note to another microcomputer, modify the program according to the specifications for the target microcomputer and conduct an extensive evaluation of the modified program. Restrictions This library is not compatible with the vector graphics supported by Renesas Graphics Processor for OpenVG™ (R-GPVG2). The hardware acceleration is used only for a part of drawing (Table 1-2), (6.11.1). RZ/A1L does not use hardware acceleration. And it cannot use a part of drawing. (Table 1-3) This library cannot be used in multiple threads. Use the RGA in a single thread. R01AN2162EJ0401 Rev.4.01 Page 1 of 144 Jan.27.21 RZ/A1H Group Graphics Library RGA Contents Graphics Library "RGA" .................................................................................................................
    [Show full text]
  • Proton Testing of AMD E9173 GPU
    Proton Testing of AMD e9173 GPU E.J. Wyrwas1 NASA Goddard Space Flight Center Code 561.4, Radiation Effects and Analysis Group 8800 Greenbelt RD Greenbelt, MD 20771 Test Date: 2 June 2019 Report Date: 22 July 2019 1SSAI Inc, Greenbelt, MD USA AMD e9173 GPU SEE Test Report 1. Acronyms ATE Automated Test Equipment BGA Ball Grid Array BSOD Blue Screen of Death (Windows crash message) Cat5e Category 5e (enhanced) specification COTS Commercial Off the Shelf CPU Central Processing Unit CUDA Compute Unified Device Architecture CUFFT CUDA Fast Fourier Transform library DHCP Dynamic Host Configuration Protocol DRAM Dynamic random-access memory DUT Device Under Test EGL Embedded-System Graphics Library ES Embedded Systems GPU Graphical Processing Unit GUI Graphical User Interface HDMI High-Definition Multimedia Interface IPv6 Internet Protocol version MGH Massachusetts General Hospital OpenGL Open Graphics Library OpenCL Open Computing Language RAM Random Access Memory RJ45 Registered Jack #45 SDK Software Development Kit SEE Single Event Effects SEFI Single Event Functional Interrupt SKU Stock Keeping Unit SNTP Simple Network Time Protocol SOC System on Chip SOM System on Module SRAM Static Random Access Memory Page 2 of 7 AMD e9173 GPU SEE Test Report 2. Introduction and Summary of Test Results Single-Event Effects (SEE) testing was conducted on the AMD e9173 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). A product brief1 is provided below. Figure 1: Product Brief from AMD's Power-Efficient Embedded GPUs website Testing was conducted at Massachusetts General Hospital’s (MGH) Francis H. Burr Proton Therapy Center on June 2nd, 2019 using 200-MeV protons.
    [Show full text]
  • Intel® Server Gpus for Cloud Gaming and Media Delivery Solution Brief
    Solution Brief Cloud & Communication Service Providers Graphics Intel® Server GPUs for Cloud Gaming and Media Delivery Providers of cloud gaming and streaming media can realize the massive opportunity in their market segments with a high density of streams per server and cost efficiency using the Intel® Server GPU. The universe of digital content continues to grow dramatically, creating unprecedented opportunity for cloud service providers (CSPs) and communication service providers (CoSPs). This opportunity is nowhere more evident than for cloud gaming and streaming media. According to the Cisco VNI report, video will account for 82% of the global IP traffic by 2022 and live video will increase fifteen-fold from 2017 to 2022 to reach 17% of internet video traffic. Over the same time period, gaming traffic is expected to grow nine-fold.1 In response, new participants are coming to these market segments, with companies such as phone manufacturers and OEMs joining alongside CSPs and CoSPs to vie for consumer attention. Optimizing competitiveness means that CSPs and CoSPs must strive to lower total cost of ownership (TCO) while continuing to deliver high-quality, reliable services. Embracing Android Cloud Gaming StatCounter reports that as of September 2020, Android has more than 74% of the global mobile market share.2 The content catalog is growing, providing additional opportunities to CSPs and CoSPs. The throughput requirements in this usage area can be expected to grow dramatically as the 5G rollout makes greater bandwidth available to end users over the next several years. The combination of Intel Xeon Scalable Processors, open-sourced and licensed software ingredients, and the new Intel Server GPU provides a very cost-effective, optimized solution for service providers to address a growing market.
    [Show full text]
  • GPU Developments 2017T
    GPU Developments 2017 2018 GPU Developments 2017t © Copyright Jon Peddie Research 2018. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • IMGD 4000: Hot Topics in Graphics Prof Emmanuel
    IMGD 4000: Hot Topics in Graphics Prof Emmanuel Agu Computer Science Dept. Worcester Polytechnic Institute (WPI) Graphics Processing Unit (GPU) Entire graphics library (OpenGL, DirectX) in hardware => FAST!! Speed: GPUs renders graphics faster CPUs Programmable: in last 8 years (Shaders) Located either on PC motherboard (Intel) or Separate graphics card (Nvidia or ATI) On PC motherboard On separate PCI express card GPU Evolution High throughput computation GeForce GTX 280: 933 GFLOP/s High bandwidth memory GeForce GTX 280: 140 GB/s High availability to all “Fermi” 180+ million shader‐capable GPUs in the wild 3B xtors Numbers doubling every 6 months (Moore’s law cubed) GeForce 8800 681M xtors GeForce FX 125M xtors GeForce 3 GeForce® 256 60M xtors RIVA 128 23M xtors 3M xtors 1995 2000 2005 2010 Slide from Stanford course 193G 2010 DesignTradeoff makes GPU Fast Graphics has many calculations, few conditional (if/else, loops) Arithmetic Logic Unit (ALU) is part of GPU for calculations Tradeoff: Assign More transistors for ALU: faster rendering calculations, math Fewer transistors for control, branches, loops (graphics rarely uses them) ALU chip area: CPU (7%) vs GPU ( > 50%) Result: GPU runs graphics much faster than CPU GFLOP/sec growth GPU!! CPU!! GPU Memory Bandwidth GPU!! CPU!! Graph from: CUDA Programming Guide 4.2 Aside: GPGPU: GPUs to speedup Non‐Graphics Applications 146X 36X 18X 50X 100X Medical Molecular Video Matlab Astrophysics Imaging Dynamics Transcoding Computing RIKEN U of Utah U of Illinois, Elemental Tech
    [Show full text]
  • Glass Fabrication
    RESOURCE + PROCESS MANAGEMENT Glass fabrication Rima Ajlouni The University of Utah, Salt Lake City, UT ABSTRACT: In architecture, the prevalence of computational design and digital fabrication has led to an increase in exploration of casting modulated geometry using fabricated molds. However, the use of mold making strategies are often limited to casting materials that conform easily to mold geometry (i.e. concrete, plaster, resin, ceramics, etc.). It is rarely that fabrication strategies are used to explore materials with challenging behavioral properties such as glass. As a result, glass in its non-flat form has been underutilized in contemporary architecture. Because of its complicated physical behavior and the technical difficulties associated with the fabrication processes, architecture education often avoids exploring such medium. One key challenge with casting glass using fabricated refractory molds relates to understanding the behavior of glass under certain physical conditions and temperature profiles. If such parameters are not anticipated, the geometry of the final casted elements can be substantially different from the design intentions. This research argues that computation can be used to predict glass forming behavior under different temperature profiles, which can inform the design and fabrication processes. The goal is to highlight the importance of integrating the complexities of the physical reality into the design and fabrication processes, especially within the context of the educational experience. To contribute to this creative discourse this paper explores the limits of precision from computation to fabrication as it relates to casting glass. The objective is to design and test an algorithm for predicting edge/corner geometry of casted glass under different temperature profiles.
    [Show full text]