Understanding GPGPU Vector Register File Usage Mark Wyse* [email protected] AMD Research, Advanced Micro Devices, Inc

Total Page:16

File Type:pdf, Size:1020Kb

Understanding GPGPU Vector Register File Usage Mark Wyse* Wysem@Cs.Washington.Edu AMD Research, Advanced Micro Devices, Inc Understanding GPGPU Vector Register File Usage Mark Wyse* [email protected] AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington ABSTRACT GPUs are no longer bound to their traditional domain Graphics processing units (GPUs) have emerged as of graphics, but they are commonly viewed as the a favored compute accelerator for workstations, serv- workhorse for computationally intense applications. ers, and supercomputers. At their core, GPUs are As the use of GPUs has expanded, the architecture massively-multithreaded compute engines, capable of of GPGPU devices has evolved. GPGPUs are mas- concurrently supporting over one hundred thousand sively-multithreaded devices, concurrently operating active threads. Supporting this many threads requires on tens to hundreds of thousands of threads. Unlike storing context for every thread on-chip, and results in CPUs, which target low-latency computation, GPUs large vector register files consuming a significant excel at high throughput computation. Achieving high amount of die area and power. Thus, it is imperative throughput requires supporting many threads, each re- that the vast number of registers are used effectively, quiring on-chip context. This context typically in- efficiently, and to maximal benefit. cludes shared memory space, program counters, syn- This work evaluates the usage of the vector register chronization resources, and private storage registers. file in a modern GPGPU architecture. We confirm the Maintaining context on-chip enables multithreading results of prior studies, showing vector registers are among the thousands of active threads, with single-cy- reused in small windows by few consumers and that cle context switching between groups of threads. vector registers are a key limiter of workgroup dis- However, the required context consumes millions of patch. We then evaluate the effectiveness of previously bytes, orders of magnitude more than the context of proposed techniques at reusing register values and the few threads present in a traditional CPU. The vec- hiding bank access conflict penalties. Lastly, we study tor register file storage space alone is typically larger the performance impact of introducing additional vec- than the L1 data caches and consumes as much as 16 tor registers and show that additional parallelism is MB in a state-of-the-art, fully configured AMD not always beneficial, somewhat counter-intuitive to Radeon™ RX “VEGA” GPU [8][9]. With a consider- the “more threads, better throughput” view of able amount of storage, die area, and energy being GPGPU acceleration. consumed by the vector register files, it is important to understand the use of this structure in GPGPU appli- cations so that it may be optimized for performance 1. INTRODUCTION and/or energy-efficiency. Contemporary graphics processing units (GPUs) are This paper examines modern GPGPU architectures, incredibly powerful data-parallel compute accelera- focusing on their use of vector general-purpose regis- tors. Originally designed exclusively for graphics ters and the vector register subsystem architecture. workloads, GPUs have evolved into programmable, Our study consists of three main parts. First, we repli- general-purpose compute devices. GPUs are now used cate experiments from prior work revealing the vector to solve some of the most computationally demanding register usage patterns for a set of compute applica- problems, in areas ranging from molecular dynamics tions. We confirm the results of prior work, despite to machine intelligence. The rapid adoption of GPUs modeling a GPGPU architecture based on products into general-purpose computing has given rise to a from a different device vendor. Second, we evaluate new term describing these devices and use: General- the effectiveness of operand buffering and register file Purpose GPU (GPGPU) computing. In this context, caching as proposed in prior work. Our experiments show these structures to be highly effective at hiding * This work was completed while the author was a Post-Grad bank access conflict penalties and enabling vector reg- Scholar at AMD Research in Bellevue, WA ister value reuse. Third, we examine the potential par- allelism and occupancy benefit of a GPGPU architec- AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names ture providing (physically or logically) twice the num- used in this publication are for identification purposes only and may ber of vector general-purpose registers. We show that be trademarks of their respective companies. the benefit of higher wave-level parallelism and device 1 occupancy is application dependent. For many devel- Compute Unit (CU) opers this notion remains counter-intuitive. The remainder of the paper is organized as follows. Instruction Fetch Section 2 provides background on GPGPU architec- WF Context 1 WF Context 2 WF Context N ture and execution. Section 3 describes our analysis and simulation methodology. Sections 4, 5, 6, and 7 Dependency Logic detail our experimental results. Section 8 covers re- lated work, Section 9 provides thoughts on future re- Instruction Arbitration & Scheduler search directions, and we conclude in Section 10. Execution Units SIMD VALU SALU SIMD VALU SALU 2. BACKGROUND GPUs are massively-multithreaded processing de- Vector RF Scalar RF Vector RF Scalar RF vices that support over one hundred thousand active threads. Supporting this many active threads requires Scalar Local Global an architecture that is modular and compartmental- Memory Memory Memory Pipeline Pipeline Pipeline ized, as well as a programming model to express data- parallel computation. This section details the GPGPU programming model, describes the hardware execu- Scalar Cache LDS Data Cache I-Cache tion model, and details the specific GPU architecture used in this study. Figure 1. Sample CU Architecture. 2.1 GPGPU Programming Model GPGPUs use a data-parallel, streaming computation to facilitate execution of many workgroups concur- programming model. In this model, a program, or ker- rently. nel, is executed by a collection of work-items Within a CU, the SIMD unit is the hardware compo- (threads). The programming model typically uses the nent responsible for executing wavefronts. Each wave- single instruction, multiple thread (SIMT) execution front within a workgroup is assigned to a single SIMD model. Work-items within a kernel are subdivided into within the CU the workgroup is dispatched to. The workgroups by the programmer, which are further SIMD unit is responsible for executing all work-items subdivided into wavefronts by hardware. The work- in a wavefront in lock-step. Each SIMD has access to items within a wavefront are logically executed in a scalar ALU (SALU), a branch and message unit, and lock-step. All work-items within a workgroup may memory pipelines. perform synchronization operations with one another. The wavefront size is a hardware parameter that may AMD’s GCN architecture [2] also includes scalar in- change across architecture generations or between de- structions that are executed on the scalar ALU. These vices capable of executing the same Instruction Set scalar instructions are generated by the compiler, Architecture (ISA) generation. Programmers should transparent to the programmer, and are intermixed not rely on the wavefront size remaining constant with vector instructions in the instruction stream. Sca- across hardware generations and should not have de- lar instructions are used for control flow or operations pendencies on a specific wavefront size in their code. that produce a single result shared by all work-items in 2.3 Baseline GPGPU Architecture a wavefront. In this section we detail the CU architecture em- 2.2 GPGPU Hardware Execution Model ployed in our study. Figure 1 depicts the architecture Modern GPU architectures execute kernels using a of the CU we model, which is capable of executing SIMD (Single Instruction, Multiple Data) hardware AMD’s GCN3 ISA [3]. Without loss of generality, we model. As mentioned above, a kernel is composed of elect to use AMD’s terminology where applicable. many work-items that are collected into workgroups. The CU used in our study contains two SIMD Vector The workgroup is the unit of dispatch to the Compute ALUs (VALUs), two Scalar ALUs (SALUs), Vector Units (CUs), which is the hardware unit responsible Register Files (VRFs), Scalar Register Files (SRFs), a for executing workgroups. A CU must be able to sup- Local Data Share (LDS), forty wavefront slots, Local port at least one full-sized workgroup, but may be able Memory (LM), Global Memory (GM), and Scalar to execute additional workgroups concurrently if hard- Memory (ScM) pipelines, and the CU is connected to ware resources allow. All work-items from the same scalar, data, and instruction caches. The following workgroup are executed on the same CU. A GPU de- subsections detail the main blocks within the CU. Note vice contains at least one CU, but it may contain more that the Scalar Cache and I-Cache are shared between 2 2.3.3.2 Operand Buffer The Operand Buffer (OB) [12][15] is responsible for reading the vector source operands of each VALU in- Lane 0 Vector RF struction. The primary purpose of the OB is to hide Bank 0 Lane 1 bank access conflict latency penalties. It is a FIFO Lane 2 Vector RF queue, and instructions enter and leave the OB in-or- Bank 1 Buffer der. However, the OB may read source operands for SIMD VALU any instruction present in the FIFO in any cycle (i.e., Vector RF Bank 2 out-of-order with respect to the execution order). In Operand Operand Lane 61 Register File Cache this study, an oldest-first-then-greedy policy is used to Vector RF Lane 62 read source operands, but this may be changed in fu- Bank 3 Lane 63 ture implementations. The OB attempts to read the op- erands of the oldest instruction first, but will greedily Figure 2. Vector Register File Subsystem Architecture. read operands for younger instructions to avoid bank conflicts or if there are banks with available read ports multiple CUs, while all other blocks are private per that contain operands for younger instructions.
Recommended publications
  • Small Form Factor 3D Graphics for Your Pc
    VisionTek Part# 900701 PRODUCTIVITY SERIES: SMALL FORM FACTOR 3D GRAPHICS FOR YOUR PC The VisionTek Radeon R7 240SFF graphics card offers a perfect balance of performance, features, and affordability for the gamer seeking a complete solution. It offers support for the DIRECTX® 11.2 graphics standard and 4K Ultra HD for stunning 3D visual effects, realistic lighting, and lifelike imagery. Its Short Form Factor design enables it to fit into the latest Low Profile desktops and workstations, yet the R7 240SFF can be converted to a standard ATX design with the included tall bracket. With 2GB of DDR3 memory and award-winning Graphics Core Next (GCN) architecture, and DVI-D/HDMI outputs, the VisionTek Radeon R7 240SFF is big on features and light on your wallet. RADEON R7 240 SPECS • Graphics Engine: RADEON R7 240 • Video Memory: 2GB DDR3 • Memory Interface: 128bit • DirectX® Support: 11.2 • Bus Standard: PCI Express 3.0 • Core Speed: 780MHz • Memory Speed: 800MHz x2 • VGA Output: VGA* • DVI Output: SL DVI-D • HDMI Output: HDMI (Video/Audio) • UEFI Ready: Support SYSTEM REQUIREMENTS • PCI Express® based PC is required with one X16 lane graphics slot available on the motherboard. • 400W (or greater) power supply GCN Architecture: A new design for AMD’s unified graphics processing and compute cores that allows recommended. 500 Watt for AMD them to achieve higher utilization for improved performance and efficiency. CrossFire™ technology in dual mode. • Minimum 1GB of system memory. 4K Ultra HD Support: Experience what you’ve been missing even at 1080P! With support for 3840 x • Installation software requires CD-ROM 2160 output via the HDMI port, textures and other detail normally compressed for lower resolutions drive.
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Comparison of Technologies for General-Purpose Computing on Graphics Processing Units
    Master of Science Thesis in Information Coding Department of Electrical Engineering, Linköping University, 2016 Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman Master of Science Thesis in Information Coding Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman LiTH-ISY-EX–16/4923–SE Supervisor: Robert Forchheimer isy, Linköpings universitet Åsa Detterfelt MindRoad AB Examiner: Ingemar Ragnemalm isy, Linköpings universitet Organisatorisk avdelning Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2016 Torbjörn Sörman Abstract The computational capacity of graphics cards for general-purpose computing have progressed fast over the last decade. A major reason is computational heavy computer games, where standard of performance and high quality graphics con- stantly rise. Another reason is better suitable technologies for programming the graphics cards. Combined, the product is high raw performance devices and means to access that performance. This thesis investigates some of the current technologies for general-purpose computing on graphics processing units. Tech- nologies are primarily compared by means of benchmarking performance and secondarily by factors concerning programming and implementation. The choice of technology can have a large impact on performance. The benchmark applica- tion found the difference in execution time of the fastest technology, CUDA, com- pared to the slowest, OpenCL, to be twice a factor of two. The benchmark applica- tion also found out that the older technologies, OpenGL and DirectX, are compet- itive with CUDA and OpenCL in terms of resulting raw performance. iii Acknowledgments I would like to thank Åsa Detterfelt for the opportunity to make this thesis work at MindRoad AB.
    [Show full text]
  • Monte Carlo Evaluation of Financial Options Using a GPU a Thesis
    Monte Carlo Evaluation of Financial Options using a GPU Claus Jespersen 20093084 A thesis presented for the degree of Master of Science Computer Science Department Aarhus University Denmark 02-02-2015 Supervisor: Gerth Brodal Abstract The financial sector has in the last decades introduced several new fi- nancial instruments. Among these instruments, are the financial options, which for some cases can be difficult if not impossible to evaluate analyti- cally. In those cases the Monte Carlo method can be used for pricing these instruments. The Monte Carlo method is a computationally expensive al- gorithm for pricing options, but is at the same time an embarrassingly parallel algorithm. Modern Graphical Processing Units (GPU) can be used for general purpose parallel-computing, and the Monte Carlo method is an ideal candidate for GPU acceleration. In this thesis, we will evaluate the classical vanilla European option, an arithmetic Asian option, and an Up-and-out barrier option using the Monte Carlo method accelerated on a GPU. We consider two scenarios; a single option evaluation, and a se- quence of a varying amount of option evaluations. We report performance speedups of up to 290x versus a single threaded CPU implementation and up to 53x versus a multi threaded CPU implementation. 1 Contents I Theoretical aspects of Computational Finance 5 1 Computational Finance 5 1.1 Options . .7 1.1.1 Types of options . .7 1.1.2 Exotic options . .9 1.2 Pricing of options . 11 1.2.1 The Black-Scholes Partial Differential Equation . 11 1.2.2 Solving the PDE and pricing vanilla European options .
    [Show full text]
  • 3D Animation
    Contents Zoom In Zoom Out For navigation instructions please click here Search Issue Next Page ComputerINNOVATIONS IN VISUAL COMPUTING FOR THE GLOBAL DCC COMMUNITY June 2007 www.cgw.com WORLD Making Waves Digital artists create ‘pretend spontaneity’ in the documentary-style animation Surf’s Up $4.95 USA $6.50 Canada Contents Zoom In Zoom Out For navigation instructions please click here Search Issue Next Page A CW Previous Page Contents Zoom In Zoom Out Front Cover Search Issue Next Page BEF MaGS _____________________________________________________ A CW Previous Page Contents Zoom In Zoom Out Front Cover Search Issue Next Page BEF MaGS A CW Previous Page Contents Zoom In Zoom Out Front Cover Search Issue Next Page BEF MaGS June 2007 • Volume 30 • Number 6 INNOVATIONS IN VISUAL COMPUTING FOR THE GLOBAL DCC COMMUNITY Also see www.cgw.com for computer graphics news, special surveys and reports, and the online gallery. ____________ » Director Luc Besson discusses Computer WORLD his black-and-white fi lm, WORLD Post Angel-A. » Trends in broadcast design. » Getting the most out of canned music and sound. See it in www.postmagazine.com Features Cover story Radical, Dude 12 3D ANIMATION | In one of the most unusual animated features to hit the Departments screen, Surf’s Up incorporates a documentary fi lming style into the Editor’s Note 2 CG medium. Triple the Fun Summer blockbusters are making their By Barbara Robertson debut at theaters, and this year, it is Wrangling Waves 18 apparent that three’s a charm, as ani- 3D ANIMATION | The visual effects mators upped the graphics ante in 12 supervisor on Surf’s Up takes us on an Spider-Man 3, Shrek 3, and At World’s incredible behind-the-scenes journey End.
    [Show full text]
  • Deep Dive: Asynchronous Compute
    Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA Joint Session AMD NVIDIA ● Graphics Core Next (GCN) ● Maxwell, Pascal ● Compute Unit (CU) ● Streaming Multiprocessor (SM) ● Wavefronts ● Warps 2 Terminology Asynchronous: Not independent, async work shares HW Work Pairing: Items of GPU work that execute simultaneously Async. Tax: Overhead cost associated with asynchronous compute 3 Async Compute More Performance 4 Queue Fundamentals 3 Queue Types: 3D ● Copy/DMA Queue ● Compute Queue COMPUTE ● Graphics Queue COPY All run asynchronously! 5 General Advice ● Always profile! 3D ● Can make or break perf ● Maintain non-async paths COMPUTE ● Profile async on/off ● Some HW won’t support async ● ‘Member hyper-threading? COPY ● Similar rules apply ● Avoid throttling shared HW resources 6 Regime Pairing Good Pairing Poor Pairing Graphics Compute Graphics Compute Shadow Render Light culling G-Buffer SSAO (Geometry (ALU heavy) (Bandwidth (Bandwidth limited) limited) limited) (Technique pairing doesn’t have to be 1-to-1) 7 - Red Flags Problem/Solution Format Topics: ● Resource Contention - ● Descriptor heaps - ● Synchronization models ● Avoiding “async-compute tax” 8 Hardware Details - ● 4 SIMD per CU ● Up to 10 Wavefronts scheduled per SIMD ● Accomplish latency hiding ● Graphics and Compute can execute simultanesouly on same CU ● Graphics workloads usually have priority over Compute 9 Resource Contention – Problem: Per SIMD resources are shared between Wavefronts SIMD executes
    [Show full text]
  • Contributions of Hybrid Architectures to Depth Imaging: a CPU, APU and GPU Comparative Study
    Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU comparative study Issam Said To cite this version: Issam Said. Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU com- parative study. Hardware Architecture [cs.AR]. Université Pierre et Marie Curie - Paris VI, 2015. English. NNT : 2015PA066531. tel-01248522v2 HAL Id: tel-01248522 https://tel.archives-ouvertes.fr/tel-01248522v2 Submitted on 20 May 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT DE l’UNIVERSITE´ PIERRE ET MARIE CURIE sp´ecialit´e Informatique Ecole´ doctorale Informatique, T´el´ecommunications et Electronique´ (Paris) pr´esent´eeet soutenue publiquement par Issam SAID pour obtenir le grade de DOCTEUR en SCIENCES de l’UNIVERSITE´ PIERRE ET MARIE CURIE Apports des architectures hybrides `a l’imagerie profondeur : ´etude comparative entre CPU, APU et GPU Th`esedirig´eepar Jean-Luc Lamotte et Pierre Fortin soutenue le Lundi 21 D´ecembre 2015 apr`es avis des rapporteurs M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M. Christophe Calvin Chef de projet, CEA devant le jury compos´ede M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M.
    [Show full text]
  • AMD APP SDK V2.8.1 Developer Release Notes
    AMD APP SDK v2.8.1 Developer Release Notes 1 What’s New in AMD APP SDK v2.8.1 1.1 New features in AMD APP SDK v2.8.1 AMD APP SDK v2.8.1 includes the following new features: Bolt: With the launch of Bolt 1.0, several new samples have been added to demonstrate the use of the features of Bolt 1.0. These features showcase the use of valuable Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. Other samples demonstrate the different fallback options in Bolt 1.0 when no GPU is available. These options include a fallback to multicore-CPU if TBB libraries are installed, or falling all the way back to serial-CPU if needed to ensure your code runs correctly on any platform. OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve the performance and efficiency of their computer vision applications. The new samples illustrate these improvements and highlight how simple it is to include them in your application. For information on the latest OpenCV enhancements, see Harris’ blog. GCN: AMD recently launched a new Graphics Core Next (GCN) Architecture on several AMD products. GCN is based on a scalar architecture versus the VLIW vector architecture of prior generations, so carefully hand-tuned vectorization to optimize hardware utilization is no longer needed.
    [Show full text]
  • AMD Firepro™ W4100 Professional Graphics in a Class of Its Own
    AMD FirePro™ W4100 Professional Graphics In a class of its own Key Features: The AMD FirePro™ W4100 professional • Application optimizations graphics card represents a completely and certifications • AMD Graphics Core Next (GCN) new class of product – one that provides GPU architecture you with great graphics performance • Four Mini DisplayPort outputs • DisplayPort 1.2a support and display versatility while housed in a • AMD Eyefinity technology1 compact, low-power design. • 4K display resolution (up to 4096 x 2160) • 512 stream processors Increase your productivity by working across up to four high-resolution displays with AMD Eyefinity 1 • 645.1 GFLOPS peak single precision technology. Manipulate 3D models and large data sets with ease thanks to 2GB of ultrafast GDDR5 memory. With a stable driver that supports a growing list of optimized and certified applications, the • 2GB GDDR5 memory AMD FirePro W4100 is uniquely suited to provide the performance and quality you expect, and more, • 128-bit memory interface from a professional graphics card. • Up to 72GB/s memory bandwidth • PCIe® 3.0 compliant Peformance Get solid, midrange performance with the AMD FirePro W4100, delivering CAD performance that is up • OpenCL™, DirectX® and OpenGL support to 100% faster than the previous generation2. Equipped with 2GB of ultrafast GDDR5 memory with a • 50W maximum power consumption 128-bit memory interface, the AMD FirePro W4100 delivers up to 72GB/s of memory bandwidth, helping • Discreet active cooling solution improve application responsiveness to your workflows. Accelerate your 3D applications with 512 stream • Low profile single-slot form factor processors and enable more efficient data transfers between the GPU and CPU with PCIe® 3.0 support.
    [Show full text]
  • How to Sell the AMD Radeon™ HD 7790 Graphics Cards Outstanding 1080P Performance and Unbeatable Value for Gamers
    How to Sell the AMD Radeon™ HD 7790 Graphics Cards Outstanding 1080p performance and unbeatable value for gamers. Who’s it for? Gamers who want 1080p gaming and outstanding image quality at a great value. Sell it in 5 seconds. This is where high-quality 1080p gaming begins. Get ready to turn on that graphics eye-candy. With the AMD Radeon™ HD 7790 GPU, you get outstanding 1080p performance in the latest DirectX® 11 games at an unbeatable value. It offers great performance per dollar and allows you to play modern games with all the settings turned up to the max. It’s an all new chip built just for gaming featuring AMD’s latest refinement of AMD PowerTune Technology. Sell it in 60 seconds. > Outstanding 1080p performance in the latest DirectX® 11 games: The AMD Radeon™ HD 7790 Graphics card was engineered to provide superior DirectX® 11.1 performance for gamers with 1080p monitors and, being built on the Graphics Core Next Architecture, is the perfect opportunity to ready your rig for the hottest games of the year. > Unbeatable value for gamers: If you’re looking for great gaming on a budget, it doesn’t get any better than this product. In fact it is up to 21% faster than the competition.1 > Featuring an all-new AMD PowerTune Technology designed to squeeze every bit of performance out of the GPU, the AMD Radeon™ HD 7790 is engineered with intelligent, automatic overclocking to provide the most frame-rates possible. Don’t take our word for it. Here is what others are saying… “…power efficiency, its low noise levels, and the free copy of BioShock Infinite in the box…looks like we have a winning recipe from AMD.” – The Tech Report 2 “…even without BioShock Infinite coming along for the ride, the HD 7790 represents a phenomenal value.” – Hardware Canucks 3 Why it’s great..
    [Show full text]
  • Graviton: Trusted Execution Environments on Gpus
    In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) Graviton: Trusted Execution Environments on GPUs Stavros Volos Kapil Vaswani Rodrigo Bruno Microsoft Research Microsoft Research INESC-ID / IST, University of Lisbon Abstract tors. This limitation gives rise to an undesirable trade-off between security and performance. We propose Graviton, an architecture for supporting There are several reasons why adding TEE support trusted execution environments on GPUs. Graviton en- to accelerators is challenging. With most accelerators, ables applications to offload security- and performance- a device driver is responsible for managing device re- sensitive kernels and data to a GPU, and execute kernels sources (e.g., device memory) and has complete control in isolation from other code running on the GPU and all over the device. Furthermore, high-throughput accelera- software on the host, including the device driver, the op- tors (e.g., GPUs) achieve high performance by integrat- erating system, and the hypervisor. Graviton can be in- ing a large number of cores, and using high bandwidth tegrated into existing GPUs with relatively low hardware memory to satisfy their massive bandwidth requirements complexity; all changes are restricted to peripheral com- [4, 11]. Any major change in the cores, memory man- ponents, such as the GPU’s command processor, with agement unit, or the memory controller can result in no changes to existing CPUs, GPU cores, or the GPU’s unacceptably large overheads. For instance, providing MMU and memory controller. We also propose exten- memory confidentiality and integrity via an encryption sions to the CUDA runtime for securely copying data engine and Merkle tree will significantly impact avail- and executing kernels on the GPU.
    [Show full text]
  • AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE
    AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor Mike Houston AMD Senior Fellow Architect AMD Fellow Architect [email protected] [email protected] At the heart of every AMD APU/GPU is a power aware high performance set of compute units that have been advancing to bring users new levels of programmability, precision and performance. AGENDA AMD Graphic Core Next Architecture .Unified Scalable Graphic Processing Unit (GPU) optimized for Graphics and Compute – Multiple Engine Architecture with Multi-Task Capabilities – Compute Unit Architecture – Multi-Level R/W Cache Architecture .What will not be discussed – Roadmaps/Schedules – New Product Configurations – Feature Rollout 3 | AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE GFX Command Processor Work Distributor Scalable Graphics Engine MC Primitive Primitive Pipe 0 Pipe n HUB & HOS HOS CS Pixel Pixel R/W MEM Pipe 0 Pipe n L2 Pipe Tessellate Tessellate Scan Scan Geometry Geometry Conversion Conversion RB RB HOS – High Order Surface RB - Render Backend Unified Shader Core CS - Compute Shader GFX - Graphics 4 | AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE PrimitiveGFX Scaling Multiple Primitive Pipelines Command ProcessorPixel Scaling Multiple Screen Partitions Multi-task graphics engine use of unified shader Work Distributor Scalable Graphics Engine MC Primitive Primitive Pipe 0 Pipe n HUB & HOS HOS CS Pixel Pixel R/W MEM Pipe 0 Pipe n L2 Pipe Tessellate Tessellate Scan Scan Geometry Geometry Conversion Conversion RB RB Unified Shader Core 5 | AMD Graphics Core Next | June 2011 MULTI-ENGINE UNIFIED COMPUTING GPU Asynchronous Compute Engine (ACE) ACE ACE GFX 0 n .
    [Show full text]