Exploring Compression in the Gpu Memory Hierarchy for Graphics and Compute

Total Page:16

File Type:pdf, Size:1020Kb

Exploring Compression in the Gpu Memory Hierarchy for Graphics and Compute EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE A Dissertation Presented by Akshay Lahiry to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts August 2018 i Contents List of Figures iv List of Tables vi Abstract of the Dissertation vii 1 Introduction 1 1.1 Compressed Cache Architecture . 5 1.2 Compression Algorithms . 6 2 Background 8 2.1 GPU Memory Hierarchy . 8 2.2 Compression Algorithms . 11 2.3 Compressed Cache Architecture . 18 2.3.1 Smart Caches . 25 2.4 DRAM Efficiency . 26 3 Framework and Metrics 30 4 Results 37 4.1 Dual Dictionary Compressor . 38 4.1.1 Cache Architecture . 38 4.1.2 Dictionary Structure . 39 4.1.3 Dictionary Index Swap . 41 4.1.4 Dictionary Replacement . 42 4.1.5 DDC Performance . 43 4.1.6 Summary . 43 4.2 Compression Aware Victim Cache . 45 4.2.1 Design Challenges . 47 4.2.2 Results . 49 4.2.3 Summary . 49 4.3 Smart Cache Controller . 50 4.3.1 Smart Compression . 54 4.3.2 Smart Decompression . 54 ii 4.3.3 Smart Prefetch . 55 4.3.4 Summary . 59 5 Compression on Graphics Hardware 60 5.1 Geometry Compression . 62 5.2 Texture Compression . 63 5.3 Depth and Color Compression . 64 5.4 Compute Workloads . 65 5.5 Key Observations . 68 6 Conclusion 70 Bibliography 73 iii List of Figures 1.1 Total number of pixels for common display resolutions. 1 1.2 The total number of pixels for common display resolutions. 2 1.3 Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 . 3 2.1 Block diagram showing the ratio of ALU to cache hardware. 9 2.2 Block diagram showing cache hierarchy in a CPU and GPU . 10 2.3 Compression example using Frequent Value Compression. 13 2.4 Compression example using CPack. 15 2.5 Fragmenation example for fixed compaction scheme . 19 2.6 Fragmentation with less restriction on compressed block placement . 19 2.7 Compaction with subblocks in the data array . 20 2.8 Compaction with decoupled subblocks and super tags . 22 2.9 Bytemasked writes with uncompressed data. 23 2.10 Bytemasked writes with compressed data . 24 2.11 Feature table for a hashed perceptron prediction. 26 2.12 Example DRAM address mapping. 27 2.13 Efficient DRAM access pattern. 28 2.14 Inefficient DRAM access pattern. 29 3.1 Block Diagram of simulation Framework. 32 3.2 Sample simulation output for IceStorm benchmark. 36 4.1 A Cache Block . 38 4.2 Block diagram of the LLC with our dual dictionary . 39 4.3 Output from Dictionary Index Swap Unit . 40 4.4 The Write BW Savings . 44 4.5 The Read BW Savings . 45 4.6 Block diagram of LLC with Victim Cache . 46 4.7 Super Tag structure for the Victim Cache . 46 4.8 Byte-masked write with uncompressed data . 47 4.9 Byte-masked write in compressed cache . 48 4.10 Burst Efficiency with EBU on and off . 49 4.11 Row Buffer Hit rate with Victim EBU on and off . 50 iv 4.12 Conventional Compressed Cache . 51 4.13 Smart Compressed Cache . 53 4.14 Results for smart data compression. For each workload, we show the oracle perfor- mance as our baseline. We also plot the wasteful compressions that were identified using our model as a percentage of the oracle, as well as the percent of false positives, as a percentage of total predictions. 55 4.15 Results for smart data decompression. For each workload, we show the oracle performance as our baseline. We also plot the repeated decompressions that were eliminated using our smart decompressor as a percentage of the oracle. 56 4.16 Results for smart prefetching. For each workload we show: the efficiency with compression turned off (the baseline), the memory efficiency drop with compression turned on, and the efficiency improvements when using smart prefetching. 57 4.17 Area impact of feature tables . 58 5.1 A simplified Direct3D pipeline. 61 5.2 Texture Map example by Elise Tarsa [66]. 63 5.3 Bandwidth savings - AES Benchmark . 66 5.4 Bandwidth savings - Compute Workloads . 66 5.5 Bus Utilization - Compute Workloads . 67 v List of Tables 2.1 CPack Code table . 15 2.2 Frequent Pattern Encoding table as seen in [25] . 17 2.3 Comparison of popular hardware compression algorithms . 18 3.1 Graphics workloads . 32 3.2 Compute workloads . 33 4.1 Updated Pattern Table . 39 4.2 The compressed word distribution . 44 vi Abstract of the Dissertation EXPLORING COMPRESSION IN THE GPU MEMORY HIERARCHY FOR GRAPHICS AND COMPUTE by Akshay Lahiry Doctor of Philosophy in Computer Engineering Northeastern University, August 2018 Dr. David Kaeli, Advisor As game developers push the limits of graphics processors (GPUs) in their quest to achieve photorealism, modern games are becoming increasingly memory bound. At the same time display vendors are pushing the boundaries of display technology with ultra high resolution displays and high dynamic range (HDR). This means the GPU not only needs to render more pixels but also needs to process a significantly higher amount of data per pixel for high quality rendering. The advent of mainstream virtual reality (VR) also increases the minimum frame-rate for these displays which puts a lot of pressure on the GPU memory system. GPU have also evolved to be used as accelerators in high performance computing systems. Given their data-parallel throughput, many compute-intensive applications have benefited from GPU acceleration. For both of these workload, increasing the cache size helps alleviate some of the memory pressure by caching the frequently used data on-chip. However, die area on modern chips comes at a premium. Data compression is one approach to manage the data footprint problem. Data compression in the last level cache (LLC) can help achieve the performance of a much larger cache while utilizing significantly less die area. In this thesis, we address these challenges of using on-chip data compression and explore novel methods to arrive at performant solutions for both graphics and compute. We also highlight some unique compression requirements for graphics workloads and how they contrast to prior cache compression algorithms. vii Chapter 1 Introduction Over the past few years the number of pixels in a video display has increased exponentially. The graphics display industry has quickly moved from resolutions as low as 480p, to stunning 8K displays. Figure 1.1 shows some common display resolutions, ranging from half a million pixels for old displays in SVGA format, to thirty three million pixels for current state-of-the-art 8K displays. Figure 1.1: Total number of pixels for common display resolutions. 1 CHAPTER 1. INTRODUCTION Rendering a growing number of pixels significantly increases the amount of data associated with each frame. Figure 1.2 shows how the total bytes per frame scales, from a few kilo-bytes for low resolution displays, to thirty mega-bytes for a modern high resolution display. With the introduction of High-Dynamic Range (HDR) displays, the amount of data associated with each pixel has also increased. More bits are being used to represent each channel in the pixel [1, 2]. As these ultra-high resolution and high dynamic range displays become ubiquitous, the data footprint of modern games will increase rapidly. Figure 1.2: The total number of pixels for common display resolutions. Figure 1.3 shows two frames from the Halo Combat Evolved game. The frame on the left shows the original game from the year 2001 and the one the right shows the remastered frame from the anniversary update of the game in 2011. The images show a significant improvement to the level of detail per frame. Game developers push modern Graphics Processors (GPU) to their performance limits in their quest for photorealism. As virtual reality (VR) headsets become more mainstream, the quest for realism goes even further to enable the illusion of reality in the virtual world. High end VR headsets have two high resolution screens that independently render immersive content. This puts a lot of pressure on the GPU to render high resoluton images in real-time. A single dropped frame due to 2 CHAPTER 1. INTRODUCTION Figure 1.3: Two frames of Halo:Combat Evolved. Left image shows a frame from 2001 and the right image shows the updated frame in 2011 memory latency can can ruin the immersive experience. The trends mentioned above have significantly increased the pressure on the memory hierarchy of GPUs. This exponential increase in the number of bytes per frame adds a lot of pressure on the GPU’s cache hierarchy and memory. As access to memory is extremely slow, GPUs have been increasing the size of on-chip caches to avoid costly cache misses and improve performance. This results in increased power consumption and on-chip area, which is not ideal. Modern manufacturing nodes have significant yield issues with larger chips, which become increasingly expensive to manufacture. As the area on the chip comes at a premium, over provisioning caches will have an adverse impact. Every millimeter of chip area is valuable and every effort is made by graphics chip companies to improve the performance/mm2. Having more data in flight also increases bandwidth contention on a GPU. This can impact the cost of a cache miss. The on-chip area must be used more efficiently to handle the memory footprint issue.
Recommended publications
  • Small Form Factor 3D Graphics for Your Pc
    VisionTek Part# 900701 PRODUCTIVITY SERIES: SMALL FORM FACTOR 3D GRAPHICS FOR YOUR PC The VisionTek Radeon R7 240SFF graphics card offers a perfect balance of performance, features, and affordability for the gamer seeking a complete solution. It offers support for the DIRECTX® 11.2 graphics standard and 4K Ultra HD for stunning 3D visual effects, realistic lighting, and lifelike imagery. Its Short Form Factor design enables it to fit into the latest Low Profile desktops and workstations, yet the R7 240SFF can be converted to a standard ATX design with the included tall bracket. With 2GB of DDR3 memory and award-winning Graphics Core Next (GCN) architecture, and DVI-D/HDMI outputs, the VisionTek Radeon R7 240SFF is big on features and light on your wallet. RADEON R7 240 SPECS • Graphics Engine: RADEON R7 240 • Video Memory: 2GB DDR3 • Memory Interface: 128bit • DirectX® Support: 11.2 • Bus Standard: PCI Express 3.0 • Core Speed: 780MHz • Memory Speed: 800MHz x2 • VGA Output: VGA* • DVI Output: SL DVI-D • HDMI Output: HDMI (Video/Audio) • UEFI Ready: Support SYSTEM REQUIREMENTS • PCI Express® based PC is required with one X16 lane graphics slot available on the motherboard. • 400W (or greater) power supply GCN Architecture: A new design for AMD’s unified graphics processing and compute cores that allows recommended. 500 Watt for AMD them to achieve higher utilization for improved performance and efficiency. CrossFire™ technology in dual mode. • Minimum 1GB of system memory. 4K Ultra HD Support: Experience what you’ve been missing even at 1080P! With support for 3840 x • Installation software requires CD-ROM 2160 output via the HDMI port, textures and other detail normally compressed for lower resolutions drive.
    [Show full text]
  • Monte Carlo Evaluation of Financial Options Using a GPU a Thesis
    Monte Carlo Evaluation of Financial Options using a GPU Claus Jespersen 20093084 A thesis presented for the degree of Master of Science Computer Science Department Aarhus University Denmark 02-02-2015 Supervisor: Gerth Brodal Abstract The financial sector has in the last decades introduced several new fi- nancial instruments. Among these instruments, are the financial options, which for some cases can be difficult if not impossible to evaluate analyti- cally. In those cases the Monte Carlo method can be used for pricing these instruments. The Monte Carlo method is a computationally expensive al- gorithm for pricing options, but is at the same time an embarrassingly parallel algorithm. Modern Graphical Processing Units (GPU) can be used for general purpose parallel-computing, and the Monte Carlo method is an ideal candidate for GPU acceleration. In this thesis, we will evaluate the classical vanilla European option, an arithmetic Asian option, and an Up-and-out barrier option using the Monte Carlo method accelerated on a GPU. We consider two scenarios; a single option evaluation, and a se- quence of a varying amount of option evaluations. We report performance speedups of up to 290x versus a single threaded CPU implementation and up to 53x versus a multi threaded CPU implementation. 1 Contents I Theoretical aspects of Computational Finance 5 1 Computational Finance 5 1.1 Options . .7 1.1.1 Types of options . .7 1.1.2 Exotic options . .9 1.2 Pricing of options . 11 1.2.1 The Black-Scholes Partial Differential Equation . 11 1.2.2 Solving the PDE and pricing vanilla European options .
    [Show full text]
  • Deep Dive: Asynchronous Compute
    Deep Dive: Asynchronous Compute Stephan Hodes Developer Technology Engineer, AMD Alex Dunn Developer Technology Engineer, NVIDIA Joint Session AMD NVIDIA ● Graphics Core Next (GCN) ● Maxwell, Pascal ● Compute Unit (CU) ● Streaming Multiprocessor (SM) ● Wavefronts ● Warps 2 Terminology Asynchronous: Not independent, async work shares HW Work Pairing: Items of GPU work that execute simultaneously Async. Tax: Overhead cost associated with asynchronous compute 3 Async Compute More Performance 4 Queue Fundamentals 3 Queue Types: 3D ● Copy/DMA Queue ● Compute Queue COMPUTE ● Graphics Queue COPY All run asynchronously! 5 General Advice ● Always profile! 3D ● Can make or break perf ● Maintain non-async paths COMPUTE ● Profile async on/off ● Some HW won’t support async ● ‘Member hyper-threading? COPY ● Similar rules apply ● Avoid throttling shared HW resources 6 Regime Pairing Good Pairing Poor Pairing Graphics Compute Graphics Compute Shadow Render Light culling G-Buffer SSAO (Geometry (ALU heavy) (Bandwidth (Bandwidth limited) limited) limited) (Technique pairing doesn’t have to be 1-to-1) 7 - Red Flags Problem/Solution Format Topics: ● Resource Contention - ● Descriptor heaps - ● Synchronization models ● Avoiding “async-compute tax” 8 Hardware Details - ● 4 SIMD per CU ● Up to 10 Wavefronts scheduled per SIMD ● Accomplish latency hiding ● Graphics and Compute can execute simultanesouly on same CU ● Graphics workloads usually have priority over Compute 9 Resource Contention – Problem: Per SIMD resources are shared between Wavefronts SIMD executes
    [Show full text]
  • Contributions of Hybrid Architectures to Depth Imaging: a CPU, APU and GPU Comparative Study
    Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU comparative study Issam Said To cite this version: Issam Said. Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU com- parative study. Hardware Architecture [cs.AR]. Université Pierre et Marie Curie - Paris VI, 2015. English. NNT : 2015PA066531. tel-01248522v2 HAL Id: tel-01248522 https://tel.archives-ouvertes.fr/tel-01248522v2 Submitted on 20 May 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT DE l’UNIVERSITE´ PIERRE ET MARIE CURIE sp´ecialit´e Informatique Ecole´ doctorale Informatique, T´el´ecommunications et Electronique´ (Paris) pr´esent´eeet soutenue publiquement par Issam SAID pour obtenir le grade de DOCTEUR en SCIENCES de l’UNIVERSITE´ PIERRE ET MARIE CURIE Apports des architectures hybrides `a l’imagerie profondeur : ´etude comparative entre CPU, APU et GPU Th`esedirig´eepar Jean-Luc Lamotte et Pierre Fortin soutenue le Lundi 21 D´ecembre 2015 apr`es avis des rapporteurs M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M. Christophe Calvin Chef de projet, CEA devant le jury compos´ede M. Fran¸cois Bodin Professeur, Universit´ede Rennes 1 M.
    [Show full text]
  • AMD APP SDK V2.8.1 Developer Release Notes
    AMD APP SDK v2.8.1 Developer Release Notes 1 What’s New in AMD APP SDK v2.8.1 1.1 New features in AMD APP SDK v2.8.1 AMD APP SDK v2.8.1 includes the following new features: Bolt: With the launch of Bolt 1.0, several new samples have been added to demonstrate the use of the features of Bolt 1.0. These features showcase the use of valuable Bolt APIs such as scan, sort, reduce and transform. Other new samples highlight the ease of porting from STL and the performance benefits achieved over equivalent STL implementations. Other samples demonstrate the different fallback options in Bolt 1.0 when no GPU is available. These options include a fallback to multicore-CPU if TBB libraries are installed, or falling all the way back to serial-CPU if needed to ensure your code runs correctly on any platform. OpenCV: AMD has been working closely with the OpenCV open source community to add heterogeneous acceleration capability to the world’s most popular computer vision library. These changes are already integrated into OpenCV and are readily available for developers who want to improve the performance and efficiency of their computer vision applications. The new samples illustrate these improvements and highlight how simple it is to include them in your application. For information on the latest OpenCV enhancements, see Harris’ blog. GCN: AMD recently launched a new Graphics Core Next (GCN) Architecture on several AMD products. GCN is based on a scalar architecture versus the VLIW vector architecture of prior generations, so carefully hand-tuned vectorization to optimize hardware utilization is no longer needed.
    [Show full text]
  • Master-Seminar: Hochleistungsrechner - Aktuelle Trends Und Entwicklungen Aktuelle GPU-Generationen (Nvidia Volta, AMD Vega)
    Master-Seminar: Hochleistungsrechner - Aktuelle Trends und Entwicklungen Aktuelle GPU-Generationen (NVidia Volta, AMD Vega) Stephan Breimair Technische Universitat¨ Munchen¨ 23.01.2017 Abstract 1 Einleitung GPGPU - General Purpose Computation on Graphics Grafikbeschleuniger existieren bereits seit Mitte der Processing Unit, ist eine Entwicklung von Graphical 1980er Jahre, wobei der Begriff GPU“, im Sinne der ” Processing Units (GPUs) und stellt den aktuellen Trend hier beschriebenen Graphical Processing Unit (GPU) bei NVidia und AMD GPUs dar. [1], 1999 von NVidia mit deren Geforce-256-Serie ein- Deshalb wird in dieser Arbeit gezeigt, dass sich GPUs gefuhrt¨ wurde. im Laufe der Zeit sehr stark differenziert haben. Im strengen Sinne sind damit Prozessoren gemeint, die Wahrend¨ auf technischer Seite die Anzahl der Transis- die Berechnung von Grafiken ubernehmen¨ und diese in toren stark zugenommen hat, werden auf der Software- der Regel an ein optisches Ausgabegerat¨ ubergeben.¨ Der Seite mit neueren GPU-Generationen immer neuere und Aufgabenbereich hat sich seit der Einfuhrung¨ von GPUs umfangreichere Programmierschnittstellen unterstutzt.¨ aber deutlich erweitert, denn spatestens¨ seit 2008 mit dem Erscheinen von NVidias GeForce 8“-Serie ist die Damit wandelten sich einfache Grafikbeschleuniger zu ” multifunktionalen GPGPUs. Die neuen Architekturen Programmierung solcher GPUs bei NVidia uber¨ CUDA NVidia Volta und AMD Vega folgen diesem Trend (Compute Unified Device Architecture) moglich.¨ und nutzen beide aktuelle Technologien, wie schnel- Da die Bedeutung von GPUs in den verschiedensten len Speicher, und bieten dadurch beide erhohte¨ An- Anwendungsgebieten, wie zum Beispiel im Automobil- wendungsleistung. Bei der Programmierung fur¨ heu- sektor, zunehmend an Bedeutung gewinnen, untersucht tige GPUs wird in solche fur¨ herkommliche¨ Grafi- diese Arbeit aktuelle GPU-Generationen, gibt aber auch kanwendungen und allgemeine Anwendungen differen- einen Ruckblick,¨ der diese aktuelle Generation mit vor- ziert.
    [Show full text]
  • AMD Firepro™ W4100 Professional Graphics in a Class of Its Own
    AMD FirePro™ W4100 Professional Graphics In a class of its own Key Features: The AMD FirePro™ W4100 professional • Application optimizations graphics card represents a completely and certifications • AMD Graphics Core Next (GCN) new class of product – one that provides GPU architecture you with great graphics performance • Four Mini DisplayPort outputs • DisplayPort 1.2a support and display versatility while housed in a • AMD Eyefinity technology1 compact, low-power design. • 4K display resolution (up to 4096 x 2160) • 512 stream processors Increase your productivity by working across up to four high-resolution displays with AMD Eyefinity 1 • 645.1 GFLOPS peak single precision technology. Manipulate 3D models and large data sets with ease thanks to 2GB of ultrafast GDDR5 memory. With a stable driver that supports a growing list of optimized and certified applications, the • 2GB GDDR5 memory AMD FirePro W4100 is uniquely suited to provide the performance and quality you expect, and more, • 128-bit memory interface from a professional graphics card. • Up to 72GB/s memory bandwidth • PCIe® 3.0 compliant Peformance Get solid, midrange performance with the AMD FirePro W4100, delivering CAD performance that is up • OpenCL™, DirectX® and OpenGL support to 100% faster than the previous generation2. Equipped with 2GB of ultrafast GDDR5 memory with a • 50W maximum power consumption 128-bit memory interface, the AMD FirePro W4100 delivers up to 72GB/s of memory bandwidth, helping • Discreet active cooling solution improve application responsiveness to your workflows. Accelerate your 3D applications with 512 stream • Low profile single-slot form factor processors and enable more efficient data transfers between the GPU and CPU with PCIe® 3.0 support.
    [Show full text]
  • How to Sell the AMD Radeon™ HD 7790 Graphics Cards Outstanding 1080P Performance and Unbeatable Value for Gamers
    How to Sell the AMD Radeon™ HD 7790 Graphics Cards Outstanding 1080p performance and unbeatable value for gamers. Who’s it for? Gamers who want 1080p gaming and outstanding image quality at a great value. Sell it in 5 seconds. This is where high-quality 1080p gaming begins. Get ready to turn on that graphics eye-candy. With the AMD Radeon™ HD 7790 GPU, you get outstanding 1080p performance in the latest DirectX® 11 games at an unbeatable value. It offers great performance per dollar and allows you to play modern games with all the settings turned up to the max. It’s an all new chip built just for gaming featuring AMD’s latest refinement of AMD PowerTune Technology. Sell it in 60 seconds. > Outstanding 1080p performance in the latest DirectX® 11 games: The AMD Radeon™ HD 7790 Graphics card was engineered to provide superior DirectX® 11.1 performance for gamers with 1080p monitors and, being built on the Graphics Core Next Architecture, is the perfect opportunity to ready your rig for the hottest games of the year. > Unbeatable value for gamers: If you’re looking for great gaming on a budget, it doesn’t get any better than this product. In fact it is up to 21% faster than the competition.1 > Featuring an all-new AMD PowerTune Technology designed to squeeze every bit of performance out of the GPU, the AMD Radeon™ HD 7790 is engineered with intelligent, automatic overclocking to provide the most frame-rates possible. Don’t take our word for it. Here is what others are saying… “…power efficiency, its low noise levels, and the free copy of BioShock Infinite in the box…looks like we have a winning recipe from AMD.” – The Tech Report 2 “…even without BioShock Infinite coming along for the ride, the HD 7790 represents a phenomenal value.” – Hardware Canucks 3 Why it’s great..
    [Show full text]
  • AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE
    AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor Mike Houston AMD Senior Fellow Architect AMD Fellow Architect [email protected] [email protected] At the heart of every AMD APU/GPU is a power aware high performance set of compute units that have been advancing to bring users new levels of programmability, precision and performance. AGENDA AMD Graphic Core Next Architecture .Unified Scalable Graphic Processing Unit (GPU) optimized for Graphics and Compute – Multiple Engine Architecture with Multi-Task Capabilities – Compute Unit Architecture – Multi-Level R/W Cache Architecture .What will not be discussed – Roadmaps/Schedules – New Product Configurations – Feature Rollout 3 | AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE GFX Command Processor Work Distributor Scalable Graphics Engine MC Primitive Primitive Pipe 0 Pipe n HUB & HOS HOS CS Pixel Pixel R/W MEM Pipe 0 Pipe n L2 Pipe Tessellate Tessellate Scan Scan Geometry Geometry Conversion Conversion RB RB HOS – High Order Surface RB - Render Backend Unified Shader Core CS - Compute Shader GFX - Graphics 4 | AMD Graphics Core Next | June 2011 SCALABLE MULTI-TASK GRAPHICS ENGINE PrimitiveGFX Scaling Multiple Primitive Pipelines Command ProcessorPixel Scaling Multiple Screen Partitions Multi-task graphics engine use of unified shader Work Distributor Scalable Graphics Engine MC Primitive Primitive Pipe 0 Pipe n HUB & HOS HOS CS Pixel Pixel R/W MEM Pipe 0 Pipe n L2 Pipe Tessellate Tessellate Scan Scan Geometry Geometry Conversion Conversion RB RB Unified Shader Core 5 | AMD Graphics Core Next | June 2011 MULTI-ENGINE UNIFIED COMPUTING GPU Asynchronous Compute Engine (ACE) ACE ACE GFX 0 n .
    [Show full text]
  • Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multithreading ∗
    Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multithreading ∗ Jack Wadden Alexander Lyashevsky§ Sudhanva Gurumurthi† Vilas Sridharan‡ Kevin Skadron University of Virginia, Charlottesville, Virginia, USA †AMD Research, Advanced Micro Devices, Inc., Boxborough, MA, USA §AMD Research, Advanced Micro Devices, Inc., Sunnyvale, CA, USA ‡ RAS Architecture, Advanced Micro Devices, Inc., Boxborough, MA, USA {wadden,skadron}@virginia.edu {Alexander.Lyashevsky,Sudhanva.Gurumurthi,Vilas.Sridharan}@amd.com Abstract Structure Size Estimated ECC Overhead Reliability for general purpose processing on the GPU Local data share 64 kB 14 kB (GPGPU) is becoming a weak link in the construction of re- Vector register file 256 kB 56 kB liable supercomputer systems. Because hardware protection Scalar register file 8 kB 1.75 kB is expensive to develop, requires dedicated on-chip resources, R/W L1 cache 16 kB 343.75 B and is not portable across different architectures, the efficiency Table 1: Reported sizes of structures in an AMD Graphics Core of software solutions such as redundant multithreading (RMT) Next compute unit [4] and estimated costs of SEC-DED ECC must be explored. assuming cache-line and register granularity protections. This paper presents a real-world design and evaluation of automatic software RMT on GPU hardware. We first describe These capabilities typically are provided by hardware. Such a compiler pass that automatically converts GPGPU kernels hardware support can manifest on large storage structures as into redundantly threaded versions. We then perform detailed parity or error-correction codes (ECC), or on pipeline logic power and performance evaluations of three RMT algorithms, via radiation hardening [19], residue execution [16], and other each of which provides fault coverage to a set of structures techniques.
    [Show full text]
  • Download Amd Radeon Hd 8350 Driver Download Amd Radeon Hd 8350 Driver
    download amd radeon hd 8350 driver Download amd radeon hd 8350 driver. You can get the basic Radeon HD 8350 drivers through %%os%%, or by conducting a Windows® update. While these Graphics Card drivers are basic, they support the primary hardware functions. Here is a full guide on manually updating these AMD device drivers. Optional Offer for DriverDoc by Solvusoft | EULA | Privacy Policy | Terms | Uninstall. Update Radeon HD 8350 Drivers Automatically: Recommendation: If you are a novice computer user with no experience updating drivers, we recommend using DriverDoc [Download DriverDoc - Product by Solvusoft] to help you update your AMD Graphics Card driver. DriverDoc takes away the hassle and headaches of making sure you are downloading and installing the correct HD 8350's drivers for your operating system. In addition, DriverDoc not only ensures your Graphics Card drivers stay updated, but with a database of over 2,150,000 drivers (database updated daily), it keeps all of your other PC's drivers updated as well. Optional Offer for DriverDoc by Solvusoft | EULA | Privacy Policy | Terms | Uninstall. HD 8350 Update FAQ. Can You Describe the Benefits of HD 8350 Driver Updates? Updated drivers can unlock Graphics Card features, increase PC performance, and maximize your hardware's potential. Risks of installing the wrong HD 8350 drivers can lead to system crashes, decreased performance, and overall instability. When Do I Update HD 8350 Drivers? For optimal HD 8350 hardware performance, you should update your device drivers once every few months. How Do I Download HD 8350 Drivers? Manual updates for advanced PC users can be carried out with Device Manager, while novice computer users can update Radeon HD 8350 drivers automatically with a driver update utility.
    [Show full text]
  • Cenník Komponentov
    Cenník komponentov Predajňa: A. Hlinku 20, 971 01 Prievidza Platný od 03.05.2005 - zmena cien vyhradená Tel.: 046-5427050, e-mail: [email protected] Otvorené: Po – Pi od 8.00 do 18.00, So od 9.00 do 13.00 Servis: M. Mišíka 38, 971 01 Prievidza tel./fax: 046-5425111, e-mail: [email protected] Otvorené: Po – Pi od 8.00 do 16.00 !!! informujte sa na aktuálne ceny a na dostupnosť výrobkov !!! Cena Kód tovaru Názov tovaru bez DPH Popis Záruka Skl. Pocitacove zostavy a notebooky Desktop MB FOXCONN-K8S755M-6LRS, HDD 80GB/8MB, DVDRW, VGA ATI Radeon 9550 128MB TVO AGP 8x, zvuk onbard 5.1, LAN, 6x USB 2.0, FDD 1.44 MB, miditower ATX 350W black, klávesnica black, opt. myš DP-A-MM001 PC4U-A AMD SE3,0G/256M/80G/DVDRW/FX5600/SB ### 14 795,00 24 Mesiace ü MB FOXCONN-K8S755M-6LRS, HDD 80GB/8MB, DVDRW, VGA ATI Radeon 9550 128MB TVO AGP 8x, zvuk onbard 5.1, LAN, 6x USB 2.0, FDD 1.44 MB, miditower ATX 350W black, klávesnica black, opt. myš DP-A-MM009 PC4U-A AMD A64 3,0G,512M,80GB,DVDRW,R9550,SB,LAN ### 17 369,00 24 Mesiace ü INTELCeleron 1800 MHz, zákl. doska FOXCONN, 256MB DDRAM, HDD 40GB, CD 52x, VGA + zvuk onboard, DP-INT1,8-INTEGRA-RW PC4U-I Intel C1.8GHz/256MB/VGAonB/40GB/CDRW ### 8 569,00 LAN,FDD 1.44 MB, midi ATX 300W, klávesnica, optická myš 24 Mesiace ű MB FOXCONN 845GV4MR-ES, 256MB DDRAM, HDD 40GB/7200, VGA + zvuk onboard, LAN, 4xUSB, FDD 1.44 MB, mini ATX 300W,klávesnica,opt.
    [Show full text]