A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications

Total Page:16

File Type:pdf, Size:1020Kb

A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications 國立中山大學資訊工程學系 碩士論文 Department of Computer Science National Sun Yat-sen University Master Thesis 原創繪圖處理器於現實應用之側寫及驗證模擬器 A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications 研究生:竇旭康 Hsu-Kang Dow 指導教授:Dr. Steve.W.Haga 中華民國 102 年 10 月 October 2013 i Dedicated to My parents ii 摘要 此論文提出以 Attila 開源模擬器為基礎的現代 GPU 驗證及側寫模擬器, 支援 OpenGL ES 2.0 與 GLSL ES 編譯和執行。完成的模擬器可以做到驗證著色語 言編譯器與提供側寫協助式最佳化編譯所需之統計資料。此模擬器在研究領域上 則可提供效能數據的統計並提供管線各階段的處理值,提供硬體驗證除錯時的正 確值。 為了使 Attila 模擬器能支援 GLSL 著色語言,本論文提供轉接器含有兩個 部分,第一部分是 API 界面的轉接器,用於 GLSL 編譯時所使用的程式資料鏈結。 另一部份為 Attila ISA 與 NSYSU ISA 指令之間的轉換與暫存器 I/O 對映。新的 Attila 模擬器與 SystemC 模擬器相比提供 300 至 2000 倍效能提升,並在系統模 擬時避免所需的前置計算,是模擬複雜的應用程式不可或缺的必要功能。 關鍵字: 繪圖晶片,Attila 模擬器,OpenGL ES,GLSL ES iii Abstract This is a simulator created base on Attila, a modern GPU architecture and open source project with the power to run games and benchmarks. This simulator has been modified in order to support OpenGL ES 2.0 and GLSL ES compilation and execution. As such, it is an important extension of the Attila simulator. In addition, the compiler is designed for the NSYSU GPU architecture, which allows the verifying of the code produced by our GLSL ES shader compiler. This simulator was also created to enable future research, by providing the ability to record statistic data from running real world applications and then use these data to make profile assisted compiler optimizations. Along with the simulator is a NSYSU GPU to Attila simulator converter. This converter consist two parts of the conversion, one is the API converter and another is the assembly converter. The converter solves data linking problems for attribute, uniform and varying data, which occur when adapting the Attila simulator to use NSYSU GLSL compiler assembly. Compared to the NSYSU GPU's current SystemC simulator, the new Attila simulator is 300 to 2000 times faster. It also avoids the necessities to precompute input to the simulator for system-wise simulation. These benefits are necessary for simulating non-trivial applications. Keywords: GPU, Attila, Simulator, OpenGL ES, GLSL ES. iv Contents 1. Introduction ................................................................................................................................ 1 1.1 Profiling .................................................................................................................................................... 2 1.2 Verifying ................................................................................................................................................... 3 1.3 Real World Applications .................................................................................................................... 3 1.4 Attila Simulator ..................................................................................................................................... 4 1.5 Converter ................................................................................................................................................. 5 2 Related Works ............................................................................................................................. 7 2.1 NSYSU GPU .............................................................................................................................................. 8 2.2 NSYSU SystemC Simulator ............................................................................................................. 10 2.3 Attila GPU .............................................................................................................................................. 11 2.4 Attila Tracing ....................................................................................................................................... 12 3 Methodology ............................................................................................................................. 15 3.1 An Overview of the Converter ...................................................................................................... 17 3.2 Data Flow (Attribute Uniform Varying) ................................................................................... 19 3.3 Attila OpenGL API Driver Modification .................................................................................... 26 3.4 Converter for NSYSU to ATTILA Assembly ............................................................................. 27 3.5 Load / Store instructions & Memory Design .......................................................................... 31 3.6 Miscellaneous ...................................................................................................................................... 33 4 Performance Comparison and Result ......................................................................... 34 4.1 GLBenchmark ...................................................................................................................................... 36 5 Reference .................................................................................................................................... 38 v A Simulator for a Novel GPU to Support the Verifying and Profiling in Real World Applications Author: Hsu-Kang Dow Advisor: Dr. Steve.W.Haga National Sun Yat-Sen University 1. Introduction This thesis presents a simulator for a novel GPU to support the verifying and profiling in real world application. The novel GPU [1] is presented by Department of Computer Science in National Sun Yat-Sen University (NSYSU). The design goal of this novel GPU is aiming on embedded system. In this case, reducing the power consumption is a vital mission in this design, thus optimization take an important role when developing this GPU. Applications and Games on handheld device are more and more popular and many of them implement 3D graphic with in it. To meet the needs of the future market, an embedded GPU with power awareness is therefore introduced. Programmable shader is also important to modern game design. Thus the NSYSU GPU supports OpenGL ES 2.0 [2] and GLSL ES [3] to meet this requirement. Along with the GPU, A GPU simulator is also required for verifying the implementation of both hardware and software. Current simulator is written in SystemC. It directly models the hardware behavior and it is cycle accurate. The nature of this makes the current SystemC simulator slow. Another disadvantage of the current SystemC simulator is that it lack of full system simulation. The simulator only simulates hardware behavior but not the communication between OpenGL API calls (CPU) and shader programs (GPU). Simple benchmark like red cubes has simple communication, which can be set by 1 programmers but not for larger scale. Current SystemC simulator is limited and not enough for real world applications. 1.1 Profiling Because the nature of the embedded system. It’s hard to increase performance by simply adding more hardware like NVidia and AMD do on Desktop GPU. This is where optimization takes part. Before we optimize the codes, we need information to understand where the bottleneck is and what resource can be reused. Profiling helps us to reach this information by tracing tagged control flow, collect memory related information and reuse calculation like distant object texture for better performance and saving power. The environment of embedded system is different from PC, thus when we are doing optimization, we focus on different aspect. For example, most handheld device has lower resolution comparing to Desktop PC. What we made is a tool to help programmer to automate optimization which usually done by hand. For instance, a distant tree might only take 3 pixels, we can take hours to rasterize the full scene and acquire those 3 pixels. But programmer can do tricks on this and simply paste a texture to replace the distant object. Profiling the scene while rendering objects can do this. In order to obtain the data we need, first we need a simulator, which can run applications and gather information on runtime so we can use this information to make profile assist compiler optimization. This is an asked feature for the simulator and the new simulator enable this power. 2 1.2 Verifying The NSYSU GPU comes with a simulator coded in SystemC, which functions directly mapped into hardware. It’s designed for hardware verification and logical in transistor detail that takes long duration to simulate the entire process. It’s also cycle accurate while simulating the hardware behavior, thus make this SystemC simulator slow when using it to verify the codes generated by the compiler. Another problem is that the current simulator lack of connection to API because simple benchmark doesn’t contain complex API call sequence, this leads to insufficient API-to-GPU communication for further analysis and verification. In this case, we need a fast simulator to quickly verify our optimized result and capability to process complex scene for fully test our compiler generated codes. 1.3 Real World Applications Another goal of this work is to expand our infrastructure to handle more complicated and more-state-of-the-art shader programs, such as might be found in modern real-world applications. Prior to this thesis, the NSYSU GPU project was unable to support such applications. As for the pre-existing SystemC simulator, it is too buggy and too slow to simulate complex shader codes. And as for the OpenGL ES API function calls, real-world applications are so intricate that it becomes infeasible
Recommended publications
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • This Is Your Presentation Title
    Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do
    [Show full text]
  • A GPU Approach to Fortran Legacy Systems
    A GPU Approach to Fortran Legacy Systems Mariano M¶endez,Fernando G. Tinetti¤ III-LIDI, Facultad de Inform¶atica,UNLP 50 y 120, 1900, La Plata Argentina Technical Report PLA-001-2012y October 2012 Abstract A large number of Fortran legacy programs are still running in production environments, and most of these applications are running sequentially. Multi- and Many- core architectures are established as (almost) the only processing hardware available, and new programming techniques that take advantage of these architectures are necessary. In this report, we will explore the impact of applying some of these techniques into legacy Fortran source code. Furthermore, we have measured the impact in the program performance introduced by each technique. The OpenACC standard has resulted in one of the most interesting techniques to be used on Fortran Legacy source code that brings speed up while requiring minimum source code changes. 1 Introduction Even though the concept of legacy software systems is widely known among developers, there is not a unique de¯nition about of what a legacy software system is. There are di®erent viewpoints on how to describe a legacy system. Di®erent de¯nitions make di®erent levels of emphasis on di®erent character- istics, e.g.: a) \The main thing that distinguishes legacy code from non-legacy code is tests, or rather a lack of tests" [15], b) \Legacy Software is critical software that cannot be modi¯ed e±ciently" [10], c) \Any information system that signi¯cantly resists modi¯cation and evolution to meet new and constantly changing business requirements." [7], d) \large software systems that we don't know how to cope with but that are vital to our organization.
    [Show full text]
  • NVIDIA's Fermi: the First Complete GPU Computing Architecture
    NVIDIA’s Fermi: The First Complete GPU Computing Architecture A white paper by Peter N. Glaskowsky Prepared under contract with NVIDIA Corporation Copyright © September 2009, Peter N. Glaskowsky Peter N. Glaskowsky is a consulting computer architect, technology analyst, and professional blogger in Silicon Valley. Glaskowsky was the principal system architect of chip startup Montalvo Systems. Earlier, he was Editor in Chief of the award-winning industry newsletter Microprocessor Report. Glaskowsky writes the Speeds and Feeds blog for the CNET Blog Network: http://www.speedsnfeeds.com/ This document is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. http://creativecommons.org/licenses/by-sa/3.0/ Company and product names may be trademarks of the respective companies with which they are associated. 2 Executive Summary After 38 years of rapid progress, conventional microprocessor technology is beginning to see diminishing returns. The pace of improvement in clock speeds and architectural sophistication is slowing, and while single-threaded performance continues to improve, the focus has shifted to multicore designs. These too are reaching practical limits for personal computing; a quad-core CPU isn’t worth twice the price of a dual-core, and chips with even higher core counts aren’t likely to be a major driver of value in future PCs. CPUs will never go away, but GPUs are assuming a more prominent role in PC system architecture.
    [Show full text]
  • Comparative Study of High Performance Software Rasterization Techniques
    MATHEMATICA MONTISNIGRI Vol XLVII (2020) COMPARATIVE STUDY OF HIGH PERFORMANCE SOFTWARE RASTERIZATION TECHNIQUES V. F. FROLOV1;2, V. A. GALAKTIONOV1;∗ AND B. H. BARLADYAN1 1Keldysh Institute of Applied Mathematics of RAS, Miusskaya Sq. 4, Moscow, Russia, 125047 2Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russian Federation *Corresponding author. E-mail: [email protected], web page: http://keldysh.ru/ DOI: 10.20948/mathmontis-2020-47-13 Abstract. This paper provides a comparative study and performance analysis of different rasterization algorithms and approaches. Unlike many other papers, we don't focus on rasterization itself, but investigate complete graphics pipeline with 3D transfor- mations, Z-buffer, perspective correction and texturing that, on the one hand, allow us to implement a useful subset of OpenGL functionality and, on the other hand, consider various bottlenecks in the graphics pipeline and how different approaches manage them. Our ultimate goal is to find a scalable rasterizer technique that on the one hand effectively uses current CPUs and on the other hand is accelerating with the extensive development of hardware. We explore the capabilities of scan-line and half-space algorithms raster- ization, investigate different memory layout for frame buffer data, study the possibility of instruction-level and thread-level parallelism to be applied. We also study relative efficiency of different CPU architectures (in-order CPUs vs out-of-order CPUs) for the graphics pipeline implementation and tested our solution with x64, ARMv7 and ARMv8 instruction sets. We were able to propose an approach that could outperform highly op- timized OpenSWR rasterizer for small triangles.
    [Show full text]
  • Ultrasound Imaging Augmented 3D Flow Reconstruction and Computational Fluid Dynamics Simulation
    Imperial College of Science, Technology and Medicine Royal School of Mines Department of Bioengineering Ultrasound Imaging Augmented 3D Flow Reconstruction and Computational Fluid Dynamics Simulation by Xinhuan Zhou A thesis submitted for the degree of Doctor of Philosophy of Imperial College London October 2019 1 Abstract Abstract Cardiovascular Diseases (CVD), including stroke, coronary/peripheral artery diseases, are currently the leading cause of mortality globally. CVD are usually correlated to abnormal blood flow and vessel wall shear stress, and fast/accurate patient-specific 3D blood flow quantification can provide clinicians/researchers the insights/knowledge for better understanding, prediction, detection and treatment of CVD. Experimental methods including mainly ultrasound (US) Vector Flow Imaging (VFI), and Computational Fluid Dynamics (CFD) are usually employed for blood flow quantification. However current US methods are mainly 1D or 2D, noisy, can only obtain velocities at sparse positions in 3D, and thus have difficulties providing information required for research and clinical diagnosis. On the other hand while CFD is the current standard for 3D blood flow quantification it is usually computationally expensive and suffers from compromised accuracy due to uncertainties in the CFD input, e.g., blood flow boundary/initial condition and vessel geometry. To bridge the current gap between the clinical needs of 3D blood flow quantification and measurement technologies, this thesis aims at: 1) developing a fast and accurate
    [Show full text]
  • Intel Literature Or Obtain Literature Pricing Information in the U.S
    LITERATURE To order Intel Literature or obtain literature pricing information in the U.S. and Canada call or write Intel Literature Sales. In Europe and other international locations, please contact your local sales office or distributor. INTEL LITERATURE SALES In the U.S. and Canada P.O. BOX 7641 call toll free Mt. Prospect, IL 60056-7641 (800) 548-4725 CURRENT HANDBOOKS Product line handbooks contain data sheets, application notes, article reprints and other design informa­ tion. TITLE LITERATURE ORDER NUMBER SET OF 11 HANDBOOKS 231003 (Available in U.S. and Canada only) EMBEDDED APPLICATIONS 270648 8-BIT EMBEDDED CONTROLLERS 270645 16-BIT EMBEDDED CONTROLLERS 270646 16/32-BIT EMBEDDED PROCESSORS 270647 MEMORY 210830 MICROCOMMUNICATIONS 231658 (2 volume set) MICROCOMPUTER SYSTEMS 280407 MICROPROCESSORS 230843 PERIPHERALS 296467 PRODUCT GUIDE 210846 (Overview of Intel's complete product lines) PROGRAMMABLE LOGIC 296083 ADDITIONAL LITERATURE (Not included in handbook set) AUTOMOTIVE SUPPLEMENT 231792 COMPONENTS QUALITY/RELIABILITY HANDBOOK 210997 INTEL PACKAGING OUTLINES AND DIMENSIONS 231369 (Packaging types, number of leads, etc.) INTERNATIONAL LITERATURE GUIDE E00029 LITERATURE PRICE LIST (U.S. and Canada) 210620 (Comprehensive list of current Intel literature) MILITARY 210461 (2 volume set) SYSTEMS QUALITY/RELIABILITY 231762 LlTINCOV/10/89 u.s. and CANADA LITERATURE ORDER FORM NAME: __________________________________________________________ COMPANY: _____________________________________________________ ADDRESS: _____________________________________________________
    [Show full text]
  • A Mobile 3D Display Processor with a Bandwidth-Saving Subdivider
    > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 A Mobile 3D Display Processor with A Bandwidth-Saving Subdivider Seok-Hoon Kim, Sung-Eui Yoon, Sang-Hye Chung, Young-Jun Kim, Hong-Yun Kim, Kyusik Chung, and Lee-Sup Kim Abstract— A mobile 3D display processor with a subdivider is presented for higher visual quality on handhelds. By combining a subdivision technique with a 3D display, the processor can support viewers see realistic smooth surfaces in the air. However, both the subdivision and the 3D display processes require a high number of memory operations to mobile memory architecture. Therefore, we make efforts to save the bandwidth between the processor and off-chip memory. In the subdivider, we propose a re-computing based depth-first scheme that has much smaller working set than prior works. The proposed scheme achieves about 100:1 bandwidth reduction over the prior subdivision Fig. 1 The memory architecture of handhelds is different methods. Also the designed 3D display engine reduces the from the memory architecture of PCs. All the processors bandwidth to 27% by reordering the operation sequence of the 3D share the same bus and a unified memory which is outside display process. This bandwidth saving translates into reductions the chip. of off-chip access energy and time. Consequently the overall bandwidth of both the subdivision and the 3D display processes is affordable to a commercial mobile bus. In addition to saving significant attentions, because they can support smooth bandwidth, our work provides enough visual quality and surfaces, leading to high-quality rendering.
    [Show full text]
  • H Ll What We Will Cover
    Whllhat we will cover Contour Tracking Surface Rendering Direct Volume Rendering Isosurface Rendering Optimizing DVR Pre-Integrated DVR Sp latti ng Unstructured Volume Rendering GPU-based Volume Rendering Rendering using Multi-GPU History of Multi-Processor Rendering 1993 SGI announced the Onyx series of graphics servers the first commercial system with a multi-processor graphics subsystem cost up to a million dollars 3DFx Interactive (1995) introduced the first gaming accelerator called 3DFx Voodoo Graphics The Voodoo Graphics chipset consisted of two chips: the Frame Buffer Interface (FBI) was responsible for the frame buffer and the Texture Mapping Unit (TMU) processed textures. The chipset could scale the performance up by adding more TMUs – up to three TMUs per one FBI. early 1998 3dfx introduced its next chipset , Voodoo2 the first mass product to officially support the option of increasing the performance by uniting two Voodoo2-based graphics cards with SLI (Scan-Line Interleave) technology. the cost of the system was too high 1999 NVIDIA released its new graphics chip TNT2 whose Ultra version could challenge the speed of the Voodoo2 SLI but less expensive. 3DFx Interactive was devoured by NVIDIA in 2000 Voodoo Voodoo 2 SLI Voodoo 5 SLI Rendering Voodoo 5 5500 - Two Single Board SLI History of Multi-Processor Rendering ATI Technologies(1999) MAXX –put two chips on one PCB and make them render different frames simultaneously and then output the frames on the screen alternately. Failed by technical issues (sync and performance problems) NVIDIA SLI(200?) Scalable Link Interface Voodoo2 style dual board implementation 3Dfx SLI was PCI based and has nowhere near the bandwidth of PCIE used by NVIDIA SLI.
    [Show full text]
  • Complete System Power Estimation Using Processor Performance Events W
    Complete System Power Estimation using Processor Performance Events W. Lloyd Bircher and Lizy K. John Abstract — This paper proposes the use of microprocessor performance counters for online measurement of complete system power consumption. The approach takes advantage of the “trickle-down” effect of performance events in microprocessors. While it has been known that CPU power consumption is correlated to processor performance, the use of well-known performance-related events within a microprocessor such as cache misses and DMA transactions to estimate power consumption in memory and disk and other subsystems outside of the microprocessor is new. Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems (CPU, memory, chipset, I/O, disk and GPU) on two platforms (server and desktop) are developed and validated. These models are shown to have an average error of less than 9% per subsystem across the considered workloads. Through the use of these models and existing on-chip performance event counters, it is possible to estimate system power consumption without the need for power sensing hardware. 1. INTRODUCTION events within the various subsystems, a modest number of In order to improve microprocessor performance while performance events for modeling complete system power are limiting power consumption, designers increasingly utilize identified. Power models for two distinct hardware platforms dynamic hardware adaptations. These adaptations provide an are presented: a quad-socket server and a multi-core desktop. opportunity to extract maximum performance while remaining The resultant models have an average error of less than 9% within temperature and power limits. Two of the most across a wide range of workloads including SPEC CPU, common examples are dynamic voltage/frequency scaling SPECJbb, DBT-2, SYSMark and 3DMark.
    [Show full text]
  • ECE 571 – Advanced Microprocessor-Based Design Lecture 36
    ECE 571 { Advanced Microprocessor-Based Design Lecture 36 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 4 December 2020 Announcements • Don't forget projects, presentations next week (Wed and Fri) • Final writeup due last day of exams (18th) • Will try to get homeworks graded soon. 1 NVIDIA GPUs Tesla 2006 90-40nm Fermi 2010 40nm/28nm Kepler 2012 28nm Maxwell 2014 28nm Pascal/Volta 2016 16nm/14nm Turing 2018 12nm Ampere 2020 8nm/7nm Hopper 20?? ?? • GeForce { Gaming 2 • Quadro { Workstation • DataCenter 3 Also Read NVIDIA AMPERE GPU ARCHITECTURE blog post https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ 4 A100 Whitepaper • A100 • Price? From Dell: $15,434.81 (Free shipping) • Ethernet and Infiniband (Mellanox) support? • Asynchronous Copy • HBM, ECC single-error correcting double-error detection (SECDED) 5 Homework Reading #1 NVIDIA Announces the GeForce RTX 30 Series: Ampere For Gaming, Starting With RTX 3080 & RTX 3090 https://www.anandtech.com/show/16057/ nvidia-announces-the-geforce-rtx-30-series-ampere-for-gaming-starting-with-rtx-3080-rtx-3090 September 2020 { by Ryan Smith 6 Background • Ampere Architecture • CUDA compute 8.0 • TSMC 7nm FINFET (A100) • Samspun 8n, Geforce30 7 GeForce RTX 30 • Samsung 8nm process • Gaming performance • Comparison to RTX 20 (Turing based) • RTX 3090 ◦ 10496 cores ◦ 1.7GHz boost clock ◦ 19.5 Gbps GDDR6X, 384 bit, 24GB ◦ Single precision 35.7 TFLOP/s ◦ Tensor (16-bit) 143 TFLOP/s 8 ◦ Ray perf 69 TFLOPs ◦ 350W ◦ 8nm Samsung (smallest non-EUV) ◦ 28 billion transistors ◦ $1500 • GA100 compute(?) TODO • Third generation tensor cores • Ray-tracing cores • A lot of FP32 (shader?) cores • PCIe 4.0 support (first bump 8 years, 32GB/s) • SLI support 9 • What is DirectStorage API? GPU can read disk directly? Why might that be useful? • 1.9x power efficiency? Is that possible? Might be comparing downclocked Ampere to Turing rather than vice-versa • GDDR6X ◦ NVidia and Micron? ◦ multi-level singnaling? ◦ can send 2 bits per clock (PAM4 vs NRZ).
    [Show full text]
  • Appendix C Graphics and Computing Gpus
    C APPENDIX Graphics and Computing GPUs John Nickolls Imagination is more Director of Architecture important than NVIDIA knowledge. David Kirk Chief Scientist Albert Einstein On Science, 1930s NVIDIA C.1 Introduction C-3 C.2 GPU System Architectures C-7 C.3 Programming GPUs C-12 C.4 Multithreaded Multiprocessor Architecture C-24 C.5 Parallel Memory System C-36 C.6 Floating-point Arithmetic C-41 C.7 Real Stuff: The NVIDIA GeForce 8800 C-45 C.8 Real Stuff: Mapping Applications to GPUs C-54 C.9 Fallacies and Pitfalls C-70 C.10 Concluding Remarks C-74 C.11 Historical Perspective and Further Reading C-75 C.1 Introduction Th is appendix focuses on the GPU—the ubiquitous graphics processing unit graphics processing in every PC, laptop, desktop computer, and workstation. In its most basic form, unit (GPU) A processor the GPU generates 2D and 3D graphics, images, and video that enable window- optimized for 2D and 3D based operating systems, graphical user interfaces, video games, visual imaging graphics, video, visual computing, and display. applications, and video. Th e modern GPU that we describe here is a highly parallel, highly multithreaded multiprocessor optimized for visual computing. To provide visual computing real-time visual interaction with computed objects via graphics, images, and video, A mix of graphics the GPU has a unifi ed graphics and computing architecture that serves as both a processing and computing programmable graphics processor and a scalable parallel computing platform. PCs that lets you visually interact with computed and game consoles combine a GPU with a CPU to form heterogeneous systems.
    [Show full text]