Nvidia Tesla:Aunified Graphics and Computing Architecture

Total Page:16

File Type:pdf, Size:1020Kb

Nvidia Tesla:Aunified Graphics and Computing Architecturehe modern 3D graphics process- In this article, we discuss the require- ing unit (GPU) has evolved from a fixed- ments that drove the unified graphics and function graphics pipeline to a programma- parallel computing processor architecture, ble parallel processor with computing power describe the Tesla architecture, and how it is exceeding that of multicore CPUs. Tradi- enabling widespread deployment of parallel tional graphics pipelines consist of separate computing and graphics applications. programmable stages of vertex processors executing vertex shader programs and pixel The road to unification Erik Lindholm fragment processors executing pixel shader The first GPU was the GeForce 256, programs. (Montrym and Moreton provide introduced in 1999. It contained a fixed- John Nickolls additional background on the traditional function 32-bit floating-point vertex trans- graphics processor architecture.1) form and lighting processor and a fixed- Stuart Oberman NVIDIA’s Tesla architecture, introduced function integer pixel-fragment pipeline, in November 2006 in the GeForce 8800 which were programmed with OpenGL John Montrym GPU, unifies the vertex and pixel processors and the Microsoft DX7 API.5 In 2001, and extends them, enabling high-perfor- the GeForce 3 introduced the first pro- NVIDIA mance parallel computing applications writ- grammable vertex processor executing vertex ten in the C language using the Compute shaders, along with a configurable 32-bit Unified Device Architecture (CUDA2–4) floating-point fragment pipeline, pro- parallel programming model and develop- grammed with DX85 and OpenGL.6 The ment tools. The Tesla unified graphics and Radeon 9700, introduced in 2002, featured computing architecture is available in a a programmable 24-bit floating-point pixel- scalable family of GeForce 8-series GPUs fragment processor programmed with DX9 and Quadro GPUs for laptops, desktops, and OpenGL.7,8 The GeForce FX added 32- workstations, and servers. It also provides bit floating-point pixel-fragment processors. the processing architecture for the Tesla The XBox 360 introduced an early unified GPU computing platforms introduced in GPU in 2005, allowing vertices and pixels 2007 for high-performance computing. to execute on the same processor.9 ........................................................................ 0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Computer Society. 39 ......................................................................................................................................................................................................................... HOT CHIPS 19 Vertex processors operate on the vertices texture units. The generality required of a of primitives such as points, lines, and unified processor opened the door to a triangles. Typical operations include trans- completely new GPU parallel-computing forming coordinates into screen space, capability. The downside of this generality which are then fed to the setup unit and was the difficulty of efficient load balancing the rasterizer, and setting up lighting and between different shader types. texture parameters to be used by the pixel- Other critical hardware design require- fragment processors. Pixel-fragment proces- ments were architectural scalability, perfor- sors operate on rasterizer output, which fills mance, power, and area efficiency. the interior of primitives, along with the The Tesla architects developed the interpolated parameters. graphics feature set in coordination with Vertex and pixel-fragment processors the development of the Microsoft Direct3D have evolved at different rates: Vertex DirectX 10 graphics API.10 They developed processors were designed for low-latency, the GPU’s computing feature set in coor- high-precision math operations, whereas dination with the development of the pixel-fragment processors were optimized CUDA C parallel programming language, for high-latency, lower-precision texture compiler, and development tools. filtering. Vertex processors have tradition- ally supported more-complex processing, so Tesla architecture they became programmable first. For the The Tesla architecture is based on a last six years, the two processor types scalable processor array. Figure 1 shows a have been functionally converging as the block diagram of a GeForce 8800 GPU result of a need for greater programming with 128 streaming-processor (SP) cores generality. However, the increased general- organized as 16 streaming multiprocessors ity also increased the design complexity, (SMs) in eight independent processing units area, and cost of developing two separate called texture/processor clusters (TPCs). processors. Work flows from top to bottom, starting Because GPUs typically must process at the host interface with the system PCI- more pixels than vertices, pixel-fragment Express bus. Because of its unified-processor processors traditionally outnumber vertex design, the physical Tesla architecture processors by about three to one. However, doesn’t resemble the logical order of typical workloads are not well balanced, graphics pipeline stages. However, we will leading to inefficiency. For example, use the logical graphics pipeline flow to with large triangles, the vertex processors explain the architecture. are mostly idle, while the pixel processors At the highest level, the GPU’s scalable are fully busy. With small triangles, streaming processor array (SPA) performs the opposite is true. The addition of all the GPU’s programmable calculations. more-complex primitive processing in The scalable memory system consists of DX10 makes it much harder to select a external DRAM control and fixed-function fixed processor ratio.10 All these factors raster operation processors (ROPs) that influenced the decision to design a unified perform color and depth frame buffer architecture. operations directly on memory. An inter- A primary design objective for Tesla was connection network carries computed to execute vertex and pixel-fragment shader pixel-fragment colors and depth values from programs on the same unified processor the SPA to the ROPs. The network also architecture. Unification would enable dy- routes texture memory read requests from namic load balancing of varying vertex- and the SPA to DRAM and read data from pixel-processing workloads and permit the DRAM through a level-2 cache back to the introduction of new graphics shader stages, SPA. such as geometry shaders in DX10. It also The remaining blocks in Figure 1 deliver let a single team focus on designing a fast input work to the SPA. The input assembler and efficient processor and allowed the collects vertex work as directed by the input sharing of expensive hardware such as the command stream. The vertex work distri- ....................................................................... 40 IEEE MICRO Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor. bution block distributes vertex work packets Command processing to the various TPCs in the SPA. The TPCs The GPU host interface unit communi- execute vertex shader programs, and (if cates with the host CPU, responds to enabled) geometry shader programs. The commands from the CPU, fetches data from resulting output data is written to on-chip system memory, checks command consisten- buffers. These buffers then pass their results cy, and performs context switching. to the viewport/clip/setup/raster/zcull block The input assembler collects geometric to be rasterized into pixel fragments. The primitives (points, lines, triangles, line pixel work distribution unit distributes pixel strips, and triangle strips) and fetches fragments to the appropriate TPCs for associated vertex input attribute data. It pixel-fragment processing. Shaded pixel- has peak rates of one primitive per clock fragments are sent across the interconnec- and eight scalar attributes per clock at the tion network for processing by depth and GPU core clock, which is typically color ROP units. The compute work 600 MHz. distribution block dispatches compute The work distribution units forward the thread arrays to the TPCs. The SPA accepts input assembler’s output stream to the array and processes work for multiple logical of processors, which execute vertex, geom- streams simultaneously. Multiple clock etry, and pixel shader programs, as well as domains for GPU units, processors, computing programs. The vertex and com- DRAM, and other units allow independent pute work distribution units deliver work to power and performance optimizations. processors in a round-robin
Recommended publications
  • NVIDIA Opengl in 2012 Mark Kilgard
    NVIDIA OpenGL in 2012 Mark Kilgard • Principal System Software Engineer – OpenGL driver and API evolution – Cg (“C for graphics”) shading language – GPU-accelerated path rendering • OpenGL Utility Toolkit (GLUT) implementer • Author of OpenGL for the X Window System • Co-author of Cg Tutorial Outline • OpenGL’s importance to NVIDIA • OpenGL API improvements & new features – OpenGL 4.2 – Direct3D interoperability – GPU-accelerated path rendering – Kepler Improvements • Bindless Textures • Linux improvements & new features • Cg 3.1 update NVIDIA’s OpenGL Leverage Cg GeForce Parallel Nsight Tegra Quadro OptiX Example of Hybrid Rendering with OptiX OpenGL (Rasterization) OptiX (Ray tracing) Parallel Nsight Provides OpenGL Profiling Configure Application Trace Settings Parallel Nsight Provides OpenGL Profiling Magnified trace options shows specific OpenGL (and Cg) tracing options Parallel Nsight Provides OpenGL Profiling Parallel Nsight Provides OpenGL Profiling Trace of mix of OpenGL and CUDA shows glFinish & OpenGL draw calls Only Cross Platform 3D API OpenGL 3D Graphics API • cross-platform • most functional • peak performance • open standard • inter-operable • well specified & documented • 20 years of compatibility OpenGL Spawns Closely Related Standards Congratulations: WebGL officially approved, February 2012 “The web is now 3D enabled” Buffer and OpenGL 4 – DirectX 11 Superset Event Interop • Interop with a complete compute solution – OpenGL is for graphics – CUDA / OpenCL is for compute • Shaders can be saved to and loaded from binary
    [Show full text]
  • Realizm Data Sheet200.Indd
    The Ultimate in Professional 3D Graphics Processing Welcome to a new kind of Realizm . where precision, speed, and your creativity are combined in ways you’ve only dreamed. 3Dlabs® puts the power of the industry’s most advanced visual processing right at your fingertips with Wildcat® Realizm™ 200. 3Dlabs’ AGP 8x-based graphics solution delivers all the performance, image fidelity, and features you’d expect from a professional graphics accelerator. So, whether you’re working on realistic animations, intricate CAD renderings, or complex scientific visualizations – if you can imagine it, you can make it real with Wildcat Realizm. Remove the boundaries to Remove the boundaries to your creativity. your productivity. With Wildcat Realizm 200’s no- Wildcat Realizm graphics compromise performance plus accelerators offer the highest levels the industry’s largest memory of image precision. You get quality resources, you’ll have more time and performance in one advanced to devote to your creativity. technology solution. Unmatched VPU Performance Extreme Geometry Performance • The most advanced Visual Processing Unit (VPU) available today • Manipulate the most complex models easily in real-time offering unparalleled levels of performance, programmability, • Wildcat Realizm’s VPU features full floating-point pipelines from With over 40 years of combined engineering talent, accuracy, and fidelity input vertices to displayed pixels to offer you unparalleled levels of 3Dlabs is the only graphics hardware developer 100% • Optimized floating-point precision across the entire pipeline performance, programmability, accuracy, and fidelity dedicated to building solutions designed specifically for The Most Memory Available on Any AGP Graphics Card – Image Quality graphics professionals. • Genuine real-time image manipulation and rendering using advanced 512 MB The Advanced Benefits of Wildcat Realizm 200 .
    [Show full text]
  • Developer Tools Showcase
    Developer Tools Showcase Randy Fernando Developer Tools Product Manager NVISION 2008 Software Content Creation Performance Education Development FX Composer Shader PerfKit Conference Presentations Debugger mental mill PerfHUD Whitepapers Artist Edition Direct3D SDK PerfSDK GPU Programming Guide NVIDIA OpenGL SDK Shader Library GLExpert Videos CUDA SDK NV PIX Plug‐in Photoshop Plug‐ins Books Cg Toolkit gDEBugger GPU Gems 3 Texture Tools NVSG GPU Gems 2 Melody PhysX SDK ShaderPerf GPU Gems PhysX Plug‐Ins PhysX VRD PhysX Tools The Cg Tutorial NVIDIA FX Composer 2.5 The World’s Most Advanced Shader Authoring Environment DirectX 10 Support NVIDIA Shader Debugger Support ShaderPerf 2.0 Integration Visual Models & Styles Particle Systems Improved User Interface Particle Systems All-New Start Page 350Z Sample Project Visual Models & Styles Other Major Features Shader Creation Wizard Code Editor Quickly create common shaders Full editor with assisted Shader Library code generation Hundreds of samples Properties Panel Texture Viewer HDR Color Picker Materials Panel View, organize, and apply textures Even More Features Automatic Light Binding Complete Scripting Support Support for DirectX 10 (Geometry Shaders, Stream Out, Texture Arrays) Support for COLLADA, .FBX, .OBJ, .3DS, .X Extensible Plug‐in Architecture with SDK Customizable Layouts Semantic and Annotation Remapping Vertex Attribute Packing Remote Control Capability New Sample Projects 350Z Visual Styles Atmospheric Scattering DirectX 10 PCSS Soft Shadows Materials Post‐Processing Simple Shadows
    [Show full text]
  • Graphics Card Support List
    Graphics card support list Device Name Chipset ASUS GTXTITAN-6GD5 NVIDIA GeForce GTX TITAN ZOTAC GTX980 NVIDIA GeForce GTX980 ASUS GTX980-4GD5 NVIDIA GeForce GTX980 MSI GTX980-4GD5 NVIDIA GeForce GTX980 Gigabyte GV-N980D5-4GD-B NVIDIA GeForce GTX980 MSI GTX970 GAMING 4G GOLDEN EDITION NVIDIA GeForce GTX970 Gigabyte GV-N970IXOC-4GD NVIDIA GeForce GTX970 ASUS GTX780TI-3GD5 NVIDIA GeForce GTX780Ti ASUS GTX770-DC2OC-2GD5 NVIDIA GeForce GTX770 ASUS GTX760-DC2OC-2GD5 NVIDIA GeForce GTX760 ASUS GTX750TI-OC-2GD5 NVIDIA GeForce GTX750Ti ASUS ENGTX560-Ti-DCII/2D1-1GD5/1G NVIDIA GeForce GTX560Ti Gigabyte GV-NTITAN-6GD-B NVIDIA GeForce GTX TITAN Gigabyte GV-N78TWF3-3GD NVIDIA GeForce GTX780Ti Gigabyte GV-N780WF3-3GD NVIDIA GeForce GTX780 Gigabyte GV-N760OC-4GD NVIDIA GeForce GTX760 Gigabyte GV-N75TOC-2GI NVIDIA GeForce GTX750Ti MSI NTITAN-6GD5 NVIDIA GeForce GTX TITAN MSI GTX 780Ti 3GD5 NVIDIA GeForce GTX780Ti MSI N780-3GD5 NVIDIA GeForce GTX780 MSI N770-2GD5/OC NVIDIA GeForce GTX770 MSI N760-2GD5 NVIDIA GeForce GTX760 MSI N750 TF 1GD5/OC NVIDIA GeForce GTX750 MSI GTX680-2GB/DDR5 NVIDIA GeForce GTX680 MSI N660Ti-PE-2GD5-OC/2G-DDR5 NVIDIA GeForce GTX660Ti MSI N680GTX Twin Frozr 2GD5/OC NVIDIA GeForce GTX680 GIGABYTE GV-N670OC-2GD NVIDIA GeForce GTX670 GIGABYTE GV-N650OC-1GI/1G-DDR5 NVIDIA GeForce GTX650 GIGABYTE GV-N590D5-3GD-B NVIDIA GeForce GTX590 MSI N580GTX-M2D15D5/1.5G NVIDIA GeForce GTX580 MSI N465GTX-M2D1G-B NVIDIA GeForce GTX465 LEADTEK GTX275/896M-DDR3 NVIDIA GeForce GTX275 LEADTEK PX8800 GTX TDH NVIDIA GeForce 8800GTX GIGABYTE GV-N26-896H-B
    [Show full text]
  • NVIDIA Quadro RTX for V-Ray Next
    NVIDIA QUADRO RTX V-RAY NEXT GPU Image courtesy of © Dabarti Studio, rendered with V-Ray GPU Quadro RTX Accelerates V-Ray Next GPU Rendering Solutions for V-Ray Next GPU V-Ray Next GPU taps into the power of NVIDIA® Quadro® NVIDIA Quadro® provides a wide range of RTX-enabled RTX™ to speed up production rendering with dedicated RT solutions for desktop, mobile, server-based rendering, and Cores for ray tracing and Tensor Cores for AI-accelerated virtual workstations with NVIDIA Quadro Virtual Data denoising.¹ With up to 18X faster rendering than CPU-based Center Workstation (Quadro vDWS) software.2 With up to 96 solutions and enhanced performance with NVIDIA NVLink™, gigabytes (GB) of GPU memory available,3 Quadro RTX V-Ray Next GPU with RTX support provides incredible provides the power you need for the largest professional performance improvements for your rendering workloads. graphics and rendering workloads. “ Accelerating artist productivity is always our top Benchmark: V-Ray Next GPU Rendering Performance Increase on Quadro RTX GPUs priority, so we’re quick to take advantage of the latest ray-tracing hardware breakthroughs. By Quadro RTX 6000 x2 1885 ™ Quadro RTX 6000 104 supporting NVIDIA RTX in V-Ray GPU, we’re Quadro RTX 4000 783 bringing our customers an exciting new boost in PU 1 0 2 4 6 8 10 12 14 16 18 20 their GPU production rendering speeds.” Relatve Performance – Phillip Miller, Vice President, Product Management, Chaos Group Desktop performance Tests run on 1x Xeon old 6154 3 Hz (37 Hz Turbo), 64 B DDR4 RAM Wn10x64 Drver verson 44128 Performance results may vary dependng on the scene NVIDIA Quadro professional graphics solutions are verified and recommended for the most demanding projects by Chaos Group.
    [Show full text]
  • Nvidia Tesla P40 Gpu Accelerator
    NVIDIA TESLA P40 GPU ACCELERATOR HIGH-PERFORMANCE VIRTUAL GRAPHICS AND COMPUTE NVIDIA redefined visual computing by giving designers, engineers, scientists, and graphic artists the power to take on the biggest visualization challenges with immersive, interactive, photorealistic environments. NVIDIA® Quadro® Virtual Data GPU 1 NVIDIA Pascal GPU Center Workstation (Quadro vDWS) takes advantage of NVIDIA® CUDA Cores 3,840 Tesla® GPUs to deliver virtual workstations from the data center. Memory Size 24 GB GDDR5 H.264 1080p30 streams 24 Architects, engineers, and designers are now liberated from Max vGPU instances 24 (1 GB Profile) their desks and can access applications and data anywhere. vGPU Profiles 1 GB, 2 GB, 3 GB, 4 GB, 6 GB, 8 GB, 12 GB, 24 GB ® ® The NVIDIA Tesla P40 GPU accelerator works with NVIDIA Form Factor PCIe 3.0 Dual Slot Quadro vDWS software and is the first system to combine an (rack servers) Power 250 W enterprise-grade visual computing platform for simulation, Thermal Passive HPC rendering, and design with virtual applications, desktops, and workstations. This gives organizations the freedom to virtualize both complex visualization and compute (CUDA and OpenCL) workloads. The NVIDIA® Tesla® P40 taps into the industry-leading NVIDIA Pascal™ architecture to deliver up to twice the professional graphics performance of the NVIDIA® Tesla® M60 (Refer to Performance Graph). With 24 GB of framebuffer and 24 NVENC encoder sessions, it supports 24 virtual desktops (1 GB profile) or 12 virtual workstations (2 GB profile), providing the best end-user scalability per GPU. This powerful GPU also supports eight different user profiles, so virtual GPU resources can be efficiently provisioned to meet the needs of the user.
    [Show full text]
  • NVIDIA Launches Tegra X1 Mobile Super Chip
    NVIDIA Launches Tegra X1 Mobile Super Chip Maxwell GPU Architecture Delivers First Teraflops Mobile Processor, Powering Deep Learning and Computer Vision Applications NVIDIA today unveiled Tegra® X1, its next-generation mobile super chip with over one teraflops of processing power – delivering capabilities that open the door to unprecedented graphics and sophisticated deep learning and computer vision applications. Tegra X1 is built on the same NVIDIA Maxwell™ GPU architecture rolled out only months ago for the world's top-performing gaming graphics card, the GeForce® GTX 980. The 256-core Tegra X1 provides twice the performance of its predecessor, the Tegra K1, which is based on the previous-generation Kepler™ architecture and debuted at last year's Consumer Electronics Show. Tegra processors are built for embedded products, mobile devices, autonomous machines and automotive applications. Tegra X1 will begin appearing in the first half of the year. It will be featured in the newly announced NVIDIA DRIVE™ car computers. DRIVE PX is an auto-pilot computing platform that can process video from up to 12 onboard cameras to run capabilities providing Surround-Vision, for a seamless 360-degree view around the car, and Auto-Valet, for true self-parking. DRIVE CX is a complete cockpit platform designed to power the advanced graphics required across the increasing number of screens used for digital clusters, infotainment, head-up displays, virtual mirrors and rear-seat entertainment. "We see a future of autonomous cars, robots and drones that see and learn, with seeming intelligence that is hard to imagine," said Jen-Hsun Huang, CEO and co-founder, NVIDIA.
    [Show full text]
  • Programming Graphics Hardware Overview of the Tutorial: Afternoon
    Tutorial 5 ProgrammingProgramming GraphicsGraphics HardwareHardware Randy Fernando, Mark Harris, Matthias Wloka, Cyril Zeller Overview of the Tutorial: Morning 8:30 Introduction to the Hardware Graphics Pipeline Cyril Zeller 9:30 Controlling the GPU from the CPU: the 3D API Cyril Zeller 10:15 Break 10:45 Programming the GPU: High-level Shading Languages Randy Fernando 12:00 Lunch Tutorial 5: Programming Graphics Hardware Overview of the Tutorial: Afternoon 12:00 Lunch 14:00 Optimizing the Graphics Pipeline Matthias Wloka 14:45 Advanced Rendering Techniques Matthias Wloka 15:45 Break 16:15 General-Purpose Computation Using Graphics Hardware Mark Harris 17:30 End Tutorial 5: Programming Graphics Hardware Tutorial 5: Programming Graphics Hardware IntroductionIntroduction toto thethe HardwareHardware GraphicsGraphics PipelinePipeline Cyril Zeller Overview Concepts: Real-time rendering Hardware graphics pipeline Evolution of the PC hardware graphics pipeline: 1995-1998: Texture mapping and z-buffer 1998: Multitexturing 1999-2000: Transform and lighting 2001: Programmable vertex shader 2002-2003: Programmable pixel shader 2004: Shader model 3.0 and 64-bit color support PC graphics software architecture Performance numbers Tutorial 5: Programming Graphics Hardware Real-Time Rendering Graphics hardware enables real-time rendering Real-time means display rate at more than 10 images per second 3D Scene = Image = Collection of Array of pixels 3D primitives (triangles, lines, points) Tutorial 5: Programming Graphics Hardware Hardware Graphics Pipeline
    [Show full text]
  • GPU-Based Deep Learning Inference
    Whitepaper GPU-Based Deep Learning Inference: A Performance and Power Analysis November 2015 1 Contents Abstract ......................................................................................................................................................... 3 Introduction .................................................................................................................................................. 3 Inference versus Training .............................................................................................................................. 4 GPUs Excel at Neural Network Inference ..................................................................................................... 5 Inference Optimizations in Caffe and cuDNN 4 ........................................................................................ 5 Experimental Setup and Testing Methodology ........................................................................................ 7 Inference on Small and Large GPUs .......................................................................................................... 8 Conclusion ................................................................................................................................................... 10 References .................................................................................................................................................. 10 2 Abstract Deep learning methods are revolutionizing various areas of machine perception. On a
    [Show full text]
  • NVIDIA Ampere GA102 GPU Architecture Whitepaper
    NVIDIA AMPERE GA102 GPU ARCHITECTURE Second-Generation RTX Updated with NVIDIA RTX A6000 and NVIDIA A40 Information V2.0 Table of Contents Introduction 5 GA102 Key Features 7 2x FP32 Processing 7 Second-Generation RT Core 7 Third-Generation Tensor Cores 8 GDDR6X and GDDR6 Memory 8 Third-Generation NVLink® 8 PCIe Gen 4 9 Ampere GPU Architecture In-Depth 10 GPC, TPC, and SM High-Level Architecture 10 ROP Optimizations 11 GA10x SM Architecture 11 2x FP32 Throughput 12 Larger and Faster Unified Shared Memory and L1 Data Cache 13 Performance Per Watt 16 Second-Generation Ray Tracing Engine in GA10x GPUs 17 Ampere Architecture RTX Processors in Action 19 GA10x GPU Hardware Acceleration for Ray-Traced Motion Blur 20 Third-Generation Tensor Cores in GA10x GPUs 24 Comparison of Turing vs GA10x GPU Tensor Cores 24 NVIDIA Ampere Architecture Tensor Cores Support New DL Data Types 26 Fine-Grained Structured Sparsity 26 NVIDIA DLSS 8K 28 GDDR6X Memory 30 RTX IO 32 Introducing NVIDIA RTX IO 33 How NVIDIA RTX IO Works 34 Display and Video Engine 38 DisplayPort 1.4a with DSC 1.2a 38 HDMI 2.1 with DSC 1.2a 38 Fifth Generation NVDEC - Hardware-Accelerated Video Decoding 39 AV1 Hardware Decode 40 Seventh Generation NVENC - Hardware-Accelerated Video Encoding 40 NVIDIA Ampere GA102 GPU Architecture ii Conclusion 42 Appendix A - Additional GeForce GA10x GPU Specifications 44 GeForce RTX 3090 44 GeForce RTX 3070 46 Appendix B - New Memory Error Detection and Replay (EDR) Technology 49 Appendix C - RTX A6000 GPU Perf ormance 50 List of Figures Figure 1.
    [Show full text]
  • Manycore GPU Architectures and Programming, Part 1
    Lecture 19: Manycore GPU Architectures and Programming, Part 1 Concurrent and Mul=core Programming CSE 436/536, [email protected] www.secs.oakland.edu/~yan 1 Topics (Part 2) • Parallel architectures and hardware – Parallel computer architectures – Memory hierarchy and cache coherency • Manycore GPU architectures and programming – GPUs architectures – CUDA programming – Introduc?on to offloading model in OpenMP and OpenACC • Programming on large scale systems (Chapter 6) – MPI (point to point and collec=ves) – Introduc?on to PGAS languages, UPC and Chapel • Parallel algorithms (Chapter 8,9 &10) – Dense matrix, and sorng 2 Manycore GPU Architectures and Programming: Outline • Introduc?on – GPU architectures, GPGPUs, and CUDA • GPU Execuon model • CUDA Programming model • Working with Memory in CUDA – Global memory, shared and constant memory • Streams and concurrency • CUDA instruc?on intrinsic and library • Performance, profiling, debugging, and error handling • Direc?ve-based high-level programming model – OpenACC and OpenMP 3 Computer Graphics GPU: Graphics Processing Unit 4 Graphics Processing Unit (GPU) Image: h[p://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html 5 Graphics Processing Unit (GPU) • Enriching user visual experience • Delivering energy-efficient compung • Unlocking poten?als of complex apps • Enabling Deeper scien?fic discovery 6 What is GPU Today? • It is a processor op?mized for 2D/3D graphics, video, visual compu?ng, and display. • It is highly parallel, highly multhreaded mulprocessor op?mized for visual
    [Show full text]
  • PACKET 22 BOOKSTORE, TEXTBOOK CHAPTER Reading Graphics
    A.11 GRAPHICS CARDS, Historical Perspective (edited by J Wunderlich PhD in 2020) Graphics Pipeline Evolution 3D graphics pipeline hardware evolved from the large expensive systems of the early 1980s to small workstations and then to PC accelerators in the 1990s, to $X,000 graphics cards of the 2020’s During this period, three major transitions occurred: 1. Performance-leading graphics subsystems PRICE changed from $50,000 in 1980’s down to $200 in 1990’s, then up to $X,0000 in 2020’s. 2. PERFORMANCE increased from 50 million PIXELS PER SECOND in 1980’s to 1 billion pixels per second in 1990’’s and from 100,000 VERTICES PER SECOND to 10 million vertices per second in the 1990’s. In the 2020’s performance is measured more in FRAMES PER SECOND (FPS) 3. Hardware RENDERING evolved from WIREFRAME to FILLED POLYGONS, to FULL- SCENE TEXTURE MAPPING Fixed-Function Graphics Pipelines Throughout the early evolution, graphics hardware was configurable, but not programmable by the application developer. With each generation, incremental improvements were offered. But developers were growing more sophisticated and asking for more new features than could be reasonably offered as built-in fixed functions. The NVIDIA GeForce 3, described by Lindholm, et al. [2001], took the first step toward true general shader programmability. It exposed to the application developer what had been the private internal instruction set of the floating-point vertex engine. This coincided with the release of Microsoft’s DirectX 8 and OpenGL’s vertex shader extensions. Later GPUs, at the time of DirectX 9, extended general programmability and floating point capability to the pixel fragment stage, and made texture available at the vertex stage.
    [Show full text]