GPU Architecture
Michael Doggett Department of Computer Science Lund university GPUs from my time at ATI
R200 Xbox360 GPU
R630 R610 R770 Let’s start at the beginning ... Graphics Hardware before GPUs • 1970s - 1980s • Very Expensive • Real-time Rendering needs a lot of compute power • Military, large industrial, flight simulators • Evans and Sutherland founded 1968 • Custom built systems, multiple boards • Custom and off-the-shelf ASICs • Today’s GPU has similar structure, but all on one chip Pixel-Planes/Flow
• Research Systems • University of North Carolina • Pixel-Planes 5 [Fuchs89] • PixelFlow [Molnar92] • Programmable Shading [Lastra95] • Final version built and sold by HP ’97 Image courtesy [Lastra95] Silicon Graphics • Onyx • MIPS R4400 RISC processors • Reality Engine ’92[Akeley93] • Geometry Engine • 50MHz Intel i860XP, • Raster Memory, 1-4 • OpenGL, no shaders
Images courtesy [Akeley93] Silicon Graphics • Onyx 10000 • Infinite Reality [Montrym97]
Images courtesy [Montrym97] Neon: A (Big) (Fast) Single-Chip 3D Workstation Graphics Accelerator
• Early GPU design (1999) with publication! • 8 x 32bit memory controllers • Commodity SDRAMs • 2D tiled memory interleaving • 350 nm, 17.3 x 17.3 mm, 6.8 million transistors, 100MHz GPUs Graphics Processing Unit GPU • How to turn millions of triangles into pixels in 1/60 of a second? • Parallelism !! • Pipelining • Single Instruction Multiple Data (SIMD) • Multiple Instruction Multiple Data (MIMD) multiple cores • Memory Bandwidth reductions (covered previously) Data 0 Data 1 Data 2 Data 3
PU PU PU PU 4-wide SIMD 1 Instruction PU=processing unit GPU design parameters
• Competition • 2 strong competitors NVIDIA and AMD (can consider Intel a third) • Performance/Dollar • Moore’s law • Number of transistors on a chip doubles every two years • APIs (OpenGL and DirectX) hide details • Hide architectural changes • Provide backward compatibility (mostly) Before programmable GPUs • ATI • Founded 1985 • 2D - Mach series • 3D - Rage series ’95-’99 Rasterization • nVidia Textures • Founded 1993 Texturing • Riva, Riva TNT series ’95-’99 Z & Alpha • up to DirectX 6
Image FrameBuffer Hardware Transform, Clipping and Lighting
• NV10 ’99 Vertices • Nvidia GeForce 256 Transform • The first GPU R100 ’00 • Rasterization • ATI Radeon 7500 Transformation requires 32bit Textures • Texturing float 4x4 matrix multiplication • Texturing for 4 component (RGBA) 8bit pixels Z & Alpha • Low precision math • DirectX 7 Image FrameBuffer Programmable GPUs 1st Generation
Vertices VertexTransform shader Shaders • Shaders run programs that do what fixed Rasterization function hardware did Textures • GPU programs (shaders) PixelTexturing shader have simpler Shaders requirements than CPU Z & Alpha programs Image FrameBuffer Programmable GPUs 1st Generation • NV20 ’01 Nvidia GeForce 3 Vertices • VertexTransform shader [Lindholm01] Shaders
• R200 ’01 Rasterization ATI Radeon 8500 • Textures PixelTexturing shader • 2-wide Vertex shader Shaders • 4-wide SIMD Pixel shader Z & Alpha • Fixed point ~16bits Image FrameBuffer Programmable GPUs 1st Generation
Vertices VertexTransform shader • DirectX8 Pixel Shaders Shaders
• Multiple versions Rasterization • 1.1, 1.3, 1.4 Textures 13-22 instructions PixelTexturing shader • Shaders assembler language • Z & Alpha
Image FrameBuffer Programmable GPUs 2nd Generation • R300 ’02 ATI Radeon 9700 Vertices • VertexTransform shader • 256bit memory bus Shaders • 4-wide Vertex shader Rasterization 8-wide SIMD Pixel Textures • PixelTexturing shader shader Shaders • 24bit float Z & Alpha
Image FrameBuffer Programmable GPUs 2nd Generation • NV30 ’03 • Nvidia GeForce FX 5800 Vertices VertexTransform shader • 16 and 32 bit float Shaders • Cg (C for graphics) Rasterization NV40 ’04 GeForce 6800 • Textures [Montrym05] PixelTexturing shader Shaders • DirectX 9 Z & Alpha • High Level Shading Language (HLSL) Image FrameBuffer Unified Shaders Unified Shader
Vertex shader
Rasterization
Pixel shader
Z & Alpha
FrameBuffer Unified Shader
Grouper Rasterization
Unified shader
Z & Alpha
FrameBuffer Unified Shader
• A revolutionary step in Graphics Hardware • One hardware design that performs both Vertex and Pixel shaders • Vertex processing power
Vertices Vertices Pixels
Vertex Shader Unified Shader
Pixels
Pixel Shader
22 Unified Shader
• GPU based vertex and pixel load balancing • Better vertex and pixel resource usage • Union of features • E.g. Control flow, indexable constant, … • All 32bit floating point • DX9 Shader Model 3.0+ • Shader write at address instruction
Vertices Pixels 3 SIMDs Unified Shader
48 ALUs
Vec4 Scalar 23 XBox360 GPU architecture - ’05
GPU Primitive Setup Vertex Pipeline Index Stream Clipper Generator Pixel Pipeline Rasterizer Tessellator Display Pixels Hierarchical Z/S
Unified Shader
Texture/Vertex Fetch UNIFIED MEMORY Output Buffer Memory Export
SIMD is 16 vector math units wide DAUGHTER - Shared instr. decode and control DIE - Makes GPUs simpler than CPUs 24 Rendering performance Z & Alpha
• 2 Dies (1 GPU, 1 Logic Embedded DRAM) • GPU to Daughter Die high speed interface • 8 pixels/clk • 32BPP color, 4 samples/pixel Z - Lossless compression • 16 pixels/clk – Double Z • 4 samples/pixel Z - Lossless compression
GPU
32GB/s
DAUGHTER DIE
25 Rendering performance Z & Alpha
• Z and Alpha logic to EDRAM interface • 256GB/s • Color and Z - 32 samples • 32bit color, 24bit Z, 8bit stencil • Double Z - 64 samples • 24bit Z, 8bit stencil
DAUGHTER DIE
8pix/clk, 4x MSAA, Stencil and Z test, Alpha blending
256GB/s 10MB EDRAM
26 2013 Xbox360 GTAV Unified Shader
Grouper Rasterization
Unified shader
Z & Alpha
FrameBuffer Unified Shaders - PC • G80 ’06 (Tesla [Lindholm08]) • nVidia GeForce 8800 • 8 Texture Processor Cluster (TPC) • 2 Streaming Multiprocessor (MP) • 8 Streaming processors (SP), Multiply Add (MAD) • 128 SPs @ 1350 MHz • Level 2 cache placed close to the individual memories (DRAMs) • DirectX 10 • Shader Shared Memory • CUDA - new compute focused API NVIDIA G80
Images courtesy [Lindholm08] Unified Shaders - PC
• R600 ’07 • ATI Radeon 2900 • 4 SIMDs • 16 Stream Processing Units • 5 Scalar Units, MAD • 320 Scalars @ 743 MHz • DirectX 10 R600 Top Level Rasterizer •Red – Compute
•Yellow – Cache
•Unified shader Thread Scheduler
•Shader R/W
•Instr./Const. cache Unified 4 SIMDs Texture •Unified texture cache 16 Vector Units per SIMD Cache 5 ALUs per Vector Unit •Compression L2 256KB L1 32KB
Z + Alpha
32 August 2007 Radeon HD 2900 Fast Thread Switching enables GPU arithmetic intensity • 1000s of small simple threads with context on chip • GPUs hide large memory request latency by switching between threads time (on one processor) sleep 1 Read ALU ops
2 Read ALU ops
3 Read ALU ops
4 Read ALU ops
© mmxii mcd How many instructions in shaders?
ATI Ruby Demos 1-3 DX9 4 DX10
34 instructions Slide courtesy Norm Rubin, AMD R770 ’08 ATI Radeon 4870
! 10 SIMDs –SIMD = 16 x 5 ALU ! New memory architecture –L2 cache partitioned at MCs –Crossbar to L1s ! 1.2 TeraFLOPS
35 R770 Die
! 260mm2 ->small ! 956 MTransistors ! ~1 Billion Transistors
36 R770 Die usage
! Red – 10 SIMDs ! Orange –64 z/stencil –40 texture
37 DirectX 11 • Direct3D 11 • Tessellation • Hull Shader • Works on control points • Tessellator - Fixed Function • Generates new vertices • Domain Shader • Works on vertices • Geometry Shader (DX10) • Works on primitives (triangles) • Compute Shader
© mmxi mcd Multi-Graphics core nVidia GF100 (Fermi) • GeForce GTX 480 • Released March, 2010 • 4 x Triangle rate, 4 rasterization engines • 4 GPC x 3-4 SM x 32 scalars = 480 scalars @ 1401MHz • Memory Hierarchy • New L1-L2 cache for shader read/write • L1 cache (per SM) is 16 or 48 KB • 768KB L2 cache is coherent and shared with texture and other parts of the graphics pipeline
© mmxii mcd Comparing GPU shaders NVIDIA GeForce GTX 480 “SM” ATI Radeon HD 5870 “SIMD-engine”
Fetch/ Decode Fetch/ Fetch/ Decode Decode
Execution contexts (128 KB) “Shared” Execution memory “Shared” memory contexts (32 KB) (16+48 KB) (256 KB)
15 Streaming 20 SIMD-engines Multiprocessors
© mmxii mcd Images courtesy Kayvon Fatahalian Comparing CPUs and GPUs
Core Core Core Core CPU L2 L2 L2 L2 Shared L3 cache
cache cache cache cache GPU
cache cache cache cache GPUs in 2017-18
• Nvidia Volta architecture, 12nm • Nvidia Tesla V100 (GV100) 15 TeraFLOPS (32bit FP) • AMD Radeon Pro Vega • Consoles • Microsoft Xbox One X (AMD) • Sony Playstation Pro (AMD) • Nintendo Switch Nvidia Tegra • Mobile • ARM Mali Bifrost Architecture • Intel discreet GPUs coming
© mmxvi mcd Next 5-10 years
• Continued upgrade cycle for GPUs? • Importance of parallel programming languages • CUDA, OpenCL, … • GPU Compute killer app is here • AI/Deep Learning! • Nvidia Volta - Tensor Cores (FP16)
© mmxvi mcd Summary
• GPU shaders grew from low precision simple programs • GPU ALU is much simpler than CPUs • SIMD allows less hardware for many instructions • GPUs hide memory latency with 1000s of threads • GPUs are massively parallel devices and can be used for general compute (GPGPU) Next ...
• Project Description Due Thursday • Thursday • Graphics architecture and OpenCL • Next week - Tuesday - Guest Speaker • Johan Grönqvist, ARM References • Henry Fuchs et. al. "A Heterogeneous Multiprocessor Graphics System Using Processor- Enhanced Memories," Proceedings of SIGGRAPH '89 • Steven Molnar et. al. “PixelFlow: high-speed rendering using image composition”, Proceedings of SIGGRAPH 92 • Kurt Akeley, "RealityEngine Graphics". Proceedings of SIGGRAPH '93 • John Montrym et. al. “InfiniteReality: a real-time graphics system”, Proceedings of SIGGRAPH ’97 • Joel McCormack, “Neon: A (Big) (Fast) Single-Chip 3D Workstation Graphics Accelerator” WRL Research Report 98/1 (Revised 99) • Lindholm et. al., “A user-programmable vertex engine”, Proceedings of SIGGRAPH ’01 • Montrym et. al. “The GeForce 6800”, IEEE Micro, March ’05 • Lindholm et. al. “Nvidia tesla: A unified graphics and computing architecture”, IEEE Micro, March- April ’08 • Larry Seiler et. al. “Larrabee: A Many-Core x86 Architecture for Visual Computing”, Proceedings of SIGGRAPH’08 • Jeremy Sugerman et. al. “GRAMPS: A Programming Model for Graphics Pipelines”, ACM Transactions on Graphics January, ’09 • Timo Aila et. al. “Understanding the Efficiency of Ray Traversal on GPUs. “ High-Performance Graphics 2009.