Michael Doggett Department of Computer Science Lund University Overview

Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university Overview • Parallelism • Radeon 5870 • Tiled Graphics Architectures • Important when Memory and Bandwidth limited • Different to Tiled Rasterization! • Tessellation • OpenCL • Programming the GPU without Graphics © mmxii mcd “Only 10% of our pixels require lots of samples for soft shadows, but determining which 10% is slower than always doing the samples. ” by ID_AA_Carmack, Twitter, 111017 Parallelism • GPUs do a lot of work in parallel • Pipelining, SIMD and MIMD • What work are they doing? © mmxii mcd What’s running on 16 unifed shaders? 128 fragments in parallel 16 cores = 128 ALUs MIMD 16 simultaneous instruction streams 5 Slide courtesy Kayvon Fatahalian vertices/fragments primitives 128 [ OpenCL work items ] in parallel CUDA threads vertices primitives fragments 6 Slide courtesy Kayvon Fatahalian Unified Shader Architecture • Let’s take the ATI Radeon 5870 • From 2009 • What are the components of the modern graphics hardware pipeline? © mmxii mcd Unified Shader Architecture Grouper Rasterizer Unified shader Texture Z & Alpha FrameBuffer © mmxii mcd Unified Shader Architecture Grouper Rasterizer Unified Shader Texture Z & Alpha FrameBuffer ATI Radeon 5870 © mmxii mcd Shader Inputs Vertex Grouper Tessellator Geometry Grouper Rasterizer Unified Shader Thread Scheduler 64 KB GDS SIMD 32 KB LDS Shader Processor 5 32bit FP MulAdd Texture Sampler Texture 16 Shader Processors 256 KB GPRs 8 KB L1 Tex Cache 20 SIMD Engines Shader Export Z & Alpha 4 Z & Alpha Units 4 Memory Controllers 32 Z tests, 8 alpha blends Z & Color cache 128 KB L2 Tex Cache FrameBuffer ATI Radeon 5870 © mmxii mcd Tiled Based Architectures • There are two major styles of GPU architecture : • Traditional • One we work with and discuss • Dominant in Desktop (Nvidia, AMD, Intel) • Tiling based • Dominant in Mobile (ARM Mali, Qualcomm Adreno) • Different to tile based rasterisation © mmxv mcd Examples of Tiled Based Architectures • ARM Mali (Mobile) • Imagination Technologies • Kyro II (PC graphics) • Sega Dreamcast (console) • PowerVR (Mobile) • Apple GPU? • Other notable ones • Original Intel Larrabee used a Software Tiled based rasterizer • (became Xeon Phi) © mmxv mcd Tiled Based Rendering 1. 3. 5. • Screen is divided into Tiles (e.g. 32x32 pixels) 4. 6. • Run vertex shader to project 2. triangles to the screen 1. 2. 3. 4. 5. 6. • Create Tile Triangle Lists (or Bins) • Each list contains pointers to all triangles in the tile © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Tiled Based Rendering 1. 2. 3. 4. 5. 6. • Take one tile and render all triangles in that tile list • Use 1 tile sized on-chip memory • After processing all triangles On-chip in the list memory • Copy tile memory, with Z 1. 3. 5. and Color, from on-chip Off-chip memory to main memory memory 2. 4. 6. © mmxvii mcd Mali Bifrost GPU • Hierarchical tiling unit - 16x16, 32x32, 64x64 bins • Switch to TLP quad execution design • Better for compute © mmxvii mcd image courtesy Jem Davies, “Bifrost GPU”, 2016, © ARM Mali-G71 shader core design © mmxvii mcd image courtesy Jem Davies, “Bifrost GPU”, 2016, © ARM Tiling : Disadvantages • More on chip memory for sorting triangles into tiles • Increased bandwidth for triangle sorting • State changes now happen for every tile (can take a lot of time) • Triangles are processed multiple times (can be solved with Mali hierarchical tiling) • On chip memory limits places hard limits on number of triangles • Overflow causes spilling and creates a big hit on performance • Too much geometry will flush whole pipeline (ParameterBuffer overflow) © mmxvii mcd Tiling : Advantages • Easily parallelizable • Allows access to frame buffer in a small high speed memory • Z, Color accesses close to free • MSAA is almost free (5-10% of rendering time) • Alpha Blending is significantly cheaper © mmxvii mcd Comparing Tiling to the Traditional/Straight Pipeline • Advantages • Less complex pipeline • More predictable performance • Z Framebuffer compression techniques reduce bandwidth • State changes once for frame (but too many still a problem) • Disadvantages • Impossible to have frame buffer in on chip memory • Accessing the frame buffer cost power and latency © mmxvii mcd Expanding the graphics pipeline • Vertex shaders & Pixel shaders • But wait there’s more !!! • Geometry Shaders • Triangle in, many triangles out • Like a triangle shader • Tessellation (in OpenGL terminology) • Control Shader • Tessellator • Evaluation Shader © mmxii mcd AV0 Geometry Shaders AV2 • Input and output can be points, lines, triangles AV1 Vertex shader • But lines and triangles can also have adjacency information (3 for triangles, 2 for lines) • Can think of it as the ‘Triangle shader’ Geometry shader • 1:n ratio of input:output • 1 primitive in, many (or no) primitives out Must set a max_vertices for number of outputs Rasterization • • Number of vertices, number of output components • Enables storing processed vertices into multiple buffers Pixel shader • Output lines and triangles are strips • If you want individual, you need EndPrimitive Layered rendering Z & Alpha • • Send primitives to specific layers of framebuffer • Good for rendering cube-based shadow mapping and environment FrameBuffer maps DirectX 10, OpenGL 3.0 feature © mmxii mcd • Geometry Shader Applications • Single pass render to cube map • Using gl_Layer • Point sprite expansion/generation • Fur/Fin generation • Shadow volume extrusion • Tessellation? © mmxv mcd Tessellation • Difference between games and film is geometric detail • Film uses higher order surfaces (HOS) • Subdivision surfaces • Bezier patches, NURBs • Tessellation converts HOS into triangles © mmxii mcd © Disney/Pixar. Image from Nießner, M., et al. “Feature-adaptive GPU rendering of Catmull-Clark subdivision surfaces”, TOG January 2012 Vertex shader Tessellation Control shader Tessellator • Benefits Evaluation shader • Higher order surfaces such as subdivision surfaces Higher detailed surfaces using displaced Geometry shader • mapping View-dependent levels-of-detail Rasterization • • Input is quad or triangle patches Pixel shader • DirectX 11 (Hull and Domain shader), OpenGL 4.0 feature Z & Alpha • On mobile, Android Extension Pack FrameBuffer © mmxii mcd Tessellation • Control shader • Transforms up to 32 control points of a patch (typically 16) • Calculates tessellation factor for patch edges in tessellator • Tessellator • Outputs UV coordinates • Evaluation shader • Calculates the vertex position of the output of tessellator in the patch © mmxiiImages mcd courtesy Charles Loop, “Hardware Subdivision and Tessellation of Catmull-Clark Surfaces”, GTC2010 © mmxiiImages mcd courtesy Charles Loop, “Hardware Subdivision and Tessellation of Catmull-Clark Surfaces”, GTC2010 displacement mapping © mmxiiImages mcd courtesy Charles Loop, “Hardware Subdivision and Tessellation of Catmull-Clark Surfaces”, GTC2010 OpenCL • OpenCL - Open Compute Language • Application Programming Interface (API) • Runs on GPUs • Write programs (shaders) for GPU using OpenCL • Industry standard between multiple companies © mmxii mcd GPU as a parallel processor • GPU parallelism increased significantly during the last decade • Roughly 1000x more pixels/sec in 10 years • People started to use this for more than just graphics • How to enable using the GPU as a parallel processor? • How could we change OpenGL? © mmxii mcd Designing OpenCL Grouper Rasterizer Unified shader • Take out all the graphics • Triangles, textures, Z testing, alpha blending Z & Alpha • Keep simple small programs (graphics shaders) • Compute kernels FrameBuffer • Use the C programming model for kernels (ANSI C99) • Low-level abstraction • High-performance, but device independent • CPUs are also becoming parallel so OpenCL works there too © mmxii mcd Designing OpenCL Scheduler Unified shader • Take out all the graphics • Triangles, textures, Z testing, alpha blending FrameBuffer • Keep simple small programs (graphics shaders) • Compute kernels

Michael Doggett Department of Computer Science Lund University Overview

Download Speeds Performance and Optimized for Long Battery Life

GPU Developments 2018

An Evolution of Mobile Graphics

POWERVR 3D Application Development Recommendations

Rasterization & the Graphics Pipeline 15.1 Rasterization Basics

Powervr Graphics - Latest Developments and Future Plans

Amd Mobile Gpu

Case 6:19-Cv-00474-ADA Document 46 Filed 05/05/20 Page 1 of 208

Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019

Real-Time Ray Traced Ambient Occlusion and Animation Image Quality and Performance of Hardware- Accelerated Ray Traced Ambient Occlusion

3/4-Discrete Optimal Transport

Difi: Fast Distance Field Computation Using Graphics Hardware