GPU Architecture II: Scheduling the Graphics Pipeline Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington

GPU architecture II: Scheduling the graphics pipeline Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington Winter 2011 – Beyond Programmable Shading 1 Notes • The front half of this talk is almost verbatim from: “Keeping Many Cores Busy: Scheduling the Graphics Pipeline” by Jonathan Ragan-Kelley from SIGGRAPH Beyond Programmable Shading, 2010 • I’ll be adding my own twists from my background on working on some of the hardware in the talk Winter 2011 – Beyond Programmable Shading 2 This talk • How to think about scheduling GPU-style pipelines Four constraints which drive scheduling decisions • Examples of these concepts in real GPU designs • Goals – Know why GPUs, APIs impose the constraints they do. – Develop intuition for what they can do well. – Understand key patterns for building your own pipelines. • Dig into ATI Radeon™ HD 5800 Series (“Cypress”) architecture deep dive – What questions do you have? Winter 2011 – Beyond Programmable Shading 3 First, a definition Scheduling [n.]: Assigning computations and data to resources in space and time. Winter 2011 – Beyond Programmable Shading 4 The workload: Direct3D IA VS PA Logical pipeline Fixed-function stage HS Programmable stage Tess DS data flow data PA GS Rast PS Blend Winter 2011 – Beyond Programmable Shading 5 The machine: a modern GPU Shader Shader Input Primitive Logical pipeline Tex Core Core Assembler Assembler Fixed-function stage Programmable stage Shader Shader Output Tex Rasterizer Core Core Blend Physical processor Shader Shader Fixed-function logic Tex Core Core Programmable core Task Distributor Shader Shader Fixed-function control Tex Core Core Winter 2011 – Beyond Programmable Shading 6 Scheduling a draw call as a series of tasks Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend IA VS PA Rast time PS PS PS Blend Blend Blend Winter 2011 – Beyond Programmable Shading 7 An efficient schedule keeps hardware busy Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend PS VS IA PA Rast Blend VS PS PS PS IA PA Blend Rast VS VS VS IA PA PS time PS PS PA Rast Blend IA PS PS PS PS Rast PA VS VS VS IA Rast Blend PS PS PS PS IA PA Rast Blend Winter 2011 – Beyond Programmable Shading 8 Choosing which tasks to run when (and where) • Resource constraints – Tasks can only execute when there are sufficient resources for their computation and their data. • Coherence – Control coherence is essential to shader core efficiency. – Data coherence is essential to memory and communication efficiency. • Load balance – Irregularity in execution time create bubbles in the pipeline schedule. • Ordering – Graphics APIs define strict ordering semantics, which restrict possible schedules. Winter 2011 – Beyond Programmable Shading 9 Resource constraints limit scheduling options Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend VS IA PS PS PS time ??? PS Winter 2011 – Beyond Programmable Shading 10 Resource constraints limit scheduling options Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend VS IA PS PS PS Deadlock??? time ??? PS Key Pre-allocation of concept: resources helps guarantee forward progress. Winter 2011 – Beyond Programmable Shading 11 Coherence is a balancing act Intrinsic tension between: Horizontal (control, fetch) coherence and Vertical (producer-consumer) locality. Locality and Load Balance. Winter 2011 – Beyond Programmable Shading 12 Graphics workloads are irregular SuperExpensiveShader( ) TrivialShader( ) Rasterizer …! But: Shaders are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule. Solution: Dynamically generating and aggregating tasks isolates irregularity and recaptures coherence. Redistributing tasks restores load balance. Winter 2011 – Beyond Programmable Shading 13 Redistribution after irregular amplification Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend IA VS PA Rast time PS PS PS Blend Blend Blend Winter 2011 – Beyond Programmable Shading 14 Redistribution after irregular amplification Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend IA VS PA Rast time PS PS PS Blend Blend Key Managing irregularity by dynamically concept: generating, aggregating, and redistributing Blend tasks Winter 2011 – Beyond Programmable Shading 15 Ordering Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order Key Carefully structuring task concept: redistribution to maintain API ordering. Winter 2011 – Beyond Programmable Shading 16 Building a real pipeline Winter 2011 – Beyond Programmable Shading 17 Static tile scheduling The simplest thing that could possibly work. Vertex Pixel Pixel Pixel Pixel Multiple cores: 1 front-end n back-end Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 18 Static tile scheduling Pixel Pixel Vertex Pixel Pixel Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 19 Static tile scheduling Pixel Pixel Vertex Pixel Pixel Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 20 Static tile scheduling Pixel Pixel Vertex Pixel Pixel Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 21 Static tile scheduling Locality captured within Pixel tiles Resource Pixel constraints static = simple Ordering Pixel single front-end, sequential processing within Pixel each tile Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 22 Static tile scheduling Pixel !!! The problem: load imbalance Pixel idle… only one task creation point. no dynamic task Pixel idle… redistribution. Pixel idle… Exemplar: ARM Mali 400 Winter 2011 – Beyond Programmable Shading 23 Sort-last fragment shading Pixel Pixel Redistribution restores fragment Vertex Rasterizer load balance. Pixel But how can we maintain order? Pixel Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 24 Sort-last fragment shading Pixel Pixel Vertex Rasterizer Pixel Preallocate outputs in Pixel FIFO order Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 25 Sort-last fragment shading Pixel Pixel Vertex Rasterizer Pixel Complete shading asynchronously Pixel Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 26 Sort-last fragment shading Blend Pixel fragments in FIFO order Pixel Output Vertex Rasterizer Blend Pixel Pixel Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 27 Unified shaders Solve load balance by time-multiplexing different stages onto shared processors according to load Shader Shader Input Primitive Tex Core Core Assembler Assembler Shader Shader Output Tex Rasterizer Core Core Blend Shader Shader Tex Core Core Task Distributor Shader Shader Tex Core Core Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 28 Unified Shaders: time-multiplexing cores Shader Shader Shader Shader Input Primitive Output Rasterizer Core Core Core Core Assembler Assembler Blend IA VS PA Rast time PS PS PS Blend Blend Blend Exemplars: NVIDIA G80, ATI RV770 Winter 2011 – Beyond Programmable Shading 29 Prioritizing the logical pipeline IA 5 VS 4 priority PA 3 Rast 2 PS 1 Blend 0 Winter 2011 – Beyond Programmable Shading 30 Prioritizing the logical pipeline IA 5 VS 4 priority fixed-size PA 3 queue storage Rast 2 PS 1 Blend 0 Winter 2011 – Beyond Programmable Shading 31 Scheduling the pipeline Shader Shader IA Core Core VS PA time Rast PS Blend Winter 2011 – Beyond Programmable Shading 32 Scheduling the pipeline Shader Shader IA Core Core VS VS VS PA time Rast PS Blend Winter 2011 – Beyond Programmable Shading 33 Scheduling the pipeline Shader Shader IA Core Core VS VS VS PA time Rast PS Blend Winter 2011 – Beyond Programmable Shading 34 Scheduling the pipeline Shader Shader IA Core Core High priority, VS VS VS but stalled on output PS PS PA time Rast Lower PS priority, but ready to run Blend Winter 2011 – Beyond Programmable Shading 35 Scheduling the pipeline Shader Shader IA Core Core VS VS VS PS PS PA time Rast … PS Blend Winter 2011 – Beyond Programmable Shading 36 Scheduling the pipeline Queue sizes and backpressure provide a natural knob for balancing horizontal batch coherence and producer-consumer locality. Winter 2011 – Beyond Programmable Shading 37 OptiX Winter 2011 – Beyond Programmable Shading 38 Summary Winter 2011 – Beyond Programmable Shading 39 Key concepts • Think of scheduling the pipeline as mapping tasks onto cores. • Preallocate resources before launching a task. – Preallocation helps ensure forward progress and prevent deadlock. • Graphics is irregular. – Dynamically generating, aggregating and redistributing tasks at irregular amplification points regains coherence and load balance. • Order matters. – Carefully structure task redistribution to maintain ordering. Winter 2011 – Beyond Programmable Shading 40 Why don’t we have dynamic resource allocation? e.g. recursion, malloc() in shaders Static preallocation of resources guarantees forward progress. Tasks which outgrow available resources can stall, causing deadlock. Winter 2011 – Beyond Programmable Shading 41 Geometry Shaders are slow because they allow dynamic amplification in shaders. •Pick your poison: – Always stream through DRAM. –Exemplar: ATI R600 –Smooth falloff for large amplification, but very slow for small amplification

Load more