Lecture 14: Scheduling the on a GPU

Visual Computing Systems Stanford CS348K, Spring 2020 Today

▪ Real-time 3D graphics workload metrics

▪ Scheduling the graphics pipeline on a modern GPU

Stanford CS348K, Spring 2020 Quick Review

Stanford CS348K, Spring 2020 Real-time graphics pipeline architecture Memory Buffers (system state) 3 1 Vertex Generation 4 Vertex data buffers 2 Vertices Vertex stream Vertex Processing Buffers, textures Vertex stream

Primitive Generation Primitive stream Primitives Primitive Processing Buffers, textures Primitive stream Fragment Generation (Rasterization) Fragment stream Fragments Fragment Processing Buffers, textures Fragment stream

Pixels Pixel Operations

Output image buffer

Stanford CS348K, Spring 2020 Programming the graphics pipeline ▪ Issue draw commands output image contents change

Command Type Command State change Bind , textures, uniforms Draw Draw using vertex buffer for object 1 State change Bind new uniforms Draw Draw using vertex buffer for object 2 State change Bind new Draw Draw using vertex buffer for object 3 State change Change depth test function State change Bind new shader Draw Draw using vertex buffer for object 4 Final rendered output should be consistent with the results of executing these commands in the order they are issued to the graphics pipeline. Stanford CS348K, Spring 2020 GPU: heterogeneous parallel processor

SIMD SIMD SIMD SIMD Texture Texture Exec Exec Exec Exec

Cache Cache Cache Cache Texture Texture

SIMD SIMD SIMD SIMD Tessellate Tessellate Exec Exec Exec Exec

Tessellate Tessellate Cache Cache Cache Cache GPU Clip/Cull Clip/Cull Rasterize Rasterize Memory SIMD SIMD SIMD SIMD Exec Exec Exec Exec Clip/Cull Clip/Cull Rasterize Rasterize Cache Cache Cache Cache

Zbuffer / Zbuffer / Zbuffer / Blend Blend Blend SIMD SIMD SIMD SIMD Exec Exec Exec Exec Zbuffer / Zbuffer / Zbuffer / Blend Blend Blend Cache Cache Cache Cache Scheduler / Work Distributor

We’re now going to talk about this scheduler Stanford CS348K, Spring 2020 Graphics workload metrics

Stanford CS348K, Spring 2020 Let’s consider different workloads Average triangle size

Image credit: https://www.theverge.com/2013/11/29/5155726/next-gen-supplementary-piece http://www.mobygames.com/game/android/ghostbusters-slime-city/screenshots/gameShotId,852293/ Stanford CS348K, Spring 2020 Triangle size (data from 2010)

30

20

10 Percentage of total triangles of total Percentage 0 [0-1] [1-5] [5-10] [10-20] [20-30] [30-40] [40-50] [50-60] [60-70] [70-80] [80-90] [90-100] [> 100] Triangle area (pixels)

[source: ] Stanford CS348K, Spring 2020 Low geometric detail

Credit: Pro Evolution SoccerStanford 2010 CS348K, (Konami) Spring 2020 Surface tessellation Approximating Subdivision Surfaces with Gregory Patches Procedurally generate Approximatingfne triangle mesh from coarse Subdivision mesh representation Surfaces with Gregory Patches for Hardwarefor Hardware Tessellation Tessellation

CharlesCharles Loop Loop Scott SchaeferScott Schaefer TianyunTianyun Ni NiIgnacio Casta˜noIgnacio Casta˜no MicrosoftMicrosoft Research Research TexasTexas A&M A&M University University NVIDIANVIDIA NVIDIA NVIDIA

Coarse geometry Post-Tessellation (fne) geometry [image credit:Figure Loop et al. 1: 2009]The first image (far left) illustrates an input control mesh;Stanfordregular CS348K, Spring 2020 (gold) faces do not have an incident extraordinary vertex, irregularFigure quads 1: The (purple) first have image at (far least left) one extraordinary illustrates an ve inputrtex, and control triangular mesh; (green)regular faces (gold) are allowed. faces do The not second have an and incident third images extraordinary show vertex, theirregular parametric quads patches (purple) we generate. have at The least final one image extraordinary is of the same vertex, surface and with triangular a displacement (green) map faces applied. are allowed. The second and third images show the parametric patches we generate. The final image is of the same surface with a displacement map applied.

Abstract non-realtime parts of production such as cut-scenes, but their real- Abstract time counternon-realtime parts are simplified parts of production polygon models such asof these cut-scenes, highres- but their real- We present a new method for approximating subdivision sur- olution characters.time counter Ideally, parts we are could simplified use a high polygon resolution modelspolygon of these highres- facesWe with present hardware a new accelerated method parametric for approximating patches. Our subdivision method sur-model extractedolution characters. at a high level Ideally, of subdivision we could to use better a high approximate resolution polygon improvesfaces with the memory hardware bandwidth accelerated requirements parametric for patches. patch contr Ourol methodthe character.model However, extracted this at a approach high level has of a number subdivision of proble to betterms: approximate points,improves translating the memory into superior bandwidth performance requirements compared for to existing patch control the character. However, this approach has a number of problems: methods. Our input is general, allowing for meshes that contain • Animation requires updating a large number of vertices each points, translating into superior performance compared to existing frame using bone weights or morph targets, consuming com- both quadrilateral and triangular faces in the input controlmesh,as • Animation requires updating a large number of vertices each methods. Our input is general, allowing for meshes that contain putational resources and harming performance. well as control meshes with boundary. We present two implementa- frame using bone weights or morph targets, consuming com- tionsboth of our quadrilateral scheme designed and triangular to run on faces in the 11 input class contro hardwarelmesh,as well as control meshes with boundary. We present two implementa- • Facetingputational artifacts occur, resources due to and the static harming nature performance. of the polygon equipped with a tessellator unit. mesh connectivity. tions of our scheme designed to run on Direct3D 11 class hardware • Faceting artifacts occur, due to the static nature of the polygon equipped with a tessellator unit. • Large polygonmesh connectivity. meshes consume significant disk, bus, and net- 1 Introduction work resources to store and transmit. • Large polygon meshes consume significant disk, bus, and net- Catmull-Clark1 Introduction subdivision surfaces [Catmull and Clark 1978] have Given that thesework subdivision resources surfaces to store already and transmit. exist, we could sim- become a standard for modeling free form shapes such as dynamic plify the content creation pipeline by skipping the precomputed, charactersCatmull-Clark in movies subdivision and computer surfaces games. [Catmull By adding and displace Clarkment 1978] havefixed polygonalizationGiven that these step. subdivision surfaces already exist, we could sim- maps,become we can a standard create highly for modelingdetailed shapes free usingform shapesa minimal such am asount dynamic plify the content creation pipeline by skipping the precomputed, ofcharacters storage [Lee in et movies al. 2000]. and Tools computer such games. as ZBrush By combine adding displace these ment fixed polygonalization step. twomaps, ideas we to allow can create artists highly to edit detailed models at shapes multiple using resolut a minimalions and amount automaticallyof storage [Lee create et low al. resolution 2000]. Tools control such meshes as ZBrush and dis combineplace- these ment maps. two ideas to allow artists to edit models at multiple resolutions and Despiteautomatically the prevalence create of low subdivision resolution surfaces, control realtime meshes applica- and displace- tionsment such maps. as games predominately use polygon models to repre- sent their geometry. The reason is understandable as GPU’s are designedDespite to the accelerate prevalence polygon of subdivision rendering and surfaces, do so well. realtime Yet,in applica- manytions cases, such subdivision as games predominately surfaces are already use polygon part of the models content to repre- creationsent their pipeline geometry. for these The applications. reason is These understandable surfaces are used as GPU’s in are designed to accelerate polygon rendering and do so well. Yet,in many cases, subdivision surfaces are already part of the content creation pipeline for these applications. These surfaces are used in

Figure 2: The Direct3D 11 graphics pipeline.

To address this issue, API designers and hardware vendors have added a new tessellatorFigure unit 2:toThe the Direct3D graphics pipeline 11 graphics in Direct3D pipeline. 11 [Drone et al. 2008]. This change adds new programmable stages, the hull shader and the domain shader, to the graphics pipeline that To address this issue, API designers and hardware vendors have added a new tessellator unit to the graphics pipeline in Direct3D 11 [Drone et al. 2008]. This change adds new programmable stages, the hull shader and the domain shader, to the graphics pipeline that Graphics pipeline with tessellation Vertex Generation Coarse Vertices Five programmable stages in modern pipeline 1 in / 1 out Vertex Processing (OpenGL 4, Direct3D 11)

Coarse Primitives Coarse Primitive Processing Vertex Generation 1 in / 1 out 1 in / N out Tessellation Vertices Fine Vertices 1 in / 1 out Vertex Processing 1 in / 1 out Fine Vertex Processing

3 in / 1 out Primitive Generation 3 in / 1 out Fine Primitive Generation (for tris) (for tris) Primitives Fine Primitives 1 in / small N out Primitive Processing 1 in / small N out Fine Primitive Processing

Rasterization Rasterization 1 in / N out 1 in / N out (Fragment Generation) (Fragment Generation) Fragments Fragments

1 in / 1 out Fragment Processing 1 in / 1 out Fragment Processing

Pixels 1 in / 0 or 1 out Frame-Buffer Ops Pixels 1 in / 0 or 1 out Frame-Buffer Ops

Stanford CS348K, Spring 2020 Scene depth complexity

[Imagination Technologies] Rough approximation: TA = SD T = # triangles A = average triangle area S = pixels on screen D = average depth complexity Stanford CS348K, Spring 2020 Amount of data generated Vertex Generation (size of stream between consecutive stages) Compact geometric model Coarse Vertices 1 in / 1 out Vertex Processing

Coarse Primitives Coarse Primitive Processing 1 in / 1 out High-resolution (post tessellation) 1 in / N out Tessellation mesh Fine Vertices 1 in / 1 out Fine Vertex Processing

3 in / 1 out Fine Primitive Generation “Diamond” structure of (for tris) Fine Primitives graphics workload 1 in / small N out Fine Primitive Processing

Intermediate data streams tend to Rasterization 1 in / N out (Fragment Generation) be larger than scene inputs or Fragments image output Fragments 1 in / 1 out Fragment Processing

Pixels 1 in / 0 or 1 out Frame-Buffer Ops

Frame buffer pixels Stanford CS348K, Spring 2020 Key 3D graphics workload metrics

▪ Data amplifcation from stage to stage - Average triangle size (amplifcation in rasterizer: 1 triangle -> N pixels) - Expansion during primitive processing (if enabled) - Tessellation factor (if tessellation enabled)

▪ [Vertex/fragment/geometry] shader cost - How many instructions? - Ratio of math to data access instructions?

▪ Scene depth complexity - Determines number of depth and color buffer writes

Stanford CS348K, Spring 2020 Graphics pipeline workload changes dramatically across draw commands ▪ Triangle size is scene and frame dependent - Move far away from an object, triangles get smaller - Vary within a frame (characters are usually higher resolution meshes than buildings)

▪ Varying complexity of materials, different number of lights illuminating surfaces - Tens to a few hundreds of instructions per shader

▪ Stages can be disabled - Depth-only rendering = NULL fragment shader - Post-processing effects = no vertex work ▪ Thousands of state changes and draw calls per frame

Example: rendering a “depth map” requires vertex but no fragment shading Stanford CS348K, Spring 2020 Parallelizing the graphics pipeline

Adopted from slides by and Pat Hanrahan (Stanford CS448 Spring 2007) Stanford CS348K, Spring 2020 GPU: heterogeneous parallel processor

SIMD SIMD SIMD SIMD Texture Texture Exec Exec Exec Exec

Cache Cache Cache Cache Texture Texture

SIMD SIMD SIMD SIMD Tessellate Tessellate Exec Exec Exec Exec

Tessellate Tessellate Cache Cache Cache Cache GPU Clip/Cull Clip/Cull Rasterize Rasterize Memory SIMD SIMD SIMD SIMD Exec Exec Exec Exec Clip/Cull Clip/Cull Rasterize Rasterize Cache Cache Cache Cache

Zbuffer / Zbuffer / Zbuffer / Blend Blend Blend SIMD SIMD SIMD SIMD Exec Exec Exec Exec Zbuffer / Zbuffer / Zbuffer / Blend Blend Blend Cache Cache Cache Cache Scheduler / Work Distributor

We’re now going to talk about this scheduler Stanford CS348K, Spring 2020 Reminder: requirements + workload challenges

▪ Pipeline accepts sequence of commands - Draw commands - State modifcation commands

▪ Processing commands has sequential semantics - Effects of command A must be visible before those of command B

▪ Relative cost of pipeline stages changes frequently and unpredictably (e.g., due to changing triangle size, rendering mode)

▪ Ample opportunities for parallelism - Many triangles, vertices, fragments, etc.

Stanford CS348K, Spring 2020 Simplifed pipeline Application

Geometry Vertex Generation For now: just consider all work

(vertex/primitive processing, tessellation, etc.) as Vertex Processing “geometry” processing. Primitive Generation (I’m drawing the pipeline this way to match the suggested readings under this lecture) Primitive Processing

Rasterization

Fragment Processing

Frame-Buffer Ops

Output image

Stanford CS348K, Spring 2020 Simple parallelization (pipeline parallelism)

Separate hardware unit is responsible for Application executing work in each stage

Geometry Processing What is my maximum speedup? Rasterization

Fragment Processing

Frame-Buffer Ops

Output image

Stanford CS348K, Spring 2020 A cartoon GPU: Assume we have four separate processing pipelines Leverages data-parallelism present in rendering computation

Application

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

Stanford CS348K, Spring 2020 More realistic GPU

▪ A set of programmable cores (run vertex and fragment shader programs) ▪ Hardware for rasterization, , and frame-buffer access

Rasterizer Rasterizer Rasterizer Rasterizer Depth Test Depth Test Depth Test Depth Test Shader Shader Shader Shader Processor Core Processor Core Processor Core Processor Core

Data Cache Texture Data Cache Texture Data Cache Texture Data Cache Texture

Render Target Blend Render Target Blend Render Target Blend Render Target Blend

Stanford CS348K, Spring 2020 Molnar’s “sorting” taxonomy Implementations characterized by where communication occurs in pipeline

Application Sort frst

Geometry Processing Geometry Processing Geometry Processing Geometry Processing Sort middle Rasterization Rasterization Rasterization Rasterization

Sort last fragment Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Sort last image composition output image

Note: The term “sort” can be misleading for some. It may be helpful to instead consider the term “distribution” rather than sort. The implementations are characterized by how and when they redistribute work onto processors. *

* The origin of the term sort was from “A Characterization of Ten Hidden-Surface Algorithms”. Sutherland et al. 1974 Stanford CS348K, Spring 2020 Sort frst

Stanford CS348K, Spring 2020 Sort frst

Application Sort!

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

Assign each replicated pipeline responsibility for a region of the output image Do minimal amount of work (compute screen-space vertex positions of triangle) to determine which region(s) each input primitive overlaps

Stanford CS348K, Spring 2020 Sort frst work partitioning (partition the primitives to parallel units based on screen overlap)

1 2

3 4

Stanford CS348K, Spring 2020 Sort frst

Application

Sort!

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image ▪ Good: - Simple parallelization: just replicate rendering pipeline and operate independently in screen regions (order maintained in each) - More parallelism = more performance - Small amount of sync/communication (communicate original triangles) - Early fne occlusion cull (“early z”) just as easy as single pipeline Stanford CS348K, Spring 2020 Sort frst

Application

Sort!

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

▪ Bad: - Potential for workload imbalance (one part of screen contains most of scene) - Extra cost of triangle “pre-transformation” (needed to sort) - “Tile spread”: as screen tiles get smaller, primitives cover more tiles (duplicate geometry processing across multiple parallel pipelines) Stanford CS348K, Spring 2020 Sort frst examples

▪ WireGL/Chromium* (parallel rendering with a cluster of GPUs) - “Front-end” node sorts primitives to machines - Each GPU is a full rendering pipeline (responsible for part of screen)

▪ Initial parallel versions of Pixar’s RenderMan - Multi-core software renderer - Sort surfaces into screen tiles prior to tessellation

* Chromium can also be confgured as a sort-last image composition system Stanford CS348K, Spring 2020 Sort middle

Stanford CS348K, Spring 2020 Sort middle Application

Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Sort!

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

Distribute primitives to pipelines (e.g., round-robin distribution) Assign each rasterizer a region of the render target Sort after geometry processing based on screen space projection of primitive vertices

Stanford CS348K, Spring 2020 Interleaved mapping of screen ▪ Decrease chance of one rasterizer processing most of scene ▪ Most triangles overlap multiple screen regions (often overlap all)

1 2 1 2

2 1 2 1

Interleaved mapping Tiled mapping

Stanford CS348K, Spring 2020 Fragment interleaving in NVIDIA Fermi

Fine granularity interleaving Coarse granularity interleaving

Question 1: what are the benefts/weaknesses of each interleaving? Question 2: notice anything interesting about these patterns?

[Image source: NVIDIA] Stanford CS348K, Spring 2020 Sort middle interleaved

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Sort! - BROADCAST

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

▪ Good: - Workload balance: both for geometry work AND onto rasterizers (due to interleaving) - Does not duplicate geometry processing for each overlapped screen region

Stanford CS348K, Spring 2020 Sort middle interleaved

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Sort! - BROADCAST

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image ▪ Bad: - Bandwidth scaling: sort is implemented as a broadcast (each triangle goes to many/all rasterizers because of interleaved screen mapping) - If tessellation is enabled, must communicate many more primitives than sort frst - Duplicated per triangle work across rasterizers Stanford CS348K, Spring 2020 SGI RealityEngine [Akeley 93] Sort-middle interleaved design

Stanford CS348K, Spring 2020 Sort-middle tiled (a.k.a. “chunking”, “bucketing”, “binning”) Step 1: sort triangles into bins ▪ Divide screen into tiles, one triangle list per “tile” of screen (called a “bin”) ▪ Core runs vertex processing, computes 2D triangle/screen-tile overlap, inserts triangle into appropriate bin(s) List of scene triangles

Core 1 Core 2 Core 3 Core 4

After processing frst fve 3 4 triangles: Bin 1 Bin 2 Bin 3 Bin 4 1 2 5

Bin 1 list: 1,2,3,4 Bin 5 Bin 6 Bin 7 Bin 8 Bin 2 list: 4,5 Bin 9 Bin 10 Bin 11 Bin 12

Stanford CS348K, Spring 2020 Sort-middle interleaved vs. binning

Processor 1 Processor 2

Processor 3 Processor 4

0 1 2 3 0 1 B0 B1 B2 B3 B4 B5 2 3 0 1 2 3 B6 B7 B8 B9 B10 B11 0 1 2 3 0 1 B12 B13 B14 B15 B16 B17 2 3 0 1 2 3 B18 B19 B20 B21 B22 B23 Interleaved (static) assignment Assignment to bins of screen tiles to processors List of bins is a work queue. Bins are dynamically assigned to processors.

Stanford CS348K, Spring 2020 Step 2: per-tile processing ▪ Cores process bins in parallel, performing rasterization fragment shading and frame buffer update

List of triangles in bin:

▪ While there are more bins to process: Rasterizer - Assign bin to available core Depth Test For all triangles: Shader - Processor Core

- Rasterize Data Cache Texture

- Fragment shade Render Target Blend - Depth test - Update frame buffer fnal pixels for NxN tile of render target

Stanford CS348K, Spring 2020 Reminder: reading less data conserves power ▪ Goal: redesign algorithms so that they make good use of on- chip memory or processor caches - And therefore transfer less data from memory ▪ A fact you might not have heard: -It is far more costly (in energy) to load/store data from memory, than it is to perform an arithmetic operation “Ballpark” numbers [Sources: Bill Dally (NVIDIA), Tom Olson (ARM)] - Integer op: ~ 1 pJ * - Floating point op: ~20 pJ * - Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pJ - Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pJ Implications - Reading 10 GB/sec from memory: ~1.6 watts

* Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc. Stanford CS348K, Spring 2020 What should the screen size of the bins be?

▪ Small enough for a tile of the color buffer and depth buffer (potentially supersampled) to ft in a shader processor core’s on-chip storage (i.e., cache)

▪ Tile sizes in range 16x16 to 64x64 pixels are common

▪ ARM Mali GPU: commonly uses 16x16 pixel tiles

Stanford CS348K, Spring 2020 “sorts” the scene in 2D space to enable efficient color/depth buffer access

Consider rendering without a sort: This sample updated three times, (process triangles in order given) but may have fallen out of cache in between accesses 6

8 2 Now consider step 2 of a tiled 1 renderer:

7 Initialize Z and color buffer for tile for all triangles in tile: for all each fragment: shade fragment 3 4 update depth/color 5 write color tile to final image buffer

Q. Why doesn’t the renderer need to read color or depth buffer from memory? Q. Why doesn’t the renderer need to write depth buffer in memory? *

* Assuming application does not need depth buffer for other purposes. Stanford CS348K, Spring 2020 Tile-based deferred rendering (TBDR)

▪ Many mobile GPUs implement deferred shading in the hardware! ▪ Divide step 2 of tiled pipeline into two phases: ▪ Phase 1: compute what fragment is visible at every sample ▪ Phase 2: perform shading of only the visible quad fragments

T3 1 2 3 4 5 6 T1 T2 7 8

T4 T3 T4 1 2 3 4 5 6

T1 1 2 3

T2 none Stanford CS348K, Spring 2020 Sort middle tiled (chunked)

▪ Good: - Good load balance (distribute many buckets onto rasterizers) - Low bandwidth requirements (why? when?) - Challenge: “bucketing” sort has low contention (assuming each triangle only touches a small number of buckets), but there still is contention

▪ Recent examples: - Many mobile GPUs: Imagination PowerVR, ARM Mali, Qualcomm - Parallel software rasterizers - Intel Larrabee software rasterizer - NVIDIA CUDA software rasterizer

Stanford CS348K, Spring 2020 Sort last

Stanford CS348K, Spring 2020 Sort last fragment

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Sort! - point-to-point

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

Distribute primitives to top of pipelines (e.g., round robin) Sort after fragment processing based on (x,y) position of fragment

Stanford CS348K, Spring 2020 Sort last fragment

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Sort! - point-to-point

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

▪ Good: - No redundant geometry processing or rasterizeration (but early z-cull is a problem) - Point-to-point communication during sort - Interleaved pixel mapping results in good workload balance for frame-buffer ops

Stanford CS348K, Spring 2020 Sort last fragment

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Sort! - point-to-point

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

▪ Bad: - Pipelines may stall due to primitives of varying size (due to order requirement) - Bandwidth scaling: many more fragments than triangles - Hard to implement early occlusion cull (more bandwidth challenges)

Stanford CS348K, Spring 2020 Sort last image composition

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

frame buffer 0 frame buffer 1 frame buffer 3 frame buffer 4

Merge Output image

Each pipeline renders some fraction of the geometry in the scene Combine the color buffers, according to depth into the fnal image

Stanford CS348K, Spring 2020 Sort last image composition

Stanford CS348K, Spring 2020 Sort last image composition ▪ Breaks graphics pipeline architecture abstraction: cannot maintain pipeline’s sequential semantics

▪ Simple implementation: N separate rendering pipelines - Can use off-the-shelf GPUs to build a massive rendering system - Coarse-grained communication (image buffers)

▪ Similar load imbalance problems as sort-last fragment

▪ Under high depth complexity, bandwidth requirement is lower than sort last fragment - Communicate fnal pixels, not all fragments

Stanford CS348K, Spring 2020 Recall: modern OpenGL 4 / Vertex Generation Coarse Vertices Direct3D 11 pipeline 1 in / 1 out Vertex Processing Five programmable stages Coarse Primitives Coarse Primitive Processing 1 in / 1 out

1 in / N out Tessellation Fine Vertices 1 in / 1 out Fine Vertex Processing

3 in / 1 out Fine Primitive Generation (for tris) Fine Primitives 1 in / small N out Fine Primitive Processing

Rasterization 1 in / N out (Fragment Generation) Fragments

1 in / 1 out Fragment Processing

Pixels 1 in / 0 or 1 out Frame-Buffer Ops

Stanford CS348K, Spring 2020 Modern GPU: programmable parts of pipeline virtualized on pool of programmable cores

Texture Tessellation Texture Tessellation Cmd Processor /Vertex Generation

Programmable Programmable Programmable Programmable Frame Buffer Ops Core Core Core Core Frame Buffer Ops Programmable Programmable Programmable Programmable Frame Buffer Ops Core Core Core Core

Rasterizer Rasterizer Frame Buffer Ops

High-speed interconnect

Texture Tessellation Texture Tessellation Work Distributor/Scheduler

Programmable Programmable Programmable Programmable Core Core Core Core Vertex Queue

Primitive Queue Programmable Programmable Programmable Programmable Core Core Core Core . . . Fragment Queue Rasterizer Rasterizer

Hardware is a heterogeneous collection of resources (programmable and non-programmable) Programmable resources are time-shared by vertex/primitive/fragment processing work Must keep programmable cores busy: sort everywhere Hardware work distributor assigns work to cores (based on contents of inter-stage queues) Stanford CS348K, Spring 2020 Sort everywhere (How modern high-end GPUs are scheduled)

Stanford CS348K, Spring 2020 [Eldridge 00] Sort everywhere

Application Distribute

Geometry Processing Geometry Processing Geometry Processing Geometry Processing

Redistribute- point-to-point

Rasterization Rasterization Rasterization Rasterization

Fragment Processing Fragment Processing Fragment Processing Fragment Processing

Sort! - point-to-point

Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops Frame-Buffer Ops

Output image

Distribute primitives to top of pipelines Redistribute after geometry processing (e.g, round robin) Sort after fragment processing based on (x,y) position of fragment Stanford CS348K, Spring 2020 Implementing sort everywhere

(Challenge: rebalancing work at multiple places in the graphics pipeline to achieve efficient parallel execution, while maintaining triangle draw order)

Stanford CS348K, Spring 2020 Starting state: draw commands enqueued for pipeline

Input: three triangles to draw Draw T3 (fragments to be generated for each Draw T2 triangle by rasterization are shown below) Draw T1

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Rasterizer 0 Rasterizer 1 Assume batch size is 2 for assignment to rasterizers. Frag Processing 0 Frag Processing 1

0 1 1 0

Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 After geometry processing, frst two processed triangles assigned to rast 0 Input: Draw T3

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

T2 T1

Rasterizer 0 Rasterizer 1 Assume batch size is 2 for assignment to rasterizers. Frag Processing 0 Frag Processing 1

0 1 1 0

Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 Assign next triangle to rast 1 (round robin policy, batch size = 2) Q. What is the ‘next’ token for? Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3 Next T2 T1 T3

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0

Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 Rast 0 and rast 1 can process T1 and T3 simultaneously (Shaded fragments enqueued in frame-buffer unit input queues)

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Next T2

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T1,4 T1,2 T3,3 1 0 T1,1 T3,1 T1,3 T3,2 Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 FB 0 and FB 1 can simultaneously process fragments from rast 0 (Notice updates to frame buffer)

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Next T2

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,3 1 0 T1,1 T3,1 T3,2 T1,2 T1,3 T1,4 Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 Fragments from T3 cannot be processed yet. Why?

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Next T2

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,3 1 0 T1,1 T3,1 T3,2 T1,2 T1,3 T1,4 Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 Rast 0 processes T2 (Shaded fragments enqueued in frame-buffer unit input queues)

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Next

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T2,3 T3,3 T2,4 1 0 T1,1 T2,1 T3,1 T2,2 T3,2 T1,2 T1,3 T1,4 Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 Rast 0 broadcasts ‘next’ token to all frame-buffer units

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 Switch Switch T2,3 T3,3 T2,4 1 0 T1,1 T2,1 T3,1 T2,2 T3,2 T1,2 T1,3 T1,4 Interleaved Frame-buffer 0 Frame-buffer 1 render target

Stanford CS348K, Spring 2020 FB 0 and FB 1 can simultaneously process fragments from rast 0 (Notice updates to frame buffer)

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,3 1 0 T1,1 Switch T3,1 Switch T3,2 T1,2 T1,3 T1,4 T2,1 Interleaved render target Frame-buffer 0 Frame-buffer 1 T2,2 T2,3 T2,4

Stanford CS348K, Spring 2020 Switch token reached: frame-buffer units start processing input from rast 1 Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,3 1 0 T1,1 T3,1 T3,2 T1,2 T1,3 T1,4 T2,1 Interleaved render target Frame-buffer 0 Frame-buffer 1 T2,2 T2,3 T2,4

Stanford CS348K, Spring 2020 FB 0 and FB 1 can simultaneously process fragments from rast 1 (Notice updates to frame buffer)

Input:

Geometry Draw T1 1 2 3 4 Draw T2 1 2 3 4

Draw T3 1 2 3

Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0 T1,1 T1,2 T1,3 T3,1 T3,2 T1,4 T2,1 T3,3 Interleaved render target Frame-buffer 0 Frame-buffer 1 T2,2 T2,3 T2,4

Stanford CS348K, Spring 2020 Extending to parallel geometry units

Stanford CS348K, Spring 2020 Draw T4 Draw T3 Draw T2 Starting state: commands enqueued Draw T1 Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1 Assume batch size is 2 for Frag Processing 0 Frag Processing 1 assignment to geom units and to rasterizers.

0 1 1 0 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Distribute triangles to geom units round-robin (batches of 2)

Distrib

Next Input: T2 T4 T1 T3 Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Geom 0 and geom 1 process triangles in parallel (Results after T1 processed are shown. Note big triangle T1 broken into multiple work items. [Eldridge et al.]) Distrib Input: Next T4 T2 T3 Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 T1,b T1,a T1,c Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Geom 0 and geom 1 process triangles in parallel (Triangles enqueued in rast input queues. Note big triangles broken into multiple work items. [Eldridge et al.]) Distrib Input:

Next Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next Next 5 T1,b T3,b T2 T1,a T3,a T1,c T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Geom 0 broadcasts ‘next’ token to rasterizers

Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Switch Next Next Switch 5 T1,b T3,b T2 T1,a T3,a T1,c T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 1 0 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Rast 0 and rast 1 process triangles from geom 0 in parallel (Shaded fragments enqueued in frame-buffer unit input queues) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 Switch T3,b Next T3,a Switch T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T1,5 T2,3 T2,4 T1,3 T2,1 T1,4 T2,2 1 0 T1,1 T1,6 T1,2 T1,7 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Rast 0 broadcasts ‘next’ token to FB units (end of geom 0, rast 0)

Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 T3,b Switch T3,a Switch T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch 0 1 T1,5 T2,3 Switch T2,4 T1,3 T2,1 T1,4 T2,2 1 0 T1,1 T1,6 T1,2 T1,7 Interleaved render target Frame-buffer 0 Frame-buffer 1

Stanford CS348K, Spring 2020 Frame-buffer units process frags from (geom 0, rast 0) in parallel (Notice updates to frame buffer) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 T3,b Switch T3,a Switch T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T2,3 T2,4 T2,1 T2,2 1 0 Switch T1,6 Switch T1,7 T1,1 Interleaved T1,2 T1,3 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5

Stanford CS348K, Spring 2020 “End of rast 0” token reached by FB: FB units start processing input from rast 1 (fragments from geom 0, rast 1) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 T3,b Switch T3,a Switch T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T2,3 T2,4 T2,1 T2,2 1 0 T1,6 T1,7 T1,1 Interleaved T1,2 T1,3 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5

Stanford CS348K, Spring 2020 “End of geom 0” token reached by rast units: rast units start processing input from geom 1 (note “end of geom 0, rast 1” token sent to rast input queues) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 Next 5 T3,b T3,a T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch Switch 0 1 T2,3 T2,4 T2,1 T2,2 1 0 T1,6 T1,7 T1,1 Interleaved T1,2 T1,3 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5

Stanford CS348K, Spring 2020 Rast 0 processes triangles from geom 1 (Note Rast 1 has work to do, but cannot make progress because its output queues are full) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5 Next T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch Switch 0 1 T3,5 T2,3 T2,4 T3,3 T2,1 T3,4 T2,2 1 0 T3,1 T1,6 T3,2 T1,7 T1,1 Interleaved T1,2 T1,3 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5

Stanford CS348K, Spring 2020 Rast 0 broadcasts “end of geom 1, rast 0” token to frame-buffer units

Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5 T4 Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch Switch Switch 0 1 T3,5 T2,3 Switch T2,4 T3,3 T2,1 T3,4 T2,2 1 0 T3,1 T1,6 T3,2 T1,7 T1,1 Interleaved T1,2 T1,3 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5

Stanford CS348K, Spring 2020 Frame-buffer units process frags from (geom 0, rast 1) in parallel (Notice updates to frame buffer. Also notice rast 1 can now make progress since space has become available)

Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch 0 1 T3,5 Switch T3,3 T4,2 T3,4 T4,1 1 0 T3,1 Switch T3,2 Switch T1,1 T2,1 T2,2 Interleaved T1,2 T1,3 T2,3 T2,4 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5 T1,6 T1,7

Stanford CS348K, Spring 2020 Switch token reached by FB: FB units start processing input from

(geom 1, rast 0) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

Switch 0 1 T3,5 Switch T3,3 T3,4 1 0 T3,1 T4,2 T3,2 T4,1 T1,1 T2,1 T2,2 Interleaved T1,2 T1,3 T2,3 T2,4 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5 T1,6 T1,7

Stanford CS348K, Spring 2020 Frame-buffer units process frags from (geom 1, rast 0) in parallel (Notice updates to frame buffer) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,1 T3,2 1 0 T3,3 T3,4 Switch T4,2 Switch T4,1 T1,1 T3,5 T2,1 T2,2 Interleaved T1,2 T1,3 T2,3 T2,4 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5 T1,6 T1,7

Stanford CS348K, Spring 2020 Switch token reached by FB: FB units start processing input from (geom 1, rast 1) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,1 T3,2 1 0 T3,3 T3,4 T4,2 T4,1 T1,1 T3,5T2,1 T2,2 Interleaved T1,2 T1,3 T2,3 T2,4 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5 T1,6 T1,7

Stanford CS348K, Spring 2020 Frame-buffer units process frags from (geom 1, rast 1) in parallel (Notice updates to frame buffer) Distrib Input:

Draw T1 1 2 3 4 Geometry 0 Geometry 1 5 6 7

Draw T2 1 2 3 4

Draw T3 1 2 3 4 5

Draw T4 1 2 Rasterizer 0 Rasterizer 1

Frag Processing 0 Frag Processing 1

0 1 T3,1 T3,2 1 0 T3,3 T3,4 T4,1 T4,2 T1,1 T3,5T2,1 T2,2 Interleaved T1,2 T1,3 T2,3 T2,4 render target Frame-buffer 0 Frame-buffer 1 T1,4 T1,5 T1,6 T1,7

Stanford CS348K, Spring 2020 Parallel scheduling with data amplifcation

Stanford CS348K, Spring 2020 Geometry amplifcation ▪ Consider examples of one-to-many stage behavior during geometry processing in the graphics pipeline:

- Clipping amplifes geometry (clipping can result in multiple output primitives)

- Tessellation: pipeline permits thousands of vertices to be generated from a single base primitive (challenging to maintain highly parallel execution)

- Primitive processing (“geometry shader”) outputs up to 1024 foats worth of vertices per input primitive

Stanford CS348K, Spring 2020 Thought experiment

Command Processor

T2 T4 T6 T8 T1 T3 T5 T7

Geometry Geometry Geometry Geometry Amplifer Amplifer Amplifer Amplifer

. . . Rasterizer . . .

Assume round-robin distribution of eight primitives to geometry pipelines, one rasterizer unit.

Stanford CS348K, Spring 2020 Consider case of large amplifcation when processing T1

Command Processor

T2

Geometry Geometry Geometry Geometry Amplifer Amplifer Amplifer Amplifer

T1,6 T4,4 T6,5 T8,3 T1,5 T4,3 T6,4 T8,2 T1,4 T4,2 T6,3 T8,1 T1,3 T4,1 T6,2 T7,3 T1,2 T3,2 T6,1 T7,2 T1,1 T3,1 T5,1 T7,1

Notice: output from T1 processing flls output queue . . . Rasterizer . . .

Result: one geometry unit (the one producing outputs from T1) is feeding the entire downstream pipeline - Serialization of geometry processing: other geometry units are stalled because their output queues are full (they cannot be drained until all work from T1 is completed) - Underutilization of rest of chip: unlikely that one geometry producer is fast enough to produce

pipeline work at a rate that flls resources of rest of GPU. Stanford CS348K, Spring 2020 Thought experiment: design a scheduling strategy for this case Command Processor

T2

Geometry Geometry Geometry Geometry Amplifer Amplifer Amplifer Amplifer

T1,6 T4,4 T6,5 T8,3 T1,5 T4,3 T6,4 T8,2 T1,4 T4,2 T6,3 T8,1 T1,3 T4,1 T6,2 T7,3 T1,2 T3,2 T6,1 T7,2 T1,1 T3,1 T5,1 T7,1

. . . Rasterizer . . .

1. Design a solution that is performant when the expected amount of data amplifcation is low? 2. Design a solution this is performant when the expected amount of data amplifcation is high 3. What about a solution that works well for both? The ideal solution always executes with maximum parallelism (no stalls), and with maximal locality (units read and write to fxed size, on-chip inter-stage buffers), and (of course) preserves order. Stanford CS348K, Spring 2020 Implementation 1: fxed on-chip storage

Command Processor

Geometry Geometry Geometry Geometry Amplifer Amplifer Amplifer Amplifer

Small, on-chip buffers . . . Rasterizer . . .

Approach 1: make on-chip buffers big enough to handle common cases, but tolerate stalls - Run fast for low amplifcation (never move output queue data off chip) - Run very slow under high amplifcation (serialization of processing due to blocked units). Bad performance cliff. Stanford CS348K, Spring 2020 Implementation 2: worst-case allocation

Command Processor

Geometry Geometry Geometry Geometry Amplifer Amplifer Amplifer Amplifer

......

Large, in-memory buffers . . . Rasterizer . . .

Approach 2: never block geometry unit: allocate worst-case space in off-chip buffers (stored in DRAM) - Run slower for low amplifcation (data goes off chip then read back in by rasterizers) - No performance cliff for high amplifcation (still maximum parallelism, data still goes off chip) - What is overall worst-case buffer allocation if the four geometry units above are Direct3D 11 geometry shaders? Stanford CS348K, Spring 2020 Implementation 3: hybrid

Command Processor

On-chip Geometry Geometry Geometry Geometry buffers Amplifer Amplifer Amplifer Amplifer

Off-chip (spill) buffers

...... Rasterizer ......

Hybrid approach: allocate output buffers on chip, but spill to off-chip, worst-case size buffers under high amplifcation - Run fast for low amplifcation (high parallelism, no memory traffic) - Less of performance cliff for high amplifcation (high parallelism, but incurs more memory traffic) Stanford CS348K, Spring 2020 NVIDIA GPU implementation Optionally resort work after “Hull” shader stage (since amplifcation factor known)

Stanford CS348K, Spring 2020