4/18/2007

Real-Time Graphics Architecture

Lecture 4: Parallelism and Communication

Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/

CS448 Lecture 4 , Pat Hanrahan, Spring 2007

Topics

1. Frame buffers 2. Types of parallelism 3. Communication patterns and requirements 4. Sorting classification for parallel rendering (with examples)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

1 4/18/2007

Frame Buffers

Raster vs. calligraphic

Raster (image order) dominant choice Calligraphic (object order) Earliest choice (Sketchpad) E&S terminals in the 70s and 80s Works with light pens Scene complexity affects Monitors are expensive Still required for FAA simulation Increases absolute brightness of light points

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

2 4/18/2007

Frame buffer definitions

What is a frame buffer? What can we learn by considering different definitions?

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Frame buffer definition #1

Storage for commands that are executed to refresh the display Allows for raster or calligraphiccalligraphic display (e. g. Megatech) “Frame buffer” for calligraphic display is a “display list” OpenGL “render list”? Key point: frame buffer contents are interpreted Color mapping Image scaling, warping Window system (overlay, separate windows, …) Address Recalculation Pipeline

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

3 4/18/2007

Frame buffer definition #2

Image memory used to decouple the render frame rate from the display frame rate Meets common understanding of frame buffer as image Leads naturally to double buffering One render buffer, one display buffer, swap n-buffering also possible, can control latency Key idea: decoupling enables general-purpose GPU Visual simulation has high render frame rate MCAD has low render frame rate Window manager has no frame rate

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Frame buffer definition #3

All pixel-assigned memory used to assemble and display the images being rendered Key point: frame buffer is active participant in rendering Leads to non-color buffers: depth, stencil, window control OpenGL treats these buffers as part of frame buffer Some reserve “frame buffer” for color images Should be n-buffered in some cases (sort last) RealityEngine frame buffer can be deeper than wide or high History cycles through this definition 2-D manipulation 3-D painters algorithm 3-D depth, stencil, accumulation, multi-pass Programmable

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

4 4/18/2007

Frame buffer is optional

Calligraphic display If we don’t define display list as frame buffer “Follow-the-beam” rendering Minimizes latency Saves cost if frames are never “dropped” Talisman-like image assembly (3-D sprites) Old idea (visual simulation, window systems) GigaPixel render tile Frame buffer stores color images only Depth, stencil, etc. in small tile

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Dominant architecture is consistent

SGI architectures look like ATI architectures, which look like NVIDIA architectures Details are evolving, but big picture remains the same

Why is this? Simplicity of design Simplicity of algorithms Simplicity of immediate-mode approach

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

5 4/18/2007

Simplicity of design

Frame buffer operations Blending: merge fragment and pixel color Deppgth Buffering: save nearest frag ment Stencil Buffering: simple pixel state machine Accumulation Buffering: high-resolution color arithmetic Antialiasing: (to be covered later) …. All frame buffer operations: Combine fragment and pixel data (not just a replace) But replace operation is optimized, e.g., no parity/ECC Are local (no intra-pixel dependencies)

Why aren’t fragment operations programmable?

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Simplicity of algorithms

Frame buffer employs brute-force simplicity Hidden surface elimination: Depth-buffer vs. sort/painter Capping: Stencil-based vs. object calculations Image-space algorithm is efficient Just samples, never “object” information, locality Just-in-time calculation, steady cost function Accumulation Buffer (high-resolution color arithmetic) The Accumulation Buffer, Haeberli and Akeley, Proceedings of SIGGRAPH ‘90 using 3D textures Multi-pass rendering Interactive Multi-pass Programmable Shading, Peercy, Olano, Airey, and Ungar, Proceedings of SIGGRAPH ‘00

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

6 4/18/2007

Simplicity of immediate-mode

Frame buffer contents are “context” Matches 2D/window-rendering model

Rendering System

Little graphics Frame buffer: state here most graphics state here

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Decreasing display bandwidth burden

Historically display bandwidth was a limiting factor Hence “Sproull’s Rule”: fill rate >= display rate Now display bandwidth is almost inconsequential

Year System FB (GB) Disp (GB) Disp / FB 1984 SGI 2000-series 0.3 0.14 1/2 1988 SGI GTX 1.8 * 0.29 1/6 1996 SGI InfiniteReality 12.8 0600.60 1/20 2006 NVIDIA 7900 GTX 51.2 0.75 1/70

* VRAM provided separate video bandwidth

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

7 4/18/2007

Parallelism and Communication

Parallelism and communication

Parallelism – using multiple computational units to processes work in parallel Communication – connecting the computational units to allow work to be distributed and aggregated Issues Dependencies Ordering Sorting Scalability Computation Bandwidth Load balancing CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

8 4/18/2007

Parallelism taxonomy

Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)

Data parallelism [aka “parallelism”] (same task on similar data sets)

Task parallelism (different tasks on similar OR differing data sets)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Parallelism taxonomy

Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)

Frame-parallelism Data parallelism (batch, SGI N-clops) [aka “parallelism”] (same task on similar Object-parallelism data sets) (geometry) Image-parallelism (fragment/pixel)

Task parallelism (different tasks on similar OR differing data sets)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

9 4/18/2007

Parallelism taxonomy

Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)

Frame-parallelism Data parallelism (batch, SGI N-clops) [aka “parallelism”] (same task on similar Object-parallelism data sets) (geometry) Image-parallelism (fragment/pixel)

Task parallelism Multi-processing (on multiple CPUs) (different tasks on similar OR differing Pipelining data sets) (the )

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Parallelism taxonomy

Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)

Frame-parallelism Data parallelism (batch, SGI N-clops) Multi-processing [aka “parallelism”] (graphics context switching) (same task on similar Object-parallelism data sets) (geometry) Multi-threading (almost defines a GPU-like Image-parallelism processor) (fragment/pixel)

Task parallelism Multi-processing (on multiple CPUs) (different tasks on similar OR differing Pipelining data sets) (the graphics pipeline)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

10 4/18/2007

Parallelism taxonomy

Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)

Frame-parallelism Data parallelism (batch, SGI N-clops) Multi-processing [aka “parallelism”] (graphics context switching) (same task on similar Object-parallelism data sets) (geometry) Multi-threading (almost defines a GPU-like Image-parallelism processor) (fragment/pixel)

Multi-processing Multi-processing Task parallelism (time sharing a single CPU) (on multiple CPUs) (different tasks on similar OR differing Pipelining Multi-threading data sets) (Direct3D-10 “common- (the graphics pipeline) core”)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Graphics is embarrassingly parallel

Ample self-similar data sets … Frames, vertexes, fragments, texels, pixels With minimal dependencies Few intra-set dependencies Pixels (in the frame buffer) are the significant exception Inter-set dependencies are purely sequential “Graphics pipeline” is designed to minimize dependencies Other graphics architectures have more dependencies E.g., for global lighting effects But graphics pipeline has huge redundancies Hence many opportunities for optimization … How hard should we work to do things wrong ?

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

11 4/18/2007

Geometry parallelism trend (SGI)

20

15 Model 10 Transform Length Transform Width 5

0

X 00 G X IR 0 000 T RE 1 2 G VG

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Image parallelism trend (SGI)

Rasterization

400 300 200 100 0 1000 2000 G GTX VGX RE IR

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

12 4/18/2007

The clear trend

Shorter and wider

Why ?

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Communication taxonomy

Sorting

Distribution Object Image Routing (Introduced by (Introduced by parallelism) parallelism)

Texturing

Fundamental

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

13 4/18/2007

Sorting is fundamental

Sorting

Distribution Object Image Routing

Texturing

I. E. Sutherland, R. F. Sproull, and R. A. Schumacher, A characterization of ten hidden surface algorithms Classified by order of x, y, z radix sorts CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Pipelining vs. parallelism

Task Parallelism Issue Data Parallelism (pipelining)

Ordering Easy Challenging dependencies

Sorting Easy Challenging dependencies Computation scalability Bandwidth scalability Load balancing scalability

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

14 4/18/2007

Pipelining vs. parallelism

Task Parallelism Issue Data Parallelism (pipelining)

Ordering Easy Challenging dependencies

Sorting Easy Challenging dependencies Computation Poor Challenging scalability Bandwidth Poor Challenging scalability Load balancing (Nearly) impossible Challenging scalability

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Ordering challenges

Fundamental: Frame buffer operations Painter’l’s algorit hm Memory hazards Texture writes Render Copy to texture Render Readback From pipelining: Changes to graphics state

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

15 4/18/2007

Sorting taxonomy

Application

Command Sort-First

Geometry Sort-Middle Rasterization

Texture Sort-Last Fragment

Fragment Sort-Last Image Composition Display

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-First

16 4/18/2007

Sort-first

App App Pre-transformation Cmd Cmd

SORT Point-to-point communication scales Geom Geom Rast Rast Coarse tiling incurs load Tex Tex imbalance Frag Frag Disp Disp

Princeton Display Wall, Stanford WireGL CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-first

Order Automatic (conceptually) Sort Pre-stage (cheat ☺) Compute scalability Good Bandwidth scalability Good Load balance scalability Poor

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

17 4/18/2007

Sort-first

App App Cmd Cmd

SORT Geom Geom Rast Rast Tex Tex Frag Frag

ROUTE Disp

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Ring parallelism

App

Cmd DIST Cmd DIST Cmd DIST … DIST Cmd

Geom Geom Geom Geom Rast Rast Rast Rast Tex Tex Tex Tex Frag Frag Frag Frag

ROUTE Disp

3DLABs

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

18 4/18/2007

Sort-Middle

Image-space work distribution

Parke - Tiled Fuchs - Interleaved

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

19 4/18/2007

Sort-middle interleaved

App

DISTRIBUTE Cmd Cmd Geometry work load-balanced, except clipping and tesselation Geom Geom Broadcast communication does SORT not scale, but supports ordering Rast Rast Finely interleaved screen tiling Tex Tex en sur es ex cellen t load balan ce Frag Frag

ROUTE Disp SGI Graphics Workstations: RealityEngine, InfiniteReality CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-middle interleaved

Order Force sequence at triangle sort Sort Broadcast Compute scalability Good Bandwidth scalability Limited by sort broadcast Load balance scalability Good

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

20 4/18/2007

SGI RealityEngine

240 MB/s

1600 MB/s

3200 MB/s

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-middle tiled

App App Cmd Cmd Geom Geom Point-to-point communication SORT scales Rast Rast Coarse tiling incurs load Tex Tex imbalance Frag Frag

ROUTE Disp Disp UNC PixelPlanes, Stanford Argus CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

21 4/18/2007

Sort-middle tiled (immediate mode)

Order Force sequence at triangle sort Sort Can approach point-to-point Compute scalability Good Bandwidth scalability Good Load balance scalability Poor for rasterization (due to large triangles)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-middle tiled (chunked)

Order Force sequence at triangle sort Full-frame delay, render to texture difficulties Sort Can approach point-to-point Compute scalability Good Bandwidth scalability Good Load balance scalability Good

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

22 4/18/2007

UNC Pixel-Planes5 (1990)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-Last

23 4/18/2007

Sort-last fragment

App App Cmd Cmd Improved texture locality No redundant work in FG Geom Geom Exposes rasterization load Rast Rast imbalance to application Tex Tex Point-to-point communication scales, but requires more bw SORT

Frag Frag Finely interleaved screen tiling

ROUTE insures excellent load balance

Disp Disp Possible, but difficult, to maintain ordering Kubota Denali, E&S Freedom 3000 CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-last fragment

Order Force sequence at fragment sort Sort Point-to-point, high bandwidth Compute scalability Good Bandwidth scalability OK (sorting is the bottleneck) Load balance scalability OK (exposed to application)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

24 4/18/2007

Kubota Denali (1993)

TEM X6 48 5

24 X10

FBM

Denali Technical Overview 1.0 Kubota Pacific Computer, 1993

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Image composition

Z comp

Other combiners possible

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

25 4/18/2007

Sort-last image composition

App App Exposes rasterization load imbalance to application Cmd Cmd Point-to-point ring interconnect Geom Geom scales Rast Rast Tex Tex Frag Frag

SORT Two-stage image composition Disp Disp loses ordering

UNC/HP PixelFlow, Aizu VC-1, Stanford Lightning-2 CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Sort-last image composition

Order Not fully supported ! Sort One to many for each pipeline Compute scalability Excellent Bandwidth scalability Excellent Load balance scalability OK (exposed to application)

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

26 4/18/2007

UNC Pixel Flow

From J. Poulton, J. Eyles, S. Molnar, H. Fuchs, Pixel Flow: The Realization

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

27 4/18/2007

Sort-Everywhere

Sort-everywhere: Pomegranate

App App Cmd Cmd Geom Geom

DISTRIBUTE Rast Rast Mem Tex Tex Mem

SORT Mem Frag Frag Mem

ROUTE Disp Disp

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

28 4/18/2007

Architecture comparison

X indicates ere iled (immd) iled (chunk) nterleaved ment ge comp. i t t h g an issue a Sort-first Sort-middle Sort-middle Sort-middle fra Sort-last im Sort-last Sort-everyw

Ordered (X) X Compute scalability Bandwidth scalability XX Load balance scalability XXXX

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Summary

GPU architecture trend Pipeline hardware-parallel virtual-parallel

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

29 4/18/2007

Readings

Required 1. S. Molnar, M. Cox, D. Ellsworth, H. Fuchs, A sorting classification of parallel rendering 2. Fuchs et al., A heterogenous multiprocessor graphics system using processor-enhanced memories (PP5). 3. Eyles et al., PixelFlow: The Realization Recommended 1. F. I. Parke, Simulation and expected performance analysis of multiple processor z-buffer systems 2. H. Fuchs, Distributing a visible surface algorithm over multiple processors

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Real-Time Graphics Architecture

Lecture 4: Parallelism and Communication

Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

30