4/18/2007
Real-Time Graphics Architecture
Lecture 4: Parallelism and Communication
Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Topics
1. Frame buffers 2. Types of parallelism 3. Communication patterns and requirements 4. Sorting classification for parallel rendering (with examples)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
1 4/18/2007
Frame Buffers
Raster vs. calligraphic
Raster (image order) dominant choice Calligraphic (object order) Earliest choice (Sketchpad) E&S terminals in the 70s and 80s Works with light pens Scene complexity affects frame rate Monitors are expensive Still required for FAA simulation Increases absolute brightness of light points
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
2 4/18/2007
Frame buffer definitions
What is a frame buffer? What can we learn by considering different definitions?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frame buffer definition #1
Storage for commands that are executed to refresh the display Allows for raster or calligraphiccalligraphic display (e. g. Megatech) “Frame buffer” for calligraphic display is a “display list” OpenGL “render list”? Key point: frame buffer contents are interpreted Color mapping Image scaling, warping Window system (overlay, separate windows, …) Address Recalculation Pipeline
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
3 4/18/2007
Frame buffer definition #2
Image memory used to decouple the render frame rate from the display frame rate Meets common understanding of frame buffer as image Leads naturally to double buffering One render buffer, one display buffer, swap n-buffering also possible, can control latency Key idea: decoupling enables general-purpose GPU Visual simulation has high render frame rate MCAD has low render frame rate Window manager has no frame rate
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frame buffer definition #3
All pixel-assigned memory used to assemble and display the images being rendered Key point: frame buffer is active participant in rendering Leads to non-color buffers: depth, stencil, window control OpenGL treats these buffers as part of frame buffer Some reserve “frame buffer” for color images Should be n-buffered in some cases (sort last) RealityEngine frame buffer can be deeper than wide or high History cycles through this definition 2-D manipulation 3-D painters algorithm 3-D depth, stencil, accumulation, multi-pass Programmable shading
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
4 4/18/2007
Frame buffer is optional
Calligraphic display If we don’t define display list as frame buffer “Follow-the-beam” rendering Minimizes latency Saves cost if frames are never “dropped” Talisman-like image assembly (3-D sprites) Old idea (visual simulation, window systems) GigaPixel render tile Frame buffer stores color images only Depth, stencil, etc. in small tile
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Dominant architecture is consistent
SGI architectures look like ATI architectures, which look like NVIDIA architectures Details are evolving, but big picture remains the same
Why is this? Simplicity of design Simplicity of algorithms Simplicity of immediate-mode approach
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
5 4/18/2007
Simplicity of design
Frame buffer operations Blending: merge fragment and pixel color Deppgth Buffering: save nearest frag ment Stencil Buffering: simple pixel state machine Accumulation Buffering: high-resolution color arithmetic Antialiasing: (to be covered later) …. All frame buffer operations: Combine fragment and pixel data (not just a replace) But replace operation is optimized, e.g., no parity/ECC Are local (no intra-pixel dependencies)
Why aren’t fragment operations programmable?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Simplicity of algorithms
Frame buffer employs brute-force simplicity Hidden surface elimination: Depth-buffer vs. sort/painter Capping: Stencil-based vs. object calculations Image-space algorithm is efficient Just samples, never “object” information, locality Just-in-time calculation, steady cost function Accumulation Buffer (high-resolution color arithmetic) The Accumulation Buffer, Haeberli and Akeley, Proceedings of SIGGRAPH ‘90 Volume rendering using 3D textures Multi-pass rendering Interactive Multi-pass Programmable Shading, Peercy, Olano, Airey, and Ungar, Proceedings of SIGGRAPH ‘00
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
6 4/18/2007
Simplicity of immediate-mode
Frame buffer contents are “context” Matches 2D/window-rendering model
Rendering System
Little graphics Frame buffer: state here most graphics state here
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Decreasing display bandwidth burden
Historically display bandwidth was a limiting factor Hence “Sproull’s Rule”: fill rate >= display rate Now display bandwidth is almost inconsequential
Year System FB (GB) Disp (GB) Disp / FB 1984 SGI 2000-series 0.3 0.14 1/2 1988 SGI GTX 1.8 * 0.29 1/6 1996 SGI InfiniteReality 12.8 0600.60 1/20 2006 NVIDIA 7900 GTX 51.2 0.75 1/70
* VRAM provided separate video bandwidth
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
7 4/18/2007
Parallelism and Communication
Parallelism and communication
Parallelism – using multiple computational units to processes work in parallel Communication – connecting the computational units to allow work to be distributed and aggregated Issues Dependencies Ordering Sorting Scalability Computation Bandwidth Load balancing CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
8 4/18/2007
Parallelism taxonomy
Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)
Data parallelism [aka “parallelism”] (same task on similar data sets)
Task parallelism (different tasks on similar OR differing data sets)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Parallelism taxonomy
Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)
Frame-parallelism Data parallelism (batch, SGI N-clops) [aka “parallelism”] (same task on similar Object-parallelism data sets) (geometry) Image-parallelism (fragment/pixel)
Task parallelism (different tasks on similar OR differing data sets)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
9 4/18/2007
Parallelism taxonomy
Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)
Frame-parallelism Data parallelism (batch, SGI N-clops) [aka “parallelism”] (same task on similar Object-parallelism data sets) (geometry) Image-parallelism (fragment/pixel)
Task parallelism Multi-processing (on multiple CPUs) (different tasks on similar OR differing Pipelining data sets) (the graphics pipeline)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Parallelism taxonomy
Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)
Frame-parallelism Data parallelism (batch, SGI N-clops) Multi-processing [aka “parallelism”] (graphics context switching) (same task on similar Object-parallelism data sets) (geometry) Multi-threading (almost defines a GPU-like Image-parallelism processor) (fragment/pixel)
Task parallelism Multi-processing (on multiple CPUs) (different tasks on similar OR differing Pipelining data sets) (the graphics pipeline)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
10 4/18/2007
Parallelism taxonomy
Hardware parallelism Virtual parallelism (time sharing a single (simultaneous execution on processor, usually with multiple processors) hardware support)
Frame-parallelism Data parallelism (batch, SGI N-clops) Multi-processing [aka “parallelism”] (graphics context switching) (same task on similar Object-parallelism data sets) (geometry) Multi-threading (almost defines a GPU-like Image-parallelism processor) (fragment/pixel)
Multi-processing Multi-processing Task parallelism (time sharing a single CPU) (on multiple CPUs) (different tasks on similar OR differing Pipelining Multi-threading data sets) (Direct3D-10 “common- (the graphics pipeline) core”)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Graphics is embarrassingly parallel
Ample self-similar data sets … Frames, vertexes, fragments, texels, pixels With minimal dependencies Few intra-set dependencies Pixels (in the frame buffer) are the significant exception Inter-set dependencies are purely sequential “Graphics pipeline” is designed to minimize dependencies Other graphics architectures have more dependencies E.g., for global lighting effects But graphics pipeline has huge redundancies Hence many opportunities for optimization … How hard should we work to do things wrong ?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
11 4/18/2007
Geometry parallelism trend (SGI)
20
15 Model 10 Transform Length Transform Width 5
0
X 00 G X IR 0 000 T RE 1 2 G VG
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Image parallelism trend (SGI)
Rasterization
400 300 200 100 0 1000 2000 G GTX VGX RE IR
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
12 4/18/2007
The clear trend
Shorter and wider
Why ?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Communication taxonomy
Sorting
Distribution Object Image Routing (Introduced by (Introduced by parallelism) parallelism)
Texturing
Fundamental
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
13 4/18/2007
Sorting is fundamental
Sorting
Distribution Object Image Routing
Texturing
I. E. Sutherland, R. F. Sproull, and R. A. Schumacher, A characterization of ten hidden surface algorithms Classified by order of x, y, z radix sorts CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Pipelining vs. parallelism
Task Parallelism Issue Data Parallelism (pipelining)
Ordering Easy Challenging dependencies
Sorting Easy Challenging dependencies Computation scalability Bandwidth scalability Load balancing scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
14 4/18/2007
Pipelining vs. parallelism
Task Parallelism Issue Data Parallelism (pipelining)
Ordering Easy Challenging dependencies
Sorting Easy Challenging dependencies Computation Poor Challenging scalability Bandwidth Poor Challenging scalability Load balancing (Nearly) impossible Challenging scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Ordering challenges
Fundamental: Frame buffer operations Painter’l’s algorit hm Memory hazards Texture writes Render Copy to texture Render Readback From pipelining: Changes to graphics state
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
15 4/18/2007
Sorting taxonomy
Application
Command Sort-First
Geometry Sort-Middle Rasterization
Texture Sort-Last Fragment
Fragment Sort-Last Image Composition Display
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-First
16 4/18/2007
Sort-first
App App Pre-transformation Cmd Cmd
SORT Point-to-point communication scales Geom Geom Rast Rast Coarse tiling incurs load Tex Tex imbalance Frag Frag Disp Disp
Princeton Display Wall, Stanford WireGL CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-first
Order Automatic (conceptually) Sort Pre-stage (cheat ☺) Compute scalability Good Bandwidth scalability Good Load balance scalability Poor
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
17 4/18/2007
Sort-first
App App Cmd Cmd
SORT Geom Geom Rast Rast Tex Tex Frag Frag
ROUTE Disp
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Ring parallelism
App
Cmd DIST Cmd DIST Cmd DIST … DIST Cmd
Geom Geom Geom Geom Rast Rast Rast Rast Tex Tex Tex Tex Frag Frag Frag Frag
ROUTE Disp
3DLABs
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
18 4/18/2007
Sort-Middle
Image-space work distribution
Parke - Tiled Fuchs - Interleaved
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
19 4/18/2007
Sort-middle interleaved
App
DISTRIBUTE Cmd Cmd Geometry work load-balanced, except clipping and tesselation Geom Geom Broadcast communication does SORT not scale, but supports ordering Rast Rast Finely interleaved screen tiling Tex Tex en sur es ex cellen t load balan ce Frag Frag
ROUTE Disp SGI Graphics Workstations: RealityEngine, InfiniteReality CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-middle interleaved
Order Force sequence at triangle sort Sort Broadcast Compute scalability Good Bandwidth scalability Limited by sort broadcast Load balance scalability Good
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
20 4/18/2007
SGI RealityEngine
240 MB/s
1600 MB/s
3200 MB/s
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-middle tiled
App App Cmd Cmd Geom Geom Point-to-point communication SORT scales Rast Rast Coarse tiling incurs load Tex Tex imbalance Frag Frag
ROUTE Disp Disp UNC PixelPlanes, Stanford Argus CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
21 4/18/2007
Sort-middle tiled (immediate mode)
Order Force sequence at triangle sort Sort Can approach point-to-point Compute scalability Good Bandwidth scalability Good Load balance scalability Poor for rasterization (due to large triangles)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-middle tiled (chunked)
Order Force sequence at triangle sort Full-frame delay, render to texture difficulties Sort Can approach point-to-point Compute scalability Good Bandwidth scalability Good Load balance scalability Good
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
22 4/18/2007
UNC Pixel-Planes5 (1990)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-Last
23 4/18/2007
Sort-last fragment
App App Cmd Cmd Improved texture locality No redundant work in FG Geom Geom Exposes rasterization load Rast Rast imbalance to application Tex Tex Point-to-point communication scales, but requires more bw SORT
Frag Frag Finely interleaved screen tiling
ROUTE insures excellent load balance
Disp Disp Possible, but difficult, to maintain ordering Kubota Denali, E&S Freedom 3000 CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-last fragment
Order Force sequence at fragment sort Sort Point-to-point, high bandwidth Compute scalability Good Bandwidth scalability OK (sorting is the bottleneck) Load balance scalability OK (exposed to application)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
24 4/18/2007
Kubota Denali (1993)
TEM X6 48 5
24 X10
FBM
Denali Technical Overview 1.0 Kubota Pacific Computer, 1993
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Image composition
Z comp
Other combiners possible
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
25 4/18/2007
Sort-last image composition
App App Exposes rasterization load imbalance to application Cmd Cmd Point-to-point ring interconnect Geom Geom scales Rast Rast Tex Tex Frag Frag
SORT Two-stage image composition Disp Disp loses ordering
UNC/HP PixelFlow, Aizu VC-1, Stanford Lightning-2 CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-last image composition
Order Not fully supported ! Sort One to many for each pipeline Compute scalability Excellent Bandwidth scalability Excellent Load balance scalability OK (exposed to application)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
26 4/18/2007
UNC Pixel Flow
From J. Poulton, J. Eyles, S. Molnar, H. Fuchs, Pixel Flow: The Realization
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
27 4/18/2007
Sort-Everywhere
Sort-everywhere: Pomegranate
App App Cmd Cmd Geom Geom
DISTRIBUTE Rast Rast Mem Tex Tex Mem
SORT Mem Frag Frag Mem
ROUTE Disp Disp
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
28 4/18/2007
Architecture comparison
X indicates ere iled (immd) iled (chunk) nterleaved ment ge comp. i t t h g an issue a Sort-first Sort-middle Sort-middle Sort-middle fra Sort-last im Sort-last Sort-everyw
Ordered (X) X Compute scalability Bandwidth scalability XX Load balance scalability XXXX
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Summary
GPU architecture trend Pipeline hardware-parallel virtual-parallel
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
29 4/18/2007
Readings
Required 1. S. Molnar, M. Cox, D. Ellsworth, H. Fuchs, A sorting classification of parallel rendering 2. Fuchs et al., A heterogenous multiprocessor graphics system using processor-enhanced memories (PP5). 3. Eyles et al., PixelFlow: The Realization Recommended 1. F. I. Parke, Simulation and expected performance analysis of multiple processor z-buffer systems 2. H. Fuchs, Distributing a visible surface algorithm over multiple processors
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Real-Time Graphics Architecture
Lecture 4: Parallelism and Communication
Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
30