Course Roadmap Evolution of the Programmable Graphics ► (GLSL) ►GPGPU (GLSL) Pipeline ƒ Briefly ►GPU Computing (CUDA, OpenCL) Lecture 2 ►Choose your own adventure Original Slides by: Suresh Venkatasubramanian Choose your own adventure Updates by Joseph Kider ƒ Student Presentation ƒ Final Project ►Goal: Prepare you for your presentation and project

Lecture Outline The evolution of the pipeline

► A historical perspective on the graphics pipeline Elements of the graphics pipeline: Parameters controlling design of the pipeline: ƒ Dimensions of innovation. 1. A scene description: vertices, triangles, colors, lighting 1. Where is the boundary ƒ Where we are today between CPU and GPU ? 2. Transformations that map the ƒ Fixed-function vs programmable pipelines scene to a camera viewpoint 2. What transfer method is used ? ► A closer look at the fixed function pipeline 3. “Effects”: texturing, shadow 3. What resources are provided ƒ Walk thru the sequence of operations mapping, lighting calculations at each step ? ƒ Reinterpret these as stream operations 4. Rasterizing: converting geometry 4. What units can access which into pixels GPU memory elements ? ► We can program the fixed-function pipeline ! 5. Pixel processing: depth tests, ƒ Some examples stencil tests, and other per-pixel ► What constitutes data and memory, and how operations. access affects program design. Generation I: 3dfx Voodoo (1996) Aside: Mario Kart 64

• One of the first true 3D game cards ►High fragment load / low vertex load • Worked by supplementing standard 2D . • Did not do vertex transformations: these were done in the CPU •Did dotexture mapping, z-buffering. http://accelenation.com/?ac.id.123.2

Rasterization Vertex Primitive Raster Frame Vertex Primitive and Frame Transforms Assembly Operations Buffer Transforms Assembly Interpolation Buffer

CPU GPU PCI

Image from: http://www.gamespot.com/users/my_shoe/

Aside: Mario Kart Wii Generation II: GeForce/ 7500 (1998)

• Main innovation: shifting the „ High fragment load / low vertex load? transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI

http://accelenation.com/?ac.id.123.5

Rasterization Vertex Primitive Raster Frame Vertex Primitive and Frame Transforms Assembly Operations Buffer Transforms Assembly Interpolation Buffer

GPU AGP

Image from: http://wii.ign.com/dor/objects/949580/mario-kart-wii/images/ Generation III: GeForce3/Radeon 8500(2001) Generation IV: Radeon 9700/GeForce FX (2002)

• For the first time, allowed limited • This generation is the first generation amount of programmability in the of fully-programmable graphics cards vertex pipeline • Different versions have different • Also allowed volume texturing and resource limits on fragment/vertex multi-sampling (for antialiasing) programs

http://accelenation.com/?ac.id.123.7 http://accelenation.com/?ac.id.123.8

Rasterization Rasterization Vertex Primitive Raster Frame Vertex Primitive Raster Frame Vertex Primitive and Frame Vertex Primitive and Frame Transforms Assembly Operations Buffer Transforms Assembly Operations Buffer Transforms Assembly Interpolation Buffer Transforms Assembly Interpolation Buffer

GPU AGP AGP ProgrammableProgrammable SmallSmall vertex vertex ProgrammableProgrammable FragmentFragment shadersshaders VertexVertex shader ProcessorProcessor Texture Memory

Generation IV.V: GeForce6/X800 (2004) NVIDIA NV40 Architecture

► Simultaneous rendering to multiple buffers 6 vertex ► True conditionals and loops shader units ► PCIe bus ► Vertex texture fetch Vertex Texture Fetch

Rasterization Vertex Primitive Raster Frame Vertex Primitive and Frame Transforms Assembly Operations Buffer Transforms Assembly Interpolation Buffer

PCIe ProgrammableProgrammable ProgrammableProgrammable FragmentFragment 16 fragment VertexVertex shader shader ProcessorProcessor shader units

Texture Memory Texture Memory

Slide adapted from Suresh Venkatasubramanian and Joe Kider Image from GPU Gems 2: http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html D3D 10 Pipeline Generation IV.V: GeForce6/X800 (2004) Not exactly a quantum leap, but… ► Simultaneous rendering to multiple buffers ► True conditionals and loops ► Higher precision throughput in the pipeline (64 bits end-to-end, compared to 32 bits earlier.) ► PCIe bus ► More memory/program length/texture accesses ► Texture access by vertex shader

Rasterization Vertex Primitive Raster Frame Vertex Primitive and Frame Transforms Assembly Operations Buffer Transforms Assembly Interpolation Buffer

AGP ProgrammableProgrammable ProgrammableProgrammable FragmentFragment VertexVertex shader shader ProcessorProcessor

Texture Memory Texture Memory

Image from David Blythe : http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/direct3d10_web.pdf

Fixed-function pipeline Generation V: GeForce8800/HD2900 (2006) 3D API Commands 3D3D API: API: 3D3D Complete quantum leap OpenGLOpenGL or or ApplicationApplication Direct3D Or Game ► Ground-up rewrite of GPU Direct3D Or Game CPU-GPU Boundary (AGP/PCIe) Data Stream ► Support for DirectX 10, and all Command & it implies (more on this later) GPU ► Geometry Shader Vertex Pixel ► Support for General GPU Index Assembled Pixel programming Primitives Location Updates Stream Stream ► Shared Memory (NVIDIA only) Rasterization GPU Primitive Raster Frame GPU Primitive and Frame Front End Assembly Operations Buffer Front End Assembly Interpolation Buffer

Input Programmable ProgrammableProgrammable Pre-transformed Pre-transformed Input ProgrammableProgrammable Raster Assembler Geometry PixelPixel Fragments Assembler VertexVertex shader shader Operations Vertices Shader ShaderShader

ProgrammableProgrammable Vertices ProgrammableProgrammable AGP Fragments Transformed VertexVertex FragmentFragment Transformed Output ProcessorProcessor ProcessorProcessor Merger Geometry : Point Sprites Geometry Shaders: Point Sprites

Geometry Shaders NVIDIA G80 Architecture

Image from David Blythe : http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/direct3d10_web.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf NVIDIA G80 Architecture Why Unify Shader Processors?

Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Why Unify Shader Processors? Unified Shader Processors

Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Terminology Shader Capabilities

Shader Direct3D OpenGL Video card Model Example NVIDIA GeForce 6800 292.xATI Radeon X800

NVIDIA GeForce 8800 3 10.x 3.x ATI Radeon HD 2900

NVIDIA GeForce GTX 480 4 11.x 4.x ATI Radeon HD 5870

Table courtesy of A K Peters, Ltd. http://www.realtimerendering.com/

Shader Capabilities Evolution of the Programmable Graphics Pipeline

Table courtesy of A K Peters, Ltd. http://www.realtimerendering.com/ Slide from Mike Houston: http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf Evolution of the Programmable Evolution of the Programmable Graphics Pipeline Graphics Pipeline

►Not covered today: ƒ SM 5 / D3D 11 / GL 4 ƒ Tessellation shaders ►*cough* student presentation *cough* ƒ Later this semester: NVIDIA Fermi ►Dual warp scheduler ►Configurable L1 / shared memory ►Double precision ►…

Slide from Mike Houston: http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf

New Tool: AMD System Monitor ►Released 01/04/2011 A closer look at the ► http://support.amd.com/us/kbarticles/Pages/AMDSystemMonitor.aspx A closer look at the http://support.amd.com/us/kbarticles/Pages/AMDSystemMonitor.aspx fixed-function pipeline Pipeline Input ModelView Transformation

Vertex Image F(x,y) = (r,g,b,a) ►Vertices mapped from object space to world (x, y, z) space (r, g, b,a) ►M = model transformation (scene) (Nx,Ny,Nz)

(tx, ty,[tz]) ►V = view transformation (camera) Each matrix transform (tx, ty) X’ X is applied to each vertex in the input Y (tx, ty) Y’ stream. Think of this Material Z’ M * V * Z as a kernel operator. properties* W’ 1

Lighting Clipping/Projection/Viewport(3D)

Lighting information is combined with ►More matrix transformations that operate on normals and other parameters at each a vertex to transform it into the viewport vertex in order to create new colors. space. Color(v) = emissive + ambient + diffuse + ►Note that a vertex may be eliminated from specular the input stream (if it is clipped). Each term in the right hand side is a function of ►The viewport is two-dimensional: however, the vertex color, position, normal and material vertex z-value is retained for depth testing. properties.

Clip test is first example of a conditional in the pipeline. However, it is not a fully general conditional. Why ? Rasterizing+Interpolation Per-fragment operations

►All primitives are now converted to ►The rasterizer produces a stream of fragments. fragments. ►Data type change ! Vertices to fragments ►Each fragment undergoes a series of tests with increasing complexity. Texture coordinates are interpolated from Fragment attributes: Test 1: Scissor texture coordinates of vertices. Scissor test is analogous to clipping (r,g,b,a) If (fragment lies in fixed rectangle) operation in fragment space instead of This gives us a linear interpolation operator let it pass else discard it vertex space. (x,y,z,w) for free. VERY USEFUL ! (tx,ty), … Test 2: Alpha Alpha test is a slightly more general If( fragment.a >= ) conditional. Why ? let it pass else discard it.

Per-fragment operations Per-fragment operations

► Stencil test: S(x, y) is stencil buffer value for ► Stencil and depth tests are more general fragment with coordinates (x,y) conditionals. Why ? ► If f(S(x,y)), let pixel pass else kill it. ► These are the only tests that can change the state Update S(x, y) conditionally depending on of internal storage (stencil buffer, depth buffer). f(S(x,y)) and g(D(x,y)). ► One of the update operations for the stencil buffer ► Depth test: D(x, y) is depth buffer value. is a “count” operation. Remember this! ► If g(D(x,y)) let pixel pass else kill it. ► Unfortunately, stencil and depth buffers have Update D(x,y) conditionally. lower precision (8, 24 bits resp.) Post-processing Quick Review: Buffers

►Blending: pixels are accumulated into final ►Color Buffers storage ƒ Front-left new-val = old-val op pixel-value ƒ Front-right ƒ Back-left If op is +, we can sum all the (say) red ƒ Back-right components of pixels that pass all tests. ►Depth Buffer (z-buffer) Problem: In generation<= IV, blending can ►Stencil Buffer only be done in 8-bit channels (the channels ►Accumulation Buffer sent to the video card); precision is limited. Accumulation Buffer

We could use accumulation buffers, but they are very slow.

Quick Review: Tests Readback = Feedback

► Scissor Test What is the output of a “computation” ? If(fragment exists inside rectangle) keep Else 1. Display on screen. delete ► Alpha Test – Compare fragment’s alpha value against 2. Render to buffer and retrieve values (readback) reference value Readbacks are VERY slow ! ► Stencil Test – Compare fragment against stencil map What options do we have ? ► Depth Test – Compare a fragment’s depth to the depth 1. Render to off-screen buffers PCI and AGP buses are asymmetric: DMA like accumulation buffer value already present in the depth buffer enables fast transfer TO graphics card. ƒ Never Reverse transfer has traditionally not 2. Copy from framebuffer to ƒ Always texture memory ? ƒ Less been required, and is much slower. ƒ Less-Equal PCIe is symmetric but still very slow 3. Render directly to a texture ? ƒ Greater-Equal ƒ Greater compared to GPU speeds. ƒ Not-Equal This motivates idea of “pass” being an atomic “unit cost” operation. An Example: Voronoi Time for a puzzle… Diagrams.

Definition Example

►You are given n sites (p1, p2, p3, … pn) in the plane (think of each site as having a color) ►For any point p in the plane, it is closest to So how do we do this on the graphics card? some site pj. Color p with color i. Note, this does not use any programmable features of the card ►Compute this colored map on the plane. In other words, Compute the nearest-neighbour diagram of the sites. Hint: Think in one dimension The Procedure higher ►In order to compute the lower envelope, we need to determine, at each pixel, the fragment having the smallest depth value. ►This can be done with a simple depth test. ƒ Allow a fragment to pass only if it is smaller than the current depth buffer value, and update The lower envelope of “cones” centered at the points is the Voronoi the buffer accordingly. diagram of this set of points. ►The fragment that survives has the correct color.

Let’s make this more complicated A First Step

►The 1-median of a set of sites is a point q* Can we compute, for each pixel q, the value that minimizes the sum of distances from all F(q) = Σ d(p, q) sites to itself. q* = arg min Σ d(p, q) q* = arg min Σ d(p, q) We can use the cone trick from before, and instead of computing the minimum depth value, compute the sum of all depth values using blending.

WRONG ! RIGHT ! What’s the catch ? We can’t blend depth values !

► Using texture interpolation helps here. ► Instead of drawing a single cone, we draw a Now we apply a shaded cone, with an appropriately constructed texture map. streaming perspective… ► Then, fragment having depth z has color component 1.0 * z. ► Now we can blend the colors. ► OpenGL has an aggregation operator that will return the overall min

Warning: we are ignoring issues of precision.precision.

Two kinds of data Who has access ?

► Stream data (data ► “Persistent” data ► Memory “connectivity” in the graphics use of a GPU is tricky. associated with vertices (associated with buffers). ► In a traditional C program, all global variables can be written by all routines. and fragments) ƒ Depth, stencil, textures. ► In the fixed-function pipeline, certain data is private. ƒ Color/position/texture ► Can be modifed by ƒ A fragment cannot change a depth or stencil value of a location different coordinates. multiple fragments in a from its own. ƒ The framebuffer can be copied to a texture; a depth buffer cannot be ƒ Functionally similar to single pass. ƒ The framebuffer can be copied to a texture; a depth buffer cannot be member variables in a C++ copied in this way, and neither can a stencil buffer. object. ► Functionally similar to a ƒ Only a stencil buffer can count (efficiently) global array BUT each ► In the fixed-function pipeline, depth and stencil buffers can be used in ƒ Can be used for limited a multi-pass computation only via readbacks. message passing: I modify fragment only gets one fragment only gets one ► A texture cannot be written directly. an object state and send it location to change. to you. ► In programmable GPUs, the memory connectivity becomes more open, ► Can be used to but there are still constraints. communicate across passes. Understanding access constraints and memory “connectivity” is a key passes. step in programming the GPU. How does this relate to stream Graphics pipeline programs ? 3D API Commands 3D3D API: API: 3D3D ► The most important question to ask when OpenGLOpenGL or or ApplicationApplication Direct3D Or Game programming the GPU is: Direct3D Or Game

Data Stream CPU-GPU Boundary Command &

What can I do in one pass ? GPU

Vertex Pixel ► Limitations on memory connectivity mean that a Index Assembled Pixel Primitives Location Updates step in a computation may often have to be Stream Stream Rasterization GPU Primitive Raster Frame GPU Primitive and Frame deferred to a new pass. Front End Assembly Operations Buffer Front End Assembly Interpolation Buffer ► For example, when computing the second smallest element, we could not store the current minimum

in read/write memory. Vertex pipeline Fragment pipeline ► Thus, the “communication” of this value has to happen across a pass.