Fermi GF100 (GPU)

Craig M. Wittenbrink, Emmett Kilgariff, Arjun Prabhu Acknowledgements

We would like to thank Jonah Alben, John Robinson, Mark Daly, Henry Moreton, and the entire Fermi GF100 Development team for success of this project Outline

GF100 Specifications GF100 Architecture Tessellation Physics Computational Graphics Demos GF100 Specifications

3 Billion Transistors in 40 nm process (TSMC) Up to 512 CUDA / unified shader cores 384-bit GDDR5 memory interface, 6GB capacity GeForce GTX480: Graphics Enthusiast Full Microsoft DirectX 11 Shader Model 5.0 32x coverage sample antialiasing (AA) 128-bit floating point high dynamic-range lighting with AA Tesla C2070: Compute HPC CUDA programming: C, C++, OpenCL, DirectCompute, or Fortran 515 GigaFLOPS double-precision peak floating point ECC Register Files, L1/L2 caches, shared memory and DRAM GF100 Host Interface GigaThread Engine

GPC GPC Raster Engine Raster Engine

Architecture Controller Memory SM SM SM SM SM SM SM SM 512 CUDA cores Memory Controller 16 PolyMorph Engines

Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine 4 raster units Controller Memory 64 texture units L2 Cache Memory Controller 48 ROP units Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine

384-bit GDDR5 Controller Memory

SM SM SM SM SM SM SM SM Memory Controller

Raster Engine Raster Engine GPC GPC GF100 SM Architecture

SM – Streaming Multiprocessor 32 CUDA cores: 4x GT200 GT200 – previous high-end GPU by NVIDIA 48 or 16KB of shared memory: 3x GT200 16 or 48KB of L1 cache: No L1 on GT200 ISA improvements 32-bit integer operations FMA (fused multiply add) IEEE-754 2008 4 Texture units 1 PolyMorph Engine Cache Architecture for Graphics

Chip

Hull, Domain & Vertex Fetch Vertex Shader Rasterizer Pixel Shader ROP Data stays on die Geomtry Shaders

L1 & L2 Caches L1 cache Register spilling

Stack ops DRAM Global LD/ST

L2 Cache Vertex, SM, Texture and ROP Data Cache Architecture Comparison

GT200 GF100 Benefit

L1 Texture Cache 12 KB 12 KB Fast texture filtering (per SM) Dedicated L1 ✘ 16 or 48 KB Efficient physics and LD/ ST Cache ray tracing Total Shared 16KB 16 or 48 KB More data reuse Memory among threads L2 Cache 256KB 768 KB Greater texture (TEX read only) (all clients coverage, robust read/ write) compute performance DX10 Game Problems, too little detail

Geometry throughput is challenge Flat head Silhouette results from coarse triangulation There are other artifacts

Image from Far Cry® 2, courtesy of Ubisoft Offline Film Rendering uses Tessellation Tessellation + displacement mapping modelling standard Rich geometric detail More , Takes More time

© Disney Enterprises, Inc. and Jerry Bruckheimer, Inc. All rights reserved. Image courtesy Industrial Light & Magic. Tessellation & Displacement Maps Add Geometric Detail

“z” Original Displaced vertices “y” “x” Geometry “v” “u” Tessellated Geometry

Final Geometry Displacement Map

http://en.wikipedia.org/wiki/File:Displacement.jpg Tessellation in DirectX 11 From input assembly

Vertex Hull shader input new patch primitive Patch New pipeline Assembly Outputs tessFactors, modes stages Runs pre-expansion (1 arrow) Hull

Control Tessellator Fixed function tessellation stage points input TessFactors and modes Domain Outputs triangles and lines

On chip data expansion (3 arrows) Primitive Assembly

Geometry Tessellation in DirectX 11, cont From input assembly

Vertex Domain shader Input TessFactors, (u,v) domain points Patch Assembly Outputs vertices in (x,y,z,w)

Hull “v” “u” Control Tessellator points

“z” Domain “z”

“x” Primitive Assembly

“y” Geometry GF100 Scalable Parallel Implementation

Raster Engine HostHost InterfaceInterface GigaThread EngineEngine Edge Setup Rasterizer Z-Cull GPC GPC Raster Engine RasterRaster Engine Engine Memory Controller Memory Memory Controller Memory SM SM SM SM SM SM SM SM PolyMorph Engine

Vertex Fetch Tessellator Viewport Transform GPC GPC

SM SM SM SM SM SM SM SM Memory Controller Memory Controller Attribute Setup Stream Output

Polymorph Engine Polymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine Memory Controller Memory Memory Controller Memory Distributed, parallel geometry 4 Raster Engines L2L2 CacheCache Memory Controller 16 PolyMorph Engines Memory Controller Polymorph Engine Polymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine 8x geometry performance Memory Controller Memory Memory Controller Memory SM SM SM SM SM SM SM SM vs. GT200 GPC GPC

SM SM SM SM SM SM SM SM Memory Controller Memory Controller Enables > 2.5 triangles/clock Raster Engine RasterRaster Engine Engine GPC GPC GF100 Enables Geometric Realism

Tessellation provides interactive smooth silhouette Film like capabilities in games

More geometry detail—3D Stereo viewing benefits Dynamic shadows 3D Vision™

Memory footprint & BW savings Store coarse geometry, expand on-demand Enables more complex animations

Scalability Dynamic LOD allows for performance/quality tradeoffs

Scale into the future – resolution, compute power © Kenneth Scott, id Software 2008 Dramatic Increase in Geometry Capability

2 Shader TeraFLOP/s triangle/s GeForce GTX 480 1.8

1.6 Geometry 1.4

1.2

1

0.8

0.6 Shader

0.4

0.2

0 Ti4600 6800 Ultra 8800GTX GTX285 GTX480 20022004 2006 2008 2010 Advance in Geometric Complexity 20 18 16 14 12 10 8 6 4 2

Millions of polygons per frame per polygons of Millions -

2003 2010 Physics: PhysX® Example: Fluid Simulation

PhysX® is NVIDIA physics engine Uses CUDA SW stack Used in over 150 computer games

Particle-based fluid simulation (SPH) 128,000 particles Includes tension

3x speed up going from GT200 to GF100 67 to 200 fps

Games fluid for water splashes, mud, blood… Computational Graphics: Image Processing – Motion Blur Object Color Buffer Previous Shading Model Position Back Buffer Vertex Unblurred Shader Image Current Motion Blur Model Shader Final Blurred Position Image Pixel Velocity Buffer Previous Shader Screen Space Camera Multiple Position velocity of render every pixel target Current (MRT) Camera Position Detected Zones Computational Graphics: Ray Tracing

Combine Rasterization with Raytracing Rasterize primary rays Raytrace shadows and reflections

4x faster than GT200 Efficient use of cache architecture GF100 GPU Compute Performance

4.0

3.5

3.0

2.5

2.0 GF100 GT200GTX285 1.5

1.0

0.5

0.0 © Capcom, Inc. PhysX Fluid Dark Void Ray Tracing AI Demo: SuperSonic Sled--tessellation, physics, and computational graphics SuperSonic Sled: Tessellation - Terrain

Terrain Displacement Map SuperSonic Sled: Physics – Smoke, Bridge, Fireballs f (p,v Large ,m) Pieces 128 CPU Objects 128 GPU Dust Small Chunks (1500 Particles Objects) GPU Debris SuperSonic Sled: Computational Graphics- Motion Blur Demo Conclusions

GF100 Improves Geometry processing up to 8X over GT200 New Architecture and New family of chips GF100 Architecture breaks 1 triangle per clock barrier More than 2.5 Triangles/clock High throughput tessellation has benefits for interactive gaming Demonstration of Tessellation Physics Computational Graphics