Fermi GF100 Graphics Processing Unit (GPU)
Total Page:16
File Type:pdf, Size:1020Kb
Fermi GF100 Graphics Processing Unit (GPU) Craig M. Wittenbrink, Emmett Kilgariff, Arjun Prabhu Acknowledgements We would like to thank Jonah Alben, John Robinson, Mark Daly, Henry Moreton, and the entire Fermi GF100 Development team for success of this project Outline GF100 Specifications GF100 Architecture Tessellation Physics Computational Graphics Demos GF100 Specifications 3 Billion Transistors in 40 nm process (TSMC) Up to 512 CUDA / unified shader cores 384-bit GDDR5 memory interface, 6GB capacity GeForce GTX480: Graphics Enthusiast Full Microsoft DirectX 11 Shader Model 5.0 32x coverage sample antialiasing (AA) 128-bit floating point high dynamic-range lighting with AA Tesla C2070: Compute HPC CUDA programming: C, C++, OpenCL, DirectCompute, or Fortran 515 GigaFLOPS double-precision peak floating point ECC Register Files, L1/L2 caches, shared memory and DRAM GF100 Host Interface GigaThread Engine GPC GPC Raster Engine Raster Engine Architecture Memory Controller SM SM SM SM SM SM SM SM 512 CUDA cores Memory Controller 16 PolyMorph Engines Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine 4 raster units Memory Controller 64 texture units L2 Cache Memory Controller 48 ROP units Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine 384-bit GDDR5 Memory Controller SM SM SM SM SM SM SM SM Memory Controller Raster Engine Raster Engine GPC GPC GF100 SM Architecture SM – Streaming Multiprocessor 32 CUDA cores: 4x GT200 GT200 – previous high-end GPU by NVIDIA 48 or 16KB of shared memory: 3x GT200 16 or 48KB of L1 cache: No L1 on GT200 ISA improvements 32-bit integer operations FMA (fused multiply add) IEEE-754 2008 4 Texture units 1 PolyMorph Engine Cache Architecture for Graphics Chip Hull, Domain & Vertex Fetch Vertex Shader Rasterizer Pixel Shader ROP Data stays on die Geomtry Shaders L1 & L2 Caches L1 cache Register spilling Stack ops DRAM Global LD/ST L2 Cache Vertex, SM, Texture and ROP Data Cache Architecture Comparison GT200 GF100 Benefit L1 Texture Cache 12 KB 12 KB Fast texture filtering (per SM) Dedicated L1 ✘ 16 or 48 KB Efficient physics and LD/ ST Cache ray tracing Total Shared 16KB 16 or 48 KB More data reuse Memory among threads L2 Cache 256KB 768 KB Greater texture (TEX read only) (all clients coverage, robust read/ write) compute performance DX10 Game Problems, too little detail Geometry throughput is challenge Flat head Silhouette results from coarse triangulation There are other artifacts Image from Far Cry® 2, courtesy of Ubisoft Offline Film Rendering uses Tessellation Tessellation + displacement mapping modelling standard Rich geometric detail More Geometry, Takes More time © Disney Enterprises, Inc. and Jerry Bruckheimer, Inc. All rights reserved. Image courtesy Industrial Light & Magic. Tessellation & Displacement Maps Add Geometric Detail “z” Original Displaced vertices “y” “x” Geometry “v” “u” Tessellated Geometry Final Geometry Displacement Map http://en.wikipedia.org/wiki/File:Displacement.jpg Tessellation in DirectX 11 From input assembly Vertex Hull shader input new patch primitive Patch New pipeline Assembly Outputs tessFactors, modes stages Runs pre-expansion (1 arrow) Hull Control Tessellator Fixed function tessellation stage points input TessFactors and modes Domain Outputs triangles and lines On chip data expansion (3 arrows) Primitive Assembly Geometry Tessellation in DirectX 11, cont From input assembly Vertex Domain shader Input TessFactors, (u,v) domain points Patch Assembly Outputs vertices in (x,y,z,w) Hull “v” “u” Control Tessellator points “z” Domain “z” “x” Primitive Assembly “y” Geometry GF100 Scalable Parallel Implementation Raster Engine HostHost InterfaceInterface GigaThread EngineEngine Edge Setup Rasterizer Z-Cull GPC GPC Raster Engine RasterRaster Engine Engine Memory Controller Memory Controller SM SM SM SM SM SM SM SM PolyMorph Engine Vertex Fetch Tessellator Viewport Transform GPC GPC SM SM SM SM SM SM SM SM Memory Controller Memory Controller Attribute Setup Stream Output Polymorph Engine Polymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine Memory Controller Memory Controller Distributed, parallel geometry 4 Raster Engines L2L2 CacheCache Memory Controller 16 PolyMorph Engines Memory Controller Polymorph Engine Polymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph EngineEngine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine PolymorphPolymorph Engine Engine 8x geometry performance Memory Controller Memory Controller SM SM SM SM SM SM SM SM vs. GT200 GPC GPC SM SM SM SM SM SM SM SM Memory Controller Memory Controller Enables > 2.5 triangles/clock Raster Engine RasterRaster Engine Engine GPC GPC GF100 Enables Geometric Realism Tessellation provides interactive smooth silhouette Film like capabilities in games More geometry detail—3D Stereo viewing benefits Dynamic shadows 3D Vision™ Memory footprint & BW savings Store coarse geometry, expand on-demand Enables more complex animations Scalability Dynamic LOD allows for performance/quality tradeoffs Scale into the future – resolution, compute power © Kenneth Scott, id Software 2008 Dramatic Increase in Geometry Capability 2 Shader TeraFLOP/s triangle/s GeForce GTX 480 1.8 1.6 Geometry 1.4 1.2 1 0.8 0.6 Shader 0.4 0.2 0 Ti4600 6800 Ultra 8800GTX GTX285 GTX480 20022004 2006 2008 2010 Advance in Geometric Complexity 20 18 16 14 12 10 8 6 4 2 Millions of polygons per frame per polygons of Millions - 2003 2010 Physics: PhysX® Example: Fluid Simulation PhysX® is NVIDIA physics engine Uses CUDA SW stack Used in over 150 computer games Particle-based fluid simulation (SPH) 128,000 particles Includes surface tension 3x speed up going from GT200 to GF100 67 to 200 fps Games fluid for water splashes, mud, blood… Computational Graphics: Image Processing – Motion Blur Object Color Buffer Previous Shading Model Position Back Buffer Vertex Unblurred Shader Image Current Motion Blur Model Shader Final Blurred Position Image Pixel Velocity Buffer Previous Shader Screen Space Camera Multiple Position velocity of render every pixel target Current (MRT) Camera Position Detected Zones Computational Graphics: Ray Tracing Combine Rasterization with Raytracing Rasterize primary rays Raytrace shadows and reflections 4x faster than GT200 Efficient use of cache architecture GF100 GPU Compute Performance 4.0 3.5 3.0 2.5 2.0 GF100 GT200GTX285 1.5 1.0 0.5 0.0 © Capcom, Inc. PhysX Fluid Dark Void Ray Tracing AI Demo: SuperSonic Sled--tessellation, physics, and computational graphics SuperSonic Sled: Tessellation - Terrain Terrain Displacement Map SuperSonic Sled: Physics – Smoke, Bridge, Fireballs f (p,v Large ,m) Pieces 128 CPU Objects 128 GPU Dust Small Chunks (1500 Particles Objects) GPU Debris SuperSonic Sled: Computational Graphics- Motion Blur Demo Conclusions GF100 Improves Geometry processing up to 8X over GT200 New Architecture and New family of chips GF100 Architecture breaks 1 triangle per clock barrier More than 2.5 Triangles/clock High throughput tessellation has benefits for interactive gaming Demonstration of Tessellation Physics Computational Graphics.