Modern GPU Architectures

4/22/12 Agenda/GPU Decoder Ring • Fermi / GF100 / GeForce GTX 480 – “Fermi Refined” / GF110 / GeForce GTX 580 – “LiPle Fermi” / GF104 / GeForce GTX 460 Modern GPU Architectures • Cypress / Evergreen / RV870 / Radeon HD 5870 – Cayman / Northern Islands / Radeon HD 6970 • TahiV / Southern Islands / GCN / Radeon HD 7970 Varun Sampath • Kepler / GK104 / GeForce GTX 680 University of Pennsylvania • Future CIS 565 - Spring 2012 – Project Denver – Heterogeneous System Architecture From G80/GT200 to Fermi Unified Address Space • GPU Compute becomes a driver for innovaon – Unified address space – Control flow advancements – ArithmeVc performance – Atomics performance – Caching – ECC (is this seriously a graphics card?) – Concurrent kernel execuVon & fast context switching • PTX 2.0 ISA supports 64-bit virtual addressing (40-bit in Fermi) • CUDA 4.0+: Address space shared with CPU • Advantages? Image from NVIDIA 1 4/22/12 Unified Address Space Control Flow Advancements cudaMemcpy(d_buf, h_buf, sizeof(h_buf), • Predicated instrucVons cudaMemcpyDefault) • RunVme manages where buffers live – avoid branching stalls (no branch predictor) • Enables copies between different devices (not • Indirect funcVon calls: only GPUs) via DMA call{.uni} fptr, flist; – Called GPUDirect • What does this enable support for? – Useful for HPC clusters • Pointers for global and shared memory are equivalent Control Flow Advancements ArithmeVc • Predicated instrucVons • Improved support for IEEE 754-2008 floang – avoid branching stalls (no branch predictor) point standards • Indirect funcVon calls: • Double precision performance at half-speed call{.uni} fptr, flist; • Nave 32-bit integer arithmeVc • What does this enable support for? • – FuncVon pointers Does any of this help for graphics code? – Virtual funcVons – ExcepVon handling • Fermi gains support for recursion 2 4/22/12 Cache Hierarchy The Fermi SM • Dual warp • 64KB L1 cache per SM schedulers – why? – Split into 16KB and 48KB pieces • Two banks of 16 – Developer chooses whether shared memory or CUDA cores, 16 cache gets larger space LD/ST units, 4 SFU • 768KB L2 cache per GPU units – Makes atomics really fast. Why? • A warp can now – 128B cache line complete as – Loosens memory coalescing constraints quickly as 2 cycles 10 Image from NVIDIA The Stats The Review in March 2010 • Compute performance unbelievable, gaming performance on the other hand… – “The GTX 480… it’s hoPer, it’s noisier, and it’s more power hungry, all for 10-15% more performance.” – AnandTech (arVcle Vtled “6 months late, was it worth the wait?”) • Massive 550mm2 die – only 14/16 SMs could be enabled (480/512 cores) Image from Stanford CS193g 3 4/22/12 “Fermi Refined” – GTX 580 “LiPle Fermi” – GTX 460 • All 32-core SMs enabled • Smaller memory bus (256-bit vs • Clocks ~10% higher 384-bit) • Transistor mix enables lower power • Much lower transistor count consumpon (1.95B) • Superscalar execuVon: one scheduler dual- issues • Reduce overhead per core? Image from AnandTech A 2010 Comparison VLIW Architecture NVIDIA GeForce GTX 480 ATI Radeon HD 5870 • Very-Long-InstrucVon- Word • 480 cores • 1600 cores • Each instrucVon clause • 177.4 GB/s memory • 153.6 GB/s memory contains up to 5 bandwidth bandwidth instrucVons for the ALUs • 1.34 TFLOPS single • 2.72 TFLOPS single to execute in parallel precision precision + Save on scheduling and • 3 billion transistors • 2.15 billion transistors interconnect (clause “packing” done by compiler) over double the FLoPS for less transistors! What is - UVlizaon going on here? Image from AnandTech 4 4/22/12 Assembly Example AMD VLIW IL NVIDIA PTX Execuon Comparison Image from AnandTech The Rest of Performance Notes Cypress • 16 Streaming • VLIW architecture relies on instrucVon-level Processors packed parallelism into SIMD Core/ • Excels at heavy floang-point workloads with low compute unit (CU) register dependencies – Execute a – Shaders? “wavefront” (a 64- • Memory hierarchy not as aggressive as Fermi thread warp) over 4 cycles – Read-only texture caches – Can’t feed all of the ALUs in an SP in parallel • 20 CUs * 16 SPs * 5 • Fewer registers per ALU ALUs = 1600 ALUs • Lower LDS capacity and bandwidth per ALU Image from AnandTech 5 4/22/12 Black-Scholes OpenCL PerforMance with work-group siZe of 256 and processing of 8 Million opons 0.005 SAT OpenCL PerforMance with work-group siZe of 256 0.0045 0.14 0.004 0.12 0.0035 0.1 0.003 0.08 0.0025 Fermi Fermi 0.06 0.002 Barts Barts Execuon Time (s) Execuon Time (s) 0.0015 0.04 0.001 0.02 0.0005 0 256x256 512x512 1024x1024 2048x2048 0 16384 32768 49152 65536 Problem Size NuMber of Work-IteMs opVmizing for AMD Architectures AMD’s Cayman Architecture – A Shiu • Many of the ideas are the same – the constants & • AMD found average VLIW uVlizaon in games names just change was 3.4/5 – Staggered offsets (ParVVon camping) • – Local Data Share (shared memory) bank conflicts Shiu to VLIW4 architecture – Memory coalescing • Increased SIMD core count at expense of – Mapped and pinned memory VLIW width – NDRange (grid) and work-group (block)sizing • Found in Radeon HD 6970 and 6950 – Loop Unrolling • Big change: be aware of VLIW uVlizaon • Consult the OpenCL Programming Guide 6 4/22/12 Paradigm Shiu – Graphics Core Next • Switch to SIMD-based instrucVon set architecture (no VLIW) – 16-wide SIMD units execuVng a wavefront – 4 SIMD units + 1 scalar unit per compute unit – Hardware scheduling • Memory hierarchy improvements – Read/write L1 & L2 caches, larger LDS • Programming goodies – Unified address space, excepVons, funcVons, recursion, fast context switching • Sound Familiar? Radeon HD 7970: 32 CUs * 4 SIMDs/CU * 16 ALUs/SIMD = 2048 ALUs Image from AMD NVIDIA’s Kepler NVIDIA GeForce GTX 680 NVIDIA GeForce GTX 580 • 1536 SPs • 512 SPs • 28nm process • 40nm process • 192.2 GB/s memory • 192.4 GB/s memory bandwidth bandwidth • 195W TDP • 244W TDP • 1/24 double performance • 1/8 double performance • 3.5 billion transistors • 3 billion transistors • A focus on efficiency • Transistor scaling not enough to account for massive core count increase or power consumpVon • Kepler die size 56% of GF110’s Image from AMD 7 4/22/12 Kepler’s SMX Performance • Removal of shader clock means warp executes in 1 GPU • GTX 680 may clock cycle be compute – Need 32 SPs per clock – Ideally 8 instrucVons issued regression but every cycle (2 for each of 4 gaming leap warps) • Kepler SMX has 6x SP count • “Big Kepler” – 192 SPs, 32 LD/ST, 32 SFUs expected to – New FP64 block remedy gap • Memory (compared to Fermi) – Double – register file size has only doubled performance – shared memory/L1 size is the necessary same – L2 decreases Image from NVIDIA Images from AnandTech Future: Integraon The Contenders • Hardware benefits from merging CPU & GPU • AMD – Heterogeneous System Architecture – Mobile (smartphone / laptop): lower energy – Virtual ISA, make use of CPU or GPU transparent consumpVon, area consumpVon – Enabled by Fusion APUs, blending x86 and AMD – Desktop / HPC: higher density, interconnect GPUs bandwidth • NVIDIA – Project Denver • Souware benefits – Desktop-class ARM processor effort – Mapped pointers, unified addressing, consistency – Target server/HPC market? rules, programming languages point toward GPU • Intel as vector co-processor – Larrabee Intel MIC 8 4/22/12 References Bibliography • NVIDIA Fermi Compute Architecture Whitepaper. • Beyond3D’s Fermi GPU and Architecture Link Analysis. Link • NVIDIA GeForce GTX 680 Whitepaper. Link • Schroeder, Tim C. “Peer-to-Peer & Unified Virtual • RWT’s arVcle on Fermi. Link Addressing” CUDA Webinar. Slides • AMD Financial Analyst Day Presentaons. Link • AnandTech Review of the GTX 460. Link • AnandTech Review of the Radeon HD 5870. Link • AnandTech Review of the GTX 680. Link • AMD OpenCL Programming Guide (v 1.3f). Link • NVIDIA CUDA Programming Guide (v 4.2). Link 9 .

Modern GPU Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support