Modern GPU Architectures
Total Page:16
File Type:pdf, Size:1020Kb
4/22/12 Agenda/GPU Decoder Ring • Fermi / GF100 / GeForce GTX 480 – “Fermi Refined” / GF110 / GeForce GTX 580 – “LiPle Fermi” / GF104 / GeForce GTX 460 Modern GPU Architectures • Cypress / Evergreen / RV870 / Radeon HD 5870 – Cayman / Northern Islands / Radeon HD 6970 • TahiV / Southern Islands / GCN / Radeon HD 7970 Varun Sampath • Kepler / GK104 / GeForce GTX 680 University of Pennsylvania • Future CIS 565 - Spring 2012 – Project Denver – Heterogeneous System Architecture From G80/GT200 to Fermi Unified Address Space • GPU Compute becomes a driver for innovaon – Unified address space – Control flow advancements – ArithmeVc performance – Atomics performance – Caching – ECC (is this seriously a graphics card?) – Concurrent kernel execuVon & fast context switching • PTX 2.0 ISA supports 64-bit virtual addressing (40-bit in Fermi) • CUDA 4.0+: Address space shared with CPU • Advantages? Image from NVIDIA 1 4/22/12 Unified Address Space Control Flow Advancements cudaMemcpy(d_buf, h_buf, sizeof(h_buf), • Predicated instrucVons cudaMemcpyDefault) • RunVme manages where buffers live – avoid branching stalls (no branch predictor) • Enables copies between different devices (not • Indirect funcVon calls: only GPUs) via DMA call{.uni} fptr, flist; – Called GPUDirect • What does this enable support for? – Useful for HPC clusters • Pointers for global and shared memory are equivalent Control Flow Advancements ArithmeVc • Predicated instrucVons • Improved support for IEEE 754-2008 floang – avoid branching stalls (no branch predictor) point standards • Indirect funcVon calls: • Double precision performance at half-speed call{.uni} fptr, flist; • Nave 32-bit integer arithmeVc • What does this enable support for? • – FuncVon pointers Does any of this help for graphics code? – Virtual funcVons – ExcepVon handling • Fermi gains support for recursion 2 4/22/12 Cache Hierarchy The Fermi SM • Dual warp • 64KB L1 cache per SM schedulers – why? – Split into 16KB and 48KB pieces • Two banks of 16 – Developer chooses whether shared memory or CUDA cores, 16 cache gets larger space LD/ST units, 4 SFU • 768KB L2 cache per GPU units – Makes atomics really fast. Why? • A warp can now – 128B cache line complete as – Loosens memory coalescing constraints quickly as 2 cycles 10 Image from NVIDIA The Stats The Review in March 2010 • Compute performance unbelievable, gaming performance on the other hand… – “The GTX 480… it’s hoPer, it’s noisier, and it’s more power hungry, all for 10-15% more performance.” – AnandTech (arVcle Vtled “6 months late, was it worth the wait?”) • Massive 550mm2 die – only 14/16 SMs could be enabled (480/512 cores) Image from Stanford CS193g 3 4/22/12 “Fermi Refined” – GTX 580 “LiPle Fermi” – GTX 460 • All 32-core SMs enabled • Smaller memory bus (256-bit vs • Clocks ~10% higher 384-bit) • Transistor mix enables lower power • Much lower transistor count consumpon (1.95B) • Superscalar execuVon: one scheduler dual- issues • Reduce overhead per core? Image from AnandTech A 2010 Comparison VLIW Architecture NVIDIA GeForce GTX 480 ATI Radeon HD 5870 • Very-Long-InstrucVon- Word • 480 cores • 1600 cores • Each instrucVon clause • 177.4 GB/s memory • 153.6 GB/s memory contains up to 5 bandwidth bandwidth instrucVons for the ALUs • 1.34 TFLOPS single • 2.72 TFLOPS single to execute in parallel precision precision + Save on scheduling and • 3 billion transistors • 2.15 billion transistors interconnect (clause “packing” done by compiler) over double the FLoPS for less transistors! What is - UVlizaon going on here? Image from AnandTech 4 4/22/12 Assembly Example AMD VLIW IL NVIDIA PTX Execuon Comparison Image from AnandTech The Rest of Performance Notes Cypress • 16 Streaming • VLIW architecture relies on instrucVon-level Processors packed parallelism into SIMD Core/ • Excels at heavy floang-point workloads with low compute unit (CU) register dependencies – Execute a – Shaders? “wavefront” (a 64- • Memory hierarchy not as aggressive as Fermi thread warp) over 4 cycles – Read-only texture caches – Can’t feed all of the ALUs in an SP in parallel • 20 CUs * 16 SPs * 5 • Fewer registers per ALU ALUs = 1600 ALUs • Lower LDS capacity and bandwidth per ALU Image from AnandTech 5 4/22/12 Black-Scholes OpenCL PerforMance with work-group siZe of 256 and processing of 8 Million opons 0.005 SAT OpenCL PerforMance with work-group siZe of 256 0.0045 0.14 0.004 0.12 0.0035 0.1 0.003 0.08 0.0025 Fermi Fermi 0.06 0.002 Barts Barts Execuon Time (s) Execuon Time (s) 0.0015 0.04 0.001 0.02 0.0005 0 256x256 512x512 1024x1024 2048x2048 0 16384 32768 49152 65536 Problem Size NuMber of Work-IteMs opVmizing for AMD Architectures AMD’s Cayman Architecture – A Shiu • Many of the ideas are the same – the constants & • AMD found average VLIW uVlizaon in games names just change was 3.4/5 – Staggered offsets (ParVVon camping) • – Local Data Share (shared memory) bank conflicts Shiu to VLIW4 architecture – Memory coalescing • Increased SIMD core count at expense of – Mapped and pinned memory VLIW width – NDRange (grid) and work-group (block)sizing • Found in Radeon HD 6970 and 6950 – Loop Unrolling • Big change: be aware of VLIW uVlizaon • Consult the OpenCL Programming Guide 6 4/22/12 Paradigm Shiu – Graphics Core Next • Switch to SIMD-based instrucVon set architecture (no VLIW) – 16-wide SIMD units execuVng a wavefront – 4 SIMD units + 1 scalar unit per compute unit – Hardware scheduling • Memory hierarchy improvements – Read/write L1 & L2 caches, larger LDS • Programming goodies – Unified address space, excepVons, funcVons, recursion, fast context switching • Sound Familiar? Radeon HD 7970: 32 CUs * 4 SIMDs/CU * 16 ALUs/SIMD = 2048 ALUs Image from AMD NVIDIA’s Kepler NVIDIA GeForce GTX 680 NVIDIA GeForce GTX 580 • 1536 SPs • 512 SPs • 28nm process • 40nm process • 192.2 GB/s memory • 192.4 GB/s memory bandwidth bandwidth • 195W TDP • 244W TDP • 1/24 double performance • 1/8 double performance • 3.5 billion transistors • 3 billion transistors • A focus on efficiency • Transistor scaling not enough to account for massive core count increase or power consumpVon • Kepler die size 56% of GF110’s Image from AMD 7 4/22/12 Kepler’s SMX Performance • Removal of shader clock means warp executes in 1 GPU • GTX 680 may clock cycle be compute – Need 32 SPs per clock – Ideally 8 instrucVons issued regression but every cycle (2 for each of 4 gaming leap warps) • Kepler SMX has 6x SP count • “Big Kepler” – 192 SPs, 32 LD/ST, 32 SFUs expected to – New FP64 block remedy gap • Memory (compared to Fermi) – Double – register file size has only doubled performance – shared memory/L1 size is the necessary same – L2 decreases Image from NVIDIA Images from AnandTech Future: Integraon The Contenders • Hardware benefits from merging CPU & GPU • AMD – Heterogeneous System Architecture – Mobile (smartphone / laptop): lower energy – Virtual ISA, make use of CPU or GPU transparent consumpVon, area consumpVon – Enabled by Fusion APUs, blending x86 and AMD – Desktop / HPC: higher density, interconnect GPUs bandwidth • NVIDIA – Project Denver • Souware benefits – Desktop-class ARM processor effort – Mapped pointers, unified addressing, consistency – Target server/HPC market? rules, programming languages point toward GPU • Intel as vector co-processor – Larrabee Intel MIC 8 4/22/12 References Bibliography • NVIDIA Fermi Compute Architecture Whitepaper. • Beyond3D’s Fermi GPU and Architecture Link Analysis. Link • NVIDIA GeForce GTX 680 Whitepaper. Link • Schroeder, Tim C. “Peer-to-Peer & Unified Virtual • RWT’s arVcle on Fermi. Link Addressing” CUDA Webinar. Slides • AMD Financial Analyst Day Presentaons. Link • AnandTech Review of the GTX 460. Link • AnandTech Review of the Radeon HD 5870. Link • AnandTech Review of the GTX 680. Link • AMD OpenCL Programming Guide (v 1.3f). Link • NVIDIA CUDA Programming Guide (v 4.2). Link 9 .