Amd “Trinity” Apu
Total Page:16
File Type:pdf, Size:1020Kb
HOT CHIPS 2012 AMD “TRINITY” APU Sebastien Nussbaum AMD Fellow Trinity SOC Architect AMD APU “TRINITY” WITH AMD DISCRETE CLASS GRAPHICS ALL NEW ARCHITECTURE FOR UP TO 50% GPU1 AND UP TO 25% BETTER X86 PERFORMANCE2 . “Piledriver” Cores – Improved performance and power efficiency – 3rd-Gen Turbo Core technology – Quad CPU Core with total of 4MB L2 . 2nd-Gen AMD Radeon™ with DirectX® 11 support – 384 Radeon™ Cores 2.0 . HD Media Accelerator – Accelerates and improves HD playback – Accelerates media conversion – Improves streaming media – Allows for smooth wireless video . Enhanced Display Support – AMD Eyefinity Technology3 – 3 Simultaneous DisplayPort 1.2 or HDMI/DVI links – Up to 4 display heads with display multi-streaming 2 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 “TRINITY” FLOORPLAN 32nm SOI, 246mm2, 1.303BN TRANSISTORS DDR3 Controller Dual Channel DDR3 Memory Controller Memory Scheduler AMD HD Media Accelerator Channel (UVD, AMD Accelerated Accelerator Accelerator Video Converter) L2 L2 MediaHD AMD Northbridge Unified Northbridge Cache Cache GPU AMD Radeon™ GPU Dual Dual Up to 4 Core Core “Piledriver” x86 x86 Cores with total 4MB L2 Module Module HDMI, DisplayPort 1.2, DVI controllers PCI Express® I/O — PCIe Display DP/ Display Controller PCIe® ® 24 lanes, optional digital PCIe® PLL HDMI® display interfaces 3 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 AMD 2ND GENERATION “BULLDOZER” CORE: “PILEDRIVER” 32nm "PILEDRIVER" COMPUTE MODULE x86 CORE REDESIGN . Shared Fetcher / prediction pipeline - 64KB I-Cache . Shared 4-way x86 decoder Prediction Queue ICache L1 BTB . Shared Floating Point Unit - dual 128-bit FMA pipes Fetch Queue L2 BTB . Shared 16-way 2MB L2; Ucode ROM 4 x86 Decoders . Dedicated integer cores – Register renaming based on physical register file Int Int – Unified scheduler per core Scheduler FP Scheduler Scheduler – Way-predicted 16KB L1 D-cache Instr Instr Out-of-order Load-Store Unit Retire Retire -bit – -bit MMX MMX FMAC FMAC AGen AGen AGen AGen 128 128 EX, DIV EX, DIV EX, . ISA additions: FMA3, F16C MULEX, MULEX, . Lightweight profiling support in HW . “Piledriver” performance increase over “Stars” Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache – 14% improvement for desktop5 Data Prefetcher Shared L2 Cache – 25% improvement for notebook2 – AMD Turbo Core 3.0 5 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 “PILEDRIVER” CORE FLOOR PLAN Microcode ROM Branch Instruction 64KB L1 Predictor Decode I-Cache Integer Integer Integer Integer Datapath Scheduler Scheduler Datapath 16KB L1 16KB L1 2MB L2 Cache Unit D-Cache D-Cache Load/Store Load/Store Cache L2 Floating-Point Unit 6 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 “PILEDRIVER” IMPROVEMENTS & ENHANCEMENTS VS. “BULLDOZER” “Bulldozer” Hybrid Predictor Augmented with 2nd level predictor ISA extensions FMA3, F16C Improved scheduling FPU and INT Faster Instruction exe SYSCALL & SYSRET HW Divider 2x Larger L1 TLB Improved Store-to- Load Forwarding L2 efficiency and prefetching HW L1 Pre-fetcher improvements improvements 7 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 “PILEDRIVER” IMPROVEMENTS 30% higher . Design optimized for wide 10% lower . Loop Predictor CPU Freq 13 dynamic operational range (0.8V to 1.3V) . Way Predictor . 30% higher frequency at same power vs. “Bulldozer”13 . Dispatch gating based on voltage as “Stars” CPU Core in group size “Llano” . Clock Gating . Reduction in high power flops Efficient . 50% more base product Power . Intelligent L2 content operation management frequency vs. “Llano” at tracking to speed up L2 same 35W SOC TDP Latency reduction flush A10-4600M vs A8-3600M . State save/restore latency improvements to speed up power gating 8 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 MEDIA PROCESSING ACCELERATION AMD’S UNIFIED VIDEO DECODER (UVD) UVD UVD UVD 1st generation 2nd generation AMD A-Series APU H.264 / AVCHD H.264 / AVCHD H.264 / AVCHD VC-1 / WMV profile D VC-1 / WMV profile D VC-1 / WMV profile D Video MPEG-2 MPEG-2 Formats MPEG-4 / DivX Bitstream decode Bitstream decode Bitstream decode Picture-in-Picture Picture-in-Picture Dual stream HD+SD Dual stream HD+SD Features Dual stream HD+HD 10 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 “TRINITY” ACCELERATED VIDEO CONVERTER (“AVC”) Core . Multi-stream hardware H.264 HD Encoder functionality . Power-efficient and faster than real-time6 1080p @60fps Quality . 4:2:0 color sampling video features . Optimizations for scene changes (games and video) . Variable compression quality Interfacing . Audio / Video multiplexing features . Input from frame buffer for transcoding and video conferencing . Input from GPU display engine for wireless display7 11 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 VIDEO ENCODING SYSTEM OPERATION APU x86 Bitrate Feedback GPU Rate Control Current frame (Uncompressed YUV420) Forward Transform (e.g. FDCT) Reference frames in the Intra group of pictures (GOP) Prediction Quantization Entropy 0100110010110 Encode 10111100… Motion H.264 Estimation Compressed Stream AMD’s Accelerated Video Converter 12 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 GPU DESIGN UPDATES FOR GAMING AND COMPUTE 3D ENGINE Command Processor Thread Generator . DirectX® 11 – SM 5.0, OpenCL™ 1.1, DC 11 Instr. Cache Ultra-Threaded Dispatch Processor . GPU Core made of 384 Radeon™ Cores , each Constant capable of 1 SP FMAC per cycle Cache 32kB Local DataShare 32kB Local DataShare 32kB Local DataShare SIMD Engine 1 Engine SIMD 2 Engine SIMD 6 Engine SIMD – Organized as 96 stream processing units – each 4- BufferExport way VLIW (vs. 5-way in Llano) Memory Controller – 6 SIMDs (each contains 16 processing units) – Each SIMD share 1 texture unit – achieving 4:1 ALU:Texture rate Registers GlobalSynchronization . 32 depth / stencil per clock, 8 color per clock Fetch Fetch Fetch 64kB GlobalDataShare . 24x multi-sample and super sample, 16x 8kB 8kB 8kB anisotropic filtering L1 L1 L1 . Improved hardware tessellator vs. “Llano” . Compute improvements 128kB 128kB 128kB 128kB L2 L2 L2 L2 – Asynchronous dispatch: multiple compute kernels with independent address space simultaneously 14 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 PERFORMANCE ACHIEVEMENTS PERFORMANCE INCREASE ON CLIENT WORKLOADS (FOR 35W TDP ) “TRINITY” VS. “LLANO” CPU PERFORMANCE INCLUDING POWER MANAGEMENT, FREQ AND IPC GAINS Digital Media Sony Vegas PowerDirector 9 HD Transcode Handbrake HD Handbrake DVD iTunes Quicktime Pro Photoshop Elements 2.0 0% 10% 20% 30% 40% 50% 60% Web & Productivity: Compression & Cryptography ® PCMark7 Score ® PCMark7 Productivity Windows Send & Compress 7Zip GUIMark2 HTML5 GUIMark2 Bitmap 0% 10% 20% 30% 40% 50% 60% 70% Experimental setup: see footnote 10 16 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 PERFORMANCE AND POWER COMPARISON VS PRIOR-GENERATION ® Visual Performance - 3DMark Vantage Performance Battery Life Hours - Windows Idle See footnote 11 (Est. 62 Whr. Battery) A10 12.2 4530 A10 A8 12.2 3850 A8 Trinity A6 12.2 A6 2650 12.2 Llano A4 A4 2200 0 5 10 15 0 2000 4000 6000 See footnotes 4 and 8 or battery life measurement considerations General Performance - PCMark® Vantage Overall Compute Capacity - Calculated CTP SP GFLOPS See footnote 11 A10 6125 A10 603 A8 405 A8 5656 A6 306 A6 4882 A4 204 A4 4600 0 200 400 600 800 9 0 2000 4000 6000 8000 See footnote Trinity performance based on estimates and/or preliminary benchmarks and are subject to change. 17 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 AMD TURBO CORE 3.0 TECHNOLOGY AMD TURBO CORE 3.0 TECHNOLOGY : OVERVIEW UTILIZE CALCULATED AVAILABLE DYNAMIC THERMAL HEADROOM TO IMPROVE PERFORMANCE . 10-20 ºC variations across die during peak load GPU-dominated workload (3DMark®) CPU-dominated workload (Livermore Loop 1Thread) with single thread application on CPU CPU0=17W, GPU=4.2W CPU0=2.7W, GPU=23.9W CPU0 At Tjmax GPU At Tjmax Die Y Die X Die Y dim dim dim Die X dim Simulation results for engineering discussion – no claims made to applicability to specific configuration of sold products. Chip divided into “Thermal Entities” (TE) – Thermal Entity calculate power and thermal density Fan . Thermal RC network – Transfer coefficients that describe thermal transfer between TEs, substrate and package are characterized – Numerical analysis firmware runs on the management processor which calculates per TE temperatures – TEs are throttled using voltage/frequency adjustments according to workload heuristics 19 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 AMD TURBO CORE 3.0 TECHNOLOGY: CALCULATED VS. MEASURED TEMPERATURE CPU module 0 (Calculated) GPU (Calculated) Hotspot (measured) CPU module 1 (Calculated) 100ºC temp temp Estimated +/- 3- 5C difference in 80ºC 3DMark® Vantage Measured hotspot temperature calculated hotspot 0s time 1200s vs. measured hot 100ºC spot temperature, at steady thermal state temp temp Measured hotspot temperature 80ºC CPU Power Virus Experimental results for engineering review, no observable product functional operational difference results from thermal differences. No claims made to accuracy 20 AMD “Trinity” HotChips 2012 | Sebastien Nussbaum | July 2012 AMD TURBO CORE 3.0 TECHNOLOGY – PERFORMANCE . Workloads of moderate activity have high residency at maximum frequency – Thermal headroom allows hotspot to remain below maximum control temperature . Higher activity workloads offer fewer opportunities to raise frequency