AMD “Kabini” APU SOC DAN BOUVIER BEN BATES, WALTER FRY, SREEKANTH GODEY HOT CHIPS 25 AUGUST 2013
Total Page:16
File Type:pdf, Size:1020Kb
AMD “Kabini” APU SOC DAN BOUVIER BEN BATES, WALTER FRY, SREEKANTH GODEY HOT CHIPS 25 AUGUST 2013 “KABINI” FLOORPLAN 28nm technology, 105 mm2, 914M transistors KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 2 "JAGUAR" CORE DESIGN GOALS IMPROVE ON “BOBCAT”: UPDATE THE INCREASE PROCESS PERFORMANCE IN A ISA/FEATURE SET PORTABILITY GIVEN POWER ENVELOPE “Jaguar” added: More IPC SSE4.1, SSE4.2 40-bit physical AES, CLMUL address-capable Better frequency at given voltage MOVBE Improved Improved power efficiency through AVX, virtualization clock gating and unit redesign XSAVE/XSAVEOPT F16C, BMI1 KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 3 "JAGUAR" ENHANCEMENTS 32KB Branch Improved IC prefetcher ICACHE Prediction Larger IB for improved New hardware divider fetch/decode decoupling Decode and New/improved COPs: Microcode ROMS CRC32/SSE4.2, BMI1, POPCNT, LZCNT More OOO resources Int Rename FP Decode Rename 128b native hardware Scheduler Scheduler 4 SP muls + 4 SP adds FP Scheduler Int PRF 1 DP mul + 2 DP adds FP PRF ALU ALU LAGU SAGU ISA: many new COPs VALU VALU Mul 256b AVX support VIMul St Conv. Ld/St Queues redesign Div New zero optimizations Enhanced tablewalks FPAdd FPMul 32KB Ld/St 128b data path to FPU DCache Queues Improved write combining To/From BU More outstanding Shared Cache Unit transactions KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 4 "JAGUAR" SHARED CACHE UNIT Shared cache is major design addition 32KB Branch ICACHE Prediction in “Jaguar” Decode and Microcode ROMS Int Rename FP Decode Rename Scheduler Scheduler FP Scheduler Supports 4 cores Int PRF FP PRF ALU ALU LAGU SAGU VALU VALU Mul VIMul St Conv. Div FPAdd FPMul Total shared 2MB, 16-way 32KB Ld/St DCache Queues BU Supported by 4 L2D banks L2 cache is inclusive Allows using L2 tags as probe filter L2 Interface L2 tags reside in interface block Divided into 4 banks L2D L2D L2D bank look-up only after L2 tag hit L2 interface block runs at core clock L2D L2D New L2 stream prefetcher per core KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 5 UNIFIED NORTHBRIDGE AND MEMORY EFFICIENT BANDWIDTH DELIVERY TO DELIVER VISUAL EXPERIENCE UNIFIED NORTHBRIDGE Supports Core 0 Core 1 Core 2 Core 3 DDR3 interface Quad-core Module Interface to graphics memory controller 256b Interface to I/O subsystem APU power management System Request Interface (SRI) PCI is the interconnect to I/O devices Crossbar (XBAR) MEMORY SUPPORT UNIFIED UNIFIED 64-bit interface with 4 memory ranks Link Controller Memory Controller NORTHBRIDGE NORTHBRIDGE Supports 1.25V, 1.35V, and 1.5V DIMMS Up to 10.3 GB/s with DDR3-1600 256b AMD Radeon™ 256b Up to 32 GB capacity in notebook FT3 BGA Memory Bus package 256b I/O Graphics Memory DRAM Supports memory P-states — with memory Controller Controller Controller speed changes on the fly KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 6 GRAPHICS CORE NEXT ARCHITECTURE A new GPU design Cutting-edge graphics performance and features for a new era High compute density with multi-tasking of computing Built for power efficiency Optimized for heterogeneous computing Support for high-level language features for heterogeneous compute KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 7 AMD RADEON™ HD 8000 GRAPHICS CORE NEXT ARCHITECTURE First APU with GCN architecture ACE ACE API support: Command Processor Graphics: DirectX 11.1, OpenGL 4.3, OpenGL ES 3.0 ACE Compute: OpenCL1.2, DirectCompute, C++ AMP ACE Hardware configuration: Geometry Engine Geometry engine Rasterizer ¼ prim/clock Global Data Share Two GCN compute units (CUs) 1 render back-end I$ 4 pixel color raster operation pipelines (ROPs) 16 depth test (Z) / stencil ops K$ Color cache (C$) / Depth cache (Z$) 128KB read/write L2 cache 4KB global data share with global synchronization resources L2 Cache Advanced power management: Fine-grain clock\clock tree gating PowerTune – dynamic V/F scaling with power containment System Fabric Zero core power – power gating KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 8 GCN COMPUTE UNIT Basic GPU building block of unified shader system Each CU can execute instructions from multiple New instruction set architecture kernels simultaneously Non-VLIW Designed for programming simplicity, high Vector unit + scalar co-processor utilization, high throughput, multi-tasking Distributed programmable scheduler Unstructured flow control, function calls, recursion, exception Consistent with AMD dGPU architecture so support kernels developed for GCN run anywhere Un-typed, typed, and image memory operations Flat address support Branch & Message Unit Texture Fetch Load / Store Units (16) Texture Filter Scheduler Units L1 Cache (16KB Cache Texture Filter Units Local Scalar Vector Units (SIMD-16) Vector Units (SIMD-16) Unit Vector Units (SIMD-16) Vector Units (SIMD-16) Texture Filter Data Units Share ) Vector Registers (64KB) Vector Registers (64KB) (64KB) Vector Registers (64KB) Vector Registers (64KB) Texture Filter Scalar Units Compute Unit Registers (4KB) KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 9 GCN R/W CACHE Reads and writes cached Command Bandwidth amplification Processors 64-128KB R/W L2 Improved behavior on more memory access patterns per MC Channel Compute Improved write-to-read re-use performance 16KB Vector DL1 Unit Relaxed consistency memory model Consistency controls available to control locality of 16KB Scalar DL1 32KB Instruction L1 X load/store B A GPU-coherent R Acquire/Release semantics control data visibility across the machine 64-128KB R/W L2 L2 coherent = all CUs, ACE, and command per MC Channel processors can have the same view of data Compute Global atomics 16KB Vector DL1 Unit Performed in L2 cache KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 10 AMD RADEON HD 8000 GCN ARCHITECTURE ACE Dual-display support ACE Command Processor HDMI and DisplayPort™ support up to 4K x 2K 30Hz ACE Wireless display ACE Low-power self-refresh Geometry Engine Video Codec Engine (VCE) fixed-function Rasterizer Multi-stream hardware H.264 HD encoder Global Data Share Power-efficient and faster than real-time 1080p@60fps Scalable video coding (SVC) I$ Universal Video Decoder (UVD) fixed-function K$ with codecs for: H.264 VC-1 L2 Cache MPEG-2 MVC DivX System Fabric WMV MFT WMV-native UVD VCE Display Controllers KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 11 DISPLAY TECHNOLOGY LEADERSHIP WIRELESS PANEL SELF- DYNAMIC ULTRAHD DISPLAY REFRESH REFRESH RATE 4096 3840 2160 1920 2160 1080 Drive 4K x 2K resolution Wi-Fi CERTIFIED, Significant power savings with Automatically and seamlessly displays* via HDMI and Miracast™ Support embedded DisplayPort panel reduce panel refresh rate to DisplayPort Low latency & low power self-refresh* save power* * Supported panel required KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 12 “KABINI” ACCELERATED COMPUTING A COMPLETE SYSTEM-ON-A-CHIP SOLUTION FCL FCL Allows the GPU to access coherent cache/memory 128b (each direction) path for I/O access to memory GPU access to coherent memory space CPU access to dedicated GPU framebuffer I/O GPU CPU GRAPHICS MEMORY BUS 256b (each direction) for GMC access to memory Full-bandwidth path for graphics to system memory DRAM-friendly stream of reads and write Bypasses coherency mechanism DDR3 NB INTEGRATED SYSTEM CONTROL AND I/O (FCH) Integrated FCH Provides complete system connectivity USB 3.0, USB 2.0, SATA 3, GPIO Integrated system clock generator Reduced motherboard footprint required AMD Radeon Memory Bus Allows GPU to access Higher I/O performance at reduced power consumption system memory with high bandwidth *Functional units not to scale KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 13 THE "KABINI" SYSTEM 2 SODIMMs DDR3 800-1600 DDR3 DDR3 Optionally: 18bpp LVDS SDXC - 2TB, (64b) UHS-I - 104MBps eDP SDIO or SD card: “Jaguar” “Jaguar” eDP LCD panel Core Core DisplayPort / HDMI 2MB Shared L2 2x USB 3.0 VGA External ports “Jaguar” “Jaguar” x4 PCI Express Discrete GPU - MXM Fingerprint scanner, Core Core PowerXpress™ WWAN, Bluetooth, 8x USB 2.0 AMD 2 x SATA 3 mini card HDD,DVD Radeon 8000 Display GPU x1 PCIe to PCIe mini card slot HW monitor, Control EC/KBC Media fan control, Acceleration x1 PCIe to PCIe mini card slot thermal sensor, TPM LPC keyboard scan x1 PCIe Wi-Fi controller FCH SPI x1 PCIe BIOS flash Gb Ethernet controller KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 14 CHIP-LEVEL POWER DISTRIBUTIONS TDP 100% Power consumption Core 3 90% VCE (and hence performance) is set by the cooling 80% GPU GPU GPU Display Core 2 DAC & PHYs capabilities of the power power power L2$ platform 70% 60% Core 1 Power varies a lot by GPU CPU 50% PADS power workload CPU CPU Core 0 power power 40% We measure and Display Display 30% power Display power manage the power of power UVD each component on the 20% FCH FCH FCH MC chip to generate the best power power power FCH 10% performance/watt Other chip Other chip Other chip Memory PADS power power power 0% App 1 App 2 App 3 KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 15 DIGITAL POWER MONITORING CPU CPU CPU CPU To manage temperature and Monitor Events send the power wherever Weights Weights Weights Weights it’s needed, we use power Weights Events monitors in all chip CAC CAC CAC CAC components Add-in Leakage Monitor Monitor Monitor Monitor “Kabini” and “Temash” have Calculate Power APU 2 power monitors in each (P = PLeakage + CacV f) P-states CPU, the GPU, the display Pboost interface, and the FCH Add in calculated GPU power AMD Turbo CORE P1 Manager Change P2 The central