AMD Raven Ridge
Total Page:16
File Type:pdf, Size:1020Kb
DELIVERING A NEW LEVEL OF VISUAL PERFORMANCE IN AN SOC AMD “RAVEN RIDGE” APU Dan Bouvier, Jim Gibney, Alex Branover, Sonu Arora Presented by: Dan Bouvier Corporate VP, Client Products Chief Architect AMD CONFIDENTIAL RAISING THE BAR FOR THE APU VISUAL EXPERIENCE Up to MOBILE APU GENERATIONAL 200% MORE CPU PERFORMANCE PERFORMANCE GAINS Up to 128% MORE GPU PERFORMANCE Up to 58% LESS POWER FIRST “Zen”-based APU CPU Performance GPU Performance Power HIGH-PERFORMANCE AMD Ryzen™ 7 2700U 7th Gen AMD A-Series APU On-die “Vega”-based graphics Scaled GPU Managed Improved Upgraded Increased LONG BATTERY LIFE and CPU up to power delivery memory display package Premium form factors reach target and thermal bandwidth experience performance frame rate dissipation efficiency density 2 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. “RAVEN RIDGE” APU AMD “ZEN” x86 CPU CORES CPU 0 “ZEN” CPU CPU 1 (4 CORE | 8 THREAD) USB 3.1 NVMe PCIe FULL PCIe GPP ----------- ----------- Discrete SYSTEM 4MB USB 2.0 SATA GFX CONNECTIVITY CPU 2 CPU 3 L3 Cache X64 DDR4 HIGH BANDWIDTH SOC FABRIC System Infinity Fabric & MEMORY Management SYSTEM Unit ACCELERATED Platform Multimedia Security MULTIMEDIA Processor Engines AMD GFX+ 1MB L2 EXPERIENCE X64 DDR4 (11 COMPUTE UNITS) Cache Video Audio Sensor INTEGRATED CU CU CU CU CU CU Display Codec ACP Fusion Controller Next SENSOR Next Hub FUSION HUB CU CU CU CU CU AMD “VEGA” GPU UPGRADED DISPLAY ENGINE 3 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | SIGNIFICANT DENSITY INCREASE “Raven Ridge” die BGA Package: 25 x 35 x 1.38mm Technology: GLOBALFOUNDRIES 14nm – 11 layer metal Transistor count: 4.94B 59% 16% Die Size: 209.78mm2 more transistors smaller die than prior generation “Bristol Ridge” APU 4 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. INTEGRATED “VEGA” GRAPHICS Graphics Engine . Up to 11 Next Gen Compute Unit (NCU) . 1 MB L2 . Flexible Geometry Engine CP . 1 Draw Stream Binning Rasterizer . 16 Pixels Units (32bpp) Geometry/Raster/RB+ . 44 Texture Units Shader System (SS) Workload Manager ACE DirectX® 12.1 Features ACE NCU Array . Conservative Rasterization ACE . Raster Ordered Views ACE . Standard Swizzle . Axis Aligned Rectangular Primitives Core Fabric L2 L2 L2 L2 Throughput at 11 NCU . 1200 MTri/sec @ 1200 Mhz sDMA Infinity Fabric . Rendering 19.2 GPix/sec @1200 MHz . 1690 FP32GFLOPS / 3379 FP16GFLOPS @ 1200 MHz . 52.8 MTex per second @ 1200MHz 5 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | “ZEN” CPU IMPROVES VISUAL FRAME RATE 64K I-Cache 4 way Branch Prediction Decode Op Cache L L L L 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 Micro-op Queue CORE 0 CORE 1 4 instructions/cycle Micro-ops C 512K 512MB C 512MB 512MB C 512MB 512K C 6 ops dispatched T T T T INTEGER FLOATING POINT L L L L Integer Rename Floating Point Rename Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Integer Physical Register File FP Register File L L L L CORE 2 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 3 ALU ALU ALU ALU AGU AGU MUL ADD MUL ADD C 512K 512MB C 512MB 512MB C 512MB 512K C CORE 3 T T T T 512K L L L L 2 loads + 1 Load/Store 32K D-Cache L2 (I+D) Cache store per cycle Queues 8 Way 8 Way High performance “Zen” core Up to . Free up more power for GPU 4 512KB 4MB “ZEN” CPU cores L2 Cache per core Shared L3 Cache 6 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | “ZEN” WITH PRECISION BOOST 2 . Governed by CPU temperature, current, load . Seeks highest possible frequency from environmental inputs, graceful roll-off . Opens new boost opportunities for real-world nT workloads (e.g., games) . 25MHz granularity 7 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. TUNE FOR THE PHASES OF VISUAL WORKLOADS STEER POWER WHERE IT’S BEST USED . Trade power/current based on dynamic utilization: − Core ↔ Core − CPU ↔ GPU . On-die regulation and fine-grained frequency control enables fast, accurate frequency and voltage changes . Fine-grained p-states (FGPS) across the IPs - continuous frequency control 8 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. “ZEN” CPU AND “VEGA” GFX CO-MANAGEMENT WITH INFINITY FABRIC . CPU threads feed major GPU resources: 3D engine, compute engine, and DMA “ZEN” CORE COMPLEX “VEGA” GRAPHICS engine (data fetch and writeback) “Zen” “Zen” Core Core Graphics Compute Pixel L3 Pipeline Engine Engines . CPU “submits” tasks, GFX “renders” or Cache “Zen” “Zen” “computes” Core Core L2 Cache . One coherent control and data interface to integrate and manage the full SoC Multimedia I/O and Engines Infinity Fabric System Hub . Power budgeting based on activity and efficiency DDR4 Display Memory Engine . Enhanced flow for quiescing/powering-off Controllers CPU-GFX component 9 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | FAST DEPLOYMENT OF NEW ARCHITECTURE MODULAR AMD INFINITY FABRIC Engines Engines . Standard port definition for IP connections (SDP = Scalable Data Port) SDP − Common interface definition used for CPU, GPU, Interface I/O, multi-media hubs, display, memory controller Modules . Coherent HyperTransport™ transport layer − Builds upon generations of coherent fabric development Memory I/O Sub Controllers Transport Layer system − Flexible topology to adapt to diverse SoC configurations . SDP hides complexities of coherence protocol from connected IP Infinity Fabric Accelerators 10 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | “RAVEN RIDGE” INFINITY FABRIC CPU Core Graphics Graphics Complex Container Container “Raven Ridge” Optimizations . 32 Byte internal datapath width Coherent Coherent Coherent Master Master Master . Up to 1.6GHz for bandwidth Memory Coherent exceeding 50GB/s Controllers Slave Transport Layer Transport Layer . Up to 5 transfers/clock per switch Switch Switch Memory Coherent . Improved CPU latency under load, Controllers Slave IO while maintaining DRAM efficiency Transport Layer Transport Layer I/O Sub Master/ Switch Switch system . Structured for multi-region Slave power gating Region A Non Coherent Non Coherent . Floorplan-aware, optimized display Master Master to memory routing Display Multimedia Controller Hub 11 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | QUALITY OF SERVICE FOR SMOOTH VISUAL EXPERIENCE Three Request Classes Switch-level View of QoS Architecture . Hard real time: − High BW (e.g., display surface refresh) Transport Transport Transport Transport − Low BW (e.g., audio) Request Response Probe Data Queue Queue Queue Queue . Soft real time (e.g., video playback) BUFFERS PICKERS PICKERS PICKERS PICKERS . Non real time (e.g., typical CPU/GPU/IO requests) VC Dedicated Tokens Architectural Mechanisms Shared Pool Picker arbitration generally age ordered, Tokens . Multiple virtual channels except when younger passes older due to: . Priority classes (Low/Medium/High/Urgent) 1) priority 2) VC resource availability . End-to-end priority escalation by VC for out 3) other resource such as output port busy of bounds conditions 12 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | MEMORY BOUND PERFORMANCE OPTIMIZATION . 1MB Shared L2 Cache New features and optimized SoC . Larger dedicated GPU TLB cache configuration contribute to 4MB “Zen” CPU Core “Vega” GFX . Deferred Primitive L3 Cache Complex improved memory-limited Engine Batch Binning . Multi Level DRAM performance: Aware Reordering . Caching and algorithms to reduce Memory Efficient memory requests Quality of Service Memory . Improved lossless compression Deeper Controllers Arbitration usage (DCC) Queues Fabric Transport Layer Memory . Better request ordering to reduce Controllers DRAM page conflicts and read/write turnarounds Direct Reads of Display Multimedia Compressed memory Controller Hub 13 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | “RAVEN RIDGE” GRAPHICS SCALING GENERATIONAL IMPROVEMENTS FOR MEMORY BOUND GAMING PERFORMANCE Gaming performance scaling uplift due to new AMD Vega GPU features: . 4x larger GFX L2 cache, unified across all graphics clients . DSBR (Draw Stream Binning Rasterizer) feature reduces bandwidth . Improved lossless DCC memory compression 14 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. NEW GENERATION DISPLAY AND VIDEO CODEC ENGINE Display Engine (DCN) . Flexible display pipe architecture DISPLAY ENGINE − Up to four 4kp60 displays . Low power display engine with DCC, 4K2K@60hz @Vmin Input Output Processing Pipe . HDR support − From 32bpp to 64bpp surfaces Input Output − From sRGB to BT2020 Memory Processing Pipe . Higher bandwidth interfaces - HDMI 2.1, DP 1.4, HBR3 Interface F(+) Hub Input Output . USB-Type C with display alt-mode Processing Pipe Infinity Fabric Infinity Video Codec (VCN) Input Output . Unified encode and decode engine Processing Pipe − Up to 4kp60 HEVC 10b decode − Up to 4kp30 HEVC 8b encode AltMode Ctrl . Low power video playback – 4kp30 @Vmin . HEVC 10b decode Display Type C USB/DP Mux . HEVC encode for superior quality skype USB . VP9 decode for efficient YouTube playback 15 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | EFFICIENT POWER DELIVERY WITH DIGITAL LOW-DROPOUT REGULATORS “ZEN” CORE COMPLEX . Current delivery overprovisioned for worst-case CPU 0 CPU 1 overlap between CPU and GPU L3 . Fine-grain LDO control allows for efficient CPU 2 CPU 3 tracking of the CPU and GFX phases, powered by a unified VDD power rail CPU Region System VDD Package Rail . 1st stage: off-chip motherboard vreg Voltage Regulator 2nd stage: on-chip vreg with digital LDO “VEGA” GRAPHICS COMPLEX . Multiple digital LDO regions for CPU cores, graphics core, and sub-regions − Idle engine is powered off GFX Compute Region . Allows more peak CPU/GPU current to improve Represents LDO boost performance Regulated / Power Gating Region GFX Region VDD Region 16 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details.