Amd Ryzen™ Processor Software Optimization Presented by Kenneth Mitchell
Total Page:16
File Type:pdf, Size:1020Kb
AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION PRESENTED BY KENNETH MITCHELL LET’S BUILD… 2020 MAY 15, 2020 ABSTRACT Join AMD Game Engineering team members for an introduction to the AMD Ryzen™ family of processors followed by advanced optimization topics. Learn about the high-performance AMD "Zen 2" microarchitecture and profiling tools. Gain insight into code optimization opportunities and lessons learned. Examples may include C/C++, assembly, and hardware performance-monitoring counters. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 2 SPEAKER BIOGRAPHY Ken Mitchell is a Principal Member of Technical Staff in the Radeon™ Technologies Group/AMD ISV Game Engineering team where he focuses on helping game developers utilize AMD processors efficiently. His previous work includes automating and analyzing PC applications for performance projections of future AMD products as well as developing benchmarks. Ken studied computer science at the University of Texas at Austin. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 3 AGENDA • Success Stories • “Zen 2” Architecture Processors • AMD uProf Profiler • Optimizations and Lessons Learned • Contacts AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 4 SUCCESS STORIES AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 5 SUCCESS STORIES BORDERLANDS 3 GEARS 5 WORLD WAR Z DirectX® 12 DirectX® 12 Vulkan® AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 6 “ZEN 2” ARCHITECTURE PROCESSORS AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 7 “ZEN 2” PRODUCT EXAMPLES NOTEBOOK DESKTOP HIGH END DESKTOP “Renoir” “Matisse” “Castle Peak” AMD Ryzen™ 7 4800U 8-Core AMD Ryzen™ 9 3950X 16-Core AMD Ryzen™ Threadripper™ Processor Processor 3990X 64-Core Processor AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 8 MICROARCHITECTURE AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 9 ADVANCES IN “ZEN 2” MICROARCHITECTURE 32K ICACHE BRANCH • +15% IPC Improvement from “Zen” to “Zen 2” 8 way PREDICTION • 2x op cache capacity DECODE OP CACHE • Reoptimized L1I cache 4 instructions/cycle 8 ops/cycle rd Micro-op Queue • 3 address generation unit 6 ops dispatched FLOATING • 2x FP data path width INTEGER POINT Integer Rename Floating Point Rename • 2x L3 capacity • Improved branch prediction accuracy Sche Sche Sche Sche Sche Sche Sche Scheduler duler duler duler duler duler duler duler • Hardware optimized Security Mitigations Integer Physical Register File FP Register File • Secure Virtualization with Guest Mode Execute Trap (GMET) ALU ALU ALU ALU AGU AGU AGU MUL ADD MUL ADD • Improved SMT fairness • for ALU and AGU schedulers 512K L2 2 loads + 1 store LOAD/STORE 32K DCACHE (I+D)CACHE • Improved Write Combining Buffer per cycle QUEUES 8 way 8 way AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 10 DATA FLOW AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 11 “RENOIR” 8 CORE PROCESSOR Unified 32B fetch 32K 32B/cycle DRAM 32B/cycle Memory 16B/cycle I-Cache Channel 8-way Controller 512K L2 uclk memclk 32B/cycle 32B/cycle I+D Cache 4M L3 2*32B load 32B/cycle 8-way Data 32B/cycle 32K GFX9 D-Cache I+D Cache Fabric 1*32B store 8-way 16-way 32B/cycle cclk l3clk Media IO Hub 64B/cycle Controller fclk lclk AMD Ryzen™ 7 4800U, 15W TDP, 8 Cores, 16 Threads, 4.2 GHz max boost clock, 1.8 GHz base clock, integrated GPU. * Monolithic Die. Each 4M L3 Cache has its own 32B/cycle link to the data fabric. 64b DDR4 Channel Shown. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 12 CCD CCD “MATISSE” 16 CORES PROCESSOR IOD Unified 32B fetch 32K 32B/cycle DRAM 32B/cycle Memory 16B/cycle I-Cache Channel 8-way Controller 512K L2 uclk memclk 32B/cycle 16M L3 32B/cycle R I+D Cache Data 2*32B load 32K 32B/cycle 8-way I+D Cache 16B/cycle W Fabric D-Cache 16-way 1*32B store 8-way cclk 64B/cycle IO Hub l3clk Controller fclk lclk AMD Ryzen™ 9 3950X, 105W TDP, 16 Cores, 32 Threads, 4.7 GHz max boost clock, 3.5 GHz base clock. * Two Core Complex Die (CCD). Each CCD has two 16M L3 Cache Complexes. * The L3 Cache Complexes within a CCD share a single link to the Data Fabric. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 13 0 1 4 5 IOD “CASTLE PEAK” 64 CORE PROCESSOR 2 3 6 7 Unified 32B fetch 32K 32B/cycle DRAM 32B/cycle Memory 16B/cycle I-Cache Channel 8-way Controller 512K L2 uclk memclk 32B/cycle 16M L3 32B/cycle R I+D Cache Data 2*32B load 32K 32B/cycle 8-way I+D Cache Fabric 16B/cycle W D-Cache 16-way Quadrant 1*32B store 8-way cclk 64B/cycle IO Hub l3clk Controller fclk lclk AMD Ryzen™ Threadripper™ 3990X, 280W TDP, 64 Cores, 128 Threads, 4.3 GHz max boost clock, 2.9 GHz base clock. * Two CCDs per Data Fabric Quadrant. * Two Data Fabric Quadrants have Unified Memory Controllers and two have IO Hubs. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 14 INSTRUCTION SET AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 15 INSTRUCTION SET EVOLUTION AES AVX BMI SHA XOP ADX TBM FMA F16C AVX2 BMI2 CLWB SMEP SSSE3 FMA4 SMAP XSAVE SSE4.1 SSE4.2 RDRND XSAVES XSAVEC CLZERO MOVBE XGETBV RDSEED OSXSAVE FSGSBASE EXAMPLE XSAVEOPT MONITORX WBNOINVD CLFLUSHOPT YEAR FAMILY PRODUCT FAMILY ARCHITECTURE PRODUCT PCLMULQDQ 2019 17h “Matisse” “Zen2” Ryzen™ 9 3950X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 2017 17h “Summit Ridge”, ”Pinnacle Ridge” “Zen”, “Zen+” Ryzen™ 7 2700X 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 2015 15h “Carrizo”, “Bristol Ridge” “Excavator” A12-9800 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 2014 15h “Kaveri”, “Godavari” “Steamroller” A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2012 15h “Vishera” “Piledriver” FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2011 15h “Zambezi” “Bulldozer” FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 2013 16h “Kabini” “Jaguar” A6-1450 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 2011 14h “Ontario” “Bobcat” E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2011 12h “Llano” “Husky” A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 “Zen 2” added CLWB and the AMD vendor specific instruction WBNOINVD. AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 16 SOFTWARE PREFETCH LEVEL INSTRUCTIONS Prefetch T0|T1|T2|NTA • Loads a cache line from the specified memory address into the data- L1 cache level specified by the locality reference T0, T1, T2, or NTA. Aggressively • If a memory fault is detected, a bus cycle is not initiated and the Evict instruction is treated as an NOP. Prefetch • Prefetch levels T0/T1/T2 are treated identically in “Zen” & “Zen 2” NTA microarchitectures. L2 • The non-temporal cache fill hint, indicated with PREFETCHNTA, reduces cache pollution for data that will only be used once. It is not suitable for cache blocking of small data sets. Lines filled into the L2 cache with PREFETCHNTA are marked for quicker eviction from the L2 and when evicted from the L2 are not inserted into the L3. L3 • The operation of this instruction is implementation-dependent. Prefetch fill & evict policies may differ for other processor vendors or microarchitecture generations. memory AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 17 “MATISSE” CACHE AND MEMORY AMD PUBLIC | LET’S BUILD… 2020 | AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION | MAY 15, 2020 | 18 CACHE LATENCY • Cache line size is 64 Bytes. Count/ Level Capacity Sets Ways Line Size Latency CCD • 2 cpu clock cycles to move a single cache line. • L2 is inclusive of L1. uop 8 4 K uops 64 8 8 uops NA • lines filled into L1 are also filled into L2. • L3 is filled from L2 victims of all 4 cores within its CCX.