ANASUA BHOWMIK RYZEN PROCESSOR MICROARCHITECTURE
ANASUA BHOWMIK PRINCIPAL MEMBER OF TECHNICAL STAFF MICROPROCESSOR PERFORMANCE MODELING AND ARCHITECTURE TEAM AMD, BANGALORE
1 | JULY 2018 | CONFIDENTIAL AGENDA
Road to RYZEN – AMD’s new X86 CPU High level architecture of Zen ‒ Core engine ‒ Floating point unit ‒ Cache hierarchy ‒ Simultaneous Multithreading Zen based SoCs Design Challenges Contributions from India Design Center Next step
2 | JULY 2018 | CONFIDENTIAL RYZEN CPU
High performance CPU 8 cores 16 threads Zen core Over 4.8 billion transistors Over 2000 meters of wiring 14 nm process technology
3 | JULY 2018 | CONFIDENTIAL CPU COMPLEX A CPU complex (CCX) is four cores connected to an L3 Cache. The L3 Cache is 16-way associative, 8MB, mostly L L L L exclusive of L2. CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 C 512K 1MB C 1MB 1MB C 1MB 512K C T T T T The L3 Cache is made of 4 L L L L slices, by low-order address interleave. Every core can access every cache with same average L L L L latency CORE 2 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 3 C 512K 1MB C 1MB 1 MB C 1MB 512K C CORE 3 Ryzen CPU in Thread Ripper T T T T L L L L has 2 CCXs Epyc Server has 8 CCXs Ryzen mobile has 1 CCX
4 | JULY 2018 | CONFIDENTIAL AMD CPU DESIGN OPTIMIZATION POINTS “ZEN”
High Performance “Excavator”
PERFORMANCE ONE CORE FROM FANLESS NOTEBOOKS Low Power TO SUPERCOMPUTERS “Jaguar”
LOWER POWER MORE PERFORMANCE SMALLER MORE AREA, POWER
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
5 | JULY 2018 | CONFIDENTIAL DEFYING CONVENTION: A WIDE, HIGH PERFORMANCE, EFFICIENT CORE
“ZEN” Instructions-Per-Clock
52% more work per cycle* “Excavator” “Steamroller” Total “Piledriver” Efficiency “Bulldozer” Gain
At = Energy Per Cycle Energy Per Cycle
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
6 | JULY 2018 | CONFIDENTIAL DESIGNED FROM THE GROUND UP FOR OPTIMAL BALANCE OF “ZEN” PERFORMANCE AND POWER
Totally New New High-Bandwidth, High-performance Low Latency Cache Core Design System
Simultaneous Energy-efficient FinFET Multithreading (SMT) Design Scales from for High Throughput Enterprise to Client Products
7 | JULY 2018 | CONFIDENTIAL 64K I-Cache Branch Prediction 4 way ZEN MICROARCHITECTURE
Decode Op Cache Fetch Four x86 instructions Op Cache instructions Micro-op Queue 4 instructions/cycle Micro-ops Integer units 6 ops dispatched ‒ Large rename space – 168 Registers ‒ INTEGER FLOATING POINT 192 instructions in flight/8 wide retire Load/Store units Integer Rename Floating Point Rename ‒ 72 Out-of-Order Loads supported Floating Point units Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler ‒ built as 4 pipes, 2 Fadd, 2 Fmul I-Cache 64K, 4-way Integer Physical Register File FP Register File D-Cache 32K, 8-way L2 Cache 512K, 8-way ALU ALU ALU ALU AGU AGU MUL ADD MUL ADD Large shared L3 cache – 8M 16 way 2 threads per core 512K 2 loads + 1 store Load/Store 32K D-Cache per cycle L2 (I+D) Cache Queues 8 Way 8 Way
8 | JULY 2018 | CONFIDENTIAL FETCH Next PC Redirect from DE/EX Decoupled Branch Prediction L0/L1/L2 TLB TLB in the BP pipe ‒ 8 entry L0 TLB, all page sizes ‒ 64 entry L1 TLB, all page sizes ‒ 512 entry L2 TLB, no 1G pages L1/L2 BTB Hash Perceptron Return Stack/ITA 2 branches per BTB entry Large L1 / L2 BTB To Op 32 entry return stack Cache Physical Request Queue Micro-Tags Indirect Target Array (ITA) 64K, 4-way Instruction cache Micro-tags for IC & Op cache 64K Instruction Cache 32 byte fetch 32 bytes/ cycle from L2 32 bytes to Decode
9 | JULY 2018 | CONFIDENTIAL From IC From Micro Tags DECODE Instruction Byte Buffer
Inline Instruction-length Decoder Pick Decode 4 x86 instructions Op Cache Op cache Decode Micro-op Queue Instructions Micro-ops Stack Engine Branch Fusion Micro-op Queue Memory File for Store to Load Forwarding
Microcode Rom Stack Engine To FP, 4 Micro-ops Dispatch
To EX, 6 Micro-ops
10 | JULY 2018 | CONFIDENTIAL Redirect 6 Micro-op Dispatch To DE EXECUTE to Fetch
Map Retire Queue 6x14 entry Scheduling Queues 168 entry Physical Register File 6 issue per cycle ‒ 4 ALU’s, 2 AGU’s ALQ0 ALQ1 ALQ2 ALQ3 AGQ0 AGQ1 192 entry Retire Queue
168 Entry Physical Register File 2 Branches per cycle Move Elimination
Forwarding Muxes 8-Wide Retire
ALU0 ALU1 ALU2 ALU3 AGU0 AGU1
LS
11 | JULY 2018 | CONFIDENTIAL AGU0 AGU1 To Ex LOAD/STORE AND L2
Store Queue 72 Out of Order Loads Load Queue 44 entry Store Queue Pre Fetch Split TLB/Data Pipe, store pipe
L0 Pick L1 Pick 64 entry L1 TLB, all page sizes Store Pipe Pick 1.5K entry L2 TLB, no 1G pages 32K, 8 way Data Cache TLB0 DAT0 DAT1 TLB1 STP Store Commit ‒ Supports two 128-bit accesses Optimized L1 and L2 Prefetchers L1/L2 TLB + 512K, private (2 threads), inclusive L2 DC tags
32K Data Cache
MAB WCB To FP
32 bytes to/from L2 To L2
12 | JULY 2018 | CONFIDENTIAL 8 Micro-op 4 Micro-op Dispatch Retire FLOATING POINT
192 Entry NSQ Retire Queue 2 Level Scheduling Queue 160 entry Physical Register File 128 bit 8 Wide Retire SQ LDCVT Loads 1 pipe for 1x128b store Accelerated Recovery on Flushes Int to FP 160 Entry Physical Register File FP to Int SSE, AVX1, AVX2, AES and legacy mmx/x87 compliant
Forwarding Muxes 2 AES units
MUL0 ADD0 MUL1 ADD1
13 | JULY 2018 | CONFIDENTIAL ZEN CACHE HIERARCHY CORE 0
32B fetch 64K 32B/cycle Fast private 512K L2 cache I-Cache 512K L2 4-way 32B/cycle Fast shared L3 cache I+D 2*16B load 32B/cycle High bandwidth enables prefetch 32K Cache D-Cache 8-way 8M L3 improvements 1*16B store 8-way 32B/cycle I+D L3 is filled from L2 victims Cache Fast cache-to-cache transfers 16-way Large Queues for Handling L1 and L2 misses
14 | JULY 2018 | CONFIDENTIAL CPU COMPLEX
A CPU complex (CCX) is four cores connected to an L L L L CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 L3 Cache. C 512K 1MB C 1MB 1MB C 1MB 512K C T T T T The L3 Cache is 16-way L L L L associative, 8MB, mostly exclusive of L2. The L3 Cache is made of 4 slices, by low-order address L L L L interleave. CORE 2 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 3 C 512K 1MB C 1MB 1 MB C 1MB 512K C CORE 3 T T T T Every core can access every L L L L cache with same average latency Ryzen has 2 CCXs Naples Server has 4 CCXs
15 | JULY 2018 | CONFIDENTIAL L0/L1/L2 64K I-Cache 4 way Branch Prediction ITLB SMT OVERVIEW
Decode MicroOp-opCache Cache All structures fully available in 1T mode Micro-op Queue instructions micro-ops Front End Queues are round robin with Vertically Threaded 6 ops dispatched priority overrides Increased throughput from SMT INTEGER Retire Queue FLOATING POINT
Integer Rename Floating Point Rename
Schedulers Scheduler Competitively shared structures Integer Physical Register File FP Register File Statically Partitioned 2x AGUs 4x ALUs MUL ADD MUL ADD
Store Queue 512K L1/L2 32K D-Cache L2 (I+D) Cache LoadLoad Queue DTLB 8 Way 8 Way
16 | JULY 2018 | CONFIDENTIAL ZEN PERFORMANCE & POWER IMPROVEMENTS
BETTER CORE ENGINE BETTER CACHE SYSTEM LOWER POWER
‐ SMT - two threads per core ‐ Write back L1 cache ‐ Aggressive clock gating with ‐ Better branch prediction ‐ Faster L2 cache multi-level regions ‐ Large Op Cache ‐ Faster L3 cache ‐ Write back L1 cache ‐ Wider micro-op dispatch 6 vs. 4 ‐ Faster Load to FPU: 7 vs. 9 cycles ‐ Large Op Cache ‐ Larger Instruction Schedulers ‐ Better L1 and L2 data prefetcher ‐ Stack Engine Integer: 84 vs. 48 | FP: 96 vs. 60 ‐ Close to 2x the L1 and L2 bandwidth ‐ Move elimination ‐ Larger retire 8 ops vs. 4 ops ‐ Total L3 bandwidth up 5x ‐ Power focus from project ‐ 4 issue FPU inception ‐ Larger Retire Queue 192 vs. 128 ‐ Larger Load Queue 72 vs. 44 ‐ Larger Store Queue 44 vs. 32
52% IPC PERFORMANCE UPLIFT
17 | JULY 2018 | CONFIDENTIAL RYZEN BASED SOCS
EPYC SERVER ‐ 8 CCX, 4 Dies in each socket ‐ Connected by fast Infinity fabric ‐ Support for virtualization, encryption, ECC ‐ Targeted for Data centers
THREAD RIPPER DESKTOP. ‒ High Frequency ‒ Gaming performance
RAVEN RIDGE LOW POWER APU (CPU + GPU) ‒ 4 M L3 ‒ Power optimization
18 | JULY 2018 | CONFIDENTIAL DESIGN CHALLENGES Long and expensive processor development project ‒ More than 2 million engineering hours for Zen over four years Need to make key design decisions early in the project which has long lasting impact ‒ Dependence on unknowns like process readiness in the fab, yield ‒ Area, power ‒ Availability of technology Conflicting needs of various SoCs ‒ Server : Bigger LLC, MT performance more important than 1T performance ‒ Desktop: High frequency, 1T is more important (aggressive prefetching) ‒ Client: Area and power sensitive, smaller LLC ‒ Semicustom: Specific ISA supports ‒ Ultra low power Workloads to consider Competitiveness of the parts
19 | JULY 2018 | CONFIDENTIAL CONTRIBUTIONS FROM INDIA DESIGN CENTER
Performance modeling, micro-architectural feature exploration, performance verification ‒ Development of cache hierarchy performance model (L2, L3, prefetcher) and architecture Traffic optimization between cores and memory All (Server, Client and Cores) workloads analysis, characterization and collection Core, Data Fabric and SoC level verification L3 RTL development for Zen based client part Zen based client SoC architecture development Compiler optimizations for Zen based server (Epyc)
20 | JULY 2018 | CONFIDENTIAL WORK IN PROGRESS
Zen2 hardware is in the lab ‒ 7 nm core with significant performance uplift over Zen Microarchitecture development, performance modeling and correlation for core following Zen2 Exploring micro-architectural ideas for future processors Improving performance analysis and modeling methodologies ‒ Performance bottleneck analysis ‒ Very high thread count simulation, impact of sharing and locks ‒ Warming-up large caches ‒ Execution driven vs trace driven simulation ‒ SoC level modeling – CPU/GPU/accelerators interaction Workloads analysis ‒ Cloud, virtualized workloads, machine learning, HPC workloads
Lots of challenging problem to work on!!
21 | JULY 2018 | CONFIDENTIAL PROJECT TIMELINE Compare Perf model/HW Diagnose Perf problems Evaluate fixes
Competitive analysis Silicon Mix simulated/measured data Performance “Relative projections” Validation and Debug
Performance Projections
Hundreds of small (<1%) micro-architectural tradeoffs Performance Correlation Compare Perf model/RTL Find/fix perf bugs in both Validate Perf model Develop high-level model Feature Modeling and evaluation with configurable architecture Perform trade-off studies RTL Development, verification, physical design
Architectural Development
Project Timeline Concept High Level Tapeout Parts in Product Launch 22 | JULY 2018 | CONFIDENTIAL Design exit the lab PERFORMANCE ANALYSIS AND MODELING
Develop and explore micro-architectural ideas for performance improvement
Variety of tools and technologies deployed for performance analysis & architectural innovation ‒ Cycle accurate simulator Constant state of evolution, due to changing workloads, and architecture ‒ More IPC, MP, SMT, large caches ‒ SW and HW more tightly linked than ever Correct performance projection is critical for the success of the project Workloads analysis, characterization and tracing Server workloads – SPEC CPU 2006, SPEC CPU2017, Legacy Enterprise workloads, Spec JBB, Cloud, NoSql database Client workloads – Games, Cinebench, Productivity suite, various use case scenarios
23 | JULY 2018 | CONFIDENTIAL DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
24 | JULY 2018 | CONFIDENTIAL AMD X86 CORES: DRIVING COMPETITIVE PERFORMANCE
“ZEN”
52% More Instructions
“Excavator” Core Per Clock* INSTRUCTIONS INSTRUCTIONS PER CLOCK
“Bulldozer Core
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
25 | JULY 2018 | CONFIDENTIAL PERFORMANCE ANALYSIS TOOLS
New ISA feature prototyping eg. X86-64, perf model feedback
Insight into current Basic statistics HW behavior Parameterized statistical models
Hardware Counters + spreadsheet model 1-2 GIPS Cycle accurate CPU and memory models Functional model
Trace Analyzers 1-100 MIPS Abstraction Level Abstraction 10-100 KIPS
Perf model correlation/ find performance bugs in RTL Cycle accurate models 1-10 Kcyc/sec
RTL 1-10 cyc/sec
Speed
26 | JULY 2018 | CONFIDENTIAL