Ryzen-Slides

ANASUA BHOWMIK RYZEN PROCESSOR MICROARCHITECTURE

ANASUA BHOWMIK PRINCIPAL MEMBER OF TECHNICAL STAFF MICROPROCESSOR PERFORMANCE MODELING AND ARCHITECTURE TEAM AMD, BANGALORE

1 | JULY 2018 | CONFIDENTIAL AGENDA

 Road to RYZEN – AMD’s new X86 CPU  High level architecture of Zen ‒ Core engine ‒ Floating point unit ‒ Cache hierarchy ‒ Simultaneous Multithreading  Zen based SoCs  Design Challenges  Contributions from India Design Center  Next step

2 | JULY 2018 | CONFIDENTIAL RYZEN CPU

 High performance CPU  8 cores 16 threads  Zen core  Over 4.8 billion transistors  Over 2000 meters of wiring  14 nm process technology

3 | JULY 2018 | CONFIDENTIAL CPU COMPLEX  A CPU complex (CCX) is four cores connected to an L3 Cache.  The L3 Cache is 16-way associative, 8MB, mostly L L L L exclusive of L2. CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 C 512K 1MB C 1MB 1MB C 1MB 512K C T T T T  The L3 Cache is made of 4 L L L L slices, by low-order address interleave.  Every core can access every cache with same average L L L L latency CORE 2 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 3 C 512K 1MB C 1MB 1 MB C 1MB 512K C CORE 3  Ryzen CPU in Thread Ripper T T T T L L L L has 2 CCXs  Epyc Server has 8 CCXs  Ryzen mobile has 1 CCX

4 | JULY 2018 | CONFIDENTIAL AMD CPU DESIGN OPTIMIZATION POINTS “ZEN”

High Performance “Excavator”

PERFORMANCE ONE CORE FROM FANLESS NOTEBOOKS Low Power TO SUPERCOMPUTERS “Jaguar”

LOWER POWER MORE PERFORMANCE SMALLER MORE AREA, POWER

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

5 | JULY 2018 | CONFIDENTIAL DEFYING CONVENTION: A WIDE, HIGH PERFORMANCE, EFFICIENT CORE

“ZEN” Instructions-Per-Clock

52% more work per cycle* “Excavator” “Steamroller” Total “Piledriver” Efficiency “Bulldozer” Gain

At = Energy Per Cycle Energy Per Cycle

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

6 | JULY 2018 | CONFIDENTIAL DESIGNED FROM THE GROUND UP FOR OPTIMAL BALANCE OF “ZEN” PERFORMANCE AND POWER

Totally New New High-Bandwidth, High-performance Low Latency Cache Core Design System

Simultaneous Energy-efficient FinFET Multithreading (SMT) Design Scales from for High Throughput Enterprise to Client Products

7 | JULY 2018 | CONFIDENTIAL 64K I-Cache Branch Prediction 4 way ZEN MICROARCHITECTURE

Decode Op Cache  Fetch Four x86 instructions  Op Cache instructions Micro-op Queue 4 instructions/cycle Micro-ops  Integer units 6 ops dispatched ‒ Large rename space – 168 Registers ‒ INTEGER FLOATING POINT 192 instructions in flight/8 wide retire  Load/Store units Integer Rename Floating Point Rename ‒ 72 Out-of-Order Loads supported  Floating Point units Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler ‒ built as 4 pipes, 2 Fadd, 2 Fmul  I-Cache 64K, 4-way Integer Physical Register File FP Register File  D-Cache 32K, 8-way  L2 Cache 512K, 8-way ALU ALU ALU ALU AGU AGU MUL ADD MUL ADD  Large shared L3 cache – 8M 16 way  2 threads per core 512K 2 loads + 1 store Load/Store 32K D-Cache per cycle L2 (I+D) Cache Queues 8 Way 8 Way

8 | JULY 2018 | CONFIDENTIAL FETCH Next PC Redirect from DE/EX  Decoupled Branch Prediction L0/L1/L2 TLB  TLB in the BP pipe ‒ 8 entry L0 TLB, all page sizes ‒ 64 entry L1 TLB, all page sizes ‒ 512 entry L2 TLB, no 1G pages L1/L2 BTB Hash Perceptron Return Stack/ITA  2 branches per BTB entry  Large L1 / L2 BTB To Op  32 entry return stack Cache Physical Request Queue Micro-Tags  Indirect Target Array (ITA)  64K, 4-way Instruction cache  Micro-tags for IC & Op cache 64K Instruction Cache  32 byte fetch 32 bytes/ cycle from L2 32 bytes to Decode

9 | JULY 2018 | CONFIDENTIAL From IC From Micro Tags DECODE Instruction Byte Buffer

 Inline Instruction-length Decoder Pick  Decode 4 x86 instructions Op Cache  Op cache Decode  Micro-op Queue Instructions Micro-ops  Stack Engine  Branch Fusion Micro-op Queue  Memory File for Store to Load Forwarding

Microcode Rom Stack Engine To FP, 4 Micro-ops Dispatch

To EX, 6 Micro-ops

10 | JULY 2018 | CONFIDENTIAL Redirect 6 Micro-op Dispatch To DE EXECUTE to Fetch

Map Retire Queue  6x14 entry Scheduling Queues  168 entry Physical Register File  6 issue per cycle ‒ 4 ALU’s, 2 AGU’s ALQ0 ALQ1 ALQ2 ALQ3 AGQ0 AGQ1  192 entry Retire Queue

168 Entry Physical Register File  2 Branches per cycle  Move Elimination

Forwarding Muxes  8-Wide Retire

ALU0 ALU1 ALU2 ALU3 AGU0 AGU1

11 | JULY 2018 | CONFIDENTIAL AGU0 AGU1 To Ex LOAD/STORE AND L2

Store Queue  72 Out of Order Loads Load Queue  44 entry Store Queue Pre Fetch  Split TLB/Data Pipe, store pipe

L0 Pick L1 Pick  64 entry L1 TLB, all page sizes Store Pipe Pick  1.5K entry L2 TLB, no 1G pages  32K, 8 way Data Cache TLB0 DAT0 DAT1 TLB1 STP Store Commit ‒ Supports two 128-bit accesses  Optimized L1 and L2 Prefetchers L1/L2 TLB +  512K, private (2 threads), inclusive L2 DC tags

32K Data Cache

MAB WCB To FP

32 bytes to/from L2 To L2

12 | JULY 2018 | CONFIDENTIAL 8 Micro-op 4 Micro-op Dispatch Retire FLOATING POINT

192 Entry NSQ Retire Queue  2 Level Scheduling Queue  160 entry Physical Register File  128 bit 8 Wide Retire SQ LDCVT Loads  1 pipe for 1x128b store  Accelerated Recovery on Flushes Int to FP 160 Entry Physical Register File FP to Int  SSE, AVX1, AVX2, AES and legacy mmx/x87 compliant

Forwarding Muxes  2 AES units

MUL0 ADD0 MUL1 ADD1

13 | JULY 2018 | CONFIDENTIAL ZEN CACHE HIERARCHY CORE 0

32B fetch 64K 32B/cycle  Fast private 512K L2 cache I-Cache 512K L2 4-way 32B/cycle  Fast shared L3 cache I+D 2*16B load 32B/cycle  High bandwidth enables prefetch 32K Cache D-Cache 8-way 8M L3 improvements 1*16B store 8-way 32B/cycle I+D  L3 is filled from L2 victims Cache  Fast cache-to-cache transfers 16-way  Large Queues for Handling L1 and L2 misses

14 | JULY 2018 | CONFIDENTIAL CPU COMPLEX

 A CPU complex (CCX) is four cores connected to an L L L L CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 L3 Cache. C 512K 1MB C 1MB 1MB C 1MB 512K C T T T T  The L3 Cache is 16-way L L L L associative, 8MB, mostly exclusive of L2.  The L3 Cache is made of 4 slices, by low-order address L L L L interleave. CORE 2 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 3 C 512K 1MB C 1MB 1 MB C 1MB 512K C CORE 3 T T T T  Every core can access every L L L L cache with same average latency  Ryzen has 2 CCXs  Naples Server has 4 CCXs

15 | JULY 2018 | CONFIDENTIAL L0/L1/L2 64K I-Cache 4 way Branch Prediction ITLB SMT OVERVIEW

Decode MicroOp-opCache Cache  All structures fully available in 1T mode Micro-op Queue  instructions micro-ops Front End Queues are round robin with Vertically Threaded 6 ops dispatched priority overrides  Increased throughput from SMT INTEGER Retire Queue FLOATING POINT

Integer Rename Floating Point Rename

Schedulers Scheduler Competitively shared structures Integer Physical Register File FP Register File Statically Partitioned 2x AGUs 4x ALUs MUL ADD MUL ADD

Store Queue 512K L1/L2 32K D-Cache L2 (I+D) Cache LoadLoad Queue DTLB 8 Way 8 Way

16 | JULY 2018 | CONFIDENTIAL ZEN PERFORMANCE & POWER IMPROVEMENTS

BETTER CORE ENGINE BETTER CACHE SYSTEM LOWER POWER

‐ SMT - two threads per core ‐ Write back L1 cache ‐ Aggressive clock gating with ‐ Better branch prediction ‐ Faster L2 cache multi-level regions ‐ Large Op Cache ‐ Faster L3 cache ‐ Write back L1 cache ‐ Wider micro-op dispatch 6 vs. 4 ‐ Faster Load to FPU: 7 vs. 9 cycles ‐ Large Op Cache ‐ Larger Instruction Schedulers ‐ Better L1 and L2 data prefetcher ‐ Stack Engine Integer: 84 vs. 48 | FP: 96 vs. 60 ‐ Close to 2x the L1 and L2 bandwidth ‐ Move elimination ‐ Larger retire 8 ops vs. 4 ops ‐ Total L3 bandwidth up 5x ‐ Power focus from project ‐ 4 issue FPU inception ‐ Larger Retire Queue 192 vs. 128 ‐ Larger Load Queue 72 vs. 44 ‐ Larger Store Queue 44 vs. 32

52% IPC PERFORMANCE UPLIFT

17 | JULY 2018 | CONFIDENTIAL RYZEN BASED SOCS

 EPYC SERVER ‐ 8 CCX, 4 Dies in each socket ‐ Connected by fast Infinity fabric ‐ Support for virtualization, encryption, ECC ‐ Targeted for Data centers

 THREAD RIPPER DESKTOP. ‒ High Frequency ‒ Gaming performance

 RAVEN RIDGE LOW POWER APU (CPU + GPU) ‒ 4 M L3 ‒ Power optimization

18 | JULY 2018 | CONFIDENTIAL DESIGN CHALLENGES  Long and expensive processor development project ‒ More than 2 million engineering hours for Zen over four years  Need to make key design decisions early in the project which has long lasting impact ‒ Dependence on unknowns like process readiness in the fab, yield ‒ Area, power ‒ Availability of technology  Conflicting needs of various SoCs ‒ Server : Bigger LLC, MT performance more important than 1T performance ‒ Desktop: High frequency, 1T is more important (aggressive prefetching) ‒ Client: Area and power sensitive, smaller LLC ‒ Semicustom: Specific ISA supports ‒ Ultra low power  Workloads to consider  Competitiveness of the parts

19 | JULY 2018 | CONFIDENTIAL CONTRIBUTIONS FROM INDIA DESIGN CENTER

 Performance modeling, micro-architectural feature exploration, performance verification ‒ Development of cache hierarchy performance model (L2, L3, prefetcher) and architecture  Traffic optimization between cores and memory  All (Server, Client and Cores) workloads analysis, characterization and collection  Core, Data Fabric and SoC level verification  L3 RTL development for Zen based client part  Zen based client SoC architecture development  Compiler optimizations for Zen based server (Epyc)

20 | JULY 2018 | CONFIDENTIAL WORK IN PROGRESS

 Zen2 hardware is in the lab ‒ 7 nm core with significant performance uplift over Zen  Microarchitecture development, performance modeling and correlation for core following Zen2  Exploring micro-architectural ideas for future processors  Improving performance analysis and modeling methodologies ‒ Performance bottleneck analysis ‒ Very high thread count simulation, impact of sharing and locks ‒ Warming-up large caches ‒ Execution driven vs trace driven simulation ‒ SoC level modeling – CPU/GPU/accelerators interaction  Workloads analysis ‒ Cloud, virtualized workloads, machine learning, HPC workloads

Lots of challenging problem to work on!!

21 | JULY 2018 | CONFIDENTIAL PROJECT TIMELINE Compare Perf model/HW Diagnose Perf problems Evaluate fixes

Competitive analysis Silicon Mix simulated/measured data Performance “Relative projections” Validation and Debug

Performance Projections

Hundreds of small (<1%) micro-architectural tradeoffs Performance Correlation Compare Perf model/RTL Find/fix perf bugs in both Validate Perf model Develop high-level model Feature Modeling and evaluation with configurable architecture Perform trade-off studies RTL Development, verification, physical design

Architectural Development

Project Timeline Concept High Level Tapeout Parts in Product Launch 22 | JULY 2018 | CONFIDENTIAL Design exit the lab PERFORMANCE ANALYSIS AND MODELING

 Develop and explore micro-architectural ideas for performance improvement

 Variety of tools and technologies deployed for performance analysis & architectural innovation ‒ Cycle accurate simulator  Constant state of evolution, due to changing workloads, and architecture ‒ More IPC, MP, SMT, large caches ‒ SW and HW more tightly linked than ever  Correct performance projection is critical for the success of the project  Workloads analysis, characterization and tracing  Server workloads – SPEC CPU 2006, SPEC CPU2017, Legacy Enterprise workloads, Spec JBB, Cloud, NoSql database  Client workloads – Games, Cinebench, Productivity suite, various use case scenarios

23 | JULY 2018 | CONFIDENTIAL DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

24 | JULY 2018 | CONFIDENTIAL AMD X86 CORES: DRIVING COMPETITIVE PERFORMANCE

“ZEN”

52% More Instructions

“Excavator” Core Per Clock* INSTRUCTIONS INSTRUCTIONS PER CLOCK

“Bulldozer Core

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

25 | JULY 2018 | CONFIDENTIAL PERFORMANCE ANALYSIS TOOLS

New ISA feature prototyping eg. X86-64, perf model feedback

Insight into current Basic statistics HW behavior Parameterized statistical models

Hardware Counters + spreadsheet model 1-2 GIPS Cycle accurate CPU and memory models Functional model

Trace Analyzers 1-100 MIPS Abstraction Level Abstraction 10-100 KIPS

Perf model correlation/ find performance bugs in RTL Cycle accurate models 1-10 Kcyc/sec

RTL 1-10 cyc/sec

Speed

26 | JULY 2018 | CONFIDENTIAL