ISSCC 2021 F5.5

Architecting Chiplet Solutions for High Volume Products

Samuel Naffziger

ISSCC Forum 2021 1of 42 Outline

 Trends and driving forces behind chiplet architecture  Chiplet benefits  Design challenges with chiplet integration – a case study  Potential future directions

ISSCC Forum 2021 2of 42 GPU and CPU Performance Trends

GPU Single Precision Floating Point Specint®_rate2006 2P Server Operations Per Second Trend Performance Trend Over Time

100000 GF 100X

10000 GF

1000 GF

10X

100 GF Single Precision GFLOPS 10 GF Throughput Performance Ratio

1 GF 1X 2006 2008 2010 2012 2014 2016 2018 2020 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020

See Endnotes ISSCC Forum 2021 3of 42 SC2-1 Slide 3 Moore’s Law Keeps Slowing

90nm 65nm 45nm 32nm 22nm 14nm 10/7nm

2004 2006 2008 2010 2012 2014 2016 2018 2020

AMD Internal estimate – See Endnotes ISSCC Forum 2021 4 of 42 SC2-1 Slide 4 While Costs Continue to Increase

Cost Per Yielded mm2 for a 250mm2 Die 6.00

5.00

4.00

3.00

2.00

1.00 NORMALIZED MM COST/YIELDED

- 45nm 32nm 28nm 20nm 14/16nm 7nm 5nm

Source AMD – See Endnotes ISSCC Forum 2021 5 of 42 SC2-1 Slide 5 Die Size Trend

Die Size Increases Over Time in Server CPUs and GPUs

1000 RETICLE LIMIT (MM2) DIE SIZE

100 2006 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020 Server CPU GPU

See Endnotes ISSCC Forum 2021 6 of 42 SC2-1 Slide 6 Chiplet Architectures to Extend Performance Gains

ISSCC Forum 2021 7 of 42 SC2-1 Slide 7 Bigger Chips to Offset Technology Slowdown

If technology scaling only gives you (say) 1.5x more devices per 24 months, why not just make chips 1.33x bigger to get 2x transistors?

1000 RETICLE LIMIT DIE SIZE (MM2)

100 395 chips  362 good die 192 chips  162 good die (8% yield loss) (16% yield loss) 2004 2009 2014 2020 (hypothetical/academic example [1], not real yield rates) Server CPU

[1] Kannan, Enright Jerger, Loh, “Enabling Interposer-based ISSCC Forum 2021 8of 42 Disintegration of Multi-core Processors,“ International SC2-1Symposium on (MICRO), 2015 Slide 8 Chiplets Background

 Alternative: build multiple smaller chips 2X X X  Historically, not needed for most markets  Except for the largest systems, Moore’s Law was sufficient to meet compute needs X X 2X  Chiplets not free  Additional area for interfaces, replicated logic One generation later  Higher packaging costs  Additional design effort, complexity  Past methodologies less suited for chiplets X X

2X device functionality costs > 2X silicon area

ISSCC Forum 2021 9 of 42 SC2-1 Slide 9 Progression of Advanced Integration/Packaging

Ceramic Substrate MCMs Organic Substrate MCMs Silicon Interposer [1]  Concept of partitioning systems into multiple chips is not new  Evolution of packaging technology has changed the trade-offs in terms of cost, bandwidth, latency, energy, etc.

[1] Carsten Schulz, Wikimedia Commons (link, license) ISSCC Forum 2021 10 of 42 SC2-1 Slide 10 High-Level Approach to Chiplets

X X X Wafer X Test Assemble X X X Functional SoCs

X X X X Test Assemble X X X

Functional SoCs

ISSCC Forum 2021 11 of 42 SC2-1 Slide 11 A Case Study with AMD ™ Server Processors

ISSCC Forum 2021 12 of 42 SC2-1 Slide 12 1st Gen AMD EPYC Architecture

 MCM approach has many advantages  Higher yield, enables increased feature-set  Multi-product leverage Traditional Monolithic 1st Gen EPYC  AMD EPYC Processors I/O Die 2 2 2 I/O I/O I/O I/O I/O  4x 213mm die/package = 852mm CCX CCX DDR Si/package* CCX CCX DDR CCX

DDR CCX I/O CCX CCX Die 1 I/O

 Hypothetical EPYC Monolithic DDR DDR processor I/O CCX CCX Die 3 I/O 2 DDR CCX  ~777mm * CCX DDR CCX CCX TM DDR CCX  Remove die-to-die Infinity Fabric PHYs CCX I/O I/O I/O I/O I/O and logic (4/die), duplicated logic, etc. Die 0 I/O  852mm2 / 777mm2 = ~10% MCM area overhead 32C Die Cost 32C Die Cost 1.0X 0.59X1

1. Based on AMD internal yield model using historical ISSCC Forum 2021 13 of 42 defect density data for mature technologies Slide 13 What to do for an Encore?

 Leadership performance 2X >1.25X 0.5X DENSITY1 FREQUENCY1 POWER1 requires 7nm benefits (same power) (same performance)  Yet the cost of advanced 7nm Compute Efficiency Gains technologies are increasing Cost Per Yielded mm2 for a 250mm2 Die  Gen1 architecture does not 6.00 scale well to double core 2 5.00 counts 4.00  Innovation required 3.00 2.00 1.00 - Normalize Cost/Yielded mm Normalize Cost/Yielded

1. Based on June 8, 2018 AMD internal testing of same-architecture ISSCC Forum 2021 14 of 42 product ported from 14 to 7 nm technology with similar SC2-1implementation flow/methodology, using performance from SGEMM. Slide 14 Chiplets to Maximize 7nm Benefits

Prior Generation ™ Processor Die  High-performance server and desktop processors are IO-heavy  Analog devices and bump pitches for IO benefit very little from leading edge technology, and that technology is very costly CPU core + L3 on this die comprises ~56% of the area These circuits see increased 7nm gains Remaining ~44% sees very little performance and  Solution: Partition the SOC, density improvement from 7nm reserving the expensive leading- edge silicon for CPU cores while L3

leaving the IO and memory IFOP SerDes interfaces in N-1 generation silicon SMU DFx L3 Zen2 cores Zen2 cores

7nm CCD is ~86% CPU + L3

ISSCC Forum 2021 15 of 42 SC2-1 Slide 15 Chiplets Evolved – Hybrid Multi-die Architecture

Traditional Monolithic 1st Gen EPYC CPU 2nd Gen EPYC CPU

Use an Advanced Each IP in its Optimal Centralized I/O Die Superior Technology Technology Where it is Technology, 2nd Gen Improves NUMA for CPU Performance Needed Most Infinity Fabric™ and Power Connected

ISSCC Forum 2021 16 of 42 Slide 16 Connecting the Chiplets Theoretical Interposer-based

 Silicon interposers and bridges provide

high wire density, but have limited CCD CCD reach IOD  Only supports die edge connectivity CCD CCD which limits number of chiplets and Interposer cores that can be supported  Performance goals required more Core Selected MCM Approach Complex Die (CCDs) than can be tiled adjacent to the IOD

 Solution is to retain the on-package SerDes links for die-die connections

ISSCC Forum 2021 17 of 42 SC2-1 Slide 17 Package Routing Challenges

 Prior generation already consumed almost all package routing resources for memory and IO  Connecting 9 chiplets in the same package requires innovation I/O

I/O Die 2 CCX CCX DDR

DDR CCX I/O

CCX I/O Die 1 I/O I/O Die 3 CCX CCX DDR

DDR CCX

CCX I/O I/O Die 0

1st Gen AMD EPYC™ [1]

[1] Beck ISSCC 2018 ISSCC Forum 2021 18 of 42 SC2-1 Slide 18 Under-CCD Routing

SERDES

CCD CCD

CCD CCD

DDR IOD DDR

CCD CCD

CCD CCD

SERDES

Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the two- deep chiplets required sharing routing layers with off-package SerDes and competing with power delivery requirements

ISSCC Forum 2021 19 of 42 SC2-1 Slide 19 vs. VDDM Distribution

Dense SRAMs require a separate rail

Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only

ISSCC Forum 2021 20 of 42 SC2-1 Slide 20 Zen 2 VDDM Design Challenges

Enables 80 IFOP package  RDL is more resistive than a dedicated package layer routed signals under the CCD  Therefore, we reduced overall VDDM 4 VDDM LDOs inside the L3 current draw by ~80% compared to Zen [Singh ISSCC 2020] Core L3 4MB L3 4MB Core  New, smaller, and distributed LDO design +L2 slice slice +L2  Ensured sufficient routing porosity through the integrated LDOs to enable Core L3 4MB L3 4MB Core critical routing +L2 slice slice +L2  These improvements kept the IR drop to ≈10mV impact VDDM RDL spanning LDO L2 and L3

ISSCC Forum 2021 21 of 42 SC2-1 Slide 21 Package Integration, Server, and Desktop

Zen2 Zen2 CCD CCD Zen2 Zen2 • Bump pitch for 14nm and CCD CCD 128 total x16 7nm is 150um and 130um 2nd Gen SERDES respectively AMD • Transitioned IOD from EPYCTM Server solder bumps to copper pillars, enabling a common Processor 72 Data + 8 Clk/Ctl interface for IOD+CCD Zen2 Zen2 (total/CCD) CCD CCD – Conducive to tighter bump pitches (compact) 3rd Gen AMD RyzenTM Zen2 Zen2 – Enabled common die CCD CCD Processor Infinity Fabric (die-to-die) height after assembly IO Controllers and PHYs – Higher max current 2 x DDR4 PHYs (electromigration) limits

ISSCC Forum 2021 22 of 42 SC2-1 Slide 22 Chiplet-enabled Socket Upgrades

• Chiplet architecture allows migration to “” CCDs

Zen3 Zen3 without disrupting the platform CCD CCD • Re-use the client IOD • Enable in-place upgrades to “Zen 3” for AMD Socket AM4 with ~19% IPC increase

Zen3 Zen3 3rd Gen AMD RyzenTM 4th Gen AMD RyzenTM CCD CCD Processor Processor Infinity Fabric (die-to-die) Infinity Fabric (die-to-die) IO Controllers and PHYs IO Controllers and PHYs 2 x DDR4 PHYs 2 x DDR4 PHYs

ISSCC Forum 2021 23 of 42 SC2-1 Slide 23 Improving Memory Performance

Prior Generation  Server memory latency is a key EPYC™ 7001 Series Processors factor in performance  A goal for 2nd Gen was to improve on the 2017 1st Gen EPYC™ CPU design  Non-Uniform-Memory-Access (NUMA) behaviors are a result of memory interfaces being distributed across die  Significant delays from NUMA1 3 NUMA Distances to NUMA2 impact performance Domain8 NUMA LatencyDomains 1 (ns) for some applications NUMA1 90 NUMA2 141 NUMA3 234 Avg. Local2 128

1: AMD internal testing with DRAM page miss ISSCC Forum 2021 24 of 42 2: 75% NUMA 2 + 25% NUMA 1 traffic mix Slide 24 2nd Gen AMD EPYCTM CPU Improved Memory Latency

• Central IOD enables a single NUMA domain per socket • Improved average memory latency1 by approximately 24ns (~19%)2 • Minimum (local) latency only increases approximately 4ns with chiplet architecture Single Domain CCD5 CCD4 G2 PCIe G1G1 xGMIPCIe CCD0 CCD1 G3 PCIe G0 PCIe CCD0, CCD1, IO0, CCD2, CCD3, UMC4 3 UMC0 IO1, CCD4, CCD5, IO2, CCD6, UMC5 UMC1 G3 xGMI G0 xGMI 1 CCD7, IO3, MA/MB/MC/MD/ME/MF/MG/MH G2 xGMI G1 xGMI MG MH MA MB interleaved G3 PCIe G0 PCIe 1 IO2 IO0  1.46GHz / DDR2933 (coupled) P3 PCIe P0 PCIe  1: Local 94ns G2 PCIe G1 PCIe  2: ~97ns MD MC ME MF IO3 IO1 P2 PCIe P1 PCIe  3: ~104ns 2  4: ~114ns S-Link to S-Link to  Measured Avg: ~104ns P1/P2 PCIe P0/P3 PCIe UMC7 UMC3 UMC6 4 UMC2 Repeater: 1 FCLK (1.46GHz) P2 P1 CCD7 CCD6 P3 P0 CCD2 CCD3 Switch: 2 FCLK (1.46GHz) (low-load bypass, best-case)

1: AMD internal testing with DRAM page miss ISSCC Forum 2021 25 of 42 2: EPYC 7002 Series NUMA 1 vs. EPYC 7001 Series Avg. Local; EPYC 7002 Series NUMA2 vs EPYC 7001 Series NUMA 3 Slide 25 2nd Gen AMD EPYCTM Chiplet Performance vs. Cost

2  Higher core counts and performance than possible 1.5 with a monolithic design 1  Lower costs at all core count/performance points 0.5

in product line Normalized Die Cost 0  Cost scales down with 64 Cores 48 Cores 32 Cores 24 Cores 16 Cores performance by Chiplet 7nm + 14nm Hypothetical Monolithic 7nm depopulating chiplets  14nm technology for IOD reduces fixed cost Dummy

ISSCC Forum 2021 26 of 42 Slide 26 Potential Future Directions

ISSCC Forum 2021 27 of 42 SC2-1 Slide 27 Design For Reuse and Flexibility

Chiplet Library

3rd-Party Chiplets Common Substrate

 Generality vs. Optimization  Interface widths and speeds, supported functionality/protocol(s), fixed pinouts  Memory options (e.g., not all systems need HBM)  Form factor/chiplet size (hard to support too many sizes, unnecessarily large silicon increases costs)  Power delivery: pinouts, supported voltage(s), voltage regulation, current draw, required decap  Thermal budgeting/allowance, cooling solutions  Design for reuse in large-scale design  How to leverage wafer-scale systems across smaller scales?

ISSCC Forum 2021 28 of 42 SC2-1 Slide 28 Architectural Partitioning

 Decomposition for a system

???

 Multi-objective technology selection and optimization  Decomposition for N arbitrary, undesigned, unimagined systems  Interfaces  Datapath interfaces: physical, protocol  Control interfaces: power management, debug, profiling, security, system alerts (e.g., thermal emergency)

ISSCC Forum 2021 29 of 42 SC2-1 Slide 29 Memory Considerations in Partitioned Architectures

 Memory consistency models, X P coherency, virtual memory P’

X

X P

ISSCC Forum 2021 30 of 42 SC2-1 Slide 30 Security, Untrusted Chiplets, etc.

 QoS, Security: DOS (accidental, malicious), snooping, side channels

ISSCC Forum 2021 31 of 42 SC2-1 Slide 31 What might come next?

 Reduce chiplet connection overhead with interposers and denser interconnect  Memory stacking directly on compute  True 3D stacking beyond memory  HBM and Hybrid Memory Cube already doing memory-on-logic of sorts  Expand to heterogeneous components

ISSCC Forum 2021 32 of 42 SC2-1 Slide 32 3D Stacking Options vs. TSV Pitch

Die 2 X[31:0] Die 2 X Cores on DRAM on CPU carry TSV TSV Y[31:0] Cores

Die 1 Die 1 X[63:32] X Y[63:32] Cores on Uncore CPU on CPU

Circuit Slicing IP Folding/Splitting Macro on Macro IP on IP Full die-to-die

ISSCC Forum 2021 33 of 42 SC2-1 Stacked DRAM Reliability Challenges

 Traditional DRAM provides ECC with redundant storage D D D D D D D D D D D D D D D D E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 With stacked DRAM, hard to64 do bits because data all bits come from a single 64 chip data + 8 ECC = 72 bits

9th DRAM ½ DRAM chip (½ wasted) chip (ECC) for ECC (?) 5th DRAM chip (ECC)

True 4 DRAM chips (data) whether stacked DRAM is

4 DRAM chips 4 DRAM chips (data) cache or system memory 8 DRAM chips 8 DRAM chips (data)

34 ISSCC Forum 2021 34 of 42 Slide 34 Parametric Variations

 Slow Die + Fast Die challenge  Do not want to sell as all-slow part  Perform pre-stack binning (how?)

 Restrict frequency-boosting to fast core(s)? [1]  Clock distribution

 Skew/jitter across layers [2]

[1] WikiMedia, Creative Commons (link, license) [2] amd.com

ISSCC Forum 2021 35 of 42 Slide 35 Known Good Die Testing

 One faulty die could ruin the entire stack (very expensive)  Need to test functionality (at least partially) before stacking/bonding  Challenges  Limited connectivity (few pads, maybe none? Only TSVs/uBumps?)  Limited ability to supply power  Incomplete functionality  What if PLL/clocking is on a different chip?

[1]

ISSCC Forum 2021 36 of 42 Slide 36 Thermal requirements

 3D stacking increases power density  Need to creatively interleave cool and hot components in the vertical dimension  BEOL dielectrics act as thermal insulators reducing cooling efficiency with stack height  Power management must become much more sophisticated

Core

ISSCC Forum 2021 37 of 42 Slide 37 Summary

 Silicon scaling running into multiple challenges  2D Chiplets are a path to keep things going  Challenges with managing interconnect overhead (power, cost)  Generalized 3rd party IP integration promising but requires standardization of interfaces, innovation in security and memory management  3D stacking reduces overhead while introducing new challenges  Binning, testing, thermals, power delivery, and management  Plenty of opportunity for innovation to support scaling in all dimensions

ISSCC Forum 2021 38 of 42 SC2-1 Acknowledgment

Special thanks to Gabe Loh, Kevin Lepak, and Mahesh Subramony for their contributions to this material, and of course the talented AMD design teams from around the world for the AMD EPYC™ server processor engineering achievements

ISSCC Forum 2021 39 of 42 SC2-1 Endnotes

Slides 2,3,4,5,12 Lisa T. Su, Samuel Naffziger, and , “Multi-Chip Technologies to Unleash Computing Performance Gains over the Next Decade,” IEDM Conference 2017. Slide 6 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2015 by K. Rupp. https://www.karlrupp.net/2018/02/42-years-of-microprocessor- trend-data/.

ISSCC Forum 2021 40 of 42 SC2-1 Disclaimer

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

© 2020 , Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

ISSCC Forum 2021 41 of 42 SC2-1 ISSCC Forum 2021 42 of 42