ISSCC 2021 F5.5
Architecting Chiplet Solutions for High Volume Products
Samuel Naffziger
ISSCC Forum 2021 1of 42 Outline
Trends and driving forces behind chiplet architecture Chiplet benefits Design challenges with chiplet integration – a case study Potential future directions
ISSCC Forum 2021 2of 42 GPU and CPU Performance Trends
GPU Single Precision Floating Point Specint®_rate2006 2P Server Operations Per Second Trend Performance Trend Over Time
100000 GF 100X
10000 GF
1000 GF
10X
100 GF Single Precision GFLOPS 10 GF Throughput Performance Ratio
1 GF 1X 2006 2008 2010 2012 2014 2016 2018 2020 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020
See Endnotes ISSCC Forum 2021 3of 42 SC2-1 Slide 3 Moore’s Law Keeps Slowing
90nm 65nm 45nm 32nm 22nm 14nm 10/7nm
2004 2006 2008 2010 2012 2014 2016 2018 2020
AMD Internal estimate – See Endnotes ISSCC Forum 2021 4 of 42 SC2-1 Slide 4 While Costs Continue to Increase
Cost Per Yielded mm2 for a 250mm2 Die 6.00
5.00
4.00
3.00
2.00
1.00 NORMALIZED MM COST/YIELDED
- 45nm 32nm 28nm 20nm 14/16nm 7nm 5nm
Source AMD – See Endnotes ISSCC Forum 2021 5 of 42 SC2-1 Slide 5 Die Size Trend
Die Size Increases Over Time in Server CPUs and GPUs
1000 RETICLE LIMIT (MM2) DIE SIZE
100 2006 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020 Server CPU GPU
See Endnotes ISSCC Forum 2021 6 of 42 SC2-1 Slide 6 Chiplet Architectures to Extend Performance Gains
ISSCC Forum 2021 7 of 42 SC2-1 Slide 7 Bigger Chips to Offset Technology Slowdown
If technology scaling only gives you (say) 1.5x more devices per 24 months, why not just make chips 1.33x bigger to get 2x transistors?
1000 RETICLE LIMIT DIE SIZE (MM2)
100 395 chips 362 good die 192 chips 162 good die (8% yield loss) (16% yield loss) 2004 2009 2014 2020 (hypothetical/academic example [1], not real yield rates) Server CPU
[1] Kannan, Enright Jerger, Loh, “Enabling Interposer-based ISSCC Forum 2021 8of 42 Disintegration of Multi-core Processors,“ International SC2-1Symposium on Microarchitecture (MICRO), 2015 Slide 8 Chiplets Background
Alternative: build multiple smaller chips 2X X X Historically, not needed for most markets Except for the largest systems, Moore’s Law was sufficient to meet compute needs X X 2X Chiplets not free Additional area for interfaces, replicated logic One generation later Higher packaging costs Additional design effort, complexity Past methodologies less suited for chiplets X X
2X device functionality costs > 2X silicon area
ISSCC Forum 2021 9 of 42 SC2-1 Slide 9 Progression of Advanced Integration/Packaging
Ceramic Substrate MCMs Organic Substrate MCMs Silicon Interposer [1] Concept of partitioning systems into multiple chips is not new Evolution of packaging technology has changed the trade-offs in terms of cost, bandwidth, latency, energy, etc.
[1] Carsten Schulz, Wikimedia Commons (link, license) ISSCC Forum 2021 10 of 42 SC2-1 Slide 10 High-Level Approach to Chiplets
X X X Wafer X Test Assemble X X X Functional SoCs
X X X X Test Assemble X X X
Functional SoCs
ISSCC Forum 2021 11 of 42 SC2-1 Slide 11 A Case Study with AMD EPYC™ Server Processors
ISSCC Forum 2021 12 of 42 SC2-1 Slide 12 1st Gen AMD EPYC Architecture
MCM approach has many advantages Higher yield, enables increased feature-set Multi-product leverage Traditional Monolithic 1st Gen EPYC AMD EPYC Processors I/O Die 2 2 2 I/O I/O I/O I/O I/O 4x 213mm die/package = 852mm CCX CCX DDR Si/package* CCX CCX DDR CCX
DDR CCX I/O CCX CCX Die 1 I/O
Hypothetical EPYC Monolithic DDR DDR processor I/O CCX CCX Die 3 I/O 2 DDR CCX ~777mm * CCX DDR CCX CCX TM DDR CCX Remove die-to-die Infinity Fabric PHYs CCX I/O I/O I/O I/O I/O and logic (4/die), duplicated logic, etc. Die 0 I/O 852mm2 / 777mm2 = ~10% MCM area overhead 32C Die Cost 32C Die Cost 1.0X 0.59X1
1. Based on AMD internal yield model using historical ISSCC Forum 2021 13 of 42 defect density data for mature technologies Slide 13 What to do for an Encore?
Leadership performance 2X >1.25X 0.5X DENSITY1 FREQUENCY1 POWER1 requires 7nm benefits (same power) (same performance) Yet the cost of advanced 7nm Compute Efficiency Gains technologies are increasing Cost Per Yielded mm2 for a 250mm2 Die Gen1 architecture does not 6.00 scale well to double core 2 5.00 counts 4.00 Innovation required 3.00 2.00 1.00 - Normalize Cost/Yielded mm Normalize Cost/Yielded
1. Based on June 8, 2018 AMD internal testing of same-architecture ISSCC Forum 2021 14 of 42 product ported from 14 to 7 nm technology with similar SC2-1implementation flow/methodology, using performance from SGEMM. Slide 14 Chiplets to Maximize 7nm Benefits
Prior Generation RYZEN™ Processor Die High-performance server and desktop processors are IO-heavy Analog devices and bump pitches for IO benefit very little from leading edge technology, and that technology is very costly CPU core + L3 on this die comprises ~56% of the area These circuits see increased 7nm gains Remaining ~44% sees very little performance and Solution: Partition the SOC, density improvement from 7nm reserving the expensive leading- edge silicon for CPU cores while L3
leaving the IO and memory IFOP SerDes interfaces in N-1 generation silicon SMU DFx L3 Zen2 cores Zen2 cores
7nm CCD is ~86% CPU + L3
ISSCC Forum 2021 15 of 42 SC2-1 Slide 15 Chiplets Evolved – Hybrid Multi-die Architecture
Traditional Monolithic 1st Gen EPYC CPU 2nd Gen EPYC CPU
Use an Advanced Each IP in its Optimal Centralized I/O Die Superior Technology Technology Where it is Technology, 2nd Gen Improves NUMA for CPU Performance Needed Most Infinity Fabric™ and Power Connected
ISSCC Forum 2021 16 of 42 Slide 16 Connecting the Chiplets Theoretical Interposer-based
Silicon interposers and bridges provide
high wire density, but have limited CCD CCD reach IOD Only supports die edge connectivity CCD CCD which limits number of chiplets and Interposer cores that can be supported Performance goals required more Core Selected MCM Approach Complex Die (CCDs) than can be tiled adjacent to the IOD
Solution is to retain the on-package SerDes links for die-die connections
ISSCC Forum 2021 17 of 42 SC2-1 Slide 17 Package Routing Challenges
Prior generation already consumed almost all package routing resources for memory and IO Connecting 9 chiplets in the same package requires innovation I/O
I/O Die 2 CCX CCX DDR
DDR CCX I/O
CCX I/O Die 1 I/O I/O Die 3 CCX CCX DDR
DDR CCX
CCX I/O I/O Die 0
1st Gen AMD EPYC™ [1]
[1] Beck ISSCC 2018 ISSCC Forum 2021 18 of 42 SC2-1 Slide 18 Under-CCD Routing
SERDES
CCD CCD
CCD CCD
DDR IOD DDR
CCD CCD
CCD CCD
SERDES
Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the two- deep chiplets required sharing routing layers with off-package SerDes and competing with power delivery requirements
ISSCC Forum 2021 19 of 42 SC2-1 Slide 19 Zen vs. Zen 2 VDDM Distribution
Dense SRAMs require a separate rail
Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only
ISSCC Forum 2021 20 of 42 SC2-1 Slide 20 Zen 2 VDDM Design Challenges
Enables 80 IFOP package RDL is more resistive than a dedicated package layer routed signals under the CCD Therefore, we reduced overall VDDM 4 VDDM LDOs inside the L3 current draw by ~80% compared to Zen [Singh ISSCC 2020] Core L3 4MB L3 4MB Core New, smaller, and distributed LDO design +L2 slice slice +L2 Ensured sufficient routing porosity through the integrated LDOs to enable Core L3 4MB L3 4MB Core critical routing +L2 slice slice +L2 These improvements kept the IR drop to ≈10mV impact VDDM RDL spanning LDO L2 and L3
ISSCC Forum 2021 21 of 42 SC2-1 Slide 21 Package Integration, Server, and Desktop
Zen2 Zen2 CCD CCD Zen2 Zen2 • Bump pitch for 14nm and CCD CCD 128 total x16 7nm is 150um and 130um 2nd Gen SERDES respectively AMD • Transitioned IOD from EPYCTM Server solder bumps to copper pillars, enabling a common Processor 72 Data + 8 Clk/Ctl interface for IOD+CCD Zen2 Zen2 (total/CCD) CCD CCD – Conducive to tighter bump pitches (compact) 3rd Gen AMD RyzenTM Zen2 Zen2 – Enabled common die CCD CCD Processor Infinity Fabric (die-to-die) height after assembly IO Controllers and PHYs – Higher max current 2 x DDR4 PHYs (electromigration) limits
ISSCC Forum 2021 22 of 42 SC2-1 Slide 22 Chiplet-enabled Socket Upgrades
• Chiplet architecture allows migration to “Zen 3” CCDs
Zen3 Zen3 without disrupting the platform CCD CCD • Re-use the client IOD • Enable in-place upgrades to “Zen 3” for AMD Socket AM4 with ~19% IPC increase
Zen3 Zen3 3rd Gen AMD RyzenTM 4th Gen AMD RyzenTM CCD CCD Processor Processor Infinity Fabric (die-to-die) Infinity Fabric (die-to-die) IO Controllers and PHYs IO Controllers and PHYs 2 x DDR4 PHYs 2 x DDR4 PHYs
ISSCC Forum 2021 23 of 42 SC2-1 Slide 23 Improving Memory Performance
Prior Generation Server memory latency is a key EPYC™ 7001 Series Processors factor in performance A goal for 2nd Gen was to improve on the 2017 1st Gen EPYC™ CPU design Non-Uniform-Memory-Access (NUMA) behaviors are a result of memory interfaces being distributed across die Significant delays from NUMA1 3 NUMA Distances to NUMA2 impact performance Domain8 NUMA LatencyDomains 1 (ns) for some applications NUMA1 90 NUMA2 141 NUMA3 234 Avg. Local2 128
1: AMD internal testing with DRAM page miss ISSCC Forum 2021 24 of 42 2: 75% NUMA 2 + 25% NUMA 1 traffic mix Slide 24 2nd Gen AMD EPYCTM CPU Improved Memory Latency
• Central IOD enables a single NUMA domain per socket • Improved average memory latency1 by approximately 24ns (~19%)2 • Minimum (local) latency only increases approximately 4ns with chiplet architecture Single Domain CCD5 CCD4 G2 PCIe G1G1 xGMIPCIe CCD0 CCD1 G3 PCIe G0 PCIe CCD0, CCD1, IO0, CCD2, CCD3, UMC4 3 UMC0 IO1, CCD4, CCD5, IO2, CCD6, UMC5 UMC1 G3 xGMI G0 xGMI 1 CCD7, IO3, MA/MB/MC/MD/ME/MF/MG/MH G2 xGMI G1 xGMI MG MH MA MB interleaved G3 PCIe G0 PCIe 1 IO2 IO0 1.46GHz / DDR2933 (coupled) P3 PCIe P0 PCIe 1: Local 94ns G2 PCIe G1 PCIe 2: ~97ns MD MC ME MF IO3 IO1 P2 PCIe P1 PCIe 3: ~104ns 2 4: ~114ns S-Link to S-Link to Measured Avg: ~104ns P1/P2 PCIe P0/P3 PCIe UMC7 UMC3 UMC6 4 UMC2 Repeater: 1 FCLK (1.46GHz) P2 P1 CCD7 CCD6 P3 P0 CCD2 CCD3 Switch: 2 FCLK (1.46GHz) (low-load bypass, best-case)
1: AMD internal testing with DRAM page miss ISSCC Forum 2021 25 of 42 2: EPYC 7002 Series NUMA 1 vs. EPYC 7001 Series Avg. Local; EPYC 7002 Series NUMA2 vs EPYC 7001 Series NUMA 3 Slide 25 2nd Gen AMD EPYCTM Chiplet Performance vs. Cost
2 Higher core counts and performance than possible 1.5 with a monolithic design 1 Lower costs at all core count/performance points 0.5
in product line Normalized Die Cost 0 Cost scales down with 64 Cores 48 Cores 32 Cores 24 Cores 16 Cores performance by Chiplet 7nm + 14nm Hypothetical Monolithic 7nm depopulating chiplets 14nm technology for IOD reduces fixed cost Dummy
ISSCC Forum 2021 26 of 42 Slide 26 Potential Future Directions
ISSCC Forum 2021 27 of 42 SC2-1 Slide 27 Design For Reuse and Flexibility
Chiplet Library
3rd-Party Chiplets Common Substrate
Generality vs. Optimization Interface widths and speeds, supported functionality/protocol(s), fixed pinouts Memory options (e.g., not all systems need HBM) Form factor/chiplet size (hard to support too many sizes, unnecessarily large silicon increases costs) Power delivery: pinouts, supported voltage(s), voltage regulation, current draw, required decap Thermal budgeting/allowance, cooling solutions Design for reuse in large-scale design How to leverage wafer-scale systems across smaller scales?
ISSCC Forum 2021 28 of 42 SC2-1 Slide 28 Architectural Partitioning
Decomposition for a system
???
Multi-objective technology selection and optimization Decomposition for N arbitrary, undesigned, unimagined systems Interfaces Datapath interfaces: physical, protocol Control interfaces: power management, debug, profiling, security, system alerts (e.g., thermal emergency)
ISSCC Forum 2021 29 of 42 SC2-1 Slide 29 Memory Considerations in Partitioned Architectures
Memory consistency models, X P coherency, virtual memory P’
X
X P
ISSCC Forum 2021 30 of 42 SC2-1 Slide 30 Security, Untrusted Chiplets, etc.
QoS, Security: DOS (accidental, malicious), snooping, side channels
ISSCC Forum 2021 31 of 42 SC2-1 Slide 31 What might come next?
Reduce chiplet connection overhead with interposers and denser interconnect Memory stacking directly on compute True 3D stacking beyond memory HBM and Hybrid Memory Cube already doing memory-on-logic of sorts Expand to heterogeneous components
ISSCC Forum 2021 32 of 42 SC2-1 Slide 32 3D Stacking Options vs. TSV Pitch
Die 2 X[31:0] Die 2 X Cores on DRAM on CPU carry TSV TSV Y[31:0] Cores
Die 1 Die 1 X[63:32] X Y[63:32] Cores on Uncore CPU on CPU
Circuit Slicing IP Folding/Splitting Macro on Macro IP on IP Full die-to-die
ISSCC Forum 2021 33 of 42 SC2-1 Stacked DRAM Reliability Challenges
Traditional DRAM provides ECC with redundant storage D D D D D D D D D D D D D D D D E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 With stacked DRAM, hard to64 do bits because data all bits come from a single 64 chip data + 8 ECC = 72 bits
9th DRAM ½ DRAM chip (½ wasted) chip (ECC) for ECC (?) 5th DRAM chip (ECC)
True 4 DRAM chips (data) whether stacked DRAM is
4 DRAM chips 4 DRAM chips (data) cache or system memory 8 DRAM chips 8 DRAM chips (data)
34 ISSCC Forum 2021 34 of 42 Slide 34 Parametric Variations
Slow Die + Fast Die challenge Do not want to sell as all-slow part Perform pre-stack binning (how?)
Restrict frequency-boosting to fast core(s)? [1] Clock distribution
Skew/jitter across layers [2]
[1] WikiMedia, Creative Commons (link, license) [2] amd.com
ISSCC Forum 2021 35 of 42 Slide 35 Known Good Die Testing
One faulty die could ruin the entire stack (very expensive) Need to test functionality (at least partially) before stacking/bonding Challenges Limited connectivity (few pads, maybe none? Only TSVs/uBumps?) Limited ability to supply power Incomplete functionality What if PLL/clocking is on a different chip?
[1]
ISSCC Forum 2021 36 of 42 Slide 36 Thermal requirements
3D stacking increases power density Need to creatively interleave cool and hot components in the vertical dimension BEOL dielectrics act as thermal insulators reducing cooling efficiency with stack height Power management must become much more sophisticated
Core
ISSCC Forum 2021 37 of 42 Slide 37 Summary
Silicon scaling running into multiple challenges 2D Chiplets are a path to keep things going Challenges with managing interconnect overhead (power, cost) Generalized 3rd party IP integration promising but requires standardization of interfaces, innovation in security and memory management 3D stacking reduces overhead while introducing new challenges Binning, testing, thermals, power delivery, and management Plenty of opportunity for innovation to support scaling in all dimensions
ISSCC Forum 2021 38 of 42 SC2-1 Acknowledgment
Special thanks to Gabe Loh, Kevin Lepak, and Mahesh Subramony for their contributions to this material, and of course the talented AMD design teams from around the world for the AMD EPYC™ server processor engineering achievements
ISSCC Forum 2021 39 of 42 SC2-1 Endnotes
Slides 2,3,4,5,12 Lisa T. Su, Samuel Naffziger, and Mark Papermaster, “Multi-Chip Technologies to Unleash Computing Performance Gains over the Next Decade,” IEDM Conference 2017. Slide 6 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2015 by K. Rupp. https://www.karlrupp.net/2018/02/42-years-of-microprocessor- trend-data/.
ISSCC Forum 2021 40 of 42 SC2-1 Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
© 2020 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
ISSCC Forum 2021 41 of 42 SC2-1 ISSCC Forum 2021 42 of 42