ISSCC 2021 F5.5 Architecting Chiplet Solutions for High Volume Products
Total Page:16
File Type:pdf, Size:1020Kb
ISSCC 2021 F5.5 Architecting Chiplet Solutions for High Volume Products Samuel Naffziger ISSCC Forum 2021 1of 42 Outline Trends and driving forces behind chiplet architecture Chiplet benefits Design challenges with chiplet integration – a case study Potential future directions ISSCC Forum 2021 2of 42 GPU and CPU Performance Trends GPU Single Precision Floating Point Specint®_rate2006 2P Server Operations Per Second Trend Performance Trend Over Time 100000 GF 100X 10000 GF 1000 GF 10X 100 GF Single Precision GFLOPS 10 GF Throughput Performance Ratio 1 GF 1X 2006 2008 2010 2012 2014 2016 2018 2020 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020 See Endnotes ISSCC Forum 2021 3of 42 SC2-1 Slide 3 Moore’s Law Keeps Slowing 90nm 65nm 45nm 32nm 22nm 14nm 10/7nm 2004 2006 2008 2010 2012 2014 2016 2018 2020 AMD Internal estimate – See Endnotes ISSCC Forum 2021 4 of 42 SC2-1 Slide 4 While Costs Continue to Increase Cost Per Yielded mm2 for a 250mm2 Die 6.00 5.00 4.00 3.00 2.00 1.00 NORMALIZED MM COST/YIELDED - 45nm 32nm 28nm 20nm 14/16nm 7nm 5nm Source AMD – See Endnotes ISSCC Forum 2021 5 of 42 SC2-1 Slide 5 Die Size Trend Die Size Increases Over Time in Server CPUs and GPUs 1000 RETICLE LIMIT (MM2) DIE SIZE 100 2006 2008 2009 2010 2012 2013 2014 2016 2017 2019 2020 Server CPU GPU See Endnotes ISSCC Forum 2021 6 of 42 SC2-1 Slide 6 Chiplet Architectures to Extend Performance Gains ISSCC Forum 2021 7 of 42 SC2-1 Slide 7 Bigger Chips to Offset Technology Slowdown If technology scaling only gives you (say) 1.5x more devices per 24 months, why not just make chips 1.33x bigger to get 2x transistors? 1000 RETICLE LIMIT DIE SIZE (MM2) 100 395 chips 362 good die 192 chips 162 good die (8% yield loss) (16% yield loss) 2004 2009 2014 2020 (hypothetical/academic example [1], not real yield rates) Server CPU [1] Kannan, Enright Jerger, Loh, “Enabling Interposer-based ISSCC Forum 2021 8of 42 Disintegration of Multi-core Processors,“ International SC2-1Symposium on Microarchitecture (MICRO), 2015 Slide 8 Chiplets Background Alternative: build multiple smaller chips 2X X X Historically, not needed for most markets Except for the largest systems, Moore’s Law was sufficient to meet compute needs X X 2X Chiplets not free Additional area for interfaces, replicated logic One generation later Higher packaging costs Additional design effort, complexity Past methodologies less suited for chiplets X X 2X device functionality costs > 2X silicon area ISSCC Forum 2021 9 of 42 SC2-1 Slide 9 Progression of Advanced Integration/Packaging Ceramic Substrate MCMs Organic Substrate MCMs Silicon Interposer [1] Concept of partitioning systems into multiple chips is not new Evolution of packaging technology has changed the trade-offs in terms of cost, bandwidth, latency, energy, etc. [1] Carsten Schulz, Wikimedia Commons (link, license) ISSCC Forum 2021 10 of 42 SC2-1 Slide 10 High-Level Approach to Chiplets X X X Wafer X Test Assemble X X X Functional SoCs X X X X Test Assemble X X X Functional SoCs ISSCC Forum 2021 11 of 42 SC2-1 Slide 11 A Case Study with AMD EPYC™ Server Processors ISSCC Forum 2021 12 of 42 SC2-1 Slide 12 1st Gen AMD EPYC Architecture MCM approach has many advantages Higher yield, enables increased feature-set Multi-product leverage Traditional Monolithic 1st Gen EPYC AMD EPYC Processors I/O Die 2 2 2 I/O I/O I/O I/O I/O 4x 213mm die/package = 852mm CCX CCX DDR Si/package* CCX CCX DDR CCX DDR CCX I/O CCX CCX Die 1 I/O Hypothetical EPYC Monolithic DDR DDR processor I/O CCX CCX Die 3 I/O 2 DDR CCX ~777mm * CCX DDR CCX CCX TM DDR CCX Remove die-to-die Infinity Fabric PHYs CCX I/O I/O I/O I/O I/O and logic (4/die), duplicated logic, etc. Die 0 I/O 852mm2 / 777mm2 = ~10% MCM area overhead 32C Die Cost 32C Die Cost 1.0X 0.59X1 1. Based on AMD internal yield model using historical ISSCC Forum 2021 13 of 42 defect density data for mature technologies Slide 13 What to do for an Encore? Leadership performance 2X >1.25X 0.5X DENSITY1 FREQUENCY1 POWER1 requires 7nm benefits (same power) (same performance) Yet the cost of advanced 7nm Compute Efficiency Gains technologies are increasing Cost Per Yielded mm2 for a 250mm2 Die Gen1 architecture does not 6.00 scale well to double core 2 5.00 counts 4.00 Innovation required 3.00 2.00 1.00 - Normalize Cost/Yielded mm Normalize Cost/Yielded 1. Based on June 8, 2018 AMD internal testing of same-architecture ISSCC Forum 2021 14 of 42 product ported from 14 to 7 nm technology with similar SC2-1implementation flow/methodology, using performance from SGEMM. Slide 14 Chiplets to Maximize 7nm Benefits Prior Generation RYZEN™ Processor Die High-performance server and desktop processors are IO-heavy Analog devices and bump pitches for IO benefit very little from leading edge technology, and that technology is very costly CPU core + L3 on this die comprises ~56% of the area These circuits see increased 7nm gains Remaining ~44% sees very little performance and Solution: Partition the SOC, density improvement from 7nm reserving the expensive leading- edge silicon for CPU cores while L3 leaving the IO and memory IFOP SerDes interfaces in N-1 generation silicon SMU DFx L3 Zen2 cores Zen2 cores 7nm CCD is ~86% CPU + L3 ISSCC Forum 2021 15 of 42 SC2-1 Slide 15 Chiplets Evolved – Hybrid Multi-die Architecture Traditional Monolithic 1st Gen EPYC CPU 2nd Gen EPYC CPU Use an Advanced Each IP in its Optimal Centralized I/O Die Superior Technology Technology Where it is Technology, 2nd Gen Improves NUMA for CPU Performance Needed Most Infinity Fabric™ and Power Connected ISSCC Forum 2021 16 of 42 Slide 16 Connecting the Chiplets Theoretical Interposer-based Silicon interposers and bridges provide high wire density, but have limited CCD CCD reach IOD Only supports die edge connectivity CCD CCD which limits number of chiplets and Interposer cores that can be supported Performance goals required more Core Selected MCM Approach Complex Die (CCDs) than can be tiled adjacent to the IOD Solution is to retain the on-package SerDes links for die-die connections ISSCC Forum 2021 17 of 42 SC2-1 Slide 17 Package Routing Challenges Prior generation already consumed almost all package routing resources for memory and IO Connecting 9 chiplets in the same package requires innovation I/O I/O Die 2 CCX CCX DDR DDR CCX I/O CCX I/O Die 1 I/O I/O Die 3 CCX CCX DDR DDR CCX CCX I/O I/O Die 0 1st Gen AMD EPYC™ [1] [1] Beck ISSCC 2018 ISSCC Forum 2021 18 of 42 SC2-1 Slide 18 Under-CCD Routing SERDES CCD CCD CCD CCD DDR IOD DDR CCD CCD CCD CCD SERDES Routing Infinity Fabric on Package (IFOP) SerDes links from IOD to the two- deep chiplets required sharing routing layers with off-package SerDes and competing with power delivery requirements ISSCC Forum 2021 19 of 42 SC2-1 Slide 19 Zen vs. Zen 2 VDDM Distribution Dense SRAMs require a separate rail Zen VDDM distribution via package plane Zen 2 VDDM distribution via RDL only ISSCC Forum 2021 20 of 42 SC2-1 Slide 20 Zen 2 VDDM Design Challenges Enables 80 IFOP package RDL is more resistive than a dedicated package layer routed signals under the CCD Therefore, we reduced overall VDDM 4 VDDM LDOs inside the L3 current draw by ~80% compared to Zen [Singh ISSCC 2020] Core L3 4MB L3 4MB Core New, smaller, and distributed LDO design +L2 slice slice +L2 Ensured sufficient routing porosity through the integrated LDOs to enable Core L3 4MB L3 4MB Core critical routing +L2 slice slice +L2 These improvements kept the IR drop to ≈10mV impact VDDM RDL spanning LDO L2 and L3 ISSCC Forum 2021 21 of 42 SC2-1 Slide 21 Package Integration, Server, and Desktop Zen2 Zen2 CCD CCD Zen2 Zen2 • Bump pitch for 14nm and CCD CCD 128 total x16 7nm is 150um and 130um 2nd Gen SERDES respectively AMD • Transitioned IOD from EPYCTM Server solder bumps to copper pillars, enabling a common Processor 72 Data + 8 Clk/Ctl interface for IOD+CCD Zen2 Zen2 (total/CCD) CCD CCD – Conducive to tighter bump pitches (compact) 3rd Gen AMD RyzenTM Zen2 Zen2 – Enabled common die CCD CCD Processor Infinity Fabric (die-to-die) height after assembly IO Controllers and PHYs – Higher max current 2 x DDR4 PHYs (electromigration) limits ISSCC Forum 2021 22 of 42 SC2-1 Slide 22 Chiplet-enabled Socket Upgrades • Chiplet architecture allows migration to “Zen 3” CCDs Zen3 Zen3 without disrupting the platform CCD CCD • Re-use the client IOD • Enable in-place upgrades to “Zen 3” for AMD Socket AM4 with ~19% IPC increase Zen3 Zen3 3rd Gen AMD RyzenTM 4th Gen AMD RyzenTM CCD CCD Processor Processor Infinity Fabric (die-to-die) Infinity Fabric (die-to-die) IO Controllers and PHYs IO Controllers and PHYs 2 x DDR4 PHYs 2 x DDR4 PHYs ISSCC Forum 2021 23 of 42 SC2-1 Slide 23 Improving Memory Performance Prior Generation Server memory latency is a key EPYC™ 7001 Series Processors factor in performance A goal for 2nd Gen was to improve on the 2017 1st Gen EPYC™ CPU design Non-Uniform-Memory-Access (NUMA) behaviors are a result of memory interfaces being distributed across die Significant delays from NUMA1 3 NUMA Distances to NUMA2 impact performance Domain8 NUMA LatencyDomains 1 (ns) for some applications NUMA1 90 NUMA2 141 NUMA3 234 Avg.