``Zeppelin'': an Soc for Multichip Architectures
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE JOURNAL OF SOLID-STATE CIRCUITS 1 “Zeppelin”: An SoC for Multichip Architectures Thomas Burd , Senior Member, IEEE, Noah Beck, Sean White, Milam Paraschou, Member, IEEE, Nathan Kalyanasundharam, Gregg Donley, Alan Smith, Member, IEEE, Larry Hewitt, and Samuel Naffziger, Fellow, IEEE Abstract— AMD’s “Zeppelin” system-on-a-chip (SoC) com- 1) Client Market: Single-chip AM4 package with two bines eight high-performance “Zen” cores with a shared 16-MB DDR4 channels, 24 PCIe Gen3 lanes [4] and is platform L3 Cache, along with six high-speed I/O links and two compatible with the previous generation AMD SoCs. DDR4 channels, using the infinity fabric (IF) to provide a high speed, low latency, and power-efficient connectivity solution. 2) High-End Desktop Market: Two-chip sTR4 package This solution allows the same SoC silicon die to be designed with four DDR4 channels and 64 PCIe Gen3 lanes. into three separate packages and provides highly competitive 3) Server Market: Four-chip SP3 package with eight solutions in three different market segments. IF is critical to DDR4 channels and 128 PCIe Gen3 lanes for one- this high-leverage design re-use, utilizing a coherent, scalable socket systems, scalable with coherent interconnect to data fabric (SDF) for on-die communication, as well as inter-die links, extending up to eight dies across two packages. To support two-socket systems. this scalability, an energy efficient, custom physical-layer link The critical enabler for this flexibility is the infinity fabric was designed for in-package, high-speed communication between (IF), comprised of two key components, or planes. The first the dies. Using an additional scalable control fabric (SCF), is the scalable data fabric (SDF) that provides coherent data a hierarchical power and system management unit (SMU) was used to monitor and manage a distributed set of dies to ensure transport between cores, memory controllers, and IO, and can the products stay within infrastructure limits. It was essential do so within the same die, across dies within the same pack- that the floor plan of the SoC was co-designed with the package age, or between packages in a two-socket system. The second substrate. The SoC used a 14-nm FinFET process technology and is the scalable control fabric (SCF) that provides a common 2 contains 4.8B transistors on a 213 mm die. command and control mechanism for system configurability Index Terms— 14 nm, high-frequency design, microproces- and management. Similar to the SDF, the SCF connects all sors, multi-chip module (MCM), scalable fabric, system-on-a- the components within the SoC, among dies within the same chip (SoC) architecture. package, and between packages in a two-socket system. A flexible, yet power-efficient physical implementation of the IF was a key requirement for competitive products, which I. INTRODUCTION drove a customized, on-package, and high-speed Serializer– MD’s next-generation system-on-a-chip (SoC), code- Deserializer (SerDes) link interface, while not as power effi- Anamed “Zeppelin,” was designed with the flexibility to cient as other on-package interconnect solutions, such as allow the single silicon design to target products in a multitude embedded multi-die interconnect bridge (EMIB), at 2 pJ/bit of markets, including server, mainstream desktop PCs, and versus 1.2 pJ/bit [5], the IF solution provides much greater high-end desktop PCs [1]. The Zeppelin SoC was designed product design flexibility. EMIB requires dies to be physically in Global Foundries’ 14-nm LPP FinFET process technology, adjacent, while IF utilizes package routing layers to support utilizing a back-end stack of 13 copper interconnect layers much more complex connection topologies, but with a custom with a top-level aluminum redistribution layer [2], [3]. SerDes solution to minimize transmission energy as compared The highest priority design goal was to provide an SoC to existing off-package SerDes solutions. that was architected with leadership server capabilities, but in addition, also have the scalability and configurability to II. ARCHITECTURE support additional complementary markets. These include: A. Functional Overview Manuscript received May 18, 2018; revised August 4, 2018 and The SoC, as shown in Fig. 1, consists of two core com- September 17, 2018; accepted September 18, 2018. This paper was approved plexes (CCXs), in which each complex contains four high- by Guest Editor Masato Motomura. (Corresponding author: Thomas Burd.) performance “Zen” x86 cores providing two-way simultaneous T. Burd, N. Kalyanasundharam, and G. Donley are with Advanced Micro Devices, Santa Clara, CA 95054 USA (e-mail: [email protected]). multi-threading (SMT), each with a 512-kB L2 Cache, and a N. Beck and S. White are with Advanced Micro Devices, Boxborough, shared 8-MB L3 Cache [3]. There are two DDR4 channels MA 01719 USA. with ECC supporting two DIMMs per channel at speeds up M. Paraschou and S. Naffziger are with Advanced Micro Devices, Fort Collins, CO 80528 USA. to 2666 MT/s. There are two combo physical-layer links, A. Smith and L. Hewitt are with Advanced Micro Devices, Austin, each which can be configured as a 16-lane PCIe Gen3 inter- TX 78735 USA. face, or an eight-lane SATA interface, or a 16-lane inter- Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. socket SerDes interface. An additional four high-speed SerDes Digital Object Identifier 10.1109/JSSC.2018.2873584 interfaces provide die-to-die links. There is an IO complex 0018-9200 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE JOURNAL OF SOLID-STATE CIRCUITS Fig. 3. Infinity data fabric topology. Fig. 1. “Zeppelin” SoC architecture. 2) bandwidth scalability to support a broad range of prod- ucts from RyzenTM Mobile to EPYCTM servers (and even RadeonTM GPUs); 3) guaranteed quality of service (QoS) for real-time clients; 4) standardized interfaces to enable automated build flows for rapid deployment of network-on-chip (NOC); and 5) low latency, which is perhaps the most important tenet. IF uses the enhanced coherent HyperTransport (cHT+)pro- + Fig. 2. “Zen” cache hierarchy. tocol built upon the cHT used in multiple generations of server deployments [8]. Zeppelin uses a seven-state MDOEFSI coherence protocol, in which the states are exclusively modi- that provides an integrated southbridge, including PCIe and fied (M), dirty (D), shared modified (O), exclusive clean (E), SATA controllers, four USB 3.1 Gen 1 ports, as well as SPI, forwarder clean (F), shared clean (S), and invalid (I). A distrib- LPC, UART, I2C, and RTC interfaces. All of these components uted SRAM-based full directory is supported. The directory are connected with the IF providing coherent data transport protocol supports directed multi-cast and broadcast probes. between all the IPs on the SoC. The protocol also allows for probe responses to be combined at the links. B. Core Complex and Cache Hierarchy SDF uses two standard interfaces—scalable data port (SDP) The CCX, detailed in [3] and [6], can fetch and decode up and fabric transport interface (FTI). Along with the standard to four instructions per cycle (IPC), and dispatch up to six interfaces, a modular design was key to building complex micro-operations per cycle, utilizing eight parallel execution topologies. The main blocks within the data fabric, as shown units, providing 52% higher IPC performance than the prior- in Fig. 3, are master, slave, transport switch, and Coherent generation x86 processor core [7]. As shown in Fig. 2, AMD Socket Extender (CAKE). There are two types of within the Zen core there is a 64 kB, four-way set-associative masters on Zeppelin—cache coherent master (CCM) and an instruction cache with 32 B/cycle of fetch bandwidth, and a IO master and slave (IOMS). Master block in the data fabric 32 kB, eight-way set-associative data cache with 48 B/cycle of abstracts the complexities of identifying the request target and load/store bandwidth. The private, 512 kB L2 cache supports routing functions away from the clients. Clients of data fabric 64 B/cycle of bandwidth to the L1 caches with 12-cycle that initiate requests use an SDP port to talk to a master block latency. The fast, shared L3 cache supports 32 B/cycle of in the data fabric. Clients with service requests use a slave bandwidth to the L2 caches with a 35-cycle latency. The SDP port. There are two types of slaves: coherent slave (CS), L3 cache is filled from L2 victims for all four cores of the traditionally known as a home agent which hosts directory CCX, and L2 tags are duplicated in the L3 cache for probe and participates in ordering and is responsible for maintaining filtering and fast cache transfer. The hierarchy can support coherency; and IO slave which provides access to devices. up to 50 outstanding misses from L2 to L3 per core, and IOMS is built as a single block to allow upstream responses 96 outstanding misses from L3 to main memory. to push prior posted writes on the same port. CS interfaces with the memory controller shown as UMC in Fig. 1. The Zeppelin SoC has two DDR4 channels, two CCXs, C. Infinity Fabric support for up to four IF on-package (IFOP) links and two The IF’s SDF was built around several design tenets: IF inter-socket (IFIS) links.