This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE JOURNAL OF SOLID-STATE CIRCUITS 1 “Zeppelin”: An SoC for Multichip Architectures

Thomas Burd , Senior Member, IEEE, Noah Beck, Sean White, Milam Paraschou, Member, IEEE, Nathan Kalyanasundharam, Gregg Donley, Alan Smith, Member, IEEE, Larry Hewitt, and Samuel Naffziger, Fellow, IEEE

Abstract— AMD’s “Zeppelin” system-on-a-chip (SoC) com- 1) Client Market: Single-chip AM4 package with two bines eight high-performance “Zen” cores with a shared 16-MB DDR4 channels, 24 PCIe Gen3 lanes [4] and is platform L3 Cache, along with six high-speed I/O links and two compatible with the previous generation AMD SoCs. DDR4 channels, using the infinity fabric (IF) to provide a high speed, low latency, and power-efficient connectivity solution. 2) High-End Desktop Market: Two-chip sTR4 package This solution allows the same SoC silicon die to be designed with four DDR4 channels and 64 PCIe Gen3 lanes. into three separate packages and provides highly competitive 3) Server Market: Four-chip SP3 package with eight solutions in three different market segments. IF is critical to DDR4 channels and 128 PCIe Gen3 lanes for one- this high-leverage design re-use, utilizing a coherent, scalable socket systems, scalable with coherent interconnect to data fabric (SDF) for on-die communication, as well as inter-die links, extending up to eight dies across two packages. To support two-socket systems. this scalability, an energy efficient, custom physical-layer link The critical enabler for this flexibility is the infinity fabric was designed for in-package, high-speed communication between (IF), comprised of two key components, or planes. The first the dies. Using an additional scalable control fabric (SCF), is the scalable data fabric (SDF) that provides coherent data a hierarchical power and system management unit (SMU) was used to monitor and manage a distributed set of dies to ensure transport between cores, memory controllers, and IO, and can the products stay within infrastructure limits. It was essential do so within the same die, across dies within the same pack- that the floor plan of the SoC was co-designed with the package age, or between packages in a two-socket system. The second substrate. The SoC used a 14-nm FinFET process technology and is the scalable control fabric (SCF) that provides a common 2 contains 4.8B transistors on a 213 mm die. command and control mechanism for system configurability Index Terms— 14 nm, high-frequency design, microproces- and management. Similar to the SDF, the SCF connects all sors, multi-chip module (MCM), scalable fabric, system-on-a- the components within the SoC, among dies within the same chip (SoC) architecture. package, and between packages in a two-socket system. A flexible, yet power-efficient physical implementation of the IF was a key requirement for competitive products, which I. INTRODUCTION drove a customized, on-package, and high-speed Serializer– MD’s next-generation system-on-a-chip (SoC), code- Deserializer (SerDes) link interface, while not as power effi- Anamed “Zeppelin,” was designed with the flexibility to cient as other on-package interconnect solutions, such as allow the single silicon design to target products in a multitude embedded multi-die interconnect bridge (EMIB), at 2 pJ/bit of markets, including server, mainstream desktop PCs, and versus 1.2 pJ/bit [5], the IF solution provides much greater high-end desktop PCs [1]. The Zeppelin SoC was designed product design flexibility. EMIB requires dies to be physically in Global Foundries’ 14-nm LPP FinFET process technology, adjacent, while IF utilizes package routing layers to support utilizing a back-end stack of 13 copper interconnect layers much more complex connection topologies, but with a custom with a top-level aluminum redistribution layer [2], [3]. SerDes solution to minimize transmission energy as compared The highest priority design goal was to provide an SoC to existing off-package SerDes solutions. that was architected with leadership server capabilities, but in addition, also have the scalability and configurability to II. ARCHITECTURE support additional complementary markets. These include: A. Functional Overview

Manuscript received May 18, 2018; revised August 4, 2018 and The SoC, as shown in Fig. 1, consists of two core com- September 17, 2018; accepted September 18, 2018. This paper was approved plexes (CCXs), in which each complex contains four high- by Guest Editor Masato Motomura. (Corresponding author: Thomas Burd.) performance “Zen” x86 cores providing two-way simultaneous T. Burd, N. Kalyanasundharam, and G. Donley are with , Santa Clara, CA 95054 USA (e-mail: [email protected]). multi-threading (SMT), each with a 512-kB L2 Cache, and a N. Beck and S. White are with Advanced Micro Devices, Boxborough, shared 8-MB L3 Cache [3]. There are two DDR4 channels MA 01719 USA. with ECC supporting two DIMMs per channel at speeds up M. Paraschou and S. Naffziger are with Advanced Micro Devices, Fort Collins, CO 80528 USA. to 2666 MT/s. There are two combo physical-layer links, A. Smith and L. Hewitt are with Advanced Micro Devices, Austin, each which can be configured as a 16-lane PCIe Gen3 inter- TX 78735 USA. face, or an eight-lane SATA interface, or a 16-lane inter- Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. socket SerDes interface. An additional four high-speed SerDes Digital Object Identifier 10.1109/JSSC.2018.2873584 interfaces provide die-to-die links. There is an IO complex 0018-9200 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE JOURNAL OF SOLID-STATE CIRCUITS

Fig. 3. Infinity data fabric topology. Fig. 1. “Zeppelin” SoC architecture.

2) bandwidth scalability to support a broad range of prod- ucts from RyzenTM Mobile to EPYCTM servers (and even RadeonTM GPUs); 3) guaranteed quality of service (QoS) for real-time clients; 4) standardized interfaces to enable automated build flows for rapid deployment of network-on-chip (NOC); and 5) low latency, which is perhaps the most important tenet. IF uses the enhanced coherent HyperTransport (cHT+)pro- + Fig. 2. “Zen” cache hierarchy. tocol built upon the cHT used in multiple generations of server deployments [8]. Zeppelin uses a seven-state MDOEFSI coherence protocol, in which the states are exclusively modi- that provides an integrated southbridge, including PCIe and fied (M), dirty (D), shared modified (O), exclusive clean (E), SATA controllers, four USB 3.1 Gen 1 ports, as well as SPI, forwarder clean (F), shared clean (S), and invalid (I). A distrib- LPC, UART, I2C, and RTC interfaces. All of these components uted SRAM-based full directory is supported. The directory are connected with the IF providing coherent data transport protocol supports directed multi-cast and broadcast probes. between all the IPs on the SoC. The protocol also allows for probe responses to be combined at the links. B. Core Complex and Cache Hierarchy SDF uses two standard interfaces—scalable data port (SDP) The CCX, detailed in [3] and [6], can fetch and decode up and fabric transport interface (FTI). Along with the standard to four instructions per cycle (IPC), and dispatch up to six interfaces, a modular design was key to building complex micro-operations per cycle, utilizing eight parallel execution topologies. The main blocks within the data fabric, as shown units, providing 52% higher IPC performance than the prior- in Fig. 3, are master, slave, transport switch, and Coherent generation x86 processor core [7]. As shown in Fig. 2, AMD Socket Extender (CAKE). There are two types of within the Zen core there is a 64 kB, four-way set-associative masters on Zeppelin—cache coherent master (CCM) and an instruction cache with 32 B/cycle of fetch bandwidth, and a IO master and slave (IOMS). Master block in the data fabric 32 kB, eight-way set-associative data cache with 48 B/cycle of abstracts the complexities of identifying the request target and load/store bandwidth. The private, 512 kB L2 cache supports routing functions away from the clients. Clients of data fabric 64 B/cycle of bandwidth to the L1 caches with 12-cycle that initiate requests use an SDP port to talk to a master block latency. The fast, shared L3 cache supports 32 B/cycle of in the data fabric. Clients with service requests use a slave bandwidth to the L2 caches with a 35-cycle latency. The SDP port. There are two types of slaves: coherent slave (CS), L3 cache is filled from L2 victims for all four cores of the traditionally known as a home agent which hosts directory CCX, and L2 tags are duplicated in the L3 cache for probe and participates in ordering and is responsible for maintaining filtering and fast cache transfer. The hierarchy can support coherency; and IO slave which provides access to devices. up to 50 outstanding misses from L2 to L3 per core, and IOMS is built as a single block to allow upstream responses 96 outstanding misses from L3 to main memory. to push prior posted writes on the same port. CS interfaces with the memory controller shown as UMC in Fig. 1. The Zeppelin SoC has two DDR4 channels, two CCXs, C. Infinity Fabric support for up to four IF on-package (IFOP) links and two The IF’s SDF was built around several design tenets: IF inter-socket (IFIS) links. The data fabric topology chosen 1) support for a diverse range of topologies both regular isolates local traffic from remote and passes through traffic and irregular; and reduces interference. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BURD et al.: “ZEPPELIN”: AN SoC FOR MULTICHIP ARCHITECTURES 3

Fig. 5. Chip-to-chip communication path.

Fig. 4. Transport switch micro-architecture. the SDF components, including the CAKE component in Zeppelin, run at the system DRAM’s MEMCLK frequency. The two key blocks within the data fabric transport layer are CAKE provides bidirectional traffic flow between the two the transport command and data switch (TCDX) and CAKE. fabrics, as shown in Fig. 5. The ingress CAKE logic processes The PIE block hosts power management, interrupt controller incoming FTI traffic for transfer to the physical coding sub- and other miscellaneous functions. Our past design experience layer (PCS); the egress CAKE logic receives transfers from and studies led to a choice of a base switch design which is PCS for transfer to its outbound FTI. CAKE handles all the scalable from radix 2 to radix 6 that can support scheduling types of FTI transactions on both ingress and egress paths: up to six ports in parallel for maximal throughput or reduced requests, responses, probes, and data. CAKEs are always at build time to minimize power and area. A centralized queue paired, with the ingress path (FTI to CAKE to PCS) of design was chosen to minimize total buffering and to also ease one connected to the egress path (PCS to CAKE to FTI) implementation for fair arbitration and QoS. Schedulers can of the other. The PCS interface is narrower than the FTI pick out of order to support QoS. The input to output bypass interface, transferring 128 bits each clock. The narrower PCS paths are used to minimize latency. Packet header state is held interconnect requires special CAKE FTI-side queue design to in the central queue, while the packet payload is held in a achieve the required throughput. CAKE uses address caching static buffer structure for minimal area and power as shown (A-cache) to improve the link protocol efficiency. in Fig. 4. Virtual channels are used to prevent interference Command compression and packing allow up to two com- between the traffic classes and for deadlock avoidance. The mands, four responses, or a combination of packets to be transport layer is a collection of switches and interfaces with assembled in a single flit. A command flit is divided into four other switches and blocks within the data fabric using the 32-bit sectors. The compressed command uses two sectors and standard FTI interface. FTI comprises a request, response, a compressed response uses a single sector. Data and command probe, and data channels. The data channel width is 32 B flits are separate. When four cache line read responses are and is shared between requests and responses. packed in a single command flit, then 16 data flits will follow An efficient high-bandwidth chip-to-chip communication the command flit to transfer four 64-B data chunks. PCS method in SDF is key to the multichip package approach, blocks add a 16-bit CRC value per flit. IFOP links use extra enabled by the CAKE component of the SDF. CAKE is wires to carry CRC, while IFIS links carry CRC over the same designed to take the local chip’s FTI and encode them set of wires. CAKE also has the optional capability to encrypt into 128-bit flits. CAKE is bidirectional, and also decoding data sent across links. flits each cycle. The flits are suitable for transmission over To prevent the IF from becoming a system bottleneck, any SerDes interface to another chip within the system. the bandwidth of an IFOP link is overprovisioned by approxi- To eliminate clock-domain crossing latency, the clock of all mately a factor of 2, relative to the DDR4 channel bandwidth This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE JOURNAL OF SOLID-STATE CIRCUITS

SCF protocol internal to the chip into a high-speed narrow format that is passed through SerDes links between the chips. Similarly, SCF is passed between each chip of each socket through a socket-to-socket bus. Thus, SCF provides each SMU with access to IPs within all chips of all sockets in the system and allows SMUs of each chip to communicate with each other. Each SMU consists of an embedded controller with accom- panying memory sufficient to execute power and system management firmware algorithms. These algorithms provide a wide variety of functions but are largely intended to man- age performance in the form of CPU frequency to several infrastructure limits. These limits include package power, temperature, current for each power plane, voltage, and others. There is one SMU in each die in the package. All of the SMUs perform “slave” functions and one of the SMUs performs the “master” functions as well; e.g., in the four-die product, one SMU is assigned to perform master and slave functions, while the other three SMUs perform only slave functions. The slave functions include capturing measurement data associated with the local die that is needed to manage Fig. 6. Reconfigurable combo SerDes. performance within the infrastructure limits and preparing it to be provided to the master SMU. The master functions include processing package-wide measurements to determine for a mixed read/write traffic. This is sufficient to provide the response. excellent scaling from one-die all the way up to eight-dies in Each infrastructure limit algorithm (including package a two-socket server system. power, temperature, current, and voltage) is managed inde- pendently by the master to determine the infrastructure limited D. IO Subsystem Muxing frequency. The structure for each infrastructure limit algorithm is similar, running on a loop that repeats approximately every A critical part of the Zeppelin flexibility to target different 1 ms, which includes the following. markets with the same exact silicon is the combo SerDes, 1) Capture data from all slaves. which provides an aggregate of 32 lanes of high speed, multi- 2) Collapse the data into a single result. protocol I/O. As shown in Fig. 6, it can be configured in one 3) Apply the result as the next input to a proportional– of three basic ways with further fine-grain granularity. The integral–differential (PID) controller; the PID controller combo SerDes on the top edge of the die can be configured smooth’s out responses to sudden changes in the input as either a 16-lane PCIe Gen3 link, or a 16-lane IFIS link for data and prevents overshoot; PID controller parameters inter-socket communication in a 2P server system. The bottom are tuned for each infrastructure limit algorithm. combo SerDes can be configured in the same way as the top, 4) The output of the PID controller is the maximum but with the additional capability to support eight-lanes of frequency that can be supported by the infrastructure SATA. The 16-lane PCIe links can then be further sub-divided limit at that time. For example, in the case of the into as many as eight links of mixed eight-lane, four-lane, two- package power limit, for step 1 above the current and the lane, or one-lane sizes. The flexibility of configuration was voltage of each power plane is passed from the slaves to essential for enabling a solution to the packaging problem, the master; for step 2 above, the power for each of the described further in detail in Section III. planes of each die is calculated and added together; the sum becomes the input of the PID controller. A similar E. Hierarchical Power and System Management process is followed for temperature (using the highest The SoC implements a hierarchical power and system man- temperature from each of the die for step 2), for current agement subsystem responsible for monitoring conditions and (using the sum for step 2), and so forth. Firmware then managing performance to infrastructure limits. This consists selects the most-constraining (lowest) of each of the of an embedded controller called the system management infrastructure limit frequencies. unit (SMU) in each of the chips within the package and a Thus, through these algorithms, global CPU frequency is supporting control bus called the SCF. managed to infrastructure limits, as shown in Fig. 7. Frequency SCF is a low-bandwidth control bus similar to AXI [9] is reduced when infrastructure limits are reached and increased that operates within each chip within the package. Access when there is an operating margin. The master passes the is provided through SCF from the SMU to each IP within frequency change requests to the slaves and each slave (and the chip. In addition, SCF is passed to each chip within the master) applies the frequency changes to the CPU cores of the package through a chip-to-chip bus. The logic converts the local die. These frequency changes are coordinated with This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BURD et al.: “ZEPPELIN”: AN SoC FOR MULTICHIP ARCHITECTURES 5

Fig. 7. SMU firmware control loop. global voltage plane control by the master; if frequency is increasing, the master commands the external voltage regulator to increase the voltage (through a standard interface bus) before passing frequency increase commands to the slaves; if frequency is decreasing, the master that passes the frequency decrease commands to the slaves, waits for the change to Fig. 8. SoC floor plan and die photograph. be acknowledged, and then commands the external voltage regulator to decrease the voltage. This enables continuous CPU operation, while the SMU dynamically varies supply voltage and frequency. In addition, slaves monitor local conditions, such as the number of cores that are active on the die, and may apply lower CPU frequencies that are globally requested by the master. Such variations in frequency are incorporated by the slaves into the measurement data that are passed to the master for package-wide optimization. SMUs also manage state transitions such as transitions between the fully operational state and various low-power + states, which may include, depending upon the specific Fig. 9. DDR4 IFOP routing. product. 1) PC6, in which the CPU power plane can be brought to Since the inter-socket high-speed SerDes was the next a low voltage when there are no active CPU cores. most constraining, the combo SerDes was placed at opposite 2) DFC1, in which most of the IO and memory interfaces corners to always ensure a short distance to the package are disabled and placed in a low-power state when there edge for these links. Finally, a fourth IFOP was added, are no transactions occurring in the system. to ensure that the inter-package MCM IF routing could be 3) IFIS link width in which the IFIS busses between sockets accomplished with only four package layers. In any product, are changed in width in response to conditions in order there is always at least one un-used IFOP block, but this “dark to optimize power. silicon” was necessary to achieve the overall design goals of 4) S3, or suspend to RAM, which is a sleep state in which re-usability. most of the system power planes are disabled. Fig. 9 shows the routing for the first two package layers 5) S5, a system OFF state. that support one layer per DDR4 channel (in white) for all four dies. For the diagonal IFOP connections (in blue), one III. PACKAGE DESIGN layer is used for each link, and the vertical IFOP (in orange) To achieve the ultimate goal of flexibility to support design links utilize both layers. Fig. 10 shows the next two package re-use, the SoC floor plan, as shown in Fig. 8, was co- routing layers. Again, one layer per DDR4 channel (in white) designed and optimized in conjunction with the package is used for all four dies, and the IFIS links (in orange) utilize design. The most-constraining package that dictated floor plan two layers each. requirements is the four-die multi-chip module (MCM) for The four-die MCM SP3 package has a total of 4094 LGA one-socket and two-socket server configurations. To ensure the pins, on a 58 mm × 75 mm organic substrate. There are shortest package-level routes for the high-speed DDR4’s I/O, 534 IF high-speed chip-to-chip routes providing in excess of the two blocks were placed along one long edge of the die, 256-GB/s total in-package bandwidth. In addition, there are and then die on the right side of the package is rotated 180° 1760 high-speed pins providing in excess of 450 GB/s of to ensure the DDR blocks are always on the outer edge of the total off-package bandwidth for the IFIS, PCIe Gen3, and die when placed onto the package substrate. DDR4 links. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE JOURNAL OF SOLID-STATE CIRCUITS

the dies. In addition, each of the four dies uses on-chip metal– insulator–metal capacitors (MIMCap) in the back-end copper interconnect stack, to effectively double the intrinsic on-die capacitance, for improved first droop response on the CPU power supply.

IV. PHYSICAL DESIGN A critical physical design challenge of the Zeppelin SoC was to integrate the IF connecting up the high-performance Zen CPU cores along with the six high-speed SerDes links Fig. 10. DDR4 + IFIS routing. and two DDR4 channels while maintaining low latency at very good power efficiency. Integrating the large set of IPs across a multitude of clock and power domains required a significant design and validation effort to ensure functional silicon. An essential component to maintain excellent power efficiency was a customized, low power, and SerDes for intra- package die-to-die communication links.

A. Infinity Fabric SerDes Two different custom high-speed SerDes links provide the connectivity for the IF. The IFIS SerDes is used to com- municate across longer socket-to-socket PCB traces, and is relatively similar to other high-speed SerDes, such as PCIe Gen3, with a power of approximately 11 pJ/b. The IFOP SerDes, on the other hand, was optimized for minimum power across the much shorter package substrate route lengths, achieving a power efficiency of 2 pJ/b. There were three key elements for achieving significantly lower power. First, single-ended signaling is used as compared to traditional differential signaling. This result in approxi- mately 50% of the power required due to the fact that a single signal transmission requires approximately half the power as Fig. 11. Package-level split power planes. a pair. Differential signaling is typically used for high-speed links due to its immunity to noise coupling, ground potential A. Power Distribution differences, and other susceptibilities. Therefore, to support the single-ended signaling at the targeted data rates, extensive The most important reason for restricting the package signal signal integrity design effort and analysis are required. Second, routing to only four layers was to free up the remainder of the a zero-power state exists when a logic 0 is transmitted as package layers for power planes, which need to support the a result of the enablement of the transmitter pull-down in multiple high-current supplies that require very low impedance conjunction with the receiver termination to the ground which to maintain performance at the maximum power. The VDD- results in zero sourced transmit current. The data transmit CPU rail, which powers the two CCXs (8 cores total + 16 M driver is powered by an internally regulated supply VIFOP. of L3 Cache) per die, requires up to 180 A at thermal design Fig. 12 shows the signaling levels transmitted indicating a current (TDC), and the VDDIO rail, which powers all the 0-V (GND) level for logic “0” and VIFOP/2 level for logic SerDes and DDR4, requires up to 60 A at TDC. With the “1” or half the voltage of the VIFOP SerDes supply. Additional rotation of the dies, as shown in Fig. 11, a partial package power savings are realized by forcing the zero-power state plane for VDDCPU can cover all eight CCXs in the four-die during link idle to significantly reduce idle power dissipation. MCM, while allowing the VDDIO rail to cover at least the Finally, data-bit inversion encoding is utilized to optimize outer SerDes, and more importantly, the DDR4, given its tight bit patterns that are transmitted to take advantage of the electrical specifications, making it more susceptible to power lower power logic 0 state, by minimizing bit flips, as well as supply noise than the inner SerDes. A third high power supply, maximizing the residency in a logic 0 state, saving on average VDDSOC, provides up to 65 A at TDC to the bulk of the 10% power per bit. rest of the logic on the die outside of the 2 CCXs and the 8 SerDes, and shares package planes with a myriad of other power supplies needed, 10 in all. B. Clock Domains To provide robust power integrity, there are approximately Enabled by each core’s integrated voltage regulator, there 300 µF of bypass capacitors, which attach to the top side are per-core clock domains, for eight CPU clock domains of the package substrate, around the entire periphery of in total (CCLK1:8), as shown in Fig. 13. In addition, there This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BURD et al.: “ZEPPELIN”: AN SoC FOR MULTICHIP ARCHITECTURES 7

Fig. 14. SoC voltage domains.

module (VRM) for additional system power savings, and the SMU can set the external VRM to a low voltage when all the cores are disabled in the PC6 state. The VDDIO supply powers all the SerDes and DDR4 supplies and utilizes local low-dropout regulators (LDOs) to provide local supply isola- tion to minimize the supply ripple for the tight specifications Fig. 12. IFOP SerDes. required for the high-speed SerDes links and DDR4 channels. The VDDSOC supply provides the necessary voltage for the vast majority of the remainder of the SoC and can be externally disabled to put the SoC into low-power sleep modes. The VDDSOC_S5 domain remains active in all but the lowest sleep state, the mechanical OFF state, G3.

V. R ESULTS Despite the added complexity of the SoC and package design, the four-die MCM proved to be more cost efficient than a monolithic single-chip module (SCM) design, while the four dies total 852 mm2 of silicon, creating a monolithic Fig. 13. SoC clock domains. 32-core die without the multi-chip support would only save about 10% of the area, resulting in a 777 mm2 die size [10]. is a clock domain for L3 of each CCX (CCLKL3a:b). It is projected that a single die would cost approximately 40% To streamline the synchronization between the CPU and its more to manufacture and test than four small chips. Adding to respective L3 cache, and to help minimize latency, each L3 is the cost benefits, the multi-chip design provides approximately always clocked at the maximum frequency of the CCX’s 20% higher full 32-core yield than would a single-chip version. four component CPUs. FCLK is the clock domain of the To make only the largest core-count 32-core parts, the cost for IF, and to minimize latency and interface complexity, it is the large die jumps to 70% more than the cost of the four small at the same frequency at the DDR MEMCLK clock-domain chips. A very high-yielding multi-chip assembly process is frequency. The IFOP SerDes is on the GCLK domain, which required, or the improved silicon yields are lost at the package is also at the same frequency as MEMCLK and FCLK, for level. Internal data have demonstrated success at achieving similar reasons. The combo SerDes has a multiplexed clock assembly yields that have a negligible impact on overall cost. domain depending on whether its configured for IFIS (GCLK), To ensure that chips with similar maximum frequency capabil- or PCIe (PCI_CLK). Finally, the IO complex has a primary ities can be matched to each other for assembly into the same clock domain on LCLK for interfacing to IF and the SerDes, package, on-die frequency sensors containing representative in addition to a multitude of local IO clock domains for USB, critical-path logic are consulted at the wafer-level test before SATA, and GPIO. chips are selected for assembly into packages [11]. As demonstrated in Fig. 15, the goal of scalable perfor- C. Voltage Domains mance from a single die (AM4 package) to an eight-die (dual- socket SP3 package) configuration was achieved. Further leveraging the core’s voltage regulator, there are per- core voltage domains that can also be fully power gated to put a particular core into sleep mode, as shown in Fig. 14. The A. Power Supply Droop Mitigation two L3 caches sit on the same externally regulated domain, A critical design challenge resulting from the four-die VDDCPU, which is also the input to the cores’ regulators. SP3 package was to maintain sufficiently low-voltage drop to When all the cores are in a low-voltage state, the SMU the high-power CPU cores over a much larger area. To achieve can reduce the voltage on the external voltage regulator the required low-lateral impedance, multiple package planes This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE JOURNAL OF SOLID-STATE CIRCUITS

Fig. 16. MCM core voltage variation.2

Fig. 15. Scalable Performance.1 were tightly connected together with micro-vias for the high- power VDDCPU power supply. Despite peak currents that can exceed 200 A, per-core measurements demonstrate ±25-mV voltage drop under a max power droop workload, as shown in Fig. 16. Per-core ring oscillators, calibrated for temperature and voltage, record min/max voltage estimates, sampled at a rate of 470 M/s. Static differences between cores can be compen- sated by the per-core LDOs to improve system-level power efficiency. Dynamic differences are mitigated by clock stretch- ing, and DPM power states, as described in [11]. In Zeppelin, these techniques yield up to 5% higher frequencies than could otherwise be achieved. Measured data show excellent tracking Fig. 17. CPU LDO voltage tracking.2 of per-core voltage from the digital LDO with a millivolt- accurate target voltage, as shown in Fig. 17. When the target B. Adaptive Frequency voltage is sufficiently below the LDO input voltage, the LDO With the hierarchical system power management unit output tracks within millivolt. When the target voltage is close described in Section II-E, the maximum, sustainable, power- to, or above, the LDO input voltage, the LDO output clamps limited frequency is a function of core count and workload at a well-regulated voltage, and requires an increase in the power intensity. The core frequency for a 32-core, SP3-socket platform VRM voltage (VID) to raise the LDO input voltage to server part, with a TDP of 180 W, is shown in Fig. 18. achieve the desired target LDO voltage. The improved per-core Although traditional margining would limit the 32 active core tracking of voltage, as required to hit the target frequency, frequency to 2.4 GHz, with adaptive sensing to detect the provides additional system-level power efficiency. lower power of light application workloads, up to 300 MHz of additional frequency can be delivered for these less demanding 1AMD RyzenTM 7 1800X CPU scored 211, using estimated scores based on testing performed in AMD Internal Labs as of March 30, 2017. System workloads. For 12 active cores or less, the part operates in config: RyzenTM 7 1800X: AMD Myrtle-SM with 95W R7 1800X, 32-GB a boost mode allowing higher frequencies without violating DDR4-2667 RAM, Crucial CT256M550SSD, Ubuntu 15.10, GCC–O2 v4.6 infrastructure power and current limits. For 12 active cores, compiler suite. AMD RyzenTM ThreadripperTM 1950X CPU scored 375, using esti- traditional margining would limit the frequency to 3.0 GHz. mated scores based on testing performed in AMD Internal Labs as of But with adaptive sensing, up to 200 MHz of additional September 7, 2017. System config: RyzenTM ThreadripperTM 1950X: frequency can be delivered for light application workloads. AMD Whitehaven-DAP with 180W TR 1950X, 64-GB DDR4-2667 RAM, CT256M4SSD disk, Ubuntu 15.10, GCC–O2 v4.6 compiler suite. Fig. 19 shows a core frequency plot for an eight-core desktop AMD EPYCTM 7601 CPU scored 702 in a 1-socket using estimated scores part, with a TDP of 105 W, again showing increasing fre- based on internal AMD testing as of June 6, 2017. 1 × EPYCTM 7601 CPU in quency as the active core count goes down. For the reliability- HPE Cloudline CL3150, Ubuntu 16.04, GCC-O2 v6.3 compiler suite, 256 GB (8 × 32-GB 2Rx4 PC4-2666) memory, 1 × 500-GB SSD. limited low-core-count frequencies, the adaptive sensing will AMD EPYCTM 7601 scored 1390 in a two-socket system using estimated scores based on internal AMD testing as of June 6, 2017. 2 × EPYCTM 7601 2Power measurements taken from a SP3 Diesel non-DAP AMD eval- CPU in Supermicro AS-1123US-TR4, Ubuntu 16.04, GCC-O2 v6.3 compiler uation system, with AMD EPYCTM processor rev B1 parts, BIOS revi- suite, 512 GB (16 × 32-GB 2Rx4 PC4-2666 running at 2400) memory, sion WDL7405N, Windows Server 2016, running a Max Power pattern at 1 × 500-GB SSD. 2.5-GHz core frequency. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BURD et al.: “ZEPPELIN”: AN SoC FOR MULTICHIP ARCHITECTURES 9

Fig. 18. Frequency versus core count: AMD EPYCTM 7601.

Fig. 21. Single-socket AMD EPYCTM system (SP3).

Fig. 22. Dual-socket AMD EPYCTM system (2x SP3).

Fig. 19. Frequency versus core count: AMD RyzenTM 2700X. same eight-core desktop part while running five iterations of the Cinebench benchmark. Coming from a cold start, the part quickly comes to the steady state, where two cores are at TJMAX, while the others are within 5 °C. The IF temperature is almost 15 °C cooler, demonstrating the significant temperature gradients seen on these products, as reported by the numerous temperature sensors throughout the SoC.

C. Package Configurations The single-socket SP3 package configuration is shown in Fig. 21. There are eight channels of DDR4, along with 128 lanes of high-speed IO, and as described earlier, three of the four IFOPs per die provide the intra-socket connectivity. The dual-socket SP3 package configuration is shown Fig. 20. Temperature versus time for AMD RyzenTM 2700X. in Fig. 22, providing twice the DDDR4 channels, for a total of 16, and the same 128 lanes of high-speed IO. An additional 128 lanes of high-speed IO is repurposed for allow additional frequency uplift as the core temperature drops 4 × 32 socket-to-socket links. below the maximum junction temperature, TJMAX.Inthis For the high-end desktop market, two dies are placed into scenario, the core voltage can be increased while maintaining an sTR4 package, for which the substrate has the same the same overall product reliability margin. physical dimensions as SP3, as shown in Fig. 23. Dummy To ensure product functionality and long-term reliability, dies are placed, so there are still a total of four dies in the it is critical to keep silicon temperatures below TJMAX. sTR4 package, in order to preserve the package’s structural Fig. 20 shows the per-core, and the IF temperature of the integrity. The sTR4 package provides four DDR4 channels This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE JOURNAL OF SOLID-STATE CIRCUITS

modestly sized die of 213 mm2, provides the most cost- effective solution by a significant margin. The key enabler of this scalability is AMD’s IF, providing a scalable, low latency, and coherent connectivity solution to connect the two CPU complexes, along with the six high-speed on-die SerDes and two DDR4 channels, in a variety of form factors. As Moore’s law slows in its ability to deliver more transistors per area, multichip architectures such as Zeppelin are necessary to provide continued, and significant increases in functionality that can be delivered in a single package solution.

ACKNOWLEDGMENT The authors would like to thank our very talented AMD design teams across Austin, Bengaluru, Boston, Fort Collins, Hyderabad, Markham, Shanghai, and Santa Clara who con- tributed on Zen and Zeppelin.

Fig. 23. AMD RyzenTM ThreadripperTM system (sTR4). REFERENCES [1] N. Beck, S. White, M. Paraschou, and S. Naffziger, “‘Zeppelin’: An SoC for multichip architectures,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018, pp. 40–42. [2] T. Song et al., “A 14 nm FinFET 128 Mb SRAM with VMIN enhancement techniques for low-power applications,” IEEE J. Solid- State Circuits, vol. 50, no. 1, pp. 158–169, Jan. 2017. [3] T. Singh et al., “Zen: A next-generation high-performance ×86 core,” IEEE J. Solid-State Circuits, vol. 53, no. 1, pp. 102–114, Jan. 2018. [4] PCI Express Base Specification Revision 3.1a, PCI-SIG Admin., Beaverton, OR, USA, Dec. 2015. [5] D. Greenhill et al., “A 14 nm 1 GHz FPGA with 2.5D transceiver integration,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, pp. 54–55. [6] M. Clark, “A new x86 core architecture for the next generation of computing,” in Proc. Hot Chips, Aug. 2016, pp. 1–19. [7] B. Munger et al., “Carrizo: A high performance, energy efficient 28 nm APU,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 105–116, Jan. 2016. [8] P. Conway and B. Hughes, “The AMD northbridge architecture,” IEEE Micro, vol. 27, no. 2, pp. 10–21, Mar. 2007. [9] AMBA AXI Protocol Specification, ARM, Cambridge, U.K., Jun. 2003. TM [10] K. Lepak et al., “The next generation AMD enterprise server product Fig. 24. AMD Ryzen system (AM4). architecture,” in Proc. Hot Chips, Aug. 2017, pp. 1–22. [11] S. Sundaram et al., “Bristol ridge: A 28-nm × 86 performance-enhanced microprocessor through system power management,” IEEE J. Solid-State along with 64 lanes of high-speed IO. In this market segment, Circuits, vol. 52, no. 1, pp. 89–97, Jan. 2017. two of the IFOPs per die remain unused, and the other two provide twice the intra-socket die-to-die connectivity that is Thomas Burd (M’93–SM’17) received the B.S, present in the SP3 package. M.S, and Ph.D. degrees in electrical engineering and For the mainstream client market, a single die is placed into computer science from the University of California at Berkeley, Berkeley, CA, USA, in 1992, 1994, and an AM4 package, as shown in Fig. 24. This package provides 2001, respectively. two DDR4 channels along with 24 lanes of high-speed IO. He was a Consultant with multiple startups in In this configuration with only one die, all four IFOPs remain Silicon Valley. In 2005, he joined Advanced Micro Devices (AMD), Santa Clara, CA, USA, where unused. However, despite this apparent inefficiency, by pro- he has worked on multiple generations of high- viding a scalable, single SoC design to hit all four product performance x86 cores, including the Bulldozer fam- configurations, the overall aggregated cost of goods comes in ily of cores, and more recently, the Zen family of cores, in the areas of physical design architecture, design for reliability, power lower. delivery, and analysis methodology. He is currently a Senior Fellow Design Engineer with AMD, and a Physical Design Architect for the next-generation VI. CONCLUSION Zen core. He has authored or co-authored over 25 conference and journal publications, in addition to the book Energy Efficient Microprocessor Design. AMD’s Zeppelin SoC, consisting of eight new Zen CPU He is an inventor of five U.S. patents. Cores, provides the high flexibility required for a single silicon Dr. Burd has been serving on the Technical Program Committee for the International Solid-State Circuit Conference since 2017. He served on the design to provide scalable performance from one die up Technical Program Committee for the Symposium on Very Large Scale to eight dies, enabling a multitude of products in not only Integration Circuits from 2012 to 2015, the International Conference on the server market but the high-end desktop market, as well Computer Aided Design from 2003 to 2005, and Hot Chips in 1996.He was a recipient of the 2001 ISSCC Lewis Winner Award for the Best as the mainstream client market. Despite the overhead to Conference Paper and the 1998 Analog Devices Outstanding Student Award simultaneously support all these markets, the high yielding, for recognition of excellence in IC design. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BURD et al.: “ZEPPELIN”: AN SoC FOR MULTICHIP ARCHITECTURES 11

Noah Beck received the B.S. degree in computer Gregg Donley received the B.S.E. degree in com- engineering and the M.S. degree in electrical engi- puter engineering and the M.S.E. degree in computer neering from Purdue University, West Lafayette, IN, engineering from the University of Michigan, Ann USA, in 1997 and 1999, respectively. Arbor, MI, USA, in 1987 and 1989, respectively. From 1999 to 2004, he was with Sun Microsys- In 1989, he joined Unisys, San Jose, CA, USA. tems, Chelmsford, MA, USA, where he worked In 1997, he was a Co-Founder of Network Virtual in system-on-a-chip (SoC) verification. In 2004, Systems, San Jose, CA, USA, functioning in the he joined Advanced Micro Devices (AMD), capacity of a Vice President of engineering design- Boxborough, MA, USA, where he is currently a ing large-scale SMP servers. In 2004, he joined Server SoC Architect. He holds two U.S. patents. Advanced Micro Devices (AMD), Sunnyvale, CA, USA. He is currently a Principal Member of Tech- nical Staff with AMD, where he is working on die-to-die interconnects. Sean White received the B.S. degree in physics and the M.S. degree in electrical engineering from the Worcester Polytechnic Institute, Worcester, Alan Smith (M’98) received the B.S. and M.S. MA, USA. degrees in electrical and computer engineering from He was a System and Silicon Designer with the Georgia Institute of Technology, Atlanta, GA, Data General, Westborough, MA, USA, and multiple USA, in 2000, and 2011, respectively. startup companies, where he was involved in a He served on the Industry Advisory Board for number of different server systems and . the Computer Systems Engineering Program with In 2002, he joined Advanced Micro Devices (AMD), the University of Georgia, Athens, GA, USA. He is Boxborough, MA, USA, where he has contributed currently a Principal Member of Technical Staff with to the design of a variety of and processor the Infinity Fabric (IF) Architecture Team, Advanced silicon projects, most recently as a Server System-on-a-Chip (SoC) Architect TM TM Micro Devices (AMD), Austin, TX, USA. He for the Zeppelin SoC used in AMD’s and Ryzen processor co-architected and designed the IF Network-on-Chip product lines. He is currently a Fellow with AMD. (NoC). His research interests include NoC and GPU fabric architecture.

Milam Paraschou (M’02) received the B.S. degree in electrical engineering from the University of Min- nesota, Minneapolis, MN, USA, in 1997. Larry Hewitt was born in San Bernardino, CA, From 1997 to 2001, he was with Digital Medi- USA, in 1960. He received the B.S. degree in elec- acom, Bloomington, MN, USA, where he was trical engineering from the University of California involved in LVDS, SRAM, and bandgap designs. In at Irvine, Irvine, CA, USA, in 1982. 2001, he joined MathStar, Inc., Hillsboro, OR, USA, Since 1982, he has been a Computer Architect for where he was involved in a gigabit multichannel various firms during which time he has contributed laser driver. From 2003 to 2006, he was a Lead on to military display systems, personal computer sys- an FBDIMM gigabit multi-channel receiver. From tem design, audio semiconductors, graphics and PCI 2006 to 2011, he was with Mixed Mode Solutions, bridges, and microprocessors. He is currently a Inc., Farmington, MN, USA, where he was involved in various designs Fellow with Advanced Micro Devices, Austin, TX, including SRAM, multiphase DLL, IO, and Serializer–Deserializer (SerDes). USA, where he leads power management architec- Since 2011, he has been with Advanced Micro Devices (AMD), Fort Collins, ture for microprocessor products. He holds more than 75 invention patents. CO, USA, where he is involved in PCIe design. From 2013 to 2017, he managed a team involved in SerDes design. He is currently a Principal Member of Technical Staff with AMD, where he is involved in multi-channel Samuel Naffziger (M’02–SM’11–F’14) received short reach SerDes data links for AMD’s next-generation microprocessor. the B.S. degree in electrical engineering from the He holds five U.S. patents. California Institute of Technology, Pasadena, CA, USA, in 1988, and the M.S. degree in electrical Nathan Kalyanasundharam received the B.S. engineering from Stanford University, Stanford, CA, degree in electronics and communication engineer- USA, in 1993. ing from the National Institute of Technology, In 1988, he joined Hewlett Packard, Fort Collins, Kurukshetra, India, in 1990, and the M.S.E.E. degree CO, USA, and then Intel, Fort Collins, where he from Texas A&M University, College Station, TX, was involved in power technology, processor archi- USA, in 1995. tecture, and circuit design. Since 2006, he has been In 2002, he joined Advanced Micro Devices with Advanced Micro Devices (AMD), Fort Collins. (AMD), Santa Clara, CA, USA, where he is cur- He is currently a Corporate Fellow with AMD, where he is responsible rently a Senior Fellow, a Lead Architect for AMD’s for technology strategy and power technology. He has authored over 30 Infinity Fabric (IF). He also manages the fabric publications. He holds 120 U.S. patents in processor circuits, architecture, design and architecture group. and power management.