IBM’s Next Generation POWER Processor

Hot Chips August 18-20, 2019 Jeff Stuecheli Scott Willenborg William Starke Proposed POWER Processor Technology and I/O Roadmap Focus of 2018 talk

POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10

2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology

Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s

Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5

20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s

CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink

© 2019 IBM Corporation Statement of Direction, Subject to Change 2 Proposed POWER Processor Technology and I/O Roadmap Focus of today’s talk

POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10

2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology

Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s

Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5

20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s

CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink

© 2019 IBM Corporation Statement of Direction, Subject to Change 3 Proposed POWER Processor Technology and I/O Roadmap Looking forward

POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10

2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology

Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s

Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5

20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s

CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink

© 2019 IBM Corporation Statement of Direction, Subject to Change 4 Future Evolution of System Architecture

Yesterday’s Plumbing Tomorrow’s Differentiation B B B B B B B B B B B B B B B B B B Emerging JEDEC Buffers OMI Memory

Cores & Cores & Cores & Caches System Bus Caches Caches 768 GB/s PowerAXON & PCI Yesterday’s Plumbing PCI Gen N+1 SMP / OpenCAPI / NVLink Tomorrow’s Differentiation

© 2019 IBM Corporation 55 POWER9 – Premier Acceleration Platform • Extreme Processor / Accelerator Bandwidth and Reduced Latency • Coherent Memory and Virtual Addressing Capability for all Accelerators POWER9 • OpenPOWER Community Enablement – Robust Accelerated Compute Options PowerAccel

• State of the Art I/O and Acceleration Attachment Signaling PCIe Devices PCIe G4 I/O – PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth PCIe – G4 ASIC / CAPI – 25 G Common Link x 96 lanes – 600 GB/s duplex bandwidth FPGA CAPI 2.0 Devices NVLink • Robust Accelerated Compute Options with OPEN standards NVLink GPUs 25G OpenCAPI – On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2 ASIC / – OpenCAPI Memory FPGA – CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 Devices – DRAM/ OMI – NVLink – Next generation of GPU/CPU bandwidth SCM/ etc OpenCAPI High bandwidth, low latency and open interface – – On Chip Accel – OMI – High bandwidth and/or differentiated for acceleration

© 2019 IBM Corporation 6 THE WORLD’S TWO MOST POWERFUL BUILT FOR THE AI ERA WITH OPEN COLLABORATION

© 2019 IBM Corporation OpenCAPI Design Goals

• Designed to support range of devices – Coherent Caching Accelerators – Network Controllers – Differentiated Memory o High Bandwidth o Low Latency o Storage Class Memory – Storage Controllers

• Asymmetric design, endpoint optimized for host and device attach – ISA of Host Architecture: Need to hide difference in Coherence, Memory Model, Address Translation, etc. – Design schedule: The design schedule of a high performance CPU host is typically on the order of multiple years, conversely, accelerator devices have much shorter development cycles, typically less than a year. – Timing Corner: ASIC and FPGA technologies run at lower frequencies and timing optimization as CPUs. – Plurality of devices: Effort in the host, both IP and circuit resource, have a multiplicative effect. – Trust: Attached devices are susceptible to both intentional and unintentional trust violations – : Hosts have high variability in protocol. Host cannot trust attached device to obey rules.

© 2019 IBM Corporation OpenCAPI Design Goals

• Low Latency and High Bandwidth – Fixed width DL CRC – Aligned TL – Aligned Data – Separately pipelined control/tag vs data o Compromise in switching capability

© 2019 IBM Corporation POWER9 Family Memory Architecture

Scale Up Superior RAS, High bandwidth, High Capacity Buffered Memory Agnostic interface for alternate memory innovations

Scale Out Low latency access Direct Attach Memory Commodity packaging form factor

OpenCAPI Agnostic Buffered Memory (OMI)

Near Tier Commodity Enterprise Storage Class Extreme Bandwidth Low Latency RAS Extreme Capacity Low Capacity Low Cost Capacity Persistence Bandwidth

IBM Confidential © 2019 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 10 Primary Tier Memory Options

72b ~2666 MHz bidi DDR4 RDIMM 8b Capacity ~256 GB BW ~150 GB/sec Same System

72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns

8b 8b Capacity ~2 TB higher load-to-use BW ~150 GB/sec than RDIMM (< 5ns for LRDIMM)

8b ~25G diff DDR4 OMI DIMM

BUF 8b 8b Capacity ~256GB→4 TB BW ~320 GB/sec Same System BW Opt OMI DIMM 8b ~25G diff

BUF 16b Capacity ~128→512 GB OMI Strategy OMI BW ~650 GB/sec

1024b On module On Module HBM Si interposer Capacity ~16→32 GB Unique System BW ~1 TB/sec

© 2019 IBM Corporation 11 Final Addition to the POWER9 Processor Family Processor Chip Details The Bandwidth Beast Open Memory Interface (OMI) • 728 mm2 ( 25.3 x 28.8 mm) Advanced I/O (AIO) • 16 channels x8 at 25 GT/s • 8 Billion Transistors • 650 GB/s peak 1:1 r/w bandwidth • Up to 24 SMT4 Cores PowerAXON (x48) Memory Signaling (8x8 OMI) Technology Agnostic Up to 120 MB eDRAM L3 cache • • Core Core Core Core Core Core Core Core Offered w/ Microchip DDR4 buffer

SMPandAccelerator Interconnect •

(410 GB/s peak bandwidth) LocalSignalingSMP (3x30) Semiconductor Technology PCIe Gen4 Signaling(x48) L2 L2 L2 L2 10 MB L3 Region • 14nm finFET L3 L3 PowerAXON 25 GT/s Attach • Improved device performance L2 L2 L2 L2 • Up to 16 socket glue-less SMP • Reduced energy (4x24 SMP added to 3x30 local) Core Core Core Core Core Core Core Core • eDRAM • Up to x48 NVIDIA NVLINK GPU • 17 layer metal stack attach L3 L3 • Up to x48 OpenCAPI 4.0 coherent L2 L2 L2 L2 accelerator / memory attach High Bandwidth Signaling • 25 GT/s low energy differential Core Core Core Core Core Core Core Core Industry Standard I/O Attach • PowerAXON, OMI memory PowerAXON (x48) Memory Signaling (8x8 OMI) • x48 PCIe Gen 4 at 16 GT/s • 16 GT/s low energy differential • Up to x16 CAPI 2.0 coherent • Local SMP 2 TB/s Raw Signaling Bandwidth accelerator / storage attach • 16 GT/s PCIe Gen4 Shared by 6 Attach Protocols © 2019 IBM Corporation 12 Unleashing the Bandwidth: POWER9 AIO + NVIDIA GPU

Potential System Offering NVLink GPU GPU

GPU GPU OMI Memory OMI Memory NVLink

OpenCAPI OpenCAPI IBM IBM POWER9 AIO POWER9 AIO PCIe Gen4 SMP PCIe Gen4

OMI Memory OMI Memory © 2019 IBM Corporation Statement of Direction, Subject to Change 13 OpenCAPI 4.0: Asymmetric Open Accelerator Attach Roadmap of Capabilities and Host Silicon Delivery

Accelerator Protocol CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 4.0 OpenCAPI 5.0 First Host Silicon POWER8 POWER9 SO POWER9 SO POWER9 AIO POWER10 (GA 2014) (GA 2017) (GA 2017) (GA 2020) (GA 2021) Functional Partitioning Asymmetric Asymmetric Asymmetric Asymmetric Asymmetric Host Architecture POWER POWER Any Any Any Cache Line Size Supported 128B 128B 64/128/256B 64/128/256B 64/128/256B Attach Vehicle PCIe Gen 3 PCIe Gen 4 25 G (open) 25 G (open) 32/50 G (open) Tunneled Tunneled Native DL/TL Native DL/TL Native DL/TL Address Translation On Accelerator Host Host (secure) Host (secure) Host (secure) Native DMA to Host Mem No Yes Yes Yes Yes Atomics to Host Mem No Yes Yes Yes Yes Host Thread Wake-up No Yes Yes Yes Yes Host Memory Attach Agent No No Yes Yes Yes Low Latency Short Msg 4B/8B MMIO 4B/8B MMIO 4B/8B MMIO 128B push 128B push Posted Writes to Host Mem No No No Yes Yes Caching of Host Mem RA Cache RA Cache No VA Cache VA Cache

© 2019 IBM Corporation 14 POWER9 Connectivity Variants POWER9 Scale Out POWER9 Scale Up POWER9 Advanced I/O Direct Attach Memory DMI Buffered (Centaur) Memory OMI Buffered Memory 2 Socket SMP 16 Socket SMP 16 Socket SMP

4 x DDR4 memory 4 x DMI buffered memory 8 x OMI buffered memory x24 Accelerator x48 PowerAXON x48 PowerAXON Interconnect Interconnect Advanced Interconnect

- OMI Bandwidth - OpenCAPI function - NVLINK function

3x bandwidth 6x bandwidth per mm2 as per mm2 as DDR signaling DDR signaling

x24 Accelerator 2 x Local SMP x48 PowerAXON 3 x Local SMP x48 PowerAXON 3 x Local SMP 4 x DDR4 memory 4 x DMI buffered memory 8 x OMI buffered memory © 2019 IBM Corporation 15 DRAM DIMM Comparison

IBM Centaur DIMM OMI DDIMM

• Technology agnostic • Low cost JEDEC DDR • Ultra-scale system density DIMM • Enterprise reliability • Low-latency • High bandwidth

© 2019 IBM Corporation Approximate Scale 16 Open Memory Interface (OMI)

• Signaling: 25.6GHz vs DDR4 @ 3200 MHz – 4x raw bandwidth per I/O signal – 1.3x mixed traffic utilization • Idle load-to-use latency over traditional DDR: – POWER8/9 Centaur design ~10 ns – OMI target of ~5-10 ns (RDIMM) – OMI target of < 5ns (LRDIMM)

• IBM Centaur: One proprietary DMI design • Microchip SMC 1000: – Open (OMI) design – Emerging JEDEC Standard

© 2019 IBM Corporation OMI Memory Latency

SMC 1000

HSS HSS DDR4 Cmd/Addr Lane Deskew Fast Activate Driver Transmit Receiver Transmit PHY

DRAM DRAM DRAM DRAM HSS HSS Frame Read DDR4 Data Receiver Receive Transmitter Formatting Dataflow Receive PHY

IBM POWER9 AIO

Microchip’s SMC 1000 8x25G features an innovative low latency design that delivers less than four ns incremental latency over a traditional integrated DDR controller with LRDIMM. This results in OMI-based DDIMM products having virtually identical bandwidth and latency performance to comparable LRDIMM products.

6X bandwidth / PHY area advantage gives POWER9 AIO bandwidth of 16 DDR ports © 2019 IBM Corporation 18 Final Addition to POWER9: Stay Tuned for POWER10

PowerAXON The Bandwidth Beast OMI Memory POWER9 with Advanced I/O (AIO)

PowerAXON (x48) Memory Signaling (8x8 OMI)

Core Core Core Core Core Core Core Core

GPU SMPandAccelerator Interconnect LocalSignalingSMP (3x30)

PCIe Gen4 Signaling(x48) L2 L2 L2 L2 10 MB NVLink L3 L3 Region L3

GPU L2 L2 L2 L2

Core Core Core Core Core Core Core Core

GPU L3 L3 L2 L2 L2 L2

Core Core Core Core Core Core Core Core

PowerAXON (x48) Memory Signaling (8x8 OMI)

Thank You! © 2019 IBM Corporation 19