IBM's Next Generation POWER Processor

IBM’s Next Generation POWER Processor Hot Chips August 18-20, 2019 Jeff Stuecheli Scott Willenborg William Starke Proposed POWER Processor Technology and I/O Roadmap Focus of 2018 talk POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5 20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink © 2019 IBM Corporation Statement of Direction, Subject to Change 2 Proposed POWER Processor Technology and I/O Roadmap Focus of today’s talk POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5 20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink © 2019 IBM Corporation Statement of Direction, Subject to Change 3 Proposed POWER Processor Technology and I/O Roadmap Looking forward POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 2010 2012 2014 2016 2017 2018 2020 2021 POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 AIO P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores 12/24 cores TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 14nm 22nm New Micro- New Micro- Enhanced Enhanced New Micro- Enhanced Enhanced Micro- New Micro- Architecture Micro- Architecture Architecture Micro- Micro- Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach memory Buffered New New Process Memory Memory New Process New Process Subsystem New Process Technology Technology Technology New Process Technology Technology Up To Up To Up To Up To Up To Up To Up To Up To Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 800 GB/s Standard I/O Interconnect PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5 20 GT/s 25 GT/s Advanced I/O Signaling N/A N/A N/A 25 GT/s 25 GT/s 32 & 50 GT/s 160GB/s 300GB/s 300GB/s 300GB/s CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , Advanced I/O Architecture N/A N/A CAPI 1.0 OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, TBA NVLink NVLink NVLink NVLink © 2019 IBM Corporation Statement of Direction, Subject to Change 4 Future Evolution of System Architecture Yesterday’s Plumbing Tomorrow’s Differentiation B B B B B B B B B B B B B B B B B B Emerging JEDEC Buffers OMI Memory Cores & Cores & Cores & Caches System Bus Caches Caches 768 GB/s PowerAXON & PCI Yesterday’s Plumbing PCI Gen N+1 SMP / OpenCAPI / NVLink Tomorrow’s Differentiation © 2019 IBM Corporation 55 POWER9 – Premier Acceleration Platform • Extreme Processor / Accelerator Bandwidth and Reduced Latency • Coherent Memory and Virtual Addressing Capability for all Accelerators POWER9 • OpenPOWER Community Enablement – Robust Accelerated Compute Options PowerAccel • State of the Art I/O and Acceleration Attachment Signaling PCIe Devices PCIe G4 I/O – PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth PCIe – G4 ASIC / CAPI – 25 G Common Link x 96 lanes – 600 GB/s duplex bandwidth FPGA CAPI 2.0 Devices NVLink • Robust Accelerated Compute Options with OPEN standards Nvidia NVLink GPUs 25G OpenCAPI – On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2 ASIC / – OpenCAPI Memory FPGA – CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 Devices – DRAM/ OMI – NVLink – Next generation of GPU/CPU bandwidth SCM/ etc OpenCAPI High bandwidth, low latency and open interface – – On Chip Accel – OMI – High bandwidth and/or differentiated for acceleration © 2019 IBM Corporation 6 THE WORLD’S TWO MOST POWERFUL SUPERCOMPUTERS BUILT FOR THE AI ERA WITH OPEN COLLABORATION © 2019 IBM Corporation OpenCAPI Design Goals • Designed to support range of devices – Coherent Caching Accelerators – Network Controllers – Differentiated Memory o High Bandwidth o Low Latency o Storage Class Memory – Storage Controllers • Asymmetric design, endpoint optimized for host and device attach – ISA of Host Architecture: Need to hide difference in Coherence, Memory Model, Address Translation, etc. – Design schedule: The design schedule of a high performance CPU host is typically on the order of multiple years, conversely, accelerator devices have much shorter development cycles, typically less than a year. – Timing Corner: ASIC and FPGA technologies run at lower frequencies and timing optimization as CPUs. – Plurality of devices: Effort in the host, both IP and circuit resource, have a multiplicative effect. – Trust: Attached devices are susceptible to both intentional and unintentional trust violations – Cache coherence: Hosts have high variability in protocol. Host cannot trust attached device to obey rules. © 2019 IBM Corporation OpenCAPI Design Goals • Low Latency and High Bandwidth – Fixed width DL CRC – Aligned TL – Aligned Data – Separately pipelined control/tag vs data o Compromise in switching capability © 2019 IBM Corporation POWER9 Family Memory Architecture Scale Up Superior RAS, High bandwidth, High Capacity Buffered Memory Agnostic interface for alternate memory innovations Scale Out Low latency access Direct Attach Memory Commodity packaging form factor OpenCAPI Agnostic Buffered Memory (OMI) Near Tier Commodity Enterprise Storage Class Extreme Bandwidth Low Latency RAS Extreme Capacity Low Capacity Low Cost Capacity Persistence Bandwidth IBM Confidential © 2019 IBM Corporation Same Open Memory Interface used for all Systems and Memory Technologies 10 Primary Tier Memory Options 72b ~2666 MHz bidi DDR4 RDIMM 8b Capacity ~256 GB BW ~150 GB/sec Same System 72b ~2666 MHz bidi DDR4 LRDIMM Only 5-10ns 8b 8b Capacity ~2 TB higher load-to-use BW ~150 GB/sec than RDIMM (< 5ns for LRDIMM) 8b ~25G diff DDR4 OMI DIMM BUF 8b 8b Capacity ~256GB→4 TB BW ~320 GB/sec Same System BW Opt OMI DIMM 8b ~25G diff BUF 16b Capacity ~128→512 GB OMI Strategy OMI BW ~650 GB/sec 1024b On module On Module HBM Si interposer Capacity ~16→32 GB Unique System BW ~1 TB/sec © 2019 IBM Corporation 11 Final Addition to the POWER9 Processor Family Processor Chip Details The Bandwidth Beast Open Memory Interface (OMI) • 728 mm2 ( 25.3 x 28.8 mm) Advanced I/O (AIO) • 16 channels x8 at 25 GT/s • 8 Billion Transistors • 650 GB/s peak 1:1 r/w bandwidth • Up to 24 SMT4 Cores PowerAXON (x48) Memory Signaling (8x8 OMI) Technology Agnostic Up to 120 MB eDRAM L3 cache • • Core Core Core Core Core Core Core Core Offered w/ Microchip DDR4 buffer SMP and Accelerator Interconnect Accelerator SMP and • (410 GB/s peak bandwidth) Local SMP Signaling (3x30) SMP Signaling Local Semiconductor Technology (x48) Signaling Gen4 PCIe L2 L2 L2 L2 10 MB L3 Region • 14nm finFET L3 L3 PowerAXON 25 GT/s Attach • Improved device performance L2 L2 L2 L2 • Up to 16 socket glue-less SMP • Reduced energy (4x24 SMP added to 3x30 local) Core Core Core Core Core Core Core Core • eDRAM • Up to x48 NVIDIA NVLINK GPU • 17 layer metal stack attach L3 L3 • Up to x48 OpenCAPI 4.0 coherent L2 L2 L2 L2 accelerator / memory attach High Bandwidth Signaling • 25 GT/s low energy differential Core Core Core Core Core Core Core Core Industry Standard I/O Attach • PowerAXON, OMI memory PowerAXON (x48) Memory Signaling (8x8 OMI) • x48 PCIe Gen 4 at 16 GT/s • 16 GT/s low energy differential • Up to x16 CAPI 2.0 coherent • Local SMP 2 TB/s Raw Signaling Bandwidth accelerator / storage attach • 16 GT/s PCIe Gen4 Shared by 6 Attach Protocols © 2019 IBM Corporation 12 Unleashing the Bandwidth: POWER9 AIO + NVIDIA GPU Potential System Offering NVLink GPU GPU GPU GPU OMI Memory OMI Memory NVLink OpenCAPI OpenCAPI IBM IBM POWER9 AIO POWER9 AIO PCIe Gen4 SMP PCIe Gen4 OMI Memory OMI Memory © 2019 IBM Corporation

IBM's Next Generation POWER Processor

Investigations on Hardware Compression of IBM Power9 Processors

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

IBM Power System POWER8 Facts and Features

March 11, 2010 Presentation

POWER® Processor-Based Systems

Openpower AI CERN V1.Pdf

IBM Power System E850 the Most Agile 4-Socket System in the Marketplace, Optimized for Performance, Reliability and Expansion

IBM Power Systems Performance Report Apr 13, 2021

IBM Energyscale for POWER8 Processor-Based Systems

Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions

IBM Power System S814 and S824 Technical Overview and Introduction

System Automation for Z/OS : Planning and Installation About This Publication