POWER8/9 Deep Dive

POWER8/9 Deep Dive Jeff Stuecheli POWER Systems, IBM Systems © 2016 IBM Corporation POWER Processor Technology Roadmap POWER9 Family 14nm POWER8 Family Built for the Cognitive Era 22nm POWER7+ − Enhanced Core and Chip 32 nm Enterprise & Architecture Optimized for POWER7 BiG Data Optimized Emerging Workloads 45 nm Enterprise − Processor Family with - Up to 12 Cores Scale-Up and Scale-Out Enterprise - 2.5x Larger L3 cache - SMT8 - On-die acceleration - CAPI Acceleration Optimized Silicon - 8 Cores - Zero-power core idle state - High Bandwidth GPU Attach - SMT4 − Premier Platform for - eDRAM L3 Cache Accelerated Computing 1H10 2H12 1H14 – 2H16 2H17 – 2H18+ © 2016 IBM Corporation 2 POWER8 Processor Family Power E850 NVLINK GPU Enabled SuperCompute Node Power S8xxLC Power S8xx/S8xxL Power E870/E880 Enterprise Chip SuperCompute Chip Scale-Out Chip Enterprise Chip Entry SCM SC SCM Scale-Out DCM Enterprise SCM SinGle LarGe Chip Dual Small Chips SinGle LarGe Chip Up to 12 cores Up to 2 x 6 cores Up to 12 cores NVLINK GPU Attach Up to 4 socket SMP Up to 4 socket SMP Up to 16 socket SMP Half memory Up to 48x PCI lanes Up to 32x PCI lanes Cost reduced Full Memory Full Memory © 2016 IBM Corporation IBM S822LC with Nvidia P100 and NVLink · Up to four P100 per system—extreme performance; · OpenPOWER hardware design—leveraging technologies from IBM and OpenPOWER partners; · Focused on high performance applications—advanced analytics, Deep Neural Networks, Machine Learning; © 2016 IBM Corporation Nvidia Tesla P100 – First GPU with NVLink · Extreme performance—powering HPC, deep learning, and many more GPU Computing areas; · NVLink™—NVIDIA’s new high speed, high bandwidth interconnect for maximum application scalability; · HBM2—Fastest, high capacity, extremely efficient stacked GPU memory architecture; · Unified Memory and Compute Preemption—significantly improved programming model; · 16nm FinFET—enables more features, higher performance, and improved power efficiency. NVLink significantly increases performance for both GPU-to-GPU communications, and for GPU access to system memory. 40 GB/s versus 16 GB/s of PCIe Gen3 5.3 TFLOPS of double precision floating point (FP64) performance 10.6 TFLOPS of single precision (FP32) performance 21.2 TFLOPS of half-precision (FP16)performance Up to 3 times the performance of previous generation 3 times memory bandwidth of K40/M40 © 2016 IBM Corporation NVIDIA Roadmap on POWER Kepler – K40, K80 Pascal – P100 Volta CUDA 5.5 – 7.0 CUDA 8 CUDA 9 Unified Memory Kepler Pascal Volta SXM2 SXM2 PCIe NVLink NVLink 2.0 POWER8 POWER8+ POWER9 Buffered Memory POWER8 POWER8 NVLink POWER9 © 2016 IBM Corporation POWER9 Family – Deep Workload Optimizations EmerGinG Analytics, AI, CoGnitive - New core for stronger thread performance - Delivers 2x compute resource per socket - Built for acceleration – OpenPOWER solution enablement DB2 BLU Technical / HPC - Highest bandwidth GPU attach - Advanced GPU/CPU interaction and memory sharing - High bandwidth direct attach memory Cloud / HSDC - Power / Packaging / Cost optimizations for a range of platforms - Superior virtualization features: security, power management, QoS, interrupt - State of the art IO technology for network and storage performance Enterprise - Large, flat, Scale-Up Systems - Buffered memory for maximum capacity - Leading RAS - Improved caching © 2016 IBM Corporation 7 POWER9 Processor – Common Features Leadership New Core Microarchitecture SMP/Accelerator Signaling Memory Signaling Hardware Acceleration Platform • Stronger thread performance Core Core Core Core Core Core Core Core • Enhanced on-chip acceleration • Efficient agile pipeline L2 L2 L2 L2 • Nvidia NVLink 2.0: High bandwidth • POWER ISA v3.0 L3 Region L3 Region L3 Region L3 Region and advanced new features (25G) L3 Region L3 Region L3 Region L3 Region • CAPI 2.0: Coherent accelerator and Enhanced Cache Hierarchy L2 L2 L2 L2 storage attach (PCIe G4) • 120MB NUCA L3 architecture Core Core Core Core Core Core Core Core • New CAPI: Improved latency and bandwidth, open interface (25G) SMP Signaling SMP • 12 x 20-way associative regions Signaling PCIe PCIe On-Chip Accel • Advanced replacement policies L3 Region L3 Region SMP Interconnect& L3 Region L3 Region State of the Art I/O Subsystem L2 L2 Enablement Accelerator Chip L2 L2 - • Fed by 7 TB/s on-chip bandwidth Off • PCIe Gen4 – 48 lanes Core Core Core Core Core Core Core Core Cloud + Virtualization Innovation SMP/Accelerator Signaling Memory Signaling HiGh Bandwidth SiGnalinG TechnoloGy • Quality of service assists 14nm finFET Semiconductor Process • 16 Gb/s interface • New interrupt architecture • Improved device performance and – Local SMP • Workload optimized frequency reduced energy • 25 Gb/s Common Link interface • Hardware enforced trusted execution • 17 layer metal stack and eDRAM – Accelerator, remote SMP • 8.0 billion transistors © 2016 IBM Corporation 8 POWER9 Processor Family Four targeted implementations Core Count / Size SMT4 Core SMT8 Core 24 SMT4 Cores / Chip 12 SMT8 Cores / Chip SMP scalability / Memory subsystem Linux Ecosystem Optimized PowerVM Ecosystem Continuity Scale-Out – 2 Socket Optimized Robust 2 socket SMP system Direct Memory Attach • Up to 8 DDR4 ports • Commodity packaging form factor Scale-Up – Multi-Socket Optimized Scalable System TopoloGy / Capacity • Large multi-socket Buffered Memory Attach • 8 Buffered channels © 2016 IBM Corporation 9 New POWER9 Cores Optimized for StronGer Thread Performance and Efficiency • Increased execution bandwidth efficiency for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling and branch prediction for unoptimized applications and interpretive languages • Adaptive features for improved efficiency and performance especially in lower memory bandwidth systems Available with SMT8 or SMT4 Cores 8 or 4 threaded core built from modular execution slices Branch Instruction Cache Branch Instruction Cache Predict 16i Predict 8i POWER9 SMT8 Core Instruction Buffer POWER9 SMT4 Core Instruction Buffer • PowerVM Ecosystem Continuity 12i 6i Decode/ • Linux Ecosystem Focus Decode/ • Strongest Thread Dispatch • Core Count / Socket Dispatch • Optimized for Large Partitions • Virtualization Granularity QP QP QP Crypto Crypto - BRU BRU - BRU 64b 64b 64b 64b 64b 64b 64b 64b 64b 64b 64b 64b DFU DFU VSU VSU VSU VSU VSU VSU VSU VSU VSU VSU VSU VSU DW DW DW DW DW DW DW DW DW DW DW DW LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU SMT8 Core SMT4 Core © 2016 IBM Corporation 10 POWER9 Core Execution Slice Microarchitecture Modular Execution Slices 4 x 128b 2 x 128b 128b 64b DFU Super-slice Super-slice Super-slice Slice VSU ISU FXU ISU ISU Exec Exec Exec Exec Exec Exec 64b 64b 64b Slice Slice Slice Slice Slice Slice VSU VSU VSU IFU IFU LSU DW DW DW IFU LSU LSU LSU LSU LSU POWER8 SMT8 Core POWER9 SMT8 Core POWER9 SMT4 Core Re-factored Core Provides Improved Efficiency & Workload AliGnment • Enhanced pipeline efficiency with modular execution and intelligent pipeline control • Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD • Shared compute resource optimizes data-type interchange © 2016 IBM Corporation 11 POWER9 Core Pipeline Efficiency Shorter Pipelines with Reduced Disruption Improved application performance for modern codes POWER8 Pipeline • Shorten fetch to compute by 5 cycles IF Fetch to Compute IC Reduced by 5 cycles • Advanced branch prediction IXFR ED IFB POWER9 Pipeline Higher performance and pipeline utilization GF1 IF GF2 IC • Improved instruction manaGement DCD D1 MRG D2 – Removed instruction grouping and reduced cracking GXFR CRACK/FUSE – Enhanced instruction fusion DISP PD0 MAP PD1 – Complete up to 128 (64 – SMT4 Core) instructions per cycle B0 V0 X0 LS0 XFER B1 V1 X1 LS1 MAP RES V2 X2 LS2 LS0 VS0 B0 V3 X3 LS3 LS1 VS1 B1 Reduced latency and improved scalability V4 ALU AGEN AGEN ALU RES F1 CA BRD F2 • Local pipe control of load/store operations F2 FMT CA F3 – Improved hazard avoidance F3 FMT F4 F4 CA+2 F5 – Local recycles – reduced hazard disruption F5 Reduced Hazard F6 CA – Improved lock management Disruption © 2016 IBM Corporation 12 POWER9 – Core Compute Symmetric EnGines Per Data-Type for SMT4 Core Resources HiGher Performance on Diverse Workloads Fetch / Branch • 32kB, 8-way Instruction Cache x8 • 8 fetch, 6 decode Predecode L1 Instruction $ IBUF Decode / Crack SMT4 Core • 1x branch execution Branch Dispatch: Allocate / Rename Instruction / Iop Prediction Completion Table Slices issue VSU and AGEN x6 • 4x scalar-64b / 2x vector-128b Branch Slice Slice 0 Slice 1 Slice 2 Slice 3 • 4x load/store AGEN ALU ALU ALU ALU BRU AGEN XS XS AGEN AGEN XS XS AGEN FP FP FP FP MUL MUL MUL MUL Vector Scalar Unit (VSU) Pipes CRYPT XC XC XC XC • 4x ALU + Simple (64b) PM PM • 4x FP + FX-MUL + Complex (64b) QFX QP/DFU QFX • 2x Permute (128b) DIV DIV • 2x Quad Fixed (128b) ST-D ST-D ST-D ST-D 128b • 2x Fixed Divide (64b) Super-slice • 1x Quad FP & Decimal FP L1D$ 0 L1D$ 1 L1D$ 2 L1D$ 3 • 1x Cryptography LRQ 0/1 LRQ 2/3 Load Store Unit (LSU) Slices SRQ 0 SRQ 1 SRQ 2 SRQ 3 • 32kB, 8-way Data Cache • Up to 4 DW load or store Efficient Cores Deliver 2x Compute Resource per Socket © 2016 IBM Corporation 13 POWER ISA v3.0 New Instruction Set Architecture Implemented on POWER9 Broader data type support • 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications • Expanded BCD and 128b Decimal Integer – For database and native analytics • Half-Precision Float Conversion – Optimized

POWER8/9 Deep Dive

Investigations on Hardware Compression of IBM Power9 Processors

Supermicro GPU Solutions Optimized for NVIDIA Nvlink

IBM Power System POWER8 Facts and Features

High Performance Computing and AI Solutions Portfolio

BRKIOT-2394.Pdf

POWER® Processor-Based Systems

Openpower AI CERN V1.Pdf

IBM Power System E850 the Most Agile 4-Socket System in the Marketplace, Optimized for Performance, Reliability and Expansion

IBM Power Systems Performance Report Apr 13, 2021

IBM Energyscale for POWER8 Processor-Based Systems

NVIDIA Gpudirect RDMA (GDR) Host Memory Host Memory • Pipeline Through Host for Large Msg 2 IB IB 4 2

Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions