The IBM POWER9 Scale up Processor
Total Page:16
File Type:pdf, Size:1020Kb
The IBM POWER9 Scale Up Processor Jeff Stuecheli William Starke POWER Systems, IBM Systems © 2018 IBM Corporation POWER Processor Technology Roadmap POWER9 Family 14nm POWER8 Family Built for the Cognitive Era 22nm POWER7+ − Enhanced Core and Chip 32 nm Enterprise & Architecture Optimized for POWER7 Big Data Optimized Emerging Workloads 45 nm Enterprise − Processor Family with - Up to 12 Cores Scale-Up and Scale-Out Enterprise - 2.5x Larger L3 cache - SMT8 - On-die acceleration - CAPI Acceleration Optimized Silicon - 8 Cores - Zero-power core idle state - High Bandwidth GPU Attach - SMT4 − Premier Platform for - eDRAM L3 Cache Accelerated Computing 1H10 2H12 1H14 – 2H16 2H17 – 2H18+ © 2018 IBM Corporation 2 POWER9 Processor – Common Features SMP/Accelerator Signaling Memory Signaling Leadership New Core Microarchitecture Hardware Acceleration Platform • Stronger thread performance Core Core Core Core • Enhanced on-chip acceleration • Efficient agile pipeline L2 L2 L2 L2 • Nvidia NVLink 2.0: High bandwidth • POWER ISA v3.0 L3 Region L3 Region L3 Region L3 Region and advanced new features (25G) L3 Region L3 Region L3 Region L3 Region • CAPI 2.0: Coherent accelerator and L2 L2 L2 L2 Enhanced Cache Hierarchy storage attach (PCIe G4) • 120MB NUCA L3 architecture Core Core Core Core • OpenCAPI: Improved latency and bandwidth, open interface (25G) • 12 x 20-way associative regions PCIe Signaling PCIe On-Chip Accel Signaling SMP L3 Region L3 Region SMP Interconnect & L3 Region L3 Region • Advanced replacement policies L2 L2 Chip Accelerator Enablement L2 L2 State of the Art I/O Subsystem - • Fed by 7 TB/s on-chip bandwidth Off Core Core Core Core • PCIe Gen4 – 48 lanes Cloud + Virtualization Innovation SMP/Accelerator Signaling Memory Signaling High Bandwidth Signaling Technology • Quality of service assists 14nm finFET Semiconductor Process • 16 Gb/s interface • New interrupt architecture • Improved device performance and – Local SMP • Workload optimized frequency reduced energy • PowerAXON 25 GT/sec Link interface • Hardware enforced trusted execution • 17 layer metal stack and eDRAM – Accelerator, remote SMP • 8.0 billion transistors © 2018 IBM Corporation 3 POWER9 – Dual Memory Subsystems Scale Out Scale Up Direct Attach Memory Buffered Memory 8 Direct DDR4 Ports 8 Buffered Channels • Up to 150 GB/s of sustained bandwidth • Up to 230GB/s of sustained bandwidth • Low latency access • Extreme capacity – up to 8TB / socket • Commodity packaging form factor • Superior RAS with chip kill and lane sparing • Adaptive 64B / 128B reads • Compatible with POWER8 system memory • Agnostic interface for alternate memory innovations © 2018 IBM Corporation 4 PowerAXON High-speed 25 GT/s Signaling Power Processor Utilize Best-of-Breed Power Processor 25 GT/s Optical-Style Flexible & Modular Coherence Packaging Signaling Technology OpenCAPI Infrastructure App FPGA © 2018 IBM Corporation 5 Scale Out Scale Up Direct Attach Memory Buffered Memory 2 Socket SMP 16 Socket SMP PowerAXON PowerAXON x24 x48 DDRx4 DMIx4 Mem Cntl Mem Cntl SMPx2 SMPx3 PCIe PCIe Mem Cntl Mem Cntl DDRx4 DMIx4 PowerAXON PowerAXON x24 x48 © 2018 IBM Corporation 6 Scale Up Chipset Block Diagram PCI G4 Processor Accel Local 3x32 x48 I/O Cores SMP Attach Links (16 GHz) Cache and (16 GHz) Interconnect 32B 4 ops Bypass Remote 32B SMP Link Active 4 KB Operations OpenCAPI NVLINK Control 64 + 12 2 KB Control Control Connect to Framing / 4x8 6x8 4x24 Memory other POWER9 Parsing PowerAXON 25 GHz Signaling Scale-up chips, Control GPU’s, and x96 OpenCAPI Devices 8 Channels Command, 1B write, per Processor 2B read @ 9.6 GHz 4 ports of Centaur Memory Buffer 1600 MHz 10B DDR4 DDR4 Operation Framing / Sched DRAM and Data Parsing 16 MB DRAM 8 x Control Cache DRAM © 2018 IBM Corporation DRAM 7 Four Socket Server Topology 16 GHz x32 1 TB POWER9 POWER9 Buffered POWER9 POWER9 Scale-Up Scale-Up Memory Scale-Up Scale-Up © 2018 IBM Corporation 8 Sixteen Socket Server Topology 16 GHz x32 25 GHz x24 POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up © 2018 IBM Corporation 9 Power E980 Server Modular, Scalable POWER9 server . 1 to 4 x 5U CEC drawers + 2U Control Unit POWER9 Enterprise processor Up to 192 cores in a single system Up to 64 TB DDR4 memory Up to 32 PCIe Gen4 slots PowerAXON 25Gb/s ports . Used for SMP cabling between nodes - 4x bandwidth improvement . Enabled for OpenCAPI accelerators © 2018 IBM Corporation 10 PCIe Expansion Drawer PCIe NVMe Bays SMP/etc Conn. Cable Card CXP Conn FPGA CXP Conn Cable Card CXP Conn FPGA CXP Conn Fan-out Module Fan-out Module © 2018 IBM Corporation 11 GPU on PowerAXON 16 GHz x32 25 GHz x8 POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU © 2018 IBM Corporation Demonstration of POWER9 capability, does not imply specific system plans 12 Storage Class Memory on PowerAXON 16 GHz x32 25 GHz x8 POWER9 POWER9 POWER9 POWER9 Scale-Up Scale-Up Scale-Up Scale-Up 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM 32 TB SCM © 2018 IBM Corporation Demonstration of POWER9 capability, does not imply specific system plans 13 Future Evolution of System Architecture Yesterday’s Plumbing Tomorrow’s Differentiation B B B B B B B B B B B B B B B B B B JEDEC Buffer OpenCAPI Northbound Cores & Cores & Cores & Caches System Bus Caches Caches 768 GB/s OpenCAPI & PCI Yesterday’s Plumbing PCI Gen X OpenCAPI / NVLink Tomorrow’s Differentiation CPU/Accelerator Bandwidth 1x 2x 5x 7-10x System bottleneck 14 © 2018 IBM Corporation 14 OpenCAPI Memory • Signaling AXON @25.6GHz vs DDR4 @ 3200 MHz – 4 times the bandwidth per IO • Idle latency over traditional DDR – POWER8/9 Centaur design ~10 ns – OpenCAPI target of ~5 ns • Centaur One proprietary design • OpenCAPI Open SCM © 2018 IBM Corporation 15 Proposed POWER Processor Technology and I/O Roadmap Todays talk POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 2010 2012 2014 2016 2017 2018 2019 2020+ POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 P10 8 cores 8 cores 12 cores w/ NVLink 12/24 cores 12/24 cores w/ Adv. I/O TBA cores 45nm 32nm 22nm 12 cores 14nm 14nm 12/24 cores 22nm 14nm New Micro- Enhanced New Micro- Enhanced New Micro- Enhanced New Micro- Architecture Micro- Architecture Micro- Architecture Micro- Enhanced Architecture Architecture Architecture Architecture Micro- With NVLink Direct attach Architecture memory Buffered New Process New Process New Process Memory New New Technology Technology Technology New Process Memory Technology Technology Subsystem 65 GB/s 65 GB/s 210 GB/s 210 GB/s 150 GB/s 210 GB/s 350+ GB/s 435+ GB/s Sustained Memory Bandwidth PCIe Gen2 PCIe Gen2 PCIe Gen3 PCIe Gen3 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen4 x48 PCIe Gen5 Standard I/O Interconnect 20 GT/s 25 GT/s 25 GT/s N/A N/A N/A 25 GT/s 32 & 50 GT/s 160GB/s Advanced I/O Signaling 300GB/s 300GB/s 300GB/s CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0 , OpenCAPI3.0, OpenCAPI3.0, OpenCAPI4.0, N/A N/A CAPI 1.0 NVLink 1.0 TBA Advanced I/O Architecture NVLink2.0 NVLink2.0 NVLink3.0 © 2018 IBM Corporation Statement of Direction, Subject to Change 16 Thanks Questions? © 2018 IBM Corporation.