POWER Processor

POWER Processor Technology Overview Myron Slota POWER Systems, IBM Systems © 2017 IBM Corporation Quarter Century of POWER 22nm Legacy of Leadership Innovation 45/32nm Driving Client Value 65nm POWER8 0.18um 0.25um 130/90nm POWER7/7+ 0.35um Business 0.5um RS64IV Sstar 180/130nm POWER6 0.5um RS64III Pulsar RS64II North Star 0.5um POWER5/5+ RS64I Apache 0.22um Cobra A10 Muskie POWER4/4+ A35 Modern UNIX Era 0.35um Workstation POWER3 -630 0.72um POWER2 P2SC 1.0um RSC 0.25um POWER1 0.35um PC 0.6um 604e 603 601 1990 1995 2000 2005 2010 2015 © 2017 IBM Corporation 2 IBM Optimized Semiconductor Technology World class technology with value-added features for server business. POWER9 is built on 14nm finFET technology transitioned to Global Foundaries 17-layer copper wire Silicon On Insulator On-chip eDRAM (14nm) -Faster Transistor, Less Noise - 6x latency improvement - No off-chip signaling required - 8x bandwidth improvement - 3x less area than SRAM - 5x less energy than SRAM Dense interconnect - Faster connections - Low latency distance paths - High density complex circuits - 2X wire per transistor DT DT eDRAM Cell “IBM is committed to meeting the rising demands of cognitive systems and cloud computing. GF’s leading performance in 7LP process technology, reflecting our joint Research collaboration, will allow IBM Power and Mainframe systems to push beyond limitations to provide high-performance computing solutions while aggressively pursuing 5nm to advance our leadership for years to come.” Tom Rosamilia, Senior Vice President, IBM Systems © 2017 IBM Corporation IBM Confidential Recent and Future POWER Processor Roadmap POWER10 Family 2020+ POWER9 Family 14nm SO, SU POWER8 Family 22nm 2H17 – 2H18+ POWER7+ 2014 – 2016 32 nm POWER7 2012 New uArch/Tech, multiple optimized 45 nm Enterprise 2010 Scale-out / OpenPOWER Accelerated / OpenPOWER New uArch/Tech System roll-out Enterprise + Scale-out / OpenPOWER Spin in new Tech System roll-out Significant Linux / Open / Partner focus Highly Optimized (Cognitive / Analytics) Enterprise focused Cloud delivery models, Accelerators IBM AIX, IBM i platforms General Purpose, IBM only © 2017 IBM Corporation 4 POWER Processor Technology Powerful Cores • Aggressive OOO design with 4 or 8 threads POWER8 • Optimized for wide range of algorithms • 22 nm SOI technology Robust Scaling • 12 x SMT8 cores • Large NUCA L3 architecture, eDRAM • PCIe G3, up to 48 lanes per package • CAPI 1.0 • Up to 8 memory channels per socket • POWER8 w/ NVLINK: GPU NVLink 1.0 • SMP interconnect and on chip switching Advanced Virtualization • Coarse or fine grained VM per core POWER9 • Advanced features for QoS • 14 nm finFET technology Leadership • 24 x SMT4 or 12 x SMT8 cores Hardware Acceleration Platform • PCIe G4, 48 lanes per socket • Coherent Accelerator Processor Interface • CAPI 2.0 provides reduced latency and high BW • 25Gbps Link, 48 lanes for acceleration • Robust Accelerated computing roadmap • OpenCAPI 3.0 supported by OpenPOWER partners • GPU NVLink 2.0 © 2017 IBM Corporation 5 POWER9 Processor Chipset 4 Targeted Deployments Core Count / Size SMT4 Core SMT8 Core 24 SMT4 Cores / Chip 12 SMT8 Cores / Chip SMP / Memory Linux Ecosystem Optimized PowerVM Ecosystem Focus Scale-Out – 2 Socket Optimized Robust 2 socket SMP system Direct Memory Attach • Up to 8 DDR4 ports • Commodity packaging form factor OpenCAPI OpenCAPI Scale-Up – 16-Socket Optimized Scalable System Topology / Capacity • Large multi-socket Buffered Memory Attach • 8 Buffered channels OpenCAPI OpenCAPI © 2017 IBM Corporation POWER9: Improved Per Thread Performance with SMT4 or SMT8 Cores • ‘Modular Execution’ enables SMT4 and SMT8 cores to be efficiently built from same DNA • 96 threads per chip: 12 SMT8 cores or 24 SMT4 cores – >2x threads per chip versus X86 offerings • Each thread significantly stronger than in POWER8 due to increased HW resources POWER9 SMT8 Core POWER9 SMT4 Core • PowerVM Ecosystem Continuity • Linux Ecosystem Focus • Strongest Thread • Core Count / Socket • Optimized for Large Partitions • Virtualization Granularity SMT8 Core SMT4 Core © 2017 IBM Corporation 7 New POWER9 Core MicroArchitecture Optimized for Cognitive Workloads and Stronger Thread Performance • Shorter pipeline • Increased execution bandwidth for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling & branch prediction for unoptimized code and interpretive languages • Adaptive features for improved efficiency and performance • Shared compute resource optimizes data-type interchange Symmetric Engines Per Data-Type for Higher Performance on Diverse Workloads © 2017 IBM Corporation 8 POWER9 – Data Capacity & Throughput Big Caches for Massively Parallel Compute Extreme Switching Bandwidth for the and Heterogeneous Interaction Most Demanding Compute and Accelerated Workloads L3 Cache: 120 MB (POWER8 96 MB) High-Throughput On-Chip Fabric Shared Capacity NUCA Cache • POWER9: Over 7 TB/s On-chip Switch • 12 Regions – one per 8 threads – provide dedicated local capacity • Move Data in/out at 256 GB/s per SMT8 Core • Cache regions share data and capacity on demand POWER9 17 Layers of Metal eDRAM Processing Cores 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 7 TB/s 256 GB/s x 12 DDR SMP PCIe CAPI NVLink 2 NVLink OpenCAPI PCIe IBM & Nvidia IBM & Memory Device Partner GPU Partner POWER9 Devices Devices © 2017 IBM Corporation 9 POWER – Dual Memory Subsystems POWER8 and POWER9 Scale Out POWER9 Scale Up Direct Attach Memory Buffered Memory 8 Direct DDR4 Ports 8 Buffered Channels • Up to 130 GB/s of sustained bandwidth • Up to 230GB/s of sustained bandwidth • Low latency access • Extreme capacity – up to 8TB / socket • Commodity packaging form factor • Superior RAS with chip kill and lane sparing • Adaptive 64B / 128B reads • Agnostic interface for alternate memory innovations 10 © 2017 IBM Corporation POWER9 Processor Modular High-speed 25 Gb/s Signaling Power Processor Utilize Best-of-Breed Power Processor Flexible & Modular 25Gbps Optical-Style Coherence Packaging Signaling Technology OpenCAPI Infrastructure App FPGA © 2017 IBM Corporation 16 Socket 2-Hop POWER9 Enterprise System Topology Horizontal Full Connect 4 Socket CEC New 25 GT/s SMP Cable 4X Bandwidth!!! Vertical Full Connect FullVertical © 2017 IBM Corporation POWER9 – Premier Acceleration Platform • Extreme Processor / Accelerator Bandwidth and Reduced Latency • Coherent Memory and Virtual Addressing Capability for all Accelerators POWER9 • OpenPOWER and OpenCAPI Community Enablement – Robust Accelerated PowerAccel Compute Options • State of the Art I/O and Acceleration Attachment Signaling PCIe – PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth Devices PCIe G4 I/O PCIe – G4 – 25Gbps Link x 48 lanes – 300 GB/s duplex bandwidth ASIC / CAPI FPGA CAPI 2.0 Devices NVLink • Robust Accelerated Compute Options with OPEN standards NVLink 2.0 Nvidia 25G GPUs OpenCAPI – On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2 Link – ASIC / OpenCAPI – CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 FPGA – Devices – NVLink 2.0 – Next generation of GPU/CPU bandwidth and integration On Chip – OpenCAPI – High bandwidth, low latency and open interface using 25Gbps Link Accel © 2017 IBM Corporation 13 Accelerated Solution Enablement 25G Accelerators Accelerators Bus Storage Host- Storage Coherence architecture 128 GBps agnostic Advanced PCIe Gen4 200 GBps Memory (SCM) CAPI CAPI Network OpenCAPI Network 25G • POWER8: CAPI (Coherent Accelerator Processor Interface): – Enables coherent attach of external devices over PCIe Gen3 physicals – Simplifies programming model, eliminates code-path overhead of accelerator / storage / network access POWER9: • CAPI 2.0: PCIe Gen4 provides 4x bandwidth of POWER8 • OpenCAPI 3.0: 100% Open Interface Architecture with low-latency, high bandwidth attach (up to 200GBps) – Ability to connect to user-level accelerators, storage + network devices, and advanced memories © 2017 IBM Corporation 14 CAPI Technology Overview Copy or Pin MMIO Notify Poll / Int Copy or Unpin Ret. From DD DD Call Acceleration Source Data Accelerator Completion Result Data Completion 300 Instructions 10,000 Instructions Application 3,000 Instructions 1,000 Instructions Dependent, but 1,000 Instructions Equal to below Typical I/O Model Flow Flow with a Coherent Model Shared Mem. Shared Memory Acceleration Notify Accelerator Completion 400 Instructions Application 100 Instructions Dependent, but Equal to above CAPI FPGA IBM Supplied POWER Service Layer Function Function Function Function CAPP PCIe n 0 1 2 Power Processor Added Advantages of Coherent Attachment Over I/O Attachment Virtual Addressing & Data Caching Easier, More Natural Programming Enables Applications Not Possible – Shared Memory Model on I/O – Lower latency for highly referenced data – Traditional thread level programming – Pointer chasing, etc… – Long latency of I/O typically requires restructuring of application © 2017 IBM Corporation 15 POWER9 – Ideal for Acceleration Extreme CPU/Accelerator Bandwidth POWER9 with 25G Link OpenCAPI 3.0 PCIe Gen3 x16 PCIe Gen4 x16 POWER8 with NVLink 1.0 NVLIink 2.0 Accelerator Accelerator Accelerator CPU CPU GPU GPU 2x GPU 1x 5x GPU 7-10x Increased Performance / Features / Acceleration Opportunity Seamless CPU/Accelerator Interaction Broader Application of Heterogeneous Compute • Coherent memory sharing • Designed for efficient programming models • Enhanced virtual address translation on chip • Accelerate complex analytic / cognitive applications • Data interaction

POWER Processor

Investigations on Hardware Compression of IBM Power9 Processors

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

IBM Power System POWER8 Facts and Features

Ray Tracing on the Cell Processor

March 11, 2010 Presentation

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

Copyrighted Material

Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important

POWER® Processor-Based Systems

Openpower AI CERN V1.Pdf

IBM Power System E850 the Most Agile 4-Socket System in the Marketplace, Optimized for Performance, Reliability and Expansion

Implementing Powerpc Linux on System I Platform