POWER Processor

Technology Overview

Myron Slota POWER Systems, IBM Systems

© 2017 IBM Corporation Quarter Century of POWER 22nm Legacy of Leadership Innovation 45/32nm Driving Client Value

65nm POWER8

0.18um 0.25um 130/90nm POWER7/7+ 0.35um Business 0.5um RS64IV Sstar 180/130nm POWER6 0.5um RS64III Pulsar RS64II North Star 0.5um POWER5/5+ RS64I Apache 0.22um Cobra A10 Muskie POWER4/4+ A35 Modern UNIX Era 0.35um POWER3 -630 0.72um POWER2 P2SC

1.0um RSC 0.25um

POWER1 0.35um PC 0.6um 604e 603

601 1990 1995 2000 2005 2010 2015 © 2017 IBM Corporation 2 IBM Optimized Semiconductor Technology

World class technology with value-added features for server business. POWER9 is built on 14nm finFET technology transitioned to Global Foundaries

17-layer copper wire On-chip eDRAM (14nm) -Faster , Less Noise - 6x latency improvement - No off-chip signaling required - 8x bandwidth improvement - 3x less area than SRAM - 5x less energy than SRAM

Dense interconnect - Faster connections - Low latency distance paths - High density complex circuits - 2X wire per transistor DT DT

eDRAM

“IBM is committed to meeting the rising demands of cognitive systems and . GF’s leading performance in 7LP process technology, reflecting our joint Research collaboration, will allow IBM Power and Mainframe systems to push beyond limitations to provide high-performance computing solutions while aggressively pursuing 5nm to advance our leadership for years to come.” Tom Rosamilia, Senior Vice President, IBM Systems

© 2017 IBM Corporation IBM Confidential Recent and Future POWER Processor Roadmap

POWER10 Family 2020+ POWER9 Family 14nm SO, SU POWER8 Family 22nm 2H17 – 2H18+ POWER7+ 2014 – 2016 32 nm POWER7 2012 New uArch/Tech, multiple optimized 45 nm Enterprise 2010 Scale-out / OpenPOWER Accelerated / OpenPOWER New uArch/Tech  System roll-out Enterprise + Scale-out / OpenPOWER Spin in new Tech  System roll-out Significant / Open / Partner focus Highly Optimized (Cognitive / Analytics) Enterprise focused Cloud delivery models, Accelerators IBM AIX, IBM i platforms General Purpose, IBM only

© 2017 IBM Corporation 4 POWER Processor Technology

Powerful Cores • Aggressive OOO design with 4 or 8 threads POWER8 • Optimized for wide range of algorithms • 22 nm SOI technology Robust Scaling • 12 x SMT8 cores • Large NUCA L3 architecture, eDRAM • PCIe G3, up to 48 lanes per package • CAPI 1.0 • Up to 8 memory channels per socket • POWER8 w/ NVLINK: GPU NVLink 1.0 • SMP interconnect and on chip switching Advanced Virtualization • Coarse or fine grained VM per core POWER9 • Advanced features for QoS • 14 nm finFET technology Leadership • 24 x SMT4 or 12 x SMT8 cores Hardware Acceleration Platform • PCIe G4, 48 lanes per socket • Coherent Accelerator Processor Interface • CAPI 2.0 provides reduced latency and high BW • 25Gbps Link, 48 lanes for acceleration • Robust Accelerated computing roadmap • OpenCAPI 3.0 supported by OpenPOWER partners • GPU NVLink 2.0

© 2017 IBM Corporation 5 POWER9 Processor Chipset 4 Targeted Deployments Core Count / Size SMT4 Core SMT8 Core 24 SMT4 Cores / Chip 12 SMT8 Cores / Chip SMP / Memory Linux Ecosystem Optimized PowerVM Ecosystem Focus

Scale-Out – 2 Socket Optimized Robust 2 socket SMP system Direct Memory Attach • Up to 8 DDR4 ports

• Commodity packaging form factor

OpenCAPI OpenCAPI

Scale-Up – 16-Socket Optimized Scalable System Topology / Capacity • Large multi-socket Buffered Memory Attach

• 8 Buffered channels

OpenCAPI OpenCAPI

© 2017 IBM Corporation POWER9: Improved Per Performance with SMT4 or SMT8 Cores

• ‘Modular Execution’ enables SMT4 and SMT8 cores to be efficiently built from same DNA • 96 threads per chip: 12 SMT8 cores or 24 SMT4 cores – >2x threads per chip versus offerings • Each thread significantly stronger than in POWER8 due to increased HW resources

POWER9 SMT8 Core POWER9 SMT4 Core • PowerVM Ecosystem Continuity • Linux Ecosystem Focus • Strongest Thread • Core Count / Socket • Optimized for Large Partitions • Virtualization Granularity

SMT8 Core SMT4 Core © 2017 IBM Corporation 7 New POWER9 Core Optimized for Cognitive Workloads and Stronger Thread Performance • Shorter • Increased execution bandwidth for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling & branch prediction for unoptimized code and interpretive languages • Adaptive features for improved efficiency and performance • Shared compute resource optimizes data-type interchange

Symmetric Engines Per Data-Type for Higher Performance on Diverse Workloads

© 2017 IBM Corporation 8 POWER9 – Data Capacity & Throughput

Big Caches for Massively Parallel Compute Extreme Switching Bandwidth for the and Heterogeneous Interaction Most Demanding Compute and Accelerated Workloads

L3 Cache: 120 MB (POWER8 96 MB) High-Throughput On-Chip Fabric Shared Capacity NUCA Cache • POWER9: Over 7 TB/s On-chip Switch • 12 Regions – one per 8 threads – provide dedicated local capacity • Move Data in/out at 256 GB/s per SMT8 Core • Cache regions data and capacity on demand POWER9 17 Layers of Metal eDRAM Processing Cores

10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M

7 TB/s

256 GB/s x 12

DDR

SMP

PCIe

CAPI

NVLink 2 NVLink OpenCAPI

PCIe IBM & IBM & Memory Device Partner GPU Partner POWER9 Devices Devices

© 2017 IBM Corporation 9 POWER – Dual Memory Subsystems

POWER8 and POWER9 Scale Out POWER9 Scale Up Direct Attach Memory Buffered Memory

8 Direct DDR4 Ports 8 Buffered Channels • Up to 130 GB/s of sustained bandwidth • Up to 230GB/s of sustained bandwidth • Low latency access • Extreme capacity – up to 8TB / socket • Commodity packaging form factor • Superior RAS with chip kill and lane sparing • Adaptive 64B / 128B reads • Agnostic interface for alternate memory innovations

10 © 2017 IBM Corporation POWER9 Processor Modular High-speed 25 Gb/s Signaling

Power Processor

Utilize Best-of-Breed Power Processor Flexible & Modular 25Gbps Optical-Style Coherence Packaging Signaling Technology OpenCAPI Infrastructure App

FPGA

© 2017 IBM Corporation 16 Socket 2-Hop POWER9 Enterprise System Topology Horizontal Full Connect

4 Socket CEC

New 25 GT/s SMP Cable

4X Bandwidth!!! Vertical Full Connect FullVertical

© 2017 IBM Corporation POWER9 – Premier Acceleration Platform

• Extreme Processor / Accelerator Bandwidth and Reduced Latency • Coherent Memory and Virtual Addressing Capability for all Accelerators POWER9 • OpenPOWER and OpenCAPI Community Enablement – Robust Accelerated PowerAccel Compute Options

• State of the Art I/O and Acceleration Attachment Signaling PCIe – PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth Devices PCIe G4 I/O PCIe – G4 – 25Gbps Link x 48 lanes – 300 GB/s duplex bandwidth ASIC / CAPI FPGA CAPI 2.0 Devices NVLink • Robust Accelerated Compute Options with OPEN standards NVLink 2.0 Nvidia 25G GPUs OpenCAPI – On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2 Link

– ASIC / OpenCAPI – CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 FPGA

– Devices – NVLink 2.0 – Next generation of GPU/CPU bandwidth and integration

On Chip – OpenCAPI – High bandwidth, low latency and open interface using 25Gbps Link Accel

© 2017 IBM Corporation 13 Accelerated Solution Enablement

25G Accelerators Accelerators

Bus Storage Host- Storage Coherence architecture 128 GBps agnostic Advanced

PCIe Gen4 PCIe 200 GBps Memory

(SCM)

CAPI CAPI

Network OpenCAPI

Network 25G • POWER8: CAPI (Coherent Accelerator Processor Interface): – Enables coherent attach of external devices over PCIe Gen3 physicals – Simplifies programming model, eliminates code-path overhead of accelerator / storage / network access POWER9: • CAPI 2.0: PCIe Gen4 provides 4x bandwidth of POWER8 • OpenCAPI 3.0: 100% Open Interface Architecture with low-latency, high bandwidth attach (up to 200GBps) – Ability to connect to user-level accelerators, storage + network devices, and advanced memories © 2017 IBM Corporation 14 CAPI Technology Overview

Copy or Pin MMIO Notify Poll / Int Copy or Unpin Ret. From DD DD Call Acceleration Source Data Accelerator Completion Result Data Completion

300 Instructions 10,000 Instructions Application 3,000 Instructions 1,000 Instructions Dependent, but 1,000 Instructions Equal to below Typical I/O Model Flow

Flow with a Coherent Model

Shared Mem. Shared Memory Acceleration Notify Accelerator Completion 400 Instructions Application 100 Instructions Dependent, but Equal to above CAPI FPGA IBM Supplied POWER

Service Layer

Function Function Function Function Function Function

CAPP PCIe

n

0 1 2

Power Processor

Added Advantages of Coherent Attachment Over I/O Attachment Virtual Addressing & Data Caching Easier, More Natural Programming Enables Applications Not Possible – Shared Memory Model on I/O – Lower latency for highly referenced data – Traditional thread level programming – Pointer chasing, etc… – Long latency of I/O typically requires restructuring of application © 2017 IBM Corporation 15 POWER9 – Ideal for Acceleration

Extreme CPU/Accelerator Bandwidth POWER9 with 25G Link OpenCAPI 3.0 PCIe Gen3 x16 PCIe Gen4 x16 POWER8 with NVLink 1.0 NVLIink 2.0

Accelerator Accelerator Accelerator

CPU CPU GPU GPU 2x GPU 1x 5x GPU 7-10x

Increased Performance / Features / Acceleration Opportunity

Seamless CPU/Accelerator Interaction Broader Application of Heterogeneous Compute • Coherent memory sharing • Designed for efficient programming models • Enhanced virtual address translation on chip • Accelerate complex analytic / cognitive applications • Data interaction with reduced SW & HW overhead • Lower latency, higher bandwidth

© 2017 IBM Corporation 16 POWER9 Processor Open Innovation Interfaces: OpenCAPI

CAPI 1.0, 2.0 Architecture OpenCAPI Architecture

core core Memory I/O Memory I/O L2/L3 L2/L3 cache cache POWER-specific coherence IP CAPP POWER-specific CAPP PSL coherence IP POWER Chip POWER Chip Architected PCIe 25G CAPI Programming POWER-specific Interface coherence IP PSL Accelerated Architected Application Accelerated Programming Application Interface

Attached CAPI-Accelerated Chip Attached CAPI-Accelerated Chip

Open Industry Coherent Attach Power Processor

- Latency / Bandwidth Improvement Coherence - Removes Overhead from Attach Silicon - Eliminates “Von-Neumann Bottleneck” App - FPGA / Parallel Compute Optimized

© 2017 IBM Corporation - Network/Memory/Storage Innovation FPGA POWER ISA Version 3.0 POWER Architecture at a Glance (new for POWER9 in blue)

Broad data type support & engines • SIMD architecture and evolving instruction set (dozens of new instructions on POWER9) – 128b native SIMD, doubleword, word, halfword, – Half-precision float conversion • Decimal float, BCD, 128b IEEE 754 Quad Precision Float, 128b Quad Precision Fixed Point and Decimal Integer • Random Number Generation Instruction

Memory Management • Hashed Page Table: 4k, 64k, 16M, 16G pages • Radix Page Table: 4k, 64k, 2M, 1G pages, with full virtualization  Reduce the Translation Lookaside Buffer (TLB) • Little and big endian data handling • Memory Atomics • Weak ordered memory model

Cloud and Accelerator Optimization • New Architecture with automated partition routing • User access to virtualized acceleration features • QoS controls

Energy & Frequency Management • Energy management instructions • Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency

© 2017 IBM Corporation 18 Thank you

Myron Slota IBM Development mslota@us..com Additional content: OpenCAPI Systems Overview Large System SMP Topology Core Microarchitecture POWER Porting

20 POWER Systems Portfolio

Scale Reliability POWER Systems POWER Features Virtualization Optimized Leading Scale & Capacity vs. x86 • Large HW thread pool Enterprise Systems • Large on-chip caches (4 socket to 16 sockets) • Memory bandwidth and capacity • 4U 4 socket or modular 5U (1-4) • SMP • Up to 1536 HW threads • Accelerated computing interconnect • Up to 32-64 TB of system memory • 3.6 to 4.3 Ghz processor speed Robust Enterprise Capability vs. x86 • Reliability, Availability, Serviceability (RAS) Scale out Systems • PowerVM system management (1-2 sockets) • Capacity on demand and high • 1U to 4U availability fail-over • Up to 192 HW threads • Up to 16 TB of system memory • 2.0 to 4.2 Ghz processor speed Optimized for Open • Umbuntu, SUSE, Redhat, OpenStack, PowerKVM Price / Performance • Linux only system offerings Open/Industry designs + stacks • OpenPower Foundation Cluster Optimized 21 POWER9 Core Execution Slice Microarchitecture

Modular Execution Slices 4 x 128b 2 x 128b 128b 64b DFU Super-slice Super-slice Super-slice Slice

VSU ISU FXU ISU ISU

Exec Exec Exec Exec Exec Exec 64b 64b 64b Slice Slice Slice Slice Slice Slice VSU VSU VSU

IFU IFU LSU DW DW DW IFU LSU LSU LSU

LSU LSU POWER8 SMT8 Core POWER9 SMT8 Core POWER9 SMT4 Core

Re-factored Core Provides Improved Efficiency & Workload Alignment • Enhanced pipeline efficiency with modular execution and intelligent pipeline control • Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD • Shared compute resource optimizes data-type interchange

© 2016 IBM Corporation 22 Large Memory SMP Consolidation Huge “In-memory” Analytics Consolidate cluster of servers onto a single image SMP – removing network bottleneck . Ideal linear scalability up to 16-sockets / 192 cores on Analytic workloads with DB2 BLU . SAP Hana . SPARK

32-48 way Drawer

128-192 way SMP system 76.8 GB/s 25.6 GB/s 23 © 2017 IBM Corporation POWER9 – Core Compute

SMT4 Core Resources Symmetric Engines Per Data-Type for Higher Performance on Diverse Workloads Fetch / Branch

• 32kB, 8-way Instruction Cache x8 • 8 fetch, 6 decode Predecode L1 Instruction $ IBUF Decode / Crack SMT4 Core

• 1x branch execution Branch Dispatch: Allocate / Rename Instruction / Iop Prediction Completion Table Slices issue VSU and AGEN x6

• 4x scalar-64b / 2x vector-128b Branch Slice Slice 0 Slice 1 Slice 2 Slice 3

• 4x load/store AGEN ALU ALU ALU ALU AGEN AGEN AGEN AGEN BRU XS XS XS XS FP FP FP FP MUL MUL MUL MUL Vector Scalar Unit (VSU) Pipes CRYPT XC XC XC XC • 4x ALU + Simple (64b) PM PM • 4x FP + FX-MUL + Complex (64b) QFX QP/DFU QFX • 2x Permute (128b) DIV DIV • 2x Quad Fixed (128b) ST-D ST-D ST-D ST-D 128b • 2x Fixed Divide (64b) Super-slice • 1x Quad FP & Decimal FP L1D$ 0 L1D$ 1 L1D$ 2 L1D$ 3 • 1x Cryptography LRQ 0/1 LRQ 2/3

Load Store Unit (LSU) Slices SRQ 0 SRQ 1 SRQ 2 SRQ 3 • 32kB, 8-way Data Cache • Up to 4 DW load or store Efficient Cores Deliver 2x Compute Resource per Socket

© 2017 IBM Corporation 24 Special notices This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquiries, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment. Revised September 26, 2006

© 2017 IBM Corporation 25 Special notices (continued) IBM, the IBM logo, ibm.com AIX, AIX (logo), IBM , DB2 Universal Database, POWER, PowerLinux, PowerVM, PowerVM (logo), PowerHA, Power Architecture, Power Family, POWER , Power Systems, Power Systems (logo), POWER2, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, POWER7, POWER7+, and POWER8 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

NVIDIA, the NVIDIA logo, and NVLink are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries or both. PowerLinux™ uses the registered trademark Linux® pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the Linux® mark on a world-wide basis. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. The OpenPOWER word mark and the OpenPOWER Logo mark, and related marks, are trademarks and service marks licensed by OpenPOWER.

Other company, product and service names may be trademarks or service marks of others.

© 2017 IBM Corporation 26