POWER10 Processor Chip Technology and Packaging: PowerAXON PowerAXON - 602mm2 7nm Samsung (18B devices) x x 3 SMT8 SMT8 SMT8 SMT8 3 - 18 layer metal stack, enhanced device 2 Core Core Core Core 2 - Single-chip or Dual-chip sockets + 2MB L2 2MB L2 2MB L2 2MB L2 +

4 4 SMP, SMP, Memory, Accel,Cluster, Interconnect PCI Computational Capabilities: SMP, Memory, Accel,Cluster, Interconnect PCI Local 8MB - Up to 15 SMT8 Cores (2 MB L2 Cache / core) L3 region

(Up to 120 simultaneous hardware threads) 64 MB L3 Hemisphere Memory Signaling (8x8 OMI) - Up to 120 MB L3 cache (low latency NUCA mgmt) Memory Signaling (8x8 OMI) - 3x energy efficiency relative to POWER9 SMT8 SMT8 SMT8 SMT8 - Enterprise thread strength optimizations Core Core Core Core - AI and security focused ISA additions 2MB L2 2MB L2 2MB L2 2MB L2 - 2x general, 4x matrix SIMD relative to POWER9 - EA-tagged L1 cache, 4x MMU relative to POWER9 SMT8 SMT8 SMT8 SMT8 Core Core Core Core Open Memory Interface: 2MB L2 2MB L2 2MB L2 2MB L2 - 16 x8 at up to 32 GT/s (1 TB/s) - Technology agnostic support: near/main/storage tiers - Minimal (< 10ns latency) add vs DDR direct attach 64 MB L3 Hemisphere

PowerAXON Interface: - 16 x8 at up to 32 GT/s (1 TB/s) SMT8 SMT8 SMT8 SMT8 x Core Core Core Core x - SMP interconnect for up to 16 sockets 3 2MB L2 2MB L2 2MB L2 2MB L2 3 - OpenCAPI attach for memory, accelerators, I/O 2 2 + + - Integrated clustering (memory semantics) 4 4 PCIe Gen 5 PCIe Gen 5 PowerAXON PowerAXON PCIe Gen 5 Interface: Signaling (x16) Signaling (x16) - x64 / DCM at up to 32 GT/s Die Photo courtesy of Samsung Foundry Socket Composability: SCM & DCM Horizontal Full Connect Up to 16 SCM Single-Chip Module Focus: Sockets - 602mm2 7nm (18B devices) - Core/thread Strength - Up to 15 SMT8 Cores (4+ GHz) - Capacity & Bandwidth / Compute - Memory: x128 @ 32 GT/s - SMP/Cluster/Accel: x128 @ 32 GT/s - I/O: x32 PCIe G5 Vertical Connect Full - System Scale (Broad Range) - 1 to 16 sockets

Dual-Chip Module Focus: - 1204mm2 7nm (36B devices) - Throughput / Socket Up to - Up to 30 SMT8 Cores (3.5+ GHz) 4 DCM - Compute & I/O Density Sockets - Memory: x128 @ 32 GT/s - SMP/Cluster/Accel: x192 @ 32 GT/s - I/O: x64 PCIe G5 - 1 to 4 sockets

(Multi-socket configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Composability: PowerAXON & Open Memory Interfaces

Multi-protocol “Swiss-army-knife” Flexible / Modular Interfaces

POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec

PowerAXON OMI Memory

PowerAXON corner OMI edge 4x8 @ 32 GT/s 8x8 @ 32 GT/s Built on best-of-breed Low Power, Low Latency, 6x bandwidth / mm2 High Bandwidth compared to DDR4 Signaling Technology signaling

IBM POWER10 System Enterprise Scale and Bandwidth: SMP & Main Memory

Multi-protocol Initial Offering: Up to 4 TB / socket “Swiss-army-knife” OMI DRAM memory Flexible / Modular Interfaces 410 GB/s peak bandwidth Build up to 16 SCM socket (MicroChip DDR4 buffer) Robustly Scalable Allocate the bandwidth < 10ns latency adder High Bisection Bandwidth however you need to use it “Glueless” SMP POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec

PowerAXON OMI Memory Main tier DRAM

DIMM swap upgradeable: Built on best-of-breed DDR5 OMI DRAM memory Low Power, Low Latency, with higher bandwidth High Bandwidth and higher capacity Signaling Technology

(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Data Plane Bandwidth and Capacity: Open Memory Interfaces

OMI-attached GDDR DIMMs can provide low-capacity, high bandwidth alternative to HBM, without packaging rigidity & cost (Up to 800 GB/s sustained)

POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec

PowerAXON OMI Memory Main tier DRAM

SCM

OMI-attached storage class memory can provide high-capacity, encrypted, persistent memory in a DIMM slot. (POWER10 systems can support 2 petabytes of addressable load/store memory)

(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Heterogeneity and Data Plane Capacity: OpenCAPI

OpenCAPI attaches FPGA or ASIC-based Accelerators to POWER10 host with High Bandwidth and Low Latency

POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM

SCM

OpenCAPI-attached storage class memory can provide high-capacity, encrypted, persistent memory in a device form factor. (POWER10 systems can support 2 petabytes of addressable load/store memory)

(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Pod Composability: PowerAXON Memory Clustering

Memory Inception capability enables a system to map another system’s memory as its own. Multiple systems can be clustered, sharing each other’s memory

POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM

SCM

(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Distributed Memory Disaggregation and Sharing

Use case: Share load/store memory amongst directly connected neighbors within Pod Workload A A A A A B B B B Unlike other schemes, memory can be used: - As low latency local memory Workload B C C C - As NUMA latency remote memory

Example: Pod = 8 systems each with 8TB Workload C C C C C C C C C Workload A Rqmt: 4 TB low latency

Workload B Rqmt: 24 TB relaxed latency C C C B B B B Workload C Rqmt: 8 TB low latency plus 16TB relaxed latency C C C B B B B All Rqmts met by configuration shown

C C C B B B B POWER10 2 Petabyte memory size enables much larger configurations C C C B B B B

C B B B B

(Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Enterprise-Scale Memory Sharing

Pod of Large Enterprise Systems Distributed Sharing at Petabyte Scale

Or Hub-and-spoke with memory server and memory-less compute nodes

(Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Pod-level Clustering

Use case: Low latency, high bandwidth messaging scaling to 1000’s of nodes

Leverage 2 Petabyte addressability to create memory window into each destination for messaging mailboxes Sender

Receiver

(Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Composability: PCIe Gen 5 Industry I/O Attach

POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM

PCIe G5 SCM

(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 POWER10 General Purpose Socket Performance Gains

DDR5 OMI 4 Memory

3

DDR4 2 OMI Memory

1

POWER9 POWER10 POWER10 POWER10 POWER10 Baseline Integer Enterprise Floating Point Memory Streaming

(Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering) IBM POWER10 Powerful Core = Enterprise Strength + AI Infused

New Enterprise Micro-architecture - Flexibility 1-2 POWER10 chips per socket - Up to 8 threads per core / 240 per socket • Up to 30 SMT8 Cores • Up to 60 SMT4 Cores - Optimized for performance and efficiency

- +30% avg. core performance* PowerAXON PowerAXON x x 3 SMT8 SMT8 SMT8 SMT8 3 - +20% avg. single thread performance* 2 Core Core Core Core 2

+ 2MB L2 2MB L2 2MB L2 2MB L2 + SMP, Memory, Accel, Cluster, PCI Interconnect PCI Cluster, Accel, Memory,SMP, - 2.6x core performance efficiency* (3x @ socket) 4 Interconnect PCI Cluster, Accel, Memory,SMP, 4

64 MB L3 Hemisphere

Memory Signaling (8x8 OMI) (8x8 Signaling Memory Memory Signaling (8x8 OMI) (8x8 Signaling Memory

SMT8 SMT8 SMT8 SMT8 SMT8 Core Core Core Core Core 2MB L2 2MB L2 2MB L2 2MB L2

AI Infused 4x Matrix SIMD SMT8 SMT8 SMT8 SMT8 Core Core Core Core 2MB L2 2MB L2 2MB L2 2MB L2

- 4x matrix SIMD acceleration* New ISA Enhanced Local 8MB 2x General L3 region - 2x bandwidth & general SIMD* Prefix Control & 64 MB L3 Hemisphere Fusion SIMD Branch - 4x L2 cache capacity with SMT8 SMT8 SMT8 SMT8 x Core Core Core Core x 3 2MB L2 2MB L2 2MB L2 2MB L2 3 improved thread isolation* 2 2 2x Load, 2x Store + + 4x MMU 4 PCIe Gen 5 PCIe Gen 5 4 - New ISA with AI data-types PowerAXON Signaling (x16) Signaling (x16) PowerAXON 4x L2 Cache

* versus POWER9

(Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering) IBM POWER10 Powerful Core : Enterprise Flexibility Hardware Based Workload Balance

Instruction Cache SMT8 Core Multiple World-class Software Stacks Instruction Buffer

Resilience and full stack integrity Slice Slice Slice Slice Slice Slice Slice Slice

• PowerVM, KVM 128b 128b 128b 128b 128b 128b 128b 128b

MMA MMA MMA

VSU VSU VSU VSU VSU VSU VSU VSU MMA

QP / DF QP / DF • AIX, IBMi, on Power, OpenShift

Store Load Store Load Load Store Load Store (32B) (32B) (32B) (32B) (32B) (32B) (32B) (32B)

1-2 Threads L1 Data Cache 1-2 Threads Automatic Thread Resource Balancing

Instruction Cache SMT8 Core

Partition flexibility and security Instruction Buffer • Full-core level LPAR

• Thread-based LPAR scheduling Slice Slice Slice Slice Slice Slice Slice Slice

128b 128b 128b 128b 128b 128b 128b 128b

MMA MMA MMA

• NEW: With PowerVM Hypervisor VSU VSU VSU VSU VSU VSU MMA VSU VSU

QP QP / DF QP / DF • Nested KVM + PowerVM Store Load Store Load Load Store Load Store • Hardware assisted container/VM isolation (32B) (32B) (32B) (32B) (32B) (32B) (32B) (32B) 2 Threads 2 Threads 2 Threads 2 Threads L1 Data Cache 8 Threads Active IBM POWER10 Powerful Architecture : AI Infused and Future Ready POWER10 implements Power ISA v3.1 • v3.1 was the latest open Power ISA contributed to the OpenPOWER Foundation: Royalty free and inclusive of patents for compliant designs POWER10 Architecture – Feature Highlights Greatly expanded opcode space, pc- RISC friendly 8B instructions including modified and new opcode forms. Prefix Architecture relative addressing, MMA masking, 0 6 31 0 31 etc. PO=1 M0 Prefix PO Suffix Set Boolean extensions; quad-precision extensions; 128b integer New Scalar instructions for control extensions; test LSB by ; byte reverse GPR; int mul/div modulo; flow, and operation symmetry New Instructions and string isolate/clear; pause, wait-reserve. Datatypes 32-byte load/store vector-pair; MMA (matrix math assist) with reduced New SIMD instructions for AI, precision; bfloat-16 converts; permute variations: extract, insert, splat, throughput and data manipulation blend; compress/expand assist; mask generation; bit manipulation. Storage management Persistent memory barrier / flush; store sync; translation extensions. Debug PMU sampling, filtering; debug watchpoints; tracing. Advanced System Features and Ease of Use Hot/Cold page tracking Recording for memory management. Memory movement; continued on-chip acceleration: Gzip, 842 Copy/Paste extensions compression, AES/SHA. Advanced EnergyScale Adaptive power management Additional performance boost across the operating range. Transparent isolation and security Nested virtualization with KVM on PowerVM; secure containers; main Security for Cloud for enterprise cloud workloads memory encryption; dynamic execution control; secure PMU. IBM POWER10 Security : End-to-End for the Enterprise Cloud

VM’s VM’s Container Container Application Confidential Applications Applications Container Container Cloud Security Computing Workload Services Middleware Container Container Security Secure Containers: • Transparent to applications • End-to-end encryption

Secure Nested Virtualization – Virtualization KVM on PowerVM: • Stronger container isolation without performance penalty • HW enabled and transparent

Crypto Performance: • Core crypto improvements for Hardened container today’s algorithms (AES, SHA3) memory isolation Processor and ready for the future (PQC, FHE) Security Foundations Dynamic Execution Control Main memory encryption: Register (DEXCR) Stronger confidentiality against physical attacks Performance enhanced side channel avoidance IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent P Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R +Fusion/Prefix F 8 instr Instruction Buffer O 128 entries R Instruction Table Decode/Fuse 8 iop M 512 entries A Execution Execution Execution Execution Slice Slice Slice Slice N MMA C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R +Fusion/Prefix F 8 instr Instruction Buffer O 128 entries R Instruction Table Decode/Fuse 8 iop M 512 entries A Execution Execution Execution Execution Slice Slice Slice Slice N MMA C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F 8 instr Instruction Buffer O 128 entries R Instruction Table Decode/Fuse 8 iop M 512 entries A Execution Execution Execution Execution Slice Slice Slice Slice N MMA C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr Instruction Buffer O 128 entries R Instruction Table Decode/Fuse 8 iop M 512 entries A Execution Execution Execution Execution N Slice Slice Slice Slice MMA 20 entries 20 entries 20 entries 20 entries C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries A Execution Execution Execution Execution Slice Slice Slice Slice N MMA C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA +Branch +Branch +Branch +Branch C Accelerator 128b 128b 128b 128b E 2x512b

2 32B LD 32B LD Load Queue Load L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Enterprise Strength P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA +Fuse +Fuse +Fuse +Fuse C • Fusion Accelerator 128b 128b 128b 128b 2x512b E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 2 LD + LD LD + LD Load Queue LD+LD L1 Data Cache 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated L2 Cache Load Miss Queue 2 (hashed 12 entries ST+ST Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch ST+ST+ST+ST 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Energy Efficient P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA C • Fusion Accelerator 128b 128b 128b 128b 2x512b E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 2 32B LD 32B LD Load Queue Load L1 Data Cache • Improved clock-gating 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated • Design & micro-arch efficiency L2 Cache Load Miss Queue 2 (hashed 12 entries Store Store Queue 32B ST index) W EA 80 entries (SMT) 40 entries (ST) (+gathered) A Prefetch 16 streams D miss T TLB miss miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Watt vs. POWER9: Improved IBM POWER10 Powerful Core : Energy Efficient P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA +STF +STF +STF +STF C • Fusion Accelerator 128b 128b 128b 128b 2x512b E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 2 32B LD 32B LD Load Queue Load L1 Data Cache • Improved clock-gating 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated • Design & micro-arch efficiency L2 Cache Load Miss Queue 2 • Branch accuracy: less wasted work (hashed 12 entries Store Store Queue 32B ST index) W • Fusion / gather: less units of work EA 80 entries (SMT) 40 entries (ST) (+gathered) A • Reduced ports / access Prefetch 16 streams D miss T • Sliced target reg-file TLB miss Reduced read ports / entry miss T I miss ERAT TLB 64 entry 4k entry L3 prefetch L3 Prefetch 48 entries

Watt vs. POWER9: Improved IBM POWER10 Powerful Core : Energy Efficient P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA C • Fusion Accelerator 128b 128b 128b 128b 2x512b E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 2 32B LD 32B LD Load Queue Load L1 Data Cache • Improved clock-gating 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated • Design & micro-arch efficiency L2 Cache Load Miss Queue 2 • Branch accuracy: less wasted work (hashed 12 entries Store Store Queue 32B ST index) W • Fusion / gather: less units of work EA 80 entries (SMT) 40 entries (ST) (+gathered) A • Reduced ports / access Prefetch 16 streams D miss T • Sliced target reg-file TLB miss Reduced read ports / entry miss T I miss ERAT TLB • EA-tagged L1-D Cache & L1-I Cache 64 entry 4k entry L3 prefetch L3 Prefetch • CAM with cache-way/index 48 entries • ERAT only on cache miss Watt vs. POWER9: Improved IBM POWER10 Powerful Core : Strength & Efficiency P10 Core Micro-architecture ½ SMT8 Core Resources Shown = SMT4 Core Equivalent • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA • Larger working-sets Fetch / +Branch I-EA L1 Instr. Cache 32B E 48k 6-way Predecode Predictors R • 1.5x L1-Instruction cache, 4x L2, 4x TLB +Fusion/Prefix F • Deeper/wider instruction windows 8 instr O • Data latency (cycles) Instruction Buffer • L2 13.5 (minus 2), L3 27.5 (minus 8) 128 entries R • L1-D cache 4 +0 for Store forward (minus 2) Instruction Table Decode/Fuse 8 iop M • TLB access +8.5 (minus 7) 512 entries • Branch A Execution Execution Execution Execution • Target registers with GPR in main regfile Slice Slice Slice Slice N • New predictors: target and direction, 2x BHT MMA C • Fusion Accelerator 128b 128b 128b 128b 2x512b E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 2 32B LD 32B LD Load Queue Load L1 Data Cache • Improved clock-gating 128 entries (SMT) EA 64B 64 entries (ST) 32k 8-way dedicated • Design & micro-arch efficiency L2 Cache Load Miss Queue 2 • Branch accuracy: less wasted work (hashed 12 entries Store Store Queue 32B ST index) W • Fusion / gather: less units of work EA 80 entries (SMT) 40 entries (ST) (+gathered) A • Reduced ports / access Prefetch 16 streams D miss T • Sliced target reg-file TLB miss Reduced read ports / entry miss T I miss ERAT TLB • EA-tagged L1-D Cache & L1-I Cache 64 entry 4k entry L3 prefetch L3 Prefetch • CAM with cache-way/index 48 entries • ERAT only on cache miss Capacity vs. POWER9: Improved >= 2x = 4x IBM POWER10 Powerful Core : Strength & Efficiency • Double SIMD + Inference acceleration P • 2x SIMD, 4x MMA, 4x AES/SHA E • Larger working-sets R • 1.5x L1-Instruction cache, 4x L2, 4x TLB F • Deeper/wider instruction windows O • Data latency (cycles) • L2 13.5 (minus 2), L3 27.5 (minus 8) R • L1-D cache 4 +0 for Store forward (minus 2) M • TLB access +8.5 (minus 7) A • Branch • Target registers with GPR in main regfile N • New predictors: target and direction, 2x BHT C • Fusion E • Fixed, SIMD, other: merge and back to back • Load, store : consecutive storage 1.3x = 2.6x performance / watt • Improved clock-gating • Design & micro-arch efficiency 0.5x POWER10 vs. POWER9 Core • Branch accuracy: less wasted work W • Fusion / gather: less units of work A • Reduced ports / access T • Sliced target reg-file T Reduced read ports / entry • EA-tagged L1-D Cache & L1-I Cache • CAM with cache-way/index • ERAT only on cache miss IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD OMI 32B T 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S WR ST 32B ST 16B SIMD 64B MMA 16B SIMD

F L O P

* versus POWER9 IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x Bytes from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B • 4 32B loads, 2 32B stores per SMT8 Core L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD • New ISA or fusion OMI 32B T • Thread max 2 32B loads, 1 32B store 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S WR ST 32B ST 16B SIMD 64B MMA 16B SIMD

F L O P

* versus POWER9 IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x Bytes from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B • 4 32B loads, 2 32B stores per SMT8 Core L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD • New ISA or fusion OMI 32B T • Thread max 2 32B loads, 1 32B store 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S • OMI Memory to one Core WR ST • 256 GB/s peak, 120 GB/s sustained 32B ST 16B SIMD • With 3x L3 prefetch and memory prefetch extensions 64B MMA 16B SIMD

F L O P

* versus POWER9 IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x Bytes from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B • 4 32B loads, 2 32B stores per SMT8 Core L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD • New ISA or fusion OMI 32B T • Thread max 2 32B loads, 1 32B store 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S • OMI Memory to one Core WR ST • 256 GB/s peak, 120 GB/s sustained 32B ST 16B SIMD • With 3x L3 prefetch and memory prefetch extensions 64B MMA 16B SIMD

2x Bandwidth matched SIMD* • 8 independent SIMD engines per Core F • Fixed, float, permute L O P

* versus POWER9 IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x Bytes from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B • 4 32B loads, 2 32B stores per SMT8 Core L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD • New ISA or fusion OMI 32B T • Thread max 2 32B loads, 1 32B store 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S • OMI Memory to one Core WR ST • 256 GB/s peak, 120 GB/s sustained 32B ST 16B SIMD • With 3x L3 prefetch and memory prefetch extensions 64B MMA 16B SIMD

Rank Operand Type (X,Y) Accumulator 2x Bandwidth matched SIMD* Peak [FL]OPS / cycle k Type X 풀푻 A Instruction Thread SMT8 Core • 8 independent SIMD engines per Core Float-64 DP 4×1 1×2 4×2 (Fp-64) 16 32 64 1 F • Fixed, float, permute Float-32 SP 4×1 1×4 32 64 128 L Float-16 HP 4×2 2×4 4×4 (Fp-32) O 4-32x Matrix Math Acceleration* 2 Bfloat-16 HP 4×2 2×4 64 128 256 Int-16 4×2 2×4 P • 4 512b engines per core = 2048b results / cycle 4 Int-8 4×4 4×4 4×4 (Int-32) 128 256 512 • Matrix math outer products: 퐴 ← ± 퐴 ± 푋푌푇 • Double, Single, Reduced precision 8 Int 4 4×8 8×4 256 512 1024 * versus POWER9 IBM POWER10 Powerful Core : AI Infused Bandwidth and Compute 16B SIMD 64B 16B SIMD MMA 2x Bytes from all sources 32B (OMI, L3, L2, L1 caches*) LD 32B 16B SIMD B LD 64B • 4 32B loads, 2 32B stores per SMT8 Core L3 64B 128B L1 MMA Y 32B RD L2 RD 32B 16B SIMD RD Region D LD • New ISA or fusion OMI 32B T • Thread max 2 32B loads, 1 32B store 64 32B 16B SIMD WR 64B 2M LD 64B 8M WR K E 64B 32B MMA 16B SIMD S • OMI Memory to one Core WR ST • 256 GB/s peak, 120 GB/s sustained 32B ST 16B SIMD • With 3x L3 prefetch and memory prefetch extensions 64B MMA 16B SIMD

Rank Operand Type (X,Y) Accumulator 2x Bandwidth matched SIMD* Peak [FL]OPS / cycle k Type X 풀푻 A Instruction Thread SMT8 Core • 8 independent SIMD engines per Core Float-64 DP 4×1 1×2 4×2 (Fp-64) 16 32 64 1 F • Fixed, float, permute Float-32 SP 4×1 1×4 32 64 128 L Float-16 HP 4×2 2×4 4×4 (Fp-32) O 4-32x Matrix Math Acceleration* 2 Bfloat-16 HP 4×2 2×4 64 128 256 Int-16 4×2 2×4 P • 4 512b engines per core = 2048b results / cycle 4 Int-8 4×4 4×4 4×4 (Int-32) 128 256 512 • Matrix math outer products: 퐴 ← ± 퐴 ± 푋푌푇 • Double, Single, Reduced precision 8 Int 4 4×8 8×4 256 512 1024 * versus POWER9 IBM POWER10 AI Infused Core: Inference Acceleration 512b ACC MMA Int8 - xvi8ger4pp ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ • 4x+ per core throughput ퟒ += × + × + × + × ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ • 3x → 6x thread latency reduction (SP, int8)* ퟒ += × + × + × + × ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ += × + × + × + ×

ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ • POWER10 Matrix Math Assist (MMA) instructions ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ ퟒ += × + × + × + × • 8 512b architected Accumulator (ACC) Registers • 4 parallel units per SMT8 core 4 per cycle per SMT8 core • Consistent VSR 128b register architecture • Minimal SW ecosystem disruption – no new register state Matrix Optimized / High Efficiency • Application performance via updated library (OpenBLAS, etc.) Result data remains local to compute • POWER10 aliases 512b ACC to 4 128b VSR’s • Architecture allows redefinition of ACC 32B Loads Operands ACC • Dense-Math-Engine VSU 16B + 16B 64B • Built for data re-use algorithms LSU Main MMA Operands • Includes separate physical register file (ACC) Regfiles ACC 16B + 16B 64B • 2x efficiency vs. traditional SIMD for MMA 32B Loads

Inference Accelerator dataflow (2 per SMT8 core)

* versus POWER9 IBM POWER10 POWER10 SIMD / AI Socket Performance Gains

5 10 15 20 POWER9 Baseline

POWER10 Linpack

POWER10 Inference (Resnet-50) FP32

POWER10 Inference (Resnet-50) BFloat16

POWER10 Inference (Resnet-50) INT8

(Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering) IBM POWER10