POWER10 Processor Chip
Total Page:16
File Type:pdf, Size:1020Kb
POWER10 Processor Chip Technology and Packaging: PowerAXON PowerAXON - 602mm2 7nm Samsung (18B devices) x x 3 SMT8 SMT8 SMT8 SMT8 3 - 18 layer metal stack, enhanced device 2 Core Core Core Core 2 - Single-chip or Dual-chip sockets + 2MB L2 2MB L2 2MB L2 2MB L2 + 4 4 SMP, Memory, Accel, Cluster, PCI Interconnect Cluster,Accel, Memory, SMP, Computational Capabilities: PCI Interconnect Cluster,Accel, Memory, SMP, Local 8MB - Up to 15 SMT8 Cores (2 MB L2 Cache / core) L3 region (Up to 120 simultaneous hardware threads) 64 MB L3 Hemisphere Memory Signaling (8x8 OMI) (8x8 Signaling Memory - Up to 120 MB L3 cache (low latency NUCA mgmt) OMI) (8x8 Signaling Memory - 3x energy efficiency relative to POWER9 SMT8 SMT8 SMT8 SMT8 - Enterprise thread strength optimizations Core Core Core Core - AI and security focused ISA additions 2MB L2 2MB L2 2MB L2 2MB L2 - 2x general, 4x matrix SIMD relative to POWER9 - EA-tagged L1 cache, 4x MMU relative to POWER9 SMT8 SMT8 SMT8 SMT8 Core Core Core Core Open Memory Interface: 2MB L2 2MB L2 2MB L2 2MB L2 - 16 x8 at up to 32 GT/s (1 TB/s) - Technology agnostic support: near/main/storage tiers - Minimal (< 10ns latency) add vs DDR direct attach 64 MB L3 Hemisphere PowerAXON Interface: - 16 x8 at up to 32 GT/s (1 TB/s) SMT8 SMT8 SMT8 SMT8 x Core Core Core Core x - SMP interconnect for up to 16 sockets 3 2MB L2 2MB L2 2MB L2 2MB L2 3 - OpenCAPI attach for memory, accelerators, I/O 2 2 + + - Integrated clustering (memory semantics) 4 4 PCIe Gen 5 PCIe Gen 5 PowerAXON PowerAXON PCIe Gen 5 Interface: Signaling (x16) Signaling (x16) - x64 / DCM at up to 32 GT/s Die Photo courtesy of Samsung Foundry Socket Composability: SCM & DCM Horizontal Full Connect Up to 16 SCM Single-Chip Module Focus: Sockets - 602mm2 7nm (18B devices) - Core/thread Strength - Up to 15 SMT8 Cores (4+ GHz) - Capacity & Bandwidth / Compute - Memory: x128 @ 32 GT/s - SMP/Cluster/Accel: x128 @ 32 GT/s - I/O: x32 PCIe G5 Vertical Full Connect - System Scale (Broad Range) - 1 to 16 sockets Dual-Chip Module Focus: - 1204mm2 7nm (36B devices) - Throughput / Socket Up to - Up to 30 SMT8 Cores (3.5+ GHz) 4 DCM - Compute & I/O Density Sockets - Memory: x128 @ 32 GT/s - SMP/Cluster/Accel: x192 @ 32 GT/s - I/O: x64 PCIe G5 - 1 to 4 sockets (Multi-socket configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Composability: PowerAXON & Open Memory Interfaces Multi-protocol “Swiss-army-knife” Flexible / Modular Interfaces POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec PowerAXON OMI Memory PowerAXON corner OMI edge 4x8 @ 32 GT/s 8x8 @ 32 GT/s Built on best-of-breed Low Power, Low Latency, 6x bandwidth / mm2 High Bandwidth compared to DDR4 Signaling Technology signaling IBM POWER10 System Enterprise Scale and Bandwidth: SMP & Main Memory Multi-protocol Initial Offering: Up to 4 TB / socket “Swiss-army-knife” OMI DRAM memory Flexible / Modular Interfaces 410 GB/s peak bandwidth Build up to 16 SCM socket (MicroChip DDR4 buffer) Robustly Scalable Allocate the bandwidth < 10ns latency adder High Bisection Bandwidth however you need to use it “Glueless” SMP POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec PowerAXON OMI Memory Main tier DRAM DIMM swap upgradeable: Built on best-of-breed DDR5 OMI DRAM memory Low Power, Low Latency, with higher bandwidth High Bandwidth and higher capacity Signaling Technology (PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Data Plane Bandwidth and Capacity: Open Memory Interfaces OMI-attached GDDR DIMMs can provide low-capacity, high bandwidth alternative to HBM, without packaging rigidity & cost (Up to 800 GB/s sustained) POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec PowerAXON OMI Memory Main tier DRAM SCM OMI-attached storage class memory can provide high-capacity, encrypted, persistent memory in a DIMM slot. (POWER10 systems can support 2 petabytes of addressable load/store memory) (PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Heterogeneity and Data Plane Capacity: OpenCAPI OpenCAPI attaches FPGA or ASIC-based Accelerators to POWER10 host with High Bandwidth and Low Latency POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM SCM OpenCAPI-attached storage class memory can provide high-capacity, encrypted, persistent memory in a device form factor. (POWER10 systems can support 2 petabytes of addressable load/store memory) (PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Pod Composability: PowerAXON Memory Clustering Memory Inception capability enables a system to map another system’s memory as its own. Multiple systems can be clustered, sharing each other’s memory POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM SCM (PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Distributed Memory Disaggregation and Sharing Use case: Share load/store memory amongst directly connected neighbors within Pod Workload A A A A A B B B B Unlike other schemes, memory can be used: - As low latency local memory Workload B C C C - As NUMA latency remote memory Example: Pod = 8 systems each with 8TB Workload C C C C C C C C C Workload A Rqmt: 4 TB low latency Workload B Rqmt: 24 TB relaxed latency C C C B B B B Workload C Rqmt: 8 TB low latency plus 16TB relaxed latency C C C B B B B All Rqmts met by configuration shown C C C B B B B POWER10 2 Petabyte memory size enables much larger configurations C C C B B B B C B B B B (Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Enterprise-Scale Memory Sharing Pod of Large Enterprise Systems Distributed Sharing at Petabyte Scale Or Hub-and-spoke with memory server and memory-less compute nodes (Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 Memory Clustering: Pod-level Clustering Use case: Low latency, high bandwidth messaging scaling to 1000’s of nodes Leverage 2 Petabyte addressability to create memory window into each destination for messaging mailboxes Sender Receiver (Memory cluster configurations show processor capability only, and do not imply system product offerings) IBM POWER10 System Composability: PCIe Gen 5 Industry I/O Attach POWER10 Chip 1 Terabyte / Sec 1 Terabyte / Sec Accelerated App SCM ASIC or FPGA OpenCAPI Attach PowerAXON OMI Memory Main tier DRAM PCIe G5 SCM (PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) IBM POWER10 POWER10 General Purpose Socket Performance Gains DDR5 OMI 4 Memory 3 DDR4 2 OMI Memory 1 POWER9 POWER10 POWER10 POWER10 POWER10 Baseline Integer Enterprise Floating Point Memory Streaming (Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering) IBM POWER10 Powerful Core = Enterprise Strength + AI Infused New Enterprise Micro-architecture - Flexibility 1-2 POWER10 chips per socket - Up to 8 threads per core / 240 per socket • Up to 30 SMT8 Cores • Up to 60 SMT4 Cores - Optimized for performance and efficiency - +30% avg. core performance* PowerAXON PowerAXON x x 3 SMT8 SMT8 SMT8 SMT8 3 - +20% avg. single thread performance* 2 Core Core Core Core 2 + 2MB L2 2MB L2 2MB L2 2MB L2 + SMP, Memory, SMP,Memory, Accel, Cluster, PCI Interconnect - 2.6x core performance efficiency* (3x @ socket) 4 SMP,Memory, Accel, Cluster, PCI Interconnect 4 64 MB L3 Hemisphere Memory Memory Signaling (8x8 OMI) Memory Memory Signaling (8x8 OMI) SMT8 SMT8 SMT8 SMT8 SMT8 Core Core Core Core Core 2MB L2 2MB L2 2MB L2 2MB L2 AI Infused 4x Matrix SIMD SMT8 SMT8 SMT8 SMT8 Core Core Core Core 2MB L2 2MB L2 2MB L2 2MB L2 - 4x matrix SIMD acceleration* New ISA Enhanced Local 8MB 2x General L3 region - 2x bandwidth & general SIMD* Prefix Control & 64 MB L3 Hemisphere Fusion SIMD Branch - 4x L2 cache capacity with SMT8 SMT8 SMT8 SMT8 x Core Core Core Core x 3 2MB L2 2MB L2 2MB L2 2MB L2 3 improved thread isolation* 2 2 2x Load, 2x Store + + 4x MMU 4 PCIe Gen 5 PCIe Gen 5 4 - New ISA with AI data-types PowerAXON Signaling (x16) Signaling (x16) PowerAXON 4x L2 Cache * versus POWER9 (Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering) IBM POWER10 Powerful Core : Enterprise Flexibility Hardware Based Workload Balance Instruction Cache SMT8 Core Multiple World-class Software Stacks Instruction Buffer Resilience and full stack integrity Slice Slice Slice Slice Slice Slice Slice Slice • PowerVM, KVM 128b 128b 128b 128b 128b 128b 128b 128b MMA MMA MMA VSU VSU VSU VSU VSU VSU VSU VSU MMA QP / DF QP / DF • AIX, IBMi, Linux on Power, OpenShift Store Load Store Load Load Store Load Store (32B) (32B) (32B) (32B) (32B) (32B) (32B) (32B) 1-2 Threads L1 Data Cache 1-2 Threads Automatic Thread Resource Balancing Instruction Cache SMT8 Core Partition flexibility and security Instruction Buffer • Full-core level LPAR • Thread-based LPAR scheduling Slice Slice Slice Slice Slice Slice Slice Slice 128b 128b 128b 128b 128b 128b 128b 128b MMA MMA MMA • NEW: With PowerVM Hypervisor VSU VSU VSU VSU VSU VSU MMA VSU VSU QP QP / DF QP / DF • Nested KVM + PowerVM Store Load Store Load Load Store Load Store • Hardware assisted container/VM isolation (32B) (32B) (32B) (32B) (32B) (32B) (32B) (32B) 2 Threads 2 Threads 2 Threads 2 Threads L1 Data Cache 8 Threads Active IBM POWER10 Powerful Architecture : AI Infused and Future Ready POWER10 implements Power ISA v3.1 • v3.1 was the latest open Power ISA contributed to the OpenPOWER Foundation: Royalty free and inclusive of patents for compliant designs POWER10 Architecture – Feature Highlights Greatly expanded opcode space, pc- RISC friendly 8B instructions including modified and new opcode forms.