IBM POWER9 SMT Deep Dive Summit Training Workshop
Total Page:16
File Type:pdf, Size:1020Kb
IBM POWER9 SMT Deep Dive Summit Training Workshop Brian Thompto POWER Systems, IBM Systems © 2018 IBM Corporation POWER9 Processor Performance Optimized for Open Interfaces for Cognitive Workloads Accelerated Computing New Core Microarchitecture 1st processor introduction of PCIeG4 Enhanced cache hierarchy Up to 120 MB / Chip 25G Coherent Link: On Chip Super-Highway Next-gen CAPI technology Connect Cores, Caches And Accelerators / GPU’s NVLINK2.0 for GPU attach 14nm silicon technology Family of Scale-out & Scale-up Optimized Offerings Dual Memory Subsystems optimized for Scale Out (latency/density) & Enterprise (capacity/bandwidth/RAS) 12 SMT8 or 24 SMT4 cores (96 threads) HiGh bandwidth scale-up fabric: 2-16 socket offerings with 2-4x chip-to-chip interconnect bandwidth © 2018 IBM Corporation 2 POWER9 – AC922 with 6 GPU’s SMP/Accelerator Signaling Memory Signaling Core Core Core Core Core Core Core Core L2 L2 L2 L2 L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L2 L2 L2 L2 Core Core Core Core Core Core Core Core PCIe Signaling PCIe On-Chip Accel SMP Signaling L3 Region L3 Region & Interconnect SMP L3 Region L3 Region L2 L2 Enablement Accelerator Chip L2 L2 - Core Core Core Core Off Core Core Core Core SMP/Accelerator Signaling Memory Signaling POWER9 Chip with 22 / 24 Active Cores Up to 88 Threads / Socket Images / diagrams modified from: "IBM POWER9 systems designed for commercial cognitive and cloud", IBM J. Res. & Dev., vol. 62, no. 4/5, 2018 "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016.. © 2018 IBM Corporation © 2018 IBM Corporation IBM Systems 3 POWER9 – Core and Cache Topology SMP/Accelerator Signaling Memory Signaling Core Core Core Core Core Core Core Core Core Core L2 L2 L2 L2 1-4 Theads 1-4 Theads L3 Region L3 Region L3 Region L3 Region (SMT4) (SMT4) L3 Region L3 Region L3 Region L3 Region L2 L2 L2 L2 Core Core Core Core Core Core Core Core PCIe Signaling SMP Signaling L2 Cache PCIe On-Chip Accel (512k) L3 Region L3 Region & Interconnect SMP L3 Region L3 Region L2 L2 Enablement Accelerator Chip L2 L2 - Core Core Core Core Off Core Core Core Core L3 Cache Region (10 MB) SMP/Accelerator Signaling Memory Signaling POWER9 Chip with 22 / 24 Active Cores 2 x POWER9 SMT4 Core : 1-4 threads each Up to 88 Threads / SocKet L2 Cache (512k) and L3 Cache (10MB) : 1-8 threads Images / diagrams modified from: "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016.. © 2018 IBM Corporation © 2018 IBM Corporation IBM Systems 4 POWER9: Cache Capacity Caches per pair of SMT4 cores (up to 1-8 threads) • L2: 512k, 8-way • L3: 10 MB, 20-way • Enhanced L3 Cache Effectiveness with enhanced Replacement • Aggregate 110 MB, 11 x 20 way associativity when 22 cores active (out of 24) on Summit POWER9 17 Layers of Metal eDRAM Processing Cores 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 7 TB/s 256 GB/s each Core Pair DDR SMP PCIe CAPI NVLink 2 New New CAPI Mellanox IBM & IBM & PCIe Nvidia Memory Partner GPU Partner POWER9 Device Devices Devices © 2018 IBM Corporation 5 New POWER9 Microarchitecture Optimized for Cognitive Workloads & Stronger Thread Performance • Shorter pipeline & improved scheduling / branch prediction for unoptimized code & interpretive languages • Increased execution bandwidth for a range of workloads including commercial, cognitive and analytics • Adaptive features for improved efficiency and performance POWER9 Pipeline (SMT4) POWER9 SMT4 Core – Sliced Micro-arch Fetch Fetch 8 Instructions Instr. $ Decode1 Decode2 Decode 6 Instructions Crack Internal-Operation Generation Resource1 Resource2 Dispatch 6 Instructions Slice Route Resource Assignment and Routing Map LSU VSU Branch Pipe Pipe Pipe Branch Select Select Issue Slices Execution Issue Issue Source x 1 Slices Agen ALU Result x 4 Brdcst Float2 $ Acc. Float3 Format Bypass Float4 Result Float5 Result Finish Complete 64 Instructions or Internal-Operations Images / diagrams modified from:. Complete © 2018 IBM Corporation "IBM POWER9 processor core", IBM Journal of Research and Development, vol. 62, no. 4/5, pp. 2:1-2:12, 2018. 6 POWER9 SMT4-Core Microarchitecture 2 x 128b 128b 64b Super-slice Super-slice Slice Exec Exec Super Super ISU Slice Slice 64b 64b 64b VSU VSU VSU 2 x 64b 2 x 64b DW DW DW IFU LSU LSU LSU LSU LSU Super Super 2 x 64b 64b compute Slice Slice 1 x 128b 64b load/store LSU POWER9 SMT4 Core – Sliced Micro-arch POWER9 SMT4 Core Images / diagrams modified from: "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016. "IBM POWER9 processor core", IBM Journal of Research and Development, vol. 62, no. 4/5, pp. 2:1-2:12, 2018. © 2018 IBM Corporation 7 POWER9 – Core Compute SMT4 Core Resources SMT4 Core x 22 per Socket for Summit Systems Fetch / Branch • 32kB, 8-way Instruction Cache x8 • 8 fetch, 6 decode Predecode L1 Instruction $ IBUF Decode / Crack SMT4 Core • 1x branch execution Branch Dispatch: Allocate / Rename Instruction / Iop Prediction Completion Table Slices issue VSU and AGEN x6 • 4x scalar-64b / 2x vector-128b Branch Slice Slice 0 Slice 1 Slice 2 Slice 3 • 4x load/store AGEN ALU ALU ALU ALU BRU AGEN XS XS AGEN AGEN XS XS AGEN FP FP FP FP MUL MUL MUL MUL Vector Scalar Unit (VSU) Pipes CRYPT XC XC XC XC • 4x ALU + Simple (64b) PM PM • 4x FP + FX-MUL + Complex (64b) QFX QP/DFU QFX • 2x Permute (128b) DIV DIV • 2x Quad Fixed (128b) ST-D ST-D ST-D ST-D 128b • 2x Fixed Divide (64b) Super-slice • 1x Quad FP & Decimal FP L1D$ 0 L1D$ 1 L1D$ 2 L1D$ 3 • 1x Cryptography LRQ 0/1 LRQ 2/3 Load Store Unit (LSU) Slices SRQ 0 SRQ 1 SRQ 2 SRQ 3 • 32kB, 8-way Data Cache • Up to 4 DW load or store © 2018 IBM Corporation 8 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Super Super Super Super Core ISU ISU Core Slice Slice Slice Slice Core Core Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b IFU IFU 1-4 Theads 1-4 Theads LSU LSU LSU LSU (SMT4) (SMT4) Super Super Super Super Slice Slice Slice Slice LSU LSU Cache Region L2 Cache L2 Cache (512k) 1 thread L3 Cache Region L3 Cache Region (10 MB) (10 MB) ST x 1 core 11 threads per socket Subset of cores enabled with Single Thread (ST) Individual cores are inactive • Inactive cores allow higher socket frequency via. WOF Frequency Boost • One threaD gets access to the full Level-2 / Level-3 cache region © 2018 IBM Corporation 9 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super Core ISU ISU Core ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b IFU IFU IFU IFU LSU LSU LSU LSU LSU LSU LSU LSU Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice 1 Thread active per Core (ST) LSU LSU LSU LSU • Each thread gets ½ of the core execution resources • Threads share the Level-2 / Level-3 cache Cache Region Cache Region L2 Cache L2 Cache 1 2 thread threads L3 Cache Region L3 Cache Region (10 MB) (10 MB) ST x 1 core ST x 2 cores 11 threads per socket 22 threads per socket © 2018 IBM Corporation 10 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super SuperCoreSuper SuperCoreSuper Core ISU ISU Core ISU ISU ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b 22 x 64bthreads2 x 64b 2 threads2 x 64b 2 x 64b IFU IFU IFU IFU IFU IFU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU 2 Threads Active Per Super Super Super Super Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Core (SMT2) LSU LSU LSU LSU LSU LSU • Each pair of threads shares ½ of each core’s Cache Region Cache Region Cache Region L2 Cache L2 Cache L2 Cache execution resources • 4 threads share the 1 2 4 thread threads threads Level-2 / Level-3 cache L3 Cache Region L3 Cache Region L3 Cache Region (10 MB) (10 MB) (10 MB) ST x 1 core ST x 2 cores SMT2 X 2 cores 11 threads per socket 22 threads per socket 44 threads per socket © 2018 IBM Corporation 11 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super SuperCoreSuper SuperCoreSuper SuperCoreSuper CoreSuper Super Core ISU ISU Core ISU ISU ISU ISU ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b 22 x 64bthreads2 x 64b 2 threads2 x 64b 2 x 64b 22 x 64b 2 2x 64b 22 x 64b 2 x2 64b IFU IFU IFU IFU IFU IFU threads threads IFU IFU threads threads LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU Super Super Super Super Super Super Super Super Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice4 Slice Slice4 Slice LSU LSU LSU LSU LSU LSU LSU LSU threads threads Cache Region Cache Region Cache Region Cache Region L2 Cache L2 Cache L2 Cache L2 Cache 1 2 4 8 thread threads threads threads L3 Cache Region L3 Cache Region L3 Cache Region L3 Cache Region (10 MB) (10 MB) (10 MB) (10 MB) ST x 1 core ST x 2 cores SMT2 X 2 cores SMT4 x 2 cores 11 threads per socket 22 threads per socket 44 threads per socket 88 threads per socket 4 Threads