U.S. Department of Energy
Office of Science
Bassi/Power5 Architecture
John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005 U.S. Department of Energy POWER5 IH Overview
Office of Science.
POWER5 IH Node
H C R U6 4W or 8W POWER5 IBM Architecture Processors POWER5 IH System L3 Cache 144MB / 288MB (total) 2U rack chassis Memory 2GB - 128/256GB .Rack: 24" X 43 " Deep, Full Drawer 2U ( 24" rack ) Packaging 16 Nodes / Rack DASD / Bays 2 DASD ( Hot Plug)
HC R U6 IBM I/O Expansion 6 slots ( Blindswap)
HC R U6 IBM Integrated SCSI Ultra 320 HC R U6 IBM
HC R U6 IBM 4 Ports Integrated Ethernet HC R U6 IBM 10/100/1000
HC R U6 IBM RIO Drawers Yes ( 1/2 or 1 ) HC R U6 IBM
HC R U6 IBM LPAR Yes
HC R U6 IBM Switch HPS
HC R U6 IBM OS AIX 5.2 & Linux 16 Systems / Rack 128 Processors / Rack
Image From IBM U.S. Department of Energy POWER5 IH Physical Structure
Office of Science. DCM (8X) SMI (32X)
DCA Power Connector (6X) Pluggable DIMM (64X)
DASD BackPlane
Processor Board
Blower (2X)
PCI Risers (2x) I/O Board
Image From IBM U.S. Department of Energy IBM Power Series Processors
Office of Science
Power3+ Power4 Power5 MHz 375 1300 1900 FLOPS/clock 4 4 4 Peak FLOPS 1.5 5.2 7.6 L1 (D-Cache) 64k 32k 32k L2 (unified) 8MB 1.5M (0.75M) 1.9M L3 (unified) -- 32M (16M) 36M STREAM GB/s 0.4 (0.7) 1.4 (2.4) 5.4 (7.6) Bytes/FLOP 0.3 0.44 0.7 U.S. Department of Energy Power3 vs. Power5 die
Office of Science Power3 Power5
Image From IBM Power3 Redbook Image From IBM Power5 Redbook U.S. Department of Energy Power3 vs. Power5 die
Office of Science Power3 Power5
Image From IBM Power3 Redbook Image From IBM Power5 Redbook U.S. Department of Energy Power3 vs. Power5 die
Office of Science Power3 Power5
Image From IBM Power3 Redbook Image From IBM Power5 Redbook U.S. Department of Energy Memory Subsystem
Office of Science
Image From IBM Power5 Redbook U.S. Department of Energy SMP Fabric
Office of Science
Power3 Power4 Power5 P P P P P P P P P P P L2 L2 L2 L2 L2 L3 L2 L2 L3 Memory Bus FC FC FC FC
L3 L3 Memory Memory Controller Controller Memory Memory Memory Memory Controller Controller Controller Controller
SMI SMI SMI SMI SMI SMI SMI SMI DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM U.S. Department of Energy SMP Fabric
Office of Science
Power3 Power4 Power5 SC P P P P P P P P P L2 L2 L2 L2 L2 L3 L2 L2 L3 Memory Bus FC FC FC FC
L3 L3 Memory Memory Controller Controller Memory Memory Memory Memory Controller Controller Controller Controller
SMI SMI SMI SMI SMI SMI SMI SMI DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
Remaining core gets •All of L2 + L3 cache •All memory BW •Increased Clock Rate U.S. Department of Energy System Packaging (MCM vs. DCM) Office of Science
. Multi-Chip Module (MCM) . Dual-Chip Module (DCM) . 4 x Power5 + L3 Caches . 1 x Power5 + L3 Cache . Up to 8 Power5 cores . Up to 2 Power5 cores . L2 shared among cores . Private L2 and L3
Image From IBM Power5 Redbook U.S. Department of Energy System Packaging (MCM vs. DCM) Office of Science
4-8 DIMM Slots 4-8 DIMM Slots 1GB-16GB Memory1GB-16GB Memory
4-8 DIMM Slots
L3 L3 2xPCI Express x16 2xPCI Express x16 2x8 GB/s 2x(4+4)P5 P5 2x8 GB/s (4+4)
L3 L3 L3 2xPCI Express x16 2xPCI Express x16 2xPCI Express x16 2x8 GB/s (4+4)P5 P5 2x8 GB/s (4+4) P5 2x8 GB/s (4+4)
4-8 DIMM Slots 4-8 DIMM Slots 1GB-16GB Memory1GB-16GB Memory U.S. Department of Energy Power5 Cache Hierarchy
Office of Science
P5 Squadron • 4 MCM • 64 proc SMP
P5 Squadron IH • 4 DCM • 8 [email protected] •16 [email protected]
Image From IBM Power5 Redbook U.S. Department of Energy Power5-IH Node Architecture
Office of Science 8 Dimms 8 Dimms 8 Dimms 8 Dimms 4 SMI2s 4 SMI2s 4 SMI2s 4 SMI2s
8B 16B 8B 16B 8B 16B 8B 16B DCM DCM DCM DCM GX+ GX+ GX+ GX+ 4B 4B 4B 4B L3 4B L3 4B L3 4B L3 4B 16B P5 16B P5 16B P5 16B P5 36MB 36MB 36MB 36MB 16B 1 Core 16B 1 Core 16B 1 Core 16B 1 Core 1.92MB 1.92MB 1.92MB 1.92MB
DCM DCM DCM DCM
P5 P5 P5 P5 L3 1 Core GX+ L3 1 Core GX+ L3 1 Core GX+ L3 1 Core GX+ 16B 1.92MB 4B 16B 1.92MB 4B 16B 1.92MB 4B 16B 1.92MB 4B 36MB 16B 4B 36MB 16B 4B 36MB 16B 4B 36MB 16B 4B
8B 16B 8B 16B 8B 16B 8B 16B
8 Dimms 8 Dimms 8 Dimms 8 Dimms 4 SMI2s 4 SMI2s 4 SMI2s 4 SMI2s
Notes: 1) SMP Buses run at 1.0 Ghz Address/Control Buses in Green Image From IBM 2) L3 Buses run at 1.0 Ghz 3) Memory Buses run at 1.066 Ghz U.S. Department of Energy Power5 IH Memory Affinity
Office of Science Mem0 Mem1 Mem2 Mem3
Proc 0 Proc 1 Proc 2 Proc 3
Proc 4 Proc 5 Proc 6 Proc 7
Mem4 Mem5 Mem6 Mem7 U.S. Department of Energy Power5 IH Memory Affinity
Office of Science Proc0 to Mem0 == 90ns
Mem0 Mem1 Mem2 Mem3
Proc 0 Proc 1 Proc 2 Proc 3
Proc 4 Proc 5 Proc 6 Proc 7
Mem4 Mem5 Mem6 Mem7 U.S. Department of Energy Power5 IH Memory Affinity
Office of Science Proc0 to Mem0 == 90ns Proc0 to Mem7 == 200+ns
Mem0 Mem1 Mem2 Mem3
Proc 0 Proc 1 Proc 2 Proc 3
Proc 4 Proc 5 Proc 6 Proc 7
Mem4 Mem5 Mem6 Mem7 U.S. Department of Energy Page Mapping
Office of Science
Proc0 Mem# Mem0 Mem1 Mem2 Mem3 Page# RR 0 0 1 1 Proc 0 Proc 1 Proc 2 Proc 3 2 2 3 3 5 4 6 5 7 6 Proc 4 Proc 5 Proc 6 Proc 7 8 7 9 0 Mem4 Mem5 Mem6 Mem7 10 1 11 2 Linear Walk through Memory Addresses 12 3 13 4 •Default Affinity is round_robin (RR) 14 5 •Pages assigned round-robin to mem ctrlrs. 15 6 •Average latency ~170-190ns 16 7 U.S. Department of Energy Page Mapping
Office of Science
Proc0 Mem# Mem# Mem0 Mem1 Mem2 Mem3 Page# RR MCM 0 0 0 1 1 0 Proc 0 Proc 1 Proc 2 Proc 3 2 2 0 3 3 0 5 4 0 6 5 0 7 6 0 Proc 4 Proc 5 Proc 6 Proc 7 8 7 0 9 0 0 Mem4 Mem5 Mem6 Mem7 10 1 0 11 2 0 Linear Walk through Memory Addresses 12 3 0 13 4 0 •MCM affinity (really DCM affinity in this case) 14 5 0 •Pages assigned to mem ctrlr where first touched. 15 6 0 •Average latency 90ns + no fabric contention 16 7 0 U.S. Department of Energy Memory Affinity Directives
Office of Science
. For processor-local memory affinity, you should also set environment variables to . MP_TASK_AFFINITY=MCM . MEMORY_AFFINITY=MCM
. For OpenMP need to eliminate memory affinity . Unset MP_TASK_AFFINITY . MEMORY_AFFINITY=round_robin (depending on OMP memory usage pattern) U.S. Department of Energy Large Pages
Office of Science . Enable Large Pages . -blpdata (at link time) . Or ldedit -blpdata
. Effect on STREAM performance . TRIAD without -blpdata: 5.3GB/s per task . TRIAD with -blpdata: 7.2 GB/s per task (6.9 loaded) U.S. Department of Energy A Short Commentary about Latency Office of Science . Little’s Law: bandwidth * latency = concurrency
. For Power-X (some arbitrary) single-core: . 150ns * 20 Gigabytes/sec (DDR2 memory) . 3000 bytes of data in flight . 23.4 cache lines (very close to 24 memory request queue depth) . 375 operands must be prefetched to fully engage the memory subsystem . THAT’S a LOT of PREFETCH!!! (esp. with 32 architected registers!) U.S. Department of Energy Deep Memory Request Pipelining Using Stream Prefetch
Office of Science
Image From IBM Power4 Redbook U.S. Department of Energy Stanza Triad Results
Office of Science
. Perfect prefetching: . performance is independent of L, the stanza length . expect flat line at STREAM peak . our results show performance depends on L U.S. Department of Energy XLF Prefetch Directives
Office of Science . DCBT (Data Cache Block Touch) explicit prefetch . Pre-request some data . !IBM PREFETCH_BY_LOAD(arrayA(I)) . !IBM PREFETCH_FOR_LOAD(variable list) . !IBM PREFETCH_FOR_STORE(variable list) . Stream Prefetch . Install stream on a hardware prefetch engine . Syntax: !IBM PREFETCH_BY_STREAM(arrayA(I)) . PREFETCH_BY_STREAM_BACKWARD(variable list) . DCBZ (Data Cache Block Zero) . Use for store streams . Syntax: !IBM CACHE_ZERO(StoreArrayA(I)) . Automatically include using -qnopteovrlp option (unreliable) . Improve performance by another 10% U.S. Department of Energy Power5: Protected Streams
Office of Science . Protected Prefetch Streams . There are 12 filters and 8 prefetch engines . Need control of stream priority and prevent rolling of the filter list (takes a while to ramp up prefetch) . Helps give hardware hints about “stanza” streams . PROTECTED_UNLIMITED_STREAM_SET_GO_FORWARD(variable, streamID) . PROTECTED_UNLIMITED_STREAM_SET_GO_BACKWARD(var,streamID) . PROTECTED_STREAM_SET_GO_FORWARD/BACKWARD(var,streamID) . PROTECTED_STREAM_COUNT(ncachelines) . PROTECTED_STREAM_GO . PROTECTED_STREAM_STOP(stream_ID) . PROTECTED_STREAM_STOP_ALL . STREAM_UNROLL . Use more aggressive loop unrolling and SW pipelining in conjunction with prefetch . EIEIO (Enforce Inorder Execution of IO)