The Tofu Interconnect D for Supercomputer Fugaku
Total Page:16
File Type:pdf, Size:1020Kb
The Tofu Interconnect D for Supercomputer Fugaku 20th June 2019 Yuichiro Ajima Fujitsu Limited 20th June 2019, ExaComm 2019 0 Copyright 2019 FUJITSU LIMITED Overview of Fujitsu’s Computing Products Fujitsu continues to develop general purpose computing products and new domain specific computing devices General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers Combinatorial Problems Supercomputers PC Clusters Quantum Computing High Performance Computing 20th June 2019, ExaComm 2019 1 Copyright 2019 FUJITSU LIMITED Role of Supercomputer in Fujitsu Products Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers Combinatorial Problems Supercomputers PC Clusters Quantum Computing High Performance Computing 20th June 2019, ExaComm 2019 2 Copyright 2019 FUJITSU LIMITED Development of Packaging Technology K computer Fugaku HPC2500 FX1 FX10 FX100 (This image is prototype) Single-Socket Node Water-Cooling 3D-Stacked Memory 2.5D Package 2003 2009 2012 2015 2021 Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory Fugaku will integrate memory stacks into a CPU package 20th June 2019, ExaComm 2019 3 Copyright 2019 FUJITSU LIMITED Development of Interconnect Technology K computer Fugaku HPC2500 FX1 FX10 FX100 (This image is prototype) DTU InfiniBand Tofu1 Tofu2 TofuD 2003 2009 2012 2015 2021 Tofu Interconnect (Tofu1) for K computer 6D mesh/torus network, virtual torus rank mapping, Tofu Barrier Tofu2 added new functions; atomic operations, cache injection Tofu Interconnect D (TofuD) for Fugaku Increased resources for high-density node configuration Fault-resilience – dynamic packet slicing 20th June 2019, ExaComm 2019 4 Copyright 2019 FUJITSU LIMITED Features of the Tofu interconnect family 6D Mesh/Torus Network Virtual 3D-Torus Rank-mapping Tofu Barrier Characteristics of Torus Network 20th June 2019, ExaComm 2019 5 Copyright 2019 FUJITSU LIMITED 6D Mesh/Torus Network Six coordinate axes: X, Y, Z, A, B, C X, Y, Z: the size varies according to the system configuration A, B, C: the size is fixed to 2×3×2 Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) X×Y×Z×2×3×2 Z B C X Y A 20th June 2019, ExaComm 2019 6 Copyright 2019 FUJITSU LIMITED Virtual 3D-Torus Rank-mapping A rank-mapping option for topology-awareness A 3D torus rank can be mapped to a 6D submesh even if there is an offline node This fault tolerance contributes to the system availability 4 3 5 2 Z C 6 1 6D submesh 7 0 4 6 3 7 5 9 8 2 10 1 A 7 6 5 4 11 X 0 0 1 2 3 B Y 20th June 2019, ExaComm 2019 7 Copyright 2019 FUJITSU LIMITED Tofu Barrier Tofu Barrier offloads Barrier and Allreduce communications Barrier channel (BCH) is an interface Barrier gate (BG) is a communication engine Tofu barrier can execute an arbitrary communication algorithm Recursive-doubling algorithm uses log2(n) of BGs in each process Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 BCH + start/end BG Process 9 Intermediate BG Reduce-broadcast algorithm uses a maximum of 5 BGs in each process 20th June 2019, ExaComm 2019 8 Copyright 2019 FUJITSU LIMITED Characteristics of Torus Network System Network Total Injection Bisection Bandwidth (PB/s) Bandwidth (TB/s) Blue Gene/Q Torus (5D) 1.97 40X 49 K Computer Mesh/Torus (6D) 1.66 36X 46 Virtual Torus (3D) 34 Sunway TaihuLight Tapered Fat-Tree 0.51 7.3X 70 Piz Daint Dragonfly 0.07 2.0X 36 Summit Fat-Tree 0.12 1.0X 115 Oakforest-PACS Fat-Tree 0.10 1.0X 102 All systems have the same order of bisection bandwidth No significant performance difference in global data exchange Torus networks have higher total injection bandwidth Topology-aware communication such as nearest-neighbor data exchange results in higher performance 20th June 2019, ExaComm 2019 9 Copyright 2019 FUJITSU LIMITED The Design of TofuD High Density Node Configuration Link Configuration and Injection Bandwidth Packaging Dynamic Packet Slicing Increased Tofu Barrier Resources 20th June 2019, ExaComm 2019 10 Copyright 2019 FUJITSU LIMITED High Density Node Configuration The processor die size gets smaller from FX100 The off-chip channels are halved # memory stacks: 8 to 4 # high speed serial lanes for Tofu: 40 to 20 The area of Tofu interconnect shrinks to about 1/3 size Tofu2 TofuD HMC HMC HBM HBM HMC HMC HMC HMC HBM HBM HMC HMC Fugaku – A64FX (7nm) FX100 – SPARC64TM XIfx (20nm) 20th June 2019, ExaComm 2019 11 Copyright 2019 FUJITSU LIMITED High Density Node Configuration (cont.) More resources are integrated into the CPU # CPU Memory Groups (NUMA nodes): 2 to 4 The expected number of processes per node is also doubled # Tofu Network Interfaces: 4 to 6 Provide more resources and accelerate collective communications HMC HMC HMC HMC HBM2 HBM2 SPARC64 XIfx A64FX CMG CMG CMG PCIe PCIe c c c c c c c c c c c c c c c c c c c c c c c c c c c TNI0 TNI0 c c c c c c c c c c c c c c c c TNI1 10 ports TNI1 10 ports 10 ports 10 NOC TNI2 × × × TNI3 c c c c c c c c TNI2 c c c c c c c c c c c c c c c c c c c c c c c c TNI4 lanes lanes c c c c c c c c TNI5 lanes TNI3 2 c 4 lanes c c Network Router Tofu Tofu Network Network Router Tofu CMG Tofu2 CMG CMG TofuD HMC HMC HMC HMC HBM2 HBM2 20th June 2019, ExaComm 2019 12 Copyright 2019 FUJITSU LIMITED Link Configuration and Injection Bandwidth Tofu1 Tofu2 TofuD Data rate (Gbps) 6.25 25.78125 28.05 Number of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 Number of TNIs per node 4 4 6 Injection bandwidth per node (GB/s) 20 50 40.8 Data transfer rate increased from 25 Gbps to 28 Gbps Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s TofuD simultaneously transmits in 6 directions Increased from 4 directions in the case of Tofu1 and Tofu2 Total injection bandwidth per node is 40.8 GB/s Approximately, twice that of Tofu1 or 80% that of Tofu2 20th June 2019, ExaComm 2019 13 Copyright 2019 FUJITSU LIMITED Packaging – CPU Memory Unit (CMU) Two CPUs connected with C-axis X×Y×Z×A×B×C = 1×1×1×1×1×2 Two or three active optical cable (AOC) cages on the board Each cable bundles two lanes of signals from each of the two CPUs CPU AOC (X) AOC AOC (Y) AOC AOC (Z) CPU 20th June 2019, ExaComm 2019 14 Copyright 2019 FUJITSU LIMITED Packaging – Rack Structure Rack 8 shelves Rack (prototype) 192 CMUs (384 CPUs) Shelf 24 CMUs (48 CPUs) X×Y×Z×A×B×C = 1×1×4×2×3×2 Top or bottom half of rack 4 shelves X×Y×Z×A×B×C = 2×2×4×2×3×2 Shelves 20th June 2019, ExaComm 2019 15 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Split Mode The physical layer of TofuD is independent for each lane In the ordinary multi-lane transmission, the physical layer has media- independent interface and hides the number of signal lanes A packet is sliced and each is injected into a different lane The routing header of the packet is copied to both slices for virtual cut- through packet transfer This is normal operation and is called split mode Slice 0 Slice 0 Slice 1 Slice 1 Packet Routing Header 20th June 2019, ExaComm 2019 16 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Duplicate Mode When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets If the error rate returns to low, the link can return to split mode Each lane is never disconnected independently The error rates of both lanes are continuously monitored and fed back Error rate feedback Slice 0 Packet Slice 1 Packet Packet Routing Header 20th June 2019, ExaComm 2019 17 Copyright 2019 FUJITSU LIMITED Increased Tofu Barrier Resources Tofu1/2 TofuD Number of BCHs 8 16 TNI Number of BGs 64 48 Number of TNI w/ Tofu Barrier 1 6 Node Number of BCHs 8 96 Number of BGs 64 288 The number of Tofu Barrier resources significantly increased All 6 TNIs of TofuD have Tofu barrier Only TNI #0 of Tofu1/2 has Tofu barrier This change intended to support intra-node synchronization 20th June 2019, ExaComm 2019 18 Copyright 2019 FUJITSU LIMITED Performance Evaluations Put Latencies Latency Breakdown Injection Rates Tofu Barrier 20th June 2019, ExaComm 2019 19 Copyright 2019 FUJITSU LIMITED Put Latencies 8B Put transfer between nodes on the same board The low-latency features were used Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs 0.20 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs 0.22 µs To/From near CMGs 0.49 µs Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1 The cache injection feature contributed to this reduction TofuD reduced the Put latency by 0.22 μs from that of Tofu2 20th June 2019, ExaComm 2019 20 Copyright 2019 FUJITSU LIMITED Latency Breakdown 1000 Rx CPU 900 Rx Host bus Rx TNI 800 Cache injection Packet Transfer 700 Tx TNI Tx Host bus 600 Tx CPU Tx Optimization 500 400 Increased Overhead in Physical Layer Latency (nsec) Latency 300 Overhead Reduced 200 100 Rx Optimization 0 Tofu1 Tofu2 TofuD The overhead increase in Tofu2 has been reduced 20th June 2019, ExaComm 2019 21 Copyright 2019 FUJITSU LIMITED Injection Rates per Node Simultaneous Put transfers to the nearest-neighbor nodes Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 % The efficiencies of Tofu1 were lower than 90% Because of a bottleneck in the bus that connects CPU and ICC The efficiencies of Tofu2 and TofuD exceeded 90 % Integration into the processor chip removed the bottleneck 20th June 2019, ExaComm 2019 22 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Intra-Node Synchronization The test program synchronized multiple BCHs in a node For 8 and 16 BCHs, some TNIs are shared by multiple BCHs Sharing TNI causes the serialization of BCHs/BGs processing Number of BCHs 1 4 8 16 48 Number of used TNIs 1 4 6 6 6 Number of communication stages 2 2 4 6 9 Max.