The Tofu Interconnect D for Supercomputer Fugaku
20th June 2019 Yuichiro Ajima Fujitsu Limited
20th June 2019, ExaComm 2019 0 Copyright 2019 FUJITSU LIMITED Overview of Fujitsu’s Computing Products
Fujitsu continues to develop general purpose computing products and new domain specific computing devices
General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers
Combinatorial Problems
Supercomputers PC Clusters Quantum Computing High Performance Computing
20th June 2019, ExaComm 2019 1 Copyright 2019 FUJITSU LIMITED Role of Supercomputer in Fujitsu Products
Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect
General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers
Combinatorial Problems
Supercomputers PC Clusters Quantum Computing High Performance Computing
20th June 2019, ExaComm 2019 2 Copyright 2019 FUJITSU LIMITED Development of Packaging Technology
K computer Fugaku
HPC2500 FX1 FX10 FX100 (This image is prototype) Single-Socket Node Water-Cooling 3D-Stacked Memory 2.5D Package 2003 2009 2012 2015 2021 Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory Fugaku will integrate memory stacks into a CPU package
20th June 2019, ExaComm 2019 3 Copyright 2019 FUJITSU LIMITED Development of Interconnect Technology
K computer Fugaku
HPC2500 FX1 FX10 FX100 (This image is prototype) DTU InfiniBand Tofu1 Tofu2 TofuD 2003 2009 2012 2015 2021 Tofu Interconnect (Tofu1) for K computer 6D mesh/torus network, virtual torus rank mapping, Tofu Barrier Tofu2 added new functions; atomic operations, cache injection Tofu Interconnect D (TofuD) for Fugaku Increased resources for high-density node configuration Fault-resilience – dynamic packet slicing
20th June 2019, ExaComm 2019 4 Copyright 2019 FUJITSU LIMITED Features of the Tofu interconnect family
6D Mesh/Torus Network Virtual 3D-Torus Rank-mapping Tofu Barrier Characteristics of Torus Network
20th June 2019, ExaComm 2019 5 Copyright 2019 FUJITSU LIMITED 6D Mesh/Torus Network
Six coordinate axes: X, Y, Z, A, B, C X, Y, Z: the size varies according to the system configuration A, B, C: the size is fixed to 2×3×2 Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) X×Y×Z×2×3×2
Z B
C X Y A
20th June 2019, ExaComm 2019 6 Copyright 2019 FUJITSU LIMITED Virtual 3D-Torus Rank-mapping
A rank-mapping option for topology-awareness A 3D torus rank can be mapped to a 6D submesh even if there is an offline node
This fault tolerance contributes to the system availability 4 3 5 2 Z C 6 1 6D
submesh 7 0
4 6
3 7 5
9 8 2
10 1 A 7 6 5 4 11
X 0 0 1 2 3 B Y
20th June 2019, ExaComm 2019 7 Copyright 2019 FUJITSU LIMITED Tofu Barrier
Tofu Barrier offloads Barrier and Allreduce communications Barrier channel (BCH) is an interface Barrier gate (BG) is a communication engine Tofu barrier can execute an arbitrary communication algorithm Recursive-doubling algorithm uses log2(n) of BGs in each process Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 BCH + start/end BG Process 9 Intermediate BG Reduce-broadcast algorithm uses a maximum of 5 BGs in each process
20th June 2019, ExaComm 2019 8 Copyright 2019 FUJITSU LIMITED Characteristics of Torus Network
System Network Total Injection Bisection Bandwidth (PB/s) Bandwidth (TB/s) Blue Gene/Q Torus (5D) 1.97 40X 49 K Computer Mesh/Torus (6D) 1.66 36X 46 Virtual Torus (3D) 34 Sunway TaihuLight Tapered Fat-Tree 0.51 7.3X 70 Piz Daint Dragonfly 0.07 2.0X 36 Summit Fat-Tree 0.12 1.0X 115 Oakforest-PACS Fat-Tree 0.10 1.0X 102 All systems have the same order of bisection bandwidth No significant performance difference in global data exchange Torus networks have higher total injection bandwidth Topology-aware communication such as nearest-neighbor data exchange results in higher performance
20th June 2019, ExaComm 2019 9 Copyright 2019 FUJITSU LIMITED The Design of TofuD
High Density Node Configuration Link Configuration and Injection Bandwidth Packaging Dynamic Packet Slicing Increased Tofu Barrier Resources
20th June 2019, ExaComm 2019 10 Copyright 2019 FUJITSU LIMITED High Density Node Configuration
The processor die size gets smaller from FX100 The off-chip channels are halved # memory stacks: 8 to 4 # high speed serial lanes for Tofu: 40 to 20 The area of Tofu interconnect shrinks to about 1/3 size
Tofu2 TofuD
HMC HMC HBM HBM
HMC HMC
HMC HMC HBM HBM
HMC HMC Fugaku – A64FX (7nm)
FX100 – SPARC64TM XIfx (20nm)
20th June 2019, ExaComm 2019 11 Copyright 2019 FUJITSU LIMITED High Density Node Configuration (cont.)
More resources are integrated into the CPU # CPU Memory Groups (NUMA nodes): 2 to 4 The expected number of processes per node is also doubled # Tofu Network Interfaces: 4 to 6 Provide more resources and accelerate collective communications
HMC HMC HMC HMC HBM2 HBM2 SPARC64 XIfx A64FX CMG CMG CMG PCIe PCIe c c c c c c c c c c c c c c c c c c c c c c c c c c c TNI0 TNI0 c c c c c c c c c c c c c c c c
TNI1 10 ports 10 TNI1 ports 10
10 ports 10 NOC TNI2
× × × TNI3 c c c c c c c c TNI2 c c c c c c c c
c c c c c c c c c c c c c c c c TNI4 lanes lanes c c c c c c c c TNI5 lanes
TNI3 2 c lanes 4
c c Network Router Tofu Tofu Network Network Router Tofu CMG Tofu2 CMG CMG TofuD HMC HMC HMC HMC HBM2 HBM2
20th June 2019, ExaComm 2019 12 Copyright 2019 FUJITSU LIMITED Link Configuration and Injection Bandwidth
Tofu1 Tofu2 TofuD Data rate (Gbps) 6.25 25.78125 28.05 Number of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 Number of TNIs per node 4 4 6 Injection bandwidth per node (GB/s) 20 50 40.8
Data transfer rate increased from 25 Gbps to 28 Gbps Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s TofuD simultaneously transmits in 6 directions Increased from 4 directions in the case of Tofu1 and Tofu2 Total injection bandwidth per node is 40.8 GB/s Approximately, twice that of Tofu1 or 80% that of Tofu2
20th June 2019, ExaComm 2019 13 Copyright 2019 FUJITSU LIMITED Packaging – CPU Memory Unit (CMU)
Two CPUs connected with C-axis X×Y×Z×A×B×C = 1×1×1×1×1×2 Two or three active optical cable (AOC) cages on the board Each cable bundles two lanes of signals from each of the two CPUs
CPU AOC (X) AOC AOC (Y) AOC AOC (Z) CPU
20th June 2019, ExaComm 2019 14 Copyright 2019 FUJITSU LIMITED Packaging – Rack Structure
Rack 8 shelves Rack (prototype) 192 CMUs (384 CPUs)
Shelf 24 CMUs (48 CPUs) X×Y×Z×A×B×C = 1×1×4×2×3×2
Top or bottom half of rack 4 shelves X×Y×Z×A×B×C = 2×2×4×2×3×2
Shelves
20th June 2019, ExaComm 2019 15 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Split Mode
The physical layer of TofuD is independent for each lane In the ordinary multi-lane transmission, the physical layer has media- independent interface and hides the number of signal lanes A packet is sliced and each is injected into a different lane The routing header of the packet is copied to both slices for virtual cut- through packet transfer This is normal operation and is called split mode Slice 0 Slice 0
Slice 1 Slice 1 Packet Routing Header
20th June 2019, ExaComm 2019 16 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Duplicate Mode
When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets If the error rate returns to low, the link can return to split mode Each lane is never disconnected independently The error rates of both lanes are continuously monitored and fed back
Error rate feedback Slice 0 Packet
Slice 1 Packet Packet Routing Header
20th June 2019, ExaComm 2019 17 Copyright 2019 FUJITSU LIMITED Increased Tofu Barrier Resources
Tofu1/2 TofuD Number of BCHs 8 16 TNI Number of BGs 64 48 Number of TNI w/ Tofu Barrier 1 6 Node Number of BCHs 8 96 Number of BGs 64 288
The number of Tofu Barrier resources significantly increased All 6 TNIs of TofuD have Tofu barrier Only TNI #0 of Tofu1/2 has Tofu barrier This change intended to support intra-node synchronization
20th June 2019, ExaComm 2019 18 Copyright 2019 FUJITSU LIMITED Performance Evaluations
Put Latencies Latency Breakdown Injection Rates Tofu Barrier
20th June 2019, ExaComm 2019 19 Copyright 2019 FUJITSU LIMITED Put Latencies
8B Put transfer between nodes on the same board The low-latency features were used Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs 0.20 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs 0.22 µs To/From near CMGs 0.49 µs Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1 The cache injection feature contributed to this reduction TofuD reduced the Put latency by 0.22 μs from that of Tofu2
20th June 2019, ExaComm 2019 20 Copyright 2019 FUJITSU LIMITED Latency Breakdown
1000 Rx CPU 900 Rx Host bus Rx TNI 800 Cache injection Packet Transfer 700 Tx TNI Tx Host bus 600 Tx CPU Tx Optimization 500
400 Increased Overhead in Physical Layer Latency (nsec) Latency 300 Overhead Reduced
200
100 Rx Optimization
0 Tofu1 Tofu2 TofuD The overhead increase in Tofu2 has been reduced
20th June 2019, ExaComm 2019 21 Copyright 2019 FUJITSU LIMITED Injection Rates per Node
Simultaneous Put transfers to the nearest-neighbor nodes Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 % The efficiencies of Tofu1 were lower than 90% Because of a bottleneck in the bus that connects CPU and ICC The efficiencies of Tofu2 and TofuD exceeded 90 % Integration into the processor chip removed the bottleneck
20th June 2019, ExaComm 2019 22 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Intra-Node Synchronization
The test program synchronized multiple BCHs in a node For 8 and 16 BCHs, some TNIs are shared by multiple BCHs Sharing TNI causes the serialization of BCHs/BGs processing Number of BCHs 1 4 8 16 48 Number of used TNIs 1 4 6 6 6 Number of communication stages 2 2 4 6 9 Max. number of BCHs per TNI 1 1 2 3 8 Max. number of BGs per TNI 2 2 5 9 24
Two delay estimation approaches One approach only considers the number of communication stages and the delays of BCH (0.48 μs) and BG (0.13 μs) The other approach considers the serialization of BCH and BG processing using the same TNI
20th June 2019, ExaComm 2019 23 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Results
5.0 Estimate considering serialization Evaluation Result 4.0 Simple Estimate
sec) 3.0 μ
2.0
underestimate Latency( 1.0
0.0 1 4 16 64 Number of BCHs per node The simple estimate results were too low The estimates considering the serialization of BCHs/BGs processing were consistent with the evaluation results MPI needs to assign BCHs in the same synchronization group to different TNIs to avoid serialization penalties
20th June 2019, ExaComm 2019 24 Copyright 2019 FUJITSU LIMITED Summary
Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect TofuD is developed for Fugaku and designed for high-density node configuration and fault-resilience The design of TofuD Node configuration, injection rate and packaging Dynamic Packet Slicing Increased Tofu Barrier Resources The evaluation results of TofuD Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2 Injection rate was 38.1 GB/s and the link efficiency exceeds 90% The evaluation results of Tofu Barrier showed the serialization penalties which a proper MPI implementation can avoid
20th June 2019, ExaComm 2019 25 Copyright 2019 FUJITSU LIMITED 26 Copyright 2019 FUJITSU LIMITED