The Tofu Interconnect D for

20th June 2019 Yuichiro Ajima Limited

20th June 2019, ExaComm 2019 0 Copyright 2019 FUJITSU LIMITED Overview of Fujitsu’s Computing Products

 Fujitsu continues to develop general purpose computing products and new domain specific computing devices

General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers

Combinatorial Problems

Supercomputers PC Clusters Quantum Computing High Performance Computing

20th June 2019, ExaComm 2019 1 Copyright 2019 FUJITSU LIMITED Role of Supercomputer in Fujitsu Products

 Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect

General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers

Combinatorial Problems

Supercomputers PC Clusters Quantum Computing High Performance Computing

20th June 2019, ExaComm 2019 2 Copyright 2019 FUJITSU LIMITED Development of Packaging Technology

K computer Fugaku

HPC2500 FX1 FX10 FX100 (This image is prototype) Single-Socket Node Water-Cooling 3D-Stacked Memory 2.5D Package 2003 2009 2012 2015 2021  Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory  Fugaku will integrate memory stacks into a CPU package

20th June 2019, ExaComm 2019 3 Copyright 2019 FUJITSU LIMITED Development of Interconnect Technology

K computer Fugaku

HPC2500 FX1 FX10 FX100 (This image is prototype) DTU InfiniBand Tofu1 Tofu2 TofuD 2003 2009 2012 2015 2021  Tofu Interconnect (Tofu1) for K computer  6D mesh/torus network, virtual torus rank mapping, Tofu Barrier  Tofu2 added new functions; atomic operations, cache injection  Tofu Interconnect D (TofuD) for Fugaku  Increased resources for high-density node configuration  Fault-resilience – dynamic packet slicing

20th June 2019, ExaComm 2019 4 Copyright 2019 FUJITSU LIMITED Features of the Tofu interconnect family

 6D Mesh/Torus Network  Virtual 3D-Torus Rank-mapping  Tofu Barrier  Characteristics of Torus Network

20th June 2019, ExaComm 2019 5 Copyright 2019 FUJITSU LIMITED 6D Mesh/Torus Network

 Six coordinate axes: X, Y, Z, A, B, C  X, Y, Z: the size varies according to the system configuration  A, B, C: the size is fixed to 2×3×2  Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) X×Y×Z×2×3×2

Z B

C X Y A

20th June 2019, ExaComm 2019 6 Copyright 2019 FUJITSU LIMITED Virtual 3D-Torus Rank-mapping

 A rank-mapping option for topology-awareness  A 3D torus rank can be mapped to a 6D submesh even if there is an offline node

 This fault tolerance contributes to the system availability 4 3 5 2 Z C 6 1 6D

submesh 7 0

4 6

3 7 5

9 8 2

10 1 A 7 6 5 4 11

X 0 0 1 2 3 B Y

20th June 2019, ExaComm 2019 7 Copyright 2019 FUJITSU LIMITED Tofu Barrier

 Tofu Barrier offloads Barrier and Allreduce communications  Barrier channel (BCH) is an interface  Barrier gate (BG) is a communication engine  Tofu barrier can execute an arbitrary communication algorithm  Recursive-doubling algorithm uses log2(n) of BGs in each process Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 BCH + start/end BG Process 9 Intermediate BG  Reduce-broadcast algorithm uses a maximum of 5 BGs in each process

20th June 2019, ExaComm 2019 8 Copyright 2019 FUJITSU LIMITED Characteristics of Torus Network

System Network Total Injection Bisection Bandwidth (PB/s) Bandwidth (TB/s) Blue Gene/Q Torus (5D) 1.97 40X 49 K Computer Mesh/Torus (6D) 1.66 36X 46 Virtual Torus (3D) 34 TaihuLight Tapered Fat-Tree 0.51 7.3X 70 Piz Daint Dragonfly 0.07 2.0X 36 Fat-Tree 0.12 1.0X 115 Oakforest-PACS Fat-Tree 0.10 1.0X 102  All systems have the same order of bisection bandwidth  No significant performance difference in global data exchange  Torus networks have higher total injection bandwidth  Topology-aware communication such as nearest-neighbor data exchange results in higher performance

20th June 2019, ExaComm 2019 9 Copyright 2019 FUJITSU LIMITED The Design of TofuD

 High Density Node Configuration  Link Configuration and Injection Bandwidth  Packaging  Dynamic Packet Slicing  Increased Tofu Barrier Resources

20th June 2019, ExaComm 2019 10 Copyright 2019 FUJITSU LIMITED High Density Node Configuration

 The processor die size gets smaller from FX100  The off-chip channels are halved  # memory stacks: 8 to 4  # high speed serial lanes for Tofu: 40 to 20  The area of Tofu interconnect shrinks to about 1/3 size

Tofu2 TofuD

HMC HMC HBM HBM

HMC HMC

HMC HMC HBM HBM

HMC HMC Fugaku – A64FX (7nm)

FX100 – SPARC64TM XIfx (20nm)

20th June 2019, ExaComm 2019 11 Copyright 2019 FUJITSU LIMITED High Density Node Configuration (cont.)

 More resources are integrated into the CPU  # CPU Memory Groups (NUMA nodes): 2 to 4  The expected number of processes per node is also doubled  # Tofu Network Interfaces: 4 to 6  Provide more resources and accelerate collective communications

HMC HMC HMC HMC HBM2 HBM2 SPARC64 XIfx A64FX CMG CMG CMG PCIe PCIe c c c c c c c c c c c c c c c c c c c c c c c c c c c TNI0 TNI0 c c c c c c c c c c c c c c c c

TNI1 10 ports 10 TNI1 ports 10

10 ports 10 NOC TNI2

× × × TNI3 c c c c c c c c TNI2 c c c c c c c c

c c c c c c c c c c c c c c c c TNI4 lanes lanes c c c c c c c c TNI5 lanes

TNI3 2 c lanes 4

c c Network Router Tofu Tofu Network Network Router Tofu CMG Tofu2 CMG CMG TofuD HMC HMC HMC HMC HBM2 HBM2

20th June 2019, ExaComm 2019 12 Copyright 2019 FUJITSU LIMITED Link Configuration and Injection Bandwidth

Tofu1 Tofu2 TofuD Data rate (Gbps) 6.25 25.78125 28.05 Number of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 Number of TNIs per node 4 4 6 Injection bandwidth per node (GB/s) 20 50 40.8

 Data transfer rate increased from 25 Gbps to 28 Gbps  Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s  TofuD simultaneously transmits in 6 directions  Increased from 4 directions in the case of Tofu1 and Tofu2  Total injection bandwidth per node is 40.8 GB/s  Approximately, twice that of Tofu1 or 80% that of Tofu2

20th June 2019, ExaComm 2019 13 Copyright 2019 FUJITSU LIMITED Packaging – CPU Memory Unit (CMU)

 Two CPUs connected with C-axis  X×Y×Z×A×B×C = 1×1×1×1×1×2  Two or three active optical cable (AOC) cages on the board  Each cable bundles two lanes of signals from each of the two CPUs

CPU AOC (X) AOC AOC (Y) AOC AOC (Z) CPU

20th June 2019, ExaComm 2019 14 Copyright 2019 FUJITSU LIMITED Packaging – Rack Structure

 Rack  8 shelves Rack (prototype)  192 CMUs (384 CPUs)

 Shelf  24 CMUs (48 CPUs)  X×Y×Z×A×B×C = 1×1×4×2×3×2

 Top or bottom half of rack  4 shelves  X×Y×Z×A×B×C = 2×2×4×2×3×2

Shelves

20th June 2019, ExaComm 2019 15 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Split Mode

 The physical layer of TofuD is independent for each lane  In the ordinary multi-lane transmission, the physical layer has media- independent interface and hides the number of signal lanes  A packet is sliced and each is injected into a different lane  The routing header of the packet is copied to both slices for virtual cut- through packet transfer  This is normal operation and is called split mode Slice 0 Slice 0

Slice 1 Slice 1 Packet Routing Header

20th June 2019, ExaComm 2019 16 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Duplicate Mode

 When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets  If the error rate returns to low, the link can return to split mode  Each lane is never disconnected independently  The error rates of both lanes are continuously monitored and fed back

Error rate feedback Slice 0 Packet

Slice 1 Packet Packet Routing Header

20th June 2019, ExaComm 2019 17 Copyright 2019 FUJITSU LIMITED Increased Tofu Barrier Resources

Tofu1/2 TofuD Number of BCHs 8 16 TNI Number of BGs 64 48 Number of TNI w/ Tofu Barrier 1 6 Node Number of BCHs 8 96 Number of BGs 64 288

 The number of Tofu Barrier resources significantly increased  All 6 TNIs of TofuD have Tofu barrier  Only TNI #0 of Tofu1/2 has Tofu barrier  This change intended to support intra-node synchronization

20th June 2019, ExaComm 2019 18 Copyright 2019 FUJITSU LIMITED Performance Evaluations

 Put Latencies  Latency Breakdown  Injection Rates  Tofu Barrier

20th June 2019, ExaComm 2019 19 Copyright 2019 FUJITSU LIMITED Put Latencies

 8B Put transfer between nodes on the same board  The low-latency features were used Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs 0.20 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs 0.22 µs To/From near CMGs 0.49 µs  Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1  The cache injection feature contributed to this reduction  TofuD reduced the Put latency by 0.22 μs from that of Tofu2

20th June 2019, ExaComm 2019 20 Copyright 2019 FUJITSU LIMITED Latency Breakdown

1000 Rx CPU 900 Rx Host bus Rx TNI 800 Cache injection Packet Transfer 700 Tx TNI Tx Host bus 600 Tx CPU Tx Optimization 500

400 Increased Overhead in Physical Layer Latency (nsec) Latency 300 Overhead Reduced

200

100 Rx Optimization

0 Tofu1 Tofu2 TofuD  The overhead increase in Tofu2 has been reduced

20th June 2019, ExaComm 2019 21 Copyright 2019 FUJITSU LIMITED Injection Rates per Node

 Simultaneous Put transfers to the nearest-neighbor nodes  Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 %  The efficiencies of Tofu1 were lower than 90%  Because of a bottleneck in the bus that connects CPU and ICC  The efficiencies of Tofu2 and TofuD exceeded 90 %  Integration into the processor chip removed the bottleneck

20th June 2019, ExaComm 2019 22 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Intra-Node Synchronization

 The test program synchronized multiple BCHs in a node  For 8 and 16 BCHs, some TNIs are shared by multiple BCHs  Sharing TNI causes the serialization of BCHs/BGs processing Number of BCHs 1 4 8 16 48 Number of used TNIs 1 4 6 6 6 Number of communication stages 2 2 4 6 9 Max. number of BCHs per TNI 1 1 2 3 8 Max. number of BGs per TNI 2 2 5 9 24

 Two delay estimation approaches  One approach only considers the number of communication stages and the delays of BCH (0.48 μs) and BG (0.13 μs)  The other approach considers the serialization of BCH and BG processing using the same TNI

20th June 2019, ExaComm 2019 23 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Results

5.0 Estimate considering serialization Evaluation Result 4.0 Simple Estimate

sec) 3.0 μ

2.0

underestimate Latency( 1.0

0.0 1 4 16 64 Number of BCHs per node  The simple estimate results were too low  The estimates considering the serialization of BCHs/BGs processing were consistent with the evaluation results  MPI needs to assign BCHs in the same synchronization group to different TNIs to avoid serialization penalties

20th June 2019, ExaComm 2019 24 Copyright 2019 FUJITSU LIMITED Summary

 Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect  TofuD is developed for Fugaku and designed for high-density node configuration and fault-resilience  The design of TofuD  Node configuration, injection rate and packaging  Dynamic Packet Slicing  Increased Tofu Barrier Resources  The evaluation results of TofuD  Latency was 0.49 μs, which was reduced by 0.22 μs from that of Tofu2  Injection rate was 38.1 GB/s and the link efficiency exceeds 90%  The evaluation results of Tofu Barrier showed the serialization penalties which a proper MPI implementation can avoid

20th June 2019, ExaComm 2019 25 Copyright 2019 FUJITSU LIMITED 26 Copyright 2019 FUJITSU LIMITED