<<

Networks: How to Design

Gilad Shainer, [email protected] TOP500 Statistics

2 TOP500 Statistics

3 World Leading Large-Scale Systems

• National Supercomputing Centre in Shenzhen – Fat-tree, 5.2K nodes, 120K cores, NVIDIA GPUs, China (Petaflop)

• Tokyo Institute of Technology – Fat-tree, 4K nodes, NVIDIA GPUs, Japan (Petaflop)

• Commissariat a l'Energie Atomique (CEA) – Fat-tree, 4K nodes, 140K cores, France (Petaflop)

• Los Alamos National Lab - Roadrunner – Fat-tree, 4K nodes, 130K cores, USA (Petaflop)

• NASA – , 9.2K nodes, 82K cores – NASA, USA

• Jülich JuRoPa – Fat-tree, 3K nodes, 30K cores, Germany

• Sandia National Labs – Red Sky – 3D-Torus, 5.4K nodes, 43K cores – Sandia “Red Sky”, USA

4 ORNL “Spider” System – Lustre File System

• Oak Ridge Nation Lab central storage system – 13400 drives – 192 Lustre OSS – 240GB/s bandwidth – InfiniBand interconnect – 10PB capacity

5 Network Topologies

• Fat-tree (CLOS), Mesh, 3D-Torus topologies

• CLOS (fat-tree) – Can be fully non-blocking (1:1) or blocking (x:1) – Typically enables best performance

• Non blocking bandwidth, lowest network latency • Mesh or 3D Torus – Blocking network, cost-effective for systems at scale

– Great performance solutions for applications with locality 0,0 0,1 0,2 – Support for dedicate sub-networks 1,0 1,1 1,2

2,0 2,1 2,2 – Simple expansion for future growth

6 d-Dimensional Torus

• Formal definition

– T=(V,E) is said to be d-dimensional torus of size N1xN2x…xNd if:

• V={(v1,v2,…,vd) : 0 ≤ vi ≤ Ni-1}

• E={(uv) : exists j s.t. 1) for each i≠j, vi=ui AND 2) vj=(uj±1) mod Nj}

• Examples

N1=5 N1=N2=3

0,0 0,1 0,2

0 1 2 3 4 1,0 1,1 1,2

2,0 2,1 2,2

7 3D-Torus System – Key Items

• Multiple server nodes per cube junction • Smallest 3D cube size the better – Lowest latency between remote nodes – Minimizing throughput contention • Ability to connect storage • Support for separate networks – Dedicated network (links) for specific applications/usage – Example: links dedicated for collectives or specific jobs

8 InfiniBand 3D Torus

9 Routing for 3D Torus (Avoiding Deadlocks)

• Setting routing might look simple – Just route packets on the shortest path between source - destination

• In lossless networks trivial routing can be disastrous

Communication pairs 1. 02 2. 13 3. 24

4. 30 0 2 1 3 2 4 3 0 4 1 5. 41

10 Avoiding Deadlock – Restrictive Approach

• Idea – Define a set of rules forbidding usage of some resources or a (temporal) combination of resources which will guarantee freedom from deadlock – Design a routing complying with the rules

0 2 1 3 2 4 0 3 1 4

11 Avoiding Deadlock – Separation Approach

• Idea – Decompose each (unidirectional) physical link into several logical channels with private buffer resources – Use logical channels to separate the network into virtual networks, each dependency-cycle-free – Assign communication pairs (with their paths) to the virtual networks

• Back to our Routing: Shortest path Virtual mapping: If a 2 3 4 0 1 2 3 4 path uses 04 or 0 1 40 link it to the red virtual network else to the black one

12 InfiniBand 3D Torus

• InfiniBand drivers includes subnet management for – Fat Tree – min hop, up/down etc – 3D Torus - Dimension Ordered Routing

13 Mixed Topologies

• Fat-tree topology provide the best performance solution

• 3D-Torus can be more cost effective, easier to scale, good fit for applications with locality

• Mixed topology

– System connected as 3D Torus – Fast Fat-tree for collective operations

0,0 0,1 0,2

1,0 1,1 1,2

2,0 2,1 2,2

14 Notes

• Following Fat-Tree network configurations – Flat network – No unused port – Two layer of switch fabric (L1 and L2)

• Following 3D Torus configurations – Each 3D Torus junction is a 36-port switch – Number of switches refers to 36-port switches

• InfiniBand is a great interconnect technology to enable flat connectivity of thousands and tens-of-thousands of servers in future Mega Warehouse Data Centers

15 Example: Non-blocking, Fat-Tree, 40Gb/s

648 L1 36-port switches 18 L2 648-port switches

18 Servers

Total: 11664 servers (nodes) Non-blocking Throughput: 40Gb/s to the node Network

18 Servers

16 Example: 2:1 Oversubscription, Fat-Tree, 40Gb/s

648 L1 36-port switches 12 L2 648-port switches

24 Servers

Total: 15552 servers (nodes) Non-blocking Throughput: 20Gb/s to the node Network

24 Servers

17 Example: 3:1 Oversubscription, Fat-Tree, 40Gb/s

648 L1 36-port switches 9 L2 648-port switches

27 Servers

Total: 17496 servers (nodes) Non-blocking Throughput: 13Gb/s to the node Network

27 Servers

18 Example: 8:1 Oversubscription, Fat-Tree, 40Gb/s

324 L1 36-port switches 2 L2 648-port switches

32 Servers

Total: 10368 servers (nodes) Non-blocking Throughput: 5Gb/s to the node Network

32 Servers

19 Example: 3D Torus

+ z 3D Torus Switch Junction + y 120Gb/s

120Gb/s

120Gb/s 120Gb/s + x - x

40Gb/s each

120Gb/s

- y 120Gb/s

- z 18 Servers (nodes)

3D Torus size: 8x8x8 (512 36-port switches) Total number of servers: 9216

20

3D Torus Connections Example

Node Node 0

Node Node 10 Node 11 Node 12 Node 13 Node 14 Node 15

Node Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 16

Node Node 1 -z +z -y +y -x +x

ION1

36-port Switch

21 Choosing the Right Topology

• Performance: Fat Tree – Application locality? 3D can become an option – Multiple users/applications? Fat Tree – Non blocking? Fat Tree

• Cost – Depends on the size of the system – Very large systems can be more cost effective with 3D Torus

• Future expansion? 3D Torus will be easier to expend

22 Network Offloading

• Transport offloads – critical for CPU efficiency • Congestion avoidance – must be done in the network • Applications offloading (MPI offloading) – For example: MPI Collectives Offloads

Software MPI: Losing performance Lower is better beyond 20% CPU Collectives Offload computation based MPI: availability Beyond 80% CPU computation availability without any performance loss!

23 Thank You

www.hpcadvisorycouncil.com [email protected]

24 24