Topologies How to Design
Total Page:16
File Type:pdf, Size:1020Kb
Networks: Topologies How to Design Gilad Shainer, [email protected] TOP500 Statistics 2 TOP500 Statistics 3 World Leading Large-Scale Systems • National Supercomputing Centre in Shenzhen – Fat-tree, 5.2K nodes, 120K cores, NVIDIA GPUs, China (Petaflop) • Tokyo Institute of Technology – Fat-tree, 4K nodes, NVIDIA GPUs, Japan (Petaflop) • Commissariat a l'Energie Atomique (CEA) – Fat-tree, 4K nodes, 140K cores, France (Petaflop) • Los Alamos National Lab - Roadrunner – Fat-tree, 4K nodes, 130K cores, USA (Petaflop) • NASA – Hypercube, 9.2K nodes, 82K cores – NASA, USA • Jülich JuRoPa – Fat-tree, 3K nodes, 30K cores, Germany • Sandia National Labs – Red Sky – 3D-Torus, 5.4K nodes, 43K cores – Sandia “Red Sky”, USA 4 ORNL “Spider” System – Lustre File System • Oak Ridge Nation Lab central storage system – 13400 drives – 192 Lustre OSS – 240GB/s bandwidth – InfiniBand interconnect – 10PB capacity 5 Network Topologies • Fat-tree (CLOS), Mesh, 3D-Torus topologies • CLOS (fat-tree) – Can be fully non-blocking (1:1) or blocking (x:1) – Typically enables best performance • Non blocking bandwidth, lowest network latency • Mesh or 3D Torus – Blocking network, cost-effective for systems at scale – Great performance solutions for applications with locality 0,0 0,1 0,2 – Support for dedicate sub-networks 1,0 1,1 1,2 2,0 2,1 2,2 – Simple expansion for future growth 6 d-Dimensional Torus Topology • Formal definition – T=(V,E) is said to be d-dimensional torus of size N1xN2x…xNd if: • V={(v1,v2,…,vd) : 0 ≤ vi ≤ Ni-1} • E={(uv) : exists j s.t. 1) for each i≠j, vi=ui AND 2) vj=(uj±1) mod Nj} • Examples N1=5 N1=N2=3 0,0 0,1 0,2 0 1 2 3 4 1,0 1,1 1,2 2,0 2,1 2,2 7 3D-Torus System – Key Items • Multiple server nodes per cube junction • Smallest 3D cube size the better – Lowest latency between remote nodes – Minimizing throughput contention • Ability to connect storage • Support for separate networks – Dedicated network (links) for specific applications/usage – Example: links dedicated for collectives or specific jobs 8 InfiniBand 3D Torus 9 Routing for 3D Torus (Avoiding Deadlocks) • Setting routing might look simple – Just route packets on the shortest path between source - destination • In lossless networks trivial routing can be disastrous Communication pairs 1. 02 2. 13 3. 24 4. 30 0 2 1 3 2 4 3 0 4 1 5. 41 10 Avoiding Deadlock – Restrictive Approach • Idea – Define a set of rules forbidding usage of some resources or a (temporal) combination of resources which will guarantee freedom from deadlock – Design a routing complying with the rules 0 2 1 3 2 4 0 3 1 4 11 Avoiding Deadlock – Separation Approach • Idea – Decompose each (unidirectional) physical link into several logical channels with private buffer resources – Use logical channels to separate the network into virtual networks, each dependency-cycle-free – Assign communication pairs (with their paths) to the virtual networks • Back to our ring Routing: Shortest path Virtual mapping: If a 2 3 4 0 1 2 3 4 path uses 04 or 0 1 40 link map it to the red virtual network else to the black one 12 InfiniBand 3D Torus • InfiniBand drivers includes subnet management for – Fat Tree – min hop, up/down etc – 3D Torus - Dimension Ordered Routing 13 Mixed Topologies • Fat-tree topology provide the best performance solution • 3D-Torus can be more cost effective, easier to scale, good fit for applications with locality • Mixed topology – System connected as 3D Torus – Fast Fat-tree for collective operations 0,0 0,1 0,2 1,0 1,1 1,2 2,0 2,1 2,2 14 Notes • Following Fat-Tree network configurations – Flat network – No unused port – Two layer of switch fabric (L1 and L2) • Following 3D Torus configurations – Each 3D Torus junction is a 36-port switch – Number of switches refers to 36-port switches • InfiniBand is a great interconnect technology to enable flat connectivity of thousands and tens-of-thousands of servers in future Mega Warehouse Data Centers 15 Example: Non-blocking, Fat-Tree, 40Gb/s 648 L1 36-port switches 18 L2 648-port switches 18 Servers Total: 11664 servers (nodes) Non-blocking Throughput: 40Gb/s to the node Network 18 Servers 16 Example: 2:1 Oversubscription, Fat-Tree, 40Gb/s 648 L1 36-port switches 12 L2 648-port switches 24 Servers Total: 15552 servers (nodes) Non-blocking Throughput: 20Gb/s to the node Network 24 Servers 17 Example: 3:1 Oversubscription, Fat-Tree, 40Gb/s 648 L1 36-port switches 9 L2 648-port switches 27 Servers Total: 17496 servers (nodes) Non-blocking Throughput: 13Gb/s to the node Network 27 Servers 18 Example: 8:1 Oversubscription, Fat-Tree, 40Gb/s 324 L1 36-port switches 2 L2 648-port switches 32 Servers Total: 10368 servers (nodes) Non-blocking Throughput: 5Gb/s to the node Network 32 Servers 19 Example: 3D Torus + z 3D Torus Switch Junction + y 120Gb/s 120Gb/s 120Gb/s 120Gb/s + x - x 40Gb/s each 120Gb/s - y 120Gb/s - z 18 Servers (nodes) 3D Torus size: 8x8x8 (512 36-port switches) Total number of servers: 9216 20 3D Torus Connections Example ION1 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 36 Node 12 Node 13 - port Switch port Node 14 Node 15 Node 16 - z +z - y +y - x +x 21 Choosing the Right Topology • Performance: Fat Tree – Application locality? 3D can become an option – Multiple users/applications? Fat Tree – Non blocking? Fat Tree • Cost – Depends on the size of the system – Very large systems can be more cost effective with 3D Torus • Future expansion? 3D Torus will be easier to expend 22 Network Offloading • Transport offloads – critical for CPU efficiency • Congestion avoidance – must be done in the network • Applications offloading (MPI offloading) – For example: MPI Collectives Offloads Software MPI: Losing performance Lower is better beyond 20% CPU Collectives Offload computation based MPI: availability Beyond 80% CPU computation availability without any performance loss! 23 Thank You www.hpcadvisorycouncil.com [email protected] 24 24 .