Networks for High-Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
Networks for High-Performance Computing Louis Jenkins University of Rochester Abstract— In high-performance computing (HPC), low- work. These interconnects implement Remote Direct Memory latency and high-bandwidth networks are required to maintain Access (RDMA) which allows for accessing memory on a high-performance in a distributed computing environment. remote PE directly through the Network Interface Controller According to the Top500, the two most widely-used networks in the top 10 supercomputers are Cray’s Aries network intercon- (NIC) without the intervention of the CPU, and/or even nect, and Mellanox’s Infiniband network interconnect, which the GPU. RDMA allows for the the high-bandwidth and both take radically different approaches towards their design. low-latency remote memory operations, both incoming and This survey explores the design of each network interconnect, outgoing, to not saturate the CPU, freeing it up to perform the ways they provide reliability and fault-tolerance, their more computations. RDMA for GPUs, such as NVidia’s support for remote direct memory access (RDMA), and analyze the performance of an Infiniband cluster and a Cray-XC50 GPUDirect, allows for not just host-to-remote-device, but supercomputer. Finally, the near-future of high-performance even device-to-remote-device transfers to perform without interconnect, obtained from an interview with Steve Scott, Chief any unnecessary copying to and from the CPU, where Technology Officer (CTO) of Cray Inc., a Hewlett Packard normally transferring device-to-device would require device- Enterprise company, is also examined. to-host, host-to-host, and then a host-to-device. As well, RDMA can be used to atomically update remote memory I. INTRODUCTION such that it appears to happen all-at-once or not-at-all, which To address the ever increasing needs for computational opens up many potential applications. power, such computations have become more and more In section 2, the properties of a high-performance network decentralized over time. The technological advancements are explored in detail, and in section 3, the design and from uni-processor to multi-processor systems, from single- implementation of Infiniband is examined, while in section 4, socket to multi-socket NUMA domains, and from shared- the design and implementation of Gemini and it’s successor memory to distributed memory, all are testaments to this fact. Aries are looked at in detail. In section 5, performance Decentralizing computations has many advantages, such as benchmark results from the Ohio State University (OSU) increasing of available processing elements (PE) available benchmarks are analyzed between a Cray-XC50 and Infini- for balancing the computational workload across, but it band compute cluster, and in section 6, we take a look at the does not come without its challenges. The lower down the near future of high-performance computing, courteousy of memory hierarchy you go, the higher the latency, where Steve Scott, CTO of Cray Inc., a Hewlett Packard Enterprise sometimes the difference between one part of the hierarchy company. Finally, in section 7, we have our conclusion. and another is as much as an order of magnitude. While in accessing memory that is local to the PE may have a lot II. PROPERTIES OF HIGH-PERFORMANCE NETWORKS of variability, such as the case of L1 Cache, L2 Cache, L3 Cache, and main memory (DRAM), they are usually within There are many characteristics and properties that make the span of nanoseconds, while accesses to memory that is up networks suitable for high-performance computing. The remote to the PE can be in the span of microseconds to structure of the network, that is it’s topology, as well as the milliseconds. Computations that span across multiple PEs properties relating to fault tolerance and reliability, and the require communication over some type of network, and in types of supported communication between PEs in the same the context of high-performance computing, the larger the network, are all important. A supercomputer or compute bandwidth and smaller the latency, the better. cluster lacking in any of these will be lacking as a whole, Supercomputers and compute clusters primarily tend to as each individual part plays a role in extracting as much use one of two high-performance network interconnects, performance possible from the underlying hardware. being Infiniband[1], [2] from Mellanox and Aries[3], suc- A. Network Topology cessor of Gemini[4], [5], from Cray. Commodity clusters also require high-performance networks, although Infiniband The topology of a network describes the overall makeup may be excessive for their computational needs, and so they and layout of switches, nodes, and links, and the ways may opt for RDMA over Converged Ethernet (RoCE)[6], they are connected. This layout is crucial for maximizing which is common in big data centers. The high-performance bandwidth, minimizing latency, ensuring reliability, and en- interconnects require their own unique design and imple- abling fault tolerance in a network. There are many types mentation that minimizes latency, maximizes bandwidth, and of topological structures in a network, but only the most ensures reliability of data sent between PEs on the net- commonly used in the top supercomputers and compute Switch Switch Switch Switch Switch Switch Switch Switch 24 25 26 27 28 29 30 31 Switch Switch Switch Switch Switch Switch Switch Switch 16 17 18 19 20 21 22 23 Switch Switch Switch Switch Switch Switch Switch Switch Group 1 8 9 10 11 12 13 14 15 Switch Switch Switch Switch Switch Switch Switch Switch 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Compute Nodes Fig. 1. Fat Tree Network Topology Group 0 Group 2 Fig. 3. Dragonfly Network Topology Switch Switch Switch Switch Compute Compute Compute Compute Node Node 1 Node Node Figure 2[7], can be generalized to N dimensions, where each the PEs are placed on a coordinate system. For example, in a 3D Torus network topology, each PE can connect to 6 1 Switch Switch Switch Switch different PEs in the ±x direction, ±y or in the ±z direction. Compute Compute Compute Compute The Torus offers a lot of diversity in routing, as there are a Node Node Node Node wide variety of ways that you can send data from a source PE to a target PE, and hence increased bandwidth, but comes at the downside of a higher cost and much more difficult setup. Switch Switch Switch Switch The Dragonfly topology is a hybrid, in that PEs are Compute Compute Compute Compute Node Node Node Node grouped together, where at least one PE in a group is linked to some other PE in some other group. The topology used inside of the group is implementation dependent, but usually Switch Switch Switch Switch is the Fat Tree topology or a Torus or Mesh topology is used, Compute Compute Compute Compute Node Node Node Node as shown in Figure 3. B. Fault Tolerance and Reliability In a network, reliability is an important feature for ensur- Fig. 2. 2D Torus Network Topology ing consistency, and when done well, can serve to extract even greater performance. A packet in an unreliable network 1 may be lost, or have to be retransmitted at the cost of clusters will be considered, and all else are outside the scope additional time, or at absolute worst, could result in the of this work. acceptance of corrupted data that throws the application The Fat Tree is a tree data structure where branches are in jeopardy. In a strive to balance both reliability and larger, or "fatter", the farther up the tree they are. The performance, each high-performance network must make ’thickness’ of the branch is an analog for the amount of some rather unique design decisions and trade-offs that may bandwidth available to that node; the fatter the branches, be unique to how they are handled in your run-of-the-mill the larger the bandwidth. The fat tree, in the context of network. One way that packets are protected is via a Cyclic networking, have all non-leaf nodes in the tree as switches, Redundancy Check (CRC), which is adds a fixed-width with leaf nodes being PEs, as demonstrated in figure 1.[7] checksum to the original payload, where the checksum is The Fat Tree is extremely cost effective, as adding a new calculated in a way that is sensitive to the position and value PE to the network can be as simple as adding a logarithmic of each individual bit. The usage of Error Correcting Codes number of switches in the worst case. Communication from (ECC) is also used in the link itself to add an additional layer one PE to another is handled by single source shortest path of reliability and recoverability on an error. (SSSP) routing, which sends packets up and down the tree When a PE goes down in a network, network traffic must from the source PE to the target PE. be routed around it, as it may no longer be able to properly The Torus topology, with a 2D Torus network shown in receive, send, or even forward packets that route through Atomic Operation Gemini uGNI RoCE Infiniband Add (Integer) 64 32,64 N/A N/A Add (Float) N/A 32 N/A N/A Fetch-Add 64 32,64 64 64 And 64 32,64 N/A N/A Fetch-And 64 32,64 64 64 Or 64 32,64 N/A N/A Fetch-Or 64 32,64 64 64 Min (Integer) 64 32,64 N/A N/A Min (Float) N/A 32 N/A N/A Max (Integer) 64 32,64 N/A N/A Max (Float) N/A 32 N/A N/A TABLE I ATOMIC MEMORY OPERATORS it.