State-Of-The-Art Network Interconnects for Computer Clusters in High Performance Computing

State-of-the-art Network Interconnects for Computer Clusters in High Performance Computing Rubén D. Sousa De León Technische Universität München, Computational Science and Engineering M.Sc. Program, Boltzmannstr. 3, Munich, Germany [email protected] Abstract. This paper presents a qualitative analysis of three of the more widely used interconnects technologies in high performance computing (HPC) scene today: Myrinet, Quadrics and Infiniband. The most important properties of each interconnect technology are described and the role of each of those in the efficiency of a clustered system is analysed. Then a comparison of the performance of each interconnect at MPI level and at application level is presented using results obtain on tests performed by different teams in several research institutes in United States. Finally the future trends of high performance network interconnect technologies are analyzed based on the results of the comparisons made and on the actual behaviour of the business markets with respect to the development of products and support of the major manufacturers in the industry. 1 Introduction During the past few years the rapid fall in prices of individual computers and the fast increase of its computing capabilities has let to the idea of grouping individual servers together in clusters as an alternative for high performance computing (HPC) applications which is cheaper and thus more accessible than the traditional concept of custom-made supercomputers. The great problem with using individual servers interconnected through some sort of network is the existence of several bottlenecks that decrease the overall performance of this sort of systems. That is the main reason for developing special network interconnects designed to meet the requirements for high performance computing, which are mainly, low internodes communication latency, a high bandwidth for transmitting messages between nodes, scalability, programmability and reliability. Currently [2] we may find a wide spectrum of network interconnect technologies available in the HPC industry, both proprietary and open (both single and multi-vendor). Among the proprietary interconnects can be mentioned HP’s HyperFabric2, HP’s ServerNet II, IBM’s Switch II, SGI’s NUMAlink and Sun’s Sun Fire Link. On the other hand there are several products with public specifications but that are only available from one specific vendor, among those Myrinet from Myricom, QsNet and QsNet II from Quadrics, Gigabyte System Network (GSN) from SGI and Scalable Coherent Interface (SCI) from Sun may be mentioned. Also in a third category of interconnects are those with open specifications and that are industry-standards available from a multiple number of vendors. Among those the most important are Infiniband and Ethernet. Table 1. Network interconnects technologies available at present [2]. Bandwidth per link Technology Vendor Latency (unidirectional) NUMAlink SGI 1.5 – 3 µsec 1500 MB/s QsNet II Quadrics 1.6 µsec 900 MB/s ServerNet HP 3 µsec 125 MB/s Sun Fire Link Sun 3 – 5 µsec 792 MB/s Myrinet XP2 Myricom 5.5 µsec 495 MB/s Myrinet XP Myricom 7 – 9 µsec 230 MB/s SCI Sun 10 µsec 1 Gb/s GNS SGI 13 – 30 µsec 800 MB/s SP Switch 2 IBM 18 µsec 500 MB/s HyperFabric2 HP 22 µsec 320 MB/s InfiniBand Multiple 3-20 msec 785 MB/s Gigabit Ethernet Multiple 40 msec 128 MB/s From all of the above mentioned products available for linking together nodes in a HPC cluster only three are considered in this paper: Infiniband, Myrinet and Quadrics. All of the proprietary interconnects are not considered for analysis because they are specifically designed for meeting custom requirements which are not standard for all users. Among the industry-standard products Gigabit Ethernet is not considered in spite of being one of the most used technologies among the Top500 listings. The main reason for leaving it out is that it does not meet the requirements that HPC applications need, such as very high bandwidth and very low latency. 2 Main Characteristics of the Interconnects to be Compared 2.1 Infiniband Infiniband Architecture was defined as an industry standard in June 2001 and it is the result of the fusion of the Next Generation I/O initiative (NGIO) and the Future I/O initiative (FIO). The InfiniBand Trade Association (IBA) was formed by industry leaders with the goal of designing a scalable and high-performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies. It set out to create a single, unified I/O fabric that takes the best features of currently available technologies and merges them into one open standard. The standard [4] includes product definitions both for copper and glass fiber connections, switch and router properties, definitions for high bandwidth multiple connections, description of the way messages are broken up in packets and reassembled as well as routing, prioritizing, and error handling capabilities. This makes Infiniband independent of a particular technology. This is one of the strong points for Infiniband. The Infiniband architecture defines some revolutionary concepts that have been key in its rapid acceptance as an inter node communication layer. It addresses one of the most important issues that have diminished the performance of clusters for HPC, namely, the I/O bottleneck created by the limitations of the PCI bus. To eliminate some of these limitations, several changes in the way I/O communication and inter process communication (IPC) are performed are introduced. On one side, it changes from a traditional memory mapped access to channel based access which results in more CPU efficiency, greater scalability, memory isolation and easier recovery. Also, and maybe the most important contribution, it introduces the concept of a switched fabric or “system area network” that interconnects all of the I/O nodes and processing nodes through dedicated links that consist of cascaded switches, instead of the usual parallel buses with all their limitations. In a switched fabric I/O devices may reside out of the box directly connected to the network through a Target Channel Adapter (TCA), thus completely eliminating the overhead caused by requesting the CPU to fetch information from memory and then processing it. Another improvement that also relieves some load from the CPU is the elimination of the Load/Store operations which are replaced by a DMA scheduling that allows the hardware to move data without CPU intervention. IBA also offers some reliability characteristics such as quality of service (QoS), fail-over in switches and support for redundant fabrics. An Infiniband network consist of three main building blocks, Host Channel Adapters (HCA), that allow processing nodes and I/O nodes to be connected to the fabric; Target Channel Adapters (TCA), that allow I/O devices to be connected directly to the fabric without the intervention of a host node, and Infiniband switches which connect all the hosts and devices together. Infiniband currently defines three levels of connection links: a basic 1X link with a speed of 2.5 Gbit/s (312.5 MByte/s), a 4X link with speed of 10 Gbit/s (1.25 GByte/s) and a 12X link with speed of 30 Gbit/s (3.75 GByte/s). Currently, Infiniband still makes use of the PCI bus, either as PCI-X or PCI-Express, to connect nodes to the switched fabric, thus the actual performance will be limited by the specific PCI bus used. Eventually the switched fabric will connect directly to the system logic [3]. Clusters connected through an Infiniband network are usually connected with a quaternary fat-tree1 topology which allows for a full bisection and good scalability of the network. Currently a single subnet can accommodate up to 48000 nodes and subnets can be connected with each other through Infiniband routers. The MPI2 standard is implemented over Infiniband by a variation called MVAPICH which uses the VAPI software interface of the Infiniband HCAs. This implementation is discussed in [11]. Although Infiniband has demonstrated great acceptance in the network’s interconnect industry it is not the definite solution. Some of the disadvantages that can be foreseen include: there is only one provider of an InfiniBand core chipset (Mellanox), constituting a risk of supply disruption; this lack of diversity also generally reduces innovation and over time drives prices higher; there is limited expertise in the field with the protocol, which may complicate implementation; it is a new networking protocol and relies heavily on gateway functionality (storage area network and IP) to get the full benefits of a unified fabric; only a limited set of 1 Fat tree: A network that has the structure of a binary (quad) tree but that is modified such that near the root the available bandwidth is higher than near the leafs. This stems from the fact that often a root processor has to gather or broadcast data to all other processors and without this modification contention would occur near the root. 2 MPI: A message passing library, Message Passing Interface, that implements the message passing style of programming. Presently MPI is the de facto standard for this kind of programming. diagnostic and troubleshooting tools are available in shipping products; it has had limited use in enterprise environments, which is the true test for sustaining long-term viability of the protocol [9]. 2.2 Myrinet Myrinet was created in 1994 by Myricom and since 1998 it is an American National Standard, ANSI/VITA 26- 1998, which was developed under the auspices of the VMEbus International Trade Association (VITA). The current implementations of Myrinet offer bandwidths ranging from 200 to 500 MByte/s per link, which allows a maximum of 1 GByte/s in bidirectional communication (2000 Gbps + 2000 Gbps), and internodes latency that go as low as 5 µs.

State-Of-The-Art Network Interconnects for Computer Clusters in High Performance Computing

End-To-End Performance of 10-Gigabit Ethernet on Commodity Systems

Parallel Computing at DESY Peter Wegner Outline •Types of Parallel

Data Center Architecture and Topology

Comparing Ethernet and Myrinet for MPI Communication

Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

Inside the Lustre File System

High-End HPC Architectures

Guaranteed Periodic Real-Time Communication Over Wormhole

Analysis and Optimisation of Communication Links for Signal Processing Applications

Comparative Performance Analysis with Infiniband and Myrinet-10G

GSN the Ideal Application(S) More Virtual Applications for HEP and Others Some Thoughts About Network Storage

Presentation