State-of-the-art Network Interconnects for Computer Clusters in High Performance Computing

Rubén D. Sousa De León

Technische Universität München, Computational Science and Engineering M.Sc. Program, Boltzmannstr. 3, Munich, Germany [email protected]

Abstract. This paper presents a qualitative analysis of three of the more widely used interconnects technologies in high performance computing (HPC) scene today: Myrinet, and Infiniband. The most important properties of each interconnect technology are described and the role of each of those in the efficiency of a clustered system is analysed. Then a comparison of the performance of each interconnect at MPI level and at application level is presented using results obtain on tests performed by different teams in several research institutes in United States. Finally the future trends of high performance network interconnect technologies are analyzed based on the results of the comparisons made and on the actual behaviour of the business markets with respect to the development of products and support of the major manufacturers in the industry.

1 Introduction

During the past few years the rapid fall in prices of individual computers and the fast increase of its computing capabilities has let to the idea of grouping individual servers together in clusters as an alternative for high performance computing (HPC) applications which is cheaper and thus more accessible than the traditional concept of custom-made supercomputers. The great problem with using individual servers interconnected through some sort of network is the existence of several bottlenecks that decrease the overall performance of this sort of systems. That is the main reason for developing special network interconnects designed to meet the requirements for high performance computing, which are mainly, low internodes communication latency, a high bandwidth for transmitting messages between nodes, scalability, programmability and reliability. Currently [2] we may find a wide spectrum of network interconnect technologies available in the HPC industry, both proprietary and open (both single and multi-vendor). Among the proprietary interconnects can be mentioned HP’s HyperFabric2, HP’s ServerNet II, IBM’s Switch II, SGI’s NUMAlink and Sun’s Sun Fire Link. On the other hand there are several products with public specifications but that are only available from one specific vendor, among those Myrinet from Myricom, QsNet and QsNet II from Quadrics, Gigabyte System Network (GSN) from SGI and Scalable Coherent Interface (SCI) from Sun may be mentioned. Also in a third category of interconnects are those with open specifications and that are industry-standards available from a multiple number of vendors. Among those the most important are Infiniband and .

Table 1. Network interconnects technologies available at present [2].

Bandwidth per link Technology Vendor Latency (unidirectional) NUMAlink SGI 1.5 – 3 µsec 1500 MB/s QsNet II Quadrics 1.6 µsec 900 MB/s ServerNet HP 3 µsec 125 MB/s Sun Fire Link Sun 3 – 5 µsec 792 MB/s Myrinet XP2 Myricom 5.5 µsec 495 MB/s Myrinet XP Myricom 7 – 9 µsec 230 MB/s SCI Sun 10 µsec 1 Gb/s GNS SGI 13 – 30 µsec 800 MB/s SP Switch 2 IBM 18 µsec 500 MB/s HyperFabric2 HP 22 µsec 320 MB/s InfiniBand Multiple 3-20 msec 785 MB/s Multiple 40 msec 128 MB/s

From all of the above mentioned products available for linking together nodes in a HPC cluster only three are considered in this paper: Infiniband, Myrinet and Quadrics. All of the proprietary interconnects are not considered for analysis because they are specifically designed for meeting custom requirements which are not standard for all users. Among the industry-standard products Gigabit Ethernet is not considered in spite of being one of the most used technologies among the Top500 listings. The main reason for leaving it out is that it does not meet the requirements that HPC applications need, such as very high bandwidth and very low latency.

2 Main Characteristics of the Interconnects to be Compared

2.1 Infiniband

Infiniband Architecture was defined as an industry standard in June 2001 and it is the result of the fusion of the Next Generation I/O initiative (NGIO) and the Future I/O initiative (FIO). The InfiniBand Trade Association (IBA) was formed by industry leaders with the goal of designing a scalable and high-performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies. It set out to create a single, unified I/O fabric that takes the best features of currently available technologies and merges them into one open standard. The standard [4] includes product definitions both for copper and glass fiber connections, switch and router properties, definitions for high bandwidth multiple connections, description of the way messages are broken up in packets and reassembled as well as routing, prioritizing, and error handling capabilities. This makes Infiniband independent of a particular technology. This is one of the strong points for Infiniband. The Infiniband architecture defines some revolutionary concepts that have been key in its rapid acceptance as an inter node communication layer. It addresses one of the most important issues that have diminished the performance of clusters for HPC, namely, the I/O bottleneck created by the limitations of the PCI bus. To eliminate some of these limitations, several changes in the way I/O communication and inter process communication (IPC) are performed are introduced. On one side, it changes from a traditional memory mapped access to channel based access which results in more CPU efficiency, greater scalability, memory isolation and easier recovery. Also, and maybe the most important contribution, it introduces the concept of a switched fabric or “system area network” that interconnects all of the I/O nodes and processing nodes through dedicated links that consist of cascaded switches, instead of the usual parallel buses with all their limitations. In a switched fabric I/O devices may reside out of the box directly connected to the network through a Target Channel Adapter (TCA), thus completely eliminating the overhead caused by requesting the CPU to fetch information from memory and then processing it. Another improvement that also relieves some load from the CPU is the elimination of the Load/Store operations which are replaced by a DMA scheduling that allows the hardware to move data without CPU intervention. IBA also offers some reliability characteristics such as quality of service (QoS), fail-over in switches and support for redundant fabrics. An Infiniband network consist of three main building blocks, Host Channel Adapters (HCA), that allow processing nodes and I/O nodes to be connected to the fabric; Target Channel Adapters (TCA), that allow I/O devices to be connected directly to the fabric without the intervention of a host node, and Infiniband switches which connect all the hosts and devices together. Infiniband currently defines three levels of connection links: a basic 1X link with a speed of 2.5 Gbit/s (312.5 MByte/s), a 4X link with speed of 10 Gbit/s (1.25 GByte/s) and a 12X link with speed of 30 Gbit/s (3.75 GByte/s). Currently, Infiniband still makes use of the PCI bus, either as PCI-X or PCI-Express, to connect nodes to the switched fabric, thus the actual performance will be limited by the specific PCI bus used. Eventually the switched fabric will connect directly to the system logic [3]. Clusters connected through an Infiniband network are usually connected with a quaternary fat-tree1 topology which allows for a full bisection and good scalability of the network. Currently a single subnet can accommodate up to 48000 nodes and subnets can be connected with each other through Infiniband routers. The MPI2 standard is implemented over Infiniband by a variation called MVAPICH which uses the VAPI software interface of the Infiniband HCAs. This implementation is discussed in [11]. Although Infiniband has demonstrated great acceptance in the network’s interconnect industry it is not the definite solution. Some of the disadvantages that can be foreseen include: there is only one provider of an InfiniBand core chipset (Mellanox), constituting a risk of supply disruption; this lack of diversity also generally reduces innovation and over time drives prices higher; there is limited expertise in the field with the protocol, which may complicate implementation; it is a new networking protocol and relies heavily on gateway functionality (storage area network and IP) to get the full benefits of a unified fabric; only a limited set of

1 Fat tree: A network that has the structure of a binary (quad) tree but that is modified such that near the root the available bandwidth is higher than near the leafs. This stems from the fact that often a root processor has to gather or broadcast data to all other processors and without this modification contention would occur near the root. 2 MPI: A message passing library, Message Passing Interface, that implements the message passing style of programming. Presently MPI is the de facto standard for this kind of programming. diagnostic and troubleshooting tools are available in shipping products; it has had limited use in enterprise environments, which is the true test for sustaining long-term viability of the protocol [9].

2.2 Myrinet

Myrinet was created in 1994 by Myricom and since 1998 it is an American National Standard, ANSI/VITA 26- 1998, which was developed under the auspices of the VMEbus International Trade Association (VITA). The current implementations of Myrinet offer bandwidths ranging from 200 to 500 MByte/s per link, which allows a maximum of 1 GByte/s in bidirectional communication (2000 Gbps + 2000 Gbps), and internodes latency that go as low as 5 µs. Myrinet networks consist of two major components: PCI-X network interface cards (NIC) and Myrinet 2000 switches with a maximum of 32 ports. Also a product called “Network in a box” is offered that allows connecting up to 256 nodes and several thousand nodes by adding more such devices. The cables for connecting the nodes are traditionally offered in copper, but the latest implementations use fibre cables. Clusters connected through a Myrinet network are connected using a Clos network topology3, which allows for full bisection and maximum throughput at the network leaves, as well as for deadlock-free routing. Myrinet also implements error control, flow control, link monitoring and adaptive routing protocols, the further should allow the network to find alternative paths to deliver messages in case of failure of one or several nodes. For sending and receiving messages through the network, Myrinet implements its own low-level message passing layer called GM. It provides several features such as: concurrent, protected, user-level access to the Myrinet NIC; reliable, ordered delivery of messages; automatic mapping and route computation; automatic recovery from transient network problems; scalability to thousands of nodes; low latency, high data rate, and very low host-CPU utilization; and extensible software to allow simultaneous direct support of the GM API, IP (TCP, UDP), MPI, and other APIs. The GM message passing layer achieves extremely low latency times and CPU utilization overhead by means of a technique called “Operating-System bypass” (OS-bypass). This consists in allowing application programs to send and receive messages without any system calls after memory has been allocated and registered. Instead, the GM API functions communicate through common memory with the Myrinet Control Program that is executing continuously on the processor in the Myrinet NIC. Another feature that allows for low latency in the Myrinet network is the low latency of the Myrinet switch [9]. For HPC application Myrinet developed MPICH-GM which is an MPICH implementation that exploits the entire message passing capabilities of the GM layer. This is implemented by re-targeting the Channel Interface to the GM messaging layer. Myrinet has some disadvantages related to the fact that it is a single-vendor solution with a long product lifecycle, which can lead to limited support from the vendor; also using Myrinet requires twice as many switch ports as alternative solutions, thus costs may raise significantly; on the other hand Myrinet has a closed software support model with little software flexibility, which makes maintenance and implementation cumbersome; and currently there are no plans at this time for interconnect speeds higher than 10Gbps [9].

2.3 Quadrics

Quadrics was first established in 1990 by the British firm Meiko under the name of Computer Surface 2 (CS-2), later, this company closed down and some part of the CS-2 technology was further developed by an Italian firm called Alenia that evolved into today’s Quadrics. The first implementation of the Quadrics network was the QsNet that proved to be a very fast and reliable network. Recently, the second generation of Quadrics has been released, QsNet II. The QsNet II communication layer consists of two building blocks: a programmable Elan-4 network interface and the Elite-4 switch, which are connected in a fat-tree topology. The Elan4 connects into the PCI-X bus of a processing node that contains one or more CPUs and serves as an interface between this and the rest of the high performance network. It has internal 64 bit architecture and supports 64 bit virtual addresses. Besides it generates and accepts packets to and from the network and provides local processing power to implement the high-level message passing protocols required in parallel processing. The network is constructed from Elite4 switch components, which are capable of switching eight bidirectional communications links. The use of bidirectional communications links is a unique characteristic of Quadrics, which helps to a certain extend to reduce latency which is currently as low as 3 µs. Each communications link carries data in both directions simultaneously at 1.3 Gbytes/s. The link bandwidth is shared between two virtual channels. The network supports broadcast transmission across selected ranges of nodes in addition to point-to-

3 Clos network: A logarithmic network in which the nodes are attached to switches that form a spine that ultimately connects all nodes. point connectivity between arbitrary nodes. Currently Elan4 switches are only available in 16 and 128 ports, but nothing in between which reduces flexibility when putting together networks of other sizes. The switches support two priority levels which greatly help in a fair distribution of message packets. This is more than Myrinet provides but less than the 16 priority levels of Infiniband. The in-switch latency is very low: about 35 ns. The low in-switch latency is achieved thanks to the STEN (Short Transaction Engine) processor which assembles short packets for transmission into the network and is optimized for short reads and writes and for protocol control. All packets that it issues are pipelined which provides very low latencies. The Elan4 switches are connected together in a fat-tree topology which allows scaling up to several thousands of nodes. Quadrics provides its libraries: libelan and libelan4, on top of its Elan4 network. Within these default Quadrics programming libraries, a parallel job first acquires a job-wise capability. Then each process is allocated a virtual process ID (VPID), together they form a static pool of processes, i.e., the process membership and connections among them cannot change. Inter-process communication is supported by two different models: Queue-based Directed Message Access (QDMA) and Remote Directed Message Access (RDMA). QDMA allows processes to post messages (up to 2KB) to a remote queue of another process; RDMA enables processes to write messages directly into remote memory exposed by other processes. Libelan also provides a very useful chained event mechanism, which allows one operation to be triggered upon the completion of another. This can be utilized to support fast and asynchronous progress of two back-to-back operations. The MPI message passing standard is implement over the libelan4 library in a custom implementation developed by Quadrics. The major disadvantage for Quadrics is that it’s the only company that develops, market, sells and support their product, thus greatly limiting the availability and frequency of new products, and also keeping prices at a higher level than other interconnects.

3 Performance Comparison of Three Network’s Interconnects

The results presented in this section are extracted from tests performed by several research teams. The main results are stated in [1]. These were made using a common test bet for all three interconnects to reduce the influence of external factors on the results. The tests presented target two different levels of performance: the MPI implementation level and the application level. Each network interconnect was tested with MPI implementations that have been specially designed to obtain the best performance on the respective network. For Myrinet it is MPICH-GM, for Infiniband MVAPICH was used and for Quadrics MPICH-Qs. All of these are based on the well spread MPICH implementation of the MPI standard. Besides the tests performed on the MPI level some other tests on the application level are analysed to understand the effect that the MPI performance parameters have over application performance. The results analysed here allow to make comparisons among the three interconnects because the experimental setup is kept consistent. Since the time when [1] was published, some improvements have been made to each network interconnect under analysis and also there have been advances in the availability of new communication buses like PCI- Express. The base comparison is made using the first Quadrics generation QsNet on a PCI bus, the second Myrinet generation Myrinet 2000 with the GM message layer on a PCI-X bus, and Infiniband 4X on a PCI-X bus. In order to give the most current view of the state of the interconnect technologies some important results related to the new QsNet II generation on a PCI-X bus [5]; the new GM-2 messaging layer from Myrinet also on a PCI-X bus [6]; and Infiniband 4X on a PCI-Express bus are analysed [7]. Nevertheless these more recent results don’t allow making any valid comparisons because none of them have been measured under controlled experimental conditions as oppose to the results presented in [1].

3.2 Comparison Tests at the MPI Level

3.2.1 Latency This test is conducted in a ping-pong fashion and the latency4 is derived form round-trip time. The smallest latency for small message sizes was shown by Quadrics (4,6 µs), followed by Myrinet (6,7 µs) and Infiniband (6,8 µs). But for large message size Infiniband has a clear advantage due to the higher bandwidth available. Results are shown in figure 1. For the recent QsNet II implementation of Quadrics the latency is further improved to a lowest of 1.38 µs [5]. In similar way latency for Myrinet adapters with 333 MHz bus and using the GM-2 message layer (both upgrades on new Myrinet products) decreases to 5.7 µs [6]. In the case of Infiniband HCAs over a PCI-Express bus the latency is decreased to 4.1 µs [7]. Even though it is not possible to perform valid comparisons of these newer values it is evident that the new features implemented resulted in significant improvements on performance.

4 Communication latency: Time overhead occurring when a message is sent over a communication network from one processor to another. Although unidirectional tests provide important information about the peak performance of each interconnect, they do not reflect the real conditions under which communication takes place in a computers cluster. Bidirectional tests put more stress over the communication layer, thus they may provide more useful information to understand the bottlenecks in communication. On these tests, two nodes, sender and receiver, are sending data simultaneously. Results for bidirectional latency over each of the three interconnects are shown in figure 2. Infiniband reaches a lowest of 7,0 µs, Quadrics shows 7,4 µs and Myrinet performs with 10.1 µs. Both Quadrics and Myrinet are more affected than Infiniband. Also, it is shown that Infiniband has a better performance in communication intensive environments than Quadrics and Myrinet although the peak performance of the first is the worst.

Fig. 1. MPI Latency across Fig. 2. MPI Bandwidth Fig. 3. MPI Bi-directional three interconnects Latency

Fig. 4. MPI Bi-directional Fig. 5. MPI Host Overhead in Fig. 6. Overlap Potential Bandwidth Latency Test

3.2.2 Bandwidth This test is used to determine the maximum sustained data rate that can be achieved at the network level using non-blocking MPI functions. Unidirectional results are shown for groups of 4 and 16 continuous messages with varying sizes on each of the interconnects. Inifniband performance is significantly superior with a maximum bandwidth of 841 MByte/s compared with 308 MByte/s and 235 MByte/s achieved by Quadrics and Myrinet, respectively. Results are shown in figure 2. When analysing the performance of Infiniband over PCI Express [7], a maximum unidirectional bandwidth of 971 MByte/s is achieved using the same MPI implementation that has been used in previous tests. When using a new MPI implementation that uses two ports in an Infiniband HCA a bandwidth of 1497 MByte/s is achieved. On the other hand recent Myrinet [6] implementations achieved 495 MByte/s as the maximum bandwidth. In regards to Quadrics QsNet II [5] the bandwidth performance reaches its best at 912 MByte/s. The bi-directional bandwidth analysis shows that Infiniband goes up to 900 MByte/s, Quadrics achieves 375 MByte/s and Myrinet, 473 MByte/s. However, for large messages Myrinet bandwidth decreases. Bi-directional bandwidth tests on Infiniband using the PCI Express bus, instead of PCI-X, and a MPI implementation that uses one communication port show [7] a bandwidth of 1927 MByte/s for large messages. With the MPI implementation that uses two ports on an Infiniband HCA the bandwidth goes up to 2721 MByte/s. Comparing these measurements to the results obtained using the PCI-X bus, it is obvious that the PCI- X bus is a great bottleneck for Infiniband performance. QsNet II over PCI-X bi-directional bandwidth tests gives values that are also around 900 MByte/s [5]. Here it is also visible that PCI-X sets a limitation on QsNet II performance. On the other hand Myrinet bandwidth using GM-2 message passing layer reaches its limit at 770 MByte/s. Theoretically the Myrinet performance under these conditions should approach 1 GByte/s bandwidth, or 500 MByte/s over each link, but unfortunately the computational overhead in the GM-2 firmware limits the performance [6].

3.2.3 Host Overhead Host overhead is a very important parameter in the performance analysis of network interconnects in HPC cluster computing, because the more CPU resources are used to establish communication among the computational nodes the less CPU resources there are available for actually performing computations; thus eliminating one of the main advantages of grouping computers together in a cluster. The overhead includes both the sender side and the receiver side. It is obtained by measuring the time spent in communication. For Myrinet and Infiniband, the overheads are around 0,8 µs and 1,7 µs, respectively, for short messages. When message size increases the overhead increases slightly. Quadrics overhead is around 3,3 µs, which is higher than the other two, despite having the lowest latency.

3.2.4 Communication/Computation Overlap Another important characteristic of HPC cluster network interconnects is the ability of overlapping communication among nodes with computations without greatly diminishing the computations performed on the cluster. Communication/computation overlapping is a technique that MPI programmers use to achieve better performance. In order to test the capabilities of overlapping communications and computations in each interconnect under consideration non-blocking MPI functions are used. The nodes send and receive operations while performing a computation loop. The overlapping potential is measured in terms of the maximum time that the computational loop runs without increasing latency. Results for this test are presented in figure 6, where a higher value represents a better overlap capability. For small messages, Infiniband and Myrinet have better overlapping potential than Quadrics because of their higher latencies and lower host overheads. Nevertheless, Quadrics overlapping potential increases steadily with message size opposed to overlapping potential for Myrinet and Infiniband which reaches a steady value for large messages. The basic reason for these is the type of protocol that is used by the MPI implementations. For small messages the eager protocol is used, whereas for large messages the rendezvous protocol is used. The rendezvous protocol requires a handshake between nodes involved in the communication and in the case of Infiniband and Myrinet this handshake requires host intervention thus decreasing the overlapping potential. Quadrics is able to make communication progress asynchronously by taking advantages of the programmable network interface.

3.2.5 Impact of Buffer Reuse In most micro-benchmarks designed to test communication performance, the sender and receiver sides each use only one buffer. The sender or receiver keeps reusing the buffer until the test finishes. Real applications, however, usually use many buffers for communication. The buffer reuse pattern can have a significant impact on the performance due to the address translation mechanisms used in these interconnects, and the registration and de-registration of buffers needed to achieve zero-copy communication. This is an effect that test using only one buffer cannot characterize. The test is performed by defining a buffer reuse percentage R, which represents the number of iterations of the calculations that are performed in one single buffer. By changing R (0%, 50%, 100%) the communication performance is affected by buffer reuse patterns. In this test all the three MPI implementations prove to be very sensitive to buffer reuse patterns. When the percentage of buffer reuse decreases the performance of the network is highly affected. Quadrics performance is greatly reduced when reducing the buffer reuse percentage, Infiniband is also affected, but to a lesser extend. Myrinet handles communication patterns with low buffer reuse percentages more gracefully than both Infiniband and Quadrics [1].

3.2.6 Memory Usage The amount of memory allocated by the processing nodes to establish communications through a certain network interconnect is a very important parameter for both the scalability of the cluster and the communication/computation overlapping potential because the more memory allocated by the MPI implementation, the more likely it will adversely affect application performance. For this test a simple MPI barrier program was ran and the amount of memory used was measured. Results show that Myrinet and Quadrics consume relatively small amount of memory, and that it does not increases with the number of nodes. On the other hand, memory consumption over Infiniband increases as the number of nodes increases. The reason for this is that, in the current implementation, during initialization, a connection is set up between every two nodes and a certain amount of memory is reserved for each connection. Therefore, total memory consumption increases with the number of connections. Some alternatives solutions for this problem are suggested in [1].

3.3 Comparison Tests at Application Level

MPI performance parameters such as latency, bandwidth, overhead, computation/communication overlapping, memory usage and buffer reuse may have significant effect over the application performance in a cluster. It is then of a great interest to give a better look at the relationship between the parameters and the applications because, after all, the whole point of having computers clusters is to obtain very good applications’ performance. The series of test performed in [1] to analyse the performance of each interconnect are based on the NAS Parallel Benchmarks5 and sweep3D [13]. This tests show important results that can be summarized as follows:

3.3.1 Overall Application Performance Results For all the benchmarks performed over each interconnect MPI over Infiniband performs better than the other two implementations. One of the decisive factors for this is the higher bandwidth available over Infiniband, especially when very large messages are involved in the communications. In the case of applications with small messages Quadrics and Myrinet performance is more comparable to that of Infiniband because the are optimized for this kind of communication.

3.3.2 Scalability with System Size Performance was measured using 2, 4 and 8 nodes for running the tests over each interconnect. The execution time on each network is measured. All the three interconnects show improvements in execution time as more nodes where added in the computational cluster, thus in general the all have good scalability. However, for some of the tests Infiniband performs better that Myrinet and Quadrics [1].

3.3.3 Impact of Computation/Communication Overlap Characterization of communication/computation overlap effect is a difficult task in real applications. In order to study its effect non-blocking MPI calls in the applications are collected in an extended MPI activity log. From the information provided by the log it is seen that some applications don’t even make use of non-blocking calls, and for those that use them in average the message size is large. The large message size gives some advantage to MPI over Quadrics, which in the computation/communication overlap tests shown before performs better with large messages.

3.3.4 Impact of Buffer Reuse The results of this test show that the impact of buffer reuse does not depend on the MPI application itself. As was stated before all MPI applications are seriously affected by low percentage rates of buffer reuse. Thus the real impact of buffer reuse depends on whether a given application is optimized to reuse the buffers or not. The applications that form part of the tests are optimized to do this, thus there is no big impact of MPI over the performance of applications.

4 Future Trends in High Performance Network Interconnects

A series of tests performed over the three most popular networks’ interconnects available today show that they all have their advantages and disadvantages. They all show relatively low latencies, high reliability and higher bandwidths than other commodity interconnects such as Gigabit Ethernet. None of then can be readily discarded, but analysis of which will be preferred by the HPC community may be carried out based on the results of the tests and other factors such as price/performance, available support, available software, available providers and so on. The current leader interconnect in the HPC scene is Myrinet being used by 38,6% of the fastest computers on the world as stated in the November 2004 Top500 list. In this list it is also surprising, and somehow contradictory with what has been exposed in this paper, to see that the second most used interconnect is Gigabit Ethernet with a 35,2% of the group. This has several reasons, the most important being that Gigabit Ethernet is the cheapest solution available in the market and for a lot of commercial applications like data centres and web servers it provides enough performance, but it is not good enough for scientific applications which are more demanding. This can be seen when the computers or clusters in the Top500 list are group by interconnect family and by performance capabilities at the same time [14], Gigabit Ethernet is not on the top of that group. Going back to the Top500 list from November 2004, Quadrics is used by 4,4% of the list and Infiniband is used only by 2.2%. In spite of having received a very good acceptance in the supercomputer industry it seems that Infiniband is not being as popular as they say, but on factor that may influence the low percentage of clusters in the Top500 is that Infiniband just hit the market at late 2003, and by that time any new powerful cluster was already budgeted and most probably under construction. It remains to be seen if this figure will increase when the June 2005 list is released.

5 The Numerical Aerodynamic Simulation (NAS) program is comprised by a set of benchmarks which are derived from computational fluid dynamics codes and have gained wide acceptance as a standard indicator of supercomputer performance.

First it is important to notice that Infiniband has several technical advantages over Myrinet and Quadrics, the main being that it is designed to connect directly to the main processor logic, thus avoiding the bottleneck created by the PCI bus or any of its variants. This type of implementation has not been realised, but a very close form of it has been launched in a product known as InfiniPath which connects directly to the Hyper Transport layer of AMD Operton processor. It is still to be shown if this can readily be used in practical applications and if some other manufacturers will continue to implement this type of solutions. The fact that Infiniband is not intrinsically bounded by the PCI bus, opposed to Myrinet and Quadrics, is a major factor in the possibilities for its future development. From a different point of view Infiniband has another advantage that turned out to be of great use for Ethernet at the correspondent time, it’s an open standard supported and developed by a great number of important manufacturers in the computer industry. This gives the product several advantages, such as a continuous price drop due to competition, constant availability of new solutions developed by different vendors, broad support and greater availability in commodity servers manufactured by those companies that prefer the product. Currently companies such as IBM, Intel, Sun, HP, Dell, Voltaire, Mellanox among others offer or are developing solutions to be used over Infiniband networks, either at the HCA level or at the switch/router level. Most important even is the fact that there are major development projects undergoing to produce faster, cheaper, smaller Infiniband components. From the costs point of view Myrinet offers prices that are slightly lower than those offered by Infiniband and Quadrics is the most expensive solution. Here it is again evident that, if budget is the major conditioning factor, Gigabit Ethernet is a good choice, with prices 20 times lower. The prices for different interconnect solutions during the first quarter of 2005 are shown in Table 2.

Table 2. Component Cost for Different Interconnect Solutions [15] Switch Cost Cost Cost MPI Lat 1-Dir Bi-Dir Topology NIC Sw/node Node (µs) MB/s MB/s Gig Ethernet Bus $ 50 $ 50 $ 100 30 100 150 SCI Torus $1,600 $ 0 $1,600 5 300 400 QsNetII (R) Fat Tree $1,200 $1,700 $2,900 3 880 900 QsNetII (E) Fat Tree $1,000 $ 700 $1,700 3 880 900 Myrinet (D) Clos $ 595 $ 400 $ 995 6.5 240 480 Myrinet (E ) Clos $ 995 $ 400 $1,395 6 450 900 IB 4x Fat Tree $1,000 $ 400 $1,400 6 820 790

Although here it is shown that Infiniband has greater chances of dominating the cluster computing market in the future years is not at all the perfect or definite solution, nor is its success guaranteed. There are still several shortcomings to Infiniband such as the fact that there are still very few support tools available for implementation and troubleshooting of Infiniband-based networks; whereas there are a great number of companies working on developing Inifniband products, still none of the big players in the game (such as IBM, Dell or others) have made any major releases of products containing Infiniband solutions which is a key factor in boosting the sales of Infiniband as a massive product and thus lowering the prices and making the technology broader available.

5 Conclusion

In this paper some properties of the latest technologies for network’s interconnects in HPC cluster computing have been discussed and compared through a series of performance analysis on both the MPI implementation level and the application level. These tests show that they all have very good latency performance, but in the case of bandwidth it is also clear that in all the current implementations the PCI bus in any of its versions is a major bottleneck. Also the importance of host overhead and communication/computation overlap potential was discussed as a major factor in application performance. One of the most important results that can be highlighted is the major limitations that PCI buses pose over network communications in general. It is clear that for achieving optimal performance in the future this bottleneck must be overcome. After making a thorough analysis of the technical characteristics and the market trends in the networks’ interconnect industry Infiniband seems to be the best candidate to dominate the industry in the fore coming years. 6 References

1. J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. K. Panda. Performance Comparison of MPI Implementations over Infiniband, Myrinet and Quadrics. SuperComputing 2003 Conference, Phoenix, AZ, November, 2003. 2. Cambridge Consulting. The Optimal Interconnect for High Performance Clustered Environments: A Look At Infiniband And Hewlett-Packard’s Implementation Published May 30, 2004. Cambridge Consulting © May 30, 2004. 3. Mellanox Technologies Inc. Understanding PCI Bus, PCI-Express and InfiniBand Architecture - Interaction among the three technologies. Rev. 1.20. 4. William T. Futral. Infiniband Architecture Development and Deployment. A strategic Guide to Server I/O Solutions. Copyright © Intel Corporation. 2001. 5. Jon Beecroft, David Addison, David Hewson, Moray McLaren, Fabrizio Petrini and Duncan Roweth. Quadrics QsNet II: Pushing the Limit of the Design of High-Performance Networks for Supercomputers. In IEEE Micro. March, 2005. 6. Myrinet. Performance Measurements. http://www.myri.com/myrinet/performance/. November, 2004. 7. J. Liu, A. Mamidala, A. Vishnu and D. K. Panda. Performance Evaluation of InfiniBand with PCI Express. In IEEE Micro, 2005. 8. J. Liu, B. Chandrasekaran, W. Yu, J. Wu, D. Buntinas, S. Kini, P. Wyckoff, and D. K. Panda. Micro- Benchmark Performance Comparison of High-Speed Cluster Interconnects. In IEEE Micro, January/February, 2004. 9. CISCO. Understanding Server Interconnect Technology. White paper, CISCO Systems Inc. ©, 2005. 10.Aad J. van der Steen and Jack J. Dongarra. Overview of Recent Supercomputers. October 7, 2004. 11.R. Martin, A. Vahdat, D.Culler, and T. Anderson. Effects of Communication Latency, Overhead, and Bandwidth in a cluster Architecture. In Proceedings of the International Symposium on Computer Architecture, 1997. 12.David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, Maurice Yarrow. The NAS Parallel Benchmarks 2.0 (1995). In The International Journal of Supercomputer Applications. 1995. 13.Los Alamos National Laboratory (LANL). LANL's ASCI SWEEP3D Compact Application. http://www.c3.lanl.gov/par_arch/CODES/SWEEP3D/sweep3d_readme.html 14.Jack Dongarra. Present and Future Present and Future Supercomputer Architectures and Supercomputer Architectures and their Interconnects. In the International Supercomputer Conference. Heidelberg, Germany. June 22-25, 2004. 15.Jack Dongarra. An Overview of Supercomputers, Clusters and Supercomputers, Clusters and Grid. In Teraflop Workbench. March 18, 2005. 16.Lloyd Dickman. An Introduction to the Pathscale Infinipath™ Htx™ Adapter. PathScale Inc. 2005.