Cluster Interconnect Overview Brett M. Bode, Jason J. Hill and Troy R. Benjegerdes Scalable Computing Laboratory, Ames Laboratory, Ames, IA 50011.

Today cluster computers are more commonplace than ever and there are a variety of choices for the interconnect. The right choice for a particular installation will depend on a variety of factors including price, raw performance, scalability, etc. This paper will present an overview of the popular network technologies available today including Gigabit , 10 , Myrinet, SCI, , and InfiniBand. Included will be comparisons of cost and performance of each along with suggestions for when each might present the best choice for a cluster installation.

Over the past few years the cost and performance of interconnects has progressed to the point where today most new clusters use a primary interconnect of 1 – 10 Gbps. There are several interconnection choices in this performance range that range in cost, latency and achievable bandwidth. Choosing the correct one for a particular application is an important and often expensive decision. This paper will present a direct comparison of Gigabit Ethernet, , Myrinet, SCI, Quadrics and InfiniBand. For each of the network technologies we will examine issues of cost, performance (latency and bandwidth), and scalability.

Some of the network interconnects in this review have been around for quite some time such as Gigabit Ethernet and Myrinet. Others such as InfiniBand and 10 Gigabit Ethernet are quite new. In addition even well know technologies such as Myrinet are evolving in terms of both hardware and software implementations. For example Myricom now offers both single and dual port NICs and is in the process of finishing a substantial rewrite of their software stack.

Figure 1 illustrates the performance of several of the networks in this study. While it is clear that InfiniBand and 10 Gigabit Ethernet outperform the other choices, that does not necessarily mean they are the best choice. Indeed 10 Gigabit Ethernet currently is cost prohibitive for most applications. On the other hand Gigabit Ethernet has the lowest performance of the technologies listed, but due to its very low cost is still a quite suitable choice for small clusters. 7000 Infiniband - Infinicon MPI

Infiniband - Infincon SDP - TCP 6500 10 Gb Ethernet

6000 SCI MPI (Tyan 2466N) Myrinet 2000C MPI

5500 Mryinet 2000C TCP

Infiniband TCP 5000 AceNIC - 9000MTU

SCI TCP (Tyan2466N) 4500 Dell/Broadcom Gigabit

4000 AceNIC - 1500MTU s p

b 3500 M

3000

2500

2000

1500

1000

500

0 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 Message Size (Bytes)

Figure 1.

In addition to the raw native performance we will also present results for high- level protocols such as MPI and TCP/IP. TCP/IP is especially interesting since it is a very important protocol for applications, yet the TCPIP performance is often significantly lower on a given network technology than the MPI or native performance. TCP/IP performance is also much more dependant on careful system and application tuning to achieve.

The API for a protocol also poses several performance issues. TCP/IP requires the sockets API, which has a significant drawback that most implementations require at least a memory copy on the receive side, and normally requires an OS context switch for both send and receive. Several vendors have addressed this by offering different methods for sockets direct. Figure 2 shows sockets API performance for several different networks, and methods of transport. The highest performing sockets results required significant TCP/IP settings tweaking including using a 9000 byte MTU on 10gigE and sockets direct on InfiniBand. Sockets performance is also significantly affected by CPU clock speed, while raw InfiniBand VAPI performance is not. We expect that most other OS- bypass/RDMA type interconnects like quadrics and SCI will show similar behavior. 4000 Infinicon Sockets Direct, 1MB socket buffer size Intel 10gigE, max ixgb buffers and no RX delay 3500 Infinicon SDP with default settings, 2.4Ghz Xeon Intel 10gigE, 2.4Ghz Xeon, 100mhz PCI-X Infinicon Sockets Direct, 2.2 Ghz Xeon 3000 Infinicon IP over IB, 2.4Ghz Xeon SCI Sockets, default library settings

2500 IP over SCI, 32K MTU, 1MB socket buffer size IP over SCI, default MTU 1MB socket buffer size

2000

1500

1000

500

0 1 10 100 1000 10000 100000 1e+06 1e+07 Message Size in Bytes Figure 2.

We intend to show further results for various methods of improving sockets performance, and discuss the resulting issues. Different approaches to improving sockets performance also have overall system design and security implications that are not initially obvious. For example, any sort of sockets direct or TCP offload approach has the potential to bypass firewall rules in the OS. Due to these and other reasons, there has been significant resistance in the Linux kernel development community to supporting TCP offload engines. This is not as large an issue for system area networks like InfiniBand, but could be a substantial barrier to the adoption of TCP offload for Ethernet.

In real-world use, the API one uses can often bias results towards one type of use that may be supported, but not as optimized. MPI is generally optimized by vendors for the ping-pong latency and bandwidth numbers. This generally results in some sort of polling mechanism being used. In contrast, TCP/IP sockets is an interrupt driven model with significant support and optimization in the operating system itself. In one case, a computational chemistry application that uses two processes per CPU runs faster with TCP/IP sockets over Myrinet than over Myrinet with GM, even though the bandwidth and latency of the Myrinet MPI implementation is significantly better.

a