END-TO-END PERFORMANCE OF 10-GIGABIT ETHERNET ON COMMODITY SYSTEMS INTEL’SNETWORK INTERFACE CARD FOR 10-GIGABIT ETHERNET (10GBE) ALLOWS INDIVIDUAL COMPUTER SYSTEMS TO CONNECT DIRECTLY TO 10GBE ETHERNET INFRASTRUCTURES. RESULTS FROM VARIOUS EVALUATIONS SUGGEST THAT 10GBE COULD SERVE IN NETWORKS FROM LANSTOWANS. From its humble beginnings as such performance to bandwidth-hungry host shared Ethernet to its current success as applications via Intel’s new 10GbE network switched Ethernet in local-area networks interface card (or adapter). We implemented (LANs) and system-area networks and its optimizations to Linux, the Transmission anticipated success in metropolitan and wide Control Protocol (TCP), and the 10GbE area networks (MANs and WANs), Ethernet adapter configurations and performed sever- continues to evolve to meet the increasing al evaluations. Results showed extraordinari- demands of packet-switched networks. It does ly higher throughput with low latency, so at low implementation cost while main- indicating that 10GbE is a viable intercon- taining high reliability and relatively simple nect for all network environments. (plug and play) installation, administration, Justin (Gus) Hurwitz and maintenance. Architecture of a 10GbE adapter Although the recently ratified 10-Gigabit The world’s first host-based 10GbE adapter, Wu-chun Feng Ethernet standard differs from earlier Ether- officially known as the Intel PRO/10GbE LR net standards, primarily in that 10GbE oper- server adapter, introduces the benefits of Los Alamos National ates only over fiber and only in full-duplex 10GbE connectivity into LAN and system- mode, the differences are largely superficial. area network environments, thereby accom- Laboratory More importantly, 10GbE does not make modating the growing number of large-scale obsolete current investments in network infra- cluster systems and bandwidth-intensive structure. The 10GbE standard ensures inter- applications, such as imaging and data mir- operability not only with existing Ethernet roring. This first-generation 10GbE adapter but also with other networking technologies contains a 10GbE controller implemented in such as Sonet, thus paving the way for Ether- a single chip that contains functions for both net’s expanded use in MANs and WANs. the media access control and physical layers. Although the developers of 10GbE mainly The controller is optimized for servers that intended it to allow easy migration to higher use the I/O bus backplanes of the Peripheral performance levels in backbone infrastruc- Component Interface (PCI) and its higher tures, 10GbE-based devices can also deliver speed extension, PCI-X. 10 Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE 8/10 bytes PCS 3.125 Gbps SerDes MAC DMA PCIX-I/F XGM II 4×3.125 Gbps RX 10.3 Gbps in Intel optics 82597EX XAUI PCS PMA PCI-X Bus XGXS TX 10.3 Gbps out (8.5 Gbps) optics 4×3.125 Gbps Intel 1310-nm serial optics 512-Kbytes flash Intel PRO/10 GbE LR Figure 1. Architecture of the 10GbE adapter. The three main components are a 10GbE controller, flash memory, and serial optics. Figure 1 gives an architectural overview of formance to these other adapters.) the 10GbE adapter, which consists of three main components: an 82597EX 10GbE con- Testing environment and tools troller, 512 Kbytes of flash memory, and We evaluate the 10GbE adapter’s perfor- 1,310-nm wavelength single-mode serial mance in three different LAN or system-area optics transmitter and receiver. The 10GbE network environments, as Figure 2 shows: controller includes an Ethernet interface that delivers high performance by providing direct • direct single flow between two comput- access to all memory without using mapping ers connected back-to-back via a registers, minimizing the interrupts and pro- crossover cable, grammed I/O (PIO) read access required to •indirect single flow between two com- manage the device, and offloading simple puters through a Foundry FastIron 1500 tasks from the host CPU. switch, and As is common practice with high-perfor- •multiple flows through the Foundry Fast- mance adapters such as Myricom’s Myrinet1 Iron 1500 switch. and Quadrics’ QsNet,2 the 10GbE adapter frees up host CPU cycles by performing cer- Host computers were either Dell Pow- tain tasks (in silicon) on behalf of the host erEdge 2650 (PE2650) or Dell PowerEdge CPU. However, the 10GbE adapter focuses 4600 (PE4600) servers. on host off-loading of certain tasks—specifi- Each PE2650 contains dual 2.2-GHz Intel cally, TCP and Internet Protocol (IP) check- Xeon CPUs running on a 400-MHz front- sums and TCP segmentation—rather than side bus (FSB) using a ServerWorks GC-LE relying on remote direct memory access chipset with 1 Gbyte of memory and a dedi- (RDMA) and source routing. Consequently, cated 133-MHz PCI-X bus for the 10GbE unlike Myrinet and QsNet, the 10GbE adapter. Theoretically, this architectural con- adapter provides a general-purpose solution figuration provides 25.6-Gbps CPU band- to applications that does not require modify- width, up to 25.6-Gbps memory bandwidth, ing application code to achieve high perfor- and 8.5-Gbps network bandwidth via the mance. (The “Putting the 10GbE Numbers PCI-X bus. in Perspective” sidebar compares 10GbE’s per- Each PE4600 contains dual 2.4-GHz Intel JANUARY–FEBRUARY 2004 11 HOT INTERCONNECTS 11 Putting the 10GbE Numbers in Perspective To appreciate the 10GbE adapter’s performance, we consider it in the Elan4 libraries, QsNet II achieves sub-2-µs latencies1 while InfiniBand’s context of similar network interconnects. In particular, we compare latency is around 6 µs.2 10GbE’s performance with that of its predecessor, 1GbE, and the most Our experience with Quadrics’ previous-generation QsNet (using the common interconnects for high-performance computing (HPC): Myrinet, Elan3 API) produced unidirectional MPI bandwidth of 2.456 Gbps and a 4.9- QsNet, and InfiniBand. µs latency. As with Myrinet’s GM API, the Elan3 API may require appli- cation programmers to rewrite the application’s network code, typically 1GbE to move from a socket API to the Elan3 API. To address this issue, Quadrics With our extensive experience with 1GbE chipsets, we can achieve also has a highly efficient TCP/IP implementation that produces 2.240 near line-speed performance with a 1,500-byte maximum transmission Gbps of bandwidth and less than 30-µs latency. Additional performance unit (MTU) in a LAN or system-area network environment with most pay- results are available elsewhere.3 load sizes. Additional optimizations in a WAN environment should let us We have yet to see performance analyses of TCP/IP over either Infini- achieve similar performance while still using a 1,500-byte MTU. Band or QsNet II. Similarly, we have yet to evaluate MPI over 10GbE. This makes direct comparison between these interconnects and 10GbE rela- Myrinet tively meaningless at present. This lack of comparison might be for the We compare 10GbE’s performance to Myricom’s published performance best. 10GbE is a fundamentally different architecture, embodying a design numbers for its Myrinet adapters (available at http://www.myri.com/ philosophy that is fundamentally different from other interconnects for myrinet/performance/ip.html and http://www.myri.com/myrinet/perfor- high-performance computing. Ethernet standards, including 10GbE, are mance/index.html). Using Myrinet’s GM API, sustained unidirectional meant to transfer data in any network environment; they value versatility bandwidth is 1.984 Gbps and bidirectional bandwidth is 3.912 Gbps. Both over performance. Interconnects designed for high-performance comput- numbers are within 3 percent of the 2-Gbps unidirectional hardware limit. ing environments (for example, supercomputing clusters) are specific to The GM API provides latencies of 6 to 7 µs. system-area networks and emphasize performance over versatility. Using the GM API, however, might require rewriting portions of lega- cy application code. Myrinet provides a TCP/IP emulation layer to avoid this extra work. This layer’s performance, however, is notably less than that References of the GM API: Bandwidth drops to 1.853 Gbps, and latencies skyrocket 1. J. Beecroft et al., “Quadrics QsNet II: A Network for to more than 30 µs. Supercomputing Applications,” presented at Hot Chips 14 When running Message-Passing Interface (MPI) atop the GM API, band- (HotC 2003); available from http://www.quadrics.com. width reaches 1.8 Gbps with latencies between 8 to 9 µs. 2. J. Liu et al., “Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects,” Proc. Hot InfiniBand and QsNet II Interconnects 11 (HotI 2003), IEEE CS Press, 2003, pp. 60-65. InfiniBand and Quadrics’ QsNet II offer MPI throughput performance 3. F. Petrini et al., “The Quadrics Network: High-Performance that is comparable to 10GbE’s socket-level throughput. Both InfiniBand and Clustering Technology,” IEEE Micro, vol. 22, no. 1, Jan.-Feb. Quadrics have demonstrated sustained throughput of 7.2 Gbps. Using the 2002, pp. 46-57. Xeon CPUs running on a 400-MHz FSB, 10GbE streams from (or to) many hosts into using a ServerWorks GC-HE chipset with a 10GbE stream to (or from) a single host. 1 GByte of memory and a dedicated 100- The total backplane bandwidth (480 Gbps) MHz PCI-X bus for the 10GbE adapter. This in the switch far exceeds our test needs because configuration provides theoretical bandwidths both 10GbE ports are limited to 8.5 Gbps. of 25.6-, 51.2-, and 6.4-Gbps for the CPU, From a software perspective, all the hosts memory, and PCI-X bus. run current installations of Debian Linux with Since publishing our initial 10GbE results, customized kernel builds and tuned TCP/IP we have run additional tests on Xeon proces- stacks. We tested numerous kernels from the sors with faster clocks and FSBs, as well as on Linux 2.4 and 2.5 distributions.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-