Proceedings of the 9th IEEE Hot Interconnects (HotI'01), Palo Alto, California, August 2001.

The Network (QsNet): High-Performance Clustering Technology

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg Computer & Computational Sciences Division

Los Alamos National Laboratory ¡

fabrizio,feng,hoisie,scoll,eitanf ¢ @lanl.gov

Abstract tegration into large-scale systems. While GigE resides at the low end of the performance spectrum, it provides a low-cost The Quadrics interconnection network (QsNet) con- solution. GigaNet, GSN, , and SCI add programma- tributes two novel innovations to the field of high- bility and performance by providing communication proces- performance interconnects: (1) integration of the virtual- sors on the network interface cards and implementing differ- address spaces of individual nodes into a single, global, ent types of user-level communication protocols. virtual-address space and (2) network fault tolerance via The Quadrics network (QsNet) surpasses the above inter- link-level and end-to-end protocols that can detect faults connects in functionality by including a novel approach to and automatically re-transmit packets. QsNet achieves these integrate the local virtual memory of a node into a globally feats by extending the native operating system in the nodes shared, virtual-memory space; a programmable processor in with a network operating system and specialized hardware the network interface that allows the implementation of intel- support in the network interface. As these and other impor- ligent communication protocols; and an integrated approach tant features of QsNet can be found in the InfiniBand speci- to network fault detection and fault tolerance. Consequently, fication, QsNet can be viewed as a precursor to InfiniBand. QsNet already possesses many of the salient aspects of In- In this paper, we present an initial performance evalu- finiBand [2], an evolving standard that also provides an inte- ation of QsNet. We first describe the main hardware and grated approach to high-performance communication. software features of QsNet, followed by the results of bench- marks that we ran on our experimental, Intel-based, 2 QsNet cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency QsNet consists of two hardware building blocks: a

under 2 £ s and bandwidth over 300 MB/s. programmable network interface called Elan [9] and a Keywords: performance evaluation, interconnection net- high-bandwidth, low-latency, communication switch called works, user-level communication, InfiniBand, Linux, MPI. Elite [10]. With respect to software, QsNet provides sev- eral layers of communication libraries that trade off between 1 Introduction performance and ease of use. These hardware and software components combine to enable QsNet to provide the fol- lowing: (1) efficient and protected access to a global virtual With the increased importance of scalable system-area memory via remote DMA operations and (2) enhanced net- networks for cluster (super)computers, web-server farms, work fault tolerance via link-level and end-to-end protocols and network-attached storage, the interconnection network that can detect faults and automatically re-transmit packets. and its associated software libraries and hardware have be- come critical components in achieving high performance. 2.1 Elan Network Interface Such components will greatly impact the design, architec- ture, and use of the aforementioned systems in the future. The Elan network interface connects the high- Key players in high-speed interconnects include Gigabit performance, multi-stage Quadrics network to a processing (GigE) [11], GigaNet [13], SCI [5], Myrinet [1], node containing one or more CPUs. In addition to generat- and GSN (HiPPI-6400) [12]. These interconnect solutions ing and accepting packets to and from the network, the Elan differ from one another with respect to their architecture, provides substantial local processing power to implement programmability, scalability, performance, and ease of in- high-level, message-passing protocols such as MPI. The

¤ internal functional structure of the Elan, shown in Figure 1, This work was supported by the U.S. Department of Energy through

Los Alamos National Laboratory contract W-7405-ENG-36. This paper is centers around two primary processing engines: the £ code LA-UR 01-4100. processor and thread processor. 10 200MHz 10 the network route. Several routing tables can be loaded in order to have different routing strategies.

Link FIFO FIFO The Elan has 8KB of cache memory, organized as 4 sets Mux 0 1 of 2KB, and 64MB of SDRAM memory. The cache line size

72 SDRAM Thread µ code is 32 bytes. The cache performs pipelined fills from SDRAM I/F Processor Processor and can issue a number of cache fills and write backs for

DMA Inputter Buffers different units while still being able to service accesses for units that hit on the cache. The interface to SDRAM is 64

64 Data bits in length with 8 check bits added to provide error-code 100 MHz

64 correction. The memory interface also contains a 32-byte 32 write buffer and a 32-byte read buffer. Table Clock & MMU & Walk Statistics TLB Engine Registers 28 4 Way Set Associative Cache 2.2 Elite Switch PCI Interface 66MHz

64 The Elite provides (1) eight bidirectional links supporting

two virtual channels in each direction, (2) an internal ¢¡¤£¦¥ 1 Figure 1. Elan Functional Units full crossbar switch, (3) a nominal transmission bandwidth of 400 MB/s in each link direction and a flow-through la-

tency of §©¨ ns, (4) packet error detection and recovery with routing and data transactions CRC-protected, (5) two priority

The 32-bit £ code processor supports four threads of exe- levels combined with an aging mechanism to ensure fair de- cution, where each thread can independently issue pipelined livery of packets in the same priority level, (6) hardware sup- memory requests to the memory system. Up to eight requests port for broadcasts, and (7) adaptive routing. The switches

can be outstanding at any given time. The scheduling for the are interconnected in a quaternary fat-tree topology, which

£

code processor enables a thread to wake up, schedule a new belongs to the more general class of -ary -trees [7, 6]. memory access on the result of a previous memory access, Elite networks are source-routed, and the transmission of and go back to sleep in as few as two system-clock cycles. each packet is pipelined into the network using wormhole

The four £ code threads are described below: (1) input- flow control. At the link level, each packet is partitioned in ter thread: Handles input transactions from the network. smaller units called flits (flow control digits) [3] of 16 bits. (2) DMA thread: Generates DMA packets to be written to Every packet is closed by an End-Of-Packet (EOP) token, the network, prioritizes outstanding DMAs, and time-slices but this is normally only sent after receipt of a packet ac- large DMAs so that small DMAs are not adversely blocked. knowledge token. This implies that every packet transmis- (3) processor-scheduling thread: Prioritizes and controls the sion creates a virtual circuit between source and destination. scheduling and descheduling of the thread processor. (4) Packets can be sent to multiple destinations using the command-processor thread: Handles operations requested broadcast capability of the network. For a broadcast packet by the host (i.e., “command”) processor at user level. to be successfully delivered a positive acknowledgment must The thread processor is a 32-bit RISC processor that aids be received from all the recipients of the broadcast group. in the implementation of higher-level messaging libraries All Elans connected to the network are capable of receiving without explicit intervention from the main CPU. In order the broadcast packet but, if desired, the broadcast set can be to better support the implementation of high-level message- limited to a subset of physically contiguous Elans. passing libraries without explicit intervention by the main CPU, its instruction set was augmented with extra instruc- tions to construct network packets, manipulate events, ef- 2.3 Global Virtual Memory ficiently schedule threads, and block save and restore a thread’s state when scheduling. The Elan can transfer information directly between The MMU translates 32-bit virtual addresses into ei- the address spaces of groups of cooperating processes ther 28-bit local SDRAM physical addresses or 48-bit PCI while maintaining hardware protection between the process physical addresses. To translate these addresses, the MMU groups. This capability is a sophisticated extension to the contains a 16-entry, fully-associative, translation lookaside conventional virtual memory mechanism and is known as buffer (TLB) and a small data-path and state machine used virtual operation. Virtual operation is based on two con- to perform table walks to fill the TLB and save trap informa- cepts: (1) the Elan virtual memory and (2) the Elan context. tion when the MMU faults. The Elan contains routing tables that translate every vir- 1The crossbar has two input ports for each input link, to accommodate tual processor number into a sequence of tags that determine two virtual channels. Main Processor 2.3.1 Elan Virtual Memory Virtual Address Space

The Elan contains an MMU to translate the virtual mem-

Elan Virtual Address Space ory addresses issued by the various on-chip functional units 1FFFFFFFFFF FFFFFFFF (Thread Processor, DMA Engine, and so on) into physical heap heap addresses. These physical memory addresses may refer to ei- bss ther Elan local memory (SDRAM) or the node’s main mem- bss data ory. To support main memory accesses, the configuration data text tables for the Elan MMU are synchronized with the main text processor’s MMU tables so that the virtual address space can stack stack 1FF0C808000 be accessed by the Elan. The synchronization of the MMU C808000 tables is the responsibility of the system code and is invisible memoy allocator heap 200MB

8000 system to the user programmer. 10C808000

The MMU in the Elan can translate between virtual ad- 200 MB memoy allocator heap

10C808000 dresses written in the format of the main processor (e.g., a system 64-bit word, big Endian architecture as the AlphaServer) and virtual addresses written in the Elan format (a 32-bit word, little Endian architecture). For a processor with a 32-bit ar- ELAN SDRAM chitecture (e.g., an Intel Pentium), a one-to-one mapping is MAIN MEMORY all that is required. In Figure 2, the mapping for a 64-bit processor is shown. Figure 2. Virtual Address Translation The 64-bit addresses starting at 0x1FF0C808000 are mapped to Elan’s 32 bit addresses starting at 0xC808000. This means that virtual addresses in the range 0x1FF0C808000 to 0x1FFFFFFFFFF can be accessed directly by the main pro- and a virtual address. Furthermore, the context value also cessor while the Elan can access the same memory by using determines which remote processes can access the address addresses in the range 0xC808000 to 0xFFFFFFFF. In our space via the Elan network and where those processes reside. example, the user may allocate main memory using mal- If the user process is multithreaded, the threads will share the loc, and the process heap may grow outside the region di- same context just as they share the same main memory ad- rectly accessible by the Elan delimited by 0x1FFFFFFFFFF. dress space. If the node has multiple physical CPUs, then In order to avoid this problem, both main and Elan mem- the individual threads may actually be executed by different ory can be allocated using a consistent memory-allocation CPUs. However, they will still share the same context. mechanism. As shown in Figure 2, the MMU tables can be set up to map a common region of virtual memory called the 2.4 Network Fault Detection & Fault Tolerance memory-allocator heap. The allocator maps physical pages, of either main memory or Elan into this virtual address range on demand. Thus, using allocation functions provided by the QsNet implements network fault detection and tolerance Elan library, portions of virtual memory can be allocated ei- in hardware.2 Under normal operation, the source Elan trans- ther from main or Elan memory, and the MMUs of both the mits a packet (i.e., route information for source routing, fol- main processor and Elan can be kept consistent. lowed by one or more transactions). When the receiver in the For reasons of efficiency, some objects can be located on destination Elan receives a transaction with an “ACK Now” the Elan, e.g., communication buffers or DMA descriptors flag, it means that it is the last transaction for the packet. The which the Elan can process independently of the main pro- destination Elan then sends a packet acknowledgment (PA) cessor. token back to the source Elan. Only when the source Elan receives the PA token is it allowed to send an EOP acknowl- edgement token to the destination to indicate the completion 2.3.2 Elan Context of the packet transfer. In short, the fundamental rule of Elan network operation is that, for every packet that is sent down In a conventional virtual-memory system, each user process a link, a single PA token will be sent back. The link will not is assigned a process identification number (PID) which se- be re-used until the PA token has been sent. lects the set of MMU tables used and, therefore, the physical If an Elan detects an error during the transmission of a address spaces accessible to it. QsNet extends this concept packet over QsNet, it immediately sends out an error mes- so that the user address spaces in a parallel program can in- sage without waiting for a PA token to be received. If an tersect. The Elan replaces the PID value with a context value. User processes can directly access an exported segment of 2It is important to note that this fault detection and tolerance occurs be- remote memory by using a combination of a context value tween two communicating Elans. Elite detects an error, it automatically transmits a error mes- two 64-bit/66-MHz PCI slots (one of which is used by the sage back to the source and the destination. During this pro- Elan3 PCI card QM-400). The interconnection network is cess, the faulty link and/or switch is isolated via per-hop fault a quaternary fat-tree of dimension two, composed of eight detection [9]; the source receives notification about the faulty 8-port Elite switches integrated in the same board. The oper- component and can re-try the packet transmission a default ating system used during the evaluation is Linux 2.4.0-test7. number of times. If this is not successful, the source can ap- To expose the basic performance of QsNet, we wrote our propriately re-configure its routing tables so as to not use the benchmarks at the Elan3lib level. We also briefly analyze 3 faulty component. the overhead introduced by Elanlib and an implementation of MPI-2 [4] (based on a port of MPI-CH onto Elanlib). 3 Programming Libraries To identify different bottlenecks, the communication buffers for our unidirectional ping, bidirectional ping, and Figure 3 shows the different programming libraries [8] for hotspot tests are placed either in main or in Elan memory. the Elan network interface. These libraries trade off speed The communication alternatives include main memory to with machine independence and programmability. Starting main memory, Elan memory to Elan memory, Elan memory from the bottom, Elan3lib provides the lowest-level, user- to main memory, and main memory to Elan memory. space programming interface to the Elan3. At this level, processes in a parallel job can communicate with each other through an abstraction of distributed virtual shared memory. 4.1 Unidirectional Ping Each process in a parallel job is allocated a virtual process id (VPID) and can map a portion of its address space into the Elan. These address spaces, taken in combination, constitute Figure 4(a) shows the results for unidirectional ping. The

a distributed virtual shared memory. Remote memory (i.e., asymptotic bandwidth for all communication libraries and ¡ £¢

memory on another processing node) can be addressed by a buffer mappings lies in a narrow range from § MB/s for §©¨ combination of a VPID and a virtual address. Since the Elan MPI to § MB/s for Elan3lib. The results also show a has its own MMU, a process can select which part of its ad- small performance asymmetry between read and write per-

dress space should be visible across the network, determine formance on the PCI bus. With Elan3lib, the read and write

¥¤ § ¦¢ specific access rights (e.g., write- or read-only) and select the bandwidths are § MB/s and MB/s, respectively. The set of potential communication partners. peak bandwidth of 335 MB/s is reached when both source and destination buffers are placed in the Elan memory. User Applications The graphs in Figure 4(a) can be logically organized into

mpi three groups: those relative to Elan3lib with the source buffer

tport in Elan memory, Elan3lib with the source buffer in main elanlib memory, and Tports and MPI. In the first group, the latency user space elan3lib is low for small and medium-sized messages. This basic la-

kernel space system calls elan kernel comms tency is increased in the second group by the extra delay to start the remote DMA over the PCI bus. Finally, both Tports and MPI use the thread processor to perform tag matching, Figure 3. Elan3 Programming Libraries and this increases the overhead further.

Figure 4(b) shows the latency of messages in the range §

Elanlib is a higher-level interface that frees the program- ©¨ ¨ ¨ ¥ . With Elan3lib, the latency for 0-byte messages

¥¨  £ ¤¨ £

mer from the revision-dependent details of the Elan and ex- is only s and is almost constant at s for messages  tends Elan3lib with point-to-point, tagged message-passing up to ¡ bytes because these messages can be packed as a

primitives (called Tagged Message Ports or Tports). Stan- single write-block transaction. The latency at the Tports and

¨¨ £ dard communication libraries as such MPI-2 or Cray shmem MPI levels increase to ¨ and s, respectively. From the are implemented on top of Elanlib. Elan3lib level, in which latency is mostly hardware, system

software is needed to run as a thread in the Elan £ processor 4 Experiments in order to match the message tags; this introduces the extra

overhead responsible for the higher latency value. The noise

¨ ¡ at ¤ bytes, shown in Figure 4(b), is due to the message

We tested the main features of our QsNet on an experi- ¥©¥ transmission policy. Messages smaller than ¤ bytes are mental cluster with 16 dual-processor SMPs equipped with inlined together with the message envelope so that they are 733-MHz Pentium IIIs. Each SMP uses a motherboard based available immediately when a receiver posts a request for on the Serverworks HE chipset with 1GB of SDRAM and them. Larger messages are always sent synchronously, only 3Recall: QsNet is source-routed. after the receiver has posted a matching request. Unidirectional Ping Bandwidth Unidirectional Ping Latency 350 18 MPI MPI Tport, Elan to Elan 16 Tport, Main to Main 300 Tport, Main to Elan Elan3, Main to Main Tport, Elan to Main 14 Tport, Main to Main 250

Figure Elan3, Elan to Elan Elan3, Main to Elan 12 s

200 Elan3, Elan to Main µ Elan3, Main to Main 10

4. 150 8 Latency Unidirectional

Bandwidth MB/s 6 100 4 50 2

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 0 1 4 16 64 256 1K 4K a) Msg Size (bytes) b) Msg Size (bytes) and

Bidirectional Ping Bandwidth Bidirectional Ping Latency Bidirectional 300 30 MPI MPI Tport, Elan to Elan Tport, Main to Main 250 Tport, Main to Elan 25 Elan3, Main to Main Tport, Elan to Main Tport, Main to Main Elan3, Elan to Elan 200 Elan3, Main to Elan 20 s

Elan3, Elan to Main µ Pings Elan3, Main to Main 150 15 Latency

Bandwidth MB/s 100 10

50 5

0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 0 1 4 16 64 256 1K 4K c) Msg Size (bytes) d) Msg Size (bytes) 4.2 Bidirectional Ping (2) network fault tolerance that can detect faults and auto- matically re-transmit packets. Next, we briefly presented the Figure 4(c) shows that the claimed bidirectionality of the results of benchmark tests on QsNet, targeting essential per-

network is not fully achievable. The maximum unidirec- formance characteristics. At the lowest level of the commu- £

nication hierarchy, the unidirectional latency is as low as ¤ s

¡ ¤ tional value, obtained as of the measured bidirectional traffic, is about 280 MB/s whereas in the previous case it and the bandwidth as high as §©§©¨ MB/s. Bidirectional mea- was 335 MB/s. This gap in bandwidth exposes bottlenecks surements indicate a degradation in performance which we in the network and in the network interface, as opposed to the analyzed and explained in the paper. At higher levels in the PCI bus. The causes of this performance degradation are the communication hierarchy, Tports still exhibit excellent per- interleaving of the DMA engine with the inputter, the shar- formance figures, comparable to the ones at Elan3lib level. ing of the internal data bus of the Elan, and interference at In summary, our analysis shows that in all the components the link level in the Elite network. Counter-intuitively, this of the performance space we analyzed, the network and its value is achieved when the source buffer is in main mem- libraries deliver excellent performance to the end user. ory and the destination buffer in Elan memory and not when Future work includes scalability analysis for larger con- both buffers are in Elan memory. In this case, the Elan mem- figurations, performance of a larger subset of collective com- ory is the bottleneck. The bidirectional bandwidth for the munication patterns, and performance analysis of scientific main memory to main memory traffic is 160 MB/s for all applications. libraries. Figure 4(d) shows how the bidirectional traffic af- fects latency with Elan3lib, Tports, and MPI. References

4.3 Hotspot [1] Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawick, Charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro, In this experiment, we read from and write to the same 15(1):29–36, January 1995. memory location from an increasing number of processors [2] Daniel Cassiday. Infiniband architecture tutorial. Hot Chips 12 Tuto- (one per SMP). The bandwidth plots are shown in Figure 5. rial, August 2000. The upper curves are the aggregate bandwidth of all pro- [3] William J. Dally and Charles L. Seitz. Deadlock-Free Message Rout-

ing in Multiprocessor Interconnection Networks. IEEE Transactions

§¡ ¢ cesses. The curves are remarkably flat, reaching § and on Computers, C-36(5):547–553, May 1987. MB/s, respectively, for read and write hotspots. The lower [4] William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing curves show the per-SMP bandwidth. The scalability of this Lusk, Bill Nitzberg, William Saphir, and Marc Snir. MPI - The Com- type of memory operation is very good, up to the available plete Reference, volume 2, The MPI Extensions. The MIT Press, number of processors in our cluster. 1998. [5] Hermann Hellwagner. The SCI Standard and Applications of SCI. In Hermann Hellwagner and Alexander Reinfeld, editors, SCI: Scal- Hotspot able Coherent Interface, volume 1291 of Lecture Notes in Computer 350 Science, pages 95–116. Springer-Verlag, 1999.

300 £ [6] Fabrizio Petrini and Marco Vanneschi. ¢ -ary -trees: High Perfor-

250 mance Networks for Architectures. In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97, 200 pages 87–93, Geneva, Switzerland, April 1997.

150 [7] Fabrizio Petrini and Marco Vanneschi. Performance Analysis of £ Wormhole Routed ¢ -ary -trees. International Journal on Founda- Bandwidth MB/s 100 tions of Computer Science, 9(2):157–177, June 1998. Global Read Bandwidth 50 Global Write Bandwidth [8] Quadrics World Ltd. Elan Programming Manual, Per-SMP Read Bandwidth Per-SMP Write Bandwidth January 1999. 0 2 3 4 5 6 7 8 [9] Quadrics Supercomputers World Ltd. Elan Reference Manual, Jan- Number of SMPs uary 1999. [10] Quadrics Supercomputers World Ltd. Elite Reference Manual, Figure 5. Read and Write Hotspot November 1999. [11] Rich Seifert. Gigabit Ethernet: Technology and Applications for High Speed LANs. Addison-Wesley, May 1998. [12] D. Tolmie, T. M. Boorman, A. DuBois, D. DuBois, W. Feng, and I. Philp. From HiPPI-800 to HiPPI-6400: A Changing of the Guard 5 Conclusion and Gateway to the Future. In Proceedings of the 6th International Conference on Parallel Interconnects (PI'99), October 1999. In this paper, we presented two novel innovations of Qs- [13] Werner Vogels, David Follett, Jenwi Hsieh, David Lifka, and David

Net: (1) the integration of the virtual-address spaces of indi- Stern. Tree-Saturation Control in the AC ¤ Velocity Cluster. In Hot Interconnects 8, Stanford University, Palo Alto CA, August 2000. vidual nodes into a single, global, virtual-address space and