The Quadrics Network (Qsnet): High-Performance Clustering Technology

Proceedings of the 9th IEEE Hot Interconnects (HotI'01), Palo Alto, California, August 2001. The Quadrics Network (QsNet): High-Performance Clustering Technology Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg Computer & Computational Sciences Division Los Alamos National Laboratory ¡ fabrizio,feng,hoisie,scoll,eitanf ¢ @lanl.gov Abstract tegration into large-scale systems. While GigE resides at the low end of the performance spectrum, it provides a low-cost The Quadrics interconnection network (QsNet) con- solution. GigaNet, GSN, Myrinet, and SCI add programma- tributes two novel innovations to the field of high- bility and performance by providing communication proces- performance interconnects: (1) integration of the virtual- sors on the network interface cards and implementing differ- address spaces of individual nodes into a single, global, ent types of user-level communication protocols. virtual-address space and (2) network fault tolerance via The Quadrics network (QsNet) surpasses the above inter- link-level and end-to-end protocols that can detect faults connects in functionality by including a novel approach to and automatically re-transmit packets. QsNet achieves these integrate the local virtual memory of a node into a globally feats by extending the native operating system in the nodes shared, virtual-memory space; a programmable processor in with a network operating system and specialized hardware the network interface that allows the implementation of intel- support in the network interface. As these and other impor- ligent communication protocols; and an integrated approach tant features of QsNet can be found in the InfiniBand speci- to network fault detection and fault tolerance. Consequently, fication, QsNet can be viewed as a precursor to InfiniBand. QsNet already possesses many of the salient aspects of In- In this paper, we present an initial performance evalu- finiBand [2], an evolving standard that also provides an inte- ation of QsNet. We first describe the main hardware and grated approach to high-performance communication. software features of QsNet, followed by the results of bench- marks that we ran on our experimental, Intel-based, Linux 2 QsNet cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency QsNet consists of two hardware building blocks: a under 2 £ s and bandwidth over 300 MB/s. programmable network interface called Elan [9] and a Keywords: performance evaluation, interconnection net- high-bandwidth, low-latency, communication switch called works, user-level communication, InfiniBand, Linux, MPI. Elite [10]. With respect to software, QsNet provides several layers of communication libraries that trade off between 1 Introduction performance and ease of use. These hardware and software components combine to enable QsNet to provide the fol- lowing: (1) efficient and protected access to a global virtual With the increased importance of scalable system-area memory via remote DMA operations and (2) enhanced net- networks for cluster (super)computers, web-server farms, work fault tolerance via link-level and end-to-end protocols and network-attached storage, the interconnection network that can detect faults and automatically re-transmit packets. and its associated software libraries and hardware have be- come critical components in achieving high performance. 2.1 Elan Network Interface Such components will greatly impact the design, architecture, and use of the aforementioned systems in the future. The Elan network interface connects the high- Key players in high-speed interconnects include Gigabit performance, multi-stage Quadrics network to a processing Ethernet (GigE) [11], GigaNet [13], SCI [5], Myrinet [1], node containing one or more CPUs. In addition to generat- and GSN (HiPPI-6400) [12]. These interconnect solutions ing and accepting packets to and from the network, the Elan differ from one another with respect to their architecture, provides substantial local processing power to implement programmability, scalability, performance, and ease of in- high-level, message-passing protocols such as MPI. The ¤ internal functional structure of the Elan, shown in Figure 1, This work was supported by the U.S. Department of Energy through Los Alamos National Laboratory contract W-7405-ENG-36. This paper is centers around two primary processing engines: the £ code LA-UR 01-4100. processor and thread processor. 10 200MHz 10 the network route. Several routing tables can be loaded in order to have different routing strategies. Link FIFO FIFO The Elan has 8KB of cache memory, organized as 4 sets Mux 0 1 of 2KB, and 64MB of SDRAM memory. The cache line size 72 SDRAM Thread µ code is 32 bytes. The cache performs pipelined fills from SDRAM I/F Processor Processor and can issue a number of cache fills and write backs for DMA Inputter Buffers different units while still being able to service accesses for units that hit on the cache. The interface to SDRAM is 64 64 Data Bus bits in length with 8 check bits added to provide error-code 100 MHz 64 correction. The memory interface also contains a 32-byte 32 write buffer and a 32-byte read buffer. Table Clock & MMU & Walk Statistics TLB Engine Registers 28 4 Way Set Associative Cache 2.2 Elite Switch PCI Interface 66MHz 64 The Elite provides (1) eight bidirectional links supporting two virtual channels in each direction, (2) an internal ¢¡¤£¦¥ 1 Figure 1. Elan Functional Units full crossbar switch, (3) a nominal transmission bandwidth of 400 MB/s in each link direction and a flow-through latency of §©¨ ns, (4) packet error detection and recovery with routing and data transactions CRC-protected, (5) two priority The 32-bit £ code processor supports four threads of exe- levels combined with an aging mechanism to ensure fair de- cution, where each thread can independently issue pipelined livery of packets in the same priority level, (6) hardware sup- memory requests to the memory system. Up to eight requests port for broadcasts, and (7) adaptive routing. The switches can be outstanding at any given time. The scheduling for the are interconnected in a quaternary fat-tree topology, which £ code processor enables a thread to wake up, schedule a new belongs to the more general class of -ary -trees [7, 6]. memory access on the result of a previous memory access, Elite networks are source-routed, and the transmission of and go back to sleep in as few as two system-clock cycles. each packet is pipelined into the network using wormhole The four £ code threads are described below: (1) input- flow control. At the link level, each packet is partitioned in ter thread: Handles input transactions from the network. smaller units called flits (flow control digits) [3] of 16 bits. (2) DMA thread: Generates DMA packets to be written to Every packet is closed by an End-Of-Packet (EOP) token, the network, prioritizes outstanding DMAs, and time-slices but this is normally only sent after receipt of a packet ac- large DMAs so that small DMAs are not adversely blocked. knowledge token. This implies that every packet transmis- (3) processor-scheduling thread: Prioritizes and controls the sion creates a virtual circuit between source and destination. scheduling and descheduling of the thread processor. (4) Packets can be sent to multiple destinations using the command-processor thread: Handles operations requested broadcast capability of the network. For a broadcast packet by the host (i.e., “command”) processor at user level. to be successfully delivered a positive acknowledgment must The thread processor is a 32-bit RISC processor that aids be received from all the recipients of the broadcast group. in the implementation of higher-level messaging libraries All Elans connected to the network are capable of receiving without explicit intervention from the main CPU. In order the broadcast packet but, if desired, the broadcast set can be to better support the implementation of high-level message- limited to a subset of physically contiguous Elans. passing libraries without explicit intervention by the main CPU, its instruction set was augmented with extra instruc- tions to construct network packets, manipulate events, ef- 2.3 Global Virtual Memory ficiently schedule threads, and block save and restore a thread’s state when scheduling. The Elan can transfer information directly between The MMU translates 32-bit virtual addresses into ei- the address spaces of groups of cooperating processes ther 28-bit local SDRAM physical addresses or 48-bit PCI while maintaining hardware protection between the process physical addresses. To translate these addresses, the MMU groups. This capability is a sophisticated extension to the contains a 16-entry, fully-associative, translation lookaside conventional virtual memory mechanism and is known as buffer (TLB) and a small data-path and state machine used virtual operation. Virtual operation is based on two con- to perform table walks to fill the TLB and save trap informa- cepts: (1) the Elan virtual memory and (2) the Elan context. tion when the MMU faults. The Elan contains routing tables that translate every vir- 1The crossbar has two input ports for each input link, to accommodate tual processor number into a sequence of tags that determine two virtual channels. Main Processor 2.3.1 Elan Virtual Memory Virtual Address Space The Elan contains an MMU to translate the virtual mem- Elan Virtual Address Space ory addresses issued by the various on-chip functional units 1FFFFFFFFFF FFFFFFFF (Thread Processor, DMA Engine, and so on) into physical heap heap addresses. These physical memory addresses may refer to ei- bss ther Elan local memory (SDRAM) or the node’s main mem- bss data ory. To support main memory accesses, the configuration data text tables for the Elan MMU are synchronized with the main text processor’s MMU tables so that the virtual address space can stack stack 1FF0C808000 be accessed by the Elan. The synchronization of the MMU C808000 tables is the responsibility of the system code and is invisible memoy allocator heap 200MB 8000 system to the user programmer.

The Quadrics Network (Qsnet): High-Performance Clustering Technology

Half-Year Financial Report at 30 June 2013 Finmeccanica

Finmeccanica 2009 Consolidated Financial Statements

High Performance Computing Tera 10: Beyond 50 Thousand Billion Operations Per Second

(12) United States Patent (10) Patent No.: US 8,862,870 B2 Reddy Et Al

High Performance Network and Channel-Based Storage

Ing. Paolo Neri Direttore Architetture Grandi Sistemi Selex Sistemi Integrati

HIPPI Developments for CERN Experiments A

PC Hardware Contents

Lecture 12: I/O: Metrics, a Little Queuing Theory, and Busses

Police Aviation News December 2008

Security Hardened Remote Terminal Units for SCADA Networks

ISO/IEC JTC 1/SC 25 N 4Chi008 Date: 2004-06-22