Designing Efficient Network Interfaces for System Area Networks

Designing Efficient Network Interfaces For System Area Networks Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der Universität Mannheim vorgelegt von Dipl.-Inf. Lars Rzymianowicz aus Rendsburg Mannheim, 2002 Dekan: Professor Dr. Herbert Popp, Universität Mannheim Referent: Professor Dr. Ulrich Brüning, Universität Mannheim Korreferent: Professor Dr. Volker Lindenstruth, Universität Heidelberg Tag der mündlichen Prüfung: 28. August 2002 Designing Efficient Network Interfaces For System Area Networks Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der Universität Mannheim vorgelegt von Dipl.-Inf. Lars Rzymianowicz aus Rendsburg Mannheim, 2002 Abstract Designing Efficient Network Interfaces For System Area Networks by Lars Rzymianowicz Universität Mannheim The network is the key component of a Cluster of Workstations/PCs. Its performance, measured in terms of bandwidth and latency, has a great impact on the overall system performance. It quickly became clear that traditional WAN/LAN technology is not too well suited for interconnecting powerful nodes into a cluster. Their poor performance too often slows down communication-intensive applications. This observation led to the birth of a new class of networks called System Area Networks (SAN). But still SANs like Myrinet, SCI or ServerNet do not deliver an appropriate level of performance. Some are hampered by the fact, that they were originally targeted at another field of application. E.g. SCI was intended to serve as a cache-coherent interconnect for fine-grain communication between tightly coupled nodes. Its architecture is optimized for this area and behaves less optimal for bandwidth-hungry applications. Earlier versions of Myrinet suffered from slow versions of their proprietary LANai network proces- sor and slow on-board SRAM. And even though a standard I/O bus with a physical bandwidth of more then 500 Mbyte/s (PCI 64 bit/66 MHz) has been available for years, typical SANs only offer between 100-200 Mbyte/s. All the disadvantages of current implementations lead to the idea to develop a new SAN capable of delivering the performance needed by todays clusters and to keep up with the fast progress in CPU and memory performance. It should completely remove the network as communication bottleneck and support efficient methods for host-network inter- action. Furthermore, it should be ideally suited for use in small-scale (2-8 CPUs) SMP nodes, which are used more and more as cluster nodes. And last but not least, it should be a cost-efficient implementation. All these requirements guided the specification of the ATOLL network. On a single chip, not one but four network interfaces (NI) have been implemented, together with an on- chip 4x4 full-duplex switch and four link interfaces. This unique “Network on a Chip” V architecture is best suited for interconnecting SMP nodes, where multiple CPUs are given an exclusive NI and do not have to share a single interface. It also removes the need for any additional switching hardware, since the four byte-wide full-duplex links can be connected by cables with neighbor nodes in an arbitrary network topology. Despite its complexity and size, the whole network interface card (NIC) only consists of a single chip and 4 cable connectors, a very cost-efficient architecture. Each link pro- vides 250 Mbyte/s in one direction, offering a total bisection bandwidth of 2 Gbyte/s at the network side. The next generation of I/O bus technology, a 64 bit/133 MHz PCI-X bus interface, has been integrated to make use of this high bandwidth. A novel combina- tion of different data transfer methods has been implemented. Each of the four NIs offers transfer via Direct Memory Access (DMA) or Programmed I/O (PIO). The PIO mode eliminates any need for an intermediate copy of message data and is ideally suited for fine-grain communication, whereas the DMA mode is best suited for larger message sizes. In addition, new techniques for event notification and error correction have been included in the ATOLL NI. Intensive simulations show that the ATOLL architecture can deliver the performance expected. For the first time in Cluster Computing, the network is no more the communication bottleneck. Specifying such a complex design is one task, implementing it in an Application Specific Integrated Circuit (ASIC) is an even greater challenge. From implementing the specification in a Register/Transfer-Level (RTL) module to the final VLSI layout generation it took almost three years. Implemented in a state-of-the-art IC technology with the CMOS-Digital 0.18 um process of UMC, Taiwan, the ATOLL chip is one of the fastest and most complex ASICs ever designed outside the commercial IC industry. With a die size of 5.8x5.8 sqmm, 43 on-chip SRAM blocks with 100 kbit total, 6 asynchronous clock domains (133-250 MHz), one large PCI-X IP cell and full-custom LVDS and PCI- X I/O cells, a carefully planned design flow had to be followed. Only the design of the full-custom I/Os and the Place & Route of the layout were done by external partners. All the rest of the design flow, from RTL coding to simulation, from synthesis to design for test, was done by ourselves. Finally, the completed layout was given to sample produc- tion in February 2002, first engineering samples are expected to be delivered 10 weeks later. The ATOLL ASIC is one of the most complex and fastest chips ever implemented by a European university. Recently, the design has won the third place in the design contest organized at the Design, Automation & Test in Europe (DATE) conference, the premier European event for electronic design. VI Table of Contents CHAPTER 1 Introduction . .1 1.1 Cluster Computing . .1 1.1.1 Trends . .2 1.1.2 Managing large installations . .3 1.1.3 Driving factors and future directions . .3 1.2 System Area Networks . .4 1.2.1 The need for a new class of networks . .4 1.2.2 Emerging from existing technology . .5 1.3 ASIC Design . .5 1.3.1 Using 10+ million transistors . .6 1.3.2 Timing closure . .6 1.3.3 Power dissipation . .7 1.3.4 Verification bottleneck . .9 1.4 Contributions. .9 1.5 Organization . .10 CHAPTER 2 System Area Networks . .11 2.1 Wide/Local Area Networks . .11 2.1.1 User-level message layers. .11 2.2 Design goals . .13 2.2.1 Price versus performance . .13 2.2.2 Scalability . .13 2.2.3 Reliability . .14 2.3 General architecture . .14 2.3.1 Shared memory vs. distributed memory . .15 2.3.2 NI location. .16 2.4 Design details . .18 2.4.1 Physical layer . .19 2.4.2 Switching. .20 2.4.3 Routing . .21 2.4.4 Flow control . .23 2.4.5 Error detection and correction. .23 2.5 Data transfer . .24 2.5.1 Programmed I/O versus Direct Memory Access . .24 2.5.2 Control transfer . .26 2.5.3 Collective operations. .26 2.6 SCI . .26 2.6.1 Targeting DSM systems . .27 2.6.2 The Dolphin SCI adapter . .28 2.6.3 Remarks. .29 2.7 ServerNet . .30 2.7.1 Scalability and reliability . .30 2.7.2 Link technology . .31 VII 2.7.3 Data transfer . .32 2.7.4 Switches . .32 2.7.5.

Load more