Cluster Computing 6, 95–104, 2003  2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

InfiniBand: The “De Facto” Future Standard for System and Local Area Networks or Just a Scalable Replacement for PCI Buses?

TIMOTHY MARK PINKSTON ∗ University of Southern California

ALAN F. BENNER IBM Corporation

MICHAEL KRAUSE Hewlett Packard

IRV M. ROBINSON Corporation

THOMAS STERLING California Institute of Technology

Abstract. InfiniBand is a new industry-wide general-purpose interconnect standard designed to provide significantly higher levels of relia- bility, availability, performance, and scalability than alternative I/O technologies. After more than two years since its official release, many are still trying to understand what are the profitable uses for this new and promising interconnect technology, and how this technology might evolve. In this article, we provide a summary of several industry and academic perspectives on this issue expressed during a panel discussion at the Workshop for Communication Architecture for Clusters (CAC), held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) in April 2001, in hopes of narrowing down the design space for InfiniBand-based systems.

Keywords: InfiniBand, I/O, system area network, fabric, interconnection network standard

1. Introduction ture systems, and how it might be improved. In addition, not everyone is enamored with this technology. Some claim it is In an attempt to solve a wide spectrum of problems associ- too expensive, others that it is too complex, still others that it ated with server I/O, many commercial entities worked to- attempts to address too many disparate problems. Moreover, gether to develop an industry-wide general-purpose intercon- some believe that because of the way InfiniBand has been po- nect standard called InfiniBand [1]. InfiniBand was designed sitioned, it directly competes with PCI, , Fibre Chan- to provide significantly higher levels of reliability, availabil- nel, and other well-established industry standards and, thus, ity, performance, and scalability than could be achieved with may never be widely accepted. While there is some validity alternative server I/O technology. In October of 2000, the first in some of these claims and beliefs, the reality is that version of the InfiniBand specs were released with much fan- InfiniBand is the first technology to really solve the entire fare. At its release, this non-proprietary, low-overhead point- server I/O problem and much of the high-speed, low-latency to-point communication standard was poised to become the inter-processor communication (IPC) problem within a sin- interconnection network fabric technology on which com- gle, open industry standard specification. modity and high-end servers could be based [2]. Nevertheless, since nature abhors a vacuum, it is likely The first generation of InfiniBand products have appeared, that many vendors will continue to invest in evolutionary ap- and prototype InfiniBand-based clustered applications have proaches to solve some of the same problems addressed by been demonstrated, but it is not yet clear in which areas In- InfiniBand. It will take some time for any new technology finiBand technology will be most successfully employed as targeted for the server market to gain a foot-hold – many be- it matures. Since its release, many realize that InfiniBand is lieving that 2003/2004 will be the time frame at which Infini- not a panacea and was never meant to be one. There has been Band could really start to take flight. More important than much effort put towards understanding just what this tech- when, however, is the question of where does it make sense nology is best used for, how it should be integrated into fu- to deploy this new technology? What will be the possible ap- ∗ Corresponding author. plication areas for InfiniBand: for I/O interconnect, system E-mail: [email protected] area network (SAN), (STAN), or local 96 PINKSTON ET AL.

Figure 1. Conceptual diagram of InfiniBand’s layered architecture. area network (LAN) application? Is it useful only for IPC or might it also be useful as a unified network fabric (backbone) in servers, server clusters, and data centers? Is there inter- esting research to be done on InfiniBand architecture? Will InfiniBand have a significant impact on the way in which fu- ture systems are designed or might it have only limited impact like some of its predecessors, i.e., VIA, SCI, etc.? These and other such questions were raised and debated during a panel discussion at the CAC Workshop, held in con- junction with IPDPS’15. As with many such panel discus- sions, a wide variety of views were expressed, with a variety of similarities as well as disagreements between them. This represents an attempt to summarize and clarify the var- (a) ious converging and conflicting perspectives shared during that workshop to help narrow the possible design space for InfiniBand-based systems.

2. InfiniBand overview

InfiniBand is a layered architecture that provides physical, data link, network, and services (see figure 1). At the physical and data link layers, its switch-based archi- tecture allows for richly connected, arbitrary topologies to be configured with some degree of flexibility in routing across logical and physical channels. It provides scalable increased I/O for driving I/O at link rates from 2.5 Gbps to 12 times that rate, increased distance (as compared to PCI) (b) of up to 300 meters, and standardized form factors for sup- Figure 2. diagram of an InfiniBand fabric. porting a variety of simple to complex I/O solutions includ- ing serial or parallel lines, copper or fiber links, and wide or Discrete message passing via send and receive queue pairs tall modules. It also provides support for traffic prioritiza- (QPs) and completion queue elements (CQEs) is supported, tion, deadlock avoidance, and segregation of traffic classes. as shown in figure 2. Its programming model is derived from At the network and transport layers, it provides various types Virtual Interface Architecture (VIA) [3], however InfiniBand of connection-oriented and datagram communication services is intended to enable the most efficient interface possible between consumers at network endpoints, including remote between a message passing interconnection network and a (RDMA) and atomic operations. It also server’s memory controller to facilitate highly efficient data provides standardized fabric management services, fault iso- transfers. For example, data movement is via DMA, sched- lation/containment, and reliability functions. uled by fabric-connected devices, which enables data move- INFINIBAND 97 ment without CPU interaction. In support of this, InfiniBand I/O, HyperTransport, and PCI Express (formerly 3GIO) have is defined to make it practical to implement protocol stack caused a fair amount of confusion about the role of InfiniBand processing in ASICs, with a strategy for integration of In- in the I/O arena. finiBand Host Channel Adapters (HCAs) and target channel Some of these I/O fabrics offer bandwidths of 400 Mbps adapters (TCAs) into server chipsets.1 While chipset inte- to 16 Gbps (aggregate, full-duplex application bandwidth), gration of other interconnection networks is certainly pos- inter-rack distances of up to 5 meters, standardized hot plug sible, InfiniBand was conceived to make the process easier and swap capability, high fan-out attachment of multiple and provide the highest performance and efficiency of any cards, and load/store/interrupt semantics which are software non-proprietary alternative. One such efficiency2 has been compatible with traditional PCI-based I/O. This would sug- achieved by the use of a message passing network which can gest prolonged usage of these I/O fabrics in future server sys- be used for nearly every kind of server I/O, perhaps making tems. Nevertheless, there are a few very important capabili- I/O buses like PCI superfluous. It is this application – as a ties that InfiniBand offers that these multi-drop and switched single fabric for server I/O use – that causes the greatest spec- I/O fabrics do not. Among these are protection, partitioning, ulation regarding its role juxtaposed to other standard alterna- (OS) by-pass, and transport level features. tives, such as PCI (for I/O) and Ethernet (for IPC). This issue PCI-based I/O technologies rely on a trust model of the is addressed in the following sections. highest degree since they provide open access to memory. Although many important elements have been specified in Although misbehaving PCI devices could be relatively rare, the standard, some details have been left for vendor innova- the potential for intentional or unintentional user corruption tion or have not been specified in the current version, possibly in large database servers, for example, is unacceptable. The left for future improvement of the standard. For example, at problem increases in scope when one considers the grow- the higher layers, operations over InfiniBand are strategically ing functionality and complexity of the PCI-connected de- specified at a functional level using verbs in a vendor-neutral, vices, which makes the possibility of errant operations even operating-system-independent way. As application program- greater. This is only one of many such deficiencies inher- mer interfaces () are included in the operating system, ent to PCI-based I/O architectures. InfiniBand separates it- it is up to operating system vendors to decide how the verbs self from the PCI comparison by having additional I/O func- should be mapped to particular operating systems to support tionality features. Among these are a sophisticated virtual various APIs. While this level of specifying things purpose- memory protection scheme (using registration), atomic and fully allows for vendor differentiation, some details are not remote memory access, protection key-based fabric partition- specified at any level. For instance, since wide-area network ing, support for multiple subnets, a rich set of connection- (WAN) and server network architecture issues and require- oriented and datagram transport functions, multiple logical ments are quite different from one another, there is no WAN channels (queue pairs) per channel , architected op- support. Also, there is no support for cache-coherent non- erations queues on each channel, direct data placement into uniform memory-access (cc-NUMA) data transactions since user space (i.e., OS by-pass), congestion management (i.e., it may not be possible to define a stable, vendor-neutral ar- automatic path migration for fail-over and load-balancing of chitecture. Moreover, at the lower layers, no strategy is spec- data flows across different physical paths), and greater dis- ified for ensuring guarantees, computing tance capability. deadlock-free routing paths, or for updating forwarding ta- Logical partitioning is an especially useful feature sup- bles in a deadlock-free manner when the network undergoes ported by InfiniBand. With this, a single large server can reconfiguration other than by dropping packets. Such unre- be made to appear as multiple consolidated smaller servers solved issues present opportunities for further research on In- of various sizes. For example, some systems3 can support finiBand architecture, some of which are actively being pur- up to several thousand virtual servers that are time- sued [4–8]. multiplexed across a single physical symmetric multiproces- sor (SMP) machine with virtualized processors, virtualized memory, and virtualized I/O for each virtual server. Virtu- 3. InfiniBand as an I/O fabric alized I/O is, in some ways, the most difficult part, since I/O involves interaction with the outside world. With a One of InfiniBand’s original requirements was to be usable as load/store/interrupt interface to I/O devices as is done in PCI, a next generation I/O fabric for server systems. Given this, it time-multiplexing between logical partitions in a protected has been positioned as a possible “replacement” for the PCI way requires a great deal of overhead and complexity. With suite of general-purpose I/O interconnects. PCI and its deriv- an InfiniBand-type queued-messages interface to I/O, each atives represent a simple, cost-effective means for connecting queue pair can be uniquely assigned to a logical partition, and a small number of devices to a server using a shared-memory the queue pairs can time-multiplex external interfaces inde- programming model. As servers continue to grow in complex- pendent of the host virtual servers. In concept, this is very ity, however, they are starting to outgrow the limitations of similar to the idea of OS-bypass for user-space communica- PCI’s simplicity. Recently, advancements in PCI-based tech- tions, except that rather than multiple user-space processes nologies (i.e., PCI 2.2, PCI-X, PCI-X 2.0, etc.) [9] and the sharing a virtual network interface through their own queue advent of a variety of switched I/O fabrics such as Rapid pairs, kernels share virtual I/O adapters through end-node and 98 PINKSTON ET AL. fabric partitioning. The hardware HCAs and TCAs handle (COTS) local area networks such as Ethernet and ATM to sys- traffic multiplexing, access control, and scheduling to prevent tems employing more specialized system-area and storage- one virtual server’s kernel from adversely affecting another’s. area networks such as the SP2 switch, ServerNet II, , Since this is done in a well-defined, inter-operable way, it cLAN, QSW, , and SCI, the network infrastruc- should allow industry standard TCAs to be shared among dif- ture has determined the pace of advancement of commodity ferent OS kernels, greatly increasing the simplicity and avail- clusters as well as their impact on end-user applications in ability of this key server consolidation technology. science, technology, defense, and commerce. For example, Given the differences in system programming models, us- weakly-coupled compute-centric applications that rely less on age models, and functionality, considering IBA simply as a bandwidth and latency properties of the network could use PCI replacement does not seem very accurate. It might be low cost LAN technology such as 100 Mbps Fast Ethernet more properly said that InfiniBand can mitigate or possi- with average inter-node latencies of 100–200 µs. With com- bly even eliminate many of the limitations of PCI buses in modity compute nodes, this, arguably, could provide the best servers and server clusters. If there is truly a PCI replace- price-performance, ranging from $0.25/Mflops to $1/Mflops ment to emerge it would likely be PCI Express (formerly sustained. However, other classes of more tightly-coupled 3GIO) [10], a new architecture implementing the PCI pro- data-centric applications that impose more frequent and syn- gramming model on an IBA-like bundled link -serial con- chronized exchanges of intermediate data among concurrent nection. PCI Express will almost certainly provide additional tasks and/or storage devices would work well only on cluster capabilities beyond those provided by PCI, but it represents systems employing lower-latency, higher-bandwidth network an evolutionary rather than a revolutionary strategy – Infini- technology such as Myrinet. Many quickly-growing Band changes the paradigm, espousing that server I/O should and database processing applications (among many others) be done using message passing. For simple systems consist- requiring shared access to files or block storage devices fall ing of a single low-end server and a small population of added into this latter category. devices, the need for InfiniBand is certainly less urgent than For a wide range of applications typical of Beowulf-class for high-end servers or server clusters with numerous shared commodity clusters, bandwidth requirement, arguably, is cor- devices that could benefit from a highly functional I/O tech- related with the floating-point performance of the computa- nology. tional nodes. One figure of merit is that one to four sustained per second is required for each sustained floating-point operation per second (i.e., 1–4 bps/flops sustained), depend- 4. InfiniBand as a SAN/LAN fabric ing on the application and system scale. Current generation COTS compute nodes incorporate one to four microproces- InfiniBand could be used as a SAN/LAN fabric in high- sors, each with a peak performance of one to four Gflops. performance server clusters – particularly between differ- This is likely to increase to ten Gflops within the next few ent kinds of servers – or in commodity clusters comprised years, though their use in Beowulf-class systems may take of and PCs. Clustering has been very impor- longer. Typically, a sustained floating-point of ap- tant for many years in some specialized, low-volume appli- proximately one quarter of this peak performance is achieved, cations where application fail-over and parallel processing although efficiencies of significantly less than that are possi- or load distribution across many separate machines are im- ble. Thus, over the next one to three years, SAN/LAN per portant. Recently, the notion of cluster systems has grown port sustained bandwidth should, theoretically, be capable of in popularity for high-performance scientific computing as between one and ten Gbps. InfiniBand’s per channel band- well as commercial computing. This is due to the fact that width capability is consistent with this projected range needed commodity clusters, including the Beowulf-class [11–13], by Beowulf-class clusters over the next few years. It remains have been shown to provide a dramatic improvement in cost- to be seen, however, whether InfiniBand’s price/performance performance ratio. Therefore, development of a new indus- will be sufficient to warrant change-over from entrenched try standard communication subsytem that promises order-of- Myrinet and Ethernet technology families, which continue to magnitude improvement and benefits from economy of scale advance [14]. through mass-market production must be considered an im- Latency requirements for SAN/LAN technologies are portant potential opportunity for cluster-based computing sys- more difficult to quantify. Many latency-tolerant algorithms tems. have been developed over recent years for a wide range of ap- Throughout the evolution of commodity (i.e., low-end) plication classes, permitting effective use of commodity clus- clusters over the last decade, the dominant constraining fac- ters for an increasing body of problems. For commodity clus- tor has been the interconnection network. While it is true that ters employing cLAN and Myrinet, for example, best case the effectiveness of commodity clusters is sensitive to the end-to-end latencies at or below 10 µs can be achieved, al- data movement demands of specific computational tasks, it lowing some applications to achieve superior overall perfor- is also apparent that the bandwidth, latency, cost, scalabil- mance as compared to using the Ethernet network family. It ity, and other properties of the underlying network architec- is expected that networks in the microsecond latency range ture have been both an enabler and an inhibitor to cluster will significantly enhance the utility of such clusters. This systems. From systems leveraging commodity off-the-shelf target must be realizable across cluster systems comprising INFINIBAND 99 hundreds or even thousands of nodes. While the commer- adapter hardware, and the performance capabilities necessary cial sweet spot for Beowulf-class clusters is centered around to work sufficiently well in this context. 64 processors (plus or minus a factor of two) there is much Some particular advantages of InifiniBand in this area need for low cost systems in the multi-Teraflops performance are that the components are available from multiple sources, regime integrating up to 10,000 processors. Even with this and are being supported inter-operably across a multiplicity many nodes, the worst-case unloaded network latency should of different server platforms. In addition, InfiniBand’s sup- not exceed a few microseconds, if possible. As the optimal port for high performance discrete message passing (as op- topology may vary depending on the scale of the system, an- posed to TCP’s stream-based orientation) makes it especially other requirement is that the network should support a diver- convenient to solve the IPC problem. Nevertheless, stream- sity of topologies. Consistent with this, InfiniBand allows the based mechanisms (such as Sockets) could be mapped over formation of arbitrary network topologies with an estimated InfiniBand’s transport layer. Since the InfiniBand architec- sub-microsecond pin-to-pin switch latency. ture is optimized around tightly-coupled clusters with a high In addition to similar trends in bandwidth and latency re- proportion of the protocol processing functions offloaded to quirements, the server industry has seen a trend toward denser high-function channel adapters, the processing overhead will rack-mounted chassis and a newly emerging blade design be relatively low, leaving more processing power available strategy, particularly in high-end machines. With blade-based for application-level functions. Similarly, the optimization of design, servers are reduced essentially to “cards” in a back- HW/SW functionality-split leads to a better likelihood of ef- fectively utilizing the full amount of link bandwidth available planed cage which, in turn, is rack-mounted. Server clusters with InfiniBand than with many other industry standard tech- may be configured using homogeneous nodes or heteroge- nologies. neous mixtures of thin appliance blades, SMPs, and main- frame systems. Increasing the density as such requires in- creasing the interconnect’s scalability, expandability, and ef- 5. InfiniBand as a STAN fabric ficiency (form-factor). This makes the movement towards a single fabric capable of handling both IPC and shared stor- Perhaps the biggest open question is whether InfiniBand can age I/O in server clusters even more urgent. InfiniBand in- compete against Fibre Channel and Ethernet for storage area cludes in its architectural specifications various mechanical networking and network-attached storage (NAS) traffic, re- form factors and a rich set of modules with a decidedly server spectively. InfiniBand may not be viewed as a solid contender orientation compatible with blade-based design. In addition, due not to any major technical issues but, rather, to economic InfiniBand provides a means of enabling inter-processor and realities of the industry as a whole. Storage vendors need to storage communication on the same interconnection network use technology that will be ubiquitous and provide connec- fabric, between many servers and many storage subsystems. tivity to existing infrastructure while enabling new services These capabilities are, perhaps, among InfiniBand’s greatest such as high-speed remote mirroring or remote data locality strengths. management. So far, the only two interconnects of major in- terest that have surfaced are Fibre Channel and iSCSI, which In such environments, communication between differ- transports SCSI operations (e.g., disk block read and write op- ent servers or commodity compute nodes could be done erations) over TCP/IP network interfaces. If InfiniBand com- with standard TCP/IP over Ethernet or some other indus- ponents can become readily and cheaply available, the high try standard network. However, for many applications, the efficiency and throughput of InfiniBand fabrics could be used overall cluster performance would be limited by TCP/IP very effectively for interconnecting servers with storage de- protocol processing overhead, which is optimized for long- vices such as RAID or tape systems, thus providing few trans- distance communications over unreliable, long-latency, low- lation levels and less expensive storage infrastructure than bandwidth links. This is more overhead than is necessary in a current common practices allow. The problem, however, is cluster environment. As stated previously, clustering fabrics that competing technologies are continuing to develop. have existed for many years on specific platforms but have Storage management and operating system support are been either proprietary (single-source), like the SP2 switch dramatically improving, allowing Fibre Channel solutions to [15] and ServerNet II [16], and/or not inter-operable across span the entire gamut of server and storage design points. other networking platforms, like Myrinet [17]. The need has With the addition of new long-haul optics and a draft stan- arisen for an efficient, high-bandwidth, low-overhead, open dard for Fibre Channel over IP, new distributed storage solu- and inter-operable cluster interconnect fabric such as Infini- tions for disaster protection and distributed content manage- Band which has provisions for “raw” packets to be trans- ment are forthcoming. What’s more, iSCSI is a new standard ported to targets which use a protocol other than that defined being developed and backed by nearly every server, storage, by the network architecture. Several other technologies have storage management, network equipment provider, and oper- been promulgated in the past (e.g., FDDI, Fibre Channel, ating system vendor. The potential for this technology is quite HIPPI-6400, etc.), but none have gained a widespread foot- high. Its primary benefits are that it is a consolidated unified hold. InfiniBand appears to be the first to get right the combi- fabric and storage interconnect capable of delivering differen- nation of wide industry support, clear hardware/software ar- tiated services to meet service level agreements and capable chitecture support with sufficient protocol offload to channel of delivering a security infrastructure to maintain customer 100 PINKSTON ET AL.

Figure 3. Conceptual diagram of a typical data center. privacy. In addition, it is able to support the same storage ser- there are several other alternatives besides InfiniBand. Com- vices to all endnodes (servers, storage, desktops, appliances, pact PCI systems (as opposed to conventional PCI), for ex- laptops, etc.) independent of their locality over wired and ample, have supported low-bandwidth inter-blade communi- wireless Ethernet solutions (including 10 GbE). Given these cation across a shared PCI- derivative backplane for many developments, it is possible that the majority of the storage years. Alternatively, blade servers could use Ethernet-based devices will use Fibre Channel or Ethernet fabrics instead of backplane switches to provide LAN-in-a-box performance, InfiniBand for economic and accessibility reasons. and some blade servers may use SMP backplane intercon- nects to provide very tight coupling across a small number of blades. However, for highly-efficient backplane communi- 6. Putting it all together: possible applications for cation across hundreds or thousands of blades, there are no InfiniBand industry-standard networks that compare to InfiniBand tech- nology. As stated previously, a major trend in server design recently As an example of possible target areas for InfiniBand has been the push towards server packaging in form factors deployment, let us consider the data center. One possible that look more like telecommunications equipment than tra- ditional computer equipment. Vertically-oriented blades con- data center configuration is illustrated in figure 3. It is com- taining electronic components are inserted into midplane or prised of three major subsystems upon which the four stan- backplane cards of racks, and the racks provide aggregated dard data center service tiers (access, web, application, and power, packaging, cooling, and cabling for dozens or hun- database/back-end) are supported. The processing subsystem dreds of cards. As silicon techology trends continue to drive is composed of servers, appliances, and I/O chassis/modules. more on-chip integration, and the relative prices of silicon Servers and appliances provide computational and data ma- and cards decrease with respect to fans, power supplies, sheet nipulation services while I/O chassis/modules provide the metal and cables, we may expect to see hundreds or thou- communication services to the other subsystems. The unified sands of general-purpose and special function processors on fabric subsystem is composed of switches, routers, and ap- blades aggregated together into single-rack or multi-rack sys- pliances used to interconnect the processing subsystem, the tems. storage subsystem, and the outside world (Internet or private Key questions for this emerging system design philosophy LAN). The storage subsystem is composed of storage end- are “How tightly-integrated will the different blades be”, and nodes including disk arrays (RAID), tape libraries, storage “What communication mechanisms will most commonly be area networks, etc. Assuming these three subsystems, we ex- used?” This is an area where InfiniBand technology has a amine how InfiniBand might compare against other various strong chance of being a key technology. As noted before, interconnect technologies applicable to each domain. INFINIBAND 101

6.1. Processing subsystem is composed of switches, routers, and appliances with each element providing different levels of basic and value-add ser- Within the processing subsystem, the PCI technology suite vices (e.g., firewalls, load-balancers, virtual private network currently is the dominant server/appliance I/O point of attach- gateways, quality-of-service controls, etc.). It need not be ment. With the advent of PCI-X 2.0 and serial PCI (3GIO) composed of only one type of network technology. For in- this may remain true for lower-end systems for at least the stance, this fabric could be composed of three separate use- next 10 years for many of the reasons discussed previously. specific networks – one for efficient, short-distance commu- In this case, InfiniBand could use PCI derivatives as the nication to devices within the processing subsystem; one for point of attachment to insert an InfiniBand Host Channel connecting to other networks; and one for connecting to stor- Adapter (HCA) into a server to provide low-latency IPC be- age subsystem devices. On the other hand, the unified fabric tween server endnodes and for server-to-I/O module expan- could be composed of one all-purpose network techonology. sion chassis (i.e., PCI bridging) [18]. In this scenario, In- As alluded to earlier, there are several key attributes finiBand is not required to be an intrinsic component of the needed by the fabric interconnect: the use of open, stan- server enclosure but, rather, as a hot-plug capability that can dard interfaces with well-understood end-to-end link proto- be incorporated into any existing or future low-end server de- col/semantics with established compliance and inter-opera- sign. That is, lower-end server chipsets can be optimized to bility; seamless integration with the Internet; low-cost, high- provide a single, well-understood and high-speed I/O inter- volume that can be integrated across the entire spectrum of connect technology (e.g., the PCI technology suite) that can price/performance design points; support for multiple phys- be quickly adapted to whatever the customer requires. Then, ical implementations, allowing the link protocol to operate InfiniBand HCAs with an extremely simplified management across arbitrary distances while meeting performance require- infrastructure can be used to provide attachment to external ments within the data center, metropolis, and wide-area; and I/O chassis to increase I/O scalability. These I/O chassis can ability to facilitate rapid innovation while providing forward further be dedicated or shared among a set of hosts, pro- and backward customer investment protection. Many inter- viding a cost-effective, low-footprint solution. On the other connects have tried to deliver these attributes, but it is ar- hand, high-end servers could implement InfiniBand as the na- gued that only one all-purpose network is able to solve all the tive I/O infrastructure. In this case, PCI derivatives would problems – Ethernet. Currently, Ethernet is considered the “de attach below InfiniBand, mainly for backward compatibility facto” interconnect of choice if the unified fabric subsystem with legacy adapters. In this scenario, InfiniBand could be is to be implemented by one technology. With the upcoming where PCI hubs would attach. This use of InfiniBand is likely release of 10 GbE, which defines a that sup- to be several years out, but it is already planned for some ports up to 40 km distances (the protocol is capable of oper- of IBM’s high-end servers, e.g., zSeries mainframes (previ- ating over any distance within this range at maximum perfor- ously, s/390), iSeries business and financial servers (previ- mance), data center administrators would be able to leverage ously, AS/400), and the pSeries SMPs (previously, RS/6000 the same technology everywhere and reap the corresponding systems). benefits and cost savings. A possible emerging alternative to using InfiniBand for Given InfiniBand’s decidedly upper hand in supporting IPC within the processing subsystem is to use Gigabit or 10 low-latency IPC within the processing subsystem, might it (10 GbE) with lighter-weight Send/RDMA/ challenge Ethernet within the unified fabric subsystem for this OS-bypass protocol capability, which may be on the horizon. specific use? This would follow the argument that the cen- This could deliver high-bandwidth solutions across the sub- ter of investment for Ethernet is moving toward metropolitan- system fabric and to the outside world, which is something area networks (MANs), therefore SANs and STANs will be that InfiniBand may not be well designed for. However, Eth- optimized using application-specific technologies such as In- ernet switches typically have latency overheads in the range finiBand and Fibre Channel, respectively. Under this scenario, of 4–9 µs (as measured from the time the packet starts to en- future systems will need a mixture of both general-purpose ter the switch until it starts to leave the switch), which is still blades which operate as present-day uni- or multiprocessor unacceptable. InfiniBand switches will typically have laten- servers and special-purpose blades which will operate as I/O cies in the 50–200 ns range. Hence, InfiniBand is likely to adapters attached as devices on a separate server. Thus, de- provide direct benefit for such application environments. The livering the value-add functionality supported in today’s uni- key question is whether or not Ethernet switch developers will fied fabric subsystem will require the industry to integrate take steps to aggressively lower their switch latency to within blade modules that can translate between different protocol the competitive range so as to prevent InfiniBand from gain- domains, e.g., that of InfiniBand and other network technolo- ing a foot-hold here. gies. With InfiniBand, different blades can be flexibly config- 6.2. Unified fabric and storage subsystems ured into servers of varying performance by logically con- necting queue pairs together and running different software The unified fabric subsystem is the nexus within the data cen- functions on each one. Each blade can be separately opti- ter for communication between the processing subsystem, the mized for specific functions, e.g., using a network processor storage subsystem, and the outside world. This subsystem for the networking blade, a general-purpose processor for the 102 PINKSTON ET AL. application blade, etc. For example, a multi- that One of the clear long-term trends in computer design has allows true TCP offload could be configured such that the been the increasing complexity of systems. In the past, com- TCP networking blade terminates TCP sockets and commu- puter systems were very simple and centralized, with a single nicates with the host on a different blade. Also, a separate processor and a single operating system managing applica- storage blade or storage interface blade can offload storage tions across a single main memory and a set of I/O cards. In functions. That is, the application blade does file-level re- the present and future, however, computer systems will con- quests (NAS storage requests) to the storage blade which then tinue to move toward a distributed, more autonomic model does SAN block-level requests to the disks. These disks may that, in some senses, works by analogy with the human body. be in the same rack or in different racks, connected through Just as dozens of different specialized organs are tightly cou- other InfiniBand links. Another possible system could be a pled together in the body and communicate with each other multi-blade server that has router functionality. Each blade through a central nervous system to work as an integrated operates as a line card in a high-throughput router, and an unit, future server complexes will have dozens of modules or InfiniBand-switched backplane operates as the switching core blades, each specialized and/or configurable to perform spe- of the router. This allows the construction of a server/router cific operations as part of an integrated system connected by with extremely high throughput and close tie-in to application an underlying fabric. A single integrated system might con- blades that are internally interconnected through an efficient, sist of particular blades or modules dedicated to storage, to flow-controlled and protected protocol. All of these examples TCP/IP communication with the outside world, to running require efficient, well-controlled cluster communications be- numerically-intensive applications, to cryptography, to trans- tween separately-operating devices over short-distances in lo- action processing, or to managing resources across the rest of cal environments, which is a key design target for InfiniBand. the system. Accordingly, different virtualized portions of the Such modularization represents new and exciting business server or server cluster will work independently yet cooper- and technology opportunities. However, it is likely to be seen atively to configure, heal, optimize, and protect themselves as a threat to many network equipment providers and, thus, and the system as a whole. In such a system model, the inter- may only be adopted reluctantly. This being the case, the connect fabric would act as the “central nervous system” type unified fabric subsystem may continue to be dominated by of communication mechanism that ties everything together. Ethernet in the short term but, eventually, could migrate to That fabric must be efficient, distributed, flexible, partition- an optimized fabric composed of two or three use-specific able, reliable, scalable, and inter-operable across the whole networks, with InfiniBand being the technology of choice integrated system. From a technical point-of-view, InfiniBand for low-latency, short-distance IPC. This scenario is likely to appears well positioned to play an integral part in such a fab- come sooner than later as multi-blade server clusters grow in ric for the foreseeable future. popularity. Will the advantages offered by InfiniBand be significant enough to warrant a change-over from the evolutionary path of existing server and cluster interconnect subsystems? Many 7. Conclusion indicators seem to indicate “yes”, particularly for certain op- InfiniBand was designed to solve a set of server I/O prob- timized uses, but perhaps it is still too early to tell. After lems, with extended support for low-latency inter-processor all, InfiniBand is only a specification, not a product – it is communication also included. In this article, areas have the actual products that, ultimately, will determine the final been identified where InfiniBand currently has essentially outcome. Nonetheless, the underlying concept of a tightly- no open industry-standard, well-accepted competitors: low- coupled, multi-layered interconnect architecture with the po- latency, high-bandwidth fabrics for commodity clusters, vir- tential of wide deployment across a broad range of server I/O tualized I/O, and multi-blade tightly-coupled servers or het- products is exciting and well worth waiting for. erogeneous server clusters. Each of these applications are for different types of systems but will eventually be important as Notes inter-operable parts of sophisticated multi-tier enterprise and internet data centers. Due to economic realities, it is highly 1. Chipsets is a term for the collection of IC devices in a server that incorpo- unlikely that InfiniBand will be used as an all-purpose data rate memory controllers, microprocessor interconnects, and I/O bridges. center backbone with any great success anytime soon, but it 2. Efficiency in this case refers to the ability to provide the greatest I/O band- is likely to play a significant role as a use-specific network width with the least processor overhead. 3. IBM’s zSeries eServers is one such system. within the data center. It is likely that InfiniBand will be de- ployed primarily within the processing subsystem initially; but, depending on market forces, it could possibly be used References within the storage subsystem as well. One related concern in the arena of commodity clusters is whether subsetting of the [1] InfiniBand Trade Association, InfiniBand Architecture, Specifica- standard might reduce its ability to exploit economy of scale tion Vol. 1, Release 1.0a (October 2000). Available at http://www. infinibandta.com. especially given that, as it is, commodity clusters make up [2] G.F. Pfister, An introduction to the infiniband architecture, in: Pro- only a relatively small portion of the computer marketplace. ceedings of the Cluster Computing Conference (Cluster00) (November This remains an open issue. 2000) Ch. 57. INFINIBAND 103

[3] Virtual Interface Architecture Specification, Version 1.0 (December of the ACM and a senior member of the IEEE. He 1997), http://www.viarch.org. has also been a member of the program committee [4] J. Pelissier, Providing quality of service over infiniband architecture for several major conferences (ISCA, HPCA, ICPP, fabrics, in: Proceedings of the Symposium on Hot Interconnects (Au- IPPS/IPDPS, ICDCS, SC, CS&I, CAC, PCRCW, gust 2000). OC, MPPOI, IEEE LEOS, WOCS, and WON), the [5] J.C. Sancho, A. Robles and J. Duato, Effective strategy to compute Program Chair for HiPC’03, the Program Co-Chair forwarding tables for InfiniBand networks, in: Proceedings of the In- for MPPOI’97, the Workshops Chair for ICPP’01, ternational Conference on Parallel Processing, September 2001 (IEEE and the Finance Chair for Cluster 2001. Recently, he Computer Society Press, 2001) pp. 48–57. has served as an Associate Editor for the IEEE Trans- [6] P. Lopez, J. Flich and J. Duato, Deadlock-free routing in InfiniBand actions on Parallel and Distributed Systems (1998– through destination renaming, in: Proceedings of the International 2002). Conference on Parallel Processing, September 2001 (IEEE Computer Society Press, 2001) pp. 427–434. [7] T.M. Pinkston, B. Zafar and J. Duato, A method for applying double Alan F. Benner, is a member of the Server Technol- scheme dynamic reconfiguration over infiniband, USC Technical report ogy Architecture and Performance Group in IBM’s (March 2002). eServer development division, working in hardware [8] F.J. Alfaro, J.L. Sanchez and J. Duato, A strategy to manage time sensi- and software for high-performance server network- tive traffic in infiniband, in: Workshop on Communication Architecture ing. He received his B.S. in physics from Har- for Clusters (CAC’02) (April 2002). vey Mudd College, and M.S. and Ph.D. degrees in [9] PCI SIG, PCI Specifications, www.pcisig.com/specifications. physics from the University of Colorado at Boul- [10] A.V. Bhatt, Creating a Third Generation I/O Interconnect (White der. Between 1986 and 1988 he was at the Photon- Paper). http://www.pcisig.com/data/news_room/3gio/3gio_whitepaper. ics Networks and Components research department pdf (2001). of AT&T Bell Laboratories. Since 1992, Dr. Ben- [11] T.L. Sterling, J. Salmon, D.J. Becker and D.F. Savarese, How to Build ner has been with IBM, at several development labs, a Beowulf: A Guide to the Implementation and Application of PC Clus- and at the Zuerich Research lab. His primary inter- ters (MIT Press, Cambridge, MA, 1999). ests have been in optical and electronic network- [12] T. Sterling, Beowulf Cluster Computing with Linux (MIT Press, Cam- ing, with particular impact on development of the bridge, MA, 2001). RS/6000 SP parallel , enterprise-scale [13] T. Sterling, Beowulf Cluster Computing with Windows (MIT Press, Internet switch/routers, and the InfiniBand architec- Cambridge, MA, 2001). ture for server I/O and high-performance clustering. [14] C.L. Seitz, Recent advances in cluster networks, in: International Con- E-mail: bennera@us..com ference on Cluster Computing (Keynote Speech) (October 2001). [15] C. Stunkel et al., The SP2 high-performance switch, IBM Systems Jour- nal 34(2) (1995) 185–204. Michael Krause is a senior I/O architect at Hewlett Packard where he has [16] D. Garcia and W. Watson, ServerNet II, in: Proceedings of the 2nd worked for the last 17 years. Michael is responsible for HP platform I/O ar- PCRCW (Springer, 1997) p. 109. chitecture and is the HP technical lead for InfiniBand and 3GIO. In addition, [17] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Dulawik, C.L. Seitz, Michael has served as the IBTA Link Workgroup Co-Chair and works within J. Seizovic and W. Su, Myrinet – A gigabit per second local area net- the IETF with a focus on new technologies used to create ubiquitous use of work, IEEE Micro (February 1995) 29–36. RDMA/OS Bypass/Direct Data Placement capabilities that will provide cus- [18] C. Eddington, Infinibridge: An integrated infiniband switch and chan- tomers with high-performance/QoS-based service delivery. nel adapter, in: Proceedings of the Symposium on Hot Chips, August 2001. Irv M. Robinson is Architecture Director for the Fabric Components Divi- sion of Intel Corporation. He is also the Co-Chair of the Technical Working Timothy Mark Pinkston completed his B.S.E.E. Group of the InfiniBand Trade Association. His interests include clustered degree from The Ohio State University in 1985 and servers and interconnects, parallel databases, and transaction processing sys- his M.S. and Ph.D. degrees in electrical engineering tems. from Stanford University in 1986 and 1993, respec- tively. Prior to joining the University of Southern California (USC) in 1993, he was a Member of Tech- Thomas Sterling received his Ph.D. from MIT in nical Staff at Bell Laboratories, a Hughes Doctoral 1984 and has held research scientist positions with Fellow at Hughes Research Laboratory, and a visit- the Harris Corporations’ Advanced Technology De- ing researcher at IBM T.J. Watson Research Labora- partment, the IDA Supercomputing Research Cen- tory. Presently, Dr. Pinkston is an Associate Profes- ter, and the USRA Center of Excellence in Space sor in the Computer Engineering Division of the EE- Data and Information Sciences. In 1996 Dr. Ster- Systems Department at USC and heads the SMART ling received a joint appointment at the NASA Jet Interconnects Group. His current research interests Propulsion Laboratory’s High Performance Comput- include the development of deadlock-free adaptive ing group where he is a Principal Scientist and the routing techniques and optoelectronic network router California Institute of Technology’s Center for Ad- architectures for achieving high-performance com- vanced Computing Research where he is a Faculty munication in parallel computer systems – mas- Associate. For the last 20 years, he has engaged in sively parallel processor (MPP) and network-based applied research in parallel processing hardware and (NOW) computing systems. Dr. Pinkston has au- software systems for high performance computing. thored over fifty refereed technical and has Sterling was a developer of the Concert shared mem- received numerous awards, including the Zumberge ory multiprocessor, the YARC static dataflow com- Fellow Award, the National Science Foundation Re- puter, and the Associative Template Dataflow com- search Initiation Award, and the National Science puterconcept and has conducted extensive studies Foundation Career Award. Dr. Pinkston is a member of distributed shared memory cache coherence sys- 104 PINKSTON ET AL.

tems. In 1994, Dr. Sterling led the team at the NASA mittee. He is also an author of the book, “Enabling Goddard Space Flight Center that developed the first Technologies for Petaflops Computing” published by Beowulf-class PC clusters including the Ethernet MIT Press in 1995. Sterling was the Principal Inves- networking software for the Linux operating system tigator for the interdisciplinary Hybrid Technology and is an author of the 1999 book, “How to Build Multithreaded (HTMT) architecture research project a Beowulf” published by MIT Press. Since 1994, sponsored by NASA, NSA, NSF, and DARPA which Sterling has been a leader in the national Petaflops ended in 2000. Currently, Sterling is Principal Inves- initiative chairing three workshops on petaflops sys- tigator leading the Gilgamesh Architecture research tems development and chairing the subgroup on the project sponsored by NASA. Petaflops computing implementation plan for the President’s Information Technology Advisory Com-