InfiniBand Architecture: Illuminata® Bridge Over Troubled Waters

That is a bad bridge which is shorter than the stream.

— German proverb

Research Note The PCI peripheral expansion has had a long and illustrious history. Since its inception in 1991, system vendors and users have embraced it like few technical stan- dards before or since. PCI provides a substantial volume of the I/O bandwidth and peripheral connectivity across the range of RISC to CISC; PC to enterprise server; proprietary to commodity. User requirements at the advent of the 21st century, David Pendery however, have rapidly evolved. Not only has computer performance advanced enor- Jonathan Eunice mously, the very landscape of IT use and connectivity has changed. The PCI standard 27 April 2000 we have converged and relied upon for close to a decade is being rapidly outstripped by the demands of ever larger databases, transaction loads, and network user bases. The bridge is beginning to look shorter than the stream.

Fortunately, help is on the way. The InfiniBand™ Architecture is the industry’s answer to the growing I/O problem. InfiniBand replaces the bus-based PCI with a high-band- width (multiple gigabytes per second) switched network topology, and shifts I/O control responsibility from processors to intelligent I/O engines commonly known as channels. These approaches have long enabled the world’s largest servers, now brought down to address virtually every server. InfiniBand is not yet a product, nor even really a standard. The first full specification won’t be available until this summer, with the first products appearing in 2001. Initial indications, however, are greatly encouraging. InfiniBand is the right technological advance, emerging at the right time and for the right reasons. To employ a bit of adolescent patois, InfiniBand rocks.

! ON OPENING SO

Copyright © 2000 Illuminata, Inc. Illuminata, Inc. 187 Main Street Nashua, NH 03060 Licensed to InfiniBand Trade Ass'n 603.598.0099 603.598.0199 f www.illuminata.com Web Use Only - Do Not Reproduce

2

Presto, Change-O ISA’s maximum 10 MBps and EISA’s 33 MBps data transfer rates. As a commodity standard, it minimizes When PCI established itself in the early 1990s, 66 MHz cost to achieve high shipment volumes. Even so, PCI processors and 10 Mbps networks were fast. 0.8 micron has neatly outperformed virtually all alternatives, CMOS semiconductor fabrication was state of the art. including those quite proprietary and specialized. Early transaction processing benchmarks churned out a Then, in a classic case of volume sales providing the whopping 54 transactions per minute.1 Data ware- investment dollars needed to move a heretofore housing had just been invented. Client-server applica- commodity product upmarket, PCI has dramatically tions and deployments were increasing, but only the extended its reach. Enhanced versions have doubled digerati had email, and the Internet as we know it was both clock speed and bus width, making 264 MBps still years distant. easily achieved today, with 500+ MBps options avail- What a difference a decade makes! Today, multi- able. The HotPlug PCI extension made PCI suitable for terabyte databases running on clustered servers, if not high availability servers, and its CompactPCI deriva- exactly commonplace, are a reality in many shops. tive has driven into embedded systems and telco gear. Storage has been decoupled from the server, and often Although its attributes promised it a long life, PCI’s extended over a storage-optimized network (SAN). very architecture is ultimately limiting. Intel’s Pentium III Xeons, now the workhorse of PCI is built upon that simple connectivity structure, servers not just PCs, are fabbed at 0.18 micron and run the parallel bus. The simple, economical bus structure at 800 MHz; 0.13 micron, 1 GHz chips are on the way. has been at the base of so many electronic products for The top TPC-C server does 135,815 transactions per so long that it’s virtually taken for granted. Yet busses minute, and the Internet is now the workshop of IT. have inherent drawbacks: These are the new reality, driving ever-higher user • Disorderly contention for resources by periph- expectations. Fast, unencumbered I/O is the lifeblood erals, memory, and CPUs. Disorder breeds ineffi- of this evolving corpus. Never has such variety of I/O ciency and suboptimal performance. been required to link such scale of hardware and soft- • Vexing failure modes. Not only is the bus a poten- ware in such transparent and accelerated ways. And tial single point of system failure, failure isolation yet, never before have the incumbent I/O technologies is difficult or impossible. If one attached card fails, been so outstripped by processor capabilities. it can cause the entire system to fail. Worse, discovering which card caused the failure is at best PCI = Problematic Computing Interface? a hit-or-miss proposition—a misery in a world Introduced in response to a morass of incompatible needing high availability. peripheral connectivity and I/O options of a decade • Severe physical stipulations and limitations. As ago, PCI has been a blessing. Over time it expunged the bus length increases to accommodate more, or alphabet-soup that was AT/ISA, EISA, HP-PB, MCA, more widely dispersed, expansion devices, VME, NuBus, SBus, and TurboChannel, among others. signaling properties become less stable. The same It ushered in a long period of wide industry acceptance thing is true for clock rates. The faster the modu- of a single standard, and thus a stability and predict- lation, the shorter the feasible bus, and the fewer ability that made both product development and selec- peripheral interconnects are possible. In the tion pleasingly straightforward. extreme case, the 133 MHz defined for PCI-X, there can only be a single connector per bus! PCI not only standardized I/O attributes, it enabled 2 high bandwidth. Its initial 133 MBps may seem PCI’s shared structure cannot keep up on a perfor- modest today, but it greatly outpaced then-standard mance basis, nor are its manageability and availability 1. The first TPC-C result, published in 1992. attributes acceptable. As next-generation computing 2. Megabytes per second. Bandwidth figures are nominal, platforms are being planned and implemented, PCI will not typical. Such naive peak rates don’t consider practical gradually be left behind, as antiquated as 66 MHz slowdowns such as contention and protocol overhead. microprocessors and 40 MB disk drives.

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

3

Incremental Upgrades community. But this is to be expected. The free market is contentious by nature. And, as they say, you can’t One could continue to improve PCI a bit, or work make an omelette without breaking a few eggs. At the around its limitations. Servers needing both high band- end of the day, these participants know that the width and large numbers of expansion slots, for customer uptake rate for their next-generation servers example, are often outfitted with multiple, independent depends on solving I/O bottlenecks, and on not PCI buses. This comes at a cost, of course, but averts an creating a divisive standards war. Thus while disagree- immediate capacity crisis. ments they may have, they are all highly motivated to The latest PCI-X revision goes further, cleaning up the find a common and standard solution. electrical signal definitions to drive towards 1 GBps IBTA leaders (officially, “Steering Members”) IBM, 3 (133 MHz x 64 bits). It’s a significant and promising Intel, Compaq, Hewlett-Packard, Dell, Microsoft, and extension that will extend PCI’s life by several years. reason that it’s better to have a Even improvements as extensive as PCI-X, however, smaller group get something practical and effective out have ever diminishing returns. The writing is on the the door than to hear everyone’s wishlist. In addition to wall. Despite PCI’s notable run of success, and the fact the Steering Members, Sponsoring Members include that it will remain with us for years to come, its ulti- 3Com, Adaptec, Cisco, Fujitsu-Siemens, Hitachi, mate headroom is limited. Bus architectures are funda- Lucent, NEC and Nortel Networks. It’s a potent brain mentally outpaced by our users’ and applications’ trust, among them the owners of the best I/O technol- voracious need for data, and thus high rates of I/O. ogies and intellectual property in the industry. Rather than more patches, what we now need is a jump as dramatic as PCI was when it was first introduced. As The Goods Mitch Shults, Intel’s point man on I/O strategies says, “the industry has got to move to some fundamentally InfiniBand is the cavalry to the rescue, the I/O standard new architecture.” Enter InfiniBand. and workhorse emerging for the new generation. So what exactly is it? On the Way to IBTA InfiniBand is a network approach to I/O. A system The road to a future I/O standard has been rocky. Even connects to the I/O “fabric” with one or more Host for PCI, vendors were reluctant to give up their favored Channel Adapters (HCAs). Devices, such as storage and proprietary options. Sun for example, while it has network controllers, would attach to the fabric with a supported PCI, to this day favors its own SBus design Target Channel Adapter (TCA). InfiniBand adapters in its premium servers. But the vastly better economics (generically, CAs) are addressed by IPv6 addresses, just of a single standard, both for IT producers and as any other network node might be. consumers, has won the day. The “fabric” concept may seem abstract to someone The once-divergent groups such as NGIO (Next who’s used to fitting a card in a slot, but it’s exactly Generation I/O, led by Intel) and Future I/O (led by what happens on any other network, whether of the IBM, Compaq, and HP) cast their fates together in traditional LAN/WAN/Internet variety, or the storage August 1999, a move that led to the foundation of the area networks (SANs) now rapidly entering data InfiniBand Trade Association (IBTA). The IBTA is centers. The physical fabric combines connectors, largely based on the successful PCI SIG model. Even cables, and switches. Current specifications call for one, after formation there have been some disagreements four, and twelve-wide link options, corresponding to 4,5 about how quickly InfiniBand should appear, and how 500 MBps, 2 GBps, and 6 GBps bandwidths. Whereas encompassing it should be when it does appear. There PCI distances can be easily measured in inches or centi- have also been tensions between IBTA members and 4. Serial links are conventionally described in bits/sec, not external constituencies such as the embedded systems the bytes/sec of parallel links. Each InfiniBand width drives 2.5Gbps (250 MBps) in each direction. Doing the 3. A speed that was aggressive for even the best system math, 4-wide = 10 Gbps (1 GBps/direction), 12-wide = 30 busses just five years back. Gbps (3 GBps/direction).

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

4

meters, InfiniBand HOST links are designed to Virtual Lanes

~17 meters (data Message center distances) using Packets copper cabling, or to Channel Adapter 100 meters (intra building or small- Physical Wire campus distances) TARGET with fibre optic links. Differential Pair The connectors will resemble today’s (RJ45) and ports. Four Wire Channel Adapter Extenders, protocol InfiniBand Link switchers, and fibre cabling may increase this a bit, say to 1km (with connecting servers to other servers (in clusters and multimode fibre) or a few km (with single mode fibre). MPP systems), to storage (somewhat displacing Fibre The 1,000km common to WAN links are impractical Channel SANs, especially in rack- and room-area given the need to minimize end-to-end latency. fabrics), and to network adapters and infrastructure Regardless of the number of wires (i.e., bandwidth (including directly into Internet routers and switches). grade) or physical dispersion, InfiniBand uses a single set of logical structures for how nodes are addressed Despite these high-end ambitions, as the successor to (IPv6), what protocols and APIs are used, and how the PCI, InfiniBand is still about in-chassis I/O, shipped in components are pieced together. high volume and low price points. This deployment, which makes a switched network a rack- and mother- Time-to-market issues will make early TCA imple- board-level feature, will remake system form factors. mentations equivalent to HCAs, but later refined 2U and 1U rack-and-stack servers may seem like dense implementations will rapidly cost- and space-mini- computing today, but InfiniBand’s small connectors, mize TCAs to enable high-volume sales and inclusion flexible cabling, and network approach will fundamen- into denser and more embedded configurations. Card- tally compress computing complexes. Within a few sized CAs will give way to multi-chip semiconductor years, expect today’s 0.5–1U per CPU densities to fall implementations, then single-chip, and finally will be well under 0.5U/processor, perhaps beneath 0.2U per. modules that can be optionally included in CPUs and Density isn’t everything—cost and high-availability ASICs. As with most I/O options, high-end servers, are also key—but ISPs, ASPs, and other service storage arrays, and peripherals will be first to imple- providers will be particularly glad to further minimize ment and deploy InfiniBand. These are the units that IT footprints. most need the added performance, and for which the higher initial costs will be most easily absorbed. Though further out than server and workstation deployments, embedded computing is another area of InfiniBand Everywhere InfiniBand opportunity. Network switches, telco gear, wireless hubs, industrial automation, and telemetry Ultimately, “InfiniBand everywhere” will be the units are all eventual targets.6 rallying cry, just as PCI expanded its purview to both larger and smaller systems. Within a few years, we Part and parcel of the InfiniBand transformation will predict that InfiniBand will be the default way of be the leveraging of switch design skills and invest- ments at system, network, and storage OEMs to 5. When implemented in copper links, each “width” unit uses two wire pairs for differential signalling, resulting 6. Albeit in competition with the still-viable CompactPCI in 4, 16, and 48-wire copper connections. and Motorola’s emerging RapidIO initiative.

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

5

support generalized InfiniBand fabrics. In great The idea is getting “server I/O onto the network,” and measure, InfiniBand will work because it brings so ultimately the Internet. This goes well beyond the many strong players to the table. remote I/O, for example, found in a few of today’s newest high-end servers. There are rich possibilities Changing the Guard with this flexible methodology. The technology’s switched design, message/packet basis, fat pipes, and Saying that InfiniBand is a networked I/O standard is extensive controlling mechanisms will underpin true, but hardly scratches the surface of the design. It architectures and network schema for the next decade. is, for example, also a channel-based approach. Thinking Outside The Box Instead of the memory-mapped “load/store” para- digm of PCI, InfiniBand uses a message-passing Future IT will be largely dictated by Internet-style “send/receive” model. This, in concert with the networked computing. In some ways, the Internet endpoint addressabilty, is essential in ensuring utterly mindset is simply an extension of trends that had robust, reliable operations. Transmissions are demar- been developing for three decades. Over time, cated into distinct “work queue pairs,” with packets compute functions have become steadily more atom- distributed and disseminated throughout the Infini- ized and distributed, devices have become more intel- Band network. Adapters take on the responsibility for ligent, client-server has been integrated into the Web, handling transmission protocols, and InfiniBand and the local network has extended into a global switches take on responsibility for making sure network. In short, data have steadily been cast farther packets get where they’re supposed to be. This distri- away from their home bases. I/O—by definition the bution of work is common, for example, in S/390 movement of data—has of necessity had to be inte- 7 mainframes. grated across wider spans.

Tom Bradicich, IBM’s Intel server technologist and a This dilative phenomenon is “the externalization of prime mover behind InfiniBand, is fond of using a I/O.” External I/O requires common protocols to link “mailman” metaphor. CPUs and hosts pass data into the traffic between the computing devices and control- memory for use by targets, and then move along to ling mechanisms referred to above. other tasks, just as a mailman drops your messages and moves along to the next house. This functioning The application is king. But oftentimes, applications is “fundamental to the specification.” Wrenching I/O and databases are starved for data. There are bottle- off the PCI bus, imbuing it with higher-order organi- necks, latencies, and congestion that simply arrest zation schema, and pressing it into better-managed performance. Here, InfiniBand will make a difference. and more tightly controlled service inside and outside Very large databases are now in the terabyte and the box is InfiniBand’s raison d’etre. above range, with some spanning to 50 TB. Very large, distributed engines are needed to process such infor- The controlling mechanisms are quite sophisticated. mation. These massively parallel systems or compute Addressing nodes with IPv6, for example, will allow farms are essentially clustered systems. Whether the easy and direct linkage with Internet routers and gate- ways. And while physical layer implementations are DBMS provider is IBM or Oracle, NCR or Compaq’s organized around a given number of “wires,” the Tandem division, these clusters require high-band- logical structure is very general. The links are bi- width, low-latency interconnects. Specialized propri- directional and composed of up to 16 “virtual lanes,” etary designs are the common result. IBM’s SP any of which a given packet may travel. switch, NCR’s BYNET, and Compaq’s ServerNet and Memory Channel are commonly used in their largest 7. Perhaps there really is nothing new under the sun! distributed engines.

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

6

HOST Compression Engine CPU

CPU Memory Target Memory DMA HCA InfiniBand Switch TCA DMA CPU Controller Function CPU System Interconnect System

To HCAs, TCAs, by way InfiniBand of IB switches or, directly to TCA DMA Target 1, 4, or 12-wire Links Controller(s) devise functions which are being controlled (disks, etc.)

HOST Network Connection (InfiniBand or other network protocols)

CPU DMA

CPU Memory Memory DMA HCA InfiniBand Switch CA DMA CPU Controller CA CPU System Interconnect System Network Router

As InfiniBand moves into its more refined switched Tandem, and IBM Fibre generation (in 2002-2003), it will provide exactly TCA DMA Channel Fibre Channel parallel/cluster Controller the sort of high-speed, low-latency packet switching systems discussed needed by these clusters. Its support for cascaded above comprise some of the switches, fabric partitions (also called zones), and highest-quality hardware and software technologies inherent multicasting are well-suited for large clus- in IT. InfiniBand will take its place as the backbone of ters. This prowess combines with its “industry stan- these clusters. dardization” to drive what have been high end cluster technologies into the mainstream. Further, the InfiniBand protocols support channel-to- channel I/O failover in InfiniBand links, should a Host Once fully formed, InfiniBand will enable massive detect a Target failure. Redundant Infiniband links will horizontal scalability and transparent I/O sharing of course be required for this functionality. among cluster nodes. Indeed, just as InfiniBand brings the intelligent channel idea down from the S/390, it Note also that clustering is based on a distributed- will enable the intelligent-everything (CPU module, memory model that improves availability by diffusing disk, network controller) model of Compaq’s Hima- points of failure. In a similar vein, InfiniBand’s layas. This is the key to not only huge performance concept of creating myriad I/O controllers, most of scalability, but the ability to do so in a highly avail- them located outside the server chassis, enables able—even fault tolerant—way. component separation and redundancy, eliminating the PCI bus’s single domain of failure. Finally, Infini- Three Precious Words Band’s message-passing paradigm and protocols incor- porate layers of error management. The technology is Reliability, availability, and serviceability may not also being designed for device hot-addability, compare to “I love you,” but CIO’s can’t say “I love including device look-up and registration, which will you” to any server technology that doesn’t have RAS aid IT professionals in dynamically managing, modi- at the heart of its design. InfiniBand again connects fying, and augmenting their networks. with the horsehide. Not only does InfiniBand directly support the RAS attributes inherent in multi-system Clusters by definition increase scaling: That’s how clustering, its physical, electronic, and logical design vendors get those 64, 128, 256, and 1,024-way are RAS-friendly. Also remember that the NCR, systems. InfiniBand’s cascadeable switching will Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

7

stretch clustering in a big way, dramatically accentu- Conclusion ating horizontal scalability. Using InfiniBand switching, partitioning and fabric management, “The old order changeth, yielding place to new,” wrote combined with memory-management and control, Tennyson. IT professionals live these words. For servers will be configured into first- and second-order system designers, InfiniBand starts this cycle anew networks with many hosts and I/O end nodes. with a generational change in computing architectures. Memory and other resources will be shared in these overlapping subnets of physical and virtual servers and Bus-based I/O is giving way to their supporting components, all playing roles in clus- links, and processor-driven I/O by intelligent I/O tered environments. Additionally, these subnets can be engines, or channels. InfiniBand both enables this separated for functional isolation, increasing manage- change and provides a standard for it that unifies inter- ment control, availability and performance. connectivity across servers, storage, and networking as few technologies have done before. You Can Take I/O Out of the Network, But You Can’t Take the Network Out of I/O Five IBTA working groups are busily designing Infini- We have referred to the Internet and its impact on Band—working out its protocols, electrical signaling, enterprise computing. The clustered servers and register models, data structures, verbs, memory and storage that control inter- and intra-application semantic operations, software, management, and phys- communications will be at the eye of the Internet ical/mechanical specifications. Version 0.9 runs some computing whirlwind for years to come. InfiniBand’s 900 detailed pages. Version 1 is due in the summer. We physical and logical attributes will extend server I/O can hardly wait for the technology’s improved band- far and wide—out of the box, out of the data center, width and grand-scale architectural possibilities! out of the network, and onto the Internet. The IBTA is basing the addressabilty of the InfiniBand Architecture That InfiniBand will eventually meet its own maturity in large part on IPv6, enabling not only efficient local and demise hardly quells our enthusiasm— that manageability, but also prepping InfiniBand for its ventures out into the Internet. Source (HCA) and endgame is another ten years distant. We like Infini- destination (TCA) IPv6 addresses are embodied in the Band because we like the idea of server and worksta- InfiniBand Global Route Header, which is used to route tion data flowing along communication lines across packets between HCAs and TCAs, across linked lattices of clustered, process-sharing hardware and subnets, and out into the world at large. It’s a very software. InfiniBand is the right technology at the well-though out design, simultaneously corresponding right time for the right reasons to realize this dream. to local-, wide-, and global-area fabric management. InfiniBand rocks!

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce

8

Components and Features of InfiniBand

InfiniBand Features Attributes Advantages and Uses

Allows direct access to memory by applica- tions and other I/O; reduces CPU, OS kernel, Multi-port; tool-free, single-axis Host and Target and peripheral traffic to memory; allows for insertion; commodity form factors; Channel Adapters remote ; translates and IPv6 addressable validates messages; enables load-balancing and redundancy qualities;

Relieves ; enables partitioning IPv6 addressable and routeable; and subnet creation for clustering and Switches/Routers cascadeable switches; partitionable manageability; enables multicasting; allows traffic zoning; QoS enablement for scalable cascading of multiple switches; routers dispatch data across switched subnets

One, four and 12-wide bandwidths; copper (differential signalling) and fibre physical links; 1-16 bi-direc- Multiplexing and logically connected address tional, independently assigned chan- spaces allow for refined manageability and Links nels/lanes per link (one reserved for arbitration; connects hosts and targets of fabric management, one for applica- different speeds and widths; multiple speed tion usage); credit-based flow grades match varying price points control; static rate control; auto- negotiation/mapping algorithm

Allows for granularity in identifying and Routing at the packet level; message controlling IPC and I/O processes; enables segmentation and re-assembly; error detection and correction; allows for Cyclical Redundancy Checking; Messages/Packets refined application and I/O management and interleaved packets across channels; control; allows for IP compatible fabric IPv6 addressing headers; memory management; enables processor-independent protection; remote DMA and serverless I/O

Licensed to InfiniBand Trade Ass'n Web Use Only - Do Not Reproduce