White Paper InfiniBand, PCI Express, and ® ™ Processors with Extended Memory 64 Technology (Intel® EM64T)

Towards a Perfectly Balanced Architecture

1.0 The Problem

The performance and efficiency of applications attracts a significant amount of attention from computing users and suppli- ers. This is easy to understand - users want to understand the productivity and return on investment of systems they are con- sidering to purchase.

However, manufacturers will often develop special and costly hardware, or tune an existing platform to peak the perfor- mance for a given application. Often the customer receives an efficient solution for his specific application only to dis- cover: • Performance is not broadly applicable to other applications or even different datasets • The system cost is high and costly to maintain • The solution is not very scalable beyond the delivered configuration

In seeking solutions to this problem, there have been many lessons learned about application efficiency, including concepts regarding the balance of CPU , memory performance, and I/O subsystem performance.

Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 www.mellanox.com 1

Document Number Inc 2279WP InfiniHost III Ex HCA Architecture System Balance in Computing Platforms

The ultimate solution to this scenario is to achieve a broad range of application performance and efficiency out of a low cost, high volume, and Industry Standard computing architecture. This is a key motivation behind Industry Standards like PCI Express and InfiniBand and the subject of this paper.

2.0 System Balance in Computing Platforms

System balance is the ability of the system to maximize processor productivity —to feed the compute processors demand for data. For example, parallel applications such as simulation and modeling, are compute and communication intensive. These applications must perform millions of calculations and then exchange intermediate results before beginning another iteration. To maximize processor performance, this exchange of data must be fast enough to prevent the compute processor from sitting idle, waiting for data. Thus, the faster the processors, the greater the bandwidth required and the lower the latency that can be tolerated. In addition, different applications stress different aspects of the computing system. A balanced system takes into account the needs of the applications and matches memory, I/O and interconnect performance with com- puting power to keep the processors fed.

Therefore it is clear that system and application efficiency are influenced by three basic elements of the computing archi- tecture: 1. (CPU) throughput 2. Memory sub-system performance 3. Input/Output (I/O) sub-system performance

A weakness in any one of these three legs results in a potentially crippling degradation in efficiency and overall platform performance. Thus it is critical to architect a platform perfectly balanced in each of these three important elements.

For example, the implicit Finite Element Analysis (FEA) used in the dynamic simulation of automotive or aerospace struc- tures is very demanding of memory bandwidth, requiring, on average, 1.2 bytes per floating point operation performed. In these applications, processors will waste many idle cycles waiting for memory access on systems where memory band- width is not matched to CPU performance. This mismatch translates to longer run times and fewer jobs that can be pro- cessed. A poorly balanced 128 processor cluster may deliver the same performance as a 64 processor system, wasting expensive capital, computing, and management resources.

Another example is that over the past few decades, standard I/O technologies have not kept pace with improvements in CPU and memory performance, creating a system imbalance which impacts overall platform performance and scalability. For all parallel applications this is double trouble considering that the I/O subsystem usually has double duty: both cluster- ing and storage traffic will stress the I/O channel to keep the CPU fully utilized. This has resulted in an I/O bottleneck which limits the overall system performance achievable and demands a platform architecture upgrade.

3.0 Industry Solutions to Improve System Balance

Fortunately, several years ago, key system and I/O architects recognized this looming unbalanced platform scenario and developed several key new technologies including 64 bit addressing, InfiniBand, and PCI Express (PCIe) I/O interfaces to address these potential limitations.

The result is a platform upgrade that delivers a highly balanced compute architecture achieved with the combination of: • Intel® Xeon™ Processors • Intel® Extended Memory 64 Technology (EM64T) and DDR2 SDRAM memory

2 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Amdahl’s Law and the Weakest Link

• I/O Subsystem with InfiniBand and PCI Express

4.0 Amdahl’s Law and the Weakest Link

Amdahl’s law (named for Gene Amdahl) is one of the most basic concepts in architecture that points out the importance of having a balanced system architecture. Amdahl’s law states that the performance gain that can be obtained by improving some portion (sub-system) of a computer is limited by the fraction of time that this sub-system contributes to the overall processing task. Mathematically the law can be expressed as:

Speedup = Two/Tw (EQ 1) Two = Execution time for entire task without the sub-system improvement Tw = Execution time with from a single sub-system improvement

Thus speedup represents how much faster the task will run with the improved sub-system enhancement. What is important to recognize is that if the contribution of one element of the overall system dominates the total execution time, than perfor- mance improvements in the other two components will have little effect on the overall performance. This is fairly intuitive as if one sub-system contributes 95% of the total run time, it does not make sense to expend effort to optimize the perfor- mance of the sub-systems that contribute only the remaining 5% of run time. Instead it makes sense to focus on the weakest link.

An example helps to make this clearer. Consider the case of a distributed (clustered) database with the entire database image being distributed across 16 nodes. Oracle 10g Grid Database is a good example of this type of distributed system and recognizes significantly improved price/performance vs. typical “big box” symmetrical multi-processing (SMP) machines. Oracle distributes data across all the nodes in the cluster with their “Cache Fusion” architecture. A typical low level opera- tion in this type of architecture requires a given node to fetch a large (say 64KByte) block of data from another machine, store the data in memory, and perform some processing on the data (ex: search for the largest value in a record of such val- ues). Fundamentally then there are three elements which contribute to the overall compute time: 1. Get the data from the other node (I/O) 2. Fetch the data from memory 3. Process the data and generate a result

Consider the following cluster architecture: • Processor: Intel 2.8GHz Xeon™ CPUs • /Memory: DDR 300MHz, 128 bit wide • I/O: PCI-X and Gigabit (later we will compare to a system utilizing Intel® EM64T, PCI Express and Infini- Band)

The first task is to get the data from the other node. With an I/O sub-system based on PCI-X and Gigabit Ethernet it requires on the order of 1100us to transfer 64K Bytes of data.

The next step is to get the data from memory. Assuming the memory operating at 300MHz data rate with a 128 bit wide width and 50% bus efficiency (conservative) than the data can be fetched in about 27us.

Finally the data is processed by the CPU. This amount of work done in this step is variable and highly dependent on the actual processing task at hand, but for concreteness, assume that the algorithm being performed requires to the order of 3 instructions per byte. For a 2.8 Ghz processor the processing contribution is thus about 70us.

3 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Limited Performance Gains in an Un-Balanced Platform

Thus the total execution time is: Execution Time Sub-System Contribution I/O 1092.3us Memory 27.3us Processing 70.2us Total Execution Time 1189.8us

Clearly the first term dominates the total execution time.

5.0 Limited Performance Gains in an Un-Balanced Platform

Now consider the speedup achieved when the CPU clock frequency is increased from 2.8GHz to 3.4GHz. While this repre- sents a substantial improvement in CPU performance, the overall run time performance is actually fairly small. The data is summarized as:

Execution Time Sub-System Contribution I/O 1092.3us Memory 27.3us Processing 57.8us Total Execution Time 1177.4us

Thus the overall run time improves by only about 1% despite the significantly higher improvement in CPU clock fre- quency.

Similarly a performance boost in memory transfer rate from 300MHz to 400MHz results in only a 0.6% overall perfor- mance improvement. Combining both the CPU speedup and the memory bandwidth speedup results in a paltry 1.6% improvement in overall execution time. Despite substantial improvements in both CPU and memory performance the over- all performance is barely improved.

Clearly this relatively small speedup is a result of focusing on improvements in the CPU and memory sub-systems without addressing the largest contribution to overall run time - in this case the I/O contribution. For other applications which are compute rather than I/O intensive, the processing contribution might dominate and considerably better speedup would be achieved. Nonetheless for a very large class of applications (such as the clustered database application described here) I/O is extremely important. More importantly servers are general purpose machines and one can never know exactly which applications they will be required to support. Thus it is vital that the entire system architecture is balanced so that signifi- cant performance gains in one sub-system actually result in substantial overall speedup.

6.0 Upgrading the Platform with PCI Express and InfiniBand

Clearly the I/O component dominates the overall run time. Fortunately system architects from Intel and the leading server vendors recognized the need for improved I/O performance and defined new technologies such as InfiniBand and PCI Express to address the I/O bottleneck and Intel® EM64T to expand the performance and scalability of the memory sub-sys- tem. New server platforms with PCI Express and Intel® EM64T memory sub-systems are beginning to ship from major server vendors and Mellanox is now shipping an 8x PCI Express version of the InfiniHost HCA that matches perfectly with

4 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Upgrading the Platform with PCI Express and InfiniBand these platforms. The HCA features both, the new 8x PCI Express interface and a new HCA device that features improved caching as well as greater I/O operations that transparently improve application speed.

Typical Server Architecture PCI Express Architecture

XEON Processor with EM64T Front Side Bus Front Side Bus Technology

DDR2 MEM Memory DDR2 MEM MemMEM DDR2 MEM Controller Memory Mem Controller Chip Set Bus

PCIX Bridge InfiniBand HCA Shared PCIX Bus

Dual 10Gb/s InfiniBand* Links IO InfiniBand Device HCA “One Fat Serial Pipe”

Dual 10Gb/s IB Links

System architecture advancements with PCI Express and InfiniBand both simplify and improve performance of next generations servers.

The new server platform architectures actually get I/O closer to the CPU and memory sub-system. Previous generations of server architectures required that data traversed PCI-X I/O bridges to reach the CPU and memory sub-system. With Infini- Band and PCI Express there is one fat serial pipe directly between servers to the CPU and memory sub-systems. This results in reduced chip count and complexity and improves both bandwidth and latency, and as will be shown overall sys- tem level balance and performance.

This improvement in I/O performance is achievable simply by adding an InfiniBand Host Channel Adapter card to an 8X enabled PCI Express server platform. Both of these are industry standard components available as off-the-shelf products from multiple system vendors. The Supermicro Platform shown in Figure 1, “Supermicro Platform 6014H-82 - 1U DP High Performance Server with PCI Express Slot,” on page 6, featuring 3.4 GHz Xeon CPU, E7520 chipset, 4GB of mem- ory, 8X PCI Express slot with Dual port InfiniBand HCA, is a good example of this architecture.

5 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Upgrading the Platform with PCI Express and InfiniBand

Figure 1. Supermicro Platform 6014H-82 - 1U DP High Performance Server with PCI Express Slot

Intel CPUs Dual Xeon with EM64T 8X PCI Express Slot

DDR2 Memory

Figure 2. The InfiniHost III Ex device based Low Profile 8x PCI Express HCA Adapter Card

6 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Balanced Platform Speedup with Intel® EM64T, PCI Express, and InfiniBand

With InfiniBand and PCI-X, the delivered I/O bandwidth performance improves more than eight times that of Gigabit Ethernet. With dual InfiniBand ports and PCI Express, the performance improvement is even more dramatic and achieves over 20Gbs of net delivered data bandwidth. Obviously the combination of InfiniBand and PCI Express yields impressive bandwidth performance improvements to the system architecture, which translates to improved block transfer latency.

7.0 Balanced Platform Speedup with Intel® EM64T, PCI Express, and InfiniBand

Thus the stage is set for a platform upgrade that adds PCI Express and InfiniBand resulting in considerably better I/O per- formance. The combination of PCI Express and InfiniBand delivers effective I/O bandwidth of over 900MB/sec for 64K Byte blocks! This blistering bandwidth cuts the time to fetch the 64K Byte block from a remote node from 1092.3us to 72.8us. Examining the overall performance with the I/O upgrade to InfiniBand and PCI Express yields:

Sub-System Execution Time Contribution I/O (InfiniBand & PCI Express) 72.8us Memory (300MHz/128bits) 27.3us Processing (2.8GHz CPU) 70.2us Total Execution Time 170.3us

Now we’re talking! The total execution time has been reduced from ~1190us to ~170us or about 86%! Clearly focusing on the largest contribution to the total execution time yields impressive speedup.

But even better the platform is now balanced so that speedup of the other sub-systems will generate substantial additional performance improvements.

Now, with the PCI Express and InfiniBand I/O sub-system, consider upgrading the CPU performance from 2.8GHz to 3.4GHz: Sub-System Execution Time Contribution I/O (InfiniBand & PCI Express) 72.8us Memory (300MHz/128bits) 27.3us Processing (3.4GHz CPU) 57.8us Total Execution Time 157.9us

7 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture No Software Hurdles

By moving to the faster CPU, the total execution time is reduced by 12.4us for a overall performance improvement of 7.8%. Similarly, moving to the faster 400MHz memory sub-system further reduces the run time to 151.1us for an overall performance improvement of 11.3% from the baseline with PCI Express and InfiniBand I/O sub-system.

Clearly, the move to PCI Express and InfiniBand yields the biggest performance increase, but by re-establishing a balanced system architecture this move also allows additional performance gains to be achieved by increasing the performance of other system components. Clearly together, the Intel® Xeon processor with EM64T, and PCI Express and InfiniBand I/O sub-system achieve the ideal of the balanced computing platform architecture.

8.0 No Software Hurdles

Frequently, the adoption of new platform architectures is slowed significantly by the requirements for new software. Fortu- nately in this case, the move to PCI Express is completely transparent. The InfiniBand driver software is structured to trans- parently migrate from PCI-X to PCI Express while providing complete backwards compatibility. Furthermore, the InfiniBand software is fully interoperable so that heterogeneous clusters using both PCI-X and PCI Express platforms can be created. Therefore, the investments made by Mellanox as well as customers, in software drivers, and applications, are preserved and readily usable for the new platforms.

9.0 Summary

A balanced computing architecture must address equally the triad of processing, memory and I/O in order to achieve opti- mized overall system performance. Neglecting any one element means that performance gains in the other two elements are squandered and do not result in significant overall performance improvements. The combination of new high performance Xeon processors including higher bandwidth Intel® EM64T memory technology, and a dramatically improved I/O sub-sys- tem with PCI Express and InfiniBand yield this balanced platform. Once balance has been returned to the overall system architecture performance gains in each of the elements yields substantial performance gains in the overall performance. It is expected that advanced processors and memory sub-systems will be able to continue to advance Moore’s law and deliver increasing clock speed and performance. Fortunately, the combination of InfiniBand and PCI Express has re-balanced the platform such that these advantages actually deliver benefits at the system level. Furthermore, both InfiniBand and PCI Express have a roadmap to continue to increase performance (with and even quad data rate signalling rates, fatter pipes, etc). In short, the industry is delivering new processor, memory, I/O technologies just in time to keep the steady advance in cost-effective system level performance marching along.

Mellanox, InfiniBridge, InfiniHost and InfiniScale are registered trademarks of Mellanox Technologies, Inc. InfiniBand (TM/SM) is a trademark and service mark of the InfiniBand Trade Association. All other trademarks are claimed by their respective owners.

8 Rev 1.40 Mellanox Technologies Inc