Infiniband, PCI Express, and Intel® Xeon™ Processors with Extended Memory 64 Technology (Intel® EM64T)

White Paper InfiniBand, PCI Express, and Intel® Xeon™ Processors with Extended Memory 64 Technology (Intel® EM64T) Towards a Perfectly Balanced Computing Architecture 1.0 The Problem The performance and efficiency of applications attracts a significant amount of attention from computing users and suppli- ers. This is easy to understand - users want to understand the productivity and return on investment of systems they are considering to purchase. However, manufacturers will often develop special and costly hardware, or tune an existing platform to peak the performance for a given application. Often the customer receives an efficient solution for his specific application only to dis- cover: • Performance is not broadly applicable to other applications or even different datasets • The system cost is high and costly to maintain • The solution is not very scalable beyond the delivered configuration In seeking solutions to this problem, there have been many lessons learned about application efficiency, including concepts regarding the balance of CPU throughput, memory performance, and I/O subsystem performance. Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400 Fax: 408-970-3403 www.mellanox.com 1 Document Number Mellanox Technologies Inc 2279WP InfiniHost III Ex HCA Architecture System Balance in Computing Platforms The ultimate solution to this scenario is to achieve a broad range of application performance and efficiency out of a low cost, high volume, and Industry Standard computing architecture. This is a key motivation behind Industry Standards like PCI Express and InfiniBand and the subject of this paper. 2.0 System Balance in Computing Platforms System balance is the ability of the system to maximize processor productivity —to feed the compute processors demand for data. For example, parallel applications such as simulation and modeling, are compute and communication intensive. These applications must perform millions of calculations and then exchange intermediate results before beginning another iteration. To maximize processor performance, this exchange of data must be fast enough to prevent the compute processor from sitting idle, waiting for data. Thus, the faster the processors, the greater the bandwidth required and the lower the latency that can be tolerated. In addition, different applications stress different aspects of the computing system. A balanced system takes into account the needs of the applications and matches memory, I/O and interconnect performance with computing power to keep the processors fed. Therefore it is clear that system and application efficiency are influenced by three basic elements of the computing architecture: 1. Central Processing Unit (CPU) throughput 2. Memory sub-system performance 3. Input/Output (I/O) sub-system performance A weakness in any one of these three legs results in a potentially crippling degradation in efficiency and overall platform performance. Thus it is critical to architect a platform perfectly balanced in each of these three important elements. For example, the implicit Finite Element Analysis (FEA) used in the dynamic simulation of automotive or aerospace struc- tures is very demanding of memory bandwidth, requiring, on average, 1.2 bytes per floating point operation performed. In these applications, processors will waste many idle cycles waiting for memory access on systems where memory bandwidth is not matched to CPU performance. This mismatch translates to longer run times and fewer jobs that can be processed. A poorly balanced 128 processor cluster may deliver the same performance as a 64 processor system, wasting expensive capital, computing, and management resources. Another example is that over the past few decades, standard I/O technologies have not kept pace with improvements in CPU and memory performance, creating a system imbalance which impacts overall platform performance and scalability. For all parallel applications this is double trouble considering that the I/O subsystem usually has double duty: both cluster- ing and storage traffic will stress the I/O channel to keep the CPU fully utilized. This has resulted in an I/O bottleneck which limits the overall system performance achievable and demands a platform architecture upgrade. 3.0 Industry Solutions to Improve System Balance Fortunately, several years ago, key system and I/O architects recognized this looming unbalanced platform scenario and developed several key new technologies including 64 bit addressing, InfiniBand, and PCI Express (PCIe) I/O interfaces to address these potential limitations. The result is a platform upgrade that delivers a highly balanced compute architecture achieved with the combination of: • Intel® Xeon™ Processors • Intel® Extended Memory 64 Technology (EM64T) and DDR2 SDRAM memory 2 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Amdahl’s Law and the Weakest Link • I/O Subsystem with InfiniBand and PCI Express 4.0 Amdahl’s Law and the Weakest Link Amdahl’s law (named for Gene Amdahl) is one of the most basic concepts in computer architecture that points out the importance of having a balanced system architecture. Amdahl’s law states that the performance gain that can be obtained by improving some portion (sub-system) of a computer is limited by the fraction of time that this sub-system contributes to the overall processing task. Mathematically the law can be expressed as: Speedup = Two/Tw (EQ 1) Two = Execution time for entire task without the sub-system improvement Tw = Execution time with from a single sub-system improvement Thus speedup represents how much faster the task will run with the improved sub-system enhancement. What is important to recognize is that if the contribution of one element of the overall system dominates the total execution time, than performance improvements in the other two components will have little effect on the overall performance. This is fairly intuitive as if one sub-system contributes 95% of the total run time, it does not make sense to expend effort to optimize the performance of the sub-systems that contribute only the remaining 5% of run time. Instead it makes sense to focus on the weakest link. An example helps to make this clearer. Consider the case of a distributed (clustered) database with the entire database image being distributed across 16 nodes. Oracle 10g Grid Database is a good example of this type of distributed system and recognizes significantly improved price/performance vs. typical “big box” symmetrical multi-processing (SMP) machines. Oracle distributes data across all the nodes in the cluster with their “Cache Fusion” architecture. A typical low level operation in this type of architecture requires a given node to fetch a large (say 64KByte) block of data from another machine, store the data in memory, and perform some processing on the data (ex: search for the largest value in a record of such val- ues). Fundamentally then there are three elements which contribute to the overall compute time: 1. Get the data from the other node (I/O) 2. Fetch the data from memory 3. Process the data and generate a result Consider the following cluster architecture: • Processor: Intel 2.8GHz Xeon™ CPUs • Chipset/Memory: DDR 300MHz, 128 bit wide • I/O: PCI-X and Gigabit Ethernet (later we will compare to a system utilizing Intel® EM64T, PCI Express and Infini- Band) The first task is to get the data from the other node. With an I/O sub-system based on PCI-X and Gigabit Ethernet it requires on the order of 1100us to transfer 64K Bytes of data. The next step is to get the data from memory. Assuming the memory operating at 300MHz data rate with a 128 bit wide bus width and 50% bus efficiency (conservative) than the data can be fetched in about 27us. Finally the data is processed by the CPU. This amount of work done in this step is variable and highly dependent on the actual processing task at hand, but for concreteness, assume that the algorithm being performed requires to the order of 3 instructions per byte. For a 2.8 Ghz processor the processing contribution is thus about 70us. 3 Rev 1.40 Mellanox Technologies Inc InfiniHost III Ex HCA Architecture Limited Performance Gains in an Un-Balanced Platform Thus the total execution time is: Execution Time Sub-System Contribution I/O 1092.3us Memory 27.3us Processing 70.2us Total Execution Time 1189.8us Clearly the first term dominates the total execution time. 5.0 Limited Performance Gains in an Un-Balanced Platform Now consider the speedup achieved when the CPU clock frequency is increased from 2.8GHz to 3.4GHz. While this represents a substantial improvement in CPU performance, the overall run time performance is actually fairly small. The data is summarized as: Execution Time Sub-System Contribution I/O 1092.3us Memory 27.3us Processing 57.8us Total Execution Time 1177.4us Thus the overall run time improves by only about 1% despite the significantly higher improvement in CPU clock frequency. Similarly a performance boost in memory transfer rate from 300MHz to 400MHz results in only a 0.6% overall performance improvement. Combining both the CPU speedup and the memory bandwidth speedup results in a paltry 1.6% improvement in overall execution time. Despite substantial improvements in both CPU and memory performance the overall performance is barely improved. Clearly this relatively small speedup is a result of focusing on improvements in the CPU and memory sub-systems without addressing the largest contribution to overall run time - in this case the I/O contribution. For other applications which are compute rather than I/O intensive, the processing contribution might dominate and considerably better speedup would be achieved. Nonetheless for a very large class of applications (such as the clustered database application described here) I/O is extremely important. More importantly servers are general purpose machines and one can never know exactly which applications they will be required to support.

Infiniband, PCI Express, and Intel® Xeon™ Processors with Extended Memory 64 Technology (Intel® EM64T)

Mellanox Technologies Announces the Appointment of Dave Sheffler As Vice President of Worldwide Sales

IBM Cloud Web

MLNX OFED Documentation Rev 5.0-2.1.8.0

Supermicro GPU Solutions Optimized for NVIDIA Nvlink

What Is It and How We Use It

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

An Emerging Architecture in Smart Phones

It's Meant to Be Played

Meridian Contrarian Fund Holdings As of 12/31/2016

Data Sheet FUJITSU Tablet STYLISTIC M702

Storage for HPC and AI

Technical Brief