Performance Evaluation of Supercomputers Using HPCC and IMB Benchmarks
Total Page:16
File Type:pdf, Size:1020Kb
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks Subhash Saini1, Robert Ciotti1,Brian T. N. Gunney2, Thomas E. Spelce2, Alice Koniges2, Don Dossa2, Panagiotis Adamidis3, Rolf Rabenseifner3, Sunil R. Tiyyagura3, Matthias Mueller4, and Rod Fatoohi5 1Advanced Supercomputing Division NASA Ames Research Center Moffett Field, California 94035, USA {ssaini, ciotti}@nas.nasa.gov 4ZIH, TU Dresden. Zellescher Weg 12, D-01069 Dresden, Germany 2Lawrence Livermore National Laboratory [email protected] Livermore, California 94550, USA {gunney, koniges, dossa1}@llnl.gov 5San Jose State University 3 One Washington Square High-Performance Computing-Center (HLRS) San Jose, California 95192 Allmandring 30, D-70550 Stuttgart, Germany [email protected] University of Stuttgart {adamidis, rabenseifner, sunil}@hlrs.de Abstract paradigm has become the de facto standard in programming high-end parallel computers. As a result, the performance of a majority of applications depends on the The HPC Challenge (HPCC) benchmark suite and the performance of the MPI functions as implemented on Intel MPI Benchmark (IMB) are used to compare and these systems. Simple bandwidth and latency tests are two evaluate the combined performance of processor, memory traditional metrics for assessing the performance of the subsystem and interconnect fabric of five leading interconnect fabric of the system. However, simple supercomputers - SGI Altix BX2, Cray X1, Cray Opteron measures of these two are not adequate to predict the Cluster, Dell Xeon cluster, and NEC SX-8. These five performance for real world applications. For instance, systems use five different networks (SGI NUMALINK4, traditional methods highlight the performance of network Cray network, Myrinet, InfiniBand, and NEC IXS). The by latency using zero byte message sizes and peak complete set of HPCC benchmarks are run on each of bandwidth for a very large message sizes ranging from 1 these systems. Additionally, we present Intel MPI MB to 4 MB for small systems (typically 32 to 64 Benchmarks (IMB) results to study the performance of 11 processors.) Yet, real world applications tend to send MPI communication functions on these systems. messages ranging from 10 KB to 2 MB using not only point-to-point communication but often with a variety of communication patterns including collective and reduction operations. 1. Introduction: The recently renamed Intel MPI Benchmarks (IMB, Performance of processor, memory subsystem and formerly the Pallas MPI Benchmarks) attempt to provide interconnect is a critical factor in the overall performance more information than simple tests by including a variety of computing system and thus the applications running on of MPI specific operations [3,4]. In this paper, we have it. The HPC Challenge (HPCC) benchmark suite is used a subset of these IMB benchmarks that we consider designed to give a picture of overall supercomputer important based on our application workload and report performance including floating point compute power, the performance results for the five computing systems. memory subsystem performance and global network Since the systems tested vary in age and cost, our goal is issues [1,2]. In this paper, we use the HPCC suite as a first not to characterize one as “better” than comparison of systems. Additionally, the message-passing 1-4244-0054-6/06/$20.00 ©2006 IEEE other, but rather to identify strength and weakness of the functions for various message sizes. However, in this underlying hardware and interconnect networks for paper we present results only for the 1 MB message size as particular operations. average size of the message is about 1 MB in many real To meet our goal of testing a variety of architectures, we world applications. analyze performance on five specific systems: SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, 2. High End Computing Platforms: and NEC SX-8 [5, 12]. These five systems use five different networks (SGI NUMALINK4, Cray network, In Table 1 is given the system characteristics of these 5 Myrinet, InfiniBand, and NEC IXS). The complete set of systems. Computing systems we have studied have three HPC benchmarks are run on each of these systems. types of networks namely, flat-tree, multi-stage crossbar Additionally, we present IMB 2.3 benchmark results to and 4-dimensional hypercube. study the performance of 11 MPI communication Table 1: System characteristics of the five computing platforms. CPUs/ Clock Peak/node Network Operating Processor System Platform Type Network Location node (GHz) (Gflop/s) Topology System Vendor Vendor SGI Altix Linux NASA Scalar 2 1.6 12.8 NUMALINK 4 Fat-tree Intel SGI BX2 (Suse) (USA) 4D- NASA Cray X1 Vector 4 0.800 12.8 Proprietary UNICOS Cray Cray Hypercube (USA) Cray Linux NASA Opteron Scalar 2 2.0 8.0 Myrinet Flat-tree AMD Cray (Redhat) (USA) Cluster Dell Xeon Linux NCSA Scalar 2 3.6 14.4 InfiniBand Flat-tree Intel Dell Cluster (Redhat) (USA) Multi-stage HLRS NEC SX-8 Vector 8 2.0 16.0 IXS Super-UX NEC NEC Crossbar (Germany) 3. 0 Benchmark Used We use HPCC Benchmark [1, 2] and Intel MPI IMB 2.0 version has three parts (a) IMB for MPI-1, (b) Benchmark Version 2.3 (IMB 2.3) as described below MPI-2 one sided communication, and (c) MPI-2 I/O. In standard mode, size of messages can be 0,1, 2, 4, 8, … 3.1 HPC Challenge Benchmarks 4194304 bytes. There are three classes of benchmarks, We have used full HPC Challenge [1,2] Benchmarks on namely single transfer, parallel transfer and collective SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell benchmarks. Cluster and NEC SX-8. HPC Challenge benchmarks are 4.0 Results multi-faceted and provide comprehensive insight into the performance of modern high-end computing systems. In this section we present results of HPC Challenge They are intended to test various attributes that can and IMB benchmarks for five supercomputers. contribute significantly to understanding the performance of high-end computing systems. These benchmarks stress 4.1 HPC Challenge Benchmarks: not only the processors, but also the memory subsystem and system interconnects. They provide a better 4.1.1 Balance of Communication to Computation: understanding of an application’s performance on the computing systems and are better indicators of how high- For multi-purpose HPC systems, the balance of end computing systems will perform across a wide processor speed, along with memory, communication, spectrum of real-world applications. and I/O bandwidth is important. In this section, we analyze the ratio of inter-node communication 3.2 Intel MPI Benchmarks bandwidth to the computational speed. To characterize IMB 2.3 is a successor of PALLAS PAM from Pallas the communication bandwidth between SMP nodes, we GmbH 2.2 [9]. In September 2003, the HPC division of use the random ring bandwidth, because for a large Pallas merged with Intel Corp. IMB 2.3 suite is very number of SMP nodes, most MPI processes will popular among high performance computing community communicate with MPI processes on other SMP nodes. to measure the performance of important MPI functions. This means, with 8 or more SMP nodes, the random ring Benchmarks are written in ANSI C using message- bandwidth reports the available inter-node passing paradigm comprising 10,000 lines of code. The communication bandwidth per MPI process. Although the balance is calculated based on MPI processes, its especially between 32 CPUs and 64 CPUs. NEC SX-8 value should be in principle independent of the system scales well which can be noted by only a slight programming model, i.e., whether each SMP node is inclination of the curve. In case of SGI Altix, it is worth used with several single-threaded MPI processes, or noting the difference in the ratio between Numalink3 and some (or even one process) multi-threaded MPI Numalink4 interconnects within the same box (512 processes, as long as the number of MPI processes on CPUs). Though the theoretical peak bandwidth between each SMP node is large enough that they altogether are Numalink3 and Numalink4 has only doubled, Random able to saturate the inter-node network [5]. Fig.1 shows Ring performance improves by a factor of 4 for runs up the scaling of the accumulated random ring performance to 256 processors. A steep decrease in the B/KFlop value with the computational speed. To compare for SGI Altix with Numalink4 is observed above 512 measurements with different numbers of CPUs and on CPUS runs (203.12 B/KFlop for 506 CPUs to 23.18 different architectures, all data is presented based on the B/KFlop for 2024 CPUs). This can also be noticed from computational performance expressed by the Linpack the cross over of the ratio curves between Altix and the HPL value. The HPCC random ring bandwidth was NEC SX-8. Whereas with Numalink3 it is 93.81 (440 multiplied by the number of MPI processes. The CPUs) when run within the same box. For the NEC SX- computational speed is benchmarked with HPL. 8, B/Kflop is 59.64 (576 CPUs), which is consistent between 128 and 576 CPUs runs. For the Cray Opteron it 1000 h t NEC SX8 32- is 24.41 (64 CPUs). d i 576 cpus w d 100 n s NEC SX8 32- / a 100 e B t 576 cpus Cray Opteron y g B n i 4-64 cpus s T / 10 R e y d t p n y 10 Cray Opteron o a B SGI Altix C R 4-64 cpus G d Numalink3 m 1 e a t e a 64-440 cpus r 0.001 0.01 0.1 1 10 l t u 1 S SGI Altix m SGI Altix P u Numalink3 E 0.1 c Numalink4 c d 64-440 cpus e A 64-2024 cpus t 0.1 a l 0.001 0.01 0.1 1 10 u 0.01 SGI Altix m u Numalink4 Linpack (HPL) TFlop/s c c 64-2024 cpus A 0.001 Figure 1: Accumulated Random Ring Bandwidth Linpack(HPL) TFlop/s versus HPL performance.