Overview of the HPC Challenge Benchmark Suite

Jack J. Dongarra Piotr Luszczek Oak Ridge National Laboratory and University of Tennessee Knoxville University of Tennessee Knoxville

Abstract— The HPC Challenge1 benchmark suite has been library development, and comparison of system architectures, released by the DARPA HPCS program to help define the parallel system design, and procurement of new systems. performance boundaries of future Petascale computing systems. This paper is organized as follows: sections II and III give HPC Challenge is a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns motivation and overview of HPC Challenge while section IV more challenging than those of the High Performance Lin- provides more detailed description of the HPC Challenge pack (HPL) benchmark used in the Top500 list. Thus, the suite is tests and section V talks briefly about the scalability of designed to augment the Top500 list, providing benchmarks that the tests; sections VI, VII, and VIII describe the rules, the bound the performance of many real applications as a function of software installation process and some of the current results, memory access characteristics e.g., spatial and temporal locality, and providing a framework for including additional tests. In respectively. Finally, section IX concludes the paper. particular, the suite is composed of several well known compu- II.MOTIVATION tational kernels (STREAM, HPL, matrix multiply – DGEMM, parallel matrix transpose – PTRANS, FFT, RandomAccess, The DARPA High Productivity Computing Systems (HPCS) and bandwidth/latency tests – b eff) that attempt to span high program has initiated a fundamental reassessment of how we and low spatial and temporal locality space. By design, the define and measure performance, programmability, portability, HPC Challenge tests are scalable with the size of data sets being a function of the largest HPL matrix for the tested system. robustness and, ultimately, productivity in the HPC domain. With this in mind, a set of computational kernels was needed I.HIGH PRODUCTIVITY COMPUTING SYSTEMS to test and rate a system. The HPC Challenge suite of benchmarks consists of four local (matrix-matrix multiply, The DARPA High Productivity Computing STREAM, RandomAccess and FFT) and four global (High Systems (HPCS) [1] is focused on providing a new generation Performance Linpack – HPL, parallel matrix transpose – of economically viable high productivity computing systems PTRANS, RandomAccess and FFT) kernel benchmarks. for national security and for the industrial user community. HPC Challenge is designed to approximately bound computa- HPCS program researchers have initiated a fundamental tions of high and low spatial and temporal locality (see Fig. 1). reassessment of how we define and measure performance, In addition, because HPC Challenge kernels consist of simple programmability, portability, robustness and ultimately, mathematical operations, this provides a unique opportunity to productivity in the High End Computing (HEC) domain. look at language and parallel programming model issues. In The HPCS program seeks to create trans-petaflops systems the end, the benchmark is to serve both the system user and of significant value to the Government HPC community. Such designer communities [2]. value will be determined by assessing many additional factors beyond just theoretical peak flops (floating-point operations). III.THE BENCHMARK TESTS Ultimately, the goal is to decrease the time-to-solution, which This first phase of the project have developed, hardened, means decreasing both the execution time and development and reported on a number of benchmarks. The collection of time of an application on a particular system. Evaluating the tests includes tests on a single processor (local) and tests over capabilities of a system with respect to these goals requires the complete system (global). In particular, to characterize the a different assessment process. The goal of the HPCS as- architecture of the system we consider three testing scenarios: sessment activity is to prototype and baseline a process that 1) Local – only a single processor is performing computa- can be transitioned to the acquisition community for 2010 tions. procurements. As part of this effort we are developing a 2) Embarrassingly Parallel – each processor in the entire scalable benchmark for the HPCS systems. system is performing computations but they do not The basic goal of performance modeling is to measure, pre- communicate with each other explicitly. dict, and understand the performance of a program 3) Global – all processors in the system are performing or set of programs on a computer system. The applications computations and they explicitly communicate with each of performance modeling are numerous, including evaluation other. of algorithms, optimization of code implementations, parallel The HPC Challenge benchmark consists at this time of 7 1This work was supported in part by the DARPA, NSF, and DOE through performance tests: HPL [3], STREAM [4], RandomAccess, the DARPA HPCS program under grant FA8750-04-1-0219. PTRANS, FFT (implemented using FFTE [5]), DGEMM [6], htw lo w ifrn ust ereported: For be code. to runs benchmark different optimize two allow to we ability that is Challenge HPC results partial the period. combining each by from application’s applica- obtained The then the characteristic. is which locality performance in single periods a exhibits time-disjoint tion into decomposition made a Alternatively, be behaviors. can of locality time-weighed) various (possibly in average its point an a being runtime. as space during locality represented the points be may space application locality an various of Consequently, through selection applications go important the to such of metrics is behavior expected for on The bounds applications. rationale HPC performance to The measure to 1. locality is Fig. temporal tests in and spatial shown conceptual space the of points various analysis data of the variety needs. captures for user allows that end on and view depending conditions comprehensive same a the – under they system characterization framework a performance Challenge of of view HPC unprecedented unified an the create community HPC with the at and for available separately results before useful the were have tests and The machine once. these single running a of on importance tests reporting the very and stress also verification should consistent We their scheme. provide from to almost augmented form were However, Challenge original effort. HPC packaging of components a all make merely seemingly may benchmark this first, our At created. was feasible. Challenge HPC time-wise is as nodes com- many increasing as of between patterns latency plexity communication suggests) of name bandwidth the and (as memory. multiprocessor’s measures from data Latency/Bandwidth of arrays large for transfer memory. of updates random of system. Gbyte/s), a (in of bandwidth performance benchmark. point performance) floating STREAM the peak stresses (toward test TPP The Linpack the is 7 and [7] i.1 agtdapiainaesi h eoyacs oaiyspace. locality access memory the in areas application Targeted 1. Fig.

• by addressed assessment performance of aspect Another for performance system examines tests included the of Each before used widely were tests aforementioned the of Many Spatial locality 0 aerndn ihpoie eeec implementation. reference provided with done run Base b admcesFFT HPL DGEMM RandomAccess STREAM PTRANS eff sabnhakta esrsssanbememory sustainable measures that benchmark a is S DSP X-section Radar TSP CFD MIltnybnwdhts)[] 9,[10]. [9], [8], test) latency/bandwidth (MPI eprllocality Temporal RandomAccess Applications PTRANS esrstert of rate the measures esrsterate the measures HPL sr htwudlk ouesmlrotmztosi their the in for optimizations guidance similar some use for to least like applications. example at would for ask that to users (due we obtain secrets) to techniques trade impossible optimization be of sometimes disclosure original may While full the results. that benchmark to understand the made we with changes together submitted the be about code require we information pro- benchmark, the the Challenge stress that HPC To the system. of aspect tested ductivity the made. on environments be specifically parallel available to using languages in runs implementations programming hence alternative optimized different and include encourage) run may base optimizations even The same the (or of the allow limitations At we the OpenMP. Passing recognize from we Message threading time to combines with referred that (MPI) sometimes Interface parallelism processing hierarchical parallel commonly as a to reflects approach It code used libraries. legacy and available of widely languages behavior only programming using represents written conservatively sense, is a it because in run, base The yfis optn Ufcoiainwt o ata pivoting partial row with the factorization of LU computing first by order of equations of equations. system of linear system a linear a solves solving the for point execution of floating of the variant rate measures Performance) which benchmark Peak Linpack (Toward original TPP Linpack the the that available so all scaled almost is fill to test enough each memory. large for are vectors data or the matrices words, other in Or denote will we former tests: the the of size as The below vectors. or matrices iosi o eund h prto on o h factorization the for count is operation phase The returned. not is pivots matrix triangular lower The system: triangular upper the solving by solution the progresses, factorization matrix permutation the by P (represented pivoting row the Since n h oe raglrfactor triangular lower the and ) HPL lotaltssicue norsieoeaeo either on operate suite our in included tests all Almost • piie u htue rhtcueseicoptimiza- specific architecture uses tions. that run Optimized n Hg efrac ipc)i nipeetto of implementation an is Linpack) Performance (High by 3 2 n n n h atras latter the and 3 n − + 2 1 ofcetmatrix: coefficient 1 Ax n n V B IV. 2 2 = n 2 and ' P b m ; [ ENCHMARK A ' n , 2 b A vial Memory Available Ux [[ = ] m o h ov hs.Cretesof Correctness phase. solve the for L ∈ h olwn od throughout holds following The . slf nioe n h ra of array the and unpivoted left is R = n L y × , . U n ; L D ] x , x ETAILS y , r ple to applied are b ] sotie noestep one in obtained is . ∈ R n n : b sthe as HPL the solution is ascertained by calculating the scaled residuals: the transpose. The scaled residual of the form kA − Aˆk/(ε n) verifies the calculation. kAx − bk∞ , RandomAccess measures the rate of integer updates to εkAk n 1 random memory locations (GUPS). The operation being per- kAx − bk ∞ , and formed on an integer array of size m is: εkAk1kxk1 kAx − bk x ← f (x) ∞ , εkAk∞kxk∞n where ε is machine precision for 64-bit floating-point values f : x 7→ (x ⊕ ai); ai– pseudo-random sequence and n is the size of the problem. where: DGEMM measures the floating point rate of execution of m m m double precision real matrix-matrix multiplication. The exact f : Z → Z ; x ∈ Z . operation performed is: The operation count is 4m and since all the operations are in C ← β C + α AB integral values over GF(2) field they can be checked exactly with a reference implementation. The verification procedure where: n×n n allows 1% of the operations to be incorrect (skipped) which A,B,C ∈ R ; α,β ∈ R . allows loosening concurrent memory update semantics on The operation count for the multiply is 2n3 and correctness of shared memory architectures. the operation is ascertained by calculating the scaled residual: FFT measures the floating point rate of execution of double kC −Cˆk/(ε nkCkF ) (Cˆ is a result of reference implementation precision complex one-dimensional Discrete Fourier Trans- of the multiplication). form (DFT) of size m: STREAM is a simple benchmark program that measures m −2πi jk sustainable memory bandwidth (in Gbyte/s) and the corre- Zk ← ∑z je m ; 1 ≤ k ≤ m sponding computation rate for four simple vector kernels: j COPY: c ← a where: SCALE: b ← α c z,Z ∈ Cm. ADD: c ← a + b The operation count is taken to be 5mlog2 m for the calculation TRIAD: a ← b + α c of the computational rate (in Gflop/s). Verification is done with where: a residual kx−xˆk/(ε logm) wherex ˆ is the result of applying a a,b,c ∈ Rm; α ∈ R. reference implementation of inverse transform to the outcome of the benchmarked code (in infinite-precision arithmetic the As mentioned earlier, we try to operate on large data objects. residual should be zero). The size of these objects is determined at runtime which Communication bandwidth and latency is a set of tests to contrasts with the original version of the STREAM benchmark measure latency and bandwidth of a number of simultaneous which uses static storage (determined at compile time) and communication patterns. The patterns are based on b eff (ef- size. The original benchmark gives the compiler more infor- fective bandwidth benchmark) – they are slightly different mation (and control) over data alignment, loop trip counts, etc. from the original b eff. The operation count is linearly de- The benchmark measures Gbyte/s and the number of items pendent on the number of processors in the tested system and transferred is either 2m or 3m depending on the operation. the time the tests take depends on the parameters of the tested The norm of the difference between reference and computed network. The checks are built into the benchmark code by vectors is used to verify the result: kx − xˆk. checking data after it has been received. PTRANS (parallel matrix transpose) exercises the commu- nications where pairs of processors exchange large messages V. SCALABILITY ISSUESFOR BENCHMARKSWITH simultaneously. It is a useful test of the total communications SCALABLE INPUT DATA capacity of the system interconnect. The performed operation sets a random n by n matrix to a sum of its transpose with A. Notation another random matrix: The following symbols are used in the formulae: A ← AT + B • P – number of CPUs • M – total size of system memory where: • r – rate of execution (unit: Gflop/s, GUPS, etc.) A,B ∈ Rn×n. • t – time The data transfer rate (in Gbyte/s) is calculated by dividing • N – size of global matrix for HPL and PTRANS the size of n2 matrix entries by the time it took to perform • V – size of global vector for RandomAccess and FFT TABLE I B. Assumptions SUBMISSION DATA FOR NUMBER ONE ENTRIES ON THE FIRST AND THE Memory size per CPU is constant as the system grows. This LATEST EDITIONS OF THE TOP500 LISTS. assumption based on architectural design of almost all systems √ in existence. Hence, the total amount of memory available in Date Rmax Nmax P t P the entire system is linearly proportional to the number of June 1993 59.7 52224 1024 1591 32 processors. November 2005 280600 1769471 131072 13163 362 For HPL the dominant cost is CPU-related because compu- tation has higher complexity order than communication: O(n3) TABLE II OMPUTERSYSTEMSUSEDINTESTS HE RANDOMACCESS NUMBER versus O(n2). C .T WAS OBTAINED DURING EARLY TESTING OF THE CURRENT CODE. C. Scalability of Time to Solution Processor Interconnect GUPS Peak Mem SMP 3 Time complexity for HPL is O(n ) (the hidden constant [GUPS] [Gflop/s] [GiB] [CPU] is 2/3) so the time to solution is: Power4 Federation 0.00262 4∗1.7 1 32 N3 Cray X1 2D torus 0.14544 16∗0.8 4 4 tHPL ∝ (1) Opteron Myrinet 2000 0.00319 2∗1.4 1 2 rHPL Itanium 2 NUMAlink 0.00305 4∗1.5 8 4 Since N is a size of a square matrix then we need to take Xeon InfiniBand 0.00065 2∗2.4 1 2 square root of available memory: Peak (per processor) = flops per cycle ∗ frequency √ √ N ∝ M ∝ P Mem – memory per processor SMP – number of processors in each SMP node The rate of execution for HPL is determined by the number of processors since computations (rather than communication) dominates in terms of complexity: D. Scalability Tests It is easy to verify the theoretical framework from the r ∝ P HPL previous section for the HPL test. Table I shows the relevant This leads to: √ data from the first and the latest editions of the TOP500 list. tHPL ∝ P The time to solution grew over 8 times while the square root of processor count grew over 11 times. This amounts Time complexity for RandomAccess is O(n) so the time to about 27% error which is relatively good achievement is: V considering how much has changed during those 12 years tRandomAccess ∝ (2) including progress in software (algorithmic improvements) and r RandomAccess hardware (chip architecture, materials, and fabrication). The main table for RandomAccess should be as big as half The computer systems that were used in RandomAccess of the total memory, so we get: initial tests are listed in Table II. The table also shows the ini- tially obtained RandomAccess number for each system. This V ∝ M ∝ P number was used to gauge running time of RandomAccess The rate of execution for RandomAccess can be argued to in the final version of the code. have various forms: Table III shows estimates of time it takes to run HPL • If we assume that the interconnect scales with the number and RandomAccess using the analysis from the previous of processors then the rate would also scale: section. For a 256-CPU system RandomAccess takes much longer to run than HPL . The only exception is Cray X1: rRandomAccess ∝ P is has large amount of memory per CPU which makes for

• Real-life experience tells that rRandomAccess is dependent on the interconnect and independent of the number of TABLE III processors due to interconnect inefficiency: ESTIMATEDTIMEINMINUTESTOPERFORMFULLSYSTEMTESTOF HPL AND RANDOMACCESS. rRandomAccess ∝ 1 Manufacturer System 64 CPUs 256 CPUs Even worse, it is also conceivable that the rate is a HPL R.A. HPL R.A. decreasing function of processors: IBM Power4 29.08 108.9 58.1 435.7 1 Cray X1 123.6 7.8 247.2 31.4 r ∝ RandomAccess P Atipa Opteron 70.6 89.6 141.2 358.4 SGI Itanium 2 745.9 750.0 1491.9 3000.2 Conservatively assuming that r ∝ 1 it follows that: RandomAccess Voltaire Xeon 41.2 440.2 82.4 1760.8

tRandomAccess ∝ P R.A.=RandomAccess longer HPL run and large GUPS number which makes for • In RandomAccess: short RandomAccess run. One of HPC Challenge’ guiding MPIRandomAccessUpdate() and principles was not to exceed running time beyond the twice the RandomAccessUpdate() HPL’s time. To keep the code in line with this principle meant • In FFT: reducing the time required by RandomAccess by artificially fftw malloc(), fftw free(), terminating the original algorithm and reflecting this change in fftw one(), fftw mpi(), the verification procedure. Both of these enhancements have fftw create plan(), been implemented in the current version of the code. fftw destroy plan(), fftw mpi create plan(), VI.RULESFOR RUNNINGTHE BENCHMARK fftw mpi local sizes(), There must be one baseline run submitted for each com- fftw mpi destroy plan() (all of these puter system entered in the archive. There may also exist an functions are compatible with FFTW 2.1.5 [11], optimized run for each computer system. [12]) 1) Baseline Runs • In b eff component alternative MPI routines Optimizations as described below are allowed. might be used for communication. But only a) Compile and load options standard MPI calls are to be performed and only Compiler or loader flags which are supported and to the MPI library that is widely available on the documented by the supplier are allowed. These tested system. include porting, optimization, and preprocessor in- b) Limitations of Optimization vocation. i) Code with limited calculation accuracy b) Libraries The calculation should be carried out in full Linking to optimized versions of the following precision (64-bit or the equivalent). However libraries is allowed: the substitution of algorithms is allowed (see • BLAS next). • MPI ii) Exchange of the used mathematical algorithm Acceptable use of such libraries is subject to the Any change of algorithms must be fully dis- following rules: closed and is subject to review by the HPC • All libraries used shall be disclosed with the re- Challenge Committee. Passing the verification sults submission. Each library shall be identified test is a necessary condition for such an ap- by library name, revision, and source (supplier). proval. The substituted algorithm must be as Libraries which are not generally available are robust as the baseline algorithm. For the ma- not permitted unless they are made available by trix multiply in the HPL benchmark, Strassen the reporting organization within 6 months. Algorithm may not be used as it changes the • Calls to library subroutines should have equiva- operation count of the algorithm. lent functionality to that in the released bench- iii) Using the knowledge of the solution mark code. Code modifications to accommodate Any modification of the code or input data sets, various library call formats are not allowed. which uses the knowledge of the solution or of • Only complete benchmark output may be sub- the verification test, is not permitted. mitted – partial results will not be accepted. iv) Code to circumvent the actual computation Any modification of the code to circumvent the 2) Optimized Runs actual computation is not permitted. a) Code modification Provided that the input and output specification is VII.SOFTWARE DOWNLOAD,INSTALLATION, AND USAGE preserved, the following routines may be substi- The reference implementation of the benchmark may be tuted: obtained free of charge at the benchmark’s web site: http:// • In HPL: HPL pdgesv(), icl.cs.utk.edu/hpcc/. The reference implementation HPL pdtrsv() (factorization and substitution should be used for the base run. The installation of the functions) software requires creating a script file for Unix’s make(1) • no changes are allowed in the DGEMM compo- utility. The distribution archive comes with script files for nent many common computer architectures. Usually, few changes • In PTRANS: pdtrans() to one of these files will produce the script file for a given • In STREAM: platform. tuned STREAM Copy(), After, a successful compilation the benchmark is ready to tuned STREAM Scale(), run. However, it is recommended that changes are made to tuned STREAM Add(), the benchmark’s input file such that the sizes of data to use tuned STREAM Triad() during the run are appropriate for the tested system. The sizes should reflect the available memory on the system and number [8] A. E. Koniges, R. Rabenseifner, and K. Solchenbach, “Benchmark of processors available for computations. design for characterization of balanced high-performance architectures,” in Proceedings of the 15th International Parallel and Distributed We have collected a comprehensive set of notes on the Processing Symposium (IPDPS’01), Workshop on Massively Parallel HPC Challenge benchmark. They can be found at http: Processing (WMPP), vol. 3. San Francisco, CA: In IEEE Computer //icl.cs.utk.edu/hpcc/faq/. Society Press, April 23-27 2001. [9] R. Rabenseifner and A. E. Koniges, “Effective communication and J. Dongarra and Yiannis Cotronis VIII.EXAMPLE RESULTS file-i/o bandwidth benchmarks,” in (Eds.), Recent Advances in Parallel Virtual Machine and Message Fig. 2 shows a sample rendering of the results web page: Passing Interface, Proceedings of the 8th European PVM/MPI Users’ http://icl.cs.utk.edu/hpcc/hpcc results. Group Meeting, EuroPVM/MPI 2001. Santorini, Greece: LNCS 2131, September 23-26 2001, pp. 24–35. cgi. It is impossible to show here all of the results for [10] R. Rabenseifner, “Hybrid parallel programming on HPC platforms,” in nearly 60 systems submitted so far to the web site. The Proceedings of the Fifth European Workshop on OpenMP, EWOMP ’03, results database is publicly available at the aforementioned Aachen, Germany, September 22-26 2003, pp. 185–194. [11] M. Frigo and S. G. Johnson, “FFTW: An adaptive software architecture address and can be exported to Excel spreadsheet or an XML for the FFT,” in Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal file. Fig. 3 shows a sample kiviat diagram generated using Processing, vol. 3. IEEE, 1998, pp. 1381–1384. the benchmark results. Kiviat diagrams can be generated [12] ——, “The design and implementation of FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, 2005, special issue on ”Program Generation, at the website and allow easy comparative analysis for Optimization, and Adaptation”. multi-dimensional results from the HPC Challenge database.

IX.CONCLUSIONS No single test can accurately compare the performance of HPC systems. The HPC Challenge benchmark test suite stresses not only the processors, but the memory system and the interconnect. It is a better indicator of how an HPC system will perform across a spectrum of real-world applications. Now that the more comprehensive, informative HPC Challenge benchmark suite is available, it can be used in preference to comparisons and rankings based on single tests. The real utility of the HPC Challenge benchmarks are that architectures can be described with a wider range of metrics than just flop/s from HPL. When looking only at HPL performance and the Top500 List, inexpensive build-your-own clusters appear to be much more cost effective than more sophisticated HPC architectures. Even a small percentage of random memory accesses in real applications can significantly affect the overall performance of that application on architectures not designed to minimize or hide memory latency. HPC Challenge benchmarks provide users with additional information to justify policy and pur- chasing decisions. We expect to expand and perhaps remove some existing benchmark components as we learn more about the collection.

REFERENCES [1] “High Productivity Computer Systems,” (http://www.highproductivity.org/). [2] W. Kahan, “The baleful effect of computer benchmarks upon applied mathematics, physics and chemistry,” 1997, the John von Neumann Lecture at the 45th Annual Meeting of SIAM, Stanford University. [3] J. J. Dongarra, P. Luszczek, and A. Petitet, “The LINPACK benchmark: Past, present, and future,” Concurrency and Computation: Practice and Experience, vol. 15, pp. 1–18, 2003. [4] J. McCalpin, “STREAM: Sustainable Memory Bandwidth in High Performance ,” (http://www.cs.virginia.edu/stream/). [5] D. Takahashi and Y. Kanada, “High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers,” The Journal of Supercomputing, vol. 15, no. 2, pp. 207– 228, 2000. [6] J. J. Dongarra, J. D. Croz, I. S. Duff, and S. Hammarling, “Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, vol. 16, pp. 1–17, March 1990. [7] ——, “A set of Level 3 Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, vol. 16, pp. 18–28, March 1990. Fig. 2. Sample HPC Challenge results web page.

64 processors: AMD Opteron 2.2 GHz

G-HPL 1.0

RandomRing Latency G-PTRANS 0.8

0.6

0.4

0.2

G-RandomAccess RandomRing Bandwidth

G-FFT S-DGEMM

S-STREAM Triad Rapid Array Quadrics GigE

Fig. 3. Sample kiviat diagram of HPC Challenge results for three clusters with Opteron 2.2 GHz processors and three different interconnects.