Introduction to the HPC Challenge Benchmark Suite
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to the HPC Challenge Benchmark Suite Piotr Luszczek1, Jack J. Dongarra1,2, David Koester3, Rolf Rabenseifner4,5, Bob Lucas6,7, Jeremy Kepner8, John McCalpin9, David Bailey10, and Daisuke Takahashi11 1University of Tennessee Knoxville 2Oak Ridge National Laboratory 3MITRE 4High Performance Computing Center (HLRS) 5University Stuttgart, www.hlrs.de/people/rabenseifner 6Information Sciences Institute 7University of Southern California 8MIT Lincoln Lab 9IBM Austin 10Lawrence Berkeley National Laboratory 11University of Tsukuba March 2005 Abstract and bandwidth/latency tests – b eff) that attempt to span high and low spatial and temporal locality space. By de- The HPC Challenge1 benchmark suite has been re- sign, the HPC Challenge tests are scalable with the size leased by the DARPA HPCS program to help define the of data sets being a function of the largest HPL matrix performance boundaries of future Petascale computing for the tested system. systems. HPC Challenge is a suite of tests that exam- ine the performance of HPC architectures using kernels with memory access patterns more challenging than 1 High Productivity Computing those of the High Performance Linpack (HPL) bench- mark used in the Top500 list. Thus, the suite is designed Systems to augment the Top500 list, providing benchmarks that The DARPA High Productivity Computing Sys- bound the performance of many real applications as tems (HPCS) [1] is focused on providing a new gener- a function of memory access characteristics e.g., spa- ation of economically viable high productivity comput- tial and temporal locality, and providing a framework ing systems for national security and for the industrial for including additional tests. In particular, the suite user community. HPCS program researchers have initi- is composed of several well known computational ker- ated a fundamental reassessment of how we define and nels (STREAM, HPL, matrix multiply – DGEMM, paral- measure performance, programmability, portability, ro- lel matrix transpose – PTRANS, FFT, RandomAccess, bustness and ultimately, productivity in the High End Computing (HEC) domain. 1This work was supported in part by the DARPA, NSF, and DOE through the DARPA HPCS program under grant FA8750-04- The HPCS program seeks to create trans-petaflops 1-0219. systems of significant value to the Government HPC 1 community. Such value will be determined by assess- to look at language and parallel programming model is- ing many additional factors beyond just theoretical peak sues. In the end, the benchmark is to serve both the flops (floating-point operations). Ultimately, the goal is system user and designer communities [2]. to decrease the time-to-solution, which means decreas- ing both the execution time and development time of an application on a particular system. Evaluating the capa- 3 The Benchmark Tests bilities of a system with respect to these goals requires a different assessment process. The goal of the HPCS as- This first phase of the project have developed, hardened, sessment activity is to prototype and baseline a process and reported on a number of benchmarks. The collec- that can be transitioned to the acquisition community tion of tests includes tests on a single processor (local) for 2010 procurements. As part of this effort we are de- and tests over the complete system (global). In partic- veloping a scalable benchmark for the HPCS systems. ular, to characterize the architecture of the system we The basic goal of performance modeling is to mea- consider three testing scenarios: sure, predict, and understand the performance of a com- 1. Local – only a single processor is performing com- puter program or set of programs on a computer system. putations. The applications of performance modeling are numer- ous, including evaluation of algorithms, optimization 2. Embarrassingly Parallel – each processor in the en- of code implementations, parallel library development, tire system is performing computations but they do and comparison of system architectures, parallel system not communicate with each other explicitly. design, and procurement of new systems. This paper is organized as follows: sections 2 and 3 3. Global – all processors in the system are perform- give motivation and overview of HPC Challenge while ing computations and they explicitly communicate sections 4 and 5 provide more detailed description of with each other. the HPC Challenge tests and section 6 talks briefly about the scalibility of the tests; sections 7, 8, and 9 The HPC Challenge benchmark consists at this describe the rules, the software installation process and time of 7 performance tests: HPL [3], STREAM [4], some of the current results, respectively. Finally, sec- RandomAccess, PTRANS, FFT (implemented using tion 10 concludes the paper. FFTE [5]), DGEMM [6, 7] and b eff (MPI la- tency/bandwidth test) [8, 9, 10]. HPL is the Linpack TPP (toward peak performance) benchmark. The test 2 Motivation stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable The DARPA High Productivity Computing Sys- memory bandwidth (in Gbyte/s), RandomAccess mea- tems (HPCS) program has initiated a fundamental re- sures the rate of random updates of memory. PTRANS assessment of how we define and measure perfor- measures the rate of transfer for large arrays of data mance, programmability, portability, robustness and, from multiprocessor’s memory. Latency/Bandwidth ultimately, productivity in the HPC domain. With this measures (as the name suggests) latency and bandwidth in mind, a set of computational kernels was needed of communication patterns of increasing complexity be- to test and rate a system. The HPC Challenge suite tween as many nodes as is time-wise feasible. of benchmarks consists of four local (matrix-matrix Many of the aforementioned tests were widely used multiply, STREAM, RandomAccess and FFT) and four before HPC Challenge was created. At first, this may global (High Performance Linpack – HPL, parallel ma- seemingly make our benchmark merely a packaging ef- trix transpose – PTRANS, RandomAccess and FFT) fort. However, almost all components of HPC Chal- kernel benchmarks. HPC Challenge is designed to ap- lenge were augmented from their original form to pro- proximately bound computations of high and low spa- vide consistent verification and reporting scheme. We tial and temporal locality (see Figure 1). In addition, should also stress the importance of running these very because HPC Challenge kernels consist of simple math- tests on a single machine and have the results available ematical operations, this provides a unique opportunity at once. The tests were useful separately for the HPC 2 community before and with the unified HPC Challenge may sometimes be impossible to obtain (due to for ex- framework they create an unprecedented view of perfor- ample trade secrets) we ask at least for some guidance mance characterization of a system – a comprehensive for the users that would like to use similar optimizations view that captures the data under the same conditions in their applications. and allows for variety of analysis depending on end user needs. Each of the included tests examines system perfor- 4 Benchmark Details mance for various points of the conceptual spatial and temporal locality space shown in Figure 1. The ra- Almost all tests included in our suite operate on either tionale for such selection of tests is to measure per- matrices or vectors. The size of the former we will de- formance bounds on metrics important to HPC appli- note below as n and the latter as m. The following holds cations. The expected behavior of the applications is throughout the tests: to go through various locality space points during run- n2 ' m ' Available Memory time. Consequently, an application may be represented as a point in the locality space being an average (pos- Or in other words, the data for each test is scaled so that sibly time-weighed) of its various locality behaviors. the matrices or vectors are large enough to fill almost Alternatively, a decomposition can be made into time- all available memory. disjoint periods in which the application exhibits a sin- HPL (High Performance Linpack) is an implemen- gle locality characteristic. The application’s perfor- tation of the Linpack TPP (Toward Peak Performance) mance is then obtained by combining the partial results variant of the original Linpack benchmark which mea- from each period. sures the floating point rate of execution for solving a Another aspect of performance assessment addressed linear system of equations. HPL solves a linear system by HPC Challenge is ability to optimize benchmark of equations of order n: code. For that we allow two different runs to be re- Ax = b; A ∈ Rn×n; x,b ∈ Rn ported: by first computing LU factorization with row partial • Base run done with provided reference implemen- pivoting of the n by n + 1 coefficient matrix: tation. P[A,b] = [[L,U],y]. • Optimized run that uses architecture specific opti- mizations. Since the row pivoting (represented by the permutation matrix P) and the lower triangular factor L are applied The base run, in a sense, represents behavior of legacy to b as the factorization progresses, the solution x is ob- code because it is conservatively written using only tained in one step by solving the upper triangular sys- widely available programming languages and libraries. tem: It reflects a commonly used approach to parallel pro- Ux = y. cessing sometimes referred to as hierarchical paral- The lower triangular matrix L is left unpivoted and the lelism that combines Message Passing Interface (MPI) array of pivots is not returned. The operation count with threading from OpenMP. At the same time we rec- 2 3 1 2 2 for the factorization phase is 3 n − 2 n and 2n for the ognize the limitations of the base run and hence we al- solve phase. Correctness of the solution is ascertained low (or even encourage) optimized runs to be made. by calculating the scaled residuals: The optimizations may include alternative implemen- kAx − bk tations in different programming languages using par- ∞ , allel environments available specifically on the tested εkAk1n system.