High-Performance Conjugate-Gradient Benchmark

Original Article The International Journal of High Performance Computing Applications 1–8 High-performance conjugate-gradient Ó The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav benchmark: A new metric for ranking DOI: 10.1177/1094342015593158 high-performance computing systems hpc.sagepub.com Jack Dongarra1, Michael A Heroux2 and Piotr Luszczek3 Abstract We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement. Keywords Preconditioned conjugate gradient, multigrid smoothing, additive Schwarz, HPC benchmarking, validation and verification 1Introduction in its purpose to HPL (Dongarra et al., 2003) which is currently used to rank systems as part of the TOP500 Many aspects of the physical world may be modeled project (Meuer et al., 2013), but HPCG is intended to with partial differential equations (PDEs) and lend a better represent how today’s applications perform. hand to predictive capability so as to aid scientific dis- HPCGgeneratesaregularsparselinearsystemthatis covery and engineering optimization. The high- mathematically similar to a finite element, finite volume or performance conjugate-gradient (HPCG) benchmark is finite difference discretization of a three-dimensional heat used to test a high-performance conjugate (HPC) diffusion equation on a semi-regular grid. The problem is machine’s ability to solve these important scientific solved using domain decomposition (Smith et al., 1996) problems. To that end, the primary scope of the project with a conjugate-gradient method that uses an additive is to measure the execution rate of Krylov subspace sol- Schwarz preconditioner. Each subdomain is precondi- vers on distributed-memory hardware. In doing so, tioned using a symmetric Gauss–Seidel sweep. HPCG aims to increase the prominence of sparse The HPL benchmark (Dongarra et al., 2003) is one matrix methods and put them on an equal footing with of the most widely recognized and discussed metrics for other benchmarks of high-end machines. ranking high-performance computing systems. When Over the years, the field of iterative methods has HPL gained prominence as a performance metric in the grown in significance, and today it offers a wide range early 1990s there was a strong correlation between its of algorithms that form the backbone of non-linear and differential equation solvers. HPCG aims to tackle the complexity of the field by offering a simple test that rep- 1Department of Electrical Engineering and Computer Science, University resents the performance characteristics of these algo- of Tennessee, USA; Computer Science and Mathematics Division, Oak rithms. In particular, the conjugate-gradient algorithm Ridge National Laboratory, ORNL School of Mathematics and School of and a symmetric Gauss–Seidel preconditioner were cho- Computer Science, University of Manchester, UK 2Scalable Algorithm Department, Sandia National Laboratories, sen for measurement and they are used to solve the Albuquerque, New Mexico Poisson differential equation on a regular three- 3Department of Electrical Engineering and Computer Science, University dimensional grid discretized with a 27-point stencil. of Tennessee, USA The HPCG benchmark (Dongarra and Heroux, Corresponding author: 2013) is a tool for ranking computer systems based on a Piotr Luszczek, University of Tennessee, 1122 Volunteer Blvd Street 203, simple additive Schwarz, symmetric Gauss–Seidel pre- Knoxville, TN 37996-3450, USA. conditioned conjugate-gradient solver. HPCG is similar Email: [email protected] Downloaded from hpc.sagepub.com at University of Manchester Library on July 7, 2016 2 The International Journal of High Performance Computing Applications predictions of system rankings and the ranking that because of the choice of using a non-physical sparsity full-scale applications would realize. Computer-system pattern. Although the NPB conjugate gradient has vendors pursued designs that would increase HPL been extensively used for HPC analysis, it does meet performance, which would in turn improve overall the criteria for our target application mix and, conse- application performance. Currently, HPL remains tre- quently, we do not consider it to be appropriate as a mendously valuable as a measure of historical trends, broad metric for our effort. and as a stress test, especially for the leadership class A lesser known but nonetheless relevant benchmark, systems which are pushing the boundaries of current the iterative solver benchmark (Dongarra et al., technology. Furthermore, HPL provides the HPC com- 2001), specifies the execution of a preconditioned munity with a valuable outreach tool, understandable conjugate gradient and generalized minimal residual to the outside world. Anyone with an appreciation of (GMRES) iteration using physically meaningful spar- computing is impressed by the tremendous increases in sity patterns and several preconditioners. As such, its performance that HPC systems have attained over the scope is broader than what we propose here, but this past few decades in terms of HPL. At the same time, benchmark does not address scalable distributed- HPL rankings of computer systems are no longer so memory parallelism or nested parallelism. strongly correlated to real application performance, The HPL benchmark (Dongarra et al., 2003) has been especially for the broad set of HPC applications gov- a yardstick of supercomputing performance for over four erned by differential equations, which tend to have decades and a basis for the biannual TOP500 (Meuer et much stronger needs for high bandwidth and low al., 2013) list of the 500 world’s fastest supercomputer for latency. This is tied to the irregular access patterns to over three decades. HPCG has a similar aim by measuring data that these codes tend to exhibit. In fact, we have the computation and communication patterns currently reached a point where designing a system for good prevalent in a vast number of applications of computa- HPL performance can actually lead to design choices tional science at multiple scales of deployment. HPCG that are wrong for the real application mix, or add measures the performance of the sparse iterative solver in unnecessary components or complexity to the system. order to reward balanced system design as opposed to Worse yet, we expect the gap between HPL predictions stressing a specific hardware components exercised by and real application performance to increase in the HPL. This has been elaborated in detail above. future. Potentially, the fast track to a computer system The HPC Challenge Benchmark Suite (Dongarra with the potential to run HPL at 1 Eflop/s (1018 and Heroux, 2013, Luszczek et al., 2006, Luszczek and floating-point calculations per second) is a design that Dongarra, 2010) has established itself as a performance may be very unattractive for our real applications. measurement framework with a comprehensive set of Without some intervention, future architectures tar- computational and, more importantly, memory-access geted towards good HPL performance will not be a patterns that build on the popularity and relevance of good match for our applications. As a result, we seek a HPL but add a much richer view of benchmarked hard- new metric that will have a stronger correlation to our ware. In comparison to HPCG, the most differentiating application base and will therefore drive system factor tends to be the focus on a multidimensional view designers in directions that will enhance application of the tested system that does not focus on a single fig- performance for a broader set of HPC applications. ure of merit. Instead, the HPC Challenge delivers a suite of performance metrics that may be filtered out, com- bined or singled out according to the end user needs and 2 Related work application profiles. Also of importance is the fact that Similar benchmarks have been proposed and used the HPC Challenge does not include a component that before. In particular, the NAS Parallel Benchmarks measures sparse-solver performance directly but instead (NPB) (Bailey et al., 1994; 1995, der Wijngaart, 2002) it would have to be derived out of various bandwidth include a conjugate-gradient benchmark and it shares and latency measurements and performed across the many attributes with the HPCG benchmark. Despite memory hierarchy and the communication interconnect. the wide use of this benchmark, it has the critical design decision that the matrix is chosen to have a random 3 Background and goals sparsity pattern with a uniform distribution of entries per row. This choice has led to the known side effect HPCG is designed to measure performance that is rep- that a two-dimensional distribution of the matrix resentative of many important scientific calculations, achieves optimal performance. Therefore, the computa- with low computation to data-access ratios, which we tional and communication patterns are non-physical. call Type 1 data access patterns. To simulate these pat- Furthermore, no preconditioning is present, so the terns that are commonly found in real applications, important features of a local sparse triangular solver HPCG exhibits the same irregular

Load more