High Performance Computing: Models, Methods, & Means Benchmarking
Total Page:16
File Type:pdf, Size:1020Kb
Prof. Thomas Sterling Department of Computer Science Louisiana State University January 27, 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS BENCHMARKING CSC 7600 Lecture 4: Benchmarking, Spring 2011 Topics • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary CSC 7600 Lecture 4: Benchmarking, 2 Spring 2011 Topics • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary CSC 7600 Lecture 4: Benchmarking, 3 Spring 2011 Basic Performance Metrics • Time related: • Effectiveness: – Execution time [seconds] – Efficiency [%] • wall clock time • Sustained perf/peak perf • system and user time – Memory consumption – Latency [bytes] – Response time – Productivity [utility/($*second)] • Rate related: – Rate of computation • Performance measures: • floating point operations – Sustained Performance per second [flops] – Peak Performance • integer operations per – Benchmark sustained perf second [ops] • HPL Rmax – Data transfer (I/O) rate [bytes/second] CSC 7600 Lecture 4: Benchmarking, 4 Spring 2011 What Is a Benchmark? Benchmark: a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance) [Merriam- Webster] • The term “benchmark” also commonly applies to specially-designed programs used in benchmarking • A benchmark should: – be domain specific (the more general the benchmark, the less useful it is for anything in particular) – be a distillation of the essential attributes of a workload – avoid using single metric to express the overall performance • Computational benchmark kinds – synthetic: specially-created programs that impose the load on the specific component in the system – application: derived from a real-world application program CSC 7600 Lecture 4: Benchmarking, 5 Spring 2011 Purpose of Benchmarking • Provide a tool, enabling quantitative comparisons – Comparison of variations within the same system – Comparison of distinct systems • Driving progress – enable better engineering by defining measurable and repeatable objectives • Establishing of performance agenda – measure release-to-release or version-to-version progress – set goals to meet – be understandable and useful also to the people not having the expertise in the field (managers, etc.) CSC 7600 Lecture 4: Benchmarking, 6 Spring 2011 Properties of a Good Benchmark • Relevance: meaningful within the target domain • Understandability • Good metric(s): linear, orthogonal, monotonic • Scalability: applicable to a broad spectrum of hardware/architecture • Coverage: does not over-constrain the typical environment (does not require any special conditions) • Acceptance: embraced by users and vendors • Has to enable comparative evaluation Adapted from: Standard Benchmarks for Database Systems by CharlesCSC Levine, 7600 SIGMODLecture ‘974: Benchmarking, 7 Spring 2011 Topics • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary CSC 7600 Lecture 4: Benchmarking, 8 Spring 2011 Early Benchmarks • Whetstone – Floating point intensive • Dhrystone – Integer and character string oriented • Livermore Fortran Kernels – “Livermore Loops” – Collection of short kernels • NAS kernel – 7 Fortran test kernels for aerospace computation The sources of the benchmarks listed above are available from: http://www.netlib.org/benchmark CSC 7600 Lecture 4: Benchmarking, 9 Spring 2011 Whetstone • Originally written in Algol 60 in 1972 at the National Physics Laboratory (UK) • Named after Whetstone Algol translator-interpreter on the KDF9 computer • Measures primarily floating point performance in WIPS: Whetstone Instructions Per Second • Raised also the issue of efficiency of different programming languages • The original Algol code was translated to C and Fortran (single and double precision support), PL/I, APL, Pascal, Basic, Simula and others CSC 7600 Lecture 4: Benchmarking, 10 Spring 2011 Dhrystone • Synthetic benchmark developed in 1984 by Reinhold Weicker • The name is a pun on “Whetstone” • Measures integer and string operations performance, expressed in number of iterations, or Dhrystones, per second • Alternative unit: D-MIPS, normalized to VAX 11/780 performance • Latest version released: 2.1, includes implementations in C, Ada and Pascal • Superseded by SPECint suite Gordon Bell and VAX 11/780 CSC 7600 Lecture 4: Benchmarking, 11 Spring 2011 Livermore Fortran Kernels (LFK) • Developed at Lawrence Livermore National Laboratory in 1970 – also known as Livermore Loops • Consists of 24 separate kernels: – hydrodynamic codes, Cholesky conjugate gradient, linear algebra, equation of state, integration, predictors, first sum and difference, particle in cell, Monte Carlo, linear recurrence, discrete ordinate transport, Planckian distribution and others – include careful and careless coding practices • Produces 72 timing results using 3 different DO-loop lengths for each kernel • Produces Megaflops values for each kernel and range statistics of the results • Can be used as performance, compiler accuracy (checksums stored in code) or hardware endurance test CSC 7600 Lecture 4: Benchmarking, 12 Spring 2011 NAS Kernel • Developed at the Numerical Aerodynamic Simulation Projects Office at NASA Ames • Focuses on vector floating point performance • Consists of 7 test kernels in Fortran (approx. 1000 lines of code): – matrix multiply – complex 2-D FFT – Cholesky decomposition – block tri-diagonal matrix solver – vortex method setup with Gaussian elimination – vortex creation with boundary conditions – parallel inverse of three matrix pentadiagonals • Reports performance in Mflops (64-bit precision) CSC 7600 Lecture 4: Benchmarking, 13 Spring 2011 Topics • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary CSC 7600 Lecture 4: Benchmarking, 14 Spring 2011 Linpack Overview • Introduced by Jack Dongarra in 1979 • Based on LINPACK linear algebra package developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library) • Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers • Provides an estimate of system’s effective floating-point performance • Does not reflect the overall performance of the machine! CSC 7600 Lecture 4: Benchmarking, 15 Spring 2011 Linpack Benchmark Variants • Linpack Fortran (single processor) – N=100 – N=1000, TPP, best effort • Linpack’s Highly Parallel Computing benchmark (HPL) • Java Linpack CSC 7600 Lecture 4: Benchmarking, 16 Spring 2011 Fortran Linpack (I) N=100 case • Provides results listed in Table 1 of “Linpack Benchmark Report” • Absolutely no changes to the code can be made (not even in comments!) • Matrix generated by the program must be used to run this case • An external timing function (SECOND) has to be supplied • Only compiler-induced optimizations allowed • Measures performance of two routines – DGEFA: LU decomposition with partial pivoting – DGESL: solves system of linear equations using result from DGEFA • Complexity: O(n2) for DGESL, O(n3) for DGEFA CSC 7600 Lecture 4: Benchmarking, 17 Spring 2011 Fortran Linpack (II) N=1000 case, Toward Peak Performance (TPP), Best Effort • Provides results listed in Table 1 of “Linpack Benchmark Report” • The user can choose any linear equation to be solved • Allows a complete replacement of the factorization/solver code by the user • No restriction on the implementation language for the solver • The solution must conform to prescribed accuracy and the matrix used must be the same as the matrix used by the netlib driver CSC 7600 Lecture 4: Benchmarking, 18 Spring 2011 Linpack Fortran Performance on Different Platforms Computer N=100 N=1000, TPP Theoretical [MFlops] [MFlops] Peak [MFlops] Intel Pentium Woodcrest (1core, 3 GHz) 3018 6542 12000 NEC SX-8/8 (8 proc., 2 GHz) - 75140 128000 NEC SX-8/8 (1 proc., 2 GHz) 2177 14960 16000 HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) - 8185 14800 HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) 1852 4851 7400 IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) - 34570 60800 IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) 1776 5872 7600 SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) 1765 5953 6400 HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) - 12860 22400 HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) 1717 4191 5600 Fujitsu VPP5000/1 (1 proc., 3.33ns) 1156 8784 9600 Cray T932 (32 proc., 2.2ns) 1129 (1 proc.) 29360 57600 HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) - 14260 20800 HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) 1122 2132 2600 HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz) - 14150 32000 HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz) 843 2905 4000 Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps CSC 7600 Lecture 4: Benchmarking, 19 Spring 2011 Fortran Linpack Demo > ./linpack Please send the results of this run to: Jack J. Dongarra Total time Computer Science Department “Timing” unit (dgefa+dgesl) First element University of Tennessee (obsolete) Time spent in Knoxville, Tennessee 37996-1300 of right hand solver (dgesl) side vector Fax: 865-974-8296 Sustained Fraction