Performance I: Benchmarking

High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking Prof. Thomas Sterling Department of Computer Science Louisiana State University January 23 rd , 2007 Topics • Definitions, properties and applications • Early benchmarks • Everything you ever wanted to know about Linpack (but were afraid to ask) • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary 2 • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary 3 Basic Performance Metrics • Time related: – Execution time [seconds] • wall clock time • system and user time – Latency – Response time • Rate related: – Rate of computation • floating point operations per second [flops] • integer operations per second [ops] – Data transfer (I/O) rate [bytes/second] • Effectiveness: – Efficiency [%] – Memory consumption [bytes] – Productivity [utility/($*second)] • Modifiers: – Sustained – Peak – Theoretical peak 4 What Is a Benchmark? Benchmark : a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance) [Merriam-Webster ] • The term “benchmark” also commonly applies to specially-designed programs used in benchmarking • A benchmark should: – be domain specific (the more general the benchmark, the less useful it is for anything in particular) – be a distillation of the essential attributes of a workload – avoid using single metric to express the overall performance • Computational benchmark kinds – synthetic: specially-created programs that impose the load on the specific component in the system – application: derived from a real-world application program 5 Purpose of Benchmarking • To define the playing field • To provide a tool enabling quantitative comparisons • Acceleration of progress – enable better engineering by defining measurable and repeatable objectives • Establishing of performance agenda – measure release-to-release or version-to-version progress – set goals to meet – be understandable and useful also to the people not having the expertise in the field (managers, etc.) 6 Properties of a Good Benchmark • Relevance: meaningful within the target domain • Understandability • Good metric(s): linear, orthogonal, monotonic • Scalability: applicable to a broad spectrum of hardware/architecture • Coverage: does not over-constrain the typical environment • Acceptance: embraced by users and vendors • Has to enable comparative evaluation • Limited lifetime: there is a point when additional code modifications or optimizations become counterproductive 7 Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97 • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary 8 Early Benchmarks • Whetstone – Floating point intensive • Dhrystone – Integer and character string oriented • Livermore Fortran Kernels – “Livermore Loops” – Collection of short kernels • NAS kernel – 7 Fortran test kernels for aerospace computation The sources of the benchmarks listed above are available from: http://www.netlib.org/benchmark 9 Whetstone • Originally written in Algol 60 in 1972 at the National Physics Laboratory (UK) • Named after Whetstone Algol translator- interpreter on the KDF9 computer • Measures primarily floating point performance in WIPS : Whetstone Instructions Per Second • Raised also the issue of efficiency of different programming languages • The original Algol code was translated to C and Fortran (single and double precision support), PL/I, APL, Pascal, Basic, Simula and others 10 Dhrystone • Synthetic benchmark developed in 1984 by Reinhold Weicker • The name is a pun on “Whetstone” • Measures integer and string operations performance, expressed in number of iterations, or Dhrystones , per second • Alternative unit: D-MIPS , normalized to VAX 11/780 performance • Latest version released: 2.1, includes implementations in C, Ada and Pascal • Superseded by SPECint suite Gordon Bell and VAX 11/780 11 Livermore Fortran Kernels (LFK) • Developed at Lawrence Livermore National Laboratory in 1970 – also known as Livermore Loops • Consists of 24 separate kernels: – hydrodynamic codes, Cholesky conjugate gradient, linear algebra, equation of state, integration, predictors, first sum and difference, particle in cell, Monte Carlo, linear recurrence, discrete ordinate transport, Planckian distribution and others – include careful and careless coding practices • Produces 72 timing results using 3 different DO-loop lengths for each kernel • Produces Megaflops values for each kernel and range statistics of the results • Can be used as performance, compiler accuracy (checksums stored in code) or hardware endurance test 12 NAS Kernel • Developed at the Numerical Aerodynamic Simulation Projects Office at NASA Ames • Focuses on vector floating point performance • Consists of 7 test kernels in Fortran (approx. 1000 lines of code): – matrix multiply – complex 2-D FFT – Cholesky decomposition – block tri-diagonal matrix solver – vortex method setup with Gaussian elimination – vortex creation with boundary conditions – parallel inverse of three matrix pentadiagonals • Reports performance in Mflops (64-bit precision) 13 • Definitions, properties and applications • Early benchmarks • Linpack • Other parallel benchmarks • Organized benchmarking • Presentation and interpretation of results • Summary 14 Linpack Overview • Introduced by Jack Dongarra in 1979 • Based on LINPACK linear algebra package developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library) • Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers • Provides an estimate of system’s effective floating-point performance • Does not reflect the overall performance of the machine! 15 Linpack Benchmark Variants • Linpack Fortran (single processor) – N=100 – N=1000, TPP, best effort • Linpack’s Highly Parallel Computing benchmark (HPL) • Java Linpack 16 Fortran Linpack (I) N=100 case • Provides results listed in Table 1 of “Linpack Benchmark Report” • Absolutely no changes to the code can be made (not even in comments!) • Matrix generated by the program must be used to run this case • An external timing function (SECOND) has to be supplied • Only compiler-induced optimizations allowed • Measures performance of two routines – DGEFA: LU decomposition with partial pivoting – DGESL: solves system of linear equations using result from DGEFA • Complexity: O(n 2) for DGESL, O(n 3) for DGEFA 17 Fortran Linpack (II) N=1000 case, Toward Peak Performance (TPP), Best Effort • Provides results listed in Table 1 of “Linpack Benchmark Report” • The user can choose any linear equation to be solved • Allows a complete replacement of the factorization/solver code by the user • No restriction on the implementation language for the solver • The solution must conform to prescribed accuracy and the matrix used must be the same as the matrix used by the netlib driver 18 Linpack Fortran Performance on Different Platforms Computer N=100 N=1000, TPP Theoretical [MFlops] [MFlops] Peak [MFlops] Intel Pentium Woodcrest (1core, 3 GHz) 3018 6542 12000 NEC SX-8/8 (8 proc., 2 GHz) - 75140 128000 NEC SX-8/8 (1 proc., 2 GHz) 2177 14960 16000 HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) - 8185 14800 HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) 1852 4851 7400 IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) - 34570 60800 IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) 1776 5872 7600 SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) 1765 5953 6400 HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) - 12860 22400 HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) 1717 4191 5600 Fujitsu VPP5000/1 (1 proc., 3.33ns) 1156 8784 9600 Cray T932 (32 proc., 2.2ns) 1129 (1 proc.) 29360 57600 HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) - 14260 20800 HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) 1122 2132 2600 HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz) - 14150 32000 HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz) 843 2905 4000 Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps 19 Fortran Linpack Demo > ./linpack Please send the results of this run to: Jack J. Dongarra Total time Computer Science Department “Timing” unit (dgefa+dgesl) First element University of Tennessee (obsolete) Time spent in Knoxville, Tennessee 37996-1300 of right hand solver (dgesl) side vector Fax: 865-974-8296 Sustained Fraction of Internet: [email protected] floating point Cray-1S rate execution time This is version 29.5.04. (obsolete) Time spent in norm. resid resid machep x(1) x(n) matrix 1.25501937E+00 1.39332990E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00 factorization routine (dgefa) times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of 201 Two different 4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03 9.090E-03 -9.159E-15 dimensions used to 4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03 9.017E-03 1.000E+00 4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03 9.018E-03 1.000E+00 test the effect of 4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03 8.981E-03 5.298E+02 array placement in memory times for array with leading dimension of 200 4.210E-04 1.800E-05 4.390E-04 1.564E+03

Load more