Center for Information Services and High Performance Computing (ZIH) Performance Analysis of Computer Systems Benchmarks: TOP 500, Stream, and HPCC Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048 Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Center for Information Services and High Performance Computing (ZIH) Summary of Previous Lecture Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048 Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Summary of Previous Lecture Different workloads: – Test workload – Real workload – Synthetic workload Historical examples for test workloads: – Addition instruction – Instruction mixes – Kernels – Synthetic programs – Application benchmarks Holger Brunst, Matthias Müller: Leistungsanalyse Excursion on Speedup and Efficiency Metrics Comparison of sequential and parallel algorithms Speedup: T1 Sn = Tn – n is the number of processors – T1 is the execution time of the sequential algorithm – Tn is the execution time of the parallel algorithm with n processors Efficiency: Sp E = p p – Its value estimates how well-utilized p processors solve a given problem – Usually between zero and one. Exception: Super linear speedup (later) Holger Brunst, Matthias Müller: Leistungsanalyse Amdahl’s Law Find the maximum expected improvement to an overall system when only part of the system is improved Serial execution time = s+p Parallel execution time = s+p/n s + p S = n p s + n – Normalizing with respect to serial time (s+p) = 1 results in: • Sn = 1/(s+p/n) – Drops off rapidly as serial fraction increases – Maximum speedup possible = 1/s, independent of n the number of processors! Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors? What is wrong with this argument? Holger Brunst, Matthias Müller: Leistungsanalyse Popilar and historic benchmarks Popular benchmarks: – Eratosthenes sieve algorithm – Ackermann’s Function – Whetstone – LINPACK – Dhrystone – Lawrence Livermore Loops – TPC-C – SPEC Holger Brunst, Matthias Müller: Leistungsanalyse Workload description Level of Detail of the workload description - Examples: – Most frequent request (e.g. Addition) – Frequency of request type (instruction mix) – Time-stamped sequence of requests – Average resource demand (e.g. 20 I/O requests per second) – Distribution of resource demands (not only the average, but also probability distribution) Holger Brunst, Matthias Müller: Leistungsanalyse Characterization of Benchmarks There are many metrics, each one has its purpose Computer Hardware – Raw machine performance: Tflops – Microbenchmarks: Stream – Algorithmic benchmarks: Linpack – Compact Apps/Kernels: NAS benchmarks – Application Suites: SPEC – User-specific applications: Custom benchmarks Applications Holger Brunst, Matthias Müller: Leistungsanalyse Comparison of different benchmark classes coverage relevance Identify Time problems evolution Micro 0 0 ++ + Algorithmic - 0 + ++ Kernels 0 0 + + SPEC + + + + Apps - ++ 0 0 Holger Brunst, Matthias Müller: Leistungsanalyse SPEC Benchmarks: CPU 2006 Application Benchmarks Different metrics: – Integer, floatingpoint – Standard and rate – Base, peak Run rules Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH) Stream Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048 Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Stream Benchmark Author: John McCalpin (“Mr Bandwidth”) John McCalpin “Memory Bandwidth and Machine Balance in High Performance Computers”, IEEE TCCA Newsletter, December 1995 http://www.cs.virginia.edu/stream STREAM: measure memory bandwidth with the operations: – Copy: a(i) = b(i) – Scale: a(i)=s*b(i) – Add: a(i)=b(i)+c(i) – Triad: a(i)=b(i)+s*c(i) STREAM2: measures memory hierarchy bandwidth with the operations: – Fill: a(i)=0 – Copy: a(i)=b(i) – Daxpy: a(i) = a(i) +q*b(i) – Sum: sum += a(i) Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 properties Holger Brunst, Matthias Müller: Leistungsanalyse Stream Results: TOP 10 STREAM Memory Bandwidth --- John D. McCalpin, [email protected] Revised to Tue Jul 25 10:10:14 CST 2006 All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B -------------------------------------------------------------------------------- Machine ID ncpus COPY SCALE ADD TRIAD -------------------------------------------------------------------------------- SGI_Altix_4700 1024 3661963.0 3677482.0 4385585.0 4350166.0 SGI_Altix_3000 512 906388.0 870211.0 1055179.0 1119913.0 NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1 NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0 NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0 HP_AlphaServer_GS1280-1300 64 407351.0 400142.0 437010.0 431450.0 Cray_T932_321024-3E 32 310721.0 302182.0 359841.0 359270.0 NEC_SX-6 8 202627.2 192306.2 190231.3 213024.3 IBM_System_p5_595 64 186137.0 179639.0 200410.0 206243.0 HP_Integrity_SuperDome 128 154504.0 152999.0 169468.0 170833.0 Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 Results a(i)=b(i)+alpha*c(i) 900 NEC_Azusa_Intel_Itanium_azusa_efc Pentium4_1400MHz_loan1_ifc 800 700 600 500 400 300 200 100 0 0 2 4 6 8 10 12 14 16 18 20 log_2(loop length) Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH) Linpack and TOP500 Slides courtesy Jack Dongarra Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048 Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) LINPACKLINPACK Benchmark?Benchmark? The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report. LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n3 + O(n2) Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100. OutputOutput FromFrom LinpackLinpack 100100 BenchmarkBenchmark When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results: Please send the results of this run to: Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300 Fax: 865-974-8296 Internet: [email protected] norm. resid resid machep x(1) x(n) 1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio times for array with leading dimension of 201 1.540E-03 6.888E-05 1.609E-03 4.268E+02 4.686E-03 2.873E-02 1.509E-03 7.084E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.509E-03 7.003E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.502E-03 6.593E-05 1.568E-03 4.380E+02 4.567E-03 2.800E-02 times for array with leading dimension of 200 1.431E-03 6.716E-05 1.498E-03 4.584E+02 4.363E-03 2.675E-02 1.424E-03 6.694E-05 1.491E-03 4.605E+02 4.343E-03 2.663E-02 1.431E-03 6.699E-05 1.498E-03 4.583E+02 4.364E-03 2.676E-02 1.432E-03 6.439E-05 1.497E-03 4.588E+02 4.360E-03 2.673E-02 Time Total Time Mflop/s Solve Time Factor rate LinpackLinpack BenchmarkBenchmark OverOver TimeTime In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Year Computer Number of Cycle time Mflop/s LinpackLinpack Processors BenchmarkBenchmark 2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz 3018 2005 NEC SX-8/1 (1 proc) 1 2 GHz 2177 ComputerComputer 2004 Intel Pentium Nocona (1 proc 3.6 GHz) 1 3.6 GHz 1803 atat thethe TopTop 2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635 2002 Intel Pentium 4 (3.06 GHz) 1 2.06 GHz 1414 ofof thethe ListList 2001 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 2000 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 1999 CRAY T916 4 2.2 nsec 1129 forfor n=100n=100 1995 CRAY T916 1 2.2 nsec 522 1994 CRAY C90 16 4.2 nsec 479 LinpackLinpack 1993 CRAY C90 16 4.2 nsec 479 1992 CRAY C90 16 4.2 nsec 479 1991 CRAY C90 16 4.2 nsec 403 1990 CRAY Y-MP 8 6.0 nsec 275 1989 CRAY Y-MP 8 6.0 nsec 275 1988 CRAY Y-MP 1 6.0 nsec 74 1987 ETA 10-E 1 10.5 nsec 52 1986 NEC SX-2 1 6.0 nsec 46 1985 NEC SX-2 1 6.0 nsec 46 1984 CRAY X-MP 1 9.5 nsec 21 1983 CRAY 1 1 12.5 nsec 12 1979 CRAY 1 1 12.5 nsec 3.4 LinpackLinpack BenchmarkBenchmark OverOver TimeTime In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Linpack 1000 (1986) n=1000 (8MB); wanted to see higher performance levels Any language; 64 bit floating point arithmetic Hand optimization OK Linpack TPP (1991) (Top500; 1993) Any size (n as large as you can; n=106; 8TB; ~6 hours); Any language; 64 bit floating point arithmetic Hand optimization OK Strassen’s method not allowed (confuses the op count and rate) ||Ax b || Reference implementation available = O(1) ||Axn |||| || In all cases results are verified by looking at: 21 2 nn32 2n Operations count for factorization 32 ; solve R WhatWhat isis LINPACKLINPACK NxNNxN max Rate N Nmax LINPACK NxN benchmark 1/2 Size Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem LINPACK NxN report Nmax – the size of the chosen problem run on a machine Rmax – the performance in Gflop/s for the chosen size problem run on the machine N1/2 – the size where half the Rmax execution rate is achieved Rpeak – the theoretical peak performance Gflop/s for the machine LINPACK NxN is used to rank TOP500 fastest computers in the world H.H.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages56 Page
-
File Size-