Center for Information Services and High Performance Computing (ZIH) Performance Analysis of Computer Systems
Benchmarks: TOP 500, Stream, and HPCC
Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Center for Information Services and High Performance Computing (ZIH)
Summary of Previous Lecture
Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Summary of Previous Lecture
Different workloads: – Test workload – Real workload – Synthetic workload Historical examples for test workloads: – Addition instruction – Instruction mixes – Kernels – Synthetic programs – Application benchmarks
Holger Brunst, Matthias Müller: Leistungsanalyse Excursion on Speedup and Efficiency Metrics
Comparison of sequential and parallel algorithms
Speedup: T1 Sn = Tn
– n is the number of processors
– T1 is the execution time of the sequential algorithm
– Tn is the execution time of the parallel algorithm with n processors
Efficiency: Sp E = p p
– Its value estimates how well-utilized p processors solve a given problem – Usually between zero and one. Exception: Super linear speedup (later)
Holger Brunst, Matthias Müller: Leistungsanalyse Amdahl’s Law
Find the maximum expected improvement to an overall system when only part of the system is improved Serial execution time = s+p Parallel execution time = s+p/n s + p S = n p s + n – Normalizing with respect to serial time (s+p) = 1 results in:
• Sn = 1/(s+p/n) – Drops off rapidly as serial fraction increases – Maximum speedup possible = 1/s, independent of n the number of processors! Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors? What is wrong with this argument?
Holger Brunst, Matthias Müller: Leistungsanalyse Popilar and historic benchmarks
Popular benchmarks: – Eratosthenes sieve algorithm – Ackermann’s Function – Whetstone – LINPACK – Dhrystone – Lawrence Livermore Loops – TPC-C – SPEC
Holger Brunst, Matthias Müller: Leistungsanalyse Workload description
Level of Detail of the workload description - Examples: – Most frequent request (e.g. Addition) – Frequency of request type (instruction mix) – Time-stamped sequence of requests – Average resource demand (e.g. 20 I/O requests per second) – Distribution of resource demands (not only the average, but also probability distribution)
Holger Brunst, Matthias Müller: Leistungsanalyse Characterization of Benchmarks
There are many metrics, each one has its purpose Computer Hardware – Raw machine performance: Tflops – Microbenchmarks: Stream – Algorithmic benchmarks: Linpack – Compact Apps/Kernels: NAS benchmarks – Application Suites: SPEC – User-specific applications: Custom benchmarks
Applications
Holger Brunst, Matthias Müller: Leistungsanalyse Comparison of different benchmark classes
coverage relevance Identify Time problems evolution Micro 0 0 ++ + Algorithmic - 0 + ++ Kernels 0 0 + + SPEC + + + + Apps - ++ 0 0
Holger Brunst, Matthias Müller: Leistungsanalyse SPEC Benchmarks: CPU 2006
Application Benchmarks Different metrics: – Integer, floatingpoint – Standard and rate – Base, peak Run rules
Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH)
Stream
Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Stream Benchmark
Author: John McCalpin (“Mr Bandwidth”) John McCalpin “Memory Bandwidth and Machine Balance in High Performance Computers”, IEEE TCCA Newsletter, December 1995 http://www.cs.virginia.edu/stream STREAM: measure memory bandwidth with the operations: – Copy: a(i) = b(i) – Scale: a(i)=s*b(i) – Add: a(i)=b(i)+c(i) – Triad: a(i)=b(i)+s*c(i) STREAM2: measures memory hierarchy bandwidth with the operations: – Fill: a(i)=0 – Copy: a(i)=b(i) – Daxpy: a(i) = a(i) +q*b(i) – Sum: sum += a(i)
Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 properties
Holger Brunst, Matthias Müller: Leistungsanalyse Stream Results: TOP 10
STREAM Memory Bandwidth --- John D. McCalpin, [email protected] Revised to Tue Jul 25 10:10:14 CST 2006
All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B
------Machine ID ncpus COPY SCALE ADD TRIAD ------SGI_Altix_4700 1024 3661963.0 3677482.0 4385585.0 4350166.0 SGI_Altix_3000 512 906388.0 870211.0 1055179.0 1119913.0 NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1 NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0 NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0 HP_AlphaServer_GS1280-1300 64 407351.0 400142.0 437010.0 431450.0 Cray_T932_321024-3E 32 310721.0 302182.0 359841.0 359270.0 NEC_SX-6 8 202627.2 192306.2 190231.3 213024.3 IBM_System_p5_595 64 186137.0 179639.0 200410.0 206243.0 HP_Integrity_SuperDome 128 154504.0 152999.0 169468.0 170833.0
Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 Results
a(i)=b(i)+alpha*c(i) 900 NEC_Azusa_Intel_Itanium_azusa_efc Pentium4_1400MHz_loan1_ifc 800
700
600
500
400
300
200
100
0 0 2 4 6 8 10 12 14 16 18 20 log_2(loop length)
Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH)
Linpack and TOP500
Slides courtesy Jack Dongarra
Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) LINPACKLINPACK Benchmark?Benchmark?
The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report. LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n3 + O(n2) Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100. OutputOutput FromFrom LinpackLinpack 100100 BenchmarkBenchmark
When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:
Please send the results of this run to:
Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300
Fax: 865-974-8296
Internet: [email protected]
norm. resid resid machep x(1) x(n) 1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio times for array with leading dimension of 201 1.540E-03 6.888E-05 1.609E-03 4.268E+02 4.686E-03 2.873E-02 1.509E-03 7.084E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.509E-03 7.003E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.502E-03 6.593E-05 1.568E-03 4.380E+02 4.567E-03 2.800E-02
times for array with leading dimension of 200 1.431E-03 6.716E-05 1.498E-03 4.584E+02 4.363E-03 2.675E-02 1.424E-03 6.694E-05 1.491E-03 4.605E+02 4.343E-03 2.663E-02 1.431E-03 6.699E-05 1.498E-03 4.583E+02 4.364E-03 2.676E-02 1.432E-03 6.439E-05 1.497E-03 4.588E+02 4.360E-03 2.673E-02
Time Total Time Mflop/s Solve Time Factor rate LinpackLinpack BenchmarkBenchmark OverOver TimeTime
In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Year Computer Number of Cycle time Mflop/s LinpackLinpack Processors BenchmarkBenchmark 2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz 3018 2005 NEC SX-8/1 (1 proc) 1 2 GHz 2177 ComputerComputer 2004 Intel Pentium Nocona (1 proc 3.6 GHz) 1 3.6 GHz 1803 atat thethe TopTop 2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635 2002 Intel Pentium 4 (3.06 GHz) 1 2.06 GHz 1414 ofof thethe ListList 2001 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 2000 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 1999 CRAY T916 4 2.2 nsec 1129 forfor n=100n=100 1995 CRAY T916 1 2.2 nsec 522 1994 CRAY C90 16 4.2 nsec 479 LinpackLinpack 1993 CRAY C90 16 4.2 nsec 479 1992 CRAY C90 16 4.2 nsec 479 1991 CRAY C90 16 4.2 nsec 403 1990 CRAY Y-MP 8 6.0 nsec 275 1989 CRAY Y-MP 8 6.0 nsec 275 1988 CRAY Y-MP 1 6.0 nsec 74 1987 ETA 10-E 1 10.5 nsec 52 1986 NEC SX-2 1 6.0 nsec 46 1985 NEC SX-2 1 6.0 nsec 46 1984 CRAY X-MP 1 9.5 nsec 21 1983 CRAY 1 1 12.5 nsec 12 1979 CRAY 1 1 12.5 nsec 3.4 LinpackLinpack BenchmarkBenchmark OverOver TimeTime In the beginning there was the Linpack 100 Benchmark (1977) n=100 (80KB); size that would fit in all the machines Fortran; 64 bit floating point arithmetic No hand optimization (only compiler options) Linpack 1000 (1986) n=1000 (8MB); wanted to see higher performance levels Any language; 64 bit floating point arithmetic Hand optimization OK Linpack TPP (1991) (Top500; 1993) Any size (n as large as you can; n=106; 8TB; ~6 hours); Any language; 64 bit floating point arithmetic Hand optimization OK Strassen’s method not allowed (confuses the op count and rate) ||Ax b || Reference implementation available = O(1) ||Axn |||| || In all cases results are verified by looking at: 21 2 nn32 2n Operations count for factorization 32 ; solve R WhatWhat isis LINPACKLINPACK NxNNxN max Rate
N Nmax LINPACK NxN benchmark 1/2 Size Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem LINPACK NxN report
Nmax – the size of the chosen problem run on a machine
Rmax – the performance in Gflop/s for the chosen size problem run on the machine
N1/2 – the size where half the Rmax execution rate is achieved
Rpeak – the theoretical peak performance Gflop/s for the machine LINPACK NxN is used to rank TOP500 fastest computers in the world H.H. Meuer,Meuer, H.H. Simon,Simon, E.E. Strohmaier,Strohmaier, && JDJD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org (Entries for this table began in 1991.) LinpackLinpack TPPTPP (1991)(1991) (Top500;(Top500; 1993)1993) Year Computer # of Measured Size of Size of Theoretical Procs Gflop/s Problem 1/2 Perf Peak Gflop/s 2005 - 2006 IBM Blue Gene/L 131072 280600 1769471 367001 2002– 2004 Earth Simulator Computer, 5104 35610 1041216 265408 40832 NEC 2001 ASCI White-Pacific, IBM SP 7424 7226 518096 179000 11136 Power 3 2000 ASCI White-Pacific, IBM SP 7424 4938 430000 11136 Power 3 1999 ASCI Red Intel Pentium II 9632 2379 362880 75400 3207 Xeon core 1998 ASCI Blue-Pacific SST, IBM SP 5808 2144 431344 3868 604E 1997 Intel ASCI Option Red (200 9152 1338 235000 63000 1830 MHz Pentium Pro) 1996 Hitachi CP-PACS 2048 368.2 103680 30720 614 1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1993 Fujitsu NWT 140 124.5 31920 11950 236 1992 NEC SX-3/44 4 20.0 6144 832 22 1991 Fujitsu VP2600/10 1 4.0 1000 200 5 26th26th List:List: TheThe TOP10TOP10
Rmax Manufacturer Computer Installation Site Country Year #Proc [TF/s]
BlueGene/L DOE 2005 1 IBM 280.6 USA 131072 eServer Blue Gene Lawrence Livermore Nat Lab custom
BGW IBM 2005 2 IBM 91.29 USA 40960 eServer Blue Gene Thomas Watson Research custom
ASC Purple DOE 2005 3 IBM 63.39 USA 10240 Power5 p575 Lawrence Livermore Nat Lab custom
4 Columbia NASA 2004 SGI 51.87 USA 10160 3 Altix, Itanium/Infiniband Ames hybrid
Thunderbird DOE 2005 5 Dell 38.27 USA 8000 Pentium/Infiniband Sandia Nat Lab commod
6 Red Storm DOE 2005 Cray 36.19 USA 10880 10 Cray XT3 AMD Sandia Nat Lab hybrid
7 Earth-Simulator 2002 NEC 35.86 Earth Simulator Center Japan 5120 4 SX-6 custom
8 MareNostrum 2005 IBM 27.91 Barcelona Supercomputer Center Spain 4800 5 PPC 970/Myrinet commod
9 ASTRON 2005 IBM eServer Blue Gene 27.45 Netherlands 12288 6 University Groningen custom
Jaguar DOE 2005 10 Cray 20.53 USA 5200 Cray XT3 AMD Oak Ridge Nat Lab hybrid Top500Top500 fromfrom NovemberNovember 20052005
2.3 PF/s 1 Pflop/s
280.6 TF/s 100 Tflop/s SUM IBM BlueGene/L 10 Tflop/s NEC 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1.646 TF/s 1 Tflop/s LLNL 59.7 GF/s Intel ASCI Red 100 Gflop/s Sandia Fujitsu 'NWT' NAL 10 Gflop/s N=500 My Laptop 0.4 GF/s 1 Gflop/s
100 Mflop/s 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 ArchitecturesArchitectures // SystemsSystems 500
SIMD 400
Single Proc. 300 Cluster 200 Constellations
100 SMP
MPP 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Cluster: Commodity processors & Commodity interconnect
Constellation: # of procs/node nodes in the system ProcessorProcessor TypesTypes
InterconnectsInterconnects // SystemsSystems
Center for Information Services and High Performance Computing (ZIH)
HPCC Benchmark
Slides courtesy Jack Dongara
Matthias Müller ([email protected]) MotivationMotivation forfor AdditionalAdditional BenchmarksBenchmarks
From Linpack Benchmark and Linpack Benchmark Top500: “no single number can Good reflect overall performance” One number
Simple to define & easy to rank Clearly need something more than Allows problem size to change Linpack with machine and over time
Bad HPC Challenge Benchmark Emphasizes only “peak” CPU Test suite stresses not only the speed and number of CPUs processors, but the memory Does not stress local bandwidth system and the interconnect. Does not stress the network The real utility of the HPCC Does not test gather/scatter benchmarks are that Ignores Amdahl’s Law (Only architectures can be described does weak scaling) with a wider range of metrics than just Flop/s from Linpack. … Ugly Benchmarketeering hype HPCHPC ChallengeChallenge BenchmarkBenchmark
Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest.
1. HPL (LINPACK) MPI Global (Ax = b)
2. STREAM Local; single CPU *STREAM Embarrassingly parallel
3. PTRANS (A A + BT) MPI Global
4. RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global 5. BW and Latency – MPI
6. FFT - Global, single CPU, and EP
7. Matrix Multiply – single CPU and EP HPCSHPCS PerformancePerformance TargetsTargets
Memory Hierarchy HPL: linear system solve Max Relative Ax = b Registers 2 Pflop/s 8x STREAM: vector operations Operands Instructions 6.5 Pbyte/s 40x A = B + s * C Cache(s) 0.5 Pflop/s 200x FFT: 1D Fast Fourier Transform Lines Blocks 64000 GUPS 2000x Z = fft(X) Local Memory Performance Targets RandomAccess: integer update Messages T[i] = XOR( T[i], rand) Remote Memory HPC Challenge Pages Disk
Tape
HPCC was developed by HPCS to assist in testing new HEC systems
Each benchmark focuses on a different part of the memory hierarchy
HPCS performance targets attempt to
Flatten the memory hierarchy Improve real application performance 34 Make programming easier TestsTests onon SingleSingle ProcessorProcessor andand SystemSystem
Local - only a single processor is performing computations.
Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly.
Global - all processors in the system are performing computations and they explicitly communicate with each other. ComputationalComputational ResourcesResources andand HPCHPC ChallengeChallenge BenchmarksBenchmarks
CPU computational speed
Computational resources
Node Memory Interconnect bandwidth bandwidth ComputationalComputational ResourcesResources andand HPCHPC ChallengeChallenge BenchmarksBenchmarks
HPL Matrix Multiply
CPU computational speed
Computational resources
Node Memory Interconnect bandwidth bandwidth Random & Natural Ring STREAM Bandwidth & Latency Memory Access Patterns Memory Access Patterns HPLHPL BenchmarkBenchmark
TPP Linpack Benchmark Used for the Top500 ratings Solve Ax=b, dense problem, matrix is random Uses LU decomposition with partial pivoting Based on the ScaLAPACK routines but optimized The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage In double precision (64-bit) arithmetic TPP performance Run on all processors Problem size set by user
These settings used for the other tests Rate Requires Size An implementation of the MPI An implementation of the Basic Linear Algebra Subprograms (BLAS) Reports total TFlop/s achieved for set of processors Takes the most time Considering stopping the process after say 25% Still check to see if correct STREAMSTREAM BenchmarkBenchmark
The STREAM Benchmark is a standard benchmark for the measurement of computer memory bandwidth Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors ------ Four operations name kernel bytes/iter FLOPS/iter ------ COPY, SCALE COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 ADD, TRIAD SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 Measures: ------ Machine Balance - relative cost of memory accesses vs arithmetic Vector lengths chosen to fill local memory Tested on a single processor Tested on all processors in the set in an “embarrassingly parallel” fashion Reports total GB/s achieved per processor PTRANSPTRANS
Implements parallel matrix transpose A = A + BT The matrices A and B are distributed across the processors Two-dimensional block-cyclic storage Same storage as for HPL Exercises the communications pattern where pairs of processors communicate with each other simultaneously. Large (out-of-cache) data transfers across the network Stresses the global bisection bandwidth Reports total GB/s achieved for set of processors RandomRandom AccessAccess
Integer Read-modify-write to random address No spatial or temporal locality Measures memory latency or the ability to hide memory latency Architecture stresses Latency to cache and main memory Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark Three forms Tested on a single processor Tested on all processors in the set in an “embarrassingly parallel” fashion Tested with an MPI version across the set of processors Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors Reports Gup/s (Giga updates per second) per processor BandwidthBandwidth andand LatencyLatency TestsTests
Ping-Pong test between pairs of processors
Send a message from proci to prock then return message from prock to proci proci MPI_Send() - prock MPI_Recv() proci MPI_Recv() - prock MPI_Send() Other processors doing MPI_Waitall() time += MPI_Wtime() time /= 2 The test is performed between as many possible distinct pairs of processors. There is an upper bound on the time for the test Tries to find the weakest link amongst all pairs Minimum bandwidth Maximum latency Not necessarily the same link will be the worst for bandwidth and latency Message 8B used for latency test; take max time Message 2MB used for bandwidth test; take min GB/s Bandwidth/LatencyBandwidth/Latency RingRing TestsTests (All(All Procs)Procs) Two types of rings: Naturally ordered (use MPI_COMM_WORLD): 0,1,2, ... P-1. Randomly ordered (30 rings tested) eg.: 7, 2, 5, 0, 3, 1, 4, 6 Each node posts two sends (to its left and right neighbor) and two receives (from its left and right neighbor). Two types of communication routines are used: combined send/receive and non-blocking send/receive. MPI_Sendrecv( TO: right_neighbor,FROM: left_neighbor) MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and MPI_Isend( right_neighbor )MPI_Isend( left_neighbor ) The smaller (better) time for each is taken (which one is smaller depends on the MPI implementation). Message 8B used for latency test; Message 2MB used for bandwidth test; FFTFFT
Using FFTE software Daisuke Takahashi code from University of Tsukuba 64 bit complex 1-D FFT
Uses 64 bit addressing Global transpose with MPI_Alltoall() Three transposes (data is never scrambled) HowHow DoesDoes TheThe BenchmarkingBenchmarking Work?Work? Single program to download and run Simple input file similar to HPL input Base Run and Optimization Run Base run must be made User supplies MPI and the BLAS Optimized run allowed to replace certain routines User specifies what was done Results upload via website html table and Excel spreadsheet generated with performance results Intentionally we are not providing a single figure of merit (no over all ranking) Goal: no more than 2 X the time to execute HPL. OfficialOfficial HPCCHPCC SubmissionSubmission ProcessProcess Prequesites:
C compiler
BLAS
MPI 1.Download 2.Install Provide detailed installation and 3.Run execution environment 4.Upload results 5.Confirm via @email@ Only some routines can be replaced 6.Tune Data layout needs to be preserved Multiple languages can be used 7.Run
8.Upload results Results are immediately available on the web site: Interactive HTML
9.Confirm via @email@ XML Optional MS Excel Kiviat charts (radar plots)
48 http://icl.cs.utk.edu/hpcc/http://icl.cs.utk.edu/hpcc/ webweb
HPCCHPCC KiviatKiviat ChartChart
ThankThank you!you!