<<

Center for Information Services and High Performance (ZIH) Performance Analysis of Systems

Benchmarks: TOP 500, Stream, and HPCC

Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Center for Information Services and High Performance Computing (ZIH)

Summary of Previous Lecture

Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Summary of Previous Lecture

Different workloads: – Test workload – Real workload – Synthetic workload Historical examples for test workloads: – Addition instruction – Instruction mixes – Kernels – Synthetic programs – Application benchmarks

Holger Brunst, Matthias Müller: Leistungsanalyse Excursion on Speedup and Efficiency Metrics

Comparison of sequential and parallel

Speedup: T1 Sn = Tn

– n is the number of processors

– T1 is the execution time of the sequential

– Tn is the execution time of the parallel algorithm with n processors

Efficiency: Sp E = p p

– Its value estimates how well-utilized p processors solve a given problem – Usually between zero and one. Exception: Super linear speedup (later)

Holger Brunst, Matthias Müller: Leistungsanalyse Amdahl’s Law

Find the maximum expected improvement to an overall system when only part of the system is improved Serial execution time = s+p Parallel execution time = s+p/n s + p S = n p s + n – Normalizing with respect to serial time (s+p) = 1 results in:

• Sn = 1/(s+p/n) – Drops off rapidly as serial fraction increases – Maximum speedup possible = 1/s, independent of n the number of processors! Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors? What is wrong with this argument?

Holger Brunst, Matthias Müller: Leistungsanalyse Popilar and historic benchmarks

Popular benchmarks: – Eratosthenes sieve algorithm – Ackermann’s Function – – LINPACK – – Lawrence Livermore Loops – TPC- – SPEC

Holger Brunst, Matthias Müller: Leistungsanalyse Workload description

Level of Detail of the workload description - Examples: – Most frequent request (e.g. Addition) – Frequency of request type (instruction mix) – Time-stamped sequence of requests – Average resource demand (e.g. 20 I/O requests per second) – Distribution of resource demands (not only the average, but also probability distribution)

Holger Brunst, Matthias Müller: Leistungsanalyse Characterization of Benchmarks

There are many metrics, each one has its purpose – Raw machine performance: Tflops – Microbenchmarks: Stream – Algorithmic benchmarks: Linpack – Compact Apps/Kernels: NAS benchmarks – Application Suites: SPEC – User-specific applications: Custom benchmarks

Applications

Holger Brunst, Matthias Müller: Leistungsanalyse Comparison of different classes

coverage relevance Identify Time problems evolution Micro 0 0 ++ + Algorithmic - 0 + ++ Kernels 0 0 + + SPEC + + + + Apps - ++ 0 0

Holger Brunst, Matthias Müller: Leistungsanalyse SPEC Benchmarks: CPU 2006

Application Benchmarks Different metrics: – Integer, floatingpoint – Standard and rate – Base, peak Run rules

Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH)

Stream

Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) Stream Benchmark

Author: John McCalpin (“Mr Bandwidth”) John McCalpin “Memory Bandwidth and Machine Balance in High Performance ”, IEEE TCCA Newsletter, December 1995 http://www.cs.virginia.edu/stream STREAM: measure memory bandwidth with the operations: – Copy: a(i) = b(i) – Scale: a(i)=s*b(i) – Add: a(i)=b(i)+c(i) – Triad: a(i)=b(i)+s*c(i) STREAM2: measures memory hierarchy bandwidth with the operations: – Fill: a(i)=0 – Copy: a(i)=b(i) – Daxpy: a(i) = a(i) +q*b(i) – Sum: sum += a(i)

Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 properties

Holger Brunst, Matthias Müller: Leistungsanalyse Stream Results: TOP 10

STREAM Memory Bandwidth --- John D. McCalpin, [email protected] Revised to Tue Jul 25 10:10:14 CST 2006

All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B

------Machine ID ncpus COPY SCALE ADD TRIAD ------SGI_Altix_4700 1024 3661963.0 3677482.0 4385585.0 4350166.0 SGI_Altix_3000 512 906388.0 870211.0 1055179.0 1119913.0 NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1 NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0 NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0 HP_AlphaServer_GS1280-1300 64 407351.0 400142.0 437010.0 431450.0 Cray_T932_321024-3E 32 310721.0 302182.0 359841.0 359270.0 NEC_SX-6 8 202627.2 192306.2 190231.3 213024.3 IBM_System_p5_595 64 186137.0 179639.0 200410.0 206243.0 HP_Integrity_SuperDome 128 154504.0 152999.0 169468.0 170833.0

Holger Brunst, Matthias Müller: Leistungsanalyse Stream 2 Results

a(i)=b(i)+alpha*c(i) 900 NEC_Azusa_Intel_Itanium_azusa_efc Pentium4_1400MHz_loan1_ifc 800

700

600

500

400

300

200

100

0 0 2 4 6 8 10 12 14 16 18 20 log_2(loop length)

Holger Brunst, Matthias Müller: Leistungsanalyse Center for Information Services and High Performance Computing (ZIH)

Linpack and TOP500

Slides courtesy

Nöthnitzer Straße 46 Raum 1026 Tel. +49 351 - 463 - 35048

Holger Brunst ([email protected]) Matthias S. Mueller ([email protected]) LINPACKLINPACK Benchmark?Benchmark?

 The Linpack Benchmark is a measure of a computer’s floating-point rate of execution.  It is determined by running a that solves a dense system of linear equations.  Over the years the characteristics of the benchmark has changed a bit.  In fact, there are three benchmarks included in the Linpack Benchmark report.  LINPACK Benchmark  Dense linear system solve with LU factorization using partial pivoting  Operation count is: 2/3 n3 + O(n2)  Benchmark Measure: MFlop/s  Original benchmark measures the execution rate for a program on a matrix of size 100x100. OutputOutput FromFrom LinpackLinpack 100100 BenchmarkBenchmark

When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:

Please send the results of this run to:

Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300

Fax: 865-974-8296

Internet: [email protected]

norm. resid resid machep x(1) x(n) 1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00

times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio times for array with leading dimension of 201 1.540E-03 6.888E-05 1.609E-03 4.268E+02 4.686E-03 2.873E-02 1.509E-03 7.084E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.509E-03 7.003E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.502E-03 6.593E-05 1.568E-03 4.380E+02 4.567E-03 2.800E-02

times for array with leading dimension of 200 1.431E-03 6.716E-05 1.498E-03 4.584E+02 4.363E-03 2.675E-02 1.424E-03 6.694E-05 1.491E-03 4.605E+02 4.343E-03 2.663E-02 1.431E-03 6.699E-05 1.498E-03 4.583E+02 4.364E-03 2.676E-02 1.432E-03 6.439E-05 1.497E-03 4.588E+02 4.360E-03 2.673E-02

Time Total Time Mflop/s Solve Time Factor rate LinpackLinpack BenchmarkBenchmark OverOver TimeTime

 In the beginning there was the Linpack 100 Benchmark (1977)  n=100 (80KB); size that would fit in all the machines  Fortran; 64 bit floating point arithmetic  No hand optimization (only options) Year Computer Number of Cycle time Mflop/s LinpackLinpack Processors BenchmarkBenchmark 2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz 3018 2005 NEC SX-8/1 (1 proc) 1 2 GHz 2177 ComputerComputer 2004 Intel Pentium Nocona (1 proc 3.6 GHz) 1 3.6 GHz 1803 atat thethe TopTop 2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635 2002 Intel (3.06 GHz) 1 2.06 GHz 1414 ofof thethe ListList 2001 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 2000 Fujitsu VPP5000/1 1 3.33 nsec 1156 Over Time 1999 T916 4 2.2 nsec 1129 forfor n=100n=100 1995 CRAY T916 1 2.2 nsec 522 1994 CRAY C90 16 4.2 nsec 479 LinpackLinpack 1993 CRAY C90 16 4.2 nsec 479 1992 CRAY C90 16 4.2 nsec 479 1991 CRAY C90 16 4.2 nsec 403 1990 CRAY Y-MP 8 6.0 nsec 275 1989 CRAY Y-MP 8 6.0 nsec 275 1988 CRAY Y-MP 1 6.0 nsec 74 1987 ETA 10-E 1 10.5 nsec 52 1986 NEC SX-2 1 6.0 nsec 46 1985 NEC SX-2 1 6.0 nsec 46 1984 CRAY X-MP 1 9.5 nsec 21 1983 CRAY 1 1 12.5 nsec 12 1979 CRAY 1 1 12.5 nsec 3.4 LinpackLinpack BenchmarkBenchmark OverOver TimeTime  In the beginning there was the Linpack 100 Benchmark (1977)  n=100 (80KB); size that would fit in all the machines  Fortran; 64 bit floating point arithmetic  No hand optimization (only compiler options)  Linpack 1000 (1986)  n=1000 (8MB); wanted to see higher performance levels  Any language; 64 bit floating point arithmetic  Hand optimization OK  Linpack TPP (1991) (Top500; 1993)  Any size (n as large as you can; n=106; 8TB; ~6 hours);  Any language; 64 bit floating point arithmetic  Hand optimization OK  Strassen’s method not allowed (confuses the op count and rate) ||Ax b ||  Reference implementation available = O(1) ||Axn |||| ||   In all cases results are verified by looking at: 21 2 nn32 2n  Operations count for factorization 32 ; solve R WhatWhat isis LINPACKLINPACK NxNNxN max Rate

N Nmax  LINPACK NxN benchmark 1/2 Size  Solves system of linear equations by some method  Allows the vendors to choose size of problem for benchmark  Measures execution time for each size problem  LINPACK NxN report

 Nmax – the size of the chosen problem run on a machine

 Rmax – the performance in Gflop/s for the chosen size problem run on the machine

 N1/2 – the size where half the Rmax execution rate is achieved

 Rpeak – the theoretical peak performance Gflop/s for the machine  LINPACK NxN is used to rank TOP500 fastest computers in the world H.H. Meuer,Meuer, H.H. Simon,Simon, E.E. Strohmaier,Strohmaier, && JDJD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www..org (Entries for this table began in 1991.) LinpackLinpack TPPTPP (1991)(1991) (Top500;(Top500; 1993)1993) Year Computer # of Measured Size of Size of Theoretical Procs Gflop/s Problem 1/2 Perf Peak Gflop/s 2005 - 2006 IBM Blue Gene/L 131072 280600 1769471 367001 2002– 2004 Earth Simulator Computer, 5104 35610 1041216 265408 40832 NEC 2001 ASCI White-Pacific, IBM SP 7424 7226 518096 179000 11136 Power 3 2000 ASCI White-Pacific, IBM SP 7424 4938 430000 11136 Power 3 1999 ASCI Red Intel Pentium II 9632 2379 362880 75400 3207 Xeon core 1998 ASCI Blue-Pacific SST, IBM SP 5808 2144 431344 3868 604E 1997 Intel ASCI Option Red (200 9152 1338 235000 63000 1830 MHz Pentium Pro) 1996 Hitachi CP-PACS 2048 368.2 103680 30720 614 1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338 1993 Fujitsu NWT 140 124.5 31920 11950 236 1992 NEC SX-3/44 4 20.0 6144 832 22 1991 Fujitsu VP2600/10 1 4.0 1000 200 5 26th26th List:List: TheThe TOP10TOP10

Rmax Manufacturer Computer Installation Site Country Year #Proc [TF/s]

BlueGene/L DOE 2005 1 IBM 280.6 USA 131072 eServer Blue Gene Lawrence Livermore Nat Lab custom

BGW IBM 2005 2 IBM 91.29 USA 40960 eServer Blue Gene Thomas Watson Research custom

ASC Purple DOE 2005 3 IBM 63.39 USA 10240 Power5 p575 Lawrence Livermore Nat Lab custom

4 Columbia NASA 2004 SGI 51.87 USA 10160 3 Altix, Itanium/Infiniband Ames hybrid

Thunderbird DOE 2005 5 Dell 38.27 USA 8000 Pentium/Infiniband Sandia Nat Lab commod

6 DOE 2005 Cray 36.19 USA 10880 10 Cray XT3 AMD Sandia Nat Lab hybrid

7 Earth-Simulator 2002 NEC 35.86 Earth Simulator Center Japan 5120 4 SX-6 custom

8 MareNostrum 2005 IBM 27.91 Barcelona Center Spain 4800 5 PPC 970/Myrinet commod

9 ASTRON 2005 IBM eServer Blue Gene 27.45 Netherlands 12288 6 University Groningen custom

Jaguar DOE 2005 10 Cray 20.53 USA 5200 Cray XT3 AMD Oak Ridge Nat Lab hybrid Top500Top500 fromfrom NovemberNovember 20052005

2.3 PF/s 1 Pflop/s

280.6 TF/s 100 Tflop/s SUM IBM BlueGene/L 10 Tflop/s NEC 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1.646 TF/s 1 Tflop/s LLNL 59.7 GF/s Intel ASCI Red 100 Gflop/s Sandia Fujitsu 'NWT' NAL 10 Gflop/s N=500 My Laptop 0.4 GF/s 1 Gflop/s

100 Mflop/s 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 ArchitecturesArchitectures // SystemsSystems 500

SIMD 400

Single Proc. 300 Cluster  200 Constellations

100 SMP

MPP 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Cluster: Commodity processors & Commodity interconnect

Constellation: # of procs/node  nodes in the system ProcessorProcessor TypesTypes

      InterconnectsInterconnects // SystemsSystems

  

   

Center for Information Services and High Performance Computing (ZIH)

HPCC Benchmark

Slides courtesy Jack Dongara

Matthias Müller ([email protected]) MotivationMotivation forfor AdditionalAdditional BenchmarksBenchmarks

 From Linpack Benchmark and Linpack Benchmark Top500: “no single number can  Good reflect overall performance”  One number

 Simple to define & easy to rank  Clearly need something more than  Allows problem size to change Linpack with machine and over time

 Bad  HPC Challenge Benchmark  Emphasizes only “peak” CPU  Test suite stresses not only the speed and number of CPUs processors, but the memory  Does not stress local bandwidth system and the interconnect.  Does not stress the network  The real utility of the HPCC  Does not test gather/scatter benchmarks are that  Ignores Amdahl’s Law (Only architectures can be described does weak scaling) with a wider range of metrics than just Flop/s from Linpack.  …  Ugly  Benchmarketeering hype HPCHPC ChallengeChallenge BenchmarkBenchmark

Consists of basically 7 benchmarks;  Think of it as a framework or harness for adding benchmarks of interest.

1. HPL (LINPACK)  MPI Global (Ax = b)

2. STREAM  Local; single CPU *STREAM  Embarrassingly parallel

3. PTRANS (A A + BT)  MPI Global

4. RandomAccess  Local; single CPU *RandomAccess  Embarrassingly parallel RandomAccess  MPI Global      5. BW and Latency – MPI  

6. FFT - Global, single CPU, and EP

7. Matrix Multiply – single CPU and EP   HPCSHPCS PerformancePerformance TargetsTargets

Memory Hierarchy  HPL: linear system solve Max Relative Ax = b Registers 2 Pflop/s 8x  STREAM: vector operations Operands Instructions 6.5 Pbyte/s 40x A = B + s * C (s) 0.5 Pflop/s 200x  FFT: 1D Fast Fourier Transform Lines Blocks 64000 GUPS 2000x Z = fft(X) Local Memory Performance Targets  RandomAccess: integer update Messages T[i] = XOR( T[i], rand) Remote Memory HPC Challenge Pages Disk

Tape

 HPCC was developed by HPCS to assist in testing new HEC systems

 Each benchmark focuses on a different part of the memory hierarchy

 HPCS performance targets attempt to

 Flatten the memory hierarchy  Improve real application performance 34  Make programming easier TestsTests onon SingleSingle ProcessorProcessor andand SystemSystem

 Local - only a single processor is performing computations.

 Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly.

 Global - all processors in the system are performing computations and they explicitly communicate with each other. ComputationalComputational ResourcesResources andand HPCHPC ChallengeChallenge BenchmarksBenchmarks

CPU computational speed

Computational resources

Node Memory Interconnect bandwidth bandwidth ComputationalComputational ResourcesResources andand HPCHPC ChallengeChallenge BenchmarksBenchmarks

HPL Matrix Multiply

CPU computational speed

Computational resources

Node Memory Interconnect bandwidth bandwidth Random & Natural Ring STREAM Bandwidth & Latency Memory Access Patterns Memory Access Patterns HPLHPL BenchmarkBenchmark

 TPP Linpack Benchmark  Used for the Top500 ratings  Solve Ax=b, dense problem, matrix is random  Uses LU decomposition with partial pivoting  Based on the ScaLAPACK routines but optimized  The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage  In double precision (64-bit) arithmetic TPP performance  Run on all processors  Problem size set by user

 These settings used for the other tests Rate  Requires Size  An implementation of the MPI  An implementation of the Basic Subprograms (BLAS)  Reports total TFlop/s achieved for set of processors  Takes the most time  Considering stopping the process after say 25%  Still check to see if correct STREAMSTREAM BenchmarkBenchmark

 The STREAM Benchmark is a standard benchmark for the measurement of bandwidth  Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors ------ Four operations name kernel /iter FLOPS/iter ------ COPY, SCALE COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1  ADD, TRIAD SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2  Measures: ------ Machine Balance - relative cost of memory accesses vs arithmetic  Vector lengths chosen to fill local memory  Tested on a single processor  Tested on all processors in the set in an “embarrassingly parallel” fashion  Reports total GB/s achieved per processor PTRANSPTRANS

 Implements parallel matrix transpose  A = A + BT  The matrices A and B are distributed across the processors  Two-dimensional block-cyclic storage  Same storage as for HPL  Exercises the communications pattern where pairs of processors communicate with each other simultaneously.  Large (out-of-cache) data transfers across the network  Stresses the global bisection bandwidth  Reports total GB/s achieved for set of processors RandomRandom AccessAccess

 Integer Read-modify-write to random address  No spatial or temporal locality  Measures memory latency or the ability to hide memory latency  Architecture stresses  Latency to cache and main memory  Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark  Three forms  Tested on a single processor  Tested on all processors in the set in an “embarrassingly parallel” fashion  Tested with an MPI version across the set of processors  Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors  Reports Gup/s (Giga updates per second) per processor BandwidthBandwidth andand LatencyLatency TestsTests

 Ping-Pong test between pairs of processors  

 Send a message from proci to prock then return message from prock to proci  proci MPI_Send() - prock MPI_Recv()  proci MPI_Recv() - prock MPI_Send()  Other processors doing MPI_Waitall()  time += MPI_Wtime()  time /= 2  The test is performed between as many possible distinct pairs of processors.  There is an upper bound on the time for the test  Tries to find the weakest link amongst all pairs  Minimum bandwidth  Maximum latency  Not necessarily the same link will be the worst for bandwidth and latency  Message 8B used for latency test; take max time  Message 2MB used for bandwidth test; take min GB/s Bandwidth/LatencyBandwidth/Latency RingRing TestsTests (All(All Procs)Procs)  Two types of rings:  Naturally ordered (use MPI_COMM_WORLD): 0,1,2, ... P-1.  Randomly ordered (30 rings tested) eg.: 7, 2, 5, 0, 3, 1, 4, 6  Each node posts two sends (to its left and right neighbor) and two receives (from its left and right neighbor). Two types of communication routines are used: combined send/receive and non-blocking send/receive.  MPI_Sendrecv( TO: right_neighbor,FROM: left_neighbor)  MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and MPI_Isend( right_neighbor )MPI_Isend( left_neighbor ) The smaller (better) time for each is taken (which one is smaller depends on the MPI implementation).  Message 8B used for latency test;  Message 2MB used for bandwidth test; FFTFFT

 Using FFTE  Daisuke Takahashi code from University of Tsukuba  64 bit complex 1-D FFT

 Uses 64 bit addressing  Global transpose with MPI_Alltoall()  Three transposes (data is never scrambled) HowHow DoesDoes TheThe BenchmarkingBenchmarking Work?Work?  Single program to download and run  Simple input file similar to HPL input  Base Run and Optimization Run  Base run must be made  User supplies MPI and the BLAS  Optimized run allowed to replace certain routines  User specifies what was done  Results upload via website  html table and Excel spreadsheet generated with performance results  Intentionally we are not providing a single figure of merit (no over all ranking)  Goal: no more than 2 X the time to execute HPL. OfficialOfficial HPCCHPCC SubmissionSubmission ProcessProcess Prequesites:

 C compiler

 BLAS

 MPI 1.Download 2.Install Provide detailed installation and 3.Run execution environment 4.Upload results 5.Confirm via @email@  Only some routines can be replaced 6.Tune  Data layout needs to be preserved  Multiple languages can be used 7.Run

8.Upload results Results are immediately available on the web site:  Interactive HTML

9.Confirm via @email@ XML Optional  MS Excel  Kiviat charts (radar plots)

48 http://icl.cs.utk.edu/hpcc/http://icl.cs.utk.edu/hpcc/ webweb

HPCCHPCC KiviatKiviat ChartChart

ThankThank you!you!