HPC Challenge (HPCC) Benchmark Suite

Jack Dongarra Piotr Luszczek Allan Snavely

SC07, Reno, Nevada Sunday, November 11, 2007 Session S09 8:30AM-12:00PM, A1

Acknowledgements

• This work was supported in part by the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, the National Science Foundation (NSF), and the Department of Energy (DOE) through the DARPA High Productivity Computing Systems (HPCS) program under grant FA8750-04-1-0219 and under Army Contract W15P7T-05-C-D001 • Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government

1 Introduction

• HPC Challenge Benchmark Suite – To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL – To augment the TOP500 list – To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics ― e.g., spatial and temporal locality – To outlive HPCS • HPC Challenge pushes spatial and temporal boundaries and defines performance bounds

TOP500 and HPCC

• TOP500 • HPCC – Performance is represented by – Performance is represented by only a single metric multiple metrics – Data is available for an – Benchmark is new — so data is extended time period available for a limited time (1993-2005) period (2003-2007) • Problem: • Problem: There can only be one “winner” There cannot be one “winner” • Additional metrics and statistics • We avoid “composite” benchmarks – Count (single) vendor systems – Perform trend analysis on each list • HPCC can be used to show – Count total flops on each list complicated kernel/ per vendor architecture performance characterizations – Use external metrics: price, ownership cost, power, … – Select some numbers for comparison – Focus on growth trends over time – Use of kiviat charts • Best when showing the differences due to a single independent “variable” • Over time — also focus on growth trends

2 High Productivity Computing Systems (HPCS)

Goal: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2010)

s & Ass nalysi essme Impact: A ry R nt Indust &D Performance (time-to-solution): speedup critical national Programming Performance Models security applications by a factor of 10X to 40X Characterization & Prediction Hardware Technology Programmability (idea-to-first-solution): reduce cost and System Architecture time of developing application solutions Software Technology Portability (transparency): insulate research and Ind operational application software from system ustry R&D Anal sment Robustness (reliability): apply all known techniques to ysis & Asses protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors

Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80’s80’s HPCHPC technology)…..technology)…..toto…..Future…..Future (Quantum/Bio(Quantum/Bio Computing)Computing)

HPCS Program Phases I-III

Productivity Assessment (MIT LL, DOE, DoD, NASA, NSF) Concept System CDR Early Final Review Design PDR SCR DRR Demo Demo Industry Review Milestones 1 2 3 4 5 6 7 Technology Assessment Review HPLS SW SW SW SW MP Peta-Scale Plan Rel 1 Rel 2 Rel 3 Dev Unit Procurements

Mission Partner Mission Partner Mission Partner Deliver Units Peta-Scale Dev Commitment System Commitment Application Dev MP Language Dev

Year (CY) 0203 04 05 06 07 08 09 10 11

(Funded Five) (Funded Three) Mission Program Reviews Phase I Phase II Phase III Partners Industry R&D Prototype Development Critical Milestones Concept Program Procurements Study

3 HPCS Benchmark and Application Spectrum

Execution and System Bounds Execution Development Indicators Indicators Discrete Math 3 Scalable … Compact Apps Execution Graph Pattern Matching Bounds Analysis Graph Analysis Signal Processing Local … DGEMM Linear STREAM Solvers 3 Petascale/s RandomAccess HPCS … Simulation 1D FFT Signal (Compact) Spanning Set Applications of Kernels Processing Global … Linpack PTRANS Simulation Others RandomAccess Classroom 1D FFT …

Experiment Future Applications

Codes Existing Applications I/O Applications Emerging 8 HPCchallenge Benchmarks (~40) Micro & Kernel (~10) Compact 9 Simulation Benchmarks Applications

Applications Simulation Intelligence Reconnaissance

SpectrumSpectrum ofof benchmarksbenchmarks provideprovide differentdifferent viewsviews ofof systemsystem •• HPCchallengeHPCchallenge pushespushes spatialspatial andand temporaltemporal boundaries;boundaries; setssets performanceperformance boundsbounds •• ApplicationsApplications drivedrive systemsystem issues;issues; setset legacylegacy codecode performanceperformance boundsbounds •• KernelsKernels andand CompactCompact AppsApps forfor deeperdeeper analysisanalysis ofof executionexecution andand developmentdevelopment timetime

Motivation of the HPCC Design

FFT DGEMM High HPL

Mission Partner Applications PTRANS

Temporal Locality Temporal RandomAccess STREAM

Low Spatial Locality High

4 Measuring Spatial and Temporal Locality (1/2)

Generated by PMaC @ SDSC

HPC Challenge Benchmarks

1.00 Select Applications HPL 0.80 Gamess Overflow HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal

0.20

RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality

•• SpatialSpatial andand temportemporalal datadata localitylocality herehere isis forfor oneone node/processornode/processor —— i.e.,i.e., locallylocally oror “in“in thethe small”small”

Measuring Spatial and Temporal Locality (2/2)

Generated by PMaC @ SDSC High Temporal Locality Good Performance on Cache-based systems HPC Challenge High Spatial Locality Spatial Locality occurs Benchmarks Sufficient Temporal Locality in registers 1.00 Select Applications Satisfactory Performance on HPL 0.80 Gamess Overflow Cache-based systems HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal

0.20

RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality No Temporal or Spatial Locality High Spatial Locality Poor Performance on Moderate Performance on Cache-based systems Cache-based systems

5 Supercomputing Architecture Issues Standard Parallel Corresponding Performance Architecture Memory Hierarchy Implications

CPU CPU CPU CPU Registers

RAM RAM RAM RAM Instr. Operands Cache Disk Disk Disk Disk Blocks Network Switch Local Memory

Messages

CPU CPU CPU CPU Bandwidth Increasing Remote Memory Increasing Programmability RAM RAM RAM RAM Pages Increasing Latency Increasing Increasing Capacity Increasing Disk Disk Disk Disk Disk

•• StandardStandard architecturearchitecture producesproduces aa “steep”“steep” multi-layered multi-layered memorymemory hierarchyhierarchy –– ProgrammerProgrammer mustmust managemanage thisthis hierarchyhierarchy toto getget goodgood performanceperformance •• HPCSHPCS technicaltechnical goalgoal –– ProduceProduce aa systemsystem withwith aa “flatter”“flatter” memory memory hierarchyhierarchy thatthat isis easiereasier toto programprogram

HPCS Performance Targets HPC Challenge Corresponding HPCS Targets Benchmark Memory Hierarchy (improvement) • Top500: solves a system Registers Ax = b 2 Petaflops Instr. Operands (8x) • STREAM: vector operations Cache 6.5 Petabyte/s A = B + s x C (40x) Blocks Local Memory 0.5 Petaflops • FFT: 1D Fast Fourier bandwidth (200x) Transform latency Messages Z = FFT(X) 64,000 GUPS Remote Memory • RandomAccess: random (2000x) updates Pages T(i) = XOR( T(i), r ) Disk

•• HPCSHPCS programprogram hashas developeddeveloped aa newnew suitesuite ofof benchmarksbenchmarks (HPC(HPC Challenge)Challenge) •• EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy •• HPCSHPCS programprogram performanceperformance targetstargets willwill flattenflatten thethe memorymemory hierarchy,hierarchy, improveimprove realreal applicationapplication performance,performance, andand makemake programmingprogramming easiereasier

6 Official HPCC Submission Process

1. Download Prequesites:Prequesites: ●● CC compilercompiler 2. Install ●● BLASBLAS ● MPI 3. Run ● MPI 4. Upload results ProvideProvide detaileddetailed 5. Confirm via @email@ installationinstallation andand executionexecution environmentenvironment 6. Tune 7. Run

8. Upload results ●● OnlyOnly somesome routinesroutines cancan bebe replacedreplaced ● Data layout needs to be preserved 9. Confirm via @email@ ● Data layout needs to be preserved ●● MultipleMultiple languageslanguages cancan bebe usedused

ResultsResults areare immediatelyimmediately availableavailable onon thethe webweb site:site: ●● InteractiveInteractive HTMLHTML ●● XMLXML OptionalOptional ●● MSMS ExcelExcel ●● KiviatKiviat chartscharts (radar(radar plots)plots)

HPCC as a Framework (1/2)

• Many of the component benchmarks were widely used before the HPC Challenge suite of Benchmarks was assembled – HPC Challenge has been more than a packaging effort – Almost all component benchmarks were augmented from their original form to provide consistent verification and reporting • We stress the importance of running these benchmarks on a single machine — with a single configuration and options – The benchmarks were useful separately for the HPC community, meanwhile – The unified HPC Challenge framework creates an unprecedented view of performance characterization of a system • A comprehensive view with data captured the under the same conditions allows for a variety of analyses depending on end user needs

7 HPCC as a Framework (2/2)

• HPCC unifies a number of existing (and well known) codes in one consistent framework • A single executable is built to run all of the components – Easy interaction with batch queues – All codes are run under the same OS conditions – just as an application would • No special mode (page size, etc.) for just one test (say Linpack benchmark) • Each test may still have its own set of compiler flags – Changing compiler flags in the same executable may inhibit inter- procedural optimization • Why not use a script and a separate executable for each test? – Lack of enforced integration between components • Ensure reasonable data sizes • Either all tests pass and produce meaningful results or failure is reported – Running a single component of HPCC for testing is easy enough

Base vs. Optimized Run

• HPC Challenge encourages users to develop optimized benchmark codes that use architecture specific optimizations to demonstrate the best system performance • Meanwhile, we are interested in both – The base run with the provided reference implementation – An optimized run • The base run represents behavior of legacy code because – It is conservatively written using only widely available programming languages and libraries – It reflects a commonly used approach to parallel processing sometimes referred to as hierarchical parallelism that combines • Message Passing Interface (MPI) • OpenMP Threading – We recognize the limitations of the base run and hence we encourage optimized runs • Optimizations may include alternative implementations in different programming languages using parallel environments available specifically on the tested system • We require that the information about the changes made to the original code be submitted together with the benchmark results – We understand that full disclosure of optimization techniques may sometimes be impossible – We request at a minimum some guidance for the users that would like to use similar optimizations in their applications

8 HPCC Base Submission

• Publicly available code is • This to mimic legacy required for base submission applications’ performance 1. Requires C compiler, MPI 1. Reasonable software 1.1, and BLAS dependences 2. Source code cannot be 2. Code cannot be changed changed for submission run due to complexity and 3. Linked libraries have to be maintenance cost publicly available 3. Relies on publicly available 4. The code contains software optimizations for 4. Some optimization has been contemporary hardware done on various platforms systems 5. Conditional compilation and 5. Algorithmic variants runtime algorithm selection provided for performance for performance tuning portability

Baseline code has over 10k SLOC — there must more productive way of coding

HPCC Optimized Submission

• Timed portions of the code may be replaced with optimized code • Verification code still has to pass – Must use the same data layout or pay the cost of redistribution – Must use sufficient precision to pass residual checks • Allows to use new parallel programming technologies – New paradigms, e.g. one-sided communication of MPI-2: MPI_Win_create(…); MPI_Get(…); MPI_Put(…); MPI_Win_fence(…); – New languages, e.g. UPC: • shared pointers • upc_memput() • Code for optimized portion may be proprietary but needs to use publicly available libraries • Optimizations need to be described but not necessarily in detail – possible use in application tuning • Attempting to capture: invested effort per flop rate gained – Hence the need for baseline submission • There can be more than one optimized submission for a single base submission (if a given architecture allows for many optimizations)

9 Base Submission Rules Summary

• Compile and load options – Compiler or loader flags which are supported and documented by the supplier are allowed – These include porting, optimization, and preprocessor invocation • Libraries – Linking to optimized versions of the following libraries is allowed • BLAS • MPI • FFT – Acceptable use of such libraries is subject to the following rules: • All libraries used shall be disclosed with the results submission. Each library shall be identified by library name, revision, and source (supplier). Libraries which are not generally available are not permitted unless they are made available by the reporting organization within 6 months. • Calls to library subroutines should have equivalent functionality to that in the released benchmark code. Code modifications to accommodate various library call formats are not allowed

Optimized Submission Rules Summary

• Only computational (timed) portion of the code can be changed – Verification code can not be changed • Calculations must be performed in 64-bit precision or the equivalent – Codes with limited calculation accuracy are not permitted • All algorithm modifications must be fully disclosed and are subject to review by the HPC Challenge Committee – Passing the verification test is a necessary condition for such an approval – The replacement algorithm must be as robust as the baseline algorithm • For example — the Strassen Algorithm may not be used for the matrix multiply in the HPL benchmark, as it changes the operation count of the algorithm • Any modification of the code or input data sets — which utilizes knowledge of the solution or of the verification test — is not permitted • Any code modification to circumvent the actual computation is not permitted

10 HPCC Tests at a Glance

1. HPL

2. DGEMM

• Scalable framework — Unified 3. STREAM • Scalable framework — Unified BenchmarkBenchmark FrameworkFramework – By design, the HPC 4. PTRANS – By design, the HPC ChallengeChallenge BenchmarksBenchmarks areare scalablescalable withwith thethe sizesize ofof datadata 5. RandomAccess setssets beingbeing aa functionfunction ofof thethe largestlargest HPLHPL matrixmatrix forfor thethe 6. FFT testedtested systemsystem

7. b_eff

HPCC Testing Scenarios

1. Local MM MM MM MM P P P P P P P P 1.Only single process P P P P P P P P computes NetworkNetwork

2. Embarrassingly parallel MM MM MM MM 1.All processes compute PP PP PP PP PP PP PP PP and do not communicate (explicitly) NetworkNetwork

3. Global MM MM MM MM 1.All processes comput PP PP PP PP PP PP PP PP and communicate

NetworkNetwork 4. Network only

11 Naming Conventions

S HPL system STREAM EP FFT CPU … G RandomAccess thread

Examples: 1. G-HPL 2. S-STREAM-system

HPCC Tests - HPL

• HPL = High Performance Linpack • Objective: solve system of linear equations Ax=b A∈Rn×n x,b∈R • Method: LU factorization with partial row pivoting 2 3 3 2 • Performance: ( /3n + /2n ) / t • Verification: scaled residuals must be small || Ax-b || / (ε ||A|| ||x|| n) • Restrictions: – No complexity-reducing matrix multiply • (Strassen, Winograd, etc.) – 64-bit precision arithmetic through-out • (no mixed precision with iterative refinement)

12 HPL:TOP500 and HPCC Implementation Comparison

HPCC TOP500 HPCC Base Optimized

Disclose code Not required Reference code Describe

Floating-point in 64- Yes Yes Yes bits

Numerics of solution No Yes Yes reported

N-half reported Yes No No

Reporting frequency Bi-annual Continuous Continuous

HPCC HPL: Further Details

• High Performance Linpack (HPL) solves a system Ax = b • Core operation is a LU factorization of a large MxM matrix • Results are reported in floating point operations per second (flop/s) Parallel Algorithm Registers Instr. Operands Cache P0 P2 P0 P2 LU P0 P2 P0 P2 U Blocks P1 P3 P1 P3 Factorization P1 P3 P1 P3 Local Memory A P0 P2 P0 P2 P0 P2 P0 P2 Messages P1 P3 P1 P3 L P1 P3 P1 P3 Remote Memory Pages 2D block cyclic distribution Disk is used for load balancing

•• LinearLinear systemsystem solversolver (requi(requiresres all-to-allall-to-all communication)communication) •• StressesStresses locallocal matrixmatrix multiplymultiply performanceperformance •• DARPADARPA HPCSHPCS goal:goal: 22 Pflop/sPflop/s (8x (8x overover currentcurrent best)best)

13 HPCC Tests - DGEMM

• DGEMM = Double-precision General Matrix-matrix Multiply • Objective: compute matrix C ←αAB + βC A,B,C∈Rn×n α,β∈R • Method: standard multiply (maybe optimized) • Performance: 2n3/t • Verification: Scaled residual has to be small || x – y || / (ε n || y || ) where x and y are vectors resulting from multiplication by a random vector of left and right hand size of the objective expression • Restrictions: – No complexity reducing matrix-multiply • (Strassen, Winograd, etc.) – Use only 64-bit precision arithmetic

HPCC Tests - STREAM

• STREAM is a test that measures sustainable memory bandwidth (in Gbyte/s) and the corresponding computation rate for four simple vector kernels • Objective: set a vector to a combination of other vectors COPY: c = a SCALE: b = α c ADD: c = a + b TRIAD: a = b + α c • Method: simple loop that preserves the above order of operations • Performance: 2n/t or 3n/t • Verification: scalre residual of computed and reference vector needs to be small || x – y || / (ε n || y || ) • Restrictions: – Use only 64-bit precision arithmetic

14 Original and HPCC STREAM Codes

HPCC STREAM HPCC Base Optimized

Programming Language Fortran (or C) C only Any

Data alignment Implicit OS-specific OS-specific

Exceeds Whole Whole Vectors’ size caches, memory, memory, Compile time Runtime Runtime

Reporting Multiple of No Possible Possible Number of Processors

HPCC STREAM: Further Details

• Performs scalar multiply and add • Results are reported in bytes/second

Parallel Algorithm Registers Instr. Operands ... Cache A 0 1 Np-1 Blocks = Local Memory 0 1 ... Np-1 Messages B Remote Memory + Pages s x C 0 1 ... Np-1 Disk

•• BasicBasic operationsoperations onon lalargerge vectorsvectors (requires(requires nono communication)communication) •• StressesStresses locallocal processorprocessor toto memorymemory bandwidthbandwidth •• DARPADARPA HPCSHPCS goal:goal: 6.56.5 Pbyte/sPbyte/s (40x (40x overover currentcurrent best)best)

15 HPCC Tests - PTRANS

• PTRANS = Parallel TRANSpose • Objective: update matrix with sum of its transpose and another matrix A=AT+B A,B∈Rn×n • Method: standard distributed memory algorithm • Performance: n2/t • Verification: scaled residual between computed and reference matrix needs to be small

|| A0 – A || / ( ε n || A0 || ) • Restrictions: – Use only 64-bit precision arithmetic – The same data distribution method as HPL

HPCC PTRANS: Further Details

1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9

• matrix: 9x9 • matrix: 9x9 •3x3 process grid •1x9 process grid • Communicating pairs: • Communicating pairs: • 2-4, 3-7, 6-8 • 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9, 2-3, …, 8-9 • 36 pairs!

16 HPCC Tests - RandomAccess

• RandomAccess calculates a series of integer updates to random locations in memory • Objective: perform computation on Table Ran = 1; for (i=0; i<4*N; ++i) { Ran= (Ran<<1) ^ (((int64_t)Ran < 0) ? 7:0); Table[Ran & (N-1)] ^= Ran; } • Method: loop iterations may be independent • Performance: 4N/t • Verification: up to 1% of updates can be incorrect • Restrictions: – Use at least 64-bit integers – About half of memory used for ‘Table’ – Parallel look-ahead limited to 1024 (limit locality)

HPCC RandomAccess: Further Details • Randomly updates N element table of unsigned integers • Each processor generates indices, sends to all other processors, performs XOR • Results are reported in Giga Updates Per Second (GUPS) Parallel Algorithm Registers Table Instr. Operands Cache 0 1 . . Np-1 Blocks Send, Local Memory XOR, Update Messages 0 1 . . NP-1 Remote Memory Pages Generate random indices

Disk

•• RandomlyRandomly updatesupdates memorymemory (requires(requires all-to-allall-to-all communication)communication) •• StressesStresses interprocessorinterprocessor communication communication ofof smallsmall messagesmessages •• DARPADARPA HPCSHPCS goal:goal: 64,00064,000 GUPSGUPS ((2000x2000x overover currentcurrent bestbest))

17 HPCC Tests - FFT

• FFT = Fast Fourier Transform • Objective: compute discrete Fourier Transform n zk=Σxj exp(-2π√-1 jk/n) x,z∈C • Method: any standard framework (maybe optimized)

• Performance: 5nlog2n/t • Verification: scaled residual for inverse transform of computed vector needs to be small (0) || x – x || / (ε log2 n ) • Restrictions: – Use only 64-bit precision arithmetic – Result needs to be in-order (not bit-reversed)

HPCC FFT: Further Details • 1D Fast Fourier Transforms an N element complex vector • Typically done as a parallel 2D FFT • Results are reported in floating point operations per second (flop/s) Parallel Algorithm Registers Instr. Operands FFT rows Cache FFT columns Blocks 0 Local Memory 1 Messages 0 1 . . Np-1 : Remote Memory corner Pages Np-1 turn Disk

•• FFTFFT aa largelarge complexcomplex vectorvector (requires(requires all-to-allall-to-all communication)communication) •• StressesStresses interprocessorinterprocessor communication communication ofof largelarge messagesmessages •• DARPADARPA HPCSHPCS goal:goal: 0.50.5 PfloPflop/sp/s (200x (200x overover currentcurrent best)best)

18 HPCC Tests – b_eff

• b_eff measures effective bandwidth and latency of the interconnect • Objective: exchange 8 (for latency) and 2000000 (for bandwidth) messages in ping-pong, natural and random ring patterns • Method: use standard MPI point-to-point routines • Performance: n/t (for bandwidth) • Verification: simple checksum on received bits • Restrictions: – The messaging routines have to conform to the MPI standard

HPCC and Computational Resources

HPL (Jack Dongarra)

CPU computational speed

ComputationalComputational resourcesresources

Node Random & Natural Memory Interconnect Ring STREAM bandwidth bandwidth Bandwidth & Latency (John McCalpin) (Rolf Rabenseifner)

19 Random and Natural Ring B/W and Latency

• Parallel communication pattern on all MPI processes ( )

– Natural ring

– Random ring

• Bandwidth per process – Accumulated message size / wall-clock time / number of processes – On each connection messages in both directions – With 2xMPI_Sendrecv and MPI non-blocking best result is used – Message size = 2,000,000 bytes

• Latency – Same patterns, message size = 8 bytes – Wall-clock time / (number of sendrecv per process)

Inter-node B/W on Clusters of SMP Nodes

• Random Ring – Reflects the other dimension of a Cartesian domain decomposition and – Communication patterns in unstructured grids – Some connections are inside of the nodes – Most connections are inter-node – Depends on #nodes and #MPI processes per node

20 Accumulated Inter-node B/W on Clusters of SMPs

• Accumulated bandwidth := bandwidth per process x #processes

~= accumulated inter-node bandwidth _ 1 – 1 / #nodes

similar to bi-section bandwidth

HPCC Awards Overview

• Goals – Increase awareness of HPCC – Increase awareness of HPCS and its goals – Increase number of HPCC submissions • Expanded view of largest supercomputing installations • Means – HPCwire sponsorships and press coverage – HPCS mission partners’ contribution – HPCS vendors’ contribution

21 HPCC Awards Rules for SC07

• Class 1: Best Performance • Class 2: Most Productivity – Figure of merit: – Figure of merit: raw system performance performance (50%) and – Submission must be valid elegance (50%) HPCC database entry • Highly subjective • Side effect: populate HPCC • Based on committee vote database – Submission must implement – 4 categories: HPCC at least 4 out of 4 Class 1 components tests • HPL • The more tests the better • STREAM-system – Performance numbers are a • RandomAccess plus • FFT – The submission process: – Award certificates • Source code • 4x $750 from IDC • “Marketing brochure” – Was $500 • SC07 BOF presentation – Award certificate • $2000 from IDC – Was $1500

HPCC Awards Committee

• David Bailey • Piotr Luszczek LBNL NERSC Univ. of Tennessee • Jack Dongarra (Co-Chair) • John McCalpin Univ. of Tenn/ORNL AMD • Jeremy Kepner (Co-Chair) • Rolf Rabenseifner MIT Lincoln Lab HLRS Stuttgart • Bob Lucas • Daisuke Takahashi USC/ISI Univ. of Tsukuba • Rusty Lusk • Jeffrey Vetter Argonne National Lab ORNL

22 SC|05 HPCC Awards Class 1 - HPL

HPCS goal: 2000 Tflop/s 1000 TOP500: 280 Tflop/s x7 1. IBM BG/L 259 (LLNL) x7 259 Tflop/s 100 2. IBM BG/L 67 (Watson) 3. IBM Power5 58 (LLNL) Initial HPCC Awards Annoucement 10 Tflop/s TOP500TOP500 SystemsSystems 1 inin HPCCHPCC database:database: #1,#1, #2,#2, #3,#3, #4,#4, #10,#10, #14,#14, #17,#17, #35, #37, #71, #80 110 Gflop/s #35, #37, #71, #80 0.1 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05

SC|05 HPCC Awards Class 1 – STREAM-sys

1000000 HPCS goal: 6500 TB/s 160 TB/s 1. IBM BG/L 160 (LLNL) x40 100000 x40 2. IBM Power5 55 (LLNL) 3. IBM BG/L 40 (Watson)

10000

GB/s Initial 1000 HPCC Awards Annoucement

100

27 GB/s 10 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05

23 SC|05 HPCC Awards Class 1 – FFT

10000 HPCS goal: 500 Tflop/s 2311 Gflop/s 1. IBM BG/L 2.3 (LLNL) x200 1000 x200 2. IBM BG/L 1.1 (Watson) 3. IBM Power5 1.0 (LLNL)

100 Initial HPCC Awards Gflop/s Annoucement

10

4 Gflop/s

1 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05

SC|05 HPCC Awards Class 1 –RandomAccess

100 HPCS goal: 64000 GUPS 35 GUPS

1. IBM BG/L 35 (LLNL) x1800 10 x1800 2. IBM BG/L 17 (Watson) 3. Cray X1E 8 (ORNL) Initial 1 HPCC Awards Annoucement GUPS 0.1

0.01

0.01 GUPS 0.001 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05

24 SC|05 HPCC Awards: Class 2

Language HPL RandomAccess STREAM FFT Sample submission Python+MPI √ √ from committee pMatlab √ √ √ √ members Cray MTA C √ √ Winners UPCx3 √ √ √

Cilk √ √ √ √ Finalists Parallel Matlab √ √ √ √

MPT C √ √

OpenMP, C++ √ √

StarP √ √

HPF √ √

SC06 HPCC Awards: Class 2

1. Best overall productivity 1. Cilk (Bradley Kuszmaul)

2. Best productivty in performance 1. UPC (Calin Cascaval)

3. Best productivity and Elegance 1. Parallel MATLAB (Cleve Moler)

4. Best student paper 1. MC# (Vadim Guzev)

5. Honorable mention 1. Chapel (Brad Chamberlain) 2. X10 (Vivek Sarkar)

25 SC07 HPCC Awards BOF

Awards will be presented at the SC07 HPC Challenge BOF

Tuesday, November 13, 2007 In Rooms C1/C2/C3 12:15-1:15

Reporting Results - Etiquette

• The HPC Challenge Benchmark suite has been designed to permit academic style usage for comparing – Technologies – Architectures – Programming models • There is an overt attempt to keep HPC Challenge significantly different than “commercialized” benchmark suites – Vendors and users can submit results – System “cost/price” is not included intentionally – No “composite” benchmark metric • Be cool about comparisons! • While we can not enforce any rule to limit comparisons observe rules of – Academic honesty – Good taste

26 Total Number of HPCC Submissions

120

110

100

90

80

70

60

50

40

30

20

10

0 2003-05-09 2004-01-14 2004-09-20 2005-05-28 2006-02-02 2006-10-10

Based vs. Optimized Submission

• Optimized G-RandomAccess is a UPC code – ~125x improvement

EP Random Random G-Random G-STREAM STREAM EP Ring Ring System Information Run G-HPL G-PTRANS Access G-FFTE Triad Triad DGEMM Bandwidth Latency System - Processor Speed CountType TFlop/s GB/s Gup/s GFlop/s GB/s GB/s GFlop/s GB/s usec Cray mfeg8 X1E 1.13GHz 248 opt 3.3889 66.01 1.85475 -1 3280.9 13.229 13.564 0.29886 14.58 Cray X1E X1E MSP 1.13GHz 252 base 3.1941 85.204 0.014868 15.54 2440 9.682 14.185 0.36024 14.93

27 Exploiting Hybrid Programming Model

• The NEC SX-7 architecture can permit the definition of threads and processes to significantly enhance performance of the EP versions of the benchmark suite by allocating more powerful “nodes” – EP-STREAM – EP-DGEMM

EP Random Random G-Random G-STREAM STREAM EP Ring Ring System Information G-HPL G-PTRANS Access G-FFTE Triad Triad DGEMM Bandwidth Latency System Speed Count Threads Processes TFlop/s GB/s Gup/s GFlop/s GB/s GB/s GFlop/s GB/s usec NEC SX-7 0.552GHz 32 16 2 0.2174 16.34 0.000178 1.34 984.3 492.161 140.636 8.14753 4.85 NEC SX-7 0.552GHz 32 1 32 0.2553 20.546 0.000964 11.29 836.9 26.154 8.239 5.03934 14.21

Kiviat Charts: Comparing Interconnects

• AMD Opteron clusters – 2.2 GHz Kiviat chart (radar plot) – 64-processor cluster • Interconnects 1. GigE 2. Commodity 3. Vendor • Cannot be differentiated based on: – HPL – Matrix-matrix multiply • Available on HPCC website

28 Augmenting TOP500’s 26th Edition with HPCC

Computer Rmax HPL PTRANS STREAM FFT GUPS Latency B/W

1 BlueGene/L 281 259 374 160 2311 35.5 6 0.2

2 BGW 91 84 172 50 84 21.6 5 0.2

3 ASC Purple 63 58 576 44 967 0.2 5 3.2

4 Columbia 52 47 91 21 230 0.2 4 1.4

5 Thunderbird 38

6 Red Storm 36 33 1813 44 1118 1.0 8 1.2

Earth 7 36 Simulator

8 MareNostrum 28

9 Stella 27

10 Jaguar 21 20 944 29 855 0.7 7 1.2

Augmenting TOP500’s 27th Edition with HPCC

Computer Rmax HPL PTRANS STREAM FFT GUPS Latency B/W

1 BlueGene/L 280.6 259.2 4665.9 160 2311 35.47 5.92 0.159

2 BGW (*) 91 83.9 171.55 50 1235 21.61 4.70 0.159

3 ASC Purple 75.8 57.9 553 55 842 1.03 5.1 3.184

4 Columbia (*) 51.87 46.78 91.31 20 229 0.25 4.23 0.896

5 Tera-10 42.9

6 Thunderbird 38.27

7 Fire x4600 38.18

BlueGene 8 37.33 eServer

9 Red Storm 36.19 32.99 1813.06 43.58 1118 1.02 7.97 1.149

Earth 10 35.86 Simulator

29 Normalize with Respect to Peak flop/s

Computer HPL RandomAccess STREAM FFT

Cray XT3 81.4% 0.031 1168.8 38.3

Cray X1E 67.3% 0.422 696.1 13.4

IBM POWER5 53.5% 0.003 703.5 15.5

IBM BG/L 70.6% 0.089 435.7 6.1

SGI Altix 71.9% 0.003 308.7 3.5

NEC SX-8 86.9% 0.002 2555.9 17.5

Normalizing with peak flop/s cancels out number of processors:

G-HPL / Rpeak = (local-HPL*P) / (local-Rpeak*P) = local-HPL / local-Rpeak

Good: can compare systems with different number of processors. Bad: it’s easy to scale peak and harder to scale real calculation.

Correlation: HPL vs. Theoretical Peak

HPL versus Theoretical Peak

30

25

20 Cray XT3

15

10 NEC SX-8

Theoretical Peak (Tflop/s) 5 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)

• How well does HPL data correlate with theoretical peak performance?

30 Correlation: HPL vs. DGEMM

HPL versus DGEMM

30000 Cray XT3 25000

20000

15000

10000 DGEMM (Gflop/s) NEC SX-8 5000 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)

• Can I Run Just Run DGEMM Instead of HPL? • DGEMM alone overestimates HPL performance • Note the 1,000x difference in scales! (Tera/Giga) • Exercise: correlate HPL versus Processors*DGEMM

Correlation: HPL vs. STREAM-Triad

HPL versus STREAM Triad

30000 Cray XT3 25000 NEC SX-8

20000 Cray X1E/opt 15000

10000 STREAM Triad (GB/s) 5000 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)

• How well does HPL correlate with STREAM-Triad performance?

31 Correlation: HPL vs. RandomAccess

HPL versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable 0.8 SGI Altix 0.6 IBM BG/L 0.4

G-RandomAccess (GUPS) NEC SX-8 0.2 0 0 5 10 15 20 25 HPL (Tflop/s)

• How well does HPL correlate with G-RandomAccess performance? • Note the 1,000x difference in scales! (Tera/Giga)

Correlation: HPL vs. FFT

HPL versus FFT

1000 900 Cray XT3 800 700 600 500 400 FFT (Gflop/s) FFT 300 200 NEC SX-8 100 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s) • How well does HPL correlate with FFT performance? • Note the 1,000x difference in scales! (Tera/Giga)

32 Correlation: STREAM vs. PTRANS

Global Stream versus PTRANS

1000 900 800 Cray XT3 700 600 500 400

PTRANS (GB/s) 300 200 NEC SX-8 100 0 0 5000 10000 15000 20000 25000 30000 Global STREAM (GB/s)

• How well does STREAM data correlate with PTRANS performance?

Correlation: RandomRing B/W vs. PTRANS

RandomRing Bandwidth versus PTRANS

1000 900 800 Cray XT3 700 600 500 400

PTRANS (GB/s) 300 NEC SX-8 200 100 NEC SX-7 0 0 2 4 6 8 10121416 RandomRing Bandwidth (GB/s)

• How well does RandomRing Bandwidth data correlate with PTRANS performance • Possible bad data?

33 Correlation: #Processors vs. RandomAccess

Number of Processors versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable IBM BG/L 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 SGI Altix 0 0 1000 2000 3000 4000 5000 6000 Number of Processors • Does G-RandomAccess scale with the number of processors?

Correlation: Random-Ring vs. RandomAccess

RandomRing Bandwidth versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 0.8 0.6 NEC SX-8 0.4 G-RandomAccess (GUPS) G-RandomAccess NEC SX-7 0.2 0 0246810121416 RandomRing Bandwidth (GB/s) • Does G-RandomAccess scale with the RandomRing Bandwidth? • Possible bad data?

34 Correlation: Random-Ring vs. RandomAccess

RandomRing Bandwidth versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 RandomRing Bandwidth (GB/s) • Does G-RandomAccess scale with RandomRing Bandwidth? • Ignoring possible bad data…

Correlation: Random-Ring vs. RandomAccess

RandomRing Latency versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 Rackable 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 102030405060 RandomRing Latency (usec) • Does G-RandomAccess scale with RandomRing Latency ?

35 Correlation: RandomAccess Local vs. Global

Single Processor RandomAccess versus G-RandomAccess (per System)

2 Cray X1E/opt 1.8 1.6 1.4 1.2 Cray XT3 1 Rackable 0.8 0.6 0.4

G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 20 40 60 80 100 120 140 160 180 200 Single Processor RandomAccess (GUPS) • Does G-RandomAccess scale with single processor RandomAccess performance (per system)?

Correlation: RandomAccess Local vs. Global

Single Processor RandomAccess versus G-RandomAccess (per Processor)

2 1.8 Cray X1E/opt 1.6 1.4 1.2 1 Rackable 0.8 0.6 0.4

G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 0.1 0.2 0.3 Single Processor RandomAccess (GUPS) • Does G-RandomAccess scale with single processor RandomAccess performance?

36 Correlation:STREAM-Triad vs.RandomAccess

STREAM Triad versus G-RandomAccess

2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable 0.8 0.6 IBM BG/L 0.4 G-RandomAccess (GUPS) 0.2 SGI Altix NEC SX-8 0 0 5000 10000 15000 20000 25000 30000 STREAM Triad (GB/s)

• Does G-RandomAccess scale with STREAM Triad?

RandomAccess Correlations Summary

• G-RandomAccess doesn’t correlate with other tests • Biggest improvement comes from code optimization – Limit on parallel look-ahead forces short messages • Typical MPI performance killer • Communication/computation overlap is wasted by message handling overhead – UPC implementation can be integrated with existing MPI code base to yield orders of magnitude speedup – Using interconnect topology and lower-level (less overhead) messaging layer is also a big win • Generalization from 3D torus: hyper-cube algorithm

37 Principal Component Analysis

• Correlates all the tests simultaneously • Load vectors hint at relationship between original results and the principal components • Initial analysis: – 29 tests • Rpeak, HPL, PTRANS (x1) • DGEMM (x2) • FFT, RandomAccess (x3) • STREAM (x8) • b_eff (x10) – 103 • Must have all 29 valid (up-to-date) values • Principal components (eigenvalues of covariance matrix of zero-mean, unscaled data): • Only 3 (4) components – 4.57, 1.17, 0.41, 0.15, 0.0085, • redundancy or 0.0027 • opportunity to check?

Effective Bandwidth Analysis

2 • All results in HPCS ~10 1.E+15 words/seco nd

• Highlights 1.E+12 Tera Peta memory HPC ~104 hierarchy

• Clusters 1.E+09 – Hierarchy Giga steepens Clusters ~106

• HPC 1.E+06

systems Mega Top500 (words/s) STREAM (words/s) – Hierarchy FFT (words/s) RandomAccess (words/s) constant BandwidthEffective (words/second) 1.E+03 Dell GigE P1 (MITLL) Dell GigE P2 (MITLL) Dell GigE P4 (MITLL) Dell GigE P8 (MITLL) (MITLL) Dell GigE P16 (MITLL) Dell GigE P32 (MITLL) Dell GigE P64 Opteron (AMD) Cray X1E (AHPCRC) (NASA) SGI Altix (HLRS) SX-8 NEC Cray X1 (ORNL) Cray X1 (ORNL) Opt Cray XT3 (ERDC) Cray XT3 (ORNL) IBM Power5 (LLNL) (LLNL) BG/L IBM (LLNL) Opt BG/L IBM DARPA HPCS Goals • HPCS Goals Kilo – Hierarchy Systems flattens (in Top500 – Easier to order) program

38 Public Web Resources

• Main HPCC website – http://icl.cs.utk.edu/hpcc/ • HPCC Awards – http://www.hpcchallenge.org/ • HPL – http://www.netlib.org/benchmark/hpl/ • STREAM – http://www.cs.virginia.edu/stream/ • PTRANS – http://www.netlib.org/parkbench/html/matrix -kernels.html • FFTE – http://www.ffte.jp/

HPCC Makefile Stucture (1/2)

• Sample Makefiles live in hpl/setup • BLAS – LAdir – BLAS top directory for other LA-variables – LAinc – where BLAS headers live (if needed) – LAlib – where BLAS libraries live (libmpi.a and friends) – F2CDEFS – resolves Fortran-C calling issues (BLAS is usually callable from Fortran) • -DAdd_, -DNoChange, -DUpCase, -Dadd__ • -DStringSunStyle, -DStringStructPtr, -DStringStructVal, -DStringCrayStyle • MPI – MPdir – MPI top directory for other MP-variables – MPinc – where MPI headers live (mpi.h and friends) – MPlib – where MPI libraries live (libmpi.a and friends)

39 HPCC Makefile Stucture (1/2)

• Compiler – CC – C compiler – CCNOOPT – C flags without optimization (for optimization-sensitive code) – CCFLAGS – C flags with optimization • Linker – LINKER – program that can link BLAS and MPI together – LINKFLAGS – flags required to link BLAS and MPI together • Programs/commands – SHELL, CD, CP, LN_S, MKDIR, RM, TOUCH – ARCHIVER, ARFLAGS, RANLIB

MPI Implementations for HPCC

• Vendor – Cray (MPT) – IBM (POE) – SGI (MPT) – Dolphin, Infiniband (Mellanox, Voltaire, ...), Myricom (GM, MX), Quadrics, PathScale, Scali, ... • Open Source – MPICH1, MPICH2 (http://www- unix.mcs.anl.gov/mpi/mpich/) – Lam MPI (http://www.lam-mpi.org/) – OpenMPI (http://www.open-mpi.org/) – LA-MPI (http://public.lanl.gov/lampi/) • MPI implementation components – Compiler (adds MPI header directories) – Linker (need to link in Fortran I/O) – Exe (poe, mprun, mpirun, aprun, mpiexec, ...)

40 Fast BLAS for HPCC

• Vendor • Implementations that use – AMD (AMD Core Math Library) Threads – Cray (SciLib) – Some vendor BLAS – HP (MLIB) – Atlas – IBM (ESSL) – Goto BLAS – Intel (Math Kernel Library) • You should never use reference – SGI (SGI/Cray Scientific BLAS from Netlib Library) – There are better alternatives – ... for every system in • Free implementations existence – ATLAS http://www.netlib.org/atl as/ – Goto BLAS http://www.cs.utexas.edu/ users/flame/goto http://www.tacc.utexas.ed u/resources/software

Tuning Process: Internal to HPCC Code

• Changes to source code are not allowed for submission • But just for tuning it's best to change a few things – Switch off some tests temporarily • Choosing right parallelism levels – Processes (MPI) – Threads (OpenMP in code, vendor in BLAS) – Processors – Cores • Compile time parameters – More details below • Runtime input file – More details below

41 Tuning Process: External to HPCC Code

• MPI settings examples – Messaging modes • Eager polling is probably not a good idea – Buffer sizes – Consult MPI implementation documentation • OS settings – Page size • Large page size should be better on many systems – Pinning down the pages • Optimize affinity on DSM architectures – Priorities – Consult OS documentation

Parallelism Examples with Comments

• Pseudo-threading helps but be careful – Hyper-threading – Simultaneous Multi-Threading – ... • Cores – Intel (x86-64, Itanium), AMD (x86) – Cray: SSP, MSP – IBM Power4, Power5, ... – Sun SPARC • SMP – BlueGene/L (single/double CPU usage per card) – SGI (NUMA, ccNUMA, DSM) – Cray, NEC • Others – Cray MTA (no MPI !)

42 HPCC Input and Output Files

• Parameter file hpccinf.txt • Memory file hpccmemf.txt – HPL parameters – Memory available per MPI • Lines 5-31 process – PTRANS parameters Process=64 • Lines 32-36 – Memory available per thread – Indirectly: sizes of arrays for Thread=64 all HPCC components – Total available memory • Hard coded Total=64 • Output file hpccoutf.txt – Many HPL and PTRANS parameters might not be – Must be uploaded to the optimal website – Easy to parse – More details later...

Tuning HPL: Introduction

• Performance of HPL comes from – BLAS – Input file hpccinf.txt • Essential parameters in the input file – N – matrix size – NB – blocking factor • Influences BLAS performance and load balance NB A x = b – PMAP – process mapping • Depends on network topology – PxQ – process grid

• Definitions N PX PY PZ PX =

43 Tuning HPL: More Definitions

• Process grid parameters: P, Q, and PMAP

Q=4

P0 P1 P2 P3 P0 P3 P6 P9

P=3 P4 P5 P6 P7 P1 P4 P7 P10

P8 P9 P10 P11 P2 P5 P8 P11

PMAP=C PMAP=R

Tuning HPL: Selecting Process Grid

44 Tuning HPL: Selecting Number of Processors

37 Prime numbers

41 43

53 47 59 61

Tuning HPL: Selecting Matrix Size

Too big 12 x 10

Best performance Too small

Too big 6 x 10

Not optimial parameters

45 Tuning HPL: Further Details

• Much more details from HPL's author:

• Antoine Petitet

• http://www.netlib.org/benchmark/hpl/

Tuning RandomAccess

• New RandomAccess algorithms introduced in version 1.2 – Contributed by benchmarking group from Sandia National Lab • Prior versions only had one algorithm – Each update required an MPI call • New algorithms are selected via C preprocessor macros – #define RA_SANDIA_NOPT – #define RA_SANDIA_OPT2

46 Tuning FFT

• Compile-time parameters – FFTE_NBLK • blocking factor – FFTE_NP • padding (to alleviate negative cache-line effects) – FFTE_L2SIZE • size of level 2 cache • Use FFTW instead of FFTE – Define USING_FFTW symbol during compilation – Add FFTW location and library to linker flags

Tuning STREAM

• Intended to measure main memory bandwidth • Requires many optimizations to run at full hardware speed – Software pipelining – Prefetching – Loop unrolling – Data alignment – Removal of array aliasing • Original STREAM has advantages – Constant array sizes (known at compile time) – Static storage of arrays (at full compiler's control)

47 Tuning PTRANS

• Parameter file hpccinf.txt – Line 33 — number of matrix sizes – Line 34 — matrix sizes • Must not be too small – enforced in the code – Line 35 — number of blocking factors – Line 36 — blocking factors • No need to worry about BLAS • Very influential for performance

Tuning b_eff

• b_eff (Effective bandwidth and latency) test can also be tuned • Tuning must use only standard MPI calls • Examples – Persistent communication – One-sided communication

48 HPCC Output File

• The output file has two parts – Verbose output (free format) – Summary section • Pairs of the form: name=value • The summary section names – MPI* — global results • Example: MPIRandomAccess_GUPs – Star* — embarrassingly parallel results • Example: StarRandomAccess_GUPs – Single* — single process results • Example: SingleRandomAccess_GUPs

Optimized Submission Ideas

• For optimized run the same MPI harness has to be run on the same system • Certain routines can be replaced – the timed regions • The verification has to pass – limits data layout and accuracy of optimization • Variations of the reference implementation are allowed (within reason) – No Strassen algorithm for HPL due to different operation count • Various non-portable C directives can significantly boost performance – Example: #pragma ivdep • Various messaging substrates can be used – Removes MPI overhead • Various languages can be used – Allows for direct access to non-portable hardware features – UPC was used to increase RandomAccess performance by orders of magnitude • Optimizations need to be explained upon results submission

49 Performance Modeling and its uses with HPC Challenge (HPCC) Benchmark Suite

Allan Snavely

Hierarchical systems

– Modern supercomputers are composed of a hierarchy of subsystems: • CPU and registers • Cache (multiple levels) • Main (local) memory • System interconnect to other CPUs and memory • Disk • Archival – Latencies span 11 orders of magnitude! Nanoseconds to Minutes. This is a solar-system scale. It is 794 million miles to Saturn. – Bandwidths taper – Cache bandwidths generally are at least four times larger than local memory bandwidth.

50 “Red Shift”

• While the absolute speed of all computer subcomponents have been changing rapidly, they have not all been changing at the same rate. For example, the analysis in [FoSC] indicates that, while peak processor speeds have been increasing at 59% between 1988 and 2004, memory speeds of commodity processors have been increasing at only 23% per year since 1995, and DRAM latency has only been improving at a pitiful rate of 5.5% per year. This “red shift” in the latency distance between various components has produced a crisis of sorts for computer architects, and a variety of complex latency-hiding mechanisms currently drive the design of computers as a result.

Moore’s Law

• Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. • Moore’s law has had a decidedly mixed impact, creating new opportunities to tap into exponentially increasing computing power while raising fundamental challenges as to how to harness it effectively. • The Red Queen Syndrome – It takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!' • Things Moore never said: » “computers double in speed every 18 months” ☺ » “cost of computing is halved every 18 months” ☺ » “cpu utilization is halved every 18 months”

51 Moore’s Law

Snavely’s Top500 Laptop?

– Among other startling implications is the fact that the peak performance of the typical laptop would have placed it as one of the 500 fastest computers in the world as recently as 1995.

– Shouldn’t we all just go find another job now?

– No because Moore’s Law has several more subtle implications and these have raised a series of challenges to utilizing the apparently ever-increasing availability of compute power; these implications must be understood to see why HPCC and modeling needed

52 Real Work (the fly in the ointment)

• Scientific calculations involve operations upon large amounts of data, and it is in moving data around within the computer that the trouble begins. As a very simple pedagogical example consider the expression.

A + B = C

• This generic expression takes two values (arguments), and performs an arithmetic operation to produce a third. Implicit in such a computation is that A and B must to be loaded from where they are stored and the value C stored after it has been calculated. In this example, there are three memory operations and one floating-point operation (Flop).

Performance model defn.

• a calculable expression that takes as parameters attributes of application software, input data, and target machine hardware (and possibly other factors) and computes, as output, expected performance

53 Uses of performance models

• As an independent way vetting performance guarantees (as in support of procurement)

• To better understand for $

• To guide code tuning • To inform future system design

• To make sense of and to generalize the results of benchmarks

“The increased publicity enjoyed by HPCC doesn't necessarily translate into deeper understanding of the performance metrics that HPCC measures. And so this tutorial will introduce attendees to HPCC, provide tools to examine differences in HPC architectures, and give hands-on training that will hopefully lead to better understanding of parallel environments.”

Performance models today are about 90% accurate

168 “blind” predictions for DoD Overall absolute relative error = 11.7% relative error = -0.5% HYCOM, AVUS, Overflow2, Lammps

A Genetic Algorithms Approach to Modeling the Performance of Memory-bound Computations M. Tikir, L. Carrington, E. Strohmaier, A. Snavely, Proceedings of the International Conference High Performance Computing, Networking, and Storage, November 2007, Reno

54 How does it work? review of PMaC methodology:

• Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application. • Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine.

Combine Machine Profile and Application Signature using: • Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction.

Pieces of PMaC Framework

Single-Processor Model Communication Model Machine Profile Application Signature Machine Profile Application Signature (Machine A) (Application B) (Machine A) (Application B) Characterization of Characterization of Characterization of Characterization of memory performance memory operations network performance network operations capabilities of needed to be performed capabilities of needed to be performed Machine A by Application B Machine A by Application B

Convolution Method Convolution Method Mapping memory usage needs of Mapping network usage needs of Application B Application B to the capabilities of Machine A to the capabilities of Machine A Application B ⇔ Machine A Application B ⇔ Machine A Exe. time = Memory op FP op Exe. time = comm. op1 comm. op2 Mem. rate FP rate op1 rate op2 rate

Performance prediction of Application B on Machine A

55 Formally, let P(m,n)=A(m,p) • M(p,n)

p • P a matrix of runtimes: p ij = ∑ a ik m kj k =1 • Where the is (rows) are applications and the js (columns) are machines; an entry of P is the (real or predicted) runtime of application i on machine j • the rows of A are applications, columns are operation counts, read a row to get an application signature • the rows of M are benchmark-measured bandwidths, columns are machines, read a columns to get a machine profile

⎡m11 m12 m13 ⎤ p p p a a a a a ⎢m m m ⎥ ⎡ 11 12 13 ⎤ ⎡ 11 12 13 14 15 ⎤ ⎢ 21 22 23 ⎥ ⎢ ⎥ ⎢ ⎥ p 21 p 22 p 23 = a21 a22 a23 a24 a25 ⎢m31 m32 m33 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣p31 p32 p33 ⎦ ⎣a31 a32 a33 a34 a35 ⎦ ⎢m41 m42 m43 ⎥ ⎢ ⎥ ⎣m51 m52 m53 ⎦

p32 = a31m12 "+"a32 m22 "+"a33m32 "+"a34b42 "+"a35b52

Example: Convolving MEMbench data with Metasim Tracer Data

Stride-one access L1 cache

Stride-one access Stride-one access L1/L2 cache L2/L3 cache Stride-one access L3 cache/Main Memory

1.4E+04 IBM p655 1.2E+04 • Memory bandwidth B SGI Altix 1.0E+04 benchmark measures IBM Opteron memory rates (MB/s) for 8.0E+03 different levels of cache 6.0E+03 and tracing reveals different access patterns 4.0E+03 Memory Bandwidth (M Memory 2.0E+03

0.0E+00 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Message Size (8-byte words)

56 PMaCinst binary instrumentation library

• PMaCinst works for – PowerPC Architecture Family • AIX Operating System • XCoff Object File Format • Fast compared to the similar tools – Lower overhead compared to Dyninst on AIX • 2.2-3.4 times faster • Slowdown – Under 2x for jbb instrumentation – Mostly under 10x for cache simulations (MetaSim) • Simulation of 51 (almost all DoE, NSF, DoD) HPC systems • Tested on fullscale HPC applications – S3D, Amr, Wrf, Overflow, Lammps, Hycom, Avus, Cth

Other latencies also matter

DataStar IBM Power 4

1.40E+04 Stride-one Random-stride

1.20E+04 Branch FP dep

1.00E+04

8.00E+03

6.00E+03

4.00E+03

2.00E+03

0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000

57 More Machines

lemieux huricane

9.00E+03 Stride-one 1.20E+04 Random-stride 8.00E+03 Branch Stride-one FP dep 1.00E+04 7.00E+03 Random-s tr ide Branch FP dep 6.00E+03 8.00E+03

5.00E+03

6.00E+03 4.00E+03

3.00E+03 4.00E+03

2.00E+03

2.00E+03 1.00E+03

0.00E+00 0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000 1000 10000 100000 1000000 10000000 100000000 1000000000

O3900 tempest

3.00E+03 5.00E+03

4.50E+03 Stride-one 2.50E+03 Stride-one Random-stride 4.00E+03 Random-stride Branch Branch FP dep 3.50E+03 FP dep 2.00E+03

3.00E+03

1.50E+03 2.50E+03

2.00E+03

1.00E+03 1.50E+03

1.00E+03 5.00E+02 5.00E+02

0.00E+00 0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000 1000 10000 100000 1000000 10000000 100000000 1000000000

Convolutions put the two together for performance predictions MetaSim trace collected on PETSc Matrix-Vector code 4 CPUs with user supplied memory parameters for PSC’s TCSini

Procedure % Mem. Ratio L1 Hit L2 Hit Data Set Location in Memory Memory Name Ref. Random Rate Rate BW (MAPS)

dgemv_n 0.9198 0.07 93.47 93.48 L1/L2 Cache 4166.0

dgemv_n 0.0271 0.00 90.33 90.39 Main Memory 1809.2

dgemv_n 0.0232 0.00 94.81 99.89 L1 Cache 5561.3

MatSet. 0.0125 0.20 77.32 90.00 L2 Cache 1522.6

Single-processor or per-processor performance: •Machine profile for processor (Machine A) •Application Signature for application (App. #1)

The relative per-processor performance of App. #1 on Machine A is represented as the MetaSim Number= SUM I=1,n (MemOpsBBi / MemRateBBi)

MAPS data for PREDICTED MACHINE

58 Extrapolation

•Least Squares Minimization is performed on each set of data points (for each basic block)

•A synthetic profile is created based on the resulting curves

•The synthetic profile is sent to the convolver and a computation/memory time is predicted

AVUS (Air Vehicle Unstructured Solver)

L1X2 2.5

2

1.5

1

0.5

MMX2 0 L2X2

L3X2

59 WRF (Weather Research Facility)

L1X2 2

1.5

1

0.5

MMX2 0 L2X2

L3X2

AMR interagency benchmark

L1X2 2

1.5

1

0.5

MMX2 0 L2X2

L3X2

60 Hycom (ocean model)

L1X2 2

1.5

1

0.5

MMX2 0 L2X2

L3X2

Reasoning about impact of multicore

MAPS Stride-one Bandwidth

6000.0 JVN3 Xeon EM64T 3.6 GHz Kraken IBM PWR4+ Eagle Itanium2 1.6GHz 5000.0 Bandwidth Blade Density Blade

4000.0

3000.0

Memory Bandwidth (MB/s) Bandwidth Memory 2000.0

1000.0

0.0 1000000 10000000 100000000 Size in 8-byte words

61 Choose two…..

Fast Accurate Cheap

What is Integrated Performance Monitoring?

IPM provides a high level application performance profile on a per batch job basis IPM •Fixed memory footprint ~1-2MB input_123 •Minimal CPU overhead ~5% •Scalable to high concurrencies job_123 •Easy to use –flip of a switch •Portable

output_123 profile_123

62 Message Sizes : CAM 336 way per MPI call per MPI call & buffer size

Load Balance Information

32K tasks AMR on BG/L

63 Ordered list of MPI calls

Can we use the output from this relatively simple tool to predict the MPI performance of a given code on Machine X (ORNL XT4) using an IPM profile of the code on Machine Y (a PWR4+) and some latency and bandwidth information ?

64 WRF CONUS benchmark 256 cpus

Xt4 Prediction XT4 Measured

2000

1800

1600

1400

1200

1000 Time / s

800

600 ~7% average absolute error

400

200

0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 MPI_rank

WRF CONUS Benchmark Summary

MPI Tasks % error

256 7

384 9

512 6

640 5

65 Prediction methodology (How did we do it?)

• Used very simple model based on Random-Ring Bandwidth and latency as measured by HPCC. • Models for collectives taken from ‘Performance Analysis of MPI Collective Operations’ Cluster Computing 10 p 127 (2007) Dongarra et al. • Speed up of compute part of mpi time treated using predicted speed up from PMaC convolver. • If Tcommunicationonly < Tmeasured on base system – Attribute the difference to the compute part of the MPI time on the sending processor. Then scale this up using convolver factor. • If Tcommunicationonly > Tmeasured on base system – Attribute the difference to the compute part of the MPI time on the receiving processor. Then scale this using the speedup factor from the convolver. Then the mpi_wait time is new_communication-new_compute.

Model and Benchmark more Kiviat diagrams with more axes

nw lt FLOPS, L1, L2, L3, MM, 8 nw bw on bw 7 6 nw lt nw lt 8 5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 8 7 nw bw nw bw I/O 4 FLOPS on bw on bw , I/O 7 6 6 3 FLOPS, L1, L2, L3, MM, 5 5 FLOPS 2 I/O 4 FLOPS on bw 4 1 3 3 MM, on bw 0 L1 2 2 1 I/O 1 L1 MM, on bw 0 L1 0

MM, on bw L1, L2 MM L1, L2 MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3, MM GYRO_0016_2X AVUS_0064_2X RFCTH2_0016_2X L1, L2, L3, MM L1, L2, L3, MM GYRO_0016_4X AVUS_0064_4X RFCTH2_0016_4X GYRO_0016_8X AVUS_0064_8X RFCTH2_0016_8X

nw lt nw lt nw lt 1.5 1.5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 1.5 nw bw nw bw nw bw on bw on bw on bw , I/O 1 1 FLOPS, L1, L2, L3, MM, 1 I/O FLOPS FLOPS I/O FLOPS on bw 0.5 0.5 0.5 I/O L1 MM, on bw 0 L1 0 MM, on bw 0 L1

MM, on bw L1, L2 MM L1, L2

MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/2X L1, L2, L3, MM, on bw L1, L2, L3, MM L1, L2, L3, MM AVUS_0064_1/4X RFCTH2_0016_1/2X L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/8X RFCTH2_0016_1/4X GYRO_0016_1/2X RFCTH2_0016_1/8X L1, L2, L3, MM GYRO_0016_1/4X GYRO_0016_1/8X

66 A spoke gives performance gradient

• Predicted (and validated) speedup for doubling of processor attributes for: – Overflow (CFD), WRF (Weather), AVUS (CFD), LAMMPS(MD), Hycom (Ocean)

2X Overflow WRF AVUS AMR LAMMPS HyCom

FLOPS 1.04 1.29 1.2 1.07 1.08 1.14

L2 BW 1.08 1.01 1.01 1.01 1.14 1.16

L3 BW 1.1 1.01 1.01 1.01 1.31 1.20

MM BW 1.72 1.71 1.7 1.93 1.49 1.60

Ratio MM BW/Flop 16.80 2.44 3.5 13.3 6.41 4.35

Qualitative comparison Thanks to Steve Scott

Dimensionless Iso-performance lines

2.5 2 Overflow 1.5 Overflow 1 WRF Memory BW Memory y 0.5 WRF 2 0 0 10203040 2x Flops

67 Qualitative comparison

• Overflow a memory-bound Dimensionless Iso-performance lines

code to WRF a “flops-bound” 2.5 2 code Overflow – Flops bound curves are 1.5 Overflow 1 WRF Memory BW Memory steeper in this plotting y 0.5 WRF 2 – Memory-bound flatter 0 010203040 2x Flops

What if we just project local gradients?

Overflow

350 300 300-350 250 250-300 200 speedup 200-250 150 150-200 100 50 100-150 10 0 x 50-100

0 2 BW 2

4 0

7 0-50 10 2x flops

68 Quantitative comparison

Overflow • You can speedup Overflow 350 1.5x with a 1000x speedup of 300 300-350 250 250-300 flops alone or 220x with a 200 speedup 200-250 150 150-200 1000x speedup BW alone 100 50 100-150 • (according to this admittedly 10 0 x 50-100

0 2 BW 2

simplistic performance gradient 4 0 7 0-50 extrapolation) 10 2x flops • You can get 330x speedup with both • More detailed simulation needed 3 to “verify” this 2.5 2.5-3 2 2-2.5 1.5 1.5-2 1 1-1.5 1 0.5 0.5-1 0 0 0-0.5 0 3 4

6 0 8 1.75 10

But what about $ ?

• Some people say flops are free • Let’s derive the slope of the line cost/flops and cost/BW from (some) real data. • More exactly, estimate slope of the line: – normalized cost delta / normalized clock speed delta – And normalized cost delta / normalized bus speed delta – Using clock speed as a proxy for flops and bus speed as a proxy for memory bandwidth

69 Families of Intel processors with list $

clock speed FSB speed $ delta $ d clock d bus d cost / clock cost bus cost d cost / clovertown d clock d bus

1.6 1066 238 0 0 0 1.86 1066 297 59 0.26 0 1.53 2 1333 355 117 0.4 267 137.45 217.56 3.65 2.33 1333 500 262 0.73 267 2.48 250.84 249.16 4.18 2.66 1333 800 562 1.06 267 4.24 364.23 435.77 7.31 woodcrest 1.6 1066 217 0 0 0 2 1333 346 129 0.4 267 125.32 220.68 4.06 2.33 1333 400 183 0.73 267 0.95 228.70 171.30 3.15 2.66 1333 725 508 1.06 267 5.74 332.09 392.91 7.23 3 1333 883 666 1.4 267 1.70 438.61 444.39 8.18 conroe 1.86 1066 166 0 0 0 2.13 1066 186 20 0.27 0 0.83 2.33 1333 179 13 0.47 267 96.90 82.10 1.97 2.66 1333 205 39 0.8 267 1.03 164.93 40.07 0.96

average 2.31 4.52

Performance data for Overflow

011.75233.545678910 0 1 1.04 1.075 1.09 1.131366 1.1314 1.178883 1.228397 1.279989 1.333749 1.389766 1.448136 1.508958 0.1 1.055607 1.1 1.134 1.15 1.194278 1.1943 1.244438 1.296704 1.351166 1.407915 1.467047 1.528663 1.592867 0.2 1.114306 1.16 1.197 1.21 1.260688 1.2607 1.313637 1.36881 1.4263 1.486205 1.548625 1.613668 1.681442 0.3 1.17627 1.23 1.264 1.28 1.330792 1.3308 1.386685 1.444926 1.505613 1.568848 1.63474 1.703399 1.774942 0.4 1.241679 1.29 1.334 1.35 1.404793 1.4048 1.463795 1.525274 1.589335 1.656088 1.725643 1.79812 1.873641 0.5 1.310725 1.37 1.409 1.42 1.48291 1.4829 1.545192 1.61009 1.677714 1.748178 1.821601 1.898109 1.977829 0.6 1.383611 1.44 1.487 1.5 1.56537 1.5654 1.631116 1.699623 1.771007 1.845389 1.922895 2.003657 2.087811 0.7 1.460549 1.52 1.57 1.59 1.652416 1.6524 1.721817 1.794134 1.869487 1.948006 2.029822 2.115075 2.203908 0.8 1.541766 1.61 1.657 1.67 1.744302 1.7443 1.817563 1.8939 1.973444 2.056329 2.142695 2.232688 2.326461 0.9 1.627499 1.7 1.749 1.77 1.841298 1.8413 1.918632 1.999215 2.083182 2.170675 2.261844 2.356841 2.455828 1 1.718 1.79 1.846 1.87 1.943687 1.9437 2.025322 2.110385 2.199021 2.29138 2.387618 2.487898 2.59239

columns are powers of 2 of flops, rows powers of 2 of BW, values of cells are predicted speedup

70 Iso performance and cost: Overflow

1.1x 1.2 1.2x 1.3x 1 $5 $10 $150 $595 $2,367 1.4x 1.5x 0.8 1.6x $10 1.7x 0.6 $5 1.9x 2x bandwidth x

2 2.1x 0.4 2.2x 2.3x 0.2 2.4x $5 0 $5 $10 $150 $595 $2,367 $10 0 2 4 6 8 10 12 $150 Say 0,0 (i.e. 1 unit of flops 1 unit of memory2x flops BW) costs $1 $595 $2 367

Iso performance and cost: WRF

1 $5 $10 $150 $595 $2,367 2x 0.9 3x

0.8 4x 5x 0.7 $10 6x 0.6 $5 7x 10x 0.5 13x bandwidth x 0.4 16x 2 $5 0.3 $10 0.2 $150 0.1 $595 $2,367 0 $5 $10 $150 $595 $2,367 024681012 2x flops

71 Comparison

1.1x 1.2 1.2x 1.6x 1 $5 $10 $150 $595 $2,367 1.4x 1.5x • Recall you can speedup 0.8 1.6x $10 1.7x 0.6 $5 1.9x Overflow 1.5x with a 1000x 2x bandwidth x

2 2.1x 0.4 speedup of flops alone or 220x 2.2x 2.3x 0.2 with a 1000x speedup BW alone 2.4x $5 • WRF can be sped up about 12x 0 $5 $10 $150 $595 $2,367 $10 024681012$150 2x flops $595 with 1000x flops and about 173x $2 367 1000x BW alone and > 600x with

both! 1 $5 $10 $150 $595 $2,367 2x 0.9 3x

0.8 4x 5x 0.7 $10 6x 0.6 $5 7x 10x 0.5 13x bandwidth

x 0.4 16x 2 $5 0.3 $10 0.2 $150 0.1 $595 $2,367 0 $5 $10 $150 $595 $2,367 024681012 2x flops

How to pick a spanning set of benchmarks Spatial vs Temporal Locality

1.00 HPL 0.90 Gamess WRF

0.80 HYCOM Overflow 0.70 CG Test3D 0.60 OOCore RFCTH2 AVUS

0.50

0.40 Temporal locality Temporal 0.30 RandomAccess 0.20 STREAM 0.10

0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Spatial Locality

72 Tools for determining data motion patterns of applications

Locality matters for performance

73 Benchmarks for Global Data Motion

APEX-Maps

74 APEX-Map

APEX-Map: Global Data Motion

75 A variety of signatures

nw lt FLOPS, L1, L2, L3, MM, 8 nw bw on bw 7 6 nw lt nw lt 8 5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 8 7 nw bw nw bw I/O 4 FLOPS on bw on bw , I/O 7 6 6 3 FLOPS, L1, L2, L3, MM, 5 5 FLOPS 2 I/O 4 FLOPS on bw 4 1 3 3 MM, on bw 0 L1 2 2 1 I/O 1 L1 MM, on bw 0 L1 0

MM, on bw L1, L2 MM L1, L2 MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3, MM GYRO_0016_2X AVUS_0064_2X RFCTH2_0016_2X L1, L2, L3, MM L1, L2, L3, MM GYRO_0016_4X AVUS_0064_4X RFCTH2_0016_4X GYRO_0016_8X AVUS_0064_8X RFCTH2_0016_8X

nw lt nw lt nw lt 1.5 1.5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 1.5 nw bw nw bw nw bw on bw on bw on bw , I/O 1 1 FLOPS, L1, L2, L3, MM, 1 I/O FLOPS FLOPS I/O FLOPS on bw 0.5 0.5 0.5 I/O L1 MM, on bw 0 L1 0 MM, on bw 0 L1

MM, on bw L1, L2 MM L1, L2

MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/2X L1, L2, L3, MM, on bw L1, L2, L3, MM L1, L2, L3, MM AVUS_0064_1/4X RFCTH2_0016_1/2X L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/8X RFCTH2_0016_1/4X GYRO_0016_1/2X RFCTH2_0016_1/8X L1, L2, L3, MM GYRO_0016_1/4X GYRO_0016_1/8X

More abstract: data moving capability is F(MAPS)1

Limited by:

Comm Locality Performance Surface Hypothetical Petascale arch Main mem Shared cache 10000.0 Chip cache FP bound

1000.0

100.0 Cycles per data op 10.0

1.0

0.001 0.1 0.010

0.100 0.0001 0.001

Temporal 0.01 1.000 1 Spatial 1 memory access pattern signature

76 Mapping applications to cycles-per-data-op surface

Limited by:

Com LocalityTimeless Performance MAPS Surface projection Mem Shared cache 10000.0 Chip cache FP Mesa 1000.0

Overflow 100.0 mpsalsa Cycles per data op 10.0 equake GUPS Gammes 1.0

0.001 Stream 0.1 0.010 S3D

0.100 0.0001 0.001

Temporal 0.01 1.000 HPL1 Spatial

Genetic Algorithms

• Suitable when a certain form of the function is assumed and the search space contains many (and unknown) local minima

BW = F( cache hit rates, cache bandwidths & latencies, latency tolerance)

12000

10000

8000

6000

4000 Memory Bandwidth (MB/s)

2000

0 100 100 50 90 80 70 60 50 40 30 20 100 L2 Hit rate L1 Hit rate

77 Genetic Algorithms

1 gen Islands

mutate

2 gen Crossover

Adapt?

14000

Predicted line02 Measured line02 12000 Predicted line04 Measured line04 10000 MultiMAPS two curves

8000 with Measured bandwidth & 6000

ory Bandwidth ory(MB/s) Bandwidth Predicted bandwidth Fitness = Mem 4000 curves 2000

0 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 Size in 8-byte words

AVUS 64 Sensitivity

case1 7

case9 6 case2

5

4

case8b 3 case3

2

1

0

case8a case4

case7b case5

case7a case6 AVUS_0064_2X AVUS_0064_4X AVUS_0064_8X

78 Performance Sensitivity

Case 1 Network latency / 2 Case 2 Network bandwidth * 2 Case 3 FLOPS * 2 Case 4 L1 BW * 2 Case 5 L1, L2 BW *2 Case 6 L1, L2, L3 BW * 2 Case 7a L1, L2, L3, MM BW * 2 Case 7b L1, L2, L3, MM, on node BW * 2

Case 8a Just MM BW *2

Case 8b Just MM, on node BW * 2 Case 9 I/O

Choose your machine wisely ☺

BENCHM_Altix_1.5 L1 bw (n) 1 flops L1 bw(r) 0.8 0.6 NW lat 0.4 L2 bw (n) 0.2 BENCHM_5_Opt1_2.2 0 L1 bw (n) NW bw L2 bw (r) 0.8 flops L1 bw(r) 0.6 MM bw(r) L3 bw (n) 0.4 NW lat L2 bw (n) MM bw(n) L3 bw (r) 0.2

0 NW bw L2 bw (r) NAVO_P655_FED L1 bw (n) 1 flops L1 bw(r) 0.8 MM bw(r) L3 bw (n) 0.6 NW lat 0.4 L2 bw (n) MM bw(n) L3 bw (r) 0.2 0 NW bw L2 bw (r)

MM bw(r) L3 bw (n)

MM bw(n) L3 bw (r)

79 HPL

Some people say it can’t predict real workloads but others say it probably tracks relative performance well Why not just test that hypothesis by predicting:

T’(X,Y) = R(X)/R(Xo) * T(Xo)

where T'(X,Y) is the predicted wall-clock time for application Y on system X, R(X) is the result of a specific simple benchmark for system

X, X0 denotes the base system, and T(X,Y) is the measured wall-clock time for application Y on system X. In other words, the performance for a specific application is assumed to faster or slower according to the ratio of the simple benchmark results for system X and the base

system (X0).

Or if that doesn’t work too well what about using HPC Challenge in some combination?

# Type Name or Description 1 Simple HPL 2 Simple STREAM 3 Simple HPC Challenge Random Access (GUPS) 4 Predictive HPL for floating point work 5 Predictive HPL for floating point work; STREAM for memory access 6 Predictive HPL for floating point work; STREAM for stride 1 memory access; GUPS for random stride memory access 7 Predictive HPL for floating point work; MEMBENCH MAPS for memory access 8 Predictive HPL for floating point work; MEMBENCH MAPS for memory access; NETBENCH for communications work 9 Predictive HPL for floating point work; ENHANCED MEMBENCH MAPS for memory access; NETBENCH for communications work

80 Systems used in this study

HPCMP Compute Site Architecture Processors ERDC SGI_O3800_400MHz_NUMA 504 ARL IBM_P3_375MHz_COL 1024 MHPCC IBM_P3_375MHz_COL 736 NAVO IBM_P3_375MHz_COL 928 ASC HP_SC45_1GHz_QUAD 472 MHPCC IBM_690_1.3GHz_COL 320 NAVO IBM_690_1.3GHz_COL 1328 ARL IBM_690_1.7GHz_FED 128 ARL LNX_Xeon_3.06GHz_MNET 256 ARL SGI_Altix_1.5GHz_NUMA 256 NAVO IBM_655_1.7GHz_FED 2832 ARL IBM_Opteron_2.2GHz_MNET 2304

Architectures used in this study

Processor Make Model Speed (GHz) Interconnect SGI Origin 3800 0.400 NUMALink IBM Power 3 0.375 Colony HP SC45 1.000 Quadrics IBM p690 1.300 Colony IBM p690 1.700 Federation LNX Xeon 3.060 Myrinet SGI Altix 1.500 NUMALink IBM p655 1.700 Federation IBM Opteron 2.200 Myrinet

81 Scientific applications used in this study

• AVUS was developed by the Air Force Research Laboratory (AFRL) to determine the fluid flow and turbulence of projectiles and air vehicles. Its standard test case calculates 400 time-steps of fluid flow and turbulence for a wing, flap, and end plates using 7 million cells. Its large test case calculates 150 time-steps of fluid flow and turbulence for an unmanned aerial vehicle using 24 million cells. • The Naval Research Laboratory (NRL), Los Alamos National Laboratory (LANL), and the University of Miami developed HYCOM as an upgrade to MICOM (both well-known ocean modeling codes) by enhancing the vertical layer definitions within the model to better capture the underlying science. HYCOM's standard test case models all of the world's oceans as one global body of water at a resolution of one-fourth of a degree when measured at the Equator. • OVERFLOW-2 was developed by NASA Langley and NASA Ames to solve CFD equations on a set of overlapping, adaptive grids, such that the grid resolution near an obstacle is higher than that of other portions of the scene. This approach allows computation of both laminar and turbulent fluid flows over geometrically complex, non-stationary boundaries. The standard test case of OVERFLOW-2 models fluid flowing over five spheres of equal radius and calculates 600 time- steps using 30 million grid points. • Sandia National Laboratories (SNL) developed CTH to model complex multidimensional, multiple-material scenarios involving large deformations or strong shock physics. RFCTH is a non-export-controlled version of CTH. The standard test case of RFCTH models a ten-material rod impacting an eight- material plate at an oblique angle, using adaptive mesh refinement with five levels of enhancement.

Test cases used in this study

• For each application 2 production inputs • For each input 3 production cpu counts in the range 32 to 384 • 5 x 2 x 3 x 12 = 180 observed execution times (empirical data) • For each datum 10 different ways of predicting performance from simple benchmarks in isolation or in simple combination (1,800 hypotheses)

82 Results

140

120

100

80 63 63 60 50 43 40 33 24 22 22 18

Average Absolute Error (%) 20

0 #1 #2 #3 #4 #5 #6 #7 #8 #9 Metric

AVUS large test case Case 1 2.75 Case 8b 2.50 Case 2 2.25 2.00 1.75 Case 8a 1.50 Case 3

1.25 128p 1.00 256p 384p Case 7b Case 4

Case 7a Case 5

Case 6

83 Is FLOPS/$ important? Case 1

Case 8b Case 2 1.16

1.14 Case 8a Case 3 1.12 128p 1.10 256p 384p Case 7b Case 4

Case 7a Case 5

Case 6

Overflow 2 standard test case Case 1 2.75 Case 8b 2.50 Case 2 2.25 2.00 1.75 Case 8a 1.50 Case 3

1.25 32p 1.00 48p 64p Case 7b Case 4

Case 7a Case 5

Case 6

84 Do computers double in speed every 18 months? Case 1 1.35 Case 8b Case 2 1.34

Case 8a 1.34 Case 3

32p 1.33 48p 64p Case 7b Case 4

Case 7a Case 5

Case 6

Summary analysis

• Computers don’t double in speed every 18 months (and Gordon Moore never said they did) • FLOPS/$ is not meaningful • TTS is meaningful • On many scientific codes this is a function of MM BW – Random memory BW is a better predictor than strided • A machine that looks 50% faster on the TOP500 list can be 20% slower

85 The real weighted order for these codes is 3, 2, 1, 5, 4

Conclusions

• HPL (TOP500) does not track relative performance well (63% absolute error) • STREAM, a main memory benchmark, does a little better (43%) • GUPs, a random main memory benchmarks, does a little better (why?) • My Top500: – Splitting your application’s operations into buckets of floating point, strided memory ops, random memory ops, and weighting each by HPL, STREAM, GUPs works much better (22%). – If we add a network and dependency term we do better yet. • It is possible to account for more than 90% of performance differences between machines via simple benchmarks combined with simple workload characterization based on HPCC benchmarks.

86