HPC Challenge (HPCC) Benchmark Suite
Jack Dongarra Piotr Luszczek Allan Snavely
SC07, Reno, Nevada Sunday, November 11, 2007 Session S09 8:30AM-12:00PM, A1
Acknowledgements
• This work was supported in part by the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, the National Science Foundation (NSF), and the Department of Energy (DOE) through the DARPA High Productivity Computing Systems (HPCS) program under grant FA8750-04-1-0219 and under Army Contract W15P7T-05-C-D001 • Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government
1 Introduction
• HPC Challenge Benchmark Suite – To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL – To augment the TOP500 list – To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics ― e.g., spatial and temporal locality – To outlive HPCS • HPC Challenge pushes spatial and temporal boundaries and defines performance bounds
TOP500 and HPCC
• TOP500 • HPCC – Performance is represented by – Performance is represented by only a single metric multiple metrics – Data is available for an – Benchmark is new — so data is extended time period available for a limited time (1993-2005) period (2003-2007) • Problem: • Problem: There can only be one “winner” There cannot be one “winner” • Additional metrics and statistics • We avoid “composite” benchmarks – Count (single) vendor systems – Perform trend analysis on each list • HPCC can be used to show – Count total flops on each list complicated kernel/ per vendor architecture performance characterizations – Use external metrics: price, ownership cost, power, … – Select some numbers for comparison – Focus on growth trends over time – Use of kiviat charts • Best when showing the differences due to a single independent “variable” • Over time — also focus on growth trends
2 High Productivity Computing Systems (HPCS)
Goal: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2010)
s & Ass nalysi essme Impact: A ry R nt Indust &D Performance (time-to-solution): speedup critical national Programming Performance Models security applications by a factor of 10X to 40X Characterization & Prediction Hardware Technology Programmability (idea-to-first-solution): reduce cost and System Architecture time of developing application solutions Software Technology Portability (transparency): insulate research and Ind operational application software from system ustry R&D Anal sment Robustness (reliability): apply all known techniques to ysis & Asses protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors
Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80’s80’s HPCHPC technology)…..technology)…..toto…..Future…..Future (Quantum/Bio(Quantum/Bio Computing)Computing)
HPCS Program Phases I-III
Productivity Assessment (MIT LL, DOE, DoD, NASA, NSF) Concept System CDR Early Final Review Design PDR SCR DRR Demo Demo Industry Review Milestones 1 2 3 4 5 6 7 Technology Assessment Review HPLS SW SW SW SW MP Peta-Scale Plan Rel 1 Rel 2 Rel 3 Dev Unit Procurements
Mission Partner Mission Partner Mission Partner Deliver Units Peta-Scale Dev Commitment System Commitment Application Dev MP Language Dev
Year (CY) 0203 04 05 06 07 08 09 10 11
(Funded Five) (Funded Three) Mission Program Reviews Phase I Phase II Phase III Partners Industry R&D Prototype Development Critical Milestones Concept Program Procurements Study
3 HPCS Benchmark and Application Spectrum
Execution and System Bounds Execution Development Indicators Indicators Discrete Math 3 Scalable … Compact Apps Execution Graph Pattern Matching Bounds Analysis Graph Analysis Signal Processing Local … DGEMM Linear STREAM Solvers 3 Petascale/s RandomAccess HPCS … Simulation 1D FFT Signal (Compact) Spanning Set Applications of Kernels Processing Global … Linpack PTRANS Simulation Others RandomAccess Classroom 1D FFT …
Experiment Future Applications
Codes Existing Applications I/O Applications Emerging 8 HPCchallenge Benchmarks (~40) Micro & Kernel (~10) Compact 9 Simulation Benchmarks Applications
Applications Simulation Intelligence Reconnaissance
SpectrumSpectrum ofof benchmarksbenchmarks provideprovide differentdifferent viewsviews ofof systemsystem •• HPCchallengeHPCchallenge pushespushes spatialspatial andand temporaltemporal boundaries;boundaries; setssets performanceperformance boundsbounds •• ApplicationsApplications drivedrive systemsystem issues;issues; setset legacylegacy codecode performanceperformance boundsbounds •• KernelsKernels andand CompactCompact AppsApps forfor deeperdeeper analysisanalysis ofof executionexecution andand developmentdevelopment timetime
Motivation of the HPCC Design
FFT DGEMM High HPL
Mission Partner Applications PTRANS
Temporal Locality Temporal RandomAccess STREAM
Low Spatial Locality High
4 Measuring Spatial and Temporal Locality (1/2)
Generated by PMaC @ SDSC
HPC Challenge Benchmarks
1.00 Select Applications HPL 0.80 Gamess Overflow HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal
0.20
RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality
•• SpatialSpatial andand temportemporalal datadata localitylocality herehere isis forfor oneone node/processornode/processor —— i.e.,i.e., locallylocally oror “in“in thethe small”small”
Measuring Spatial and Temporal Locality (2/2)
Generated by PMaC @ SDSC High Temporal Locality Good Performance on Cache-based systems HPC Challenge High Spatial Locality Spatial Locality occurs Benchmarks Sufficient Temporal Locality in registers 1.00 Select Applications Satisfactory Performance on HPL 0.80 Gamess Overflow Cache-based systems HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal
0.20
RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality No Temporal or Spatial Locality High Spatial Locality Poor Performance on Moderate Performance on Cache-based systems Cache-based systems
5 Supercomputing Architecture Issues Standard Parallel Corresponding Performance Computer Architecture Memory Hierarchy Implications
CPU CPU CPU CPU Registers
RAM RAM RAM RAM Instr. Operands Cache Disk Disk Disk Disk Blocks Network Switch Local Memory
Messages
CPU CPU CPU CPU Bandwidth Increasing Remote Memory Increasing Programmability RAM RAM RAM RAM Pages Increasing Latency Increasing Increasing Capacity Increasing Disk Disk Disk Disk Disk
•• StandardStandard architecturearchitecture producesproduces aa “steep”“steep” multi-layered multi-layered memorymemory hierarchyhierarchy –– ProgrammerProgrammer mustmust managemanage thisthis hierarchyhierarchy toto getget goodgood performanceperformance •• HPCSHPCS technicaltechnical goalgoal –– ProduceProduce aa systemsystem withwith aa “flatter”“flatter” memory memory hierarchyhierarchy thatthat isis easiereasier toto programprogram
HPCS Performance Targets HPC Challenge Corresponding HPCS Targets Benchmark Memory Hierarchy (improvement) • Top500: solves a system Registers Ax = b 2 Petaflops Instr. Operands (8x) • STREAM: vector operations Cache 6.5 Petabyte/s A = B + s x C (40x) Blocks Local Memory 0.5 Petaflops • FFT: 1D Fast Fourier bandwidth (200x) Transform latency Messages Z = FFT(X) 64,000 GUPS Remote Memory • RandomAccess: random (2000x) updates Pages T(i) = XOR( T(i), r ) Disk
•• HPCSHPCS programprogram hashas developeddeveloped aa newnew suitesuite ofof benchmarksbenchmarks (HPC(HPC Challenge)Challenge) •• EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy •• HPCSHPCS programprogram performanceperformance targetstargets willwill flattenflatten thethe memorymemory hierarchy,hierarchy, improveimprove realreal applicationapplication performance,performance, andand makemake programmingprogramming easiereasier
6 Official HPCC Submission Process
1. Download Prequesites:Prequesites: ●● CC compilercompiler 2. Install ●● BLASBLAS ● MPI 3. Run ● MPI 4. Upload results ProvideProvide detaileddetailed 5. Confirm via @email@ installationinstallation andand executionexecution environmentenvironment 6. Tune 7. Run
8. Upload results ●● OnlyOnly somesome routinesroutines cancan bebe replacedreplaced ● Data layout needs to be preserved 9. Confirm via @email@ ● Data layout needs to be preserved ●● MultipleMultiple languageslanguages cancan bebe usedused
ResultsResults areare immediatelyimmediately availableavailable onon thethe webweb site:site: ●● InteractiveInteractive HTMLHTML ●● XMLXML OptionalOptional ●● MSMS ExcelExcel ●● KiviatKiviat chartscharts (radar(radar plots)plots)
HPCC as a Framework (1/2)
• Many of the component benchmarks were widely used before the HPC Challenge suite of Benchmarks was assembled – HPC Challenge has been more than a packaging effort – Almost all component benchmarks were augmented from their original form to provide consistent verification and reporting • We stress the importance of running these benchmarks on a single machine — with a single configuration and options – The benchmarks were useful separately for the HPC community, meanwhile – The unified HPC Challenge framework creates an unprecedented view of performance characterization of a system • A comprehensive view with data captured the under the same conditions allows for a variety of analyses depending on end user needs
7 HPCC as a Framework (2/2)
• HPCC unifies a number of existing (and well known) codes in one consistent framework • A single executable is built to run all of the components – Easy interaction with batch queues – All codes are run under the same OS conditions – just as an application would • No special mode (page size, etc.) for just one test (say Linpack benchmark) • Each test may still have its own set of compiler flags – Changing compiler flags in the same executable may inhibit inter- procedural optimization • Why not use a script and a separate executable for each test? – Lack of enforced integration between components • Ensure reasonable data sizes • Either all tests pass and produce meaningful results or failure is reported – Running a single component of HPCC for testing is easy enough
Base vs. Optimized Run
• HPC Challenge encourages users to develop optimized benchmark codes that use architecture specific optimizations to demonstrate the best system performance • Meanwhile, we are interested in both – The base run with the provided reference implementation – An optimized run • The base run represents behavior of legacy code because – It is conservatively written using only widely available programming languages and libraries – It reflects a commonly used approach to parallel processing sometimes referred to as hierarchical parallelism that combines • Message Passing Interface (MPI) • OpenMP Threading – We recognize the limitations of the base run and hence we encourage optimized runs • Optimizations may include alternative implementations in different programming languages using parallel environments available specifically on the tested system • We require that the information about the changes made to the original code be submitted together with the benchmark results – We understand that full disclosure of optimization techniques may sometimes be impossible – We request at a minimum some guidance for the users that would like to use similar optimizations in their applications
8 HPCC Base Submission
• Publicly available code is • This to mimic legacy required for base submission applications’ performance 1. Requires C compiler, MPI 1. Reasonable software 1.1, and BLAS dependences 2. Source code cannot be 2. Code cannot be changed changed for submission run due to complexity and 3. Linked libraries have to be maintenance cost publicly available 3. Relies on publicly available 4. The code contains software optimizations for 4. Some optimization has been contemporary hardware done on various platforms systems 5. Conditional compilation and 5. Algorithmic variants runtime algorithm selection provided for performance for performance tuning portability
Baseline code has over 10k SLOC — there must more productive way of coding
HPCC Optimized Submission
• Timed portions of the code may be replaced with optimized code • Verification code still has to pass – Must use the same data layout or pay the cost of redistribution – Must use sufficient precision to pass residual checks • Allows to use new parallel programming technologies – New paradigms, e.g. one-sided communication of MPI-2: MPI_Win_create(…); MPI_Get(…); MPI_Put(…); MPI_Win_fence(…); – New languages, e.g. UPC: • shared pointers • upc_memput() • Code for optimized portion may be proprietary but needs to use publicly available libraries • Optimizations need to be described but not necessarily in detail – possible use in application tuning • Attempting to capture: invested effort per flop rate gained – Hence the need for baseline submission • There can be more than one optimized submission for a single base submission (if a given architecture allows for many optimizations)
9 Base Submission Rules Summary
• Compile and load options – Compiler or loader flags which are supported and documented by the supplier are allowed – These include porting, optimization, and preprocessor invocation • Libraries – Linking to optimized versions of the following libraries is allowed • BLAS • MPI • FFT – Acceptable use of such libraries is subject to the following rules: • All libraries used shall be disclosed with the results submission. Each library shall be identified by library name, revision, and source (supplier). Libraries which are not generally available are not permitted unless they are made available by the reporting organization within 6 months. • Calls to library subroutines should have equivalent functionality to that in the released benchmark code. Code modifications to accommodate various library call formats are not allowed
Optimized Submission Rules Summary
• Only computational (timed) portion of the code can be changed – Verification code can not be changed • Calculations must be performed in 64-bit precision or the equivalent – Codes with limited calculation accuracy are not permitted • All algorithm modifications must be fully disclosed and are subject to review by the HPC Challenge Committee – Passing the verification test is a necessary condition for such an approval – The replacement algorithm must be as robust as the baseline algorithm • For example — the Strassen Algorithm may not be used for the matrix multiply in the HPL benchmark, as it changes the operation count of the algorithm • Any modification of the code or input data sets — which utilizes knowledge of the solution or of the verification test — is not permitted • Any code modification to circumvent the actual computation is not permitted
10 HPCC Tests at a Glance
1. HPL
2. DGEMM
• Scalable framework — Unified 3. STREAM • Scalable framework — Unified BenchmarkBenchmark FrameworkFramework – By design, the HPC 4. PTRANS – By design, the HPC ChallengeChallenge BenchmarksBenchmarks areare scalablescalable withwith thethe sizesize ofof datadata 5. RandomAccess setssets beingbeing aa functionfunction ofof thethe largestlargest HPLHPL matrixmatrix forfor thethe 6. FFT testedtested systemsystem
7. b_eff
HPCC Testing Scenarios
1. Local MM MM MM MM P P P P P P P P 1.Only single process P P P P P P P P computes NetworkNetwork
2. Embarrassingly parallel MM MM MM MM 1.All processes compute PP PP PP PP PP PP PP PP and do not communicate (explicitly) NetworkNetwork
3. Global MM MM MM MM 1.All processes comput PP PP PP PP PP PP PP PP and communicate
NetworkNetwork 4. Network only
11 Naming Conventions
S HPL system STREAM EP FFT CPU … G RandomAccess thread
Examples: 1. G-HPL 2. S-STREAM-system
HPCC Tests - HPL
• HPL = High Performance Linpack • Objective: solve system of linear equations Ax=b A∈Rn×n x,b∈R • Method: LU factorization with partial row pivoting 2 3 3 2 • Performance: ( /3n + /2n ) / t • Verification: scaled residuals must be small || Ax-b || / (ε ||A|| ||x|| n) • Restrictions: – No complexity-reducing matrix multiply • (Strassen, Winograd, etc.) – 64-bit precision arithmetic through-out • (no mixed precision with iterative refinement)
12 HPL:TOP500 and HPCC Implementation Comparison
HPCC TOP500 HPCC Base Optimized
Disclose code Not required Reference code Describe
Floating-point in 64- Yes Yes Yes bits
Numerics of solution No Yes Yes reported
N-half reported Yes No No
Reporting frequency Bi-annual Continuous Continuous
HPCC HPL: Further Details
• High Performance Linpack (HPL) solves a system Ax = b • Core operation is a LU factorization of a large MxM matrix • Results are reported in floating point operations per second (flop/s) Parallel Algorithm Registers Instr. Operands Cache P0 P2 P0 P2 LU P0 P2 P0 P2 U Blocks P1 P3 P1 P3 Factorization P1 P3 P1 P3 Local Memory A P0 P2 P0 P2 P0 P2 P0 P2 Messages P1 P3 P1 P3 L P1 P3 P1 P3 Remote Memory Pages 2D block cyclic distribution Disk is used for load balancing
•• LinearLinear systemsystem solversolver (requi(requiresres all-to-allall-to-all communication)communication) •• StressesStresses locallocal matrixmatrix multiplymultiply performanceperformance •• DARPADARPA HPCSHPCS goal:goal: 22 Pflop/sPflop/s (8x (8x overover currentcurrent best)best)
13 HPCC Tests - DGEMM
• DGEMM = Double-precision General Matrix-matrix Multiply • Objective: compute matrix C ←αAB + βC A,B,C∈Rn×n α,β∈R • Method: standard multiply (maybe optimized) • Performance: 2n3/t • Verification: Scaled residual has to be small || x – y || / (ε n || y || ) where x and y are vectors resulting from multiplication by a random vector of left and right hand size of the objective expression • Restrictions: – No complexity reducing matrix-multiply • (Strassen, Winograd, etc.) – Use only 64-bit precision arithmetic
HPCC Tests - STREAM
• STREAM is a test that measures sustainable memory bandwidth (in Gbyte/s) and the corresponding computation rate for four simple vector kernels • Objective: set a vector to a combination of other vectors COPY: c = a SCALE: b = α c ADD: c = a + b TRIAD: a = b + α c • Method: simple loop that preserves the above order of operations • Performance: 2n/t or 3n/t • Verification: scalre residual of computed and reference vector needs to be small || x – y || / (ε n || y || ) • Restrictions: – Use only 64-bit precision arithmetic
14 Original and HPCC STREAM Codes
HPCC STREAM HPCC Base Optimized
Programming Language Fortran (or C) C only Any
Data alignment Implicit OS-specific OS-specific
Exceeds Whole Whole Vectors’ size caches, memory, memory, Compile time Runtime Runtime
Reporting Multiple of No Possible Possible Number of Processors
HPCC STREAM: Further Details
• Performs scalar multiply and add • Results are reported in bytes/second
Parallel Algorithm Registers Instr. Operands ... Cache A 0 1 Np-1 Blocks = Local Memory 0 1 ... Np-1 Messages B Remote Memory + Pages s x C 0 1 ... Np-1 Disk
•• BasicBasic operationsoperations onon lalargerge vectorsvectors (requires(requires nono communication)communication) •• StressesStresses locallocal processorprocessor toto memorymemory bandwidthbandwidth •• DARPADARPA HPCSHPCS goal:goal: 6.56.5 Pbyte/sPbyte/s (40x (40x overover currentcurrent best)best)
15 HPCC Tests - PTRANS
• PTRANS = Parallel TRANSpose • Objective: update matrix with sum of its transpose and another matrix A=AT+B A,B∈Rn×n • Method: standard distributed memory algorithm • Performance: n2/t • Verification: scaled residual between computed and reference matrix needs to be small
|| A0 – A || / ( ε n || A0 || ) • Restrictions: – Use only 64-bit precision arithmetic – The same data distribution method as HPL
HPCC PTRANS: Further Details
1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 4 5 6 4 5 6 1 2 3 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 1 2 3 4 5 6 7 8 9
• matrix: 9x9 • matrix: 9x9 •3x3 process grid •1x9 process grid • Communicating pairs: • Communicating pairs: • 2-4, 3-7, 6-8 • 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9, 2-3, …, 8-9 • 36 pairs!
16 HPCC Tests - RandomAccess
• RandomAccess calculates a series of integer updates to random locations in memory • Objective: perform computation on Table Ran = 1; for (i=0; i<4*N; ++i) { Ran= (Ran<<1) ^ (((int64_t)Ran < 0) ? 7:0); Table[Ran & (N-1)] ^= Ran; } • Method: loop iterations may be independent • Performance: 4N/t • Verification: up to 1% of updates can be incorrect • Restrictions: – Use at least 64-bit integers – About half of memory used for ‘Table’ – Parallel look-ahead limited to 1024 (limit locality)
HPCC RandomAccess: Further Details • Randomly updates N element table of unsigned integers • Each processor generates indices, sends to all other processors, performs XOR • Results are reported in Giga Updates Per Second (GUPS) Parallel Algorithm Registers Table Instr. Operands Cache 0 1 . . Np-1 Blocks Send, Local Memory XOR, Update Messages 0 1 . . NP-1 Remote Memory Pages Generate random indices
Disk
•• RandomlyRandomly updatesupdates memorymemory (requires(requires all-to-allall-to-all communication)communication) •• StressesStresses interprocessorinterprocessor communication communication ofof smallsmall messagesmessages •• DARPADARPA HPCSHPCS goal:goal: 64,00064,000 GUPSGUPS ((2000x2000x overover currentcurrent bestbest))
17 HPCC Tests - FFT
• FFT = Fast Fourier Transform • Objective: compute discrete Fourier Transform n zk=Σxj exp(-2π√-1 jk/n) x,z∈C • Method: any standard framework (maybe optimized)
• Performance: 5nlog2n/t • Verification: scaled residual for inverse transform of computed vector needs to be small (0) || x – x || / (ε log2 n ) • Restrictions: – Use only 64-bit precision arithmetic – Result needs to be in-order (not bit-reversed)
HPCC FFT: Further Details • 1D Fast Fourier Transforms an N element complex vector • Typically done as a parallel 2D FFT • Results are reported in floating point operations per second (flop/s) Parallel Algorithm Registers Instr. Operands FFT rows Cache FFT columns Blocks 0 Local Memory 1 Messages 0 1 . . Np-1 : Remote Memory corner Pages Np-1 turn Disk
•• FFTFFT aa largelarge complexcomplex vectorvector (requires(requires all-to-allall-to-all communication)communication) •• StressesStresses interprocessorinterprocessor communication communication ofof largelarge messagesmessages •• DARPADARPA HPCSHPCS goal:goal: 0.50.5 PfloPflop/sp/s (200x (200x overover currentcurrent best)best)
18 HPCC Tests – b_eff
• b_eff measures effective bandwidth and latency of the interconnect • Objective: exchange 8 (for latency) and 2000000 (for bandwidth) messages in ping-pong, natural and random ring patterns • Method: use standard MPI point-to-point routines • Performance: n/t (for bandwidth) • Verification: simple checksum on received bits • Restrictions: – The messaging routines have to conform to the MPI standard
HPCC and Computational Resources
HPL (Jack Dongarra)
CPU computational speed
ComputationalComputational resourcesresources
Node Random & Natural Memory Interconnect Ring STREAM bandwidth bandwidth Bandwidth & Latency (John McCalpin) (Rolf Rabenseifner)
19 Random and Natural Ring B/W and Latency
• Parallel communication pattern on all MPI processes ( )
– Natural ring
– Random ring
• Bandwidth per process – Accumulated message size / wall-clock time / number of processes – On each connection messages in both directions – With 2xMPI_Sendrecv and MPI non-blocking best result is used – Message size = 2,000,000 bytes
• Latency – Same patterns, message size = 8 bytes – Wall-clock time / (number of sendrecv per process)
Inter-node B/W on Clusters of SMP Nodes
• Random Ring – Reflects the other dimension of a Cartesian domain decomposition and – Communication patterns in unstructured grids – Some connections are inside of the nodes – Most connections are inter-node – Depends on #nodes and #MPI processes per node
20 Accumulated Inter-node B/W on Clusters of SMPs
• Accumulated bandwidth := bandwidth per process x #processes
~= accumulated inter-node bandwidth _ 1 – 1 / #nodes
similar to bi-section bandwidth
HPCC Awards Overview
• Goals – Increase awareness of HPCC – Increase awareness of HPCS and its goals – Increase number of HPCC submissions • Expanded view of largest supercomputing installations • Means – HPCwire sponsorships and press coverage – HPCS mission partners’ contribution – HPCS vendors’ contribution
21 HPCC Awards Rules for SC07
• Class 1: Best Performance • Class 2: Most Productivity – Figure of merit: – Figure of merit: raw system performance performance (50%) and – Submission must be valid elegance (50%) HPCC database entry • Highly subjective • Side effect: populate HPCC • Based on committee vote database – Submission must implement – 4 categories: HPCC at least 4 out of 4 Class 1 components tests • HPL • The more tests the better • STREAM-system – Performance numbers are a • RandomAccess plus • FFT – The submission process: – Award certificates • Source code • 4x $750 from IDC • “Marketing brochure” – Was $500 • SC07 BOF presentation – Award certificate • $2000 from IDC – Was $1500
HPCC Awards Committee
• David Bailey • Piotr Luszczek LBNL NERSC Univ. of Tennessee • Jack Dongarra (Co-Chair) • John McCalpin Univ. of Tenn/ORNL AMD • Jeremy Kepner (Co-Chair) • Rolf Rabenseifner MIT Lincoln Lab HLRS Stuttgart • Bob Lucas • Daisuke Takahashi USC/ISI Univ. of Tsukuba • Rusty Lusk • Jeffrey Vetter Argonne National Lab ORNL
22 SC|05 HPCC Awards Class 1 - HPL
HPCS goal: 2000 Tflop/s 1000 TOP500: 280 Tflop/s x7 1. IBM BG/L 259 (LLNL) x7 259 Tflop/s 100 2. IBM BG/L 67 (Watson) 3. IBM Power5 58 (LLNL) Initial HPCC Awards Annoucement 10 Tflop/s TOP500TOP500 SystemsSystems 1 inin HPCCHPCC database:database: #1,#1, #2,#2, #3,#3, #4,#4, #10,#10, #14,#14, #17,#17, #35, #37, #71, #80 110 Gflop/s #35, #37, #71, #80 0.1 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05
SC|05 HPCC Awards Class 1 – STREAM-sys
1000000 HPCS goal: 6500 TB/s 160 TB/s 1. IBM BG/L 160 (LLNL) x40 100000 x40 2. IBM Power5 55 (LLNL) 3. IBM BG/L 40 (Watson)
10000
GB/s Initial 1000 HPCC Awards Annoucement
100
27 GB/s 10 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05
23 SC|05 HPCC Awards Class 1 – FFT
10000 HPCS goal: 500 Tflop/s 2311 Gflop/s 1. IBM BG/L 2.3 (LLNL) x200 1000 x200 2. IBM BG/L 1.1 (Watson) 3. IBM Power5 1.0 (LLNL)
100 Initial HPCC Awards Gflop/s Annoucement
10
4 Gflop/s
1 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05
SC|05 HPCC Awards Class 1 –RandomAccess
100 HPCS goal: 64000 GUPS 35 GUPS
1. IBM BG/L 35 (LLNL) x1800 10 x1800 2. IBM BG/L 17 (Watson) 3. Cray X1E 8 (ORNL) Initial 1 HPCC Awards Annoucement GUPS 0.1
0.01
0.01 GUPS 0.001 May 9, 03 Jan 14, 04 Sep 20, 04 May 28, 05 Feb 2, 06 SC04 SC|05
24 SC|05 HPCC Awards: Class 2
Language HPL RandomAccess STREAM FFT Sample submission Python+MPI √ √ from committee pMatlab √ √ √ √ members Cray MTA C √ √ Winners UPCx3 √ √ √
Cilk √ √ √ √ Finalists Parallel Matlab √ √ √ √
MPT C √ √
OpenMP, C++ √ √
StarP √ √
HPF √ √
SC06 HPCC Awards: Class 2
1. Best overall productivity 1. Cilk (Bradley Kuszmaul)
2. Best productivty in performance 1. UPC (Calin Cascaval)
3. Best productivity and Elegance 1. Parallel MATLAB (Cleve Moler)
4. Best student paper 1. MC# (Vadim Guzev)
5. Honorable mention 1. Chapel (Brad Chamberlain) 2. X10 (Vivek Sarkar)
25 SC07 HPCC Awards BOF
Awards will be presented at the SC07 HPC Challenge BOF
Tuesday, November 13, 2007 In Rooms C1/C2/C3 12:15-1:15
Reporting Results - Etiquette
• The HPC Challenge Benchmark suite has been designed to permit academic style usage for comparing – Technologies – Architectures – Programming models • There is an overt attempt to keep HPC Challenge significantly different than “commercialized” benchmark suites – Vendors and users can submit results – System “cost/price” is not included intentionally – No “composite” benchmark metric • Be cool about comparisons! • While we can not enforce any rule to limit comparisons observe rules of – Academic honesty – Good taste
26 Total Number of HPCC Submissions
120
110
100
90
80
70
60
50
40
30
20
10
0 2003-05-09 2004-01-14 2004-09-20 2005-05-28 2006-02-02 2006-10-10
Based vs. Optimized Submission
• Optimized G-RandomAccess is a UPC code – ~125x improvement
EP Random Random G-Random G-STREAM STREAM EP Ring Ring System Information Run G-HPL G-PTRANS Access G-FFTE Triad Triad DGEMM Bandwidth Latency System - Processor Speed CountType TFlop/s GB/s Gup/s GFlop/s GB/s GB/s GFlop/s GB/s usec Cray mfeg8 X1E 1.13GHz 248 opt 3.3889 66.01 1.85475 -1 3280.9 13.229 13.564 0.29886 14.58 Cray X1E X1E MSP 1.13GHz 252 base 3.1941 85.204 0.014868 15.54 2440 9.682 14.185 0.36024 14.93
27 Exploiting Hybrid Programming Model
• The NEC SX-7 architecture can permit the definition of threads and processes to significantly enhance performance of the EP versions of the benchmark suite by allocating more powerful “nodes” – EP-STREAM – EP-DGEMM
EP Random Random G-Random G-STREAM STREAM EP Ring Ring System Information G-HPL G-PTRANS Access G-FFTE Triad Triad DGEMM Bandwidth Latency System Speed Count Threads Processes TFlop/s GB/s Gup/s GFlop/s GB/s GB/s GFlop/s GB/s usec NEC SX-7 0.552GHz 32 16 2 0.2174 16.34 0.000178 1.34 984.3 492.161 140.636 8.14753 4.85 NEC SX-7 0.552GHz 32 1 32 0.2553 20.546 0.000964 11.29 836.9 26.154 8.239 5.03934 14.21
Kiviat Charts: Comparing Interconnects
• AMD Opteron clusters – 2.2 GHz Kiviat chart (radar plot) – 64-processor cluster • Interconnects 1. GigE 2. Commodity 3. Vendor • Cannot be differentiated based on: – HPL – Matrix-matrix multiply • Available on HPCC website
28 Augmenting TOP500’s 26th Edition with HPCC
Computer Rmax HPL PTRANS STREAM FFT GUPS Latency B/W
1 BlueGene/L 281 259 374 160 2311 35.5 6 0.2
2 BGW 91 84 172 50 84 21.6 5 0.2
3 ASC Purple 63 58 576 44 967 0.2 5 3.2
4 Columbia 52 47 91 21 230 0.2 4 1.4
5 Thunderbird 38
6 Red Storm 36 33 1813 44 1118 1.0 8 1.2
Earth 7 36 Simulator
8 MareNostrum 28
9 Stella 27
10 Jaguar 21 20 944 29 855 0.7 7 1.2
Augmenting TOP500’s 27th Edition with HPCC
Computer Rmax HPL PTRANS STREAM FFT GUPS Latency B/W
1 BlueGene/L 280.6 259.2 4665.9 160 2311 35.47 5.92 0.159
2 BGW (*) 91 83.9 171.55 50 1235 21.61 4.70 0.159
3 ASC Purple 75.8 57.9 553 55 842 1.03 5.1 3.184
4 Columbia (*) 51.87 46.78 91.31 20 229 0.25 4.23 0.896
5 Tera-10 42.9
6 Thunderbird 38.27
7 Fire x4600 38.18
BlueGene 8 37.33 eServer
9 Red Storm 36.19 32.99 1813.06 43.58 1118 1.02 7.97 1.149
Earth 10 35.86 Simulator
29 Normalize with Respect to Peak flop/s
Computer HPL RandomAccess STREAM FFT
Cray XT3 81.4% 0.031 1168.8 38.3
Cray X1E 67.3% 0.422 696.1 13.4
IBM POWER5 53.5% 0.003 703.5 15.5
IBM BG/L 70.6% 0.089 435.7 6.1
SGI Altix 71.9% 0.003 308.7 3.5
NEC SX-8 86.9% 0.002 2555.9 17.5
Normalizing with peak flop/s cancels out number of processors:
G-HPL / Rpeak = (local-HPL*P) / (local-Rpeak*P) = local-HPL / local-Rpeak
Good: can compare systems with different number of processors. Bad: it’s easy to scale peak and harder to scale real calculation.
Correlation: HPL vs. Theoretical Peak
HPL versus Theoretical Peak
30
25
20 Cray XT3
15
10 NEC SX-8
Theoretical Peak (Tflop/s) 5 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)
• How well does HPL data correlate with theoretical peak performance?
30 Correlation: HPL vs. DGEMM
HPL versus DGEMM
30000 Cray XT3 25000
20000
15000
10000 DGEMM (Gflop/s) NEC SX-8 5000 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)
• Can I Run Just Run DGEMM Instead of HPL? • DGEMM alone overestimates HPL performance • Note the 1,000x difference in scales! (Tera/Giga) • Exercise: correlate HPL versus Processors*DGEMM
Correlation: HPL vs. STREAM-Triad
HPL versus STREAM Triad
30000 Cray XT3 25000 NEC SX-8
20000 Cray X1E/opt 15000
10000 STREAM Triad (GB/s) 5000 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s)
• How well does HPL correlate with STREAM-Triad performance?
31 Correlation: HPL vs. RandomAccess
HPL versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable 0.8 SGI Altix 0.6 IBM BG/L 0.4
G-RandomAccess (GUPS) NEC SX-8 0.2 0 0 5 10 15 20 25 HPL (Tflop/s)
• How well does HPL correlate with G-RandomAccess performance? • Note the 1,000x difference in scales! (Tera/Giga)
Correlation: HPL vs. FFT
HPL versus FFT
1000 900 Cray XT3 800 700 600 500 400 FFT (Gflop/s) FFT 300 200 NEC SX-8 100 SGI Altix 0 0 5 10 15 20 25 HPL (Tflop/s) • How well does HPL correlate with FFT performance? • Note the 1,000x difference in scales! (Tera/Giga)
32 Correlation: STREAM vs. PTRANS
Global Stream versus PTRANS
1000 900 800 Cray XT3 700 600 500 400
PTRANS (GB/s) 300 200 NEC SX-8 100 0 0 5000 10000 15000 20000 25000 30000 Global STREAM (GB/s)
• How well does STREAM data correlate with PTRANS performance?
Correlation: RandomRing B/W vs. PTRANS
RandomRing Bandwidth versus PTRANS
1000 900 800 Cray XT3 700 600 500 400
PTRANS (GB/s) 300 NEC SX-8 200 100 NEC SX-7 0 0 2 4 6 8 10121416 RandomRing Bandwidth (GB/s)
• How well does RandomRing Bandwidth data correlate with PTRANS performance • Possible bad data?
33 Correlation: #Processors vs. RandomAccess
Number of Processors versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable IBM BG/L 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 SGI Altix 0 0 1000 2000 3000 4000 5000 6000 Number of Processors • Does G-RandomAccess scale with the number of processors?
Correlation: Random-Ring vs. RandomAccess
RandomRing Bandwidth versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 0.8 0.6 NEC SX-8 0.4 G-RandomAccess (GUPS) G-RandomAccess NEC SX-7 0.2 0 0246810121416 RandomRing Bandwidth (GB/s) • Does G-RandomAccess scale with the RandomRing Bandwidth? • Possible bad data?
34 Correlation: Random-Ring vs. RandomAccess
RandomRing Bandwidth versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 RandomRing Bandwidth (GB/s) • Does G-RandomAccess scale with RandomRing Bandwidth? • Ignoring possible bad data…
Correlation: Random-Ring vs. RandomAccess
RandomRing Latency versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 1.4 1.2 1 Rackable 0.8 0.6 0.4 G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 102030405060 RandomRing Latency (usec) • Does G-RandomAccess scale with RandomRing Latency ?
35 Correlation: RandomAccess Local vs. Global
Single Processor RandomAccess versus G-RandomAccess (per System)
2 Cray X1E/opt 1.8 1.6 1.4 1.2 Cray XT3 1 Rackable 0.8 0.6 0.4
G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 20 40 60 80 100 120 140 160 180 200 Single Processor RandomAccess (GUPS) • Does G-RandomAccess scale with single processor RandomAccess performance (per system)?
Correlation: RandomAccess Local vs. Global
Single Processor RandomAccess versus G-RandomAccess (per Processor)
2 1.8 Cray X1E/opt 1.6 1.4 1.2 1 Rackable 0.8 0.6 0.4
G-RandomAccess (GUPS) G-RandomAccess 0.2 0 0 0.1 0.2 0.3 Single Processor RandomAccess (GUPS) • Does G-RandomAccess scale with single processor RandomAccess performance?
36 Correlation:STREAM-Triad vs.RandomAccess
STREAM Triad versus G-RandomAccess
2 Cray X1E/opt 1.8 1.6 Cray XT3 1.4 1.2 1 Rackable 0.8 0.6 IBM BG/L 0.4 G-RandomAccess (GUPS) 0.2 SGI Altix NEC SX-8 0 0 5000 10000 15000 20000 25000 30000 STREAM Triad (GB/s)
• Does G-RandomAccess scale with STREAM Triad?
RandomAccess Correlations Summary
• G-RandomAccess doesn’t correlate with other tests • Biggest improvement comes from code optimization – Limit on parallel look-ahead forces short messages • Typical MPI performance killer • Communication/computation overlap is wasted by message handling overhead – UPC implementation can be integrated with existing MPI code base to yield orders of magnitude speedup – Using interconnect topology and lower-level (less overhead) messaging layer is also a big win • Generalization from 3D torus: hyper-cube algorithm
37 Principal Component Analysis
• Correlates all the tests simultaneously • Load vectors hint at relationship between original results and the principal components • Initial analysis: – 29 tests • Rpeak, HPL, PTRANS (x1) • DGEMM (x2) • FFT, RandomAccess (x3) • STREAM (x8) • b_eff (x10) – 103 computers • Must have all 29 valid (up-to-date) values • Principal components (eigenvalues of covariance matrix of zero-mean, unscaled data): • Only 3 (4) components – 4.57, 1.17, 0.41, 0.15, 0.0085, • redundancy or 0.0027 • opportunity to check?
Effective Bandwidth Analysis
2 • All results in HPCS ~10 1.E+15 words/seco nd
• Highlights 1.E+12 Tera Peta memory HPC ~104 hierarchy
• Clusters 1.E+09 – Hierarchy Giga steepens Clusters ~106
• HPC 1.E+06
systems Mega Top500 (words/s) STREAM (words/s) – Hierarchy FFT (words/s) RandomAccess (words/s) constant BandwidthEffective (words/second) 1.E+03 Dell GigE P1 (MITLL) Dell GigE P2 (MITLL) Dell GigE P4 (MITLL) Dell GigE P8 (MITLL) (MITLL) Dell GigE P16 (MITLL) Dell GigE P32 (MITLL) Dell GigE P64 Opteron (AMD) Cray X1E (AHPCRC) (NASA) SGI Altix (HLRS) SX-8 NEC Cray X1 (ORNL) Cray X1 (ORNL) Opt Cray XT3 (ERDC) Cray XT3 (ORNL) IBM Power5 (LLNL) (LLNL) BG/L IBM (LLNL) Opt BG/L IBM DARPA HPCS Goals • HPCS Goals Kilo – Hierarchy Systems flattens (in Top500 – Easier to order) program
38 Public Web Resources
• Main HPCC website – http://icl.cs.utk.edu/hpcc/ • HPCC Awards – http://www.hpcchallenge.org/ • HPL – http://www.netlib.org/benchmark/hpl/ • STREAM – http://www.cs.virginia.edu/stream/ • PTRANS – http://www.netlib.org/parkbench/html/matrix -kernels.html • FFTE – http://www.ffte.jp/
HPCC Makefile Stucture (1/2)
• Sample Makefiles live in hpl/setup • BLAS – LAdir – BLAS top directory for other LA-variables – LAinc – where BLAS headers live (if needed) – LAlib – where BLAS libraries live (libmpi.a and friends) – F2CDEFS – resolves Fortran-C calling issues (BLAS is usually callable from Fortran) • -DAdd_, -DNoChange, -DUpCase, -Dadd__ • -DStringSunStyle, -DStringStructPtr, -DStringStructVal, -DStringCrayStyle • MPI – MPdir – MPI top directory for other MP-variables – MPinc – where MPI headers live (mpi.h and friends) – MPlib – where MPI libraries live (libmpi.a and friends)
39 HPCC Makefile Stucture (1/2)
• Compiler – CC – C compiler – CCNOOPT – C flags without optimization (for optimization-sensitive code) – CCFLAGS – C flags with optimization • Linker – LINKER – program that can link BLAS and MPI together – LINKFLAGS – flags required to link BLAS and MPI together • Programs/commands – SHELL, CD, CP, LN_S, MKDIR, RM, TOUCH – ARCHIVER, ARFLAGS, RANLIB
MPI Implementations for HPCC
• Vendor – Cray (MPT) – IBM (POE) – SGI (MPT) – Dolphin, Infiniband (Mellanox, Voltaire, ...), Myricom (GM, MX), Quadrics, PathScale, Scali, ... • Open Source – MPICH1, MPICH2 (http://www- unix.mcs.anl.gov/mpi/mpich/) – Lam MPI (http://www.lam-mpi.org/) – OpenMPI (http://www.open-mpi.org/) – LA-MPI (http://public.lanl.gov/lampi/) • MPI implementation components – Compiler (adds MPI header directories) – Linker (need to link in Fortran I/O) – Exe (poe, mprun, mpirun, aprun, mpiexec, ...)
40 Fast BLAS for HPCC
• Vendor • Implementations that use – AMD (AMD Core Math Library) Threads – Cray (SciLib) – Some vendor BLAS – HP (MLIB) – Atlas – IBM (ESSL) – Goto BLAS – Intel (Math Kernel Library) • You should never use reference – SGI (SGI/Cray Scientific BLAS from Netlib Library) – There are better alternatives – ... for every system in • Free implementations existence – ATLAS http://www.netlib.org/atl as/ – Goto BLAS http://www.cs.utexas.edu/ users/flame/goto http://www.tacc.utexas.ed u/resources/software
Tuning Process: Internal to HPCC Code
• Changes to source code are not allowed for submission • But just for tuning it's best to change a few things – Switch off some tests temporarily • Choosing right parallelism levels – Processes (MPI) – Threads (OpenMP in code, vendor in BLAS) – Processors – Cores • Compile time parameters – More details below • Runtime input file – More details below
41 Tuning Process: External to HPCC Code
• MPI settings examples – Messaging modes • Eager polling is probably not a good idea – Buffer sizes – Consult MPI implementation documentation • OS settings – Page size • Large page size should be better on many systems – Pinning down the pages • Optimize affinity on DSM architectures – Priorities – Consult OS documentation
Parallelism Examples with Comments
• Pseudo-threading helps but be careful – Hyper-threading – Simultaneous Multi-Threading – ... • Cores – Intel (x86-64, Itanium), AMD (x86) – Cray: SSP, MSP – IBM Power4, Power5, ... – Sun SPARC • SMP – BlueGene/L (single/double CPU usage per card) – SGI (NUMA, ccNUMA, DSM) – Cray, NEC • Others – Cray MTA (no MPI !)
42 HPCC Input and Output Files
• Parameter file hpccinf.txt • Memory file hpccmemf.txt – HPL parameters – Memory available per MPI • Lines 5-31 process – PTRANS parameters Process=64 • Lines 32-36 – Memory available per thread – Indirectly: sizes of arrays for Thread=64 all HPCC components – Total available memory • Hard coded Total=64 • Output file hpccoutf.txt – Many HPL and PTRANS parameters might not be – Must be uploaded to the optimal website – Easy to parse – More details later...
Tuning HPL: Introduction
• Performance of HPL comes from – BLAS – Input file hpccinf.txt • Essential parameters in the input file – N – matrix size – NB – blocking factor • Influences BLAS performance and load balance NB A x = b – PMAP – process mapping • Depends on network topology – PxQ – process grid
• Definitions N PX PY PZ PX =
43 Tuning HPL: More Definitions
• Process grid parameters: P, Q, and PMAP
Q=4
P0 P1 P2 P3 P0 P3 P6 P9
P=3 P4 P5 P6 P7 P1 P4 P7 P10
P8 P9 P10 P11 P2 P5 P8 P11
PMAP=C PMAP=R
Tuning HPL: Selecting Process Grid
44 Tuning HPL: Selecting Number of Processors
37 Prime numbers
41 43
53 47 59 61
Tuning HPL: Selecting Matrix Size
Too big 12 x 10
Best performance Too small
Too big 6 x 10
Not optimial parameters
45 Tuning HPL: Further Details
• Much more details from HPL's author:
• Antoine Petitet
• http://www.netlib.org/benchmark/hpl/
Tuning RandomAccess
• New RandomAccess algorithms introduced in version 1.2 – Contributed by benchmarking group from Sandia National Lab • Prior versions only had one algorithm – Each update required an MPI call • New algorithms are selected via C preprocessor macros – #define RA_SANDIA_NOPT – #define RA_SANDIA_OPT2
46 Tuning FFT
• Compile-time parameters – FFTE_NBLK • blocking factor – FFTE_NP • padding (to alleviate negative cache-line effects) – FFTE_L2SIZE • size of level 2 cache • Use FFTW instead of FFTE – Define USING_FFTW symbol during compilation – Add FFTW location and library to linker flags
Tuning STREAM
• Intended to measure main memory bandwidth • Requires many optimizations to run at full hardware speed – Software pipelining – Prefetching – Loop unrolling – Data alignment – Removal of array aliasing • Original STREAM has advantages – Constant array sizes (known at compile time) – Static storage of arrays (at full compiler's control)
47 Tuning PTRANS
• Parameter file hpccinf.txt – Line 33 — number of matrix sizes – Line 34 — matrix sizes • Must not be too small – enforced in the code – Line 35 — number of blocking factors – Line 36 — blocking factors • No need to worry about BLAS • Very influential for performance
Tuning b_eff
• b_eff (Effective bandwidth and latency) test can also be tuned • Tuning must use only standard MPI calls • Examples – Persistent communication – One-sided communication
48 HPCC Output File
• The output file has two parts – Verbose output (free format) – Summary section • Pairs of the form: name=value • The summary section names – MPI* — global results • Example: MPIRandomAccess_GUPs – Star* — embarrassingly parallel results • Example: StarRandomAccess_GUPs – Single* — single process results • Example: SingleRandomAccess_GUPs
Optimized Submission Ideas
• For optimized run the same MPI harness has to be run on the same system • Certain routines can be replaced – the timed regions • The verification has to pass – limits data layout and accuracy of optimization • Variations of the reference implementation are allowed (within reason) – No Strassen algorithm for HPL due to different operation count • Various non-portable C directives can significantly boost performance – Example: #pragma ivdep • Various messaging substrates can be used – Removes MPI overhead • Various languages can be used – Allows for direct access to non-portable hardware features – UPC was used to increase RandomAccess performance by orders of magnitude • Optimizations need to be explained upon results submission
49 Performance Modeling and its uses with HPC Challenge (HPCC) Benchmark Suite
Allan Snavely
Hierarchical systems
– Modern supercomputers are composed of a hierarchy of subsystems: • CPU and registers • Cache (multiple levels) • Main (local) memory • System interconnect to other CPUs and memory • Disk • Archival – Latencies span 11 orders of magnitude! Nanoseconds to Minutes. This is a solar-system scale. It is 794 million miles to Saturn. – Bandwidths taper – Cache bandwidths generally are at least four times larger than local memory bandwidth.
50 “Red Shift”
• While the absolute speed of all computer subcomponents have been changing rapidly, they have not all been changing at the same rate. For example, the analysis in [FoSC] indicates that, while peak processor speeds have been increasing at 59% between 1988 and 2004, memory speeds of commodity processors have been increasing at only 23% per year since 1995, and DRAM latency has only been improving at a pitiful rate of 5.5% per year. This “red shift” in the latency distance between various components has produced a crisis of sorts for computer architects, and a variety of complex latency-hiding mechanisms currently drive the design of computers as a result.
Moore’s Law
• Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. • Moore’s law has had a decidedly mixed impact, creating new opportunities to tap into exponentially increasing computing power while raising fundamental challenges as to how to harness it effectively. • The Red Queen Syndrome – It takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!' • Things Moore never said: » “computers double in speed every 18 months” ☺ » “cost of computing is halved every 18 months” ☺ » “cpu utilization is halved every 18 months”
51 Moore’s Law
Snavely’s Top500 Laptop?
– Among other startling implications is the fact that the peak performance of the typical laptop would have placed it as one of the 500 fastest computers in the world as recently as 1995.
– Shouldn’t we all just go find another job now?
– No because Moore’s Law has several more subtle implications and these have raised a series of challenges to utilizing the apparently ever-increasing availability of compute power; these implications must be understood to see why HPCC and modeling needed
52 Real Work (the fly in the ointment)
• Scientific calculations involve operations upon large amounts of data, and it is in moving data around within the computer that the trouble begins. As a very simple pedagogical example consider the expression.
A + B = C
• This generic expression takes two values (arguments), and performs an arithmetic operation to produce a third. Implicit in such a computation is that A and B must to be loaded from where they are stored and the value C stored after it has been calculated. In this example, there are three memory operations and one floating-point operation (Flop).
Performance model defn.
• a calculable expression that takes as parameters attributes of application software, input data, and target machine hardware (and possibly other factors) and computes, as output, expected performance
53 Uses of performance models
• As an independent way vetting performance guarantees (as in support of procurement)
• To better understand for $
• To guide code tuning • To inform future system design
• To make sense of and to generalize the results of benchmarks
“The increased publicity enjoyed by HPCC doesn't necessarily translate into deeper understanding of the performance metrics that HPCC measures. And so this tutorial will introduce attendees to HPCC, provide tools to examine differences in HPC architectures, and give hands-on training that will hopefully lead to better understanding of parallel environments.”
Performance models today are about 90% accurate
168 “blind” predictions for DoD Overall absolute relative error = 11.7% relative error = -0.5% HYCOM, AVUS, Overflow2, Lammps
A Genetic Algorithms Approach to Modeling the Performance of Memory-bound Computations M. Tikir, L. Carrington, E. Strohmaier, A. Snavely, Proceedings of the International Conference High Performance Computing, Networking, and Storage, November 2007, Reno
54 How does it work? review of PMaC methodology:
• Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application. • Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine.
Combine Machine Profile and Application Signature using: • Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction.
Pieces of PMaC Framework
Single-Processor Model Communication Model Machine Profile Application Signature Machine Profile Application Signature (Machine A) (Application B) (Machine A) (Application B) Characterization of Characterization of Characterization of Characterization of memory performance memory operations network performance network operations capabilities of needed to be performed capabilities of needed to be performed Machine A by Application B Machine A by Application B
Convolution Method Convolution Method Mapping memory usage needs of Mapping network usage needs of Application B Application B to the capabilities of Machine A to the capabilities of Machine A Application B ⇔ Machine A Application B ⇔ Machine A Exe. time = Memory op FP op Exe. time = comm. op1 comm. op2 Mem. rate FP rate op1 rate op2 rate
Performance prediction of Application B on Machine A
55 Formally, let P(m,n)=A(m,p) • M(p,n)
p • P a matrix of runtimes: p ij = ∑ a ik m kj k =1 • Where the is (rows) are applications and the js (columns) are machines; an entry of P is the (real or predicted) runtime of application i on machine j • the rows of A are applications, columns are operation counts, read a row to get an application signature • the rows of M are benchmark-measured bandwidths, columns are machines, read a columns to get a machine profile
⎡m11 m12 m13 ⎤ p p p a a a a a ⎢m m m ⎥ ⎡ 11 12 13 ⎤ ⎡ 11 12 13 14 15 ⎤ ⎢ 21 22 23 ⎥ ⎢ ⎥ ⎢ ⎥ p 21 p 22 p 23 = a21 a22 a23 a24 a25 ⎢m31 m32 m33 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣p31 p32 p33 ⎦ ⎣a31 a32 a33 a34 a35 ⎦ ⎢m41 m42 m43 ⎥ ⎢ ⎥ ⎣m51 m52 m53 ⎦
p32 = a31m12 "+"a32 m22 "+"a33m32 "+"a34b42 "+"a35b52
Example: Convolving MEMbench data with Metasim Tracer Data
Stride-one access L1 cache
Stride-one access Stride-one access L1/L2 cache L2/L3 cache Stride-one access L3 cache/Main Memory
1.4E+04 IBM p655 1.2E+04 • Memory bandwidth B SGI Altix 1.0E+04 benchmark measures IBM Opteron memory rates (MB/s) for 8.0E+03 different levels of cache 6.0E+03 and tracing reveals different access patterns 4.0E+03 Memory Bandwidth (M Memory 2.0E+03
0.0E+00 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Message Size (8-byte words)
56 PMaCinst binary instrumentation library
• PMaCinst works for – PowerPC Architecture Family • AIX Operating System • XCoff Object File Format • Fast compared to the similar tools – Lower overhead compared to Dyninst on AIX • 2.2-3.4 times faster • Slowdown – Under 2x for jbb instrumentation – Mostly under 10x for cache simulations (MetaSim) • Simulation of 51 (almost all DoE, NSF, DoD) HPC systems • Tested on fullscale HPC applications – S3D, Amr, Wrf, Overflow, Lammps, Hycom, Avus, Cth
Other latencies also matter
DataStar IBM Power 4
1.40E+04 Stride-one Random-stride
1.20E+04 Branch FP dep
1.00E+04
8.00E+03
6.00E+03
4.00E+03
2.00E+03
0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000
57 More Machines
lemieux huricane
9.00E+03 Stride-one 1.20E+04 Random-stride 8.00E+03 Branch Stride-one FP dep 1.00E+04 7.00E+03 Random-s tr ide Branch FP dep 6.00E+03 8.00E+03
5.00E+03
6.00E+03 4.00E+03
3.00E+03 4.00E+03
2.00E+03
2.00E+03 1.00E+03
0.00E+00 0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000 1000 10000 100000 1000000 10000000 100000000 1000000000
O3900 tempest
3.00E+03 5.00E+03
4.50E+03 Stride-one 2.50E+03 Stride-one Random-stride 4.00E+03 Random-stride Branch Branch FP dep 3.50E+03 FP dep 2.00E+03
3.00E+03
1.50E+03 2.50E+03
2.00E+03
1.00E+03 1.50E+03
1.00E+03 5.00E+02 5.00E+02
0.00E+00 0.00E+00 1000 10000 100000 1000000 10000000 100000000 1000000000 1000 10000 100000 1000000 10000000 100000000 1000000000
Convolutions put the two together for performance predictions MetaSim trace collected on PETSc Matrix-Vector code 4 CPUs with user supplied memory parameters for PSC’s TCSini
Procedure % Mem. Ratio L1 Hit L2 Hit Data Set Location in Memory Memory Name Ref. Random Rate Rate BW (MAPS)
dgemv_n 0.9198 0.07 93.47 93.48 L1/L2 Cache 4166.0
dgemv_n 0.0271 0.00 90.33 90.39 Main Memory 1809.2
dgemv_n 0.0232 0.00 94.81 99.89 L1 Cache 5561.3
MatSet. 0.0125 0.20 77.32 90.00 L2 Cache 1522.6
Single-processor or per-processor performance: •Machine profile for processor (Machine A) •Application Signature for application (App. #1)
The relative per-processor performance of App. #1 on Machine A is represented as the MetaSim Number= SUM I=1,n (MemOpsBBi / MemRateBBi)
MAPS data for PREDICTED MACHINE
58 Extrapolation
•Least Squares Minimization is performed on each set of data points (for each basic block)
•A synthetic profile is created based on the resulting curves
•The synthetic profile is sent to the convolver and a computation/memory time is predicted
AVUS (Air Vehicle Unstructured Solver)
L1X2 2.5
2
1.5
1
0.5
MMX2 0 L2X2
L3X2
59 WRF (Weather Research Facility)
L1X2 2
1.5
1
0.5
MMX2 0 L2X2
L3X2
AMR interagency benchmark
L1X2 2
1.5
1
0.5
MMX2 0 L2X2
L3X2
60 Hycom (ocean model)
L1X2 2
1.5
1
0.5
MMX2 0 L2X2
L3X2
Reasoning about impact of multicore
MAPS Stride-one Bandwidth
6000.0 JVN3 Xeon EM64T 3.6 GHz Kraken IBM PWR4+ Eagle Itanium2 1.6GHz 5000.0 Bandwidth Blade Density Blade
4000.0
3000.0
Memory Bandwidth (MB/s) Bandwidth Memory 2000.0
1000.0
0.0 1000000 10000000 100000000 Size in 8-byte words
61 Choose two…..
Fast Accurate Cheap
What is Integrated Performance Monitoring?
IPM provides a high level application performance profile on a per batch job basis IPM •Fixed memory footprint ~1-2MB input_123 •Minimal CPU overhead ~5% •Scalable to high concurrencies job_123 •Easy to use –flip of a switch •Portable
output_123 profile_123
62 Message Sizes : CAM 336 way per MPI call per MPI call & buffer size
Load Balance Information
32K tasks AMR on BG/L
63 Ordered list of MPI calls
Can we use the output from this relatively simple tool to predict the MPI performance of a given code on Machine X (ORNL XT4) using an IPM profile of the code on Machine Y (a PWR4+) and some latency and bandwidth information ?
64 WRF CONUS benchmark 256 cpus
Xt4 Prediction XT4 Measured
2000
1800
1600
1400
1200
1000 Time / s
800
600 ~7% average absolute error
400
200
0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 MPI_rank
WRF CONUS Benchmark Summary
MPI Tasks % error
256 7
384 9
512 6
640 5
65 Prediction methodology (How did we do it?)
• Used very simple model based on Random-Ring Bandwidth and latency as measured by HPCC. • Models for collectives taken from ‘Performance Analysis of MPI Collective Operations’ Cluster Computing 10 p 127 (2007) Dongarra et al. • Speed up of compute part of mpi time treated using predicted speed up from PMaC convolver. • If Tcommunicationonly < Tmeasured on base system – Attribute the difference to the compute part of the MPI time on the sending processor. Then scale this up using convolver factor. • If Tcommunicationonly > Tmeasured on base system – Attribute the difference to the compute part of the MPI time on the receiving processor. Then scale this using the speedup factor from the convolver. Then the mpi_wait time is new_communication-new_compute.
Model and Benchmark more Kiviat diagrams with more axes
nw lt FLOPS, L1, L2, L3, MM, 8 nw bw on bw 7 6 nw lt nw lt 8 5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 8 7 nw bw nw bw I/O 4 FLOPS on bw on bw , I/O 7 6 6 3 FLOPS, L1, L2, L3, MM, 5 5 FLOPS 2 I/O 4 FLOPS on bw 4 1 3 3 MM, on bw 0 L1 2 2 1 I/O 1 L1 MM, on bw 0 L1 0
MM, on bw L1, L2 MM L1, L2 MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3, MM GYRO_0016_2X AVUS_0064_2X RFCTH2_0016_2X L1, L2, L3, MM L1, L2, L3, MM GYRO_0016_4X AVUS_0064_4X RFCTH2_0016_4X GYRO_0016_8X AVUS_0064_8X RFCTH2_0016_8X
nw lt nw lt nw lt 1.5 1.5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 1.5 nw bw nw bw nw bw on bw on bw on bw , I/O 1 1 FLOPS, L1, L2, L3, MM, 1 I/O FLOPS FLOPS I/O FLOPS on bw 0.5 0.5 0.5 I/O L1 MM, on bw 0 L1 0 MM, on bw 0 L1
MM, on bw L1, L2 MM L1, L2
MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/2X L1, L2, L3, MM, on bw L1, L2, L3, MM L1, L2, L3, MM AVUS_0064_1/4X RFCTH2_0016_1/2X L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/8X RFCTH2_0016_1/4X GYRO_0016_1/2X RFCTH2_0016_1/8X L1, L2, L3, MM GYRO_0016_1/4X GYRO_0016_1/8X
66 A spoke gives performance gradient
• Predicted (and validated) speedup for doubling of processor attributes for: – Overflow (CFD), WRF (Weather), AVUS (CFD), LAMMPS(MD), Hycom (Ocean)
2X Overflow WRF AVUS AMR LAMMPS HyCom
FLOPS 1.04 1.29 1.2 1.07 1.08 1.14
L2 BW 1.08 1.01 1.01 1.01 1.14 1.16
L3 BW 1.1 1.01 1.01 1.01 1.31 1.20
MM BW 1.72 1.71 1.7 1.93 1.49 1.60
Ratio MM BW/Flop 16.80 2.44 3.5 13.3 6.41 4.35
Qualitative comparison Thanks to Steve Scott
Dimensionless Iso-performance lines
2.5 2 Overflow 1.5 Overflow 1 WRF Memory BW Memory y 0.5 WRF 2 0 0 10203040 2x Flops
67 Qualitative comparison
• Overflow a memory-bound Dimensionless Iso-performance lines
code to WRF a “flops-bound” 2.5 2 code Overflow – Flops bound curves are 1.5 Overflow 1 WRF Memory BW Memory steeper in this plotting y 0.5 WRF 2 – Memory-bound flatter 0 010203040 2x Flops
What if we just project local gradients?
Overflow
350 300 300-350 250 250-300 200 speedup 200-250 150 150-200 100 50 100-150 10 0 x 50-100
0 2 BW 2
4 0
7 0-50 10 2x flops
68 Quantitative comparison
Overflow • You can speedup Overflow 350 1.5x with a 1000x speedup of 300 300-350 250 250-300 flops alone or 220x with a 200 speedup 200-250 150 150-200 1000x speedup BW alone 100 50 100-150 • (according to this admittedly 10 0 x 50-100
0 2 BW 2
simplistic performance gradient 4 0 7 0-50 extrapolation) 10 2x flops • You can get 330x speedup with both • More detailed simulation needed 3 to “verify” this 2.5 2.5-3 2 2-2.5 1.5 1.5-2 1 1-1.5 1 0.5 0.5-1 0 0 0-0.5 0 3 4
6 0 8 1.75 10
But what about $ ?
• Some people say flops are free • Let’s derive the slope of the line cost/flops and cost/BW from (some) real data. • More exactly, estimate slope of the line: – normalized cost delta / normalized clock speed delta – And normalized cost delta / normalized bus speed delta – Using clock speed as a proxy for flops and bus speed as a proxy for memory bandwidth
69 Families of Intel processors with list $
clock speed FSB speed $ delta $ d clock d bus d cost / clock cost bus cost d cost / clovertown d clock d bus
1.6 1066 238 0 0 0 1.86 1066 297 59 0.26 0 1.53 2 1333 355 117 0.4 267 137.45 217.56 3.65 2.33 1333 500 262 0.73 267 2.48 250.84 249.16 4.18 2.66 1333 800 562 1.06 267 4.24 364.23 435.77 7.31 woodcrest 1.6 1066 217 0 0 0 2 1333 346 129 0.4 267 125.32 220.68 4.06 2.33 1333 400 183 0.73 267 0.95 228.70 171.30 3.15 2.66 1333 725 508 1.06 267 5.74 332.09 392.91 7.23 3 1333 883 666 1.4 267 1.70 438.61 444.39 8.18 conroe 1.86 1066 166 0 0 0 2.13 1066 186 20 0.27 0 0.83 2.33 1333 179 13 0.47 267 96.90 82.10 1.97 2.66 1333 205 39 0.8 267 1.03 164.93 40.07 0.96
average 2.31 4.52
Performance data for Overflow
011.75233.545678910 0 1 1.04 1.075 1.09 1.131366 1.1314 1.178883 1.228397 1.279989 1.333749 1.389766 1.448136 1.508958 0.1 1.055607 1.1 1.134 1.15 1.194278 1.1943 1.244438 1.296704 1.351166 1.407915 1.467047 1.528663 1.592867 0.2 1.114306 1.16 1.197 1.21 1.260688 1.2607 1.313637 1.36881 1.4263 1.486205 1.548625 1.613668 1.681442 0.3 1.17627 1.23 1.264 1.28 1.330792 1.3308 1.386685 1.444926 1.505613 1.568848 1.63474 1.703399 1.774942 0.4 1.241679 1.29 1.334 1.35 1.404793 1.4048 1.463795 1.525274 1.589335 1.656088 1.725643 1.79812 1.873641 0.5 1.310725 1.37 1.409 1.42 1.48291 1.4829 1.545192 1.61009 1.677714 1.748178 1.821601 1.898109 1.977829 0.6 1.383611 1.44 1.487 1.5 1.56537 1.5654 1.631116 1.699623 1.771007 1.845389 1.922895 2.003657 2.087811 0.7 1.460549 1.52 1.57 1.59 1.652416 1.6524 1.721817 1.794134 1.869487 1.948006 2.029822 2.115075 2.203908 0.8 1.541766 1.61 1.657 1.67 1.744302 1.7443 1.817563 1.8939 1.973444 2.056329 2.142695 2.232688 2.326461 0.9 1.627499 1.7 1.749 1.77 1.841298 1.8413 1.918632 1.999215 2.083182 2.170675 2.261844 2.356841 2.455828 1 1.718 1.79 1.846 1.87 1.943687 1.9437 2.025322 2.110385 2.199021 2.29138 2.387618 2.487898 2.59239
columns are powers of 2 of flops, rows powers of 2 of BW, values of cells are predicted speedup
70 Iso performance and cost: Overflow
1.1x 1.2 1.2x 1.3x 1 $5 $10 $150 $595 $2,367 1.4x 1.5x 0.8 1.6x $10 1.7x 0.6 $5 1.9x 2x bandwidth x
2 2.1x 0.4 2.2x 2.3x 0.2 2.4x $5 0 $5 $10 $150 $595 $2,367 $10 0 2 4 6 8 10 12 $150 Say 0,0 (i.e. 1 unit of flops 1 unit of memory2x flops BW) costs $1 $595 $2 367
Iso performance and cost: WRF
1 $5 $10 $150 $595 $2,367 2x 0.9 3x
0.8 4x 5x 0.7 $10 6x 0.6 $5 7x 10x 0.5 13x bandwidth x 0.4 16x 2 $5 0.3 $10 0.2 $150 0.1 $595 $2,367 0 $5 $10 $150 $595 $2,367 024681012 2x flops
71 Comparison
1.1x 1.2 1.2x 1.6x 1 $5 $10 $150 $595 $2,367 1.4x 1.5x • Recall you can speedup 0.8 1.6x $10 1.7x 0.6 $5 1.9x Overflow 1.5x with a 1000x 2x bandwidth x
2 2.1x 0.4 speedup of flops alone or 220x 2.2x 2.3x 0.2 with a 1000x speedup BW alone 2.4x $5 • WRF can be sped up about 12x 0 $5 $10 $150 $595 $2,367 $10 024681012$150 2x flops $595 with 1000x flops and about 173x $2 367 1000x BW alone and > 600x with
both! 1 $5 $10 $150 $595 $2,367 2x 0.9 3x
0.8 4x 5x 0.7 $10 6x 0.6 $5 7x 10x 0.5 13x bandwidth
x 0.4 16x 2 $5 0.3 $10 0.2 $150 0.1 $595 $2,367 0 $5 $10 $150 $595 $2,367 024681012 2x flops
How to pick a spanning set of benchmarks Spatial vs Temporal Locality
1.00 HPL 0.90 Gamess WRF
0.80 HYCOM Overflow 0.70 CG Test3D 0.60 OOCore RFCTH2 AVUS
0.50
0.40 Temporal locality Temporal 0.30 RandomAccess 0.20 STREAM 0.10
0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Spatial Locality
72 Tools for determining data motion patterns of applications
Locality matters for performance
73 Benchmarks for Global Data Motion
APEX-Maps
74 APEX-Map
APEX-Map: Global Data Motion
75 A variety of signatures
nw lt FLOPS, L1, L2, L3, MM, 8 nw bw on bw 7 6 nw lt nw lt 8 5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 8 7 nw bw nw bw I/O 4 FLOPS on bw on bw , I/O 7 6 6 3 FLOPS, L1, L2, L3, MM, 5 5 FLOPS 2 I/O 4 FLOPS on bw 4 1 3 3 MM, on bw 0 L1 2 2 1 I/O 1 L1 MM, on bw 0 L1 0
MM, on bw L1, L2 MM L1, L2 MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3, MM GYRO_0016_2X AVUS_0064_2X RFCTH2_0016_2X L1, L2, L3, MM L1, L2, L3, MM GYRO_0016_4X AVUS_0064_4X RFCTH2_0016_4X GYRO_0016_8X AVUS_0064_8X RFCTH2_0016_8X
nw lt nw lt nw lt 1.5 1.5 FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, FLOPS, L1, L2, L3, MM, 1.5 nw bw nw bw nw bw on bw on bw on bw , I/O 1 1 FLOPS, L1, L2, L3, MM, 1 I/O FLOPS FLOPS I/O FLOPS on bw 0.5 0.5 0.5 I/O L1 MM, on bw 0 L1 0 MM, on bw 0 L1
MM, on bw L1, L2 MM L1, L2
MM L1, L2 MM L1, L2, L3 L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/2X L1, L2, L3, MM, on bw L1, L2, L3, MM L1, L2, L3, MM AVUS_0064_1/4X RFCTH2_0016_1/2X L1, L2, L3, MM, on bw L1, L2, L3 AVUS_0064_1/8X RFCTH2_0016_1/4X GYRO_0016_1/2X RFCTH2_0016_1/8X L1, L2, L3, MM GYRO_0016_1/4X GYRO_0016_1/8X
More abstract: data moving capability is F(MAPS)1
Limited by:
Comm Locality Performance Surface Hypothetical Petascale arch Main mem Shared cache 10000.0 Chip cache FP bound
1000.0
100.0 Cycles per data op 10.0
1.0
0.001 0.1 0.010
0.100 0.0001 0.001
Temporal 0.01 1.000 1 Spatial 1 memory access pattern signature
76 Mapping applications to cycles-per-data-op surface
Limited by:
Com LocalityTimeless Performance MAPS Surface projection Mem Shared cache 10000.0 Chip cache FP Mesa 1000.0
Overflow 100.0 mpsalsa Cycles per data op 10.0 equake GUPS Gammes 1.0
0.001 Stream 0.1 0.010 S3D
0.100 0.0001 0.001
Temporal 0.01 1.000 HPL1 Spatial
Genetic Algorithms
• Suitable when a certain form of the function is assumed and the search space contains many (and unknown) local minima
BW = F( cache hit rates, cache bandwidths & latencies, latency tolerance)
12000
10000
8000
6000
4000 Memory Bandwidth (MB/s)
2000
0 100 100 50 90 80 70 60 50 40 30 20 100 L2 Hit rate L1 Hit rate
77 Genetic Algorithms
1 gen Islands
mutate
2 gen Crossover
Adapt?
14000
Predicted line02 Measured line02 12000 Predicted line04 Measured line04 10000 MultiMAPS two curves
8000 with Measured bandwidth & 6000
ory Bandwidth ory(MB/s) Bandwidth Predicted bandwidth Fitness = Mem 4000 curves 2000
0 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 Size in 8-byte words
AVUS 64 Sensitivity
case1 7
case9 6 case2
5
4
case8b 3 case3
2
1
0
case8a case4
case7b case5
case7a case6 AVUS_0064_2X AVUS_0064_4X AVUS_0064_8X
78 Performance Sensitivity
Case 1 Network latency / 2 Case 2 Network bandwidth * 2 Case 3 FLOPS * 2 Case 4 L1 BW * 2 Case 5 L1, L2 BW *2 Case 6 L1, L2, L3 BW * 2 Case 7a L1, L2, L3, MM BW * 2 Case 7b L1, L2, L3, MM, on node BW * 2
Case 8a Just MM BW *2
Case 8b Just MM, on node BW * 2 Case 9 I/O
Choose your machine wisely ☺
BENCHM_Altix_1.5 L1 bw (n) 1 flops L1 bw(r) 0.8 0.6 NW lat 0.4 L2 bw (n) 0.2 BENCHM_5_Opt1_2.2 0 L1 bw (n) NW bw L2 bw (r) 0.8 flops L1 bw(r) 0.6 MM bw(r) L3 bw (n) 0.4 NW lat L2 bw (n) MM bw(n) L3 bw (r) 0.2
0 NW bw L2 bw (r) NAVO_P655_FED L1 bw (n) 1 flops L1 bw(r) 0.8 MM bw(r) L3 bw (n) 0.6 NW lat 0.4 L2 bw (n) MM bw(n) L3 bw (r) 0.2 0 NW bw L2 bw (r)
MM bw(r) L3 bw (n)
MM bw(n) L3 bw (r)
79 HPL