2007 SPEC Workshop January 21, 2007 Radisson Hotel Austin North

TheThe HPCHPC ChallengeChallenge Benchmark:Benchmark: AA CandidateCandidate forfor ReplacingReplacing LINPACKLINPACK inin thethe TOP500?TOP500?

Jack Dongarra and Oak Ridge National Laboratory

1 OutlineOutline -- TheThe HPCHPC ChallengeChallenge Benchmark:Benchmark: AA CandidateCandidate forfor ReplacingReplacing LinpackLinpack inin thethe TOP500?TOP500?

♦ Look at LINPACK

♦ Brief discussion of DARPA HPCS Program

♦ HPC Challenge Benchmark

♦ Answer the Question

2 WhatWhat IsIs LINPACK?LINPACK?

♦ Most people think LINPACK is a benchmark.

♦ LINPACK is a package of mathematical software for solving problems in , mainly dense linear systems of linear equations.

♦ The project had its origins in 1974

♦ LINPACK: “LINear algebra PACKage” ¾ Written in 66

3 ComputingComputing inin 19741974 ♦ High Performance : ¾ IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030

♦ Fortran 66

♦ Run efficiently

♦ BLAS (Level 1) ¾ Vector operations

♦ Trying to achieve software portability

♦ LINPACK package was released in 1979 4 ¾ About the time of the Cray 1 TheThe AccidentalAccidental BenchmarkerBenchmarker

♦ Appendix B of the Linpack Users’ Guide ¾ Designed to help users extrapolate execution time for Linpack software package ♦ First benchmark report from 1977; ¾ Cray 1 to DEC PDP-10

Dense matrices Linear systems Least squares problems Singular values

5 LINPACKLINPACK Benchmark?Benchmark?

♦ The LINPACK Benchmark is a measure of a ’s floating-point rate of execution for solving Ax=b. ¾ It is determined by running a computer program that solves a dense system of linear equations. ♦ Information is collected and available in the LINPACK Benchmark Report. ♦ Over the years the characteristics of the benchmark has changed a bit. ¾ In fact, there are three benchmarks included in the Linpack Benchmark report. ♦ LINPACK Benchmark since 1977 ¾ Dense linear system solve with LU factorization using partial pivoting ¾ Operation count is: 2/3 n3 + O(n2) ¾ Benchmark Measure: MFlop/s ¾ Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100. 6 ForFor LinpackLinpack withwith nn == 100100 ♦ Not allowed to touch the code. ♦ Only set the optimization in the compiler and run. ♦ Provide historical look at computing ♦ Table 1 of the report (52 pages of 95 page report) ¾ http://www.netlib.org/benchmark/performance.pdf

7 LinpackLinpack BenchmarkBenchmark OverOver TimeTime

♦ In the beginning there was only the Linpack 100 Benchmark (1977) ¾ n=100 (80KB); size that would fit in all the machines ¾ Fortran; 64 bit floating point arithmetic ¾ No hand optimization (only compiler options); source code available ♦ Linpack 1000 (1986) ¾ n=1000 (8MB); wanted to see higher performance levels ¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK ♦ Linpack Table 3 (Highly - 1991) (Top500; 1993) ¾ Any size (n as large as you can; n=106; 8TB; ~6 hours); ¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK ¾ Strassen’s method not allowed (confuses the operation count and rate) ¾ Reference implementation available ||Ax− b || = O(1) ♦ In all cases results are verified by looking at: ||Axn |||| || ε

21 2 8 ♦ Operations count for factorization nn 32 − ; solve 2 n 32 MotivationMotivation forfor AdditionalAdditional BenchmarksBenchmarks

♦ From Linpack Benchmark and Linpack Benchmark Top500: “no single number can ♦ Good reflect overall performance” ¾ One number ¾ Simple to define & easy to rank ♦ Clearly need something more ¾ Allows problem size to change than Linpack with machine and over time ¾ Stresses the system with a run ♦ HPC Challenge Benchmark of a few hours ¾ Test suite stresses not only ♦ Bad the processors, but the ¾ Emphasizes only “peak” CPU memory system and the speed and number of CPUs interconnect. ¾ The real utility of the HPCC ¾ Does not stress local bandwidth benchmarks are that ¾ Does not stress the network architectures can be described ¾ Does not test gather/scatter with a wider range of metrics than just Flop/s from Linpack. ¾ Ignores Amdahl’s Law (Only does weak scaling) ♦ Ugly ¾ MachoFlops 9 ¾ Benchmarketeering hype AtAt TheThe TimeTime TheThe LinpackLinpack BenchmarkBenchmark WasWas CreatedCreated ……

♦ If we think about computing in late 70’s ♦ Perhaps the LINPACK benchmark was a reasonable thing to use. ♦ Memory wall, not so much a wall but a step. ♦ In the 70’s, things were more in balance ¾ The memory kept pace with the CPU ¾ n cycles to execute an instruction, n cycles to bring in a word from memory

♦ Showed compiler optimization ♦ Today provides a historical base of data

10 ManyMany ChangesChanges

♦ Many changes in our hardware over the past 30 years Top500 Systems/Architectures

¾ Superscalar, Vector, 500 Const. Distributed Memory, 400 Cluster Shared Memory, Multicore, … 300 MPP 200 SMP ♦ While there has been 100 SIMD some changes to the 0 Single Proc. Linpack Benchmark not 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 all of them reflect the advances made in the hardware.

♦ Today’s memory hierarchy is much more complicated. 11 HighHigh ProductivityProductivity ComputingComputing SystemsSystems Goal:Goal: ProvideProvide aa generationgeneration ofof economicallyeconomically viableviable highhigh productivityproductivity computingcomputing systemssystems forfor thethe nationalnational securitysecurity andand industrialindustrial useruser communitycommunity (2010;(2010; startedstarted inin 2002)2002)

Focus on: is & Asse nalys ssmen A ry R t ¾Real (not peak) performance of critical national Indust &D Programming Performance Models security applications Characterization & Prediction Hardware ¾ Intelligence/surveillance Technology System Architecture ¾ Reconnaissance Software Technology ¾ Cryptanalysis Ind ustry R&D ¾ Weapons analysis Anal sment ysis & Asses ¾ Airborne contaminant modeling ¾ Biotechnology HPCS Program Focus Areas ¾Programmability: reduce cost and time of developing applications ¾Software portability and system robustness

Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80's80's HPCHPC Technology)Technology) ...... toto ...... FutureFuture (Quantum/Bio(Quantum/Bio Computing) Computing) HPCSHPCS RoadmapRoadmap ¾ 5 vendors in phase 1; 3 vendors in phase 2; 1+ vendors in phase 3 ¾ MIT Lincoln Laboratory leading measurement and evaluation team Petascale Systems Full Scale Development TBD Validated Procurement Evaluation Methodology team Advanced Test Evaluation Design & Framework Prototypes team

Concept New Evaluation Framework Study team Today

Phase 1 Phase 2 Phase 3 $20M (2002) $170M (2003-2005) (2006-2010) ~$250M each 13 PredictedPredicted PerformancePerformance LevelsLevels forfor Top500Top500

100,000

6,267 4,648 10,000 3,447 Total #1 1,000 405 557 #10 294 #500 Total 100 59 Pred. #1 33 44 Pred. #10 TFlop/s Linpack Linpack TFlop/s Pred. #500

10 5.46 2.86 3.95

1

3 3 4 6 7 9 -0 -0 -04 -0 -05 -08 c c n-0 c un un e u 14 J Dec J D Jun-05 De Jun-0 Dec-06 J Dec-07 Jun-08 De Jun-0 AA PetaFlopPetaFlop ComputerComputer bbyy thethe EndEnd ofof thethe DecadeDecade

♦ At least 10 Companies developing a Petaflop system in the next decade.

¾ Cray 2+ Pflop/s Linpack ¾ IBM 6.5 PB/s data streaming BW 3.2 PB/s Bisection BW ¾ Sun } 64,000 GUPS ¾ Dawning ¾ Galactic Chinese Companies ¾ Lenovo } ¾ Hitachi Japanese ¾ NEC “Life Simulator” (10 Pflop/s) ¾ Fujitsu } Keisoku project $1B 7 years ¾ Bull 15 PetaFlopPetaFlop ComputersComputers inin 22 Years!Years!

♦ Oak Ridge National Lab ¾ Planned for 4th Quarter 2008 (1 Pflop/s peak) ¾ From Cray’s XT family ¾ Use quad core from AMD ¾ 23,936 Chips ¾ Each chip is a quad core-processor (95,744 processors) ¾ Each processor does 4 /cycle ¾ Cycle time of 2.8 GHz ¾ Hypercube connectivity ¾ Interconnect based on Cray XT technology ¾ 6MW, 136 cabinets ♦ Los Alamos National Lab ¾ Roadrunner (2.4 Pflop/s peak) ¾ Use IBM Cell and AMD processors ¾ 75,000 cores 16 HPCHPC ChallengeChallenge GoalsGoals ♦ To examine the performance of HPC architectures using kernels with more challenging memory access patterns than the Linpack Benchmark ¾ The Linpack benchmark works well on all architectures ― even cache-based, distributed memory multiprocessors due to 1. Extensive memory reuse 2. Scalable with respect to the amount of computation 3. Scalable with respect to the communication volume 4. Extensive optimization of the software ♦ To complement the Top500 list ♦ Stress CPU, memory system, interconnect ♦ Allow for optimizations ¾ Record effort needed for tuning ¾ Base run requires MPI and BLAS 17 ♦ Provide verification & archiving of results Tests on Single Processor and System

● Local - only a single processor (core) is performing computations.

● Embarrassingly Parallel - each processor (core) in the entire system is performing computations but they do no communicate with each other explicitly.

● Global - all processors in the system are performing computations and they explicitly communicate with each other. HPCHPC ChallengeChallenge BenchmarkBenchmark

Consists of basically 7 benchmarks; ¾ Think of it as a framework or harness for adding benchmarks of interest.

1. LINPACK (HPL) ― MPI Global (Ax = b)

2. STREAM ― Local; single CPU *STREAM ― Embarrassingly parallel

3. PTRANS (A A + BT) ― MPI Global

4. RandomAccess ― Local; single CPU *RandomAccess ― Embarrassingly parallel Random integer RandomAccess ― MPI Global read; update; & write 5. BW and Latency – MPI

6. FFT - Global, single CPU, and EP proci prock 19 7. Matrix Multiply – single CPU and EP HPCSHPCS PerformancePerformance TargetsTargets

MemoryMemory HierarchyHierarchy

RegistersRegisters Operands Instructions Cache(s)Cache(s) Lines Blocks LocalLocal MemoryMemory Messages RemoteRemote MemoryMemory Pages DiskDisk

TapeTape

●● HPCCHPCC waswas developeddeveloped byby HPCSHPCS toto assistassist inin testingtesting newnew HECHEC systemssystems

●● EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy

●● HPCSHPCS performanceperformance targetstargets attemptattempt toto

FlattenFlatten thethe memorymemory hierarchyhierarchy

AprilImproveImprove 18, 2006 realreal applicationapplication performanceperformance Oak Ridge National Lab, CSM/FT 20

MakeMake programmingprogramming easiereasier HPCSHPCS PerformancePerformance TargetsTargets

● LINPACK: linear system solve MemoryMemory HierarchyHierarchy Ax = b RegistersRegisters Operands Instructions Cache(s)Cache(s) Lines Blocks LocalLocal MemoryMemory Messages RemoteRemote MemoryMemory Pages DiskDisk

TapeTape

●● HPCCHPCC waswas developeddeveloped byby HPCSHPCS toto assistassist inin testingtesting newnew HECHEC systemssystems

●● EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy

●● HPCSHPCS performanceperformance targetstargets attemptattempt toto

FlattenFlatten thethe memorymemory hierarchyhierarchy

AprilImproveImprove 18, 2006 realreal applicationapplication performanceperformance Oak Ridge National Lab, CSM/FT 21

MakeMake programmingprogramming easiereasier HPCSHPCS PerformancePerformance TargetsTargets

● LINPACK: linear system solve MemoryMemory HierarchyHierarchy Ax = b RegistersRegisters ● STREAM: vector operations Operands Instructions A = B + s * C Cache(s)Cache(s) ● FFT: 1D Fast Fourier Transform Lines Blocks Z = fft(X) LocalLocal MemoryMemory ● RandomAccess: integer update Messages T[i] = XOR( T[i], rand) RemoteRemote MemoryMemory HPCHPC ChallengeChallenge Pages DiskDisk

TapeTape

●● HPCCHPCC waswas developeddeveloped byby HPCSHPCS toto assistassist inin testingtesting newnew HECHEC systemssystems

●● EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy

●● HPCSHPCS performanceperformance targetstargets attemptattempt toto

FlattenFlatten thethe memorymemory hierarchyhierarchy

AprilImproveImprove 18, 2006 realreal applicationapplication performanceperformance Oak Ridge National Lab, CSM/FT 22

MakeMake programmingprogramming easiereasier HPCSHPCS PerformancePerformance TargetsTargets

Memory Hierarchy ● LINPACK: linear system solve Memory Hierarchy Max Relative Ax = b RegistersRegisters 2 Pflop/s 8x ● STREAM: vector operations Operands Instructions 6.5 Pbyte/s 40x A = B + s * C Cache(s)Cache(s) 0.5 Pflop/s 200x ● FFT: 1D Fast Fourier Transform Lines Blocks 64000 GUPS 2000x Z = fft(X) LocalLocal MemoryMemory ● RandomAccess: integer update Messages T[i] = XOR( T[i], rand) RemoteRemote MemoryMemory HPCHPC ChallengeChallenge PerformancePerformance TargetsTargets Pages DiskDisk

TapeTape

●● HPCCHPCC waswas developeddeveloped byby HPCSHPCS toto assistassist inin testingtesting newnew HECHEC systemssystems

●● EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy

●● HPCSHPCS performanceperformance targetstargets attemptattempt toto

FlattenFlatten thethe memorymemory hierarchyhierarchy

AprilImproveImprove 18, 2006 realreal applicationapplication performanceperformance Oak Ridge National Lab, CSM/FT 23

MakeMake programmingprogramming easiereasier Computational Resources and HPC Challenge Benchmarks

CPU computational speed

ComputationalComputational resourcesresources

Node Memory Interconnect bandwidth bandwidth Computational Resources and HPC Challenge Benchmarks

HPL Matrix Multiply

CPU computational speed

ComputationalComputational resourcesresources

Node Memory Interconnect bandwidth bandwidth Random & Natural Ring STREAM Bandwidth & Latency PTrans, FFT, Random Access HowHow DoesDoes TheThe BenchmarkingBenchmarking Work?Work? ♦ Single program to download and run ¾ Simple input file similar to HPL input ♦ Base Run and Optimization Run ¾ Base run must be made ¾ User supplies MPI and the BLAS ¾ Optimized run allowed to replace certain routines ¾ User specifies what was done ♦ Results upload via website (monitored) ♦ html table and Excel spreadsheet generated with performance results ¾ Intentionally we are not providing a single figure of merit (no over all ranking)

♦ Each run generates a record which contains 188 pieces of information from the benchmark run.

♦ Goal: no more than 2 X the time to execute HPL. 26 http://icl.cs.utk.edu/hpcc/http://icl.cs.utk.edu/hpcc/ webweb

27 28 29 HPCCHPCC KiviatKiviat ChartChart

30 http://icl.cs.utk.edu/hpcc/ 31 32 DifferentDifferent ComputersComputers areare BetterBetter atat DifferentDifferent Things,Things, NoNo ““FastestFastest”” ComputerComputer forfor AllAll ApsAps

33 HPCCHPCC AwardsAwards InfoInfo andand RulesRules

Class 1 (Objective) Class 2 (Subjective) ♦ Performance ♦ Productivity (Elegant 1.G-HPL $500 Implementation) 2.G-RandomAccess $500 ¾ Implement at least two 3.EP-STREAM system $500 tests from Class 1 4.G-FFT $500 ¾ $1500 (may be split) ♦ Must be full submissions ¾ Deadline: through the HPCC ¾ October 15, 2007 ¾ database Select 3 as finalists ♦ This award is weighted Winners (in both classes) will Winners (in both classes) will ¾ 50% on performance and be announced at SC07 HPCC be announced at SC07 HPCC ¾ BOF 50% on code elegance, BOF clarity, and size. Sponsored by: ♦ Submissions format flexible 34 ClassClass 22 AwardsAwards

♦ Subjective ♦ Productivity (Elegant Implementation) ¾ Implement at least two tests from Class 1 ¾ $1500 (may be split) ¾ Deadline: ¾ October 15, 2007 ¾ Select 5 as finalists ♦ Most "elegant" implementation with special emphasis being placed on: ♦ Global HPL, Global RandomAccess, EP STREAM (Triad) per system and Global FFT. ♦ This award is weighted ¾ 50% on performance and ¾ 50% on code elegance, clarity, and size. 35 55 FinalistsFinalists forfor ClassClass 22 –– NovemberNovember 20052005

, Mathworks ♦ Bradley Kuszmaul, MIT ¾ Environment: Parallel ¾ Environment: Cilk Matlab Prototype ¾ System: 4-processor ¾ System: 4 Processor 1.4Ghz AMD Opteron 840 Opteron with 16GiB of memory ♦ Calin Caseval, C. Bartin, G. ♦ Nathan Wichman, Cray Almasi, Y. Zheng, M. ¾ Environment: UPC Farreras, P. Luk, and R. ¾ System: Cray X1E (ORNL) Mak, IBM ♦ Petr Konency, Simon ¾ Environment: UPC Kahan, and John Feo, Cray ¾ System: Blue Gene L ¾ Environment: C + MTA pragmas ¾ System: Cray MTA2

Winners! 36 20062006 CompetitorsCompetitors ♦ Some Notable Class 1 Competitors

Cray (DOD ERDC) XT3 4096 CPUs Sapphire

SGI (NASA) NEC (HLRS) Cray IBM (DOE LLNL) Cray (DOE ORNL) DELL (MIT LL) Columbia SX-8 512 CPUs (Sandia) BG/L 131,072 CPU X1 1008 CPUs 300 CPUs 10,000 CPUs XT3 11,648 CPU Purple 10,240 CPU Jaguar XT3 5200 CPUs LLGrid Red Storm

♦ Class 2: 6 Finalists ¾ Calin Cascaval (IBM) UPC on Blue Gene/L [Current Language] ¾ Bradley Kuszmaul (MIT CSAIL) Cilk on SGI Altix [Current Language] ¾ Cleve Moler (Mathworks) Parallel Matlab on a cluster [Current Language] ¾ Brad Chamberlain (Cray) Chapel [Research Language] ¾ Vivek Sarkar (IBM) X10 [Research Language] ¾ Vadim Gurev (St. Petersburg, Russia) MCSharp [Student Submission] 37 TheThe FollowinFollowingg areare thethe WinnersWinners ofof thethe 20062006 HPCHPC ChallengeChallenge ClassClass 11 AwardsAwards

38 TheThe FollowinFollowingg areare thethe WinnersWinners ofof thethe 20062006 HPCHPC ChallengeChallenge ClassClass 22 AwardsAwards

39 20062006 ProgrammabilityProgrammability

Speedup vs Relative Code Size

103 Ref ♦ ♦ClassClass 22 AwardAward 102 ¾¾50%50% PerformancePerformance Traditional

1 HPC ¾¾50%50% EleganceElegance 10 ♦♦2121 CodesCodes submittedsubmitted byby 66 teamsteams 100 ♦♦SpeedupSpeedup relativerelative toto Speedup 10-1 serialserial CC onon workstationworkstation , Matlab, Python, etc. “All too often” ♦ ♦CodeCode sizesize relativerelative toto 10-2 serialserial CC

10-3 10-1 100 10 Relative Code Size

40 20062006 ProgrammingProgramming ResultsResults SummarySummary

♦♦ 2121 ofof 2121 smallersmaller thanthan C+MPIC+MPI Ref;Ref; 2020 smallersmaller thanthan serialserial 41 ♦♦ 1515 ofof 2121 fasterfaster thanthan seserial;rial; 1919 inin HPCSHPCS quadrantquadrant Top500Top500 andand HPCHPC ChallengeChallenge RankingsRankings

♦ It should be clear that the HPL (Linpack Benchmark - Top500) is a relatively poor predictor of overall machine performance.

♦ For a given set of applications such as: ¾ Calculations on unstructured grids ¾ Effects of strong shock waves ¾ Ab-initio quantum chemistry ¾ Ocean general circulation model ¾ CFD calculations w/multi-resolution grids ¾ Weather forecasting

♦ There should be a different mix of components used to help predict the system performance.

42 WillWill thethe Top500Top500 ListList GoGo Away?Away?

♦ The Top500 continues to serve a valuable role in high performance computing. ¾ Historical basis ¾ Presents statistics on deployment ¾ Projection on where things are going ¾ Impartial view ¾ Its simple to understand ¾ Its fun ♦ The Top500 will continue to play a role

43 NoNo SingleSingle NumberNumber forfor HPCC?HPCC?

♦ Of course everyone wants a single number. ♦ With HPCC Benchmark you get 188 numbers per system run! ♦ Many have suggested weighting the seven tests in HPCC to come up with a single number. ¾ LINPACK, MatMul, FFT, Stream, RandomAccess, Ptranspose, bandwidth & latency

♦ But your application is different than mine, so weights are dependent on the application.

♦ Score = W1*LINPACK + W2*MM + W3*FFT+ W4*Stream +

W5*RA + W6*Ptrans + W7*BW/Lat

♦ Problem is that the weights depend on your job mix.

♦ So it make sense to have a set of weights for each user or site.

44 ToolsTools NeededNeeded toto HelpHelp WithWith PerformancePerformance

♦ A tools that analyzed an application perhaps statically and/or dynamically. ♦ Output a set of weights for various sections of the application

¾ [ W1, W2, W3, W4, W5, W6, W7, W8 ] ¾ The tool would also point to places where we were missing a benchmarking component for the mapping. ♦ Think of the benchmark components as a basis set for scientific applications ♦ A specific application has a set of "coefficients" of the basis set.

♦ Score = W1*HPL + W2*MM + W3*FFT+ W4*Stream +

W5*RA + W6*Ptrans + W7*BW/Lat + … 45 FutureFuture DirectionsDirections

♦ Looking at reducing execution time ♦ Constructing a framework for benchmarks ♦ Developing machine signatures ♦ Plans are to expand the benchmark collection ¾ operations ¾ I/O ¾ Smith-Waterman (sequence alignment)

♦ Port to new systems ♦ Provide more implementations ¾ Languages (Fortran, UPC, Co-Array) ¾ Environments ¾ Paradigms 46 Collaborators • HPC Challenge • Top500 – Piotr Łuszczek, U of Tennessee – Hans Meuer, Prometeus – David Bailey, NERSC/LBL – Erich Strohmaier, LBNL/NERSC – Jeremy Kepner, MIT Lincoln Lab – Horst Simon, LBNL/NERSC – David Koester, MITRE – Bob Lucas, ISI/USC – Rusty Lusk, ANL – John McCalpin, IBM, Austin – Rolf Rabenseifner, HLRS Stuttgart – Daisuke Takahashi, Tsukuba, Japan

http://icl.cs.utk.edu/hpcc/