HighPerformanceComputing:Concepts,Methods&Means PerformanceI:Benchmarking

Prof.ThomasSterling DepartmentofScience LouisianaStateUniversity January23 rd ,2007 Topics

• Definitions,propertiesandapplications • Earlybenchmarks • Everythingyoueverwantedtoknow aboutLinpack(butwereafraidtoask) • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary 2 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

3 BasicPerformanceMetrics

• Time related: – Execution time [seconds] • wallclocktime • systemandusertime – Latency – Response time • Raterelated: – Rateofcomputation • floating point operations per second [] • integer operations per second [ops] – Datatransfer(I/O)rate[bytes/second] • Effectiveness: – Efficiency [%] – Memoryconsumption[bytes] – Productivity [utility/($*second)]

• Modifiers: – Sustained – Peak – Theoretical peak 4 WhatIsa?

Benchmark :astandardizedproblemortestthatservesas abasisforevaluationorcomparison(asofcomputer systemperformance) [Merriam-Webster ] • Theterm“benchmark” alsocommonlyappliesto speciallydesignedprogramsusedinbenchmarking • Abenchmarkshould: – bedomainspecific(themoregeneralthebenchmark,the lessusefulitisforanythinginparticular) – beadistillationoftheessentialattributesofaworkload – avoidusingsinglemetrictoexpresstheoverallperformance • Computationalbenchmarkkinds – synthetic:speciallycreatedprogramsthatimposetheload onthespecificcomponentinthesystem – application:derivedfromarealworldapplicationprogram 5 PurposeofBenchmarking

• Todefinetheplayingfield • Toprovideatoolenablingquantitative comparisons • Accelerationofprogress – enablebetterengineeringbydefining measurable and repeatable objectives • Establishingofperformanceagenda – measurereleasetoreleaseorversiontoversion progress – setgoalstomeet – beunderstandableandusefulalsotothepeople nothavingtheexpertiseinthefield(managers,

etc.) 6 PropertiesofaGood Benchmark • Relevance:meaningfulwithinthetargetdomain • Understandability • Goodmetric(s):linear,orthogonal,monotonic • Scalability:applicabletoabroadspectrumof hardware/architecture • Coverage:doesnotoverconstrainthetypical environment • Acceptance:embracedbyusersandvendors • Hastoenablecomparativeevaluation • Limitedlifetime:thereisapointwhenadditionalcode modificationsoroptimizationsbecome counterproductive

7 Adaptedfrom: Standard Benchmarks for Database Systems byCharlesLevine,SIGMOD‘97 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

8 EarlyBenchmarks

– Floatingpointintensive • – Integerandcharacterstringoriented • LivermoreKernels – “LivermoreLoops” – Collectionofshortkernels • NASkernel – 7Fortrantestkernelsforaerospacecomputation Thesourcesofthebenchmarkslistedaboveare availablefrom: http://www.netlib.org/benchmark

9 Whetstone

• OriginallywritteninAlgol 60in1972atthe NationalPhysicsLaboratory(UK) • NamedafterWhetstoneAlgol translator interpreterontheKDF9computer • Measuresprimarilyfloatingpointperformance in WIPS : WhetstoneInstructionsPerSecond • Raisedalsotheissueofefficiencyofdifferent programminglanguages • TheoriginalAlgol codewastranslatedto andFortran(singleanddoubleprecision support),PL/I,APL,Pascal,Basic,Simula andothers

10 Dhrystone

• Syntheticbenchmarkdevelopedin1984by ReinholdWeicker • Thenameisapunon“Whetstone” • Measuresintegerandstringoperations performance,expressedinnumberof iterations,or ,persecond • Alternativeunit: D-MIPS ,normalizedtoVAX 11/780performance • Latestversionreleased:2.1,includes implementationsinC,Ada andPascal • SupersededbySPECint suite

GordonBellandVAX11/780 11 LivermoreFortranKernels (LFK) • DevelopedatLawrenceLivermoreNational Laboratoryin1970 – alsoknownasLivermoreLoops • Consistsof24separatekernels: – hydrodynamiccodes,Cholesky conjugategradient,linear algebra,equationofstate,integration,predictors,firstsum anddifference,particleincell,MonteCarlo,linear recurrence,discreteordinatetransport,Planckian distribution andothers – includecarefulandcarelesscodingpractices • Produces72timingresultsusing3differentDOloop lengthsforeachkernel • ProducesMegaflopsvaluesforeachkerneland rangestatisticsoftheresults • Canbeusedasperformance,compileraccuracy (checksumsstoredincode)orhardwareendurance test 12 NASKernel

• DevelopedattheNumericalAerodynamicSimulation ProjectsOfficeatNASAAmes • Focusesonvectorfloatingpointperformance • Consistsof7testkernelsinFortran(approx.1000 linesofcode): – matrixmultiply – complex2DFFT – Cholesky decomposition – blocktridiagonalmatrixsolver – vortexmethodsetupwithGaussianelimination – vortexcreationwithboundaryconditions – parallelinverseofthreematrixpentadiagonals • ReportsperformanceinMflops (64bitprecision)

13 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

14 LinpackOverview

• IntroducedbyJackDongarrain1979 • BasedonLINPACKlinearalgebrapackage developedbyJ.Dongarra,J.Bunch,C.Moler andP.Stewart(nowsupersededbythe LAPACKlibrary) • Solvesadense,regularsystemoflinear equations,usingmatricesinitializedwith pseudorandomnumbers • Providesanestimateofsystem’seffective floatingpointperformance • Doesnotreflectthe overall performanceof themachine! 15 LinpackBenchmark Variants

• LinpackFortran(singleprocessor) – N=100 – N=1000,TPP,besteffort • Linpack’s HighlyParallelComputing benchmark(HPL) • Linpack

16 FortranLinpack(I)

N=100case • ProvidesresultslistedinTable1of“Linpack BenchmarkReport” • Absolutelynochangestothecodecanbemade(not evenincomments!) • Matrixgeneratedbytheprogrammustbeusedtorun thiscase • Anexternaltimingfunction(SECOND)hastobe supplied • Onlycompilerinducedoptimizationsallowed • Measuresperformanceoftworoutines – DGEFA:LUdecompositionwithpartialpivoting – DGESL:solvessystemoflinearequationsusingresultfrom DGEFA • Complexity:O(n 2)forDGESL,O(n 3)forDGEFA 17 FortranLinpack(II)

N=1000case,TowardPeakPerformance (TPP),BestEffort • ProvidesresultslistedinTable1of“Linpack BenchmarkReport” • Theusercanchooseanylinearequationto besolved • Allowsacompletereplacementofthe factorization/solvercodebytheuser • Norestrictionontheimplementation languageforthesolver • Thesolutionmustconformtoprescribed accuracyandthematrixusedmustbethe sameasthematrixusedbythenetlib driver 18 LinpackFortranPerformance onDifferentPlatforms

Computer N=100 N=1000,TPP Theoretical [MFlops] [MFlops] Peak[MFlops] IntelPentiumWoodcrest(1core,3GHz) 3018 6542 12000 NECSX8/8(8proc.,2GHz) 75140 128000 NECSX8/8(1proc.,2GHz) 2177 14960 16000 HPProLiantBL20pG3(4cores,3.8GHzIntelXeon) 8185 14800 HPProLiantBL20pG3(1core3.8GHzIntelXeon) 1852 4851 7400 IBMeServerp5575(8POWER5proc.,1.9GHz) 34570 60800 IBMeServerp5575(1POWER5proc.,1.9GHz) 1776 5872 7600 SGIAltix3700Bx2(1Itanium2proc.,1.6GHz) 1765 5953 6400 HPProLiantBL45p(4coresAMDOpteron854,2.8GHz) 12860 22400 HPProLiantBL45p(1coreAMDOpteron854,2.8GHz) 1717 4191 5600 FujitsuVPP5000/1(1proc.,3.33ns) 1156 8784 9600 CrayT932(32proc.,2.2ns) 1129(1proc.) 29360 57600 HPAlphaServerGS12807/1300(8Alphaproc.,1.3GHz) 14260 20800 HPAlphaServerGS12807/1300(1Alphaproc.,1.3GHz) 1122 2132 2600 HP9000rp842032(8PA8800proc.,1000MHz) 14150 32000 HP9000rp842032(1PA8800proc.,1000MHz) 843 2905 4000

Dataexcerptedfromthe11302006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps 19 FortranLinpackDemo

> ./ Please send the results of this run to:

Jack J. Dongarra Totaltime Computer Science Department “Timing” unit (dgefa+dgesl) Firstelement University of Tennessee (obsolete) Timespentin Knoxville, Tennessee 37996-1300 ofrighthand solver(dgesl) sidevector Fax: 865-974-8296 Sustained Fractionof Internet: [email protected] floatingpoint Cray1S rate executiontime This is version 29.5.04. (obsolete)

Timespentin norm. resid resid machep x(1) x(n) matrix 1.25501937E+00 1.39332990E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00 factorization routine(dgefa) times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of 201 Twodifferent 4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03 9.090E-03 -9.159E-15 dimensionsusedto 4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03 9.017E-03 1.000E+00 4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03 9.018E-03 1.000E+00 testtheeffectof 4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03 8.981E-03 5.298E+02 arrayplacementin memory times for array with leading dimension of 200 4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03 7.804E-03 1.000E+00 4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03 7.950E-03 5.298E+02 end of tests -- this version dated 05/29/04

Reference: http://www.netlib.org/utk/people/JackDongarra/faqlinpack.html 20 Linpack’sHighlyParallel ComputingBenchmark(HPL) • Measurestheperformanceofdistributedmemory machines • Usedinthe“LinpackBenchmarkReport” (Table3) andtodeterminetheorderofmachinesontheTop500 list • Theportableversion(writteninC) • Externaldependencies: – MPI1.1functionalityforinternodecommunication – BLAS orVSIPLlibraryforsimplevectoroperationssuchas scaledvectoraddition(DAXPY: y =αx+y)andinnerdot product(DDOT:a=Σxiyi) • Groundrules: – allowsacompleteuserreplacementoftheLUfactorization andsolversteps(theaccuracymustsatisfygivenbound) – samematrixasinthedriverprogram – norestrictionsonproblemsize 21 HPLAlgorithm

• Datadistribution:2Dblockcyclic • Algorithmelements: – rightlookingvariantofLUfactorizationwithrow partialpivotingfeaturingmultiplelookahead depths – recursivepanelfactorizationwithpivotsearchand columnbroadcastcombined – variousvirtualpanelbroadcasttopologies – bandwidthreducingswapbroadcastalgorithm – backwardsubstitutionwithlookaheaddepthof one • Floatingpointoperationcount:2/3n3+n 2

22 HPLAlgorithmElements

Executionflowforsingleparameterset: RightlookingvariantofLUfactorizationisused. IneachiterationoftheloopapanelofNBcolumns isfactorizedandthetrailingsubmatrix isupdated: MatrixGeneration

Panel Factorization

PanelBroadcast

Lookahead

MatrixdistributionschemeoverP×Qgridof Update processors:

N Allcolumns ofA processed?

Y Backward Substitution . . .

Sixbroadcastalgorithmsavailable SolutionCheck Reference: http://www.netlib.org/benchmark/hpl/algorithm.html 23 HPLLinpackMetrics

• TheHPLimplementationofthebenchmarkisrunfor differentproblemsizes N ontheentiremachine

• Forcertainproblemsize Nmax ,thecumulative performanceinMflops(reflecting64bitadditionand multiplicationoperations)reachesitsmaximumvalue denotedas Rmax • Anothermetricpossibletoobtainfromthebenchmark is N1/2 ,theproblemsizeforwhichthehalfofthe maximumperformance( Rmax /2)isachieved • The Rmax valueisusedtoranksupercomputersin Top500list;listedalongwiththisnumberarethe theoreticalpeakdoubleprecisionfloatingpoint performance Rpeak ofthemachineand N1/2

24 MachineParametersInfluencing LinpackPerformance

Parameter LinpackFortran, LinpackFortran, HPL N=100 N=1000,TPP

Processorspeed Yes Yes Yes

Memorycapacity No No(modern Yes(forR max ) system)

Network No No Yes latency/bandwidth

Compilerflags Yes Yes Yes

25 TenFastestSupercomputers OnCurrentTop500List

# Computer Site Processors Rmax Rpeak 1 IBMBlueGene/L DoE/NNSA/LLNL(USA) 131,072 280,600 367,000

2 CrayRedStorm Sandia(USA) 26,544 101,400 127,411

3 IBMBGW IBMT.WatsonResearchCenter(USA) 40,960 91,290 114,688

4 IBMASCPurple DoE/NNSA/LLNL(USA) 12,208 75,760 92,781

5 IBMMareNostrum BarcelonaSupercomputingCenter(Spain) 10,240 62,630 94,208

6 DellThunderbird NNSA/Sandia(USA) 9,024 53,000 64,973

7 BullTera10 Commissariatal’EnergieAtomique(France) 9,968 52,840 63,795

8 SGIColumbia NASA/AmesResearchCenter(USA) 10,160 51,870 60,960

9 NEC/SunTsubame GSICCenter,TokyoInstituteofTechnology 11,088 47,380 82,125 (Japan) 10 CrayJaguar OakRidgeNationalLaboratory(USA) 10,424 43,480 54,205

Source:http://www.top500.org/list/2006/11/100 26 JavaLinpack

• Intendedmostlytomeasuretheefficiencyof Javaimplementationratherthanhardware floatingpointperformance • Solvesadense500x500systemoflinear equationswithonerighthandside,Ax=b • MatrixAisgeneratedrandomly • Vectorbisconstructed,sothatallcomponent ofsolutionxareone • UsesGaussianeliminationwithpartial pivoting • Reports:Mflops,timetosolution,NormRes (solutionaccuracy),relativemachine precision 27 HPLDemo

>mpirunnp4xhpl ======T/VNNBPQTime Gflops HPLinpack1.0a HighPerformanceLinpackbenchmark January20,2004 WrittenbyA.PetitetandR.ClintWhaley,InnovativeComputing Labs.,UTK WR01L2L2500032227.14 1.168e+01 ====== ||Axb||_oo/(eps*||A||_1*N)=0.0400275...... PASSED Anexplanationoftheinput/outputparametersfollows: ||Axb||_oo/(eps*||A||_1*||x||_1)=0.0264242...... PASSED T/V:Walltime/encodedvariant. ||Axb||_oo/(eps*||A||_oo*||x||_oo)=0.0051580...... PASSED N:TheorderofthecoefficientmatrixA. ======NB:Thepartitioningblockingfactor. T/VNNBPQTime Gflops P:Thenumberofprocessrows. Q:Thenumberofprocesscolumns. WR01L2L2500032147.00 1.192e+01 Time:Timeinsecondstosolvethelinearsystem. Gflops:Rateofexecutionforsolvingthelinearsystem. ||Axb||_oo/(eps*||A||_1*N)=0.0335428...... PASSED ||Axb||_oo/(eps*||A||_1*||x||_1)=0.0221433...... PASSED Thefollowingparametervalueswillbeused: ||Axb||_oo/(eps*||A||_oo*||x||_oo)=0.0043224...... PASSED ======N:5000 T/VNNBPQTime Gflops NB:32 PMAP:Rowmajorprocessmapping WR01L2L2500032417.00 1.191e+01 P:214 Q:241 ||Axb||_oo/(eps*||A||_1*N)=0.0426255...... PASSED PFACT:Left ||Axb||_oo/(eps*||A||_1*||x||_1)=0.0281393...... PASSED NBMIN:2 ||Axb||_oo/(eps*||A||_oo*||x||_oo)=0.0054928...... PASSED NDIV:2 ======RFACT:Left BCAST:1ringM Finished3testswiththefollowingresults: DEPTH:0 3testscompletedandpassedresidualchecks, SWAP:Mix(threshold=64) 0testscompletedandfailedresidualchecks, L1:transposedform 0testsskippedbecauseofillegalinputvalues. U:transposedform EQUIL:yes ALIGN:8doubleprecisionwords EndofTests. ======

ThematrixAisrandomlygeneratedforeachtest. Thefollowingscaledresidualcheckswillbecomputed: 1)||Axb||_oo/(eps*||A||_1*N) 2)||Axb||_oo/(eps*||A||_1*||x||_1) 3)||Axb||_oo/(eps*||A||_oo*||x||_oo) Forconfigurationissues,consult: Therelativemachineprecision(eps)istakentobe1.110223e16 http://www.netlib.org/benchmark/hpl/faqs.html Computationaltestspassifscaledresidualsarelessthan 16.0 28 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

29 OtherParallelBenchmarks

• HighPerformanceComputing Challenge( HPCC )benchmarks – Devisedandsponsoredtoenrichthe benchmarkingparameterset • NASParallelBenchmarks( NPB ) – Powerfulsetofmetrics – Reflectscomputationalfluiddynamics • NPBIO-MPI – StressesexternalI/Osystem

30 HPCChallenge Benchmark Consistsof7individualtests: • HPL(LinpackTPP):floatingpointrateofexecutionofasolver oflinearsystemofequations • DGEMM:floatingpointrateofexecutionofdoubleprecision matrixmatrixmultiplication • STREAM:sustainablememorybandwidth(GB/s)andthe correspondingcomputationrateforsimplevectorkernel • PTRANS(parallelmatrixtranspose):totalcapacityofthe networkusingpairwisecommunicatingprocesses • RandomAccess:therateofintegerrandomupdatesofmemory (inGUPS:GigaUpdatesPerSecond) • FFT:floatingpointrateofexecutionofdoubleprecisioncomplex 1DDiscreteFourierTransform • b_eff(effectivebandwidthbenchmark):latencyandbandwidth ofanumberofsimultaneouscommunicationpatterns

31 ComparisonofHPCCResultson SelectedSupercomputers

"RedStorm"CrayXT3,Sandia(Opteron/Craycustom3Dmesh) IBMp5575,LLNL(Power5/IBMHPS) IBMBlueGene/L,NNSA(PowerPC440/IBMcustom3Dtorus&tree) CrayX1E,ORNL(X1E/Craymodified2Dtorus) HPXC,Government(Itanium2/QuadricsElan4) "Columbia"SGI,NASA(Itanium2/SGINUMALINK) NECSX8,HLRS(SX8/IXScrossbar) "Emerald"RackableSystems,AMD(Opteron/SilverstormInfiniband)

100

80

60

40

20 Percentage of the maximum value maximum the Percentage of

0 GHPL(max=91 GPTRANS GRandom GFFTE EPSTREAM EPDGEMM RandomRing RandomRing Tflops) (max=4666GB/s) Access (max=1763 system system Bandwidth Latency (max=7.69 Gflops) (max=62890 (max=161885 (max=0.829 (max=118.6s) GUP/s) GB/s) Gflops) GB/s) Notes: • allmetricsshownare“higherbetter”,exceptfortheRandomRingLatency • machinelabelsinclude:machinename(optional),manufacturerandsystemname,affiliationand(inparentheses) processor/networkfabrictype 32 NASParallelBenchmarks

• Derivedfromcomputationalfluiddynamics(CFD)applications • Consistoffivekernelsandthreepseudoapplications • Existinseveralflavors: – NPB1:originalpaperandpencilspecification • generallyproprietaryimplementationsbyhardwarevendors – NPB2:MPIbasedsourcesdistributedbyNAS • supplementsNPB1 • canberunwithlittleornotuning – NPB3:implementationsinOpenMP,HPFandJava • derivedfromNPBserialversionwithimprovedserialcode • asetofmultizonebenchmarkswasadded • testimplementationefficiencyofmultilevelandhybridparallelization methodsandtools(e.g.OpenMPwithMPI) – GridNPB3:newsuiteofbenchmarks,designedtoratethe performanceofcomputationalgrids • includesonlyfourbenchmarks,derivedfromtheoriginalNPB • writteninFortranandJava • Globusasgridmiddleware

33 NPB2Overview

• Multipleproblemclasses(S,W,A,B,C,D) • TestswrittenmainlyinFortran(ISinC): – BT(blocktridiagonalsolverwith5x5blocksize) – CG(conjugategradientapproximationtocomputethesmallest eigenvalue ofasparse,symmetricpositivedefinitematrix) – EP(“embarrassinglyparallel”;evaluatesanintegralbymeansof pseudorandomtrials) – FT(3DPDEsolverusingFastFourierTransforms) – IS(largeintegersort;testsbothintegercomputationspeedand networkperformance) – LU(aregularsparse,5x5blockloweranduppertriangularsystem solver) – MG(simplifiedmultigrid kernel;testsbothshortandlongdistance datacommunication) – SP(solvesmultipleindependentsystemofnondiagonally dominant,scalar,pentadiagonal equations) • Sourcesandreportsavailablefrom: http://ww.nas.nasa.gov/Resources/Software/npb.html

34 NPBIOMPI

• AttemptstoaddresslackofI/OtestsinNPB,focusingprimarily onfile output • BasedonBTIOeffort,whichextendedBTbenchmarkwithroutines writingtostoragefivedoubleprecisionnumbersforeverymeshpoint – runsfor200iterations,writingeveryfiveiterations – afteralltimestepsarefinished,alldatabelongingtoasingletimestepmust bestoredinthesamefile,sortedbyvectorcomponents – timingmustincludeallrequireddatarearrangementstoachievethe specifieddatalayout • Supportedaccessscenarios: – simple:MPIIOwithoutcollectivebuffering – full:MPIIOcollectivebuffering – fortran:Fortran77fileoperations – epio:whereeachprocesswritescontinuouslyitspartofthecomputational domaintoaseparatefile • Numberofprocessesmustbeasquare • Problemsizes:classA(64 3),classB(102 3),classC(162 3) • Severalpossibleresults,dependingonthebenchmarkinggoal: effective flops , effective output bandwidth or output overhead

35 SampleNPB2Results Performance (per node) of BT class A Performance (total) of LU class B 70 10000 IBM SP2 IBM SP2 Cray T3D Cray T3D SGI R8000 Array 60 SGI R8000 Array Intel Paragon Intel Paragon

50 1000

40

n

n

€

m o

o

30 Mflop/s

Mflop/s/processor 100 NPB 2 Results 8/14/96 20 NPB 2 Results 8/14/96 http://www.nas.nasa.gov/NAS/NPB/ 10 http://www.nas.nasa.gov/NAS/NPB/

0 10 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of processors Number of processors

Performance (per node) of Cray T3D on LU Performance (per node) of IBM SP2 on class A 25 80 LU class A BT class A LU class B SP class A LU class C 70 LU class A MG class A 20 60

15 50

„

n

ˆ

m 40 o

10 30 Mflop/s/processor Mflop/s/processor NPB 2 Results 8/14/96 NPB 2 Results 8/14/96 20 5 http://www.nas.nasa.gov/NAS/NPB/ http://www.nas.nasa.gov/NAS/NPB/ 10

0 0 0 100 200 300 0 50 100 150 Number of processors Number of processors Reference: The NAS Parallel Benchmarks 2.1 Results byW.Saphir,A.Woo,andM.Yarrow 36 http://www.nas.nasa.gov/News/Techreports/1996/PDF/nas96010.pdf • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

37 BenchmarkingOrganizations

• SPEC – Createdtosatisfytheneedforrealistic,fair andstandardizedperformancetests – Motto:“Anounceofhonestdataisworth morethanapoundofmarketinghype” • TPC – Formedprimarilyduetolackofreliable databasebenchmarks

38 SPECBenchmarkSuite Overview • StandardPerformanceEvaluationCorporationisa nonprofitorganization(financedbyitsmembers: over60leadingcomputerandsoftware manufacturers)foundedin1988 • SPECbenchmarksarewritteninplatformneutral language(typicallyCorFortran) • Thecodemaybecompiledusingarbitrarycompilers, butthesourcesmaynotbemodified – manymanufacturersareknowntooptimizetheircompilers and/orsystemstoimprovetheSPECresults • Benchmarksmaybeobtainedbypurchasingthe licensefromSPEC;theresultsarepublishedonthe SPECwebsite • Website: http://www.spec.org

39 SPECSuiteComponents

• SPECCPU2006:combinedperformanceofCPU,memoryand compiler – CINT2006(aka.SPECint):integerarithmetictestusingcompilers, interpreters,wordprocessors,chessprograms,etc. – CFP2006(aka.SPECfp):floatingpointtestusingphysicalsimulations,3D graphics,imageprocessing,computationalchemistry,etc. • SPECweb2005:PHP/JSPperformance • SPECviewperf:OpenGL3Dgraphicsystemperformance • SPECapc:severalpopular3Dintensiveapplications • SPECHPC2002:highendparallelcomputingtestsusingquantum chemistryapplication,weathermodeling,industrialoildeposits locator • SPECOMP2001:OpenMP applicationperformance • SPECjvm98:performanceofjavaclientonaJavaVM • SPECjAppServer2004:multitierbenchmarkmeasuringthe performanceofJ2EEapplicationservers • SPECjbb2005:serversideJavaperformance • SPECMAIL2001:mailserverperformance(SMTPandPOP) • SPECSFS97_R1:NFSserverthroughputandresponsetime • Planned:SPECMPI2006,SPECimap,SPECpower,Virtualization

40 SampleResults: SPEC CPU2006 System CINT2006 CFP2006 CINT2006 CFP2006 Speed Speed Rate Rate base peak base peak base peak base peak DellPrecision380(PentiumEE9653.73GHz, 11.6 12.4 23.1 21.7 2cores) HPProLiant DL380G4(Xeon3.8GHz,2 11.4 11.7 20.9 18.8 cores) HPProLiant DL585(Opteron8542.8GHz,2 11.2 12.7 12.1 13.0 22.3 25.2 24.1 25.9 cores) SunBlade2500(1UltraSPARC IIIi,1280MHz) 4.04 4.04

SunFireE25K(UltraSPARC IV+1500MHz, 759 904 144cores) HPIntegrityrx6600(Itanium21.6GHz/24MB,2 14.5 15.7 17.3 18.1 cores) HPIntegrityrx6600(Itanium21.6GHz/24MB,8 94.7 102 69.1 71.4 cores) HPIntegritySuperdome(Itanium2 1534 1648 1422 1479 1.6GHz/24MB,128cores) Notes: • base metricrequiresthatthesameflagsareusedwhencompilingall instancesofthebenchmark( peak islessstrict) • speed metricmeasureshowfastacomputerexecutessingletask,while rate determinesthroughputwithmultipletasks 41 TPC

• GovernedbytheTransactionProcessingPerformanceCouncil (http://www.tpc.org )foundedin1985 – membersincludeleadingsystemandmicroprocessormanufacturers, and commercialdatabasedevelopers – thecouncilappointsprofessionalaffiliatesandauditorsoutsidethemember grouptohelpfulfilltheTPC’s missionandvalidatebenchmarkresults • Currentbenchmarkflavors: – TPCCfortransactionprocessing(defactostandardforOnLine TransactionProcessing) – TPCHfordecisionsupportsystems – TPCAppforwebservices • Obsoletebenchmarks: – TPCA(performanceofupdateintensivedatabases) – TPCB(throughputofasystemintransactionspersecond) – TPCD(decisionsupportapplicationswithlongrunningqueriesagainst complexdatastructures) – TPCR(businessreporting,decisionsupport) – TPCW(transactionalwebeCommercebenchmark)

42 TopTenTPCCResults ByPerformance: ByPrice/Performance:

System (Database) tpmC Price per System (Database) tpmC Price per tpmC tpmC IBMp5595(IBMDB29) 4,033,378 2.97USD DellPowerEdge 2900(MSSQL 65,833 .98USD Server2005) IBMeServer p5595(IBM 3,210,540 5.07USD DellPowerEdge 2800/2.8GHz 38,622 .99USD DB2UDB8.2) (MSSQLServer2005x64) IBMeServer p5595(Oracle 1,601,784 5.05USD DellPowerEdge 2800/3.6GHz 28,244 1.29USD 10gEE) (MSSQLServer2005WE) FujitsuPRIMEQUEST540 1,238,579 3.94USD DellPowerEdge 2800/3.4GHz 28,122 1.40USD (Oracle10gEE) (MSSQLServer2000WE) HPIntegritySuperdome(MS 1,231,433 4.82USD DellPowerEdge 2850/3.4GHz 26,410 1.53USD SQLServer2005EESP1) (MSSQLServer2000) HPIntegrityrx5670(Oracle 1,184,893 5.52USD HPProLiant ML350T03(MS 17,810 1.57USD 10gEE) SQLServer2000SP3) IBMeServer pSeries 690 1,025,486 5.43USD HPProLiant ML350T03/3.06 18,661 1.61USD (IBMDB2UDB8.1) (IBMDB2UDB8.1Express) IBMp5570(IBMDB2UDB 1,025,169 4.42USD HPProLiant ML350T03/2.8 18,318 1.68USD 8.2) (IBMDB2UDB8.1Express) HPIntegritySuperdome 1,008,144 8.33USD HPProLiant ML370G41M/3.6 68,010 1.80USD (Oracle10gEE) (MSSQLServer2000EESP3) IBMeServer p5570(IBM 809,144 4.95USD HPIntegrityrx2600(Oracle10g) 51,506 1.81USD DB2UDB8.1) 43 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

44 PresentationoftheResults

• Tables (a) (b) • Graphs – Bargraphs (a) – Scatterplots (b)

– Lineplots (c) 12000

10000 – Piecharts (d) G-HPL G-PTRANS (c) G-FFTE (d) 8000 G-RanAcc – Ganttcharts (e) G-Stream EPStream – Kiviatgraphs (f) 6000 • Enhancements 4000 – Errorbars,boxesor 2000 0 confidenceintervals 0 2000 4000 6000 8000 10000 12000 – Brokenoroffsetscales (becareful!) (e) (f) – Multiplecurvespergraph (butavoidoverloading) – Datalabels,colors, etc . KiviatGraphExample

Source: http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml 46 MixedGraphExample

WRFOOCOREMILCPARATECHOMME BSSN_PUGHWhisky_CarpetADCIRCPETSc_FUN3D

Computationfraction Floatingpointoperations

Communicationfraction Load/storeoperations Otheroperations

CharacterizationofNSF/CCTparallelapplicationsonPOWER5architecture (usingdatacollectedbyIPM)

47 GraphDo’sandDon’ts

• Goodgraphs: – Requireminimumeffortfromthereader – Maximizeinformation – Maximizeinformationtoinkratio – Usecommonlyacceptedpractices – Avoidambiguity • Poorgraphs: – Havetoomanyalternativesonasinglechart – Displaytoomany yvariablesonasinglechart – Usevaguesymbolsinplaceoftext – Showextraneousinformation – Selectscalerangesimproperly – Uselinechartinsteadofabargraph

Reference:RajJain, The Art of Computer Systems Performance Analysis ,Chapter10 48 CommonMistakesin Benchmarking From Chapter 9 of The Art of Computer Systems Performance Analysis byRajJain: • Onlyaveragebehaviorrepresentedintestworkload • Skewnessofdevicedemandsignored • Loadinglevelcontrolledinappropriately • Cachingeffectsignored • Bufferingsizesnotappropriate • Inaccuraciesduetosamplingignored • Ignoringmonitoringoverhead • Notvalidatingmeasurements • Notensuringsameinitialconditions • Notmeasuringtransientperformance • Usingdeviceutilizationsforperformancecomparisons • Collectingtoomuchdatabutdoingverylittleanalysis

49 MisrepresentationofPerformance ResultsonParallel • Quoteonly32bitperformanceresults,not64bitresults • Presentperformanceforaninnerkernel,representingitastheperformanceofthe entireapplication • Quietlyemployassemblycodeandotherlowlevelconstructs • Scaleproblemsizewiththenumberofprocessors,butomitanymentionofthisfact • Quoteperformanceresultsprojectedtothefullsystem • Compareyourresultswithscalar,unoptimizedcoderunonanotherplatform • Whendirectruntimecomparisonsarerequired,comparewithanoldcodeonan obsoletesystem • IfMFLOPSratesmustbequoted,basetheoperationcountontheparallel implementation,notonthebestsequentialimplementation • Quoteperformanceintermsofprocessorutilization,parallelspeedupsorMFLOPS perdollar • Mutilatethealgorithmusedintheparallelimplementationtomatchthearchitecture • Measureparallelruntimesonadedicatedsystem,butmeasureconventionalrun timesinabusyenvironment • Ifallelsefails,showprettypicturesandanimatedvideos,and don'ttalkabout performance

Reference : DavidBailey“TwelveWaystoFooltheMassesWhenGivingPerformanceResultsonParallelComputers”, Supercomputing Review , Aug 1991 , pp.54-55 , http://crd.lbl.gov/~dhbailey/dhbpapers/twelveways.pdf 50 • Definitions,propertiesandapplications • Earlybenchmarks • Linpack • Otherparallelbenchmarks • Organizedbenchmarking • Presentationandinterpretationof results • Summary

51 KnowledgeFactors& Skills • Knowledgefactors: – benchmarkingandmetrics – performancefactors – Top500list • Skillset: – determinestateofsystemresourcesand manipulatethem – acquire,runandmeasurebenchmark performance – launchuserapplicationcodes 52 MaterialForTest

• Basicperformancemetrics(slide4) • Definitionofbenchmarkinownwords;purposeof benchmarking;propertiesofgoodbenchmark(slides 5,6,7) • Linpack:whatitis,whatdoesitmeasure,concepts andcomplexities(slides15,17,18) • HPL:(slides21and24) • Linpackcompareandcontrast(slide25) • GeneralknowledgeaboutHPCCandNPBsuites (slides31and 34) • Benchmarkresultinterpretation(slides49,50)

53