Computer Architecture 10

Measuring Performance

Made wi th OpenOffi ce.org 1 Performance Measurement

FactorsFactors influencinginfluencing thethe performanceperformance ofof modernmodern computerscomputers pure microelectronic & CPU architecture issues I/O performance interprocessor communication cache coherence memory hierarchy Amdahl's law FieldField ofof computercomputer applicationapplication dedicated/specific tasks general-purpose Made wi th OpenOffi ce.org 2 MIPS

MeasureMeasure ofof aa 'scomputer's processorprocessor speedspeed inin (Millions)(Millions) InstructionsInstructions PerPer SecondSecond historically the oldest, straightforward and naïve rated to 1MIPS VAX 11/780 in the 70's (0.5MIPS) direct correlation with clock speed strong influence of instruction set CISC/RISC & Cache misunderstandings

Meaningless Indication of Processor Speed Meaningless Information on Performance for Salespeople

Made wi th OpenOffi ce.org 3 MIPS poor,poor, but...but... gives some elementary performance overview very good to compare the CPU's with: ● the same instruction set ● the same reference code ● built with the same compiler important for microcontrollers to estimate the resource-management abilities BogoMips ("bogus" MIPS) is a measurement of CPU speed made by the kernel when it boots, to calibrate an internal busy-loop and cache effect

Made wi th OpenOffi ce.org 4 MIPS – Examples

١٩٧٤ 640 kIPS at 2 MHz ٨٠٨٠ Intel ١٩٧٧ 500 kIPS ٧٨٠/١١ VAX ١٩٧٩ MHz ٨ MIPS at ١ ٦٨٠٠٠ Motorola ١٩٨٨ MHz ٢٥ MIPS at ٨.٥ ٣٨٦DX Intel ١٩٩٢ MHz ٦٦ MIPS at ٥٤ ٤٨٦DX Intel ١٩٩٩ MHz ٥٠٠ MIPS at ١٣٥٤ Intel Pentium III ٢٠٠٠ GHz ١.٢ MIPS at ٣٥٦١ AMD ٢٠٠٢ GHz ٢.٠ MIPS at ٥٩٣٥ AMD Athlon XP 2400+ ٢٠٠٣ GHz ٣.٢ MIPS at ٩٧٢٦ Extreme Edition ٢٠٠٥ GHz ١.٠ MIPS at ٢٠٠٠ ARM Cortex A8 ٢٠٠٥ GHz ٣.٢ MIPS at ٦٤٠٠ Xbox360 IBM "Xenon" Triple Core ٢٠٠٥ GHz ٢.٨ MIPS at ١٢٠٠٠ AMD Athlon FX-57 ٢٠٠٥ GHz ٢.٠ MIPS at ١٤٥٦٤ (AMD Athlon 64 3800+ X2 (Dual Core ٢٠٠٦ GHz ٢.٦ MIPS at ١٨٩٣٨ (AMD Athlon FX-60 (Dual Core ٢٠٠٦ GHz ٢.٩٣ MIPS at ٢٧٠٧٩ X٦٨٠٠ ٢ ٢٠٠٦ GHz ٣.٣٣ MIPS at ٥٧٠٦٣ Intel Core 2 Extreme QX6700

Made wi th OpenOffi ce.org 5 FLOPS

MeasureMeasure ofof aa computer'scomputer's CPUCPU performanceperformance inin FLoatingFLoating pointpoint OperationsOperations PerPer SecondSecond fields of scientific calculations - heavy use of floating point calculations add, multiply, convert, sqrt, divide more precise than MIPS and frequently advertised shares some MIPS disadvantages software needed pocket calculator – up to 10 FLOPS PC CPU's - over 30 GFLOPS (2007) PC GPUs - over 500 GFLOPS (2007), but less flexible IBM Blue Gene/L : 360 TFLOPS (peak), 2007 Made wi th OpenOffi ce.org 6 Software Benchmarks

SoftwareSoftware designeddesigned toto mimicmimic aa particularparticular typetype ofof workloadworkload onon aa computercomputer SystemSystem vsvs ComponentComponent BenchmarksBenchmarks SyntheticSynthetic vsvs ApplicationApplication BenchmarksBenchmarks Synthetic" benchmarks impose the workload by specially-created programs (best for testing individual components) "Application" benchmarks run actual real-world programs on the system (best for system-wide testing) Benchmark-marketing:Benchmark-marketing: misrepresentingmisrepresenting thethe significancesignificance ofof somesome benchmarksbenchmarks Made wi th OpenOffi ce.org 7 TPC Benchmarks

TransactionTransaction ProcessingProcessing PerformancePerformance CouncilCouncil ((www.tpc.orgwww.tpc.org)) –– non-profitnon-profit corporationcorporation foundedfounded toto definedefine andand disseminatedisseminate objective,objective, verifiableverifiable performanceperformance datadata toto thethe industryindustry Bussines,Bussines, databasedatabase && internetinternet applicationsapplications orientedoriented benchmarks:benchmarks: TPC-C & TPC-Cis - on-line transaction processing TPC-H is - ad-hoc, decision support TPC-App - application server and web services and others IndustryIndustry recognitionrecognition

Made wi th OpenOffi ce.org 8 SPEC Benchmarks

TheThe StandardStandard PerformancePerformance EvaluationEvaluation CorporationCorporation (www.spec.org)(www.spec.org) –– non-profitnon-profit corporationcorporation formedformed toto establishestablish aa standardizedstandardized setset ofof benchmarksbenchmarks forfor high-performancehigh-performance computerscomputers SPECSPEC benchmarkbenchmark suites:suites: CPU, Graphics/Workstations, MPI/OMP, Client/Server, Mail Servers, Network File System, Power and Performance, SIP, Virtualization, Web Servers IndustryIndustry recognitionrecognition FrequentFrequent updateupdate

Made wi th OpenOffi ce.org 9 Computational Performance

NumericalNumerical computationscomputations havehave alwaysalways constitutedconstituted thethe fundamentalfundamental applicationsapplications forfor computerscomputers technics - from house appliances to space exploration sciences - from particle physics to structure of Universe mathematical theories human-oriented – virtual reality, artificial intelligence, genomic engineering, etc. BothBoth CPUCPU andand memorymemory performanceperformance countscounts NaturalNatural limitslimits -- computationalcomputational complexitycomplexity complexity classes, Θ & Ω notations, time & memory issues

Made wi th OpenOffi ce.org 10 Linpack

SoftwareSoftware librarylibrary (in(in FortranFortran )) forfor performingperforming numericalnumerical linearlinear algebraalgebra –– beginningbeginning ofof 70's70's LinpackLinpack benchmarkbenchmark measuremeasure thethe speedspeed ofof solutionsolution ofof densedense nn××nn systemssystems ofof linearlinear equationsequations Ax=b,Ax=b, aa commoncommon engineeringengineering tasktask MostMost widelywidely usedused casecase was:was: 100100×100×100 BenchmarkBenchmark calculationscalculations areare centeredcentered around:around: for (i=0; i<=N; i++) dy[i] = dy[i]+da*dx[i]; which measure both FP performance (mul. & add) and memory performance (2 reads, 1 write) ResultResult isis givengiven inin MFLOPSMFLOPS

Made wi th OpenOffi ce.org 11 Linpack

LINPACKLINPACK librarylibrary isis supersededsuperseded byby LAPACKLAPACK ((www.netlib.orgwww.netlib.org)) OriginalOriginal LinpackLinpack benchmarkbenchmark (Gaussian(Gaussian elimination)elimination) waswas notnot scalablescalable overover 100x100100x100 GrowthGrowth ofof cachecache memoriesmemories eliminatedeliminated thethe measurementmeasurement ofof memorymemory performanceperformance withwith 100x100100x100 LinpackLinpack (only(only 320kB320kB ofof data)data)

Made wi th OpenOffi ce.org 12 Modern Comp. Benchmarks

BetterBetter suitedsuited forfor modernmodern computercomputer architecturesarchitectures (Lapack(Lapack library/benchmarks)library/benchmarks) vector optimal cache usage shared & distributed memory systems ComplexComplex combinationscombinations ofof realreal numericalnumerical problemsproblems fromfrom manymany fieldsfields ofof sciencescience SPEC CPU2006 Benchmarks: CINT2006 - The Integer Benchmarks CFP2006 - The Floating Point Benchmarks ExecutionExecution timetime shouldshould bebe inin hours,hours, notnot minutesminutes

Made wi th OpenOffi ce.org 13 CINT2006 - Integer Benchmarks

Language Application Area Brief Description Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an C Programming Language email indexer), and specdiff (SPEC's tool that checks benchmark outputs). Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, C Compression rather than doing I/O.

C C Compiler Based on gcc Version 3.2, generates code for Opteron. Vehicle scheduling. Uses a network simplex algorithm (which is also used in C Combinatorial Optimization commercial products) to schedule public transport.

C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.

C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs)

C Artificial Intelligence: chess A highly-ranked chess program that also plays several chess variants. Simulates a quantum computer, running Shor's polynomial-time factorization C Physics / Quantum Computing algorithm. A reference implementation of H.264/AVC, encodes a videostream using 2 C Video Compression parameter sets. The H.264/AVC standard is expected to replace MPEG2 Uses the OMNet++ discrete event simulator to model a large Ethernet campus C++ Discrete Event Simulation network.

C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm. A modified version of Xalan-C++, which transforms XML documents to other C++ XML Processing document types.

Made wi th OpenOffi ce.org 14 CFP2006 – FP Benchmarks

Benchmark Language Application Area Brief Description Gamess implements a wide range of quantum chemical computations. For bwaves Fluid Dynamics theComputes SPEC workload,3D transonic self-consistent transient laminar field calculations viscous flow. are performed using the.٤١٠ Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and gamess Fortran Quantum Chemistry. Multi-Configuration Self-Consistent Field.٤١٦ A gauge field generating program for lattice gauge theory programs with milc C Physics / Quantum Chromodynamics ZEUS-MPdynamical isquarks. a computational fluid dynamics code developed at the.٤٣٣ Laboratory for Computational Astrophysics (NCSA, University of Illinois at zeusmp Fortran Physics / CFD MolecularUrbana-Champaign) dynamics, for i.e. the simulate simulation Newtonian of astrophysical equations phenomena. of motion for.٤٣٤ hundreds to millions of particles. The test case simulates protein Lysozyme .gromacs C,Fortran Biochemistry / Molecular Dynamics in a solution.٤٣٥ Solves the Einstein evolution equations using a staggered-leapfrog numerical cactusADM C, Fortran Physics / General Relativity Computationalmethod Fluid Dynamics (CFD) using Large-Eddy Simulations with.٤٣٦ Linear-Eddy Model in 3D. Uses the MacCormack Predictor-Corrector time .leslie٣d Fortran Fluid Dynamics integration scheme.٤٣٧ Simulates large biomolecular systems. The test case has 92,224 atoms of namd C++ Biology / Molecular Dynamics deal.IIapolipoprotein is a C++ A-I. program library targeted at adaptive finite elements and error.٤٤٤ estimation. The testcase solves a Helmholtz-type equation with non-constant .dealII C++ Finite Element Analysis coefficients.٤٤٧ Solves a linear program using a simplex algorithm and sparse . soplex C++ Linear Programming, Optimization ImageTest cases rendering. include The railroad testcase planning is a 1280x1024 and military anti-aliased airlift models. image of a.٤٥٠ landscape with some abstract objects with textures using a Perlin noise .povray C++ Image Ray-tracing function.٤٥٣ Finite element code for linear and nonlinear 3D structural applications. Uses .calculix C, Fortran Structural Mechanics the SPOOLES solver library.٤٥٤ AnSolves open the source Maxwell quantum equations chemistry in 3D usingpackage, the finite-differenceusing an object-oriented time-domain GemsFDTD Fortran Computational Electromagnetics design(FDTD) inmethod. Fortran 95. The test case places a constraint on a molecular.٤٥٩ Hartree-Fock wavefunction calculation to better match experimental X-ray .tonto Fortran Quantum Chemistry diffraction data.٤٦٥ Implements the "Lattice-Boltzmann Method" to simulate incompressible lbm C Fluid Dynamics fluids in 3D.٤٧٠ Weather modeling from scales of meters to thousands of kilometers. The .wrf C, Fortran Weather test case is from a 30km area over 2 days.٤٨١

C Speech recognition A widely-known speech recognition system from Carnegie Mellon University sphinx٣.٤٨٢

Made wi th OpenOffi ce.org 15 Customizing Benchmarks

RealisticRealistic performanceperformance measurementmeasurement oftenoften dependsdepends onon manymany side-factors:side-factors: specific features of CPU/IO architecture code optimization level operating systems interference FourFour approachesapproaches toto benchmarksbenchmarks no source modification, strict set of compiler flags source modification allowed, but unfeasible source modification allowed and expected low-level coding, hand-crafting, etc.

Made wi th OpenOffi ce.org 16 Comparing Performance

LotsLots ofof misunderstandingsmisunderstandings duedue toto lacklack ofof (or(or unclear)unclear) performanceperformance definitionsdefinitions PerformancePerformance cancan bebe comparedcompared accordingaccording toto aa preciselyprecisely statedstated criterioncriterion ProvidingProviding identicalidentical testingtesting conditionsconditions forfor differentdifferent architecturesarchitectures maymay bebe difficultdifficult ThereThere isis nono ""overalloverall // generalgeneral // universal"universal" performanceperformance criterioncriterion WithWith allall thethe negatives,negatives, thethe benchmarksbenchmarks telltell aa lotlot aboutabout realreal andand relativerelative performanceperformance

Made wi th OpenOffi ce.org 17 Total Execution Time

BenchmarkBenchmark asas aa mixturemixture ofof programsprograms PerformancePerformance reciprocalreciprocal toto totaltotal exec.exec. timetime

Computer A Computer B Computer C ٢٠ ١٠ ١ ١ Program ٢٠ ١٠٠ ١٠٠٠ ٢ Program ٤٠ ١١٠ ١٠٠١ Total time

MoreMore confusionsconfusions thanthan answers!answers! IsIs thethe mixturemixture properproper forfor workload?workload?

Made wi th OpenOffi ce.org 18 Weighted Mean Execution Time

WeightsWeights reflectreflect thethe realreal workloadworkload proportionsproportions VariousVarious weights-setweights-set cancan bebe availableavailable NormalizationNormalization effecteffect ofof executionexecution timetime

W٣ W٢ Computer A Computer B Computer C W١ ٠,٩٩٩ ٠,٩٠٩ ٠,٥٠٠ ٢٠ ١٠ ١ ١ Program ٠,٠٠١ ٠,٠٩١ ٠,٥٠٠ ٢٠ ١٠٠ ١٠٠٠ ٢ Program ٤٠ ١١٠ ١٠٠١ Total time

٢٠,٠٠ ٥٥,٠٠ ٥٠٠,٥٠ Mean W1 ٢٠,٠٠ ١٨,١٩ ٩١,٩١ Mean W2 ٢٠,٠٠ ١٠,٠٩ ٢,٠٠ Mean W3

Made wi th OpenOffi ce.org 19 Normalized Execution Time

NormalizeNormalize exec.exec. timestimes toto referencereference machinemachine TakeTake geometricgeometric averageaverage ofof normalizednormalized timestimes SPEC CPU2006 Benchmarks A B C ٢٠ ١٠ ١ ١ Program ٢٠ ١٠٠ ١٠٠٠ ٢ Program ٤٠ ١١٠ ١٠٠١ Total time

Normalized to A Normalized to B Normalized to C A B C A B C A B C ١,٠٠ ٠,٥٠ ٠,٠٥ ٢,٠٠ ١,٠٠ ٠,١٠ ٢٠,٠٠ ١٠,٠٠ ١,٠٠ ١ Program ١,٠٠ ٥,٠٠ ٥٠,٠٠ ٠,٢٠ ١,٠٠ ١٠,٠٠ ٠,٠٢ ٠,١٠ ١,٠٠ ٢ Program ١,٠٠ ٢,٧٥ ٢٥,٠٣ ٠,٣٦ ١,٠٠ ٩,١٠ ٠,٠٤ ٠,١١ ١,٠٠ Total time ١,٠٠ ٢,٧٥ ٢٥,٠٣ ١,١٠ ١,٠٠ ٥,٠٥ ١٠,٠١ ٥,٠٥ ١,٠٠ Arith. Mean ١,٠٠ ١,٥٨ ١,٥٨ ٠,٦٣ ١,٠٠ ١,٠٠ ٠,٦٣ ١,٠٠ ١,٠٠ Geom. Mean

Made wi th OpenOffi ce.org 20 Normalization Drawbacks

NoNo predictionprediction ofof totaltotal executionexecution timetime forfor variousvarious workloadsworkloads SmallerSmaller andand simplersimpler componentscomponents havehave equalequal contributioncontribution asas largelarge andand complexcomplex onesones BetterBetter resultsresults possiblepossible byby improvementimprovement inin easiesteasiest components,components, notnot inin thethe slowestslowest e.g. improvement by 50% in the easiest component has the same effect as 50% improvement in the most difficult and time consuming benchmark module

Made wi th OpenOffi ce.org 21 Computer Design Principles

HardwareHardware Make the common task fast Consult Amdahl's Low SoftwareSoftware Make programs local: spatially and temporarily Increase parallelism

Made wi th OpenOffi ce.org 22 Amdahl's Law

SeedupSeedup obtainedobtained byby aa certaincertain improvementimprovement isis limitedlimited byby thethe fractionfraction ofof timetime thethe improvementimprovement isis usedused EffectEffect ofof diminishingdiminishing returnreturn F – fraction of the total exec. time that can be imp affected by the improvement S – speedup in F due to the improvement imp imp

= 1 Overall speedup Sall  −  F imp 1 F imp Simp

Made wi th OpenOffi ce.org 23 Amdahl's Law – Example

CPUCPU isis busybusy 40%40% andand I/OI/O takestakes 60%60% ofof timetime NewNew CPUCPU isis 1010 timestimes faster!faster! WhatWhat willwill bebe thethe totaltotal speedup?speedup?

FF == 0.40.4 imp SS == 1010 imp 1 1 S = = ≈1.56 all 0.4 0.64 1−0.4 10

Made wi th OpenOffi ce.org 24 Amdahl's Law – Example

FPFP SQRTSQRT takestakes 20%20% ofof totaltotal executionexecution timetime andand allall FPFP instructionsinstructions taketake 50%50% ofof totaltotal WhatWhat isis better?better? speedup FP SQRT by a factor of 10 speedup all FP instructions by 1.6

FPFP SQRT:SQRT: FF == 0.2,0.2, SS == 1010 →→ SS == 1.221.22 imp imp all allall FP:FP: FF == 0.5,0.5, SS == 1.61.6 →→ SS == 1.231.23 imp imp all

Made wi th OpenOffi ce.org 25 Final Notes

Price-to-PerformancePrice-to-Performance ratioratio reallyreally mattersmatters AddAdd softwaresoftware costcost toto Price-to-PerformancePrice-to-Performance ratioratio BenchmarksBenchmarks dodo notnot remainremain validvalid indefinitelyindefinitely BeBe carefulcareful aboutabout peak-performancepeak-performance datadata SyntheticSynthetic benchmarksbenchmarks cannotcannot mimicmimic realreal programsprograms DoDo notnot mistakemistake optimizationoptimization techniquestechniques oror compilercompiler efficiencyefficiency forfor systemsystem performanceperformance ImprovementImprovement isis almostalmost alwaysalways possible,possible, butbut maymay bebe unfeasibleunfeasible (cost/time/difficulty)(cost/time/difficulty) DoDo notnot overestimateoverestimate improvementsimprovements (Amdahl's(Amdahl's Low)Low) Made wi th OpenOffi ce.org 26