Measuring Performance

Computer Architecture 10 Measuring Performance Made wi th OpenOffi ce.org 1 Performance Measurement FactorsFactors influencinginfluencing thethe performanceperformance ofof modernmodern computerscomputers pure microelectronic & CPU architecture issues I/O performance interprocessor communication cache coherence memory hierarchy Amdahl's law FieldField ofof computercomputer applicationapplication dedicated/specific tasks general-purpose Made wi th OpenOffi ce.org 2 MIPS MeasureMeasure ofof aa computer'scomputer's processorprocessor speedspeed inin (Millions)(Millions) InstructionsInstructions PerPer SecondSecond historically the oldest, straightforward and naïve rated to 1MIPS VAX 11/780 in the 70's (0.5MIPS) direct correlation with clock speed strong influence of instruction set CISC/RISC & Cache misunderstandings Meaningless Indication of Processor Speed Meaningless Information on Performance for Salespeople Made wi th OpenOffi ce.org 3 MIPS poor,poor, but...but... gives some elementary performance overview very good to compare the CPU's with: ● the same instruction set ● the same reference code ● built with the same compiler important for microcontrollers to estimate the resource-management abilities BogoMips ("bogus" MIPS) is a measurement of CPU speed made by the Linux kernel when it boots, to calibrate an internal busy-loop and cache effect Made wi th OpenOffi ce.org 4 MIPS – Examples ١٩٧٤ 640 kIPS at 2 MHz ٨٠٨٠ Intel ١٩٧٧ 500 kIPS ٧٨٠/١١ VAX ١٩٧٩ MHz ٨ MIPS at ١ ٦٨٠٠٠ Motorola ١٩٨٨ MHz ٢٥ MIPS at ٨.٥ ٣٨٦DX Intel ١٩٩٢ MHz ٦٦ MIPS at ٥٤ ٤٨٦DX Intel ١٩٩٩ MHz ٥٠٠ MIPS at ١٣٥٤ Intel Pentium III ٢٠٠٠ GHz ١.٢ MIPS at ٣٥٦١ AMD Athlon ٢٠٠٢ GHz ٢.٠ MIPS at ٥٩٣٥ AMD Athlon XP 2400+ ٢٠٠٣ GHz ٣.٢ MIPS at ٩٧٢٦ Pentium 4 Extreme Edition ٢٠٠٥ GHz ١.٠ MIPS at ٢٠٠٠ ARM Cortex A8 ٢٠٠٥ GHz ٣.٢ MIPS at ٦٤٠٠ Xbox360 IBM "Xenon" Triple Core ٢٠٠٥ GHz ٢.٨ MIPS at ١٢٠٠٠ AMD Athlon FX-57 ٢٠٠٥ GHz ٢.٠ MIPS at ١٤٥٦٤ (AMD Athlon 64 3800+ X2 (Dual Core ٢٠٠٦ GHz ٢.٦ MIPS at ١٨٩٣٨ (AMD Athlon FX-60 (Dual Core ٢٠٠٦ GHz ٢.٩٣ MIPS at ٢٧٠٧٩ X٦٨٠٠ ٢ Intel Core ٢٠٠٦ GHz ٣.٣٣ MIPS at ٥٧٠٦٣ Intel Core 2 Extreme QX6700 Made wi th OpenOffi ce.org 5 FLOPS MeasureMeasure ofof aa computer'scomputer's CPUCPU performanceperformance inin FLoatingFLoating pointpoint OperationsOperations PerPer SecondSecond fields of scientific calculations - heavy use of floating point calculations add, multiply, convert, sqrt, divide more precise than MIPS and frequently advertised shares some MIPS disadvantages benchmark software needed pocket calculator – up to 10 FLOPS PC CPU's - over 30 GFLOPS (2007) PC GPUs - over 500 GFLOPS (2007), but less flexible IBM Blue Gene/L supercomputer: 360 TFLOPS (peak), 2007 Made wi th OpenOffi ce.org 6 Software Benchmarks SoftwareSoftware designeddesigned toto mimicmimic aa particularparticular typetype ofof workloadworkload onon aa computercomputer SystemSystem vsvs ComponentComponent BenchmarksBenchmarks SyntheticSynthetic vsvs ApplicationApplication BenchmarksBenchmarks Synthetic" benchmarks impose the workload by specially-created programs (best for testing individual components) "Application" benchmarks run actual real-world programs on the system (best for system-wide testing) Benchmark-marketing:Benchmark-marketing: misrepresentingmisrepresenting thethe significancesignificance ofof somesome benchmarksbenchmarks Made wi th OpenOffi ce.org 7 TPC Benchmarks TransactionTransaction ProcessingProcessing PerformancePerformance CouncilCouncil ((www.tpc.orgwww.tpc.org)) –– non-profitnon-profit corporationcorporation foundedfounded toto definedefine andand disseminatedisseminate objective,objective, verifiableverifiable performanceperformance datadata toto thethe industryindustry Bussines,Bussines, databasedatabase && internetinternet applicationsapplications orientedoriented benchmarks:benchmarks: TPC-C & TPC-Cis - on-line transaction processing TPC-H is - ad-hoc, decision support TPC-App - application server and web services and others IndustryIndustry recognitionrecognition Made wi th OpenOffi ce.org 8 SPEC Benchmarks TheThe StandardStandard PerformancePerformance EvaluationEvaluation CorporationCorporation (www.spec.org)(www.spec.org) –– non-profitnon-profit corporationcorporation formedformed toto establishestablish aa standardizedstandardized setset ofof benchmarksbenchmarks forfor high-performancehigh-performance computerscomputers SPECSPEC benchmarkbenchmark suites:suites: CPU, Graphics/Workstations, MPI/OMP, Java Client/Server, Mail Servers, Network File System, Power and Performance, SIP, Virtualization, Web Servers IndustryIndustry recognitionrecognition FrequentFrequent updateupdate Made wi th OpenOffi ce.org 9 Computational Performance NumericalNumerical computationscomputations havehave alwaysalways constitutedconstituted thethe fundamentalfundamental applicationsapplications forfor computerscomputers technics - from house appliances to space exploration sciences - from particle physics to structure of Universe mathematical theories human-oriented – virtual reality, artificial intelligence, genomic engineering, etc. BothBoth CPUCPU andand memorymemory performanceperformance countscounts NaturalNatural limitslimits -- computationalcomputational complexitycomplexity complexity classes, Θ & Ω notations, time & memory issues Made wi th OpenOffi ce.org 10 Linpack SoftwareSoftware librarylibrary (in(in FortranFortran )) forfor performingperforming numericalnumerical linearlinear algebraalgebra –– beginningbeginning ofof 70's70's LinpackLinpack benchmarkbenchmark measuremeasure thethe speedspeed ofof solutionsolution ofof densedense nn××nn systemssystems ofof linearlinear equationsequations Ax=b,Ax=b, aa commoncommon engineeringengineering tasktask MostMost widelywidely usedused casecase was:was: 100100×100×100 BenchmarkBenchmark calculationscalculations areare centeredcentered around:around: for (i=0; i<=N; i++) dy[i] = dy[i]+da*dx[i]; which measure both FP performance (mul. & add) and memory performance (2 reads, 1 write) ResultResult isis givengiven inin MFLOPSMFLOPS Made wi th OpenOffi ce.org 11 Linpack LINPACKLINPACK librarylibrary isis supersededsuperseded byby LAPACKLAPACK ((www.netlib.orgwww.netlib.org)) OriginalOriginal LinpackLinpack benchmarkbenchmark (Gaussian(Gaussian elimination)elimination) waswas notnot scalablescalable overover 100x100100x100 GrowthGrowth ofof cachecache memoriesmemories eliminatedeliminated thethe measurementmeasurement ofof memorymemory performanceperformance withwith 100x100100x100 LinpackLinpack (only(only 320kB320kB ofof data)data) Made wi th OpenOffi ce.org 12 Modern Comp. Benchmarks BetterBetter suitedsuited forfor modernmodern computercomputer architecturesarchitectures (Lapack(Lapack library/benchmarks)library/benchmarks) vector computers optimal cache usage shared & distributed memory systems ComplexComplex combinationscombinations ofof realreal numericalnumerical problemsproblems fromfrom manymany fieldsfields ofof sciencescience SPEC CPU2006 Benchmarks: CINT2006 - The Integer Benchmarks CFP2006 - The Floating Point Benchmarks ExecutionExecution timetime shouldshould bebe inin hours,hours, notnot minutesminutes Made wi th OpenOffi ce.org 13 CINT2006 - Integer Benchmarks Language Application Area Brief Description Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an C Programming Language email indexer), and specdiff (SPEC's tool that checks benchmark outputs). Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, C Compression rather than doing I/O. C C Compiler Based on gcc Version 3.2, generates code for Opteron. Vehicle scheduling. Uses a network simplex algorithm (which is also used in C Combinatorial Optimization commercial products) to schedule public transport. C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game. C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs) C Artificial Intelligence: chess A highly-ranked chess program that also plays several chess variants. Simulates a quantum computer, running Shor's polynomial-time factorization C Physics / Quantum Computing algorithm. A reference implementation of H.264/AVC, encodes a videostream using 2 C Video Compression parameter sets. The H.264/AVC standard is expected to replace MPEG2 Uses the OMNet++ discrete event simulator to model a large Ethernet campus C++ Discrete Event Simulation network. C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm. A modified version of Xalan-C++, which transforms XML documents to other C++ XML Processing document types. Made wi th OpenOffi ce.org 14 CFP2006 – FP Benchmarks Benchmark Language Application Area Brief Description Gamess implements a wide range of quantum chemical computations. For bwaves Fortran Fluid Dynamics theComputes SPEC workload,3D transonic self-consistent transient laminar field calculations viscous flow. are performed using the.٤١٠ Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and gamess Fortran Quantum Chemistry. Multi-Configuration Self-Consistent Field.٤١٦ A gauge field generating program for lattice gauge theory programs with milc C Physics / Quantum Chromodynamics ZEUS-MPdynamical isquarks. a computational fluid dynamics code developed at the.٤٣٣ Laboratory for Computational Astrophysics (NCSA, University of Illinois at zeusmp Fortran Physics / CFD MolecularUrbana-Champaign) dynamics, for i.e. the simulate simulation Newtonian of astrophysical equations phenomena. of motion for.٤٣٤

Measuring Performance

Microbenchmarks in Big Data

Overview of the SPEC Benchmarks

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Power Measurement Tutorial for the Green500 List

The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 Diss

A Study on the Evaluation of HPC Microservices in Containerized Environment

Fair Benchmarking for Cloud Computing Systems

Energy Efficient Spin-Locking in Multi-Core Machines

On the Performance of MPI-Openmp on a 12 Nodes Multi-Core Cluster

Power Efficiency in High Performance Computing

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix

Benchmarking-HOWTO.Pdf