Mips, Mops, and Mflops

10.5 PREHISTORIC PERFORMANCE RATINGS: MIPS, MOPS, AND MFLOPS Lerd Kelvin might have been predicting precessnr perfermanee measurements when he said, “When yen can measure what yen are speaking abeut, and express it in numbers, yen knew semething abeut it; but when yen eannet measure it, when yen cannet express it in numbers, ynur knnwledge is efa meager and unsatisfactnry kind. It may,r be the beginning ef knnwledge. but gnu have scarcely, in yeur theughts, advanced “tn the stage ef science." [3] The need tn rate precesser periermance is an great that, at first, micrepreeesser "renders grabbed any and all numbers at hand in rate perfermance. The first figures ef merit used were likelyr eleck rate and memerr bandwidth. These prehisterie ratings measured precesser perfermanee dieereed ef any cenneetien te running cede. Censequently, these perfermanee ratings are net benchmarks. Just as an engine's revelutiens per minute (1113M) reading is net sufficient te measure vehicle perfermanee [yen need engine terque plus transmissien gearing, differential gearing, and tire diameter te eempute speed), the prehisteric, eieekrrelated precesser ratings el" MIPS, MOPS, MFLOPS, and VAX MIPS [defined belew) tell yen almest nething abeut a precesser's true perfermanee petential. Befere it was the name ef a micreprecesser architecture, the term "MIPS" was an acrnnym fer mitiiens efinstrnetiens per serene. If all precessers had the same instructinnfiset architecture {ISA} and used the same cempiler te generate cede. then a MIPS rating might pessibly be used as a perfermance measure. Hewever, all precessers de net have the same ISA and they mnst defi- nitely de net use the same cen'ipiler, se titer;F are net equally efficient when it cemes te task esecu- tien speed versus clack rate. in fact, micreprecessers and precesser ceres have very different lSAs and cempilers fer these precessers are differently abled when generating cede. Censequently, seme preeessers can de mere with ene instructien than ether preeessers, just as large autemebile engines can de mere than smaller enes running at the same RPM. This preblem was already had in the days when enly cemplex~instructitin—set cemputer (CISC) precessers reamed the earth. The preblem went frem bad te werse when reducedkinstructienmset eemputer (RISC) preeessers arrived en the scene. One CISC instructien weuld eften tie the werk at several RISC instructiens (by design} an that a CISC micre precesser's MIPS rating did net cer- relate at all with a RlSC micreprecesser's MIPS rating because ef the werk differential between RISC’s simple instructiens and CISC’s mere cemplex instructiens. The nest prehisteric step in creating a usable precesser perferrnance rating was te switch frem MIPS te VAX MIPS. which was acce-rnplished by setting the entremel}r successful VAX 11i78i} minicemputer—intreduced in, 19?? by the new defunct Digital Equipment Cerperatien {DECl—as the benchmark against which all ether precessers are measured. Se, if a micreprecesr ser executed a set ef pregrarns twice as fast as a VAX HE’SU. it was said te be rated at 2 VAX MIPS. The eriginal term “MIPS" then became "native MIPS.” se as net te cenfuse the eriginal ratings with VAX MIPS. DEC referred te VAX MIPS as VAX units ef perfermance [VUPsl just te keep things interesting er cenfusing depending en seer peint ef view. Beth native and VAX MIPS are weefully inadequate measures ef precesser perfermance because they were usually previded witheut specifying the seftware (er even the pregram- ming language) used te create the rating. Because different pregrams have diflerent instrucw tien mines. different memer},r usage patterns, and different data-mevement patterns. the same precesser ceuld easily earn ene MIPS rating en ene set ef pregrams and quite a different rating en anether set. Because MIPS ratings are net linked in a specific benchmark pregram suite. the MIPS acrenrm new stands fer meaningless indieatien efperferrnanee fer these in the knew. A farther prehiern with the VAX MIPS measure efpreeesser pederrnanee is that the eeneept efnsing a VAX iix’730 rninieernpnter as the geld perfermance standard is an idea that is mere than a hit ieng in the teeth in the twenty-first century. There are an [eager many {er any) VAX iii/’780s avaitahfe fer running benchmark cede and DEC afiectivefy disappeared when Cempaq Cernpn ter Cerp. purchased what was ftft ef the eernpa ny in 1a nnary i 995'. feitewing the deetine ef the rninieernpnter market. Hewlett-Packard ahserhed Cenipaa in May 2002. reenterging DEC's identity even farther. VAX MIPS is new the precesser eqaivatent efafirrfengs-per-fertnight speed rating—weefntiy eatdated. Even mere tenueus than the MIPS perfermance rating is the cencept ef MOPS, an acrenym that stands fer "milliens ef eperatiens per secend." Every algorithmic task requires the cemple- tien ef a certain number ef fundamental eperatiens. which may er triager net have a ene-te-ene cerrespendence with machine instructiens. Ceunt these fundamental eperatiens in the milliens and thew,r beceme MOPS. if the}r are ileating-peint eperatiens. yeu get MFLOPS. One theusand MFLOPS equals ene GFLOPS. The MOPS. MFLOPS. and GFLOPS ratings suffer frem the same drawback as the MIPS rating: there is ne standard seftware te serve as the benchmark that preduces the ratings. In additien. the eenversien faster fer cementing hew man}r eperatiens a precesser perferms per cleck {er hew many precesser instructiens censtitute ene eperatien) is semewhat fluid as well. which means that the precesser vender is free te develep a cenversien facter en its ew n. Censequently, MOPS and MELOPS perferrnance ratings exist fer varieus precesser ceres but they really de net help an IC design team pick a precesser cer-e because they are net true benchmarks. 10.6 CLASSIC. PROCESSOR BENCHMARKS (THE STONE AGE) Like 155s. standardized precesser perfermance benchmarks predate the 19Tl intreductien ef Intel's 4004 micreprecesser, but just barely. The first benchmark suite te attain de facte standard status was a set ef pregrams knewn as the Livermere Kernels (alse pepularl},r called the Livermere Leeps). 10.6.1 LIVERMORE FORTRAN KERNELSILIUERMORE LOOPS BENCHMARK The Livermere Kernels were first develeped in 1er and censist ef14 numerically intensive appli- catien kernels written in FORTRAN. Ten mere kernels were added during the earl}: 19805 and the final suite ef benchmarks was discussed in a paper published in 1986 by F. H. McMahen ef the Lawrence Livermere Natienal l..aberater}lr [LLNLL lecated in Livermere, Ch. [4]. '[he Livermere Kernels actually constitute a supercomputer benchmark, measuring a processor’s floating-point computational performance in terms of MFLOPS (millions of floatingepoint operations per sec» ond). Because of the somewhat frequent occurrence of floating-point errors in many computers, the Livermor-e Kernels test both the processor’s speed and the system's computational accu- racy. Today, the Livermore Kernels are called the Livermore FORTRAN Kernels (LFK) or the Livermore Loops. The Livermore Loops are real samples of floatingepoint computation taken from a diverse workload of scientific applications extracted from operational program code used at LLNL- The kernels were extracted from programs in use at LLNL because those programs were generally far too large to serve as useful benchmark programs; they included hardware-s specific subroutines for performing functions such as NO, memory management, and graphics that were not appropriate for benchmark testing of floating—point performance; and they were largely classified. due to the nature of the work done at LLN L. Some kernels represent widely used, generic computations such as dot and matrix (SAXPY) products, polynomials, first sum and differences, firsteorder recurs rences. matrix solvers, and array searches. Some kernels typify often—used FORTRAN constructs while others contain constructs that are difficult to compile into efficient machine code. These kernels were selected to represent both the best and worst cases of common FORTRAN pro»- gramming practice to produce results that measure a realistic floating-point performance range- by challenging the FORTRAN compiler's ability to produce optimized machine code. Table 10.1 lists the 24 Livermore Loops. A complete LFK run produces 72 timed results. produced by timing the execution of the 24 LFK kernels three times using difi'erent DO-Ioop lengths. The LFK kernels are a mixture of vectorizable and nonvectorizable loops and test the computational capabilities of the processor hardware and the software tools’ ability to generate efficient machine code. The Livermore Loops also tests a processor’s vector abilities and the associated software tools’ abilities to vectorixe code. TABLE 140.1 Twenty-Four Kernels in the Livermore Loops LFK Kernel Number Kernel Description Kernel 1 An excerpt from a hydrodynamic application Kernel 2 An excerpt from an incomplete Cholesky-Conjugate |[Eiradient program Kernel 3 [he standard inner~product function from linear algebra Kernel 4 An excerpt from a banded linear equation routine Kernel 5 An excerpt from a tridlagonal elimination routine Kernel a An example of a general linear recurrence equation Kernel i' Equation of state code fragment {as used in nuclear weapons researc h} Kernel 8 An excerpt of an alternating direction, implicit integration program Kernel 9 An integrate predictor program Kernel to A difference predictor program Kernel l l A first sum Kernel l? A first difference Kernel 13 An excerpt from a 2D particle—in-cell

Mips, Mops, and Mflops

EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President

Microbenchmarks in Big Data

Examining the Viability of FPGA Supercomputing

Cloud Workbench a Web-Based Framework for Benchmarking Cloud Services

Overview of the SPEC Benchmarks

Automatic Benchmark Profiling Through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin

I.MX 8Quadxplus Power and Performance

Performance of a Computer (Chapter 4) Vishwani D

Automatic Benchmark Profiling Through Advanced Trace Analysis

Srovnání Kvality Virtuálních Strojů Assessment of Virtual Machines Performance

Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance

Introduction to Digital Signal Processors