An Overview of Common Benchmarks
Total Page:16
File Type:pdf, Size:1020Kb
An Overview of Common Benchmarks Reinhold P. Weicker Siemens Nixdorf Information Systems he mdin redwn for using comput- ,*. load mem (BJ,reg (B) ers is to perform ta\ks taster Thir *, load mem (C),reg (C) is why performance measurement add reg (E),reg (C),reg (A) is taken so seriously by computer custom- “Fair benchmarking” store reg (A), mem (A) ers. Even though performance measurement usually compares only one aspect of com- would be less of an If both machines need the same time to puters (speed), this aspect is often dominant. oxymoron if those execute (not unrealistic in some cases), Normally, a mainframe customer can run should the RISC then be rated as a 4-MIPS typical applications on anew machine before using benchmark machine if the CISC (for example, a VAX buying it. With microprocessor-based 11) operates at 1 MIPS? The MIPS number systems, however, original equipment man- results knew what in its literal meaning is still interesting for ufacturers must make decisions without tasks the benchmarks computer architects (together with the CPI detailed knowledge of the end user’s code, number - the average number of cycles so performance measurements with standard really perform and necessary for an instruction), but it loses its benchmarks become more important. what they measure. significance for the end user. Performance is a broad area, and tradi- Because of these problems, “MIPS” has tional benchmarks cover only part of it. often been redefined, implicitly or explic- This article is restricted to benchmarks itly, as “VAX MIPS.” In this case MIPS is measuring hardware speed, including just a performance factor for a given ma- compiler code generation; it does not cover chine relative to the performance of a VAX the more general area of system benchmarks nearly all its significance. This became 1 1/780. If a machine runs some program or (for example, operating system perfor- obvious when reduced instruction-set set of programs X times faster than a VAX mance). Still, manufacturers use traditional computer architectures appeared.’ Opera- 11/780, it is called an X-MIPS machine. benchmarks in their advertising, and cus- tions that can be performed by one CISC This is based on computer folklore saying tomers use them in making decisions, so it (complex instruction-set computer) in- that for typical programs a VAX 11/780 is important to know as much as possible struction sometimes require several RISC performs one million instructions per sec- about them. This article characterizes the instructions. Consider the example of a ond. Although this is not true,* the belief is most often used benchmarks in detail and high-level language statement warns users about a number of pitfalls. A = B + C /* Assume mem operands */ The ubiquitous MIPS With a CISC architecture, this can be *Some time ago I ran the Dhrystone benchmark pro- numbers compiled into one instruction: gram on VAX I1/780s with different compilers. With Berkeley Unix (4.2) Pascal, the benchmark was trans- lated into 483 instructions executed in 700 microsec- For comparisons across different in- add mem (B),mem (C), mem (A) onds, yielding 0.69 (native) MIPS. With DEC VMS struction-set architectures, the unit MIPS, Pascal (V. 2.4),226 instructions were executed in 543 in its literal meaning of millions of instruc- microseconds, yielding 0.42 (native) MIPS. Interest- On a typical RISC, this requires four in- ingly, the version with the lower MIPS rating executed tions per second (native MIPS), has lost structions: the program faster. widespread. When VAX MIPS are quoted, possible to run the application on each gram with counters. Note that for all pro- it is important to know what programs machine in question. There are other con- grams, even those normally used in the form the basis for the comparison and what siderations, too: The program may have Fortran version, the language-feature-re- compilers are used for the VAX 111780. been tailored to run optimally on an older lated statistics refer to the C version of the Older Berkeley Unix compilers produced machine; original equipment manufactur- benchmarks; this was the version for which code up to 30 percent slower than VMS ers must choose a microprocessor for a the modification was performed. Howev- compilers, thereby inflating the MIPS rat- whole range of applications; journalists er, since most features are similar in the ing of other machines. want to characterize machine speed inde- different languages, numbers for other The MIPS numbers that manufacturers pendent of a particular application program. languages should not differ much. The give for their products can be any of the Therefore, the next best benchmark (1) is profiling data has been obtained from the following: written in a high-level language, making it Fortran version (Whetstone, Linpack) or portable across different machines, (2) is the C version (Dhrystone). MlPS numbers with noderivation. This representative for some kind of program- can mean anything, and flippant interpre- ming style (for example, systems pro- tations such as “meaningless indication of gramming, numerical programming, or Whetstone processor speed’ are justified. commercial programming), (3) can be Native MIPS, or MIPS in the literal measured easily, and (4) has wide distri- The Whetstone benchmark was the first meaning. To interpret this you must know bution. program in the literature explicitly designed what program the computation was based Obviously, some of these requirements for benchmarking. Its authors are H.J. on and how many instructions are generated are contradictory. The more representative Curnow and B.A. Wichmann from the per average high-level language statement. the benchmark program - in terms of National Physical Laboratory in Great Peak MIPS. This term sometimes ap- similarity to real programs - the more Britain. It was published in 1976, with pears in product announcements of new complicated it will be. Thus, measurement Algol 60 as the publication language. To- microprocessors. It is largely irrelevant, becomes more difficult, and results may be day it is used almost exclusively in its since it equals the clock frequency for most available for only a few machines. This Fortran version, with either single precision processors (most can execute at least one explains the popularity of certain benchmark or double precision for floating-point instruction in one clock cycle). programs that are not complete application numbers. EDN MIPS, Dhrystone MIPS, or sim- programs but still claim to be representa- The benchmark owes its name to the ilar. This could mean native MIPS, when a tive for a given area. Whetstone Algol compiler system. This particular program is running. More often This article concentrates on the most system was used to collect statistics about it means VAX MIPS (see below) with a common “stone age” benchmarks (CPU/ the distribution of “Whetstone instructions,” specific program as the basis for compar- memorykompiler benchmarks only) - in instructions of the intermediate language ison. particular the Whetstone, Dhrystone, and used by this compiler, for a large number of VAX MIPS. A factor relative to the Linpack benchmarks. These are the numerical programs. A synthetic program VAX 11/780, which then raises the fol- benchmarks whose results are most often was then designed. It consisted of several lowing questions: What language? What cited in manufacturers’ publications and in modules, each containing statements of compiler (Unix or VMS) was used for the the trade press. They are better than some particular type (integer arithmetic, VAX? What programs have been measured? meaningless MIPS numbers, but readers floating-point arithmetic, “if‘ statements, (Note that DEC uses the term VUP, for should know their properties - that is, calls, and so forth) and ending with a VAX unit of performance, in making what they do and don’t measure. statement printing the results. Weights were comparisons relative to the VAX 111780. Whetstone and Dhrystone are synthetic attached to the different modules (realized These units are based on a set of DEC benchmarks: They were written solely for as loop bounds for loops around the indi- internal programs, including some floating- benchmarking purposes and perform no vidual modules’ statements) such that the point programs.) useful computation. Linpack was distilled distribution of Whetstone instructions for out of a real, purposeful program that is the synthetic benchmark matched the dis- In short, Omri Serlin2 is correct in say- now used as a benchmark. tribution observed in the program sample. ing, “There are no accepted industry stan- Tables A-D in the sidebar on pages 68- The weights were chosen in such a way that dards for computing the value of MIPS.” 69 give detailed information about the high- the program executes a multiple of one level language features used by these million of these Whetstone instructions; benchmarks. Comparing these advantages thus, benchmark results are given as KWIPS Benchmarks with the characteristics of the user’s own (kilo Whetstone instructions per second) programs shows how meaningful the results or MWIPS (mega Whetstone instructions Any attempt to make MIPS numbers of a particular benchmark are for the user’s per second). This way the familiar term meaningful (for example, VAX MIPS) own applications. The tables contain “instructions per second” was retained but comes down to running a representative comparable information for all three given a machine-independent meaning. program or set of programs. Therefore, we benchmarks, thereby revealing their dif- A problem with Whetstone is that only can drop the notion of MIPS and just ferences and similarities. one officially controlled version exists - compare the speed for these benchmark All percentages in the tables are dynam- the Pascal version issued with the Pascal programs. ic percentages, that is, percentages obtained Evaluation Suite by the British Standards It has been said that the best benchmark by profiling or, for the language-feature Institution - Quality Assurance (BSI- is the user’s own application.