An Overview of Common Benchmarks

Reinhold P. Weicker Siemens Nixdorf Information Systems

he mdin redwn for using comput- ,*. load mem (BJ,reg (B) ers is to perform ta\ks taster Thir *, load mem (),reg (C) is why performance measurement add reg (E),reg (C),reg (A) is taken so seriously by custom- “Fair benchmarking” store reg (A), mem (A) ers. Even though performance measurement usually compares only one aspect of com- would be less of an If both machines need the same time to puters (speed), this aspect is often dominant. oxymoron if those execute (not unrealistic in some cases), Normally, a mainframe customer can run should the RISC then be rated as a 4-MIPS typical applications on anew machine before using machine if the CISC (for example, a VAX buying it. With microprocessor-based 11) operates at 1 MIPS? The MIPS number systems, however, original equipment man- results knew what in its literal meaning is still interesting for ufacturers must make decisions without tasks the benchmarks computer architects (together with the CPI detailed knowledge of the end user’s code, number - the average number of cycles so performance measurements with standard really perform and necessary for an instruction), but it loses its benchmarks become more important. what they measure. significance for the end user. Performance is a broad area, and tradi- Because of these problems, “MIPS” has tional benchmarks cover only part of it. often been redefined, implicitly or explic- This article is restricted to benchmarks itly, as “VAX MIPS.” In this case MIPS is measuring hardware speed, including just a performance factor for a given ma- compiler code generation; it does not cover chine relative to the performance of a VAX the more general area of system benchmarks nearly all its significance. This became 1 1/780. If a machine runs some program or (for example, operating system perfor- obvious when reduced instruction-set set of programs X times faster than a VAX mance). Still, manufacturers use traditional computer architectures appeared.’ Opera- 11/780, it is called an X-MIPS machine. benchmarks in their advertising, and cus- tions that can be performed by one CISC This is based on computer folklore saying tomers use them in making decisions, so it (complex instruction-set computer) in- that for typical programs a VAX 11/780 is important to know as much as possible struction sometimes require several RISC performs one million instructions per sec- about them. This article characterizes the instructions. Consider the example of a ond. Although this is not true,* the belief is most often used benchmarks in detail and high-level language statement warns users about a number of pitfalls. A = B + C /* Assume mem operands */

The ubiquitous MIPS With a CISC architecture, this can be *Some time ago I ran the benchmark pro- numbers compiled into one instruction: gram on VAX I1/780s with different compilers. With Berkeley Unix (4.2) Pascal, the benchmark was trans- lated into 483 instructions executed in 700 microsec- For comparisons across different in- add mem (B),mem (C), mem (A) onds, yielding 0.69 (native) MIPS. With DEC VMS struction-set architectures, the unit MIPS, Pascal (V. 2.4),226 instructions were executed in 543 in its literal meaning of millions of instruc- microseconds, yielding 0.42 (native) MIPS. Interest- On a typical RISC, this requires four in- ingly, the version with the lower MIPS rating executed tions per second (native MIPS), has lost structions: the program faster. widespread. When VAX MIPS are quoted, possible to run the application on each gram with counters. Note that for all pro- it is important to know what programs machine in question. There are other con- grams, even those normally used in the form the basis for the comparison and what siderations, too: The program may have version, the language-feature-re- compilers are used for the VAX 111780. been tailored to run optimally on an older lated statistics refer to the C version of the Older Berkeley Unix compilers produced machine; original equipment manufactur- benchmarks; this was the version for which code up to 30 percent slower than VMS ers must choose a microprocessor for a the modification was performed. Howev- compilers, thereby inflating the MIPS rat- whole range of applications; journalists er, since most features are similar in the ing of other machines. want to characterize machine speed inde- different languages, numbers for other The MIPS numbers that manufacturers pendent of a particular application program. languages should not differ much. The give for their products can be any of the Therefore, the next best benchmark (1) is profiling has been obtained from the following: written in a high-level language, making it Fortran version (, Linpack) or portable across different machines, (2) is the C version (Dhrystone). MlPS numbers with noderivation. This representative for some kind of program- can mean anything, and flippant interpre- ming style (for example, systems pro- tations such as “meaningless indication of gramming, numerical programming, or Whetstone processor speed’ are justified. commercial programming), (3) can be Native MIPS, or MIPS in the literal measured easily, and (4) has wide distri- The Whetstone benchmark was the first meaning. To interpret this you must know bution. program in the literature explicitly designed what program the computation was based Obviously, some of these requirements for benchmarking. Its authors are H.J. on and how many instructions are generated are contradictory. The more representative Curnow and B.A. Wichmann from the per average high-level language statement. the benchmark program - in terms of National Physical Laboratory in Great Peak MIPS. This term sometimes ap- similarity to real programs - the more Britain. It was published in 1976, with pears in product announcements of new complicated it will be. Thus, measurement Algol 60 as the publication language. To- microprocessors. It is largely irrelevant, becomes more difficult, and results may be day it is used almost exclusively in its since it equals the clock frequency for most available for only a few machines. This Fortran version, with either single precision processors (most can execute at least one explains the popularity of certain benchmark or double precision for floating-point instruction in one clock cycle). programs that are not complete application numbers. EDN MIPS, Dhrystone MIPS, or sim- programs but still claim to be representa- The benchmark owes its name to the ilar. This could mean native MIPS, when a tive for a given area. Whetstone Algol compiler system. This particular program is running. More often This article concentrates on the most system was used to collect statistics about it means VAX MIPS (see below) with a common “stone age” benchmarks (CPU/ the distribution of “Whetstone instructions,” specific program as the basis for compar- memorykompiler benchmarks only) - in instructions of the intermediate language ison. particular the Whetstone, Dhrystone, and used by this compiler, for a large number of VAX MIPS. A factor relative to the Linpack benchmarks. These are the numerical programs. A synthetic program VAX 11/780, which then raises the fol- benchmarks whose results are most often was then designed. It consisted of several lowing questions: What language? What cited in manufacturers’ publications and in modules, each containing statements of compiler (Unix or VMS) was used for the the trade press. They are better than some particular type (integer arithmetic, VAX? What programs have been measured? meaningless MIPS numbers, but readers floating-point arithmetic, “if‘ statements, (Note that DEC uses the term VUP, for should know their properties - that is, calls, and so forth) and ending with a VAX unit of performance, in making what they do and don’t measure. statement printing the results. Weights were comparisons relative to the VAX 111780. Whetstone and Dhrystone are synthetic attached to the different modules (realized These units are based on a set of DEC benchmarks: They were written solely for as loop bounds for loops around the indi- internal programs, including some floating- benchmarking purposes and perform no vidual modules’ statements) such that the point programs.) useful computation. Linpack was distilled distribution of Whetstone instructions for out of a real, purposeful program that is the synthetic benchmark matched the dis- In short, Omri Serlin2 is correct in say- now used as a benchmark. tribution observed in the program sample. ing, “There are no accepted industry stan- Tables A-D in the sidebar on pages 68- The weights were chosen in such a way that dards for computing the value of MIPS.” 69 give detailed information about the high- the program executes a multiple of one level language features used by these million of these Whetstone instructions; benchmarks. Comparing these advantages thus, benchmark results are given as KWIPS Benchmarks with the characteristics of the user’s own (kilo Whetstone ) programs shows how meaningful the results or MWIPS (mega Whetstone instructions Any attempt to make MIPS numbers of a particular benchmark are for the user’s per second). This way the familiar term meaningful (for example, VAX MIPS) own applications. The tables contain “instructions per second” was retained but comes down to running a representative comparable information for all three given a machine-independent meaning. program or set of programs. Therefore, we benchmarks, thereby revealing their dif- A problem with Whetstone is that only can drop the notion of MIPS and just ferences and similarities. one officially controlled version exists - compare the speed for these benchmark All percentages in the tables are dynam- the Pascal version issued with the Pascal programs. ic percentages, that is, percentages obtained Evaluation Suite by the British Standards It has been said that the best benchmark by profiling or, for the language-feature Institution - Quality Assurance (BSI- is the user’s own application. But this is distribution, by adding appropriate counters QAS). Versions in other languages can often unrealistic, since it is not always on the source level and executing the pro- be registered with BSA-QAS to ensure

66 COMPUTER comparability. Table 1. Procedure Drofile for Whetstone.* Many Whetstone versions copied infor- mally and used for benchmarking have the Procedure Percent What is done there print statements removed, apparently with the intention of achieving better timing Main program 18.9 accuracy. This is contrary to the authors’ P3 14.4 FP arithmetic intentions, since optimizing compilers may PO 11.6 Indexing then eliminate significant parts of the Pa -1.9 FP arithmetic program. If timing accuracy is a problem, User code 46.8 the loop bounds should be increased in such a way that the time spent in the extra Trigonometric functions 21.6 Sin, cos, atan statements becomes insignificant. Other math functions -31.7 Exp, log, sqrt Users should know that since 1988 there Library functions 53.3 has been a revised (Pascal) version of the - ben~hmark.~Changes were made to mod- Total 100 ules 6 and 8 to adjust the weights and to preclude unintended optimization by *Because of rounding, all percentages can add up to a number slightly below or above 100. compilers. The print statements have been replaced by statements checking the values of the variables used in the computation. According to Wichmann,’ performance figures for the two versions should be very similar; however, differences of up to 20 pages 68-69. Some properties of Whet- MIPS RISC compilers) make no differ- percent cannot be ruled out. The Fortran stone are probably typical for most numer- ence in Whetstone execution times. version has not undergone a similar revi- ic applications (for example, a high num- (4) Instead of local variables, Whetstone sion, since with the separate compilation ber of loop statements); other properties uses a handful of global data (several scalar model of Fortran the danger of unintended belong exclusively to Whetstone (for ex- variables and a four-element array of con- optimization is smaller (though it certainly ample, very few local variables). stant size) repeatedly. Therefore, a compiler exists if all parts are compiled in one unit). in which the most heavily used global All Whetstone data in this article is based Whetstone characteristics. Some im- variables are allocated in registers (an op- on the old version; the language-feature portant characteristics should be kept in timization usually considered of secondary statistics are almost identical for both mind when using Whetstone numbers for importance) will boost Whetstone perfor- versions. performance comparisons. mance. (5) Because of its construction principle Size, procedure profile, and language- (I) Whetstone has a high percentage of (nine small loops), Whetstone has an ex- feature distribution. The static length of floating-point data and floating-point op- tremely high code locality. A near 100 the Whetstone benchmark (C version) as erations. This is intentional, since the percent hit rate can be expected even for compiled by the VAX Unix 4.3 BSD C benchmark is meant to represent numeric fairly small instruction caches. For the same compiler* is 2,117 bytes (measurement programs. reason, a simple reordering of the source loops only). However, because of the pro- (2) As mentioned above, a high per- code can significantly alter the execution gram’s nature, the length of the individual centage of execution time is spent in time in some cases. For example, it has modules is more important. They are be- mathematical library functions. This been reported that for the MC68020 with tween40 and 527 bytes long; all except one property is derived from the statistical data its 256-byte instruction cache, reordering are less than 256 bytes long. The weights forming the basis of Whetstone; however, of the source code can boost performance (upper loop bounds) of the individual it may not be representative for most of up to 15 percent. modules number between 12 and 899. today’s numerical application programs. Table 1 shows the distribution ofexecu- Since the speed of these functions (realized tion time spent in the subprograms of as software subroutines or microcode) Linpack Whetstone (VAX 11/785, BSD 4.3 For- dominates Whetstone performance to a high tran, single precision). The most important, degree, manufacturers can be tempted to As explained by its author, Jack Don- and perhaps surprising, result is that Whet- manipulate the runtime library for Whet- garra4 from the University of Tennessee stone spends more than half its time in stone performance. (previously Argonne National Laboratory), library subroutines rather than in the com- (3) As evident from Table D in the side- Linpack didn’t originate as a benchmark. piled user code. bar, Whetstone uses very few local variables. When first published in 1976, it was just a The distribution of language features is When Whetstone was written, the issue of collection (a package, hence the name) of shown in Tables A-D in the sidebar on local versus global variables was hardly linear algebra subroutines often used in being discussed in software engineering, Fortran programs. Dongarra, who collects not to mention in computer architecture. and publishes Linpack results, has now Because of this unusual lack of local vari- distilled what was part of a “real life” ables, register windows (in the Sparc RISC, program into a benchmark that is distrib- *With the Unix 4.3 BSD language systems, it was for example) or good register allocation uted in various easier to determine the code size for the C version. The version^.^ numbers for the Fortran version should be similar. algorithms for local variables (say, in the The program operates on a large matrix

December 1990 67 (two-dimensional array); however, the in- declared with bounds 200), but versions ogy means that the nonfloating-point op- ner subroutines manipulate the matrix as a for larger arrays also exist. erations are neglected or, stated another one-dimensional array, an optimization The results are usually reported in mil- way, that their execution time is included customary for sophisticated Fortran pro- lions of floating-point operations per sec- in that of the floating-point operations. gramming. The matrix size in the version ond (Mflops); the number of floating-point When floating-point operations become distributed by standard mail servers is 100 operations the program executes can be increasingly faster relative to integer oper- x 100 (within a two-dimensional array derived from the array size. This terminol- ations, this terminology becomes some-

Tables covering more than one benchmark

Table A. Statement distribution in percentages. *

Statement Dhry stone Whetstone Linpackkaxpy

Assignment of a variable 20.4 14.4 Assignment of a constant 11.7 8.2 Assignment of an expression (one operator) 17.5 1.4 Assignment of an expression (two operators) 1.0 24.3 48.5 Assignment of an expression (three operators) 1 .0 1.6 Assignment of an expression (>three operators) 6.8

One-sided if statement, “then” part executed 2.9 0.5 One-sided if statement, “then” part not executed 3.9 0.1 2.2 Two-sided if statement, “then” part executed 4.9 4.0 Two-sided if statement, “else” part executed 1.9 4.0

For statement (evaluation) 6.8 17.3 49.3 Goto statement 0.5 Whilehepeat statement (evaluation) 4.9 Switch statement I .0 Break statement 1 .0

Return statement (with expression) 4.9

Call statement (user procedure) 9.7 11.9 Call statement (user function) 4.9 Call statement (system procedure) 1.0 Call statement (system function) 1.0 4.7 - 100 100 100

*Because of rounding, all percentages can add up to a number slightly below or above 100.

Table C. Operand data-type distribution in percentages.

Operand Dhry stone Whetstone LinpacWsaxpy

Integer 57.0 55.7 67.2 Char 19.6 Float/double 44.3 32.8 Enumeration 10.9 Boolean 4.2 Array 0.8 String 2.3 Pointer 5.3 ~ - ~ 100 100 100

68 COMPUTER what misleading. point data. + 3. This technique saves execution time For Linpack, it is important to know Rolled/unrolled --In the unrolled ver- for most machines and compilers; howev- what version is measured with respect to sion, loops are optimized at the source er, more sophisticated vector machines, the following attribute pairs: level by “loop unrolling”: The loop index where loop unrolling is done by the com- (say, i) is incremented in steps of four, and piler generating code for vector hardware, Single/double -Fortran single preci- the loop body contains four groups of usually execute the standard (rolled) ver- sion or double precision for the floating- statements, for indexes i, i + I, i + 2, and i sion faster. Coded BLASIFortran BLAS - Lin- pack relies heavily on a subpackage of linear algebra subroutines (BLAS). Coded BLAS (as opposed to Fortran BLAS) Table B. Operator distribution in percentages. means that these subroutines have been rewritten in assembly language. Dongarra Operator Dhrystone Whetstone Linpack/saxpy has stopped collecting and publishing re- sults for the coded BLAS version and re- + (intkhar) 21.0 11.9 14.1 quires that only the Fortran version of these - (int) 5.0 6.0 subroutines be used unchanged. However, * (int) 2.5 6.0 some results for coded BLAS versions are still cited elsewhere. Computing the exe- I (int) 0.8 ~ Integer arithmetic 29.3 23.9 14.1 cution-time ratio between coded BLAS and Fortran BLAS versions for the same ma- + (float/double) 14.9 14.1 chine offers insights about the Fortran - (floddouble) 2.1 compiler’s code optimization quality: For * (float/double) 9.3 14.1 some machines the ratio is 1.2 to I; for others it can be as high as 2 to 1. / (float/double) 4.6 ~ Floating-point arithmetic 30.9 28.2 Size, procedureprofile, and language- <, <= (incl. loop control) 10.1 10.7 14.5 feature distribution. The Linpack data Other relational operators 11.7 reported here is for the rolled version, single ~ 2.8 0.6 Relational 21.8 13.5 15.1 precision, with Fortran BLAS; code sizes have been measured with VAX Unix BSD Logical 3.3 0.2 4.3 Fortran. The static code length for all subprograms Indexing (one-dimensional) 5.9 24.5 42.3 is 4,537 bytes. The length for individual subprograms varies between 11 1 and 1,789 Indexing (two-dimensional) -3.4 Indexing 9.3 24.5 42.3 bytes; the most heavily used subprogram, saxpy, is 234 bytes long. Data size, in the Record selection 7.6 standard version, is dominated by an array Record selection via pointer 15.1 of 100 x 100 real numbers. For 32-bit Record selection 22.1 machines, this means that with single pre- cision, 40 Kbytes are used for data (80 Address operator (C) 5.0 3.6 Kbytes with double precision). Indirection operator (C) 8.4 3.6 Table 2 shows the distribution of execu- ~ C-specific operators 13.4 7.2 tion time in the various subroutines. The most important observation from the table Total 100 IO0 100 is that more than 75 percent of Linpack’s execution time is spent in a 15-line sub- routine (called saxpy in the single-preci- sion version and daxpy in the double- Tab%D. Operand locality distribution in percentages. precision version). Dongarra4 reports that

~ on most machines the percentage is even i Operand Locality Dhrystone Whetstone Linpacklsaxpy higher (90 percent). Because of this ex- treme concentration of execution time in Local 48.7 0.4 49.5 the saxpy subroutine, and because of the Global 8.3 56.3 time-consuming instrumentation method Parameter (value) 10.6 18.6 17.0 for obtaining the measurements, language- Parameter (reference) 6.8 1.9 24.6 feature distribution has been measured Function result 2.3 I .3 only for the saxpy subroutine (rolled Constant 23.4 -21.6 - 8.8 version). IO0 IO0 100 Table A in the sidebar shows that very few statement types (assignment with multiplication and addition, and “for”

December 1990 69

- __ Table 2. Procedure profile for Linpack. drops to 10 percent instead of 16 percent if the Pascal (or Ada) version is used (mea- Procedure Percent What is done there surement for Berkeley Unix 4.3 Pascal). On the other hand, the number is higher for Main program 0.0 newer RISC machines with optimizing matgen 13.8 compilers, mainly because they spend much sgefa 6.2 less time in procedure calls than the VAX. =xPY 77.1 y[il = y[i]+ a*x[i] Consistent with usage in system-type isamax 1.6 programming, arithmetic expressions are Miscellaneous 1.2 simpler than in the other benchmarks; there ~ User code 100 are more “if’ statements and fewer loops. Dhrystone was the first benchmark to Library functions 0.0 explicitly consider the locality of operands: Local variables and parameters are used more often than global variables. This is not only consistent with good software engineering practices but also important for modern CPU architectures (RISC ar- statements) make up the bulk of the Dhrystone chitectures). On older machines with few subroutine and, therefore, of Linpack it- registers, local variables and parameters self. The data is mostly reference pa- As the name indicates, Dhrystone was are allocated in memory in the same way as rameters (array values) or local variables developed much as Whetstone was; it is a global variables; on RISC machines they (indexes); there are hardly any global synthetic benchmark that I published in typically reside in registers. The resulting variables. 1984. The original language of publication difference in access time is one of the is Ada, although it uses only the Pascal most important advantages of RISC ar- Linpack characteristics. To interpret subset of Ada and was intended for easy chitectures. performance characterizations by Linpack translation to Pascal and C. It is used mainly Mflops, it helps to know the benchmark’s in the C version. Dhrystone characteristics. Familiarity main characteristics: The basis for Dhrystone is a literature with the benchmark’s main characteristics, survey on the distribution of source language described below, is important when inter- As expected for a numeric benchmark, features in nonnumeric, system-type pro- preting Dhrystone performance character- Linpack has a high percentage of floating- gramming (operating systems, compilers, izations. point operations, though only a few are editors, and so forth). In addition to the actually used. For example, the program obvious difference in data types (integral As intended, Dhrystone contains no has no floating-point divisions. In striking versus floating-point), numeric and system- floating-point operations in its measurement contrast to Whetstone, no mathematical type programs have other differences, too: loop. functions are used at all. System programs contain fewer loops, A considerable percentage of execution The execution time is spent almost simpler computational statements, and more time is spent in string functions; this number exclusively in one small function. This “if’ statements and procedure calls. should have been lower. In extreme cases means that even a small instruction cache Dhrystone consists of 12 procedures (MIPS architecture and C compiler), this will show a very high hit rate. included in one measurement loop with 94 number goes up to 40 percent. Contrary to the high locality for code, statements. During one loop (one Dhrys- Unlike Whetstone, Dhrystonecontains Linpack has a low locality for data. A tone), 101 statements (103 in the C version) hardly any loops within the main mea- larger size for the main matrix leads - are executed dynamically. The results are surement loop. Therefore, for micropro- depending on the cache size -to signifi- usually given in per second. cessors with small instruction caches (be- cantly more cache misses and therefore to The program (currently Version 2.1) has low 1,000 bytes), almost all instruction a lower Mflops rate. So, in making com- been distributed mainly through Usenet, accesses are cache misses. But as soon as parisons, it is important to know whether the Unix network; I also make it available the cache becomes larger than the mea- Linpack Mflops for different machines have on a floppy disk. Rick Richardson has surement loop, all instruction accesses are been computed using the same array di- collected and posted results for the Dhry- cache hits. mensions. Also, Linpack can be highly stone benchmark regularly to Usenet (the Only a small amount of global data is sensitive to the cache configuration: A latest list of results is dated April 29,1990). manipulated, and the data size cannot be different array alignment (201 x 200 in- scaled as in Linpack. stead of 200 x 200 for the global array Size, procedure profile, and language- No attempt has been made to thwart declaration) can lead to a different mapping feature distribution. The static length of optimizing compilers. The goal was for the of data to cache lines and therefore to a the Dhrystone measurement loop, as com- program to reflect typical programming considerably different execution time. The piled by the VAX Unix (BSD 4.3) C style; it should be just as optimizable as program, as distributed by the standard compiler, is 1,039 bytes. Table 3 shows the normal programs. An exception is the mail servers, delivers Mflops numbers for distribution of execution time spent in its optimization of dead-code removal. Since two choices of leading dimension, 200 and subprograms. in Version 1 the computation results were 201; we can assume that manufacturers The percentage of time spent in string not printed or used, optimizing compilers report the better number. operations is highly language dependent; it were able to recognize many statements as

70 COMPUTER dead code and suppress code generation Table 3. Dhrystone procedure profile. for these statements. In Version 2, this has been corrected. Procedure Percent What is done there

Ground rules for Dhrystone number Main program 18.3 comparisons. Because of Dhrystone’s User procedures -65.7 peculiarities, users should be sure to observe User code 84.0 certain ground rules when comparing Dhrystone results. First, the version used strcpy 8.0 String copy should be 2.1 ; the earlier version, 1.1, leaves (string constant) too much room for distortion of results by strcmp 8.1 String comparison dead-code elimination. (string variables) ~ Second, the two modules must be com- Library functions 16.1 piled separately, and procedure merging - (in-lining) is not allowed for user proce- I Total 100 dures. ANSI C, however, allows in-lining of library routines (relevant for string routines in the C version of Dhrystone). Third, when processors are compared, the same programming language must be usedon both. For compilers of equal quality, Other benchmarks The net Mflops rate of many Fortran programs Pascal and Ada numbers can be about 10 and work loads will be in the subrange between the equi-weighted harmonic and arithmetic percent better because of the string se- In addition to the most often quoted means, depending on the degree of code mantics. In c, the length of a string is benchmarks explained above, several other parallelism and optimization. The Mflops normally not known at compile time, and programs are used as benchmarks, including metric provides aquick measure of the average the compiler needs -at least for the string efficiency of a computer system, since its peak computing rate is well known. comparison statement in Dhrystone - to Livermore Fortran Kernels, generate code that checks each byte for the Stanford Small Programs Benchmark string terminator byte (null byte). With Set, Stanford Small Programs Benchmark Pascal and Ada the compiler can generate EDN benchmarks, Set. Concurrent with development of the word instructions (usually in-line code) for Sieve of Eratosthenes, first RISC systems at Stanford University the string operations. Rhealstone, and and the University of California, Berkeley, Therefore, for a meaningful comparison SPEC benchmarks. John Hennessy and Peter Nye at Stanford’s of C-version results, it helps to be able to Computer Systems Laboratory collected a answer certain questions: These range from small, randomly chosen set of small programs (one page or less of programs such as Sieve, to elaborate source code for each program). These (1) Are the string routines written in benchmark suites such as Livermore For- programs became popular mainly because machine code? tran Kernels and SPEC benchmarks. they were the basis for the first comparisons (2) Are the string routines implemented of RISC and CISC processors. They have as in-line code? Livermore Fortran Kernels. The Liv- now been packed into one C program (3) Does the compiler use the fact that ermore Fortran Kernels, also called the containing eight integer programs - Per- in the “strcpy” statement the source oper- Lawrence Livermore Loops, consist of 24 mutations, Towers of Hanoi, Eight Queens, and has a fixed length? If it does (legal kernels, or inner loops, of numeric com- Integer Matrix Multiplication, Puzzle, according to ANSI C), this statement can putations from different areas of the physical Quicksort, Bubble Sort, and Tree Sort - be compiled in the same way as a record sciences. The author, F.H. McMahon of and two floating-point programs - Float- assignment, which can result in consider- Lawrence Livermore National Laboratory, ing-point Matrix Multiplication and Fast able savings. has collected them into a benchmark suite Fourier Transformation. (4) Is a word alignment assumed for the and has added statements for time mea- Characteristics of the individual programs string routines? This is acceptable for the surement. The individual loops range from vary; most contain a high percentage of strcpy statement only, not for the “strcmp” a few lines to about one page of source array accesses. There seems to be no offi- statement. code. The program is self-measuring and cial publication of the source code. The computes Mflops rates for each kernel, for only place I have seen the C code in print is Language systems are allowed to opti- three different vector lengths. in a manufacturer’s performance report. mize for cases 1 through 3 above, just as As we might expect, these kernels con- There is no standardized method for they can for programs in general. For pro- tain many floating-point computations and generating an overall figure of merit from cessor comparisons, however, it is impor- a high percentage of array accesses. Sev- the individual execution times. In one tant that the compilers used apply the same eral kernels contain vectorizable code; some version, a driver program assigns weights amount of optimization; otherwise, opti- contain code that is vectorizable if rewritten. between 0.5 and 4.44 to the individual mization differences may overshadow CPU (Feo6 provides a detailed discussion of the execution times. Perhaps a better alterna- speed differences. This usually requires an Livermore Loops.) McMahon characterizes tive, used by Sun and MIPS, is to compute inspection of the generated machine code the representativity of the Livermore Loops the geometric mean of the individual pro- and the C library routines. as follows: grams’ execution times.

December 1990 71 Table 4. SPEC benchmark Droerams.

~~ ~~ ~~~~ ~ ~ ~ ~ ~~~~~~~ Acronym Short Characterization Language Main Data Types

gcc GNU C compiler C Integer espresso PLA simulator C Integer spice 2g6 Analog circuit simulation Fortran Floating point doduc Monte Carlo simulation Fortran Floating point nasa7 Collection of several numerical “kernels” Fortran Floating point li Lisp interpreter C Integer eqntott Switching-function minimization, mostly sorting C Integer matrix300 Various matrix multiplication algorithms Fortran Floating point fPPPP Maxwell equations Fortran Floating point tomcatv Mesh generation, highly vectorizable Fortran Floating point

EDN benchmarks. The program col- All operands are integer operands, and 58 subsequently, AT&T, Bull, CDC, Com- lection now known as the EDN bench- percent of them are local variables. paq, Data General, DEC, Dupont, Fujitsu, marks was developed by a group at Carn- The program is mentioned here not be- IBM, Intel, Intergraph, Motorola, NCR, egie Mellon University for the Military cause it can be considered a good bench- Siemens Nixdorf, Silicon Graphics, Sol- Computer Family project. EDN published mark but because, as one author put it, bourne, Stardent, and Unisys became it in 1981. Originally, the programs were “Sieve performance of one compiler over members. written in several assembly languages (LSI- another has probably sold more compilers In October 1989, SPEC released its first 11/23, 8086, 68000, and ZSOOO); the in- for some companies than any other set of 10 benchmark programs. Table 4 tention was to measure the speed of mi- benchmark in history.” contains only a rough characterization of croprocessors without also measuring the the programs; J. Uniejewski* provides a compiler’s quality. SPEC benchmarks. Probably the most more detailed discussion. Because a license A subset of the original benchmarks is important current benchmarking effort is must be signed, and because of its size often used in a C version: SPEC - the systems performance evalu- (lS0,OOO lines of source code), the SPEC ation cooperative effort. It started because benchmark suite is distributed via magnetic Benchmark E: String search benchmarking experts at various companies tape only. Benchmark F: Bit testhetheset felt that most previously existing bench- Results are given as performance relative Benchmark H: Linked list insertion marks (usually small programs) were in- to a VAX 1 U780 using VMS compilers. Benchmark I: Quicksort adequate. Small benchmarks can no longer Results for several of SPEC Benchmark K: Bit matrix transforma- be representative for real programs when it member companies are contained in the tion comes to testing the memory system, be- regular SPEC Newsletter (see Additional cause with the growing size of cache reading and address information). A com- This subset of the EDN benchmarks has memories and the introduction of on-chip prehensive number, the “SPECmark,” is been used in Bud Funk’s comparison of caches for high-end microprocessors, the defined as the geometric mean of the rel- RISC and CISC processor^.^ There seems cache hit ratio comes close to 100 percent ative performance of the 10 programs. to be no standard C version of the EDN for these benchmarks. Furthermore, once a However, SPEC requires a reporting form benchmarks; the programs are disseminated small program becomes popular as a that gives, in addition to the raw data, the informally. benchmark, compiler writers are inclined relative performance for each benchmark (or forced) to “tweak” their compilers into program separately. Thus, users can select Sieve of Eratosthenes. One of the most optimizations particularly beneficial to this the subset of performance numbers for popular programs for benchmarking small benchmark - for example, the string op- which the programming language and/or PCs is the Sieve of Eratosthenes, some- timizations for Dhrystone. the application area best matches their times called “Primes.” It computes all prime SPEC’S goal is to collect, standardize, applications. numbers up to a given limit (usually 8,192). and distribute large application programs The program has some unusual character- that can be used as benchmarks. This is a istics. For example, 33 percent of the dy- nontrivial task, since realistic programs Non-CPU influences in namically executed statements are assign- previously used in benchmarking (for ex- benchmark ments of a constant; only S percent are ample, the Unix utilities “yacc” or “nroff’) assignments with an expression at the right- often require a license and are therefore not performance hand side. There are no “while” statements freely distributable. and no procedure calls; SO percent of the The founding members of SPEC were In trade journals and advertisements, statements are loop control evaluations. Apollo, Hewlett-Packard, MIPS, and Sun; manufacturers usually credit good bench-

72 COMPUTER mark numbers only to the hardware sys- Table 5. Performance ratio for different languages (larger is better, C = 1): Stan- tem’s speed. With microprocessors, this is ford programs. reduced even more to the CPU speed. However, the preceding discussion makes I Proeram Bliss C Pascal Ada I it clear that other factors also have an influence -for example, the programming Search 1.24 1.o 0.70 language, the compiler, the runtime library Sieve 0.63 1.o 0.80 functions, and the memory and cache size. Puzzle 0.77 1.o 0.73 Ac kermann 1.20 1.o 0.80 Programming-language influence. Table 5 (numbers from Levy and Clark9 Dhrystone (1.1) 1 .o 1.32 1.02 and my own collection of Dhrystone results) shows the execution time of several pro- grams on the same machine (VAX, 1982 and 1985). Properties of the languages (calling sequence, pointer semantics, and string semantics) obviously influence ex- Table 6. Compiler optimization levels in Dhrystonedsec. ecution time even if the source programs look similar and produce the same results. I Optimization Level v. 1.1 v. 2.1 I Compiler influence. Table 6, taken from No opt., no “register” attribute 30,700 3 1,000 the MIPS Performance Brief,lo gives No opt., with “register” attribute 32,600 32,400 Dhrystone results (as of January 1990) for Optimization “0,”no “register” attribute 39,700 36,700 the MIPS M/2000 with the MIPS C com- Optimization “0,”with “register” attribute 39,700 36,700 piler cc2.0. The table shows how the dif- Optimization “03” 43,100 39,400 ferent levels of optimization influence ex- Optimization “04” 46,700 43,200 ecution time. Note that optimization “04” performs procedure in-lining, an optimization not consistent with the ground rules and in- cluded in the report for comparison only. On the other hand, the “strcpy” optimiza- tion for Dhrystone is not included in any of cess times for the cache and the main mem- Table 7. Size in bytes for some popular the optimization levels for the MIPS C ory, cache size can have a considerable benchmarks. compiler. If it is used, the Dhrystone rate effect. increases considerably. Table 7 summarizes the code sizes (size I Program Code Data I of the relevant procedureshnner loops) and Runtime library system. The role of data sizes (of the main array) for some Whetstone - 256 16 the runtime library system is often over- popular benchmarks. All sizes have been Dhrystone 1,039 looked when benchmark results are com- measured for the VAX 11 with the Unix Linpack (saxpy) 234 40,000 pared. As apparent fromTable 1, Whetstone BSD 4.3 C compiler, with optimization ( 1OOx 100 version) spends 40 to 50 percent of the execution “-0” (code compaction). Of course, the Sieve 160 8,192 time in functions of the mathematical sizes will differ for other architectures and (standard version) subroutines library. The C version of compilers. Typically, RISC architectures Quicksort 174 20,000 Dhrystone spends 16 percent of the exe- lead to larger code sizes, whereas the data (standard version) cution time in the string functions (VAX, size remains the same. Puzzle 1,642 511 Berkeley Unix 4.3 C); with other systems, If the cache is smaller than the relevant Ac kermann 52 the percentage can be higher. benchmark, reordering the code can, for Some systems have two flavors of the some benchmarks and cache configurations, mathematical floating-point library: The lead to considerable savings in execution first is guaranteed to comply with the IEEE time. Such savings have been reported for floating-point standard; the second is faster Whetstone on MC 68020 systems (reor- and may give less accurate results under dering the source program) as well as for Small, synthetic some circumstances. Customers who must Dhrystone on NS 32532, where just a dif- rely on the accuracy of floating-point ferent linkage order can lead to a difference benchmarks versus computations should know which library of up to 5 percent in execution time. It is real-life programs was used for benchmark measurements. debatable whether the “good case” or the “bad case” better represents the system’s It should be apparent by now that with Cache size. It is important to look for the true characteristics. In any event, custom- the advent of on-chip caches and sophisti- built-in performance boost when the cache ers should be aware of these effects and cated optimizing compilers, small bench- size reaches the relevant benchmark size. know when the standard order of the code marks gradually lose their predictive val- Depending on the difference between ac- hasbeenchanged. ue. This is why current efforts like SPEC’S

December 1990 73 References Obtaining benchmark sources via e-mail

Most of the benchmarks discussed in this article can be obtained via electronic 1. D.A. Patterson, “Reduced Instruction-Set Computers,” Comm. ACM, Vol. 28, No. I, mail from several mail servers established at large research institute^.^ The ma- Jan. 1985, pp. 8-21. jor mail servers and their electronic mail addresses are shown below. Users can get information about the use of these mail servers by sending electronic mail 2. 0. Serlin, “MIPS, Dhrystones, and Other consisting of the line “send index” to any of the mail servers. Tales,” Daramarion, June 1986, pp. 112- The SPEC benchmarks are available only via magnetic tape. 118.

North America 3. B.A. Wichmann, “Validation Code for the uucp: uunet!research!netlib Murray Hill, New Jersey Whetstone Benchmark,” Tech. Report NPL- Internet: [email protected] DITC 107/88, National Physical Laborato- ry, Teddington, UK, Mar. 1988. Internet: netlibOornl.gov Oak Ridge, Tennessee 4. J.J. Dongarra, “The Linpack Benchmark: Europe An Explanation,” in Evaluating Supercom- EUNEThucp: nac!netlib Oslo, Norway puters, Aad J. Van der Steen, ed., Chapman Internet: netlib @ nac.no and Hall, London, 1990, pp. 1-21. EARN/Bitnet: netlib%[email protected] X.400: s=netlib; o=nac; c=no; 5. J. Dongama and E. Grosse, “Distribution of Mathematical Software viaElectronic Mail,” Pacific Comm. ACM, Vol. 30, No. 5, May 1987, pp. 403-407. Internet: [email protected] Univ. of Wollongong, NSW. Australia 6. J.T. Feo, “An Analysis of the Computational and Parallel Complexity of the Livermore Loops,” Parallel Computing, Vol. 7, No. 2, June 1988, pp. 163-185.

7. B. Funk, “RISC and CISC Benchmarks and Insights,” Unisys World,Jan. 1989, pp. 11- 14.

activities concentrate on collecting large, example, it won’t be as easy to tweak 8. J. Uniejewski, “Characterizing System real-life programs. Why, then, should this compilers for the SPEC benchmarks as it is Performance Using Application-Level article bother to characterize in detail these for the small benchmarks; but if it happens, Benchmarks,”Proc. Buscon,Sept. 1989,pp. 159-167. Partial publication in SPEC it also will be harder to detect. “stone age” benchmarks? There are several Newsletter, Vol. 2, No. 3, Summer 1990, pp. reasons: 3-4.

(1) Manufacturers will continue to use dvice for users looking at bench- 9. H. Levy and D.W. Clark, “On the Use of Benchmarks for Measuring System Perfor- mark numbers to characterize them for some time, so the trade press will mance,” Computer Architecture News, Vol. keep quoting them. machine performance should be- 10, No. 6, Dec. 1982, pp. 5-8. (2) Manufacturers sometimes base their gin with a warning not to trust MIPS num- MIPS rating on them. An example is IBM’s bers unless their derivation is clearly ex- 10. MIPS Computer Systems, Inc., Performance Brief; CPUBenchmarks,Issue 3.9, Jan. 1990, (unfortunate) decision to base the pub- plained. Here are some other things to p. 35. lished (VAX-relative) MIPS numbers for watch for: the IBM 6000 workstation on the old 1 .I version of Dhrystone. Subsequently, DEC Check whether Mflops numbers relate and Motorola changed the MIPS compu- to a standard benchmark. Does this tation rules for their competing products, benchmark match your applications? also basing their MIPS numbers on Dhry- Know the properties of the benchmarks stone 1.1. whose results are advertised. (3) For investigating new architectural Be sure you know all the relevant facts designs - via simulations, for example - about your system and the manufacturer’s Additional reading and the benchmarks can provide a useful first benchmarking system. For hardware this approximation. includes clock frequency, memory laten- address information (4) For embedded microprocessors with cy, and cache size; for software it includes no standard system software (the SPEC programming language, code size, data size, Following are the main reference sources suite requires Unix or an equivalent oper- compiler version, compiler options, and for each of the benchmarks discussed in ating system), nothing else may be avail- runtime library. this article, together with a short charac- able. Check benchmark code listings to make terization. A contact person is identified (5) We can expect that larger bench- sure apples are compared with apples and for each of the major benchmarks so that marks will not be completely free of distor- that no illegal optimizations are applied. readers can get additional information. For tions from unforeseen effects either. Ex- Ask for a well-written performance information about access to the benchmark perience gained with smaller benchmarks report. Good companies provide all rele- sources via electronic mail, see the sidebar can help us be aware of such effects. For vant details. W “Obtaining benchmark sources via e-mail.”

74 COMPUTER Whetstone of Dhrystone are compared; reiteration of mea- gram listings in Pascal, C, Forth, Fortran IV, surement rules. Basic, Cobol, Ada, and Modula-2. Curnow, H.J., and B.A. Wichmann, “A Synthel- icBenchmark,”TheComputerJ.,Vol. 19,No. 1, Contact: Reinhold P. Weicker, Siemens Nixdorf 1976, pp. 43-49. Original publication, explana- Information Systems, STM OS 32, Otto-Hahn- SPEC benchmarks tion of the benchmark design, program (Algol Ring 6, W-8000 Munchen 83, Germany; phone 60) in the appendix. 49 (89) 636-42436, fax 49 (89) 636-48008, In- “Benchmark Results,”SPECNewsletrer, Vol. I, ternet: [email protected]; Eunet: No. 1, Fall 1989, pp. 1-15. First published list of Wichmann, B.A., “Validation Code for the weicker%ztivax.uucp @unido.uucp. results, in the report form required by SPEC. Whetstone Benchmark,” see Reference 3. Dis- cussion of comments made to the original pro- Collection of results: Rick Richardson, PC Re- Uniejewski, J., “Characterizing System Perfor- gram, explanation of the revised version. Paper search, Inc., 94 Apple Orchard Dr., Tinton Falls, mance Using Application-Level Benchmarks,” contains a program listing of the revised ver- NJ07724;phone(201) 389-8963.e-mail (UUCP) see Reference 8. This paper includes a short sion, in Pascal, including checks for correct ... !uunet!pcrat!rick. characterization of each SPEC benchmark pro- execution. gram.

Contact: Brian A. Wichmann, National Physical Contact: SPEC -System Performance Evalua- Laboratory, Teddington, Middlesex, England Livermore Fortran Kernels tion Cooperative (Kim Shanley, Secretary), c/o TWl1 OLW; phone 44 (81) 943-6976, fax 44 Waterside Associates, 39 150PaseoPadre Pkwy., (8 1) 977-7091, Internet [email protected]. Feo, J.T., “An Analysis of the Computational Suite 350, Fremont, CA 94538; phone (415) and Parallel Complexity of the Livermore Loops,” 792-2901, fax (41 5) 792-4748, Internet Registration of other versions: J.B. Souter, see Reference 6. Analysis of the Livermore shanley @cup.portal.com. BenchmarkRegistration, BSI-QAS, PO Box 375, Fortran Kernels with respect to the achievable Milton Keynes, Great Britain MK14 6LL. parallelism. McMahon, F.H., “The Livermore Fortran Ker- nels: A Computer Test of the Numerical Per- formance Range,” Tech. Report UCRL-53745, Linpack Lawrence Livermore National Laboratory, Livermore, Calif., Dec. 1986, p. 179. Original Dongarra, J.J.,etal., LinpackUsers’Guide, SIAM publication ofthe benchmark with sampleresults. Publications, Philadelphia, Pa., 1976. Original publication (not yet as a benchmark), contains McMahon, F.H., “The Livermore Fortran Ker- the benchmark program as an appendix. nels Test of the Numerical Performance Range,” in Performance Evaluarion of , Dongarra, J.J., “Performance of Various Com- J.L. Martin, ed., North Holland, Amsterdam, puters Using Standard Equations Software in a 1988, pp. 143.186. Reprint of main part of the Fortran Environment,” Computer Archirecrure original publication. News, Vol. 18, No. 1, Mar. 1990, pp. 17-31. Latest published version of the regularly main- Contact: Frank H. McMahon, Lawrence Liver- tained list of Linpack results, rules for Linpack more National Laboratory, L-35, PO Box 808, measurements. Livermore, CA 94550; phone (415) 422-1647, Reinhold P. Weicker is a senior staff engineer Internet [email protected] .Ilnl.gov, Dongarra, J.J., “The Linpack Benchmark: An with Siemens Nixdorf Information Systems AG Explanation,” see Reference 4. Explanation of in Munich, Germany. His research interests in- Linpack, guide to interpretation of Linpack re- clude performance evaluation with benchmarks sults. and its relation toCPU architecture and compiler Stanford Small Programs Benchmark Set code generation. He wrote the often-used Dhrystone benchmark while working on the Contact: Jack J. Dongarra, Computer Science Appendix 2 -Stanfordcomposite Sourcecode, Dept., Univ. of Tennessee, Knoxville, TN 37996- CPU architecture team for the i80960 micro- Appendix to “Performance Report 68020/68030 processor. Previously, he performed research in 1301; phone (615) 974-8295, fax (615) 974- 32-bit Microprocessors,” Motorola, Inc., BR705/ 8296, Internet [email protected]. theoretical computer science at the University D, 1988, pp. A2-1 - A2-IS. This is the only of Hamburg, Germany, and was a visiting as- place I have seen this benchmark in print; it is sistant professor at Pennsylvania State Univer- normally distributed via informal channels. sity. Weicker received a diploma degree in mathe- Dhrystone matics and a PhD in computer science from the University of Erlangen-Numberg. He is a member Weicker, R.P., “Dhrystone: A Synthetic Sys- EDN benchmarks of the IEEE Computer Society, the ACM, and tems Programming Benchmark,” Comm. ACM, the Gesellschaft fur Informatik. Vol. 27, No. IO, Oct. 1984, pp. 1,013-1,030. Grappel, R.D., and J.E. Hemenway, “A Tale of Original publication, literature survey on the Four pPs: Benchmarks Quantify Performance,” use of programming language features, base EDN, Apr. 1, 198 I, pp. 179-265. Original pub- statistics and benchmark program in Ada. lication with benchmarks described in assembler (code listings for LSI-11/23, 8086, 68000, and Weicker, R.P., “Dhrystone Benchmark: Ratio- 28000). nale for Version 2 and Measurement Rules,” SIGPlan Notices, Vol. 23, No. 8, Aug. 1988, pp. Patstone, W., “16-bit-yP Benchmarks - An 49-62. Version 2.0 of Dhrystone (in C), mea- Update withExplanations,”EDN, Sept. 16,1981, surement rules. For the Ada version, a similar pp. 169.203. Discussionof results, updatedcode article appeared in Ada Letters, Vol. 9, No. 5, listings (assembler). July 1989, pp. 60-82. The author can be contacted at Siemens Weicker, R.P., “Understanding Variations in Sieve Nixdorf Information Systems AG, Otto-Hahn- Dhrystone Performance,” Microprocessor Re- Ring 6, W-8000 Munchen 83, Germany; Inter- porr, Vol. 3, No. 5, May 1989, pp. 16-17. What Gilbreath, J., and G. Gilbreat’l, “Eratosthenes net: [email protected]; Eunet: customers should know when C-version results Revisited,” Byte, Jan. 1983, pp. 283-326. Pro- weicker%[email protected].

December 1990 15