View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by DigitalCommons@WPI

Worcester Polytechnic Institute DigitalCommons@WPI

Computer Science Faculty Publications Department of Science

7-1-1998 Touchstone - A Lightweight Processor Mark Claypool Worcester Polytechnic Institute, [email protected]

Follow this and additional works at: http://digitalcommons.wpi.edu/computerscience-pubs Part of the Computer Sciences Commons

Suggested Citation Claypool, Mark (1998). Touchstone - A Lightweight Processor Benchmark. . Retrieved from: http://digitalcommons.wpi.edu/computerscience-pubs/206

This Other is brought to you for free and open access by the Department of Computer Science at DigitalCommons@WPI. It has been accepted for inclusion in Computer Science Faculty Publications by an authorized administrator of DigitalCommons@WPI.

WPI-CS-TR-98-16 July 1998

Touchstone { A Lightweight Pro cessor Benchmark

by

Mark Claypool

Computer Science

Technical Rep ort

Series

WORCESTER POLYTECHNIC INSTITUTE

Computer Science Department

100 Institute Road, Worcester, Massachusetts 01609-2280

Touchstone { A Lightweight Pro cessor Benchmark

Mark Claypool

[email protected]

Worcester Polytechnic Institute

Computer Science Department

August 11, 1998

Abstract

Benchmarks are valuable for comparing pro cessor p erformance. However, p orting and run-

ning pro cessor b enchmarks on new platforms can b e dicult. Touchstone, a simple addition

b enchmark, is designed to overcome p ortability problems, while retaining p erformance measure-

ment accuracy. In this pap er, we present exp erimental results that showTouchstone correlates

strongly with pro cessor p erformance under other b enchmarks. Equally imp ortant, we presenta

measure of p ortability that demonstrates Touchstone is easier to p ort and run than other b ench-

marks. We conclude that while Touchstone is not a replacement for more extensive pro cessor

b enchmarks, it can b e used for quick, reasonably accurate estimations of pro cessor p erformance.

Keywords: b enchmark, pro cessor p erformance, measurement

1 Intro duction

benchmark v. trans - to sub ject a system to a series of tests in order to obtain prear-

ranged results not available on comp etitive systems. { S. Kel ly-Bootle, \The Devil's DP

Dictionary"

Standard measures of p erformance for systems provide a basis for comparison, which

can lead to system improvements and predictions of e ectiveness under other applications. If

computer users ran the same programs day in and day out, p erformance comparisons would b e

straight-forward. As this will never b e the case, systems must rely on other metho ds to predict

p erformance under estimated workloads. The pro cess of p erformance comparison for two or more

systems by measurements is called benchmarking, and the workloads used in the measurements are

called benchmarks. Broadly, there are four typ es of b enchmarks [10,11]:

Real programs. Real programs are applications that users will actually run on the system. Observ-

ing a system running such programs may b e a go o d indication of p erformance under normal

op erations. Examples are language compilers suchasgcc, acc and f77, text-pro cessing to ols

like TeX and Framemaker, and computer-aided design to ols like Spice. 1

Kernels. Kernels are small, key pieces from real programs. Unlike real programs, no user runs

kernel programs, for they exist only for p erformance measurements. They are b est used to

isolate p erformance of key system features. Examples are LINPACK see Section 2 for more

information and the Livermore Loops [14].

Synthetic benchmarks. Synthetic b enchmarks havecharacteristics similar to those of a set of real

programs and can b e applied rep eatedly in a controlled manner. They require no real-world

data les that may b e large and contain sensitive data, and can b e easily p orted without

a ecting their basic op eration. They often have built-in measurement capabilities. Examples

are Spec see Section 2 for more information, [7] and [15].

Toy benchmarks. Toy b enchmarks are small pieces of co de that pro duce results known b efore

hand. They are small, easy to implement, and can b e run on almost any computer. They may

b e used in some real programs, but often are not. Examples are the Sieve of Eratosthenes

and Quicksort see Section 2 for more information.

Although the ab ove b enchmarks may provide valuable system p erformance information, they are

often dicult to run prop erly. Determining the correct mix of real programs is dicult and in-

stalling real programs can b e time-consuming and resource intensive. Gcc, for example, required

more disk space than we had available and demanded a signifcant amount of compilation time see

Section 3.2 for more details. Running kernels can b e dicult without the necessary compilers

and often small changes are required to p ort kernels b etween platforms. LINPACK, for example,

requires a compiler and a user-supplied second function, that caused us p orting dif-

culties see Section 3.2 for more details. Co ding toy b enchmarks can b e surprisingly dicult.

Many computer professionals nd it dicult to accurately co de a Mergesort correctly, despite un-

derstanding the algorithm. Obtaining reputed synthetic programs can b e equally challenging. For

instance, although SPEC is non-pro t, they charge a supp ort and development fee for use of their

b enchmarks. Prices in 1995 were [6]:

Benchmark Typ e Price

CPU intensiveinteger b enchmarks $ 425

CPU intensive oating p oint b enchmarks $ 575

Integer and oating p oint b enchmarks $ 900

System level le server NFS workload $ 1200

UNIX software developmentworkloads $ 1450

Even the basic integer or oating p oint b enchmarks are far to o much for the average graduate

student wishing to simply explore workstation p erformance.

Toovercome some of the diculties in running standard b enchmarks, we present Touchstone,a

lightweight b enchmark for pro cessor p erformance. Touchstone is pro cess that incrementsavariable

in a tight lo op for a xed p erio d of time. The value that it rep orts represents the p ower of the

pro cessor; the higher the numb er, the more p owerful the pro cessor while the lower the numb er,

the less p owerful the pro cessor. Touchstone's b eauty is its simplicity.At its core, Touchstone uses

only 2 lines of high-level co de, a branch and an add. These commands are found on all systems

and in all languages. Its simplicity makes it easy to p ort while its basic functionality provides a

surprisingly accurate measure of system p erformance. 2

Increment instruction b enchmarks, suchasTouchstone, are not new. Historically, pro cessors were

the most exp ensive and the most used system comp onents. Initially, computers had very few

instructions, the most frequent of whichwas addition. Thus, the computer that added faster was

considered a b etter p erformer. As the numb er and complexity of instructions grew, the addition

b enchmark was no longer sucient as the only means of measuring pro cessor p erformance. This

remains true to day, and so Touchstone should not replace existing b enchmarks. However, rather

than b eing abandoned, Touchstone should b e used as a quick, easy way to give an approximate

estimate of pro cessor p erformance.

Wehavetwohyp otheses ab out the usefulness of Touchstone:

Hypothesis:Touchstone is a go o d indicator of pro cessor p erformance under other b ench-

marks.

Hypothesis:Touchstone is easier to p ort than other b enchmarks.

The rest of this pap er is laid out as follows: Section 2 describ es exp eriments we ran to test our

hyp otheses; Section 3 analyzes our exp erimental data and tests our hyp otheses; and Section 4

summarizes our conclusions and presents other p ossible uses for Touchstone.

2 Exp eriments

touchstone noun - a test or criterion for determining the quality or genuineness of a

thing. { Webster's 7th dictionary, on-line.

In order to test our rst hyp othesis, Touchstone is a good indicator of processor performance under

other benchmarks,we need to correlate Touchstone with one b enchmark from each of the four

categories presented in Section 1. This Section describ es our exp eriments to gather data for these

correlation computations.

The four b enchmarks we selected are: gcc, LINPACK, SPEC int and fp and quicksort. To

strengthen our results, we p erform correlations on nine di erent computer workstations. The

workstations are: SGI Crimson, SGI Indigo 2 Extreme, Sun Sparc 10, Sun IPX, Sun Sparc 2, Intel

80486-33, SGI Personal Iris, Sun Sparc 1 and Intel 80386-20. Table 1 summarizes the systems.

The next 5 subsections describ e the exp eriments involving Touchstone and each b enchmark.

2.1 Touchstone

We ran Touchstone on the 9 platforms listed ab ove. Since the countTouchstone rep orts is sensitive

to other pro cesses running concurrently,we p ostulate that lightly loaded machines a must for

accurate measurements. We p erformed the exp eriments on machines with a minimum of other

pro cesses. In order to facilitate nding when time-shared machines are little used, we develop ed

a network script to monitor pro cessor load. Figure 1 depicts the sample output from one of our

script runs. In this case, the script help ed us avoid running Touchstone during the 3 a.m. system 3

Workstation Op erating System Pro cessor Sp eed RAM

SGI Crimson Irix 5.2 100 Mhz 160 Mbytes

SGI Indigo 2 Extreme Irix 5.2 100 Mhz 32 Mbytes

Sun Sparc 10 SunOs 4.1.4 33 Mhz 64 Mbytes

SGI Personal Iris Irix 4.01 20 Mhz 32 Mbytes

Sun IPX SunOs 4.1.4 40 Mhz 40 Mbytes

Sun Sparc 2 SunOs 4.1.4 40 Mhz 30 Mbytes

Sun Sparc 1 SunOs 4.0.2 20 Mhz 20 Mbytes

Intel 80486 1.2.13 33 Mhz 8Mbytes

Intel 80386 Linux 1.2.13 20 Mhz 8Mbytes

Table 1: Exp erimental platforms. Nine di erent systems from three di erent manufacturers were used for the

exp eriments. The 386, 486 and Sparc 1 machines representlow-end workstations. The Iris, IPX and Sparc 2

represent mid-range workstations. The Crimson, Indigo and Sparc 10 represent high-end workstations.

backups. Verifying that the Unix system load after the Touchstone runs was 1 assured us that no

other pro cesses had b een running.

We ran Touchstone for 200 seconds to amortize start-up costs. To determine the numb er 200, we

ran Touchstone for increasing times and recorded the standard deviation of its counts. Figure 2

depicts the standard deviation of a series of Touchstone runs versus the length of the run. The

standard deviations level out just b efore 200 seconds. At this p oint, the standard deviation is

only 0.001 of the mean. Thus, wechose one 200 second Touchstone run as one iteration. Pilot

tests indicated that ve iterations for each machine were needed to achieve reasonable con dence

intervals.

Touchstone can increment a oating p ointvariable or an integer variable. We call Touchstone

with a oating p ointvariable TouchFp and Touchstone with an integer variable TouchInt. The

Touchstone values we measured are presented in Table 2.

2.2 Quicksort

We implemented a standard quicksort algorithm, as found in most typical computer algorithms

b o oks. To ensure that p o or choices for the partitioning pivot did not in uence sorting times, we

sorted the same sequence of randomly generated numb ers on each platform. We measured only the

sorting time, not the time to read in the numb ers, using the wall-clo ck. The pseudo-co de follows:

loadarray, array_size

start_time = gettime

quicksortarray, 1, array_size

end_time = gettime

Since wall-clo ck times are sensitive to other running pro cesses, we ran the quicksort exp eriments

on lightly loaded machines, as describ ed in Section 2.1.

The Quicksort times we measured are presented in Table 2. 4 3.5

3

2.5

2 Load 1.5

1

0.5

0 0 1 2 3 4 5 6 7 8

Hours Since Midnight

Figure 1: Pro cessor Load at Night. The horizontal axis is hours since midnight. The vertical axis is the number of

pro cesses in the ready queue. From this graph, it app ears that backups are sometime around 3 a.m. for this system.

This time should b e avoided for running Touchstone exp eriments.

2.3 GCC

The Gnu compiler gcc is a pro duct of the Free Software Foundation. For reasons not related to

its price, it is often preferred by many for system development. The latest gcc source co de is avail-

able from any of the gnu ftp sites, [email protected]:pub/gnu/gcc-*.tar.gz

and [email protected]:packages/gnu/gcc-*.tar.gz.

We ran exp eriments on lightly loaded machines as we describ ed in Section 2.1. The input for gcc

was the LaTeX source. The LaTeX do cument preparation system is a sp ecial version of Donald

Knuth's TeX program. LaTeX adds to TeX a collection of commands that simplify typ esetting by

letting the user concentrate on the structure of the text rather than on formatting commands [12].

LaTeX is available for most workstation platforms. Compilation time was measured using the Unix

time command.

The gcc times we measured are presented in Table 2.

2.4 LINPACK

LINPACK was develop ed by of Argonne National Lab oratory in 1983 [9]. It con-

sists of a numb er of programs that solve dense systems of linear equations. LINPACK attempts to

represent mechanical engineering applications on workstations. LINPACK programs can b e charac-

terized as having a high p ercentage of oating-p oint additions and multiplications. The b enchmark 5 Deviation of Counts for Counter Process 70000

60000

50000

40000

30000 Standard Count Deviation 20000

10000

0 0 50 100 150 200 250 300

Time Counter Ran (in seconds)

Figure 2: Standard Deviation of Touchstone Values. The horizontal axis is the numb er of seconds that Touchstone

ran. The vertical axis is the standard deviation of the runs. The standard deviation app ears to reach a steady state

just b efore 200 seconds. At this p oint, it is 0.001 of the mean.

results are rep orted in Millions of Floating Point Op erations Per Second MFLOPS [11].

The LINPACK b enchmark consists of three parts:

1. LINPACK Benchmark. The standard LINPACK b enchmark is written entirely in Fortran and

must b e run with no changes to the source co de. The problem size is 100x100. It uses an old

algorithm that has low compute intensity and is not optimized in terms of memory bandwidth.

2. Towards Peak Performance, Best E ort. The b est e ort b enchmark seeks to achieve the

maximum p erformance on the workstation by allowing mo di cations to the b enchmark co de.

The problem size is 1000x1000. The b est implementations currently use blo cked solvers that

make ecient use of memory with high compute intensity. However, there are limits on

algorithm selection and use of assembly language to improve p erformance. The LINPACK

do cumentation provides examples of this typ e of solution.

3. ALook at Paral lel Processing. The parallel b enchmark uses a problem size of NxN with N

selected by the user. The b est implementations use high compute intensity algorithms scaled

to a size where interpro cessor communication cost is minimal compared to the computation

cost.

The LINPACK b enchmark results as rep orted in [9] are presented in Table 2. 6

Machine TouchInt TouchFp M ops SPECint SPECfp Quicksort Gcc

Crimson 9008886 5479664 16 58.3 63.4 0.84 -

Indigo 8802542 5367132 15.0 57.6 60.3 0.84 50.5

Sparc10 7044849 3445895 8.9 40.0 41.1 1.87 209.1

IPX 3506958 1702699 4.1 21.8 21.5 3.78 279.8

Sparc2 3580454 1709879 4.0 21.8 18.2 4.48 278.3

486 4096334 1230640 3.0 17.5 15.5 5.17 695.6

Iris 3721732 1331571 2.1 22.4 24.4 3.51 199.8

Sparc1 1737108 663700 1.4 9.5 7.5 9.31 530.7

1

386 638547 1730 0.16 2.15 2.15 23.89 1590.1

Table 2: Exp eriment results. TouchInt and TouchFp are 200 second Touchstone values we measured when the

incrementvariable is an integer variable or a oating p ointvariable, resp ectively. M ops are Millions of Floating

Point Op erations Per Second, as recorded by LINPACK. SPECint and SPECfp are integer and oating p ointintensive

b enchmarks recorded by the SPEC b enchmark suite. Quicksort and Gcc are the run times in seconds for our Quicksort

and Gcc exp eriments.

2.5 SPEC

The Systems Performance Evaluation Co op erative SPEC is a nonpro t corp oration formed by

leading computer vendors to develop a standardized set of b enchmarks. Founders, including

Ap ollo/Hewlett-Packard, DEC, MIPS and Sun, have agreed on a set of real programs and in-

puts that all will run. The founders of this organization b elieve that the user community will

b ene t greatly from an ob jective series of application-oriented tests, which can serve as common

reference p oints and b e considered during the evaluation pro cess. The b enchmarks are meant for

comparing pro cessor sp eeds. The sp ec numb ers are the ratio of the time to run the b enchmarks on

a reference system and the system under test [6].

There are currently two suites of compute-intensive SPEC b enchmarks, measuring the p erformance

of pro cessor, memory system, and compiler co de generation. SPECint and SPECfp represent results

from integer intensive and oating p ointintensive b enchmarks, resp ectively. They normally use

UNIX as the p ortabilityvehicle, but they have b een p orted to other op erating systems as well. The

p ercentage of time sp ent in op erating system and I/O functions is generally negligible.

The SPEC b enchmark results as rep orted in [1] are presented in Table 2.

3 Analysis

In this section we analyze our exp erimental data from Section 2 and test the hyp otheses presented

in Section 1.

3.1 Touchstone as an Indicator of Pro cessor Performance

To test our rst hyp othesis, Touchstone is a good indicator of processor performance under other

benchmarks,we determine the correlation b etween Touchstone and each b enchmark. LINPACK 7

Con dence Con dence

2

Benchmark A Interval B Interval R

LINPACK 2.94e-06 [2.74e-06, 3.15e-06] -7.75e-01 [-1.39e+00, -1.55e-01] 0.99

SPECint 6.64e-06 [5.93e-06, 7.35e-06] -3.21e+00 [-7.07e+00, 6.56e-01] 0.98

SPECfp 1.10e-05 [9.91e-06, 1.20e-05] 2.73e+00 [-3.99e-01, 5.86e+00] 0.98

Quicksort 1.41e-07 [1.06e-07, 1.76e-07] -2.12e-01 [-4.02e-01, -2.22e-02] 0.89

Gcc 1.89e-09 [8.56e-10, 2.93e-09] -2.76e-03 [-7.77e-03, 2.25e-03] 0.67

Table 3: Touchstone linear regression parameters. Parameters A and B are from the equation y= Ax+B.Intervals

are 95 con dence intervals around the parameters A and B.

and SPECfp b oth primarily test oating p oint p erformance, so we compare them with TouchFp

Touchstone incrementing a oating p ointvariable. Similarly, SPECint stresses integer op erations,

so we compare it with TouchInt Touchstone incrementing a long integer variable. Since our

quicksort b enchmark sorted integers, we compare it with TouchInt. Gcc contains 99 integer

op erations [10]. Thus, we correlated gcc with TouchInt.For b enchmarks measured by time, such

as gcc and quicksort, the lower the numb er the more p owerful the pro cessor. Thus, Touchstone

values are inversely prop ortional to the time for gcc and quicksort.

To quantify our comparisons, we apply a simple linear regression mo del through a least-squares

line t for each b enchmark. We calculate the regression parameters A and B from the equation:

y=Ax+B. This gives us a line predicting the correlation b etween Touchstone and the other

b enchmarks. In addition, we calculate a 95 con dence interval surrounding each parameter. This

gives us a range for p erformance predictions for di erent pro cessors having di erentTouchstone

counts. The numeric results are given in Table 3. Figures 3 through 7 depict the results graphically.

Without linear regression, the mean of Touchstone and the other b enchmarks can b e used for

the predicted correlation. The quantity is called the total sum of squares. It can b e broken

into two parts: the variation not explained by the regression and the variation explained by the

regression. The fraction of the variation that is explained by the regression is called the coecient

2 2

of determination, R . The \go o dness" of a regression is measured by R . The higher the value of

2

R , the b etter the regression. If the regression is p erfect in the sense that all observed values are

2

equal to those predicted by the mo del, R will b e one. On the other hand, if the regression mo del

2 2 2

is completely bad, R will b e zero. R values ab ove .8 are said to have a strong correlation, R

2

values ab ovebetween .5 and .8 are considered having a medium correlation, and R values b elow

.5 havea weak correlation [8]. Table 3 lists the co ecient of determination for Touchstone and

each b enchmark and Figure 8 gives a graphical representation of the co ecient of determination

for each b enchmark.

Touchstone correlates strongly with all the b enchmarks with the exception of gcc. Most likely this

is b ecause Touchstone is a pro cessor intensive b enchmark while gcc re ects the p erformance of

more of the system, including Mbytes of RAM and disk sp eed. 8 16 Crimson Indigo Sparc10 14 IPX Sparc2 486 Iris 12 Sparc1 386

10

Mflops 8

6

4

2

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06

Touchstone Fp Count

Figure 3: LINPACK M ops vs. TouchFp. The horizontal axis is the Touchstone count. The vertical axis is the

M ops score. The middle line is the least-squares line t. The upp er and lower curving lines represent the 95

con dence interval around the least-squares line t.

3.2 Touchstone Portability

To test our second hyp othesis, Touchstone is easier to port than other benchmarks,we create a

p ortability function that takes into account a level of e ort to run each b enchmark. Our level of

e ort is based on the numb er of user steps required to p ort the b enchmark, the amount of time it

to ok to compile, and the disk space required. Weweight each comp onent roughly based on a dollar

cost:

 Change is the numb er of line changes necessary to either the source co de or the Makefile in

order to ready the b enchmark for a new platform. Change also includes the numb er of steps

required to do an installation and run. We assume eachchange takes 5 human minutes. We

weighthuman time at $50/hour, so eachchange costs ab out $12.

 Compilation is the numb er of minutes to compile the b enchmark. Weweight machine time at

$1 p er hour, so each minute of compilation costs ab out $.16.

 Space is the number of Mbytes required to supp ort the b enchmark, including source co de,

compiled co de and output data. Weweight space at $1 p er Mbyte.

The total measure of p ortability is the weighted sum of each of the ab ove comp onents for eachof

the 5 op erating systems listed in Table 1 Irix 5.2, SunOs 4.1.4, Irix 4.01, SunOs 4.0.2 and Linux

1.2.13: 9 60 Crimson Indigo Sparc10 IPX 50 Sparc2 486 Iris Sparc1 386 40

30 SpecInt

20

10

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

Touchstone Int Count

Figure 4: SPECInt vs. TouchInt. The horizontal axis is the Touchstone count. The vertical axis is the SPECint

score. The middle line is the least-squares line t. The upp er and lower curving lines represent the 95 con dence

interval around the least-squares line t.

P or tabil ity = C hang e  12 + C ompil ation  :16 + S pace  1

The SPEC b enchmarks were unavailable to us b ecause of their prohibitive costs. Their p ortability

was the dollar cost for the integer and oating p oint b enchmarks see Section 1.

Table 4 summarizes the p ortability for each b enchmark and Figure 9 shows the total p ortability

graphically.

The installation of gcc was very resource intensive. The distribution alone required over 20 Mbytes

of disk space. Several of the systems we ran our exp eriments on had disk quotas in e ect. On these

Benchmark Change Compilation Space Portability

Touchstone 5 10 .2 62

Quicksort 5 20 1.5 65

LINPACK 21 30 2.8 260

Gcc 18 400 30 310

SPEC - - - 900

Table 4: A Measure of Portability. \Change" is the numb er of line changes necessary to ready the b enchmark for a

new platform. \Compilation " is the numb er of minutes to compile the b enchmark. \Space" is the numberofMbytes

required to supp ort the b enchmark. \Portability" is a weighted sum of Change, Compilation and Space. 10 Crimson 60 Indigo Sparc10 IPX Sparc2 486 50 Iris Sparc1 386

40 SpecFp 30

20

10

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06

Touchstone Fp Count

Figure 5: SPECfp vs. TouchFp. The horizontal axis is the Touchstone count. The vertical axis is the SPECfp

score. The middle line is the least-squares line t. The upp er and lower curving lines represent the 95 con dence

interval around the least-squares line t.

systems, we had to receive assistance from the system administrators in running our b enchmarks.

From the ab ove data, we nd Touchstone is the most p ortable of the b enchmarks tested.

4 Conclusion

Benchmarks provide a standard measure of p erformance that demonstrate system improvements

and allow users to make informed system choices. Unfortunately, when b enchmark results are

unavailable for a system, obtaining and running typical b enchmarks can b e dicult or imp ossible.

Touchstone can overcome some of the diculties in running standard b enchmarks. Touchstone

is a pro cess that increments a variable in a tight lo op for a xed p erio d of time. This simplicity

makes Touchstone easier to p ort than other b enchmarks. The countvalue that Touchstone rep orts

represents the p ower of the pro cessor. Surprisingly, this simple measure of p erformance has a strong

correlation with pro cessor p erformance under other b enchmarks. Thus, Touchstone can b e used

as an quick, accurate easy-to-use indicator of pro cessor p erformance. Moreover, you can convert

Touchstone results into common b enchmark results by using the linear equations in Table 3. 11 1.2 Crimson Indigo Sparc10 IPX Sparc2 1 486 Iris Sparc1 386 0.8

0.6

0.4 1 / (Quicksort Time in Milliseconds)

0.2

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

Touchstone Int Count

Figure 6: Quicksort vs. TouchInt. The horizontal axis is the Touchstone count. The vertical axis is the 1 over the

quicksort score. The middle line is the least-squares line t. The upp er and lower curving lines represent the 95

con dence interval around the least-squares line t.

4.1 A Grain of Salt

All of the Touchstone measurements should b e taken with a grain of salt, for they are largely

dep endent up on the compiler used. For example, compiling Touchstone on the Sun IPX with gcc

resulted in a count 1/3 lower than the Touchstone count rep orted ab ove using acc. In addition,

compiler optimizations, such as lo op unrolling and instruction reordering to exploit instruction slots

available after delayed branch instructions, may increase Touchstone counts even more. According

to John Larson and David Bailey, such manipulative b enchmarking b ecomes \b enchmarketing"

when the normally ob jectiveevaluation pro cess b ecomes awed incomplete, irrelevant, or invalid

[13, 2]. Therefore, this weakness in Touchstone is to b e exp ected. Standard b enchmarks suchas

LINPACK and SPEC always list the compiler names and ags that the b enchmark was tested under

when rep orting b enchmark p erformance. Ultimately, it is up to the user to apply Touchstone

prop erly if a reasonably accurate indicator of pro cessor p erformance is to b e exp ected.

4.2 Other Applications of Touchstone

It is easy to extend Touchstone to study the e ects that other p eripherals and other pro cesses on

pro cessor p erformance. This use of Touchstone is depicted in Figure 10. The Touchstone counton

the bare machine gives us the pro cessor p otential for the machine, depicted by the single column

on the far left. We then run Touchstone with the other comp onentwe wish to test. The di erence

in the bare Touchstone count and the comp onentTouchstone count is the comp onent-induced load. 12 Gcc vs. Touchstone

0.02 Indigo Sparc10 IPX Sparc2 486 Iris Sparc1 0.015 386

0.01 1 / (Gcc Time in Milliseconds)

0.005

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

Touchstone Int Count

Figure 7: Gcc vs. TouchInt. The horizontal axis is the Touchstone count. The vertical axis is the 1 over the gcc

score. The middle line is the least-squares line t. The upp er and lower curving lines represent the 95 con dence

interval around the least-squares line t.

1

0.9 strong correlation 0.8

0.7 medium correlation 0.6

0.5 weak correlation 0.4 Coefficient of Determination 0.3

0.2

0.1

0

Mflops SpecInt SpecFp Quicksort Gcc

Figure 8: Correlation with Touchstone. The horizontal axis is the co ecient of determination. The vertical axis is

the b enchmark. The horizontal lines represent the weak, mo derate and strong correlation b oundaries as indicated. 13 1000

800

600 Portability 400

200

0

Touchstone Quicksort Linpack Gcc Spec

Figure 9: Total Portability. The weighted sum of Change, Compilation and Space is shown for each b enchmark.

For example, the middle two columns show the count obtained if twoTouchstone pro cesses were

run simultaneously. On a Sun IPX, eachTouchstone would count to ab out 0.85 million p er second.

In the last two columns, the pro cessor load of the send pro cess would b e the count obtained on a

bare machine less the count obtained by the Touchstone pro cess. If the Sun IPX were sending data

at a rate of a 1280 byte packet every 160 ms, the countwould b e ab out 1.69 million in one second.

The di erence b etween the 1.7 million barecount and the 1.69 million send count can b e converted

into milliseconds of pro cessor load, resulting in a send load of ab out 1 millisecond p er second. As

another example, a Touchstone count on a system with a SCSI disk having a faulty parity bit can

2

b e compared to a Touchstone run on the same system with a healthy SCSI disk. The di erence

in the values rep orted is the pro cessor degradation due to the failing disk.

Toverify that Touchstone do es indeed accurately rep ort load of pro cessor b ound pro cesses with

which it runs concurrently,we ran Touchstone with 1, 2, 3 and 4 other Touchstones. We exp ect the

countvalue to b e count on bare machine / N+1, where N is the numb er of other counters

running. Figure 11 depicts the result of this exp eriment. At all p oints, the predicted values

were within the con dence intervals for the measured values. In fact, wehave used Touchstone

successfully extensively to measure the pro cessor load of manylow-level systems [4,3, 5]. Note

that in instances where pro cessor degradation was very slight, single-user mo de was necessary to

determine exact p erformance.

In time sharing systems, Touchstone can also b e used to analyze computing p ower. Touchstone

2

In fact, we used Touchstone for just this purp ose. The ailing disk caused a p erformance degradation of 20

b ecause of the manyinterrupts it generated. 14 Barecount

Touchstone Touchstone Touchstone Touchstone Send

Figure 10: Touchstone to Measure Pro cess Load. Touchstone is run on a quiet machine to get the \barecount" as

depicted on the left. Touchstone is then run with the comp onent to b e tested, such as another Touchstone as depicted

in the middle or a pro cess that sends packets as depicted on the right. The pro cessor load of the comp onentisthe

barecount less the count obtained by the Touchstone pro cess.

IPX Loads With Multiple Touchstones 3.5e+08 predicted meaured 95% confidence interval 3e+08

2.5e+08

2e+08 Count

1.5e+08

1e+08

5e+07 0 1 2 3 4 5

Number of Extra Touchstones

Figure 11: Sun IPX loads with multiple Touchstone pro cesses. The p oints are from 5 samples each with 95

con dence intervals. The largest interval is 8 of the measured value. The curving line is the predicted value. At all

p oints, the predicted values were within the con dence intervals for the measured values. 15

can b e run several times during the day. The mean value rep orted indicates the mean available

pro cessor p ower. In this sense, the Touchstone value on a lightly loaded machine will give the

p eak pro cessor p ower. To determine pro cessor utilization, Touchstone can b e run as a low-priority

3

pro cess, through nice or some other means. The value rep orted represents the amount of \unused"

pro cessor and can b e used to estimate the numb er of additional users a time-sharing system may

supp ort b efore the pro cessor b ecomes the critical resource.

4.3 Future Work

There are several p ossible areas for future work:

 Other Workstations. We ran Touchstone on most of the workstation platforms that were

available to us. Future work mightinvolve seeking assistance from other researchers, p erhaps

even the Internet community at large, in running Touchstone on other workstations.

 Other Benchmarks.We correlated Touchstone with one b enchmark from each of the b ench-

mark categories presented in Section 1. Future work mightinvolve correlating Touchstone to

additional b enchmarks within those categories.

 Multi-Processors.Touchstone may not give a go o d indication of p erformance for machines

that have more than one pro cessor. For although a counting lo op can certainly b e pip elined,

splitting the work to multiple pro cessors may b e dicult since counting is inherently sequential.

Future work mayinvolve researchinto a counter that utilizes multiple pro cessors.

References

[1] The Performance Database Server. 1997. The URL for this do cument can b e found at

http://netlib2.cs.utk.edu/p erformance/html/PDStop.html.

[2] David H. Bailey.How to fo ol the masses when giving p erformance results on parallel computers.

Supercomputer, 85:4 { 7, Septemb er 1991.

[3] M. Clayp o ol, J. Riedl, J. Carlis, G. Wilcox, R. Elde, E. Retzel, A. Georgop oulos, J. Pardo,

K. Ugurbil, B. Miller, and C. Honda. Network requirements for 3D ying in a zo omable brain

database. IEEE JSACSpecial Issue on Gigabit Networking, 135, June 1995.

[4] Mark Clayp o ol and John Riedl. Silence is golden? The e ects of silence deletion on the CPU

load of an audio conference. In Proceedings of IEEE Multimedia, Boston, May 1994.

[5] Mark Clayp o ol and John Riedl. A quality planning mo del for distributed multimedia in the

virtual co ckpit. In Proceedings of ACM Multimedia, pages 253 { 264, Novemb er 1996.

[6] Standard Performance Evaluation Corp oration. SPEC faq. Decemb er 1995. The SPEC Primer

is frequently p osted to the newsgroup comp.b enchmarks. SPEC information can also b e found

at http://www.sp ecb ench.org/.

[7] H.J. Curnow and B.A Wichmann. A synthetic b enchmark. The Computer Journal, 191,

1976.

3

Pilot tests suggest that a maximally niced Touchstone running on a machine with a pro cessor b ound pro cess

will only count 4 as high as an unloaded Touchstone. 16

[8] Jay Devore and Roxy Peck. Statistics { The Exploration and Analysis of Data.Wadsworth,

Inc., second edition edition, 1993.

[9] Jack J. Dongarra. Performance of various computers using standard linear equations software.

Technical Rep ort CS-89-85, UniversityofTennessee, February 1994. To obtain a p ostscript

copy of the rep ort send mail to [email protected] and in the message typ e: send p erformance

from b enchmark.

[10] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach.

Morgan Kaufmann Publishers, Inc., 1990.

[11] Ra j Jain. The Art of Computer Systems PerformanceAnalysis. John Wiley and Sons, Inc.,

1991.

[12] Leslie Lamp ort. The Latex Document Preparation System/User's Guide and Reference Manual.

Addison-Wesley Publishing Company, 1985.

[13] John L. Larson. Benchmarks. Newsletter of IEEE Computer Society Technical Comitteeon

Supercomputer Applications, 61:3 { 4, June 1992.

[14] F.M McMahon. The livermore fortran kernels: A computer test of numerical p erformance

range. Technical Rep ort UCRL-55745, Lawrence Livermore National Lab oratory, University

of California, Decemb er 1986.

[15] R.P.Weicker. Dhrystone: A synthetic systems programming b enchmark. Communications of

the ACM, 2710:1013 { 1030, Octob er 1984. 17