THE NAS PARALLEL BENCHMARKS

1 1 1 2

D. H. Bailey , E. Barszcz , J. T. Barton ,D.S.Browning , R. L. Carter, L.

2 2 3 1

Dagum ,R.A.Fato ohi ,P.O.Frederickson , T. A. Lasinski ,R.S.

3 2 2 2

Schreib er , H. D. Simon ,V.Venkatakrishnan and S. K. Weeratunga

NAS Applied Research Branch

NASA Ames Research Center, Mail Stop T045-1

Mo ett Field, CA 94035

Ref: Intl. Journal of Sup ercomputer Applications,vol. 5, no. 3 (Fall 1991),

pg. 66{73

Abstract

A new set of b enchmarks has b een develop ed for the p erformance

evaluation of highly parallel sup ercomputers. These b enchmarks con-

sist of ve \parallel kernel" b enchmarks and three \simulated appli-

cation" b enchmarks. Together they mimic the computation and data

movementcharacteristics of large scale computational uid dynamics

applications.

The principal distinguishing feature of these b enchmarks is their

\p encil and pap er" sp eci cation | all details of these b enchmarks

are sp eci ed only algorithmically. In this waymany of the dicul-

ties asso ciated with conventional b enchmarking approaches on highly

parallel systems are avoided.

1

This author is an employee of NASA Ames ResearchCenter

2

This author is an employee of Computer Sciences Corp oration. This work is supp orted

through NASA Contract NAS 2-12961.

3

This author is an employee of the Research Institute for Advanced Computer Science

(RIACS). This work is supp orted by the NAS Systems Division via Co op erative Agreement

NCC 2-387 b etween NASA and the Universities Space Research Asso ciation.

1 Intro duction

The Numerical Aero dynamic Simulation (NAS) Program, which is based at

NASA Ames Research Center, is a large scale e ort to advance the state of

computational aero dynamics. Sp eci cally, the NAS organization aims \to

provide the Nation's aerospace researchanddevelopment communitybythe

year 2000 a high-p erformance, op erational computing system capable of sim-

ulating an entire aerospace vehicle system within a computing time of one to

several hours" ([4], page 3). The successful solution of this \grand challenge"

problem will require the development of computer systems that can p erform

the required complex scienti c computations at a sustained rate nearly one

thousand times greater than current generation sup ercomputers can now

achieve. The architecture of computer systems able to achieve this level of

p erformance will likely b e dissimilar to the multipro cessing

sup ercomputers of to day. While no consensus yet exists on what the design

willbe,itislikely that the system will consist of at least 1,000 pro cessors

computing in parallel.

Highly parallel systems with computing p ower roughly equivalent to tradi-

tional shared memory multipro cessors exist to day. Unfortunately, for various

reasons, the p erformance evaluation of these systems on comparable typ es of

scienti c computations is very dicult. Little relevantdataisavailable for

the p erformance of algorithms of interest to the computational aerophysics

community on many currently available parallel systems. Benchmarking and

p erformance evaluation of such systems has not kept pace with advances in

hardware, software and algorithms. In particular, there is as yet no gener-

ally accepted b enchmark program or even a b enchmark strategy for these

systems.

The p opular \kernel" b enchmarks that have b een used for traditional

vector sup ercomputers, such as the Livermore Lo ops [12], the LINPACK

b enchmark [9, 10] and the original NAS Kernels [7], are clearly inappropriate

for the p erformance evaluation of highly parallel machines. First of all, the

tuning restrictions of these b enchmarks rule out many widely used parallel

extensions. More imp ortantly, the computation and memory requirements

of these programs do not do justice to the vastly increased capabilities of the

new parallel machines, particularly those systems that will b e available by

the mid-1990s.

On the other hand, a full scale scienti c application is similarly unsuitable. 1

First of all, p orting a large program to a new parallel computer architecture

requires a ma jor e ort, and it is usually hard to justify a ma jor research

task simply to obtain a b enchmark numb er. For that reason we b elieve

that the otherwise very successful PERFECT Club b enchmark [11] is not

suitable for highly parallel systems. This is demonstrated byonlyvery sparse

p erformance results for parallel machines in the recent rep orts [13, 14 , 8].

Alternatively, an application b enchmark could assume the availabilityof

automatic software to ols for transforming \dustydeck" source into ecient

parallel co de on a variety of systems. However, such to ols do not exist to day,

and many scientists doubt that they will ever exist across a wide range of

architectures.

Some other considerations for the development of a meaningful b ench-

mark for a highly parallel sup ercomputer are the following:

 Advanced parallel systems frequently require new algorithmic and soft-

ware approaches, and these new metho ds are often quite di erent from

the conventional metho ds implemented in source co de for a sequential

or vector machine.

 Benchmarks must b e \generic" and should not favor any particular

parallel architecture. This requirement precludes the usage of any

architecture-sp eci c co de, such as message passing co de.

 The correctness of results and p erformance gures must b e easily veri-

able. This requirement implies that b oth input and output data sets

must b e kept very small. It also implies that the nature of the compu-

tation and the exp ected results must b e sp eci ed in great detail.

 The memory size and run time requirements must b e easily adjustable

to accommo date new systems with increased p ower.

 The b enchmark must b e readily distributable.

In our view, the only b enchmarking approach that satis es all of these

constraints is a \pap er and p encil" b enchmark. The idea is to sp ecify a set

of problems only algorithmically.Even the input data must b e sp eci ed only

on pap er. Naturally, the problem has to b e sp eci ed in sucient detail that

a unique solution exists, and the required output has to b e brief yet detailed 2

enough to certify that the problem has b een solved correctly. The p erson

or p ersons implementing the b enchmarks on a given system are exp ected

to solve the various problems in the most appropriate way for the sp eci c

system. The choice of data structures, algorithms, pro cessor allo cation and

memory usage are all (to the extentallowed by the sp eci cation) left op en to

the discretion of the implementer. Some extension of FortranorCisrequired,

and reasonable limits are placed on the usage of assembly co de and the like,

but otherwise programmers are free to utilize language constructs that give

the b est p erformance p ossible on the particular system b eing studied.

To this end, wehave devised a numb er of relatively simple \kernels",

which are sp eci ed completely in [6]. However, kernels alone are insucient

to completely assess the p erformance p otential of a parallel machine on real

scienti c applications. The chief diculty is that a certain data structure

maybevery ecient on a certain system for one of the isolated kernels, and

yet this data structure would b e inappropriate if incorp orated into a larger

application. In other words, the p erformance of a real computational uid

dynamics (CFD) application on a parallel system is critically dep endenton

data motion b etween computational kernels. Thus we consider the complete

repro duction of this data movement to b e of critical imp ortance in a b ench-

mark.

Our b enchmark set therefore consists of two ma jor comp onents: ve par-

allel kernel b enchmarks and three simulated application b enchmarks. The

simulated application b enchmarks combine several computations in a man-

ner that resembles the actual order of execution in certain imp ortantCFD

application co des. This is discussed in more detail in [6].

We feel that this b enchmark set successfully addresses many of the prob-

lems asso ciated with b enchmarking parallel machines. Although wedonot

claim that this set is typical of all scienti c computing, it is based on the

key comp onents of several large aeroscience applications used by scientists on

sup ercomputers at NASA Ames Research Center. These b enchmarks will b e

used by the Numerical Aero dynamic Simulation (NAS) Program to evaluate

the p erformance of parallel computers. 3

2 Benchmark Rules

2.1 De nitions

In the following, the term \pro cessor" is de ned as a hardware unit capable

of integer and oating p oint computation. The \lo cal memory" of a pro cessor

refers to randomly accessible memory with an access time (latency) of less

than one microsecond. The term \main memory" refers to the combined lo cal

memory of all pro cessors. This includes any memory shared by all pro cessors

that can b e accessed byeach pro cessor in less than one microsecond. The

term \mass storage" refers to non-volatile randomly accessible storage media

that can b e accessed by at least one pro cessor within forty milliseconds. A

\pro cessing no de" is de ned as a hardware unit consisting of one or more

pro cessors plus their lo cal memory,which is logically a single unit on the

network that connects the pro cessors.

The term \computational no des" refers to those pro cessing no des pri-

marily devoted to high-sp eed oating p oint computation. The term \ser-

vice no des" refers to those pro cessing no des primarily devoted to system

op erations, including compilation, linking and communication with external

computers over a network.

2.2 General Rules

Implementations of these b enchmarks must b e based on either Fortran-77 or

C, although a wide variety of parallel extensions are allowed. This require-

ment stems from the observation that Fortran and C are the most commonly

used programming languages by the scienti c community

at the present time. If in the future other languages gain wide acceptance

in this community, they will b e considered for inclusion in this group. As-

sembly language and other low-level languages and constructs maynotbe

used, except that certain sp eci c vendor-supp orted assembly-co ded

routines may b e called (see section 2.3).

We are of the opinion that such language restrictions are necessary,be-

cause otherwise considerable e ort would b e made by b enchmarkers in low-

level or assembly-level co ding. Then the b enchmark results would tend to

re ect the amount of programming resources available to the b enchmarking

organization, rather than the fundamental merits of the parallel system. Cer- 4

tainly the mainstream scientists that these parallel computers are intended to

serve will b e co ding applications at the source level, almost certainly in For-

tran C, and thus these b enchmarks are designed to measure the p erformance

that can b e exp ected from suchcode.

Accordingly, the following rules must b e observed in any implementations

of the NAS Parallel Benchmarks:

 All oating p oint op erations must b e p erformed using 64-bit oating

point arithmetic.

 All b enchmarks must b e co ded in either Fortran-77 [1] or C [3], with

certain approved extensions.

 Implementation of the b enchmarks may not mix Fortran-77 and C co de

| one or the other must b e used.

 Any extension of Fortran-77 that is in the Fortran-90 draft dated June

1990 or later [2] is allowed.

 Any extension of Fortran-77 that is in the Parallel Computer Fortran

(PCF) draft dated March 1990 or later [5] is allowed.

 Any language extension or library routine that is employed in anyof

the b enchmarks must b e supp orted bythevendor and available to all

users.

 Subprograms and library routines not written in Fortran or C may only

p erform certain functions, as indicated on the next section.

 All rules apply equally to subroutine calls, language extensions and

compiler directives (i.e. sp ecial comments).

2.3 Allowable Language Extensions and Library Rou-

tines

The following language extensions and library routines are p ermitted:

 Constructs that indicate sections of co de that can b e executed in par-

allel or lo ops that can b e distributed among di erent computational

no des. 5

 Constructs that sp ecify the allo cation and organization of data among

or within computational no des.

 Constructs that communicate data b etween pro cessing no des.

 Constructs that communicate data b etween the computational no des

and service no des.

 Constructs that rearrange data stored in multiple computational no des,

including constructs to p erform indirect addressing and array transp o-

sitions.

 Constructs that synchronize the action of di erent computational no des.

 Constructs that initialize for a data communication or synchronization

op eration that will b e p erformed or completed later.

 Constructs that p erform high-sp eed input or output op erations b etween

main memory and the mass storage system.

 Constructs that p erform any of the following array reduction op era-

tions on an array either residing within a single computational no de or

distributed among multiple no des: +; , MAX, MIN, AND, OR, XOR.

 Constructs that combine communication b etween no des with one of

the op erations listed in the previous item.

 Constructs that p erform any of the following computational op era-

tions on arrays either residing within a single computational no de

or distributed among multiple no des: dense matrix-matrix multipli-

cation, dense matrix-vector multiplication and one-dimensional, two-

dimensional or three-dimensional fast Fourier transforms. Suchrou-

tines must b e callable with general array dimensions.

3 The Benchmarks: A Condensed Overview

After an evaluation of a numb er of large scale CFD and computational

aerosciences applications on the NAS sup ercomputers at NASA Ames, ve

medium-sized computational problems were selected as the \parallel kernels". 6

In addition to these problems, three di erent implicit solution schemes were

added to the b enchmark set. These schemes are representative of CFD co des

currently in use at NASA Ames ResearchCenter in that they mimicthe

computational activities and data motions typical of real CFD applications.

They do not include the typical pre- and p ostpro cessing of real applications,

nor do they include I/O. Boundary conditions are also handled in a greatly

simpli ed manner. For a detailed discussion on the di erences b etween the

simulated application b enchmarks and real CFD applications, see Chapter 3

of [6].

Even the ve parallel kernel b enchmarks involvesubstantially larger com-

putations than many previous b enchmarks, such as the Livermore Lo ops or

Linpack, and therefore they are more appropriate for the evaluation of par-

allel machines. They are suciently simple that they can b e implemented

on a new system without unreasonable e ort and delay. The three simu-

lated application b enchmarks require somewhat more e ort to implement

but constitute a rigorous test of the usability of a parallel system to p erform

state-of-the-art CFD computations.

3.1 The Eight Benchmark Problems

The following gives an overview of the b enchmarks. The rst ve are the

parallel kernel b enchmarks, and the last three are the simulated application

b enchmarks. Space do es not p ermit a complete description for all of these.

A detailed description of these b enchmark problems is given in [6].

EP: An \" kernel, whichevaluates an integral by

means of pseudorandom trials. This kernel, in contrast to others in the

list, requires virtually no interpro cessor communication.

MG: A simpli ed multigrid kernel. This requires highly structured long

distance communication and tests b oth short and long distance data

communication.

CG: A conjugate gradient metho d is used to compute an approximation to

the smallest eigenvalue of a large, sparse, symmetric p ositive de nite

matrix. This kernel is typical of unstructured grid computations in that

it tests irregular long distance communication, employing unstructured

matrix vector multiplication. 7

FT: A 3-D partial di erential equation solution using FFTs. This kernel

p erforms the essence of many \sp ectral" co des. It is a rigorous test of

long-distance communication p erformance.

IS: A large integer sort. This kernel p erforms a sorting op eration that is

imp ortant in \particle metho d" co des. It tests b oth integer computa-

tion sp eed and communication p erformance.

LU: A regular-sparse, blo ck(5 5) lower and upp er triangular system so-

lution. This problem represents the computations asso ciated with the

implicit op erator of a newer class of implicit CFD algorithms, typi-

ed at NASA Ames by the co de \INS3D-LU". This problem exhibits

a somewhat limited amount of parallelism compared to the next two.

SP: Solution of multiple, indep endent systems of non diagonally dominant,

scalar, p entadiagonal equations. SP and the following problem BT are

representative of computations asso ciated with the implicit op erators of

CFD co des suchas\ARC3D" at NASA Ames. SP and BT are similar

in many resp ects, but there is a fundamental di erence with resp ect to

the communication to computation ratio.

BT: Solution of multiple, indep endent systems of non diagonally dominant,

blo ck tridiagonal equations with a (5  5) blo ck size.

3.2 The Embarrassingly Parallel Benchmark

In order to give the reader a avor of the problem descriptions in [6], a

detailed de nition will b e given for the rst problem, the \embarrassingly

parallel" b enchmark:

28

Set n =2 and s = 271828183 . Generate the pseudorandom oating

p ointvalues r in the interval (0, 1) for 1  j  2n using the scheme

j

describ ed b elow. Then for 1  j  n set x =2r 1and y =2r 1.

j 2j 1 j 2j

Thus x and y are uniformly distributed on the interval (1; 1).

j j

2 2

Next set k = 0, and b eginning with j = 1, test to see if t = x + y  1.

j

j j

If not, reject this pair and pro ceed to the next j . If this inequality holds,

q q

then set k k +1;X = x (2 log t )=t and Y = y (2log t )=t ,

k j j j k j j j

where log denotes the natural logarithm. Then X and Y are indep endent

k k 8

Gaussian deviates with mean zero and variance one. Approximately n =4

pairs will b e constructed in this manner.

Finally,for0 l  9 tabulate Q as the count of the pairs (X ;Y )that

l k k

lie in the square annulus l  max(jX j; jY j)

k k l

counts. Each of the ten Q counts must agree exactly with reference values.

l

The 2n uniform pseudorandom numb ers r mentioned ab ove are to b e

j

13

generated according to the following scheme: Set a =5 and let x = s be

0

the sp eci ed initial \seed". Generate the integers x for 1  k  2n using

k

the linear congruential recursion

46

x = ax (mo d 2 )

k +1 k

46

and return the numb ers r =2 x as the results. Observe that 0

k k k

and the r are very nearly uniformly distributed on the unit interval.

k

An imp ortant feature of this pseudorandom numb er generator is that any

particular value x of the sequence can b e computed directly from the initial

k

seed s by using the binary algorithm for exp onentiation, taking remainders

46

mo dulo 2 after eachmultiplication. The imp ortance of this prop ertyfor

parallel pro cessing is that numerous separate segments of a single, repro-

ducible sequence can b e generated on separate pro cessors of a multipro cessor

system. Many other widely used schemes for pseudorandom numb er genera-

tion do not p ossess this imp ortant prop erty.

Additional information and references for this b enchmark problem are

given in [6].

4 Sample Co des

The intentoftheNASParallel Benchmarks rep ort is to completely sp ecify

the computation to b e carried out. Theoretically, a complete implementa-

tion, including the generation of the correct input data, could b e pro duced

from the information in this pap er. However, the develop ers of these b ench-

marks are aware of the diculty and time required to generate a correct

implementation from scratch in this manner. Furthermore, despite several

reviews, ambiguities in the technical pap er may exist that could delayim-

plementations.

In order to reduce these diculties and to aid the b enchmarking sp ecialist,

Fortran-77 computer programs implementing the b enchmarks are available. 9

These co des are to b e considered examples of how the problems could b e

solved on a single pro cessor system, rather than statements of how they

should b e solved on an advanced parallel system. The sample co des actually

solve scaled down versions of the b enchmarks that can b e run on many

current generation workstations. Instructions are supplied in comments in

the source co de on how to scale up the program parameters to the full size

b enchmark sp eci cations.

These programs, as well as the b enchmark do cument itself, are available

from the following address: Applied Research Branch, NAS Systems Di-

vision, Mail Stop T045-1, NASA Ames Research Center, Mo ett Field, CA

94035, attn: NAS Parallel Benchmark Co des. The sample co des are provided

on Macintosh oppy disks and contain the Fortran source co des, \ReadMe"

les, input data les, and reference output data les for correct implemen-

tations of the b enchmark problems. These co des have b een validated on

anumb er of computer systems ranging from conventional workstations to

sup ercomputers.

Table 1 lists approximate run times and memory requirements of the

sample co de problems, based one pro cessor Cray Y-MP implementations.

Table 2 contains similar information for the full-sized b enchmark problems.

The unit \Mw" in tables 1 and 2 refers to one million 64-bit words. Note that

p erformance in MFLOPS is meaningless for the integer sort (IS) b enchmark

and is therefore not given. An explanation of the entries in the problem size

column can b e found in the corresp onding sections describing the b enchmarks

in [6].

5 Submission of Benchmark Results

It should b e emphasized again that the sample co des describ ed in section

4 are not the b enchmark co des, but only implementation aids. For the ac-

tual b enchmarks, the sample co des must b e scaled to larger problem sizes.

The sizes of the current b enchmarks were chosen so that implementations

are p ossible on currently available sup ercomputers. As parallel computer

technology progresses, future releases of these b enchmarks will sp ecify larger

problem sizes.

The authors and develop ers of these b enchmarks encourage submission of

p erformance results for the problems listed in Table 2. Perio dic publication 10

Table 1: NAS Parallel Benchmarks Sample Co des. (Times and

MFLOPS for one pro cessor of the Cray Y-MP)

Benchmark co de Problem Memory Time MFLOPS

Size (Mw) (sec)

24

Embarrassingly parallel (EP) 2 0.1 11.6 120

3

Multigrid (MG) 32 0.1 0.1 128

5

Conjugate gradient(CG)  10 0.6 1.2 63

3

3-D FFT PDE (FT) 64 2.0 1.2 160

16

Integer sort (IS) 2 0.3 0.2 NA

3

LU solver (LU) 12 0.3 3.5 28

3

Pentadiagonal solver (SP) 12 0.2 7.2 24

3

Blo ck tridiagonal solver (BT) 12 0.3 7.2 34

Table 2: NAS Parallel Benchmarks Problem Sizes. (Times and

MFLOPS for one pro cessor of the Cray Y-MP)

Benchmark co de Problem Memory Time MFLOPS

Size (Mw) (sec)

28

Embarrassingly parallel (EP) 2 1 151 147

3

Multigrid (MG) 256 57 54 154

6

Conjugate gradient (CG)  2  10 12 22 70

2

3-D FFT PDE (FT) 256  128 59 39 192

23

Integer sort (IS) 2 26 21 NA

3

LU solver (LU) 64 8 344 189

3

Pentadiagonal solver (SP) 64 6 806 175

3

Blo ck tridiagonal solver (BT) 64 6 923 192 11

of the submitted results is planned. Benchmark results should b e submitted

to the Applied Research Branch, NAS Systems Division, Mail Stop T045-1,

NASA Ames ResearchCenter, Mo ett Field, CA 94035, attn: NAS Parallel

Benchmark Results. A complete submission of results should include the

following:

 A detailed description of the hardware and software con guration used

for the b enchmark runs.

 A description of the implementation and algorithmic techniques used.

 Source listings of the b enchmark co des.

 Output listings from the b enchmarks.

6 Acknowledgments

The conception, planning, execution, programming and authorship of the

NAS Parallel Benchmarks was truly a team e ort, with signi cantcontribu-

tions bya numb er of p ersons. Thomas Lasinski, chief of the NAS Applied

Research Branch (RNR), and John Barton of the NAS System Development

Branch (RND), provided overall direction and management of the pro ject.

David Bailey and Horst Simon edited the b enchmark do cumentandworked

with others in the development and implementation of the b enchmarks. Eric

Barszcz of RNR assisted in the implementation of b oth the multigrid and

the simulated application b enchmarks. David Browning and Russell Carter

of RND reviewed all problem de nitions and sample co des, as well as con-

tributed some text to this pap er. Leonardo Dagum of RNR develop ed the

integer sort b enchmark. Ro d Fato ohi of RNR assisted in the developmentand

implementation of the simulated application b enchmarks. Paul Frederickson

of RIACS develop ed the multigrid b enchmark and worked with Bailey on the

embarrassingly parallel and 3-D FFT PDE b enchmarks. Rob Schreib er of

RIACS develop ed the conjugate gradientbenchmark and worked with Simon

on its implementation. V. Venkatakrishnan of RNR assisted in the implemen-

tation of the simulated application b enchmarks. Finally, Sisira Weeratunga

of RNR was resp onsible for the overall design of the simulated application

b enchmarks and also for a ma jor p ortion of their implementation. 12

References

[1] American National StandardProgramming Language Fortran X3.9-

1978. American National Standards Institute, 1430 Broadway,New

York, NY, 10018, 1990.

[2] Draft ProposedFortran 90 ANSI Standard X3J11.159 { 1989. American

National Standards Institute, 1430 Broadway, New York, NY, 10018,

1990.

[3] Draft Proposed C ANSI Standard X3J3 { S8115. American National

Standards Institute, 1430 Broadway, New York, NY, 10018, 1990.

[4] Numerical Aerodynamic Simulation Program Plan. NAS Systems Divi-

sion, NASA Ames Research Center, Octob er 1988.

[5] PCF Fortran Extensions { Draft Document, Revision 2.11. Parallel

Computing Forum(PCF), c/o Kuck and Asso ciates, 1906 Fox Drive,

Champaign, Illinois 61820, March1990.

[6] D. Bailey, J. Barton, T. Lasinski, and H. Simon, eds. The NAS Par-

al lel Benchmarks.Technical Rep ort RNR-91-02, NASA Ames Research

Center, Mo ett Field, CA 94035, January 1991.

[7] D. Bailey and J. Barton. The NAS Kernel Benchmark Program.Tech-

nical Rep ort 86711, NASA Ames Research Center, Mo ett Field, Cali-

fornia, August 1985.

[8] G. Cyb enko, L. Kipp, L. Pointer, and D. Kuck. Perfor-

mance Evaluation and the Perfect Benchmarks.Technical Rep ort 965,

CSRD, Univ. of Illinois, Urbana, Illinois, March 1990.

[9] J. Dongarra. The LINPACK Benchmark: An Explanation. SuperCom-

puting, 10 { 14, Spring 1988.

[10] J. Dongarra. PerformanceofVarious Computers Using StandardLinear

Equations SoftwareinaFortran Environment.Technical Rep ort MC-

SRD 23, Argonne National Lab oratory,March 1988. 13

[11] M. Berry et al. The Perfect Club Benchmarks: E ectivePerformance

Evaluation of Sup ercomputers. The International Journal of Supercom-

puter Applications, 3:5 { 40, 1989.

[12] F. McMahon. The LivermoreFortran Kernels: A Computer Test of

the Numerical PerformanceRange. Technical Rep ort UCRL - 53745,

Lawrence Livermore National Lab oratory, Livermore, California, De-

cember 1986.

[13] L. Pointer. PERFECT Report 1.Technical Rep ort 896, CSRD, Univ.

of Illinois, Urbana, Illinois, July 1989.

[14] L. Pointer. PERFECT Report 2: Performance Evaluation for Cost-

E ective Transformations.Technical Rep ort 964, CSRD, Univ. of Illi-

nois, Urbana, Illinois, March1990. 14