<<

J 1 January Cambridge University Press

Benchmarking Implementations of Functional Languages with

Pseudoknot a FloatIntensive Benchmark

Pieter H Hartel Marc Feeley Martin Alt

Lennart Augustsson Peter Baumann Marcel Beemster

Emmanuel Chailloux Christine H Flo o d Wolfgang Grieskamp

John H G van Groningen Kevin Hammond Bogumil Hausman

Melo dy Y Ivory Richard E Jones Jasp er Kamp erman

Peter Lee Xavier Leroy Rafael D Lins

Sandra Lo osemore Niklas Rojemo Manuel Serrano

JeanPierre Talpin Jon Thackray Stephen Thomas

Pum Walters Pierre Weis Peter Wentworth

Abstract

Over implementation s of dierent functional languages are b enchmarked using the same program a oating

p oint intensive application taken from molecular biology The principal asp ects studied are compile time and

Dept of Computer Systems Univ of Amsterdam Kruislaan SJ Amsterdam The Netherlands email

pieterfwiuvanl

Depart dinformatique et ro Univ de Montreal succursale centreville Montreal HC J Canada email

feeleyiroumontrealca

Informatik Universitat des Saarlandes Saarbruc ken Germany email altcsunisbde

Dept of Computer Systems Chalmers Univ of Technology Goteb org Sweden email

augustsscschalmersse

Dept of Computer Science Univ of Zurich Winterthurerstr Zurich Switzerland email

baumanniunizh ch

Dept of Computer Systems Univ of Amsterdam Kruislaan SJ Amsterdam The Netherlands email

b eemsterfwiuvanl

LIENS URA du CNRS Ecole Normale Superieure rue dUlm PARIS Cedex France email

EmmanuelChaillou xensfr

Lab oratory for Computer Science MIT Technology Square Cambridge Massachusetts USA email

chflcsmitedu

Berlin University of Technology Franklinstr Berlin Germany email wgcstub erlinde

Faculty of Mathematics and Computer Science Univ of Nijmegen To erno oiveld ED Nijmegen The

Netherlands email johnvgcskunnl

Dept of Computing Science Glasgow University Lilybank Gardens Glasgow G QQ UK email

khdcsglasgowacuk

Computer Science Lab Ellemtel Telecom Systems Labs Box S Alvsjo Sweden email

b ogdanerixericssonse

Computer Research Group Institute for Scientic Computer Research Lawrence Livermore National Lab ora

tory P O Box L Livermore CA email ivoryllnl gov

Dept of Computer Science Univ of Kent at Canterbury Canterbury Kent CT NF UK email

REJonesukcacuk

CWI Kruislaan SJ Amsterdam The Netherlands email jasp ercwinl

Dept of Computer Science Carnegie Mellon University Forb es Avenue Pittsburgh Pennsylvania

USA email p etelcscmuedu

INRIA Ro cquencourt pro jet Cristal BP Le Chesnay France email XavierLeroyinriafr

Departamento de Informatica Universidade Federal de Pernambuco Recife PE Brazil email rdldiufp ebr

Dept of Computer Science Yale Univ New haven Connecticut email lo osemoresandracsyal eedu

Dept of Computer Systems Chalmers Univ of Technology Goteb org Sweden email

ro jemocschalmersse

INRIA Ro cquencourt pro jet Icsla BP Le Chesnay France email ManuelSerranoin ria fr

Europ ean ComputerIndustry Research Centre Arab ella Strae D Munich Germany email

jpecrcde

Harlequin Ltd Barrington Hall Barrington Cambridge CB RG England email jontharlequincouk

Dept of Computer Science Univ of Nottingham Nottingham NG RD UK email sptcsnottacuk

CWI Kruislaan SJ Amsterdam The Netherlands email pumcwinl

INRIA Ro cquencourt pro jet Cristal BP Le Chesnay France email PierreWeisinriafr

Dept of Computer Science Rho des Univ Grahamstown South Africa email cspwcsruacza

Hartel Feeley et al

execution time for the various implementation s that were b enchmarked An imp ortant consideration is how the

program can b e mo died and tuned to obtain maximal p erformance on each language implementation

With few exceptions the take a signicant amount of time to compile this program though most

compilers were faster than the then current GNU C GCC version Compilers that generate C or

Lisp are often slower than those that generate native co de directly the cost of compiling the intermediate form

is normally a large fraction of the total compilation time

There is no clear distinction b etween the runtime p erformance of eager and lazy implementations when appro

priate annotations are used lazy implementations have clearly come of age when it comes to implementing largely

strict application s such as the Pseudoknot program The sp eed of C can b e approached by some implementa

tions but to achieve this p erformance sp ecial measures such as strictness annotations are required by nonstrict

implementations

The b enchmark results have to b e interpreted with care Firstly a b enchmark based on a single program cannot

cover a wide sp ectrum of typical application s Secondly the compilers vary in the and level of optimisation s

oered so the eort required to obtain an optimal version of the program is similarly varied

Intro duction

At the Dagstuhl Workshop on Applications of Functional Programming in the Real World in May

Giegerich and Hughes several interesting applications of functional languages were pre

sented One of these applications the Pseudoknot problem Feeley et al had b een written in

several languages including C Scheme Rees and Clinger Multilisp Halstead Jr and

Miranday Turner A numb er of workshop participants decided to test their compiler technology

using this particular program The rst p oint of comparison is the sp eed of compilation and the sp eed

of the compiled program The second p oint is how the program can b e mo died and tuned to obtain

maximal p erformance on each language implementation available

The initial b enchmarking eorts revealed imp ortant dierences b etween the various compilers The

rst impression was that compilation sp eed should generally b e improved After the workshop we have

continued to work on improving b oth the compilation and execution sp eed of the Pseudoknot program

Some researchers not present at Dagstuhl joined the team and we present the results as a record of a

small scale but exciting collab oration with contributions from many parts of the world

As is the case with any b enchmarking work our results should b e taken with a grain of salt We are

using a realistic program that p erforms a useful computation however it stresses particular language

features that are probably not representative of the applications for which the language implementations

were intended Implementations invariably tradeo the p erformance of some programming features for

others in the quest for the right blend of usability exibility and p erformance on typical applications

It is clear that a single b enchmark is not a go o d way to measure the overall quality of an implementation

Moreover the p erformance of an implementation usually but not always improves with new releases

as the implementors repair bugs add new features and mo dify the compiler We feel that our choice

of b enchmark can b e justied by the fact that it is a real world application that it had already b een

translated into C and several functional languages and that we wanted to compare a wide range of

languages and implementations The main results agree with those found in earlier studies Cann

Hartel and Langendo en

Section briey characterises the functional languages that have b een used The compilers and inter

preters for the functional languages are presented in Section The Pseudoknot application is intro duced

in Section Section describ es the translations of the program to the dierent programming languages

The b enchmark results are presented in Section The conclusions are given in the last section

y Miranda is a trademark of Research Software Ltd

Pseudoknot benchmark

language source ref typing evaluation order match

SML family

Caml INRIA Weis strong p oly eager higher pattern

SML Committee Milner et al strong p oly eager higher pattern

Haskell family

Clean Nijmegen Plasmeijer and strong p oly lazy higher pattern

van Eekelen

Gofer Yale Jones strong p oly lazy higher pattern

LML Chalmers Augustsson and strong p oly lazy higher pattern

Johnsson

Miranda Kent Turner strong p oly lazy higher pattern

Haskell Committee Hudak et al strong p oly lazy higher pattern

RUFL Rho des Wentworth strong p oly lazy higher pattern

Lisp family

Common Committee Steele Jr dynamic eager higher access

Lisp

Scheme Committee Rees and Clinger dynamic eager higher access

Parallel and concurrent languages

Erlang Ericsson Armstrong et al dynamic eager rst pattern

Facile ECRC Thomsen et al strong p oly eager higher pattern

ID MIT Nikhil strong p oly eager higher pattern

nonstrict

Sisal LLNL McGraw et al strong mono eager rst none

Intermediate languages

CMC Recife Lins and Lira strong p oly lazy higher access

Stoel Amsterdam Beemster strong p oly lazy higher case

Other functional languages

Epic CWI Walters and strong p oly eager rst pattern

Kamp erman

Opal TU Berlin Didrich et al strong p oly eager higher pattern

Trafola Saarbruc ken Alt et al strong p oly eager higher pattern

C

ANSI C Committee Kernighan and weak eager rst none

Ritchie

Table Language characteristics The source of each language is fol lowed by a key reference to the

language denition The remaining columns characterise the typing discipline the evaluation strategy

whether the language is rst or higherorder and the patternmatching facilities

Hartel Feeley et al

Languages

The Pseudoknot b enchmark takes into account a large numb er of languages and an even larger numb er

of compilers Our aim has b een to cover as comprehensively as p ossible the landscap e of functional

languages while emphasising typ ed languages

Of the general purp ose functional languages the most prominent are the eager dynamically typ ed

languages Lisp and Scheme the Lisp family the eager strongly typ ed languages SML and Caml the

SML family and the lazy strongly typ ed languages Haskell Clean Miranda and LML the Haskell

family These languages are suciently well known to obviate an intro duction There are also some

variants of these languages such as the Gofer and RUFL variants of Haskell The syntax and semantics

of these variants is suciently close to that of their parents that no intro duction is needed

Four of our functional languages were designed primarily for concurrentparallel applications These

are Erlang an eager concurrent language designed for prototyping and implementing reliable realtime

systems Facile which combines SML with a mo del of higherorder concurrent pro cesses based on CCS ID

an eager nonstrict mostly functional implicitly parallel language and Sisal an eager implicitly parallel

functional language designed to obtain high p erformance on commercial scalar and vector multipro cessors

The concurrentparallel capabilities of these four languages have not b een used in the Pseudoknot

b enchmark so a further discussion of these capabilities is not relevant here It should b e p ointed out

however that b ecause these languages were intended for parallel execution the sequential p erformance

of some may not b e optimal See Section

Two of the functional languages are intended to b e used only as intermediate languages and thus lack

certain features of fully edged programming languages such as pattern matching These languages are

CMC a Miranda based language intended for research on the categorical abstract machine Lins

and Stoel an intermediate language designed to study co de generation for high level languages on ne

grained parallel pro cessors The Stoel and CMC compilers have b een included b ecause these compilers

oer interesting implementation platforms not b ecause of the they implement

A further three functional languages were designed for a sp ecic purp ose Epic is a language for

equational programming which was primarily created to supp ort the algebraic sp ecication language

ASFSDF Bergstra et al Trafola is an eager language that was designed as a transformation

language in a compiler construction pro ject and Opal is an eager language that combines concepts from

algebraic sp ecication and functional programming in a uniform framework

Finally C is used as a reference language to allow comparison with an imp erative language

Table provides an overview of the languages that were b enchmarked The table is organised by

language family The rst column of the table gives the name of the language The second column gives

the source ie a University or a Company if a language has b een develop ed in one particular place

Some languages were designed by a committee which is also shown The third column of the table gives

a key reference to the language denition

The last four columns describ e some imp ortant prop erties of the languages The typing discipline may

b e strong and static dynamic or weak a strong typing discipline may b e monomorphic mono or

p olymorphic poly The evaluation strategy may b e eager lazy or eager with nonstrict evaluation The

language may b e rst order or higher order Accessing comp onents of data structures may b e supp orted

by either patternmatching on function arguments lo cal denitions andor as part of case expressions

pattern case by compiler generated access functions to destruct data access or not at all none The

reader should consult the references provided for further details of individual languages

Compilers

Having selected a large numb er of languages we wished to provide comprehensive coverage of compilers

for those languages For a numb er of languages we out to b enchmark more than one compiler so as

to provide direct comparisons b etween dierent implementations of some prominent languages as well as

b etween the languages themselves

For the Lisp family we use the CMU common Lisp native co de compiler and the Biglo o and Gambit

p ortable Scheme to C compilers

Pseudoknot benchmark

compiler version source ref FTP Email

Biglo o INRIA Serrano Ftp ftpinriafr

Ro cquencourt INRIAPro jectsicslaImplementation s

Caml Light INRIA Leroy Ftp ftpinriafr

Ro cquencourt langcamlli ght

Caml Gallium INRIA Leroy Email XavierLeroyinriafr

Ro cquencourt

Camlo o INRIA Serrano and Weis Ftp ftpinriafr

Ro cquencourt langcamlli ght

CeML LIENS Chailloux Email EmmanuelChaillo uxens fr

Clean b Nijmegen Smetsers et al Ftp ftpcskunnl

pubClean

CMU CL e Carnegie Mellon MacLachlan et al Ftp lispsunsl isp cscmuedu

pub

Epic CWI Walters and httpwwwcwinl gip eepichtml

Kamp erman

EpicC CWI Walters and httpwwwcwinl gip eepichtml

Kamp erman

Erlang Ellemtel AB Hausman commercial

Email erlangerixericsson se

Facile Antigua ECRC Thomsen et al Email facileecrcde

FAST Southampton Hartel et al Email pieterfwiuvanl

Amsterdam

Gambit Montreal Feeley and Miller Ftp ftpiroumontrealca

pubparallel ega mbi t

CMC Recife Lins and Lira Email rdldiufp ebr

Brazil

Gofer a Yale Jones Ftp nebulacsyaleedu

pubhaskell gofer

Haskell Chalmers Augustsson Ftp ftpcschalmersse

pubhaskell chal mers

Haskell Glasgow Peyton Jones et al Ftp ftpdcsglasgowacuk

pubhaskell gla sgow

Haskell Yale Yale Haskell group Ftp nebulacsyaleedu

pubhaskell yal e

ID TL MITBerkeley Nikhil Email chflcsmitedu

LML Chalmers Augustsson and Ftp ftpcschalmersse

Johnsson pubhaskell chal mers

LML Prerel Nottingham Thomas Email sptcsnottacuk

OPTIM Kent

MLWorks na Harlequin Ltd Harlequin Ltd commercial

Miranda Research Turner commercial

Software Ltd Email mirarequestukcacuk

Nearly Pre rel Chalmers Rojemo Ftp ftpcschalmersse

Haskell pubhaskell nhc

Opal c Berlin Schulte and Ftp ftpcstub erlinde

Grieskamp publo calue bbo cs

RUFL Rho des Wentworth Ftp csruacza

pubru

Sisal LLNL Cann Ftp sisallln lgov

pubsisal

SMLNJ ATT Bell Labs App el Ftp researchattcom

distml

Stoel Amsterdam Beemster Email b eemsterfwiuvanl

Trafola Saarbruc ken Alt et al Email altcsunisbde

Table Compiler details consisting of the name of the compiler andor language the University or

Company that built the compiler a key reference to the description of the implementation and the

address from which information about the compiler can be obtained

Hartel Feeley et al

compiler compiler options execution options collector oat

Biglo o unsafe O markscan double

Caml Light gen double

Caml Gallium gen double

Camlo o unsafe O markscan double

CeML O space single

Chalmers c YS HMg space single

YAk cpp

Clean nt s k h space double

markscan

CMU CL sp eed safety debug space single

compilationsp eed

Epic s markscan single

EpicC markscan single

Erlang BEAM fast h space double

Facile gen double

FAST fcg v h s K space single

Gambit h space double

CMC space double

Glasgow O fviaC OforC RTS HM gen single

Gofer DTIMER space single

ID strict mergepartitions none double

tlc opt

LML Chalmers H DSTR c space single

LMLOPTIM LMLC H DSTR c space single

fnoco de foutic

SPGC c i

a

MLWorks no details available space double

Miranda heap count markscan double

NHCHBC HM space single

NHCNHC hM space single

Opal optfull debugno refcount single

RUFL w m markscan double

RUFLI iw m r markscan double

Sisal cpp seq O refcount double

c atan ccO

SMLNJ gen double

Stoel O for C space double

Trafola TC INLINE HE space sgldbl

Yale see CMU CL space single

a

MLWorks is not yet available Compilation was for maximum optimisation no debugging or statistics collection

was taking place at runtime

Table Compilation and execution options The type of garbage col lector is one of space

nongenerational space copying col lector markscan gen generational with two or more spaces

space markscan one space compactor or reference counting Floatingpoint arithmetic used is

either single or doubleprecision

For the SML family we use SMLNJ an incremental interactive native co de compiler MLWorks

a commercial native co de compiler Caml Light a simple byteco de compiler Camlo o a Caml to C

compiler derived from the Biglo o Scheme compiler Caml Gallium an exp erimental nativeco de compiler

and CeML a compiler that has b een develop ed to study translations of Caml into C

For Haskell we use the Glasgow compiler which generates either C or native co de the Chalmers native

co de compiler and the Yale compiler which translates Haskell into Lisp A large subset of Haskell is

translated into byteco de by the NHC compiler The Haskell relatives RUFL and Gofer can b oth compile

either to native co de or to a byte co de The Clean native co de compiler from Nijmegen is used for Clean

For Miranda the Miranda interpreter from Research Software Ltd is used as well as the FAST compiler

which translates a subset of Miranda into C For LML the Chalmers LML native co de compiler is used

as well as a mo died version that translates into a lowlevel intermediate form based on FLIC After

extensive optimisations Thomas this LMLOPTIM backend generates native co de

Pseudoknot benchmark

For the four concurrentparallel languages we use the Erlang BEAM compiler a p ortable compiler

that generates C the Facile compiler which uses the SMLNJ compiler to translate the SML co de

emb edded in Facile programs into native co de the ID compiler which translates into an intermediate

data ow representation that is subsequently compiled into C by the Berkeley Tl backend and the

Sisal compiler which compiles via C with sp ecial provisions to up date data structures in place without

copying

Epic is supp orted by a socalled hybrid interpreter which allows the combination of interpreted and

compiled functions Initially all functions are translated to essentially byteco de Then individual func

tions can b e translated into C and can b e linked to the system This leads to a stratum of p ossibilities

with in one extreme all functions b eing interpreted and in the other all functions b eing compiled and

only the dispatch overhead b eing paid In this do cument the two extremes are b eing b enchmarked under

the names Epic and EpicC resp ectively

Of the four remaining languages Opal CMC and Stoel are translated into C whereas Trafola is

translated into an interpreted co de

An overview of the compilers that have b een used may b e found in Table Since this table is intended

for reference rather than comparisons the entries are listed in alphab etical order The rst column gives

the name of the language andor compiler the second shows the source of the compiler A key reference

that describ es the compiler is given in the third column The last column gives instructions for obtaining

the compiler by FTP or email

To make the b est p ossible use of each of the compilers compilation and runtime options have b een

selected that should give fast execution We have consistently tried to optimise for execution sp eed In

particular no debugging information run time checks or proling co de have b een generated Where a

O option or higher optimisation setting could b e used to generate faster co de we have done so The

precise option settings that were used for each compiler are shown in columns and of Table The

fourth column shows what typ e of garbage collection is used and the last column indicates whether

single or double oatingp oint precision was used Where alternatives were available we have chosen

singleprecision since this should yield b etter p erformance

Application

The Pseudoknot program is derived from a realworld molecular biology application Feeley et al

In the following sections the program is describ ed briey from the p oint of view of its functional structure

and its main op erational characteristics The level of detail provided should b e sucient to understand

the later sections that describ e the optimisations and p erformance analyses of the program For more

detail on the biological asp ects of the program the reader is referred to Feeley et al

Functional behaviour

The Pseudoknot program computes the threedimensional structure of part of a nucleic acid molecule

from its primary structure ie the nucleotide sequence and a set of constraints on the threedimensional

structure The program exhaustively searches a discrete space of shap es and returns the set of shap es

that resp ect the constraints

More formally the problem is to nd all p ossible assignments of the variables x x n

1 n

here that satisfy the structural constraints Each variable represents the D p osition orientation and

conformation ie the internal structure of a nucleotide in the nucleic acid molecule Collectively they

represent the D structure of the molecule There are four typ es of nucleotides A C G and U which

contain from to atoms each To reduce the search space the domains of the variables are discretized

to a small nite set ie x D These domains are dep endent on the lower numb ered variables ie D

i i i

is a function of x x to take into account the restricted ways in which nucleotides attach relatively

1 i1

to one another The constraints sp ecify a maximal distance b etween sp ecic atoms of the molecule

The heart of the program is a backtracking search For each p ossible assignment of x all p ossible

1

assignments of x are explored and so on until all variables are assigned a value A satisfactory set of 2

Hartel Feeley et al

oatingp oint op erations square ro ot and

trigonometric functions

p

> < arctan

cos

= sin

total total

Table Breakdown of the real work involved in the Pseudoknot problem as counted by the FAST

system The oatingpoint operations occurring in the trigonometric functions and the square root are

not counted separately

assignments is a solution As the search deep ens the constraints are checked to prune branches of the

search tree that do not lead to a solution If a constraint is violated the search backtracks to explore the

next p ossible assignment of the current variable When a leaf is reached the current set of assignments

is added to the list of solutions For the b enchmark there are p ossible solutions

The computation of the domains is a geometric problem which involves computations on D transfor

mation matrices and p oints D vectors Notable functions include tfocombine multiplication of D

matrices tfoalign creation of a D matrix from three D vectors and tfoapply multiplication

of a D matrix by a D vector Another imp ortant part of the program is the conformation database of

nucleotides This database contains the relative p osition of all atoms in all p ossible conformations of the

four nucleotides a total of conformations This data is used to align nucleotides with one another

and to compute the absolute p osition of atoms in the molecule

The program used in the present b enchmarking eort is slightly dierent from the original Feeley

et al The latter only computed the numb er of solutions found during the search However in

practice it is the lo cation of each atom in the solutions that is of real interest to a biologist since the

solutions typically need to b e screened manually by visualising them consecutively The program was thus

mo died to compute the lo cation of each atom in the structures that are found In order to minimise IO

overhead a single value is printed the distance from the origin to the farthest atom in any solution this

requires that the absolute p osition of each atom b e computed

Operational behaviour

The Pseudoknot program is heavily oriented towards oatingp oint computations and oatingp oint

calculations should thus form a signicant p ortion of the total execution time For the C version executed

on machine cf Table this p ercentage was found to b e at least

We also studied the extent to which the execution of the functional versions is dominated by oating

p oint calculations using stateoftheart compilers for the eager SML and the lazy FAST versions of

the program The time prole obtained for the MLWorks compiler for SML suggests that slightly over

of the run time is consumed by three functions tfocombine tfoalign and tfoapply which

do little more than oatingp oint arithmetic and trigonometric functions This means that half the time

little more than the oatingp oint capability of this implementation is b eing tested and some of that

functionality is actually provided by an op erating system library

Statistics from the lazy FAST compiler show that with lazy evaluation the most optimised version of the

program do es ab out million oatingp oint op erations excluding those p erformed by the thousand

trigonometric and square ro ot function calls A detailed breakdown of these statistics is shown in Table

Overall the program makes ab out million function calls and claims ab out Mbytes of space the

maximum live data is ab out Kbytes

Pseudoknot benchmark

Translations annotations and optimisations

The Pseudoknot program was handtranslated from either the Scheme or the C version to the various other

languages that were b enchmarked All versions were handtuned to achieve the b est p ossible p erformance

for each compiler The following set of guidelines were used to make the comparison as fair as p ossible

Algorithmic changes are forbidden but slight mo dications to the co de to allow b etter use of a

particular feature of the language or programming system are allowed

Only small changes to the data structures are p ermitted eg a tuple may b e turned into an array

or a list

Annotations are p ermitted for example strictness annotations or annotations for inlining and

sp ecialisation of co de

All changes and annotations should b e do cumented

Anyone should b e able to rep eat the exp eriments So all sources and measurement pro cedures should

b e made public by ftp somewhere

All programs must pro duce the same output the numb er to signicant gures

The optimisations and annotations made to obtain b est p erformance with each of the compilers are

discussed in the following subsections We will make frequent reference to particular parts of the program

text As the program is relatively large it would b e dicult to repro duce every version in full here The

reader is therefore invited to consult the archive that contains most of the versions of the program at

ftpfwiuvanl le pubcomputersystemsfunctionalpackagespseudoknottarZ

The guidelines ab ove were designed on the basis of the exp erience gained at the Dagstuhl workshop

with a small subset of the present set of implementations We tried to make the guidelines as clear and

concise as p ossible yet they were op en to dierent interpretations The problems we had were aggravated

by the use of dierent terminology particularly when one term means dierent things to p eople from

dierent backgrounds During the pro cess of interpreting and integrating the b enchmarking results in the

pap er we have made every eort to eradicate the dierences that we found There may b e some remaining

dierences that we are unaware of

In addition to these unintentional dierences there are intentional dierences some exp erimenters

sp ent more time and eort improving their version of Pseudoknot than others These eorts are do cu

mented but not quantied in the sections that follow

Sources used in the translations

The translation of the Pseudoknot program into so many dierent languages represented a signicant

amount of work Fortunately this work could b e shared amongst a large numb er of researchers The basic

translations were not particularly dicult or interesting so we will just trace the history of the various

versions The optimisations that were applied will then b e discussed in some detail in later sections

The Scheme version of the Pseudoknot b enchmark was used as the basis for the Biglo o Gambit CMU

Common Lisp and Miranda versions

The Miranda source was used to create the Clean FAST Erlang CMC Gofer Haskell Stoel and

SML sources The Haskell source was subsequently used to create the ID RUFL and LML sources and

together with the SML source to create the Opal source The SML version was subsequently used as the

basis for the translation to Caml Epic and Facile

The Sisal version is the only functional co de to have b een derived from the C version of the program

Some typ ed languages RUFL Opal require explicit typ e signatures to b e provided for all top level

functions For other languages SMLNJ it was found to b e helpful to add typ e signatures to improve

the readability of the program

All but two of the sources were translated by hand the Stoel source was translated by the FAST

compiler from the Miranda source and the Epic source was pro duced by a translator from a subset of

SML to Epic which was written in Epic

Hartel Feeley et al

Splitting the source

Most of the compilers that were used have diculty compiling the Pseudoknot program In particular the

C compilers and also most of the compilers that generate C take a long time to compile the program

For example GCC requires more than seconds on machine See table to compile the

program with the O optimisation enabled The bundled SUN CC compiler takes over seconds with

the same option setting and on the same machine

The reason it takes so long to compile Pseudoknot is b ecause the program contains four large functions

which in C comprise and lines of co de resp ectively which collectively build the

conformation database of nucleotides These functions contain mostly oatingp oint constants If the

b o dies of these four functions are removed leaving lines of C co de the C compilation time is

reduced to approximately seconds for b oth SUN CC and GCC Since the functional versions of the

program have the same structure as the C version the functional compilers are faced with the same

diculty

In a numb er of languages that supp ort separate compilation eg Haskell LML and C the program

has b een split into separate mo dules Each global data structure is placed in its own mo dule which is

then imp orted by each initialisation mo dule The main program imp orts all of these mo dules Splitting

the source reduced the compilation times by ab out for GCC and for SUN CC As the main

problem is the presence of large numb ers of oatingp oint constants in the source this is all we could

hop e for

The NHC compiler is designed sp ecically to compile large programs in a mo dest amount of space

There are two versions of this compiler NHCHBC which is the NHC compiler when compiled by

the Chalmers Haskell compiler HBC and NHCNHC which is a b o otstrapp ed version The monolithic

source of the Pseudoknot program could b e compiled using less than MB heap space by NHCNHC

whereas NHCHBC requires MB heap space HBC itself could not compile the monolithic source in

MB heap space even when a singlespace garbage collector was used

A numb er of the functional compilers that compile to C eg Opal and FAST generated such large

or so many C pro cedures that some C compilers had trouble compiling the generated co de For example

the C co de generated by EpicC consists of many functions o ccupying lines of co de It is generated

in minutes but had to b e split by hand in order for the gcc compiler to compile it successfully taking

hours b oth times on machine cf Table

As a result of the Pseudoknot exp erience the Opal compiler has b een mo died to cop e b etter with

extremely large functions such as those forming the nucleotide database

One way to dramatically reduce C compilation time at the exp ense of increased run time is to represent

each vector of oatingp oint constants in the conformation database as a single large string This string

is then converted into the appropriate numeric form at run time For the Biglo o compiler which uses this

technique to successfully reduce compilation time the run time p enalty amounted to of the total

execution time

Purity

To allow a fair comparison of the quality of co de generation for pure functions none of the functional

versions of Pseudoknot exploit sideeects where these are available in the source language

Typing

Most of the languages are statically typ ed in which case the compilers can use this typ e information to

help generate b etter co de Some of the compilers for dynamically typ ed languages can also exploit static

typ e information when this is provided

For example the Erlang version of Pseudoknot used guards to give some limited typ e information as

in the following function denition where X etc are typ ed as oatingp oint numb ers

ptsubXYZXYZ

when floatXfloatYfloatZ

Pseudoknot benchmark

floatXfloatYfloatZ XXYYZZ

Similarly for Common Lisp typ e declarations were added to all oatingp oint arithmetic op erations and

to all lo cal variables that hold oatingp oint numb ers This was unnecessary for the Scheme version

which already had calls to oatingp oint sp ecic arithmetic functions

Functions

Functional abstraction and application are the central notions of functional programming Each of the

various systems implements these notions in a dierent way This in turn aects the translation and

optimisation of the Pseudoknot program

There is considerable variety in the treatment of function arguments Firstly some languages use curried

arguments some use uncurried arguments and some make it relatively cheap to simulate uncurried

arguments through the use of tuples when the normal argument passing mechanism is curried Secondly

higherorder languages allow functions to b e passed as arguments to other functions whereas the rst

order languages restrict this capability Finally even though most languages supp ort patternmatching

some do not allow aspatterns which make it p ossible to refer to a pattern as a whole as well as to its

constituent parts

There are several other issues that aect the cost of function calls such as whether function b o dies can

b e expanded inline and whether recursive calls can b e transformed into tail recursion or lo ops These

issues will now b e discussed in relation to the Pseudoknot program with the eects that they have on

the p erformance where these are signicant

Curried arguments

The SMLNJ source of the Pseudoknot program is written in a curried style In SMLNJ version

this proved to have a relatively small eect on p erformance less than improvement compared with

an uncurried style For the older version of the SMLNJ compiler used for the Facile system however

version some of the standard compiler optimisations app ear to b e more eective on the uncurried

than on the curried version of the program In this case the dierence was still less than

Higherorder functions

The Pseudoknot program o ccasionally passes functions as arguments to other functions This is obviously

not supp orted by the three rst order languages Sisal Erlang and Epic The Sisal co de was therefore

derived from the C program where this problem had already b een solved Erlang to ok the alternative

approach of eliminating higherorder calls using an explicit application function papply For example

reference is called by

papplyreferenceArgArgArg referenceArgArgArg

where reference is a constant it is a static function name In Epic a similar mechanism was used

Higherorder functions are generally exp ensive to implement so many compilers will make attempts

to reduce how often such functions are used In most cases higherorder functions simply pass the name

of a statically known function to some other function These cases can b e optimised by sp ecialising the

higher order function Many compilers will sp ecialise automatically in some cases this has b een achieved

manually For example for Yale Haskell the functions atompos and search were inlined to avoid a higher

order function call

Patterns

Some functions in the Pseudoknot program rst destruct and then reconstruct a data item In CeML

Haskell LML and Clean aspatterns have b een used to avoid this As an example consider the following

Haskell fragment

Hartel Feeley et al

atompos atom vVar i t n absolutepos v atom n

Here the rebuilding of the constructor Var i t n is avoided by hanging on to the structure as a whole

via the variable v The Epic compiler automatically recognises patterns that o ccur b oth on the left and

the right hand side of a denition such patterns are never rebuilt

This optimisation has not b een applied universally b ecause aspatterns are not available in some lan

guages eg Miranda In FAST a similar eect has b een achieved using an auxiliary function

atompos atom v absolutepos v atom getnuc v

getnuc Var i t n n

The b enets of avoiding rebuilding a data structure do not always outweigh the disadvantage of the extra

function call so this change was not applied to the other languages

Neither of the two intermediate languages supp ort pattern matching To access comp onents of data

structures CMC uses access functions Stoel uses case expressions

Inlining

Functional programs normally contain many small functions and the Pseudoknot program is no exception

Each function call carries some overhead so it may b e advantageous to inline functions by expanding

the function b o dy at the places where the function is used Small functions and functions that are only

called from one site are normally go o d candidates for inlining Many compilers will automatically inline

functions on the basis of such heuristics and some compilers eg Opal Chalmers Haskell Glasgow

Haskell are even capable of inlining functions across mo dule b oundaries

For Clean FAST Trafola and Yale Haskell many small functions in particular the oatingp oint

op erator denitions and constants were inlined explicitly

Tail recursion and loops

Tail recursive functions can b e compiled into lo ops but some languages oer lo op constructs to allow

the programmer to express rep etitive b ehaviour directly In ID and Sisal the recursive function getvar

is implemented using a lo op construct In Epic this function coincides with a builtin p olymorphic

asso ciation table lo okup which was used instead In ID the backtracking search function search has also

b een changed to use a lo op instead of recursion

Data structures

The original functional versions of the Pseudoknot program use lists and algebraic data typ es as data

structures The preferred implementation of the data structures is language and compiler dep endent We

will describ e exp eriments where lists are replaced by arrays and where algebraic data typ es are replaced

by records tuples or lists

For the lazy languages strictness annotations on selected comp onents of data structures andor function

arguments also give signicant p erformance b enets

Avoiding lists

The b enchmark program computes solutions to the Pseudoknot constraint satisfaction problem Each

solution consists of variable bindings that is one for each of the nucleotides involved This creates

a total of records of atoms for which the distance from the origin to the furthest atom

must b e computed These records each contain b etween and atoms dep ending on the typ e

of the nucleotide for typ e A for typ e C for typ e G and for typ e U The sizes of these

records of atoms are determined statically so they are ideal candidates for b eing replaced by arrays The

advantage of using an array instead of a list of atoms is the amortised cost of allo catingreclaiming all

atoms at once A list of atoms is traversed linearly from the b eginning to the end so the unit access

Pseudoknot benchmark

cost of the array do es not give an extra advantage in this case This change from lists to arrays has b een

implemented in Caml ID and Scheme

In the Sisal co de the problem describ ed ab ove do es not arise instead of building the records a

double lo op traverses the records A further lo op computes the maximum distance to the origin

Consequently no intermediate lists or arrays are created

The lo cal function generate within p was replaced by an ID array comprehension in Sisal a lo op

construct was used

Avoiding algebraic data types

Some of the algebraic data typ e constructors in the Pseudoknot program are rather large with up to

comp onents This leads to distinctly unreadable co de when pattern matching on arguments of such typ es

and it may also cause ineciencies

For Sisal all algebraic data typ es were replaced by arrays since Sisal compilers are sp ecically optimised

towards the ecient handling of arrays

For SMLNJ the comp onent co ordinate transformation matrix TFO was changed to an array repre

sentation This was found not to make a signicant dierence

For the Caml Gallium compiler some of the algebraic data typ es have b een converted into records

to guide the data representation heuristics this transformation makes no dierence for the other Caml

compilers Caml light and Camlo o

Trafola Epic and RUFL implement algebraic data typ es as linked lists of cells which implies a signif

icant p erformance p enalty for the large constructors used by Pseudoknot

Strictness annotations

The Pseudoknot program do es not b enet in any way from lazy evaluation b ecause all computations

contained in the program are mandatory It is thus feasible to annotate the data structures ie lists

algebraic data typ es and tuples in the program as strict Those implementations which allowed strictness

annotations only had to annotate the comp onents of algebraic data typ es as strict to remove the bulk

of the lazy evaluations The Gofer Miranda NHC RUFL and Stoel compilers do not p ermit strictness

annotations but a variety of strictness annotations were tried with the other compilers

For Yale Chalmers Haskell and the two LML compilers all algebraic data typ es were annotated as

strict for CMC the comp onents of Pt and TFO were annotated as strict for Clean the comp onents

of Pt TFO and the integer comp onent of Var were annotated as strict for FAST all comp onents of

these three data typ es were annotated as strict For Yale Haskell the rst argument of getvar and the

arguments of makerelativenuc were also forced to b e strict This is p ermissible since the only cases

where these arguments are not used give rise to errors and are thus equivalent to demanding the value

of the arguments

Dep ending on the compiler strictness annotations caused the Pseudoknot execution times to b e reduced

by

Unboxing

The Pseudoknot program p erforms ab out million oatingp oint op erations Unless sp ecial precautions

are taken the resulting oatingp oint numb ers will b e stored as individual ob jects in the heap a b oxed

representation Representing these values as unb oxed ob jects that can b e held directly in registers on the

stack or even as literal comp onents of b oxed structures such as lists has a ma jor impact on p erformance

not only do es it reduce the space requirements of the program but the execution time is also reduced

since less garbage collection is required if less space is allo cated

There are a numb er of approaches that can b e used to avoid maintaining b oxed ob jects Caml Gallium

SMLNJ Biglo o and Gambit provide an analysis that will automatically unb ox certain ob jects CMU

common Lisp and Glasgow Haskell provide facilities to explicitly indicate where unb oxed ob jects can

Hartel Feeley et al

safely b e used Our exp erience with each of these techniques will now b e describ ed in some detail as it

provides useful insight into the prop erties of this relatively new technology

The Caml Gallium compiler employs a representation analysis Leroy which automatically

exploits an unb oxed representation for doubleprecision oatingp oint numb ers when these are used

monomorphically Since the Pseudoknot b enchmark do es not use p olymorphism all oatingp oint num

b ers are unb oxed This is the main reason why the Gallium compiler generates faster co de than most of

the other compilers

The latest version of the SMLNJ compiler version also supp orts automatic unb oxing through a

representation analysis Shao However unlike Caml Gallium it do es not directly exploit sp ecial

load and store instructions to transfer oatingp oint numb ers to and from the FPU Changing this should

improve the overall execution time for this compiler

In an attempt to nd b etter p erformance a large numb er of variations were tried with the SMLNJ

compiler The execution time was surprisingly stable under these changes and in fact no change made

any signicant dierence either go o d or bad to the execution sp eed In the end the original transcription

of the Scheme program with a typ e signature for the main function was used for the measurements A

similar result was found for the MLWorks compiler where a few optimisations were tried and found to

give only a marginal improvement of The MLWorks timings apply to essentially the same source

as the SMLNJ timings MLWorks generates slightly faster co de than SMLNJ for this program

The SMLNJ implementation of the Pseudoknot program actually p erforms b etter on the DECstation

than on the SPARC On the DECstation it runs at of the sp eed of C whereas on the

SPARC it runs at only of the sp eed of C We susp ect that this is mainly due to memory eects

Previous studies Diwan et al have shown that the intensive heap allo cation which is characteristic

of the SMLNJ implementation interacts badly with memory subsystems that use a writenoallo cate

cache p olicy as is the case of the SPARC in contrast the use of a writeallo cate p olicy coupled with

what amounts to subblo ck placement on the DECstation the cache blo ck size is four bytes supp orts

such intensive heap allo cation extremely well

The Biglo o compiler uses a twostep representation analysis The rst step is a control ow analysis

that identies monomorphic parts of the program The second step improves the representation of those

ob jects that are only used in these monomorphic parts Unfortunately it is not p ossible to avoid b oxing

entirely b ecause some data structures are used heterogeneously in the Scheme source eg oatingp oint

numb ers b o oleans and vectors are contained in the same vector Even so of the million oatingp oint

values that are created by the Pseudoknot program only thousand b ecome b oxed

The Gambit compiler uses two simple lo cal metho ds for reducing the numb er of oatingp oint numb ers

that are b oxed Firstly intermediate results for multiple argument arithmetic op erators such as when

more than two numb ers are added are never b oxed This means that only million of the million

oatingp oint results need to b e considered for b oxing Secondly Gambit uses a lazy b oxing strategy

whereby oatingp oint results b ound to lo cal variables are kept in an unb oxed form and only b oxed when

they are stored in a data structure passed as a function argument or when they are live at a branch ie

at a function call or return Of the million oatingp oint results that might need to b e b oxed only

million actually b ecome b oxed This optimisation decreases the run time by roughly

In the Epic implementation sp ecialised functions were dened for the two most common oating p oint

expressions two and threedimensional vector inpro duct leading to a reduction of function calls

and unb oxing Although the new functions were trivially written by hand their utilization was added

automatically by the addition of two rewrite rules to the otherwise unaltered SMLtoEpic translator

This is p ossible b ecause Epic unlike many functional languages do es not distinguish constructor symb ols

from dened function symb ols Consequently laws in the sense of Miranda Thompson in Epic

al l functions are dened by laws can b e intro duced which map sp ecic patterns such as x x x

1 2 3

x to semantically equivalent but more ecient patterns which use a newly intro duced function ie

4

inprod x x x x

1 2 3 4

In the Common Lisp version of the program the Pt and TFO data typ es were implemented as vectors

sp ecialised to hold untagged singlefloat ob jects rather than as general vectors of tagged ob jects This

is equivalent to unb oxing those oatingp oint numb ers

Pseudoknot benchmark

Cost Centre time allo c Cost Centre time allo c

tfocombine getvar

tfoapply tfocombine

po po

tfoalign pseudoknotconstraint

dgfbase search

getvar tfoalign

absolutepos ptphi

tfoapply

a Original prole by time c Maximum map by time

Cost Centre time allo c Cost Centre time allo c

tfoapply tfocombine

tfocombine po

search pseudoknotconstraint

pseudoknotconstraint mkvar

getvar tfoalign

varmostdistant tfoinvortho

tfoalign

b Strict typ es by time d Maximum map by allo cation

Table Time and al location prole of Pseudoknot from the Glasgow Haskel l system by function as a

percentage of total timeheap al locations

The Glasgow Haskell compiler has provisions for explicitly manipulating unb oxed ob jects using the

typ e system to dierentiate b oxed and unb oxed values Peyton Jones and Launchbury The pro cess

of engineering the Pseudoknot co de to reduce the numb er of b oxed oatingp oint numb ers is a go o d

illustration of how the Glasgow proling to ols can b e used We therefore present this asp ect of the

software engineering pro cess in detail b elow

The version of Pseudoknot which was used for Chalmers Haskell ran in seconds when compiled

with the Glasgow Haskell Compiler for machine cf Table The raw time proling information from

this program See Table a shows that a few functions account for a signicant p ercentage of the time

used and over of the total space usage Three of the top four functions by time tfocombine

tfoapply and tfoalign manipulate TFOs and Pts and the remainder are heavy users of the Var

structure Since these functions can b e safely made strict they are prime candidates to b e unb oxed as

was also done with the Common Lisp compiler

By unb oxing these data structures using a simple editor script and changing the pattern match in the

denition of varmostdistantatom so that it is strict rather than lazy an improvement of roughly a

factor of is obtained This is similar to the improvements which are p ossible by simply annotating the

relevant data structures to make them strict as with the Chalmers Haskell compiler However further

unb oxing optimisations are p ossible if the three uses of the function comp osition maximum map are

replaced by a new sp ecialised function maximummap as shown b elow This function maps a function

whose result is a oatingp oint numb er over a list of arguments and selects the maximum result It is not

p ossible to map a function directly over a list of unb oxed values using the normal Prelude map function

b ecause unb oxed values cannot b e passed to p olymorphic functions

maximummap aFloat aFloat

maximummap f ht

max f t f h

where max f xxs m max f xs let fx f x in

Hartel Feeley et al

Version seconds Mbytes Residency

Original K

Strict Typ es K

Maximum Map K

Table Time and Heap Usage of three Pseudoknot variants compiled for machine by the Glasgow

Haskel l compiler

if fx gtFloat m then fx else m

max f m m

max aFloat a Float Float

This optimisation is suggested indirectly by the time prole Table b which shows that the top function

by time is tfoapply This is called through absolutepos within mostdistantatom Merging the

three nested function calls that collectively pro duce the maximum value of a function applied to a list

of arguments allows the compiler to determine that the current maximum value can always b e held in a

register an extreme form of deforestation Wadler When this transformation is applied to the

Haskell source the total execution time is reduced to seconds user time still on machine An

automatic generalised version of this hand optimisation the foldrbuild transformation Gill and Peyton

Jones has now b een incorp orated into the Glasgow Haskell compiler

The nal time prole Table c shows getvar and po jointly using of the Haskell execution

time with tfocombine tfoalign and tfoapply accounting for a further The minor dierences

in p ercentage time for tfocombine in Tables c and d are probably explained by sampling error over

such a short run While the rst two functions could b e optimised to use a nonlist data structure it is

not easy to optimise the latter functions any further The total execution time is now close to that for

C with a large fraction of the total remaining time b eing sp ent in the mathematical library Since

the allo cation prole Table d suggests that there are no space gains which can b e obtained easily it

was decided not to attempt further optimisations The overall time and space results for Glasgow Haskell

are summarised in Table In each case the heap usage rep orted is the total numb er of bytes that were

allo cated with the maximum live data residency after a ma jor garbage collection shown in parentheses

The foldrbuild style deforestation of maximum map has also b een applied to the ID SML and Scheme

sources For SMLNJ this transformation and other similar deforestation transformations made no

measurable improvement though several led to minor slowdowns

Single threading

In a purely functional program a data structure cannot normally b e mo died once it has b een created

However if the compiler can detect that a data structure is mo died by only one op eration and that

this op eration executes after all other op erations on the data structure or can b e so delayed then the

compiler may generate co de to mo dify the data structure in place The Sisal compiler includes sp ecial

optimisations preallo cation Ranelletti and copy elimination Gopinath and Hennesy that

make safe destructive up dates of data structures p ossible In order to exploit this the Sisal version of

the Pseudoknot program was written so as to exp ose the single threaded use of some imp ortant data

structures An example is given b elow where the array stack is single threaded so that the new versions

stack and stack o ccupy the same storage as the original stack

Pseudoknot benchmark

C Pseudo co de Sisal revised

let

add new element to stack stack arrayaddhstackelement

increment stack counter

domains stack pseudoknotdomainsstack call pseudoknot

in

decrement stack counter arrayremhstack

end let

In principle this co de is identical to the C co de The Sisal compiler realises that there is only a single

consumer of each stack It tags the data structure as mutable and generates co de to p erform all up dates in

place Consequently the Sisal co de maintains a single stack structure similar to the C co de eliminating

excessive memory usage and copy op erations As in the C co de when a solution is found a copy of

the stack is made to preserve it The Sisal co de runs in approximately KB of memory and achieves

execution sp eeds comparable to the C co de

FloatingPoint Precision

When comparing our p erformance results there are several reasons why oatingp oint precision must

b e taken into account Firstly it is easier to generate fast co de if singleprecision oatingp oint numb ers

are used since these can b e unb oxed more easily Secondly b oth memory consumption and garbage

collection time are reduced b ecause singleprecision oatingp oint numb ers can b e represented more

compactly Thirdly singleprecision oatingp oint arithmetic op erations are often signicantly faster than

the corresp onding doubleprecision op erations

Traditionally functional languages and their implementations have tended to concentrate on symb olic

applications Floatingp oint p erformance has therefore b een largely ignored One notable exception is

Sisal which is intended more as a sp ecialpurp ose language for scientic computations than as a general

purp ose language

Since singleprecision gives sucient accuracy for the Pseudoknot program on our b enchmark machine

and since singleprecision op erations are faster than doubleprecision op erations on this architecture

compilers that can exploit singleprecision arithmetic are therefore at some advantage The advantage

is limited in practice by factors such as the dynamic instruction mix of the co de that is executed for

example for the GNU C version of Pseudoknot overall p erformance is improved by only when single

precision oatingp oint is used for the Trafola interpreter however p erformance was improved by

and for the Opal compiler p erformance was improved by a factor of

Results

Comparative time measurements are b est done using a single platform However many of the compilers

are exp erimental and in constant ux They are therefore dicult to install at another site in a consistent

and working state Therefore we have decided to collect the compiled binaries of the Pseudoknot program

so as to b e able to execute all binaries on the same platform The measured execution times of the

programs are thus directly comparable and accurate

The compile times are not directly comparable To make a reasonable comparison p ossible a relative

measure has b een dened This is somewhat inaccurate but we think that this is quite acceptable since

for compile times it is more the order of magnitude that counts than precise values The relative unit

pseudoknot relates compilation time to the execution time of the C version of the Pseudoknot program

where b oth are measured on the same platform The more obvious alternative of comparing to the C

compilation times of Pseudoknot was rejected b ecause not all architectures that are at stake here use the

same C compiler See Table The Pseudoknot is computed as

C execution time

relative sp eed compilation time

Hartel Feeley et al

no SUN machine mem cache op system pro cessor C compiler

M K SunOS standard gcc

M K SunOS standard gcc

M K SunOS standard gcc

MP M K SunOS SUNW gcc

M K SunOS standard gcc

MP M K SunOS TMSZ gcc

M K SunOS TI Sup ersparc gcc

M K SunOS Cypress CY gcc

MP M M SunOS SUNW system gcc

M K SunOS standard gcc

MP M K SunOS ROSS MHz Sup er cc

M M SunOS standard gcc

SPARC M M SunOS TMSZ gcc

SPARC M M SunOS standard gcc

SPARC M M SunOS standard gcc

SPARC M KKM SunOS TMSZ gcc

SPARC M KKM Solaris standard gcc

SPARCStat M KK SunOS standard gcc

SPARCStat M KKM SunOS Sup ersparc gcc

Table Details of the SUN machines and C compilers used to compile the Pseudoknot program The

type of the machine is fol lowed by the size of the memory in MB the size of the cache as a total or

as instructiondata secondary cache size the name and version and the type of

processor used The last column gives the C compilerversion that has been used on the machine

To compile at knots thus means to take the same amount of time to compile as it takes the C

version of Pseudoknot to execute

With so many compilers under scrutiny it is not surprising that a large numb er of machines are involved

in the dierent compilations The most imp ortant characteristics of the SUN machines may b e found in

Table The table is ordered by the typ e of the machine

To measure short times with a reasonable degree of accuracy the times rep orted are an average of

either or rep eated runs The resulting system and user times divided by are rep orted in

the Tables and

Compile time

Table shows the results of compiling the programs The rst column of the table shows the name of

the compiler cf Table The second column route indicates whether the compiler pro duces native

co de N co de for a sp ecial interpreter I or compiles to native co de through a p ortable C backend

C Lisp L or Scheme S or through a combination of these backends The third column gives a

reference to the particular machine used for compilation cf Table The next three columns give the

usersystem time and the space required to compile the Pseudoknot program Unless noted otherwise

the space is the largest amount of space required by the compiler as obtained from ps v under the

heading SIZE The column marked Cruntimes gives the usersystem time required to execute the C

version of the Pseudoknot program on the same machine as the one used for compilation The last two

columns pseudoknots show the relative p erformance of the compiler with resp ect to the Cruntimes

It is p ossible to distinguish broad groups within the compilers The faster compilers are unsurprisingly

those that compile to an intermediate co de for a byteco de or similar interpreter and which therefore

p erform few if any optimisations NHC is an outlier p erhaps b ecause unlike the other compilers in this

group it is a b o otstrapping compiler

With the exception of Biglo o and Camlo o which are faster than many native compilers implementa

Pseudoknot benchmark

compiler route mach times space Cruntimes pseudoknots

a

user sys Mb AH user sys user sys

Compiled via another high level language C Lisp or Scheme

Biglo o C A

Camlo o SC A

Sisal C A

Gambit C A

Yale L H

CMC C A

FAST C A

Opal C A

Glasgow C A

Erlang BEAM C > Hour A

CeML C > Hour A

ID C > Hour A

EpicC C > Hours A

Stoel C > Hours A

Compiled into native co de

Clean N A

RUFL N A

CMU CL N H

Caml Gallium N A

SMLNJ N A

LML Chalmers N A

LMLOPTIM N A

Facile N A

MLWorks N R

Chalmers N A

Interpreted

Gofer I A

RUFLI I A

Miranda I A

Caml Light I A

Trafola I A

Epic I A

NHCHBC I A

NHCNHC I A

C compilers

SUN CC O N A

GNU GCC O N A

a

A Mbytes allo cated space H Mbytes heap size R Mbytes maximum resident set size

Table Results giving the time usersystem time in seconds and space in Mbytes required for

compilation of the Pseudoknot program The pseudoknots give the relative speed with respect to the

execution not compilation of the C version

Hartel Feeley et al

tions that generate C or Lisp are the slowest compilers Not only do es it take extra time to pro duce and

parse the C but C compilers have particular diculty compiling co de that contains large numb ers of

oatingp oint constants The worst case example is the Stoel compiler It takes seconds to generate

the C co de and more than hours to compile the C on machine cf Table Most of this time is

sp ent compiling the function that initialises the data structures containing oatingp oint numb ers As

the b ottom two rows of the table show C compilers also have particular diculty compiling the hand

written C version of the Pseudoknot program due to this phenomenon

The faster compilers also generally allo cate less space This may b e b ecause the slower compilers

generally apply more sophisticated and therefore spaceintensive optimisations

Execution time

All programs have b een executed a numb er of times on machine cf Table with dierent heap sizes

to optimise for sp eed The results rep orted in Table show the b est execution time inclusive of garbage

collection time The rst column of the table shows the name of the compilerinterpreter cf Table

The second column route duplicates the route column from Table The third column states whether

oatingp oint numb ers are single or doubleprecision Columns and give the user and system time

required to execute the Pseudoknot program The last column shows the space required which unless

noted otherwise represents the largest amount of space required by the program as obtained from ps v

under the heading SIZE

The pro duct moment correlation co ecient calculated from all compilation sp eeds as rep orted in

pseudoknots in Table and execution times as rep orted in seconds in Table is This shows that

there is a strong correlation b etween compilation time and execution sp eed the longer it takes to compile

the faster the execution will b e Only the Clean implementation oers b oth fast compilation and fast

execution The set of Caml compilers oers a particularly interesting sp ectrum Caml Gallium is a slow

compiler which pro duces fast co de Caml Light compiles quickly but is relatively slow and Camlo o is

intermediate b etween the two

The EpicC co de generator was designed to allow selected individual functions to b e compiled thus

providing an almost continuous sp ectrum of p ossibilities from fully interpreted to fully compiled co de

The present facilities for compiling co de provide little improvement over interpreted co de at the cost of

huge compilation times The reason is that the C co de faithfully mimics each interpreter step without

optimizations such as the use of lo cal variables or lo ops This results in C functions which b ehave identical

to their interpreted counterparts As much as of the sp eedup of EpicC with resp ect to epic was

achieved by compiling of the functions o ccurring in the Epic version of Pseudoknot

For the compiled systems there is a very rough relationship b etween execution sp eed and heap usage

faster implementations use less heap There do es not however seem to b e any correlation b etween non

strictness and heap usage

Summary

Overall the eager Sisal compiler achieved the b est p erformance The next b est implementation is the

lazy Glasgow Haskell compiler for a heavily optimised version of the program The next group of

compilers are for Lisp Scheme and SML which generally yield very similar p erformance An outlier is

the Biglo o optimising Scheme compiler whose p erformance is more comparable to most of the nonstrict

implementations Chalmers FAST ID Stoel Yale and Glasgow Haskell on less optimised co de which

form the next obvious group

Unsurprisingly p erhaps the interpretive systems yield the worst p erformance The interpreters for

Caml Light Epic NHC and Trafola which compile to an intermediate byteco de which is then inter

preted are however signicantly faster than their conventional brethren Gofer RUFLI and Miranda

which interpret a representation that is closer to the program than a byteco de Interpreters for strict

languages Caml Light Epic do seem on the whole to b e faster than interpreters for nonstrict languages

NHC Gofer RUFLI Miranda

Pseudoknot benchmark

compiler route oat times space

a

user sys Mb AH

Compiled via another high level language C or Lisp

Glasgow C single A

Opal C single A

CeML C single A

FAST C single A

Yale L single H

EpicC C single A

Sisal C double A

Gambit C double A

Camlo o SC double A

ID C double A

Biglo o C double A

CMC C double A

Stoel C double A

Erlang BEAM C double A

Compiled into native co de

CMU CL N single H

LMLOPTIM N single A

Chalmers N single A

LML Chalmers N single A

Caml Gallium N double A

Clean N double A

MLWorks N double A

SMLNJ N double A

Facile N double A

RUFL N double A

Interpreted

Epic I single A

Trafola I single A

NHC I single A

Gofer I single A

Caml Light I double A

RUFLI I double A

Miranda I double A

C compilers

GNU GCC C single A

GNU GCC C double A

a

A Mbytes allo cated space H Mbytes heap size

Table The execution times usersystem time in seconds and space MB of Pseudoknot as

measured on platform

Hartel Feeley et al

Analysis of Performance Results

Apart from the issues already discussed such as oatingp oint precision many language and implemen

tation design issues clearly aect p erformance This section attempts to isolate the most imp ortant of

those issues

HigherOrderness

It is commonly b elieved that supp ort for higherorder functions imp oses some p erformance p enalty and

the fact that the fastest language system Sisal is rstorder may therefore b e signicant Unfortunately

the other rstorder implementations Erlang and Epic yield relatively p o or p erformance Sisal is also

the only monomorphic language studied and p olymorphism is known to exact some p erformance p enalty

so results here must b e inconclusive

NonStrictness

As the Glasgow Haskell compiler shows if the compiler can exploit strictness at the right p oints the

presence of lazy evaluation need not b e a hindrance to high p erformance This implementation is actually

faster than most of the strict implementations

Generally however nonstrict compilers do not achieve this level of p erformance typically oering

only around of the p erformance of eager implementations such as SMLNJ or Gambit or of the

p erformance of CMU Common Lisp and only after the exploitation of strictness through unb oxing and

similar optimisations Without these features on the basis of the Glasgow results p erformance can b e

estimated as just under a quarter of the typical p erformance of a compiler for an eager language For these

compilers and this application supp ort for laziness therefore costs directly a factor of with a further

probably attributable to the use of dierent implementation techniques for predened functions etc

which are needed to allow for the p ossibility of laziness

The dierence b etween the Yale Haskell and Common Lisp results is due partly to use of tagged versus

untagged arrays and partly to the overhead of lazy lists in Haskell These were the only signicant

dierences b etween the handwritten Common Lisp co de and the Lisp co de pro duced by the Yale Haskell

compiler The Haskell co de generator could b e extended to use untagged arrays for homogeneous oating

p oint tuple typ es as well but this has not yet b een implemented

The LMLOPTIM compiler generates faster co de than the corresp onding Chalmers LML compiler b e

cause if a case alternative unpacks strict arguments LMLOPTIM takes into account that the unpacked

values are evaluated in Weak Head Normal Form

ConcurrencyParal lelism Support

Several of the compilers b enchmarked here include supp ort for concurrency or parallelism In some cases

eg Facile Glasgow Haskell this supp ort do es not aect the normal sequential execution time In

other cases eg ID Gambit and Erlang BEAM it is not p ossible to entirely eliminate the overhead of

parallelism

The low p erformance recorded by the Erlang BEAM compiler reects the fact that Erlang is a pro

gramming language primarily intended for designing robust concurrent realtime systems Firstly a low

priority has b een placed on oatingp oint p erformance Secondly to supp ort concurrent execution re

quires the implementation of a scheduling mechanism and a notion of time Together these add some

appreciable overhead to the Erlang BEAM execution

Native Code Generation

It is interesting that several of the compilers that generate fast co de compile through C rather than b eing

native compilers Clearly it is p ossible to compile ecient co de without generating assembler directly

Space usage for these compilers is also generally low the compilers have clearly optimised b oth for time and space

Pseudoknot benchmark

Language Design

Of the languages studied Sisal is the only one that was sp ecically designed for numeric rather than

symb olic computations and clearly the design works well for this application Floatingp oint p erfor

mance has traditionally taken secondplace in functional language implementations so we may hop e that

these results spur other compiler writers to attempt to duplicate the Sisal results

Conclusions

Over compilers for b oth lazy and strict functional languages have b een b enchmarked using a single

oatingp oint intensive program The results given here compare compilation time and execution time

for each of the compilers against the same program implemented in C Compilation time is measured in

terms of pseudoknots which are dened in terms of the execution time of the b enchmark program The

execution times of all compiled programs are rep orted in seconds as measured on a single machine

Benchmarking a single program can lead to results which cannot easily b e generalised Sp ecial care has

b een taken to make the comparison as fair as p ossible the Pseudoknot program is not an essentially lazy

program the dierent implementations use the same algorithm all of the binaries were timed on one and

the same machine

The eort exp ended by individual teams translating the original Scheme program and subsequently

optimising their p erformance varied considerably This is the result of a delib erate choice on the part of

the teams carrying out the exp eriments Firstly this variability gives some compilers an advantage but

not an unfair advantage Secondly the aim of the Pseudoknot b enchmark is sp ecically to get the b est

p ossible p erformance from each of the implementations using the guidelines discussed in Section There

is a wide variability in the kind and level of optimisation oered by each compiler The programming

eorts required to make these optimisations eective will thus b e as varied as the oerings of the compilers

themselves This should b e kept in mind when interpreting our results

Turning to the b enchmark itself we observe that the C version of the program sp ends of its time

in the C library trigonometric and square ro ot routines This represents the core of the application the

remaining work is overhead that should b e minimised by a go o d implementation While this pattern

may not generally hold for scientic applications the program is still useful as a b enchmark since the

real work it do es the trigonometric and oatingp oint calculations is so clearly identiable Not all the

b enchmark implementations are capable of realising this but some implementations do extremely well

Because the b enchmark is so oatingp oint intensive implementations that used an unb oxed oating

p oint representation had a signicant advantage over those that did not Implementations that were

capable of exploiting singleprecision bit oatingp oint have some additional advantage though this

is signicant only for the faster implementations where a greater prop ortion of execution time is sp ent

on oatingp oint op erations

To achieve go o d p erformance from lazy implementations it proved necessary to apply strictness an

notations to certain commonlyused data structures When appropriate strictness annotations are used

there is no clear distinction b etween the runtime p erformance of eager and lazy implementations and in

some cases the p erformance approaches that of C

Inserting these strictness annotations correctly can b e a ne art as demonstrated by the eorts of

the Glasgow team While the b ehaviour of the Pseudoknot program is not sensitive to incorrectly placed

strictness annotations in general lazy functional programs are not so well b ehaved in this resp ect and

considerable eort might b e exp ended intro ducing annotations without changing the termination prop er

ties of a program To make lazy functional languages more useful than they are now clearly more eort

should go into providing users with simple to use and eective means of analysing and improving the

p erformance of their programs

The b enchmark proved to stress compilers more than exp ected the compilation times for most compiled

implementations including the two C compilers was surprisingly high Generating C as intermediate

co de however do es not necessarily make the compiler slow as demonstrated by the p erformance of the

Biglo o and Camlo o compilers However generating fast C do es often lead to high compilation times

The Pseudoknot b enchmark represents a collab orative eort of an unprecedented scale in the functional

Hartel Feeley et al

programming community This has had a p ositive inuence on the work that is taking place in that

community Firstly researchers learn of the techniques applied by their coauthors in a more direct way

than via the literature Secondly researchers are more strongly motivated to apply new techniques b ecause

of the comp etitive element Thirdly using a common b enchmark always p oints at weaknesses in systems

that were either known and put aside for later or uncovered by the b enchmarking eort The Pseudoknot

b enchmark has b een the trigger to improve a numb er of implementations Finally researchers working

on implementations of the dierent language families are brought closer together so that the functional

programming community as a whole may emerge stronger

Acknowledgements

The comments of the anonymous referees were very useful

Mark Jones pro duced the Gofer version of the Pseudoknot program

Will Partain and Jim Mattson p erformed many of the exp eriments rep orted here for the Glasgow

Haskell compiler The work at Glasgow is supp orted by an SOED Research Fellowship from the Royal

So ciety of Edinburgh and by the EPSRCAQUA and PARADE grants

The work at Nijmegen is supp orted by STW Stichting vo or de Technische Wetenschapp en The Nether

lands

Jon Mountjoy p erformed some of the exp eriments for the RUFL implementation

The ID version of the Pseudoknot program was the result of a group eort Credit go es to Jamey Hicks

R Paul Johnson Shail Aditya Yonald Chery and Andy Shaw

Zhong Shao made several imp ortant changes to the SMLNJ implementation and Andrew App el and

David MacQueen provided general supp ort in the use of this system David Tarditi p erformed several of

the SMLNJ exp eriments

John T Feo and Scott Denton of Lawrence Livermore National Lab oratory collab orated on the Sisal

version of the Pseudoknot program

The CMC version of Pseudoknot was the result of team work with Genesio Cruz Neto and Ri

cardo Lima The CMC group is supp orted by CNPq Brazilian Government grants

and

Marc Feeley was supp orted in part by a grant from the Natural Sciences and Engineering Research

Council of Canada

References

M Alt C Fecht C Ferdinand and R Wilhelm The TrafolaS subsystem In B Homann and B KriegBruc kner

editors Program development by specication and transformation LNCS pages SpringerVerlag

Berlin May

A W App el Compiling with Continuations Cambridge Univ Press Cambridge England

J Armstrong M Williams and R Virding Concurrent programming in Erlang Prentice Hall Englewo o d Clis

New Jersey

L Augustsson HBC users manual Programming Metho dology Group Distributed with the HBC compiler

Depart of Comp Sci Chalmers S Goteb org Sweden

L Augustsson and T Johnsson The Chalmers LazyML compiler The computer journal Apr

L Augustsson and T Johnsson Lazy ML users manual Programming metho dology group rep ort Dept of

Comp Sci Chalmers Univ of Technology Goteb org Sweden

M Beemster The lazy functional intermediate language Stoel Technical rep ort CS Dept of Comp Sys

Univ of Amsterdam Dec

M Beemster Optimizing transformations for a lazy functional language In WJ Withagen editor th Computer

systems pages Eindhoven The Netherlands Nov Eindhoven Univ of Technology

J A Bergstra J Heering and P Klint Algebraic Specication The ACM Press in coop eration with Addison

Wesley ACM Press Frontier Series

D C Cann The optimizing SISAL compiler version Manual UCRLMA Lawrence Livermore

National Lab oratory Livermore California Apr

D C Cann Retire FORTRAN a debate rekindled CACM Aug

Pseudoknot benchmark

E Chailloux An ecient way of compiling ML to C In P Lee editor ACM SIGPLAN Workshop on ML and its

Applications pages San Francisco California Jun Scho ol of Comp Sci Carnegie Mellon Univ

Pittsburg Pennsylvania Technical rep ort CMUCS

K Didrich A Fett C Gerke W Grieskamp and P Pepp er OPAL Design and implementation of an algebraic

programming language In J Gutknecht editor Programming Languages and System Architectures LNCS

pages Zurich Switzerland Mar SpringerVerlag Berlin

A Diwan D Tarditi and E Moss Memory subsystem p erformance of programs with copying garbage collection

In st Principles of programming languages pages Portland Oregon Jan ACM New York

M Feeley and J S Miller A parallel virtual machine for ecient Scheme compilation In Lisp and functional

programming pages Nice France Jul ACM New York

M Feeley M Turcotte and G Lapalme Using Multilisp for solving constraint satisfaction problems an appli

cation to nucleic acid D structure determination Lisp and symbolic computation

R Giegerich and R J M Hughes Functional programming in the real world Dagstuhl seminar rep ort IBFI

GmbH Schloss Dagstuhl D Wadern Germany May

A J Gill and S L Peyton Jones Cheap deforestation in practice An optimiser for Haskell In Proc IFIP Vol

pages Hamburg Germany Aug

K Gopinath and J L Hennesy Copy elimination in functional languages In th Principles of programming

languages pages Austin Texas Jan ACM New York

The Yale Haskell Group The Yale Haskel l Users Manual version Yb Dept of Comp Sci Yale Univ

Distributed with the Yale Haskell compiler Jul

R H Halstead Jr Multilisp A language for concurrent symb olic computation ACM transactions on programming

languages and systems Oct

Harlequin MLWorks draft documentation Harlequin Ltd Cambridge England

P H Hartel H W Glaser and J M Wild Compilation of functional languages using ow graph analysis

Softwarepractice and experience Feb

P H Hartel and K G Langendo en Benchmarking implementations of lazy functional languages In th Functional

programming languages and computer architecture pages Cop enhagen Denmark Jun ACM New

York

B Hausman Turb o erlang Approaching the sp eed of C In E Tick and G Succi editors Implementations

of Systems pages Kluwer Academic Publishers BostonDordrechtLond on Mar

P Hudak S L Peyton Jones and P L Wadler editors Rep ort on the programming language Haskell a

nonstrict purely functional language version ACM SIGPLAN notices RR May

M P Jones The implementation of the Gofer functional programming system Research Rep ort

YALEUDCSRR Dept of Comp Sci Yale Univ New haven Connecticut May

B W Kernighan and D W Ritchie The C programming language ANSI C Prentice Hall Englewo o d Clis

New Jersey second edition edition

X Leroy Unb oxed ob jects and p olymorphic typing In th Principles of Programming Languages pages

Albuquerque New Mexico Jan ACM New York

X Leroy The Caml Light system release Software and do cumentation distributed by anonymous FTP on

ftpinriafr

R D Lins Categorical MultiCombinators In G Kahn editor rd Functional programming languages and

computer architecture LNCS pages Portland Oregon Sep SpringerVerlag Berlin

R D Lins and B O Lira CMC A novel way of implementing functional languages J Programming Languages

Mar

R A MacLachlan CMU common Lisp users manual Technical rep ort CMUCS Scho ol of Comp Sci

Carnegie Mellon Univ Jul

J R McGraw S K Skedzielewski S Allan R Oldeho eft J R W Glauert C Kirkham B Noyce and

R Thomas Sisal Streams and iteration in a single assignment language Language reference manual version

M Rev Lawrence Livermore National Lab oratory Livermore California Mar

R Milner M Tofte and R Harp er The denition of Standard ML MIT Press Cambridge Massachusetts

R S Nikhil ID version reference manual Computation Structures Group Memo Lab oratory for

Comp Sci MIT Cambridge Massachusetts Jul

S L Peyton Jones C V Hall K Hammond W D Partain and P L Wadler The Glasgow Haskell compiler a

technical overview In Proc Joint Framework for Information Technology JFIT Technical Conference pages

Keele England Mar DTISERC

S L Peyton Jones and J Launchbury Unb oxed values as rst class citizens in a nonstrict functional language

In R J M Hughes editor th Functional programming languages and computer architecture LNCS pages

Cambridge Massachusetts Sep SpringerVerlag Berlin

M J Plasmeijer and M C J D van Eekelen Concurrent Clean version Language Reference Manual

draft version Dept of Comp Sci Univ of Nijmegen The Netherlands Jun

Hartel Feeley et al

J E Ranelletti Graph transformation algorithms for array memory memory optimization in applicative languages

PhD thesis Comp Sci Dept Univ of California at Davis California Nov

4

J A Rees and W Clinger Revised Report on the Algorithmic Language Scheme MIT Cambridge Mas

sachusetts Nov

N Rojemo Highlights from nhc a spaceecient Haskell compiler In th Functional programming languages

and computer architecture pages La Jolla California Jun ACM New York

W Schulte and W Grieskamp Generating ecient p ortable co de for a strict applicative language In J Dar

lington and R Dietrich editors Phoenix Seminar and Workshop on Declarative Programming pages

Sasbachwalden West Germany Nov SpringerVerlag Berlin

M Serrano Biglo o users manual Technical rep ort INRIARo cquencourt France Dec

M Serrano and P Weis an optimizing Caml compiler In ACMSIGPLAN Workshop on ML and its

applications pages Research rep ort INRIA Ro cquencourt France Jun

Z Shao Compiling Standard ML for Ecient Execution on Modern Machines PhD thesis Princeton Univ

Princeton New Jersey Nov

S Smetsers E G J M H Nocker J van Groningen and M J Plasmeijer Generating ecient co de for

lazy functional languages In R J M Hughes editor th Functional programming languages and computer

architecture LNCS pages Cambridge Massachusetts Sep SpringerVerlag Berlin

G L Steele Jr Common Lisp the Language Digital Press Bedford second edition

S Thomas The Pragmatics of Closure Reduction PhD thesis University of Kent at Canterbury Canterbury

UK

S Thomas The OPTIM a b etter PGTIM Technical rep ort NOTTCSTR Dept of Comp Sci Univ

of Nottingham England

S Thompson Laws in Miranda In Lisp and functional programming pages Cambridge Massachusetts

Aug ACM New York

B Thomsen L Leth S Prasad TS Kuo A Kramer F Knab e and A Giacalone Facile antigua release

programming guide Technical rep ort ECRC Europ ean ComputerIndustry Research Centre Munich

Germany The reference manual and license agreement are available by anonymous ftp from ftpecrcde

D A Turner Miranda A nonstrict functional language with p olymorphic typ es In JP Jouannaud editor

nd Functional programming languages and computer architecture LNCS pages Nancy France Sep

SpringerVerlag Berlin

D A Turner Miranda system manual Research Software Ltd St Augustines Road Canterbury Kent CT

XP England Apr

P L Wadler Deforestation transforming programs to eliminate trees Theoretical Computer Science

H R Walters and J F Th Kamp erman Epic Implementing language pro cessors by equational programs

Technical rep ort in preparation Centrum vo or Wiskunde en Informatica Amsterdam The Netherlands

P Weis and X Leroy Le langage Caml InterEditions

E P Wentworth Co de generation for a lazy functional language Technical rep ort Dept of Comp Sci

Rho des Univ Dec

E P Wentworth RUFL reference manual Technical rep ort Dept of Comp Sci Rho des Univ Jan