The Nofib Benchmark Suite of Haskell Programs

The nofib Benchmark Suite of Haskell Programs Will Partain University of Glasgow Abstract This p osition pap er describ es the need for makeup of and rules of the game for a b enchmark suite of Haskell programs It do es not include results from running the suite Those of us working on the Glasgow Haskell compiler hop e this suite will encourage sound quantitative as sessment of lazy functional programming systems This version of this pap er reects the state of play at the initial prerelease of the suite Towards lazy functional b enchmarking History of b enchmarkingfunctional The quantitative measurement of systems for lazy functional programming is a nearscandalous sub ject Dancing b ehind a thin veil of disclaimers researchers in the eld can still b e found quoting nbssec or something equally egre gious as if this refers to anything remotely interesting The April Computer Journal sp ecial issue on lazy functional program ming is a notto odated selfp ortrait of the community that promotes comput ing in this way It is one that nonsp ecialists are likely to see There are three pap ers under the heading Eciency of functional languages The Yale group rep orting on their ALFL compiler cites results for the b enchmarks queens nfib tak mm deriv tfib qsort and init noting that several are from the Gabriel suite of LISP b enchmarks They say that results from these tests indicate that functional languages are indeed b ecoming comp etitive with conventional languages page Augustsson and Johnsson have a section ab out p erformance in their pap er on the LML compiler They consider some of the usual susp ects queens fib prime and kwic comparing against implementations of these algo 1 rithms in C Edinburgh and New Jersey SML and Miranda To their credit the wondrous Chalmers hackers are somewhat ap ologetic conceding that mea suring p erformance of the compiled co de is very dicult Finally Wray and Fairbairn argue for programming techniques that make essential use of nonstrictness and for an implementation TIM that makes these techniques inexp ensive Though they delve into a substantial spread sheetlike example program they do not rep ort any actual measurements How ever they astutely take issue with the usual toy b enchmarks There was in the past a tendency for implementations to b e judged on their p erformance for unusually strict b enchmarks 1 Miranda is a trademark of Research Software Ltd History of b enchmarkingimp erative Our imp erativeprogramming colleagues are not far removed from our brutish b enchmarking condition Only a few years ago MIPS ratings Dhrystones and friends were all the rage marketeers bandied them ab out shamelessly compiler writers tweaked their compilers to sp ot sp ecic constructs in certain b enchmarks users were baed and noone learned much that was worth know ing The section on p erformance in Hennessy and Pattersons standard text on computer architecture is an admirable exp ose of these shenanigans and is well worth reading Then in enter the Standard Performance Evaluation Corp oration SPEC b enchmarking suite The initial version included source co de and sample inputs workloads for four mostlyinteger programs and six mostly oatingp oint programs These are all either real programs eg the GNU C compiler or the kernel of a real program eg matrix oatingp oint matrix multiplication Computer vendors have since put immense eort into improving their SPECmarks and this has delivered real b enet to the work station user Towards lazy b enchmarking The SPEC suite is the most visible artifact of an imp ortant shift towards sys tem b enchmarking A big reason for the shift lies in the b enchmarked sys tems themselves Fifteen years ago a typical computer systemhardware and softwareprobably came from one manufacturer sat in one ro om and was a computing environment all on its own An excellent discussion ab out b enchmarking from the selfcontainedsys tems era is Gabriel and Masinters pap er ab out LISP systems The prop er role of b enchmarking is to measure various dimensions of Lisp system p erfor mance and to order those systems along each of these dimensions page A toy b enchmark of the typ e I have derided so far can fo cus on one of these dimensions thus contributing to an overall picture Much early measurement work in functional programming was of this plot alongdimensions style however the concern was usually to assess a particular implementation technique not the system as a whole For example Hailp ern Huynh and Revesz tried to compare systems that use strict versus lazy evalua tion They went to considerable eort to factor out irrelevant details hoping to end up with pristine data p oints along interesting dimensions Hartels eort to characterise the relative costs of xedcombinator versus programderived combinator implementations was even more elab orate using nontoy SASL programs So can SPEC b e seen as a culmination of go o d practice in b enchmarking bycharacteristics No SPEC makes no eort to pinp oint systems along in teresting dimensions except for the very simplestelapsed wallclo ck time An underlying premise of SPEC is that systems are suciently complicated that we probably wont even b e able to pick out the interesting dimensions to measure much less characterise b enchmarks in terms of them SPEC represents a shift to lazy b enchmarking of systems Conte and Hwus survey conrms that at least in computer architecture this shift towards lazy systemoriented b enchmarking is supp orted as a Go o d Thing The trend can also b e seen in some sp ecialised areas of computing the Perfect Benchmarks for sup ercomputers Crays etc running FORTRAN programs and the Stanford b enchmarks for parallel sharedmemory sys tems are two examples Some serious b enchmarks nofib We the Glasgow Haskell compiler group wish to help develop and promote a freelyavailable b enchmark suite for lazy functional programming systems called the nofib suiteconsisting of Source co de for real Haskell programs to compile and run Sample inputs workloads to feed into the compiled programs along with the exp ected outputs Rules for compiling and running the b enchmark programs and more notably for rep orting your b enchmarking results and Sample rep orts showing by example how results should b e rep orted Our nonmotivations in creating this suite Benchmarking is a delicate art and science and its hard work to b o ot We have quite limited goals for the nofib suite are hoping for lots of help and are prepared to overlo ok considerable shortcomings esp ecially at the b eginning Motivations Our main initial motivation is to give functionallanguage implementors including ourselves a common set of real Haskell programs to attack and study We encourage implementors to tackle the problems that ac tual ly make Haskell programs large and slow thus hastening solutions to those problems And of course b ecause the b enchmark programs are shared it will b e p os sible to compare p erformance results b etween systems running on identical hardware eg Chalmers HBC vs Glasgow Haskell b oth running on the same Sun Racing is the fun part Our ultimate motivation for this b enchmark suite is to provide end users of Haskell implementations with a useful predictor of how those systems will p erform on their own programs The initial nofib suite will have no value as a predictive to ol Perhaps those with greater exp ertise will help us correct this If necessary we will gladly hand over the token for the suite to a more disinterested party We are very keen on some might say paranoid ab out readilyacces sible reproducible results That is the whole p oint of the rep orting rules elsewhere in this pap er Go o dbutirrepro ducible b enchmarking results are very damaging b e cause they lull implementors into a false sense of security After the initial prerelease of the suite which will b e for p ossibly ma jor debugging we intend to keep the suite stable so that sensible comparisons can b e made over time Having said that b enchmarks must change over time or they b ecome stale It is dicult to brim with condence ab out the Gabriel b enchmarks for LISP systems they are more than a decade old Being forced to change a b enchmark suite can b e a mark of success The SPEC p eople made substantial changes to their suite in so much work had gone into compiler tricks that improved SPEC p erformance results that some tests were no longer useful notably the matrix test mentioned earlier Nonmotivations We are profoundly uninterested in distilling a single gure of merit eg MIPS to characterise a Haskell implementations p erformance Initially at least we are also uninterested in any statistics derived from the raw nofib numb ers eg various means standard deviations etc You may calculate and rep ort any such numb ersall honest eorts to understand these b enchmarks are welcomebut the raw underlying numb ers must b e readily available An imp ortant issue we are not addressing with this suite is interlanguage comparisons How do es program X written in Haskell fare against the same program written in language Y Such comparisons raise a nest of issues all their own for example is it really the same program when written in the two compared languages This disclaimer aside we do provide the nofib program sources in other languages if we happ en to have them The Real subset The nofib programs are divided into three subsets Real Imaginary and Sp ec tral somewhere b etween Real and Imaginary The Real subset of the nofib suite is by far the most imp ortant In fact we insist

The Nofib Benchmark Suite of Haskell Programs

GHC Reading Guide

Dynamic Extension of Typed Functional Languages

What I Wish I Knew When Learning Haskell

The Glasgow Haskell Compiler User's Guide, Version 4.08

The Glorious Glasgow Haskell Compilation System User's Guide

Freebsd Foundation February 2018 Update

Hassling with the IDE - for Haskell

Installing Haskell

Secrets of the Glasgow Haskell Compiler Inliner

Used Tools and Mechanisms in Open Source Projects (A School Seminar Work)

The Glorious Glasgow Haskell Compilation System User's Guide

Benchmarking Implementations of Functional Languages With