<<

Compiling Scheme to JVM bytecode: a performance study

Bernard Paul Serpette Manuel Serrano Inria Sophia-Antipolis Inria Sophia-Antipolis 2004 route des Lucioles – B.P. 93 2004 route des Lucioles – B.P. 93 F-06902 Sophia-Antipolis, Cedex F-06902 Sophia-Antipolis, Cedex [email protected] [email protected] http://www.inria.fr/oasis/Bernard.Serpette http://www.inria.fr/mimosa/Manuel.Serrano

ABSTRACT (henceforth FLs), several papers describing such systems We have added a Java virtual machine (henceforth JVM) have been published (Haskell [36], Scheme [7, 23, 9], and bytecode generator to the optimizing Scheme-to- SML [5]). . We named thisnew compiler BiglooJVM. We have JVM bytecode isan appealing target because: used this new compiler to evaluate how suitable the JVM • It ishighly portable. Producing JVM bytecode, it is bytecode isasa target for compiling strict functional lan- possible to “compile once and run everywhere”. For in- guagessuch asScheme. In thispaper, we focuson the perfor- stance, the BiglooJVM Scheme-to-JVM compiler that manceissue.Wehavemeasuredtheexecutiontimeofmany ispresented in thispaper isimplemented on but Scheme programswhen compiled to C and when compiled theverysamebootstrappedversionalsorunsonMi- to JVM. We found that for each benchmark, at least one of crosoft Windows. our hardware platformsran the BiglooJVM version in less than twice the time taken by the Bigloo version. In order • The standard runtime environment for Java contains to deliver fast programs the generated JVM bytecode must a large set of libraries and features: widget libraries, be carefully crafted in order to benefit from the speedup of database access, network connection, sound libraries, just-in-time . etc. A compiler which compilesa language to JVM bytecode could make these fancy features available to Categories and Subject Descriptors programswritten in that language. D.3.1 [Programming Languages]: Language Classifica- • A lot of programming environmentsand toolsare de- tions—applicative (functional) languages;D.3.4[Program- signedfortheJVM.Somehavenocounterpartfor ming Languages]: Processors—compilers; I.1.3 [Symbolic other languages(especially the onesthat rely on the and Algebraic Manipulation]: Languages and Systems— high-level featuresof Java such asitsautomatic mem- evaluation strategies ory management). Producing JVM bytecode allows these tools to be reused. For instance, we have al- General Terms ready customized the Jinsight system [8] for profiling Scheme compiled programs. Languages, Experimentation, Measurement, Performance • The JVM isdesigned for hosting high-level, object- Keywords oriented languages. It providesconstructionsand facil- ities such as dynamic type testing, garbage collection Functional languages, Scheme, Compilation, Java virtual and subtype polymorphism. These runtime features Machine are also frequently required for FLs. 1. INTRODUCTION 1.1 Performance of JVM executions Many implementorsof high-level languageshave already In spite of the previously quoted advantages, compiling ported their compilersand interpretersto the Java Virtual toJVM(ortoJava)raisesanimportantissue:whatabout Machine [21]. According to Tolksdorf’s web page [34], there performance? Is it possible to deliver fast applications using are more than 130 compilerstargeting JVM bytecode, for a FL-to-JVM compiler? Current Java implementationsare all kinds of languages. In the case of functional languages known not to be very fast and, in addition, because the JVM isdesigned and tuned for Java, one might wonder whether it ispossible to implement a correct mapping from a FL to JVM without an important performance penalty. The aim Permission to make digital or hard copies of all or part of this work for of thispaper isto provide answersto thesequestions. personal or classroom use is granted without fee provided that copies are We have added a new code generator to the existing op- not made or distributed for profit or commercial advantage and that copies timizing Scheme-to-C compiler Bigloo. We call thisnew bear this notice and the full citation on the first page. To copy otherwise, to compiler BiglooJVM. Because the only difference between republish, to post on servers or to redistribute to lists, requires prior specific Bigloo and BiglooJVM is the runtime system, BiglooJVM permission and/or a fee. ICFP’02, October 4-6, 2002, Pittsburgh, Pennsylvania, USA. is a precise tool to evaluate JVM performance. We have used Copyright 2002 ACM 1-58113-487-8/02/0010 ...$5.00. it asa test-bed for benchmarking. We have compared the

259

performance of a wide range of Scheme programswhen com- C piled to native code via C and when compiled to JVM. We runtime have used 21 Scheme benchmarks, implemented by differ- ent authors, in different programming styles, ranging from .o a.out 18 lineslong programsto 100,000 lineslong programs(see Bigloo C Appendix A for a short description). We have tested these generator programson three architectures(Linux/x86, Solaris/Sparc Scheme Parser Optimizer Scheme and Digital Unix/Alpha). We have found the ratiosBigloo- module runtime

JVM/Bigloo quite similar on Linux/x86 and Sparc. This BiglooJVM JVM isnot surprising asboth configurationsusethe same JVM generator (HotSpottm 2.0). We have thusdecided not to present the .class a.jar Sparc results in this paper.

It would be pointless to measure the performance penalty JVM imposed by the JVM if our compiler is a weak one. In order runtime to “validate” the BiglooJVM compiler we have compared it to other compilers. In a first step, we have compared it to two other Scheme compilers: i) Kawa [7] because it is the Figure 1: The architecture of the Bigloo compiler. only other Scheme compiler that producesJVM code and ii) Gambit because it is a popular, portable and largely used benchmark (see Appendix A), when ran on a Linux/x86 Scheme-to-C compiler. In a second step, we have compared architecture wasmore than 100 timesslower than the cor- BiglooJVM to other compilersproducing JVM code: iii) responding C version. Modifying the produced JVM class MLj [5] and, iv) Sun’sJava compiler. We present these files, we have been able to speed up this benchmark and it is comparisons in Section 2. now “only” 3 times slower than the C version. Many small problems were responsible for the initial poor performance. 1.2 The Bigloo and BiglooJVM compilers For instance, we found that some sequences of JVM instruc- Bigloo isan for the Scheme program- tionsprevent the just-in-time compiler (JIT) from compiling ming language [18]. It compiles modules into C files. Its methods. We also found that on some 32-bit architectures, runtime library uses the Boehm and Weiser garbage col- using 64-bit integersisprohibitive. We found that some lector [6]. Each module consists in a sequence of Scheme JIT (Sun HotSpot and others) compilers have a size limit definitions and top level expressions. After reading and perfunctionabovewhichtheystopcompiling.Finally,we expanding a source file, Bigloo builds an abstract syntax found that some JIT compilers refuse to compile functions tree (AST) that passes from stage to stage. Many stages that include functional loops, that is, loops that push values implement high-level optimizations. That is, optimizations on the stack. that are hardware independent, such as source-to-source [27] We present our experience with JVM bytecode generation transformations or data flow reductions[29]. Other stages in this paper. Most of the presented information is indepen- implement simplification of the AST, such as the one that dent of the source language to be compiled so it could be compiles Scheme closures [25, 30, 26]. Another stage han- beneficial to anyone wishing to implement a compiler pro- dlespolymorphism[20, 28]. That is,for each variable of the ducing JVM bytecode. source code, it selects a type that can be polymorphic, i.e., thetypedenotingany possible Scheme value, or monomor- 1.3 Overview phic which can denote the type of primitive hardware values, In Section 2 we present the performance evaluation fo- such as integers or floating point double precision.AttheC cusing on the difference between Bigloo and BiglooJVM. code generation point, the AST isentirely type-annotated In Section 3 we present the specific techniques deployed in and closuresare allocated asdata structures. Producing the the BiglooJVM code generator and the BiglooJVM runtime C code requires just one additional transformation: the ex- system. In Section 4 we present the important tunings we pressions of the AST are turned into C statements. In the have applied to the code generator and the runtime system BiglooJVM, the bytecode generator takesthe place of this to tame the JIT compilers. In Section 5, we compare the C statement generator (Figure 1). BiglooJVM compiler with other systems. Reusing all the Bigloo compilation stagesisbeneficial for the quality of the produced JVM bytecode and for the sim- 2. PERFORMANCE EVALUATIONS plicity of the new resulting compiler. The JVM bytecode We present in this section the performance evaluation of only represents 12% of the code of the whole compiler (7,000 BiglooJVM.Inafirststep,wecompareittoBigloo(that linesof Scheme code for the JVM generator to be compared isthe C back-end of Bigloo). Then, we compare it with to the 57,700 linesof Scheme code for the whole compiler). other Scheme implementations. Finally, we compare it with The JVM runtime system is only 5,000 lines of Java code non-Scheme implementationsthat produce JVM bytecode. whichhavetobecomparedwith30,000linesofSchemecode that are common to both JVM and C back-ends. This can 2.1 Settings also be compared with 33,000 lines of C code (including the To evaluate performance, we have used two different ar- Garbage Collector which is23,000 lineslong) of the C run- chitectures: time system. The first versions of BiglooJVM, a mere 2,000 lines of • Linux/x86: An AMD/Thunderbird 800Mhz, 256MB, Scheme code, were delivering extremely poor performance running Linux 2.2, Java HotSpottm of the JDK 1.3.0, applications. For instance, the JVM version of the Cgc used in “client” mode.

260

BiglooJVM vs Bigloo (Linux/x86) BiglooJVM vs Bigloo (Alpha) 12345 12345 Bague 1.6 Bague 3.2 Beval 1.4 Beval 1.4 Bigloo 3.3 Boyer 1.3 Boyer 1.9 Cgc 3.5 Cgc 3.2 Conform 1.9 Conform 4.6 Earley 1.5 Earley 3.0 Fft 3.1 Fft 2.0 1.2 0.6 Fib Fib 1.7 2.0 Leval Leval 2.2 3.2 Maze Maze 1.5 4.1 Mbrot Mbrot 2.0 Nucleic 1.3 Nucleic Peval 4.3 Peval 0.9 Puzzle 1.5 Puzzle 1.2 Queens 1.0 Queens 1.2 Qsort 1.0 Qsort 1.3 Rgc 10.4 Rgc 1.5 Sieve 1.4 Sieve 1.2 Slatex 4.4 Slatex 1.3 Traverse 2.7 Traverse 1.2

Figure 2: Scores are relative to Bigloo, which is the Figure 3: Scores are relative to Bigloo, which is the 1.0 mark. Lower is better. 1.0 mark. Lower is better.

• Alpha: An Alpha/21264 500Mhz, 512MB, running Dig- stack consumption of the produced code. On the other hand, ital Unix 4.0F, Classic VM version 1.2.2-3. we explain in Section 3.3.1 the reason why Cgc isslower whenranwiththeJVMruntime.Comparingthe“bestof Each program wasran three timesand the minimum sum all platforms” (that is, considering the time gathered on of cpu + system time, asreported by the Unix times com- the Linux platform when the ratio BiglooJVM/Bigloo isthe mand, wascollected. For JVM executions,the bytecode smallest, considering the Alpha platform otherwise) demon- verifier is disabled (see Section 4.3). stratesthat there isno technical impossibility for a JVM Inthispaperwepresentratiosforwhichthebasevalue implementation to deliver performance comparable to the isthe execution time of the program when compiled to C one of C. The Compaq implementation on Alpha almost by Bigloo. In consequence, when studying a configuration achievesthislevel of performance. K,wepresenttheratioofK/Bigloo. The mark 1 isthe execution time of Bigloo; lower isbetter. 2.2.1 Variations on the JVM implementation It existsmany implementationsof the JVM. Some are 2.2 BiglooJVM vs Bigloo based on pure interpretation techniques [12]. Some are based Figures2 and 3 present the ratio of BiglooJVM/Bigloo on static compilation of JVM bytecode to native code [17]. for the Linux/x86 and Alpha platforms. The efficiency of Finally, some are based on just-in-time compilers [19, 1]. JVM executionsvary from one benchmark to another and Amongst the last category some use adaptive compilation [32, from one platform to another. However, if for each bench- 31, 3] which waspioneered by the Self project [14, 15]. It mark, we consider the configuration with the smaller Bigloo- consists in an on-demand compilation of the “hot spots” of JVM/Bigloo ratio, we found that BiglooJVM measurements a program. Interpreterscannot compete with compilersfor fall roughly between once and twice the time taken by Bigloo. performance thuswe have not tested any. In addition, we The only three exceptionsto thisrule are: i) Fib, ii) Bigloo have not considered testing static JVM compilers because (the bootstrap of the Bigloo compiler) and, iii) Cgc (a toy one of the most important advantage of JVM over C is the compiler that producesMipscode for a C-like language). ability to compile once and to run everywhere. Thisprop- Fib isfaster with the JVM implementation but tackling erty doesnot hold for static native compilation. In addition, with C compilation options, Bigloo can reach the speed of previousexperimentsshow that thiscompilation technique BiglooJVM. More precisely if we prevent the C compiler hasa moderate impact on performance [16]. from using a register as frame pointer (the gcc -fomit- For the Linux alone, more than ten im- frame-pointer option) then Bigloo performsaswell asBigloo- plementations of the JVM exist. It is beyond the scope JVM. Bigloo and Cgc benchmarksare slower when com- of thispaper to evaluate all of them. According to the piled to JVM with a deficient ratio of 3.2. It would have been Java performance report [22], two implementationsperform interesting to study the performance of Bigloo on the Al- best: Sun’s HotSpot (JDK 1.3) [32] and IBM JDK 1.3 [31]. pha platform, unfortunately, the virtual machine crashes for HotSpot seems to be the most reliable one on Linux so it is that benchmark because of stack overflow and we have found our base reference. In Figure 4 we present the performance no mean to extent the stack on this platform. We are cur- evaluationsof BiglooJVM on Linux/x86 using the differ- rently testing a new version of BiglooJVM that reduces the ent JVM implementations. Only HotSpot/client is able to

261

JVM implementations Scheme implementations HS/Client HS/Server IBM BiglooJvm Kawa Gambit

12345 1.6 0246810 Bague 0.7 0.8 1.6 1.4 Beval 2.3 Bague 6.8 1.5 3.3 1.4 Bigloo Beval 178.2 5.0 2.1 1.9 1.9 Boyer 1.2 7.0 1.9 Boyer 2.5 3.2 Cgc 3.7 4.6 Conform 5.4 4.6 Conform 3.4 3.0 3.0 Earley 43.0 3.0 6.3 Earley 1.3 2.0 2.8 Fft 276.2 2.0 72.6 Fft 1.3 0.9 0.6 140.3 0.6 Fib 1.1 Fib 0.5 0.4 2.0 2.0 Leval 4.0 Leval 6.0 4.4 3.4 3.2 3.2 Maze 2.3 Maze 3.8 2.9 4.1 4.1 Mbrot 5.0 Mbrot 589.7 2.2 25.4 1.3 1.3 Nucleic 1.4 2.3 Nucleic 2.4 4.3 4.3 Peval 2.3 6.2 1.7 Peval 2.5 1.5 Puzzle 1.5 1.5 1.0 Puzzle 329.2 1.0 1.6 Queens 0.8 1.0 Queens 46.8 1.0 2.7 Qsort 1.0 0.7 1.0 341.2 10.4 Qsort Rgc 10.9 9.8 1.4 1.4 Sieve 14.6 Sieve 1.1 1.9 4.0 4.4 4.4 5.0 Slatex 5.1 Slatex 2.9 2.8 2.7 2.7 Traverse 2.5 Traverse 5.8 4.2 6.1

Figure 4: BiglooJVM with Sun’s JDK, with Sun’s Figure 5: BiglooJVM, Kawa and Gambit vs Bigloo JDK/server and with IBM’s JDK vs Bigloo on on Linux/x86 architecture. Lower is better. Linux/x86 architecture. Lower is better. run all the benchmarks. Missing values denote a JVM fail- ure. It seems that these three machines implement different turn generic arithmetic function callsto specialized arith- adaptive compilation strategies because they deliver differ- metic calls. So, Kawa programs always use generic arith- entperformance.Wecannotconcludethatonemachine metic. Thisisa huge handicap. always deliversthe bestperformance. We can thusonly use For all but two programs, BiglooJVM is faster than Gam- thistestto emphasize that: i) performance of JVM ishardly bit. Gambit isnot mainly designed for maximal efficiency. It implementsthe exact definition of Scheme when Bigloo predictable and highly dependent of the hardware/software 1 platforms, and ii) comparing all the “best” JVM measured doesnot . In particular, Gambit performance isaffected by execution timesto the C execution times,we found that the implementation of tail recursive calls. This speed test only one benchmark took more than twice aslong when run between Gambit and Bigloo thusonly demonstratesthat on the JVM. BiglooJVM deliver programswith “acceptable” performance since Gambit is considered by the Scheme community as an 2.3 BiglooJVM vs Scheme implementations efficient enough system. In consequence, we think that if Figure 5 presents a comparison of BiglooJVM, Kawa v1.6.7 speed isnot an implementor’shighestpriority, the JVM and Gambit v3.0 (unsafe mode and optimizations enabled) should be considered. on Linux/86. Kawa producesJVM bytecode, Gambit pro- 2.4 BiglooJVM vs other languages ducesC code. For all testedbenchmarksBiglooJVM is significantly faster than Kawa. One may notice that since It isalwaysdifficult to compare compilersfor different Kawa unfortunately doesnot provide specialized arithmetic 1Bigloo isnot fully Scheme compliant mainly because it is operations, slow executions are therefore to be expected. not properly tail-recursive and because it does not imple- The Kawa compiler doesnot seemable to automatically ment arithmetic bound checking.

262

programming languages. Because the programs cannot be code is even slower. To us, this demonstrates that Bigloo- the very same, we can fear that a small difference in the im- JVM makesefficient useof the JVM. plementation invalidatesthe test. For instance, Scheme and SML share a lot of features and constructions: they both 3. THE BiglooJVM BACK-END are functional languages, promoting first class closures and polymorphism, using garbage collection, etc. However the In this section we present the most important points of programming styles of these two languages differ. It is thus the compilation of Scheme, i.e. a to be expected that a compiler for Scheme and a compiler language using dynamic type checking, to JVM bytecode. for SML optimize different patterns of source expressions. We start, in Section 3.1, by presenting the compilation of It islikely that a Scheme program written using an SML functions. We then discuss in Section 3.2 the implementa- style would not be efficiently optimized by a Scheme com- tion of dynamic type checking. We conclude in Section 3.3 piler and vice-versa. In order to avoid these pitfalls we have with a description of data representation. only tested small and simple programs that we have tried to implement using the natural style of each programming lan- 3.1 Compiling functions guages. In addition, we have slightly modified some bench- Bigloo usesthree different frameworksfor compiling func- marksin order to prevent the Bigloo compiler from applying tions. We first present in Section 3.1.1 how local functions too aggressive optimizations. For instance, the original ver- canbecompiledtoloops.TheninSection3.1.2weshow sion of the Mbrot benchmark wasrunning 40 timesfaster the general framework applied to functions. In Section 3.1.3 when compiled by BiglooJVM than when compiled by Javac, we present how does Bigloo handle closures. Finally, in Sec- the Sun’sstandard Java compiler. Thiswasonly due to in- tion 3.1.4 we present limitations of our JVM back-end. lining and constant folding. Since we do not want, in this paper, to emphasize compiler techniques and optimizations 3.1.1 Compiling functions to loops but runtime system efficiency, we have found relevant to Closure analysis aims at compiling Scheme functions into write a “neutral” benchmark version that prevents Bigloo JVM loops. This framework applies when local Scheme optimizations. However, one should note that because of functions do not escape (that is, when functions are not used the Java object model and because of dynamic class load- as first-class values) and when these functions are always in- ing, a compiler for FLsto JVM bytecode islikely to have voked tail-recursively (see Section 3.1.4 for a discussion of more opportunitiesto optimize the source code than a Java generalized tail recursion). Here is an example of two mu- compiler. We present the comparison between BiglooJVM, tually recursive functions that map to JVM loops: MLj and Javac compilersFigure 6. (definedefine (odd? x) On the five small programs we have used, we have found (letrecletrec ((even? (λ (m) that BiglooJVM and MLj (version 0.2c) deliver programs (ifif (= m 0) #t (odd? (- m 1))))) with comparable performance. (odd? (λ (n) (ifif (= n 0) #f (even? (- n 1)))))) (odd? x))) JVM code generators iscompiled as: BiglooJvm MLj Javac (method odd? (x) (n m) (iload x) (istore n) 1234 L1 (iload n) ;; beginning of odd? 1.6 (iconst_0) Bague 1.3 (if_icmpne L2) ;; compare the argument n with 0 0.6 (iconst_0) ;; return (for odd?)false 1.4 Fib 0.9 (ireturn) 4.1 L2 (iload n) ;; prepare the actual value for Mbrot 3.3 (iconst_1) ;; the even? function 1.3 (isub) 1.3 Nucleic (istore m) 1.0 (iload m) ;; even? is inlined here 1.4 Queens (iconst_0) 1.0 (if_icmpne L3) ;; compare the argument m with 0 Qsort 1.9 1.0 (iconst_1) ;; return (for even?)true 1.4 (ireturn) Sieve 1.3 1.5 L3 (iload m) ;; otherwise prepare the call to odd? (iconst_1) (isub) Figure 6: BiglooJVM, MLj and Javac vs Bigloo on (istore n) (goto L1)) ;; goto odd? Linux/x86 architecture. Lower is better. Asillustrated by the Qsort or Sieve benchmarksthat use loops extensively, this compilation framework delivers good performance. We decided to perform these tests for Java too, even though we assumed that a Java compiler would produce 3.1.2 Compiling functions to JVM methods better code for the JVM than BiglooJVM. Surprisingly, our When the previousscheme doesnot apply, non-escaping programsperformed comparably to thoseproduced by Sun’s n-ary functionsmap to JVM n-ary static methods. Consider Javac (JDK 1.3). In fact, on the Sieve benchmark Sun’s the Scheme fib definition:

263

(definedefine (fib x) package bigloo; (ifif (< x 2) public abstract class procedure { 1 public int index, arity; (+ (fib (- x 1)) (fib (- x 2))))) public Object[] env; public procedure( int arity, int index ) { ... }; It iscompiled as: public abstract Object funcall1( Object a1 ); 1: (method fib (x) () 11: (isub) public abstract Object funcall2( Object a1, Object a2 ); 2: (iload x) 12: (invokestatic fib) } 3: (iconst_2) 13: (iload x) One should keep in mind that according to Sections 3.1.1 4: ;; test with numeral 2 14: ;; second recursive call 5: (if_icmpge 9) 15: (iconst_2) and 3.1.2, thisgeneric compilation framework isused only 6: (iconst_1) 16: (isub) when the compiler is unable to statically discover which 7: (ireturn) 17: (invokestatic fib) functionsare called at an application site. However, we 8: ;; first recursive call 18: (iadd) found that some JITs are unable to produce code with a 9: (iload x) 19: (ireturn)) constant complexity for the tableswitch instruction (see 10: (iconst_1) Section 4.4.3). The good performance of the Fib, Beval and other bench- An alternative compilation isusedby Kawa [7], where marks using non tail recursive calls shows the efficiency of each function used as value is compiled into a new JVM thiscompilation framework. The JVM’sstatic method in- class. This compilation framework gets rid of tableswitches. vocation isasfastasC’sfunction invocation. Our different solution is motivated by reducing the num- ber of generated classes. For instance, the Kawa technique 3.1.3 Higher-order and Heap-allocated closures would yield to more than 4,000 classes to compile the clo- Scheme isa higher-order language: functionsare first- sures of the Bigloo bootstrap, with each class been compiled class objects which can be passed as arguments, returned by intoaseparate.class file! functionsandstoredinmemory.Thefront-endofthecom- Moreover, in the current Bigloo version, each global that piler resolves the lexical visibility of variables inside func- is declared to be accessible from the Scheme interpreter, is tionsby making explicit closureson demand. The back-end associated with two Bigloo functions; one to get the value of still has to implement an abstract type pointer-to-function the global, one to set a new value. This would also increase (defined in a straightforward manner for the C back end). significantly the number of generated classes. Fortunately, thisabstract type wasdesigned to appear only in specific functions (closure creators) and is used indirectly 3.1.4 Call/cc and tail recursion by specific special forms (funcall and apply nodes). This BiglooJVM imposes one restriction to Bigloo: continu- allowsusto implement pointer-to-function asan index (in- ationscan only be invoked in the dynamic extent of the teger) into a switch tablesfor the JVM back-end. Thistech- call/cc expression from which they have been reified. This nique issimilar to the one deployed in Gambit to implement restriction comes from the design of the JVM instruction set tail recursive calls[11]. One unique index isallocated for and from the bytecode verifier. The JVM doesnot provide each static closure creation on a compilation unit (a Bigloo ameantosaveandrestoreexecutionstacks. module). Indexesmay be identical for 2 different modules The lack of general stack operatorsalsomakesvery dif- (since one switch table has been allocated per module). Con- ficult a “correct” JVM implementation of tail calls. In the sider the simple following Bigloo module example: current BiglooJVM version, only recursive calls to local func- tionsare correctly handled, i.e., without stack consumption (modulemodule test (exportexport app0 app1)) (definedefine (app0 f l) (f l)) (see Section 3.1.1 for an example and a previous article [29] (definedefine (app1 f l) (f (reverse l))) bytheauthorsformoredetails).Thesamerestrictionap- pliesto the C Bigloo runtime. In addition, we think that the The compiler generatesa classfile similar to: trampoline technique [4, 33, 11] isinfeasible for the current public class test extends bigloo.procedure { JVM implementation, as it would cause an excessive per- public static procedure app0 = new test( 2, 0 ); formance penalty (see Section 3.1.3). Performance of JVM public static procedure app1 = new test( 2, 1 ); executionshighly depend on the performance of embedded public test( int arity, int index ) { super( arity, index ); JIT compilers. Because executed at application runtime, } these compilers have to be extremely fast. They only have public Object funcall2( Object a1, Object a2 ) { the opportunity to deploy simple and local optimizations. switch( this.index ) { It isthuslikely that JIT compilerswill fail at optimizing case 0: return anonymous0( this, a1, a2 ); pure trampolined code, which isan unusual style of JVM case 1: return anonymous1( this, a1, a2 ); bytecode. On the other hand, a recent study shows a vari- default: funcall_error2( this, a1, a2 ); } ant of trampolined code that appearsto fit the requirement } of JIT compilersand that could be usedto implement tail private static Object anonymous0( recursion under the JVM [24]. procedure fun, Object a1, Object a2 ) { return ((procedure) a1).funcall1( a2 ); 3.2 Dynamic typing } private static Object anonymous1( TheSchemetypesystemmaybeviewedasahuge,un- procedure fun, Object a1, Object a2 ) { limited union type. Since the JVM doesnot provide union return ((procedure) a1).funcall1( reverse( a2 ) ); types or parametric types, the usual way to mimic union } types is to use subtyping. In order to specify a union type } T which isa sum of some Ti, it is usual to declare a class Where procedure isa runtime Bigloo classdefined as: T and implement all Ti as subclasses of T . Type-dependent

264

behaviors take the form of methods added to each subclass. Boxed integers and 3 words-large pairs Thisstrategy impactsearlieststagesof the compiler. This BiglooJVM Bigloo32 Bigloo32+ wouldhaveforcedwidespreadchangestotheexistingBigloo source code, and we decided not to use this design. In- 1234 stead, we have used the JVM instruction instanceof that 1.6 Bague 0.9 implementsdynamic type checking. Asin the caseof the 0.9 1.4 tableswitch instructions, we have found that some JITs Beval 1.1 1.1 fail to compile instanceof efficiently. For instance, Sun’s 3.3 Bigloo 1.1 JDK 1.3 on Linux/x86 is40 timesslower to answer “no” 1.1 1.9 than to answer “yes” to instanceof. Thisexplainssome Boyer 1.0 1.5 distortions on the benchmarks results. As for tableswitch, 3.2 Cgc 3.4 the Alpha implementation of the JDK doesnot seemto suf- 3.3 4.6 Conform 1.0 fer from thisproblem. 1.2 We plan to get rid of the instanceof instruction by adding 3.0 Earley 1.1 a specific tag to all Scheme objects. The drawback of this 1.6 2.0 Fft 1.2 implementation isthat all foreign valuesrequire wrapping 1.2 in a specific Scheme type. We hope that the SUA optimiza- 0.6 Fib 0.9 tion [28] will do the same nice job asfor boxed integers(see 0.9 2.0 Leval 1.0 Section 3.3.1). 1.2 3.2 Maze 1.3 3.3 Data representation 1.6 4.1 Mbrot 0.8 We present in this section the BiglooJVM data represen- 0.8 1.3 tations. We start showing in Section 3.3.1 the main differ- Nucleic 1.2 1.2 encesin the representation of data structuresbetween the C 4.3 Peval 1.0 runtime and the JVM runtime. Then, in Section 3.3.2, we 1.5 1.5 focuson the representation of integersin the JVM runtime Puzzle 1.1 1.0 system. 1.0 Queens 1.1 1.8 1.0 3.3.1 Tagging, boxing and pointers alignment Qsort 1.0 1.0 10.4 Some well-known C tricksenable faster implementations Rgc 1.0 of dynamic type identification [13]. These techniques rely 1.1 1.4 Sieve 1.8 on hardware alignment requirements. Because pointers are 2.9 4.4 aligned on four- or eight-byte boundaries, the two or three Slatex 1.0 least significant bits of values can be used to encode type 1.0 2.7 Traverse 1.9 information. In the Bigloo C back-end we use these bits for 3.1 two purposes: i) using tagged instead of boxed representa- tion for integers, and ii) using lightweight two-words repre- sentation for pairs. Because Java is a safe high-level lan- Figure 7: BiglooJVM, Bigloo C with boxed inte- guage it doesnot provide operationson pointersand thus, gers (Bigloo32) and Bigloo C with boxed integers these C tricks do not apply to the BiglooJVM runtime sys- and 3 words long pairs (Bigloo32+) vs Bigloo on tem. In order to estimate the impact of these techniques, Linux/x86 architecture. Lower is better. we have built two additional versions of the regular Scheme- to-C version of Bigloo: i) the first one, called Bigloo32, is identical to Bigloo but integersare 32 bitslong and boxed; ii) the second one, called Bigloo32+ is identical to Bigloo32 but pairsare implemented using3 memory words(one more bit-vector operations become expensive because new inte- than with Bigloo and Bigloo32). gershave to be allocated. That is,each time a setoperation Figure 7 comparesBiglooJVM and thesetwo new ver- iscomputed such asan union, a disjunction, etc., new inte- sions. We see that Cgc is highly sensitive to integer boxing. gersare allocated. One may argue that thisproblem isdue When compiled to C using boxed integers, this program is to the poor implementation of Cgc.ItisclearthatCgc no longer faster than its JVM counterpart. Cgc.Mostof could be easily improved. However, we wanted to use this itsexecution time isspent in a library of bit-vectors. These program “asis”becausethiscurrent versionpointsout one are used for data flow analyses such as liveness or reach- real problem of the JVM version. If integers cannot be un- ability property computations. In Cgc the bit-vectorsare boxed by a compiler, the JVM runtime system performance implemented using vectors of integers. Because of the pro- couldbemuchslowerthantheoneofC. gramming style of Cgc, the Bigloo compiler isunable to demonstrate that these vectors cannot contain non-integer 3.3.2 Integer arithmetic values. Thus, it cannot optimize these vectors. That is, it As presented Section 3.3.1, BiglooJVM uses heap-allocated cannot replace them with native integer vectors. With the C boxed integerssuch asJava Integer instances. Boxing in- tagged version of Bigloo this is not a performance problem. tegersisexpensive but preallocating small onesavoid mem- The main consequence of tagging isthat a couple of bitsis ory consumption and memory allocation in most cases (for wasted for each of the word of the bit-vectors, making the example all 256 Scheme charactersare preallocated). For vectors slightly larger. With the boxed version of integers, 32-bit integerswe have preallocated integersin the range

265

[-100...2048].FortheBigloo benchmark, only 3.8% of some JIT compilers. Since JIT compilers such as HotSpot integersare outside thisrange. Thanksto the SUA [28], usemethodsascompilation units,it isimportant to remove an optimization that enablesunboxing for polymorphic lan- such sequences; otherwise, the whole function that contains guages, fewer than 4% of the static arithmetic operations it isleft uncompiled, i.e., interpreted. (including loading constants) need boxing for this bench- The Bigloo compiler already implementssome data flow mark. analysis. However, we have added three new transformation rulesfor the JVM bytecode generator that are enabled when Impact of 64 bits arithmetic the compiler isable to prove that expressionsare side-effect free: JDK 1.3 JDK 1.2 JDK 1.1 1. Let bindingswith no bound variablesare removed:

0246810

5.0 .. .. bague 1.1 . . 1.3 R1 (let () expr) = expr 10.9 Beval 1.0 . . .. 1.0 . . 10.9 Cgc 1.4 1.3 3.1 2. Let-bound variablesthat are only used in a function Conform 1.0 1.0 called are reduced: 3.2 Earley 1.1 1.1 8.4 Fft 1.2 1.3 ...... 1.0 ... expr ...... Leval 1.0 R (let ( (x ) ) = (let ( ) 1.0 2 (f ... x ...)) (f ... expr ...)) 1.6 . . .. Nucleic 1.0 . . 0.9 4.9 Queens 0.9 0.8 1.1 3. Let-bound variables used only in a test position are re- Rgc 1.0 1.1 duced:

. . Figure 8: BiglooJVM-64 bits/BiglooJVM-32 bits on ...... expr ...... Linux/86. Lower is better. R (let ( (x ) ) = (let ( ) 3 (if x ...)) (if expr ...)) ...... Aside from boxing, some choices have to be made for the size of basic types. These choices are directed by the target languages. For C, sizes are unspecified but the long type is None of these rewriting rules are difficult to implement and usually large enough to contain any address. By contrast, they are well-known [2, 35]. However, they are important the JVM specifies the length of all basic types. In order to for our JVM back-end. For instance, disabling rule R2 slows choose between 32-bit integers and 64-bit arithmetic inte- down the Bigloo benchmark by more than 20%, because gerswe have testedthe two settingswith the Sun’sJDK on the pattern reduced by rule R2 isfrequently introduced by Linux/x86. Thistestispresented Figure 8. Because64-bit earlier Bigloo compilation stages. arithmetic in JDK1.3 hassuch a major performance penalty, we decided to use 32-bit arithmetic on all platforms. Using thisconfiguration, JDK 1.3 isthe fastestJVM we have found 4.2 Functional loops on Linux/x86 and Solaris/Sparc. On the one hand, functional languagesare expression ori- ented languages. On the other hand, Java, as C, is statement- 4. JVM BYTECODE TUNING oriented. Scheme loops( e.g., Scheme inner functions) are Because of the currently leading technology relying on dy- expressions, that is, there evaluations produces a value. This namic compilation for executing JVM programs, special at- feature isdepicted in thisnaive example: tentionmustbepaidtoproduceJVMbytecodethatpleases the JIT compilers. We report the most important facts we (definedefine (test a n b) ;; compute a + b * n have learned during our work on BiglooJVM. (definedefine (mul m r) (ifif(=m0)r(mul(-m1)(+br)))) 4.1 Data flow optimizations (+ a (mul n 0))) Data flow optimizationsturn out to be important. We In a first version of the back-end we had left the value of have found that certain JVM bytecode sequences disable a in the JVM stack. The generated bytecode was:

266

(method test (a n b) (r m) downward casts. Event if for instance, in order to comply (iload a) ;; Prepare a for addition (iload n) ;; Prepare arguments for mul:m to the JVM bytecode verifier the following Scheme function (iconst_0) ;; ...: r definition: muls (istore r) ;; stack entry for mul, store in reg r 1: (definedefine (first x) (istore m) ;; ... reg m 2: (condcond mul (iload m) ;; reg entry for mul 3: ((pair? x) (car x)) (iconst_0) 4: ((vector? x) (vector-ref x 0)) (if_icmpne l1) ;; m == 0 ? 5: (elseelse #f))) (iload r) ;; return loop with r mustbecompiledsuchas: (iadd) ;; continuation of the loop (ireturn) 1: (method first (x) () 10: (instanceof [jobject) l1 (iload m) ;; Prepare arg 1 of mul:m-1 2: (aload x) 11: (ifeq 17) (iconst_1) 3: (instanceof pair) 12: (aload x) (isub) 4: (ifeq 9) 13: (checkcast [jobject) (iload b) ;; Prepare arg 2 of mul:b+r 5: (aload x) 14: (iconst_0) (iload r) 6: (checkcast pair) 15: (aaload) (iadd) 7: (getfield car) 16: (areturn) (goto muls)) ;; back to stack entry of mul 8: (areturn) 17: (getstatic bfalse) 9: (aload x) 18: (areturn)) We have found that someJITsdo not compile thispro- gramming style and we now require that all loops (back- That is, if the compiler is able to prove that the type of x isa wardbranches)mustbedoneina“statement”style,with pair line 3 or a vector line 4 depending on the branch of the an empty stack. (One should note that such loops used as cond, it still has to enforce a dynamic type test when fetch- expressions cannot be produced by a Java compiler because ing valuesfrom the data structure,such aslines 6 and 13. Java is a statement-based language.) As a result, the current One may enable or disable the bytecode verifier when run- BiglooJVM version now produces: ning JVM programs. Because the bytecode generated pro- (method test (a n b) (reg1 r m save) duced by BiglooJVM isnot bytecode verifier compliant all (iload a) ;; Prepare a for addition the previously given time figures have been gathered with (iload n) ;; Prepare arguments for mul:n the bytecode verifier disabled. BiglooJVM provides an op- (iconst_0) ;; ...: r tion to generate code that complieswith the verifier. We muls (istore r) ;; stack entry for mul: store in reg r have not yet implemented the optimization which removes (istore m) ;; .. reg m unnecessary checkcast instructions. We have noticed that (istore save) ;; SAVE THE STACK mul (iload m) ;; reg entry for mul the current overhead of the bytecode verifier compliance (iconst_0) rangesfrom 10% to 55%. We think that thisslowdown is (if_icmpne l1) ;; m == 0 ? acceptable,sowehavepostponedtheworkonthistopic. (iload r) ;; return loop with r (istore reg1) ;; RESTORE THE STACK 4.4 JVM idiosyncrasies (iload save) ;; RESTORE THE STACK We have noticed that current JVM implementationsare (iload reg1) ;; RESTORE THE STACK (iadd) ;; continuation of the loop highly sensitive to the shape of the generated bytecode. In (ireturn) this section we present three JIT related problems we have l1 (iload m) ;; Prepare arg 1 of mul:m-1 encountered and the solution we have applied for two of (iconst_1) them. (isub) (iload b) ;; Prepare arg 2 of mul:b+r 4.4.1 Startup time (iload r) (iadd) In order to gather accurate time informations, we have (istore r) ;; prepare for reg entry point of mul considered long lasting benchmarks. All but Bigloo and (istore m) Cgc benchmarks, when compiled to C, last more than 10 (goto mul)) ;; back to reg entry of mul seconds on our AMD Athlon 800 Mhz. However, we must In order to perform thistransformation, the code genera- point out that JVM applicationshave long startup times. tor must know the type of stack elements at each loop entry. For instance, when Bigloo isinvoked with the -help op- The size of the stack is not enough since for each element we tion which displayscommand line optionsand exits,the have to know if we have to generate an istore or an astore C time isabout 0.02swhile the JVM time is2.52s. That instruction. Even if more complex, this second version runs is, the startup time of the JVM version is 126 slower than 18 timesfasteron SUN JDK1.3 or 1.2 than the former one! the C one! This extremely slow startup time explains the poor result of the overall Bigloo benchmark for the JVM 4.3 Bytecode verification version. Prohibitive startup times are a problem for short- Since all functions(or more generally methods)musthave lasting applications. For instance, it is currently impossible a prototype (declaration of argumentsand return type), the to use BiglooJVM to implement shell-like commands such JVM may be considered as a strongly typed language. These as ls, cat, etc. We have no solution to this problem yet. prototypes are used for two main reasons. First, it is possi- ble to define two different functionswith the samename and 4.4.2 Method size considerations use the prototype as an additional key for linking purposes. We have noticed that the current JIT compilersusea Thisisnot to be confused with Java overloading which is compilation threshold. In particular, we have experimen- resolved statically at compile time. Second, prototypes are tally found out that Sun’sHotSpot stopscompiling func- used by the bytecode verifier at runtime. Bytecode veri- tionsfor which the body islarger than 8000 JVM bytecode fier compliant programsmustexplicitly, at runtime, check instructions. Expectingly, some of our benchmarks suffer

267

from this strict limitation. The most significant one is Rgc cated which leadsto the following implementation of eq?: which performs miserably with HotSpot (see Figure 2). This boolean EQ( Object o1, Object o2 ) { benchmark uses Bigloo regular grammars. These lex-like ifif( o1 == o2 ) return true; grammarsare compiled to deterministic finite automata that returnreturn( ( o1 instanceof BINT ) && are, in turn, compiled into Scheme functions. Each grammar ( o2 instanceof BINT ) && iscompiled into one global Scheme function for which the ((BINT)o1.value == (BINT)o2.value) ); } states of the automaton are implemented using local func- tions. Because all function calls to these local functions are If equality isnot considered for integer, eq? could be: tail calls, the Bigloo compiler is able to inline or integrate boolean EQ( Object o1, Object o2 ) { them (see Section 3.1.1). In consequence, for one regular return o1 == o2; grammar, Bigloo producesexactly one JVM method. The } Rgc benchmark usesalarge grammar that implementsa When, thanksto the SUA, the Bigloo front-end succeeds complete Scheme reader (the actual Bigloo reader). When in demonstrating that either one of the eq? argumentsis compiled to JVM thisgrammar uses8252 JVM instructions. not an integer, thisfastimplementation ispreferred to the That is252 instructionsmore than the accepted number! In former one. In order to minimize the performance slow down consequence the JVM method resulting of the compilation of the compatibility between Bigloo and BiglooJVM we have of the regular grammar isleft uncompiled by HotSpot. All added a compilation flag that switches from compatible to the execution time of the benchmark isspent in the com- incompatible eq?. For all benchmarks, we have disable eq? piled regular grammar. Thisexplainswhy that benchmark compatibility. performssobadly with thisJava virtual machine. We hope that thisproblem will be fixed in the HotSpot version. 5.2 The MLj compiler A JVM flag enabling user customization of the compilation MLj isan optimizing compiler for Standard ML producing threshold would be a workaround to this problem. JVMbytecode.BiglooJVMandMLjshareamaingoal: efficiently compiling a strict functional language to JVM. 4.4.3 JVM compilation of tableswitch The paper [5] concentrateson the front-end of that compiler WefoundthatsomeJITsareunabletoproducecodewith and doesnot detail itsbytecode production. The present aconstantcomplexityforthetableswitch instruction. For paper focuses on the runtime system. The two papers are example, on Linux/x86, using Sun’s JIT, tableswitch im- thuscomplementary. plementation islinear in time. More precisely, we found that It seemsthat MLj usestype information to avoid boxing the execution time of tableswitch is17 .68 ∗ index +34.39 when Bigloo uses a global analysis. The MLj analysis is nanoseconds. On the same platform, iadd wasbenchmarked too briefly described. For this reason, we cannot compare around 14 nanoseconds. When index isnot in the specified the respective accuracy of the two approaches. The most range, the longesttime isrequired. Thispoor compilation of important difference between MLj and BiglooJVM seems to tableswitch explains some distortions on the benchmarks be that MLj does not support separate compilation while results. One should note that stable timing results were Bigloo does[5]. The authorsof MLj state that “...the rea- measured on the Alpha platform. The efficiency of this sonable [MLj] performance has only been achieved at the compilation framework isdemonstrated by the good per- price of high compile timesand a limitation on the size of formance of benchmarksusing higher order functions( e.g., programswhich may reasonably be compiled. Our deci- Queens, Leval). sion to do whole-program optimization is certainly contro- versial...”. As reported in Figure 6 it seems that Bigloo- 5. RELATED WORK JVM producescomparable performance without thislimi- tation. In addition, BiglooJVM supports separate compi- In thissection we compare BiglooJVM with two others lation. So, the BiglooJVM, which isa program of about Scheme to JVM compilers, Kawa and with a ML to JVM 100,000 linesof code (the compiler plusthe runtime system) compiler, MLj. isbootstrapped. On the other hand, the good performance 5.1 The Kawa Scheme compiler of BiglooJVM isachieved at the price of producing code that doesnot comply with the bytecode verifier while MLj does Kawa isanother Scheme compiler to JVM. Kawa imple- not suffer from this limitation. Using a compilation option, mentation techniquesdiffer from BiglooJVM in many re- BiglooJVM producesbytecode verifier compliant code but spects. For instance, Kawa closures are implemented by thiscurrently slowsdown executions. the means of classes (see Section 3.1.3). Kawa strings dif- fer from BiglooJVM strings. Kawa maps Scheme strings into Java strings, while BiglooJVM maps them into array of 6. CONCLUSION bytes. Java strings are not suitable for implementing Scheme We have added a new JVM bytecode generator to the strings because the former are immutable while the later are Bigloo Scheme-to-C optimizing compiler. In order to esti- mutable. mate, from a performance point of view, how suitable JVM The Kawa eq? predicate hasa different meaning than the isasa target for compiling strict functional languages,we BiglooJVM one’s. Kawa eq? is false for integers, which is compared the applicationswhen run on the C-basedand the a correct implementation, according to the Scheme Revised5 JVM-basedruntimesystems. Wehavetestedseveralimple- Report [18]. Bigloo implementsexact eq? for integers. Thus, mentationsof the JVM on several different hardware archi- forthesakeofcompatibilitybetweenthetworuntimesys- tectures. Given a code generator that is carefully written tems,wehavedecidedtoprovidethesamesemanticsforeq? and aware of the JVM’s idiosyncrasies, we found that most with BiglooJVM. Providing integers eq? impactsthe perfor- of the JVM compiled applicationsare no more than twice as mance of eq?. Asreported Section 3.3.2, integersare allo- slow as compiled C applications. The performance of JVM

268

has not ceased to increase significantly since the first imple- [17]C.-H.Hsieh,J.Gyllenhaal,andW.Hwu.Java bytecode mentation. If JVM implementationskeep improving at this to native code translation: The Caffeine prototype pace, we think that in a near future, JVM bytecode may and preliminary results.InIEEE/ACM Int’l become a true performance challenger. Symposium on Microarchitecture, 1996. [18] R. Kelsey, W. Clinger, and J. Rees. The Revised(5) Report on the Algorithmic Language Scheme. ACKNOWLEDGMENTS Higher-Order and Symbolic Computation, 11(1), September 1998. Many thanksto Christian Queinnec, Xavier Leroy, John [19] A. Krall. Efficient JavaVM Just-in-Time Clements, and to C´eline for their helpful feedbackson this Compilation.InProceedings PACT’98,Paris,France, work. October 1998. [20] X. Leroy. Unboxed objects and polymorphic typing. 7. REFERENCES In Symposium on Principles of Programming Languages, pages 177–188, Albuquerque, New Mexico, January 1992. [1]A.Adl-Tabatabai,M.Cierniak,G-Y.Lueh,V.Parikh,and [21] T. Lindholm and F. Yellin. The Java Virtual Machine. J. Stichnoth. Fast,EffectiveCodeGenerationina Addison-Wesley, 1996. Just-In-Time Java Compiler.InConference on [22] O. Pinali Doederlein. The Java Performance Report – Design and Implementation,pages Part IIA. http://www.javalobby.org/features/jpr, August 280–190, June 1998. 2000. [2]A.Aho,R.Sethi,andJ.Ullman.Compilers: Principles, [23] C. Queinnec. The influence of browsers on evaluators. Techniques and To ols. Addison-Wesley, 1986. In Int’l Conf. on Functional Programming, pages 23–33, [3]M.Arnold,D.Fink,M.Hind,andP.Sweeney.Adaptive Montr´eal, Canada, September 2000. OptimizationintheJalape˜no JVM.InConference on [24] M. Schinz and M. Odersky. Tail call elimination of the Object-Oriented Programming, Systems, Languages, and Java Virtual Machine.InProceedings of Babel’01, Applications,Minneapolis,USA,October2000. Florence, Italy, September 2001. [4] H. Baker. Should Not CONS Its Arguments, [25] N. S´eniak. Th´eorieetpratiquedeSqil:unlangage < > Sigplan Part II: Cheney on the M.T.A 1 . Notices, interm´ediaire pour la compilation des langages 30(9):17–20, September 1995. fonctionnels. PhD thesis, Universit´ePierreetMarieCurie [5] N. Benton and G. Kennedy, A. Russel. Compiling (Paris VI), November 1991. Standard ML to Java Bytecodes.InInt’l Conf. on [26] M. Serrano. Analysis: a Functional Functional Programming,1998. Languages Compilation Paradigm.In10th Symposium [6]H.J.BoehmandM.Weiser.GarbageCollectioninan on Applied Computing, pages 118–122, Nashville, Uncooperative Environment. Software — Practice and Tennessee, USA, February 1995. Experience, 18(9):807–820, September 1988. [27] M. Serrano. Inline expansion: when and how? In Int. [7] P. Botner. Kawa: Compiling Scheme to Java.InLisp Symp. on Programming Languages, Implementations, users conference,Berkeley,California,USA,November Logics, and Programs, pages 143–147, Southampton, UK, 1998. September 1997. [8] W. De Pauw and G. Sevitski. Visualizing Reference [28] M. Serrano and M. Feeley. Storage Use Analysis and Patterns for Solving Memory Leaks in Java.In its Applications.In1fst Int’l Conf. on Functional Proceedings ECOOP’99, pages 116–134, Lisbon, Portugal, Programming, pages 50–61, Philadelphia, Penn, USA, May June 1999. 1996. [9] M. DePristo. SINTL: A Strongly-Typed Generic [29] M. Serrano and P. Weis. Bigloo: a portable and Intermediate Language for Scheme.Northwestern optimizing compiler for strict functional languages. University, Computer Science Honors Thesis, 2000. In 2nd Static Analysis Symposium, Lecture Notes on [10] P. H. Hartel et al. Pseudoknot: a Float-Intensive Computer Science, pages 366–381, Glasgow, Scotland, Benchmark for Functional Compilers. Journal of September 1995. Functional Programming, 6(4):621–655, 1996. [30] O. Shivers. Control Flow Analysis in Scheme.In [11] M. Feeley, J. Miller, G. Rozas, and J. Wilson. Compiling Proceedings of the SIGPLAN ’88 Conference on Higher-Order Languages into Fully Tail-Recursive Programming Language Design and Implementation, Portable C. Technical Report Rapport technique 1078, Atlanta, Georgia, June 1988. Universit´edeMontr´eal, D´epartement d’informatique et [31] T. Suganama, T. Ogasawara, M. Takeuchi, T. Yasue, r.o., August 1997. M.Kawahito,K.Ishizaki,H.Komatsu,andT.Nakatani. [12] E. Gagnon and L. Hendren. SableVM: A Research Overview of the IBM Java Just-in-time compiler. Framework for the Efficient Execution of Java IBM Systems Journal, 39(1), 2000. Bytecode. Technical Report 2000-3, McGill University, [32] Sun Microsystems. The Java HotSpot Performance School of Computer Science, November 2000. Engine, April 1999. [13] D. Gudeman. Representing Type Information in [33] D. Tarditi, A. Acharya, and P. Lee. No assembly Dynamically Typed Languages. Technical report, required: Compiling Standard ML to C. ACM University of Arizona, Departement of Computer Science, Letters on Programming Languages and Systems, Gould-Simpson Building, The University of Arizona, 2(1):161–177, 1992. Tucson, AZ 85721, April 1993. [34] Tolksdorf. Compiling to JVM. [14] U. H¨olzle. Adaptive Optimization for Self: http://grunge.cs.tu-berlin.de/~tolk, 2000. Reconciling High Performance with Exploratory [35] T. VanDrunen, A. Hosking, and J. Palsberg. Reducing Programming.PhDthesis,StanfordUniversity,August loads and stores in stack architectures, 2000. 1994. [36] D. Wakeling. Compiling Lazy Functional Programs [15] U. H¨olzle and D. Ungar. Reconciling responsiveness for the Java Virtual Machine. Journal of Functional with performance in pure object-oriented Programming, 9(6):579–603, November 1999. languages. ACM Transactions on Programming Languages and Systems, 18(4):355–400, July 1996. [16] M. Honeyford. Weighing in on Java native compilation. Technical Report developerWorks, IBM, January 2002.

269

CPU+SYS seconds Bench Bigloo C BiglooJvm Gambit Kawa MLj Javac Bague 10.12 (1.00 δ) 15.88 (1.56 δ) 69.05 (6.82 δ) 13.39 (1.32 δ) Beval 10.61 (1.00 δ) 15.30 (1.44 δ) 22.18 (2.09 δ) 1890.9 (178.22 δ) Bigloo 3.89 (1.00 δ) 12.85 (3.30 δ) Boyer 12.73 (1.00 δ) 24.28 (1.90 δ) 31.82 (2.50 δ) 89.17 (7.00 δ) Cgc 2.35 (1.00 δ) 7.43 (3.16 δ) Conform 11.01 (1.00 δ) 51.18 (4.64 δ) 59.33 (5.38 δ) Earley 10.51 (1.00 δ) 31.81 (3.02 δ) 65.75 (6.25 δ) 452.08 (43.01 δ) Fft 8.96 (1.00 δ) 17.92 (2.00 δ) 650.46 (72.60 δ) 2474.9 (276.22 δ) Fib 13.58 (1.00 δ) 8.49 (0.62 δ) 14.59 (1.07 δ) 1905.4 (140.31 δ) 19.65 (1.44 δ) 12.03 (0.88 δ) Leval 9.26 (1.00 δ) 18.94 (2.04 δ) 40.73 (4.39 δ) 37.21 (4.01 δ) Maze 13.87 (1.00 δ) 43.74 (3.15 δ) 39.58 (2.85 δ) Mbrot 13.72 (1.00 δ) 55.85 (4.07 δ) 347.95 (25.36 δ) 8090.0 (589.65 δ) 44.98 (3.27 δ) Nucleic 9.12 (1.00 δ) 12.14 (1.33 δ) 21.82 (2.39 δ) 11.58 (1.27 δ) Peval 14.50 (1.00 δ) 62.77 (4.32 δ) 36.20 (2.49 δ) 89.50 (6.17 δ) Puzzle 13.13 (1.00 δ) 20.20 (1.53 δ) 20.62 (1.57 δ) 4322.4 (329.21 δ) Queens 12.22 (1.00 δ) 12.38 (1.01 δ) 33.13 (2.71 δ) 571.82 (46.79 δ) 17.09 (1.39 δ) Qsort 19.38 (1.00 δ) 19.74 (1.01 δ) 190.66 (9.83 δ) 6611.7 (341.17 δ) 36.15 (1.86 δ) 18.92 (0.97 δ) Rgc 10.64 (1.00 δ) 110.45 (10.38 δ) Sieve 11.88 (1.00 δ) 16.06 (1.35 δ) 22.05 (1.85 δ) 173.34 (14.59 δ) 15.76 (1.32 δ) 17.61 (1.48 δ) Slatex 13.91 (1.00 δ) 61.81 (4.44 δ) 38.44 (2.76 δ) 69.69 (5.01 δ) Traverse 18.68 (1.00 δ) 50.69 (2.71 δ) 113.06 (6.05 δ) 107.89 (5.77 δ)

Figure 9: Benchmarks timing on an AMD Tbird 800Mhz/256MB, running Linux 2.2.8

Appendix A: The benchmarks Here isa shortdescriptionof the benchmarkswe have been using so far. The numbers of lines are always given for the Bigloo version of the source files. Figure 9 presents all the numerical valueson Linux/x86. •Bague by P. Weis (105 lines). Tests fixnum arithmetic and vec- tors. •Beval (582 lines). The regular Bigloo Scheme evaluator. •Bigloo (99,376 lines). The bootstrap of the BiglooC compiler and the runtime library. •Boyer (626 lines) by B. Boyer and modified by B. Shaw and W. Clinger.Testssymbolsandcondi- tional expressions. •Cgc (8,128 lines). A simple compiler for a C like language that produces Mips assembly code. •Conform (596 lines). It uses lists, vectors and numerous small inner functions. •Earley by M. Feeley (672 lines). An implementation of the Ear- ley parser. •Fft (120 lines). Yet another Gabriel’s benchmark. Fast Fourier transform ported to Scheme by Harry Barrow. •Fib (18 lines). Fibonacci numbers. •Leval by M. Feeley (555 lines). A Scheme evaluator using lambda expressions. •Maze by O. Shivers (809 lines). Uses arrays fixnum operations and iterators. •Mbrot (47 lines). The Mandlebrot curve that tests floating point arithmetic. •Nucleic (3,507 lines). Described in [10], this benchmark measures the efficiency of numerical computations. •Peval by M. Feeley (639 lines). A partial evaluator that uses a lot of nested functions and allocates many lists and symbols. •Puzzle by F. Baskett (208 lines). Another Gabriel’s bench- mark. •Queens by L. Augustsson (131 lines). Ported from Lml to Scheme, tests list allocations. •Quicksort by P. Weis (124 lines). It tests arrays and fixnum arithmetic. •Rgc (348 lines). The Bigloo regular grammar that implements the Bigloo reader. •Sieve (53 lines) Fixnum arithmetic and list allocations. •Slatex by D. Sitaram (2,827 lines). This is a LATEX preprocessor imple- mented by Rice University that tests Input/Output capacities. •Traverse modified by J. Siskind (136 lines). It allocates and modifies lists.

270