Implementing a Portable Foreign Function Interface through Dynamic Code Generation

Abstract mentation technique for programming languages, in particular scripting languages. They combine a Interpreters for various languages usually need to number of advantages: interface to arbitrary functions that have a C interface. Unfortunately, C does not directly al- • simplicity low constructing calls to functions with arbitrary parameter lists at run-time. Various libraries (in • portability across architectures particular libffi and the ffcall libraries) have been • high startup speed developed to close this gap, but they have two dis- advantages: each foreign function call is slow, and One might think that calling C functions from these libraries are not available on all platforms. an written in C should be trivial, but In this paper we present a method for performing unfortunately in the general case this is not true: such foreign function calls by creating and com- One can call specific C functions easily; one can piling wrapper functions in C at run time, and also call C functions with a specific sequence of ar- then dynamically linking them. The technique is gument and return types indirectly; but calling a portable to all platforms that support dynamic function with a statically unknown sequence of ar- linking. Moreover, the wrapper functions are faster gument and return types is not supported by C. than foreign function calls through earlier libraries, The ffcall libraries1 and libffi2 have been devel- resulting in a speedup by a factor of 1.28 over libffi oped to fill that gap, but they introduce a signifi- on the DaCapo Jython benchmark. One disadvan- cant call overhead and they are not available on all tage is the higher startup cost from calling the C platforms. at run time. We present and evaluate sev- In this paper we present and evaluate a for- eral methods to alleviate this problem: caching, eign function interface based on generating wrap- batching, and seeding. per functions in C at run-time, and then compiling and dynamically linking these functions. Some may 1 Introduction consider this idea obvious, but few people have used it (and even fewer have published about it), people Many languages support calling libraries written have invested significant effort in other approaches in other programming languages; this facility is (e.g., libffi and ffcall), and there have been no em- known as foreign function interface. More con- pirical evaluations of this approach; the present pa- cretely, many of the libraries of interest provide per fills this hole. C interfaces, so the foreign language of interest is Our main contributions in this paper are: usually C. Foreign function interfaces are particu- • larly popular in scripting languages such as Python We present an implementation of this idea in [Bea98], , , and Ruby, but are also present a JVM interpreter (Section 3.2) and compare in other languages, e.g., Java [Lia99], Common the calling performance with libffi and ffcall Lisp, LuaML[Ram03], Haskell [FLMJ98], and vari- (Section 5.2). ous Forth implementations. 1http://www.haible.de/bruno/packages-ffcall.html Interpreters written in C are a popular imple- 2http://gcc.gnu.org/viewcvs/trunk/libffi/ • We also discuss the run-time cost of invoking microbenchmarks, but also on application bench- the C compiler (Section 4) and present and marks. evaluate several ways of alleviating this prob- The libffi interface is similar, but divides the lem: work into two stages: In the first stage you pass an array of parameter types to the macro – Using a faster compiler (Section 4.1, Sec- ffi_prep_cif, and libffi creates a call interface ob- tion 5.4) ject from that. In the second stage you create an – Caching of compiled wrapper functions array of pointers to the parameter values, and pass between runs (Section 4.2, Section 5.3) that to the calling macro ffi_call along with the – Compiling a batch of wrapper functions call interface object. at once (Section 4.3, Section 5.4) In the Cacao interpreter, we execute the first stage only once per foreign function, and the sec- – Seeding the cache with wrapper functions ond stage on every call. In theory, this two-stage for popular foreign functions (Section 4.4) arrangement can be more efficient than the ffcall in- terface, but in practice, the ffcall interface is faster We discuss the problem and its existing solutions on the platforms where we measured both.3 But in more depth in Section 2, and finally compare our even if the libffi interface was implemented opti- approach with related work (Section 6). mally, there would still be at least the overhead of constructing the array of parameter pointers. 2 Previous work Another issue with the approach taken by these libraries is that they require additional effort for In this section we describe using the ffcall libraries every architecture and . While and libffi for calling foreign functions, and the dis- the maintainers of these libraries have done an ad- advantages of this approach. mirable job at supporting a wide variety of archi- You can use the ffcall interface for calling a for- tectures and calling conventions, they are not com- eign function (avcall) as follows: For each param- plete: e.g., at the time of writing the Win64 calling eter, you call a type-specific macro (e.g., av_int), convention for the x86-64 (aka x64) architecture is and pass the actual parameter. These macros con- supported by neither library. So, an approach that struct an av_list data structure, that you eventu- does not require extra work for every architecture ally use in the av_call macro to call the function. and calling convention would be preferable for com- The overhead of using this interface for a foreign pleteness as well as for reducing the total amount function call in an interpreter consists of several of effort. components: • The interpreter has to interpret a description 3 Dynamically Generated C of the foreign function, and use that to call the appropriate ffcall macros in the appropri- Wrapper Functions ate order. In this section we describe the basic idea and two • The macros construct a data structure. implementations of our approach. We present im- provements and optimizations beyond the basic ap- • Finally, av_call has to interpret that data proach in Section 4, and timing results in Section 5. structure, then move the parameters to the right locations according to the calling conven- tion, then perform the call, and finally store 3.1 Basic Idea the result in the right location. The C compiler already knows the architecture and It is easy to see that this is quite a bit more the calling convention of the platform, so we should expensive than a regular function call. How- 3 In our experience, libffi still has one significant practical ever, we were still surprised when we saw signif- advantage over ffcall for interpreter developers: gdb back- icant speedups from our new method not just on tracing works across libffi calls, but not across ffcalls.

2 make use of this knowledge instead of reimplement- 2. The function is for a static class method, ing it from scratch (as is done in ffcall and libffi). so the second parameter has to be the class Unfortunately, the C language does not give us pointer. We pass this pointer through a vari- a way to use this knowledge directly, e.g., through able, and assign the class pointer to the vari- a run-time code generation interface. But we can able when we dynamically link the wrapper perform run-time code generation indirectly: gen- function (and the variable). We have one such erate C at run-time, then invoke the C variable per static class function; sharing the compiler at run-time, and finally link the resulting variables for the same class would be possible object file dynamically. but more complex. In an earlier version we What is the C code that we need to generate dy- passed the class pointer as a literal number, namically if we want to call arbitrary functions? It but we changed it to the current scheme in needs to be a C function, and we will call this the order to implement caching (Section 4.2), be- wrapper function. It must be callable from ordi- cause different runs may require different lit- nary C code, so it needs pre-determined parameter eral addresses for the same class (e.g., due to and return types, typically the same for all wrap- address-space randomization). per functions. The wrapper function needs to call the foreign function, and pass the parameters to 3. The rest of the parameters are the parame- it. It has to read these parameters from the lo- ters passed to the method at the JVM level. cations where the interpreter has put them; in a In this example the called function has one stack-based virtual machine (VM), it will typically pointer parameter. The parameters reside read the parameters from the VM stack. And the on the JVM stack and are accessed through wrapper function has to store the result of the for- the stack pointer. The offsets from the stack eign function in a place where the interpreter ex- pointer (0 in this case) are hard-coded into the pects it; again, this will usually be the stack in wrapper function. The type of the parameter a stack-based VM. The wrapper may also update is also hard-coded, in the form of a cast of the the VM state (in particular, the VM stack pointer). pointer type before the access; in the exam- Figure 1 shows and Section 3.2 explains an example ple, the type of the parameter is a pointer, so of such a wrapper function. there is a cast to void ** before the access. After dynamic linking, the wrapper function can The types and offsets of the parameters are be called arbitrarily often with an ordinary indirect computed from the descriptor of the method. call in C. The wrapper function stores the result of the 3.2 Implementation in a JVM JNI function back to the stack, again using a hard- coded type and offset derived from the method de- We have implemented this idea in the Cacao inter- scriptor. Finally, the wrapper function returns an preter [EGKP02, ETK06], a JVM implementation. updated stack pointer, again with a hard-coded in- It is used for calling (JNI) crement. functions. The wrapper functions are generated, compiled, linked, and run in a just-in-time fashion, i.e., when the JNI function is first called. 4 Improvements Figure 1 shows an example of a wrapper function for calling the JNI function Java java io VMFile This section discusses improvements over the basic list. All wrapper functions take the JVM stack approach that address the following problems: pointer as parameter and return the updated stack The main disadvantage of our approach is that pointer as return value. The JNI function has three invoking the C compiler is an expensive operation, parameters: and invoking it a lot increases the startup time of the programs significantly. Another disadvantage 1. As for all JNI functions, the first parameter is in some settings (in particular, embedded systems) the JNIEnv pointer (declared as void pointer is that our approach requires a C compiler and to avoid having to include .h files). toolchain to be available at run-time.

3 extern void *_Jv_env; void *ccjni_class_Java_java_io_VMFile_list; long *ccjni_Java_java_io_VMFile_list(long *sp) { extern void *Java_java_io_VMFile_list(); *(void **)(sp+0) = Java_java_io_VMFile_list(_Jv_env, ccjni_class_Java_java_io_VMFile_list, *(void **)(sp+0)); return sp+0; }

Figure 1: A wrapper function for a JNI function

4.1 Faster compiler Race conditions. If several instances of the inter- preter execute at the same time, they have to A straightforward improvement with respect to avoid writing and compiling wrapper function startup time is to use a faster compiler; a very fast files at the same time. This problem can be 4 one, possibly the fastest, is tcc . However, compil- avoided with appropriate locking. ers often pay for faster compilation with slower ex- ecution. Moreover, even a very fast C compiler still Security. An interpreter run by user Alice must requires a significant startup time: E.g., compiling not dynamically link shared object files belong- 488 files, each with one wrapper function, with 488 ing to a different (untrusted) user Bob, because invocations of tcc 0.9.22 requires 1.1s on a 2.26GHz (with a suitably constructed shared object file) Pentium 4. Finally, fast C are not gen- that would give Bob all the privileges of Al- erally available: tcc supports only a few platforms, ice. This can be avoided by keeping per-user and is often not even installed on those platforms caches, which would typically reside at a sub- where it is available. Therefore additional ways to directory of the user’s home directory. Also, improve the compilation time would be useful. the foreign function interface should check that that subdirectory and the cache files belong to the user. 4.2 Caching Proper architecture. The shared object files Instead of generating and compiling the same wrap- must be compiled for the architecture and ABI per function every time we invoke a given foreign that the interpreter is running on. The user’s function, we can just keep the shared object file home directory can reside on a network file sys- for the wrapper function around and reuse it in tem and be mounted on machines with differ- the next run of any program that uses that foreign ent architectures, so a particular shared object function. file might have been created by the interpreter We have implemented this approach in the Cacao on a different architecture. This problem can interpreter as follows: the file name of the shared be avoided by having per-architecture directo- object is derived from the name of the JNI function, ries for the shared-object files. so we just need to check if there is an appropri- ately named file; only if not, generate and compile Caching works very well. However, to reduce the a wrapper function file. If the file already exists, startup overhead on the first execution, one can use the startup cost for this JNI function is reduced to the following techniques. almost nothing. There are a number of issues that have to be 4.3 Batching dealt with when implementing caching: We can collect several wrapper functions into one 4http://fabrice.bellard.free.fr/tcc/ file, and compile (and dynamically link) all of them

4 in one batch. That eliminates much of the repeated 4.5 Call Templates startup overhead of the C compiler; other (minor) benefits of batching are that there are fewer shared We have considered and others have suggested to object files that have to be linked dynamically, re- solve the foreign function call problem as follows: sulting in less load on the virtual memory system, Instead of having calls to concrete functions with and reduced memory consumption from internal their parameter and return types, provide a set of fragmentation in the pages. indirect calls to C functions with various combi- nations of parameter and return types; when you We have not implemented batching in the Cacao want to call a given C function, execute the ap- interpreter, because that would require substantial propriate indirect call with the function pointer as changes in the way Cacao deals with JNI functions; parameter. currently Cacao only deals with JNI functions in a The problem with this approach for foreign func- JIT way, i.e., at the execution of first call to the tion calls is that it cannot be complete, because one function; at that point the wrapper function has cannot provide enough call templates to cover all to be compiled immediately, leaving no time for possible parameter and return types. E.g., the ff- accumulating more wrapper functions. call library avcall supports 14 different parameter One way to achieve batching in a JVM would be types (plus void for return types). With only three to generate a file of wrapper functions when loading parameters or less, 44325 call templates would be a class containing JNI functions. One might even needed. accumulate a file containing wrapper functions for But these templates are not enough: one wants several classes, and only compile all of them when to be able to call functions with more parameters, one of the functions is actually called. and in some foreign function interfaces, one also 5 Combining batching and caching complicates wants to deal with more types , resulting in an even things a little: We can no longer use the file name larger number of call templates. That’s why we to determine if a wrapper function is cached, but did not implement this approach, and we are not need an index file or data base that indicates which aware of anyone who has used it for implementing wrapper function is in which file. a foreign function interface. However, call templates could be used for reduc- ing the amount of compilation needed, similar to seeding: Provide call templates for the type signa- 4.4 Seeding tures of the expected foreign functions. Then most of the other foreign functions will also be callable by these templates, and only a few foreign func- Wrapper functions for popular foreign functions tions need a newly-compiled wrapper function (or, can be generated and compiled when the inter- alternatively, a newly-compiled call template). preter is built, and later the system can use the If the call templates are implemented as wrapper resulting shared object files without having to gen- functions, using them would be slightly slower than erate them itself, reducing the first-execution slow- using the wrappers our approach generates: The down that the user experiences. Another benefit call to the foreign function would be indirect, and of this approach is that these shared objects would the called function has to be passed as a parameter not have to be replicated for each user. to the wrapper function. This approach can also be useful to deal with Alternatively, for a limited number of call tem- platforms where no C compiler is available at run- plates, one could implement the call directly in the time (e.g., embedded systems): the system interpreter, resulting in a small speedup compared with shared objects for all wrapper functions that to our wrapper function approach: one level of call the application needs, and it will never invoke the overhead would be eliminated. C compiler. While this approach will not work 5 off t for every application, many applications on embed- E.g., the mapping of the C type to the basic types provided by ffcall is platform-dependent, so for a more ded systems are so well-characterized that all the portable foreign function interface one may want to provide needed foreign functions are known in advance. an off t type.

5 We have not implemented either variant of call cycles templates, but we expect the calling-performance 2000 difference from our direct-calling wrapper functions libffi to be small (probably in the measurement noise), and the startup-time advantage to be similar to seeding or having a warm cache. The downside of this approach is that it increases the complexity of the implementation: Generating 1000 wrapper functions is still needed to support the general case, but in addition one has to map the called functions to the call templates, and to sup- port another way of calling foreign functions. ffcall C wrappers

0 arguments 5 Empirical Results 0 1 2 4 8

5.1 Platform and Benchmarks Figure 2: Calling cost of a static JNI call using different foreign function call implementations. Unless otherwise noted, our empirical results were gathered on an Intel Xeon UP 3070 2.67GHz ma- 5.3 Application benchmarks, chine with 8 GB of main memory running on 64- caching and seeding bit Debian GNU/Linux 4.0 with the Linux kernel version 2.6.18, libffi version 4.1.1-21, ffcall version In our application benchmarks we show two results 1.10+2.41-3, and Cacao based on version 0.97 built for our C wrapper approach: with gcc version 4.1.1-21. The benchmarks we used are SPECjvm98 [SPEC99], the DaCapo benchmark no caching This is the performance of C wrap- suite [MMB+04] and some micro-benchmarks. pers without caching, or with a cold cache (i.e. on the first execution of the benchmark after clearing the cache6). Each wrapper function is 5.2 Calling speed generated, compiled, and dynamically-linked separately; there is one wrapper function per JNI function. To compare the raw call performance of the differ- ent approaches, we wrote a microbenchmark that warm cache This is the performance of C wrap- n invokes an empty static JNI method with argu- pers with caching, when all the wrapper func- n ments ( +2 arguments for the C function) a million tions have already been generated, i.e., typ- times. Figure 2 shows the time for a single call in ically on the second and subsequent runs of CPU cyles. While libffi with its two-stage interface the benchmark. There is no generation or could be faster than ffcall, it is quite a lot slower compilation of the wrapper functions, only the on our test platform. Using C wrapper functions dynamic-linking cost, once per JNI function. beats both of them, with a speedup of 1.18–1.47 This is also the performance that we can ex- over ffcall and 2.68–7.17 over libffi. pect on first execution from seeding if all JNI However, these are best-case results that do not functions used in the application are seeded. take into account that the functions or the sur- rounding Java program takes time for doing some- Figure 3 compares the performance of the foreign thing useful, nor does it include the compilation function interface implementations on the JVM98 or dynamic-linking overhead. So, how well do the 6 If the cache has not been cleared, an application typ- different approaches work on application bench- ically benefits quite a lot from wrappers for JNI functions marks? called by other applications.

6 libffi ffcall no caching warm cache

performance (%) 100

75

50

25

0 jess javac mtrt compress db mpegaudio jack

Figure 3: SPECjvm98 results (128MB Java heap)

libffi ffcall no caching warm cache performance (%) 100

75

50

25

0 bloat eclipse hsqldb luindex antlr chart fop jython pmd

Figure 4: DaCapo benchmark results (512MB Java heap)

7 different dynamic JNI libffi ffcall warm cache static seeding JNI calls invocations runtime 10ms 9ms 43ms 1ms 201 compress 66 4,863 202 jess 79 2,708,609 Figure 6: Dynamic linking overhead: CPU time for 209 db 67 127,294 calling 500 static JNI methods 213 javac 81 4,296,649 222 mpegaudio 75 1,253,804 gcc-3.3.5 tcc-0.9.22 227 mtrt 78 251,370 1 compile/function 9108ms 1077ms 228 jack 68 2,825,405 1 compile/488 functions 1065ms 35ms antlr 91 4,311,262 functions for break-even 9 33 bloat 119 18,941,773 chart 149 13,026,159 Figure 7: CPU (user+system) time for compiling eclipse 321 29,320,684 488 wrapper functions on a 2.26GHz Pentium 4 (32- fop 99 1,871,976 bit) hsqldb 116 449,259 jython 136 49,891,342 luindex 102 7,037,744 Figure 6 gives an idea of the dynamic link- pmd 105 12,908,311 ing cost for warm caches: It measures the CPU (user+system) time for calling 500 different static Figure 5: Number of JNI methods and calls for the JNI functions with no arguments, calling each func- JVM98 and DaCapo benchmarks tion just once. Here warm cache is quite a bit slower than libffi and ffcall, because it needs to dy- namically load the 500 additional shared libraries benchmarks, and Fig. 4 for the DaCapo bench- and dynamically link the 500 wrapper functions (in marks. addition to the 500 JNI functions, which are also The results exceeded our expectations by far. linked by libffi and ffcall); however, the absolute Not only did we see measurable speedups of warm cost of this dynamic linking overhead is small. cache over both libffi and ffcall, in several cases Figure 6 also contains a column static seeding even no caching was faster than libffi. The maxi- where the wrapper functions have already been mum speedup we saw was a factor of 1.28 for warm linked statically. This eliminates the dynamic link- cache over libffi on jython. These results are sur- ing overhead, resulting in a very low run-time for prising because we did not expect our interpreter to this micro-benchmark. spend that much time in JNI functions, much less in the function call overhead for these functions. 5.4 Batching One contributing factor to these results is the different cost of the different calling mechanisms as We cannot provide application benchmarks results discussed in Section 5.2. The other contributing with batching, because we have not implemented factor is the large number of JNI calls, shown in batching in the Cacao interpreter. Fig. 5. E.g., for Jython on average each libffi call However, Fig. 7 shows the potential of batching: costs 220ns more than each C wrapper call, result- gcc -O shows more than a factor of 8 speedup from ing in an overall execution time difference of 11s, compiling all the wrapper functions in one batch; exceeding the time to compile the 136 wrapper the tcc speedup is a factor of 30, and the resulting functions in the no caching case. is not human-perceptible at 35ms. There are also a few unexpected results: e.g., in The third line shows how many functions need to Jython ffcall is just as fast as warm cache, and in be compiled in one invocation to make the actual db libffi is faster than ffcall. We believe that these compilation time at least as large as the startup differences are due to factors like cache effects (e.g., overhead of the compiler; this gives an idea of a the difference in allocations between the variants kind of break-even point for the compiler. can lead to different conflict misses). For the runs with one compiler invocation per

8 function, the system time consumed a large part C code generation at run-time. SWIG uses C and of the CPU (for tcc, the majority), probably be- C++ as interface definition language (IDL) and is cause the startup overhead involves a lot of virtual one of most portable systems. TIDE is another for- memory work, in particular page table setup and eign function interface generator for interfacing Tcl zeroing or copying of pages on writing. The oper- with full C++ [Win97]. ating system for this experiment was Linux-2.6.13. Other tools are FIG (foreign interface generator) that takes C header files and a declarative script as input [RS06]. This allows the user to tailor the 6 Related Work resulting interface to the application. The script- ing language contains composable typemaps which The ffcall libraries and libffi achieve a similar goal describe the mapping between high-level and low- as our approach does, but the approach is differ- level types. The ml-nliffigen tool generates ML ent: We use the C compiler’s knowledge of the call- glue code from C declarations to enable the ma- ing convention to achieve portability, these libraries nipulation of C data structures from the ML appli- contain hand-written code for each architecture and cation. Chasm is a toolkit wich provides seamless calling convention they support. You can read a de- language interoperability between Fortran 95 and scription of how they are used and their drawbacks C++ [RSSM06]. Chasm uses the intermediate rep- in Section 2. resentation generated by a compiler front-end for Generating wrapper functions to implement a each supported language as its source of interface foreign function is already introduced in an earlier information instead of an IDL. In this regard it is paper [Ert07]. However, that paper differs from the similar to our work as we also use the internal repre- present one in all other respects. In particular, it sentation of the virtual machine for the generation does not present any performance results, so it is of the wrapper code, but in our case dynamically, unclear how practical this approach is; it does not in the case of Chasm statically. cover optimisations such as caching, batching, or There is also other work of optimizing foreign seeding. It also does not discuss Java or JNI at all. function calls. Sch¨arli and Achermann use partial Instead, it deals with Forth and focusses on the de- evaluation of inter-language wrappers to optimize sign of the foreign function interface as seen by the foreign function calls [SA01]. Forth . Using wrapper functions com- Persistent code caching has been used in a num- piled by the C compiler provides additional advan- ber of systems before, and is described in depth tages in this context, because one can make good in the context of the Pin dynamic instrumentation use of the C compiler’s knowledge about the type system [RCCS07]. Fortunately, our code caching signature of the called function. problem is much easier, and persistence is very There is a lot of other work on wrapper and in- easy to achieve in the absence of batching by using terface generation. The main difference from our appropriate file names; batching makes persistent work is that wrapper and interface generation is caching only a little harder. not done at run time of the program. SWIG [Bea96, Bea03] is a tool for generating for- eign function interfaces for scripting languages. It 7 Conclusion is highly configurable and can be used to generate C wrapper files similar to the ones we use; how- In this work we present the implementation of ever, the typical use is quite far from what we have a portable foreign function interface that uses presented in this paper (but closer to our work on dynamically-generated C wrapper files. We also the Gforth foreign function interface [Ert07]); in present and evaluate a number of ways to reduce the present paper we have hardly looked at deal- the startup overhead of this approach, and com- ing with the mismatch between two languages, be- pare it to existing less-portable solutions like libffi cause this has been defined away with JNI (for the and ffcall. Surprisingly, not only does our approach JVM) and for Gforth has been covered in another increase portability, it can also increase the perfor- paper [Ert07]. SWIG is normally not invoked at mance: We saw speedups by up to a factor of 1.28 run-time, but at build time, whereas we focus on on application benchmarks.

9 References A. Diwan, A. Hosking, . Ste- fanovic, and C. Weems. The [Bea96] David M. Beazley. SWIG: An easy to DaCapo Project. http://ali- use tool for integrating scripting lan- www.cs.umass.edu/DaCapo/, 2004. guages with C and C++. In Proceed- ings of the 4th USENIX Tcl/Tk Work- [Ram03] Norman Ramsey. Embedding an in- shop, pages 129–139, 1996. terpreted language using higher-order functions and types. In IVME ’03: [Bea98] David M. Beazley. Interfacing C/C++ Proceedings of the 2003 workshop on and Python with SWIG. In 7th In- Interpreters, virtual machines and em- ternational Python Conference, SWIG ulators, pages 6–14, New York, NY, Tutorial, 1998. USA, 2003. ACM Press.

[Bea03] David M. Beazley. Automated sci- [RCCS07] Vijay Janapa Reddi, Dan Connors, entific software scripting with SWIG. Robert Cohn, and Michael D. Smith. Future Generation Computer Systems, Persistent code caching: Exploiting 19(5):599–609, July 2003. code reuse across executions and ap- plications. In Code Generation and [EGKP02] M. Anton Ertl, David Gregg, Andreas Optimization (CGO ’07), pages 74–88, Krall, and Bernd Paysan. vmgen — 2007. a generator of efficient virtual machine interpreters. Software—Practice and [RS06] John Reppy and Chunyan Song. Experience, 32(3):265–294, 2002. Application-specific foreign-interface generation. In GPCE ’06: Proceedings [Ert07] M. Anton Ertl. Gforth’s libcc C func- of the 5th international conference on tion call interface. In 23rd EuroForth Generative programming and compo- Conference, pages 7–11, 2007. nent engineering, pages 49–58, New York, NY, USA, 2006. ACM Press. [ETK06] M. Anton Ertl, Christian Thalinger, and Andreas Krall. Superinstructions [RSSM06] Craig Edward Rasmussen, Matthew J. and replication in the Cacao JVM in- Sottile, Sameer Shende, and Allen D. terpreter. Journal of .NET Technolo- Malony. Bridging the language gap gies, 4:25–32, 2006. Journal papers in scientific computing: the Chasm from .NET Technologies 2006 confer- approach. Concurrency and Com- ence. putation: Practice and Experience, 18(2):151–162, February 2006. [FLMJ98] Sigbjorn Finne, Daan Leijen, Erik Mei- [SA01] Nathanael Sch¨arli and Franz Acher- jer, and Simon Peyton Jones. H/direct: mann. Partial evaluation of inter- a binary foreign language interface for language wrappers. In Workshop haskell. In ICFP ’98: Proceedings on Composition Languages, WCL ’01, of the third ACM SIGPLAN interna- September 2001. tional conference on Functional pro- gramming, pages 153–162, New York, [SPEC99] SPEC Standard Performance Evalua- NY, USA, 1998. ACM Press. tion Corporation. SPECjvm98 Docu- mentation. Release 1.03 Edition, 1999. [Lia99] Sheng Liang. Java Native Interface: Programmer’s Guide and Reference. [Win97] H. Winroth. A in- Addison-Wesley Longman Publishing terface to C++ libraries. In Technol- Co., Inc., Boston, MA, USA, 1999. ogy of Object-Oriented Languages and Systems, TOOLS (23), pages 247–259. + [MMB 04] J. E. B. Moss, K. S. McKinley, IEEE Computer Society, 1997. S. M. Blackburn, E. D. Berger,

10