Safe and Efficient Hardware Specialization of Java Applications
Matt Welsh Computer Science Division University of California, Berkeley Berkeley, CA 94720-1776 USA [email protected]
Abstract take full advantage of these interfaces, instead fo- cusing on the other aspects of performance, such as Providing Java applications with access to low-level compilation [18], garbage collection [1], and thread system resources, including fast network and I/O inter- performance [2]. faces, requires functionality not provided by the Java Traditionally, Java applications make use of low- Virtual Machine instruction set. Currently, Java appli- level system functionality through the use of native cations obtain this functionality by executing code writ- ten in a lower-level language, such as C, through a native methods, which are written in a language such as C. method interface. However, the overhead of this inter- To bind native method code to the Java application, face can be very high, and executing arbitrary native a native method interface is used, which has been code raises serious protection and portability concerns. standardized across most JVMs as Sun Microsys- Jaguar [37] provides Java applications with efficient tems’ Java Native Interface [29]. However, the use of access to hardware resources through a bytecode spe- native methods raises two important concerns. The cialization technique which transforms Java bytecode first is performance: the cost of traversing the na- sequences to make use of inlined Jaguar bytecode which tive method interface can be quite high, especially implements low-level functionality. Jaguar bytecode is when a large amount of data must be transferred portable and type-exact, making it both safer and more across the Java-native code boundary. The second efficient than native methods. Jaguar requires that the is safety: invoking arbitrary native code from a Java target JVM or compiler recognizes Jaguar bytecode, application effectively negates the protection guar- which is a superset of the Java bytecode instruction set. antees of the Java Virtual Machine. These two prob- We describe two implementations of Jaguar: one based lems conflate, as programmers tend to write more on a static Java compiler, and the other which uses a application code in the native language to amortize standard JVM coupled with a Jaguar-enabled just-in- the cost of crossing the native interface. time compiler. The JIT compiler applies code patches We have developed a system called Jaguar [37] to the original Java bytecode, stored in the form of an- which provides applications with efficient and safe notations in the Java class file. We demonstrate that access to low-level system resources. This is accom- Jaguar-specialized Java applications perform as well as plished through a bytecode specialization technique C for a set of communication benchmarks, and that the in which certain Java bytecode sequences are trans- code patching technique involves minimal compile-time lated to low-level code which is capable of perform- overhead. ing functions not allowed by the JVM, such as direct memory access. Because this low-level code is in- lined into the Java application at compile time, the 1 Introduction overhead of the native interface is avoided. Also, aggressive optimizations can be performed on the Java [14] is becoming increasingly popular for combined application and inlined Jaguar code. large-scale server applications, including Internet Our original prototype of Jaguar, described services [27, 34], databases [16], and parallel pro- in [37], made use of a Java just-in-time (JIT) com- cessing [40, 24]. To obtain high performance, these piler which recognized certain bytecode sequences applications need to make use of specialized hard- and translated them to x86 machine code directly. ware and O/S interfaces, such as fast cluster net- This approach is both non-portable and error-prone, works [32], asynchronous I/O [23], and raw disk ac- requiring that the machine code sequences be tai- cess [17]. To date, few Java systems have striven to lored for the particular combination of JVM, JIT, and machine architecture. It also requires non- usually have special requirements for buffer alloca- trivial changes to the JIT compiler to recognize and tion. However, this requirement runs counter to the translate Java bytecode sequences. existing Java model in which all objects and arrays This paper presents a new design based on trans- are allocated from a single heap, managed by the lating Java bytecode to Jaguar bytecode, which is a JVM’s garbage collector. superset of the Java instruction set. Jaguar byte- Native methods are the traditional mechanism code is used to represent low-level operations oth- by which these operations are implemented in Java. erwise unsupported by the JVM. Jaguar bytecode Native methods permit the binding of arbitrary contains two additional instructions — peek and code written in a lower-level language (such as C) poke — which are used to perform direct memory to the Java application. Apart from raising serious access. Application programmers are not permit- protection and safety concerns, native methods are ted to make use of these instructions; rather, they not well-suited for providing fine-grained, efficient are restricted to use within Java-to-Jaguar trans- access to system resources. As we demonstrated lation rules. Because Jaguar bytecode is portable, in [37], the Java Native Interface itself imposes a type-exact, and inlined directly into the Java appli- large performance penalty, making them suitable cation, it is both safer and more efficient than using only for relatively long operations which pass only native methods for low-level operations. a small amount of data across the Java-native code We present two implementations of Jaguar based boundary. Furthermore, native methods limit ex- upon this design. The first makes use of a static pressiveness, as native code can only be bound to Java compiler to translate Jaguar bytecode to ma- method invocation. Field access, operators, and chine code. The second uses a modified JIT com- other Java operations cannot hook into the native piler which applies a “patch” to produce Jaguar code interface in any way. bytecode at compile time; this mechanism requires Several projects have attempted to bind fast Jaguar support only in the JIT compiler, and is communication layers into the Java environment fully compatible with unmodified JVMs. Both ap- through the use of native methods. Native method proaches relegate the (relatively expensive) trans- bindings to MPI [13] and PVM [33] have been de- lation from Java to Jaguar bytecode to a portable scribed, however, neither of these have considered front-end compiler, minimizing the complexity im- performance issues with respect to obtaining low la- pact on the back-end compilers. tency or high bandwidth. Little work has been done to support other types of I/O, although Sun Mi- crosystems is currently considering new APIs sup- 2 Motivation and Background porting asynchronous I/O interfaces [25]. Another approach is to extend the JVM or Much previous work has addressed the CPU- Java compiler to provide special support for certain related aspects of Java performance, including com- classes, such as communication buffers [9], bulk I/O pilation [28, 18, 11], thread synchronization [2], and interfaces [4], or JVM-internal data structures [8]. garbage collection [30]. However, Java I/O per- While this technique can provide a point solution formance remains largely uninvestigated. Imple- for particular communication and I/O interfaces, it menting high-performance communication and I/O lacks generality and requires that significant new in Java requires two classes of operations which functionality be incorporated into each JVM imple- the Java Virtual Machines does not directly sup- mentation. In addition, this functionality must it- port. The first is direct access to I/O device inter- self be trusted, much in the same that native meth- faces, including user-level network interfaces [35, 32] ods are. and raw disk I/O [17]. These mechanisms gener- The Jaguar approach is motivated by the obser- ally require the use of specialized system calls, or vation that the sort of low-level operations required even memory-mapped interfaces which circumvent for enabling high-performance communication and the O/S kernel entirely. The second is the use of I/O are generally short and easily expressed as a se- explicitly-managed memory regions. For example, quence of simple instructions (e.g., accessing a par- user-level network interfaces often require that com- ticular memory address, or invoking a system call). munication buffers be pinned to physical memory This suggests that such operations can be inlined for direct access by the NI hardware; these pages into the compiled Java bytecode stream for perfor- must be allocated from a special pool or pinned dy- mance, and that some form of static analysis could namically by the O/S or NI [36]. Memory-mapped be performed to guarantee safety or type-exactness. files are often used for I/O, and raw disk interfaces Jaguar is based on the concept of bytecode spe- Instruction Stack effect Java Machine
450 C Jaguar/GCJ Jaguar/OpenJIT 250 C Jaguar/GCJ 400 Jaguar/OpenJIT 350
200 300
250 JNI Simulated (10x) 150 200
Bandwidth, Mbits/sec 150 Round trip time, microseconds
100 100 JNI Simulated 50
50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20000 40000 60000 80000 100000 Message size, bytes Message size, bytes
Figure 4: JaguarVIA microbenchmark results. The graph on the left shows the round-trip latency of messages transmitted over Berkeley VIA as a function of the message size. The graph on the right shows the bandwidth as a function of message size for C and Jaguar, as well as an estimation of the bandwidth obtained if JNI were used to access VIA functions as native methods. Also shown is the bandwidth if JNI performance were improved tenfold.
recovery of the objects from the pre-serialized form exactly the same size after processing. by mapping a new PSO object list onto the data. The average classfile size increase was 698 bytes Recovering references to PSO objects requires de- using the diff patch method, and 738 bytes using marshalling a pointer and creating a new object to the flat method. The average increase for Jaguar map onto the pointer’s location. classfiles was 494 bytes. Note that using the diff ap- As the results show, GCJ has a significant per- proach does not have significant savings over simply formance advantage over OpenJIT for these prim- storing the fully-expanded Jaguar bytecodes; this is itive operations. In particular, GCJ has highly- because the number of instructions added by a patch optimized routines for accessing object fields and tends to be relatively large compared to the num- arrays; OpenJIT must deal with the Java object ber of instructions in the original method. (Most representation within the JVM, and does not op- Jaguar translation rules take a single Java bytecode timize code as heavily as the static compiler. Ac- instruction and expand it out to 5-20 new instruc- cessing PSOs is slower than accessing standard Java tions.) objects in both cases, because these primitives are The time to apply Jaguar code patches and com- constructed using multiple Java object field accesses pile methods using both patch techniques is shown (as well as Jaguar peek and poke instructions). in Figure 7. These results show that applying a diff patch takes about twice as long as applying a ‘flat 4.2 Code Patch Overhead patch. However, the number of methods patched is very small, less than 2% in most cases. Although To assess the overhead of Jaguar code patches, the time to apply a patch is about 10% of the av- we measured the size difference between original erage method compilation time, because the num- Java classfiles and those patched using both the ber of patched methods is small the total patch two patch mechanisms. We also measured the time for an application is negligible. Although these size of Jaguar classfiles generated for use with are small benchmark applications, we expect that the static Jaguar compiler (which contain only ex- the use of Jaguar specializations (and hence, code panded Jaguar bytecode, and no patch). The results patches) will be restricted to a small set of low-level are shown in Figure 6. 27 classes in the Jaguar code libraries in a given application. In the case of the tree were processed by the Jaguar front-end com- VIA benchmarks, for example, only the core VIA piler; only 7 of these classes had Jaguar translation library makes use of code patches. rules applied to them. As the figure shows, sev- eral classes decreased in size when processed by the front-end compiler; others increased in size even if 5 Issues and Future Work no translation rules were applied. This is because of the way in which the Jaguar front-end compiler Our approach to bytecode specialization raises a parses and writes out classfile data. Ideally, a class number of issues for future investigation. The first file with no translation rules applied to it would be is whether the class of operations required for im- £ Original classfile
¤ With diff patch
6000 ¤ With flat patch ¥ Jaguar classfile 5000
4000 ¡ 3000 Size in bytes 2000
1000
0
Class¢ files
Figure 6: Classfile sizes using Jaguar. The size in bytes of 27 Java classfiles, before and after processing by the Jaguar front-end compiler, are shown. Patched classfiles are used with the JIT-based Jaguar implementation, and Jaguar classfiles are used with the static compiler implementation. The decrease in size for some classfiles is attributed to removal of extraneous information during processing, and differing data layouts. plementing efficient communication and I/O in Java ing data between Java and the library or system call can be captured by just two additional bytecode in- in question. structions (peek and poke). Direct memory access is Another set of optimizations that Jaguar could a requirement for memory-mapped I/O interfaces, enable is fast floating-point operations and complex and for manipulation of other memory regions (such arithmetic using standard Java operators. These as I/O buffers) outside of the Java heap. However, optimizations have been investigated using other other communication layers, such as MPI [12] and techniques, including class-specific compiler opti- Active Messages [10], operate primarily through a mizations [39] and dialects of the Java language [6]. function-call interface. In addition, most operating While augmenting the Jaguar instruction set with system features can only be accessed through the support for these optimizations fits nicely into the use of system calls. goal of hardware specialization of Java applications, This suggests that the Jaguar instruction set it is not clear whether these instructions would should provide a binding to arbitrary library and be portable across many architectures. Our desire system calls; however, this is already provided (indi- has been to strike a balance between performance, rectly) through the use of Java native methods. The portability, and generality. main performance limitation of the Java Native In- The real opportunity presented by Jaguar is to terface is in the way that native code accesses Java build novel systems in Java. Apart from the bene- data, and vice versa. Jaguar External Objects (de- fits of increased portability, modularity, and robust- scribed in [37]) can be used to avoid this bottleneck, ness, there is the potential to leverage Java’s protec- as Jaguar specializations allow Java and native code tion model to achieve higher performance than can to directly share data external to the Java heap. be afforded by languages such as C. This is possi- Sharing data on the Java heap is more difficult, as ble because code that was once protected from the the garbage collector must be informed when ob- application by placing it in the kernel can now be jects are being accessed outside of Java code. The moved into a Java library, and co-optimized with remaining concern with native methods is portabil- application code at compile time. We have investi- ity and safety. One solution to these problems would gated the benefit of moving the protection code of be to introduce a restricted form of native methods the Berkeley VIA implementation from the slow net- which map only onto standardized APIs, such as work interface co-processor (which runs at 37 MHz) POSIX calls, and are responsible only for marshal- into a Java library on the Pentium-based host, re- Methods Average Average Average Benchmark Patched Total patch size patch time compile time JaguarVIA Pingpong 6 215 Diff patch 470 bytes 1.5 ms 11.92 ms Flat patch 438 bytes 1.16 ms 11.92 ms JaguarVIA Bandwidth 4 214 Diff patch 460 bytes 3.0 ms 11.93 ms Flat patch 505 bytes 1.5 ms 11.93 ms PSO Benchmark 3 221 Diff patch 341 bytes 2.66 ms 11.34 ms Flat patch 343 bytes 1.33 ms 11.34 ms PSO Test 3 194 Diff patch 442 bytes 2.66 ms 12.25 ms Flat patch 510 bytes 1.33 ms 12.25 ms
Figure 7: Patch application and compile time for various microbenchmarks. Also shown are the number of methods compiled in the application, and the number of methods which were patched. Note that the number of patches applied is very low compared to the total number of methods used in the application. sulting in a 15-20% increase in peak bandwidth and piler, and the other based on a JIT compiler. The a 25% reduction in round-trip latency. Using Java’s static compiler makes use of a Jaguar classfile con- protection model in place of the traditional system- taining specialized bytecodes; the JIT solution uses call boundary could improve other aspects of appli- a code patch embedded in a standard Java class- cation performance, including support for high I/O file. We have demonstrated both implementations concurrency and throughput [38, 15]. to perform well against a set of communication and I/O benchmarks, and that the overhead of the code patch technique is minimal. Our design represents 6 Conclusion an efficient, safe means of extending the Java envi- ronment to support large-scale server applications Java server applications must necessarily make that can be readily employed in existing Java im- use of functionality not directly provided by the plementations. JVM. These applications require both access to low- level system resources, such as fast communication and I/O interfaces, as well as management of mem- References ory external to the Java heap. While native meth- ods provide a general-purpose way to invoke code [1] Ole Agesen. GC Points in a Threaded Environ- written in a lower-level language from Java, they ment. Technical Report SMLI-TR-98-70, Sun Mi- are generally non-portable, do not perform well, and crosystems Laboratories, December 1998. raise important safety concerns. Jaguar takes an alternate approach, that of spe- [2] D. Bacon, R. Konuru, C. Murthy, and M. Ser- cializing Java bytecode at compile time to make rano. Thin Locks: Featherweight Synchronization for Java. In Proceedings of ACM Conference on use of low-level operations expressed as type-exact, Programming Language Design and Implementa- portable Jaguar bytecode. Jaguar bytecode is a su- tion, June 1998. perset of Java bytecode with just two additional in- structions enabling direct memory access for spe- [3] Hans Boehm. Space Efficient Conservative Garbage cializations. Adding Jaguar support to an exist- Collection. In Proceedings of the ACM SIGPLAN ing JVM is straightforward, and does not require ’93 Conference on Programming Language Design and Implementation, June 1993. the addition of significant new functionality. Be- cause Jaguar bytecode is inlined into the applica- [4] Dan Bonachea. Bulk File I/O Extensions to Java. tion, the compiler can also perform aggressive op- In Proceedings of the ACM 2000 Java Grande Con- timizations against the combined application and ference, June 2000. low-level code. [5] Per Bothner. A GCC-based Java Implementa- We have presented two implementations of the tion. In Proceddings of the 42nd IEEE International Jaguar design, one based on a static Java com- Compuer Conference, Spring 1997. [6] John Brophy. Design of the Java JIT Compiler. In Proc. of OOPSLA ’98, visual numerics complex class. Workshop on Reflective Programming in C++ and http://www.vni.com/corner/garage/grande/. Java. http://openjit.is.titech.ac.jp/. [7] P. Buonadonna, A. Geweke, and D. Culler. An im- [20] Sun Microsystems. The JIT plementation and analysis of the Virtual Interface Compiler Interface Specification. Architecture. In Proceedings of SC’98, November http://java.sun.com/docs/jit interface.html. 1998. [21] Eugene W. Myers. An O(ND) difference algorithm [8] Michael Burk, Jong-Deok Choi, Stephen Fink, and its variations. Algorithmica 1, 2:251–266, 1986. David Grove, Michael Hind, Vivek Sarkar, Mauri- [22] Christian Nester, Michael Philippsen, and Bern- cio Serrano, V.C. Sreedhar, and Harini Srinivasan. hard Haumacher. A more efficient RMI for Java. The Jalape˜noDynamic Optimizing Compiler for In ACM Java Grande Conference 1999, June 1999. Java. In Proceedings of the ACM 1999 Java Grande [23] Vivek S. Pai, Peter Druschel, and Willy Conference, June 1999. Zwaenepoel. Flash: An efficient and portable Web [9] Chi-Chao Chang and Thorsten von Eicken. Inter- server. In Proceedings of the 1999 Annual Usenix facing Java with the virtual interface architecture. Technical Conference, June 1999. In ACM Java Grande Conference 1999, June 1999. [24] Michael Philippsen and Matthias Zenger. Java- [10] B. Chun, A. Mainwaring, and D. Culler. Virtual Party - transparent remote objects in Java. In Con- network transport protocols for Myrinet. In Pro- currency: Practice and Experience, 9(11):1225- ceedings of Hot Interconncts V, August 1997. 1242, November 1997. [11] Robert Fitzgerald, Todd B. Knoblock, Erik Ruf, [25] Mark Reinhold. New I/O APIs for the Java Bjarne Steensgard, and David Tarditi. Marmot: An Platform. Sun Microsystems Java Com- Optimizing Compiler for Java. Technical Report munity Process Specification JSR-000051, MSR-TR-99-33, Microsoft Research, June 1999. http://java.sun.com/aboutJava/community- process/jsr/jsr 051 ioapis.html. [12] Message Passing Interface Forum. The MPI Mes- sage Passing Interface Standard. Technical Report [26] Cygnus Solutions. The Cygnus Native Interface for UT-CS-94-230, University of Tennessee, Knoxville, C++/Java Integration. http://sourceware.cyg- April 1994. nus.com/java/papers/cni/t1.html. [13] Vladimir Getov, Susan Flynn-Hummel, and Sava [27] Sun Microsystems Inc. Enterprise Java Beans Tech- Mintchev. High-performance parallel programming nology. http://java.sun.com/products/ejb/. in Java: Exploiting native libraries. In ACM 1998 [28] Sun Microsystems Inc. Java HotSpot Performance Workshop on Java for High-Performance Network Engine. http://java.sun.com/products/hot- Computing, 1998. spot/index.html. [14] J. Gosling, B. Joy, and G. Steele. The Java Lan- [29] Sun Microsystems Inc. Java Native Inter- guage Specification. Addison-Wesley, Reading, MA, face Specification. http://java.sun.com/pro- 1996. ducts/jdk/1.2/docs/guide/jni/index.html. [15] Steven Gribble. Simplifying Cluster-Based Inter- [30] Sun Microsystems Labs. The net Service Construction with Scalable Distributed Exact Virtual Machine (EVM). Data Structures. http://www.cs.berkeley.edu/- http://www.sunlabs.com/research/java-topics/. ~gribble/papers/quals/sdds-cluster.ppt. [31] The Blackdown Java Linux Porting Team. Java on [16] Joe Hellerstein, Eric Brewer, and Mike Franklin. linux. http://java.blackdown.org/. Telegraph: A Universal System for Information. [32] The VIA Consortium. The Virtual Interface Archi- http://db.cs.berkeley.edu/telegraph/. tecture. http://www.viarch.org. [17] Silicon Graphics Inc. SGI Releases Technology to [33] D.A. Thurman. jPVM: The Java to PVM interface. Form Foundation for Enterprise-Class Linux Ap- http://www.isye.gatech.edu/chmsr/JavaPVM. plications. http://www.sgi.com/newsroom/press- [34] UC Berkeley Ninja Project. The UC Berkeley Ninja releases/1999/december/linux oss.html. Project. http://ninja.cs.berkeley.edu. [18] Kazuaki Ishizaki, Motohiro Kawahito, Toshiaki Ya- [35] T. von Eicken, A. Basu, V. Buch, and W. Vogels. sue, Mikio Takeuchi, Takeshi Ogasawara, Toshio U-Net: A user-level network interface for parallel Suganuma, Tamiya Onodera, Hideaki Komatsu, and distributed computing. In Proceedings of the and Toshio Nakatani. Design, implementation, and 15th Annual Symposium on Operating System Prin- evaluation of optimizations in a just-in-time com- ciples, December 1995. piler. In Proceedings of the ACM 1999 Java Grande [36] M. Welsh, A. Basu, and T. von Eicken. Incorpo- Conference, June 1999. rating memory management into user-level network [19] S. Matsuoka, H. Ogawa, K. Shimura, Y. Kimura, interfaces. In Proceedings of Hot Interconnects V, K. Hotta, and H. Takagi. OpenJIT: A Reflective August 1997. [37] Matt Welsh and David Culler. Jaguar: Enabling efficient communication and I/O from Java. Con- currency: Practice and Experience, 2000. Spe- cial Issue on Java for High-Performance Network Computing, To appear, http://www.cs.berk- eley.edu/~mdw/papers/jaguar-journal.ps.gz. [38] Matt Welsh, Steven D. Gribble, Eric A. Brewer, and David Culler. A design framework for highly- concurrent systems. October 2000. [39] Peng Wu, Sam Midkiff, Jose Moreira, and Manish Gupta. Efficient support for complex numbers in java. In Proceedings of the ACM 1999 JavaGrande Conference, June 1999. [40] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krish- namurthy, Hilfinger, Graham, Gay, Colella, and Aiken. Titanium: A high-performance Java di- alect. In ACM 1998 Workshop on Java for High- Performance Network Computing, February 1998.