The Jrpm System for Dynamically Parallelizing Java Programs
Total Page:16
File Type:pdf, Size:1020Kb
The Jrpm System for Dynamically Parallelizing Java Programs Michael K. Chen Kunle Olukotun Stanford University Stanford University [email protected] [email protected] Abstract analysis can parallelize Fortran-like numerical applications on traditional multiprocessors We describe the Java runtime parallelizing machine [2][5][20][38]. Unfortunately, numerous challenges (Jrpm), a complete system for parallelizing sequential have made automatic compiler parallelization of general programs automatically. Jrpm is based on a chip integer programs difficult. Analyzing pointer aliasing, multiprocessor (CMP) with thread-level speculation control flow, irregular array accesses, and dynamic loop (TLS) support. CMPs have low sharing and limits as well as handling inter-procedural analysis communication costs relative to traditional complicate static dependence analysis [3]. These multiprocessors, and thread-level speculation (TLS) difficulties introduce imprecision into dependence simplifies program parallelization by allowing us to relations, limit the accuracy of parallelism estimates, and parallelize optimistically without violating correct force conservative synchronization to safely handle sequential program behavior. Using a Java virtual potential dependencies. machine with dynamic compilation support coupled with Our paper describes the Java runtime parallelizing a hardware profiler, speculative buffer requirements and machine (Jrpm), a complete system that automatically inter-thread dependencies of prospective speculative parallelizes sequential programs using thread-level thread loops (STLs) are analyzed in real-time to identify parallelism. This system takes advantage of recent the best loops to parallelize. Once sufficient data has developments that now enable a different approach to been collected to make a reasonable decision, selected automatic parallelization for small multiprocessors (2 – 8 loops are dynamically recompiled to run in parallel. CPUs). The key components of this system are: a chip Experimental results demonstrate that Jrpm can multiprocessor that provides low-latency inter-processor exploit thread-level parallelism with minimal effort from communication, thread speculation support that allows us the programmer. On four processors, we achieved to parallelize optimistically, a hardware profiler for speedups of 3 to 4 for floating point applications, 2 to 3 identifying parallel loops from program execution, and a on multimedia applications, and between 1.5 and 2.5 on virtual machine environment where dynamic integer applications. Performance was achieved by optimizations can be performed without modifying automatic selection of thread decompositions by the source binaries. hardware profiler, intra-procedural optimizations on • Chip multiprocessor – Jrpm is based on the code compiled dynamically into speculative threads, and Hydra chip multiprocessor (CMP) [32]. Decreasing some minor programmer transformations for exposing feature size and increasing transistor counts now allow parallelism that cannot be performed automatically. chip multiprocessors to be a reality [6][14][24][42]. Chip multiprocessors combine several CPUs onto one 1. Introduction and Overview die with a tightly coupled memory interface. In this As the limits of instruction-level parallelism (1 – 10s configuration, inter-processor sharing and of instructions) with a single thread of control are communication costs are significantly less than in approached [44], we must look elsewhere for traditional multiprocessors. The reduced communication architectural improvements that can speedup sequential costs make it possible to take advantage of fine-grained program execution. Coarser grained parallelism, like thread-level parallelism. fine-grained thread-level parallelism (10 – 1,000s of • Thread-level speculation – Hydra includes instructions), is a potential area for exploration. This support for thread-level speculation (TLS) type of parallelism is not exploited today due to [10][21][29][40]. TLS allows a sequential program to be limitations of current multiprocessor architectures and divided into arbitrary chunks of code, or threads, to be parallelizing compilers. executed in parallel. TLS hardware ensures memory Traditional symmetric multiprocessors (SMPs) accesses between threads maintain the original sequential [13][22] have been most effective at exploiting coarse program order. grained parallelism (>1,000s of instructions). The high Traditional multiprocessors must synchronize cost of inter-processor communication in these systems, conservatively to preserve correctness when static generally >100s of cycles, eliminates any potential analysis cannot determine with complete certainty that a speedups for fine-grained parallel tasks that either have dependency does not exist. For non floating-point true dependencies or closely shared cache lines. applications, this makes it difficult to find regions that Modern compilers that perform array dependence can be parallelized safely and still achieve good speedups. With TLS, it is possible to parallelize © 2003 IEEE. Published in the Proceedings of ISCA-30, 9-11 June 2003 in San Diego, CA, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966. aggressively rather than conservatively. Since sequential compiled with instructions annotating local variables and ordering is guaranteed for parallel execution of any thread decompositions is executed as a sequential program decomposition under TLS, the problem of program on a single Hydra processor. The trace compilation is reduced to finding regions of program hardware collects statistics in real-time for the execution with significant parallelism. prospective decompositions. Once sufficient data has • Hardware profiler – Static parallelizing compilers been collected by the trace hardware, regions predicted often have insufficient information to analyze dynamic to have the largest speedup and most coverage are dependencies effectively. Dynamic analysis to find dynamically recompiled into speculative threads. parallelism complements a TLS processor’s ability to This dynamic parallelization system has other parallelize optimistically and use hardware to guarantee potential benefits: correctness. Tracer for Extracting Speculative Threads • Reduced programmer effort – Explicit coding of (TEST) [9] is hardware support in Jrpm that analyzes parallelism is better suited for coarser grained and sequential program execution in real-time to find the best distinct tasks. Manually identifying fine-grained parallel regions to parallelize. This system provides accurate decompositions can be a time-consuming effort for estimates of dynamic dependency behavior, thread size, programs without obvious critical sections. and buffering requirements that are needed for selecting • Portability – The code maintains platform good decompositions and that would be difficult to independence because the binaries are not modified derive statically. Estimates show the tracer requires explicitly for thread speculation. minimal hardware additions to our CMP and causes only • Retargetability – Decompositions can be tailored minor slowdowns to programs during analysis. dynamically for specific hardware or data. Different • Virtual machine – The Java virtual machine decompositions may be chosen for future CMPs with (JVM) [28] acts as an abstraction layer to hide dynamic more CPUs, larger speculative buffers, and different analysis and thread-level speculation from the program. cache configurations. In some applications, Virtual machines like the JVM and Microsoft’s .NET decompositions can also be optimized for specific data VM have become commercially popular recently for set sizes. supporting platform independent applications. Binaries • Simplified analysis – Compared to traditional are distributed as portable, platform independent parallelizing compilers, the Jrpm system relies on more bytecodes [28] that are dynamically compiled into native hardware for TLS and profiling support, but reduces the instructions of the underlying architecture and run within complexity of the analysis required to extract exposed the protected environment of the VM. This virtualization thread-level parallelism. allows us to seamlessly support a new execution model Simulations demonstrate Jrpm’s ability to exploit without modifying the source binaries. thread-level parallelism automatically with minimal effort from the programmer. On our CMP with four Java bytecode CPUs, we achieve speedups of 3 to 4 on floating point Application 1 applications, 2 to 3 on multimedia applications, and between 1.5 and 2.5 on integer applications. Selection of Nati ve code CFG / DFG appropriate thread decompositions by the hardware + Nati ve TLS JIT Compiler Annotation code profiler, low-level optimizations on code dynamically instructions 4 compiled into speculative threads, and some manual Profile analyzer Java VM 2 5 programmer transforms for exposing parallelism when it 3 cannot be found automatically are all key to getting good TEST profiler TLS support Hydra CMP performance.