A Region-Based Compilation Technique for a Just-In-Time Toshio Suganuma Toshiaki Yasue Toshio Nakatani IBM Tokyo Research Laboratory 1623-14 Shimotsuruma, Yamato-shi, 242-8502 Japan {suganuma,yasue,nakatani}@jp.ibm.com

ABSTRACT Method inlining and data flow analysis are two major optimization 1. INTRODUCTION Dynamic compilation systems can exploit the profile information components for effective program transformations, however they from the current execution of a program. This allows dynamic often suffer from the existence of rarely or never executed code to find opportunities for better optimization, and thus is contained in the target method. One major problem lies in the as- a significant advantage over traditional static compilers. Many of sumption that the compilation unit is partitioned at method today’s Java Virtual Machines (JVMs) and Just-In-Time (JIT) boundaries. This paper describes the design and implementation of compilers indeed use some form of online profile information, not a region-based compilation technique in our dynamic compilation only for focusing the optimization efforts on programs’ hot spots, system, in which the compiled regions are selected as code por- but also for better optimizing programs by exploiting various tions without rarely executed code. The key part of this technique kinds of runtime information such as call edge frequency and bi- is the region selection, partial inlining, and region exit handling. ased branch behaviors [1][2][10][21][24][25]. For region selection, we employ both static heuristics and dynamic profiles to identify rare sections of code. The region selection Typically these systems have multiple execution modes. They be- process and method inlining decision are interwoven, so that gin the program execution using the interpreter or the base-line method inlining exposes other targets for region selection, while compiler as the first execution mode. When they detect the pro- the region selection in the inline target conserves the inlining gram’s frequently executed methods or critical hotspots, they in- budget, leading to more method inlining. Thus the inlining process voke the for those selected methods to run in can be performed for parts of a method, not for the entire body of the next execution mode for higher performance. Some systems the method. When the program attempts to exit from a region employ several different optimization levels to select from based boundary, we trigger recompilation and then rely on on-stack re- on the invocation frequency or relative importance of the target placement to continue the execution from the corresponding entry methods. point in the recompiled code. We have implemented these tech- niques in our Java JIT compiler, and conducted a comprehensive Method inlining and data flow analysis are two major components evaluation. The experimental results show that the approach of re- for effective program transformations in the higher optimization gion-based compilation achieves approximately 5% performance levels. However, methods often contain rarely or never executed improvement on average, while reducing the compilation over- paths, as shown in the previous studies [5][29], and this can cause head by 20 to 30%, in comparison to the traditional function- some adverse effects that reduce the effectiveness of these optimi- based compilation techniques. zations. For example, method inlining can be restricted due to the excessive code size caused by the rarely executed code in the tar- get method. Some methods may include a large portion of rarely Categories and Subject Descriptors executed code, and this may prevent them from being inlined at D.3.4 [Programming Languages]: Processors – incremental the corresponding call sites. Others can grow large from the cumu- compilers, optimization lative effects of sections of rare code after several stages of inlin- ing, and this may prevent other hot methods from being inlined. General Terms Also data flow analysis is often hindered by kill points existing in Performance, Design, Experimentation, Measurement, Languages those rarely executed paths, whose control flow may merge back to non-rare paths, and this can prevent the propagation of accurate data flow information on non-rare paths. Keywords Region-based compilation, dynamic compilers, partial inlining, The problem here is that we implicitly assume methods are the on-stack replacement units for compilation. Even if we perform , we ei- ther inline or do not inline the entire body of a target method, re- gardless of the structure and dynamic behavior of the target Permission to make digital or hard copies of all or part of this work for method. Method boundaries have been a convenient way to parti- personal or classroom use is granted without fee provided that copies are tion the process of compilation, but methods are not necessarily a not made or distributed for profit or commercial advantage and that desirable unit to perform optimizations. If we can eliminate from copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, the compilation target those portions that are rarely or never exe- requires prior specific permission and/or a fee. cuted, we can focus the optimization efforts only on non-rare PLDI’03, June 9-11, 2003, San Diego, California, USA. paths and this would make the optimization process both faster Copyright 2003 ACM 1-58113-662-5/03/0006…$5.00. and more effective. In this paper, we describe the design and implementation of a re- fundamental difference between these prior techniques and ours is gion-based compilation (RBC) technique in our dynamic compila- in the integration of the region selection process with the method tion system. In this framework, we no longer treat methods as the inlining decisions. That is, the previous techniques first perform unit of compilation, as in traditional method-based or function- inlining normally and then eliminate rarely executed code from the based compilation (FBC), and select only those portions that are resulting code. The major disadvantage with that approach is that identified as non-rare paths. The term region refers to a new com- possibly large portions of the body of a once inlined method can pilation unit, which results from collecting code from several be simply eliminated as rare in later phases, and it may miss some methods of the original program but excludes all rarely executed optimization effects that could be achieved with more method portions of these methods. Regions are inherently inter-procedural, inlining if those eliminated rare code sections were not included in rather than intra-procedural, and thus the region selection needs to the inlining process. be interwoven with the method inlining process. The following are the contributions of this paper: The notion of region-based compilation was first proposed in [14], z as a generalization of the profile-based trace selection approach Design and implementation of a region-based compilation [19], with some experimental evidence of the potential impact by technique for a dynamic compiler: We present a simple and allowing the compiler to repartition the program for desirable effective technique consisting of region selection, partial inlin- compilation units. An improved region formation algorithm was ing, region exit handling, and other RBC-related optimizations then proposed by combining region selection and the inlining (such as partial escape analysis and partial dead code elimina- process [26][27]. The goal of this prior work was to expose as tion), as implemented in our dynamic compilation system. many scheduling and other optimization opportunities as possible zDetailed experimental evaluation of the effectiveness of re- for an ILP static compiler without creating an excessively large gion-based compilation in a dynamic compilation environ- amount of code due to the aggressive inlining. In a dynamic com- ment: We present detailed evaluation results for the impact on pilation environment, however, this technique is especially useful both performance and compilation overhead by this technique, for several reasons. First, dynamic compilers can take advantage based on the actual implementation on a production-level Java of runtime profile information from currently executing code and JIT compiler. use this information for the region selection process. Second, they are very sensitive to the compilation overhead, and this technique The rest of this paper is organized as follows. The next section is can significantly improve the total compilation time and code size. an overview of our dynamic compilation system as used in this Third, they can avoid generating code outside the selected regions study, and describes our implementation of on-stack replacement until the code is actually executed at runtime. for safely executing the exit path from the selected region. Section 3 describes our design and implementation of the region-based At the same time, we have other kind of challenges that are differ- compilation technique, including intra-method region identifica- ent from those in static compilers. For example, dynamic online tion, partial inlining, and other optimization techniques that are profiles are not always available for all target methods, unlike a aware of region information. Section 4 presents the experimental static compilation model using offline profile results. Even if they results on both performance and compilation overhead by apply- are available, we cannot expect complete profile results that cover ing this technique. Section 5 discusses the results and possible fu- all basic blocks of the target method. The profiling is usually per- ture research. Section 6 summarizes related work, and finally Sec- formed only for selected methods and on a sampling basis in order tion 7 presents our conclusions. to reduce the runtime overhead. Thus we cannot rely solely on the profile information and need to come up with a different region selection strategy by combining some static heuristics. Also the 2. BACKGROUND cost of handling region exits can be high in a dynamic compilation In this section, we describe the basic infrastructure required for the system, with recompilation and possibly stack frame replacement, RBC approach in a dynamic compilation system. Section 2.1 gives so the region selection may need to be more conservative. the brief overview of the recompilation system with dynamic pro- filing mechanism1. Section 2.2 describes our implementation of The key components for the RBC approach are region selection, OSR as the basic support for handling region exit points. partial inlining, and the region exit handling. For region selection, we employ both static heuristics and dynamic profiles to identify seed blocks of rare code and then propagate them to form rare 2.1 System Overview Our system is a multilevel compilation system, with a mixed mode code sections. The region selection process and method inlining interpreter (MMI) and three compilation levels (level-0 to level-2). can affect each other, in the sense that method inlining exposes Initially all methods are interpreted by the MMI. A counter for another target for region selection, and the region selection proc- counting both method invocation frequencies and loop iterations ess in turn conserves the inlining budget. Thus the inlining process is provided for each method. When the counter exceeds a thresh- can be performed for parts of a method, not for the entire body of old, the method is considered as frequently invoked or computa- the method. When the program attempts to exit from a region tionally intensive, and the first compilation is triggered. boundary at runtime, we trigger recompilation and rely on on- stack replacement (OSR) [15], which is a technique to dynami- The dynamic compiler has a variety of optimization capabilities. cally replace a stack frame from one form to another, in order to The level-0 compilation employs only a very limited set of optimi- continue the execution from the corresponding entry point in the recompiled code. 1 Notions related to RBC include deferred compilation [7][13], un- The overall system architecture of our dynamic optimization common traps [15][21], and partial method compilation [29]. The framework is described in detail in [24][25]. zations. Method inlining is considered only for small target meth- this option in our framework means we lose the majority of the ods. The devirtualization of method calls is applied based on class benefit expected from the integration of inlining and region selec- hierarchy analysis and type flow analysis, and produces either tion. Recall that our motivation for the RBC approach is not only guarded or unguarded code [17]. Preexistence analysis is also per- to eliminate the control flow merges, but also to exploit mutual in- formed to safely remove guard code and backup paths without re- teraction between method inlining and region selection, and this is quiring OSR [12]. Most of the data flow optimizations are dis- possible by actually removing rare regions from the compilation abled except for very basic copy and constant propagation. The scope. Thus we eliminated this simple option from consideration. level-1 compilation enhances level-0 by adding additional optimi- Some other options are listed below, all of which assume on-stack zations, including more aggressive full-fledged method inlining, replacement as a pre-requisite technique. and a wider set of data flow optimizations. The level-2 compila- tion is augmented with all the remaining optimizations available in 1. Simply fall back to the interpreter. This is the option taken by our system, such as escape analysis and stack object allocation, the HotSpot server compiler [21] when class loading invali- code scheduling, and DAG-based optimizations. dates inlining or other optimization assumptions. This relies on the underlying recompilation system to promote the “de- The level-0 compilation is invoked from the MMI and is executed compiled” interpreted method once again. as an application , while level-1 and level-2 compilations are performed by a separate compilation thread in the background. 2. Drive recompilation with deoptimization. This is the policy The priority of the compilation thread is set equal to that of the used by the SELF-93 system [15] and Jikes RVM [13]. The application threads. The upgrade recompilation from level-0 com- deoptimized compiled code is used only for the current transi- piled code to level-1 or level-2 optimized code is triggered on the tion. If the optimization assumption turns out to be wrong and basis of the compiled method hotness level as detected by a timer- the uncommon cases happen frequently, the system reopti- based sampling profiler. The sampling profiler periodically moni- mizes the method under the new assumption, replacing the tors the program counters of application threads, and keeps track code for future invocations, by using the underlying recom- of methods in threads that are using the most CPU time. Depend- pilation system. Thus this option can produce two additional ing on the relative hotness level, the method can be promoted from versions of the code. level-0 compiled code to either level-1 or directly to level-2 opti- mized code. 3. Drive recompilation with the same optimization level. The re- compilation is performed with the same scope for method There is another profiler, the instrumenting profiler, used when inlining as in the original version. The recompiled version has detailed information needs to be collected from the target method. several entry points corresponding to region boundaries in the The instrumentation code is initially generated in level-0 compiled original version, and is used for both current transitions and code but disabled. When a method is identified as hot and is a future method invocations. candidate for promotion to a higher optimization level, the con- troller enables the instrumentation code in the target method to The information we need to keep for restoring the JVM state at the collect the required information. After a sufficient number of sam- region boundary should be basically the same for all three options ples are collected, it is disabled to minimize the performance im- above, but the way to reconstruct the new stack frame is different. pact. The compiler can then take advantage of this online profile In particular, the first two options need to create a set of stack information in the higher optimization levels. frames according to the inlined context at the transfer point, while the third option can simply create a single new stack frame. This instrumenting profile mechanism is used for collecting both virtual/interface call receiver type distributions and basic block There are both advantages and disadvantages for each of these execution frequencies. The receiver type profile drives inlining if three options, and which options we should choose depends on the it identifies the call site as dynamically monomorphic, and the ba- meaning of “rare code”. If we stay very conservative, saying that sic block frequency profile provides the runtime rare code infor- only extremely rare cases are rare, then the first option is probably mation. We apply RBC-based optimization on level-1 and level-2, the best choice, since the current version can still serve for future using the rare code profile information collected from level-0 invocations and it is only necessary to handle transitions that may compiled code. happen very infrequently. However, this means the compiler may miss some additional optimization opportunities for frequent cases due to the conservativeness. On the other hand, if we use an ag- 2.2 Region Exit Handling gressive strategy for optimizing away rare code, these cases would There are several options to handle region exit points. The sim- actually occur frequently at runtime, and for the sake of overall plest option is code splitting [7], which was originally designed to performance it may be better to replace the compiled code as reduce the overhead of message sends in the SELF programming quickly as possible to avoid too many expensive OSR operations. language. It eliminates control flow merges from rarely executed code blocks by performing tail-duplication of conditional control Ideally it might be better to choose from these options depending flows to improve data flow information. To alleviate the poten- on the predicted execution frequency for each rare code block tially exponential code expansion problem of this technique, sev- identified. For example, we could use the branch prediction from eral improved versions have been proposed, such as reluctant the dynamic profile information. In our current implementation, splitting [7] and feedback-directed splitting [2]. With this option, we use only the third option above, recompilation with the same we do not eliminate rarely executed paths, but keep that code in a optimization level. This is because we apply RBC-based compila- separate section from non-rare paths. The implementation at a re- tion only at the higher optimization levels (level-1 and level-2), gion boundary can be very simple, just a single jump instruction, and thus those methods where region exits occur are already without requiring a complex OSR mechanism. However, using known to be hot and performance critical, and we do not want to level-0 online profiler transition entry compiled code basic block partial inlining region exit basic block (RE-BB) intra-method inlining region selection opc_recompile level-1/2 when frames are the same shape compilation other optimizations

drive recompilation code generation if required level-1/2 opc_recompile OSR compiled code runtime increment Figure 2. The flow of region selection and inlining in level-1 OSR counter and level-2 RBC-based optimizations. Both static heuristics control of future and dynamic information provided by the online profiler are invocations RBC-optimized recompiled used in the intra-method region selection process. version version callee-saved registers. We also create a runtime structure indicat- Figure 1. An example of transitions from an RBC-optimized ing the inline context of the given method, and a pointer to the ap- version to recompiled version. After replacing the current propriate node in this structure is also recorded as part of the in- stack frame (OSR), it jumps to the corresponding entry formation for each exit point. point in the recompiled code. If the source and destination frames are of same shape, it directly jumps without an OSR All registers holding live variables are first spilled out into mem- operation. The OSR counter is used to determine when a fu- ory at region exit points. One small optimization here is that when ture invocation is redirected to the recompiled version. the source and destination stack frames turn out to be the same shape (frame size, callee saved registers, number and order of lo- cal variables, etc) in the first OSR after recompilation, we patch deoptimize nor decompile them. In the recompilation, to avoid the the instruction at the region exit points with an unconditional recursive recompilation we do not apply RBC optimization. Fig- jump instruction directly into the corresponding entry points, so ure 1 illustrates how a transition from a RBC-optimized version to that we can skip the expensive OSR operation from the next time a recompiled version occurs. The recompiled version prepares all (see the shortcut line in Figure 1). the entry points for future possible transitions, not only the entry point for the current transition. 3. REGION-BASED COMPILATION Since our region selection is speculative (based on static heuristics In this section, we provide a detailed description of our region- and dynamic profiles), it is possible that control frequently exits based compilation technique. Section 3.1 describes the intra- from certain region boundary points. However, if the frequency is procedural region selection process using both static heuristics and small enough, we want the original RBC version to still serve for dynamic profiles. Section 3.2 shows how the region identification future invocations since it is better optimized. In order to deter- and method inlining are interrelated to obtain more desirable tar- mine when we should switch from the RBC-optimized version to gets as compilation units. Section 3.3 gives some useful optimiza- the new recompiled version for future invocations, we provide a tions we implemented to take advantage of the target code in the counter for each RBC version to count the number of actual OSR selected region. events. Initially the recompiled version is used for the region exit transitions only, not for future invocations. When the OSR counter As shown in Figure 2, region-based compilation is performed only exceeds a certain threshold, we conclude that our speculative re- in level-1 and level-2 compilation, using the profile information gion selection did not work well for this method, and redirect the resulting from the instrumentation on level-0 compiled code. control of future invocations to the recompiled version. Since only level-0 compiled methods that are identified as hot and hence are candidates for promotion to the next optimization level As in the previous implementation of OSR [13][29], we provide a are instrumented, profile information is not always available for all first-class operator in our compiler’s intermediate representation, target methods considered for inlining at level-1 and level-2. called OPC_RECOMPILE, at each region exit point. We keep all live variables at that program point in the given bytecode sequence This is in contrast to the static compilation model using offline as operands of this special operator for both local and stack vari- profiles. Even if the information is available, we cannot expect a ables so as to be able to restore the JVM state. This is done in a complete profile that covers all the code within the target method. very early stage in the compilation, so that any optimizations in This is regarded as a common problem for dynamic compilers, later phases can rely on the use of those variables across each re- where online profiles need to be collected selectively and on a gion exit boundary. At the final stage of the compilation, we create sampling basis from the currently executing program to reduce the a map indicating the final locations of those variables as well as performance impact. Therefore we combine some static heuristics other information for frame conversion, such as frame size, and with incomplete profile information available for region selection. 3.1 Intra-Procedural Region Selection /* Rare-Static, Rare-Profile, NonRare-Static, and NonRare-Profile Unlike the region formation algorithm shown in [14], we identify are all defined as a bit vector representation. */ rarely executed portions that we want to remove from the original code, rather than choosing a preferred portion to be optimized to- /* initialization phase (set flags in seed blocks) */ gether. This is because we have to be more conservative for region for each basic block bb { selection in a dynamic compiler than in a static compiler, consid- /* set flags in Gen set based on heuristics */ ering the potentially high overhead of OSR that has to be per- if ( matched to static heuristics ) { formed when control exits from the selected region. set (Gen (bb), Rare-Static) or set (Gen (bb), NonRare-Static); } This is also a different strategy from that used in dynamic binary /* set flags in Gen set based on dynamic profile information */ translation systems [3][6][9], where frequently executed paths if ( profile_info_available (bb) ) { (traces) are extracted as a unit for optimization to form a single- if ( profile_count (bb) == 0 ) { entry multiple-exit contiguous sequence. These systems dynami- set (Gen (bb), Rare-Profile ); cally reoptimize code on top of already (statically) optimized ge- unset (Gen (bb), NonRare-Static); neric executable, which is quite a different compilation model } else if ( profile_count (bb) > threshold ) { from our system environment. We need to avoid escaping from the set (Gen (bb), NonRare-Profile ); selected regions as much as possible, and thus have to employ unset (Gen (bb), Rare-Static); wider regions that can contain arbitrary numbers of hot traces by } } } removing rarely executed code blocks.

The intra-procedural region selection process is shown in Figure 3. /* iteration phase (grow rare and non-rare regions) */ We assume each method is represented as a control flow graph do { (CFG) at this point with a single entry block and a single exit changed = false block. The algorithm begins by marking a Gen set for some seed for each basic block bb { basic blocks as either non-rare or rare by employing both heuris- /* compute Out set from all successors’ In set */ tics and dynamic profile results. We currently use the following Out (bb) = ∪ In (succ (bb)) for all successors of bb; heuristics. if ( is_set (Out (bb), NonRare-Static | NonRare-Profile) ) unset (Out (bb), Rare-Static | Rare-Profile); zA backup block generated by compiler versioning optimiza- 2 /* combine Out set and Gen set to update In set */ tion (such as devirtualization of method invocation ) is rare. new_set = Gen (bb) | Out (bb); zA block that ends with an exception throwing instruction if ( is_set (Gen (bb), NonRare-Profile) ) (OPC_ATHROW) is rare. unset (new_set, Rare-Static | Rare-Profile); else if ( is_set (Gen (bb), Rare-Static | Rare-Profile) ) z An exception handler block is rare. unset ( new_set, NonRare-Static | NonRare-Profile); ≠ zA block containing unresolved or uninitialized class references if ( new_set In (bb) ) { is rare. In (bb) = new_set changed = true zA block that ends with a normal return instruction is non-rare. } } If the dynamic profile information is available and it shows that a } while ( ! changed ) block is never executed, then we mark the block as rare. If the pro- file count value is above a predetermined threshold, the block is /* final phase (remove rarely executed block) */ marked as non-rare. The dynamic profile information is given pri- find boundary points from a non-rare block to a rare block ority when there are conflicts with the static heuristics. perform for each rare block entry bb { In the iteration phase, we propagate this information along back- create a new bb (RE-BB) ward data flow until it converges for all of the basic blocks. The create a special instruction in the RE-BB holding live variables basic block is marked non-rare in its Out set if any of the succes- redirect incoming edges to the RE-BB sor’s In sets is marked non-rare, otherwise it is marked rare when } any of the In sets is marked rare. If two flags conflict on the same path between Gen set and Out set, the flag in Gen set is selected to propagate further. Thus a region of rare code can grow backward Figure 3. Algorithm for intra-procedural region selection. until it encounters a non-rare path, or a statically identified rare region can be blocked from growing by a profile-based non-rare The final phase simply traverses the basic blocks to determine the block along its path. When converged, the rare regions should transitions from non-rare blocks to rare blocks, and marks those have reached the points where the branches are expected to be locations as rare block entry points. After performing live analysis rarely taken from the non-rare paths. to find the set of live variables at each rare block entry point, we generate a new region exit basic block (RE-BB) for each entry point and replace the original entry block by redirecting its incom- 2 Loop versioning is another versioning optimization, but we cur- ing control flow edge to the new block. This new RE-BB contains rently do not treat its backup block as rare due to our implemen- a single special instruction OPC_RECOMPILE, which holds all tation. See Section 5. live variables at the rare block entry point as its operands, in order to trigger recompilation and call a runtime routine that performs /*setup for root method */ OSR when it is executed. All the rare blocks that originally existed regionSelection ( root_method ) following rare block entry points are no longer reachable from the construct (a possibly large) call_graph G top of the method and thus eliminated in the succeeding control set total_budget B, set current_cost C = 0 flow cleanup phase. /* actual inlining pass */ 3.2 Partial Inlining do { Partial inlining begins by performing region selection for the root select a call edge E in call_graph G { method as shown in Figure 4. It then builds a large call tree of M = target method of E possible inlined scopes from this root method with allowable sizes if ( is_tiny (M) ) { and depths using optimistic assumptions. The actual inlining pass /* tiny methods are always inlined */ proceeds by checking each individual decision in the given call perform inlining for M tree against the total cost to come up with a pruned final tree. Spe- } else { cifically, it tries to greedily incorporate as many methods as possi- /* otherwise, decision is made after region selection */ ble using static heuristics until the predetermined budget is used M’ = regionSelection (M) up. Tiny methods are always inlined without qualification [25]. if ( inlinable (M’) ) { Otherwise the target method is first processed by region selection, perform inlining M’ and then it is determined whether or not it is inlinable based on update live information for each RE-BB in M’ this reduced code size. If inlinable, it performs inlining only for C += cost (M’) the non-rare part of the code, and the current cost is updated with } } the reduced size of the method. } while ( C < B )

Note that the devirtualization of the dynamically dispatched call sites is also performed during the inlining process based on class Figure 4. Algorithm for interacting intra-procedural region hierarchy analysis and the receiver type distribution profile col- selection and inlining process, leading to partial inlining. lected in the previous runs, and region selection can successfully remove the backup path which otherwise needs to be generated. tion process can be more effective with increased optimization One tricky part of this process is to update the live variable infor- scope. mation in each special instruction provided in the RE-BB for the method being inlined. There are two things that need to be done. One is to rename those live variables by reflecting the mapping 3.3 Optimizations into the caller’s context, just like other local variable conversions All of the analyses and optimizations that follow partial inlining when inlined. The other is to add the live variables at the call site can proceed normally. The operands in the special instruction to reflect the complete set of live variables at each point. This is OPC_RECOMPILE provided in the RE-BB for each region exit necessary to automatically hide the effect of all optimizations such point work as anchors to preserve the necessary variables, which as copy propagation, constant propagation, and any other program can be renamed by another variables or replaced by constants dur- transformations. ing the optimization phases. The RE-BB serves as a placeholder for any optimizations that require generating compensation code The direct advantage of the partial inlining in our framework is for an exiting path. As special optimizations that can take advan- twofold: tage of the selected region, we implemented partial dead code 1. Since we first remove the rarely executed paths before trying elimination and partial escape analysis, as described in [29]. to find the next inlining candidate, we never inline methods at Partial . We have to keep all live variables call sites within rare portions of code, since they are no longer (both local and stack) in the bytecode-level code at each RE-BB to included in the current scope. This avoids performing inlining be able to restore the JVM state correctly when a region exit oc- at performance-insensitive call sites and can conserve the curs at runtime. This means the live range of some of those vari- inlining budget. ables becomes longer than in the FBC approach. For example, 2. Methods being inlined are first processed through intra- common subexpression elimination can make some variables un- method region selection before actual inlining. Inlining is con- referenced, and they are eliminated as dead code in the FBC. But sidered and carried out against the reduced target code after those variables may have to be passed on region exit if they are removing the rare portions of the code, and this contributes to used later at the bytecode level. This problem is expressed in the conserving the inlining budget. final code as increased register pressure, extra instructions to spill into memory in the non-rare paths, and larger frame sizes. Assuming the current set of criteria for method inlining is reason- able and thus fixed, we can then use the saved budget from the Partial dead code elimination [18] can partly alleviate this prob- above steps and try to inline other methods in the call tree, which lem by pushing computations that are only live in region exit paths is expected to be more effective and can contribute to further im- into the RE-BBs. Our implementation of partial dead code elimi- proving performance. Other indirect benefits due to partial inlin- nation is a simple code motion followed by dead code elimination. ing include that 1) instruction cache locality can be improved We maintain two sets of live variables, one from the RE-BBs and since non-rare parts of the code tend to be better packed into the the other from the non-rare paths. Using a standard code motion same compilation unit, and 2) later optimizations in the compila- algorithm, we then move computations whose defined variables object operation has been eliminated in the non-rare paths but is neces- allocated lv1 = opc_new class on stack sary at the exit point.

4. EXPERIMENTAL RESULTS

generate This section presents some experimental results showing the effec- compensation tiveness of the RBC in our dynamic compilation system. We out- code line our experimental methodology first, describing the configura- lv1 = opc_stack2heap lv1 tions used in the evaluation, and then present and discuss our opc_recompile lv1, ... measurement results. region exit basic block (RE-BB) object opc_putfield lv1, lv2 captured 4.1 Benchmarking Methodology All the performance results presented in this section were obtained on an IBM IntelliStation M Pro 6850 (Pentium4 Xeon 2.8 GHz Figure 5. An example of partial escape analysis and the uni-processor with 1,024 MB memory), running Windows XP, compensation code generation. and using the JVM of the IBM Developer Kit for Windows, Java Technology Edition, version 1.3.1 prototype build. The bench- are included in the set from the RE-BBs but not included in the marks we chose are SPECjvm98-1.04 and SPECjbb2000-1.02 other set. The computations are copied once into both the appro- [23]. SPECjvm98 was run in interactive mode with the default priate RE-BB and the non-rare path in the other branch direction, large input size, and in autorun mode 10 times for each test, with but the copy in the non-rare path can then be eliminated in the fol- the initial and maximum heap sizes of 128 MB. Each distinct test lowing dead code elimination phase. was run with a separate JVM. SPECjbb2000 was run in the fully compliant mode with 1 to 8 warehouses, with the initial and Partial escape analysis. Escape analysis and its application to maximum heap sizes of 256 MB. stack object allocation, scalar replacement, and synchronization elimination is very effective for improving performance. However, The threshold in the MMI to initiate level-0 compilation was set to quite often it suffers from the fact that objects escape from only 500. The timer interval for the sampling profiler for detecting hot rarely executed code, especially from the backup path of a devir- methods was 3 milliseconds. The number of samples to be col- tualized method call, and thus its effectiveness with the FBC ap- lected in the instrumentation-based profiler was set to 10,000. The proach has been limited in practice. We modified the escape threshold of the numbers of OSRs to redirect the future method analysis to simply ignore region exit points, so that it can analyze invocations to recompiled code is set to 10. whether or not objects can be escaped only for non-rare paths in The five configurations are compared, one with the FBC approach, the target code. and the others are RBC approaches with variations in optimiza- Our escape analysis is based on the algorithm described in [28]. It tions and the region selection process as listed below. The baseline is a compositional analysis designed to analyze each method inde- of the comparison is with the current FBC approach. pendently and to produce a parameterized analysis summary result 1. RBC-noopt: This is the RBC approach, but disabling partial that can be used at all of the call sites that may invoke the method. escape analysis, partial dead code elimination, and partial Hence the analysis result can be more precise and complete as inlining. more of the invoked methods are analyzed. However, if the analy- sis result is summarized based on the optimistic assumption with 2. RBC-nopi: The same setting as RBC-noopt, but enabling the rare regions ignored, then some of its caller methods may con- partial escape analysis and partial dead code elimination only. clude that the object passed as a parameter is non-escaping. When This is to evaluate the effectiveness of partial inlining. In this the execution exits from a region boundary, we have to recompile and the above cases, the region selection process is performed not only the current method but those caller methods that used the for the resulting code after normal method inlining. summary information as well. Therefore, we check whether argu- ments are included in the list of live variables at any region exit 3. RBC-full: This is the RBC approach, with all three optimiza- point within the method, and suppress generating summary infor- tions enabled. mation if that is the case. 4. RBC-offline: This is the same as RBC-full, but using offline In the final stage of the analysis, we check each of the objects collected profile results instead of online results. This is to identified as stack allocatable or replaced by scalar variables as to evaluate the maximum potential of the RBC approach if we whether it is included in the list of live variables at each region could have complete profile information. There should be no exit point, and if it is, generate a special instruction recompilation or OSR events during execution. OPC_STACK2HEAP in the RE-BB as compensation code. Fig- In our current implementation of partial inlining, we perform re- ure 5 shows a graphic example of this. When executed, this in- gion selection for an inline target method and estimate the cost for struction calls a runtime that performs 1) allocation of the object the reduced target code, but temporarily inline the entire body of on the heap and its initialization, 2) copying of the object content the method. The rare portion of the code identified in the region from stack to heap or copying of the scalar-replaced variables selection is removed immediately after the whole inlining process. which are also included in the list of live variables at the current This should not greatly affect the compilation time or the code size, exit point, and 3) synchronization on the allocated object, if that but the compile-time peak work memory usage can look larger than it should be with the complete implementation. Benchmarks mtrt jess compress db mpeg jack javac SPECjbb

Total number of methods executed 455 734 325 321 495 561 1,085 2,818 Level-0 compiled methods 200 334 135 126 219 373 825 566 Level-1 / level-2 compiled methods 33 41 7 7 56 81 157 166 Region-based optimized methods 28 22 1 6 19 58 101 123 Profile identified rare path 3 12 1 7 2 7 14 35 Devirtualized backup path 483 36 0 19 16 136 644 1,244 Exception throwing path 19 22 0 7 33 114 169 171 Handler block exit points exit points 2 3 0 6 0 69 54 96

Number of region region of Number Uninitialized code path 0 0 0 0 0 1 0 4 Recompiled methods due to region exit 0 0 0 0 1 9 4 1 Number of on-stack replacement 0 0 0 0 6 52 34 10

Table 1. Statistics of the region-based compilation for benchmark runs with RBC-full. The top three rows show execution and compilation statistics, the middle six rows are numbers of RBC optimized methods and how those rare region are identified, and the bottom two show runtime behavior for recompilations and on-stack replacements.

14 24.7% 4.2 Statistics on Region-Based Compilation RBC-noopt RBC-nopi 12 Table 1 shows the statistics when running the benchmarks with RBC-full RBC-offline RBC-full. The second to fourth rows are execution and compila- 10 tion statistics, showing the total number of methods executed for each test, the total number of methods compiled with level-0, and 8 the total number of methods compiled with level-1 and level-2, re- spectively. The last number, level-1 and level-2 compiled methods, 6 are the target of RBC-based optimizations. Out of these methods, the fifth row shows the actual number of methods where rare re- 4 gions were identified and RBC optimizations were performed. The 2 next five rows show the breakdown of how those rare regions

were identified in the region selection process, that is, the number FBC over improvement Percent 0 of exit points for each rare type, based on profile results or on mtrt jess comp db mpeg jack javac jbb G.M. some heuristics. The bottom two rows show the number of meth- -2 ods recompilation has been triggered by region exits, and the total number of OSR events that occurred, respectively. -4 Figure 6. Performance improvement with four RBC ap- Except for compress, quite a few methods were RBC optimized. proaches over the FBC. Taller bars show better scores. The number was roughly 50% to 80% of the level-1 and level-2 compiled methods (except for mpegaudio, which was about ber of RBC-optimized methods. The number of OSRs is con- 30%), showing that many benchmarks contain rare regions even in strained, owing to the mechanism of dynamic OSR counting and hot methods and we can do some optimizations for these methods. the control of future invocations for the recompiled methods based The backup paths for the devirtualized method invocations is the on those counts. The other benchmarks show none or very small most common kind of rare region identified, followed by the ex- numbers of recompilations and OSRs. ception throwing path. The majority of rare regions are identified on the basis of static heuristics, and the numbers of profile identi- fied rare paths are relatively small. This is because profile results 4.3 Performance are not always available for RBC optimization target methods as Figure 6 shows the performance improvements of the four varia- described in Section 3, and because we remain conservative when tions of the RBC over the current FBC approach. We took the best working from the profile results, considering the fact that our pro- time from 10 repetitive autoruns for each test in SPECjvm98, and files are based on the samples collected for short intervals of pro- the best throughput from a series of successive executions from 1 gram execution. to 8 warehouses for SPECjbb2000. The figure shows that both RBC-full and RBC-offline perform significantly better than FBC The numbers of recompiled methods and OSRs is relatively large for some benchmarks, with a 25% improvement for mtrt, and 5 to in jack and javac compared to the other benchmarks. These 7% improvement on average. benchmarks are known to frequently raise exceptions during the program execution, and control frequently exited from certain re- The majority of the performance gain for mtrt comes from the gion boundary points. But the number of recompiled methods was elimination of backup paths for devirtualized method calls and its still kept within a reasonable level, around 15% of the total num- exploitation by the partial escape analysis. Mtrt has many virtual Compilation Time Ratio method invocations, most of which are devirtualized and the target 1 methods are inlined. These methods are never overridden and re- 0.9 taining the backup paths for dynamic class loading just prevents the escape analysis from working well. However, simply removing 0.8 the backup paths as a part of region selection does not solve this 0.7 problem. Without partial escape analysis, the performance actually 0.6 degrades as shown in RBC-noopt, because the escape analysis now has to treat region exit points as globally escaping points for 0.5 all live variables, not only for the variables passed as arguments as 0.4 in the original virtual invocation call sites. As a result, more ob- 0.3 jects are analyzed as escaping, and the number of stack allocated or scalar variable replaced objects is decreased. 0.2

Ratio of compilation time over FBC over time compilation Ratio of 0.1 For the other benchmarks, the performance difference with RBC- noopt and RBC-nopi seems not very significant, meaning that 0 simply eliminating data flow merge points from rare regions is not mtrt jess comp db mpeg jack javac jbb G.M. generally very effective by itself for a wide range of benchmarks. The partial inlining, on the other hand, can contribute to a signifi- Compiled Code Size Ratio cant improvement for some benchmarks, especially for jess and 1 jack, as the difference from RBC-nopi to RBC-full shows. C 0.9 The large improvement with partial inlining in jack, for example, 0.8 results from the additional inlining performed in the partial inlin- ing process which then allows the escape analysis to recognize 0.7 some frequently allocated objects as captured and makes those ob- 0.6 jects stack allocated in one of the core methods of the benchmark. 0.5 This is a good example of the indirect effect of partial inlining. 0.4 4.4 Compilation Overhead 0.3 Figure 7 shows the ratio of compilation overhead with the four 0.2 RBC variations over FBC using three metrics: the compilation 0.1 time, the compiled code size, and the compilation time peak work FB over code size compiled Ratio of memory usage. Smaller bars mean better scores in these figures. 0 We measured only level-1 and level-2 overhead, since level-0 is mtrt jess comp db mpeg jack javac jbb G.M. shared among all configurations. All these figures for the three RBC configurations using online profile include additional over- Peak Work Memory Usage Ratio head that results from the recompilations if region exits occur at 1.2 runtime. The peak memory usage is the maximum amount of C memory allocated for compiling methods. Since our memory man- 1 agement routine allocates and frees memory in 1 Mbyte blocks, this large granularity masks the minor differences in memory us- age and causes the results of the figure to form clusters. 0.8

Overall the RBC approaches show significant advantages over the 0.6 current FBC approach in most of the benchmarks. In particular, the reductions sometimes exceed 60% (such as code size for jess and db), and are between 20% and 30% on average, depending on 0.4 the benchmarks and metrics of the overhead. This significant re- duction in compilation overhead is not surprising for RBC-noopt 0.2 and RBC-nopi, since the rare regions identification and their re- moval from the target code is performed after completing the nor- 0 mal inlining process, and all the optimizations and code genera- FB over usage memory peak work Ratio of mtrt jess comp db mpeg jack javac jbb G.M. tion are executed for this smaller target code. The reduction of overhead for RBC-full is slightly less dramatic, but it still shows a RBC-noopt RBC-nopi RBC-full RBC-offline significant reduction from FBC overall. An increase in peak work memory usage with RBC-full and RBC-offline for some bench- marks is caused by a problem in our current implementation of Figure 7. Ratio of three metrics of compilation overhead with partial inlining, in temporarily inlining the entire body of target four RBC approaches compared to FBC. Smaller bars mean methods. We intend to fix this problem in the future. better scores. The top figure shows the compilation time ratio, the middle shows the ratio of compiled code size, and the bot- The middle graph of Figure 7 shows the code size reduction over tom shows the ratio of compilation time peak work memory FBC, but RBC actually requires more runtime memory space for usage. the map of each region’s exit points and for the runtime structure We can explore further opportunities for identifying rarely exe- holding the inline context of each method. The structure showing cuted code to increase the effectiveness of the RBC approach. For the inline context is shared for other purposes in our implementa- example, loop versioning is an optimization technique for hoisting tion, such as by the exception handling system, so the additional the array bound exception checking code for an individual array overhead specific to the RBC is only the map. The size of the map outside a loop by providing two copies of the loop: the safe loop, depends on the number of live local and stack variables for each where exception checking code is retained as in the original loop, region exit point, but it typically requires around 50 to 70 bytes and an unsafe loop, where all array exception checking code is per exit point. Even if we take this map space into consideration, eliminated. Guard code is provided to examine the whole range of the total size is still well under the FBC code size for most of the the index against the bound of the arrays accessed within the loop, benchmarks. and depending on the result of this test, either the safe or unsafe loop is selected at runtime. This is an effective optimization, but 5. DISCUSSION entails a significant code size increase. It is expected that the safe loop is rarely or never executed in most of the cases, and we could Overall, this study shows the advantages of the RBC approach in have a significant code size reduction if we can integrate this op- both performance and compilation overhead over the traditional portunity into the RBC strategy. We could even use on-the-fly safe FBC approach. It shows the potential for significantly reducing loop generation when the test in the entry guard code fails. the compilation overhead, measured in time, work memory, and code size, and for improving performance. In a dynamic compila- Method splitting, or procedure splitting [22] is a technique that tion environment, we have to be very careful in performing any can complement our RBC strategy. This is to place relatively in- optimizations that have significant impact on compilation over- frequent code apart from common code, typically in a separate head, so RBC is a promising strategy for dynamic compilers. page, in order to improve instruction cache locality. Our region se- lection process does not identify these relatively infrequent but As expected, RBC-offline shows better performance with smaller still executed portions of the code, since over-aggressive region compilation overhead over RBC-full for several benchmarks, but selection will lead to too many recompilations and can degrade overall our current online RBC strategy seems to compete well performance. In other word, the selected region still contains rela- against RBC using the offline profiles. The online profiling com- tively infrequent code. We can increase the code locality even bined with static heuristics currently provides useful information more by integrating the method splitting technique into our to identify rarely or never executed blocks of code, as compared to framework. the offline-collected profile results. Currently we disable exception directed optimization (EDO) [20]. 6. RELATED WORK This is a technique to monitor frequently raised exception paths As described earlier, Hank et al. [14] first described the problems and to optimize them by inlining and converting exception throw- of the conventional function-based compilation strategy and dem- ing instructions to simple jump instructions into their correspond- onstrated the potential of the region-based compilation technique ing handlers. This is complementary to our RBC approach, since it by showing several experimental results, including static code size effectively eliminates frequently excepting instructions from the savings. The proposed region formation was to perform an aggres- current control flow, before treating them as rare paths in our sive (possibly an over-aggressive) inlining pass first, followed by a static region identification heuristics, and we can ensure the re- partitioning phase that created new regions based on heuristics us- maining exception throwing instructions are truly rare. As shown ing offline profile results. Way et al. [26] improved this region in Section 4.2, most of the recompilation and OSR occurring in formation algorithm to make it scalable by combining region se- our current implementation is due to region exits from exception lection and the inlining process, and reduced the compilation time paths, so this optimization is expected to decrease the probability memory requirements considerably. They also evaluated region- of region exits without reducing the effectiveness of RBC ap- based partial inlining that was performed through partial cloning, proach. and observed small performance improvements [27]. All this work As described in Section 2.2, it may be useful to support several was done in an ILP static compiler environment, so it required two options for OSR and employ them depending on the characteris- additional steps, encapsulation and reintegration, to make regions tics of each region exit point—how the rare paths are eliminated. look like ordinary functions for optimizations and then to reinte- For example, the current strategy of recompilation with the same grate them into the containing function. optimization level works fine for the exit points of a devirtualized The SELF-91 system [7] uses a technique called deferred compila- call site backup path, because once the control escapes from one tion for uncommon branches where a skewed execution frequency of those exit points due to a dynamic class loading, it will most distribution can be expected. They use type information to defer likely escape from this exit point in subsequent executions. On the compilation for messages sent to receiver classes that are pre- other hand, when the region exit occurs from an exception throw- sumed to be rare. When the rare path is actually executed, the ing path, it may be sufficient to fall back to the interpreter, since compiler generates code for the uncommon branch extension, exception paths (especially after EDO) need not be optimized, which is a of the original compiled code from the considering the inherently high overhead of runtime exception point of failure to the end of the method. The extension is unopti- handling. We can still collect the dynamic counts of OSRs to iden- mized to avoid recursive uncommon branches, and reuses the tify from which region exit points the control is frequently escap- stack frame created for the original common-case version. It was ing, and thereby drive recompilation, rather than waiting for the demonstrated that the technique increases compilation speed sig- promotion to be performed by the underlying recompilation sys- nificantly, by nearly an order of magnitude, but there were both tem. This mechanism may allow more aggressive rare path elimi- performance and code size problems when the compiler’s uncom- nation than we currently use. mon code predictions were wrong. The SELF-93 system [15] This may not be a problem for these systems, since the statically fixed these problems by treating the occurrence of uncommon optimized generic code is already there and a specialized reopti- cases as another form of runtime feedback and replacing overly mized version is being generated. We need to be more conserva- specialized code with less specialized code rather than just extend- tive not to frequently exit from selected regions, and thus employ ing the specialized code with an unoptimized extension code. This more general regions to contain arbitrary numbers of hot traces by approach was made possible by introducing both an OSR tech- removing rarely executed code blocks. nique and an adaptive recompilation system. The OSR allowed the Trace scheduling [19] is a technique that predicts the outcome of dynamic deoptimization of the target code and the replacement of conditional branches and then optimizes the code assuming the the stack frame containing the uncommon trap with several unop- prediction is correct, but it can suffer from the complexity in- timized frames. When it was found that this unoptimized compiled volved in the compensation code generation. Superblock schedul- code was executed frequently, the recompilation system could op- ing [16] simplifies the complexity by using tail duplication to cre- timize the method again based on the feedback from the unopti- ate superblocks, single-entry multiple-exit regions. Both of these mized code. techniques are to extend the scope of ILP scheduling beyond basic The HotSpot server [21] is a JVM product implementing an adap- block boundaries to encompass larger units. Other classic optimi- tive optimization system with an interpreter to allow a mixed exe- zations are also extended to exploit superblocks in [8]. cution environment. As in the SELF-93 system, it also employed Bruening and Duesterwald [5] explored the issues of finding op- the uncommon trap mechanism to avoid code generation for un- timal compilation unit shapes for an embedded Java JIT compiler. common cases. The system always falls back to the interpreter at a They demonstrated that method boundaries are a poor choice for safe point after converting the stack frame when an uncommon compilation. They did not implement a working JIT compiler for path is actually taken. Both the SELF-93 and HotSpot systems evaluation, and only provided estimates of the code size benefits have some similarities to ours, but the important differences are when using trace-based and loop-based strategies for compilation that our technique deals with a broader set of rare regions using units. They found that a majority of the code in methods is rarely both static heuristics and profiling results, not focusing only on or never executed, and concluded that code size can be reduced uncommon virtual method targets and references to an uninitial- drastically with only a negligible change in performance. ized class, and that we perform partial inlining to make the region- selection process affect inlining decisions. Ball and Larus [4] proposed a heuristic approach for static branch prediction based on the data types and type of comparison used in Whaley [29] described a technique for performing partial method the branch and the code in the target basic blocks. We also use compilation using basic-block-level offline profile information. program-based static heuristics for the region selection, and The technique allows most optimizations to completely ignore rare propagate the rare/non-rare information along backward data flow paths and fully optimize the common cases. This system also as- to identify rarely or never executed regions. sumes falling back to an interpreter when the rare path is taken. Since the technique is not fully implemented in a working system, he estimated the effectiveness of the technique by collecting basic- 7. CONCLUSIONS block-level profiles offline and then using this information to We have described the design and implementation of a region- refactor the affected classes with a Bytecode Engineering Library based compilation technique in our dynamic compilation system tool. The interpreter transfer points at rare block entry are replaced for Java. We presented our design decisions for handling the re- with method calls to synthetic methods that contain all the code gion exit points and for the intra-procedural region selection and separated from the transfer point, so that the compilation of those its integration in the inlining process. We implemented this RBC rarely executed blocks can be avoided. Thus the result is an ideal framework and evaluated it using the industry standard bench- case, since the compiler need not hold any information to restore marks. The experimental results show the potential to achieve bet- the interpreter state. ter performance and improved compilation overhead in compari- son to the traditional FBC approach. In the future, we plan to fur- Fink and Qian [13] described a new, relatively compiler inde- ther study the effectiveness of region-based compilation tech- pendent mechanism for implementing OSR, and apply this tech- niques, especially by exploring other optimization opportunities nique to integrate the deferred compilation strategy in the Jikes that can take advantage of selected regions and by providing sev- RVM adaptive optimization system. Since they do not yet imple- eral options for region exit handling. ment optimizations that take advantage of the deferred compila- tion, the performance improvement is small, but the compilation time and code size show modest improvements. 8. ACKNOWLEDGEMENTS We would like to thank all the members of the Network Comput- There are several binary translation systems for profile-based na- ing Platform group in IBM Tokyo Research Laboratory for helpful tive-to-native reoptimization, such as Dynamo [3], its descendent discussions and comments. In particular, we thank Motohiro DynamoRIO [6], HCO [11], and mojo [9]. These systems identify Kawahito for implementing the partial dead code elimination rou- frequently executed paths (traces) and optimize them by exploiting tine. The anonymous reviewers also provided many valuable sug- code layout and other runtime optimization opportunities. Since gestions and comments to improve the presentation of the paper. the trace is a single-entry multiple-exit contiguous sequence, and can extend across static program boundaries, it has arbitrary sub- method and cross-method granularity as the unit of optimization, REFERENCES similar to the effect with our partial inlining. However, they oper- [1] M. Arnold, S. Fink, D. Grove, M. Hind, and P.F. Sweeney. ate only on a single trace at a time, and the optimization can be Adaptive Optimizations in the Jalapeño JVM. In Proceed- less effective when the selected trace is hot but not a dominant one. ings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages & Applications, pp. 47- Compilation. The Journal of Supercomputing, 7(1-2), pp. 65, Oct. 2000. 229-248, May 1993. [2] M. Arnold, M. Hind, and B.G. Ryder. Online Feedback- [17] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu, and T. Directed Optimization of Java. In Proceedings of the ACM Nakatani. A Study of Devirtualization Techniques for a Java SIGPLAN Conference on Object-Oriented Programming, Just-In-Time Compiler. In Proceedings of the ACM SIG- Systems, Languages & Applications, pp. 111-129, Nov. 2002. PLAN Conference on Object-Oriented Programming, Sys- [3] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Trans- tems, Languages & Applications, pp. 294-310, Oct. 2000. parent Dynamic Optimization System. In Proceedings of the [18] J. Knoop, O. Rüthing, and B. Steffen. Partial Dead Code ACM SIGPLAN Conference on Programming Language De- Elimination. In Proceedings of the ACM SIGPLAN Confer- sign and Implementation, pp. 1-12, Jun. 2000. ence on Programming Language Design and Implementation, [4] T. Ball, and J.R. Larus. Branch Prediction For Free. In Pro- pp. 147-158, Jun. 1994. ceedings of the ACM SIGPLAN Conference on Programming [19] P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Language Design and Implementation, pp. 300-313, Jun. Lichtenstein, R.P. Nix, J.S. O’Donnell, and J.C. Ruttenberg. 1993. The Multiflow Trace Scheduling Compiler. The Journal of [5] D. Bruening, and E. Duesterwald. Exploring Optimal Compi- Supercomputing, 7(1-2), pp. 51-142, Jan. 1993. lation Unit Shapes for an Embedded Just-In-Time Compiler. [20] T. Ogasawara, H. Komatsu, and T. Nakatani. A Study of Ex- In Proceedings of the ACM SIGPLAN Workshop on Feed- ception Handling and Its Dynamic Optimization in Java. In back-Directed and Dynamic Optimization, 2000. Proceedings of the ACM SIGPLAN Conference on Object- [6] D. Bruening, T. Garnett, and S. Amarasinghe. An Infrastruc- Oriented Programming, Systems, Languages & Applications, ture for Adaptive Dynamic Optimization. In Proceedings of pp. 83-95, Oct. 2001. the ACM SIGPLAN Conference on Code Generation and Op- [21] M. Paleczny, C. Vick, and C. Click. The Java HotSpot Server timization, pp. 265-275, Mar. 2003. Compiler. In Proceedings of the Java Virtual Machine Re- [7] C. Chambers and D. Ungar. Making Pure Object-Oriented search and Technology Symposium (JVM ’01), pp. 1-12, Apr. Languages Practical. In Proceedings of the ACM SIGPLAN 2001. Conference on Object-Oriented Programming, Systems, [22] K. Pettis and R.C. Hansen. Porfile Guided Code Positioning. Languages & Applications, pp. 1-15, Oct. 1991. In Proceedings of the ACM SIGPLAN Conference on Pro- [8] P.P. Chang, S.A. Mahlke, and W.M. Hwu. Using Profile In- gramming Language Design and Implementation, pp. 16-27, formation to Assist Classic Code Optimizations. Software Jun. 1990. Practice and Experience, 21(12), pp. 1301-1321, Dec. 1991. [23] Standard Performance Evaluation Corporation. SPECjvm98 [9] W.K. Chen, S. Lerner, R. Chaiken, and D.M. Gillies. Mojo: available http://www.spec.org/osg/jvm98, and SPECjbb2000 A Dynamic Optimization System. In Proceedings of the ACM available at http://www.spec.org/osg/jbb2000. SIGPLAN Workshop on Feedback-Directed and Dynamic [24] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu, and T. Optimization, 2000. Nakatani. A Dynamic Optimization Framework for a Java [10] M. Cierniak, G.Y. Lueh, and J.M. Stichnoth. Practicing Just-In-Time Compiler. In Proceedings of the ACM SIG- JUDO: Java Under Dynamic Optimizations. In Proceedings PLAN Conference on Object-Oriented Programming, Sys- of the ACM SIGPLAN Conference on Programming Lan- tems, Languages & Applications, pp. 180-194, Oct. 2001. guage Design and Implementation, pp. 13-26, Jun. 2000. [25] T. Suganuma, T. Yasue, and T. Nakatani. An Empirical [11] R. Cohn, and P.G. Lowney. Hot Cold Optimization of Large Study of Method Inlining for a Java Just-In-Time Compiler. Windows/NT Applications. In Proceedings of 29th Interna- In Proceedings of the Java Virtual Machine Research and tional Conference on Microarchitecture, MICRO-29, pp. 80- Technology Symposium (JVM ’02), pp. 91-104, Aug. 2002. 89, Dec. 1996. [26] T. Way, B. Breech, and L. Pollock. Region Formation Analy- [12] D. Detlefs, and O. Agesen. Inlining of Virtual Methods. In sis with Demand-driven Inlining for Region-based Optimiza- the 13th European Conference on Object-Oriented Program- tion. In Proceedings of the Conference on Parallel Architec- ming, ECOOP, LNCS 1628, pp. 258-277, 1999. ture and Compilation Technique, pp. 24-36, Oct. 2000. [13] S. Fink, and F. Qian. Design, Implementation and Evaluation [27] T. Way, and L. Pollock. Evaluation of a Region-based Partial of Adaptive Recompilation with On-Stack Replacement. In Inlining Algorithm for an ILP Optimizing Compiler. In Pro- Proceedings of the ACM SIGPLAN Conference on Code ceedings of the Conference on Parallel and Distributed Generation and Optimization, pp. 241-252, Mar. 2003. Processing Techniques and Applications (PDPTA2002), pp. [14] R.E. Hank, W. Hwu, and B.R. Ran. Region-Based Compila- 552-556, Jun. 2002. tion: An Introduction and Motivation. In Proceedings of 28th [28] J. Whaley and M. Rinard. Compositional Pointer and Escape International Conference on Microarchitecture, MICRO-28, Analysis for Java Programs. In Proceedings of the ACM pp. 158-168, Dec. 1995. SIGPLAN Conference on Object-Oriented Programming, [15] U. Hölzle. Adaptive Optimization for SELF: Reconciling Systems, Languages & Applications, pp. 187-206, Nov. 1999. High Performance with Exploratory Programming. Ph.D. [29] J. Whaley. Partial Method Compilation using Dynamic Pro- Thesis, Stanford University, CS-TR-94-1520, Aug. 1994. file Information. In Proceedings of the ACM SIGPLAN Con- [16] W.M. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. ference on Object-Oriented Programming, Systems, Lan- Warter, R.A. Bringmann, R.G. Ouellete, R.E. Hank, T. guages & Applications, pp. 166-179, Oct. 2001. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The Su- perblock: An Effective Technique for VLIW and Superscalar