Lecture 17 1) Profile-Based Compiler: High-Level  Binary, Static – Dynamic Code Optimization Uses (Dynamic=Runtime) Information Collected in Profiling Passes

I. Beyond Static Compilation Lecture 17 1) Profile-based Compiler: high-level binary, static – Dynamic Code Optimization Uses (dynamic=runtime) information collected in profiling passes I. Motivation & Background 2) Interpreter: high-level, emulate, dynamic II. Overview III. Partial Method Compilation 3) Dynamic compilation / code optimization: high-level binary, dynamic IV. Partial Dead Code Elimination – interpreter/compiler hybrid V. Partial Escape Analysis – supports cross-module optimization VI. Results – can specialize program using runtime information “Partial Method Compilation Using Dynamic Profile Information”, • without separate profiling passes John Whaley, OOPSLA 01 Carnegie Mellon Carnegie Mellon Phillip B. Gibbons 15-745: Dynamic Code Optimization 1 15-745: Dynamic Code Optimization 2 1) Dynamic Profiling Can Improve Compile-time Optimizations Profile-Based Compile-time Optimization • Understanding common dynamic behaviors may help guide optimizations 1. Compile statically 2. Collect profile 3. Re-compile, using profile – e.g., control flow, data dependences, input values (using typical inputs) void foo(int A, int B) { prog1.c prog1.c … What are typical values of A, B? input1 while (…) { How often is this condition true? execution if (A > B) profile runme.exe How often does *p == val[i]? compiler compiler *p = 0; (instrumented) C = val[i] + D; Is this loop invariant? E += C – B; execution … runme.exe runme_v2.exe profile } } • Collecting control-flow profiles is relatively inexpensive • Profile-based compile-time optimizations – profiling data dependences, data values, etc., is more costly – e.g., speculative scheduling, cache optimizations, code specialization • Limitations of this approach? Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 3 15-745: Dynamic Code Optimization 4 1 Instrumenting Executable Binaries Binary Instrumentation/Optimization Tools 1. Compile statically 2. Collect profile • Unlike typical compilation, the input is a binary (not source code) (using typical inputs) • One option: static binary-to-binary rewriting prog1.c input1 runme.exe tool runme_modified.exe How to perform the instrumentation? • Challenges (with the static approach): compiler runme.exe (instrumented) – what about dynamically-linked shared libraries? binary – if our goal is optimization, are we likely to make the code faster? execution runme.exe instrumentation • a compiler already tried its best, and it had source code (we don’t) profile tool – if we are adding instrumentation code, what about time/space overheads? 1. The compiler could insert it directly • instrumented code might be slow & bloated if we aren’t careful 2. A binary instrumentation tool could modify the executable directly • optimization may be needed just to keep these overheads under control – that way, we don’t need to modify the compiler – compilers that target the same architecture (e.g., x86) can use the same tool • Bottom line: the purely static approach to binary rewriting is rarely used Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 5 15-745: Dynamic Code Optimization 6 2) (Pure) Interpreter A Sweet Spot? • One approach to dynamic code execution/analysis is an interpreter • Is there a way that we can combine: – basic idea: a software loop that grabs, decodes, and emulates each instruction – the flexibility of an interpreter (analyzing and changing code dynamically); and while (stillExecuting) { – the performance of direct hardware execution? inst = readInst(PC); instInfo = decodeInst(inst); switch (instInfo.opType) { • Key insights: case binaryArithmetic: … – increase the granularity of interpretation case memoryLoad: … … • instructions chunks of code (e.g., procedures, basic blocks) } – dynamically compile these chunks into directly-executed optimized code PC = nextPC(PC,instInfo); } • store these compiled chunks in a software code cache • Advantages: • jump in and out of these cached chunks when appropriate – also works for dynamic programming languages (e.g., Java) • these cached code chunks can be updated! – easy to change the way we execute code on-the-fly (SW controls everything) – invest more time optimizing code chunks that are clearly hot/important • Disadvantages: • easy to instrument the code, since already rewriting it – runtime overhead! • must balance (dynamic) compilation time with likely benefits • each dynamic instruction is emulated individually by software Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 7 15-745: Dynamic Code Optimization 8 2 3) Dynamic Compiler Components in a Typical Just-In-Time (JIT) Compiler while (stillExecuting) { Dynamic Code if (!codeCompiledAlready(PC)) { Optimizer compileChunkAndInsertInCache(PC); } “Interpreter” Compiled Code jumpIntoCodeCache(PC); Input Profiling // compiled chunk returns here when finished Program Control Loop Cache (Chunks) Information PC = getNextPC(…); } Cache Manager (Eviction Policy) • This general approach is widely used: – Java virtual machines • Cached chunks of compiled code run at hardware speed – dynamic binary instrumentation tools (Valgrind, Pin, Dynamo Rio) – returns control to “interpreter” loop when chunk is finished – hardware virtualization • Dynamic optimizer uses profiling information to guide code optimization – as code becomes hotter, more aggressive optimization is justified replace the old compiled code chunk with a faster version Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 9 15-745: Dynamic Code Optimization 10 II. Overview of Dynamic Compilation / Code Optimization Dynamic Compilation Policy • Interpretation/Compilation/Optimization policy decisions • ∆Ttotal = Tcompile – (nexecutions * Timprovement) – Choosing what and how to compile, and how much to optimize – If ∆Ttotal is negative, our compilation policy decision was effective. • Collecting runtime information – Instrumentation • We can try to: – Sampling – Reduce Tcompile (faster compile times) – Increase T (generate better code: but at cost of increasing T ) • Optimizations exploiting runtime information improvement compile – Focus on large nexecutions (compile/optimize hot spots) – Focus on frequently-executed code paths • 80/20 rule: Pareto Principle – 20% of the work for 80% of the advantage Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 11 15-745: Dynamic Code Optimization 12 3 Latency vs. Throughput Multi-Stage Dynamic Compilation System • Tradeoff: startup speed vs. execution performance interpreted Execution count is the sum of Stage 1: method invocations & back edges executed code Startup speed Execution performance when execution count = t1 (e.g. 2000) Interpreter Best Poor ‘Quick’ compiler Fair Fair compiled Stage 2: Optimizing compiler Poor Best code when execution count = t2 (e.g. 25,000) fully optimized Stage 3: code Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 13 15-745: Dynamic Code Optimization 14 Granularity of Compilation: Per Method? Example: SpecJVM98 db • Methods can be large, especially after inlining void read_db(String fn) { – Cutting/avoiding inlining too much hurts performance considerably int n = 0, act = 0; int b; byte buffer[] = null; try { FileInputStream sif = new FileInputStream(fn); • Compilation time is proportional to the amount of code being compiled n = sif.getContentLength(); – Moreover, many optimizations are not linear buffer = new byte[n]; Hot while ((b = sif.read(buffer, act, n-act))>0) { • Even “hot” methods typically contain some code that is rarely/never executed act = act + b; loop } sif.close(); if (act != n) { /* lots of error handling code, rare */ } } catch (IOException ioe) { /* lots of error handling code, rare */ } Carnegie Mellon } Carnegie Mellon 15-745: Dynamic Code Optimization 15 15-745: Dynamic Code Optimization 16 4 Example: SpecJVM98 db Optimize hot “regions”, not methods void read_db(String fn) { interpreted int n = 0, act = 0; int b; byte buffer[] = null; • Optimize only the most frequently executed Stage 1: code try { segments within a method FileInputStream sif = new FileInputStream(fn); – Simple technique: n = sif.getContentLength(); • Track execution counts of basic blocks buffer = new byte[n]; in Stages 1 & 2 while ((b = sif.read(buffer, act, n-act))>0) { • Any basic block executing in Stage 2 compiled Stage 2: act = act + b; is considered to be not rare code } Lots of sif.close(); • Beneficial secondary effect of improving if (act != n) { rare code! optimization opportunities on the common paths /* lots of error handling code, rare */ } • No need to profile any basic block executing fully optimized } catch (IOException ioe) { in Stage 3 Stage 3: code /* lots of error handling code, rare */ – Already fully optimized } } Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 17 15-745: Dynamic Code Optimization 18 % of Basic Blocks in Methods that are Executed > Threshold Times % of Basic Blocks that are Executed > Threshold Times (hence would get compiled under per-method strategy) (hence get compiled under per-basic-block strategy) 100.00% 100.00% Linpack Linpack JavaC UP JavaC UP 80.00% 80.00% JavaLEX JavaLEX SwingSet S wingSet 60.00% check 60.00% check com press com press 40.00% jess 40.00% jess db db javac javac 20.00% 20.00% m pegaud m pegaud % of basic executed % of blocks % of basic compiled % of blocks m trt m trt 0.00% jack 0.00% jack 1 10 100 500 1000 2000 5000 1 10 100 500 1000 2000 5000 execution threshold execution threshold Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization

Load more