I. Beyond Static Compilation

Lecture 17 1) Profile-based : high-level  binary, static – Dynamic Code Optimization Uses (dynamic=runtime) information collected in profiling passes

I. Motivation & Background 2) Interpreter: high-level, emulate, dynamic II. Overview

III. Partial Method Compilation 3) Dynamic compilation / code optimization: high-level  binary, dynamic IV. Partial – interpreter/compiler hybrid V. Partial – supports cross-module optimization VI. Results – can specialize program using runtime information “Partial Method Compilation Using Dynamic Profile Information”, • without separate profiling passes John Whaley, OOPSLA 01

Carnegie Mellon Carnegie Mellon Phillip B. Gibbons 15-745: Dynamic Code Optimization 1 15-745: Dynamic Code Optimization 2

1) Dynamic Profiling Can Improve Compile-time Optimizations Profile-Based Compile-time Optimization

• Understanding common dynamic behaviors may help guide optimizations 1. Compile statically 2. Collect profile 3. Re-compile, using profile – e.g., control flow, data dependences, input values (using typical inputs)

void foo(int A, int B) { prog1.c prog1.c … What are typical values of A, B? input1 while (…) { How often is this condition true? execution if (A > B) profile runme.exe How often does *p == val[i]? compiler compiler *p = 0; (instrumented) C = val[i] + D; Is this loop invariant? E += C – B; execution … runme.exe runme_v2.exe profile } } • Collecting control-flow profiles is relatively inexpensive • Profile-based compile-time optimizations – profiling data dependences, data values, etc., is more costly – e.g., speculative scheduling, cache optimizations, code specialization • Limitations of this approach?

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 3 15-745: Dynamic Code Optimization 4

1 Instrumenting Executable Binaries Binary Instrumentation/Optimization Tools

1. Compile statically 2. Collect profile • Unlike typical compilation, the input is a binary (not source code) (using typical inputs) • One option: static binary-to-binary rewriting

prog1.c input1 runme.exe tool runme_modified.exe How to perform the instrumentation? • Challenges (with the static approach): compiler runme.exe (instrumented) – what about dynamically-linked shared libraries? binary – if our goal is optimization, are we likely to make the code faster? execution runme.exe instrumentation • a compiler already tried its best, and it had source code (we don’t) profile tool – if we are adding instrumentation code, what about time/space overheads? 1. The compiler could insert it directly • instrumented code might be slow & bloated if we aren’t careful 2. A binary instrumentation tool could modify the executable directly • optimization may be needed just to keep these overheads under control – that way, we don’t need to modify the compiler – that target the same architecture (e.g., x86) can use the same tool • Bottom line: the purely static approach to binary rewriting is rarely used

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 5 15-745: Dynamic Code Optimization 6

2) (Pure) Interpreter A Sweet Spot?

• One approach to dynamic code execution/analysis is an interpreter • Is there a way that we can combine: – basic idea: a software loop that grabs, decodes, and emulates each instruction – the flexibility of an interpreter (analyzing and changing code dynamically); and

while (stillExecuting) { – the performance of direct hardware execution? inst = readInst(PC); instInfo = decodeInst(inst); switch (instInfo.opType) { • Key insights: case binaryArithmetic: … – increase the granularity of interpretation case memoryLoad: … … • instructions  chunks of code (e.g., procedures, basic blocks) } – dynamically compile these chunks into directly-executed optimized code PC = nextPC(PC,instInfo); } • store these compiled chunks in a software code cache • Advantages: • jump in and out of these cached chunks when appropriate – also works for dynamic programming languages (e.g., Java) • these cached code chunks can be updated! – easy to change the way we execute code on-the-fly (SW controls everything) – invest more time optimizing code chunks that are clearly hot/important • Disadvantages: • easy to instrument the code, since already rewriting it – runtime overhead! • must balance (dynamic) compilation time with likely benefits • each dynamic instruction is emulated individually by software

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 7 15-745: Dynamic Code Optimization 8

2 3) Dynamic Compiler Components in a Typical Just-In-Time (JIT) Compiler

while (stillExecuting) { Dynamic Code if (!codeCompiledAlready(PC)) { Optimizer compileChunkAndInsertInCache(PC); } “Interpreter” Compiled Code jumpIntoCodeCache(PC); Input Profiling // compiled chunk returns here when finished Program Control Loop Cache (Chunks) Information PC = getNextPC(…); } Cache Manager (Eviction Policy) • This general approach is widely used: – Java virtual machines • Cached chunks of compiled code run at hardware speed – dynamic binary instrumentation tools (Valgrind, Pin, Dynamo Rio) – returns control to “interpreter” loop when chunk is finished – hardware virtualization • Dynamic optimizer uses profiling information to guide code optimization – as code becomes hotter, more aggressive optimization is justified  replace the old compiled code chunk with a faster version

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 9 15-745: Dynamic Code Optimization 10

II. Overview of Dynamic Compilation / Code Optimization Dynamic Compilation Policy

• Interpretation/Compilation/Optimization policy decisions • ∆Ttotal = Tcompile – (nexecutions * Timprovement) – Choosing what and how to compile, and how much to optimize – If ∆Ttotal is negative, our compilation policy decision was effective. • Collecting runtime information – Instrumentation • We can try to: – Sampling – Reduce Tcompile (faster compile times) – Increase T (generate better code: but at cost of increasing T ) • Optimizations exploiting runtime information improvement compile – Focus on large nexecutions (compile/optimize hot spots) – Focus on frequently-executed code paths • 80/20 rule: Pareto Principle – 20% of the work for 80% of the advantage

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 11 15-745: Dynamic Code Optimization 12

3 Latency vs. Throughput Multi-Stage Dynamic Compilation System

• Tradeoff: startup speed vs. execution performance interpreted Execution count is the sum of Stage 1: method invocations & back edges executed code

Startup speed Execution performance when execution count = t1 (e.g. 2000) Interpreter Best Poor ‘Quick’ compiler Fair Fair compiled Stage 2: Poor Best code

when execution count = t2 (e.g. 25,000)

fully optimized Stage 3: code

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 13 15-745: Dynamic Code Optimization 14

Granularity of Compilation: Per Method? Example: SpecJVM98 db

• Methods can be large, especially after inlining void read_db(String fn) { – Cutting/avoiding inlining too much hurts performance considerably int n = 0, act = 0; int b; byte buffer[] = null; try { FileInputStream sif = new FileInputStream(fn); • Compilation time is proportional to the amount of code being compiled n = sif.getContentLength(); – Moreover, many optimizations are not linear buffer = new byte[n]; Hot while ((b = sif.read(buffer, act, n-act))>0) { • Even “hot” methods typically contain some code that is rarely/never executed act = act + b; loop } sif.close(); if (act != n) { /* lots of error handling code, rare */ } } catch (IOException ioe) { /* lots of error handling code, rare */ }

Carnegie Mellon } Carnegie Mellon 15-745: Dynamic Code Optimization 15 15-745: Dynamic Code Optimization 16

4 Example: SpecJVM98 db Optimize hot “regions”, not methods

void read_db(String fn) { interpreted int n = 0, act = 0; int b; byte buffer[] = null; • Optimize only the most frequently executed Stage 1: code try { segments within a method FileInputStream sif = new FileInputStream(fn); – Simple technique: n = sif.getContentLength(); • Track execution counts of basic blocks buffer = new byte[n]; in Stages 1 & 2 while ((b = sif.read(buffer, act, n-act))>0) { • Any basic block executing in Stage 2 compiled Stage 2: act = act + b; is considered to be not rare code } Lots of sif.close(); • Beneficial secondary effect of improving if (act != n) { rare code! optimization opportunities on the common paths /* lots of error handling code, rare */ } • No need to profile any basic block executing fully optimized } catch (IOException ioe) { in Stage 3 Stage 3: code /* lots of error handling code, rare */ – Already fully optimized }

} Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 17 15-745: Dynamic Code Optimization 18

% of Basic Blocks in Methods that are Executed > Threshold Times % of Basic Blocks that are Executed > Threshold Times (hence would get compiled under per-method strategy) (hence get compiled under per-basic-block strategy) 100.00% 100.00% Linpack Linpack JavaC UP JavaC UP 80.00% 80.00% JavaLEX JavaLEX SwingSet S wingSet 60.00% check 60.00% check com press com press 40.00% jess 40.00% jess db db javac javac 20.00% 20.00% m pegaud m pegaud % of basic blocks%of executedbasic % of basic blocks%of compiledbasic m trt m trt 0.00% jack 0.00% jack 1 10 100 500 1000 2000 5000 1 10 100 500 1000 2000 5000

execution threshold execution threshold Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 19 15-745: Dynamic Code Optimization 20

5 Dynamic Code Transformations III. Partial Method Compilation

• Compiling partial methods 1. Based on profile data, determine the set of rare blocks • Partial dead code elimination – Use code coverage information from the first compiled version • Partial escape analysis

Goal: Program runs correctly with white blocks compiled and blue blocks interpreted

What are the challenges? • How to transition from white to blue • How to transition from blue to white • How to compile/optimize ignoring blue

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 21 15-745: Dynamic Code Optimization 22

Partial Method Compilation Partial Method Compilation

2. Perform live variable analysis 3. Redirect the control flow edges that targeted rare blocks, – Determine the set of live variables at rare block entry points and remove the rare blocks

live: x,y,z

to interpreter…

Once branch to interpreter, never come back to compiled (no blue- to-white transitions)

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 23 15-745: Dynamic Code Optimization 24

6 Partial Method Compilation Partial Method Compilation

4. Perform compilation normally 5. Record a map for each interpreter transfer point – Analyses treat the interpreter transfer point as an unanalyzable method call – In code generation, generate a map that specifies the location, in registers or memory, of each of the live variables – Maps are typically < 100 bytes

live: x,y,z x: sp - 4 y: r1 z: sp - 8

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 25 15-745: Dynamic Code Optimization 26

IV. Partial Dead Code Elimination Partial Dead Code Example

• Move computation that is only live on a rare path into the rare block, saving x = 0; if (rare branch 1) { computation in the common case if (rare branch 1){ x = 0; ...... z = x + y; z = x + y; ...... } } if (rare branch 2){ if (rare branch 2) { ... x = 0; a = x + z; ...... a = x + z; } ... }

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 27 15-745: Dynamic Code Optimization 28

7 V. Escape Analysis Partial Escape Analysis

• Escape analysis finds objects that do not escape a method or a thread • Stack allocate objects that don’t escape in the common blocks – “Captured” by method: • can be allocated on the stack or in registers • Eliminate synchronization on objects that don’t escape the common blocks – “Captured” by thread: • can avoid synchronization operations • If a branch to a rare block is taken: • All Java objects are normally heap allocated, so this is a big win – Copy stack-allocated objects to the heap and update pointers – Reapply eliminated synchronizations

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 29 15-745: Dynamic Code Optimization 30

VI. Results: Run Time Improvement Summary: Beyond Static Compilation

100.00% 1) Profile-based Compiler: high-level  binary, static 90.00% – Uses (dynamic=runtime) information collected in profiling passes 80.00% 70.00% 60.00% 50.00% 2) Interpreter: high-level, emulate, dynamic 40.00% 30.00% 20.00% 3) Dynamic compilation / code optimization: high-level  binary, dynamic 10.00% 0.00% – interpreter/compiler hybrid check compress jess db javac mpegaud mtrt jack SwingSet linpack JLex JCup – supports cross-module optimization First bar: original (Whole method opt) – can specialize program using runtime information Second bar: Partial Method Comp (PMC) • without separate profiling passes Third bar: PMC + opts • for what’s hot on this particular run Bottom bar: Execution time if code was compiled/opt. from the beginning

Carnegie Mellon Carnegie Mellon 15-745: Dynamic Code Optimization 31 15-745: Dynamic Code Optimization 32

8 Looking Ahead

• Friday: No class

• Monday & Wednesday: “Recent Research on Optimization” – Student-led discussions, in groups of 2, with 20 minutes/group – Read 3 papers on a topic, and lead a discussion in class – See “Discussion Leads” tab of course web page for topics, sign-up sheet, instructions

• Spring Break

• Monday March 14 – Homework #3 due – Meetings to discuss project proposal ideas

Carnegie Mellon 15-745: Dynamic Code Optimization 33

9