Lecture Outline

• Binary Translation: Why, What, and When.

Dynamic Binary Translation • Why: Guarding against buffer overruns

• What, when: overview of two dynamic translators: Lecture 24 –Dynamo-RIO by HP, MIT – CodeMorph by

• Techniques used in dynamic translators acknowledgement: E. Duesterwald (IBM), S. Amarasinghe (MIT) – Path profiling

2 Ras Bodik CS 164 Lecture 24 1 Ras Bodik CS 164 Lecture 24

Motivation: preventing buffer overruns Preventing buffer overrun attacks

Recall the typical buffer overrun attack: Two general approaches:

1. program calls a method foo() • static (compile-time): analyze the program – find all array writes that may outside array bounds 2. foo() copies a string into an on-stack array: – program proven safe before you run it – string supplied by the user – user’s malicious code copied into foo’s array – foo’s return address overwritten to point to user code • dynamic (run-time): analyze the execution – make sure no write outside an array happens 3. foo() returns – unknowingly jumping to the user code – execution proven safe (enough to achieve security)

3 4 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Dynamic buffer overrun prevention A different idea the idea, again: perhaps less safe, but easier to implement: –goal: detect that return address was overwritten. • prevent writes outside the intended array instrument the program so that – as is done in Java – it keeps an extra copy of the return address: – harder in C: must add “size” to each array • done in CCured, a Berkeley project 1. store aside the return address when function called (store it in an inaccessible shadow stack) 2. when returning, check that the return address in AR matches the stored one; 3. if mismatch, terminate program

5 6 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Commercially interesting What is Binary Translation?

• Similar idea behind the product by • Translating a program in one binary format to determina.com another, for example:

•key problem: –MIPS Æ (to port programs across platforms) – reducing overhead of instrumentation • what’s instrumentation, anyway? • We can view “binary format” liberally: – adding statements to an existing program – Java Æ x86 (to avoid interpretation) – in our case, to x86 – x86 Æ x86 (to optimize the ) • Determina uses binary translation

7 8 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

When does the translation happen? Why? Translation Allows Program Modification

• Static (off-line): before the program is run Static Dynamic – Pros: no serious translation-time constraints

Runtime Program Compiler Linker Loader • Dynamic (on-line): while the program is running System

–Pros: • Instrumenters • access to complete program (program is fully linked)

• access to program state (including values of data struct’s) • Debuggers • can adapt to changes in program behavior • Interpreters • Just-In-Time Compilers • Dynamic Optimizers • Profilers • Note: Pros(dynamic) = Cons(static) • Load time optimizers • Dynamic Checkers • Shared library mechanism • instrumenters 9 • Etc. 10 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Applications, in more detail Dynamic Program Modifiers

• profilers: – add instrumentation instructions to count basic Running Program block execution counts (e.g., gprof) • load-time optimizers: Dynamic Program Modifier: Observe/Manipulate Every Instruction in the Running Program – remove caller/callee save instructions (callers/callees known after DLLs are linked) Hardware Platform – replace long jumps with short jumps (code position known after linking) •dynamic checkers – finding memory access bugs (e.g., Rational Purify)

11 12 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

In more detail Dynamic Program Modifiers

application application Requirements: application DLL OS DLL 9 Ability to intercept execution at arbitrary points OS 9 Observe executing instructions DLL OS CodeMorph Dynamo 9 Modify executing instructions CPU CPU=VLIW CPU=x86 9 Transparency - modified program is not specially prepared common setup CodeMorph Dynamo-RIO 9 Efficiency (Transmeta) (HP, MIT) - amortize overhead and achieve near-native performance 9 Robustness 9 Maintain full control and capture all code - sampling is not an option (there are security applications)

13 14 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

HP Dynamo-RIO System I: Basic Interpreter

next • Building a dynamic program modifier VPC • Trick I: adding a code cache

• Trick II: linking fetch next • Trick III: efficient handling instruction decode execute update VPC

• Trick IV: picking traces exception handling • Dynamo-RIO performance Instruction Interpreter • Run-time trace optimizations 9 Intercept execution 9 Observe & modify executing instructions 9 Transparency Efficiency? - up to several 100 X slowdown

15 16 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Trick I: Adding a Code Cache Example Basic Block Fragment

add %eax, %ecx

next lookup VPC cmp $4, %eax VPC exception handling jle $0x40106f

frag7: add %eax, %ecx fetch block emit execute at VPC block block cmp $4, %eax jle jmp context switch stub1: mov %eax, eax-slot # spill eax mov &dstub1, %eax # store ptr to stub table BASIC BLOCK jmp context_switch CACHE stub2: mov %eax, eax-slot # spill eax non-control-flow mov &dstub2, %eax # store ptr to stub table instructions jmp context_switch 17 18 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Runtime System with Code Cache Linking a Basic Block Fragment

next VPC basic block builder add %eax, %ecx frag7: add %eax, %ecx cmp $4, %eax cmp $4, %eax

context switch jle $0x40106f jle jmp BASIC BLOCK stub1: mov %eax, eax-slot CACHE mov &dstub1, %eax non-control-flow instructions jmp context_switch stub2: mov %eax, eax-slot Improves performance: mov &dstub2, %eax • slowdown reduced from 100x to 17-26x jmp context_switch • remaining bottleneck: frequent (costly) context switches

19 20 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Performance Effect of Trick II: Linking Basic Block Cache with direct branch linking

vpr (Spec2000) next 28 lookup VPC 26.03 VPC exception 26 data set 1 handling 24 data set 2 22 20 17.45 fetch block link emit execute until 18 16 at VPC block block cache miss 14 12 10 8 context switch 6 2.97 3.63 4 2 0

BASIC BLOCK Execution Native over Slowdown block cache block cache with direct CACHE linking non-control-flow instructions Performance Problem: mispredicted indirect branches

21 22 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Indirect Branch Linking Indirect Branch Handling Shared Indirect Branch Target Conditionally “inline” a preferred indirect branch target as the (IBT) Table continuation of the trace original target F if equal goto original target H ret mov %edx, edx_slot # save app’s edx pop %edx # load actual target lookup IBT table linked if (! tag-match) targets cmp %edx, $0x77f44708 # compare to goto # preferred target jump to tag-value jne H K mov edx_slot, %edx # restore app’s edx I L J

23 Ras Bodik CS 164 Lecture 24

Trick III: Efficient Indirect Branch Handling Performance Effect of indirect branch linking

next 26.03 17.45 vpr (Spec2000) VPC 10 9 8 data set 1 7 data set 2 basic block builder 6 5 4 3.63 2.97 context switch Execution 3 2 1.20 1.15 miss 1 Slowdown over Native over Slowdown 0 BASIC BLOCK miss block cache block cache with direct block cache with linking CACHE linking (direct+indirect) non-control-flow indirect instructions branch lookup Performance Problem: poor code layout in code cache

25 26 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Trick IV: Picking Traces Picking Traces

Block Cache has poor execution efficiency: • Increased branching, poor locality START basic block builder trace selector Pick traces to: • reduce branching & improve layout and locality dispatch • New optimization opportunities across block boundaries

Block Cache Trace Cache context switch A D G J A G B K J BASIC BLOCK TRACE CACHE B E H K E F CACHE H non-control-flow indirect non-control-flow C F I L instructions branch lookup instructions D

27 28 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Picking hot traces Alternative 1: Edge profiling

• The goal: path profiling The algorithm: – find frequently executed control-flow paths • Edge profiling: measure frequencies of all control- – Connect basic blocks along these paths into flow edges, then after a while contiguous sequences, called traces. • Trace selection: select hot traces by following highest-frequency branch outcome. • The problem: find a good trade-off between – profiling overhead (counting execution events), and Disadvantages: – accuracy of the profile. • Inaccurate: may select infeasible paths (due to branch correlation) • Overhead: must profile all control-flow edges

29 30 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Alternative 2: Bit-tracing path profiling Alternative 3: Next Executing Tail (NET)

The algorithm: This is the algorithm of Dynamo: – collect path signatures and their frequencies – profiling: count only frequencies of start-of-trace – path signature = .history points (which are targets of original backedges) – example: .0101101 – must include addresses of indirect branches – trace selection: when a start-of-trace point becomes sufficiently hot, select the sequence of Advantages: basic blocks executed next. – accuracy Disadvantages: – may select a rare (cold) path, but statistically – overhead: need to monitor every branch selects a hot path! – overhead: counter storage (one counter per path!)

31 32 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Spec2000 Performance on Windows NET (continued) (w/o trace optimizations)

Advantages of NET: 2.2 ƒ very light-weight 2.0 #instrumentation points = #targets of backward branches 1.8 1.6 #counters = #targets of backward branches 1.4 1.2 ƒ statistically likely to pick the 1.0 hottest path A D G J 0.8 ƒ pick only feasible paths 0.6 ƒ easy to implement B E H K 0.4 0.2 Slowdown vs. Execution Native Slowdown 0.0 C F I L art vpr gcc eon gap mcf gzip twolf bzip2 crafty mesa vortex parser equake perlbmk H_MEAN 33 34 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Spec2000 Performance on (w/o trace optimizations) Performance on Desktop Applications

1.6 1.7 1.5 1.6 1.4 1.5 1.3 1.4 1.3 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 Slowdown vs. Native Execution Native vs. Slowdown 0.1 0.0 0.0 Execution Native vs. Slowdown Adobe Acrobat Excel Microsoft Microsoft Word art vpr eon gap gcc mcf apsi gzip twolf swim applu bzip2 mesa mgrid crafty vortex PowerPoint ammp parser equake sixtrack perlbmk wupwise H_MEAN

35 36 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Performance Breakdown Trace optimizations

trace branch taken • Now that we built the traces, let’s optimize them 2% rest of system 1% • But what’s left to optimize in a statically optimized indirect branch lookup code? 11% • Limitations of static compiler optimization: – cost of call-specific interprocedural optimization – cost of path-specific optimization in presence of complex control flow – difficulty of predicting indirect branch targets – lack of access to shared libraries – sub-optimal register allocation decisions – register allocation for individual array elements or pointers

code cache 86% 37 38 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24

Maintaining Control (in the real world)

• Capture all code: execution only takes place out of the code cache • Challenging for abnormal control flow • System must intercept all abnormal control flow events: •Exceptions • Call backs in Windows • Asynchronous procedure calls • Setjmp/longjmp • Set thread context

39 Ras Bodik CS 164 Lecture 24