Lecture 24 –Dynamo-RIO by HP, MIT – Codemorph by Transmeta

Lecture Outline • Binary Translation: Why, What, and When. Dynamic Binary Translation • Why: Guarding against buffer overruns • What, when: overview of two dynamic translators: Lecture 24 –Dynamo-RIO by HP, MIT – CodeMorph by Transmeta • Techniques used in dynamic translators acknowledgement: E. Duesterwald (IBM), S. Amarasinghe (MIT) – Path profiling 2 Ras Bodik CS 164 Lecture 24 1 Ras Bodik CS 164 Lecture 24 Motivation: preventing buffer overruns Preventing buffer overrun attacks Recall the typical buffer overrun attack: Two general approaches: 1. program calls a method foo() • static (compile-time): analyze the program – find all array writes that may outside array bounds 2. foo() copies a string into an on-stack array: – program proven safe before you run it – string supplied by the user – user’s malicious code copied into foo’s array – foo’s return address overwritten to point to user code • dynamic (run-time): analyze the execution – make sure no write outside an array happens 3. foo() returns – unknowingly jumping to the user code – execution proven safe (enough to achieve security) 3 4 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Dynamic buffer overrun prevention A different idea the idea, again: perhaps less safe, but easier to implement: –goal: detect that return address was overwritten. • prevent writes outside the intended array instrument the program so that – as is done in Java – it keeps an extra copy of the return address: – harder in C: must add “size” to each array • done in CCured, a Berkeley project 1. store aside the return address when function called (store it in an inaccessible shadow stack) 2. when returning, check that the return address in AR matches the stored one; 3. if mismatch, terminate program 5 6 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Commercially interesting What is Binary Translation? • Similar idea behind the product by • Translating a program in one binary format to determina.com another, for example: •key problem: –MIPS Æ x86 (to port programs across platforms) – reducing overhead of instrumentation • what’s instrumentation, anyway? • We can view “binary format” liberally: – adding statements to an existing program – Java bytecode Æ x86 (to avoid interpretation) – in our case, to x86 executables – x86 Æ x86 (to optimize the executable) • Determina uses binary translation 7 8 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 When does the translation happen? Why? Translation Allows Program Modification • Static (off-line): before the program is run Static Dynamic – Pros: no serious translation-time constraints Runtime Program Compiler Linker Loader • Dynamic (on-line): while the program is running System –Pros: • Instrumenters • access to complete program (program is fully linked) • access to program state (including values of data struct’s) • Debuggers • can adapt to changes in program behavior • Interpreters • Just-In-Time Compilers • Dynamic Optimizers • Profilers • Note: Pros(dynamic) = Cons(static) • Load time optimizers • Dynamic Checkers • Shared library mechanism • instrumenters 9 • Etc. 10 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Applications, in more detail Dynamic Program Modifiers • profilers: – add instrumentation instructions to count basic Running Program block execution counts (e.g., gprof) • load-time optimizers: Dynamic Program Modifier: Observe/Manipulate Every Instruction in the Running Program – remove caller/callee save instructions (callers/callees known after DLLs are linked) Hardware Platform – replace long jumps with short jumps (code position known after linking) •dynamic checkers – finding memory access bugs (e.g., Rational Purify) 11 12 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 In more detail Dynamic Program Modifiers application application Requirements: application DLL OS DLL 9 Ability to intercept execution at arbitrary points OS 9 Observe executing instructions DLL OS CodeMorph Dynamo 9 Modify executing instructions CPU CPU=VLIW CPU=x86 9 Transparency - modified program is not specially prepared common setup CodeMorph Dynamo-RIO 9 Efficiency (Transmeta) (HP, MIT) - amortize overhead and achieve near-native performance 9 Robustness 9 Maintain full control and capture all code - sampling is not an option (there are security applications) 13 14 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 HP Dynamo-RIO System I: Basic Interpreter next • Building a dynamic program modifier VPC • Trick I: adding a code cache • Trick II: linking fetch next • Trick III: efficient indirect branch handling instruction decode execute update VPC • Trick IV: picking traces exception handling • Dynamo-RIO performance Instruction Interpreter • Run-time trace optimizations 9 Intercept execution 9 Observe & modify executing instructions 9 Transparency Efficiency? - up to several 100 X slowdown 15 16 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Trick I: Adding a Code Cache Example Basic Block Fragment add %eax, %ecx next lookup VPC cmp $4, %eax VPC exception handling jle $0x40106f frag7: add %eax, %ecx fetch block emit execute at VPC block block cmp $4, %eax jle <stub1> jmp <stub2> context switch stub1: mov %eax, eax-slot # spill eax mov &dstub1, %eax # store ptr to stub table BASIC BLOCK jmp context_switch CACHE stub2: mov %eax, eax-slot # spill eax non-control-flow mov &dstub2, %eax # store ptr to stub table instructions jmp context_switch 17 18 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Runtime System with Code Cache Linking a Basic Block Fragment next VPC basic block builder add %eax, %ecx frag7: add %eax, %ecx cmp $4, %eax cmp $4, %eax context switch jle $0x40106f jle <frag42> jmp <frag8> BASIC BLOCK stub1: mov %eax, eax-slot CACHE mov &dstub1, %eax non-control-flow instructions jmp context_switch stub2: mov %eax, eax-slot Improves performance: mov &dstub2, %eax • slowdown reduced from 100x to 17-26x jmp context_switch • remaining bottleneck: frequent (costly) context switches 19 20 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Performance Effect of Trick II: Linking Basic Block Cache with direct branch linking vpr (Spec2000) next 28 lookup VPC 26.03 VPC exception 26 data set 1 handling 24 data set 2 22 20 17.45 fetch block link emit execute until 18 16 at VPC block block cache miss 14 12 10 8 context switch 6 2.97 3.63 4 2 0 BASIC BLOCK Execution Native over Slowdown block cache block cache with direct CACHE linking non-control-flow instructions Performance Problem: mispredicted indirect branches 21 22 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Indirect Branch Linking Indirect Branch Handling Shared Indirect Branch Target Conditionally “inline” a preferred indirect branch target as the (IBT) Table continuation of the trace original target F <load actual target> <compare to inlined target> if equal goto <inlined target> original target H ret mov %edx, edx_slot # save app’s edx <preferred target> pop %edx # load actual target lookup IBT table <save flags> linked if (! tag-match) targets cmp %edx, $0x77f44708 # compare to goto <exit stub> # preferred target jump to tag-value jne <exit stub > H K mov edx_slot, %edx # restore app’s edx <restore flags> I L <inlined target> <inlined preferred target> J 23 <exit stub> Ras Bodik CS 164 Lecture 24 Trick III: Efficient Indirect Branch Handling Performance Effect of indirect branch linking next 26.03 17.45 vpr (Spec2000) VPC 10 9 8 data set 1 7 data set 2 basic block builder 6 5 4 3.63 2.97 context switch Execution 3 2 1.20 1.15 miss 1 Slowdown over Native over Slowdown 0 BASIC BLOCK miss block cache block cache with direct block cache with linking CACHE linking (direct+indirect) non-control-flow indirect instructions branch lookup Performance Problem: poor code layout in code cache 25 26 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Trick IV: Picking Traces Picking Traces Block Cache has poor execution efficiency: • Increased branching, poor locality START basic block builder trace selector Pick traces to: • reduce branching & improve layout and locality dispatch • New optimization opportunities across block boundaries Block Cache Trace Cache context switch A D G J A G B K J BASIC BLOCK TRACE CACHE B E H K E F CACHE H non-control-flow indirect non-control-flow C F I L instructions branch lookup instructions D 27 28 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Picking hot traces Alternative 1: Edge profiling • The goal: path profiling The algorithm: – find frequently executed control-flow paths • Edge profiling: measure frequencies of all control- – Connect basic blocks along these paths into flow edges, then after a while contiguous sequences, called traces. • Trace selection: select hot traces by following highest-frequency branch outcome. • The problem: find a good trade-off between – profiling overhead (counting execution events), and Disadvantages: – accuracy of the profile. • Inaccurate: may select infeasible paths (due to branch correlation) • Overhead: must profile all control-flow edges 29 30 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Alternative 2: Bit-tracing path profiling Alternative 3: Next Executing Tail (NET) The algorithm: This is the algorithm of Dynamo: – collect path signatures and their frequencies – profiling: count only frequencies of start-of-trace – path signature = <start addr>.history points (which are targets of original backedges) – example: <label7>.0101101 – must include addresses of indirect branches – trace selection: when a start-of-trace point becomes sufficiently hot, select the sequence of Advantages: basic blocks executed next. – accuracy Disadvantages: – may select a rare (cold) path, but statistically – overhead: need to monitor every branch selects a hot path! – overhead: counter storage (one counter per path!) 31 32 Ras Bodik CS 164 Lecture 24 Ras Bodik CS 164 Lecture 24 Spec2000 Performance on Windows NET (continued) (w/o trace optimizations) Advantages of NET: 2.2 very light-weight 2.0 #instrumentation points = #targets of backward branches 1.8 1.6 #counters = #targets of backward branches 1.4 1.2 statistically likely to pick the 1.0 hottest path A D G J 0.8 pick only feasible paths 0.6 easy to implement B E H K 0.4 0.2 Slowdown vs.

Load more