Instrew: Fast LLVM-based Dynamic Binary Instrumentation and Translation

Alexis Engelke Martin Schulz

Chair of Computer Architecture and Parallel Systems Department of Informatics Technical University of Munich

LLVM Performance Workshop at CGO 2021, virtual Alexis Engelke 2021

Dynamic Binary Translation Dynamic Binary Instrumentation

I Run program on other architecture, I Modify program behavior, translate code for host CPU add additional code/checks I Use-cases: compatibility, I Use-cases: analysis, debugging, architecture research profiling, arch. research I Example: QEMU-user, Rosetta 2 I Examples: ,

x86-64 AArch64 x86-64 x86-64 +checks

mov rax, rcx mov rax, rcx mov rax, rcx add x0, x1, 4 add rax, 4 add rax, 4 add rax, 4 call check_bounds mov [rdx+rsi+16], rax add x16, x6, 16 mov [rdx+rsi+16], rax mov [rdx+rsi+16], rax str x0, [x2, x16] incq [num_stores]

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 2 Alexis Engelke 2021 Motivation: LLVM for DBT/DBI

I Many DBT/DBI systems focus on translation performance I QEMU, Valgrind, Pin, . . . I Instead, use LLVM code generation for run-time performance

Instrew: a fast LLVM-based DBT/DBI framework

Step 1 Step 2 Step 3 Lift machine Align LLVM code code → LLVM-IR gen. for DBI/DBT Performance

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 3 Alexis Engelke 2021 Overview: Process-level DBT and DBI

Execution Unit Translation Instrumentation

Guest Code Decode & Lift Decode & Lift

main Execution Arch-indep.Arch-indep. IR IR Archdep.IR loop Manager Instrument

Code Gen. Code Gen. Code Cache

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 4 Alexis Engelke 2021 Overview: Process-level DBT and DBI

Execution Unit Transformation Unit

Guest Code Decode & Lift

main Execution Arch-indep. IR: LLVM-IR loop Manager Instrument

Code Gen. Code Cache

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 5 Alexis Engelke 2021 Lifting Machine Code to LLVM-IR

I Actually widely researched... I McSema: excellent coverage of x86-64 and others, but slow I HQEMU/DBILL (TCG→LLVM): limited by TCG (FP, basic blocks) I MCTOLL: very low instruction coverage

I ... but not sufficient for good coverage and overall performance!

Rellume: performance-oriented lifting library

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 6 Alexis Engelke 2021 Rellume: Lifting Approach

I Target-independent & idiomatic LLVM-IR I Use LLVM constructs where possible, e.g. vectors, comparisons I Helper functions for syscalls and cpuid I Lifting performance: avoid heavy transformations (like mem2regs) I Lifting stages: 1. Decode instructions, recover control flow with basic blocks 2. Lift individual instructions/basic blocks 3. Fixup branches and PHI nodes

I Architectures: x86-64 (up to SSE2), RISC-V64 (imafdc)

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 7 Alexis Engelke 2021 Rellume: Example define void @func_40061e(i8* %cpu) { prologue: ; ...

bb_40061e: %rsp_2 = phi i64 [%rsp, %prologue] ; sub rsp, 176 Lift instruction %rsp_3 = sub i64 %rsp_2, 176 semantics ; ... compute flags ... br label %epilogue

epilogue: ; ... } Introduction Machine Code → LLVM Instrumentation Framework Performance Results 8 Alexis Engelke 2021 Rellume: Example define void @func_40061e(i8* %cpu) { Single LLVM function, prologue: single parameter: ; ... CPU state I Instruction pointer I Registers bb_40061e: ; ... I Status flags I TLS pointer I ...

epilogue: ; ... } Introduction Machine Code → LLVM Instrumentation Framework Performance Results 9 Alexis Engelke 2021 Rellume: Example define void @func_40061e(i8* %cpu) { prologue: %rip_p_i8 = getelementptr i8, i8* %cpu, i64 0 Construct ptrs. into CPU %rip_p = bitcast i8* %rip_p_i8 to i64* struct %rsp_p_i8 = getelementptr i8, i8* %cpu, i64 40 %rsp_p = bitcast i8* %rsp_p_i8 to i64* %rsp = load i64, i64* %rsp_p Load registers into SSA ; ... load other registers ... variables br label %bb_40061e

bb_40061e: ; ... epilogue: ; ... } Introduction Machine Code → LLVM Instrumentation Framework Performance Results 10 Alexis Engelke 2021 Rellume: Example define void @func_40061e(i8* %cpu) { prologue: ; ...

bb_40061e: ; ...

epilogue: %rsp_4 = phi i64 [%rsp_3, %bb_40061e] store i64 %rsp_4, i64* %rsp_p Store new values ; ... store flags ... store i64 0x400625, i64* %rip_p Store new RIP ret void } Introduction Machine Code → LLVM Instrumentation Framework Performance Results 11 Alexis Engelke 2021 The good

Functional! Fast!

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 12 Alexis Engelke 2021 The bad: LLVM’s limitations

I Better use of host registers possible I Keep more guest registers in host registers, less memory accesses I x86-64 has (slightly broken) HHVM calling convention I Need for general all-regs CC

I LLVM 11 is 40% slower than LLVM 9 I Not investigated yet

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 13 Alexis Engelke 2021 The ugly: hard problems

I Floating-point semantics I Rounding mode depends on register, impossible to model in LLVM-IR I Current state: ignore rounding, except for FP→int (emulate in software)

I Load-Locked/Store-Conditional atomics I Can’t be represented in general LLVM-IR I Ideas: hardware transactional memory, global mutex, stop-the-world

I Memory Consistency: multi-threaded x86-64 on hsomething elsei I Two approaches: fences everywhere or hardware support I Current state: single-threaded only :)

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 14 Alexis Engelke 2021 Instrew: Client-Server Architecture

Client Process Server Process

Guest Code I More flexible, options: Decode I Different server main Execution Lift: Rellume machine loop Manager (Instrument) I Permanent server I Transparent caching MCJIT ELF Code Cache I ... Linker

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 15 Alexis Engelke 2021 Instrumentation

I Instrumentation tool is a shared library on server-side I Loads initial library functions into client I Compile LLVM code for target architecture, send to client I Modify lifted code prior to compilation I Optionally, lifter adds instruction information to LLVM-IR

I Memory management currently rather simple I 48 bytes storage accessible in CPU state I Further memory custom allocation with mmap

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 16 Alexis Engelke 2021 Evaluation

I Run on SPEC CPU2017 benchmarks I Source architectures: x86-64, RISC-V64 I Target architectures: x86-64, AArch64 I LLVM 9 I Comparison with Valgrind, DynamoRIO, QEMU, HQEMU; base: host run

System x86-64: 2×Intel Xeon CPU E5-2697 v3 (Haswell) @ 2.6 GHz (3.6 GHz Turbo), 17 MiB L3 cache; 64 GiB main memory; SUSE 15.1 SP1; Linux kernel 4.12.14-95.32; 64-bit mode. Compiler: GCC 9.2.0 with -O3 -march=x86-64, implies SSE/SSE2 but no SSE3+/AVX. Libraries: glibc 2.32; LLVM 9.0. System AArch64: 2×Cavium ThunderX2 99xx @ 2.5 GHz, 32 MiB L3 cache; 512 GiB main memory; CentOS 8; Linux kernel 4.18.0-193.14.2; 64-bit mode. Compiler: GCC 10 with -O3. Libraries: glibc 2.32; LLVM 9.0. SPEC CPU2017 intspeed+fpspeed benchmarks, ref workload, single thread.

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 17 Alexis Engelke 2021 SPEC CPU2017int Results Results normalized to native execution on host Valgrind DynamoRIO QEMU HQEMU Instrew 6 5 4 3 2

Normalized run-time 1 0 X86toX86 RV64toX86 X86toA64 RV64toA64

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 18 Alexis Engelke 2021

SPEC CPU2017int+fp Results (x86-64→x86-64)

Native QEMU HQEMU Instrew 30 28 26 24 22 20 18 16 14 12 10 8

Normalized run-time 6 4 2 0

657.xz 602.gcc 605.mcf 619.lbm 621.wrf 644.nab geomean 625.x264 641.leela 627.cam4628.pop2 654.roms 603.bwaves 620.omnetpp 638.imagick 631.deepsjeng 649.fotonik3d 623.xalancbmk 648.exchange2 607.cactuBSSN Introduction Machine Code → LLVM Instrumentation Framework Performance Results 19 Alexis Engelke 2021

SPEC CPU2017int+fp Results (x86-64→x86-64)

Native QEMU HQEMU Instrew 30 28 26 Instrew Best Case 24 22 Instrew: 1.07x; HQEMU: 1.17x 20 18 16 14 12 10 8

Normalized run-time 6 4 2 0

657.xz 602.gcc 605.mcf 619.lbm 621.wrf 644.nab geomean 625.x264 641.leela 627.cam4628.pop2 654.roms 603.bwaves 620.omnetpp 638.imagick 631.deepsjeng 649.fotonik3d 623.xalancbmk 648.exchange2 607.cactuBSSN Introduction Machine Code → LLVM Instrumentation Framework Performance Results 20 Alexis Engelke 2021

SPEC CPU2017int+fp Results (x86-64→x86-64)

Native QEMU HQEMU Instrew 30 28 26 Rewriting Time Matters 24 22 Instrew: 4.2x; HQEMU: 4.0x 20 18 High rewriting time: 55% 16 14 12 10 8

Normalized run-time 6 4 2 0

657.xz 602.gcc 605.mcf 619.lbm 621.wrf 644.nab geomean 625.x264 641.leela 627.cam4628.pop2 654.roms 603.bwaves 620.omnetpp 638.imagick 631.deepsjeng 649.fotonik3d 623.xalancbmk 648.exchange2 607.cactuBSSN Introduction Machine Code → LLVM Instrumentation Framework Performance Results 21 Alexis Engelke 2021 Rewriting Overhead

I Mean Rewriting Time: <2% I Notable exception: 602.gcc with 55% I HHVM-CC on x86 hosts: codegen.-time increase 12%–78% I Mean Rewriting Time Breakdown: I Most time spent for machine code generation I SelectionDAG instruction selector known to be slow I Replacement GlobalISel not yet ready Lift (12%) Optimize (22%) Code Gen. (65%) Link (<1%)

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 22 Alexis Engelke 2021 Instrew: LLVM-based DBI/DBT

I Fast Dynamic Binary Instrumentation/Translation based on LLVM I Lift whole functions directly to LLVM-IR I Client-server approach enabling further optimizations I 50% less overhead compared to current LLVM-based DBT

Instrew is Free Software! https://github.com/aengelke/instrew

Introduction Machine Code → LLVM Instrumentation Framework Performance Results 23