King’s Research Portal

Document Version Peer reviewed version

Link to publication record in King's Research Portal

Citation for published version (APA): Cono D'Elia, D., Coppa, E., Palmaro, F., & Cavallaro, L. (Accepted/In press). SoK: Using Dynamic Binary Instrumentation for Security (And How You May Get Caught Red Handed). In ACM Asia Conference on Information, Computer and Communications Security (ASIACCS 2019)

Citing this paper Please note that where the full-text provided on King's Research Portal is the Author Accepted Manuscript or Post-Print version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version for pagination, volume/issue, and date of publication details. And where the final published version is provided on the Research Portal, if citing you are again advised to check the publisher's website for any subsequent corrections. General rights Copyright and moral rights for the publications made accessible in the Research Portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognize and abide by the legal requirements associated with these rights.

•Users may download and print one copy of any publication from the Research Portal for the purpose of private study or research. •You may not further distribute the material or use it for any profit-making activity or commercial gain •You may freely distribute the URL identifying the publication in the Research Portal Take down policy If you believe that this document breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 02. Oct. 2021 1 SoK: Using Dynamic Binary Instrumentation for Security 59 2 60 3 (And How You May Get Caught Red Handed) 61 4 62 5 Anonymous Author(s) 63 6 64 7 ABSTRACT To accomplish this task users can typically resort to instrumen- 65 8 tation techniques, which in their simplest form consist in adding 66 9 Dynamic binary instrumentation (DBI) techniques allow for moni- 67 toring and possibly altering the execution of a running program up instructions to the code sections of the program under observation. 10 One can think of at least two aspects that impact the instrumenta- 68 11 to the instruction level granularity. The ease of use and flexibility of 69 DBI primitives has made them popular in a large body of research in tion strategy that researchers can choose to support their analyses: 12 the availability of the source code for the objects that undergo obser- 70 13 different domains, including software security. Lately, the suitabil- 71 ity of DBI for security has been questioned in light of transparency vation and the granularity of information that should be gathered. 14 Additionally, a researcher may be interested in accessing instru- 72 15 concerns from artifacts that popular frameworks introduce in the 73 execution: while they do not perturb benign programs, a dedicated mentation facilities that let them also alter the normal behavior of 16 the program when specific conditions are observed at run time. 74 17 adversary may detect their presence and defeat the analysis. 75 The contributions we provide are two-fold. We first present the A popular instrumentation paradigm is represented by dynamic 18 binary instrumentation. DBI techniques support the insertion of 76 19 abstraction and inner workings of DBI frameworks, how DBI as- 77 sisted prominent security research works, and alternative solutions. probes and user-supplied analysis routines in a running software 20 for the sake of monitoring and possibly altering its execution up 78 21 We then dive into the DBI evasion and escape problems, discussing 79 attack surfaces, transparency concerns, and possible mitigations. to the instruction level granularity, without requiring access to 22 its source code or modifications to the runtime. The ease ofuse 80 23 We make available to the community a library of detection pat- 81 terns and stopgap measures that could be of interest to DBI users. and flexibility that characterize DBI techniques has favored their 24 adoption in an impressive deal of programming languages, software 82 25 CCS CONCEPTS testing, and security research over the years. 83 26 Lately, the suitability of using DBI for security applications has 84 • Security and privacy → Systems security; Intrusion/anomaly 27 been questioned in light of artifacts that popular DBI frameworks 85 detection and malware mitigation; Software reverse engineering; Soft- 28 introduce in the execution, which may let an adversary detect their 86 ware security engineering. 29 presence and cripple an analysis that hinges on them. This trend 87 30 88 KEYWORDS of research originated in non-academic forums like REcon and 31 BlackHat where security researchers pointed out several attack 89 32 Dynamic binary instrumentation, dynamic binary translation, in- surfaces for DBI detection and escape (e.g., [18, 25, 32, 57]). 90 33 terposition, transparent monitoring, evasion, escape Among academic works, Polino et al. [47] proposed countermea- 91 34 ACM Reference Format: sures for anti-instrumentation strategies found in packers, some of 92 35 Anonymous Author(s). 2019. SoK: Using Dynamic Binary Instrumentation which are specific to DBI. One year later Kirsch29 etal.[ ] presented 93 36 for Security (And How You May Get Caught Red Handed). In Proceedings of a research that instead deems DBI unsuitable for security applica- 94 37 14th ACM ASIA Conference on Computer and Communications Security (ASIA tions, presenting a case study on its most popular framework. 95 38 CCS’19). ACM, New York, NY, USA, 13 pages. https://doi.org/10.475/123_4 96 39 Contributions. In this work we try to approach the problem of 97 40 1 INTRODUCTION using DBI in software security research from a neutral stance, in 98 hopes of providing our readers with insights on when the results 41 Even before the size and complexity of computer software reached 99 of an analysis built on top of DBI should not be trusted blindly. 42 the levels that recent years have witnessed, developing reliable and 100 We distill the DBI abstraction, discuss inner workings and primi- 43 efficient ways to monitor the behavior of code under execution 101 tives of frameworks implementing it, and present a quantitative 44 has been a major concern for the scientific community. Execution 102 overview of recent security literature that uses DBI to back a hetero- 45 monitoring can serve a great deal of purposes: to name but a few, 103 geneous plethora of analyses. We then tackle the DBI evasion and 46 consider performance analysis and optimization, vulnerability iden- 104 escape problems, discussing desired transparency properties and 47 tification, as well as the detection of suspicious actions, execution 105 architectural implications to support them. We categorize known 48 patterns, or data flows in a program. 106 49 adversarial patterns against DBI engines by attack surfaces, and 107 50 Permission to make digital or hard copies of all or part of this work for personal or discuss possible mitigations both at framework design level and 108 classroom use is granted without fee provided that copies are not made or distributed 51 for profit or commercial advantage and that copies bear this notice and the full citation as part of user-encoded analyses. We collect instances of known 109 52 on the first page. Copyrights for components of this work owned by others than ACM adversarial sequences and prototype a mitigation scheme in the 110 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 53 form of a high-level library that could be of interest to DBI users. 111 to post on servers or to redistribute to lists, requires prior specific permission and/or a 54 fee. Request permissions from [email protected]. 112 55 ASIA CCS’19, July 2019, Auckland, New Zealand 2 DYNAMIC BINARY INSTRUMENTATION 113 © 2019 Association for Computing Machinery. 56 114 ACM ISBN 123-4567-24-567/08/06...$15.00 DBI systems have significantly evolved since the advent of DynInst [6] 57 https://doi.org/10.475/123_4 as a post-compilation library to support tools that wanted to instru- 115 58 1 116 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

117 ment and modify programs during execution. In the following we have been patched with trampolines to analysis code. In this work 175 118 present characteristic traits of the DBI abstraction and its embodi- we deal with JIT-based engines only, as they can offer better perfor- 176 119 ments, discussing popular frameworks and alternative technologies. mance for fine-grained instrumentation and are intuitively more 177 120 transparent. Another choice is whether to operate on a native in- 178 121 2.1 The DBI Abstraction struction set or by lifting code to an intermediate representation: 179 180 122 We can think of a DBI system as an application virtual machine that the first choice typically can lead to faster compilation at the priceof 181 123 interprets the ISA of a specific platform (usually coinciding with an increased complexity for the backend, while the other generally 182 124 the one where the system runs) while offering instrumentation favors architectural portability. 125 183 capabilities to support monitoring and altering instructions and Execution Correctness. One of the most critical challenges in the 126 184 data from an analysis tool component written by the user: design of a DBI system is to prevent the native behavior of an appli- 127 185 cation under analysis from inadvertently changing when executing 128 Definition 2.1 (DBI System). A DBI system is an execution run- 186 inside the system. Real-world applications can exercise a good 129 time running in user space for process-level virtualization. An un- 187 deal of introspective operations: common instances are retrieving 130 modified compiled program executes within the runtime as would 188 instruction pointer values and return addresses for function invoca- 131 happen in a native execution, with introspection and execution 189 tions, and iterating over loaded code modules. When the execution 132 control capabilities accessible to its users in a programmatic way. 190 environment orchestrated by the DBI runtime does not meet the 133 191 expected characteristics, an application might exercise unexpected 134 The definition above is meant to capture distinctive traits of 192 behaviors or most likely crash. 135 most DBI embodiments used in research in the last two decades. 193 Bruening et al. [5] identify and discuss three broad categories 136 The components of a DBI runtime are laid out in the same address 194 of transparency requirements related to correctness: code, data, 137 space where program execution will take place, with the program’s 195 and concurrency. An example of code transparency when using JIT 138 semantics being carried out alongside user-supplied code for its 196 compilation is having every address manipulated by the applica- 139 analysis. Alternative designs recently proposed for moving the 197 tion match the one expected in the original code: the DBI system 140 runtime outside the process where the code under analysis executes 198 translates addresses for instance when the OS provides the con- 141 are discussed in Section 5.1 and Section 6. 199 text for a signal or exception handler. Data transparency requires, 142 Inner Workings of DBI Engines. A design goal for a DBI system 200 e.g., exposing the CPU state to analysis code as it would be in a 143 is to make it possible to observe—and possibly alter—the entire 201 native execution (leaving the application’s stack unhindered as the 144 202 architectural state of the program under analysis, including register program may examine it) and not interfering with its heap usage. 145 203 values, memory addresses and contents, and control flow transfers. Concurrency transparency prescribes, e.g., that the runtime does 146 204 To this end, the approach followed by most popular DBI em- not interfere by using additional locks or analysis threads. 147 205 bodiments is to recur to dynamic compilation: the original code Achieving these properties is difficult when handling generic 148 206 of the application is not executed directly, but rather analyzed, in- code, as a system should not make assumptions on how a program 149 207 strumented and compiled using a just-in-time (JIT) compiler. An has been compiled or how it manipulates registers, heap and stack: 150 208 instruction fetcher component reads the original instructions in the for instance, some versions of Microsoft Office read data or execute 151 209 program as they are executed for the first time, offering the engine code located beyond the top of the stack, while aggressive loop 152 the opportunity to instrument them before undergoing compilation. 210 optimizations may use the stack pointer to hold data [5]. 153 The compilation unit is typically a trace, defined as a straight-line 211 154 sequence of instructions ending in an unconditional transfer and Primitives for Analysis. DBI systems offer general-purpose 212 155 possibly with multiple side exits representing conditional branches. that can accommodate a wide range of program analyses, allowing 213 156 Compiled traces are placed in a code cache, while a dispatcher users to write clients (most commonly referred to as DBI tools) that 214 157 component coordinates transfers among compiled traces and new run interleaved with the code under analysis. One of the reasons 215 158 fetches from the original code. Similarly as in tracing JIT compilers behind the vast popularity of DBI frameworks is that DBI architects 216 159 for language VMs, a trace exit can be linked directly to its target to tried not to put too many restrictions on tool writers [5]: although 217 160 bypass the dispatcher when a compiled version is available, while sometimes users may miss the most effective way for the job, they 218 161 inline caching and code cloning strategies can be used to optimize are not required to be DBI experts to implement their program anal- 219 162 indirect control transfers. Special care is taken for instructions that yses. The runtime exposes APIs to observe the architectural state 220 163 should not execute directly, such as those for system call invoca- of a process (e.g., memory and register contents, control transfers) 221 164 tions, as they get handled by an emulator component in a similar from code written in traditional programming languages such as 222 165 way to how privileged instructions are dealt with in virtual machine C/C++, often supporting the invocation of external libraries (e.g., 223 166 monitors for whole-system virtualization [21]. for disassembly or constraint solving). A DBI system may also try 224 167 From the user’s perspective, analysis code builds on instrumen- to abstract away idiosyncrasies of the underlying instruction set, 225 168 tation facilities exposed through an API interface, with the DBI providing functions to intercept generic classes of operations such 226 169 backend taking care of program state switching between analysis as reading from memory or transferring control flow. 227 170 routines and the code under observation. From the client perspective, a generic DBI system provides prim- 228 171 The design space of a DBI engine accounts for different possibil- itives to handle at least the following elements and events: 229 172 ities. An alternative to JIT compilation is the probe-based approach • instructions in the original program to be instrumented; 230 173 where the original program instructions are executed once they • system call invocation, before and after a context switch; 231 174 2 232 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

233 • library function invocation, intercepted at the call site and to the analysis tool: by exposing familiar abstractions such as the 291 234 also on return when possible (think of, e.g., tail calls); control-flow graph, functions, and loops, DynInst makes it easier 292 235 • creation and termination of threads and child processes; to implement even complex analyses. However, the intrinsic diffi- 293 236 • dynamic code loading and unloading; culties in the static analysis work required to back them may lead 294 237 • exceptional control flow; DynInst to generate incomplete representations in presence of, e.g., 295 238 • asynchronous control flow (e.g., callbacks, Windows APCs). indirect jumps or obfuscated code sequences. 296 239 Instructions can be exposed to the client when traces are built Frida [28] tries to ease DBI tool writing by letting users write anal- 297 240 in the code cache, allowing it to iterate over the basic blocks that ysis code in JavaScript, executing it within the native application 298 241 compose them, or when code images are first loaded to memory, en- by injecting an engine. The framework targets quick development 299 242 abling ahead-of-time (AOT) instrumentation. AOT instrumentation of analysis code, aiming in particular at supporting reverse engi- 300 243 is useful for instance when analyzing libraries (e.g., to place hooks neering tasks. Due to its flexible design, Frida can support several 301 244 at the beginning of some functions), but cannot access information platforms and architectures, including for instance mobile ones, but 302 245 like basic-block boundaries that is revealed only at run time. its intrusive footprint could be a source of concern. 303 246 The capabilities offered to a client are not limited to execution While for most DBI frameworks guest and host architectures 304 247 inspection, but include the possibility of altering the program behav- coincide, Strata [52] uses software dynamic translation to support 305 248 ior. Common examples are overwriting register contents, replacing different host and guest ISAs, requiring users to implement only 306 249 instructions, and modifying the arguments and return values for a few guest and host-specific components. could technically 307 250 function call or rewiring it to some user-defined function. support different ISAs, but its current implementation does not. 308 251 Sophisticated engines like and DynamoRIO assist users in libdetox [41] featured the first DBI framework design concerned 309 252 tool creation by providing facilities for memory manipulation, with transparency for security uses. Originally used as a founda- 310 253 thread local storage, creation of analysis threads, synchronization, tion for a user-space sandbox for software-based fault isolation, 311 254 and interaction with the OS (e.g., for file creation) that minimize libdetox randomizes the location of the code cache and other in- 312 255 the possibility of interfering with the execution of the application. ternal structures, posing particular attention on, e.g., preventing 313 256 The DBI abstraction can cope with sequences of code interleaved internal pointers overwriting and disabling write accesses to the 314 257 with data, overlapping instructions, statically unknown targets for code cache when the program executes. At least for its publicly 315 258 indirect branches, and JIT code generation on the application side. released codebase, the framework is however vulnerable to some 316 259 Another appealing feature is the possibility for some engines to of the attacks that we will describe throughout Section 4. 317 260 attach to a process and then release it just like a debugger; this Although more DBI frameworks [38, 49, 50] appeared recently, 318 261 might come in handy, e.g., for large, long-running applications [35]. their designs did not introduce architectural changes relevant for 319 262 security uses: for this reason we opted for not covering them. 320 263 321 264 2.2 Popular DBI Frameworks 2.3 Alternative Technologies 322 265 Pin [35], DynamoRIO [5] and Valgrind [40] are possibly the most Source code instrumentation. When the source code of a program 323 266 popularly known DBI frameworks, supporting different architec- is available and the analysis is meaningful regardless of its com- 324 267 tures and operating systems. They have been extensively used in piled form1, instrumentation can take place at compilation time. 325 268 countless academic and industrial projects, providing reliable foun- Analyses can thus be encoded using source-to-source transforma- 326 269 dations for building performant and accurate analysis tools. tion languages like CIL [39] or by resorting to compiler assistance. 327 270 Pin provides robust support for instrumenting binary code run- The most common instance of the latter approach are performance 328 271 ning on Intel architectures. Its instrumentation APIs allow analysis profilers, but in recent years a good deal of sanitizers have been 329 272 tools to register for specific statements (e.g., branch instructions) or built for instance on top of the LLVM compiler. 330 273 events (e.g., thread creation) callbacks to analysis routines that can Operating at source level offers a few notable benefits. Analysis 331 274 observe the architectural state of the program. A recurrent criticism code can be written in an architecture-independent manner, also 332 275 is related to its closed source nature as it limits possible extensions. allowing it to access high-level properties of an application (e.g., 333 276 DynamoRIO is an open source project and unlike Pin it exposes types) that could be lost during compilation. Also, instrumentation 334 277 the entire instruction stream to an analysis tool, allowing users to code undergoes compiler optimization, often leading to smaller 335 278 perform many low-level code transformations directly. The superior performance overheads. However, when the scope of the analysis 336 279 performance level it can offer compared to Pin is still a popular involves interactions with the OS or other software components 337 280 subject of discussion within the DBI community. the applicability of source code instrumentation may be affected. 338 281 Sometimes the analysis code might be coupled too tightly with 339 Static binary instrumentation. A different avenue to instrument 282 details of the low-level binary representation. Valgrind and DynInst 340 a program could be rewriting its compiled form statically. This 283 approach this problem in different ways. Valgrind lifts binary code 341 approach is commonly known as Static Binary Instrumentation 284 to an architecture-independent intermediate representation: this 342 (SBI), and sometimes referred to as binary rewriting. While tech- 285 allowed its developers to port it to many architectures and plat- 343 niques for specific tasks such as collecting instruction traces had 286 forms, to the advantages of analysis tools based on it. However, this 344 already been described three decades ago [17], it is only with the 287 comes with performance penalties that could make it inadequate in 345 288 346 several application scenarios. On the other hand, DynInst tries to 1There are cases where the source may not be informative enough (e.g., for side channel 289 provide high-level representations of the program under analysis detection) or Heisenberg effects appear if the source is altered (e.g., for memory errors). 347 290 3 348 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

349 ATOM [56] framework that a general SBI-based approach to tool venues looking for works that made use of them. Although this list 407 350 building was proposed. ATOM provided selective instrumentation may not be exhaustive, and a meticulous survey of the literature 408 351 capabilities, with information being passed from the application could be addressed in a separate work, we identified 95 papers and 409 352 program to analysis routines via procedure calls. SBI generally pro- articles from the following venues: 18 for CCS, 7 for NDSS, 6 for 410 353 vides better performance than DBI, but struggles in the presence S&P, 14 for USENIX Security, 10 for ACSAC, 4 for RAID, 4 for ASIA 411 354 of indirect branches, anti-disassembly sequences, dynamically gen- CCS, 2 for CODASPY, 9 for DIMVA, 7 for DSN, 3 for (S)PPREW, and 412 355 erated code (JIT compiled or self-modifying), and shared libraries. 11 among ESSoS, EuroSys, ICISS, ISSTA, MICRO, STC, and WOOT. 413 356 ATOM was shortly followed by other systems (e.g., [30, 51, 68]) 414 357 that gained popularity in the programming languages community Prominent applications. For the sake of presentation, we classify 415 358 especially for performance profiling tasks before the advent of DBI. these works in the following broad categories, reporting the most 416 359 Recent research [67] has shown how to achieve with SBI some of common types of analysis for each of them: 417 360 418 the practical benefits of DBI, such as instrumentation completeness • cryptoanalysis: identification of crypto functions and keys 361 419 along the software stack and non-bypassable instrumentation. Some in obfuscated code [31], obsolete functions replacement [3]; 362 420 obfuscation techniques and self-modifying code remain however • malicious software analysis: e.g., malware detection and clas- 363 problematic, causing execution to terminate when detected. 421 sification37 [ , 65], analysis of adversarial behavior [44], au- 364 422 Cooperation on the runtime side. Another possibility is to move tomatic unpacking [47]; 365 423 the analysis phase on the runtime side, like the virtual machine for • vulnerability analysis: e.g., memory errors and bugs [10, 60], 366 424 managed languages or the for executables. While side channels [64], fuzzing [9], prioritization of code regions 367 425 the former possibility has been explored especially in programming for manual inspection [23], debugging [62]; 368 426 languages research (e.g., using instrumentation facilities of Java • software plagiarism: detection of unique behaviors [61]; 369 427 VMs), the latter has seen several uses in security, for example in • reverse engineering: e.g., code deobfuscation [54], protocol 370 428 malware analysis to monitor API calls using a hooking component analysis and inference [7, 33], configuration retrieval [59]; 371 429 in kernel space. Although finer-grained analyses like instruction • information flow tracking: design of taint analysis engines 372 430 recording or information flow analysis are still possible with this and their optimization [26, 27]; 373 431 approach [16], the flexibility of an analysis component executing • software protection: e.g., control flow integrity [48], detection 374 432 in kernel mode is more limited compared to DBI and SBI. of ROP sequences [13], software-based fault isolation [42], 375 433 code randomization [24], application auditing [69]. 376 Virtual machine introspection. In recent years a great deal of 434 377 research has adopted Virtual Machine Introspection (VMI) tech- This choice left out 3 works that dealt with protocol replay, code 435 378 niques to perform dynamic analysis from outside the virtualized reuse paradigms, and hardware errors simulation, respectively. 436 379 full software stack in which the code under analysis runs. The Categories can have different prevalence in general conferences: 437 380 VMI approach has been proposed by Garfinkel and Rosenblum21 [ ] for instance, 6 out of 14 USENIX Security papers deal with vul- 438 381 to build intrusion detection systems that retain the visibility of nerability detection, but only 2 of the 18 CCS papers fall into it. 439 382 host-based solutions while approaching the degree of isolation of For a specific category, works are quite evenly distributed among 440 383 network-based ones, and became very popular ever since. Inspect- venues: for instance, works in malicious software analysis (14) have 441 384 ing a virtual machine from the outside enables scenarios such as appeared in CCS (5), DIMVA (4), and five other conferences. 442 385 code analysis in kernel space that are currently out of reach of The heterogeneity of analyses built on top of DBI engines is 443 386 DBI systems (an attempt is made in [66]). VMI is possible for both somehow indicative of the flexibility provided by the DBI abstrac- 444 387 emulation-based and hardware-assisted virtualization solutions, tion to researchers for prototyping their analyses and systems. The 445 388 allowing for different trade-offs in terms of execution speed and first works we surveyed date back to 2007, with 7 papers inthat 446 389 flexibility of the analysis. Unlike DBI, VMI incursa semantic gap year. The numbers for the past two years (9 works in 2017, 8 in 2018) 447 390 when trying to inspect high-level concepts of the guest system such are lined up with those from four years before that (9 in 2013, 8 for 448 391 as API calls or threads. Recent research has thus explored ways 2014), hinting that DBI is still very popular in security research. 449 392 to minimize the effort required to build VMI tools, e.g., with auto- 450 393 matic techniques [15] or by borrowing components from memory Usage of DBI primitives. For each work we then identify which 451 394 forensic frameworks [46]. While the ease of use of VMI techniques DBI system is used and what primitives are necessary to support the 452 395 has lately improved with the availability of scriptable execution proposed analysis. We grouped instrumentation actions required 453 396 frameworks [58], performing analyses that require deep inspection by analyses in the following types: 454 397 features or altering properties of the execution other than the out- • instructions, which we further divide in memory operations 455 398 come of CPU instructions (say replacing a function call) remains (for checking every register or memory operand’s content), 456 399 hard for a user, or at least arguably harder than in a DBI system. We call/ret (for monitoring classic calls to internal functions or 457 400 will return to this matter in Section 6 discussing also transparency. library code), branches ([in]direct and [un]conditional), and 458 401 other (for special instructions such as int and rdtsc); 459 402 3 DBI IN SECURITY RESEARCH • system calls, to detect low-level interactions with the OS; 460 403 To provide the reader with a tangible perspective on the ubiquity • library calls, when high-level function call interception fea- 461 404 of DBI techniques in security research over the years, we have tures of the DBI engine (like the routine instrumentation of 462 405 reviewed the proceedings of flagship conferences and other popular Pin) are used to identify function calls to known APIs; 463 406 4 464 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

465 dbi primitives 523 application domain instructions system library threads & code exceptions works 466 524 memory r/w calls/rets branches other calls calls processes loading & signals 467 525 Cryptoanalysis ✓ ✓ ✓ ✓ 7 468 Malicious Software Analysis ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 14 526 469 Vulnerability Detection ✓ ✓ ✓ ✓ ✓ ✓ 22 527 Software Plagiarism ✓ ✓ 2 470 Reverse Engineering ✓ ✓ ✓ ✓ ✓ ✓ 9 528 471 Information Flow Tracking ✓ ✓ ✓ ✓ ✓ ✓ 8 529 472 Software Protection ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 30 530 473 Table 1: Application domains and uses of DBI primitives in related literature. 531 474 532 475 • threads and process, when the analysis is concerned with is compelling especially for VM architects and dynamic translator 533 476 intercepting their creation and termination; builders, implies that an application executes as it would in a non- 534 477 • code loading, to intercept dynamic code loading events; instrumented, native execution [5], and that interoperability with 535 478 • exceptions and signals, to deal with asynchronous flows. native applications works normally [12]. To this end, Bruening et 536 479 537 In Table 1 we present aggregate results for application domains, al. [5] identify three guidelines to achieve the execution correctness 480 538 where for each category we consider the union of the instrumen- properties outlined in Section 2.1 when writing a DBI system: 481 539 tation primitives used by the works falling into it. While such [G1] leave the code unchanged whenever possible; 482 540 information is clearly coarse-grained compared to a thorough anal- [G2] when change is unavoidable, ensure it is imperceivable to 483 541 ysis of each work, we observed important regularities within each the application; 484 542 category. For instance, in the cryptoanalysis domain nearly every [G3] avoid resource usage conflicts. 485 543 considered work resorts to all the primitives listed for the group. 486 The authors explain how transparency has been addressed on 544 Tracing all kinds of instructions and operands seems fundamental 487 an ad-hoc basis in the history of DBI systems, as applications were 545 in the analysis of malicious software, while depending on the goal 488 found to misbehave due to exotic implementation characteristics 546 of the specific technique it may be necessary to trace also context 489 with respect to code, data, concurrency, or OS interactions. 547 switches or asynchronous flows. Observe that some primitives are 490 DBI architects are aware that absolute transparency may be 548 intuitively essential in some domains: this is the case of memory 491 unfeasible to obtain for certain aspects of the execution, or that im- 549 accesses for information flow analysis, as well as of control transfer 492 plementing a general solution would cause a prohibitive overhead. 550 instructions in software protection works. “the further we push transparency, 493 In the words of Bruening et al. [5]: 551 the more difficult it is to implement, while at the same time fewer 494 Choices. We identify Pin as the most popular engine in the works 552 applications require it” 495 we survey with 57 uses, followed by Valgrind (19), libdetox (5), . In the end, the question boils down to see- 553 496 DynamoRIO (4) and Strata (3); in some cases the engine was not re- ing whether a presumably rare corner case may show up in code 554 497 ported. Unsurprisingly, Valgrind is very popular in the vulnerability analyzed by the users of the DBI system. 555 498 detection domain with 9 uses, just behind Pin with 12. A second connotation of transparency, which is compelling for 556 499 An important design choice that emerged from many works is software security research, implies the possibility of adversarial 557 500 related to when DBI should be used to back an online analysis or to sequences that look for the presence of a DBI system and thwart any 558 501 rather record relevant events and proceed with offline processing. analyses built on top of it. We defer the discussion of the DBI evasion 559 502 Other than obvious aspects related to the timeliness of the obtained and escape problems to the next section, as in the following we will 560 503 results (e.g., shepherding control flows vs. bug identification) and revisit from the DBI perspective general transparency properties 561 504 nondeterministic factors in the execution, also the complexity of that authors of seminal works sought in analysis runtimes they 562 505 the analysis carried out on top of the retrieved information may considered for implementing their approaches. 563 506 play a role—this seems at least the case with symbolic execution. For an IDS, Garfinkel and Rosenblum [21] identify three capabil- 564 507 A few research works [45, 69] aiming at mitigating defects in ities required to support good visibility into a monitored system 565 508 software devise their techniques in two variants: one for when the while providing significant resistance to both evasion and attack: 566 509 source code is available, and another based on DBI for software in • isolation: code running in the system cannot access or modify 567 510 binary form, hinting at higher implementation effort and possible code and data of the monitoring system; 568 511 technical issues in using SBI techniques in lieu of DBI. • inspection: the monitoring system has access to all the state 569 512 of the analyzed system; 570 513 4 ATTACK SURFACES • interposition: the monitoring system should be able to inter- 571 514 In light of the heterogenous analyses mentioned in the previous sec- pose on some operations, like uses of privileged instructions. 572 515 tion, it is legitimate to ask whether their results may be affected by The last two properties are simply met by the design goals behind 573 516 imperfections or lack of transparency of the underlying DBI engine. the DBI abstraction. Supporting the first property in the presence 574 517 We will deal with these issues throughout the present section. of a dedicated adversary seems instead hard for a DBI engine: as the 575 518 runtime shares the same address space of the code under analysis, 576 519 4.1 Desired Transparency Properties the possibility of covert channels is real. Actually, by subverting 577 520 Existing literature has discussed the transparency of runtime sys- isolation an attacker might in turn also foil the inspection and 578 521 tems for program analysis under two connotations. The first, which interposition capabilities of a DBI system [29]. 579 522 5 580 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

581 The authors of the Ether malware analyzer [14] present a sim- We opted for a broad definition of the evasion problem for the 639 582 ple abstract model of program execution where transparency is following reason: alongside techniques that actively fingerprint 640 583 achieved if the same trace of instructions is observed in an environ- known artifacts of a DBI engine and deviate the standard control 641 584 ment with and without an analyzer component present. As they use flow accordingly, DBI systems suffer from translation defects that 642 585 hardware virtualization, the model is later generalized to account are common in binary translation solutions and cause the analysis 643 586 for virtual memory, exception handling, and instruction privilege to follow unfeasible execution paths. The most prominent example 644 587 levels. The requirements identified for a system that wants to hide is the use of self-modifying code, which is used both in benign 645 588 memory and CPU changes caused by its own presence are: mainstream programs [5] (resulting in an unintended evasion, and 646 589 • higher privilege: the analyzer runs at a privilege level higher possibly in a program crash) and as part of implicit evasion mecha- 647 590 than the highest level a program can achieve; nisms to cripple dynamic analysis by yielding bogus control flows. 648 591 • privileged access for side effects: if any, side effects can be seen 649 Definition 4.2 (DBI Escape). A code is said to escape DBI-based 592 only at a privilege level that the program cannot achieve; 650 analysis when parts of its instruction and data flows get out of 593 • same basic instruction semantics: aside from exceptions, the 651 the DBI engine’s control and execute natively on the CPU, in the 594 semantics of an instruction is not involved in side effects; 652 address space of the analysis process or of a different one. 595 • transparent exception handling: when one occurs, the ana- 653 596 lyzer can reconstruct the expected context where needed; An attacker aware of the presence of a DBI engine may try to 654 597 • identical timing information: access to time sources is trapped, hijack the control transfers that take place under the DBI hood, 655 598 so that query results can be massaged consistently. triggering flows that may never return under its control. The typical 656 599 657 For an analysis runtime operating in user space, fulfilling all scenario involves leaking an address inside the code cache and 600 658 these requirements simultaneously is problematic, and oftentimes patching it with a hijacking sequence, but more complex schemes 601 659 unfeasible. While DBI systems preserve instruction semantics and have been proposed. As for the second part of the definition, DBI 602 660 can capture exceptions, current embodiments operate at the same frameworks can provide special primitives to follow control flows 603 661 privilege level of the code under analysis. Values retrieved from carried out in other processes on behalf of the code under analysis: 604 662 timing sources can be massaged as in [1, 47] to deal with specific for instance, Pin can handle child processes, injected remote threads, 605 663 detection patterns, but general strategies for manipulating the time and calls to external programs. Implementation gaps are often the 606 664 behavior of a process with realistic answers may be intrinsically main reason for which such attempts could go unnoticed. 607 665 difficult to conceive or computationally too expensive [20]. 608 666 For a fair comparison, we observe that similar problems affect 4.3 Artifacts in Current DBI Embodiments 609 667 to some extent also other other analysis approaches whose design 610 In Section 1 we have mentioned several scientific presentations 668 seems capable of accommodating such requirements in a robust 611 and academic research highlighting flaws in DBI systems. While 669 manner. Let us consider VMI techniques: Garfinkel in20 [ ] describes 612 some of them characterize the DBI approach in its generality, others 670 several structural weaknesses in virtualization technology that an 613 leverage implementation details of a specific engine, but can often 671 attacker may leverage to detect its presence, concluding that build- 614 be adapted to others that follow similar implementation strategies. 672 ing a transparent monitor “is fundamentally infeasible, as well as 615 In hopes of providing a useful overview of the evasion problem 673 impractical from a performance and engineering standpoint”. Attacks 616 to researchers that wish to use DBI techniques in their works, we 674 to VMI-based analyses are today realistic: for instance, performance 617 propose a categorization of the techniques that are known to date 675 differences can be observed due to TLB entry eviction4 [ ], and the 618 to detect the presence of a DBI system. We will refer to Pin on 676 falsification of timing information can be imperfect [43]. 619 Windows in many practical examples, as the two have received 677 In the next sections we will present the reader with practical 620 significantly more attention than any other engine/OS in theDBI 678 attack surfaces that a dedicated adversary may use to detect a DBI 621 evasion literature. We start by enumerating the attack surfaces we 679 system, and discuss how analyses running on one can be impacted. 622 identified in such research, and then move to in-depth individual 680 623 discussions: time overhead, leaked code pointers, memory contents 681 624 4.2 DBI Evasion and Escape and permissions, DBI engine internals, interactions with the OS, ex- 682 ception handling, and translation defects. 625 The imperfect transparency of DBI systems has led researchers to 683 626 design a plethora of detection mechanisms to reveal the presence Time overhead. The process of translating and instrumenting the 684 627 of a DBI framework from code that undergoes dynamic analysis. original instructions in traces to be placed in cache and eventually 685 628 Once an adversary succeeds, the code can either execute misleading executed introduces an inevitable slowdown in the execution. This 686 629 actions to deceive the analysis, or attempt to carry out execution slowdown grows with the granularity of the required analysis: for 687 630 flows that go unnoticed by the engine. These scenarios are popularly example, tracing memory accesses is significantly more expensive 688 631 known as the DBI evasion and escape problem, respectively. than monitoring function calls. An adversary may try generic time 689 632 measurement strategies for dynamic analyses that compare the 690 633 Definition 4.1 (DBI Evasion). A code is said to evade DBI-based execution of a code fragment to one in a reference system and 691 634 analysis when its instruction and data flows eventually start di- look for significant discrepancies. There are peculiarities of DBI 692 635 verging from those that would accompany it in a native execution, that could be exploited as well: for instance, the time required to 693 636 either as a result of a decision sequence that revealed the presence dynamically load a library from user code can be two orders of 694 637 of a DBI engine, or because of translation defects on the DBI side. magnitude larger [18] under DBI due to the image loading process. 695 638 6 696 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

697 Similarly, effects of the trace compilation process can be exploited are userspace VMs their presence can be revealed by interacting 755 698 by observing fluctuations in the time required to take a branch in with the OS. A classic example is to check for the parent process [18, 756 699 the program from the first time it is observed in the execution [29]. 32] or the list of active processes to reveal Pin or DynamoRIO. 757 700 Handles can reveal the presence of a DBI engine too, for instance 758 Leaked code pointers. A pivotal element in the execution trans- 701 when fewer than expected are available for file manipulation32 [ ] or 759 parency of DBI is the decoupling between the real instruction 702 when their names give away the presence of, e.g., Pin [18]. We found 760 pointer and the virtual one exposed to the code. There are however 703 out that Pin may be revealed also by anti-debugging techniques 761 subtle ways for an adversary to leak the real IP and compare it 704 based on NtQueryInformationProcess and NtQueryObject. 762 against an expected value. One way inspired by shellcode writing 705 763 is to use special x87 instructions that are used to save the FPU state 706 Exception handling. DBI engines have capabilities for hooking 764 (e.g., for exception handling) in programs: an adversary executes 707 exceptional control flow: for instance, this is required to provide 765 some x87 instruction (like pushing a number to the FPU stack) and 708 SEH handlers with the same information that would accompany the 766 then uses fstenv, fsave, or one of their variants to write the FPU 709 exception in a native execution [53]. There are however cases that 767 state to memory. The materialized structure will contain the EIP 710 DBI embodiments may not deal with correctly. For instance, we 768 value for the last performed FPU instruction, which DBI engines 711 found out that Pin may not handle properly single-step exceptions 769 typically do not mask: a check on its range will thus expose DBI [18]. int 2d 712 and instructions used in evasive malware, with the sample 770 Another way in 32-bit Windows is the int 2e instruction normally 713 not seeing the expected exception [1]. 771 used to enter kernel mode on such systems: by clearing EAX and 714 772 EDX before invoking it, the real IP is leaked to EDX [47]. Translation defects. Analysis systems that base their working 715 on binary translation are subject to implementation defects and 773 716 Memory contents and permissions. A major source of transparency limitations: this is true for DBI but also for full system emulators 774 717 concerns is that a DBI engine shares the same address space of the like QEMU. A popular example is the enter instruction that is not 775 718 analyzed code without provisions for isolation. The inspection of implemented in Valgrind [29]. DBI architects may decide to not 776 719 the address space can reveal additional sections and unexpected support rarely used instructions; however, some instructions are 777 720 exported functions from the runtime [18]; the increased memory intrinsically challenging for DBI systems: consider for instance far 778 721 usage could be an indicator of the presence of DBI as well [32]. returns, which in Pin are allowed only when within the same code 779 722 An adversary may look for recurrent patterns that are present in segment. Similarly, Pin cannot run “Heaven’s gate code” to jump 780 723 the code components (the runtime and the user’s analysis code) of into a 64-bit segment from a 32-bit program by altering the CS 781 724 the DBI system and in their data, or in heap regions used for the selector. DynamoRIO does not detect the change, paving the way to 782 725 code caching mechanism [18]. Another issue could be the duplicate the Xmode code evasion [57] that uses special instructions having 783 726 presence of command-line arguments in memory [18]. the same encoding and disassembly on both architectures to yield 784 727 Also the memory layout orchestrated and exposed by the DBI different results due to the different stack operations size. 785 728 engine to the application under analysis can be stressed for consis- We put in this category also uses of self-modifying code (SMC): 786 729 tency by adversarial sequences. For instance, Pin and DynamoRIO intuitively, SMC should always lead to invalidation of cached trans- 787 730 miss permission violations when the virtual IP falls into code pages lated code, but implementations may miss its presence. In 2010 788 731 for which access has been disabled (PAGE_NOACCESS) or guarded Martignoni et al. reported: “the presence of aggressive self-modifying 789 732 (PAGE_GUARD) by the application [25]. Similarly, an engine may code prevents [...] from using efficient code emulations techniques, 790 733 erroneously process and execute code from pages that were not such as dynamic binary translation and software-based virtualiza- 791 734 granted execution permissions [29]. tion” [36]. Detecting SMC sequences affecting basic blocks other 792 735 793 DBI engine internals. While the CPU context exposed to the ap- than the current is nowadays supported by many DBI engines, while 736 794 plication under analysis is masked by the DBI abstraction, some SMC on the executing block complicates the picture: DynamoRIO 737 795 changes applied to the execution context to assist a DBI runtime handles it correctly, while Pin has caught up from its 3.0 release 738 796 are not. One instance is represented by Thread Local Storage slots: providing a strict SMC policy option to deal with such cases. 739 797 developers can use TLS to maintain thread-specific storage in con- 740 We assembled a collection of detection sequences that might 798 current applications, but for the sake of efficiency DBI engines 741 interest DBI users: we searched for PoCs made available by authors 799 may occupy slots with internal data [57]. Another attack surface is 742 of evasion research for the techniques described in this section, and 800 represented by DLL functions that are altered by the engine: while 743 created our own for those that were not accompanied by one. 801 the vast majority of library code goes unhindered through the DBI 744 802 engine, in special cases trampolines should be inserted at their head. 745 4.4 Escaping from Current DBI Embodiments 803 In the case of Pin on Windows systems ntdll is patched for [53]: 746 To the best of our knowledge, the first DBI escape attack has been 804 • KiUserExceptionDispatcher, to distinguish exceptions in 747 described in a 2014 blog post by Cosmin Gorgovan [22]: the au- 805 the running code from internal (engine/analysis code) ones; 748 thor investigated how weaknesses of current DBI implementations 806 • KiUserApcDispatcher and KiUserCallbackDispatcher, 749 highlighted in evasion-related works could pave the way to DBI 807 as the Windows kernel can deliver asynchronous events; 750 escape. Intuitively, an avenue is represented by having code cache 808 • LdrInitializeThunk, to intercept user thread creation. 751 locations readable and writable also by the code under analysis. 809 752 Interactions with the OS. DBI engines are concerned with the Instead of leveraging a leaked pointer artifact, the author sug- 810 753 transparency of the execution space of an application, but as they gests encoding a fairly unique pattern in a block that gets executed 811 754 7 812 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

813 and looking it up in the code cache once the latter’s randomized lo- efficacy is mostly tied to detection patterns observed in a specific 871 814 cation is determined by querying the OS memory map. The pattern domain. A general solution based on realistic simulations of the time 872 815 gets overwritten with a trampoline, and when execution reaches the elapsed in executing instructions as if the cost of DBI were evicted is 873 816 block again the code escapes. Gorgovan also explains how to make believed impractical for dynamic analyses [20]. Also, such schemes 874 817 execution gracefully return under the control of the DBI system. may be defeated by queries to other processes not running under 875 818 The code cache is not the only region that can be tampered with: DBI or to external attacker-controlled time sources. Compared to 876 819 escape attacks involving internal data, callbacks, and stack that VMI solutions where one may patch the time sources of the entire 877 820 are specific to a DBI engine are described in[57]. More recently, guest machine, DBI architects are thus left with (possibly much) 878 821 [29] describes a complex data-only attack for COTS applications less wiggle room to face timing attacks. 879 822 880 running in Pin that uses a known memory corruption bug to escape, Leaked code pointers. We have mentioned subtle ways to leak 823 881 leveraging relative distances between regions that host, e.g., the code cache addresses through execution artifacts for specific code 824 882 libc and the code cache that are predictable in some releases. patterns, namely FPU context saving instructions and context switch- 825 883 ing sequences for syscalls. While these leaks do not seem to bother 826 884 5 COUNTERMEASURES the execution of classic programs under DBI, at the price of an 827 885 In the following we investigate how transparency problems of DBI increased overhead a framework could be extended to patch them 828 886 can be mitigated by revisiting the design space of engines, and as soon as they become visible to user code2. 829 when stopgap measures shipped with user code could be helpful. 887 830 Memory contents and permissions. Presenting the code under 888 analysis with a faithful memory state as in a native execution is 831 5.1 Design Space of DBI Frameworks 889 832 inherently difficult for a DBI system, as it operates at thesame 890 833 The design of general-purpose embodiments of the DBI abstrac- privilege level and in the same address space of the program. We 891 834 tion has historically been driven by the necessity of preserving identify three aspects that matter in how memory gets presented 892 835 execution correctness while obtaining an acceptable performance to the code under analysis. 893 836 level. While general techniques for dynamic code generation and The first aspect is correctness. As we have seen, memory permis- 894 837 compilation have dramatically improved in the past two decades sions mirroring a native execution may not be enforced faithfully, 895 838 in response to the ever-growing popularity of languages for man- resulting in possibly unfeasible executions. For instance, a DBI sys- 896 839 aged runtimes, DBI architects have to face additional, compelling tem may not detect when the virtual instruction pointer reaches a 897 840 execution transparency problems. They thus strove to improve the region that is not executable, continuing the instruction fetching 898 841 designs and implementations of their systems as misbehaviors were process rather than triggering an exception. This may result in, e.g., 899 842 observed in the analysis of mainstream applications [5], backing reviving classic buffer overflow attacks as shown29 in[ ]. Similarly, 900 843 popular program analysis tools such as profilers, cache simulators, failing to reproduce guard page checks may be problematic when 901 844 and memory checkers. dealing with programs like JIT compilers that use them. Adding 902 845 The security research domain is however characterized by appli- these checks may introduce unnecessary overhead for many ana- 903 846 cation scenarios where the program under inspection may resort lyzed applications, so they could be made available on demand. 904 847 to a plethora of generic or DBI-specific techniques to elude the The second aspect is escaping. When an adversary gets a pointer 905 848 analysis or even tamper with the runtime. A researcher may thus inside the code cache, injection attacks become a possibility. This 906 849 wonder how the design of a general-purpose DBI framework could problem is shared with general-purpose JIT engines [55], such as 907 850 be adapted to deal with common categories of adversarial sequences. those used in browsers for running Javascript code efficiently. One 908 851 In some cases the required changes could not be easy to be accom- possibility could be to use read-only code caches as in the mobile 909 852 modated by a DBI framework with a wide audience of users from version of Safari, incurring a performance penalty by switching it 910 853 different domains, but could be sustainable for an engine thatis to writable non-executable when modifications are made by the 911 854 designed with specific security applications in mind: one prominent engine; during a transition however legit accesses from concurrent 912 855 example is SafeMachine [25], a proprietary DBI framework used threads could fail, and proper handling would be needed. 913 856 by the Avast security firm for fine-grained malware analysis. Keeping thread-specific code caches mitigates the problem atthe 914 857 We will now revisit the attack surfaces from Section 4.3 from a price of increased memory occupancy3 and is available as an option 915 858 DBI architect’s perspective, referencing research works that have for instance in DynamoRIO. This solution is however still subject 916 859 dealt with specific aspects and discussing other viable options. to race conditions attacks as shown in [55]. The authors propose 917 860 Time overhead. Dealing with the run-time overhead from a dy- a mechanism where a writable code cache is moved to another 918 861 namic analysis is an old problem in research. The overhead of a process, and its pages shared with the process under analysis with 919 862 DBI system is not easy to hide, and it may not only be revealed read-only privileges. Their approach incurs a small performance 920 863 by adversarial measurement sequences, but as discussed in [5] can overhead on Strata and on the V8 JavaScript engine. 921 864 also affect the correct execution of code sensitive to time changes. The third aspect is evasion. Due to the shared address space, an 922 865 Also, the time spent in analysis code might exceed the cost of mere adversary may simply register an exception handler and scan the 923 866 instrumentation depending on the type of analysis carried out. memory looking for unexpected contents, such as the presence of 924 867 Previous research in malware analysis has explored mechanisms 925 868 926 to alter the time behavior perceived by the process by faking the 2https://github.com/dynamorio/dynamorio/issues/698. 869 results of time queries from different sources [1, 47]. However, their 3This may however even let instrumentation code run faster when thread-specific. 927 870 8 928 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

929 the code cache or internal structures and code from the runtime. could be problematic as well, but a kernel module could be used to 987 930 Shepherding all memory accesses would solve the evasion problem, capture them [5]. 988 931 but the overhead could increase dramatically [5]. 989 Exception handling. A DBI system has to present a faithful con- 932 One mitigation proposed in Strata is to use metamorphism to 990 text to the application in the presence of exceptions and signals. 933 alter the contents of the code cache, hindering evasion and escape 991 DBI architects are faced with different options in when (if an in- 934 attacks based on pattern recognition. This could be paired with 992 terruption can be delayed) and how the context translation has to 935 guarded pages and mechanisms to move the location of the code 993 take place; also, there are cases extreme enough for mainstream 936 cache (and possibly internal structures) periodically during the 994 applications that can be handled loosely [5]. Systems like Valgrind 937 execution, triggering however the invalidation of all compiled code 995 that work like emulators by updating the virtual application state at 938 as if the engine were detached and reattached to the process. 996 every executed instruction are not affected by this problem. How- 939 These countermeasures would still not be sufficient against at- 997 ever, they incur a higher runtime overhead compared to others (e.g., 940 tackers with code reading capabilities. Consider the case of a loop 998 Pin, DynamoRIO) that reconstruct the context only when needed. 941 trying to read the page containing its own instructions in the cache: 999 942 such an access cannot be denied on the x86 architecture. An answer Translation defects. Instruction errata and alike defects can be 1000 943 to the problem may come from recent research on non-readable ex- brought under two main categories: implementation gaps and de- 1001 944 ecutable memory (XnR) [2] to prevent disclosure exploits for code sign choices. Apart from challenging sequences such as 32-to-64 1002 945 reuse attacks. While the default XnR setting does not support our switches, errata from the first category can be addressed through 1003 946 “in-page” reading loop scenario, follow-up techniques can be used implementation effort for code domains where they are problematic. 1004 947 to handle it: in particular, [63] shows how to achieve an effective On the other hand, defects may arise due to design choices aimed 1005 948 separation between read and execution capabilities using Extended at supporting efficient execution of general programs: this isthe 1006 949 Page Tables on a thin hypervisor layer. Further hardware assis- case with self modifying code within the current basic block, which 1007 950 tance could ease both the implementation and deployment effort: as we said can be detected in Pin with an optional switch at the 1008 951 for instance, the ARMv8 processor provides facilities to support price of an increased overhead. 1009 952 this mechanism at kernel level [63], while the Intel Skylake x86 1010 To conclude our discussion, we would like to mention the possi- 953 architecture has introduced Memory Protection Keys to control 1011 bility of having user-supplied analysis code pollute the transparency 954 memory permissions in user space that could be used to achieve 1012 of the execution, for instance in its context reconstruction process. 955 executable-only memory as described in [34]. 1013 956 Frameworks like Pin that support registering analysis callbacks 1014 DBI engine internals. Changes to the execution context may be 957 might be slightly easier to use for analysis writers compared to oth- 1015 necessary for the sake of performance, for instance to keep internal 958 ers like DynamoRIO that let users manipulate statements directly, 1016 data structures of the runtime quickly accessible using TLS mech- 959 but DBI systems may hardly avoid leaving part of the responsibility 1017 anisms. An engine may attempt to randomize the TLS slot in use 960 for transparency in the hands of their users. 1018 and hide its presence to queries from the application, but when an 961 1019 adversarial sequence tries to allocate all the slots the engine can 962 5.2 Mitigations at Analysis Code Level 1020 either abort the execution (similarly as in what happens when the 963 While revisiting architectural and implementation choices behind 1021 memory occupied by the engine prevents further heap allocation 964 a DBI system can bring better transparency by design, some re- 1022 by the program) or resort to a less efficient storage mechanism. 965 search has explored how user-provided analysis code can mitigate 1023 The presence of trampolines on special DLL functions could be 966 artifacts of mainstream DBI systems and defeat evasive attempts 1024 hidden using the same techniques for protecting the code cache, 967 observed in some applications domains. This approach has been 1025 providing the original bytes expected in a read operation as in [47]. 968 proposed in PinShield [47] and adopted in the context of executable 1026 Write operations are instead more difficult, as the attacker may 969 unpacking in the presence of anti-instrumentation measures [47] 1027 install a custom trampoline that either returns eventually to the 970 and for malware dissection of highly evasive samples [1]. 1028 original function (which still needs to be intercepted by the engine) 971 We have designed a high-level DBI library that could run in prin- 1029 or simply alter the standard semantics for the call in exotic ways. 972 ciple in existing analysis systems to detect and possibly counter 1030 973 Interactions with the OS. DBI frameworks can massage the pa- DBI evasion and escape attempts. We revisited the design of Pin- 1031 974 rameters and output values of some library and system calls in Shield to achieve better performance when shepherding memory 1032 975 order to achieve the design guideline G2, that is, hiding unavoidable accesses, and introduced protective measures to cope with mem- 1033 976 changes from the program (Section 4.1). For instance, DynamoRIO ory permission consistency (e.g., to enforce NX policies and page 1034 977 intercepts memory query operations to its own regions to let the guards), pointer leaks using FPU instructions, inconsistencies in 1035 978 program think that such areas are free [5]. While allocations in such exception handling, and a number of detection queries to the OS 1036 979 regions could be allowed in principle by moving the runtime in (e.g., for when the DBI engine is revealed a debugger). 1037 980 other portions of the address space, resource exhaustion attacks are For a prototype implementation we chose the most challenging 1038 981 still possible on 32-bit architectures, and are not limited to memory setting for user-provided stopgap measures: we target the popular 1039 982 (for instance, file descriptors are another possibility). As system combination of Pin running on Windows. Unlike DynamoRIO, Pin 1040 983 call interposition can incur well-known traps and pitfalls [19], DBI does not rewrite the results of basic fingerprinting operations that 1041 984 architects implement such strategies very carefully. Observe that can give away its presence like OS queries about memory. As it 1042 985 remote memory modifications carried out from another process is closed source, it cannot be inspected or modified to ease the 1043 986 9 1044 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

100 1045 leak nx rw full paranoid libdft consider different protection levels: pointer leaks, NX and page 1103 23.84 21.73 17.40 1046 17.04 guard checks on indirect transfers, denying RW access to DBI re- 1104

7.02 6 6.71 1047 10 6.62 1105 5.92

4.95 gions, the three strategies together, and a paranoid variant . We 4.43 3.91 3.73 3.82 3.66 2.98 2.95 2.59 1048 2.61 1106 2.25 2.31 also consider a popular taint analysis library for byte-level tracking 1.15 1.11 1.09 1.0 1.06 1.05 1107 1.03 1.03

1049 1.02 1 1.01 granularity in its default configuration with 1-byte tags. Due to lack 1050 Slowdown w.r.t. Pin of space (a more complete discussion is provided in Appendix A) 1108 1051 we report figures for a subset of benchmarks in Figure 1. 1109 1052 0.1 1110 400.perlbench 401.bzip2 403.gcc 429.mcf 473.astar Tracking x87 instructions for leaks has a rather limited impact 1053 Figure 1: Performance impact of user-provided mitigations. on execution time. Enforcing NX and page guard protection on indi- 1111 1054 rect transfers seems cheap as well. Shepherding memory read/write 1112 1055 accesses incurs a high slowdown, but smaller than the one intro- 1113 implementation of mitigations, and using Windows makes imme- 1056 duced by the heavy-duty analysis of libdft. High overheads were 1114 diate cooperation on the OS side (e.g., using a kernel driver) more 1057 expected, but we do not find these figures demoralizing: while some 1115 difficult. 1058 performance can be squeezed by optimizing the integration with 1116 1059 Approach. We pursued a design for the library that could be the backend and with analyses meant to run on top of it, we be- 1117 1060 portable to other frameworks and Linux, avoiding uses of specific lieve a fraction of this gap can be reduced if countermeasures get 1118 1061 primitives or choices that could tie it to Pin’s underpinnings. implemented inside the engine. We used evasive packer programs 1119 1062 To shepherd memory accesses, we maintain a shadow page table to stress the implementation, and we were able to run code packed 1120 1063 as an array indexed by the page number for the address. We monitor with PELock that [47] based on PinShield could not handle properly. 1121 1064 basic RWX permissions and page guard options using 4 bits in a 1122 4 Discussion. Mitigating artifacts using user-level code is a slippery 1065 1-byte element per page . For 32-bit Windows, this yields a table of 1123 road. As we mentioned in Section 5.1, there are aspects in system 1066 512 kbytes for the 2-GB user address space, with recently accessed 1124 call interposition that if overlooked can lead security tools to easily 1067 fragments likely to be found in the CPU cache for fast retrieval. 1125 be circumvented: one of them is incorrectly replicating OS seman- 1068 We update the table in presence of code loading events and every 1126 tics. We follow the recommendations from [19] by querying the OS 1069 time the program (or Windows components on its behalf) allocates, 1127 to capture the effects of system calls that manipulate regions ofthe 1070 releases, or changes permissions for memory, hooking system calls 1128 address space: this helped us in dealing with occasional inconsisten- 1071 and events that may cause such changes to the address space. When 1129 5 cies between the arguments or output parameters for such calls and 1072 a violation is detected, we create an exception for the application . 1130 the effects observed on the address space. Our library pursues at 1073 Possible code pointer leaks from FPU instructions are intercepted 1131 analysis code level the DBI system design guideline G2 on making 1074 as revealing instructions get executed: we replace the address from 1132 discrepancies imperceivable to the monitored program (Section 4.1), 1075 the code cache with the one in the original code. For this operation 1133 and its performance impact could be attenuated if cooperation oc- 1076 one can either resort to APIs possibly offered by an engine to 1134 curs on the DBI runtime side, for instance by supporting automatic 1077 convert addresses, or monitor the x87 instructions that cause the 1135 (optimized) guard insertion during trace compilation. 1078 FPU instruction pointer to change with a shadow register. Although 1136 1079 more expensive, we opted for the second approach for generality. 6 WRAP-UP AND CLOSING REMARKS 1137 1080 Exception handling inconsistencies may be unrelated to memory: 1138 In the previous sections we have illustrated structural weaknesses 1081 this is the case with the single-step exception and int 2d attacks 1139 of the DBI abstraction and its implementations when analyzing code 1082 described in [1] and found in malware and executable protectors. 1140 in a software security setting, and discussed mitigations to make 1083 We intercept such sequences and forge exceptions where needed. 1141 DBI frameworks more transparent—at the price of performance 1084 Due to lack of space, we refer the reader to our source code 1142 penalties—in the form of adjustments to their design or stopgap 1085 for mitigations made of punctual countermeasures, such as pointer 1143 measures inside analysis tools. We conclude by discussing implica- 1086 leaks with int 2e attacks and detections based on debugger objects. 1144 tions for researchers that want to use DBI for security with respect 1087 1145 Overhead. The mitigations presented above can have a signifi- to instrumentation capabilities and to the relevance of the eva- 1088 1146 cant impact on the baseline performance level offered by an engine sion and escape problems, putting into the equation the attacker’s 1089 1147 running an empty analysis code. Shepherding memory accesses is capabilities and what is needed to counter adversarial sequences. 1090 a daunting prospect for DBI architects [5], let alone when imple- 1148 1091 mented on top of the engine. However, it may be affordable for a Choosing DBI in the first place. As we have seen in Section 3, the 1149 1092 user-defined analysis that already has to track such operations such flexibility of DBI primitives has supported researchers in developing 1150 1093 as taint analysis. Similarly, conformance checking on NX policy a great deal of analysis techniques over the years. Especially when 1151 1094 slows down the execution as it requires that target of branches the source code of a program is not available, there are essentially 1152 1095 be checked, but may be acceptable for code that already validates two options that could be explored other than DBI: SBI and VMI 1153 1096 transfers, for instance to support CFI or other ROP defenses. techniques. Although tempting to make a general statement on 1154 1097 We conducted a preliminary investigation on the SPEC CPU2006 when one approach should be preferred, we believe the picture 1155 1098 benchmarks commonly used to analyze DBI systems [5, 35]. We is not so simple, and thorough methodological and experimental 1156 1099 comparisons would be required for different application domains. 1157 4We leave 4 bits to encode more policies or host data from upper analysis layers. 1100 1158 5PinShield in such cases allows the access but rewires it to another region. We can 6For when ESP holds data or EIP flows into a page with inconsistent permissions via 1101 still use their approach for instance to protect ntdll trampolines. hard-wired jump offsets or branchless sequences that cross page boundaries [25]. 1159 1102 10 1160 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

1161 We can however elaborate on three aspects that could affect a the execution of unfeasible flows from inconsistencies in enforcing 1219 1162 researcher’s choice. The first aspect is related to the instrumentation NX policies can be avoided by shepherding control transfers. 1220 1163 capabilities of each technology. SBI can instrument a good deal What really raises the bar for DBI engines is coping with an 1221 1164 of program behaviors as long as static inference of the necessary attacker that can register exception handlers and force memory 1222 1165 information is possible. On the other hand, VMI can capture generic operations also on regions marked as unallocated in massaged OS 1223 1166 events at whole-system level regardless of the structure of the code, queries. The runtime overhead of shepherding every memory access 1224 1167 using libraries that bridge the semantic gap to determine which is intrinsically high for current DBI designs, but technologies like 1225 1168 events belong to the code under analysis. However, current VMI executable-only memory may come to the rescue in the future. 1226 1169 systems cannot make queries or execute operations using the APIs One may also wonder whether the popularity of an analysis 1227 1170 provided by the OS, unlike DBI systems that naturally let their users technique can eventually lead to the diffusion of ad-hoc adversarial 1228 1171 to. A 2015 work proposed with PEMU [66] a new design to move sequences in the code it is meant to analyze. In the history of DBI 1229 1172 DBI instrumentation out of VM, providing a mechanism called we are aware of ad-hoc evasions against Pin in some executable 1230 1173 forwarded guest syscall execution to mimic the normal functioning packers [11], as simple DBI-based unpacking schemes from a few 1231 1174 of a DBI engine; however, the public codebase the authors made years ago were effective against older generations of packers. On 1232 1175 available seems no longer under active development. the contrary, there is a possibility that even when an attack surface 1233 1176 The second aspect is related to the deployability of the analysis is well known and not particularly hard to exploit, as in the case of 1234 1177 system runtime. In the case of SBI, the requirements are typically implicit flows against taint analysis8 [ ], adversaries may focus their 1235 1178 limited as the original binary gets rewritten. For the DBI abstrac- attention elsewhere for anti-analysis sequences: for the proposed 1236 1179 tion, the use of process virtualization paves the way for building example, malware authors in late years seem rather to have con- 1237 1180 tools that operate on a single application in a possibly lightweight centrated their efforts on escaping sandboxing technologies and 1238 1181 manner, enabling their use also in production systems, e.g., when hindering code analysis with obfuscation strategies such as opaque 1239 1182 legacy binary are involved [5]. For VMI technology, bringing up an predicates and virtualization. We thus find it hard to speculate on 1240 1183 emulated or virtualized environment may very well be a daunting an arms race that could involve DBI evasion in the near future. 1241 1184 prospect or a natural choice depending on the application scenario. Finally, as we mentioned in Section 4.1 also approaches like 1242 1185 The third aspect is related to whether the analysis should not VMI that are more transparent by design can become fragile in the 1243 1186 only monitor, but also alter the execution when needed. To this end, presence of a dedicated adversary. Every technology has its own 1244 1187 DBI and VMI are both capable of detecting and altering specific trade-offs between transparency and aspects like performance or 1245 1188 instructions also when generated at run time. VMI technology is instrumentation capabilities. In the case of SBI, the approach can 1246 1189 currently lackluster in aspects that involve replacing entire calls or offer better performance than DBI, but its transparency is more frag- 1247 1190 sequences in the text of a program; on the other hand, system call ile than DBI and VMI: for instance, even recent SBI embodiments 1248 1191 authorization can be dealt with as context switches occur. cannot offer protection against introspective sequences when these 1249 1192 are not identified in the preliminary static code analysis phase. 1250 1193 Dealing with adversarial code. A fourth aspect, possibly the most 1251 Closing remarks. In this work we attempted to systematize exist- 1194 appealing for our readers, can be added to the technological discus- 1252 ing knowledge and speculate on a problem that recently received a 1195 sion: adversarial sequences. We should distinguish between general 1253 good deal of attention in the security community, with opinions 1196 detection techniques, which may affect more technologies at once, 1254 sometimes at odds with one another. We think that using DBI for 1197 and ad-hoc detection patterns, sometimes for a specific runtime. 1255 security is not a black and white world, but is more about how the 1198 For example, code very sensitive to time variations is likely 1256 DBI abstraction is used once the user is aware of its characteristics 1199 to deviate from the normal behavior when executing under DBI 1257 in terms of execution transparency, understood not only as a se- 1200 or other in-guest dynamic analysis systems: consider for instance 1258 curity problem, but sometimes also as a correctness one. Hoping 1201 time-based anti-analysis strategies in evasive malware. On the other 1259 that other researchers may benefit from it, we make available our 1202 hand, code checksumming sequences give away several analysis 1260 library of detection patterns and mitigations for Pin at: 1203 systems, but DBI ones are normally not bothered by them. 1261 1204 A crucial issue is related to the characteristics of the code that un- 1262 1205 dergoes analysis, that is, if there is the possibility for an attacker to 1263 1206 embed adversarial patterns specific to DBI. This is particularly (but REFERENCES 1264 1207 not only) the case of research focused on malicious code analysis [1] [n. d.]. Blinded reference. 1265 [2] Michael Backes, Thorsten Holz, Benjamin Kollenda, Philipp Koppe, Stefan Nürn- 1208 1266 and reverse engineering activities. Such code can clearly challenge berger, and Jannik Pewny. 2014. You Can Run but You Can’t Read: Preventing 1209 massaged results and other mitigations for transparency issues put Disclosure Exploits in Executable Code (CCS ’14). ACM. 1267 1210 in place by the engine or the analysis tool. [3] Otto Brechelmacher, Willibald Krenn, and Thorsten Tarrach. 2018. A Vision for 1268 Enhancing Security of Cryptography in Executables (ESSoS ’18). Springer. 1211 When the adversary does not have arbitrary memory read ca- [4] Michael Brengel, Michael Backes, and Christian Rossow. 2016. Detecting 1269 1212 pabilities, mitigating leaked code pointers is already sufficient in Hardware-Assisted Virtualization (DIMVA ’16). Springer. 1270 1213 [5] Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent Dynamic 1271 most cases to hide the presence of extra code regions, and different Instrumentation (VEE ’12). ACM. 1214 attack surfaces should be tried, such as exhaustion or prediction [6] Bryan Buck and Jeffrey K. Hollingsworth. 2000. An API for Runtime Code 1272 1215 attacks for memory allocations. When the adversary cannot write Patching. Int. J. High Perform. Comput. Appl. 14, 4 (Nov. 2000). 1273 [7] Juan Caballero, Pongsin Poosankam, Christian Kreibich, and Dawn Song. 2009. 1216 1274 to arbitrary memory locations and leaked pointers have already Dispatcher: Enabling Active Botnet Infiltration Using Automatic Protocol Reverse- 1217 been dealt with, escaping attempts are essentially contained, while engineering (CCS ’09). ACM. 1275 1218 11 1276 ASIA CCS’19, July 2019, Auckland, New Zealand Anon.

1277 [8] Lorenzo Cavallaro, Prateek Saxena, and R. Sekar. 2008. On the Limits of Infor- [38] Collin Mulliner, Jon Oberheide, William Robertson, and Engin Kirda. 2013. Patch- 1335 1278 mation Flow Techniques for Malware Analysis and Containment (DIMVA ’08). Droid: Scalable Third-party Security Patches for Android Devices (ACSAC ’13). 1336 Springer. ACM. 1279 [9] Sang Kil Cha, Maverick Woo, and David Brumley. 2015. Program-Adaptive [39] George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer. 2002. 1337 1280 Mutational Fuzzing (SP ’15). IEEE. CIL: Intermediate Language and Tools for Analysis and Transformation of C 1338 1281 [10] Yue Chen, Mustakimur Khandaker, and Zhi Wang. 2017. Pinpointing Vulnerabil- Programs (CC ’02). Springer. 1339 ities (ASIA CCS ’17). ACM. [40] Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavy- 1282 [11] Binlin Cheng, Jiang Ming, Jianmin Fu, Guojun Peng, Ting Chen, Xiaosong Zhang, weight Dynamic Binary Instrumentation (PLDI ’07). ACM. 1340 1283 and Jean-Yves Marion. 2018. Towards Paving the Way for Large-Scale Win- [41] Mathias Payer and Thomas R. Gross. 2011. Fine-grained User-space Security 1341 dows Malware Analysis: Generic Binary Unpacking with Orders-of-Magnitude Through Virtualization (VEE ’11). ACM. 1284 1342 Performance Boost (CCS ’18). ACM. [42] M. Payer and T. R. Gross. 2013. Hot-patching a web server: A case study of ASAP 1285 [12] Anton Chernoff, Mark Herdeg, Ray Hookway, Chris Reeve, Norman Rubin, Tony code repair (ACSAC ’13). IEEE. 1343 1286 Tye, S. Bharadwaj Yadavalli, and John Yates. 1998. FX!32: A Profile-Directed [43] Gábor Pék, Boldizsár Bencsáth, and Levente Buttyán. 2011. nEther: In-guest 1344 Binary Translator. IEEE Micro 18, 2 (March 1998). Detection of Out-of-the-guest Malware Analyzers (EUROSEC ’11). ACM. 1287 [13] Lucas Davi, Ahmad-Reza Sadeghi, and Marcel Winandy. 2011. ROPdefender: A [44] Fei Peng, Zhui Deng, Xiangyu Zhang, Dongyan Xu, Zhiqiang Lin, and Zhendong 1345 1288 Detection Tool to Defend Against Return-oriented Programming Attacks (ASIA Su. 2014. X-force: Force-executing Binary Programs for Security Applications 1346 1289 CCS ’11). ACM. (SEC’14). USENIX Association. 1347 [14] Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Mal- [45] Theofilos Petsios, Vasileios P. Kemerlis, Michalis Polychronakis, and Angelos D. 1290 ware Analysis via Hardware Virtualization Extensions (CCS ’08). ACM. Keromytis. 2015. DynaGuard: Armoring Canary-based Protections Against Brute- 1348 1291 [15] Brendan Dolan-Gavitt, Tim Leek, Michael Zhivich, Jonathon Giffin, and Wenke force Attacks (ACSAC ’15). ACM. 1349 Lee. 2011. Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspec- [46] Jonas Pfoh and Sebastian Vogl. 2017. rVMI: A New Paradigm for Full System 1292 1350 tion (SP ’11). IEEE. Analysis. Black Hat USA (2017). 1293 [16] Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2008. A [47] Mario Polino, Andrea Continella, Sebastiano Mariani, Stefano D’Alessio, Lorenzo 1351 1294 Survey on Automated Dynamic Malware-analysis Techniques and Tools. ACM Fontana, Fabio Gritti, and Stefano Zanero. 2017. Measuring and Defeating Anti- 1352 Comput. Surv. 44, 2, Article 6 (March 2008). Instrumentation-Equipped Malware (DIMVA ’17). Springer. 1295 [17] S. J. Eggers, David R. Keppel, Eric J. Koldinger, and Henry M. Levy. 1990. Tech- [48] Aravind Prakash, Xunchao Hu, and Heng Yin. 2015. vfGuard: Strict Protection 1353 1296 niques for Efficient Inline Tracing on a Shared-memory Multiprocessor (SIGMET- for Virtual Function Calls in COTS C++ Binaries. (NDSS ’15). 1354 1297 RICS ’90). ACM. [49] Quarkslab. 2019. QBDI: QuarkslaB Dynamic binary Instrumentation. https: 1355 [18] Francisco Falcón and Nahuel Riva. 2012. Dynamic Binary Instrumentation Frame- //qbdi.quarkslab.com/. 1298 works: I know you’re there spying on me. Recon (2012). [50] Nguyen Anh Quynh. 2018. SKORPIO: Advanced Binary Instrumentation Frame- 1356 1299 [19] Tal Garfinkel. 2003. Traps and pitfalls: Practical problems in system call interpo- work. (OPCDE ’18). 1357 sition based security tools (NDSS ’03). [51] Ted Romer, Geoff Voelker, Dennis Lee, Alec Wolman, Wayne Wong, Hank Levy, 1300 1358 [20] Tal Garfinkel, Keith Adams, Andrew Warfield, and Jason Franklin. 2007. Com- Brian Bershad, and Brad Chen. 1997. Instrumentation and Optimization of 1301 patibility is Not Transparency: VMM Detection Myths and Realities (HOTOS’07). Win32/Intel Executables Using Etch (NT’97). USENIX Association. 1359 1302 USENIX Association. [52] K. Scott, N. Kumar, S. Velusamy, B. Childers, J. W. Davidson, and M. L. Soffa. 2003. 1360 [21] Tal Garfinkel and Mendel Rosenblum. 2003. A Virtual Machine Introspection Retargetable and Reconfigurable Software Dynamic Translation (CGO ’03). IEEE. 1303 Based Architecture for Intrusion Detection (NDSS ’03). [53] A. Skaletsky, T. Devor, N. Chachmon, R. Cohn, K. Hazelwood, V. Vladimirov, and 1361 1304 [22] Cosmin Gorgovan. 2014. Escaping DynamoRIO and Pin. https://github.com/ M. Bach. 2010. of applications 1362 1305 lgeek/dynamorio_pin_escape. (ISPASS ’10). IEEE. 1363 [23] Sean Heelan and Agustin Gianni. 2012. Augmenting Vulnerability Analysis of [54] Asia Slowinska, Istvan Haller, Andrei Bacs, Silviu Baranga, and Herbert Bos. 2014. 1306 Binary Code (ACSAC ’12). ACM. Data Structure Archaeology: Scrape Away the Dirt and Glue Back the Pieces! 1364 1307 [24] Jason Hiser, Anh Nguyen-Tuong, Michele Co, Matthew Hall, and Jack W. David- (DIMVA ’14). Springer. 1365 son. 2012. ILR: Where’D My Gadgets Go? (SP ’12). IEEE. [55] Chengyu Song, Chao Zhang, Tielei Wang, Wenke Lee, and David Melski. 2003. 1308 1366 [25] Martin Hron and Jakub Jermář. 2014. SafeMachine: Malware Needs Love, Too. Exploiting and Protecting Dynamic Code Generation (NDSS ’03). 1309 Virus Bulletin (2014). [56] Amitabh Srivastava and Alan Eustace. 1994. ATOM: A System for Building 1367 1310 [26] Kangkook Jee, Vasileios P. Kemerlis, Angelos D. Keromytis, and Georgios Por- Customized Program Analysis Tools (PLDI ’94). ACM. 1368 tokalidis. 2013. ShadowReplica: Efficient Parallelization of Dynamic Data Flow [57] Ke Sun, Xiaoning Li, and Ya Ou. 2016. Break Out of the Truman Show: Active 1311 Tracking (CCS ’13). ACM. Detection and Escape of Dynamic Binary Instrumentation. Black Hat Asia (2016). 1369 1312 [27] Kangkook Jee, Georgios Portokalidis, Vasileios P Kemerlis, Soumyadeep Ghosh, [58] Cisco Talos. 2017. PyREBox: Python scriptable Reverse Engineering sandbox. 1370 1313 David I August, and Angelos D Keromytis. 2012. A General Approach for Effi- https://github.com/Cisco-Talos/pyrebox. 1371 ciently Accelerating Software-based Dynamic Data Flow Tracking on Commodity [59] Rui Wang, XiaoFeng Wang, Kehuan Zhang, and Zhuowei Li. 2008. Towards 1314 Hardware (NDSS ’12). Automatic Reverse Engineering of Software Security Configurations (CCS ’08). 1372 1315 [28] Karl Trygve Kalleberg and Ole André Vadla Ravnås. 2016. Testing interoperability ACM. 1373 with closed-source software through scriptable diplomacy. (FOSDEM ’16). [60] Tielei Wang, Chengyu Song, and Wenke Lee. 2014. Diagnosis and Emergency 1316 1374 [29] Julian Kirsch, Zhechko Zhechev, Bruno Bierbaumer, and Thomas Kittel. 2018. Patch Generation for Integer Overflow Exploits (DIMVA ’14). Springer. 1317 PwIN – Pwning Intel piN: Why DBI is Unsuitable for Security Applications [61] Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu. 2009. Behavior Based 1375 1318 (ESORICS ’18). Springer. Software Theft Detection (CCS ’09). ACM. 1376 [30] James R. Larus and Eric Schnarr. 1995. EEL: Machine-independent Executable [62] Z. Wang, X. Ding, C. Pang, J. Guo, J. Zhu, and B. Mao. 2018. To Detect Stack 1319 Editing (PLDI ’95). ACM. Buffer Overflow with Polymorphic Canaries (DSN ’18). IEEE. 1377 1320 [31] Juanru Li, Zhiqiang Lin, Juan Caballero, Yuanyuan Zhang, and Dawu Gu. 2018. [63] Jan Werner, George Baltas, Rob Dallara, Nathan Otterness, Kevin Z. Snow, Fabian 1378 1321 K-Hunt: Pinpointing Insecure Cryptographic Keys from Execution Traces (CCS Monrose, and Michalis Polychronakis. 2016. No-Execute-After-Read: Preventing 1379 ’18). ACM. Code Disclosure in Commodity Software (ASIA CCS ’16). ACM. 1322 [32] Xiaoning Li and Kang Li. 2014. Defeating the Transparency Features of Dynamic [64] Yuan Xiao, Mengyuan Li, Sanchuan Chen, and Yinqian Zhang. 2017. STACCO: 1380 1323 Binary Instrumentation. BlackHat USA (2014). Differentially Analyzing Side-Channel Traces for Detecting SSL/TLS Vulnerabili- 1381 [33] Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, and Xiangyu Zhang. 2008. Auto- ties in Secure Enclaves (CCS ’17). ACM. 1324 1382 matic Protocol Format Reverse Engineering through Context-Aware Monitored [65] Guanhua Yan, Nathan Brown, and Deguang Kong. 2013. Exploring Discriminatory 1325 Execution (NDSS ’08). Features for Automated Malware Classification (DIMVA’13). Springer. 1383 1326 [34] Daiping Liu, Mingwei Zhang, and Ravi Sahita. 2018. XOM-switch: Hiding Your [66] Junyuan Zeng, Yangchun Fu, and Zhiqiang Lin. 2015. PEMU: A Pin Highly 1384 Code from Advanced Code Reuse Attacks In One Shot. BlackHat ASIA (2018). Compatible Out-of-VM Dynamic Binary Instrumentation Framework (VEE ’15). 1327 [35] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff ACM. 1385 1328 Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: [67] Mingwei Zhang, Rui Qiao, Niranjan Hasabnis, and R. Sekar. 2014. A Platform for 1386 1329 Building Customized Program Analysis Tools with Dynamic Instrumentation Secure Static Binary Instrumentation (VEE ’14). ACM. 1387 (PLDI ’05). ACM. [68] Xiaolan Zhang, Zheng Wang, Nicholas Gloy, J. Bradley Chen, and Michael D. 1330 [36] Lorenzo Martignoni, Roberto Paleari, and Danilo Bruschi. 2010. Conqueror: Smith. 1997. System Support for Automatic Profiling and Optimization (SOSP 1388 1331 Tamper-Proof Code Execution on Legacy Systems (DIMVA ’10). Springer. ’97). ACM. 1389 [37] Jiang Ming, Dongpeng Xu, Li Wang, and Dinghao Wu. 2015. LOOP: Logic- [69] Jingyu Zhou and Giovanni Vigna. 2004. Detecting Attacks That Exploit 1332 1390 Oriented Opaque Predicate Detection in Obfuscated Binary Code (CCS ’15). Application-Logic Errors Through Application-Level Auditing (ACSAC ’04). IEEE. 1333 ACM. 1391 1334 12 1392 SoK: Using Dynamic Binary Instrumentation for Security ASIA CCS’19, July 2019, Auckland, New Zealand

1393 A COMPLETE EXPERIMENTAL RESULTS Suite ID Name 1451 400 perlbench 1394 In the following we describe the experimental setup used for a 401 bzip2 1452 1395 preliminary assessment of the performance impact of the stopgap 403 gcc 1453 1396 429 mcf 1454 measures we implemented as a high-level library for analysis tools, 445 gobmk CINT 1397 and present complete results for the corpus of C/C++ benchmarks 456 hmmer 1455 1398 458 sjeng 1456 in the SPEC CPU2006 suite (Table 2) that we used to measure it. 464 h264ref 1399 1457 We use Pin 3.5 running on a Windows 7 SP1 32-bit machine 471 omnetpp 1400 with negligible background activity. For each experiment we use a 473 astar 1458 1401 433 milc 1459 clean virtual machine with 1 CPU core and 3 GB of RAM hosted on 444 namd 1402 450 soplex 1460 VirtualBox 5.2.6 running on a server with Debian 9.6 Stretch and two CFP 1403 Intel Xeon E5-4610 processors. Both the library and the benchmarks 453 povray 1461 1404 470 lbm 1462 are compiled using Visual Studio 2010 in Release configuration. We 482 sphinx3 1405 1463 consider the average value for 9 trials of each configuration. Table 2: Integer and floating-point C/C++ benchmarks. 1406 For integer benchmarks we can see from Figure 2 that the impact 1464 1407 1465 of the mitigation for pointer leaks is small, with a slowdown in SPEC CPU2006 CINT benchmarks 1408 10 1466 leak nx rw full paranoid the worst case as high as 1.09x compared to an execution under 9

1409 8.05 1467 Pin with no instrumentation and analysis code. The overhead orig- 8 7.25 1410 7.02 1468 7 6.62 inates from spilling the x87 instruction pointer to a tool register 6.42 5.92 5.8 1411 6 5.87 1469 5.0 5.02 at every non-control FPU instruction, with the JIT compiler of Pin 4.95 5 4.62 4.43 1412 4.39 1470 3.91 3.82 3.79 3.74 3.73 3.66 3.55 taking care of register reallocation. The impact of the NX miti- 4 3.52 2.98 2.95 2.7 2.7

2.61 1471 1413 3 2.59 Slowdown w.r.t. Pin 2.31 gation, which protects instruction fetching from non-executable 2.25 1414 2 1472

7 1.1 1.17 1.15 1.11 1.0 1.0 1.0 1.09 1.09 1.06 1.06 1.05 1.03 1.03 1.02 1.02 1.01 1.01 1.01 or guarded pages, is intuitively related to the number of indirect 1 1.00 1415 1473 transfers (call, ret, and jump operations) in the program, with a 0 400 401 403 429 445 456 458 464 471 473 1416 peak of 1.17x for benchmark omnetpp (417). The overhead from 1474 1417 Figure 2: Slowdown for integer benchmarks. 1475 shepherding every read/write memory access to protect the code 1418 1476 cache and other regions of the runtime ranges from 2.31x to 5.02x. SPEC CPU2006 CFP benchmarks 1419 100 1477 Combining the three mitigations in the full configuration incurs leak nx rw full paranoid 1420 an overhead very similar to the one from RW, which unsurprisingly 1478 20.16 16.45 16.33 16.07 1479 1421 12.52 12.23 11.47 8.05 is by far the most expensive on these benchmarks. However in 10 7.1 1422 6.58 1480 4.35 4.01 3.53 3.37

some cases the overhead of full might be slightly lower than having 3.08 2.83 2.1 1.99 1.9

1423 1.83 1481 1.43 1.24

the sole RW mitigation active. Once we ruled out noise in time 1.17 1.13 1.13 1.07 1.0 1.0 1.0 1.03 1424 1 1482

measurements, we believe the cause might lie in how the quality Slowdown w.r.t. Pin 1425 of the JIT-ted code is affected by the guards inserted for NXand 1483 1426 1484 RW as well as by the register re-allocation caused by spilling tool 0.1 1427 registers in the implementation of the leak and RW mitigations. 433 444 450 453 470 482 1485 1428 The paranoid mode is clearly more expensive as it shepherds Figure 3: Slowdown for floating-point benchmarks. 1486 1429 every instruction fetch to enforce NX and page guard protection on 1487 1430 code that crosses the border of two pages with different permissions faster in the tests we conducted: while the guards we insert incur a 1488 1431 with no intervening jumps, and verifies read/write permissions for O(1) lookup inside an array, PinShield scans a linked list of known 1489 1432 push and pop operations for when the stack pointer may be used memory ranges for every memory read or write access. As we 1490 1433 as a general-purpose register. mentioned in Section 5.2, pointer leaks via x87 instructions and 1491 1434 For floating-point benchmarks we can observe similar trends NX/page guard conformance checking are not handled by PinShield. 1492 1435 as for the integer ones with two notable exceptions: namd (444) 1493 1436 and sphinx3 (482). We can see that for these two benchmarks the 1494 1437 NX mitigation incurs a overhead considerably higher than in the 1495 1438 other programs from the CINT and CFP sets. A first inspection of 1496 1439 the execution profiles seems to suggest that both programs make 1497 1440 a high usage of indirect control transfers, and the insertion of 1498 1441 guards for such transfers may affect the trace linking process inside 1499 1442 Pin or other aspects of the JIT compilation process. Unfortunately 1500 1443 we could not verify this claim by monitoring the code generation 1501 1444 process due to the closed-source nature of Pin. 1502 1445 Compared to the original implementation of PinShield8, the RW 1503 1446 mitigation from our library was at least one order of magnitude 1504 1447 1505 1448 1506 7Pin by design already handles page guards for read and write operations correctly. 1449 8https://github.com/Phat3/PINdemonium/. 1507 1450 13 1508