Finding Inlined Functions in Optimized Binaries
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1 Learning to Find Usages of Library Functions in Optimized Binaries Toufique Ahmed, Premkumar Devanbu, and Anand Ashok Sawant Abstract—Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries’ behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, “natural” source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to “decompile” binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization. Index Terms—Reverse engineering, Software modeling, Deep learning F 1 INTRODUCTION N their seminal work, Chikofsky and Cross [1], define decompile the binary successfully. I Software Reverse Engineering as “the process of analyzing Two tools are industry standard for reverse engineering: a subject system to (1) identify the system’s components Hexrays IDA Pro [2] and Ghidra [3]. These tools incorporate and their interrelationships and (2) create representations two distinct functionalities: a disassembler (convert binary to of the system in another form or at a higher level of ab- assembly code), and a decompiler (convert assembly code to straction”. Understanding the behavior of software binaries a higher-order representation, similar to C code), and a vari- generated by potentially untrusted parties have many mo- ety of data flow analysis tools. Both tools can handle binaries tivations such binaries may incorporate e.g., stolen intellec- that have been compiled on a variety of architectures (such tual property (such as patented or protected algorithms or as x86/64 or ARM64). data), unauthorized access to system resources, or malicious In the field of security, quite a bit of work has focused behavior of various sorts. The ability to reverse engineer on understanding the behavior of malicious applications binaries would make it more difficult to conceal potential by examining their library API calls [4], [5], [6], [7], [8], bad behavior, and thus act as a deterrent, and this would [9], [10], [11], [12]. The intuition behind this is that calls enhance public confidence in, and overall free exchange of, made to library APIs (such as Windows DLLs) can cap- software technology. ture the important underlying semantics of the malware’s Reverse engineering of an optimized binary is quite chal- attacking behaviour [13]. However, uncovering API calls is arXiv:2103.05221v2 [cs.SE] 17 Sep 2021 lenging, since compilers substantially transform source code particularly hard as the compiler might have mangled the structures to create the binary. This is done primarily to inlined body of the called function together with the code at improve run-time performance; however, some compilers the calling site in complex ways. Both Ghidra and Hexrays support deliberate obfuscation of source code details, to have specially engineered functionality for recovering calls protect IP, or for security reasons. The resulting binary could to library functions . These functions are considered very be stripped of all information such as variable names, code important by the developers of these tools and are explicitly comments, and user-defined types. Binaries compiled using documented and advertised. gcc can be optimized in the interest of run-time perfor- Both solutions use a database of assembly-level signa- mance benefits, (even if compilation per se takes longer). tures for each potentially inlined library call. Assembler Optimizations include function inlining, array vectorization, code recovered from the binary is matched against the and loop unrolling. This dramatically alters the code at the database to identify library function bodies within the bi- assembly level, making it substantially more challenging to nary. The approaches, however, do not work as well with higher optimization levels due to extensive code restructur- • All the authors are with the Department of Computer Science, University ing [14], [15] and context-dependent inlining of the called of California, Davis, CA, 95616. method’s body. Despite both tools making an attempt to find E-mail: ftfahmed, ptdevanbu, [email protected] library function invocations in a static way (see, e.g., [16], Manuscript in submission item 7), the problem is non-trivial and more often than not IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2 undefined4 * component_name_wrap (char * param_1) { static char * component_name_wrap (const char * name) size_t sVar1 ; { undefined4 * __dest; int len = strlen(name) + 30; sVar1 = strlen(param_1); char * wrap = malloc_or_quit(len); __dest = (undefined4 *) malloc_or_quit((long) ((int) sVar1 + 0x1e)); strcpy(wrap, "/mgmt/"); * __dest = 0x6d676d2f ; Funcre recovers strcat(wrap, name); *(undefined2 *) (__dest + 1) = 0x2f74; missing inlined return wrap; *(undefined *) ((long) __dest + 6) = 0; strcpy } strcat ((char *) __dest, param_1); return __dest; } Fig. 1: Finding inline functions in real-world1decompiled version of original C source code these functions are not recovered. This problem is compli- approach that allows us to places labels (with the name of cated by varying compilers, compiler versions, optimization the function) in the binary, at locations where a function has levels, and library versions. been inlined. Our labels resist being removed by optimizing This signature or pattern-oriented approach adopted by compilers; this also minimizes interference with the opti- current tools relies on a pattern database; this database must mizing compilers code generation. This labeling approach be manually maintained to include a pattern (or patterns) for allows us to successfully annotate 9,643 files and 459 open- each possible inlined function. Therefore, we adopt a more source projects, creating a large training dataset. data-driven approach and propose to learn to identify those functions that are most often used in the data. To that end, Our approach uses a pre-training plus fine-tuning we develop our tool, FUNCRE, which finds inlined library scheme [17] currently popular in natural language process- function invocations even in binaries compiled with higher ing. We pre-train using a masked language model (MLM) on optimization levels. With sufficient data, a suitably designed decompiled binaries. This pre-training step helps our model training approach, and a powerful enough model, this ap- learn the statistics of a large corpus of decompiled source proach offers enhanced ability to recover inlined functions. code. These statistics are represented in a low-dimensional As an illustration, in Figure 1, we present a sample of positional embedding. We then fine-tune on the labeled original source code (from GitHub) and its Ghidra output dataset of inlined functions. FUNCRE achieves a precision for compilation with Os. From the decompiled version, of 0.64 and a recall of 0.46. When combined with the results it’s evident that recovering strlen, malloc_or_quit and, of Ghidra (since it uses the output of Ghidra, which might strcat is possible for Ghidra, but the strcpy gets inlined contain some functions recovered), our approach provides with copying of pointers to stack addresses after optimiza- the highest f-score across all optimization levels while also tion. The pattern-database approach used by Ghidra works recovering a larger number of unique functions. Recover- well for the three of the four functions; but Ghidra fails to ing inlined functions (where assembly-level function pro- recover the latter, (strcpy), because the more complex pattern logues & epilogues are absent, the inlined body is changed required is not available in its database. The precision of and shuffled by optimization) is hard even for humans Ghidra is 1.0 for this example, but the recall is lower (0.75). and existing tools; however, a high-capacity neural model Our tool FUNCRE builds on Ghidra, and will only try to (RoBERTa + Transformer) that uses pre-training and fine- recover additional inlined functions beyond what Ghidra tuning can still learn statistical features and associations does, using automatically learned patterns latent within its between inlined functions and the mangled residual clues layered deep-learning architecture.