In-Memory Fuzzing for Binary Code Similarity Analysis

In-Memory Fuzzing for Binary Code Similarity Analysis Shuai Wang and Dinghao Wu The Pennsylvania State University University Park, PA 16802, USA fszw175, [email protected] Abstract—Detecting similar functions in binary executables static methods can test any program component, but they may serves as a foundation for many binary code analysis and reuse suffer from real-world challenging settings such as compiler tasks. By far, recognizing similar components in binary code optimizations or program obfuscations [20], [23], [57], [19]. remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics- To overcome such limitations, we propose IMF-SIM, which level features for comparison. However, there exist multiple leverages in-memory fuzzing to solve the coverage issue and design limitations in previous work, which result in relatively reveal similarities between two binary code components even high cost, low accuracy and scalability, and thus severely impede in front of real-world challenging settings. their practical use. In this paper, we present a novel method that leverages in- Fuzz testing, or fuzzing, is a widely-used software testing memory fuzzing for binary code similarity analysis. Our proto- technique that exercises a program by providing invalid or type tool IMF-SIM applies in-memory fuzzing to launch analysis random data as inputs. Compared with traditional testing towards every function and collect traces of different kinds of techniques, where a single input is used to test one execution program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. trace, fuzzing can largely improve the code coverage and To compare two functions, a feature vector is generated, whose increase the chances of exposing hidden bugs. Despite its elements are the similarity scores of the behavior trace-level simplicity in the concept, fuzz testing is proven as robust comparisons. We train a machine learning model through labeled and effective in the real-world settings and is widely used for feature vectors; later, for a given feature vector by comparing software testing [24], [51], [32], [31], [26], [62]. While most two functions, the trained model gives a final score, representing standard fuzz testing mutates the program inputs, we have the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, noticed a special fuzzing technique that is designed to directly and commonly-used obfuscation methods, in total over one fuzz the content of assembly registers or memory cells, i.e., thousand binary executables. Our evaluation shows that IMF- in-memory fuzzing (Chapters 19 and 20 in [61]). In-memory SIM notably outperforms existing tools with higher accuracy and fuzzing can start one fuzzing execution at any program point. broader application scopes. Index Terms—in-memory fuzzing; code similarity; reverse While most testing techniques suffer from generating proper engineering; taint analysis; inputs to reach certain program points, in-memory fuzzing can start at the beginning of any instruction. Hence, every program I. INTRODUCTION component becomes testable. Determining the similarity between two components of While fuzzing technique is originally proposed for software binary code is critical in binary program analysis and security testing, we observe that rich information regarding the pro- tasks. For example, binary code clone detection identifies gram runtime behavior is indeed revealable during fuzzing potential code duplication or plagiarism by analyzing the without additional efforts. That means, semantics-based sim- similarities of two binary components [57], [50]. Patch-based ilarity analysis shall be boosted through well-designed fuzz exploitation compares the pre-patch and post-patch binaries to testing. To this end, we propose IMF-SIM, a fuzzing based reveal hidden vulnerabilities fixed by the patch [11]. Malware similarity analysis tool that overcomes multiple design limi- research analyzes similarities among different malware sam- tations of existing research (details are discussed in xII) with ples to reveal malware clusters or lineage relations [8], [36]. higher accuracy and broader application scopes. In particular, So far, a number of binary similarity analysis tools have IMF-SIM leverages in-memory fuzzing to launch dynamic been developed with different techniques. The de facto in- testing towards every function for multiple iterations and dustrial standard tool BINDIFF identifies similar functions records program runtime behaviors. We collect different kinds mostly through graph isomorphism comparison [20], [23]. of program behaviors (referred as behavior traces in this pa- This algorithm detects similar functions by comparing the per) for each function, and behavior traces from two functions control flow and call graphs. Moreover, recent research work are compared through their longest common subsequence. For proposes advanced techniques to identify the hidden similari- each comparison pair of two functions, we generate a vector ties regarding program semantics [50], [21], [56], [15]. including the Jaccard index rates (the Jaccard index rate of Given the critical role of similarity analysis in binary code, two behavior traces is derived from their longest common we observe several weaknesses in existing research. For ex- subsequence) of the behavior trace-level comparisons, and we ample, dynamic analysis-based methods usually have coverage then label sample vectors to train a machine learning model. issues [35], [54], [59], [64], [14], which naturally impedes Later, given a vector by comparing two functions, the trained their work from testing every function in binary code. Typical model provides a similarity score. The main contributions of this paper are as follows. test component, as it mutates the inputs for iterations and aims • We identify design limitations of existing research in sim- to exhaust execution paths in a best effort. ilarity analysis of binary components, and propose IMF- Program Syntax Information. To balance the cost and accu- SIM, a novel method that uses fuzz testing techniques for racy, many static analysis-based techniques capture relatively function-level similarity analysis in binary code. “light-weight” features, such as the number of instructions, • IMF-SIM employs the in-memory fuzzing technique, opcode sequences, and control flow graph information [34], which is originally designed for assembly-level testing. [22], [20], [23]. However, note that one major application of We propose several advanced methods to overcome the binary code similarity analysis is for malware study. Thus, unique challenges and reveal the hidden power of the to better understand the strength and weakness of similarity fuzzing technique in our new context. testing tools, we should consider commonly-used obfusca- • Benefit from its runtime behavior based comparison, tion techniques as well. Typical obfuscations are designed IMF-SIM is effectively resilient to challenges from dif- to change the program syntax and evade the similarity anal- ferent compilers, optimizations, and even commonly-used ysis [45]. As a result, syntax-based techniques can become program obfuscations. We evaluate IMF-SIM on over one inefficient in the presence of obfuscations. However, as IMF- thousand widely-used binaries produced by various com- SIM leverages dynamic analysis to collect program runtime pilation and obfuscation settings. Our evaluation shows behaviors, commonly-used obfuscations should not impede it. that IMF-SIM has promising performance regarding all We evaluate IMF-SIM on three widely-used program obfus- the settings and outperforms the state-of-the-art tools. cations, and it shows promising results in all the settings. III. IMF-SIM II. MOTIVATION We now outline the design of IMF-SIM. The overall work- In this section, we summarize the limitations of previous flow is shown in Fig. 1. To compare two functions in two binary code similarity analysis work and also highlight the binary executables, we first launch in-memory fuzzing to motivation of our research. We discuss the design choices of execute the functions for iterations, and record multiple kinds existing work in multiple aspects here. of behavior traces (xIII-A). The central challenge at this step Dynamic Analysis. Many program runtime behaviors, such is the lack of data type information. For example, we have no as memory accesses, function calls, and program return val- pre-knowledge on whether a function parameter is of value or ues, are descriptors of program semantics to some extent. pointer type, and misuse of a non-pointer data as a pointer can Some existing work leverages dynamic analysis and software lead to memory access (pointer dereference) errors. To address birthmarks for (function-level) similarity analysis [54], [59], this issue, we propose to use backward taint analysis to recover [64], [38], [14], [37], [35]. However, an obvious issue for the “root” of a pointer data flow whenever a dereference existing dynamic analysis is the potential low coverage of error occurs on this pointer. Later, we re-execute the function test targets (e.g., functions). Theoretically, generating program and update the recovered dataflow root (e.g., an input of the inputs to guarantee the reachability of every code component is function) with a valid pointer value (xIII-B). an undecidable problem. That

Load more