Binpro: a Tool for Binary Source Code Provenance

BinPro: A Tool for Binary Source Code Provenance Dhaval Miyani, Zhen Huang, David Lie University of Toronto [email protected],[email protected],[email protected] ABSTRACT the similarity of the graphs using one of a number of well-known Enforcing open source licenses such as the GNU General Public methods [21, 33]. However, in reality, optimization applied during License (GPL), analyzing a binary for possible vulnerabilities, and the compilation process can result in a binary that has a significantly code maintenance are all situations where it is useful to be able to different call-graph and subroutines than the original source code. determine the source code provenance of a binary. While previous These optimizations have traditionally been the bane of systems work has either focused on computing binary-to-binary similarity that attempt to compute binary-to-binary similarity [2, 7, 9, 11]. or source-to-source similarity, BinPro is the first work we are aware Unfortunately, the heuristics compilers use to determine when and of to tackle the problem of source-to-binary similarity. BinPro can where optimizations are applied are opaque and difficult to predict, match binaries with their source code even without knowing which often being sensitive to a large number of factors in the source compiler was used to produce the binary, or what optimization level code. was used with the compiler. To do this, BinPro utilizes machine Since determining the source code provenance of a binary re- learning to compute optimal code features for determining binary- quires access to the source code features during the matching pro- to-source similarity and a static analysis pipeline to extract and cess, we expect that the accuracy of matching in this case to be compute similarity based on those features. Our experiments show better than binary-to-binary matching. In this work, we show this that on average BinPro computes a similarity of 81% for matching is indeed the case, especially in the case when code optimizations binaries and source code of the same applications, and an average such as function inlining are applied. Our most surprising result is similarity of 25% for binaries and source code of similar but different that determining when and where a compiler will apply inlining applications. This shows that BinPro’s similarity score is useful for can be done without analysis of the compiler code at all. Instead, determining if a binary was derived from a particular source code. we generate a training set of optimized and unoptimized binaries by simply compiling unrelated applications, and use this to train a machine learning model, which can then be used to predict when 1 INTRODUCTION optimization will be applied by the compiler. As one would expect, There are a number of situations where one wants to determine the this works fairly well when the compiler used for training the model source code provenance of a binary – that is, the source code from is the same as the one used to produce the binary, but surprisingly, which a binary originates. This can help determine the authorship, it works fairly well even when it is a different compiler, or the same intellectual property ownership rights and trustworthiness of a compiler used with different optimization levels. binary. For example, the GNU General Public License (GPL) [10] We demonstrate the utility of this approach with a tool called requires any one who modifies and distributes an application pro- BinPro, which, given a binary and a source code, computes a similar- tected by the license to make the source code modifications publicly ity score for them. A high similarity score indicates that the binary available – a requirement that is often violated, as demonstrated by is very likely to have been the result of compiling the source code, numerous GPL infringement lawsuits [24, 31, 35]. To determine if while a low similarity score indicates that the binary was likely a distributed binary infringes, a plaintiff must show that the binary compiled from some other source code. We evaluate BinPro on a was derived from the source code of an application protected by the corpus of applications and libraries and demonstrate that BinPro’s GPL, often without the cooperation of the producer of the binary. similarity score correlates well with whether binaries and source Since GPL only requires source code to be shared if the binaries are code really match even across different compilers and optimization levels. arXiv:1711.00830v1 [cs.CR] 2 Nov 2017 released to the public, source code provenance can also be used by organizations and developers as an additional check to ensure that In summary, we make the following contributions: they do not mistakenly release binaries derived from GPL source code. (1) We present BinPro, which is the first technique we are In other instances, a user or an organization may wish to know aware that is able to match a program’s source code and the source code provenance of a binary for security or maintenance binary using a novel combination of machine learning and purposes. Knowing the identity of the source code of a binary may static analysis. help identify possible vulnerabilities. For maintenance purposes, it (2) We evaluate BinPro on a corpus of 10 executable applica- can help identify possible ways of updating and maintaining the tions and 8 libraries, and demonstrate that BinPro produces binary if the original source code for the binary has been lost. similarity scores ranging from 59% to 96% and an average In an ideal world, the translation from source code to binary is of 81% for matching binaries and source code, and scores a straightforward process and determining whether a binary was ranging from 10% to 43% with an average of 25% for non- derived from a particular source code is simply a measurement of matching binary and source code. This demonstrates that the similarity of the source code and binary. For example, one might BinPro is able to determine whether a binary was derived compute a call-graph of both binary and source code and compute from a particular source code or not. (3) We show that BinPro is insensitive to different compilers the actual instructions or instruction sequences are different. The or compiler optimization levels used to compile the binary. work in this area can be broadly classified into static and dynamic We further show that BinPro’s machine learning models, approaches when trained on one compiler, GCC, is able to predict Static approaches. Static approaches extract features from bina- when function inlining optimizations will be applied by a ries, such as control flow graphs (CFG) and then compare the graphs completely different compiler, the ICC Intel compiler. of the two binaries. Zynamics BinDiff [9] is an industry standard The remainder of the paper is organized as follows. First, we state-of-the-art binary diff-ing tool. BinDiff matches a pair ofbi- review related work on computing binary-to-binary similarity and naries using a variant of graph-isomorphism algorithm. BinDiff source-to-soruce similarity in Section 2. Then, we describe the extracts CFGs from two binaries and tries to match functions be- design of BinPro Section 3. Implementation details are given in tween each binary using heuristics. The major drawback of BinDiff Section 4 and we evaluate BinPro’s effectiveness in Section 5. Finally, is that it performs extremely poorly when comparing two bina- we conclude in Section 6. ries that are compiled with different optimization levels or with different compilers. Even though the programs may be function- 2 RELATED WORK ally equivalent, many compiler optimizations affect program CFGs Determining whether two programs are equivalent or not is re- greatly and thus make graph matching ineffective. ducible to the halting problem, and is thus undecidable. For the Inspired by BinDiff, BinSlayer [2] and Pewny et al [25] perform same reason, it is difficult to prove that compilers produce binary bipartite matching using the Hungarian algorithm. This allows code that is equivalent to the input source code [18, 22]. However, them to be more resilient to CFG changes due to local compiler despite these difficulties, researchers have still made significant optimizations. DiscovRE [8] uses an even looser matching algo- progress with a number of proposals for practically measuring the rithm to match binaries using structural and numeric features of similarity of two programs. Previous work in computing program the CFG. However, despite the increased accuracy despite CFG similarity breaks down into three major sub-problems: a) measur- transformations, neither approach handles function inlining very ing the similarity of sections of binary with corpus of source code, well, which introduces code from a inlined callee function into the b) measuring the similarity of two binaries and c) measuring the caller function’s CFG. similarity of two sections of source code. We review related work While BinPro does rely on function-level code features, BinPro in the other two subproblems below. does not use features from a function’s CFG for matching. Our experiments empirically show that CFG features are unreliable 2.1 Measuring binary−source code similarity for program matching, and thus BinPro excludes them from its Hemel et al., developed the Binary Analysis Toolkit (BAT), a system matching strategy and relies on other features instead. In addition, for code clone detection in binaries to detect GPL violations [13]. It BinPro uses machine learning to predict when functions might recursively extracts strings from a binary, such as a firmware image. be inlined by the compiler, allowing BinPro to properly compute It attempts to detect cloning of code by matching strings with a similarity even for inlined functions.

Load more