De-Anonymizing Programmers from Executable Binaries

To appear at the 2018 Network and Distributed System Security Symposium (NDSS). When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries Aylin Caliskan∗, Fabian Yamaguchiy, Edwin Dauberz, Richard Harangx, Konrad Rieck{, Rachel Greenstadtz and Arvind Narayanan∗ ∗Princeton University, faylinc@, [email protected] yShiftleft Inc, [email protected] zDrexel University, fegd34, [email protected] xSophos, Data Science Team, [email protected] {TU Braunschweig, [email protected] Abstract—The ability to identify authors of computer pro- in the compilation process and can be extracted from the grams based on their coding style is a direct threat to the privacy executable binary. This means that it may be possible to infer and anonymity of programmers. While recent work found that the programmer’s identity if we have a set of known potential source code can be attributed to authors with high accuracy, candidate programmers, along with executable binary samples attribution of executable binaries appears to be much more (or source code) known to be authored by these candidates. difficult. Many distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and com- Programmer de-anonymization from executable binaries piler optimization may alter the structure of a program, further has implications for privacy and anonymity. Perhaps the creator obscuring features that are known to be useful in determining of a censorship circumvention tool distributes it anonymously, authorship. We examine programmer de-anonymization from the fearing repression. Our work shows that such a programmer standpoint of machine learning, using a novel set of features that might be de-anonymized. Further, there are applications for include ones obtained by decompiling the executable binary to source code. We adapt a powerful set of techniques from the software forensics, for example to help adjudicate cases of domain of source code authorship attribution along with stylistic disputed authorship or copyright. representations embedded in assembly, resulting in successful de- The White House Cyber R&D Plan states that “effective anonymization of a large set of programmers. deterrence must raise the cost of malicious cyber activities, We evaluate our approach on data from the Google Code lower their gains, and convince adversaries that such activities Jam, obtaining attribution accuracy of up to 96% with 100 and can be attributed [42].” The DARPA Enhanced Attribution calls 83% with 600 candidate programmers. We present an executable for methods that can “consistently identify virtual personas binary authorship attribution approach, for the first time, that and individual malicious cyber operators over time and across is robust to basic obfuscations, a range of compiler optimization different endpoint devices and C2 infrastructures [25].” While settings, and binaries that have been stripped of their symbol the forensic applications are important, as attribution methods tables. We perform programmer de-anonymization using both develop, they will threaten the anonymity of privacy-minded obfuscated binaries, and real-world code found “in the wild” in single-author GitHub repositories and the recently leaked individuals at least as much as malicious actors. Nulled.IO hacker forum. We show that programmers who would We introduce the first part of our approach by significantly arXiv:1512.08546v3 [cs.CR] 18 Dec 2017 like to remain anonymous need to take extreme countermeasures overperforming the previous attempt at de-anonymizing pro- to protect their privacy. grammers by Rosenblum et al. [39]. We improve their accuracy of 51% in de-anonymizing 191 programmers to 92% and then I. INTRODUCTION we scale the results to 83% accuracy on 600 programmers. First, whereas Rosenblum et al. extract structures such as If we encounter an executable binary sample in the wild, control-flow graphs directly from the executable binaries, our what can we learn from it? In this work, we show that the work is the first to show that automated decompilation of exe- programmer’s stylistic fingerprint, or coding style, is preserved cutable binaries gives additional categories of useful features. Specifically, we generate abstract syntax trees of decompiled source code. Abstract syntax trees have been shown to greatly improve author attribution of source code [16]. We find that Network and Distributed Systems Security (NDSS) Symposium 2018 syntactical properties derived from these trees also improve 18-21 February 2018, San Diego, CA, USA ISBN 1-1891562-49-5 the accuracy of executable binary attribution techniques. http://dx.doi.org/10.14722/ndss.2018.23304 Second, we demonstrate that using multiple tools for dis- www.ndss-symposium.org assembly and decompilation in parallel increases the accuracy of de-anonymization by generating different representations of code that capture various aspects of the programmer’s style. a model and correctly rejected the fourth sample as none of We present a robust machine learning framework based on the previous three. entropy and correlation for dimensionality reduction, followed by random-forest classification, that allows us to effectively use We emphasize that research challenges remain before pro- disparate types of features in conjunction without overfitting. grammer de-anonymization from executable binaries is fully ready for practical use. For example, programs may be au- These innovations allow us to de-anonymize a large set thored by multiple programmers and may have gone through of real-world programmers with high accuracy. We perform encryption. We have not performed experiments that model experiments with a controlled dataset collected from Google these scenarios which require different machine learning and Code Jam (GCJ), allowing a direct comparison to previous segmentation techniques and we mainly focus on the privacy work that used samples from GCJ. The results of these implications. Nonetheless, we present a robust and principled experiments are discussed in detail in Section V. Specifically; programmer de-anonymization method with a new approach we can distinguish between thirty times as many candidate and for the first time explore various realistic scenarios. Ac- programmers (600 vs. 20) with higher accuracy, while utilizing cordingly, our effective framework raise immediate concerns less training data and much fewer stylistic features (53) per for privacy and anonymity. programmer. The accuracy of our method degrades gracefully The remainder of this paper is structured as follows. as the number of programmers increases, and we present We begin by formulating the research question investigated experiments with as many as 600 programmers. Similarly, we throughout this paper in Section II, and discuss closely related are able to tolerate scarcity of training data: our accuracy for work on de-anonymization in Section III. We proceed to de-anonymizing sets of 20 candidate programmers with just a describe our novel approach for binary authorship attribution single training sample per programmer is over 75%. based on instruction information, control flow graphs, and Third, we find that traditional binary obfuscation, enabling decompiled code in Section IV. Our experimental results are compiler optimizations, or stripping debugging symbols in described in Section V, followed by a discussion of results in executable binaries results in only a modest decrease in Section VII. Finally, we shed light on the limitations of our de-anonymization accuracy. These results, described in Sec- method in Section VIII and conclude in Section IX. tion VI, are an important step toward establishing the practical significance of the method. II. PROBLEM STATEMENT The fact that coding style survives compilation is unintu- In this work, we consider an analyst interested in deter- itive, and may leave the reader wanting a “sanity check” or an mining the author of an executable binary purely based on its explanation for why this is possible. In Section V-J, we present style. Moreover, we assume that the analyst only has access several experiments that help illuminate this mystery. First, we to executable binary samples each assigned to one of a set of show that decompiled source code is not necessarily similar candidate programmers. to the original source code in terms of the features that we use; rather, the feature vector obtained from disassembly and Depending on the context, the analyst’s goal might be decompilation can be used to predict, using machine learning, defensive or offensive in nature. For example, the analyst the features in the original source code. Even if no individual might be trying to identify a misbehaving employee that feature is well preserved, there is enough information in the violates the non-compete clause in his company by launching vector as a whole to enable this prediction. On average, the an application related to what he does at work. By contrast, the cosine similarity between the original feature vector and the re- analyst might belong to a surveillance agency in an oppressive constructed vector is over 80%. Further, we investigate factors regime who tries to unmask anonymous programmers. The that are correlated with coding style being well-preserved, and regime might have made it unlawful for its citizens to use find that more skilled programmers are more fingerprintable. certain types of programs, such as censorship-circumvention This suggests that

Load more