Decoupling Coding Habits from Functionality for Effective Binary

Journal of Computer Security 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Decoupling Coding Habits from 4 5 5 6 6 7 Functionality for Effective Binary 7 8 8 9 Authorship Attribution 9 10 10 11 11 a;∗ b b b c 12 Saed Alrabaee , Paria Shirani , Lingyu Wang , Mourad Debbabi and Aiman Hanna 12 13 a Information Systems and Security, United Arab Emirates University, UAE 13 14 E-mail: [email protected] 14 15 b CIISE, Concordia University, Montreal, QC, Canada 15 16 E-mails: [email protected], [email protected], [email protected] 16 17 c Computer Science, Concordia University, Montreal, QC, Canada 17 18 E-mail: [email protected] 18 19 19 20 Abstract. Binary authorship attribution refers to the process of identifying the author of a given anonymous binary file based on 20 21 stylistic characteristics. It aims to automate the laborious and error-prone reverse engineering task of discovering information 21 related to the author(s) of a binary code. Existing works typically employ machine learning methods to extract features that 22 are unique for each author and subsequently match them against a given binary to identify the author. However, most existing 22 23 works share a common critical limitation, i.e., they cannot distinguish between features representing program functionality 23 24 and those representing authorship (e.g., authors’ coding habits). Such distinction is crucial for effective authorship attribution 24 because what is unique in a particular binary may be attributed to either author, compiler, or function. In this study, we present 25 25 BINAUTHOR a system capable of decoupling program functionality from authors’ coding habits in binary code. To capture 26 coding habits, BINAUTHOR leverages a set of features that are based on collections of functionality-independent choices made 26 27 by authors during coding. Our evaluation demonstrates that BINAUTHOR outperforms existing methods in several aspects. 27 28 First, it successfully attributes a larger number of authors with a significantly higher accuracy (around 90%) based on the large 28 datasets extracted from selected open-source C++ projects in GitHub, Google Code Jam events, Planet Source Code contests, 29 and several programming projects. Second, BINAUTHOR is more robust than previous methods; there is no significant drop 29 30 in accuracy when the code is subjected to refactoring techniques, simple obfuscation, and processed with different compilers. 30 31 Finally, decoupling authorship from functionality allows us to apply BINAUTHOR to real malware binaries (Citadel, Zeus, 31 Stuxnet, Flame, Bunny, and Babar) to automatically generate evidence on similar coding habits. 32 32 Keywords: Binary Authorship Attribution, Malware Analysis, Coding Habits, Assembly Instructions 33 33 34 34 35 35 36 36 37 1. Introduction 37 38 38 39 Binary authorship attribution aims to automate the arduous and error-prone reverse engineering task 39 40 of extracting information related to the author(s) of a program binary and further attribute the author(s). 40 41 The ability to conduct such analyses at the binary level is especially important for security applications 41 42 because the source code for malware is not always available. In automating binary authorship attribution 42 43 compared to source code analysis, two main challenges are typically encountered: the binary code lacks 43 44 44 45 *Corresponding author. E-mail: [email protected]. 45 46 46 0926-227X/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 S. Alrabaee et al. / Authorship Attribution 1 many abstractions (i.e., variable names) that are present in the source code, and the time and space com- 1 2 plexities of analyzing binary code are greater than those of the corresponding source code. Although 2 3 significant efforts have been made to develop automated approaches for source code authorship attribu- 3 4 tion [1], these often rely on features that will likely be lost in the strings of bytes representing binary code 4 5 after the compilation process, such as variable and function naming, original control and data flow struc- 5 6 tures, comments, and space layout. To this end, a few approaches to binary authorship attribution have 6 7 been proposed; typically, they employ machine learning methods to extract unique features for each au- 7 8 thor and then match a given binary against such features to identify the authors [2–4]. These approaches 8 9 share the following limitations: (i) the chosen features are generally not related to author’s style but 9 10 rather to functionality; (ii) significantly lower accuracy is observed in the case of multiple authors [5]; 10 11 (iii) the approaches are easily defeated by refactoring techniques or simple obfuscation methods; and 11 12 (iv) the approaches handle only one platform (e.g., x86). Dealing with the binary authorship attribution 12 13 problem requires novel features that can capture author’s style characteristics, such as a preference for 13 14 using specific compilers, keywords, reused packages, implementation frameworks, and binary times- 14 15 tamps. More recently, the feasibility of authorship attribution on malware binaries was discussed at the 15 16 BlackHat conference [6]. A set of features are employed to group malware binaries according to author- 16 17 ship. However, the process is not automated and requires a considerable human intervention. 17 18 18 19 Key idea: We present BINAUTHOR, a system designed to recognize authors’ coding habits by decou- 19 20 pling them from program functionality in binary code. Instead of using generic features (e.g., n-grams 20 21 or small subgraphs of a CFG [4]), which may or may not be related to authorship, BINAUTHOR captures 21 22 coding habits based on a collection of functionality-independent choices frequently made by authors 22 23 during coding (e.g., preferring to use either if or switch, and relying more on either object-oriented 23 24 modularization or procedural programming). We first investigate a large collection of source code and 24 25 their mapping to assembly instructions in order to determine what coding habits may be preserved in the 25 26 binary, and consequently design and integrate features based on such habits into BINAUTHOR. Then, 26 27 we apply it to a series of problems related to binary authorship attribution, namely, attributing the au- 27 28 thor of a given binary from a set of author candidates inside a repository, identifying the presence of 28 29 multiple authors of the same binary on the basis of function-level analysis, and automatically extracting 29 30 authorship-related evidences from real malware binaries. 30 31 31 32 Summary of results: Our evaluation shows that BINAUTHOR outperforms existing approaches [2– 32 33 4] in several aspects. Specifically, the system attributes a larger number of authors with a significantly 33 34 higher accuracy (approximately 90%) based on large-scale datasets extracted from selected open-source 34 35 C++ projects on GitHub [7], Google Code Jam events [8], Planet Source Code contests [9], and stu- 35 36 dents’ programming projects. Furthermore, BINAUTHOR is more robust than previous methods in the 36 37 sense that the system can still attribute authors with an acceptable accuracy after the code is subjected to 37 38 refactoring and simple obfuscation techniques. Finally, applying BINAUTHOR to real malware binaries 38 39 (Zeus-Citadel, Stuxnet-Flame, and Bunny-Babar) automatically generates evidence of sim- 39 40 ilar coding habits shared by each pair of malware. These types of evidence match the findings of the 40 41 technical reports published by antivirus vendors [10, 11] and reverse engineering team [6]. 41 42 42 43 Contributions: The main contributions of this study are: 43 44 (1)B INAUTHOR yields a higher accuracy that survives refactoring techniques and simple obfuscation. 44 45 This shows the potential of BINAUTHOR as a practical tool to assist reverse engineers in a num- 45 46 46 S. Alrabaee et al. / Authorship Attribution 3 1 ber of security-related tasks, such as identifying the author of a malware sample, and clustering 1 2 malware samples based on common authors. 2 3 (2)B INAUTHOR is amongst the first approaches that performs automated authorship attribution on 3 4 real-world malware binaries. When we applied it to Zeus-Citadel, Stuxnet-Flame, and 4 5 Bunny-Babar malware binaries, it automatically generated evidence of coding habits shared by 5 6 each malware pair, matching the findings of antivirus vendors [10, 11] and reverse engineering 6 7 teams [6]. 7 8 (3) The filtration component of BINAUTHOR (which labels functions as compiler-related, library- 8 9 related, or user-related functions) is a stand-alone component. In addition, it can be employed 9 10 as pre-processing steps in other applications, such as clone detection, to allow for the reduction of 10 11 false positives and the enhancement of performance. We made the code available to the commu- 11 12 nity1. 12 13 13 In this paper, we significantly extend our previous work, by studying three models: BINAUTHOR; 14 14 BINAUTHOR , which combines BINAUTHOR with a compiler identification tool BINCOMP [12]; and 15 1 15 BINAUTHOR , which combines BINAUTHOR with GITZ, a tool [13] for lifting assembly instructions 16 2 16 to a canonical form. Specifically, our major extensions are as follows: i) we introduce a set of illustra- 17 17 tive examples to answer the following question: Does each programmer have a distinctive coding style 18 18 reflected in his/her collection of functionality-independent

Load more