<<

Journal of Computer Security 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Decoupling Coding Habits from 4 5 5 6 6 7 Functionality for Effective Binary 7 8 8 9 Authorship Attribution 9 10 10 11 11 a,∗ b b b c 12 Saed Alrabaee , Paria Shirani , Lingyu Wang , Mourad Debbabi and Aiman Hanna 12 13 a Information Systems and Security, United Arab Emirates University, UAE 13 14 E-mail: [email protected] 14 15 b CIISE, Concordia University, Montreal, QC, Canada 15 16 E-mails: [email protected], [email protected], [email protected] 16 17 c Computer Science, Concordia University, Montreal, QC, Canada 17 18 E-mail: [email protected] 18 19 19 20 Abstract. Binary authorship attribution refers to the process of identifying the author of a given anonymous binary file based on 20 21 stylistic characteristics. It aims to automate the laborious and error-prone reverse engineering task of discovering information 21 related to the author(s) of a binary code. Existing works typically employ machine learning methods to extract features that 22 are unique for each author and subsequently match them against a given binary to identify the author. However, most existing 22 23 works share a common critical limitation, i.e., they cannot distinguish between features representing program functionality 23 24 and those representing authorship (e.g., authors’ coding habits). Such distinction is crucial for effective authorship attribution 24 because what is unique in a particular binary may be attributed to either author, compiler, or function. In this study, we present 25 25 BINAUTHOR a system capable of decoupling program functionality from authors’ coding habits in binary code. To capture 26 coding habits, BINAUTHOR leverages a set of features that are based on collections of functionality-independent choices made 26 27 by authors during coding. Our evaluation demonstrates that BINAUTHOR outperforms existing methods in several aspects. 27 28 First, it successfully attributes a larger number of authors with a significantly higher accuracy (around 90%) based on the large 28 datasets extracted from selected open-source C++ projects in GitHub, Code Jam events, Planet Source Code contests, 29 and several programming projects. Second, BINAUTHOR is more robust than previous methods; there is no significant drop 29 30 in accuracy when the code is subjected to refactoring techniques, simple obfuscation, and processed with different compilers. 30 31 Finally, decoupling authorship from functionality allows us to apply BINAUTHOR to real malware binaries (Citadel, Zeus, 31 Stuxnet, Flame, Bunny, and Babar) to automatically generate evidence on similar coding habits. 32 32 Keywords: Binary Authorship Attribution, Malware Analysis, Coding Habits, Assembly Instructions 33 33 34 34 35 35 36 36 37 1. Introduction 37 38 38 39 Binary authorship attribution aims to automate the arduous and error-prone reverse engineering task 39 40 of extracting information related to the author(s) of a program binary and further attribute the author(s). 40 41 The ability to conduct such analyses at the binary level is especially important for security applications 41 42 because the source code for malware is not always available. In automating binary authorship attribution 42 43 compared to source code analysis, two main challenges are typically encountered: the binary code lacks 43 44 44 45 *Corresponding author. E-mail: [email protected]. 45 46 46

0926-227X/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 S. Alrabaee et al. / Authorship Attribution

1 many abstractions (i.e., variable names) that are present in the source code, and the time and space com- 1 2 plexities of analyzing binary code are greater than those of the corresponding source code. Although 2 3 significant efforts have been made to develop automated approaches for source code authorship attribu- 3 4 tion [1], these often rely on features that will likely be lost in the strings of bytes representing binary code 4 5 after the compilation process, such as variable and function naming, original control and data flow struc- 5 6 tures, comments, and space layout. To this end, a few approaches to binary authorship attribution have 6 7 been proposed; typically, they employ machine learning methods to extract unique features for each au- 7 8 thor and then match a given binary against such features to identify the authors [2–4]. These approaches 8 9 share the following limitations: (i) the chosen features are generally not related to author’s style but 9 10 rather to functionality; (ii) significantly lower accuracy is observed in the case of multiple authors [5]; 10 11 (iii) the approaches are easily defeated by refactoring techniques or simple obfuscation methods; and 11 12 (iv) the approaches handle only one platform (e.g., x86). Dealing with the binary authorship attribution 12 13 problem requires novel features that can capture author’s style characteristics, such as a preference for 13 14 using specific compilers, keywords, reused packages, implementation frameworks, and binary times- 14 15 tamps. More recently, the feasibility of authorship attribution on malware binaries was discussed at the 15 16 BlackHat conference [6]. A set of features are employed to group malware binaries according to author- 16 17 ship. However, the process is not automated and requires a considerable human intervention. 17 18 18 19 Key idea: We present BINAUTHOR, a system designed to recognize authors’ coding habits by decou- 19 20 pling them from program functionality in binary code. Instead of using generic features (e.g., n-grams 20 21 or small subgraphs of a CFG [4]), which may or may not be related to authorship, BINAUTHOR captures 21 22 coding habits based on a collection of functionality-independent choices frequently made by authors 22 23 during coding (e.g., preferring to use either if or switch, and relying more on either object-oriented 23 24 modularization or procedural programming). We first investigate a large collection of source code and 24 25 their mapping to assembly instructions in order to determine what coding habits may be preserved in the 25 26 binary, and consequently design and integrate features based on such habits into BINAUTHOR. Then, 26 27 we apply it to a series of problems related to binary authorship attribution, namely, attributing the au- 27 28 thor of a given binary from a set of author candidates inside a repository, identifying the presence of 28 29 multiple authors of the same binary on the basis of function-level analysis, and automatically extracting 29 30 authorship-related evidences from real malware binaries. 30 31 31 32 Summary of results: Our evaluation shows that BINAUTHOR outperforms existing approaches [2– 32 33 4] in several aspects. Specifically, the system attributes a larger number of authors with a significantly 33 34 higher accuracy (approximately 90%) based on large-scale datasets extracted from selected open-source 34 35 C++ projects on GitHub [7], Google Code Jam events [8], Planet Source Code contests [9], and stu- 35 36 dents’ programming projects. Furthermore, BINAUTHOR is more robust than previous methods in the 36 37 sense that the system can still attribute authors with an acceptable accuracy after the code is subjected to 37 38 refactoring and simple obfuscation techniques. Finally, applying BINAUTHOR to real malware binaries 38 39 (Zeus-Citadel, Stuxnet-Flame, and Bunny-Babar) automatically generates evidence of sim- 39 40 ilar coding habits shared by each pair of malware. These types of evidence match the findings of the 40 41 technical reports published by antivirus vendors [10, 11] and reverse engineering team [6]. 41 42 42 43 Contributions: The main contributions of this study are: 43 44 (1)B INAUTHOR yields a higher accuracy that survives refactoring techniques and simple obfuscation. 44 45 This shows the potential of BINAUTHOR as a practical tool to assist reverse engineers in a num- 45 46 46 S. Alrabaee et al. / Authorship Attribution 3

1 ber of security-related tasks, such as identifying the author of a malware sample, and clustering 1 2 malware samples based on common authors. 2 3 (2)B INAUTHOR is amongst the first approaches that performs automated authorship attribution on 3 4 real-world malware binaries. When we applied it to Zeus-Citadel, Stuxnet-Flame, and 4 5 Bunny-Babar malware binaries, it automatically generated evidence of coding habits shared by 5 6 each malware pair, matching the findings of antivirus vendors [10, 11] and reverse engineering 6 7 teams [6]. 7 8 (3) The filtration component of BINAUTHOR (which labels functions as compiler-related, library- 8 9 related, or user-related functions) is a stand-alone component. In addition, it can be employed 9 10 as pre-processing steps in other applications, such as clone detection, to allow for the reduction of 10 11 false positives and the enhancement of performance. We made the code available to the commu- 11 12 nity1. 12 13 13 In this paper, we significantly extend our previous work, by studying three models: BINAUTHOR; 14 14 BINAUTHOR , which combines BINAUTHOR with a compiler identification tool BINCOMP [12]; and 15 1 15 BINAUTHOR , which combines BINAUTHOR with GITZ, a tool [13] for lifting assembly instructions 16 2 16 to a canonical form. Specifically, our major extensions are as follows: i) we introduce a set of illustra- 17 17 tive examples to answer the following question: Does each programmer have a distinctive coding style 18 18 reflected in his/her collection of functionality-independent choices? (in Section 2.1); ii) we present use 19 19 cases to highlight the interest of BINAUTHOR (in Section 2.3); iii) Since the filtration component of 20 20 BINAUTHOR (which labels functions as compiler-related, library-related, or user-related functions) is a 21 21 stand-alone component, we have added a detailed algorithm to make it possible for the researchers to 22 22 leverage it with any binary analysis tool (in Section 3.1); iv) we propose a new choice called “variable 23 23 IN UTHOR 24 choice” and then re-compute the significance of B A choices and accordingly re-conducted 24 25 the results comparison in (Sections 3.2.2, 3.4, and 4.3, respectively); (v) we conduct new experiments 25 26 to study the impact of training dataset and test size (in Section 4.7); (vi) we investigated the effect of 26 27 choices on large-scale author identification and further showed the percentage of choices found in large- 27 28 scale dataset (in Section 4.9); (vii) we provide a comparison between the three proposed models (in 28 IN UTHOR 29 Section 4.10); (viii) finally, we have added more details to highlight the findings of B A when 29 30 it is applied to real malware binaries (in Section 5) as well as new verifying method has been investigated 30 31 in order to verify the correctness of BINAUTHOR findings (in Section 5.4). 31 32 32 33 33 2. Preliminaries 34 34 35 35 In this section, we first provide a general overview of the main idea of choosing and extracting the 36 36 coding habit styles in BINAUTHOR followed by an illustrative example. Then, we provide the threat 37 37 model. Finally, we introduce the use cases in which BINAUTHOR can be employed. 38 38 39 39 2.1. Stylistic Choices 40 40 41 41 Programmers may develop their own coding habits over time. For instance, a programmer may prefer 42 42 to write a code that takes advantage of object-oriented modularization over procedural programming, or 43 43 s/he may have a tendency to utilize more global variables than local variables and more while loops 44 44 45 1https://github.com/g4hsean/BinAuthor 45 46 46 4 S. Alrabaee et al. / Authorship Attribution

1 rather than for loops, etc. Profiling such functionality-independent programming choices may help to 1 2 capture a programmer’s coding habits and stylistic characteristics. However, although many stylistic 2 3 characteristics may exist in source code and can be used to distinguish authors’ styles, source code is 3 4 not always available, e.g., in the case of malware analysis. This leads to the following questions: How 4 5 many stylistic characteristics found in the source code would survive the compilation process, and could 5 6 consequently be extracted from the binary? To what extent would the extracted characteristics represent 6 7 the author’s coding habits? 7 8 8 Table 1 9 9 Examples of stylistic choices that can be captured at binary level 10 10 11 Category Sample Choices Captured? 11 12 printf vs. cout X 12 13 printf vs. printf with new line X 13 14 Keyword cout vs. cout with new line X 14 15 scanf vs. cin X 15 void int  16 vs. 16 if/else vs. switch/cases 17 X 17 General for loop vs. while loop  18 18 vs. memory X 19 19 API calls X 20 Empty methods  20 21 21 Unusual Flushing buffer unnecessarily and excessively X 22 Excessive printing of system clock time X 22 23 How a function is ended X 23 24 How parameters are provided X 24 25 Action How exceptions are handled X 25 26 How directives are defined  26 27 Use of recursion X 27 28 Using system() calls X 28 29 Using semaphores X 29 Quality Handling standard library errors 30 X 30 Reopening of already opened files 31 X 31 Code reuse X 32 32 Use of infinite loops X 33 33 Modifying constants through pointers X 34 34 Vectors vs. arrays X 35 35 Structure Global variables vs. local variables X 36 Procedural vs. object-oriented X 36 37 37 38 To answer above questions, we investigate a large collection of source code and its mapping to assem- 38 39 bly instructions to determine which stylistic characteristics may be preserved in binary. Additionally, 39 40 we aim to isolate functionality-independent characteristics from those that are more likely driven by 40 41 functionality requirements. We investigate various programs performing different tasks but written by 41 42 the same author, for the purpose of averaging out the effect of functionality. To capture stylistic charac- 42 43 teristics at different abstraction levels, BINAUTHOR employs many different types of features, such as 43 44 statistical and structural features, opcode instructions, and specific template of instructions. 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 5

1 Table 2 1 2 Example of C++-keywords choices and options 2 3 CPP Keyword Options Alternative(s) 3 4 printf d or i for signed integer. cout 4 5 f or F for decimal floating point. 5 6 e or E for scientific notation lower case or upper case. 6 7 a or A for Hexadecimal floating point lower case or upper case. 7 8 With or without new line. 8 cout With or without new line. printf 9 9 Overloading «...«.... 10 10 Using it with other keywords such as put(cout,"x = "); 11 11 cin Using the » operator. scanf 12 12 Using cin.get. 13 Using cin.getline. 13 14 Using cin.ignore. 14 15 scanf d or u for decimal integer. cin 15 16 f, e, a, or g for floating point. 16 17 Number of parameters %. 17 18 Scan a set or an individual element. 18 19 sscanf d, i, or n for integer. 19 20 o, u, or x for unsigned integer. 20 21 e, f, or g for long double. 21 fscanf e E f g G scanf 22 , , , , specifiers for floating point. 22 x, X specifiers for hexadecimal integer. 23 23 put Insert character to stream. operator « 24 24 getc Get character from stream. fgetc 25 25 fgetc Get character from stream. getc 26 26 ferror Check error indicator. perror 27 27 perror Print error message. ferror 28 strlen Get string length. sizeof () 28 29 sizeof How many elements are in an array. strlen 29 30 30 31 31 32 32 33 Examples of stylistic choices. Table 1 provides examples of the functionality-independent choices, 33 34 and indicates whether they can be captured at binary level. For clarity, these choices are categorized into 34 35 basic groups. For instance, the Keyword group indicates the preference of using a particular keyword 35 36 36 over another. The preference can also be based on one variation of a keyword over another variations 37 37 38 of the same keyword. For instance, calling a particular method often allows different options that can 38 39 be indicative of style. A printf, for example, allows various format specifiers such as %e and %E 39 40 for scientific notation. BINAUTHOR considers these different variations of keywords as defined by the 40 41 C/C++ language reference [14]. Unusual coding styles, such as the implementation of empty methods, 41 42 or making excessive use of particular calls unnecessarily, are grouped in the Unusual group. Table 2 42 43 provides examples of keywords that can be captured at binary level, their options, and their alternatives. 43 44 For instance, scanf and cin are alternatives that can also be captured at binary level. The output of 44 45 this preference is a set of strings that is used by detection component. 45 46 46 6 S. Alrabaee et al. / Authorship Attribution

1 2.2. Threat Model 1 2 2 3 BINAUTHOR is specifically designed to assist reverse engineers in discovering information from a 3 4 binary that may be related to its author(s). As such, it is not intended for general-purpose reverse en- 4 5 gineering tasks, such as unpacking or deobfuscating malware samples. We investigate refactoring and 5 6 obfuscation techniques later in this paper in order to demonstrate possible evading countermeasures that 6 7 may be used by future malware authors in order to circumvent detection. It is worth to mention that since 7 8 we are not employing machine learning techniques to identify the authors, adversarial machine learning 8 9 models cannot be used by the adversaries to evade detection. More specifically, the threat model and 9 10 scope of this paper are further clarified in what follows. 10 11 In-Scope Threats. Since the research on binary authorship attribution is still in its infancy, BINAU- 11 12 THOR is certainly not meant as a bullet-proof solution. As such, strengthening it against possible coun- 12 13 termeasures will be an ongoing and continuous battle. Nonetheless, we dedicated special care to evading 13 14 techniques while designing and implementing BINAUTHOR. We have taken into consideration some po- 14 15 tential evading techniques. First, we assume that the adversaries might apply refactoring techniques to 15 16 evade detection. Second, the adversaries might apply obfuscation techniques to source and binary files. 16 17 Third, a program binary can be significantly changed by simply choosing a different compiler or by 17 18 altering the compilation settings. Finally, the adversaries may intentionally avoid or fake some of their 18 19 coding habits. 19 20 We show how BINAUTHOR survives the first three aforementioned threats. As for the last threat, 20 21 the features of BinAuthor have been carefully designed to capture coding habits at multiple abstraction 21 22 levels, which makes it harder for adversaries to evade detection even if they are aware of the habits being 22 23 looked for [15]. In addition, an operational solution is to customize and enrich the list of features based 23 24 on the actual use case and learning data, which will both improve accuracy and make it more difficult 24 25 for adversaries to hide all their habits. 25 26 Out of Scope Threats. As previously mentioned, BINAUTHOR is not intended for general-purpose 26 27 reverse engineering tasks, such as unpacking, de-obfuscation, or decryption. The evading techniques we 27 28 investigate are not intended to be exhaustive. Also, if adversaries possess sufficient knowledge about 28 29 the exact settings (e.g., the exact list of choices) of a particular BINAUTHOR user, they may potentially 29 30 apply BINAUTHOR to their malware in exactly the same manner and consequently tweak the code to 30 31 evade detection. Further hardening of BinAuthor against such threats is an interesting and challenging 31 32 research direction for future work. 32 33 33 34 2.3. Use Cases 34 35 35 36 The interest of BINAUTHOR is to determine the authorship of an anonymous piece of binary code 36 37 based on coding style. Additionally, given a code that is written by multiple authors, it required to deter- 37 38 mine which part of the code is attributed to which author. It is assumed that a set of binary code samples 38 39 is available, where the code is either labelled with the author(s) from a set of known candidate authors, 39 40 or from other unknown authors. Given an anonymous piece of binary code, BINAUTHOR converts the 40 41 code into a set of coding habits that are consequently used to either attribute the author(s), or conclude 41 42 that the code is attributed to some author(s) not belonging to BINAUTHOR repository. This is useful for 42 43 the following two applications: software infringement and forensic investigation. 43 44 In software infringement, a set of candidate authors is assembled based on previously collected mal- 44 45 ware samples, online code repositories, etc. There are also no guarantees that the anonymous author is 45 46 46 S. Alrabaee et al. / Authorship Attribution 7

1 one of the candidates since the test sample may not belong to any known authors. Finally, it may be 1 2 suspected that a piece of code is not written by the claimed author, but yet there are no leads to who the 2 3 actual author might be. This is the authorship verification problem. 3 4 In forensic investigation, FireEye, discovered that malware binaries share the same digital infrastruc- 4 5 ture and code, for instance, the use of certificates, executable resources and development tools. FireEye 5 6 investigators eventually noticed that malware binaries of the same previously discovered infrastructures 6 7 are written by the same group of authors. In such cases, training on such binaries and some random au- 7 8 thors’ code may offer a vital help to forensics investigators. In addition, testing recent pieces of malware 8 9 binary code using some confidence metrics would verify if a specific author is the actual author. 9 10 In each of these applications, the adversary may try to actively modify the coding style of the anony- 10 11 mous code. In the case of software forensics, the adversary may modify his/her own code to hide his/her 11 12 style. In the copyright and authorship verification applications, the adversary attempts to modify code 12 13 written by another author to match his/her own style. In forensic applications, two of the parties may col- 13 14 laborate to modify the style of code written by one to match the other’s style. BINAUTHOR emphasizes 14 15 coding habits that are robust to adversarial manipulation. Since these habits are driven by the adversary’s 15 16 own coding style, they cannot easily/completely be modified. 16 17 17 18 18 19 3. BinAuthor 19 20 20 21 Our goal is to automatically identify the author(s) of binary code. We approach this problem using 21 22 different distance metrics; that is, we generate a list of functionality-independent choices from training 22 23 data of sample binaries with known authors. Hence, we propose a system encompassing different com- 23 24 ponents, each of which is meant to achieve a particular purpose. Figure 1 illustrates the architecture of 24 IN UTHOR 25 B A . The first component (Filtration), isolates user functions from compiler functions, library 25 26 functions, and free open-source software packages. Hence, the outcome of this component is considered 26 27 as a habit (e.g., the preference in using specific compiler or open-source software packages). For in- 27 GCC visual studio SQLite MongoDB 28 stance, using compiler rather than , or utilizing rather than , 28 Feature Categorization 29 etc. The second component ( ), analyzes binary code to extract possible features 29 Author Habits Profiling 30 that represent stylistic choices. The third component ( ), constructs a repository 30 of habits of known authors. The last component (Authorship Attribution), performs matching to BINAU- 31 31 THOR’s repository for author classification attribution. The aforementioned components are explained 32 32 in the remainder of this section. 33 33 34 34 3.1. Filtration Process 35 35 36 An important initial step in most reverse engineering tasks is to distinguish between user functions and 36 37 library/compiler functions [2]; this saves considerable time and helps shift the focus to more relevant 37 38 functions. As shown in Figure 1, the filtration process consists of three steps. First, FLIRT [16] (Fast 38 39 Library Identification and Recognition Technology) is used to label library functions. Then, a set of 39 40 signatures is created for specific FOSS libraries, such as SQLite3, libpng, zlib, and openssl, 40 41 using Fast Library Acquisition for Identification and Recognition (FLAIR) [16]; this set is added to the 41 42 existing signatures of the IDA FLIRT list. In the last step, compiler function filtration is performed. 42 43 The underlying hypothesis is that compiler/helper functions can be identified based on a collection of 43 44 static signatures that are created in the training phase (e.g., opcode frequencies). We analyzed a number 44 45 of programs with different functionalities, ranging from a simple “Hello World!" program to programs 45 46 46 8 S. Alrabaee et al. / Authorship Attribution

1 1 2 Input Input 2 Known Binaries 3 ... Filtration Target File 3 Known Binaries 4 4 User Functions 5 Library Functions 5 Compiler Functions 6 6 7 7 Known Functions Known Binaries 8 ...... Choice 8 Known Functions Known Binaries Construction 9 9 10 Feature 10 Extraction Profiling 11 Mapping 11 12 12 Choice 13 Templates 13 Author 14 Habits Author(s) Identification 14 15 15 16 Author’s Habits Profiling Feature Categorization Authorship Attribution 16

17 Learning Phase Testing Phase 17 18 18 19 19 Fig. 1. BINAUTHOR Architecture. 20 Accesses to the databases are indicated by dashed lines. 20 21 21 22 22 23 executing complex tasks. Through the intersection of these functions combined with manual analysis, 23 24 we collected 120 functions as compiler/helper functions. The opcode frequencies are extracted from 24 25 these functions, and the mean and variance of each opcode are calculated. 25 26 After passing through FLIRT, each disassembled program P consists of n functions { f1, ···, fn}. Each 26 27 function fk is represented as set of m opcodes associated with the (mean, variance) pair of their frequency 27 28 28 of occurrence in function fk. More specifically, each opcode oi ∈ O has a pair of values (µi, νi), which 29 represent the mean and variance values for that opcode. 29 30 30 31 31 Example. We introduce an example in Table 3 to show how we compute (µi, νi) for each opcode. 32 For instance, suppose a compiler function has the following opcodes and corresponding frequencies: 32 33 push(19) mov(31) lea(13) pop(7) xi 33 , , , and . We compute the µi as µi = Pn x , where xi 34 j=1 j 34 represents the frequency of opcode i and n is the number of opcodes. Similarly, νi is computed as 35 2 35 (xi−¯x) 36 νi = n , where x¯ represents the average of the frequencies. As shown in the table, µ for push 36 2 37 is 19/(19 + 31 + 13 + 7) = 0.271, and νi is (19 − 17.5) /4 = 0.563. 37 38 38 39 Table 3 39 40 An example of computing (µi, νi) 40 41 Opcode push mov lea pop 41 42 42 Frequency 19 31 13 7 43 43 44 µ 0.271 0.442 0.186 0.1 44 45 ν 0.563 45.563 5.063 27.563 45 46 46 S. Alrabaee et al. / Authorship Attribution 9

1 Dissimilarity measurement Di, j is performed based on distance between the target function j and the 1 2 training function i as per the following equation [17]: 2 3 3 2 4 (µ − µ ) 4 D = i j , 5 i, j 2 2 5 νi + ν j 6 6 7 7 where (µ , ν ) and (µ , ν ) represent the opcode mean and variance of the training function i and target 8 i i j j 8 function j, respectively. 9 9 Each opcode o in the target function is measured against the same opcode of all compiler functions 10 i 10 in the training set. If the distance D is less than a predefined threshold value α = 0.005, the opcode 11 i, j 11 is considered as a match. This dissimilarity metric detects functions, which are closer to each other in 12 12 terms of types of opcodes. For instance, logical opcodes are not available in compiler-related functions. 13 13 Finally, a function is labelled as compiler-related if the number of matched opcodes over the total 14 14 number of opcodes (m) is greater than a predefined threshold value learned from experiments to be 15 15 γ = 0.75. Otherwise, the target function is labeled as user-related. 16 16 17 17 3.2. Feature Categorization 18 18 19 19 20 20 21 Skills Expertise 21 22 Memory Allocation Advanced Methods 22 23 Encryption Algorithms Choice of API Calls 23 Hashing Methods Software Engineering Models 24 24 Global Variable Usage Code Organization 25 Semaphores and Locks Usage 25 26 Reliance on main Function 26 Design-Oriented Model 27 Preferences 27 Compilers Long Code Block Usage 28 28 Keywords Code Structure 29 29 Certain Data Structure Code Quality 30 30 Function Termination Security Concerns 31 Standard Compliance 31 32 32 33 33 34 Fig. 2. Coding Habits Taxonomy 34 35 35 36 Determining a set of characteristics that remain constant for a significant portion of programs, written 36 37 by an author, is analogous to finding human characteristics that can later be used for the identification 37 38 of a specific person. To this end, our aim is to automate the finding of such program characteristics, and 38 39 with reasonable computational cost. To capture coding habits at different abstraction levels, we consider 39 40 a spectrum of such habits, as illustrated in Figure 2. As shown in Figure 2, an author’s habits can be 40 41 reflected in the preference of choosing certain keywords or compilers, the reliance on the main function, 41 42 or the use of an object oriented programming paradigm. In addition, the manner in which the code is 42 43 organized by the author may also reflect author habits. All possible choices are stored as a template in 43 44 this step. Moreover, we introduce a novel taxonomy of coding habits in Figure 2. We provide details of 44 45 each category of functionality-independent choices in the following subsections. 45 46 46 10 S. Alrabaee et al. / Authorship Attribution

1 3.2.1. General Choices 1 2 General choices are designed to capture the author’s general programming preferences; for example, 2 3 preferences in organizing the code, terminating a function, use of particular keywords, or use of specific 3 4 resources. In this subsection, we first introduce the three proposed general choices and then we provide 4 5 the details on computing the final similarity score. 5 6 6 7 1) Code Organization: We capture how code is organized by measuring the reliance on the main 7 8 function since it is considered a starting part for managing user functions. We define a set of ratios, as 8 9 shown in Table 4, that measures the actions used in the main function. We thus capture the percentage 9 10 of usage of keywords, local variables, API calls, and calling user functions, as well as the ratio between 10 11 the number of basic blocks of the main function to the number of basic blocks of other user functions. 11 12 These percentages are computed in relation to the length of the main function, where the length signifies 12 13 the number of instructions in the function. The results are represented as a vector of ratio values, which 13 14 is used by the detection component. 14 15 15 16 Table 4 16 17 Features extracted from the main function: length(l): Number of instructions in the main function 17 18 Ratio Equation Description 18 19 # of push / l Ratio of local variables to length 19 20 # of push / # of lea Ratio of local variables to memory address locations 20 21 # of lea / l Ratio of memory address locations to length 21 22 # of calls / l Ratio of function calls to length 22 23 # of indirect calls / l Ratio of API calls to length 23 24 # of BBs / total # of all BBs Ratio of the number of basic blocks of the main function to that of other user functions 24 # of calls / # of user functions Ratio of function calls to the number of user functions 25 25 26 26 27 27 Table 5 28 28 Examples of actions taken in terminating a function 29 29 30 Features 30 31 Printing results to memory Printing results to file 31 32 Using system ("pause") User action such as cin 32 33 Calling user functions Calling API functions 33 Closing files Closing resources 34 34 Freeing memory Flushing buffer 35 35 Using network communication Printing clock time 36 36 Releasing semaphores or locks Printing errors 37 37 38 38 39 2) Function Termination: BINAUTHOR captures how an author terminates a function. This could 39 40 help identify an author since programmers may be used to specific ways of terminating a function. BIN- 40 41 AUTHOR does not only consider the last statement of a function as the terminating instruction; rather, 41 42 it considers the last basic block of the function with its predecessor as the terminating part. This is a 42 43 realistic consideration since various actions may be required before a function terminates. To this end, 43 44 BINAUTHOR not only considers the usual terminating instructions, such as return and exit, but 44 45 also captures other related actions that are taken prior to termination. For instance, a function may be 45 46 46 S. Alrabaee et al. / Authorship Attribution 11

1 terminated with a display of , calling another function, releasing some resources, communica- 1 2 tion over networks, etc. Table 5 shows examples of what is captured in relation to the termination of a 2 3 function. Each feature is set to 1 if it is used to terminate a function; otherwise, it is set to 0. The output 3 4 of this component is a binary vector that is used by the detection component. 4 5 5 6 3) Keyword and resource preferences: BINAUTHOR captures the author’s preference of using dif- 6 7 ferent keywords or resources. We consider only groups of such preferences with equivalent or similar 7 8 functionality in order to avoid functionality-dependent features. These include keyword type preferences 8 9 for inputs (e.g., using cin, scanf), preferences for particular resources or a specific compiler (we iden- 9 10 tify the compiler by using PEiD2), operation system (e.g., Linux), CPU architecture (e.g., ARM), and the 10 11 manner in which certain keywords are used, which can serve as further indications of an author’s habits. 11 12 12 Some of these features are identified through binary string matching, which tracks the strings annotated 13 13 to call and mov instructions. For instance, excessive use of fflush will cause the string fflush to 14 14 appear frequently in the resulting binary. 15 15 16 16 17 General Choice Computation: We compute a set of vectors (vgi), where g represents the general 17 18 category and i represents the sub-category number. To consider the reliance on the main function, a 18 19 vector vg1, representing related features, is constructed according to the equations shown in Table 4. 19 20 These equations indicate the author’s reliance on the main function as well as the actions performed 20 21 by the author. Function termination is represented as a binary vector, (vg2), which is determined by the 21 22 absence or existence of a set of features for function termination. Keyword and resource preferences are 22 23 identified through binary string matching, which tracks the annotations to call and mov instructions. 23 24 For instance, excessive use of fflush will cause the string "_imp_fflush" to appear frequently 24 25 in the resulting binary. We extract a collection of strings from all user functions of a particular author, 25 26 then intersect these strings in order to derive a persistent vector (vg3) for that author. Consequently, for 26 27 each author, a set of vectors representing the author’s signature is stored in our repository. Given a target 27 28 binary, BINAUTHOR constructs the vectors from the target and measures the distance/similarity between 28 29 these vectors and those in our repository. The vg1 vector is compared using Euclidean distance, whereas 29 30 30 vg2 vector is compared using the Jaccard distance. For vg3, the similarity is computed through string 31 31 matching. Finally, the three derived similarity values are averaged in order to obtain λg, which is later 32 used for the purpose of author classification. 32 33 33 34 3.2.2. Variable Choices 34 35 Developers often have their own habits for defining local and global variables, which may originate 35 36 from the author’s experiences or skills. The variable chain has been shown to greatly improve author 36 37 attribution of source code [18]. It has been defined as the variable usage among different functions. 37 38 Inspired by this work, we introduce a register chain to capture authors’ habits in using variables. We 38 39 define the register chain concept as the states of using a particular register through all basic blocks in a 39 40 user function. To avoid compilation setting effects, we normalize the registers to general names such as 40 41 Reg1, Reg2, etc. and keep their occurrence order. Useful characteristics of such chains include the longest 41 42 chain, the shortest chain, the number of existing chains, the liveness of registers among basic blocks, etc. 42 43 43 44 44 45 2https://www.aldeid.com/wiki/PEiD 45 46 46 12 S. Alrabaee et al. / Authorship Attribution

1 Example: In what follows, we illustrate how a register chain is extracted. Part of the Control Flow 1 2 Graph (CFG) of the RC4 function in Citadel is shown in Figure 3(a). Figure 3(b) shows the construc- 2 3 tion of the register liveness [19] for the indicated registers. As illustrated in Figure 3(b), the used registers 3 4 ecx, ebp, esi, ebx, edx, and al are normalized to Reg1, ··· , Reg6. The first, second, and third basic 4 5 blocks manipulate hReg1, Reg2i, hReg3, Reg4i, and hReg2, Reg5, Reg6i registers, respectively. BINAU- 5 6 THOR captures the register liveness by storing the set of basic blocks where the register is alive. For 6 7 7 instance, the Reg2 register appears in the first and third basic blocks and does not appear in the second 8 8 basic block, so we represent the liveness of the Reg2 register as {BB1, BB3}. A summary of the registers 9 liveness is given in Table 6. 9 10 10 11 11 inc ecx 12 Reg inc mov cmp 12 mov [ebp+a], ecx 1 ecx ..  cmp ecx, [ebp+b] 13 ..  13 mov cmp mov mov mov Reg 2 14 .. ..  14 ebp 

15 pop esi 15 Reg 3 pop pop ebx .. 16 ..  esi  16

17 Reg 4 pop 17 mov al, [ebp+0x4] ebx ..  18 mov [edx+100], al 18 mov al, [ebp+y] Reg 5 mov mov mov 19 mov [edx+101], a1 ..  19 edx mov al, [ebp+var] 20 mov [edx+102], al 20 Reg 6 mov mov mov mov retn ..  .. 21 al  21 22 22 23 Fig. 3. (a) Part of the CFG of RC4 (b) Register chain 23 24 24 25 25 26 26 27 Table 6 27 28 Register liveness (Xindicates that the register is alive in a BB) 28

29 Register BB1 BB2 BB3 29 30 30 Reg1 X -- 31 Reg2 X - X 31 32 Reg3 - X - 32 33 Reg4 - X - 33 34 Reg5 -- X 34 35 Reg6 -- X 35 36 36 37 37 38 Variable Choice Computation: Since each function may have a large number of register chains, 38 39 BINAUTHOR employs locality-sensitive hashing (LSH) for feature reduction. The hash is calculated 39 40 over all sets of chains, and only those with similar hash values are clustered (hashed) to the same bucket. 40 41 In the case of register chain similarity, similar register chains will be hashed to the same bucket. Once 41 42 register chains have been hashed to a corresponding bucket, any bucket containing more than one similar 42 43 hash value is identified and a list of candidate register chains is extracted. Finally, similarity analysis is 43 44 performed to rank the candidate pairs obtained from the previous steps. The similarity score obtained 44 45 45 from this choice is λv. 46 46 S. Alrabaee et al. / Authorship Attribution 13

1 3.2.3. Quality-Related Choices 1 2 We investigate code quality in terms of standard compliance with C/C++ coding standards and secu- 2 3 rity concerns. In the literature, code quality can be measured with different indicators, such as testability, 3 4 flexibility, adaptability [20], etc. BINAUTHOR defines rules for capturing code that exhibits either rela- 4 5 tively high or low quality. For any code that cannot be matched using such rules, the code is labeled as 5 6 regular quality, which indicates that the code quality feature is not applicable. Such rules are extracted 6 7 by defining a set of signatures (sequence of instructions) for each choice. 7 8 8 9 Rules: Examples of low-quality coding styles include reopening already opened files, leaving files 9 10 open when they are no longer in use, attempting to modify constants (i.e., through pointers), us- 10 11 ing float variables as loop counters, and declaring variables inside of a switch statement, which can 11 12 result in unexpected/undefined behavior due to jumped-over instructions. Examples of high-quality 12 13 coding styles include handling errors generated by library calls (i.e., examining the returned value 13 14 by fclose()), avoiding reliance on side-effects (i.e., ++ operator) within particular calls such as 14 15 sizeof or _Alignof, averting the use of particular calls on some environments or using them with 15 16 protective measures (i.e., the use of system() in Linux may lead to shell command injection or priv- 16 17 ilege escalation; hence, using execve() instead is indicative of high-quality coding), and the use of 17 18 locks and semaphores around critical sections. 18 19 19 20 Quality-related Choice Computation: We build a set of idiom templates to describe high or low 20 21 quality habits. Idioms are sequences of instructions with wild-card possibility [21]. We employ the 21 22 idioms templates in [21] according to our qualitative-related choice. In addition, such templates carry 22 23 a meaningful connection to the quality-related choices. Our experiments demonstrate that such idiom 23 24 templates may effectively capture quality-related habits. BINAUTHOR uses the Levenshtein distance [22] 24 25 for this computation due to its efficiency. The similarity is represented by λq, which is used for author 25 26 classification purpose. 26 27 27 28 28 L(Ci, C j) 29 λq = 1 − 29 30 max(|Ci|, |C j|) 30 31 31 32 where L(Ci, C j) is the Levenshtein distance between the qualitative-related choices Ci (sequence of 32 33 instructions) and C j, max(|Ci|, |C j|) returns the maximum length between two choices Ci and C j in 33 34 terms of characters. 34 35 3.2.4. Embedded Choices 35 36 We define embedded choices by actions that are related to coding habits present in the source code that 36 37 are not easily captured at the binary level by traditional features such as strings or graphs. For instance, 37 38 initializing member variables in constructors and dynamically deleting allocated resources in destructors 38 39 39 are examples of embedded choices. As it is not feasible to list all possible features, BINAUTHOR relies 40 on the fact that opcodes reveal actions, expertise, habits, knowledge, and other author characteristics, 40 41 and analyzes the distribution of opcode frequencies. Our experiments show that such a distribution can 41 42 effectively capture the manner by which the author manages the code. As every single action in source 42 43 43 code can affect the frequency of opcodes, BINAUTHOR targets embedded choices by capturing the dis- 44 tribution of opcode frequencies. 44 45 45 46 46 14 S. Alrabaee et al. / Authorship Attribution

1 Embedded Choice Computation: For measuring the distance between distributions of opcode fre- 1 2 quencies, the Mahalanobis distance [23] is used to measure the similarity of opcode distributions among 2 3 different user functions. The Mahalanobis distance is chosen because it can capture the correlation be- 3 4 tween opcode frequency distributions, and this correlation represents the embedded choices. The simi- 4 5 larity returned is represented by λe. 5 6 3.2.5. Structural Choices 6 7 7 Programmers usually develop their own habits in terms of structural design of an application. They 8 8 may prefer to use a fully object-oriented design or they may be more accustomed to procedural pro- 9 9 gramming. Such structural choices can serve as features for author identification. To avoid functionality, 10 10 we consider the common subgraphs and the longest path for each user function, and then intersect them 11 11 among different user functions. These subgraphs are defined as k-graphs, where k is the number of 12 12 nodes. These common k-graphs form author signatures since these graphs always appear regardless of 13 13 the program functionality. In addition, we consider the longest path since it reflects how an author tends 14 14 to use deep or nested loops. 15 15 16 16 Example: An author may organize different classes in an ad hoc manner, or organize them in a hi- 17 17 erarchical way by designing a driver class to contain several manager classes, where each manager is 18 18 responsible for different processes (a collection of threads running in parallel). Both ad hoc and hierar- 19 19 chical organizations will create a common structure in the author’s programs. 20 20 21 21 Structural Choice Computation: BINAUTHOR uses subgraphs of size k in order to capture struc- 22 22 tural choices (k = 4, 5, and 6 through our experiments). Given a k-graph, the graph is transformed into 23 23 strings using Bliss open-source toolkit [24]. Then, a similarity measurement is performed over these 24 24 strings using the normalized compression distance (NCD) [25], which enhances search performance. 25 25 We have chosen NCD since it allows us to concatenate all the common subgraphs that appear in author’s 26 26 programs. Additionally, it allows for inexact matching between the target subgraphs and the training 27 27 subgraphs. BINAUTHOR forms a signature based on these strings. The similarity obtained from this 28 28 choice is represented by λs. 29 29 30 3.3. Discussion 30 31 31 32 Our main intuition is based on the following hypothesis that the coding characteristics of program- 32 33 mer/developer are resulted according to their educational background, level of expertise, and coding 33 34 traits and preferences. To achieve this, we defined a set of coding habits based on the following intu- 34 35 itions for each category. 35 36 General Choice: The developers may use different keywords according to their preferences, experience, 36 37 and knowledge. For instance, the data types operated by author could reveal data access patterns used 37 38 by authors. Moreover, using specific resources (e.g., file over memory) or terminating the function could 38 39 reveal authors’ habits. 39 40 Structural Choice: This choice captures the author’s habits in organizing and structuring the code, for 40 41 instance, creating deeply nested code, unusually long functions or long chains of assignments, or there 41 42 are some authors who prefer object oriented style or spaghetti oriented style. Moreover, it captures the 42 43 use of API calls and recursion. We do believe this habit is derived from the author’s experience. 43 44 Variable choice: This choice captures the method by which variables are manipulated, such as identi- 44 45 fying aliases of the stack pointer. Further, developers often have their own habits for defining local and 45 46 46 S. Alrabaee et al. / Authorship Attribution 15

1 global variables, which may originate from the author’s experiences or skills. For example, PC-relative 1 2 addressing often represents accessing a global variable, while scaled indexing often represents accessing 2 3 an array. Moreover, the variable chain has been shown to greatly improve author attribution of source 3 4 code [1]. 4 5 Quality-Related Choice: One of the characteristics that can reflect the authors’ expertise and knowledge 5 6 is the code quality. Therefore, we investigate the code quality in terms of standard compliance with 6 7 C/C++ coding standards and security concerns. 7 8 Embedded Choice: We define embedded choices to capture the actions which are related to coding 8 9 habits (present in the source code) that are not easily captured at the binary level. This choice can reveal 9 10 several characteristics such as: expertise, habits, and knowledge. As every single action in source code 10 11 can affect the frequency of opcodes, BINAUTHOR targets embedded choices by capturing the distribution 11 12 of opcode frequencies. 12 13 13 14 3.4. Classification 14 15 15 16 As previously described, BinAuthor extracts different types of choices to characterize different aspects 16 17 17 of author coding habits. Such choices do not equally contribute to the attribution process since the 18 18 significance of these indicators are not identical. Consequently, a weight is assigned to each choice by 19 19 applying logistic regression to each choice individually in order to predict class probabilities (e.g., the 20 20 probability of identifying an author). For this purpose, we use the introduced dataset in Section 4.2. To 21 21 prevent the overfitting, we test each dataset individually then we compute the average of the weights. The 22 22 probability outcomes of logistic regression prediction is illustrated in Table 7. We calculate the weights 23 23 as follows. 24 24 25 25 26 26 pi/ps 27 w = rnd 27 i P4 28 i=1(pi/ps) 28 29 29 30 30 where ps is the smallest probability value (e.g. 0.39 in Table 7), pi is the probability outcome from 31 logistic regression of each choice, and the rnd function rounds the final value. For instance, the weight 31 32 for general choice is calculated as 2.128205/(2.128205 + 1.615385 + 1.333333 + 1 = 6.07692) = 0.35. 32 33 The probability outcomes of logistic regression prediction is illustrated in Table 7. 33 34 34 35 Table 7 35 36 Logistic regression weights for choices 36   37 P4 37 Choice Probability (Pi) Pi/(Ps = 0.39) Weight wi = rnd (pi/ps)/ i=1(pi/ps) 38 General 0.83 2.128205 0.35 38 39 Qualitative 0.63 1.615385 0.27 39 40 Structural 0.52 1.333333 0.22 40 41 Embedded 0.39 1 0.16 41 P4 42 i=1(pi/ps) = 6.076923 42 43 43 44 After extracting features, we define a probability value P based on obtained weights. The author 44 45 attribution (A) probability is defined as follows: 45 46 46 16 S. Alrabaee et al. / Authorship Attribution

1 1 2 2 4 3 X 3 4 P(A) = wi ∗ λi 4 5 i=1 5 6 6 7 where wi represents the weight assigned to each choice, as shown in Table 7, and λi is the distance 7 8 metric value obtained from each choice (λg, λq, λe, and λs) as described in Section 3.2. We normalize the 8 9 probabilities of all authors, and if P> ζ, where ζ represents predefined threshold values, then the author 9 10 is labeled as a matched author. Through our experiments, we find that the best value of ζ is 0.87. If more 10 11 than one author has probability larger than the threshold value, then BinAuthor returns the set of those 11 12 authors. 12 13 13 14 14 15 4. Evaluation 15 16 16 17 4.1. Implementation Setup 17 18 18 19 The described stylistic choices are implemented using separate Python scripts for modularity purposes, 19 20 which altogether form our analytical system. A subset of the python scripts in the BINAUTHOR system 20 21 is used in tandem with IDA PRO disassembler. The final set of the framework scripts perform the bulk 21 22 of the choice analysis functions that compute and display critical information about an author’s coding 22 23 style. With the analysis framework completed, a graph database is utilized to perform complex graph 23 24 operations such as k-graph extraction. The graph database chosen for this task is Neo4j [26]. Gephi [27] 24 25 is employed for all graph analysis functions, which are not provided by Neo4j. MongoDB database is 25 26 used to store our features for efficiency and scalability purposes. 26 27 27 28 4.2. Dataset 28 29 29 30 Our dataset is consisted of several C/C++ applications from different sources, as described below: (i) 30 31 GitHub [7], where a considerable amount of real open-source projects are available; (ii) Google Code 31 32 Jam [8], an international programming competition, where solutions to difficult algorithmic puzzles are 32 33 available; (iii) Planet Source Code [9], a web-based service that offers a large amount of source code 33 34 written in different programming languages; (iv) Graduate Student Projects at our institution. Statistics 34 35 about the dataset are provided in Table 8. We compile the source code with different compilers and 35 36 compilation settings to measure the effects of such variations. We use GNU Compiler Collection (version 36 37 37 38 Table 8 38 39 Statistics about the dataset used in the evaluation of BINAUTHOR 39 40 Source # of authors # of programs # of functions average # of code lines # of files 40 41 GitHub 5 10 400 5000 754 41 42 Google Code Jam 50 250 650 80 250 42 43 Planet Source Code 44 168 580 250 400 43 44 Graduate Student Projects 25 125 875 250 450 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 17

1 4.8.5) with different optimization levels, as well as Microsoft Visual Studio (VS) 2010. We study the 1 2 impact of Clang and ICC compilers, as described in Section 4.11. 2 3 We perform two sets of experiments. In the first phase, we use a dataset for different authors with the 3 4 same functionalities. The purpose of this experiment is to ensure that we are not attributing the function- 4 5 ality. This shows that our tool segregates between the binaries according to the authorship attribution not 5 6 functionality. Afterwards, we move to the second phase to make sure that our tool attributes the binary 6 7 according to its author. In this phase, we collect binaries for each author with different versions, and 7 8 variant functionalities. In total, we test 800 authors from different sets in which each author has two to 8 9 five software applications, resulting in a total of 3150 programs. 9 10 10 11 4.3. Evaluation Metrics 11 12 12 13 In our experimental setup, we split the collected program binaries into ten sets, reserving one as 13 14 a testing set and using the remaining nine sets as the training set. To evaluate BINAUTHOR and to 14 15 compare it with existing methods, precision (P) and recall (R) measures are applied as Precision = 15 16 TP TP 16 TP+FP , Recall = TP+FN , where the true positive (TP) indicates number of relevant authors that are 17 correctly retrieved; true negative (TN) returns the number of irrelevant authors that are not detected; false 17 18 positive (FP) indicates the number of irrelevant authors that are incorrectly detected; and false negative 18 19 (FN) presents the number of relevant authors that are not detected. We choose F0.5 to preciously identify 19 20 the author and this metric is more sensitive to false positives rather than false negatives. Therefore, 20 21 precision is of higher priority than recall. We employ the F-measure as follows: 21 22 22 23 23 24 P . R 24 F0.5 = 1.25 . 25 0.25P + R 25 26 26 27 27 28 4.4. Accuracy 28 29 29 30 We compare BINAUTHOR with the existing authorship attribution methods [2–4]. The source code 30 31 and dataset of our previous work, OBA2 [2], is available which performs authorship attribution on a 31 32 small scale of 5 authors with 10 programs for each. The source code of the two other approaches pre- 32 33 sented by Caliskan-Islam et al. [3] and Rosenblum et al. [4] are available at [28] and [29], respectively. 33 34 Both Caliskan-Islam et al. and Rosenblum et al. present a largest-scale evaluation of binary authorship 34 35 attribution, which contains 600 authors with 8 training programs per author, and 190 authors with at 35 36 least 8 training programs, respectively. However, since the corresponding datasets are not available, we 36 37 compare BinAuthor with these methods by using the datasets mentioned in Table 8. 37 38 Figure 4 details the results of comparing the accuracy between BINAUTHOR and all other existing 38 39 39 methods. It shows the relationship between the accuracy (F0.5) and the number of authors present in all 40 datasets, where the accuracy decreases as the size of author population increases. The results show that 40 41 BINAUTHOR achieves better accuracy in determining the author of binaries. Taking all four approaches 41 42 into consideration, the highest accuracy of authorship attribution is close to 96% on the Google Code 42 43 Jam with less than 50 authors, while the lowest accuracy is 22% when 150 authors are involved on all 43 44 dataset together. We believe that the reason behind Caliskan-Islam et al. approach superiority on Google 44 45 Jam Code is that this dataset is simple and can be easily decompiled to source code. BINAUTHOR also 45 46 46 18 S. Alrabaee et al. / Authorship Attribution

1 1 1 1 1 2 0.9 0.9 0.9 2

3 0.8 0.8 0.8 3

4 F0.5 4 F0.5 0.7 F0.5 BinAuthor 0.7 0.7 BinAuthor BinAuthor 5 OBA2 OBA2 5 0.6 OBA2 Rosenblim 0.6 Rosenblim 0.6 6 Caliskan Rosenblim 6 Caliskan Caliskan 0.5 0.5 7 1 2 3 4 5 0.5 7 10 20 30 40 50 0 20 40 8 Number of Authors Number of Authors Number of Authors 8 9 (a) (b) (c) 9 10 10 1 11 1 11 0.9 12 0.8 12 0.8 13 13 0.6 F0.5

0.7 F0.5 14 BinAuthor BinAuthor 14 OBA2 0.4 OBA2 15 0.6 Rosenblim Rosenblim 15 16 Caliskan Caliskan 16 0.5 0.2 5 10 15 20 25 50 100 150 17 Number of Authors Number of Authors 17 18 18 (d) (e) 19 19 20 Fig. 4. Accuracy results of authorship attribution obtained by BINAUTHOR, Caliskan-Islam et al. [3], Rosenblum et al. [4], and 20 21 OBA2 [2], on (a) Github, (b) Google Code Jam, (c) Planet Source Code, (d) Graduate Student Projects, and (e) All datasets 21 22 22 23 identifies the author of Github dataset with an accuracy of 90%. The main reason for this is due to the 23 24 fact that the authors of projects in Github have no restrictions when developing projects. In addition, 24 25 the advanced programmers of such projects usually design their own class or template to be used in the 25 26 26 projects. The lower accuracy obtained by BINAUTHOR is approximately 75% on a Graduate student 27 27 projects with 25 authors. This is explained by the fact that programs in Graduate student projects have 28 28 common choices among different students due to assignment rules, which force students to change/re- 29 29 strict their habits accordingly. When the number of authors is 140 on the mixed dataset, the accuracy 30 30 31 of Rosenblum et al., Caliskan-Islam et al., and OBA2 approaches drop rapidly to 30% on all datasets, 31 32 whereas our system’s accuracy remains greater than 75%. This provides evidence for the stability of 32 33 using coding habits in identifying authors. In total, the different categories of choices achieve an average 33 34 accuracy of 84% for ten distinct authors and 75% when discriminating among 152 authors. These results 34 35 show that author habits may survive the compilation process. 35 36 36 37 Accuracy Interpretation. In addition to superior accuracy, our results demonstrate the following 37 38 advantages of BINAUTHOR: 38 39 1) Feature Pre-processing: We have encountered that in the existing methods, the top-ranked features 39 40 are related to the compiler (e.g., stack frame setup operation). It is thus necessary to filter irrelevant func- 40 41 tions (e.g., compiler functions) in order to better identify author-related portions of code. To this end, 41 42 we utilize a more elaborate method for filtration to eliminate the compiler effects and to label library, 42 43 compiler, and open-source software related functions. Successful distinction between these functions 43 44 leads to considerable time savings and helps shift the focus of analysis to more relevant functions. 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 19

1 2) Application Type: We find that the accuracy of existing methods [2, 4] depends heavily on the appli- 1 2 cation’s domain. For example, in Figure 4, a good level of accuracy is observed for the Graduate student 2 3 projects dataset, where the average accuracy is 75%. One explanation involves the fact that Rosenblum 3 4 et al.’s approach extracts LibCalls, which are more useful for code in academia than in other contexts. 4 5 This can be explained by students’ choice to systematically rely on external libraries and to implement 5 6 MFC APIs, for example. The results also show that OBA2 [2] relies on the application, as their approach 6 7 extracts the manner by which the author handles branches. For instance, the accuracy drops from 92% to 7 8 57% when Google Code Jam was used. After investigating the source code, we noticed that the number 8 9 of branches is not high, which makes the attribution even more difficult. Moreover, the manner by which 9 10 branches are handled becomes less distinct with larger numbers of authors. 10 11 11 12 3) Source of Features. Existing methods use disassembler and decompilers to extract features from 12 13 binaries. Caliskan-Islam et al. [3] use a decompiler to translate the program into C-like pseudo code 13 14 via Hex-Ray [30]. They pass the code to a fuzzy parser for C, thus obtaining an abstract syntax tree 14 15 from which features can be extracted. In addition to Hex-Ray limitations [30], the C-like pseudo code 15 16 is different from the original code to the extent that the variables, branches, and keywords are different. 16 17 For instance, we find that a function in the source code consists of the following keywords: (1-do, 17 18 1-switch, 3-case, 3-break, 2-while, 1-if) and the number of variables is 2. Once we 18 19 check the same function after decompiling its binary, we find that the function consists of the following 19 20 keywords: (1-do, 1-else/if, 2-goto, 2-while, 4-if) and the number of variables is 4. 20 21 This will evidently lead to misleading features, thus increasing the rate of false positives. 21 22 22 23 4.5. False Positives 23 24 24 25 We investigate the false positives in order to understand the situations where BINAUTHOR is likely to 25 26 make incorrect attribution decisions. Figure 5 shows the false positives relationship with the number of 26 27 authors in repository. For this experiment, we consider 4 programs for each author. For instance, when 27 28 we have 50 authors (4*50 = 200 programs), BINAUTHOR misclassifies 16 programs. Also, when the 28 29 number of authors is 1100 (1100*4 = 4400 programs), the number of false positives is 402. These false 29 30 positives (402) are investigated in Figure 5 (a). Also, we show the false positives in each dataset as shown 30 31 in Figure 5 (b). It is obviously shown that the false positives rate for student dataset is the highest rate 31 32 and we believe the reason behind this is that each student should follow the standard coding instructions 32 33 which restrict him/her to have their habits. 33 34 34 35 4.6. Identifying the Presence of Multiple Authors 35 36 36 37 In the previous section, we showed how our system identifies an author among a set of authors. In this 37 38 section we are interested in determining whether our system is capable of identifying the presence of 38 39 multiple authors. We therefore extend our system to an open world setting by clustering functions that 39 40 are written by the same author. 40 41 We choose a common measure of cluster agreements known as Adjusted Mutual Information 41 42 (AMI) [31] due to its stability across different numbers of clusters, which facilitates the comparison 42 43 of different datasets [31]. For this measure, we use values within the range [0, 1], where higher scores 43 44 indicate better cluster agreement. We manually collect a set of GitHub projects and check the contribu- 44 45 tors who have written the code of these projects. We limit our system to programs written in C/C++ and 45 46 46 20 S. Alrabaee et al. / Authorship Attribution

1 500 300 1 2 2 400 3 200 3 4 300 4 5 200 100 5 6 6 100 7 7 0 Number of false positives

Number of false positives 0 Google Planet GitHub Student 8 0 200 400 600 800 1000 8 9 Number of authors Dataset type 9 10 (a) (b) 10 11 11 12 Fig. 5. False Positive Analysis 12 13 13 14 1 0.5 14 15 15 0.95 0.4 16 16

17 0.9 0.3 17 18 18

0.85 Cluster (%) 0.2 19 Cluster (%) 19 20 20 0.8 0.1 21 10 20 30 40 50 10 20 30 40 50 21 Test Authors Training Authors 22 22 23 (a) (b) 23 24 24 Fig. 6. Clustering with metric learning. (a) The correct cluster rate (b) The effect of training size on the correct cluster rate 25 25 26 we ignore authors whose contribution is merely adding lines. For this purpose, we collect 65 projects to 26 27 which 50 authors contributed. In what follows, we briefly describe the authorship clustering algorithm. 27 28 We randomly select N authors from GitHub projects and use large margin nearest neighbors to learn a 28 29 distance metric over the feature space [32]. We then randomly select 10, 20, 30, 40, and 50 different au- 29 30 thors, and cluster their programs using k-means with and without transforming the data with the learned 30 31 metric. Due to the randomness in our experiment (dataset and k-means clustering algorithm), we repeat 31 32 the experiment 10 times and compute the average AMI. Figure 6 depicts the clustering results. 32 33 In Figure 6 (a) we show the effect of the size of training authors on the improvement of clustering 33 34 results. As shown in Figure 6 (b), using more training authors leads to greater improvement. We conclude 34 35 that stylistic information derived from one set of authors can be transferred to improve the clustering of 35 36 programs written by a different set of authors. 36 37 37 38 4.7. Impact of Training and Testing Set Sizes 38 39 39 40 Figure 7 shows the impact of training set size. The number of files per author is a different factor 40 41 from the number of functions, although they are highly correlated. The former appears to be a better 41 42 predictor of performance in our experiments. One possible reason involves splitting files into functions, 42 43 which allows us to capture the variance in an author’s habits. Two points are evident from the graph. 43 44 First, accuracy is poor with less than 8 training files for the unusual choice. This is due to the fact that 44 45 this choice appears very seldom in programs written by the same author. This choice serves as a unique 45 46 46

S. Alrabaee et al. / Authorship Attribution 21

1 1 1 2 2 0.8 3 3 4 0.6 4 All choices Embedded 5 5 0.4 General Variable 6 Precision 6 Quality 7 0.2 Structural 7 8 8 0 9 0 2 4 6 8 10 9 Number of files per author 10 10 11 Fig. 7. Effect of training set size 11 12 12 13 13

14 signature for that author if it exists. Second, all choices together have high accuracy even with a low 14 15 number of training files. The results suggest that authors who wish to be identified should consider 10 15 16 files in order to achieve an accuracy of 82%. This number is found through our experiments, and we 16 17 notice this number of files allows our system to identify the consistent and common habits in author 17 18 18 files. 10 files is considered as a minimum requirement to attribute an author with an accuracy of 80%. 19 19 As the number of files increases, so does the accuracy our tool can achieve. 20 20 Figure 8 displays the results of identifying a target author when the number of his related functions 21 21 varies. All choices together require 6 files of the target author in order to identify him/her with an 22 22 23 accuracy of 88%, whereas the unusual choice identifies the author with an accuracy of 38% when the 23 24 number of files is 10. These results are indicative of the ability of BINAUTHOR to match an author based 24 25 on few functions that are related to him/her. Some functions do not have enough information to capture 25 26 author related habits, and may thus be ignored. 26 27 27

28 1 28 29 29 0.8 30 30 31 0.6 31 All choices Embedded 32 32 0.4 General Variable 33 Precision 33 0.2 Quality 34 Structural 34 35 0 35 0 2 4 6 8 10 36 Number of functions per target author 36 37 37 38 Fig. 8. Effect of test set size 38 39 39 40 40 41 Our results indicate that while the identifiability of small amounts of functions is slightly reduced, we 41 42 still achieve 75% accuracy on average with only two functions. Furthermore, the curves appear to reach 42 43 an asymptote at around 6 functions per target file with an accuracy of roughly 80%. We believe that this 43 44 accuracy is a lower bound of the performance that could be achieved in practice when more functions 44 45 are usually available. 45 46 46 22 S. Alrabaee et al. / Authorship Attribution

1 4.8. Scalability 1 2 2 3 Security analysts or reverse engineers may be interested in performing large-scale author identifica- 3 4 tion, and in the case of malware, an analyst may have to deal with an extremely large number of new 4 5 samples on a daily basis. With this in mind, we evaluate how well BINAUTHOR scales. To prepare a large 5 6 dataset for the purpose of large-scale authorship attribution, we obtained programs from three sources: 6 7 Google Code Jam, GitHub, and Planet Source Code. We eliminated from the experiment programs that 7 8 could not be compiled because they contain bugs and those written by authors who contributed only 8 9 one or two programs. The resulting dataset comprised 103,800 programs by 23,000 authors: 60% from 9 10 Google Code Jam, 25% from Planet source code, and 15% from GitHub. We modified the script3 used 10 11 in [3] to download all the code submitted to the Google Code Jam competition. The programs from the 11 12 other two sources were downloaded manually. The programs were compiled with the Visual Studio and 12 13 gcc compilers, using the same settings as those in our previous investigations. The experiment evalu- 13 14 ated how well the top-weighted choices represent author habits. The results of the large-scale author 14 15 identification are shown in Figure 9. 15 16 16 17 1 17 18 0.9 18 19 0.8 19 20 20 0.7 21 21

Precision 0.6 22 22 0.5 23 23 0.4 24 0 0.5 1 1.5 2 2.5 24 25 Number of Authors x 104 25 26 26 27 Fig. 9. Large-scale author attribution 27 28 28 29 29 The figure shows the precision with which BINAUTHOR identifies the author, and its scaling behavior 30 30 as the number of authors increases is satisfactory . An author is identified among almost 4000 authors 31 31 with 72% precision. When the number of authors is doubled to 8000, the precision is close to 65%, and 32 32 it remains nearly constant (49%) after the number of authors reaches 19, 000. Additionally, we tested 33 33 BINAUTHOR on the programs from each of the sources. We found high precision (88%) for samples 34 34 from the GitHub dataset, 82% precision for samples from the Planet dataset, and low precision (51%) 35 35 for samples from Google code jam. After analyzing these results, we have found that the authors who 36 36 wrote the code for difficult tasks is easier to attribute than easier tasks. 37 37 38 38 4.9. Impact of Choices 39 39 40 40 We study the impact of each choice on precision as depicted in Figure 10. For example, when the num- 41 41 ber of authors is 15, 000,BINAUTHOR achieves a precision of 70% based on the use of variables, while 42 42 with structural considerations, it achieves a precision of 50%, the lowest for all the choices. When the 43 43 number of authors reaches 21, 000, the precision for embedded, general, quality, variable, and structural 44 44 45 3https://github.com/calaylin/CodeStylometry/tree/master/Corpus 45 46 46 S. Alrabaee et al. / Authorship Attribution 23

1 1 Embedded General Quality Variable Structural 1 2 2 0.8 3 3 4 0.6 4 5 5 0.4 6 6 Precision 7 0.2 7 8 8 0 9 3 6 9 12 15 18 21 9 10 Number of Authors (103) 10 11 11 12 Fig. 10. The effect of choices on large-scale author identification 12 13 13 14 choices is 56%, 45%, 49%, 60%, and 40%, respectively. From Figure 10, it seems reasonable to expect 14 15 15 that when the number of authors exceeds 20, 000, there will be little additional change in the precision. 16 16 According to this experiment, the precision of the attributions made based on the choices extracted and 17 17 stored in the BINAUTHOR repository will differ depending on the dataset as presented in Figure 11. 18 18 As mentioned earlier, BINAUTHOR weights a set of functionality-independent choices, and it can 19 19 adjust the weights according to the dataset type. Hence, we show the percentage of choices that BIN- 20 20 AUTHOR finds in each dataset as illustrated in Figure 11. For instance, the percentage of quality choices 21 21 found in Google code dataset is 6%. 22 22 23 Google Dataset Planet Dataset GitHub Dataset 23 24 27% 14% 24 47% 9% 13% 25 General 21% 25 General 11% General 26 Quality 13% Quality 26 Quality Variable 27 Variable Variable 14% 27 27% Structural 21% Structural 28 Structural Embedded 28 Embedded Embedded 19% 29 6% 26% 32% 29 30 30

31 (a) (b) (c) 31 32 32 Fig. 11. The percentage of choices found in large-scale datasets 33 33 34 34 35 4.10. Impact of Compilers 35 36 36 37 37 To investigate the impact of different compilers, we studied three models: BINAUTHOR;BINAUTHOR1, 38 38 which combines BINAUTHOR with BINCOMP [12], a tool for compiler identification; and BINAUTHOR2, 39 which combines BINAUTHOR with the adopted proposed method in GITZ [13] for lifting assembly in- 39 40 structions to canonical form. 40 41 41 42 Model I: BINAUTHOR. Specifically, we are interested in studying the impact of different compilers 42 43 and compilation settings on the features used by BINAUTHOR and consequently on its accuracy. To pre- 43 44 pare the required datasets, we collected 5 programs from each of 50 authors to use as training sets. We 44 45 then compiled the source code with the VS, GCC, ICC, and Clang compilers and various optimization 45 46 46 24 S. Alrabaee et al. / Authorship Attribution

1 settings (O0, O2, and Ox flags). To test the effects of different compilers, we kept the binaries compiled 1 2 with VS as the training set and tried to identify the authors of binaries compiled with the other compil- 2 3 ers. The results show that for most optimization levels, coding habits are largely preserved. However, 3 4 the accuracy drops significantly (from 0.86 to 0.43) when the Clang or the ICC compiler is used since 4 5 BINAUTHOR is not trained based on ICC and Clang binaries. While there is only a slight drop in accu- 5 6 racy (from 0.86 to 0.83) when the GCC compiler is used. We then classified the effect of each compiler 6 7 as full, partial, and no effect, with a full effect corresponding to a significant difference between choices 7 8 (here we consider the choices extracted from VS are the baseline). 8 9 As shown in Table 9, we used the choices extracted from binaries generated by the VS compiler as a 9 10 reference for comparison with the same choices extracted from binaries generated by the other compilers 10 11 (GCC, ICC, Clang) to judge the degree of the effect. Some of the choices are robust to compiler effects 11 12 because compilers change the syntax of code but not its semantics. Since some of the choices used 12 13 in BINAUTHOR (embedded and qualitative) are based on semantic features, they are not substantially 13 14 affected. However, we observed that the choices involving variables are affected more significantly since 14 15 they rely on the flow of instructions. 15 16 16 Table 9 17 17 Effects of Different Compilers on Choices: ( ) full effect,( ) partially effect, and ( ) no effect. 18 18 G# # 19 Compilers 19 20 Choice VS GCC ICC Clang 20 21 General 21 22 Structural # # G# G# 22 23 Qualitative # G# 23 Embedded 24 # # G# G# 24 Variable #### 25 25 # G# 26 26 27 27 Model II: BINAUTHOR1. In light of the impact of compilers on the choices (Table 9), we made minor 28 modifications to the proposed Architecture presented in Figure 1, which is illustrated in Figure 12). 28 29 29 30 Target File 30 31 31 Compiler Identification 32 32 (BinComp) 33 33 34 34 VS ICC 35 35 Author(s) 36 BinAuthor 36 Identification 37 Clang GCC 37 38 Author Habits 38 39 Repositories 39 40 40 41 Fig. 12. The BINAUTHOR1 architecture 41 42 42 43 We extract choices from binaries compiled by the four compilers (GCC, VS, Clang, and ICC) and 43 44 then store them in four corresponding repositories. Given a target binary, we apply our existing tool 44 45 BINCOMP to identify the source compiler so that BINAUTHOR can use the corresponding repository 45 46 46 S. Alrabaee et al. / Authorship Attribution 25

1 to identify the author of the binary. It should be pointed out that BINCOMP is designed to extract the 1 2 elements comprising binary provenance, including the compiler, version, optimization speed levels, and 2 3 compiler-related functions. The accuracy of this model is 0.86, 0.85, 0.89, and 0.81 for VS, GCC, ICC, 3 4 and Clang, respectively. This shows that BINAUTHOR can achieve a high accuracy regardless the source 4 5 of the compiler when it is leveraged with compiler identification tool. 5 6 6 7 7 GitZ

8 004020D0 stack reg1 8 004020D1 generic reg1,reg2 004020D3 Math reg1, C t3 = load i64, i64* x20 004020D6 generic reg2,fun 9 t4 = sext i8 0 to i64 9 004020DB log reg2, reg1 t3 = load i64, i64* x20 t5 = shl i64 t3, t4 004020DD generic t18 = add i64 t3, 1 t6 = or i64 0, t5 [reg1+var1], reg3 store i64 t18, i64* x0 t18 = load i64, i64* rax store i64 t6, i64* x0 10 ……………. t38 = load i64, i64* x21 t19 = add i64 t18, 1 10 ….. store i64 t19, i64* r15 ………….. Lifting Analysis

11 Canonical 11

Binary Disassembly Lifted Canonical Instructions Choices 12 Code Instructions Form 12 13 Storing 13 14 14

t3 = load i64, i64* x20 Detection t4 = sext i8 0 to i64 15 t5 = shl i64 t3, t4 15 t6 = or i64 0, t5 store i64 t6, i64* x0 t3 = load i64, i64* x20 ………………... t18 = add i64 t3, 1 16 t38 = load i64, i64* x21 16 ………... GitZ 17 Target Assembly Checking 17 Disassembly Choices BinAuthor Repository 18 Binary Instruction 18 19 19 Fig. 13. BINAUTHOR2 architecture 20 20 21 21 22 Model III: BINAUTHOR2 Because Model II deals only with a limited number of compilers, we 22 23 designed a new model that can handle the effect of any compiler as presented in Figure 13. We adapted a 23 24 recent technique used in GITZ [13], which converts assembly instructions into a semantic representation 24 25 called the canonical form. Canonicalization is the process to translate binary code or assembly code into 25 26 a unique representation. One of the main benefits for this representation is extremely robust to heavy 26 27 forms of obfuscation. This process also can preserve the side effects of a function (e.g., flags). The 27 28 following steps are performed. First, GITZ is used to lift binaries to the intermediate representation 28 29 LLVM-IR and also to canonicalize the instructions; at the same time, this step reoptimizes code which 29 30 might not have been optimized previously. Next, we apply BINAUTHOR to the canonical instructions to 30 31 extract the choices. However, the structural choices cannot be extracted from the canonical instructions, 31 32 which is a limitation of this model. Finally, we build a repository of the extracted choices, which is used 32 33 for authorship detection. This model, which is a compiler-agnostic system, achieves an accuracy of 0.82. 33 34 Finally, we build a repository of the extracted choices, which is used for authorship detection. This 34 35 model, which is a compiler-agnostic system, achieves an accuracy of 0.88 as shown in Figure 14. 35 36 However, efficiency is a problem, and the time complexity is high since all the target functions must 36 37 be converted into canonical form. Model I achieves the highest accuracy since it does not incorporate 37 38 leveraged tools such as BINCOMP and GITZ which may generate some false positives. Further, it is 38 39 more efficient in terms of the time required for choice extraction and author detection. We compare the 39 40 three models in terms of processing time (it includes preprocessing, extracting, and detecting) as shown 40 41 in Figure 15. 41 42 42 43 4.11. Impact of Evading Techniques 43 44 Refactoring Techniques. We consider a random set of 50 files from our dataset which we use for the 44 45 C++ refactoring process [33, 34]. We consider the techniques of i) renaming a variable, ii) moving a 45 46 46 26 S. Alrabaee et al. / Authorship Attribution

1 1 BinAuthor BinAuthor* BinAuthor** 1 2 0.8 2 3 0.6 3 4 0.4 4 Precision 5 0.2 5

6 0 6 VS GCC Clang ICC LLVM 7 Compilers 7 8 8 9 Fig. 14. The precision of the three different models 9 10 10 11 11 1000 12 12 800 13 13 600 14 14 400 15 15 200 16 16

Processing Time (s) 0 17 BinAuthor BinAuthor* BinAuthor** 17 Model 18 18 19 Fig. 15. The processing time for three models when the number of authors is 1000 19 20 20 21 21 22 22 23 Table 10 23 24 The evading techniques: methods used, tools used, and their affect on BINAUTHOR choices. (A) means the accuracy before 24 ∗ 25 applying obfuscation method while (A ) means the accuracy after applying it. ( ) means there is no effect while ( ) means 25 26 there is an effect. (NA) means is not applicable. # 26 Choice 27 Tools Method Input Output A −→ A∗ 27 General Variable Quality Embedded Structural 28 28 RV 85 −→ 84 29 29 Refactoring [33, 34] MM Binary Binary 86 −→ 82 # # # # 30 NM 86 −→ 83 # # # 30 31 CFG flattening 86 −→ 83 # # # # # 31 Obfuscator-LLVM [35] Instruction substitution Binary Binary 86 −→ 80 32 # # # # 32 CFG bogus 86 −→ 81 # # # 33 Instruction reordering 86 −→ 86 # # # # 33 34 Dead code insertion 86 −→ 86 # # # # # 34 DaLin [36] Assembly Assembly 35 Register renaming 86 −→ 86 # # # # # 35 Instruction replacement 86 −→ 80 # # # # # 36 36 Virtulazation 86 −→ 20 # # # 37 Trigress [37] Jitting Source Source 86 −→ 0 37 38 Dynamic 86 −→ 10 38 −→ 85 39 Hide procedure call 86 39 Insert fake instruction 86 −→ 86 # # # # PElock [38] Assembly Assembly 40 Prefix junk opcode 86 −→ 86 # # # # # 40 41 Insert junk handlers 86 −→ 86 # # # # # 41 Frame Pointer Omission 86 −→ 83 # # # # # 42 Nynaeve [39] Source Binary 42 Function inlining 86 −→ 78 # # # # 43 43 OREANS [40] Encrypt binary Source Binary NA NA# NA NA NA# NA 44 Gas Obfuscator [41] Junk byte Assembly Assembly 86 −→ 86 44 45 Designed Script Loop unrolling Source Source 86 −→ 75 # # # # # 45 46 # # 46 S. Alrabaee et al. / Authorship Attribution 27

1 method from a superclass to its subclasses, and iii) extracting a few statements and placing them into a 1 2 new method. We obtain an accuracy of 83.5% in correctly classifying authors, which is only a mild drop 2 3 in comparison to the 85% accuracy observed without applying refactoring techniques. 3 4 4 5 Impact of Obfuscation. We are interested in determining how BINAUTHOR handles simple binary 5 6 obfuscation techniques intended for evading detection, as implemented by tools such as Obfuscator- 6 7 LLVM [42]. These obfuscators replace instructions by other semantically equivalent instructions, intro- 7 8 duce spurious control flow, and can even completely flatten control flow graphs. Obfuscation techniques 8 9 implemented by Obfuscator-LLVM are applied to the samples prior to classifying the authors. We pro- 9 10 ceed to extract functionality-independent choices from obfuscated samples. We obtain an accuracy of 10 11 82.9% in correctly classifying authors, which is only a slight drop in comparison to the 85% accuracy 11 12 observed without obfuscation. 12 13 13 14 14 15 5. Applying BINAUTHOR to Real Malware Binaries 15 16 16 17 In this section, we apply BINAUTHOR to different sets of real malware: i) Bunny and Babar; ii) 17 18 Stuxnet and Flame; and iii) Zeus and Citadel. These malware are selected based on researchers’ 18 19 claims that each of these pairs of malware originates from the same set of authors [6, 10, 11]. Due to the 19 20 lack of ground truth, we compare BINAUTHOR outputs to the findings in those technical reports. Details 20 21 about the malware dataset are shown in Table 11. Packed malware binaries are manually unpacked 21 22 using OllyDbg [43]. For obfuscated binaries, we manually decrypt encrypted functions using a suitable 22 23 decryption algorithm. 23 24 24 Table 11 25 25 Characteristics of malware datasets 26 26 Malware Packed Obfuscated Source code Binary code Type # of Binary functions Source of malware sample 27 27 Zeus     PE 557 Our security laboratory 28 28 Citadel     PE 794 Our security laboratory 29 29 Flame     PE 1434 Contagio [44] 30 30 Stuxnet     PE 2154 Contagio [44] 31 31 Bunny     ELF 854 VirusSign [45] 32 32 Babar     ELF 1025 VirusSign [45] 33 33 34 34 35 35 5.1. Applying BINAUTHOR to Bunny and Babar 36 36 37 Findings. BINAUTHOR is able to find the following coding habits automatically: i) config all capital 37 38 letters in XML, the preference in using Visual Studio 2008 (general choice); ii) using one variable over 38 39 long chain (variable choice); iii) the methods of accessing freed memory, deallocating dynamically allo- 39 40 cated resources, and opening the same process in the same function (quality choice); and iv) the common 40 41 manner by which functions are managed (structural choice). 41 42 42 43 Statistics. BINAUTHOR finds common functions between the Bunny and Babar malware that share 43 44 the aforementioned coding habits as follows: 494 functions share qualitative choices; 450 functions share 44 45 embedded choices; 372 functions share general choices; 278 functions share variable choices; and 127 45 46 46 28 S. Alrabaee et al. / Authorship Attribution

1 functions share structural choices. Furthermore, BINAUTHOR finds 340 functions which share 4 choices, 1 2 478 functions which share 3 choices, 50 functions which share 2 choices, and 230 functions which share 2 3 1 choice. 3 4 4 5 Summary. We consider the cluster based on 3 or more common choices written by one author, and 1 5 6 or 2 choices written by multiple authors. BINAUTHOR can cluster the functions written by one author, 6 7 in which case the percentage is 44% ((340 + 478) / (854 + 1025)), where 854 and 1025 are total binary 7 8 functions for Bunny and Babar as shown in Table 11. On the other hand, the percentage of functions 8 9 written by multiple authors is 23% ((150 + 290) / (854 + 1025)). For the remaining 33% of functions, no 9 10 common choices are found. Based on our findings, we conclude that Bunny and Babar are written by 10 11 a set of authors. Our results support the technical report published by Citizen Lab [6]. This report shows 11 12 that both malware are written by a set of authors according to common implementation habits (general 12 13 and qualitative choices) and infrastructure usage (general choice). 13 14 14 15 5.2. Applying BINAUTHOR to Stuxnet and Flame 15 16 16 17 Findings. BINAUTHOR automatically finds the following coding habits: i) using global variables, lua 17 18 scripts, specific open source package SQlite, and choosing heap sort over other methods (general choice); 18 19 ii) opening and terminating processes (qualitative choice); iii) recursion patterns and using POSIX socket 19 20 API over BSD socket API (structural choice); and iv) closer functions in terms of Mahalanobis distance 20 21 where the distance is closer to 0.1, and passing primitive types by value while passing objects by refer- 21 22 ence (embedded choice). 22 23 23 24 Statistics. BINAUTHOR finds common functions that share the aforementioned coding habits as fol- 24 25 lows: 725 functions share general choices; 528 functions share qualitative choices; 300 functions share 25 26 embedded choices; and 189 functions share structural choices. Furthermore, BINAUTHOR finds 180 26 27 functions which share 4 common choices, 294 functions which share 3 choices, 515 functions which 27 28 share 2 choices, and 689 functions which share 1 choice. 28 29 29 30 Summary. BINAUTHOR can cluster the functions written by one author where the percentage is 13% 30 31 ((180 + 294) / (1434 + 2154)), while 34% ((515 + 689) / (1434 + 2154)) of the functions are written 31 32 by multiple authors. For the remaining 53% of functions, no common choices are found. Based on 32 33 our findings, we conclude that Stuxnet and Flame are written by an organization since they follow 33 34 specific rules and targets. Our results support the technical report published by Kaspersky [10], which 34 35 indicates that both malware use particular infrastructure (e.g., lua) and are related to an organization. 35 36 36 37 5.3. Applying BINAUTHOR to Zeus and Citadel 37 38 38 39 Findings. We apply BINAUTHOR to both binaries and cluster the functions based on functionality- 39 40 independent choices. BinAuthor automatically finds the following coding habits: i) using network re- 40 41 sources over file resources, creating configuration using mostly config files, and using specific packages 41 42 such as webph and ultraVNC (general choice); ii) using switch statements over if statements (structural 42 43 choice); iii) using semaphores and locks (qualitative choice); and iv) closer functions in terms of Maha- 43 44 lanobis distance with distance = 0.0004 (embedded choice). 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 29

1 Statistics. BINAUTHOR finds common functions that share the aforementioned coding habits as fol- 1 2 lows: 655 functions share general choices; 452 functions share qualitative choices; 370 functions share 2 3 embedded choices; and 289 functions share structural choices. Furthermore, BINAUTHOR finds that 258 3 4 functions share 4 common choices, 194 functions share 3 choices, 588 functions share 2 choices, and 4 5 600 functions share 1 choice. 5 6 6 7 Summary. BINAUTHOR clusters 33% of functions as written by a single author, while 53% of the 7 8 functions are written by multiple authors. For the remaining 14% of functions, no common choices are 8 9 found. Based on our findings, we conclude that Zeus and Citadel are written by a team of authors as 9 10 results show that a large percentage of functions are written by more than one author. Our results match 10 11 the findings of the technical report published by McAfee [11]. 11 12 12 13 5.4. Verifying the correctness of BINAUTHOR findings 13 14 14 15 Due to the lack of ground truth, we verify the correctness of BINAUTHOR findings using follow- 15 16 ing methods: Comparing BinAuthor outputs to the findings of human experts in available technical 16 17 reports [6, 10, 11]; measuring the distance between the choices in one cluster and the choices in another 17 18 to calculate the degree of similarity in order to provide a clear indication of whether the choices are 18 19 closely related to specific malware packages. 19 20 20 21 21 Comparison with technical reports. We compare the BINAUTHOR findings with those made by 22 human experts in technical reports. 22 23 23 24 • For Bunny and Babar, our results match the technical report published by the Citizen Lab [6], 24 25 which demonstrates that both malware packages were written by a set of authors according to 25 26 common implementation traits (general and qualitative choices) and infrastructure usage (general 26 27 choices). The correspondence between the BINAUTHOR findings and those in the technical report is 27 28 the following: 60% of the choices matched those mentioned in the report, and 40% did not; 10% of 28 29 the choices found in the technical report were not flagged by BINAUTHOR as they require dynamic 29 30 extraction of features, while BINAUTHOR uses a static process. 30 31 • For Stuxnet and Flame, our results corroborate the technical report published by Kasper- 31 32 sky [10], which shows that both malware packages use similar infrastructure (e.g., Lua) and are 32 33 associated with an organization. In addition, the BINAUTHOR findings suggest that both malware 33 34 packages originated from the same organization. The frequent use of particular qualitative choices, 34 35 such as the way the code is secured, indicates the use of certain programming standards and strict 35 36 adherence to the same rules. Moreover, the BINAUTHOR findings provide much more information 36 37 concerning the authorship of these malware packages. The correspondence between the BINAU- 37 38 THOR findings and those in the technical report is as follows: all the choices found in the report [10] 38 39 were found by BINAUTHOR, but they represent only 10% of our findings. The remaining 90% of 39 40 the BINAUTHOR findings were not flagged by the report. 40 41 • For Zeus and Citadel, our results match the findings of the technical report published by 41 42 McAfee [11], indicating that Zeus and Citadel were written by the same team of authors. The 42 43 correspondence between the findings of BINAUTHOR and those of McAfee are as follows: 45% of 43 44 the choices matched those in the report, while 55% did not, and 8% of the technical report findings 44 45 were not flagged by BINAUTHOR. 45 46 46 30 S. Alrabaee et al. / Authorship Attribution

1 Table 12 1 2 Choices found in malware binaries 2 3 Bunny Babar Stuxnet Flame Zeus Citadel 3 4 Bunny - 500 10 2 4 12 4 5 Babar 500 - 4 9 0 5 5 6 Stuxnet 10 4 - 750 14 3 6 7 Flame 2 9 750 - 17 6 7 8 Zeus 4 0 14 17 - 670 8 9 Citadel 12 5 3 6 670 - 9 10 10 11 Measuring similarity between choices in malware binaries. In this subsection, the goal is to assess 11 12 the similarity between malware binaries by reporting the statistics about common choices (Table 12). We 12 13 observed that there are only ten choices common to Bunny and Stuxnet, which clearly indicates that 13 14 the malware packages were written by different authors. These choices are found in seven functions, 14 15 which amounts to ( 7/(854 + 2154))= 0.2% shared author habits. In comparison, there are seventeen 15 16 choices common to Flame and Zeus, found in thirty-eight functions, so the percentage of shared 16 17 author habits is (38 / (1434 +557)) = 2%. The results in Table 12 may provide clues about the validity 17 18 of the BINAUTHOR findings. 18 19 19 20 5.5. More examples of findings 20 21 21 22 In this subsection, we show some examples of the common choices that we find in our malware 22 23 binaries. One of the good habit that we find is the termination of a process once it is no longer in use, as 23 24 shown in Listing 1. Moreover, we find advanced feature (using mutex) as shown in Listing 2. We also 24 25 notice that there is a bad habit in a set of functions where the author does not close the process. Listing 25 26 3 shows an example. 26 27 27 28 28 29 Listing 1: Qualitative choice: opening process and terminating it 29 30 30 call ds:OpenProcess 31 mov esi, eax 31 32 test esi, esi 32 33 jz short loc_41DB97 33 push 104h 34 lea eax, [ebp+String1] 34 35 push eax; 35 36 push esi 36 call sub_42F93F 37 cmp al, 1 37 38 jnz short loc_41DB94 38 39 push [ebp+lpExistingFileName] 39 lea eax, [ebp+String1] 40 40 call sub_42DE63 41 test eax, eax 41 42 jz short loc_41DB94 42 43 push 0FFFFFFFFh 43 push esi 44 call ds:TerminateProcess 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 31

1 1 2 Listing 2: Qualitative choice: using mutex 2 3 3 push 1 4 4 push offset MutexAttributes ; 5 call ds:CreateMutexW 5 6 test eax, eax 6 7 jz short loc_4229D 7 8 8 9 9 10 10 11 11 12 Listing 3: Qualitative choice: opening process and not terminate it 12 13 13 push esi ; hModule 14 14 mov esi, ds:GetProcAddress 15 mov ds:dword_436A84, eax 15 16 call esi ; GetProcAddress 16 17 push offsetaPr_seterror 17 push ds:hModule 18 mov ds:dword_436A80, eax 18 19 call esi ;GetProcAddress 19 20 push offset aPr_geterror 20 push ds:hModule 21 mov ds:dword_436A90, eax 21 22 call esi ; GetProcAddress 22 23 movds: dword_436A7C, eax 23 pop esi 24 24 retn 10h 25 25 26 26 27 27 28 28 Moreover, we find a common sequence of instructions which is repeated among different functions. 29 29 30 These sequences are neither because of functionally nor because of the compiler. Listing 4 shows an 30 31 example of such common instructions. We classify this kind of sequences as general choices. Another 31 32 examples of general choices is illustrated in Listings 5. We notice that in some functions there is a pref- 32 33 33 34 erence of using files over memory locations, and using command line over config file for configuration, 34 35 as shown in Listings 6 and 7, respectively. We also observe in some functions, there is a preference in 35 36 using specific file resources; for instance, using Bitmap as shown in listing 8. 36 37 37 38 38 39 39 40 40 41 Listing 4: General choice: sequence of instructions 41 42 inc [ebp+var_2] 42 43 mov al, [ebp+var_2] 43 cmp al, [edi+1] 44 jb loc_430509 44 45 45 46 46 32 S. Alrabaee et al. / Authorship Attribution

1 1 2 Listing 5: General choice: sequence of instructions 2 3 3 push offset aChrome_dll 4 4 push eax 5 call ds:lstrcmpiW 5 6 test eax, eax 6 7 jnz short loc_41A766 7 movzx eax, byteptr[esp+14h+arg_0] 8 push eax 8 9 mov eax, esi 9 10 call CRT_sub_415C3B 10 jmp short loc_41A768 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 Listing 6: General choice: preference of using files over memory 19 20 20 call MTX_sub_431FEC 21 mov esi, ds:GetFileAttributesExW 21 22 test al, al 22 23 jz shortloc_422AC9 23 xor edi, edi 24 jmp short loc_422ABD 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 Listing 7: General choice: preference of using command line over config 33 34 file 34 35 mov [ebp+var_1], bl 35 36 call ds:SetErrorMode 36 lea eax,[ebp+pNumArgs] 37 push eax ; pNumArgs 37 38 call ds:GetCommandLineW 38 39 push eax ; lpCmdLine 39 call ds:CommandLineToArgvW 40 40 xor esi, esi 41 cmp eax, esi 41 42 jz loc_422D33 42 43 xor edx, edx 43 cmp [ebp+pNumArgs], esi 44 jle shortloc_422CF2 44 45 45 46 46 S. Alrabaee et al. / Authorship Attribution 33

1 1 2 Listing 8: General habit: preference in specific file resources 2 3 3 call edi ; GetDeviceCaps 4 4 push 0Ah 5 push [esp+704h+hdc] ; hdc 5 6 mov [esi+2Ch], eax 6 7 call edi ; GetDeviceCaps 7 mov ecx,[esi+2Ch] 8 mov edi, ds:CreateCompatibleBitmap 8 9 push eax 9 10 push ecx ;cx 10 push [esp+708h+hdc] ; hdc 11 mov [esi+30h], eax 11 12 call edi ;CreateCompatibleBitmap 12 13 cmp eax, ebx 13 jz short loc_40F528 14 14 15 15 16 16 Regarding to the usage of API calls, we find MSDN-Windows shut-down has been used in dif- 17 17 ferent functions as shown in Listing 9. The Win32 API ExitWindowsEx makes an RPC call to 18 18 CSRSS.EXE. CSRSS synchronously sends a message to all Windows applications. 19 19 20 20 21 21 Listing 9: Structural choice: using specific API 22 22 23 push 80000000h 23 24 push 14h 24 call ds:ExitWindowsEx 25 25 26 26 27 27 28 6. Related Work 28 29 29 30 Source Code Authorship. Most previous studies of authorship analysis for general-purpose software 30 31 have focused on source code [1, 46, 47]. These techniques are typically based on programming styles 31 32 (e.g., naming variables and functions, comments, and code organization) and the development envi- 32 33 ronment (e.g., OS, programming language, compiler, and text editor). The features selected by these 33 34 techniques are highly dependent on the availability of the source code, which is seldom available when 34 35 dealing with malware binaries. 35 36 36 37 Binary Code Authorship. Binary code has drawn significantly less attention with respect to author- 37 38 ship attribution. This is mainly due to the fact that many salient features that may identify an author’s 38 39 style are lost during the compilation process. In [2–4], the authors show that certain stylistic features can 39 40 indeed survive the compilation process and remain intact in binary code, thus showing that authorship 40 41 attribution for binary code should be feasible. The methodology developed by Rosenblum et al. [4] is 41 42 the first attempt to automatically identify authors of software binaries. The main concept employed by 42 43 this method is to extract syntax-based features using predefined templates such as idioms, n-grams, and 43 44 graphlets. A subsequent approach to automatically identify the authorship of software binaries is pro- 44 45 posed by Alrabaee et al. [2]. The main concept employed by this method is to extract a sequence of in- 45 46 46 34 S. Alrabaee et al. / Authorship Attribution

1 structions with specific semantics and to construct a graph based on register manipulation. A more recent 1 2 approach to automatically identify the authorship of software binaries is proposed by Caliskan-Islam et 2 3 al. [3]. The authors extract syntactical features present in source code from decompiled executable bi- 3 4 naries. Most recently, Meng et al. [5] introduce new fine-grained techniques to address the problem 4 5 of identifying the multiple authors of binary code by determining the author of each basic block. The 5 6 authors extract syntactic and semantic features at a basic level, such as constant values in instructions, 6 7 backward slices of variables, and width and depth of a function control flow graph (CFG). 7 8 Table 13 compares our approach with the aforesaid approaches. Please note that the results of code 8 9 transformation (CT) section are based on conducted experiment. When we found the accuracy is dropped 9 10 by 1 − 3%, we considered as “Not affected”, while 4 − 14% gives “Partially affected”, and finally if it 10 11 was above 15%, we considered as “Affected”. 11 12 12 13 Table 13 13 14 Comparing features, methods, applicability to other compiler binaries, resistance to evading techniques of different existing so- 14 lutions with. (CT) stands for code transformation. (DCI) stands for dead code insertion. (IR) stands for instruction replacement. 15 15 (IRO) stands for instruction reordering. (RT) stands for refactoring techniques. The symbol (X) indicates that the proposal 16 solution offers the corresponding feature. ( ): Not affected by the code transformation method. ( ): Affected by the code 16 17 transformation method. ( ): Partially affected# by the code transformation method. (SA) stands for single author. (MA) stands 17 for multiple author 18 G# 18 Features Compiler CT Binaries 19 Effort Application 19 Syntax Semantic Structural Statistical VS GCC Clang ICC DCI IR IRO RT ELF PE 20 20 OBA2 [2]  X   X     X SA 21 21 Caliskan [3]  X X X  X   G# G# # X  SA 22 Rosenblum [4] X X X   X   # # # X  SA 22 23 Meng [5]  X X X      X MA 23 BINAUTHOR X X X X X X X X G# G# G# X X SA/MA 24 24 # # # # 25 25 26 Malware Authorship Attribution. Most existing work on malware authorship attribution relies on 26 27 manual analysis. In 2013, a technical report published by FireEye [48] discovered that malware binaries 27 28 share the same digital infrastructure and code; for instance, the use of certificates, executable resources, 28 4 29 and development tools. More recently, the team at Citizen Lab attributed malware authors according 29 30 to the manual analysis exploit type found in binaries and the manner by which actions are performed, 30 31 such as connecting to a command and control server. The authors in [6] presented a novel approach 31 32 to creating credible links between binaries originating from the same group of authors [6]. Their goal 32 33 aimed to add transparency in attribution and to supply analysts with a tool that emphasizes or denies 33 34 vendor statements. The technique is based on features derived from different domains, such as imple- 34 35 mentation details, applied evasion techniques, classical malware traits, or infrastructure attributes, which 35 36 are leveraged to compare the handwriting among binaries. 36 37 37 38 38 39 7. Limitations and Concluding remarks 39 40 Our work has a few important limitations. First, we assume that the binary code under analysis is 40 41 unpacked. While this assumption may be reasonable for many general-purpose software, it implies the 41 42 need for a pre-processing step involving unpacking before applying the method to malware. Second, 42 43 43 our tool cannot handle most of the advanced obfuscation techniques such as virtualization and jitting 44 44 45 4https://citizenlab.org/ 45 46 46 S. Alrabaee et al. / Authorship Attribution 35

1 since our system does not deal with bytecode. Third, although we have tested our work on various 1 2 datasets, our dataset is still limited in terms of scale and scope. Fourth, our work yields lower/ accuracy 2 3 when dealing with code compiled by ICC or Clang. Fifth, the features used by BINAUTHOR are static 3 4 analysis-based; as such, BINAUTHOR does not detect habits which require dynamic features. Finally, 4 5 BINAUTHOR currently supports the x86 ISA; other ISAs, such as ARM, should also be considered. 5 6 Our future work aims to extend BINAUTHOR to tackle these limitations. We point out four main 6 7 avenues for future research in terms of improving our system. First, we suggest investigating more 7 8 principled and robust cognition models for program understanding and for extending binary choices. 8 9 Second, we will seek to extend our system to include visualizations of functionality-independent choices 9 10 rather than presenting numeric results. Third, while we have demonstrated the viability of our system in 10 11 identifying the author or clustering the functions based on author habits, a more thorough investigation of 11 12 different domains is necessary, along with an analysis of the impact of training and test set size. Finally, 12 13 Privacy threat: Recent advances in binary authorship attribution can pose a serious threat to the privacy 13 14 of anonymous coders. There are cases in which contributors to software projects may wish to hide their 14 15 identities, for example, developers who do not want their employers to know about their independent 15 16 activities [3]. While we believe that the ability to perform authorship attribution at the binary level has 16 17 important security applications, we are also concerned about its implications for privacy. In response, we 17 18 plan to design a tool that permits developers to anonymize themselves. In addition, we will investigate 18 19 existing methods that address the privacy threat through the analysis of code writing style [49] and 19 20 study their applicability to binary code. Another interesting future direction will involve developing 20 21 rules which identify characteristics that make a binary code harder to attribute. 21 22 To conclude, we have presented the first known effort on decoupling coding habits from function- 22 23 ality. Previous research has applied machine learning techniques to extract stylometry styles and can 23 24 distinguish between 5-50 authors, whereas we can handle up to 150 authors. In addition, existing works 24 25 have only employed artificial datasets, whereas we included more realistic datasets. We also applied our 25 26 system to known malware (e.g., Zeus and Citadel) as a case study. Our findings indicated that 26 27 the accuracy of these techniques drops dramatically to approximately 45% at a scale of more than 50 27 28 authors. Authors with advanced expertise are easier to attribute than authors who have less expertise. 28 29 Authors of realistic datasets are easier to attribute than authors of artificial datasets. Specifically, in the 29 30 GitHub dataset, the authors of a sample can be identified with greater than 90% accuracy. In summary, 30 31 our system demonstrates superior results on more realistic datasets. 31 32 32 33 33 34 References 34 35 35 36 [1] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt, “De-anonymizing pro- 36 37 grammers via code stylometry,” in 24th USENIX Security Symposium (USENIX Security 15), 2015, pp. 255–270. 37 [2] S. Alrabaee, N. Saleem, S. Preda, L. Wang, and M. Debbabi, “Oba2: An onion approach to binary code authorship 38 attribution,” Digital Investigation, vol. 11, pp. S94–S103, 2014. 38 39 [3] A. Caliskan-Islam, F. Yamaguchi, E. Dauber, R. Harang, K. Rieck, R. Greenstadt, and A. Narayanan, “When coding style 39 40 survives compilation: De-anonymizing programmers from executable binaries,” arXiv preprint arXiv:1512.08546, 2015. 40 [4] N. Rosenblum, X. Zhu, and B. P. Miller, “Who wrote this code? identifying the authors of program binaries,” in Computer 41 41 Security–ESORICS 2011. Springer, 2011, pp. 172–189. 42 [5] X. Meng, B. P. Miller, and K.-S. Jun, “Identifying multiple authors in a binary program,” in European Symposium on 42 43 Research in Computer Security. Springer, 2017, pp. 286–304. 43 44 [6] “Big Game Hunting: Nation-state malware research, BlackHat,” 2015, https://www.blackhat.com/docs/us- 44 15/materials/us-15-MarquisBoire-Big-Game-Hunting-The-Peculiarities-Of-Nation-State-Malware-Research.pdf. 45 [7] “GitHub-Build software better,” https://github.com/trending?l=cpp, 2011. 45 46 46 36 S. Alrabaee et al. / Authorship Attribution

1 [8] “The Google Code Jam.” http://code.google.com/codejam/, 2008-2015. 1 2 [9] “The planet source code. Available from:,” http://www.planet-source-code.com/vb/default.asp?lngWId=3 #ContentWin- 2 ners, 2015. 3 [10] “Techniqal report, Resource 207: Kaspersky Lab Research proves that Stuxnet and Flame developers are connected,” 3 4 http://www.kaspersky.com/about/news/virus/2012/. 4 5 [11] “Techniqal report, Mcafee,” 2011, www.mcafee.com/ca/resources/wp-citadel-trojan-summary.pdf. 5 6 [12] A. Rahimian, P. Shirani, S. Alrbaee, L. Wang, and M. Debbabi, “Bincomp: A stratified approach to compiler provenance 6 attribution,” Digital Investigation, vol. 14, pp. S146–S155, 2015. 7 [13] Y. David, N. Partush, and E. Yahav, “Similarity of binaries through re-optimization,” in Proceedings of the 38th ACM 7 8 SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2017, pp. 79–94. 8 9 [14] “The C Language Library, Cplusplus website,” http://www.cplusplus.com/reference/clibrary/, 2011. 9 [15] B. G. Hunting, “Nation-state malware research, blackhat (2015).” 10 [16] “IDA pro Fast Library Identification and Recognition Technology,” https://www.hex-rays.com/products/ida/tech/, 2011. 10 11 [17] J. T.-L. Wang, Q. Ma, D. Shasha, and C. H. Wu, “New techniques for extracting features from protein sequences,” IBM 11 12 Systems Journal, vol. 40, no. 2, pp. 426–441, 2001. 12 13 [18] B. S. Elenbogen and N. Seliya, “Detecting outsourced student programming assignments,” Journal of Computing Sciences 13 in Colleges, vol. 23, no. 3, pp. 50–57, 2008. 14 [19] R. Muth, “Register liveness analysis of executable code,” Manuscript, Dept. of Computer Science, The University of 14 15 Arizona, Dec, 1998. 15 16 [20] V. Rajlich, “Software evolution and maintenance,” in Proceedings of the Future of Software Engineering. ACM, 2014, 16 pp. 133–144. 17 [21] D. E. Knuth, “Backus normal form vs. backus naur form,” Communications of the ACM, vol. 7, no. 12, pp. 735–736, 17 18 1964. 18 19 [22] L. Yujian and L. Bo, “A normalized levenshtein distance metric,” Pattern Analysis and Machine Intelligence, IEEE Trans- 19 20 actions on, vol. 29, no. 6, pp. 1091–1095, 2007. 20 [23] P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences (Calcutta), 21 vol. 2, pp. 49–55, 1936. 21 22 [24] T. A. Junttila and P. Kaski, “Engineering an efficient canonical labeling tool for large and sparse graphs.” in ALENEX, 22 23 vol. 7. SIAM, 2007, pp. 135–149. 23 [25] R. Cilibrasi and P. Vitanyi, “Clustering by compression,” Information Theory, IEEE Transactions on, vol. 51, no. 4, pp. 24 1523–1545, 2005. 24 25 [26] “The Scalable Native Graph Database. Available from:,” http://neo4j.com/, 2015. 25 26 [27] “The Gephi plugin for nneo4j. Avaiable from:,” https://marketplace.gephi.org/plugin/neo4j-graph-database-support/, 26 27 2015. 27 [28] “Programmer De-anonymization from Binary Executables,” https://github.com/calaylin/bda, 2015. 28 [29] “The materials supplement for the paper "Who Wrote This Code? Identifying the Authors of Program Binaries",” http: 28 29 //pages.cs.wisc.edu/$\sim$nater/esorics-supp/, 2011. 29 30 [30] “Hex-Ray decompiler,” https://www.hex-rays.com/products/decompiler/, 2015. 30 [31] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, 31 normalization and correction for chance,” The Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010. 31 32 [32] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” The Journal 32 33 of Machine Learning Research, vol. 10, pp. 207–244, 2009. 33 34 [33] “C++ refactoring tools for visual studio,” http://www.wholetomato.com/, 2016. 34 [34] “Refactoring tool,” https://www.devexpress.com/Products/CodeRush/. 35 [35] P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin, “Obfuscator-LLVM – software protection for the masses,” in Pro- 35 36 ceedings of the IEEE/ACM 1st International Workshop on Software Protection, SPRO’15, Firenze, Italy, May 19th, 2015, 36 37 B. Wyseur, Ed. IEEE, 2015, pp. 3–9. 37 [36] D. Lin and M. Stamp, “Hunting for undetectable metamorphic viruses,” Journal in computer virology, vol. 7, no. 3, pp. 38 201–214, 2011. 38 39 [37] “Tigress is a diversifying virtualizer/obfuscator for the C language,” 2016, http://tigress.cs.arizona.edu/. 39 40 [38] “PELock is a software security solution designed for protection of any 32 bit Windows applications ,” 2016, 40 41 https://www.pelock.com/. 41 [39] “Adventure in Windows debugging and reverse enigineering,” 2016, http://www.nynaeve.net/. 42 [40] “Advanced Windows software protection system. ,” 2016, http://www.oreans.com/themida.php. 42 43 [41] “script modifies GNU assembly files (.s) to confuse linear sweep disassemblers like objdump.” 2016, 43 44 https://github.com/defuse/gas-obfuscation. 44 [42] P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin, “Obfuscator-llvm: software protection for the masses,” in Proceedings 45 of the 1st International Workshop on Software Protection. IEEE Press, 2015, pp. 3–9. 45 46 46 S. Alrabaee et al. / Authorship Attribution 37

1 [43] “OllyDbg, it is a assembler level analysing debugger,” http://www.ollydbg.de/. 1 2 [44] “Contagio: malware dump,” http://contagiodump.blogspot.ca, may, 2016. 2 [45] “VirusSign: Malware Research & Data Center, Virus Free,” http://www.virussign.com/, may, 2016. 3 3 [46] I. Krsul and E. H. Spafford, “Authorship analysis: Identifying the author of a program,” Computers & Security, vol. 16, 4 no. 3, pp. 233–257, 1997. 4 5 [47] E. H. Spafford and S. A. Weeber, “Software forensics: Can we track code to its authors?” Computers & Security, vol. 12, 5 6 no. 6, pp. 585–595, 1993. 6 [48] N. Moran and J. Bennett, Supply Chain Analysis: From Quartermaster to Sun-shop. FireEye Labs, 2013, vol. 11. 7 7 [49] M. Brennan, S. Afroz, and R. Greenstadt, “Adversarial stylometry: Circumventing authorship recognition to preserve 8 privacy and anonymity,” ACM Transactions on Information and System Security (TISSEC), vol. 15, no. 3, p. 12, 2012. 8 9 [50] P. Shirani, L. Wang, and M. Debbabi, “Binshape: Scalable and robust binary library function identification using function 9 10 shape,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 10 2017, pp. 301–324. 11 [51] S. Alrabaee, P. Shirani, L. Wang, and M. Debbabi, “Fossil: a resilient and efficient system for identifying foss functions 11 12 in malware binaries,” ACM Transactions on Privacy and Security (TOPS), vol. 21, no. 2, p. 8, 2018. 12 13 [52] A. Aiken et al., “Moss: A system for detecting software plagiarism,” University of California–Berkeley. See www. cs. 13 14 berkeley. edu/aiken/moss. html, vol. 9, 2005. 14 [53] S. Alrabaee, P. Shirani, M. Debbabi, and L. Wang, “On the feasibility of malware authorship attribution,” in International 15 Symposium on Foundations and Practice of Security. Springer, 2016, pp. 256–272. 15 16 [54] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in 16 17 Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 459–468. 17 [55] G. Palmer et al., “A road map for digital forensic research,” in First Digital Forensic Research Workshop, Utica, New 18 18 York, 2001, pp. 27–30. 19 [56] S. Schleimer, D. S. Wilkerson, and A. Aiken, “Winnowing: local algorithms for document fingerprinting,” in Proceedings 19 20 of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003, pp. 76–85. 20 21 [57] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, “Application of information retrieval techniques for source code author- 21 ship attribution,” in Database Systems for Advanced Applications. Springer, 2009, pp. 699–713. 22 22 [58] R. Kilgour, A. Gray, P. Sallis, and S. MacDonell, “A fuzzy logic approach to computer software source code authorship 23 analysis,” 1998. 23 24 [59] P. Juola, “Authorship attribution,” Foundations and Trends in information Retrieval, vol. 1, no. 3, pp. 233–334, 2006. 24 25 [60] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song, “On the feasibility of 25 internet-scale author identification,” in Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012, pp. 300–314. 26 26 [61] Z. Zhao, J. Wang, and J. Bai, “Malware detection method based on the control-flow construct feature of software,” Infor- 27 mation Security, IET, vol. 8, no. 1, pp. 18–24, 2014. 27 28 [62] R. Chen, L. Hong, C. Lü, and W. Deng, “Author identification of software source code with program dependence graphs,” 28 29 in Computer Software and Applications Conference Workshops (COMPSACW), 2010 IEEE 34th Annual. IEEE, 2010, 29 pp. 281–286. 30 [63] G. Balakrishnan and T. Reps, “Wysinwyx: What you see is not what you execute,” ACM Transactions on Programming 30 31 Languages and Systems (TOPLAS), vol. 32, no. 6, p. 23, 2010. 31 32 [64] Y. David and E. Yahav, “Tracelet-based code search in executables,” in ACM SIGPLAN Notices, vol. 49, no. 6. ACM, 32 33 2014, pp. 349–360. 33 [65] “EXEINFO PE,” http://exeinfo.atwebpages.com/. 34 [66] “SEI CERT C++ Coding Standard,” https://www.securecoding.cert.org/, 2016. 34 35 [67] “The IDA Pro Book: The Unofficial Guide to the World’s Most Popular Disassembler,” 2011, http://www.amazon.ca/The- 35 36 IDA-Pro-Book-Disassembler/dp/1593272898. 36 [68] S.-S. Choi, S.-H. Cha, and C. C. Tappert, “A survey of binary similarity and distance measures,” Journal of Systemics, 37 37 Cybernetics and Informatics, vol. 8, no. 1, pp. 43–48, 2010. 38 [69] H. Gascon, F. Yamaguchi, D. Arp, and K. Rieck, “Structural detection of android malware using embedded call graphs,” 38 39 in Proceedings of the 2013 ACM workshop on Artificial intelligence and security. ACM, 2013, pp. 45–54. 39 40 [70] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, “Polymorphic worm detection using structural information 40 of executables,” in Recent Advances in Intrusion Detection. Springer, 2005, pp. 207–226. 41 41 [71] S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variantdetection,” Dependable and Secure Computing, 42 IEEE Transactions on, vol. 11, no. 4, pp. 307–317, 2014. 42 43 [72] A. Desnos and G. Gueguen, “Android: From reversing to decompilation,” Proc. of Black Hat Abu Dhabi, pp. 77–101, 43 44 2011. 44 [73] A. Dasgupta, R. Kumar, and T. Sarlós, “Fast locality-sensitive hashing,” in Proceedings of the 17th ACM SIGKDD Inter- 45 national Conference on Knowledge Discovery and Data Mining. ACM, 2011, pp. 1073–1081. 45 46 46 38 S. Alrabaee et al. / Authorship Attribution

1 [74] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max- 1 2 relevance, and min-redundancy,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp. 2 1226–1238, 2005. 3 [75] “Techniqal report,” www.seifreed.es/docs/Citadel/Trojan/Report-eng.pdf, 2011. 3 4 [76] I. Santos, F. Brezo, J. Nieves, Y. K. Penya, B. Sanz, C. Laorden, and P. G. Bringas, “Idea: Opcode-sequence-based malware 4 5 detection,” in Engineering Secure Software and Systems. Springer, 2010, pp. 35–43. 5 [77] A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, and Y. Elovici, “Detecting unknown malicious code by applying classifi- 6 6 cation techniques on opcode patterns,” Security Informatics, vol. 1, no. 1, pp. 1–22, 2012. 7 [78] X. Wang and R. Karri, “Detecting kernel control-flow modifying rootkits,” in Network Science and Cybersecurity. 7 8 Springer, 2014, pp. 177–187. 8 9 [79] K. Kim and B.-R. Moon, “Malware detection based on dependency graph using hybrid genetic algorithm,” in Proceedings 9 of the 12th annual conference on Genetic and evolutionary computation. ACM, 2010, pp. 1211–1218. 10 [80] “Citizen Lab,” https://citizenlab.org/, 2015, university of Toronto, Canada. 10 11 [81] P. W. Oman and C. R. Cook, “A programming style taxonomy,” Journal of Systems and Software, vol. 15, no. 3, pp. 11 12 287–301, 1991. 12 [82] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software engineering metrics and models. Benjamin-Cummings Publishing 13 Co., Inc., 1986. 13 14 [83] T. Hoff, “C++ coding standard,” Disponível: http://www. possibility. com/Cpp/CppCodingStandard. html [capturado em 14 15 15 fev. 2006], 2000. 15 [84] N. Krislock and H. Wolkowicz, Euclidean distance matrices and applications. Springer, 2012. 16 16 [85] M. Fowler, Refactoring: improving the design of existing code. Pearson Education India, 1999. 17 [86] “Malware channel: Channel for malware samples to download and analyze,” https://twitter.com/MalwareChannel, july, 17 18 2016. 18 19 [87] “Microsoft Malware Classification Challenge (BIG 2015),” https://www.kaggle.com/c/malware-classification, july, 2016. 19 [88] “Blinded for submission.” 20 [89] N. Rosenblum, B. P. Miller, and X. Zhu, “Recovering the toolchain provenance of binary code,” in Proceedings of the 20 21 2011 International Symposium on Software Testing and Analysis. ACM, 2011, pp. 100–110. 21 22 [90] L. Martignoni, M. Christodorescu, and S. Jha, “Omniunpack: Fast, generic, and safe unpacking of malware,” in Computer 22 Security Applications Conference, 2007. ACSAC 2007. Twenty-Third Annual. IEEE, 2007, pp. 431–441. 23 [91] M. G. Kang, P. Poosankam, and H. Yin, “Renovo: A hidden code extractor for packed executables,” in Proceedings of the 23 24 2007 ACM workshop on Recurring malcode. ACM, 2007, pp. 46–53. 24 25 [92] S. Alrabaee, P. Shirani, L. Wang, M. Debbabi, and A. Hanna, “On leveraging coding habits for effective binary authorship 25 attribution,” in European Symposium on Research in Computer Security. Springer, 2018, pp. 26–47. 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46