Similarity in Binary Executables

Yaniv David

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Similarity in Binary Executables

Research Thesis

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Yaniv David

Submitted to the Senate of the Technion — Israel Institute of Technology Hadar 5780 Haifa March 2020

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 This research was carried out under the supervision of Prof. Eran Yahav, in the Faculty of Computer Science.

Some results in this thesis have been published as articles by the author and research collaborators in conferences and journals during the course of the author’s doctoral research period, the most up-to-date versions of which being:

Yaniv David, Nimrod Partush, and Eran Yahav. Statistical similarity of binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pages 266–280, New York, NY, USA, 2016. ACM. Yaniv David, Nimrod Partush, and Eran Yahav. Similarity of binaries through re-optimization. In Pro- ceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 79–94, New York, NY, USA, 2017. ACM. Yaniv David, Nimrod Partush, and Eran Yahav. Firmup: Precise static detection of common vulnerabilities in firmware. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, pages 392–404, New York, NY, USA, 2018. ACM. Yaniv David and Eran Yahav. Tracelet-based code search in executables. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 349–360, New York, NY, USA, 2014. ACM.

ACKNOWLEDGEMENTS

To Vered and Yael.

The Technion’s funding of this research is hereby acknowledged.

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Contents

List of Figures

Abstract 1

1 Introduction3 1.1 The Importance of Analyzing Binary Code...... 3 1.2 Binary Code Analysis Using Search...... 4 1.3 Binary Code Search as An Information Retrieval Problem...... 5 1.4 Evaluating Similarity Measures...... 7 1.5 This Thesis...... 8 1.5.1 Tracelet-Based Code Search: Searching for Patched Code...... 8 1.5.2 Statistical Similarity of Binaries: Tool-Chains and Configuration Obliv- ious Search...... 9 1.5.3 Similarity through re-Optimization: Cross-Architecture Accelerated Search...... 10 1.5.4 Applying Our Method To a Real-World Problem: Finding Common Vulnerabilities in Firmware...... 10 1.5.5 Leaving the Source Code Behind: Similarity by Neural Name Prediction 11

2 Tracelet-Based Code Search in Executables 13 2.1 Overview...... 13 2.1.1 Motivating Example...... 13 2.1.2 Similarity using Tracelets...... 15 2.2 Preliminaries...... 19 2.3 Tracelet-Based Matching...... 20 2.3.1 Preprocessing: Dealing with Compilation Side-Effects...... 20 2.3.2 Comparing Binary Code Procedures...... 21 2.3.3 Calculating Tracelet Match Score...... 23 2.3.4 Using the Rewrite Engine to Bridge the Gap...... 26 2.4 Evaluation...... 28 2.4.1 Test-bed Structure...... 28 2.4.2 Prototype Implementation...... 30

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 2.4.3 Test Configuration...... 30 2.5 Results...... 31 2.5.1 Comparing Different Configurations...... 31 2.5.2 Using the Rewrite Engine to Improve Tracelet Match Rate...... 33 2.5.3 Performance...... 34 2.6 Limitations...... 34

3 Statistical Similarity of Binaries 37 3.1 Background: Similarity By Compisition...... 37 3.2 Overview...... 39 3.3 Strand-Based Similarity by Composition...... 42 3.3.1 Similarity By Composition...... 42 3.3.2 Procedure Decomposition to Strands...... 43 3.3.3 Statistical Evidence...... 44 3.3.4 Local and Global Evidence Scores...... 45 3.4 Semantic Strand Similarity...... 46 3.4.1 Similarity Semantics...... 46 3.4.2 Encoding Similarity as a Program Verifier Query...... 47 3.5 Evaluation...... 49 3.5.1 Test-Bed Creation and Prototype Implementation...... 49 3.5.2 Using Vulnerable Code as Queries...... 50 3.5.3 Testing Different Aspects of the Problem Separately...... 50 3.5.4 Enabling the Use of Powerful Semantic Tools...... 51 3.6 Results...... 53 3.6.1 Finding Heartbleed ...... 53 3.6.2 Decomposing Our Method into Sub-methods...... 53 3.6.3 Comparison of TRACY and Esh ...... 54 3.6.4 Evaluating BinDiff ...... 55 3.6.5 Pairwise Comparison...... 55 3.6.6 Limitations...... 57

4 Similarity of Binaries through re-Optimization 59 4.1 Overview...... 59 4.1.1 The Challenge in Establishing Binary Similarity...... 59 4.1.2 Representing Procedures in a Canonicalized & Normalized Form... 61 4.2 Algorithm...... 62 4.2.1 Representing Binary Procedures with Canonicalized Normalized Strands 62 4.2.2 Scalable Binary Similarity Search...... 65 4.3 Evaluation...... 67 4.3.1 Creating a Corpus to Evaluate the Different Problem Vectors...... 67 4.3.2 Evaluation Setup...... 69

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 4.3.3 GitZ as a Scalable Vulnerability Search Tool...... 69 4.3.4 All vs. All Comparison...... 70 4.3.5 Comparing Components of the Solution...... 74 4.3.6 Understanding the Global Context Effect...... 74 4.3.7 Limitations...... 75

5 FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware 77 5.1 Overview...... 77 5.1.1 Pairwise Procedure Similarity: The syntactic gap...... 77 5.1.2 Efficient Partial Executable Matching...... 79 5.2 Representing Firmware Executables...... 80 5.3 Binary Similarity as a Back-and-Forth Game...... 83 5.3.1 Game Motivation...... 83 5.3.2 Game Algorithm...... 84 5.3.3 Graph-based Approaches...... 86 5.4 Evaluation...... 86 5.4.1 Corpus Creation...... 86 5.4.2 Finding Undiscovered common vulnerabilities and exposures (CVEs) in Firmware...... 87 5.4.3 Evaluating Precision Against Previous Work...... 89

6 Similarity by Neural Name prediction 95 6.1 Overview...... 95 6.2 Background - Seq2seq Models...... 98 6.2.1 LSTM Encoder-Decoder Models...... 99 6.2.2 Attention Models...... 100 6.2.3 Transformers...... 100 6.3 Representing Binary Procedures for Name Prediction...... 101 6.4 Model...... 105 6.4.1 Call Site Encoder...... 105 6.4.2 Encoding Call Site Sequences with LSTMs...... 105 6.4.3 Encoding Call Site Sequences with Transformers...... 106 6.4.4 Encoding Call Site Sequences with Graph Neural Networks...... 106 6.5 Evaluation...... 106 6.5.1 Experimental Setup...... 106 6.5.2 Results...... 108 6.5.3 Ablation Study...... 109 6.5.4 Qualitative Evaluation...... 110

7 Related Work 113

8 Conclusion 117

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Additional Examples of Procedure Name Predictions 119

Hebrew Abstracti

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 List of Figures

1.1 A ROC plot ([Tow18]) showing how the area under (the) curve (AUC) is measured. 8

2.1 doCommand1 and its corresponding control flow graph (CFG) G1...... 14

2.2 doCommand2 and its corresponding CFG G2...... 15 2.3 A pair of 3-tracelets: (a) based on blocks (1,3,5) from Figure 2.1(b), and (b) on blocks (1’,3’,5’) from Figure 2.2(b). Note that jump instructions are omitted from the tracelet, and that empty lines were added to the original tracelet to align similar instructions...... 16 2.4 A successful rewriting attempt with cost calculation...... 17 2.5 A full alignment and rewrite for the basic blocks 3 and 3’ of fig. 2.1(b) and fig. 2.2(b)...... 17 2.6 Simplified grammar for x86 assembly...... 19 2.7 A’s members are dissimilar to B’s. We will measure precision and recall on =A. 28 2.8 Matching a procedure across different contexts and after code changes...... 33

3.1 Image vs. Code Similarity by Composition. Query image iq2 is similar to the

target image it. Query code q2 is similar to the target code t. (Images courtesy of Irani et al.([BI06].)...... 38 3.2 Heartbleed vulnerability code snippets...... 39 3.3 Semantically similar strands...... 40 3.4 Syntactically similar but semantically different strands...... 40 3.5 Experiment #1 – Successfully finding 12 procedure variants of Heartbleed.. 53 3.6 All vs. all experiment result in heat map form (pixel intensity = similarity GES measure)...... 56 3.7 Wrapper exit_cleanup procedure from sort. of Coreutils 8.23..... 57 3.8 DEFINE_SORT_FUNCTIONS macro for creating “template” procedures in ls.c of Coreutils 8.23...... 57

4.1 Applying canonicalization, normalization on strands taken from our motivating example (OpenSSL‘s dtls1_buffer_message procedure in d1_both.c). Only the application of both techniques results in equivalence...... 60 4.2 Three examples of binary to intermediate representation (IR) lifting outputs containing redundant operations...... 63

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 4.3 Similarity score (as height) for All v. All experiment. Average concentrated ROC (CROC): .978, false positive (FP) rate: .03, Runtime: 1.1h...... 70 4.4 add_ref_tag() compiled to different architectures...... 73 4.5 The average accuracy (CROC) of the All v. All experiment, when applying different parts of our solution...... 73 4.6 Examining the importance and effect of using the global context by measuring number of false-positives as P grows...... 74 4.7 Source code and assembly for different compilations of ftp_syst()...... 76

5.1 Wget ftp_retrieve_glob() vulnerability snippets...... 77 5.2 The search process for Wget’s vulnerable ftp_glob_retrieve() proce- dure, searched against a found-in-the-wild NETGEAR device firmware, in (a) procedure-centric search (on the left) and (b) executable-centric search (on the right)...... 78 5.3 Assembly strand computing the branch destination result (top) and its lifted LLVM-IR strand (middle) and final canonical LLVM-IR form (bottom)..... 82

5.4 Two approaches for finding a similar procedure to q1 ∈ Q, within the executable T. The procedure-centric approach on the left leads to a less accurate matching.

Strands {s1, s2, s3, s4, s5} are spread over all procedures...... 83 5.5 The call-graphs surrounding Wget’s ftp_retrieve_glob() procedure in the MIPS query (left) and a target from a NETGEAR firmware (right). The variance in call-graph structure prevents graph based techniques (e.g., BinDiff) from succeeding...... 86 5.6 Labeled experiment comparing BinDiff with FirmUp, showing that BinDiff encountered over 69.3% false results overall compared with 6% for FirmUp.. 90 5.7 BinDiff’s disproportionate focus on the procedure’s internal structure leads to false matches for vsf_filename_filter() ...... 91 5.8 Labeled experiment comparing GitZ with FirmUp, showing that GitZ encoun- tered 34% false positives overall compared with 9.88% for FirmUp ...... 92 5.9 Number of procedures correctly matched as a factor of the number of steps in the back-and-forth game. 180 procedures required more than 1 step for the matching, up to 32 steps...... 93

6.1 (a) Reconstructing calls to external procedures and creating pointer-aware slices; (b) representing the procedure as a set of augmented call sites; (c) predicting procedure names by training neural sequence-to-sequence (seq2seq) models on the set of augmented call site sequences...... 96 6.2 Typical call instructions for starting a server: on the left, the call instructions are randomly placed in the code by the compiler; on the right, the call instructions are placed in their correct call order as they appear in the CFG...... 97

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 6.3 An example for slice information sets created by three x64 instructions: Vread|write

sets show values read and written into and Pread|write show pointer dereferences for reading from and writing to memory...... 102 6.4 Creating pointer-aware slices for extracting concrete values or abstract values: all possible computations that create the value of the rdi are statically analyzed, and finally, the occurrence of rdi in the call to getaddrinfo is replaced with the ARG abstract value originating from 2 , which is defined as more informative than ∅ ( 1 ) and STK ( 4 )...... 103

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Abstract

We address the problem of binary code search in stripped executables (with no debug informa- tion). The main challenge is establishing binary code similarity even when the binary code has been compiled using different compilers, optimization levels, target architectures. Moreover, the source code being compiled might be from another version of the software package or another implementation altogether. Overcoming this challenge, while avoiding false-positives, is invaluable to guide other more costly tasks in the field of binary code analysis such as reverse engineering or automated vulnerability detection. We present an iterative process of analyzing and presenting the different parts of the binary similarity problem. At each step we further refine our similarity method. Towards this end we incorporate several representations for binary code, each created by statically analyzing the binary code to decompose it into smaller parts carrying semantic meaning. These representations are matched with different concepts and tools from other fields to create a measure for binary similarity between procedures. These include fields include model theory, statistical frameworks, SMT solvers and deep neural networks. We tested our developed methods in real-world scenarios by employing them to find vul- nerabilities by search and perform name prediction on binary procedure. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device, and successfully predicted procedure names improving on the state-of- the-art by 20% and improving by 84% over state-of-the-art neural models that do not use any static analysis.

1

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 2

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 1

Introduction

1.1 The Importance of Analyzing Binary Code

The family of compiled programming languages, with its prominent members C & C++, are still heavily used even almost 50 years after their initial development. Interpreted and managed languages, such as Python and Java, are safer and easier to program with yet core applications, important services and operating systems are still written in compiled languages. This is mainly because these languages have better run-time performance and lower memory consumption. Moreover, recent years have shown a significant growth in demand for IOT devices. These devices employ slow CPUs and limited RAM capacity due to cost and size requirements, and force vendors to once again develop their whole software stack using only compiled code. In fact, the TIOBE index, a service measuring the use of different programming languages by the community since 2001, shows that C’s always held the first or second position in the overall ranking. For example, in October 2019 [TIO19] it ranked second place with 15.20%, trailing only 1.4% behind Java. Generally, software written in compiled languages can be distributed in:

• Source code form: allowing the user to compile the code and target their requested system.

• Binary code form: the output of the compilation process in an executable format, hence- forth executable (we will use binary code and executable interchangeably).

These are dubbed open-source and closed-source software respectively. Its important to note that executables can only run on the specific combination of hardware and (OS) they were built against ,also called the compilation target. For example, an executable targeting the Windows OS and Intel x86 CPU architecture can not run on Intel x86_64 machines or on any Linux machines. While this serves a clear practical advantage for open-source software, compiling legacy and even current source-code is cumbersome and time-consuming, leading many open-source projects, such as Linux distributions, to release software in binary code form as well. Compilers can output a mapping from different offsets in the binary code to names of procedures and other structures from the source code and include it in the executable. While this

3

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 information is very useful for debugging, it takes a lot of space on disk. As a result software is usually distributed without this information. In fact, in most cases executables are intentionally stripped of any information which is not strictly required by the OS for running the executable. Stripped executables contain no information regarding the internal structure of the binary code: The division into compilation units, procedures and classes, type information and memory structure layout are all removed. As a result, fully recreating the original source code from a stripped executable is considered an very hard task. Moreover, even given a source code package and an executable, one can not guarantee this source code package was used to compile it simply by re-compiling it and comparing the executables. This is due to the ever-changing nature of software build environments and advance optimizations developed by compiler implementation teams. Effectively, the compilation process is a many-to-many translation process: several different source-codes can be compiled into the same binary code, e.g., due to dead code elimination, and the same source code can be compiled into different binary codes, e.g., due to the use of different compilers or compiler configurations. Many examples for how changes in source code and compilation tool-chain affect the compiled binary code are shown and explained in the next chapters. The problem landscape created by the combination of closed-source software with open- source software distributed as stripped executables, motivates the development of methods for reasoning about and analyzing binary code. Analogues to (source code based) program analysis, the field of binary code analysis addresses tasks involving binary code.

1.2 Binary Code Analysis Using Search

A lot of progress was made in the field of binary code analysis, yet this growing problem landscape needs solutions that can scale. Some of these approaches achieve good results but require many computational resources and long run-times to provide meaningful results. For example, Driller [SGS+16], an automated dynamic tool for discovering errors and vulnerabilities in executables, might run for several hours to test a single executable. Applying it to a large set of executables thus become unfeasible. This thesis suggests a complimentary approach: Instead of directly examining each exe- cutable in the search space, we first prune the space by performing a preliminary search. We suggest using one executable, for which we know some useful property holds, to search for others. For example, after finding several vulnerabilities in a PDF reader, searching for other PDF readers can lead to finding many similar vulnerabilities in them. While many PDF readers exist it is still a small group, making throughly examining them achievable. We call this process binary code search. As we hinted above, using binary code search is especially suitable for the security and intellectual property (IP) protection fields: One malware instance can be used to detect future and past variants (versions of malware) and finding parts of a patent-protected software product elsewhere can uncover IP theft. In the same way, a software vulnerability can be leveraged to find other software containing the same or similar vulnerabilities.

4

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 One of the interesting and challenging examples for such a search is the Shellshock [She14] vulnerability. The code containing this vulnerability, which was remained undiscovered for 20 years, got ported into various operating systems, including Apple’s OSX and the Ubuntu Linux distribution. This vulnerability even affects jail-broken iOS devices. The discovery required security researchers in multiple organizations to meticulously search through all the executables in their all of their products – an insurmountable task which can be greatly aided by binary code search. This task also serves as the example for one of hardest variations of binary code search problem:

1. During these 20 years many versions of were created, each version creating a slightly to extensively different executable.

2. Code for OSX and iOS is compiled using the CLang compiler, while Linux is compiled using gcc, different compilers usually generate different binary code even when given the exact source code.

3. Compilers evolve over time, improving and changing code generation and optimizations.

4. OSX use to run on PowerPC CPUs, iOS runs on ARM and desktop Linux computers run on Intel CPUs. Binary code exhibit vast syntactic differences.

Each of these avenues of change in the source code or build tool-chain has vast affects on the created binary code. Examples for these changes are shown and explained in the next chapters. In chapter5 we showcase the usefulness of this cross-{architecture, optimizations, optimization- configurations, software versions} search scenario by discovering 373 vulnerabilities affecting publicly available firmware.

1.3 Binary Code Search as An Information Retrieval Problem

To clearly define and perform binary code search operations it is useful to frame this search process as an information retrieval problem. Doing so requires: (i) a definition for the search- query, (ii) a representation for binary-code used to build a database, and (iii) a process to compare a search-query against a database of targets. To measure the success of our search process, we need a way to evaluate the relevance of its output. We strive to use the binary code itself as the search-query. On one hand, using the entire binary code block from an executable seems too coarse; a malware is not malicious from beginning to end (e.g., opening a socket), any IP protected software also contains some standard code (e.g., a linked-list implementation), and a vulnerability is usually bound to a specific area in the code. On the other hand, working with specific parts from the binary, also called signatures, seem to put too much of the problem area out-of-scope by assigning it to the user. . Modern disassemblers can use their knowledge about specific instructions used for calling and returning from procedures to deduce a division of the executable’s binary code block into procedures. We henceforth address these as binary procedures. This process is not perfect,

5

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 as shown by [BBW+14], but it is stable enough to serve as the foundation for many binary code analysis works. While more information such as classes and data structures can also be uncovered, these requires deeper semantics-driven analysis and thus are less accurate. As a results we decided to use binary procedures as our basic comparison object. Throughout our work we informally defined a relevant output of our search process as a binary procedure which is similar to our query binary procedure. In other fields, and for some cases, similarity can be defined as a relation (same as isomorphism). Two n ∗ n matrices, A and B, are similar if exists an invert-able matrix, P, such that P−1 AP = B. Geometric similarity, e.g., between triangles, allows for scaling but requires that the shape’s proportions remain the same. For the general case, similarity is defined as a measure, i.e., a function:

(Object1, Object2) → [0, 1], where the resulting number depicts similarity, from not similar at all (0) to identical (1). In NLP, the similarity of two documents can be measured using their Levenshtein-distance, commonly known as edit-distance: the number of edits required to convert one document into the other ([Lev66]). For pictures and videos, [BI06] has defined the similarity measure as a composition process, determining how much of one signal can be composed from the other, allowing scaling and changes in background. In all these cases, the same pattern emerges: some entry criteria is presented to allow for a similarity measure. e.g., matrices needs to be n ∗ n, and documents are not compared to pictures. Then, specific differences between objects were considered tolerable, and the comparison process between two objects was relaxed w.r.t these tolerable changes. The properties which can be relaxed in the search are derived from what kind of similarity we wish to establish. For matrix similarity we allow for the base to change but not the operator represented by the matrix. While changing one word can alter the meaning of a sentence, many changes are required to change the message delivered by a whole document. In the same way, for pictures, we allow scaling objects in the picture and not allow changing these objects altogether. To create a good binary code similarity measure, we will want to relax all properties regarding the syntax of the binary code, and focus on the semantics being represented by it. We can ignore the specific registers or memory offsets used by the machine the code runs on, but we must consider specific steps performed to generate the program’s output. For example, we want the similarity measure for the following two assembly snippets to be 1 (meaning a perfect match):

• mov x0,1 - Assigning the value of 1 into the register numbered 0 in ARM64 assembly.

• xor eax, eax;inc eax; - Assigning the value of 1 into the register named rax in Intel86_64x assembly, by xoring its initial value to it-self (putting the result, 0, in rax) and increasing its value, resulting in rax containing the value 1.

Moreover, we allow the code control flow to change, e.g., the selection between nested if structure and a switch case has no importance to the semantics expressed. In the next chapters we show examples of how different control-flows affect binary code similarity.

6

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 1.4 Evaluating Similarity Measures

Going back to our search scenario, we can use a similarity measure to rank outputs by relevance. By setting a threshold, deeming all results below it as non-relevant, i.e., non-similar, we can evaluate our measure as a classifier. Combining the classifications with our knowledge about a data set, i.e., a label for each target, (assuming a fully controlled data set) results in the following sets:

• True positives (TP): the ones we know are positive, i.e., similar

• True Negatives (TN): the ones we know are not positive, i.e., not similar

• Positives (P): the ones classified as positive

• Negatives (N): the ones classified as negative

• False positives (FP): the ones mistakenly marked as positive while being part of the true negatives

• False positives (FN): the ones mistakenly marked as negative while being part of the true positives

We use the size of these sets, with the threshold, to calculate the classifier’s accuracy:

TP + TN Accuracy = P + N The challenge thus is shifted to selecting the right cross-experiment threshold, as it has considerable impact the accuracy. We note that there is no clear way to find a strict global threshold over multiple experiments, as similarity scores vary according to query and corpus size. To side-step this problem we use the receiver operating characteristic (ROC)& CROC methods. ROC is a standard method for evaluating threshold based classifiers. Instead of selecting one threshold, we evaluate the classifier by testing all of the possible thresholds consecutively. For each threshold we measure the TPR (true positive rate, also called recall or sensitivity) and the FPR (false positive rate) of the classifier. Plotting the results for the different thresholds on the same graph yields a curve; the AUC is regarded as the accuracy of the proposed classifier. The definitions for TPR and FPR, accompanied by an example for a ROC plot is shown in fig. 1.1 CROC is an improvement over ROC addressing the problem of “early retrieval”, where the corpus size is huge and the number of true positives is low. CROC gives a stronger grade to methods that provide a low number of candidate matches for a query (i.e., it penalizes false positives more aggressively than ROC). This is appropriate in our setting, as manually verifying a match is a costly operation for a human expert. Moreover, software development is inherently based on re-use, so similar procedures should not appear in the same executable (so each executable will contain at most one TP). Further discussion of the benefits and implementation of this widely used method is provided by [SADB10].

7

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 TP TPR = TP + FN

FP FPR = TN + FP

Figure 1.1: A ROC plot ([Tow18]) showing how the AUC is measured.

It is important to note that both ROC and CROC measure the rate of FPs encountered as the threshold increases, and not just their number or percentage, and thus reflect the accuracy across all thresholds. To generate AUC scores for the different similarity measures developed throughout this work, we used the yard-plot tool, [Nep], which uses (by default) a magnification factor of a = 7, which transforms a FPR of 0.1 to 0.5.

1.5 This Thesis

In this thesis we show an iterative process of addressing binary code search problems by performing a static analysis of the binary code. We start by establishing similarity between binary procedures originating from the same code package across different versions of the package. Then, we expand our similarity method to support comparing binary procedures created by different build tool-chains and targeting different CPU architectures. In our final step we establish similarity even for binary procedures not originating from the same source code. This is performed by using deep learning to predict names for binary procedures and comparing these names to establish similarity between underlaying binary procedures. As shown throughout this thesis, harder search scenarios demand trade-off adjustments, incorporating new techniques from other fields and performing deeper analysis of the binary code.

1.5.1 Tracelet-Based Code Search: Searching for Patched Code

In the first step, detailed in chapter2, we address the most common source for a change in binary code: a patch in the source code it was compiled from. For now, we assume the compilation tool-chain is fixed and focus on the source code changes which occur far more frequently. Inspired by the way edit-distance is used to establish similarity between text documents, we build a framework to rewrite one binary procedure into the other, while relaxing comparison w.r.t syntactic differences. This framework is based on two key steps. Tracelet decomposition We break down the CFG of a procedure into tracelets: continuous,

8

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 short, partial traces of an execution. We use tracelets of bounded length k (the number of basic blocks), which we refer to as k-tracelets. We bound the length of tracelets in accordance with the number of basic blocks, but a tracelet itself is comprised of individual instructions. The idea is that a tracelet begins and ends at control-flow instructions, and is otherwise a continuous trace of an execution. Intuitively, tracelet decomposition captures (partial) flow. Tracelet similarity by rewriting To measure similarity between two tracelets, we define a set of simple rewrite rules and measure how many rewrite steps are required to rewrite one tracelet into the other. Rather than exhaustively searching the space of possible rewrite sequences, we encode the problem as a constraint-solving problem and measure distance using the number of constraints that have to be violated to reach a match. Tracelet rewriting captures distance between tracelets in terms of transformations that effectively “undo” low-level compiler decisions such as memory layout and register allocation. In a sense, some of our rewrite rules can be thought of “register deallocation” and “memory reallocation”. We implemented this approach and tested it using a dataset containing over a million procedures, achieving a very high precision and recall scores: 1.0 and 0.99 ROC and CROC AUC respectively.

1.5.2 Statistical Similarity of Binaries: Tool-Chains and Configuration Oblivi- ous Search

In this step, detailed in chapter3, we expand our search to include the next class of differences between binary procedures: Using different tool-chains and different configurations to compile them. Essentially, the only criterion left for binary search is that the CPU architecture is the same. We found that the initial code-layout (the result of using O0 optimization level) is substan- tially different when compilers from different vendors are used. Obviously, applying different optimization steps brings them further apart. Even routine changes in build environments, e.g., upgrading a compiler to a new version, causes substantial changes. The same goes for changes in the optimization level. Noteworthy changes include different decisions regarding code in-lining and calculations moving from one location (in the code) to another, become intertwined with other calculations and cloned/united. These complex changes in the binary code rendered our previous approach unsuitable. Instead, inspired by the way similarity by composition ([BI06]) was used to compare pictures and videos, we construct a much better method based on three key aspects. Decomposing the procedure into strands We use data-flow information to decompose the CFG’s basic-blocks into strands. These strands represent a consecutive part of one calculation, making them more appropriate to compare. Comparing strands We use a program verifier to check whether two strands are semantically equivalent by assuming input equivalence and checking if intermediate and output values are the same. When they are not equivalent, we define a quantitative notion of strand similarity based on the proportion of matching values to the total number of values in the strand.

9

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Statistical reasoning over strands We present a statistical framework for reasoning about similarity of whole procedures using strand similarity. We compute a global similarity evidence score between procedures using the sum of the local evidence scores (LES) between strands. A key aspect of our framework is that we amplify the similarity scores of unique strands, expressed by a high LES, and diminish the significance of “common” strands (with respect to the target database examined) as they are less indicative of similarity. We implemented this approach and throughly evaluated it in a cross-{ compiler, compiler versions, compiler configurations} binary code search scenario: Searching for eight prominent vulnerabilities, representing different vulnerability types, achieving high ROC and CROC AUC scores, with few false positives.

1.5.3 Similarity through re-Optimization: Cross-Architecture Accelerated Search

In this step, detailed in chapter4, we set two goals: (i) improve search performance, and (ii) support cross-architecture search. We preserve our strand-based representation for binary procedures and the statistical frame- work to reason about strand-significance (both introduced in chapter3), focusing our efforts on improving strand comparison. In chapter2 and chapter3 most of the search run-time is spent on heavyweight - SMT based - approaches for comparing pairs of tracelets or strands. Allowing cross-architecture search further complicates the comparison task, due to the different architecture instruction sets (ISAs) involved. A simple example to showcase these differences: The lea r15, [r14+1] Intel x86-64 instruction requires two commands to express with ARM64, mov x15,x14;add x15,x15,1. This example was taken from and is further explained in chapter4. Finding strand equality through re-optimization We relinquish the need to pair-wise compar- ison, and instead employ an “out-of-context re-optimization” technique. This technique lifts binary code from all supported architectures into one IR, and then transforms strands to a canonical normalized form (canonical with respect to the optimizer). This technique allows us to efficiently and accurately overcome syntactic difference, caused by the use of different archi- tectures or other compiler actions. It allows us to capture similarity through the (re-)application of the compiler optimizer on its strands, aiding us in identifying syntactically different yet semantically equivalent strands. We implemented our technique and performed an extensive evaluation. We show that we are able to perform millions of comparisons efficiently, and that the combined use of strand re-optimization with the global context establishes similarity with high accuracy, and with an order of magnitude speed-up over our previous method presented in chapter3.

1.5.4 Applying Our Method To a Real-World Problem: Finding Common Vul- nerabilities in Firmware

In this step, detailed in chapter5, we show the usefulness of our binary code search based vulnerability search by applying it to a real-world test. To do this, we gathered information

10

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 regarding known vulnerabilities in open-source software, and used it to find previously unknown vulnerable internet of things (IOT) devices by searching their firmware (the device software). To achieve this goal we employ the representation for binary code developed in chapters3 and4 and focus our efforts on improving the comparison engine. Using the information in the surrounding executable We broaden the scope of procedure similarity beyond the procedure itself, and observe the neighboring procedures (all other procedures in the executable). This is guided by the observation that procedures always operate within an executable, and will therefore almost always appear with some subset of the neighboring procedures. Using information from neighboring procedures greatly improves accuracy. Efficiently matching procedures in the context of executables To find procedure similarity

within the context of an executable, we require not only to match the procedure qv ∈ Q itself to some procedure t ∈ T, but also the surrounding procedures in Q and T (or at least some of them). Producing a full matching over executables containing thousands of procedures is inessential because finding an optimal full matching is computationally costly. Instead, we

employ an algorithm for producing a partial matching, focused on the query procedure qv (i.e.,

the matching need not match all procedures, but must contain qv). This matching algorithm is inspired by model theory’s back-and-forth games [Ehr61], which provide a more efficient replacement for full-matching algorithms when the sets being matched are very large (but not infinite). We implemented this technique and performed an extensive evaluation over 40 million procedures, over 4 different prevalent architectures, crawled from public vendor firmware images. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device. In fact, we show that this last refinement translates into a 45% increase (on average) in our detection rate.

1.5.5 Leaving the Source Code Behind: Similarity by Neural Name Prediction

In recent years, great strides were made in the analysis of source code using learned models. From automatic inference of variables and types [RVK15, BRV16, AZLY18, BPS18, ABK18] to bug detection [PS18, RAJ+17], code summarization [APS16, AZLY19, ABLY19], code re- trieval [SLL+18, ATGW15] and even code generation [MCJ17, BAGP19, LCJM17, ASLY19]. Inspired by these works, we offer a representation for binary procedures that allow a neural model to generate a descriptive name for a given stripped assembly procedure. Using this framework, instead of comparing binary procedures we can compare their predicted names. This representation provides an interesting and powerful balance between the program- analysis effort required to obtain the representation from assembly procedures, and the effective- ness of the learning model. We perform an extensive evaluation of our framework to predict names within real-world Linux executables from GNU software packages. Our framework achieves F1 score of 40.04 in predicting procedure names within GNU packages.

11

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 12

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 2

Tracelet-Based Code Search in Executables

This chapter presents our first stab at using binary code similarity for searching for vulnerable software. In this stage focus our efforts on establishing similarity for patched code. Problem definition Given a binary procedure q, and a large dataset of target binary procedures T, we require a measure that accurately captures binary code similarity. This measure will provide a high similarity score for a procedure t ∈ T when q and t originated from the same source code, or a slightly patched versions of the same source code. As we show in the following example, the biggest challenge in establishing similarity in this scenario is the ripple effects in the procedure’s control structures and final syntactic representation of the code.

2.1 Overview

In this section, we give an informal overview of our approach.

2.1.1 Motivating Example

Consider the source code of fig. 2.1(a) and the patched version of this code shown in fig. 2.2(a). Note that both procedures have the same format string vulnerability (printf(optionalMsg), where optionalMsg is unsanitized and used as the format string argument). In our work, we would like to consider docommand1 and docommand2 as similar. Generally this will allow us to use one vulnerable procedure to find other similar procedures which might be vulnerable too. Despite the similarity at the source-code level, the assembly code produced for these two versions (compiled with gcc using default optimization level O2) is quite different. Figure 2.1(b)

shows the assembly code for docommand1 as a CFG G1, and fig. 2.2(b) shows the code for

docommand2 as a CFG G2. In these CFGs, we numbered the basic blocks for presentation purposes. We used the numbering n for a basic block in the original program, n0 for a matching basic block in the patched program, and m∗ for a new block in the patched program.

13

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 int doCommand1(int cmd, char * optionalMsg, char * logPath) { int counter =1; FILE *f = fopen(logPath,"w"); if (cmd == 1) { printf("(%d) HELLO",counter); } else if (cmd == 2) { printf(optionalMsg); } fprintf(f,"Cmd %d DONE",counter); return counter; }

(a) (b) G1

Figure 2.1: doCommand1 and its corresponding CFG G1.

Trying to establish similarity between these two procedures at the binary level, we face (at least) two challenges: the code is structurally different and syntactically different. In particular: ∗ (i) The control flow graph structure is different. For example, block 6 in G2 does not have a

corresponding block in G1. (ii) The offsets of local variables on the stack are different, and in fact all variables have

different offsets between versions. For example, var_18 in G1 becomes var_28 in G2 (this also holds for immediate values).

(iii) Different registers are used for the same operations. For example, block 3 of G1 uses ecx 0 to pass a parameter to the printf call, while the corresponding block 3 in G2 uses ebx. (iv) All of the jump addresses have changed, causing jump instructions to differ syntactically. For example, the target of the jump at the end of block 1 is loc_401358, and in the corresponding jump in block 1’, the jump address is loc_401370. Furthermore, the code shown in the figures assumes that the procedures were compiled in the same context (surrounding code). Had the procedures been compiled under different contexts, the generated assembly would differ even more (in any location-based symbol, address or offset). Why Tracelets? A tracelet is a partial execution trace that represents a single partial flow through the program. Using tracelets has the following benefits: • Stability with respect to jump addresses: Generally, jump addresses can vary dramat- ically between different versions of the same code (adding an operation can create an avalanche of address changes). Comparing tracelets allows us to side-step this problem, as a tracelet represents a single flow that implicitly and precisely captures the effect of the jump.

14

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 int doCommand2(int cmd, char *optionalMsg, char *logPath){ int counter = 1; int bytes = 0; // New variable FILE *f = fopen(logPath,"w"); if (cmd == 1) { printf("(%d) HELLO",counter); bytes += 4; } else if (cmd == 2) { printf(optionalMsg); bytes+= strlen(optionalMsg); /* This option is new: */ } else if (cmd == 3) { printf("(%d) BYE",counter); bytes += 3; } fprintf(f,"Cmd %d\\%d DONE", counter,bytes); return counter; }

(a) (b) G2

Figure 2.2: doCommand2 and its corresponding CFG G2.

• Stability to changes: When a piece of code is patched locally, many of its basic blocks might change and affect the linear layout of the binary code. This makes approaches based on structural and syntactic matching extremely sensitive to changes. In contrast, under a local patch, many of the tracelets of the original code remain similar. Furthermore, tracelets allow for alignment (also considering insertions and deletions of instructions), as explained in section 2.3.3. • Semantic comparison: Since a tracelet is a partial execution trace, it is feasible to check semantic equivalence between tracelets (e.g., using a SAT/SMT solver as in [SSA13]). In general, this can be very expensive and we present a practical approach for checking equivalence that is based on data-dependence analysis and rewriting (see section 2.3.4). The idea of tracelets is natural from a semantic perspective. Taking tracelet length to be the maximal length of acyclic paths through a procedure is similar in spirit to path profiling [BL96, RBDL97]. The problem of similarity between tracelets (loop-free sequences of assembly instructions) also arises in the context of super-optimization [BA06].

2.1.2 Similarity using Tracelets

Consider the CFGs G1 and G2 of fig. 2.1 and fig. 2.2. In the following, we denote 3-tracelets using triplets of block numbers. Note that in a tracelet all control-flow instructions (jumps)

are removed (as we show in fig. 2.3). Decomposing G1 into 3-tracelets result in the following sequences of blocks: (1, 2, 4), (1, 2, 5), (1, 3, 5), (2, 4, 5),

15

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 push ebp push ebp mov ebp, esp mov ebp,esp sub esp, 18h sub esp,28h mov [ebp+var_4], esi mov [ebp+var_C],ebx mov eax,[ebp+arg_8] mov eax,[ebp+arg_8] mov esi,[ebp+arg_0] mov ebx,[ebp+arg_0] mov [ebp+var_8], ebx mov [ebp+var_4],edi mov ebx, offset unk_404000 mov edi,offset unk_404000 mov [ebp+var_8],esi xor esi,esi mov [esp+18h+var_14], ebx mov [esp+28h+var_24],edi mov [esp+18h+var_18], eax mov [esp+28h+var_28],eax call _fopen call _fopen cmp esi, 1 cmp ebx,1 mov ebx, eax mov edi,eax jz short loc_401358 jz short loc_401370 mov [esp+18h+var_18],... mov [esp+28h+var_28],... mov ecx, 1 mov ebx,1 mov esi,4 mov [esp+18h+var_14], ecx mov [esp+28h+var_24],ebx call _printf call _printf jmp short loc_40132F jmp short loc_401339 mov [esp+18h+var_18], ebx mov [esp+28h+var_1C],esi mov edx, 1 mov edx,1 mov eax, offset aCmdDDone mov eax,offset aCmdDDDone mov [esp+18h+var_10], edx mov [esp+28h+var_28],edi mov [esp+18h+var_14], eax mov [esp+28h+var_20],edx mov [esp+28h+var_24],eax call _fprintf call _fprintf mov ebx,[ebp+var_8] mov ebx,[ebp+var_C] mov eax, 1 mov eax,1 mov esi,[ebp+var_4] mov esi,[ebp+var_8] mov edi,[ebp+var_4] mov esp, ebp mov esp,ebp pop ebp pop ebp retn retn

(a) (b)

Figure 2.3: A pair of 3-tracelets: (a) based on blocks (1,3,5) from Figure 2.1(b), and (b) on blocks (1’,3’,5’) from Figure 2.2(b). Note that jump instructions are omitted from the tracelet, and that empty lines were added to the original tracelet to align similar instructions.

and doing the same for G2 results in the following:

(10, 20, 40), (10, 20, 6∗), (10, 30, 50), (20, 40, 50), (20, 6∗, 7∗), (20, 6∗, 50), (6∗, 7∗, 50).

To establish similarity between the two procedures, we measure similarity between the sets of their tracelets.

Tracelet comparison as a rewriting problem As an example of tracelet comparison, consider 0 0 0 the tracelets based on (1, 3, 5) in G1, and (1 , 3 , 5 ) in G2. In the following, we use the term instruction to refer to an opcode (mnemonic) accompanied by its required operands. The two tracelets are shown in fig. 2.3 with similar instructions aligned (blanks added for missing instructions). Note that in both tracelets, all control-flow instructions (jumps) have been omitted (shown as strikethrough). For example, in the original tracelet, je short loc_401358 and jmp short loc_40132F were omitted as the flow of execution has already been determined. A trivial observation about any pair of k-tracelets is that syntactic equality results in semantic

16

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion -Computer ScienceDepartment -Ph.D. ThesisPHD-2020-04 -2020

Figure 2.4: A successful rewriting attempt with cost calculation. 17

Figure 2.5: A full alignment and rewrite for the basic blocks 3 and 3’ of fig. 2.1(b) and fig. 2.2(b). equivalence and should be considered a perfect match, but even after alignment this is not true for these two tracelets. To tackle this problem, we need to be able to measure and grade the similarity of two k-tracelets that are not equal. Our approach is to measure the distance between two k-tracelets by the number of rewrite operations required to edit one of them into the other (an edit distance). Note that a lower grade means greater similarity, where 0 is a perfect match. To do this we define the following rewrite rules: • Delete an instruction [instDelete]. • Add an instruction [instAdd]. • Substitute one operand for another operand of the same type [Opr-for-Opr] (the possible types are register, immediate and memory offset). Substituting a register with another register is one example of a replace rule. • Substitute one operand for another operand of a different type [Opr-for-DiffOpr]. Given a pair of k-tracelets, finding the best match using these rules is a rewriting problem requiring an exhaustive search solution. Following code patches or compilation in a different context, some syntactic deviations in the procedure are expected and we want our method to accommodate them. For example, a register used in an operation can be swapped for another depending on the compiler’s register allocation. We therefore redefine the cost in our rewriting rules such that if an operand is swapped with another operand throughout the tracelet, this would be counted at most once. Figure 2.4 shows an example of a score calculation performed during rewriting. In this example, the patched block on the left is rewritten into the original block on the right, using the aforementioned rewrite rules and assuming some cost procedure Cost used to calculate the cost for each rewrite operation. We present a way to approximate the results of this process, using 2 co-dependent heuristic steps: tracelet alignment and constraint-based rewriting.

Tracelet alignment To compare tracelets we use a tracelet alignment algorithm, a vari- ation on the longest common subsequence (LCS) algorithm adapted for our purpose. The alignment algorithm matches similar operations and ignores operations added as a result of a patch. In our example, most of the original paths exist in the patched code (excluding (1, 2, 5), which was “lost” in the code change), and so most of the tracelets could also be found using patch-tolerant methods to compare basic blocks.

Constraint-based matching To overcome differences in symbols (registers and memory locations), we use a tracelet from the original code as a “template” and try to rewrite a tracelet from the patched code to match it. This is done by addressing this rewrite problem as a constraint satisfaction problem (CSP), using the tracelets’ symbols as variables and its internal dataflow and the matched instruction alignment as constraints. fig. 2.5 shows an example of the full process, alignment and rewriting, performed on the basic blocks 3 and 3’. It is important to note that this process is meant to be used on whole tracelets; we use only one basic block for readability. As we can see, the system correctly identifies the added instruction (mov esi,4) and ignores it. In the next step, the patched tracelet’s symbols are abstracted according to

18

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 the symbol type, and later successfully assigned to the correct value for a perfect match with the original tracelet. Note that procedure calls alone (or any other group of special symbols) could not have been used instead of this full matching process. For example, “printf” is very common in our code example, but it is only visible by name because it is imported; if it were an internal procedure, its name would have been stripped away.

2.2 Preliminaries

In this section, we provide basic definitions and background that we use throughout the paper.

hinstri ::= nullary | unary op | binary op op | trenary op op op

hopi ::= [ OffsetCalc ] | arg

hargi ::= reg | imm

hOffsetCalci ::= arg | arg aop OffsetCalc

haopi ::= + | - | *

hregi ::= eax | ebx | ecx | edx | ...

hnullaryi ::= aad | aam | aas | cbw | cdq | ...

hunaryi ::= dec | inc | neg | not | ...

hbinaryi ::= adc | add | and | cmp | mov | ...

htrenaryi ::= imul | ...

Figure 2.6: Simplified grammar for x86 assembly.

Assembly Instructions An assembly instruction consists of a mnemonic, and up to 3 operands. Each operand can be one argument (register or immediate value) or a group of arguments used to address a certain memory offset. Figure 2.6 provides a simplified grammar of x86 instructions. Note that in the case of OffsetCalc a group of arguments will be used to determine a single offset, and the value at this memory offset will serve as the operand for the instruction (see examples in the following table.) We denote the set of all assembly instructions by Instr. We define a set Reg of general purpose registers, allowing explicit read and write, a set Imm of all immediate values, and a set Arg = Reg ∪ Imm containing both sets. Given an assembly instruction, we define the following: • read(inst) : Instr → 2Reg, the set of registers being read by the instruction inst. A register r is considered as being read in an instruction inst when it appears as the right-hand-side argument, or if it appears as a component in the computation of an memory address offset (in this case, its value is read as part of the computation).

19

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 • write(inst) : Instr → 2Reg, the set of registers being written by the instruction inst. • args(inst) : Instr → 2Arg, given an instruction returns the set of arguments that appear in the instruction. • SameKind(inst,inst): Instr × Instr → Boolean, given two instructions returns true if both instructions have the same structure, meaning that they have the same mnemonic, the same number of arguments, and all arguments are the same up to renaming of arguments of the same type. Following are some examples of running these functions:

# Instruction Args Read Write inst1 add eax, ebx {eax,ebx} {eax,ebx} {eax} inst2 mov eax,[ebp+4] {eax,ebp,4} {ebp} {eax} inst3 mov ebx,[esp+8] {ebx,esp,8} {esp} {ebx} inst4 mov eax,[ebp+ecx] {eax,ebp,ecx} {ebp,ecx} {eax}

Note that, SameKind(inst2,inst3)=true, but SameKind(inst3,inst4)=false, as the last argument used in the offset calculation has different types in the two instructions (immediate and register respectively.) Control Flow Graphs and Basic Blocks We use the standard notions of basic blocks and control- flow graphs. A basic block is a sequence of instructions with a single entry point, and at most one exit point (jump) at the end of the sequence. A control flow graph is a graph where the nodes are basic blocks and directed edges represent program control flow.

2.3 Tracelet-Based Matching

In this section, we describe the technical details of tracelet-based procedure similarity. In sec- tion 2.3.1 we present a preprocessing phase applied before our tracelet-based similarity algorithm, the goal of this phase is to reverse some compilation side-effects. In section 2.3.2 we present the main algorithm, an algorithm for comparing procedures. The main algorithm uses tracelet similarity, the details of which are presented in section 2.3.3, and a rewrite engine which is presented in section 2.3.4.

2.3.1 Preprocessing: Dealing with Compilation Side-Effects

When the code is compiled into assembly, a lot of data is lost, or intentionally stripped, especially memory addresses. One of the fundamental choices made in the process of constructing our similarity measure was the use of disassembly “symbols” (registers, memory offsets, etc.) to represent the code. This approach coincides with our need for flexibility, as we want to be able to detect the expected deviations that might be caused by optimizations and patching. Some data can be easily and safely recovered by abstracting any location-dependent argument values, while making sure that these changes will not introduce false matches.

20

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Specificallly, we perform the following substitutions:

• Replace procedure offsets of imported procedure with the procedure name. For example, call 0x00401FF0 was replaced with call _printf.

• Replace each offset pointing to initialized global memory with a designated token denoting its content. For example, the address 0x00404002, containing the string "DONE", is replaced with aCmdDDDone

• If the called procedure can be detected using the import table, replace variable offsets (stack or global memory) with the variable’s name, retrieved from the procedure’s docu- mentation. For example, var_24 should be replaced with format (this was not done in this case to demonstrate the common case in which the called procedure is not imported and as such its argument are not known).

All of our code examples (e.g., fig. 2.2(b)) are presented after these substituations are made to ease readability. This leaves us with two challenges:

1. Inter-procedural jumps (procedure calls): As our executables are stripped from procedure names and any other debug information, unless the procedure was imported (e.g., as fprintf), we have no way of identifying and marking identical procedures in different executables. In section 2.3.4 we show how our rewrite engine addresses this problem.

2. Intra-procedural jumps: The use of the real address will cause a mismatch (e.g. loc_40132f in fig. 2.1(b) and loc_401339 in fig. 2.2(b) point to corresponding basic blocks). In line 17, we show how to extract tracelets that follow intra-procedural jumps.

2.3.2 Comparing Binary Code Procedures

Using the CFG to measure similarity When comparing procedures in binary form, it is impor- tant to realize that the linear layout of the binary code is arbitrary and as such provides a poor basis for comparison. In the simple case of an if-then-else statement (splitting execution for the true and false clauses), the choice made by the compiler as to which clause will come first in the binary is determined by parameters such as basic block size or the likelihood of a block being used. Another common example is the layout of switch-case statements, which, aside from affecting a larger portion of the code, also allows for multiple code structures to be employed: a balanced tree structure can be used to perform a binary search on the cases, or a different structure such as a jump (lookup) table can be used instead. To obtain a reasonable similarity metric, one must, at least, use the procedure’s control flow (alternatively, one can follow data and control dependencies as captured in a program dependence graph [Hor90]). Following this notion, we find it natural to define the “procedure similarity” between two procedures by extracting tracelets from the CFG of each procedure and measuring the coverage rate of the matching tracelets. This coincides with our goal to create

21

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 an accountable measure of similarity for assembly procedures, in which the matched tracelets retrieved during the process will be clearly presented, allowing further analysis or verification.

Procedure Similarity by Decomposition Algorithm1 provides the high-level structure of our comparison algorithm. We first describe this algorithm at a high-level and then present the details for each of its steps.

Algorithm 1: Calculating the similarity score for two procedures. Input: T,Q - target and query procedures, NM - normalizing method: ratio or containment k - tracelet size, α, β - threshold values Output: IsMatch - true if the procedures are a match, SimilarityScore - the normalized similarity score 1 Algorithm ProceduresMatchScore(Q,R,NM,k,α, β) 2 QueryTracelets = ExtractTracelets(Q,k) 3 TargetTracelets = ExtractTracelets(T,k) 4 MatchCount = 0 5 foreach q ∈ QueryTracelets do 6 foreach t ∈ TargetTracelets do 7 AlignedInsts = AlignTracelets(q,t) 8 t’ = RewriteTracelet(AlignedInsts,q,t) 9 S = CalcScore(q,t’) 10 QIdent = CalcScore(q,q) 11 TIdent = CalcScore(t’,t’) 12 if Norm(S,QIdent,TIdent,NM) > β then 13 MatchCount+ + 14 end 15 end 16 SimilarityScore = MatchCount /|QueryTracelets| 17 isMatch = SimilarityScore > α

The algorithm starts by decomposing each of the two procedures into a set of k-tracelets (lines2 and3). We define the exact notion of a tracelet later in this section; for now it suffices to view each tracelet as a bounded sequence of instructions without intermediate jumps. The ExtractTracelets operation is explained in line 17. After each procedure has been decomposed into a set of tracelets, the algorithm proceeds to perform a pairwise comparison of tracelets. This is done by first aligning each pair (line7, see section 2.3.3), and then trying to rewrite the target tracelet using the query tracelet (line8, see section 2.3.4). We will later see that CalcScore and AlignTracelets actually perform the same operation and are separated for readability (see section 2.3.3).The tracelet similarity score is calculated using the identity similarity scores (the similarity score of a tracelet with itself) for the target and query tracelets (computed in lines 10 and 11), and one of two normalization methods (ratio or containment, line 13). Two tracelets are considered similar, or a “match” for each other, if their similarity score is above threshold β. After all tracelets were processed, the resulting number of similar tracelets is used to calculate the cover rate, which acts as the similarity score for the two procedures.

22

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Finally, two procedures are considered similar if their similarity score is above the threshold α.

Extracting tracelets

Following this, we formally define our working unit, the k-tracelet, and show how k-tracelets are extracted from the CFG.A k-tracelet is an ordered tuple of k sequences, each representing one of the basic blocks in a directed acyclic sub-path in the CFG, and containing all of the basic block’s assembly instructions excluding the jump instruction. Note that as all control-flow information is stripped, the out-degree of the last basic block (or any other basic block) in the tracelet is not used in the comparison process. This choice was made to allow greater robustness with regard to code changes and optimization side-effects, as exit nodes can be added and removed. Also note that the same k basic blocks can appear in a different order (in a different tracelet), if such sub-paths exist in the CFG. We only use small values of k (1 to 4), and because most basic blocks have 1 or 2 out-edges, and a few of which are back-edges, such duplicate tracelets (up to order) are not very common. Algorithm2 shows the algorithm for extracting k-tracelets from a CFG. The tracelets are extracted recursively from the nodes in the CFG. To extract all the k-traclets from a certain node in the CFG, we compute all (k-1)-tracelets from any of its “sons”, and use a Cartesian product (×) between the node and the collected tracelets. Algorithm2 uses a helper function, StripJumps. This function takes a sequence of assembly instructions in which a jump may appear as the last instruction, and returns the sequence without the jump instruction.

Algorithm 2: Extract all k-tracelets from a CFG. Input: G=hB, Ei - control flow graph, k - tracelet size Output: T - a list of all the tracelets in the CFG Algorithm ExtractTracelets(G,k) result = ∅ foreach b ∈ B do result ∪= Extract(b,k) end return result Function Extract(b,k) bCode = {stripJumps(b)} if k = 1 then return bCode else S return {b0|(b,b0)∈E} bCode × Extract(b’,k-1) end

2.3.3 Calculating Tracelet Match Score

Aligning tracelets with an LCS variation As assembly instructions can be viewed as a text string, one way to compare them is using a textual diff. This simulates the way a human

23

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 might detect similarities between tracelets. Comparing them in this way might work to some extent. However, a textual diff might decompose an assembly instruction and match each decomposed part to a different instruction. This may cause very undesirable matches such as “rorx edx,esi” with “inc rdi” (shown in bold, rorx is a rotate instruction, and rdi is a 64-bit register). A common variation of LCS is the edit distance calculation algorithm, which allows ad- ditions and deletions by using dynamic programming and a “match value table” to determine the “best” match between two strings. The match value table is required, for example, when replacing a space because doing so will be costlier than other replacement operations due to the introduction of a new “word” into the sentence. By declaring one of the strings as a source and the other as the target, the output of the algorithm can be extended to include matched, deleted, and inserted letters. The matched letters will be consecutive substrings from each string, whereas the deleted and inserted letters must be deleted from or inserted to the target, in order to match it to the source string. A thorough explanation about edit-distance is presented in [WF74]. We present a specially crafted variation on the edit distance. We will treat each instruction as a letter (e.g., pop eax; is a letter) and use a specialized match value table, which is really a similarity measure between assembly instructions, to perform the tracelet alignment. It is important to note that in our method we give a high grade to perfect matches (instead of a 0 edit distance in the original method). For example, the score of comparing push ebp; with itself is 3, whereas the score of add ebp,eax; with add esp,ebx; is only 2. We compute the similarity between assembly instructions as follows:  2 + |{i|args(c)[i] = args(c0)[i]}| SameKind(c, c0) Sim(c, c0)= −1 otherwise.

Here, c and c0 are the assembly instructions we would like to compare. As defined in section 2.2, we say that both instructions (our letters) are the same “kind” if their mnemonic is the same, and all of their argument are of the same kind (this will prove even more valuable in our rewrite process). We then calculate the grade by counting the number of matching arguments, also giving 2 “points” for the fact that the “kinds” match. If the instructions are not the same kind, the match is given a negative score and won’t be favored for selection. This process enables us to accommodate code changes; for example, we can “jump over” instructions which were heavily changed or added, and match the instructions that stayed mostly similar but one register was substituted for another. Note that this metric gives a high value to instructions with a lot of arguments in common, and that some of these arguments were pre-processed to allow for better matches (section 2.3.1), while others, such as registers, are architecture dependent. Using LCS based alignment Given a query and a target tracelet, an algorithm for calculating the similarity score for the two tracelets is described in algorithm3. This specialized edit distance calculation process results in: 1. A set of aligned instruction pairs, one instruction from the query tracelet and one from the target tracelet.

24

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Algorithm 3: Calculating the similarity score for two tracelets. Input: T,Q - target and query tracelets Output: Score - tracelet similarity score 1 Algorithm CalcScore(T,Q) 2 A = InitMatrix(|T|, |Q|) 3 // Access outside the array returns a large negative value 4 for i = |T|; i > 0; i − − do 5 for j = |Q|; j > 0; j − − do 6 A[i, j] =Max( 7 Sim(T[i],Q[j]) + A[i + 1, j + 1], // match 8 A[i + 1, j], // insert 9 A[i, j + 1] // delete 10 ) 11 end 12 end 13 Score = A[0, 0]

2. The similarity score for the two tracelets, which is the sum of Sim(c, c0) for every aligned instruction pair c and c0. 3. A list of deleted and inserted instructions. Note that the first two outputs were denoted AlignTracelets and CalcScore, respec- tively, in algorithm1. Despite not being used directly in the algorithm, inserted and deleted data (for a “successful” match) give us important information. Examining the inserted instructions will uncover changes made from the query to the target, such as new variables or changes to data structures. On the other hand, deleted instructions show us what was removed from the target, such as support for legacy protocols. Identifying these changes might prove very valuable for a human researcher, assisting with the identification of global data structure changes, or with determining that an underlying library was changed from one provider to another. Normalizing tracelet similarity scores We used two different ways to normalize the tracelet similarity score (“Norm” in algorithm1): 1. Containment: requiring that one of the tracelets be contained in the other. Calculation: S/min(RIdent, TIdent). 2. Ratio: taking into consideration the proportional size of the unmatched instructions in both tracelets. Calculation: (S ∗ 2)/(RIdent + TIdent). Note that in both methods the normalized score is calculated using the similarity score for the query and target tracelets, and the identity scores for the two tracelets. Each method is better suited for different scenarios, but, as our experiments show, for the common cases they provide the same accuracy. Next we present a supplementary step that improves our matching process even further by performing argument “de-allocation” and rewrite.

25

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 2.3.4 Using the Rewrite Engine to Bridge the Gap

We first provide an intuitive description of the rewrite process, and then follow with a full algorithm accompanied by explanations. Using aligned tracelets for argument de-allocation To further improve our matching process, and in particular to tackle the ripple effects of compiler optimizations, we employ an “argument de-allocation” and rewrite technique. First, we abstract away each argument of the target tracelet to an unknown variable. These variables are divided into three groups: registers, memory locations, and procedure names. Next, we introduce the constraints representing the relations between the variables, using data flow analysis, and the relations between the variables and the matched arguments. Arguments are matched using their linear position in matched instructions. Because query and target tracelets were extracted from real code, we will use them as “hints” for the assignment by generating two sets of constraints: in-tracelet constraints and cross-tracelet constraints. The in-tracelet constraints preserve the relation between the data and arguments inside the target tracelet. The second stage will introduce the cross-tracelet constraints, inducing argument equality for every argument in the aligned instructions. Finally, combining these constraints and attempting to solve them by finding an assignment with minimum conflicts is the equivalent of attempting to rewrite the target tracelet into the query tracelet, while respecting the target’s data flow and memory layout. Note that if all constraints are broken, the assignment is useless as the score will not improve, but this means that the tracelets are not similar and reducing their similarity score improves our method’s precision. Returning to the example of section 2.1, the full process of the tracelet alignment and rewrite is shown in fig. 2.5. As we can see, in this example the target tracelet achieved a near perfect score for the new match (only “losing points” for the missing instruction if ratio calculation was used). Algorithm details Our rewriting method is shown in algorithm4. First, we iterate over the aligned instruction pairs from T and R. For every pair of instructions, |args(r)| = |args(t)| holds (this is required for aligned instructions; see section 2.3.3). For every pair of aligned arguments (aligned with regard to their linear position, eg., eax ←→ ebx, ecx ←→ edx, in add eax,ecx;add ebx,edx), we abstract the argument in the target tracelet. This is done using newVarName, which generates unique temporary variable names according to the type of the argument (such as the ones shown in fig. 2.5). Then, a cross-tracelet constraint between the new variable and the value of the argument from the query instruction is create and added to the constraint φ (line 7).

Then, read(t) is used to determine whether the argument st is read in t. Next, the algorithm

uses the helper data structure lastWrite(st) to determine the last temporary variable name in

which the argument st was written into. It then creates a dataflow constraint from the last write

to the read (line 9). Otherwise the algorithm checks whether t is a write instruction on st , using

write(t) , and updates lastWrite(st) accordingly. Finally, after creating all variables and constraints, we call the constraint solver (using

26

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 solve, line 14), obtain a minimal conflict solution, and use it to rewrite the target tracelet.

Algorithm 4: Rewrite target code to match query. Input: AlignedInsts - a set of tuples with aligned query and target assembly instructions (Note that aligned instructions have the same number of arguments), T,Q - target and query tracelets Output: T’ - rewritten target code 1 Algorithm RewriteTracelet(AlignedInsts,T,Q) 2 foreach (t, q) ∈ AlignedInsts do 3 for i = 1; i < |args(t)|; i + + do 4 st = args(t)[i] 5 sq = args(q)[i] 6 nv = newVarName(st) 7 ϕ = ϕ ∧ (nv = sq) 8 if st ∈ read(t) and lastWrite(st) 6= ⊥ then 9 ϕ = ϕ ∧ (nv = lastWrite(st)) 10 else if st ∈ write(t) then 11 lastWrite(st) = nv 12 end 13 end 14 vmap = solve(ϕ, symbols(Q)) 15 T’ = ∅ 16 foreach t ∈ T do 17 t’ = t 18 foreach st ∈ t do 19 if (st) ∈ vmap then 0 0 20 t = t [st 7→ vmap(st)]) 21 end 22 end 0 0 0 23 T = T ∪ t 24 end

Constraint solver details In our representation, we identify registers and allow them to be replaced by other registers. Our domain for the register assignment only contains registers found in the query tracelet. Generally, two operations that involve memory access may be considered similar when they perform the same access to the same variable. However, the problem of identifying variables in stripped binaries is known to be extremely challenging [BR07]. We therefore choose to consider operations as similar under the more strict condition that all the components involved in the computation of the memory address are syntactically identical. Our rewrite engine therefore adds constraints that try to match the components of memory-address computation across tracelets. The domains for the global offsets in memory and procedure call arguments again contain exactly the arguments of the query tracelet. Note that each constraint is a conjunction (lines 7,9), and when solving the constraint we are willing to drop conjuncts if the full constraint is not satisfiable. The constraint solving process uses a backtracking solver. This process is

27

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Figure 2.7: A’s members are dissimilar to B’s. We will measure precision and recall on R=A.

bounded to 1000 backtracking attempts and returns the best assignment found, with minimal conflicts. The assignment is then used to rewrite the abstracted arguments in all of the matched instructions and, for completeness, a cache of the argument swaps is used to replace the values in the deleted instructions. When this process is complete, the tracelets’s match score will be recalculated (using the same algorithm) to get the final match score. Arguably, after this process, the alignment of instructions might change, and repeating this process iteratively might help improve the score, but experiments show that subsequent attempts after the first run are not cost effective.

2.4 Evaluation

2.4.1 Test-bed Structure

Test-bed design We built our testbed to cover all of the previously discussed cases in which we attempt to detect similarity: when a code is compiled in a different context, or after a patch. Accordingly, we divided the executables into two groups: • Context group: This group is composed of several executables importing and using the same library. We used executables from the Coreutils package focusing on libraries involved with parsing command line parameters. • Code Change group: This group includes procedures originating from several executables which are different versions of the same application. For example, we used wget versions 1.10,1.12,1.14. In the first step of our evaluation process we performed controlled tests: we built all the samples from the source code, used the same machines, and the same default configurations. This allowed us to perform extensive testing using different test configurations (i.e., using different values of k and with different values of β). During this stage, as we had access to the source code, we could perform tests on edge conditions, such as deliberately switching the similar procedures between the executables (“implants”) and measuring code change at the source level

28

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 #Tracelets #Instructions K # Tracelets # Compares Procedure Tracelet k=1 229, 250 1.586 ∗ 108 12.839[39.622] 5.738[21.926] k=2 211, 395 1.456 ∗ 108 11.839[39.622] 11.091[23.768] k=3 188, 133 1.284 ∗ 108 10.536[39.445] 16.518[39.445] k=4 166, 867 1.139 ∗ 108 9.345[39.096] 22.072[29.965] k=5 147, 634 1.008 ∗ 108 8.268[38.484] 27.618[31.021]

Table 2.1: Test-bed statistics. For average values standard deviation is presented in square brackets.

to make sure the changes are reasonable. We should note that when building the initial test set, the only requirement from our procedures was that they be “big” enough, that is, that they have more than 100 basic blocks. This was done to avoid over matching of too-small procedures in the containment calculation.

Our general testing process is shown in fig. 2.7. From each group of similar procedures we chose a random representative and then tested it against the entire testbed. We required our similarity classifier to find all similar procedures (the other procedures from its original similarity group) with high precision and recall.

In the second stage we expanded each test group with executables downloaded from different internet repositories (known to contain true matches, i.e., procedures similar to the ones in the group), and added a new “noise” group containing randomly downloaded procedures from the internet. At this stage we also implicitly added executables containing procedures that are both in a different context and had undergone code change (different version). An example for such samples is shown in section 2.5.2.

Test-bed statistics In the final compositions of our testbed (at the beginning of the second stage described above), we had a total of over a million procedures in our database. At that point we gathered some statistics about the collected procedures. The gathered information gives an interesting insight into the way modern compilers perform code layout and the number of computations that were performed. These statistics are shown in table 2.1. A slightly confusing statistic in this table is the number of tracelets as a function of k. The number of tracelets is expected to grow as a function of their length (the value of k) in “ordinary” graph structures such as trees, where nodes might have a high out-degree. This is not the case for a structure such as a CFG. The average in-degree of a node in a CFG is 0.9221(STD = 0.2679), and the average out-degree is 0.9221(STD = 1.156). This, in addition to our omitting paths shorter than k when extracting k tracelets, caused the number of tracelets to decline for higher values of k. The number of instructions in a tracelet naturally rises, however. Note the very high number of compare operations required to test similarity against a reasonable datebase, and that the average size of a basic block (or a 1-tracelet) is 6 instructions. Fortunately, this process can be easily parallelized (section 2.4.2).

29

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 2.4.2 Prototype Implementation

We have implemented a prototype of our approach in a tool called TRACY. In addition to tracelet- based comparison, we have also implemented the other aforemention comparison methods, based on a sliding window of mnemonics (n-grams) and graphlets. This was done in an attempt to test the different techniques on the same data, measuring precision and recall (and ignoring performance, as our implementation of the other techniques is non-optimized). Our prototype obtains a large number of executables from the web and stores them in a database. A tracelet extraction component then disassembles and extracts tracelets from each procedure in the executables stored in the database, creating a search index. Finally, a search engine allows us to use different methods to compare a query (a procedure given in binary form, as a part of an executable) to the disassembled binaries in the index. Our prototype was implemented in Python, using IDA Pro ([HR]) for disassembly. The full source code can be found at https://github.com/Yanivmd/TRACY. The prototype was deployed on a server with four quad-core Intel Xeon CPU E5-2670 (2.60GHz) processors, and 188 GiB of RAM, running Ubuntu 12.04.2 LTS. Optimizations and parallel execution As shown in table 2.1, the number of tracelet compare operations is huge, but as similarity score calculations on pairs of tracelets are independent, they can be done in parallel. One important optimization is first performing the instruction alignment (using our LCS algorithm) with respect to the basic-block boundary. This reduces the calculation’s granularity and allows for even more parallelism. Namely, when aligning two tracelets, (1, 2, 3) and (10, 20, 30), instructions from 1 can only be matched and aligned with instructions from 1’. This allowed us to use 1-tracelets (namely basic blocks) to cache scores and matches, and then use these alignments in the k-tracelet’s rewriting process. This optimization doesn’t affect precision because the rewrite process uses all of the k nodes in the tracelet, followed by a full recalculation of the score (which doesn’t use any previous data). Furthermore, to better adapt our method to work on our server’s NUMA architecture, we statically split the tracelets of the representative procedure (which is compared with the rest of the procedures in the database) across the server’s nodes. This allowed for better use of the core’s caches and avoided memory overheads.

2.4.3 Test Configuration

Similarity calculation configuration The following parameters require configuration. • k, the number of nodes in a tracelet. • The “tracelet match barrier” (β in algorithm1), a threshold parameter above which a tracelet will count as a match for another. • The “procedure coverage rate match barrier” (α in algorithm1), a threshold parameter, above which a procedure will be considered similar to another one. Because the ROC (and CROC) methods attempt all values for α, for every reasonable value of k (1 − 4), we ran 10 experiments testing all values of β (We checked values from 10 to 100

30

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 β Value 10 − 20 30 40 50 60 70 − 90 100 AUC[CROC] 0.15 0.23 0.45 0.78 0.95 0.99 0.91

Table 2.2: CROC-AUC scores for the tracelet-based matching process with different values of β.

percent match, in 10 percent intervals) to discover the best values to use. The best results for each parameter is reported in the next section. For our implementation of the other methods, we used the best parameters as reported in [KMA13], using n-gram windows of 5 instructions with a 1 instruction delta, and k=5 for graphlets.

2.5 Results

2.5.1 Comparing Different Configurations

Testing different thresholds for tracelet-to-tracelet matches Table 2.2 shows the results of using different thresholds for 3-tracelet to 3-tracelet matches. Higher border values (excluding 100) lead to better accuracy. This makes sense as requiring a higher similarity score between the tracelets means that we only match similar tracelets. All border values between 70 and 90 percent give the same accuracy, suggesting that similar tracelets score above 90, and that dissimilar tracelets score below 70. Our method thus allows for a big “safety buffer” between the similar and dissimilar spectrums. It is, therefore, very robust. Finally, we see that for 100 (requiring a perfect syntactical match), we get a lower grade. This makes sense as we set out to compare “similar” tracelets knowing (even after the rewrite) that they still contain code changes which cannot (and should not) be matched. Testing different values of k The last parameter in our tests is k (the number of basic blocks in a tracelet). For each value of k, we attempted every threshold for tracelet-to-tracelet matching (as done in the previous paragraph). Using k = 1, we achieved a relatively low CROC AUC of 0.83. Trying instead k = 2 yielded 0.91, while k = 3 yielded 0.99. The results for k = 4 are similar to the results of k = 3 and are only slightly more expensive to compute (the number of 4-tracelets is similar to that of 3-tracelets; see table 2.4). In summary, using a higher number for k is crucial for accuracy. The less accurate results for lower k values is due to their leading to shorter tracelets having fewer instructions to match and fewer constraints (especially in-tracelet), resulting in lower precision. Using ratio and containment calculations Our experiments show that this parameter does not affect accuracy. This might be because containment or ratio are traits of whole procedures and so are marginal in the scope of the tracelet. Detecting vulnerable procedures After our system’s configuration was calibrated, we set out to employ it for its main purpose, finding vulnerable executables. To make this test interesting, we looked for vulnerabilities in libraries because such vulnerabilities affect multiple applications. One example is CVE-2010-0624 ([Hea10]); this vulnerability is an exploitable heap-based

31

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Traclets (K=3) N-grams (Size 5,Delta 1) Graphlets (K=5) Ratio Contain AUC[ROC] 0.7217 0.5913 1 1 AUC[CROC] 0.2451 0.1218 0.99 0.99

Table 2.3: Accuracy for tracelet-based procedure similarity vs graphlets and n-grams, using ROC and CROC. These results are based on 6 different experiments.

buffer overflow affecting GNU tar (up to 1.22) and GNU cpio(up to 2.10). We compiled a vulnerable version of tar on our machine and scanned package repositories, looking for vulnerable versions of cpio and tar. This resulted in our system successfully pinpointing the vulnerable procedure in the tar 1.22 packages, the older tar 1.21 packages, and packages of cpio 2.10. (older packages of cpio and tar were not found at all and so could not be tested). Table 2.3 summarizes the results of 6 different, carefully designed experiments, using the following representatives: • quotearg_buffer_restyle procedure from wc v6.12 (a library procedure used by multiple Coreutils applications). • The same procedure from wc v7.6 “implanted” in wc v8.19. • getftp from wget v1.10. • The vulnerable procedure from tar (described above). • Two random procedures selected from the “noise group”. Each experiment considers a single query (procedure) against the DB of over a million examples drawn from standard Linux utilities. For each executables group, the true positives (the similar procedures) were manually located and classified as such, while the rest were assumed to be true negatives. Although the same procedure should not appear twice in the same executable, every procedure which yielded a high similarity grade was manually checked to confirm it is a false positive. The challenge is in picking a single threshold to be used in all experiments. Changing the threshold between queries is not a realistic search scenario. Each entry in the table is the area under the curve, which represents the best-case match score of the approach. ROC curves (and their improvement CROC) allow us to compare different classifiers in a “fair” way, by allowing the classifier to be tested with all possible thresholds and plotting them on a curve. For example, the value 0.7217 for ROC with n-grams (size=5, delta=1) yields the the best accuracy (see definition in section 1.4) obtained for any choice of threshold for n-grams. In other words, the best accuracy achievable with n-grams with these parameters is 0.7217, in contrast to 0.99 accuracy in our approach. This is because n-grams and graphlets use a coarse matching criterion not suited for code changes. This could also be attributed to the fact that they were targeted for a different classification scenario, where the goal is to find only one matching procedure, whereas we strive to find multiple similar procedures. (The former scenario requires that the matched procedure be the top or in the top 10 matches.) When running an experiment in

32

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Figure 2.8: Matching a procedure across different contexts and after code changes.

that scenario, the other methods do get better, though still inferior, results (∼ 90% ROC, ∼ 50% CROC).

2.5.2 Using the Rewrite Engine to Improve Tracelet Match Rate

Figure 2.8 shows executables containing a procedure which was successfully classified as similar to quotearg_buffer_restyled (coreutils library), compiled locally inside wc during the first stage of our experiments. Following is a breakdown of the procedures detected: • The exact procedure compiled locally into different executables (in the controlled test stage), such as ls,rm and chmod. • The exact procedure “implanted” in another version of the executable. For example, wc_from_7.6 means that the version of quotearg_buffer_restyled from the core- utils 7.6 library was implanted in another version (in this case 8.19). • Different versions of the procedure found in executables downloaded from internet repositories, such as deb wc 8.5 (the wc executables, Debian repository, coreutils v8.5). To present the advantages of the rewrite engine, we separated the percentage of tracelets that were matched before the rewrite from the ones that could only be matched after the rewrite. Note that both match scores are obtained using our LCS derived match scoring algorithm run once before the rewrite, and once after. An average of 25% of the total number of tracelets were matched as a result of the rewriting process. It should be noted that true positive and negative information was gathered using manual similarity checks and/or using debugging information (comparing procedure names, as we assume that if a procedure changed its name, it will probably have changed its purpose, since

33

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 ˜ (seconds) Item Operation Average STD Median Min Max Align 0.0015 0.0022 0.00071 0.00024 0.092 Tracelet Align & RW 0.0035 0.0057 0.0025 0.00052 0.23 Align 3.8 12.2 1.6 0.076 627 Procedure Align & RW 8.6 25.78 2.8 0.087 1222

Table 2.4: Runtimes of tracelet-to-tracelet and procedure-to-procedure comparisons, with and without the rewrite engine.

we are dealing with well-maintained code).

2.5.3 Performance

Our proof-of-concept implementation is written in Python and does not employ optimizations (not even caching, clustering, or other common search-engine optimization techniques). As a result, the following reported running times should only be used to get a grasp on the amount of work performed in each step of the process, show an upper limit, and achieve a better understanding of where optimizations should be employed. Analyzing complete matching operations (which may take from milliseconds to a couple of hours) shows that the fundamental operation in our process, that of comparing two tracelets, greatly depends on the composition of the tracelets being compared. If the tracelets are an exact match or are not remotely similar, the process is very fast because, during alignment, we do not need to check multiple options and the rewrite engine (which relies on the alignment and as such cannot be measured independently) does not need to do too much work. When the tracelets are similar but are not a perfect match, the rewrite engine has a chance to really assist in the matching process and as such requires some runtime. Table 2.4 summarizes tracelet comparison, with and without the rewrite operation, showing that almost half of the time is spent in the rewrite engine (on average). Also presented in the table is the runtime of a complete procedure-to-procedure comparison. It should be noted that the run times greatly depend on the size of the procedures being compared, and the shown run times were measured on procedures containing ∼ 200 basic-blocks. Postmortem analysis of the results of the experiments described earlier shows that tracelets with a matching score below 50%, which comprised 85% of the cases, will not be improved using the rewrite engine, and so a simple yet effective optimization will be to skip the rewriting attempts in such cases.

2.6 Limitations

During our experiments, we observed some limitations of our method, which we now describe.

Different optimization levels We found that when compiling source code using O1 opti- mization level, the resulting binary can be used to find O1,O2 and O3 versions. However, O0

34

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 and Os are very different and are not found. Nonetheless, when we have the source code for the instance we want to find, we can compile it with all the optimization levels and search them one by one.

Cross-domain assignments A common optimization is replacing an immediate value with a register already containing that value. Our method was designed so that each symbol can only be replaced with another in the same domain. Our system could also search cross-domain assignments, but this would notably increase the cost of performing a search. Further, our experiments show very good precision even when this option is disabled.

Mnemonic substitution Because our approach requires compared instructions’ mnemon- ics to be the same in the alignment stage, if a compiler were to select a different mnemonic the matching process would suffer. Our rewrite engine could be extended to allow common substitutions; however, handling the full range of instruction selection transformations might require a different approach.

Matching small procedures Our experiments use procedures with a minimum of 100 basic-blocks. Attempting to match smaller procedures will often produce bad results. This is because we require a percent of the tracelets to be covered. Some tracelets are very common (leading to false positives) while slight changes to others might result in major differences that cannot be evened out by the other tracelets. Furthermore, small procedures are sometimes inlined.

Dealing with inlined procedures This is a problem in two cases, when the target proce- dure was inlined, and when procedures called inside the query procedure are inlined into it. Some of these situations could be handled — but only to certain extent — by the containment normalization method.

Optimizations that duplicate code Code duplication for avoiding jumps, for example loop unrolling. Similarly to inlined procedures, our method can manage these optimizations when the containment normalization method is used.

35

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 36

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 3

Statistical Similarity of Binaries

In this chapter we improve uppon our previous approach (chapter2), handle some of its limitations and expand our supported search scenario. Problem definition Improve the measure developed in chapter2, by providing a high similarity score for a procedure t ∈ T when q and t originated from compiling the code in any combination of the following scenarios: (i) different compiler versions, (ii) different compiler vendors and (iii) the same or a different version of the same source code. In this chapter we introduce two important building-blocks: (i) strand-based representation for binary procedure, and (ii) a statistical framework used to establish similarity between strands.

3.1 Background: Similarity By Compisition

We draw inspiration from Boiman and Irani’s [BI06] work on image similarity, where the key idea is that one image is similar to another if it can be composed using regions of the other image, and that this similarity can be quantified using statistical reasoning. Figure 3.1(a) and (b) illustrate the idea of similarity by composition for images. Looking at

the three images in fig. 3.1(a), we wish to determine which of the two query images, iq1 and iq2,

is more similar to the target image, it. Our intuition is that iq2 is probably more similar to it. Figure 3.1(b) provides some explanation for this intuition. Looking at this figure, we identify

similar regions in iq2 and it (marked by the numbered outlined segments). Although the images

are not identical, their similarity stems from the fact that query image iq2 can be composed using significant (and maybe transformed) regions from target image it. In contrast, image

iq1 shares only two small regions with it, and indeed we intuitively consider it to be different. These vague notions are made precise within a statistical framework that lifts similarity between significant regions into similarity between images. In this paper, we show that the general framework of “similarity by composition” can also be applied to code. Specifically, we consider assembly code extracted from stripped binaries (with no debug information). fig. 3.1(c), (d), and (e) show partial assembly code of three procedures. Snippets (c) and (e) are taken from the OpenSSL procedure vulnerable to “Heartbleed”, and were compiled using different compilers: gcc 4.9 and CLang 3.5, respectively. Snippet (d) is

37

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 target image 푖푡 similar regions in 푖푡

4 5 3 6 1 2

query image 푖푞1 query image 푖푞2 similar regions in 푖푞1 similar regions in 푖푞2

4 3 1 5 6

2

(a) (b)

(c) target code 푡  shr eax, 8  lea r14d, [r12+13h] mov r13, rbx  lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

(d) query code 푞1 (e) query code 푞2 mov rsi, 14h mov r9, 13h mov rdi, rcx mov r12, rbx  shr eax, 8 add rbp, 3 mov ecx, r13  mov rsi, rbp  add esi, 1h lea rdi, [r12+3] xor ebx, ebx mov [r12+2], bl test eax, eax lea r13d, [rcx+r9] jl short loc_22F4  shr eax, 8

Figure 3.1: Image vs. Code Similarity by Composition. Query image iq2 is similar to the target image it. Query code q2 is similar to the target code t. (Images courtesy of Irani et al.([BI06].)

taken from an unrelated procedure in Coreutils and compiled using gcc 4.9. For simplicity and brevity, we only present a small part of the code.

Finding similarity between the procedures using syntactic techniques is challenging, as different compilers can produce significantly different assembly code. Instead, we decompose each procedure to small code segments we name strands (a strand is a basic-block slice, as explained in section 3.3.2), and semantically compare the strands to uncover similarity, and lift the results into procedures.

In fig. 3.1(c) & (e), the query code, q2, and the target code, t, share three matching strands, numbered in the figure as 1 , 2 , and 3 . Each strand is a sequence of instructions, and strands are considered as matches when they perform an equivalent computation. In the figure, we mark matching strands using the same circled number. Two syntactically different strands can

be equivalent. For example, the strands numbered 2 in q2 and t differ syntactically but are equivalent (up to renaming and ignoring the change of r9). Furthermore, strands need not be syntactically contiguous. This is because they are based on data-flow dependencies rather than

on syntactic properties. For example, strand 3 in the query procedure q2 (mov r12, rbx; lea rdi, [r12+3]) matches the strand mov r13, rbx;lea rcx, [r13+3] in the target procedure t. In contrast, the code in fig. 3.1(d) only matches the single strand 1 in the target.

38

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 1 mov r9, 13h 2 mov r12, rax 3 mov eax, ebx 1 lea r14d,[r12+13h] 4 add rbp, 3 2 mov r13, rax 5 mov rsi, rbp 3 mov eax, r12d 6 lea rdi,[r12+3] 4 lea rcx,[r13+3] 7 mov [r12+2], bl 5 shr eax, 8 8 lea r13d,[rbx+r9] 6 lea rsi,[rbx+3] 9 shr eax, 8 7 mov [r13+1], al 10 mov [r12+1], al 8 mov [r13+2], r12b 11 call _intel_memcpy 9 mov rdi, rcx 12 add r9, 5h 10 call memcpy 13 mov esi, r9d 11 mov ecx, r14d 14 mov ecx, r13d 12 mov esi, 18h 15 mov eax, ecx 13 mov eax, ecx 16 add eax, esi 14 add eax, esi 17 call write_bytes 15 call write_bytes 18 mov ebx, eax 16 test eax, eax 19 test ebx, ebx 17 js short loc_2A38 20 jl short loc_342E (a) gcc v.4.9 -O3 (b) icc v.15.0.1 -O3

Figure 3.2: Heartbleed vulnerability code snippets.

3.2 Overview

In this section, we illustrate our approach informally using an example. Given a query procedure from a stripped binary, our goal is to find similar procedures in other stripped binaries. To simplify presentation, we illustrate our approach on two code snippets instead of full procedures. Consider the assembly code of fig. 3.2(a). This snippet is taken from a version of OpenSSL, which is vulnerable to the Heartbleed bug [Hea14]. Our goal is to find similar vulnerable code snippets in our code-base. We would like to define a notion of similarity that can find matches even when the code has been modified, or compiled with a different compiler vendors and versions. For example, the code of fig. 3.2(a) was compiled using gcc v4.9, and the code of fig. 3.2(b), originating from the same source, was compiled using icc v15.0.1. We would like our approach to find the similarity of these two snippets despite their noticeable syntactic difference. We focus on similarity rather than equivalence, as we would like our approach to apply to code that may have been patched. Towards that end, we compute a similarity score that captures similarity between (small) partial computations performed by each procedure. The main idea of our approach is to decompose the code to smaller fragments, for which similarity is easy to compute, and use statistical reasoning over fragment similarity to establish the global similarity between code snippets. Towards that end, we have to answer three design questions: • What is the best way to decompose the code snippets? • How should the decomposed fragments be compared? • How can fragment similarity be lifter into find snippet similarity? Decomposition into strands We decompose a procedure into fragments which are feasible to compare. In this work, we use strands—partial dependence chains, as the basic unit. fig. 3.3

39

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 assume r12q == rbxt

1 v1t = 13h 1 v1q = r12q 2 r9t = v1t 2 v2 = 13h + v1 q q 3 v2t = rbxt v3 = int_to_ptr(v2 ) 3 q q 4 v3t = v2t + v1t 4 r14q = v3q 5 v4t = int_to_ptr(v3t) 5 v4q = 18h 6 r13t = v4t v5 = v1 + 5 6 rsiq = v4q 7 t t 8 rsi = v5 7 v5q = v4q + v3q t t 9 v6t = v5t + v4t 8 raxq = v5q 10 raxt = v6t

assert v1q == v2t,v2q == v3t, v3q == v4t,r14q == r13t, v4q == v5t,rsiq == rsit,v5q == v6t,raxq == raxt

Figure 3.3: Semantically similar strands.

shows two strands obtained from the code snippets of fig. 3.2. For now, ignore the assume and assert operations added around the strands. The strands in the figure have been transformed to an intermediate verification language (IVL), by employing tools from [BJAS11, SMA]. The IVL abstracts away from specific assembly instructions, while maintaining the semantics of the assembly code. A fresh temporary variable is created for every intermediate value computed throughout the execution. The flow of data between registers is always through these temporaries. Furthermore, the IVL always uses the full 64-bit representation of registers (e.g. rax and not eax) and represents operations on part of the register using temporaries and truncation (e.g. mov rbx, al will be v1 = truncate(rax,8); rbx = v1;). The strands in the figure have been aligned such that similar instructions of the two strands appear side by side. This alignment is only for the purpose of presentation, and our comparison is based on semantically comparing the strands. We added q and t postfixes to the strands’ variables to separate the name spaces of variables in the two strands and specify one as the query and the other as the target. This allows us to create a joint program that combines the variables from the two strands and makes assumptions and assertions about their equality.

Comparing a pair of strands To compare a pair of strands, we create a joint program that combines them, but has a separate name space for the variables of each strand. We then explore the space of equality assumptions on the different inputs, and check how these assumptions affect the equality assertions on the outputs. For example, one choice for assumptions and assertions is shown in fig. 3.3. Technically, given a pair of strands, we perform the following steps: (i) add equality assumptions over inputs of the two strands, (ii) add assertions that check the equality of all output variables (where output variables also include temporaries), and (iii) check the

1 v1 = r14 q q 1 v1t = r14t v2 = v1 + 1 2 q q 2 v2t = v1t + 16 3 v3q = xor(v2q,v1q) 3 v3t = xor(v2t,v1t) 4 v4q = and(v3q,v2q) 4 v4t = and(v3t,v2t) v5 = (v4 < 0) 5 v5q = (v4q < 0) 5 t t 6 FLAGS[OF] = v5 6 FLAGS[OF]q = v5q t t

Figure 3.4: Syntactically similar but semantically different strands.

40

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 assertions using a program verifier and count how many variables are equivalent. Choosing which variables to pair when assuming and asserting equality is solved by searching the space of possible pairs. The choice of the strand as a small unit of comparison (with a relatively small number of variables), along with verifier based optimizations (described in section 3.5.4), greatly reduce the search space, making the use of a verifier feasible.

Match Probability: We define an asymmetric similarity measure between a query strand, sq, and

a target strand, st, as the percentage of variables from sq that have an equivalent counterpart in st.

We denote this measure by VCP(sq, st). We later use this measure as a basis for computing the

probability Pr(sq|st) that a strand sq is input-output equivalent to a strand st (see section 3.3.3).

For example, taking sq to be the strand on the left-hand side of fig. 3.3, and st to be the strand

on the right-hand side, VCP(sq, st) = 1, because all 8 variables from the left-hand side have an

equivalent variable on the right-hand side. However, in the other direction, VCP(st, sq) = 8/10. We note that no previous approach is able to produce such a matching, as the equivalent values are computed using different instructions. In contrast to fig. 3.3, the strands of fig. 3.4 (taken from our corpus as specified in sec- tion 3.5.1) are very similar syntactically (differ in just one character), but greatly differ se- mantically. Syntactic approaches would typically classify such pairs as matching, leading to a high rate of false positives. Our semantic comparison identifies that these two strands are different, despite their significant syntactic overlap, and yields VCP = 1/6, expressing the vast difference. Local Evidence of Similarity: Modeling strand similarity using probability allows us to express other notions of similarity in a natural manner. For example, given a target procedure t and a

query strand sq, we can capture how well sq can be matched in t by computing the maximal

probability of Pr(sq|st) over any possible strand st in t.

We further define the probability, Pr(sq|H0), of finding a matching strand for sq at random

(where H0 represent all possible strands). The significance of finding a match for sq in t can be then defined as: maxst∈t Pr(sq|st) LES(sq|t) = log . Pr(sq|H0)

LES provides a measure of the significance of the matching of sq with t by comparing it to

the matching of sq with the random source H0. It is important to measure the significance of a match, because many strand matches may be due to common strands introduced by the compiler (e.g., prolog/epilog), and therefore not significant in determining a match between procedures. Lifting strand similarity into procedure similarity We say that two procedures are similar if one can be composed using (significantly similar) parts of the other. Given a query procedure q

and a target procedure t, the LES(sq|t) measure indicates how significant the match of sq is in t. We can therefore define global evidence of similarity (GES) between the procedures q and t by

summing LES(sq|t) over all strands sq in q (as further explained in section 3.3.1). The local and global similarity evidence allow us to lift semantic similarity computed between individual strands into a statistical notion of similarity between procedures.

41

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Key Aspects of our approach

• Similarity by composition: we decompose procedures into strands, semantically compare strands, and lift strand similarity into procedure similarity using statistical techniques.

• Using strand similarity allows us to establish similarity between procedures even when the procedures are not equivalent, but still contain statistically significant similar strands.

• By performing semantic comparison between strands, we are able to find similarity across different compilers versions and vendors (without knowledge of the specifics of the compilers). In section 3.5 we compare our approach to previous work and show the importance of semantic comparison.

3.3 Strand-Based Similarity by Composition

In this section, we describe the technical details of our approach. In section 3.3.1, we describe the steps for generating statistical procedure similarity. In section 3.3.2 we illustrate how we decompose procedures into the single-path units of execution we call strands (step 1). We defer the exact details of computing similarity between strands (step 2) to section 3.4, and assume that such a similarity measure is given. In section 3.3.3 we define the likelihood and statistical significance of a strand, allowing us to quantify similarity using these probabilistic tools (step 3). In section 3.3.4, we describe how whole procedure similarity is computed from statistical similarity of strands, using global and local evidence scores (step 4).

3.3.1 Similarity By Composition

We determine that two procedures are likely to be similar if non-trivial strands from one can be used to compose the other, allowing for some (compiler or patch related) transformation. We decompose the procedure to basic blocks, and then further split each block into strands using a slicing technique, as described in section 3.3.2. We check for strand similarity with the aid of a program verifier, as further shown in section 3.4. This gives us the flexibility to determine similarity between two strands, i.e., that they are (partially) input-output equivalent, even if they produce their result using different instructions, register allocation or program ordering. We then define a Global Evidence Score, GES(q|t), representing the likelihood that a query procedure q is similar to a target procedure t. This global score is based on the composition of

multiple Local Evidence Scores (denoted LES(sq|t)) representing the sum of likelihoods that

each strand sq ∈ q has a similar strand st ∈ t:

maxs ∈t Pr(sq|st) GES(q|t) = LES(s |t) = log t . (3.1) ∑ q ∑ Pr(s |H ) sq∈q sq∈q q 0

The right-hand side of eq. (3.1) shows that the LES for strand sq is calculated using the

ratio between two factors: (i) the probability, Pr(sq|st), that sq is semantically similar to one

of the strands st ∈ t, where st is the strand that produces the highest probability (i.e., is most

42

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 similar to sq), and (ii) Pr(sq|H0), the probability that this strand matches a random source. In our context, this means that the strand is commonly found in the corpus binaries and does not uniquely identify the query procedure. This is further described in section 3.3.3.

3.3.2 Procedure Decomposition to Strands

We use a standard CFG representation for procedures. We decompose the procedure by applying slicing [Wei81] on the basic-block level. A strand is the set of instructions from a block that are required to compute a certain variable’s value (backward slice from the variable). Each block is sliced until all variables are covered. As we handle each basic block separately, the inputs for a block are variables (registers and memory locations) used before they are defined in the block. Figure 3.3 is an example of a pair of strands extracted from the blocks in fig. 3.2.

Strands as partial program dependence graphs (PDGs) Strands are a practical compromise over enumerating all paths in the PDG [FOW87]. All strands are obtained by decomposing the PDG at block boundaries. This means that strands only contain data dependencies, as control dependencies exist only over block boundaries, while severed data dependencies (e.g., values created outside the block) are marked as such and used in the comparison process (these are the inputs, as explained later in algorithm5). This approach of decomposing graphs while making use of loose edges to improve performance is similar to the “extended graphlets” in [KMA13], yet there the edge contains less information as it is not connected to a certain variable. Decomposing at block level yielded precise results for most of our benchmarks. However, using longer paths can be advantageous when handling small procedures, where breaking at block level results in a small number of short blocks that can be easily matched with various targets. Sec. section 3.6.6 further discusses benchmarks for which a the small number of blocks yields low accuracy for Esh. Algorithm5 uses standard machinery to extract strands from a basic block, and uses the standard notions of Def and Ref for the sets of variables defined and referenced (respectively) in a given instruction. The process starts by putting all instructions in unusedInsts, and will end only when this list is empty, i.e., when every instruction is marked as having been used in at least one extracted strand. The creation of a new strand begins by taking the last non-used instruction, as well as initializing the list of variables referenced in the strand – varsRefed, and the variables defined in the strand – varsDefed. Next, all of the previous instructions in the basic-block are iterated backwards, adding any instruction which defines a variable referenced in the strand so far (is in varsRefed) and updating varsRefed and varsDefed with every instruction added. When the for loop is finished the new strand is complete, as every instruction needed to calculate all of the variables defined inside the basic-block is present. This does not include the inputs, which are any variables used in the calculation and not defined in the basic-block. Note that the backward iteration is crucial for minimizing the number of strands.

43

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Algorithm 5: Extract Strands from a Basic Block. Input: b - An array of instructions for a basic-block Output: strands - b’s strands, along with their inputs unusedInsts ← {1, 2, ..., |b|}; strands ← []; while unusedInsts 6= ∅ do maxUsed ← max(unusedInsts); unusedInsts \= maxUsed; newStrand ← [b[maxUsed]]; varsRefed ← Ref(b[maxUsed]); varsDefed ← Def(b[maxUsed]); for i ← (maxUsed − 1)..0 do needed ← Def(b[i]) ∩ varsRefed; if needed 6= ∅ then newStrand += b[i]; varsRefed ∪= Ref (b[i]); varsDefed ∪= needed; unusedInsts \= i;

inputs ← varsRefed \ varsDefed; strands += (newStrand, inputs);

3.3.3 Statistical Evidence

Given two sets of strands obtained from two procedures for comparison, we assume a mechanism for computing a similarity measure between two strands based on the proportion of output variables they agree on when given equivalent inputs. We denote this similarity metric by VCP, and defer its formal definition and computation to section 3.4. In this section, we assume that

the VCP between two strands, sq and st, is given, and transform it into a probabilistic measure

for strand similarity Pr(sq|st). We then describe how to compute Pr(sq|H0), the probability of each strand to match a random process. Using these values we compute the likelihood-ratio,

LR(sq), from which the GES value is composed.

Strand Similarity as a Probability Measure

We denote Pr(sq|t) as the likelihood that a strand, sq, from the query procedure, q, can be

“found” in the target procedure t, i.e., that we can find an equivalent strand st ∈ t.

Pr(sq|t) , max Pr(sq|st). (3.2) st∈Ht

The likelihood Pr(sq|st) that two strands are input-output equivalent is estimated by ap- plying a sigmoid function (denoted g()) over the VCP of the two strands (we set the sigmoid

midpoint to be x0 = 0.5 as VCP(sq, st) ∈ [0, 1]):

−k(VCP(sq,st)−0.5) Pr(sq|st) , g(VCP(sq, st)) = 1/(1 + e ). (3.3)

44

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 The use of the logistic function allows us to produce a probabilistic measure of similarity,

where Pr(sq|t) is approximately 1 when VCP(sq, st) = 1 and nears 0 when VCP(sq, st) = 0. We experimented with different values to find the optimal value for the steepness of the sigmoid curve parameter, k, and found k = 10.0 to be a good value. Our use of the sigmoid function is similar to its application in logistic regression algorithms

for classification problems [KK10]. The hypothesis hθ(x) is set to be the sigmoid function of the original hypothesis θT x, resulting in the hypothesis being a probability distribution, T hθ(x) , Pr(y = 1|x; θ) = g(θ x), which reflects the likelihood of a positive classification (y = 1) given a sample x. This correlates to Pr(sq|st) representing the likelihood that sq and st are a positive match for performing the same calculation.

The Statistical Significance of a Strand

In order to find procedure similarity in binaries, we require that non-trivial strands of code be matched across these binaries. This stands in contrast to eq. (3.3), where smaller pieces of code will receive a high likelihood score, since they perform trivial functionality that can be matched with many strands. Thus a query strand need not only be similar to the target, but also have a low probability to occur at random. For this we introduce the Likelihood Ratio measure:

LR(sq|t) = Pr(sq|t)/Pr(sq|H0). (3.4)

This measure represents the ratio between the probability of finding a semantic equivalent

of sq in st vs. the probability of finding a semantic equivalent at random (from the random

process H0). Pr(sq|H0) in fact measures the statistical insignificance of a strand, where a higher

probability means low significance. We estimate the random hypothesis H0 by averaging the

value of Pr(sq|st) over all targets (as it needs to be computed either way), i.e., Pr(sq|H0) = Pr(s |s ) ∑st∈T q t |T| where T is the set of all target strands for all targets in the corpus.

3.3.4 Local and Global Evidence Scores

After presenting the decomposition of procedures q and t to strands, and defining the likelihood-

ratio for each strand sq and target t, we can define the Local Evidence Score as the log of the likelihood-ratio:

LES(sq|t) = log LR(sq|t) = log Pr(sq|t) − log Pr(sq|H0). (3.5)

This local score reflects the level of confidence for sq to have a non-trivial semantic equivalent

strand in t. The global score GES(q|t) is simply a summation (eq. (3.1)) of all LES(sq|t) values after decomposing q to strands, which reflects the level of confidence that q can be composed from non-trivial parts from t and is in fact similar to it.

45

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 3.4 Semantic Strand Similarity

In the previous section, we assumed a procedure that computes semantic strand similarity. In this section, we provide such a procedure using a program verifier. Given two strands to compare, the challenge is to define a quantitative measure of similarity when the strands are not equivalent. We first provide the formal definitions on which the semantics of strand similarity are based, and then show how we compute strand similarity using a program verifier.

3.4.1 Similarity Semantics

Preliminaries We use standard semantics definitions: A program state σ is a pair (l, values), mapping the set of program variables to their concrete value values : Var → Val, at a certain

program location l ∈ Loc. The set of all possible states of a program P is denoted by ΣP. ∗ A program trace π ∈ ΣP is a sequence of states hσ0, ..., σni describing a single execution of the program. The set of all possible traces for a program is denoted by [[P]]. We also ∗ ∗ define f irst : ΣP → ΣP and last : ΣP → ΣP, which return the first and last state in a trace respectively. A strand s ∈ P is therefore a set of traces, s ⊆ [[P]] – the set of all traces generated by all possible runs of s, considering all possible assignments to s inputs. We will use this abstraction to further define strand equivalence and VCP.

Variable correspondence A variable correspondence between two states, σ1 and σ2, denoted γ : Var1 9 Var2, is a (partial) function from the variables in σ1 to the variables in σ2. Note that several variables can be mapped to a single variable in Var2. Γ(P1, P2) denotes the set of all

variable correspondences for the pair of programs (P1, P2). This matching marks the variables as candidates for input-output equivalence to be proven by the verifier.

State, trace equivalence Given two states and a correspondence γ, if ∀(v1, v2) ∈ γ : σ1(v1) =

σ2(v2), then we say that these states are equivalent with respect to γ, and denote them σ1 ≡γ σ2.

Given two traces and a correspondence γ between their last states, if last(π1) ≡γ last(π2),

then we say that these traces are equivalent with respect to γ, and denote them π1 ≡γ π2.

Definition 3.4.1 (Strand equivalence). Given two strands (remembering that each strand has inputs as defined in section 3.3.2 and denoted inputs(s)) and a correspondence γ, we say

that these strands are equivalent with respect to γ, denoted s1 ≡γ s2 if: (i) every input from

s1 is matched with some input from s2 under γ, and (ii) every pair of traces (π1, π2) ∈

(s1, s2) that agree on inputs (∀(i1, i2) ∈ (γ ∩ (inputs(s1) × inputs(s2))) : f irst(π1)(i1) =

f irst(π2)(i2)) is equivalent π1 ≡γ π2. This expresses input-output equivalence.

Definition 3.4.2 (State, trace variable containment proportion). We define the VCP between

a query state σq and a target state σt as the proportion of matched values in σq, denoted |γmax| VCP(σq, σt) , where γmax is the maximal variable correspondence (in size) for which , |σq|

the two states are equivalent, i.e., σq ≡γmax σt, considering all possible gammas. We define the VCP between two traces, VCP(πq, πt), as VCP(last(πq), last(πt)).

46

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 For instance, given valuesq = {x 7→ 3, y 7→ 4}, valuest = {a 7→ 4}, the maximal

correspondence is therefore γmax = {y 7→ a} as it matches the most possible variables. 1 Therefore VCP(σq, σt) = 2 . We note that it is possible for several maximal correspondences to exist, and in these cases we simply pick one of the said candidates.

Definition 3.4.3 (Strand VCP). We define the VCP between two strands as the proportion of matched variables in the γ that induces the maximal containment proportion over all pairs of traces, as follows:

max{|γ| ∀(πq, πt) ∈ (sq, st) : πq ≡γ πt} VCP(sq, st) , . |Var(sq)|

An important observation regarding the VCP is that it can produce a high matching score for potentially unrelated pieces of code, for instance if two strands perform the same calculation but one ends by assigning 0 to all outputs, or if the result of the computation is used for different purposes. We did not observe this to be the case, as (i) compiler optimizations will eliminate such cases where a computation is not used, and (ii) even if the code is used for different purposes – it may still suggest similarity, if for example a portion of the query procedure was embedded in the target.

3.4.2 Encoding Similarity as a Program Verifier Query

Next we show how we compute a strand’s VCP (definition 3.4.3) by encoding input-output equivalence, along with procedure semantics as a program verifier query. The query consists of three parts, including (i) assuming input equivalence over the inputs in the variable corre- spondence (γ), (ii) expressing query and target strand semantics by sequentially composing their instructions, and (iii) checking for variable equivalence, over all possible traces, by adding equality assertions to be checked by the program verifier. Program verifiers For the purposes of this paper, we describe a program verifier simply as a function denoted Solve : (Proc,Assertion) → (Assertion → {True, False}), that given a

procedure p ∈ Proc with inputs i1, ...in and a set of assertion statements Φ ⊆ Assertion, is able to determine which of the assertions in Φ hold, for any execution of p, under all possible

values for i1, ..., in. The assertions in Φ are mapped to a specific location in p and specify a property (a formula in first order logic (FOL)) over p’s variables that evaluates to True or False according to variable value. Solve will label an assertion as True if for all variable values under all input values, the assertion holds. Verifiers usually extend the program syntax with an assume statement, which allows the user to specify a formula at desired program locations. The purpose of this formula is to instruct the verifier to assume the formula to always be true at the location, and try to prove the assertions encountered using all the assumptions encountered in the verification pass. In this work, we used the Boogie program verifier. ([BCD+05] provides the inner workings of the verifier.) To allow the use of the verifier, we lifted assembly code into a non-branching subset of the (C-like) BoogieIVL (translation details described in section 3.5.1). The details of BoogieIVL are throughly described in [Lei].

47

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Procedure calls We treat procedure calls as uninterpreted functions while computing similarity because, (i) an inter-procedural approach would considerably limit scalability, as the verifier would need to reason over the entire call tree of the procedure (this could be unbounded for recursive calls), and (ii) we observed that the semantics of calling a procedure is sufficiently captured in the code leading up to the call where arguments are prepared, and the trailing code where the return value is used. Calls to two different procedures that have the exact same argument preparation process and use return values in the same way would be deemed similar by our approach. We note, however, that our approach does not rely on knowing call targets, as most are omitted in stripped binaries.

Algorithm 6: Compute Strand VCP. Input: Query pq, Target pt in BoogieIVL Output: VCP(pq, pt) maxVCP ← 0; for γ ∈ Γ(pq, pt) do p ← NewProcedure (Inputs(pq) ∪ Inputs(pt)); for (iq, it) ∈ (γ ∩ (Inputs(pq) × Inputs(pt)) do p.body.Append (assume iq == it); p.body.Append (pq.body;pt.body); for (vq, vt) ∈ (((Vars(pq) × Vars(pt)) ∩ γ) do p.body.Append (assert vq == vt); Solve (p); q t if p ≡γ p then maxVCP ← max(|γ|/|Vars(pq)|, maxVCP);

Calculating strand VCP using a program verifier Next, we describe how we encode strand similarity as a Boogie procedure. We present a simplified version of the algorithm for clarity and brevity, and further describe optimizations in section 3.5.4. As our compositional approach alleviates the need to reason over branches, we can define the encoding assuming single-path programs. For the rest of the paper we separate procedure variables to Vars(p), denoting all non-input variables in the procedure, and Inputs(p), denoting only inputs. algorithm6 receives a pair of Boogie procedures pq, pt representing the strands q and t, after renaming of variables to avoid naming collisions. It then proceeds to enumerate over all possible variable q t correspondences γ ∈ Γ(p , p ), where all of pq’s inputs are matched in compliance with definition 3.4.1. For each correspondence, a new Boogie procedure p is created. We start building the procedure body by adding assumptions of equivalence for every pair of inputs in γ. This is crucial for checking input-output equivalence. Next, we append the bodies of the query and target procedures sequentially, capturing both strands’ semantics. Lastly, a series of assertion statements are added, whose goal is to assert the exit state equivalence by adding an assertion for all variable pairs matched by γ. The resulting procedure p is then given to the Solve() function, which uses the program verifier to check assertion correctness. If all the assertions were proven, the current VCP is calculated and compared against the best VCP

48

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 computed so far, denoted maxVCP. The higher value is picked, resulting in the maximal VCP at the end of the loop’s run.

3.5 Evaluation

Our method’s main advantage is its ability to perform cross-compiler (and cross-compiler- version) code search on binary methods, even if the original source-code was slightly patched. We will harness this ability to find vulnerable code; we demonstrate our approach by using a known (to be) vulnerable binary procedure as a query and trying to locate other similar procedures in our target database that differ only in that they are a result of compiling slightly patched code or using a different compilation tools. These similar procedures become suspects for being vulnerable as well and will be checked more thoroughly. We designed our experiments to carefully test every major aspect of the problem at hand. Finally, we will also show how our method measures up against other prominent methods in this field. In section 3.5.1 we describe the details of our prototype’s implementation. In section 3.5.2 we detail the vulnerable and non-vulnerable code packages that compose our test-bed. In section 3.5.3 we show how our test-bed was built and how this structure enabled us to isolate the different aspects of the problem domain. In section 3.5.4 we detail the different heuristics we employed in our prototype to improve performance without undermining our results.

3.5.1 Test-Bed Creation and Prototype Implementation

We implemented a prototype of our approach in a tool called Esh. Our prototype accepts a query procedure and database of procedures (the targets), residing in executable files, as input in binary form. Before the procedures can be compared using our method, we need to divide them (from whole executables into single procedures) and “lift” them into the BoogieIVL to enable the use of its verifier.

Lifting Assembly Code into BoogieIVL

Each binary executable was divided to procedures using a custom IDA Pro ([HR]) Python script, which outputs a single file for every procedure. BAP (Binary Analysis Framework) [BJAS11] “lifts” the binary procedure into LLVM-IR ([LA04]) code, which manipulates a machine state represented by global variables. An important observation, and this was shown in fig. 3.3 and fig. 3.4, is that translated code is in single static assignment (SSA) form, which is crucial for an effective calculation of the VCP (definition 3.4.3). The SMACK (Bounded Software Verifier, [SMA]) translator is used to translate the LLVM-IR into BoogieIVL. Finally, strands are extracted from the blocks of the procedure’s CFG. In our evaluation we compared Esh with TRACY (chapter2). section 3.6.3 shows a detailed analysis of these experiments. Our prototype was implemented with a mixture of C# (for Boogie framework interaction) and Python. The source code for the comparison engine accompanied by it’s installation guide is available

49

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 here : github.com/tech-srl/esh. A demonstration of this engine can be found here: github.com/tech-srl/esh. For our evaluation, we deployed the prototype on a server with four Intel Xeon E5-2670(2.60GHz) processors, 377 GiB of RAM, running Ubuntu 14.04.2 LTS.

3.5.2 Using Vulnerable Code as Queries

To make sure the experiments put our method to a real-world test scenario, we incorporated eight real vulnerable code packages in the test-bed. The specific CVEs are detailed in table 3.1. The rest of the target database was composed from randomly selected open-source software from the Coreutils ([GNUb]) package. All code packages are compiled using the default settings, resulting in most of them being optimized using the -O2 optimization level while a few, like OpenSSL, default to -O3. All executables were complied to the x86_64 (64-bit) architecture as default. Following this we decided to focus our system’s implementation efforts on this architecture. Note that our system can be easily expanded to support x86 (32bit) and even other chip-sets (assuming our tool-chain provides or is extended to provide support for it). After compilation we removed all debug information from the executables. After compiling the source-code into binaries, our target corpus contained 1500 different procedures. We made this dataset available to the public at https://github.com/nimrodpar/esh-dataset-1523.

3.5.3 Testing Different Aspects of the Problem Separately

Many previous techniques suffered from a high rate of false positives, especially as the code corpus grew. We will show one cause for such incidents using a general example. [RMZ11] shows that the compiler produces a large amount of compiler-specific code, such as the code for its control structures, so much so that the compiler can be identified using this code. This is an example of a common pitfall for methods in the fields of binary code search: if the comparison process is not precise enough when comparing procedures compiled by a different compiler, a low similarity score can be wrongly assigned for these procedures on the basis of the generating compiler alone. This is due to the compiler-centric code taking precedence over the real semantics of the code, instead of being identified as common code and having its importance reduced in the score calculation. In our context, the problem can be divided into three vectors: (i) different compiler versions, (ii) different compiler vendors, and (iii) source-code patches. Different versions of the same compiler Before attempting cross-compiler matching, we first evaluated our method on binaries compiled using different versions of the same compiler. We therefore compiled the last vulnerable version of each package mentioned in section 3.5.2 using the gcc compiler versions 4.{6,8,9}. Next we performed the same process with the CLang compiler versions 3.{4,5}, and again with the icc compiler versions 15.0.1 and 14.0.4. Cross-compiler search Next, we evaluated our approach by searching procedures compiled across different compilers. An important aspect of this evaluation process was alternating the query used, each time selecting the query from a different compiler. This also ensured that our method is not biased towards a certain compiler. As explained in section 3.3, our matching method is not symmetric, and so examining these different scenarios will provide evidence for the validity of this asymmetric approach. Patched source-code The last vector we wish to explore is patching. We define a patch as any modification of source-code that changes the semantics of the procedure. The common case for this is when the procedure’s code is altered, yet changes to other procedures or data-structures can affect the semantics

50

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 as well. We predict that precision will decline as the size of the patch grows and the procedures exhibit greater semantic difference.

3.5.4 Enabling the Use of Powerful Semantic Tools

Esh Performance Our initial experiments showed the naive use of program verifiers to be infeasible, resulting in many hours of computation for each pair of procedures. In the following subsection we describe various optimizations to our algorithm which reduced the time required for comparing a pair of procedures to roughly 3 minutes on average on an 8-core Ubuntu machine. We emphasize that our approach is embarrassingly parallelizable, as verifier queries can be performed independently, allowing performance improvements linearly to the number of computation cores.

algorithm6 optimizations We first presented the VCP computation algorithm in a simplified form with no optimizations. To avoid enumerating over all variable correspondences in Γ(pq, pt), in our optimized implementation we enumerated over inputs only (Inputs(pq) × Inputs(pt)). Furthermore, we did not allow multiple inputs in pq to be matched with a single input in pt (γ was one-to-one) and only allowed for correspondences that matched all of pq inputs. This reduced the number of outer loop iterations to

max(|Iq|!, |It|!). We further reduced this by maintaining typing in matches. For each matching of the inputs, the non-input variable matching part of γ starts out simply as Var(pq) × Var(pt), while maintaining types. We further perform a data-flow analysis to remove variable pairs that have no chance of being matched, as their calculation uses inputs that were not matched with an initial assumption (were not in γ). Allowing for all possible matchings (up to typing and dataflow analysis) means that we check all possible correspondences for non-inputs at once. We do this by parsing the output of the Boogie verifier that specifies which of the equality assertions hold and which fail. Unmatched variables are removed, leaving only equivalent pairs in γ. Finally, multiple matchings for variables are removed (γ must be a function over q’s variables according to definition 3.4.3) and the VCP is calculated.

Discarding trivial strands and impractical matches As mentioned in section 3.3.3, trivial strands receive low LES similarity scores. Thus smaller strands are less likely to generate evidence for global similarity GES. We therefore established a minimal threshold for the number of variables required in a strand in order that it be considered in the similarity computation (5 in our experiments), and did not produce verifier queries for strands smaller than the said threshold. We further avoided trying to match pairs of strands which vary greatly in size. We used a ratio threshold, set to 0.5 in our experiments, meaning that we will attempt to match a query strand only with target strands that have at least half the number of variables (i.e. a minimal value of VCP = 0.5 is required) or at most twice the number of variables (avoiding matching with “giant” strands, which are likely to be matched with many strands).

Batching verifier queries To further save on the cost of performing a separate verifier query for each input matching (which usually contains a small number of assertions), we batch together multiple queries. As we wanted to allow procedure knowledge gathered by the verifier to be re-used by subsequent queries, we embedded several possible γ correspondences in the same procedure by using the non-deterministic branch mechanism in Boogie (further explained in [Lei]) to make the verifier consider assumptions and check assertions over several paths. We also batched together different strand procedures up to a threshold of 50,000 assertions per query.

51

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Method S-VCP S-LOG Esh # PP

PP Stats CVE Alias PP #BB # Strands FP ROC CROC FP ROC CROC FP ROC CROC CVE PPP 1 Heartbleed 2014-0160 15 92 107 0.967 0.814 0 1.000 1.000 0 1.000 1.000 2 Shellshock 2014-6271 136 430 246 0.866 0.452 3 0.999 0.995 3 0.999 0.996 3 Venom 2015-3456 13 67 0 1.000 1.000 0 1.000 1.000 0 1.000 1.000

4 Clobberin’ Time 2014-9295 41 233 351 0.797 0.343 65 0.987 0.924 19 0.993 0.956 52 5 Shellshock #2 2014-7169 88 294 175 0.889 0.541 40 0.987 0.920 0 1.000 1.000 6 ws-snmp 2011-0444 6 98 42 0.981 0.879 5 0.999 0.990 1 1.000 0.997 7 wget 2014-4877 94 181 332 0.885 0.600 11 0.998 0.988 0 1.000 1.000 8 ffmpeg 2015-6826 11 87 222 0.9212 0.6589 97 0.9808 0.8954 0 1.000 1.000 Table 3.1: The ROC, CROC and FPs for our query search experiments. Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 3.6 Results

3.6.1 Finding Heartbleed

openssl bash 4.3 coreutils 8.23 1 f e f g e f f

퐺퐸푆 e g f f 0.419 f g 0.333

0 3.5 3.4 3.5 15 14 4.9 4.8 4.6 15 3.5 14 15 14 15 CLang icc gcc icc CLang icc Compiler version (top), and vendor (bottom) Figure 3.5: Experiment #1 – Successfully finding 12 procedure variants of Heartbleed.

To better understand our evaluation process and test-bed design, we start by walking through exper- iment #1 from table 3.1. The query for this experiment was the “Heartbleed” vulnerable procedure from openssl-1.0.1f, compiled with CLang 3.5. Each bar in fig. 3.5 represents a single tar- get procedure, and the height of the bar represents the GES similarity score (normalized) against the query. The specific compiler vendor and version were noted below the graph (on the X axis) the source package and version above it. Bars filled green represent procedures originating from the same code as the query (i.e. “Heartbleed”) but vary in compilation or source code version (the exact code version openssl-1.0.1{e,g,f} is specified over the bar itself). All unrelated procedures were left blank. As we can see from the results in fig. 3.5, our method gives high scores to all other similar versions of the “Heartbleed” procedure, despite them being compiled using different compilers, different compiler versions or from a patched source code. A gap of 0.08 in the GES score exists between the true positives from the rest of the procedures. (0.419 for the icc 15 compiled procedure of openssl-1.0.1g “Heartbleed” vs. 0.333 for the bash 4.3 “ShellShock” procedure compiled with Clang 3.5). It is important to note that we will not try to establish a fixed threshold to evaluate the quality of these results. As mentioned, this clean separation between the true positives and the false positives is not always possible. Instead, this result and others, as shown in the following sections, are evaluated according to the produced ranking. The result in fig. 3.5 receives a ROC = CROC = 1.0 score as it puts all of the true positives in the top of the ranking.

3.6.2 Decomposing Our Method into Sub-methods

When examining our method bottom up, we can divide it into three layers:

• S-VCP: The first layer of our method is the way we calculate the VCP between strands. Without the

53

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 use of the statistical processing, we still define a similarity score as: ( ( )). ∑st∈T maxsq∈Q VCP st, sq This approach attempts to generalize the VCP from a pair of strands to a pair of procedures by counting the maximal number of matched variables in the entire procedure.

• S-LOG: The next layer of our approach incorporates the statistical significance of every query

strand, by using local and global significance. By alternatively defining Pr(sq, st) = VCP(sq, st) and applying it to the LES and GES equations, we can see how our method looks without applying the sigmoid function to the VCP.

• Esh: Adding the use of the sigmoid function results in our full method as described in section 3.3

To fully justify and explain each layer of our method, table 3.1 shows the results of each sub-method compared with our full method, in terms of (i) FPs, and (ii) ROC and CROC. (for details about these method see section 1.4.) Note that we count the number of false positives as determined by a human examiner who receives the list of procedures sorted by similarity scores, and we define the number of false positives as the number of non-matching procedures the human examiner will have to test until all the true similar procedures are found. The effectiveness of a method can be measured more precisely and quickly by using CROC. We also included additional information about the number of basic-blocks and the number of strands extracted from them, and the CVE for every vulnerable procedure we attempted to find. These results clearly show that each layer of our method increases its accuracy. Comparison of the different experiments shows that CROC & ROC scores do more than simply count the number of false positives for every threshold they compare the rate by which the false positive rate grows. (Informally, this may be regarded as a prediction of the number of attempts after which the human researcher will give up.) An important point is that the size of the query, in terms of the number of basic blocks or strands, does not directly correlate with easier matching. Experiment #3 shows an interesting event, where even S-VCP gets a perfect score. Upon examination of the query procedure, we discovered that this occurs because the said procedure contains several distinct numeric values which are only matched against similar procedures. (These are used to explore a data structure used to communicate with the QEMU floppy device.) Examining the results as a whole, we see that in more than half of the experiments the use of S-VCP, which doesn’t employ the statistical amplification of strands, results in a high number of false positives. To understand this better, we performed a thorough analysis of experiment #5. When we examine the

values of Pr(sq|H0), which express the frequency of appearance of strand sq, we see that several strands get an unusually high score (appear more frequently). One of these strands was found to be a sequence of push REG instructions, which are commonplace for a procedure prologue.

3.6.3 Comparison of TRACY and Esh

Following our explanation of the different problem aspects in section 3.5.3, we tested both tools, with a focus on experiment number one in table 3.1. Each line in table 3.2 represents a single experiment; the in one of the columns specifies that this specific problem aspect is applied: (i) Compiler version from the same vendor, (ii) Cross-compiler, meaning different compiler vendors, and (iii) applying patches to the code. For example, when (i) and (ii) are checked together this means that all variations of queries created using all compiler vendors and compiler versions were put in the target database. As we can see, because TRACY was designed to handle patches, it achieves a perfect grade when dealing them. Moreover, TRACY can successfully handle the different compiler versions on their own. However when it is used in a cross-compiler search, its accuracy begins to plummet. Furthermore, when

54

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Versions Cross Patches TRACY (Ratio-70) Esh  1.0000 1.0000  0.6764 1.0000  1.0000 1.0000   0.5147 1.0000   0.4117 1.0000   0.8230 1.0000    0.3529 1.0000

Table 3.2: Comparing TRACY and Esh on different problem aspects.

any two problem aspects are combined, and especially when all three are addressed, this method become practically unusable.

3.6.4 Evaluating BinDiff

BinDiff [zyna] is a tool for comparing whole executables/libraries by matching all the procedures within these libraries. It works by performing syntactic and structural matching relying mostly on heuristics. The heuristic features over procedures which are the basis for similarity include: the number of jumps, the place of a given procedure in a call-chain, the number of basic blocks, and the name of the procedure (which is unavailable in stripped binaries). Detailed information about how this tool works can be found in [zynb], along with a clear statment that BinDiff ignores the semantics of concrete assembly-level instructions. table 3.3 shows the results of running BinDiff on each of the procedures in table 3.1. As BinDiff operates on whole executables/libraries, the query/target was a whole library containing the original vulnerability. We only compared one target for each query, the same library compiled with a different vendor’s compiler, and patched (for the queries where patching was evaluated). BinDiff failed to find the correct match in all experiments but two. The experiments in which BinDiff found the correct match are those where the number of blocks and branches remained the same, and is relatively small, which is consistent with [zynb].

Alias Matched? Similarity Confidence Heartbleed  - - Shellshock  - - Venom  - - Clobberin’ Time  - - Shellshock #2  - - ws-snmp  0.89 0.91 wget  - - ffmpeg  0.72 0.79

Table 3.3: Evaluating BinDiff.

3.6.5 Pairwise Comparison

Fig. fig. 3.6 shows the similarity measures produced in an all-vs-all experiment, where 40 queries were chosen at random from our corpus and compared. The result is shown in heat-map form, where the axes

55

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 represent individual queries (X and Y are the same list of queries, in the same order) and each pixel’s intensity represents the similarity score GES value (eq. (3.1)) for the query-target pair. Queries that originate from the same procedure (but are compiled with different compilers, or patched) are coalesced. Different procedures are separated by ticks. We included at least two different compilations for each procedure. The first procedure (leftmost on the X, bottom on Y axis) is ftp_syst() from wget 1.8, queried in 6 different compilations. The second is taken from ffmpeg 2.4.6, queried in 7 different compilations. The rest are taken from Coreutils 8.23. The average ROC and CROC values for the experiment were 0.986 and 0.959 respectively.

parse_integer() dev_ino_compare() default_format()  print_stat()

cached_umask() create_hard_link() i_write() compare_nodes() ffmpeg-2.4.6: ff_rv34_decode_  init_thread_copy()

wget-1.8: ftp_syst()

Figure 3.6: All vs. all experiment result in heat map form (pixel intensity = similarity GES measure).

Several observations can be made from fig. 3.6:

1. The graph’s diagonal represents the “ground truth” i.e., each query is matched perfectly with itself.

2. Our GES measure is not symmetrical (as it is based on the asymmetrical VCP metric).

3. Esh provides a very clear distinction for ffmpeg-2.4.6’s ff_rv34_decode_init_thread_copy(), marked with a dashed region numbered 1 , where all compiled queries of the procedure receive high GES scores when compared with each other and low ones when compared with random code. In general, Esh correctly matches procedures compiled with different compilers, as the pattern of “boxes” along the diagonal shows.

4. Groups of queries that originate from the same procedure generally receive similar GES values (i.e. similar shade) when compared to groups of targets that originate from the same (but different from the queries’) code.

Although some procedures, such as default_format(), seem to perform poorly as they have high GES matches with wrong targets (marked with a dashed region numbered 2 ), the correct evaluation of the matching should be relative to the correct matching. For default_format(), the GES values of the correct matching is high (specified by dark pixels around the diagonal), so relative to that, the set of wrong matchings (somewhat dark pixels in the middle) becomes less significant, which is reflected by a ROC = .993 and CROC = .960 AUC score.

56

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 3.6.6 Limitations

In this subsection we discuss the limitations of our approach using study cases of procedures where Esh yields low recall and precision. “Trivial” procedures and Wrappers Our approach relies on matching semantically similar, non-trivial fragments of execution. Thus, when matching a very small query (i.e. a “trivial” procedure), we are forced to rely on the statistical significance of a very small number of relatively short strands, which yields poor results. Wrappers are procedures that mostly contain calls to other procedures, and hold very little logic of their own e.g. the exit_cleanup() procedure in fig. 3.7. As we operate on stripped binaries, we cannot rely on identifiers like procedure names to try and match strands based on procedure calls.

s t a t i c void exit_cleanup ( void ){ i f ( temphead ) { s t r u c t cs_status cs = cs_enter (); c l e a n u p ( ) ; cs_leave (cs); } close_stdout (); }

Figure 3.7: Wrapper exit_cleanup procedure from sort.c of Coreutils 8.23.

Generic procedures The C concatenation preprocessor directive (##) is sometimes used as a mechanism for generics, where “template” procedures are created, that have similar structure but vary in type, or use a function pointer to provide different functionality as shown in fig. 3.8, where different string comparison procedures are created to compare creation, modification and access time of a file. As these hold the same general structure (and are specifically small), Esh deems the whole set of procedures similar. We do not consider these matches to be strict false positives, as the procedures are in fact similar in structure. Note that this problem only arises when these types of procedures are used as queries, as it is not clear whether the different variants should be considered as true positives (they perform similar semantics) or false positives (as they might operate on different data structures).

# d e f i n e DEFINE_SORT_FUNCTIONS( key_name , key_cmp_func ) \ s t a t i c i n t strcmp_##key_name (V a, V b) \ { return key_cmp_func (a, b, strcmp); }

...

DEFINE_SORT_FUNCTIONS ( ctime , cmp_ctime ) DEFINE_SORT_FUNCTIONS ( mtime , cmp_mtime ) DEFINE_SORT_FUNCTIONS ( atime , cmp_atime )

Figure 3.8: DEFINE_SORT_FUNCTIONS macro for creating “template” procedures in ls.c of Coreutils 8.23.

57

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 58

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 4

Similarity of Binaries through re-Optimization

In this chapter we perform an important improvment to our similarity method, overcoming several of the limitations reported in chapter3, first of which speeding up the comparison process, while removing the last criterion on search. Problem definition Improve the measure developed in chapter3, by providing a high similarity score for a procedure t ∈ T when q and t originated from compiling the code in any combination of the following scenarios: (i) different optimization flags, (ii) different target architecture, (iii) different compiler versions, (iv) different compiler vendors and (v) the same or a different version of the same source code. The biggest challenge we face is comparing vastly different binary code stemming from the use of different ISAs employed by each central processing unit (CPU) architecture. In this chapter we introduce an important concept: re-optimizing binary code into one IR, moving to a canonicalized normalized strand-based representation.

4.1 Overview

In this section we informally describe our approach using an example found during our evaluation.

4.1.1 The Challenge in Establishing Binary Similarity

Figure 4.1 shows our similarity method at work when applied to two similar computations, taken from the dtls1_buffer_message procedure in d1_both.c, which is a part of the OpenSSL code package. These computations were created by compiling the said procedure in two different setups, using:

• gcc version 4.8, targeting the AArch64 (64-bit ARM) architecture, and optimizing at -O0 optimization level. The processing of this computation is presented in the upper half of fig. 4.1, and marked (A).

• icc version 15.0.3, targeting the x86_64 Intel architecture, and optimizing at -O3 optimization level. The processing of this computation is presented in the lower half of fig. 4.1 and marked (B).

Figure 4.1 is divided into four columns. The first column (left-hand side), marked (i), shows snippets from the binaries created by the compilation setups above, and the other three (ii, iii and iv) show the effects of our method’s processing steps on them.

59

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 (i) Assembly (ii) Lifted (iii) Canonical (iv) Canonical & Normalized ARM-64 t3 = load i64, i64* x20 t3 = load i64, i64* x20 (A) GCC 4.8 –O0 t4 = sext i8 0 to i64 t18 = add i64 t3, 1 t5 = shl i64 t3, t4 store i64 t18, i64* x0 t6 = or i64 0, t5 store i64 t6, i64* x0 t38 = load i64, i64* x21 mov x0, x20 t17 = load i64, i64* x0 t42 = sub i64 t38, t18 t0 = load i64, i64* r0 add x0, x0, 1 t18 = add i64 t17, 1 store i64 t42, i64* x21 sub x21, x21, x0 store i64 t18, i64* x0 t57 = add i64 t42, 2 t1 = add i64 t0, 1 cmn x21, 2 t38 = load i64, i64* x21 ret i64 t57 store i64 t1, i64* r1 ... t2 = load i64, i64* r2 t3 = sub i64 t2, t1 X86-64 t18 = load i64, i64* rax t18 = load i64, i64* rax store i64 t3, i64* r2 (B) t19 = add i64 t18, 1 t19 = add i64 t18, 1 icc 15.0.3 –O3 store i64 t19, i64* r15 store i64 t19, i64* r15 t4 = add i64 t3, 2 t23 = load i64, i64* r13 t24 = load i64, i64* r15 t23 = load i64, i64* r13 ret i64 t4 t25 = sub i64 t23, t24 t25 = sub i64 t23, t19 lea r15, [rax+1] store i64 t25, i64* r13 store i64 t25, i64* r13 sub r13, r15 t37 = load i64, i64* r13 t38 = add i64 t25, 2 cmp r13, -2 t38 = sub i64 t37, -2 ret i64 t38 ...

Figure 4.1: Applying canonicalization, normalization on strands taken from our motivating example (OpenSSL‘s dtls1_buffer_message procedure in d1_both.c). Only the appli- cation of both techniques results in equivalence.

Syntactically different yet semantically equivalent code The assembly snippets in fig. 4.1(i) perform exactly the same computation, yet understanding this requires some familiarity with both architectures. The computations use two values that were generated beforehand and stored in two registers, with an additional, temporary register: the constant 1 is added to the value stored in one of the registers, and the result is stored in the temporary register. The sum is then subtracted from the third register’s value. Finally, the computation sets a boolean flag value according to the result of comparing the subtraction’s result with −2. The two computations are performed in different ways, making them hard to compare syntactically. This distinction stems from the different compilation processes, which introduce variance due to many factors, including but not limited to: Arbitrary register use: The inputs for the computation are stored in three different sets of registers: x0, x20 and x21 in (A) and rax, r15 and r13 in (B). The register selection process of the compiler is driven by various heuristics and intricate code pass specifics, which results in a fairly arbitrary selection. Under certain scenarios, even well-known conventions like using rbp for the stack frame head are not adhered to by the compiler. Cross-optimization variance: Another source for syntactic difference in fig. 4.1 is the variance in the way code is produced for different optimization purposes. The -O0 code generated by gcc in (A) contains a move and an addition operation, which could have been easily been united to one instruction–ADD x1,x20,#1. On the other hand, the CLang generated code (B) demonstrates the use of the lea instruction to perform a binary arithmetic operation and put its result in a third register without causing other side-effects (as Intel ISA does not support three-address-code instructions like ARM). Different instruction selection: The snippets from fig. 4.1 check whether the input values are equal using different instructions: cmn in (A) and cmp in (B). cmn has similar semantics to the better known cmp instruction, yet the former uses addition instead of subtraction to check for equality. This change causes the comparison to be performed against the constant 2 instead of −2. These variations, found in a short simple computation, demonstrate the challenge in establishing similarity in binary code. Many more variations can be introduced by the application of different optimization levels, using different compilers, or by targeting different machine architecture. In this work, we propose a new approach for overcoming these variations, while finding a new sweet spot between scalability and accuracy.

60

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 4.1.2 Representing Procedures in a Canonicalized & Normalized Form

Lifting binaries to intermediate representation We adapt and use existing techniques to implement a lifter from binary procedures into LLVM-IR. The standard LLVM-IR representation is beneficial in that: (1) basing our similarity method on LLVM-IR allows it to be architecture-agnostic, and (2) the LLVM-IR format and accompanying tool-suite is well documented, well maintained and has a plethora of tools for creation, translation and manipulation. VEX-IR is a well-used and robust IR for representing binary code from many architectures. Since there are no tools which offer support for translating binary code from multiple architectures to LLVM-IR, we implemented a translator from the VEX-IR ([NS07]) to LLVM-IR. Figure 4.1(ii) shows the result of lifting the assembly snippets in fig. 4.1(i) into LLVM-IR. Decomposing procedures to strands We build upon our previous strand-based decomposition process (section 3.3.2) to extract strands from the binary procedures. This time, as we are working on a lifted procedure, the strands are composed of LLVM-IR instructions. Canonicalizing strands Figure 4.1(iii) shows the benefits of canonicalizing the strands: (1) All of the lifter-imposed changes (the complex move operation modeling and redundant loads) are reverted, (2) the expression is canonicalized to perform the addition with the constant first and the result is put in the register before the subtraction, and (3) the comparison is canonicalized to the simple addition with a positive constant (instead of subtracting with a negative). Note that this canonicalization step also acts as a re-optimization for code which might not have been optimized before. In our case the mov,add -> add optimization was performed. It is also worth nothing that the Intel assembly code was changed to use an instruction which the original architecture does not offer (cmn). Canonicalization is an important step in finding the equivalence of the snippets from fig. 4.1, yet further differences in register selection cause the optimized strands not to match. Normalized form Figure 4.1(iv) shows the last step of our process, and the benefits of the normalization step: the specific name of the register become immaterial, and the temporary values are renumbered to offset the fact the they were initially part of a bigger basic block and that some of the temporary values were removed as they were a part of a redundant operation. In this final form the two strands are syntactically equivalent. Transforming strands to a representation where semantic equivalence is captured by syntactic equality instead of using heavyweight theorem provers or dynamic analysis is crucial to allowing our approach to scale. Scalable search using hashed canonical strands Canonicalization and normalization of procedure strands are performed on each procedure independently, followed by hashing of the textual representation of the strands. The set of strand hashes (i.e., a set of numbers), denoted R(p), is then used to represent the procedure, p, in the comparison stage. Comparing strand hashes allows our comparison engine to achieve superior performance, and minimize memory usage. Determining the statistical significance of a strand Another important component of our approach is the use of a statistical framework that determines the relevance of each strand in establishing similarity. The goal is to distinguish relevant strands attributed to procedure semantics as it appears in the source code, from other strands which are artifacts of targeting a specific architecture or using a specific compiler pass. The statistical reasoning has a twofold effect, as it also offsets some overmatching which may occur due to canonicalization and normalization. The similarity score of two procedures, q and t, is based on strands which appear in both q and t, i.e., R(q) ∪ R(t). The statistical framework factors in the common appearance rate of each strand, denoted

61

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 by Pr(s), into the similarity score, in an inverse manner. The significance of a strand and its subsequent contribution to the similarity score is the inverse of its probability, s.t. rare strands (Pr(s)=0) contribute greatly to similarity where common strands (Pr(s)=1) contribute very little. We determine the significance of a strand s using its common appearance rate, in the (theoretical) set of all binary procedures (in existence) We as follows:

|{p ∈ We |s ∈ R(p)}| PrW (s) = (4.1) e |We |

Estimating a global context through crawling binaries “in the wild” As computing We is intractable, our statistical framework is based on sampling a limited number of binaries “in the wild”, thus creating an estimation of the global context, which we demote by P. We use P to approximate the global context We . P can be gathered offline (regardless of the query and target procedures being compared), and we show that a relatively small number of procedures (1K) is sufficient for achieving high accuracy in our approach. Given P, we compute the similarity of a query q and a target t, denoted S(q, t) as follows:

|P| S (q, t) = (4.2) P ∑ f (s) s∈(R(q)∪R(t))

The similarity score between q and t is the sum of the inverse frequency ( f (s)) of all the strands shared between q and t, normalized by the number of unique strands in P. The composition of eq. (4.2) is further explained in section 4.2.2 by eq. (4.3) and eq. (4.4).

4.2 Algorithm

4.2.1 Representing Binary Procedures with Canonicalized Normalized Strands

Lifting opaque binary procedures to IR The first step used by most tools dealing with binary executables is to lift the procedures into some IR. This move allows the tool builder to focus on the semantics expressed in the binary instructions and not on the way they were emitted by the compiler or arranged by the linker. Prominent frameworks for performing these steps are Mcsema [tra], Binary Analysis Platform (BAP)[BJAS11], and Valgrind [NS07]. The first uses LLVM-IR, while the other two each created an IR (BIL for BAP and VEX-IR for Valgrind) for their own purpose. We note that none of these frameworks (nor any other frameworks we encountered) attempt decompile of the binary code, but rather represent the binary instructions’ semantics: The machine state is represented by variables and instructions translate to operations on these variables, according to the machine specification. This lifting process works by translating each assembly instruction in the procedure into the IR, which explicitly specifies how it affects the machine’s memory and registers, including the flags. Lifted IR inadvertently inhibit similarity Implementing the lifting process requires an immense amount of work, as there are many machine instructions, some of which cause intricate side-effects and are dependent on other elements of the machine state. To assuage some of these difficulties, the lifting process does not concern itself with being minimal or optimal in any way, and instead focuses on capturing the precise semantics. Figure 4.2 shows three simple examples demonstrating this behavior. In section 4.2.1 we see a simple move instruction in 64-bit ARM, and fig. 4.2(d) shows the lifted VEX-IR for it. The figure illustrates how a fairly trivial instruction is modeled using several (relatively complex) arithmetic operations, and using three redundant temporary values (t14,t15 and t16). In section 4.2.1 we see a simple add

62

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion -Computer ScienceDepartment -Ph.D. ThesisPHD-2020-04 -2020

mov x1 , x21 add rax , 1 sub eax , 2 (a) Intel-x86_64 (b) Intel-x86_64 (c) ARM-AArch64 Valgrind Lifting BAP Lifting Mcsema Lifting ↓ ↓ ↓

63 t14 = GET:I64(x21) T1:u64 = R_RAX:u64 %usub = tail call { i32, i1 } \ t15 = Shl64(t14 ,0x00) T2:u64 = 1:u64 @llvm.usub.w.OF(i32 %61, i32 2) t16 = Or64(t13,t15) R_RAX:u64 = R_RAX:u64+T2:u64 %62 = extractvalue { i32, i1 } \ PUT(x1) = t16 %usub, 0 BIL LLVM-IR (d) VEX-IR (e) (f)

Figure 4.2: Three examples of binary to IR lifting outputs containing redundant operations. instruction in 64-bit x86 assembly, and fig. 4.2(e) shows the lifted BIL code for it. This lift operation does not use the fact that one of the arguments is a constant, and creates a redundant temporary value (T2) for it. This value is then used in subsequent addition operation, even though the BIL allows for an addition between a temporary and a constant. Section 4.2.1 displays another simple instruction in 64-bit x86 assembly, this time a 32 bit subtraction, and fig. 4.2(f) shows the lifted LLVM-IR code created by Mcsema. The created IR again tries to model a more complex computation, this time an unsigned subtraction with overflow, which returns a struct containing the subtraction’s result and the overflow indication. This is performed even though the next assembly commands do not check the overflow flag; accordingly, the lifted LLVM-IR will not extract the second part of this struct. For some use cases, the variance and redundancy of the produced IR are immaterial. However, in the context of program similarity, as shown in fig. 4.1(ii), they are damaging, and become disastrous when combined with the other challenges of finding similarity between different architectures and optimization levels. From lifted binaries to LLVM-IR strands The aforementioned lifting tools were created with different design goals in mind: Mcsema focused on lifting in a way that allows transformations on the intermediate code such as obfuscation and adding security mechanisms. BAP is geared towards performing binary analysis, and VEX-IR was built for instrumentation in Valgrind dynamic analyses. Furthermore, each tool became specialized at handling specific combinations of the target OS and architectures specifically: Mcsema works well on Windows binaries, BAP contains a very accurate representation of the x86 flag computation and VEX is adept at generalizing multiple architectures. We used VEX-IR in our prototype, by virtue of its ability to lift binaries from both Intel and ARM architectures and support float instructions. Regardless of this prototype implementation choice, we wanted our method to be lifter-agnostic, and so we based our representation on LLVM-IR, and implemented a translator from VEX to LLVM-IR. We selected LLVM-IR, in view of its stability, extensive set of auxiliary tools and its support for multiple platforms. Following the lifting and translation process to LLVM-IR, our main goal is establishing a procedure representation what will help us find similar procedures. As mentioned before, we start this process by adapting the strand creation process described sec- tion 3.3.2, yet this time our method do not require the use of (costly) SMT solvers allowing us to use an LLVM-IR representation without the need to translate to BoogieIVL [Lei]. Canonicalizing LLVM-IR strands Our notion of procedure similarity depends on our method’s ability to detect similar strands across different procedures. This requires our method to overcome three major difficulties: (i) the compiler-imposed changes to the way the strand’s semantics are represented, e.g. optimizations for size or runtime and control flow manipulation, (ii) the machine-imposed constraints, e.g. the number of general-purpose registers available, as well as the expressiveness of the instructions-set and (iii) lifter-related changes similar to those detailed earlier in this section. The basis for establishing similarity using strands is finding semantically equivalent strands, where this equivalence accommodates for register renaming and other compiler or architecture artifacts; e.g., r12 + (rax * rbx) will be equivalent to x2 + (x4 * x7). One way to find semantically equivalent strands while avoiding performance-heavy solvers is to move to a canonical form. This transformation should funnel all semantically-equivalent strands to the same syntactic representation. Fortunately, the problem of canonicalizing expressions is well researched and implemented in modern compilers to allow common-subexpression elimination and other optimizations at different levels (procedure-wide or at the basic block level) as part of compilation-time optimizations. As we are using LLVM-IR to represent our strands, an obvious choice for a canonicalizer is the

64

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 CLang optimizer (opt). Internally we represent each strand as an LLVM procedure, which accepts as input all the registers partaking in the computation. To allow us to harness the optimizer to our goal, we perform two simple transformations to the LLVM procedure: (i) change the machine register’s representation to global variables and (ii) add an instruction returning the strand’s value. Note that these changes are required only because the strand was extracted from its context in the binary procedure. The important optimizer phases we employ are ‘common subexpressions elimination’ and ‘combine redundant instructions’ (activated by -early-cse and -instcombine respectively). The passes perform canonicalization of the expression under specific pre-defined rules. These include re-associating binary operations, grouping subsequent additions, converting multiplications with power-of-two constants to shifts, and many more. One example of this process is shown in the move from (ii) to (iii) in fig. 4.1. For more information on these optimization phases, we refer the reader to the LLVM documenta- tion [GNUc] with a friendly warning that a full understanding of this process is possible only by closely examining opt’s source code. Out-of-context re-optimization is crucial to establishing similarity Refining the representation of our strands by moving to a canonical form is an important step in our similarity search method. Its important to note that optimizing entire procedures or basic blocks has little effect in most scenarios because multiple computation paths might be intertwined. This is a major challenge in establishing binary similarity (as demonstrated in fig. 4.1). Re-optimizing the code allows us to find similarity between unoptimized (e.g., -O0) assembly code generated for one architecture, with heavily optimized code for a completely different machine architecture. Furthermore, the fact that we are targeting the code to the same ‘virtual’ machine (the LLVM abstraction) helps close the gap between the different architectures for comparison purposes. Moving to a normalized strand representation During the canonicalization step performed by the compiler, the strand (in our case a procedure with a single basic block) is represented by a directed acyclic graph (DAG) which stores the expression. Even though comparing DAGs is possible, we wanted to simplify our representation to some kind of textual form, allowing for fast and simple comparison. This is accomplished by using opt to output a linearized version of the computation’s DAG. To finalize the transformation which eliminates the origin and compilation choices made in the creation of the binary code, the final refinement to our representation is normalizing the strands. This is done by renaming all symbols in the strand, i.e., its registers and temporary values, into sequentially named symbols. This step is crucial for cross-architecture comparison, as the names of the specific registers used in a given computation have nothing to do with its actual semantics, and are completely different between architectures.

4.2.2 Scalable Binary Similarity Search

Using hashed strands to detect similarity The steps performed to canonicalize and normalize the strands allow for efficient comparison, as they replace complex and heavy semantic matching with syntactic equality to establish procedure similarity. Another important advantage for this transition is that we index the procedure as the set of MD5 hashes and efficiently compare them in this representation. This allows for fast comparison and a reduced memory footprint. Thus, given a procedure p, we denote its representation, R(p), as the set of MD5 hash values over the canonicalized and normalized strands of the procedure:

R(p) = {MD5(Canonicalize&Normalize(sp)))|sp ∈ p}. A basic notion of the similarity between a given query and target, q and t, can be achieved based

65

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 on the intersection of their (hashed) representation denoted M(q, t) = R(q) ∩ R(t), yet this similarity metric doesn’t perform well in practice, as shown in section 4.3.5. One reason is that some of the matched strands are not relevant, as they are compilation artifacts and have nothing to do with the semantics expressed by q (or t). Accounting for the binary creation process As our evaluation will show, using a certain compiler or targeting a specific architecture in the procedure compilation and assembly generation process adds non-semantic related artifacts in the generated assembly, beyond the original content of the source code. These artifacts might be a side effect of accessing the machine’s memory through a certain construct (e.g., a stack), or be related to a specific optimization pattern implemented by the compiler. Previous work, [RMZ11] for example, shows that these artifacts can even be used to detect the tool-chain used to create the binary. Knowing the exact origins of every query and target procedure in the corpus might have allowed us to cancel out these artifacts, but retrieving this information is complex and imprecise when working on stripped executables. We thus need to find some other way to separate the semantic strands, which originate from the source code of the query, from the unrelated artifact strands, which can be attributed to the binary creation process. One relatively simple way to separate the strands is to apply the assumption that a common strand, which appears in many procedures, carries less importance than a rare strand. In general we can define Pr(s) as the probability that a strand s will appear “at random”. A more appropriate approach is to limit our probability space by two factors: (i) strands that are lifted from real binaries, and specifically from the group of binaries which are targeting one of the architectures our process can lift, denoted W and We respectively, and (ii) strands that are canonicalized and normalized. As defined in eq. (4.1), we determine Pr (s) for a given strand s and we divide the number of We different procedures in which it appears, by the total number of unique strands appearing in all of the procedures in We . We use this probability to define our similarity metric:

1 |We | S (q, t) = = (4.3) We ∑ Pr (s) ∑ s∈M(q,t) We s∈M(q,t) |{p ∈ We |s ∈ R(p)}|

Using random sampling to approximate strand frequency Calculating Pr is not feasible, due to its We dependence on possessing We (although We is finite at any given time, it constantly grows). Instead we sample a large subset, denoted P, and use it to approximate Pr : We f (s) Pr (s) ' (4.4) We P |P| where:  |{p ∈ P|s ∈ R(p)}| s ∈ P f (s) = 1 else

P is randomly collected from We using a crawling process that gathers binary procedures “in the wild”, and as such contain equal representation of all supported compilers, optimization levels and architectures. Moreover, smaller sizes for P will decrease the memory footprint of our method. We further discuss the composition and size of P in section 4.3.6. Note that all the calculations regarding the frequency of every strand s ∈ P and the (constant) |P| can be done offline, and using this data will allow our similarity score calculation to be performed in a global context yet still scale. Applying eq. (4.4) to

66

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 eq. (4.3) will result in eq. (4.2), shown before. We note that Pr is an approximation of Pr , which is a probability. Since it is not feasible to We W˜ compute W˜ , we approximate f (s) = 1 for strands not collected in P. This prevents f (s)/|P| from being a probability; however it remains a reasonable approximation as these are mostly rare strands that seldom appear in W˜ . Some important properties of our similarity score are:(i) It is symmetric, allowing to cache similarity scores for any pair for future matchings. (ii) The results of similarity searches with the same query, which outputs two ranked lists, can be joined. Embarrassingly parallelizable approach Our technique is based on pairwise strand hash comparison operations and the application of the (offline) global context to produce a similarity score. Thus we were able to deploy our tool on a powerful many-core setup, where each query-target pair comparison is assigned to a core. The global context is pre-computed and loaded by each of the cores interdependently. A pair’s similarity score is thus produced by a single core in under a second, on average, independently of other comparisons. The procedure indexing process is parallelizable as well, where each procedure is indexed by a different core.

4.3 Evaluation

We implemented a prototype of our approach in a tool called GitZ. Our evaluation aims to answer the following questions about it: • How useful is GitZ in the vulnerability search scenario (section 4.3.3)? • How well does GitZ scale (section 4.3.3)? • How accurate is GitZ and how does it measure (independently and altogether) in each of the problem vectors: different architectures, different compilers and different optimization levels (section 4.3.1)? • How does each component of our solution affect the accuracy of GitZ (section 4.3.5)? • How does GitZ measure in comparison to previous work (section 4.3.3)? But first, we explain our corpus creation process and design goals, and give some details about our evaluation setup.

4.3.1 Creating a Corpus to Evaluate the Different Problem Vectors

To perform a through evaluation, we will need to compare binaries where the ground truth is known. To this end we create binaries according to our three problem vectors: Different architectures As illustrated by our motivating example (fig. 4.1), the same source code compiled to different architectures is inherently different. The instruction set is different, and even after lifting to an intermediate representation, the code remains different, due to different paradigms implemented by the architecture. For instance, Intel’s x86_64 architecture allows instructions to operate over the higher 8 bit part of certain registers ([abcd]h), while ARM’s AArch64 does not allow accessing data at that resolution, and requires further computation to extract the same value (see another example in fig. 4.4). To measure the accuracy of our technique in the cross-arch setting, our corpus included binaries from two widely used architectures: Intel x86_64 and ARM AArch64. Different compilers Different compilers produce binaries which differ immensely in syntax ([DR10]). Different compilers may use different registers, employ different instruction selection, order instructions differently, structure the code differently, etc. To evaluate our ability to overcome these differences, our

67

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 (a) (b) X XX GitZ GitZ Esh XXX Tool -500K: -1500: -1500: # #Canon XX # CVE Details XXX Cross-{Comp, Arch, Opt} Cross-Comp Cross-Comp VEX Norm ID Alias #FPs CROC ˜ #FPs CROC ˜ #FPs CROC ˜ Strands Strands 1 2014-0160 Heartbleed 52 .999 15m 0 1 1s 0 1 19h 89 45 2 2014-6271 Shellshock 0 1 17m 0 1 3s 3 .996 15h 233 71 3 2015-3456 Venom 0 1 16m 0 1 1s 0 1 16h 47 20

4 2014-9295 Clobberin’ Time 0 1 16m 0 1 2s 19 .956 16h 153 58 5 2014-7169 Shellshock #2 0 1 12m 0 1 2s 0 1 11h 359 104 68 6 2011-0444 WS-snmp 0 1 14m 0 1 1s 1 .997 10h 65 43 7 2014-4877 wget 0 1 10m 0 1 2s 0 1 15h 214 47

8 2015-6826 ffmpeg 0 1 17m 0 1 1s 0 1 20h 74 30 9 2014-8710 WS-statx 0 1 18m 0 1 2s - - - 104 55

Table 4.1: CROC and #FP for the vulnerability search experiments: (a) Over 500K procedures, cross-{arch, compiler, optimization} and (b) A comparison GitZ Esh of to . Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 test corpus was compiled using prominent compilers from 3 different vendors, with several versions for each compiler. Different optimization levels Modern compilers apply various optimization methods. For instance, with the -O1 and -O2 optimization levels for the gcc compiler, as many as 40 different optimization passes are performed ([GNUa]). To see whether GitZ is able to identify similarity across optimizations, each binary was compiled using each of the optimization flags. Compilation setups In our evaluation we will use:

• Cx64 – The set of compilers targeting the Intel x86_64 architecture containing CLang 3.{4,5}, gcc 4.{6,8,9} and icc {14,15}.

• CARM – The set of compilers targeting the ARM AArch64 architecture containing aarch64-gcc 4.8 and aarch64-CLang 4.0.

• O – A set of optimization levels, -O{0,1,2,3,s}.

To this end we have created a utility named Compilator, which receives a code package as

input and compiles it with each of the configurations from {Cx64 ∪ CARM} × O, resulting in 44 binary versions for each procedure. We used Compilator to create ∼500K binary procedures from prominent open-source software packages, including OpenSSL, git, Coreutils, VideoLAN, bash, Wireshark, QEMU, wget and ffmpeg. Some packages were intentionally chosen as they contained a vulnerability (in a specific version), which was used in the experiments in table 4.1. We crawled the corpus to randomly select 1K procedures, and used their strands to build our global context P. Further details regarding the global context composition and size can be found in section 4.3.6.

4.3.2 Evaluation Setup

We performed all of our experiments on a machine with four Intel Xeon E5-2640 (2.90GHz) pro- cessors (72 cores), 368 GiB of RAM, running Ubuntu 14.04.2 LTS. Each process uses at most 300MiB during its run. In the machine’s view, when all 72 cores are used, GitZ uses 23 GiB.

4.3.3 GitZ as a Scalable Vulnerability Search Tool

To evaluate GitZ in the vulnerable code search scenario, we used real-world vulnerabilities and searched for them in our corpus of procedures. The experiment shows how GitZ can be used by a security-savvy organization that wishes to discover whether it might be vulnerable to a newly found 0-day exploit. Table 4.1(a) details our main experiment, where 9 real-world vulnerable procedures from open source projects are used as queries and searched against our full 500K procedure corpus. The corpus contained 44 true positives (i.e., similar procedures originating from the same source code) for each query. For each procedure the number of false positives, the overall accuracy expressed using the CROC measure, and the overall runtime (rounded) are specified. In all experiments but one, GitZ was able to rank all true positives above all unrelated procedures. For experiment #1 (Heartbleed), 52 false positives were ranked above a true positive (0.0001 FP rate). Table 4.1(b) further shows a comparison of GitZ to Esh (chapter3). We recreated the experiment from section 3.5, which includes searching for the vulnerabilities in a corpus of 1500 procedures, across compilers (Esh did not apply to the cross-architecture or cross-optimization level scenarios). The results reported in section 3.6 are presented here again and marked as Esh-1500, and the results for running the same experiment with GitZ appear alongside, marked GitZ-1500. table 4.1(b) shows that GitZ is able to produce more accurate results with 0 false positives, for the same scenario. Furthermore, since

69

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 1.0

0.8

0.6

0.4

0.2

0.0

Figure 4.3: Similarity score (as height) for All v. All experiment. Average CROC: .978, FP rate: .03, Runtime: 1.1h.

Esh relies on a program verifier, its average runtime is 15.3 hours. GitZ, on the other hand, provides a speedup of 4 orders of magnitude, from tens of hours to an average run time of 1.8 seconds. Finally, when added a new vulnerability (CVE 2014-8710) which did not appear in the Esh experiment, GitZ was able to find all true positives with 0 false positives. There is a slight loss in precision for vulnerability #1 (Heartbleed) for the GitZ-500K experiment (compared to GitZ-1500), which is attributed to the more challenging cross-{arch, opt} scenario, and the vastly larger corpus. The last two columns in table 4.1 show the number of (unique) raw VEX strands and canonical normalized strands extracted from each procedure. The numbers show the first positive effect of our normalization and canonicalization passes, as they help reduce the size of our procedure representation. These passes further affect accuracy, as explained later and shown in table 4.2.

4.3.4 All vs. All Comparison

To better evaluate GitZ’s accuracy and performance over these challenging scenarios, we performed an experiment using a subset of 1K procedures, selected at random from our 500K corpus. The experiment was executed in an “All vs. All” setting, where every procedure out of the thousand picked is searched against all others (approximately 1 million comparisons). The experiment’s goal was to evaluate our approach, when either the query or the target binaries consist of any and all of the varying architectures, compilers and optimization levels. In this setting, GitZ reports an average CROC accuracy of .978, with an average FP rate of .03. The overall run time for the experiment was 1.1 hours. Due to the size of the experiment, we cannot include a table showing the result for each of the procedures. Instead, we present results in the form of a surface (fig. 4.3) showing the normalized similarity for 100 randomly selected queries. Figure 4.3 shows the normalized similarity result as surface height. Both axes have the same dataset: The list of experiments ordered by name and grouped together according to source procedure, i.e., all

70

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 compilations of the same procedures are coalesced. Important observations are: (i) The diagonal shows the ground truth (all procedures are matched with themselves) along with other compilations of the same procedure. These can be see as “ridges” along the diagonal “wall”. The similarity score for the same procedures compiled differently is, as expected, lower than that of the ground truth and is dependent (for the most part) on how different the architecture, compilers and optimization flags are. (ii) The surface is symmetrical w.r.t the diagonal. This is expected as our similarity metric is symmetrical i.e. S(q, t) = S(t, q), due to the use of an offline global context. (iii) Some “spikes” seem out of place. For instance the invalidate_cache() procedure from dd.c seems to match with an unrelated procedure and create a false positive. Upon closer examination, we learn that the matched procedure is iwrite() also from dd.c, where in fact invalidate_cache() is a callee. The matching occurs because in that specific compilation, if iwrite(), the callee invalidate_cache() was inlined, and the entire procedure body resides inside the binary, thus explaining the match and asserting it as a true positive. Evaluating all vectors, together and separately Table 4.2 shows the accuracy of GitZ for several combinations of compilers, optimization levels and architectures for the aforementioned All v. All experiment. These experiments are aimed at answering questions like: (i) “How well does our technique find similarity between less and more optimized code, produced by the same compiler?” (ii) “Which of the search settings (opt, comp, arch) is most challenging?” (iii) “Does optimization level factor when searching across compilers, architectures?” Table 4.2 presents the results from 28 different settings of the experiment where: (i) the “Scenario” column groups similar experiments, (ii) the “Queries” and “Targets” columns specify the subset of compiler setups used to generate the queries and targets’ and (iii) the CROC and FPr specify accuracy and false positive rate. Similarity within compiler, architecture boundaries: Lines 4-8 of table 4.2 show that GitZ is adept at finding similarity whenever the code is compiled with the same compiler, but different optimization level, regardless of compiler vendor, or even target architecture. Lines 9-18 show a slight loss of precision, for the cross-compiler (but same architecture and optimization level) scenario. No particular optimization level showed to be more challenging than the others. Challenging cross-architecture scenarios: Lines 2-3 and 19-28 in table 4.2 describe the cross-architecture scenario, where the compiled queries targeting the AArch64 architecture are searched for in a corpus of x86_64 targets, and vice versa. Each experiment is represented using two lines in the table to emphasize the distinction made between architectures, i.e. when searching for an AArch64 query, the target corpus included only x86_64 and vice versa. The cross-architecture scenario presents the greatest challenge for GitZ. This challenge stems from different implementation paradigms for various operations and instructions, which result in strand mismatch, even after canonicalization and normalization. Figure 4.4 depicts basic blocks taken (with some changes made for brevity of presentation) from the add_ref_tag() procedure in pack-objects.c, which is a part of version 2.10.0 of the git project. The two blocks perform the same operation – preparing arguments for the procedure call packlist_find(&to_pack, peeled.hash, NULL). Argument preparation is as follows: (i) The address of the to_pack argument is stored into the x0 register in the AArch64 case (sec- tion 4.3.4, lines 1-2), and the edi register in the x86_64 case (section 4.3.4, line 3), (ii) NULL is stored into the x2 register in the AArch64 case (section 4.3.4 line 4), and the edx register in the x86_64 case (section 4.3.4 line 1); and (iii) the peeled.hash field, which belongs to the locally allocated peeled struct, is computed at offset 0x30 of the stack pointer and assigned into x1 in the AArch64 case (section 4.3.4, line 3), but assigned directly to the rsi register in the x86_64 case (section 4.3.4, line

71

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Scenario # Queries Targets CROC FPr Cross-* 1 * * .977 .03 Cross- 2 C -O* C -O* ARM x64 .963 .01 Arch,Opt 3 Cx64 -O* CARM -O* 4 gcc gcc x64 x64 .999 .001 4.{6,8,9} -O* 4.{6,8,9} -O* 5 icc icc x64 x64 .999 .001 {14,15} -O* {14,15} -O* Cross- 6 CLang CLang x64 x64 1 0 Opt, 3.{4,5} -O* 3.{4,5} -O* Version 7 gccARM 4.8 -O* gccARM 4.8 -O* 1 0 8 CLangARM 4.0 -O* CLangARM 4.0 -O* 1 0 9 Cx64 -Os Cx64 -Os .992 .001 Cross- 10 Cx64 -O0 Cx64 -O0 .992 .001 Comp 11 Cx64 -O1 Cx64 -O1 .986 .002 x86_64 12 Cx64 -O2 Cx64 -O2 .992 .001 13 Cx64 -O3 Cx64 -O3 .992 .001 14 CARM -Os CARM -Os .988 .002 Cross- 15 CARM -O0 CARM -O0 .995 .001 Comp 16 CARM -O1 CARM -O1 .999 .001 AArch64 17 CARM -O2 CARM -O2 .995 .001 18 CARM -O3 CARM -O3 .998 .001 19 C -Os C -Os x64 ARM .969 .006 20 CARM -Os Cx64 -Os 21 C -O0 C -O0 x64 ARM .977 .004 22 CARM -O0 Cx64 -O0 23 C -O1 C -O1 Cross- x64 ARM .960 .006 24 C -O1 C -O1 Arch ARM x64 25 C -O2 C -O2 x64 ARM .965 .005 26 CARM -O2 Cx64 -O2 27 C -O3 C -O3 x64 ARM .975 .004 28 CARM -O3 Cx64 -O3

Table 4.2: Accuracy and FP rate for different derivatives of the All v. All experiment.

72

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 2), due to a different memory outline. Note that we intentionally left in the callee’s name for clarity; however, procedure names are in no way used to establish similarity in GitZ, as they do not exist in the stripped binary setting.

adrp x0, #to_pack add x0, x0, #to_pack xor edx, edx add x1, sp, #0x30 mov rsi, rsp mov x2, #0 mov edi, offset to_pack bl packlist_find call packlist_find

(a) gcc 4.8 AArch64 -O1 (b) gcc 4.9 x86_64 -O1

Figure 4.4: add_ref_tag() compiled to different architectures.

Although most differences ((i) and (ii)) are bridged by GitZ, the different memory layout over architectures hinders our approach’s matching ability and results in some loss of precision.

Normalized Canonical Both 1

0.95

0.9

0.85

0.8

0.75 Canon Norm Norm Canon VEX LLVM Canon Norm Norm VEX Norm LLVM Canon LLVM Setting: VEX w/ LLVM w/ Norm LLVM VEX w/ LLVM w/ LLVM w/ GC GC LLVM w/ GC GC GC GC CROC:CROC 0.848 0.879 0.847 0.878 0.905 0.943 0.908 0.954 0.851 0.875 0.927 0.978

Figure 4.5: The average accuracy (CROC) of the All v. All experiment, when applying different parts of our solution.

Templating, Vectorization and Inlining: Some language, machine or compiler mechanisms may introduce vast syntactic changes to a procedure, which may also result in different semantics. Generic template procedures, are written as a means to allow functionality over several types, and thus can be compiled to operate over different types. The same template procedure compiled by different compilers will differ even more so when it is compiled to different types. We encountered template procedures in our evaluation; however, as the general algorithm performed by the template procedure is the same, GitZ was able to identify enough shared strands over the differently typed compilations to detect similarity. The same was observed for vectorized vs. non-vectorized assembly code for the same procedure. Inlining poses a challenge in establishing similarity, as it introduces the code of a different procedure into the body of the query or target. However, in our experiment, it resulted in relatively few wrong matches. Since the core functionality of the caller is preserved in its own strands, these strands were sufficient to capture similarity of true positives, even when the call-graphs vary. In other cases, caller procedures were matched with callees that were inlined, which initially seemed like false positives. To correctly label these results we added a pass flagging caller-callee similarity as a true positive.

73

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 4.3.5 Comparing Components of the Solution

The effect of canonicalization and/or normalization Figure 4.5 presents the average accuracy for the All v. All experiment, in terms of the CROC metric (y-axis and below each bar), for different settings (the x-axis) of GitZ. The graph shows accuracy results when (incrementally) applying the different components in our techniques, where the leftmost bar represents the accuracy for computing the similarity by simply counting the number of shared strands over the procedures’ VEX-IR strands, without any canonization or normalization, and without a global context. The rightmost bar shows accuracy when applying our complete approach, which is also the result reported in all previous experiments. fig. 4.5 is separated into blank bars and full bars, representing results without and with the application of a global context P. The following observations can be made from fig. 4.5: (i) The use of a global context uniformly increases precision for all settings. (ii) Normalization is vital in achieving syntactic equality, which is to be expected due to the high variance in register (and temporary) names, originating from the architecture. (iii) The canonicalized, normalized scenario (2 rightmost bars) is highly affected by the global context, with a precision gain of 0.051 in CROC, which translates to a substantial false positive rate drop from 0.16 to 0.03. This emphasizes the beneficial dependence of this setting (which is the de facto setting for GitZ) on the global context, as normalization and canonicalization group together more strands, thus reducing their significance in Pr (s). We

4.3.6 Understanding the Global Context Effect

Evaluating different compositions of P To throughly understand how the global context affects our method’s success at finding similar procedures, we experimented with running GitZ using different variations of P – the approximation of the global context. # FP

100

10

1 1 101 201 301 401 501 601 701 801 901 1000 Size of P (in #Procedures) Figure 4.6: Examining the importance and effect of using the global context by measuring number of false-positives as P grows.

Section 4.3.6 depicts the average number of false positives, as a function of the size of P, across 5 randomly selected experiments. In each experiment we selected one query from the All vs. All corpus

74

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 and searched for it within the 1K-sized corpus. Each experiment consisted of multiple runs where a different (gradually growing in size) set of procedures were selected for P and used to perform the similarity score calculation. Section 4.3.6 shows how, at the beginning of the experiment, each increase in the size of P is reflected in a major decrease in the number of false positives, and eventually a fix point is reached at around 400 procedures. Although not expressed in the figure (for clarity and proportion), any further attempts to increase the size of P, even up to the size of our bigger 500K corpus, did not change the number of false positives in our experiments. Following these results we set our P size to 1K (2.5 times the largest effective size), and composed it by randomly picking a subset of this size from of our 500K corpus. We proceeded to perform all of the experiments in this section with the said subset. Delving into the Composition of the Global Context After building our global context, P, we more deeply examined its contents and found two dominating and subgroups of strands in it. Common strands appearing in all architectures: One type of strand which occurs frequently, and in all examined architectures, is the stack-frame setup and destruction in the prologue and epilogue of some (mostly large) procedures. The frame setup is performed by adding and subtracting from rsp and xsp in the Intel and ARM architectures respectively. Another common strand was the saving and restoring the callee-saved registers. We encountered this type of strand in different variations, some include partial use of the callee-saved registers, some use different stack offsets. Despite these variations, the strands were successfully detected as low significance strands within the global context, due to the optimization and normalization stages. The transformation converted multiple strands of the said type into one syntactic form, across compilers and architectures, marking them as common. Common strands unique to specific architectures: One reason for GitZ’s lower precision in the cross- arch scenario (shown in table 4.2 and explained above) resides in the inherent differences between the architectures, differences that affect the representation of the global context. For example, one very com- mon instruction encountered in Intel assembly code is xor some-register,same-register. This instruction simply puts the value of zero in a register, instead of using mov some-register, 0. This is used as a code-size-optimization, as the xor operation on any register is represented using two bytes, while moving the immediate zero requires between three and eight bytes (for the instruction itself and the zero immediate). The ARM architecture aligns all instructions to size 4, and therefore such instruction size maneuvers are not performed. Moreover, ARM contains a special ‘virtual’ register called zr (or r31), which always holds the value of zero and alleviates the need for a zero immediate altogether. Furthermore, the ARM architecture supports some very useful instructions not present in the Intel architecture. One example is the ‘Compare Branch Zero’ (cbz) instruction, which jumps to a specified offset if the operand register is equal to zero. Several Intel instructions can be used to represent this operation, i.e., cmp reg, 0; jz offset. However the flags register will be affected by the cmp instructions, essentially creating a different computation. In this scenario the Intel computation will contain a new flag storing the comparison’s result, breaking the equivalence, and in turn causing the computation’s representation in the strands to diverge and not match.

4.3.7 Limitations

GitZ relies on the premise that the same source code compiled differently will contain the same chains of execution (strands), and these can be extracted and transformed to an equivalent form through normalized

75

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 canonical strands (fig. 4.1) while reducing the importance of common strands using statistical reasoning (fig. 4.5). Our evaluation shows that equivalent strands will almost always be found across different compilations. However, similarity indeed cannot be established in some cases, and these become more frequent as the two compilations diverge. Broken basic blocks: As compilation diverges, the resulting control flow graphs of the procedures differ from one another. Figure 4.7 depicts the source code (marked (a)) and the starting basic blocks of three compilations of wget’s ftp_syst() procedure from file ftp-basic.c (marked (b)-(d)). Two of the compilations (marked (b) and (c)) are performed by the same compiler make (CLang) with different versions of the compiler (3.4 and 3.5 respectively), while the third (marked (d)) is compiled using gcc 4.6. Figure 4.7 shows how the more current versions of CLang operate similarly, as both versions 3.4 and 3.5 include the call to free() as part of the first basic block. The CLang compiler (in both versions) recognizes that both branch destinations of the conditional in fig. 4.7(a) line 8 arrive at a call to free(). gcc at version 4.6 is less optimized and does use the fact that free() is called in both branches. This difference (among others) results in a lower similarity score (19 mutual strand for gcc 4.6 vs. CLang 3.5, as opposed to 40 mutual strands for CLang 3.4 vs. CLang 3.5). This illustrates a limitation of GitZ, as it operates at the boundaries of a basic block. The decision to break procedures at basic blocks was made due to performance and scalability concerns. Our previous representation, based on “tracelets” (chapter2) which employ longer chains of executions can be merged with this representation to overcome the limitation, at the price of reduced performance.

uerr_t ftp_syst (int csock, ...){ ... /* Send SYST request. */ request = ftp_request ("SYST", NULL); nwritten = fd_write (csock, request, strlen (request), -1); if (nwritten < 0) { free (request); return WRITEFAILED; } free (request); ... }

(a) Source code

mov r14, rsi mov r14, rsi mov rbp, rdi mov r15, rdi mov r15, rdi mov edi, 0F93h mov edi, 0D9Eh mov edi, 0D8Eh sub rsp, 28h xor esi, esi xor esi, esi mov r12, rsi call ftp_request call ftp_request xor esi, esi mov rbp, rax mov rbp, rax call ftp_request mov ebx, [r15] mov ebx, [r15] mov rdi, rax mov rdi, rbp mov rdi, rbp mov rbx, rax call strlen call strlen call strlen mov edi, [rbp+0] mov edi, ebx mov edi, ebx mov edx, eax mov rsi, rbp mov rsi, rbp mov rsi, rbx mov edx, eax mov edx, eax call iwrite call iwrite call iwrite test eax, eax mov ebx, eax mov ebx, eax mov rdi, rbx mov rdi, rbp mov rdi, rbp js loc_C18 call free call free test ebx, ebx test ebx, ebx call free mov ebp, 37h mov ebp, 37h ... js loc_B20 js loc_B0D call free ...

(b) CLang 3.5 -O2 (c) CLang 3.4 -O2 (d) gcc 4.6 -O2

Figure 4.7: Source code and assembly for different compilations of ftp_syst().

76

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 5

FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware

In this chapter we show a real-world application for the binary code search method developed in chapter3 and further improved in chapter4: Searching for previously unknowned vulnerabilities in IOT firmware (device software). Firmware Embedded devices are closed systems, booting to a consolidated software package known as firmware. Vendors do not provide information about the composition of their firmware, and often the exact composition of the firmware image is not well known even to them [com16]. Furthermore, firmware is often customized and optimized to contain only parts of a library to match the particular device on which it runs. Unpacking the firmware images and exploring the device’s file-system usually provides little insight, as the executable files are often stripped from debug information for space considerations.

5.1 Overview

In this section, we illustrate our approach informally using an example.

5.1.1 Pairwise Procedure Similarity: The syntactic gap

1 jalr t9 1 addiu a2, sp, 0x20 2 move s2, a0 2 move s4, a1 3 move s5, v0 3 jal 0x40B2AC 4 li v0, 0x1F 4 move s5, a0 5 lw gp, 0x28+sp 5 li v1, 0x1F 6 bne s5, v0, 0x40E744 6 beq v0, v1, 0x40B518 7 move v0, s5 7 lui s6, 0x47

(a) gcc v.5.2 -O2 (b) NETGEAR device firmware

Figure 5.1: Wget ftp_retrieve_glob() vulnerability snippets.

Consider the MIPS assembly code snippets in fig. 5.1. Although syntactically very different and sharing no lines of code whatsoever, the snippets belong to the first basic block (BB) of the same ftp_glob_retrieve() procedure from Wget. The syntactic difference stems from having compiled the procedures in different settings and with different tool chains. Figure 5.1(a) was compiled from

77

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 (a) Procedure–Centric Search (b) Executable–Centric Search Matched (a.0) Procedure url_ ftp_ ftp_ url_ parse() skey_ ftp_ retrieve_ get_ retrieve_ parse() get_ skey_ (a.1) resp() (b.1) retrieve_ glob() resp() ftp() glob() Unmatched glob() ftp() Procedure 푆푖푚 

= ? 푆푖푚 = 53 푆푖푚 = 71 푆푖푚 = 44  푆푖푚 = 51 푆푖푚 = 71 53  Similarity 78 Matching sub_ sub_ (with score) 491b00() 491b00() sub_ sub_ (a.2) sub_ (b.2)

443ee2() 443ee2() sub_ 443ee2() sub_ 4ea884() 4ea884()

Executable Figure 5.2: The search process for Wget’s vulnerable ftp_glob_retrieve() procedure, searched against a found-in-the-wild NETGEAR device firmware, in (a) procedure-centric search (on the left) and (b) executable-centric search (on the right). Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 Wget 1.15, using the gcc 5.2 compiler at -O2 (optimization level 2), while the compilation setting of fig. 5.1(b) is unknown as it belongs to a stripped firmware image of a NETGEAR device, which was crawled from the vendor’s public support site. Despite their syntactic variation, the snippets share much of the same semantics:

• Both snippets retrieve a value from the stack (line 5 in (a), line 1 in (b)).

• Both load the value 0x1F and use it in a jump comparison operation (lines 4 and 6 in (a), lines 5 and 6 in (b)).

• Both snippets call a procedure (line 1 in (a), line 3 in (b)).

Although not semantically equivalent, the procedures share great similarity, but finding this similarity is made difficult by the variance in instruction selection, ordering and register usage, as well as different code and data layout (offsets). We build upon the procedure representation detailed in chapter4, and further refine the strand- comparison process to better suite the task at hand.

5.1.2 Efficient Partial Executable Matching

Similarity in the scope of a single procedure is inaccurate Although procedure-level similarity may achieve fair precision, in some scenarios using the additional information provided by the containing executable can significantly reduce the number of false matches.

Figure 5.2 illustrates our approach through the actual search process performed for the qv = ftp_glob_retrieve() query found in the Q = Wget executable, from fig. 5.1(a), searched for in a found-in-the-wild firmware image F, of a NETGEAR device, partially featured in fig. 5.1(b). Initially, in fig. 5.2(a.0), an attempt is made to match the query procedure with a procedure from a target executable originating from the NETGEAR firmware image. Using a naive pairwise similarity based approach, the query procedure will be matched with the procedure in the target executable T with whom it shares the highest similarity score Sim, that being the sub_443ee2() procedure. When a broader, executable-level similarity measure is used, as in fig. 5.2(a.1-2), this selection is found to be a mismatch, only occurring due to the larger size of sub_443ee2(), and because, having undergone different compilations, ftp_glob_retrieve() and its true positive sub_4ea884(), share fewer strands and thus have a lower Sim score. We note that simply normalizing the score by the size of the procedure would not resolve this issue, as the size of a procedure is affected by many factors, most having little to do with the actual semantics. For example, a low optimization level can lead to the creation of a huge amount of code, a size-oriented optimization will shrink it, but a very high optimization can again inflate the code size by loop unrolling. This mismatch demonstrates the limitation of procedure- centric approaches as, for example, GitZ (chapter4) would falsely match ftp_glob_retrieve() with the sub_443ee2() procedure. (We further elaborate in our comparison to GitZ in section 5.4.3.) This mismatch can be identified by performing the reverse search, that is, searching for the most similar procedure to sub_443ee2() in the query executable (a.1). Doing so results in matching a different procedure– get_ftp(), showing that the original match is inconsistent, as one would expect to get the same result regardless of the search direction. Another approach to addressing this problem is to match all the procedures and thus establish a full matching between the executables. While a step in the right direction, this approach is severely limited by the assumption that the whole structure of the executable is similar, which is not always the case. Major differences in executable structure are often caused by the build configuration selected. For example, in our case the query executable was compiled using default settings, leading to the skey_resp(),

79

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 a procedure handling the OPIE authentication scheme for sftp, to be compiled into it, as seen in fig. 5.2(a.1). For reasons unknown to us, in this particular instance the vendor compiled Wget with the -disable-opie option, leading to the omission of this procedure from the target executable. This change can create a “domino effect”, causing several procedures to be mismatched. Our approach focuses instead on the query procedure, in an attempt avoid such inconsistent and inaccurate results. Procedure similarity in the scope of an executable using back-and-forth games Figure 5.2(b) illustrates the transition from a procedure-level similarity metric to an executable-level similarity metric, where the scope is broadened by the additional information in the query (b.1) and target (b.2) executables. This result is reached by using an algorithm implementing a back-and-forth game, which establishes and extends a more appropriate partial matching, restricted only by the requirement that it must contain the query procedure. Outlining a matching from a two-player game The matching in fig. 5.2(b) adheres to the moves of two participants, a player and a rival, in a back-and-forth game. The game starts by the player picking a

matching t1 = sub_443ee2() ∈ T for the query qv = ftp_glob_retrieve(), passing the turn to the rival, who then tries to pick a better matching for the target procedure t1. The rival selects q1 = get_ftp() ∈ Q as an alternative and preferable match to t1, since:

Sim(sub_443ee2(), get_ftp()) > Sim(sub_443ee2(), ftp_glob_retrieve())

, forcing the player to reiterate. The game proceeds as described in table 5.1, until the rival is left with no moves as there are no higher similarity picks for the procedures in the matching. At this point the query

procedure qv is matched with its true positive, t3 = sub_4ea884(). Back-and-forth games, along with an algorithm for producing partial matchings from games, are detailed in section 5.3.

Actor Step Matching

player Matches qv = ftp_glob_retrieve() with t1 = sub_443ee2() {(qv, t1)}

rival Matches t1 with q1 = get_ftp() Sim(q1, t1) = 71 > 53 = Sim(qv, t1) {(qv, t1)}

player Counters by matching q1 with t1 {(qv, t2), and matching qv with t2 = sub_491b00() (q1, t1)}

rival Matches t2 with q2 = parse_url() {(qv, t2), Sim(q2, t2) = 51 > 47 = Sim(qv, t2)(q1, t1)}

player Counters by matching q2 with t2 {(qv, t3), and matching qv with t3 = sub_4ea884() (q2, t2)} (q1, t1)} rival Left with no valid moves. Game Over

Table 5.1: Game course for fig. 5.2.

5.2 Representing Firmware Executables

To represent firmware executables we built heavily on the representation we developed in chapter3 and chapter4.

80

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Building on the Angr.io Framework Angr.io [SWS+16] is a well-maintained and robust framework, containing a wrapper for Valgrind’s VEX-IR. As we were already using the Valgrind based VEX representation [NS07] in our lifting tool chain, we opted to use more parts from the Angr.io framework. Angr.io offers lifting from various architectures, which enabled us to operate over the most common architectures found throughout our firmware crawling process (section 5.4.1). We used IDA Pro [HR] for the parsing and extraction of procedures and BBs from executables, as we noticed that overall it is more accurate when tasked with finding all procedures and blocks in the executable. Translating embedded architectures Handling four different target architectures, MIPS32, ARM32, PPC32 and Intel-x86, and specifically executables originating from real firmware images, holds some caveats. First, many of the executables either had a corrupt executable and linkable format (ELF) header, or were distributed with the wrong ELFCLASS. Specifically we found that the existence of MIPS64 executables (8-byte aligned instructions) with a ELFCLASS32 header is common in firmware. Another caveat, in MIPS executables, is the use of a delay branch slot, which requires an additional instruction to follow any branch instruction. This additional instruction will be executed while the branch destination is being resolved. This results in the first instruction of the subsequent block being omitted from it and placed as part of the preceding block, which leads to strand discrepancy. Finally, as mentioned, binary lifting tools may still fail to identify several blocks in some procedure, or even omit entire procedures altogether. This is exacerbated in the stripped scenario. We implemented means for overcoming these caveats and further corroborating the results of the lifter, to avoid erroneous results. We added checks for CFG connectivity and coverage of unaccounted-for areas in the text˙ section of the ELF file. In an attempt to improve the underlying tools, we provided their creators with our implementations and findings, which were then often adapted into the tool as a patch or a feature.

Improving Canonicalized Normalized Strands Creation

We further improve the process described in section 4.2 to bring semantically equivalent strands to the same syntactic form. Offset elimination We remove offset values that pertain to the concrete structure of the binary file. This includes jump addresses and addresses pointing towards static sections (e.g. .data) in the executable. We do not remove offsets which pertain to stack and memory substructures manipulation, as they are more relevant to the semantics of the procedure, serving as a descriptor of the type of data the procedure handles. Register folding Registers which are read before written to are translated as the arguments to the procedure, and the last value computed by the strand is the return value. We note that even if the return value from a procedure is stored in a register, the value returned must be first generated in the procedure, and so will be captured in the strand. Variable name normalization To further advance a strand towards canonical form, we rename variables appearing in the optimized strand according to the order in which they appear. This technique is inspired by previous work aimed at finding equivalence over computations [BCS97]. The output of this final stage are the canonical strands we use for our pairwise procedure similarity metric. We denote Strands(p) as the set of canonical strands (each strand is represented as a string of its instructions) for a procedure p. Any reference to strands from this point onwards would be to the canonical strands produced by this stage. Example Figure 5.3 shows an example of a (single) strand extracted from the BB in fig. 5.1(a) and

81

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 move s5, v0 li v0, 0x1F bne s5, v0, 0x40E744

↓Li f t

v0 = external global i32 s5 = external global i32 pc = external global i32 define i32 strand() { entry: t0 = v0 store i32 s5, t0 t1 = 0x1F store i32 v0, t1 t0 = icmp ne i32 t0, t1 br i1 t0, label if.true, label if.false if.true: ret i32 0x40E744 if.false: ret i32 pc } ↓Optimize,Normalize

define i32 strand(i32 reg0, i32 offset0, i32 pc) { entry: t0 = icmp ne i32 reg0, 0x1F br i1 t0, label if.true, label if.false if.true: ret i32 offset0 if.false: ret i32 pc }

Figure 5.3: Assembly strand computing the branch destination result (top) and its lifted LLVM-IR strand (middle) and final canonical LLVM-IR form (bottom)

the corresponding lifted strand and canonical strand it was transformed to. The strand is responsible for deciding the branch destination. The lifted strand (in the middle) preserves the assembly operation verbatim, and adds a temporary variable for each value operation. Explicit register names and offsets are also kept. After optimizing and normalizing names and offsets (resulting in the bottom strand), the entire operation of the assembly strand (top) is reduced to the first instruction of the LLVM-IR canonical strand (bottom), comparing the normalized register reg0 (previously s5) with the 0x1F constant, which was folded into the v0 register. The following instructions reflect the branch operation and either return the normalized offset offset0 or the next program counter value held in pc.

Pairwise Procedure Similarity After generating the set of strands for each procedure, we compute

pairwise similarity. We denote a query procedure, i.e., the procedure being searched for, as qv ∈ Q, where Q is the set of all procedures in the containing query executable. The target procedure, i.e., the candidate procedure q is being compared to, is denoted by t ∈ T (similarly, T being the containing target executable). Given a pair (q, t), we define procedure similarity as follows:

Sim(q, t) = |Strands(q) ∩ Strands(t)|,

i.e., Sim(q, t) is simply the number of unique canonical strands that are shared in both representations. To calculate Sim faster, we keep the procedure representation as a set of hashed strands (without consideration for hash counts).

82

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 5.3 Binary Similarity as a Back-and-Forth Game

5.3.1 Game Motivation

Procedure-centric matching is insufficient We first show the intuition for using a game algorithm as a means to find procedure similarity while accounting for the surrounding context. The goal is to transition from a local match, which relies on a local maximum of similarity score, to a global maximum. This is done by composing a matching that is expected to maximize overall similarity across the binaries.

Figure 5.4 illustrates a conceptual example for matching a procedure q1 ∈ Q with a procedure in an executable T. The notation is similar to fig. 5.2, where squares represent executables which are sets of

procedures and circles are procedures which are sets of strands denoted as si.

푞 푞1 2 푞1 푞2 푠1, 푠3, 푠4 푠 , 푠 , 푠 푄 푠1, 푠2, 푠3 푠1, 푠2, 푠3 1 3 4 푄 푠5 푠5

푆푖푚 = 3 푆푖푚 = 4 푆푖푚 = 2

푡1 푡1 푡2 푡2 푠1, 푠2, 푠3 푠1, 푠2, 푠3 푇 푠2, 푠3 푠 , 푠 푇 푠4, 푠5 푠4, 푠5 2 3

(a) procedure-centric (b) executable-centric matching matching

Figure 5.4: Two approaches for finding a similar procedure to q1 ∈ Q, within the executable T. The procedure-centric approach on the left leads to a less accurate matching. Strands {s1, s2, s3, s4, s5} are spread over all procedures.

The procedure-centric approach in fig. 5.4(a) immediately matches q1 with t1 as it has the most PC shared strands. We denote q ∼ t as matching the procedure q with a procedure t that has the maximal Q,T PC similarity (Sim(q, t)) score in T. The q1 ∼ t1 matching ignores the neighboring procedures in the Q,T executables, resulting in a sub-optimal match. The existence of q2 and t2, and the higher similarity score of q2 and t2, indicate that t2 is a more correct matching for q1, in a global view. (This is also the case in fig. 5.2(a), resulting in a false matching for a real world example.) Knowledge in surrounding executable leads to better matching To induce a better matching, we con- strain the matching process with the rules of a two-player game, inspired by model theory’s notion of back-and-forth games, also called Ehrenfeucht- Fraïssè games [Ehr61]. The main premise of the game is

that the player trying to match a procedure qv is countered with a rival. Applying these notions to our example creates the game scenario depicted in fig. 5.4(b):

• The player performs its first selection, t1 ∈ T , as it is the best match for q1. The rival tries “to show the player it was wrong” by selecting a different procedure q2 ∈ Q with a higher similarity

83

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 q ∼ t Q,T m

∀qi ∈ Q, qi 6= q ⇒ ! " # Sim(qi, t) < ζ ∨ ∃ti0 ∈ T, ti0 6= t, Sim(qi, ti0 ) ≥ Sim(qi, t) ⇒ qi ∼ ti0 (5.1) Q,T V

∀tj ∈ T, tj 6= t ⇒ ! " # Sim(q, tj) < ζ ∨ ∃qj0 ∈ Q, qj0 6= q, Sim(qj0 , tj) ≥ Sim(q, tj) ⇒ qj0 ∼ tj Q,T

score Sim(q2, t1) > Sim(q1, t1).

• The player corrects its selection by adapting (q2, t1) as part of the matching and choosing another procedure t2 ∈ T as a pairwise matching for q1.

• Following this selection, the rival cannot find a better match for t2 and “forfeits the game”.

Formalizing the intuition behind the game Using Sim(q, t) = ζ, we define q ∼ t in eq. (5.1) Q,T q ∼ t expresses the back-and-forth selection process. In this process the best match for q will be Q,T ignored if there exist a better match between other procedures in Q and T. In this definition, each matched procedure pair (q, t) is the best match among all of the procedures in T and Q, while other procedures are explicitly allowed to be a better match if and only if these other procedures have a different better match themselves. The first big parentheses express the option

that another procedure, qi 6= q, is a better local match for t, meaning that Sim(q, t) < Sim(qi, t). This is allowed as long as qi has another better global match ti0 ∈ T, so that qi ∼ ti0 . The second Q,T big parentheses express the same possibility for the other direction. There can be tj 6= t for which Sim(q, t) < Sim(q, tj), as long as tj has another better match qj0 ∈ Q, so that qj0 ∼ tj. Q,T It is important to note that these requirements do not force a full matching between the executables; they will only create the needed additional matches between the executable procedures, to allow for a consistent match (and for the player to win). Implementing the matching algorithm according to the aforementioned game, where the player aims to win and thus needs to account for the context of the whole executable, leads to better matchings.

5.3.2 Game Algorithm

Algorithm7 details our similarity game algorithm. Because the algorithm implements the winning strategy for the player trying to match a query, this is not an explicit two-step algorithm performed by the two players. The algorithm accepts as input the query executable, query procedure, and target executable. Each executable is represented as a set of procedures, and each procedure is represented by the set of its strands extracted as described in section 3.3.2.

84

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Algorithm 7: Similarity Game Algorithm.

Input: T – The target executable procedures ; Q – The query executable containing qv qv – The vulnerable procedure in Q Output: Matches – The resulting partial matching, containing (at least) qv

Matches ← {} , ToMatch ← Stack() , ToMatch.Push(qv) while GameDidntEnd() do M ← ToMatch.Peek() if M ∈ Q then My,Other ← Q, T else My,Other ← T, Q Forward ← GetBestMatch(M,Other,Matches) Back ← GetBestMatch(Forward,My,Matches) if M== Back then Matches+= M ↔ Forward ToMatch.Pop (Matches) else ToMatch.PushIfNotExists([Forward,Back])

In line 1 we create an empty dictionary, Matches, which will accumulate all the matched procedures, i.e., procedures q ∈ Q, t ∈ T for which q ∼ t. In line 2 we initialize a stack, ToMatch, to hold all the Q,T procedures we are trying to match. In this process we also push qv as this is the primary procedure we need to match. Following the initializations we start the matching game, expressed as a while loop, which ends when the game ends. Game ending conditions are checked by the GameDidntEnd() procedure, to be described later. In each loop iteration, or step in the game, we try to match the procedure at the head of the ToMatch stack (retrieved by Peek() in line 4). We resolve in which direction we will perform the match, by checking whether it is part of Q or T, and setting My and Other. We perform a forward match, i.e., we search for the best match for M in Other, while ignoring all previously matched procedures. This is done by calling GetBestMatch() (line 9). Using the best match for M, Forward, we perform the backwards match, and store the result in Back (line 10). Collecting the best matches in both directions gives us two possibilities:

• Back is M, meaning that M ∼ Forward (checked in line 11). In that case this match is added to Q,T Matches, and M is popped from the stack (lines 12-13).

• Otherwise, we need to find a match for Forward or Back before we can claim that we have a match for M. To do that, in PushIfNotExists(), we check if Forward and Back are already in the stack; otherwise we push the missing procedures.

The implementation of GameDidntEnd() will check for one of the following conditions:

• A match was found for qv

• ToMatch has arrived at a fixed state. This will happen when no new procedures are pushed to ToMatch at PushIfNotExists() (line 15). In this case the game will never end, meaning that the matching process will not succeed.

85

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 • As a heuristic, the game can also be stopped if too many matches were found or ToMatch contains too many procedures.

5.3.3 Graph-based Approaches

(a) Query Call-Graph (b) Target Call-Graph

ftp_retrieve_glob() ftp_retrieve_glob()

Figure 5.5: The call-graphs surrounding Wget’s ftp_retrieve_glob() procedure in the MIPS query (left) and a target from a NETGEAR firmware (right). The variance in call-graph structure prevents graph based techniques (e.g., BinDiff) from succeeding.

Limitations of the graph-based approaches Using a graph-based technique, which tries to leverage relationships over the procedures in the form of a call-graph, is problematic in our scenario. Fig- ure 5.5 depicts two samples of call-graphs, restricted to one level below and two levels above the ftp_retrieve_glob() procedure from Wget. The sample on the left belongs to the MIPS query executable, and the one on the right originated from a NETGEAR firmware. Even in this limited scope, the variance in call-graph structure is vast. This can be attributed to the customized nature of firmware, along with compiler inlining and dynamic call targets, which produce high variance over the target call- graph, reducing the accuracy of a graph-based matching. This is further demonstrated in our evaluation (section 5.4.3), showing that BinDiff’s graph-based technique results in poor precision.

5.4 Evaluation

We evaluate our approach using FirmUp in the scenario of identifying vulnerable procedures found in real-world firmware images, and comparing it to prominent existing work in the same scenario. The evaluation is thus composed of (i) a description of the queries and corpus creation process (section 5.4.1), (ii) our main experiment, in which vulnerable procedures were located in publicly available firmware images (section 5.4.2), and (iii) a controlled experiment evaluating FirmUp’s accuracy alone and in comparison to BinDiff and GitZ (chapter4).

5.4.1 Corpus Creation

Into the wild To simulate a real-world vulnerability search scenario, we created a crawler to collect, unpack, organize and index executables in firmware. Our crawler downloads firmware images from publicly available repositories of prominent device vendors, including NETGEAR [NET], D-Link [DLI] and ASUS [ASU]. We used binwalk [ReF] for unpacking firmware images. Out of ∼5000 firmware

86

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 images collected, only ∼2000 produced usable executables, while others failed to unpack or consisted only of content or configuration. After unpacking, the crawled images amounted to ∼200,000 executables. Each executable contained ∼200 procedures on average, resulting in a dataset containing ∼40,000,000 procedures. Finally, the procedures were indexed as a set of strands (as described in section 3.3.2). Evaluation setup Experiments were performed on an Ubuntu 16.4 machine with two Xeon E5-2699V3 CPUs (36 cores, 72 threads), 368 GiB of RAM, 14 TB HDD. The overall memory consumption for each thread did not exceed 560 MiB. Using prominent open-source CVEs as queries To evaluate FirmUp’s ability to locate vulnerable procedures, we gathered query procedures known to be vulnerable to some attack according to a CVE report. We put an emphasis on widely distributed software as: (i) we wanted to adhere to a real-world scenario and (ii) this made the procedure more likely to be used by some device and thus be found in our corpus of crawled firmware. To produce our search query, we used the latest vulnerable version of the software package as reported by the CVE and compiled it with gcc 5.2 at the default optimization level (usually -O2). We intentionally chose a range of diverse procedures, affected by different vulnerability types, including DoS due to crafted message (table 5.2 lines 1,5), BOF (table 5.2 lines 2,7), input validation (table 5.2 line 3), information disclosure (table 5.2 line 4), and path traversal (table 5.2 line 6).

5.4.2 Finding Undiscovered CVEs in Firmware

Table 5.2 shows the results of our experiments with found-in-the-wild firmware images. 373 vulnerable procedures were found in publicly available firmware images, 147 of them in the latest available firmware version for the device. We note that these experiments included stripped procedures only, to adhere to a real-world scenario. The use of stripped firmware also led to new findings, pointing to the exact location of the vulnerable procedure. Each line in table 5.2 depicts an experiment in which a vulnerable procedure was searched for in our corpus of firmware. Line 1 of the table, for instance, depicts the results of the experiment designed to search for the vulnerable procedure in CVE-2011-0762. This vulnerability was discovered in the “Very Secure FTP Daemon” (vsftpd) package. The relevant executable from the package, the FTP server (aptly named vsftpd), was found by FirmUp in 76 firmware images. In this experiment our engine correctly found the vulnerable vsf_filename_passes_filter() procedure in all of the suspect executables. The FPs column measures the number of procedures wrongly matched by FirmUp to the vulnerable query, which is 0 for this experiment. The next column lists the vendors for the devices in which the CVE was discovered, and the column to the right gives the number of devices for which the latest firmware version available is vulnerable. The rightmost column shows average overall (user+kernel mode) run time, across three runs, for that experiment. Confirming findings Procedures matched by FirmUp to a vulnerable query procedure were confirmed in a semi-manual fashion. We used a combination of markers such as string constants, use of global memory, structures access, and others more specific to the vulnerable procedure. Additional matched procedures (excluding the query) were used to further assist the validation process. We then grouped procedures according to their effective address in the executable (the exact same executable sometimes appeared in several images) and manually verified the result. We found that subsequent versions of firmware images targeting the same device will sometimes not re-compile executables if they are not part of the software updated in that release. We note that immense manual effort would be required to rule out vulnerable procedures not found by FirmUp in this wild objects scenario. Instead, to measure the precision of our tool, we performed a

87

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 # CVE Package Procedure Confirmed FPs Affected Vendors Latest ˜

1 2011-0762 vsftpd vsf_filename_passes_filter 76 0 ASUS,NETGEAR 29 2m

2 2009-4593 bftpd bftpdutmp_log 63 0 NETGEAR 15 4m 3 2012-0036 libcurl curl_easy_unescape 1 0 NETGEAR 0 12s

libcurl tailmatch ASUS,D-Link

4 2013-1944 5 0 2 1m 88

5 2013-2168 dbus printf_string_upper_bound 10 0 D-Link,NETGEAR 5 7m 6 2014-4877 wget ftp_retrieve_glob 69 14 ASUS,NETGEAR 35 18m 7 2016-8618 libcurl alloc_addbyter 149 0 ASUS,D-Link,NETGEAR 61 25m

FirmUp in publicly available, stripped firmware images. Table 5.2: Confirmed vulnerable procedures found by Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 controlled experiment described in section 5.4.3. Noteworthy findings False positives: Line 6 of table 5.2 shows the experiment with the highest false positive rate overall (and the only experiment to report false positives). We used the wget version 1.15 executable as the query sample. Analysis of the results showed that the 14 FPs were a result of discrepancies between versions, e.g., the target executable was a previous (1.12) version of wget. The notable semantic (and subsequently syntactic) differences over the versions of the tool were the main reason that the query procedure was falsely matched with an unrelated target procedure. Deprecated procedures: Line 3 of table 5.2 details the result for the curl_easy_unescape() proce- dure in libcurl. FirmUp claimed to locate one supposedly non-stripped sample of this procedure. This was an unexpected result, as curl_easy_unescape() is exported in libcurl. In depth examination uncovered that this is not a stripping error but an example of a vendor using an old version of curl containing a predecessor of the query procedure– curl_unescape(), which was long since deprecated. The deprecated procedure in this experiment matched our query. We were surprised to learn that vendors sometimes use very outdated versions of software in their firmware, despite many CVEs having been issued for them (more than 39! [cur]). The firmware in question was released in March 2014; however, the curl_unescape() procedure was deprecated in June 2006 (version 15.4).

5.4.3 Evaluating Precision Against Previous Work

Refining Wild Data for a Controlled Experiment We designed a controlled experiment to accurately evaluate our approach and enable a fair comparison to previous work. We decided to avoid the use of synthesized targets, as it was infeasible to create a collection of tool chains diversified enough to accurately approximate the wild data. Instead, we used a subset of a corpus containing executables where the vulnerable procedures were labeled. These were split into two groups:

1. Non-stripped executables - these were not very common, as most firmware software is stripped to save space. Some non-stripped versions were found in early releases, probably left there to allow for debugging. 2. Exported procedures - when the query is an exported procedure, it can be easily located in the binary even when the executable is stripped. This case is very common for libraries where many procedures are exported, for example the procedure exif_entry_get_value() from libexif, which was affected by CVE-2012-2841 but does not appear in table 5.2 as the table includes results over stripped procedures only.

Limiting our experiment to these types of procedures reduced our overall number of targets, but allowed us to accurately report precision for each tool. To increase the scope of the experiment, we added two more CVEs belonging to the second group from the libexif and net-snmp code packages. Comparison to BinDiff First we experimented with BinDiff, which is the de facto industry standard for finding similarity between procedures in whole binaries [zyna]. It operates over two binaries at a time (a query and a target), scanning the procedures of both input files and producing a mapping of the two binaries. This mapping matches a subset of the procedures in one binary with a subset of the other, using the control structure of the procedures, call graph and syntactic matching of instructions (further information can be found in [zynb]). Note that for this scenario we could only run a subset of our experiment, from the first group of labeled data described above (non-stripped, non-exported). The other group could not be used as

89

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 tailmatch() dbus_printf()

alloc_addbyter()

ftp_retrieve_glob()

vsf_filename_filter()

60 40 20 0 FP FN P 0 20 40 60 BinDiff FirmUp

Figure 5.6: Labeled experiment comparing BinDiff with FirmUp, showing that BinDiff encountered over 69.3% false results overall compared with 6% for FirmUp.

BinDiff, in its similarity score calculation process, attributes great importance to the procedure name when it exists. As we found no visible way of configuring BinDiff to ignore this information, we could not objectively assess its accuracy for these stripped scenarios. Figure 5.6 shows the results of the labeled experiments, using FirmUp (right) and BinDiff (left). Each line represents an aggregation of all of the results reported by each tool for a given query against all the relevant targets, split to (true) positive, false negative and false positive, marked respectively as (P), (FN) and (FP). Note that for BinDiff we consider an unmatched procedure to be a false positive (because we know it is there), and a matched procedure to be positive or false positive depending on the correctness of the match. Aggregating the results, we see that BinDiff suffers from a very high FP rate, mismatching 85 out of 150 targets (FPr = 56.6%). It further suffers from a false negative rate of 12%, and only 32% positive matches. For the same dataset, FirmUp successfully matches 141 targets, achieving a 96% success rate, with only 4 (2.6%) false positives and 5 (3.33%) false negatives. The biggest difference in accuracy was measured in the vsf_filename_filter() experiment (bottom line). BinDiff succeeded in matching 26% of the targets, yet still trailed behind FirmUp, which achieved a perfect result. In the next experiment, ftp_retrieve_glob() (one line above), BinDiff detects 5 procedures, yet falsely matches all of the rest, 37 out of 42. In this case FirmUp successfully identifies 38 targets. To better understand why BinDiff performed so poorly in the vsf_filename_filter() experiment (the bottom line in fig. 5.7), we more closely examined the one false positive it reported.

Figure 5.7(b) shows the CFG of the vulnerable procedure, qv. In this graph the nodes are the basic blocks, shown without their contents (the code) for readability. The edges are colored green and red for true and false branches respectively, and blue for unconditional control-flow. The CFG of section 5.4.3 closely resembles that of fig. 5.7(b), causing BinDiff to report it as a match. This is a false match as this CFG

represents an unrelated procedure in T which we denote by tj. On the other hand, FirmUp performs a semantic comparison, relying on shared canonical strands in the blocks of qv and tv and correctly reports section 5.4.3 as the best match. Comparison to GitZ Next, we experimented with GitZ (chapter4). We note that GitZ was not designed for our testing scenario, as it compares procedures while disregarding the origin executable. Moreover, there is no notion of a positive or negative match; instead, GitZ accepts a single query and a set of targets and returns an ordered list of decreasingly similar procedures from the targets.

90

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 (a) FirmUp: tv  (b) qv (c) BinDiff: tj 

Figure 5.7: BinDiff’s disproportionate focus on the procedure’s internal structure leads to false matches for vsf_filename_filter()

91

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 tailmatch() dbus_printf() alloc_addbyter() ftp_retrieve_glob() vsf_filename_filter() snmp_pdu_parse() bftpdutmp_log_0() curl_easy_unescape() exif_entry_get_value() 250 200 150 100 50 0 FP P 0 50 100 150 200 250 GitZ FirmUp

Figure 5.8: Labeled experiment comparing GitZ with FirmUp, showing that GitZ encountered 34% false positives overall compared with 9.88% for FirmUp

In the comparison experiment, we used each query against all the procedures in each target executable, and considered the first result (top-1) to conclude whether the match was a positive or a negative. This result is presented in fig. 5.8. We further explored accuracy for the top-k results produced by GitZ; however, we note that this scenario is much more time consuming for a human researcher. Thus we did not include the top-k results of FirmUp in any of the results. The top-k comparison is reflected in fig. 5.9, as explained below. To maximize GitZ’s precision and allow for a fair comparison, we trained a global context (a set of randomly sampled of procedures in the wild used to statistically estimate the significance of a strand; see section 3.3.3 for more details) for each architecture separately, using more than a thousand procedures for each one, and using the right context for each comparison. Note that for this experiment we used queries from both groups mentioned above. Figure 5.8 shows the results of the labeled experiments for GitZ (left) and FirmUp (right), similarly to fig. 5.6. Note that in this comparison we treated false positives and false negatives together, marking them as false positives in the figure. GitZ suffers from relatively poor precision, reporting 274 false positives, compared with 78 false positives and false negatives combined reported by FirmUp, and shown in the figure as false positives. For example, in the ftp_retrieve_glob experiment in line 4 GitZ does not report any true positives, while FirmUp successfully detects 38 of the 42 targets. Furthermore, in the curl_easy_unescape() experiment in line 8, GitZ only managed to detect 104 targets out of 264, while FirmUp detected 171. Next, we examine the contribution of our back-and-forth approach. Figure 5.9 shows number of game steps required to perform matches. While 493 of the queries could be matched in one iteration of the game algorithm, 115 procedures (or 19% of the 608 targets matched in one iteration of the game algorithm) required an additional 1 − 32 more steps to arrive at the positive match. When a single iteration was sufficient, strands were shown to be effective in bridging the syntactic gap over the procedures. When more steps were needed, a matching process (as in table 5.1) took place to account for surrounding procedures, correcting matches accordingly. These cases stem from a range of scenarios, including:

• Very large procedures that are mistakenly matched with the query due to their size.

• The number of shared strands between the query and the true positive was very small.

• The true positive and several false positives share an equal number of strands with the query.

92

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 600 493 500

400

300

200 # correct matches

100 50 31 12 15 7 0 1 2 3-4 5-8 9-16 17-32 # game steps needed Figure 5.9: Number of procedures correctly matched as a factor of the number of steps in the back-and-forth game. 180 procedures required more than 1 step for the matching, up to 32 steps.

These cases are resolved in our algorithm, which effectively searches for replacement matches for all candidates, thus eliminating wrong matches. Without this iterative matching process, the overall precision drops from 90.11% to 67.3%. Note that fig. 5.9 allows us to evaluate FirmUp vs. GitZ while considering the top-k results. Since both tools rely on shared strands (with further optimizations performed in FirmUp as described in section 5.2), the top-k accuracy of GitZ can be approximated by aggregating the columns of fig. 5.9. For instance by observing the 2 game-steps column in fig. 5.9, we can see that considering the top-2 results from GitZ will reduce the number of false positives by approximately 50.

93

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 94

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 6

Similarity by Neural Name prediction

In this chapter we perform a major leap in our efforts to establish binary code similarity. This time we no longer assume the binary procedures originated from the same source code. Problem definition Given a nameless assembly procedure X residing in a stripped (containing no debug

information) executable, our goal is to predict a likely and descriptive name Y = y1..., ym, where y1..., ym are the subtokens composing Y. Thus, our goal is to model P (Y | X ). For example, for the name Y = create_server_socket, the subtokens y1..., ym that we aim to predict are create, server and socket, respectively. The problem of predicting a meaningful name for a given procedure can be viewed as a translation task – translating from assembly code to natural language. While this high-level view of the problem is shared with previous work (e.g., [APS16, AZLY19, ABLY19]), the technical challenges are vastly different due to different characteristic of binaries. Challenge 1: Little syntactic information Assembly code from a stripped procedure is a sequence of instructions lacking any information regarding variable types or names. Learning descriptive names for such a low-level stripped representation is a challenging task. A naïve approach where the flat sequence of assembly instructions is fed into a sequence-to-sequence (seq2seq) architecture [LPM15, VSP+17] yields imprecise results (21.72 F1 score), as we show in Section 6.5). Challenge 2: Long procedure names Procedure names in compiled C code are often long, as they encode information that would be part of a typed function signature in a higher-level language (e.g., AccessCheckByTypeResultListAnd- AuditAlarmByHandleA in the Win32 API); methods which attempt to directly predict a full label as a single word from a vocabulary such as [HIT+18] will be inherently imprecise. Predicting meaningful useful names for binary procedures translates the binary code search problem into a simple text search problem. Given any query one can search it in a dataset of named binary procedures.

6.1 Overview

In this section, we illustrate our approach informally and further explain the challenges in binary procedure name prediction using an example. Using Imported Procedures Given an unknown binary procedure, P, which is stripped from debug symbols, our goal is to suggest a meaningful and likely name that describes its purpose. We focus our example on the Intel 64-bit executables running on the Linux OS, but the same process can be applied to

95

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Reconstructing Call-Sites Set of Call Site Sequences Prediction 2 % rdi is an argument % rsi is an argument mov [rbp-68h], rdi mov [rbp-70h], rsi mov rax, [rbp-68h] mov rsi, [rbp-70h] memset(STK,0,48) mov rdi, rax

getaddrinfo(rdi,rsi,rdx,rcx) getaddrinfo (ARG,ARG,STK,STK) create lea rdx, [rbp-40h] % rbp-40&rbp-50h are _ lea rcx, [rbp-50h] % stack structures socket(STK, STK,STK) server 3 1 call socket(...) mov rsi, 1 mov [rbp-58h], rax setsockopt mov rax, [rbp-58h] (RET,1,2,STK,4) _ mov r8, 4 mov rdi, rax socket bind(…) setsockopt(rdi,rsi,rdx,rcx,r8) listen(…) mov rdx, 2 lea rdx, [rbp-5Ch] mov rcx, rdx

(a) (b) (c)

Figure 6.1: (a) Reconstructing calls to external procedures and creating pointer-aware slices; (b) representing the procedure as a set of augmented call sites; (c) predicting procedure names by training neural sequence-to-sequence (seq2seq) models on the set of augmented call site sequences.

other architectures and OSs. After disassembling P’s code, the procedure is transformed into a sequence of assembly instructions. We draw our initial intuition from the way a human reverse engineer skims this sequence of instructions. To understand what the code does, the most informative pieces of information are calls to procedures that were dynamically linked to the examined executable, e.g., call _getaddrinfo (line 8) and call _setsockopt (line 17) in Figure 6.1(a). The API names _getaddrinfo and _setsockopt can be statically resolved by using the information put in the ELF header by the linker to help the loader find and dynamically load libraries. While some malware may obfuscate the API names, the values for the arguments passed when calling these external procedures must remain intact. In Section 6.5 we show how API name obfuscation affects prediction results when different representations for binary procedures are used. Reconstructing Call Sites Using Pointer-Aware Slicing We retrieve the number of arguments passed to each procedure by examining debug-symbols for the libraries used by the executable. Following the calling convention used in the executable, we map each argument to the register used to pass this argument. Finally, we construct a call site that is very similar to a call site in higher-level languages, as shown in fig. 6.1(a). This reconstructed call site contains the API name and the registers used as arguments for this API call. To obtain additional information from these call sites we examine how the value of each argument was computed. To gather this information, we compute a static slice of each register at the call location in the procedure. As some arguments are pointers, we perform a pointer-aware static program slice. This process is explained in section 6.3. Each register of a call site in fig. 6.1(a) is connected by an arrow to the slice of P used to compute its value. Augmenting Call Sites Using Abstract Values Several informative slices can be seen in fig. 6.1(a). In 1 , the constant value of 1 is assigned directly to rsi. This means that we can determine the concrete value for the register used as an argument to setsockopt at the call location. In 2 , rsi gets its value from an argument passed into P: (i) rdi is coincidently used to pass this argument to P, (ii) this value

96

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 6: call listen call gai_strerror 3: call socket call freeaddrinfo ... call errno_location 1: call memset call strerror 2: call getaddrinfo 2: call getaddrinfo 3: call socket call close 4: call setsocketopt 5: call bind 5: call bind call close 6: call listen 4: call setsocketopt ... call errno_location call strerror 1: call memset (a) (b)

Figure 6.2: Typical call instructions for starting a server: on the left, the call instructions are randomly placed in the code by the compiler; on the right, the call instructions are placed in their correct call order as they appear in the CFG.

is placed on a stack variable at rbp-70h, and finally (iii) rdi is assigned this value from the stack . In 3 , rdi is assigned with the return value of the call to socket: (i) in accordance with the calling conventions, the return value of socket is placed in rax before returning from the call; (ii) rax’s value is assigned to a local variable on the stack at location rbp-58h; (iii) rax reads this value from the same stack location; 1 and finally, (iv) this value is assigned from rax to rdi. In slices 2 and 3 , the concrete values for the registers used as arguments are unknown. Using static analysis, we augment the reconstructed call sites by replacing the register names (which carry no meaning) with: (i) the concrete value if such can be determined; or (ii) a broader category which we call argument abstract values. An abstract value is extracted by classifying the slice into one of the following classes: argument (ARG), global value (GLOBAL), unknown return value from calling a procedure (RET) and local variable stored on the stack (STK). Complete details regarding this representation are presented in Section 6.3. The result of augmenting the two reconstructed call sites from fig. 6.1(a) are the augmented call sites getaddrinfo(ARG,ARG,STK,STK) and setsockopt(RET,1,2,STK,4). As shown in section 6.5.3, augmenting call sites using abstract and concrete values relatively improve the results of our model by 50% over the alternative of using only the name of the external API. Moreover, as shown in table 6.1, abstract and concrete values allow the model to make predictions even when API names are obfuscated. Putting Calls in the Right Order After examining the API calls of the procedure, a human reverser will try to understand the order in which they are called at runtime. Consider the call instruc- tions of Figure 6.2(a). These calls are listed in the arbitrary order they were written by the compiler, which does not reflect any logical or chronological order. However, the standard order of creating and using a socket is shown in the following path: π = memset→ getaddrinfo→socket →setsockopt→bind→listen. However, in the code: (i) the call to socket, which is part of setup, is interleaved with strerror used for error handling; and (ii) the calls in this sequence themselves are randomly shuffled in the assembly, i.e., listen appears before bind . A human reverse-engineer can follow the jumps in the code and figure out the order in which these calls will be made. This results in the order shown in Figure 6.2(b). After throughly examining the

1Seemingly redundant computation steps, e.g., placing a value on the stack and then reading it back to the same register, are necessary to comply with calling conventions and to handle register overflow at the procedure scope.

97

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 procedure and understanding its expected behavior, the human reverse-engineer can try to summarize her understanding by giving the procedure a name. To order the augmented call sites correctly, we construct a CFG for P.A CFG is a directed graph representing the control flow of the procedure, as determined by jump instructions. Exploring the CFG paths we extract only possible runtime paths, thus approximating all potential call sequences. Finally, we represent P using a set of sequences of augmented call sites, while each sequence represents one path. Predicting Procedure Names Using Augmented Call Site Sequences fig. 6.1(b) depicts our representa- tion for P, focusing on the path π shown in Figure 6.2(b). Figure 6.1(b) shows that: (i) memset(STK,0,48) 0 initializes a 48-byte memory space with zeroes; (ii) getaddrinfo(ARG,ARG,STK,STK) uses two of P s arguments to search for a specific interface to be used later; (iii) socket(STK,STK,STK) and setsocketopt(RET,1,2,STK,4) create a socket and configure it to be a TCP socket by passing the value 1; and (iv) bind and listen determine that this procedure is part of a server listening to incoming connections. This example shows how correctly ordered augmented call sites create a powerful building block for representing binary procedures. As shown in fig. 6.1(c), using this representation, our model is able to correctly predict the name create server socket for P. In section 6.4 we explain how we build a model using long short-term memory network (LSTM) or Transformers to perform this prediction, and in section 6.5 we show how both achieve similar performance in terms of precision, recall and F1 score. We note that call order does not determine data-flow, i.e., if memory was allocated in the first call, it will not necessarily be used in the second call. Nevertheless, abstract values do capture information regarding their surrounding procedure. For example, the abstract value “RET” in setsocketopt marks that the socket was created inside this procedure; the abstract value “ARG” marks that the parameters for the interface search were arguments to the procedure. This contributes to the model choosing the subtoken “create” in the first decoding step. Key aspects The illustrated example highlights several key aspects of our approach:

• Using static analysis, augmented API call sites can be reconstructed from the shallow assembly code.

• Analyzing pointer-aware slices for arguments allows call site augmentation by replacing register names with concrete or abstracted values.

• By analyzing the CFG, we string augmented call sites together in their approximated runtime order.

• A set of augmented call site sequences is an efficient and informative representation of a binary procedure.

• Various neural network architectures trained on these sequences can accurately predict complex procedure names such as create_server_socket.

6.2 Background - Seq2seq Models

In this section we provide necessary background. In Section 6.2.1 we describe the encoder-decoder paradigm which is the basis for most seq2seq models; in Section 6.2.2 we describe the mechanism of attention; and in Section 6.2.3 we describe Transformer models. Preliminaries Contemporary seq2seq models are usually based on the encoder-decoder paradigm + [CVMG 14, SVL14], where the encoder maps a sequence of input symbols x = (x1, ..., xn) to a

98

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 sequence of latent vector representations z = (z1, ..., zn). Given z, the decoder predicts a sequence of target symbols y = (y1, ..., ym), thus modeling the conditional probability: p (y1, ..., ym | x1, ..., xn). At each decoding time step, the model predicts the next symbol conditioned on the previously predicted symbol, hence the probability of the target sequence is factorized as:

m p (y1, ..., ym | x1, ..., xn) = ∏ p (yt | y

6.2.1 LSTM Encoder-Decoder Models

Traditionally [CVMG+14, SVL14], seq2seq models are based on recurrent neural networks (RNNs), and typically LSTMs [HS97]. LSTMs are trainable neural network components that work on a sequence of input vectors, and return a sequence of output vectors, based on internal learned weight matrices. Throughout processing the input sequence, the LSTM keeps an internal state vector that is updated after reading each input vector.

The encoder embeds each input symbol into a vector using a learned embedding matrix Ein. The

encoder then maps the input symbol embeddings to a sequence of latent vector representations (z1, ..., zn) using an encoder LSTM:

  in  z1, ..., zn = LSTMencoder embed E , x1, ..., xn

The decoder is an additional LSTM with separate learned weights. The decoder uses an aggregation of the encoder LSTM states as its initial state; traditionally, the final hidden state of the encoder LSTM is used:

dec dec h1 , ..., hm = LSTMdecoder (zn)

dec At each decoding step, the decoder reads a target symbol and outputs its own state vector ht given the currently fed target symbol and its previous state vector. The decoder then computes a dot product dec out between its new state vector ht and a learned embedding vector Ei for each possible output symbol in the vocabulary yi ∈ Y, and normalizes the resulting scores to get a distribution over all possible symbols:

 out dec p (yt | y

where softmax is a function that takes a vector of scalars and normalizes it into a probability out dec distribution. That is, each dot product Ei · ht produces a scalar score for the output symbol yi, and these scores are normalized by exponentiation and division by their sum. This results in a probability distribution over the output symbols y ∈ Y at time step t.

The target symbol that the decoder reads at time step t differs between training and test time: at test time, the decoder reads the estimated target symbol that the decoder itself has predicted in the previous

step yˆt−1; at training time, the decoder reads the ground truth target symbol of the previous step yt−1. This training setting is known as “teacher forcing” - even if the decoder makes an incorrect prediction

at time t − 1, the true yt−1 will be used to predict yt. At test time, the information of the true yt−1 is unknown.

99

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 6.2.2 Attention Models

In attention-based models, at each step the decoder has the ability to compute different weighted average

of all latent vectors z = (z1, ..., zn) [LPM15, BCB14], and not only the last state of the encoder as in + traditional seq2seq models [SVL14, CVMG 14]. The weight that each zi gets in this weighted average can be thought of as the attention that this input symbol xi is given at a certain time step. This weighted dec average is produced by computing a score for each zi conditioned on the current decoder state ht . These scores are normalized such that they sum to 1:

 dec αt = softmax z · Wa · ht

That is, αt is a vector of positive numbers that sum to 1, where every element αti in the vector is the dec normalized score for zi at decoding step t. Wa is a learned weights matrix that projects ht to the same size as each zi, such that dot product can be performed.

Each zi is multiplied by its normalized score to produce ct, the context vector for decoding step t:

n ct = ∑ αti · zi i

That is, ct is a weighted average of z = (z1, ..., zn), such that the weights are conditioned on the dec current decoding state vector ht . The dynamic weights αt can be thought of as the attention that the model has given to each zi vector at decoding step t. dec The context vector ct and the decoding state ht are then combined to predict the next target token dec yt. A standard approach [LPM15] is to concatenate ct and ht and pass them through another learned linear layer to predict the next symbol. I.e., multiply the concatenated vector by another learned matrix

Wc:

dec h deci hgt =Wc ct; ht

 out dec p (yt | y

Note that target symbol prediction in non-attentive models (Equation (6.1)) is very similar to the prediction in attention models (Equation (6.2)), except that non-attentive models use the decoder’s dec dec state ht to predict the output symbol yt, while attention models use hgt which is the decoder’s state combined with the dynamic context vector ct. This dynamic context vector ct allows the decoder to focus its attention to the most relevant input vectors at different decoding steps.

6.2.3 Transformers

Transformer models were introduced by [VSP+17] and were shown to outperform LSTM models for many seq2seq tasks. Transformer models are the basis for the vast majority of recent sequential models [DCLT19, RWC+18, CSW+18]. The main idea in Transformers is to rely on the mechanism of self-attention without any recurrent

components such as LSTMs. Instead of reading the input vectors x = (x1, ..., xn) sequentially as in RNNs, the input vectors are passed through several layers which include multi-head self-attention and a fully-connected feed-forward network.

In a self-attention layer, each xi is mapped to a new value that is based on attending to all other xj’s in the sequence. First, each xi in the sequence is projected to query, key, and value vectors using learned

100

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 linear layers:

Q = Wq · x K = Wk · x V = Wv · x

where Q ∈ Rn×dk holds the queries, K ∈ Rn×dk holds the keys, and V ∈ Rn×dv holds the values. The

output of self-attention for each xi is a weighted average of the other value vectors, where the weight assigned to each value is computed by a compatibility function of the query vector of the element Qi 1 with each of the key vectors of the other elements Kj, scaled by . This can be performed for all keys dk and queries in parallel using the following quadratic computation: ! QK> α (Q, K) = softmax √ dk

n×n That is, α ∈ R is a matrix in which every entry αij contains the scaled score between the query of xi and the key of xj. These scores are then used as the factors in the weighted average of the value vectors:

Attention (Q, K, V) = α (Q, K) · V

In fact, there are multiple separate (typically 8) attention mechanisms, or “heads”, that are computed in parallel. Their resulting vectors are concatenated and passed through another linear layer. There are also multiple (typically 6) layers which contain separate multi-head self-attention components, stacked on top of each other. The decoder has similar layers, except that one self-attention mechanism attends to the previously-

predicted symbols (i.e., y

6.3 Representing Binary Procedures for Name Prediction

Our approach is based on predicting procedure names from sequences of augmented call sites. In Sec- tion 6.4 we will show how to adapt the neural models of Section 6.2 to our setting, but first, we have to describe how do we extract sequences of augmented call sites from a given binary procedure. In this section we describe the program analysis that builds our representation from a given binary procedure.

Pre-Processing the Procedure’s CFG Given a binary procedure, P, we construct its CFG, GP: a directed graph composed of nodes which correspond to P’s BBs, connected by edges according to control flow instructions, i.e., jumps between basic-blocks. For clarity, in the following we: (i) add an artificial entry node BB denoted Entry, and connect it to the original entry BB and, (ii) connect all exit-nodes, in our case any BB ending with a return instruction (exiting the procedure), to another artificial sink node Sink.

For every simple path p in PG, we use instructions(p) to denote the sequence of instructions in the path BBs. We use [[P]] to denote the set of sequences of instructions along simple paths from Entry to Sink, that is: [[P]] = {instructions(p) | p ∈ simplePaths(Entry, Sink)}.

101

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Instruction Vread Vwrite Pread Pwrite i1: mov rax,5 5 rax ∅ ∅ i2: mov rax,[rbx+5] rbx rax rbx+5 ∅ i3: call rcx rcx rax rcx ∅

Figure 6.3: An example for slice information sets created by three x64 instructions: Vread|write sets show values read and written into and Pread|write show pointer dereferences for reading from and writing to memory.

From each sequence of instructions s ∈ [[P]], we extract a list of call instructions, like the one in fig. 6.2(b); these call sites will be reconstructed by inferring the number of arguments passed to the call target. The number of registers is recovered by: (i) using library debug information for external procedures (explained next); and (ii) analyzing the called procedures for internal procedures. For indirect procedures calls, call sites can not be reconstructed. Reconstructing Imported Procedures Using Debug Information Linux executables declare external dependencies in a special section in the ELF header, causing the OS loader to import them at load time and allow the executable to call into their procedures. These procedures are also called imported procedures. These external dependencies are usually software libraries , such as LIBC, which provides the API for most OS related functionality (e.g., opening a file using open()). Our analysis needs to ascertain: the (i) addresses for these imported procedures, and (ii) the number of arguments passed to them. These enable us to detect call instructions targeting imported procedures and allow us to reconstruct them into call sites. To gather imported procedures addresses ((i)), we simulate the OS loader’s actions using the structures in the analyzed executables alongside the ELF header of each imported executable. Note that for position independent code (PIC) this includes parsing the PLT (Procedure Linkage Table) which is used by the executable’s procedures to call external procedures, and the global offset table (GOT) containing the addresses in use. For a brief summary about how GOT and PLT are used to implement position independent code, see [Ben]. Using the addresses for the imported procedures, we extract the number of arguments and their types ((ii)) by using the library’s debug symbols for the imported libraries. These symbols are represented in the DWARF format ([Com]). Using all this information, we reconstruct a prototype for every imported procedure used by the executable according to the application binary interface (ABI) ([Int]), which dictates the calling conven- tion. In our case (System-V-AMD64-ABI) the first six procedure arguments which fit in the native 64-bit registers are passed in the following order: rdi, rsi, rdx, rcx, r8, r9. Return values which fit in the native register are returned using rax. For example, libc’s open() is defined by the following C prototype: int open(const char *pathname, int flags, mode_t mode). As all arguments are pointers or integers they can be passed using native registers, and the reconstruction process results in the following prototype: rax = open(rdi, rsi, rdx). We direct the reader to the ABI for further details on how floats and bigger arguments are handled. Two other examples for the output of this process are shown in fig. 6.1(a). Creating Pointer-Aware Slice-Trees To create augmented call sites we produce a value or an abstract value to replace registers used as call arguments. We create a pointer-aware slice for each register value at the call site location. This slice contains the calculation steps performed to generate the register value. We adapt the definitions of [LB93] to assembly instructions. For an instruction i we define the following

102

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 ∅ ∅ ARG ∅ 2 1 V P 4 3 ∅ mov [rbp-68h], rdi ��� ARG | ∅

V(rbp) P(ebp-68h)

6 5 mov rax, [rbp-68h] ∅ STK | ARG ∅ V P 7 mov rdi, rax ARG |∅ V getaddrinfo(rdi,rsi,rdx,rcx) getaddrinfo(ARG,rsi,rdx,rcx)

(a) (b)

Figure 6.4: Creating pointer-aware slices for extracting concrete values or abstract values: all possible computations that create the value of the rdi are statically analyzed, and finally, the occurrence of rdi in the call to getaddrinfo is replaced with the ARG abstract value originating from 2 , which is defined as more informative than ∅ ( 1 ) and STK ( 4 ).

sets:

• Vwrite(i): all registers that are written into in i.

• Vread(i): all values used or registers read from in i.

• Pwrite(i): all memory addresses written to in i.

• Pread(i): all memory addresses read from in i.

The first two sets, Vread(i) and Vwrite(i), capture the values read and written into. The last two sets, Pread(i) and Pwrite(i), capture the pointers addressed for writing and reading operations. Figure 6.3 shows three examples for assembly instructions and their slice information sets.

Instruction i1 shows a simple assigment of a constant value into the rax register. Accordingly, Vread(i1) contains the value 5 of the constant that is read in the assignment. Vwrite(i1) contains the assignment’s target register rax.

Instruction i2 shows an assignment into the same register, but this time from the memory value stored at memory offset rbx+5. The full expression rbx+5 that is used to compute the memory address to

read from is put in Pread, and the register used to compute this address is put in Vread. Note that Vread(i1) contains the value 5, due to its use as a constant value, while Vread(i2) does not contain the value 5 because in i2 it is used as a memory offset. Instruction i3 shows an indirect call instruction. This time, as the callee address resides in rcx, it is put in Vread. Vwrite(i3) contains rax, which does not explicitly appear in the assembly instruction, because the calling convension states that return values are placed in rax. Given an instruction sequence s ∈ [[P]], we can define:

0 0 • Vlast−write(i, r): the maximal instruction i < i, in which for the register r: r ∈ Vwrite(i ).

0 0 • Plast−write(i, p): the maximal instruction i < i, in which for the pointer p: p ∈ Pwrite(i ).

Finally, to generate a pointer-aware slice for a specific register r at a call site in an instruction i, we

repeat the following calculations: Vlast−write → Vread and Plast−write → Pread, until we reach empty sets. Although this slice represents a sequential computation performed in the code, it is better visualized

103

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 as a tree. fig. 6.4(a) shows the pointer-aware slice for the rdi register used as an argument for the getaddrinfo in fig. 6.1(a), represented as a tree. Analyzing Slice Trees We analyze slice trees to select a concrete value or abstract value to replace the register name in the sink of the tree2. This allows us to encode more information in the augmented call site. To do this we need to assign abstract values to some of the slice tree leaves, and propagate the values in the tree. We begin by performing the following initial tag assignments in the tree:

• We tag all of P’s arguments as “ARG”.

• When a call instruction is encountered, we tag the location that rax is read to after the call instruction by “RET”, as it holds the return value of the called procedure.

• We tag by “STK” the rbp register at the beginning of P. During the propagation process, this tag might be passed to rsp or other registers. 3

• When a pointer to a global value can not be resolved to a concrete value, e.g., a number or a string, we set its tag to be “GLOBAL”.4

After the initial tagging, we propagate values downwards in the tree towards the register at the tree sink. We prefer values according to the following order: (1) concrete value, (2) ARG, (3) GLOBAL, (4) RET, (5) STK. This hierarchy represents the certainty level we have regarding a specific value: if the value was created by another procedure there is no information about it; if the value was created using one of P’s argument we know something about it; and if the specific value used as an argument is known – it is used instead of an abstract value. To make this order complete we add (6) ∅ (empty set), but this tag is never propagated forward and in no case will reach the sink. Figure 6.4(b) shows an example of this propagation process. The assigned tags “ARG” and “STK” in 2 and 4 accordingly, are marked in bold and underlined. Two tags, “ARG” and “∅” are propagated to 3 , and “ARG” is propagated to 6 . In 6 , as “ARG” is preferred over “STK”, “ARG” is propagated to the sink and thus selected to replace the rdi register name in the getaddrinfo call site. Dealing with aliasing and complex constant expressions Using the lessons learned in chapter4, we re-optimize the entire procedure before extracting simple paths, and we re-optimize every slice before analyzing it to deduce argument values. A complex calculation of constants is thus replaced by the re- optimization process. For example, the instructions xor rax,rax; inc rax; are simply replaced to mov rax,1. This saves us the trouble of performing the costly process of expression simplification and the need to expand our analysis to support them. Furthermore, while very common in (human written) source code, memory aliasing is seldom performed in optimized binary code, as they contradict the compiler’s aspiration for reducing memory consumption and avoiding redundant computations. An example for how aliasing is avoided is depicted in fig. 6.1(a) 2 : an offset to the stack storing a local variable can always be addressed in the same way by using rbp (in this case rbp-70h). The re-optimization thus spares us to need to expand our analysis to address memory aliasing.

2Formally, the root of the tree 3While it was a convention to keep rbp and rsp set to positions on the stack – modern compilers often use these registers for other purposes. 4One example for such case is a pointer to a global structure.

104

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 6.4 Model

The key idea in this work is to represent a binary procedure as a set of call site sequences, and the novel synergy between program analysis of binaries and neural models, rather than the specific neural architecture. In Section 6.4.1, we describe how a call site is represented in the model; this encoding is performed similarly across architectures. We use the same encoding in two different popular architectures - LSTM-based attention encoder-decoder (Section 6.4.2) and a Transformer (Section 6.4.3). 5 We follow the general encoder-decoder paradigm [CVMG+14, SVL14] for sequence-to-sequence (seq2seq) models, with the difference that the input is not the standard single sequence of symbols, but a set of call site sequences. Thus, our models can be described as “set-of-sequences-to-sequence”. We learn sequences of encoded call sites; finally, we decode the target procedure name word-by-word while considering a dynamic weighted average of encoded call site sequence vectors at each step.

6.4.1 Call Site Encoder

We define a vocabulary of learned embeddings Enames. This vocabulary assigns a vector for every subtoken of API name which was observed in the training corpus. For example, if the training corpus contains a call to open_file, each of open and file is assigned a vector in Enames. Additionally, we define a learned embedding for each abstract value, e.g., ARG, STK, RET or GLOBAL (Section 6.3), and for every actual value (e.g., the number “1”) that occurred in training data. We denote the matrix containing these vectors as Evalues. We represent a call site by summing the

embeddings of its API subtokens w1...wks , and concatenating with up to kargs of argument abstract value embeddings:

" ks ! #   names values values encode_callsite w1...wk , value1, ..., valuek = E ; E ; ... ; E s args ∑ wi value1 valuekargs i

We pad the remaining argument slots with an additional no-arg symbol.

6.4.2 Encoding Call Site Sequences with LSTMs

In the LSTM-based model, we learn a call site sequence using a bidirectional LSTM. We represent each call site sequence by concatenating the last states of forward and backward LSTMs:

h1, ..., hl = LSTMencoder (callsite1, ..., callsitel)

→ ← z = [hl ; hl ]

Where l is the maximal length of a call site sequence. In our experiments, we used l = 60. Finally,

we represent the entire procedure as a set of its encoded call site sequences {z1, z2, ..., zn} which is passed to the LSTM decoder as described in Section 6.2.2.

5We do not share any learnable parameters nor weights across architectures; every architecture is trained from scratch.

105

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 6.4.3 Encoding Call Site Sequences with Transformers

In the Transformer model, we encode each call site sequence separately using N Transformer-encoder layers, followed by attention pooling to represent a call site sequence as a single vector:

z = Transformerencoder (callsite1, ..., callsitel)

Finally, we represent the entire procedure as a set of its encoded call site sequences {z1, z2, ..., zn} which is passed to the Transformer-decoder as described in Section 6.2.3.

6.4.4 Encoding Call Site Sequences with Graph Neural Networks

Another hypothetical neural architecture that could leverage our call site representation is the Graph Neural Network (GNN)[SGT+08, GSR+17]. GNNs would potentially not require extracting sequences in advance, and instead, feed the CFG as the network’s input directly. However, there are two main limitations to using GNNs in these settings:

1. The abstract values (Section 6.3) in a certain call site in our analysis depend on the CFG path that “arrives” to the call site. In other words, the same call site argument can be assigned two different abstract values in two different paths. Thus, fixing the abstract value to the CFG node independently of the path would hurt the abstract values analysis, which we found to be critical to the results in Section 6.5.3.

2. GNNs are practically limited in the number of message propagation steps, which is equivalent to the call site sequence length in our LSTM and Transformer models. GNNs are usually unrolled for 2 [VCC+18, SKB+18, HYL17, KW17], 3 [ZL17], 8 [ABK18] or 10 [Bro19] propagation steps. In our dataset, more than 33.4% of the sequences are longer than 10 call sites; the average length of the sequences that are longer than 10 is 17.02; and we limit the max length to l = 60. Thus, limiting the number of propagation steps to 2-10 as in prior work would not capture long control flow sequences. While using GNNs is computationally difficult for a large number of propagation steps, encoding long pre-extracted sequences in parallel as in our LSTM and Transformer models is not a computational problem.

We leave the use of our representation in GNNs for future work; in Section 6.5 we experiment with plugging in our representation into LSTMs and Transformers.

6.5 Evaluation

We implemented our approach in a framework called Nero, for NEural Reverse engineering Of stripped binaries.

6.5.1 Experimental Setup

Creating a Dataset We focus our evaluation on Intel 64-bit executables running on Linux, but the same process can be applied to other architectures and operating systems. We collected a dataset of software packages from the GNU code repository containing a variety of applications such as networking, administration tools and libraries. To avoid dealing with mixed naming schemes, we removed all packages containing a mix of programming languages, e.g., a Python package containing partial C implementations. We filtered out

106

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Stripped Stripped & Obfuscated API calls Model Prec Rec F1 Prec Rec F1 Debin (He et al., 2018) 34.86 32.54 33.66 32.10 28.76 30.09 LSTM-text 22.32 21.16 21.72 15.46 14.00 14.70 Transformer-text 25.45 15.97 19.64 18.41 12.24 14.70 Nero-LSTM 39.94 38.89 39.40 39.12 31.40 34.83 Nero-Transformer 41.54 38.64 40.04 36.50 32.25 34.24

Table 6.1: Our model outperforms previous work (Debin) by a relative improvement of almost 20%; learning from the flat assembly code (LSTM-text, Transformer-text) yields much lower results. Obfuscating API calls hurts all models, but thanks to the use of abstract values, our model still performs better than the baselines.

wrapper procedures because they are usually very easy to both reverse-engineer and predict, thus falsely improve the scores. Avoiding Duplicates Following [LMM+17] and [All18] who pointed out the existence of code dupli- cation in open-source datasets and its adverse effects, we created the train, validation, and test sets from completely separate projects and packages. Additionally, we put a lot of effort, both manual and automatic, into filtering duplicates from our dataset. To filter duplicates, we filtered out the following:

1. Different versions of the same package – for example, “wget-1.7” and “wget-1.20”.

2. C++ code – C++ code regularly contains overloaded procedures; further, class methods start with the class name as a prefix. To avoid duplication and name leakage, we filtered out all C++ executables entirely.

3. Tests – all executables suspected as being tests or examples were filtered out.

4. Static linking – we took only packages that could compile without static linking. This ensures that dependencies are not compiled into the dependent executable.

Dataset Processing and Statistics After all filtering, the dataset contains 67, 246 examples. We extract procedure names to use as labels, and then create two datasets by: (i) stripping, and (ii) stripping and obfuscating API names. Stripping is performed to conform to the way executables are usually distributed, and API calls obfuscation is sometimes done in malware. We obfuscated API calls by manipulating the ‘.symtab‘ and ‘.dynsym‘ sections: replacing strings with dummy strings and zeroing the information used to dynamically load external code. We made sure that after these manipulations Nero and Debin are still able to statically analyze the executables. We split both datasets into the same training-validation-test sets using a (8 : 1 : 1) ratio resulting in 60403/2034/4809 training/validation/test examples. Each dataset contains 2.42 (±0.04) target symbols per example. There are 489.63 (±2.82) assembly code tokens per procedure, which our analysis reduces to 9.05 (±0.18) paths per procedure; the average path is 9.15 (±0.01) call sites long. We make these datasets publicly available. Metrics At test and validation time, we adopted the measure used by previous work [APS16, ABLY19, FAB19], and measured precision, recall and F1 score over the target subtokens, case-, order-, and duplication-insensitive, and ignoring non-alphabetical characters. For example, for a true reference of open file: a prediction of open is given full precision and 50% recall; and a prediction of file open input file is given 67% precision and full recall.

107

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Model Prediction

Ground truth read file check new watcher get user groups install signal handlers Debin ([HIT+18]) bt open read index display signal setup LSTM-text check opt close stdin Transformer-text ipmi disable coredump config file ipmi regfree Nero-LSTM vfs read file check file get ip groups install handlers Nero-Transformer read file system list check state get user groups install signal

Table 6.2: Examples from our test set and predictions made by the different models. Even when a prediction is not an “exact match” to the ground truth, it usually captures more subtokens of the ground truth than the baselines. More examples can be found in table A.1 in the appendix.

Baselines We compare our model to the state-of-the-art model Debin [HIT+18], by training and testing their model on our dataset6. This is a non-neural baseline based on Conditional Random Fields (CRFs). As far as we are aware, this is the only other work attempting to perform a similar task to ours. We note that Debin was designed for a slightly different task of predicting names for both local variables and procedure names; nevertheless, we focus on the task of predicting procedure names and use only these to compute their score. Other straightforward baselines are Transformer-text and LSTM-text in which we do not perform any program analysis, and instead just apply standard NMT architectures directly on the assembly code: one is the Transformer [VSP+17], and the other has two layers of bidirectional LSTMs as the encoder, two LSTM decoder layers, and attention [LPM15]. To demonstrate how our representation can be plugged into various architectures, we implemented our approach in two models: Nero-LSTM encodes the control-flow sequences using bidirectional LSTMs and decodes with another LSTM; and Nero-Transformer encodes these sequences and decodes using a Transformer [VSP+17]. To evaluate the main novelties of our approach, which are learning from augmented call site sequences extracted from CFG paths – we perform a thorough ablation study, as detailed in Section 6.5.3. Training We trained our models using a single Tesla V100 GPU. For both our models (Nero-LSTM and Nero-Transformer) we used embeddings of size 128 for target subtokens and API subtokens, and the same size for embedding argument abstract values. For our LSTM model, to encode call site sequences we use bidirectional LSTMs with 128 units each; the decoder LSTM had 256 units. We used dropout [SHK+14] of 0.5 on the embeddings and the LSTMs. For our Transformer model we used

N = 6 encoder layers and the same number of decoder layers, keys and values of size dk = dv = 32, and a feed-forward network of size 128. For both models we used the Adam [KB14] optimization algorithm with an initial learning rate of 10−5 decayed by a factor of 0.95 every epoch. We trained each network end-to-end using the cross-entropy loss. We tuned hyperparameters on the validation set, and evaluated the final model on the test set.

6.5.2 Results

The left side of Table 6.1 shows the results of the comparison to [HIT+18], LSTM-text, and Transformer- text on the stripped dataset. Overall, our models show almost 20% relative improvement over the model of [HIT+18] and over 84% relative gain over LSTM-text and Transformer-text.

6The dataset of [HIT+18] is not publicly available. We make our dataset public.

108

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Model Prec Rec F1 BiLSTM calls 23.45 24.56 24.04 BiLSTM call sites 36.05 31.77 33.77 Nero-LSTM no-values 29.84 24.08 26.65 Nero Transformer→LSTM 41.30 37.50 39.32 Nero-LSTM 39.94 38.89 39.40 Nero-Transformer 41.54 38.64 40.04

Table 6.3: Variations on our approach that ablate its different components. Either one of removing abstract values (Nero-LSTM no-values), ignoring the control-flow order (BiLSTM call sites), or both (BiLSTM calls) – hurts the results drastically.

Nero-Transformer performs similarly to Nero-LSTM, with the Nero-Transformer achiev- ing an F1 score of 40.04 vs. 39.40 by Nero-LSTM. Nero-LSTM outperforms Nero-Transformer in recall but trailing in precision and F1 score. This demonstrates the usefulness of our representation with different learning architectures. The right side of Table 6.1 shows the same models on the stripped and API-obfuscated dataset. Obfuscation degrades the results of all models, yet our models still perform significantly better than the model of [HIT+18] and the textual baselines. This result depicts the importance of abstract values in our representation, which can recover sufficient information even in the absence of API names. We note that overall, in both datasets, our models perform best on both precision and recall. Comparison to [HIT+18] Conceptually, our model is much more powerful than the model of [HIT+18] because it is able to decode out-of-vocabulary procedure names (“neologisms”) from subtokens, while the CRF of [HIT+18] uses a closed vocabulary that can only predict already-seen procedure names. At the binary code side, since our model is neural, at test time it can utilize unseen call site sequences while their CRF can only use observed relationships between elements. For a detailed discussion about the advantages of neural models over CRFs, see Section 5 of [AZLY19]. Furthermore, their representation performs a shallow translation from binary instruction to connections between symbols, while our representation is based on a deeper data-flow-based analysis to find values of registers arguments of imported procedures. Comparison to LSTM-text and Transformer-text The comparison to the NMT baselines shows that learning directly from the assembly code performs significantly worse than leveraging semantic knowl- edge and static analysis of binaries. We hypothesize that the reasons are the high variability in the assembly data, which results in a low signal-to-noise ratio. This comparison necessitates the need of an informative static analysis to represent and learn from executables. Examples Table 6.2 shows a few examples for predictions made by the different models. Additional examples can be found in table A.1 in the appendix.

6.5.3 Ablation Study

To evaluate the contribution of different components of our representation, we compare the following configurations: Nero-LSTM no values - uses only the CFG analysis with the called API names, without abstract nor concrete values. Nero Transformer→LSTM - uses a Transformer to encode the sets of control-flow sequences and an LSTM to decode the prediction.

109

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Error Type Package Ground Truth Predicted Name wget i18n_initialize i18n_init Programmers VS direvent split_cfg_path split_config_path English Language gzip add_env_opt add_option gtypist get_best_speed get_list_item Data Structure wget ftp_parse_winnt_ls parse_tree Name Missing gzip abort_gzip_signal fatal_signal_handler units read_units parse Verb Replace findutils share_file_fopen add_file mcsim display_help show_help

Table 6.4: Examination of common interesting model mistakes.

BiLSTM call sites - uses the same enriched call sites representation as our model including abstract values, with the main difference that the order of the call sites is their order in the assembly code: there is no analysis of the CFG. BiLSTM calls - does not use CFG analysis nor abstract values. Instead, it uses two layers of bidirectional LSTMs with attention to encode call instructions with only the name of the called procedure, in the order they appear in the executable. Results Table 6.3 shows the performance of the different configurations. Nero-LSTM achieves rel- atively 50% higher score than Nero-LSTM no-values. This shows the contribution of the call site augmentation by capturing values and abstract values and its importance to prediction. Nero Transformer→LSTM achieved comparable results to Nero-Transformer and Nero-LSTM as expected. As we discuss in Section 6.3, our data-flow-based analysis helps filtering and reordering calls in their approximate chronological runtime order, rather than the arbitrary order of calls as they appear in the assembly code. BiLSTM call sites performs slightly better than BiLSTM calls due to the use of abstract values instead of plain call instructions. Nero-LSTM improves over BiLSTM call sites by 16%, showing the importance of learning call sites in their right chronological order and the importance of our data-flow-based observation of executables.

6.5.4 Qualitative Evaluation

Taking a closer look at partially-correct predictions reveals some common patterns. We divide these partially-correct predictions into major groups; table 6.4 shows a few examples from the more interesting and common groups. The first group, “Programmers VS English Language”, depicts prediction errors caused by the programmer’s naming habits or errors. In the first two lines we see one example of a relatively common practice for programmers – the use of shorthands. In the first example, the ground truth is i18n_initialize7 yet the model predicts the shorthand “init”; in the second the ground truth sub-token is the shorthand “cfg” but the model predicts “config”. We note that “cfg”, “config” and “init” appear 81, 496 and 6808 times respectively in our vocabulary of target subtokens while “initialize” does not appear at all, justifying the model’s predictions.

7Internationalization (i18n) is the process of developing products in such a way that they can be localized for languages and cultures easily.

110

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 In the second group “Data Structure Name Missing”, the model, not privy to application specific labels which only exist in the test set, resorts to predicting generic data structures names (i.e., a list). In the first example, “gtypist”, a typing tutor application for Ubuntu, “speed records” are predicted as list items. In the next example, the parsing of a (windows) directories, performed by the wget package, is predicted as parsing a tree. In the last example, the sub-token “gzip” is not even in the target vocabulary, causing “abort_gzip” to be predicted instead as “fatal”. We note that the model successfully predicts the generic data structure thanks to the memory and API use patterns seen in the training. The last group, “Verb Replacement”, shows how much information is captured by the verb in the procedure name. In the first example, “share” is replaced with “add”, losing some information about the work performed in the procedure. In the last example, “display” is replaced with “show” which are almost synonymous.

111

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 112

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 7

Related Work

Over the years a lot of works was made in the field of binary code similarity, for search and other applications. In this chapter we provide a brief summary of related work in this field and adjacent fields of research. Similarity in binaries Ng and Prakash [NP13] use symbolic execution and theorem provers to find similar binary code. They lack a cross-compiler evaluation, and use some syntactic heuristics (number of arguments) which affect accuracy. The authors of [PSB+14] use expression trees as procedure signatures, but this approach is susceptible to the problem of varying compilation tool-chains as different compilers perform the same calculation differently. Pewny et al. [PGG+15] uses sampling of SMT-solver simplified basic blocks, lifted using VEX. They report false positives in their results, which can be linked to the sampling based method because it is not countered by a statistical framework to account for spurious matches from commonly found computations. Furthermore, they require a complex BHB process to be performed online on each pair of compared CFGs (after some pruning). [FZX+16a] works by extracting syntactic and structural features from the code, which hinders their applicability to the stripped binary search scenario. The authors also report a high false positive rate (even for the bug-search scenario). The authors of [EYG16] leverage the (static) CFG of the procedure for similarity, while applying a dynamic filtering beforehand to scale. Their results show that the CFG in not a precise enough feature for similarity, leading to false positives. n-grams based methods The authors of [MC05] introduce the concept of using n-grams (which they call k-grams, or Birthmarks) from the world of document searching to the field of detecting similarities in binary code. Their similarity method works by counting mnemonics in a sliding window over program text. This approach is very sensitive to the linear code layout, and produces poor results in practice. Jang et al. [JBV11] speed up n-gram based similarity, in a technique that is complimentary to our hashed search (chapter4). However, it targets a different scenario of malware detection using user-supplied features. The authors of [SWP+09] propose a method for determining even complex lineage for executables by performing normalization steps (in this case also normalizing jumps). The authors of [JWB13] further expand the similarity framework to families of executable to establish a lineage. While providing a important stepping stone for this research field, all these works are based on the n-gram model, which as we shown in sections 2.3.2, 2.5 and 5.4.3) is inherently flawed due to reliance on the compiler to make the same layout choices, thus resulting in a weak representation for binary similarity tasks. Structural-based methods A method for comparing binary control flow graphs using graphlet-coloring was presented in [KKM+06]. This approach has been used to identify malware, and for identifying

113

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 compilers and even authors [RMZ10, RZM11]. This method was designed to match whole binaries in large groups, and as such it employs a coarse equivalence criterion between graphlets. The authors of [BMM06] attempt to detect virus infections inside executables. This is done by randomly choosing a starting point in the executable, parsing it as code and attempting to construct an arbitrary length CFG from there. This work is focused on cleverly detecting the virus entry point, and presents interesting ideas for analyzing binary code that we can use in the future. Jacobson et al. [JRM11] attempt to fingerprint binary procedures using the sequence of system calls used in the procedure. This approach is unstable to patching, and is only applicable to procedures which contain no indirect calls or use system calls directly. [EYG16] target multiple architectures and rely on the (static) control-flow structure of the procedure for similarity along with a numeric filtering pre-stage for scalability. This reliance on syntactic features leads to sub-optimal accuracy and limitations, making it not applicable in some cases (such as procedures with a small number of branches). Dynamic methods There are several dynamic methods targeting malware. For example, [CSK+10] uses run-time information to model executable behavior and detect anomalies which could be attributed to malware and used to identify it. Employing dynamic measures in this case enables bypassing malware defences such as packing. We employ a static approach, and dealing with executable obfuscation is out of the scope of our work. Egele et al. [EWCB14] present a dynamic approach which executes the procedure in a randomized environment and compares the side effects of the machine for similarity. As they base the similarity on a single randomized run, similarity may occur by chance, especially since they coerce execution of certain paths to achieve full coverage. Their evaluation indeed outperforms [zyna]. However, it is still wanting as they rank similar procedures in the top 10 in only 77% of the cases, and do not evaluate over patched versions. We note that using dynamic method is highly dependent on obtaining inputs that drive meaningful coverage of a binary, which is a well-known hard problem. Moreover, the executable may not be easily run externally (i.e., it may target a different architecture than the researcher’s machine) and may thus require non-trivial execution access to devices (e.g., IoT devices). Also, running (any part) of the possibly vulnerable library or malware may expose the user to risk. Vulnerability search in firmware The authors of [CZFB14] were the first to direct attention to security of embedded firmware, in the context of locating CVEs. The main goal of their work was to obtain a quantitative result, as they did not perform any sophisticated static analysis. Firmware security was further explored in [FZX+16b], which used statistical and structural features of the procedure in order to locate vulnerabilities in firmware. Their heuristic-based, procedure-centric approach performs a rather shallow analysis, which is demonstrated by the low detection rate (38 findings for a corpus of 420 million procedures and 154 vulnerabilities). Similarity measures developed for code synthesis testing The authors of [SSA13] propose an interesting way to compare x86 loop-free snippets to perform transformation correctness tests. The similarity is calculated by running the code on selected inputs, and quantifying similarity by comparing outputs and states of the machine at the end of the execution (for example counting equal bits in the output). This distance metric does offer a way to compare two code fragments and possibly to compute similarity, but requires dynamic execution on multiple inputs, which makes it infeasible for our cause. Similarity measures for source code There has been a lot of work on detecting similarities in source code. Moss (Measure Of Software Similarity, [SWA03, Aik]) is an automatic system for determining the similarity of programs. To date, the main application of Moss has been in detecting plagiarism in programming classes. Moss implements some basic ideas that resemble our approach: (i) it decomposes programs and checks for fragment similarity and (ii) it provides the ability to ignore common code.

114

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Using program dependence graphs was shown effective in comparing procedures using their source code [HRB88], but applying this approach to assembly code is difficult. Assembly code has no type information, variables are not easily identified ([BR07]), and attempting to create a PDG proves costly and imprecise. Smith and Horwitz [SH09] recognized the importance of statistical significance for similarity problems, employing the technique towards source-code. Machine learning for source code Several works have investigated machine learning approaches for predicting names in high-level languages. Most works focused on variable names [AZLY18, BPS18], method names [APS16, AZLY19, ABBS15] or general properties of code [RBVK16, RVY14]. Another interesting application is measuring the likelihood of existing names to detect naming bugs [PS18, RAJ+17]. Most work in this field used either syntax only [BRV16, RBV16, MT14], semantic analysis [ABK18] or both [RVK15, IKCZ18]. Leveraging syntax only may be useful in languages such as Java and JavaScript that have a rich syntax, which is not available in our difficult scenario of reverse engineer (RE) of binaries. In contrast with syntactic-only work such as [ABLY19, AZLY19], working with binaries requires a deeper semantic analysis in the spirit of [ABK18], which recovers sufficient information for training the model using semantic analysis. [ABK18], [BAGP19] and [FAB19] further leveraged semantic analysis with Graph Neural Networks, where edges in the graph were relations found using syntactic and semantic analysis. Another work [DTRG18] learned embeddings for C functions based on the CFG. We also use the CFG, but in the more difficult domain of stripped compiled binaries rather than C code. Static analysis models for RE [HIT+18] used static analysis with CRFs to predict various properties in binaries. As we show in Section 6.5, our model gains 20% higher scores due to their sparse model and our deeper data-flow analysis. A concurrent work with ours is DIRE [LYS+19], who addressed name prediction in binaries. However, they focused on predicting variable names which is more of a local prediction, while we focus on predicting procedure names which requires a more global view of the given binary. The main difference between our approach and theirs is our use of slicing to reconstruct abstract values (Section 6.3), and use them to construct augmented call sites. Their approach combines an LSTM applied on the flat assembly code with a graph neural network applied on the AST of the decompiled C code, without a deep static analysis as ours and without leveraging the CFG. Their approach is thus similar to the LSTM-text baseline combined with the Nero-LSTM no-values ablation. [KRY18] showed an approach to infer subclass-superclass relations in stripped binaries. [LAB11] used static and dynamic analysis to recover high-level types. In contrast, our approach is purely static. [RBLT05] presented CodeSurfer, a binary executable framework built for analyzing x86 executables, focusing on detecting and manipulating variables in the assembly code. [SSM15] used RNNs to identify procedure boundaries inside a stripped binary. Equivalence checking and semantic differencing Engler et al. [RE11] present a symbolic execution approach for proving procedure equivalence for finding bugs in different implementations of the same procedures. They work on the source code level and their method is limited to non-looping code. SymDiff by Lahiri et al. [LHKR12] leverages a program verifier to prove equivalence of whole procedures by translating them to BoogieIVL, or produce a counterexample. It has limited handling for loops and will require a translation of binary to full procedures. Furthermore, the use of provers prevents the approach from scaling. The authors of [PY13, PY14] offer some handling for loops, but apply computationally expensive abstract domain libraries, which do not scale in our setting. Sharma et al. [SSCA13] present a technique for proving equivalence between high-level code and machine code, with loops, by trying to find a simulation relation between the two. However the search is computationally demanding and will

115

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 not scale. Lahiri et al. [LSH15] narrow the scope to assumed-to-be but cannot-be-proved-to-be-equivalent snippets of binary code, and attempts to find a set of changes (add or delete) to complete the proof. While having interesting implications for our problem, the assumption of a variable mapping encumbers the transition. Compiler bug-finding Hawblitzel et al. [HLP+13] present a technique that handles compiled versions of the same procedure from different compilers. Their goal is to identify root causes of compiler bugs, and their approach cannot be directly applied to our setting as they: (i) require strict equivalence and thus even a slight change would be deemed a bug, (ii) know the IL originated from the same code allowing them to easily match inputs and outputs (i.e. these are labeled) for solving, which is not the case for machine code, and (iii) offer no method for segmenting procedures and thus are limited in handling loops (they use loop unrolling up to 2 iterations) (iv) uses semantic tools for proving equivalence, making it hard to scale to the binary code search scenario .

116

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Chapter 8

Conclusion

In this thesis we present our process of establishing a precise yet scalable method for performing binary code search. In each step we demonstrate the main challenges in establishing similarity in binary code and measure our progress by comparing our different techniques against each other, the de-facto industry standard for binary similarity BinDiff [zyna] and the only other name prediction method for binary code Debin [HIT+18]. In chapter2 we presented a new approach for measuring similarity between binary procedures created from patched versions of the same source-code. To compute similarity between procedures we decompose them into tracelets: continuous, short, partial traces of an execution. To efficiently compare tracelets (an operation that has to be applied frequently during search), we encode their matching as a constraint solving problem. The constraints capture alignment constraints and data dependencies to match registers and memory addresses between tracelets. We implemented our approach and applied it to find matches in over a million binary procedures. We compared tracelet matching to approaches based on n-grams and graphlets and show that tracelet matching obtains dramatically better results in terms of precision and recall. In chapter3 we presented a new statistical technique for measuring similarity between procedures. The main idea is to decompose procedures to smaller comparable fragments, called strands, define semantic similarity between them, and use statistical reasoning to lift strand similarity into similarity between procedures. We implemented our technique, and applied it to find various prominent vulnera- bilities, including Heartbleed, Shellshock and Venom, in a challenging cross-compilervendor, version,configuration scenario. Building on the components presented in chapter3, in chapter4 we improved the scalability of our approach for establishing similarity and expand it to work across multiple architectures. The main idea is to perform out-of-context “strand re-optimization”, by using the compiler-optimizer to transform the strand into a canonical normalized form. After this transformation, these strands can be used for efficient comparison simply by hashing them. We implemented our technique using it to perform millions of comparisons efficiently, and that the combined use of strand re-optimization with the global context establishes similarity with high accuracy, and with an order of magnitude speed-up over a competing method. In chapter5, we showcased the method presented in chapters3 and4 by finding vulnerabilities in stripped firmware images. We demonstrate that the pairwise similarity of procedures cannot directly provide precise vulnerability discovery. Instead we introduce a back-and-forth game to lift the pairwise similarity between procedures to similarity between sets of procedures, pin-pointing the vulnerable procedure inside the set. Using our refined technique, searching in millions of procedures from public

117

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 vendor firmware images, we discovered 373 vulnerabilities, 147 of which appear in the latest available firmware version for the device. In chapter6 we presented a novel approach for predicting procedure names in executables. The core idea is to leverage static analysis of binaries to encode rich representations of API call sites; traverse the CFG to approximate the chronological runtime order of the call sites; and encode these sequences using two different set-of-seq-to-seq architectures (LSTM-based and Transformer-based). We evaluated our method over real-world stripped procedures. Our method achieves a 20% relative gain over existing non-neural approaches, and over 84% relative gain over the naïve textual baselines (“LSTM-text” and “Transformer-text”). Overall, in each step we have shown improvement in performance and run-time. We believe that the principles presented in this thesis can serve as a basis for a wide range of tasks that involve binary code, among them: malware identification, decompilation, vulnerability detection and more advance and precise binary code similarity.

118

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Appendix

Additional Examples of Procedure Name Predictions

Table A.1 contains more examples from our test set, along with the predictions made by our model and each of the baselines.

Ground Truth [HIT+18] LSTM-text Transformer-text BiLSTM call-sites Nero-LSTM (this work)

mktime from utc nettle pss ... get boundary str file mktime read buffer concat fopen safer mh print fmtspec net read filter read get widech get byte user mh decode rcpt flag do tolower get user groups display close stdin config file ipmi get user groups get user groups ftp parse winnt ls uuconf iv ... mktime print status send to file parse tree write init pos allocate pic buf open int print type cfg init wait for proc wait subprocess start open mh print fmtspec strip read string cmp error check command process io read find env find env pos proper name utf close stream read token find env write calc jacob usage msg update pattern print one paragraph write write calc outputs fsquery show debug section cwd advance fd write get script line get line make dir hier read ps line jconfig get getuser readline stdin read readline rushdb print mh decode rcpt flag write line readline read set max db age do link set owner make dir hier sparse copy set write calc deriv orthodox hdy ds symbol close stream fprint entry write type read file bt open ... disable coredump vfs read file parse options parse options finish mh print fmtspec get options parse args url free hash rehash hostname destroy setupvariables hol free free dfa content check new watcher read index check opt open source check file open input file get options query in ck rename set delete input install signal handlers signal setup regfree install handlers write calc jacob put in fp table save game var hostname destroy write filename pattern free add char segment free dfa content hostname destroy glob cleanup free exclude segment read line tartime init all close stdout parse args read locate unset var is unset url get arg regerror var is var is unset ftp parse unix ls serv select fn canonicalize parse syntax option free netrc gea compile hostname destroy hostname destroy free ent hol free string to bool string to bool setnonblock mh decode rcpt flag string to bool string to bool

Table A.1: Examples from our test set and predictions made by the different models.

119

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 120

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Bibliography

[ABBS15] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY, USA, 2015. ACM.

[ABK18] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. In ICLR, 2018.

[ABLY19] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2019.

[Aik] Alex Aiken. Moss. https://theory.stanford.edu/~aiken/moss/.

[Ala] Jay Alammar. The illustrated transformer. http://jalammar.github.io/ illustrated-transformer/.

[All18] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. arXiv preprint arXiv:1812.06469, 2018.

[APS16] Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2091–2100, 2016.

[ASLY19] Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models for any-code generation. arXiv preprint arXiv:1910.00577, 2019.

[ASU] ASUS. Asus firmware resources. https://www.asus.com/support/.

[ATGW15] Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. Bimodal modelling of source code and natural language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2123–2132. JMLR.org, 2015.

[AZLY18] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, pages 404–419, New York, NY, USA, 2018. ACM.

[AZLY19] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed representations of code. Proc. ACM Program. Lang., 3(POPL):40:1– 40:29, 2019.

121

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [BA06] Sorav Bansal and Alex Aiken. Automatic generation of peephole superoptimizers. In ASPLOS XII, 2006.

[BAGP19] Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. Generative code modeling with graphs. In International Conference on Learning Representations, 2019.

[BBW+14] Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. Byteweight: Learning to recognize functions in binary code. In Proceedings of the 23rd USENIX Conference on Security Symposium, SEC’14, pages 845–860, Berkeley, CA, USA, 2014. USENIX Association.

[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla- tion by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

[BCD+05] Michael Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K. Rus- tan M. Leino. Boogie: A modular reusable verifier for object-oriented programs. In Formal Methods for Components and Objects, 4th International Symposium, FMCO 2005, Amsterdam, The Netherlands, November 1-4, 2005, Revised Lectures, pages 364–387, 2005.

[BCS97] Preston Briggs, Keith D. Cooper, and L. Taylor Simpson. Value Numbering. Software: Practice and Experience, 27(6):701–724, June 1997.

[Ben] Eli Bendersky. Position independent code PIC. https://eli.thegreenplace.net/2011/11/03/ position-independent-code-pic-in-shared-libraries/.

[BI06] Oren Boiman and Michal Irani. Similarity by composition. In NIPS, pages 177–184. MIT Press, 2006.

[BJAS11] David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. Bap: A binary analysis platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 463–469, Berlin, Heidelberg, 2011. Springer-Verlag.

[BL96] Thomas Ball and James R. Larus. Efficient path profiling. In Proceedings of the 29th Int. Symp. on Microarchitecture, MICRO 29, 1996.

[BMM06] Danilo Bruschi, Lorenzo Martignoni, and Mattia Monga. Detecting self-mutating malware using control-flow graph matching. In Proceedings of the Third In- ternational Conference on Detection of Intrusions and Malware & Vulnerability Assessment, DIMVA’06, pages 129–143, Berlin, Heidelberg, 2006. Springer-Verlag.

[BPS18] Rohan Bavishi, Michael Pradel, and Koushik Sen. Context2name: A deep learning- based approach to infer natural variable names from usage contexts. arXiv preprint arXiv:1809.05193, 2018.

[BR07] Gogul Balakrishnan and Thomas Reps. Divine: discovering variables in executables. In VMCAI’07, pages 1–28, 2007.

[Bro19] Marc Brockschmidt. Gnn-film: Graph neural networks with feature-wise linear modulation. arXiv preprint arXiv:1906.12192, 2019.

122

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [BRV16] Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learn- ing, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.

[Com] DWARF Standards Committee. Dwarf DWARF. http://dwarfstd.org.

[com16] computerworld. Updates and more on the netgear router vulnerability. 2016.

[CSK+10] P.M. Comparetti, G. Salvaneschi, E. Kirda, C. Kolbitsch, C. Kruegel, and S. Zanero. Identifying dormant functionality in malware programs. In IEEE Symp. on Security and Privacy, 2010.

[CSW+18] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE, 2018.

[cur] curl. curl releases. https://curl.haxx.se/docs/releases.html.

[CVMG+14] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[CZFB14] Andrei Costin, Jonas Zaddach, Aurélien Francillon, and Davide Balzarotti. A large-scale analysis of the security of embedded firmwares. In Proceedings of the 23rd USENIX Conference on Security Symposium, SEC’14, pages 95–110, Berkeley, CA, USA, 2014. USENIX Association.

[DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. In Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.

[DLI] DLINK. D-link firmware resources. http://support.dlink.com/.

[DR10] Jianjun Duan and John Regehr. Correctness proofs for device drivers in embedded systems. In 5th International Workshop on Systems Software Verification, SSV’10, Vancouver, BC, Canada, October 6-7, 2010, 2010.

[DTRG18] Daniel DeFreez, Aditya V. Thakur, and Cindy Rubio-González. Path-based func- tion embedding and its application to error-handling specification mining. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engi- neering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, pages 423–433, New York, NY, USA, 2018. ACM.

[Ehr61] Andrzej Ehrenfeucht. An application of games to the completeness problem for formalized theories. Fundamenta Mathematicae, 49(2):129–141, 1961.

123

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [EWCB14] Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. Blanket execution: Dynamic similarity testing for program binaries and components. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, August 20-22., pages 303–317, 2014.

[EYG16] Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. discovre: Efficient cross-architecture identification of bugs in binary code. In 23nd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016, 2016.

[FAB19] Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. Structured neural summarization. In International Conference on Learning Representations, 2019.

[FOW87] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, 1987.

[FZX+16a] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 480–491, 2016.

[FZX+16b] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable Graph-based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 480–491, New York, NY, USA, 2016. ACM.

[GNUa] GNU. gcc optimizations options. https://gcc.gnu.org/onlinedocs/ gcc/Optimize-Options.html.

[GNUb] GNU. Gnu coreutils. http://www.gnu.org/software/coreutils.

[GNUc] GNU. LLVM’s analysis and transform passes. http://llvm.org/docs/ Passes.html.

[GSR+17] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 1263–1272. JMLR.org, 2017.

[Hea10] A heap based vulnerability in gnu’s rtapelib.c, 2010. http://www. cvedetails.com/cve/CVE-2010-0624/.

[Hea14] Heartbleed vulnerability cve information, 2014. https://cve.mitre.org/ cgi-bin/cvename.cgi?name=CVE-2014-0160.

[HIT+18] Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 1667–1680, New York, NY, USA, 2018. ACM.

[HLP+13] Chris Hawblitzel, Shuvendu K. Lahiri, Kshama Pawar, Hammad Hashmi, Sedar Gokbulut, Lakshan Fernando, Dave Detlefs, and Scott Wadsworth. Will you still compile me tomorrow? static cross-version compiler validation. In Joint

124

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013, pages 191–201, 2013.

[Hor90] Susan Horwitz. Identifying the semantic and textual differences between two versions of a program. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI ’90, pages 234–245, New York, NY, USA, 1990. ACM.

[HR] Hex-Rays. Hex-rays IDAPRO. http://www.hex-rays.com.

[HRB88] S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI ’88, pages 35–46, New York, NY, USA, 1988. ACM.

[HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.

[HYL17] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 1025–1035, USA, 2017. Curran Associates Inc.

[IKCZ18] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643–1652, 2018.

[Int] Intel. Linux64-abi LINUXABI. https://software.intel.com/sites/ default/files/article/402129/mpx-linux64-abi.pdf.

[JBV11] Jiyong Jang, David Brumley, and Shobha Venkataraman. BitShred : Feature Hashing Malware for Scalable Triage and Semantic Analysis. Proceedings of the 18th ACM Conference on Computer and Communications Security, pages 309–320, 2011.

[JRM11] Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller. Labeling library functions in stripped binaries. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools, PASTE ’11, pages 1–8, New York, NY, USA, 2011. ACM.

[JWB13] Jiyong Jang, Maverick Woo, and David Brumley. Towards automatic software lineage inference. In USENIX Security, 2013.

[KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[KK10] David G Kleinbaum and Mitchel Klein. Analysis of Matched Data Using Logistic Regression. Springer, 2010.

[KKM+06] Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. Polymorphic worm detection using structural information of executables. In Proceedings of the 8th International Conference on Recent Advances in Intrusion Detection, RAID’05, pages 207–226, Berlin, Heidelberg, 2006. Springer-Verlag.

125

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [KMA13] Wei Ming Khoo, Alan Mycroft, and Ross Anderson. Rendezvous: A search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pages 329–338, Piscataway, NJ, USA, 2013. IEEE Press.

[KRY18] Omer Katz, Noam Rinetzky, and Eran Yahav. Statistical reconstruction of class hierarchies in binaries. In Proceedings of the Twenty-Third International Confer- ence on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, pages 363–376, New York, NY, USA, 2018. ACM.

[KW17] Thomas Kipf and Max Welling. Semi-supervised classification with graph convo- lutional networks. In ICLR, 2017.

[LA04] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75–86. IEEE, 2004.

[LAB11] JongHyup Lee, Thanassis Avgerinos, and David Brumley. Tie: Principled reverse engineering of types in binary programs. 2011.

[LB93] James R Lyle and David Binkley. Program slicing in the presence of pointers. In Proceedings of the 1993 Software Engineering Research Forum, pages 255–260. Citeseer, 1993.

[LCJM17] Yanxin Lu, Swarat Chaudhuri, Chris Jermaine, and David Melski. Data-driven program completion. CoRR, abs/1705.09042, 2017.

[Lei] K. Rustan M. Leino. This is boogie 2. http://research.microsoft. com/en-us/um/people/leino/papers/krml178.pdf.

[Lev66] VI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966.

[LHKR12] Shuvendu K. Lahiri, Chris Hawblitzel, Ming Kawaguchi, and Henrique Rebêlo. Symdiff: A language-agnostic semantic diff tool for imperative programs. In CAV, pages 712–717, 2012.

[LMM+17] Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. Déjàvu: a map of code duplicates on github. Proceedings of the ACM on Programming Languages, 1(OOPSLA):84, 2017.

[LPM15] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412–1421, 2015.

[LSH15] Shuvendu K. Lahiri, Rohit Sinha, and Chris Hawblitzel. Automatic rootcausing for program equivalence failures in binaries. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I, pages 362–379, 2015.

[LYS+19] Jeremy Lacomis, Pengcheng Yin, Edward J Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. Dire: A neural ap- proach to decompiled identifier naming. arXiv preprint arXiv:1909.09029, 2019.

126

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [MC05] Ginger Myles and Christian Collberg. K-gram based software birthmarks. In Proceedings of the 2005 ACM Symposium on Applied Computing, SAC ’05, pages 314–318, New York, NY, USA, 2005. ACM.

[MCJ17] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Bayesian sketch learning for program synthesis. CoRR, abs/1703.05698, 2017.

[MT14] Chris Maddison and Daniel Tarlow. Structured generative models of natural source code. In International Conference on Machine Learning, pages 649–657, 2014.

[Nep] Tamas Nepusz. Yard - yet another roc drawer. http://github.com/ ntamas/yard.

[NET] NETGEAR. Netgear firmware resources. downloadcenter.NETGEAR. com/.

[NP13] Beng Heng Ng and A. Prakash. Expose: Discovering potential binary code re-use. In Computer Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual, pages 492–501, July 2013.

[NS07] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89–100, 2007.

[PGG+15] Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. Cross-architecture bug search in binary executables. In 2015 IEEE Sym- posium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pages 709–724, 2015.

[PS18] Michael Pradel and Koushik Sen. Deepbugs: A learning approach to name-based bug detection. Proc. ACM Program. Lang., 2(OOPSLA):147:1–147:25, October 2018.

[PSB+14] Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. Leveraging semantic signatures for bug search in binary programs. In Pro- ceedings of the 30th Annual Computer Security Applications Conference, ACSAC ’14, pages 406–415, New York, NY, USA, 2014. ACM.

[PY13] Nimrod Partush and Eran Yahav. Abstract semantic differencing for numerical programs. In Static Analysis: 20th International Symposium, SAS 2013, Seattle, WA, USA, June 20-22, 2013. Proceedings, pages 238–258. Springer, 2013.

[PY14] Nimrod Partush and Eran Yahav. Abstract semantic differencing via speculative correlation. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2014, part of SPLASH 2014, Portland, OR, USA, October 20-24, 2014, pages 811–828, 2014.

[RAJ+17] Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. Detecting argument selection defects. Proceedings of the ACM on Programming Languages, 1(OOPSLA):104, 2017.

[RBDL97] Thomas Reps, Thomas Ball, Manuvir Das, and James Larus. The use of program profiling for software maintenance with applications to the year 2000 problem. In Proceedings of the 6th European SOFTWARE ENGINEERING Conference Held Jointly with the 5th ACM SIGSOFT International Symposium on Foundations of

127

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Software Engineering, ESEC ’97/FSE-5, pages 432–449, New York, NY, USA, 1997. Springer-Verlag New York, Inc.

[RBLT05] T. Reps, G. Balakrishnan, J. Lim, and T. Teitelbaum. A next-generation platform for analyzing executables. In Proceedings of the Third Asian Conference on Pro- gramming Languages and Systems, APLAS’05, pages 212–229, Berlin, Heidelberg, 2005. Springer-Verlag.

[RBV16] Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. In Proceedings of the 2016 ACM SIGPLAN International Con- ference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, pages 731–747, New York, NY, USA, 2016. ACM.

[RBVK16] Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, POPL ’16, pages 761–774, New York, NY, USA, 2016. ACM.

[RE11] David A Ramos and Dawson R. Engler. Practical, low-effort equivalence ver- ification of real code. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 669–685, Berlin, Heidelberg, 2011. Springer-Verlag.

[ReF] ReFirm. Binwalk is a fast, easy to use tool for analyzing, reverse engineering, and extracting firmware images.

[RMZ10] Nathan E. Rosenblum, Barton P. Miller, and Xiaojin Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN- SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ’10, pages 21–28, New York, NY, USA, 2010. ACM.

[RMZ11] Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis, ISSTA ’11, pages 100–110, New York, NY, USA, 2011. ACM.

[RVK15] Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program proper- ties from "big code". In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111–124, New York, NY, USA, 2015. ACM.

[RVY14] Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428, New York, NY, USA, 2014. ACM.

[RWC+18] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2018.

[RZM11] Nathan Rosenblum, Xiaojin Zhu, and Barton P. Miller. Who wrote this code? identifying the authors of program binaries. In Proceedings of the 16th European Conference on Research in Computer Security, ESORICS’11, pages 172–189, Berlin, Heidelberg, 2011. Springer-Verlag.

128

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [SADB10] S. Joshua Swamidass, Chloé-Agathe Azencott, Kenny Daily, and Pierre Baldi. A CROC stronger than ROC. Bioinformatics, 26(10), 2010. [SGS+16] Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016, 2016. [SGT+08] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008. [SH09] Randy Smith and Susan Horwitz. Detecting and measuring similarity in code clones. In Proceedings of the International Workshop on Software Clones (IWSC), 2009. [She14] Shellshock vulnerability cve information, 2014. https://cve.mitre.org/ cgi-bin/cvename.cgi?name=CVE-2014-6271. [SHK+14] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014. [SKB+18] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional net- works. In European Semantic Web Conference, pages 593–607. Springer, 2018. [SLL+18] Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. Retrieval on source code: a neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages 31–41, 2018. [SMA] SMACK. Smack: A bounded software verifier for c programs. https:// github.com/smackers/smack. [SSA13] Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, pages 305–316, New York, NY, USA, 2013. ACM. [SSCA13] Rahul Sharma, Eric Schkufza, Berkeley Churchill, and Alex Aiken. Data-driven equivalence checking. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applica- tions, OOPSLA ’13, pages 391–406, New York, NY, USA, 2013. ACM. [SSM15] Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. Recognizing functions in binaries with neural networks. In USENIX Security Symposium, pages 611–626, 2015. [SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.

129

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 [SWA03] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: Local algo- rithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 76–85, New York, NY, USA, 2003. ACM.

[SWP+09] Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. Detecting code clones in binary executables. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ISSTA ’09, pages 117–128, New York, NY, USA, 2009. ACM.

[SWS+16] Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. Sok: (state of) the art of war: Offensive techniques in binary analysis. 2016.

[TIO19] TIOBE. Tiobe index for the c programming language 2019. 2019. https://webcache.googleusercontent.com/search?q=cache: 13SyxILk7UMJ:https://www.tiobe.com/tiobe-index/c/.

[Tow18] Towardsdatascience. Towardsdatascience - understanding-auc- roc-curve. 2018. https://towardsdatascience.com/ understanding-auc-roc-curve-68b2303cc9c5.

[tra] trailofbits. Mcsema: Framework for lifting x86, amd64, and aarch64 program binaries to llvm bitcode. https://github.com/trailofbits/mcsema.

[VCC+18] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

[Wei81] Mark Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981., pages 439–449, 1981.

[WF74] Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.

[ZL17] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):190–198, 2017.

[zyna] zynamics. zynamics bindiff. http://www.zynamics.com/bindiff. html.

[zynb] zynamics. zynamics bindiff manual - understanding bindiff. www.zynamics. com/bindiff/manual/index.html#chapUnderstanding.

130

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 בהשוואה לשיטות החדישות ביותר המבוססות על רשתות עצביות, אשר לא עושות שימוש בניתוח סטטי.

השיטות למציאת דמיון בין קבצי הרצה וחיזוי שמות שגרות בהם, בצירוף הטכניקות אשר הופעלו כדי לבנות אותן, מהוות תרומה משמעותית לתחום הניתוח של קבצי ההרצה.

ii

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 תקציר

עבודה זו תעסוק בבעיה של דמיון בין קבצי הרצה. קבצי הרצה נוצרים בתהליך בו מהדר הופך קוד מקור לקוד מכונה, הידוע גם כקוד בינארי, אשר ממנו מרכיבים קבצי הרצה. משפחת שפות תכנות העילית בהן נדרש תהליך הידור זה הינה בת יותר מחמישים שנים. השפות המובילות והידועות מתוך משפחה זו הינן C ו־C++. למרות הזמן הרב שחלף, והעובדה כי מאז פותחו שפות רבות אשר הינן בטוחות יותר אבטחתית וכן נוחות יותר לפיתוח, השפות הדורשות הידור שמרו על מקומן החשוב בפיתוח חבילות תוכנה. חבילות כגון מערכות הפעלה וכן תוכנות המשמשות את הגל העכשווי של מוצרי מחשוב לביש ו־האינטרנט של הדברים מפותחות בעזרת שפות אלו. הסיבה לכך נעוצה באילוץ לשימוש במשאבים מוגבלים ודרישה לזמן ריצה מהיר, ואלו בדיוק המאפיינים אותן מספקות שפות אלו.

תוכנות אשר נכתבו בשפות אלו מופצות ללקוחותיהן בשתי צורות: קוד פתוח וקוד סגור. מחד, הקוד הפתוח מאפשר למתכנתים לתקן את הקוד בעצמם ולהדר אותו כך שיתאים לסביבת ריצה )צירוף של סוג המעבד, מערכת ההפעלה והספריות בה( בה יחפצו, אך מנגד יוצרת קשיים מהותיים בהידור הקוד ככל שעובר הזמן ותשתיות ההידור מתפתחות ומתעדכנות. מאידך, הפצת תוכנה בקוד סגור מבטיחה כי הקוד יוכל לרוץ בכל נקודה בזמן אך רק מול סביבת הריצה אליה הודר מלכתחילה. בכל אחת מהשיטות הללו, בתהליך ההידור הולך לאיבוד מידע רב המקשה על ניתוח קובץ הריצה. לא זאת ועוד, כדי לחסוך במקום מוסר מקבצי הריצה כל המידע אשר איננו הכרחי כדי לטובת הריצה. בתהליך זה מוסר מידע המשמש לניפוי שגיאות, קישור בין השגרות בקוד המקור לחלקים הרלוונטיים בקוד הבינארי ומידע רב נוסף.

עבודה זו תעסוק בבעיה של דמיון בין קבצי הרצה, אשר הסירו מהם מידע הנועד לניפוי שגיאות. האתגר המרכזי הוא בביסוס הדמיון, שכן הקוד הבינארי המצוי בקבצי ההרצה, מהודר על ידי מהדרים מסוגים שונים, אשר בכל אחד מהם ניתן לבחור בהגדרות שונות, ובפרט את רמת המיטוב של המהדר, ואף מיועדים לריצה במעבדים שונים. בנוסף, נרצה לקחת בחשבון גם אפשרות שקוד המקור ממנו נוצרו קבצי ההרצה, מקורו בגרסה שונה של חבילת התוכנה ואף מחבילת תוכנה שונה לחלוטין. התמודדות עם אתגרים אלו, תוך הימנעות מדיווחים חיוביים כוזבים, הינו צעד חשוב מאין כמותו לסיוע במשימות אחרות, ויקרות יותר, בתחום ניתוח קבצי ההרצה. משימות אלו כוללות הינדוס אוטומטי לאחור וזיהוי חולשות אבטחה.

נציג תהליך רב שלבי, כאשר בכל שלב נסביר ונתמודד באמצעות ניתוח סטטי עם חלקים שונים של בעיית הדמיון בין קבצי ההרצה. בכל שלב נזקק ונשפר את מרכיביה של שיטת ביסוס הדמיון שלנו: נשפר את הייצוג שלנו לקבצי ההרצה, תוך שילוב טכניקות מתחומי ידע אחרים. כך, ניצור סולם מדידה של רמת הדמיון בין קבצי ההרצה. תחומי ידע אלו כוללים את תורת המודלים, מערכות סטטיסטיות, פתרנים עבור בעיות ספיקות, וכן למידה עמוקה בעזרת רשתות עצביות מלאכותיות.

ביצענו ניסויים עבור כלל השיטות שפיתחנו, בתרחישים המדמים כמידת האפשר את המציאות. במסגרת זאת, אתגרנו את השיטות למציאת חולשות אבטחה בעזרת חיפוש, וכן על ידי חיזוי שמות משמעותיים לשגרות בשפת מכונה. בתהליך זה גילינו שלוש מאות שבעים ושלוש חולשות אבטחתיות המשפיעות על קושחות הזמינות להורדה מרשת האינטרנט. מתוכם, מאה ארבעים ושבע קושחות מהוות את הגרסה החדשה ביותר הזמינה להורדה עבור המכשיר. חזינו בהצלחה את שמות השגרות, תוך שיפור של עשרים אחוזים ברמת הדיוק בהשוואה לשיטת החיזוי החדישה ביותר בתחום חיזוי השמות בקבצי הרצה. כמו כן, שיפרנו את רמת הדיוק בשמונים וארבעה אחוזים

i

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 המחקר בוצע בהנחייתו של פרופסור ערן יהב, בפקולטה למדעי המחשב.

חלק מן התוצאות בחיבור זה פורסמו כמאמרים מאת המחבר ושותפיו למחקר בכנסים ובכתבי־עת במהלך תקופת מחקר הדוקטורט של המחבר, אשר גרסאותיהם העדכניות ביותר הינן:

Yaniv David, Nimrod Partush, and Eran Yahav. Statistical similarity of binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pages 266–280, New York, NY, USA, 2016. ACM. Yaniv David, Nimrod Partush, and Eran Yahav. Similarity of binaries through re-optimization. In Pro- ceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 79–94, New York, NY, USA, 2017. ACM. Yaniv David, Nimrod Partush, and Eran Yahav. Firmup: Precise static detection of common vulnerabilities in firmware. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’18, pages 392–404, New York, NY, USA, 2018. ACM. Yaniv David and Eran Yahav. Tracelet-based code search in executables. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 349–360, New York, NY, USA, 2014. ACM.

תודות

לורד ויעל.

הכרת תודה מסורה לטכניון על מימון מחקר זה.

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 דמיון בקבצי הרצה

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת התואר דוקטור לפילוסופיה

יניב דוד

הוגש לסנט הטכניון ־־־ מכון טכנולוגי לישראל אדר תש״ף חיפה מרץ 2020

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020 דמיון בקבצי הרצה

יניב דוד

Technion - Computer Science Department - Ph.D. Thesis PHD-2020-04 - 2020