<<

Master of Science in Software Engineering May 2020

To Force a Bug: Extending Hybrid

Johan Näslund Henrik Nero

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Software Engineering. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Authors: Johan Näslund E-mail: [email protected]

Henrik Nero E-mail: [email protected]

University advisor: Doctor Dragos Ilie Department of Computer Science

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

One of the more promising solutions for automated binary testing today is hybrid fuzzing, a combination of the two acknowledged approaches, fuzzing and symbolic execution, for detecting errors in code. Hybrid fuzzing is one of the pioneering works coming from the authors of Angr and Driller, opening up for the possibility for more specialized tools such as QSYM to come forth. These hybrid fuzzers are coverage guided, meaning they measure their success in how much code they have covered. This is a typical approach but, as with many, it is not flawless. Just because a region of code has been covered does not mean it has been fully tested. Some flaws depend on the context in which the code is being executed, such as double-free vulnerabilities. Even if the free routine has been invoked twice, it does not mean that a double-free bug has occurred. To cause such a vulnerability, one has to free the same memory chunk twice (without it being reallocated between the two invocations to free). In this research, we will extend one of the current state-of-the-art hybrid fuzzers, QSYM, which is an open source project. We do this extension, adding double- free detection, in a tool we call QSIMP. We will then investigate our hypothesis, stating that it is possible to implement such functionality without losing so much performance that it would make the tool impractical. To test our hypothesis we have designed two experiments. One experiment tests the ability of our tool to find double-free bugs (the type of context-sensitive bug that we have chosen to test with). In our second experiment, we explore the scalability of the tool when this functionality is executed. Our experiments showed that we were able to implement context-sensitive bug detection within QSYM. We can find most double-free vulnerabilities we have tested it on, although not all, because of some optimizations that we were unable to build past. This has been done with small effects on scalability according to our tests. Our tool can find the same bugs that the original QSYM while adding functionality to find double-free vulnerabilities.

Keywords: Symbolic execution, fuzzing, context-sensitive, bug

i

Sammanfattning

En av de mer lovande lösningarna för automatiserad binärtestning är i dagsläget hy- brid fuzzing, en kombination av två vedertagna tillvägagångssätt, fuzzing och sym- bolisk exekvering. Forskarna som utvecklade Angr och Driller anses ofta vara några av de första med att testa denna approach. Detta har i sin tur öppnat upp för fler mer specialiserade verktyg som QSYM. Dessa hybrid fuzzers mäter oftast sin framgång i hänsyn till hur mycket kod som nås under testningen. Detta är ett typiskt tillvägagångssätt, men som med många metoder är det inte felfri. Kod som har exekverats, utan att en bugg utlösts, är inte nödvändigtvis felfri. Vissa buggar beror på vilken kontext maskininstruktioner exekveras i – ett exempel är double-free sårbarheter. Att minne har frigjorts flera gånger betyder inte ovillkorligen att en double-free sårbarhet har uppstått. För att en sådan sårbarhet ska uppstå måste samma minne frigöras flera gånger (utan att detta minne omallokerats mellan anropen till free). I detta projekt breddar vi en av de främsta hybrid fuzzers, QSYM, ett projekt med öppen källkod. Det vi tillför är detektering av double-free i ett verktyg vi kallar QSIMP. Vi undersöker sedan vår hypotes, som säger att det är möjligt att implementera sådan funktionalitet utan att förlora så mycket prestanda att det gör verktyget opraktiskt. För att bepröva hypotesen har vi designat två experiment. Ett experiment tes- tar verktygets förmåga att detektera double-free sårbarheter (den sortens kontext- känsliga sårbarheter vi har valt att fokusera på). I det andra experimentet utforskar vi huruvida verktyget är skalbart då den nya funktionaliteten körs. Våra experiment visar att vi har möjliggjort detektering av kontext-känsliga bug- gar genom vidareutveckling av verktyget QSYM. QSIMP hittar double-free buggar, dock inte alla, på grund av optimiseringar som vi ej har lyckats arbeta runt. Detta har gjorts utan större effekter på skalbarheten av verktyget enligt resultaten från våra experiment. Vårt verktyg hittar samma buggar som orignal verktyget QSYM, samtidigt som vi tillägger funktionalitet för att hitta double-free sårbarheter.

Nyckelord: symbolisk exekvering, fuzzing, kontext-känslig, bug

iii

Acknowledgments

We would like to greatly thank Dr. Dragos Ilie for his assistance, counsel, and proof- readings throughout our research. We also want to express our gratitude to Erik Bergenholtz for, alongside his own research, actively participating in our meetings and discussions. Additionally, we would like to thank Martin Strand for his supervision, engage- ment, and excellent guidance on the practical matters of binary analysis. Also, we are very grateful for the help of Dr. Carl Löndahl with getting started, both theoret- ically and academically, proofreading, and opening up the possibility to work with TrueSec and Säkerhetskontoret. Finally, we want to thank Mikael Lagström and TrueSec for giving us an inspir- ing workplace for our thesis, and for lending us the hardware needed to run the experiments.

v

Contents

Abstract i

Sammanfattning iii

Acknowledgments v

1 Introduction 1 1.1 Motivation ...... 1 1.2 Problem Formulation ...... 2 1.2.1 Research Questions ...... 3 1.2.2 Hypothesis ...... 3 1.2.3 Scope ...... 3 1.3 Thesis Outline ...... 3

2 Background 5 2.1 Bugs ...... 5 2.2 Fuzzing ...... 6 2.3 Dynamic Symbolic Execution ...... 7 2.3.1 Soundness ...... 7 2.3.2 Concretization ...... 8 2.4 Hybrid Fuzzing ...... 8 2.5 Context-Sensitive Bugs ...... 10

3 Related Work 13 3.1 Satisfiable Modulo Theory Solvers ...... 13 3.2 Automated Vulnerability Discovery ...... 13 3.3 Hybrid Fuzzing ...... 14 3.4 Symbolic Memory ...... 15

4 Method 17 4.1 Current State of the Art ...... 17 4.2 Implementation ...... 18 4.2.1 DF Detection Algorithm ...... 18 4.2.2 Libdislocator ...... 20 4.2.3 Symbolic Load ...... 20 4.3 Dataset ...... 20 4.4 Evaluation ...... 22 4.4.1 Context-Sensitive Binaries Experiment ...... 23

vii 4.4.2 Scalability Experiment ...... 23 4.5 Equipment ...... 24

5 Results and Analysis 25 5.1 Context-Sensitive Bugs Experiment ...... 25 5.2 Scalability Experiment ...... 25 5.3 Analysis ...... 27 5.3.1 Context-Sensitive Bugs Experiment ...... 30 5.3.2 Scalability and Efficiency Experiment ...... 30 5.3.3 Combined Result Analysis ...... 31

6 Discussion 33 6.1 Conducted Research ...... 33 6.1.1 RQ1: DF Triggering Using SMT ...... 33 6.1.2 RQ2: Scalability ...... 34 6.2 General Discussion ...... 34 6.2.1 Further Applicability ...... 35 6.2.2 Implementation Weaknesses ...... 36 6.3 Limitations ...... 36 6.3.1 Equipment around QSYM ...... 36 6.3.2 Symbolic Memory ...... 37 6.3.3 Datasets ...... 37 6.3.4 Libdislocator ...... 38 6.4 Validity Threats ...... 38

7 Conclusions and Future Work 39 7.1 Conclusion ...... 39 7.2 Future Work ...... 39 7.2.1 Further optimizations ...... 39 7.2.2 Other interesting ideas ...... 40

A Graphs of path coverage 47

viii Chapter 1 Introduction

Software is being run everywhere today. Almost every large company has code run- ning and security breaches commonly involve exploitation of software. Everyday objects, previously lacking networking capabilities, are being connected to the Inter- net – a growing concept commonly referred to as the Internet of Things. Day by day this digital revolution increases the risk and impact of software vulnerabilities. Be- cause the amount of code written is constantly increasing, and the difficulty finding such vulnerabilities, automation of vulnerability detection has become a necessity. Manually testing binaries can be time consuming, which is why effort is still being put into automation. Tool-assisted detection today is done via fuzzing, static analysis such as symbolic execution, and taint analysis. One of the most utilized approaches for automated testing fits very well in with the “A good programmer is a lazy programmer” paradigm, namely fuzzing. Letting the computer generate “random” input and see how the program reacts. State-of-the-art fuzzers today use evolutionary algorithms and minor instrumentation of the program to create input that is more likely to cause the program to [25]. The strength of fuzzers lies in their near-native execution performance. They are good at generating general input but inefficient when encountering specific constraints in binaries. Another method employed is symbolic execution (usually dynamic), simulating all possible paths of the binary while collecting variables and constraints that depend on user input. By using satisfiability modulo theory, mathematical constraint solving in computing, a symbolic execution engine can use the collected variables to see what input is needed to reach a certain branch of the program. This technique is good at handling fine-grained constraints but suffers from the path explosion problem invoked by having to explore every possible path in the binary [30, 23]. The two aforementioned approaches both have problems of their own but in recent years the two techniques have been combined to something often called hybrid fuzzing. The fuzzers do the heavy lifting, generating much of the coverage in the binary. The symbolic executor then “piggybacks” on the traces generated by the fuzzer solving lean constraints, to find new possible inputs and feeds those back to the fuzzer.

1.1 Motivation Driller and QSYM [32, 36], that we argue, are the current state-of-the-art in hybrid fuzzing, use code coverage to measure progress. This is a good metric to find many bugs but it does not mean that they find all bugs. There are situations where, by

1 2 Chapter 1. Introduction invoking the constraint solver, we could find bugs that would otherwise go unnoticed because of the nature of the coverage-based metric. Just like there are very specific conditions to be met in order to enter certain code regions (requiring a constraint solver to pass efficiently), the same goes for bugs. Some bugs are not necessarily dependent on the control flow of the program, meaning that current hybrid fuzzers, being coverage-guided, may not react to changes in input data if they do not directly affect the order of executed machine instructions. In this paper, we refer to such bugs as context-sensitive bugs. If we are aware of which instructions are likely to generate certain context-sensitive bugs we can instrument them to invoke the constraint solver and try to forcefully generate them. A similar idea is done in research by Eckert et al. [10]. In their tool Heaphopper they create models to interact with the heap and correspondingly can generate bug-inducing input. Their research was a big inspiration in the early stages of the project. Heap-related bugs such as double-free (DF) are not necessarily dependent on the control flow of the program, meaning they can be categorized as context-sensitive bugs. We believe that our way of forcing DF bugs can act as a proof of concept for methodically finding other types of bugs that could result in access to memory regions which should not be directly affected by user data. In the worst case, these types of vulnerabilities could give a malicious user arbitrary reads or writes to sensitive areas of memory (e.g. virtual function tables) to achieve code execution. We argue that it should be possible to implement such functionality without impeding the hybrid fuzzer’s performance too much. In this project, we will implement this functionality specifically for DF vulnerabilities into an already present tool (QSYM), and evaluate the performance to assess our thesis.

1.2 Problem Formulation

While automated bug discovery is a flourishing field, it is well known that there is no panacea to defeating bugs as a whole. The purpose of this research is to improve automated detection of context-sensitive bugs that are located deep inside binaries and are difficult to trigger. We seek to investigate whether it is possible to force DFs using additional constraint solving without taking a too big performance hit compared to current state-of-the-art hybrid fuzzers. We will build a tool, which we call QSIMP, by extending the open source hybrid fuzzer QSYM, adding additional features to find context-sensitive bugs using constraint solving. This will be done by adding instrumentation to important functions such as free and malloc, tracking interesting data to be able to build constraints to test the possibility of forcing DFs. Furthermore, we will extend QSYM’s symbolic loading as they use a very strict concretization strategy. As this implementation risk impeding the efficiency and performance of the tool we will strive towards being as non-invasive as possible. The performance impact from the additional overhead caused by our implementation will be analysed by comparing it to the original implementation of QSYM. To measure the performance of QSIMP we are performing two experiments, using two datasets. The first dataset is built by the authors to test the efficacy of the two versions in finding context-sensitive bugs. We call this the DF dataset and it is publicly available [24]. The second experiment is meant to compare the scalability of QSYM and QSIMP 1.3. Thesis Outline 3 and makes use of the LAVA-M dataset [8]. When we say scalability we mean the tool’s ability to maintain its practicality as the program analysed becomes bigger and more complex.

1.2.1 Research Questions RQ1 How can constraint solving be used to trigger context-sensitive double-free bugs?

RQ2 How can coverage-guided hybrid fuzzing be utilized to trigger context-sensitive bugs in an efficient and scalable manner?

1.2.2 Hypothesis Eckert et al. have used constraint solving to trigger several common heap-related bugs during their analysis of heap algorithms [10]. Because hybrid fuzzers handle symbolic data, we believe it should be possible to add functionality for force-triggering context- sensitive DF bugs in a similar manner, with a slight performance cost due to the added instrumentation. While the performance is impaired, we still believe that a tool should maintain its practicality. Thus, we hope to end up with a tool that can find more types of bugs, with a lower (but still acceptable) execution speed.

1.2.3 Scope Automated binary analysis is a broad field with many use cases. It can be done in a variety of ways which is why we have limited our research to DF vulnerabilities and the hybrid fuzzing method. We have chosen to focus on 32-bit Linux x86 binaries (i386) and/or 64-bit Linux x86 binaries (amd64). We have chosen x86 because it is widely spread (runs on a majority of personal computers), and is well documented. Using Linux, which is open source, makes development easier than on a closed source platform. We will also need to build on already present tools, limiting us to the capabilities in previous implementations.

1.3 Thesis Outline The rest of the thesis will follow the following structure:

• Chapter 2 introduces all the relevant theory needed to understand this thesis work. However, the reader is assumed to have a good understanding of software development.

• Chapter 3 presents other work that this research is inspired by and based on.

• Chapter 4 discusses implementation details and how the evaluation is performed for the algorithms developed in this thesis.

• Chapter 5 presents the results from the two conducted experiments and inves- tigates their implications. 4 Chapter 1. Introduction

• Chapter 6 further discusses limitations, the retrieved results, and puts some key concepts in a wider context. Answers the two research questions.

• Chapter 7 presents the conclusions that can be made from our work. It also discusses possible future development and research strategies. Chapter 2 Background

2.1 Bugs

In lower-level Linux programs, there are several known types of vulnerabilities, such as the buffer overflow and format-string vulnerabilities. Aside from logical bugs, there are memory-corruption vulnerabilities, which are commonly stack based or heap based (or a combination of the two). Dynamic memory utilization in lower-level programming languages like C is han- dled by developers, as they can have their program ask the system to allocate memory for it during runtime, and release (free) the memory when the program no longer needs it. Which underlying algorithm that is used for serving these memory alloca- tion calls depends on which libraries are used. The GNU C library for example, uses the ptmalloc [33] algorithm, while FreeBSD and NetBSD uses jemalloc [14]. Because allocation and releasing of dynamic memory is handled by programmers themselves, calls to these algorithms are prone to human-introduced logical bugs. Common bugs include Use-After-Free (UAF) and DF vulnerabilities. These vulnerabilities mean, respectively, that memory is accidentally used after it has already been released, or that the same memory is accidentally released twice (or more). re- mains, according to Microsoft Security Response Center, dominant as the cause of vulnerabilities in software [22]. As shown in Figure 2.1 heap corruption and UAF being two of the largest issues since 2016. In the worst case, a memory corruption bug can lead to arbitrary code execution. A fairly recent example of this is when an exploitable DF bug was found in What- sApp for Android (WhatsApp versions before 2.19.244). Using this vulnerability, an adversary could gain remote code execution to a victim phone by sending a specially crafted Graphics Interchange Format (GIF) file [13]. Locating these types of bugs can be tedious and difficult. This is especially the case when one does not have access to the source code, as is usually the case when working with commercial off-the-shelf (COTS) binary programs - as Dinesh et al. state, even widely used Linux applications are sometimes closed source [7]. Fuzzing, as a black-box approach, has shown very good results throughout the years but can have troubles reaching deep code regions [25]. Furthermore, even if a bug happens, it may go undetected if the bug does not crash the program [19]. When instead symbolically executing, the problem remains with reaching deep code, as the can memory quickly get exhausted due to path explosion [30].

5 6 Chapter 2. Background

Figure 2.1: Microsoft vulnerability and exploitation trends [22]

2.2 Fuzzing

Manes et al. [21] describe fuzzing as repeatedly running a program or function with fuzz-input while monitoring the program for undefined or unexpected behaviour. Fuzz-input being defined as input that the program may not expect. This is usually done using some kind of helper tool called a fuzzer. You can categorize fuzzers into three categories depending on how much information they are given about the program, black-box - no information, grey-box - limited information, white-box - full information. Black-box fuzzing has the advantage of not needing, or having to analyse, information about the binary making it fast and often simpler in its implementation. Grey-box fuzzing involves lightly instrumenting what is fuzzed, for example, one might use coverage as a metric. White-box fuzzers uses the full semantics of a binary, which in turn often makes them more heavy run, many white- box fuzzers does, for example, use dynamic symbolic execution to gain knowledge about the binary. A modern popular fuzzer that is heavily used in both research and commercially is American Fuzzy Lop (AFL). It has an impressive suite of bug findings that can be found on their webpage [38]. One of the major strengths of AFL is according to Rawat et al. [25] its feedback loop, assessing the success of previous mutations, and generating new input based on them. Although fuzzers are fast at generating and executing test cases, the major down- side is that they merely generate user inputs based on some shallow metrics, without utilizing any form of intelligent binary analysis. This makes fuzzing have a hard time detecting behaviours triggered by very specific inputs. 2.3. Dynamic Symbolic Execution 7

2.3 Dynamic Symbolic Execution In this paper, when we talk about symbolic execution, we are referring to dynamic symbolic execution. This is an optimization, executing the binary (concretely) with real values in parallel with the symbolic execution. Symbolic execution, as opposed to fuzzing, simulates execution to explore all possible execution paths that a pro- gram can take. Symbolic execution utilizes Satisfiability Modulo Theories(SMT) to generate a Proof of Vulnerability (PoV) and can, therefore, be a powerful way of analysing binaries. SMT uses predicate logic to test the satisfiability of different logical formulas by combining theories such as bit-vector-, array-, and arithmetic theories [27]. A formula in our case is an equation that defines what input, if any, is needed to traverse a certain path in the binary. An example is solving simple arithmetical problems such as 0 ≤ x < 300. Z3, the SMT solver used in this project is developed by Microsoft Research [37]. It is focused on solving problems that can arise in development, such as software verification during testing. An essential functionality of SMT solvers is that they can generate models of formulas which in turn can be used to generate input that satisfies the constraints of the formula [3]. A model of a formula being a description holding one single possible value that fulfills the conditions of the given formula. In other words, they can produce inputs satisfying long chains of constraints stemming from conditional branches, to have execution reach a specific code region. The way most current state-of-the-art symbolic engines execute binaries is by sim- ulating real execution. The symbolic executor tracks the flow of data throughout the simulation pinning down variables that are dependent on user input. The executor simultaneously generates expressions representing these variables, connecting them to conditional branches successively building formulas. Expressions are often rep- resented as an Abstract Syntax Tree(AST) composed of constants, input variables, and operators. In Figure 2.2 we show an AST containing an expression for validating 0 ≤ x < 300. With each conditional branch that depends on user-generated input, the simulation splits into multiple states of execution. A state holds the context of a single execution, including values in registers and memory, and open files [30]. Each state is stored as a snapshot. Snapshotting is an optimization to avoid re-execution at the cost of memory. By continuously simulating every branch the execution engine generates every possible outcome of the binary. In larger and more complex programs however, symbolic execution suffers from path explosion because every time an execution path forks, the set of potential paths doubles – causing the number of paths to grow exponentially. There have been several different techniques implemented to try and mitigate this problem, e.g. veritesting [1], concolic execution [28, 5], prioritizing promising branches [39, 4]. But they all suffer from either losing coverage, precision, or simply not managing to remove the path explosion, only postponing it.

2.3.1 Soundness Soundness is a way of measuring how well an analysis represents all possible solutions to a problem [30]. Because symbolic execution is a demanding task as an effect of its semantically high insight and the path explosion problem there have been many 8 Chapter 2. Background

Figure 2.2: Example of how a formula could look represented as an AST. optimizations done throughout the years [30]. These optimizations often sacrifice some soundness to gain performance.

2.3.2 Concretization Symbolic memory is a way of simulating symbolic addresses and symbolic values in volatile memory. The memory is hard to model because you often have all the possible solutions of something that could be of arbitrary length or stored at an arbi- trary position. Even when achieving a fully sound analysis, modeling all possibilities can quickly exhaust the memory. To combat this memory exhaustion, symbolic ex- pressions are sometimes concretized (i.e. reduced) to one or a number of constant values during execution. This means that there is a trade-off between soundness and scalable performance [2]. Symbolic loads and stores, dereferencing instructions containing symbolic addresses or values, are because of this a point where some soundness often is lost.

2.4 Hybrid Fuzzing

In recent years fuzzing and symbolic execution have been combined to utilize the ad- vantages of both techniques, two of the more successful hybrid fuzzers being Driller and QSYM. Fuzzing is efficient in gaining code coverage where constraints are loose, and symbolic execution can assist with its strengths in solving narrow or precise con- straints. Hybrid fuzzing is generally done by letting the fuzzer output intermediate seeds that can be fed to the symbolic engine. The symbolic executor then uses the input seed as a guide, following the path or trace of the input through the binary. At conditional branches, the SMT solver can be invoked to generate inputs that lead to executing the branch not taken following the trace. In this way, the fuzzer is helped into new parts of the binary. To lighten the burden of the symbolic engine further QSYM uses a bitmap to keep track of which parts of the binary has been explored or not. The bitmap functionality used is similar to how AFL does it. AFL, by default, uses a 64kB shared memory to measure coverage [17]. In tran- sitions between basic blocks in the Control Flow Graph(CFG) of the binary AFL generates hashes. The input for the hash is the address of the two blocks the transi- tion happened between and the hash is a value between 0 and 64kB being stored as 2.4. Hybrid Fuzzing 9

1 void challenge(){

2 int x;

3 char buf[32];

4

5 read_int_from_user(&x);

6 read_buf_from_user(&buf, 32);

7

8 // Challenge for fuzzing

9 if (x == 0x8badf00d){

10 printf("Step 1 passed\n");

11

12 // Challenge for symbolic execution

13 int count = 0;

14 for (int i = 0; i < 32; i++){

15 if (buf[i] >= 'a')

16 count++;

17 }

18

19 if (count >= 8){

20 printf("Step 2 passed\n");

21 }

22 }

23 }

Listing 1: Example showing the different difficulties for fuzzing and symbolic execu- tion

a bit in the bitmap. If a new position in the bitmap is filled the path is considered new. A problem with this approach is that there will be collisions, meaning that sometimes AFL will not know that it has made progress. In Listing 1 there is a code snippet containing challenges for both symbolic exe- cution and fuzzing.

• The if statement on line 9 puts a strict constraint on the variable x, which is a good example where symbolic execution shines and fuzzing has trouble finding the correct input. The reason this would take a fuzzer a long time is because it has to effectively guess the exact value that is compared in the if statement.

• On the contrary, on line 15 the challenge is to the symbolic execution engine that will suffer from path explosion caused by the 232 possible paths of the for loop.

So by running the two techniques consecutively, both of the problems can be miti- gated. When one technique flounder, the other can take over, maximizing coverage and the number of bugs being discovered. 10 Chapter 2. Background

2.5 Context-Sensitive Bugs Driller, QSYM, and Digfuzz [39] are hybrid fuzzers with a broad focus on errors in general, as they mainly search for bugs that can crash a program. Although the triggering of a bug may not crash a process, a program may be instrumented to crash whenever a bug is detected. There are bug detectors in which this is the default behaviour, such as the widely used Address Sanitizer (ASan) [29] developed by Google. This instrumentation requires having source code because it happens at compile time. Successful research has also been done in instrumenting address sanitation directly in binaries (i.e. after they have already been compiled), referred to as binary address sanitation (BASan) [dinesh_retrowrite_2020-1]. Something to note is that using the tools and techniques mentioned above requires that a bug must occur for it to be detected. In hybrid fuzzers, a bug may be triggered by fuzzing but it can also be completely missed. The symbolic execution engine in a hybrid fuzzer mainly serves to increase code coverage by solving conditions that are hard for a fuzzer to pass (e.g. a specific switch case). There seems to be a lack of integrating the constraint solver to forcefully trigger a bug in modern hybrid fuzzers. Put differently, in addition to having a symbolic engine solve conditions for new branches to be explored, it could also be used to fulfill conditions for certain bugs to forcibly occur. By utilizing the constraint solver to test whether symbolic data, fed to functions or specific instructions, can take on dangerous values leading to undefined behaviour. This is under the presumption that different executions of the same execution path may or may not result in a bug, depending on the data used within the instructions that are executed in that path. An example bug is shown in Listing 2. What can be observed here is that any input value of x between zero and 767 will result in the same instructions being executed. However, a DF bug will only occur if x is equal to 423. 2.5. Context-Sensitive Bugs 11

#define NROF_THINGS 0x300 #define VULN_INDEX 0x1a7 // ...

// Allocate 0x300 chunks for storing integers. for (int i = 0; i < NROF_THINGS; i++){ things[i] = malloc(sizeof(int)); }

// Read a user-controlled integer int x = read_user_int();

// Make sure it is between 0 and 0x300 if (x >= NROF_THINGS || x < 0) exit(1);

// Free a specific chunk free(things[VULN_INDEX]);

// Free a chunk with the user-specified index. // If the user selects index 0x1a7, a double free // bug will occur, even though the same code is // executed in any of the cases. Thus, a coverage- // guided hybrid fuzzer will not attempt to actively // explore this case. free(things[x]);

// ...

Listing 2: Example of a context-sensitive bug. Even though the same code will be executed provided the first if statement is passed, the source of the bug lies within the user-controlled array access.

Chapter 3 Related Work

3.1 Satisfiable Modulo Theory Solvers SMT solvers are part of what makes symbolic execution so powerful. They are, among other things, used to analyse logical formulas and solve equations, while being able to handle data structures common to computer science such as lists, arrays, and bit vectors. They can be used to assess the satisfiability of an equation, given a number of constraints, and generate a solution if there is one, which is their main purpose in binary analysis. According to Rönn [26], the most commonly found solvers found in symbolic execution engines are STP [12] and Z3 [23]. QSYM and our code make use of Z3’s C++ API.

3.2 Automated Vulnerability Discovery Over the last years, much research has been done in automatic vulnerability detec- tion, exploitation, and patching, thanks to events such as the DARPA CGC 2016 [11]. In 2012 Cha et al. presented Mayhem [5] which primarily introduced two novel techniques. Firstly, they present hybrid symbolic execution, a more efficient algorithm for snapshotting and forking executions compared to earlier works. The authors of Mayhem make a distinction between an online and offline symbolic execu- tor, where the former forks at each branch point in a binary while the latter reasons about a single execution path at a time. Hybrid symbolic execution is an attempt to combine the speed of online execution with the lower memory use of offline execution to more efficiently explore the input space. Secondly, they introduce a way to handle symbolic memory in real-world applications called index-based memory modeling. This technique allows Mayhem to reason about symbolic indices, which has shown importance in automated binary analysis – one of the experiments in their research showed that more than 40% of the analysed programs were deemed exploitable only if symbolic indices was used during analysis. DoubleTake [20] uses evidence-based analysis during runtime, meaning they de- tect that a bug has occurred, and try to reproduce it afterwards. Heaphopper [10] focuses on finding new exploitation primitives inside an actual heap algorithm. Pangr [19], Driller [32], QSYM [36], and Digfuzz [39] work similarly to one another, in that they combine symbolic execution with fuzzing (i.e. hybrid fuzzing) to find bugs, exploit them, and patch them. Pangr had trouble finding and exploiting heap vul- nerabilities and was mostly tested on binaries that had stack-and format string-based vulnerabilities.

13 14 Chapter 3. Related Work

Angr is a framework for symbolic dynamic execution created by Shoshitaishvili et al. [30] at UC Santa Barbara. The project is meant to summarize the current state of automated binary analysis while creating an open source framework for future research. Angr is written in Python and has an easy-to-use high-level API, and is built so that more well-versed users can implement their own binary analysis strategies. Angr is one of the first big open source frameworks for symbolic execution. In the research done by Eckert et al. [10] they build a tool, HeapHopper, where they test the malloc implementation in glibc. The tool is built using the angr framework, and similarly to our research they utilize the DSE engine to generate bug-inducing input. Their implementation is focused on finding possible logic bugs that can be introduced into a binary through the use of glibc. They create models on how to interact with the heap, instructions for Heaphopper that tells how it is possible to interact with it and try to generate a bug. Heaphopper generate small binaries that the models can be run on to find bugs. In our research we instead “model” possible ways to generate a DF-bug and try to generate it when following the traces through the binary. Driller is built upon Angr and seems to have really kickstarted the competition for creating better symbolic executors and hybrid fuzzers. Initially, when we started our project we were intending to work with Driller as the basis for our thesis. Later we found QSYM, that superseded Driller, making it our preferred choice.

3.3 Hybrid Fuzzing

Driller [32], a hybrid fuzzer developed by the creators of Angr, showed great results and was used in the DARPA CGC 2016 [11]. It uses Angr for symbolic execution and AFL as its fuzzing component. The main focus of the research was to evaluate a symbolic execution engine like Angr’s ability to offload a state-of-the-art fuzzer, while not caring as much about when to use which component. After the fuzzing component has not made progress for a certain period of time, the system considers the fuzzer as “stuck” and switches to Angr, its symbolic execution engine. QSYM is the hybrid fuzzer that we are using as the basis for our tool. In their research, they have built a symbolic engine that uses Intel PIN [16], a dynamic binary instrumenting tool. They use libdft [15], a data flow tracking library, to track the data flow of the binary being analysed. One of the main takeaways from their paper is that most current emulators are not built for hybrid fuzzing. They argue that the emulation in frameworks such as Angr, that uses VEX-IR [31], makes the process slow. VEX-IR is a well established intermediate representation(IR) created for the binary instrumentation framework Valgrind [35]. IR is mainly used to simplify the emulation and also enables compatibility over multiple architectures, but at the cost of performance. By using a tool like PIN they can get instruction-level instrumenta- tion and get much better performance, at the cost of having to put more effort into integration. As their fuzzing component, they use AFL, which likely was the best choice at the time of their research. In their result, they show that they perform on par or better than Driller in the majority of tests. They also show that their tool scales well on bigger binaries. Digfuzz [39] is focused on a subproblem for hybrid fuzzers – namely the heuristics 3.4. Symbolic Memory 15 for when to switch from the fuzzing component to the symbolic executor. Zhao et al. [39] argues that Driller makes assumptions that are oversimplified. Instead of re- garding the fuzzer as "stuck" after a certain period of time and switching to symbolic execution, the makers of Digfuzz use probabilistic path prioritization. They designed a novel Monte Carlo-based model to quantify each possible execution path’s difficulty based on metrics collected during fuzzing and then prioritize paths based on those values. The main focus is to create a lightweight instrumentation during fuzzing, making it possible to weigh the difficulty of the constraints found in the binary. This information is then used to feed the symbolic engine with seeds accordingly. To weigh the constraints they measure how often each branch is reached, isolated from the rest of the program. By multiplying each node in the chain of constraints they get the probability of a path. After a tree of branches has been constructed they prioritize missed branches based on probability of going to that state in the code and feeds them to the symbolic engine. In their experiments, they show great performance compared to at the time state-of-the-art hybrid fuzzers.

3.4 Symbolic Memory

Modeling symbolic memory has always been a problem when working with symbolic execution. Arbitrary input of arbitrary size, stored in an arbitrary memory location, quickly discards the possibility of modeling it as a linear array of addresses. For example, how do you model the memory when the user is allowed to store a string of arbitrary size in memory? Or how do you model multiple symbolic loads efficiently? Cha et al. [5] propose an indexed-memory model that allows for storing and loading symbolic values in memory. Angr implements this method allowing for both symbolic loading and storing in their framework. In QSYM the authors have optimized the memory model by concretizing symbolic pointers when they are referenced. Different symbolic executors have different heuris- tics for deciding when and what to concretize. Like Angr and Mayhem, QSYM uses the indexed-memory model proposed by Mayhem although Yun et al. have opted to take a more restrained modeling than Angr with more strict concretization. In our project, we did investigate the possibility of full symbolic memory modeling similar to what is proposed by Coppa et al. [6]. In their paper, they propose a different algorithm for modeling the memory during symbolic execution. They add a layer above the indexed memory, which they call symbolic memory. This memory con- tains lists of memory objects where each maps a symbolic address expression to an object stored at that location in memory, constant or symbolic. When building sym- bolic expressions during store and load operations their algorithm generates an if then-else (ITE) chain. Each statement of the chain contains a condition either based on a symbolic or constant address. If the condition is met it returns the object mapped to it, if not the solver continues through the ITE chain until a condition is satisfied. This is further explained under Section 4.2.3. Their implementation called MemSight is implemented as a plugin to the Angr framework. In their experiment, they show great improvement compared to Angr’s implementation of fully symbolic memory [6]. Because of QSYM’s closer integration with the native execution of the binary 16 Chapter 3. Related Work and their strict concretization we would have needed to make a major restructure of QSYM’s implementation to be able to test this memory model. Chapter 4 Method

The initial step to address the research questions was a literature review of the current state of the art in automated binary analysis. This lead us towards the idea that we can improve the detection of DF bugs by instrumenting free and malloc and force bugs based on exploit models. To do this we also needs to extend QSYM’s capability to handle symbolic loads. To test our hypothesis, we implemented a QSYM module that was later evaluated against vanilla QSYM on two different datasets.

4.1 Current State of the Art A literature review has been done to understand the current state-of-the-art of auto- mated binary analysis. This has enabled us to verify that the research gap identified in the early stage of the project indeed exists. It also builds a good ground for the choices made in the implementation done in this project. As we encounter a lot of different terminologies it is hard to use one or more search strings to get a good overview of the current state of hybrid fuzzing. We have therefore chosen to use a snowballing approach for gathering resources. As a basis, we have used some of the more important papers, i.e. papers that have been cited by many, and papers that have significance to our work. • SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis [30] • Unleashing Mayhem on Binary Code [5] • Driller: Augmenting Fuzzing Through Selective Symbolic Execution [32] We have primarily used BTH summon and Google Scholar as our search engines for finding papers. But, we have also utilized general web search as there are products out in the wild that do not originate from scientific papers but are utilized by the community. We opted to use QSYM, a tool built by Yun et al. We chose this tool because according to their results QSYM works more efficiently compared to other similar tools. The tool is focused on x86 binaries which is the architecture we have chosen to work with. During the literature review, we found a plethora of different tools that were good candidates to build our thesis on. The primary candidates were Driller and QSYM, but it is also worth mentioning DigFuzz and MemSight [6] as they are similar or relevant. As is mentioned under Section 3.4 we did investigate the possibility of using a similar memory model to what is proposed by Coppa et al. [6]. This model was

17 18 Chapter 4. Method not used but has been inspirational when implementing the model used for our experiments. Driller is built by the authors of Angr and is therefore built upon that frame- work. Angr uses the VEX-IR as an abstraction to be able to handle all underlying architectures, this makes the tool dynamic. A side effect of this is that it makes the tool rather coarse-grained in its instrumentation which in turn gives much overhead. QSYM uses a different approach which makes its implementation less flexible. Instead of using an intermediate representation, they analyse assembly instructions directly, which means that instrumentation is done on instruction level instead of basic block level. To do this, the authors use a dynamic binary instrumentation tool called PIN that is developed by Intel. They instrument each assembly instruction they deem interesting to their analysis. By doing this more fine-grained instrumen- tation they only run extra instruction when it is needed thus making their symbolic engine faster. We chose to use QSYM because of their much faster symbolic execution engine and their focus on hybrid fuzzing. Because our tool tries to force bugs through symbolic execution, it needs to run the symbolic execution engine relatively often in binaries with much dynamic memory allocation. Because of their faster execution, they have also chosen to not take snapshots of the program state. Optimization through snapshots makes the execution run faster when it is possible to reuse already run paths, but when running hybrid fuzzing you always follow the program trace created by the fuzzer which can make snapshotting superfluous. Because of choosing to focus on Linux x86 binaries, the strength of the VEX IR used by Driller would not give any real advantage, except that we would have fewer instructions to handle. DigFuzz is excluded as a possible hybrid fuzzer for the same reason as Driller as they are both built using angr, meaning they hold similar weaknesses.

4.2 Implementation

In this section, we describe how we implemented functionality necessary to be able to track parts of dynamic memory with the QSYM stack.

4.2.1 DF Detection Algorithm The primary part of the implementation hooks the functions malloc and free that are located in glibc. To do this we built a dynamic binary instrumentation tool using Intel’s PIN. The tool monitors freed and allocated chunks of memory, while si- multaneously using constraint solving to try and force the binary into freeing already deallocated memory whenever a pointer that is influenced by user input is passed to free. The fundamentals of this are shown in Algorithm 1. The algorithm attempts to force an equality between a pointer passed to free and any pointer that is present in the list of already freed pointers. It also utilizes Z3’s push-and pop functionality (which can be used to save and restore a formula when extending it), between adding the strict equality constraints. 4.2. Implementation 19

Algorithm 1: Fundamentals of the free hook function that attempts to force a DF bug. Surrounding details, such as adding a new pointer to the free list, have been omitted.

F := free list, containing expressions of pointer values addressed to chunks that previously have been freed, but not yet reallocated. (empty before the first free has been invoked) S := Z3 solver p := pointer passed to free, as an expression. May be symbolic or constant.

function BEFORE_FREE(p): for f ∈ F do S.push_state() S.add_constraint(f == p) if S.satisfiable() then save_input_case() end S.pop_state() end return

Optimistic Solving Optimistic solving is a pragmatic approach to help ease the burden of the constraint solver when working with complex binaries. It involves relaxing over-constrained expressions by removing older constraints, and is a way to lower the amount of false negatives. This brought QSYM good results, and the reason for its success is explained in the next two paragraphs. A powerful concept from the QSYM research is opting for re-execution rather than snapshotting. QSYM is fed a trace from AFL, following it while generating new input-seeds for each branch-transition that is not present in the bitmap. When QSYM reaches the end of the binary it terminates and starts over fresh with a new trace. Their reason for doing this is primarily to not have the symbolic execution component bottleneck their fast fuzzing component. This way of re-executing is efficient much thanks to QSYM’s strict concretization strategy. Concretizing expressions too strictly, however, will have the constraint solver ren- der many expressions as unsatisfiable when they should, in fact, be theoretically sat- isfiable. QSYM combats these potential false negatives with a pragmatic approach, namely optimistic solving. As the name suggests, optimistic solving involves tem- porarily dropping all older constraints, in the hope of removing harshly introduced concretizations. This, in turn, introduces many false positives, rather than false negatives, but the fast fuzzing component can quickly validate which cases are false positives or not. This approach has been adopted by our DF detection algorithm as well. The constraint-dropping step was added to extend what is shown in Algorithm 1 and is done if satisfiability cannot be achieved right away. 20 Chapter 4. Method

4.2.2 Libdislocator Because DF bugs are not always detected by the libc library we used a library called libdislocator – an implementation of libc that is more sensitive to errors. In libdislocator it is impossible to trigger a DF without getting an error. Because the tool tries to find bugs that do not always trigger exceptions, using libdislocator was a crucial part of the implementation. The library can be loaded with AFL and by doing so the tool can verify inputs generated by the optimistic DF solver. Lib- dislocator is used during the context-sensitivity experiment that is explained under Section 4.4.1 but is not used for the scalability experiment that is explained under Section 4.4.2.

4.2.3 Symbolic Load To maintain a lean symbolic execution engine QSYM optimizes dereferencing instruc- tions by a rather coarse simplification. If the address of the instruction is symbolic the tool reduces the expression to one single possible value (also called concretiza- tion). Because of this, we have implemented symbolic load functionality into the tool. We present a small example followed by how symbolic loading is implemented. Let’s say we have a binary where the user can interact with an array of objects. The objects are stored in dynamic memory and referenced through the array. The user can create and destroy objects by selecting an index in the array. In Figure 4.2 we show how this could look in memory. When encountering a symbolic load the tool generates an ITE expression similar to what is shown in Figure 4.1. For each possible index in the array, a condition is built mapping from index to address, if the condition is met the dereferenced object is returned, if not the next condition in the chain is tested. In the example, QSYM would simplify this expression to the lowest possible value in the array, meaning it would always return 0xfdaf6010. For QSYM this is a reasonable trade-off because its main purpose is to gain good coverage, and solving all possible values can be a demanding task if there are many possible solutions. In our implementation, however, a full or partial simulation of dereferencing was necessary. We have taken inspiration from Angr and how they handle symbolic loads. We do this as they use a well- established approach that was first proposed by Cha et al. [5]. This structure isn’t very efficient, but as the order of dereferenced objects is not necessarily linear (in memory), it is difficult to specify in another way.

4.3 Dataset For the experiment, we chose to use two datasets. The first measures QSIMP’s ability to solve different types of DFs, and the second is used to measure the scalability of the tool. Most of the results are logged by AFL, statistics such as crashes, number of paths, and hangs. Logging for solving time was also added to be able to see how much the result from our tool differs from the original QSYM tool. The first dataset consists of ten binaries, two baseline examples, five examples containing context-sensitive problems, and three examples containing problems that are either hard to trigger for both tools or impossible. Every example program 4.3. Dataset 21

Figure 4.1: Example of how an ITE-chain could look.

Figure 4.2: Showing the difficulty efficiently modelling dereferencing. 22 Chapter 4. Method contains one bug, except for Example 3 which has zero bugs. The source code of the example files can also be found on GitHub [24].

• Example 0 and 1 are sanity checks, testing that the solver engine and the fuzzer can employ their individual strengths. Their code follow the structure shown in Listing 1.

• Example 3 does not contain a bug. Instead, it has what could be a false positive.

• Example 2, 4, 6, 7, and 9 are tasks involving context-sensitive bugs. These are designed to be difficult for a symbolic solver without proper symbolic memory modeling.

• Example 5 and 8 are tasks, also containing context-sensitive bugs, that should be unsolvable using both QSYM and QSIMP; however, they are solvable through fuzzing.

• Example 2 through 9 roughly follow the structure shown in Listing 2 with some minor changes between them. For example, Example 6 reads two input variables instead of one and triggers a bug if both variables have a specific value.

The first baseline example is meant to test only the symbolic engine, having to solve a problem that would take the fuzzer a long time to solve. The second baseline example contains elements difficult for both the fuzzing component and the symbolic engine, testing the synergy between the two. The second is the LAVA-M dataset, it is used to test how well the tool handles scaling up for larger more complex binaries. The dataset contains four Linux binaries - base64, md5sum, who, and uniq - with several purposely injected bugs at different depths. To clarify, the binaries from this dataset do not necessarily contain DF bugs. Experimentation on this dataset is used to measure how a slightly more fine-grained handling of symbolic pointers can slow down QSYM’s original performance. The LAVA-M dataset is created using a tool built by Dolan-Gavitt et al. [8], by injecting bugs in so-called Dead Available Uncomplicated (DAU) data. The tool uses data flow tracking to find paths of the code that contain “dead” data, meaning that changing the input of such data does not interfere with the control flow of the program.

4.4 Evaluation To evaluate how well the tool performs we ran it on the two datasets for a certain amount of time. During the experiments we measured how fast bugs are found and the number of bugs found by QSYM compared to QSIMP. We also measured how long each emulation takes and how much of the time was spent solving constraints. The experiment was run with the simplest full stack possible. QSYM is run through a Python script polling the AFL output directory for seeds, as described by the authors of QSYM. AFL is run with two instances, one master and one slave. The only difference between running AFL in master or slave mode is that the former 4.4. Evaluation 23 applies some more advanced input-seed modifications than the latter. This is the way that the authors of QSYM ran it, and it is likely going to be run similarly in a production environment, the only difference being more slave instances. However, our primary goal was to test the symbolic engine of QSYM and QSIMP, rather than the fuzzing component, so the specific configuration of AFL was not given as much thought. Only two instances of AFL are needed to quickly verify the results generated by QSYM. But we still wanted to run both a slave and a master as this is a more realistic setup than running a single instance.

4.4.1 Context-Sensitive Binaries Experiment First, we ran the experiment which targets the 10 example files (the DF dataset), that tests how well the tool handles different types of DFs. This experiment should give a fast way to verify if the tool can identify various origins of potential DF vulnerabilities and test how consistently it can do so. In this dataset, we also included some binaries that our implementation cannot handle the way we would like - for example when a symbolic pointer is concretized due to a symbolic write. We included these to see if there is any significant difference in run times and time spent by the SMT solver compared to the simpler cases. Because AFL requires a valid input we supply a file containing the binary representation of the integer 0. The same input file is supplied to all example files. When using AFL in the real world, the contents of the input seed largely depend on what is being fuzzed. Each example was run until the planted bug was found. This was done five times per binary for both QSIMP and QSYM. The reason for testing every binary multiple times is because a fuzzer can, by chance, detect a bug almost instantly, or take a long time, largely depending on chance. By running each binary five times, until the bug is found, we can see how big the difference in average time until finding a bug is, and how the time deviates between each run.

4.4.2 Scalability Experiment Secondly, we ran the experiment testing the scalability. When we say scalability we mean the tool’s ability to maintain its practicality as the program analysed becomes bigger and more complex. Each of the four binaries from the LAVA-M dataset [8] was tested for five hours like the authors of QSYM did in their experiment. The number of errors found was measured during the tests. We also measured how fast the tools find the vulnerabilities. QSYM was run as a comparison to be able to verify that the results are plausible compared to the results given by Yun et al. [36] and to compare the two tools. Because we wanted to be able to verify our results against the results of the original QSYM paper, we opted to run the LAVA-M binaries instrumented as they do. The main target of this project is being able to analyse COTS binaries, but as is mentioned on AFL’s homepage [38] the main difference between running the binaries instrumented compared to running them through an emulator such as QEMU is a performance penalty of 3-5 times slower performance when running in QEMU. We argue that as we are comparing instrumented to instrumented we should be getting similar results, only exploration should go faster. 24 Chapter 4. Method

Table 4.1: Specs for Virtual Machine

Component Specs CPU 4 core 3.4GHz RAM 16GB Disk 40GB Operating System Ubuntu 16.04

Table 4.2: Specs for server hosting virtual machine

Component Specs CPU AMD Ryzen Threadripper 1950X 3.4GHz 16 cores RAM Corsair Vengeance 64GB 2666MHz Disk 4 Kingston SSDNow A400 in RAID5 Operating System Debian Proxmox

Similarly to the context-sensitivity experiment we need to supply valid input for AFL to start mutating. For the four LAVA-M binaries we have supplied the following:

• base64: The string “test” encoded with base64

• md5sum: The output given from hashing the string “test”

• uniq: A single line of text

• who: A copy of the file /var/run/utmp We wanted to give the simplest possible input that is still valid. This is because we want to let the symbolic engine do the work for gaining coverage. But so long as both tools use the same inputs seed there should not be any relevant difference.

4.5 Equipment The experiments were run using a virtual machine from TrueSec [34]. The speci- fications for the virtual machine can be found in Table 4.1 and the specifications for the host server can be seen in Table 4.2. The reason for using Ubuntu 16.04 is that it was running an older kernel version and the PIN version used in QSYM and QSIMP does not support newer kernels. We could have tried porting the tool to a newer version but as this was not relevant to our thesis we opted not to. The rig used in the original QSYM experiment was equipped with 256GB RAM, but we were not able to get our hands on a machine with the same amount of memory. To mitigate this from making our results invalid we have run both QSYM and QSIMP with our rig. By comparing the results to those from the original QSYM paper to see whether our results are reasonable. The test suite, similarly to QSYM, consists of three components – the AFL master node, AFL slave node, and QSYM. Each of the three parts had its own processor core to work with. Chapter 5 Results and Analysis

In the following section, we present and analyse the results from our two experiments.

5.1 Context-Sensitive Bugs Experiment

A total of nine programs containing one crash-inducing bug each were tested in this experiment. Each program was tested five times until the bug was found, and the average time spent to find each bug is presented in Table 5.1. The time distribution over test runs is plotted in Figure 5.1, the dots in the plots being outliers. Some randomness is involved due to the nature of fuzzing, which means that for some programs the time to find a bug deviates considerably between runs. When a bug can be forced by our implementation, it is found quickly compared to vanilla QSYM, as shown by most of the rows where the solver has been used to force context- sensitive bugs (i.e. where the solver timeshare is above 0%). This is because vanilla QSYM simply keeps fuzzing until it randomly hits the magic number, while our extension uses the SMT solver to calculate it, which can be observed by the increased solving time shown in the rightmost column of Table 5.1. To clarify, this rightmost column shows the percentage of time spent utilizing the Z3 solver in relation to the total emulation time of the symbolic execution component. Looking at the examples containing bugs that neither QSYM nor QSIMP can solve, namely Example 5 and 8, we see a small increase in solving time by QSIMP. This is a reasonable effect and is based on the added constraint solving introduced by the symbolic loading, and the attempted forcing of a DF bug.

5.2 Scalability Experiment

As explained in the method section, the four binaries - base64, md5sum, uniq, and who - have been tested with the QSYM-AFL-stack and our implementation of the QSIMP-AFL-stack. The four binaries were selected because they were tested by the authors of QSYM as well and we wanted to be able to verify the plausibility of our results compared to the results presented by Yun et al. [36]. The experiments were run with three processes in total, the symbolic engine QSYM, AFL-master, and AFL-slave. In Figure 5.2 we show the cumulative numbers of errors found for each binary during the experiment. The errors are verified and logged by AFL, which leads to the results being split up between the master and the slave. This is because

25 26 Chapter 5. Results and Analysis

(a) QSYM

(b) QSIMP Figure 5.1: Box plots showing the time spent finding each bug (there were five runs per example binary) by QSYM and QSIMP respectively. 5.3. Analysis 27

Table 5.1: Average time spent triggering each bug by QSYM and QSIMP, respec- tively. The rightmost column shows the percentage of time spent utilizing the Z3 solver in relation to the total emulation time of the symbolic execution component, for QSIMP. Note that this solving timeshare has been rounded. For QSYM, the solving timeshare was rounded to 0% in each case (not shown in table).

Crash QSYM (s) QSIMP (s) QSIMP Solving Timeshare example0 19 21 0% example1 38 39 0% example2 3067 11 27% example4 2006 24 13% example5 4023 5049 5% example6 10468 48 30% example7 1311 14 25% example8 3315 2623 6% example9 3079 25 56% the master and slave AFL instances have separate output directories by design. To make the graphs more clear we have summarized the results from the two instances. In each graph, we see the number of bugs found at any given point in time during the five-hour run. The errors shown in the graph are unique in the sense that AFL identifies them as unique. According to AFL’s README, a crash is defined as an execution that receives a fatal signal (e.g. SIGSEGV, SIGABRT, SIGILL). A unique crash is defined as an execution that originates from a path passing a combination of states unique to the crash. This means that multiple crashes can end at the same instruction but having gotten there through different execution paths. This experiment was run without libdislocator because it made the bugs con- tained in the LAVA-M binaries behave oddly. This is further discussed in Section 6.3.4. Every time the symbolic engine is invoked the total emulation time and the solving time is logged. This has then been summed up in Table 5.2 and the percentage of the time spent solving in relation to total time emulating execution is shown in Table 5.3. Each LAVA-M bug exits the program sending a signal with a number over 127, and this is how the authors of the LAVA-M dataset verifies that such a bug has been triggered. The signal for the error is not unique though, and to verify unique bugs found(Table 5.4) we parse the output from each crash that prints the number of the bug. For some reason, the who binary would not print these crash outputs. Even though we were still able to verify that the bug triggered is from the LAVA-M dataset, we were not able to verify that each bug was a unique LAVA-M bug. This is the reason why we are not showing this data in Table 5.4.

5.3 Analysis In this section, we comment on how the results from each experiment can be inter- preted, and mention some details that stand out. Finally, we shortly discuss what 28 Chapter 5. Results and Analysis

(a) base64 (b) md5sum

(c) uniq (d) who Figure 5.2: Errors found in QSYM versus QSIMP. 5.3. Analysis 29

Table 5.2: Emulation Time(E) and Solve Time(S) for QSIMP and QSYM in seconds. * When testing md5sum, QSYM ran out of RAM at 1h20min which is why we only show the times up until then. So the results from this should be read with this in mind.

E QSIMP (s) S QSIMP (s) E QSYM (s) S QSYM (s) base64 17116 5896 12390 225 md5sum* 4663 673 4603 207 uniq 1993 280 2053 97 who 17786 10198 17629 2734

Table 5.3: Percentage of time spent solving during emulation. * QSYM ran out of RAM at 1h20min which is why we only show the times up until then. So the results from this should be taken with this in mind.

QSIMP S/E QSYM S/E base64 34,45% 1,82% md5sum* 14,43% 4,50% uniq 14,05% 4,72% who 57,34% 15,51%

Table 5.4: Unique LAVA-M bugs found in each binary found by QSYM and QSIMP, and the number of listed bugs by the authors of LAVA-M. * There are also some unlisted bugs injected into the binaries. They are unlisted because the authors of LAVA-M were not able to trigger those themselves [18].

QSYM QSIMP Listed base64* 48 46 44 md5sum 47 45 57 uniq 28 28 28 30 Chapter 5. Results and Analysis may be concluded from the results of the two performed experiments in combination.

5.3.1 Context-Sensitive Bugs Experiment The results from the context-sensitive binaries experiment show that in most of the examples containing context-sensitive variables QSIMP can consistently force the DF bugs planted in the binaries. This consistency can especially be seen in Figure 5.1, as the times spent by QSIMP to find a majority of the bugs barely deviate compared to QSYM. By the same reasoning, the randomness of fuzzing can be seen when looking at most examples under QSYM in the same figure – the times spent by QSYM to find most bugs deviate noticeably and a clear outlier can even be seen in Example 6 (QSYM box plot). The results are expected as QSIMP is designed to be able to solve most of the examples using constraint solving. In Example 5 and 8 we also see a behaviour that is expected. The two binaries contain context-sensitive elements but because of QSYM’s concretization heuristics, neither QSYM nor QSIMP can solve the task at hand. A similar situation is presented in Example 7 where the symbolic engine makes parts of the expression concrete to simplify expression-solving. Here however we get help from utilizing a similar approach to QSYM, using optimistic solving, which makes it possible to solve the example. The side effect of this is that the tool sometimes generates false positives. This effect is mitigated by AFL verifying each case through fast native execution. Looking at Table 5.1 it is clear which binaries need to use fuzzing to solve the problems as these take noticeably longer time than those where they can be forced using the constraint solver. None of the binaries in this experiment are big enough to slow down the symbolic engine in any noticeable manner. In Figure 5.1a, Example 6 is showing the binary with the most deviation. This is because unlike the other example files, its bug depends on two input variables instead of one. A DF vulnerability only occurs if the two variables are pointing at the same address. This example stresses how much uncertainty can be involved in fuzzing with results spanning from 32 seconds to 29283 seconds.

5.3.2 Scalability and Efficiency Experiment From the LAVA-M binary results (Figure 5.2), we see that QSIMP is a little bit slower than QSYM in detecting errors. Not by much, but it can clearly be seen looking at base64 and who in Figure 5.2. By comparing the solver usage between QSYM and QSIMP (Table 5.3), we see that there may be a correlation between solver usage and decreased performance (seen in Figure 5.2). It is the most distinct in the base64 case, where the solver is used roughly 26 times more, as can be seen by comparing the solver times between QSYM and QSIMP in Table 5.2. Both the increase in solving time and the performance impact are expected results, as the essence of our implementation involves utilizing the solver to further analyse user input. In Table 5.2 we see that both base64 and uniq are emulating well under the 5-hour limit. This is because the tool is finished exploring the binary, and therefore AFL is no longer feeding QSYM or QSIMP with more paths to explore. For who 5.3. Analysis 31 the time is 200-400 seconds under the 5-hour limit, this is only an effect of minor downtimes because of communication between the different components. For md5sum however, this is not the case as there was a memory error in QSYM leading us to only compare the two for the first 1.5h of the test. Another interesting statistic can be seen in Table 5.2. Looking at uniq, we see that the increased solver time does not always add to the total emulation time in equal proportion (e.g. an increase of 300 seconds in solving time does not necessarily imply an increase of 300 seconds in total emulation time). In the uniq case, QSIMP has a roughly 3 times higher solving time, but a lower total emulation time. This could mean that the more verbose expressions generated by QSIMP can sometimes assist the symbolic executor in gaining code coverage.

5.3.3 Combined Result Analysis Original QSYM is slightly faster at exploring a binary, as shown by our scalability experiment results shown in Figure 5.2. This is expected as our implementation involved adding some new steps of instrumentation to force bugs and has most likely a direct effect of the symbolic load. However, the results in Table 5.1 give an idea of how much time could potentially be saved when utilizing an SMT solver to not only increase code coverage but to also look for context-sensitive bugs. The coverage- guided nature of QSYM can easily have this type of bug to go undetected for a long time, as e.g. simply fuzzing an integer can mean testing 232 possible values. Apart from triggering certain bug scenarios faster, QSIMP shows time consistency in these cases. We argue that these results show how more bugs may be found when the solver is used for more than coverage, without significantly throttling the performance of the original implementation in most cases. The exception being base64 and who where we can see a distinct difference in performance between QSYM and QSIMP. The performance in exploration in these two binaries is impaired, but not excessively. In the base64 case, there seems to be a correlation between the increased solver time and the number of bugs found per second. We suspect the significant increase in solving may be due to a high amount of user input-based table lookups (due to the nature of decoding base64).

Chapter 6 Discussion

In this chapter we summarize what has been done throughout the project and discuss the results from the experiments, first separately and then combined. Finally, we discuss the limitations of the experiments.

6.1 Conducted Research

The purpose of this project was to fill a research gap that we saw in the current state-of-the-art hybrid fuzzers. We saw that hybrid fuzzers today utilize constraint solving only for gaining coverage in the binary. Although this is a big part of finding many bugs, this leaves much of the job to the fuzzer which is in nature somewhat random. This means leaving a lot of bug finding to coincidence. Our thesis states that by using knowledge about how certain bugs are triggered (DF vulnerabilities in our case) and with more fine-grained instrumentation it is possible to force-trigger a bug by dynamically calculating a bug-inducing value during execution. During the project, we further developed the hybrid fuzzer system QSYM to assess whether it is possible to utilize constraint solving to trigger context-sensitive bugs and to do so in an efficient and scalable manner, as was stated in the research questions. Doing so required slightly extending the binary instrumentation to handle symbolic pointers such as symbolic array indices.

6.1.1 RQ1: DF Triggering Using SMT Our results showed that by utilizing an SMT solver, DF bugs that are not affected by QSYM’s optimizations (e.g. pointer concretizations during a symbolic store) can be forced with great speed compared to finding them with fuzzing. In the context- sensitive binaries experiment, we can see a significant improvement both in speed and accuracy of bug detection. It is important to note that this experiment was designed to test how well QSIMP is able to handle context-sensitive binaries which is something that QSYM is not designed to handle by means other than fuzzing. Looking at Table 5.1 we can see that by utilizing constraint-solving to trigger bugs with known behaviour we can sometimes trigger them over 100 times faster. It can be argued that this type of comparison cannot be made because the fuzzing involves a lot of randomness, but what sticks out here is the low variability, which becomes apparent in Figure 5.1. We believe that the DF detection algorithm described under Section 4.2.1, and the aforementioned results answer the first research question.

33 34 Chapter 6. Discussion

The symbolic load functionality may have performance-decreasing effects that ripple throughout the execution, meaning that instructions following a symbolic load also has to handle expressions resulting from previous load instructions. The part that tries forcing DF vulnerabilities, however, is not very invasive as it does not have these ripple effects on successive program instructions. DF forcing is an event isolated only to the instructions malloc and free in the analysis. Looking at the last column in Table 5.1 we see that to offload the fuzzer we invoke the constraint solver. By comparison, in QSYM the solver is never invoked more than 1% of the total emulation time. So instead of fuzzing for hours QSIMP instead tries to force the bug using the smart symbolic engine. This does look very good but this functionality is useful only if the solution scales well which we go into further in the next section.

6.1.2 RQ2: Scalability Testing the scalability of our extension to QSYM, we see that a hybrid fuzzer that is able to force certain bugs could be used practically and pragmatically, as can be seen from the results in Chapter 5. In the experiments we ran, our implementation is only slightly impaired compared to QSYM, with the added detection of DF bugs. The results fit well in with what we were expecting. Increasing the soundness of the symbolic execution by adding symbolic loading has a significant effect on the efficiency of the emulation. Angr, for example, handles this by making the expression concrete when they reach a certain limit of symbolic addresses1. On the other hand, Coppa et al. [6] show that by having a more sound memory model it is possible to find paths otherwise unreachable with more restricted memory modeling. This means that by adding symbolic loading we could in certain cases help further the analysis rather than impede it, although we do not have data backing up this statement. An interesting thing we noted during a further analysis of the results is that, when comparing the number of paths explored instead of errors found, in base64 QSIMP is able to find a similar amount of unique paths to QSYM, while in the who binary QSYM gains a further lead. This could indicate that in base64 we gain coverage thanks to the symbolic loads while in who we are only slowed down by the increased solving time. The graphs can be seen in the appendix A.

6.2 General Discussion Taking into account the results we see in the two experiments, we show that while adding functionality to force DF bugs we still maintain a performance that is close to or on par with QSYM. This is done while increasing the soundness of the symbolic execution by adding symbolic load functionality. The symbolic load has performance- impairing effects that ripple throughout the binary, which is likely what is causing the slowdown. The magnitude of the slowdown on how much the user input interacts with parts that involve symbolic memory and symbolic memory addressing. While the slowdown is noticeable it does not interfere to the extent we were expecting. This could possibly be explained by the more correct memory model used that helps

1In Angr’s default configuration they make the expressions concrete when reaching the range limit 1024 6.2. General Discussion 35 further the analysis. This fits in well with what Cha et al. [5] state in their research. In their experiments they show that 40% of exploits rely on being able to reason about symbolic memory – more exploitability likely means more coverage. It also fits well in with what Coppa et al. [6] show in their experiments, showing that in some binaries they can gain further coverage with a more sound model. A counter-argument to what is said in the previous paragraph could be that when Yun et al. [36] does their experiment they show that QSYM outperforms Driller. Driller does give a much more sound model of the memory, in the same way that Angr does. This comparison is not fair though, as there are too many outside variables to make conclusions about the effects of different memory models. For example, Driller uses Angr that is built for pure symbolic execution, including functions like snapshots and veritesting that provide little or no gain when used in a hybrid fuzzer. Another factor is that Angr uses VEX-IR making the analysis able to handle multiple architectures which makes it inefficient compared to QSYM when working on x86_64. Because of this, it is reasonable to assume that the improved soundness, as an effect of the symbolic loading, helps to somewhat mitigate the increased overhead caused by it. The symbolic loading leads to ripple effects throughout the binary; however, the DF-triggering part is an isolated event that should slow down the symbolic execu- tion linearly with how many invocations of free and malloc operations are present during execution. The specific bug-forcing part of the analysis (the free function instrumentation), therefore, could be seen as relatively cheap.

6.2.1 Further Applicability An exploitable DF bug could potentially lead to arbitrary code execution within a process [9]. This could, in turn, lead to privilege escalation and sometimes a full system takeover. Moreover, one could also choose to further instrument some ma- chine instructions that involve moving memory around to investigate the possibilities of performing an out of bounds read or write. While concretization heuristics can heavily limit the possibilities of a symbolic expression, the use of more pragmatic approaches, such as optimistic solving, can be used to find previously undiscovered holes in an application, as has been shown in this research and by Yun et al. [36]. We believe that being able to forcefully trigger some DF bugs can serve as a proof of concept for utilizing limited symbolic memory in a hybrid fuzzer to find context- sensitive bugs. An analyst could, for example, specify memory regions of interest in a process and once a potential vulnerability is found, generate possible addresses in memory that are susceptible to a read or write to further analyse exploitation capa- bilities. An out-of-bounds write could overwrite the content of other data structures which could lead to new, overlooked program states containing more potential bugs. In the worst case, a malicious user could overwrite a function pointer to achieve code execution. Invoking the solver on any read or write could, however, slow down performance significantly, which is why it is also interesting to discuss when to check for arbitrary reads or writes. Eckert et al. look for arbitrary write primitives during malloc and free operations as their research involves specifically testing the heap algorithm [10]. The functionality described above is mainly speculation and would, of course, 36 Chapter 6. Discussion have to be implemented and tested for scalability as well. What we have presented in this research is the only evidence for being able to properly extend a hybrid fuzzer as described in earlier parts of the report.

6.2.2 Implementation Weaknesses In Example 9 shown in Table 5.1 we see one weakness with our implementation. We do not have any limits on how many addresses could be added to an expression when doing symbolic loads. For example, the default range limit on symbolic loads imple- mented in Angr is 1024 [30]. In Example 9 we see it by the high solving time compared to total emulation time. Generally, there are constraints that put this within reason (e.g. the analysed program makes sure that the user input specifying an array index is limited to the bounds of the array) but if the constraints are very relaxed we risk building very big expressions. Multiple dereferences risk making the memory explode similarly to what is described by Stephens et al. [32] when encountering interleaved loads and stores. By adding optimizations such as knowledge about which addresses are allocated by mmap2 and/or implementing similar range limitations like in Angr could further improve performance while maintaining the soundness or lowering it only a little. This unlimited dereferencing is likely what causes the solver time to explode in the two binaries base64 and who from the LAVA-M dataset as well. The “explosion” in solve time can be seen in Table 5.2 and 5.3. As our implementation only serves as a proof of concept we have left this to future works.

6.3 Limitations

In this section, we discuss both physical and implementational limitations of our experiment. We also mention some delimitations that were made regarding both datasets and the use of libdislocator.

6.3.1 Equipment around QSYM The equipment around QSYM such as the fuzzer and the operating system that we run has an impact on the results of the analysis, which we explain below. When constructing our experiments we have made a few choices and had to make some limitations related to equipment and software around the symbolic engine. Although there are multiple fuzzers that are built after AFL (e.g. VUzzer and AFLFast), most of them are built upon AFL. Because we are comparing to QSYM and AFL is used as their fuzzer, we have chosen to use the same. This limitation is reasonable as AFL is still a good enough fuzzer. We are using AFL for all experiments and the main focus is how well the symbolic engine works not how well the entire stack works. The computer we use is also limited in the amount of volatile memory. We did however only hit this limitation in one of the binaries in the scalability experiment (md5sum). In this experiment, we have clearly stated that this is the case and we take this into consideration when we analyse the data from it.

2This is only applicable for when we are working with heap-related vulnerabilities 6.3. Limitations 37

6.3.2 Symbolic Memory Because the authors of QSYM implement their dereferencing instrumentation with a strict concretization strategy we had to do some quite invasive changes in the instru- mentation of load operations. We are essentially adding symbolic load to QSYM. Because we have implemented a simplified version of the symbolic load employed in Angr it is likely not as efficient. But since our goal is to trigger DF vulnerabilities we did not have a choice but to implement this. Coppa et al. [6] experiment by comparing their implementation, MemSight, with memory models in Angr of dif- ferent soundness. In their experiments, they see a significant improvement in path exploration for some of the binaries. They only compare symbolic engines but their results show similar effects to what we have obtained. The way we implement our load operation is a mix of the default behaviour of angr and angr with fully symbolic memory. Implementing a fully symbolic memory model using indexed memory in QSYM would not make sense as it would be very slow, and the system builds upon coopera- tion between the symbolic executor and the fuzzer. Hence, one component should not bottleneck the other too significantly. As Shoshitaishvili et al. [30] explain, full sym- bolic load and store would lead to a quadratic increase in constraints with repeated stores and loads.

6.3.3 Datasets We were unable to find a dataset containing known context-sensitive bugs while simultaneously being big enough to draw conclusions about the scalability of the solution. Because of this we have instead chosen to split the experiment into two separate experiments. The LAVA-M dataset was chosen because of its relatively simple setup and the reasonable dataset size (four common Linux binaries). This dataset was used by many similar pieces of research. While we are aware that this dataset does not contain very complex binaries, we argue that with many hard-to-reach bugs planted evenly over the binaries it makes a good enough set to validate how well it scales compared to QSYM. When choosing a dataset we touched on multiple other datasets. In QSYM, the authors use libpng to measure the code coverage effectiveness of their implementation. We reached out to the authors as they do not specify what binary they are executing this through in their paper but we did not get any answer. Another idea we had was to run the binaries used in Darpa Cyber Grand Challenge that has been widely used by many other publications. This dataset contains a grand total of 131 binaries containing a range of program from very simple to complex [36]. We were not able to find any good sources on how they were supposed to be compiled and run on our system. Because of this and time limitations, a decision was made to not use this dataset. The DF dataset for testing context-sensitive bugs is built by the authors and it is focused on DF bugs. The binaries in the dataset are small. This is because we are only trying to test whether we are able to trigger these specific bugs. Because of the binaries being so small there is not really anything going on inbetween the operation allowing the bug to happen and the actual bug. This is something that could cause 38 Chapter 6. Discussion the concretization to behave differently but because of time limitations we chose not to create more complex examples.

6.3.4 Libdislocator As QSYM and QSIMP use optimistic solving to alleviate its strict concretization strategy we rely on AFL to verify that the results from the symbolic engine are true errors. Because libc does not trigger an error on all bugs related to DF vulner- abilities we have used the built-in library libdislocator in AFL. The purpose of libdislocator is to be much more sensitive to errors than the more speed-focused libc. During the experimentation, we noticed that libdislocator sometimes makes the programs too fragile. Some of the binaries from the LAVA-M dataset crashed from almost any input for no known reason. Because of this we instead decided to run these binaries with the standard libc library. We argue that this has little to no effect on the correctness of the results, but it gives us more data to look at. If we were to use libdislocator for the LAVA-M experiment, we would not be able to use the who binary. As the who binary contains the most bugs in the dataset we did not want to exclude it, which is why we did not use libdislocator for the LAVA-M experiment.

6.4 Validity Threats Because the experiment evaluating QSIMP’s ability to solve context-sensitive DF bugs is built by the authors themselves, there is a risk that our results from this experiment are biased. We may be missing bugs of certain types that have not been thought of during construction. Due to time limitations, we were not able to find more fitting, real-world binaries containing DF bugs to evaluate our implementation. The fact still stands, however, that QSIMP is able to consistently find some bugs that QSYM has trouble detecting, without losing excessive amounts of performance. The risk of having false positives or negatives in the experiment concerning the DF dataset that was constructed by the authors is largely mitigated by the fact that we introduced the bugs ourselves. Hence, every bug triggered could be easily validated by us. The LAVA-M dataset binaries are, as mentioned before, not the biggest and most complex binaries. We are aware that this might not give the full picture regarding scalability, but because of time limitations we have not been able to do further testing using more complex binaries. Chapter 7 Conclusions and Future Work

7.1 Conclusion

In this project, we researched the possibilities of extending the use of the SMT solver in a hybrid fuzzer. Instead of using it for only increasing code coverage, the solver is also invoked whenever the analysed program utilizes dynamic memory through the malloc API. To enable the solver’s ability to force a DF bug, symbolic dereferencing of memory also had to be implemented. The latter slows down the original performance of QSYM, but not to an inoperable extent according to the experiments conducted in this research. The goal of being able to force context- sensitive DF bugs is partly successful, as the tool’s outcome is dependent on which pointers have been concretized (see Section 5.3.1). In binaries with context-sensitive bugs, QSIMP is able to consistently find some bugs that the symbolic execution engine of the original QSYM is unable to detect. QSIMP is less efficient in exploring binaries compared to QSYM, according to our result there seems to be a correlation between how much of emulation time is used for constraint solving and the amount of coverage. The correlation is not linear however indicating that by having a more sound solver engine the tool could make up for some of the time lost by the more utilized solver (see Section 5.3.2). While DF is just one out of many bugs in the vast and complex wilderness of software bugs, we aim to show the strength in feeding the model of a potentially vulnerable scenario to an SMT solver. Further discussion on the applicability of this is found under Sections 6.2.1 and 7.2. Finally, according to the results, QSIMP can find more bugs than QSYM, at a slightly higher cost in terms of time and memory.

7.2 Future Work

During the implementation and experimentation of QSIMP, we encountered some in- teresting observations and had some ideas of possible optimizations. Here we present what we would like to investigate in the future.

7.2.1 Further optimizations During discussions about possible solutions for big memory indexes, we played with the idea of instrumenting the function mmap. By recording which virtual address

39 40 Chapter 7. Conclusions and Future Work spaces are allocated to the heap, it may be possible to narrow down possible values for symbolic addresses. The way that Algorithm 1 is implemented could possibly be optimized by utilizing Z3’s “Or”-expressions, to avoid iteratively calling the solver. This is something we realized after experimentation and therefore, because of time limitations, this has been put as future work.

7.2.2 Other interesting ideas The algorithm and memory model (MemSight) that is explained and tested by Coppa et al. in their publication [6] have some interesting optimizations that compared to Angr shows promising results in both effectiveness and path exploration. Because QSYM utilizes the real memory of the process being analysed, which is an ingenious move, implementing MemSight into their symbolic engine becomes a hefty task. This is because MemSight relies on the indexed memory model being the lowermost modeling which is not the case for QSYM. We also implement the functionality for forcing DF bugs where it is possible. This functionality would be interesting to implement for other similar bugs such as trying to overwrite memory in data structures, or worse, function table pointers. Since the Z3 constraint solver is relatively fast in our experience it should be possible to implement multiple checks at certain points in the code to try and force bugs such as the aforementioned. Bibliography

[1] T. Avgerinos, A. Rebert, S. K. Cha, and D. Brumley. “Enhancing symbolic execution with veritesting”. In: Proceedings of the 36th International Confer- ence on Software Engineering. ICSE 2014. Hyderabad, India: Association for Computing Machinery, May 31, 2014, pp. 1083–1094. isbn: 978-1-4503-2756-5. doi: 10.1145/2568225.2568293. url: http://doi.org/10.1145/2568225. 2568293 (visited on 04/09/2020). [2] R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi. “A Survey of Symbolic Execution Techniques”. In: ACM Comput. Surv. 51.3 (May 2018). Place: New York, NY, USA Publisher: Association for Computing Machinery. issn: 0360-0300. doi: 10.1145/3182657. url: https://doi-org.miman.bib. bth.se/10.1145/3182657. [3] N. Bjørner and L. de Moura. “Z310: Applications, Enablers, Challenges and Directions”. In: CFV 2009 (). [4] P. Boonstoppel, C. Cadar, and D. Engler. “RWset: Attacking Path Explosion in Constraint-Based Test Generation”. In: Tools and Algorithms for the Con- struction and Analysis of Systems. Ed. by C. R. Ramakrishnan and J. Rehof. Vol. 4963. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 351–366. isbn: 978-3-540-78799-0 978-3-540-78800-3. doi: 10.1007/978-3-540-78800- 3_27. url: http://link.springer.com/10.1007/978-3-540-78800-3_27 (visited on 01/22/2020). [5] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley. “Unleashing Mayhem on Binary Code”. In: 2012 IEEE Symposium on Security and Privacy. 2012 IEEE Symposium on Security and Privacy (SP) Conference dates subject to change. San Francisco, CA, USA: IEEE, May 2012, pp. 380–394. isbn: 978- 1-4673-1244-8 978-0-7695-4681-0. doi: 10 . 1109 / SP . 2012 . 31. url: http : //ieeexplore.ieee.org/document/6234425/ (visited on 01/22/2020). [6] E. Coppa, D. C. D’Elia, and C. Demetrescu. “Rethinking pointer reasoning in symbolic execution”. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). Urbana, IL: IEEE, Oct. 2017, pp. 613–618. isbn: 978-1-5386-2684-9. doi: 10.1109/ASE.2017. 8115671. url: http://ieeexplore.ieee.org/document/8115671/ (visited on 04/01/2020). [7] S. D. S. Dinesh, N. Burow, D. Xu, and M. Payer. “RetroWrite: Statically In- strumenting COTS Binaries for Fuzzing and Sanitization”. In: IEEE S&P 2020. 2020.

41 42 BIBLIOGRAPHY

[8] B. Dolan-Gavitt, P. Hulin, E. Kirda, T. Leek, A. Mambretti, W. Robertson, F. Ulrich, and R. Whelan. “LAVA: Large-Scale Automated Vulnerability Ad- dition”. In: 2016 IEEE Symposium on Security and Privacy (SP). 2016 IEEE Symposium on Security and Privacy (SP). ISSN: 2375-1207. May 2016, pp. 110– 121. doi: 10.1109/SP.2016.15. [9] Doubly freeing memory | OWASP. Library Catalog: owasp.org. url: https: //owasp.org/www-community/vulnerabilities/Doubly_freeing_memory (visited on 05/04/2020). [10] M. Eckert, A. Bianchi, R. Wang, Y. Shoshitaishvili, C. Kruegel, and G. Vigna. “HeapHopper: Bringing Bounded Model Checking to Heap Implementation Se- curity”. In: 27th USENIX Security Symposium (USENIX Security 18). Balti- more, MD: USENIX Association, Aug. 2018, pp. 99–116. isbn: 978-1-931971- 46-1. url: https : / / www . usenix . org / conference / usenixsecurity18 / presentation/eckert. [11] D. Fraze. Cyber Grand Challenge (CGC). Cyber Grand Challenge. url: ht tps : / / www . darpa . mil / program / cyber - grand - challenge (visited on 01/22/2020). [12] V. Ganesh and D. Dill. “A Decision Procedure for Bit-Vectors and Arrays”. In: Lecture Notes in Computer Science. Vol. 4590. Jan. 1, 2007, pp. 519–531. doi: 10.1007/978-3-540-73368-3_52. [13] How a double-free bug in WhatsApp turns to RCE. Awakened security blog. Library Catalog: awakened1712.github.io. Oct. 2, 2019. url: https://awak ened1712.github.io/hacking/hacking- whatsapp- gif- rce/ (visited on 06/10/2020). [14] jemalloc. jemalloc - memory allocator. url: http://jemalloc.net/ (visited on 01/22/2020). [15] V. P. Kemerlis, G. Portokalidis, K. Jee, and A. D. Keromytis. “libdft: practical dynamic data flow tracking for commodity systems”. In: Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments. VEE ’12. London, England, UK: Association for Computing Machinery, Mar. 3, 2012, pp. 121–132. isbn: 978-1-4503-1176-2. doi: 10.1145/2151024.2151042. url: https://doi.org/10.1145/2151024.2151042 (visited on 05/13/2020). [16] O. Levi. Pin - A Dynamic Binary Instrumentation Tool. Intel Software - De- veloper Zone. url: https://software.intel.com/en-us/articles/pin-a- dynamic-binary-instrumentation-tool (visited on 01/23/2020). [17] K. Li. “AFL’s Blindspot and How to Resist AFL Fuzzing for Arbitrary ELF Binaries”. Black Hat 2018 USA. Sept. 8, 2018. url: https://www.blackhat. com/us-18/briefings/schedule/#afls-blindspot-and-how-to-resist- afl-fuzzing-for-arbitrary-elf-binaries-11048 (visited on 05/10/2020). [18] Y. Li, S. Ji, C. Lv, Y. Chen, J. Chen, Q. Gu, and C. Wu. V-Fuzz: Vulnerability- Oriented Evolutionary Fuzzing. _eprint: 1901.01142. 2019. BIBLIOGRAPHY 43

[19] D. Liu, J. Wang, Z. Rong, X. Mi, F. Gai, Y. Tang, and B. Wang. “Pangr: A Behavior-Based Automatic Vulnerability Detection and Exploitation Frame- work”. In: 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Confer- ence On Big Data Science And Engineering (TrustCom/BigDataSE). 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Sci- ence And Engineering (TrustCom/BigDataSE). ISSN: 2324-9013. Aug. 2018, pp. 705–712. doi: 10.1109/TrustCom/BigDataSE.2018.00103. [20] T. Liu, C. Curtsinger, and E. D. Berger. “DoubleTake: fast and precise er- ror detection via evidence-based dynamic analysis”. In: Proceedings of the 38th International Conference on Software Engineering - ICSE ’16. the 38th In- ternational Conference. Austin, Texas: ACM Press, 2016, pp. 911–922. isbn: 978-1-4503-3900-1. doi: 10.1145/2884781.2884784. url: http://dl.acm. org/citation.cfm?doid=2884781.2884784 (visited on 01/22/2020). [21] V. J. M. Manès, H. Han, C. Han, S. K. Cha, M. Egele, E. J. Schwartz, and M. Woo. “The Art, Science, and Engineering of Fuzzing: A Survey”. In: IEEE Transactions on Software Engineering (2019). Conference Name: IEEE Trans- actions on Software Engineering, pp. 1–1. issn: 1939-3520. doi: 10.1109/TSE. 2019.2946563. [22] M. Miller. microsoft/MSRC-Security-Research. GitHub. url: https://githu b.com/microsoft/MSRC-Security-Research (visited on 01/22/2020). [23] L. de Moura and N. Bjørner. “Z3: An Efficient SMT Solver”. In: Tools and Al- gorithms for the Construction and Analysis of Systems. Ed. by C. R. Ramakr- ishnan and J. Rehof. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2008, pp. 337–340. isbn: 978-3-540-78800-3. doi: 10.1007/978-3- 540-78800-3_24. [24] H. Nero. henriknero/double-free-dataset. GitHub. Library Catalog: github.com. url: https://github.com/henriknero/double-free-dataset (visited on 06/10/2020). [25] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos. “VUzzer: Application-aware Evolutionary Fuzzing”. In: Feb. 26, 2017. doi: 10.14722/ ndss.2017.23404. [26] M. Rönn. Symvex : A Symbolic Execution System for Machine Code. 2016. url: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva- 124181 (visited on 04/28/2020). [27] Satisfiability modulo theories. In: Wikipedia. Page Version ID: 937374865. Jan. 24, 2020. url: https://en.wikipedia.org/w/index.php?title=Satisfiabili ty_modulo_theories&oldid=937374865 (visited on 01/31/2020). [28] K. Sen, D. Marinov, and G. Agha. “CUTE: a concolic unit testing engine for C”. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering. ESEC/FSE-13. Lisbon, Portugal: Association for Com- puting Machinery, Sept. 1, 2005, pp. 263–272. isbn: 978-1-59593-014-9. doi: 44 BIBLIOGRAPHY

10.1145/1081706.1081750. url: http://doi.org/10.1145/1081706. 1081750 (visited on 04/09/2020). [29] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov. “AddressSanitizer: A Fast Address Sanity Checker”. In: USENIX ATC 2012. 2012. url: https: //www.usenix.org/conference/usenixfederatedconferencesweek/addre sssanitizer-fast-address-sanity-checker (visited on 01/31/2020). [30] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna. “SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis”. In: 2016 IEEE Symposium on Security and Privacy (SP). 2016 IEEE Symposium on Security and Privacy (SP). ISSN: 2375-1207. May 2016, pp. 138–157. doi: 10.1109/ SP.2016.17. [31] smparkes/valgrind-vex. GitHub. Library Catalog: github.com. url: https:// github.com/smparkes/valgrind-vex (visited on 05/13/2020). [32] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshi- taishvili, C. Kruegel, and G. Vigna. “Driller: Augmenting Fuzzing Through Selective Symbolic Execution”. In: Proceedings 2016 Network and Distributed System Security Symposium. Network and Distributed System Security Sym- posium. San Diego, CA: Internet Society, 2016. isbn: 978-1-891562-41-9. doi: 10.14722/ndss.2016.23368. url: https://www.ndss- symposium.org/ wp-content/uploads/2017/09/driller-augmenting-fuzzing-through- selective-symbolic-execution.pdf (visited on 01/22/2020). [33] The GNU Allocator (The GNU C Library). GNU Operating System. url: h ttps : / / www . gnu . org / software / libc / manual / html _ node / The - GNU - Allocator.html (visited on 05/06/2020). [34] TRUESEC: Experts in Cybersecurity, Secure Infrastructure & Development. TRUESEC. Library Catalog: www.truesec.com. url: https://www.truesec. com/ (visited on 05/14/2020). [35] Valgrind Home. url: https://valgrind.org/ (visited on 05/13/2020). [36] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim. “QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing”. In: 27th USENIX Security Symposium (USENIX Security 18). Baltimore, MD: USENIX Association, Aug. 2018, pp. 745–761. isbn: 978-1-939133-04-5. url: https://www.usenix.org/ conference/usenixsecurity18/presentation/yun. [37] Z3Prover/z3. original-date: 2015-03-26T18:16:07Z. Jan. 21, 2020. url: https: //github.com/Z3Prover/z3 (visited on 01/22/2020). [38] M. Zalewski. american fuzzy lop (2.52b). american fuzzy lop (2.52b). Jan. 22, 2020. url: http://lcamtuf.coredump.cx/afl/ (visited on 01/22/2020). BIBLIOGRAPHY 45

[39] L. Zhao, Y. Duan, H. Yin, and J. Xuan. “Send Hardest Problems My Way: Probabilistic Path Prioritization for Hybrid Fuzzing”. In: Proceedings 2019 Net- work and Distributed System Security Symposium. Network and Distributed System Security Symposium. San Diego, CA: Internet Society, 2019. isbn: 978- 1-891562-55-6. doi: 10.14722/ndss.2019.23504. url: https://www.ndss- symposium.org/wp- content/uploads/2019/02/ndss2019_04A- 5_Zhao_ paper.pdf (visited on 01/22/2020).

Appendix A Graphs of path coverage

47 48 Appendix A. Graphs of path coverage

(a) base64 (b) md5sum

(c) uniq (d) who Figure A.1: Paths found in QSYM versus QSIMP.

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden