Università degli Studi di Padova

Dipartimento di Matematica “Tullio Levi-Civita”

Corso di Laurea in Informatica

Coverage-guided fuzzing of embedded firmware with avatar2

Supervisor Candidate Prof. Mauro Conti Andrea Biondo Università di Padova Co-supervisor Marius Muench EURECOM Tutor Prof. Claudio Enrico Palazzi Università di Padova ii To my family, who always supports me (and bears with the 3am keyboard noise) iv Abstract

Since the eighties, fuzz testing has been used to stress applications and find problems in them. The basic idea is to feed malformed inputs to the program, with the goal of stimulating buggy code paths that produce incorrect behavior. Early fuzzers used primitive methods to gener- ate such inputs, mostly based on random generation and mutation. Modern fuzzers have reached high levels of efficiency, often by leveraging feedback information about how apro- gram’s control flow is influenced by the inputs, and uncovered a large amount ofbugs,many with security implications. However, those fuzzers are targeted towards normal, general- purpose systems. The state of fuzzing on embedded devices is not so developed. Due to distinctive traits of this kind of device, such as atypical operating systems, low-level periph- eral access and limited memory protection, fuzzing tools do not have the same efficiency, and fuzzing jobs are more expensive to set up. In this work, we present AFLtar, a coverage-guided fuzzer for embedded firmware. AFLtar leverages avatar2, an orchestration framework for dynamic analysis, along with the Amer- ican Fuzzy Lop coverage-guided fuzzer and the AFL-Unicorn CPU emulator. The goal of AFLtar is to reduce the cost of embedded fuzzing by providing a platform that can be used to quickly setup a firmware fuzzing job, while reaping the benefits of modern, feedback-driven fuzzing strategies.

v vi Sommario

Fin dagli anni ottanta, il fuzzing è stato usato per stressare applicazioni e trovare problemi all’interno di esse. L’idea di base consiste nel fornire input malformati al programma, con l’obiettivo di stimolare percorsi di codice problematici che producono comportamenti non corretti. I primi fuzzer utilizzavano metodi rudimentali per generare questi input, perlopiù basati su generazioni e mutazioni casuali. I fuzzer moderni hanno raggiunto alti livelli di efficienza, spesso sfruttando informazioni di feedback sul modo in cui l’input influenza il flusso di controllo del programma, e hanno scoperto grandi quantità di bug, buona parte dei quali con implicazioni di sicurezza. Questi fuzzer, tuttavia, sono costruiti per normali sistemi general-purpose. Nell’ambito embedded, lo stato del fuzzing non è così sviluppato. A causa di alcuni tratti distintivi di questi dispositivi, come sistemi operativi atipici, accesso a basso livello alle periferiche, e limitata protezione della memoria, gli strumenti di fuzzing non raggiungono la stessa efficienza, e rendono il processo più costoso. In questo lavoro, presentiamo AFLtar, un fuzzer coverage-guided per firmware embedded. AFLtar sfrutta avatar2, un framework di orchestrazione per l’analisi dinamica, assieme al fuzzer coverage-guided American Fuzzy Lop e all’emulatoredi CPU AFL-Unicorn. L’obiettivo di AFLtar è ridurre il costo del fuzzing su firmware embedded fornendo una piattaforma us- abile per avviare velocemente processi di fuzzing, cogliendo allo stesso tempo i benefici delle moderne strategie di fuzzing guidate da feedback.

vii viii Contents

Abstract v

List of figures xi

List of tables xiii

1 Introduction 1 1.1 Contribution ...... 2 1.2 Organization ...... 2

2 Background 3 2.1 Program analysis ...... 3 2.1.1 Control flow analysis ...... 3 2.1.2 Coverage ...... 6 2.1.3 Static analysis ...... 6 2.1.4 Dynamic analysis ...... 8 2.2 Fuzzing ...... 8 2.3 Embedded devices ...... 10

3 Technologies 13 3.1 American Fuzzy Lop ...... 13 3.1.1 Coverage measurement ...... 14 3.1.2 Test case evolution ...... 15 3.1.3 Culling and trimming ...... 17 3.1.4 Mutation strategies ...... 18 3.1.5 Crash reporting ...... 18 3.1.6 The forkserver ...... 19 3.1.7 QEMU mode ...... 20 3.2 Unicorn ...... 21 3.2.1 Overview ...... 21 3.2.2 Instrumentation ...... 22 3.3 AFL-Unicorn ...... 22 3.3.1 Overview ...... 22 3.3.2 Driver Workflow ...... 23 3.4 Avatar2 ...... 24

ix 3.4.1 Architecture ...... 24

4 AFLtar 27 4.1 avatar2 API ...... 28 4.1.1 Top-level API ...... 28 4.1.2 Execution protocol ...... 30 4.1.3 Memory protocol ...... 32 4.1.4 Register protocol ...... 33 4.1.5 Remote memory protocol ...... 33 4.1.6 Target ...... 34 4.2 Unicorn API ...... 34 4.2.1 Execution, registers and memory ...... 35 4.2.2 Hooks ...... 36 4.3 Unicorn bugs ...... 38 4.3.1 Issue A: wrong PC after stopping from hook ...... 39 4.3.2 Issue B: cannot stop from different thread while in hook ...... 39 4.3.3 Issue : crash when stopping from different thread while in hook . 40 4.4 Design ...... 40 4.4.1 Message passing ...... 42 4.4.2 Hooks ...... 45 4.4.3 Breakpoint and watchpoint handling ...... 46 4.4.4 Emulation ...... 47 4.4.5 Memory forwarding ...... 49 4.4.6 Additions to the standard API ...... 50 4.4.7 Fuzzing driver ...... 50

5 Evaluation 53 5.1 Experiment design ...... 53 5.2 Results ...... 55 5.3 Discussion ...... 56 5.4 Future work ...... 57

6 Conclusion 59

References 60

x Listing of figures

2.1 The line_length function’s control flow graph (compiled for x86_64). . 5 2.2 The line_length control flow graph without instructions...... 5 2.3 Verified property sets in sound and complete static analysis, compared to real program behavior...... 6

3.1 AFL architecture...... 14 3.2 Example CFG fragment...... 15 3.3 Example traces used in the text...... 16 3.4 Workflow of AFL’s QEMU mode...... 20 3.5 Workflow of AFL-Unicorn’s Unicorn mode...... 22 3.6 Workflow of a typical AFL-Unicorn driver...... 23 3.7 avatar2 architecture...... 25

4.1 General AFLtar architecture...... 41 4.2 Sequence diagram for Unicorn protocol message passing...... 43 4.3 Flow chart for Unicorn hook handling...... 45 4.4 Flow chart for the breakpoint hook...... 46 4.5 Flow chart for emulation start (protocol side)...... 47 4.6 Flow chart for emulation start (endpoint side)...... 48

5.1 Experimental hardware. On the right, the STMicroelectronics NUCLEO- L152RE board, which integrates an STM32L152RE microcontroller (bot- tom) and an ST-LINK/V2-1 programming and debugging interface (top). On the left, an RS232-USB converter based on the FTDI FT232RL chip, connected to the microcontroller’s UART interface...... 54 5.2 Total executions vs total paths in our experiments...... 57

xi xii Listing of tables

5.1 Experimental results...... 56

xiii xiv There are three principal means of acquiring knowledge available to us: observation of nature, reflection, and experimentation. Observation col- lects facts; reflection combines them; experimenta- tion verifies the result of that combination. Denis Diderot 1 Introduction

Fall of 1988. It was a dark and stormy night over the city of Madison, Wisconsin. The weather was particularly bothersome for Prof. Barton Miller, who was trying to connect to his office’s Unix system from his apartment. The rain was causing constant noise onhis 1200-baud line, which made it hard to type commands into the shell. He was not surprised by the noise itself, but by how the corrupted input made common Unix utilities crash. Could noisy, garbled or random input be used as a testing tool to find bugs in software? Miller de- cided to study this phenomenon, and assigned it as a project for students in his Advanced class: he dubbed it The Fuzz Generator. One group wrote a fuzzer that crashed a quarter to a third of Unix utilities over seven different Unix variants [1]. Later, Miller discovered that this idea was not new: in 1983, Steve Capps at Apple had written The Monkey, a testing tool that generated random GUI events to test Macintosh applications [2]. Nowadays, fuzzing is a commonly employed testing technique that has seen extensive improvement over the years. In particular, fuzzing has significant impact on security test- ing. Smart fuzzers that build feedback by observing the program’s internal behavior in re- sponse to inputs have identified a large amount of vulnerabilities in complex, high-value software [3, 4]. However, the situation is not so encouraging in the embedded world, whose security is becoming increasingly important with the rise of the Internet of Things. While there are fuzzing tools available for embedded software, they are not as sophisticated as state- of-the-art fuzzers, and adopt black-box approaches instead of feedback-driven ones. These devices also pose technical difficulties in developing fuzzers, due to their frequent lackof

1 a fully-fledged OS, the absence of sophisticated memory protection mechanisms, andthe limited I/O. In this thesis, I report the work done during my intership at the Department of Mathemat- ics, University of Padova, in the SPRITZ Security and Privacy Research Group, under the supervision of Prof. Mauro Conti. During this project, we also strongly interacted with Mar- ius Muench, the main developer of avatar2 at EURECOM. This internship, which took up 320 hours, is part of the requirements for obtaining my Bachelor’s Degree in Computer Science. We developed AFLtar, a fuzzer for embedded devices that rests upon avatar2, an open source orchestration framework for dynamic analysis developed by EURECOM. We use a coverage-guided fuzzing strategy, which takes into account how inputs affect a pro- gram’s control flow to optimize the generation of fuzzed test cases. This is achieved bycom- bining avatar2with other open source components: the American Fuzzy Lop [5] fuzzer and the AFL-Unicorn [6] CPU emulation framework. During the intership, we made great progress on AFLtar. However, there are still challenges to solve, and real-world testing is still in progress. Therefore, I will continue working on this project in the future.

1.1 Contribution

We the following contributions:

• We study the domain of fuzzing, with specific references to the embedded world, and identify technologies useful for implementing a coverage-guided embedded fuzzer. • We design and implement AFLtar, a coverage-guided fuzzer for embedded firware, on top of the avatar2 framework, the American Fuzzy Lop fuzzer and the AFL-Unicorn CPU emulator. • We evaluate AFLtar’s performance by fuzzing the Expat [7] XML parser in an embed- ded firmware.

1.2 Organization

The work is organized as follows. In Chapter 1, we introduced the problem space. In Chap- ter 2, we study it in more detail, covering program analysis, fuzzing and embedded devices. In Chapter 3, we describe the technologies and frameworks used in AFLtar, which we design and implement in Chapter 4. We evaluate the fuzzer in Chapter 5, and draw some conclud- ing remarks in Chapter 6.

2 Beware of bugs in the above code; I have only proved it correct, not tried it. Donald Ervin Knuth 2 Background

In this chapter we examine some background information about concepts used in the rest of the work. In Section 2.1, we discuss basic notions of program analysis. Then, in Section 2.2, we cover fuzzing, the main theme of this work. Finally, we provide an overview of the em- bedded world and the challenges it poses in Section 2.3.

2.1 Program analysis

The term program analysis refers to automatically analyzing a program, with the goal of un- derstanding its behavior with respect to certain properties. In particular, we focus on secu- rity, which is part of correctness. For example, we might be interested in knowing whether properties such as “the program never causes a buffer overflow” hold. In this section we cover some basic notions of program analysis that will either be used later in this work, or help put the work into context.

2.1.1 Control flow analysis

Control flow analysis studies the control flow of a program, i.e., the order in which instruc- tions are executed, and how control flow is transferred across different places in the program. This is one of the most fundamental kinds of analysis, and often acts as foundation for more complex analyses. During the rest of this work, we will often refer to two important concepts in control flow analysis: basic blocks and control flow graphs.

3 Basic blocks are sequences of instructions that satisfy two properties:

1. The block is entered only through its first instruction, i.e., there are no external jumps that target block instructions other than the first one.

2. The block is exited only through its last instruction, i.e., there are no jump instructions within the block, except possibly for its last instruction.

Basic blocks are often used as a fundamental unit in program analysis, because each block executes fully from top to bottom (if it is in the execution path). Therefore, a block can be analyzed as an atomic unit. Breaking a program down into basic blocks allows to perform many small local analyses and then combine them, instead of trying to reason about a com- plex program all at once. A Control Flow Graph (CFG) is a directed graph that defines all the possible execution paths a procedure or program can take during execution. Each node in the graph corresponds to a basic block, and control flow transfers between blocks are represented by directed edges from the transfer source to the destination. A CFG can be intraprocedural, when it describes a single procedure (function), or interprocedural, when it represents a whole program. In- traprocedural CFGs are more common, with the program being represented as a collection of procedures, and thus of intraprocedural CFGs. As an example, we consider the C function in Listing 2.1. This is a simple procedure which reads a line from the user and returns its length (newline excluded). After initializ- ing the len variable to zero, it enters a while loop that increments len every time the user inputs a character, stopping at the first newline. In Figure 2.1 we show the CFG obtained with a reverse engineering tool after compiling line_length for the x86_64 architecture. The function’s entry point is the block at 0x4006a0, which initializes len (assigned to the register rbx) to zero and jumps to the block at 0x4006ac, which is the first block (header) of the while loop. This block calls getchar (a compiler optimization transformed it into

1 size_t line_length() { 2 size_t len = 0; 3 /*EOFnothandledforsimplicity*/ 4 while (getchar() != '\n') 5 len++; 6 return len; 7 }

Lisng 2.1: The line_length example C funcon.

4 Figure 2.1: The line_length funcon’s control flow graph (compiled for x86_64).

_IO_getc(stdin)) and compares the character with a newline. Depending on the compar- ison, execution can take two paths, as shown by the presence of two outgoing edges. If the character is not a newline, execution will jump to the block at 0x4006a8, which increments len and falls through into the next loop interation. If the caracter is a newline, the loop breaks out by continuing to the block at 0x4006bd, which returns len to the caller, thus exiting the function.

When reasoning about properties of the CFG and paths in it (as we will do when discussing fuzzing), we are often A not interested in the actual instructions inside the blocks. Therefore, we represent the CFG as a simpler graph. Fig- ure 2.2 shows this representation for the line_length func- B tion. With respect to Figure 2.1, block A is 0x4006a0 (func- tion’s entry point), block B is 0x4006ac (loop header), block

C D C is 0x4006a8 (increment), and block D is 0x4006bd (re- turn). Such a graph makes it possible to quickly observe prop- erties of the function. For example, any call to line_length Figure 2.2: The line_length con- trol flow graph without instrucons. must enter through A and exit through D. Moreover, the pres- ence of two successors to B suggests a conditional statement. Finally, we can also note at a glance that B and C are part of a loop (formally, B and C form a natural loop region, with CB being a back edge, as C is dominated by B).

5 2.1.2 Coverage

We use the term coverage to define to which extent a program has been analyzed. Initsmost elementary form, code coverage measures how much of the program code has been subject to analysis, relative to the total amount of code present in the program. Other types of coverage that will be used in this work are:

• Block coverage: amount of basic blocks analyzed over the total amount of blocks in the program.

• Edge coverage: amount of CFG edges traversed during analysis over the total amount of edges in the CFG.

• Path coverage: amount of execution paths traversed during analysis over the total amount of possible execution paths in the program.

It is worth mentioning that some coverage measures are correlated. For example, full code coverage and full block coverage are equivalent. Also, full path coverage implies full code, block and edge coverage. However, the inverse does not necessarily hold: for example, full edge coverage does not imply full path coverage (because an edge can belong to multiple paths).

2.1.3 Static analysis

Static analysis is the process of reasoning on code without executing it. It is attractive because it can analyze every sin- Sound gle path in the program (i.e., achieves perfect path coverage).

However, Rice’s theorem [8] states that any nontrivial prop- Real erty of a recursively enumerable language (i.e., a property that Complete holds for some but not all such languages) is undecidable. Re- formulating this theorem in the context of program analy- sis, any nontrivial property of a program is undecidable*. To Figure 2.3: Verified property sets in make static analysis decidable, we need to introduce approxi- sound and complete stac analysis, mations. In Figure 2.3, the middle set represents the real pro- compared to real program behavior. gram properties. One way to go is sound analysis, which over- approximates those properties. From a security standpoint, this means that the analysis will

*Technically, real computers are finite machines, therefore Rice’s theorem does not hold and the problem is decidable. However, the amount of states is so huge that it can be considered infinite for practical purposes.

6 find all vulnerabilities, but can also produce false positives. Therefore, it can ensure theab- sence of bugs, but gives no guarantee that all the problems found are real. On the other end of the spectrum we have complete analysis, which under-approximates the set of properties. Any vulnerability found by a complete analysis is guaranteed to be real, however, the analysis can miss bugs and produce false negatives. Complete analyses can be used to find issues, but does not ensure the absence of them. Real-world static analysis tools can be sound, complete, or neither, but never both sound and complete. One common static technique is abstract interpretation, where program semantics are ap- proximated within an appropriate abstract domain depending on what characteristics we are interested in analyzing. For example, Value-Set Analysis (VSA) is a form of abstract interpre- tation that maps each variable to an abstract value-set, which is a strided interval representing the over-approximated set of possible values that the variable can take. Another popular static technique is symbolic execution, which represents program input as a symbolic variable. Then, operations are tracked to build symbolic formulas for variables. Furthermore, path conditions from conditional statements are added as constraints to the model. Once a symbolic model of a program is built, a constraint solver can provide an- swers to questions such as whether a certain property can be satisfied at a certain point in the program, or what input has to be provided to the program to reach a certain state. Some common symbolic execution frameworks are angr [9] and KLEE [10]. Symbolic execution has two main limitating factors: path explosion and the complexity of constraint solving. Symbolic exection has to follow all possible paths of a program, the number of which grows exponentially with the program’s size. As an example, take the line_length function from Listing 2.1. Each iteration generates a new path, because the loop can either exit if the while condition is not satisfied, or execute the loop body and continue to the next iteration if it is. Let us assume that n is the maximum length of a line (newline excluded). Then, the loop can produce n + 1 paths, and therefore any call to line_length multiplies the number of possible execution paths in the program by n + 1. This clearly does not scale well for large applications, and quickly becomes unmanageable. The other issue is constraint solving, i.e., finding a set of values for logic variables that satis- fies a model. This is known as the boolean satisfiability problem (SAT). Symbolic execution engines usually employ a SAT generalization, the satisfiability modulo theories (SMT) prob- lem. Both SAT and SMT are NP-complete, therefore, solving complex constraint systems is computationally hard. This impacts the performance of symbolic execution, as constraint solving is a fundamental part of it. Some popular SMT solvers are z3 [11] and SMT-LIB [12].

7 2.1.4 Dynamic analysis

Dynamic analysis relies on observation of a program during its concrete execution. Typically, the program or its execution environment are instrumented, i.e., extra code in inserted to provide feedback for the analyzer with respect to certain properties one whishes to observe. Since dynamic analysis relies on concrete execution, it can only observe paths that are effec- tively executed. Hence, its coverage heavily depends on the quality of the test cases, i.e., the inputs given the program under analysis. While providing worse coverage guarantees when compared to static analysis, dynamic analysis is fast, does not rely on undecidable or hard problems, and is not subject to path explosion. Dynamic analysis is used in many real-world analyzers to find bugs during testing. For example, if one wanted to check for the presence of heap corruption, the analyzer could in- strument the heap routines with additional checks. If heap corruption happened during an instrumented execution, the analyzer would detect it thanks to the instrumentation. Clearly, this verification is limited to the states encountered during concrete execution, and itisthere- fore important to maximize coverage by building a high-quality corpus of test cases. One of the most widely known dynamic analyzers is Valgrind[13]. Another such tool is AddressSan- itizer [14], included in the Clang compiler. It inserts checks to detect most kinds of memory errors. Some common dynamic instrumentation frameworks, upon which dynamic analyz- ers can be built, are Pin [15] and DynamoRIO [16].

2.2 Fuzzing

Fuzzing [1] is the process of feeding malformed inputs to a program, the target, with the goal of causing a malfunction (e.g., a hang or crash). It is a dynamic technique that, in its more advanced forms, has proven to be very effective in finding real-world bugs. This canbe attrituted to its speed, combined with strategies to maximize coverage, allowing to quickly stress deep, less-tested parts of a codebase. In order to be effective, a fuzzer has to produce test cases that exercise interesting code paths in the program. Those test cases have to strike a delicate balance in order to solicit unexpected behavior. If they are too malformed, they will quickly be rejected by the pro- gram, thus providing very shallow exploration. If they are too close to valid input, they will likely not trigger bugs. There are two strategies for test case generation: generation-based and mutation-based. In generation-based fuzzing, the fuzzer builds test cases from scratch. Given a nontrivial

8 input format, the probability that a random stream of bits will pass even the most basic sanity checks is extremely low. Therefore, generation-based fuzzing requires detailed knowledge of the format in order to generate almost-valid test cases. This raises the setup cost for a fuzzing job, as the analyst has to build such a specification. Generation-based fuzzing does a very good job at passing sanity checks, and can achieve very high coverage with a complete specification, since it knows about all the functionality of the format. However, italsoruns the risk of following the format too closely, thus triggering few bugs. Examples of generation- based fuzzers are Peach [17] and Sulley [18]. In mutation-based fuzzing, the test cases are obtained by mutating some seed inputs pro- vided by the user or by previous fuzzing rounds. In its basic form, the mutations are oblivious to the specific format (more advanced mutation-based fuzzing can take a format specification into account to generate better results). The main advantage of mutation-based fuzzing is that the setup cost is very low, since no format specification is needed, thus allowing an an- alyst to begin fuzzing a new target very quickly. However, these black-box mutations (espe- cially random ones) can easily break the integrity of the format and fail shallow checks. For example, consider a file format with a checksum. A generation-based fuzzer will knowhow to generate a valid checksum, and thus will easily pass the integrity check. A mutation-based fuzzer has no knowledge of the checksum, and random mutations are extremely unlikely to produce a valid file†. Moreover, without more sophisticated improvements, mutation-based fuzzing may not achieve good coverage. The mutated test cases that pass verification will likely be heavily biased by the seed inputs, thus triggering only a limited portion of all code paths. Examples of mutation-based fuzzers are Radamsa [19] and zzuf [20]. A very popular and effective way to improve coverage is coverage-guided fuzzing. With this technique, the target is instrumented to provide coverage feedback for the fuzzer, which uses this information to optimize test case selection, generation or mutation. Often this kind of fuzzer is mutation-based and evolutionary, i.e., it continually produces new mutated test cases from previous ones through an evolutionary (usually genetic) algorithm targeted to- wards coverage maximization. Such fuzzers have gained great popularity because they have the low setup cost of a mutation-based fuzzer, and at the same time are capable of achiev- ing very high coverage. Examples of coverage-guided fuzzers are American Fuzzy Lop [5], honggfuzz [21], libFuzzer [22], and syzkaller [23].

†A common strategy, especially with open source targets, is to patch out checksum checks, as doing so is not likely to introduce new bugs. Once a bug is found, the checksums can be manually fixed by the analyst to produce a test case for an unpatched target. This, however, raises setup cost. Checksums are a simple example (assuming the input format is known), and more complex cases are not as easy to handle.

9 Basic coverage-guided fuzzers do not actually understand the relationship between the input and the path taken by the program: they are only able to determine what the effect of mutation is on the control flow after execution. While keeping this black-box approach, one avenue to improve fuzzing performance is figuring out exactly what input bytes affect a conditional branch and restricting mutations to these bytes. A common technique for this is taint tracking. When a value is tainted, all computations on that value will propagate the taint to the results. This allows to track what values in the program are influenced by another value, e.g., the input. An example of such a fuzzer is VUzzer [24]. Another venue for optimization is gaining a deeper understanding of the relationship be- tween input and control flow by calculating and solving path constraints. In hybrid fuzzing, a symbolic execution engine re-traces the program flow while building constraints for condi- tional branches. Then, these constraints can be solved to generate test cases that trigger new paths. An example of hybrid fuzzing is Driller [25]. While not subject to path explosion, this technique suffers from the significant performance overhead imposed by symbolic execution. New hybrid fuzzers, such as QSYM [26], try to reduce this overhead. Symbolic execution, however, is not the only strategy to solve path constraints. For example, Angora [27] treats conditionals as black-box functions and solves their constraints through a gradient descent algorithm (with taint tracking to reduce the vector size), thus only performing concrete in- strumented execution.

2.3 Embedded devices

Embedded devices distinguish themselves from the computers we are used to because they perform very specific tasks, often interacting with the physical world in a real-time fashion. They are also usually much more limited, in terms of available resources, than usual com- puters. For example, embedded devices are used to control all kinds of electronics, from industrial production lines, to medical devices, to everyday consumer appliances. Often, embedded devices are tightly integrated with a larger system, and can only work as part of it. Due to their limited resources, most embedded devices do not run a fully-fledged operat- ing system as normal computers. Sometimes, they run embedded operating systems designed specifically for this kind of scenario. The specific program runs as an application ontopof the embedded OS. Those OSes do not always offer the same security properties and easeof debugging as traditional OSes, but they are significantly lighter. Many devices, however, do not even run an OS: they are known as bare metal. In this case, the application could be con-

10 sidered to be the OS: the processor runs a monolithic firmware. Firmware can be written from the ground up, or based on top of a library OS or a Hardware Abstraction Layer, both of which abstract the underlying hardware to facilitate development. Nonetheless, the final product is still a monolithic image. Manufacturers of embedded processors often include various kind of peripherals inside of their chips, such as timers, clock generators, communication interfaces (e.g., I/O lines, UART, I2C, SPI, USB, Ethernet), hardware for the treatment of analog signals (e.g., analog- to-digital and digital-to-analog converters), and many other kinds of specialized hardware. Those peripherals can be accessed by means specified by the Instruction Set Architecture of the processor. For example, some architectures use special CPU instructions (e.g., in/out on x86), while others, such as ARM, use memory-mapped I/O, which allows the CPU to communicate with peripherals through the memory bus. One of the motivations for this work is that fuzzing embedded devices is hard. Many common fuzzers only work on software built for normal operating systems. Moreover, ad- vanced fuzzers use compile-time instrumentation techniques to optimize their fuzzing strate- gies. Many firmware images are distributed as binary only, rendering this useless. Eventhe fuzzers that are able to perform such instrumentation at runtime only work on the major op- erating systems. Furthermore, certain devices do not support memory protection, making it hard for a fuzzer to detect whether a crash happened. In such conditions, vulnerabilities are difficult to discover and might be missed. A solution could be emulating the embedded device on a normal system, as the emulator can allow deeper inspection and instrumentation of the firmware. However, this also requires emulating all peripherals and their interaction with the whole system. Full emulation can therefore be very expensive in terms of time and effort [28].

11 12 Any sufficiently advanced technology is indistin- guishable from magic. Arthur C. Clarke 3 Technologies

In this chapter, we introduce the main technologies used as building blocks in this project. In Section 3.1, we describe in significant depth American Fuzzy Lop, our fuzzer of choice. We then proceed to introduce Unicorn (Section 3.2), a CPU emulator, and AFL-Unicorn (Section 3.3), the bridge between AFL and Unicorn. Finally, in Section 3.4, we begin looking at avatar2, the orchestration framework upon which we rely for combining emulators and physical devices.

3.1 American Fuzzy Lop

American Fuzzy Lop (AFL) [5] is a mutation-based coverage-guided evolutionary fuzzer written by Michal Zalewski. It tries to evolve seed inputs into new test cases that increase code coverage, while performing input mutations to try to induce crashes in the fuzzed tar- get. By leveraging coverage feedback from a lightweight compile-time (or run-time, if no source code is available) instrumentation, it is able to deeply fuzz complex format without any knowledge of the format’s specifics. For example, it was able to generate complex JPEG and XML documents from scratch [29][30]. This is AFL’s main strength: setting up a fuzzing job requires very little effort and time, as AFL is completely agnostic with respect to the target’s input format. AFL works on , *BSD, Mac Os X, and (reportedly) So- laris. There are also third-party ports for other platforms, such as Android and Windows. AFL is actually a collection of tools: the main fuzzer is afl-fuzz and, unless otherwise spec-

13 ified, we will use the term “AFL” loosely to refer to it. An overview of AFL’s architecture is depicted in Figure 3.1. In this section, we describe in detail the most important aspects and components of AFL, along with the techniques and algorithms employed during the fuzzing process. The information reported below is based on the official technical whitepa- per for AFL [31], which also provides further details for the interested reader.

3.1.1 Coverage measurement

Measuring the code coverage produced by a given test case is central to AFL’s operation. Tar- get programs are built with a modified compiler, which injects a lightweight instrumentation to provide coverage information. AFL offers two options for this: a traditional GCC-based compiler, with assembly instrumentation, or a more recent LLVM-based solution, which instruments the LLVM bitcode before optimizations. The LLVM compiler typically pro- duces faster binaries, as the injected bitcode goes through the same optimization passes as the target’s code. If source code is not available, coverage measurements can be provided by an emulator that runs the target binary. See Section 3.1.7 for more information about this mode of operation. AFL works on edge coverage, i.e., observes which control flow edges are traversed by the program while executing a test case. Edge coverage provides more detailed information about the program’s control flow when compared with block coverage. While less precise than path coverage, edge coverage has the advantage of not being exposed to path explosion issues. Cov-

AFL Process Target process (parent)

Status/control pipes Monitoring & Forkserver Control Crashing Testcases Culling & Mutation Trimming Engine Shared Memory Status Fork Code

Write Edge Yes tuples

Crash? New edges? Yes Input Queue Target Binary Testcase Read

Target process (child)

Figure 3.1: AFL architecture.

14 erage measurements are reported to AFL via shared memory IPC. The shared memory region is a 64kB table which stores hit counts for control flow edges. Each (source, destination) tu- ple maps to a byte in this heat map, containing the hit count for that edge. To understand how the map is populated, it is helpful to look at the code injected by the instrumentation at branch destinations: cur_location = ; shared_mem[cur_location ^ prev_location]++; prev_location = cur_location >> 1;

Block identifiers (such as cur_location) are random, to keep the XOR distribution uni- form. The identifiers for source (prev_location) and destination (cur_location) blocks are XORed together, and the result is used as index into the map to increment the count. A point worth of notice is that the destination is right-shifted by one when it becomes the new source: this allows to distinguish (A, B) from (B, A). The resulting instrumentation is very simple and therefore lightweight. This comes at the pric of possible collisions, especially con- sidering the limited size of the map. However, in real programs, the collisions are sufficiently rare. The small map size (64kB) ensures it can comfortably fit into the L2 cache, which sig- nificantly increases performance. Therefore, this is a pragmatic and acceptable trade-off. An- other possible issue is that the 8-bit hit counter can wrap around. Since x86 does not provide saturating arithmetic instructions and the overflow is a rare occurrence, AFL chooses tolet it happen for the sake of simplicity and speed.

3.1.2 Test case evolution

During the whole fuzzing process, AFL keeps a global heat map of all the edges seen during previous executions. After ex- A ecuting a test case, AFL can compare the resulting map with the global map to determine whether the program behaved in a significantly different way. If it did, the test case iskeptfor B C further mutations. A run can qualify as significantly differ- ent in two ways: either new edges are discovered, or the hit count for one or more edges significantly differs from what E D was previously seen. As the basis for examples to explain those cases, we take the CFG fragment shown in Figure 3.2. From this CFG, several Figure 3.2: Example CFG fragment. 15 possible execution traces have been extracted into Figure 3.3. Each trace enters in A, exits in E, and shows the actual edges traversed during the execution. Let Figure 3.3a be the trace for a seed test case. For a first example, we assume AFL mutated the seed test case, andthe mutated test case produced the trace shown in Figure 3.3b. Those two executions clearly follow a different path. The seed traverses the (A, B) branch first, then loops on (E,A), and finally takes (A, C) without looping until it reaches E. The mutated test case, instead, first follows (A, C), then loops on (E,A), then heads down (A, B) up to E. While the traces are different, AFL will not see any difference between them. For both, the set ofseen edges is exactly the same: {(A, B); (A, C); (B,E); (C,D); (D,E); (E,A)} (all hit exactly one time). Therefore, the second trace does not qualify as new. A trace with new tuples is shown in Figure 3.3c. Here, the program performs two iteration of the CD region, by taking the (D,C) back edge once. This edge is also the only new one. Since AFL encounters new tuples, it will keep this test case as a seed for future mutations. As previously stated, the other condition for considering a test case as interesting is when the hit count for an edge changes significantly. AFL divides hit counts in buckets of increas- ing size: 1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128-255. A significant increase in hit count is marked by a transition to a different bucket. For an example, take Figure 3.3d. Here, the cyclic CD region is executed n times. If previous runs executed n = 20 iterations, AFL

A B E A C D E

(a) (A, B) then (A, C).

A C D E A B E

(b) (A, C) then (A, B).

A C D C D E

(c) Two iteraons of the CD loop.

A C D E

n mes

(d) Arbitrary iteraons of the CD loop.

Figure 3.3: Example traces used in the text.

16 would mark a new test case as interesting if it produced less than n = 16 or more than n = 31 iterations, because that would bring the hit count for the (C,D) edge outside of the 16-31 bucket. Once a test case is found to be interesting, AFL adds it to the input queue. Unlike more greedy genetic algorithms, previous generations are not removed from the queue. This al- lows the tool to explore different features of the input format in parallel.

3.1.3 Culling and trimming

The test case corpus grows during the fuzzing process, since old test cases are not removed from it. At the same time, certain mutations add data to inputs, resulting in an increase of the input size over time. Both those phenomenoms are undesirable. A large corpus has a high probability of containing redundant inputs (i.e., new inputs whose edge coverage is a superset of old inputs’ coverage). Mutating those inputs would be a waste of time, as it likely will not lead to new edges. Large test cases have a similar issue: they reduce the probability that a random mutation will alter important parts of the input on which the control flow depends, therefore reducing the fuzzer’s efficiency in finding new edges. Tosolve those issues, AFL employs two techniques: culling, which reduces the corpus size, and trimming, which reduces the size of single test cases. To support culling, each test case is scored in terms of execution time and size when it is executed. We wish to minimize these features, as it will maximize the fuzzer’s efficiency. To cull the corpus, AFL selects a minimal subset of test cases which (1) still covers all the edges seen so far, and (2) minimizes the sum of scores. Internally, AFL uses a greedy algorithm to perform this. For each edge in the map, it keeps track of a top rated test case, which is the lowest-scoring test case which covers that edge. A set of favored test cases is calculated as follows. Let W be a working set initially containing all seen edges, F be the initially empty set of favored test cases, Φ(E) be the top rated test case for edge E, and ∆(T ) be the set T E ∈ W T = Φ(E) of edges covered∪ by test case . Iteratively, pick an edge , let , and set F = F {T },W = W \ ∆(T ). Continue until W = ∅. This algorithm does not guarantee the optimality of F as minimal subset of the corpus satisfying (1) and (2): for a trivial counterexample, consider having two test cases T and U, with ∆(T ) ⊂ ∆(U), and take the case where E ∈ ∆(T ) is picked first. However, it ensures that F still covers all seen edges, it is fast, and produces acceptable results on a big corpus in practice. Non-favored test cases are not removed from the corpus, but they are skipped with a probability between 75% and 99% in favor of favored ones, depending on whether new favored entries are present

17 (99%), or on whether a non-favored entry was fuzzed before (95% if it was, 75% if it was not). The afl-cmin tool, distributed with AFL, performs more accurate (but slower) culling, and permanently discards redundant test cases. The trimming process integrated in AFL is fairly simple: it attemps to remove blocks of data from the input, varying the block size and the distance between the blocks, while check- ing that the edge map for the trimmed test case is identical to the original test case’s map. This technique is not particularly thorough, but once again, the main focus is for it to be fast and have acceptable precision in practice. AFL provides another tool, afl-tmin, which performs more thorough trimming using an exhaustive algorithm.

3.1.4 Mutation strategies

At the heart of every fuzzer there is a mutation engine, which modifies existing test cases to produce new ones, in the hope of crashing the target program. AFL, too, needs a good mutation engine: maximizing fuzzing coverage is not very helpful if the mutations do not cause crashes. AFL divides fuzzing in two phases: a deterministic phase followed by a random phase. While the former has a finite bound on the number of mutation attempts, the latter can go on endlessly until the user terminates the fuzzing process. As the name suggests, in the deterministic phase AFL applies deterministic mutation strate- gies. Those are: walking bit flips (1/2/4 bits, 1 bit stepover), walking byte flips (8/16/32 bits, 1 byte stepover), simple add/subtract arithmetics (35, 8/16/32 bits, 1 byte stepover, little/big-endian), and substitutions with know integers (-1, 256, INT_MAX, …). Over a sub- stantial and diverse input corpus, this phase yields approximately 130-140 new paths per million mutated test cases [32]. In the random phase, AFL applies stacked random tweaks such as bit flips, byte sets, arith- metics, and block deletion/duplication/fills. The number of tweaks to be applied is chosen as a random power of two between 1 and 64. The block size for block tweaks is also chosen at random, with a dynamic upper cap. Another less used strategy is splicing, which takes two input files (with at least two different locations) and splices them at a random point inthe middle. The spliced input is then sent through a (short) run of random tweaks.

3.1.5 Crash reporting

AFL observes the exit code of the target process to determine whether it crashed. The tar- get is considered crashed when it terminated because of a signal, e.g., a segmentation fault

18 (SIGSEGV) or an assertion failure (SIGABRT). The only exception is forceful termination (SIGKILL), which AFL itself uses to enforce execution timeouts. Crashing test cases are copied to a specific directory for further manual analysis. One important consideration is that, given a specific bug, there is usually a large number of possible test case variants that trigger the bug. Therefore, naive crash reporting runs the risk of overwhelming the human analyst with many duplicate crashes. To avoid this, AFL performs crash deduplication by not reporting a test case if it triggers the same bug as an already reported one. A crash is considered unique if its trace either includes an edge not seen in previous crashes, or does not include an edge always seen in previous crashes. To further aid in manual analysis, AFL includes an exploration mode [33], in which AFL mutates a crashing test case while trying to keep the program in a crashing state. This pro- duces a corpus of inputs that trigger the same bug, but are slightly different from each other. The analyst can then gain a better understanding of how the input influences the program’s state at the moment of the crash. For example, for a stack overflow vulnerability, this could be used to pinpoint which part of the input overwrites the return address, and thus deter- mine the attacker’s degree of control over the instruction pointer.

3.1.6 The forkserver

Fuzzing throughput heavily depends on execution speed, i.e., the time required to run a test case. On top of the time spent actually processing the input, there is significant overhead coming from process creation and initialization: loading the binary, fixing relocations, resolv- ing symbols, initializing the language runtime and so forth. The reduce this overhead, AFL uses a forkserver strategy so that those operations are only performed once [34]. The idea is to execute, let the linker work (with lazy linking disabled), let libraries initialize, and stop before the program accesses the input (for example, at main). Then, a copy of the initialized target program will be forked for each test case. This optimization can result in speed-ups up to 2 times. In practice, this is done by injecting the forkserver into the target as part of the instrumen- tation. The forkserver takes control at the stop point and notifies AFL it is ready. Then, it waits for AFL to send an execution request. When received, the forkserver forks a child which continues to execute past the stop point, thus processing the current test case. The forkserver then waits on the child, and once finished sends the exit code to AFL. Finally, it loops back to waiting for the next execution request. All communication between AFL and the forkserver happens over two pipes (on fixed file descriptors 198 and 199) that are created

19 by AFL and inherited by the the forkserver (as it is a child of AFL).

3.1.7 QEMU mode

While instrumentation is the most efficient way to collect coverage information and inject the forkserver, it is only applicable to programs for which the source code is available for recompilation. However, it is possible to run AFL against a binary-only target through the QEMU mode. QEMU [35] is a CPU and system emulator. AFL includes a patched QEMU version, which can essentially instrument the program at runtime. The patches hook into QEMU’s execution engine and implement reporting of edge hits to the shared map, and the forkserver. The QEMU forkserver loop is shown in Figure 3.4. The main disadvantage of QEMU mode is that it is significantly slower than native exe- cution. One partial mitigation is that the forkserver forks QEMU itself just before it starts running the target. Therefore, QEMU’s initialization code is executed only once. QEMU also employs a Just-In-Time (JIT) compiler to translate basic blocks on-the-fly to the host architecture, allowing fast execution of targets compiled for a different architecture. Since writing translators for every target-host pair would be unfeasible, QEMU adopts an extra indirection layer, the Tiny Code Generator (TCG). The TCG front-end lifts target instruc- tions into an intermediate language, which is then lowered by the TCG back-end to native host instructions. This translation is expensive, therefore, the TCG maintains a cache of translated blocks. When QEMU executes a block that is not present in the TCG cache, the block will be translated and stored in the cache, from which it can be looked up when it is ex- ecuted again. The TCG cache poses an issue when forking: the parent initially has an empty cache, which will be populated in the child. Since the child’s memory space is isolated from the parent’s after forking, the cache will be local to the child. Therefore, every test case will

Forksvr sends Child reports result to AFL result to forksvr

AFL spawns QEMU becomes AFL requests Forksvr forks Child performs QEMU forkserver new process child emulation

AFL QEMU (parent) QEMU (child) Emulated code Target reads Target works Target input on input exits/crashes

Figure 3.4: Workflow of AFL’s QEMU mode.

20 run with an initially empty cache. To avoid those unnecessary translations, the AFL patches establish a pipe between parent and child. Every time the child translates a block, it informs the parent, which will also translate the same block in its own cache. Therefore, future chil- dren will already have that block cached. Thanks to these optimizations, QEMU mode has a limited time overhead of two to five times.

3.2 Unicorn

Unicorn [36] is a lightweight CPU emulator framework based on QEMU [35]. It reuses the CPU emulation component from QEMU, and is thus able to emulate all CPUs and in- structions QEMU can. Unlike QEMU, which is a standalone program, Unicorn exposes this functionality as a framework, with bindings for many languages, such a C, Python, Java, Go and more. In this work, we will use its Python bindings. Unicorn offers the possibility of executing raw binary code, while QEMU only supports running executable files or whole system images. Another important feature of Unicorn is support for hooks and instrumen- tation: the client application can register handlers for various kinds of CPU and memory events.

3.2.1 Overview

While based on QEMU, Unicorn makes significant modifications and refactoring to its code- base. All system emulation code from QEMU has been removed, as Unicorn only focuses on CPU emulation. QEMU’s CPU emulation code, however, is kept as close to the original as possible, to make synchronizing with upstream easier. Core QEMU features, such as the TCG cache (discussed in Section 3.1.7), are included in Unicorn. QEMU is distributed as a set of individually compiled emulators, one for each architecture, which share a common core. Unicorn, instead, is a single library for independent emulation of multiple architec- tures. The QEMU codebase was considerably refactored to support this requirement: Uni- corn can emulate multiple architectures on multiple threads within the same process. The result is a framework that leverages the huge amount of work behind QEMU’s CPU emu- lator completeness and performance to provide an API to build emulation tools on top of it. The API does not stop at straightforward emulation, but allows to observe and manip- ulate the program’s state, such as memory, CPU registers, and respond to various kinds of instrumentation events (see Section 3.2.2). Unicorn is written in pure C, as QEMU is, which makes it easy to write bindings for many common languages.

21 3.2.2 Instrumentation

One of the main features of Unicorn are hooks, which allow fine-grained instrumentation of the emulated code. The programmmer can register callbacks for certain events and, from within the callback, inspect the program’s state and manipulate it. Modifying QEMU’s codebase for instrumentation is not easy (mainly because of JIT optimizations), therefore, Unicorn is an attractive option. Hooks in the memory subsystem allow interception of memory reads, writes, and instruc- tion fetches. The can distinguish between valid and invalid accesses, such as unmapped accesses and violations of memory protection. Other hooks allow single step- ping through code, intercepting execution of code at specific addresses, being notified atthe beginning of basic blocks, and handling system calls and interrupts.

3.3 AFL-Unicorn

AFL-Unicorn [6] is a fusion between AFL and Unicorn. It allows to use AFL to fuzz any- thing that can be emulated by Unicorn. Moreover, all of Unicorn’s features are accessible to the user during the fuzzing process. AFL-Unicorn can be used to emulate any piece of raw binary code, while, at the same time, having powerful inspection and instrumentation capabilities.

3.3.1 Overview

AFL-Unicorn is based on AFL’s QEMU mode. The AFL forkserver and instrumentation patches for QEMU have been ported to Unicorn, and a new Unicorn mode has been added to AFL. The workflow of this mode is shown in Figure 3.5. In Unicorn mode, a driver program

Target works Target AFL Driver (parent) Driver (child) Emulated code on input exits/crashes

Forksvr sends Child reports Child performs result to AFL result to forksvr emulation

AFL spawns Driver becomes AFL requests Forksvr forks Child reads input driver Unicorn forksvr new process driver child

Figure 3.5: Workflow of AFL-Unicorn’s Unicorn mode.

22 is executed by AFL. The patched Unicorn, running inside the driver, spins up the forkserver. Since Unicorn runs inside the driver process, the whole driver will be forked. When a new test case is ready, the forkserver will fork a driver child which will read the input, place it where the target code expecteds it to be (e.g., by storing it in memory, or by simulating a memory- mapped I/O line). The driver starts the target emulation, listening for events such as invalid memory accesses or other crashing conditions. Once the target terminates or crashes, the forked driver returns a suitable exit code to the forkserver, which reports it to AFL, closing the loop.

3.3.2 Driver Workflow

A particularly important implementation artifact of AFL’s QEMU mode, inherited by Uni- corn mode, is that the forkserver is started when the first target instruction is executed, with the fork point being immediately before actually executing the instruction. In QEMU mode, this works seamlessly: each child will start from the beginning of the target, which will read the input from a file and process it. In Unicorn mode, however, the workflow is complicated by the fact that the target usually does not read input from a file, which would require man- ual system call emulation, given that Unicorn is strictly a CPU emulator. Instead, a typical target might read its input directly from memory. The driver is tasked with reading AFL’s test case file, writing it into Unicorn memory and starting the target. Since Unicorn emula- tion is blocking, and the fork point is inside Unicorn, the driver has to go through a two-step process. A typical driver is illustrated in Figure 3.6. After setting up memory maps, loading the target code into Unicorn, and performing other initialization duties, the driver starts a single-instruction emulation. This “dummy” emulation has the sole purpose of starting the

Parent Child

Driver starts Unicorn executes Return Driver reads input single-step emulation instruction to driver into Unicorn memory

Driver starts target Forkserver starts Return Driver exits to driver emulation

Fork point

Figure 3.6: Workflow of a typical AFL-Unicorn driver.

23 forkserver. For each test case, a driver child returns from the single-step emulation, reads the input from AFL and performs emulation of the target.

3.4 Avatar2

avatar2 [37] is an orchestration framework for dynamic analysis, with a focus on embedded firmware, written in Python. It is maintained by EURECOM and inspired by their previous Avatar framework [38]. In other words, avatar2 makes it possible to connect different tools and devices (e.g., analysis frameworks, debuggers, emulators, physical devices) and dynami- cally analyze a program through their combination. The main mechanism used to achieve this is state transfer, i.e., moving the execution of a program from one tool or device to an- other by transferring its complete state. Moreover, avatar2 capabilities include memory forwarding, which allows to service I/O or peripheral memory accesses from a different de- vice. All these features make avatar2 a useful tool for the dynamic analysis of embedded firmware. For example, it is possible to execute a firmware on a physical device up to acer- tain point and then transfer execution to an emulator, but have accesses to memory-mapped hardware peripherals redirected to the physical device.

3.4.1 Architecture

avatar2 is a highly extensible and modular framework, which is reflected in its design. The high-level architecture is shown in Figure 3.7. Every supported tool or device is dubbed an endpoint, and represented at a higher level by a target. Targets expose a common interface, which allows avatar2 and the client code to perform manipulations on the target context, its memory, its execution and so forth. A target does not directly talk to its endpoint. Instead, communication is handled by protocols. This way, responsabilities are clearly separated, thus improving code reusability. More specifically, there are three main protocols: execution, register, and memory. It is also possible to extend avatar2 with additional protocols, such as the remote memory (omitted from Figure 3.7) to handle memory forwarding. The exe- cution protocols handles starting and stopping code execution, performing single stepping, and insertion and deletion of breakpoints and watchpoints. The register protocol allows to read and write CPU registers. Similarly, the memory protocols exposes functionality to read and write the endpoint’s memory. Finally, the remote memory protocol is necessary to handle memory forwarding. The whole avatar2 core and API is asynchronous and event-driven. This is particularly

24 important when using multiple targets, as a synchronous API would be inconvenient in that case. Users can tap into the event-driven model by subscribing to and hooking events. For example, if one wanted to emulate a memory-mapped serial port, they could forward it and be notified of the remote memory read and write events, which could be handled byemulat- ing the behavior of said port. Internally, avatar2 uses a simple event queue approach. The producer of an event pushes it into a specific queue. Each queue has a listener thread, which pulls events from the queue and dispatches them to subscribers.

Client Code

Avatar²

Target ? Target

Execution Memory Register Execution Memory Register Protocol Protocol Protocol ? Protocol Protocol Protocol

Endpoint ? Endpoint

Figure 3.7: avatar2 architecture.

25 26 Theory is when “Unicorn is thread-safe by design”. Practice is when “/unicorn/qemu/tcg/tcg.c:1728: tcg fatal error”. Me, after a day-long GDB session 4 AFLtar

In this work, we design and implement AFLtar, a solution for performing coverage-guided of embedded firmware, making use of the avatar2 orchestration framework as a key com- ponent. With AFLtar, the analyst can easily set up a coverage-guided fuzzing job, combining both peripheral emulation and forwarding of I/O accesses to physical devices. Moreover, the extensive instrumentation capabilities of Unicorn can be used, for example, to quickly insert new checks, mock or hook code, and experiment with writing emulated peripherals. Tra- ditionally, fuzzing of embedded devices is associated with a high setup cost, due to the fact that the firmware is heavily intertwined with the hardware, which can be custom and hard to emulate in an instrumented environment. AFLtar aims to tackle those issues, in order to ease fuzzing of embedded systems. We start by implementing an AFL-Unicorn target for avatar2, along with its underlying protocol and endpoint. Then, we build a coverage-guided fuzzing harness on top of the extended avatar2. The used is Python, as it is the language avatar2 is written in. We start by describing in detail the avatar2 API (Section 4.1) and the Unicorn API (Section 4.2), as these are the foundation upon which we build this work. During devel- opment we found several bugs in Unicorn, which we reported upstream. Those issues have influence on the design and are detailed in Section 4.3. Finally, we explain and motivate the design of AFLtar in Section 4.4.

27 4.1 avatar2 API

The first point of interaction for clients is the top-level avatar2 API, which we show in Section 4.1.1. In order to add an endpoint to avatar2, we need to build a protocol for it, and then a target on top of the protocol. Todo so, our protocol must implement the protocol interfaces in Sections 4.1.2, 4.1.3, 4.1.4, and 4.1.5. Moreover, the target has to implement the interface described in Section 4.1.6. The target interface is also used by clients. For each method, we describe its functionality, its return value (if any), and list its parameters along with their types and default values, if any.

4.1.1 Top-level API

The top-level avatar2 API is used by clients to configure the orchestration by adding targets and memory ranges, and to perform state transfers from one target to another.

add_target Instantiates a new target and adds it to the orchestration. Returns the cre- ated target instance.

• backend: Target class to instantiate.

• *args, **kwargs: extra arguments to pass to the target’s constructor.

add_memory_range Creates a new memory region. Returns a MemoryRange object that represents the new region.

• address: integer. Base address of the region.

• size: integer. Size of the region, in bytes.

• name: string, default None. Name of the region.

• permissions: string, default “rwx”. Access permissions of the mapping ( for read, w for write, x for executable).

• file: string, default None. If specified, path to a file backing this region.

• forwarded: boolean, default False. Whether this region should be forwarded (True) or not (False).

28 • forwarded_to: Target, default None. Only if forwarded is True: target to for- ward accesses to.

transfer_state Transfer one target’s state (registers and memory) to another.

• from_target: instance of Target. Source target.

• to_target: instance of Target. Destination target.

• sync_regs: boolean, default True. Whether to synchronize registers (True) or not (False).

• synced_ranges: list of MemoryRange instances, default []. Memory ranges to syn- chronize.

Since it is an asynchronous framework, avatar2 uses a queue upon which targets and protocols place messages to be handled by the avatar2 core. Moreover, clients can listen on queue messages through a component called watchdog. The supported messages, along with any data field they carry, are:

UpdateStateMessage Request to update a target’s state.

• origin: a Target instance. The target whose state should be changed.

• new_state: a TargetState. New state to set.

BreakpointHitMessage A breakpoint was hit.

• origin: a Target instance. The target whose state should be changed.

• breakpoint_number: integer. Number that identifies the triggered breakpoint.

• address: integer. The breakpoint’s memory address.

29 RemoteMemoryReadMessage Request for a forwarded memory read. The result is passed to the send_response method of the origin’s remote memory protocol (see Section 4.1.5).

• origin: a Target instance. The target who is making the request.

• id: integer. An indentifier that will be passed to the remote memory protocol’s send_response.

• pc: integer. Address of the instruction that caused the read.

• address: integer. Read address.

• size: integer. Size of the word to read, in bytes.

RemoteMemoryWriteMessage Request for a forwarded memory write. The result is passed to the send_response method of the origin’s remote memory protocol (see Section 4.1.5).

• origin: a Target instance. The target who is making the request.

• id: integer. An indentifier that will be passed to the remote memory protocol’s send_response.

• pc: integer. Address of the instruction that caused the write.

• address: integer. Write address.

• value: integer. Word to write.

• size: integer. Size of the word to write, in bytes.

In addition to the normal queue, avatar2 also exposes a fast queue. While the normal queue can be used for any message, the fast queue is a priority lane for UpdateStateMessage only.

4.1.2 Execution protocol

The execution protocol handles manipulation of the program’s control flow, such as start and stop operations and breakpoints.

continue Starts or continues execution.

stop Stops execution.

30 step Steps execution by one instruction, i.e., executes one instruction and stops. set_breakpoint Inserts a breakpoint. Returns the breakpoint number as an integer.

• location: integer or string. Address or line at which to insert the breakpoint.

• hardware: boolean, default True. Whether the breakpoint should be hardware (True) or software (False).

• temporary: boolean, default False. Whether the breakpoint should be temporary, i.e., one shot (True), or persistent (False).

• regex: string, default None. If set, inserts a breakpoint matching the specific regular expression.

• condition: string, default None. If set, the breakpoint will only trigger when the condition is satisfied.

• ignore_count: integer, default 0. Number of breakpoint hits to ignore before be- ginning to trigger.

• thread: integer, default 0. Number of the thread on which the breakpoint should be inserted.

set_watchpoint Inserts a watchpoint. Returns the watchpoint number as an integer.

• location: integer or string. Address or variabile to watch.

• write: boolean, default True. Whether to trigger the watchpoint on memory writes (True) or not (False).

• read: boolean, default False. Whether to trigger the watchpoint on memory reads (True) or not (False).

remove_breakpoint Removes a breakpoint or watchpoint.

• bkptno: integer. Number of the breakpoint or watchpoint to remove.

31 4.1.3 Memory protocol

The memory protocol handles manipulaton of the endpoint’s memory.

read_memory Reads wordsize*num_words bytes from memory. If raw is True, they will be returned as a raw string of bytes. Otherwise, they will be returned as a list of integer words (as specified in wordsize and num_words), decoded according to the endpoint’s en- dianness. As a special case, if num_words is 1 the return value will be directly the integer word, instead of a one-element list.

• address: integer. Address to read from.

• wordsize: integer. Size of a read word, in bytes.

• num_words: integer, default 1. Number of words to read.

• raw: boolean, default False. Whether to read the memory as raw bytes (True), or decode it as words (False).

write_memory Writes wordsize*num_words bytes to memory. If raw is True, value will be interpeted as a raw string of bytes. Otherwise, value is interpreted as a list of integer words (as specified in wordsize and num_words). The words will be encoded according to the endpoint’s endianness before writing. As a special case, if num_words is 1 then value will be interpreted directly as the integer word, instead of a one-element list.

• address: integer. Address to write to.

• wordsize: integer. Size of a written word, in bytes.

• value: integer, list or string. Value to write.

• num_words: integer, default 1. Number of words to write.

• raw: boolean, default False. Whether to write the memory as raw bytes (True), or as words (False).

32 4.1.4 Register protocol

The register protocol handles manipulation of the endpoint’s CPU registers. read_register Reads a CPU register and returns its value as an integer.

• register: string. Name of the register to read.

write_register Writes a CPU register.

• register: string. Name of the register to write.

• value: integer. Value to write.

4.1.5 Remote memory protocol

The remote memory protocol handles response from remote (forwarded) memory opera- tions. Whenever the endpoint makes a memory access to a forwarded region, the proto- col will enqueue a RemoteMemoryReadMessage or RemoteMemoryWriteMessage in the avatar2 queue. Then, avatar2 will dispatch the remote memory request and pass the re- sponse to the only method exposed by the remote memory protocol attached to the request- ing endpoint: send_response Handles a response from a remote memory request.

• id: integer. ID specified in the request.

• value: integer. Only for read requests: read value.

• success: boolean. Whether the operation was successful (True) or not (False).

33 4.1.6 Target

The target interface exposes proxy methods for the execution protocol (Section 4.1.2), the memory protocol (Section 4.1.3), and the register protocol (Section 4.1.4), with the same exact prototypes. Moreover, it also exposes a state, of type TargetState, which describes the current state of the target and can take the values CREATED, INITIALIZED, STOPPED, RUNNING, SYNCING, or EXITED. The state API is as follows:

update_state Sets the target state.

• state: TargetState. New state to set.

wait Blocks until a certain state is reached.

• state: TargetState, default STOPPED. State to wait for.

The proxies for protocol methods are decorated to ensure they can only be called while in certain states (an exception will be thrown otherwise). More specifically:

• cont, step, read_memory, write_memory, read_register, write_register, and remove_breakpoint can only be called in the STOPPED state.

• stop can only be called in the RUNNING state.

• add_breakpoint and add_watchpoint can only be called in the INITIALIZED or STOPPED states.

4.2 Unicorn API

In this section we examine the API of the Python bindings for Unicorn. Some functionality not used by our endpoint has been omitted for brevity. All API methods raise the UcError exception on error. In Section 4.2.1, we describe the interface to control execution and ma- nipulate memory and registers. Then, in Section 4.2.2, we examine the hook API. For each method, we describe its functionality, its return value (if any), and list its parameters along with their types and default values, if any.

34 4.2.1 Execution, registers and memory emu_start Starts emulation. Returns once emulation is complete.

• begin: integer. Address at which to begin emulation (initial program counter / in- struction pointer).

• until: integer. Address at which to stop emulation.

• timeout: integer, default 0. If not zero, execution timeout in microseconds.

• count: integer, default 0. If not zero, number of instructions to execute.

emu_stop Stops emulation. reg_read Reads a CPU register and returns its value as an integer.

• reg_id: integer. Architecture-specific register number (UC_*_REG_*).

reg_write Writes a CPU register.

• reg_id: integer. Architecture-specific register number (UC_*_REG_*).

• value: integer. Value to write.

mem_read Reads bytes from memory and returns them as a raw byte string.

• address: integer. Address to read from.

• size: integer. Number of bytes to read.

35 mem_write Writes bytes to memory.

• address: integer. Address to write to.

• data: string. Bytes to write.

mem_map Creates a new mapped region in memory.

• address: integer. Address of the mapping.

• size: integer. Size of the mapping, in bytes.

• perms: integer, default UC_PROT_ALL. Access permissions of the mapping, as an OR combination of UC_PROT_READ (read), UC_PROT_WRITE (write), and UC_PROT_EXEC (execute). For convenience, UC_PROT_NONE (no access) and UC_PROT_ALL (read, write and execute) are available.

4.2.2 Hooks

The hooks API allows the client to register callbacks for specific events. While the hook is executing, emulation is temporarily paused, and the hook is free to observe and manipulate the emulator state.

hook_add Registers a hook. Returns a handle for the hook.

• htype: integer. Hook type (one of UC_HOOK_*).

• callback: callable. Hook callback function. Prototype depends on hook type.

• user_data: any type, default None. Auxiliary data that will be passed to the callback.

• begin: integer, default 1. Meaning depends on hook type.

• end: integer, default 0. Meaning depends on hook type.

• arg1: integer, default 0. Meaning depends on hook type.

36 hook_del Removes a hook.

• h: handle of the hook to remove, as returned by hook_add.

We now describe the types of hooks supported by Unicorn, and the meanings of the type- dependented hook_add parameters for each one.

• Interrupt hooks: type constant UC_HOOK_INTR. Triggers on software-generated in- terrupt instructions between the addresses begin and end. The arg1 parameter is unused.

• Instruction hooks: type constant UC_HOOK_INSN. Triggers when the instruction specified by arg1 (UC_*_INS_*) is executed at addresses between begin and end.

• Code hooks: type constant UC_HOOK_CODE. Triggers when any instruction is exe- cuted at addresses between begin and end. The arg1 parameter is unused.

• Block hooks: type constant UC_HOOK_BLOCK. Triggers when execution reaches the beginning of any basic block between the addresses begin and end. The arg1 param- eter is unused.

• Memory hooks: has many subtypes, grouped under the constants UC_HOOK_MEM_*. Triggers when a memory access (depending of the subtype) is made to an addresses between begin and end. The arg1 parameter is unused.

As stated above, memory hooks actually have several subtypes that define what kind of memory access should trigger the hook. The subtypes can be combined by OR, producing a hook type that will trigger when the trigger conditions for any of the ORed subtypes are satisfied. There are a number of predefined OR combinations for convenience, whichwe omit for brevity. The subtypes are as follows:

• Validaccess hooks: type constant UC_HOOK_MEM_X, with X being one of READ, WRITE, FETCH, or READ_AFTER. Triggers on, respectively, read, write, or instruction fetch ac- cesses to mapped memory that do not violate the memory’s protection flags, or after read accesses that satisfy such conditions.

• Unmapped access hooks: type constant UC_HOOK_MEM_X_UNMAPPED, with X being one of READ, WRITE, or FETCH. Triggers on, respectively, read, write, or instruction fetch accesses to unmapped memory.

• Protection violation hooks: type constant UC_HOOK_MEM_X_PROT, with X being one of READ, WRITE, or FETCH. Triggers on, respectively, read, write, or instruction fetch accesses to mapped memory that violate the memory’s protection flags.

37 To complete the hook API, we describe the prototypes of callbacks, which depend on the hook type. The only hooks used in our work are code and memory hooks. Therefore, for brevity, we omit prototypes for other kinds of hooks.

Code hook callback prototype

• uc: instance of the Unicorn object that triggered the hook.

• address: integer. Address of the instruction that triggered the hook.

• size: integer. Size of the instruction.

• user_data: any type. Auxiliary data specified when registering the hook.

Memory hook callback prototype For invalid access hooks (unmapped, protec- tion violation), returns a boolean indicating where emulation should be resume (True) or stopped False. No return value for valid access hooks.

• uc: instance of the Unicorn object that triggered the hook.

• access: integer. Type of memory access that triggered the hook (one of UC_MEM_*).

• address: integer. Address of the memory access.

• size: integer. Size of the memory access.

• value: integer. Only valid for memory writes: value that is being written into mem- ory.

• user_data: any type. Auxiliary data specified when registering the hook.

4.3 Unicorn bugs

During the development of AFLtar, we found three significant bugs in Unicorn. While we reported them upstream, it would have taken too long to wait for them to get fixed. Moreover, they are hard to fix without extensive and detailed knowledge of Unicorn’s and QEMU’s internals. Therefore, a fix on our end would have required significant investment

38 in terms of time and effort. However, we found workarounds for these bugs that couldbe fitted into our design. We decided to use these workarounds until there will beupstream fixes for the issues. As such, the workarounds have to be taken into account when designing the software.

4.3.1 Issue A: wrong PC after stopping from hook

Description On the ARM architecture, Unicorn does not correctly update the PC (pro- gram counter) register after stopping from a hook. The PC should be set to the address of the instruction that triggered the hook. Instead, it is set to the beginning of the basic block containing that instruction.

Impact Clients that, after stopping, rely on the PC to identify where the program stopped will behave incorrectly.

Cause This is likely caused by an interaction between Unicorn and QEMU’s basic block translation mechanism. Other architectures (such as x86) where the program counter is up- dated correctly have additional code to handle PC recovery within a basic block. Certain contextual information that would allow a quick fix seems to be missing in the ARM code.

Workaround For code hooks, the PC register has the correct value within the hook. Therefore, the hook can save the PC to a variable shared with the code that started emulation, which can then fix the PC value after returning from emulation. For memory hooks, PCis incorrect even within the hook. There are no known workarounds for this case.

4.3.2 Issue B: cannot stop from different thread while in hook

Description Calls to emu_stop are ignored while a hook is executing if they are not made from the thread that is executing the hook.

Impact Clients that rely on such mechanics will not be able to stop emulation.

Cause Unknown.

Workaround Avoid the scenario that triggers the bug. While executing a hook, only stop emulation from within the hook.

39 4.3.3 Issue C: crash when stopping from different thread while in hook

Description Occasionally, when calling emu_stop while a hook is executing in a differ- ent thread, QEMU will crash or report a fatal error.

Impact Clients that rely on such mechanics will crash or fail with an error.

Cause Unknown. Given that the issue is random and hard to reproduce, we suspect a race condition. Specifically, the errors suggest that the TCG code is involved. This is likely not a bug in the QEMU core itself, but in the way Unicorn uses it.

Workaround Avoid the scenario that triggers the bug. While executing a hook, only stop emulation from within the hook.

4.4 Design

There are four fundamental parts in this system: the target, the protocol, the endpoint, and the fuzzer. We use the singular term “protocol” because we implement it as the composi- tion of multiple protocols, similarly to avatar2’s GDB protocol, which combines execu- tion, memory, and register protocol*. The general architecture is shown in Figure 4.1. The fuzzing driver rests on top of avatar2, which communicates with a separate AFL-Unicorn process. The AFL-Unicorn process hosts the forkserver, and communicates with an AFL instance. The most complex component of the design is the protocol itself. Its purpose is manag- ing communication between avatar2 and the endpoint by translating avatar2’s generic interface to actual operations on the endpoint. Our end goal is communicating with AFL- Unicorn, whose interface is exactly the same as Unicorn. Therefore, we decided to design a generic Unicorn protocol that is compatible both with Unicorn and AFL-Unicorn, in case an analyst would like to use Unicorn for non-fuzzing purposes. We already contributed a preliminary version of the Unicorn protocol and target to mainline avatar2†. In some cases, e.g., when interfacing to a debugger, a protocol is merely an I/O layer. In the Unicorn case, it contains more logic because the avatar2 and Unicorn abstractions are significantly different. An example is breakpoint handling: while avatar2 exposes classic

*https://github.com/avatartwo/avatar2/blob/master/avatar2/protocols/gdb.py †https://github.com/avatartwo/avatar2/blob/master/avatar2/protocols/unicorn_ protocol.py

40 Remote Unicorn process (parent)

Listener

Pipes Shared memory AFL-Unicorn

AFL Process Unicorn Fuzzing Driver Protocol Remote Unicorn process (child) AFL

Avatar² Unicorn Target Listener

Emulation AFL-Unicorn Duplex pipe Thread

Figure 4.1: General AFLtar architecture.

“debugger-style” insert/delete operations, Unicorn offers generic hooks, upon which one has to implement actual breakpoints. We decided that such logic should be implemented on the avatar2 protocol side, instead of in the Unicorn endpoint. One important point is that AFL-Unicorn must be run in a separate process. From Section 3.3, we recall that AFL-Unicorn implements the forkserver mechanism of AFL. Therefore, the process that contains AFL-Unicorn must be able to fork. Unfortunately, avatar2 is heav- ily multithreaded because of its asynchronous nature. Forking a multithreaded process is problematic, therefore we decided to isolate AFL-Unicorn in its own process, which will be spawned by the protocol. Another important consideration is the necessity of an event mechanism that the end- point can use to send notifications to avatar2 (e.g., emulation start/stop, breakpoints, faults). Such events would normally be asynchronous. However, since we decided to keep all the hook logic on the avatar2 side, and because of Issue B and C (Section 4.3.2, 4.3.3), we need the avatar2 side to also be able to send a response back to the endpoint (e.g., decid- ing whether to stop or not after a hook). Therefore, the event system must also support synchronous events with responses. To avoid ambiguity, we define the following terms:

• Unicorn endpoint: a process which runs the Unicorn library and acts as the avatar2 endpoint.

41 • Asynchronous event: an event message, sent in a non-blocking way from the Uni- corn endpoint to the avatar2 process.

• Synchronous event: an event message, sent from the Unicorn endpoint to the avatar2 process, which blocks the Unicorn endpoint until the avatar2 process sends back a response.

The requirements on the endpoint and protocol are formalized as follows:

1. The Unicorn endpoint must run in a process isolated from avatar2.

2. The Unicorn endpoint must be able to fork cleanly.

3. The protocol must implement avatar2’s protocol interface.

4. The protocol must be able to communicate with a Unicorn endpoint.

5. The protocol must be able to interact with the Unicorn API on the Unicorn endpoint.

6. The protocol must handle delivery of asynchronous events from the Unicorn end- point to the avatar2 process.

7. The protocol must handle delivery of synchronous events from the Unicorn endpoint to the avatar2 process, and of corresponding responses from the avatar2 process to the Unicorn endpoint.

8. The protocol must not trigger the issues detailed in Section 4.3.

4.4.1 Message passing

For handling both remote API calls and events, we designed a simple message passing system, shown as a sequence diagram in Figure 4.2. It supports both synchronous and asynchronous messages. The two parties are named sender and recipient, and they reside in different pro- cesses. Todeliver a message to the recipient, the sender passes it to a channel abstraction in its process. The channel takes care of serializing the message and sending it over a pipe between the two processes, thus hiding the channel logic from the sender. On the recipient side, a lis- tener thread waits for data on the pipe. When a message arrives, it is deserialized and finally passed on to the recipient’s message handler. What happens next depends on whether the message is asynchronous or synchronous. If it is asynchronous, the delivery sequence is complete. If the message is synchronous, the

42 Sender process Target process

SENDER CHANNEL LISTENER RECIPIENT

send(message) Message over pipe handler(message)

opt [synchronous] Response Response over pipe Response

Figure 4.2: Sequence diagram for Unicorn protocol message passing. listener waits for a response from the recipient. Then, it serializes and sends the response back to the sender process over the pipe. Meanwhile, on the other side, the channel is waiting for the response. Once it gets there, it is unserialized and returned back to the sender. Remote Unicorn API calls from the protocol to the endpoint are synchronous messages, with the API’s return value (or None, if there is no return value) as response. While there is an almost one-to-one correspondence between the messages and the Unicorn API, some changes were made to simplify the protocol. The message types, along with any additional parameters, are as follows:

UnicornContinueMessage Starts or continues execution from the current program counter.

• single_step: boolean. Whether to perform a single step (True) or continued exe- cution (False).

• sync: boolean. Whether to perform synchronous (True) or asynchronous (False) execution. Forkserver start for AFL-Unicorn must be done in synchronous mode. See Section 4.4.4 for more details.

UnicornStopMessage One-to-one proxy for emu_start.

43 UnicornMemReadMessage One-to-one proxy for mem_read.

UnicornMemWriteMessage One-to-one proxy for mem_write.

UnicornRegReadMessage One-to-one proxy for reg_read.

UnicornRegWriteMessage One-to-one proxy for reg_write.

UnicornMemMapMessage One-to-one proxy for mem_map.

UnicornHookAddMessage Registers a hook. Returns a handle for the hook.

• htype: same as htype for hook_add.

• user_data: any type. Auxiliary data that will be passed in the hook event.

• begin: same as begin for hook_add.

• end: same as end for hook_add.

UnicornHookDelMessage Removes a hook.

• h: handle of the hook to remove, as returned by UnicornHookAddMessage.

Synchronous and asynchronous events are delivered, respectively, as synchronous or asyn- chronous messages. The following list describes the events, along with their parameters (if any):

UnicornStartEvent Asynchronous. Emulation has started.

UnicornStopEvent Asynchronous. Emulation has stopped.

UnicornCodeHookEvent Synchronous. A code hook has been triggered. Same parame- ters as the code hook callback prototype in Section 4.2.2, except for uc. Expects a boolean response: True to continue execution, False to stop it.

44 UnicornMemHookEvent Synchronous. A memory hook has been triggered. Same pa- rameters as the memory hook callback prototype in Section 4.2.2, except for uc. Expects a boolean response: True to continue execution, False to stop it.

4.4.2 Hooks

As explained in Section 3.2.2 and further detailed in Section 4.2.2, Unicorn allows clients to register hooks to instrument the emulator. Hooks are important for the avatar2 Unicorn protocol, as they are used to provide functionality such as breakpoints. Hooks are triggered by Unicorn within the endpoint, but are handled by the protocol over the event mechanism. This process is shown in Figure 4.3. When the protocol wishes to register a hook, it sends a UnicornHookAddMessage to the endpoint, specifying the specifics of the hook, such as type, address, and auxiliary attached data. Normally, a callback is passed to Unicorn when registering a hook, so that Unicorn will invoke the callback when the hook is triggered. Since the protocol (which contains the callback) and the endpoint (where the hook is triggered) are in different processes, this is replaced by an event-based mechanism. When the hook is triggered, the endpoint sends a synchronous event (UnicornCodeHookEvent or UnicornMemHookEvent) to the protocol. The message includes all the hook information normally provided by Unicorn, along with the auxiliary data provided by the protocol at registration time. The protocol will handle the hook, and send back a boolean response which indicates whether the endpoint should stop emulation after the hook, or let it continue. Note that while the decision of stopping is made by the hook, running inside the protocol, the actual API call to stop emulation is made from the endpoint. The decision is passed as a boolean response to the event sent by the endpoint. This is needed to work around Issue B and C (Section 4.3.2, 4.3.3).

START: Unicorn Send synchronous Protocol Yes Stop emulation END hook called hook event to protocol responded "stop"?

No

Figure 4.3: Flow chart for Unicorn hook handling.

45 4.4.3 Breakpoint and watchpoint handling

The main use of hooks in our Unicorn protocol is breakpoint handling, shown in Figure 4.4. When the protocol receives a hook event corresponding to a breakpoint (the auxiliary data is used to trace the breakpoint number), it first checks whether that breakpoint is inthe pending set, i.e., the set of breakpoints that have already been handled since the last stop. If the breakpoint is pending, execution will be resumed. This is necessary to ensure that, when resuming from a breakpoint, the same breakpoint is not triggered again. Furthermore, breakpoints can have an ignore count, i.e., a number of times they should be ignored before triggering. Therefore, if the ignore count is greater than zero, it is decremented and execution is resumed. If the breakpoint is not pending or ignored, then it triggers: it is added to the pending set, and a breakpoint event is sent into avatar2’s fast queue. If the breakpoint is temporary, it is removed. As part of the workaround for Issue A (Section 4.3.1), we also save the current PC for later access. Finally, the hook stops emulation. Ideally, a very similiar mechanism can be used to implement watchpoints, with a mem- ory hook in place of the code hook. Unfortunately, Issue A (Section 4.3.1) has no known workarounds for memory hooks. Therefore, we would not be able to resume execution at the right place after a watchpoint triggers. For this reason, watchpoints are not supported at the moment.

In END: START: breakpoint Stop/continue is END: stop pending Yes continue hook event sent in event execution set? execution response to endpoint

No

Ignore Decrement Save stop Yes No count > 0? ignore count PC

No

Add to Notify Avatar Remove Temporary? Yes pending set of breakpoint breakpoint

Figure 4.4: Flow chart for the breakpoint hook.

46 4.4.4 Emulation

The emulation start operation, initiated by the UnicornContinueMessage from the pro- tocol to the endpoint, is the most complex of all. It shall have two modes: synchronous and asynchronous (chosen through the sync boolean parameter in the message). Normally, emulation will be performed in asynchronous mode, i.e., on a separate thread within the endpoint. However, when using AFL-Unicorn, the first execution will cause the forkserver to start. In this specific case, the endpoint must be able to fork cleanly, which wouldnot be possible when the endpoint is running multiple threads. Therefore, synchronous mode runs the emulation on the listener thread itself, which is also the main endpoint thread. The emulation process starts with the protocol, which can be asked by the target to be- gin emulation using the cont and step methods. Both methods accept an extra optional parameter sync from the target, which specifies whether emulation should be synchronous or asynchronous. Therefore, all emulation requests from the target can be described by two properties: whether they are single stepping or continuing execution, and whether they are asynchronous or synchronous. This is the starting point of Figure 4.5, which describes the protocol side of the emulation process. First, the protocol checks the pending breakpoint set. If there are no pending breakpoints

START: emulation request (, )

Pending set Emulation (, )

No

Clone pending Emulation (single Pending set Yes breakpoint set step, sync) changed?

No Clear pending Single step? Yes END breakpoint set

No

Emulation (run, )

Figure 4.5: Flow chart for emulaon start (protocol side).

47 (i.e., no breakpoints have been handled since the last stop), then the request can be immedi- ately dispatched into a UnicornContinueMessage to the endpoint. Otherwise, the pro- tocol is resuming execution after one or more breakpoints, which have to be skipped as they would re-trigger when continuing. From Section 4.4.3, we recall that the breakpoint hook will skip pending breakpoints, and will add a breakpoint to the pending set when triggered. Therefore, the protocol makes a copy of the pending set and performs a syn- chronous single step. If the pending set changed, it means another breakpoint was triggered. In this case, emulation is done, as the endpoint has stopped again. If the pending set did not change, then the single step did not trigger any additional breakpoints, and emulation is free to proceed forward. Since the emulator has already been stepped, a single step request is done at this point. If the request was for continued execution, the protocol sends a proper UnicornContinueMessage to the endpoint.

On the endpoint side, the process begins when a UnicornContinueMessage is received, as shown in Figure 4.6. First, the endpoint reads the program counter from the underly- ing Unicorn instance. On the ARM architecture, there are two modes of execution: classic ARM and Thumb. Since instruction are of fixed 2-byte or 4-byte size and their addresses are aligned to the instruction size, the LSB of the instruction address is unused. There- fore, ARM uses the LSB of the PC register to signal the current execution mode. Since emu_start requires a start address, which will overwrite the PC, the current PC needs to be fixed up with the correct mode bit before being passed to emu_start. However, the PC read from the Unicorn state does not indicate the current mode (its LSB is always cleared). As such, the endpoint reads the CPSR (Current Program Status Register) to obtain the cur- rent execution mode, and fixes up the PC accordingly. Now that the current PC iscorrect

START: emulation Read PC from ARM Thumb set Fixup initial PC Yes Yes () Unicorn state architecture? in CPSR? with Thumb bit

No No No Send asynchronous start event

Unicorn emulation Set state PC to Send asynchronous Stop PC set? Yes Clear stop PC END (pc, ) stop PC stop event

No

Figure 4.6: Flow chart for emulaon start (endpoint side).

48 for all architectures, the endpoint sends a UnicornStartEvent to the protocol and begins the real emulation process. In the case of asynchronous emulation, this runs in a new thread. For synchronous emulation, it runs on the listener thread to support forking. Toperform emulation, the endpoint calls emu_start. Once emulation is done and emu_start returns, the endpoint checks whether a stop PC was set by a breakpoint hook (see Section 4.4.3). We recall that the stop PC is a workaround for Issue A (Section 4.3.1). If the stop PC is set, the endpoint sets the current PC (which could be incorrect due to Issue A) to the stop PC, and clears the latter. Finally, in any case, the endpoint sends a UnicornStopEvent to the protocol, concluding the emulation process.

4.4.5 Memory forwarding

The avatar2 API offers support for memory forwarding between targets. When mapping a memory region during target initialization, the client can specify that accesses to the region must be handled by a specific target. This allows, for example, running code that performs memory-mapped I/O in an emulator while redirecting accesses to peripherals to a physical device, thus removing the need to implement an emulated peripheral. The generic memory forwarding process in avatar2 has been described in Section 4.1.1 and 4.1.5. During initialization, the Unicorn protocol scans the memory ranges defined by the client. For forwarded memory, the protocol registers a memory hook to intercept accesses to the range. Moreover, also during initialization, the protocol creates an internal queue, the RMP queue, which is used to deliver memory forwarding responses. The memory hook handler, depending on the access type, wraps the access into either a RemoteMemoryReadMessage or a RemoteMemoryWriteMessage, which is then put into the main avatar2 message queue. Then, the hooks blocks while waiting for the result to be pushed into the RMP queue. If the access was a read, the result will include the value read from the remote target, and the protocol will write this value to the emulator memory, so that the read (which happends after the hook) will get the remote value. In any case, the result also includes a boolean indicanting whether the forwarding was successful or not. Once avatar2 handles the remote memory request by dispatching it to the correct target, it responds to our protocol by calling its send_response method (part of the remote mem- ory protocol interface). This method simply pushes the result into the RMP queue, thus unblocking the memory hook.

49 4.4.6 Additions to the standard API

The Unicorn protocol and target adhere to the standard avatar2 API. This interface must be the common denominator. However, protocols and targets are allowed to add function- ality as either new methods, or additional optional parameters to existing methods. We added a boolean sync parameter, with default value False, to the start method in the target and protocol. This parameter specifies whether emulation should be synchronous (True) or asynchronous (False). We also added a new terminate_endpoint method to the target and protocol, which terminates the endpoint process with a status code specified as the only parameter. This method is used in the fuzzing driver (Section 4.4.7). Those additions are proxied by the target to the protocol.

4.4.7 Fuzzing driver

The fuzzing driver is a script that actually brings together all the new functionality added to avatar2 and executes fuzzing. What the driver does depends on the specific firmware that is being fuzzed. However, a basic guideline is:

1. Create targets and memory regions.

2. Optionally, execute the firmware initialization on a physical device and then transfer the state to an AFL-Unicorn target.

3. Start the AFL-Unicorn forkserver by executing one instruction in synchronous mode.

4. Feed AFL inputs to the forkserver children.

Every time AFL produces a new input and asks the forkserver to fork, the child will re- turn from the synchronous emulation at point 3. Since all communication between pro- tocol and endpoint happens over pipes, the child will inherit the communication channel from the forkserver. In this situation, the actual avatar2 endpoint will be the child, as the parent is blocked inside the synchronous emulation. Since the child finished emulation, it will produce a stop event, which the driver can detect. Therefore, the driver knowns when a new child is ready, and can read AFL’s input file and place the data where the firmware ex- pects it (e.g., memory, or UART bus) before continuing emulation. When execution ends or causes a crash, the driver uses the terminate_endpoint method to make the child exit with a specified status code, which signals the test case result (crash or not) toAFL.

50 Since AFL-Unicorn runs inside a child of avatar2, the avatar2 process itself must be a child of AFL, in order for AFL-Unicorn to be able to access the AFL pipes via file descriptor inheritance. While starting the fuzzing driver from AFL works, we would like to launch AFL from within the driver. Therefore, we use a small helper script, which shares the AFL pipes descriptors with the avatar2 process. Now we can run AFL from within the driver, with the helper as AFL’s child. Since the forkserver shares the child PID with AFL via the pipes, AFL is then able to monitor the child’s exit status.

51 52 “Regression testing”? What’s that? If it compiles, it is good; if it boots up, it is perfect. Linus Torvalds, 1998 5 Evaluation

In this chapter we evaluate AFLtar’s performance. Due to time constraints, we were unable to perform real-world testing of the fuzzer. Therefore, the experiment will consist in verify- ing that the fuzzer works on a test firmware and measuring its performance. In Section 5.1 we describe the experimental setup and methodology used to obtain the results presented in Section 5.2. Then, in Section 5.3, we analyze, motivate and discuss the results. Finally, we present starting points for future work in Section 5.4.

5.1 Experiment design

The physical experimental setup is shown in Figure 5.1. For the embedded device, we use an STMicroelectronics NUCLEO-L152RE develop- ment board. It is built around an STM32L152RE microcontroller, which contains a 32-bit ARM Cortex-M3 core with 512kB of flash and 80kB of RAM, along with many peripherals. The board also provides an ST-LINK/V2-1 interface for programming and debugging. For the firmware, we reuse a simple program from WYCINWYC [28] that receives an XML document via the UART interface and parses it with the Expat [7] XML parsing li- brary. This simulates an embedded device handling untrusted data. Expat is patched as in [28] to introduce artificial bugs. While Expat is available for normal computers andcan be fuzzed in traditional ways, it is also often integrated into embedded firmware, and thus an interesting and realistic target.

53 Figure 5.1: Experimental hardware. On the right, the STMicroelectronics NUCLEO-L152RE board, which integrates an STM32L152RE microcontroller (boom) and an ST-LINK/V2-1 programming and debugging interface (top). On the le, an RS232-USB converter based on the FTDI FT232RL chip, connected to the microcontroller’s UART interface.

To communicate with the UART interface of the STM32 microcontroller, we use an RS232-USB adapter based on the FTDI FT232RL chip. This adapter makes the UART interface appear as a serial port on the computer. Moreover, we connect a host-controllable RS232 line (DTR) to the microcontroller’s reset line, in order to programmatically reset it. On the computer side, we use a driver that follows the general guideline of Section 4.4.7. The driver sets up two avatar2 targets: an OpenOCD target which connects to the physical board via the ST-LINK/V2-1 interface, and an AFL-Unicorn target. Initially, the driver re- sets the microcontroller and sets a breakpoint to stop just before the firmware begins reading the input. Then, it transfers the physical device’s state to the Unicorn target. The driver in- cludes support for three ways to handle UART communication, which are typical strategies for handling peripheral interaction:

• Forwarded UART: the UART peripheral I/O memory is forwarded to the physical

54 device, and the driver talks to the device via the RS232-USB interface.

• Emulated UART: the UART peripheral is emulated by hooking accesses to its I/O memory with Unicorn.

• Hooked UART: the driver hooks I/O functions in the firmware via Unicorn to com- municate without any access to the UART peripheral I/O memory.

Moreover, the driver implements a simple memory allocator that can be used in place of the firmware-supplied one by hooking the allocation functions (indicated as hooked heap). The driver’s allocator supports the OOB check mode, which instruments memory accesses to detect out-of-bounds reads and writes to heap chunks. Such violations are reported to the fuzzer as crashes. We test AFLtar in the following configurations:

• Forwarded UART.

• Emulated UART.

• Hooked UART, hooked heap.

• Hooked UART, hooked heap, OOB check.

The test machine runs Fedora Core 28 on an Intel i7-6700HQ CPU clocked at 3.0 GHz, with 16GB of DDR4 RAM running at 2133 MT/s. All tests use the CPython interpreter, a single AFL thread, and run for 30 minutes. For each experiment, we measure the number of total executions (test cases tried by the fuzzer) and the total number of discovered paths.

5.2 Results

Table 5.1 shows, for each experiment, the total number of test cases executed by the fuzzer, the average execution speed, and the total paths discovered. The average execution speed is calculated from the total number of executions and the duration of the experiment (30 minutes). No crashes were found. Additionally, in Figure 5.2 we compare the amount of discovered paths with the total executions.

55 Experiment Total executions Executions/s (avg.) Total paths Forwarded UART 5.6k 3 53 Emulated UART 18.1k 10 202 Hooked UART 60.1k 33 463 Hooked heap Hooked UART Hooked heap 51.1k 28 445 OOB check

Table 5.1: Experimental results.

5.3 Discussion

The results in Table 5.1 show that AFLtar can achieve 3 executions/s with forwarded UART, 10 executions/s with emulated UART, and 33 executions/s with hooked UART and heap. The OOB check introduces an overhead of approximately 18% because of the additional memory hooking and checks, bringing the performance of hooked UART and heap down to 28 executions/s. We think it is important to note that the OOB check is only possible thanks to the flexibility of AFLtar and the power of Unicorn’s instrumentation. In tradi- tional fuzzing, this kind of check is often inserted at compile time, if the source is available, or through dynamic binary instrumentation. These numbers are very low when compared to AFL on a normal computer, which achieves hundreds to thousands of executions/s on this kind of program. The low speed, coupled with the short duration of each experiment (30 minutes), justifies the absence of crashes. This is likely caused by the overhead introduced by the Unicorn hooks. The overhead is, for the most part, due to the hooks being written in Python. Since it is an interpreted language, it is significantly slower than compiled languages such asC. We included Figure 5.2 as it gives an interesting glimpse into the fuzzing process. As one would expect, the number of paths discovered by the fuzzer increases with the number of exe- cutions. However, the slope of the curve (i.e., the fuzzer’s efficiency in finding new paths) de- creases as more test cases are executed. This can be justified by considering that deeper paths often have less successors. Moreover, path constraints become more complex and harder to satisfy, and test cases become larger. Those factors cause the fuzzer’s efficiency to reduce over time, resulting in diminishing returns.

56 500

400

300

Total paths 200

100

10 20 30 40 50 60 Total execuons (thousands)

Figure 5.2: Total execuons vs total paths in our experiments.

5.4 Future work

Fuzzers do not just need to be smart, but also as fast as possible. Therefore, we want to put significant effort into improving AFLtar’s performance. For example, one areaofim- provement is interpreter performance. The tests were done with the standard CPython in- terpreter, which only pre-compiles to bytecode and is quite slow. A better solution would be PyPy [39], a Python interpeter with just-in-time compilation. PyPy is significantly faster than CPython, and is usually a drop-in replacement for it. Unfortunately, the Python bind- ings for Unicorn are not compatible with PyPy yet because of incompatibilities with a foreign function interface module. Porting the bindings to PyPy would be valuable future work. From a design perspective, many commonly used components are still implemented as part of the fuzzing driver: an example is the instrumented heap allocator. Furthermore, the driver is exposed to the forkserver logic. Refactoring these aspects as more reusable modules would decrease the driver’s complexity and facilitate the analyst’s work. Finally, we wish to perform real-world testing of AFLtar on actual devices.

57 58 A conclusion is the place where you got tired thinking. Martin H. Fischer 6 Conclusion

The modern world is increasing its reliance of embedded devices: many systems whose mal- function could cause economic or human damage are more and more connected to the exter- nal world, and thus their attack surface widens. While there are plenty of different analysis techniques that can be used to improve a software’s security (and safety) properties, fuzzing has proven to be a solid choice when it comes to minimizing the cost-to-benefit ratio, because it typically has a very low setup cost. We need tools that can bring down the entry barrier and required effort for fuzzing in the embedded world. In this work we built and evaluated AFLtar, a coverage-guided fuzzer for embedded firmware. AFLtar combines the avatar2 orchestration framework, the Unicorn CPU emulator, and the American Fuzzy Lop fuzzer in a single tool. Our fuzzer empowers the analyst with flex- ible orchestration and instrumentation capabilities, which allow one to quickly build a tai- lored coverage-guided fuzzer for specific scenarios. While initial results are encouraging, they are still far from what we have come to expect of fuzzers on common, general purpose systems. As future work, we plan to abstract common parts of the fuzzing driver, in order to lower the setup cost even more. Moreover, we wish to improve the fuzzer’s performance significantly. On the evaluation side, we plan to start running AFLtar on real devices, where it will be able to prove its true effectiveness.

59 60 References

[1] B. P. Miller, L. Fredriksen, and B. So, “An empirical study of the reliability of unix utilities,” Commun. ACM, vol. 33, no. 12, pp. 32–44, Dec. 1990.

[2] “Monkey lives,” accessed September, 2018. [Online]. Available: http://www. folklore.org/StoryView.py?story=Monkey_Lives.txt

[3] M. Zalewski, “American Fuzzy Lop trophies,” accessed September, 2018. [Online]. Available: http://lcamtuf.coredump.cx/afl/#bugs

[4] “libFuzzer trophies,” accessed September, 2018. [Online]. Available: https://llvm. org/docs/LibFuzzer.html#trophies

[5] M. Zalewski, “American Fuzzy Lop,” accessed September, 2018. [Online]. Available: http://lcamtuf.coredump.cx/afl/

[6] “AFL-Unicorn,” accessed September, 2018. [Online]. Available: https://github. com/Battelle/afl-unicorn

[7] “Expat xml parser,” accessed September, 2018. [Online]. Available: https://libexpat. github.io/

[8] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Lan- guages, and Computation (3rd Edition). Boston, MA, USA: Addison-Wesley Long- man Publishing Co., Inc., 2006.

[9] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna, “SoK: (State of) The Art of War: Of- fensive Techniques in Binary Analysis,” in IEEE Symposium on Security and Privacy, 2016.

[10] C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2008.

61 [11] Research, “Z3 theorem prover,” accessed September, 2018. [Online]. Available: https://github.com/Z3Prover/z3

[12] “SMT-LIB,” accessed September, 2018. [Online]. Available: http://smtlib.cs.uiowa. edu/

[13] N. Nethercote and J. Seward, “Valgrind: A framework for heavyweight dynamic bi- nary instrumentation,” in Proceedings of the 2007 ACM SIGPLAN Conference on Pro- gramming Language Design and Implementation (PLDI).

[14] “AddressSanitizer,” accessed September, 2018. [Online]. Available: https://clang. llvm.org/docs/AddressSanitizer.html

[15] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).

[16] D. L. Bruening, “Efficient, transparent, and comprehensive runtime code manipula- tion,” Ph.D. dissertation, Cambridge, MA, USA, 2004, AAI0807735.

[17] “Peach Fuzzer,” accessed September, 2018. [Online]. Available: https://www.peach. tech/products/peach-fuzzer/

[18] “Sulley,” accessed September, 2018. [Online]. Available: https://github.com/ OpenRCE/sulley

[19] “Radamsa,” accessed September, 2018. [Online]. Available: https://gitlab.com/ akihe/radamsa

[20] “zzuf,” accessed September, 2018. [Online]. Available: http://caca.zoy.org/wiki/ zzuf

[21] “honggfuzz,” accessed September, 2018. [Online]. Available: http://honggfuzz. com/

[22] “libFuzzer,” accessed September, 2018. [Online]. Available: https://llvm.org/docs/ LibFuzzer.html

62 [23] “syzkaller,” accessed September, 2018. [Online]. Available: https://github.com/ google/syzkaller

[24] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos, “VUzzer: Application-aware Evolutionary Fuzzing,” in Proceedings of the 24th Network and Dis- tributed System Security Symposium (NDSS), 2017.

[25] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshitaishvili, C. Kruegel, and G. Vigna, “Driller: Augmenting fuzzing through selective symbolic execution,” in Proceedings of the 23rd Network and Distributed System Security Sym- posium (NDSS), 2017.

[26] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim, “QSYM: A practical concolic execution engine tailored for hybrid fuzzing,” in Proceedings of the 27th USENIX Security Sym- posium (USENIX Security 18), 2018.

[27] P. Chen and H. Chen, “Angora: Efficient fuzzing by principled search,” ArXiv e-prints, 2018. [Online]. Available: https://arxiv.org/abs/1803.01307

[28] M. Muench, J. Stijohann, F. Kargl, A. Francillon, and D. Balzarotti, “What you cor- rupt is not what you crash: Challenges in fuzzing embedded devices,” in Proceedings of the 25th Network and Distributed System Security Symposium (NDSS), 2018.

[29] M. Zalewski, “Pulling JPEGs out of thin air,” accessed September, 2018. [Online]. Available: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air. html

[30] ——, “afl-fuzz: nobody expects CDATA sections in XML,” accessed September, 2018. [Online]. Available: https://lcamtuf.blogspot.com/2014/11/ afl-fuzz-nobody-expects-cdata-sections.html

[31] ——, “Technical “whitepaper” for afl-fuzz,” accessed September, 2018. [Online]. Available: http://lcamtuf.coredump.cx/afl/technical_details.txt

[32] ——, “Binary fuzzing strategies: what works, what doesn’t,” accessed September, 2018. [Online]. Available: https://lcamtuf.blogspot.com/2014/08/ binary-fuzzing-strategies-what-works.html

63 [33] ——, “afl-fuzz: crash exploration mode,” accessed September, 2018. [Online]. Avail- able: https://lcamtuf.blogspot.com/2014/11/afl-fuzz-crash-exploration-mode. html

[34] ——, “Fuzzing random programs without execve(),” accessed Septem- ber, 2018. [Online]. Available: https://lcamtuf.blogspot.com/2014/10/ fuzzing-binaries-without-execve.html

[35] “Qemu,” accessed September, 2018. [Online]. Available: https://www.qemu.org/

[36] “Unicorn,” accessed September, 2018. [Online]. Available: https://www. unicorn-engine.org/

[37] M. Muench, D. Nisi, A. Francillon, and D. Balzarotti, “Avatar2: A multi-target or- chestration platform,” in Workshop on Binary Analysis Research (BAR) 2018, colo- cated with the 25th Network and Distributed System Security Symposium (NDSS), 2018.

[38] J. Zaddach, L. Bruno, A. Francillon, and D. Balzarotti, “Avatar: A Framework to Sup- port Dynamic Security Analysis of Embedded Systems’ Firmwares,” in Proceedings of the 21st Network and Distributed System Security Symposium (NDSS), 2014.

[39] “Pypy,” accessed September, 2018. [Online]. Available: https://pypy.org/

64