Defending In-Process Memory Abuse with Mitigation and Testing

A Dissertation Presented by

Yaohui Chen

to

The Khoury College of Computer Sciences

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

Northeastern University Boston, Massachusetts

October 2019 Version Dated: October 21, 2019 To my parents, gave me the life like a river flows, & To Boyu, my best friend, who accompanies me through the rapids and undertows.

i Contents

List of Figures v

List of Tables viii

Acknowledgments x

Abstract of the Dissertation xii

1 Introduction 1 1.1 Problem Statement ...... 1 1.2 Thesis Statment ...... 3 1.3 Contributions ...... 3 1.3.1 A Hybrid Approach for Practical Fine-grained Software Randomization .. 3 1.3.2 Leave No Program Behind: Execute-only Memory Protection For COTS Binaries ...... 4 1.3.3 Keep My Secrets: In-process Private Memory ...... 4 1.3.4 Focus on bugs: Bug-driven Hybrid Fuzzing ...... 5 1.3.5 Learning On Experience: Smart Seed Scheduling for Hybrid Fuzzing ... 6 1.4 Roadmap ...... 7

2 Related Works 8 2.1 Perpetual War On Memory Corruption Attacks ...... 8 2.2 In-Process Memory Isolation ...... 10 2.3 Automatic Software Tests Generation ...... 12

I Runtime Protections Against In-Process Abuse 15

3 Code Reuse Exploit Mitigations 16 3.1 Compiler-assisted Code Randomization ...... 16 3.1.1 Background ...... 16 3.1.2 Overall Approach ...... 19 3.1.3 Compiler-level Metadata ...... 21 3.1.4 -time Metadata Consolidation ...... 25

ii 3.1.5 Code Randomization ...... 28 3.1.6 Experimental Evaluation ...... 28 3.2 Enabling Execute-Only Memory for COTS Binaries On AArch64 ...... 33 3.2.1 Overview ...... 33 3.2.2 Background ...... 34 3.2.3 Design ...... 38 3.2.4 Evaluation ...... 47 3.3 Limitations ...... 51

4 In-process Memory Isolation 52 4.1 Overview ...... 53 4.2 Design ...... 55 4.3 Implementation ...... 64 4.4 Evaluation ...... 66 4.5 Limitations and Discussion ...... 71

II Offline Software Testing To Find Memory Corruption Bugs 72

5 Bug-driven Hybrid Testing 74 5.1 Background and Motivation ...... 74 5.1.1 In-efficiency of Existing Coverage-guided Hybrid Testing ...... 74 5.1.2 Motivation ...... 75 5.2 Design ...... 77 5.2.1 Core Techniques ...... 77 5.2.2 System Design ...... 80 5.3 Implementation ...... 85 5.4 Evaluation ...... 87 5.4.1 Evaluation with LAVA-M ...... 88 5.4.2 Evaluation with Real-world Programs ...... 90 5.4.3 Vulnerability Triage ...... 93

6 Learning-based Hybrid Fuzzing 98 6.1 Introduction ...... 98 6.2 Background ...... 100 6.2.1 Hybrid Fuzzing ...... 100 6.2.2 Supervised Machine Learning ...... 102 6.3 System Design ...... 103 6.3.1 System Overview ...... 103 6.3.2 System Requirements ...... 103 6.3.3 Feature Engineering ...... 105 6.3.4 Seed Label Inference ...... 107 6.3.5 Model Construction and Prediction ...... 108 6.3.6 Updating Model ...... 109 6.4 Evaluation and Analysis ...... 110

iii 6.4.1 Evaluation setup ...... 110 6.4.2 Learning Effectiveness ...... 111 6.4.3 Insights and Analyses ...... 112 6.4.4 Model Reusability ...... 113 6.4.5 Model Transferability ...... 114 6.4.6 Discovered Bugs ...... 115 6.5 Discussions ...... 117 6.5.1 Applicability of different machine learning models ...... 117 6.5.2 Applicability of MEUZZ on grey-box fuzzing ...... 118

7 Conclusion 123

Bibliography 126

iv List of Figures

3.1 Example of the fixup and relocation information that is involved during the compi- lation and linking process...... 18 3.2 Overview of the proposed approach. A modified compiler collects metadata for each object file 1 , which is further updated and consolidated at link time into a single extra section in the final executable 2 . At the client side, a binary rewriter leverages the embedded metadata to rapidly generate randomized variants of the executable 3 ...... 21 3.3 An example of the ELF layout generated by Clang (left), with the code of a par- ticular function expanded (center and right). The leftmost and rightmost columns in the code listing (“BBL” and “Fragment”) illustrate the relationships between ba- sic blocks and LLVM’s various kinds of fragments: data (), relaxable (RF), and alignment (AF). Data fragments are emitted by default, and may span consecutive basic blocks (e.g., BBL #1 and #2). The relaxable fragment #1 is required for the branch instruction, as it may be expanded during the relaxation phase. The padding bytes at the bottom correspond to a separate fragment, although they do not belong to any basic block...... 22 3.4 Example of jump table code generated for non-PIC and PIC binaries...... 25 3.5 Overview of the linking process. Per-object metadata is consolidated into a single section...... 27 3.6 Performance overhead of fine-grained (function vs. basic block reordering) ran- domization for the SPEC CPU2006 benchmark tests...... 29 3.7 NORAX System Overview: the offline tools (left) analyze the input binary, locate all the executable data and their references (when available), and then statically patch the metadata to the raw ELF; the runtime components (right) create separated mapping for the executable data sections and update the recorded references as well as those generated at runtime...... 39 3.8 The layout of ELF transformed by NORAX. The shaded parts at the end are the generated NORAX-related metadata...... 44 3.9 Bionic Linker’s binary loading flow, NLoader operates in different binary preparing stages, including module loading, relocation and symbol resolution...... 44 3.10 Unixbench performance overhead for unixbench binaries, including runtime, peak resident memory and file size overhead (left: user tests, right: system tests) .... 50

v 4.1 Shreds, threads, and a process ...... 52 4.2 Developers create shreds in their programs via the intuitive APIs and build the pro- grams using S-compiler, which automatically verifies and instruments the executa- bles (left); during runtime (right), S-driver handles shred entrances and exits on each CPU/thread while efficiently granting or revoking each CPU’s access to the s-pools. 54 4.3 The DACR setup for a quad-core system, where k =4. The first 3 domains (Dom Dom ) are reserved by . Each core has a designated domain 0 2 (Dom Dom ) that it may access when executing a shred. No CPU can access 3 6 Dom7...... 61 4.4 A shred’s transition of states ...... 61 4.5 The time and space overhead incurred by S-compiler during the offline compilation and instrumentation phase ...... 67 4.6 The time needed for a context switch when: (1) a shred-active thread is switched off, (2) a regular thread is switched off but no process or address space change, and (3) a regular thread is switched off and a thread from a different process is scheduled on...... 67 4.7 Invocation time of shred APIs and reference system calls (the right-most two bars are on log scale). It shows that shred entry is faster than thread creation, and s-pool allocation is slightly slower than basic memory mapping...... 69 4.8 Five SPEC2000 benchmark programs tested when: (1) no shred is used, (2) shreds are used but without the lazy domain adjustment turned on in S-driver, and (3) shreds are used with the lazy domain adjustment...... 69

5.1 A demonstrative example of hybrid testing. Figure 5.1a presents the code under . Figure 5.1b and 5.1c are the paths followed by two seeds from the fuzzer. Their execution follows the red line and visits the grey boxes. Note that the white boxes connected by dotted lines are non-covered code...... 75 5.2 A demonstrative example of limitation in finding defects by existing hybrid testing. This defect comes from objdump-2.29 [33]...... 76 5.3 An example showing how to estimate the bug-detecting potential of a seed. In this example, the seed follows the path b1->b2->b3->b4. Basic block b5 and b7 are unexplored and they can reach L1 and L2 UBSan labels, respectively. They have been attempted by constraint solving for S1 and S2 times. The final score for this 0.05S 0.05S e 1 L1+e 2 L2 seed is ⇥ 2 ⇥ ...... 78 5.4 Solving the integer overflow in Figure 5.2. This shows the case in a 32-bit system, but it applies to 64-bit as well...... 79 5.5 System architecture of SAVIOR...... 80 5.6 A demonstrative example of reachability analysis. The target BB can “reach” 3 UBSan labels...... 82 5.7 Fork server mode in KLEE. In this mode, KLEE only performs initialization once and reuses the same executor for all the received seeds...... 84 5.8 Evaluation results with LAVA-M.The left column shows the number of bugs reached by different fuzzers and the right column shows the number of bugs triggered by the fuzzers...... 96

vi 5.9 Evaluation results with real-world programs over 24 hours. p1 and p2 are the p- values for the Mann Whitney U-test of SAVIOR vs. DRILLER and SAVIOR vs. QSYM, respectively...... 97

6.1 General hybrid fuzzing workflow...... 101 6.2 System overview of MEUZZ. The coordinator is extended with a ML engine, which consists of 4 modules – Feature extraction, label inference, prediction and training modules. During fuzzing, utility prediction and model training are carried out con- secutively. After extracting features for inputs in the fuzzer’s queue, the ML engine can predict their utilities based on the current model. Then, with the seed labels in- ferred from previously selected seeds, the model is trained iteratively with the new data...... 104 6.3 The examples that show how bug-triggering and coverage features are computed. . 106 6.4 Branch coverage fuzzing with valid seeds (higher is better). p1, p2 and p3 are p- values in Mann-Whitney U Test by comparing QSYM with MEUZZ-OL, MEUZZ- RF and MEUZZ-EN...... 119 6.5 The box plots show the importance of the features on nine programs. The impor- tance is extracted by training an offline random forest model and they are ranked by the median of their importance. Queue Size and New Cov are the most and the least important ones, respectively...... 120 6.6 Branch coverage fuzzing with naive seeds (higher is better). p1, p2 and p3 are p- values in Mann-Whitney U Test by comparing QSYM with MEUZZ-OL, MEUZZ- RF and MEUZZ-EN, respectively...... 121 6.7 This heat map shows Coverage improvement with model initialization for MEUZZ- OL over vanilla MEUZZ-OL. Y-axis is the tested programs, X-axis is the models used for initialization. Each cell shows the relative coverage comparison (%). The diagonal values show the coverage improvement on each program after initializing MEUZZ with model learn from the same program (reusability). Model transferabil- ity is shown in 7 out of the 8 programs...... 122 6.8 Off-by-one heap read overflow in tiff2ps...... 122

vii List of Tables

3.1 Collected randomizaton-assisting metadata ...... 24 3.2 Experimental evaluation dataset and results (* indicates programs written in C++) 32 3.3 Access permissions for stage 1 EL0 and EL1 ...... 35 3.4 ELF sections that comprise the code segment of the example program, the high- lighted ones are locate in the same page...... 37 3.5 Android Marshmallow system binaries that have embedded data in Nexus 5X. ... 38 3.6 Sections in the executable code page that are handled by NORAX ...... 38 3.7 ELF section reference types ...... 38 3.8 Rewritten program functionality tests...... 48 3.9 System compatibility evaluation, the converted zygote, qseecomd, installd, rild, logd, surfaceflinger, libc++, libstagefright are selected randomly to participate the test to see whether they can run transparently with other unmodified system compo- nents...... 48 3.10 Binary transformation correctness test...... 49 3.11 Embedded data identification correctness, empirical experiment shows our analysis works well in AArch64 COTS ELFs, with zero false negative rate and very low false positive rate in terms of finding embedded data. The last column shows the negligible number of leftover gadgets in the duplicated embedded data set...... 49

4.1 5 open source softwares used in evaluation ...... 66 4.2 End-to-end overhead observed while tested programs performing a complete task: the left-side part of the table shows the executing time and the right-side part shows the memory footprint...... 70

5.1 Families of potential bugs that SAVIOR enables UBSan to label. Here, x, y are n-bit integers; array is an array, the size of which is specified as size(array); ops and op refers to binary operators +, , , , % over signed and unsigned integers, u ⇥ ÷ respectively...... 82 5.2 Fuzzer specific settings in evaluation with Lava-M...... 85 5.3 LAVA-MBugs triggered by different fuzzers (before bug-guided verification). “X%” indicates that X% of the listed LAVA bugs are triggered...... 90 5.4 LAVA-M Bugs triggered by different fuzzers (after bug-guided verification). “X%” indicates that X% of the listed LAVA bugs are triggered...... 90

viii 5.5 Real-world benchmark programs and evaluation settings. In the column for Seeds, AFL indicates we reuse the testcases provided in AFL and build-in indicates that we reuse the test cases shipped with the program...... 91 5.6 Number of unique UBSan labels reached by different fuzzers in 24 hours. On aver- age SAVIOR reaches 19.68% and 15.18% more labels than DRILLER and QSYM. 93 5.7 New UBSan violations triggered with bug-guided verification in the evaluation with real-world programs. “+X/Y%” means “X” new violations are triggered, increasing the total number by “Y%”...... 94 5.8 Triage of UBsan violations triggered by SAVIOR in 24 hours...... 95

6.1 Evaluation settings ...... 110 6.2 Execution time spend on different learning stages ...... 112 6.3 The table shows the discovered bugs by MEUZZ. UB, ME, DoS, and ML refers to Undefined Behavior, Memory Error, Denial of Service, and Memory Leak, respec- tively...... 116

ix Acknowledgments

I would like to extend my greatest gratitude towards my Ph.D advisor Prof. Long Lu. Not only did he make a great research advisor as I can ever ask for; In life, he is also like a big brother to me. As a research advisor, he is always so supportive as he encourages me to do research that I am passionate about. During the process, he also guides me to apply critical thinking to distill and crystalize fuzzy ideas. As a big brother, he listens to my distress and grief about life in a foreign country. We also share a lot of joys together, I will never forget the rejoicing we had when our first S&P paper gets accepted after year-long of hard works. These invaluable and unforgettable experience help me grow stronger to be an independent researcher and to conquer obstacles in life. I also like to thank my thesis committee members, Prof. Engin Kirda, Prof. Wil Robertson and Dr. Weidong Cui. Their constructive feedbacks and helpful suggestions help me shape this thesis into its better form. If my Ph.D student life is a painting, the internship experiences would be one of the most colorful strokes. I am so fortunate to work with my mentors Dr. Weidong Cui, Dr. Xinyang Ge and Dr. Ben Niu at Microsoft Research; Dr. László Szekeres, Dr. Stefan Bucur and Dr. Franjo Ivancic at Google; Dr. Hayawardh Vijayakumar and Dr. Mike Grace at Samsung Research; Dr. Peng Li and Dr. Tao Wei at Baidu X-Lab. During my internships, they provided me the best working environments that one can ever ask for. They also showed me the importance of great teamwork; and how to cultivate research ideas and land these ideas via solid engineering. I converted everything I learned from them into my research later on after the internships. I am also grateful to meet these friends during my internships. I interacted and collaborated with them, directly and indirectly, in works, in research, and in life. They make the whole journey much more fun and unforgettable: Prof. Jun Xu, Dr. Nan Zhang, Prof. Wenbo Shen, Prof. Dave Jing Tian, Dr. Yuru Shao, Dr. Yueh-Hsun Lin, Dr. Yuping Li, Dr. Ruowen Wang, Dr. Xun Chen, Rohan Padhye, Dr. Rundong Zhou, Dr. Qian Feng, Dr. Shengjian Guo, Dr. Haining Chen, Yulong Zhang, Dr. Mingshen Sun, Dr. Yu Ding, Dr. Yizheng Chen, Dr. Yiming Gong, Dr. An Liu, Dr. Yueqiang Cheng, Zhaofeng Chen, Hangchen Yu, Willy Vasquez, Meng Xu and Dr. Markus Kusano. These collaborative experience pose very positive influence not only on my research but also on my communication and social networking. Of course, my Ph.D journey would not be complete without my friends from Stony Brook Uni- versity and Northeastern University: Zhichuang, Bo, Suwen, Farhan, Mingwei, Rui, Hyungjoon, Nahid, Meng, Shachee, Tapti, Andrea Possemato, Andrea Mambretti, Fangfan, Ahmad, Sajjad, Ahmin, Conor, Shuwen, Jingjing, Matthew, Eyza, Desheng, Mansour, Reza, Ruimin, Ryan, Alejan- dro, Tomasso, Omin and Jeremiah. We had a lot of fun times together, and I am grateful for their company. This thesis is built upon the foundation knowledge of computing systems and security, most of

x which I acquired during my study in Stony Brook University. I want to thank Prof. Donald Porter, Prof. Nima Hornamand, Prof Michalis Polychronakis, Prof. Nick Nikiforakis and Prof. R. Sekar for passing on their knowledge to me in and out of the classes. These system and security knowledge greatly benefits my PhD career later on. Lastly, I want to take this opportunity to specially thank Kelwin, Fish, DeAdCaT, zTrix, MaskRay and Flanker (the old Blue Lotus members) for being my inspiration to pursue the path of computer system security. Without all these aforementioned people, my life pursuing the Ph.D will not be the same, and I hold the utmost gratitude for them to show up in this fantastic journey.

xi Abstract of the Dissertation

Defending In-Process Memory Abuse with Mitigation and Testing

by Yaohui Chen Doctor of Philosophy in Computer Science Northeastern University, October 2019 Version Dated: October 21, 2019 Dr. Long Lu, Advisor

Modern softwares often include large code base from different origins with different trust lev- els. This creates large attack surface and raises security concerns that sensitive information of one component is directly accessible by other (malicious or manipulated) components in memory. In this thesis, I refer to this problem as in-process memory abuse. Despite the prevalence of in-process abuses, the defense mechanisms are not well studied, due to the complex root causes and attack surfaces of such attack. First of all, a large amount of the existing software is written in type-unsafe languages such as C and C++. Such languages are notorious for being error-prone. These programming errors have incurred countless high-severity security bugs that lead to in-process memory attacks. Secondly, contemporary defenses such as data execution prevention (DEP) and address space layout randomization (ASLR) have little effect on preventing in-process memory attacks. Last but not least, developers are often helpless when they are trying to protect their sensitive data, due to the lack of support to create boundaries within the same process context. As a result, as long as one of the many components is successfully exploited, the whole program’s sensitive data and code are subject to abuse. A common belief is that in-process abuse cannot be defended without high-overhead or loss of backward compatibility. To reduce memory corruption bugs, options like formally verifying every software or rewriting the whole software stack with type safe language is impractical, due to the poor scalability of formal verification methods and the immense engineering cost required to rebuild all existing software infrastructures. To prevent exploitations of memory corruption bugs, one may suggest adopting full memory safety by bound checking all the pointers and tracking the liveness of every allocated memory objects. However, this comes with intolerable overheads.

xii Lastly, existing work propose rewriting established operating system design paradigms to create sub-process isolation, this creates incompatibility and reduces practicability of the solution. Challenging these common beliefs, this thesis presents a series of practical defenses against in- process memory abuse. It includes runtime protections [80, 82, 130] and offline bug detections [53, 77, 78]. Collectively, these new techniques improved the state-of-the-art defense against in-process memory abuse without sacrificing practicability and compatibility. First, I present CCR [130], a compiler-binary rewriter toolchain to enable fine-grained soft- ware randomization. CCR solves the incompatibility of existing fine-grained randomization ap- proaches by aligning its defense implementation with established software deployment and bug report paradigms. However, fine-grained randomization alone is still vulnerable to just-in-time info- leak aided code reuse attacks. To tighten this loose end, I introduce NORAX [82], a binary rewriting framework to retrofit execute-only memory (XOM) protection into source-unavailable programs. Then, I design shreds [80]–fine-grained execution units with private memory–as an extra line of defense to in-process abuse. Shreds enable sub-process isolation without relying on nested paging, virtualization or even modified hardware. It incurs negligible overheads and is highly compatible with the existing operating system design paradigm of process/thread based execution units. Lastly, for offline software test generations, I present SAVIOR [78] and MEUZZ [77], they are advanced hybrid fuzzing frameworks enlightened with bug-driven oracle to quickly find more bugs and machine learning guidance to learn from past fuzzing statistics to tune the fuzzing scheduling strategies. By designing and conducting the large-scale experiments for these proposed defenses on real- world softwares, I demonstrate that in-process memory abuses can be reasonably well-defended and prevented. The insights and knowledge gained during the development of this thesis have raised the community’s awareness of in-process abuse and advanced the state-of-the-art defense against such attacks. Each of the included works have yielded at least one practical defense or automated software testing system. Many of them have also been adopted by the industry, blocking malicious in-process abuse attempts and uncovering highly severe security bugs in critical software infrastructure on daily basis, which highlights the broad impact of the thesis.

xiii Chapter 1

Introduction

1.1 Problem Statement

Many attacks on software aim at accessing sensitive content in victim programs’ memory, in- cluding secret data (e.g., crypto keys and user passwords) and critical code (e.g., private APIs and privileged functions). To achieve the goal, such attacks normally starts with remote exploitations or injected malicious libraries. For instance, the HeartBleed attack on OpenSSL-equipped software reads private keys by exploiting a memory disclosure vulnerability [98]; the malicious libraries found in mobile apps covertly invoke private framework APIs to steal user data [95]. We generally refer to this class of attacks as in-process abuse. Obviously, such attacks would not succeed if we are able to (i) defend the victim program against exploitation attempts; or (ii) isolate the sensitive data and code from hostile code running in the same process. Despite decades of research, the defense techniques still can not meet the demand of practical and effective mitigations against in-process abuse, mainly for the following reasons.

The Pervasiveness of Memory Corruption Bugs: Memory-unsafe languages such as C and C++ allow developers to directly access memory with raw pointers. This great flexibility also imposes the burden on developers to make sure every memory access does not violate the spatial (e.g., out of bound access) and temporal (e.g., use after free) memory safety. This process, unfortunately, is very error-prone. As a result, software inevitably contains defects [32, 181]. A large amount of these defects are security vulnerabilities that can be exploited for malicious purposes [149]. This type of vulnerable code has become a fundamental threat to software security.

1 CHAPTER 1. INTRODUCTION

The Arm Races of Memory Corruption Attacks: Once a memory corruption bug is found, an at- tacker may use it to gain illegal memory accesses to hijack the control flow, alter program execution logic or read memory that can later facilitate the exploitation. To execute arbitrary code, attacks used to inject the shellcode along with the deployed payload into the victim program’s address space. Modern commodity operating systems employ code integrity protection techniques, such as data execution prevention (DEP), to prevent traditional code injection attacks. Consequently, recent attacks [61, 182] increasingly leverage code-reuse techniques to gain control of vulnerable programs. In code reuse attacks, a target application’s control flow is manipulated in a way that snippets of existing code (called gadgets) are chained to carry out malicious activities. Knowledge of process memory layout is a key prerequisite for code-reuse attacks to succeed. Attackers need to know the exact binary instruction locations in memory to assemble the chain of gadgets. Commodity operating systems widely adopt address space layout randomization (ASLR), which loads code binaries at random memory locations unpredictable to attackers. Without knowing the locations of needed code or gadgets, attackers cannot build code-reuse chains. However, memory disclosure attacks can use information leaks in programs to determine code locations, thus defeating ASLR. Such attacks either read the program code (direct de-randomization) or read code pointers (indirect de-randomization). Although deployed ASLR techniques randomize the load address of a large chunk of data or code, leaking a single code pointer or a small sequence of code allows attackers to identify the corresponding chunk, infer its base address, and calculate the addresses of gadgets contained in the chunk. After knowing the location of existing code, attackers can then launch return oriented programming (ROP) style code reuse attacks to execute arbitrary code in the victim process.

The Insufficient Support for In-process Memory Isolation: Developers are virtually helpless when it comes to preventing in-process abuse in their programs, due to a lack of support from underlying operating systems (OS): the memory isolation mechanisms provided by modern OS operate merely at the process level and cannot be used to establish security boundaries inside a process. As a result, protecting sensitive memory content against malicious code inside the same process remains an open issue, which has been increasingly exploited by attackers. To address this open issue, some recent work proposed thread-level memory isolation [62]. Taking distinct approaches, they allow developers to limit the sharing of a thread’s memory space with other threads in the same process. However, this line of works faces three major limitations. First, thread-level memory isolation is still too coarse to stop in-process abuse because exploitable

2 CHAPTER 1. INTRODUCTION or malicious code often run in the same thread as the legitimate code that needs to access sensitive memory content. Second, adopting these solutions requires significant efforts from developers. Sep- arating application components into different threads (i.e., scheduling units) demands major design changes, as opposed to regional code patches, to deal with the added concurrency. Third, threads with private memory tend to incur much higher overhead than normal threads due to the additional page table switches, TLB flushes, or nested page table management upon context switches.

1.2 Thesis Statment

In-process memory abuse has become a dominating problem in software security, while few research has studied the defense mechanisms. A common belief is that in-process abuse cannot be defended without high overhead or loss of backward compatibility. But this thesis challenges the assumptions. I propose holistic defense including runtime mitigations and offline detection tech- niques. Collectively, these tools improved the state-of-the-art defense against in-process memory abuse without sacrificing practicability and compatibility.

1.3 Contributions

Next, I give an overview of the solutions I propose to address the problems discussed in § 1.1. Our goal is to create comprehensive defense against in-process abuse attacks that plague the current software ecosystem. To this end, this thesis makes the following contributions.

1.3.1 A Hybrid Approach for Practical Fine-grained Software Randomization

Despite decades of research on software diversification, only address space layout randomiza- tion has seen widespread adoption. Code randomization, an effective defense against return-oriented programming exploits, has failed to gain wide adoption in practice mainly due to i) the lack of a transparent and streamlined deployment process that does not disrupt existing software distribution norms, and ii) the inherent incompatibility of program variants with error reporting, whitelisting, patching, and other operations that rely on code uniformity. To this end, we present compiler- assisted code randomization (CCR), a hybrid approach that relies on compiler–rewriter cooperation to enable fast and robust fine-grained code randomization on end-user systems, while maintain- ing compatibility with existing software distribution models. The main concept behind CCR is to augment binaries with a minimal set of transformation-assisting metadata, which i) facilitate rapid

3 CHAPTER 1. INTRODUCTION

fine-grained code transformation at installation or load time, and ii) form the basis for reversing any applied code transformation when needed, to maintain compatibility with existing mechanisms that rely on referencing the original code. We have implemented a prototype of this approach by extending the LLVM compiler toolchain, and developing a simple binary rewriter that leverages the embedded metadata to generate randomized variants using basic block reordering. The results of our experimental evaluation demonstrate the feasibility and practicality of CCR, as on average it incurs a modest file size increase of 11.46% and a negligible runtime overhead of 0.28%, while it is compatible with link-time optimization and control flow integrity.

1.3.2 Leave No Program Behind: Execute-only Memory Protection For COTS Bi- naries

Code reuse attacks exploiting memory disclosure vulnerabilities can bypass all deployed mit- igations. One promising defense against this class of attacks is to enable execute-only memory (XOM) protection on top of fine-grained address space layout randomization (ASLR). However, recent works implementing XOM, despite their efficacy, only protect programs that have been (re)built with new compiler support, leaving commercial-off-the-shelf (COTS) binaries and source- unavailable programs unprotected. We present the design and implementation of NORAX, a practi- cal system that retrofits XOM into stripped COTS binaries on AArch64 platforms. Unlike previous techniques, NORAX requires neither source code nor debugging symbols. NORAX statically trans- forms existing binaries so that during runtime their code sections can be loaded into XOM memory pages with embedded data relocated and data references properly updated. NORAX allows trans- formed binaries to leverage the new hardware-based XOM support—a feature widely available on AArch64 platforms (e.g., recent mobile devices) yet virtually unused due to the incompatibility of existing binaries. Furthermore, NORAX is designed to co-exist with other COTS binary hardening techniques, such as in-place randomization (IPR). We apply NORAX to the commonly used An- droid system binaries running on SAMSUNG Galaxy S6 and LG Nexus 5X devices. The results show that NORAX on average slows down the execution of transformed binaries by 1.18% and in- creases their memory footprint by 2.21%, suggesting NORAX is practical for real-world adoption.

1.3.3 Keep My Secrets: In-process Private Memory

Once attackers managed to execute code in a victim program’s address space (i.e., after bypass- ing the code reuse mitigations), or found a memory disclosure vulnerability, all sensitive data and

4 CHAPTER 1. INTRODUCTION code inside that address space are subject to thefts or manipulation. Unfortunately, this broad type of attack is hard to prevent, even if software developers wish to cooperate, mostly because the con- ventional memory protection only works at the process level and previously proposed in-process memory isolation methods are not practical for wide adoption. We propose shreds, a set of OS-backed programming primitives that address developers’ cur- rently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory con- tent against in-process adversaries. A shred can be viewed as a flexibly defined segment of a thread execution (hence the name). Each shred is associated with a protected memory pool, which is ac- cessible only to code running in the shred. Unlike previous works, shreds offer in-process private memory without relying on separate page tables, nested paging, or even modified hardware. Plus, shreds provide the essential data flow and control flow guarantees for running sensitive code. We have built the compiler toolchain and the OS module that together enable shreds on Linux. We demonstrated the usage of shreds and evaluated their performance using 7 non-trivial open source software, including OpenSSH and Lighttpd. The results show that shreds are fairly easy to use and incur low runtime overhead (4.67%).

1.3.4 Focus on bugs: Bug-driven Hybrid Fuzzing

A popular trend in the fuzzing research community is to augment grey box fuzz testing with symbolic execution, generally referred as hybrid testing. It leverages fuzz testing to test easy-to- reach code regions and uses concolic execution to explore code blocks guarded by complex branch conditions. As a result, hybrid testing is able to reach deeper into program state space than fuzz testing or concolic execution alone. Recently, hybrid testing has seen significant advancement. However, its code coverage-centric design is inefficient in vulnerability detection. First, it blindly selects seeds for concolic execution and aims to explore new code continuously. However, as statis- tics show, a large portion of the explored code is often bug-free. Therefore, giving equal attention to every part of the code during hybrid testing is a non-optimal strategy. It slows down the detection of real vulnerabilities by over 43%. Second, classic hybrid testing quickly moves on after reaching a chunk of code, rather than examining the hidden defects inside. It may frequently miss subtle vulnerabilities despite that it has already explored the vulnerable code paths. I introduce SAVIOR, a new hybrid testing framework pioneering a bug-driven principle. Unlike the existing hybrid testing tools, SAVIOR prioritizes the concolic execution of the seeds that are likely to uncover more vulnerabilities. Moreover, SAVIOR verifies all vulnerable program loca-

5 CHAPTER 1. INTRODUCTION tions along the executing program path. By modeling faulty situations using SMT constraints, SAV- IOR reasons the feasibility of vulnerabilities and generates concrete test cases as proofs. Our eval- uation shows that the bug-driven approach outperforms mainstream automated testing techniques, including state-of-the-art hybrid testing systems driven by code coverage. On average, SAVIOR detects vulnerabilities 43.4% faster than DRILLER and 44.3% faster than QSYM, leading to the discovery of 88 and 76 more unique bugs, respectively. According to the evaluation on 11 well fuzzed benchmark programs, within the first 24 hours, SAVIOR triggers 485 UBSAN violations, among which 243 are real bugs.

1.3.5 Learning On Experience: Smart Seed Scheduling for Hybrid Fuzzing

Seed scheduling is a prominent in determining the yields of hybrid fuzzing. Existing hybrid fuzzers schedule seeds based on fixed heuristics to predict input utilities with best effort. However, such heuristics are not generalizable as there is no one-size-fits-all rule that applies to different kinds of situations; they may work well on one program but are detrimental when fuzzing others. To overcome this problem, we design a Machine learning-Enhanced hybrid fUZZing system (MEUZZ), which employs supervised machine learning for devising generalizable seed scheduling. MEUZZ determines which new seeds are likely to produce better fuzzing yields based on the knowl- edge learned from past seed scheduling results. MEUZZ integrates machine learning techniques without interrupting the fuzzing workflow, it draws a series of light-weight but informative features from the reachability and dynamic analysis. Extracting these features incurs very little overhead (in microseconds). Moreover, MEUZZ automatically infers the data labels by constantly evaluating the fuzzing performance of each selected seed. As a result, MEUZZ achieves substantial efficacy, as well as generalizability. The experimental result shows MEUZZ significantly outperforms the state-of-the-art grey-box and hybrid fuzzers, especially by as much as 27.1% more code coverage compared with QSYM. More importantly, the models are extensively reusable and transferable. The reused models boost the coverage performance by 7.1% on average and the transplanted models improve 67.9% of the 56 cross-program fuzzing configurations. Also, MEUZZ can uncover 50 deeply hidden bugs–with 19 confirmed and fixed by the maintainers–when fuzzing 8 well-tested programs with the same configurations used in previous work.

6 CHAPTER 1. INTRODUCTION

1.4 Roadmap

The remainder of this dissertation is organized as follows. In chapter 2, I discuss works related to the online and offline defenses introduced in this thesis. Then, the whole thesis is into two parts. In part one, I first show how to use runtime mitigations to break in-process abuse exploit chain in chapter 3. Following the mitigations, in chapter 4 I introduce a fine-grained memory isolation technique as the last resort provided to developers to protect their secret data or code from untrusted components running in the same process. In part two, I explore automated test generation to identify software bugs that facilitate in-process memory abuse. Specifically, I present fuzzing works in two directions. In ?? I show how existing code base can be used to automate fuzz driver generation, to improve the adoption rate of fuzz testing. Then In chapter 5 and chapter 6, I show how bug-driven and learning-based hybrid testing can be used to detect bugs hidden in deep program paths. Finally, I conclude by providing a discussion on the findings of this dissertation in chapter 7.

7 Chapter 2

Related Works

In this chapter, I discuss related works concerning the defense of in-process abuses. I start by describing the progression of memory corruption attacks and defenses. Then I discuss past efforts trying to provide extra isolation layer than process-level isolation. Finally, I give an overview of software test generation techniques and describe state-of-the-art testing techniques to facilitate exposing deep memory corruption bugs.

2.1 Perpetual War On Memory Corruption Attacks

Over the years, there has been an ongoing race between code reuse attacks and corresponding defense countermeasures. Such code reuse attacks keep evolving into new forms with more complex attack steps (e.g., Blind-ROP [61], JIT-ROP [182]). To defend against them, two categories of countermeasures (e.g., ASLR + XOM, CFI) have been proposed from different perspectives. Here we briefly review these defenses, especially execute-only memory, which is the category of this paper.

Address Space Layout Randomization (ASLR): ASLR is a practical and popular defense de- ployed in modern operating systems to thwart code reuse attacks [191]. It randomizes the memory address and makes the locations of ROP gadgets unpredictable. However, the de-facto ASLR only randomizes the base address of code pages. It becomes ineffective when facing recent memory- disclosure-based code reuse attacks [61, 182]. Such attack explores the address space on-the-fly to find ROP gadgets via a memory disclosure vulnerability. Although fine-grained ASLR increases the entropy of randomization, such as compile-time code randomization [59] and load-time ran-

8 CHAPTER 2. RELATED WORKS domization [92, 119, 125, 199], the memory disclosure attack is not directly addressed, since code pages can still be read by attackers [182]. Runtime randomization [60, 81, 90] is thus proposed to introduce more uncertainty into the program’s address space. Their effectiveness depends on who acts faster, attacker or the re-randomization mechanism. Due to the need of tracking all the code and data objects and correct their references, these solutions either require compiler’s assist or rely on runtime translation, which limit their applications and incur non-trivial overhead. eXecute-only Memory (XOM): To address the memory disclosure attack, researchers proposed execute-only but non-readable (R X) memory pages to hinder the possibility of locating reusable code (or ROP gadgets). However, one fundamental challenge to achieve this defense is that it is non-trivial to identify and separate legitimate data read operations in code pages. When source code is available, existing works like Readactor [88,89] and LR2 [67] rely on compilers to separate data reads from code pages and then enforcing XOM via either hardware-based virtualization or software-based address masking. On the other hand, for COTS binaries, which are more common in the real-world scenario, XnR [54] blocks direct memory disclosure by modifying the page fault handler in operating systems to check whether a memory read is inside a code or data region of a process. However, it cannot handle embedded data mixed in code region. HideM [109] utilizes split-TLB features in AMD processors to direct code and data access to different physical pages to prevent reading code. Unfortunately, recent processors no longer support split-TLB.

Control Flow Integrity (CFI): Enforcing CFI is another general defense against attacks that hijack control flows, including code reuse attacks. Proposed a decade ago by Abadi et al. [46], CFI has been tuned by researchers over the years [141, 147, 152, 153, 192, 193], from its early form coarse- grained CFI to its current mature appearance as fine-grained CFI. The fundamental difference is that a coarse-grained CFI allows forward edges in the control flow graph (CFG) to point at any node in the graph and backward edges to return to any call preceded destination, whilst a fine-grained CFI has a more precise set of destinations for both forward and backward edges. bin-CFI [210] and CCFIR [209] enforce the coarse-grained CFI policy on Linux and windows COTS binaries respectively. Unfortunately, enforcing a fine-grained CFI requires a more precise CFG to be built as the ground truth, which is difficult to obtain in practice based on static analysis, even when source code is available. In addition, researchers found that it is still possible to launch code reuse attacks when fine-grained CFI solution is in place due to the difficulty of extracting a perfect CFG in practice [72, 91, 101, 112].

9 CHAPTER 2. RELATED WORKS

2.2 In-Process Memory Isolation

Program module isolation: The previous works have studied the problem of isolating the execu- tions of mutually distrusting modules, ranging from libraries in user-space programs to drivers in the OS. SFI [197] and its variants [73,100] establish strict boundaries in memory space to isolate po- tentially faulty modules and therefore contain the impact resulted from the crashes or malfunctions of such modules. SFI has also been extended to build sandboxes for untrusted plugins and libraries on both x86 [105, 206] and ARM [11, 213]. Extending module isolation into kernel-space, some previous works [100,185] contain faulty drivers as well as user-space modules. Unlike these works, which focus on fault isolation or sandboxing, our work aims to prevent the in-process memory abuse launched by either vulnerable or malicious code. Our work allows developers to run sensitive code in flexibly-defined and lightweight execution units (i.e., shreds), where the code has exclusive access to private memory pools, in addition to the regular memory regions, and the execution is protected from other code running (concurrently) in the same address space. The aforementioned works require verification and instrumentation of all untrusted code modules, whereas our work only needs to analyze and harden trusted in-shred code. We repurpose the ARM memory domain to efficiently realize the design of shreds and the protection against in-process abuse. Furthermore, SFI and similar techniques assume that isolated modules should be logically independent and not interact closely, whereas shreds neither impose such restrictions nor incur additional overhead when accessing regular memory, invoking third-party library functions, or making system calls.

Process- and thread-level isolation: Arranging program components into different processes has long been advocated as a practical approach to achieving privilege and memory separation [69,126, 161]. Many widely used software, such as OpenSSH and Chrome, have adopted this approach. Separated components run in their own address spaces and are immune from memory abuse by other components. However, process separation faces three major limitations when being used for defending memory abuse. First, due to the coarse granularity of a process, memory abuse may still happen inside a component process as a result of a library call or a code injection, as shown in several real attacks on Chrome. Second, using process separation usually requires major software design changes due to the added concurrency and restrictions, which prevents wide adoption. Third, process separation can cause high overhead, particularly when separated components frequently interact. Some recent works [62, 160] proposed thread-level isolation. While incurring slightly lower overhead than process-level isolation, they still suffer from the fixed granularity and require

10 CHAPTER 2. RELATED WORKS major software changes to be adopted. In comparison, shreds are flexibly grained and easy to adopt. Shreds are also more efficient because, unlike the aforementioned works, our design does not rely on the heavy paging-based memory access control.

Protected execution environments: A number of systems were proposed for securely executing sensitive code or performing privileged tasks. Flicker [144] allows for trusted code execution in full isolation to OS or even BIOS and provides remote attestation. TrustVisor [143] improves on performance and granularity with a special-purpose hypervisor. SeCage [136] runs sensitive code in a secure VM. SICE [52] protects sensitive workloads purely at the hardware level and supports current execution on multicore platforms. SGX [145], an upcoming feature in Intel CPUs, allows user-space programs to create so-called enclaves where sensitive code can run securely but has little access to system resources or application context. In general, these systems are designed for self- contained code that can run independently in isolated or constrained environments. They are neither suitable nor practical for preventing memory abuses, which can target data or code that cannot be jailed in these isolated environments. In addition, these systems do not need to consider the case where the protected execution can be exploited, whereas our design does and enforces security checks on in-shred executions.

Memory encryption and protection: Several memory protection mechanisms were proposed be- fore. Overshadow [76] uses virtualization to render encrypted views of application memory to untrusted OS, and in turn, protects application data. Mondrian [132] is a hardware-level memory protection scheme that enables permission control at word-granularity and allows memory sharing among multiple protection domains. Another scheme [186] provides memory encryption and in- tegrity verification for secure processors. While offering strong protection, these schemes all require hardware modifications have not been adopted in real-world. In fact, this work was partly motivated by the lack of a practical and software-based memory protection mechanism. Recently, protecting cryptographic keys in memory became a popular research topic. Proposed solutions range from minimizing key exposure in memory [48, 118, 148], to avoiding key presence in the RAM by con- fining key operations to CPUs [113, 150], GPUs [196], and hardware transactional memory [114]. Although effective at preventing key thefts, a major common type of memory abuse, these works can hardly protect other types of sensitive data or code in memory.

11 CHAPTER 2. RELATED WORKS

2.3 Automatic Software Tests Generation

Software tests can expose unexpected programing errors such as memory corruptions. To un- cover software bugs, one must first trigger the functionality that contains the bug. As a result, software testing techniques seek to maximize the functionality coverage. This strategy can also be modeled as an optimization problem, where testing tools search in the potentially infinite input space inputs that trigger new program behavior. In general there are two ways of searching for interesting inputs, namely random testing and systematic analysis. Fuzzing [2] is a representative method for random testing. The fuzzer randomly generate new inputs in the hope that they will trigger unexpected program errors. Whereas systematic analysis like symbolic execution [71] will collect path constraints and utilize SMT solvers to generate inputs that satisfy the constraints. In this thesis, I focus on three categories of test generation techniques, namely fuzzing, concolic execution and hybrid fuzzing.

Advanced Grey-Box Fuzzing: Many recent works focus on improving the capability of code ex- ploration in fuzzing. CollAFL [106] aims to reduce hash collision in coverage feedback to de- crease false negatives. PTrix [79] enables path-sensitive fuzzing based on efficient hardware tracing. TFUZZ [159] transforms tested programs to bypass complex conditions and improve code coverage, and later uses a validator to reproduce the inputs that work for the original program. To generate high-quality seeds, ProFuzzer [207] infers the structural information of the inputs. Along the line of seed generation, Angora [75] assumes a black-box function at each conditional statement and ap- plies gradient descent to find satisfying input bytes. This method is later improved by NEUZZ [177] with a smooth surrogate function to approximate the behavior of the tested program.

Concolic Execution: Symbolic execution, a systematic approach introduced in the 1970s [121, 127] for program testing, has attracted new attention due to the advances in satisfiability modulo theory [93,94,107]. However, classic symbolic execution has the problems of high computation cost and path explosion. To tackle these issues, Sen proposes concolic execution [172], which combines the constraint solving from symbolic execution and the fast execution of concrete testing. Concolic execution increases the coverage of random testing [110, 111] while also scales to large software. Hence, it has been adopted in various frameworks [70, 83, 173, 174]. Recently, concolic execution is also widely applied in automated vulnerability detection and exploitation, in which the concolic component provides critical inputs by incorporating security-related predicates [51, 74]. However, concolic execution operates based on emulation or heavy instrumentation, incurring tremendous

12 CHAPTER 2. RELATED WORKS execution overhead. Purely relying on concolic execution for code exploration is less practical for large software that involves large amounts of operations. In contrast, hybrid testing runs fuzzing for code exploration and invokes concolic execution only on hard-to-solve branches. This takes advantage of both fuzzer’s efficiency and concolic executor’s constraint solving.

Hybrid Testing: Majundar et al. [139] introduce the idea of hybrid concolic testing a decade ago. This idea offsets the deficiency of both random testing and concolic execution. Specifically, their approach interleaves random testing and concolic execution to deeply explore a wide program state space. Subsequent development reinforces hybrid testing by replacing random testing with guided fuzzing [154]. This approach could rapidly contributing more high-quality seeds to concolic execu- tion. Recently, DRILLER [184] engineers the pioneering hybrid testing system. It more coherently combines fuzzing and concolic execution and can seamlessly test various software systems. Despite the advancement, DRILLER still achieves unsound vulnerability detection. DigFuzz [211] is a more recent work that tries to better coordinate the fuzzing and concolic execution components. Using a Monte Carlo algorithm, DigFuzz predicts the difficulty for a fuzzer to explore a path and priori- tizes to explore seeds with a higher difficulty score. Moreover, motivated by the growing demands in software testing, researchers have been reasoning the performance of hybrid testing. As com- monly understood, hybrid testing is largely restricted by the slow concolic execution. To this end, QSYM [208] implements a concolic executor that tailors the heavy but unnecessary computations in symbolic interpretation and constraint solving. It leads to times of acceleration.

Universal Fuzzing Adoption: As of today, two major hindrances of wide adoption of fuzzing as as such. First, to fuzz a program/library, a fuzzer requires a fuzz driver to which it can pass inputs to exercises the interested library code. Unfortunately, writing fuzz drivers remains a primarily manual exercise, a major hindrance to the widespread adoption of fuzzing. Second, state-of-the-art fuzzers suffer from 2-5x slow down if the source code is unavailable for the target program. This is due to the high overhead introduced by the dynamic emulation. As a result, grey-box fuzzers do not ex- pose bugs–when fuzzing binary-only softwares– as efficient as the compiler-based instrumentation approach. To this end, Babic´ et al. [53] built the Fudge system for automated fuzz driver generation. Fudge automatically generates fuzz driver candidates for libraries based on existing client code. Instead of relying on manual effort to compose the fuzz target, Fudge operates based on the key insight that fuzz driver can be automatically learn from client code in the existing code base. Fudge extracts the

13 CHAPTER 2. RELATED WORKS interesting function usage by static analysis and dynamic tracing and use function synthesis module to generate compilable code to use libfuzzer to fuzz the target function. To mitigate the slowdown introduced by dynamic binary instrumentation (DBI), Chen et al. [79] built Ptrix system to replace DBI with intel Processor Trace. Ptrix fully unleash the benefit of iPT with three novel designs. First, PTrix introduces a scheme to highly parallel the processing of PT trace and target program execution. Second, it directly takes decoded PT trace as feedback for fuzzing, avoiding the expensive reconstruction of code coverage information. Third, PTrix main- tains the new feedback with stronger feedback than edge-based code coverage, which helps reach new code space and defects that existing fuzzers may not.

14 Part I

Runtime Protections Against In-Process Abuse

15 Chapter 3

Code Reuse Exploit Mitigations

3.1 Compiler-assisted Code Randomization

3.1.1 Background

To fulfill our goal of generic, transparent, and fast fine-grained code randomization at the client side, there is a range of possible solutions that one may consider. In this section, we discuss why existing solutions are not adequate, and provide some details about the compiler toolchain we used.

3.1.1.1 The Need for Additional Metadata

Static binary rewriting techniques [55, 199, 209] face significant challenges due to indirect con- trol flow transfers, jump tables, callbacks, and other code constructs that result in incomplete or inaccurate control flow graph extraction [120, 163, 200]. More generally applicable techniques, such as in-place code randomization [131, 156], can be performed even with partial disassembly coverage, but can only apply narrow-scoped code transformations, thereby leaving parts of the code non-randomized (e.g., complete basic block reordering is not possible). On the other hand, ap- proaches that rely on dynamic binary rewriting to alleviate the inaccuracies of static binary rewrit- ing [92, 119, 179, 209] suffer from increased runtime overhead. A relaxation that could be made is to ensure programs are compiled with debug symbols and relocation information, which can be leveraged at the client side to perform code randomization. Symbolic information facilitates runtime debugging by providing details about the layout of ob- jects, types, addresses, and lines of source code. On the other hand, it does not include lower-level information about complex code constructs, such as jump tables and callback routines, nor it con-

16 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS tains metadata about (handwritten) assembly code [137]. To make matters worse, modern compilers attempt to generate cache-friendly code by inserting alignment and padding bytes between basic blocks, functions, objects, and even between jump tables and read-only data [194]. Various per- formance optimizations, such as profile-guided [35] and link-time [123] optimization, complicate code extraction even further—Bao et al. [56], Rui and Sekar [162], and others [50, 99, 117], have repeatedly demonstrated that accurately identifying functions (and their boundaries) in binary code is a challenging task. In the same vein, Williams-King et al. [203] implemented Shuffler, a system that relies on sym- bolic and relocation information (provided by the compiler and linker) to disassemble code and identify all code pointers, with the goal of performing live code re-randomization. Despite the im- pressive engineering effort, its authors admit that they “encountered myriad special cases” related to inaccurate or missing metadata, special types of symbols and relocations, and jump table en- tries and invocations. Considering that these numerous special cases occurred just for a particular compiler (GCC), platform (x86-64 Linux), and set of (open-source) programs, it is reasonable to expect that similar issues will arise again, when moving to different platforms and more complex applications. Based on the above, we argue that relying on existing compiler-provided metadata is not a viable approach for building a generic code transformation solution. More importantly, the complexity in- volved in the transformation process performed by the aforementioned schemes (e.g., static code disassembly, control flow graph extraction, runtime analysis, heuristics) is far from what could be considered reasonable for a fast and robust client-side rewriter, as discussed in Section ??. Conse- quently, we opt for augmenting binaries with just the necessary domain-specific metadata needed to facilitate safe and generic client-side code transformation (and hardening) without any further binary code analysis.

3.1.1.2 Fixups and Relocations

When performing code randomization, machine instructions with register or immediate operands do not require any modification after they are moved to a new (random) location. In contrast, if an operand contains a (relative or absolute) reference to a memory location, then it has to be adjusted according to the instruction’s new location, the target’s new location, or both. (Note that a similar process takes place during the late stages of compilation.) Focusing on LLVM, whenever a value that is not yet concrete (e.g., a memory location or an

17 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Object File Final Executable ADDR Byte Code Instructions Byte Code ADDR 0x5A78 48 89 DF mov rdi, rbx 48 89 DF 0x412D58 0x5A7B 4C 89 F6 mov rsi, r14 4C 89 F6 0x412D5B 0x5A7E E8 49 43 00 00 call someFunc 1 E8 8D 30 06 00 0x412D5E 0x5A83 EB 0D jmp short 0xD 2 EB 0D 0x412D63 0x5A85 49 39 1C 24 cmp [mh],ctrl 49 39 1C 24 0x412D65 0x5A89 74 13 jz short 0x13 74 13 0x412D69 0x5A8B 49 39 5C 24 08 cmp [mh+8],ctrl 49 39 5C 24 08 0x412D6B 0x5A90 74 51 jz short 0x51 74 51 0x412D70 0x5A92 48 83 C4 08 add rsp, 8 48 83 C4 08 0x412D72 0x5A96 5B pop rbx 5B 0x412D76 0x5A97 41 5C pop r12 41 5C 0x412D77 0x5A99 41 5E pop r14 41 5E 0x412D79 0x5A9B 41 5F pop r15 41 5F 0x412D7B 0x5A9D C3 retn C3 0x412D7D

Relocation Table for Object File .text Section OFFSET TYPE VALUE ... 0x5a7f R_X86_64_PC32 someFunc-0x4 1 ... Figure 3.1: Example of the fixup and relocation information that is involved during the compilation and linking process. external symbol) is encountered during the instruction encoding phase, it is represented by a place- holder value, and a corresponding fixup is emitted. Each fixup contains information on how the placeholder value should be rewritten by the assembler when the relevant information becomes available. During the relaxation phase [57, 135], the assembler modifies the placeholder values ac- cording to their fixups, as they become known to it. Once relaxation completes, any unresolved fixups become relocations, stored in the resulting object file. Figure 3.1 shows a code snippet that contains several fixups and one relocation. The left part cor- responds to an object file after compilation, whereas the right one depicts the final executable after linking. Initially, there are four fixups (underlined bytes) emitted by the compiler. As the relocation table shows, however, only a single relocation (which corresponds to fixup 1 ) exists for address 0x5a7f, because the other three fixups were resolved by the assembler. Henceforth, we explicitly refer to relocations in object files as link-time relocations—i.e., fixups that are left unresolved after the assembly process (to be handled by the linker). Similarly, we refer to relocations in executable files (or dynamic shared objects) as load-time relocations—i.e., relocations that are left unresolved after linking (to be handled by the dynamic linker/loader). Note that in this particular example, the

final executable does not contain any load-time relocations, as relocation 1 was resolved during linking (0x4349 0x6308d). !

18 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

In summary, load-time relocations are a subset of link-time relocations, which are a subset of all fixups. Unfortunately, even if link-time relocations are completely preserved by the linker, they are not sufficient for performing fine-grained code randomization. For instance, fixup 2 is earlier resolved by the assembler, but is essential for basic block reordering, as the respective single-byte jmp instruction may have to be replaced by a four-byte one—if the target basic block is moved more than 127 bytes forward or 126 bytes backwards from the jmp instruction itself. Evidently, comprehensive fixups are pivotal pieces of information for fine-grained code shuffling, and should be promoted to first-class metadata by modern toolchains in order to provide support for generic, transparent, and compatible code diversification.

3.1.2 Overall Approach

The design of CCR is driven by the following two main goals, which so far have been limiting factors for the actual deployment of code diversification in real-world environments: Practicality: From a deployment perspective, a practical code diversification scheme should not disrupt existing features and software distribution models. Requiring software vendors to generate a diversified copy per user, or users to recompile applications from source code or transform them using complex binary analysis tools, have proven to be unattractive models for the deployment of code diversification. Compatibility: Code randomization is a highly disruptive operation that should be safely ap- plicable even for complex programs and code constructs. At the same time, code randomization inherently clashes with well-established operations that rely on software uniformity. These include security and quality monitoring mechanisms commonly found in enterprise settings (e.g., code in- tegrity checking and whitelisting), as well as crash reporting, diagnostics, and self-updating mech- anisms. Augmenting compiled binaries with metadata that enable their subsequent randomization at installation or load time is an approach fully compatible with existing software distribution norms. The vast majority of software is distributed in the form of compiled binaries, which are carefully generated, tested, signed, and released through official channels by software vendors. On each endpoint, at installation time, the distributed software typically undergoes some post-processing and customization, e.g., its components are decompressed and installed in appropriate locations according to the system’s configuration, and sometimes they are even further optimized according to the client’s architecture, as is the case with Android’s ahead-of-time compilation [188] or the

19 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Linux kernel’s architecture-specific optimizations [86]. Under this model, code randomization can fittingly take place as an additional post-processing task during installation. As an alternative, randomization can take place at load time, as part of the modifications that the loader makes to code and data sections for processing relocations [158]. However, to avoid extensive user-perceived delays due to the longer rewriting time required for code randomization, a more viable approach would be to maintain a supply of pre-randomized variants (e.g., an OS service can be generating them in the background), which can then instantly be picked by the loader. Note that this distribution model is followed even for open-source software, as installing bi- nary executables through package management systems (e.g., apt-get) offers unparalleled con- venience compared to having to compile each new or updated version of a program from scratch. More importantly, under such a scheme, each endpoint can choose among different levels of di- versification (hardening vs. performance), by taking into consideration the anticipated exposure to certain threats [108], and the security properties of the operating environment (e.g., private intranet vs. Internet-accessible setting). The embedded metadata serves two main purposes. First, it allows the safe randomization of even complex software without relying on imprecise methods and incomplete symbolic or debug information. Second, it forms the basis for reversing any applied code transformation when needed, to maintain compatibility with existing mechanisms that rely on referencing the original code that was initially distributed. Figure 3.2 presents a high-level view of the overall approach. The compilation process remains essentially the same, with just the addition of metadata collection and processing steps during the compilation of each object file and the linking of the final master executable. The executable can then be provided to users and endpoints through existing distribution channels and mechanisms, without requiring any changes. As part of the installation process on each endpoint, a binary rewriter generates a randomized version of the executable by leveraging the embedded metadata. In contrast to existing code diver- sification techniques, this transformation does not involve any complex and potentially imprecise operations, such as code disassembly, symbolic information parsing, reconstruction of relocation information, introduction of pointer indirection, and so on. Instead, the rewriter performs simple transposition and replacement operations based on the provided metadata, treating all code sections as raw binary data. Our prototype implementation, discussed in detail in Section ??, currently supports fine-grained randomization at the granularity of functions and basic blocks, is oblivious to any applied compiler optimizations, and supports static executables, shared objects, PIC, partial/full

20 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Input Source Code Object File Metadata Output Object File Metadata 1 Compiler (LLVM/Clang) Object File Metadata

2 Linker (gold ld) Executable Metadata Compilation

Binary Rewriting Variant #1 Variant #2 Variant #N 3 Binary Rewriter Figure 3.2: Overview of the proposed approach. A modified compiler collects metadata for each object file 1 , which is further updated and consolidated at link time into a single extra section in the final executable 2 . At the client side, a binary rewriter leverages the embedded metadata to rapidly generate randomized variants of the executable 3 .

RELRO [129], exception handling, LTO, and even CFI.

3.1.3 Compiler-level Metadata

Our work is based on LLVM [42], which is widely used in both academia and industry, and we picked the ELF format and the x86-64 architecture as our initial target platform. Figure 3.3 illustrates an example of the ELF layout generated by Clang (LLVM’s native C/C++/Objective-C compiler).

3.1.3.1 Layout Information

Initially, the range of the transformable area is identified, as shown in the left side of Figure 3.3. This area begins at the offset of the first object in the .text section and comprises all user-defined objects that can be shuffled. We modified LLVM to append a new section named .rand in every compiled object file so that the linker can be aware of which objects have embedded metadata. In our current prototype, we assume that all user-defined code is consecutive. Although it is possible to have intermixed code and data in the same section, we have ignored this case for now, as by default LLVM does not mix code and data when emitting x86 code. This is the case for other modern compilers too—Andriesse et al. [49] could identify 100% of the instructions when disassembling GCC and Clang binaries (but CFG reconstruction still remains challenging).

21 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

ELF BBL Emitted Bytes by Clang (Fixup) Disassembly by IDA Pro Fragment Header #0 0x40ABD0 53 push rbx #0 (DF) Program crt1.o 0x40ABD1 48 8B 1D 58 F7 0B 00 mov rbx, cs:Fun1 cri.o 0x40ABD8 48 85 DB test rbx, rbx Header crtbegin.o 0x40ABDB 74 2A jz short loc_40AC07 #1 (RF) .interp User- #1 0x40ABDD 48 89 DF mov rdi, rbx ; s #2 (DF) .dynsym Defined 0x40ABE0 E8 7B D7 FF FF call _strlen .dynstr Objects 0x40ABE5 48 89 DF mov rdi, rbx ; b

… 0x40ABE8 48 89 C6 mov rsi, rax ; n .rela.dyn 0x40ABEB E8 50 D3 00 00 call smemclr .rela.plt 0x40ABF0 48 8B 3D 39 F7 0B 00 mov rdi, cs:Fun1 OBJ (i) 0x40ABF7 E8 74 D3 00 00 call safefree .init 0x40ABFC 48 C7 05 29 F7 0B 00 mov cs:Fun1, 0 FUN (0) .plt 00 00 00 00 Rand. Area #2 0x40AC07 31 DB xor ebx, ebx .text 0x40AC09 0F 1F 80 00 00 00 00 nop dword ptr [rax+0x0h] #3 (AF) … #3 0x40AC10 48 8B BB 40 A3 4C 00 mov rdi, qword ptr ds:Fun2[rbx] #4 (DF) .rodata 0x40AC17 E8 54 D3 00 00 call safefree FUN (j) .fini 0x40AC1C 0F 57 C0 xorps xmm0, xmm0 0x40AC1F 0F 29 83 40 A3 4C 00 movaps xmmword ptr ds:Fun2[rbx], xmm0 .got 0x40AC26 48 83 C3 10 add rbx, 10h

.data … 0x40AC2A 48 83 FB 20 cmp rbx, 20h .bss 0x40AC2E 75 E0 jnz short loc_40AC10 #5 (RF) .symtab crtn.o #4 0x40AC30 5B pop rbx #6 (DF) .strtab crtend.o 0x40AC31 C3 retn Section 0x40AC32 66 66 66 66 66 2E 0F align 20h #7 (AF) Header 1F 84 00 00 00 00 00 Figure 3.3: An example of the ELF layout generated by Clang (left), with the code of a particular function expanded (center and right). The leftmost and rightmost columns in the code listing (“BBL” and “Fragment”) illustrate the relationships between basic blocks and LLVM’s various kinds of fragments: data (DF), relaxable (RF), and alignment (AF). Data fragments are emitted by default, and may span consecutive basic blocks (e.g., BBL #1 and #2). The relaxable fragment #1 is required for the branch instruction, as it may be expanded during the relaxation phase. The padding bytes at the bottom correspond to a separate fragment, although they do not belong to any basic block.

When loading a program, a sequence of startup routines assist in bootstrap operations, such as setting up environment variables and reaching the first user-defined function (e.g., main()). As shown in Figure 3.3, the linker appends several object files from libc into the executable for this purpose (crt1.o, cri.o, crtbegin.o). Additional object files include process termina- tion operations (crtn.o, crtend.o). Currently, these automatically-inserted objects are out of transformation—this is an implementation issue that can be easily addressed by ensuring that a set of augmented versions of these objects is made available to the compiler. At program startup, the function _start() in crt1.o passes five parameters to __libc_start_main(), which in turn invokes the program’s main() function. One of the parameters corresponds to a pointer to main(), which we need to adjust after main() has been displaced. The metadata we have discussed so far are updated at link time, according to the final layout of all objects. The upper part of Table 3.1 summarizes the collected layout-related metadata.

22 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

3.1.3.2 Basic Block Information

The bulk of the collected metadata is related to the size and location of objects, functions, basic blocks (BBL), and fixups, as well as their relationships. For example, a fixup inherently belongs to a basic block, a basic block is a member of a function, and a function is included in an object. The LLVM backend goes through a very complex code generation process which involves all scheduled module and function passes for emitting globals, alignments, symbols, constant pools, jump tables, and so on. This process is performed according to an internal hierarchical structure of machine functions, machine basic blocks, and machine instructions. The machine code (MC) framework of the LLVM backend operates on these structures and converts machine instructions into the corresponding target-specific binary code. This involves the EmitInstruction() routine, which creates a new chunk of code at a time, called a fragment. As a final step, the assembler (MCAssembler) assembles those fragments in a target-specific manner, decoupled from any logically hierarchical structure—that is, the unit of the assembly pro- cess is the fragment. We internally label each instruction with the corresponding parent basic block and function. The collection process continues until instruction relaxation has completed, to capture the emitted bytes that will be written into the final binary. As part of the final metadata, however, these labels are not essential, and can be discarded. As shown in Table 3.1, we only keep informa- tion about the lower boundary of each basic block, which can be the end of an object (OBJ), the end of a function (FUN), or the beginning of the next basic block (BBL). Going back to the example of Figure 3.3, we identify three types of data, relaxable, and align- ment fragments, shown at the right side of the figure. The center of the figure shows the emitted bytes as generated by Clang, and their corresponding code as extracted by the IDA Pro disassem- bler, for the j-th function of the i-th object in the code section. The function consists of five basic blocks, eight fragments, and contains eleven fixups (underlined bytes). Note that relaxable fragments are generated only for branch instructions and contain just a single instruction. Alignment fragments correspond to padding bytes. In this example, there are two alignment fragments (#3 and #7): one between basic blocks #2 and #3, and one between function j and the following function. For metadata compactness, alignment fragments are recorded as part of the metadata for their preceding basic blocks. The rest of the instructions are emitted as part of data fragments. Another consideration is fall-through basic blocks. A basic block terminated with a conditional branch implicitly falls through its successor depending on the evaluation of the condition. In Fig-

23 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Table 3.1: Collected randomizaton-assisting metadata

Metadata Collected Information Collection time

Layout Section offset to first object Linking Section offset to main() Linking Total code size for randomization Linking

Basic Block BBL size (in bytes) Linking (BBL) BBL boundary type (BBL, FUN, OBJ) Compilation Fall-through or not Compilation Section name that BBL belongs to Compilation

Fixup Offset from section base Linking Dereference size Compilation Absolute or relative Compilation Type (c2c, c2d, d2c, d2d) Linking Section name that fixup belongs to Compilation

Jump Table Size of each jump table entry Compilation Number of jump table entries Compilation ure 3.3, the last instruction of BBL #0 jumps to BBL #2 when the zero flag is set, or control falls through to BBL #1. Such fall-through basic blocks must be marked so that they can be treated appropriately during reordering.

3.1.3.3 Fixup Information

Evaluating fixups and generating relocation entries are part of the last processing stage during layout finalization, right before emitting the actual code bytes. Note that this phase is orthogonal to the optimization level used, as it takes place after all LLVM optimizations and passes are done. Each fixup is represented by its offset from the section’s base address, the size of the target (1, 2, 4, or 8 bytes), and whether it represents a relative or absolute value. As shown in Table 3.1, we categorize fixups into four groups, similar to the scheme proposed by Wang et al. [198], depending on their location (source) and the location of their target (destination): code-to-code (c2c), code-to-data (c2d), data-to-code (d2c), and data-to-data (d2d). We define data as a universal region that includes all other sections except the .text section. This classification helps in increasing the speed of binary rewriting when patching fixups after randomization.

3.1.3.4 Jump Table Information

Due to the complexity of some jump table code fragments, extra metadata needs to be kept for their correct handling during randomization. For non-PIC/PIE (position independent code/exe- cutable) binaries, the compiler generates jump table entries that point to targets using their absolute

24 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Section Compiled without PIC/PIE Compiled with PIC/PIE Name Byte Code Disassembly Byte Code Disassembly .text FF 24 D5 A0 jmp qword 48 8D 05 5E lea rax, 39 4A 00 [rdx*8+0x4A39A0] 84 09 00 [rel 0x98465] 48 63 0C 90 movsxd rcx, dword [rax+rdx*4] 48 01 C1 add rcx, rax FF E1 jmp rcx 1 3 … … Code for JTE #1 Code for JTE #1* Code for JTE #0 Code for JTE #0* 2 4 .rodata D2 C0 40 00 JT Entry #0(8B) AB 7B F6 FF JT Entry #0*(4B) 00 00 00 00 0x0040C0D2 0xFFF67BAB D8 C0 40 00 JT Entry #1(8B) B1 7B F6 FF JT Entry #1*(4B) 00 00 00 00 0x0040C0D8 0xFFF67BB1 … … … … Figure 3.4: Example of jump table code generated for non-PIC and PIC binaries. address. In such cases, it is trivial to update these destination addresses based on their corresponding fixups that already exist in the data section. In PIC executables, however, jump table entries correspond to relative offsets, which remain the same irrespectively of the executable’s load address. Figure 3.4 shows the code generated for a jump table when compiled without and with the PIC/PIE option. In the non-PIC case, the jmp instruction directly jumps to the target location 1 by dereferencing the value of an 8-byte absolute address 2 according to the index register rdx, as the address of the jump table is known at link time (0x4A39A0). On the other hand, the PIC-enabled code needs to compute the target with a series of arithmetic instructions. It first loads the base address of the jump table into rax 3 , then reads from the table the target’s relative offset and stores it in rcx, and finally computes the target’s

absolute address 4 by adding to the relative offset the table’s base address. To appropriately patch such jump table constructs, for which no additional information is emit- ted by the compiler, the only extra information we must keep is the number of entries in the table, and the size of each entry. This information is kept along with the rest of the fixup metadata, as shown in Table 3.1, because the relative offsets in the jump table entries should be updated after randomization according to the new locations of the corresponding targets.

3.1.4 Link-time Metadata Consolidation

The main task of the linker is to merge multiple object files into a single executable. The linking process consists of three main tasks: constructing the final layout, resolving symbols, and updating relocation information. First, the linker maps the sections of each object into their corresponding locations in the final sections of the executable. During this process, alignments are adjusted and the

25 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS size of extra padding for each section is decided. Then, the linker populates the symbol table with the final location of each symbol after the layout is finalized. Finally, it updates all relocations created by the assembler according to the final locations of those resolved symbols. These operations influence the final layout, and consequently affect the metadata that has already been collected at this point. It is thus crucial to update the metadata according to the final layout that is decided at link time. Our CCR prototype is based on the GNU gold ELF linker that is part of binutils. It aims to achieve faster linking times compared to the GNU linker (ld), as it does not rely on the standard binary file descriptor (BFD) library. Additional advantages include lower memory requirements and parallel processing of multiple object files [190]. Figure 3.5 provides an overview of the linking process and the corresponding necessary updates to the collected metadata. Initially, the individual sections of each object are merged into a single one, according to the naming convention 1 . For example, the two code sections .text.obj1 and .text.obj2 of the two object files are combined into a single .text section. Similarly, the metadata from each object is extracted and incorporated into a single section, and all addresses are

updated according to the final layout 2 . As part of the section merging process, the linker introduces padding bytes between objects in the same section 3 . At this point, the size of the basic block at the end of each object file has to be adjusted by increasing it according to the padding size. This is similar to the treatment of alignment bytes within an object file, which is considered as part of the preceding basic block. Note that we do not need to update anything related to whole functions or objects, as our representation of the layout relies solely on basic blocks. Updating the size of the basic blocks that are adjacent to padding bytes is enough for deriving the final size of functions and objects. Once the layout is finalized and symbols are resolved, the linker updates the relocations recorded by the assembler 4 . Any fixups that were already resolved at compilation time are not available in this phase, and thus the corresponding metadata remains unchanged, while the rest is updated

accordingly. Finally, the aggregation of metadata is completed 5 by updating the binary-level metadata discussed in Section 3.1.3, including the offset to the first object, the total code size for transformation, and the offset to the main function (if any). A special case that must be considered is that a single object file may contain multiple .text, .rodata, .data or .data.rel.ro sections. For instance, C++ binaries often have several code and data sections according to a name mangling scheme, which enables the use of the same identifier in different namespaces. The compiler blindly constructs these sections without consid- ering any possible redundancy, as it can only process the code of a single object file at a time. In

26 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

ELF Object (1) Layout Header … Fixups .strtab .text .text .text … 1 .rel.text .rodata .rodata … … Code .symtab .data from other Section Meta- objects Header data(1) …

… Other Sections

ELF Object (N) Header .symtab .strtab 4 Relocations … Update .text .strtab 3 Paddings .rel.text … .rodata … … 5 Integrated Adjustment (basic block sizes, fixup Metadata .symtab offsets from section, section removal) 2 Merging Section Meta- Header data(N) Figure 3.5: Overview of the linking process. Per-object metadata is consolidated into a single section. turn, when the linker observes redundant sections, it nondeterministically keeps one of them and discards the rest [124]. This deduplication process can cause discrepancies in the layout and fixup information kept as part of our metadata, and thus the corresponding information about all removed sections is discarded at this stage. This process is facilitated by the section name information that is kept for basic blocks and fixups during compilation. Note that section names are optional attributes required only at link time. Consequently, after deduplication has completed, any remaining section name information about basic blocks and fixups is discarded, further reducing the size of the final metadata.

27 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

3.1.5 Code Randomization

To strike a balance between performance and randomization entropy, we have opted to maintain some of the constraints imposed by the code layout decided at link time, due to short fixup sizes and fall-through basic blocks. As mentioned earlier, these constraints can be relaxed by modifying the width of short branches and adding new branches when needed. However, our current choice has the simplicity and performance benefit of keeping the total size of code the same, which helps in maintaining caching characteristics due to spatial locality. To this end, we prioritize basic block reordering at intra-function level, and then proceed with function-level reordering. Distance constraints due to fixup size may occur in both function and basic block reordering. For instance, it is typical for functions to contain a short fixup that refers to a different function, as part of a jump instruction used for -call optimization. At the rewriting phase, basic block reordering proceeds without any constraints if: (a) the parent function of a basic block does not have any distance-limiting fixup, or (b) the size of the function allows reaching all targets of any contained short fixups. Note that the case of multiple functions sharing basic blocks, which is a common compiler optimization, is fully supported. From an implementation perspective, the simplest solution for fall-through basic blocks is to assume that both child blocks will be displaced away, in which case an additional jump instruction must be inserted for the previously fall-through block. From a performance perspective, however, a better solution is to avoid adding any extra instructions and keep either of the two child basic blocks adjacent to its parent—this can be safely done by inverting the condition of the branch when needed. In our current implementation we have opted for this second approach, but have left branch inversion as part of our future work. As shown in Section 3.1.6.5, this decision does not impact the achieved randomization entropy. After the new layout is available, it is essential to ensure fixups are updated accordingly. We have classified fixups into four categories: c2c, c2d, d2c and d2d. In case of d2d fixups, no update is needed because we diversify only the code region, but we still include them as part of the metadata in case they are needed in the future. The dynamic linking process relies on c2d (relative) fixups to adjust pointers to shared libraries at runtime.

3.1.6 Experimental Evaluation

We evaluated our CCR prototype in terms of runtime overhead, file size increase, randomization entropy, and other characteristics. Our experiments were performed on a system equipped with an

28 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

6 Function Randomization Basic Block Randomization 4

2

Overhead (%) 0

x y e a and v r r lbench r 403.gcc 429.mcf 470.lbm 433.milc 473.astar 401.bzip2 458.sjeng 447.dealII 444.namd 450.sopl 453.p o 445.gobmk 456.hmmer 464.h264ref 482.sphinx3 471.omnetpp 999.spec 400.pe 483.xalancbmk 462.libquantum Figure 3.6: Performance overhead of fine-grained (function vs. basic block reordering) randomiza- tion for the SPEC CPU2006 benchmark tests.

Intel i7-7700 3.6GHz CPU, 32GB RAM, running the 64-bit version of Ubuntu 16.04.

3.1.6.1 Randomization Overhead

We started by compiling the entire SPEC CPU2006 benchmark suite (20 C and C++ programs) with our modified LLVM and gold linker, using the -O2 optimization level and without the PIC option. Next, we generated 20 different variants of each program, 10 using function reordering and 10 more using function and basic block reordering. Each run was performed 10 times for the original programs, and a single time for each of the 20 variants. Figure 3.6 shows a boxplot of the runtime overhead for function reordering and basic block reordering. The dark horizontal line in each box corresponds to the median overhead value, which mostly ranges between zero and one across all programs. The top and bottom of each box corre- spond to the upper and lower quartile, while the whiskers to the highest and lowest value, excluding outliers, which are denoted by small circles (there were 14 such cases out of the total 400 variants, exhibiting an up to 7% overhead). Overall, the average performance overhead is negligible at 0.28%, with a 1.37 standard deviation. The average overhead per benchmark is reported in Table 3.2, which also includes further information about the layout and fixups of each program. Interesting cases are mcf and milc, the variants of which consistently exhibit a slight per-

29 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS formance improvement, presumably due to better cache locality (we performed an extra round of experiments to verify it). In contrast, xalancbmk exhibited a distinguishably high average over- of 4.9%. Upon further investigation, we observed a significant increase in the number of L1 instruction cache misses for its randomized instances. Given that xalancbmk is one of the most complex benchmarks, with a large number of functions and heavy use of indirect control transfers, it seems that the disruption of cache locality due to randomization has a much more pronounced effect. For such cases, it may be worth exploring profile-guided randomization approaches that will preserve the code locality characteristics of the application.

3.1.6.2 ELF File Size Increase

Augmenting binaries with additional metadata entails the risk of increasing their size at levels that may become problematic. As discussed earlier, this was an issue that we took into consideration when deciding what information to keep, and optimized the final metadata to include only the minimum amount of information necessary for code diversification. As shown in Table 3.2, file size increase ranges from 1.68% to 20.86%, with an average of 11.46% (13.3% for the SPEC benchmarks only). We consider this a rather modest increase, and do not expect it to have any substantial impact to existing software distribution workflows. The Layout columns (Objs, Funcs, BBLs) show the number of object files, functions, and basic blocks in each program. As expected, the metadata size is proportional to the size of the original code. Note that the generated randomized variants do not include any of the metadata, so their size is the same as the original binary.

3.1.6.3 Binary Rewriting Time

We measured the rewriting time of our CCR prototype by generating 100 variants of each pro- gram and reporting the average processing time. We repeated the experiment twice, using function and basic block reordering, respectively. As shown in Table 3.2 (Rewriting columns) the rewriting process is very quick for small binaries, and the processing time increases linearly with the size of the binary. The longest processing time was observed for xalancbmk, which is the largest and most complex (in terms of number of basic blocks and fixups) among the tested binaries. All but four programs were randomized in under 9s, and more than half of them in under 1s. The reported numbers include the process of updating the debug symbols present in the .symtab section. As this is not needed for production (stripped) binaries, the rewriting time in practice will

30 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS be shorter—indicatively, for xalancbmk, it is 30% faster when compiled without symbols. Note that our rewriter is just a proof of concept, and further optimizations are possible. Currently, the rewriting process involves parsing the raw metadata, building it into a tree representation, resolving any constraints in the randomized layout, and generating the final binary. We believe that the rewrit- ing speed can be further optimized by improving the logic of our rewriter’s randomization engine. Moving from Python to C/C++ is also expected to increase speed even further.

3.1.6.4 Correctness

To ensure that our code transformations do not affect in any way the correctness of the resulting executable, in addition to the SPEC benchmarks, we compiled and tested the augmented versions of ten real-world applications. For example, we parsed the entire LLVM source code tree with a randomized version of ctags using the -R (recursive) option. The MD5 hash of the resulting index file, which was 54MB in size, was identical to the one generated using the original executable. Another experiment involved the -line audio encoding tool oggenc—a large and quite complex program (58,413 lines of code) written in C [142]—to convert a 44MB WAV file to the OGG format, which we then verified that was correctly processed. Furthermore, we successfully compiled popular server applications (web, FTP, and SSH daemons), confirming that their variants did not malfunction when using their default configurations.

3.1.6.5 Randomization Entropy

We briefly explore the randomization entropy that can be achieved using function and basic block reordering, when considering the current constraints of our implementation. Let Fij be the jth function in the ith object, fi the number of functions in that object, and bij the number of basic blocks in the function Fij. Suppose there are p object files comprising a given binary executable. p 1 The total number of functions q and basic blocks r in the binary can be written as q = i=0 fi p 1 fi 1 and r = i=0 j=0 bij. Then, the number of possible variants with function reorderingP is q! and with basicP blockP reordering is r!. Due to the large number of variants, let the randomization entropy E be the base 10 logarithm of the number of variants. In our case, we perform basic block randomization at intra-function level first, followed by function reordering. Therefore, the entropy can be computed as follows:

p 1 f 1 p 1 i E = log ( ( b !) ( f )!) 10 ij · i Yi=0 jY=0 Xi=0

31 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Table 3.2: Experimental evaluation dataset and results (* indicates programs written in C++)

Layout Fixups Size (KB) Rewriting (sec) Overhead Entropy (log10) Program Objs Funcs BBLs .text .rodata .data .init_ar. Orig. Augm. Increase Func BBL Func BBL Func BBL

400.perlbench 50 1,660 46,732 70,653 7,872 1,765 0 1,198 1,447 20.86% 7.69 8.05 -0.07% 0.32% 4,530 5,011 401.bzip2 7 71 2,407 2,421 75 0 0 90 101 12.80% 0.19 0.21 -0.23% 0.16% 100 157 403.gcc 143 4,326 118,397 189,543 84,357 367 0 3,735 4,465 19.54% 52.30 53.89 0.82% 0.91% 13,657 16,483 429.mcf 11 24 375 410 0 0 0 22 25 12.02% 0.08 0.09 -1.27% -0.98% 23 44 433.milc 68 235 2,613 5,980 50 36 0 148 170 14.94% 0.48 0.50 -1.53% -1.50% 456 600 444.namd* 23 95 7,480 8,170 24 0 0 312 345 10.49% 0.50 0.56 0.06% 0.07% 148 187 445.gobmk 62 2,476 25,069 44,136 1,377 21,400 0 3,949 4,116 4.23% 21.28 20.43 0.05% 0.35% 7,272 8,271 447.dealII* 6,295 6,788 100,185 103,641 7,954 1 45 4,217 4,581 8.65% 38.08 39.18 0.60% 0.52% 23,064 25,601 450.soplex* 299 889 13,741 15,586 1,561 0 61 467 531 13.76% 1.90 1.99 0.60% 0.28% 2,234 2,983 453.povray* 110 1,537 28,378 47,694 10,398 617 1 1,223 1,406 14.92% 5.67 5.88 -0.08% 0.50% 4,130 4,939 456.hmmer 56 470 10,247 14,265 798 156 0 343 400 16.53% 1.14 1.19 0.00% -0.11% 1,042 1,313 458.sjeng 119 132 4,469 8,978 431 0 0 155 186 19.93% 0.50 0.53 -0.55% -0.38% 221 334 462.libquantum 16 95 1,023 1,373 319 0 0 55 62 13.57% 0.19 0.19 0.40% -0.24% 148 207 464.h264ref 42 518 14,476 23,180 320 321 0 698 782 12.01% 1.97 2.06 0.17% 0.00% 1,180 1,468 470.lbm 2 17 133 227 0 0 0 22 24 8.15% 0.06 0.06 0.25% 0.25% 14 24 471.omnetpp* 366 1,963 22,118 34,212 3,411 240 75 843 952 12.95% 4.73 4.94 0.03% 0.25% 5,560 6,983 473.astar* 14 88 1,116 1,369 6 1 0 56 62 12.03% 0.17 0.17 0.78% 1.08% 134 169 482.sphinx3 44 318 5,557 9,046 26 207 0 213 249 16.54% 0.68 0.72 0.02% 0.23% 656 815 483.xalancbmk* 3,710 13,295 130,691 142,128 19,936 323 0 6,217 6,836 9.95% 88.09 89.94 4.92% 4.89% 48,863 61,045 999.specrand 2 3 11 32 0 0 0 8 9 11.07% 0.03 0.03 -0.32% -0.15% 0.8 1.6

ctags 50 423 8,550 13,618 3,733 507 0 795 851 7.03% 1.17 1.21 - - 915 1,095 gzip 34 103 2,895 5,466 466 21 0 267 289 8.13% 0.40 0.41 - - 164 194 lighttpd 50 351 5,817 9,169 818 98 0 866 903 4.23% 0.96 0.99 - - 732 891 miniweb 7 67 1,322 1,681 65 74 0 56 64 14.54% 0.19 0.19 - - 94 113 oggenc 1 428 7,035 7,746 183 3,869 0 2,120 2,156 1.68% 2.79 2.74 - - 942 2,285 openssh 122 1,135 18,262 29,815 2,442 90 0 2,144 2,248 4.83% 4.04 4.17 - - 3,398 3,856 putty 79 1,288 20,796 31,423 3,126 118 0 1,069 1,184 10.78% 3.71 3.82 - - 2,927 3,610 vsftpd 39 516 3,793 7,148 74 0 0 138 163 18.48% 0.65 0.67 - - 1,147 1,227 libcapstone 42 402 21,454 47,299 13,002 5 0 2,777 2,931 5.69% 10.64 11.31 - - 863 1,040 dosbox* 630 3,127 66,522 124,814 14,906 2,585 18 11,729 12,145 3.54% 37.59 38.12 - - 9,503 10,941

However, as discussed in Section 3.1.5, our current implementation has some constraints regarding the placement of functions and basic blocks. Let the number of such function constraints in the

ith object be yi. Likewise, fall-through blocks are currently displaced together with their previous block. Similarly to functions, in some cases the size of a fixup also constrains the maximum distance to the referred basic block. Let the number of such basic block constraints in function Fij be xij. Given the above, the entropy in our case can be calculated as:

p 1 f 1 p 1 i E = log ( ( (b x )!) ( (f y ))!) 10 ij ij · i i Yi=0 jY=0 Xi=0 Using the above formula, we report the randomization entropy for function and basic block

32 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS level randomization in Table 3.2. We observe that even for small executables like lbm, the number of variants exceeds 300 trillion. Consequently, our current prototype achieves more than enough entropy, which can be further improved by relaxing the above constraints (e.g., by separating fall- through basic blocks from their parent blocks, and adding a relaxation-like phase in the rewriter to alleviate existing fixup size constraints).

3.2 Enabling Execute-Only Memory for COTS Binaries On AArch64

3.2.1 Overview

In the previous section, I show that CCR makes fine-grained diversification accessible to indus- trial use. However, numerous studies [138, 178] have shown that attackers are still able to exploit the highly randomized programs [101, 166, 182]. Such attacks increasingly leverage code-reuse techniques [63, 166, 204] to gain control of vulnerable programs. Since contemporary softwares widely employ code integrity protection techniques, such as data execution prevention (DEP [195]), to prevent traditional code injection attacks. In code reuse attacks, a target application’s control flow is manipulated in a way that snippets of existing code (called gadgets) are chained and run to carry out malicious activities. Knowledge of process memory layout is a key prerequisite for code-reuse attacks to succeed. Attackers need to know the exact binary instruction locations in memory to assemble the chain of gadgets. Commodity operating systems widely adopt address space layout randomization (ASLR), which loads code binaries at random memory locations unpredictable to attackers. Without knowing the locations of needed code or gadgets, attackers cannot build code-reuse chains. However, memory disclosure attacks can use information leaks in programs to de-randomize code locations, thus defeating ASLR. Such attacks either read the program code (direct de-randomization) or read code pointers (indirect de-randomization). Given that deployed ASLR techniques only ran- domize the load address of a large chunk of data or code, leaking a single code pointer or a small sequence of code allows attackers to identify the corresponding chunk, infer its base address, and calculate the addresses of gadgets contained in the chunk. More sophisticated fine-grained ASLR techniques [92,119,125,157,199] aim at shuffling code blocks within the same module to make it more difficult for attackers to guess the location of binary instructions. Nevertheless, research by Snow et al. [182] proves that memory disclosure vulnerabil- ities can bypass the most sophisticated ASLR techniques.

33 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Therefore, a robust and effective defense against code-reuse attacks should combine fine-grained ASLR with memory disclosure prevention. Some recent works proposed to prevent memory disclo- sures using compile-time techniques [67,88,89]. Despite their effectiveness, these solutions cannot cover COTS binaries that cannot be easily recompiled and redeployed. These binaries constitute a significant portion of real-world applications that need protection. In this section, I present NORAX 1, which protects COTS binaries from code memory disclosure attacks. The goal of NORAX is to allow COTS binaries to take advantage of execute-only memory (XOM), a new security feature that recent AArch64 CPUs provide and is widely available on today’s mobile devices. While useful for preventing memory disclosure-based code reuse [61, 182], XOM remains barely used by user and system binaries due to its requirement for recompilation. NORAX removes this requirement by automatically patching COTS binaries and loading their code to XOM. As a result, when used together with ASLR, NORAX enables robust mitigation against code reuse attacks for COTS binaries. NORAX consists of four major components: NDisassembler, NPatcher, NLoader, and NMoni- tor. The first two perform offline binary analysis and transformation. They convert COTS binaries built for AArch64 without XOM support into one whose code can be protected by XOM during run- time. The other two components provide supports for loading and monitoring the patched, XOM- enabled binaries during runtime. The design of NORAX tackles a fundamentally difficult problem: identifying data embedded in code segments, which are common in ARM binaries, and relocating such data elsewhere so that during runtime code memory pages can be made executable-only while allowing all embedded data to be readable. We apply NORAX to Android system binaries running on SAMSUNG Galaxy S6 and LG Nexus 5X devices. The results show that NORAX on average slows down the transformed binaries by 1.18% and increases their memory footprint by 2.21%, suggesting NORAX is practical for real- world adoption.

3.2.2 Background

NORAX makes use of the modern MMU support in AArch64 architecture to create execute-only memory, which is a hardware feature now widely available yet virtually unused due to compatibility issues. To bridge the gap, NORAX reconstructs COTS binaries running on commodity Android

1NORAX stands for NO Read And eXecute.

34 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Table 3.3: Access permissions for stage 1 EL0 and EL1 AP[2:1] EL0 Permission EL1 Permission 00 Executable-only Read/Write 01 Read/Write, Config-Executable Read/Write 10 Executable-only Read-only 11 Read, Executable Read-only smartphones to enforce the R X policy. In the rest of this section, we explain the necessary technical background and the challenges we face when building the system.

AArch64 eXecute-Only Memory (XOM) Support: AArch64 defines four Exception Levels, from EL0 to EL3. EL0 has the lowest execution privilege, usually runs normal user applications; EL1 is usually for hosting privileged systems, such as operating system kernel; EL2 is designed for hypervisor while EL3 is for secure monitor. In order to enforce the instruction access permission for different Exception Levels, AArch64 leverages the Unprivileged eXecute Never (UXN) bit, Privileged eXecute-Never (PXN) bit and two AP (Access Permission) bits defined in the page table entry [19]. For the user space program code page, the UXN bit is set to “0”, which allows the code execution at EL0, while PXN is set to “1”, which disables the execution in EL1. With such UXN and PXN settings, the instruction access permissions defined by AP bits are shown in Table 3.3. It is easy to see that we can set the AP bits in page table entry to “10”, so that the kernel running in EL1 will enforce the execute-only permission for user space program, which is running in EL0. In other words, the corresponding memory page will only permit for instruction fetch for user space program, while all read/write data accesses will be denied. However, the kernel still has the read permission to that page, which means that it can help the user space program read the intended memory area if necessary, but need to perform security checks beforehand.

Position-Independent Binaries in Android: Position-independent code (PIC) is the kind of code compiler generates for a module that does not assume any absolute address, that is, no matter where the module is loaded, it will be able to function correctly. The mechanism works by replacing all the memory accesses using hard-coded addresses with PC-relative addressing instructions. Position- independent executables (PIE) are executables that employ PIC code. In Android, ever since version 5 (codename: Lolipop), in order to fully enjoy the benefit of ASLR, all the executables are required to be compiled as PIE. To enforce this, Google removed the support for non-PIE loading from the

35 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Bionic Linker [6]. Nowadays, smartphones equipped with AArch64 CPU are most likely running Android OSes after Lolipop, meaning the majority of them will only have binaries, including both executables and shared libraries, that are compiled to be position independent.

Code-Data Separation: To convert a stripped binary to be XOM-compatible, there is one funda- mental problem to solve, namely code-data separation. Note that separating data from code for COTS binaries is, in general, undecidable as it is equivalent to the famous Halting Problem [200]. But we found that in the scope of ARM64 position-independent binaries, which are prevalent in modern Android and iOS [10] Phones, a practical solution is possible. Basically, a feasible solution should address the two following challenges.

3.2.2.1 Locating Data In Code Pages

We generally refer to data residing in executable code regions as executable data. There are two types of executable data allowed in ELF binaries.

Executable sections: The first kind of data are those ELF sections consisting of pure read- • only data which could reside in executable memory. Defined by contemporary ELF standard, a typical ELF file has two views: linking view and loading view, used by linker and loader respectively. Linking view consists of ELF sections (such as .text, .rodata). During linking, the static linker bundles those sections with compatible access permissions to form a segment – in this case, executable indicates readable. The segments then comprise the loading view. When an ELF is being loaded, the loader simply loads each of the segments as a whole into memory, and grant the corresponding access permissions. A standard ELF has two loadable segments. One is readable and executable, which is normally referred as “code segment”. This segment contains all the sections with instructions (.plt and .text, etc.), and read-only data (.gnu.hash, .dynsym, etc.); the other segment is readable and writable, referred as “data segment”, it contains the program data as well as other read/writ-able sections. For our goal to realize non-readable code, we mainly focus on the code segment. In this segment, generally only .plt and .text contain instructions used for program execution, but as explained before, they are mixed with other sections that only need to be read-only, thus we cannot simply map the memory page to execute-only as oftentimes these sections could locate within the same page. For instance, Table 3.4 shows the code segment layout of an example program, all except the last two sections in this code segment are placed within the same page. To make

36 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

things more complex, the segment layout varies for different ELFs.

Embedded data: The second kind of data in the code pages is those embedded data in the • .text section. For optimization purpose, such as exploiting spatial locality, compilers emit data to places nearby their accessing code. Note that albeit recent study [49] shows that in modern x86 Linux, compilers no longer generate binaries that have code interleaved with data, to the opposite of our discovery, we found this is not the case for ARM, we examined the system binaries extracted from smartphone Nexus 5X running the factory image MMB29P, Table 3.5 reveals that code-data interleaving still prevails in those modern ARM64 Linux binaries, indicating this is a real-world problem to be solved.

Table 3.4: ELF sections that comprise the code segment of the example program, the highlighted ones are locate in the same page. Section Name Address Type .interp 0000000000000238 PROGBITS .note.android.ident 0000000000000250 NOTE .note.gnu.build-id 0000000000000268 NOTE .gnu.hash 0000000000000288 GNU_HASH .dynsym 00000000000002c8 DYNSYM .dynstr 00000000000005b0 STRTAB .gnu.version 00000000000006e2 VERSYM .gnu.version_r 0000000000000720 VERNEED .rela.dyn 0000000000000740 RELA .rela.plt 0000000000000830 RELA .plt 00000000000009a0 PROGBITS .text 0000000000000ab0 PROGBITS .rodata 0000000000000f08 PROGBITS .eh_frame_hdr 00000000000010d0 PROGBITS .eh_frame 0000000000001110 PROGBITS

3.2.2.2 Updating Data References

In addition to finding out the locations of executable data, we also need to relocate them and up- date their references. It turns out that references updating is also non-trivial. In our system, as shown in Table 3.6, the majority of the ELF sections inside code segment are expected to be relocated to a

37

Module #. of Real #. of Norax #. of Gadgets Inline Data Inline Data found in (Byte) (Byte) extracted inline Data vold 0x0 0x0 0 0x8 0x10 0 toolbox 0x14 0x30 0 dhcpcd 0x28 0x58 4 Logd 0x0 0x0 0 installd 0x0 0x0 0 app_process64 0x0 0x0 0 (zygote) qseecomd N/A 0x0 0 surfaceflinger 0x0 0x0 0 rild 0x0 0x0 0 libart.so 0x4534 0x4654 8 libstagefright.so 0x128 0x148 5 libcrypto.so 0x9a8 0xa3c 25 libmedia.so 0f60 0x10b2 0 libc.so 0x12e4 0x13b4 5 libc++.so 0xc 0xc 0 libsqlite.so 0x3a4 0x57c 13 libbinder.so 0x0 0x0 0 libm.so 0x4f3b 0x51bc 48 CHAPTER 3. CODE REUSElibandroid.so EXPLOIT MITIGATIONS0x0 0x0 0 Total 0xc577 0xce1a 108

Table 3.5: Android Marshmallow system binaries that have embedded data in Nexus 5X.

# of # of binaries w/ binaries embeded data Percentage /system/bin 237 167 70.46% /system/lib64 255 101 39.61% /vendor/lib64 111 39 35.14% /vendor/bin 4 2 50.00%

different memory location so that appropriate permission can be enforced. The sections that are left out, such as .interp and .note. are either accessed only by OS or not used for program execution so ⇤ we can leave them untouched. For those sections listed in Table 3.6, they have complex intercon- nections, both internally and externally. As shown in Table 3.7, various types of references exist in a given ELF. Due to this complexity, the references collection is conducted across the whole NORAX system by different components in different stages including both offline and during load-time.

Table 3.6: Sections in the executable code page that are handled by NORAX (.gnu).hash .dynsym .dynstr .gnu.version .rela.dyn .rela.plt .text (embedded data) .rodata .eh_frame .eh_frame_hdr

Table 3.7: ELF section reference types Reference Type Example Intra-section references .text refers to .text (embedded data) Inter-section references .text refers to .rodata External references dynamic linker refers to .dynsym, .rela. ⇤ Multiple external references C++ runtime/debugger refer to .eh_frame

3.2.3 Design

NORAX Workflow: NORAX consists of four major components: NDisassembler, NPatcher, NLoader, and NMonitor, as shown in Figure 4.2. The first two components perform offline binary analysis and transformation and the last two provide runtime support for loading and monitoring the patched, XOM-compatible executables and libraries. In addition to disassembling machine code, NDisassembler scans for all executable code that needs to be protected by XOM. A major challenge it solves is identifying various types of data that ARM compilers often embed in the code section, including jump tables, literals, and padding. Unlike typical disassemblers, NDisassembler has to

38 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

user executable --x code r-- read-only data rw- data … Exec. library Data .refs --x code ARM64 r-- read-only data ELF Refs. Patching rw- data Metas. Y YY NDisassembler NPatcher YY kernel

NLoader NMonitor

Offline Analyzing and Rewriting Load-time and runtime supports Figure 3.7: NORAX System Overview: the offline tools (left) analyze the input binary, locate all the executable data and their references (when available), and then statically patch the metadata to the raw ELF; the runtime components (right) create separated mapping for the executable data sections and update the recorded references as well as those generated at runtime.

precisely differentiate embedded data from code. Taking input from NDisassembler, NPatcher trans- forms the binary so that its embedded data are moved out of code sections and their references are collected for later adjustment. After the transformation, NPatcher inserts a unique magic number in the binary so that it can be recognized by NLoader during load-time. NPatcher also stores NORAX metadata in the binary, which will be used by NLoader and NMonitor. When a patched binary is being loaded, NLoader takes over the loading process to (i) load the NORAX metadata into memory, (ii) adjust the NPatcher-collected references as well as those dynamically created references to the linker-related sections (e.g .hash, .rela.*), and (iii) map all memory pages that contain code to XOM. During runtime, NMonitor, an OS extension, handles read accesses to XOM. While such accesses are rare and may indicate attacks, they could also be legitimate because NPatcher may not be able to completely recognize dynamic references to the relocated embedded data (e.g., those generated at runtime). When there are missed data references, the access will trigger an XOM violation, which NMonitor verifies and, if legitimate, facilitates the access to the corresponding data.

3.2.3.1 NDisassembler: Static Binary Analyzer

NDisassembler first converts an input binary from machine code to assembly code and then performs analysis needed for converting the binary into an XOM-compatible form. It disassembles the binary in a linear sweep fashion, which yields a larger code coverage than recursive disassem-

39 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS bling [49]. However, the larger code coverage comes at a cost of potentially mis-detecting embedded data as code (e.g., when such data happen to appear as syntactically correct instructions). NDisassembler addresses this problem via an iterative data recognition technique. Along with this process, it also finds instructions that reference embedded data. The data recognition technique is inspired by the following observations:

Although it is difficult to find all instructions referencing some embedded data at a later point • in the running program, it is relatively easy to locate the code that computes these references in the first place.

To generate position-independent binaries, compilers can only use PC-relative addressing • when emitting instructions that need to reference data inside binaries.

AArch64 ISA only provides two classes of instructions for obtaining PC-relative values, • namely the ldr (literal) instructions and adr(p) instructions.

NDisassembler uses Algorithm 1 to construct an initial set of embedded data (IS) and a set of reference sites (RS). For embedded data whose size cannot be precisely bounded, NDisassembler collects their seed addresses (AS) for further processing. As shown in Line 5–9 in Algorithm 1, since the load size for ldr-literal instructions is known, the identified embedded data are added to IS. On the other hand, the handling for adr instructions is more involved, as shown in Line 10–27. NDisassembler first performs forward slicing on xn — the register which holds the embedded data address. All instructions that have data dependencies on xn are sliced, and xn is considered escaped if any of its data-dependent registers is either (i) stored to memory or (ii) passed to another function before being killed. In either case, the slicing also stops. If not all memory dereferences based on xn can be identified due to reference escaping, the size of the embedded data cannot be determined. Therefore, NDisassembler only adds the initial value of xn to AS, as a seed address (Line 24–26). Line 10–23 of Algorithm 1 deal with the sliced instructions. If a memory load based on xn is found, RS is updated with the location of the original address-taking instruction. Moreover, NDisassembler analyzes the address range for each memory load. Note that oftentimes the address range is bounded because embedded data are mostly integer/floating point constants, or jump tables. In the former case, the start address of memory load is typically xn plus some constant offset, while the load size is explicit from the memory load instruction. In the latter, well-known techniques for determining jump table size [84] are utilized. In both cases, the identified embedded data are

40 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Algorithm 1: Initial embedded data

andAlgorithm references 1 Initial collection embedded data and references collection Algorithm 2 embedded data set expansion INPUT: INPUT: code[] - An array of disassembly output AS - The set of seed addresses for embedded data OUTPUT: IS - Initial set of embedded data IS - Initial set of embedded data OUTPUT: AS - The set of seed addresses for embedded data DS - conservative set of embedded data RS - The set of reference sites to embedded data 1: procedure SETEXPANSION 1: procedure INITIALSETCOLLECTION 2: DS = IS 2: IS = 3: for addr in AS do {} Algorithm 2: embedded data set ex- 3: AS = 4: c1 = BackwardExpand (addr, DS) {} 4: RS = 5: c2 = ForwardExpand (addr, DS) Algorithm{} 1 Initial embedded data and references collectionpansionAlgorithm 2 embedded data set expansion 5: for each (ldr-literal addr) code[] at curr do 6: DS = DS c1 c2 INPUT: INPUT: 6: size = MemLoadSize(ldr) 7: end for code[] - An array of disassembly output AS - The set of seed addresses for embedded data 7: IS = IS addr, addr+1, ..., addr+size-1 8: end procedure OUTPUT: { } IS - Initial set of embedded data 8: RS = RS curr IS - Initial { set} of embedded data OUTPUT: 9: end for AS - The set of seed addresses for embedded data DS - conservative set of embedded data 10: for each (adr xn, addr) code[] at curr do address until it encounters a valid control-flow transfer in- RS - The set of reference sites to embedded data 1: procedure SETEXPANSION 11: escaped, depInsts = ForwardSlicing (xn) struction: i.e., the instruction is either a direct control-flow 1: procedure INITIALSETCOLLECTION 2: DS = IS 12: unbounded = False transfer to a 4-byte aligned address in the address space, or 2: IS = 3: for addr in AS do 13: for each inst {} depInsts do an indirect control-flow transfer. All bytes walked through 3: AS = 4: c1 = BackwardExpand (addr, DS) 14: if inst is MemoryLoad{} then are marked as data and added to DS. On the other hand, 4: RS = 5: c2 = ForwardExpand (addr, DS) 15: RS = RS{} curr the forward expansion walks forward from the seed address. 5: for each (ldr{ -literal} addr) code[] at curr do 6: DS = DS c1 c2 16: addr = MemLoadAddrExpr(inst) It proceeds aggressively for a conservative inclusion of all 6: size = MemLoadSize(ldr) 7: end for 17: if IsBounded(addr expr) then embedded data. It only stops when it has strong indication 7: IS = IS addr, addr+1, ..., addr+size-1 8: end procedure 18: IS = IS AddrRange{ (addr expr) that} it has identified a valid code instruction. These indicators 8: RS = RS { curr } 19: else { } are one of the following: (i)avalid control-flow transfer 9: 20: endunbounded for = True instruction is encountered, (ii) a direct control-flow transfer 10: each (adr xn, addr) code[] at curr address until it encounters a valid control-flow transfer in- 21: endfor if dotarget (originating from other locations) is reached, and (iii) 11: escaped, depInsts = ForwardSlicing (xn) struction: i.e., the instruction is either a direct control-flow 22: end if an instruction is confirmed as the start of a function [23]. In the 12: unbounded = False transfer to a 4-byte aligned address in the address space, or 23: end for last case, comprehensive control-flow and data-flow properties 13: each inst depInsts an indirect control-flow transfer. All bytes walked through 24: if escaped orforunbounded then do such as parameter passing and callee saves are checked before are marked as data and added to DS. On the other hand, 25: 14:AS = AS ifaddrinst is MemoryLoad then validating an instruction as the start of a function. 15: { RS} = RS curr the forward expansion walks forward from the seed address. 26: end if { } Finally, DS contains nearly all embedded data that exists in 16: addr expr MemLoadAddrExpr inst It proceeds aggressively for a conservative inclusion of all 27: end for = the( binary.) Although we could further leverage heuristics to embedded data. It only stops when it has strong indication 28: end procedure17: if IsBounded(addr expr) theninclude undecodable instructions as embedded data, it is not 18: IS = IS AddrRange(addr expr) that it has identified a valid code instruction. These indicators { necessary} because our conservative algorithms already cover 19: else the vast majorityare one (if not of all) the of following: them, and (i the)a restvalid arecontrol-flow mostly transfer memory load20: is typically xn plus someunbounded constant= offset, True while padding bytesinstruction which are is never encountered, referenced. (ii Theoretically,) a direct control-flow fail- transfer addedtheinto load size IS.21: is However, explicit from iftheend there memory if is load a single instruction. memory In ure toload include whosetarget certain (originating addressreferenced from rangeembedded other cannot locations) data could be is bounded, reached, still and (iii) the latter, well-known22: techniquesend if for determining jump table happen if a chunkan instruction of data is can confirmed be coincidentally as the start decoded of a function as [23]. In the NDisassemblersize [22] are23: utilized. adds Inend the both for seed cases, address the identified to embeddedAS. a sequence oflast instructions case, comprehensive that satisfies control-flow many code andproperties, data-flow properties data are added24: into IS.if However,escaped ifor thereunbounded is a singlethen memory but in our evaluationsuch as parameter of over passing300 stripped and callee Android saves system are checked before If Algorithm25: 1 is notAS =able AS toaddr determine the sizesbinaries of (V-A), allvalidating we embedded never an encountered instruction data, assuch the the a start case.initial of a function. set (IS) is load whose address range cannot be bounded, { NDisassembler} 26: end if Finally, DS contains nearly all embedded data that exists in not complete.adds the seed address In this to AS.case, the seed addresses in ASRS are contains expanded a large subset using of reference Algorithm sites to 2the to embed- construct If Algorithm27: 1 isend not for able to determine the sizes of all ded data. Sincethe statically binary. Although identifying we all could indirect further or dynamic leverage heuristics to undecodable an over-approximatedembedded data,28: end the procedureinitial set set of (IS) embedded is not complete. data In (DS). this data The references coreinclude functions may not always are beBackwardExpandinstructions possible, NDisassembler as embedded data,(line it is not case, the seed addresses in AS are expanded using Algo- leaves such casesnecessary to be because handled our by NMonitor. conservative algorithms already cover 4) andrithmForwardExpand 2 to construct an over-approximated(line 5). set The of embeddedbackward expansionthe vast starts majority from (if a not seed all) of address them, and and the rest walks are mostly data (DS). Thememory core load functions is typically are BackwardExpandxn plus some constant(line offset,C. NPatcher: while padding XOM Binary bytes whichPatcher are never referenced. Theoretically, fail- backward4) and ForwardExpand fromthe load that size address is(line explicit 5). until The from backward the it memory encounters expansion load instruction. a validWith In thecontrol-flow inputure to from include NDisassembler, certaintransferreferenced instruction:NPatcherembedded transforms i.e., data could the still starts fromthe a seed latter, address well-known and walks techniques backward for determining from that jumpthe binary table inhappen two steps. if a First, chunk it relocates of data can data be out coincidentally of the code decoded as instruction issize either [22] area direct utilized.control-flow In both cases, the transfer identified to embedded a 4-byteaaligned sequence of address instructions in that the satisfies address many space, code properties, or data are added into IS. However, if there is a single memory but in our evaluation of over 300 stripped Android system an indirect control-flowload whose address transfer. range cannot All bebytes bounded, walked NDisassembler throughbinaries are marked (V-A), we as never data encountered and added such to a case.DS. On adds the seed address to AS. RS contains a large subset of reference sites to the embed- the other hand,If the Algorithm forward 1 is expansion not able to determine walks forward the sizes from of all theded seed data. Since address. statically It proceeds identifying all aggressively indirect or dynamic embedded data, the initial set (IS) is not complete. In this data references may not always be possible, NDisassembler for a conservativecase, the inclusion seed addresses of all in embeddedAS are expanded data. using It Algo-only stopsleaves whensuch cases ithas to be strong handled by indication NMonitor. that it rithm 2 to construct an over-approximated set of embedded data (DS). The core functions are BackwardExpand (line C. NPatcher: XOM Binary Patcher 4) and ForwardExpand (line 5). The backward41 expansion With the input from NDisassembler, NPatcher transforms starts from a seed address and walks backward from that the binary in two steps. First, it relocates data out of the code CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS has identified a valid code instruction. These indicators are one of the following: (i)avalid control- flow transfer instruction is encountered, (ii) a direct control-flow transfer target (originating from other locations) is reached, and (iii) an instruction is confirmed as the start of a function [162]. In the last case, comprehensive control-flow and data-flow properties such as parameter passing and callee saves are checked before validating an instruction as the start of a function. Finally, DS contains nearly all embedded data that exists in the binary. Although we could fur- ther leverage heuristics to include undecodable instructions as embedded data, it is not necessary because our conservative algorithms already cover the vast majority (if not all) of them, and the rest are mostly padding bytes which are never referenced. Theoretically, failure to include certain referenced embedded data could still happen if a chunk of data can be coincidentally decoded as a sequence of instructions that satisfies many code properties, but in our evaluation of over 300 stripped Android system binaries, we never encountered such a case. RS contains a large subset of reference sites to the embedded data. Since statically identifying all indirect or dynamic data refer- ences may not always be possible, NDisassembler leaves such cases to be handled by NMonitor.

3.2.3.2 NPatcher: XOM Binary Patcher

Data Relocation: An intuitive design choice is to move the executable data out of the code segment. But doing so affects backward compatibility as the layout of the ELF and the offsets of its sections will change significantly. Another approach is to duplicate the executable data, but this would increase binary sizes and memory footprint significantly. Instead, NPatcher uses two different strategies to relocate those executable data without mod- ifying code sections or duplicating all read-only data sections. For data located in code segment but are separated from code text (i.e., read-only data), NPatcher does not duplicate them in binaries but only records their offsets as metadata, which will be used by NLoader to map such data into read-only memory pages. For data mixed with code (i.e., embedded data), NPatcher copies them into a newly created data section at the end of the binary. The rationale behind the two strategies is that read-only data usually accounts for a large portion of the binary size and duplicating it in binary is wasteful and unnecessary. On the other hand, embedded data is usually of a small size, and duplicating it in binaries does not cost much space. More importantly, this is necessary for se- curity reasons. Without duplication, code surrounding data would have to be made readable, which reduces the effectiveness of XOM.

42 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Data Reference Collections: NPatcher only collects the references from .text to .text (embedded data) and to .rodata because they can be statically recognized and resolved. Other types of refer- ences are either from outside the module or statically unavailable, which are handled by NLoader. For references to embedded data, NPatcher can directly include them based on NDisassembler’s analysis results. But there is one caveat – the instructions used to reference embedded data (i.e., adr and ldr-literal) have a short addressing range. Therefore, when we map their target data to different memory pages, it is possible that the instructions cannot address or reach the relocated data. To solve this issue without breaking backward-compatibility, NPatcher generates stub code to facilitate access to out-of-range data. The instructions of short addressing range are replaced with an unconditional branch instruction2, which points to the corresponding stub entry. The stub code only contains unconditional load and branch instructions pointing to fixed immediate offsets. This design ensures that these stub entries cannot be used as ROP gadgets. For references to the .rodata, there is no addressing capability problem, because adrp is used instead of adr. However, a different issue arises. There are multiple sources from which such references could come. We identify 5 sources in our empirical study covering all Android system executables and libraries. NPatcher can only prepare the locations of the first three offline while leaving the last two to be handled by NLoader after relocations and symbol resolving are finish.

References from code (.text): these are usually caused by access to constant values and • strings.

References from symbol table (.dynsym): when a symbol is located in .rodata, there will be • an entry in the symbol table, whose value field contains the address of the exposed symbol.

References from relocation table (.rela.dyn): for a relocatable symbol located in .rodata, • the relocation table entry’s r_addend field will point to the symbol’s address.

References from global offset table (.got): when a variable in .rodata cannot be addressed • due to the addressing limit(e.g., adrp can only address +/- 4GB), an entry in the global offset table is used to address that far-away variable.

References from read-only global data (.data.rel.ro): most binaries in Android disable lazy- • binding. The .data.rel.ro section contains the addresses of global constant data that need to be relocatable. After the dynamic linker finishes relocating them, this table will be marked as read-only, as opposed to the traditional .data section.

2ADR can address +/- 1MB, while B(ranch) can access +/- 128MB, which is far enough for regular binaries.

43 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

OS Loader

load_elf_binar y() 1 ELF Header Ld-1 Dynamic Linker prelink_image Program Header () Elf_Reader::lo Executable Loading ad() Ld-2 Ld-1 Ld-2 2 .text prelink_image find_library() find_libraries() () Libraries Loading 4 .rodata dlopen() ... … link_image() 3 Runtime Lib Ld-3 Loading .data Relocation & Symbol Resolution Section Header Embedded Data && Refs Running Stub Code Program Norax Header Y Figure 3.8: The layout of ELF trans- Figure 3.9: Bionic Linker’s binary loading flow, formed by NORAX. The shaded parts NLoader operates in different binary preparing at the end are the generated NORAX- stages, including module loading, relocation and related metadata. symbol resolution.

Finally, the metadata (duplicates and references), the data-accessing stub code and the NORAX header are appended to the end of the original binary, as shown in Figure 3.8. Note that by append- ing the NORAX-related data to the end of the binary, we allow patched binaries to be backward- compatible. This is because the ELF standard ignores anything that comes after the section header table. As a result, binaries transformed by NPatcher can run on devices without NORAX support installed. They can also be parsed and disassembled by standard ELF utilities such as readelf and objdump. Moreover, NORAX-patched binaries are compatible with other binary-level security en- hancement techniques.

3.2.3.3 NLoader: Plugin for Stock Loader and Linker

Binaries rewritten by NPatcher remain recognizable by and compatible with the stock loader and linker. They can still function albeit without the XOM protection. New data sections added by NORAX, however, are transparent to the toolchain. They require NLoader’s support to complete the binary loading and references updating process before their code can be mapped in XOM. Other than the ones prepared by NPatcher, there are several types of references to executable data which are related to the linker and only available at runtime. Built as a linker/loader plugin, NLoader adjusts these references in the following steps:

44 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Ld-1: It parses and loads NORAX header into memory, including information about the em- • bedded data in .text and the stub code accessing embedded data. Then, it creates duplicated mappings for .rodata and the linker-referencing sections3, which have been loaded by the stock linker/loader.

Ld-2: It updates the .dynamic section to redirect linker to use the read-only copy of those • relocated data sections.

Ld-3: It collects the .rodata references from .got and .data.rel.ro, which are only populated • after the relocation is done. It then adjusts all the collected data references in one pass. Eventually, the memory access level of the loaded module is adjusted to enforce the R X policy.

The overall workflow of NLoader is shown in Figure 3.9. It starts with the executable loading, which is done by the OS ELF loader (Step 1 ). Then, the OS loader transfers the control to the dy- namic linker, which in turns creates a book-keeping object for the just-loaded module. Meanwhile, Ld-1 is performed to complete the binary loading. Next, the binary’s corresponding book-keeping object is then populated with references to those ELF sections used by the linker to carry out re- location and symbol resolution in a later stage. Ld-2 is then invoked to update these populated references. At this point, the preparation for the executable is done. The linker then starts preparing all the libraries (Step 2 ). This process is similar to the preparation of executable, thus Ld-1 and Ld-2 are called accordingly. When all the modules are loaded successfully in previous steps with their book-keeping objects populated, the linker walks through the book-keeping objects to perform relocation and symbol resolution (Step 3 ). In this step, Ld-3 is called for each of the relocated modules to update all those collected references, including the ones from .got and .data.rel.ro to .rodata. This is feasible because the .got entries which reference to .rodata are populated upfront, same as those in .data.rel.ro. During runtime, the program may dynamically load or unload new libraries (Step 4 ), as shown in Figure 3.9, which is also naturally handled by NLoader. To boost performance, once NLoader finishes updating the offline-updatable references, it caches the patched binary so that it can directly load the cached version without going through the whole references adjustment process again next time.

3The linker-referencing sections include .(gnu).hash, .dynsym, .dynstr, .gnu.version, .gnu.version_r, .rela.dyn, .rela.plt., etc.

45 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

3.2.3.4 NMonitor: Runtime Enforcement and Safety-net

After being processed by the last three NORAX components, a patched binary that follows the R X policy is ready to run, which is assisted by NMonitor. At runtime, the converted program could still be running with some unadjusted references to the executable data, which belong to the two following possible categories.

Missed references to embedded data: Although in our evaluation we rarely see cases where • an access violation is triggered by missed embedded data references, such situation, if mis- handled, will cause a program crash. NDisassembler is unable to discover such cases due to the limitation of static analysis. These missed data references would trigger access violations. Note that references to .rodata from .text do not have this problem, because whenever an ad- dress is calculated that happens to point at .rodata section, NDisassembler will mark it as a valid reference regardless of whether a corresponding memory load instruction is detected or not.

References to .eh_frame_hdr and .eh_frame: These sections provide auxiliary information • such as the address range of functions, the stack content when a C++ exception is triggered, etc. The previous components are unable to update them because they are used neither by the converted module itself nor by the dynamic linker. Instead, we found that C++ runtime and debuggers such as gdb would reference and read into these two sections for exception handling or stack unwinding.

NMonitor dynamically handles both categories of unadjusted references. NMonitor responds to memory violations caused by any attempted read access to XOM. It checks the context and the data being accessed. If the context matches the two cases discussed above and the address being accessed does belong to the relocated data, NMonitor permits and facilitates the access; otherwise, it terminates the program. Specifically, NMonitor whitelists these two kinds of data and ensures legitimate accesses to them can go through while potential abuses by attackers cannot. For instance, NMonitor only allows C++ runtime module to access the .eh_frame sections (updatable through sysctl). For the .text embedded data, NMonitor only allows code from the over-approximated host- ing function to read them. Note that while this design helps our system cope with those corner cases, the security of our system is barely undermined for two reasons: (i) the majority of the whitelisted data are indeed real data, which are not even decodable or surrounded by non-decodable data. (ii)

46 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Different data require code from different regions to access them; attackers cannot simply exploit one memory leak bug to read across all these embedded data.

3.2.4 Evaluation

We evaluate three aspects of NORAX: (i) whether it breaks the functioning of patched binaries? (ii) how accurate is its data analysis? and (iii) how much overhead it incurs?

Functioning of Transformed Binaries: For this test, we selected 20 core system binaries to trans- form, including both programs and libraries (Table 3.10). These binaries provide support for basic functionalities of an Android phone, such as making a phone call, installing apps, and playing videos. We obtain these binaries from a Nexus 5X phone that runs Android OS v6.0.1 (Marsh- mallow). These stock binaries are compiled with compiler optimization and without debugging metadata. We tested the functionality of the transformed binaries using our own test cases as well as the Android Compatibility Test Suite (CTS) [5]. We modified the system bootstrapping scripts ( .rc ⇤ files), which direct Android to load the system binaries patched by NORAX. Table 3.8 shows the specific tests we designed for each system executable and library. For example, surfaceflinger is the UI composer, which depends on two libraries: libmedia.so and libstagefright.so. Zygote (app_process64) is the template process from which all App processes are forked. It uses all of the patched binaries. While running our functionality tests, we observed an attempt by the linker to read the ELF header, which is located in the pages marked executable-only. While this attempt was allowed and facilitated by NMonitor, our system can be optimized to handle this case during the patching stage instead. We also ran the Android Compatibility Test Suite (CTS) on a system where our transformed binaries are installed. The suite contains around 127,000 test packages, and is mandatory test per- formed by OEM vendors to assess the compatibility of their modified Android systems. The test results are shown in Table 3.9.NORAX did not introduce any additional failure than those gener- ated by the vendor customization on the testing devices. The results from both tests show that the functioning of patched binaries is not interrupted or broken by NORAX.

Correctness of Data Analysis: To thoroughly test the correctness of our embedded data identifica- tion algorithm, we ran the data analysis module of NDisassembler against a large test set consisting of all 313 Android system binaries, whose sizes span from 5.6KB (libjnigraphics.so) to 16.5MB

47 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

(liblog.so), totaling 102MB. For these binaries, we compare the data identified by NDisassembler with the real embedded data. Our ground truth is obtained by compiling debugging sections (.de- bug_*) [18] into the binaries. We use an automatic script to collect bytes in file offsets that fall outside any function range and compare them with the analysis results from NDisassembler. For the bytes that are not used by any of the functions, we found that some of them are NOP instructions used purely for the padding purpose; whilst some are just “easter eggs”, for instance, in the function

gcm_ghash_v8 of libcrypto.so, the developers left a string “GHASH for ARMv8, CRYPTOGAMS

by ”. These kinds of data were not collected by NORAX. Since there are no

references to them, making them non-readable will not break any function.

For the tested binaries, NDisassembler correctlyModule identified allDescription the embedded data.Experiment Only for 28Suc out of the 313 binaries did NDisassembler reported false positives (i.e., code mistakenly identifiedcess vold Volume daemon mount SDCard; as embedded data), due to the over-approximate approach we use. These rare falseumount positive cases toybox 115 *nix utilities try all commands Yes are expected by our design and are handled by NMonitortoolbox during runtime.22 core *nix utilities Table 3.11try all commands shows a subset Yes DHCP daemon obtain dynamic IP Yes dhcpcd of the results4. address logd Logging daemon collect system log for Yes 1 hour Table 3.8: Rewritten program functionality tests. installd APK daemon install 10 APKs Yes app_process64Table 3.9: Parent process for all System compatibilityopen 20 apps; close evaluation, Yes Module Description Experiment Suc (zygote) applications cess qseecomdthe convertedQualcomm's proprietary zygote, qseecomd,boot up the phone installd, Yes vold Volume daemon mount SDCard; Yes driver umount rild, logd,Compositing surfaceflinger, frame libc++,Take 5 photos; play libstage-Yes toybox 115 *nix utilities try all commands Yes surfaceflinger buffers for disply 30 min movie toolbox 22 core *nix utilities try all commands Yes rild fright areBaseband service selected randomlyHave 10 min phone to participateYes DHCP daemon obtain dynamic IP Yes dhcpcd daemon call address the test to see whether they can run transpar- logd Logging daemon collect system log for Yes 1 hour

installd APK install daemon install 10 APKs Yes ently with other unmodified system compo-

app_process64 Parent process for all open 20 apps; close Yes applications nents. (zygote) qseecomd Qualcomm's proprietary boot up the phone Yes Pass Fail Not Executed Plan Name driver CTS surfaceflinger Compositing frame Take 5 photos; play Yes buffers for disply 30 min movie normal 126,457 552 0 CTS rild Baseband service Have 10 min phone Yes CTS daemon call NORAX 126,457 552 0 CTS

Size Overhead: In our functionality test, the sizes of our selected binaries range from 14K to ⇡ 7M, as shownPass in TableFail Not E 3.10.xecuted After Plan N transformation,ame the binary sizes increased by an average of ⇡ CTS 3.91%.normal Note 126454 that libm.so555 is0 a outlier,CTS as its file size increased much more than others. After CTS NORAX 126453 556 0 CTS manual inspection, we found that this math library has a lot of constant values hardcoded in various

mathematical functions such as casinh(), cacos (). As an optimization, the compiler embeds this

4 This subset was chosen to be consistent with the binaries used in the other tests in this section. The complete set of all 313 Android

system binaries, which can be easily obtained, are not shown here due to the space limit.

48

CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

Table 3.11: Embedded data identification correctness, empirical experiment shows our analysis works well in AArch64 COTS System Norax SLoC Language System Norax SLoC Language ELFs,Modifications with zero falseComponents negative rate and very Modifications Components Linux Kernel NLoader, NMonitor 1947 C Linux Kernel NLoader, NMonitor 1947 C low falseBionic Linker positive rateNLoader in terms of finding289 em- C++ Bionic Linker NLoader 289 C++ Analysis & NDisassembler, 3580 Python & Bash Analysis & NDisassembler, 3580 Python & Bash bedded Rewriting Modules data. The lastNPatcher column shows the neg-Shell Script Rewriting Modules NPatcher Shell Script Table 3.10: Binary transformation correctness ligible number of leftover gadgets in the -

test. plicated embedded data set. Module Size Size File Size # of Module #. of Real #. of Inline #. of Gadgets (Stock) (NORAX) Overhead Rewrite Inline Data Data Flagged found in Errors (Byte) by Norax extracted vold 486,032 512,736 5.49% 0 (Byte) inline Data toybox 310,800 322,888 3.89% 0 vold 0 0 0 toolbox 148,184 154,632 4.35% 0 toybox 8 8 0 dhcpcd 112,736 116,120 3.00% 0 toolbox 20 20 0 logd 83,904 86,256 2.80% 0 dhcpcd 40 40 4

installd 72,152 76,896 6.58% 0 Logd 0 0 0 app_process64 22,456 23,016 2.49% 0 installd 0 0 0 (zygote) app_process64 0 0 0

qseecomd 14,584 15,032 3.07% 0 (zygote) surfaceflinger 14,208 14,448 1.69% 0 qseecomd N/A 0 0 0 0 0 rild 14,216 14,784 4.00% 0 surfaceflinger rild 0 0 0 libart.so 7,512,272 7,772,520 3.46% 0 libart.so 17716 17716 8 libstagefright.so 1,883,288 1,946,328 3.35% 0 libstagefright.so 296 296 5 libcrypto.so 1,137,280 1,157,816 1.81% 0 libcrypto.so 2472 2512 25 1,058,616 1,071,712 1.24% 0 libmedia.so libmedia.so 3936 3936 0 1,032,392 1,051,312 1.83% 0 libc.so libc.so 4836 4836 5 libc++.so 944,056 951,632 0.80% 0 libc++.so 12 12 0 libsqlite.so 791,176 805,784 1.85% 0 libsqlite.so 932 1004 13 libbinder.so 325,416 327,072 0.51% 0 libbinder.so 0 0 0 libm.so 235,544 293,744 24.71% 0 libm.so 20283 20291 48 libandroid.so 96,032 97,208 1.22% 0 libandroid.so 0 0 0 AVG. 3.91% 0 Total 50551 50671 108

large set of constant data into the code section to fully exploit spatial locality, which translates to

more metadata generated by NORAX during the patching stage. # of # of binaries w/ binaries embeded data Percentage /system/bin 237 167 70.46% Performance Overhead: We used Unixbench [151] to/system/lib64 measure the performance255 of101 our system.39.61% /vendor/lib64 111 39 35.14% The benchmark consists of two types of testing programs:/vendor/bin (i) User-level CPU-bound4 programs;2 50.00% (ii)

System benchmark programs that evaluate I/O, process creation, and system calls, etc. We ran the benchmark on both the stock and patched binaries, repeating three times in each round. We then derived the average runtime and space overhead, which are given in Figure 3.10. For the runtime overhead, the average slowdown introduced by NORAX is 1.18%. The overhead mainly comes from the system benchmark programs, among which Execl shows the maximum

49 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS slowdown. Investigating its source code, we found that this benchmark program keeps invoking the exec system call from the same process to execute itself over and over again, thus causing NLoader to repeatedly prepare new book-keeping structures and destroy old ones (§ 3.2.3.3). This process, in turn, leads to multiple locking and unlocking operations, hence the relatively higher overhead. Fortunately, we do not find this behavior common in normal programs. In addition, some simple optimizations are possible: (i) employing a more fine-grained locking mechanism; (ii) reusing the book-keeping structures when exec loads the same image.

Runtime Overhead(%) Mem overhead(%) File size overhead(%)

14.00

user 11.72 sys 12.00

10.00

7.77 8.00

6.00 3.90 4.14 4.06 3.42 3.42 3.89 3.58 4.06 3.66 3.42 3.42 4.00 3.02 2.30 2.77 1.79 2.30 1.84 2.30 1.84 1.84 1.83 1.83 1.84 1.77 1.84 2.00 1.38 1.45 1.47 1.11 0.86 0.92 0.01 0.05 0.03 0.18 0.01 0.02 0.20 0.00 -0.07

-2.00 -1.14

Figure 3.10: Unixbench performance overhead for unixbench binaries, including runtime, peak resident memory and file size overhead (left: user tests, right: system tests)

Security Impact: Since NORAX duplicates the embedded data in code sections, they are in theory still reusable by adversaries. we conduct a gadget searching experiment in the duplicated embedded data appended at the end of the converted binaries. Table 3.11 shows the number of available gadgets we found in those data. As the result shows, available gadgets are actually very rare even in the binaries that have a lot of embedded data such as libm.so, we believe this is because the majority of those duplicated bytes are by themselves not decodable. Also note that the shown numbers are upper bounds of the available gadgets. Because, in the executable code section, where the original embedded data reside, the bytes that form the gadgets may not be placed next to each other.

50 CHAPTER 3. CODE REUSE EXPLOIT MITIGATIONS

3.3 Limitations

Unforeseeable Code: Neither CCR nor NORAX support self-modifying code. CCR relies on compile-time analysis to extract metadata that are assumed to be immutable after the compilation. and NORAX relies on static binary analysis and rewriting that are assumed to be completed before runtime. As a result, the current implementation cannot handle dynamically generated code (JIT Compilation) or self-modifying code. In addition, NORAX cannot patch customized ELF files con- sisting of unrecognizable sections that may contain code and data. For instance, the .ARM.exidx and .ARM.extab sections contained in the dex2oat program5 are not recognized by the current implementation of NORAX. Nevertheless, these limitations are shared by most works relying on compile-time analysis or offline binary rewriting.

Indirect Memory Disclosure: CCR and NORAX block exploitations by preventing attackers from directly reading the code to search for gadgets loaded in memory. However, code pointers residing in data areas such as stack and heap are still vulnerable to indirect memory disclosure attacks, which can lead to whole function reuse or call-preceded gadget reuse attacks [85, 170]. This limitation, however, is shared by all related solutions using binary rewriting [54, 109, 189, 202]. In addition, a recent study [167] shows even the most advanced source-code based techniques [88,89] are subject to attacks of this kind. As we can anticipate, defense against indirect memory disclosure warrant further research in this external war on memory corruption attacks.

5An optimization tool to convert applications’ byte code to native code.

51 Chapter 4

In-process Memory Isolation

As pointed out in § 3.3, advance attacks may still manage to bypass the mitigations. Similarly, when a rogue library is injected into the victim program, attacker is able to execute arbitrary code and have unrestricted access to other components’ memeory. When it comes to preventing in- process memory abuse, developers are virtually helpless due to a lack of support from underlying operating systems (OS): the memory isolation mechanisms provided by modern OS operate merely at the process level and cannot be used to establish security boundaries inside a process. As a result, protecting sensitive memory content against malicious code inside the same process remains an open issue, which has been increasingly exploited by attackers.

A process

Threads

Shreds

Figure 4.1: Shreds, threads, and a process

52 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

4.1 Overview

In this section, we present a new execution unit for userspace code, namely shred, which repre- sents an arbitrarily sized segment of a thread (hence the name) and is granted exclusive access to a protected memory pool, namely shred-private pool (or s-pool). Figure 4.1 depicts shreds in relation to the conventional execution units. Upon its creation, a shred is associated an s-pool, which can be shared among multiple shreds. Shreds address developers’ currently unmet needs for fine-grained, convenient, and efficient protection of sensitive memory content against in-process adversaries. To prevent sensitive content in memory from in-process abuse, a developer includes into a shred the code that needs access to the sensitive content and stores the content in the shred’s s-pool. For in- stance, an encryption function can run in a shred with the secret keys stored in the s-pool; a routine allowed to call a private API can run in a shred whose s-pool contains the API code. We design shreds under a realistically adversarial threat model. We assume attackers may have successfully compromised a victim program, via either remote exploitation or malicious local li- braries. Attackers’ goal is to access the sensitive content, including both data and code, in the victim program’s virtual memory space. Further, we expect unknown vulnerabilities to exist inside shreds (e.g., control flow hijacks and data leaks are possible). On the other hand, we assume a clean OS, which serves as the TCB for shreds. The assumption is reasonable because the attacks that shreds aim to prevent, in-process abuse, would become unnecessary had attackers already subverted the OS. In fact, we advocate that, future OS should support shreds, or more generally, enable private memory for execution units of smaller granularities than the scheduling units. We realize the concept of shreds by designing and building: (i) a set of easy-to-use APIs for developers to use shreds and s-pools; (ii) a compilation toolchain, called S-compiler, automatically verifying, instrumenting, and building programs using shreds; (iii) a loadable kernel extension, called S-driver, enabling the support and protection of shreds on commodity OS. Figure 4.2 shows an overview of the entire system and the workflow. A developer creates a shred and associates it with a selected s-pool by calling the shred enter API and supplying the s-pool descriptor as the argument. Code inside a shred may access content in the associated s-pool as if it were a normal region in the virtual memory space. But the s-pool is inaccessible outside of the associated shred(s). S-pools are managed and protected by S-driver in a way oblivious to developers or applications. With the help of use-define chain analysis on labeled sensitive variables, shreds can also be created automatically at compile time. As shown in Figure 4.2, while compiling programs that use shreds, S-compiler automatically

53 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

Thread 1 Thread 2 … … int enc(x) { Loader Shred Property … shred_enter Verification shred_enter shred_enter(P1); (P1); (P1); shred_exit(); // encryption S-pool: P1 S-pool // logic manager shred_exit(); Code shred_enter (P2); Security … Instrumentation shred_exit(); S-pool: P2 } shred_exit(); monitor … S-compiler S-driver Process Proc Memory Space

Development and Build Runtime Figure 4.2: Developers create shreds in their programs via the intuitive APIs and build the pro- grams using S-compiler, which automatically verifies and instruments the executables (left); during runtime (right), S-driver handles shred entrances and exits on each CPU/thread while efficiently granting or revoking each CPU’s access to the s-pools.

verifies the safe usage of shreds and instruments in-shred code with inline checks. The verification and instrumentation regulate sensitive data propagation and control flows inside shreds so that un- known vulnerabilities inside shreds cannot lead to secret leaks or shred hijacking. During runtime, S-driver serves as the manager for s-pools and the security monitor for executing shreds. It creates and resizes s-pools on demand. It enables a per-CPU locking mechanism on s-pools and ensures that only authorized shreds may access s-pools despite concurrent threads. S-driver leverages an under-exploited CPU feature, namely ARM memory domains [31], to efficiently realize s-pools and enforce shred-based access control. Unlike the previously proposed thread-level memory iso- lations [62], our approach neither requires separate page tables nor causes additional page table switches or full TLB flushes. Our approach also avoids the need for a hypervisor or additional lev- els of address translates (e.g., nested paging). Although our reference design and implementation of s-pools are based on ARM CPUs, they are compatible with future x86 architectures, which will be equipped with a feature similar to memory domain [87, 116]. We implement S-compiler based on LLVM [133] and S-driver as a kernel module for Linux. We evaluate shreds’ compatibility and the ease of adoption by manually retrofitting shreds into 5 non- trivial open source software, including OpenSSL and Lighttpd. We show that developers can easily adopt shreds in their code without design-level changes or sacrifice of functionality. Our evaluation shows that shreds incurs an average end-to-end overhead of 4.67%. We also conduct security anal- ysis on shreds, confirming that possible attacks allowed in our thread model are prevented. Overall, our results indicate that shreds can be easily adopted in real software for fine-grained protection of sensitive memory content while incurring very low overhead.

54 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

4.2 Design

4.2.0.1 Shred APIs and Usages

Application developers use shreds and s-pools via the following intuitive APIs:

err_t shred_enter(int pool_desc); err_t shred_exit(); void * spool_alloc(size_t size); void spool_free(void *ptr);

These APIs internally make requests to S-driver via ioctl for managing shreds and s-pools. To explain the API usage, we use the lightweight open-source web server, Lighttpd, as an example, where we employ shreds to protect the HTTP authentication password in Lighttpd’s virtual memory. By wrapping the code that receives and checks the password in two shreds and storing the password in an s-pool, the modified Lighttpd prevents out-shred code, including third-party and injected code, from accessing the password in memory. Listings 4.1-4.3 show the code snippets that contain the modifications (lines marked with “+”). A successful call to shred_enter starts a shred execution on the current thread. It also causes a switch to a secure execution stack allocated in s-pool, which prevents potential secret leaks via local variables after the shred exits. The thread then is given exclusive access to the associated s-pool, which is specified by the developer using the pool_desc parameter of shred_enter. Our design allows developers to associate an s-pool with multiple shreds by using the same descriptor at shred creations (e.g., an encryption shred and a decryption shred may need to share the same s-pool storing keys). The two shreds in Lighttpd, created on Line 9 in Listing 4.1 and Line 3 in Listing 4.3, share the same s-pool. However, as a security restriction, shreds in different compilation units cannot share s-pools. Therefore, even if shreds from different origins happen to use the same descriptor value, their s-pools are kept separate. The shred_exit API stops the calling shred, revokes the current thread’s access to the s-pool, and recovers the original execution stack. It is called immediately after a self-contained operation or computation on the s-pool finishes, as shown on Line 22 in in Listing 4.1 and Line 8 in Listing 4.3. The shred enter and exit APIs must be used in pairs without nesting. To facilitate verification, an enter-exit pair must be called inside a same function. In principle, a shred should contain a minimum body of code that corresponds to a single undividable task requiring access to an s-pool. In the example, since Lighttpd separates the parsing and processing of HTTP requests, we naturally

55 CHAPTER 4. IN-PROCESS MEMORY ISOLATION used two small shreds, rather than one big shred, to respectively read the password from network and checks if the hash value of the password matches with the local hash. To allocate memory from its associated s-pool, in-shred code calls spool_alloc, in a same way as using libc’s malloc. Similar to regular heap-backed memory regions, buffers allocated in s-pools are persistent and do not change as code execution enters or exits shreds. They are erased and reclaimed by S-driver when in-shred code calls spool_free. In the Lighttpd example, an s- pool named AUTH_PASSWD_POOL is used for storing the password that the server receives via HTTP authentication requests. The password enters the s-pool immediately after being read from the network stream and stays there till being erased at the end of its lifecycle.

1 int http_request_parse(server *srv, 2 connection *con) { 3 ... 4 /* inside the request parsing loop */ 5 char *cur; /* current parsing offset */ 6 + char auth_str[] = "Authorization"; 7 + int auth_str_len = strlen(auth_str); 8 + if (strncmp(cur, auth_str, auth_str_len)==0){ 9 + shred_enter(AUTH_PASSWD_POOL); 10 + /* object holding passwd in spool */ 11 + data_string *ds = s_ds_init(); 12 + int pw_len = get_passwd_length(cur); 13 + cur += auth_str_len + 1; 14 + buffer_copy_string_len(ds->key, auth_str, auth_str_len); 15 + buffer_copy_string_len(ds->value, cur, pw_len); 16 + /* add ds to header pointer array */ 17 + array_insert_unique(parsed_headers, ds); 18 + /* only related shreds can deref ds */ 19 + /* wipe out passwd from input stream */ 20 + memset(cur, 0, pw_len); 21 + cur += pw_len; 22 + shred_exit(); 23 +} 24 ... 25 }

Listing 4.1: lighttpd/src/request.c – The HTTP request parser specially handles the AUTH request inside a shred: it allocates a data_string object in the s-pool (Line 11), copies the input password from the network stream to the object (Line 12-15), saves the object pointer to the array of parsed headers (Line 17), and finally erases the password from the input buffer before exiting the shred.

56 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

1 /* called inside a shred */ 2 data_string *s_ds_init(void){ 3 data_string *ds; 4 + ds = spool_alloc(sizeof(*ds)); 5 + ds->key = spool_alloc(sizeof( buffer)); 6 + ds->value = spool_alloc(sizeof( buffer)); 1 ... 7 ... 2 /* inside HTTP auth module */ 8 return ds; 3 + shred_enter(AUTH_PASSWD_POOL); 9 } 4 /* ds points passwd obj in 10 spool */ 11 /* called inside a shred */ 5 http_authorization = ds->value-> 12 void s_ds_free(data_string *ds) { ptr; 13 ... 6 ... // hash passwd and compare 14 + spool_free(ds->key); with local copy 15 + spool_free(ds->value); 7 + s_ds_free(ds); 16 + spool_free(ds); 8 + shred_exit(); 17 return; 9 ... 18 } 10 19 Listing 4.2: Listing 4.3: lighttpd/src/mod_auth.c – lighttpd/src/data_string.c – We added When the authentication module receives s-pool support to the data_string type in the parsed headers, it enters a shred, Lighttpd, which allows the HTTP parser associated to the same s-pool as the parser to save the AUTH password, among other shred. It retrieves the password by things, in s-pools and erase them when dereferencing ds, as if the password resided needed. in a regular memory region (Line 5)

4.2.0.2 Security Properties

Shreds’ security is guaranteed by three properties:

P1 - Exclusive access to s-pool: An s-pool is solely accessible to its associated shreds. Other • shreds or threads, even when running concurrently with the associated shreds, cannot access the s-pool.

P2 - Non-leaky entry and exit: Data loaded into s-pools cannot have copies elsewhere in • memory or be exported without sanitization.

57 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

P3 - Untampered execution: Shred execution cannot be altered or diverted outside of the • shred.

P 1 enables the very protection of a shred’s sensitive memory against other unrelated shreds or out-shred code that run in the same address space. P 2 avoids secret leaks when data are being loaded into or exported out of s-pools (e.g, ensuring that no secret is buffered in unprotected memory as a result of standard I/O). P 3 prevents in-process malicious code from manipulating shreds’ control flow. Such manipulation can cause, for instance, ROP that forces a shred to execute out-shred code and expose its s-pool. Next, we explain how we design S-compiler and S-driver together to ensure these properties.

4.2.0.3 S-compiler: automatic toolchain for shred verification and instrumentation

Developers use S-compiler to build programs that use shreds. In addition to regular compilation, S-compiler performs a series of analysis and instrumentation to verify programs’ use of shreds and prepare the executables so that S-driver can enforce the security properties (P 1-P 2) during runtime. In addition, S-compiler checks that code included in a shred follows two rules. First, it cannot copy data from an s-pool to unprotected memory without applying any transformation (e.g., encryption). This rule prevents unexpected secret leaks from s-pools and is needed for achieving P 2. Second, in-shred code can only use libraries built using S-compiler. This rule allows all code inside shreds to be checked and instrumented for P 3. Unlike general-purpose program analysis, S-compiler’s analysis is mostly scoped within the code involved in shred executions, and therefore, can afford to favor accuracy over scalability. Prior to the analysis and transformation, S-compiler translates an input program into an intermediate representation (IR) in the single static assignment (SSA) form.

Checking shred usage: To verify that all shreds in the program are properly closed, S-compiler first identifies all the shred creations sites(i.e., calls to shred_enter), uses them as analysis entry points, and constructs a context-sensitive control flow graph for each shred. S-compiler then per- forms a code path exploration on each graph in search for any unclosed shred (or unpaired use of shred_enter and shred_exit), which developers are asked to fix. This check is sound because it is not inter-procedural (i.e., a pair of shred enter and exit APIs must be called inside a same function) and it conservatively models indirect jumps.

58 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

To prevent potential secret leaks, S-compiler performs an inter-procedural data-flow analysis in each shred. Potential leaks happen when sensitive data in the s-pool are propagated to unprotected memory. To ensure that, the data-flow analysis checks for any unsanitized data propagation from an s-pool object to a regular heap destination. Thanks to the explicit memory allocations and aliasing in s-pool, the data-flow analysis needs neither manually defined sources or sinks nor heuristic point- to analysis. In addition, this analysis strikes a balance between security and usability: it captures the common forms of secret leaks (e.g., those resulted from bugs) while permitting intentional data exports (e.g., saving encrypted secrets). Buffered I/O, when used for loading or storing s-pool data, may implicitly leak the data to pre-allocated buffers outside of s-pools, which data-flow analysis can hardly detect. Therefore, S-compiler replaces any buffered I/O (e.g., fopen) with direct I/O (e.g., open) in shreds.

Hardening in-shred control flows: We adopt a customized form control-flow integrity (CFI) to ensure that in-process malicious code cannot hijack any shred execution. To that end, S-compiler hardens in-shred code during compilation. Based on the control flow graphs constructed in the previous step, S-compiler identifies all dynamic control flow transfers, including indirect jumps and calls as well as returns, inside each shred. It then instruments these control flow transfers so that they only target basic block entrances within containing shreds. This slightly coarse-grained CFI does not incur high overhead as the fine-grained CFI and at the same time is sufficiently secure for our use. It prevents shred execution from being diverted to out-shred code. Furthermore, since shreds are usually small in code size (i.e., few ROP gadgets) and our CFI only allows basic block- aligned control transfers, the chance of in-shred ROP is practically negligible. The control flow hardening only applies to in-shred code. If a function is called both inside and outside of a shred, S-compiler duplicates the function and instruments the duplicate for in-shred use while keeping the original function unchanged for out-shred use. S-compiler creates new symbols for such duplicates and replaces the in-shred call targets with the new symbols. As a result, a function can be used inside shreds and instrumented without affecting out-shred invocations. Using function duplicates also allows S-compiler to arrange the code reachable in a shred in adjacent memory pages, which facilitates the enforcement of control flow instrumentations and improves code cache locality.

Binding shreds and s-pools: Developers define a constant integer as the pool descriptor for each s-pool they need. To associate an s-pool with a shred, they use the constant descriptor as the pool_desc parameter when calling shred_enter. This simple way of creating the association is

59 CHAPTER 4. IN-PROCESS MEMORY ISOLATION intuitive and allows explicit sharing of an s-pool among multiple shreds. However, if not protected, it may be abused by in-process malicious code (e.g., creating a shred with an association to an ar- bitrary s-pool). S-compiler prevents such abuse by statically binding shreds and their s-pools. It first infers the pool-shred association by performing a constant folding on the pool_desc used in each shred_enter invocation. It then records the associations in a special section (.shred) in the resulting executable, to which S-driver will refer during runtime when deciding if a shred (identified by its relative offset in memory) indeed has access to a requested s-pool. Thanks to the static bind- ing, dynamically forged pool-shred association is prevented, so is s-pool sharing across different compilation units.

4.2.0.4 S-driver: OS-level manager for shreds and s-pools

S-driver is a dynamically loadable kernel extension. It can be easily installed on a system as a regular driver. S-driver provides the OS-level support and protection for shreds and s-pools.

ARM memory domains: S-driver leverages a widely available yet rarely used ARM CPU fea- ture, namely the the memory domain mechanism, to realize s-pools or create specially protected memory regions inside a single virtual memory space. At the same time, our design is not spe- cific to ARM and can realize s-pools using a mechanism similar to memory domains in future Intel CPUs [87, 116]. On ARM platforms, domains are a primary yet lesser known memory access con- trol mechanism, independent of the widely used paging-based access control. A memory domain represents a collection of virtual memory regions. By setting a 4-bit flag in a Page Directory Entry (PDE), OS assigns the memory region described by the PDE to one of the 16 (24) domains sup- ported by the CPU. Since each PDE has its own domain flag, the regions constituting a domain do not have to be adjacent. Upon each memory access, the hardware Memory Management Unit (MMU) determines the domain to which the requested memory address belongs and then decides if the access should be allowed based on the current access level for that domain. The access level for each domain is recorded in the per-core Domain Access Control Registers (DACR) [15], and therefore, can be individually configured for each CPU core.

Creation and management of s-pools: Although memory domains are ideal building blocks for s- pools thanks to their efficient hardware-enforced access control, memory domains are not originally designed for this purpose and cannot directly enable s-pools due to two limitations. First, only a total of 16 memory domains are available. If intuitively using one domain for creating one s-pool,

60 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 DACR1

DACR2 DACR3

DACR4 Out-shred In-shred DACR: Deny shred enter DACR: Deny SPOOL: Lock SPOOL: Lock :No-op(depends on page permissions) shred exit Off-CPU On-CPU :No Access context switch shred exit shred alloc/Domain fault :Unused

In-shred In-shred :Reserved DACR: Allow DACR: Figure 4.3: The DACR setup for a quad- shred exit SPOOL: Unlock SPOOL: Lock On-CPU Off-CPU core system, where k =4. The first 3 context switch domain fault domain fault* (modified DACR) domains (Dom0 Dom2) are reserved by In-shred In-shred DACR: Deny DACR: Deny Linux. Each core has a designated domain SPOOL: Unlock context switch SPOOL: Unlock On-CPU (Dom Dom ) that it may access when ex- Off-CPU 3 6 ecuting a shred. No CPU can access Dom7. Figure 4.4: A shred’s transition of states the limited domains will soon run out as the number of s-pools used in a program increases. Second, the access control on memory domains is very basic and does not concern the subject of an access (i.e., who initiates the access). However, access control for s-pools must recognize subjects at the granularity of shreds. S-driver overcomes both limitations of memory domains by multiplexing the limited domains and introducing shred identities into the access control logic. S-driver uses the limited domains to support as many s-pools as an application may need. Rather than permanently assigning an s-pool to a domain, S-driver uses domains as temporary and rotating security identities for s-pools in an on-demand fashion. Specifically, it uses a total of k = Min(N 1,N ) domains, where N is the number of available domains and N is dom cpu dom cpu the number of CPU (or cores) on a system. The first k domains are reserved for the first k CPUs.

S-driver sets the per-CPU DACR in a way such that, Domi is only accessible to shreds running on

CPUi, for the first k CPUs; Domk+1 is inaccessible to any CPU in user mode. Figure 4.3 shows an example DACR setup. S-driver uses the k CPUs and the k +1domains for executing shreds and protecting s-pools.

When a shred starts or resumes its execution on CPUi, S-driver assigns its associated s-pool to

Domi, and therefore, the shred can freely access its s-pool while other concurrent threads, if any,

cannot. When the shred terminates or is preempted, S-driver assigns its s-pool to Domk+1, which prevents any access to the pool from that moment on. As a result, S-driver allows or denies access

61 CHAPTER 4. IN-PROCESS MEMORY ISOLATION to s-pools on a per-CPU basis, depending on if an associated shred occupies the CPU. Even if any malicious code manages to run concurrently alongside the shred inside the same process on another CPU, it cannot access the shred’s s-pool without triggering domain faults. Thus, P 1 is achieved. It is reasonably efficient to switch s-pools to different domains upon shred entries and exits are. These operations do not involve heavy page table switches as process- or VM-based solutions do. They only require a shallow walk through of the first level page table and updates to the PDEs pointing to the s-pools in question. Besides, they do not trigger full TLB flushes as our design uses the per-address TLB eviction interface (flush_tlb_page) and only invalidates the TLB entries related to the updated PDEs. To further reduce the overhead, we invent a technique called lazy

domain adjustment: when a shred is leaving CPUi, without adjusting any domain assignment, S-

driver quickly changes the DACR to revoke the CPU’s access to Domi and lets the CPU’s execution

continue. It does not assign the s-pool used by the previous shred to Domk+1 until a domain fault happens (i.e., another shred coming to the CPU and accessing its s-pool). The lazy domain adjustment avoids unnecessary domain changes and halves the already small overhead in some test cases. Figure 4.4 shows how S-driver orchestrates the transitions of a shred’s states in response to the API calls, context switches, and domain faults. Each state is defined by a combination of four properties:

Shred = {In-shred Out-shred}: if the shred has started or exited. • | DACR = {Allow Deny}: if the DACR allows or denies the current CPU to access its • | domain.

SPOOL = {Lock Unlock}: if the associated s-pool is locked or not. • | CPU = {On-CPU Off-CPU}: if the shred is running on a CPU or not. • |

The transition starts from the top, left circle, when the shred has not started and its s-pool is locked. After shred_enter is called, S-driver starts the shred, but it will not adjust the DACR or the s-pool access till a domain fault or a spool_alloc call due to the lazy domain adjustment in effect. When a context switch happens in the middle of the shred execution with unlocked DACR and s-pool, S-driver instantly sets the DACR to Deny but (safely) leaves the s-pool open. Later on, if a domain fault occurs, S-driver locks the previous s-pool because the fault means that the current code running on the CPU is in-shred and is trying to access its s-pool. If a domain fault never occurs

62 CHAPTER 4. IN-PROCESS MEMORY ISOLATION till the shred regains the CPU, S-driver does not need to change any domain or s-pool settings, in which case the lazy domain adjustment saves two relatively heavy s-pool locking and unlocking operations.

Secure stacks for shreds: Although S-compiler forbids unsanitized data flows from s-pools to unprotected memory regions, it has to allow in-shred code to copy s-pool data to local variables, which would be located in the regular stack and potentially accessible to in-process malicious code. To prevent secret leaks via stacks, S-driver creates a secure stack for each shred, allocated from its associated s-pool. When code execution enters a shred, S-driver transparently switches the stack without the application’s knowledge: it copies the current stack frame to the secure stack and then overwrites the stack pointer. When the shred exits or encounters a signal to be handled outside of the shred, S-driver restores the regular stack. As a result, local variables used by shreds never exist in regular stacks, and therefore cannot leak secrets.

Runtime protection of shreds: In addition to enabling and securing shreds and s-pools, S-driver also protects the inline reference monitor (IRM) that S-compiler plants in shred code. S-driver write- protects the memory pages containing the instrumented code and the associated data in memory. It also pins the pages in s-pools in memory to prevent leaks via memory swap. Given that our threat model assumes the existence of in-process adversaries, S-driver also mediates the system calls that malicious code in user space may use to overwrite the page protection, dump physical memory via /dev/*mem, disturb shreds via ptrace, or load untrusted kernel modules. For each program using shreds, S-driver starts this mediation before loading the program code, avoiding pre-existing malicious code. S-driver’s system call mediation also mitigates the attacks that steal secret data, not directly from s-pools, but from the I/O media where secret data are loaded or stored. For instance, instead of targeting the private key loaded in an s-pool, an in-process attacker may read the key file on disk. S-driver monitors file-open operations insides shreds. When the first time a file F is accessed by a shred S, S-driver marks F as a shred-private file and only allows shreds that share the same s-pool with S to access F . This restriction is persistent and survives program and system reboots. As a result, an attacker can read F only if she manages to intrude the program during its first run and access F before a shred does. Although not completely preventing such attacks, S-driver makes them very difficult to succeed in reality. For a complete remedy, we envision a new primitive for in-shred code to encrypt and decrypt secret data with a persistent key assigned to each s-pool and

63 CHAPTER 4. IN-PROCESS MEMORY ISOLATION automatically managed by S-driver. However, our current prototype does not support this primitive. It is worth noting that, although the system call mediation can prevent user-space malicious code that tries to break shreds via the system interfaces, it is a more intrusive and less configurable design choice than the well-known access control and capability frameworks, such as SELinux, AppArmor, and Capsicum [201]. However, we leave the integration with those systems as future work because the system call mediation is easy to implement and is sufficient for the prototyping purpose.

4.3 Implementation

We built S-compiler based on LLVM [133] and its C front-end Clang [13]. We built S-driver with Linux as the reference OS. The implemented system was deployed and evaluated on a quad- core ARM Cortex-A7 computer (Raspberry Pi 2 Model B running Linux 4.1.15).

S-compiler: The modular and pass-based architecture of LLVM allows us to take advantage of the existing analyzers and easily extends the compilation pipeline. S-compiler adds two new passes to LLVM: the shred analysis pass and the security instrumentation pass. Both operate on LLVM bitcode as the IR. The analysis pass carries out the checks on the usages and security properties of shreds, as described in § 4.2. We did not use LLVM’s built-in data flow analysis for those checks due to its overly heuristic point-to analysis and the unnecessarily conservative transfer functions. Instead, we implemented our specialized data flow analysis based on the basic round-robin iterative algorithm, with weak context sensitivity and a straightforward propagation model (i.e., only tracking value- conserving propagators). We also had to extend LLVM’s compilation pipeline because it by default only supports intra-module passes while S-compiler needs to perform inter-module analysis. We employed a linker plugin, called the Link-Time Optimization (LTO), to cross link the IR of all compilation modules and feed the linked IR to our analyzers. The instrumentation pass uses the LLVM IR Builder interfaces to insert security checks into the analyzed IR, which are necessary for enforcing the in-shred control flow regulations and preventing dynamic data leaks.

S-driver: We built S-driver into a Loadable Kernel Module (LKM) for Linux. S-driver creates a virtual device file (/dev/shreds) to handle the ioctl requests made internally by the shred APIs. It uses 13 out of 16 memory domains to protect s-pools because the recent versions of Linux kernel for ARM already occupies 3 domains (for isolating device, kernel, and user-space memory). S-

64 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

Algorithm 3: Domain Fault Handler input : The faulting virtual address fault_addr result: Recover from the domain fault, or kill the faulting thread

/*Identity check*/ s_pool FindSpool(fault_addr);

s_owner GetOwner(s_pool);

if fault_thread is NOT in shred then goto bad_area if fault_thread is NOT s_owner then goto bad_area /*Recover from domain fault*/ cpu_domain GetCPUDomain();

s_pool_domain GetSpoolDomain(s_pool);

if s_pool is unlocked then if cpu_domain = s_pool_domain then /*No need to change domain for s_pool*/ RestoreDACR();

else AdjustSPool(cpu_domain)

else UnlockSPool(cpu_domain) LockOtherActiveSPools(s_pool);

65 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

Table 4.1: 5 open source softwares used in evaluation Executable Size(byte) Category Protected Data Type Program Size(KLOC) curl 227071 curl http client password 177 80572 miniunz minizip file compression tool password 7 97749 minizip openssh 2207588 ssh remote login tool credential 130 openssl 3093920 libcrypto.so crypto library crypto key 526 lighttpd 85135 mod_auth.so web server credential 56 driver uses the available domains to protect unlimited s-pools and controls each CPU’s access to the domains as described in § 4.2. Since Linux does not provide callback interfaces for drivers to react to scheduling events, in order to safely handle context switches or signal dispatches in shreds, S-driver dynamically patches the OS scheduler so that, during every context switch, the DACR of the current CPU is reset, which locks the open s-pool, if any. The overhead of this operation is negligible because resetting the DACR only takes a single lightweight instruction. To capture illegal access to s-pools and lazily adjust domain assignments, S-driver registers itself to be the only handler of domain faults and is triggered whenever a domain violation happens. Algorithm 3 shows how S-driver handles a domain fault. Purely implementing S-driver as a LKM allows shreds to be introduced into a host without installing a custom-build kernel image.

4.4 Evaluation

Our evaluation sought to answer the following questions:

How compatible and useful are shreds to real-world programs? • How do shreds affect the application’s and system’s performance? • How shreds help mitigate in-process memory abuse? •

Choice of Applications: We selected 5 popular open source applications to evaluate our prototype system. The applications are shown in Table 4.1, ranging from the small HTTP server, lighttpd, to the complex cryptography library, OpenSSL. The applications were chosen because each of them has at least one piece of sensitive data that is subject to in-process abuse, and therefore, warrants shred’s protection. Moreover, the applications represent a good variety of software of different functionalities and codebase sizes.

66 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

18500

18000

17500

17000

16500 40.00% compile time increase 35.00% 16000 generated file size Time/context(µs) switch 30.00% increase 15500

25.00% 15000 shred thread process 20.00% switch switch switch

15.00% Figure 4.6: The time needed for a context 10.00% switch when: (1) a shred-active thread is 5.00% switched off, (2) a regular thread is switched 0.00% curl minizip openssh openssl lighttpd off but no process or address space change, Figure 4.5: The time and space overhead incurred and (3) a regular thread is switched off and a by S-compiler during the offline compilation and thread from a different process is scheduled instrumentation phase on.

Compilation Tests: To test the performance and compatibility of our offline analysis and com- pilation methods, we instrumented S-compiler in order to measure the overhead and log potential errors, if any, while building the hundreds of software packages that use shreds. Figure 4.5 shows the time and space overhead introduced by S-compiler, relative to the performance of a vanilla LLVM Clang compiling the unchanged applications. On average, S-compiler delays the building process by 24.58% and results in a 7.37% increase in executable sizes. The seemingly significant delays in compilation are in fact on par with static analysis and program instrumentation tools of similar scale. They are generally tolerable because compilations take place offline in background and are usually not considered to be time-critical. The executable file size increases are mainly resulted from the in-shred instrumentation and are below 2% except for the outliers. We encountered no error when building these applications using S-compiler. The built applications run without issues during the course of the tests.

Performance Tests: This group of tests examines the runtime performance of shreds and s-pools. We performed both micro-benchmarkings and end-to-end tests, which respectively reveal the per- formance cost associated with shreds’ critical operations and the overhead exhibited in the hundreds of applications retrofitted with shreds.

67 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

In the micro-benchmarking tests, we developed unit test programs that force shreds to go through the critical operations and state changes, including shred entry, exit, and context switch. We mea- sured the duration of these operations and state changes, and then compared them with the durations of the equivalent or related operations without shreds. Figure 4.6 shows the absolute time needed for a context switch that preempts a shred-active thread, a regular thread, and a regular process, respectively. It is obvious that, switching shred-active threads is marginally more expensive than switching regular threads ((about 100µs slower); switching shred-active threads is much faster than a process context switch. This is because when a shred is preempted, S-driver does not need to make any change to page tables or TLB. Instead, it only performs a single DACR reset operation, which is very lightweight. We also compared the time needed for completing the shred API calls (invoking ioctl inter- nally) with several reference system calls, as shown in Figure 4.7. The getpid, one of the fastest system calls, serves as a baseline for comparison. The shred_enter API is compared with the clone system call (without address space change), and is slightly faster, which means creating a shred takes less time than creating a thread. The s-pool allocation API is mildly slower than mmap due to the additional domain configurations. But the overhead is low enough to easily blend in the typical program performance fluctuations. Furthermore, we measured the performance improvement enabled by the lazy domain adjust- ment optimization. We applied shreds to five SPEC CINT2006 benchmark programs written in C (Figure 4.8), where a number of shreds were created to perform intensive access to s-pools. We note that this test is designed only for the performance evaluation purpose while recognizing that these benchmark programs do not need shreds’ protection. The result shows that in all but one case the optimization brings the overhead under 1% whereas the non-optimized implementation of shreds incurs an average overhead of 2.5%. Those micro-benchmark tests together indicate that the shred primitives are lightweight and the performance impact that shred state changes and s-pool operations may pose to the application or the system is very mild. In the end-to-end tests, we let each of the hundreds of open-source applications perform a self- contained task twice, with and without using shreds to protect their secret data (e.g., Lighttpd fully handling an HTTP auth login and OpenSSL carrying out a complete RSA key verification). We instrumented the applications with timers. For each application, we manually drove it to perform the task, which fully exercises the added shreds. We measured both the time and space costs associated with using shreds in these tests. The absolute costs and the relative increases are shown in Table 4.2.

68 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

2.10E6 1.1 user time (µs) no shred sys time (µs) shred-no lazy 2.06E6 1.05 shred lazy

1000 1

0.95 800 0.9 600 0.85 400 0.8

200 0.75

0

Figure 4.8: Five SPEC2000 benchmark Figure 4.7: Invocation time of shred APIs and programs tested when: (1) no shred is reference system calls (the right-most two bars used, (2) shreds are used but without the are on log scale). It shows that shred entry is lazy domain adjustment turned on in S- faster than thread creation, and s-pool allocation driver, and (3) shreds are used with the is slightly slower than basic memory mapping. lazy domain adjustment.

69 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

Table 4.2: End-to-end overhead observed while tested programs performing a complete task: the left-side part of the table shows the executing time and the right-side part shows the memory foot- print. End-to-end time Memory footprint(Max RSS) w/ w/o time w/ w/o size shred(ms) shred(ms) increase shred(KB) shred(KB) increase curl 154 163 5.80% 4520 5104 12.90% minizip 23770 25650 7.90% 3004 3064 1.90% openssh 158.1 163.3 3.20% 3908 4644 18.80% openssl 2502 2546 1.75% 3892 3908 0.40% lighttpd 501 525 4.70% 3364 3440 2.30% Avg. 4.67% 7.26%

On average, the per-task slow down among the applications is 4.67% and the memory footprint increase is 7.26%. The results show that shreds are practical for real applications of various sizes and functionalities. The overhead is hardly noticeable to the end users of the applications.

Security Coverage Test: Finally, we tested the coverage of shred protection in the hundreds of modified applications. These tests not only check if the shred adoption is correct and complete in these applications, but also demonstrate the security benefits uniquely enabled by shreds in these applications. We conducted these tests using a simple memory scraper that scans each application’s virtual memory in search for the known secrets. The tests simulate the most powerful in-process abuse where an adversary has full visibility into the user-space virtual memory of the application and can perform brute-force search of secrets. For each application, our memory scraper runs as an independent thread inside the application and verifies if any instance of the secret data can be found in memory via a value-based exhaustive search. We ran this test in two rounds, one on a vanilla version of the application and the other on the shred-enabled version. In the first round, where shreds are not used, the memory scraper found at lease one instance of the secret values in memory for all the applications, which means that these secrets are subject to in-process abuse. In the second round, where shreds are used, the memory scraper failed to detect any secrete matches in the applications’ memory, which means that the secrets are well contained inside the s-pools and protected from in-process abuse. The results show that, the applications have correctly adopted shreds for processing the secret data in memory and stored such data only in s-pools. Moreover, the tests show that, without significant design changes, applying the shred prim- itives in these real applications creates needed protection for the otherwise vulnerable passwords, crypto keys, and user credentials.

70 CHAPTER 4. IN-PROCESS MEMORY ISOLATION

4.5 Limitations and Discussion

Coarse-grained Control Flow Integrity: To make it harder for attackers to exploit in-shred code, in case there is a vulnerability, s-compiler mounts a coarse-grained CFI. The rationale is that in- shred code are typically much smaller than the whole program and thus less gadgets will be avail- able. However, this assumption might not hold in the case that developer created a shred that happen to have enough gadgets. Plus, numerous research [72,101,103] have shown that coarse-grained and even type-based CFI is bypass-able. This create a trade-off between overhead and security.

Incompatible ABI of Stack Switch: To protect stack variables containing sensitive information, shred scratches volatile registers and swap the stack to a secure stack allocated in the corresponding spool when entering a shred. However, this could create confusions if developer does not pay at- tention to the variable usages inside and outside of shred (e.g., making change of out-of-shred stack variables). For example, in the following code snippet, it shows that when the variable greeting is assigned inside the shred, while it is allocated in the non-secure stack. After exiting the shred, its content will stay uninitialized, which is counter-intuitive, although it does not hurt security.

1 bool CheckGreeting() { 2 char* greeting; // 0 < MAX < INT_MAX 3 CHECK_OK(shred_enter(pool)); 4 greeting = "hello";// modifying local variables declared outside shred 5 // other operations... 6 CHECK_OK(shred_exit()); 7 return strcmp("hello", greeting); 8 }

A more proper way to handle this issue is to change the interface of using shred, namely, only allow using shred as a procedural call, and pass the function pointer to the shred_enter API. How- ever, this will limit the granularity of shred to be at function level.

71 Part II

Offline Software Testing To Find Memory Corruption Bugs

72 Fuzzing has emerged as one of the most effective testing techniques for discovering security vulnerabilities and reliability issues in software. The idea behind fuzzing is simple: the fuzzer exe- cutes programs with randomly generated inputs, and monitors their behavior for invalid operations, such as memory corruption issues. Recent advancements in fuzzing technologies, such as cover- ageguided fuzzing [134,140], have enabled fuzzing to reach even deeper program paths and uncover significantly more bugs. The success of fuzzing has led to significant adoption in the industry, and the emergence of services providing continuous fuzzing for open source and commercial software. For example, Google has developed continuous fuzzing infrastructures to test the security of C/C++ libraries, both for its internal software and externally for open-source code. Google’s ClusterFuzz project, through its OSS-Fuzz [34,47] instance, has alone led tens of thousands of bugs to develop- ers by fuzzing over 200 open source projects. C/C++ code is a primary target for fuzzing, due to unsafe language features, such as explicit memory management, that makes it prone to bugs and vul- nerabilities. To detect such bugs, the fuzzed programs are usually instrumented with checks (e.g., ASAN [175]) that can expose memory corruption issues and other undefined behavior in C/C++ code. Several classes of these bugs, such as buffer overflows, use-after-frees, integer overflows, and uninitialized memory, are often exploitable security vulnerabilities. Another widely used test generation technique is concolic execution [139,172]. There has been a trend to combine fuzzing and concolic execution to get the best out of each world [122, 184, 208] (a.k.a. Hybrid Fuzzing). On one hand, fuzz testing quickly tests a program, but it hardly explores code regions guarded by complex conditions. On the other hand, concolic execution excels at solving path conditions but it frequently directs the execution into code branches containing a large number of execution paths (e.g., loop). Due to these shortcomings, using fuzz testing or concolic execution alone often ends with large amounts of untested code after exhausting the time budget. The goal of hybrid testing is to utilize fuzzing in path exploration and leverage concolic ex- ecution to solve hard-to-resolve conditions. A hybrid approach typically lets fuzz testing run as much as possible. When the fuzzer barely makes any progress, the hybrid controller switches to the concolic executor which re-runs the generated seeds from fuzzing. During the run, the concolic executor checks each conditional branch to see whether its sibling branches remain untouched. If so, the concolic executor solves the constraints of the new branch and contributes a new seed for fuzzing. In general, this hybrid approach guides the fuzzer to new regions for deeper program space exploration. In the second part of this thesis, I introduce two novel techniques that improve the state-of-the- art of hybrid testing to uncover deep security bugs.

73 Chapter 5

Bug-driven Hybrid Testing

5.1 Background and Motivation

This work is motivated by the limitations of hybrid testing in vulnerability detection. In this section, we first introduce the background of hybrid testing and then demonstrate the limitations by two examples.

5.1.1 In-efficiency of Existing Coverage-guided Hybrid Testing

Existing hybrid testing combines fuzz testing and concolic execution to achieve high code cov- erage. For the ease of understanding, we use the example in Figure 5.1 to explain how it works. The explanation is based on Driller [184] since it has been the de facto implementation of hybrid testing. The example in Figure 5.1 is taken from tcpdump-4.9.2. Figure 5.1a shows the code — it first uses the link-layer type from input to select a pcap handler and then uses the handler to dissect packets. Our objective is to test the entry function parse_pcap and reach the vulnerable function pcap_handler2. In the test, we assume hybrid testing starts with a seed that executes the path shown in Fig- ure 5.1b. After that, the fuzzer mutates the seed to run a second path shown in Figure 5.1c. It then, however, fails to synthesize inputs that match the packet type at line 20 and the link-layer type at line 10, due to the huge mutation space (232 possibilities). This situation prevents the fuzzer from testing the remaining code and makes hybrid testing switch to concolic execution. After executing the seed that covers the path in Figure 5.1b, the concolic executor backtracks to the branch statement at line 20. Solving the input packet_type to PACKET1 by a SMT solver, the

74 1 define void @bug(%struct.msg_ds* %msg) #0 { 2 entry: 3 %msg.addr = alloca %struct.msg_ds*, align 8 4 %sz = alloca i64, align 8 5 %buf = alloca i8*, align 8 6 store %struct.msg_ds* %msg, %struct.msg_ds** %msg.addr, align 8 7 %0 = load %struct.msg_ds*, %struct.msg_ds** %msg.addr, 8 align 8 %msg_sz = getelementptr inbounds %struct.msg_ds, %struct.msg_ds %0, 0, 4 * i32 i32 9 %1 = load i64, i64* %msg_sz, align 8 10 %2 = call { i64, i1 } @llvm.uadd.with.overflow.i64( 24, %1) i64 i64 11 %3 = extractvalue { i64, i1 } %2, 0 12 %4 = extractvalue i64, i1 %2, 1

13 %5 = xor i1 %4, true, !saviorBugNum !1 ; label of integer overflow 14 br i1 %5, label %cont, label %handler.add_overflow, !prof !2, !saviorBugNum !1 15 16 handler.add_overflow: ; preds = %entry 17 call void @__ubsan_handle_add_overflow(i8* ({ { [6 ] , , }, { , bitcast x i8 * i32 i32 i16 ,[16 ]} } @1 ), 24, i16 x i8 * * to i8* i64 i64 %1) #7, !saviorBugNum !1 18 br label %cont, !saviorBugNum !1 19 20 cont: ; preds = %handler.add_overflow, %entry 21 store i64 %3, i64* %sz, align 8 22 %6 = load i64, i64* %sz, align 8 23 %call = call noalias i8* @malloc(i64 %6) #7 24 store i8* %call, i8** %buf, align 8 25 %7 = load i64, i64* %sz, align 8 26 call void @llvm.memset.p0i8.i64(i8* bitcast (void (%struct.msg_ds ) @bug ), 0, * * to i8* i8 i64 %7, 4, ) i32 i1 false 27 ret void 28 }

CHAPTER 5. BUG-DRIVEN HYBRID TESTING

1 int parse_pcap(){ 2 int link_type; 3 /*read link-layer type from input*/ 4. read( input_fd, 4. read(input_fd, 4 read(input_fd, &link_type, sizeof(int)); &link_type… ) &link_type…) 5 /*select a handler based on link_type*/ 6 if(link_type == LINKTYPE1){ 7 pcap_handler1(); 8 return 0; 6. if( link_type … ) 9 } 7. pcap_handler1() 10 if(link_type == LINKTYPE2){ 6. if(link_type == 11 pcap_handler2(); LINKTYPE1) 12 return 0; 13 } 19. read( input_fd, 14 ... &packet_type…) 15 return -1; 16 } 17 int pcap_handler1(){ 18 int packet_type; 10. if(link_type == 19 read(input_fd, &packet_type, sizeof(int)); 20. if(packet_type...) LINKTYPE2) 20 if(packet_type == PACKET1){ 21 packet_handler1(); 22 return 0; 25. return -1 22. packet_handler1() 23 } 24 ... 25 return -1; 26 } 15. return -1 15. return -1 11. pcap_handler2()

(a) A simplified version of the packet- (b) The path followed by a (c) The path followed by parsing code in tcpdump-4.9.2, in which seed that matches LINKTYPE1 a seed that matches neither pcap_handler2 contains vulnerabilities. but5 mismatches PACKET1. LINKTYPE1 nor LINKTYPE2. Figure 5.1: A demonstrative example of hybrid testing. Figure 5.1a presents the code under test. Figure 5.1b and 5.1c are the paths followed by two seeds from the fuzzer. Their execution follows the red line and visits the grey boxes. Note that the white boxes connected by dotted lines are non-covered code. executor generates a new seed to cover that branch. Then, the hybrid controller suspends the con- colic execution and resumes the fuzzer. Guided by the new seed, the fuzzer tests packet_handler1 and switches back to concolic execution after that. This time, the concolic executor runs the seed, following the path in Figure 5.1c. After solving the branch condition at line 10, it generates a seed for the flow from line 10 to line 11. Further fuzz testing can finally reach the vulnerable code in pcap_handler2. Note that the testing processes by different hybrid tools may vary from the above description. For instance, QSYM [208] keeps running concolic execution instead of invoking it in an interleaved manner. Despite those implementation differences, existing tools share a similar philosophy on scheduling the seeds to concolic execution. That is, they treat the seeds indiscriminately [184,208], presumably assuming that these seeds have equal potentials in contributing to new coverage.

5.1.2 Motivation

Inefficiency in Covering Vulnerable Code: Although hybrid testing specializes in coverage-driven testing, it still needs substantial time to saturate hard-to-reach code compartments, which often overspends the time budget. To discover more vulnerabilities in a limited time frame, an intuitive

75 1 /* len is propagated from a field in the input */ 2 void ahcp_print( ... u_char *, const u_int len){ 3 4 uint8_t version; 5 ... 6 version = EXTRACT_U_1(cp); 7 cp += 1; 8 switch (version) { 9 ... 10 /*blocking condition 1*/ 11 case AHCP_VERSION_1: { 12 /*blocking condition 2*/ 13 if (len < AHCP1_HEADER_FIX_LEN) 14 goto invalid; 15 16 /* dominate XXX basic blocks*/ 17 ahcp1_body_print(ndo, cp, ep); 18 ... 19 break; 20 } 21 default: 22 ND_PRINT(...); 23 break; 24 } 25 return; 26 invalid: 27 ND_PRINT(...); 28 ... 29 return; 30 ... 31 } CHAPTER 5. BUG-DRIVEN HYBRID TESTING

1 static bfd_boolean load_specific_debug_section(enum dwarf_section_display_enum debug, asection sec, file){ * void * 2 3 dwarf_section *section = &debug_displays[debug].section; 4 5 if (section->start != NULL){ 6 if (streq (...) 7 return TRUE; 8 free (section->start); 9 } 10 ... 11 /*section->size is copied from input */ 12 section->size = bfd_get_section_size (sec); 13 14 /*setting section->size as 0xffffffffffffffff on 64-bit systems or 0xffffffff on 32-bit systems, malloc will return a zero-byte buffer, leading to out of bound access / * 15 section->start = malloc(section->size + 1); 16 ... 17 }

Figure 5.2: A demonstrative example of limitation in finding defects by existing hybrid testing. This defect comes from objdump-2.291 typedef struct[33]. msg_ds{ 2 char str[2]; int magic,ver,secret_num; 3 } msg_t; way is to prioritize the4 testing of vulnerable code. However, the current hybrid testing method 5 int main(){ 6 msg_t msg; introduced in Section 5.1.17 doesread(STDIN, not meet&msg, thissizeof requirement.(msg_t)); 8 9 if(msg.ver == 0xFFFFAB) Consider the example10 in Figurelog_msg(msg); 5.1, where concolic execution chronologically runs the seeds to 11 if(msg.str[0] != 'h') 12 return -1; explore the paths shown13 in Figureif(msg.str[ 5.1b1 and] != Figure'i') 5.1c. This sequence indeed postpones the testing 14 return -1; of the vulnerable function15 pcap_handler2if(msg.magic !.= The0x12FF6EF delay) can be significant, because concolic execution 16 return -1; 17 1 runs slowly and the fuzz18 testingbug(); on packet_handler1 may last a long time. In our experiments , 19 return 0; 20 } DRILLER spends minutes on reaching pcap_handler2 with the aforementioned schedule. However, if it performs concolic execution first on the path in Figure 5.1c, the time can reduce to seconds. Not surprisingly, the delayed situations frequently happen in practice. As we will show in 2 Section 6.4, on average this defers DRILLER and QSYM to cover vulnerabilities by 43.4% and 44.3%, leading to reduced efficiency in vulnerability finding.

Deficiency in Vulnerability Detection: Hybrid testing often fails to identify a vulnerability even if it approaches the vulnerable location along the right path. Figure 5.2 demonstrates an integer over- flow in objdump-2.29. At line 12, the program copies a value from sec to section size. Next, ! this value is used as the size of a memory allocation request at line 15. By carefully handcrafting the input, an adversary can make section size be the value 232-1 on 32-bit systems or 264-1 ! on 64-bit systems. This wraps section size+1 around to 0 and makes malloc return a zero-byte ! buffer. When the buffer is further used, a segfault or a memory leak would occur. In this example, hybrid testing can quickly generate a seed to hit line 15. However, it could

1SAVIOR is customized to do this test since DRILLER cannot run on tcpdump. More details can be found in Section 6.4

76 CHAPTER 5. BUG-DRIVEN HYBRID TESTING barely trigger the integer overflow. As the program enforces no constraints on the input bytes that propagate to section size, hybrid testing can only do random mutation to synthesize the extreme ! value(s). Taking into account the tremendous possibility space (232 or 264), the mutation is unlikely to succeed.

5.2 Design

5.2.1 Core Techniques

The design of SAVIOR is bug-driven, aiming to find bugs faster and more thoroughly. We propose two techniques to achieve the goal: bug-driven prioritization and bug-guided verification. Below we present an overview of our techniques.

Bug-driven prioritization: Recall that classic hybrid testing blindly schedules the seeds for con- colic execution, without weighing their bug-detecting potentials. This can greatly defer the dis- covery of vulnerabilities. To remedy this limitation, SAVIOR collects information from the target source code to prioritize seeds which have higher potentials to trigger vulnerabilities. This approach, however, needs to predict the amount of vulnerabilities that running concolic execution on a seed could expose. The prediction essentially depends on two prerequisites: R1 – A method to assess the reachable code regions after the concolic execution on a seed and R2 – A metric to quantify the amount of vulnerabilities in a chunk of code. SAVIOR fulfills them as follows. To meet R1, SAVIOR approximates the newly explorable code regions based on a combination of static and dynamic analysis. During compilation, SAVIOR statically computes the set of reach- able basic blocks from each branch. At run-time, SAVIOR identifies the unexplored branches on the execution path of a seed and calculates the basic blocks that are reachable from those branches. We deem that these blocks become explorable code regions once the concolic executor runs that seed. To meet R2, SAVIOR utilizes UBSan [43] to annotate three types of potential bugs (as shown in Table 5.1) in the program under testing. It then calculates the UBSan labels in each code region as the quantitative metric for R2. As UBSan’s conservative instrumentation may generate dummy labels, SAVIOR incorporates a static filter to safely remove useless labels. We discuss the details of this method in Section 5.2.2.1. The above two solutions together ensure a sound analysis for identifying potential bugs. First, our static reachability analysis, as described in Section 5.2.2.1, is built upon a sound algorithm. It

77 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

b1

b5

b2 reachable labels: L1 solve attempts: S1

b3 b6

b4 b7 b8

reachable labels: L2 solve attempts: S2 Figure 5.3: An example showing how to estimate the bug-detecting potential of a seed. In this example, the seed follows the path b1->b2->b3->b4. Basic block b5 and b7 are unexplored and they can reach L1 and L2 UBSan labels, respectively. They have been attempted by constraint solving 0.05S 0.05S e 1 L1+e 2 L2 for S1 and S2 times. The final score for this seed is ⇥ 2 ⇥ . over-approximates all the code regions that may be reached from a branch. Moreover, UBSan adopts a conservative design, which counts all the operations that may lead to the undefined behavior issues listed in Table 5.1 [43, 96]. Facilitated by the two aspects of soundness, we can avoid mistakenly underrating the bug-detecting potential of a seed. Following the two solutions, SAVIOR computes the importance score for each seed as follows. Given a seed with n unexplored branches e ,e ,...,e , SAVIOR calculates the UBSan labels { 1 2 n} in the code that are reachable from these branches, respectively denoted as L ,L ,...,L . Also { 1 2 n} note that, in the course of testing, SAVIOR has made S ,S ,...,S attempts to solve those { 1 2 n} branches. With these pieces of information, SAVIOR evaluates the importance score of this seed with a weighted average 1 n e 0.05Si L . L represents the potential of the i unexplored n ⇥ i=1 ⇥ i i th 0.05S branch. We penalize Li with eP i to monotonically decrease its weight as the attempts to solve this branch grow. The rationale is that more failed attempts (usually from multiple paths) indicate a low success possibility on resolving the branch. Hence, we decrease its potential so that SAVIOR can gradually de-prioritize hard-to-solve branches. Lastly, SAVIOR takes the average score of each candidate branches in order to maximize the bug detection gain per unit of time. To better understand this scoring method, we show an example and explain the score calculation in Figure 5.3. This scoring method is to ensure that SAVIOR always prioritizes seeds leading to more unver- ified bugs, while in the long run it would not trap into those with hard-to-solve branch conditions. First, it conservatively assesses a given seed by the results of sound reachability and bug labeling analysis. A seed which leads to more unexplored branches where more unverified bugs can be reached from will earn a higher score. Second, it takes into account runtime information to con-

78 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

section->size

\xfb\xfb\xf4\xf1 \xxx\xxx\xxx\xxx \xfb\xf4\xf1\xf1 solve Overflow Condition: sectionsection->size+1->size + 1 < section > 0xffffffff->size section->size

\xfb\xfb\xf4\xf1 \xff \xff \xff \xff \xfb\xf4\xf1\xf1 Figure 5.4: Solving the integer overflow in Figure 5.2. This shows the case in a 32-bit system, but it applies to 64-bit as well. tinuously improve the precision of the assessment. This online refinement is important because statically SAVIOR may hardly know whether a branch condition is satisfiable or not. Utilizing the history of constraint solving attempts, SAVIOR can decide whether a seemingly high-score branch is worth more resources in the future. As shown by our evaluation in Section 6.4, this scoring scheme significantly accelerates the detection of UBSan violations, which empirically supports the effectiveness of our design. Referring to our motivating example in Figure 5.1, the function packet_handler1 has few UB- San labels while pcap_handler2 contains hundreds of labels. Hence, the seed following Figure 5.1b has a lower score compared to the other seed which runs the path in Figure 5.1c. This guides SAV- IOR to prioritize the latter seed, which can significantly expedite the exploration of vulnerable code.

Bug-guided verification: This technique also ensures a sound vulnerability detection on the ex- plored paths that reach the vulnerable sites. Given a seed from fuzz testing, SAVIOR executes it and extracts the label of each vulnerability along the execution path. After that, SAVIOR verifies the predicates implanted in each label by checking the satisfiability under the current path condition — if the predicate is satisfiable then its corresponding vulnerability is valid. This enables SAVIOR to generate a proof of either vulnerability or non-existence along a specific program path. Note that in concolic execution, many new states with new branch constraints will be created. SAVIOR will prioritize the constraint solving for states who require bug-guided verification. Going back to the example in Figure 5.2, classic hybrid testing misses the integer overflow at line 15. In contrast, SAVIOR is able to identify it with bug-guided verification. Aided by the Clang sanitizer [43], SAVIOR instruments the potential overflows in a solver-friendly way (i.e., the predicate of triggering this overflow is section->size + 1 > 0xffffffff). As demonstrated in Figure 5.4, following a seed to reach the integer overflow location, SAVIOR tracks that the value

79 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

AFL binary SAVIOR binary KLEE bc Label info

seeds ktests (KLEE) q Analysis q Coverage Test (AFL) q Constraint Solving q Instrument q Bug-driven q Fuzzing new seeds testcases q Bug-guided Target Prioritization Verification Program Clang + LLVM Fuzzer Coordinator Concolic Executor Figure 5.5: System architecture of SAVIOR.

of section->size relies on a four-byte field in the input. By solving the vulnerability predicate, SAVIOR generates a witness value 0xffffffff and triggers the vulnerability.

5.2.2 System Design

Figure 5.5 depicts the overall architecture of SAVIOR. It consists of a compiling tool-chain built upon Clang and LLVM, a fuzzer derived from AFL, a concolic executor ported from KLEE, and a hybrid coordinator responsible for the orchestration. We explain these components in details in the following sections.

5.2.2.1 The Compilation Tool-chain

SAVIOR’s compilation tool-chain has multiple purposes including vulnerability labeling, con- trol flow reachability analysis, and the targets building of different components.

Sound Vulnerability Labeling: In our design, we use Clang’s Undefined Behavior Sanitizer (UB- San) [43] to label different families of potential bugs2. Table 5.1 summarizes those families used in SAVIOR and the operations pertaining to them. We ignore other bug types listed in UBSan (e.g., misaligned reference) since they are less likely to cause security issues. For each inserted label, we patch the Clang front-end to attach a !savior- BugNum metadata, aiding the reachability analysis that we will shortly discuss. As explained in Section 5.2.1, UBSan over-approximates the potential vulnerabilities. This ap- proximation ensures soundness since it never misses true bugs. UBSan also models the conditional triggers of the labeled bugs as shown in Table 5.1. E.g., out-of-bound (OOB) array access happens when the index x is not between zero and array size minus 1. At the time of bug-guided verification,

2Clang supports enabling checks on each individual bug family.

80 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

SAVIOR solves each triggering condition to produce a witness of the bug or, prove that the bug never happens on current path in terms of the unsatisfiable condition. SAVIOR uses UBSan by default, while other labeling methods may also apply if they meet the following two properties. First, they can comprehensively annotate the potential vulnerabilities. Second, they can synthesize the triggering condition of each labeled vulnerability. Note that such condition must have data dependency on the program input. Otherwise, our concolic execution cannot correlate the input with the vulnerable conditions and hence, has no guidance for bug-guided verification. For instance, the AddressSanitizer [175] builds checks upon the status of its own red- zone, which is not applicable to SAVIOR at the moment. UBSan’s conservative approximation inevitably introduces false positives and might mislead SAVIOR’s prioritization. In practice, we incorporate a static counter-measure to reduce fake labels. Specifically, we trim a label when all the following requirements hold: 1) The label’s parent (basic block) is its immediate dominator [183]; 2) The IR variables involved in the vulnerability conditions are not re-defined between the label and its parent; 3) The parent basic block has constraints that conflict with the vulnerability conditions, and these constraints are enforced by constant values. The first two points ensure that the constraints added by the parent will persist upon reaching the label, and the third point indicates that the conflict always arises, regardless of the input and the execution path. Therefore, we can safely remove this label.

1 char array[MAX]; // 0 < MAX < INT_MAX 2 for(int i = 0; i < MAX;){ 3 array[i] = getchar();//LABEL: OOB access 4 i++;//LABEL: integer-overflow 5 }

For instance, the code above has two labels that meet the three requirements. In this example, the variable i ranges from 0 to MAX, meaning that neither the array access at line 3 can be out-of-bound nor the self increment at line 4 can cause an integer overflow. SAVIOR hence removes the two labels. On average, we can conservatively reduce 5.36% of the labels.

Reachability Analysis: This analysis counts the number of vulnerability labels that can be for- wardly reached by each basic block in the program control flow graph (CFG). It proceeds with two phases. The first step constructs an inter-procedure CFG. The construction algorithm is close to the method implemented in SVF [187]. It individually builds intra-procedure CFGs for each func- tion and then bridges function-level CFGs by the caller-callee relation. To resolve indirect calls,

81 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

UBSan Labeling Details UB Families Operation Condition Out-of-bound array access array[x] x<0 x size(array) _ Oversized shift x y, x yy<0 y n ⌧ _ n 1 n 1 Signed integer overflow xops y xops y/[ 2 , 2 1] 2 n Unsigned integer overflow xopu yxopu y>2 1 Table 5.1: Families of potential bugs that SAVIOR enables UBSan to label. Here, x, y are n-bit integers; array is an array, the size of which is specified as size(array); ops and opu refers to binary operators +, , , , % over signed and unsigned integers, respectively. ⇥ ÷ our algorithm iteratively performs Andersen’s point-to analysis and expands the targets of the calls. This prevents SAVIOR from discarding aliasing information of indirect calls and therefore, our prioritization would not miscount the number of vulnerability labels. By examining the CFGs, we also extract the edge relations between a basic block and its children for further use in the hybrid coordinator. The second step is to calculate the UBSan labels that are reachable from each basic block in the constructed BB Reachable Bug#: 3 inter-procedure CFG. Specifically, we identify the regions code of code that a basic block can reach and count the num- … UBSan ber of UBSan labels in those regions. In SAVIOR, we UBSan … … deem this number as the importance metric of that basic block and use it for bug-driven prioritization. For exam- … UBSan … ple, in Figure 5.6 the basic block BB can reach 8 other basic blocks while 3 of them have UBSan labels. Thereby Figure 5.6: A demonstrative example of we output 3 as the number of reachable UBSan labels for reachability analysis. The target BB can BB. Note that each basic block at most has one label after “reach” 3 UBSan labels. Clang’s compilation.

Target Building: After the labeling and the reachability analysis, SAVIOR’s compiling tool-chain begins its building process. It compiles three binaries from the source code — a fuzzing-binary for the fuzzer, a SAVIOR-binary for the coordinator, and a LLVM bitcode file for the concolic execu- tor. In particular, the SAVIOR-binary is instrumented to print the unique IDs of the executed basic blocks. With this design, SAVIOR completely decouples the fuzzer, the concolic executor and the coordinator, thus it supports quick replacement of any components.

82 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

5.2.2.2 The Coordinator

The coordinator bridges the fuzzer and the concolic executor. It keeps polling seeds from the fuzzer’s queue and prioritizes those with higher importance for concolic execution. We explain the details as follows.

Bug-driven Prioritization: In a polling round, the coordinator operates the new seeds in the fuzzer’s queue after last round. Each seed is fed to the SAVIOR-binary and the coordinator up- dates two pieces of information based on the execution result. First, it updates the global coverage information. The coverage computation here follows AFL’s original approach. That is, we take the hit counts of an edge as different coverage (i.e., similar to AFL). Second, the coordinator records the sequence of basic blocks visited by each seed. Using the updated coverage information, the coordinator assigns a score to each seed following the scheme presented in Section 5.2.1. Here, we re-score all the seeds except those already tested by our concolic executor, since the coverage information is dynamically adjusted. Finally, the coordinator selected the top-ranked seeds and feed them into the input queue of the concolic executor. If two seeds have the same score, the coordinator prefers the seed with the +cov property. +cov indicates that the seed brings new code coverage.

Post-processing of Concolic Execution: Going beyond seed scheduling for concolic execution, the coordinator also need to triage the new seeds generated by the concolic executor for the fuzzer. First, it re-runs the new seeds and retains those who provide new coverage or can reach uncovered bug labels. As a result, SAVIOR transfers the valuable test cases from the concolic executor to the fuzzer. Second, the coordinator updates the number of solving attempts upon uncovered branches. If a branch remains uncovered, its solving attempts would be increased by 1. As such, a branch having a much higher solving attempt value will be de-prioritized.

5.2.2.3 The Concolic Executor

The concolic executor replays the seeds scheduled by the coordinator and chooses to solve branch conditions based on coverage information. In addition, it also performs bug-guided verifica-

83 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

… LoadAndLink() Initialization() TakeSeed() … S3 S2 S1 S0 ConcolicExe() OutputAndClean() seed queue … KLEE Figure 5.7: Fork server mode in KLEE. In this mode, KLEE only performs initialization once and reuses the same executor for all the received seeds. tion.

Independent Coverage Scheme: When encountering a branch instruction the concolic executor needs to decide whether to solve this branch’s condition. An intuitive design is to reuse the cov- erage information from the coordinator. However, since our coverage scheme is ID based, yet as KLEE invokes a group of transformations on the target bitcode, this leads to numerous mismatches between the edge IDs in the SAVIOR-binary and the KLEE bitcode. To tackle this problem, we opt to use KLEE’s internal coverage information to better decouple the concolic executor and other components.

Fork Server Mode: Before running a seed, KLEE needs to perform a group of initialization, in- cluding bitcode loading, library bitcode linking, and global data preparation to place the program under testing into the virtual machine. This initialization process, however, typically takes a long time on large bitcode files. For instance, the initialization time for tcpdump is usually several times longer than the actual concolic execution time. To address this issue, we introduce an optimization named fork server mode for the KLEE concolic executor (as shown in Figure 5.7). Technical details are explained in Section 5.3.

Bug-guided Verification: Our concolic executor also performs bug-guided verification. Once an non-covered vulnerability label is reached, we endeavor to solve the triggering constraint following the current path. If the solving succeeds, KLEE generates a seed as the proof of the vulnerability. In certain cases, the path constraints may conflict with the vulnerability triggering conditions, while that vulnerability can indeed happen following the same path (with fewer constraints). QSYM summarizes this issue as the over-constraint problem. We adopt QSYM’s optimistic solving strat- egy only on solving the vulnerability conditions. However, the relaxed-constraint may also produce a false positive, and we do not count a vulnerable label as being covered through relaxed-constraint

84 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Setup Fuzzers Source Note AFL [4] N/A

AFLGO [3] Use in-lined lava_get as target locations of guided fuzzing

TFUZZ [39] Use the docker environment prepared at [39] for evaluation

ANGORA [7] Patch Lava to support Angora, as suggested by the developers [38]

DRILLER Self-developed Follow the original Driller in scheduling concolic execution [17] QSYM [37] N/A

SAVIOR Self-developed Use in-lined lava_get as labels of vulnerabilities Table 5.2: Fuzzer specific settings in evaluation with Lava-M. solving.

Timeout on Concolic Execution: To prevent the concolic execution from hanging on localized code regions (e.g., , deep loops and blocking IO), the concolic executor usually needs a time thresh- old while running a seed. QSYM adjusts this timing budget by watching AFL’s status. If the number of hanging seeds increases, QSYM increases the (up to 10 minutes). We set the timeout to be proportional to the number of uncovered branches that a seed can reach. The ratio- nale is that those seeds need more time for constraint solving and such setting benefits higher bug coverage.

5.3 Implementation

We have implemented SAVIOR, which can be applied to software as sophisticated as Baidu’s Apollo Autonomous Driving System [9, 102]. SAVIOR consists of four major components: a compiling tool-chain built on top of Clang and LLVM-4.0, a fuzzing component based on AFL- 2.5b [4], a concolic executor built atop KLEE [71] (with LLVM-3.6), and a python middle-ware which coordinates the fuzzing component and the concolic executor. In total, our implementation has about 3.3K lines of python code and 4K lines of C/C++ code. SAVIOR can run on both 32-bit and 64-bit systems, and it can support both 32-bit and 64-bit targets. In the following, we discuss the important implementation details.

Concolic Executor: We develop our concolic executor based on KLEE-3.6. The original KLEE aims at full symbolic execution, and it does not support concolic execution. We port a concolic executor from KLEE’s symbolic executor. Specifically, the concolic executor attaches the concrete input as the assignment property in the initial state. It then symbolically interprets each instruction

85 CHAPTER 5. BUG-DRIVEN HYBRID TESTING as KLEE originally does. On reaching a conditional statement, it always follows the branch that matches the concrete input. For the other branch, if not covered, the concolic executor solves the conditions and generate a corresponding testcase. The state following that branch is then immedi- ately terminated. When generating the seed, our concolic executor copies the un-constrained bytes from the input, instead of padding with random values. Another limitation of KLEE is that the initialization phase is notoriously time-consuming. To overcome this, we introduce a fork server mode. In a run, KLEE first sets up the environments with bitcode loading, library linking, and preparing for globals and constants. These are then followed by the initialization of an Executor. By default, the Executor executes one seed and then destructs itself. In our implementation, after the execution of one seed, we clean up any stateful changes introduced in last execution (including destructing the memory manager, clearing the global data objects, and erasing all the remaining states). Then we reuse the Executor to run a new seed from the input queue. In this mode, we avoid repeating the lengthy environments setup. Recall that we invoke UBSan to label potentially vulnerable operations. At the IR level, UBSan replaces those operations with LLVM intrinsic functions, which are incomprehensible by KLEE. We replace those intrinsic functions with general LLVM IR so that KLEE can execute without exceptions. The replacements follow those that KLEE already enforced [22]. By default, KLEE redirects un-modeled external functions (e.g., system calls) to the native code. This causes two issues. First, KLEE is unaware of their effects on the symbolic address space, which can interrupt memory operations. For instance, the function strdup allocates a new buffer and copies data from the source to this buffer. However, KLEE cannot capture this allocation due to the lack of modeling. On future accesses to this buffer, KLEE will throw an out-of-bound access error. There are many similar cases, such as getenv. We extend KLEE’s environment model to include the symbolic versions of those functions. Second, KLEE concretizes the data passed to the external functions and adds constant constraints on such data for future execution. However, this may over-constraint the concretized variables. For instance, KLEE concretizes the data written to standard output or files. This leads to over-constraints – When the concretized data is later used in constraint solving, KLEE will not be able to find a satisfying solution. To address this issue, we prevent KLEE from adding constraints on concretization. This scheme, following the design of S2E [83] and QSYM [208], ensures that we never miss solutions for non-covered branches. Last but not least, stock KLEE provides limited support for software written in C++. Since a lot of the C++ programs rely on the standard C++ library (e.g., libstdc++ on Linux) but KLEE neither models this library nor supports the semantics of calls to this library. Therefore, KLEE frequently

86 CHAPTER 5. BUG-DRIVEN HYBRID TESTING aborts the execution in the early stage of running a C++ program. We customize the GNU libstdc++ library to make it compilable and linkable to KLEE. Considering that many libstdc++ functions also access in-existent devices (e.g., Random), we also build models of those devices.

5.4 Evaluation

SAVIOR approaches bug-driven hybrid testing with the key techniques of bug-driven prioriti- zation and bug-guided verification. In this section, we evaluate these techniques and our evaluation centers around two questions:

With bug-driven prioritization, can hybrid testing find vulnerabilities quicker? • With bug-guided verification, can hybrid testing find vulnerabilities more thoroughly? • To support our evaluation goals, we prepare two groups of widely-used benchmarks. The first group is the LAVA-M data-set [97]. This data-set comes with artificial vulnerabilities, and the ground truth is provided. The second group includes a set of 8 real-world programs. Details about these programs are summarized in Table 6.1. All these programs have been extensively tested in both industry [34] and academia [164, 184, 208]. In addition, they represent a higher level of diversity in functionality and complexity. Using the two benchmarks, we compare SAVIOR with the most effective tools from related families. To be specific, we take AFL [4] as the baseline of coverage-based testing. As SAV- IOR performs testing in a directed manner, we also include the state-of-the-art directed fuzzer, AFLGO [64]. To handle complex conditions, recent fuzzing research introduces a group of new techniques to improve code coverage. From this category, we cover TFUZZ [159] and ANGORA [75], because they are open-sourced and representatives of the state-of-the-art. Finally, we also consider the existing implementations of hybrid testing, DRILLER [184] and QSYM [208]. Note that the original DRILLER has problems of running many of our benchmarks, due to lack of system-call modeling or failure to generate test cases (even with the patch [16] to support input from files). This aligns with the observations in [208]. In the evaluation, we re-implement DRILLER on the top of SAVIOR. More specifically, it runs AFL as the fuzzing component and it invokes the concolic executor once the pending_favs attribute in AFL drops to 0. These implementations strictly follow the original DRILLER [17]. Similar to the Angr-based concolic executor in DRILLER, our KLEE-based concolic executor focuses on generating new seeds to cover untouched branches.

87 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

In addition, we keep the relaxed constraint solving and the fork-server mode. These two features increase the effectiveness and efficiency of DRILLER without introducing algorithmic changes. In the following, we will explain the experimental setups and evaluation results for the two groups of benchmarks.

5.4.1 Evaluation with LAVA-M

5.4.1.1 Experimental Setup

In this evaluation, we run each of the fuzzers in Table 5.2 with the four LAVA-M programs and we use the seeds shipped with the benchmark. For consistency, we conduct all the experiments on Amazon EC2 instances (Intel Xeon E5 Broadwell 64 cores, 256GB RAM, and running Ubuntu 16.04 LTS), and we sequentially run all the experiments to avoid interference. In addition, we assign each fuzzer 3 free CPU cores to ensure fairness in terms of computation resources. Each test is run for 24 hours. To minimize the effect of randomness in fuzzing, we repeat each test 5 times and report the average results. In Table 5.2, we also summarize the settings specific to each fuzzer, including how we distribute the 3 CPU cores and the actions we take to accommodate those fuzzers. In LAVA-M, each artificial vulnerability is enclosed and checked in a call to lava_get (in-lined in our evaluation). We use these calls as the targets to guide AFLGO and we mark them as vulnerability labels to enable bug-driven prioritization in SAVIOR. In addition, as the vulnerability condition is hard-coded in the lava_get function, we naturally have support for bug-guided verification. Finally, for ANGORA, we adopt the patches as suggested by the developers [38].

5.4.1.2 Evaluation Results

In the left column of Figure 5.8, we show how many vulnerabilities are reached over time by different fuzzers. The results demonstrate that all the fuzzers can instantly cover the code with LAVA vulnerabilities. However, as presented in the right column of Figure 5.8,TFUZZ,ANGORA, DRILLER, QSYM, and SAVIOR are able to trigger most (or all) of the vulnerabilities while AFL and AFLGO can trigger few. The reason behind is that the triggering conditions of LAVA vulner- abilities are all in the form of 32-bit magic number matching. Mutation-based fuzzers, including AFL and AFLGo, can hardly satisfy those conditions while the other fuzzers are all featured with techniques to solve them.

88 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Vulnerability Finding Efficiency: Despite TFUZZ,ANGORA,DRILLER, QSYM, and SAVIOR all trigger large numbers of LAVA vulnerabilities, they differ in terms of efficiency. TFUZZ quickly covers the listed vulnerabilities in base64 and . This is attributable to that (1) TFUZZ can reach all the vulnerabilities with several initial seeds and (2) TFUZZ can transform the program to immediately trigger the encountered vulnerabilities. Note that we do not show the results of TFUZZ on and who, because TFUZZ gets interrupted because of a broken dependency 3. For all the cases, ANGORA triggers the vulnerabilities immediately after its start. The main reason is that the “black-box function” pertaining to all LAVA vulnerabilities is f(x) = x and the triggering conditions are like f(x) == CONSTANT.ANGORA always starts evaluating such functions with x= CONSTANT and hence, it can instantly generate seeds that satisfy the vulnerability conditions. In the case of who,ANGORA does not find all the vulnerabilities because of its incomplete dynamic taint analysis. Regarding the three hybrid tools, they trigger every vulnerability that their concolic executors encounter. In the cases of base64, uniq, and md5sum, their concolic executors can reach all the vulnerabilities with initial seeds. This explains why they all quickly trigger the listed vulnerabilities, regardless of their seed scheduling. In the case of who, even though the fuzzing component quickly generates seeds to cover the vulnerable code, the concolic executor takes much longer to run those seeds. For instance, while executing the inputs from AFL, QSYM needs over 72 hours of continuous concolic execution to reach all the LAVA bugs in who. Differing from DRILLER and QSYM, SAVIOR prioritizes seeds that have a higher potential of leading to Lava bugs. As demonstrated by the results of who in Table 5.3, our technique of bug-driven prioritization indeed advances the exploration of code with more vulnerabilities. Note that DRILLER (with a random seed scheduling) moves faster than QSYM. This is because QSYM prioritizes concolic execution on small seeds, while reaching the vulnerabilities in who needs seeds with a larger size.

Vulnerability Finding Thoroughness: We further evaluate our bug-guided verification design. Specifically, we run the seeds generated by all the fuzzers with our concolic executor. In this exper- iment, we only perform constraint solving when a vulnerability condition is encountered. As shown in Table 5.4, bug-guided verification facilitates all the fuzzers to not only cover the listed LAVA bugs but also disclose an extra group of Lava bugs. Due to limited space, those additionally identi- fied bugs are summarized in Table ?? at Appendix. Such results strongly demonstrate the promising

3The broken component is the QEMU based tracer in Angr [8]. This has been confirmed with the developers.

89 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Fuzzing results Fuzzers base64 uniq md5sum who AFL 0 (0%) 0 (0%) 0 (0%) 0 (0%)

AFLGO 2 (5%) 1 (4%) 0 (0%) 0 (0%)

TFUZZ 47 (100%) 29 (100%) N/A N/A

ANGORA 47 (100%) 28 (100%) 54 (95%) 1743 (79%)

DRILLER 48 (100%) 28 (100%) 58 (100%) 1827 (78%) QSYM 47 (100%) 29 (100%) 58 (100%) 1244 (53%) SAVIOR 48 (100%) 29 (100%) 59 (100%) 2213 (92%) Listed 44 28 57 2136

Table 5.3: LAVA-M Bugs triggered by different fuzzers (before bug-guided verification). “X%” indicates that X% of the listed LAVA bugs are triggered.

Fuzzing results Fuzzers base64 uniq md5sum who AFL 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%)

AFLGO 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%)

TFUZZ 47 (100%) 29 (100%) N/A N/A

ANGORA 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%)

DRILLER 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%) QSYM 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%) SAVIOR 48 (100%) 29 (100%) 59 (100%) 2357 (96.3%) Listed 44 28 57 2136

Table 5.4: LAVA-M Bugs triggered by different fuzzers (after bug-guided verification). “X%” indicates that X% of the listed LAVA bugs are triggered. potential of bug-guided verification to benefit fuzzing tools in vulnerability findings.

5.4.2 Evaluation with Real-world Programs

5.4.2.1 Experimental Setup

In this evaluation, we prepare 8 programs. Details about these programs and the test settings are summarized in Table 6.1. All the programs have been extensively tested by both industry [34] and academic researches [164, 184, 208]. Since different seed inputs and execution options could lead to varying fuzzing results [128, 165], we follow existing works to use the seeds shipping with AFL or the vendors, as well as to configure the fuzzing options. Similar to our evaluation with LAVA-M, we conduct all the experiments on Amazon EC2 instances. To reduce randomness during testing,

90 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Programs Settings Name Version Driver Source Seeds Options libpcap 4.9.2/1.9.0 tcpdump [40] build-in -r @@ libtiff 4.0.10 tiff2ps [27] AFL @@ libtiff 4.0.10 tiff2pdf [27] AFL @@ binutils 2.31 objdump [20] AFL -D @@ binutils 2.31 readelf [20] AFL -A @@ libxml2 2.9.7 xmllint [30] AFL @@ libjpeg 9c djpeg [25] AFL jasper master jasper [21] AFL -f @@ -T pnm

Table 5.5: Real-world benchmark programs and evaluation settings. In the column for Seeds, AFL indicates we reuse the testcases provided in AFL and build-in indicates that we reuse the test cases shipped with the program.

we run each test 5 times and report the average results. In addition, we leverage Mann Whitney U-test [146] to measure the significance of our improvements, following the suggestion by George etc [128]. In this evaluation, we also prepare the setups that are specific to each fuzzing tool. These setups mostly follow Table 5.2 except the following. First, we use UBSan labels as the target locations for AFLGO and as the guidance of bug-driven prioritization in SAVIOR. Second, to prevent ANGORA from terminating the fuzzing process once it encounters un-instrumented library functions, we follow suggestions from the developers and add the list of un-instrumented functions into ANGORA’s dfsan_abilist.txt configuration file. Third, we do not include TFUZZ, because it does not function correctly on our benchmark programs due to issues in the aforementioned third-party component. Furthermore, we prepare these benchmark programs such that they are instrumented with UBSan for all fuzzers to ensure a fair comparison. This also means that bug- guided verification is enabled by default in DRILLER, QSYM, and SAVIOR.

5.4.2.2 Evaluation Results

In Figure 5.9, we summarize the results of our second experiment. It shows the outputs over time from two metrics, including the number of triggered UBSan bugs and basic block coverage. In addition, we calculate the p-values for Mann Whitney U-test of SAVIOR vs. DRILLER and SAVIOR vs. QSYM. Note that we use the IDs of UBSan labels for de-duplication while counting the UBSan bugs, as each UBSan label is associated with a unique potential defect. In the following,

91 CHAPTER 5. BUG-DRIVEN HYBRID TESTING we delve into the details and explain how these results testify our design hypotheses.

Vulnerability Finding Efficiency: As shown in Figure 5.9 (the left column of each program), SAV- IOR triggers UBSan violations with a pace generally faster than all the other fuzzers. In particular, it outperforms DRILLER and QSYM in all the cases except djpeg. On average, SAVIOR discov- ers vulnerabilities 43.4% faster than DRILLER and 44.3% faster than QSYM. The low p-values (< 0.05)4 of Mann Whitney U-test well support that these improvements are statistically signifi- cant. Since the three hybrid tools only differ in the way of seed scheduling, these results strongly demonstrate that the scheduling scheme in SAVIOR— bug-driven prioritization — accelerates vul- nerability finding. In the case of djpeg, all six fuzzers trigger the same group of UBSan violations. This is because djpeg has a tiny code base, with which these fuzzers quickly saturate on code explo- ration. In addition, the conditions of those UBSan violations are simple that even mutation-based approaches can solve. Going beyond, we examine the number of labels that are reached by different fuzzers. In Ta- ble 5.6, we list the average results from our 24-hour tests. Not surprisingly, the hybrid tools cover higher volumes of UBSan labels than the ordinary fuzzers. This is likely because a hybrid tool can solve complex conditions, enabling the coverage on the code and labels behind. Among the hybrid tools, SAVIOR reaches 19.68% and 15.18% more labels than DRILLER and QSYM, respectively. Such results are consistent with the number of triggered UBSan violations. This also signifies that our bug-driven prioritization guides SAVIOR to spend more resources on code with richer UBSan labels. In the case of djpeg, SAVIOR nearly ties with the other tools. This is due to a similar reason as explained above. We further find that the efficiency boost of SAVIOR in vulnerability finding is not due to high code coverage. As shown in Figure 5.9 (the right column for each program), we compare the code coverage of the six fuzzers. As demonstrated by the results, the efficiency of code coverage and UBSan violation discovery are not positively correlated. Particularly, in the case of tcpdump, libxml, tiff2pdf, objdump and jasper, SAVIOR covers code in a similar or even slower pace than DRILLER and QSYM (the high p-values also support that SAVIOR is not quicker). However, SAVIOR triggers UBSan violations significantly quicker in these cases. Such results validate the above hypothesis with high confidence.

Vulnerability Finding Thoroughness: In this experiment, we also measure the performance of

4The p-values of readelf and objdump are larger than 0.05 but they are at the level of quasi-significance. In the two programs, the variances are mainly due to randomness.

92 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Number of reached UBSan labels Prog. AFL AFLGO ANGORA DRILLER QSYM SAVIOR tcpdump 2029 1235 1333 1906 2509 2582 tiff2ps 748 927 770 931 852 970 readelf 91 79 102 104 106 183 xmllint 588 580 456 567 568 597 djpeg 2746 2588 2546 2713 2707 2746 tiff2pdf 1488 1467 919 1448 1369 1478 jasper 649 660 679 691 731 752 objdump 780 715 844 835 906 1039 Avg. 1139 1031 956 1149 1218 1289

Table 5.6: Number of unique UBSan labels reached by different fuzzers in 24 hours. On average SAVIOR reaches 19.68% and 15.18% more labels than DRILLER and QSYM. bug-guided verification in enhancing the thoroughness of vulnerability finding. Specifically, we re- run the seeds from all the fuzzers with our concolic executor. In this test, we enable SAVIOR to do constraint solving only when encountering un-solved UBSan labels. In Table 5.7, we summarize the comparison results. For all the 8 programs, bug-guided verifi- cation facilitates different fuzzers to trigger new violations. The average increase ranges from 4.5% (SAVIOR) to 61.2% (ANGORA). In particular, it aids ANGORA to trigger 82 new UBSan bugs in total. In the case of djpeg bug-guided verification does not help much. This is because djpeg has a relatively smaller code base and contains fewer vulnerability labels, making bug-guided verification less utilized. These results are further evidence that bug-guided verification can truly benefit fuzzing in terms of vulnerability finding thoroughness.

5.4.3 Vulnerability Triage

The UBSan violations triggered by SAVIOR could lead to various consequences and some of them might be harmless. Therefore, we manually examine all the UBSan violations triggered by SAVIOR. These violations include those triggered in the 8 programs in Table 6.1 and also those from mjs, catdoc, and c++filt. We do not include the results of mjs, catdoc, and c++filt in the evaluation above, as all fuzzers trigger fewer than 10 UBSan violations. A small difference would result in a big variance in comparison.

Triage Result: In total, we collect 481 UBSan violations and we manually classify them based on their consequences and present the results in Table 5.8. Specifically, 102 of them lead to OOB

93 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Improvements by bug-guided verification Prog. AFL AFLGO ANGORA DRILLER QSYM SAVIOR tcpdump +10/11% +22/41.5%+29/76.3%+9/9.9% +4/4% +8/7% tiff2ps +4/133% +0/0% +3/42.9% +0/0% +0/0% +0/0% readelf +10/82% +9/72.2% +16/107% +9/68.4% +8/63.2% +7/29.2% libxml +4/33.3% +4/33.3% +5/166.7%+4/33.3% +4/33.3% +0/0% tiff2pdf +5/50% +1/7.7% +4/44.4% +3/27.2% +5/62.5% +0/0% djpeg +0/0% +7/5.2% +7/5.2% +0/0% +0/0% +0/0% objdump +7/10.9% +7/11.7% +11/17.2%+7/11.7% +6/9.5% +0/0% jasper +0/0% +0/0% +7/30.4% +7/26.9% +7/26.9% +0/0% Ave. +5/40.1% +6/21.5% +10/61.2%+5/22.2% +4.3/25% +1.8/4.5%

Table 5.7: New UBSan violations triggered with bug-guided verification in the evaluation with real- world programs. “+X/Y%” means “X” new violations are triggered, increasing the total number by “Y%”.

reads/writes and 141 of them result in logic errors. Those logic errors consist of different categories, such as incorrect computation, wrong outputs, and polluted conditional variables. Among the 243 OOB and logic errors, 16 of them have been confirmed by the developers. Our further analysis so far reveals at least 25 of them are exploitable for goals such as information leak and control flow manipulation. The remaining 238 cases are likely harmless according to our triage result. They mainly consist of the following categories: (1) the variables triggering UBSan violations are used as storage (e.g., int as char[4]) instead of computation-related objects; (2) the affected variables expire immedi- ately after the violations; (3) the program already considers the case of UBSan violations and has handlers.

1 static bool process_notes_at(...){ 2 //readelf.c:18303 3 if(inote.namedata[inote.namesz-1] != ’\0’) 4 ... 5 }

Case Studies: From each of the three categories (OOB, logic errors, and those without harm), we pick a case and explain the details here. All the cases have been fixed. The first case is an OOB in readelf. The code is shown above. The variable inote.namesz is copied from input. By making it equal to 0, (inote.namesz 1) under-flows to the maximal unsigned value. It causes an OOB access to inote.namedata. The second case is a logic error in libtiff. Variable twobitdeltas[delta] is controlled by

94 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

Defect categories Note Program OOB Logic Error Exploitable* Confirmed tcpdump 6 102 6+ 7 libjpeg 8 23 0+ N/A objdump 41 4 4+ N/A readelf 19 10+ 3 libtiff 20 0 0+ N/A jasper 21 2 2+ 2 mjs 1 0 0+ 1 catdoc 30 3+ 1 c++filt 1 1 0 2 Total 102 141 25+ 16

Table 5.8: Triage of UBsan violations triggered by SAVIOR in 24 hours. user. With a specially crafted input, one can cause an overflow in the result of lastpixel + twobit- deltas[delta], making SETPIXEL set the wrong pixel value to the decoded image.

1 static int ThunderDecode(...){ 2 //tif_thunder.c:125 3 if((delta = ((n >> 4) & 3)) != DELTA2_SKIP) 4 SETPIXEL(op, lastpixel + twobitdeltas[delta]); 5 ... 6 }

The last case is harmless, as the program already considers overflow. This case locates in libxml. As shown below, with a special input, the variable okey can be overflowed. However, the program modulo okey with dict->size before using it, making the overflow harmless.

1 static int xmlDictGrow(...) { 2 // dict.c:417 3 okey = xmlDictComputeQKey(...); 4 key = okey % dict->size; 5 ... 6 }

95 CHAPTER 5. BUG-DRIVEN HYBRID TESTING

(a) Number of bugs reached in base64 (b) Number of bugs triggered in base64

(c) Number of bugs reached in uniq (d) Number of bugs triggered in uniq

(e) Number of bugs reached in md5sum (f) Number of bugs triggered in md5sum

(g) Number of bugs reached in who 96 (h) Number of bugs triggered in who Figure 5.8: Evaluation results with LAVA-M. The left column shows the number of bugs reached by different fuzzers and the right column shows the number of bugs triggered by the fuzzers. CHAPTER 5. BUG-DRIVEN HYBRID TESTING

(a) Number of UBSan (b) Number (c) Number of UB- (d) Number of basic violations triggered in of basic blocks San violations triggered blocks reached in

tcpdump (p1=0.029, reached in tcpdump in tiff2ps (p1=0.005, tiff2ps (p1=0.049,

p2=0.047). (p1=0.106,p2=0.999). p2=0.046). p2=0.073).

(e) Number of UB- (f) Number of basic (g) Number of UBSan (h) Number of San violations triggered blocks reached in violations triggered in basic blocks 5 in readelf (p1=0.098, readelf (p1=0.042, libxml (p1=7.04 e , reached in libxml ⇤ 5 7 p2=5.63 e ). p2=0.726). p2=2.15 e ). (p1=0.042,p2=0.094). ⇤ ⇤

(k) Number of (i) Number of (j) Number of basic UBSan violations (l) Number of UBSan violations blocks reached in triggered in tiff2pdf basic blocks

triggered in djpeg djpeg (p1=3.28 (p1=0.002,p2=3.95 reached in tiff2pdf ⇤ ⇤ 7 6 6 (p1=0.777,p2=0.203). e ,p2=3.79 e ). e ). (p1=0.009,p2=0.807). ⇤

(m) Number of (n) Number of (o) Number of UBSan (p) Number of ba- UBSan violations basic blocks violations triggered in sic blocks reached in

triggered in jasper reached in jasper objdump (p1=0.096, objdump (p1=0.0001,

(p1=0.010,p2=0.002). (p1=0.287,p2=0.653). p2=0.003). p2=0.125).

Figure 5.9: Evaluation results with real-world programs over 24 hours. p1 and p2 are the p-values for the Mann Whitney U-test of SAVIOR vs. DRILLER and SAVIOR vs. QSYM, respectively. 97 Chapter 6

Learning-based Hybrid Fuzzing

6.1 Introduction

Hybrid testing has attracted more attention recently because of the achieved dramatic improve- ments in bug discovery. For instance, the top winning teams in DARPA Cyber Grand Challenge [14] benefited from hybrid testing [45]. Compared with plain fuzzing, hybrid testing introduces an ex- tra component that uses concolic execution to revisit paths explored by the fuzzer and try to solve hard-to-satisfy branch conditions that previously fuzzer failed to solve. One key challenge in hybrid testing is how to intelligently select high utility seed for the concolic engine and fuzzer to increase code coverage; consequently, discover a higher number of bugs in a given timeframe. Essentially, the concolic execution engine cannot explore the majority of the seeds due to time constraint, and the system needs to make a trade-off between selecting from the inputs that have a known level of utility versus the inputs whose performance is completely unknown, yet might have better utility than the known ones. Existing work [24, 65, 78, 134, 184, 208, 212] uses heuristic-based guidance, such as selecting seeds with smaller size or seeds with less explored paths, attempting to select inputs having the highest utility to explore. However, these heuristics, despite their simplicity, are not theoretically generalizable to a wide range of programs because the rules are fixed throughout fuzzing. So it is hard to justify whether these heuristics will keep being performant across different benchmark programs. In contrast to heuristics, Machine Learning (ML) infers from the data without the need of defin- ing explicit rules, and it automatically develops predictive models that can be generalized well. It then keeps adapting the model for every new sample it sees; the longer it is used, the more reliable

98 CHAPTER 6. LEARNING-BASED HYBRID FUZZING it gets. Hence, ML can be a more reliable alternative to heuristics for seed selection to make solid decisions in different circumstances. In this chapter, we introduce MEUZZ, an ML-enhanced hybrid fuzzing system. Unlike existing fuzzing systems that schedule seeds based on simple heuristics, MEUZZ compiles a set of static and dynamic features computed from seeds and the corresponding program and then uses these features to predict the utility of each seed. MEUZZ also has a built-in evaluation module to help measure the quality of the predictions and use these data for continuous learning, so that the prediction quality can be improved over time. To the best of our knowledge, MEUZZ is the first work [168] that applies ML to systematically prioritize seeds based on the automatically-learned patterns from the seeds’ attributes. To effectively integrate ML into hybrid fuzzing without blocking the fuzzing workflow, MEUZZ needs to elaborate carefully on two tasks: feature engineering and data labeling. While these are the essential steps to bootstrap learning, they could be time-consuming and are thus too costly to be integrated in the fuzzing loop. For instance, feature extraction can be very slow if the features require substantial analyses. Moreover, it is not straightforward to directly and precisely indicate the seed utility, which is essential for labeling. To tackle the aforementioned challenges, first, we engineer a set of light-weight features based on reachability and dynamic analysis. Second, we propose a labeling approach using the input descendant tree to pinpoint the utility of a seed. Our evaluation shows it takes only 5µs on average for MEUZZ to extract an individual feature. Moreover, it shows how the descendant tree of a seed accurately reflects the seed utility. In addition to ML integration, MEUZZ investigates the feasibility of model reusability and trans- ferability since ML-based fuzzers may face this question: “does the model transfer well to different fuzzing configurations or programs?”. Usually, collecting data and training a distinct model for every new program is time-consuming and not economical, these constraints may limit the wide adoption of ML-based fuzzing applications. However, as the features extracted by MEUZZ are ag- nostic to programs and focus on nonexclusive characteristics, such as control-flow information and the likelihood to trigger bugs, the generated model by MEUZZ are highly adaptable. MEUZZ can collect many data points to refine its prediction model iteratively during fuzzing. At the end of aMEUZZ fuzzing session, the trained model can then be reused as part of the initialization when launching a new fuzzing job towards the same or different program. We compare MEUZZ with the state-of-the-art fuzzing approaches [65,75,134] as well as hybrid testing systems [78, 208] using a set of commonly evaluated real-world benchmark programs. Our

99 CHAPTER 6. LEARNING-BASED HYBRID FUZZING results show that (i)MEUZZ achieves the highest code coverage among all the tested fuzzers using simple seed selection heuristics. Particularly MEUZZ is able to boost the coverage by as much as 27.1% compared with QSYM, the start-of-the-art hybrid fuzzing system. And (ii) the MEUZZ prediction models have good reusability and transferability. The reused models boost the coverage by 7.1% on average and the transplanted models help improve fuzzing performance in 38 out of the 56 testings (67.9%), among which 10 out of the 38 (26.3%) cases see more than 10% improvements. This work makes the following contributions.

Effective and generalizable approach. We design the MEUZZ system, which is the first sys- • tem that applies machine learning to the seed selection stage of fuzzing. Our system shows that ML-based fuzzing is practically more effective and generalizable than heuristic-based fuzzing.

Practical feature and label engineering. We address two major challenges, namely fea- • ture engineering and label inference, when integrating ML into the MEUZZ hybrid fuzzing framework. Considering that the approach should be effective, online friendly, and program agnostic, we engineer a series of features based on reachability analysis and dynamic instru- mentation, as well as proposing a novel approach for automatic label inference based on seed descendant tree.

Reusable and transferable ML models. Our seed selection learning models demonstrate • strong reusability and transferability, while the models are merely trained from the data col- lected in 24 hours of fuzzing. Thanks to these two properties, MEUZZ enables users to utilize a well-trained model across different programs or different fuzzing configurations.

6.2 Background

6.2.1 Hybrid Fuzzing

Hybrid fuzzing [78, 184, 208] combines fuzzing and concolic execution to address the deficien- cies of both the approaches. Figure 6.1 shows the overview of a general hybrid fuzzing framework. The whole system consists of three major components: fuzzer, concolic testing, and coordinator. For the sake of brevity, we refer the interested readers to [2, 24, 71, 111] for the technical details of fuzzing and concolic execution. Note that in the rest of the paper we use the terms “seed” and “test case” interchangeably to refer to inputs fed to the fuzzer and concolic engine.

100 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Coordinator

Fuzzing Monitor Seed Selection Job launcher

New Test New Test Seeds Seeds Cases Cases

Fuzzing C Concolic Testing Mutation Solver Program (AFL) (KLEE/QSYM/Angr)

Figure 6.1: General hybrid fuzzing workflow.

We dissect the coordinator component as it is less discussed in the literature and is the focus of this work. The coordinator is a middleware that regulates the other two components. Its major tasks include (i) monitoring the fuzzer to decide if it is time to launch the concolic execution engine; and (ii) prepare the running environments for concolic testing; and finally (iii) select and filter inputs that flow between fuzzer and concolic executor. The seed selection module in the coordinator needs to decide which seeds in the fuzzer’s queue should be transferred to the concolic testing first (i.e., Seed utility prediction phase). Later, it filters the new test cases generated by concolic executor before adding them back to the fuzzer’s queue (i.e., Post filtering phase).

6.2.1.1 Seed utility prediction

Before launching concolic execution, the coordinator needs to rank all inputs in the fuzzer’s queue based on their utility. The utility of a seed should correspond to the estimation of its power to produce additional coverage if it is selected to fuzz. As we mentioned in Section 5.1.2, current methods use various heuristics to achieve this prioritization goal.

6.2.1.2 Post filtering

After concolic execution generates the new inputs, the coordinator will filter out the useful ones before introducing them back to the fuzzer. Post-filtering plays an important role in guiding the exploration pace of the fuzzer. If a new test case that leads to code with large loops or less potential bugs [78] is introduced to the fuzzer, this may reduce the “air time” for other seeds in the queue that could have higher utility. In our study, all three hybrid testing systems [78, 184, 208] use the

101 CHAPTER 6. LEARNING-BASED HYBRID FUZZING synchronization mode of AFL [2]. In this mode, the fuzzer will periodically scan the output of the concolic execution engine and add them as seeds into its queue.

6.2.2 Supervised Machine Learning

Supervised machine learning enables learning from labeled data and applying the knowledge to the unknown data. Algorithms that fall into this category are divided into two foremost cate- gories, namely classification, and regression. While classification is used for predicting categorical responses, regression predicts a numerical value to the new data based on previously observed data. Regardless of the type of learning algorithms, the learning process itself can be either offline or online. In other words, learning can be done over the entire dataset (offline) or incrementally by considering one instance at a time (online). As far as time is concerned during fuzzing, the right choice of the machine learning algorithm, as well as its learning manner, affects both the efficiency and effectiveness of fuzzing. We will discuss more on online vs. offline learning in the following subsections.

6.2.2.1 Online learning

Some learning environments can change from second to second and a model needs to be learned (or updated) as fast as it sees a new sample. This is where online learning plays an important role by only considering the new data to update the model and its efficiency is a key advantage when there is a time constraint. Learning algorithms that are compatible with standard optimization algorithms like stochastic gradient descent (SGD) can learn incrementally.

6.2.2.2 Offline learning

Opposite to online learning that updates the model weights incrementally based on new obser- vations, in offline learning, the model needs to be trained with the whole dataset. Hence, the training process needs to be redone as newer data appear on timely basis. However while offline learning might be seen not as efficient as online learning, because the entire dataset needs to be retrained, we will show later how offline learning can work efficiently along with online learning if it is tuned properly. Random forest is a supervised technique that has high likelihood to produce good performances; sometimes even better than neural networks [104]. Random forest is mostly used offline, but it can

102 CHAPTER 6. LEARNING-BASED HYBRID FUZZING also be implemented as online [132,169]. However, in general, online random forest demands more training data than their batch equivalent to attain comparable predictive performance. In addition to random forest, deep learning has been shown success in different domains; how- ever, they tend to perform better on unstructured data such as images and they require a relatively large amount of data to perform well [44]. Moreover, such techniques need high computational power to be trained in a reasonable time.

6.3 System Design

6.3.1 System Overview

MEUZZ is the first machine learning-based hybrid fuzzer that learns from the previously ob- served seeds and identifies which kinds of seeds have the potentials to better explore the program being tested. Figure 6.2 shows an overview of MEUZZ.MEUZZ starts fuzzing (∂) a program with pre- defined/empty seeds. It then extracts features (∑) from the program as well as the seeds (§6.3.3) to model coverage gains. Such features are used to predict (∏) the coverage potential for unknown seeds (§6.3.5). Concolic engine (π) then receives those highly influential seeds, which are predicted from the prior step, and produces mutated seeds. Next, MEUZZ guides the fuzzer to use these seeds and their generated mutants (by the evolutionary algorithms) to continually test the program. At the beginning, the prediction model is randomly initialized, so the prediction quality is uncertain. But as fuzzing continues, the model gets improved and will provide a more reliable prediction. MEUZZ updates the seed selection model in three steps. First, it grows the descendent trees (∫) of those seeds generated by the concolic engine in π, it then derives a label (ª) based on the descendant trees of the previously selected seed (§6.3.4); finally, it updates or retrains the model (º) depending on the type of learning process (§6.3.5,§6.3.6).

6.3.2 System Requirements

MEUZZ aims to predict the seed utility in a more accurate and generalizable fashion than the existing heuristic-based approaches, while keeping the fuzzing efficiency intact. One of the steps that contribute the most in achieving these goals is feature extraction. MEUZZ can potentially de- rive various semantic features because it has access to complex program structures, such as the Control Flow Graph (CFG) with sanitizer instrumentations. However, there are some challenges

103 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

ML Engine Hybrid Features Model - # Reachable Bugs Off-L - # Indirect calls - # Cmps En-L ... On-L

2 Feature Extraction 3 Prediction 6 Infer Seeds Labels 7 Model Training

Update Seed Selection Model U s e f u l S d T o p P o t e n i a l Hybrid Fuzzing

5

1 4 Concolic Fuzzer Engine

Seeds descendant Trees Seeds Program

Figure 6.2: System overview of MEUZZ. The coordinator is extended with a ML engine, which consists of 4 modules – Feature extraction, label inference, prediction and training modules. During fuzzing, utility prediction and model training are carried out consecutively. After extracting features for inputs in the fuzzer’s queue, the ML engine can predict their utilities based on the current model. Then, with the seed labels inferred from previously selected seeds, the model is trained iteratively with the new data. that MEUZZ may encounter during feature extraction because it requires to adapt the ML engine to the online-style fuzzing workflow. To cope with such challenges, we instruct the following require- ments (R1–R3) to guide the feature engineering stage.

R1 - Utility Relevant: The ultimate goal of fuzzing is more code coverage as well as discovering a higher number of hidden bugs. The features should reflect the characteristics that may improve such measures. For instance, how much a seed is likely to trigger more potential bugs or how much unexplored code a mutated seed will reach during its execution. As it is obvious, a seed is only meaningful in the context, which is the program it is executed upon. Accordingly, it is essential for feature extraction to consider the seed and the program as a bundle.

R2 - Seed-/Program-Agnostic: To achieve good generalizability, the features should be seed-

104 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

/program-agnostic. As a counter example, one could engineer a boolean feature that shows if a seed is genuine or not based on the magic number to ignore the invalid ones for fuzzing a specific program. However, such a feature is target-dependent and needs to be customized for fuzzing dif- ferent programs. Contrarily, “meta properties” like the execution path triggered by the input are more preferable, as it is a universally usable characteristic.

R3 - Online Friendly: In order not to sacrifice the efficiency compared to heuristic-based ap- proaches, it is not only important how efficient the features can be extracted, but also the number of features is concerned during model construction. The light-weight features and rich feature sets assure the coordinator will not be blocked from launching the concolic executor and at the same time able to construct meaningful models to predict seed utility. As a result, suitable features should strike a balance between analysis goal (i.e., how informative is the analysis result) and computation complexity (i.e., what is the time complexity for the analysis).

6.3.3 Feature Engineering

Following the requirements (R1–R3), we engineer the following list of features in MEUZZ and discuss them in four categories. The feature set is easily extensible, as long as the new features comply with the requirements (§6.3.2).

Bug-triggering: Inspired by existing research [78], we use the amount of reachable sanitizer in- strumentations as guidance for measuring how likely bugs can be triggered. As sanitizer instrumen- tations are based on sound analysis (i.e., no missed bugs), it provides a good over-approximation when trying to quantify the number of bugs that can be found. Hence, we extract these two features:

1. Count of reachable sanitizer instrumentations: For all branches throughout the path triggered by a given seed, the number of reachable sanitizer instrumentations is computed and then up. For instance, there are two branches in the left example of Figure 6.3. There are six potential bugs by following the branches, so the value for this feature is six.

2. Count of reached sanitizer instrumentations: For all branches throughout the path triggered by a given seed, we sum up the number of reached sanitizer instrumentations by the fuzzer. The major difference between this feature and the prior one is that this feature reflects the expectation of immediately solvable sanitizer bugs, while the former feature is an indirect reflection. For instance, the value of this feature in the right example of Figure 6.3 is two because the bugs can be directly reached by negating the constraints from b1 and b2.

105 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Seed Seed

. . n o C

. . n o C

Figure 6.3: The examples that show how bug-triggering and coverage features are computed.

Coverage: Concolic execution is good at solving complex branch conditions. Hence if there are a lot of unsolved branches the concolic executor can encounter when executing on the given input, it will significantly improve the code coverage. The most common situations where concolic execu- tion can help if when there is a conditional statement (i.e., if-then-else or switch-case). As the given input will only follow one of the branches, we call those branches stemmed from the same conditional statement neighbor branches. So we extract the following feature to estimate potential new coverage.

1. Count of undiscovered neighbor branches: For all branches along the path triggered by the given seed, this feature is the sum of the undiscovered neighbors for each branch regardless of sanitizer instrumentation. For instance, the value of this feature in the right example of Figure 6.3 is two if the seed follows the path with continue labels.

Constraint Solving: We also devised a set of features that impact the solving capabilities of the concolic execution engine. The incentive behind selecting such features is that the performance of the concolic executor significantly influences the entire hybrid fuzzing system.

1. Count of external calls: Existing concolic executors either rely on a simulated procedure or simply terminate the path execution when encountering an external function. As a result, external function calls may have negative impacts on the concolic executor, such as mislead- ing the path and causing failure to generate correct seeds. This feature records the count of external function calls along the path executed by the given seed.

2. Count of comparison instructions: This feature records the count of cmp instructions along the path executed by the given seed. Comparison instructions pose the constraints on the

106 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

execution path, which will later be solved by the SMT solver. However, constraint-solving is very time-consuming, and it is often the reason for the timeout.

3. Count of indirect calls: This is the number of indirect call instructions along the path exe- cuted by the given seed. Indirect calls may cause state explosion because when the concolic executor encounters an indirect call with a symbolic pointer, it simply forks a state for each possible value that can be resolved for the symbolic pointer [171]. In large programs, there could be many possible values for a symbolic function pointer.

4. Length of path: This feature records the number of executed branches (not deduplicated) by the given input. It helps identify the existence of large loops, which is another common reason that causes state explosion and solver timeout.

Empirical: This set of features is devised based on empirical observations by existing works. They might indirectly affect the fuzzing performance.

1. Input size: Size of the input is often employed by existing tools as a heuristic to make a scheduling decision. On the one hand, smaller size inputs often end the execution more quickly and then leave more time for the fuzzer or concolic executor to explore other inputs [36, 207]. On the other hand, larger input has a better chance to trigger more functionalities [78]. Therefore, we consider the input size as one of the potential features for our approach.

2. First seed with new coverage: This is a boolean value indicating whether the given seed is the first one to discover some new branches or not. This is based on the intuition that such seeds are more likely to trigger more new coverage. This feature is used in many popular fuzzers [24, 134].

3. Queue size: This feature records how many inputs are saved in the fuzzing queue at the time of the query. If the queue is longer, it is less likely to see more coverage. Since MEUZZ needs to predict the utility of each seed during runtime, namely how much more new coverage can be discovered by fuzzing with the given input, the prediction should consider the current status of fuzzing.

6.3.4 Seed Label Inference

Labeling is an indispensable stage of data preprocessing in supervised learning, and well-defined labels make the prediction much easier and more reliable. As we aim to predict the utility of a

107 CHAPTER 6. LEARNING-BASED HYBRID FUZZING selected seed and there is no direct indication to show if the selected seed is definitely useful, we need to derive a label by which we show the proportion of the seed utility. To understand the seed utility, we need to fuzz the program with that seed and check the out- come. Fuzzers that use genetic algorithms (GAs) for seed generation represent such an outcome as a forest of input descendant tree, which depicts the parent-child relationship of the seeds in the fuzzer’s queue. Each node of the tree represents a seed, and each edge connects a seed to one of its mutants. In plain fuzzing, the root nodes are the original seeds provided by the user. However, in hybrid testing, root nodes also consist of those seeds generated by a concolic engine and imported by the fuzzer. If the descendant tree of a seed is larger, it relatively means the seed contributes more to the fuzzer’s code coverage. Hence, to derive the label, we measure the size of the input descendant tree (or simply the number of nodes in the subtree) of a seed and consider it as the label. In reality, it is not feasible to compute the complete descendant tree since it could grow indefi- nitely, if the user never terminates the fuzzing process. As a result, we have to limit the tree analysis to a time window to make the label inference possible. Specifically, after the fuzzer imports a seed from concolic executor, we wait for a certain number of fuzzing epochs for the fuzzer to explore the imported seed and then compute the size of its descendant tree.

6.3.5 Model Construction and Prediction

The next step after preparing the data is to predict the seed prominence (i.e., label). As the labels are continuous values (i.e., number of nodes), the model behind the prediction should be regression. Hence, we embed a regression model in MEUZZ in a way that when new seeds are generated by the fuzzer, the model predicts the utility of the seeds and then transfer the potential seeds to the concolic engine. By doing so, the concolic engine runs solely on particular seeds that are most likely capable of discovering unexplored paths. As a result, this saves a lot of time and effort from the concolic engine. As seeds are mutated continuously during fuzzing a program in real-time, prediction and model update need to be done in a limited time window. Such limitation makes online learning approaches the desirable candidates for model construction. In online-learning, the model can be incrementally updated by only considering new data. It does not need to store all previous data and learn a classifier from scratch every time. Instead, the model can be updated iteratively based on the incoming input, previous model and historical fuzzing yields. Such an update is very fast and requires less storage,

108 CHAPTER 6. LEARNING-BASED HYBRID FUZZING which fits our use case very well. Thus we adopt online learning as one of the techniques for model construction. In addition to online learning approaches, we include offline learning techniques as well. Al- though offline learning is known to be inefficient for real-time streaming data, the main bottleneck of such techniques for real-time data is updating the model, not the prediction. We will discuss more on updating the models in Section 6.3.6.

6.3.6 Updating Model

To assure the model is entirely up-to-date with the prevailing seeds, ideally, we need to dynam- ically update/retrain the model, depending on the learning type (i.e., online vs. offline). By doing so, we do not only predict in real-time but also learn in real-time. For online learning, we use the Recursive Least Square (RLS) algorithm [66, 176] for updating

our linear model. Suppose at time t, the input data and the label are xt and yt correspondingly,

where xt is a vector of d-dimension. The following formula shows how the weight of the model at time t (i.e., wt) is updated based on the weight obtained from the previous model (i.e., wt 1): 1 T wt = wt 1 + Ct xt yt xt wt 1 1 ⇥ ⇤ where Ct is the inverse of for Ct, and Ct is defined as: t Ct = xixi + I Xi=1 1 Note that to calculate Ct , we do not need to store all previous data and compute the inverse. Based 1 on the Woodbury formula, Ct can also be updated recursively as follows: 1 T 1 1 1 Ct 1xtxt Ct 1 C = C t t 1 T 1 1+xt Ct 1xt The complexity for such an update is O(d2). To update the offline learning algorithms, the model needs to be retrained every time a new data appears. Although retraining the model with the whole dataset every time a new seed is coming seems to be time-consuming, we show in our evaluation it is still practical in our case (§6.4.3). The reason is that the seeds are not of very high dimension and the number of seeds need to be retrained is within the acceptable order of magnitude. As the model just sees a few samples, the model predicts very naively or just random at the beginning like other existing approaches. However, the prediction will be more reliable when more seeds are generated.

109 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Table 6.1: Evaluation settings

Program Settings Name Version Driver Initial Seeds Options tcpdump 4.10.0 tcpdump [41] -r @@ binutils 2.32 objdump [12] -D @@ binutils 2.32 readelf [12] -A @@ libxml 2.9.9 xmllint [29] stdin libtiff 4.0.10 tiff2pdf [28] @@ libtiff 4.0.10 tiff2ps [28] @@ jasper 2.0.16 jasper [26] -f @@ -T pnm libjpeg jpeg9c djpeg [26] stdin

6.4 Evaluation and Analysis

We conduct a comprehensive set of experiments to answer the following research questions:

RQ1: Can ML based seed scheduling outperform heuristics-based approaches (§ 6.4.2 and • § 6.4.6)?

RQ2: Which features are more important in predicting seed utility and which learning mode • is more effective (§ 6.4.3)?

RQ3: Does the learned model adapt well to different fuzzing configurations (§ 6.4.4)? • RQ4: Is it feasible to transfer the learned model from a program to other programs to improve • fuzzing yields (§ 6.4.5)?

6.4.1 Evaluation setup

Following the general fuzzing evaluation guideline [128], we use 8 real-world benchmark pro- grams that are widely used by existing works [65, 75, 78, 208, 212]. Table 6.1 shows the configura- tions used for fuzzing each program. All experiments are conducted on AWS c5.18xlarge servers running Ubuntu 16.04 with 72 cores and 281 GB RAM. Without explicitly mention, all tests are run 24 hours each and repeated at least 5 times; we report the average result with Mann-Whitney U-test. To evaluate MEUZZ, we compare it with several state-of-the-art grey-box fuzzers (AFL [134], AFLFast [65], and Angora [75]) as well as heuristics-based hybrid testing systems such as QSYM [208] and SAVIOR [78] 1. For MEUZZ, we also consider its three different configurations accord-

1We did not include Driller [184] and DigFuzz [211] mainly because they are powered by Angr [180], which is does not support real-world programs well due to incomplete environment modeling.

110 CHAPTER 6. LEARNING-BASED HYBRID FUZZING ing to the learning process, namely MEUZZ-OL, MEUZZ-RF and MEUZZ-EN, which refer to online learning linear model, offline learning random forest model and the ensemble of the previous two models, respectively. Hence, in total, we evaluate 8 fuzzing configurations on 8 benchmark pro- grams for 24 hours and repeated at least 5 times for each of the following evaluations (Learning Effectiveness, Model Reusability, and Model Transferability). For fair comparisons, each fuzzer is strictly assigned three CPU cores. We launch one master and two slaves for the grey-box fuzzers; and one master, one slave, and one concolic execution engine for the hybrid fuzzers. Because SAVIOR requires instrumenting the testing program with UBSAN [43], we also apply this sanitizer to all other fuzzers.

6.4.2 Learning Effectiveness

The most straightforward metric to measure the effectiveness of MEUZZ is code coverage, which is also a widely accepted and evaluated metric. Figure 6.4 shows the branch coverage achieved by different fuzzers to the time of fuzzing. Based on the coverage result, we have several interesting findings. Firstly, MEUZZ covers more code than other fuzzers in most programs after 24 hours of fuzzing. Among the non-ML fuzzers, QSYM performs the best in terms of code coverage, thanks to its efficient concolic execution engine tailored specially for hybrid fuzzing. Compared with QSYM, the MEUZZ variants achieve various levels of coverage improvements. In tcpdump, objdump, readelf and libxml,MEUZZ improves code coverage over QSYM by more than 10%, and particularly 27.1% by MEUZZ-RF in readelf. In tiff2pdf and tiff2ps,MEUZZ also has moderate coverage improvements. However, in jasper and djpeg, there is no much difference between MEUZZ and QSYM, we speculate it is because all fuzzers are saturated and hit a plateau after 16 and 6 hours. Secondly, MEUZZ covers less code in the beginning but gradually surpass other fuzzers as time progresses. For example, in objdump MEUZZ-OL and MEUZZ-RF did not cross QSYM and SAV- IOR until after 9.6 hours of fuzzing, but eventually achieves 14% higher code coverage. Similar situations can be observed in libxml, readelf and tiff2ps. This observation complies with our design, where MEUZZ will start seed scheduling with random parameters, hence the performance of seed selection is unpredictable at the beginning. But as time passes, fuzzing data are increasingly collected and used to refine the prediction model. Hence, the prediction becomes more accurate. Lastly, we observe that the effectiveness of MEUZZ-OL and MEUZZ-RF are comparable, where MEUZZ-OL outperforms MEUZZ-RF in 3 out of the 8 programs. On the other hand, MEUZZ-RF

111 CHAPTER 6. LEARNING-BASED HYBRID FUZZING achieves much higher fuzzing yield in programs like libxml and readelf. This is also the reason why MEUZZ-EN does not always yield better coverage. According to the No Free Lunch (NFL) theorem [205] for supervised machine learning, no individual algorithm achieves best for every problem and the underlying reason behind producing good/bad performance is data correlation. If the data cannot be modeled linearly, techniques using decision trees ensemble may properly determine the non-linear relations in data. Moreover, they are more robust to outliers. As we discussed in §6.3, offline learning needs to retrain the whole data set every time a new data appears, which might downgrade efficiency. However, this evaluation suggests that this additional delay does not impact the fuzzing output, which we will discuss more in detail in the next section.

6.4.3 Insights and Analyses

Table 6.2: Execution time spend on different learning stages

Model Update (s) Prediction(s) Feature Extraction (s) Online Offline Online Offline

0.000636 0.326139 0.000016 0.003168 5e 6

Online v.s Offline learning: As mentioned in the previous section, offline learning with the random forest model sometimes beats online learning with the linear model; however, the main concern with using offline learning is time delays, especially during the model updating stage. To further analyze the effects of time delays caused by offline learning, we profile each learning stage during the 24 hours of fuzzing and report the average time spend on different learning steps. As shown in Table 6.2, although offline learning spent 512x and 198x more time than online learning on updating the model and making prediction respectively, the absolute time-lapse is negligible (i.e., in milliseconds). Hence, offline learning is not a critical hindrance throughout the hybrid fuzzing loop, which endorses the offline learning effectiveness discussed in Section 6.4.2. Having said that, if fuzzing continues for a longer time and the number of seeds significantly increase, offline learning can become an obstacle. As a conclusion, we recommend users to choose online-learning when the target program has simpler program structs, and the model needs to be learned from an immense number of seeds. On the other hand, if the fuzzing time budget is limited and the seeds are fewer, offline learning is a more desirable candidate.

112 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Feature Analysis: Figure 6.5 shows the importance of the features. The importance score is computed by capturing the mean decrease impurity from the offline random forest models [68]. Apart from Queue Size and New Cov, most of the features contribute similarly to the model. Queue Size contributes the most and the main reason is that it directly correlates with the label, as finding more coverage in the past indicates that it is harder to find more new coverage in the future. The contribution of New Cov is the least among the rest of the features. While it is difficult to entirely disregard the minor contribution of New Cov, this suggests that putting much efforts to follow the seeds that bring new coverage might jeopardize the chance to explore unknown seeds. This is also known as the famous Multi-Armed Bandit (MAB) problem [58]. This finding might shed some lights on the scheduling algorithm implemented in popular fuzzers like AFL [134] that heavily rely on New Cov feature. Also, the figure shows some of the features like Path Length and New Cov are less subject to programs, while others like Reachable Label are more tied to a program. This is understandable since Path Length is extracted dynamically so it is precise. Moreover, program loops are the main factor that affects Path Length, which is a common trait among all programs. Similarly, New Cov is set during runtime when a seed is the first one to trigger new program behaviors (e.g., coverage), this feature is generally applicable to a variety set of programs. On the other hand, Reachable Label feature is extracted with a hybrid approach, namely compile- time analysis and runtime instrumentation. Thus, the sensitivity of static analysis affects the preci- sion (i.e., flow/context/field sensitivity). Some programs can be analyzed more easily than others. It is worth mentioning that due to our design of light-weight feature engineerings, the average time to extract each feature is only 5µs (as shown in Table 6.2), which indicates that the online- friendly requirement is satisfied in MEUZZ.

6.4.4 Model Reusability

Building machine learning models is a valuable but time-consuming task. It is reasonable to build and reuse models where possible. By reusing a model, one can improve generalization, speed up training, as well as improving the model accuracy. Also, reusability can be a good evidence that our model correctly captured what kind of inputs have higher utility when testing the target programs. Hence, we test the reusability of the learned models obtained via the previous fuzzing experiments. We conduct an experiment in which we use a pre-trained model for fuzzing the same target pro-

113 CHAPTER 6. LEARNING-BASED HYBRID FUZZING gram and compare the coverage difference. We make the following two changes in the experiment performed in § 6.4.2:(i) the initial seeds are replaced by a naive input that only consists of 4 whites- paces; and (ii) all MEUZZ variants are initialized with the models they learned in the effectiveness test (with valid initial seeds). Figure 6.6 shows the coverage result with Mann-Whitney U Test. There are several interesting observations. The most important one is that the MEUZZ variants start performing well even at the beginning of fuzzing compared with when there is no model initialization. We believe this improvement is brought by the initial models. Additionally, “pure-AFL” fuzzers do not perform well with this naive initial seed. For instance, in tcpdump, AFL and AFLFast only generate 6 inputs in total after 24 hours of fuzzing (see Figure 6.6a). On the contrary, systems augmented with other input generation techniques such as concolic execution and taint analysis can generate more inputs and consequently are able to explore significantly more code. Lastly, MEUZZ-RF outperforms its peers in djpeg, and its p-value indicates the improvement is significant (< 0.05), suggesting the non-linear model works better on djpeg.

6.4.5 Model Transferability

To further evaluate the model reusability explained in the previous section, we conduct a cross- program experiment to determine whether a model trained on one type of program will transfer well to fuzzing a new program. This is known as transfer learning in the ML field [155]. As far as we known, no prior research is able to show this property in fuzzing [168]. In this experiment, we augment MEUZZ with a pre-trained model from one program and com- pare the result of the fuzzer on different programs with a baseline. Our baseline is the coverage result from the learning effectiveness experiment (§ 6.4.2), in which we use valid seeds to bootstrap fuzzing without model initialization. We choose MEUZZ-OL to measure transferability since it is less affected by time-delays. We then fuzz each program using MEUZZ-OL initialized with the 8 pre-learned models. Figure 6.7 visualizes the relative coverage improvements (i.e., percentage) produced by each fuzzing configuration. The Y-axis shows the tested program and X-axis shows the programs by which the models are built. This result shows three interesting findings. Firstly, MEUZZ-OL observes 7.1% more coverage on average when it is tested on the same program it is initialized with. The amount of improvement for each program is shown in the diagonal of Figure 6.7, from top left to bottom right. Note that these models are only learned in 24 hours

114 CHAPTER 6. LEARNING-BASED HYBRID FUZZING from previous experiments; we expect to see much more improvement with the help of continuous fuzzing services (e.g., [34]). This again confirms that the previously learned models are reusable. Secondly, MEUZZ-OL observes improvement in 38 out of 56 cross-testing cases, which shows 67.9% success rate when the model is transferred from a program to another program. Among them, 10 cases see more than 10% coverage improvement. Such improvement also indicates that the program-agnostic requirement is satisfied in MEUZZ. Last but not least, we notice different programs have different “sensitivity” towards the trans- ferred models. For instance, almost all the transferred models can strengthen fuzzing readelf, tiff2pdf, tiff2ps and djpeg programs, among which readelf sees the highest improvement. In- terestingly, readelf achieves even higher improvement when using tcpdump model than the read- elf model by itself. However, other programs are only partially accepting foreign models. For instance, the model of tcpdump can outperform almost all of the programs, while no external model can improve its fuzzing yields. Apart from the learning algorithms, the number of seeds as well as different data distribution in the feature values are the two underlying reasons behind the ineffectiveness of some of the trans- ferred models. When there is more data, the model can be better generalized [115]. For instance, the tcpdump model contains a higher number of seeds compared with others (see Figure ??), which justifies the effectiveness of the model built from the tcpdump program. Moreover, we compared the feature importance of each program. The combination of Size, Indirect call and Path Length as a bundle plays more important role for tcpdump. This suggests that tcpdump is designed with heavier use of function handlers (for different types of network packets) and recursive loops (for parsing packet fields). While other models contain different feature value distribution as well as fewer data points, which justify the failure of using them to improve fuzzing tcpdump.

6.4.6 Discovered Bugs

To prove the effectiveness of our system in discovering new bugs, we performed various analy- ses. In addition to analyzing the crashes reported during fuzzing, we manually analyzed the reported undefined behaviors by MEUZZ. UBSan reports a large amount of undefined behaviors; however, the majority of them are deem benign after our triage process. We also compile the programs with AddressSanitizer [1] as well as LeakSanitizer [23] and run the instrumented programs with the in- puts generated by MEUZZ during fuzzing. In total, we found 33 undefined behaviors, 6 memory errors like heap overflow, 7 Denial of Service (DoS), and 4 memory leakage bugs, among which

115 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Table 6.3: The table shows the discovered bugs by MEUZZ. UB, ME, DoS, and ML refers to Undefined Behavior, Memory Error, Denial of Service, and Memory Leak, respectively.

Program Potential UB ME DoS ML Confirmed

tcpdump 10 2 objdump 41 readelf 22 31 tiff2pdf 122 tiff2ps 141 4 jasper 444 djpeg 11 6

Total 33 6 7 4 19

19 have been confirmed/fixed so far by the developers and the rest are pending (see Table 6.3). Especially, we observed 2 and 4 unique heap overflow bugs in readelf and tiff2ps, respectively. One of the heap overflow vulnerabilities in tiff2ps is discovered only by MEUZZ among the hybrid fuzzers. Figure 6.8 shows the vulnerable code snippet. This bug has been confirmed and fixed by the developers. It is an out-of-bound read vulnerability that leads to information disclosure. The vulnerability takes place at PSDataColorContig function where cp buffer with the size of 4 bytes is allocated in heap and then 5th element of the buffer is accessed by cp[4] which leads to out-of-bound read. In order to trigger this bug, the loop needs to be executed without early breaks and to control the buffer size, the input needs to satisfy many constraints in the TIFFScanlineSize function so that it will return value 4. Looking at the feature importance of tiff2ps, Cmp and Path Length play more important roles in its model, we believe this is why MEUZZ is able to guide the fuzzer to explore and trigger this bug. When triaging UB bugs, although we conducted a variety of experiments, they are not sufficient to evaluate the consequences of UB cases regarding the functionality of the programs. Therefore, We discussed with the developers of these projects that respond to us. For instance, in the libjpeg library, it turns out some undefined behaviors are used intentionally in this project. Here is an example that UBSan complains left shift by 8 places cannot be represented in long:

1 get_buffer = (get_buffer << 8) & c;

According to the developer: “the get_buffer variable is deliberately meant do be running through the compressed data continuously, inserting new data from right, shifting processed data out to the

116 CHAPTER 6. LEARNING-BASED HYBRID FUZZING left, and reading data for processing in the middle. The latter operation is performed by macros GET_BITS and PEEK_BITS which involve a bit mask operation to ignore the bits in the leftmost position. So these leftmost bits which are eventually shifted out have no effect on the processing of the data”. This statement shows that in some cases, programming techniques that are used by the developers are undefined behavior from the UBSanitizer perspective. Thus, it is challenging to put a label on UB cases as a bug or benign without enough knowledge about the structure and the logic of the program.

6.5 Discussions

6.5.1 Applicability of different machine learning models

6.5.1.1 Reinforcement learning

The basic goal of reinforcement learning (RL) is to decide what action to take in each given state such that the final reward is maximized. Although RL might seem a suitable candidate for the seed selection task in fuzzing, there are some caveats in applying RL. RL requires predefined and low-dimension action space (e.g., robot arm movements), while in fuzzing, the actions of selecting seeds form a higher dimension space (e.g., taken or not taken, take one or take two, etc.). It is hard to define the action for selecting which, or which group of, inputs to fuzz.

6.5.1.2 Rank-based learning

Rank-based learning is to learn how to rank a set of data based on predefined preferences. What distinguishes rank-based learning from traditional learning is that instead of predicting the absolute ground truth score for individual data, rank-based learning predicts the relationship between data. In rank-based learning, the input to learn is a set of features and their ranks. Although it aligns the best with the goal of seed selection task in fuzzing, this approach is not efficient enough as its complexity is quadratic, as rank-based learning needs to learn from its data in a pairwise manner. However, it is possible to take the model training step out of the hybrid fuzzing loop to offset the impact of delay brought by training. We leave this exploration to future work.

117 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

6.5.2 Applicability of MEUZZ on grey-box fuzzing

In this work, we propose using machine learning to learn from a set of features which seeds have better utility. Although this methodology is generally applicable to plain fuzzing as well, one must consider the delay caused by the realtime computation of features and labels. We envision that it is possible to pull the learning process including data collection and model updating out of the critical path, namely extracting features and updating model on background, so that the delay introduced by these extra computations will not block the fuzzer.

118 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

(a) tcpdump branch coverage (b) objdump branch coverage 4 (p1=0.071, p2=0.005, p3=0.082) (p1=0.044, p2=0.056, p3=8.2 e ) ⇤

(c) libxml branch coverage (p1=0.035, (d) tiff2pdf branch coverage (p1=8.2 ⇤ 4 4 5 p2=0.059, p3=0.054) e , p2=5.6 e , p3=6.2 e ) ⇤ ⇤

(e) tiff2ps branch coverage (p1=0.035, (f) jasper branch coverage (p1=0.037,

p2=0.091, p3=0.017) p2=0.192, p3=0.015)

(g) readelf branch coverage (p1=0.012, (h) djpeg branch coverage (p1=0.072, 4 p2=0.093, p3=8.2 e ) p2=0.021, p3=0.093) ⇤ Figure 6.4: Branch coverage fuzzing with valid seeds (higher is better). p1, p2 and p3 are p-values in Mann-Whitney U Test by comparing QSYM with MEUZZ-OL, MEUZZ-RF and MEUZZ-EN.

119 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Figure 6.5: The box plots show the importance of the features on nine programs. The importance is extracted by training an offline random forest model and they are ranked by the median of their importance. Queue Size and New Cov are the most and the least important ones, respectively.

120 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

(a) tcpdump branch coverage (b) objdump branch coverage

(p1=0.047, p2=0.018, p3=0.026) (p1=0.051, p2=0.002, p3=0.005)

(c) libxml branch coverage (p1=0.072, (d) tiff2pdf branch coverage (p1=0.02, 3 p2=0.032, p3=0.026) p2=0.03754, p3=5.7 e ) ⇤

(e) tiff2ps branch coverage (p1=6.04 (f) jasper branch coverage (p1=0.264, ⇤ 4 3 3 e , p2=0.012, p3=5.6 e ) p2=0.0268, p3=1.3 e ) ⇤ ⇤

(g) readelf branch coverage (p1=0.03, (h) djpeg branch coverage (p1=6.04 ⇤ 3 3 p2=0.072, p3=0.037) e , p2=0.012, p3=3.68 e ) ⇤ Figure 6.6: Branch coverage fuzzing with naive seeds (higher is better). p1, p2 and p3 are p-values in Mann-Whitney U Test by comparing QSYM with MEUZZ-OL, MEUZZ-RF and MEUZZ-EN, respectively.

121 CHAPTER 6. LEARNING-BASED HYBRID FUZZING

Figure 6.7: This heat map shows Coverage improvement with model initialization for MEUZZ-OL over vanilla MEUZZ-OL. Y-axis is the tested programs, X-axis is the models used for initialization. Each cell shows the relative coverage comparison (%). The diagonal values show the coverage improvement on each program after initializing MEUZZ with model learn from the same program (reusability). Model transferability is shown in 7 out of the 8 programs.

1 for (; cc < tf_bytesperrow; cc += samplesperpixel) { 2 3 adjust = 255 - cp[nc]; 4 switch (nc) { 5 case 4: c = *cp++ + adjust; PUTHEX(c,fd); 6 case 3: c = *cp++ + adjust; PUTHEX(c,fd); 7 case 2: c = *cp++ + adjust; PUTHEX(c,fd); 8 case 1: c = *cp++ + adjust; PUTHEX(c,fd); 9 }

Figure 6.8: Off-by-one heap read overflow in tiff2ps.

122

1 Chapter 7

Conclusion

In-process memory abuse has become a major problem to system and software security. As secret data or code present in memory are often unprotected and subject to exfiltration or illegitimate access. One major attack vector of in-process abuse is via exploiting memory corruption bugs. This further complicates the defense mechanisms due to the pervasiveness of memory corruption bugs in type unsafe languages such as C and C++. Further, as exploitation techniques get more and more advanced, contemporary software mitigation techniques such as stack canaries, DEP, ASLR and CFI are not longer sufficient to safely guard the victim program. Last but not least, the statuesque is that when an attacker is able to bypass the deployed software exploit mitigations, the game is over. As there is no enough support from the language runtime or operating system to protect the sensitive information, even as the developers are willing to cooperate. Aiming to solve the aforementioned problems, this thesis proposes both runtime protection and offline detection techniques to jointly thwart the attempts to in-process abuses.

Runtime mitigation and isolation: In the first part of the thesis, we present both advanced software mitigations and in-memory isolation techniques. Firstly, contemporary software exploitations rely on info-leak bugs to aid the code reuse attacks, due to the prevail deployment of DEP and ASLR. To further raise the bar of exploitation, we introduced a compiler assisted code randomization (CCR) technique to enable fine-grained software randomization. We designed this hybrid approach in order to be compatible with the current software deployment model, and more importantly, ensure the correctness of the rewritten software at client side. Our evaluation of CCR on SPEC 2006 benchmark shows its feasibility and practicality as it incurs a negligible average runtime overhead (0.28%). However, CCR alone is not sufficient as just-in-time code reuse attack can disclose gadgets

123 CHAPTER 7. CONCLUSION on the fly and bypass even the finest grained randomization (i.e., instruction randomization). I then present a framework NORAX to retrofit execute-only memory (XOM) protection into source- unavailable ARM64 programs. By making the code pages non-readable, these programs can benefit from the fine-grained randomization techniques (e.g., CCR). NORAX makes use of new hardware feature in ARMv8 chips to enable XOM and incurs very low runtime (1.18%) and memory (2.21%) overheads, based on our evaluation on unix bench and a set of system programs and libraries. When the mitigation techniques are bypassed, as our last defense we propose a fine-grained in-process memory isolation technique, namely shreds, a set of OS-backed programming primitives that addresses developers’ currently unmet needs for fine-grained, convenient, and efficient protec- tion of sensitive memory content against in-process adversaries. A shred can be view as a flexibly defined segment of a thread execution (hence the name). Each shred is associated with a protected memory pool, which is accessible only to code running in the shred. Unlike previous works, shreds offer in-process private memory without relying on separate page tables, nested paging, or even modified hardware. Plus, shreds provide the essential data flow and control flow guarantees for running sensitive code. Thanks to the hardware-based mechanism and light-weight in-shred miti- gations, based on our experiments on 5 real-world programs, shreds are fairly easy to use and incur low runtime overhead.

Offline software bugs detection: When runtime protections are not feasible due to compatibility and performance reasons. We need to employ offline bug detection techniques to find memory cor- ruptions and fix them before they are exploited. In the second part of the thesis, I present automated tools to enable both broad and in-depth fuzz testings. Firstly, we introduce SAVIOR, a bug-driven hybrid testing approach. Unlike the mainstream hybrid testing tools which follow the coverage-driven design, SAVIOR moves towards being a bug- driven. We accordingly propose in SAVIOR two novel techniques, named bug-driven prioritization and bug- guided verification, respectively. On one hand, SAVIOR prioritizes the concolic execution to run seeds with more potentials of leading to vulnerabilities. On the other hand, SAVIORexamines all vulnerable candidates along the running program path in concolic execution. By modeling the unsafe conditions in SMT constraints, it solves for proofs of valid vulnerabilities or proves that the corresponding vulnerabilities do not exist. SAVIORsignificantly outperforms the existing coverage- driven tools. On average, it detects vulnerabilities 43.4% faster than DRILLER and 44.3% faster than QSYM, resulting in the discovery of 88 and 76 more security violations in 24 hours. Secondly, we present MEUZZ, a learning based hybrid fuzzing system whose seed scheduling

124 CHAPTER 7. CONCLUSION process is guided by supervised machine learning. Theoretically, MEUZZ is more generalized than systems using fixed seed selection heuristics. For effective integration of machine learning workloads into the online hybrid fuzzing loop. MEUZZ follows the requirements of being utility relevant, online friendly and program agnostic for its feature engineering and label inference. Our evaluation shows that MEUZZ outperforms state of the art fuzzers in coverage. In addition, the learned models demonstrate good reusability and transferability, making it more practical to apply machine learning to hybrid fuzzing. More importantly, MEUZZ found a heap-overflow bug in libtiff, a well-tested library, following the same fuzzing configurations. This bug is not found by any other hybrid fuzzing systems.

125 Bibliography

[1] Addresssanitizer. https://clang.llvm.org/docs/AddressSanitizer.html.

[2] Afl technical details. http://lcamtuf.coredump.cx/afl/technical_ details.txt.

[3] Aflgo source code. https://github.com/aflgo/aflgo.

[4] American fuzzy lop. http://lcamtuf.coredump.cx/afl.

[5] Android compatibility test suite. https://source.android.com/ compatibility/cts/index.html.

[6] Android executables mandatorily need to be pie. https://source.android.com/ security/enhancements/enhancements50.html.

[7] Angora source code. https://github.com/AngoraFuzzer/Angora.

[8] angr/tracer: Utilities for generating dynamic traces. https://github.com/angr/ tracer.

[9] Apollo: an open autonomous driving platform. https://github.com/ApolloAuto/ apollo.

[10] Apple warn developers when the binaries are not compiled as position-indepent. https: //developer.apple.com/library/content/qa/qa1788/_index.html.

[11] Arm 32-bit sandbox. https://developer.chrome.com/native-client/ reference/sandbox_internals/arm-32-bit-sandbox.

[12] Binutils test cases. https://github.com/mirrorer/afl/tree/master/ testcases/others/elf.

126 BIBLIOGRAPHY

[13] clang: a c language family frontend for llvm. http://clang.llvm.org/.

[14] Darpa cyber grand challenge. http://archive.darpa.mil/ cybergrandchallenge/.

[15] Domain access control register. http://infocenter.arm.com/help/index. jsp?topic=/com.arm.doc.ddi0434b/CIHBCBFE.html.

[16] Driller reading file patch. https://github.com/shellphish/driller/ issues/48.

[17] Driller stuck heuristic. https://github.com/shellphish/driller# the-stuck-heuristic.

[18] Dwarf standards. http://www.dwarfstd.org.

[19] El_0 execute-only memory configuration. https://armv8-ref.codingbelief. com/en/chapter_d4/d44_1_memory_access_control.html.

[20] Index of /gnu/binutils. https://ftp.gnu.org/gnu/binutils/.

[21] jasper source code. https://github.com/mdadams/jasper/archive/ master.zip.

[22] Klee intrinsiccleaner. https://github.com/klee/klee/blob/master/lib/ Module/IntrinsicCleaner.cpp.

[23] Leaksanitizer. https://clang.llvm.org/docs/LeakSanitizer.html.

[24] libfuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/ LibFuzzer.html.

[25] libjpeg source code. https://www.ijg.org/files/jpegsrc.v9c.tar.gz.

[26] Libjpeg test cases. https://github.com/mirrorer/afl/tree/master/ testcases/images/jpeg.

[27] Libtiff source code. https://download.osgeo.org/libtiff/.

[28] Libtiff test cases. https://github.com/mirrorer/afl/tree/master/ testcases/images/tiff.

127 BIBLIOGRAPHY

[29] Libxml test cases. https://github.com/mirrorer/afl/tree/master/ testcases/others/xml.

[30] libxml2 source code. http://xmlsoft.org/libxml2/ libxml2-git-snapshot.tar.gz.

[31] Memory domains. http://infocenter.arm.com/help/index.jsp?topic= /com.arm.doc.ddi0211k/Babjdffh.html.

[32] The myth of “bug free” software. https://www.betabreakers.com/ the-myth-of-bug-free-software/.

[33] Objdump overflow patch. https://sourceware.org/git/gitweb.cgi?p= binutils-gdb.git;a=commitdiff;h=f2023ce7.

[34] Oss-fuzz - continuous fuzzing for open source software. https://github.com/ google/oss-fuzz.

[35] Profile guided optimization. https://clang.llvm.org/docs/UsersManual. html#profile-guided-optimization.

[36] Qsym: A practical concolic execution engine tailored for hybrid fuzzing. https:// github.com/sslab-gatech/qsym.

[37] Qsym source code. https://github.com/sslab-gatech/qsym.

[38] Run angora on lava dataset. https://github.com/AngoraFuzzer/Angora/ blob/master/docs/lava.md.

[39] T-fuzz source code. https://github.com/HexHive/T-Fuzz.

[40] Tcpdump source code. http://www.tcpdump.org/release/.

[41] Tcpdump test cases. https://github.com/the-tcpdump-group/tcpdump/ tree/master/tests.

[42] The LLVM Compiler Infrastructure. http://llvm.org.

[43] Undefined behavior sanitizer - clang 9 documentation. http://clang.llvm.org/ docs/UndefinedBehaviorSanitizer.html#ubsan-checks.

128 BIBLIOGRAPHY

[44] When does deep learning work better than svms or random forests? https://www. kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest. html, 04 2016.

[45] Cyber grand shellphish. http://www.phrack.org/papers/cyber_grand_ shellphish.html, 2017.

[46] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-Flow Integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security, pages 340–353. ACM, 2005.

[47] M. M. M. B. Abhishek Arya, Oliver Chang and C. t. Jonathan Metzman. Open sourcing clusterfuzz. https://opensource.googleblog.com/2019/02/ open-sourcing-clusterfuzz.html, 2019.

[48] Akamai Technologies. Secure storage of private (rsa) keys. https://lwn.net/ Articles/594923/.

[49] D. Andriesse, X. Chen, V. van der Veen, A. Slowinska, and H. Bos. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In Proceedings of the 25rd USENIX Security Symposium, pages 583–600, 2016.

[50] D. Andriesse, A. Slowinska, and H. Bos. Compiler-agnostic function detection in binaries. In Proceedings of the 2nd IEEE European Symposium on Security & Privacy (EuroS&P), pages 177–189, April 2017.

[51] T. Avgerinos, S. K. Cha, B. L. T. Hao, and D. Brumley. AEG: automatic exploit generation. In Proceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011, 2011.

[52] A. M. Azab, P. Ning, and X. Zhang. Sice: a hardware-level strongly isolated computing environment for x86 multi-core platforms. In Proceedings of the 18th ACM conference on Computer and communications security, pages 375–388. ACM, 2011.

[53] D. Babic,´ S. Bucur, Y. Chen, F. Ivanciˇ c,´ T. King, M. Kusano, C. Lemieux, L. Szekeres, and W. Wang. Fudge: fuzz driver generation at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 975–985. ACM, 2019.

129 BIBLIOGRAPHY

[54] M. Backes, T. Holz, B. Kollenda, P. Koppe, S. Nürnberger, and J. Pewny. You can run but you can’t read: Preventing disclosure exploits in executable code. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1342–1353. ACM, 2014.

[55] M. Backes and S. Nürnberger. Oxymoron: Making fine-grained memory randomization practical by allowing code sharing. In Proceedings of the 23rd USENIX Security Symposium, 2014.

[56] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley. BYTEWEIGHT: Learning to Rec- ognize Functions in Binary Code. In Proceedings of the 23rd USENIX Security Symposium, pages 845–860, 2014.

[57] E. Bendersky. Assembler relaxation. http://eli.thegreenplace.net/2013/01/ 03/assembler-relaxation, 2013.

[58] D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments (mono- graphs on statistics and applied probability). London: Chapman and Hall, 5:71–87, 1985.

[59] S. Bhatkar, D. C. DuVarney, and R. Sekar. Efficient techniques for comprehensive protection from memory error exploits. In Usenix Security, 2005.

[60] D. Bigelow, T. Hobson, R. Rudd, W. Streilein, and H. Okhravi. Timely rerandomization for mitigating memory disclosures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 268–279. ACM, 2015.

[61] A. Bittau, A. Belay, A. Mashtizadeh, D. Mazières, and D. Boneh. Hacking blind. In 2014 IEEE Symposium on Security and Privacy, pages 227–242. IEEE, 2014.

[62] A. Bittau, P. Marchenko, M. Handley, and B. Karp. Wedge: Splitting applications into reduced-privilege compartments. In NSDI, volume 8, pages 309–322, 2008.

[63] T. Bletsch, X. Jiang, V. W. Freeh, and Z. Liang. Jump-oriented programming: a new class of code-reuse attack. In Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, pages 30–40. ACM, 2011.

[64] M. Böhme, V.-T. Pham, M.-D. Nguyen, and A. Roychoudhury. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2329–2344. ACM, 2017.

130 BIBLIOGRAPHY

[65] M. Böhme, V.-T. Pham, and A. Roychoudhury. Coverage-based greybox fuzzing as markov chain. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communica- tions Security, pages 1032–1043. ACM, 2016.

[66] L. Bottou. Online learning and stochastic approximations. On-line learning in neural net- works, 17(9):142.

[67] K. Braden, S. Crane, L. Davi, M. Franz, P. Larsen, C. Liebchen, and A.-R. Sadeghi. Leakage- resilient layout randomization for mobile devices. In Proceedings of the 2016 Network and Distributed System Security (NDSS) Symposium, 2016.

[68] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. 1984.

[69] D. Brumley and D. Song. Privtrans: Automatically partitioning programs for privilege sepa- ration. In USENIX Security Symposium, pages 57–72, 2004.

[70] J. Burnim and K. Sen. Heuristics for scalable dynamic test generation. In 23rd IEEE/ACM International Conference on Automated Software Engineering, 15-19 September 2008, L’Aquila, Italy, pages 443–446, 2008.

[71] C. Cadar, D. Dunbar, and D. Engler. Klee: Unassisted and automatic generation of high- coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pages 209–224. USENIX Association, 2008.

[72] N. Carlini, A. Barresi, M. Payer, D. Wagner, and T. R. Gross. Control-flow bending: On the effectiveness of control-flow integrity. In 24th USENIX Security Symposium (USENIX Security 15), pages 161–176, 2015.

[73] M. Castro, M. Costa, J.-P. Martin, M. Peinado, P. Akritidis, A. Donnelly, P. Barham, and R. Black. Fast byte-granularity software fault isolation. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 45–58. ACM, 2009.

[74] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley. Unleashing mayhem on binary code. In IEEE Symposium on Security and Privacy, SP 2012, 21-23 May 2012, San Francisco, California, USA, pages 380–394, 2012.

[75] P. Chen and H. Chen. Angora: Efficient fuzzing by principled search. In 2018 IEEE Sympo- sium on Security and Privacy (SP), pages 711–725. IEEE, 2018.

131 BIBLIOGRAPHY

[76] X. Chen, T. Garfinkel, E. C. Lewis, P. Subrahmanyam, C. A. Waldspurger, D. Boneh, J. Dwoskin, and D. R. Ports. Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems. In ACM SIGOPS Operating Systems Review, volume 42, pages 2–13. ACM, 2008.

[77] Y. Chen, M. Ahmadi, R. Mirzazade farkhani, B. Wang, L. Lu, et al. Meuzz: Smart seed scheduling for hybrid fuzzing. arXiv preprint arXiv:1906.07327, 2019.

[78] Y. Chen, P. Li, J. Xu, S. Guo, R. Zhou, Y. Zhang, L. Lu, et al. Savior: Towards bug-driven hybrid testing. arXiv preprint arXiv:1906.07327, 2019.

[79] Y. Chen, D. Mu, J. Xu, Z. Sun, W. Shen, X. Xing, L. Lu, and B. Mao. Ptrix: Efficient hardware-assisted fuzzing for cots binary, 2019.

[80] Y. Chen, S. Reymondjohnson, Z. Sun, and L. Lu. Shreds: Fine-grained execution units with private memory. In 2016 IEEE Symposium on Security and Privacy (SP), pages 56–71. IEEE, 2016.

[81] Y. Chen, Z. Wang, D. Whalley, and L. Lu. Remix: On-demand live randomization. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 50–61. ACM, 2016.

[82] Y. Chen, D. Zhang, R. Wang, R. Qiao, A. M. Azab, L. Lu, H. Vijayakumar, and W. Shen. Norax: Enabling execute-only memory for cots binaries on aarch64. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 304–319. IEEE, 2017.

[83] V. Chipounov, V. Kuznetsov, and G. Candea. S2E: A platform for in-vivo multi-path analysis of software systems. In ACM SIGARCH Computer Architecture News, volume 39, pages 265–278. ACM, 2011.

[84] C. Cifuentes and M. Van Emmerik. Recovery of jump table case statements from binary code. In IEEE International Workshop on Program Comprehension, 1999.

[85] M. Conti, S. Crane, L. Davi, M. Franz, P. Larsen, M. Negro, C. Liebchen, M. Qunaibit, and A.-R. Sadeghi. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communi- cations Security, pages 952–963. ACM, 2015.

132 BIBLIOGRAPHY

[86] J. Corbet. SMP alternatives. https://lwn.net/Articles/164121/, 2005.

[87] J. Corbet. Memory protection keys. https://lwn.net/Articles/643797/, May 2015.

[88] S. Crane, C. Liebchen, A. Homescu, L. Davi, P. Larsen, A.-R. Sadeghi, S. Brunthaler, and M. Franz. Readactor: Practical code randomization resilient to memory disclosure. In 2015 IEEE Symposium on Security and Privacy, pages 763–780. IEEE, 2015.

[89] S. J. Crane, S. Volckaert, F. Schuster, C. Liebchen, P. Larsen, L. Davi, A.-R. Sadeghi, T. Holz, B. De Sutter, and M. Franz. It’s a trap: Table randomization and protection against function- reuse attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Com- munications Security, pages 243–255. ACM, 2015.

[90] L. Davi, C. Liebchen, A.-R. Sadeghi, K. Z. Snow, and F. Monrose. Isomeron: Code random- ization resilient to (just-in-time) return-oriented programming. In NDSS, 2015.

[91] L. Davi, A.-R. Sadeghi, D. Lehmann, and F. Monrose. Stitching the gadgets: On the in- effectiveness of coarse-grained control-flow integrity protection. In 23rd USENIX Security Symposium (USENIX Security 14), pages 401–416, 2014.

[92] L. V. Davi, A. Dmitrienko, S. Nürnberger, and A.-R. Sadeghi. Gadge me if you can: secure and efficient ad-hoc instruction-level randomization for x86 and arm. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, pages 299–310. ACM, 2013.

[93] L. M. de Moura and N. Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54(9):69–77, 2011.

[94] L. M. de Moura, B. Dutertre, and N. Shankar. A tutorial on satisfiability modulo theories. volume 4590 of Lecture Notes in Computer Science, pages 20–36, 2007.

[95] Z. Deng, B. Saltaformaggio, X. Zhang, and D. Xu. iris: Vetting private api abuse in ios applications. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Com- munications Security, CCS ’15, pages 44–56, New York, NY, USA, 2015. ACM.

[96] W. Dietz, P. Li, J. Regehr, and V. Adve. Understanding integer overflow in c/c++. In Proceed- ings of the 34th International Conference on Software Engineering, ICSE’12, pages 760–770, 2012.

133 BIBLIOGRAPHY

[97] B. Dolan-Gavitt, P. Hulin, E. Kirda, T. Leek, A. Mambretti, W. Robertson, F. Ulrich, and R. Whelan. Lava: Large-scale automated vulnerability addition. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), pages 110–121. IEEE, 2016.

[98] Z. Durumeric, J. Kasten, D. Adrian, J. A. Halderman, M. Bailey, F. Li, N. Weaver, J. Amann, J. Beekman, M. Payer, et al. The matter of heartbleed. In Proceedings of the 2014 Conference on Internet Measurement Conference, pages 475–488. ACM, 2014.

[99] K. ElWazeer. Deep Analysis of Binary Code to Recover Program Structure. Dissertation, 2014.

[100] U. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula. Xfi: Software guards for system address spaces. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 75–88. USENIX Association, 2006.

[101] I. Evans, F. Long, U. Otgonbaatar, H. Shrobe, M. Rinard, H. Okhravi, and S. Sidiroglou- Douskos. Control jujutsu: On the weaknesses of fine-grained control flow integrity. In Pro- ceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 901–913. ACM, 2015.

[102] H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong. Baidu apollo em motion planner. arXiv preprint arXiv:1807.08048, 2018.

[103] R. M. Farkhani, S. Jafari, S. Arshad, W. Robertson, E. Kirda, and H. Okhravi. On the effec- tiveness of type-based control flow integrity. In Proceedings of the 34th Annual Computer Security Applications Conference, pages 28–39. ACM, 2018.

[104] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15(1):3133– 3181, Jan. 2014.

[105] B. Ford and R. Cox. Vx32: Lightweight user-level sandboxing on the x86. In USENIX Annual Technical Conference, pages 293–306. Boston, MA, 2008.

[106] S. Gan, C. Zhang, X. Qin, X. Tu, K. Li, Z. Pei, and Z. Chen. Collafl: Path sensitive fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP), volume 00, pages 660–677.

[107] V. Ganesh and D. L. Dill. A decision procedure for bit-vectors and arrays. volume 4590 of Lecture Notes in Computer Science, pages 519–531, 2007.

134 BIBLIOGRAPHY

[108] D. Geneiatakis, G. Portokalidis, V. P. Kemerlis, and A. D. Keromytis. Adaptive Defenses for Commodity Software Through Virtual Application Partitioning. In Proceedings of the 19th ACM conference on Computer and communications security (CCS), pages 133–144, 2012.

[109] J. Gionta, W. Enck, and P. Ning. Hidem: Protecting the contents of userspace memory in the face of disclosure vulnerabilities. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, pages 325–336. ACM, 2015.

[110] P. Godefroid, N. Klarlund, and K. Sen. DART: directed automated random testing. In Pro- ceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, pages 213–223, 2005.

[111] P. Godefroid, M. Y. Levin, D. A. Molnar, et al. Automated whitebox fuzz testing. In NDSS, volume 8, pages 151–166, 2008.

[112] E. Göktas, E. Athanasopoulos, H. Bos, and G. Portokalidis. Out of control: Overcoming control-flow integrity. In 2014 IEEE Symposium on Security and Privacy, pages 575–589. IEEE, 2014.

[113] L. Guan, J. Lin, B. Luo, and J. Jing. Copker: Computing with private keys without ram. In 21st ISOC Network and Distributed System Security Symposium (NDSS), 2014.

[114] L. Guan, J. Lin, B. Luo, J. Jing, and J. Wang. Protecting private keys against memory disclosure attacks using hardware transactional memory. In Security and Privacy (SP), 2015 IEEE Symposium on, pages 3–19, May 2015.

[115] A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8–12, Mar. 2009.

[116] D. Hansen. [rfc] x86: Memory protection keys. https://lwn.net/Articles/ 643617/, May 2015.

[117] L. C. Harris and B. P. Miller. Practical analysis of stripped binary code. SIGARCH Comput. Archit. News, 33(5):63–68, Dec. 2005.

[118] K. Harrison and S. Xu. Protecting cryptographic keys from memory disclosure attacks. In Dependable Systems and Networks, 2007. DSN’07. 37th Annual IEEE/IFIP International Conference on, pages 137–143. IEEE, 2007.

135 BIBLIOGRAPHY

[119] J. Hiser, A. Nguyen-Tuong, M. Co, M. Hall, and J. W. Davidson. Ilr: Where’d my gadgets go? In 2012 IEEE Symposium on Security and Privacy, pages 571–585. IEEE, 2012.

[120] R. N. Horspool and N. Marovac. An approach to the problem of detranslation of computer programs. Computer Journal, 23(3):223–229, 1980.

[121] W. E. Howden. Symbolic testing and the DISSECT symbolic evaluation system. IEEE Trans. Software Eng., 3(4):266–278, 1977.

[122] K. Jayaraman, D. Harvison, V. Ganesh, and A. Kiezun. jfuzz: A concolic whitebox fuzzer for java. In First NASA Formal Methods Symposium - NFM 2009, Moffett Field, California, USA, April 6-8, 2009., pages 121–125, 2009.

[123] T. Johnson. ThinLTO: Scalable and Incremental LTO. http://blog.llvm.org/ 2016/06/thinlto-scalable-and-incremental-lto.html, 2016.

[124] S. Kell, D. P. Mulligan, and P. Sewell. The missing link: Explaining ELF static linking, se- mantically. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object- Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 607–623, 2016.

[125] C. Kil, J. Jun, C. Bookholt, J. Xu, and P. Ning. Address space layout permutation (aslp): Towards fine-grained randomization of commodity software. In ACSAC, volume 6, pages 339–348, 2006.

[126] D. Kilpatrick. Privman: A library for partitioning applications. In USENIX Annual Technical Conference, FREENIX Track, pages 273–284, 2003.

[127] J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385–394, 1976.

[128] G. Klees, A. Ruef, B. Cooper, S. Wei, and M. Hicks. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2123–2138. ACM, 2018.

[129] T. Klein. Relro - a (not so well known) memory corruption mitigation technique. http:// tk-blog.blogspot.com/2009/02/relro-not-so-well-known-memory. html, 2009.

136 BIBLIOGRAPHY

[130] H. Koo, Y. Chen, L. Lu, V. P. Kemerlis, and M. Polychronakis. Compiler-assisted code randomization. In 2018 IEEE Symposium on Security and Privacy (SP), pages 461–477. IEEE, 2018.

[131] H. Koo and M. Polychronakis. Juggling the gadgets: Binary-level code randomization using instruction displacement. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 23–34. ACM, 2016.

[132] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests: Efficient online random forests. In Advances in neural information processing systems, pages 3140–3148, 2014.

[133] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75–86. IEEE, 2004.

[134] lcamtuf. american fuzzy lop. http://lcamtuf.coredump.cx/afl/, 2015.

[135] Y. Li. Target independent code generation. http://people.cs.pitt.edu/ ~yongli/notes/llvm3/LLVM3.html, 2012.

[136] Y. Liu, T. Zhou, K. Chen, H. Chen, and Y. Xia. Thwarting memory disclosure with effi- cient hypervisor-enforced intra-domain isolation. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, 2015.

[137] M. Ludvig. CFI support for GNU assembler (GAS). http://www.logix.cz/michal/ devel/gas-cfi/, 2003.

[138] A. Machiry, E. Gustafson, C. Spensky, C. Salls, N. Stephens, R. Wang, A. Bianchi, Y. R. Choe, C. Kruegel, and G. Vigna. Boomerang: Exploiting the semantic gap in trusted exe- cution environments. In Proceedings of the 2017 Network and Distributed System Security Symposium (NDSS), 2017.

[139] R. Majumdar and K. Sen. Hybrid concolic testing. In Software Engineering, 2007. ICSE 2007. 29th International Conference on, pages 416–426. IEEE, 2007.

[140] V. J. Manes, H. Han, C. Han, S. K. Cha, M. Egele, E. J. Schwartz, and M. Woo. Fuzzing: Art, science, and engineering. arXiv preprint arXiv:1812.00140, 2018.

137 BIBLIOGRAPHY

[141] A. J. Mashtizadeh, A. Bittau, D. Boneh, and D. Mazières. Ccfi: cryptographically enforced control flow integrity. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 941–951. ACM, 2015.

[142] S. McCamant. Large single compilation-unit C programs. http://people.csail. mit.edu/smcc/projects/single-file-programs/, 2006.

[143] J. M. McCune, Y. Li, N. Qu, Z. Zhou, A. Datta, V. Gligor, and A. Perrig. Trustvisor: Efficient tcb reduction and attestation. In Security and Privacy (SP), 2010 IEEE Symposium on, pages 143–158. IEEE, 2010.

[144] J. M. McCune, B. J. Parno, A. Perrig, M. K. Reiter, and H. Isozaki. Flicker: An execution infrastructure for tcb minimization. In ACM SIGOPS Operating Systems Review, volume 42, pages 315–328. ACM, 2008.

[145] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi, V. Shanbhogue, and U. R. Savagaonkar. Innovative instructions and software model for isolated execution. In Proceed- ings of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy, pages 1–1. ACM, 2013.

[146] P. E. McKnight and J. Najab. Mann-whitney u test. The Corsini encyclopedia of psychology, pages 1–1, 2010.

[147] V. Mohan, P. Larsen, S. Brunthaler, K. W. Hamlen, and M. Franz. Opaque control-flow integrity. In NDSS, 2015.

[148] MSDN. Securestring class. https://msdn.microsoft.com/en-us/library/ system.security.securestring.aspx.

[149] D. Mu, A. Cuevas, L. Yang, H. Hu, X. Xing, B. Mao, and G. Wang. Understanding the re- producibility of crowd-reported security vulnerabilities. In Proceedings of the 27th USENIX Conference on Security Symposium, pages 919–936. USENIX Association, 2018.

[150] T. Müller, F. C. Freiling, and A. Dewald. Tresor runs encryption securely outside ram. In USENIX Security Symposium, pages 17–17, 2011.

[151] D. Niemi. Unixbench 4.1. 0.

138 BIBLIOGRAPHY

[152] B. Niu and G. Tan. Rockjit: Securing just-in-time compilation using modular control-flow integrity. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 1317–1328. ACM, 2014.

[153] B. Niu and G. Tan. Per-input control-flow integrity. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 914–926. ACM, 2015.

[154] B. S. Pak. Hybrid fuzz testing: Discovering software bugs via fuzzing and symbolic execu- tion. School of Computer Science Carnegie Mellon University, 2012.

[155] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, Oct 2010.

[156] V. Pappas, M. Polychronakis, and A. D. Keromytis. Smashing the gadgets: Hindering return- oriented programming using in-place code randomization. In Proceedings of the 33rd IEEE Symposium on Security & Privacy (S&P), pages 601–615, May 2012.

[157] V. Pappas, M. Polychronakis, and A. D. Keromytis. Smashing the gadgets: Hindering return- oriented programming using in-place code randomization. In 2012 IEEE Symposium on Security and Privacy, pages 601–615. IEEE, 2012.

[158] V. Pappas, M. Polychronakis, and A. D. Keromytis. Dynamic reconstruction of relocation information for stripped binaries. In Proceedings of the 17th International Symposium on Research in Attacks, Intrusions and Defenses (RAID), pages 68–87, September 2014.

[159] H. Peng, Y. Shoshitaishvili, and M. Payer. T-fuzz: fuzzing by program transformation. In 2018 IEEE Symposium on Security and Privacy (SP), pages 697–710. IEEE, 2018.

[160] T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Fractured processes: adaptive, fine-grained process abstractions. In Proceedings of the 2014 International Conference on Timely Results in Operating Systems, pages 4–4. USENIX Association, 2014.

[161] N. Provos, M. Friedl, and P. Honeyman. Preventing privilege escalation. In USENIX Security, volume 3, 2003.

[162] R. Qiao and R. Sekar. Function interface analysis: A principled approach for function recog- nition in COTS binaries. In The 47th IEEE/IFIP International Conference on Dependable Systems and Networks, 2017.

139 BIBLIOGRAPHY

[163] G. Ramalingam. The Undecidability of Aliasing. ACM Trans. Program. Lang. Syst., 16(5):1467–1471, September 1994.

[164] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos. Vuzzer: Application- aware evolutionary fuzzing. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2017.

[165] A. Rebert, S. K. Cha, T. Avgerinos, J. Foote, D. Warren, G. Grieco, and D. Brumley. Optimiz- ing seed selection for fuzzing. In Proceedings of the 23rd USENIX Conference on Security Symposium, pages 861–875. USENIX Association, 2014.

[166] R. Roemer, E. Buchanan, H. Shacham, and S. Savage. Return-oriented programming: Sys- tems, languages, and applications. ACM Transactions on Information and System Security (TISSEC), 15(1):2, 2012.

[167] R. Rudd, R. Skowyra, D. Bigelow, V. Dedhia, T. Hobson, C. L. Stephen Crane, P. Larsen, L. Davi, M. Franz, A.-R. Sadeghi, and H. Okhravi. Address-Oblivious Code Reuse: On the Effectiveness of Leakage Resilient Diversity. In Proceedings of the Network and Distributed System Security Symposium (NDSS’17), Feb 2017.

[168] G. J. Saavedra, K. N. Rodhouse, D. M. Dunlavy, and P. W. Kegelmeyer. A review of machine learning applications in fuzzing. arXiv preprint arXiv:1906.11133, 2019.

[169] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In 2009 ieee 12th international conference on computer vision workshops, iccv workshops, pages 1393–1400. IEEE.

[170] F. Schuster, T. Tendyck, C. Liebchen, L. Davi, A.-R. Sadeghi, and T. Holz. Counterfeit object-oriented programming: On the difficulty of preventing code reuse attacks in c++ ap- plications. In 2015 IEEE Symposium on Security and Privacy, pages 745–762. IEEE, 2015.

[171] E. J. Schwartz, T. Avgerinos, and D. Brumley. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In 2010 IEEE symposium on Security and privacy, pages 317–331. IEEE, 2010.

[172] K. Sen. Concolic testing. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 571–572. ACM, 2007.

140 BIBLIOGRAPHY

[173] K. Sen and G. Agha. CUTE and jcute: Concolic unit testing and explicit path model-checking tools. In Computer Aided Verification, 18th International Conference, Seattle, WA, USA, August 17-20, 2006, Proceedings, pages 419–423, 2006.

[174] K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C. In Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2005, Lisbon, Portugal, September 5-9, 2005, pages 263–272, 2005.

[175] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov. Addresssanitizer: A fast ad- dress sanity checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, pages 28–28. USENIX Association, 2012.

[176] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[177] D. She, K. Pei, D. Epstein, J. Yang, B. Ray, and S. Jana. Neuzz: Efficient fuzzing with neural program smoothing. In NEUZZ: Efficient Fuzzing with Neural Program Smoothing. IEEE, 2018.

[178] M.-W. Shih, S. Lee, T. Kim, and M. Peinado. T-sgx: Eradicating controlled-channel at- tacks against enclave programs. In Proceedings of the 2017 Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2017.

[179] E. Shioji, Y. Kawakoya, M. Iwamura, and T. Hariu. Code shredding: Byte-granular random- ization of program layout for detecting code-reuse attacks. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC), pages 309–318, 2012.

[180] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), pages 138–157. IEEE, 2016.

[181] R. M. Smullyan and R. Smullyan. Gödel’s incompleteness theorems. Oxford University Press on Demand, 1992.

141 BIBLIOGRAPHY

[182] K. Z. Snow, F. Monrose, L. Davi, A. Dmitrienko, C. Liebchen, and A.-R. Sadeghi. Just-in- time code reuse: On the effectiveness of fine-grained address space layout randomization. In Security and Privacy (SP), 2013 IEEE Symposium on, pages 574–588. IEEE, 2013.

[183] V. C. Sreedhar, G. R. Gao, and Y.-f. Lee. Incremental computation of dominator trees. In ACM SIGPLAN Notices, volume 30, pages 1–12. ACM, 1995.

[184] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshitaishvili, C. Kruegel, and G. Vigna. Driller: Augmenting fuzzing through selective symbolic exe- cution. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2016.

[185] R. Strackx and F. Piessens. Fides: Selectively hardening software application components against kernel-level or process-level malware. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 2–13. ACM, 2012.

[186] G. E. Suh, D. Clarke, B. Gassend, M. v. Dijk, and S. Devadas. Efficient memory integrity ver- ification and encryption for secure processors. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, page 339. IEEE Computer Society, 2003.

[187] Y. Sui and J. Xue. Svf: interprocedural static value-flow analysis in llvm. In Proceedings of the 25th International Conference on Compiler Construction, pages 265–266. ACM, 2016.

[188] M. Sun, T. Wei, and J. C. Lui. TaintART: A Practical Multi-level Information-Flow Tracking System for Android RunTime. In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS), pages 331–342, 2016.

[189] A. Tang, S. Sethumadhavan, and S. Stolfo. Heisenbyte: Thwarting memory disclosure at- tacks using destructive code reads. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 256–267. ACM, 2015.

[190] I. L. Taylor. Introduction to gold. http://www.airs.com/blog/archives/38, 2007.

[191] P. Team. PaX address space layout randomization (ASLR). 2003.

[192] P. Team. grsecurity: RAP is here. 2016.

142 BIBLIOGRAPHY

[193] C. Tice, T. Roeder, P. Collingbourne, S. Checkoway, Ú. Erlingsson, L. Lozano, and G. Pike. Enforcing forward-edge control-flow integrity in gcc & llvm. In 23rd USENIX Security Symposium (USENIX Security 14), pages 941–955, 2014.

[194] Using the GNU Compiler Collection (GCC). Common Function Attributes. https: //gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html, 2017.

[195] A. van de Ven and I. Molnar. Exec shield. Retrieved March, 1:2017, 2004.

[196] G. Vasiliadis, E. Athanasopoulos, M. Polychronakis, and S. Ioannidis. Pixelvault: Using gpus for securing cryptographic operations. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1131–1142. ACM, 2014.

[197] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isola- tion. In ACM SIGOPS Operating Systems Review, volume 27, pages 203–216. ACM, 1994.

[198] R. Wang, Y. Shoshitaishvili, A. Bianchi, A. Machiry, J. Grosen, P. Grosen, C. Kruegel, and G. Vigna. Ramblr: Making Reassembly Great Again. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2017.

[199] R. Wartell, V. Mohan, K. W. Hamlen, and Z. Lin. Binary stirring: Self-randomizing instruc- tion addresses of legacy x86 binary code. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 157–168. ACM, 2012.

[200] R. Wartell, Y. Zhou, K. W. Hamlen, M. Kantarcioglu, and B. Thuraisingham. Differentiating code from data in x86 binaries. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 522–536. Springer, 2011.

[201] R. N. Watson, J. Anderson, B. Laurie, and K. Kennaway. Capsicum: Practical capabilities for unix. In USENIX Security Symposium, pages 29–46, 2010.

[202] J. Werner, G. Baltas, R. Dallara, N. Otterness, K. Z. Snow, F. Monrose, and M. Polychronakis. No-execute-after-read: Preventing code disclosure in commodity software. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 35–46. ACM, 2016.

143 BIBLIOGRAPHY

[203] D. Williams-King, G. Gobieski, K. Williams-King, J. P. Blake, X. Yuan, P. Colp, M. Zheng, V. P. Kemerlis, J. Yang, and W. Aiello. Shuffler: Fast and deployable continuous code re- randomization. In 12th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 16). USENIX Association, 2016.

[204] R. Wojtczuk. The advanced return-into-lib (c) exploits: Pax case study. Phrack Magazine, Volume 0x0b, Issue 0x3a, Phile# 0x04 of 0x0e, 2001.

[205] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Comput., 8(7):1341–1390, Oct. 1996.

[206] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy, S. Okasaka, N. Narula, and N. Fullagar. Native client: A sandbox for portable, untrusted x86 native code. In Security and Privacy, 2009 30th IEEE Symposium on, pages 79–93. IEEE, 2009.

[207] W. You, X. Wang, S. Ma, J. Huang, X. Zhang, X. Wang, and B. Liang. Profuzzer: On-the-fly input type probing for better zero-day vulnerability discovery. In ProFuzzer: On-the-fly Input Type Probing for Better Zero-Day Vulnerability Discovery. IEEE.

[208] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim. QSYM : A practical concolic execution engine tailored for hybrid fuzzing. In Proceedings of the 27th USENIX Conference on Security Symposium, pages 745–761. USENIX Association, 2018.

[209] C. Zhang, T. Wei, Z. Chen, L. Duan, L. Szekeres, S. McCamant, D. Song, and W. Zou. Practical Control Flow Integrity and Randomization for Binary Executables. In Proceedings of the 2013 IEEE Symposium on Security and Privacy, SP ’13, pages 559–573, Washington, DC, USA, 2013. IEEE Computer Society.

[210] M. Zhang and R. Sekar. Control flow integrity for cots binaries. In Presented as part of the 22nd USENIX Security Symposium (USENIX Security 13), pages 337–352, 2013.

[211] L. Zhao, Y. Duan, H. Yin, and J. Xuan. Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing. In Network and Distributed Systems Security (NDSS) Sym- posium, 02 2019.

[212] L. Zhao, Y. Duan, H. Yin, and J. Xuan. Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing. In NDSS, 2019.

144 BIBLIOGRAPHY

[213] Y. Zhou, X. Wang, Y. Chen, and Z. Wang. Armlock: Hardware-based fault isolation for arm. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 558–569. ACM, 2014.

145