Automatic Synthesis of Instruction Set Semantics and Its Applications

Automatic Synthesis of Instruction Set Semantics and its Applications A Dissertation presented by Niranjan Sudhir Hasabnis to The Graduate School in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science Stony Brook University August 2015 Copyright by Niranjan Sudhir Hasabnis 2015 Stony Brook University The Graduate School Niranjan Sudhir Hasabnis We, the dissertation committee for the above candidate for the Doctor of Philosophy degree, hereby recommend acceptance of this dissertation Dr. R. Sekar - Dissertation Advisor Professor, Computer Science Dr. Mike Ferdman - Chairperson of Defense Assistant Professor, Computer Science Dr. Michalis Polychronakis Assistant Professor, Computer Science Dr. Suresh Srinivas Principal Engineer, Intel Corporation This dissertation is accepted by the Graduate School Charles Taber Dean of the Graduate School ii Abstract of the Dissertation Automatic Synthesis of Instruction Set Semantics and its Applications by Niranjan Sudhir Hasabnis Doctor of Philosophy in Computer Science Stony Brook University 2015 Binary analysis, translation and instrumentation tools play an important role in software security. To support binaries for different processors, it is necessary to incorporate the semantics of every processor’s instruction set into the tool. Unfortunately, the complexity of modern instruction sets makes the common approach of manual semantics modeling cumbersome and error-prone. Furthermore, it limits the number of processors as well as the fraction of the instruction set that is supported. In this dissertation, we propose novel architecture-neutral techniques for automatically synthesizing the semantics of instruction sets. Our approach relies on the observation that modern compilers such as GCC and LLVM already contain detailed knowledge about the semantics of many instruction sets. We therefore develop two techniques for extracting this knowledge. Our first technique relies on a learning process: observing examples of translation between a compiler’s architecture-neutral internal representation and machine instructions, and inferring the mapping from these examples. We then develop a second (and complementary) method that develops symbolic execution techniques to extract this mapping from the code generator source. Unlike previous symbolic execution systems that specialize in generating a single solution to a set of constraints, our problem requires a compact representation of all possible solutions. We describe the development of such a system, based on source-to-source transformation of C-code and a runtime system that is implemented in C and Prolog with a finite-domain constraint solver (CLP-FD). To demonstrate the applicability of synthesized instruction-set semantics, we develop two applications. In the first application, we use synthesized semantics to test correctness of code generators. Specifically, we develop a new testing approach that generates and executes test cases based on the derived semantic model for each instruction. We uncovered iii nontrivial bugs in the GCC code generator using this technique. As a second application, we have used these models to lift binaries for x86, ARM and AVR (used in Arduino and other microcontroller) architectures to intermediate code, which can then be analyzed or instrumented in an architecture-independent manner. iv Dedicated to, Aai, Baba, and Nilesh (my parents and my brother) Without their support I would not have started this journey. & Anuja (my wife and love of my life) Without her support I would not have been able to complete it. v Table of Contents Contents 1 INTRODUCTION 1 1.1 Overview of Approaches and Dissertation Organization...........2 1.2 High-Level Overview of Contributions....................5 2 OVERVIEW OF APPROACHES FOR AUTOMATIC SYNTHESIS OF INSTRUCTION SET SEMANTICS 7 2.1 Level of Abstractions Required In the Extracted Semantics.........7 2.2 Possible Approaches.............................9 2.2.1 Extraction from instruction-set manuals............... 10 2.2.2 Extraction from the hardware.................... 10 2.2.3 Extraction by exhaustive testing of CPU............... 11 2.2.4 Extraction by symbolic execution of CPU specification....... 12 2.3 Our Approach................................. 13 2.3.1 Background on code generation in modern compilers........ 15 2.3.2 Possible challenges for our approaches............... 18 2.3.3 Approach details........................... 19 2.4 Summary................................... 20 3 LISC: LEARNING INSTRUCTION SEMANTICS FROM CODE GENERATORS 21 3.1 LISC Overall Approach............................ 24 3.2 Extracting IR and Assembly Pairs...................... 26 3.3 Parameterization............................... 26 3.4 Transducer Construction........................... 31 3.4.1 Background.............................. 34 3.4.2 Maximal common prefix (mcp)................... 36 3.4.3 Residue................................ 37 3.4.4 Discriminating tests......................... 40 3.4.5 Algorithm for constructing automaton................ 42 3.4.6 Error detection............................ 44 3.5 Implementation................................ 44 3.6 Evaluation................................... 47 3.6.1 Completeness of the model...................... 47 3.6.2 Soundness.............................. 54 3.6.3 Compiler independence....................... 58 3.6.4 Sizes of the models and performance of LISC ............ 59 3.7 Related Work................................. 61 3.7.1 Relation to machine learning..................... 61 3.7.2 Decision tree learning........................ 63 vi 3.7.3 Finite state transducers and grammatical inference......... 68 3.7.4 Learning using assemblers and compilers.............. 69 3.8 Summary................................... 71 4 EISSEC: EXTRACTING INSTRUCTION SEMANTICS BY SYMBOLIC EXECUTION OF CODE GENERATORS 72 4.1 Background: Symbolic Execution...................... 74 4.1.1 A simple code generator and its symbolic execution......... 75 4.1.2 Concolic execution of the simple code generator.......... 79 4.2 Design..................................... 81 4.2.1 Input program............................. 82 4.2.2 Overall flow.............................. 82 4.2.3 Source-to-source transformation for concolic execution....... 83 4.2.4 C language features and challenges for EISSEC ........... 94 4.3 Implementation................................ 98 4.3.1 Source-to-source transformation................... 98 4.3.2 Dynamic single assignment..................... 98 4.3.3 Undo records............................. 99 4.3.4 Constraint solver........................... 103 4.4 Optimizations................................. 104 4.4.1 Using range and set constraints.................... 106 4.4.2 Strength reduction.......................... 107 4.4.3 Exploiting hardware-level parallelism................ 107 4.5 Evaluation................................... 108 4.5.1 Model extraction performance.................... 109 4.5.2 Soundness, completeness, and model size.............. 118 4.5.3 Model extraction for AVR...................... 120 4.6 Related Work................................. 122 4.6.1 Path explosion problem........................ 123 4.6.2 Constraint solver inefficiencies.................... 124 4.6.3 Symbolic execution for function extraction............. 125 4.7 Summary................................... 126 5 ArCheck:CHECKING CORRECTNESS OF EXTRACTED SEMANTICS MODELS 128 5.1 Models and Inconsistencies.......................... 129 5.1.1 Types of inconsistencies in semantic models............. 129 5.1.2 How to detect inconsistencies in semantic models.......... 130 5.1.3 Detecting inconsistencies in compilers’ semantic models...... 132 5.2 Our Approach................................. 136 5.2.1 Start state generation......................... 139 5.2.2 Obtaining “interesting” test inputs: constraint generation, propaga- tion and solving............................ 141 vii 5.2.3 Test execution and result comparison................ 144 5.3 Implementation................................ 146 5.3.1 Obtaining concrete rules from abstract mapping rules........ 146 5.3.2 Obtaining start states for a mapping rule............... 146 5.3.3 Obtaining RTL semantics...................... 147 5.3.4 Obtaining assembly semantics.................... 147 5.3.5 Handling memory.......................... 149 5.3.6 Test execution and result comparison................ 150 5.4 Evaluation................................... 150 5.4.1 Detecting known soundness issues.................. 155 5.5 Related Work................................. 157 5.6 Summary................................... 159 6 BUILDING ASSEMBLY-TO-IR TRANSLATORS AUTOMATICALLY 160 6.1 Our Approach to Assembly to IR Translation................ 162 6.1.1 Challenge............................... 162 6.1.2 Algorithm............................... 166 6.1.3 Other challenges for Assembly to IR translation........... 169 6.2 Implementation................................ 170 6.3 Evaluation................................... 171 6.4 Related Work................................. 172 6.5 Summary................................... 173 7 CONCLUSION AND FUTURE WORK 174 7.1 Future Work.................................. 175 REFERENCES 177 viii List of Figures List of Figures 1 Semantics of x86 add instruction taken from Intel manual ........9 2 Key steps in GCC’s translation of a source program ........... 16 3 An MD entry for x86 div instruction .................... 17 4 Sample

Load more