Retargetable Analysis of Machine Code
Total Page:16
File Type:pdf, Size:1020Kb
VYSOKE´ UCˇ ENI´ TECHNICKE´ V BRNEˇ BRNO UNIVERSITY OF TECHNOLOGY FAKULTA INFORMACˇ NI´CH TECHNOLOGII´ U´ STAV INFORMACˇ NI´CH SYSTE´ MU˚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INFORMATION SYSTEMS REKONFIGUROVATELNA´ ANALY´ ZA STROJOVE´ HO KO´ DU RETARGETABLE ANALYSIS OF MACHINE CODE DISERTACˇ NI´ PRA´ CE PHD THESIS AUTOR PRA´ CE Ing. JAKUB KRˇ OUSTEK AUTHOR VEDOUCI´ PRA´ CE Doc. Dr. Ing. DUSˇ AN KOLA´ Rˇ SUPERVISOR BRNO 2014 Abstrakt Analýza softwaru je metodologie, jejímž účelem je analyzovat chování daného programu. Jednotlivé metody této analýzy je možné využít i v dalších oborech, jako je zpětné in- ženýrství, migrace kódu apod. V této práci se zaměříme na analýzu strojového kódu, na zjištění nedostatků existujících metod a na návrh metod nových, které umožní rychlou a přesnou rekonfigurovatelnou analýzu kódu (tj. budou nezávislé na konkrétní cílové plat- formě). Zkoumány budou dva typy analýz – dynamická (tj. analýza za běhu aplikace) a statická (tj. analýza aplikace bez jejího spuštění). Přínos této práce v rámci dynam- ické analýzy je realizován jako rekonfigurovatelný ladicí nástroj a dále jako dva typy tzv. rekonfigurovatelného translátovaného simulátoru. Přínos v rámci statické analýzy spočívá v navržení a implementování rekonfigurovatelného zpětného překladače, který slouží pro transformaci strojového kódu zpět do vysokoúrovňové reprezentace. Všechny tyto nástroje jsou založeny na nových metodách navržených autorem této práce. Na základě experi- mentálních výsledků a ohlasů od uživatelů je možné usuzovat, že tyto nástroje jsou plně srovnatelné s existujícími (komerčními) nástroji a nezřídka dosahují i lepších výsledků. Abstract Program analysis is a computer-science methodology whose task is to analyse the behavior of a given program. The methods of program analysis can also be used in other methodolo- gies such as reverse engineering, re-engineering, code migration, etc. In this thesis, we focus on program analysis of a machine-code and we address the limitations of a nowadays ap- proaches by proposing novel methods of a fast and accurate retargetable analysis (i.e. they are designed to be independent of a particular target platform). We focus on two types of analysis — dynamic analysis (i.e. run-time analysis) and static analysis (i.e. analysing application without its execution). The contribution of this thesis within the dynamic anal- ysis lays in the extension and enhancement of existing methods and their implementation as a retargetable debugger and two types of a retargetable translated simulator. Within the static analysis, we present a concept and implementation of a retargetable decompiler that performs a program transformation from a machine code into a human-readable form of representation. All of these tools are based on several novel methods defined by the author. According to our experimental results and users feed-back, all of the proposed tools are at least fully competitive to existing solutions, while outperforming these solutions in several ways. Klíčová slova strojový kód, analýza kódu, reverzní inženýrství, zpětný překladač, ladicí nástroj, simulátor, zpětný assembler, gramatiky s rozptýleným kontextem, Lissom, jazyky pro popis architek- tur, ISAC, škodlivý kód. Keywords machine code, code analysis, reverse engineering, decompiler, debugger, simulator, disas- sembler, scattered context grammars, Lissom, architecture description languages, ISAC, malware. Citation Jakub Křoustek: Retargetable Analysis of Machine Code, PhD thesis, Brno, FIT BUT, 2014 Retargetable Analysis of Machine Code Statement of Originality I hereby declare that this thesis is my own work that has been created under the supervision of doc. Dušan Kolář. Several results were achieved in co-operation with Lukáš Ďurfina, Petr Zemek, Peter Matula, Zdeněk Přikryl, Stanislav Židek, my students, and other mem- bers of the Lissom project. Where other sources of information have been used, they have been duly acknowledged. ....................... Jakub Křoustek November 17, 2014 Acknowledgements I wish to thank Dušan Kolář for the support during his supervision of this work. I also wish to thank all members of the Lissom research project for their help and ceaseless support of my research. Last, but certainly not least, I wish to thank my family and my girlfriend Lenka for their support and patience during my work on this thesis. c Jakub Křoustek, 2014. Tato práce vznikla jako školní dílo na Vysokém učení technickém v Brně, Fakultě in- formačních technologií. Práce je chráněna autorským zákonem a její užití bez udělení oprávnění autorem je nezákonné, s výjimkou zákonem definovaných případů. Contents List of Figures5 List of Tables9 1 Introduction 11 1.1 Author’s Contribution.............................. 13 1.2 Chapter Survey.................................. 14 2 Preliminaries 15 2.1 Theory of Formal Languages.......................... 15 2.2 Theory of Compiler Construction........................ 17 3 Focus of the Thesis 23 3.1 CPU Architectures Overview.......................... 23 3.1.1 CISC Architecture............................ 24 3.1.2 RISC Architecture............................ 24 3.1.3 VLIW Architecture............................ 25 3.2 Executable-File Formats Overview....................... 26 3.2.1 UNIX Executable and Linking Format................. 26 3.2.2 Microsoft Portable Executable and Common Object File Format.. 27 3.2.3 Other Common Formats......................... 28 3.2.4 Comparison of EFFs........................... 28 3.3 Debugging-Information Formats Overview................... 29 3.3.1 DWARF.................................. 29 3.3.2 PDB.................................... 30 3.3.3 Other Formats.............................. 30 4 Machine-Code Analysis: State of the Art 31 4.1 Compiler and Packer Detectors......................... 31 4.2 EFF Parsers and Converters........................... 32 4.3 Disassemblers................................... 34 4.4 Machine-Code Decompilers........................... 35 4.5 Debuggers..................................... 39 4.6 Tools for Simulation, Emulation, and Instrumentation............ 42 4.7 Lissom Project.................................. 44 4.7.1 ISAC Language.............................. 46 4.7.2 Lissom Object File Format....................... 48 4.7.3 Toolchain................................. 48 1 2 Contents 5 Retargetable Dynamic Analysis 53 5.1 Obtaining and Processing Debugging Information............... 54 5.1.1 DWARF Parsing............................. 54 5.1.2 PDB Parsing............................... 58 5.2 Translated Simulator............................... 60 5.2.1 Translated Simulation Overview.................... 62 5.2.2 Static Translated Simulation...................... 62 5.2.3 Just-In-Time Translated Simulation (JIT)............... 67 5.3 Source-Level Debugger.............................. 69 5.3.1 Lissom Three-Layer Architecture.................... 71 5.3.2 Obtaining Debugging Information................... 72 5.3.3 Implementation.............................. 74 6 Retargetable Static Analysis 79 6.1 Motivation.................................... 80 6.1.1 Retargetable Decompilation....................... 81 6.1.2 Application of Decompilation...................... 81 6.2 Overview of the Retargetable Decompiler................... 83 6.3 Preprocessing Phase............................... 87 6.3.1 Compiler and Packer Detection..................... 89 6.3.2 Unpacking................................. 93 6.3.3 Executable-File-Format Conversion................... 95 6.3.4 Context-Sensitive Description of Executable-File-Formats...... 97 6.4 Front-End Phase................................. 105 6.4.1 Debugging and Symbolic Information Exploitation.......... 106 6.4.2 Statically-Linked-Code Recognition................... 107 6.4.3 Instruction Decoding........................... 108 6.4.4 Control-Flow Analysis (CFA)...................... 111 6.4.5 Static Code Interpreter......................... 112 6.4.6 Function Detection............................ 113 6.4.7 Data-Flow Analysis (DFA)....................... 115 6.4.8 Other Analyses.............................. 117 6.4.9 Outputs Generation........................... 118 6.5 Middle-End Phase................................ 118 6.5.1 Built-in Optimization.......................... 120 6.5.2 Instruction-Idiom Analysis....................... 122 6.6 Back-End Phase................................. 130 6.6.1 Back-End Intermediate Representation................. 131 6.6.2 Conversion into BIR........................... 132 6.6.3 BIR Optimizations............................ 133 6.7 Output Generation................................ 135 6.7.1 Generation of Control-Flow Graphs.................. 135 6.7.2 Generation of Call Graphs........................ 136 6.7.3 Generation of the Output HLL..................... 136 Contents 3 7 Experimental Results 139 7.1 Translated Simulation.............................. 140 7.2 Source-Level Debugging............................. 142 7.3 Decompilation.................................. 144 7.3.1 Compiler and Packer Detection..................... 145 7.3.2 Function, Argument, and Return Value Reconstruction....... 148 7.3.3 Instruction-Idiom Detection....................... 152 8 Conclusion 155 8.1 Future Work................................... 156 Bibliography 159 Author’s Publications 167 List of Abbreviations 171 A Decompilation Results 173 A.1 Decompilation of Simple Programs....................... 173 A.2 Decompilation of a Real-World Malware...................