Interprocedural Analysis of Low-Level Code

TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Institut fur¨ Informatik Lehrstuhl Informatik II Interprocedural Analysis of Low-Level Code Andrea Flexeder Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. H. M. Gerndt Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. H. Seidl 2. Dr. A. King, University of Kent at Canterbury / UK Die Dissertation wurde am 14.12.2010 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 9.6.2011 angenommen. ii Contents 1 Analysis of Low-Level Code 1 1.1 Source versus Binary . 1 1.2 Application Areas . 6 1.3 Executable and Linkable Format (ELF) .................. 12 1.4 Application Binary Interface (ABI)..................... 18 1.5 Assumptions . 24 1.6 Contributions . 24 2 Control Flow Reconstruction 27 2.1 The Concrete Semantics . 31 2.2 Interprocedural Control Flow Reconstruction . 33 2.3 Practical Issues . 39 2.4 Implementation . 43 2.5 Programming Model . 44 3 Classification of Memory Locations 49 3.1 Semantics . 51 3.2 Interprocedural Variable Differences . 58 3.3 Application to Assembly Analysis . 73 4 Reasoning about Array Index Expressions 81 4.1 Linear Two-Variable Equalities . 81 4.2 Application to Assembly Analysis . 88 4.3 Register Coalescing and Locking . 89 5 Tools 91 5.1 Combination of Abstract Domains . 91 5.2 VoTUM . 96 6 Side-Effect Analysis 101 6.1 Semantics . 105 6.2 Analysis of Side-Effects . 108 6.3 Enhancements . 115 6.4 Experimental Results . 118 iii iv CONTENTS 7 Exploiting Alignment for WCET and Data Structures 123 7.1 Alignment Analysis . 126 7.2 Memory Operand Alignment . 129 7.3 Cache Bursting . 131 7.4 Identifying Data Structures . 133 7.5 Experimental Results . 137 8 Range Information 141 8.1 Semantics . 145 8.2 Convex Abstraction . 145 8.3 Linear Inequality Guards . 148 8.4 Representing Convex Sets . 150 8.5 Experimental Results . 157 9 Conclusion 161 9.1 Contributions . 161 9.2 Perspectives . 163 Bibliography 163 CONTENTS v Abstract Static analysis of machine code is employed for reverse engineering, automatic detection of low-level errors such as memory violations, malware detection, and many other application areas. Only at the level of executables can all errors introduced by programmers or even by compilers be identified. Analysis of machine code comes at a price: high-level language features such as local variables and procedures are no longer visible and need to be recovered. In particular, it is necessary to first reconstruct the control flow graph (CFG). This thesis tackles these challenges by presenting a sound static analysis to executables expressed using the abstract interpretation framework. Our aim is to present reasonably fast and sound analyses that scale well for industrial-sized applications. The two key contributions are as follows: First, we introduce a fully automatic and sound interprocedural analysis framework for executables. To this end, we argue for an analysis that intertwines disassembling and abstract interpretation-based analysis to provide a sound overapproximation of the CFG and additionally handles procedure calls precisely. In order to handle indirect jumps it is essential to reason about data. Hence, the location and size of variables in memory has to be inferred. Therefore we propose an analysis of differences and equalities between register contents which allows inference of potential local and global variables. In order to discharge certain assumptions made during control flow reconstruction we add an additional side-effect analysis that reasons about the modifying potential of procedures. Second, we present two novel domains: the domain of fast linear two-variable equalities and simplices, a special case of convex polyhedra. While the former domain allows a precise analysis of non-optimised assembly by inferring equalities between registers and memory locations, the latter infers precise information about memory accesses by providing linear inequality relations between loop iteration variables and memory accesses. We have implemented these analyses and experimentally evaluated that our techniques scale well for industrial-sized executables and provide very good results concern- ing control flow reconstruction, inference of local variables, alignment information for improving the worst-case execution time (WCET) estimation. vi CONTENTS Zusammenfassung Diese Arbeit beschreibt statische Analysen von Assemblerprogrammen, welche im Bereich des Reverse Engineerings, der automatischen Erkennung von Fehlern, z.B. Stack Overflows, und der Erkennung von Schadsoftware eingesetzt werden. Nur auf der Ebene von Executables (ausfuhrbare¨ Programme) konnen¨ alle Fehler, die von Pro- grammierern oder Compilern stammen, erkannt werden. Jedoch ergeben sich dabei einige Schwierigkeiten, da auf Assemblerebene abstrakte Sprachkonzepte, wie zum Beispiel lokale Variablen oder Funktionen, nicht langer¨ explizit vorhanden sind, son- dern erst rekonstruiert werden mussen.¨ Dies erfordert eine Analyse welche den Kon- trollflussgraphen (CFG) des zu analysierenden Executables wiederherstellt. Diese Arbeit behandelt die Herausforderungen welche eine Analyse von Assembler- programmen darstellt. Dafur¨ wird ein statisches Programmanalyseframework, basierend auf dem Prinzip der abstrakten Interpretation, eingefuhrt,¨ das sichere Aussagen uber¨ das zu analysierende Executable ermoglicht.¨ Unser Ziel ist es, schnelle Analysen zu entwickeln, die sichere Analyseergebnisse liefern. Zudem sollen unsere Analysen gut mit großen industriellen Programmen zurecht kommen. Die beiden Hauptbeitrage¨ dieser Arbeit zum Forschungsbereich der Analyse von Executables sind folgende: Zum einen stellen wir ein voll automatisches interprozedurales Analyseframe- work fur¨ Executables vor. In diesem Kontext befurworten¨ wir eine Analyse, die Disassemblieren und statische Programmanalyse miteinander verknupft.¨ Diese Analyse liefert eine sichere Uberapproximation¨ des Kontrollflussgraphen eines Executables, wobei Funktionsaufrufe prazise¨ behandelt werden. Um indirekte Sprunge¨ aufzulosen,¨ mussen¨ Große¨ und Adresse der Programmvariablen im Speicher bekannt sein. Hierfur¨ entwickeln wir eine Analyse, die konstante Differenzen und Gleichheiten der Werte von Registern herleitet, um lokale und globale Programmvariablen zu inferieren. Wir prasentieren¨ zudem eine Seiteneffektanalyse, die das Modifikationspotential einzelner Funktionen beschreibt, um so die Qualitat¨ der Kontrollflussrekonstruktion zu verbessern. Des weiteren beschreiben wir zwei neue Domanen:¨ die Domane¨ von schnellen Zwei- Variablen-Gleichheiten und die geometrische Domane¨ der Simplizes, einer Subklasse von Polyedern. Zwei-Variablen-Gleichheiten ermoglichen¨ eine prazise¨ Analyse von unoptimiertem Code, dadurch dass Gleichheiten zwischen Registern und Speicherstellen hergeleitet werden konnen.¨ Die Simplex-Domane¨ inferiert lineare Ungleichheiten zwischen Registern und Speicherstellen und erlaubt somit prazise¨ Aussagen uber¨ die Relationen zwischen Schleifenvariablen und Speicherzugriffen. Unsere Implementierung dieser Analysen und eine erste experimentelle Auswertung zeigt, dass unsere Techniken gut fur¨ große Beispielprogramme (mehr als 300:000 Zeilen Assembler) skalieren und sehr gute Ergebnisse liefern: fur¨ die Kontrollflussrekonstruk- tion, die Herleitung lokaler und globaler Variablen und Informationen daruber,¨ ob Daten stets zu Beginn von Adressblocken¨ gespeichert sind (Alignment). CONTENTS vii Extended Abstract Much work has been done in the research area of analysing machine code, employed for reverse engineering, automatic detection of low-level errors like memory violations and also in the area of malware detection. This thesis focuses on semantics-based static analysis approaches, since obfuscation can be applied to complicate or even prevent using pattern-based techniques. In contrast to a source code analysis that can exploit the structure of the program to improve its precision, high-level language features such as local variables and procedures are no longer visible at the machine level and need to be recovered first. In this context we work on disassembling executables. There are two prevalent disassembly techniques, recursive traversal and linear sweep. Although both rely on unsound heuristics, neither one is able to produce a correct disassembly in the presence of indirect calls and jumps. As a consequence, standard tools like IDAPro deal well with compiler-generated code but yield unsatisfactory results for hand-written assembly1. The latter is often found in malware. Our approach closely couples disassembling and static analysis that computes information about registers and memory locations. This helps to correctly resolve the targets of indirect calls and indirect jumps. For instance, the C high-level construct of switch-statements is often compiled into an array of jump targets in combination with an indirect jump to a value read from that area. Consequently, the control flow reconstruction and the analysis of variable ranges has to be intertwined to arrive at a sound overapproximation of the control flow graph. Identifying memory locations is crucial when all data is kept in memory and stack locations are temporarily cached in registers as is the case for zero-optimised2 assembly. For this purpose, we developed a fast interprocedural linear two-variable equality analysis,

Interprocedural Analysis of Low-Level Code

A Type Inference on Executables

Targeting Embedded Powerpc

Mac OS X ABI Function Call Guide

Alignment in C Seminar “Eﬃziente Programmierung in C”

Compiler Construction

An Introduction to Reverse Engineering for Beginners

Unstructured Computations on Emerging Architectures

AUTHOR PUB TYPE Needs of Potential Users, System

Program Optimization

A Guide to Vectorization with Intel® C++ Compilers

Gdura, Youssef Omran (2012) a New Parallelisation Technique for Heterogeneous Cpus

Mach-O Runtime Architecture for Mac OS X Version 10.3