Code Obfuscation and Malware Detection by Abstract Interpretation

Mila Dalla Preda Code Obfuscation and Malware Detection by Abstract Interpretation Ph.D. Thesis Universitàdegli Studi di Verona Dipartimento di Informatica Advisor: prof. Roberto Giacobazzi Series N◦: TD ???? Universitàdi Verona Dipartimento di Informatica Strada le Grazie 15, 37134 Verona Italy Summary An obfuscating transformation aims at confusing a program in order to make it more difficult to understand while preserving its functionality. Software protection and malware detection are two major applications of code obfuscation. Soft- ware developers use code obfuscation in order to defend their programs against attacks to the intellectual property, usually called malicious host attacks. In fact, by making the programs more difficult to understand it is possible to obstruct malicious reverse engineering – a typical attack to the intellectual property of programs. On the other side, malware writers usually obfuscate their malicious code in order to avoid detection. In this setting, the ability of code obfuscation to foil most of the existing detection techniques, such as misuse detection algorithms, relies in their purely syntactic nature that makes malware detection sensitive to slight modifications of programs syntax. In the software protection scenario, researchers try to develop sophisticated obfuscating techniques that are able to resist as many attacks as possible. In the malware detection scenario, on the other hand, it is important to design advanced detection algorithms in order to detect as many variants of obfuscated malware as possible. It is clear how both malicious host and malicious code attacks represent harmful threats to the security of computer networks. In this dissertation, we are interested in both security issues described above. In particular, we describe a formal approach to code obfuscation and malware detection based on program semantics and abstract interpretation. This theoretical framework is useful in contrasting some well known drawbacks of software protection through code obfuscation, as well as for improving existing malware detection schemes. In fact, the lack of rigorous theoretical bases for code obfuscation prevents any possibility to formally study and certify their effectiveness in protecting proprietary programs. Moreover, in order to design malware detection schemes that are resilient to obfuscation we have to focus on program semantics rather than on program syntax. ii Our formal framework for code obfuscation relies on a semantics-based definition of code obfuscation that characterizes each program transformation T as a potential obfuscation in terms of the most concrete property preserved by T on program semantics. Deobfuscating techniques, and reverse engineering in general, usually begin with some sort of static program analysis, which can be specified as an abstraction of program semantics. In the software protection scenario, this observation naturally leads to model attackers as abstractions of program semantics. In fact, the abstraction modeling the attacker expresses the amount of information, namely the semantic properties, that the attacker is able to observe. It follows that, comparing the degree of abstraction of an attacker A with the one of the most concrete property preserved by an obfuscating transformation T , it is possible to understand whether obfuscation T defeats attack A. Following the same reasoning it is possible to compare the efficiency of different obfuscating transformations, as well as the ability of different attackers in contrasting a given obfuscation. We apply our semantics-based framework to a known control code obfuscation technique that aims at confusing the control flow of the original program by inserting opaque predicates. As argued above, an obfuscating transformation modifies a program while preserving an abstraction of its semantics. This means that different obfuscated versions of the same malware have to share (at least) the malicious in- tent, namely the maliciousness of their semantics, even if they may express it through different syntactic forms. The basic idea of our formal approach to malware detection is to use program semantics to model both malware and program behaviour, and semantic abstractions to hide the details changed by the obfuscation. Thus, given an obfuscation T , we are interested in defining an abstraction of program semantics that does not distinguish between the semantics of malware M and the semantics of its obfuscated version T (M). In particular, we provide this suitable abstraction for an interesting class of commonly used obfuscating transformations. It is clear that, given a malware detector D, it is always possible to define its semantic counterpart by analyzing how D works on program semantics. At this point, by translating both malware detectors and obfuscating transformations in the semantic world, we are able to certify which obfuscations a detector is able to handle. This means that our semantics- based framework provides a formal setting where malware detectors designers can prove the efficiency of their algorithms. Acknowledgements The first person that I would like to thank is my advisor Roberto Giacobazzi for his precious guide and encouragement over these years. He taught me how to develop my ideas and how to be independent. A great thanks goes to Saumya Debray for his very kind hospitality and constant support while I was visiting the department of Computer Science at Tucson. I sincerely thank Somesh Jha and Mihai Christodorescu for the interesting discussions we had and their precious collaboration. I also thank Matias Madou and Koen De Boscheere for the work done together. A warm thank goes to the participants of the Doctoral Symposium affiliated to Formal Methods 2006 and in particular to the organizers Ana Cavalcanti, Augusto Sampaio and Jim Woodcock for their interesting comments and advices on my work. I would also like to thank my PhD thesis referees Christian Collberg and Patrick Cousot, but also Andrea Masini and Massimo Merro for their precious advices and comments on my studies. Contents Preface ................................................... ....... ix 1 Introduction ................................................. 1 1.1 Motivations ..................................... .......... 1 1.2 TheProblem ...................................... ........ 3 1.3 The Idea: A Semantics-based Approach . ....... 6 1.4 MainResults ..................................... ......... 8 1.5 Overview of the Thesis............................. ......... 10 2 Basic Notions ................................................ 13 2.1 Mathematical Background.......................... ......... 13 2.1.1 Sets.......................................... ....... 13 2.1.2 Ordered structures............................. ....... 16 2.1.3 Fixpoints..................................... ....... 20 2.1.4 Closure operators.............................. ....... 21 2.1.5 Galois connections ............................. ....... 22 2.1.6 Galois connections and closure operators . ...... 24 2.2 Abstract Interpretation .......................... ........... 24 2.2.1 Lattice of abstract interpretations . ........ 28 2.2.2 Abstract Operations ............................ ...... 30 2.2.3 Abstract Semantics ............................. ...... 35 2.3 Syntactic and Semantic Program Transformations . ........ 36 3 Code Obfuscation ............................................ 43 3.1 Software Protection .............................. .......... 44 3.1.1 Obfuscating Transformations and their Evaluation . ...... 47 3.1.2 A Taxonomy of Obfuscating Transformations. 49 3.1.3 Positive and Negative Theoretical Results. ....... 52 3.1.4 Code Deobfuscation............................. ...... 55 3.2 Malware Detection ................................ ......... 55 vi Contents 3.2.1 Detection Techniques . ...... 57 3.2.2 Metamorphic Malware ............................ 59 3.2.3 Theoretical Limitations . ........ 62 3.2.4 Formal Methods Approaches ....................... 62 4 Code Obfuscation as Semantic Transformation ............... 65 4.1 Standard Definition of Code Obfuscation . ........ 67 4.2 Semantics-based Definition of Code Obfuscation. ......... 69 4.2.1 Constructive characterization of δØ ...................... 73 4.2.2 Comparing Transformations...................... ...... 75 4.3 Modeling Attackers............................... .......... 76 4.4 Case study: Constant Propagation................... ......... 77 4.5 Discussion...................................... ........... 82 5 Control Code Obfuscation ................................... 85 5.1 Control Code Obfuscation.......................... ......... 86 5.1.1 Semantic Opaque Predicate Insertion. ...... 87 5.1.2 Syntactic Opaque Predicate Insertion . ...... 89 5.1.3 Obfuscating behaviour of opaque predicate insertion ...... 93 5.1.4 Detecting Opaque Predicates . ...... 96 5.2 Opaque Predicates Detection Techniques . ......... 99 5.2.1 Dynamic Attack................................. 100 5.2.2 Brute Force Attack .............................. 101 5.3 Breaking Opaque Predicates by Abstract Interpretation .........102 5.3.1 Breaking Opaque Predicates n|f(x).....................103 5.3.2 Experimental results ........................... .......112 5.3.3 Breaking Opaque Predicates h(x)= g(x) ................114 5.3.4 Comparing Attackers............................ 118 5.4 Discussion...................................... ...........119 6 A Semantics-Based approach to Malware Detection

Code Obfuscation and Malware Detection by Abstract Interpretation

Integrated Program Debugging, Verification, and Optimization Using Abstract Interpretation (And the Ciao System Preprocessor)

Abstract Interpretation and Abstract Domains with Special Attention to the Congruence Domain

Static Program Analysis Via 3-Valued Logic

Design of Semantics by Abstract Interpretation

Abstract Interpretation Based Program Testing

Abstract Interpretation of Programs for Model-Based Debugging

Abstract Interpretation: a Unified Lattice Model for Static Analysis of Programs by Construction Or Approximation of Fixpoints

Arxiv:1007.3250V1 [Cs.PL] 19 Jul 2010 Ple Opromhg-Adlwlvlotmztosand Optimizations Low-Level and High- Perform to Applied Ae Rpoesro the of Etc

Verification by Abstract Interpretation

Abstract Interpretation: Theory and Practice

Extending Abstract Interpretation to Dependency Analysis of Database Applications

Static Analysis for Software Assurance: Soundness, Scalability and Adaptiveness