FAKULT¨AT F¨UR INFORMATIK Behavior-Based Malware Detection
Total Page:16
File Type:pdf, Size:1020Kb
FAKULTAT¨ FUR¨ INFORMATIK DER TECHNISCHEN UNIVERSITAT¨ MUNCHEN¨ Dissertation in Informatik Behavior-based Malware Detection with Quantitative Data Flow Analysis Tobias Wuchner¨ ii FAKULTAT¨ FUR¨ INFORMATIK DER TECHNISCHEN UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl XXII - Software Engineering Behavior-based Malware Detection with Quantitative Data Flow Analysis Tobias Wuchner¨ Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzende: Univ.-Prof. Dr. rer. nat. Claudia Eckert Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. rer. nat. Alexander Pretschner 2. Univ.-Prof. Dr.-Ing. Felix Freiling, Friedrich-Alexander-Universitat¨ Erlangen-Nurnberg¨ Die Dissertation wurde am 27.01.2016 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 10.05.2016 angenommen. Acknowledgments I would like to thank my first adviser, Prof. Dr. Alexander Pretschner, for giving me the opportunity to almost freely decide on a research topic that I was truly passionate and excited about and for letting me “play around” – even though this meant that I got a bit offside the main research profile of the group. Although in particular the beginning of my time as a Phd student, without even a vague vision of where the whole thing should be going to, was sometimes frustrating, in retrospect I am glad having taken this road. My thanks also go to Prof. Dr.-Ing. Felix Freiling for agreeing to be my second adviser, for providing me with useful feedback in earlier stages of my research, and in particular for supporting me with my research stay in England. Furthermore, I would like to thank Siemens AG and the Siemens CERT for fund- ing my research. In particular I want to thank Thomas Schreck for stimulating my thoughts and teaching me what really matters in industry. My sincere thanks go to Prof. Dr. Mart´ın Ochoa for always encouraging me to carry on with my research, even when I was close to giving up. Your continuous belief in me and your tireless support definitely were among the main reasons for me to push further – thank you a lot for this! Also, I shall thank my office-mate Dr. Enrico Lovat and my flat-mate Dominik Holling for enduring my frustrations, for endless and fruitful discussions, and for always being open and critical. I am very grateful to Hannah Brosch, Benjamin Lautenschlager,¨ Alei Salem, and Enrico Lovat for reviewing my thesis and for providing valuable feedback that definitely improved the quality of this work. Thank you also Andrea Catalucci, Aleksander Cisłak, and Gaurav Srivastava for supporting me in the development of my research prototypes and for helping me to turn my often half-baked ideas into usable software. I would further like to express my gratitude to my parents and family for always having supported me in this endeavor, both financially and motivational. Last but most important, thank you Monika for always backing me up and sup- porting me through all these years, even though my decisions for sure did not make your life much easier. Thank you for your empathy and understanding when I was not as much there for you as I should have been. Thank you for forc- ing me to sometimes hold on, reflect on my goals and plans in life, and for always reminding me that there is a life beyond work and research. I will not forget this! v vi Abstract Malware remains one of the biggest IT security threats, with available detec- tion approaches struggling to cope with a professionalized malware development industry. The increasing sophistication of today’s malware and the prevalent us- age of obfuscation techniques renders traditional static detection approaches in- creasingly ineffective. This thesis contributes towards improving this situation by proposing a novel effective, robust, and efficient concept of leveraging quantita- tive data flow analysis for behavior-based malware detection. We interpret system calls, issued by monitored processes, as quantifiable flows of data between system entities, such as files, sockets, or processes. We aggregate multiple flows as quantitative data flow graphs (QDFGs) that model the behav- ior of a system during a certain period of time. We operationalize this model for behavior-based malware detection in four different ways by either detecting pat- terns of known malicious behavior in QDFGs of unknown samples, or by profiling and identifying malicious behavior with graph metrics on QDFGs. The core contribution of this thesis is the demonstration that quantitative data flow information improves detection effectiveness compared to non-quantitative analyses. We establish high detection effectiveness, obfuscation robustness, and efficiency by evaluations on a large and diverse malware and goodware data set. vii viii Zusammenfassung Schadsoftware stellt nach wie vor eines der großten¨ Probleme im Bereich IT-Sicherheit dar. Die Effektivitat¨ verfugbarer¨ Erkennungsmethoden, insbe- sondere kommerzieller Anti-Virus-Produkte, ist in zunehmendem Maß durch die ansteigende Professionalisierung der Entwickler von Schadsoftware be- droht. Vor allem die stetig steigende Komplexitat¨ und der mittlerweile nahezu flachendeckende¨ Einsatz von Obfuskations- und Anti-Analysetechniken durch Standard-Schadsoftware stellen insbesondere statische Erkennungsansatze¨ vor große Probleme. Die vorliegende Arbeit leistet einen Beitrag hin zu einer Verbesse- rung dieser Situation mit einem neuen, effektiven, robusten und effizienten Ansatz zur Erkennung von Schadsoftware basierend auf quantitativer Datenflussanalyse. Dazu interpretieren wir von Prozessen ausgeloste¨ Funktionsaufrufe als quan- tifizierbare Datenflusse¨ zwischen verschiedenen Systemressourcen, etwa Datei- en, Netzwerkschnittstellen, oder anderen Prozessen. Mehrerer solcher Datenflusse¨ werden dann in Form sogenannter quantitativer Datenflussgraphen (QDFGs) ag- gregiert, die das Systemverhalten wahrend¨ eines bestimmten Zeitraums modellie- ren. Wir stellen vier verschiedene Ansatze¨ zur Operationalisierung dieses Modells fur¨ die verhaltensbasierte Erkennung von Schadsoftware vor. Diese geschieht ent- weder durch Definition und Wiederkennung von Mustern bekannten Schadver- haltens in QDFGs unbekannter Prozesse oder uber¨ das Erstellen von Profilen ty- pischen Schadverhaltens mit Hilfe spezieller Graphmetriken auf QDFGs. Der Kernbeitrag dieser Arbeit ist der Nachweis der Nutzlichkeit¨ quantitativer Datenflussanalyse zur Verbesserung der Erkennungseffektivitat¨ gegenuber¨ nicht- quantitativen Analysen. Wir weisen die hohe absolute Effektivitat,¨ Robustheit und Effizienz unserer Erkennungsansatze¨ auf Basis einer großen und heterogenen Sammlung bosartiger¨ und gutartiger Software nach. ix x Contents Acknowledgmentsv Abstract vii Zusammenfassung ix Outline of this Thesis xv 1. Introduction1 1.1. Research Description...........................4 1.1.1. Tackled Problems and Research Goals.............5 1.1.2. Research Questions........................5 1.2. Structure and Research Methodology.................6 1.3. Contributions...............................8 2. Background9 2.1. Malware..................................9 2.1.1. Malware History.........................9 2.1.2. Malware Taxonomy....................... 11 2.2. Malware Detection............................ 15 2.2.1. Static Detection.......................... 16 2.2.2. Dynamic Detection........................ 20 2.2.3. Gap Analysis and Assessment................. 24 3. System Model 25 3.1. Quantitative Data Flow Graphs..................... 26 3.2. Model Instantiation............................ 29 3.3. Malware Data Flow Behavior Example................. 34 4. Pattern-based Detection 35 4.1. Deductive Pattern-based Detection................... 37 4.1.1. Introduction............................ 37 4.1.2. Approach............................. 39 4.1.3. Evaluation............................. 48 xi Contents 4.1.4. Related Work........................... 54 4.1.5. Discussion and Conclusion................... 56 4.2. Inductive Pattern-based Detection................... 58 4.2.1. Introduction............................ 58 4.2.2. Preliminaries........................... 61 4.2.3. Approach............................. 63 4.2.4. Evaluation............................. 78 4.2.5. Related Work........................... 85 4.2.6. Discussion and Conclusion................... 87 5. Metric-based Detection 89 5.1. Deductive Metric-based Detection................... 93 5.1.1. Introduction............................ 93 5.1.2. Approach............................. 96 5.1.3. Evaluation............................. 106 5.1.4. Related Work........................... 114 5.1.5. Discussion and Conclusion................... 115 5.2. Inductive Metric-based Detection.................... 116 5.2.1. Introduction............................ 116 5.2.2. Preliminaries........................... 119 5.2.3. Approach............................. 120 5.2.4. Evaluation............................. 134 5.2.5. Related Work........................... 147 5.2.6. Discussion and Conclusion................... 148 6. Assessment and Operationalization 149 6.1. Assessment................................ 149 6.1.1. Effectiveness........................... 149 6.1.2. Efficiency............................. 155 6.1.3. Summary and Discussion.................... 157 6.2. Operationalization............................ 160 6.2.1. Offline Detection......................... 160 6.2.2. Online Detection........................