Efficient Machine Learning for Attack Detection

Institut für Systemsicherheit Efficient Machine Learning for Attack Detection Von der Carl-Friedrich-Gauß-Fakultät der Technischen Universität Carolo-Wilhelmina zu Braunschweig zur Erlangung des Grades eines Doktoringenieurs (Dr.-Ing.) genehmigte Dissertation von Christian Wressnegger geboren am 29. Juli 1984 in Graz, Österreich Eingereicht am: 11. September 2018 Disputation am: 16. November 2018 Referent: Prof. Dr. Konrad Rieck Korreferent: Prof. Dr. Thorsten Holz Betreuungsausschluss: Prof. Dr. Konrad Rieck Technische Universität Braunschweig Prof. Dr. Klaus-Robert Müller Technische Universität Berlin Promotionsauschuss: Referent: Prof. Dr. Konrad Rieck Technische Universität Braunschweig Korreferent: Prof. Dr. Thorsten Holz Ruhr-Universität Bochum Vorsitz: Prof. Dr. Martin Johns Technische Universität Braunschweig Tag der Einreichung: 11. September 2018 Tag der Disputation: 16. November 2018 Dedicated to my parents. Abstract Detecting and fending off attacks on computer systems is an enduring problem in computer security. In light of a plethora of different threats and the growing automation used by attackers, we are in urgent need of more advanced methods for attack detection. Manually crafting detection rules is by no means feasible at scale, and automatically generated signatures often lack context, such that they fall short in detecting slight variations of known threats. In this thesis, we address the necessity of advanced attack detection and develop methods to detect attacks using machine learning to establish a higher degree of automation for reactive security. Machine learning is data-driven and not void of bias. For the effective application of machine learning for attack detection, thus, a periodic retraining over time is crucial. However, the training complexity of many learning-based approaches is substantial. We show that with the right data representation, efficient algorithms for mining substring statistics, and implementations based on probabilistic data structures, training the underlying model can be achieved in linear time. In two different scenarios, we demonstrate the effectiveness of so-called language models that allow to generically portray the content and structure of attacks: On the one hand, we are learning malicious behavior of Flash-based malware using classification, and on the other hand, we detect intrusions by learning normality in industrial control networks using anomaly detection. With a data throughput of up to 580 Mbit/s during training, we do not only meet our expectations with respect to runtime but also outperform related approaches by up to an order of magnitude in detection performance. The same techniques that facilitate learning in the previous scenarios can also be used for revealing malicious content, embedded in passive file formats, such as Microsoft Office documents. As a further showcase, we additionally develop a method based on the efficient mining of substring statistics that is able to break obfuscations irrespective of the used key length, with up to 25 Mbit/s and thus, succeeds where related approaches fail. These methods significantly improve detection performance and enable operation in linear time. In doing so, we counteract the trend of compensating increasing runtime requirements with resources. While the results are promising and the approaches provide urgently needed automation, they cannot and are not intended to replace human experts or traditional approaches, but are designed to assist and complement them. Zusammenfassung Die Erkennung und Abwehr von Angriffen auf Endnutzer und Netzwerke ist seit vielen Jahren ein anhaltendes Problem in der Computersicherheit. Angesichts der hohen Anzahl an unterschiedlichen Angriffsvektoren und der zunehmenden Automatisierung mit der Angriffe durchgeführt werden, bedarf es dringend moderner Methoden zur Angriffserken- nung. Das manuelle Erstellen von Regeln zur Erkennung von Angriffen ist in diesem Umfang nicht mehr zu bewerkstelligen und automatische Ansätze zur Signaturgenerierung generalisieren oft nur schlecht. In dieser Doktorarbeit werden Ansätze entwickelt, um Angriffe mit Hilfe von Methoden des maschinellen Lernens zuverlässig, aber auch effizient zu erkennen. Sie stellen der Automatisierung von Angriffen einen entsprechend hohen Grad an Automatisierung von Verteidigungsmaßnahmen entgegen. Für die zuverlässige, lernbasierte Angriffserkennung müssen die zugrundeliegenden Modelle im Laufe ihres Einsatzes periodisch neu gelernt werden. Das Trainieren solcher Methoden ist allerdings rechnerisch aufwändig und erfolgt auf sehr großen Datenmengen. Laufzeiteffiziente Lernverfahren sind also entscheidend. Wir zeigen, dass durch den Einsatz von effizienten Algorithmen zur statistischen Analyse von Zeichenketten und Implementierung auf Basis von probabilistischen Datenstrukturen, das Lernen von effektiver Angriffserkennung auch in linearer Zeit möglich ist. Anhand von zwei unterschiedlichen Anwendungsfällen, demonstrieren wir die Effektiv- ität von Modellen, die auf der Extraktion von sogenannten n-Grammen basieren: Zum einen, betrachten wir die Erkennung von Flash-basiertem Schadcode mittels Methoden der Klassifikation, und zum anderen, die Erkennung von Angriffen auf Industrienetzwerke bzw. SCADA-Systeme mit Hilfe von Anomaliedetektion. Dabei erzielen wir während des Trainings dieser Modelle einen Datendurchsatz von bis zu 580 Mbit/s und übertreffen gleichzeitig die Erkennungsleistung von anderen Ansätzen deutlich. Die selben Tech- niken, um diese lernenden Ansätze zu ermöglichen, können außerdem für die Erkennung von Schadcode verwendet werden, der in anderen (passiven) Dateiformaten eingebettet und mittels einfacher Verschlüsselungen obfuskiert wurde. Hierzu entwickeln wir eine Methode die basierend auf der statistischen Auswertung von Zeichenketten einfache Ver- schlüsselungen bricht. Der entwickelte Ansatz arbeitet unabhängig von der verwendeten Schlüssellänge, mit einem Datendurchsatz von bis zu 25 Mbit/s und ermöglicht so die erfolgreiche Deobfuskierung in Fällen an denen andere Ansätze scheitern. Die erzielten Ergebnisse in Hinsicht auf Laufzeiteffizienz und Erkennungsleistung sind vielversprechend. Die vorgestellten Methoden ermöglichen die dringend nötige Automa- tisierung von Verteidigungsmaßnahmen, sollen den Experten oder etablierte Methoden aber nicht ersetzen, sondern diese unterstützen und ergänzen. Acknowledgments Above all, I would like to thank my parents who have supported me on my bumpy journey through school and all the years at university, where I have finally been able to live out my passion for computers. I consider myself very lucky. Speaking of luck, I would like to especially thank Prof. Dr. Konrad Rieck for not brushing me off when I have first contacted him out of the blue, most naively asking for an internship or, I quote, “something”. No, I did not have funding and obviously I was not thinking this through, but I was looking for a change and I was passionate about it. Luckily, he was patient enough to listen to me and suggested that, if I was serious about it, I should apply as a PhD candidate at the upcoming call in a couple of months—so, I did. Ever since Konrad has been a great advisor and friend to me. Thank you for all the discussions, valuable suggestions and words of advise. Special thanks also goes to Prof. Dr. Klaus-Robert Müller and Dr. Sebastian Mika who enabled me a smooth start into my PhD back in Berlin. Moreover, I would like to thank Prof. Dr. Thorsten Holz for refereeing the thesis and Prof. Dr. Martin Johns for chairing the defense. In light of your full schedules, I am extremely glad that both of you took the time to participate in the process. Thank you for an interesting discussion and all the kind words. My most profound thanks, furthermore, goes to all my colleagues and friends who I have met along the way. I would like to emphasize that these two groups of people are not mutually exclusive. I very much appreciate the time we have spent discussing anything and everything, playing in the worst imaginable soccer team, traveling, trolling and ranting. I am not even going to try to name you all, but I am truly thankful. Finally, I would like to gratefully acknowledge funding from the German Federal Min- istry of Education and Research under the projects PROSEC (FKZ 01BY1145) and INDI (FKZ 16KIS0154K). Publications In the following, papers and journal articles are listed that have emerged in the course of this thesis. These include works from various fields of computer security such as malware and intrusion detection or the discovery of vulnerabilities. Publications indicated by a filled square () have been authored by the thesis author, those denoted by an empty square ( ) originate from collaborations under the lead of other researchers. 2 Vulnerability Discovery Chucky: Exposing Missing Checks in Source Code for Vulnerability Discovery. 2 F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck. In Proc. of the 20th ACM Conference on Computer and Communications Security (CCS) PULSAR: Stateful Black-box Fuzzing of Proprietary Network Protocols. 2 H. Gascon, C. Wressnegger, F. Yamaguchi, D. Arp, and K. Rieck. In Proc. of the 11th International Conference on Security and Privacy in Communication Networks (SECURECOMM) Twice the Bits, Twice the Fun: Vulnerabilities caused by Migration to 64-Bit Systems. C. Wressnegger, F. Yamaguchi, A. Maier, and K. Rieck. In Proc. of the 23th ACM Conference on Computer and Communications Security (CCS) 64-Bit Migration Vulnerabilities. C. Wressnegger, F. Yamaguchi, A. Maier, and

Efficient Machine Learning for Attack Detection

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support