Detection of Unknown Cyber Attacks Using Convolution Kernels Over Attributed Language Models

Detection of Unknown Cyber Attacks Using Convolution Kernels Over Attributed Language Models Kumulative Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn von Patrick Duessel aus Berlin Bonn, Juli 2018 Dieser Forschungsbericht wurde als Dissertation von der Mathematisch-Naturwissenschaftlichen Fakultät der Universität Bonn angenommen und ist auf dem Hochschulschriftenserver der ULB Bonn http://hss.ulb.uni-bonn.de/diss_online elektronisch publiziert. 1. Gutachter: Prof. Dr. Michael Meier 2. Gutachter: Prof. Dr. Sven Dietrich Tag der Promotion: 2. Juli 2018 Erscheinungsjahr: 2018 To my beloved family. Acknowledgements I would like to express my gratitude to my supervisor Prof. Dr. Michael Meier for continuous support and advise throughout my dissertation journey. I would also like to thank Prof. Dr. Sven Dietrich for providing direction to my thesis as co-referent. Furthermore, I would like to thank Prof. Dr. Klaus-Robert Mueller for his scientific inspiration and his support to foster my interest for research and development in the field of machine learning and network security during the first years of my studies. I would also like to express my gratitude to all my colleagues at the Fraunhofer Institute, Berlin Institute of Technology as well as University of Bonn for fruitful discussions as well as an interactive and collaborative working atmosphere. I would like to thank my friends and colleagues Christian Gehl, Dr. Ulrich Flegel and Dr. Alexandra James for great moments shared and their personal support. Finally, I would like to express my love and gratitude to my family for their persistent moral support and for everything they have done for me. v Contents 1 Introduction 1 1.1 Known and unknown threats...............................2 1.2 Machine learning to detect unknown threats.......................4 1.3 Data representations for sequential data.........................9 1.4 Measuring similarity in feature spaces.......................... 13 1.5 Thesis outline....................................... 17 Bibliography 19 2 Learning intrusion detection: supervised or unsupervised? 27 2.1 Introduction........................................ 27 2.2 Publication........................................ 27 2.3 Summary......................................... 36 3 Cyber-Critical Infrastructure Protection Using Real-Time Payload-Based Anomaly Detection 37 3.1 Introduction........................................ 37 3.2 Publication........................................ 37 3.3 Summary......................................... 50 4 Automatic feature selection for anomaly detection 51 4.1 Introduction........................................ 51 4.2 Publication........................................ 51 4.3 Summary......................................... 58 5 Incorporation of Application Layer Protocol Syntax into Anomaly Detection 59 5.1 Introduction........................................ 59 5.2 Publication........................................ 59 5.3 Summary......................................... 75 6 Detecting zero-day attacks using context-aware anomaly detection at the application layer 77 6.1 Introduction........................................ 77 6.2 Publication........................................ 77 6.3 Summary......................................... 95 7 Learning and classification of malware behavior 97 7.1 Introduction........................................ 97 7.2 Publication........................................ 97 vii 7.3 Summary......................................... 118 8 Conclusions 119 List of Figures 123 List of Tables 125 Acronyms 127 viii CHAPTER 1 Introduction Information Technology (IT) plays a crucial role in modern society. Over the past decade the Internet has emerged as a key communication platform for businesses and private users demonstrated by an exponential increase not only of hosts connected to the Internet but also Internet users. While in 2007 561.6 Million hosts were advertised in the Domain Name System (DNS) 1, the number of hosts has almost doubled to 1.062 Billion 2 over the past 10 years [1]. In addition, with the ongoing adoption of Internet of Things (IoT) as key technology enabler for advanced infrastructure concepts such as smart cities, smart grids, virtual power plants, or intelligent transportation systems (e.g. connected cars), the number of hosts connected to the Internet is expected to increase exponentially to 30 Billion devices by 2020 [2]. With an increase of hosts on the Internet, the number of Internet users has also increased dramatically over the past decade. While in 2007 20% of the world’s population (1,319 Billion) utilized the Internet, the number of Internet users in 2017 has more than doubled to 3,885 Billion users (approx. 52% of the world’s population [3]). Considering the pervasiveness of today’sIT organizations become increasingly exposed – predominantly driven by threat likelihood and vulnerability level of the organization. On the one hand, the number of reported security incidents world-wide has increased tremendously at a compound annual growth rate of approx. 60% between 2009 (3.4 Million incidents) and 2015 (59 Million incidents) [4]. One reason for the tremendous growth of computer and network attacks is an increase of attack automation and attack sophistication [5]. While in earlier days computer attacks had to be crafted manually and were aimed at specific targets, today complex malware tool kits are readily available to infiltrate a network, persistently deploy on target hosts, selectively propagate across the network and perform a wide range of actions ranging from sensitive data ex-filtration to distributed denial of service attacks [6]. A common concept to mount sophisticated and large-scale attacks are Botnets in which hosts are infected by malware to receive commands from a Command & Control (C2) server to execute malicious payload [7]. For example, the self-propagating worm Mirai infected over 600.000 IoT devices (mostly cameras and routers). Infected devices were controlled by so called C2 servers to mount the largest distributed denial of service attack on record with more than one Terabits per second (Tbps) at its peak causing temporary but significant connectivity issues for Internet users. On the other hand, while there is an increase in threats, the vulnerability surface of organizations is also growing. As an example, the number of disclosed software vulnerabilities has grown at a compound annual growth rate of approx. 8.3% from 6,516 in 2007 to 14,451 reported vulnerabilities in 2017 [8]. Lack of security awareness, continuous adoption of emerging technology, increasingly complex software 1 Devices behind Network Address Translation (NAT) segmented networks are not included in the estimation. 2 Billion = 109 1 Chapter 1 Introduction combined with fast-paced software development life cycles can be considered major factors that drive an organization’s vulnerability level [9]. The following sections will provide an overview of challenges in the fields of computer and network security – specifically related to the detection of network-based computer attacks and host-based malware attacks – and furthermore introduce a machine learning-based approach to overcome these challenges. 1.1 Known and unknown threats One of today’s key challenges is the detection of unknown threats. The majority of the state-of-the-art security controls deployed in organizations are reactive in nature and predominantly based on known attack signatures. Although those safeguards demonstrate high detection accuracy for known attack patterns, drawbacks include their inability to reliably detect not only unknown but also modified versions of known threats. For example, in the context of targeted attacks by adversaries with advanced capabilities (e.g. state-sponsored attackers), computer and network attacks have become increasingly sophisticated. Obfuscation techniques such as polymorphism, metamorphism, armoring or simply using "living-off-the- land" functionality (i.e. file-less malware) are specifically designed to thwart existing countermeasures and enable undetected delivery and execution of malicious code [e.g. 10, 11]. The problem of unknown threat detection is known in many different areas in computer and network security. Network attacks and malware are among the most common threat vectors [12] and thus, are focal point of this dissertation. Network attacks. Extensive research has been conducted to develop methods and systems for the detection of computer and network attacks. A computer attack refers to an attempt to compromise confidentiality, availability or integrity of a computing resource. An Intrusion Detection System (IDS) captures and analyzes streams of data in networks or on hosts to detect computer attacks. The conceptual design of intrusion detection systems goes back to the work of Denning [13] which serves as a foundation for numerous state-of-the-art intrusion detection systems nowadays [e.g. 14, 15]. An overview of existing intrusion detection approaches is provided by the work of McHugh [5]. Intrusion detection techniques can be broadly classified into misuse detection and anomaly detection [16]. While misuse detection methods are intended to recognize known attack patterns, anomaly detection techniques aim at identifying unusual structural or activity patterns in observed data. A well-known drawback of misuse detection systems is their inability

Load more