Network Traffic Profiling and Anomaly Detection for Cyber Security

Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Student number: 01309688 Supervisors: Prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Counselors: Prof. dr. Bruno Volckaert, dr. ir. Tim Wauters A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Information Engineering Technology Academic year: 2017-2018 Acknowledgements This thesis is the result of 4 months work and I would like to express my gratitude towards the people who have guided me throughout this process. First and foremost I’d like to thank my thesis advisors prof. dr. Bruno Volckaert and dr. ir. Tim Wauters. By virtue of their knowledge and clear communication, I was able to maintain a clear target. Secondly I would like to thank prof. dr. ir. Filip De Turck for providing me the opportunity to conduct research in this field with the IDLab research group. Special thanks to Andres Felipe Ocampo Palacio and dr. Marleen Denert are in order as well. Mr. Ocampo’s Phd research into big data processing for network traffic and the resulting framework are an integral part of this thesis. Ms. Denert has been the go-to member of the faculty staff for general advice and administrative dealings. The final token of gratitude I’d like to extend to my family and friends for their continued support during this process. Laurens D’hooge Network traffic profiling and anomaly detection for cyber security Laurens D’hooge Supervisor(s): prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters Abstract— This article is a short summary of the research findings of a creation of APT2. An open source project on Github, by an Master’s dissertation on the intersection of network intrusion detection, big employee of Rapid7, the company behind the biggest frame- data processing and machine learning. Its results contribute to the founda- tion of a new research project at the Internet Technology and Data Science work for penetration testing, Metasploit. APT2 [3] is a Python- Lab (IDLab) of the University of Ghent. powered extensible framework for Metasploit and nmap au- Keywords— Network intrusion detection, big data, Apache Spark, ma- tomation. APT2 starts with an nmap scan or an nmap file with chine learning, Metasploit the details of a previous scan. Based on the information from the scan, events are fired that get picked up by automated versions of I. INTRODUCTION reconnaissance and exploit modules from Metasploit. The pro- HE full text of this dissertation covers a wide range of top- gram requires almost no human interaction and is customizable. Tics, connected to existing research fields at IDLab [1], a.o.: To avoid unwanted intrusiveness, a safety setting is available in APT2, with values ranging from one to five. One is the most ag- gressive level and can potentially crash the target server. Level • Machine learning and data mining 5 is the weakest intrusiveness level and does only information • Cloud and big data infrastructures gathering tasks. As a final extension to this research part, I have • Cyber security written an attack that automates another Metasploit module and The three main sections that were researched are summarized nmap to find hosts with a vulnerability in the TCP/IP stack, al- briefly. These sections are: lowing them to act as intermediaries for a stealthy port scan of • A capture setup for network traffic with an automated hacker the real target. and intentionally vulnerable target B. Vulnerable target • A detailed study of the state of the art in big data processing for the purpose of network intrusion detection (NIDS), with An automated hacker isn’t useful without a target to attack. special attention for the Apache Spark engine and ecosystem. To collect quality traffic beyond probing (=port scanning, finger- • The processing of a public NIDS data set, with machine learn- printing), the target should be exploitable. The second stage of ing algorithms. Implementations cover both Scikit-Learn and this research part was the search and integration of a deliberately Apache Spark to research the benefits and drawbacks of single- vulnerable system in a controlled environment. After comparing host versus distributed processing. different options, Metasploitable3, was chosen to be the target. It integrates well with Metasploit because it is also invented and II. AUTOMATED ATTACKER AND VULNERABLE TARGET maintained by Rapid7 (and the open source community). Metas- Data quality is of paramount importance to build any machine ploitable3 is a portable, virtual machine (VM) built on Pakcer, learning system. A system that can generalize needs to have Chef and Vagrant [4]. Packer uses a template system to specify seen lots of normal and attack traffic. Obtaining clean samples the creation steps of virtual machines in a portable way. Chef is is a difficult problem, especially if those samples have to be la- a tool to configure what software should be installed on a VM beled. Human labeling is hard because network traffic quickly and how it should be configured. Chef’s configuration files are generates large volumes of varied data. The labeling is com- called recipes and are listed in a section of the Packer build tem- plicated further by the contextual classification difficulty of net- plate. After building the VM, the final configuration (e.g. network packets and flows. They might not be anomalous on their working) is done by Vagrant, which also acts as a management own, but when seen as part of a set, do indicate an attack. To system for virtual machines, with functionality akin to Docker solve this problem a setup was created that combines an auto- for containers. mated hacker and a target with intentionally vulnerable services to exploit. This experiment was tested on the cloud experiment C. Results infrastructure of the University, the Virtual Wall [2]. The setup has been experimentally verified on the Virtual Wall. The experiment layout is shown in figure 1. The layout is A. Automated hacker a stripped down version of the full layout to reduce the resource Manual penetration testing is a laborious, repetitive process claim on the Virtual Wall. An even smaller layout without the that can be automated. This thought was the inspiration for the us and dst nodes has been used for testing as well. Traffic col- lection was done with TShark, Wireshark’s command line inter- L. D’hooge does his dissertation at the IDLab research group of the faculty of engineering and architecture, Ghent University (UGent), Gent, Belgium. E- face. The packet capture files were transformed into flows with mail: [email protected] . Joy, an open source tool by Cisco for network security research, monitoring and forensics [5]. Inspection of the generated traffic at the available safety levels revealed that APT2 was success- ful in gathering information with the modules for which Metas- ploitable ran a service. This proves the validity of the setup and opens more extension of APT2 and Metasploitable in tandem to exploit a greater number of services. Labeling the resulting cap- tures is less problematic, because of the controlled environment in which the experiment runs. Specific modules can activated to attack specific services, with much less overhead and noise than capturing in a network with active users. Fig. 2. The Spark ecosystem The main abstraction underlying Spark is the resilient, distributed data set (RDD) on top of which more recent additions like DataFrames and DataSets have been built. More efficient processing is continually introduced into the Spark project and its libraries. Two main projects stand out: the Spark-SQL cata- lyst optimizer works like a database query optimizer, receiving a programmed logical query plan, generating an optimized logical query plan and ultimately outputting Java bytecode that runs on each machine. The other umbrella project concerned with optimization is called Project Tungsten. The research efforts under Tungsten are focused on improving memory management and binary processing (elimination of memory and garbage collec- tion overhead), cache-aware computation (making optimal use of on-die CPU cache) and code generation (improving serializa- tion and removing virtual function calls). These improvements aim to make Spark the dominant big data processing engine for Fig. 1. Experiment layout times to come. B. IDLab NIDS architecture III. BIG DATA FOR NETWORK INTRUSION DETECTION This dissertation happens complementary to the research of SYSTEMS an IDLab PhD. student, Andres Ocampo. His research focuses on user profiling and data-analysis from a streaming perspective Network traffic maps directly onto the three dimensions of [8], while this research has a batch perspective. The layout in big data, volume, velocity and variety. Because of this, a part which both systems integrate is shown in figure 3. An avenue of the research time was invested in getting to the state-of-the for future research is the deep integration of the real-time stream art of big data processing, with the specific purpose of network processing and profiling with detailed batch analysis. intrusion detection. After this research phase, the Apache Spark engine was studied from an architectural overview down to the IV. MACHINE LEARNING FOR NETWORK INTRUSION optimization efforts at the byte- and native code level. DETECTION SYSTEMS The biggest and last part of this dissertation is the use of ma- A. Apache Spark chine learning (ML) algorithms for IDS purposes with imple- The core processing engine in this dissertation is Apache mentations on Spark (distributed) and Scikit-learn [11] (single- Spark, the successor of Apache Hadoop. Spark is an in-memory host) to study whether and how using Spark is beneficial in this big data engine, with three layers (see figure 2). The Spark Core, process. Research began with a broad search state of the art in which provides shared functionality for the four libraries on top machine learning and anomaly detection, followed by more spe- of it.

Network Traffic Profiling and Anomaly Detection for Cyber Security

Large-Scale Learning from Data Streams with Apache SAMOA

DSP Frameworks DSP Frameworks We Consider

Comparative Analysis of Data Stream Processing Systems

Optimizing Resource Utilization in Distributed Computing Systems For

Storage and Ingestion Systems in Support of Stream Processing

A Study of Incremental Checkpointing in Distributed Stream Processing Systems

Evaluating the Impact of Streaming Systems Design on Application Performance Alessio Pagliari

Apache Samza

An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

A Novel Cloud Broker-Based Resource Elasticity Management and Pricing for Big Data Streaming Applications

Parte I Studio Delle Tecnologie Utili Per L'analisi, L'elaborazione E L'interrogazione Di Big Data