Mitigating Concept Drift in Data Mining Applications for Intrusion Detection Systems
Total Page:16
File Type:pdf, Size:1020Kb
Mitigating concept drift in data mining applications for intrusion detection systems Koutrouki Evgenia SID: 3307160006 SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Communications and Cybersecurity DECEMBER 2017 THESSALONIKI – GREECE 1 Mitigating concept drift in data mining applications for intrusion detection systems Koutrouki Evgenia SID: 3307160006 Supervisor: Prof. Georgios Ioannou SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Communications and Cybersecurity DECEMBER 2017 THESSALONIKI – GREECE 2 ABSTRACT The phenomenon of concept drift is defined as the unexpected behavior of a data stream under changing environments. It is considered one of the major problems of data mining applications. Intrusion detection systems are correlated with data mining due to the fact that they collect, monitor and analyze data, so as it is expected, they also experience concept drift. By analyzing the different types of data mining, machine learning, intrusion detection systems and concept drift, a new method is developed with the aim to mitigate this phenomenon in data mining applications for intrusion detection systems. Koutrouki Evgenia 31/12/2017 3 Contents ABSTRACT ................................................................................................................................. 3 CHAPTER 1 INTRODUCTION .......................................................................................................................... 6 1.1 Context ........................................................................................................................ 6 1.2 Problem Statement ....................................................................................................... 7 1.2.1 Concept Drift ............................................................................................................ 7 1.2.2 Machine Learning and Data Mining ............................................................................ 9 1.2.3 Intrusion Detection Systems .................................................................................... 10 1.3 Aims & Objectives ...................................................................................................... 11 1.4 Dissertation Layout ..................................................................................................... 11 CHAPTER 2 LITERATURE REVIEW ............................................................................................................... 12 2.1 An overview of data mining techniques................................................................................ 12 2.1.1 Classification.............................................................................................................. 12 2.1.2 Regression ................................................................................................................ 14 2.1.3 Association ................................................................................................................ 15 2.1.4 Clustering .................................................................................................................. 16 2.2 Data Mining Techniques in Intrusion Detection Systems ....................................................... 17 2.2.1 k-Nearest Neighbor Algorithm ...................................................................................... 17 2.2.2 Naïve Bayes Algorithm................................................................................................ 18 2.2.3 Decision Tree Algorithm .............................................................................................. 19 2.3 Developed Methods for Mitigating Concept Drift in Intrusion Detection Systems ...................... 20 2.3.1 MineClass Algorithm ................................................................................................... 20 2.3.2 Early Drift Detection Method ........................................................................................ 21 2.3.3 DWM Algorithm .......................................................................................................... 22 2.3.4 FAE Algorithm ............................................................................................................ 23 CHAPTER 3 METHODOLOGY ....................................................................................................................... 25 3.1 Software Setup ................................................................................................................. 25 3.1.1 Waikato Environment for Knowledge Analysis (WEKA) .................................................. 25 3.1.2 Massive Online Analysis (MOA) ................................................................................... 26 3.1.3 Eclipse ...................................................................................................................... 26 3.1.4 Operating System ....................................................................................................... 27 3.2 Experimental Dataset ........................................................................................................ 27 3.3 Algorithm Implementation .................................................................................................. 28 4 3.3.1 Objective ................................................................................................................... 28 3.3.2 The New Method Hypothesis ....................................................................................... 29 3.3.3 Classifier Behavior...................................................................................................... 30 3.4 Data Analysis ................................................................................................................... 30 3.4.1 Output Information ...................................................................................................... 30 3.4.2 Testing Method .......................................................................................................... 31 3.4.3 Evaluation Methods .................................................................................................... 31 CHAPTER 4 RESULTS AND ANALYSIS ......................................................................................................... 33 4.1 DATASET PRESENTATION .............................................................................................. 33 4.2 RESULTS IN WEKA ENVIRONMENT ................................................................................. 34 4.1.1 Hoeffding Tree ........................................................................................................... 35 4.1.2 Naïve Bayes .............................................................................................................. 37 4.1.3 J48............................................................................................................................ 39 4.1.3 lazy IBk ..................................................................................................................... 41 4.3 RESULTS IN MOA ENVIRONMENT ................................................................................... 43 CHAPTER 5 CONCLUSIONS ......................................................................................................................... 45 REFERENCES ........................................................................................................................... 46 APPENDIX ................................................................................................................................ 48 5 CHAPTER 1 INTRODUCTION This first chapter is the one where the introduction on data mining, machine learning and intrusion detection system takes place. Along with that, the problem of concept drift is presented and analyzed. A preview of the methodology used is also explained. 1.1 Context When discussing about technology, the words machine learning continuously comes up. Those two terms are paired together because with technology evolving and becoming such a big part of people’s everyday lives, the need of automating several processes is a critical issue. Numerous techniques have been developed to automate both systems and communications but also new problems appear along with that. Securing sensitive data is also related to automated processes. There are different types of systems that are built for detecting and preventing attacks that aim sensitive data. That kind of systems are called intrusion detection systems (IDS). Intrusion detection systems use machine learning algorithms to become smarter and faster when it comes to detecting a malicious movement. Data that are continuously received over a long period of time are known as data streams. As those sequential data are arriving they are processed and analyzed by the algorithms so that useful information can be acquired [1]. The data streams can be processed either offline or online. In offline training the whole training dataset is available when the model is training. When the model is trained and the training process has finished, it can be used for predictions. In online training the data keep arriving and the processing is sequential. The training dataset is not fully complete during the training of the model. The model keeps training as new training data keep coming up. The trained model is used for predictions