Inductive Intrusion Detection in Flow-Based Network Data Using One-Class Support Vector Machines

Inductive Intrusion Detection in Flow-Based Network Data using One-Class Support Vector Machines Philipp Winter DIPLOMARBEIT eingereicht am Fachhochschul-Masterstudiengang Sichere Informationssysteme in Hagenberg im Juli 2010 © Copyright 2010 Philipp Winter All Rights Reserved ii Erklärung Hiermit erkläre ich an Eides statt, dass ich die vorliegende Arbeit selbst- ständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt und die aus anderen Quellen entnommenen Stellen als solche gekennzeichnet habe. Hagenberg, am 14. Juli 2010 Philipp Winter iii Contents Erklärung iii Preface xii Kurzfassung xiii Abstract xiv 1 Introduction1 1.1 Motivation.............................1 1.2 Hypothesis............................2 1.3 Related Work...........................2 1.4 Thesis Outline..........................5 2 Analysed Network Data6 2.1 Overview.............................6 2.2 Network Data Sources......................7 2.2.1 Requirements.......................7 2.2.2 Protocol-Based......................8 2.2.3 Packet-Based.......................9 2.2.4 Flow-Based........................ 11 2.2.5 Comparison........................ 12 2.3 Flow-Based Network Data.................... 14 2.3.1 Protocols......................... 14 2.3.2 Definition......................... 15 2.3.3 Technical Details..................... 16 3 Machine Learning 19 3.1 Overview............................. 19 3.2 Introduction............................ 20 3.2.1 Definition......................... 20 3.2.2 Supervised Learning................... 22 3.2.3 Unsupervised Learning.................. 26 3.2.4 Training Data....................... 29 iv Contents v 3.3 Dimensionality.......................... 34 3.3.1 Feature Selection..................... 35 3.3.2 Feature Extraction.................... 39 3.4 Support Vector Machines.................... 40 3.4.1 Operating Mode..................... 40 3.4.2 One-Class Support Vector Machines.......... 42 3.5 Performance Evaluation..................... 42 3.5.1 Performance Measures.................. 43 3.5.2 ROC-Curves....................... 45 3.5.3 Cross Validation..................... 45 3.5.4 Further Criteria...................... 46 4 Experimental Results 48 4.1 Overview............................. 48 4.2 Proposed Approach........................ 49 4.2.1 System Design...................... 49 4.2.2 Discussion......................... 50 4.3 The Data Sets........................... 53 4.3.1 Training Data....................... 53 4.3.2 Testing Data....................... 60 4.3.3 Data Set Allocation................... 64 4.4 Model and Feature Selection................... 64 4.4.1 Approach......................... 65 4.4.2 Feature Optimisation................... 66 4.4.3 Parameter Optimisation................. 67 4.4.4 Joint Algorithmic Optimisation............. 67 4.5 Evaluation............................. 73 4.5.1 Model Testing....................... 74 4.5.2 Data Set Limitations................... 76 4.5.3 Discussion......................... 77 5 Conclusions 80 5.1 Thesis Summary......................... 80 5.2 Interpretation........................... 81 5.3 Future Work........................... 82 A Content of the enclosed CD-ROM 84 A.1 Diploma Thesis.......................... 84 A.2 Code................................ 84 A.3 Data Sets for Coarse Grained Optimisation.......... 84 A.4 Data Sets for Fine Grained Optimisation............ 85 A.5 Results............................... 85 Contents vi B Code 86 B.1 nfdump Patch........................... 86 B.2 svm-scale Patch.......................... 87 B.3 Feature and Model Optimisation................ 88 B.3.1 One-Class SVM Wrapper................ 88 B.3.2 Cross Validation..................... 90 B.3.3 Feature Subset Generator................ 91 B.3.4 Coarse Grained Feature and Model Selection..... 92 B.3.5 Fine Grained Model Selection.............. 95 B.3.6 Utility Functions..................... 97 Acronyms 98 Bibliography 100 List of Figures 2.1 Bidirectional communication between two computers on a network. The network traffic monitored by the NetFlow router results in two unidirectional flow records............. 16 2.2 A common scenario for the use of NetFlow. Several NetFlow probes send their records to a central collector. The collector stores the flows and provides an analysis interface for network operators who are responsible for network accounting and monitoring........................... 17 3.1 The concept behind supervised machine learning. A training set is used by a machine learning algorithm to build a hypothesis which is a generalisation of the training data. This hypothesis is then used to classify as yet unknown data.... 23 3.2 A linear two-class classification problem. There are two differ- ent data distributions, namely male and female points. These 2 two data sets can be linearly separated in R as illustrated by the black line............................ 24 3.3 A linear regression problem. The distribution of all the points 2 of the training set can be approximated by a line in R as illustrated by the red line..................... 25 3.4 Scatter plot which explains the idea of unsupervised learning. Diagram (a) shows unstructured data points. Diagram (b) shows the same data points after a clustering algorithm assigned all points to one of the two clusters.......... 27 3.5 A hierarchical clustering algorithm creates three clusters out of the four data points. First, two points together form a cluster and finally the two clusters form another final cluster. 28 3.6 A dendrogram which represents four iterations of a hierarchical clustering algorithm. The dendrogram can be seen as a binary tree with the data points as its leafs........... 28 3.7 A hypothesis which fits the training data very well. In fact, there are some minor training errors but the generalisation ability is adequate......................... 33 vii List of Figures viii 3.8 Two hypotheses which over and underfit, respectively. Dia- gram (a) illustrates an overfitting hypothesis. The training errors are minimised whereas the generalisation ability can be considered poor. Diagram (b) features an underfitting hypothesis. Training errors are high and the generalisation ability will also be far from good................... 33 3.9 Sequential forward selection which resulted in the selection of three features, namely BAD. After each iteration, the feature yielding the best intermediate error rate is added to the list of features............................. 36 3.10 Sequential backward elimination which resulted in the selection of a single feature, namely B. After each iteration, an attribute is eliminated until the algorithm ended up with B.. 37 3.11 The basic functionality of genetic algorithms as described in [88, p. 1]. An initial random population is created out of the input features. Then, mutation and/or crossover is performed as long as the cost function does not decide that the current feature set is “good enough”.................... 39 3.12 Optimal (a) and poorly (b) separating hyperplanes of an SVM. The poorly separating hyperplane offers bad generalisation ability whereas the optimal separating hyperplane perfectly divides both data sets by maximising the margin of the hyperplane.............................. 41 3.13 Two data sets which contain linearly inseparable vectors. Lin- ear separation would be attended by many training errors. Nonlinear separation, as realised by the black curve, permits the division of both data sets................... 42 3.14 ROC-curve which is often found in evaluations of NIDS’s. The curve illustrates how the detection rate and the false alarm rate change when a parameter of the machine learning model is modified. A good tradeoff between the false alarm rate and the detection rate seems to be at the point with the false alarm rate being 8 and the detection rate being 75........... 46 4.1 The high-level view on the proposed approach for inductive network intrusion detection. An incoming flow is first prepro- cessed. Then, two independent SVMs are used to first detect malicious flows and then, if the flow turns out to be malicious, to perform network traffic classification............. 49 4.2 ER-model of the training data set. The model shows how the tables are interconnected and what information they hold. The actual flows are stored in the table “flows”. The remaining tables provide the correlation to alerts and alert clusters.... 54 List of Figures ix 4.3 Illustrates the distribution of the services of the training set. By far the most flows have been collected for the SSH protocol. The remaining flows belong to auth/ident, HTTP or IRC... 56 4.4 Illustrates the attack types of the training set. Almost all malicious flows are part of automated attacks. Only 6 attacks are manual. Furthermore, far more attacks failed than succeeded. 57 4.5 Relationship between the time necessary to train a training set and the size of the respective training set. For each training set size three randomly sampled sets were created and trained. The time necessary to train the respective sets is plotted. Al- though the training time seems to increase with additional set size, it often varies drastically................... 58 4.6 Setup for the creation of benign network flows. A clean host inside a virtual machine creates benign network traffic heading towards the Internet. All this attack-free network traffic is captured and transformed to flow format............ 61 4.7 The two created data sets are divided into a validation

Inductive Intrusion Detection in Flow-Based Network Data Using One-Class Support Vector Machines

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support