MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS AND DETECTION OF NETWORK ATTACKS

by Maryam Mousaarab Najafabadi

A Dissertation Submitted to the Faculty of The College of Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Florida Atlantic University Boca Raton, FL August 2017 Copyright 2017 by Maryam Mousaarab Najafabadi

ii

ACKNOWLEDGEMENTS

First and foremost, I would like to acknowledge my graduate advisor and men- tor, Dr. Taghi M. Khoshgoftaar. His unwavering support, patience, and knowledge have helped me to become the researcher I am today. I would also like to thank Dr. Bassem Alhalabi, Dr. Xingquan Zhu, and Dr. Hanqi Zhuang, for being on my PhD supervisory committee. I want to thank the members of FAU security analytic group, Mr. Richard Zuech and especially Mr. Chad Calvert and Mr. Clifford Kemp for their data collection efforts. I would like to acknowledge NSF I/UCRC which provided a framework for interaction between FAU faculty and industry, as well as the LexisNexis company for supporting my research assistantship position during my PhD studies. I want to thank Mr. Richard Bauder, from FAU Data Mining and Machine Learning Laboratory, for his constructive reviews of this dissertation. My thanks also go to the other members of the FAU Data Mining and Machine Learning Laboratory at Florida Atlantic University for their continued feedback and support. I also gratefully acknowledge partial support by the National Science Founda- tion, under grant number CNS- 1427536. Any opinions, findings, and conclusions or recommendations expressed in this dissertation are those of the author and do not necessarily reflect the views of the National Science Foundation.

iv ABSTRACT

Author: MaryamMousaarabNajafabadi Title: Machinelearningalgorithmsfortheanalysisanddetection of network attacks Institution: Florida Atlantic University Dissertation Advisor: Dr. Taghi M. Khoshgoftaar Degree: DoctorofPhilosophy Year: 2017

The Internet and computer networks have become an important part of our organizations and everyday life. With the increase in our dependence on computers and communication networks, malicious activities have become increasingly prevalent. Network attacks are an important problem in today’s communication environments. The network traffic must be monitored and analyzed to detect malicious activities and attacks to ensure reliable functionality of the networks and security of users’ information. Recently, machine learning techniques have been applied toward the detection of network attacks. Machine learning models are able to extract similarities and patterns in the network traffic. Unlike signature based methods, there is no need for manual analyses to extract attack patterns. Applying machine learning algorithms can automatically build predictive models for the detection of network attacks. This dissertation reports an empirical analysis of the usage of machine learning methods for the detection of network attacks. For this purpose, we study the detection of three common attacks in computer networks: SSH brute force, Man In The Middle (MITM) and application layer Distributed Denial of Service (DDoS) attacks. Using outdated and non-representative benchmark data, such as the DARPA dataset, in the v intrusion detection domain, has caused a practical gap between building detection models and their actual deployment in a real computer network. To alleviate this limitation, we collect representative network data from a real production network for each attack type. Our analysis of each attack includes a detailed study of the usage of machine learning methods for its detection. This includes the motivation behind the proposed machine learning based detection approach, the data collection process, feature engineering, building predictive models and evaluating their performance. We also investigate the application of feature selection in building detection models for network attacks. Overall, this dissertation presents a thorough analysis on how machine learning techniques can be used to detect network attacks. We not only study a broad range of network attacks, but also study the application of different machine learning methods including classification, anomaly detection and feature selection for their detection at the host level and the network level.

vi To my beloved family, my parents, Mostafa and Manije, my sisters, Anna and Elahe and my brother, Ehsan. You are the greatest treasure in my life. None of my success would be possible without your endless love and support. MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS AND DETECTION OF NETWORK ATTACKS

List of Tables ...... xi

List of Figures ...... xii

1 Introduction ...... 1 1.1 Motivation ...... 1 1.1.1 Intrusion Detection ...... 1 1.1.2 Machine Learning for the Detection of Network Attacks . . . . 2 1.1.3 Lack of Proper Intrusion Detection Public Data ...... 3 1.2 Contributions ...... 6 1.3 Dissertation Structure ...... 7

2 Methodology ...... 8

3 SSH Brute Force Attacks ...... 15 3.1 Background and Motivation ...... 15 3.2 Related Work ...... 19 3.3 Data Collection ...... 20 3.4 Detection of SSH Brute Force Attacks ...... 23 3.5 Experimental Results and Analysis ...... 28 3.6 Chapter Summary ...... 29

4 Man In The Middle Attacks ...... 31 4.1 Background and Motivation ...... 31

viii 4.2 Related Work ...... 35 4.3 Data Collection ...... 38 4.4 Detection of Man In The Middle Traffic ...... 42 4.4.1 General Approach ...... 42 4.4.2 Selecting a Subset of Packet Header Fields ...... 46 4.5 Experimental Results and Analysis ...... 49 4.6 Chapter Summary ...... 54

5 Application Layer DDoS Attacks ...... 56 5.1 Background and Motivation ...... 56 5.2 Related Work ...... 63 5.3 Data Collection ...... 65 5.4 User Behavior Anomaly Detection ...... 69 5.4.1 Defining User Behavior ...... 71 5.4.2 Detecting Anomalous User Behaviors ...... 74 5.5 Experimental Results and Analysis ...... 78 5.5.1 Results for PCA-subspace Anomaly Detection ...... 78 5.5.2 Results for One-class SVM Anomaly Detection ...... 80 5.5.3 Comparison analysis ...... 82 5.6 Chapter Summary ...... 84

6 Feature Selection ...... 86 6.1 Background and Motivation ...... 86 6.2 Related Work ...... 88 6.3 Methodology ...... 91 6.3.1 Evaluating Feature Selection Methods for Detection of Network Attacks ...... 91 6.3.2 Ensemble of Feature Selection Methods for Analyzing Impor- tant Features ...... 95 6.4 Experimental Results and Analysis ...... 100 ix 6.4.1 Results for Evaluating Feature Selection Methods for Detection of Network Attacks ...... 100 6.4.2 Results for Ensemble of Feature Selection Methods for Analyz- ing Important Features ...... 107 6.5 Chapter Summary ...... 109

7 Conclusion and Future Works ...... 112 7.1 Conclusions ...... 113 7.1.1 Data Collection ...... 113 7.1.2 Machine Learning Methods for The Detection of Network Attacks 115 7.1.3 Future Work ...... 118

Bibliography ...... 119

x LIST OF TABLES

3.1 Description of features extracted from aggregated data ...... 24 3.2 Cross validation results ...... 29

4.1 Collected data information ...... 40 4.2 Selected header fields for each attack and the whole data ...... 51 4.3 Selected header fields for each attack and the whole data when check- sum and length-related fields are removed ...... 51 4.4 Performance results for different subsets of packet header fields and different datasets ...... 52

5.1 Attack variants ...... 70 5.2 Number of instances in each dataset ...... 78 5.3 AUC and number of instance for each attack type ...... 82

6.1 Details of the dataset ...... 92 6.2 Description of features extracted from sessions ...... 98 6.3 AUC values for different combination of feature selection and classifi- cation methods...... 103 6.4 ANOVA Results ...... 104 6.5 Selected features by the ensemble of rankers...... 107 6.6 Cross validation results on the whole feature set ...... 108 6.7 Cross validation results on the selected feature set with 7 features . . 108 6.8 ANOVA Results ...... 109

xi LIST OF FIGURES

2.1 General methodology ...... 8 2.2 Network topology ...... 11

4.1 Man In The Middle (MITM) Attack ...... 32 4.2 Half-duplex Man In The Middle ...... 43 4.3 Full-duplex Man In The Middle ...... 43 4.4 Plotted TPR results for Table 4.4 ...... 53

5.1 Distributed Denial of Service attack ...... 57 5.2 Fraction of total variance captured by each principal component ... 79 5.3 Boxplot of error values for normal training instances, test training in- stances and each of the nine attack types instances ...... 79 5.4 Boxplot of SVM decision values for normal training instances, test training instances and each of the nine attack types instances ..... 81 5.5 ROC curve fpr PCA-subspace method ...... 83 5.6 ROC curve for one-class SVM ...... 83

6.1 Tukey’s HSD results for feature selection ...... 104

xii CHAPTER 1 INTRODUCTION

1.1 MOTIVATION

1.1.1 Intrusion Detection

The rapidly growing progress in Internet-based technology has brought tremendous benefits to our society. Communications and other services that the Internet provides have transformed our lives in many ways. The Internet has opened up a whole new world of possibilities to access the information. Students and researchers do not need to go to the libraries to collect the information they need anymore. Nowadays, the information is just a few clicks away from one’s computer web browser. Social networking sites have eliminated geographic distance and made it easier to be in contact with family and friends. Online services, such as online shopping, online banking, and online learning, have made all these activities more convenient to do. While the Internet has made our lives much more convenient, its vulnerabilities and the amount of information communicating over it generate opportunities for ad- versaries to perform malicious activities within its infrastructure. Any host connected to the public Internet or even a private network is under constant threat from poten- tial attacks. A lot of threats are created every day by individuals and organizations to attack computer networks to steal private information and data. This information can be very critical and sensitive, such as social security numbers or bank account information. This has created the need for security technologies to secure users’ infor- mation and provide reliable computer network environments. Network security has become a very important factor for the companies and organizations to consider. In

1 that regard, intrusion detection plays an important role in the detection of attacks and with securing computer networks. Intrusion Detection Systems (IDS) monitor and analyze network systems to detect malicious activities. Even though users benefit from the use of IDS technology, more is needed to detect better obfuscated or more complex attack patterns

1.1.2 Machine Learning for the Detection of Network Attacks

Machine learning is a subfield of computer science, which uses pattern recognition and artificial intelligence methods to group and extract behaviors and entities from the data. These previously known patterns and relationships trained by machine learning algorithms can be used to do prediction tasks on new data. With today’s technology, machine learning algorithms touch our everyday life by being used in a wide range of applications. Examples from common domains, which machine learning algorithms are extensively used, include product recommendations systems, such as the ones used by Amazon [ 73 ] and Netflix [ 24 ]; natural language processing [ 27 ], [ 79 ]; spam detection [ 65 ], image recognition [ 105 ] and fraud detection [ 21 ]. The most common operational network intrusion detection systems are signature- based systems [ 67 ]. These systems consist of a database of attack signatures. Human experts produce the attack signatures by manually analyzing the attack data. The monitored network traffic is matched against this database to detect malicious activ- ities. Producing attack signatures is a time consuming and manually intensive task. Recently, machine learning techniques have been applied to build predictive models for the detection of network attacks [ 127 ], [ 107 ], [ 43 ]. Unlike signature based methods, which need manual analysis by human experts to extract attack patterns, machine learning algorithms are able to automatically extract similarities and patterns in the network data. With more data being produced than the human brain has the ca- pacity to monitor, machine learning analysis provides results that even an army of

2 analyst experts would be unable to accomplish. With machine learning affecting a lot of aspects in our everyday life, it is necessary to study the usage of its interdisciplinary capabilities in the detection of computer network attacks. Machine learning algorithms can be applied on network data to extract patterns and similarities, which distinguish between normal and attack in- stances. These trained patterns can be used to build intrusion detection systems for the detection of network attacks. With the bulk of the work being carried out by machine learning, the cybersecurity experts can become more productive by focusing on analytical results from machine learning models in order to get more insight about the current and future threats.

1.1.3 Lack of Proper Intrusion Detection Public Data

An important component for developing an effective intrusion detection system using machine learning algorithms is the presence of a good IDS dataset. Such a dataset is used to build and evaluate intrusion detection models. Machine learning algorithms use the data directly in order to build predictive models. The more informative the data is, the better the built models can be generalized on new data. It is very important to thoroughly evaluate the attack detection models on a mixture of attacks and normal behaviors. Evaluating their detection capabilities and false alarms on normal data can be useful for determining and eliminating weaknesses. A low false alarm rate is a very important factor in the evaluation of an intrusion detection system. False alarms arise when normal traffic is misclassified as attack traffic. The evaluation results on a benchmark dataset, which is not representative of normal network traffic, does not provide a realistic reflection of the model’s per- formance on real network traffic. Similarly, a dataset that does not contain a good variety of attack behaviors, does not provide reliable detection evaluation results. The lack of current IDS evaluation datasets that contain real-world representative

3 traffic data, and are collected from modern complex networks is an important prob- lem faced by intrusion detection research community [ 131 ], [ 103 ]. This has caused a practical gap between the research done in this domain and the actual deployment of the proposed detection mechanisms in real computer networks. One of the publicly available and widely used datasets for intrusion detection is the DARPA dataset [ 4], [ 130 ], [ 63 ], [ 64 ]. The DARPA dataset was created by collecting multiple weeks of data from a simulated air force network. The data is synthesized and it does not represent a real-world network traffic. The claim is that the DARPA dataset is similar to the traffic observed during several months of sampling from a number of air force bases; however, the statistics for the real traffic and the measures used to simulate the DARPA dataset are not provided [ 76 ]. As it is criticized in [ 12 ], there is no rational explanation given about the DARPA dataset which indicates that the system which is tested on this dataset would have the same false alarm behavior when it is exposed to the real-world data. Similar criticisms have been issued regarding the attack data. The attack data in the DARPA dataset is produced via scripts. No attempt has been made to ensure that the synthetic attacks are realistically distributed in the background data. These issues make the DARPA dataset ill-suited for the evaluation of systems that need to work in the real- world environments. The DARPA dataset is an outdated dataset and it is no longer adequate for the current studies. KDD 99 dataset [ 9] is another publicly available IDS benchmark dataset, which was created by processing the tcpdump portion of the DARPA dataset. As KDD 99 is derived from DARPA, it also inherents the shortcomings of the DARPA dataset. Kyoto 2006+ [ 108 ] and ISCX [ 104 ] datasets have also been introduced for study- ing intrusion detection. The Kyoto dataset is a collection of network traffic over approximately 2.5 years. The attack data is mainly obtained from honeypots. All the traffic from the honeypots is labeled as attack data. On the positive side, by

4 using honeypots, there is no need to manually label the attack data, while on the negative side, honeypots give a limited view of the network traffic, i.e. experts can only observe the attacks that target the honeypots and no other devices on the net- work. For generating normal traffic, they developed a server in the same network with honeypots. The server had only two main services for email and DNS. This makes the normal traffic very limited. Not having quality, representative normal traffic lead to unrealistic analysis of the false alarms of an intrusion detection system under study. ISCX dataset is a relatively recent dataset that has been used in some studies [125 ], [ 129 ], [ 23 ]. In the ISCX dataset, the data is produced by executing profiles that include descriptions of the intrusions and the abstract distribution models for the applications, protocols, or lower level network entities. A profile can be defined explicitly and carried out by humans, autonomous agents, or it can be defined with stochastic distributions. The concept is to generate more realistic data by simulat- ing realistic profiles. However, this does not mean that even with a very detailed simulation based on realistic profiles the produced network data accurately mimics behaviors observed in real-world network traffic. The ISCX dataset is not as realistic as real network traffic. Most of the publicly available datasets for intrusion detection do not reflect an adequate representation of the real-world network traffic. To overcome these short- comings, there is a need to collect representative intrusion detection data to develop and analyze detection mechanisms for computer network attacks. Such a dataset should provide an adequate coverage of the networking scenarios that happen in real- world networks. In addition to a representative normal data, it should also contain a proper diversity of different types of attacks. Such a dataset is necessary to provide a realistic analysis of a proposed approach for the detection of network attacks.

5 1.2 CONTRIBUTIONS

The contribution of this work lies in the application of machine learning techniques for the detection of computer network attacks. Applying machine learning algorithms in any application domain includes a number of steps: integrating and pre-processing the data, training a machine learning model, and making predictions and decision based on what is learned in the trained model [ 124 ]. This work provides a comprehensive study of all these steps for the detection of three common computer network attacks. We not only study different types of attacks, but we also utilize different machine learning methods including classification, anomaly detection and feature selection mechanisms for building detection models at the host level and the network level. More specifically, our contribution involves the following:

• We study the application of machine learning for the detection of SSH brute force, Man In The Middle and application layer distributed denial of service attacks.

• We provide a thorough analysis of the machine learning approaches for the detection of each attack. This includes the data collection, feature extraction and the machine learning analysis part. We provide the motivation behind the selected machine learning approach and its new contributions compared to other studies available in the literature.

• For each attack, we collect representative data, whether it is network traffic or server logs, from a live campus network. Our data collection for each attack is well-studied in order to eliminate the limitations which exist in the current public datasets.

• For each attack, experiments are conducted on the collected data to validate our detection approach.

6 • Our analysis in this work represents an extended analysis of how different ma- chine learning methods, including classification and anomaly detection algo- rithms, can be used to build both host and network based detection models for network attacks.

• We study the application of feature selection for the detection of network attacks for building a more efficient intrusion detection model by reducing the number of features. In addition, we show how we can gain additional insight about the studied attack from the selection of the most important features leading to its detection.

1.3 DISSERTATION STRUCTURE

This dissertation is organized as follows. In Chapter 2, we introduce our general methodology to detect computer network attacks by applying machine learning al- gorithms. The next three chapters (Chapter 3, Chapter 4 and Chapter 5) discuss the detection of SSH brute force, Man In The Middle and application layer DDoS attacks, respectively. Each chapter covers the data collection, the machine learning based detection methodology and finally the experimental results. Chapter 6 provides our study of the application of feature selection in the detection of network attacks. Lastly, we present our conclusions and future work in Chapter 7.

7 CHAPTER 2 METHODOLOGY

This chapter describes the general methodology utilized in this work to detect network attacks by applying machine learning methods. There are some common steps that need to be followed when applying machine learning algorithms in any application domain. These steps include data collection and prepossessing, feature extraction, building the predictive model and evaluating the prediction results. This chapter explains our approach to follow these steps for the specific application of computer network attack detection. The following chapters provide in-depth elaboration of the specifications of each step when applying machine learning methods for the detection of any of the studied attacks. Figure 2.1 shows the general methodology followed in this work for the detection of network attacks.

Figure 2.1: General methodology

The first step is to define the problem. The general problem throughout this work is the detection of network attacks. The problem definition gets more specific for any attack type and includes an expanded definition of the attack and its behavior. The definition of an attack needs to answer the following questions:

• What is the attacker’s goal?

• What method/methods can be used to launch the attack? 8 • What are the available tools or scripts to run the attack?

• What detection methods are available in the literature to detect the attack?

Detailed answers to above questions are necessary in order to achieve a good understanding of the behavior of the attack under study. The related work presence in the literature, network security reports and blog posts are good resources to get more knowledge about the attack. In-depth understanding of the studied attack specifies its behavior and its impact on the networking data, such as network traffic and log files. This is very important in order to decide the source of data, which needs to get collected, as well as the features and machine learning methods, which need to be used in the second and third steps, as described in Figure 2.1 . After the attack is studied, the second step is collecting the data. This involves deciding what data source is required to apply the machine learning analytic. The collected data directly affects the nature of the features that can be extracted from it, and consequently the detection mechanism. Therefore, the data collection step cannot be considered completely independent of the next step, which involves the feature extraction and building the machine learning predictive model. One of the first aspects to be considered for the data collection, is whether the detection mechanism is a host-based or network-based method. This affects whether the data needs to be collected at the network level (e.g. network traffic) or at the host level (e.g. log files). Network attacks can be detected by monitoring and analyzing related data at the network level or at the host level. In a host-based detection, internal software is installed on the host to monitor traffic that are originating and coming to that particular host. The host-based attack detection software also has access to the internal logs like user’s login activities, running processes, etc. In network-based detection, all the data that passes through the network is monitored and analyzed. To do so, the detection element of the network must be positioned at entry and 9 exit points from the network to the outside world. In comparison with a host-based detection, which is done on a particular host, a network-based detection is more scalable. It is not dependent on the host’s . In addition to providing protection for hosts which do not have their own protection mechanism, network- based detection is the only possibility in networks that there is not direct access to particular hosts. Deciding on a host-based or network-based detection mechanism specifies the en- vironment and tools needed to collect the data. We use network-based detection for SSH brute force and Man In The Middle attacks. Conversely, a host-based detection is proposed for the application layer DDoS attacks for which the data includes web server logs collected directly from the web server. The specific choice of a network- based or host-based detection for an attack depends on the attack’s behavior and how it affects the network traffic and the host generated log files. This implies the importance of gaining an in-depth understanding of the attacker behavior in the first step of our analysis in order to move on to the next steps. It might be necessary to collect different traffic and log file samples through penetration testing, using an attack tool, in order to analyze the attacker behavior and its impact on data from different sources. We use a campus network for our data collection and experimental analysis of the detection of any of the studied attack types. The network exists in a live production environment, which consists of hosts from classrooms, labs, and offices. The network architecture is shown in Figure 2.2 . There are multiple switches, routers and servers to support the network usage. A Cisco Adaptive Appliance (ASA) is utilized as the firewall for the whole network. This ASA sits between two HP network switches. A management machine is also connected to the ASA to provide DHCP services. For all our experiments, we utilize subnet XXX.XXX.10.0 associated with switch 1 in our network.

10 Figure 2.2: Network topology

The network-level capture for brute force and Man In The Middle attacks, involved collecting the traffic via port monitoring, with the switch forwarding all traffic to a Security Onion Server (SO), which collects full-duplex traffic. The collected traffic is then filtered using Wireshark [ 95 ] and tcpdump and then converted to comma separated files. The capture at the host level for the application layer DDoS attacks is done by collecting web server logs from an Apache server connected to switch 1. It is important for labeled network data, which is used to train an attack detection machine learning model, to properly represent the real-world network traffic. Repre- sentative data includes different networking and communication scenarios seen in real communications over computer networks. Different network scenarios include data 11 transfer, interactive traffic (interactive login and interactive commands), streaming and other scenarios seen in the communication networks. A non-representative data or inadequately representative data is the type of data that does not include a suf- ficient number of different network scenarios seen in the real world. Representative traffic should also include the traffic that is more likely to be mistaken as attack by predictive models. This is a kind of normal traffic with patterns that are similar to the patterns generated by the attack traffic. The models built on such training data learn the patterns for attack data along with the patterns for different scenarios of normal data, which makes them better in discriminating between attack instances and normal instances. A representative collection of the traffic makes it possible for the practitioners to compare the attack data with normal data in order to introduce discriminate features, which distinguish these two from one another. If the collected traffic is not representative and does not contain various types of normal network traffic, even a simple feature set might appear to work well for discriminating normal and attack traffic. However, using such a feature set in building the predictive models can lead to poor performance when new types of traffic show up during the detection time. The collected data should contain a representative amount of normal traffic as well as the normal traffic that might be similar to the attack traffic. This makes it possible to do more in-depth comparisons of normal and attack data and define the features in a way that can be truly discriminative for recognizing attacks from normal traffic. After collecting representative data for the detection of a specific attack type, the next step is to use machine learning to build a method for its detection. The detection of network attacks can be considered anomaly detection. There are three different setups to apply an anomaly detection algorithm [ 51 ] in machine learning area.

• Supervised anomaly detection describes the set up where both training data and test data are fully labeled. A machine learning classifier, such as Support Vector 12 Machine (SVM) or decision tree can be used to build a model on the training data and evaluate it on test data afterwards. In most cases, the anomalies might not be known in advance and assuming that the data is fully labeled is not practical. In future chapters whenever referring to a fully labeled set up, we use “classification” instead of “anomaly detection.” Our assumption for anomaly detection would be something we do not know in advance, therefore, the training data cannot be labeled.

• Semi-supervised anomaly detection also uses training and test data, however the training data only contains normal instances. The basic idea is to build a model of patterns seen in normal training data. The test instances that deviate from this model a lot are considered as anomalies. This idea is also known as “one class” which we will see in chapter 5.

• Unsupervised anomaly detection does not require any labels and there is no distinction between a training and a test dataset. Typically, distances and estimations are used to decide which instance is normal and which instance is an anomaly. The methods used in this set up include nearest-neighbor methods, clustering and statistical methods.

Since we want to build a model which can be used later to label new data (sepa- rate training and test data), we used the first and second methods explained above throughout this work. We use classification for the detection of SSH brute force attacks and semi-supervised anomaly detection approach for the detection of appli- cation layer DDoS attacks. Our detection method for Man In The Middle attacks utilize a feature selection approach from machine learning area for efficient detection of these attacks. One important step in building models by applying machine learning methods is to extract discriminate features for the prediction task. Our analysis of each attack explains how we used domain knowledge about the attack behavior, as

13 well as analyzing the collected representative data, to extract discriminative features for its detection. After building a model, we evaluate its performance on the test data. The evalua- tion metrics used throughout this work are True Positive Rate (TPR), False Positive Rate (FPR) and Area Under receiver operating characteristic Curve (AUC). In the application of detection of network attacks, TPR is the detection rate. It is calculated as the number of attack instances which are correctly predicted as attack divided by the number of all the attack instances. On the other hand, FPR is the rate of the normal data which is wrongly detected as attack by the classifier. It is calculated by dividing the number of normal data which is wrongly predicted as attack by the number of all the normal instances. Obviously, a good intrusion detection model should have a high TPR and a low FPR. We also used the effective and commonly used Area under the Receiver Operating Characteristic Curve (AUC) [ 28 ] as the classification performance metric in our ex- periments. AUC builds a graph of the True Positive Rate vs. the False Positive Rate as the classifier decision threshold is varied. The area under this graph is then used as the performance across all decision thresholds. AUC demonstrates the trade-off between TPR and FPR, where higher AUC values indicate a high TPR and low FPR which is preferable in the current application, i.e. network attack detection. Using AUC in the network attack detection is a good choice, because the IDS datasets are typically imbalanced (i.e. the number of normal instances is much larger than the number of attack instances).

14 CHAPTER 3 SSH BRUTE FORCE ATTACKS

3.1 BACKGROUND AND MOTIVATION

The brute force attack is one of the most prevalent attacks in computer networks. Different studies [ 20 ], [ 7] have reported this attack as one of the steadiest attacks in time. In a brute force attack, the attacker tries to log into users account. He contin- ues trying different passwords on the victim’s machine to reveal the login password. Normally attackers use automated software which generates different combinations of passwords to try on the victims machine. Unfortunately, human chosen passwords are inherently weak because they are selected from a limited domain. It is well within the realm of possibility for an attacker to find the correct password by trying all different possible passwords. A machine compromised by a brute force attack can cause serious damage by join- ing , distributing sensitive information, participating in distributed attacks, etc. Brute force attacks are still a prevalent threat for computer networks. The Cisco 2014 annual security report [ 13 ] explains that “Although brute-force login attempts are by no means a new tactic for cybercriminals, their use increased threefold in the first half of 2013.” In a survey reported by the Ponemon 2014 SSH security Vulnera- bility Report [ 14 ], 51% of the respondent companies have been compromised by SSH brute force attacks in the last 14 months. Weak passwords, user names used as passwords and not changing the default configured password are all reasons that make attackers interested in doing brute force attacks for finding the correct password. Also there are some services which provide

15 users default passwords, such as wifi routers. This again gives a good opportunity to attackers to apply brute force attacks. Brute force password guessing attacks can be launched against telnet, SSH and FTP servers [ 96 ]. All these protocols allow communications between two remote computers and passwords are used in order to confirm the authentications of the two machines. Telnet is a Unix utility that lets the user log in to a remote computer on the Internet assuming that the user already has an account on that computer. This can be especially useful to log in to a school or work computer when working form home. One of the drawbacks of telnet is that the information is sent un-encrypted over the network. This allows the attackers to easily use available software to snoop on network traffic and monitor the communicated information. SSH (Secure Shell) is a cryptographic network protocol, which can be considered a secure version of telnet and it encrypts the network communication. In fact, SSH provides a secure communication over an in-secure network. FTP stands for File Transfer Protocol and it specifies a method for transferring files between computers. In this work, we study SSH brute force attacks. The attackers try different pass- words against a SSH login server hoping to get access to the user’s account. Any SSH server open to the general Internet is at risk of SSH brute force attacks, where the attacker attempts to locate instances of weak authentication [ 61 ]. The SSH servers in our campus networks were also under SSH attack. Therefore, during the data collection, we labeled real attacks in our collected data and there was no need for generating attacks through penetration testing. The research on the detection of Brute force attacks is mostly focused on the host-based detection. In a host-based detection, access logs are inspected and if the number of failed login attempts in a specific time exceeds a predefined threshold, an alert is fired. In this work, we study the detection of brute force attacks at the network level. It scales better in comparison to the host-based detection and it also promise

16 to protect devices that do not have internal (host level) protection for themselves. We propose using machine learning methods for automatic detection of brute force attacks at the network level based on Netflow data. We collected network traffic data from our campus network that includes some SSH servers. Our network experts labeled real brute force attacks in the collected data. We applied our analyses on the network IPFIX Netflows [ 22 ] extracted from the collected traffic. A Netflow is a summary of network packets that share some common characteristics. In recent years, Netflow analysis has increased in intrusion detection studies because it provides faster analysis than packet analysis. In addition, since no packet payload is used in the construction of a Netflow, Netflow analysis can be applied on encrypted traffic as well. In addition to the normal data which is already included in the collected network traffic, we also produced some normal data which is similar to the attack data. We generated some failed login attempt data that represents the kind of traffic which is produced when a legitimate user has forgotten/misspelled his/her password. The final data is collected over eight days. It also includes failed login data produced by penetration testing. These characteristics in the collected data include most of the networking and communicating scenarios seen in a real computer network. We use aggregated Netflow data along with a machine learning approach for the detection of SSH brute force attacks at the network level using our representative data [ 80 ]. We use four different classification models in our experiments, two types of decision trees, 5 nearest neighbor and Na¨ıve Bayes. We explain our feature extraction process which is based on using domain knowledge and the analysis of the represen- tative collected data in building the predictive models. Extracting discriminative features is a very important step in any machine learning application. Even simple machine learning algorithms can provide good performance results on a well-defined feature set. Conversely, a sophisticated machine learning method can fail in providing

17 good performance results when applied on a poorly defined feature set. To introduce the features that can discriminate between normal SSH and brute force attack data, we inspected the collected data by comparing brute force Netflows with normal Netflows which also included failed logins produced by our network operators. Our analysis showed that the brute force Netflows are very similar to the failed login attempts Netflows by legitimate users. This means the Netflow features are not discriminative enough to differentiate between brute force and normal data. We decided to extract features from an aggregation of Netflow data. The aggregation of the Netflows can reflect the number of attempts taken which is the main difference between the brute force traffic and the legitimate failed login attempts traffic. Our results show that, using these features, classifiers provide very good performances for the detection of brute force attacks. Our contributions in this work can be considered as follow:

• We propose a network-based detection for detecting brute force attacks. Network- based detection is more scalable and provides protection for hosts that do not have their own detection mechanism.

• Our collected data is real network traffic and includes real attacks in the wild. We collected a representative traffic by including eight days of traffic and in- cluded similar to attack traffic.

• We used domain knowledge and analyzed attack data with failed login data to extract robust features, which can discriminate between attack and normal traffic.

• We apply four different machine learning algorithms to evaluate the extracted features in order to provide a comprehensive analysis from machine learning perspective.

18 The structure of the rest of this chapter is as follows. In section 3.2 , we discuss related work on the topic of the detection of SSH brute force attacks. Section 3.3 explains our data collection and the experimental data used. Section 3.4 presents our approach for detecting SSH brute force attacks. Section 3.5 presents experimental results. Finally, in section 3.6 , we conclude our work in this chapter and provide suggestions for future research.

3.2 RELATED WORK

Some studies have been done for the detection of brute force attacks. Work conducted by Javed and Paxson [ 61 ] outline a method for the detection of distributed SSH brute force attacks. A two-phase approach is considered to first identify attack epochs in order to determine if an attack has occurred, and secondly, to classify the hosts as either participants or non-participants. Their method was evaluated using 8 years of SSH login records collected at the Lawrence Berkley National Laboratory. The first phase consists of an aggregate site analyzer that monitors the probability distribution of the parameter for excessive change and then denotes such changes as possible attacks. The second phase of this method is accomplished by denoting common characteristics of legitimate users, singleton brute-forcers and low-rate distributed brute-forcers. The results found that, during the aggregation analyzer phase, 99 attack epochs were detected with 9 false alarms reported, i.e., a nearly 10% false positive rate. Classification of the attack host provided better results, detecting a total of 9,306 hosts with only 37 benign hosts misclassified. Network flows have also been utilized in the detection of brute force attacks. By analyzing information found in the network flow, the amount of data that must be mitigated is lessened, making attacks more visible. Drasar et al. [ 42 ] analyzed the impact that this methodology has based on different types of brute force attack de- tection methods. In their work, five methodologies were examined: signature-based,

19 similarity-based, automated action detection based on a time window heuristic and a Fourier transform, and entropy time changes. Each approach was deemed successful in detecting attacks but each came with their own considerations. Signature-based approaches were simple but other data sources were required to help eliminate false positives. Similarity-based methods ended up being overly generic, and therefore, they were highly dependent on the parameters identified. Both automated detec- tion methods were capable of detecting constant periodic traffic alluding to ongoing attacks. The entropy time series method was very scalable and fast to deploy. Hofstede et al. [ 57 ] also utilized network flows to aid in the detection of SSH at- tacks, such as brute force. The authors propose the use of their open source IDS SSHCure which has been updated to incorporate detection algorithms capable of identifying if an attack was successful. Their algorithms are based upon a Hidden Markov Model which assumes one to three attack phases: scan, brute force, and/or compromise. Traffic, possibly consisting of brute force attacks, is identified by con- sidering hosts sending flows with a particular range of packets per flow (PPF). Their results showed that their IDS was capable of close to 100% accuracy, however, it was observed that retransmissions can cause false positives. This is due to the fact that retransmissions are not contained in the Netflow, except for presence of increased packet and byte counts. In this chapter, we apply machine learning for the detection of brute force attacks by using Netflow data. We use the aggregation of Netflows to extract the features that are more robust than normal Netflow features to discriminate between brute force and normal SSH data.

3.3 DATA COLLECTION

The network-based detection element needs to monitor all the data passing through the network, thus it needs to be fast. One solution is to apply the analysis at the

20 Netflow level. In recent years, network security research started to focus more on flow data to detect attacks [ 69 ], [ 56 ]. Netflow is an aggregation of packets that share some identical network features in a specific interval of time. The most five common features that are used in definition of a network flow is a 5-tuple key consisting of source IP, destination IP, source port, destination port and protocol. Since network flow is an aggregation of network packets, analyzing at the flow level speeds up the processing time in comparison to analyzing on a packet by packet basis, because it reduces the amount of data that needs to be processed. On the other hand, flow data is only based on packet headers and does not carry any packet payload information. This is a reasonable choice for the detection of attacks in the presence of encrypted traffic, which the payload is encrypted. Data was collected from our live campus network utilized by both students and faculty [ 86 ]. Students specifically used the network to upload and download assign- ments via SSH connections. The server was configured with one public IP allowing traffic from both local and outside users. Traffic was collected over a period of eight days through the use of port mirroring with Cisco switches and in some instances by placing a network tap (nTAP). Amongst the natural traffic being captured, penetration testing was also done to simulate failed login traffic. This traffic was generated to represent users who have possibly forgotten their password and would attempt multiple logins before either successfully logging in or giving up. A script was run on the network which would generate several failed logins before ultimately giving up and quitting. This script was designed in such a way as to not trigger a Snort [ 101 ] alert for brute force. Since certain Snort rules will flag any communication that attempts more than five unsuccessful logins in one minute, the script made sure to stop before this threshold and wait at least a minute before attempting more logins. Our capture machine was configured with the Security Onion IDS and utilized

21 Snort to perform full packet captures (pcaps) of all data passing through the network daily. Once the initial data capture was completed, Wireshark was utilized to merge all daily pcaps into a single capture and filter out unneeded communications such as broadcast, IPv6, UDP, or ARP traffic. Next, Snort was run against the resulting pcap to produce an alert file to be used for labeling. This alert file would help to identify which Netflows were associated with various attack types. Once the alert file had been produced, the pcap was run through SiLK [ 10 ] to extract the necessary Netflow information and export into the csv file format for labeling and analysis. Netflow, also referred to as session data, represents a high-level summary of net- work conversations. A Network flow record is identified based upon the standard 5-tuple attribute set that makes up a conversation: source IP, destination IP, source port, destination port, and transport protocol. Based upon which Netflow standard is being implemented, other attribute fields can also be produced. As stated prior, SiLK was used in our experiments to extract Netflows and utilizes the IPFIX standard for flow generation. Once a flow record has been initiated there are only two ways it can be terminated. The IPFIX standard states that when no data for a flow has been received within 30 seconds of the last packet, the flow record is terminated and when a flow has been open for 30 minutes, the flow record is terminated and a new one is created. Once the Netflow csv and alert file had been generated, labeling was conducted by attempting to associate each Netflow to an alert fired by Snort. If a Netflow aligned with the time-stamp of a brute force associated alert, then that Netflow was labeled as Attack. If no alert could be associated with the Netflow, then it was labeled as Normal. During the later aggregation process, if any Netflow in a particular aggregation had been labeled as Attack, then the entire resulting aggregation was also labeled as Attack. If no Netflows in the aggregation had triggered alerts, then it remained a Normal aggregation.

22 Aggregation of Netflows was performed over a five minute time window. A shorter time window might not include enough attack data if the attack data is sent with some delays. A larger time window, on the other hand, can delay the detection of the attack. We decided to use five minute time intervals as a trade-off between the short and large time windows. Within this window, Netflows were aggregated based on three common features: source IP (sIP), destination IP (dIP), and destination port (dPort). Once the Netflows were aggregated, 19 features were calculated and/or extracted for evaluation based upon a Netflow’s packet size, byte count, duration, etc. A description of each extracted feature can be found in Table 1. We did not include sIP, dIP and dPort in this table because we did not use them in applying the machine learning methods. This makes the approach to be able to be generalized by not being dependent on the IPs and ports seen in a specific network. The aggregated dataset includes 425 Attack instances and 558 Normal instances. The contribution of our data collection for brute force attacks can be considered as follows:

• We collected real network traffic, which include real brute force attacks. Our collected data is a real representation of real-wrold traffic.

• We collected representative network data from eight days of traffic, which in- clude different scenarios happening in a computer network. We also included data similar to attack traffic (failed login traffic) through penetration testing to make the collected traffic even more representative.

3.4 DETECTION OF SSH BRUTE FORCE ATTACKS

To build the predictive models for the detection of SSH brute force attacks at the network level, we followed these steps:

1- Collecting representative network traffic. 23 Table 3.1: Description of features extracted from aggregated data

Feature Name Description

sIP Source IP

numOfNetflows Number of Netflows in aggregation

avgPkts Average number of packets within aggregation

medPkts Median number of packets within aggregation

stdPkts Standard deviation of packets within aggregation

avgBytes Average number of bytes within aggregation

medBytes Median number of bytes within aggregation

stdBytes Standard deviation of bytes within aggregation

avgPktSz Average packet size (bytes/packets) within aggregation

medPktsz Median packet size (bytes/packets) within aggregation

stdPktSz Standard deviation of packet size (bytes/packets) within ag- gregation

avgDur Average duration length (in seconds) within aggregation

medDur Median duration length (in seconds) within aggregation

stdDur Standard deviation of duration length (in seconds) within aggregation

flags All flags triggered from all Netflows within aggregation

initialFlags All initial flags triggered from all Netflows within aggrega- tion

sessionFlags All session flags triggered from all Netflows within aggrega- tion

class Class label (Attack or Normal) associated with the Netflows within aggregation

24 2- Analyzing the collected data and using the domain knowledge to extract fea- tures.

3- Using machine learning algorithms to build the predictive models for the detec- tion of attacks.

Producing network traffic is an essential step in any intrusion detection study. A lot of previous work in the application of intrusion detection have been done on the KDD Cup 99 dataset [ 9]. This dataset is outdated and does not reflect the current trends in today’s computer networks. It is simulated network traffic and does not contain the noise and randomness seen in real network traffic. The models built on such data do not provide the same performance in a real computer network. To fulfill the gap between intrusion detection studies and the actual deployment of these systems in real computer networks, real network traffic should be used to build the predictive models. In addition, our previous study [ 85 ] has shown that the collected data has to be representative enough to provide realistic results that are applicable on real computer networks. Representative network traffic means the data includes different scenarios seen in real computer networks, such as data streaming, interactive traffic, etc. After the data is collected, the next step is to define discriminative features for the task of building machine learning models. To define the features that are powerful enough to discriminate between a brute force attack and normal SSH traffic, we decided to compare the collected brute force Netflows with the normal Netflows. By looking at the data, we realized that the number of packets and the number of bytes in the Brute force Netflows are different from the normal SSH Netflows. Compared to the upload/download and even successful login traffic, the brute force attack Netflows had a relatively fewer number of packets and bytes. Although, it appeared that Netflow features are able to discriminate between brute force attacks and normal SSH traffic, we decided to compare brute force traffic 25 with the failed login data that a normal user who has forgotten his/her password will produce. The reason is that a brute force attack is a sequence of attacker’s failed login attempts to log in to the user’s account until the correct password is found. Therefore, it is expected that a brute force attack might be similar to failed login data. The difference is that in legitimate failed login data the number of attempts is limited, while in the brute force attack the number of attempts is high. The comparison between the brute force data and the failed login data showed that the failed login Netflows produced by the legitimate users can be very similar to the failed login data produced by the brute force attack. The reason is that most of the time, the attackers change their source port when sending a new login attempt. This causes a new Netflow to be generated for each login attempt which is similar to the login attempts Netflows produced by legitimate users. Based on comparing brute force data with failed login data, it appeared that Netflow features are not robust enough to discriminate the brute force data from the normal failed login data. We decided to add new features. The main difference between brute force attacks and normal failed login data is the number of failed login attempts made. In the brute force attack data the number of attempts is more than in the data from failed logins by a legitimate user. To reflect this characteristic in our features, we decided to extract the features not from the Netflows, but from an aggregation of Netflows which is independent of the sender IP’s source port. Altering the source port in producing attack attempts makes each attack attempt produces a different Netflow. In fact, we are combining the attacker’s attempts in one single instance by aggregating the Netflows with the same characteristics but different source ports. Such an attack instance has higher numbers of Netflows than an instance related to failed login attempts by a legitimate user. We aggregated all the Netflows with the same source IP, destination IP and des- tination port in one single instance in 5-minutes intervals. We extracted the features

26 from each aggregated instance. The extracted features are explained in more de- tails in section 3.3 . After the features are extracted, four different machine learning algorithms are applied to build the predictive models. To build the predictive models, we chose four classification algorithms: 5-Nearest Neighbor (5-NN), two forms of C4.5 decision trees (C4.5D and C4.5N), and Na¨ıve Bayes (NB). These learners are commonly used in machine learning applications. Using these learners provides a broader analysis from a data mining point of view. We built all models using the WEKA machine learning toolkit [ 53 ]. K-nearest-neighbors or K-NN is an instance learning or lazy learning algorithm. This algorithm delays building the predictive models until the testing phase. K-NN stores the training instances in the memory. When predicting the class of a new instance, its distance or similarity to all the training instances stored in the memory is calculated. The algorithm uses the K (in our study, K=5) closest instances to the test instance to decide its class. C4.5 decision tree (implementation of the J48 decision tree in WEKA) is a tree- based learning algorithm. In these algorithms, a decision tree is built based on the training data. Each branch of the tree represents a feature in the data which divides the instances into more branches based on the values which that feature can take. The leaves represent the final class label. The C4.5 algorithm uses a normalized version of Information Gain to decide the hierarchy of features in the final tree structure. In this study, we employed a version of C4.5 using the default parameter values from WEKA (denoted C4.5D) as well as a version (denoted C4.5N) with Laplace smoothing activated and tree-pruning deactivated. The Na¨ıve Bayes algorithm uses Bayes’ theorem with the strong assumptions of features being independent to calculate the posteriori probability of an instance being a member of a specific class. While, this assumption makes Na¨ıve Bayes a relatively weak learner, it is a fast classifier.

27 We use Area Under the receiver operating characteristic Curve (AUC) as the eval- uation metric. The Receiver Operating Characteristic (ROC) curve is a graph of the True Positive Rate (TPR) vs False Positive Rate (FPR). In the current application, TPR is the percentage of the brute force attack instances that are correctly predicted as Attack. FPR is the percentage of the normal data which is wrongly predicted as Attack by the model. The ROC curve is built by plotting TPR vs FPR as the classifier decision threshold is varied. The area under the ROC graph is calculated as the AUC performance metric. A higher value of AUC means higher TPR and lower FPR which is preferable in intrusion detection applications. To evaluate the performance values, we used 5-fold cross validation. In the 5- fold cross validation, the data is divided to 5 non-overlapping parts (folds). In each iteration, one part is kept out as the test data and the other four parts are used as the train data. The final performance values are calculated by aggregating the performance values of the models being tested on each of 5 parts of the data. In order to decrease the bias of randomly selected folds, we applied four runs of 5-fold cross validation to provide each performance value.

3.5 EXPERIMENTAL RESULTS AND ANALYSIS

We built four models by using four different classification algorithms explained in section 3.4 . Table 3.2 shows the performance results. Each AUC value in the table is the average of 20 AUC values acquired during applying four runs of 5-fold cross validation and the std values reflect the standard deviation between those values. The results show that the predictive models perform very well in the detection of SSH brute force attacks. Different classification algorithms provide good results. This indicates that the features extracted from the network traffic are discriminative enough for recognizing attack versus normal traffic. Although all the classification algorithms provide close performance results, deci-

28 Table 3.2: Cross validation results

Classifier AUC AUC std

Na¨ıve Bayes 0.994586 0.00279617

C4.5N 0.99648205 0.00438882

5-NN 0.9965878 0.003975975

C4.5D 0.98712355 0.011668849 sion tree algorithms are easier to interpret. The final decision tree structures show the features that contribute most to the classification task. Among the two decision trees, C4.5N has better performance with less standard deviation than C4.5D. In the C4.5N, numOfNetflows is selected at the first level in the decision tree structure. We extracted features from an aggregation of Netflows instead of Netflows themselves, because this way the instances that belong to brute force attacks would have more Netflows than the instances that belong to the failed login attempts produced by legitimate users that have forgotten their password. Therefore, it is not unexpected that numOfNetflows is selected at the first level of the tree structure to discrim- inate between normal and attack traffic. The next feature selected in the second level of the tree structure is avgPktSz. This feature represents the average packet sizes (bytes/packets) in the aggregated instance. Since a brute force attack is a se- quence of login attempts, the average packet size is usually not very high and below a threshold in the tree structure.

3.6 CHAPTER SUMMARY

In this chapter, we explained the process of building machine learning models for the detection of SSH brute force attacks in network traffic. We explained how we incor- porated domain knowledge as well as analysis of the real representative collections of network traffic to define discriminative features. We collected our network traffic for 29 the underlying intrusion detection task (detection of brute force attacks) from a cam- pus computer network which provides real normal and attack data. We also added data similar to attack network traffic (failed login data produced by legitimate users that have forgotten their passwords) to the collected data. We analyzed brute force versus normal SSH traffic in order to define the features that can be discriminative enough to discriminate between normal and attack traffic. Our analysis showed that attackers tend to change their source port when sending a new login attempt. This results in a new Netflow being built for each attack login attempt. These Netflows are very similar to the Netflows produced in a login attempt by a user that has forgotten his password. The main difference is the number of launched attempts (Netflows) which can be presented in an aggregated instance. The Netflows whose start time falls in the same 5-minute interval with the same source IP, destination IP and des- tination port are aggregated into one network instance. The features extracted from aggregated Netflows can discriminate between brute force attacks and normal SSH traffic, especially the failed login traffic by legitimate users which can be similar to attack traffic. We used four different classification algorithms to build the predictive models. Our results show that the predictive models perform very well for the de- tection of SSH brute force attacks. Our approach in defining discriminative features by taking into account the domain knowledge and analysis of representative traffic containing normal data similar to the attack data can be used in building machine learning based-models for the detection of other types of attacks as well. For future work, we aim to compare the performance of the aggregated features and Netflow features for the detection of brute force attacks.

30 CHAPTER 4 MAN IN THE MIDDLE ATTACKS

4.1 BACKGROUND AND MOTIVATION

One older, yet still popular, attack on computer networks is the Man In The Middle attack (MITM) [ 38 ], [ 81 ]. A Man In The Middle attack involves an attacker inter- cepting existing communication between two computers. These attacks are used by malicious internal users to collect confidential information (passwords, emails, etc.) about other network users. In this attack, the attacker places him/herself as a “man in the middle” in the middle of the communication between two computers. Due to the fact that the attacker is between both machines, their transferred network traffic would reach the attacker machine first, before it gets to any of the victim computers (Figure 4.1 ). The attacker can then decide to steal sensitive information from the received traffic, selectively modify it or perform any other kind of malicious activity before the traffic reaches its intended destination. The two communicating computers believe they are directly connected to each other and unless a MITM attack detection method alarms them, they are not aware of the presence of an attacker in the middle of their communication. Although different mechanisms exist to launch a MITM attack, there is a common behavior among them. The attacker forwards the traffic it receives from any of the victim parties to the other party. We call this “forwarding behavior.” It allows the generated packets from the victim machine to get to their actual destination. Therefore, the targeted machine does not get suspicious of the existence of an attacker machine in the middle of the path toward its destination. In a scenario, where the

31 Figure 4.1: Man In The Middle (MITM) Attack

attacker drops receiving traffic, the two communicating parties would realize there is an anomaly in their communication. In the forwarding behavior of a MITM attacker, the attacker receives a packet from the victim and forwards it to the real destination. Both the received packet and its corresponding forwarded packet will show up in the network traffic as two nearly similar packets. This produces semi-duplicated packets in the network traffic. The reason we call them semi-duplicated packets is because the attacker might change some of the received packet information in order to forward it to the real destination. In addition, some packet features might characteristically change, such as the Time To Live (TTL) field. The network packets can be compared together to find these semi-duplicated packets. The detection of the semi-duplicated packets in a network traffic indicates the existence of MITM attacks. To the best of our knowledge, there are only two previous works, which have proposed comparing different packets in network traffic in order to detect MITM semi-duplicated packets [ 18 ], [ 34 ]. However, neither of them has systematically ana- lyzed which parts of a packet’s content have to be compared among different network packets. This is necessary to efficiently find the MITM traffic with a better detection performance. In addition, they do not provide their experimental analysis on a real 32 network traffic. Analysing real network traffic, which includes different scenarios that can occur in a computer network, provides more realistic results compared to the experiments done on controlled or lab generated traffic. In this chapter, we study data collected from our real production network with MITM attacks generated through penetration testing. Our study is mainly focused on providing a systematic approach to automatically select which parts of packet content should be compared among different packets for the detection of semi-duplicated MITM traffic. Comparing the whole packet content cannot find the semi-duplicated packets because the packets are not exactly the same. On the other hand, there are some packet content data which are not informative enough and including them not only increases the computational need, but it might even decrease the detection performance. There is a need for a systematic approach that can automatically select which parts of packet content should be compared for the detection of MITM semi- duplicated traffic. We use the Forward Greedy Stepwise feature subset selection algorithm [ 52 ] to select which packet header fields provide better performance results for the detection of semi-duplicated MITM data. Feature subset selection methods are used as a pre- processing step in classification tasks. They evaluate different subsets of features in order to decide which subset provides better performance. They eliminate redundant and irrelevant features which results in decreased processing time and potentially increased classification performance. We adapt this idea in order to select a subset of packet header fields that can be compared between different packets to detect MITM semi-duplicated packets. We only analyze the packet header and not the packet payload. The reason is because some MITM attackers might modify the packet payload, however, since packet header is mostly carrying control data to route the packet, there is less chance of altering some specific packet header fields by the attacker.

33 Along with using the Forward Greedy Stepwise feature subset selection algorithm, we define a performance metric by using the linear scalarization technique from multi- objective criteria decision making [ 59 ]. This performance metric is used in the For- ward Greedy Stepwise feature subset selection algorithm to evaluate different subsets of packet header fields. This metric makes a trade off between minimizing the False Positives and the False Negatives. Since our approach in detecting MITM attacks is based on exploiting a common behavior among all MITM attacks, i.e. the attacker’s forwarding behavior, among different MITM methods, it is independent of the method used in launching the at- tack. To demonstrate this independence, we study three different types of MITM attacks. We performed Address Resolution Protocol (ARP) Spoofing, Dynamic Host Configuration Protocol (DHCP) Spoofing and Port Stealing MITM attacks on our network and collected the generated Local Area Network (LAN) traffic. We ana- lyzed the collected traffic as a representation of MITM attacks behavior on a LAN environment. Our experimental results show that using the feature subset selection algorithm for selecting packet header fields to detect MITM traffic results in the selection of as few as two or three packet header fields, which at the same time provide better performance results compared to not using the feature selection method. Using only a few number of packet header fields for the detection of MITM attacks provides faster analysis by eliminating the need for comparing the packet header fields, which are not as informative. Our method also provides good performance results on all three types of MITM traffic included in our experimental data. This demonstrates the independence of this method to the MITM launching mechanism. The contributions of this chapter are as follows:

• We propose a systematic approach to select which packet header features should be compared to detect semi-duplicated packets generated by MITM traffic. Us-

34 ing the right subset of packet fields reduces the processing time by removing the fields, which are not important for the detection of MITM traffic, and makes intrusion detection system more effective for high speed networks.

• We use a general behavior in MITM attack to detect the MITM traffic, which makes our approach independent of the attack method. Our approach can be considered a general mechanism for the detection of MITM traffic regardless of the method used to launch the attack.

• We experimentally validate the effectiveness of our approach on real traffic and attack data generated through penetration testing. Our experimental data includes three different varieties of MITM attacks in order to experiment the independence of our detection of the attack method.

The structure of the rest of this chapter is as follows. In section 4.2 , we discuss related work on the topic of the detection of Man In the Middle attacks. Section 4.3 explains our data collection and the experimental data used. Section 4.4 presents our approach for detecting Man In The Middle attacks. In section 4.5 , experimental results are presented. Finally, in section 4.6 , we conclude our work in this chapter and provide suggestions for future research.

4.2 RELATED WORK

Since ARP Spoofing is the most popular MITM mechanism in LAN networks, many of MITM detection methods are particularly proposed for the detection of ARP spoofing attacks. ARPWatch [ 2] and Snort [ 101 ] are tools that are able to detect ARP Spoofing by checking each packet content in order to keep track of the IP address-MAC (Media Access Control) address pairings seen in the traffic. If an analyzed packet contains an IP-MAC pair that does not match with the one included in the database, then the system generates an alert. 35 Carnut et al. [ 34 ] proposed using the entire packet content modulo the Ethernet addresses, the IP Time To Live and the IP header Checksum fields to detect duplicated packets in the network. Receiving a packet whose hash already exists in a hash table, within the last 2 seconds, results in a MITM alert. They proposed this method only for the detection of ARP spoofing attacks and they did not provide the details of their experimental analysis. Singh et al. [ 106 ] surveys several schemes for the detection and preventing ARP Spoofing. However, these methods are limited to the detection of ARP spoofing and not MITM attacks in general. Zouheir Trabelsi [ 111 ] compared the resilience of popular versions of OSs and Apple Mac OS X against ARP poisoning attack in a local area network. Based on their experimental results, all tested versions of both windows OSs and Apple Mac OS X are vulnerable to ARP cache poisoning and do not deploy effi- cient built-in security features that prevent this attack. They also show that by using static MAC addresses, both OS types are resilient against ARP poisoning. However, this can be challenging, in large networks, for the administrator to manage and up- date tables through the network. In addition, an attacker can use other methods such as Port Stealing in order to launch a MITM attack in a LAN environment with static ARP tables. They conclude that there is no efficient universal defense measure or cure, for ARP cache poisoning. Other works, such as the work done by Al-Hemairy et al. [ 19 ], also show that even very famous and expensive ARP detection/prevention systems have shortcomings. This emphasizes more on the importance of the detection of this attack and other types of MITM attacks in LAN networks. Vallivaara et al. [ 112 ] proposed a method for detecting MITM attacks using times- tamps of TCP packet headers. They proposed to compare the mean of the sequential time stamp delays in the current connection to one calculated from previous sessions in order to detect MITM attacks when observed delay is unusually high. The main

36 idea is that the presence of the man in the middle delays communications and this delay is large enough to be observed as an anomaly when compared to delays observed in normal network communications. They defined a simple detection threshold based on mean and variance of previous connections. A connection with mean of the delay more than this threshold is considered a MITM connection. They collected data with systematic sampling from four big financial companies. Sampling was done by sniff- ing TCP packets for 24 hours as a script would connect the target website every two minutes. They also collected the same kind of data with the connections routed from a proxy. This experimental design does not reflect real MITM data on a computer network. The data is limited only to the traffic type generated by their script. In addition, instead of a real MITM attacker, a proxy is used to resemble the attacker machine. Considering the amount of delay fluctuation on real network traffic due to router configurations, etc., false alarms might become an issue in this approach. Also, if the attacker is fast in forwarding the packets, the delay difference might not show up as an anomaly. Abri [ 18 ] proposed a method for detecting MITM attacks which is based on match- ing the payload of frames exchanged in the network. They compare a selective portion of the packet payload in order to find duplicates. Two parameters, called matching offset (MO) and matching length (ML), are used in order to decide what portion of the frame payload should be used in comparisons. MO shows the starting point by defining how much of the start of the frame to skip. From specified MO, a selected portion of the payload of size ML is used for comparisons. The authors do not suggest any systematic approach for selecting MO and ML parameters except trying different values. Their approach limits the portion of payload that needs to be compared to be an unbroken portion of the payload. It does not provide the flexibility of com- paring different portions (potentially different packet header fields) of payload. They applied their experiments on lab data which includes only downloading FTP traffic.

37 This traffic is limited and it does not reflect scenarios that can occur in a real network. In this chapter, we collect MITM traffic on a real production network and propose using a feature subset section algorithm, along with a scalarized performance criteria, in order to decide what part of the packet content should be compared for finding MITM packets.

4.3 DATA COLLECTION

In order to apply our analysis on real MITM data, we collected network traffic from a our live campus network [ 33 ], [ 30 ]. MITM attacks are generated through penetration testing over the network. The collected data includes traffic produced by students, staff and faculty from hosts in classrooms, labs and offices. This data contains up- loads, downloads, streaming, browsing and any other kind of general course work that can occur on a campus network. We collected packets by port monitoring with a Cisco switch, forwarding all traffic to a capturing server. Our attacker and targeted machine exist in a single switch which cause all their traffic to be collected by the capturing server. Each attack is generated during a live class while one of the students’ machines is targeted. Each attack lasted approximately one hour. Each MITM attack is per- formed separately to avoid potential overhead performance over the network. Three different MITM attacks are produced: ARP Spoofing, DHCP Spoofing and Port stealing. Two different tools were used for the implementation of ARP spoofing: ARPspoof [ 1] and Cain & Abel [ 3]. These tools are commonly used for ARP spoof- ing attacks. The implementation of DHCP spoofing and Port Stealing is done using Ettercap [ 5]. After we completed our data collection efforts, we labeled each packet as MITM or non-MITM packet. A packet is labeled as a MITM packet whether it belongs to the traffic received by the attacker during the actual MITM attack or it belongs to

38 the traffic forwarded by the attacker, otherwise the packet is labeled as non-MITM which means the packet has not been received or forwarded by the attacker. We used the information about each attack, such as attacker’s MAC and IP address, target’ MAC and IP address, attack time and duration, via manual analysis to label the data. Table 4.1 shows the number of MITM and non-MITM packets for each attack. We generated three different types of MITM attacks in order to demonstrate the independence of our proposed detection method to the attack type. We launched MITM attacks using the ARP spoofing method which is the most common method used in LAN environments. We also used DHCP Spoofing as another variety of MITM attack methods. Also, if static MAC addresses are used in order to avoid ARP spoofing, MITM attacks tend to use Port Stealing attack. We used Port Stealing as the third method of generating MITM traffic. Below, we provide explanations of these three attacks. ARP Spoofing ARP is a protocol used to map an IP address to the physical layer address (MAC address) in LAN environments. The communications between devices in a LAN environment are done through the physical address. If computer A wants to communicate with computer B on the LAN network and it only knows the IP address of computer B, it needs to discover what the MAC address of computer B is in order to be able to send traffic to it. Computer A broadcasts an ARP packet over the network asking what is the MAC address of computer B. When receiving this ARP request, computer B sends a unicast massage to computer A, sending its MAC address. Computer A now updates its ARP cache with the IP/MAC pair of computer B. ARP protocol does not have any authentication specification in order to prevent invalid association of IP and MAC addresses provided in an ARP massage. Thus, a malicious host can send an ARP massage to a victim host with another machine’s IP (called machine B) and its own MAC address. Upon receiving this ARP massage,

39 Table 4.1: Collected data information

Dataset # of MITM packets # of non-MITM packets

ARP Spoofing 84659 173137

DHCP Spoofing 211060 625617

PORT Stealing 299667 1620513 the victim host updates the entry associated to machine B with the fake MAC ad- dress which is actually attacker’s MAC address. From now on, whenever the victim machine wants to send a packet to machine B, it uses attacker MAC address and the packet is sent to the attacker instead. This attack is called ARP cache poisoning or ARP Spoofing. The attacker can also poison machine A’s ARP cache with the same approach and listen to all traffic from machine B to the victim machine. There- fore, the attacker observes bi-directional traffic transmitted between the victim and machine B. Usually attackers use this method to listen to the bi-directional traffic between a target machine and the router on a LAN. This is called a full-duplex attack as opposed to half-duplex attacks, when the attacker only listens to half of the traffic. DHCP Spoofing DHCP is a network protocol that dynamically distributes net- work configuration parameters, such as gateway or DNS server information. When a new host is connected to the LAN, it sends out a DHCP request for the network gateway IP address in a broadcast form. Since clients broadcast DHCP requests, an attacker can get the requests and respond with its own information, and the newly connected host thinks the attacker is the default gateway and sends the outgoing pack- ets to it. DHCP spoofing occurs when an attacker attempts to respond to DHCP requests in order to list itself as the default gateway. With DHCP spoofing, all the traffic from the user to the real gateway would be intercepted by the attacker. DHCP spoofing is a half-duplex MITM attack, since only the data going out from the net- work is captured by the attacker and not the data coming from outside the network

40 targeting the victim machine. Port Stealing Port stealing is a useful MITM attack, when methods such as ARP Spoofing are not effective due to static configuration of ARP cache in network machines. This prevents ARP poisoning attack to modify hosts’ ARP cache. In this situation, attackers direct their effort from network hosts to the network switch. A switch connects computer devices on a network by receiving packets and forwarding them to the right destination. Each computer is connected to the switch via a port. When the switch receives a packet, it looks up its Content Addressable Memory (CAM) table to find the associate port to the packet’s MAC address. The switch then forwards the packet to that port. In Port Stealing attacks, the attacker floods LAN with bogus ARP requests. These ARP requests contain a victim machine MAC address as their source MAC address. Upon receiving these ARP requests in the switch, it updates the CAM table and assigns the victim’s MAC address to the attacker’s port. This way, the attacker steals the victim’s port. From now on, the switch will forward all the traffic destined for the victim’s machine to the attacker. The attacker continues the flooding process until it receives packets intended for the victim machine. It then pauses the flooding process and sends an ARP request to the victim’s machine. This results in the victim machine sending an ARP response back which makes it communicate directly with switch, forcing the CAM table to be updated back to its normal state. Now the attacker can forward the received packets to the victim. The flooding then re- commences and the process repeats. The contributions of our data collection can be summarized as follows:

• We are collecting real network traffic, not simulated or lab generated traffic.

• Our attack data were generated concurrently with the normal network usage, which makes our data more representative of real-world attack scenarios.

41 • Our data includes three MITM attack variants: ARP spoofing, DHCP spoofing, and Port Stealing. These attacks were selected due to their uniqueness in exe- cution from one another. They incorporate both full- and half-duplex behaviors on hosts using static and dynamically mapped ARP tables.

4.4 DETECTION OF MAN IN THE MIDDLE TRAFFIC

4.4.1 General Approach

A MITM attack can be launched through the use of several different mechanisms. No matter what kind of mechanism is used to perform a MITM attack, there is one common behavior: the attacker forwards the packets received from one party to the other party. This behavior makes the two parties believe they are directly connected to each other and there is nothing wrong with their connection. We don’t consider MITM attacks where the attacker drops received packets. Those attacks cause abnormal behavior in the communication of two computers and make the communication look suspicious. In addition, by dropping the packets, the communication protocols will usually result in re-transmission requests which does not provide any new information to the attacker. The forwarding behavior of MITM attacker in half-duplex and full-duplex scenar- ios are shown in Figure 4.2 and Figure 4.3 respectively. Regardless of whether the attacker is listening to only one direction of the traffic, i.e. half-duplex case or it is listening to both directions of the traffic i.e. full-duplex case, it forwards the received packet from one side (packet 1 in the figures) to the other side. The forwarded packet is shown as packet 2 in the figures. Packet 2 a semi-duplicate of packet 1. The forwarding process in a MITM attack results in generation of semi-duplicated packets in the network traffic. In order to forward the packet, the attacker may change some information. For example, in an ARP spoofing MITM attack, the attacker has to change the destination MAC address in the forwarded packet in order to forward 42 Figure 4.2: Half-duplex Man In The Middle

Figure 4.3: Full-duplex Man In The Middle

43 it to the real intended destination. Some MITM attacks might also modify packet payloads. Therefore, the forwarded packet might not be exactly the same as the one received by the attacker. However, the received and forwarded packets still might share a good amount of the same information, which can be used to detect them as a “MITM pair.” We refer to these semi-duplicated packets as “pairs”/“MITM pairs” in the remainder of the chapter. Each pair in fact contains a packet from the victim machine which is received by the attacker and the correspondent packet which the attacker forwards to the other side of the communication. We use packet header information to detect MITM pairs in the traffic. The de- tection of pairs in the network indicates the existence of MITM attacks. We apply our analysis only on TCP traffic since it contains the majority of the collected traf- fic. We compared specific packet header fields between different packets in order to find MITM pairs. The question would then be, which packet header fields should be compared? During our preliminary analysis on the MITM data, we observed that there are some packet header fields, e.g. MAC addresses, IP checksum, TTL or even length-related fields, such as IP header length, that are more likely to be changed by the attacker when forwarding a received packet. We also observed there are some fields such as IP Identification Number, TCP Acknowledgment Number and TCP Sequence Number which are less likely to be changed by the attacker. We need to select packet header fields which are less likely to be changed by the attacker in the forwarding process. For example, including MAC address most likely would not work well in the detection of MITM pairs. The reason is that if the packet received from the victim machine has a different destination MAC address with the MAC address of the real intended destination (e.g. due to ARP Spoofing), the attacker has to change the destination MAC address in order to forward the packet. Therefore, the destination MAC address will not be the same in the MITM pair packets. On the other hand, the selected fields have to be discriminative. A

44 field such as IP Service Number usually has the same value for most of the packets. No matter whether the two comparing packets belong to a MITM pair or not, this field will probably be the same for both of them. Such packet header fields are not informative enough to be used for detecting pairs. It just increases the computational need to compare an extra field, which is not very informative. After deciding which subset of packet header fields, called subset S, should be used in detecting MITM pairs, we take two steps to find such pairs:

1. The traffic is divided into non-overlapping intervals. Each interval is T seconds long.

2. In each interval, each packet is matched against the rest of the packets included in the interval to find whether there exist a match for it.

Two packets are matched if all the packet header fields in S have the same value in both of them. If only one packet is found which matches the examined packet, both packets are labeled as MITM pairs. If no match is found for the examined packet or more than one matches are found, the packet is labeled as non-MITM. The reason we label the packets with more than one match as non-MITM is that, in MITM traffic, for each received packet only one forwarded packet is generated. Therefore, we expect each MITM packet pair to include only two packets (the received one and the forwarded one). The selected time interval length ( T ) should not be very long because it increases the computational needs. Finding MITM pairs requires all the packets in a time interval to be matched against each other. This process is of O(N 2) where N is the number of packets in the time interval. The longer the time interval is, the higher number of packets it includes (bigger N) which increases the computations. On the other hand, the time interval should not be too small, because a small time interval increases the chance of having MITM pair packets fall into neighboring time

45 intervals. In that case, our approach would not be able to find those pairs. Based on preliminary experiments, we decided to select T = 30 seconds, which is not too long for the computations and it also gives good performance results. The process of selecting a subset of packet header fields, S, in order to detect MITM pairs is explained in the next subsection.

4.4.2 Selecting a Subset of Packet Header Fields

Feature subset selection methods evaluate different subsets of features in order to decide which subset provides better performance. There exist different algorithms to search through the space of possible feature subsets and evaluate each one. In order to evaluate the performance of each feature subset, some feature subset selection algorithms build a classification model on each examined feature subset to calculate the performance metric. There are also some methods that calculate the performance based on simpler filter metrics instead of evaluating a classification model. In this chapter, we use a feature subset selection algorithm to select which subset of packet header fields should be used for the detection of MITM pairs. We use the Forward Greedy Stepwise algorithm as our subset search algorithm to search through the space of different subsets of packet header fields. We define the performance metric to evaluate different subsets based on a trade off between decreasing the number of MITM packets which are not detected and the number of none-MITM packets which are wrongly detected as MITM by using scalarization methods from multi-criteria decision making. The Forward Greedy Stepwise algorithm begins with an empty set of features called “working set” and it progressively adds one feature at a time into this set. The features are selected from a “pool” of features which is initially fed to the algo- rithm. Each time, all the features that are not included in the current working set are temporarily added to the working set one at a time. The feature that most im-

46 proves performance when temporarily added to the working set becomes a permanent member and the algorithm iterates until no feature actually improves performance. We define the metric for evaluating each feature subset in Forward Greedy Step- wise algorithm based on how good the examined subset detects MITM packets and avoids detecting non-MITM packets as MITM. The defined metric gets a better eval- uation value if it leaves less number of MITM packets undetected and it does not detect a lot of MITM packets wrongly among non-MITM data. To define our evaluation metric, called E, first, we explain some other evaluation metrics which we will use in defining E, which also will be used later in the result section to demonstrate the performance of our detection mechanism. After applying our pair detection algorithm, each packet is labeled as either MITM or non-MITM. A True Positive (TP) packet is a packet in MITM traffic which is also labeled as MITM. A False Negative (FN) or missed packet, is a packet that belongs to MITM traffic, however, it is not detected as MITM and it has been labeled as non-MITM. A True Negative (TN) is a packet in non-MITM traffic which is also labeled as non-MITM. A False Positive (FP) is a packet in non-MITM traffic which is wrongly labeled as MITM. We also can define TP rate (TPR) and FN rate (FNR) by dividing TP and FN by number of packets in MITM data. Similarly, FP rate (FPR) and TN rate (TNR) are calculated by dividing FP and TN on number of packets in non-MITM data. The preference is for TPR to be high and FNR to be low which means most of the MITM packets are correctly labeled as MITM. On the other hand, FPR should be low and TNR should be high which means there exists less number of packets in non- MITM data which are wrongly labeled as MITM. Different subsets of packet header fields will result in different evaluation values. In order to define a single metric that can be used in Forward Greedy Stepwise subset selection algorithm, we used scalar- ization methods from multi-objective criteria decision making [ 59 ]. We use linear

47 scalarization, in which a linear combination of different objectives are minimized. To define the evaluation metric, we only considered FNR and FPR, both of which are preferred to have smaller values. We don’t include TPR or TNR rate because a lower value of FNR or FPR corresponds to higher values of TPR and TNR, re- spectively. Thus, there is no need to include all these values in the evaluation metric. Using FNR and FPR along with scalarization method we define the evaluation metric as below:

E = γ × F NR + (1 − γ) × F P R

This evaluation metric gets minimized through selecting different subsets of packet header fields in the subset selection algorithm. Coefficient γ, indicates which one of the two criteria (FNR and FPR) are more important. A higher value of γ results in FNR to get decreased more while minimizing E. Conversely, lower γ value (higher 1 − γ) means more focus is toward decreasing FPR. In our preliminary analysis, we observed that usually FNR is higher than FPR. Thus, we decided to focus more on decreasing the number of missed MITM packets. Based on our preliminary analysis, we decided to use γ = 0 .8 which means we care more about reducing the number of missing packets in MITM data than the number of False Positives in the non-MITM data. We use this evaluation metric in the subset selection algorithm. The algorithm starts adding packet header fields to the final selected subset and it stops when no reduction is seen in E value. The FPR and FNR in the calculation of E are computed based on a dataset of labeled data provided to the algorithm. The initial pool consists of all the packet header fields in a TCP/IP packet. We don’t include the fields from Ethernet header as MAC addresses are most likely to change during the forwarding process. TCP and IP headers have their specific control fields, which are used for

48 routing packets in computer networks. We included all TCP and IP packet fields as the pool set in the subset selection algorithm.

4.5 EXPERIMENTAL RESULTS AND ANALYSIS

In the first set of experiments, we applied the subset selection algorithm on each dataset (ARP Spoofing, DHCP Spoofing, Port Stealing) separately as well as the whole dataset, including all three aforementioned different MITM datasets. The pool set includes all the IP and TCP packet header fields. The results are shown in Table 4.2 . The field names ip id, tcp seq and tcp check indicate IP Identification Number, TCP Sequence Number and TCP Checksum fields from the packet header, respectively. The results show that IP Identification and TCP Sequence Number are commonly selected among all datasets. TCP Checksum is selected among all the datasets except ARP Spoofing dataset. All these three packet header fields (IP Iden- tification, TCP Sequence Number and TCP Checksum) inherently can get different values among different packets most of the time. This property makes them appro- priate candidates for our MITM pair detection algorithm. The reason is that since these fields usually get different values for different packets, if we observe two packets having the same values corresponding to all these fields, that is most likely a MITM pair. The IP Identification Number field [ 110 ] is used for uniquely identifying the IP packets. Its value is incremented every time source sends a packet to the destina- tion. This field is useful during the reassembly of fragmented IP packets. Currently, this field is required to be unique within the maximum lifetime for all packets with a given source/destination address/protocol tuple. TCP Sequence Number [ 11 ] is a 32-bit number which is used by each client on either side of a TCP session to keep track of how much data it has sent. This number is included in each transmitted packet. The packet is acknowledged by the opposite host by using an acknowledge-

49 ment number which informs the sender the packet was successfully received. When a host initiates a TCP session, the sequence number randomly takes a value between 0 and 4,294,967,295. TCP Checksum field is used for error checking of the header and data. All these fields have different values in a TCP session or among different packets. Among the three selected packet header fields explained above, TCP Checksum will change if the packet payload changes. If the attacker changes packet payload while forwarding the packet, the TCP Checksum field will not have the same value in the received and forwarded packets. In order to see what packet header fields would be selected if a modifying attack is going on, we decided to remove all the fields that might change in such attack and apply our subset selection algorithm again. This way, the selected subset is independent of whether the attacker changes the packet payload or not. We eliminated IP Checksum, TCP Checksum and length-related features (IP Header Length, IP Total Length, TCP Header Length or Data Offset) from the pool set in subset selection algorithm. The results are shown in Table 4.3 . For the ARP Spoofing dataset, the same fields from Table 4.2 are selected. This is expected considering that this set does not include any of the fields eliminated from the pool set. For DHCP dataset, TCP Checksum is obviously eliminated when compared to previous results and no other field is added to compensate that. It shows that TCP Checksum does not have a significant effect on the performance on this dataset. Therefore, removing it does not make any difference. For the Port Stealing dataset and whole dataset though, in addition to IP Identification and TCP Sequence Number, TCP flags field is selected to compensate the elimination of TCP Checksum. Table 4.4 shows the performance results for different subsets and different datasets. In this table, we provide performance results for 5 different packet header field sets. The “AllFields” set is the pool set used in the first set of our experiments, which

50 Table 4.2: Selected header fields for each attack and the whole data

Dataset Selected Header Fields

ARP Spoofing ip id,tcp seq

DHCP Spoofing tcp checksum, ip id,tcp seq

Port Stealing tcp checksum, ip id,tcp seq

Whole Data tcp checksum, ip id,tcp seq

Table 4.3: Selected header fields for each attack and the whole data when checksum and length-related fields are removed

Dataset Selected Header Fields

ARP Spoofing ip id, tcp seq

DHCP Spoofing ip id, tcp seq

PORT Stealing tcp seq, ip id, tcp flags

Whole Data tcp seq, ip id, tcp flags includes all IP and TCP packet header fields. The “AllFieldsmod” set is the pool set used in the second set of experiments, which includes all IP and TCP header fields modulo checksum and length-related fields. The “SelectedFields1” include IP Identification, TCP sequence Number and TCP Checksum. The “SelectedFields” in- cludes IP Identification and TCP sequence Number. The “SelectedFields3” includes IP Identification, TCP sequence Number and TCP flags. We provided the perfor- mance results for all the 5 different sets of packet fields even though a set might not be selected for a specific dataset during subset selection algorithm. For example, SelectedField2 is not selected for the ARP dataset in any of the two experiments; however, we provided the performance results on ARP dataset using this set for the sake of comparisons.

51 Table 4.4: Performance results for different subsets of packet header fields and differ- ent datasets

subset TPR FNR TNR FPR

AllFields 0.9443 0.0557 0.9998 0.0002 AllFieldsmod 0.9448 0.0552 0.9998 0.0002 Whole Dataset SelectedFields1 0.9735 0.0265 0.9999 0.0001 SelectedFields2 0.9725 0.0275 0.9944 0.0056 SelectedFields3 0.9747 0.0253 0.9944 0.0056

AllFields 0.7011 0.2989 1 0 AllFieldsmod 0.7011 0.2989 1 0 ARP Spoofing SelectedFields1 0.9060 0.0940 1 0 SelectedFields2 0.9139 0.0861 0.9974 0.0026 SelectedFields3 0.9113 0.0887 0.9974 0.0026

AllFields 0.9653 0.0348 1 0 Allfieldsmod 0.9675 0.0324 1 0 DHCP Spoofing SelectedFields1 0.9653 0.0347 1 0 SelectedFields2 0.9705 0.0295 1 0 SelectedFields3 0.9676 0.0324 1 0

AllFields 0.9983 0.0017 0.9998 0.0002 AllFieldsmod 0.9977 0.0023 0.9998 0.0002 Port Stealing SelectedFields1 0.9983 0.0017 0.9998 0.0002 SelectedFields2 0.9904 0.0096 0.9919 0.0081 SelectedFields3 0.9977 0.0023 0.9920 0.0080 52 Figure 4.4: Plotted TPR results for Table 4.4

Since overall, the FPR values are very small, we only present TPR values in Figure 4.4 . The results show that, overall, selected subsets of packet header fields provide better performance results compared to using the whole IP and TCP fields (AllFields) or the one without Checksum and length-related fields (AllFieldsmod) among all datasets. This performance improvement is significant for a dataset such as ARP Spoofing dataset, where TPR increases from 70% to almost 91% when using subset selection algorithm while FPR is almost the same. We observed that in the ARP dataset, the attacker tends to change packet header fields more compared to other attacks. The performance improvement with subset selection algorithm indi- cates that if the attacker changes some packet header fields, this method performs well in increasing the detection performance. The ARP dataset also has lower performance compared to other datasets. The reason is, again, due to the fact that in ARP traffic the packet header fields have changed more in our collected data. Using the subset selection algorithm, along with the scalarized performance metric, not only increases

53 the performance most of the time, it also decreases the number of packet header fields significantly, which is beneficial for speeding up the detection process, especially in high speed networks. In addition, the fact that our method is in general providing good detection performance among all three datasets shows its independence of the MITM attack type.

4.6 CHAPTER SUMMARY

In a MITM attack, the attacker intercepts, reads or alters information moving be- tween two computers. A common behavior between all these attacks is that the attacker forwards the packets received from one side to the other side which causes semi-duplicated packets being observed in the collected traffic [ 82 ]. Although previ- ous works have exploited this behavior to detect MITM traffic by comparing packet contents in the network traffic, none of them has proposed a systematic approach to select which parts of packet content have to be used in the packet comparisons. In this chapter, we used the Forward Greedy Stepwise feature selection algorithm to decide which packet header fields have to be compared. We defined a performance metric which trades off between FNR and FPR to evaluate each subset. This per- formance metric is defined using the scalarization idea from multi-criteria decision making. The fact that we only used packet header fields in our analysis, makes our method adaptable to the case where the attacker changes the packet payloads. We applied our analysis on real network traffic collected from a campus network. We collected three different types of MITM attacks, ARP Spoofing, Port Stealing and DHCP Spoofing through penetration testing. The traffic is divided into non- overlapping intervals. In each interval, the selected packet header fields are compared between all the packets in that interval in order to detect MITM pairs. Our results show that overall, using the subset selection algorithm along with the defined scalar- ized performance metric selected better subsets of packet header fields, which provide

54 better detection performance compared to using the whole IP and TCP packet header fields for detecting MITM pairs. This performance improvement is especially observ- able for ARP attacks, which tend to change packet header fields more frequently when forwarding the received packets. Our approach shows that using only two packet header fields (IP Identification and TCP Sequence Number) among all MITM attack types provides very good performance results in the detection of MITM pairs. Since these two fields are less likely to change if the attacker alters packet payload, we suggest that these two fields can also be used in detecting MITM traffic where the attacker modifies packet payload. In future work, we plan to collect MITM data where the attacker changes packet payload and investigate more on their detection.

55 CHAPTER 5 APPLICATION LAYER DDOS ATTACKS

5.1 BACKGROUND AND MOTIVATION

One of the most prevalent network attacks, which aim to prevent legitimate users from using services over the Internet are Denial of Service (DoS) attacks. These attacks have consistently been a major threat to computer networks and Internet security [ 15 ]. They have been known to the network community since the early 1980’s [ 128 ]. DoS attacks cause the services offered by organizations to become unavailable, resulting in damaging the hard earned brand image of a financial institution. The first Distributed Denial of Service (DDoS) incident was reported by the Computer Incident Advisory Capability (CIAC) in the summer of 1999 [ 41 ]. Since then, most of the Denial of Service attacks continue to be distributed in nature. A Distributed Denial of Service attack uses multiple machines, which are operating in concert to target a network or server. DDoS attacks are often launched by a network of remotely controlled zombies or bot computers that are simultaneously sending a large number of traffic and/or service requests to the target system. Consequently, the target system responds very slowly, making the connection unstable or causes the target system to crash. As a result, the service to legitimate users is denied [ 78 ] [ 36 ]. The remotely controlled computers are part of a , which are usually re- cruited through the use of worms, trojan horses or backdoors. A botnet, occasionally referred to as a zombie army [ 39 ], is a collection of Internet-connected computers that are infected and controlled by a single attacker, often a cybercriminal. The users of

56 infected computers are mostly unaware that a botnet is infecting their system [ 97 ]. The objective of creating a botnet is to infect as many computers as possible over the Internet, and use their computing power and resources to perform automated tasks that generally remain hidden to the users of the devices. In case of a DDoS attack, the botnet computers are used to simultaneously launch a Distributed Denial of Service attack toward a single target system to make the attack much larger and more disruptive. Figure 5.1 shows a typical DDoS attack. Conventionally, DDoS attacks are carried out at the network and transport layer [ 128 ] commonly via SYN flood in which part of the normal TCP three-way handshake is exploited to consume resources on the targeted server and render it unresponsive.

Figure 5.1: Distributed Denial of Service attack

57 Essentially, TCP connection requests are sent faster than the target machine can process, causing network saturation. There exists various methods that can detect and block these attacks [ 45 ], [ 123 ]. It is not as easy as it was in the past for the attackers to launch network and transport layer DDoS attacks. Attackers continue to change their methods as protection against network and transport layer DDoS attacks improve. One increasingly common trick is a shift to application layer attacks where attackers are switching to target the application layer instead of network and transport layer. The goal is to overload application servers [ 16 ] to the point that access to the services provided by the server become impossible. Protecting against an application layer DDoS attack is much more difficult than protecting against a network layer DDoS attack, because the generated traffic looks normal at IP and TCP level. Application layer DDoS attacks target the vulnerabilities in the application layer, usually through Hypertext Transfer Protocol (HTTP) [ 49 ]. HTTP is designed as a request-response protocol to enable communications between clients and servers. It is used to deliver data (HTML files, image files, query results, etc.) on the World Wide Web (WWW). The web browser on a user’s computer can be considered the client, and an application on a computer that hosts the website, which the user is browsing, is the server. Whenever a URL is issued from the browser to get a web resource using HTTP, e.g. https://www.google.com/, the browser turns the URL into a request message and sends it to the HTTP server. The HTTP server interprets the request message, and returns an appropriate response message, which is either the requested resource or an error message. Two commonly used HTTP methods for a request-response between a client and server are: GET and POST. The GET protocol retrieves information from the server, while the POST protocol submits data to the server, and requests that the web server accepts the data either for analyzing it or most likely for storing it. Attackers exploit

58 the weaknesses in either HTTP GET or HTTP POST protocols when launching an application layer DDoS attacks. In HTTP GET attacks, the attacker sends a large number of HTTP GET requests to flood the web server. The request packets have legitimate HTTP payloads, which do not allow the victim server to distinguish them from normal users’ packets [ 29 ]. Another type of misuse of the HTTP GET protocol is the Slowloris attack [ 48 ]. The attacker separates the lines of the HTTP headers and sends them at very slow rate to keep the server busy. Attackers also misuse weaknesses in HTTP POST protocol, which is similar to the Slowloris attack in that the attacker tries to keep the server busy by sending HTTP POST requests in a very slow rate. The behavior of Slowloris and HTTP POST attacks is different from HTTP GET attacks behavior, as the former attack sends pieces of data at a very slow rate and the latter sends a large number of requests. We focus on HTTP GET flood attacks in this chapter. Most studies analyze web server logs 1 to differentiate between a normal user’s behavior and an attacker’s behavior [ 100 ], [ 118 ], [ 72 ]. These studies follow two main strategies: anomaly detection and classification. In anomaly detection methods, the normal users ' behavior is modeled. If a new behavior does not conform to this normal, or baseline, behavior model, it is marked as a potential attack. Anomaly detection methods are able to detect new or zero-day attacks. Training the model only requires normal data, which can be collected during the typical operation of a computer network. A potential downside of these methods is a high false alarm rate, because no attack data is used to train the model. The classification based studies, on the other hand, use both normal and attack data in order to build a classification model, which can distinguish between normal and attack instances. The issue with these studies is that they are not able to detect a new type of attack, because it was not a part of the model training dataset. In addition, in real world applications, it is

1https://httpd.apache.org/docs/1.3/logs.html

59 hard to access data that includes real attack instances beforehand in order to train the model. We focus our efforts on anomaly detection, rather than classification. Our main reason is that classification methods need labeled data. Since HTTP GET flood attacks can utilize numerous variants, building a labeled dataset that contains all these variants is not a feasible task. The attackers are capable of sending different HTTP requests to the web server in different orders. They can target their requests toward the main webpage, a random webpage, a particular resource such as an image file or a combination of these. Considering that a website can contain a high number of webpages and resources, the number of possible HTTP GET flood behaviors can get very high. It is not feasible to have some data which contains all variants of HTTP GET flood attacks by including an exhausted list of behaviors attackers can show while sending HTTP requests to the target server. However, it is possible, and also simpler, to collect normal data during normal server operations in order to have a representation of normal users’ access behavior. This data can be used to model users’ behavior and detect anomalous behaviors (potential attacks). The main idea in application layer attack anomaly detection methods is to the model normal user behavior and detect any new behavior that does not conform to the normal model as an anomaly i.e. an attack [ 100 ], [ 118 ], [ 72 ]. A user behavior can be represented as a set of features which describes the behavior in which a user is requesting resources from a web server. Such features are usually extracted from web server logs. The web server records the requests sent by users’ browsers and some operational information in web server logs. Some studies [ 122 ], [ 118 ] model user behaviors based on the webpages they have browsed on a specific websites. The problem with this approach is that they need to first map the sequence of resources a user has requested from the web server to the actual sequence of webpages browsed on the website. This can be a complicated task

60 as each webpage contains several resources. Additionally, the sequence of resources requested from the web server can differ from user to user based on their browsers and the resources, which are already cached on their machines and webpages can share different resources. These reasons make the process of mapping a sequence of resources a user has requested to the actual browsed webpages difficult. In this chapter, we propose an approach to model the user behaviors based on the resources they have requested rather than webpages they have browsed. This simple variation eliminates the difficulty and complexity of modeling the webpages that a user has browsed. We define each user behavior as the number of times she/he has requested spe- cific resources, such as an image, a PDF file or a HTML file, in a particular time window. Our motivation is that the repetitive behavior of attackers in requesting resources to exhaust the server will reflect itself as higher values in our proposed rep- resentation of the user behaviors. This repetitive behavior is also considered under the assumption that normal users spend longer times to read webpages while the at- tacker don’t [ 71 ]. There are approaches which use timing information such as request rate to model users behaviors [ 71 ], [ 126 ]. However, timing information can fluctuate widely in a networking environment. Our proposed approach provides a more robust representation of the repetitive behavior of attackers by not using the explicit timing information. We apply PCA-subspace anomaly detection [ 44 ] and one-class SVM [ 74 ] to ana- lyze whether the proposed user behavior model is discriminative between normal and attack instances. We apply our experimental analysis on real web server logs asso- ciated with a student resource portal website. We generated nine different HTTP GET DDoS attacks through penetration testing by changing the pattern in which the GET requests are sent to the server. We designed the attacks in a way that they mimic normal users’ behaviors more naturally versus current, popular attack tools.

61 To the best of our knowledge, this is the first approach which considers the existence of embedded objects in requesting a webpage as a factor in generating attack data. Embedded objects are HTML tags inside a webpage, such as images, audios and CSS style sheets. The browser needs to request the embedded objects from the web server in order to show the webpage. Our experimental results show that using the two anomaly detection methods along with the proposed user behavior instances ex- tracted from web server logs provides effective performance in the detection of HTTP GET DDoS attacks. The contributions of this chapter are as follows.

• We propose a new method to extract user behavior instances from web server logs. Our method is only based on the number of specific resources a user has requested from the server in a particular time. This approach eliminates the complexity that comes with modeling user behaviors based on the browsed webpages or the request rate and timing information.

• We use two popular anomaly detection methods, PCA-subpsace and one-class SVM for the detection of application layer DDoS attacks using the extracted proposed user behaviors.

• We experimentally validate the effectiveness of our approach on real web server logs and attack data generated through penetration testing.

• Our experimental data includes a wide variety of attacks. We consider different factors ,such as the popularity of a webpage, randomness and embedded objects in requesting a webpage when generating our attacks.

The structure of the rest of this chapter is as follows: In section 5.2 , we discuss related work on the topic of the detection of HTTP GET flood attacks. Section 5.3 explains our data collection and the experimental data used. Section 5.4 presents our approach for detecting HTTP GET flood attacks. Section 5.5 presents experimental 62 results. Finally, in section 5.6 , we conclude our work in this chapter and provide suggestions for future research.

5.2 RELATED WORK

Xie and Yu [ 122 ] proposed an extended, hidden semi-Markov model to model the browsing behavior of web surfers. Markov state space is used to describe the webpage set of the website. The state transition probability matrix presents the hyperlink relationship between different webpages. When a user clicks a hyperlink pointing to a page, a number of HTTP requests are generated for the page and its in-line objects. To get an estimate of the order of pages a user is requesting, the requested objects in an observation sequence is grouped into different clusters. This allows each user request sequence to be transformed into the corresponding webpage. The order and transition sequence of consecutive groups show the user’s browsing behavior. The entropy of an observed request sequence made by a user is defined as the anomaly measure. This model is very complicated to train and it is computationally heavy. Ye et al. [ 126 ] proposed clustering user sessions. They calculate the deviation between sessions and normal clusters as the abnormality measure. Four features are extracted from each session in order to cluster the normal users sessions to describe normal users behavior. The attack data contains a fixed number of requests per second with random objects to the web server. Their generated attack characteristic is based on the normal data they have collected. The number of attack requests per second is calculated by multiplying the number of requests in a session randomly selected from normal data by two. This makes the attack characteristics not representative of a real attack that happens in computer networks. Liao et al. [ 71 ] used a classification method to classify attacks from normal data. They extracted two features to represent user’s browsing behavior in one time window. The first feature describes the number of user requests in every sub-time window. The

63 second feature describes the time intervals between user requests. The idea is that human visitors spend time on their interested webpages. First, the average interval for a sequence of user accesses during one hour is calculated and compared with a threshold value to filter users who are 100% normal. Both normal and attack data are used to define this threshold. Rythm matching is used on frequency sequences (first feature) to group the remaining sequences in clusters. The suspected attack clusters are defined as clusters, with which their scale is less than a threshold. This filters out the normal users. Each of the remaining suspected clusters are again clustered using a clustering algorithm with label (L-Kmeans). Their approach is dependent on defining threshold values in different steps to filter the data. In addition, they did not provide results on test datasets, the results are provided on the training datasets only. Providing the results on the test data is important as it verifies whether the parameter values, which are selected solely based on the trained data, provide the same performance results on the test data as well. Wang et al. [ 118 ] presented two different methods to characterize users’ web access behavior in order to detect application layer DDoS attacks. To characterize web access behavior with webpage ratios, they construct the website’s priori click rate on vector which represents the click ratio for each webpage in the website. Each user session is defined as its subsequent webpage requests in 30-minute time interval and each user has its empirical click ratio vector which shows user’s interest in different webpages during a session. The second method builds the transition probability matrix between webpages. Again, each user’s empirical access behavior matrix is calculated depicting user’s access logic on the website. They compare users’ empirical click ratio vectors to the websites priori one, and adopts large deviation theory estimating the probability of the deviation. They do the same to measure the deviation of ongoing user’s behavior from website’s priori transition probability matrix. Their simulation results show that the first approach can detect application layer DDoS attacks accurately, while

64 the second approach has high false negatives. Clicking on a webpage produces several HTTP requests, in order to apply this method, it should be clear how the user’s webpage access can be derived from the users’ HTTP requests. Also, their first approach might not provide good detection results if the attacker targets only highly accessed webpages on a website.

5.3 DATA COLLECTION

This section discusses the data collection process to collect our experimental data. We collected web server logs for an active student resource portal inside our collection environment [ 31 ]. The student resource portal is built with Wordpress and is hosted from a local Apache server running on CentOS. A web server log is a log file generated by the server which maintains a history of requests received by the web server. A typical record in a web server log looks similar to the following: “222.222.222.222”, “57890”, “837074” , “/var/www/html/wordpress/index.php”, “HTTP/1.1” ”GET”, “/wordpress/index.php”, “Mozilla/5.0 (compatible; Baiduspider/2.0;” “222.222.222.222” is the client IP address of the request. “57890” is the size of the response in bytes and “837074” is the time that took the server to serve this request. “HTTP/1.1” is the request protocol and “GET” is the request method. “/wordpress/index.php” shows what URL path is requested by the client and finally, “Mozilla/5.0 (compatible; Baiduspider/2.0;)” shows the browser from which the request has been sent, known as the user agent. We generated nine different Distributed HTTP GET flood attacks through pene- tration testing. All attacks are targeted toward the web server for the student portal on our collection environment. In a distributed HTTP GET flood attack, each zombie machine sends a large number of HTTP GET requests to a targeted web server. The large number of HTTP GET requests coming toward the web server causes it to al- locate increasingly more resources to handle them. Eventually, after enough requests

65 are generated by Distributed HTTP GET flood attack, the server resources saturate and the service provided by the web server will be denied to legitimate users. Our nine different variations of HTTP GET flood attacks were genearted by tar- geting various resources with different ordering on our web server. We considered methods both performed by real-world attack tools as well as variants aimed to bet- ter mimic normal user behavior in generating our attacks. Each attack session is generated in a distributed manner to better mimic a Distributed Denial of Service scenario. For each attack session, 45 host machines within our network were con- figured to launch the attack simultaneously toward the targeted web server. Each attack session lasts for a full hour and only one attack variant was implemented at a time. For all attack variants, the HTTP GET requests are sent toward the web server using a random interval of 1 to 5 seconds. The web server is constantly generating web server logs, therefore, as with our normal traffic, attack traffic was also collected and stored in the continuous web logs. The known IP addresses of attacker machines are used to label attack data from normal data. In order to make our generated attacks mimic normal users’ behavior, we con- sidered embedded objects in generating the attacks. Embedded objects are in-line objects included in a webpage which are necessary to properly display a webpage on a browser. These objects can consist of scripts, images, css, and fonts amongst other needed resources. When the URL of a website’s webpage is navigated on a browser, the browser typically sends HTTP GET requests for the embedded objects included in that webpage to the web server in addition to the original page request. Usually, the first access to a webpage causes some of the embedded objects to be cached on the user machine. Therefore, the subsequent requests to the same webpage from the web browser only includes HTTP GET requests for the embedded objects, which are not cached on the machine, i.e. non-caches embedded objects. Most of the attack tools such as, High Orbit Ion Cannon (HOIC) [ 6] and Hulk [ 8], only target the targeted

66 page without considering the embedded objects. This can make detection of these at- tacks much simpler as they are not following the behaviors of standard web browsers. We decided to take embedded objects into account in generating our attacks in order to make the attacks more sophisticated and similar to normal user behaviors. The order in which embedded objects are requested from the web server can vary from one browser to another. We evaluated the request behavior of three popular browsers. The browsers we selected were Chrome, Firefox, and Safari. We navi- gated a targeted webpage (e.g. home page) from each of these three browsers. We then monitored the web log to document the embedded objects requested from each browser and their ordering. These different ordering lists were used in the scripts to generate different attacks. Thus, the attack would also request the webpage’s embedded objects, in association with a particular browser behavior. We also took into account the case where the request to the webpage is a subsequent one and only non-cached embedded objects are requested. Analyzing such subsequent requests for non-cached embedded objects for the three studied browsers, indicated that all three had the same behavior in requesting these objects. They cached the same objects and therefore request the same non-cached resources. This object order was also considered in the attack scripts to generate a different type of attacks. Overall, our generated attacks belong to three main categories. For two of these attack categories, we considered embedded objects and non-chached embedded ob- jects to make the attacks mimic normal user behaviors more. Each attack category targets different sets of webpages on our student resource portal website. This set of webpages are as follows:

• Home Page - This attack targets the home page from our server. The home page is a common target for real-world attacks because it is typically one of the most accessed pages by a website users. The attackers target the home page hoping that their requests will not get detected among all the normal requests 67 targeting the home page.

• Top 5 Pages - This attack targets the five most accessed webpages within our web site. A Wordpress plug-in was incorporated to track the number of views for each webpage. We selected the top five webpages with the most views to target in the second category of our attacks.

• Random Page - This attack targets a random webpage on our website for each new request. A random selection from all available webpages on the website is taken when sending a new request.

For the Home Page attack category, we generated 5 different variants. One attack simply sends request for the home page without requesting any embedded objects. This behavior is similar to the behavior observed in most of the attack tools. Another variant of the Home page attack sends requests for the home page as well as the non- cached embedded objects. There are also three variants where the attacker sends requests for the home page and all the embedded objects in the home page. The order in which the embedded objects are requested is based on the three chosen browsers. One attack type is generated associated with each chosen browser. For our top 5 page attack category, we generated three different variants similar to that for the Home page attack: one to represent a typical attack without embedded objects, one to request the webpages and their non-cached embedded objects and one to request the webpages and their Chrome browser embedded objects. Since we had to document an object ordering for each of the five top pages, we decided to only focus on one browser. Lastly, for the Random Page attack category, we only generated one type of at- tack where the attacker sends requests for the random webpages without including any embedded object. This approach was chosen because creating an object ordering list for every page in the resource site, even for a single browser, was not deemed

68 feasible for the scope of this particular capture. Each of the attack variants is repre- sented in Table 5.1 .

The contributions of our data collection can be summarized as follows:

• Compared to the related works, which inject generated attack traffic to a pre- collected normal data [ 121 ], [ 126 ], our attack data were generated concurrently with the normal network usage. This makes our data more representative of real-world attack scenarios.

• Our HTTP GET flood attacks are generated in a distributed manner. This provides more realistic representation of the web server behavior under the of flood a large number of HTTP GET requests.

• We took into account the embedded objects and non-cached embedded objects in generating our attacks. This not only makes our attacks more sophisticated than the attacks generated by common attack tools, but also mimics normal users’ behaviors in sending requests for a webpage.

5.4 USER BEHAVIOR ANOMALY DETECTION

This section explains our methodology for detection of HTTP GET flood attacks using web server logs. Our main idea is to define user behavior instances based on web server logs and apply anomaly detection method to detect anomalous behaviors (potential attacks). To apply an anomaly detection method for the detection of anomalous behaviors, the first step is to provide a definition of a user behavior instance. Such user behavior should be defined in a way that HTTP GET flood attackers‘ behavior instances can stand out from normal users‘ behavior instances as outliers or anomalies. The next 69 Table 5.1: Attack variants

Attack Description

Home Page: Requests home page along with the associated Chrome w/embed embedded objects. Objects are requested based on the orders of the Chrome browser.

Home Page: Requests home page along with the associated Firefox w/embed embedded objects. Objects are requested based on the orders of the Firefox browser.

Home Page: Requests home page along with the associated Safari w/embed embedded objects. Objects are requested based on the orders of the Safari browser.

Home Page: Requests home page. Only objects that are Non-cached not cached by browsers are also requested.

Home Page: Requests home page without embedded objects. w/o embed Most similar to standard HTTP GET attack tools.

Top 5 Pages: Requests the top five pages along with their Chrome w/embed associated embedded objects. Objects are requested based on the orders of the Chrome browser.

Top 5 Pages: Requests the top five pages. Only objects that are Non-cached not cached by browsers are also requested.

Top 5 Pages: Requests the top five pages without embedded w/o embed objects.Most similar to standard HTTP GET attack tools.

Random Page Selects a random page from all active pages on each new request.

70 subsection explains our definition of a user behavior instance considering the behavior of HTTP GET flood attacks. Our user behavior instances are extracted from web server logs. After user behavior instances are extracted from web server logs, an anomaly de- tection method is applied to detect HTTP GET flood attacks. An anomaly detection method builds a normal base line model from a provided normal training data. Each newly seen instance in the test data is then compared with the normal base line model. If the instance does not conform to the model, it is detected as an anomaly [ 35 ]. We use two popular anomaly detection methods, PCA-subspace and one-class SVM to detect the anomalous user behaviors.

5.4.1 Defining User Behavior

A normal Internet user accesses the resources on a web server based on the structure of the website she/he is visiting. For example, a user who is accessing an educational website, might first visit the home page and then click on the article section and finally download an article. Each URL navigation or click on a webpage results in the browser sending a number of HTTP GET requests to the web server. These requests include the webpage and their embedded objects, such as images, flash, video and audio. On the other hand, an infected computer in a HTTP GET flood attack sends HTTP GET requests to the web server based on a pre-written script running on its computer. The repetitive behavior of the attacker in sending requests in order to exhaust the server resources is not similar to what a normal user does when accessing a website. For example, it is very unlikely that a normal user accesses the home page or downloads the same PDF file over and over again in a short period of time. Effective detection of HTTP GET attack requests requires the ability to separate them from normal users’ requests. We extract user browsing behavior instances from web server logs. Each behavior instance represents how each user is accessing the

71 resources on the web server in a particular time interval. We then apply an anomaly detection method to detect anomalous user behavior instances. These anomalous instances most likely belong to attack data. Each web server provides users with different resources. These resources include webpages, images, configuration files, etc. When a user enters a URL in the browser or clicks on a URL link, the browser sends HTTP GET requests for the web server and the objects embedded in the requested webpage. Some of these HTTP requests will be responded to by the proxy or caches and they do not get to the web server. The requests that are received by the web server will be recorded in the web server log. Each record in a web server log provides information about “who” has requested “what” at “what time.” The “who” is the client IP field, which represents the user who has sent the request. We consider the “what” as the URL field which represents the resource that the user has sent an HTTP request for. The “what time” is the time filed which represents at what time the request has been received by the web server. Using this information, we extract the browsing behavior of each user based on the resources she/he has requested from the web server in a particular time window. We apply a non-overlapping sliding window to divide all the log records into 5- minute time windows. The records inside each time window are grouped based on the client IP field. For each client IP, the group of records inside a time window shows all the resources, i.e. URL fields, the client has requested during that particular time window. We define each user browsing behavior instance as a feature vector extracted from the group of URL fields for each user in each time window. It should be noted that if the selected time window size is too short, it does not sufficiently represent the attack behavior. On the other hand a very long time window delays the detection of the attack. Our preliminary analysis showed that 5 minutes is a good trade off to provide proper performance results. To define the feature vector, we first build our features by considering each unique

72 URL seen in a web server log as a single feature. During our preliminary analysis, we realized that some of the URLs are requested by only a few number of users. Following this analysis, we only consider the URLs that have been requested by at least 5 users in the web server log to build the features. All the other URLs which are requested by less than 5 users are considered as one single feature. We build a feature vector by counting the number of times each feature appears in a group of log records. Each feature vector represents a user browsing behavior instance. Our approach in extracting features is similar to bag of words [ 75 ] in Natural Language Processing [ 84 ], [ 83 ]. Algorithm 1 shows the process of extracting feature vectors from web server logs.

Algorithm 1 1: procedure Extract Feature Vectors

2: Define Features as all the unique URLs seen in the server log which are requested

by at least 5 users

3: Add one additional feature to the Features for all the URLs which are requested by

less than 6 users in the web server log

4: Divide the server log into non-overlapping 5 minute time window s

5: for each time window w do

6: Group the log records based on client IP field and extract URL fields for each

group

7: for each Client IP in w do

8: Extract the feature vector: Count the number of times each feature in Fea-

tures has occurred in the group of URL Fields

9: end for

10: end for

11: end procedure

73 5.4.2 Detecting Anomalous User Behaviors

According to [ 47 ] and [ 50 ], the general architecture of all anomaly based network in- trusion detection systems consists of three based modules or stages. These stages are parameterization, training and detection. Parameterization includes collecting net- work data from the target environment. As discussed before in the previous sections, the raw data should be representative of the system to be modeled (in our case web server logs). The training stage models the system using either manual or automatic methods. The Detection stage compares the system generated in the training stage with the selected parameterized data portion. Threshold criteria will be selected to determine anomalous data instance. Formulating our problem of user behavior anomaly detection for the detection of HTTP GET flood attacks, our parametrized data is the web server logs collected from our web server which is hosting a student portal website. We extracted user behavior instances from the collected web server logs using Algorithm 1. Each user behavior instance is a feature vector of the length 85. Since our purpose is to detect HTTP GET flood attacks, we only included GET log records in our analysis. Our data is divided into two parts which include training data and test data. The training data only contains normal user behavior instances. These are extracted from server logs during normal operational times of the web server. The test data, on the other hand, is extracted from log files which include both normal and generated attack records. It should be noted that even though we use the same process (Algorithm 1) to extract our instances, once we build the feature set, Features in line 2 and 3 of Algorithm 1 are trained using the normal data. We use the same feature set to extract the test instances from test log data. At the training stage, we apply PCA-subspace and one-class SVM anomaly de- tection separately. The reason we apply two different anomaly detection methods is to compare their performance. Similar performance would suggest that the defined

74 features are discriminative enough that two different anomaly detection methods are both resulting in good performance. Both these methods are machine learning algo- rithms and they allow us to build a model for the normal training data automatically. They build a model from normal user behavior patterns in the training data using PCA and one-class SVM, respectively. In the detection stage, PCA-subspace method projects a new instance on the anomalous subspace extracted from normal training data to detect an anomaly. One- class SVM creates a representational model of data and if newly encountered data is too different from the trained model, it is labeled as out-of-class.

PCA-subspace Anomaly Detection

PCA is a multivariate technique, which transforms a number of correlated features into a number of linearly uncorrelated features called principle components [ 17 ]. The first principle component captures the variance in data to the greatest degree on one single axes. The second principle components captures the maximum variance among the remaining orthogonal directions and so on. After applying PCA on a set of data points, the calculated principle components are ordered based on the amount of data variance that they capture. PCA has been used for network anomaly detection by researchers, with the most popular work done by Lakhina et al. [ 66 ] for the detection of network-wide traffic anomalies. We use the same terminology used in [ 66 ] to explain the PCA and PCA-subspace method in the rest of this section. We build the user behavior matrix Y by considering each feature vector extracted in the previous subsection as one row in Y . Each row is a point in IR m where m is the number of features. To apply PCA, we need to make sure each feature in Y has a zero-mean. In the rest of this section, we assume that Y has already been altered so each feature has a zero-mean. This ensures that the principle components capture the correct variance.

75 m Applying PCA on Y results in m principle components {vi}i=1 . The first princi- m ple component, v1 is a vector in IR , which its direction represents the direction of maximum variance in Y . Proceeding iteratively, the k-th principle component cor- responds to the maximum variance direction in the difference between original data and its map on the first k − 1 principle axes. One important use of PCA is dimensionality reduction [ 26 ]. We can add up the amount of variance captured by the first r principle components. If this variance represents the majority of the variance in our data (for example more than 95% of the variance in the data), we can map the data to the reduced r dimensional space represented by the first r principle components. In our analysis, we do not use PCA for the dimensionality reduction, but for the anomaly detection using the PCA-subspace method. The subspace method works by dividing the principle axes into two sets which represent the normal and anomalous variations in the user behavior data. Any instance y which corresponds to one row in the Y matrix can be represented as y =y ˆ +y ˜ by projecting it into normal (ˆy) and anomalous subspaces (˜y). The idea of subspace anomaly detection is that an anomalous change in feature correlations in a feature vector reflects an increase in the projection of that data point to the anomalous subspace. This results in the magnitude ofy ˜ to be higher than what is seen among normal instances. To calculate the magnitude of projection of each instance to the anomalous sub- space, we first consider the set of principle components in the normal subspace as columns of a matrix P of size m × r. This matrix represents the first r principle components where r is selected as explained above for the case of dimensionality reduction.

yˆ = P P T y = Cy (5.1)

76 y˜ = (1 − P P T )y = C˜y˜ (5.2)

C = P P T matrix represents the linear projection to the normal subspace. On the other hand, C˜ = 1 − P P T represents the linear projection to anomalous subspace. Abnormal changes iny ˜ can represent an anomaly. Squared Prediction Error (SPE) can be used to monitor the changes iny ˜ . SPE is calculated as below:

2 SP E = || y˜|| (5.3)

If a user behavior instance SP E is more than a threshold δ, that instance is detected as an anomalous instance.

One-Class SVM Anomaly Detection

One class Support Vector Machine (SVM) [ 102 ] can be used for the task of anomaly detection [ 55 ]. Support vector machine is a discriminative classifier [ 40 ]. Given labeled training data (supervised learning) of two different classes, the algorithm finds an optimal hyperplane which categorizes new instances to either of the learned classes. Sch¨olkopf [ 102 ] extended the SVM methodology to handle training using only one class of data. By just providing the normal training data, the algorithm creates a (representational) model of this data. If newly encountered data is too different from the trained model, it is labeled as out-of-class, i.e anomaly [ 68 ] [ 37 ]. We use one-class SVM to detect user behavior instances that belong to HTTP GET Flood attacks. We use the feature vectors from normal users’ behaviors to train the one-class SVM. This learns how the sequence of a normal user’s requests should look like. Each newly seen session then is provided to this trained model and a decision value is calculated for that instance. The instances with lower decision values represents anomalies.

77 Table 5.2: Number of instances in each dataset

Dataset # of Attack instances # of Normal Instances

Training 0 2047

Test 3917 1133

5.5 EXPERIMENTAL RESULTS AND ANALYSIS

Table 5.2 shows the number of instances in each of the training and test datasets. The reason the number of attack instances in the test dataset is higher than normal instances is that our attacks are distributed attacks and a relatively large number of computers are targeting the web server at the same time, which results in a large number of attack instances. The following subsections provide our results for PCA- subspace and one-class SVM anomaly detection methods.

5.5.1 Results for PCA-subspace Anomaly Detection

We use training instances to build the Y matrix discussed in section 5.4 for PCA- subspcae anomaly detection method. We applied PCA using R stats package [ 98 ] on our training data. Figure 5.2 shows the fraction of total variance captured by each principal component of Y . To choose the number of principle components in the normal subspace, we selected the first r principle components which cumulatively account for 95% of the variance in the data. After applying PCA on training data, we use Equation 5.3 to calculate error values for the instances in the test data. Figure 5.3 shows the boxplot of error values on normal training instances, normal test instances and each of the attack instances in the test data. We showed the boxplot of error on normal training instances as a baseline to compare the error values of normal and attack test instances with. The error values of normal instances in the test data are similar to the error values of the normal instances in the training

78 Figure 5.2: Fraction of total variance captured by each principal component

Figure 5.3: Boxplot of error values for normal training instances, test training in- stances and each of the nine attack types instances

79 data. On the other hand, the error values for attack instances have clearly higher values than error values of training and test normal data. This shows that the PCA- subspace method works well for discriminating between normal and attack instances. To provide the performance values on the test data, we calculated AUC, Area Under the Curve (Receiver operating characteristic), by varying the threshold value, δ. The calculated AUC on the test dataset is 0.9964. Figure 5.5 shows the ROC curve for PCA-subspace anomaly detection results. We also calculated AUC values for each attack type. For each attack type, we built a dataset which includes attack instances plus normal test instances. We then calculated the AUC value on the built dataset. The AUC results for each attack type using PCA-subspace method are shown in Table 5.3 in the second column. The results show that the AUC value for each attack type is higher than 0.9921, which still is a high performance value. Overall, our results show that the PCA-subspace method provides good discrimination between normal and attack data using the browsing behavior features extracted in this chapter.

5.5.2 Results for One-class SVM Anomaly Detection

We used the “e1071” [ 77 ] library for the one-class SVM. We used Radial Basis Func- tion (RBF) kernel and the ν and gamma parameters were chosen by using grid-search method [ 58 ]. Since in our application, the attack instances are the class of interest, we consider the attack class as the Positive class and the normal class as the Negative class in building one-class SVM model. After the one-class SVM is trained, we applied the trained model on the test data, which includes both normal and attack instances. The trained one-class SVM model calculates a decision value for each test instance. Figure 5.4 shows the boxplot of SVM decision values on normal training instances, normal test instances and each of the attack instances in the test data. We showed the boxplot of decision values on normal training instances as a baseline to compare

80 Figure 5.4: Boxplot of SVM decision values for normal training instances, test training instances and each of the nine attack types instances

the decision values of normal and attack test instances with. The decision values of normal instances in the test data are similar to the decision values of the normal instances in the training data and they are mostly positive value. On the other hand, the decision values for attack instances have clearly lower values than decision values

81 Table 5.3: AUC and number of instance for each attack type

PCA Subspace One-Class SVM # of Attack AUC AUC instances

HomePage: Chromew/embed 0.9990996 0.9965218 397 HomePage:Firefoxw/embed 0.9975946 0.9941442 375 HomePage: Safariw/embed 0.9976941 0.9960058 374 HomePage: Non-cached 0.9973396 0.9966887 373 HomePage:w/oembed 0.9980106 0.9955232 455 Top5Pages: Chromew/embed 0.9937477 0.9928553 528 Top5Pages: Non-cached 0.9920824 0.9804020 505 Top5Pages:w/oembed 0.9966260 0.9949420 415 RandomPage 0.9974426 0.9962314 494 of training and test normal data and they mostly have negative value. This shows that the one-class SVM method works well for discriminating between normal and attack instances. Again, we calculated AUC values as the performance metric by varying the thresh- old for decision values computed by one-class SVM on test data. The calculated AUC on the test dataset is 0.9941. Figure 5.6 shows the ROC curve. We also calculated AUC values for each attack type similar what we did for PCA-subspace experimental results. The AUC values for each attack, using one-class SVM anomaly detection are shown in the third column of Table 5.3 .

5.5.3 Comparison analysis

Table 5.3 shows the AUC values calculated for each attack type using PCA-subspace anomaly detection and one-class SVM. Overall, the results show that both methods are providing very similar performance with PCA-subspace providing slightly better

82 Figure 5.5: ROC curve fpr PCA-subspace method

Figure 5.6: ROC curve for one-class SVM

AUC values. Also comparing the overall calculated AUC values on the test dataset for PCA-subspace (0.9964) and one-class SVM (0.9941) again suggests that both methods are providing similar results with PCA-subspace having a slightly better performance. The fact that two different anomaly detection methods are providing good performance results on the same set of user behavior features suggests that the defined user behaviors are doing well in discriminating between normal and HTTP 83 GET flood attack instances. Compared to approaches such as [ 122 ] and [ 118 ], which model how the users access different webpages on a website, our approach models how the users access the resources on a website. This eliminates the need to extract which webpage the user is accessing by examining the sequence of the requests. This can be a complicated task considering webpages can share different resources. In addition, those approaches do not fit attacks where the attacker is targeting resources on a website instead of the webpages.

5.6 CHAPTER SUMMARY

The large number of studies and mitigation methods proposed for IP and TCP level DDoS attacks necessitated a switch to application layer methods to launch DDoS attacks in recent years. Therefore, the application layer DDoS attacks can be consid- ered a more recent trend in network attacks. These attacks exploit the vulnerabilities in the application layer to exhaust web servers resources. The generated traffic by these attacks looks normal at the TCP and IP level which makes them more difficult to detect. We studied the detection of distributed HTTP GET flood attacks in this chapter. HTTP GET flood attacks send a large number of HTTP GET requests toward the targeted web server in order to exhaust its resources and make the services unavailable to legitimate users. There is a need for a detection mechanism that discriminate between normal user requests from the attack requests, which enables the server to block the attack requests. In this chapter, we proposed an anomaly detection mechanism for detecting HTTP GET flood attacks. First user behavior instances are extracted from web server logs, and then an anomaly detection method is applied to detect the attacks. We used two different anomaly detection methods, PCA-subspace and one-class SVM to detect unusual user behaviors (potential attacks). The reason

84 we used two different anomaly detection methods is to estimate whether the defined features to represent user behaviors are discriminative enough to discriminate between attack behaviors from normal behaviors. We defined a browsing behavior instance as a feature vector which represents the resources that a specific user has requested on the web server in a particular time window. To perform our experimental analysis, we used web server logs from a student resource portal web server and we also generated nine different types of HTTP GET flood attacks through penetration testing. In generating our attacks, we considered different levels of similarity between the attack pattern and a normal user pattern in requesting resources from a web server by taking into account the embedded objects. Our experimental results show that applying PCA-subspace and one-class SVM anomaly detection methods on our browsing behavior instances can discriminate be- tween normal behavior instances and the attack behavior instances. For future work, we plan to collect more normal data as well as more variants of attack data in order to expand our analysis.

85 CHAPTER 6 FEATURE SELECTION

6.1 BACKGROUND AND MOTIVATION

A network intrusion detection system needs to analyze all the traffic passing through the network. Analyzing high amounts of network traffic can be challenging for any intrusion detection system. There is a need for data reduction methods which reduce the amount of data that needs to be analyzed. Since machine learning methods are being used to build intrusion detection models, data reduction strategies can also be borrowed from the machine learning domain to reduce the amount of data that needs to be analyzed. Feature selection is a pre-processing step used to reduce the number of features in building machine learning predictive models [ 60 ], [ 90 ], [ 87 ]. To build a predictive model, a set of features are extracted from each data instance in the training dataset. These features are then used to build the predictive models by applying machine learning algorithms. Feature selection focuses on removing redundant and irrelevant features. This results in a smaller set of features and is considered as a data reduction step. It is important that a feature selection method reduces the number of features while not adversely affecting the predictive model’s performance. The effectiveness of an intrusion detection model is dependent on the number of features, among other factors. When the intrusion detection model is built, all the feature values have to be extracted and evaluated for every newly observed network instance to decide whether it is a healthy instance or a malicious one. The fewer features the intrusion detection model needs to extract, the more efficient it would

86 be. By using feature selection, most of the relevant features are selected during the process of building predictive models. Consequently, during the intrusion detection process, fewer features need to be extracted for each single network traffic record. In addition, to improving effectiveness of the predictive models, sometimes removing redundant and irrelevant features by using feature selection increases the prediction performance. In this chapter, we study feature selection in the detection of network attacks for two purposes. Firstly, we study the main application of feature selection, which is re- ducing the number of features in building the attack detection predictive models. We apply our analysis on the Kyoto dataset [ 108 ]. Our analysis provides a guideline for selecting a proper feature selection method for an intrusion detection task. Secondly, we investigate the application of feature selection to discover the important features in the detection of a specific attack. Such features provide more insight about the attack behavior and how it can be distinguished from normal traffic. This information can be used in the process of feature engineering for similar attacks [ 32 ]. We use an ensemble of different feature selection methods to find the important feature for the detection of RUDY attack in SANTA dataset [ 119 ], [ 88 ]. The contributions of this chapter are as follows.

• We investigate the application of feature selection in intrusion detection from two perspectives. Firstly, we study how different feature selection methods can be evaluated for the detection of network attacks. Secondly, we use feature selection methods in order to evaluate the important features for the detection of a specific attack.

• Our first study provides a guideline for the evaluation of feature selection meth- ods and selecting the one that best suits an application of intrusion detection.

87 • We propose an ensemble of feature selection methods for our second study. It selects the important features for the detection of a specific attack. Such an approach not only can increase the efficiency of an intrusion detection by reducing the number of features, it also gives more insight about the attack and the characteristics that make it distinguishable from normal data.

The structure of the rest of this chapter is as follows. In section 6.2 , we discuss related work on the topic of feature selection in intrusion detection applications. Sec- tion 6.3 presents our methodology for studying the two aforementioned applications of feature selection in detection of network attacks. Section 6.4 presents experimental results. Finally, in section 6.5 , we conclude our work in this chapter and provide suggestions for future research.

6.2 RELATED WORK

Feature selection is a very important pre-processing step in building machine learning models for the detection of network attacks [ 117 ]. Kayacik et al. [ 62 ] used a hierarchy of self-organizing feature maps (SOMs) for intrusion detection. They used the KDD 99 dataset in their experiments. They applied two sets of experiments by using all 41 features in the KDD dataset and using 6 features selected using expert-based domain knowledge. Within the context of intrusion detection, the design of the hierarchical SOM is to correlate network behavior with the network features with greater specificity as the hierarchy levels increase. Their results show that when using 6 basic features, the resulting model is capable of detecting over 95% of Normal and Denial of Service (DoS) connections. Utilizing all 41-features results in incremental improvements to the detection of Normal and DoS categories. It particularly has significant performance improvements for Prob attacks. Nguyen et al. [ 93 ] focused on correlation feature selection (CFS) and minimal redundancy maximal relevance (mRMR), which are feature subset evaluation meth- 88 ods. Both of these feature selection methods contain an objective function which is maximized over all possible subsets of the feature set. They compared CFS and mRMR with SVM wrapper, Markov blanket and CART (classification and regression trees) on the KDD 99 dataset. They made two versions of this dataset including DoS + Normal data and Prob + Normal data. Their analysis is done on single attack detection and not multiple attack detections. They use C4.5 on the whole feature set and feature sets acquired by subset evaluation methods. Based on their results the two subset evaluation methods perform better than other feature selection methods studies in their work. In their analysis they used the average of accuracies of the clas- sifiers in the detection of Normal, Prob and DoS traffic as the performance metric. Accuracy is not a reliable performance metric when it comes to the applications that include imbalanced datasets [ 113 ]. Intrusion detection is mostly an application with imbalanced data because in real networks, proportion of normal traffic is much larger than that of attack/malicious traffic. In addition in their analyses, they compared dif- ferent feature selection methods by providing the results of their performances along with different classifiers. From a statistical analysis point of view, it is better when one factor is examined (feature selection), the other factors do not change (classifier). Li et al. [ 70 ] introduce a light weighted intrusion detection system by using fea- ture selection. They first apply Information Gain and Chi Squared to select the core features. Then, Maximum Entropy is applied as the classifier to build the attack de- tection models. They select the top features ranked by both feature selection methods and apply classification. They apply the analysis on the KDD 99 dataset. The au- thors indicated that the light weighted intrusion detection model can be used for real time detection of attacks; however, they did not provide any timing computational results. Sung et al. [ 109 ] uses SVM and neural networks to rank the features. This method is a wrapper feature selection method because they apply a classifier to select the

89 features. Each time they remove one feature from the feature set, the SVM or neural network performance is then calculated. This performance is then compared with the performance of SVM or neural network on the whole feature set. The eliminated feature is then ranked as “important”, “secondary” or “insignificant” based on the rules defined on the total classifier accuracy, training time and testing time. Finally, some of the important features are kept and SVM or neural network is applied to classify the 5 different types of attacks in the KDD 99 dataset. They conclude that using the important features yields the most remarkable performance in terms of training time. Eid et al. [ 46 ] propose a linear correlation-based feature selection for building intrusion detection models. Their proposed feature selection method consists of two layers for analyzing feature redundancy. The first layer selects the feature subset based on the analysis of Pearson correlation coefficients between the features and the second layer selects a new set of features by analyzing the Pearson coefficient between the features and the classes. They compare their proposed feature selection method with methods such as PCA and gain ratio. Based on the accuracy analyses, their proposed method outperforms the other feature selection methods used in their experimental analyses. Previous studies [ 115 ] have shown that the performance analysis on different feature selection methods depends on what performance metric is used. In the area of intrusion detection, accuracy does not seem to be a reliable performance metric as it is not good for imbalanced data. When comparing different feature selection methods it is better to use proper performance metrics that work well for the IDS application domain (imbalanced data). Onut and Ghorbani [ 94 ] proposed a ranking mechanism to evaluate the effec- tiveness of different features for the detection of different types of attacks. Their statistical feature ranking process is designed as a 3-tier method. The ultimate goal is to calculate the probability of each individual feature being able to detect one of

90 the main types of attacks defined in the used dataset (DARPA). The higher the prob- ability, the better the feature is for the detection of attacks. Once all the probabilities are computed, the features are ranked based on their corresponding probability val- ues. The top selected features for the DoS attack are the ones correlating to Internet Control Message Protocol (ICMP) such as the number of ICMP bytes sent by the source IP or the number of ICMP packets sent by the source IP.

6.3 METHODOLOGY

6.3.1 Evaluating Feature Selection Methods for Detection of Network Attacks

The purpose of our first analysis of the application of feature selection in the detec- tion of network attacks is to evaluate which feature selection method can reduce the number of features while still yielding a similar or better performance to the intrusion detection model built based on using the whole feature set. In this section, we explain the main steps we took in our empirical studies toward this goal. As the main goal in our experiments is to compare different feature selection methods in building intrusion detection models, we apply 4 different feature selection methods along with 3 different classification algorithms to build the models. We also apply the 3 classifiers when no feature selection is performed. We use the Kyoto 2006+ dataset [ 108 ], [ 116 ] in our experiments. We build the models on one day of data and test the models’ performance on the days chosen from the following days after the day the models are built. Due to large quantity of the Kyoto 2006+ data with over 2.5 years of network traffic, we use only a subset of the data. All the classifiers and feature selections are applied on the first of January of 2008 to build the classification models. We test the models on the data containing the 12 days data between 6 and 12 months after the training day. Models are tested on each day separately. Table 6.1 contains all of 91 Table 6.1: Details of the dataset

Date Fit or Test,dataset # of Normal # of Attacks

2008-01-01 Fit 61,462 50,127

2008-07-10 Test 72,863 19,508

2008-07-25 Test 78,519 44,756

2008-08-10 Test 64,759 59,634

2008-08-25 Test 87,253 37,483

2008-09-10 Test 75,749 38,552

2008-09-23 Test 53,703 22,835

2008-10-10 Test 79,298 19,407

2008-10-25 Test 61,647 14,087

2008-11-10 Test 78,875 22,587

2008-11-25 Test 43,716 36,792

2008-12-10 Test 66,459 58,728

2008-12-25 Test 79,537 44,869 the dates we used, with the number of normal/attack instances. It shows which day is used for training the models (Fit) and which days are used for testing the models (Test). Although in the original data they had 24 features, we exclude the features related to security analysis. The reason is that we wanted our analysis to get generalized to other networks and not all the networks have such analysis on their network data. Also we wanted to test if our models could work effectively without such security analysis included. We also removed IPs from the features. The reason is that we built our model on one day (first of January of 2008) and we test it on the days from the between 6 and 12 months after the training day. In the Kyoto dataset, the original IP address on IPv4 was sanitized to one of the Unique Local IPv6 Unicast Addresses. 92 Also, the same private IP addresses are only valid in the same month. Basically, the IP addresses are different in different month, therefore using the IP addresses in building the models does not make sense. To make it more clear, consider that we build a model on one day of the data. The model uses IP=X in its construction. Then, we test that model on one day from a consequent month. There is no data in the day from the consequent month with IP=X. That means if we test a model built on one day of the data on the days from another month the IP values are not useful. The description of the features in Kyoto dataset can be found in [ 108 ]. We chose three classification learners to build the models for the intrusion de- tection: 5-Nearest Neighbor (5-NN), Decision Trees (C4.5) and Naive Bayes (NB). These learners were all chosen due to their relative ease of computation and their dissimilarity from one another. We use different learners in our study to provide broader analysis from the machine learning perspective. We build all the models us- ing WEKA machine learning toolkit [ 53 ]. For each of the feature selection methods, three classification models are trained on January first of 2008 of the Kyoto dataset and they are evaluated using 12 subsequent days. We performed ANalysis Of Variance (ANOVA) along with Tukeys honestly signifi- cant difference test [ 25 ] to confirm our findings regarding the performance comparison of different feature selection methods. Our results show that, overall, feature rankers outperform the feature subset evaluation method. Among the feature rankers, Signal- to-Noise (S2N) has the best performance followed by the threshold-based feature se- lection method based on area under Receiver Operating Characteristic Curve (ROC), although the difference between the two is not significant. The feature selections methods used in our analysis are explained below. There are two general categories of feature selection methods: filters and wrap- pers. Filter-based methods select a subset of features without using any classification method. They don’t use the error rate to score a feature set, instead they use a

93 measure which is fast to compute to select a feature subset. Conversely, wrapper methods use a classification model to score feature subset. For each feature subset a predictive model is built using that feature subset and the prediction performance of the built model is calculated. A wrapper method trains a new model for each feature subset. Thus, wrapper methods are very computationally intensive. In this work, we focus on a comparison of the filter-based feature selection methods. We selected four feature selection methods from filter-based category. These meth- ods are selected from two sub-categories: feature rankers and feature subset selections. Filter-based feature ranking methods use different techniques to assign a score to each feature. The features are ranked based on these scores in order from best to worst. The top N features are then selected as the results of the feature selection method. These methods are computationally effective. 3 different feature rankers are used in this study: Chi Squared (CS), Area under the Receiver Operating Characteristic Curve (ROC), Signal-to-Noise ratio (S2N). CS is a commonly used feature selection method. ROC is a threshold-based feature selection method and S2N is a first order statistic filter. The Chi Squared method (CS) [ 120 ] utilizes the X2 statistic to measure the strength of the relationship between each feature and the class. This method ba- sically evaluates each feature independently with respect to the class labels. The larger the Chi squared, the more relevant the feature is with respect to the class. ROC method is a threshold-based filter method [ 114 ] that assigns a value to each feature. Threshold based feature selection methods consider the feature values as the posterior probabilities to estimate classification performance. In the ROC method, the classification performance is calculated by using Area under ROC Curve metric (AUC). The larger the value of the AUC metric (between 0 and 1) the more predictive power the attribute has. It should be noted that the ROC feature selection method uses the AUC metric to

94 rank the features. This technique is distinctly different from the AUC value which is calculated as the classification performance in our analyses. In ROC feature selection method, AUC value is calculated for each single feature to rank the features. In our performance analyses, AUC value is calculated for testing each model on each test day to evaluate its performance. The Signal-to-Noise, or S2N, as it relates to classification or feature selection, represents how well a feature separates two classes. The filter-based subset evaluation method used in our analyses is Correlation- based Feature Selection (CFS) [ 54 ]. CFS uses the Pearson correlation coefficient in order to make a tradeoff between features not having correlation among each other and having correlation with the class. We use Forward Greedy Stepwise as our subset search algorithm. This algorithm begins with an empty set of features called working set and it progressively adds exactly one feature at a time into the working set. Each time all the features that are not included in the current working set are temporarily added to the working set one at a time. The feature which most improves performance when temporarily added to the working set becomes a permanent member and the algorithm iterates until no feature actually improves performance. Based on preliminary results, we decided to choose top 6 features for each ranked list. We also considered “no feature selection” strategy that employs all features from the dataset for building models.

6.3.2 Ensemble of Feature Selection Methods for Analyzing Important Features

The purpose of our second analysis is to investigate the application of feature selec- tion in discovering important features for the detection of a specific attack. We use feature selection methods to determine which of the defined features are more impor- tant for the detection of RUDY attacks. The selected features reveal the important characteristics of the RUDY attack that are beneficial for its detection, as well as

95 defining features for the detection of similar attacks. To determine the important features for the detection of the RUDY attack, we applied 10 different feature ranker methods and aggregated their results. We ap- plied three classification methods to build the predictive models. Finally, we applied ANOVA [ 25 ] to compare the predictive models performances when the whole feature set with those models when the selected feature sets are used. Our results show that the selected feature set provides very similar performance results to the case where all the features are used in the building of the predictive models. The selected features include the features which represent three main characteristics, including the traffic size, the self-similarity between packets and the traffic velocity. The studied attack in this section is the Rudy attack. RUDY is a slow rate application layer DoS attack. This attack attempts to open a relatively low number of connections to the targeted machine over a period of time and keeps them open as long as possible to keep the machine’s resources suspended. Eventually, these open sessions exhaust the targeted machine and make it unavailable to the legitimate users. The low and slow traffic associated with this attack makes it difficult for traditional mitigation tools to detect. RUDY attacks exploit a weakness in the HTTP protocol, which was originally designed to provide service to the users with very slow rate traffic (such as dial up users) [ 89 ], [ 91 ]. RUDY attacks take advantages of the fact that an HTTP POST operation allows for the connection to remain open indefinitely in cases where the POST data arrives very slowly; for example one byte per second. The attacker sends a legitimate HTTP POST request with an abnormally long “content length” header field. Then, it starts injecting the content at a very slow rate (usually one byte at a time) to the server. The long content length field prevents the server from closing the connection. The information is not only sent in very small chunks, but is also sent at a very slow rate. On the server side, this traffic creates a massive backlog

96 of application threads, while it does not close the connection because of the long content length field. The attacker launches simultaneous connections to the server. Since the server is hanging while waiting for the rest of these HTTP POST requests, its connection table eventually gets exhausted and the server crashes. We use the SANTA [ 119 ] dataset in our analysis. This dataset is collected from an operational network of a commercial Internet Service Provider (ISP). The network data includes a mixture of varying types of internet traffic. The ISP hosts a wide range of server traffic that is accessed by users from across the internet, including web servers, email servers, and other various internet services that are common to internet providers. The customer networks accessing the internet through the ISP network generate traffic such as email, browser and all other types of internet traffic that an average business might generate in the course of day-to-day operation. The data is collected from two border routers that connect the ISP network to the outside world; therefore the collected traffic does not include the internal traffic (from one internal host to another internal host). Since the collected data only includes the border traffic, the features are defined based on two concepts: inbound and outbound traffic. Inbound traffic is the traffic which is targeting the network and the packets are originating from the hosts outside of the network. Outbound traffic is the traffic which is leaving the network and the originating hosts are the ones in the network which are sending the packets to the outside world. The two concepts are used in order to define bi-directional network instances which include both inbound and outbound packets. This allows the extracted features to include inbound and outbound traffic information simultaneously for one network instance. The RUDY attack traffic was produced through penetration testing. The descriptions of the features in the SANTA dataset are shown in Table 6.2 . In this study, we apply 10 different feature ranker methods on our dataset. These methods provide 10 different ranking lists of the features. The feature ranking meth-

97 Table 6.2: Description of features extracted from sessions

Feature Name Description

Protocol Transmission protocol

IO match Whether the inbound Netflow has an associated outbound Netflow record (Boolean)

Duration The elapsed time from the earliest of the associated inbound or out- bound Netflow until the end of the later Netflow

Bytes Total size for the session in bytes

Packets Total number of packets in the session

Inbound Session Self-similarity of the inbound packets in the session is determined by Convergence examining the variance in size of the inbound packets

Outbound session Self-similarity of the outbound packets in the session is determined convergence by examining the variance in size of the outbound packets

Repetition The ratio of the number of most common packet

Periodicity Standard deviation of packet size (bytes/packets) within the session

Inbound velocity pps Velocity of inbound traffic measured in packets per second

Inbound velocity bps Velocity of inbound traffic measured in bits per second

Inbound velocity bpp Velocity of inbound traffic measured in bytes per packet

Outbound velocity Velocity of outbound traffic measured in packets per second pps

Outbound velocity Velocity of outbound traffic measured in bits per second bps

Outbound velocity Velocity of outbound traffic measured in bytes per packet Bpp

RIOT packets Ratio of inbound to outbound traffic measured in packets

RIOT Bytes Ratio of inbound to outbound traffic measured in bytes

Flags Cumulative OR of all the TCP flags seen in this session

Class Class label (Attack or Normal) associated with the Netflows within the session

98 ods used in this study are: F-Measure (F), Geometric Mean(GM), Kolmogorov- Smirnov statistic (KS), Mutual Information (MI), Area Under the Receiver Oper- ating Characteristic Curve (ROC), Fisher Score (FS), Signal to Noise Ratio (S2N), Chi Squared (CS), Information Gain (IG) and Gain Ratio (GR). These can be di- vided into three groups: threshold based, first order statistics, and common literature techniques. Threshold-based feature selection techniques use the feature values as the posterior probabilities to estimate the classification error (This includes F, GM, KS, MI, ROC). First order statistics based methods use first order statistical measures such as mean and standard deviation to measure the relevance of the features (this includes FS and S2N) and the techniques commonly used in the literature (CS, IG and GR). By aggregating the feature ranking lists and expert analysis, we selected seven features. Once the features are ranked in ten different ranked lists, we counted, for each feature, how often the feature appears at the first-place, at the second-place and so on among all ten ranking lists. When sorting the features based on how frequently they appear toward the top of the ten ranked lists, we realized that there are some natural cut-offs where some features tend to appear more at the beginning of the ranked lists and the others tend to appear at the end of the ranked lists. Based on this information, we determined that the features that appear more than 4 times at the top 7 positions in the rank lists are of our greatest interests. Further investigations can consider more features, but we decided to go with 7 features since we only had 18 features to begin with and choosing more features seems to obviate the purpose of feature selection. To build the predictive models, we chose three classification algorithms: K-Nearest Neighbor (K-NN) and two forms of C4.5 decision trees (C4.5D and C4.5N). These learners were all chosen due to their popular use in machine learning applications as well as their relative ease of computation. Using these learners provides a broader

99 analysis from a data mining point of view. We built all models using the WEKA machine learning toolkit [ 53 ].

6.4 EXPERIMENTAL RESULTS AND ANALYSIS

This section provides our results for the two studied applications of feature selection in detection of network attacks.

6.4.1 Results for Evaluating Feature Selection Methods for Detection of Network Attacks

In our analysis, we aimed to provide a comprehensive comparison of the different feature selection methods for the application of network intrusion detection. We compare four different feature selection methods. Three different classifiers are used to build the classification models along with using these feature selection algorithms. We also provide the classification results when the whole feature set is used and no feature selection (NF) is done for the sake of comparisons. We use Kyoto 2006+ data to apply our experiments. All the models are built upon the January 1, 2008 data collection day in the Kyoto 2006+ dataset. Considering the combination of four feature selection methods and three different classifiers, overall 12 models are built. Each model is tested on 12 days which are selected after the training day. Training a model on one day of the data and testing it on the subsequent days has two benefits. It demonstrates that the features selected by the feature selection method on one day of the data are not specific to only that day and they are useful in the later days too. It also shows that the classification model built on former data can be used for intrusion detection on future data. Table 6.3 shows the test evaluation results in term of AUC values for the different combinations of classifiers and feature selection methods on 12 test days. The feature selection method that has the highest AUC value for a particular classifier is shown in boldface.

100 TestDate FS Classifier 5-NN C4.5 NB

NF 0.99820 0.99654 0.98764 CS 0.98956 0.98520 0.98552 07/10/2008 ROC 0.99909 0.99914 0.98816 S2N 0.99947 0.99950 0.99649 CFS 0.96404 0.96381 0.96377

NF 0.99874 0.99682 0.98129 CS 0.99926 0.99583 0.99233 07/25/2008 ROC 0.99939 0.99897 0.96832 S2N 0.99960 0.99964 0.98927 CFS 0.99225 0.99226 0.99203

NF 0.99722 0.99745 0.98862 CS 0.97955 0.99555 0.99225 08/10/2008 ROC 0.99928 0.99885 0.98605 S2N 0.99943 0.99938 0.99645 CFS 0.98948 0.98954 0.98943

NF 0.99088 0.99458 0.98700 CS 0.99893 0.99660 0.99452 08/25/2008 ROC 0.99403 0.99913 0.98877 S2N 0.99962 0.99965 0.99692 CFS 0.99011 0.99002 0.98972

101 NF 0.99508 0.99394 0.98659 CS 0.99458 0.99535 0.98608 09/10/2008 ROC 0.99816 0.99898 0.98451 S2N 0.99807 0.99958 0.99757 CFS 0.98301 0.98303 0.98198

NF 0.99809 0.98971 0.97441 CS 0.93165 0.98604 0.97542 09/23/2008 ROC 0.99881 0.99830 0.97342 S2N 0.99943 0.99645 0.99910 CFS 0.98582 0.98586 0.98486

NF 0.99715 0.98526 0.97826 CS 0.99316 0.97713 0.98024 10/10/2008 ROC 0.99922 0.99413 0.98504 S2N 0.99936 0.99501 0.99646 CFS 0.97886 0.97874 0.97708

NF 0.99932 0.99541 0.98350 CS 0.99042 0.98990 0.97289 10/25/2008 ROC 0.99969 0.99577 0.97895 S2N 0.99962 0.99660 0.99911 CFS 0.98507 0.98507 0.98484

NF 0.99469 0.97940 0.98238 CS 0.97970 0.98477 0.98543 11/10/2008 ROC 0.99687 0.99354 0.98055

102 S2N 0.99578 0.98977 0.99670 CFS 0.97508 0.97481 0.97370

NF 0.98143 0.99470 0.94754 CS 0.99127 0.98993 0.94808 11/25/2008 ROC 0.99432 0.99008 0.95099 S2N 0.98762 0.99683 0.97578 CFS 0.97600 0.97588 0.97577

NF 0.99016 0.99078 0.97426 CS 0.99750 0.98846 0.98859 12/10/2008 ROC 0.99252 0.98416 0.96760 S2N 0.99257 0.99292 0.98094 CFS 0.98856 0.98856 0.98765

NF 0.96021 0.98547 0.88475 CS 0.83107 0.99839 0.96755 12/25/2008 ROC 0.99121 0.95580 0.89229 S2N 0.99633 0.99657 0.93357 CFS 0.96216 0.96280 0.95246

Table 6.3: AUC values for different combination of feature selection and classification methods.

Considering the AUC results for different test days and different classification methods, the most obvious observation is that the S2N feature selection method performs the best. S2N provides the best AUC values when applied with different

103 Table 6.4: ANOVA Results

Source DF Sum Sq Mean Sq F value Prob >F

Feature Selection 4 0.00426 0.0010652 2.764 0.0292

Residuals 175 0.06746 0.0003855

Figure 6.1: Tukey’s HSD results for feature selection

classifiers on most of the days. Also, its performance is always better than NF for all the test days and all the different classifiers. Since we have applied 3 different classifiers for each feature selection method, it is interesting to see for which classifier the feature selection method is providing the best results. Considering S2N as the best feature selection, we can see that C4.5 classifier has the best performance in 7 out of 12 days and 5-NN has the best performance in 4 out of 12 days. Overall, it can be said that C4.5 performs better than the other

104 classification methods in combination with S2N feature selection. We recommend using S2N feature selection method along with C4.5 to have a good combination of classification performance and fast analyses. It can be said that after S2N, ROC has the second best performance among all the other feature selection methods. When ROC feature selection is applied along with the 5-NN classifier, its performance is always better than NF. Also when we use C4.5 along with ROC, the performance is better than NF most of the time. All in all, it can be said that for the 5-NN and C4.5 classifiers ROC is doing well, and relatively close to S2N performance. For the NB classifier, the difference between S2N and ROC is larger. CS comes at the third feature selection position while CFS is obviously the worst feature selection method in these experiments. CS and CFS perform worse than no feature selection method for most of the scenarios. The fact that the CS and CFS selection methods are performing worse than NF most of the time indicates the importance of the choice of feature selection method. One might decide to use these feature selection methods just because they are pop- ularly used in many machine learning applications. However, our results show that these methods do not perform well for the intrusion detection application in this par- ticular dataset. It is important to select a feature selection method which performs well for the specific domain. We performed a one-way ANOVA [ 25 ] test to validate our results and to determine if any of the intrusion detection models can be considered statistically significant. ANOVA analysis is a statistical model that is used to analyze the difference between group means in order to see whether these references are significant or not. The factor considered in our ANOVA analysis is the choice of feature selection method. We chose a significance level of 5% for this ANOVA analysis. A ”Pr( >F) score of less than 0.05 is considered to be statistically significant. All the statistical analysis is done in R version 3.1.2 [ 99 ]. The ANOVA analysis results are shown in

105 Table 6.4 . The results show that the choice of feature selection method is significant. In order to determine which of the choices for the feature selection technique is significantly better or worse from the other choices, we also performed a multiple comparison test with Tukey’s Honestly Significant Difference (HSD) test [ 25 ]. This test is a multiple comparison procedure which finds the means that are significantly different from each other. Figure 6.1 contains the HSD results for the feature selection methods. This is a letter display figure in a sense that the feature selection methods that share a common letter are not significantly different. Figure 6.1 shows that S2N is the best among all the feature selection methods, and is significantly better than CFS method while not significantly better than ROC, NF (No Feature selection) and CS methods. The main contribution of applying feature selection in intrusion detection applications is to reduce the number of features while the classification results remain the same or does not decreases significantly. Thus, S2N is a good choice in this application and the statistical analysis is confirming our results. Also based on the HSD test, the feature subset selection method (CFS) is the worst and it is significantly worse than S2N method. Overall, it can be said that the feature rankers and no feature selection methods are performing better than feature subset selection method in our analysis. CFS almost always perform worse than the NF method. This shows that the choice of feature selection in the intrusion detection application is important and it should not be taken lightly. Overall, our results demonstrate that feature ranking methods are working better than the feature subset evaluation method in this particular dataset (Kyoto). S2N yields the best performance results with ROC coming after it, but with no significant difference. In the future, we want to include more feature selection methods from both feature rankers and feature subset evaluation methods in our analyses toward further investigation on this conclusion.

106 6.4.2 Results for Ensemble of Feature Selection Methods for Analyzing Important Features

We selected seven features by using the ensemble method explained in subsection 6.3.2 for the detection of RUDY attack in SANTA dataset. By looking at the ranking lists side by side and the frequency of different features appearing in different positions in the ranked lists, we decided to select the features that appear more than 4 times at the 7 first positions of all the ranked lists as the final selected features. The selected features and their frequency in the first 7 positions of the ten ranked lists are shown in Table 6.5 . We applied three classification algorithms on the data with the whole feature set and with the 7 selected features. The performance values are achieved through 4 runs of 5 fold cross validation. The cross validation results on the whole feature set and on the selected feature set are shown in Table 6.6 and Table 6.7 respectively. The AUC values being more than 0.99 plus the high TPR and low FPR values indicate that the machine leaning methods perform very well in the detection of RUDY attacks including all features or selected feature sets.

Table 6.5: Selected features by the ensemble of rankers.

Feature Number of occurrences

Outbound session convergence 5

Inbound session convergence 5

Packets 5

Bytes 6

RIOT Bytes 6

Outbound velocity bpp 6

Outbound velocity bps 7

107 Table 6.6: Cross validation results on the whole feature set

Classifier AUC TOR FPR

C4.5N 0.9988 0.9873 0.000282

C4.5D 0.9940 0.9866 0.000307

5-NN 0.9999 0.9883 0.000316

Table 6.7: Cross validation results on the selected feature set with 7 features

Classifier AUC TPR FPR

C4.5N 0.9983 0.9907 0.00029

C4.5D 0.9996 0.9890 0.00041

5-NN 0.9944 0.9897 0.000265

We applied ANOVA analysis in order to compare whether the selection of feature sets significantly affects the performance. We applied one-way ANOVA analysis with the factor being whether the whole or selected feature set is used. We chose a signif- icance level of 5% for this ANOVA analysis and a “Prob >F” score of less than 0.05 is considered to be statistically significant. The ANOVA results are shown in Table 6.8 . The results show that there is no significant difference in the selection of feature sets which means the selected feature set performs very similar to the whole feature set in the detection of RUDY attacks. By looking at the selected features shown in Table 6.5 , we observe that the features are in correlation with the RUDY attack behavior. In a RUDY attack scenario, the attacker is sending small packets in a slow rate and the server is just responding TCP acknowledgement packets to the incoming slow rate attack packets which makes the response, i.e, outbound traffic, have a small number of bytes per packets. This also affects the speed in which the packets are sent out, i.e. bits per second/packets. These characteristics can be represented in Outbound velocity bps and Outbound velocity

108 Table 6.8: ANOVA Results

Source DF Sum Sq Mean Sq F value Prob >F

Whole/selected fea- 1 6.00e-07 6.190e-07 0.04 0.843 ture set

Residuals 118 1.85e-03 1.568e-05 bpp features. On the other hand, the small packet sizes sent by the attacker and the short responses can be represented in the overall size of a session (Bytes and Packets) features, as well as RIOT Bytes which shows the ratio of inbound to outbound traffic in bytes representing the relative size of inbound to outbound traffic. Another important characteristic of the RUDY attack is the self-similarity be- tween the request packets sent by the attacker and the response packets sent by the server. The two features, Inbound session convergence and Outbound session con- vergence, represent the self-similarity between inbound and outbound packets in a session respectively. Thus, it is expected to see these features among the important features selected for the RUDY attack.

6.5 CHAPTER SUMMARY

The high volume of data that needs to be processed as well as high speed network- ing challenge the task of intrusion detection in terms of the computational analysis needed. One solution would be using feature selection methods in the intrusion de- tection algorithms. Feature selection methods reduce the number of features with no or very low performance degradation. The lesser number of features to be analyzed, the faster the intrusion detection would be done. We studied the application of feature selection for the detection of network attacks from two perspectives. We not only studied the main application of feature selection, which is removing redundant features in order to build a more efficient predictive 109 model, but also, we studied its application to get more insight about the important features for the detection of a specific attack. Our first analysis provides a guideline for evaluating different feature selection methods for an intrusion detection task. To compare the feature selection methods, we focus on the filter-based feature selections because these methods are computationally less expensive in comparison to wrapper-based feature selection methods. We applied four different filter-based feature selection methods chosen from the 2 main sub- categories, filter-based rankers and filter-based subset evaluation methods. We also provided the results of no feature selection, when the whole feature set is used to build the predictive models, for comparisons purposes. For the feature selection methods, we chose CS, ROC and S2N from the filter- based ranker category and CFS from the filter-based subset evaluation category. We applied three different classification methods along with the feature selection methods to provide a comprehensive analysis. We used 5-NN, C4.5 and Nave Bayes as the classification methods. All the predictive models, resulted from each combination of feature selection and classification algorithms, are built on one day of the Kyoto data. We tested the trained models on 12 days after the day models are built upon. We used Area Under the ROC curve (AUC) as the performance metric. To compare whether different feature selection methods are significantly different, we applied ANOVA along with Tukey’s honestly significant difference test on our fit versus test results. Our results show that overall, filter-based rankers are performing better than feature subset evaluation method. Among the feature ranker methods, S2N is performing the best. ROC comes after S2N but with no significant difference. In our second analysis, we used an ensemble of feature selection methods for investigating the important features for the detection of RUDY attack in SANTA dataset. The RUDY attack is an application layer HTTP denial of service attack which exploits the server behavior of supporting users with slower connections. We

110 used feature selection methods to determine which features are more important for the detection of the RUDY attack at the network level. We used an ensemble feature selection approach including 10 different feature ranking methods to investigate what features are more important for the detection of this attack. Based on our results, the features related to the traffic size, self-similarity between traffic packets and the velocity related features are more important for the detection of RUDY attacks. These features provide very good classification performance for the detection of RUDY attacks, which based on our ANOVA results, is not significantly different from the classification performance using the whole feature set. This information can be used in introducing features for the detection of other types of attacks similar to the RUDY attack. In future work, we intend to further investigate on feature selection methods in IDS by examining more feature selection methods. We intend to include additional feature selection methods from filter-based feature selections, as well as wrapper methods. We also plan to investigate the use of different feature selection methods on our own network traffic collected from a real operational network.

111 CHAPTER 7 CONCLUSION AND FUTURE WORKS

The internet has brought a huge revolution in our everyday life by reducing stress through enhancing services, such as making reservations, finding jobs, education, socializing, banking and business transactions. While the Internet has brought convenience to our lives, it has also created lots of opportunities for adversaries to target this huge infrastructure in order to access the information communicating over it or to eliminate its services. Attacks on computer networks are increasing at a worrying pace, which made network security tremen- dously important over the past decades. The network traffic must be monitored and analyzed to detect malicious activities and attacks. Since users information is transacted throughout computer networks, it is essential to secure the data commu- nications. Recently, machine learning techniques have been applied for the detection of net- work attacks. Machine learning models are able to extract similarities and patterns in the network data. Unlike signature based methods, there is no need for manual analyses to extract the attack patterns. We can use machine learning algorithms to build mechanisms for the detection of network attacks. This dissertation focuses on the usage of machine learning techniques in the detection of attacks on computer networks.

112 7.1 CONCLUSIONS

In this research, we considered three common attacks on computer networks and proposed methods for their detection by applying machine learning algorithms. In that regard, we introduced a general methodology, which includes acquiring an in- depth understanding of the attack under study, collecting representative network traffic, building predictive models by using machine learning methods and evaluating their performance. Our analysis of each attack included an extensive elaboration of these steps toward its detection. Our conclusions from this research are derived from two main aspects. The first one is the conclusions from our data collection and the second is the conclusions from our analysis of applying machine learning methods for the detection of network attacks.

7.1.1 Data Collection

One important contribution in this research is collecting representative network traffic from a real computer network. This is a critical step in applying machine learning methods for the detection of network attacks. Representative traffic should contain different scenarios that occur in a real computer network. This includes user activities, such as transferring files, browsing and streaming. The available public datasets for intrusion detection, such as DARPA, ISCX and Kyoto, suffer from limitations, like outdated traffic, simulated traffic that does not reflect real-world network traffic and lack of representative normal data. In addition to representative normal data, a network attack benchmark should also contain a variety of attacks generated by different tools and methods. Building and evaluating the detection models on representative data provides realistic evaluation results, which can reduce the gap between building detection mechanisms using machine learning methods and their actual deployment in real computer networks. Our study demonstrated how representative network data can be collected for 113 different types of attacks. Our data is collected from a campus network. It contains the typical workload found on a campus network including the traffic generated by students, faculty and staff. This makes our data representation of real-world computer network traffic. Throughout the data collection, specific instructions were occasionally given to the students by their instructors in order for the generated traffic to include a vari- ety of different activities, such as uploading and downloading files, video streaming, interactive sessions and Internet browsing. This step was taken, along with collect- ing a good amount of traffic, to make our collected data even more representative of networking scenarios that occur in the wild. We introduced the concept of using normal traffic that is similar to attacks while collecting representative network data. This includes studying the attack behavior and including the normal traffic with similar behavior to the attack traffic during the data collection. For SSH brute force attacks, we included failed login traffic which can look similar to failed logins generated by the attacker. In the case of the Man In The Middle attacks, we generated TCP re-transmission data which can look similar to attack data and result in semi-duplicated packets in the traffic. Including normal traffic with behavior similar to the attack traffic is an important step in reducing false positives. The two traffic patterns can be compared in order to introduce discrimi- native features that can distinguish the two from one another. This avoids normal traffic, which is similar to the attack traffic, labeled as attacks by the detection models and consequently greatly reduces such false alarms. Another important aspect when collecting network data is to decide what data source should be collected. We collected network traffic for SSH brute force and Man In The Middle attacks and web server logs for the HTTP GET flood attacks. The reason for collecting different sources of network data for the detection of distinct attacks is the difference in their behaviors. A SSH brute force attacker needs to send

114 a lot of failed login traffic to finally get access to the correct password. The generated traffic mostly includes failed logins. Such a behavior can be observed in network traffic as Netflows with a lower number of bytes and packets. The high number of such Netflows is another factor that can discriminate brute force traffic from normal traffic. In the case of the Man In The Middle attacks, the forwarding behavior of the attacker generates semi-duplicated packets in the network traffic. On the other hand, the HTTP GET flood attack generates a large number of HTTP GET requests. It can be difficult to distinguish these requests in the network traffic, because they look like normal HTTP GET requests sent by legitimate users. Even the repetitive behavior of this attack, in sending a lot of HTTP GET requests, might not be very helpful in distinguishing the attack traffic. The reason is that the generated requests can be different from one another. We concluded that evaluating the web server logs is a better option than network traffic for the detection of these attacks. The web server logs reveal the attackers based on how they are accessing the resources on the web server. It is very hard for the attacker to completely mimic normal user behavior in accessing resources on the web server. Therefore, analyzing the web server logs can lead to their detection.

7.1.2 Machine Learning Methods for The Detection of Network Attacks

Our study showed different applications of machine learning methods in the detection of network attacks. This includes building classification and anomaly detection mod- els for the detection of attacks, as well as reducing the number of features to build a more efficient detection model and additional insight about the attack. In Chapter 3, we studied the detection of SSH brute force attacks by building classification models using machine learning algorithms. We used domain knowledge and studied the representative collected SSH data to define discriminative features. Our analysis of the data showed that the generated Netflow data by attackers are very

115 similar to the Netflow data generated by normal failed login attempts by legitimate users. Therefore, we extracted the features from an aggregation of Netflows in order to represent the repetitive behavior of an attacker in sending login attempts. As our experimental results showed, these features provide good performance in the detection of these attacks. Our study showed that incorporating domain knowledge and the analysis of the real network traffic for defining robust features is an important factor for the detection of network attacks. In Chapter 4, we studied the detection of Man In The Middle attacks. We ex- ploited the forwarding behavior of the attacker in sending the packets it receives from the victim to the actual destinations to detect the traffic generated by the attacker. The forwarding behavior causes the network traffic to contain semi-duplicated pack- ets. We proposed a framework to detect such packets, which is based on comparing packet header fields in a specific time window to detect semi-duplicated packets. We did not build a predictive model for the detection of attack traffic. However, we bene- fited from machine learning by using the Greedy Step-wise feature selection algorithm, to decide which packet header fields need to be compared to detect the MITM traffic. Using feature selection from the machine learning area not only improved the perfor- mance results compared to using the whole TCP and IP fields, it also reduced the number of packet header fields that needed to be compared down to only two fields. This increases the efficiency of the detection method by reducing the computations needed to compare the packet header fields in order to find Man In The Middle traffic. Our study in this chapter demonstrated how algorithms from machine learning can be used to provide better performance and more efficient detection methods for the detection of attacks even in the absence of a predictive model. In Chapter 5, we studied the detection of HTTP GET flood application layer DDoS attacks. We took an anomaly detection approach for the detection of these attacks. The reason we did not use a classification approach is that an attacker

116 can choose any resource on a web server in any order to send a large amount of requests to the web server. This makes it unfeasible to build a dataset that contains an exhaustive list of different types of HTTP GET flood attacks for a classification task. An anomaly detection method builds a model of the normal users’ behaviors and if a new behavior does not conform with the model, it is detected as an anomaly (a potential attack). We defined user behavior instances as a feature vector, which represents how the users are accessing resources on a web server. Our results showed that the features provide good performance results for the detection of HTTP GET flood attacks when used in two different anomaly detection methods. Our analysis in this chapter showed how anomaly detection methods along with discriminative features can be used for the detection of network attacks. In Chapter 6, we studied the application of feature selection methods from the machine learning domain in the detection of network attacks. Our analysis showed how different feature selection methods can be evaluated to select the best one that reduces the number of features without adversely affecting the detection model per- formance. Reducing the number of features produces models which are more efficient due to decreasing computations. We also used feature selection as a tool to get more insight about an underlying attack. Our analysis showed that applying feature se- lection methods for the detection of network attacks revealed the features that are more correlated with the attacker’s behavior. This provides more insight about the attack and can be used to introduce discriminative features for the detection of similar attacks. Overall, our analysis in this research demonstrates how different algorithms from machine learning can be utilized for detecting attacks at the network level as well as at the host level. We used classification, anomaly detection and feature selection meth- ods to build detection mechanisms for three common attacks in computer networks and to gain more insight about their behaviors. This work gives recommendations

117 for applying machine learning methods in intrusion detection, which covers both data collection and analyses.

7.1.3 Future Work

Opportunities for future work include:

• Studying other type of attacks common in today’s computer networks such as stealthy Denial of Service attacks, Slowloris and HTTP POST application layer DDoS attacks.

• Comparing performance of classification and anomaly detection methods for the detection of a particular attack.

• Applying Deep Learning methods [ 92 ] for the detection of Network attacks.

• Incorporating high performance computing capabilities for analyzing large amount of network traffic.

118 BIBLIOGRAPHY

[1] Arpspoof. http://su2.info/doc/arpspoof.php .

[2] Arpwatch. http://www.linuxcommand.org/man_pages/arpwatch8.html .

[3] Cain and Abel. http://www.oxid.it/cain.html .

[4] Darpa intrusion detection data sets - mit lincoln laboratory. http://www.ll. mit.edu/ideval/data/ .

[5] Ettercap. http://ettercap.github.io/ettercap/ .

[6] Hoic. https://sourceforge.net/projects/highorbitioncannon/ .

[7] Huge increase in brute force attacks in december and what to do. https: //www.wordfence.com/blog/2016/02/wordpress-password-security/ .

[8] Hulk. https://github.com/grafov/hulk .

[9] Kdd cup data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99. html .

[10] Silk. https://tools.netsa.cert.org/silk/ .

[11] Transmission control protocol. RFC 793, RFC Editor, September 1981. https: //tools.ietf.org/html/rfc793 .

[12] Testing intrusion detection systems: A critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Trans. Inf. Syst. Secur. , 3(4):262–294, Nov. 2000.

[13] Cisco 2014 annual security report. Technical report, Cisco, 2014. http://www. cisco.com/web/offer/gist_ty2_asset/Cisco_2014_ASR.pdf .

[14] Ponemon 2014 ssh security vulnerability report. Technical report, Venafi, 2014. https://www.venafi.com/assets/pdf/Ponemon_2014_SSH_Security_ Vulnerability_Report.pdf .

[15] Mcafee labs, threats report. Technical report, Intel Security, March 2016. https://www.mcafee.com/us/resources/reports/ rp-quarterly-threats-mar-2016.pdf .

119 [16] Worldwide infrastructure security report. Technical report, ARBOR Networks, The security division of netscout, 2016. https://www.arbornetworks.com/ images/documents/WISR2016_EN_Web.pdf .

[17] H. Abdi and L. J. Williams. Principal component analysis. Wiley interdisci- plinary reviews: computational statistics , 2(4):433–459, 2010.

[18] D. Al Abri. Detection of mitm attack in lan environment using payload match- ing. In Industrial Technology (ICIT), 2015 IEEE International Conference on , pages 1857–1862. IEEE, 2015.

[19] M. Al-Hemairy, S. Amin, and Z. Trabelsi. Towards more sophisticated arp spoofing detection/prevention systems in lan networks. In Current Trends in Information Technology (CTIT), 2009 International Conference on the , pages 1–6. IEEE, 2009.

[20] E. Alata, V. Nicomette, M. Kaˆaniche, M. Dacier, and M. Herrb. Lessons learned from the deployment of a high-interaction honeypot. In Dependable Computing Conference, 2006. EDCC’06. Sixth European , pages 39–46. IEEE, 2006.

[21] J. A. Alvarez-Jare˜no,´ E. Badal-Valero, J. M. Pav´ıa, et al. Using machine learn- ing for financial fraud detection in the accounts of companies investigated for money laundering. Technical report, 2017.

[22] E. B. Claise. Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information. Request for Comments: 5101 , RFC Editor, January 2008.

[23] J. M. Beaver, C. T. Symons, and R. E. Gillen. A learning system for discrimi- nating variants of malicious network traffic. In Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop , CSIIRW ’13, pages 23:1–23:4, New York, NY, USA, 2013. ACM.

[24] J. Bennett, S. Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, NY, USA, 2007.

[25] M. L. Berenson, M. Goldstein, and D. Levine. Intermediate Statistical Methods and Applications: A Computer Package Approach 2nd Edition . Prentice Hall, 1983.

[26] C. M. Bishop. Pattern recognition. Machine Learning , 128:1–58, 2006.

[27] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated cor- pus for learning natural language inference. arXiv preprint arXiv:1508.05326 , 2015.

[28] C. D. Brown and H. T. Davis. Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems, 80(1):24–38, 2006. 120 [29] S. Byers, A. D. Rubin, and D. Kormann. Defending against an internet- based attack on the physical world. ACM Transactions on Internet Technology (TOIT) , 4(3):239–254, 2004.

[30] C. Calvert, T. M. Khoshgoftaar, C. Kemp, and M. M. Najafabadi. Capturing man in the middle attack traffic on a live network environment. In 22nd ISSAT International Conference on Reliability and Quality in Design , pages 203–209, 2016.

[31] C. Calvert, T. M. Khoshgoftaar, C. Kemp, and M. M. Najafabadi. A framework for capturing http get ddos attacks on a live network environment. In 23rd ISSAT International Conference on Reliability and Quality in Design , 7 pages, 2017.

[32] C. Calvert, T. M. Khoshgoftaar, and M. M. Najafabadi. Survey on selected features used for network level attack detection. In 21st ISSAT International Conference on Reliability and Quality in Design –Special Session on Security Analytics , pages 168–173, 2015.

[33] C. Calvert, T. M. Khoshgoftaar, M. M. Najafabadi, and C. Kemp. A proce- dure for collecting and labeling man-in-the-middle attack traffic. International Journal of Reliability, Quality and Safety Engineering , 24(1):19 pages, 2017.

[34] M. Carnut and J. Gondim. Arp spoofing detection on switched ethernet net- works: A feasibility study. In Proceedings of the 5th Simposio Seguranca em Informatica , 2003.

[35] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR) , 41(3):15, 2009.

[36] R. K. Chang. Defending against flooding-based distributed denial-of-service attacks: a tutorial. IEEE communications magazine , 40(10):42–51, 2002.

[37] Y. Chen and W. Wu. Application of one-class support vector machine to quickly identify multivariate anomalies from geochemical exploration data. Geochem- istry: Exploration, Environment, Analysis , pages geochem2016–024, 2017.

[38] M. Conti, N. Dragoni, and V. Lesyk. A survey of man in the middle attacks. IEEE Communications Surveys & Tutorials , 18(3):2027–2051, 2016.

[39] E. Cooke, F. Jahanian, and D. McPherson. The zombie roundup: Understand- ing, detecting, and disrupting botnets. SRUTI , 5:6–6, 2005.

[40] C. Cortes and V. Vapnik. Support-vector networks. Machine learning , 20(3):273–297, 1995.

[41] P. J. Criscuolo. Distributed denial of service: Trin00, tribe flood network, tribe flood network 2000, and stacheldraht ciac-2319. Technical report, DTIC Document, 2000. 121 [42] M. Drasar and J. Vykopal. Flow-based brute-force attack detection. In F. Ver- lag, editor, Advances in IT Early Warning , pages 41–51. Stuttgart, Oxford, 2013. [43] S. Dua and X. Du. Data mining and machine learning in cybersecurity . CRC press, 2016. [44] R. Dunia and S. J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In American Control Conference , 1997. [45] V. Durcekova, L. Schwartz, and N. Shahmehri. Sophisticated denial of service attacks aimed at application layer. In ELEKTRO, 2012 , pages 55–60. IEEE, 2012. [46] H. F. Eid, A. E. Hassanien, T.-h. Kim, and S. Banerjee. Linear correlation-based feature selection for network intrusion detection model. In Advances in Security of Information and Communication Networks , pages 240–248. Springer, 2013. [47] J. M. Estevez-Tapiador, P. Garcia-Teodoro, and J. E. Diaz-Verdejo. Anomaly detection methods in wired networks: a survey and taxonomy. Computer Com- munications, 27(16):1569–1584, 2004. [48] J. M. Estevez-Tapiador, P. Garc´ıa-Teodoro, and J. E. D´ıaz-Verdejo. Detection of web-based attacks through markovian protocol parsing. In 10th IEEE Sym- posium on Computers and Communications (ISCC’05) , pages 457–462. IEEE, 2005. [49] R. Fielding, U. Irvine, and J. G. and. Hypertext Transfer Protocol – HTTP/1.1. Request for Comments: 2068 1654, RFC Editor, January 1997. [50] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maci´a-Fern´andez, and E. V´azquez. Anomaly-based network intrusion detection: Techniques, systems and chal- lenges. computers & security , 28(1):18–28, 2009. [51] M. Goldstein and S. Uchida. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PloS one , 11(4):e0152173, 2016. [52] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research , 3:1157–1182, 2003. [53] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl. , 11(1):10– 18, Nov. 2009. [54] M. A. Hall. Correlation-based feature selection for machine learning . PhD thesis, The University of Waikato, 1999. [55] K. A. Heller, K. M. Svore, A. D. Keromytis, and S. J. Stolfo. One class support vector machines for detecting anomalous windows registry accesses. In Proc. of the workshop on Data Mining for Computer Security , volume 9, 2003. 122 [56] R. Hofstede, V. Bartos, A. Sperotto, and A. Pras. Towards real-time intrusion detection for netflow and ipfix. In Network and Service Management (CNSM), 2013 9th International Conference on , pages 227–234. IEEE, 2013.

[57] R. Hofstede, L. Hendriks, A. Sperotto, and A. Pras. Ssh compromise detection using netflow/ipfix. Computer Communication Review , pages 20–26, 2014.

[58] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al. A practical guide to support vector classification. 2003.

[59] C.-L. Hwang and A. S. M. Masud. Multiple Objective Decision Making Methods and Applications . Springer Berlin Heidelberg, 1979.

[60] G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning , volume 6. Springer, 2013.

[61] M. Javed and V. Paxson. Detecting stealthy, distributed ssh brute-forcing. In Proceedings of the 2013 ACM SIGSAC Conference on Computer; Communica- tions Security , CCS ’13, pages 85–96, New York, NY, USA, 2013. ACM.

[62] H. G. Kayacik, A. N. Zincir-Heywood, and M. I. Heywood. A hierarchical som-based intrusion detection system. Engineering applications of artificial intelligence, 20(4):439–451, 2007.

[63] T. M. Khoshgoftaar and M. E. Abushadi. Resource-sensitive intrusion detection models for network traffic. In 8th IEEE International Symposium on High- Assurance Systems Engineering (HASE 2004), 25-26 March 2004, Tampa, FL, USA , pages 249–258, 2004.

[64] T. M. Khoshgoftaar, K. Gao, and N. H. Ibrahim. Evaluating indirect and direct classification techniques for network intrusion detection. Intell. Data Anal. , 9(3):309–326, 2005.

[65] S. Kumar, X. Gao, I. Welch, and M. Mansoori. A machine learning based web spam filtering approach. In Advanced Information Networking and Applications (AINA), 2016 IEEE 30th International Conference on , pages 973–980. IEEE, 2016.

[66] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anoma- lies. In ACM SIGCOMM Computer Communication Review , volume 34, pages 219–230. ACM, 2004.

[67] A. Lazarevic, V. Kumar, and J. Srivastava. Intrusion detection: A survey. In Managing Cyber Threats , pages 19–78. Springer, 2005.

[68] S. Lecomte, R. Lengell´e, C. Richard, F. Capman, and B. Ravera. Abnormal events detection using unsupervised one-class svm-application to audio surveil- lance and evaluation. In Advanced Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International Conference on , pages 124–129. IEEE, 2011. 123 [69] B. Li, J. Springer, G. Bebis, and M. H. Gunes. A survey of network flow applications. Journal of Network and Computer Applications , 36(2):567–581, 2013.

[70] Y. Li, B.-X. Fang, Y. Chen, and L. Guo. A lightweight intrusion detection model based on feature selection and maximum entropy model. In Communication Technology, 2006. ICCT’06. International Conference on , pages 1–4. IEEE, 2006.

[71] Q. Liao and H. Li. Application layer ddos attack detection using cluster with label based on sparse vector decomposition and rhythm matching. Security and Communication Networks , 8(17):3111–3120, 2015.

[72] Q. Liao, H. Li, S. Kang, and C. Liu. Feature extraction and construction of application layer ddos attack based on user behavior. In Control Conference (CCC), 2014 33rd Chinese , pages 5492–5497. IEEE, 2014.

[73] G. Linden, B. Smith, and J. York. Amazon. com recommendations: Item-to- item collaborative filtering. IEEE Internet computing , 7(1):76–80, 2003.

[74] L. M. Manevitz and M. Yousef. One-class svms for document classification. Journal of Machine Learning Research , 2(Dec):139–154, 2001.

[75] C. D. Manning, P. Raghavan, H. Sch¨utze, et al. Introduction to information retrieval , volume 1. Cambridge university press Cambridge, 2008.

[76] J. McHUGH. Testing intrusion detection systems: A critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln lab- oratory. ACM Trans. Inf. Syst. Secur. , 3(4):262–294, Nov. 2000.

[77] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien , 2015. R package version 1.6-7.

[78] J. Mirkovic and P. Reiher. A taxonomy of ddos attack and ddos defense mecha- nisms. ACM SIGCOMM Computer Communication Review , 34(2):39–53, 2004.

[79] T. Miyato, A. M. Dai, and I. Goodfellow. Virtual adversarial training for semi- supervised text classification. stat, 1050:25, 2016.

[80] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp. Detection of ssh brute force attacks using aggregated netflow data. In 2015 IEEE 14th In- ternational Conference on Machine Learning and Applications (ICMLA) , pages 283–288. IEEE, 2015.

[81] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp ,. Detecting man in the middle traffic using packet header information. In 22nd ISSAT International Conference on Reliability and Quality in Design , pages 197–201, 2016. 124 [82] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp ’. Effective selection of packet header fields for detection of man in the middle attacks in lan environments. In 23rd ISSAT International Conference on Reliability and Quality in Design , 8 pages, 2017.

[83] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp ’. User behav- ior anomaly detection for application layer ddos attacks. In IEEE IRI . IEEE, 8 pages, 2017.

[84] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp . A text mining approach for anomaly detection in application layer ddos attacks. In FLAIRS Conference , pages 312–217. AAAI, May 2017.

[85] M. M. Najafabadi, T. M. Khoshgoftaar, and C. Kemp. The importance of representative network data on classification models for the detection of specific network attacks. In 21th ISSAT International Conference on Reliability and Quality in Design , pages 59–64, Philadelphia, PA, USA, 2015.

[86] M. M. Najafabadi, T. M. Khoshgoftaar, C. Kemp, N. Seliya, and R. Zuech. Machine learning for detecting brute force attacks at the network level. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Confer- ence on , pages 379–385. IEEE, 2014.

[87] M. M. Najafabadi, T. M. Khoshgoftaar, and A. Napolitano. A comparison of feature selection strategies for identifying malicious network sessions. In 21st ISSAT International Conference on Reliability and Quality in Design –Special Session on Security Analytics , pages 161–167, 2015.

[88] M. M. Najafabadi, T. M. Khoshgoftaar, and A. Napolitano. Detecting network attacks based on behavioral commonalities. International Journal of Reliability, Quality and Safety Engineering , 23(1):19 pages, 2016.

[89] M. M. Najafabadi, T. M. Khoshgoftaar, A. Napolitano, and C. Wheelus. Rudy attack: Detection at the network level and its important features. In FLAIRS Conference, pages 282–287. AAAI, 2016.

[90] M. M. Najafabadi, T. M. Khoshgoftaar, and N. Seliya. Evaluating feature selection methods for network intrusion detection with kyoto data. International Journal of Reliability, Quality and Safety Engineering , 23(1):22 pages, 2016.

[91] M. M. Najafabadi, T. M. Khoshgoftaar, and C. Wheelus. Attack commonali- ties: Extracting new features for network intrusion detection. In 21st ISSAT International Conference on Reliability and Quality in Design –Special Session on Security Analytics , pages 46–50, 2015.

[92] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic. Deep learning applications and challenges in big data analytics. Journal of Big Data , 2(1), 21 pages, 2015. 125 [93] H. T. Nguyen, S. Petrovi´c, and K. Franke. A comparison of feature-selection methods for intrusion detection. In International Conference on Mathematical Methods, Models, and Architectures for Computer Network Security , pages 242– 255. Springer, 2010. [94] I.-V. Onut and A. A. Ghorbani. Features vs. attacks: A comprehensive fea- ture selection model for network based intrusion detection systems. In 10th International Conference, ISC 2007 , volume 4779, pages 19–36. Springer, 2007. [95] A. Orebaugh, G. Ramirez, and J. Beale. Wireshark & Ethereal network protocol analyzer toolkit . Syngress, 2006. [96] J. Owens and J. Matthews. A study of passwords and methods used in brute- force ssh attacks. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET) , 2008. [97] R. Puri. Bots & botnet: An overview. SANS Institute , 3:58, 2003. [98] R Core Team. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria, 2016. [99] R Core Team. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria, 2016. [100] S. Ranjan, R. Swaminathan, M. Uysal, A. Nucci, and E. Knightly. Ddos-shield: Ddos-resilient scheduling to counter application layer attacks. IEEE/ACM Transactions on Networking (TON) , 17(1):26–39, 2009. [101] M. Roesch et al. Snort: Lightweight intrusion detection for networks. In LISA , volume 99, pages 229–238, 1999. [102] B. Sch¨olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural computation , 13(7):1443–1471, 2001. [103] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. computers & security , 31(3):357–374, 2012. [104] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. , 31(3):357–374, May 2012. [105] K. Simonyan and A. Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556 , 2014. [106] J. Singh, G. Kaur, and J. Malhotra. A comprehensive survey of current trends and challenges to mitigate arp attacks. In Electrical, Electronics, Signals, Com- munication and Optimization (EESCO), 2015 International Conference on , pages 1–6. IEEE, 2015. 126 [107] R. Sommer and V. Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Security and Privacy (SP), 2010 IEEE Symposium on , pages 305–316. IEEE, 2010.

[108] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, and K. Nakao. Statistical analysis of honeypot data and building of kyoto 2006+ dataset for nids evalua- tion. In Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security , BADGERS ’11, pages 29–36, New York, NY, USA, 2011. ACM.

[109] A. H. Sung and S. Mukkamala. Identifying important features for intrusion detection using support vector machines and neural networks. In Applications and the Internet, 2003. Proceedings. 2003 Symposium on , pages 209–216. IEEE, 2003.

[110] J. Touch. Updated Specification of the IPv4 ID Field. RFC 6864, RFC Editor, February 2013.

[111] Z. Trabelsi. Microsoft windows vs. apple mac os x: Resilience against arp cache poisoning attack in a local area network. Information Security Journal: A Global Perspective , pages 1–15, 2016.

[112] V. A. Vallivaara, M. Sailio, and K. Halunen. Detecting man-in-the-middle attacks on non-mobile systems. In Proceedings of the 4th ACM Conference on Data and Application Security and Privacy , CODASPY ’14, pages 131–134, New York, NY, USA, 2014. ACM.

[113] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspec- tives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning , pages 935–942. ACM, 2007.

[114] J. Van Hulse, T. M. Khoshgoftaar, A. Napolitano, and R. Wald. Threshold- based feature selection techniques for high-dimensional bioinformatics data. Network modeling analysis in health informatics and bioinformatics , 1(1-2):47– 61, 2012.

[115] R. Wald, T. Khoshgoftaar, and A. Napolitano. The importance of performance metrics within wrapper feature selection. In Information Reuse and Integra- tion (IRI), 2013 IEEE 14th International Conference on , pages 105–111. IEEE, 2013.

[116] R. Wald, T. M. Khoshgoftaar, R. Zuech, and A. Napolitano. Network traffic prediction models for near-and long-term predictions. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on , pages 362– 368. IEEE, 2014.

[117] R. Wald, F. Villanustre, T. M. Khoshgoftaar, R. Zuech, J. Robinson, and E. Muharemagic. Using feature selection and classification to build effective 127 and efficient firewalls. In Proceedings of the 15th IEEE International Confer- ence on Information Reuse and Integration, IRI 2014, Redwood City, CA, USA, August 13-15, 2014 , pages 850–854, 2014.

[118] J. Wang, X. Yang, and K. Long. Web ddos detection schemes based on measur- ing user’s access behavior with large deviation. In Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE , pages 1–5. IEEE, 2011.

[119] C. Wheelus, T. M. Khoshgoftaar, R. Zuech, and M. M. Najafabadi. A session based approach for aggregating network traffic data - the santa dataset. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Confer- ence on - workshop on Big Data and data analytics applications , pages 369–378, 2014.

[120] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine learning tools and techniques . Morgan Kaufmann, 2016.

[121] Y. Xie and S.-Z. Yu. A novel model for detecting application layer ddos attacks. In Computer and Computational Sciences, 2006. IMSCCS’06. First Interna- tional Multi-Symposiums on , volume 2, pages 56–63. IEEE, 2006.

[122] Y. Xie and S.-Z. Yu. A large-scale hidden semi-markov model for anomaly detection on user browsing behaviors. IEEE/ACM Transactions on Networking (TON), 17(1):54–65, 2009.

[123] S. Yadav and S. Selvakumar. Detection of application layer ddos attack by modeling user behavior using logistic regression. In Reliability, Infocom Tech- nologies and Optimization (ICRITO)(Trends and Future Directions), 2015 4th International Conference on , pages 1–6. IEEE, 2015.

[124] J. Yang and S. Olafsson. Optimization-based feature selection with adaptive instance sampling. Computers & Operations Research , 33(11):3088–3106, 2006.

[125] W. Yassin, N. I. Udzir1, Z. Muda, and M. N. Sulaiman. Anomaly-based in- trusion detection through kmeans clustering and naives bayes classification. In Proceedings of the 4th International Conference on Computing and Informatics, ICOCI 2013 , AISec ’12, pages 298–303, 2013.

[126] C. Ye, K. Zheng, and C. She. Application layer ddos detection using clustering analysis. In Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on , pages 1038–1041. IEEE, 2012.

[127] M. Zamani and M. Movahedi. Machine learning techniques for intrusion detec- tion. arXiv preprint arXiv:1312.2177 , 2013.

[128] S. T. Zargar, J. Joshi, and D. Tipper. A survey of defense mechanisms against distributed denial of service (ddos) flooding attacks. IEEE Communications Surveys & Tutorials , 15(4):2046–2069, 2013.

128 [129] D. Zhao, I. Traore, B. Sayed, W. Lu, S. Saad, A. Ghorbani, and D. Garant. Botnet detection based on traffic behavior analysis and flow intervals. Comput. Secur. , 39:2–16, Nov. 2013.

[130] S. Zhong, T. M. Khoshgoftaar, and N. Seliya. Clustering-based network intrusion detection. International Journal of reliability, Quality and safety Engineering , 14(02):169–187, 2007.

[131] R. Zuech, T. M. Khoshgoftaar, N. Seliya, M. M. Najafabadi, and C. Kemp. A new intrusion detection benchmarking system. In FLAIRS Conference , pages 252–256, 2015.

129