Ph.D. in Electronic and Computer Engineering Dept. of Electrical and Electronic Engineering University of Cagliari

Host and Network based Anomaly Detectors for HTTP Attacks

Davide Ariu

Advisor: Prof. Giorgio Giacinto Curriculum: ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

XXII Cycle Marzo 2010

Ph.D. in Electronic and Computer Engineering Dept. of Electrical and Electronic Engineering University of Cagliari

Host and Network based Anomaly Detectors for HTTP Attacks

Davide Ariu

Advisor: Prof. Giorgio Giacinto Curriculum: ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

XXII Cycle Marzo 2010

Alla mia famiglia...

Host and Network based Anomaly Detectors for HTTP Attacks

by

Davide Ariu

Abstract

The huge number of people that everyday connects to Internet makes web applications an attractive target for computer criminals. For example an attack against a web service might be used to quickly spread a or to steal ac- cess credential from the web service users. To avoid this the protection of web applications with Intrusion Detection Systems (IDS) is necessary. Unfortunately, protecting web applications is a tricky task since they are in general large, com- plex and highly customized. Traditional systems based on signatures are not ad- equate to guarantee a solid defense since they are not able to face up with zero days attacks. Anomaly-based systems represent a valid alternative to signature based ones and they offer also protection against zero days attacks. In this dissertation we propose several anomaly-based Intrusion Detection Systems for the protection of web applications. We address that of the Intrusion Detection as a Pattern Recognition problem discussing all the aspects that must be considered in realizing an anomaly-based IDS. The formulation of the prob- lem the most suitable for web applications security is the “One-class” formula- tion. This formulation builds a statistical model which is based on legitimate patterns only, thus ignoring any kind of information about the attacks. With this approach potentially any attack pattern can be detected if it is statistically anomalous respect to those within the target (that is the legitimate) class. We propose a Host-based and two Network-based IDS. Both Host and Net- work based solutions are effective even if with different scopes. Host-based IDS are more specific and tailored to protect a particular application. A Network- based IDS offers the great advantage of being able to monitor the traffic toward an entire network segment. In all of the proposed IDS we employ Multiple Clas- sifiers to increase the accuracy of the IDS and to harden the IDS against attempts of evasion. All the IDS have been tested on several datasets of attacks and nor- mal traffic both private and publicly available. Experimental results confirm the effectiveness of the proposed solutions in terms of a high detection rate and with regard to the small amount of false alarms generated.

Acknowledgements

Understand the things I say, Don’t turn away from me. ’Cause I’ve spent half my life out there,You wouldn’t disagree D. O’Riordan

At the end of an amazing experience such as that of pursuing a Ph.D., there is always somebody to thank. I don’t want to escape this unwritten rule, since the last three years have been a school of life before being an unbelievable professional experience. There has been at least one moment in these three years where each one of the people that I’m going to mention has been very important to me. Thus, to thank them I will follow just a chronological order. The first person I would say thank is Prof. Giorgio Giacinto, as he gave to me the possi- bility of living this experience. I would thank him for his valued advices, for his helpfulness and to have been by far more than a simple advisor. The second person I would say thank is Prof. Fabio Roli, for his exhortations to live the research with passion and ambition. I would say thank to both Prof. Giacinto and Roli for gave me the opportunity to join the PRA group and that of beginning my Ph.D. with an amazing experience at the Georgia Institute of Technology. The period in Atlanta has been certainly one of the most hard periods of my Ph.D., but I will always remember it as one of the most beautiful moments of my entire life. Many people helped me in having such an enjoyable period there. Certainly without Andrea, Claudio and Roberto (in a strict alphabetical order) my American staying wouldn’t have had the same taste. I have much appreciated that nice person which is Jack A. Lang, who gave me his hospitality with courtesy and helpfulness. There are also other people who in that period showed me their affection even if I was more than six thousands of kilometers away. Thanks to everybody. A big thank to all the PRA people and, above all, to Battista and Luca. I shared with them moments of both hard work and fun, and I spent hours waiting at the University cafeteria. Finally, there is my Family. I don’t have enough words to explain how much I love them, how much I’m thankful for their kind support and infinite patience and to say how much I’m proud of them. Without them I wouldn’t certainly be who I am and for this I will be eternally grateful to them.

iii

Ringraziamenti

Understand the things I say, Don’t turn away from me. ’Cause I’ve spent half my life out there,You wouldn’t disagree D. O’Riordan

Al termine di un esperienza ricca e impegnativa come un Dottorato di Ricerca, c’é sempre qualcuno a cui dover dire grazie. Personalmente, non ho alcuna intenzione di sottrarmi a questa tacita regola, dato che gli ultimi tre anni sono stati per me una scuola di vita prima ancora che una fantastica esperienza professionale. Stabilire una gerarchia e un ordine secondo il quale ringraziare tutte le persone a cui in- tendo esprimere la mia gratitudine non sarebbe possibile e forse nemmeno giusto, dato che c’é stato almeno un momento lungo questo cammino nel quale ognuna di esse é stata fonda- mentale. Per non fare torto a nessuno, mi limiteró pertanto a seguire un ordine cronologico. La prima persona a cui devo dire grazie é il Prof. Giorgio Giacinto, il quale mi ha offerto la possibilitá di vivere quest’esperienza. Lo ringrazio per i suoi preziosi consigli, per essersi dimostrato disponibile ogni qual volta io abbia richiesto il suo aiuto, e per essere stato molto di piú di un semplice tutor. Ringrazio il Prof. Fabio Roli, per l’incitamento a vivere con passione e ambizione l’attivitá di ricerca. Un ringraziamento congiunto a Prof. Giacinto e a Prof. Roli, per avermi offerto la possibilitá di entrare a far parte del gruppo PRA e di iniziare il mio dottorato di ricerca con uno straordinario periodo al Georgia Institute of Technology. Il periodo trascorso ad Atlanta é stato senza dubbio uno dei momenti piú faticosi del mio dottorato ma lo ricorderó sempre come uno dei momenti piú belli e intensi della mia vita. Se ha potuto essere quel che é stato lo devo a diverse persone. Tra queste Andrea, Claudio e Roberto (in rigoroso ordine alfabetico) senza la cui presenza il soggiorno negli Stati Uniti non avrebbe potuto avere lo stesso sapore. Un ringraziamento a parte, lo devo a Jack Lang, il quale mi ha aperto le porte di casa propria con grande cortesia e disponibilitá. Ci sono poi tante altre persone, che non elenco esplicitamente, ma che hanno saputo starmi vicino anche a piú di 6000 chilometri di distanza. A tutte, grazie. Un grazie a tutte le persone del gruppo PRA. Tra tutte, con Battista e Luca in particolare ho avuto modo di condividere, oltre che ore di duro lavoro, anche tanti momenti divertenti, tra cui le lunghe ore di fila ai tornelli delle mense universitarie. Infine il ringraziamento piú grande, va senza dubbio alla mia Famiglia. Nonostante credo sappiano ben poco di ció di cui mi occupo, mi hanno sempre incitato e sostenuto, aiutan- domi a ritrovare l’equilibrio che ogni tanto avevo smarrito. Non ci sono parole che possano esprimere i miei sentimenti, la mia gratitudine e quanto io sia orgoglioso di ognuno di loro. Se ho potuto raggiungere questo traguardo é anche grazie alla serenitá che mi hanno sempre

v insegnato ad avere anche di fronte alle difficoltá. E di questo non potró mai essere abbas- tanza grato. Contents

1 Introduction 1 1.1 Contribution of the thesis ...... 3 1.2 Organization...... 4

2 An Introduction to Intrusion Detection5 2.1 Taxonomy of Intrusion Detection Systems...... 6 2.2 Taxonomy of Attacks ...... 8 2.3 Intrusion Detection Systems Evaluation...... 10 2.3.1 ROC Curves ...... 10 2.3.2 More on evaluation metrics...... 11

3 Pattern Recognition Algorithms for Anomaly Detection 13 3.1 One vs. Multi-class Pattern Classification ...... 15 3.1.1 Multi-class Pattern Classification...... 16 3.1.2 One-class Pattern Classification ...... 19 3.2 Algorithms for pattern classification ...... 21 3.2.1 Hidden Markov Models ...... 21 3.2.2 One-Class SVM...... 22 3.3 Multiple Classifier Systems...... 23 3.3.1 Classifier Selection...... 24 3.3.2 Classifier Fusion ...... 25 3.3.3 Combining Multiple One-Class SVM Classifiers...... 26

4 Web Applications Security: an Host-based solution 29 4.1 Web Applications: An overview ...... 29 4.2 Attacks against Web Applications...... 31 4.2.1 SQL injection ...... 31 4.2.2 Cross Site Scripting - XSS ...... 31 4.2.3 Remote File Inclusion - Shellcode Injection ...... 32 4.3 State of Art ...... 32 4.4 A Host-based IDS: HMM-Web...... 33 4.4.1 Feature Extraction ...... 34 4.4.2 Application-specific modules...... 35 4.4.3 Decision module...... 35 4.4.4 HMM building ...... 36

vii viii CONTENTS

4.4.5 Fusion of HMM outputs...... 37 4.5 HMM-Web Evaluation ...... 37 4.5.1 Dataset and performance evaluation ...... 37 4.5.2 Experimental Results...... 40

5 Network Based Intrusion Detection Applications 43 5.1 Payload based anomaly detection...... 45 5.1.1 State of Art...... 46 5.1.2 Evading Payload-based IDS...... 48 5.2 McPAD - Multiple classifiers Payload Anomaly Detector...... 49 5.2.1 Feature Extraction ...... 49 5.2.2 Dimensionality Reduction...... 51 5.2.3 Payload Classification ...... 52 5.2.4 Complexity Analysis ...... 53 5.3 McPAD Evaluation...... 54 5.3.1 Experimental Setup ...... 54 5.3.2 Datasets ...... 54 5.3.3 Experimental Results...... 56 5.3.4 Bayesian Detection Rate...... 68 5.4 HMMPayl - HMM for Payload Analysis...... 68 5.4.1 Feature Extraction ...... 69 5.4.2 Pattern Analysis...... 71 5.4.3 Classification ...... 72 5.5 HMMPayl Evaluation...... 72 5.5.1 Experimental Setup ...... 72 5.5.2 Datasets ...... 73 5.5.3 Evaluation Metrics...... 74 5.6 Experimental Results ...... 75 5.6.1 Shell-code and CLET dataset...... 75 5.6.2 Generic Dataset...... 76 5.6.3 XSS and SQL-Injection Attacks...... 78 5.6.4 Sequences Sampling...... 78 5.6.5 Ideal Selector ...... 80 5.6.6 Performance evaluation in terms of Detection Rate at fixed values of False Alarm rate...... 80 5.6.7 Single Classifiers Performance...... 81 5.6.8 Analysis of the Computational Cost...... 81

6 Conclusions and Future works 85 6.1 Conclusions...... 86 6.2 Future Works...... 87

Bibliography 93 List of Figures

1.1 The growth of Internet from 1995 until 2009...... 1 1.2 The Web 2.0 Mind-Map by Markus Angermeier...... 2 1.3 Percentage of Vulnerability Disclosures that Affect Web Applications, 2009 H1.. 3

3.1 A possible representation of the problem of Intrusion Detection as a two class problem. In this case the two distributions of patterns are perfectly separable with a linear classifier...... 16 3.2 A possible representation of the problem of Intrusion Detection as a two class problem. The two classes are separated with a K-Neirest Neighbor classifier. A more detailed view of the region within the rectangle is proposed in figure 3.3... 17 3.3 A detail of the figure 3.2. The presence of several points strictly close to the deci- sion boundary indicates that the system is neither resilient to attempts of evasion nor robust respect to variations of the normal patterns...... 18 3.4 Posterior probabilities distribution for a two-classes one-dimensional problem. . 19 3.5 A possible representation of the problem of Intrusion Detection as a one class problem. A closed surface is drawn around the distribution of normal patterns which leaves outside a certain fraction of rejected samples...... 20 3.6 A possible representation of the problem of Intrusion Detection as a one class problem. A closed surface is drawn around the distribution of normal patterns which leaves outside a certain fraction of rejected samples...... 20 3.7 Three fundamental reasons why an ensemble may work better than a single clas- siÞer [29]...... 24 3.8 A general schema of a MCS based on HMM...... 25 3.9 An example of ideal score selector with two classifiers on a real dataset. The dis- tributions of the output values resulting from the “ideal selection” exhibit a larger separability than the original ones...... 25

4.1 A web application example. The web browser request the page search.php and provides it two inputs: the attributes cat and key with associated values. The web server processes the request and sends the output back to the client...... 30 4.2 IDS scheme. The Parser processes the request URI and identifies the web applica- tion (i.e. search.php) and its input query. Applying a threshold on the probability value associated to the codified query, it is labeled as legitimate/anomalous. The threshold depends on the web application probability and the α parameter..... 34 4.3 Distribution of queries and percentage of attacks for the 14 most frequent web applications...... 40

ix x LIST OF FIGURES

4.4 Average Detection Rate and False Positive Rate for different values of α and a sin- gle HMM per ensemble. Our approach for query codification (on the left) outper- forms the solution proposed in [58]...... 41 4.5 Average Detection Rate and False Positive Rate for different values of α either with single or multiple HMM per ensemble...... 41

5.1 A Network IDS Sensor...... 43 5.2 An example of legitimate HTTP payload...... 45 5.3 Long Request Buffer Overflow attack. Bugtraq: 5136 ...... 45 5.4 URL decoding error attack. Microsoft: MS01-020 ...... 46 5.5 Overview of McPAD ...... 49 5.6 Average relative mutual information for varying ν (computed on GATECH dataset). 57 5.7 Generic Attacks detection - AUCp obtained for different values of the number of the feature clusters k, and number of combined classifiers m. The combination rule used was minimum probability...... 61 5.8 Shell-code Attacks detection - AUCp obtained for different values of the number of the feature clusters k, and number of combined classifiers m. The combination rule used was minimum probability...... 62 5.9 PAYL - ROC curves for Generic, Shell-code, and CLET attacks...... 62 5.10 McPAD - ROC curves for Generic, Shell-code, and CLET attacks...... 63 5.11 PAYL - ROC curves for Code-Red PBA attacks (the string “cred” in the legend stands for “code-red”)...... 63 5.12 McPAD - ROC curves for Code-Red PBA attacks (the string “cred” in the legend stands for “code-red”)...... 64 5.13 PAYL - ROC curves for DDK PBA attacks...... 64 5.14 McPAD - ROC curves for DDK PBA attacks...... 65 5.15 PAYL - ROC curves for WMS PBA attacks...... 65 5.16 McPAD - ROC curves for WMS PBA attacks...... 66 5.17 McPAD - ROC curves for Generic, Shell-code, and CLET attacks. The ROC refers to classification results obtained by combining 3 classifiers chosen at random among 11 one-class classifiers...... 66 5.18 McPAD - ROC curves for Code-Red PBA attacks. The ROC refers to classification results obtained by combining 3 classifiers chosen at random among 11 one-class classifiers...... 67 5.19 A simplified scheme of HMMPayl...... 69 5.20 Distribution of bytes from normal traffic (green) and from Shell-code attacks (red). 76 5.21 AUCp values for the Generic Attacks Dataset. The AUCp increases with the length n of sequences extracted from the payload ...... 77 5.22 Values of AUCp for the XSS and SQL Injection Attacks. The AUCp increases with the length n of sequences extracted from the payload...... 78 5.23 Performance of HMMPayl in terms of AUCp when a subset of sequences is ran- domly chosen. The sampling varies between 20% and 100% ...... 79 5.24 Comparison of the AUCp attained by the Ideal Selector with that attained by the Minimum Rule. GT and DARPA datasets, with Generic and XSS-SQL attacks.... 80

6.1 A possible scheme of a comprehensive protection mechanism including McPAD, HMMPayl and HMM-Web...... 89 List of Tables

2.1 Pros and Cons of Host-Based Intrusion Detection Systems...... 7 2.2 Pros and Cons of Network-Based Intrusion Detection Systems...... 7

4.1 Principal characteristics of dataset D. Columns contains respectively: the num- ber of queries, the duration of the collection period, the number of applications for administration (Admin) and public (Pub) services...... 38 4.2 References for attacks inside XSS-SQL Dataset. Attacks are taken from www.. com. For each attack the number identifying the exploit and that of the paper where the vulnerability is described are provided...... 38

5.1 Summary of the parameter settings used in the experiments with McPAD...... 54 5.2 DARPA Dataset Characteristics...... 55 5.3 GATECH dataset characteristics...... 55 5.4 ATTACKS dataset characteristics...... 57 5.5 DARPA dataset - summary of AUCp results computed over different values of k. . 58 5.6 GATECH dataset - summary of AUCp results computed over different values of k. 58 5.7 DARPA dataset - Average of AUCp results for different values of k...... 60 5.8 GATECH dataset - Average of AUCp results for different values of k...... 60 5.9 PAYL’s average processing time per payload. The number between parenthesis represents the number of payloads in the test dataset...... 67 5.10 McPAD’s average processing time per payload. The number between parenthesis in the first column represents the number of payloads in the test dataset...... 68 5.11 Number of packets and size (MB) of traces of normal traffic ...... 74 5.12 Resume of the values of AUCp for Shell-code and CLET attacks ...... 76 5.13 Detection Rate at False Positive rate = 0.01 and 0.001...... 81 5.14 Average AUCp achieved by single classifiers on GT dataset. AUCp are averaged both on single HMM and on sequences length n ...... 81 5.15 Workstation Specifics ...... 82 5.16 HMMPayl’s average processing time per payload. The value between brackets represents the number of payloads in the test dataset. The sampling ratio indi- cates the percentage of sequences sampled from each payload with the random- ization strategy. HMMPayl uses m=5 HMM...... 82

xi

Chapter 1

Introduction

A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratliff

When in 1990 Tim Berners-Lee created the Internet Protocol and the WWW language (HTML) certainly he canned foresee what Internet would have become in the following years [14, 15]. That tool born to support the information requirements of research into high energy physics was destined to become one of the most important human inventions of the recent history. In fact, Internet changed the everyday life of billions of people as well as the radio, the television or the phone. And in many cases, it has also replaced them. In that period the effects of the revolution produced by Internet were definitely unpre- dictable also for insiders. Just consider that the first release of Microsoft Windows 95 released in the August of 1995 was shipped without Internet Explorer. Even worse, the default net- work installation did not install TCP/IP,the network protocol used on the Internet. The first release of Windows 95 that included Internet Explorer with the OS appeared only 6 months later in 1996. It was also completely impossible to predict such an exceptional growth of Internet within fifteen years. As the Figure 1.1 shows, the number of hosts connected to the network has grown of 17,000 times from 1995 to 2009. Consider that Google, which was born in 1996 as a

Figure 1.1: The growth of Internet from 1995 until 2009

1 2 CHAPTER 1. INTRODUCTION research project at Stanford University, today is a company with more than 20,000 employ- ees located around the world and with quarterly revenues of more than 6,600 millions of dollars. Furhtermore, Internet has been the main vehicle of the globalization phenomenon that invested the world in the last twenty years and in particular in the last ten. 2004 has been a crucial moment in the Internet history since the advent of new tech- nologies changed completely the way of thinking Internet. It is in fact dated 2004 the birth of the the “Web 2.0”. A concise definition of Web 2.0 is that provided by Tim O’Reilly during a conference at O’Reilly Media:

Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform, and an attempt to understand the rules for success on that new platform. Chief among those rules is this: build applications that harness network to get better the more people use them.

Figure 1.2 presents an interesting representation of the “Web 2.0 Mind-Map” proposed by Markus Angermeier.

Figure 1.2: The Web 2.0 Mind-Map by Markus Angermeier

The aim of this figure is to show that with the “Web 2.0” the approach to Internet has completely changed. Everybody is both the creator and the end-user of contents. Social Networks platforms such as Facebook, Twitter or Linkedin are just the final products of a giant revolution started several years ago. It is obvious that a phenomenom such as Internet has had an enormous impact also on the worldwide economy. Let us think to home banking services or to on-line stores: they let move every day in the network a massive amount of money. In addition there is another aspect which is not so well know but which is equally important. It is the underground econ- omy of the . Within this economy cyber criminals sell not only credit card num- bers or bank account credentials but also full governmental identities (such as passports or driver’s licenses). After the 9/11, all the governments enforced anti-terrorism checks so that authentic governmental identities are particularly desired by criminals. For example, the free circulation of the europeans citizens within the European Union countries, makes european identities extremely attractive for terrorists. Since national and local governents offer to cit- izens more and more on-line services, the sensitive data that flow every day in the network 1.1. CONTRIBUTION OF THE THESIS 3 are appealing target for cyber criminals. Thus, there is indeed a need of protection mecha- nisms to guarantee the security of data on Internet for reasons that are both economics and of social security. Neverthless, that of Computer Security is a problem that has been addressed since early seventies. At that time the main problem was mainly that of prevent unauthorized accesses to the systems. In eigthies the first computer viruses appeared. They will be the main threat against computer networks until the first years of the third millennium. Two pioneering works in Computer Security are certainly that of Anderson in 1980 [8] and that of Dorothy Denning [26]. The second one in particular is generally recognized as the first attempt of providing a formulation of the problem of detecting attempts of intrusion in a computer network. Since them a plenty of solutions for the protection of computer networks has been proposed. Unfortunately, the spreading of web applications has made that of protecting computer networks an even more challenging task, since very often the architecture of web applications makes them particularly exposed to attacks. The Figure 1.3 is taken from the results of a recent study by the X-Force team which shows that more than 50% of vulnerabilities discovered during the first half of 2009 affected Web applications [48].

Figure 1.3: Percentage of Vulnerability Disclosures that Affect Web Applications, 2009 H1

In consequence of this, the security of Web applications is a key topic in information security.

1.1 Contribution of the thesis

In this thesis we address the problem of protecting web applications through the analysis of the HTTP requests, proposing several solutions for Host and Network-based protection. The problem is tackled as a statistical Pattern Recognition problem, or, even better, as a prob- lem of “Novelty’ or “Anomaly” detection. During a learning phase a statistical model is cre- ated starting from a dataset of patterns belonging to the target class. Attack patterns are 4 CHAPTER 1. INTRODUCTION detected if they are statistically diverse from the statistical model created during the learn- ing phase. Respect to traditional systems based on signatures this approach allows to face up with “zero-days” (that is unknown) attacks. Our main contributions relate to Network-based IDS, and concern in particular systems based on the analysis of the application layer payload. We propose two different solutions, called McPAD and HMMPayl respectively. The first one is able to detect attacks even at very low false positive rates. This is an important result, because being able to maintain a high detection rate with very low false positives greatly imprioves the Bayesian detection rate, i.e., the probability of having an intrusion since the IDS issued an allarm. The second one, HMMPayl, is an IDS based on Hidden Markov Models that allows for a very accurate model of the payload. Thus, HMMPayl results particularly effective against these categories of attacks that do not alter significantly payload statistics. Another contribution of this work relates Host-based IDS. Here, we propose HMM-Web, which is an IDS able to detect the attacks against web applications that exploit flows in the validation of the input provided to the web application. The experimental results confirm that HMM-Web attains both high detection rate and small false positive rate. Finally, all the IDS we propose are based on Multiple Classifiers Systems. In Intrusion Detection multiple classifier are used to increase the overall accuracy of the IDS but also to harden the systems against attempts of evasion. This is also our case, since the experimental results confirm that using multiple classifiers allows to improve systems from both these points of view.

1.2 Organization

The rest of the dissertation is organized as follows. Chapter2 presents the required background on Intrusion Detection Systems, providing also an overview of the metrics the most frequently used to evaluate them. Chapter3 provides the necessary background on Pattern Recognition. In particular the issues to be addressed to design an Intrusion Detection System as a Pattern Recognition system are discussed. Further, a brief description of the statistical models that will be used in following chapters is provided. Finally, fundamentals on Multiple Classifiers Systems are given. Chapter4 focuses on Host-based Intrusion Detection Systems for Web Applications Se- curity. First we provide a little background on web applications and briefly describe the at- tacks the most frequent against web applications. After, the state of art on Host-based IDS in the field of web applications security is furnished. Next we describe the architecture of HMM-Web, an Host-based Intrusion Detection System to protect web applications. Then we present results of the experiments on HMM-Web. Chapter5 focuses on Network-based Intrusion Detection Systems, and particularly on systems that analyze the application layer payload. Initially the state of art on Network- based IDS is provided. After, several strategies of evading Network-based IDS based on the payload analysis are presented. Then we present two IDS, called McPAD and HMMPayl re- spectively, describing for each one its architecture, the experimental setup on which it has been evaluated and the results achieved. Finally, Chapter6 concludes the dissertation and presents some future research that can extend our work. Chapter 2

An Introduction to Intrusion Detection

Anti-virus software does nothing to stop a worm, Firewalls can’t stop a worm. We’re preaching those burglar alarm systems with IDS capability and vulnerability scanning. Christopher Klaus - Founder of ISS

In this chapter we introduce the problem of computer security and Intrusion Detection. Let us consider a toy example that will help us to introduce some definitions and problems that are very common in computer security. Suppose that Bob wants to realize his own “malware detector”, that is a software able to recognize malicious software such as viruses, worms, trojans and so on. Assume also that Bob has got a copy of the executable code of a particular malware and that he would use it to test his detector. What he wants to do is scanning his laptop with the detector to verify if the machine has been infected by that malware. What he would do if he weren’t an expert of computer security, would be writing a program to compare files inside the hard drive with the executable code of the malware. This is intuitive and in someway reasonable. Luckily, Bob discovered that considering the whole code of the malware it is not necessary since it is enough considering certain pieces of it. That is great. Therefore Bob extracts those sequences of bytes that characterize the malware: usually, these sequences are called signatures. At this point Bob runs the detector with matches all the files inside the hard drive against the signatures. At the end of the scan no malware has been found in the hard disk. Assumed that it is sure that malware is inside the hard disk and both the matching routine and signatures are right, where does the problem is? In computer security the approach based on signatures has been widely used for dozens of years and basically it is still used. Traditionally, systems that rely on signatures are known as signature or misuse-based, since the detection of the attacks depends on a database of signatures that describe known attacks. We can consider as an attack whatever can cause a damage inside a computer network. Thus also the malware in our example is an attack. Nowadays the effectiveness of misuse-based systems is seriously limited by the appear- ance every day of a massive amount of new attacks. In addition several versions of the same attack might exist: as in the case of Bob, a signature against a certain attack might not be ef- fective against all its variants. Let us consider a few data to have an idea of the significance of

5 6 CHAPTER 2. AN INTRODUCTION TO INTRUSION DETECTION the problem. In its annual report for 2008 Panda Security says to have identified an average of 35,000 malware samples per day, 22,000 of which have turned to be in new infections [74]. The problem of the large variety of attacks involves not only malware but computer security in general. Consider for example web applications. Everyday, everybody accesses his home- banking account, the webmail or social networking platforms. The popularity they gained in last ten years has made web applications an attractive target for . According to the X-Force Mid-year report for 2009 [48] more than the 50% of software vulnerabilities dis- covered in 2009 affected web applications: it is more than 17,000 vulnerabilities. Obviously, this number does not include custom-developed Web applications or customized versions of these standard packages, which also introduce vulnerabilities. Writing signatures against attacks requires time and obviously signatures can be extracted only after an attack has appeared. This means that a misuse-based system can’t stop “zero- days-attacks”, that is attacks never seen before. As a consequence of all these reasons, nowadays misuse-based systems alone can’t guar- antee the security of computer networks. In the most innovative systems, misuse-based modules are integrated with others based on “anomaly-detection” engines. Anomaly de- tectors are based on Pattern Recognition algorithms and create a statistical model of the resource that is being protected. If something is detected that deviates from the statistical model stored for the resource an alarm is raised. Among all the tools that today are available against the army of hackers and virus that threaten the security of computer networks Intrusion Detection Systems play a crucial role, especially for what concerns enterprise security. In this Chapter we provide a brief introduction to Intrusion Detection Systems, describ- ing how they can be categorized, attacked and evaluated. In particular, we provide a taxon- omy of Intrusion Detection Systems in Section 2.1. In the following Section (2.2) we describe how the attacks against IDS can be categorized. In the same section we give also a categoriza- tion of attacks toward anomaly-based IDS. Finally, we describe metrics the most frequently used to evaluate IDS in Section 2.3.

2.1 Taxonomy of Intrusion Detection Systems

An Intrusion Detection System (IDS) is a tool which aims at:

• Detecting attempts of Intrusion inside a network of computers.

• Providing forensic informations that allow organizations to discover the origins of the attack. This allows also to enforce systems and make them more resilient to future attacks.

Based on the source of the data being audited IDS are usually classified in Host and Network-based. In particular:

• Host-based Intrusion Detection Systems (HIDS) monitor the host where the sensor is installed. They can monitor the activity of the Operating System or that of a particular application. An host-based IDS can monitor several parameters such as the CPU or the memory usage. A host sensor protecting a web server might monitor the logs produced by the HTTP server software looking for anomalous HTTP request patterns. This type of IDS is good at discerning attacks that are initiated by local users, and which involves 2.1. TAXONOMY OF INTRUSION DETECTION SYSTEMS 7

misuse of the capabilities of one system [78]. The main drawback of HIDS is that they are insulated from network events that occur on a low level (because they only inter- pret high-level logging information) [78]. In addition they must be installed on every single host and can also degrade the system performance. Finally, since there are no clear standards on the type of information that the operating system should provide to the monitor, it is usually very hard to port HIDS from one platform to another.

Table 2.1: Pros and Cons of Host-Based Intrusion Detection Systems.

Host-based IDS PROS Can detect attacks that do not involve the network Can analyze what an application is doing Do not require additional hardware CONS Must be installed on every single hosts Degrade the system performance Only interpret high-level logging information

• Network-based Intrusion Detection Systems (NIDS). Network intrusion detection sys- tems are driven by the interpretation of raw network traffic. A network based IDS ana- lyzes the traffic transmitted over a network segment so it can be used to protect a sin- gle machine or an entire network. Data used by a network IDS includes packet header data, packet statistics, and application layer data. Respect to HIDS network-based IDS offer the great advantage that a single system can be used to monitor an entire net- work without the need of installing a dedicated software sensor on each host. NIDS are good at discerning attacks that involve low-level manipulation of the network and attacks that alter the network activity. An example might be that of a worm that infects a machine and scans others machines in the network looking for possible victims. The detection of the attack at network level can be particularly useful to stop the attack be- fore all the machines in the network are compromised. Furthermore, NIDS can easily correlate attacks against multiple hosts in the network. An hard constraint for NIDS is that they must be very fast to keep up with the speed of the network. In addition the IDS might have some problems if the traffic flows into an encrypted channel. Obvi- ously, NIDS are bad at determining exactly what’s occurring on a single host.

Table 2.2: Pros and Cons of Network-Based Intrusion Detection Systems.

Network-based IDS PROS Can monitor multiple hosts at the same time Can correlate attacks against multiple hosts Does not affect host performance Can detect attacks that are not visible from single hosts CONS Must be able to keep up with the network speed May have problem with encrypted channels

On the basis of the detection mechanism both Host-based and Network-based IDS can be further classified in: 8 CHAPTER 2. AN INTRODUCTION TO INTRUSION DETECTION

• Misuse-based Intrusion Detection Systems. Misuse-based IDS are based on a database of signatures that describe known attacks. Thus an attack is detected only if it matches at least one signature. Given that misuse-based IDS are unable to face down “zero- days” (that is never seen before) attacks. Since zero-days attacks are unknown any sig- nature can’t exist into the IDS database to stop them. For instance, a misuse-based IDS can’t stop the spreading of a new worm. The ineffectiveness against zero-days attacks is a severe limitation for misuse-based systems, because a number of new attacks ap- pears every day. Since a certain period of time occurs between the moment an attack is detected for the first time and the moment on which the signature is ready, the system remains exposed to the attack for all that period. A great advantage that misuse-based systems usually offer is a very low rate of false alarms.

• Anomaly-based Intrusion Detection Systems. In anomaly-based IDS a statistical model is created of the normal behavior of the resource to be protected. We can define the normal behavior of a resource as “a set of characteristics that are observed during its normal operation”. A statistical model is created of the normal behavior and attacks are detected because they produce an “anomalous” (that is statistically different) be- havior. This approach can be applied to monitor a variety of resources such as the net- work traffic in a network segment, a particular application, the sequences of syscalls in a particular host. In the past the popularity of anomaly-based IDS has been restrained by the too high rate of false alarms generated. This might be still true in those cases where the resource to be protected changes frequently. If the statistical model is not able to take into account these variations the system obviously raises a large number of false alarms. Despite this, nowadays several commercial products exist that are based on anomaly detection mechanisms [21,34,49].

2.2 Taxonomy of Attacks

Nowadays a great variety exists of attacks against computer systems. Without entering into details of different attack mechanisms, we provide here a taxonomy of the attacks against a computer system. It has been proposed by Anderson in 1980 but in spite of its oldness is still valid [8]. This taxonomy is particularly useful since understanding the effects of different threats allows to understand how each threat may manifest itself in audit data. According to Anderson, threats are categorized in:

• Risk: Accidental or unpredictable exposure of information, or violation of operations integrity due to malfunction of hardware or incomplete or incorrect software design.

• Vulnerability: A known or suspected flow in the hardware or software design or opera- tion of a system that exposes the system to penetration of its information to accidental disclosure.

• Threat: The potential possibility of a deliberate unauthorized attempt to:

1. Access Information 2. Manipulate Information 3. Render a system unreliable or unusable 2.2. TAXONOMY OF ATTACKS 9

• Attack: A specific formulation or execution of a plan to carry out a threat.

• Penetration: A successful attack; the ability to obtain unauthorized (undetected) ac- cess to files and programs of the control state of a computer system.

For the sake of following discussion, definitions of Vulnerability and Attack are particularly important. In fact we will consider that an attack will take place always exploiting a vulner- ability of a certain application. This classification it is general and is valid for Network and Host-based IDS as well as for Misuse and Anomaly-based systems. More recently a different taxonomy has been proposed by Barreno et. al [10]. Respect to Anderson, Barreno proposes a taxonomy which is focused on anomaly-based systems. As we will discuss in the following the training phase is crucial for an anomaly-based system be- cause in the training phase the parameters are estimated for all the statistical models used by the IDS. If an attacker is able to attack the IDS during the training phase, he can compromise the statistical model created causing the IDS to be completely unuseful. The classification proposed by Barreno is particularly focused on this issue. The categorization divides the attacks based on their:

• Influence:

– Causative - Causative attacks alter the training process through influence over the training data. – Exploratory - Exploratory attacks do not alter the training process but use other techniques, such as probing the learner or offline analysis, to discover informa- tion.

• Specificity:

– Targeted - The specificity of an attack is a continuum spectrum. At the targeted end, the focus of the attack is on a particular point or a small set of points. – Indiscriminate - At the indiscriminate end, the adversary has a more flesible goal that involves a very general class of points, such as “any false negative”.

• Security Violation:

– Integrity - An integrity attack results in intrusion point being classified as normal (false negative). – Availability - An availability attack is a broader class of attack that an integrity attack. An availability attack results in so many classification errors, both false positive and negative, that the system becomes effectively unusable.

Attacks are classified on the bases of their Influence, Specificity and Security Violation. The Influence takes into account what the aim of the attack is: the attacker can have in mind of compromising the system (causative attack) or just to infer some informations about it (exploratory attack). The Specificity discriminates between targeted and non discriminatory attacks: in the first case the attacker pursues a particular strategy to evade the IDS, whereas in the second he wants just to evade it. Finally, the Security Violation considers if, after the attack, the IDS is completely unusable (then the availability is compromised) or if it is just prone to a particular attack (the integrity respect to that attack is compromised). 10 CHAPTER 2. AN INTRODUCTION TO INTRUSION DETECTION

2.3 Intrusion Detection Systems Evaluation

The choice of good evaluation metrics is crucial to attain an accurate and reliable evaluation of its detection capabilities. Without regard of details depending on a particular IDS, any In- trusion Detection System can be modeled as a black box which receives data from a certain source and has to decide if an attack is occurring or not. The source of data might be the network, a log-file or something else. This source generates objects that for instance might be single packets in the case of the network, or different entries if the source is a log-file. The IDS will decide if each one of these objects is an attack or not. In the case of misuse- based IDS the decision will be made on the base of a signatures’ database, whereas in the case of anomaly-based IDS the decision will be based on a statistical model. What is im- portant is that at the end of the analysis the IDS will label each object as “normal” (that is non-intrusive) or as an “attack”. The first thing in evaluating an IDS is to assess how many objects are labeled correctly and how many are not. More formally, as the class of attacks is usually know as the positive class and the normal class as the negative, we can provide for the following definitions [35]:

•A true positive is an instance of the positive class which is labeled as positive. A true positive consists of an attack which is detected by the IDS.

•A false positive is an instance of the negative class which is labeled as positive. A false positive consists of a normal object which is wrongly labeled as an attack.

•A true negative is an instance of the negative class which is labeled as negative. A true negative consists of a normal object which is correctly labeled as normal.

•A false negative is an instance of the positive class which is labeled as negative. A false negative consists of an attack which is not detected by the IDS and consequently evade it.

Starting from these definitions, an Intrusion Detection System can be evaluated in terms of: Negatives incorrectly classified False Positive Rate (2.1) = Total Negatives

Detection Rate 1 False Negative Rate (2.2) = − where the False Negative Rate is defined as

Positives incorrectly classified False Negative Rate (2.3) = Total Positives The False Positive Rate is also known as False Alarm Rate, whereas the Detection Rate is also called True Positive Rate.

2.3.1 ROC Curves Given a certain IDS, the detection rate and the false positive rate can’t be tuned indepen- dently as they both depend from the same threshold. Changing the value of this threshold the operator can chose a desired value for one of them and the other changes consequently. 2.3. INTRUSION DETECTION SYSTEMS EVALUATION 11

To evaluate the performance of the IDS for different values of the threshold the Receiver Operating Characteristic (ROC) curve analysis and the Area Under the Curve (AUC) are fre- quently used. The ROC curve provides a way to visually represent how the trade-off between false positives and detection rate varies for different values of the detection threshold [19]. Differently from the classic accuracy metric, which suffers from a dependency to specific values of the detection threshold [19], the AUC summarizes the classification performance of the classifier in the entire range [0,1] of the false positive rate and can be interpreted as the probability of scoring attack packets higher than legitimate packets [23] (i.e., the higher the AUC, the easier to distinguish attacks from normal traffic). One problem with the AUC for evaluating intrusion detection systems is that it is com- puted along the entire range [0,1] of the false positive rate. Because it is not realistic that an intrusion detection system will be configured to generates a high number of false alarms, we are mainly interested in evaluating the classification performance of our anomaly detector for low values of the false positive rate. To cope with this problem a partial AUC (AUCp ) can be considered. The partial AUC can be calculated computing the area under the ROC curve in the range [0,0.1] of the false positive rate (i.e., we do not take into account how the classi- fication system performs for a false positive rate higher than 10%). To obtain a value in the range [0,1] the partial AUC is divided by 0.1.

2.3.2 More on evaluation metrics Several works that criticize the use of metrics such as the accuracy or the area under ROC curves for evaluating Intrusion detection systems has been proposed. Gu et al. consider the use of ROC curves just a way to compare different IDS but not as a useful way to evaluate a system and propose a new measure called intrusion detection capability [42]. In [65] an approach based on cost metrics is proposed. Ali et al. argue that since the input of a real time IDS changes considerably over time, using time invariant classification thresholds do not characterize the best accuracy that an Intrusion Detection System can achieve. In order to deal with this problem they propose a solution based on Markov Chains which is able to predict and adapt the IDS’ classification threshold [7]. In this context a relevant work of is that of Axelsson who demonstrates that the false alarm rate is the limiting factor for the performance of an intrusion detection system [9]. In his work the problem of evaluating an IDS is formulated using the Bayes’ rule. Let I and I denote intrusive, and non intrusive behavior respectively, and A and A de- ¬ ¬ note the presence or absence of an intrusion alarm. Let us also provide a Bayesian definition of the detection and false positive rate. In particular:

• The Detection rate is the probability P((A I)), that is the probability of an alarm being | raised when the IDS is tested on a populations made of attacks only.

• The False alarm rate is the probability P(A I), that is the probability of an alarm being |¬ raised when the IDS is tested on a population made of normal patterns only.

From these definitions it follows that:

• The False negative rate is the probability of having a missed alarm:

P( A I) 1 P(A I) (2.4) ¬ | = − | 12 CHAPTER 2. AN INTRODUCTION TO INTRUSION DETECTION

• The True negative rate it the probability of correctly labeling normal samples:

P( A I) 1 P(A I) (2.5) ¬ |¬ = − |¬

Obviously, what do we want is to maximize both:

• The Bayesian detection rate (P(I A)), that is the probability that an alarm really indi- | cates an intrusion.

• The probability (P( A I)), that is the probability that if an alarm is not raised we do ¬ |¬ not have anything to worry about.

Using the Bayes’ rule, the Bayesian detection rate P(I A) can be rewritten as: | P(I) P(A I) P(I A) · | (2.6) | = P(I) P(A I) P( I) P(A I) · | + ¬ · |¬ To evaluate the Bayesian detection rate, we can assume to have a system with 1,000,000 audit record per day, 2 intrusion per day and 10 audit record per intrusion. Thus, the proba- bility of having in intrusion is:

2 10 5 P(I) · 2 10− (2.7) = 1,000,000 = · and the probability of not being attacked is:

P( I) 1 P(I) 0.9998 (2.8) ¬ = − = By substituting P(I) and P( I) in Equation 2.6 we obtain: ¬ 5 2 10− P(A I) P(I A) · · | (2.9) | = 2 10 5 P(A I) 0.99998 P(A I) · − · | + · |¬ The important outcome here is that if we want to have a Bayesian detection rate 1, 5 ≈ P(A I) must be approximately 10− . Since P(A I) indicates the probability of having false |¬ |¬ alarms, this result tells us that a reliable intrusion detection system is required to generate 5 false alarms at a rate not bigger than 10− . The result is obviously a consequence of the heavy unbalancing of the prior probabilities P(I) and P( I). ¬ Chapter 3

Pattern Recognition Algorithms for Anomaly Detection

Part of the inhumanity of the computer is that, once it is competently programmed and working smoothly, it is completely honest. Isaac Asimov

In past years, the uselessness of misuse-based against zero-days attacks has pushed the research on anomaly-based systems so far that nowadays several commercial products based on anomaly-detection engines exist [21, 34,49]. An anomaly-based IDS makes use of Pattern Recognition algorithms, that allow for a statistical representation of the problem of intrusion detection. The design of such a system is not trivial, since a lot of issues has to be addressed. Let us analyze them: 1. What is the most appropriate model for the problem? The basic problem in com- puter security is substantially the following: given a certain object (e.g. a network packet, an executable file, a request toward a web server) it must be automatically estab- lished if it belongs to the “normal” class or to the “malicious” class (that is it is an attack). Anyway the choice of the theoretical model the most suitable to represent this problem is not trivial. Consider for example the case of malware detection. What an anti-virus software is required to do is to establish if a certain file (typically an executable one) is a malware or not. It is not particularly hard to obtain a large amount of samples of mal- ware and obviously it is not difficult to obtain a number of samples of “benign” (that is non malicious) executables. Therefore, since the availability of a large amount of samples from both populations is not a problem, we can easily represent the problem with a two-classes model [77]. Consider now a different case: let us assume for ex- ample that we want to protect a web application. We decide to do this monitoring the requests that the web server receives. Whereas it is easy collecting a large number of legitimate requests toward the web application it is not so simple gathering examples of attacks against the application. What we can certainly do is scanning the web appli- cation looking for possible vulnerabilities and creating samples of attacks that exploit them. Unfortunately doing this we are just collecting samples of known vulnerabil- ities (and related attacks) and we can’t say anything about vulnerabilities that might exist and have not been pointed out by the scan. Thus in this case a one-class model

13 14 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

is probably more appropriate: it is based only on “normal” requests and requests that are statistically too diverse from the normal ones are labeled as attacks [76]. In Section 3.1 we will describe in more details both the approaches.

2. Which kind of pattern we should look at to detect attempts of intrusion? Let us as- sume that we want to build a biometric system to restrict the access to a particular area of a building. In biometrics it is well known that looking at the iris or at the fingerprint is a good way to verify the identity of a person [52]. It is also know that both iris and fingerprints are definitely better than other biometric properties, such as for exam- ple the voice or the hand geometry[52]. And this is substantially due to the “amount of information” about the identity which is bigger in iris and fingerprint with respect to that within other biometric properties. But what about Intrusion Detection? If we want to protect a web application it is better to look at the HTTP traffic incoming to the web server or monitoring the log files? Whatever the choice is it seems quite artificial because neither the network traffic nor the web server logs are an “inner property” of the web application such as fingerprints are for a person. Motivations for choosing a particular pattern instead of another will be provided in the Chapters4 and5 in the context of the solutions that we will describe.

3. Given a pattern what is the best choice of features? Obviously this is a concern not only for anomaly-based IDS but it is a general question which involves the design of every Pattern Recognition System. Suppose that we decided to monitor the network traffic toward a web server to detect attacks against web applications that it is hosting. It is enough to model only the HTTP payload or should we consider also informations within the IP header? Assumed that we decided to model just the HTTP payload, is an analysis of the bytes’ distribution enough or should we put into the model the a pri- ory knowledge about the structure of the HTTP payload? This is not a trivial question to answer. In general, the more a priory knowledge is used the more accurate the re- sulting system is in classifying patterns. Nevertheless, as we already discussed in the previous chapter the amount of false alarms and the detection rate are not the only pa- rameters that must be considered in the evaluation of an IDS. For example a network based IDS has to meet severe real time constraints, and this means that the represen- tation of the pattern into the features space can’t be computationally too expensive. Anyway a discussion of the choice of features goes beyond the focus of this chapter and will be reprized in the followings in the context of the solutions we will present.

4. Which is the algorithm the most suitable? The choice of the algorithm the most suit- able to separate normal and malicious patterns is tricky and heavily depends on the features chosen. The two options are supervised and unsupervised algorithms. Suc- cessful applications of both supervised [76] and unsupervised [100] methods exist. In [64] the authors show that for what concerns network intrusion detection supervised methods work better: we are convinced that a comparison can be made just in relation to a particular problem. Once chosen among supervised and unsupervised methods, a further choice has to be made for a specific algorithm: a variety of alternatives ex- ists. In this work we consider supervised algorithms only. In particular we used two different models: 3.1. ONE VS. MULTI-CLASS PATTERN CLASSIFICATION 15

• Hidden Markov Models that are particularly powerful in modeling the distribu- tion of sequential data. HMM will be described in section 3.2.1. • Support Vector Machines that are particularly appropriate to classify data that are non linearly separable. One-class SVM will be described in subsection 3.2.2.

5. Is a single classifier sufficient or would the IDS being based on an ensemble? Clas- sifier ensembles are generally used to increase the classification accuracy with respect to that a single classifier can achieve. The price that must be paid for this gain in accu- racy is an increased complexity of the resulting system. This complexity might cause for example an increase of the computational cost. In a system such as a network IDS this might be particularly critical since the IDS must be able to keep up with the net- work speed. Thus, the trade-off between complexity and computational cost must be carefully evaluated. A critical review of the approach based on multiple classifiers has been expressed by Ho [46]:

Instead of looking for the best set of features and the best classifier, now we look for the best set of classifiers and then the best combination method. One can imagine that very soon we will be looking for the best set of com- bination methods and then the best way to use them all. If we do not take the chance to review the fundamental problems arising from this challenge, we are bound to be driven into such an inÞnite recurrence, dragging along more and more complicated combination schemes and theories and grad- ually losing sight of the original problem.

Neverthless, the combination of classifiers has been deeply investigated and many successful applications exist in fields such as intrusion and spam detection or biomet- rics [29, 38,56, 63,16, 22,40, 70,76]. In biometrics, intrusion and spam detection the employment of multiple classifiers is further motivated by the fact that not only the classification accuracy but also the robustness against attempts of evasion is a crucial parameter to evaluate a system. Usually, it is more difficult for an attacker to evade multiple classifiers instead of a single one. In Section 3.3 we will provide a brief de- scription of Multiple Classifiers Systems and we will illustrate how they can be useful also in Intrusion Detection.

3.1 One vs. Multi-class Pattern Classification

The aim of an IDS is that of of blocking any kind of malicious activity leaving as much as pos- sible undisturbed all the normals. Since the objects the IDS is looking for are the “attacks”, their class is frequently labeled also as the “positive” class. On the other hand, the class of legitimate patterns is usually know as the “normal” or “negative” class. Basically, every pattern recognition systems works in two different phases:

•A training or learning phase, where the parameters of the models are estimated.

• An operating phase, that in the case of an IDS is the phase on which the systems de- tects attempts of intrusion. We can call this phase also detection phase. 16 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

During the training phase the models’ parameters are estimated based on a population of examples that represent the objects the system has to classify: these data used for training are usually called “training set”. In principle the training set should provide a good repre- sentation of the real data: that is it should contain a number of samples of each class large enough to obtain a good estimate of the distribution of the real population. This is gener- ally true for what concerns patterns of the normal class since in computer security it is quite simple to obtain large volumes of samples of legitimate patterns (e.g. network traffic traces, web-server log files). Unfortunately this is not always true in the case of the positive class: we mentioned in the previous section that might be reasonably simple collecting samples of malware whereas to collect attacks against a web application might be not so easy. This might be true also in the case where it is actually possible to collect a large number of mali- cious samples, but this number is negligible with respect to the number of possible attacks. Depending on the availability of attacks samples two different approaches can be adopted:

• If we have enough samples for both the normal and the attack class, a multi-class model might be employed. During the training phase the (classifier inside the) IDS models both the normal and the attack class. During the detection phase, the pattern is assigned to one of the two classes.

• If we do not have enough samples for the attack class, a one-class model might be employed. During the training phase, the (classifier inside the) IDS models only the normal class. During the detection phase the classifiers estimate if the pattern belongs or not to the normal class. If it doesn’t the IDS considers it as anomalous.

Now, reasons that make an anomaly-based IDS able to deal with zero-days attacks should be clear: during the training phase the system doesn’t make any assumption on the distribution of attacks inside the features’ space. It only models the distribution of normal patterns.

3.1.1 Multi-class Pattern Classification

Normal Attack Decision Boundary

Figure 3.1: A possible representation of the problem of Intrusion Detection as a two class problem. In this case the two distributions of patterns are perfectly separable with a linear classifier. 3.1. ONE VS. MULTI-CLASS PATTERN CLASSIFICATION 17

In figure 3.1 we propose a possible representation of a two class problem. We have two populations of samples, one representing normal patterns and the other representing at- tacks. During the training phase, the classifier determines a decision boundary which separates as better as possible the two classes. In this example the problem is quite simple since that even with a linear classifier it is possible to separate the two classes perfectly. Unfortunately this is not a realistic situation since that there is usually a certain overlap between the two classes. An additional example is presented in figure 3.2. Here the two distributions are not lin- early separable and a more sophisticated classifier is necessary to separate the two classes. The classifier achieves a good performance since that only two patterns out of two thousands are misclassified. Unfortunately such a classifier is not robust against attempts of evasion and is also prone to false alarms. Let us focus in on the region of the features space within the rectangle which has been enlarged in figure 3.3.

Normal Attack Decision Boundary

Figure 3.2: A possible representation of the problem of Intrusion Detection as a two class problem. The two classes are separated with a K-Neirest Neighbor classifier. A more detailed view of the region within the rectangle is proposed in figure 3.3.

What is important to notice is that there are several points from both classes that are clas- sified correctly but are very close to the boundary. We indicated patterns from the normal class that are in this situation such as False Positive Candidates. The presence of these points close to the boundary means that a little change in normal patterns might produce an high increasing in the rate of false positives. In the case of a network-based IDS this might be a very risky situation, since that with high volumes of traffic the amount of false alarms might become very large. In addition an attacker might decide to exploit this situation sending “fake attacks” that are not dangerous at all but that make the IDS generating a huge number of alarms [99]. Despite this is not strictly dangerous since these attacks don’t produce any damage, the false alarms represent noise that an attacker could cause to masquerade the real attacks. In the worst case, these false alarms might lead the security officer to decide of switching off the IDS leaving the network completely unprotected. On the contrary, the presence of attack patterns close to the boundary is dangerous for what concerns the robustness of the system against attempts of evasion. We indicated these 18 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

Normal Attack Decision Boundary False Negative False Positive Candidates Candidates

Figure 3.3: A detail of the figure 3.2. The presence of several points strictly close to the deci- sion boundary indicates that the system is neither resilient to attempts of evasion nor robust respect to variations of the normal patterns. patterns as False Negative Candidates. The proximity of these attacks to the boundary rep- resents for an attacker a great opportunity of evading the IDS with little modifications of the attack patterns. A practical example is that of a polymorphic engine such as CLET [27]: it can modify the distribution of bytes inside an attack packet to make it similar to that of a normal packet. In such a way an IDS which is based on packet statistics will not be able to detect the attack. A simple mathematical model of the two-class problem is that based on the Bayesian Decision Theory [30]. Class-labels are assigned by a Bayesian classifier on the basis of the a posteriori probabilities. Given a generic class “c” and a pattern “x” the a posteriori probabil- ity of c given x is the probability of having the class c given that the pattern is x. Basically, the a posteriori probability (also known as posterior) indicates how much the pattern x is likely to belong to the class c. In an intrusion detection problem formulated as a two-class problem the two possible classes are obviously “normal” and “attack”. According to the Bayes decision rule if

P(normal x) P(attack x) (3.1) | > | x is assigned to the normal class; otherwise it is assigned to the attack class. The probability of error for the rule is:

P(error x) min[P(normal x),P(attack x)] (3.2) | = | | An example of posteriors for a simple one-dimensional problem is proposed in figure 3.4. From the figure it is possible to infer easily how the false positive and negative rates variates in consequence of the threshold. Moving the threshold toward the attacks’ posterior the false positive rate reduces but increases the number of unrecognized attacks. On the contrary a threshold which moves in the direction of the normal class posterior will increase the detection rate but also the amount of false positives. 3.1. ONE VS. MULTI-CLASS PATTERN CLASSIFICATION 19

P(normal | x) P(attack | x) False Positives False Negatives Threshold

Figure 3.4: Posterior probabilities distribution for a two-classes one-dimensional problem.

3.1.2 One-class Pattern Classification One-class classification techniques are particularly useful in case of two-class learning prob- lems whereby one of the classes, referred to as target class, is well-sampled, whereas the other one, referred to as outlier class, is severely under sampled. The low number of examples from the outliers class may be motivated by the fact that it is too difficult or expensive to obtain a significant number of training patterns of that class [88]. The goal of one-class classifica- tion is to construct a decision surface around the examples from the target class in order to distinguish between target objects and all the other possible objects, i.e., the outliers [88]. Given an unlabeled training dataset that is deemed to contain mostly target objects, a re- jection rate is usually chosen during training so that a certain percentage of training patterns lies outside the constructed decision surface. This takes into account the possible presence of noise (i.e., unlabeled outliers), and allows us to obtain a more precise description of the target class [88]. In the case when the training set contains only “pure” target patterns, this rejection rate can be interpreted as a tolerable false positive rate. This situation is repre- sented in figure 3.5. The decision boundaries obtained with two different classifiers are compared in figure 3.6. The “Decision Boundary - 1” is obtained using a quadratic discriminant classifier which realizes a closed surface around the distribution of normal patterns. The “Decision Bound- ary - 2” is obtained using a polynomial of 3rd degree. Obviously neither of two classifiers can do anything against the attacks that falls exactly over the distribution of normal patterns. Additional features would be necessary to detect these attacks. Nevertheless, the closed sur- face realized by the quadratic classifier is by far better than the decision boundary drawn by the polynomial. The problem with the polynomial classifier is that the region assigned to the normal class is considerably wider than the space effectively covered by normal patterns. This offers to an attacker a great opportunity to evade the IDS. To evade the IDS an attacker 20 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

Normal Decision Boundary

Figure 3.5: A possible representation of the problem of Intrusion Detection as a one class problem. A closed surface is drawn around the distribution of normal patterns which leaves outside a certain fraction of rejected samples.

! "#$%&' ())&*+, -.*/,/#0!1#203&$4!!!5 -.*/,/#0!1#203&$4!!!6

!

Figure 3.6: A possible representation of the problem of Intrusion Detection as a one class problem. A closed surface is drawn around the distribution of normal patterns which leaves outside a certain fraction of rejected samples. has only to make an attack falling into the “normal region”, it is not necessary for the pattern to be similar to the normal ones. On the contrary to evade an IDS based on the quadratic classifier an attack has to fall exactly over the distribution of normal patterns. Nevertheless, one might think that the polynomial distribution is preferable because it generates fewer false positive in the case the distribution of normal patterns moves. This is true but it a simi- lar approach would be completely insecure since it leaves to the attacker a wide space in the features space to evade the IDS. 3.2. ALGORITHMS FOR PATTERN CLASSIFICATION 21

3.2 Algorithms for pattern classification

3.2.1 Hidden Markov Models Hidden Markov Models represent a very useful tool to model time-series, and to capture the underling structure of a set of strings of symbols. Markovian models and Hidden Markov models have been used for modeling information security problems only recently, whereas they have been originally used in applications such as speech recognition [79], handwritten text recognition [43], and biological sequence analy- sis [32]. In the field of information security, HMM have been largely employed in host-based Intrusion Detection. The seminal work in this direction is that of Warrender et al., where HMM are used to model system call sequences [97]. In [20] HMM have been used to model privilege flows, while in [39] Gao et al. propose to use HMM for computing a behavioral dis- tance between processes. In [59] the authors proposed a framework to detect attacks against web servers and web-based applications. Attack detection is performed by multiple models, including a Markov Model that models the request URI. SPECTROGRAM is another sensor that aims at detecting web-layer code injection attacks by operating above the packet layer [85]. SPECTROGRAM uses a mixture of Markov chains to address the curse of dimensionality problem arising from the need for a large value of n in the n-gram analysis. HMM is a stateful model, where the states are not observable (hidden). Two probability density functions are associated to each hidden state: one provides the probability of transi- tion to another state, the other provides the probability that a given symbol is emitted from that state. According to [79], an HMM is characterized by the following:

• N, the number of states in the model.

• M, the number of distinct observation symbols per state, i.e. the discrete alphabet size.

• A, the state transition probability distribution. In our case, the element ai j is the prob- ability of transition from state i to state j.

• B, the observation symbol probability distribution. Element bik is the probability that the k th symbol is emitted from the state i. −

• π, the initial state distribution. Each element πi is the probability that the initial state is the i th state. − The problems HMM can deal with are usually grouped into three categories: (a) Decod- ing, (b) Evaluation and (c) Training. As our goal is to train an HMM so that it models the byte distribution of normal HTTP payload, in the following we will briefly review the basics of Training and Evaluation.

• Training an HMM requires finding a solution to the problem of estimating the HMM parameters, i.e. λ (A,B,π). The goal is to maximize the probability assigned by the = model to the sequences in the training set. This problem is usually solved iteratively by resorting to the Baum-Welch algorithm [12]. 22 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

• The Evaluation problem denotes the problem of estimating the probability of a se- quence for a given model. This problem is solved using the Forward-Backward pro- cedure [11, 13]. The Forward-Backward procedure is based on a so-called forward variable defined as: α (i) P(O O ...O ,q S λ) (3.3) t = 1 2 t t = i | The forward-variable represents the probability of the partial observation sequence O1O2...Ot given that the model λ is in the state Si . Given a sequence of length “n” the forward-variable is calculated as follows:

1. Initialization: α (i) π b (O ), 1 i N 1 = i i 1 ≤ ≤ 2. Induction: PN αt 1(j) [ αt (j)ai j ]b j (Ot 1), + = i 1 + 1 t n= 1, 1 j N ≤ ≤ − ≤ ≤ 3. Termination: PN P(O λ) i 1 αT (i) | = = For the sake of the discussions in the following sections , it is important to remark that for a given model λ, the longer the sequence analyzed, the smaller the value of the forward-variable.

3.2.2 One-Class SVM In this section we will briefly describe a one-class classifier (inspired by the Support Vector Machine (SVM) [92]) that was proposed by Schölkopf et al. in [81]. Let us define a pattern vector x [x ,x ,..,x ] as the description of an object π in a k = k1 k2 kl k l-dimensional feature space F. The one-class classification problem is formulated to find a hyperplane that separates a desired fraction of the training patterns, called the target pat- terns, from the origin of the feature space F. This hyperplane cannot be always found in the original feature space, thus a mapping function Φ : F F0, from F to a kernel space F0, is used. −→ In particular, it can be proven that when the gaussian kernel

K (x,y) Φ(x) Φ(y) exp ¡ γ x y 2¢ (3.4) = · = − || − || is used it is always possible to find a hyperplane that solves the separation problem. The problem is formulated as follows:

min ¡ 1 w 2 ρ 1 P ξ ¢ w,ξ,ρ 2 k k − + hC i i (3.5) w φ(x ) ρ ξ , ξ 0, i 1,..,h · i ≥ − i i ≥ ∀ = where w is a vector orthogonal to the separation hyperplane, C represents the fraction of training patterns that are allowed to be rejected (i.e., that are not separated from the origin by the hyperplain), x is the i-th training pattern, h is the total number of training patterns, ξ i = [ξ1,..,ξh] is a vector of slack variables used to “penalize” the rejected patterns, ρ represents the margin, i.e., the distance of the hyperplane from the origin. 3.3. MULTIPLE CLASSIFIER SYSTEMS 23

The solution of (3.5) gives us the desired separation hyperplane. A generic test pattern z can then be classified as target or outlier using the following decision function [81] µ ¶ P Ph fsvc (z) I αi K (xi ,z) ρ , i 1 αi 1 (3.6) = i ≥ = = where I is the indicator function (whereby I(x) 1 if x is true, otherwise I(x) 0), and the = = coefficients αi and the threshold ρ are provided by the solution of (3.5). According to (3.6), a pattern z is either rejected (i.e., classified as outlier) if f (z) 0, or accepted as target object svc = if f (z) 1. svc = It is worth noting that most of the coefficients αi are usually equal to zero, therefore f (z) can be efficiently computed. The training patterns x for which α 0 are referred svc i i 6= to as support vectors.

3.3 Multiple Classifier Systems

Multiple Classifier Systems (MCS) are widely used in Pattern Recognition applications as they allow to obtain better performance than a single classifier. Reasons why a MCS could perform better than a single classifier have been deeply investigated in the literature, and the effectiveness of MCS for computer security has been also shown [22,63, 75]. Basically, an MCS exploits the decisions made by an ensemble of classifier, and com- bines these decision to obtain a “better” classification. According to [29], there are at least three reasons for which an ensemble results more accurate and robust of any classifier in the ensemble:

• The statistical reason. A learning algorithm can be viewed as a search for the best hypothesis in a space H of hypotheses. In consequence of the finite size of the training set, the learning algorithm will usually end up with a number of classifiers that achieve the same accuracy on the training data. These classifiers, however, may not produce the same accuracy on unseen data. By constructing an ensemble out all of them, the risk of choosing the wrong classifier can be reduced.

• The computational reason. In many cases the optimal training of a classifier is a NP- hard problem: consequently, most learning algorithms usually aim at finding a local optimum of the target function. This optimum usually depends on the starting point. This means that by running the local search from different starting points, and using the obtained classifiers to build an ensemble, a better approximation of the true un- known function can be attained.

• The representational reason. In most machine learning applications, the true func- tion for the problem at hand cannot be represented by any of the functions available in H. The use of weighted sums of hypotheses drawn from H may allow expanding the space of representable functions.

In this work we use the MCS paradigm to combine different HMM. A general schema of the proposed HMM ensemble is shown in figure 3.8. A payload xi is submitted to an ensemble H {HMM } of n HMM, each j HMM produces an output s , and their outputs = j − i j are combined into a “new” output si∗. 24 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

Different combination strategies for building a MCS have been proposed in the literature. They can be roughly subdivided into two main approaches, namely the Fusion approach, and the Dynamic Selection approach. In the following, a brief overview of these combination strategies is given.

3.3.1 Classifier Selection Classifier Selection is based on the assumption that each classifier in a given ensemble ex- hibits a higher “expertise” than others on a subset of patterns. For each pattern to be clas- sified, the system selects the classifier which is considered to provide the highest accuracy for the pattern at hand. It is easy to see that the main difficulty with this approach is the development of the selection criterion. On the other hand, it can be easily shown that if the selector works properly, very high accuracy can be attained. For this reason, in this work we us the Selection paradigm to provide an upper bound to the performances that could be achieved by the HMM ensemble employedEnsemble in HMMPayl Methods in. Machine Let us Learning define which 3 is the expected output of an “ideal selector”The first reason [90]: is statistical. A learning algorithm can be viewed as sear- ching a space of hypotheses to identify the best hypothesis in the space. The statistical problemH½ arises when the amount of training data available is too small compared to the sizemax{ of thesi hypothesis j } if xi space.is a negative Without suffipatterncient data, the lear- si∗ (3.7) ning algorithm= canmin{ find manysi j } diff iferentxi hypothesesis a positive in thatpattern all give the same accuracy on the training data. By constructing an ensembleH out of all of these accurate classifiers, the algorithm can “average” their votes and reduce the risk An example ofof the choosing result the attained wrong classifier. by classifier Figure 2(top selection left) depicts is this shown situation. in The figure 3.9, where two classifiers are combined.outer curve In denotes particular, the hypothesis for each space classifier. The inner the curve distribution denotes the set of of the output values hypotheses that all give good accuracy onH the training data. The point labeled f for the two classesis are the shown. true hypothesis, It is and easy we tocan see see that that by averaging the distribution the accurate hypotheses, of the output values of the ideal selector allowswe can a better find a good separation approximation between to f. the classes with respect to the combined classifiers.

Statistical Computational H H

h2 h1 h1 ff h2 h3 h4 h3

Representational H

h1 f h2

h3

Figure 3.7: Three fundamentalFig. 2. Three fundamental reasons reasons why why an an ensemble may work may better work than better a single than a single clas- siÞer [29] classifier 3.3. MULTIPLE CLASSIFIER SYSTEMS 25

It can be easily seen that the above ideal score selector exhibits a better ROC curve than the ROC curves of each individual classifiers used in the combination, and consequently a larger AUC [90]. Moreover, it has also been shown that the ideal selector always attains a larger AUC than that obtained by the linear combination, whatever the value of the weights, and the number of classifiers [90].

3.3.2 Classifier Fusion This strategy “fuses” the outputs of the ensemble of classifier to produce a single output. A large number of fusion function are available from the literature, each one with its pros and cons. One of the basic assumptions of the fusion strategy is that combined classifiers are considered as competitive rather than complementary.

!" !""#$ #$&

!""#% ( '" !" !"#$ !" # #$% #

!""#& !" #$'

Figure 3.8: A general schema of a MCS based on HMM.

Expert 1 Expert 2 0.4 0.4 Normal Score Distribution Normal Score Distribution 0.35 Attack Score Distribution 0.35 Attack Score Distribution

0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Score Score Ideal Selector 0.4 Normal Score Distribution 0.35 Attack Score Distribution

0.3

0.25

0.2

0.15

0.1

0.05

0 0 0.2 0.4 0.6 0.8 1 Score

Figure 3.9: An example of ideal score selector with two classifiers on a real dataset. The distri- butions of the output values resulting from the “ideal selection” exhibit a larger separability than the original ones. 26 CHAPTER 3. PATTERN RECOGNITION ALGORITHMS FOR ANOMALY DETECTION

In this paper we will consider the Maximum, the Minimum, the Mean and the Geometric mean rules.

• the Maximum rule: s∗ max{s } i = i j • the Minimum rule: s∗ min{s } i = i j • the Mean rule: N 1 X si∗ si j = N j 1 = • the Geometric mean rule: 1 " N # N Y si∗ si j = j 1 = These “static” rules are widely used in Pattern Recognition to combine classifiers because allow achieving good results in spite of their simplicity. Nevertheless, “trained” combina- tion rules have been also proposed to better exploit additional knowledge of the domain at hand. As pointed out in [31], combining classifiers using static rules is a suboptimal solu- tion, whereas trained combining rules are asymptotically optimal. Despite this, in this paper we used static rules for two main reasons. One reason is related to the issues involved in building a trained combiner that make its design a non trivial task. The other reason is that static rules are very fast to be computed, and thus the additional computational cost is very small compared to the one typically required by complex trained rules.

3.3.3 Combining Multiple One-Class SVM Classifiers Unlike the combination of two-class or multi-class classifiers, the combination of one-class classifiers is usually not straightforward [89]. This is due to the fact that usually it is not possi- ble to reliably estimate the probability distribution of the outlier class. As a consequence the posterior class probability cannot be estimated and many combination rules used in multi- class classification problems may not be applied. However when the Gaussian kernel (3.4) is used, the output of the one-class SVM can be formulated in terms of a class conditional probability by

x x 2 1 || − i || 1 Ph Ph 1 2 s p(x ωt ) d i 1 αi K (x,xi ) i 1 αi d e− (3.8) | = (2π s) 2 = · = = (2π s) 2 · · R · which respects the constraint d p(x ω )dx 1 [40]. R | t = Assuming a uniform distribution for the outlier class, this allows us to combine L differ- ent one-class SVM classifiers as

L 1 X yavg (x) pi (x ωt ) (3.9) = L i 1 | = for example, where ωt represents the target class. We can then use the simple decision crite- rion [40] y (x) θ x is an outlier (3.10) avg < ⇒ 3.3. MULTIPLE CLASSIFIER SYSTEMS 27 where θ is a predefined threshold that can be tuned to find the desired trade-off between false positives and detection rate. Equations (3.9) and (3.10) represent the average of probabilities combination rules. Other alternative combination rules are

L Y ypr od (x) pi (x ωt ), ypr od (x) θ x is an outlier (3.11) = i 1 | < ⇒ = which is the product of probabilities rule, and

L ymin(x) minpi (x ωt ), ymin(x) θ x is an outlier (3.12) = i 1 | < ⇒ = and L ymax (x) maxpi (x ωt ), ymax (x) θ x is an outlier (3.13) = i 1 | < ⇒ = which are the minimum and maximum probability combination rules, respectively. Av- erage, product, maximum and minimum probabilities are popular simple (low cost) non- trainable combiners that have been show to be quite successful for different classification problems [56, 63]. However, it is often difficult to predict which classification rule will per- form the best on a specific real problem. This is the reason why in the following we will compare the results obtained using different rules for combining the output of different clas- sifiers in our anomaly detection system. It is worth noting that in general only a small number of coefficients αi will be different from zero, thus p(x ω ) can be efficiently computed. | t In case of the majority voting rule there is no need to estimate the class conditional prob- abilities, and the application of the combination rule is straightforward. Assume the output of the L classifiers related to a payload πk to be given as a vector (1) (2) (L) L (h) c(πk ) [c1(xk ),c2(xk ),..,cL(xk )] {0,1} , where ch(xk ) 1 if the h-th classifiers labels πk = (h) ∈ = as target, otherwise ch(xk ) 0. = P (1) The majority voting rule can be written as i 1..L ci (x ) L/2. = k >

Chapter 4

Web Applications Security: an Host-based solution

Distrust and caution are the parents of security. Benjamin Franklin

This chapter focus on web applications security and in particular on host based solu- tions. The most important part of the chapter is dedicated to describe HMM-Web, an host- based IDS based on Hidden Markov Models. More in detail, Section 4.1 provides a quick explanation of how a web application works. In Section 4.2 a short description follows of the attacks the most frequent against web applications. Section 4.3 provides an overview of the state of art. The architecture of HMM-Web is described in Section 4.4. Experimental results are provided in Section 4.5.

4.1 Web Applications: An overview

Web applications can be realized using a number of different technologies. Without entering into details of how an application is realized, we are particularly interested into the interac- tion between the client and the web application. Let us consider the simple scheme pre- sented in figure 4.1. The scheme represents a typical situation where a web browser sends to the server a request for a certain resource. In this case the page search.php is requested.

1. A network packet is sent from the web browser to the web server (www.example.com). The packet contains a request for the page search.php. The request is sent accord- ing to the HTTP protocol [3] and it is contained in the portion of the network packet which is known as HTTP payload. We can observe from the example that several in- formations are provided to the web server with the request:

• The name of the HTTP method. A different method is used in consequence of the kind of action the client is requiring to the web server. In particular GET is the method tipically used to retrieve information from the web server. • The “Request URI” that identifies a page or a set of pages on a web server by the path and/or query parameters (also called “attributes”).

29 30 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION

Figure 4.1: A web application example. The web browser request the page search.php and provides it two inputs: the attributes cat and key with associated values. The web server processes the request and sends the output back to the client.

For instance, if www.example.com is developed in static HTML and it has a page on that site called "about.html" located in a sub-directory on the site, the re- quest URI for that page might be "/prag/eng/home.html". On the other hand, if www.example.com is developed in php, the request URI for that page might look like /pages.php?group=prag&lang=eng&page=about.php. The example in figure 4.1 contains a request for the application search.php. The application receives two attributes: “cat” with value 32 and “key” with value hmm. It is worth to re- mark that in the following we will consider every single script on the web server as a different application. • The version of the HTTP protocol. In the example the request is sent according to the version 1.1 of the protocol. • A set of HTTP Headers. The headers are just two in the example. The Host header which specifies the host of the resource being requested. The Connection header that allows the sender to specify options that are desired for that particular con- nection. Additional headers are possible: the client machine generally uses them to specify more options to the web server.

2. When the HTTP request is received it is interpreted by a parsing engine which extracts the name of the application requested and the input attributes.

3. The application is called and the attributes are passed as arguments.

4. The application generates the resource required.

5. The resource required is sent back to the web client within a response message 4.2. ATTACKS AGAINST WEB APPLICATIONS 31

4.2 Attacks against Web Applications

In this section a very concise description is furnished of the attacks the most frequent against web applications. A more exhaustive description of all the attacks to which a web application might be vulnerable can be found in [41].

4.2.1 SQL injection SQL injection is a class of attacks where unsanitized user input is able to change the structure of an SQL query so that when it is executed it has an unintended effect on the database. SQL injection is made possible by the fact that SQL queries are usually assembled by performing a series of string concatenations of static strings and variables. Let us consider a simple taken from [91]. It is related to a web application that lets a user list all his registered credit cards of a given type. A possible pseudo-code for a similar application is the following: uname = getAuthenticatedUser(); cctype = getUserInput(); result = sql("SELECT nb FROM creditcards WHERE user = ’" + uname + "’ AND type = ’" + cctype + "’;"); print(result);

If a user bob does a search for all his VISA card the following query would be executed:

"SELECT nb FROM creditcards WHERE user = ’bob’ AND type = ’VISA’;

Bob could manipulate the structure of the SQL query if the input supplied isn’t not properly sanitized. For example, if the value ’ OR user = ’alice is passed to the cctype attribute the following query would result:

"SELECT nb FROM creditcards WHERE user = ’bob’ AND type = ’’ OR user = ’alice’;

This query will return to Bob a list of all the credit cards belonging to Alice.

4.2.2 Cross Site Scripting - XSS Cross Site Scripting (XSS) is an attack that exploits the client’s trust in a web server. To be pre- cise, a web page from the trusted server is executed in a context that has more permissions than a page from an attacker’s website would get. This is done by injecting a script (often written in JavaScript) into a server which is not under the control of the attacker. Next, if the uploaded code is viewable by other users, the malicious script will be executed in the victims’ browsers. Since the script originates from the web server, it is run with the same privileges as legitimate scripts originating from the web server. Down here is presented a simple example taken from [4]. The XSS attack takes place when an attacker is able to let a user clicking on a link such as the following:

Click here

If the web server simply echoes the value of mycomment in the result page, the script gets executed on the client within the page from the trusted server. 32 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION

4.2.3 Remote File Inclusion - Shellcode Injection With Remote file inclusion or code injection, an attacker uses a flaw in the web application to inject code from another server to run on the server hosting the application. This is definitely the more dangerous attacks because it allows to execute remote code on the web server. Shellcode Injection attacks usually exploit Buffer Overflow vulnerabilities. A buffer overflow happens when a process stores data in a buffer outside the memory the programmer set aside for it. Shellcode Injection is deeeply explained in [73].

4.3 State of Art

Misuse-based IDS are not particularly suitable to protect web applications in consequence of the inner variability of these. In fact two scenarios are generally possible when a web application is realized.

• The application is built over a framework such as a Content-Management System. The developer customizes the application according to the requirements of the purchaser.

• The web application is developed from scratch and it is fully tailored to meet individual requirements.

In particular in the second case, but also in the first, the application will have its own vulnerabilities different from those of similar applications. Thus ad-hoc signatures would be necessary to protect the application with a misuse-based IDS. In addition, developers of web applications often have little or no security skills. Also, they mostly concentrate on the functionalities and on time constraints on which the appli- cation must be realized. These reasons make web application further prone to the presence of vulnerabilities. In order to cope with these problems a number of solutions has been pro- posed to scan the web application for vulnerabilities, and to block and to detect attacks that exploit these vulnerabilities. The problem of the identification of vulnerabilities has been addressed in several ways. Kals et al. proposed Secubat, a tools which scans web applications looking for vulnerabilities exploitable with SQL-Injection and XSS attacks [53]. The tool has been used to scan more than one hundred of web applications among which also those of governmental institutions and of several global companies were present. Results show how easy it is for attackers to automatically discover and exploit application-level vulnerabilities. Huang et al. show how software testing techniques can be applied to web applications to discover vulnerabilities [47]. In [45] the authors propose a new approach to penetration testing of web applications. The method proposed allows for a more accurate and exhaustive analysis of the web applica- tion respect to previously methods. In addition it allows to have a list of all the input vectors the application can receive. Based on this list a detailed list of vulnerabilities is generated. Unfortunately even if a vulnerability is detected patching the application is not always possible. In this case a tool such as a web application firewall might be particularly useful since it can prevent the vulnerability to be exploited. Neverthless that of writing rules for the firewall might be a tedious and time consuming task. To make this job more simple Scott et al. proposed a framework which helps designers in writing security policies for a web appli- cation firewall [82]. In [24] the authors propose an anomaly-based web application firewall which works as a reverse proxy in front of the web application to be protected. This solution 4.4. A HOST-BASED IDS: HMM-WEB 33 allows the detection of attacks at both servers and clients. Moreover, several commercial products exists such as [21, 34,49]. Moreover several anomaly-based Intrusion Detection Systems have been proposed. Spec- trogram is an anomaly-based sensor wich operates above the network layer [85]. It examines individual HTTP requests and models the content and structure of script inputs. Using a mixture of Markovian-models the sensor represent the portion of the HTTP request that con- tains the attributes and its values. Valeur et al. proposed an anomaly based approach to the detection of SQL attacks [91]. During the training phase a number of models is created and the maximum anomaly score accepted (that is a threshold) is calculated. During the detec- tion phase, features are extracted from the query and a score that measures how well the query fits models parameters is calculated. If the score exceeds the threshold the query is labeled as anomalous. Cho et al. proposed a solution that relies on a definition of “web ses- sion”: they define a web session as the sequence of page requests received from a certain IP address within a certain window of time. The IDS evaluates the web session related to each client and assigns it a probability that is compared with a threshold. In [58], the authors pro- posed a multi model framework for the detection of attacks against web applications. They modelled (legitimate) queries using both spatial features (related to a single request), and temporal features (related to multiple consecutive requests). Different models to represent these features were applied, and scores associated from each one of them to the query were combined to obtain an overall score. A similar approach has been proposed also by Ingham et al. [51]. The presence of an attack inside a web request is spotted using a Determinis- tic Finite Automata, that can be considered as not more than a language acceptor. When a request is received an heuristic is applied that measures the number of changes that would have made to the DFA for it to accept the request. This similarity measure is thresholded to decide if the request contains or not an attack. Finally the “concept drift” of web applications has also been addressed. In fact, in con- sequence of their nature web applications change very often. Do not take into account this might let the IDS generating a large amount of false positives. In [68] this problem is ad- dressed proposing a technique which is able to automatically discover changes into the web application. The automated detection of changes allows to retrain the anomaly models re- ducing the amount of false positives generated.

4.4 A Host-based IDS: HMM-Web

In this section we describe HMM-Web wich is a host based IDS for protecting web applica- tions against attacks that exploit input validation flows. HMM-Web is based on an ensemble of Hidden Markov Models that are used to model:

• The sequence of attributes received by the web application.

• The value assigned to each attribute.

Since a web server can host several applications and each application has its own at- tributes, the IDS is composed by a set of (independent) application-specific modules. Each module, composed by multiple HMM ensembles, is trained using queries on a specific web application. During the operational phase HMM-Web outputs a probability value for each 34 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION

Figure 4.2: IDS scheme. The Parser processes the request URI and identifies the web appli- cation (i.e. search.php) and its input query. Applying a threshold on the probability value associated to the codified query, it is labeled as legitimate/anomalous. The threshold de- pends on the web application probability and the α parameter. query on this web application. Furthermore, a decision module classifies the query as suspi- cious (a possible attack) or legitimate, applying a threshold to this probability value. Thresh- olds are fixed independently for each application-specific module. A simple scheme of one module is proposed in figure 4.2. From an operational point of view, the context of the proposed system is the same as [51, 58, 85] and in particular [58] is the work the most similar to HMM-Web. The following sections provide details about (a) the feature extraction process (4.4.1), (b) the application-specific modules (4.4.2), (c) decision module (4.4.3) and (d) the building of HMM ensembles (4.4.4). Throughout these sections we will refer to fig. 4.2, where it is showed the IDS processing for a search.php application, which could be used to list publi- cations of a certain category (cat attribute), containing a key-word (key attribute). It may be finally useful to remark that in some works the term web application is used to identify the a set of executables/scripts that offers a certain number of services (e.g. a search engine: main page, database interrogation, images generation). For the sake of clarity, in the following we will refer to each program or script which generate dynamic web contents, as a different web application.

4.4.1 Feature Extraction Since the aim of HMM-Web is detecting attacks that exploit flaws in the validation of the input, it models:

• The position of each attribute into the sequence. In fact, normal requests are gen- 4.4. A HOST-BASED IDS: HMM-WEB 35

erated clicking on hyperlinks located somewhere into a web page: consequently if a request is received that contains attributes in an unusual arrangement, this might be symptomatic of something bad which is happening.

• The value of each attribute into the sequence. For example, if the typical value of an attribute is numeric and it receives a string (or vice versa) obviously there is something anomalous in the request.

Both the model of the sequence of attributes and those of the values of attributes are realized using Hidden Markov Models.

Sequence of attributes Regarding the sequence of attributes, we consider each attribute as a symbol of the sequence, as it is enough to detect anomalies either in the order or in the presence of suspicious attributes.

Attribute values It is useful to provide for a more complex codification of attribute in- puts with respect to the simple extraction of the character sequence. By scrutinising sev- eral attacks against web applications (cve.mitre.org), it is evident that, typically, non- alphanumeric characters have higher relevance than alphanumeric characters, when inter- preting the meaning of attribute inputs. Non-alphanumeric characters could be used as meta-characters, with a special “meaning”, during the processing made by web applications. Thus, a distinction between them (e.g. between “/” and “-”) is definitely necessary. On the contrary distinguishing between digits or between alphabetic letters is not useful to detect input validation attacks. Consequently, we substitute every digit with the symbol N and ev- ery letter with the symbol A, while all the other characters remain the same. For example, the attribute value “/dir/sub/1,2” becomes “/AAA/AAA/N,N”. The obtained sequence of symbols is then processed by HMM.

4.4.2 Application-specific modules As shown in figure 4.2, each application-specific module consists of: (1) an HMM ensemble which analyse the sequence of attributes (i.e. {cat, key}); (2) for each attribute found in training queries, an HMM ensemble which analyses the input sequence for that attribute (i.e. for cat, {N,N}, for key, {A,A,A}). Braces and commas are not part of the sequence, we use them just to represent it. Thus, each HMM ensemble models a different feature of the web application query. Our goal is to detect an anomaly in any of these features of the query, because they can be related to a different attack typology. To this end, we apply the minimum rule (see Section 3.3.2) between the HMM ensemble outputs.

4.4.3 Decision module For the aim of the following discussion, let us define for a specific training set:

• M the number of web applications.

• q(w ) the set of queries for the i-th web application w , i [1,M]. q(w ) is the i i ∀ ∈ | i | number of queries for wi . 36 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION

PM • N j 1 q(w j ) is the total number of queries collected. = = | | The decision module classifies a query as suspicious if its probability is under a threshold, otherwise, the query is classified as legitimate. Now, it is important to note that, for a certain web application, the basic assumption that a considerable amount of its training queries is legitimate may not be valid. The attacker may exploit web applications which are rarely interrogated by users, to perform some unautho- rised action. For example, this is the case of applications for testing/configuration purposes (www.milw0rm.com, exploits: 6287, 6314, 6269, 5955). If such a kind of attacks are inside the training set, in the worst case, we model only instances of attack queries (instead that legitimate queries) for the web applications involved. To cope with this problem we consider the relative frequencies of queries toward each web application f req(w ) q(w ) /N. This frequency reflects in some way how strong i = | i | is the assumption that its queries are actually legitimate. The higher this frequency, the stronger the assumption that its queries are actually legitimate. In the detection phase, the frequency f req(wi ) represents an estimate of the probability of (a query on) the web appli- cation wi and it is stored in a look-up table (figure 4.2). If we expect an overall fraction of attack queries α in the training set, it will be equal to:

M 1 X α αi q(wi ) (4.1) = N i 1 · | | = where αi is the fraction of suspicious training queries toward wi . The simplest solution may be α α, that is, an equal fraction α of training queries is classified as suspicious by each i = application-specific module. However, this setting does not take into account how strong is the assumption that training queries are really legitimate. Aiming at including this information, for each application-specific module S(wi ), we fix a threshold ti so that the value of αi is in inverse proportion with respect to f req(wi ). In this case, in agreement with eq. 4.1, αi are given by α αi i [1,M] (4.2) = M f req(w ) ∀ ∈ · i It is easy to see that, with such a setting, the smaller the frequency of a web application, the larger the fraction of training queries classified as suspicious. In other terms, the weaker the assumption that web application queries are legitimate, the bigger the fraction of training queries classified as suspicious. Thus, the α parameter is used to estimate the threshold ti to be used in the detection phase, as α can be considered as an overall “confidence factor” for the legitimacy of training queries.

4.4.4 HMM building In this work we address only two out of the three basic problems for HMM (see 3.2.1): the Learning Problem (during the training phase) and the Evaluation Problem during the de- tection phase. We use the well-known Baum-Welch algorithm to train HMM [79]. As the algorithm may find only a local minima of the likelihood function (that is, the HMM mod- els well only a subset of training sequences) we use an ensemble of HMM in order to better 4.5. HMM-WEB EVALUATION 37 model the whole training set. Moreover, considering that HMM performance depend on pa- rameters as number of states, initial state, symbol emission matrix, state transition matrix. As the estimation of the best suited values of HMM parameters is more art than science, the use of an ensemble of HMM can counterbalance this lack of knowledge. We set an equal number of states for each HMM inside an ensemble. This number is equal to the average length of training sequences, rounded to the nearest greater integer. Also, an effective length definition is used; the length of a sequence is given by the number of different symbols in this sequence. For example, in the sequence {a,b,c,b,c} there are 3 different symbols, a, b and c. Consequently 3 is the effective length for this sequence. In con- sequence of the heuristics we used to set the number of states, each state can be associated to an element of the analyzed sequence, rather than a particular state of the web application. Both the state transition and the symbol emission matrices are randomly initialised. Considering IDS performance, this choice seems to be reasonable. In fact, our IDS con- sists of a large number of HMM, and using the a priori knowledge of the problem to model the structure of matrices could be a time and effort expensive task. Finally, we build the dic- tionary of symbols by extracting them from training sequences. This means to assume that no a priori information about the dictionary of legitimate symbols is available. HMM in the same ensemble use the same dictionary.

4.4.5 Fusion of HMM outputs In principle, the best fusion rule for HMM inside an ensemble is unknown. However, it may be useful to refer to a theoretical analysis of HMM outputs. Given an input sequence s, the output of the i-th HMM mi (out of K HMM) in the ensemble can be written as p(s m ) p(m s) p(s)/p(m ). (4.3) | i = i | · i We could set the same a priori probability for all models, that is, p(m ) c, i [1,K ]. (4.4) i = ∀ ∈ This can be a reasonable assumption as in principle all models are equally valid. It is easy to see that, when using the maximum fusion rule output max{p(s m ),i [1,K ]}, (4.5) = | i ∈ the output is proportional to max{p(m s),i [1,K ]} (the term p(s) is a constant). So, us- i | ∈ ing the maximum rule, we may select the model in the ensemble that best “describes” the analyzed sequence to compute the probability of the sequence. This reasoning is in agree- ment with the original goal of an HMM ensemble, that is, to exploit the diversity of multiple HMM to better modelling the whole set of training sequences. To confirm this hypothesis we performed also several preliminary experiments that confirmed that the maximum rule performed definitely better than other static rules.

4.5 HMM-Web Evaluation

4.5.1 Dataset and performance evaluation In order to test our IDS in a realistic scenario, we collected a data set of queries from a pro- duction web server of our Academic Institution. Web application queries are extracted from 38 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION web server logs. In particular, we selected all requests where the method is GET and that receive a successful response (status code 2xx, as described in RFC2616). Then, the web ap- plication and its input query are obtained from the Uniform Resource Identifier (URI), by considering the web server configuration and its URI parsing routine. It is worth to note that, if we would want to monitor web applications inputs also for POST requests, we should acquire features in another way (i.e. through a specialised module inside the web server), as queries are not stored inside web server logs. Let us call D the dataset of queries collected from the web server. It consists of more than 150,000 queries collected in a period of six months. Queries are distributed over a total of 52 web applications. In particular, 24 of these applications provide services for registered users, the remaining 28 provide public services. Informations about the dataset are resumed in table 4.1.

Table 4.1: Principal characteristics of dataset D. Columns contains respectively: the number of queries, the duration of the collection period, the number of applications for administra- tion (Admin) and public (Pub) services.

Dataset D Queries Collection Period Admin Pub 154,036 183 days 24 28

As D consists of a set of real requests from a web server log files, it may contain both nor- mal and attack queries. Our first goal is to assess IDS performance in terms of false alarm rate and detection of attacks similar to those which may be inside the training queries. To this end, each query inside D has been labelled as attack or legitimate, through a semiautomatic process, further described in section 4.5.1. Furthermore, D has been split randomly into 5 parts (all containing the same number of queries) in order to apply a 5-fold cross-validation. In a 5-fold cross validation the IDS is trained 5 times, each time with a different set: it is tested on the union of remaining four and the overall performance is obtained averaging on different training sets. As we are dealing with real traffic each split of D will contain unknown attacks and, in consequence of the random sampling, we assume the percentage of attacks being the same in all of five splits. Exploiting the labelling process, we evaluated both the false positive (FPR) and the detection rate (DR) on D. In order to evaluate the detection rate, we selected a set of attacks published on www.milw0rm.com and used these as a basis to build a dataset (called A) of attacks exploiting specific vulnera- bilities on our set of applications. This dataset consists of 19 SQL Injection and 19 Cross Site

Table 4.2: References for attacks inside XSS-SQL Dataset. Attacks are taken from www. milw0rm.com. For each attack the number identifying the exploit and that of the paper where the vulnerability is described are provided.

Attack Type Exploit N. Paper N. SQL Injection 6512, 6510, 6502, 6490, 6469, 6467, 6465, 6449, 16, 174, 202, 215 6336, 3490, 5507 XSS 2776, 2881, 2987, 3405, 3490, 4681, 4989, 6332 162, 173, 192 4.5. HMM-WEB EVALUATION 39

Scripting (XSS) attacks, on a subset of 18 web applications (see table 4.2). It is worth to note that, due to privacy issues and the problem formulation, public data sets are not available. Anyway, there are attacks as those referred in [50] that are related to known vulnerabilities of widely deployed and open-source web applications. On the other hand, in practice, web applications which manage critical information (i.e. public adminis- tration, home banking) are typically highly customised and their source code is not public. This reflect also our case, and it is the reason why attacks inside A are just derived from well know attacks, representing a version of them customised against applications in our set. It is worth to note that, due to privacy issues and the problem formulation, public data sets are not available. This is the reason why data sets used for performance evaluation in related works [51, 58] are not public, except for the set of attacks used in [51]. However, that set of attacks has been published because it was related to known vulnerabilities related to widely deployed and open-source web applications. On the other hand, in practice, web applications which manage critical information (i.e. public administration, home banking) are typically highly customised and their source code is not public. This reflect also our case, and it is the reason why attacks inside D are suited to the specific web application.

Attack queries inside dataset D

In order to distinguish between attacks and legitimate queries in the training set we exploit the IDS itself. As attack queries are typically in low number with respect to legitimate queries, and their structure is typically different (this is definitely evident from the working exploits in table 4.2), we expect that they will receive lower probability than legitimate queries. In fact our experiments fully confirmed this behaviour. However, it may not be possible to fully automate the labelling process without fall into error, i.e. simply because data set D does not contain enough information to do that. Thus, for each web application wi having a relatively high number of queries, we iden- tified (and labelled) attack queries inside the training set by manually inspecting queries which receive the lowest probability. For web applications having a relatively low amount of queries, we inspected all training queries, because an attack query could not receive lower probability than a legitimate query (i.e. the majority of queries are attacks), as discussed in sec. 4.4. Spotting attack queries inside D, we exploited additional information. We assessed legitimate inputs by links contained inside web pages, when browsing as a typical user. Moreover, evidencing attack queries, we exploited our expertise regarding typical attacks and perhaps the corresponding output generated by web applications. In such a way, for each application wi we computed the corresponding fraction of attacks αi∗ inside D. PM Using such a method, we found an overall fraction of α∗ α∗ 0.995% of attacks = i 0 i = inside D. These attacks were very similar and were mainly related= to the injection of HTML code. As an example, we found the following (attack) request URI:

/app?attr=h where /app and attr is are respectively, the URI which identifies the vulnerable web ap- plication and its attribute. The web application is vulnerable because its output contains the string submitted as input for attr, without any validation/sanitization. In this case, the 40 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION

Figure 4.3: Distribution of queries and percentage of attacks for the 14 most frequent web applications. attacker added an arbitrary link to the page, maybe to let the link be more “popular”, exploit- ing the indexing of search engines. It is worth to note that the system administrator did not noticed such attacks, simply because they did not caused any (permanent) software damage. To conclude this section, we present in figure 4.3 two histograms which represent the percentages of attack queries received by each web application. In particular, the top of the figure is a histogram of the number of queries received by the 14 most requested applica- tions on the web server. The bottom of the figure represents the percentage of query attacks received by each web application. The histograms confirm clearly our intution that the lower the number of queries, the more weak is the assumption that they are legitimate. In fact, we observe that as the number of queries toward a certain application decreases, the percent- age of attacks between them tend to increase (about 10% of attacks for the 9th application). Applications after the 9th (excluding the 13th) are not attacked at all. These are applications used for administrative tasks and consequently are not publicly available. Consequently, to spot their presence on the web server an attacker should scan the web server. This is not hard but obviously the publicly exposed applications are more subject to being attacked.

4.5.2 Experimental Results In this section we summarise experimental results when (1) a single HMM and (2) multiple HMM are used to model a generic sequence. Our IDS has been always able to detect all attacks inside dataset A, so in the following we will focus our analysis on the evaluation of FPR/DR on dataset D (average values over all splits). Figure 4.4 shows a comparison of results obtained using our query codification strategy (see section 4.4.1) with those obtained using the codification proposed in [58]. In both cases a single HMM per ensemble has been used. As we can see, the solution we propose to model query parameters is really effective to heavily reduce the false alarm rate. In particular, we 4.5. HMM-WEB EVALUATION 41 evaluated the IDS performance for different values of α, as in real scenarios we cannot rely on a reliable estimation, but only on raw estimates. On the other hand, with a little expense in terms of false alarm rate, a positive value of α definitely enhance the detection rate of attacks. It is evident that a precise value is not necessary. For example if we estimate α ∼ 3% (remember that attacks in D are about the α∗ 1%) we are able to both detect 96% of i ' attacks, and raise a fraction of false alarms lower than 1%. The point α 0 identifies IDS = performances when we are fully confident on the legitimacy of the training queries, as in [58]. It is evident that for the proposed query codification, the lowest amount of false alarms is obtained (about 0.4%), but a significant part (about 15%) of attacks inside D cannot be detected.

Figure 4.4: Average Detection Rate and False Positive Rate for different values of α and a single HMM per ensemble. Our approach for query codification (on the left) outperforms the solution proposed in [58].

Results obtained with a single HMM, are compared with those obtained while using 3, 5 and 7 HMM per ensemble in figure 4.5. For small values of α, the IDS takes advantage of being based on en ensemble of classifiers. In fact at the same false positive rate the detection rate increases with the number of HMM. This is a reasonable result, because using more

Figure 4.5: Average Detection Rate and False Positive Rate for different values of α either with single or multiple HMM per ensemble. 42 CHAPTER 4. WEB APPLICATIONS SECURITY: AN HOST-BASED SOLUTION models the IDS is able to better modeling the information inside the training set. However, the improvement of performance is not as large as we expected. This is substantially due to the fact that with the codification of the attributes (Section 4.4.1) the structure of the strings processed by the HMM results being quite simple. Consequently already a single classifier is able to provide an accurate model. In some preliminary experiments we performed using the codification used in [58] we observed that the improvement of performance using multiple HMM was larger. This because the codification of attributes was less powerful and in that case the ensemble allowed to significantly improve the accuracy. Moreover, as the value of α increases, the 4 curves become more close each other. This may be explained if we note that the increasing of α can lead to a heavy modification thresholds (in a complex relationship), which may overwhelm advantages of a more thorough computation of the query probability. Chapter 5

Network Based Intrusion Detection Applications

In theory, one can build provably secure systems. In theory, theory can be applied to practice, but in practice, it can’t. Marc Dacier - EURECOM Institute

A strategy of defense of a computer network can’t set aside of the monitoring of the net- work activity. In fact, very often when an attack occurs there is evidence in the network activ- ity that something bad is happening. Let us think for example to the case of a malware that infects a machine. Once the machine has been compromised, the malware usually scans the network looking for other possible victims. Another case might be that of an attacker that scans the hosts inside a network to discover which ports are open on each machine. This is only the first step of an attempt of attack, where the attacker is trying to asses which services are running on each host in the perspective of exploiting potential vulnerabilities. If such an attempt is detected the attack might be stopped even before some of the hosts within the network are compromised. Thus, monitoring the network activity is crucial not only to detect attempts of intrusion but also to prevent attacks from happening. In addition Network based IDS offer the great advantage that a single sensor which mon- itors a network segment may protect several hosts at the same time.

!()*+*",-%& !"#$%&'

Figure 5.1: A Network IDS Sensor

43 44 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Figure 5.1 shows a simple example. A single network sensor might be used to protect a web, a file and a mail server at the same time. To protect these machines with a Host-based IDS, we would have placed a different sensor on each host. As well as Host-based, Network-based IDS can be categorized in signature and anomaly based. A well known example of signature based IDS is SNORT [6]. As already explained, in this thesis we are interested in anomaly based IDS such as for example ADAM or PAYL [83, 96]. A further categorization distinguishes IDS depending on which part of the packet the IDS looks at to discover anomalies. In fact an IDS might look or at the IP header or at the application layer payload (or, eventually, at both). IDS that look to the IP header take into account informations such as the source and destination addresses and ports. On the con- trary, the IDS based on the analysis of the payload model the data generated by application layer protocols such as for example HTTP,FTP and SMTP. To understand which kind of attacks might be detected by a payload or by a header based IDS, it is useful to consider a categorization of anomalies in hostile traffic provided by Ma- honey [69].

• User Behavior. Hostile traffic may have a novel source address because it comes from an unauthorized user of a restricted (password protected) service. Also, probes such as ipsweep and portsweep may attempt to access nonexistent hosts and services, gen- erating anomalies in the destination addresses and port numbers [54].

• Bug Exploits. Attacks often exploit errors in the target software, for example, a buffer overflow vulnerability. Such errors are the most likely to be found in the least-used features of the program, because otherwise the error is likely to have been discovered during normal use and fixed in a later version. Thus, any remaining errors are involved only with unusual inputs which are not likely to occur during normal use.

• Response Anomalies. Sometimes a target will generate anomalous outgoing traffic in response to a successful attack, for example, a victimized mail server passing root shell command responses back to an attacker.

• Bugs in the attack. Attackers typically must implement client protocols themselves, and will fail to duplicate the target environment either out of carelessness or because it is not necessary. For example, many text based protocols such as FTP,SMTP and HTTP allow either uppercase and lowercase. An attacker may use lowercase for convenience, even though normal clients might always use uppercase.

• Evasion. Attacker may deliberately manipulate network protocols to hide an attack from an improperly coded intrusion detection system (IDS) monitoring the applica- tion layer [78]. Such methods include IP fragmentation, overlapping TCP segments that do not match, deliberate use of bad checksums, short TTL values, and so on. Such events must be rare in normal traffic, or else the IDS would have been written to handle them properly.

In this thesis we are interested in detecting attacks against web applications. The most frequent attacks against web applications (see Section 4.2) all arise to the fourth category, since they are carried by HTTP requests. Then, in this Chapter we will focus our attention on IDS based on the analysis of the HTTP payload. 5.1. PAYLOAD BASED ANOMALY DETECTION 45

The rest of the Chapter is organized as follows. Section 5.1 provides an overview on payload-based IDS, with a survey of recent related works and a brief discussion of how payload- based IDS can be evaded. Section 5.2 provides a description of McPAD, an IDS based on an ensemble of One-class SVM classifiers. Section 5.3 describes the experimental setup on which McPAD has been evaluated and describes experimental results. In Section 5.4 we de- scribe HMMPayl, an IDS based on Hidden Markov Models. Finally, Section 5.5 contains ex- perimentals results on HMMPayl.

5.1 Payload based anomaly detection

Basically, the assumption behind the analysis of th HTTP payload to detect attacks against web applications is that the distribution of bytes within attack payloads is different with re- spect to that in the legitimate request. In order to make this more clear, let us consider a few examples. First, we consider a legitimate payload. As we briefly discussed in Section 4.1, an HTTP request has the following structure:

• A request URI, which specifies the requested resource.

• A set of pairs Header-value that are used by the client to provide additional informa- tions to the web server. For example, with the User-agent header, the client host noti- fies the web server the type and the version of the web browser. This information can be used by the web server to optimize the response sent back to the client according to the version of the browser.

Figure 5.2 provides simple example of legitimate HTTP request. It is clear that if we would calculate the bytes’ distribution within this payload, we would obtain a distribution quite similar to that of an English-written text.

GET /pra/ita/home.php HTTP/1.1 Host: prag.diee.unica.it Connection: Keep-alive Accept: text/*, text/html Accept-Encoding: compress, gzip Accept-Language: it, en-gb

Figure 5.2: An example of legitimate HTTP payload

Now, let us consider two examples of attacks proposed in figure 5.3 and 5.4 respectively.

HEAD / aaaaaaaaa ... aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Figure 5.3: Long Request Buffer Overflow attack. Bugtraq: 5136

The attack in figure 5.3 contains a sequence of 4,096 “a” (we omitted a large number of them for typesetting constraints) which overfills a buffer producing a Denial of Service. The bytes’ distributions in this packet is definitely anomalous respect to that of a normal packet 46 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS in consequence of the pronounced peak corresponding to the “a” character. Any statistical model of the payload would spot the anomaly allowing to detect the attack. If the attack also injects an executable code in the victim machine, that code would appear in the packet after the sequence of “a”. That would be an even easier situation for the IDS, as the injected code would also contain non-ASCII characters. In general, the attacks that aim to overflow a buffer in the victim machine contain just the request URI. Consequently they lack the sequence of HTTP headers (and corresponding values) that are present in normal payloads.

GET /d/winnt/system32/cmd.exe?/c+dir HTTP/1.0 Host: www Connnection: close

Figure 5.4: URL decoding error attack. Microsoft: MS01-020

The attack in figure 5.4 exploits a fault of the web server in decoding the URL. In con- sequence of it the web server makes an error in decoding the URL and a denial of Service is caused. It is quite evident that the packet that carries an attack like this looks definitely more similar to a normal one respect to that in figure 5.3. More in general, payload statistics of attacks similar to the above URL decoding attack example, are not so far from those of a normal payload. Thus it follows that a simple statistical model of the payload is sufficient to spot attacks similar to that in figure 5.3, while a more detailed representation is necessary to detect attacks such as that in figure 5.4.

5.1.1 State of Art A number of payload-based anomaly IDS has been proposed in the literature. In the follow- ing we will briefly review the main characteristics of previously proposed solutions. In par- ticular, the following subsection is completely dedicated to PAYL, which is by far the most relevant in work payload based anomaly detection. After we will briefly describe also other related works.

PAYL

PAYL [96] is a payload-based anomaly detector. In PAYL, intrusions are detected by analyzing the distribution of bytes inside the HTTP payload. In particular, the analysis performed by PAYL is known as n-gram-analysis and has been originally used for text-classification [25]. By n-gram-analysis, a payload is represented through a vector containing the relative fre- quencies of n-grams, that are sequences of contiguous bytes. If n 1, the histogram of the = bytes’ distribution in the payload is drawn. The relative position of different bytes inside the payload is not taken into account, so that the structure of the payload is not modeled. To model the structure of the payload, a value of n 2 should be considered. Unfortunately, ≥ the representation of the payload by n-gram analysis generate a features space of size 256n. It is easy to see that as the value of n increases, the problem becomes quickly intractable, and we would never see enough training data to properly fit a full n-gram distribution. This is the reason why in a real scenario a value of n greater than 2 can’t be used. Another ele- ment that must be considered is that the distribution of n-grams changes with the length 5.1. PAYLOAD BASED ANOMALY DETECTION 47 of the payload. For example, large payloads are more likely to contain non-printable char- acters which are typical of media formats and binary representations. To take into account different payload lengths, for each host i and port j a model Mi j is created for each payload length encountered in the training set. Thus a set of profiles are computed for every possi- ble length payload. A second phase clusters the profiles to increase accuracy and decrease resource consumption. The developers of PAYL presented an improved version in [94]. In particular, the new version builds a number of models for each packet length, and performs inbound and out- bound traffic correlation to detect the propagation of worms.

Other works

In [60] Kruegel et al. describe a service-specific intrusion detection system. They combine the type, length and payload byte distribution of the service requests to build a statistical model of normal traffic that is used to compute an anomaly score. NETAD monitors the first 48 bytes of IP packets [69]. Thus, in case of TCP packets the model includes at most just the first 8 byte of the payload. A number of separate models are constructed corresponding to the most common network protocols. An anomaly score is computed in order to detect rare events. Sekar et al. propose a solution that combines anomaly based and specification-based intrusion detection with the aim of mitigating the weaknesses of the two approaches [84]. Estévez-Tapiador et al. address the problem of measuring the normality in HTTP traffic for the purpose of anomaly-based intrusion detection [33]. In the same work they propose also an anomaly-detection technique based on Markov-chains. They split the HTTP payload into a certain number of contiguous blocks. Each block is quantized according to a previ- ously trained codebook. Then, the resulting sequence of blocks is analyzed using Markov- chains. In [95] Wang et al. propose a solution called ANAGRAM where n-grams are extracted from both legitimate and intrusive traffic. ANAGRAM stores all the n-grams extracted from the normal traffic, and trains a Bloom filter. Another Bloom filter stores the n-grams ex- tracted from known malicious packets. At detection time, the packet is scored on the basis of the number of unobserved n-gram. The number of malicious n-grams is used to weight the score. In ANAGRAM the problem is represented as that of distinguishing between two classes of patterns, namely the normal class and the malicious class. It is worth noting that in this work we adopt a one-class approach, as we do not consider anomalous patterns for training. So ANAGRAM is fundamentally different from our solutions. Bolzoni and Zambon propose POSEIDON, an anomaly based network IDS [18]. As well the majority of the IDS based on the payload analysis POSEIDON is packet oriented, that is it does not reassemble packets to reconstruct the whole connection. POSEIDON is based on a two-tier architecture. The first tier is based on Self-organizing Maps (SOM) that are used to cluster packets without taking into account properties such as the length, the destination port and the semantic of data. Next, a second tier analyzes the bytes’ distribution in the payload using a modified version of the PAYL algorithm [96]. 48 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

5.1.2 Evading Payload-based IDS The Polymorphism is a strategy of creating instances of the same attack that appear differ- ent from each other. It is not new since it has been introduced by Dark Avenger in 1992. The aim of polymorphism is obviously to evade both IDS and anti-virus systems that are signature-based. Since every instance of the attack appears different from the others, signa- tures extracted from a particular instance are not effective in detecting instances different from that. Several strategies to realize polymorphic attacks exist [57]. Without entering into details of a particular strategy, the basic idea is that the attack is first encrypted (to make it un- recognizable by the IDS or the anti-virus) and then it is decrypted to be executed once the vulnerability in the victim machine has been exploited. Thus, three different components can be generally distinguished in a polymorphic attack. They are:

• The Attack Vector. It is the part of the attack that exploits the vulnerability in the victim machine. For example, if the attack exploits a buffer overflow vulnerability, the attack vector is the sequence of characters that overfills the buffer.

• The Attack Body. It is the code that implements the malicious action the attack has been designed to. For instance, it might be the code of a worm.

• The Decryptor. Is the section that contains the part of the code that decrypts the shell- code. It decrypts the attack body and pass the control to it. Polymorphism of the de- cryptor can be achieved using various code obfuscation techniques.

Tools like CLET, adMutate and tPE [27, 61, 98] performed advanced code polymorphism and are available on the web. More recently, Mason et al. proposed a technique for automatically producing “English Shellcode”, transforming arbitrary shell-code into a representation that is superficially similar to English prose [71]. The shellcode is self-contained, i.e., it does not require an external loader and executes as a valid IA 32 code. Anomaly detection systems offer a defense against polymorphic attacks because typi- cally each attack instance still looks different from normal data. For example, the payload of an attack packet may contain some non printable characters or unusual byte-structure, whereas the payload of a normal packet predominantly contains ASCII characters with a predefined structure, as required by the application protocol. In 2006, Fogla et al. proposed the Polymorphic Blending Attack (PBA, [37]) which is an attack able to evade high-speed network-based anomaly IDS such as PAYL [94, 96]. The hard real time constraints deriving from the high network speed, requires the statistics extracted from the payload to be simple and easily computable. Then, an attacker can exploit this weakness in modeling the payload creating an attack such that the payload, as measured by the IDS, will match the normal profile. The PBA is for network-based IDS the equivalent of mimicry attacks for host-based systems [93]. The creation of a Polymorphic Blending Attack is performed in three steps:

• Learning the IDS Normal Profile. The task it that of observing the traffic toward a target machine inside a target network to generate a profile close to that used by the IDS that protects the target network. 5.2. MCPAD - MULTIPLE CLASSIFIERS PAYLOAD ANOMALY DETECTOR 49

• Attack Body Encryption. After the normal profile has been created, the attacker en- crypts (and blends) the attack instance to make it matching the normal profile. This can be done with a joint use of byte substitution and padding techniques.

• Polymorphic Decryptor. The decryptor has to remove the extra padding and to de- crypt the attack with a substitution table which is the inverse of that used to encrypt it. For example, a decode table for one-to-one mapping can be stored in an array where the i-th entry of the array represent the normal character used to substitute attack character i.

Finally, the creation of a PBA is a hard problem (NP complete) but it can be translated to the SAT (satisfiability) or ILP (integer linear programming) problem.

5.2 McPAD - Multiple classifiers Payload Anomaly Detector

In this section we present McPAD, a payload anomaly detector based on an ensemble of one- class SVM classifiers. A simplified view of McPAD is depicted in Figure 5.5. When a packet is received from McPAD the payload is extracted to perform Feature Extraction. After the extraction of features the same payload results represented in m different feature spaces. In addition a clustering algorithm is applied to reduce the dimensionality of each feature space. Next the payload is processed by m different classifiers each one working in a different space. Finally the outputs of the classifiers are combined in a fusion stage. All these steps are deeply described in following sections. In addition, in Section 5.2.4 we formally analyze the com- plexity of our classification system. In Section 5.3 an evaluation of the McPAD’s performance is provided.

Figure 5.5: Overview of McPAD

5.2.1 Feature Extraction As we mentioned above, the detection model used by PAYL [96] is based on the frequency distribution of the n-grams (i.e., the sequences of n consecutive bytes) in the payload. The 50 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS occurrence frequency of the n-grams is measured by using a sliding window of length n. The window slides over the payload with a step equal to one byte and counts the occurrence fre- quency in the payload of the 256n possible n-grams. Therefore, in this case the payload is represented by a pattern vector in a 256n-dimensional feature space. It is easy to see that the higher n, the larger the amount of structural information extracted from the payload. How- ever, using n 2 we already obtain 65,536 features. Larger values of n are impractical given = the exponentially growing dimensionality of the feature space and the curse of dimension- ality problem [30]. In order to solve this problem we propose to measure the occurrence frequency of pairs of bytes that are ν positions (i.e., ν bytes) apart from each other in the payload. This allows us to efficiently extract some information related to the n-grams, with n 2. We call such > pairs of bytes 2ν-grams. Regardless of the value of the parameter ν, measuring the 2ν-gram extracts 2562 features. As we will discuss in the following, by combining information ex- tracted by using 2ν-grams with different values of ν we can somehow (partially) reconstruct the information that we would extract by directly measuring the frequency of n-grams, with n 2. > In practice, the occurrence frequency of the 2 -grams can be measured by using a (ν 2) ν + long sliding window with a “gap” between the first and last byte. Consider a payload B = [b1,b2,..,bl ], where bi is the byte value at position i. The occurrence frequency in the payload B of an n-gram β [β ,β ,..,β ] , with n l, is computed as = 1 2 n < # of occurrences of β in B f (β B) (5.1) | = l n 1 − + where the number of occurrences of β in B is measured by using the sliding window tech- nique, and (l n 1) is the total number of times the window can “slide” over B. f (β B) − + | can be interpreted as an estimate of the probability p(β B) of finding the n-gram β (i.e., the | sequence of consecutive bytes [β1,β2,..,βn]) in B. Accordingly, the probability of finding a 2ν-gram {β1,βν 2} can be written as + X p({β1,βν 2} B) p([β1,β2,..,βν 1,βν 2] B) (5.2) + | = + + | β2,..,βν 1 + where the summation is over all the possible combinations of β2,..,βν 1. It is worth noting + that for ν 0 the 2 -gram technique reduces to the “standard” 2-gram technique. When = ν ν 0, the occurrence frequency in the payload of a 2ν-gram {β1,βν 2} can be viewed as a > + marginal probability computed on the distribution of the (ν 2)-grams that start with β + 1 and end with βν 2. + From the occurrence frequency of the n-grams it is possible to derive the distribution of the (n 1)-grams, (n 2)-grams, etc. On the other hand, measuring the occurrence frequency − − of the 2ν-grams does not allow us to automatically derive the distribution of 2(ν 1)-grams, − 2(ν 2)-grams, etc. The distributions of 2ν-grams with different values of ν give us different − structural information about the payload. The intuition is that, ideally, if we could somehow combine the structural information extracted using different values of ν 0,..,N we would = be able to (at least partially) reconstruct the structural information given by the distribution of n-grams, with n (N 2). This intuition motivates the combination of classifiers that = + work on different descriptions of the payload obtained using the 2ν-gram technique with different values of ν. 5.2. MCPAD - MULTIPLE CLASSIFIERS PAYLOAD ANOMALY DETECTOR 51

5.2.2 Dimensionality Reduction Payload anomaly detection based on the frequency of n-grams is analogous to a text clas- sification problem for which the bag-of-words model and a simple unweighted raw fre- quency vector representation [66] is used. The different possible n-grams can be viewed as the words, whereas a payload can be viewed as a document to be classified. In gen- eral for text classification only the words that are present in the documents of the training set are considered. This approach is not suitable in case of a one-class classification prob- lem. Given that the training set contains (almost) only target examples (i.e., “normal” doc- uments), we cannot conclude that a word that has a probability equal to zero to appear in the training dataset will not be discriminant. As a matter of fact, if we knew of a word w that has probability p(w d ) 0, d C , of appearing in the class of target documents C , and | t = ∀ t ∈ t t p(w d ) 1, d C , of appearing in documents of the outlier class C , it would be suffi- | o = ∀ o∈ o o cient to measure just one binary feature, namely the presence or not of wt in the document, to construct a perfect classifier. This is the reason why we choose to take into account all the 256n n-grams, even though their occurrence frequency measured on the training set is 2 equal to zero. Using the 2ν-gram technique we still extract 256 features. This high number of features could make it difficult to construct an accurate classifier, because of the curse of dimensionality [30] and possible computational complexity problems related to learning algorithms. A number of techniques have been proposed in the literature to reduce the dimension- ality of the feature space for text classification [28,44,55]. One of the most common tech- niques is feature selection. However, in case of one-class classification problems the appli- cation of feature selection algorithms is in general not straightforward [88]. In order to reduce the dimensionality of the feature space for payload anomaly detection, we apply a feature clustering algorithm originally proposed by Dhillon et al. in [28] for text classification. Given the number of desired clusters k, which is chosen a priori, the algorithm first randomly splits the features into k groups. Then, the features are iteratively moved from one of the k clusters to another until the information loss due to the clustering process is less than a certain threshold τ. This clustering algorithm has the property to reduce the within cluster and among clusters Jensen-Shannon divergence [28] computed on the distribution of words, and has been shown to help obtain better classification accuracy results with respect to other feature reduction techniques for text classification [28]. The inputs to the algorithm are:

1. The set of distributions {p(C w ) : 1 i m, 1 j l}, where C is the i-th class of doc- i | j ≤ ≤ ≤ ≤ i uments, m is the total number of classes, w j is a word and l is the total number of possible different words in the documents.

2. The set of all the priors {p(w ), 1 j l}. j ≤ ≤ 3. The number of desired clusters k.

4. The tolerable information loss τ.

The output is represented by the set of word clusters W {W ,W ,..,W }. Therefore, after = 1 2 k clustering the dimensionality of the feature space is reduced from l to k. In the original l- dimensional feature space, the j-th feature of a pattern vector xi represents the occurrence 52 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

frequency f (w d ) of the word w in the document d . The new representation x0 of d in j | i j i i i the k-dimensional feature space can be obtained by computing the features according to X f (Wh di ) f (w j di ), h 1,..,k (5.3) | = w W | = j ∈ h where f (W d ) can be interpreted as the occurrence frequency of the cluster of words W h| i h in the document di . In case of a one-class problem, m 2 and we can call C the target class and C the outlier = t o class. The posterior probabilities {p(C w ): i t,o, 1 j l} can be computed as i | j = ≤ ≤

p(w j Ci )p(Ci ) p(Ci w j ) | | = p(w j Ct )p(Ct ) p(w j Co )p(Co ) | + | (5.4) i t,o, 1 j l = ≤ ≤ and the priors {p(w ), 1 j l} can be computed as j ≤ ≤ p(w ) p(w C )p(C ) p(w C )p(C ), 1 j l (5.5) j = j | t t + j | o o ≤ ≤ The probabilities p(w C ) of finding a word w in documents of the target class C can be j | t j t reliably estimated on the training dataset, whereas it is difficult to estimate p(w C ), given j | o the low number (or the absence) of examples of documents in the outlier class Co. Similarly, it is difficult to reliably estimate the prior probabilities p(C ) Ni , i t,o, where N is the i = N = i number of training patterns of the class C and N N N is the total number of training i = t + o patterns. Given that N N (or even N 0), the estimated priors are p(C ) 0 and p(C ) o¿ t o = o ' t ' 1, which may be very different from the real prior probabilities. 2 In our application, the words w j are represented by the 256 possible different 2ν-grams (with a fixed ν). In order to apply the feature clustering algorithm, we estimate p(w C ) j | t by measuring the occurrence frequency of the 2ν-grams w j on the training dataset and we assume a uniform distribution p(w C ) 1 of the 2 -grams for the outlier class. We also j | o = l ν assume p(Co) to be equal to the desired rejection rate for the one-class classifiers (see Sec- tion 3.2.2), and accordingly p(C ) 1 p(C ). t = − o

5.2.3 Payload Classification

By varying the parameter ν and applying the dimensionality reduction algorithm explained above, we obtain different compact representations of the payload in different feature spaces. For each of these representations we construct a model of normal traffic by training a one- class SVM classifier (see Section 3.2.2). In practice, given a set of “gap” values {νi }i 1..m, and a = νi νi dataset D {pk }k 1..N of (mostly) normal traffic, we construct m datasets D {p }k 1..N ,i = = = k = = 1..m, one for each νi , according to the feature extraction and dimensionality reduction pro- cess described in Section 5.2.1 and Section 5.2.2. Afterwards, we train a one-class SVM clas- νi sifier on each dataset D , thus obtaining m models M1,M2,..,Mm of normal traffic. During the operational phase, whenever a test payload p is received, we compute m dif- ferent representations of p, namely pν1 ,pν2 ,..pνm according to the feature extraction and di- mensionality reduction process described above. Then, we classify each representation pνi using model Mi . Finally, we combine the classification results obtained from each model in order to make a final decision. We use the combination approach described in Section 3.3.3. 5.2. MCPAD - MULTIPLE CLASSIFIERS PAYLOAD ANOMALY DETECTOR 53

In the case of the average, product, minimum and maximum probability, the output of the combiner is a score that can be interpreted as the probability of the payload under test be- ing normal, P(nor mal p) (see Section 3.3.3). The final decision depends on a threshold θ, | whereby the payload p is classified as an attack if P(nor mal p) θ, and as normal other- | < wise. In case of the majority voting combination rule (see Section 3.3.3), the output of the combiner equals the number of classifiers that labeled p as attack. In this case, we can again set a threshold θ, whereby if the number of classifiers that deemed p as attack is greater than θ the final decision will be to classify p as attack, otherwise it will be classified as nor- mal. The threshold θ can be chosen in order to tune the inevitable trade-off between false positives and detection rate.

5.2.4 Complexity Analysis In this section we provide an analysis of the computational complexity of our McPAD de- tection algorithm. Because the training of McPAD can be performed off-line, here we only present an analysis of the computation involved in the test phase. Given a payload πj of length n and a fixed value of ν, the frequency of 2ν-grams (see Sec- tion 5.2.1) can be computed in O(n). As the number of extracted features is constant (equal to 216, regardless of the actual value of n and ν), the mapping between the frequency distri- bution of 2ν-grams and the k feature clusters (see Section 5.2.2) can be computed using a simple look-up table and a number of sum operations that is always less than 216 (regardless of the value of k). Therefore the feature reduction process can be computed in in O(1). The feature extraction and reduction process has to be repeated m times choosing every time a different value of ν. m represents the number of different one-class classifiers used to make a decision about each payload πj , and the overall feature extraction and reduction process can be accomplished in O(nm). Once the features have been extracted, and the dimensionality reduced to k, each pay- load has to be classified according to each of the m one-class SVM classifiers. Let us first consider one single classifier. As mentioned in Section 3.2.2, in order to classify a pattern vector x j (i.e., a representation of the payload πj , in our case), the distance between x j and each of the support vectors obtained during training has to be computed. Given the num- ber of feature clusters k, and the number of support vector s, the classification of a pattern can be computed in O(ks). This classification process has to be repeated m times and the results are then combined (e.g., using the average of probability). The overall classification of a pattern vector x j can therefore be computed in O(mks). It is worth noting that the num- ber of support vectors s depends on two main parameters, namely the size t of the training set, and the value of the parameter C in Equation (3.5). The parameter C can be interpreted as the fraction of “desired” false positives. Schölkopf et al. [81] showed that when the solu- tion of the problem in Equation (3.5) satisfies ρ 0, tC s. Therefore, for a fixed size t of 6= 6 the training dataset the higher the “desired” false positive rate chosen during training, the higher s, and in turn the higher the computation time spent on each packet. We will show in Section 5.3 that this is the source of an inevitable trade-off between the accuracy of McPAD and its average computational cost per payload. 54 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

5.3 McPAD Evaluation

In this section we report the results of the extensive experiments we performed. We present the results regarding the accuracy of McPAD first, and then we compare it to PAYL [96]. Both McPAD and PAYL have been evaluated in terms of the partial AUC (AUCp ) described in Sec- tion 2.3.1. We limited our analysis to HTTP traffic that consists of several days of simulated and real legitimate HTTP requests, several HTTP attacks provided by the authors of [50], and a large number of polymorphic HTTP attacks we generated ourselves. We found that col- lecting a sufficient amount of attack traffic for protocols other than HTTP is very hard and expensive. To the best of our knowledge no such dataset is publicly available.

5.3.1 Experimental Setup We implemented an open-source proof-of-concept version of McPAD, which we made avail- able at http://prag.diee.unica.it/pra/ita/research/ids/prototypes/mcpad. Mc- PAD is written entirely in Java and is built on top of LibSVM (http://www.csie.ntu.edu. tw/~cjlin/libsvm/) and Jpcap (http://netresearch.ics.uci.edu/kfujii/jpcap/doc/). In order to compare McPAD to PAYL [96], we requested and obtained a copy of the PAYL soft- ware from Columbia University. We experimented with several different combinations of the configuration parameters for McPAD. We used a “gap” ν 0..10 for the 2 -gram feature extraction process (see Sec- = ν tion 5.2.1). We used one-class SVM classifiers with Gaussian Kernel (see Section 3.2.2). The value of γ for the Gaussian Kernel was set equal to 0.5 in all the experiments, because this value gave good results during preliminary experiments [76]. We used several values for the number of feature clusters k used in the dimensionality reduction process (see Section 5.2.2). In particular we experimented with k 10,20,40,80,160 feature clusters. The “desired” false = positive rate (see Section 3.1.2) used during the training of the one-class SVM classifiers was set to FP 10%,5%,2%,1%,0.5%,0.2%,0.1%, 0.05%,0.02%,0.01%,0.005%, 0.002%, and = 0.001%.

Table 5.1: Summary of the parameter settings used in the experiments with McPAD.

γ 0.5 ν 0-10 Feature Clusters (k) 10, 20, 40, 80, 160 10%, 5%, 2%, 1%, 0.5%, 0.2% Desired FP Rate 0.1%, 0.05%, 0.02%, 0.01%, 0.001%

The parameter values used during the experiments are summarized in Table 5.1, for con- venience.

5.3.2 Datasets In this section we describe the characteristics of the datasets that we used in our experi- ments.

DARPA We used the HTTP requests extracted from the first week of the DARPA’99 dataset [67], which consists of five entire days of simulated normal traffic to and from an airforce base. 5.3. MCPAD EVALUATION 55

Although the DARPA dataset is outdated and has been criticized [69, 72] for the way it was generated, to the best of our knowledge it is the only public dataset of network traffic that represents a common base on which experimental results may be reproduced and compared to the ones obtained using different approaches. We randomly split each day of traffic into two parts, a training set made of approximately 80% of the traffic and a validation set made of the remaining 20% of the traffic. For each day of training/validation, a test dataset was con- structed which consists of 20% of the traffic randomly chosen from all the remaining days (i.e., the traffic from the days which are not included in the training/validation set). The characteristics of the obtained dataset are reported in Table 5.2.

Table 5.2: DARPA Dataset Characteristics

DARPA Training Set Validation Set Test Set Day Size (MB) Packets Size (MB) Packets Size (MB) Packets 1 19 161,602 4.7 40,057 44 137,997 2 23 196,605 5.7 48,905 42 131,738 3 23 189,362 5.5 46,957 42 133,133 4 30 268,250 7.6 67,593 39 121,999 5 18 150,847 4.4 37,639 45 139,869

GATECH We collected seven days of real HTTP requests towards the website of the College of Computing School at the Georgia Institute of Technology. Although this traffic is com- pletely unlabeled, it is very reasonable to consider it as containing mostly legitimate traffic. This would not be true only in case persistent intrusion attempts were ongoing at the time we collected the traffic. Such attacks are generally unlikely and usually noticeable. Given that no evidence of persistent attacks was reported during the period in which we collected the traffic, we speculate the level of noise in our dataset is negligible. Therefore, in the fol- lowing we consider the GATECH dataset as “clean” for the purpose of measuring the false positive rate. Similarly to the DARPA dataset, we divided each day of traffic into two parts, a training set made of approximately 80% of the traffic and a validation set made of 20% of the traffic. For each day of training/validation, a test dataset was constructed which consists of 20% of the traffic from all the remaining days. The characteristics of the GATECH dataset are reported in Table 5.3.

Table 5.3: GATECH dataset characteristics.

GATECH Training Set Validation Set Test Set Day Size (MB) Packets Size (MB) Packets Size (MB) Packets 1 131 307,929 33 76,654 147 350,849 2 72 171,750 19 43,418 162 385,247 3 124 289,649 31 72,320 149 354,637 4 110 263,498 28 65,260 152 361,789 5 79 195,192 20 48,653 161 379,610 6 78 184,572 20 45,949 160 380,895 7 127 296,425 32 74,218 148 352,119

ATTACKS We experimented with several non-polymorphic and polymorphic HTTP attacks. Although we were able to find a public source of non-polymorphic HTTP attacks provided by the authors of [50], we were not able to find any public source of polymorphic attacks. 56 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

We therefore created the polymorphic attacks ourselves, using both the polymorphic engine CLET [27] and a Polymorphic Blending Attack engine similar to the one used in [36,37]. We decided to make the entire attack dataset publicly available at http://prag.diee.unica. it/pra/ita/research/ids/datasets, in the hope that this will foster future research. We divided the attack dataset into the following groups of attacks:

• Generic Attacks. This dataset includes all the HTTP attacks provided by the authors of [50] plus a shell-code attack that exploits a vulnerability (MS03-022) in Windows Media Service (WMS), which has been used in [76]. In total this dataset consists of 66 HTTP attacks. Among these, 11 are shell-code attacks, i.e., attacks that carry exe- cutable code in the payload. Other attacks cause Information Leakage and Denial of Service (DoS), for example.

• Shell-code Attacks. This dataset contains 11 shell-code attacks from the Generic At- tacks dataset. Shell-code attacks are particularly dangerous because their objective is to inject executable code and hijack the normal execution of the target application. Some famous worms, like Code-Red, for example, use shell-code attacks to propagate.

• CLET Attacks. This dataset contains 96 polymorphic attacks generated using the poly- morphic engine CLET [27]. We selected 8 among the 11 Shell-code Attacks for which the source code was available, and created a polymorphic version of each attack using the payload statistics computed on each distinct day of traffic from the DARPA and GATECH datasets for training CLET’s polymorphic engine. Overall we generated 96 polymorphic CLET attacks.

• Polymorphic Blending Attacks (PBAs). We created this dataset using three shell-code attacks, namely Code-Red (a famous worm that exploits a vulnerability in Windows IIS (MS01-044)), DDK (an exploit to a buffer overflow vulnerability in Windows IIS (MS01- 033)), and an attack against Windows Media Service (MS03-022). For each one of these attacks we created PBAs that mimic the normal traffic of five different hosts that we selected at random from the GATECH dataset. Based on the traffic of these hosts, we created several versions of the attacks by spreading the attack payload on different values of the total number of attack packets and targeting a different feature extraction method. We created PBAs that mimic the statistical distribution of n-grams with n = 1..12, 2 -grams with ν 1..10, and 2 -gram which is intended to mimic all the 2 - ν = all ν grams with ν 1..10 at the same time. Overall, we generated 6,339 PBAs that aim to = evade both PAYL and McPAD. It is worth noting that the main goal of the PBAs we generated is to mimic the distribution of n-grams (with different values of n) and can therefore be seen as evasion attacks against any payload-based anomaly IDS that uses n-gram analysis.

The characteristics of the ATTACKS dataset are summarized in Table 5.4.

5.3.3 Experimental Results

In the first part of our experiments we show that the 2ν-gram feature extraction technique presented in Section 5.2.1 is actually able to extract structural information from the payload. Afterwards, we evaluate the accuracy of McPAD in detecting Generic Attacks, Shell-code At- tacks and polymorphic CLET Attacks. In the last part of the experiments we compare McPAD 5.3. MCPAD EVALUATION 57

Table 5.4: ATTACKS dataset characteristics

Type Attacks Attack Packets Generic 66 205 Shell-code 11 93 CLET 96 792 PBA 6,339 71,449 Total 6,512 72,539 to PAYL on these three groups of attacks, and we evaluate the robustness of the two detectors in the face of advanced Polymorphic Blending Attacks (PBAs).

2ν-gram Analysis.

We discussed in Section 5.2.1 how to extract the features using the 2ν-gram technique. We also argued that the occurrence frequency of 2ν-grams somehow “summarizes” the occur- rence frequency of n-grams. This allows us to capture some byte sequence information. In order to show that the 2ν-grams actually extract structural information from the payload, we can consider the bytes in the payload as random variables and then we can compute the relative mutual information of bytes that are ν positions apart from each other. That is, for a fixed value of ν we compute the quantity

I(Bi ;Bi ν 1) RMIν,i + + (5.6) = H(Bi ) where I(Bi ;Bi ν 1) is the mutual information of the bytes at position i and (i ν 1), and + + + + H(Bi ) is the entropy of the bytes at position i. By computing the average for RMIν,i over the index i 1,..,(L ν 1), with L equal to the maximum payload length, we obtain the = − − average relative mutual information for the 2ν-grams along the payload. We measured this average relative mutual information on both the training and the test set varying ν from 0 to 20. The results are shown in Figure 5.6 for the traffic of day 1 and the merged traffic of day 2 to 5 from the GATECH dataset. It is easy to see that the amount of information extracted using the 2 -gram technique is maximum for ν 0 (i.e., when the 2-gram technique is used) ν = and decreases for growing ν. However the decreasing trend is slow and the average RMI is always higher than 0.5 until ν 10. This is probably due to the fact that HTTP is a highly = structured protocol.

&

!'+ >0?2&

!'% >0?2"!)

!'*

!'$

!')

!'#

!'(

!'"

!'& ,-./01.23.4056-.278580429:;

! ! " # $ % &! &" &# &$ &% "! !

Figure 5.6: Average relative mutual information for varying ν (computed on GATECH dataset). 58 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Summing up, the 2ν-gram technique indeed extracts structural information from the payload, which helps to construct accurate classifiers.

Validation of McPAD

Similarly to [76], we performed experiments with a combination of 11 one-class classifiers. Each classifier is trained on a different representation of normal payloads obtained using the 2 -gram technique, with ν 0,..,10, as explained in Section 5.2.1. In the previous paragraph ν = we showed that for ν 10 the relative mutual information of bytes in the payload that are ≤ ν positions apart is higher than 0.5. This means that 2 -grams with ν 10 actually convey ν ≤ structural information extracted from the payload. Also, combining 2 -grams with ν 10 ν ≤ allows us to approximate the distribution of 12-grams, which may be difficult for the attacker to mimic, as explained in [76].

Table 5.5: DARPA dataset - summary of AUCp results computed over different values of k.

Maj. Vot. Avg. Prob. Prod. Prob. Min. Prob. Max. Prob. Generic Attacks MAX 0.94425 0.97837 0.97868 0.97377 0.91530 MIN 0.83100 0.87116 0.87092 0.83963 0.84533 AVG 0.89211 0.93982 0.93996 0.90682 0.87993 STDEV 0.02834 0.02672 0.02668 0.03048 0.02078 Shell-code Attacks MAX 0.97910 0.99990 0.99990 0.99986 0.98400 MIN 0.95410 0.95235 0.95228 0.94354 0.92836 AVG 0.96876 0.98698 0.98700 0.98537 0.96496 STDEV 0.00616 0.01684 0.01686 0.01830 0.01009 CLET Attacks MAX 0.99939 0.99944 0.99944 0.99969 0.99924 MIN 0.98467 0.98324 0.98318 0.97328 0.95686 AVG 0.99547 0.99639 0.99638 0.99500 0.99329 STDEV 0.00407 0.00459 0.00460 0.00717 0.00881

Table 5.6: GATECH dataset - summary of AUCp results computed over different values of k.

Maj. Vot. Avg. Prob. Prod. Prob. Min. Prob. Max. Prob. Generic Attacks MAX 0.8787 0.8978 0.8976 0.9073 0.8424 MIN 0.8054 0.8360 0.8359 0.8370 0.5822 AVG 0.8452 0.8671 0.8673 0.8752 0.7578 STDEV 0.0213 0.0169 0.0170 0.0173 0.0587 Shell-code Attacks MAX 0.99729 0.99991 0.99991 0.99987 0.99918 MIN 0.97904 0.98512 0.98512 0.96748 0.56770 AVG 0.98926 0.99744 0.99742 0.99492 0.92340 STDEV 0.00456 0.00330 0.00333 0.00813 0.10504 CLET Attacks MAX 0.99897 0.99953 0.99954 0.99970 0.99930 MIN 0.99403 0.99721 0.99722 0.99713 0.74793 AVG 0.99764 0.99829 0.99828 0.99908 0.94872 STDEV 0.00098 0.00066 0.00064 0.00070 0.08468

In this work, we performed separate tests using both the DARPA dataset and the GATECH dataset for the training phase and for computing the false positive rate. In both cases we used the ATTACKS dataset to estimate the detection rate. For each day of normal traffic, we 5.3. MCPAD EVALUATION 59 trained McPAD on 80% of the traffic (the training dataset), and tuned the detection threshold (see Section 5.2.3) on the remaining 20% of traffic (the validation dataset) in order to obtain the desired false positive rate. We then tested the obtained classifier on 20% of the traffic from all the remaining days of traffic not involved in the training and validation process. For example, we trained and validated (for choosing the detection rate) McPAD on the first day of the DARPA dataset, and tested it on 20% of the traffic randomly sampled from days 2 to 4 and on each group of attacks in the ATTACK dataset. We repeated the same process for all the other days, thus performing a 5-fold cross validation evaluation (the DARPA dataset consists of 5 days of normal traffic). We proceeded in a similar way for the GATECH dataset, for which we performed a 7-fold cross validation evaluation (the GATECH dataset consists of 7 days of normal traffic). We repeated such experiments fixing the number of feature clusters k and using differ- ent values of the desired false positive rate and combination rule. Therefore, for each fixed value of k and combination rule we were able to compute a 5-fold and 7-fold cross vali- dation of the AUCp . Table 5.5 and Table 5.6 report the maximum, minimum, average and standard deviation of the cross-validation AUCp computed over the different values of k for each combination rule we considered. On the other hand, Table 5.7 and Table 5.8 report the average AUCp obtained with cross-validation for each different value of k and different combination rule. It is easy to see that the average, product, and minimum probabilities combination rules provide the highest values of average AUCp , and therefore best classi- fication performance. Also, as we can see the average AUCp reaches values very close to 1, when Shell-code and CLET Attacks are considered. As mentioned above, Shell-code and CLET Attacks carry some form of executable code in the payload. This causes the statisti- cal distribution of byte values in the payload to be severely altered, compared to the normal distribution. We believe the presence of executable code in the payload has even a more ev- ident effect on the distribution of 2ν-grams. This is the reason why McPAD is so effective in detecting this kind of attacks. The Generic attacks are more difficult to detect. The reason is that this dataset contains information leakage and Denial of Service (DoS) attacks, for ex- ample, besides shell-code attacks. Information leakage and DoS attacks usually do not carry executable code, and although they include some abnormality in the payload that allows the attacker to exploit the targeted vulnerability, they do not significantly alter the distribution of byte values in the payload, compared to normal traffic. We also performed experiments with a number of classifiers m lower than 11. Figure 5.7 and Figure 5.8 present the results on the classification of Generic Attacks and Shell-code Attacks, respectively for different values of m between 3 and 11. The experiments were per- formed in the following way. We merged the 7 days of GATECH traffic, and then we split it in two parts of equal size. We used one half of the obtained dataset for training McPAD, and the second half to test the false positive rate. We used the Generic Attacks and Shell-code Attacks, respectively, to compute the detection rate. During test, for each payload in input McPAD picks m classifiers at random among the pool of 11 available classifiers (each trained with a different value of ν 0..10, as explained above), classifies the payload, and combines = the obtained m outputs. Both Figure 5.7 and Figure 5.8 report the results obtained using the minimum probability combination rule. As we can see, the AUCp slightly decreases for decreasing values of the number of feature clusters k. Also, the AUCp decreases with the number of combined classifiers m. However, it is worth noting that in Figure 5.8 the AUCp is always higher than 0.97, and for k 160 the AUC is always higher than 0.985 even when = p m 3. This confirms that McPAD is very good at detecting Shell-code Attacks, even when = 60 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Table 5.7: DARPA dataset - Average of AUCp results for different values of k.

Maj. Vot. Avg. Prob. Prod. Prob. Min. Prob. Max. Prob. Generic Attacks k=10 0.87097 0.91665 0.91698 0.88028 0.87066 k=20 0.87631 0.92025 0.92013 0.90145 0.87058 k=40 0.89713 0.93291 0.93334 0.8981 0.88476 k=80 0.8954 0.95584 0.95589 0.91748 0.88482 k=160 0.92075 0.97343 0.97347 0.93677 0.8888 Shell-code Attacks k=10 0.96707 0.97691 0.97686 0.97572 0.96962 k=20 0.96762 0.98282 0.98281 0.98107 0.96483 k=40 0.96745 0.98256 0.98263 0.98194 0.95724 k=80 0.96958 0.99404 0.99404 0.99145 0.96713 k=160 0.97208 0.9986 0.99865 0.99668 0.966 CLET Attacks k=10 0.99539 0.99496 0.99495 0.99409 0.9953 k=20 0.99423 0.99489 0.99488 0.99352 0.98642 k=40 0.99412 0.99575 0.99574 0.99601 0.99424 k=80 0.9972 0.99815 0.99815 0.99558 0.99524 k=160 0.99639 0.9982 0.9982 0.99582 0.99525

Table 5.8: GATECH dataset - Average of AUCp results for different values of k.

Maj. Vot. Avg. Prob. Prod. Prob. Min. Prob. Max. Prob. Generic Attacks k=10 0.83501 0.86331 0.8633 0.87187 0.76765 k=20 0.8366 0.8613 0.86135 0.86882 0.7492 k=40 0.8366 0.86312 0.86407 0.87783 0.77834 k=80 0.84778 0.85948 0.8595 0.88594 0.80212 k=160 0.87016 0.8884 0.88828 0.87131 0.69164 Shell-code Attacks k=10 0.98632 0.99544 0.99543 0.99323 0.94105 k=20 0.98758 0.99689 0.9969 0.99361 0.94685 k=40 0.98903 0.99827 0.99826 0.99417 0.97585 k=80 0.99613 0.99874 0.99875 0.9965 0.98666 k=160 0.98723 0.99785 0.99775 0.99709 0.76661 CLET Attacks k=10 0.99776 0.99854 0.99854 0.99866 0.9589 k=20 0.99778 0.99839 0.99839 0.99925 0.969 k=40 0.99757 0.99815 0.99815 0.99908 0.98624 k=80 0.99773 0.99785 0.9979 0.99925 0.99669 k=160 0.99737 0.9985 0.99844 0.99913 0.83275 the number of one-class classifiers in the ensemble is low. On the other hand, Figure 5.8 shows that McPAD suffers more when detecting Generic Attacks for low values of m and k, although the AUC stays above 0.8 when k 160 is used. In Section 5.3.3 we will discuss p = how the number of classifiers m and the value of k impact the average computational cost per payload.

Comparison between McPAD and PAYL

In this section we present the results of the comparison between McPAD and PAYL. In order to compare the two anomaly detectors, we proceeded this way: we first merged the 5 days of traffic of the DARPA dataset and than we randomly split the obtained traffic in two portions, 50% reserved for training purposes and 50% for testing. We did the same for the GATECH dataset, i.e., we merged the 7 days of traffic and then we randomly split it in two parts of 5.3. MCPAD EVALUATION 61

Generic Attacks

k 40 80 160 AUC 0.75 0.80 0.85 0.90 0.95 1.00

024681012

m

Figure 5.7: Generic Attacks detection - AUCp obtained for different values of the number of the feature clusters k, and number of combined classifiers m. The combination rule used was minimum probability. roughly the same size. First, we trained both McPAD and PAYL on the first half of the DARPA dataset, and then we tested both the detectors on the second half of the DARPA dataset and the entire ATTACKS dataset. We repeated the same procedure using the GATECH dataset. For the sake of brevity, in the following we only present the results obtained using the GAT- ECH dataset. The results on the DARPA dataset are similar. We set the parameters of McPAD to k 160, a desired false positive rate of 1% for each single one-class SVM classifier, and the = maximum probability combination rule. We chose this configuration of McPAD because it provided slightly better results compared to other configurations we tried for the detection of polymorphic blending attacks at very low false positives, while maintaining also good re- sults for the detection of shell-code attacks and polymorphic CLET attacks at very low false positives as well. We varied the detection threshold on the output of the combination of clas- sifiers, so to vary the trade-off between false positives and detection rate, thus allowing us to draw the ROC curve. Figure 5.9 and Figure 5.10 report the results obtained with PAYL and McPAD, respectively. In this case, the test dataset consisted of the second half of the GATECH traffic, which we use to compute the false positives, and the Generic Attacks, Shell-code Attacks, and CLET Attacks, on which we computed the detection rate. The three curves in the graph reflect the obtained results. It is easy to see that for all of the three groups of attacks, the detection rate of PAYL 3 rapidly decreases for a false positive rate below 5 10− . On the other hand, McPAD is able to · 5 detect the three groups of attacks even at a false positive rate of 10− . In particular, McPAD is able to detect shell-code attacks and polymorphic shell-code attacks generated using CLET very well, with a detection rate around 90% and above even at very low false positive rates. 62 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Shell−code Attacks

k 40 80 160 AUC 0.95 0.96 0.97 0.98 0.99 1.00

024681012

m

Figure 5.8: Shell-code Attacks detection - AUCp obtained for different values of the number of the feature clusters k, and number of combined classifiers m. The combination rule used was minimum probability.

8%9%,): $;%##!:(<% =>?- -,.%&'($)*)+%&,"*% 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.9: PAYL - ROC curves for Generic, Shell-code, and CLET attacks

Figure 5.11 and Figure 5.12 show the results obtained on several Polymorphic Blending Attacks (PBAs) derived from the Code-Red worm. Although we generated a high number of polymorphic attacks, here we only report part of the results, for the sake of brevity. Both fig- ures report the ROC curves obtained using n-gram Code-Red attacks with n 1,2,4,12, and = 2-all-gram attacks (see Section 5.3.2) constructed using either 5 or 10 overall attack pack- ets (reported at the end of the attacks’ name in the graph legend). Intuitively, the larger the 5.3. MCPAD EVALUATION 63 -,.%&'($)*)+%&,"*%

8%9%,): $;%##!:(<% =>?- 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.10: McPAD - ROC curves for Generic, Shell-code, and CLET attacks

/!8,"95:,%;51 /!8,"95:,%;5/0 4!8,"95:,%;51 4!8,"95:,%;5/0 2!8,"95:,%;51 2!8,"95:,%;5/0 /4!8,"95:,%;51 /4!8,"95:,%;5/0 4!"##!8,"95:,%;51 4!"##!8,"95:,%;/0 -,.%&'($)*)+%&,"*% 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.11: PAYL - ROC curves for Code-Red PBA attacks (the string “cred” in the legend stands for “code-red”). number of packets, the larger the space available for the attacker to better mimic the distri- bution of normal traffic [37,36], and thus the more difficult it is to detect the attack. It is easy 3 to see that while the detection rate of PAYL rapidly drops at a false positive rate around 10− , in some cases McPAD is able to “push” the ROC curves farther to the left (lower false posi- tives). This shows that McPAD is more robust than PAYL for PBA instances that are spread over a limited number of packets. However, when the attack is spread over a large number of packets (e.g. 10 in the case of the Code-Red PBA), neither PAYL nor McPAD is able to detect the attack at very low false positives. Figure 5.13 and Figure 5.14 show similar results for PBAs derived from the DDK attack using either 3 or 5 packets., whereas Figure 5.15 and Figure 5.16 show the results obtained with PBAs derived from the WMS attack using either 1 or 3 packets. 64 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

/!8,"95:,%;51 /!8,"95:,%;5/0 4!8,"95:,%;51 -,.%&'($)*)+%&,"*% 4!8,"95:,%;5/0 2!8,"95:,%;51 2!8,"95:,%;5/0 /4!8,"95:,%;51 /4!8,"95:,%;5/0 4!"##!8,"95:,%;51 4!"##!8,"95:,%;/0 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.12: McPAD - ROC curves for Code-Red PBA attacks (the string “cred” in the legend stands for “code-red”).

/!8,"95::;53 /!8,"95::;51 4!8,"95::;53 4!8,"95::;51 2!8,"95::;53 2!8,"95::;51 /4!8,"95::;53 /4!8,"95::;51 4!"##!8,"95::;53 4!"##!8,"95::;51 -,.%&'($)*)+%&,"*% 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.13: PAYL - ROC curves for DDK PBA attacks.

Computational Cost Analysis

In this section we discuss the experimental results regarding the performance of McPAD and PAYL in terms of average computational cost per payload. We performed the experiments on a machine equipped with a 2GHz Dual Core AMD Opteron™ Processor and 8GB of RAM, although both McPAD and PAYL used only one CPU core at a time, and always less than 4GB of RAM. Table 5.9 and Table 5.10 report the results obtained with PAYL and McPAD, respectively, on both the DARPA and GATECH datasets. The numbers between parenthesis in the first column in both tables represent the number of payloads in each test dataset. We filtered out all the packets which did not carry a TCP payload (e.g., we did not count the time spent on SYN, and FIN packets). The average time per payload is reported in terms of milliseconds. In Table 5.10 we report the results obtained considering only a few possible parameter con- figurations for McPAD, for the sake of brevity. As discussed above, k represents the number 5.3. MCPAD EVALUATION 65

/!8,"95::;53 /!8,"95::;51 4!8,"95::;53 -,.%&'($)*)+%&,"*% 4!8,"95::;51 2!8,"95::;53 2!8,"95::;51 /4!8,"95::;53 /4!8,"95::;51 4!"##!8,"95::;53 4!"##!8,"95::;51 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.14: McPAD - ROC curves for DDK PBA attacks.

/!8,"95:9$5/ /!8,"95:9$53 4!8,"95:9$5/ 4!8,"95:9$53 2!8,"95:9$5/ 2!8,"95:9$53 /4!8,"95:9$5/ /4!8,"95:9$53 4!"##!8,"95:9$5/ 4!"##!8,"95:9$53 -,.%&'($)*)+%&,"*% 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.15: PAYL - ROC curves for WMS PBA attacks. of feature clusters, whereas FP represents the “desired” false positive rate which we chose to train the SVM models and tune the detection threshold on the output of the combination rules. m represents the number of combined models. We experimented with m 11, which = refers to the combination of all the SVM models constructed on different values of the pa- rameter ν 0..10. Also, we experimented with m 3 by picking 3 different models out of the = = available 11 SVM models at random in order to classify each payload. In other words, during the test phase, for each payload McPAD picks 3 of the 11 SVM models at random, classifies the payload using only the chosen 3 models, and combines the obtained 3 outputs to make the final decision about whether the payload under test is anomalous or not. We found that this approach decreases the average computation cost per payload while maintaining a high detection rate at very low false positive rates in most scenarios, as shown in Figure 5.17 and Figure 5.18 (these two figures were obtained using the same configuration of McPAD used to plot Figure 5.10 and Figure 5.12, with the only difference that we combined only 3 classifiers, instead of 11). It is easy to see that the results reported for McPAD in Table 5.10 are consistent with 66 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

/!8,"95:9$5/ /!8,"95:9$53 4!8,"95:9$5/ -,.%&'($)*)+%&,"*% 4!8,"95:9$53 2!8,"95:9$5/ 2!8,"95:9$53 /4!8,"95:9$5/ /4!8,"95:9$53 4!"##!8,"95:9$5/ 4!"##!8,"95:9$53 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.16: McPAD - ROC curves for WMS PBA attacks. -,.%&'($)*)+%&,"*%

"##&8(8!9(,':%; $:%##!<(;% =>?-&$:%##!<(;% 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.17: McPAD - ROC curves for Generic, Shell-code, and CLET attacks. The ROC refers to classification results obtained by combining 3 classifiers chosen at random among 11 one- class classifiers. the computational complexity analysis in Section 5.2.4, in particular for the results on the GATECH dataset. The difference between the results on the DARPA and GATECH datasets are due to the fact that for a fixed “desired” false positive rate the number of support vectors grows with the size of the training dataset. As the GATECH training dataset is larger than the DARPA training dataset, the SVM models used to classify GATECH traffic will have more support vectors (see Section 5.2.4), and therefore a higher average cost per test payload. The performance of PAYL in terms of average computational cost per payload is much better than our McPAD software. However, it is worth noting that our software is a proof-of- concept Java implementation of the algorithms described in this work, and that both Lib- SVM, on which McPAD is based, and McPAD itself are not optimized. In particular, the clas- sification of a payload using a given one-class SVM can be easily parallelized. This is because the distance between the pattern vector representing the payload and each support vector in Equation (3.6) can be computed independently. The results can then be summed up and 5.3. MCPAD EVALUATION 67

/!8,"95:,%;51 /!8,"95:,%;5/0 4!8,"95:,%;51 -,.%&'($)*)+%&,"*% 4!8,"95:,%;5/0 2!8,"95:,%;51 2!8,"95:,%;5/0 /4!8,"95:,%;51 /4!8,"95:,%;5/0 4!"##!8,"95:,%;51 4!"##!8,"95:,%;/0 050 054 052 056 057 /50

/%!01 1%!01 /%!02 1%!02 /%!03 1%!03 /%!04

!"#$%&'($)*)+%&,"*%

Figure 5.18: McPAD - ROC curves for Code-Red PBA attacks. The ROC refers to classification results obtained by combining 3 classifiers chosen at random among 11 one-class classifiers.

Table 5.9: PAYL’s average processing time per payload. The number between parenthesis represents the number of payloads in the test dataset.

DARPA (137,997) 0.039ms GATECH (1,068,429) 0.032ms compared to the threshold ρ in order to compute the probability p(x ω ) in Equation (3.8). | t Also, the classification results of each single one-class SVM can be computed independently and then combined using one of the combination rules discussed in Section 3.3.3 (e.g., av- erage of probabilities). Another approach that can be used to optimized the performance of our IDS, is to use it as a second-stage classifier. For example, we could use PAYL configured with a detection threshold that forces it to generated a high percentage of FP rate, say 5%, for example. In turn, this high percentage of “desired” false positives forces PAYL to classify as normal only those payloads that have a very low anomaly score (as computed by PAYL itself), and for which we are therefore very confident that they do not carry an attack (as we can see from Figure 5.9, 5.11, 5.13, and 5.15, PAYL has a high detection rate at false positive rates higher than 1%). Only those payloads that are classified as anomalous by PAYL may then be fed to McPAD, which should be configured in order to accept only a small percentage of false positives, say 0.01%. As McPAD has a higher detection rate than PAYL for very low false posi- tive rates (in particular for shell-code and CLET attacks), the overall result is a two-stage IDS characterized by high detection rate at very low false positive rate. At the same time, as in average only 5% of the payloads will undergo “precise scrutiny” by McPAD, the computa- tional cost per payload of the entire two-stage classifier will also be much lower in average, compared to using McPAD by itself. It is also worth noting that techniques like load-balancing can be employed. For example, it is typical to have more than one web server offering the same service, for both efficiency and fault tolerance reasons. Putting an instance of our IDS in front of each of a pool of N Web servers in load balance would decrease the traffic crossing each instance of McPAD, thus multiplying the overall traffic that can be handled by the provider of the web services by 68 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Table 5.10: McPAD’s average processing time per payload. The number between parenthesis in the first column represents the number of payloads in the test dataset.

FP=0.001 FP=0.001 FP=0.01 FP=0.01 m=3 m=11 m=3 m=11 k=10 3.07ms 10.96ms 3.16ms 11.04ms DARPA (137,997) k=40 3.04ms 11.02ms 3.11ms 11.31ms k=160 3.13ms 10.92ms 3.81ms 13.39ms k=10 4.28 16.23ms 4.94ms 17.53ms GATECH (1,068,429) k=40 4.16ms 15.92ms 6.14ms 21.93ms k=160 4.95ms 17.11ms 10.49ms 38.45ms

N. Although this solution may be financially more expensive, we believe that in a number of practical scenarios the increase in security provided by our IDS would generate a positive return of investment. This reasoning is supported by the analysis presented in Section 5.3.4, where we show that PAYL has a much lower Bayesian detection rate compared to McPAD.

5.3.4 Bayesian Detection Rate The experimental results show that McPAD is able to detect attacks even at very low false positive rates. This is particularly true for shell-code attacks and polymorphic attacks cre- ated using morphing engines such as CLET [27]. On the other hand the detection rate of PAYL quickly drops to zero at low false positive rates. This is an important result, because being able to maintain a high detection rate with very low false positives greatly improves the Bayesian detection rate P(Intr usion Al ar m) (see Section 2.3.2). | In [9], Axelsson presented a realistic example in which he showed that in order to have a 5 significant Bayesian detection rate we need to reduce the false positive rate to around 10− , while maintaining a relatively high detection rate. It is easy to see from Figure 5.12 that Mc- PAD achieves this goal in particular in case of shell-code attacks and polymorphic CLET at- 5 tacks. Following Axelsson’s example [9], assume P(I) 1 P(I¯) 2 10− . McPAD has a detec- = − = · 5 tion rate around 95% for shell-code attack packets, at a false positive rate of 10− . Therefore, 5 P(A I) 0.95, and P(A I¯) 10− . In this case the Bayesian detection rate is P(I A) 0.65. | = | = | =5 On the other hand, PAYL has detection rate equal to zero at a false positive rate of 10− , and therefore the Bayesian detection rate is zero. If we consider that the detection rate of PAYL 3 is around 0.32 at a false positive rate of 10− , in this case we have a Bayesian detection rate 3 for PAYL equal to P(I A) 0.006. Even if we had 100% detection rate at 10− , the Bayesian | = detection would be only P(I A) 0.02. This confirms that we need to maintain a high detec- | ' tion rate at very low false positive rates, in order to increase the Bayesian detection rate. We showed that McPAD meets this requirement.

5.4 HMMPayl - HMM for Payload Analysis

Now we will present HMMPayl, a payload-based anomaly IDS which models the payload using Hidden Markov Models. HMMPayl represents an improvement with respect to previ- ously discussed solutions such as PAYL and McPAD, since it addresses limitations of both. In particular with respect to PAYL [96, 94] we address the infeasibility of the n-gram analysis for large values of n proposing an analysis that has the same expressive power but which is 5.4. HMMPAYL - HMM FOR PAYLOAD ANALYSIS 69 feasible also for values of n larger than two. With respect to McPAD we address the limitation that the 2ν-gram analysis, even if effective, only allows to approximate the n-grams. One of the reasons which initially motivated the development of HMMPayl was the con- sideration that HMM, as well as n-grams, have been widely used in text classification [43]. In particular, it has been shown that HMM and n-grams belong to the same theoretical model [86]. In particular, HMM as well as n-grams can be described using a stochastic finite state automata (sFSA) [32, 85, 36]. In spite of the same expressive power, HMM offers a great advantage with respect to n- grams in modeling sequences of bytes, While n-grams produce feature spaces of size 256n, HMM can process sequences of any size, without differences in computational complexity. Thus we expect to produce an effective representation of the payload. HMMPayl, performs payload processing in three steps as shown in figure 5.19. The al- gorithm we propose for Features Extraction (step 1) allows the HMM to produce an effective statistical model which is sensitive to the “details” of the attacks (e.g., the bytes that have a particular value). Since HMM are particularly robust to noise, their use during the Pattern Analysis phase (step 2) guarantees to have a system which is robust to the presence of attacks (i.e., noise) in the training set. In the Classification phase (step 3) we adopted a Multiple Clas- sifier System approach, in order to improve both the accuracy and the difficulty of evading the IDS. Besides, the MCS paradigm guarantees that the weaknesses of classifiers due to a sub-optimal choice of initial parameters are mitigated. In the same step, we also evaluated the performances of a classifiers selection approach that provides an upper bound on ac- curacy and may suggest some guidelines to exploit the combination of classifiers [90]. In sections 5.4.1, 5.4.2 and 5.4.3 we will provide a detailed description of the Features Extrac- tion, Pattern Analysis and Classification step respectively. In Section 5.5 we will describe the experimental setup of our experiments and in Section 5.6 we will describe the experimental results.

Figure 5.19: A simplified scheme of HMMPayl.

5.4.1 Feature Extraction The most critical task in designing an IDS is that of choosing good features. The main prob- lem in Intrusion Detection is that the “target is moving”, i.e., new attacks appear everyday, and a number of variants of known attacks are developed, thus making attack modeling a difficult task. Moreover, when the resources to be protected change frequently as in the case of Web applications the problem become even more difficult [68]. As a consequence, hav- ing good features is crucial because they should not only model the normal traffic, but also 70 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS allow distinguishing known and possibly never-seen-before attacks. In addition, the fea- tures should be chosen in a way that makes difficult for an attacker to craft an attack whose representation in terms of the selected feature is similar to normal traffic. The above consid- erations have been taken into account when selecting the features to be used in HMMPayl. At a low level, an HTTP payload is not more than a sequence of L bytes, where L is the length of the payload length. As HMM are able to analyze sequences of data, it would be pos- sible to feed the HMM with the payload as it is. Although it would be a very simple solution, it has two considerable drawbacks. The first one is that the longer the payload, the smaller the probability associated to it. This is a consequence of the fact that the number of products performed by the Forward-Backward procedure increases with L (see Section 3.2.1). From a practical point of view, this behavior of HMM increases the risk of having long normal payloads classified as intrusive, while short attack payloads classified as normal. The second drawback is related to the fact that HMM work better if the length of the processed sequences is equal to (or at least close to) the number of states of the models. This wouldn’t be possible if the payload would be processed as it is for several reasons:

(a) the length of the payloads may change

(a) HMM having hundreds of states are not manageable (the typical length of the HTTP payload is several hundreds of bytes).

These are the reasons why we propose the following approach to represent the payload. A window of width “n” slides over the payload byte by byte, and each group of n bytes that falls inside the window is considered as a sequence. Starting from a payload P, the sequence extractor produces as output a set O of sequences. Given the length of the payload L and the width of the window n, the number of sequences N in O is equal to:

N L n 1 (5.7) = − + In order to better illustrate the features extractor of HMMPayl, we will use a toy example. Let us consider the following very short payload made up of 10 bytes, and let us also assume that the bytes can take only three values, namely, 0, 1 and 2.

2 1 2 0 0 1 2 1 0 2

Let us know consider a sliding window of size n equal to 5. Thus it follows that the num- ber of sequences extracted from the payload is computed as N = 10-5+1 = 6. The set of se- quences that are extracted from the toy payload is:

2 1 2 0 0 1 2 1 0 2 Sequence 1: 2-1-2-0-0 → 2 1 2 0 0 1 2 1 0 2 Sequence 2: 1-2-0-0-1 → 2 1 2 0 0 1 2 1 0 2 Sequence 3: 2-0-0-1-2 → 2 1 2 0 0 1 2 1 0 2 Sequence 4: 0-0-1-2-1 → 2 1 2 0 0 1 2 1 0 2 Sequence 5: 0-1-2-1-0 → 2 1 2 0 0 1 2 1 0 2 Sequence 6: 1-2-1-0-2 → 5.4. HMMPAYL - HMM FOR PAYLOAD ANALYSIS 71

Thus, a payload P is represented by a set O of N of sequences of length n. Each sequence will then be processed by the HMM, and the probability of P will then be computed as a combination of the outputs of the N sequences in O. It is easy to see that payloads with different lengths L will be translated into sets containing a different number of sequences, the size of each sequence being constantly equal to n. Our approach offers several advantages with respect to other works addressing the prob- lem of payload representation, namely PAYL [96] and McPAD. While PAYL proposes the use of n-gram analysis whose computational cost rapidly increases for values of n greater than 2, the additional computational cost related to the increase of the width “n” of the sliding window in our approach is negligible. On the other hand, the 2 gr am analysis in McPAD ν − is an approximation of the n-gram analysis which requires multiple classifiers to be imple- mented, while the proposed approach models sequences of length n that can be processed serially by a single HMM. Finally, the above example shows that the proposed approach produces some redun- dancy, as the same byte appears in “n’ sequences in O. In Section 5.5.3 we will show how this redundancy can be exploited to reduce the computational cost of the analysis.

5.4.2 Pattern Analysis The input for this step is the set O of N sequences extracted from a payload P. As the pattern to be classified by HMMPayl is the payload P, we will now illustrate how we produce the output probability for P. First of all, one HMM is trained on sets of sequences extracted from normal payloads according to the technique previously described. Then, at detection time, for each payload P, the HMM calculates the probability of emitting each one of the sequences inside O:

p P(o λ), j 1,...,N (5.8) j = j | = Thus, for each payload P, the HMM produces a set of N probabilities associated to the N sequences in O. A simple solution to obtain an overall probability for P is to calculate the arithmetic mean of the output probabilities:

N N 1 X 1 X P(O λ) p j P(o j λ) (5.9) | = N j 1 = N j 1 | = = as the arithmetic mean provides an efficient and unbiased estimate. The scheme of HMMPayl reported in Figure 5.19 shows that K different HMM are used in parallel during the Pattern Analysis step. We have provided in Section 3.3 several reasons that make an ensemble of classifier more suitable than a single one. Consequently, we decided to make the payload being analyzed in parallel from an ensemble of HMM. As each HMM processes all the payload, all the HMM in the ensemble have the same number of states, and are trained on the same training set. The difference among the HMM in the ensemble is in the random initialization of the A and B matrices during the training phase, which pro- duces different final matrices. Summing up, each HMM receives as input the whole set of sequences O, and produces as output a set of N probabilities. The set of output probabilities from each HMM is the input for the “Sequences Probabilities Fusion” block which calculates the arithmetic mean of the N probabilities produced by each HMM. At the end of this step we obtain a vector of K probabilities assigned to the payload by the K HMM. 72 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

5.4.3 Classification At this level, the problem is that of combining the outputs of several classifiers. We have al- ready discussed in Section 3.3 the reasons why a system based on multiple classifiers should be preferred to a system based on a single classifier. We would just remind here that both the accuracy and the difficulty of evading the IDS benefit from such an approach [22, 76]. Among all of the available techniques for combining an ensemble of classifiers we de- cided to use the static combination rules described in Section 3.3.2, i.e., the minimum, the maximum, the mean, and the geometric mean rules [63]. The output of the MCS block is thus the probability that the payload P belongs to the normal traffic. A payload P is then classified as normal or attack if the probability assigned by the system is respectively above or below a predefined threshold, The choice of the threshold depends on the accepted trade- off between detection and false positive rates.

5.5 HMMPayl Evaluation

5.5.1 Experimental Setup This Section is organized as follows. The following paragraphs provide some guidelines on how to setup the system, and the choices we made to perform the reported experiments. Subsection 5.5.2 describes the datasets used to validate the proposed model. In particu- lar we used different dataset of normal traffic, and attacks. Finally, Subsection 5.5.3 details the measures we used to evaluate the performance of HMMPayl.

The width n of the sliding window

From the previous Section, it should be clear that the larger the width of the window, the more accurate the IDS. Reported experiments are aimed at show this behavior, by using val- ues of n from 2 to 10. It is also worth noting that further increase in the window size does not provide significant gain in performance, as distant bytes are loosely related.

Hidden Markov Models Parameters

In Section 3.2.1 it has been shown that the performances of HMM are affected by the choice of its parameters, namely

(a) the number of hidden states;

(a) the initialization of emission and transition matrices.

A brief description of how to set these parameters follows.

Number of States The estimation of the most appropriate number of hidden states of an HMM for a given application is more art than science. Anyway, there are several heuristics that allow building effective HMM. In particular, we observed that a good choice of the num- ber of hidden states is related to the length of the sequences to be processed. Thus, we have chosen to build the HMM with a number of states exactly equal to the length n of the se- quences. Therefore, in our experiments the number of states varies in the range from 2 to 10 5.5. HMMPAYL EVALUATION 73 according to the value of n. It is worth to notice that all the HMM in the ensemble have the same number of states, the only difference being the initialization of the matrices.

Initialization of Matrices The behavior of a HMM resulting from the training procedure depends on the training data as well as on the initial values of the matrices A and B (see Sec- tion 3.2.1). Among all the strategies available to initialize the matrices A and B, we decided for the random initialization [79]. In fact, other strategies different from random initializa- tion generally try to take into account the structure of sequences to be modeled. Since in HMMPayl this structure is arbitrary (as it depends from the value of n) these strategies are not suitable for our purposes. Furthermore, random initialization is a common practice in Pattern Recognition when the selection of parameters can’t be driven by data.

Number of HMM

After some preliminary experiments we found that a number of HMM K 5 was a good = compromise between system complexity and its ability being accurate, hard to evade, and fast enough.

5.5.2 Datasets In this Section we describe the characteristics of the datasets we used in our experiments. HMMPayl has been deeply tested on three different datasets of normal traffic, and on five datasets containing different types of attacks.

Normal Traffic

In our experiments three datasets of normal traffic have been employed, one of them con- taining simulated traffic, while the other two contain real traffic collected at academic insti- tutions. These datasets are called DARPA, GT and DIEE respectively. DARPA and GT datasets are exactly the same datasets on which we evaluated McPAD. They have already been de- scribed in Section 5.3.2. The DIEE dataset consists of six days of traffic toward the website of the Department of Electrical and Electronic Engineering (DIEE) at the University of Cagliari, Italy. It is important to remark that both the GT and the DIEE datasets are completely unla- beled. Anyway, it is reasonable to assume that the traffic is made up of legitimate HTTP requests, and even if we cannot exclude the presence of attacks, they would represent a very negligible fraction with respect to the normal traffic. For this reason, we consider the possi- ble fraction of attacks in the dataset as noise. In fact, both networks are protected by firewalls and IDS, and in case of persistent intrusion attempts, there is usually evidence that an attack is occurring. As any evidence has not been reported in the period in which we collected the traffic, we assume that the level of noise in both datasets is negligible. Therefore, in the fol- lowing we consider the GT and DIEE datasets as “clean” from known attacks for the purpose of measuring the false positive rate. The experiments have been carried out in the same way on all the three datasets, both for training and testing. A k-fold cross validation has been realized, using in rotation one day of traffic for training and all the remaining days for testing purposes. Details about the number of packets and the size (in MB) of each trace are resumed in table 5.11. 74 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Table 5.11: Number of packets and size (MB) of traces of normal traffic

DARPA DIEE GT Day Size (MB) Packets Size (MB) Packets Size (MB) Packets 1 19 161,602 1.1 13,797 131 307,929 2 23 196,605 13,451 4.32 72 171,750 3 23 189,362 1 12,377 124 289,649 4 30 268,250 0.612 7,418 110 263,498 5 18 150,847 0.612 7,419 79 195,192 6 – – 0.98 11,933 78 184,572 7 – – – – 127 296,425

Very recently some researchers have made publicly available a dataset of network traffic collected during a network warfare competition (ITOC dataset) [80]. Anyway, it appeared too late to give us the opportunity of including some results in this work.

Attack Datasets

We evaluated HMMPayl with four different datasets made up of the most frequently ob- served attacks against web applications. We used the Generic, Shell-code and CLET datasets used to test McPAD. In addition, we used a fourth dataset that we call XSS-SQL that is the same dataset of attacks on which we evaluated HMM-Web (see Table 4.2). It is just worth to remember here that the XSS-SQL dataset consists of 38 attacks, 19 Cross Site Scripting (XSS) and 19 SQL-Injection attacks respectively.

5.5.3 Evaluation Metrics In order to validate the classification performance of our detector, we use the Receiver Oper- ating Characteristic (ROC) curve analysis, and the related Area Under the ROC Curve (AUC) described in Section 2.3.1. In particular we will use the partial AUC (AUCp ) that is the AUC calculated with a maximum of 10% of false alarms. Further, in Section 5.6.6 we show the value of the detection rate for 1% and 0.1% of false alarms respectively. To conclude, we provide a final remark about the evaluation of the IDS. As HMMPayl doesn’t reconstruct HTTP sessions, we consider a per packet detection rate. This means that the detection rate reported in our experiments is related to the fraction of attack packets detected by the IDS. This is a pessimistic estimate of the real detection rate that would be calculated in terms of the fraction of attacks detected. In real cases, an attack is considered to be detected if at least one of its packets is detected.

Sequences sampling

In Section 5.4.1 we shown the relationship between the length of the payload and the size of the set of the extracted sequences. Typically, the length of the payload is of several hundreds of bytes, so that our system extracts several hundreds of sequences. With the exception of the first n and the last n bytes of the payload, we also showed that the same byte will fall in n different sequences, thus generating a certain redundancy. This redundancy can be exploited to reduce the number of sequences to be processed by the HMM. In particular, 5.6. EXPERIMENTAL RESULTS 75 we propose to randomly sample the sequences generated by our system, and use them to classify the payload. From a practical perspective, the randomization strategy we propose has several out- comes. First of all, it speed up the IDS as much as we reduce the sampling ratio. Moreover, randomization might be also a defense strategy against attempts of evasion [16]. As the sub- set of analyzed sequences is chosen randomly for each payload, from an attacker is unaware of the portions of the payload used to classify it. Consequently, an attacker could evade the IDS by reproducing the whole structure of normal packets, which is obviously an harder task. While this is an interesting issue, in this work we are not able to show any quantitative evaluation of the difficulty of evading the IDS. On the other hand, reported experiments in sections 5.6.4 and 5.6.8 show the performance of the system when 20%, 40%, 60% and 80% of sequences are randomly chosen from the whole set of sequences.

5.6 Experimental Results

This section shows the performance of HMMPayl, and it is organized in five parts. Sec- tion 5.6.1 shows the performance of HMMPayl on both Shell-code and CLET attacks, while Sections 5.6.2 and 5.6.3 show the ability of the systems in detecting Generic and XSS-SQL attacks, respectively. The behavior of the system when a randomly sampled subset of se- quences are used for classification is discussed in Section 5.6.4. Section 5.6.5 reports the upper bound in performance attained by the ideal selector and Section 5.6.7 resumes the average results obtained using a single HMM. Finally, Section 5.6.8 provides an evaluation of the computational cost of HMMPayl.

5.6.1 Shell-code and CLET dataset The results attained on the Shell-code and the CLET datasets are shown together, as the attacks in the CLET dataset has been obtained by modifying Shell-code attacks with a poly- morphic engine so that the n-gram statistics match those of the normal traffic. Thus, all of the considered attacks aim at injecting executable code into the target machine, where CLET attacks should be more difficult to detect as they have been crafted to be similar to normal traffic. Table 5.12 shows the average and the standard deviation of the partial AUC (AUCp ) calculated for the whole range of the considered length of sequences (namely, from 2 to 10), and by considering all the combining rules illustrated in Section 3.3.2. Reported results show that, for this category of attacks, the length of sequences do not affects the accuracy signifi- cantly, as the values of the standard deviation are quite small. The reported values of AUCp are quite high, thus pointing out that the proposed system is effective in detecting not only Shell-code attacks, but also their polymorphic variants that have been crafted to evade the detection system. Such a good result can be easily explained if we inspect Figure 5.20 where the distribu- tion of 2-grams for the normal traffic and for the Shell-code attacks is shown. The 2-grams of normal traffic are concentrated in the down-left corner of the graph as the normal traf- fic contains only ASCII characters. On the contrary, the bytes of Shell-code attacks are dis- tributed evenly in the range 0-255. As the training set is made up of normal traffic, the vast majority of training sequences is made up of ASCII characters, and the probability associ- 76 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Table 5.12: Resume of the values of AUCp for Shell-code and CLET attacks

Minimum Maximum Mean GeoMean DARPA

Attack AUCp σ AUCp σ AUCp σ AUCp σ Shell-code 0.99909 0.0019376 0.91937 0.065238 0.97187 0.041408 0.99967 0.00030095 CLET 0.9984 0.002354 0.93551 0.05324 0.97553 0.039972 0.99935 0.00059503 DIEE

Attack AUCp σ AUCp σ AUCp σ AUCp σ Shell-code 0.99103 0.011958 0.99134 0.0097119 0.99038 0.011111 0.9932 0.0092128 CLET 0.99359 0.0085823 0.9946 0.0068128 0.99356 0.0081228 0.99506 0.0066504 GT

Attack AUCp σ AUCp σ AUCp σ AUCp σ Shell-code 0.9887 0.0035 0.8547 0.0937 0.9153 0.0659 0.9865 0.0043 CLET 0.9906 0.0087 0.8887 0.0825 0.9434 0.0596 0.9915 0.0094 ated by HMM to non-ASCII symbols is close to zero. This makes quite easy for HMMPayl to recognize Shell-code attacks.

Figure 5.20: Distribution of bytes from normal traffic (green) and from Shell-code attacks (red).

Finally, the comparison of different combining rules shows that the minimum rule allows attaining the best performances. This result is not surprising, and it is also supported by a theoretical background [17]. In fact, it is known that the minimum rule is conceptually equivalent to the logic operator AND [62]. This means that combining the outputs of several classifiers using a minimum rule sounds like finding a complete consensus among all the classifiers in the ensemble.

5.6.2 Generic Dataset The “Generic” dataset includes different attack types, including Shell-code attacks. While Shell-code attacks can be detected quite easily, other attacks included in this dataset (e.g. DoS, URL decoding error, etc.) are more difficult to be detected by IDS based on simple 5.6. EXPERIMENTAL RESULTS 77

1

0.95

0.9

0.85 Partial AUC

0.8

HMMPayl ! GT 0.75 McPAD ! GT HMMPayl ! DARPA McPAD ! DARPA HMMPayl ! DIEE 0.7 2 3 4 5 6 7 8 9 10 n

Figure 5.21: AUCp values for the Generic Attacks Dataset. The AUCp increases with the length n of sequences extracted from the payload payload statistics, as it has been pointed out in Section 5.1. In particular, let us recall that the statistics of a payload containing a Denial of Service attack (see for example Figure 5.4) do not significantly deviate from those of normal traffic. This implies that a detailed model of the payload is necessary to achieve a high detection rate. HMMPayl can provide such a detailed model as it allows attaining higher detection rates than those of similar approaches in the literature. Experimental results are presented in Figure 5.21 where HMMPayl is compared to Mc- PAD on both the DARPA and the GT datasets. PAYL is not considered in this comparison because it has been already shown that McPAD outperforms PAYL when these attacks are considered (see Figure 5.9 and 5.10). The graph clearly shows that the AUCp attained by HMMPayl increases with the value of n. This means that the larger the value of n, the better the structure of the payload is inferred by the IDS. In particular, a value of n 3 is suffi- = cient for HMMPayl to outperform McPAD on the DARPA dataset, whereas on the GT dataset the same behavior is attained with a value of n greater than or equal to 5. This result can be explained by observing that the traffic in the DARPA dataset is synthetically generated, while the GT dataset contains real traffic. As a consequence, a larger value of n is needed to obtain an accurate model of the higher variability of the payload observed in real traffic. For the sake of clarity, we reported the best performances attained by HMMPayl and McPAD when different combination rules are used. In particular, reported results for HMMPayl are related to the minimum rule, while those of McPAD are related to the arithmetic average, as they provided the best performances, respectively. For what concerns results attained on the DIEE dataset, we limited to five the maximum length of the sequences extracted from the payload. This choice is motivated by the fact that applications running on the DIEE web server are quite simple, and consequently it is easy to model byte frequencies. This behavior is clear shown in the graph, as the maximum value of AUC is reached for n 3. p = 78 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

1

0.95

0.9

0.85 Partial AUC

0.8

0.75 HMMPayl ! GT HMMPayl ! DARPA HMMPayl ! DIEE 0.7 2 3 4 5 6 7 8 9 10 n

Figure 5.22: Values of AUCp for the XSS and SQL Injection Attacks. The AUCp increases with the length n of sequences extracted from the payload

5.6.3 XSS and SQL-Injection Attacks When HMMPayl is tested on Cross Site Scripting (XSS) and SQL-Injection attacks, the behav- ior is similar to that exhibited by HMMPayl when tested on “Generic Attacks” . This similar- ity is not surprising since the structure of the payload of attacks belonging to the XSS-SQL dataset is quite similar to that of normal traffic, as well as some of the attacks inside the “Generic Attacks” dataset (see Section 5.6.2). Accordingly, the value of AUCp increases with the length n of the sequences extracted from the payload (figure 5.22). It is worth pointing out two aspects arising from the analysis of the results reported in figure 5.22. One comment is related to the behavior on the GT dataset, which is definitely the most difficult to model. In spite of this difficulty, the AUCp provided by HMMPayl is larger than 0.85, and a value of n 6 is sufficient to attain this result. Another comment is = related to the behavior on the DIEE dataset. The instance of HMMPayl trained on this dataset achieves a 100% of accuracy against XSS and SQL-Injection attacks. This result is particularly noticeable since XSS and SQL-Injection attacks are tailored to exploit vulnerabilities of the applications running on the DIEE web server.

5.6.4 Sequences Sampling In Section 5.5.3 we proposed a randomization strategy to exploit the redundancy in the se- quences used to classify the payload. We expect a decrease in performance related to the reduction of the number of sequences generated by the sliding window. However, it is of interest to quantify the amount of decrease, in order to measure the trade-off between com- putational complexity and accuracy. To this end, we tested the performance of HMMPayl on the DARPA and GT datasets, as they are more difficult to model than the DIEE dataset. Attacks from the Shell-code, XSS-SQL and Generic datasets have been considered, while the ensemble of HMM has been combined by the minimum rule. Experimental results are re- ported in Figure 5.23. As we expected, the larger the percentage of sequences randomly sampled, the larger the 5.6. EXPERIMENTAL RESULTS 79

Shell−code Attacks Generic Attacks 1 1

0.95 0.95

0.9 0.9

0.85 0.85 Partial AUC Partial AUC

0.8 0.8 HMMPayl − GT HMMPayl − GT HMMPayl − DARPA HMMPayl − DARPA 0.75 0.75 20 40 60 80 100 20 40 60 80 100 Percentage of Sequences Percentage of Sequences XSS−SQL Attacks 1

0.95

0.9

0.85 Partial AUC

0.8 HMMPayl − GT HMMPayl − DARPA 0.75 20 40 60 80 100 Percentage of Sequences

Figure 5.23: Performance of HMMPayl in terms of AUCp when a subset of sequences is ran- domly chosen. The sampling varies between 20% and 100%

AUCp , the maximum AUCp being achieved when the whole packet is considered (100%, i.e., no sampling is performed). Anyway it is worth to remark that HMMPayl attains good values of AUCp even if a very small percentage of sequences is considered (e.g. 20-30%). For ex- ample, at a sampling rate of 20% on the GT dataset, the corresponding reduction in AUCp is approximately 3% for Shell-code attacks, 6% for Generic attacks and 9% for attacks into XSS- SQL. If the 40% of sequences are considered, the loss reduces to 0.6% , 2% and 3% respec- tively. The performance loss on the DARPA dataset is always smaller than that observed on the GT dataset since the normal traffic in the DARPA dataset is more simple to be modeled. We can also observe that the loss is larger for those attacks (such as XSS) whose detection is strictly related to the amount of information that can be extracted from the payload. It is straightforward to see that these results make sense if they are compared with the reduction in the per-packet processing time. By considering just the 20% of sequences, the cost of pay- load processing is approximately reduced by a factor of 4. A more detailed discussion on the computational cost will be provided in Section 5.6.8. It also worth remarking that we performed a random selection of sequences in oder to make it difficult for an attacker to modify the attacks in order to evade detection. On the other hand, a deterministic sampling of the sequences based on the characteristics of the attacks may allow reducing the performance loss. However, the measure of the trade-off between detection accuracy, difficulty of evasion, and computational cost is out of the scope of this work. 80 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS

Generic Attacks XSS!SQL Injection Attacks 1 1

0.9 0.9

0.8 Minimum ! GT 0.8 Minimum ! GT Partial AUC Partial AUC Ideal Selector ! GT Ideal Selector ! GT 0.7 0.7 2 4 6 8 10 2 4 6 8 10 n n Generic Attacks XSS!SQL Injection Attacks 1 1

0.9 0.9

0.8 Minimum ! DARPA 0.8 Minimum ! DARPA Partial AUC Partial AUC Ideal Selector ! DARPA Ideal Selector ! DARPA 0.7 0.7 2 4 6 8 10 2 4 6 8 10 n n

Figure 5.24: Comparison of the AUCp attained by the Ideal Selector with that attained by the Minimum Rule. GT and DARPA datasets, with Generic and XSS-SQL attacks

5.6.5 Ideal Selector Reported results on classifier selection have been carried out on the DARPA and the GT datasets, and aim to show the gain in performance that could be attained by a careful de- sign of the classifier fusion module. In particular, as the performance attained by HMMPayl in the case of Shell-code and CLET attacks are very high (see Section 5.6.1), the experiments in Figure 5.24 are restricted to the Generic and XSS-SQL attacks datasets. The relevant result is that in principle it is possible to increase the AUCp up to 100%, if dynamically the best HMM in the ensemble for each sequence is always selected. In other words, the use of the ensemble of HMM provide complementary information that can be further exploited to in- crease the performance. However, the task of deploying the most suitable combination rule could be as hard as the solution of the detection problem itself, as it would require resorting to “trained” fusion rules [31].

5.6.6 Performance evaluation in terms of Detection Rate at fixed values of False Alarm rate The measure of performance used so far provided an overall evaluation of the IDS in the range of false alarm rates from 0% to 10%. While this is a useful measure as its value does not depend on a particular threshold value, it is also useful to inspect the behavior of HMMPayl for some particular values of the threshold. A typical performance measure is the evaluation of the detection rate by setting the threshold so that the false positive rate is equal to 0.01 and 0.001, that is, 1% and 0.1% of false alarms, respectively. These values are shown in table 5.13 for all the datasets of normal traffic and attacks. These results are consistent with those presented in the previous Sections, thus confirm- ing the validity of the proposed approach. In fact, when a percentage of false alarms equals to 1% is allowed, HMMPayl detects all the attacks in the Shell-code and CLET datasets. At the same percentage of false alarms, approximately the 90% and 94% of attacks in the Generic dataset is detected on the GT and DARPA datasets, respectively. When a percentage of false alarms of 0.1% is allowed, the detection rate still remains very high with respect to attacks 5.6. EXPERIMENTAL RESULTS 81

Table 5.13: Detection Rate at False Positive rate = 0.01 and 0.001

Detection Rate (Real False Positive) Dataset False Positive XSS-SQL Generic Shell-code CLET 0.01 1 (0.0103) 0.984 (0.0103) 0.997 (0.0103) 0.996 (0.0103) DARPA 0.001 0.996 (0.0013) 0.941 (0.0013) 0.989 (0.0013) 0.984 (0.0013) 0.01 0.872 (0.0101) 0.942 (0.0101) 1 (0.0101) 1 (0.0101) DIEE 0.001 0.857 (0.0069) 0.875 (0.0069) 0.985 (0.0069) 0.989 (0.0069) 0.01 0.865 (0.0137) 0.897 (0.0137) 1 (0.0137) 1 (0.0137) GT 0.001 0.805 (0.0025) 0.779 (0.0025) 0.978 (0.0025) 0.991 (0.0025) in the Shell-code and CLET datasets. As far as attacks in the Generic and XSS-SQL datasets are concerned, HMMPayl guarantees approximately a detection rate of 80% at least. These results clearly point out the effectiveness of the proposed model, if compared to other ap- proaches in the literature. A direct comparison to results reported in the literature is not possible because tests are typically carried out on private data, and the source code of the proposed technique is usu- ally not available. Spectrogram has been tested against the same categories of attacks we used to test HMMPayl, and both Spectrogram and HMMPayl exhibited the highest perfor- mances on Shell-code attacks, while lower performances on other attacks (those that we denoted by the term “Generic”)[85]. However, it is worth to point out that the detection rate of HMMPayl at the 0,1% false alarm rate are quite high if compare to those of Spectrogram. In other words, the proposed technique exhibits a better trade-off between detection rate and false alarm rate.

5.6.7 Single Classifiers Performance In Section 3.3 we provided theoretical motivations for the use of multiple classifiers. To sup- port these theoretical reasonings with experimental results, here we presents results achieved by HMMPayl when a single classifier instead of an ensemble was used. Basically, these re- sults are the average results obtained when the number of HMM K in the Pattern Analysis step is set equal to one. Results are averaged on the five different classifiers and on the value of the sequence length n. It is evident from Table 5.14 that the value of the AUCp significantly benefits of the employment of multiple classifiers.

Table 5.14: Average AUCp achieved by single classifiers on GT dataset. AUCp are averaged both on single HMM and on sequences length n

GT Dataset Shell-code CLET Generic

AUCp 0.8369 0.7778 0.6438 σ 0.0598 0.169 0.1566

5.6.8 Analysis of the Computational Cost In this section, we provide a brief discussion on the computational cost of HMMPayl. At the moment, HMMPayl is just a set of scripts that implements the algorithm. This means 82 CHAPTER 5. NETWORK BASED INTRUSION DETECTION APPLICATIONS that we don’t implemented an optimized software tool which is able to work on line, reading packets from the network interface and assigning a probability to them. Consequently, the computational cost reported here is just an average per packet cost, that has been calculated during the testing phase. No evaluation of the computational cost has been made during the training phase, as it can be performed off-line, and thus it is not required that the IDS is able to keep up with the network speed. Table 5.15 reports the characteristics of the machines on which we ran the experiments.

Table 5.15: Workstation Specifics

CPU Memory O.S. 2 Dual Core AMD Opteron(tm) Processor 280 8 GB Debian 4.0

The scripts that implements the HMMPayl algorithm are based on two libraries:

• LibPcap [2], which is a network programming library used to extract bytes from the HTTP Payload.

• GHMM [1], which is a library that implements Hidden Markov Models. In particular, we used the Python wrapper of GHMM.

Table 5.16: HMMPayl’s average processing time per payload. The value between brackets represents the number of payloads in the test dataset. The sampling ratio indicates the per- centage of sequences sampled from each payload with the randomization strategy. HMM- Payl uses m=5 HMM.

Sampling Ratio Dataset 20 % 40 % 60 % 80 % 100% DARPA (137,997) 7.48 ms 12.71 ms 17.70 ms 22.96 ms 27.51 ms GT (1,068429) 8.37 ms 14.18 ms 20.72 ms 27.59 ms 32.66 ms

Performance of HMMPayl are presented in table 5.16. We can compare these results with those in tables 5.9 and 5.10 where performance of PAYL and McPAD are provided. It can be observed that HMMPayl is definitely slower than PAYL, even in the cases when HMMPayl process just a small subset of the sequences extracted for each packet. On the other hand, performances of McPAD and HMMPayl are quite similar and higher than those of PAYL. We are aware that the reported processing time of HMMPayl is not suitable for an IDS with real time constraints. Anyway, we can provide several reasons that motivate the slowness of our implementation:

Proof of concept The implementations of both HMMPayl and GHMM are proof of concept, aimed at evaluating the performances in terms of detection and false alarm rates.

Python Both HMMPayl and GHMM are written in Python, which is a fast interpreted lan- guage but obviously it is not as fast as compiled languages, such as C or C++.

These performances could be easily improved in several ways. Among them, we can con- sider: 5.6. EXPERIMENTAL RESULTS 83

• Implementing the IDS in C or C . LibPcap, which is probably the most widely used + + for network programming, is also written in C [5].

• Multi-thread processing. As in HMMPayl the payload analysis is made by a set of clas- sifiers, a parallel calculation which exploits recent multi-core architectures would al- low to speed-up the IDS.

For these reasons, we are very confident on the fact that a good implementation of the algo- rithm would be able to significantly improve the performances in terms of processing time. It is also worth remarking the benefits provided by the random sampling in terms of com- putational cost. In Section 5.6.4 we have already discussed how the random sampling strat- egy affects the accuracy of HMMPayl. It has been shown that even if a small percentage (20%) of the sequences extracted from the packet is considered, the AUCp still remains very high, as being approximately the 97% of that obtained considering the whole packet. At the same time, as it can be observed from table 5.16 the IDS becomes 4 times faster. For this reason, our opinion is that sampling the sequenced produced for each payload can be sees as a way to speed-up the IDS. To conclude, we provide some informations on the amount of memory used by the algo- rithm: the memory employed by the process was 2-3% of the on-board memory (200-300 MB) which is a negligible amount respect to that available on modern machines. The opti- mization of the code will reduce certainly also the amount of memory employed.

Chapter 6

Conclusions and Future works

You can’t defend. You can’t prevent. The only thing you can do is detect and respond. Bruce Schneier

Nowadays, one of the key challenges in computer security is the security of web appli- cations. In fact, everyday the Internet is crossed by huge amounts of sensitive data, such as credit card numbers, personal data and so on. These data are stored in databases that are accessed by web applications. Thus, the security of these data is strictly connected to that of web applications that access them. Since web applications are generally large, complex and highly customized, relying only on signature-based IDS to protect web applications is not a good choice for several reasons. One, is that zero-days attacks are particularly dangerous in the case of web applications. Ex- ploiting a vulnerability in an application that receives everyday a large number of contacts, might allow the attacker to use the vulnerability to quickly spread a malware to a large num- ber of victims or to steal personal data from many users. Thus, a protection mechanism is necessary which is able to face up also with attacks that exploit unknown vulnerabilities. Another reason is that as web applications are often highly customized an IDS based on sig- natures would require signatures written explicitly for that particular application. Writing signatures is a tedious and time consuming task, which requires also the administrator to have an in-dept knowledge of the application to be protected. Since not alway the adminis- trator of the IDS is also the developer of the application, this might not be possible. A third reason is that even if tools are available to scan web applications for vulnerabilities, it is not possible to guarantee that the application is completely vulnerability-free. In fact, using sev- eral tools we can guarantee the application being free from known vulnerabilities, but we can’t claim the application being completely secure as unknown vulnerabilities might exist. Thereby, an application protected only with a signature based IDS can’t be completely secure at all. Anomaly detection systems allow to better meet the protection requirements of web applications. Anomaly-based IDS learn a model from a set of training data. If the model matches the data well, the IDS can detect novel attacks. In addition, respect to signature- based IDS, anomaly-based systems do not require higher skills on the part of the system developer or administrator. As a consequence of the crucial role that anomaly-based IDS play within a defense strategy, in last years the activity of the research community has been

85 86 CHAPTER 6. CONCLUSIONS AND FUTURE WORKS mainly focused on them.

6.1 Conclusions

This dissertation resumes our research activity on anomaly-based Intrusion Detection Sys- tems for protecting web applications. We proposed several solutions and in particular two (called McPAD and HMMPayl respectively) for the protection at the network level and one (called HMM-Web) for the protection at the application level. Despite practical details that obviously distinguish these solutions (e.g. features used, classifiers, etc.) all are inspired by the same philosophy and are designed following the same principles. The problem is formulated as a one-class problem that is substantially based on the def- inition of normal behavior of the web application. The normal behavior refers to a set of characteristics extracted from HTTP messages that are observed during normal operation. Detection is performed under the assumption that attacks cause significant changes (i.e., anomalies) in the application behavior. We adopted this approach in all the solutions proposed. A statistical model of the normal behavior is created starting from a bunch of training data assumed to be attack-free. When- ever a HTTP request is received, it is analyzed by the IDS to check how much its statistical model fits that created during the learning phase. If the request deviates from the statistical model of the normal behavior the request is considered anomalous. Even if this approach is widely used, there are several questions that can be raised and which ask for an answer. One is if the assumption that the training dataset is attack-free is indeed realistic, and if it isn’t, how the presence of “noise” (that is attacks) has been taken into account during the learning phase. This is an important point since the presence of attacks within the training set might lead to have an IDS which is vulnerable to these attacks [10]. In the case of network traces used to test McPAD and HMMPayl, during the period we collected them we didn’t observed signs of attacks in the network activity. Since each trace consists of several dozens of thousands of packets, we can reasonably presume that our assumption is true. On the contrary, for what concerns the dataset on which we tested HMM-Web we are aware of the presence of attacks inside it and we designed the IDS in such a way that allows to take into account the presence of this “noise”. To cope with this problem, we adopted different strategies that take into account both the scenario and obviously the design of the IDS. With McPAD this issue is addressed during the training phase, since to train One-class SVM a percentage of desired false positives must be chosen. Depending on the value chosen for this percentage, a certain amount of training patterns is rejected and is left outside the decision boundary. This allows to leave outside the closed surface that encloses normal patterns these attacks that are in proximity of the decision boundary. Obviously, we can’t do anything against the attack patterns that perfectly mimicry the normal ones [37,87,93]. In that case additional features would be necessary to distinguish among them. With HMM-Web the presence of noise is considered after the learning phase. We addressed the problem taking into account how frequently an applica- tion is requested. Our intuition was that the most frequently an application is required the smaller is the percentage of attacks among its requests. Experimental results confirmed that setting the decision threshold taking into account the relative frequency of the application (that is the percentage of requests it receives out of the total received by the web server) al- lows to find a better trade-off between false positive and detection rate. With HMMPayl we 6.2. FUTURE WORKS 87 did not explicitly address the problem but there are several reasons that allow to conclude that even if attacks were present in the training set, they did not undermine the IDS perfor- mance. One is a consequence of the way HMMPayl extracts features. The redundancy of the process, makes the same sequence of bytes being processed several times. If the percentage of malicious payloads is negligible, the number of legitimate sequences will be by far larger than the number of non-legitimate. Another reason is related to Hidden Markov Models, that are particularly robust to noise. In addition, inspecting by hand emission and transition matrices of HMM, we didn’t observed anything strange. Another point which might be worth to remark is the approach based on multiple classi- fiers. It is true that multiple classifiers allow to improve the IDS performance? Experimental results give us the opportunity to answer affirmatively to this question. For what concerns McPAD Figures 5.7 and 5.8 clearly show that the accuracy of the IDS increases with the num- ber of classifiers. Also in the case of HMM-Web multiple classifiers work better than as single one, as shown by Figure 4.5. Nevertheless in the case of HMM-Web the advantage it is not so clear since the preprocessing step significantly simplifies the structure of the HTTP re- quests, allowing also a single HMM to achieve good results. Finally, HMMPayl too benefits of the employment of multiple classifiers. We demonstrated that if an ensemble is used, the IDS achieves better accuracy respect to that achieved using a single classifier (compare Table 5.12 and Figure 5.21 with Table 5.14). Further we shown that multiple classifiers allow to increase the robustness of the IDS against attempts of evasion. We investigated this point in particular in the case of McPAD, testing the IDS against the Polymorphic Blending Attack which is an attack crafted to evade IDS based on payload statistics [37]. Results of the comparison of McPAD with PAYL show 5 that McPAD is able to achieve the same detection rate of PAYL with a rate of false positive 10− 3 respect to the 10− attained by PAYL. This result is particularly relevant if it is considered that as shown by Axelsson the false positive rate is the limiting factor for the performance of an intrusion detection system. To conclude we will remind the main contributions provided with HMM-Web, McPAD and HMMPayl respectively. The main contribution provided with HMM-Web is the coding strategy of the requests received by web applications. We shown that using this strategy, the structure of the legiti- mate requests can be easily inferred. This allow the classifier, that in HMM-Web is an Hidden Markov Model, to easily distinguish attempts of intrusion. The main contribution provided with McPAD is a model of the payload which allows to significantly reduce the rate of false positives. This result is particularly relevant since Mc- PAD is a networ-based IDS and as a consequence we expect it has to deal with large amounts of traffic. The main contribution provided with HMMPayl is a new way of extracting features from the payload that allows to realize a statistical model particularly accurate. This allow HMM- Payl to be particularly effective against all these attacks that do not deeply affect the structure of the payload.

6.2 Future Works

The experimental results presented in this dissertation show that all the proposed IDS would be used effectively to protect a web server. Despite this, the problem could be further inves- 88 CHAPTER 6. CONCLUSIONS AND FUTURE WORKS tigated to find solutions other than those already proposed. Anyway, we are firmly convinced that future research efforts should be directed in a dif- ferent way. In fact, in literature a plenty of possible solutions already exists. Even if each solution has its own limitations, the most part of them allows to attain the goal for which has been designed. Thus, working on further solutions (using for example slightly different features or different classifiers) might lead to realize new solutions for specific problems but not to give a strong answer to the problem of intrusion detection. One of the reason that make that of intrusion detection such a big (and sometimes frus- trating) challenge, is the huge variety of devices connected to the network. Each device has its own operating system, dedicated applications ans so on. As a consequence, it is almost impossible to find a solution that alone would be able to guarantee an adequate protection. In addition, the effectiveness of each solution is heavily conditioned by the specific scenario in which it is applied. So, the research activity should be concentrated mainly on two points:

• Optimizing as much as possible solutions already existing.

• Integrating with each other these solutions.

The first point is a trivial matter but includes at least two aspects. One is related to the design of the sensors. For example, in the case of HMMPayl we shown that an accurate de- sign of the fusion stage the accuracy of the IDS can be further improved (see Section 5.6.5). So, there is still room for improvements of already existing algorithms before beginning from scratch the development of new IDS. The other is related to the usability of the IDS. When we work with machine learning algorithms, typically several parameters have to be estimated to guarantee that proper results are achieved. An example of such parameters might be the number of states for an HMM or the number of clusters resulting from a clustering algo- rithm. Our experimental results shown that as a consequence of the value of these parame- ters the classification accuracy of an IDS might change significantly. Unfortunately, to find out the best estimate for the value of these parameters is more art than science. Neverthe- less, in order to make the IDS able to achieve good performance at least a good estimate must be found. This estimate usually depends on the particular applicative scenario. For example the properties of the traffic toward a web server will obviously depend on the characteristics of the applications hosted by the web server. Thus, the parameters of an network anomaly based IDS standing in front of the web server should be optimized according to the network traffic properties. A security officer responsible for the administration of an anomaly based IDS shouldn’t be required to know the IDS internals to find the best tuning of its parameters. This means that the IDS should be able to tune itself requiring as few as possible the human intervention. On the other hand, the other main issue that must be addressed is that of making dif- ferent sensors able to interact and cooperate with each other. A comprehensive detection strategy shouldn’t rely only on a particular detection mechanism or sensor but would con- sist of an ensemble of different sensor. The diversity of detection mechanisms requires the attacker of being able to evade all of them at the same time, which is by far much difficult than evading a single IDS. In Figure 6.1 we propose a possible scheme of a comprehensive protection mechanism that includes McPAD, HMMPayl and HMM-Web. The scheme is very general and might for example represent that of a corporate network. Services such as the e-mail and the web 6.2. FUTURE WORKS 89 server are placed in the DMZ, so that they are protected by a Front-end firewall. The internal network is behind two firewalls, the Front-end and the Back-end respectively. The Back- end firewall allows the internal network to initiate connections to any other network, but no other network can initiate connections to it. Hosts inside the DMZ can initiate connec- tions to the outside network, but not to the inside network. Any other network can initiate connections to the DMZ. A possible placement of the IDS could be the following:

• McPAD can be used to analyze the traffic toward the DMZ. Eventually different in- stances of McPAD can be used, a different one for each application layer protocol.

• HMMPayl can be placed in front of the Web Server. Considered the high detection rate of HMMPayl and that the network traffic at this point has been already analyzed by McPAD, we expect the network traffic after HMMPayl to be almost attack-free.

• In addition the Web Server can be further protected with HMM-Web to detect remain- ing attacks.

In such a way the protection of the web server would be guarantee by:

• Two IDS (McPAD and HMMPayl) that guarantee high detection rate against Shell-code attacks, and also against attacks modified with polymorphic engines.

• An IDS (HMM-Web) that guarantees a high detection rate against attacks such as Cross- Site Scripting and SQL-Injection.

• An IDS that has a good detection capability against attacks belonging to the other cat- egories (HMMPayl).

Figure 6.1: A possible scheme of a comprehensive protection mechanism including McPAD, HMMPayl and HMM-Web.

List of Publications Related to the Thesis

Published papers

Journal papers

• R. Perdisci, D. Ariu, P. Fogla, G. Giacinto, and W. Lee. Mcpad: A multiple classifier system for accurate payload-based anomaly detection. Computer Networks, 53(6):864 - 881, 2009. Special Issue on Traffic Classification and Its Applications to Modern Networks.

Conference papers

• I. Corona, D. Ariu, and G. Giacinto. Hmm-web: A framework for the detection of attacks against web applications. In Communications, 2009. ICC ’09. IEEE International Conference on, pages 1-6, June 2009.

• D. Ariu, G. Giacinto, R. Perdisci, Sensing Attacks in Computers Networks with Hidden Markov Models. In P.Perner, editor, MLDM, volume 4571 of Lecture Notes in Computer Science, pages 449-463. Springer, 2007.

Submitted papers

Journal papers

• D. Ariu, G. Giacinto, and R. Tronci. HMMPayl: an Intrusion Detection System based on Hidden Markov Models. Submitted to the Computers & Security, Elsevier.

91

Bibliography

[1] Ghmm: General hidden markov model library. http://ghmm.org/. [cited at p. 82]

[2] Libpcap: Network programming library. http://www.tcpdump.org. [cited at p. 82]

[3] Rfc2616 - hypertext transfer protocol – http/1.1. [cited at p. 29]

[4] Cert¨ advisory ca-2000-02 malicious html tags embedded in client web requests, 2000. [cited at p. 31]

[5] Programming with libpcap - sniffing the network from our own application, 2008. [cited at p. 83]

[6] Snort - open source network intrusion prevention and detection systems, 2009. [cited at p. 44]

[7] Muhammad Qasim Ali, Hassan Khan, Ali Sajjad, and Syed Ali Khayam. On achieving good operating points on an roc plane using stochastic anomaly score prediction. In CCS ’09: Pro- ceedings of the 16th ACM conference on Computer and communications security, pages 314–323, New York, NY, USA, 2009. ACM. [cited at p. 11]

[8] J. P.Anderson. Computer security threat monitoring and surveillance. Technical Report 98–17, James P Anderson Co., FortWashington, Pennsylvania,USA, April 1980. [cited at p. 3, 8]

[9] Stefan Axelsson. The base-rate fallacy and the difficulty of intrusion detection. ACM Transac- tions on Information and System Security, 3(3):186–205, 2000. [cited at p. 11, 68]

[10] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D. Joseph, and J. D. Tygar. Can machine learning be secure? In Ferng-Ching Lin, Der-Tsai Lee, Bao-Shuh Lin, Shiuhpyng Shieh, and Sushil Jajodia, editors, ASIACCS, pages 16–25. ACM, 2006. [cited at p. 9, 86]

[11] L.E. Baum and J.A. Egon. An inequality with applications to statistical estimation for proba- bilistic function of a markov process and to a model for ecology. Bullettin American Metereology Society, 73:360–363, 1967. [cited at p. 22]

[12] L.E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statis- tical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164–171, 1970. [cited at p. 21]

[13] L.E. Baum and G.R. Sell. Growth functions for transformations on manifolds. Pacific Journal of Mathematics, 27(2):211–227, 1968. [cited at p. 22]

[14] Tim Berners-Lee and Mark Fischetti. Weaving the Web: The Original Design and Ultimate Des- tiny of the World Wide Web. DIANE Publishing Company, 2001. [cited at p. 1]

93 94 BIBLIOGRAPHY

[15] Tim Berners-Lee, Wendy Hall, and James A. Hendler. A Framework for Web Science. Now Pub- lishers Inc, September 2006. [cited at p. 1]

[16] Battista Biggio, Giorgio Fumera, and Fabio Roli. Adversarial pattern classification using mul- tiple classifiers and randomisation. In Niels da Vitoria Lobo, Takis Kasparis, Fabio Roli, James Tin-Yau Kwok, Michael Georgiopoulos, Georgios C. Anagnostopoulos, and Marco Loog, edi- tors, SSPR/SPR, volume 5342 of Lecture Notes in Computer Science, pages 500–509. Springer, 2008. [cited at p. 15, 75]

[17] Battista Biggio, Giorgio Fumera, and Fabio Roli. Multiple classifier systems for adversarial clas- sification tasks. In Jon Atli Benediktsson, Josef Kittler, and Fabio Roli, editors, MCS, volume 5519 of Lecture Notes in Computer Science, pages 132–141. Springer, 2009. [cited at p. 76]

[18] D. Bolzoni, S. Etalle, and P.Hartel. Poseidon: a 2-tier anomaly-based network intrusion detec- tion system. In Information Assurance, 2006. IWIA 2006. Fourth IEEE International Workshop on, pages 10 pp.–156, April 2006. [cited at p. 47]

[19] Andrew P.Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145 – 1159, 1997. [cited at p. 11]

[20] Sung-Bae Cho and Sang-Jun Han. Two sophisticated techniques to improve hmm-based in- trusion detection systems. In Giovanni Vigna, Erland Jonsson, and Christopher Krügel, edi- tors, RAID, volume 2820 of Lecture Notes in Computer Science, pages 207–219. Springer, 2003. [cited at p. 21]

[21] Inc. Citrix Systems. Netscaler application firewall. http://www.citrix.com, November 2009. [cited at p. 8, 13, 33]

[22] I. Corona, G. Giacinto, C. Mazzariello, F.Roli, and C. Sansone. Information fusion for computer security: State of the art and open issues. Information Fusion, 10:274–284, 2009. [cited at p. 15, 23, 72]

[23] C. Cortes and M. Mohri. Confidence intervals for the area under the roc curve. In Advances in Neural Information Processing Systems (NIPS 2004), volume 17. MIT Press, 2005. [cited at p. 11]

[24] Claudio Criscione and Stefano Zanero. Masibty: an anomaly based intrusion prevention sys- tem for web applications. In Black Hat ¨ Technical Security Conference, 2009. [cited at p. 32]

[25] Marc Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199):843–848, 1995. [cited at p. 46]

[26] D.E. Denning. An intrusion-detection model. Software Engineering, IEEE Transactions on, SE- 13(2):222–232, Feb. 1987. [cited at p. 3]

[27] T. Detristan, T. Ulenspiegel, Y. Malcom, and M. Underduk. Polymorphic shellcode engine using spectrum analysis. Phrack Issue 0x3d, 2003. [cited at p. 18, 48, 56, 68]

[28] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. A divisive information theo- retic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3:1265–1287, 2003. [cited at p. 51]

[29] Thomas G. Dietterich. Ensemble methods in machine learning. In Josef Kittler and Fabio Roli, editors, Multiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer, 2000. [cited at p. ix, 15, 23, 24] 95

[30] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley-Interscience Publication, 2000. [cited at p. 18, 50, 51]

[31] R.P.W. Duin. The combining classifier: to train or not to train? In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 765–770 vol.2, 2002. [cited at p. 26, 80]

[32] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis. Press, Cambridge University, 2006. [cited at p. 21, 69]

[33] Juan M. Estévez-Tapiador, Pedro García-Teodoro, and Jesús E. Díaz-Verdejo. Measuring nor- mality in http traffic for anomaly-based intrusion detection. Computer Networks, 45(2):175 – 193, 2004. [cited at p. 47]

[34] Inc. F5 Networks. Big-ip application security manager. http://www.f5.com/products/big- ip/product-modules/application-security-manager.html, November 2009. [cited at p. 8, 13, 33]

[35] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition. [cited at p. 10]

[36] Prahlad Fogla and Wenke Lee. Evading network anomaly detection systems: formal reasoning and practical techniques. In CCS ’06: Proceedings of the 13th ACM conference on Computer and communications security, pages 59–68, New York, NY, USA, 2006. ACM. [cited at p. 56, 63, 69]

[37] Prahlad Fogla, Monirul Sharif, Roberto Perdisci, Oleg M. Kolesnikov, and Wenke Lee. Poly- morphic blending attack. In Angelos D. Keromytis, editor, USENIX Security Symposium, pages 241–256. USENIX Association, 2006. [cited at p. 48, 56, 63, 86, 87]

[38] Giorgio Fumera and Fabio Roli. A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:942–956, 06/2005 2005. [cited at p. 15]

[39] D. Gao, M.K. Reiter, and D. Song. Beyond output voting: Detecting compromised replicas using hmm-based behavioral distance. Dependable and Secure Computing, IEEE Transactions on, 6(2):96–110, April-June 2009. [cited at p. 21]

[40] Giorgio Giacinto, Roberto Perdisci, Mauro Del Rio, and Fabio Roli. Intrusion detection in com- puter networks by a modular ensemble of one-class classifiers. Information Fusion, 9(1):69 – 82, 2008. Special Issue on Applications of Ensemble Methods. [cited at p. 15, 26]

[41] Dieter Gollmann. Securing web applications. Information Security Technical Report, 13(1):1 – 9, 2008. [cited at p. 31]

[42] Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee, and Boris Skori´c. Measuring intrusion detection capability: an information-theoretic approach. In ASIACCS ’06: Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 90–101, New York, NY, USA, 2006. ACM. [cited at p. 11]

[43] S. Günter and H. Bunke. Optimizing the number of states, training iterations and gaussians in an hmm-based handwritten word recognizer. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, page 472. IEEE Computer Society, 2003. [cited at p. 21, 69]

[44] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. [cited at p. 51] 96 BIBLIOGRAPHY

[45] William G. J. Halfond, Shauvik Roy Choudhary, and Alessandro Orso. Penetration testing with improved input vector identification. In ICST ’09: Proceedings of the 2009 International Con- ference on Software Testing Verification and Validation, pages 346–355, Washington, DC, USA, 2009. IEEE Computer Society. [cited at p. 32]

[46] T.K Ho. Multiple classifier combination: Lessons and next steps. In A. Kandel and H. Bunke, editors, Hybrid Methods in Pattern Recognition, pages 171–198. World Scientific Publishing, 2002. [cited at p. 15]

[47] Yao-Wen Huang, Shih-Kun Huang, Tsung-Po Lin, and Chung-Hung Tsai. Web application se- curity assessment by fault injection and behavior monitoring. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 148–159, New York, NY, USA, 2003. ACM. [cited at p. 32]

[48] Internet Security Systems IBM-ISS. X-force 2009 mid-year trend and risk report. Technical report, IBM Global Technology Services, December August 2009. [cited at p. 3, 6]

[49] Breach Security Inc. Breach webdefend. http://www.breach.com/products/webdefend.html, November 2009. [cited at p. 8, 13, 33]

[50] Kenneth L. Ingham and Hajime Inoue. Comparing anomaly detection techniques for http. In Christopher Krügel, Richard Lippmann, and Andrew Clark, editors, RAID, volume 4637 of Lec- ture Notes in Computer Science, pages 42–62. Springer, 2007. [cited at p. 39, 54, 55, 56]

[51] K.L. Ingham, A. Somayaji, and J. Burge S. Forrest. Learning dfa representations of http for pro- tecting web applications. Computer Networks, 51:1239–1255, 2007. [cited at p. 33, 34, 39]

[52] Anil K. Jain, Patrick Flynn, and Arun A. Ross, editors. Handbook of Biometrics. Springer, 2008. [cited at p. 14]

[53] Stefan Kals, Engin Kirda, Christopher Kruegel, and Nenad Jovanovic. Secubat: a web vulner- ability scanner. In Proceedings of the 15th international conference on World Wide Web, pages 247–256, New York, NY, USA, 2006. ACM. [cited at p. 32]

[54] Kristopher Kendall. A database of computer attacks for the evaluation of intrusion detection systems. PhD thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science, 1999. [cited at p. 44]

[55] Hyunsoo Kim, Peg Howland, and Haesun Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Reserach, 6:37–53, 2005. [cited at p. 51]

[56] J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(3):226–239, Mar 1998. [cited at p. 15, 27]

[57] Oleg Kolesnikov and Wenke Lee. Advanced polymorphic worms: Evading ids by blending in with normal traffic. Technical report, Georgia Institute of Technology, 2004. [cited at p. 48]

[58] C. Kruegel, G. Vigna, and W. Robertson. A multi-model approach to the detection of web-based attacks. Computer Networks, 48(5):717–738, 2005. [cited at p. x, 33, 34, 39, 40, 41, 42]

[59] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web-based attacks. In CCS ’03: Proceedings of the 10th ACM conference on Computer and communications security, pages 251–261, New York, NY, USA, 2003. ACM. [cited at p. 21] 97

[60] Christopher Krügel, Thomas Toth, and Engin Kirda. Service specific anomaly detection for network intrusion detection. In SAC ’02: Proceedings of the 2002 ACM symposium on Applied computing, pages 201–208, New York, NY, USA, 2002. ACM. [cited at p. 47]

[61] Ktwo. Admutate: Shellcode mutation engine, 2001. [cited at p. 48]

[62] L. Kuncheva. Fuzzy Classifier Design, volume 49 of Studies in Fuzziness and Soft Computing. Springer-Verlag, 2000. [cited at p. 76]

[63] L. Kuncheva. Combining Pattern Classifiers. Wiley, 2004. [cited at p. 15, 23, 27, 72]

[64] Pavel Laskov, Patrick Düssel, Christin Schäfer, and Konrad Rieck. Learning intrusion detection: Supervised or unsupervised? In Fabio Roli and Sergio Vitulano, editors, ICIAP, volume 3617 of Lecture Notes in Computer Science, pages 50–57. Springer, 2005. [cited at p. 14]

[65] Wenke Lee and Salvatore J. Stolfo. A framework for constructing features and models for intru- sion detection systems. ACM Transactions on Information and System Security, 3(4):227–261, 2000. [cited at p. 11]

[66] Edda Leopold and Jörg Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46(1-3):423–444, 2002. [cited at p. 51]

[67] Richard Lippmann, Joshua W. Haines, David J. Fried, Jonathan Korba, and Kumar Das. The 1999 darpa off-line intrusion detection evaluation. Computer Networks, 34(4):579 – 595, 2000. Recent Advances in Intrusion Detection Systems. [cited at p. 54]

[68] Federico Maggi, William K. Robertson, Christopher Krügel, and Giovanni Vigna. Protecting a moving target: Addressing web application concept drift. In Engin Kirda, Somesh Jha, and Davide Balzarotti, editors, RAID, volume 5758 of Lecture Notes in Computer Science, pages 21– 40. Springer, 2009. [cited at p. 33, 69]

[69] Matthew V. Mahoney. Network traffic anomaly detection based on packet bytes. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, pages 346–350, New York, NY, USA, 2003. ACM. [cited at p. 44, 47, 55]

[70] Gian Luca Marcialis, Fabio Roli, and Luca Didaci. Personal identity verification by serial fusion of fingerprint and face matchers. Pattern Recognition, 42(11):2807 – 2817, 2009. [cited at p. 15]

[71] Joshua Mason, Sam Small, Fabian Monrose, and Greg MacManus. English shellcode. In CCS ’09: Proceedings of the 16th ACM conference on Computer and communications security, pages 524–533, New York, NY, USA, 2009. ACM. [cited at p. 48]

[72] J. McHugh. Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intru- sion detection system evaluations as performed by Lincoln Laboratory. ACM Transactions on Information and System Security, 3(4):262–294, 2000. [cited at p. 55]

[73] Aleph One. Smashing the stack for fun and profit. Phrack Issue 49, 7(49), 2003. [cited at p. 32]

[74] Panda-Labs. Annual report - 2008. Technical report, Panda Security, 2008. [cited at p. 6]

[75] R. Perdisci, D. Ariu, P. Fogla, G. Giacinto, and W. Lee. Mcpad: A multiple classifier system for accurate payload-based anomaly detection. Computer Networks, 53(6):864 – 881, 2009. Special Issue on Traffic Classification and Its Applications to Modern Networks. [cited at p. 23] 98 BIBLIOGRAPHY

[76] R. Perdisci, Guofei Gu, and Wenke Lee. Using an ensemble of one-class svm classifiers to harden payload-based anomaly detection systems. In Data Mining, 2006. ICDM ’06. Sixth In- ternational Conference on, pages 488–498, Dec. 2006. [cited at p. 14, 15, 54, 56, 58, 72]

[77] Roberto Perdisci, Andrea Lanzi, and Wenke Lee. Classification of packed executables for accurate detection. Pattern Recognition Letters, 29(14):1941 – 1946, 2008. [cited at p. 13]

[78] Thomas Ptacek and Timothy Newsham. Insertion, evasion, and denial of service: Eluding net- work intrusion detection. Technical report, Secure Networks, Inc., January 1998. [cited at p. 7, 44]

[79] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recogni- tion. Proceedings of the IEEE, 77(2):257–286, 1989. [cited at p. 21, 36, 73]

[80] B. Sangster, T.J. O’Connor, T. Cook, R. Fanelli, E. Dean, J. Adams, C. Morrell, and G. Conti. Toward instrumenting network warfare competitions to generate labeled datasets. In Secu- rity’s Workshop on Cyber Security Experimentation and Test (CSET). USENIX, August 2009. [cited at p. 74]

[81] Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. [cited at p. 22, 23, 53]

[82] David Scott and Richard Sharp. Abstracting application-level web security. In WWW ’02: Pro- ceedings of the 11th international conference on World Wide Web, pages 396–407, New York, NY, USA, 2002. ACM. [cited at p. 32]

[83] R. Sekar, M. Bendre, D. Dhurjati, and P.Bollineni. A fast automaton-based method for detecting anomalous program behaviors. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on, pages 144–155, 2001. [cited at p. 44]

[84] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou. Specification-based anomaly detection: a new approach for detecting network intrusions. In CCS ’02: Proceedings of the 9th ACM conference on Computer and communications security, pages 265–274, New York, NY, USA, 2002. ACM. [cited at p. 47]

[85] Yingbo Song, Angelos D. Keromytis, and Salvatore J. Stolfo. Spectrogram: A mixture-of-markov- chains model for anomaly detection in web traffic. In NDSS. The Internet Society, 2009. [cited at p. 21, 33, 34, 69, 81]

[86] Ching Y. Suen. n-gram statistics for natural language understanding and text processing. Pat- tern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-1(2):164–172, April 1979. [cited at p. 69]

[87] Kymie M. C. Tan, Kevin S. Killourhy, and Roy A. Maxion. Undermining an anomaly-based in- trusion detection system using common exploits. In RAID, pages 54–73, 2002. [cited at p. 86]

[88] D. M. J. Tax. One-Class Classification, Concept Learning in the Absence of Counter Examples. PhD thesis, Delft University of Technology, Delft, Netherland, 2001. [cited at p. 19, 51]

[89] David M. J. Tax and Robert P.W. Duin. Combining one-class classifiers. In Josef Kittler and Fabio Roli, editors, Multiple Classifier Systems, volume 2096 of Lecture Notes in Computer Science, pages 299–308. Springer, 2001. [cited at p. 26] 99

[90] Roberto Tronci, Giorgio Giacinto, and Fabio Roli. Dynamic score selection for fusion of multiple biometric matchers. In Rita Cucchiara, editor, ICIAP, pages 15–22. IEEE Computer Society, 2007. [cited at p. 24, 25, 69]

[91] Fredrik Valeur, Darren Mutz, and Giovanni Vigna. A learning-based approach to the detection of sql attacks. In Klaus Julisch and Christopher Krügel, editors, DIMVA, volume 3548 of Lecture Notes in Computer Science, pages 123–140. Springer, 2005. [cited at p. 31, 33]

[92] Vladimir Vapnik. Statistical Learning Theory. Wiley, 1998. [cited at p. 22]

[93] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detection systems. In CCS ’02: Proceedings of the 9th ACM conference on Computer and communications security, pages 255–264, New York, NY, USA, 2002. ACM. [cited at p. 48, 86]

[94] Ke Wang, Gabriela F.Cretu, and Salvatore J. Stolfo. Anomalous payload-based worm detection and signature generation. In Alfonso Valdes and Diego Zamboni, editors, RAID, volume 3858 of Lecture Notes in Computer Science, pages 227–246. Springer, 2005. [cited at p. 47, 48, 68]

[95] Ke Wang, Janak J. Parekh, and Salvatore J. Stolfo. Anagram: A content anomaly detector resis- tant to mimicry attack. In Diego Zamboni and Christopher Krügel, editors, RAID, volume 4219 of Lecture Notes in Computer Science, pages 226–248. Springer, 2006. [cited at p. 47]

[96] Ke Wang and Salvatore J. Stolfo. Anomalous payload-based network intrusion detection. In Erland Jonsson, Alfonso Valdes, and Magnus Almgren, editors, RAID, volume 3224 of Lecture Notes in Computer Science, pages 203–222. Springer, 2004. [cited at p. 44, 46, 47, 48, 49, 54, 68, 71]

[97] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls: alterna- tive data models. In Security and Privacy, 1999. Proceedings of the 1999 IEEE Symposium on, pages 133–145, 1999. [cited at p. 21]

[98] Tarkan Yetiser. Polymorphic viruses - implementation, detection, and protection. Technical report, Purdue University, 1993. [cited at p. 48]

[99] W. Yurcik. Controlling intrusion detection systems by generating false positives: squealing proof-of-concept. In Local Computer Networks, 2002. Proceedings. LCN 2002. 27th Annual IEEE Conference on, pages 134–135, Nov. 2002. [cited at p. 17]

[100] Stefano Zanero and Sergio M. Savaresi. Unsupervised learning techniques for an intrusion detection system. In SAC ’04: Proceedings of the 2004 ACM symposium on Applied computing, pages 412–419, New York, NY, USA, 2004. ACM. [cited at p. 14]