SELECTING PAYLOAD FEATURES USING N-GRAM ANALYSIS TO

CHARACTERIZE IRC TRAFFIC AND MODEL BEHAVIOUR OF IRC-

BASED BOTNETS

by

Goaletsa Rammidi

BSc (Computer Science), University of Botswana, 2006

A Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Master of Computer Science

in the Graduate Academic Unit of Faculty of Computer Science

Supervisor: Dr. Ali A. Ghorbani, PhD, Faculty of Computer Science

Examining Board: Professor John DeDourek, Computer Science, Chair

Dr. Harold Boley, Adjunct Professor, Computer Science

Dr. Donglei, Du, Faculty of Business Administration, UNB

This thesis is accepted by the Dean of Graduate Studies

THE UNIVERSITY OF NEW BRUNSWICK

September, 2009

© Goaletsa Rammidi, 2010 Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington OttawaONK1A0N4 Ottawa ON K1A 0N4 Canada Canada

Your file Votre reference ISBN: 978-0-494-82638-6 Our file Notre r6f6rence ISBN: 978-0-494-82638-6

NOTICE: AVIS:

The author has granted a non­ L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduce, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lntemet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non­ support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

1+1 Canada DEDICATION

To my mother Khumoetsile Rammidi, and in loving memory of my father Maranyane

Rammidi. Being your daughter is a blessing.

n ABSTRACT

A botnet is a network of compromised computers remotely controlled by an attacker.

Different feature selection methods are applied to find a lower dimension subset of

Unicode characters as payload features, using n-gram analysis with n=l, to classify

TCP packets into IRC and non-IRC application groups.

The identified IRC packets are grouped into 1 minute intervals to create a temporal- frequent distribution, and then unsupervised clustering is applied to separate botnet IRC from normal IRC. The botnet cluster is labeled as one with minimum cluster standard deviation. We found a subset of 9 features that separate IRC packets from non-IRC in less time and comparable accuracy to using all the 256 features. We also found that IRC traffic is dominated by the first 128 Unicode characters, therefore, using all the 256 characters may not be necessary. Clustering packets into IRC and non-IRC using merged X-means had lower false alarm rates than K-means and consistent higher detection rates than unmerged X-means.

in ACKNOWLEDGEMENTS

I would like to thank Dr Ali A. Ghorbani for his supervision throughout this thesis. I am also thankful to Dr Wei Lu for his professional guidance and support during my research. I am grateful to my friend Nyaladzani Jairo Nkhwanana for his valuable comments and for being supportive in writing this thesis, and to my family for their love and support throughout my studies. I also extend my gratitude to Botswana

International University of Science and Technology for funding me to pursue this degree.

IV Table of Contents

DEDICATION ABSTRACT iii ACKNOWLEDGEMENTS iv Table of Contents v List of Tables viii List of Figures ix List of Symbols, Nomenclature or Abbreviations x Introduction 1 1.1 Introduction 1 1.2 Summary of thesis contributions 2 1.3 Thesis organization 4 2 Background information and literature review 7 2.1 What is a bot and a botnet? 7 2.2 Botnet communication control topologies 8 2.2.1 Centralized C&C 8 2.2.2 Peer to Peer (P2P) botnets 10 2.3 Literature on botnet detection techniques 12 2.4 Clustering 18 2.4.1 Hierarchical clustering 18 2.4.2 Density-based clustering 19 2.4.3 Partition-based methods 19 2.4.4 Related research using clustering algorithms in intrusion detection 21 2.5 N-gram and its application to intrusion detection 22 2.6 Concluding remarks 24 3 Proposed framework 26 3.1 Overview of proposed framework 26 3.2 Traffic classification 28 3.2.1 The rationale 28 3.2.2 Obtaining 1-gram frequent payload features 32 3.2.3 Feature selection process 34 v 3.2.4 Traffic classification using C4.5 algorithm 37 3.3 Botnet detection 40 3.3.1 The rationale 40 3.3.2 Obtaining temporal-frequent payload features 42 3.3.3 Standard deviation metric for cluster labeling 44 3.3.4 K-means detection 45 3.3.5 Comparison toX-means detection approach 48 3.4 Concluding remarks 53

4 Experiments and Results 55 4.1 Feature selection and application classification 55 4.1.1 Datasets 56 4.1.2 Metrics 58 4.1.3 Experimental procedure 59 4.1.4 Subset evaluation results 60 4.1.5 Gain ratio results 61 4.1.6 Classification results for final subset of 9 selected features 63 4.1.7 Comparing 9, 128 and 256 features 66 4.2 Validating standard deviation metric 70 4.2.1 Validation on frequency vectors 70 4.2.2 Validation on temporal-frequency vectors 75 4.3 Botnet detection 77 4.3.1 Description of clustering datasets 77 4.3.2 Metrics to evaluate clustering performance 79 4.3.3 K-means detection 81 4.3.4 Unmerged X-means detection 83 4.3.5 Merged X-means detection 86 4.3.6 Comparison of the three detection approaches 88 4.4 Concluding remarks 90 5 Conclusions and future works 91 5.1 Conclusion 91 5.2 Future Work 93

VI Bibliography 95 Appendix A: Gain Ratio values for bottom 15 selected features 100 Appendix B: Time results for comparing 9, 128 and 256 features 101

Curriculum Vitae

vn List of Tables

4.1 - Description of the datasets 57

4.2 - CfsSubsetEval classification results on traindatal 61

4.3 - CfsSubsetEval classification results on traindata2 61

4.4 - Top 5 gain ratio selected features 62

4.5 - GainRatioAttributeEval classification results on traindatal 62

4.6 - GainRatioAttributeEval classification results on traindata2 63

4.7 - Unicode index values and the corresponding Java printed characters 64

4.8 - Classification results using the final selected subset of 9 features 65

4.9 - Example to approximate time for 1 feature 67

4.10 - Compare classification accuracy for 9, 128 and 256 features 69

4.11 - Line of best fit constants for normal versus botnet IRC datasets 74

4.12 - Validation using 128 versus 256 temporal-frequent features 76

4.13 - IRC botnet detection datasets 78

4.14 - Clustering data models 79

4.15 - K-means cluster statistics 83

4.16 - K-means performance results 83

4.17 - X-means original number of clusters 84

4.18 - Unmerged X-means cluster statistics 85

4.19 - Unmerged X-means performance results 86

4.20 - Merged X-means cluster statistics 87

4.21 - Merged X-means performance results 88

viii List of Figures

2.1 - An example of n-gram sliding window 23

3.1 - Framework of proposed solution 28

3.2 - Average character frequencies for IRC packets versus non-IRC packets 31

3.3 - Algorithm to extract 1-gram frequency features from a packet payload 33

3.4 - Algorithm for ID3 decision tree for boolean-valued functions 38

3.5 - Illustration of dimension reduction in training and testing dataset 39

3.6 - Illustration of C4.5 decision tree using 1-gram frequency features 40

3.7-High level view of clustering IRC and non-IRC traffic 42

3.8 - Algorithm for K-means clustering 46

3.9 - Algorithm for IRC botnet detection using K-means 47

3.10 - Algorithm for IRC botnet detection using unmerged X-means 51

3.11 - Algorithm for IRC botnet detection using merged X-means 53

4.1 - J48 versus SVM false positive rates 65

4.2 - Comparing times for 9, 128 and 256 features on J48 67

4.3 - Compare FPR for 9, 128 and 256 features 69

4.4 - Normalized standard deviation graph for normal versus botnet IRC packets 72

4.5 - Normalized standard deviation graphs on scale [-0.6, 0.6] 73

4.6 - Line of best fit graphs for normal versus botnet IRC datasets 74

IX List of Symbols, Nomenclature or Abbreviations

fij Average frequency of Unicode character with 28

integer value j (it is also at index position/)

FJ Frequency of the Unicode character at jth index 28

position.

integer value j (it is also at index position j). Can

also refer to standard deviation of cluster labeled j pi Reduced average frequency of the Unicode 43

character at/A position on the ith minute interval.

Cb Cluster labeled b, where there are k clusters and 45

1 <= b <= k. cb Center of cluster labeled b, where there are k 45

clusters and 1 <= b <- k.

x Chapter 1

Introduction

1.1 Introduction

The use of computers covers most of the tasks that people do on a daily basis, and this makes information security to be of importance. Botnet attackers take advantage of most Internet users through the use of spam emails that lure users into opening malicious email attachments or visit malicious websites of interesting subjects such as holidays and good price deals on expensive merchandise. Therefore, many computers in the Internet are infected and most owners have no clue that important information is being stolen from their machines or that their machines are being remotely controlled to be part of massive attacks. As security professionals keep on fighting botnets, attackers also keep on improving and changing tactics to avoid detection such as using encryption and using a de-centralized control architecture that has no central point of failure. This has led to growth in research in the areas of botnet tracking, detection, and mitigation over the past few years. This thesis presents a content-based Internet Relay

Chat (IRC) botnet detection system that first classifies traffic into IRC and non-IRC groups and then detects botnet IRC packets within the IRC community. IRC is a text- based protocol that allows users running IRC clients to connect to groups (called channels) in an IRC server and communicate in real-time. IRC uses Transport Layer

Protocol (TCP) as its transport layer protocol, therefore, during our traffic classification

IRC application group or community refers to all IRC packets and the non-IRC refers

1 to all TCP packets that belong to different applications other than IRC e.g. HTTP and

FTP packets. In addition to separating botnet IRC packets from normal IRC, in this thesis we aim to model different characteristics of IRC traffic such as evaluating the necessity to use all the payload Unicode characters for IRC botnet detection.

1.2 Summary of thesis contributions

Key contributions made by this thesis are as follows:

Using n-gram analysis with n=l to calculate the frequency of each of the first 256

Unicode characters in packet's payload, the thesis found a reduced dimension (size of subset found is 9 features) subset of the 256 Unicode payload frequent features that can be used for supervised classification of traffic into IRC and non-IRC application communities with high classification accuracy and low false positive rates. Using all the

256 frequent payload features for traffic classification has been used in a previous study. However, we believe to the best of our knowledge that our thesis is the first study to consider that 256 is a high dimension and aim to find a smaller subset of these

256 features that given any TCP network traffic dataset can distinguish different application communities contained. Using IRC and non-IRC classification as an example study, our selected subset of features takes less time to build a classification model and has comparable classification accuracy to using 128 and 256 features.

The necessity of using only the first 128 Unicode characters of packet payload to separate botnet IRC from normal IRC versus all the first 256 Unicode characters is

2 evaluated, and it was found that IRC traffic is dominated by the first 128 Unicode characters. Although the aim here is not to say that using only the first 128 features better detects IRC botnets than using all the first 256 features, results from two experiments in this thesis have led us to conclude that when analyzing or studying only

IRC traffic, it is sufficient to use only the first 128 Unicode characters of the packet payload. Firstly, the average frequency of each of the first 256 Unicode characters for several IRC datasets and non-ERC datasets were calculated and compared and we found that for most IRC datasets the average frequency for most Unicode characters beyond

128 was 0. Secondly, the mean of means and the mean of mean standard deviations values for individual Unicode characters using 256 characters were all about half the values obtained when using only the first 128 Unicode characters for the same IRC datasets. This may imply that when using 256 features we are simply adding 0's after the first 128 characters thus forcing the values obtained for all 256 characters to be half the values obtained for only the first 128 characters.

In this thesis experiments are conducted to compare standard deviation of botnet IRC individual packet payload frequent features to that of normal IRC packets. And results of the experiments have shown that the property that has been shown in a previous study by Lu and Ghorbani in [1] that standard deviation of botnet IRC flows is less than that of normal IRC also holds when dealing with individual packets.

N-gram analysis with n=l and time interval of 1 minute is used to group individual packet payload features into a temporal-frequent distribution, which is then clustered 3 using K-means unsupervised clustering algorithm. The minimum standard deviation approach is applied to label the botnet cluster. When applying clustering to separate normal from intrusive traffic, the aim is usually to end up with two clusters labeled normal and intrusive. This thesis compares fixing the number of clusters to two versus dynamically finding the true number of clusters in the dataset and then deriving the normal and botnet cluster from the list of returned clusters. K-means clustering is used to represent the fixed number of clusters approach and X-means is used to determine the true number of clusters in the dataset, these two were selected because they are some of the most commonly used. K-means is selected because it is approximately linear (approximately O(A0, where TV is number of instances) making it a suitable method for clustering large datasets and X-means is based K-means, therefore, this will make our proposed system more scalable. K-means and X-means are also commonly used clustering algorithms, which gives a higher opportunity to find well established machine learning software packages for these algorithms to use during the implementation stage.

1.3 Thesis organization

This thesis is organized as follows:

Chapter 1 briefly introduces network security, botnet detection and security problems posed by botnets to the Internet today. The main contributions of this thesis are also discussed.

4 Chapter 2 gives background information necessary to understand the proposed framework solution. Some examples of botnet attacks, different botnet communication control structures and different clustering techniques are discussed. Related literature on botnet detection techniques and related research using n-gram analysis for network traffic application classification and intrusion detection and their limitations are reviewed.

In chapter 3 the proposed solution to problems and possible areas of improvement identified from Chapter 2 is presented in form of a hierarchical framework. The rationale for the traffic classification and botnet detection used in our framework are discussed. The feature selection process and traffic classification using C4.5 decision tree are then discussed, and the support vector machine classifier is used as a comparison supervised classifier. Botnet detection using K-means is discussed, followed by unmerged and merged X-means botnet detection as comparison approaches.

Chapter 4 describes the datasets, evaluation metrics, experimental procedures and results. Experimental traffic application classification results for applying feature selection methods to find a subset of informative features are discussed, followed by classification results using the final subset of selected features. The performance of this selected subset of features is then compared to classification using the first 128 and 256 frequent payload features. The standard deviation cluster labeling metric on packets and effectiveness of using 128 versus 256 features for IRC botnet detection are validated.

Botnet detection results using K-means, followed by unmerged X-means and lastly

5 merged X-means are discussed and these three approaches are compared at the end in terms of IRC botnet detection performance.

Chapter 5 concludes the thesis by discussing its contributions, limitations and possible future work to improve or extend the work done in this thesis.

6 Chapter 2

Background information and literature review

2.1 What is a bot and a botnet?

A botnet is a group of compromised computers remotely controlled by one attacker or a small group of attackers working together called a "botmaster". A bot also known as zombie or drone can refer to either a computer that has been compromised and is part of the botnet or to the malicious program used to compromise new machines into the botnet. The botmaster's ability to carry out an attack from hundreds or even more computers means increased bandwidth, increased processing power, increased memory for storage and a large number of attack sources making botnet attacks more malicious and difficult to detect and defend against.

Botnet attacks include distributed denial of service (DDOS) attacks, sending out spam emails, click fraud and stealing information. There are different ways bots steal information such as by sending out phishing spam emails [2] or stealing sensitive information using key loggers e.g. Agobot and Spybot [2]. Some bots can also steal

CD-keys from the victim machines e.g. Agobot [2], retrieve information such as CPU uptime and IP address of the infected machine e.g. SDBot [2] search and download files

7 from the victim machine to the botmaster using some expressions e.g. Reverb [2] and others can even delete files in the victim machine e.g. Spybot [2].

2.2 Botnet communication control topologies

Unlike viruses, worms and other malware working as individual entities, in a botnet there is need to have a form of communication architecture that the botmaster(s) can use to send out commands and receive responses to commands from bots, and do all other tasks to manage the botnet. This control architecture can be classified according to two criteria. Firstly it can be classified according to the underlying protocol into IRC- based, HTTP-based and P2P-based. The other classification is based on communication topology and groups control architecture into centralized C&C and P2P botnets, which are discussed in detail below.

2.2.1 Centralized C&C

In a centralized command and control structure there is a central location called a botnet

C&C server where all communication between bots and botmaster takes place. Gu et al. in [3] categorizes centralized C&C into push-style where bots stay connected to a C&C server channel and commands are downloaded to bots in real time e.g. IRC-based and into pull-style where it is the responsibility of bots to regularly connect to the C&C server and upload commands e.g. HTTP-based.

8 (IRC) protocol is used to manage chat sessions and it is described in RFC 1459, with latest update in RFC 2813 [4]. In IRC-based C&C, the botmaster creates a channel in the C&C server to post commands on and bots of this botnet must subscribe to this channel in order to have access to the posted commands. In secured

IRC servers bots shall first provide a connection password. Bots join an IRC channel using a unique nickname, the nickname is authenticated and if accepted the bot has to provide more details such as hostname using USER command in order to be registered.

A bot's nickname may be rejected if it does not follow the botnet's nickname pattern, suspicious to be a spy or to avoid overloading the IRC server [5]. IRC messages are then exchanged as channel "TOPIC" messages or using "PRIVMSG" or "NOTICE" commands. The bot stays connected to the IRC channel until it chooses to leave channel using PART command, completely close its connection with IRC server by QUIT, the botmaster forcibly kicks it out of the channel using KICK command or botmaster completely close the connection between this bot client and IRC server using KILL command. Therefore, with at least one attempt to connect to an IRC server channel denoted by resending NICK and USER commands, IRC communication can be as follows

PASS* -> NICK+ -> USER+ -> JOIN -> TOPIC -> PRIVMSG I NOTICE ->...->

PRIVMSG I NOTICE -> PART I QUIT I KICK I KILL

Examples of IRC-based bots are Phatbot, Spybot, Sdbot, and GTBot. New machines are recruited into the IRC botnet as follows: An already infected bot uses different reconnaissance strategies such as port scanning to identify vulnerabilities in specified 9 target machines; if vulnerable the machine is made to connect to an FTP or HTTP server where it downloads the bot binary. The bot binary usually installs in the newly infected machine such that it always starts automatically on system reboot [6]. The newly infected machine then attempts to connect to the IRC server either directly with provided IP address but will most likely do DNS lookup for the IRC server hostname.

The new bot authenticates itself to start a session with the IRC server, and it shall also authenticate itself in order to be allowed into an IRC channel and the botmaster too shall authenticate itself before issuing commands. The bot authentication helps block out spies while botmaster authentication protects the botnet from being taken over by other botmasters [6]. More information on IRC-based botnets can be found in [6].

In HTTP-based C&C the botmaster can post commands in a file, and bots periodically connect to C&C server using a URL to check for commands. Therefore, in HTTP-based

C&C the botmaster does not have real time control over its bots' actions [3]. Examples of HTTP-based bots include Bobax.

2.2.2 Peer to Peer (P2P) botnets

The major limitation of a centralized C&C is that the C&C server is the main point of failure, if it is located then the entire botnet can be destroyed by shutting it down.

Botmasters have a tendency to regularly migrate their botnet to different C&C servers to avoid detection [7], but some botmasters have started moving to a distributed

10 architecture where there is no central point of failure e.g. P2P-based. In a P2P network any host can be both a client and a server at the same time, therefore, the botnet is robust to removing a few hosts and also it is difficult to destroy the whole botnet by shutting down the central server [8, 9]. The most currently common growing P2P bot is the Storm worm, which has aliases including Peacomm, Nuwar and Zhelatin [9]. Other

P2P-based bot examples are Sinit, Nugache, Slapper and Phatbot.

Storm worm spreads using social engineering techniques mainly email with tempting subjects of public interest such as politics and public holidays [10]. These emails can either contain malicious attachments or have links to malicious websites. According to

Holtz et al. in [9] after the user opens malicious attachment or connects to malicious server, a storm binary is installed to infect the new machine. The storm binary uses a root kit to avoid detection and a configuration file containing hash values and IP/port number combination of list of peers to connect to after installation is stored. The binary computes a global identifier and stores it. To join the network it searches for and connects to some peers in its initial list. After contacting these peers, from content published by botmaster at these peers in advance, the newly infected bots compute an

IP address and TCP port number combination to contact the botmaster. Communication that follows then takes place between the new bot and the botmaster. After completing

TCP handshake and successful authentications the botmaster sends commands to the bot in a zlib encoded communication channel. The botmaster's commands observed from [9] are instructing bots to send out spam emails or start DDOS attacks. Details of how Storm worm botnet works and how P2P-based botnet works can be found in [9]. 11 2.3 Literature on botnet detection techniques

Recent years have seen a great interest in studying botnet detection techniques.

Although most of the existing techniques are only applicable to centralized C&C botnets, some have the potential to be extended to other C&C types and a few are designed to detect any type of botnets. Lu and Ghorbani in [1] classified botnet detection techniques into honeypots based, passive anomaly analysis based and traffic application classification based. In this section we discuss the different botnet detection techniques in related literature.

Botnet detection techniques based on traffic application classification such as in [1, 11,

12] are usually guided by a botnet C&C control protocol e.g. if one is only interested in

IRC-based botnets then traffic will be classified into IRC and non-IRC groups. The general procedure for traffic application classification based botnet detection is as follows:

a) Pre-processing traffic to reduce the size of data e.g. by applying filters.

b) Classify traffic into broad traffic application communities of interest e.g.

Web and Chat or into more specific applications e.g. HTTPWeb, BitTorrent

and IRC, which will enable selecting traffic for the C&C protocol of

interest.

c) Separating botnet traffic from normal traffic within the identified group(s).

12 Strayer et al. in [11] used statistical flow characteristics and supervised classifiers to classify traffic into IRC or non-IRC groups. In [11] average bytes per packet was found to be the most informative feature. Once IRC traffic is identified, flows that were active at same time are correlated. The last stage detected malicious botnet by finding common IP address endpoint and any evidence of communication between botmaster and the C&C server.

Lu and Ghorbani in [1] used payload signatures to classify traffic into different application communities. For any traffic classified as unknown, which is about 40% [1], a cross-association algorithm was applied to map unknown traffic into already known communities. Botnet IRC was detected by applying K-means clustering algorithm on temporal-frequent characteristics of 256 ASCII bytes of the flow payloads. Number of clusters set to 2 and the cluster with smaller average standard deviation was labelled as botnet cluster. In a similar approach, Lu et al. in [12] labelled unknown flows by applying C4.5 decision tree on temporal-frequent characteristics of the 256 ASCII payload bytes. Malicious botnet within IRC community were detected using agglomerative hierarchical clustering.

Bots connect to an IRC server channel using unique nicknames which must have some common part that indicate they are true members of this botnet. Botnet detection techniques in [13, 14] utilized this characteristic. Wang et al. in [13] found the similarity between pairs of IRC nicknames in an IRC channel to determine overall nickname similarity in a channel. Hosts connect to an IRC server channel using unique nicknames and botnet IRC nicknames must have a common part to be accepted into the 13 IRC server channel; hence botnet IRC channel is likely to contain more similar nicknames than a normal channel [13]. Similarity between two IRC nicknames was calculated using Euclidean distance in four dimensional space. After reaching a predefined number of nicknames, channel distance was computed as the mean value of nicknames similarity, and [13] concluded that botnet IRC channel distance is much smaller.

In a similar technique, Rishi, an application implemented by Goebel and Holz in [14] monitors TCP traffic for suspicious IRC nicknames. Rishi uses n-gram analysis and a scoring function. TCP packets that contain common IRC keywords (with NICK as the most desirable) were collected and packet details (e.g. port numbers extracted). The packet nickname is passed through a scoring function and a nickname with score higher than the set threshold triggers an alarm. Both studies in [13, 14] aimed to detect botnets at an early stage but [14] requires pre-knowledge of the bot nickname signatures. These nickname based techniques are limited to only detecting only IRC-based C&C protocol.

There are several techniques that detect the presence of botnets on victim machines, mainly by examining spam emails in [15, 16, 17]. In a study by Sroufe et al. in [15] the shapes of spam emails were analyzed because botnet spam uses templates to bypass spam filters [15]. The skeleton of a spam email was obtained from the HTML code, and email shape was obtained from the skeleton. This shape was then fed into a classifier that finds the spam email's botnet based on minimum Hellinger distance between extracted email shape and the stored botnet spam email shape signatures. Similarly in a 14 study by Brodsky and Brodsky in [16], for each received spam e-mail the source IP address was extracted and queried in a distributed database that kept a score of how many e-mails each source IP sent in a fixed time period. An IP address that has a high score and does not appear in the recipient's contact lists was considered botnet spam.

Both [15, 16] do not require examining what is written in the spam email, but [16] may not be always effective as botnets can sometimes send out only a few spam emails in short time intervals to look legitimate, the possibility of this behavior is noted by [17].

Botgraph is a graph-based system implemented by Zhao et al. in [17] that detected botnet spamming caused by web account abuse attacks. Since bot web email accounts needs to be accessed from bot hosts, botnet spam is likely to show more shared IP addresses than normal email accounts spam [17]. Botgraph constructed large user-user graphs to correlate user login activities in terms of IP addresses of spam emails, and they found that sub graphs for bot users are more tightly connected than for normal users.

In anomaly-based botnet detection, Gu et al. in [3] implemented a system called

Botsniffer that detected only centralized C&C botnets using IRC and HTTP C&C protocols. It used spatial-temporal correlation based on the characteristic that bots respond in similar timing, with correlated activities and connects to same C&C server.

Botsniffer classified bot responses as "message" when bot reply to the C&C server or

"activity" when the bot does malicious activities to victim hosts (e.g. sending out spam emails). The approach searched for handshake keywords to identify IRC and HTTP 15 clients. The identified clients were then monitored for any message or activity responses. Binkley and Singh in [18] found an IRC channel as IRC meshes and applied

SYN scanning detection to find the hosts performing scanning in this channel.

There are also log correlation botnet detection methods that analyze traffic log files from multiple hosts to find any correlations between activities, applications and processes in the hosts. Masud et al.'s approach in [19] was based on the assumption that bot hosts respond to received commands faster than normal hosts, and applied temporal correlation to multiple log files to detect IRC botnet flows between hosts. Different packet features that can differentiate response times of bots and of normal hosts were extracted and aggregated into flow features which were then classified using 5 different classifiers.

Other botnet detection techniques are presented in [5, 20]. Villamarin-Salomon and

Brustoloni in [20] applied the Bayesian method on DNS traffic to find if knowing a set of bot hosts whose C&C server is blacklisted, they can be able to detect other hosts in the same botnet whose C&C server is not yet blacklisted, thus, find other domain name aliases for the botnet's C&C server. This method requires to know at least one bot in the botnet, therefore, it may not always work well on its own because it may require assistance of other methods to detect the first one (or more) bots belonging to the new botnet. In another method, Kugisaki et al. in [5] identified differences in communication patterns of IRC bot clients and normal IRC clients (e.g. bots may repeatedly try different nicknames if the IRC server suspects them). 16 Most of the botnet detection techniques mentioned above are limited to detecting botnets based on a particular C&C protocol (e.g. IRC-based) or structure. The botnet detection techniques presented in [7, 21, 22] are independent of C&C protocol or structure hence they have potential to detect any botnets. Gu et al. in [21] designed a system called "Botminer" that clusters traffic to identify hosts with common normal and abnormal communication patterns and activities. Once a host is identified with a suspicious activity, a score function was kept for it and during correlation stage hosts with score functions below set detection threshold are discarded. The communication and activity clusters were correlated for remaining hosts to detect any abnormal patterns that may suggest they belong to a botnet. Botminer system has many desirable features but if the activity carried out by the botnet is not yet a known suspicious activity (e.g. is not spamming), then this approach may fail to detect this botnet.

Different from [21], Choi et al. in [7] presents an anomaly-based approach that uses

DNS traffic IP headers. The method is based on several characteristics of botnets that differentiate normal DNS traffic from botnet DNS traffic such as it is common practice for the botmaster to frequently migrate the C&C server to avoid detection, and during this migration all bots migrate at the same time hence there will be some DNS group activity [7]. Another C&C independent technique is presented by Villamarfn-Salomon and Brustoloni in [22], which looks for domain names with abnormally increased

DDNS traffic as a sign of botnet migrations. A large number of hosts accessing a server

17 whose domain does not exists is interpreted as a sign that bots are trying to access a

C&C server that has been shutdown or has moved to a different domain name.

2.4 Clustering

Clustering is a form of unsupervised learning that takes a set of data objects and tries to group them such that objects in one cluster are more similar to each other than to objects in a different cluster. It does not require class labels. In this section several clustering algorithm categories are discussed, with emphasis on partition-based clustering because this thesis uses K-means and X-means in the proposed solution.

2.4.1 Hierarchical clustering

Hierarchical clustering algorithms group data into a binary tree-like structure, called a dendogram. The root of the dendogram represents the whole dataset while the leaf nodes are individual data points in the dataset. Agglomerative hierarchical clustering takes a dataset of N data points and creates N clusters, each with exactly one element. It repeatedly merges two clusters until all the data points are merged into one clusters. In contrast, divisive hierarchical clustering puts all the N data points into one cluster and repeatedly splits it until each data point belongs to its own cluster.

18 2.4.2 Density-based clustering

These are based on density distribution of data points. Data points within a cluster have a higher density than those outside the cluster [23]. Density Based Spatial Clustering of

Applications with Noise (DBSCAN) algorithm is an example of density based clustering that generates clusters of arbitrary shapes, isolates noise or outliers and is efficient for large spatial databases [24]. More information on DBSCAN can be found in [24].

2.4.3 Partition-based methods

Clustering algorithms in this category try to assign N data points to k clusters while optimizing a criterion function. Partition-based methods have two important properties namely a membership-function and a weight-function, as described by Hamerly et al. in

[25]. A membership function determines what proportion of this data point belongs to a cluster; in hard-partitioning methods a data point x is either close to cluster center c, and hence fully belongs to cluster i or it does not belong to this cluster at all, i.e. the membership-function is 0 or 1 (e.g. K-means and X-means). Soft-partitioning methods such as Fuzzy c-means allow an object to belong to 1 to & clusters with membership- function ranging from 0 to 1. The weight-function determines how much influence a data point has in calculating centers for the next iteration; K-means assign a weight- function of 1 for all data points, meaning they are treated equally.

19 2.4.3.1 K-means clustering

K-means is the most widely used clustering algorithm due to its simplicity of implementation. K-means aims to create clusters that are very separate by minimizing the sum of square errors. To cluster N data objects into k clusters, choose initial k cluster centers called centroids and assign each data point to a cluster with the closest center. Cluster centers are commonly chosen randomly from dataset, or can be predefined. Once all data points are assigned, update by calculating each centroid as the new mean value of the members, and re-assign data points to clusters. Repeat update and re-assign steps until there is no more change in centroid values or the set number of maximum iterations is reached. The above steps describe batch-mode K-means, but there is also an online mode that updates centroids after assigning each point [23].

K-means has several limitations such as it does not always converge to a global optimum [23, 26], the user is required to know the number of clusters in the dataset in advance, which is hardly easy on practice [23], some of final clusters may be empty

[26] and all data points equally take part in calculating centroids thus, making K-means less robust to outliers and noise [23]. Several algorithms have been proposed over the years to solve some of K-means' problems and the next section discusses the X-means clustering algorithm based on K-means that dynamically finds the number of clusters.

20 2.4.3.2 X-means clustering

X-means is a hard-partition clustering algorithm based on K-means. It does not require the user to supply a fixed number of clusters and it is much faster than K-means [27].

X-means requires user to specify the range that the true number of clusters in the dataset is likely to be in, and it will return a number within this range that scored best using Bayesian Information Criterion (BIC) [27]. After each run of K-means, X-means determines which subset of centroids needs to be split into two's and runs K-means to completion on the split centroids, and if results are better then the splitting is accepted.

The K-means and splitting of centroids continues until the number of clusters reaches the upper bound of the user specified range, the model with the best score is then reported. More information on X-means can be found in [27].

2.4.4 Related research using clustering algorithms in intrusion detection

Over the years there have been several studies that utilized clustering algorithms for intrusion detection. Erman et al. in [28] compared K-means, DBSCAN and Autoclass clustering algorithms to classify network traffic using transport layer statistics. The study reported Autoclass to have the highest accuracy, DBSCAN was the slowest but it also produced better clusters. In [1, 12] IRC flows were grouped using temporal- frequent analysis and then applied K-means clustering and Agglomerative hierarchical clustering, respectively to separate botnet IRC flows from normal IRC flows. In another botnet detection approach, a study in [21] clustered network traffic using X-means to

21 identify common communication patterns and common activities and in [11] clustering concept was applied during flow correlation.

In the proposed approach this thesis will use clustering algorithms to separate botnet

IRC packets from normal IRC packets. Hard-partition based algorithms are chosen because in this thesis we assume a network packet is either a normal packet or an abnormal packet, it cannot be both. As an extension to studies in [1, 12] that only used fixed number of clusters, this thesis will compare the performance of an algorithm that requires the user to specify the number of clusters known in advance, K-means, and the one that dynamically finds the true number of clusters in the dataset within a specified range, X-means.

2.5 N-gram and its application to intrusion detection

An n-gram is a subsequence of n items from a sequence of items. It can be used to compare two or more streams of data or to look for changes between new incoming data and existing set of data. To perform n-gram analysis on a stream of bytes; get a stream of data and define a fixed windows size, say 2 bytes to perform a 2-gram.

Working from left to right, slide the window 1 byte at a time such that 2 bytes are covered every time as shown in Figure 2.1. Count the number of occurrences of each distinct byte sequence.

22 iniooiiiooinoQiiiiiioiooiiiiioo loiooi iiMiminii imniooiiiiioo loiooiiiooiiioon m 101001 HUM

Figure 2.1 - An example of n-gram sliding window

Wang et al. in [29] applied n-gram analysis to build two payload-based anomaly detectors aimed at detecting packets that may display normal traffic connection behavior so they can look legitimate to connection based detectors while they are carrying anomalous payload. The first detector used n=l and created profiles of each byte, modeling the occurrence of each of 256 ASCII characters in packet payload as a frequency distribution and the second detector was used for n>l and applied to sequences of bytes.

Lu and Ghorbani in [1] extended the concept of applying n-gram to payloads into a temporal-frequency analysis. IRC flows were grouped into 1 minute intervals to create a temporal data vector. N-gram analysis with (n=l) was then applied to each element of the temporal vector to calculate the number of occurrences of each of the 256 ASCII bytes in the flow payloads. K-means unsupervised clustering was applied on the resulting temporal-frequency vector to detect malicious botnet-IRC from normal-IRC.

Lu et al. in [12] obtained temporal-frequency vectors on IRC flows in a similar approach and instead of K-means, this system separated botnet IRC flows from normal

IRC using agglomerative hierarchical clustering. In the same study in [12], the C4.5 supervised classifier was applied on temporal-frequency vectors of flow payloads 23 belonging to flows labelled as unknown by the payload-based signature; flows were classified into different communities.

All the above studies showed good accuracy and considerably low false alarm or false positive rates. The proposed approach starts from the hypothesis that there must exist a subset of these 256 ASCII payload features with enough informative value to distinguish IRC from non-IRC traffic, such that given any dataset made up of IRC protocol packets and packets from other TCP based protocols, payload features can be obtained only for this subset of features and still separate IRC and non-IRC traffic with good classification accuracy and low false positive rates. Methods in [1, 12] were based on flows; this thesis will apply n-gram of payload features to detect botnet IRC packets from normal IRC packets.

2.6 Concluding remarks

This chapter has provided some background information of bots and botnets, n-gram analysis and unsupervised clustering that will be helpful in understanding our proposed solution to be presented in the next chapter. Also as preparation for chapter 3, botnet detection techniques in related literature have been discussed; the rationale behind different techniques, their strengths and limitations. The previous research that used clustering and n-gram analysis in intrusion detection is also discussed and briefly explained how we propose to solve the identified problems or extend current studies in

24 our proposed solution. The next chapter, Chapter 3, presents the proposed solution as an

IRC botnet detection framework.

25 Chapter 3

Proposed framework

Chapter 2 has introduced some background information that will help in understanding the proposed solution presented in this chapter. Section 3.1 presents the overview of the framework. Section 3.2 discusses the feature selection and traffic application process including the main idea behind the feature selection and traffic classification approach.

In Section 3.3 the IRC botnet detection using K-means is explained, followed by unmerged X-means and merged X-means IRC botnet detection.

3.1 Overview of proposed framework

Our proposed solution is a hierarchical, content-based IRC-based botnet detection framework that uses an informative subset of frequent packet payload features to classify unencrypted traffic into different application communities, and further uses temporal-frequent payload features and unsupervised clustering to detect botnet IRC packets from normal IRC within the identified IRC community.

Figure 3.1 shows the framework. Let Dtes, represent the different 256 dimensional labelled datasets of packet n-gram frequencies to be used for experiments to find a smaller informative subset of the 256 features, Dciassijy represent the reduced dimension frequency vector for incoming packets to be classified into non-IRC and IRC groups,

Fail represent the initial set of all 256 features and Fsutse, be the subset of final selected

26 features. The remaining input variables T and k are used at botnet detection stage; let T represent the value of time interval to group IRC packets and k be the number of K- means clusters.

The subset evaluation and gain ratio feature selection methods are performed on all 256 frequent payload features Fau over several datasets Dtest. For each dataset, a subset of informative features is obtained, and features that are common from all or most of the datasets form the final selected subset of informative features Fsubsel. The dimension of datasets used for supervised classification DciasSjfy is reduced accordingly to leave only features contained in FSUbSet. Our proposed framework uses C4.5 decision tree classifier.

After classification, non-IRC packets are discarded and the IRC packets are grouped into Ttime intervals and an average 1-gram frequency for each of the first 128 Unicode characters is obtained, forming an IRC temporal-frequent data structure. This temporal- frequent data structure is then clustered using K-means.

This thesis also implements a support vector machine supervised classifier as a comparison with the C4.5 decision tree used. Using K-means clustering for botnet detection requires the user to fix the number of clusters to two; therefore, in this thesis the X-means clustering is also implemented to compare botnet detection using fixed number of clusters versus obtaining botnet and normal clusters from a dynamic number of clusters.

27 '-'teste ''all V

Feature selection

"subset

Traffic application D, —* i Non IRC packets classify classification

IRC pack ets V

T,ki IRC botnet detection

Bot net [RC Norma IRC

Figure 3.1 - Framework of proposed solution

3.2 Traffic classification

3.2.1 The rationale

Classifying traffic into different application groups using port number was a very effective approach in the past, but is no longer very effective because of various reasons such as applications no longer use default ports to avoid detection, some traffic hides in other traffic to bypass filters, e.g. some applications can hide their traffic inside http traffic and some use dynamic ports. A study from [13] showed only about 18% of the botnet IRC servers detected used the standard IRC port 6667. Traffic can also be

28 classified into different application communities using flow statistics. Content signature based methods examines content to match predefined application signatures, and these methods may be limited if communication is encrypted. Using n-gram analysis to obtain frequencies of the 256 ASCII characters in packet's payload and using temporal- frequencies to classify network traffic flows into different application communities has proven to be a very effective metric in a previous study in [12]. However, in this thesis the hypothesis is that 256 is a high dimension and some of these 256 features may not be necessary for classification. In the learning process, feature selection tries to answer the question "which features are informative or important?" Therefore, combining our hypothesis that 256 is a high dimension and the concept of feature selection leads to the question

"What if within the 256 n-gram frequent payload features Faii, there exists a

smaller subset of informative features Fsubset such that given any unencrypted

TCP-protocol traffic (or dataset), we can classify this network traffic into

different application communities using only Fsubse, and still obtain high

classification accuracy and low false positive rate values close to when

classifying using FaU ?"

Our proposed solution answers this question using IRC protocol as an example study. A feature selection method that evaluates a subset of features and a gain ratio feature selection method that evaluates the informative values of individual features are applied on several datasets. The final subset of features will then be selected as those features common from most subsets. These features will be the ones used for supervised 29 classification using C4.5 decision tree. Hence this thesis aims to find a subset of the first 256 Unicode characters that discriminates IRC packets from non-IRC with desirable high classification accuracy and low false positive rates.

Figure 3.2 illustrates why this thesis assumes the proposed solution should work. The graphs in Figure 3.2 show the average frequency of each of the first 256 Unicode characters in the payloads of packets of the dataset represented by the graph. Given a dataset with n packet payload frequency vectors of the form , where each F is frequency of the Unicode character at index position j; the average frequency for each character is calculated using Equation (3.1). Details of how to obtain 1-gram payload frequencies will be discussed in detail in Section 3.2.2.

ItiXji (3-D

where j is the index of jth Unicode character (or column number) 0 < j < number of features - 1 and n is the number of rows or instances.

For example, according to Figure 3.2 packets in the non-IRC datasets "nonircl" and

"nonirc2" on average have approximately 4 occurrences of the Unicode character at index position 0 (or Unicode character of integer value 0), while the IRC datasets

"honeyirc", "irc2" and "SkypelRC" have an average of 0 occurrences for the same

Unicode character.

30 nonircl pcap nonirc2.pcap honeyirc.pcap

a.

LI U V j4ij^A-jw 100 200 300 100 200 300 300 Index of 256 Unicode characters Index of 256 Unicode characters Index of 256 Unicode characters

a) Non IRC b) Non IRC c) Botnet IRC

ire 1.trace SkypelRC.pcap 25 80 honeyirc irc2 g, 20 60 SkypelRC 15 nonircl 2 40 nonirc2 10 go 20 > < -j- 0 I iJiJ, iM 0 10I0 200 300 300 0 100 200 300 Index of first 256 Unicode characters Index for 256 Unicode characters Index of 256 Unicode characters f) Relationship d) Normal IRC e) Normal IRC between all datascts

Figure 3.2 - Average character frequencies for IRC packets versus non-IRC packets

Figure 3.2 compares average character frequencies for packets in non-IRC, botnet IRC and normal IRC dataset. Observations from these graphs show that for more than two thirds of first 256 Unicode characters, non-IRC packets have higher average character frequency values than IRC traffic. In these sections non-IRC have average values ranging from 4 to 6 while IRC is at 0 to 2 ranges. This may be evidence that non-IRC packets have larger payloads that ERC payload. 31 All datasets show an approximately 0 average for characters from around 125 to 160, and from around 161 to 255 all IRC datasets are dominated by close to 0 average values while non-IRC shows consistent high average values. This shows that most IRC traffic packet payload is dominated by the first 128 Unicode characters, compared to the last

128, hence when doing tests and analysis involving only IRC traffic payload frequent features it may be reasonable to use only the first 128 Unicode characters.

These differences in average character frequencies for IRC and non-IRC datasets shows that packet payload frequent characteristics, achieved by applying n-gram analysis with n = 1 on the first 256 Unicode characters on each packet payload, can be considered a possible metric to differentiate IRC and non-IRC packets. Also the graphs show consistent similarities and differences for certain Unicode characters in both IRC and non-IRC datasets and in this thesis we believe these Unicode characters may be the important features that discriminate IRC from non-IRC packets.

3.2.2 Obtaining 1-gram frequent payload features

N-gram analysis is applied to obtain frequencies of individual Unicode characters with value of n = 1. The proposed approach uses n = 1 and not higher values of n such as n =

2 because we are interested in frequencies of individual characters and not frequency of sequences of characters. 1-gram is also preferred over higher values of n because the computational complexity of n-gram analysis increases exponentially with increasing sliding window size. Among X distinct tokens the space complexity will be X".

Therefore, in case of sliding window size of 1, over 256 distinct Unicode characters

32 there will be 2561 distinct possible 1-grams and using 2-gram will give 2562 distinct possible 2-grams and so on. Figure 3.3 shows the algorithm of how to calculate a multi­ dimensional data structure of 1-gram payload frequencies given an individual packet payload byte structure.

Function: getFrequency Input: TCP packet payload, payload; required dimension, dim (e.g. dim=\28)

Output: 1 -gram frequency vector as a 1 element J/m-dimensional vector of integers

Initializations: 1 Forj'=0 to dim-l do / Initialize vector to hold frequencies, FreqVector to all O's/ 2 FreqVector[j] = 0 3 End for

/Loop below extracts all Unicode characters (from bytes) of payload./ For /=0 to length of payload do Characters[i] = i,h character in payload End for For z=0 to length of Characters[] do intValue = Calculate integer value of CharactersfzJ if §<=intValue<=dim-\ then 10: Set FreqVector [intValue] to FreqVector [intValue] +1 /increment count at this position by 1/ 11 End if 12 End for 13 Return FreqVector[]

Figure 3.3 - Algorithm to extract 1-gram frequency features from a packet payload

33 The algorithm takes in packet payload as a sequence of bytes, a decoding scheme is used to convert each byte to its Unicode character equivalent in line 4 to line 6 and the integer value for each identified character is used to count number of occurrences of that character in line 10.

3.2.3 Feature selection process

In this section the different feature selection methods used in the process of finding the final subset of features are discussed. The first approach, described in Section 3.2.3.1, evaluates subsets of features and selects the best subset and the second approach evaluates the informative value of individual features, described in Section 3.2.3.2.

Applying these two methods to x datasets results in 2x subsets, and features that appear in most subsets are selected to form the final subset of features.

3.2.3.1 Evaluating subsets of features

A correlation based feature selection method is used for evaluating subsets. The method tries to find subsets of features that are more correlated with the class, but have less correlation with each other. The importance of a feature is determined by how much it can predict the classes for instances not already predicted by other features. The first step is to discretize a copy of the training data by converting continuous values into discrete values. Then for all features, feature-feature correlations as well as feature- class correlations are calculated using Symmetrical Uncertainty. This metric tries to reduce the bias of information gain metric that favors attributes with many values and is 34 normalized to [0,1]. After obtaining correlation values, the subset of features is generated using a greedy stepwise backwards search.

The greedy stepwise backwards search method starts with set of all features, say Z, and for each features i it calculates the subset merit of all features excluding feature i. After identify subset with lowest merit (i.e. when feature i was excluded) the corresponding feature is eliminated by replacing Z - Z\ {/}. The process is repeated using the new Z until there is only one feature left or the subset merits can no longer be improved. The subset with highest merit is returned. If all subsets have a merit below set threshold then it is concluded that no subset with sufficient informative value was found.

3.2.3.2 Evaluating individual features

The above section evaluated subsets of features, now the individual features are evaluated on how they each classify the dataset. The second feature selection method used is gain ratio measure that is mainly used in the decision tree to select the root node attribute that best classifies the given set of training examples. Gain ratio (calculated using Equation (3.5) is a better alternative to information gain measure which gives more preference to attributes with many values such as Date and product IDs etc.

Information gain, Gain(S, A) (see Equation (3.3)) measures the amount of information known about target function by knowing attribute A over the set of examples S. Gain ratio uses split information, SplitInformation(S,A) (see Equation (3.4)) that penalizes attributes with many values by increasing when the attribute values increase. An

35 attribute with many values gets larger split information, and dividing this attribute's information gain by this large split information value results in lower gain ratio.

After obtaining gain ratio values, ranker search method is used to rank attributes in descending gain ratio values and perform selection by discarding features with gain ratio below set threshold. We provide a parameter n that selects the top n features (e.g. n - 20). If among the top n selected features, some features have a gain ratio value below the threshold value, they are considered to be of unsatisfactory informative value and are discarded and the final returned subset will contain less than n features.

(3.2) Entropy(S) = > -pt log2 pt

Gain's, A) = Entropy(S) — y -—Entropy {S ~) — \S\ v v€Values(A)

c=2 , (3.4) Splitlnf ormation{S', A) = — / 77Tl°g2~r7T i=i

Gain{S, A) (3.5) GainRatio(S,A) = „ ,. T— ——— Split Information's, A)

where S is the set of examples, A is the currently evaluated attribute, Values(A) is the list of possible values for A, Sv is the subset of example S with value v for attribute A, and each S, is a subset of S for which attribute A has class value i. Value of c is set to 2 because this thesis uses two class labels (IRC and non-IRC) and /?, is the probability that an example has class value i.

36 3.2.4 Traffic classification using C4.5 algorithm

The C4.5 decision tree is used in this thesis because it is one of the most commonly used supervised classifiers. The C4.5 decision tree is an extension to ID3 algorithm.

The ID3 builds the tree in a greedy search top-down approach by finding an attribute that is best suited to be the root node based on a high information gain value. After selecting the root node, a descendant node is created for each value of the root node.

Training examples are then sorted for the different descendant node values and at each descendant node the process is repeated to find the root node using the sorted examples for this node Figure 3.4 shows an ID3 algorithm that was copied from [30].

C4.5 addresses some limitations of ID3 such as C4.5 can handle missing attribute values, use continuous-valued attributes, avoids over fitting the tree, use rule-post pruning and uses gain ratio, calculated using Equation (3.5), which is a better statistical test than information gain. Information gain tends to favor attributes with many values

[30].

After obtaining a set of features selected by the feature selection module, the decision tree is trained with a reduced dimension dataset containing only the selected features and the class labels. Since the proposed approach classifies packets into IRC and non-

IRC groups, the IRC is considered as positive class with label "IRC" and the negative class is labeled "non-IRC". Figure 3.5 illustrates how the selected features are used to pre-process training and testing datasets for classification.

37 Function: ID3 {Examples, Target attribute, Attributes )

Input: Set of training examples, Examples; Target attribute to be predicted by the tree, Target_attribute; A list of other attributes that may be tested , Attributes.

Output: A decision tree model that correctly classifies Examples.

1: Let Examples„ = subset of Examples that have the value v, for A, label = null.

2: Create a Root node for the decision tree

3: If all Examples are positive then Return the single-node tree Root, with label = +

4: If all Examples are negative then Return the single-node tree Root, with label = -

5: If Attributes is empty then Return the single-node tree Root, with label = most common value of Target attribute in Examples.

6: Otherwise Begin

7: A = the attribute from Attributes that has highest information gain for the Examples

8: The decision attribute for Root = A,

9: For each possible value v, of A do

10: Add a new tree branch below Root, corresponding to the test A=v,

11: If Examples^ is empty then Below this new branch add a leaf node with label = most common value of Target_attribute in Examples.

12: Else Below this new branch add the subtree

(ID3 {Examples^,, Target_attribute, Attributes - {A} )

13: End

14: Return Root

Figure 3.4 - Algorithm for ID3 decision tree for boolean-valued functions

38 A subset of selected features: Unicode characters at index position 0,1, 20,35,129 Original training instances with all 256 features:

2S5 0 x 2S 0 ls ,,.., Original testing instances with all 256 features: ,,..,,F1,..,F255,IRC> Reduced dimension training dataset used for classification:

Reduced dimension testing dataset of unseen instances:

Figure 3.5 - Illustration of dimension reduction in training and testing dataset

The reduced dimension training data is one that is finally used to training the classifier, and reduced dimension unseen instances are used for testing the resulting classification model. In an offline mode the reduced dimension testing dataset can be created for all the testing packets, however, in an online mode 1-gram frequency values for the selected features can be calculated for one packet and then perform classification, and repeat process for the next packets.

Figure 3.6 illustrates how a C4.5 decision tree model built using IRC and non-ERC

1-gram payload frequency of Unicode characters may look like. An incoming packet will be classified as non-IRC if it has a frequency value greater than 0 for the Unicode character at index position 47, otherwise the other tree branch will be followed until a leaf node for its class label is reached. 39 Unicode47 <= 0

| Unicode 10 <= 0: non-IRC

| Unicodel0>0

| Unicode61 <= 0:IRC

| | Unicode61 > 0 : non-IRC

Unicode47 > 0: non -IRC

Figure 3.6 - Illustration of C4.5 decision tree using 1-gram frequency features

3.3 Botnet detection

3.3.1 The rationale

The idea behind the feature selection and traffic classification modules of the proposed solution has been discussed; this section discusses the rationale behind the IRC botnet detection. Some characteristics of (or assumptions about) botnets that influence the methods of implementing different botnet detection technique used in literature include that bots in the same botnet usually receive commands from the botmaster at the same time and are preprogrammed hence do not need time to think of how and when to respond, therefore, bots are likely to have similar timing [1,7, 11].

These bots are pre-programmed by the botmaster, therefore, bots are likely to respond to botmaster's commands in a similar way, hence botnet traffic content is less diverse

[1], botnet activities are more correlated [11, 21], botnets regularly migrate their botnet to new C&C servers and hence have observed DNS group activities [7] and botnet 40 traffic is likely to exhibit similar bandwidth patterns [11] than normal (human) traffic or malware instances working individually.

Our proposed IRC-based botnet detection is based on assumption that bots in the same botnet are pre-programmed by the botmaster and hence all bots will respond to the botmaster's commands with similar content and in similar timing. Bots replying with similar content implies botnet content is less diverse than normal content; our detection approach groups IRC packets into specified time intervals (e.g. 1 minute) and obtains an average n-gram (with n=l) frequency of each of the first 128 Unicode characters within this time interval hence forming a temporal-frequent distribution of the dataseL

The IRC temporal-frequent structure is then clustered using K-means unsupervised clustering algorithm and the minimum standard deviation metric is used to label the botnet cluster. Previous studies using n-gram analysis payload frequent features for intrusion detection used all the 256 payload features, but observations from Figure 3.2 show normal IRC and botnet IRC datasets dominated by only the first approximately

128 characters. Thus, our proposed approach uses only the first 128 features in botnet detection.

Algorithm in Figure 3.7 provides an entry point into our botnet detection system from a training perspective. User specifies which clustering approach to use (among K-means, unmerged X-means and merged X-means) and will be redirected to the algorithm that performs the specified clustering. Line 8 returns botnet and normal IRC clusters, which will be used for detecting botnet IRC packets in unseen IRC instances.

41 Function: IRC_Botnet_Detection Input: n temporal-frequent individual packets payload data structure, F; The cluster labeling approach, choice = {Kmeans | mergedXmeans | unmergedXmeans}; number of clusters, k or range of clusters, minK to maxK

Output: The botnet cluster and normal cluster

If choice= Kmeans then Botdet_K(F,k) Else if choice= unmergedXmeans then Botdet_UX(F,minK,maxK) Else if choice = mergedXmeans then Botdet_MX{F,minK, maxK) End if Return botnet cluster

Figure 3.7 - High level view of clustering IRC and non-ERC traffic

3.3.2 Obtaining temporal-frequent payload features

Each instance in the botnet detection input vector is an average of packet payload

1-gram frequencies for packets that occurred consecutively within the same time window. Using 1 minute time window, we retrieve the minute value in each packet timestamp i.e. if at 9:05 am 10 packets were captured; we calculate the mean frequency of each Unicode character over the 10 packets using Equation (3.1) and use the resulting mean vector as a temporal-frequent instance for this minute, the instance in our dataset. Note that our proposed botnet detection approach does not "wrap-over" time intervals i.e. if this example packet capturing continues past 10:05 am any packets

42 captured during this minute interval will be processed as a new instance, it is treated as an independent instance from the 9:05am instance even though their minute values are equal. The number of occurrences of each Unicode character in an individual packet payload is calculated using the algorithm already described in Section 3.2.2.

To reduce the impact of outliers, for each obtained temporal-frequent instance (or row) we calculate the simple mean and standard deviation of all the 128 payload features.

The original instance vector is reduced into \i - 2a to |i + 2a range by replacing each frequency value x, in the original vector with a new frequency value Xi_new as follows:

If |J. - 2a <= xt <= n + 2a then Xj_new = xi, otherwise x,_new = 0

The reduced temporal-frequency vectors are the final vectors that will be used for IRC botnet detection. In our proposed botnet detection approach, the average frequency vector of the first 128 Unicode features of packets' payloads in a 1 minute interval will be represented as a 1-element 128-dimensional vector F°, F1,..., F127, where each Fj, 0

<=j <= 127, is the reduced average frequency of the/'' Unicode character.

Therefore, a data structure of n instances is represented as 128 features in a vector as shown below

l27 127 , , ..,,

where there are n minute intervals (hence n instances to be used for botnet detection)

j h and each Ft , 1 <= i <= n, is the reduced average frequency of f Unicode character in the i'h minute interval.

43 3.3.3 Standard deviation metric for cluster labeling

As already mention in Section 3.3.1, our botnet detection approach is based on the assumption that bots from same botnet are pre-programmed by the botmaster on how and when to respond to received commands. Therefore, this implies botnet content is less diverse than normal content hence we perform unsupervised clustering on a temporal-frequent distribution of IRC packets and label botnet cluster as the one with minimum standard deviation. Using only the first 128 Unicode characters for IRC botnet detection, the standard deviation of a cluster is found by first calculating the individual column standard deviation for each of the 128 Unicode characters using

Equation (3.1) and Equation (3.6). This results in a 1 element 128 dimensional vector of standard deviations

Equation (3.7).

: (3-6) *>= piZfe"^)2

E dimension-1 _ fl n\ v Averagejj — —;'= 0 : "j ' dimension

where dimension is the number of features, j is the / Unicode character 0 < j <

dimension-1, n is the number of rows or instances and \ij is the mean of column j from

Equation (3.1), and 0}- is the standard deviation of jth column.

44 3.3.4 K-means detection

The unsupervised clustering algorithm used for IRC botnet detection in our framework is batch mode K-means, using Euclidean distance (see Equation (3.8)). Hard partitioning is used because we believe a packet is either a normal packet or an attack hence all our IRC instances will be identified as either botnet or normal.

2 Euclidean distance d(x,y) = > (xt - y[)

where X; is the ith column of the data point x, and y; is the ith column of data point y.

Time complexity of K-means is 0(Nkdt), where there are TV objects, t is the number of iterations and d is dimension of dataset [23]. N is usually much larger than the other parameters, making K-means to be approximately linear 0(N), hence suitable for clustering large datasets [23].

Figure 3.8 shows steps for K-means clustering given a set of instances and the number of clusters to be returned.

45 Function Kmeans (F, k) returns k clusters

Inputs: data instances, , i= 1,2,...,n and number of clusters k Initialization: 1: Set th\ to maximum number of iterations, th^ to minimum center difference, and assign cluster centers c={c\,C2,...,Ck} for k clusters

2: do: itermax - iter^ + 1 3: Calculate distance d(fi,cj) between data instances/ and center c, 4: Assign/; to cluster C, with the closest distance (iminC/i-.c/)

5: Update cluster centroid as the mean of the

current members of C,

6: while itermax < th\ or \centroidnew - centroidcunen!\

1: Return the k clusters C\, C%,..., Ck

Figure 3.8 - Algorithm for K-means clustering

Input to K-means IRC botnet detection is a temporal-frequent distribution of botnet IRC

1 127 0 1 127 l21 and normal IRC instances , , .., where each F,; is the reduced average frequency of Unicode character index j on the time interval i as described in Section 3.3.2. In this thesis we use 1-gram frequency and 46 1 minute time intervals to group packets. The number of K-means clusters is fixed to 2 corresponding to normal IRC and botnet IRC clusters. As shown by the algorithm in

Figure 3.9, the standard deviations of the two clusters is calculated (in line 2) and compared (in line 3), and the cluster with minimum standard deviation is labeled as the botnet clusters. The remaining cluster is then labeled as normal cluster.

Function Botdet_K (F, k) returns botnet cluster

n—i YD Inputs: data instances , i=l,2,...,n and number of clusters k 1: Obtain k clusters with K-means,

{C\, C2,..., Cic} =Kmeans(F, k)

2: calculate average standard deviation a, for each

cluster, where l<=j <=k, q, = standard deviation of Cj

3: if Ob = min(ai, 02, •.., a*) label as botnet cluster

4: Return the botnet cluster Q,

Figure 3.9 - Algorithm for IRC botnet detection using K-means

To categorize new unseen instances into normal and botnet IRC, our approach uses ungrouped individual packet frequency vectors rather than temporal-frequency vectors.

The reasons for not grouping unseen instances are as follows:

47 a) If incoming packets are grouped into time intervals, some packets in that

time interval can be botnet or normal, therefore, saying this whole interval

vector is botnet (or normal) will be incorrect.

b) Grouping into time intervals will require the system to wait at least 1 minute

before making a decision on the incoming packet resulting in some delay. It

may also take some time to calculate average character frequencies

especially in a high speed network.

c) This will also increase the storage requirements of the system as all packets

for a time interval will have to be temporarily stored before the whole

minute elapse and the average frequency for each of the first 128 Unicode

characters is calculated.

We calculate the center of each cluster as the mean vector containing average frequency value of each Unicode character for all instances in the cluster. The frequency vector of an incoming IRC packet is calculated as described in Figure 3.3 and the Euclidean distance between each cluster center and the unseen packet frequency vector is calculated using Equation (3.8). The instance is then assigned to category with minimum distance.

3.3.5 Comparison to X-means detection approach

When applying clustering algorithms to intrusion detection, we expect to have two clusters; the abnormal cluster and the normal cluster. This has resulted in most previous studies fixing the number of clusters to two such as in [1, 12], but this can lead to the

48 question "what if there are more than two groupings in the data?" This section presents the unmerged and merged X-means botnet detection that does not fix the number of clusters to two, but aims to label the botnet cluster and normal cluster from the several returned clusters. X-means requires the user to specify the range in which the true number of clusters might fall in and the algorithm returns a number in the range that scored the best.

The input to our X-means IRC botnet detection is a labeled dataset. The X-means IRC botnet detection performs normal unsupervised clustering that does not require class labels; however, it requires class labels to label the clusters as they enable us to count the number of botnet and normal instances in a cluster. The numbers of botnet and normal instances contained in a cluster are used during cluster labeling. Therefore, the temporal-frequent frequent distribution input for X-means botnet detection is

0 n7 0 l27 ,,..,

l U7 where classjtabel tells if a training instance is normal

IRC or botnet IRC.

Similar to K-means detection, unseen instances here are categorized based on Euclidean distance between the data point and the cluster centers. The unmerged X-means and merged X-means IRC botnet detection are discussed in detail in Section 3.3.5.1 and

Section 3.3.5.2, respectively.

49 3.3.5.1 Unmerged X-means IRC botnet detection

The unmerged X-means IRC botnet detection approach tries to find two clusters; one containing most botnet and the other most normal instances and discard the remaining returned clusters on the assumption that they are outliers from the normal and botnet instance groups. Two metrics namely botnetJProportion and normal_Proportion are introduced. The IRC botnet detection using unmerged X-means procedure is described in Figure 3.10.

For each of the clusters returned by X-means clustering, the botnet_Proportion is calculated in line 3 as the number of botnet instances contained in the clusters as a percentage of total botnet instances used for clustering Equation (3.9). After selecting the cluster containing the most botnet instances, the normal_Proportion finds cluster that contains the highest percentage of normal instances among remaining clusters using Equation (3.10).

number of botnet instances (3.9) botnet_Proportion = ; ; — : * 100 total number of botnet instances number of normal instances (3.10) normal_Proportion = : ; ; : * 100 total number or normal instances

Once the two clusters are selected, the standard deviation for each cluster is calculated using a similar approach as for K-means. The two values are then compared and the cluster with minimum standard deviation is labeled as the botnet cluster.

50 Function Botdet_UX (F, kmm kmax) returns botnet cluster

o—i m

Inputs: data instances F, - ,i=l,2,...,n , minimum number of clusters kmm, and maximum number of clusters kmax.

1: Create a copy of Fu Fmew, without class labels. 2: Obtain m clusters, {C\,C2,.. .,Cm} with X-means using Finew, kmm and K-max-

3: Find clusters with largest proportion of botnet and normal instances by calculating:

number of botnet instances botnet; = 7 total number of botnet instances number of normal instances normah; total number of normal instances

botnet t, = max(botneti,botnet2,. ..,botnetm)

For all clusters except Q,, normaln = max(normah, normah,-.., normalm)

5: Calculate average standard deviation a, for cluster n and b, standard deviation o„ = standard deviation of C„ and Ob- standard deviation of Q,

6: if oz= min(o^ o„), label Cz as botnet cluster

7: Return the botnet cluster C7

Figure 3.10 - Algorithm for IRC botnet detection using unmerged X-means

51 3.3.5.2 Merged X-means IRC botnet detection

The merged X-means approach tries to find the botnet cluster as cluster with lowest standard deviation and then gather as many instances that represent botnet IRC as possible from remaining clusters. In this thesis we assume that merging other smaller clusters that are dominated by botnet instances to the identified botnet cluster can help add more information that may improve the learner's ability to correctly classify more unseen botnet instances.

As shown by algorithm in Figure 3.11 merged X-means first performs clustering on the unlabeled IRC temporal-frequent distribution.

After obtaining the set of clusters, two empty temporary clusters are created; one is a temporary botnet cluster and the other is a temporary normal cluster. The standard deviation is calculated for each cluster and cluster with minimum standard deviation is added to the temporary botnet cluster in line 5. For all other remaining clusters, if a cluster contains more botnet instances than normal instances it is added to the temporary botnet cluster, otherwise it is added to the temporary normal cluster. Cluster

A is added (or assigned) to cluster B by simply appending all instances in cluster A to cluster B. After all the X-means clusters are processed, the temporary botnet cluster is labeled as the final botnet cluster and the temporary normal cluster is labeled as the final normal cluster.

52 Function Botdet_MX (F, kmm kmax) returns botnet cluster

11 Inputs: data instances Fl=, i-l,2,...,n, minimum

number of clusters kmm, and maximum number of clusters kmax.

1: Create a copy of F,, Fmew, without class labels.

2: Obtain m clusters, {C\,C2,...,Cm\ with X-means using Finew, kmln and kmaX.

3: Create empty temporary botnet and normal clusters, tempBotnet and tempNormal

4: Calculate average standard deviation c; for each cluster,

Where \<-j <=m, o} = standard deviation of C}

5: if oz = min(oij02, ,am), copy Cz to tempBotnet

6: Assign remaining clusters based on the number of botnet instances contained:

botnetSizeCj = number of botnet instances in Cj

normalSizeCj = number of botnet instances in Cj

7: {{botnetSizeCj > normalSizeCj, copy C, to tempBotnet, otherwise copy C, to tempNormal

8: Return the botnet cluster tempBotnet

Figure 3.11 - Algorithm for IRC botnet detection using merged X-means

3.4 Concluding remarks

This chapter has introduced our proposed solution as a hierarchical IRC botnet detection framework. The procedures for the individual modules of the framework were 53 then discussed in detail along with formal algorithms for individual tasks. The next chapter carries out experiments to implement and evaluate the framework, and compare with comparison traffic classification and clustering methods discussed in this chapter.

54 Chapter 4

Experiments and Results

The proposed solution framework has been discussed in chapter 3, in this chapter experiments are conducted to implement the different modules of the framework and the experimental results are discussed. Section 4.1 presents features selection and traffic application classification; Section 4.2 validates the standard deviation metric and effectiveness of using 128 features versus 256 features for IRC botnet detection.

Section 4.3 presents the K-means, unmerged and merged X-means experiments to cluster traffic into RC and non-IRC.

4.1 Feature selection and application classification

This section starts by describing the datasets used for experiments, the metrics used to evaluate the performance of supervised classifiers, and the different feature selection experiments along with IRC and non-IRC classification results for selected features are presented. In this thesis, we use the JPCAP package from [31] to read and extract different components of a packet such as packet payload and timestamp values.

55 4.1.1 Datasets

The feature selection and traffic application classification experiments were conducted using several IRC and non-IRC datasets of different sizes and captured from different environments, as shown in Table 4.1. At this stage all the IRC datasets both normal and botnet are referred to just as IRC datasets. The first column in Table 4.1 shows names of the different datasets and throughout discussions in this chapter file extensions will be omitted when referring to the dataset (e.g. we will only refer to ire 1.trace as ircl dataset). The environment where each dataset was captured from is described in the second column. The third column indicates whether a dataset is IRC protocol or is made up of other TCP-based protocols other than IRC. IRC protocol uses TCP as its transport layer protocol therefore, filters are applied such that only TCP is retained and all other protocols such as UDP are discarded. The Wireshark network protocol analyzer [32] was used to filter the desired protocols; to pre-process IRC datasets we applied the filter

"tcp and ire" and for non-IRC datasets the filter "(tcp&&edonkey) II (tcp&&bittorrent) II

(tcp&&http) II (tcp&&dns) II (tcp&&imap) II (tcp&&snmp) II (tcp&&vnc)ll (tcp&&nfs)

II (tcp&&ftp)" was used.

The fourth and fifth columns in Table 4.1 show the size of each dataset before and after applying the filters, respectively. In non-ire 1 the filter had reduced 1154251 packets to

80181 packets. From Wireshark visual inspection it was clear that filtered traffic in non-ire 1 is dominated by eDonkey, BitTorrent and HTTP and has a few DNS packets.

In non-irc2 since the original traffic file is very large, we used only about 20% of the original dump file, which using this filter we obtained 35958 non-IRC packets out of 56 580146 read packets. Compared to the size of the IRC datasets, non-IRC datasets non- ire 1 and non-irc2 are very large; therefore, we selected only the first 30000 packets that were allowed by filter for the experiments.

Table 4.1 - Description of the datasets

Name Captured from Type Number of packets Number of packets before filter after filtering Ircl UNB testbed IRC 110609 354 Irc2 UNB honeynet IRC 125494 4471 SkypelRC Wireshark Wiki IRC 2263 159 [32] HoneylRC Honeynet [33] IRC 54536 9809 Kaiten UNB testbed IRC 5272 846 Non-ire 1 UNB testbed Non IRC 1154251 First 30000 Non-irc2 UNB testbed Non IRC >250 MB First 30000

Using different combinations of the shuffled datasets from Table 4.1, we formed training and testing dataset models as follows:

The first training dataset that we refer to as "traindatal" or "training data 1" during our discussions, was made from 3250 IRC instances from shuffled honeylRC and 3250 non-IRC instances from the shuffled packets of non-ire 1 dataset. Therefore, considering the IRC type as positive class label and non-IRC as negative (this assignment is applied to all other datasets), this training dataset has 3250 positive and 3250 negative instances.

The second training dataset referred to as "traindata2" or "training data 2" is made up of

3250 IRC instances from shuffled irc2 and 3250 non-IRC instances from shuffled packets of non-irc2 dataset

57 The first testing data called "testdatal" or "testing data 1" has a total of 10000 instances with 5000 instances coming from re-shuffled non-ire 1 and 5000 from re-shuffled honeylRC. Therefore, traindatal and testdatal were made from same IRC and non-IRC dataset combinations.

The second testing dataset referred to as "testdata2" or "testing data 2" is smaller than testdatal with 5846 instances of which 5000 are non-IRC from non-irc2 and 846 are from the IRC Kaiten dataset. The non-IRC instances in this testing dataset are drawn from same dataset as traindata2, but the IRC instances are from a dataset that was not used to create any training data.

4.1.2 Metrics

During the subset evaluation and gain ratio experiments to find a smaller subset of informative features, the accuracy of the J48 and SVM supervised classifiers in classifying unseen instances into IRC and non-IRC groups using selected features is measured using the false negative rate (FNR), false positive rate (FPR), precision and recall metrics. Once the final subset of features is selected, in addition to the above four metrics the accuracy metric is also used. Precision finds the percentage of examples classified as positive that are actually positive and recall finds the percentage of correct predictions of real positives. The precision, recall and accuracy metrics are calculated using Equation (4.1), Equation (4.2), and Equation (4.3), respectively.

58 . • TP (4.1) PrecisioD n = ————— (TP + FP)

Recall = _!Z_ <«> (TP + FN)

(TP + TN) (4.3) Accuracy = (TP + FP + FN + TN)

4.1.3 Experimental procedure

The same training and testing datasets for all our feature selection and traffic application classification experiments are used, but different numbers of features. For a subset evaluation feature selection method the CfsSubsetEval from WEKA [34]

(Version 3.6.0) is used, and gain ratio based feature selection uses

GainRatioAttributeEval also from WEKA. The J48 classifier that is WEKA's implementation of the C4.5 decision tree is used and WEKA's implementation of sequential minimization optimization (SMO) function for training the SVM is used.

First the CfsSubsetEval method is applied on traindatal to obtain a subset of features, then using these returned features the dimensions of copies of traindatal, testdatal and testdata2 are reduced accordingly and perform J48 followed by SVM classification. For each classifier the trained model is tested using 10-fold cross-validation, testdatal and testdata2. The experiment is then repeated by applying the same subset evaluation method on traindata2. Once completed the process now moves to

59 GainRatioAttributeEval method and follows same steps as with subset evaluation experiment. At the end, features that appeared in at least 2 of the 4 subsets are selected as the final subset of selected features and we repeat the J48 and SVM classification using these selected features.

Experiments in the last section compare the classification using the selected subset of features versus using 128 features and 256 features in terms of time to build J48 model and classification accuracy.

4.1.4 Subset evaluation results

Excluding the class label, the CfsSubsetEval method selected 4 features on training datasetl; UO, U10, U47, U61 at WEKA index positions 1, 11, 48, and 62. Where UO is index 0 or integer 0 corresponding to the first Unicode character (it has integer value 0) and U255 is for the 256th character. Table 4.7 shows the Java printed characters corresponding to the features selected by the different attribute selection approaches used. Table 4.2 shows classification results using these selected features; J48 performed better that SVM with both higher recall and precision values. Both classifiers had low false negative rates between 0% and 10.6% for both classifiers implying that less IRC packets are lost; however, SVM had higher false positive rates minimum 16.8% and maximum of 30.8%.

Performing CfsSubsetEval feature selection on training dataset 2 selected 8 features;

UO, U10, U33, U60, U61, U64, U84, U216 and the class attribute. For this model J48 still performed better than SVM which had higher false positive rates, as shown in

60 Table 4.3. In both classifiers, the traindatal model performs better on testdatal than testdata2 because testdatal is from the same environment as the training data.

Table 4.2 - CfsSubsetEval classification results on traindatal

Classifier Test data FNR FPR Recall Precision J48 10-fold Cross-validation 0.000 0.013 1 0.987 Testdatal 0 0.015 1 0.985 Testdata2 0.106 0.039 0.894 0.958 SVM 10-fold Cross-validation 0.036 0.178 0.964 0.844 Testdatal 0.039 0.168 0.961 0.851 Testdata2 0 0.308 1 0.765

Table 4.3 - CfsSubsetEval classification results on traindata2

Classifier Test data FNR FPR Recall Precision J48 10-fold Cross-validation 0.002 0.004 0.998 0.996 Testl 0.04 0.005 0.960 0.995 Test 2 0.032 0.004 0.968 0.996 SVM 10-fold Cross-validation 0.01 0.27 0.99 0.786 Testl 0.003 0.161 0.966 0.857 Test2 0.004 0.282 0.996 0.779

4.1.5 Gain ratio results

The top 20 features with high gain ratio values were selected. WEKA's default minimum gain ratio threshold of -1.7977 was used. The top 5 features selected by gain ratio on traindatal and traindata2 along with corresponding gain ratio values are listed in Table 4.4, also see Table 4.7 for the Java printed characters. The remaining bottom

15 features that were selected in traindatal and traindata2 are shown in Appendix A(a)

61 and Appendix A(b), respectively. Table 4.5 shows classification results using the selected top 20 high gain ratio features on traindatal. As with the subset evaluation method, J48 and SVM had good low false negative rates and high recall values.

However, SVM still has high false positive rates and lower precision values. Testing on traindata2 also yielded similar results as shown in Table 4.6; J48 has lower false positive rates 0.007 to 0.009 and SVM had 0.189 to 0.251.

Table 4.4 - Top 5 gain ratio selected features

Attribute Gain ratio Training Dataset 1 Unicode47 0.679 Unicode61 0.649 Unicode59 0.629 UnicodeO 0.622 Unicode62 0.611 Training Dataset 2 UnicodeO 0.484 Unicode216 0.405 Unicode 10 0.403 Unicode 168 0.375 Unicode60 0.374

Table 4.5 - GainRatioAttributeEval classification results on traindatal

Classifier Test data FNR FPR Recall Precision J48 1 10-fold Cross-validation 0 0.022 1 0.978 Testdatal 0 0.018 1 0.982 Testdata2 0.106 0.023 0.894 0.87 SVM 1 10-fold Cross-validation 0 0.18 1 0.847 Testdatal 0 0.171 1 0.854 Testdata2 0.004 0.28 0.996 0.781

62 Table 4.6 - GainRatioAttributeEval classification results on traindata2

Classifier Test data FNR FPR Recall Precision SVM 1 10-fold Cross-validation 0 0.25 1 0.8 Testdatal 0.028 0.189 0.972 0.837 Testdata2 0 0.251 1 0.799 J48 1 10-fold Cross validation 0.006 0.009 0.994 0.991 Testdatal 0.05 0.007 0.95 0.993 Testdata2 0.037 0.009 0.963 0.991

4.1.6 Classification results for final subset of 9 selected features

Using groups of features selected by CfsSubsetEval subset evaluation and

GainRatioAttributeEval gain ratio feature selection methods on traindatal and traindata2 there are a total of 4 subsets of features to select the final subset of informative features from. Excluding the class label attribute, the following attributes are common in at least 3 out of 4 feature selection subsets: U0, U10, U47, U60, U61, and U216. The features U177, U179, and U217 are common in 2 out of 4 subsets.

Therefore, the 9 payload features U0, U10, U47, U60, U61, U177, U179, U216, and

U217 which represent Unicode characters with integer values (also implying they are at

Unicode index positions) 0, 10, 47, 60, 177, 179, 216 and 217 are chosen as the final improved subset of important features, see Table 4.7.

63 Table 4.7 - Unicode index values and the corresponding Java printed characters

Index (integer value)

0 10 33 47 59 60 61 62 64 84 168 177 179 216 217

Printed i + 3 character • / f < = > @ T 0 u

The dimensions of the training and testing datasets are reduced accordingly to leave only these 9 selected features, and we repeated the J48 and SVM classification experiments. Classification results are shown in Table 4.8. SVM shows slightly higher false positive rates for all testing datasets and slightly lower precision and accuracy values, meaning than more non-IRC packets are passed as IRC when classifying using this SVM. J48 has both lower false negative and lower false positive rates. Similar to results from above experiments, J48 shows a slightly higher FNR and slightly lower precision values for Testdata2 than Testdatal because it is from different environment, however, overall J48 still performs much better than SVM for this testing dataset. All classifiers give good classification accuracy.

Figure 4.1 visually compares the false positive rates for J48 and SVM; the subset evaluation method is in Figure 4.1(a), for the gain ratio method see Figure 4.1(b), results for using the final subtest of 9 features are in Figure 4.1(c). Therefore, based on classification results from the CfsSubsetEval method, GainRatioAttributeEval method and our final subset of 9 features, this thesis has shown that not all the 256 payload features are necessary to classify network traffic into IRC and non-IRC application communities. A very small subset of the 256 payload frequent features made up of only

64 9 features was identified and J48 classifier performs better than SVM in using these features to separate IRC from non-IRC packets.

Table 4.8 - Classification results using the final selected subset of 9 features

Classifier Training Test data used FNR FPR Precision Recall Accuracy data used J48 Traindatal Testdatal 0 0.015 0.985 1.000 0.993 Testdata2 0.106 0.039 0.794 0.894 0.928 1 10-fold Cross- 0 0.013 0.987 1.000 0.994 validation Traindata2 Testdatal 0.05 0.006 0.994 0.95 0.972 Testdata2 0.037 0.006 0.962 0.963 0.978 1 10-fold Cross- 0.005 0.006 0.994 0.995 0.995 validation SVM Traindatal Testdatal 0 0.208 0.828 1.000 0.896 Testdata2 0 0.381 0.724 1.000 0.809 1 10-fold Cross- 0 0.216 0.822 1.000 0.812 validation Traindata2 Testdatal 0.007 0.183 0.845 0.993 0.905 Testdata2 0 0.319 0.758 1.000 0.841 1 10-fold Cross- 0.002 0.311 0.762 0.998 0.843 validation

I J48 ^^m J48 ]SVM SVM j03 t,0 3

502 •

S.0 1 - .20 1 • • ft, Tr1 T1 Tr1 T2 Tr2 T1 Tr2 T2 Tr1 T1 Tr1 T2 Tr2 T1 Tr2 T2 0- Tr1 T1 Tr1 T2 Tr2 T1 Tr2 T2 Training and testing dataset Training and testing dataset Training and testing dataset

a) CfsSubsetEval b) GainRatioAttributeEval c) Final 9 selected features

Figure 4.1- J48 versus SVM false positive rates

65 4.1.7 Comparing 9,128 and 256 features

The final subset of 9 features selected during the feature selection experiments produced good classification results for both J48 and SVM classifiers as already shown in Table 4.8. In this section, these 9 features are compared with 128 and 256 features in terms of the time it takes to build a classification model given an already processed dataset and classification accuracy.

4.1.7.1 Comparing in terms of time

The J48 classifier is used in the experiment to compare times for different number of features in a Windows Vista machine with 1.00 GB RAM and 1.80 GHz processor. In our classification approach, when using n features the dataset's original dimension is reduced such that only n features and class label remain, i.e. for 9 features the input is a dataset with 10 columns, and 128 and 256 features have 129 and 257 columns, respectively. For each case J48 was run 5 times and the smallest and largest time values were discarded as outliers. Figure 4.2 shows the average times it took to build the J48 classifier on the training datasets for 9, 128 and 256 features, and raw time results are shown in Appendix B(a). From the graphs it can be concluded that the time (in seconds) it takes to build the J48 model is approximately linear and increases with number of features. In both traindatal and traindata2, time for 256 features is approximately double the time for 128 features.

66 128 features is approximately 9 features * 14 and 256 features is approximately 9 features * 28; using example time values obtained for traindatal and traindata2 shows that it takes approximately 0.03 seconds to process 1 feature, see Table 4.9.

Table 4.9 - Example to approximate time for 1 feature

Dataset Number of features Time (seconds) Ratio (i.e. time for 1 feature) Traindatal 9 0.26 0.0289 28 4.44 0.0347 Traindata2 9 0.32 0.0356 256 10.19 0.0398

12 r I ! 1 1

, rr, . t ^ i Irani data 1 10 - /-0 O Traiii data 2

-a c o o - E

i i i 50 100 150 200 250 300 Number of features

Figure 4.2 - Comparing times for 9, 128 and 256 features on J48

Therefore, based on results from this experiment on J48 classifier, we conclude that doubling number of features doubles time to build the J48 model. Therefore, using our 67 9 features instead of the whole 256 reduces time to build the classification model approximately 28 times.

4.1.7.2 Comparing in terms of classification accuracy

Figure 4.3 compares the false positive rates obtained for J48 and SVM classifiers using

9; 128 and 256 features. The corresponding FPR values are in Appendix B(b) and

Appendix B(c). J48 classification using 256 features on traindata2, testdatal and all

SVM classifications using 256 features were not performed due to out of memory errors.

The J48 classifier's FPR for 9 features was higher, but not very far from those obtained using 128 and 256 features. However, using 128 and 256 features showed improved results for SVM. As shown in Figure 4.3(b), the FPR obtained using 128 features for

SVM are almost less than 15% and this is a big improvement compared to FPR that reached close to 40% when using the 9 selected features. More classification accuracy results are shown in Table 4.10.

68 0.04 •• 9

A) J48 classifier b) SVM classifier

Figure 4.3 - Compare FPR for 9, 128 and 256 features

Table 4.10 - Compare classification accuracy for 9, 128 and 256 features

Number Training Test data used Precision Recall Accuracy of features data used J48 128 Traindatal Testdatal 0.997 0.999 0.998 Testdata2 0.825 0.173 0.584 Traindata2 Testdatal 0.995 0.971 0.983 Testdata2 0.979 0.953 0.974 256 Traindatal Testdatal 0.997 0.999 0.998 Testdata2 0.825 0.173 0.586 Traindata2 Testdata2 0.979 0.953 0.975 SVM 128 Traindatal Testdatal 0.987 0.998 0.993 Testdata2 0.877 0.193 0.583 Traindata2 Testdatal 0.922 1 0.958 Testdata2 0.89 0.96 0.922

69 4.2 Validating standard deviation metric

Experiments in this section aim to validate the standard deviation metric in packets. Our botnet detection approach use only the first 128 Unicode characters based on earlier observations from a sample on datasets that IRC packets are dominated by an average of 0 frequencies for characters beyond 128. Therefore, this section also validates the effectiveness of using only the first 128 Unicode characters for IRC botnet detection versus all the first 256. HoneyTRC and Kaiten are used to represent botnet datasets and

Ircl and Irc2 represent normal IRC datasets.

4.2.1 Validation on frequency vectors

To be able to compare the individual Unicode character frequency standard deviations

(calculated using Equation (3.6)) from different IRC datasets on the same scale, the obtained standard deviation frequency values are normalized into a normal distribution with mean 0 and standard deviation of 1 using Equation (4.4).

_ Xj- n (4 4) %inew ~ _ a

This normalization shows how far an item is from mean as a number of standard deviations and about 99% of values are scaled to fall in [-3, +3] range. Raw values below mean value yields a negative normalized value and those larger than mean results

70 in positive normalized values. First the normalized standard deviation graphs for botnet

IRC and normal IRC datasets are compared and the second experiment uses the same normalized standard deviation vectors to compare their lines of best-fit.

4.2.1.1 Comparing normalized standard deviation graphs

The hypothesis for plotting and comparing the individual Unicode character standard deviation graphs is that based on assumption that botnet content is less diverse than normal content, in this thesis we expect standard deviation for most botnet payload

Unicode characters to be smaller than for same characters from normal IRC datasets; hence it is expected that the botnet IRC graphs will visually lie below the normal IRC graphs for most characters.

Figure 4.4 shows a plot comparing HoneylRC and Kaiten botnet IRC datasets to ircl and irc2 normal IRC datasets. The line graphs in this graph are very close making it difficult to observe, therefore, Figure 4.5 shows the graphs for the y-axis scale of [-0.6,

0.6]. This range was selected as it is where most graphs lie according to Figure 4.4. The

HoneylRC botnet dataset graph agrees to the hypothesis because it is the one that lies below all the other three graphs for most characters, with a big gap. The Kaiten botnet graph is above all the graphs for about the first 30 Unicode characters, for remaining characters is alternates positions with the two normal IRC datasets.

71 9 Botnet IRC: Honey irc 8 Botnet IRC: Kaiten Normal IRC: irc 1 7 Normal IRC: irc 2

2 6 > u T3 -a -a c

.3 3 13 e o p

20 40 60 80 100 120 140 Index of first 128 Unicode characters

Figure 4.4 - Normalized standard deviation graph for normal versus botnet IRC packets

72 I } I I I' ' Botnet IRC Honeyirc 05" I I I I I j I ! 'I ; Botnet IRC Kaiten 0 4- 'I [l I I , 1 | ! Normal IRC lrcl

0 20 40 60 80 100 120 140 Index of first 128 Unicode characters

Figure 4.5 - Normalized standard deviation graphs on scale [-0.6, 0.6]

4.2.1.2 Comparing normalized standard deviation lines of best-fit

The hypothesis for comparing lines of best-fit is that based on assumption that botnet content is less diverse than normal content; we expect standard deviation for most botnet characters to be smaller than for same characters from normal IRC datasets, hence most IRC botnet characters to have close to a 0 standard deviation. Therefore, the botnet IRC normalized standard deviation line of best-fit is expected to be less steep

(i.e. lower values of m) than the normal IRC line of best-fit most of time indicating that botnet traffic is less varying.

73 The Matlab Polyfit function was used to obtain values of constants m and k (see Table

4.11), and new standard deviation values are calculated using Y=mx + k and plotted in

Figure 4.6.

Table 4.11- Line of best fit constants for normal versus botnet IRC datasets

Dataset m k Honey (botnet) 0.0082 -0.5229 Kaiten (botnet) 0.0056 -0.3586 Ircl 0.0064 -0.4071 Irc2 0.0097 -0.6130

0.8 r 1 1 1

0.6 1 rsotnet IKC: rioneyirc Botnet IRC: Kaiten 0.4 - Normal IRC: ircl

0.2 Normal IRC: ire2 1

0

-0.2

-0.4:^

-0.6]

-0.8 1 1 1 [ 1 1 20 40 60 80 100 120 140 Index of first 128 Unicode characters

Figure 4.6 - Line of best fit graphs for normal versus botnet IRC datasets

74 HoneylRC botnet dataset has a lower m value of .0082 than normal ircl dataset which has 0.0064, difference of 0.0018. HoneylRC is also lower than Kaiten. Although Kaiten has highest m value, it is not very far from that of ircl (difference is 0.0008).

4.2.2 Validation on temporal-frequency vectors

Experiments from Section 4.2.1 compared un-grouped individual packets and this section now compares IRC packet temporal-frequent vectors obtained with n-gram n=l and time interval of 1 minute. The mean of means and mean of mean standard deviations are compared for various IRC datasets using the first 128 Unicode characters. The effectiveness of using only first 128 versus using all the first 256

Unicode features is evaluated by repeating the same experiment on the first 256

Unicode characters and comparing the results. Experimental results are shown in Table

4.12.

Results from Table 4.12 shows that for all the test datasets the mean of means and mean of mean standard deviations values for 256 dimensional feature space are approximately half the values obtained for 128 features. This can imply that when using all 256 features we are adding mostly zero values in the last 128 characters; therefore, dividing by the total number of characters (256) splits the sum of values of the first 128 characters into half.

75 Table 4.12 - Validation using 128 versus 256 temporal-frequent features

IRC dataset Number of Mean of Mean of mean features mean standard deviation

Irc2 (normal) 128 0.8435 0.8847

256 0.4219 0.4433

Ircl (normal) 128 1.3741 1.403

256 0.6872 0.7018

SkypelRC 128 4.5953 4.641 (normal) 256 2.2984 2.3237

Honeyirc 128 0.6227 0.5854 (botnet) 256 0.3294 0.3099

Kaiten_08_10 128 0.1492 0.207 (botnet) 256 0.0746 0.1035

Table 4.12 also shows mean of mean standard deviation for normal IRC is higher than that of botnet IRC, implying that botnet IRC individual packet content is less diverse than normal IRC content.

Based of majority of result cases from experiments using individual frequent packets and experiment using temporal-frequent features we conclude that as it has already been shown in previous studies that IRC botnet flows are less diverse (shown by lower

standard deviations ) than IRC normal flows, this characteristic also holds when working with individual packets. It is also concluded that with IRC, only the first 128

Unicode characters are necessary for experiments.

76 4.3 Botnet detection

In this section IRC botnet detection is performed using K-means, unmerged X-means and merged X-means. These algorithms are each applied to two models of training and testing datasets, and finally the results of the three approaches are compared.

Same datasets are used for all the IRC botnet detection experiments in the three algorithms and Section 4.3.1 describes the datasets used. The metrics used to evaluate clustering results are presented in Section 4.3.2. Experimental results for K-means botnet detection are presented in Section 4.3.3, unmerged X-means in Section 4.3.4 and merged X-means detection is in Section 4.3.5.

4.3.1 Description of clustering datasets

Botnet detection experiments were carried out using normal and botnet IRC datasets of different sizes and captured from different environments, as described in Table 4.13. In

Table 4.13 the first column shows the names of datasets and each dataset will be referred to using only first part of the name during the discussions (e.g. Ire 1.trace will be referred to only as Ircl). The second column tells whether the dataset is made up entirely of botnet ERC instances or it is entirely normal IRC instances, and the third column is a description of where the dataset was captured from. Knowing the environment from which datasets were captured is important when evaluating the performance of the built clustering model on clustering unseen instances.

To ensure that the datasets contain only IRC protocol packets, each dataset was loaded into Wireshark network protocol analyzer (1) and applied the filter "tcp && ire". The

77 fourth and fifth columns of Table 4.13 show the number of packets in each dataset before and after applying the filter, respectively. In the experiments we use 1 minute for temporal correlation of packet payloads and the last column shows the number of minute intervals in each dataset, hence the final number of instances of each dataset that will be used during clustering.

Table 4.13 - IRC botnet detection datasets

Dataset name Botnet or Captured from Number Number Number of 1- normal of packets of packets gram 1-minute IRC before after intervals filter filtering Ircl.trace Normal UNB testbed 110609 354 28 Irc2.trace Normal UNB honeynet 125494 4471 101 SkypelRC.pcap Normal Wireshark Wiki 2263 159 6 [32] HoneylRC.pcap Botnet Honeynet [33] 54536 9809 244 Kaiten.pcap Botnet UNB testbed 5272 846 80

K-means, unmerged X-means and merged X-means experiments were performed on same data models, these models were obtained as combinations of IRC datasets from

Table 4.13. As depicted in Table 4.14, the first data model referred to as "Model-1" uses Irc2 normal instances and HoneylRC botnet instances for training, the obtained clustering model is tested on Ircl and Kaiten datasets. The second model "Model-2" uses Ircl and Kaiten for training and its testing instances are obtained from Irc2 and

HoneylRC datasets. Table 4.14 shows the number of normal and botnet temporal- frequent instances used for training in each model. It also shows number of normal and botnet individual frequent instances used for testing in each data model. Note that all our three botnet detection approaches build a clustering model using temporal-frequent

78 characters, but new unseen instances are clustered using ungrouped packet payload frequent vectors.

Table 4.14 - Clustering data models

Data Training dataset Testing dataset model Datasets Number of Number of Datasets Number Number names normal botnet names of normal of botnet instances instances instances instances Model-1 Irc2 and 101 244 Ircl and 354 846 HoneylRC Kaiten Model-2 Ircl and 28 80 Irc2 and 4471 9809 Kaiten HoneylRC

4.3.2 Metrics to evaluate clustering performance

The performance of clustering models formed by K-means, unmerged X-means and merged X-means on model-1 and model-2 was evaluated using the detection rate (DR) and false alarm rate (FAR). DR evaluates how much of the true botnet packets the system is able to detect as botnet and FAR gives normal packets that the system falsely classifies as botnet. DR and FAR were calculated using Equation (4.5) and Equation

(4.6), respectively.

number of botnet instances correctly detected (4.5) total number of botnet instances

number of normal instances detected as botnet (4.6) total number detected as botnet

79 number of botnet instances (4.7) BPR = - *100 k ' total number of botnet :instances

number of normal instances (4.8) NPR = ; • *100 total number of normal instances

Two more metrics were defined to evaluate the composition of the botnet cluster namely; Botnet Percentage Representation (BPR) and Normal Percentage

Representation (NPR). During the botnet detection experiments class labels are included in clustering (note that the class labels are ignored during clustering) and this makes it possible to count the number of botnet and normal instances in each cluster.

BPR (see Equation (4.7)) and NPR (see Equation (4.8)) denote the proportion of botnet and normal instances, respectively that make up the identified botnet cluster. These two metrics help evaluate whether the identified botnet cluster is really composed mainly of botnet instances or is it just a collection of instances with a smaller standard deviation.

Evaluating 3 clustering approaches on 2 data models means that there are a total of 6 performance results to evaluate and compare. New instances are clustered into either botnet or normal IRC category based on Euclidean distance (using Equation (3.8)) between each cluster center and the unseen instance, the instance will be assigned to

IRC type with minimum distance.

80 4.3.3 K-means detection

This section presents K-means IRC botnet detection experiment details and performance results. This thesis used the implementation of K-means clustering algorithm from MATLAB (2). As already mentioned in Section 3.3.4, when performing

K-means clustering, the proposed detection approach does not require assistance of class labels to label the botnet and normal clusters, however, during experiments the class labels are added as part of K-means input to enable determining the composition of botnet clusters i.e. to be able to find statistically if our botnet clusters are dominated by botnet instances. The class labels are ignored during clustering. The following input values were used:

The number of clusters k was fixed to k-2 for both model-1 and model-2 experiments.

To improve the chances of the final K-means clusters being well separated as botnet

IRC and normal IRC, the initial centroids were predefined to ensure each class is represented. Therefore, two integers were provided as index positions for botnet and normal centroids.

An n x 129 dimensional vector containing n reduced temporal-frequent instances from botnet IRC and normal IRC datasets. During all K-means experiments the first 1 torn instances came from botnet IRC dataset and the remaining m+1 to n instances were normal IRC. The structure of input dataset (which was described in detail in 3.3.2) is as follows

1 7 127 121 121 ,,...,

In data model model-1, botnet instances ranged from instance number 1 to 244, and normal instances ranged from 245 to 345. The Matlab randomizing function was used 81 to find a random integer in the botnet range as a position for botnet centroid and another random integer was chosen in normal instances range to be the normal centroid position. Instances as positions 46 and 256 were selected as botnet and normal centroids, respectively. This K-means was repeated several times with different random start position pairs and all results returned two clusters of similar sizes; one cluster with

335 instances and another with 10 instances. Botnet instance in data model-2 ranged from 1 to 80 and normal instances were from 81 to 108. The cluster centroids were chosen at positions 68 and 83. Once clustering is complete, standard deviation was calculated for each cluster and the cluster with minimum standard was labeled as the botnet cluster.

Table 4.15 shows the standard deviations of normal and botnet clusters formed by K- means. For both model-1 and model-2 the standard deviation for botnet cluster is much less than that of normal cluster, agreeing with this thesis' assumption and validations that botnet IRC is less diverse. However, results from Table 4.15 (4th and 5th columns) also shows that K-means clustering grouped almost all instances (96.72% of total botnet instances and 98.02% of total normal instances in model-1 and 100% of total botnet instances and 85.71% of total normal instances in model-2) into one big cluster which was labeled as botnet cluster using the minimum standard deviation approach, and the remaining few instances formed the normal cluster. This difference in cluster sizes shows that during clustering K-means did not well separate normal IRC temporal- frequent instances from botnet IRC instances.

82 Table 4.15 - K-means cluster statistics

Cluster labeling Botnet cluster Normal cluster Botnet Botnet approach standard deviation standard deviation BPR % NPR% Model-1 0.1990 1.2557 96.72 98.02 Model-2 0.253 0.8946 100 85.71

Table 4.16 - K-means performance results

Model FAR DR Model-1 0.295 1 Model-2 0.319 0.957

Separating botnet IRC and normal IRC unseen individual packet instances using

K-means returned clusters on model-1 shows desirable performance results, which are shown in Table 4.16. All the botnet instances in test dataset were correctly classified into botnet IRC group (i.e. DR of 100%) and 29.5% of the identified botnet instances were false alarms. K-means performance also showed good performance on clustering unseen instances using the model-2 clustering model, a DR of 95.7 % was obtained and a FAR of 31.9 %. A slightly lower performance of K-means in model-2 may be due to differences in sizes of training and testing datasets in model-1 and model-2; model-2 is trained with fewer instances and tested on a larger number of instances than model-1.

4.3.4 Unmerged X-means detection

The botnet detection experiment on dataset models model-1 and model-2 from K-means clustering above was repeated using X-means clustering algorithm, a WEKA(3) 83 implementation was used. Unlike K-means, X-means does not require fixed number of clusters instead the minimum and maximum number of clusters were provided as minK = 2 and maxK = 6, respectively. The lower bound was set to 2 because in this thesis the goal is to end up with two clusters; normal cluster and botnet cluster. During both unmerged X-means and merged X-means detection, the proposed approach does not require class labels to build a clustering model, but the class labels are required for labeling clusters. A time interval of 1 minute and n-gram with n = 1 were used to form temporal-frequent IRC instances. Therefore, as during K-means clustering, input training instances are a n by 129 vector as shown below

0 1 7 127 0 l 121 ni ,,...,

X-means clustering algorithm returned 4 clusters on model-1 and also 4 clusters on model-2 as shown in Table 4.17.

Table 4.17 - X-means original number of clusters

Dataset Number of instances per cluster model ClusterO Clusterl Cluster2 Cluster3 Model-1 11 233 16 85 Model-2 5 24 11 68

Using the X-means clusters from Table 4.17, this section discusses the unmerged X- means botnet cluster labeling experiment. As per unmerged X-means algorithm discussed in Section 3.3.5, the first step was to count the number of botnet instances in

84 each cluster to find the cluster that contains highest proportion of total botnet instances.

In model-1 cluster 1 contains 233 / 244 * 100 = 95.5 % of total botnet instances. Next a cluster containing high percentage of normal instances among remaining clusters is determined; cluster3 has 85 / 101 * 100 = 84.2% of normal instances. Therefore, clusterl and cluster3 were selected for the unmerged X-means and the minimum standard deviation approach applied to the two clusters to determine the botnet cluster.

Clusterl had standard deviation of 0.0983 and cluster3 had a standard deviation of

0.1425, therefore, these clusters were labeled as the final botnet and normal cluster, respectively, as shown in Table 4.18. The cluster labeled by the minimum standard deviation as botnet is the same as the cluster that was initially identified as containing a higher percentage of total botnet instances. The BPR and NPR values from Table 4.18 show that the final botnet cluster on model-1 has more than 95% of the total botnet instances and contains no normal instances. This is good as it indicates that the botnet cluster labeled by the unmerged X-means approach as botnet is indeed made up of botnet instances. Unmerged X-means performed good in clustering unseen Ircl and

Kaiten individual packet instances into botnet IRC and normal IRC groups in model-1 as shown in Table 4.19.

Table 4.18 - Unmerged X-means cluster statistics

Cluster labeling Botnet cluster Normal cluster Botnet Botnet approach standard deviation standard deviation BPR% NPR% Model-1 0.0983 0.1425 95.5 0 Model-2 0.0190 0.2590 81.25 10.71

85 Table 4.19 - Unmerged X-means performance results

Model FAR DR Model 1 0.231 0.905 Model2 0.181 0.469

In model-2, cluster3 and clusterl were the initial chosen clusters with largest proportion of botnet and normal instances, respectively. After calculating cluster standard deviations, cluster3 was labeled as botnet cluster with standard deviation of 0.019; it is made up of 81.25% of total botnet instances and only 10.71% of total normal instances as shown in Table 4.18. Therefore, as with model-1, the botnet cluster identified by the unmerged X-means is actually dominated by botnet instances and represent a larger percentage of the total botnet instances used for training. Although having lower FAR than model-1, unmerged X-means in model-2 failed to correctly identify about 53% of the testing botnet instances, see Table 4.19. This unsatisfactory DR may be due to that model-2 is smaller in size than model-1; therefore, discarding the other two originally returned clusters may have caused loss of information necessary to distinguish some of unseen botnet instances.

4.3.5 Merged X-means detection

The merged X-means cluster labeling was performed on the same clusters returned by

X-means clustering on model-1 and model-2, already shown in Table 4.17. At first two temporary empty clusters called "tempBotnet" and "tempNormal" are created. Using all the returned clusters in model-1, standard deviations are calculated and the cluster 86 with lowest standard deviation forms 1st component of tempBotnet cluster; clusterl had the lowest standard deviation of 0.0983. For all other remaining clusters if a cluster contains more botnet instances than normal instances it is added to tempBotnet, otherwise it is added to tempNormal cluster. After completing all the assignments, tempBotnet was labeled as the final botnet cluster with a standard deviation of 0.3055.

The tempNormal cluster is labeled as final normal cluster and has overall cluster standard deviation of 0.3004. In this case botnet cluster standard deviation is slightly higher than that of normal cluster; this may be considered an outlier. Note that in unmerged X-means the slightly higher overall standard deviations for botnet clusters may happen often as only the first component of temporary tempBotnet cluster is determined by minimum standard deviation, the remaining members are added based entirely on that they contain more botnet instances than normal instances. After merging in model-1 the final botnet cluster now contains all the botnet instances used

(BPR of 100% and NPR of 0%) and still no normal instances as results for unmerged

X-means on model-1. A DR of 87.7% was obtained and a slightly higher FAR than in the unmerged case, see Table 4.21.

Table 4.20 - Merged X-means cluster statistics

Cluster labeling Botnet cluster Normal cluster Botnet Botnet approach standard deviation standard deviation BPR% NPR% Modell 0.3055 0.3004 100 0 Model2 0.1165 0.7488 95 10.71

87 Table 4.21 - Merged X-means performance results

Model FAR DR Model-1 0.245 0.877

Model-2 0.255 0.846

In data model-2, the first cluster added to tempBotnet was cluster3 with standard deviation of 0.019, and the final botnet cluster had overall cluster standard deviation of

0.1165, and the final normal cluster had standard deviation of 0.7488. The BPR and

NPR values from Table 4.20 show that merged X-means final botnet cluster contained

95% of the total botnet instances and only about 10% of the normal instances. A good detection rate of 84.6% was obtained.

4.3.6 Comparison of the three detection approaches

As conclusion to the IRC botnet detection module, the three clustering approaches used are compared in terms of clustering performance (overall DR, FAR, BPR and NPR).

Clustering using K-means and merged X-means had consistent high detection rates for the two dataset models. Unmerged X-means gave a 90.5% DR for modell but did not perform well in model2 with a low DR of 46.9%. Despite being able to correctly detect most unseen botnet instances in general, all the three botnet cluster labeling approaches had slightly higher FAR. Unmerged X-means had lowest FAR of 23.1% for modell and

18.1% for model2 while K-means had the highest FAR of 29.5% for modell and 31.9%

88 for model2. These FAR values are reasonable, but considering that all the testing datasets used had many botnet instances than normal instances, if the sizes of both types were equal it is possible the FAR may be higher. These FAR values may be due to that the numbers of normal instances in training datasets were about half the numbers of botnet instances. With regard to categorizing unseen instances, this thesis concludes that merged X-means performed better than the other two approaches since it had consistent good detection rate than unmerged X-means and had lower false alarm rate than K-means.

BPR and NPR results for model 1 and model2 show that botnet clusters labeled by unmerged and merged X-means contained a large proportion of the total IRC botnet training instances, with very few normal instances. From the experiments, botnet clusters labeled by these two methods contained at least 81.25% of the total botnet instances in the training dataset, and a maximum of 10.71% of the normal instances in the training dataset. In contrast, K-means botnet clusters contained almost all the botnet instances in the training dataset (at least 96.72%); however, the K-means botnet clusters were also composed of very high percentage of normal instances (98.02% for model 1 and 85.71% for model2). Although K-means has good detection rates in both models, a larger percentage of normal instances in the botnet cluster may indicate it did not clearly separate normal IRC instances from botnet IRC instances. Comparing results for all the four IRC botnet detection metrics used in the experiments, the thesis concludes that merged X-means performed the best, followed by unmerged X-means and lastly K- means. 89 4.4 Concluding remarks

This chapter presented the experiments to evaluate the feature selection and traffic application classification, and separating malicious botnet IRC packets from normal

IRC packets. After analyzing results for each clustering approach, the three approaches were compared and it was concluded that merged X-means was the better than the other two methods. The next chapter concludes the thesis and gives possible points of future works from this thesis.

90 Chapter 5

Conclusions and future works

Chapter 1 has introduced the thesis and its contributions, background information necessary to understand the proposed framework and related botnet detection techniques were discussed in Chapter 2, the proposed framework was presented in

Chapter 3 and we carried out experiments and reported the results in Chapter 4. Now in this chapter we conclude the thesis by summarizing its achievements and any limitations in Section 5.1 and we also point out possible points of improvement or future works that other interested professionals may pursue from this thesis, in Section

5.2.

5.1 Conclusion

This thesis found a subset of the 256 1-gram payload features made up 9 Unicode characters that can be effectively used for supervised classification of any unencrypted

TCP packets into IRC and non-IRC groups. These 9 characters have integer values (or at index positions) 0, 10, 47, 60, 61, 177, 179, 216, and 217. Although a support vector machine trained with SMO algorithm had slightly higher false positive rates than C4.5 decision tree, overall both classifiers had desirable performance. Experiments demonstrated that classifying using these 9 features greatly reduces time to build a classifier model and had comparable classification accuracy to using the first 128 and

256 features.

91 The proposed framework does not rely on port numbers to classify network traffic which is desirable since using port numbers is no longer very effective. It depends entirely on packet payload content and therefore, like many other content-based traffic classification systems it may not be very effective when communication is encrypted.

Experiments also validated the assumption that botnet IRC individual packets are less diverse than normal IRC packets, hence IRC botnet packets have lower standard deviation than normal packets and thus, botnet IRC clusters are expected to have lower cluster standard deviation.

We found out that IRC traffic is more dominated by the first 128 Unicode characters thus, it may not be necessary to use all the first 256 Unicode characters to separate botnet IRC from normal IRC. Unsupervised clustering algorithms were applied on temporal-frequent characteristics of individual packets (obtained using n-gram analysis with n=l and packets grouped into 1 minute time intervals) to separate botnet IRC from normal IRC. The minimum cluster standard deviation technique was used to label the botnet cluster using K-means which fixes the number of clusters to two, and compared it with unmerged X-means and merged X-means which finds true number of clusters in the dataset dynamically. Merged X-means performed better than unmerged X-means and K-means because it yielded lower false alarm rates than K-means and had more consistent good detection rates than unmerged X-means. Botnet clusters formed by

X-means are good representatives of the botnet instances than those formed by K- means because K-means botnet clusters also contained at least 85.71% of total normal instances in the training dataset. Results from the traffic classification and IRC botnet

92 detection experiments showed that n-gram analysis with value of n=l performs well thus, does not provide the need to experiment with higher values of n.

5.2 Future Work

The following are some points which may be of interest in improving or extending work done in this thesis:

Implement the framework as a multiple detector: This thesis used K-means clustering with fixed number of clusters for botnet detection, and the approach was compared to two versions of X-means cluster labeling using dynamic number of clusters to find which one performs better. Instead of comparing and choosing a better method from the three, the framework may be implemented as a multiple detector; unseen IRC instances can be passed to all the three detectors and use some "voting" scheme to make a final decision on whether a packet is botnet or normal using the three results.

Extend to find a subset of the 256 payload 1-gram frequency features that can distinguish a larger number of application communities: This thesis used classifying network traffic into IRC and non-IRC classes as an example study, some future work may be finding a smaller subset of informative features that can classify traffic into many application communities or into specific applications such as HTTP, IRC,

BitTorrent, eDonkey etc.

93 Traffic classification using n-gram analysis with n > 1 values: In addition to the anomaly based detector using n-gram with n = 1, Wang et al. in [29] also used n > 1 to examine sequences of bytes. Lu et al. in [12] used 1-gram to classify flows into different applications and this thesis used 1-gram for classification of packets into IRC and non-IRC groups. Therefore, it may be interesting to investigate the possibility of using sequences of bytes or characters with higher values of n, (e.g. n = 2 or n = 3) for traffic classification.

Feature selection using n-gram analysis with n > 1 values: Finding the reduced subset of features using 2-gram, 3-gram etc. and also studying the relationship among features in the reduced feature set.

Implementing and evaluating the framework using many other different feature selection methods, classifiers, unsupervised clustering algorithms and even using many different time interval values, (e.g. 1 second, 5 seconds etc.) to improve the classification accuracy, detection rate and false alarm rate.

94 Bibliography

[1] Lu, W., and Ghorbani, A. A., "Botnets detection based on IRC-community,"

Global Telecommunications Conference, New Orleans, LO, pp. 1-5, Nov.2008

- Dec. 2008.

[2] Provos, N., and Holz, T., Virtual Honeypots from Botnet Tracking to Intrusion

Detection, Boston: Addison-Wesley, 2008, pp. 359 - 390.

[3] Gu, G. F., Zhang, J. J., and Lee, W. K., "BotSniffer: detecting botnet command

and control channels in network traffic," Proceedings of the 15th Annual

Network and Distributed System Security Symposium, San Diego, CA, Feb.

2008.

[4] http://www.irchelp.org, Accessed: April 2009.

[5] Kugisaki, Y., Kasahara, Y., Hori, Y., and Sakurai, K., "Bot detection based on

traffic analysis," The 2007 International Conference on Intelligent Pervasive

Computing, Jeju City, pp. 303 - 306, Oct. 2007.

[6] Rajab, M. A., Zarfoss, J., Monrose, F., and Terzis, A., "A multifaceted

approach to understanding the botnet phenomenon," Proceedings of the Sixth

ACM SIGCOMM conference on Internet measurement, Rio de Janeriro, Brazil,

pp. 41-52 ,2006.

[7] Choi, H., Lee, H., Lee, H., and Kim, H., "Botnet Detection by Monitoring

Group Activities in DNS Traffic." International Conference on Computer and

Information Technology, Aizu-Wakamatsu, Fukushima, pp. 715 - 720, Oct.

2007

95 [8] Thuraisingham, B., "Data mining for security applications: Mining concept-

drifting data streams to detect peer to peer botnet traffic," International

Conference on Intelligence and Security Informatics, Taipei, pp. xxix - xxx,

June 2008.

[9] Holz, T., Steiner, M., Dahl, F., Biersack, E., and Freiling, F., "Measurements

and mitigation of peer-to-peer-based botnets: a case study on storm worm,"

Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent

Threats, San Francisco, CA, 2008.

[10] forums2.symantec.com, Accessed: April 2009.

[11] Strayer, W.T., Walsh, R., Livadas, C, and Lapsley, D., "Detecting botnets

with tight command and control," 2006" Proceedings of Thirty-first IEEE

Conference on Local Computer Networks, Tampa, FL, pp. 195 - 202, Nov.

2006.

[12] Lu, W., Mahbod, T., Rammidi, G., and Ghorbani, A. A., "Botcop: an online

botnet traffic classifier," Proceedings of the Seventh Annual Communication

Networks and Services Research Conference, Moncton, NB, Canada, pp. 70 -

77, May 2009.

[13] Wang, W., Fang, B., Zhang, Z., and Li, C, "A novel approach to detect IRC-

based botnets," International Conference on Networks Security, Wireless

Communications and Trusted Computing, Wuhan, Hubei, pp. 408 - 411, April

2009.

96 [14] Goebel, J., and Holz, T., "Rishi: identify bot contaminated hosts by IRC

nickname evaluation," Proceedings of the first conference on First Workshop on

Hot Topics in Understanding Botnets, Cambridge, MA, 2007

[15] Sroufe, P., Phithakkitnukoon, S., Dantu, R., and Cangussu, J., "Email shape

analysis for spam botnet detection," Sixth IEEE Consumer Communications and

Networking Conference, Las Vegas, NV, pp. 1-2, Jan. 2009.

[16] Brodsky, A., and Brodsky, D., "A distributed content independent method for

spam detection," Proceedings of the first conference on First Workshop on Hot

Topics in Understanding Botnets, Cambridge, MA, 2007.

[17] Zhao, Y., Xie, Y., Yu, R, Ke, Q., Yu, Y., Chen, and Gillum, E.,

"BotGraph: large scale spamming botnet detection," Proceedings of the 6th

USENIX symposium on Networked systems design and implementation, Boston,

MA, pp. 321 - 334, 2009.

[18] Binkley, J. R., and Singh, S., "An algorithm for anomaly-based botnet

detection," USENIX SRUTI: 2nd Workshop on Steps to Reducing Unwanted

Traffic on the Internet, San Jose, CA, pp. 43 - 48, 2006.

[19] Masud, M.M., Al-khateeb, T., Khan, L., Thuraisingham, B., and Hamlen,

K.W., "Row-based identification of botnet traffic by mining multiple log files,"

First International Conference on Distributed Framework and Applications,

Penang, pp. 200 - 206, Oct. 2008.

[20] Villamarfn-Salomon, R., and Brustoloni, J. C, "Bayesian bot detection based on

DNS traffic similarity," Proceedings of the 2009 ACM symposium on Applied

Computing, Honolulu, Hawaii, pp. 2035 - 2041, 2009 97 [21] Gu, G. F., Perdisci, R., Zhang, J. J., and Lee, W. K., "BotMiner: clustering

analysis of network traffic for protocol-and structure-independent Botnet

detection," Proceedings of the 17th USENIX Security Symposium, San Jose, CA,

pp. 139 - 154, July - Aug. 2008.

[22] Villamarin-Salomon, R., and Brustoloni, J.C., "Identifying botnets using

anomaly detection techniques applied to DNS traffic," Fifth IEEE Consumer

Communications and Networking Conference, Las Vegas, NV, pp. 476 - 481,

Jan. 2008.

[23] Xu, R., and Wunsch II, D. C., Clustering, Hoboken: John Wiley & Sons, 2009.

[24] Sander, J., Ester, M., Kriegel, H., andXu, X., "A Density-Based Algorithm for

Discovering Clusters in Large Spatial Databases with Noise," Proceedings of

Second International Conference on Knowledge Discovery and Data Mining,

1998.

[25] Hamerly, G. and Charles Elkan, C, "Alternatives to the k-means algorithm that

find better clusterings," Proceedings of the Eleventh International Conference

on Information and Knowledge Management, McLean, Virginia, pp. 600 - 607,

2002.

[26] Guan, Y., Ghorbani, A. A., and Belacel, N., "Y-means: a clustering method

for intrusion detection" Canadian Conference on Electrical and Computer

Engineering, pp. 1083 - 1086, vol. 2, May 2003.

[27] Pelleg, D., and Moore, A. W., "X-means: Extending K-means with Efficient

Estimation of the Number of Clusters," Proceedings of the Seventeenth

International Conference on Machine Learning, p.727 - 734, June - July 2000. 98 [28] Erman, J., Arlitt, M., and Mahanti, A., "Traffic classification using clustering

algorithms," Proceedings of the 2006 SIGCOMM Workshop on Mining Network

Data, Pisa, Italy, pp. 281 - 286, Sept. 2006.

[29] Wang, K., and Stolfo, S. J., "Anomalous Payload-based Network Intrusion

Detection," Symposium on Recent Advances in Intrusion Detection, Sophia

Antipolis, France, 2004.

[30] Mitchell, T. M., Machine Learning, Boston: McGraw-Hill, 1997, pp. 52 - 76.

[31] http://netresearch.ics.uci.edu/kfujii/jpcap/doc/index.html, Accessed: November

2008.

[32] http://www.wireshark.org, Accessed: January 2009.

[33] http://www.honeynet.org, Accessed: February 2009.

[34] http://www.cs.waikato.ac.nz/ml/weka, Accessed: February 2009.

99 Appendix A: Gain Ratio values for bottom 15 selected features

a) Features selected from first training dataset, traindatal.

Attribute Gain ratio Unicode60 0.609 Unicode34 0.595 Unicode38 0.581 Unicode44 0.578 Unicode95 0.574 Unicode216 0.574 Unicode9 0.573 Unicode63 0.563 Unicode75 0.557 Unicode39 0.554 Unicode 1 0.552 Unicode88 0.55 Unicode 177 0.547 Unicode217 0.547 Unicodel79 0.542

b) Features selected from second training dataset, traindata2.

Attribute Gain ratio Unicode218 0.371 Unicode 177 0.363 Unicode217 0.361 Unicodel70 0.36 Unicode2 0.36 Unicode5 0.36 Unicodel71 0.359 Unicode219 0.358 Unicodel9 0.357 Unicode 183 0.357 Unicode 175 0.356 Unicode 167 0.355 Unicodel79 0.354 Unicode 172 0.351 Unicodel85 0.35 Appendix B: Time results for comparing 9,128 and 256 features

a) Raw and final average values comparing the time it took to build a J48

classification model for each number of features. Values in bold show the

outliers discarded before calculating the average time, which is shown in

column 8.

Number of Training Timel Time2 Time3 Time4 Time5 Final average features dataset (seconds) (outliers removed) 9 Train 1 0.26 0.25 0.28 0.24 0.27 0.26 Train2 0.35 0.32 0.32 0.32 0.3 0.32 128 Train 1 4.54 4.43 4.46 4.42 4.35 4.44 Train2 5.37 5.47 5.73 5.53 5.34 5.46 256 Train 1 9.14 8.9 8.91 8.8 8.87 8.89 Train2 10.23 10.36 10.26 10.08 10.06 10.19

b) Table below shows false positive values obtained using J48 classifier.

Number of Traindatal: Traindatal: Traindatal: Traindata2: features Testdatal Testdata2 Testdatal Testdata2 9 0.015 0.039 0.006 0.006 128 0.003 0.006 0.005 0.003 256 0.003 0.006 0.003

c) Table below shows false positive values obtained using SVM classifier.

Number of Traindatal: Traindatal: Traindatal: Traindatal: features Testdatal Testdatal Testdatal Testdatal 9 0.208 0.381 0.183 0.319 128 0.013 0.027 0.084 0.117

101 Curriculum Vitae

Goaletsa Rammidi

Universities attended:

• University of Botswana, Bachelor of Science (Computer Science), 2002-2006

• University of New Brunswick, Diploma in University Teaching, 2008-2009

• University of New Brunswick, 2007-2009

Publications:

• Wei Lu, Goaletsa Rammidi and Ali A. Ghorbani, "Clustering Botnet

Communication Traffic Based on N-gram Feature Selection," submitted to

Special issue on Information and Future Communication Security, Computer

Communications, Elsevier, 2009.

• Wei Lu, Mahbod Tavallaee, Goaletsa Rammidi and Ali A. Ghorbani, "Botcop:

an online botnet traffic classifier," Proceedings of the Seventh Annual

Communication Networks and Services Research Conference, Moncton, NB,

Canada, pp. 70 - 77, May 2009.