BASED BOTNETS by Goalet
Total Page:16
File Type:pdf, Size:1020Kb
SELECTING PAYLOAD FEATURES USING N-GRAM ANALYSIS TO CHARACTERIZE IRC TRAFFIC AND MODEL BEHAVIOUR OF IRC- BASED BOTNETS by Goaletsa Rammidi BSc (Computer Science), University of Botswana, 2006 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Computer Science in the Graduate Academic Unit of Faculty of Computer Science Supervisor: Dr. Ali A. Ghorbani, PhD, Faculty of Computer Science Examining Board: Professor John DeDourek, Computer Science, Chair Dr. Harold Boley, Adjunct Professor, Computer Science Dr. Donglei, Du, Faculty of Business Administration, UNB This thesis is accepted by the Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK September, 2009 © Goaletsa Rammidi, 2010 Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington OttawaONK1A0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-82638-6 Our file Notre r6f6rence ISBN: 978-0-494-82638-6 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduce, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lntemet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission. In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. 1+1 Canada DEDICATION To my mother Khumoetsile Rammidi, and in loving memory of my father Maranyane Rammidi. Being your daughter is a blessing. n ABSTRACT A botnet is a network of compromised computers remotely controlled by an attacker. Different feature selection methods are applied to find a lower dimension subset of Unicode characters as payload features, using n-gram analysis with n=l, to classify TCP packets into IRC and non-IRC application groups. The identified IRC packets are grouped into 1 minute intervals to create a temporal- frequent distribution, and then unsupervised clustering is applied to separate botnet IRC from normal IRC. The botnet cluster is labeled as one with minimum cluster standard deviation. We found a subset of 9 features that separate IRC packets from non-IRC in less time and comparable accuracy to using all the 256 features. We also found that IRC traffic is dominated by the first 128 Unicode characters, therefore, using all the 256 characters may not be necessary. Clustering packets into IRC and non-IRC using merged X-means had lower false alarm rates than K-means and consistent higher detection rates than unmerged X-means. in ACKNOWLEDGEMENTS I would like to thank Dr Ali A. Ghorbani for his supervision throughout this thesis. I am also thankful to Dr Wei Lu for his professional guidance and support during my research. I am grateful to my friend Nyaladzani Jairo Nkhwanana for his valuable comments and for being supportive in writing this thesis, and to my family for their love and support throughout my studies. I also extend my gratitude to Botswana International University of Science and Technology for funding me to pursue this degree. IV Table of Contents DEDICATION ii ABSTRACT iii ACKNOWLEDGEMENTS iv Table of Contents v List of Tables viii List of Figures ix List of Symbols, Nomenclature or Abbreviations x Introduction 1 1.1 Introduction 1 1.2 Summary of thesis contributions 2 1.3 Thesis organization 4 2 Background information and literature review 7 2.1 What is a bot and a botnet? 7 2.2 Botnet communication control topologies 8 2.2.1 Centralized C&C 8 2.2.2 Peer to Peer (P2P) botnets 10 2.3 Literature on botnet detection techniques 12 2.4 Clustering 18 2.4.1 Hierarchical clustering 18 2.4.2 Density-based clustering 19 2.4.3 Partition-based methods 19 2.4.4 Related research using clustering algorithms in intrusion detection 21 2.5 N-gram and its application to intrusion detection 22 2.6 Concluding remarks 24 3 Proposed framework 26 3.1 Overview of proposed framework 26 3.2 Traffic classification 28 3.2.1 The rationale 28 3.2.2 Obtaining 1-gram frequent payload features 32 3.2.3 Feature selection process 34 v 3.2.4 Traffic classification using C4.5 algorithm 37 3.3 Botnet detection 40 3.3.1 The rationale 40 3.3.2 Obtaining temporal-frequent payload features 42 3.3.3 Standard deviation metric for cluster labeling 44 3.3.4 K-means detection 45 3.3.5 Comparison toX-means detection approach 48 3.4 Concluding remarks 53 4 Experiments and Results 55 4.1 Feature selection and application classification 55 4.1.1 Datasets 56 4.1.2 Metrics 58 4.1.3 Experimental procedure 59 4.1.4 Subset evaluation results 60 4.1.5 Gain ratio results 61 4.1.6 Classification results for final subset of 9 selected features 63 4.1.7 Comparing 9, 128 and 256 features 66 4.2 Validating standard deviation metric 70 4.2.1 Validation on frequency vectors 70 4.2.2 Validation on temporal-frequency vectors 75 4.3 Botnet detection 77 4.3.1 Description of clustering datasets 77 4.3.2 Metrics to evaluate clustering performance 79 4.3.3 K-means detection 81 4.3.4 Unmerged X-means detection 83 4.3.5 Merged X-means detection 86 4.3.6 Comparison of the three detection approaches 88 4.4 Concluding remarks 90 5 Conclusions and future works 91 5.1 Conclusion 91 5.2 Future Work 93 VI Bibliography 95 Appendix A: Gain Ratio values for bottom 15 selected features 100 Appendix B: Time results for comparing 9, 128 and 256 features 101 Curriculum Vitae vn List of Tables 4.1 - Description of the datasets 57 4.2 - CfsSubsetEval classification results on traindatal 61 4.3 - CfsSubsetEval classification results on traindata2 61 4.4 - Top 5 gain ratio selected features 62 4.5 - GainRatioAttributeEval classification results on traindatal 62 4.6 - GainRatioAttributeEval classification results on traindata2 63 4.7 - Unicode index values and the corresponding Java printed characters 64 4.8 - Classification results using the final selected subset of 9 features 65 4.9 - Example to approximate time for 1 feature 67 4.10 - Compare classification accuracy for 9, 128 and 256 features 69 4.11 - Line of best fit constants for normal versus botnet IRC datasets 74 4.12 - Validation using 128 versus 256 temporal-frequent features 76 4.13 - IRC botnet detection datasets 78 4.14 - Clustering data models 79 4.15 - K-means cluster statistics 83 4.16 - K-means performance results 83 4.17 - X-means original number of clusters 84 4.18 - Unmerged X-means cluster statistics 85 4.19 - Unmerged X-means performance results 86 4.20 - Merged X-means cluster statistics 87 4.21 - Merged X-means performance results 88 viii List of Figures 2.1 - An example of n-gram sliding window 23 3.1 - Framework of proposed solution 28 3.2 - Average character frequencies for IRC packets versus non-IRC packets 31 3.3 - Algorithm to extract 1-gram frequency features from a packet payload 33 3.4 - Algorithm for ID3 decision tree for boolean-valued functions 38 3.5 - Illustration of dimension reduction in training and testing dataset 39 3.6 - Illustration of C4.5 decision tree using 1-gram frequency features 40 3.7-High level view of clustering IRC and non-IRC traffic 42 3.8 - Algorithm for K-means clustering 46 3.9 - Algorithm for IRC botnet detection using K-means 47 3.10 - Algorithm for IRC botnet detection using unmerged X-means 51 3.11 - Algorithm for IRC botnet detection using merged X-means 53 4.1 - J48 versus SVM false positive rates 65 4.2 - Comparing times for 9, 128 and 256 features on J48 67 4.3 - Compare FPR for 9, 128 and 256 features 69 4.4 - Normalized standard deviation graph for normal versus botnet IRC packets 72 4.5 - Normalized standard deviation graphs on scale [-0.6, 0.6] 73 4.6 - Line of best fit graphs for normal versus botnet IRC datasets 74 IX List of Symbols, Nomenclature or Abbreviations fij Average frequency of Unicode character with 28 integer value j (it is also at index position/) FJ Frequency of the Unicode character at jth index 28 position. <jj Standard deviation of Unicode character with 42 integer value j (it is also at index position j). Can also refer to standard deviation of cluster labeled j pi Reduced average frequency of the Unicode 43 character at/A position on the ith minute interval.