Counter Intrusion Software: Malware Detection Using Structural and Behavioural Features and Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Counter Intrusion Software Malware Detection using Structural and Behavioural Features and Machine Learning by Joseph Rabaiotti School of Computer Science Cardiff University A thesis submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy Version: August 4, 2007 1 Abstract Over the past twenty-five years malicious software has evolved from a minor annoyance to a major security threat. Authors of malicious software are now more likely to be organised criminals than bored teenagers, and modern malicious software is more likely to be aimed at stealing data (and hence money) than trashing data. The arms race between malware authors and manufacturers of anti-malware software continues apace, but despite this, the majority of anti-malware solutions still rely on relatively old technology such as signature scanning, which works well enough in the majority of cases but which has long been known to be ineffective if signatures are not updated regularly. The need for regular updating means there is often a critical window between the publication of a flaw exploitable by malware and the distribution of the appropriate counter measures or signature. At this point a user system is open to attack by hitherto unseen malware. The object of this thesis is to determine if it is practical to use machine learning techniques to abstract generic structural or behavioural features of malware which can then be used to recognise hitherto unseen examples. Although a sizeable amount of research has been done on various ways in which malware detection might be automated, most of the proposed methods are burdened by excessive complexity. This thesis looks specifically at the possibility of using learning systems to classify software as malicious or nonmalicious based on easily-collectable structural or behavioural data. On the basis of the experimental results presented herein it may be concluded that classification based on such structural data is certainly possible, and on behavioural data is at least feasible. Malware Detection using Structural and Behavioural Features and Machine Learning Joseph Rabaiotti 2 Acknowledgements Personal acknowledgements I would like to thank my supervisor Professor Antonia Jones for her support and encouragement during the past three years. I must also thank my friend Stuart Goring, without whose keen observations Chapter 3 would not have come into being, my parents Robert and Carole for supporting me financially and providing me with accommodation for the first two and a half years, and Mark, for supporting and putting up with me for the remainder of the time and without whose love and encouragement this thesis could never have been written. Technical acknowledgements I must also acknowledge the significant contribution to this work made by the authors of the website which hosted the virus/malware library, without which this work could not have proceeded. I hope they will forgive me for not naming them explicitly, but they probably know who they are. Work of this type is not without risk and approaches a grey area of UK Law. Everyone involved has taken great pains to remain within the law as we understand it. Con- siderable thought and discussion occurred within the School and University, before agreements were reached which ensured that the work could be undertaken safely, i.e. that no harm would be caused by the study. A number of precautions were put in place in order to isolate, and limit access to, the computers used in this work. There- fore, thanks are also due to my Supervisor Professor Antonia J. Jones, the Head of the Cardiff School of Computer Science Professor Nick Fiddian, Robert Evans of the School's Computer Support Group, to the University level Computer Support team in INSRV, and to the Corporate Compliance and HR sections in the University,1 who collectively, and somewhat courageously, trusted me sufficiently to allow the work to go forward. 1http://www.cf.ac.uk/az/divisions/index.html Malware Detection using Structural and Behavioural Features and Machine Learning Joseph Rabaiotti 3 Contents Contents 3 List of Figures 9 List of Tables 11 1 Introduction 14 1.1 What Is Malicious Software? . 14 1.2 The Connectivity Issue . 15 1.3 Combat Approaches . 19 1.4 Scope and Aim . 20 1.4.1 Hypotheses . 20 1.4.2 Aims . 20 1.4.3 Principal Contributions . 21 2 Historical Background and Literature Review 22 2.1 Types of Malware . 22 2.1.1 Viruses . 23 2.1.2 Worms . 27 2.1.3 Other Email Tricks . 28 2.1.4 Trojan Horses . 29 2.1.5 Backdoors . 29 2.1.6 Spyware and Adware . 30 2.1.7 Botnets and DDoS Attacks . 32 2.1.8 Rootkits . 35 2.1.9 Malware Payloads . 35 Malware Detection using Structural and Behavioural Features and Machine Learning Joseph Rabaiotti CONTENTS 4 2.2 Vulnerabilities exploited by malware . 36 2.2.1 Code Injection . 36 2.2.2 Buffer Overflow . 36 2.3 Malware Naming Conventions . 37 2.4 Detection and Prevention Techniques . 38 2.4.1 Heuristic Scanners . 40 2.4.2 Structure-Based Detection . 41 2.4.3 Activity Monitors . 41 2.4.4 Behaviour-Based Anomaly Detection . 42 2.5 Detection Countermeasures . 42 2.5.1 Heuristic Frustration: Concealment . 43 2.5.2 Polymorphism and Metamorphism . 43 2.5.3 Encryption and Packing . 43 2.5.4 Retroviruses . 44 2.6 Malware and the Law . 44 2.7 Literature Review . 45 2.7.1 Introduction . 45 2.7.2 Early Research and Theory . 46 2.7.3 Practical Detection Systems . 48 2.7.4 Static and Semantic Analysis . 49 2.8 Tools and Sources of Information . 54 3 The law of unintended consequences: a case study 55 3.1 Introduction . 55 3.2 Disclaimer . 56 3.3 Unintended consequences . 56 3.4 The login procedure . 57 3.5 The Vulnerability . 58 3.6 Attack Methodology . 59 3.7 Probabilistic Analysis . 61 3.8 Conclusions . 63 Malware Detection using Structural and Behavioural Features and Machine Learning Joseph Rabaiotti CONTENTS 5 4 Machine learning classification methods and systems 66 4.1 Introduction . 66 4.2 Classification methods . 66 4.2.1 Decision Trees . 67 4.2.2 Naive Bayes . 68 4.2.3 Artificial Neural Networks . 69 4.3 Classifier Implementations: Machine Learning Systems . 70 4.3.1 C4.5 . 71 4.3.2 WEKA . 73 4.4 Conclusion . 74 5 Gathering structural and behavioural data in Windows 76 5.1 Introduction . 76 5.2 Structural Data . 77 5.2.1 What sort of data to use? . 77 5.2.2 The Portable Executable (PE) file format . 77 5.2.3 The PE Header . 79 5.2.4 Sections and the Section Table . 79 5.2.5 Viruses and the PE format . 80 5.2.6 Collecting Structural Data . 87 5.3 Behavioural Monitoring . 88 5.3.1 The collection of real-time process data . 88 5.3.2 Behavioural Data Gathering Methods . 97 5.3.3 Two Means of Monitoring . 99 5.3.4 Windows Performance Data . 100 5.3.5 Lower-Level Methods . 102 5.3.6 Collecting Other Relevant Behavioural Data . 103 5.3.7 Behavioural Monitoring: Conclusion . 104 6 Structural identification experiments 106 6.1 Introduction . 106 6.2 Methodology . 106 6.2.1 The data-gathering program . 107 6.2.2 The source datasets . 109 Malware Detection using Structural and Behavioural Features and Machine Learning Joseph Rabaiotti CONTENTS 6 6.2.3 Creating experimental datasets . 110 6.2.4 Splitting into Training and Test sets . 110 6.3 Results . 110 6.3.1 The kappa statistic κ . 111 6.3.2 Baseline . 111 6.3.3 Learning curves . 111 6.3.4 Experiment One . 113 6.3.5 Experiment Two . 114 6.3.6 Experiment Three . 115 6.3.7 Experiment Four . 116 6.3.8 Experiment Five . 117 6.4 Analysis . 118 6.4.1 General Trends . 118 6.4.2 Specifics . 119 6.5 Conclusion . 120 7 Behavioural identification experiments 121 7.1 Introduction . 121 7.2 Methodology . 121 7.3 The Data-Gathering Program . 122 7.3.1 Real-time Data . 123 7.3.2 Choice of Malicious Examples . 123 7.3.3 Final Datasets . ..