Effective Malicious Features Extraction and Classification for Incident Handling Systems
Total Page:16
File Type:pdf, Size:1020Kb
EFFECTIVE MALICIOUS FEATURES EXTRACTION AND CLASSIFICATION FOR INCIDENT HANDLING SYSTEMS CHO CHO SAN UNIVERSITY OF COMPUTER STUDIES, YANGON OCTOBER, 2019 Effective Malicious Features Extraction and Classification for Incident Handling Systems Cho Cho San University of Computer Studies, Yangon A thesis submitted to the University of Computer Studies, Yangon in partial fulfillment of the requirements for the degree of Doctor of Philosophy October, 2019 Statement of Originality I hereby certify that the work embodied in this thesis is the result of original research and has not been submitted for a higher degree to any other University or Institution. …..…………………………… .…………........………………………… Date Cho Cho San ACKNOWLEDGEMENTS First of all, I would like to thank Hist Excellency, the Minister for the Ministry of Education, for providing full facilities support during the Ph.D. course at the University of Computer Studies, Yangon. Secondly, my profound gratitude goes to Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Yangon, for allowing me to develop this research and giving me general guidance during the period of my study. I would like to express my greatest pleasure and the deepest appreciation to my supervisor, Dr. Mie Mie Su Thwin, Professor, the University of Computer Studies, Yangon, for her excellent guidance, caring, patient supervision, and providing me with excellent ideas throughout the study of this thesis. I would also like to extend my special appreciation to Dr. Khine Moe Nwe, Professor and Course-coordinator of the Ph.D. 9th Batch, the University of Computer Studies, Yangon, for her useful comments, advice, and insight which are invaluable through the process of researching and writing this dissertation. I would like to express my special appreciation and thanks to my external examiner Dr. Thandar Phyu, Director of Technology Group, ATG Company Ltd., for useful comments and suggestions. I deeply would like to express my respectful gratitude to Daw Aye Aye Khine, Associate Professor, Head of English Department, for her valuable supports from the language point of view and pointed out the correct usage not only throughout the Ph.D. course work but also in my dissertation. My sincere thanks also go to all my respectful Professors and teachers for giving me valuable lecture and knowledge during the Ph.D. course work and dissertation. I also thank my respectful Professor Dr. Abhishek Vaish for his valuable comments, advice, and insight which are invaluable to me. Moreover, I thank my friends from the Ph.D. 9th Batch for providing support, care, co-operation and encouragement during this way. Special thanks go to all the past and current members at Cyber Security Research Laboratory. Last but not least, I am very much indebted to my parents, U Than Kyaw and Daw Yee Shwe, my baby sister, aunts and uncles for always believing in me; for providing me with unfailing support and continuous encouragement; and for their i endless love and support during these years of my Ph.D. study and through the process of researching and writing this dissertation. This accomplishment would not have been possible without them. ii ABSTRACT Each and every day, malicious software writers continue to create new variants, new innovation, new infection, and more obfuscated malware by using packing and encrypting techniques. Malicious software classification and detection play an important role and a big challenge for cyber security research. Due to the increasing rate of false alarm, the accurate classification and detection of malware is a big necessity issue to be solved. This research provides the classification system to differentiate malware from benign and classify malicious types. This research contributes the Malicious Sample Names Extraction (MSNE) procedure and Naming Malicious Samples using the Regular Expression (NMS_RE) technique have been contributed to label the malicious samples. This research also contributes the prominent Malware Feature Extraction Algorithm (MFEA) to point out the dominant features based on the generated report files. The features are API, DLL, and PROCESS called by malicious and benign executables through automated analysis. During the experiments, data cleansing for extracted raw data, applying the n-gram technique, and representing and preparing the malicious dataset have been performed to provide the malware classification system. This research work makes use of two malicious datasets for malware classification. The Benign Malware Classification (BMC) dataset is used for binary class classification system to identify malicious or not and Benign Malware Family Classification (BMFC) dataset is used for multi-class classification system to identify malware family. Chi-Square and Principal Component Analysis (PCA) feature selection methods have been applied in this system to select the best features. Classification algorithms like k-Nearest Neighbor (kNN), Random Forest (RF) and Support Vector Classification (SVC) have been used for multi-class and binary class classification. The proposed approach is able to classify the malicious and benign executable files effectively. This research work provides malware classification using Machine Learning (ML) classifiers. The findings from the experiment prove that the extracted API_DLL features provide the best evaluation metrics in terms of accuracy, confusion matrix (CM), True Positive Rate (TPR), False Positive Rate (FPR), and Receiver Operating Characteristic (ROC) curve area. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS i ABSTRACT iii LIST OF FIGURES viii LIST OF TABLES xii LIST OF EQUATIONS xiv 1. INTRODUCTION 1.1 The Importance of Malicious Software Analysis on Cyber Security......................... 1 1.2 Motivation of the Research ........................................................................................ 2 1.3 Problem Statements and Solutions ............................................................................. 3 1.4 Objectives of the Research ......................................................................................... 4 1.5 Contributions of the Research .................................................................................... 5 1.6 Organization of the Research ..................................................................................... 6 2. LITERATURE REVIEW AND RELATED WORKS 2.1 Evolution of Malware ................................................................................................ 8 2.1.1 Types of Malware ............................................................................................ 10 2.1.1.1 Virus ............................................................................................................. 10 2.1.1.2 Trojan Horses ............................................................................................... 10 2.1.1.3 Worms .......................................................................................................... 11 2.1.1.4 Backdoor ...................................................................................................... 11 2.1.1.5 Rootkits ........................................................................................................ 12 2.1.1.6 Adware ......................................................................................................... 12 2.1.1.7 Spyware ........................................................................................................ 12 2.1.1.8 Phishing........................................................................................................ 13 2.2 Malicious Features for Classification and Detection ............................................... 14 iv 2.2.1 Extracted Features through Static Analysis ..................................................... 15 2.2.2 Extracted Features through Dynamic Analysis ................................................ 17 2.3 Feature Selection Methods for Malware Classification and Detection .................... 20 2.4 Malicious Classification and Detection in Machine Learning ................................. 23 2.5 Summary .................................................................................................................. 24 3. THEORETICAL BACKGROUND 3.1 Malware Analysis Techniques ................................................................................. 26 3.1.1 Static Malware Analysis Tools and Techniques .............................................. 26 3.1.1.1 Advantages and Disadvantages of Static Analysis ...................................... 28 3.1.2 Dynamic Malware Analysis Tools and Techniques ......................................... 28 3.1.2.1 Advantages and Disadvantages of Dynamic Analysis ................................. 30 3.2 Malware Feature Extraction and Selection Techniques ........................................... 31 3.2.1 Static Based Features ....................................................................................... 31 3.2.2 Dynamic Based Features .................................................................................. 33 3.2.3 Feature Selection Techniques .......................................................................... 34 3.2.3.1 Chi-Square (ꭓ2) ............................................................................................. 34 3.2.3.2 Principal Component Analysis (PCA) ......................................................... 35 3.3 ML Techniques for Malicious Family Classification .............................................