Early Stage Malware Classification Using Behavior Analysis

Early Stage Malware Classification using Behavior Analysis A thesis submitted in partial fulfillment of the requirements for the degree of Master of Technology by Mugdha Gupta Department of Computer Science And Engineering INDIAN INSTITUTE OF TECHNOLOGY KANPUR June 2018 Abstract Name of the student: Mugdha Gupta Roll No: 16111041 Degree for which submitted: M.Tech. Department: Computer Science and Engineering Thesis title: Early Stage Malware Classification using Behavior Analysis Thesis supervisor: Dr. Sandeep Shukla Month and year of thesis submission: June 2018 In the recent years, there has been an exponential growth in the number of malware captured and analyzed by the antivirus companies. However, much of these malwares are variants of al- ready known malware. Thus, it has become necessary to determine whether a malware belongs to a known family, or exhibits a new behavior hitherto unseen, and requires further analysis. Existing traditional approaches used by antivirus companies are based on signature-based detection and can be thwarted in case of zero-day exploit-based malware. Manual examination of such executables is extremely cumbersome due to the enormous number of such cases. Also, it has become necessary to speed up the detection process and predict before the executable releases its malicious payload. In this work, we addressed all the above issues using automated yet efficient malware analysis. We classified the malicious executables into different malware classes in the earliest possible time using dynamic analysis. Dynamic analysis provides useful insights in the case of obfuscated or packed malware where static analysis fails. Our experi- ments achieve an accuracy of 98.02% for classifying malware into classes in the initial 4 seconds of its execution using XGBoost. We also classified samples which were not seen by the classi- fier before, thus attempted to classify zero-day malware. Our solution is robust and scalable as we have increased the number of samples used during analysis compared to prior work and reduced the execution time drastically. Our solution is also efficient since the state of the art accuracy for early stage malware detection is 91% for the first 4 seconds of execution and 96% for the first 19 seconds using recurrent neural networks. Acknowledgements I would express my profound gratitude to Dr. Sandeep Shukla for guiding me in this project. I would also like to thank Pranjul Ahuja, Bhaskar Mukhoty and Rohit Singh Kharanghar for their help and support whenever I needed. I am grateful to my parents and my siblings for the immense love they have given me. I am thankful to the Virustotal community for being so generous by providing me the access to their private API. I would also like to take this opportunity to thank CDAC Mohali for their help in building the dataset and TCG Digital for their support in creating the virtual network. iv Contents Abstract iii Acknowledgements iv Contents v List of Figures vii List of Tables viii 1 Introduction 1 1.1 Need for Malware Classification..............................1 2 Background 3 2.1 Malware and its classes...................................3 2.2 Malware Nomenclature...................................5 2.3 Available Defenses......................................6 2.4 Malware Analysis techniques................................6 2.4.1 Static Analysis.....................................7 2.4.1.1 Limitations of Static Analysis.......................7 2.4.2 Dynamic Analysis...................................8 3 Past Work 10 3.1 Static analysis based feature extraction.......................... 10 3.2 Dynamic analysis based feature extraction........................ 11 3.3 Time efficient detection................................... 13 3.4 Goals of this thesis...................................... 13 4 Machine Learning Background 15 4.1 Classifiers........................................... 15 4.2 Handling Imbalanced Data................................. 16 4.3 Cross Validation........................................ 18 4.4 Evaluation Metrics...................................... 19 4.4.1 Confusion Matrix................................... 19 v Contents vi 5 Classification of Existing malware 21 5.1 Architecture of classification system............................ 21 5.1.1 Dataset collection, Generation and Labeling................... 21 5.1.1.1 Dataset collection............................. 21 5.1.1.2 Dataset generation............................. 22 5.1.1.3 Labeling................................... 23 5.1.2 Feature Extraction.................................. 24 5.1.2.1 Network related features......................... 25 5.1.2.2 Process related features.......................... 30 5.1.2.3 API bins................................... 30 5.1.2.4 Signatures.................................. 33 5.1.3 Training and Testing................................. 38 5.1.4 Comparison to Existing Approaches........................ 40 6 Classification of Zero Day malwares 42 6.1 Architecture.......................................... 42 6.1.1 Dataset Collection and Generation........................ 42 6.1.2 Feature Extraction.................................. 45 6.1.3 Handling Imbalanced Data............................. 45 6.1.4 Training and Testing................................. 47 7 Scope And Future Work 49 7.1 Building a Hierarchical model............................... 49 7.2 Sliding window based approach for classification.................... 49 7.3 Building a robust classification system.......................... 49 A Appendix A 50 Bibliography 51 List of Figures 1.1 Growth of the the malware over years [4].........................2 2.1 Naming convention used by Microsoft [18]........................5 4.1 Neural network with one hidden layer [20]........................ 16 4.2 SMOTE oversampling technique [3]............................ 17 4.3 Tomek undersampling technique [3]............................ 17 4.4 K-Fold Cross Validation [2]................................. 18 4.5 Confusion Matrix[1]..................................... 19 5.1 Architecture of our classification system......................... 22 5.2 Cuckoo Architecture [6]................................... 23 5.3 Protocol Hierarchy of a malware using Wireshark.................... 26 5.4 LLMNR Poisoning [16].................................... 27 5.5 TLS Connections....................................... 28 5.6 HTTP Requests by Sventore.A malware.......................... 28 5.7 Frequency of API Calls in each bin............................. 31 5.8 Shortcuts created by worm family Yuner......................... 33 5.9 Registry Keys modified by Backdoor Agent malware to install itself at startup.... 34 5.10 Polymorphic nature exhibited by malware Yuner.................... 36 5.11 Exception raised by malware Renos............................ 37 5.12 Confusion Matrix - XGBoost................................. 40 6.1 Architecture of our classification system......................... 43 6.2 tSNE - Test Set......................................... 44 6.3 Imbalanced Virus families.................................. 45 6.4 Imbalanced Trojan families................................. 46 6.5 Confusion Matrix - XGBoost................................. 48 vii List of Tables 3.1 Summary - Dynamic analysis based feature extraction................. 12 5.1 Dataset............................................. 24 5.2 Testing accuracy - Simple Neural Network, for various optimizers and loss functions 39 5.3 Test Results for all classifiers................................ 39 5.4 Comparison to previous approaches............................ 41 6.1 Number of samples in Training and Testing Set..................... 44 6.2 Families in Training and Testing Set............................ 44 6.3 Number of samples in Training Set after SMOTE..................... 46 6.4 Accuracy for each type with corresponding FPR..................... 47 viii Dedicated to my parents ix Chapter 1 Introduction 1.1 Need for Malware Classification The rise of the Internet has readily affected our day to day life, turning it upside down. From buying products, doing online banking to using it for entertainment purposes or social net- working, it has made our lives a lot easier. With the ease of information flow, every other or- ganization is now getting connected to the Internet and going transparent with their opera- tions and resources. But as the Internet economy has grown, more serious cyber crimes have evolved. Almost every device like mobile phones, laptops to large systems such as power grid and nuclear plants are subjected to cyber attacks. Among serious cyber threats, there is malware which evolves daily and has the capacity to disrupt every other sector without any fail. According to the reports published by AV-Test institute [4], there has been tremendous growth in the number of malicious samples as shown in figure 1.1, registering over 250,000 new malicious samples every day. Analyzing these samples manually using reverse engineering and dis- assembly is a tedious and cumbersome task, therefore not appreciated enough by the security analysts. Thus there is a dire need for automated malware analysis systems which produces efficient results with minimal human intervention. Antivirus systems use the most common and primitive approach which involves generation of signatures of known malware beforehand and then comparing newly downloaded executables against these signatures to predict its nature. This technique drastically fails in case of any

Load more