Classification of Malware Using Reverse Engineering and Data
Total Page:16
File Type:pdf, Size:1020Kb
CLASSIFICATION OF MALWARE USING REVERSE ENGINEERING AND DATA MINING TECHNIQUES A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Ravindar Reddy Ravula August, 2011 CLASSIFICATION OF MALWARE USING REVERSE ENGINEERING AND DATA MINING TECHNIQUES Ravindar Reddy Ravula Thesis Approved: Accepted: ______________________________ ______________________________ Advisor Department Chair Dr. Kathy J. Liszka Dr. Chien-Chung Chan ______________________________ ______________________________ Committee Member Dean of the College Dr. Chien-Chung Chan Dr. Chand K. Midha ______________________________ ______________________________ Committee Member Dean of the Graduate School Dr. Zhong-Hui Duan Dr. George R. Newkome _____________________________ Date ii ABSTRACT Detecting new and unknown malware is a major challenge in today’s software security profession. A lot of approaches for the detection of malware using data mining techniques have already been proposed. Majority of the works used static features of malware. However, static detection methods fall short of detecting present day complex malware. Although some researchers proposed dynamic detection methods, the methods did not use all the malware features. In this work, an approach for the detection of new and unknown malware was proposed and implemented. 582 malware and 521 benign software samples were collected from the Internet. Each sample was reverse engineered for analyzing its effect on the operating environment and to extract the static and behavioral features. The raw data extracted from the reverse engineering was preprocessed and two datasets are obtained: dataset with reversed features and dataset with API Call features. Feature reduction was performed manually on the dataset with reversed features and the features that do not contribute to the classification were removed. Machine learning classification algorithm, J48 was applied to dataset with reversed features to obtain classification rules and a decision tree with the rules was obtained. To reduce the tree size and to obtain optimum number of decision rules, attribute values in the dataset with reversed features were discretized and iii another dataset was prepared with discretized attribute values. The new dataset was applied to J48 algorithm and a decision tree was generated with another set of classification rules. To further reduce the tree and number of decision rules, the dataset with discretized features was subjected to a machine learning tool, BLEM2 which is based on the rough sets and produces decision rules. To test the accuracy of the rules, the dataset with decision rules from BLEM2 was given as input to J48 algorithm. The same procedure was followed for the dataset with API Call features. Another set of experiments was conducted on the three datasets using Naïve Bayes classifier to generate training model for classification. All the training models were tested with an independent training set. J48 decision tree algorithm produced better results with DDF and DAF datasets with accuracies of 81.448% and 89.140% respectively. Naïve Bayes classifier produced better results with DDF dataset with an accuracy of 85.067%. iv AKNOWLEDGMENTS I would like to express my sincere gratitude to people who are the reason for making this research possible. I want to express my heartiest thanks to Dr. Kathy J. Liszka for giving me the opportunity to work on the thesis. Her invaluable guidance and support at every stage has led to successful conclusion of the study. I would like to thank Dr. Chien Chung Chan for his expert advice in data mining and the insightful suggestions that have been very helpful in the study. In addition, I want to thank Dr. Zhong Hui Duan for taking time and willing to be on the thesis committee. I want to convey my special thanks to my parents, sister, brother, brother-in-law, cousin and friends for their love and continuous encouragement. Their blessings and moral support have been invaluable at every stage of my life. Thank you all for standing by me at all times. v TABLE OF CONTENTS Page LIST OF TABLES…..……………………………………………………………….....viii LIST OF FIGURES...………………………………………………………………….....ix CHAPTER I. INTRODUCTION………………………………………………………………...1 II. LITERATURE REVIEW…………………………………...................................4 III. TYPES OF MALWARE AND ANTI-MALWARE DEFENSE TECHNIQUES…………………………………………………………………..15 3.1 Malware Types……………………………………………………………….15 3.1.1 Virus………………………………………………………………….15 3.1.2 Worm………………………………………………………………...18 3.1.3 Backdoor……………………………………....................................19 3.1.4 Trojan Horse………………………………………………………....19 3.1.5 Rootkit………………………………………....................................20 3.1.6 Spyware………………………………………………………………21 3.1.7 Adware…………………………………………………………….....21 3.2 Antivirus Detection Techniques………..……………………………...…….21 3.2.1 Signature Based Detection…………...……………………………....22 vi 3.2.2 Heuristic Approach………………………….....................................23 3.2.3 Sandbox Approach…………………………………………………...23 3.2.4 Integrity Checking…………………………………………………...24 IV. REVERSE ENGINEERING……………………………………………………..25 4.1 Controlled Environment……………………………………………….……..25 4.2 Experimental Setup…………………………………………………………..27 4.3 Static Analysis……………………………………………………………….28 4.3.1 Cryptographic Hash Function………………………...……………...28 4.3.2 Packer Detection……………………………………………………..29 4.3.3 Code Analysis………………………………………..………………31 4.4 Dynamic Analysis……………………………………………………………33 4.4.1 File System Monitor…………………………………………………33 4.4.2 Registry Monitor……………………………………………………..34 4.4.3 API Call Tracer………………………………………………………36 V. DATA MINING……………………………………………………………….…38 5.1 System Design……………………………………………………………….38 5.2 KDD Process…………………………………………………………………40 5.2.1 Target Data………………………………………………………...…44 5.2.2 Preprocessing………………………………………………………...45 5.2.3 Transformation……………………………………………………….45 5.2.4 Data Mining………………………………………………………….47 5.2.5 Interpretation/Evaluation…………………………………………….51 VI. RESULTS AND DISCUSSIONS………………………………………………..52 vii 6.1 Experiment 1: Classification of DRF…………………………………..…….52 6.2 Experiment 2: Classification of DDF…………………………………….….56 6.3 Experiment 3: Classification of DDF using BLEM2………………………...59 6.4 Experiment 4: Classification of DDF from BLEM2 using J48..…………….62 6.5 Experiment 5: Classification of DAF………………………………………..64 6.6 Experiment 6: Classification of DAF using BLEM2……..……………….…68 6.7 Experiment 7: Classification of DAF from BLEM2 using J48……………...68 6.8 Accuracies……………………………………………………………………72 6.9 Pattern in API Call Frequencies……………………………………………...75 VII. CONCLUSIONS AND FUTURE WORK………………………………………77 7.1 Conclusions…………………………………………………………………..77 7.2 Future Work……………………………………………………………….…78 REFERENCES…………………………………………………………………………..79 APPENDICES…………………………………………………………………………...83 APPENDIX A. DATASETS…………………………………………………….83 viii LIST OF TABLES Table Page 5.1 Attributes in DRF…………………………………………………………………….44 5.2 Attributes in DRF after Transformation…………………………………………..…46 5.3 Discretized Values………………………………………………………………...…47 6.1 Decision rules for DRF for the decision label “YES”………….……………………54 6.2 Decision rules for DRF for the decision label “NO”……………….………………..55 6.3 Decision rules for DDF for the decision label “YES”…………….…………………57 6.4 Decision rules for DDF for the decision label “NO”……………….………………..58 6.5 BLEM2 rules for DDF for the decision label “YES”………………………………..59 6.6 BLEM2 rules for DDF for the decision label “NO”………………………………....61 6.7 Decision rules for DDF from BLEM2 for the decision label “YES”………………..63 6.8 Decision rules for DDF from BLEM2 for the decision label “NO”………………....63 6.9 Decision rules for DAF for the decision label “YES”……………….………………65 6.10 Decision rules for DAF for the decision label “NO”…..………….………………..66 6.11 Decision rules for DAF from BLEM2 for the decision label “YES”………………69 6.12 Decision rules for DAF from BLEM2 for the decision label “NO”………………..70 6.13 Testing set Results against Training Models from Experiments 1, 2 and 3………..73 6.14 Testing set Results against Training Models from Experiments 4 and 5…………..73 6.15 Experimental Results from Naïve Bayes Classifier………………………………...73 ix A1: An Instance for Attributes File Name, File Size and MD5 Hash in DRF…………..84 A2: An Instance for Attributes Packer, File Access, Directory Access and Internet Access in DRF……………………………………………………………………………………84 A3: API Calls Accessed By the Trojan…………………………………………………..85 A4: DLLs Accessed By the Trojan………………………………………………………85 A5: Registry Keys Added By the Trojan………………………………………………...86 A6: Registry Keys Modified By the Trojan……………………………………………...86 A7: Registry Keys Deleted By the Trojan……………………………………………….88 A8: URL References Made By the Trojan………………………………………………88 A9: Programming Language used, Strings and Decision label of the Trojan…………...89 A10: An Instance of DRF Dataset after Preprocessing………………………………….89 A11: An Instance of DDF Dataset……………………………………………………….90 A12: An Instance of DAF Dataset……………………………………………………….90 x LIST OF FIGURES Figure Page 3.1 Typical Malware Signature…………………………………………………………..22 4.1 Snapshot Manager……………………………………………………………………27 4.2 Normal PE File……………………………………………………………………....30 4.3 Packed PE File……………………………………………………………………….30 4.4 PEiD………………………………………………………………………………….31 4.5 IDA Pro Disassembler……………………………………………………………….32 4.6 File Monitor………………………………………………………………………….34 4.7 Registry Monitor……………………………………………………………………..35 4.8 Registry Key Changes Made by a PE………………………………………………..36 4.9 Maltrap……………………………………………………………………………….37