Improved Detection for Advanced Polymorphic Malware James B
Total Page:16
File Type:pdf, Size:1020Kb
Nova Southeastern University NSUWorks CEC Theses and Dissertations College of Engineering and Computing 2017 Improved Detection for Advanced Polymorphic Malware James B. Fraley Nova Southeastern University, [email protected] This document is a product of extensive research conducted at the Nova Southeastern University College of Engineering and Computing. For more information on research and degree programs at the NSU College of Engineering and Computing, please click here. Follow this and additional works at: https://nsuworks.nova.edu/gscis_etd Part of the Computer Sciences Commons Share Feedback About This Item NSUWorks Citation James B. Fraley. 2017. Improved Detection for Advanced Polymorphic Malware. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, College of Engineering and Computing. (1008) https://nsuworks.nova.edu/gscis_etd/1008. This Dissertation is brought to you by the College of Engineering and Computing at NSUWorks. It has been accepted for inclusion in CEC Theses and Dissertations by an authorized administrator of NSUWorks. For more information, please contact [email protected]. Improved Detection for Advanced Polymorphic Malware by James B. Fraley A Dissertation Proposal submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Assurance College of Engineering and Computing Nova Southeastern University 2017 ii An Abstract of a Dissertation Submitted to Nova Southeastern University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Improved Detection for Advanced Polymorphic Malware by James B. Fraley May 2017 Malicious Software (malware) attacks across the internet are increasing at an alarming rate. Cyber-attacks have become increasingly more sophisticated and targeted. These targeted attacks are aimed at compromising networks, stealing personal financial information and removing sensitive data or disrupting operations. Current malware detection approaches work well for previously known signatures. However, malware developers utilize techniques to mutate and change software properties (signatures) to avoid and evade detection. Polymorphic malware is practically undetectable with signature-based defensive technologies. Today’s effective detection rate for polymorphic malware detection ranges from 68.75% to 81.25%. New techniques are needed to improve malware detection rates. Improved detection of polymorphic malware can only be accomplished by extracting features beyond the signature realm. Targeted detection for polymorphic malware must rely upon extracting key features and characteristics for advanced analysis. Traditionally, malware researchers have relied on limited dimensional features such as behavior (dynamic) or source/execution code analysis (static). This study’s focus was to extract and evaluate a limited set of multidimensional topological data in order to improve detection for polymorphic malware. This study used multidimensional analysis (file properties, static and dynamic analysis) with machine learning algorithms to improve malware detection. This research demonstrated improved polymorphic malware detection can be achieved with machine learning. This study conducted a number of experiments using a standard experimental testing protocol. This study utilized three advanced algorithms (Metabagging (MB), Instance Based k-Means (IBk) and Deep Learning Multi-Layer Perceptron) with a limited set of multidimensional data. Experimental results delivered detection results above 99.43%. In addition, the experiments delivered near zero false positives. The study’s approach was based on single case experimental design, a well-accepted protocol for progressive testing. The study constructed a prototype to automate feature extraction, assemble files for analysis, and analyze results through multiple clustering algorithms. The study performed an evaluation of large malware sample datasets to understand effectiveness across a wide range of malware. The study developed an integrated framework which automated feature extraction for multidimensional analysis. The feature extraction framework consisted of four modules: 1) a pre-process module that extracts and generates topological features based on static analysis of machine code and file characteristics, 2) a behavioral analysis module that extracts behavioral characteristics based on file execution (dynamic analysis), 3) an input file construction and submission module, and 4) a machine learning module that employs various advanced algorithms. As with most studies, careful attention was paid to false positive and false negative rates which reduce their overall detection accuracy and effectiveness. This study provided a novel approach to expand the malware body of knowledge and improve the detection for polymorphic malware targeting Microsoft operating systems. Acknowledgments I would like to thank Professor James Cannady for providing guidance and leadership as the Chair of my dissertation committee. His brutal honesty has made me a better student and person. I will always look back fondly on his instruction, guidance and wisdom. I also want to thank Dr. Hughes and Dr. Liu for working diligently on my committee. Your comments and feedback served to build a better end-product and researcher. I also want to thank Dr. Ackerman and Dr. Seagull who provided guidance and inspiration early on in the process. I would also like to thank David Wasson who provided technical guidance and helped me build an infrastructure to process literally hundreds of experimental models. He provided a sound technical advice and allowed me to test ideas that neither of us had ever tackled. Some for good reason. I am grateful that he took my calls and provided technical insights when things broke. He helped me breakthrough by answering questions, giving up his time and providing technical expertise when it was desperately needed. I am very appreciative of my employer both McAfee and Intel. My journey was only made possible by the company’s support and encouragement. McAfee has been exceptionally supportive, and I have appreciated the support from my senior management. I also have been surrounded by exceptional colleagues at McAfee who have increased my knowledge in information security by leaps and bounds. It has been my distinct honor to work and learn from all of you over the years. I am so grateful to Theresa who partnered with me on my life journey. I thank her for inspiring, encouraging and fully supporting my academic journey. I appreciated her patience and willingness to let me work some insane hours for a long time. I am grateful that my mother instilled a belief that anything is possible if you work hard enough. Jake and Dylan, I hope my hard work and dedication provides a good example for you as you begin your academic journey. Lastly, I appreciate my family’s support and encouragement (especially David Bressi) throughout the entire process. iv List of Tables Tables 1. Advanced Detection Studies. Sources Used in Study 92 2. Confusion Matrix 125 3. Example Confusion Matrix 128 4. Example Unweighted Testing Results 128 5. Experimental Resource Requirements 134 6. MB Baseline Classification Results 144 7. MB Baseline Experimental Results 144 8. MB Reduced Feature Selection Classification Results 145 9. MB Reduced Feature Selection Experimental Results 145 10. MB Data Amplification Classification Results 145 11. MB Data Amplification Experimental Results 146 12. MB Experimental Results 147 13. IBK Baseline Classification Results 152 14. IBK Baseline Experimental Results 152 15. IBK Reduced Feature Selection Classification Results 152 16. IBK Reduced Feature Selection Experimental Results 153 17. IBK Data Amplification Classification Results 153 18. IBK Data Amplification Experimental Results 153 19. IBK Experimental Results 155 20. DLMLP Baseline Classification Results 160 21. DLMLP Baseline Experimental Results 160 22. DLMLP Reduced Feature Selection Classification Results 161 23. DLMLP Reduced Feature Selection Experimental Results 161 24. DLMLP Data Amplification Classification Results 161 25. DLMLP Data Amplification Experimental Results 162 26. DLMLP Experimental Results 163 v List of Figures Figures 1. System Overview 101 2. ROC Results 124 3. Comparative Baseline Testing Results 129 4. ROC Analysis 130 5. MB Graph Results 146 6. IBk Graph Results 154 7. DLMLP Graph Results 162 vi Table of Contents Abstract ii Acknowledgements iii Approval/Signature Page iv List of Tables v List of Figures vi Chapters 1 Introduction 1 Background 1 Problem Statement 10 Dissertation Goal 11 Research Questions and/or Hypotheses 14 Relevance and Significance 16 Barriers and Issues 18 Assumptions, Limitations and Delimitations: 20 Summary 23 Definition of Terms 25 List of Acronyms 31 2 Review of the Literature 33 Overview of Topics 33 Synthesis of Current Literature 39 Research Methods 77 Gaps in Current Literature 81 Strengths and Weaknesses of Current Studies 85 Similar Research Methods 87 Summary 93 3 Methodology 94 Overview 94 Research Design 96 Research Procedures 97 Prototype Environment 100 Threats to Validity 108 Sample 121 Data Analysis 123 Data Formats for Results 127 Resource Requirements 130 Summary 135 vii 4 Results 136 Research Goals 136 Review of the Methodology 136 Experimental Outcomes 138 Data Analysis 163 Findings 166 Summary of Findings 167 5 Conclusions, Implications, Recommendations, and Summary 169 Conclusions 169 Implications 171 Recommendations 172 Summary