Countering Malicious Documents and Adversarial Learning

COUNTERING MALICIOUS DOCUMENTS AND ADVERSARIAL LEARNING by Charles Smutz A Dissertation Submitted to the Graduate Faculty of George Mason University In Partial fulfillment of The Requirements for the Degree of Doctor of Philosophy Information Technology Committee: Dr. Angelos Stavrou, Dissertation Director Dr. Daniel Barbará,Committee Member Dr. Daniel B. Carr, Committee Member Dr. Duminda Wijesekera, Committee Member Dr. Stephen G. Nash, Senior Associate Dean Dr. Kenneth S. Ball, Dean, The Volgenau School of Engineering Date: Summer Semester 2016 George Mason University Fairfax, VA Countering Malicious Documents and Adversarial Learning A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at George Mason University By Charles Smutz Master of Science George Mason University, 2009 Bachelor of Science Brigham Young University, 2006 Director: Dr. Angelos Stavrou, Professor Department of Information Technology Summer Semester 2016 George Mason University Fairfax, VA Copyright c 2016 by Charles Smutz All Rights Reserved ii Acknowledgments I thank my dissertation committee members for crucial guidance, feedback, and patience as I've conducted this research. I also acknowledge the feedback received from numerous re- viewers of this work. I recognize the support of my employer, Lockheed Martin Corporation, as I've conducted this research. iii Table of Contents Page List of Tables . vii List of Figures . ix Abstract . x 1 Introduction . 1 1.1 Background . 1 1.2 Thesis Contributions . 3 1.3 Thesis Outline . 4 2 Related Work . 5 3 Content Randomization in Office Documents . 11 3.1 Microsoft Office File Formats . 12 3.1.1 OLE Compound Document Format . 12 3.1.2 Office Open XML File Format . 13 3.2 Microsoft Office Exploit Protections . 15 3.3 Approach . 16 3.3.1 Content Randomization in .doc Files . 16 3.3.2 Content Randomization in .docx Files . 18 3.3.3 Strength of Content Randomization Mechanisms . 20 3.4 Exploit Protection Evaluation . 22 3.5 Performance Evaluation . 27 3.5.1 .doc DCR Performance . 27 3.5.2 .docx DCR Performance . 29 3.6 Content Randomization Evasion . 30 3.7 Content Injection and Memory Displacement . 35 3.8 Discussion . 38 4 Structural Feature Based PDF Malware Classifier . 41 4.1 PDF File Format . 41 4.2 Feature Extraction . 42 4.3 Feature Selection . 43 4.4 Machine Learning Algorithm Selection . 44 iv 4.4.1 Support Vector Machines . 45 4.4.2 Random Forests . 46 4.5 Classification Labels . 49 4.6 PDFrate Evaluation . 49 4.6.1 Evaluation Data . 50 4.6.2 Adequacy of Features . 52 4.6.3 Classification & Detection Performance . 53 4.6.4 New Variant Detection . 56 4.6.5 Comparison to PJScan . 57 4.6.6 Computational Complexity . 58 4.7 PDFrate Online Service . 59 5 Countering Adversarial Learning . 61 5.1 Top Feature Mimicry Attack . 61 5.1.1 Mimicry Attack Effectiveness . 62 5.1.2 Introducing Noise to Counter Top Feature Mimicry . 64 5.2 Independent Evasion Attacks . 66 5.2.1 Mimicus . 66 5.2.2 EvadeML . 69 5.2.3 Reverse Mimicry . 69 5.2.4 Parser Confusion . 70 5.3 Mutual Agreement Analysis . 71 5.4 Evaluation on Operational Data . 75 5.5 Evaluation on Virustotal Data . 79 5.6 Independent Evasion Attack Evaluations . 84 5.6.1 Mimicus Evaluation . 84 5.6.2 EvadeML Evaluation . 86 5.6.3 Reverse Mimicry Evaluation . 89 5.6.4 Parser Confusion Evaluation . 91 5.7 Mutual Agreement Threshold Tuning . 92 5.8 Ensemble Classifier Diversity . 94 5.9 Evaluation on Drebin Android Malware Detector . 97 5.10 Discussion . 101 6 Specific Value Based Features . 106 6.1 PDF Metadata Values . 106 6.2 Metadata and Structural Based Signatures . 108 6.3 Quantized Prevalence of Specific Values . 109 v 6.3.1 Combining General and Specific Value Methods . 112 6.3.2 Improving Extrapolation through Prevalence Database Bagging . 113 7 Microsoft Office Document Malware Detection . 116 7.1 Detection of OLE Based Malware Using Metadata and Structure . 116 7.2 Detection of Zip Based Malware . 117 7.2.1 Zip File Attributes . 118 7.2.2 Zip Features . 118 7.2.3 Zip Based File Format Classification Evaluation . 120 7.2.4 Zip Creator Fingerprinting . 120 7.3 Discussion . 123 8 Conclusion . 124 8.1 Summary . 124 8.2 Lessons Learned . 125 8.3 Limitations . 125 8.4 Future Work . 127 A Examples of Malicious PDF Structure . 129 B PDF Feature Descriptions . 136 C Copyright . 144 Bibliography . 145 vi List of Tables Table Page 3.1 DCR Exploit Protection Evaluation . 24 3.2 DCR Performance . 27 3.3 Metasploit Windows Shellcode Sizes . 32 3.4 Deflate Decompressor Size (Linux) . 35 4.1 Optimal SVM Error Rates by Kernel Type . 45 4.2 SVM Tuning: Error Rates for Polynomial Kernel . 46 4.3 SVM Tuning: Error Rates for Linear Kernel . 46 4.4 SVM Tuning: Error Rates for Sigmoid Kernel . 47 4.5 SVM Tuning: Error Rates for Gaussian (RBF) Kernel . 47 4.6 Random Forest Tuning: Classification Error Rates . 48 4.7 Classifier Performance Comparison . 49 4.8 Data Set Summary . 52 4.9 FP/TP Rates: Training Set (ben/mal) . 53 4.10 FP/TP Rates: Training Set (opp/tar) . 54 4.11 FP/TP Rates: Operational Set (ben/mal) . 55 4.12 FP/TP Rates: Operational Set (opp/tar) . 56 4.13 Variants in Data Sets . 56 4.14 PJScan Results . 57 4.15 Run Times on Training Data . 59 5.1 Mimicry: Classifier Error Increase . 64 5.2 Classifier Error with Features Removed . 65 5.3 Classification Error with Training Data Perturbation . 65 5.4 Relative Performance of Individual Trees in Contagio Classifier . 72 5.5 Ensemble Classifier Outcomes . 73 5.6 PDFrate Outcomes for Benign Documents . 76 5.7 PDFrate Outcomes for Malicious Documents . ..

Load more