Improving the Effectiveness and Efficiency Of

IMPROVING THE EFFECTIVENESS AND EFFICIENCY OF DYNAMIC MALWARE ANALYSIS USING MACHINE LEARNING by Leonardo De La Rosa A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Financial Services Analytics Summer 2018 © 2018 Leonardo De La Rosa All Rights Reserved IMPROVING THE EFFECTIVENESS AND EFFICIENCY OF DYNAMIC MALWARE ANALYSIS USING MACHINE LEARNING by Leonardo De La Rosa Approved: Bintong Chen, Ph.D. Chair of the Department of Financial Services Analytics Approved: Bruce Weber, Ph.D. Dean of the College of Business and Economics Approved: Douglas J. Doren, Ph.D. Interim Vice Provost for Graduate and Professional Education I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: John Cavazos, Ph.D. Professor in charge of dissertation I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Adam Fleischhacker, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Starnes Walker, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Michael Silas, Ph.D. Member of dissertation committee DEDICATION I would like to dedicate this dissertation to my Mom, my Angel. iv TABLE OF CONTENTS LIST OF TABLES ::::::::::::::::::::::::::::::::::: x LIST OF FIGURES :::::::::::::::::::::::::::::::::: xi ABSTRACT ::::::::::::::::::::::::::::::::::::::: xiii Chapter 1 INTRODUCTION :::::::::::::::::::::::::::::::: 1 1.1 My Thesis :::::::::::::::::::::::::::::::::::: 1 1.2 Problem Statement :::::::::::::::::::::::::::::: 1 1.2.1 Need to Quickly Detect Malware in Real Time ::::::::: 3 1.3 Organization :::::::::::::::::::::::::::::::::: 5 2 BACKGROUND AND LITERATURE REVIEW ::::::::::: 6 2.1 Introduction ::::::::::::::::::::::::::::::::::: 6 2.2 Malware Overview ::::::::::::::::::::::::::::::: 6 2.3 Signature-based Techniques ::::::::::::::::::::::::: 10 2.4 Heuristics-based Approaches ::::::::::::::::::::::::: 10 2.5 Static Analysis ::::::::::::::::::::::::::::::::: 11 2.5.1 Drawbacks of Static Analysis :::::::::::::::::::: 11 2.5.2 Static Methods for Binary Characterization ::::::::::: 12 2.5.2.1 Bytes Analysis ::::::::::::::::::::::: 12 2.5.2.2 Hashes Histograms ::::::::::::::::::::: 13 2.5.2.3 Disassembly Analysis ::::::::::::::::::: 15 2.5.2.4 Graph-based Features ::::::::::::::::::: 18 2.6 Dynamic Analysis ::::::::::::::::::::::::::::::: 19 2.6.1 Drawbacks of Dynamic Analysis :::::::::::::::::: 20 v 2.6.2 Cuckoo Sandbox :::::::::::::::::::::::::::: 20 2.6.2.1 Architecture ::::::::::::::::::::::::: 21 2.6.2.2 Processing Modules :::::::::::::::::::: 22 2.6.2.3 Repository of Malicious Behaviors ::::::::::: 23 2.6.2.4 Maliciousness Scale :::::::::::::::::::: 24 2.7 Machine Learning for Cyber Security :::::::::::::::::::: 24 2.7.1 Using Static Analysis Features ::::::::::::::::::: 24 2.7.2 Using Dynamic Analysis Features ::::::::::::::::: 26 2.7.3 Using Hybrid Features :::::::::::::::::::::::: 28 2.8 Discussion :::::::::::::::::::::::::::::::::::: 29 3 PROPOSED APPROACH ::::::::::::::::::::::::::: 31 3.1 TURACO: Training Using Runtime Analysis from Cuckoo Outputs : 33 3.2 SEEMA: Selecting the Most Efficient and Effective Malware Attributes 34 3.3 MAGIC: Malware Analysis to Generate Important Capabilities :::: 35 3.4 Summary of Contributions :::::::::::::::::::::::::: 36 4 TURACO: TRAINING USING RUNTIME ANALYSIS FROM CUCKOO OUTPUTS :::::::::::::::::::::::::::::: 37 4.1 Introduction ::::::::::::::::::::::::::::::::::: 37 4.2 Problem Statement :::::::::::::::::::::::::::::: 38 4.2.1 Dynamic Analysis Time ::::::::::::::::::::::: 39 4.3 Approach :::::::::::::::::::::::::::::::::::: 40 4.3.1 Features ::::::::::::::::::::::::::::::::: 41 4.3.1.1 Byte Features :::::::::::::::::::::::: 42 4.3.1.2 Hashes Histograms ::::::::::::::::::::: 42 4.3.1.3 Target Labels :::::::::::::::::::::::: 42 4.3.1.4 Threat Level :::::::::::::::::::::::: 43 4.3.1.5 Feature Engineering :::::::::::::::::::: 44 4.3.2 Dataset ::::::::::::::::::::::::::::::::: 44 4.4 Learning Methodology ::::::::::::::::::::::::::::: 45 vi 4.5 Experimental Infrastructure ::::::::::::::::::::::::: 47 4.6 Experimental Results ::::::::::::::::::::::::::::: 48 4.6.1 Accuracy Results ::::::::::::::::::::::::::: 49 4.6.2 Recall and Precision Results ::::::::::::::::::::: 50 4.7 Discussion :::::::::::::::::::::::::::::::::::: 51 4.7.1 Efficiency of TURACO :::::::::::::::::::::::: 52 4.8 Related Work :::::::::::::::::::::::::::::::::: 53 4.9 Conclusions ::::::::::::::::::::::::::::::::::: 55 5 SEEMA: SELECTING THE MOST EFFICIENT AND EFFECTIVE MALWARE ATTRIBUTES :::::::::::::::: 57 5.1 Introduction ::::::::::::::::::::::::::::::::::: 57 5.2 Problem Statement :::::::::::::::::::::::::::::: 58 5.3 Approach :::::::::::::::::::::::::::::::::::: 60 5.3.1 Features ::::::::::::::::::::::::::::::::: 61 5.3.1.1 Byte Features :::::::::::::::::::::::: 62 5.3.1.2 Graph-based Features ::::::::::::::::::: 62 5.3.1.3 Dynamic Features ::::::::::::::::::::: 62 5.3.1.4 Feature Engineering :::::::::::::::::::: 63 5.3.2 Dataset ::::::::::::::::::::::::::::::::: 63 5.4 Learning Methodology ::::::::::::::::::::::::::::: 64 5.4.1 Malware Family Classifiers :::::::::::::::::::::: 64 5.4.2 SEEMA Model ::::::::::::::::::::::::::::: 65 5.5 Experimental Infrastructure ::::::::::::::::::::::::: 66 5.6 Experimental Results ::::::::::::::::::::::::::::: 67 5.6.1 Malware Classifier Results :::::::::::::::::::::: 67 5.6.1.1 Accuracy Results :::::::::::::::::::::: 67 vii 5.6.1.2 Precision and Recall Results ::::::::::::::: 69 5.6.2 SEEMA Model Results :::::::::::::::::::::::: 70 5.7 Discussion :::::::::::::::::::::::::::::::::::: 74 5.7.1 Efficiency of SEEMA ::::::::::::::::::::::::: 75 5.8 Related Work :::::::::::::::::::::::::::::::::: 76 5.8.1 Feature Exploration :::::::::::::::::::::::::: 76 5.8.1.1 Hybrid Characterizations of Malware :::::::::: 78 5.9 Conclusions ::::::::::::::::::::::::::::::::::: 79 6 MAGIC: MALWARE ANALYSIS TO GENERATE IMPORTANT CAPABILITIES :::::::::::::::::::::::::::::::::: 81 6.1 Introduction ::::::::::::::::::::::::::::::::::: 81 6.2 Problem Statement :::::::::::::::::::::::::::::: 82 6.3 Approach :::::::::::::::::::::::::::::::::::: 83 6.3.1 Features ::::::::::::::::::::::::::::::::: 84 6.3.1.1 Instruction-Based Features :::::::::::::::: 84 6.3.1.2 Global Instruction Features ::::::::::::::: 84 6.3.1.2.1 Global Instruction histograms ::::::: 85 6.3.1.2.2 Global Instruction bit vectors ::::::: 85 6.3.1.3 Individual Node Features ::::::::::::::::: 86 6.3.1.3.1 Random Walk ::::::::::::::::: 86 6.3.1.3.2 Random Walk bit vector :::::::::: 86 6.3.1.4 Dynamic Characterization :::::::::::::::: 87 6.3.1.5 Cuckoo Sandbox :::::::::::::::::::::: 87 6.3.1.6 Malware Attribute Enumeration and Characterization 88 6.3.1.7 A1000 Cloud Analysis System :::::::::::::: 88 6.4 Learning Methodology ::::::::::::::::::::::::::::: 88 viii 6.5 Experimental Infrastructure ::::::::::::::::::::::::: 91 6.5.1 Malware Family Distribution :::::::::::::::::::: 92 6.5.2 Cuckoo/MAEC Capability Distribution :::::::::::::: 92 6.5.3 A1000 Indicators of Interest Distribution ::::::::::::: 93 6.6 Experimental Results ::::::::::::::::::::::::::::: 93 6.6.1 Decision Trees ::::::::::::::::::::::::::::: 94 6.6.1.1 Decision Tree Results ::::::::::::::::::: 94 6.7 Discussion :::::::::::::::::::::::::::::::::::: 96 6.7.1 Decision Tree Model ::::::::::::::::::::::::: 98 6.8 Related Work :::::::::::::::::::::::::::::::::: 99 6.9 Conclusions ::::::::::::::::::::::::::::::::::: 100 7 CONCLUSIONS :::::::::::::::::::::::::::::::::: 101 8 FUTURE WORK ::::::::::::::::::::::::::::::::: 104 BIBLIOGRAPHY ::::::::::::::::::::::::::::::::::: 106 ix LIST OF TABLES 4.1 Reported time for dynamic malware analysis :::::::::::::: 38 4.2 Assignment of labels for malware dataset :::::::::::::::: 43 4.3 Training Dataset for TURACO model :::::::::::::::::: 44 4.4 Confusion Matrix for TURACO Model ::::::::::::::::: 50 4.5 Individual recall and precision results for TURACO model. ::::: 51 5.1 Composition of dataset for malware family classifiers ::::::::: 64 5.2 Construction of input dataset for SEEMA model. ::::::::::: 65 5.3 Input dataset for SEEMA model ::::::::::::::::::::: 66 5.4 Confusion matrix for SEEMA model. :::::::::::::::::: 73 5.5 Precision and recall results for SEEMA model. ::::::::::::: 74 5.6 Output from logistic regression model. :::::::::::::::::: 74 6.1 Categories of instructions extracted by Radare2 :::::::::::: 85 6.2 Cuckoo/MAEC Capabilities :::::::::::::::::::::::: 90 6.3 A1000 Indicators of Interest (IOI) :::::::::::::::::::: 91 6.4 Training instance for MAGIC model. ::::::::::::::::::: 92 6.5 Distribution of Families for our Malware Datasets ::::::::::: 93 6.6 Dataset distribution for Cuckoo/MAEC capabilities :::::::::: 93 x LIST OF FIGURES 2.1 Overall development of malware in the last decade :::::::::: 9 2.2 Extraction of bytes-entropy histograms ::::::::::::::::: 14 2.3 Feature extraction

Improving the Effectiveness and Efficiency Of

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support