Improving the Effectiveness and Efficiency Of

IMPROVING THE EFFECTIVENESS AND EFFICIENCY OF DYNAMIC MALWARE ANALYSIS USING MACHINE LEARNING by Leonardo De La Rosa A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Financial Services Analytics Summer 2018 © 2018 Leonardo De La Rosa All Rights Reserved IMPROVING THE EFFECTIVENESS AND EFFICIENCY OF DYNAMIC MALWARE ANALYSIS USING MACHINE LEARNING by Leonardo De La Rosa Approved: Bintong Chen, Ph.D. Chair of the Department of Financial Services Analytics Approved: Bruce Weber, Ph.D. Dean of the College of Business and Economics Approved: Douglas J. Doren, Ph.D. Interim Vice Provost for Graduate and Professional Education I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: John Cavazos, Ph.D. Professor in charge of dissertation I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Adam Fleischhacker, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Starnes Walker, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Michael Silas, Ph.D. Member of dissertation committee DEDICATION I would like to dedicate this dissertation to my Mom, my Angel. iv TABLE OF CONTENTS LIST OF TABLES ::::::::::::::::::::::::::::::::::: x LIST OF FIGURES :::::::::::::::::::::::::::::::::: xi ABSTRACT ::::::::::::::::::::::::::::::::::::::: xiii Chapter 1 INTRODUCTION :::::::::::::::::::::::::::::::: 1 1.1 My Thesis :::::::::::::::::::::::::::::::::::: 1 1.2 Problem Statement :::::::::::::::::::::::::::::: 1 1.2.1 Need to Quickly Detect Malware in Real Time ::::::::: 3 1.3 Organization :::::::::::::::::::::::::::::::::: 5 2 BACKGROUND AND LITERATURE REVIEW ::::::::::: 6 2.1 Introduction ::::::::::::::::::::::::::::::::::: 6 2.2 Malware Overview ::::::::::::::::::::::::::::::: 6 2.3 Signature-based Techniques ::::::::::::::::::::::::: 10 2.4 Heuristics-based Approaches ::::::::::::::::::::::::: 10 2.5 Static Analysis ::::::::::::::::::::::::::::::::: 11 2.5.1 Drawbacks of Static Analysis :::::::::::::::::::: 11 2.5.2 Static Methods for Binary Characterization ::::::::::: 12 2.5.2.1 Bytes Analysis ::::::::::::::::::::::: 12 2.5.2.2 Hashes Histograms ::::::::::::::::::::: 13 2.5.2.3 Disassembly Analysis ::::::::::::::::::: 15 2.5.2.4 Graph-based Features ::::::::::::::::::: 18 2.6 Dynamic Analysis ::::::::::::::::::::::::::::::: 19 2.6.1 Drawbacks of Dynamic Analysis :::::::::::::::::: 20 v 2.6.2 Cuckoo Sandbox :::::::::::::::::::::::::::: 20 2.6.2.1 Architecture ::::::::::::::::::::::::: 21 2.6.2.2 Processing Modules :::::::::::::::::::: 22 2.6.2.3 Repository of Malicious Behaviors ::::::::::: 23 2.6.2.4 Maliciousness Scale :::::::::::::::::::: 24 2.7 Machine Learning for Cyber Security :::::::::::::::::::: 24 2.7.1 Using Static Analysis Features ::::::::::::::::::: 24 2.7.2 Using Dynamic Analysis Features ::::::::::::::::: 26 2.7.3 Using Hybrid Features :::::::::::::::::::::::: 28 2.8 Discussion :::::::::::::::::::::::::::::::::::: 29 3 PROPOSED APPROACH ::::::::::::::::::::::::::: 31 3.1 TURACO: Training Using Runtime Analysis from Cuckoo Outputs : 33 3.2 SEEMA: Selecting the Most Efficient and Effective Malware Attributes 34 3.3 MAGIC: Malware Analysis to Generate Important Capabilities :::: 35 3.4 Summary of Contributions :::::::::::::::::::::::::: 36 4 TURACO: TRAINING USING RUNTIME ANALYSIS FROM CUCKOO OUTPUTS :::::::::::::::::::::::::::::: 37 4.1 Introduction ::::::::::::::::::::::::::::::::::: 37 4.2 Problem Statement :::::::::::::::::::::::::::::: 38 4.2.1 Dynamic Analysis Time ::::::::::::::::::::::: 39 4.3 Approach :::::::::::::::::::::::::::::::::::: 40 4.3.1 Features ::::::::::::::::::::::::::::::::: 41 4.3.1.1 Byte Features :::::::::::::::::::::::: 42 4.3.1.2 Hashes Histograms ::::::::::::::::::::: 42 4.3.1.3 Target Labels :::::::::::::::::::::::: 42 4.3.1.4 Threat Level :::::::::::::::::::::::: 43 4.3.1.5 Feature Engineering :::::::::::::::::::: 44 4.3.2 Dataset ::::::::::::::::::::::::::::::::: 44 4.4 Learning Methodology ::::::::::::::::::::::::::::: 45 vi 4.5 Experimental Infrastructure ::::::::::::::::::::::::: 47 4.6 Experimental Results ::::::::::::::::::::::::::::: 48 4.6.1 Accuracy Results ::::::::::::::::::::::::::: 49 4.6.2 Recall and Precision Results ::::::::::::::::::::: 50 4.7 Discussion :::::::::::::::::::::::::::::::::::: 51 4.7.1 Efficiency of TURACO :::::::::::::::::::::::: 52 4.8 Related Work :::::::::::::::::::::::::::::::::: 53 4.9 Conclusions ::::::::::::::::::::::::::::::::::: 55 5 SEEMA: SELECTING THE MOST EFFICIENT AND EFFECTIVE MALWARE ATTRIBUTES :::::::::::::::: 57 5.1 Introduction ::::::::::::::::::::::::::::::::::: 57 5.2 Problem Statement :::::::::::::::::::::::::::::: 58 5.3 Approach :::::::::::::::::::::::::::::::::::: 60 5.3.1 Features ::::::::::::::::::::::::::::::::: 61 5.3.1.1 Byte Features :::::::::::::::::::::::: 62 5.3.1.2 Graph-based Features ::::::::::::::::::: 62 5.3.1.3 Dynamic Features ::::::::::::::::::::: 62 5.3.1.4 Feature Engineering :::::::::::::::::::: 63 5.3.2 Dataset ::::::::::::::::::::::::::::::::: 63 5.4 Learning Methodology ::::::::::::::::::::::::::::: 64 5.4.1 Malware Family Classifiers :::::::::::::::::::::: 64 5.4.2 SEEMA Model ::::::::::::::::::::::::::::: 65 5.5 Experimental Infrastructure ::::::::::::::::::::::::: 66 5.6 Experimental Results ::::::::::::::::::::::::::::: 67 5.6.1 Malware Classifier Results :::::::::::::::::::::: 67 5.6.1.1 Accuracy Results :::::::::::::::::::::: 67 vii 5.6.1.2 Precision and Recall Results ::::::::::::::: 69 5.6.2 SEEMA Model Results :::::::::::::::::::::::: 70 5.7 Discussion :::::::::::::::::::::::::::::::::::: 74 5.7.1 Efficiency of SEEMA ::::::::::::::::::::::::: 75 5.8 Related Work :::::::::::::::::::::::::::::::::: 76 5.8.1 Feature Exploration :::::::::::::::::::::::::: 76 5.8.1.1 Hybrid Characterizations of Malware :::::::::: 78 5.9 Conclusions ::::::::::::::::::::::::::::::::::: 79 6 MAGIC: MALWARE ANALYSIS TO GENERATE IMPORTANT CAPABILITIES :::::::::::::::::::::::::::::::::: 81 6.1 Introduction ::::::::::::::::::::::::::::::::::: 81 6.2 Problem Statement :::::::::::::::::::::::::::::: 82 6.3 Approach :::::::::::::::::::::::::::::::::::: 83 6.3.1 Features ::::::::::::::::::::::::::::::::: 84 6.3.1.1 Instruction-Based Features :::::::::::::::: 84 6.3.1.2 Global Instruction Features ::::::::::::::: 84 6.3.1.2.1 Global Instruction histograms ::::::: 85 6.3.1.2.2 Global Instruction bit vectors ::::::: 85 6.3.1.3 Individual Node Features ::::::::::::::::: 86 6.3.1.3.1 Random Walk ::::::::::::::::: 86 6.3.1.3.2 Random Walk bit vector :::::::::: 86 6.3.1.4 Dynamic Characterization :::::::::::::::: 87 6.3.1.5 Cuckoo Sandbox :::::::::::::::::::::: 87 6.3.1.6 Malware Attribute Enumeration and Characterization 88 6.3.1.7 A1000 Cloud Analysis System :::::::::::::: 88 6.4 Learning Methodology ::::::::::::::::::::::::::::: 88 viii 6.5 Experimental Infrastructure ::::::::::::::::::::::::: 91 6.5.1 Malware Family Distribution :::::::::::::::::::: 92 6.5.2 Cuckoo/MAEC Capability Distribution :::::::::::::: 92 6.5.3 A1000 Indicators of Interest Distribution ::::::::::::: 93 6.6 Experimental Results ::::::::::::::::::::::::::::: 93 6.6.1 Decision Trees ::::::::::::::::::::::::::::: 94 6.6.1.1 Decision Tree Results ::::::::::::::::::: 94 6.7 Discussion :::::::::::::::::::::::::::::::::::: 96 6.7.1 Decision Tree Model ::::::::::::::::::::::::: 98 6.8 Related Work :::::::::::::::::::::::::::::::::: 99 6.9 Conclusions ::::::::::::::::::::::::::::::::::: 100 7 CONCLUSIONS :::::::::::::::::::::::::::::::::: 101 8 FUTURE WORK ::::::::::::::::::::::::::::::::: 104 BIBLIOGRAPHY ::::::::::::::::::::::::::::::::::: 106 ix LIST OF TABLES 4.1 Reported time for dynamic malware analysis :::::::::::::: 38 4.2 Assignment of labels for malware dataset :::::::::::::::: 43 4.3 Training Dataset for TURACO model :::::::::::::::::: 44 4.4 Confusion Matrix for TURACO Model ::::::::::::::::: 50 4.5 Individual recall and precision results for TURACO model. ::::: 51 5.1 Composition of dataset for malware family classifiers ::::::::: 64 5.2 Construction of input dataset for SEEMA model. ::::::::::: 65 5.3 Input dataset for SEEMA model ::::::::::::::::::::: 66 5.4 Confusion matrix for SEEMA model. :::::::::::::::::: 73 5.5 Precision and recall results for SEEMA model. ::::::::::::: 74 5.6 Output from logistic regression model. :::::::::::::::::: 74 6.1 Categories of instructions extracted by Radare2 :::::::::::: 85 6.2 Cuckoo/MAEC Capabilities :::::::::::::::::::::::: 90 6.3 A1000 Indicators of Interest (IOI) :::::::::::::::::::: 91 6.4 Training instance for MAGIC model. ::::::::::::::::::: 92 6.5 Distribution of Families for our Malware Datasets ::::::::::: 93 6.6 Dataset distribution for Cuckoo/MAEC capabilities :::::::::: 93 x LIST OF FIGURES 2.1 Overall development of malware in the last decade :::::::::: 9 2.2 Extraction of bytes-entropy histograms ::::::::::::::::: 14 2.3 Feature extraction

Improving the Effectiveness and Efficiency Of

ROADS and BRIDGES: the UNSEEN LABOR BEHIND OUR DIGITAL INFRASTRUCTURE Preface

Google Summer of Code 2019

Ultimate++ Forum - Mentoring How to Ing-Howto/Index.Html

Phpmyadmin Documentation Release 5.1.2-Dev

MEDES and Google Summer of Code

Facebook's Libra

Harvesting the Twittersphere: Qualitative Research Methods Using Twitter

Zvonimir Rakamaric June, 2021

Open Source and Sustainability

A Microkernel Written in Rust

Free Applications

Summer of Code & Hackathons for Community Building in Scientific