Applying Supervised Learning on Malware Authorship Attribution

RADBOUD UNIVERSITY NIJMEGEN MASTER THESIS Applying Supervised Learning on Malware Authorship Attribution Author: Supervisors: Coen BOOT Dr. Ir. Erik POLL Alexandru C. SERBAN A thesis submitted in fulfillment of the requirements for the degree of Master of Science in the Digital Security Group Institute for Computing and Information Sciences May 1, 2019 iii RADBOUD UNIVERSITY NIJMEGEN Abstract Faculty of Science Institute for Computing and Information Sciences Master of Science Applying Supervised Learning on Malware Authorship Attribution by Coen BOOT Malware is a problem in current digital society, since it can cause economic or phys- ical damage and in the end disrupt society as a whole. In order to effectively fight cyber threats by coming up with (legal) consequences for the actor behind the malware, it is important to be able to provide a certain degree of proof about who is responsible for the malware. The process of linking an author to an asset is called authorship attribution. In case of malware, attribution needs to be based on binary executables, since the source code is mostly unavailable. This thesis focusses on evaluating and comparing two promising approaches for performing authorship attribution on malware. These approaches are based on two supervised learning algorithms, namely a neural network and a random forest classifier. Both approaches use automatically generated analysis reports from a sandbox solution as input data. Malware can be divided in two types with respect to the actors behind it: state- sponsored or criminal. This thesis focusses on the first type, since state-sponsored malware has a richer and clearer hierarchy of authorship (i.e. country-level and APT group-level) compared to criminal malware which is often attributed to a group of individuals which is not explicitly related to a nation-state. Since no suitable dataset containing state-sponsored malware is available yet, we collected a dataset using a newly devised method, based on indicators of com- promise found in threat intelligence reports. In this way we collected a dataset with 3,594 state-sponsored malware samples, which forms the first publicly available dataset of its kind. Using the retrieved malware samples, we used 2 sandboxes to generate reports about the samples: Cuckoo and VMRay. Moreover, we downloaded the reports belonging to the samples from VirusTotal as well. All these reports were converted to a bag of words, after which they are used as input for the classification algorithms. We evaluated the two approaches on the dataset and found that both approaches perform well (with accuracy results up to 98.8%) and match the performance de- scribed in the original papers. The neural network-based approach tends to perform slightly better compared to the approach based on a random forest classifier, whereas the latter uses considerably less time to finish training. A trained classification algorithms contains knowledge about the characteristics of the different classes, since it needs to decide to what class an unseen sample be- longs. We extracted this information, attempting to discover new insights and verify whether the classifier makes sensible decisions. v Contents Abstract iii 1 Introduction1 1.1 Research Questions..............................2 1.2 Structure of this Thesis............................2 2 Background and Related Work5 2.1 Legal Goals of Attribution..........................5 2.1.1 Distinctions in Types of Attribution................6 2.1.2 State-sponsored or not?.......................6 2.1.3 Legal Goals..............................7 2.2 Complications Inherent to Malware....................7 2.2.1 Unavailability of Source Code...................7 2.2.2 Hiding Intents.............................8 Static Methods............................8 Dynamic Methods..........................8 2.2.3 Fake Traces..............................9 2.3 Technical Means for Authorship Attribution and Family Classification9 2.3.1 Approaches Focused on Family Classification.......... 10 2.3.2 Approaches Focused on Authorship Attribution......... 11 2.3.3 Approaches Using Machine Learning............... 12 2.3.4 Selected Approaches......................... 12 3 Used Malware Classification Techniques 15 3.1 Classification Algorithms.......................... 15 3.1.1 Supervised Learning......................... 15 Methodology............................. 15 Overfitting and Underfitting.................... 16 3.1.2 Deep Artificial Neural Network.................. 16 Structure................................ 16 Basic Principles............................ 16 Peculiarities and Parameters.................... 17 3.1.3 Random Forest Classifier...................... 18 Structure................................ 18 Basic Principles............................ 18 Peculiarities and Parameters.................... 19 3.2 Classification Performance Metrics..................... 19 3.2.1 Recall / True Positive Rate (TPR).................. 20 3.2.2 False Positive Rate (FPR)....................... 20 3.2.3 Precision................................ 20 3.2.4 F1-Score................................ 21 3.2.5 ROC-Curve.............................. 21 3.2.6 Accuracy................................ 21 vi 4 Collecting a Dataset of State-Sponsored Malware 23 4.1 Collection Method.............................. 23 4.1.1 Collecting Samples.......................... 23 4.1.2 Collecting Sandbox Reports..................... 24 4.2 Data Preprocessing.............................. 24 4.2.1 Duplicate Detection......................... 25 4.2.2 Filtering Sandbox Reports...................... 25 4.2.3 Extracting API Calls......................... 25 4.2.4 Creating a Bag of Words....................... 26 4.3 Overview of the Collected Dataset..................... 26 4.4 Dealing with Imbalanced Datasets..................... 26 5 Experimental Data & Setup 29 5.1 Constructing Training and Test Sets..................... 29 5.2 Tested Scenarios................................ 30 5.3 Sandboxes Used................................ 30 5.3.1 Cuckoo................................. 30 5.3.2 VirusTotal............................... 31 5.3.3 VMRay................................. 31 5.4 Overview of Training and Test Sets..................... 32 5.4.1 Used Variants............................. 32 5.4.2 k-Fold Cross-Validation....................... 33 6 Evaluation of the Neural Network-based Approach (Ro18) 35 6.1 Configuration of the Neural Network................... 35 6.2 Results..................................... 35 6.2.1 Country-Level Authorship Attribution with Unseen APT Groups 36 6.2.2 Country-Level Authorship Attribution with Earlier Seen APT Groups................................. 37 6.2.3 APT Group-Level Authorship Attribution............ 38 6.3 Preliminary Conclusions........................... 40 7 Evaluation of the RFC-based Approach (Am17) 41 7.1 Configuration of the Random Forest Classifier.............. 41 7.2 Results..................................... 41 7.2.1 Country-Level Authorship Attribution with Unseen APT Groups 41 7.2.2 Country-Level Authorship Attribution with Earlier Seen APT Groups................................. 42 7.2.3 APT Group-Level Authorship Attribution............ 43 7.3 Preliminary Conclusions........................... 45 8 Comparing the Two Approaches 47 8.1 Comparison with Respect to Algorithm.................. 47 8.2 Comparison with Respect to Sampling................... 47 8.3 Comparison with Respect to Metrics.................... 48 8.4 Comparison with Respect to Sandboxes.................. 48 9 Extracting Human-Interpretable Characteristics 49 9.1 Knowledge Extraction on Random Forest Classifiers........... 49 9.1.1 Feature Importance.......................... 50 9.1.2 Manual Decision Tree Analysis................... 50 9.2 Knowledge Extraction on Neural Networks................ 52 vii 10 Complications Faced and Discussion 53 10.1 Complications Faced............................. 53 10.2 Discussion................................... 53 11 Conclusions and Future Work 55 11.1 Conclusions.................................. 55 11.2 Future Work.................................. 56 Bibliography 59 A Additional Results of the Neural Network-based Approach 67 A.1 Country-Level Authorship Attribution with Unseen APT Groups... 67 A.2 Country-Level Authorship Attribution with Earlier Seen APT Groups 68 A.3 APT Group-Level Authorship Attribution................. 68 B Additional Results of the RFC-based Approach 71 B.1 Country-Level Authorship Attribution with Unseen APT Groups... 71 B.2 Country-Level Authorship Attribution with Earlier Seen APT Groups 72 B.3 APT Group-Level Authorship Attribution................. 72 C Most Important Features in a Trained RFC 75 D Extracted Decision Tree 77 1 Chapter 1 Introduction Malware is a problem in current digital society, since it can cause economic or physi- cal damage and in the end disrupt society as a whole. Although anti-virus solutions attempt to withstand the massive flows of malicious software, they fail in blocking every single piece of malware. This is caused by the fact that they often look for exact fingerprints, without being able to recognize characteristics similar to known malware. This makes it possible to create malware which is able to successfully infect a machine and perform its (illicit) tasks. Malware is not only used by cyber criminals. State-sponsored Advanced Persistent Threat (APT) groups make use of malware as well to spy on their targets, contributing to an even more tumultuous cyber

Applying Supervised Learning on Malware Authorship Attribution

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support