Early Stage Classification using Behavior Analysis

A thesis submitted in partial fulfillment of the requirements

for the degree of Master of Technology

by

Mugdha Gupta

Department of Computer Science And Engineering

INDIANINSTITUTEOFTECHNOLOGYKANPUR

June 2018

Abstract

Name of the student: Mugdha Gupta Roll No: 16111041

Degree for which submitted: M.Tech. Department: Computer Science and Engineering

Thesis title: Early Stage Malware Classification using Behavior Analysis

Thesis supervisor: Dr. Sandeep Shukla

Month and year of thesis submission: June 2018

In the recent years, there has been an exponential growth in the number of malware captured and analyzed by the antivirus companies. However, much of these are variants of al- ready known malware. Thus, it has become necessary to determine whether a malware belongs to a known family, or exhibits a new behavior hitherto unseen, and requires further analysis.

Existing traditional approaches used by antivirus companies are based on signature-based de- tection and can be thwarted in case of zero-day exploit-based malware. Manual examination of such executables is extremely cumbersome due to the enormous number of such cases. Also, it has become necessary to speed up the detection process and predict before the executable releases its malicious payload. In this work, we addressed all the above issues using automated yet efficient malware analysis. We classified the malicious executables into different malware classes in the earliest possible time using dynamic analysis. Dynamic analysis provides useful insights in the case of obfuscated or packed malware where static analysis fails. Our experi- ments achieve an accuracy of 98.02% for classifying malware into classes in the initial 4 seconds of its execution using XGBoost. We also classified samples which were not seen by the classi-

fier before, thus attempted to classify zero-day malware. Our solution is robust and scalable as we have increased the number of samples used during analysis compared to prior work and reduced the execution time drastically. Our solution is also efficient since the state of the art accuracy for early stage malware detection is 91% for the first 4 seconds of execution and 96% for the first 19 seconds using recurrent neural networks. Acknowledgements

I would express my profound gratitude to Dr. Sandeep Shukla for guiding me in this project. I would also like to thank Pranjul Ahuja, Bhaskar Mukhoty and Rohit Singh Kharanghar for their help and support whenever I needed. I am grateful to my parents and my siblings for the immense love they have given me.

I am thankful to the Virustotal community for being so generous by providing me the access to their private API. I would also like to take this opportunity to thank CDAC Mohali for their help in building the dataset and TCG Digital for their support in creating the virtual network.

iv Contents

Abstract iii

Acknowledgements iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Need for Malware Classification...... 1

2 Background 3 2.1 Malware and its classes...... 3 2.2 Malware Nomenclature...... 5 2.3 Available Defenses...... 6 2.4 Malware Analysis techniques...... 6 2.4.1 Static Analysis...... 7 2.4.1.1 Limitations of Static Analysis...... 7 2.4.2 Dynamic Analysis...... 8

3 Past Work 10 3.1 Static analysis based feature extraction...... 10 3.2 Dynamic analysis based feature extraction...... 11 3.3 Time efficient detection...... 13 3.4 Goals of this thesis...... 13

4 Machine Learning Background 15 4.1 Classifiers...... 15 4.2 Handling Imbalanced Data...... 16 4.3 Cross Validation...... 18 4.4 Evaluation Metrics...... 19 4.4.1 Confusion Matrix...... 19

v Contents vi

5 Classification of Existing malware 21 5.1 Architecture of classification system...... 21 5.1.1 Dataset collection, Generation and Labeling...... 21 5.1.1.1 Dataset collection...... 21 5.1.1.2 Dataset generation...... 22 5.1.1.3 Labeling...... 23 5.1.2 Feature Extraction...... 24 5.1.2.1 Network related features...... 25 5.1.2.2 Process related features...... 30 5.1.2.3 API bins...... 30 5.1.2.4 Signatures...... 33 5.1.3 Training and Testing...... 38 5.1.4 Comparison to Existing Approaches...... 40

6 Classification of Zero Day malwares 42 6.1 Architecture...... 42 6.1.1 Dataset Collection and Generation...... 42 6.1.2 Feature Extraction...... 45 6.1.3 Handling Imbalanced Data...... 45 6.1.4 Training and Testing...... 47

7 Scope And Future Work 49 7.1 Building a Hierarchical model...... 49 7.2 Sliding window based approach for classification...... 49 7.3 Building a robust classification system...... 49

A Appendix A 50

Bibliography 51 List of Figures

1.1 Growth of the the malware over years [4]...... 2

2.1 Naming convention used by Microsoft [18]...... 5

4.1 Neural network with one hidden layer [20]...... 16 4.2 SMOTE oversampling technique [3]...... 17 4.3 Tomek undersampling technique [3]...... 17 4.4 K-Fold Cross Validation [2]...... 18 4.5 Confusion Matrix[1]...... 19

5.1 Architecture of our classification system...... 22 5.2 Cuckoo Architecture [6]...... 23 5.3 Protocol Hierarchy of a malware using Wireshark...... 26 5.4 LLMNR Poisoning [16]...... 27 5.5 TLS Connections...... 28 5.6 HTTP Requests by Sventore.A malware...... 28 5.7 Frequency of API Calls in each bin...... 31 5.8 Shortcuts created by worm family Yuner...... 33 5.9 Registry Keys modified by Backdoor Agent malware to install itself at startup.... 34 5.10 Polymorphic nature exhibited by malware Yuner...... 36 5.11 Exception raised by malware Renos...... 37 5.12 Confusion Matrix - XGBoost...... 40

6.1 Architecture of our classification system...... 43 6.2 tSNE - Test Set...... 44 6.3 Imbalanced Virus families...... 45 6.4 Imbalanced Trojan families...... 46 6.5 Confusion Matrix - XGBoost...... 48

vii List of Tables

3.1 Summary - Dynamic analysis based feature extraction...... 12

5.1 Dataset...... 24 5.2 Testing accuracy - Simple Neural Network, for various optimizers and loss functions 39 5.3 Test Results for all classifiers...... 39 5.4 Comparison to previous approaches...... 41

6.1 Number of samples in Training and Testing Set...... 44 6.2 Families in Training and Testing Set...... 44 6.3 Number of samples in Training Set after SMOTE...... 46 6.4 Accuracy for each type with corresponding FPR...... 47

viii Dedicated to my parents

ix Chapter 1

Introduction

1.1 Need for Malware Classification

The rise of the Internet has readily affected our day to day life, turning it upside down. From buying products, doing online banking to using it for entertainment purposes or social net- working, it has made our lives a lot easier. With the ease of information flow, every other or- ganization is now getting connected to the Internet and going transparent with their opera- tions and resources. But as the Internet economy has grown, more serious cyber crimes have evolved. Almost every device like mobile phones, laptops to large systems such as power grid and nuclear plants are subjected to cyber attacks. Among serious cyber threats, there is mal- ware which evolves daily and has the capacity to disrupt every other sector without any fail. According to the reports published by AV-Test institute [4], there has been tremendous growth in the number of malicious samples as shown in figure 1.1, registering over 250,000 new mali- cious samples every day. Analyzing these samples manually using reverse engineering and dis- assembly is a tedious and cumbersome task, therefore not appreciated enough by the security analysts. Thus there is a dire need for automated malware analysis systems which produces ef- ficient results with minimal human intervention. Antivirus systems use the most common and primitive approach which involves generation of signatures of known malware beforehand and then comparing newly downloaded executables against these signatures to predict its nature. This technique drastically fails in case of any zero-day malware, a malware which has been newly created and thus signature is not available. Other common techniques are static analysis and dynamic analysis. Static analysis analyzes the executables without executing it, and pre- dicts the results. It is generally used because it’s relatively fast but fails if the malware is packed, encrypted or obfuscated. As a result, researchers have started using dynamic analysis which

1 Chapter 1. Introduction 2 involves collecting behavioral data by executing the sample in a sandboxed environment and then using it for detection and classification.

FIGURE 1.1: Growth of the the malware over years [4]

Malware attacks which include malicious programs being downloaded by victims from any website onto their systems or a worm-like malware which propagates laterally in a network by taking advantage of the weaknesses in the protection mechanisms are becoming extremely common nowadays. Adware which simply displays malicious ads on a victim’s system is largely different from a worm like Stuxnet which disrupted the nuclear plants of Iran and hampers services of national importance. These samples not only threaten the privacy and availability of systems but also affects the integrity of systems leading to national security concern. Thus it has become necessary to classify these malicious samples into their respective families and classes and to respond to the attacks accordingly.

In this work, we created a system which can classify any known/unknown malicious samples into its classes with precision better than any existing system present in the literature. Chapter 2

Background

2.1 Malware and its classes

The term malware refers to malicious software which is either used to gain unauthorized ac- cess to victim’s computer, steal sensitive information or to disrupt its operation. Now we dis- cuss some common malware classes briefly. These classes are neither mutually exclusive nor exhaustive and may share similar exploitation tactics.

• Trojans: often referred to as trojan horses, these are the most common type of malware. These files seem benign but often have some hidden purposes, i.e., either to install spy- ware, keyloggers or other malware, infect system files, etc. They generally trick victims using social engineering (e.g. phishing) into loading and executing them onto their sys- tems. They are usually hidden in email attachments, web browser plugins for a game, fake copies of expensive and genuine software, etc. Some of the common examples of trojan horses are Startpage, banker, delf etc. [32]

According to the types of actions they can perform on victim’s system, trojans are broadly classified into the following categories:

– Trojan Downloader: These are generally used to download and install new versions of malicious software onto a victim’s system. They copy themselves onto some hid- den files and modify registry keys so that they can install the malicious files again on system startup. The only way to remove them is by identifying and deleting those hidden files and registry keys. [32]

3 Chapter 2. Background 4

– Trojan Dropper: These are generally used to either install other kinds of malware such as viruses, worms etc. or to hide the activities of malicious programs from antivirus systems. Once they have executed the malicious payload onto a victim’s system they cease to work as their primary objective has been fulfilled. [32]

– Trojan Spy: As the name suggests, these are generally used to spy on victims by often tracking their data, taking screenshots and sending them to their command and control (C&C) servers periodically. [32]

– Trojan clickers: These types of trojans generally reside in a victim system’s memory and regularly connect to a few websites to enhance their creator’s revenue on a pay per click basis. [32]

• Virus: These types of malware usually insert unwanted code into other programs or exe- cutables. During each execution of the host program, code added by a virus is run and in turn adds some more unwanted code either to the same program or to some other pro- grams present in the system, leading to the corruption of files. Some of the most common types of viruses are Expiro, , Virut, etc. [31]

• Worms: Unlike viruses, worms spread by exploring the network vulnerabilities. Usually, they aim to exhaust the system’s resources. They do so by sitting in the main memory and performing unwanted actions of replicating and spreading themselves. There is also another type of worm that carry malicious payloads which is either used in stealing sen- sitive information or installing other kinds of malware. Some of the common worms are Allaple, Vobfus etc. [31]

• Backdoor: These types of malware usually provide remote access of a compromised computer to an illegitimate user by exploring security vulnerabilities in the system. Using a backdoor, a person can perform any number of activities on victim’s system such as in- stalling illegal software such as a keylogger, leaking sensitive information, infecting other hosts on the same network, etc. Some of the common backdoors include Rbot, Hupigon, Bifrose etc. [30]

• Virtool: Virtools are software tools which are not generally malicious but has the poten- tial to compromise user’s security. They are called Riskware tools because it can be used to access user’s computer and perform malicious activities. Malware authors generally use these tools to hide the actual malware from antivirus agencies. [28]

• PWS: PassWord Stealer is a family of malware which steals confidential information from the users, most likely to be online banking user names and passwords. Chapter 2. Background 5

2.2 Malware Nomenclature

There is no standard convention which is being followed while assigning labels to malicious executables, every antivirus agency follows a different scheme. In this work, we have used the labels provided by Microsoft and below is the convention used by them. Microsoft name these executables as per the figure 2.1 according to the Computer Antivirus Research Organization (CARO) malware naming scheme.

FIGURE 2.1: Naming convention used by Microsoft [18]

For example, from Worm:Win32/Allaple.A, we can easily interpret that Worm is the type, Win- dows is the platform, Allaple is the malware family and A is the variant.

A detailed explanation of label components is as follows:

• Type: It determines the activity performed by the malware on the victim’s system. Usu- ally, it is among the classes discussed in the previous section.

• Platform: It determines the operating system on which these executables will show its malicious nature. The platform also tells about the file format and extensions used by the malware.

• Family: It is the grouping of the malware based on their common characteristics. In most cases malware authors reuse the code belonging to a malware of an existing family, thus there are high chances that the samples belonging to the same family will have code similarity and will require similar detection and removal methods.

• Variant: It is used to identify the different versions of the same malware family.

• Information: It is used to provide additional information about an executable, for exam- ple, whether it is packed or compressed, developed using an existing toolkit or whether it has a rootkit or plugin component. Chapter 2. Background 6

In this thesis, we have mainly used malware labeled till its family or till its variant in a few cases. Also, we have used only Windows 32 bit executables for our analysis to narrow down our work. However, similar techniques can be used for malware designed for other platforms.

2.3 Available Defenses

• Antivirus Systems: These systems are the most primitive defense against the malware. Previously, antivirus engines were entirely based on creating the static fingerprint of the malware by the security experts, which slowly became the bottleneck. This technique fails in identifying zero-day malware and thus security experts started exploring other techniques including heuristic-based detection and machine learning. They incorpo- rated heuristic analysis into the signature matching technique to raise warnings against possible threats. However sometimes statically examining the files can lead to marking a legitimate file as malicious inadvertently (False negatives). [12]

• Intrusion Detection/Prevention Systems: IDPS are generally used to monitor the net- work traffic by examining packets closely against known threats and alerting the user in case it find any anomaly. It is able to provide protection against many malicious activi- ties, hackers breaching security and users violating access policies. IPS is similar to IDS but with an active defense. It not only monitors the network or a system but also provides solutions to immediate threats. [14]

• Firewalls: Firewalls monitor and control incoming and outgoing network traffic using some predefined rules. They are often categorized into network firewall, which filters traffic between two or more networks and host-based firewall, which controls traffic mov- ing in and out of the system. [11]

There are many other available defenses such as Anti-spam and Anti-phishing software, au- thentication, authorization, etc.

2.4 Malware Analysis techniques

Since the techniques discussed above requires a lot of man-hours, researchers have started au- tomating feature extraction processes which comprise of static and dynamic analyses. More- over, applications of machine learning techniques have shown some positive results and thus encouraged further development of these automated malware analysis systems. Chapter 2. Background 7

The two techniques which are most commonly used in these systems are Static Analysis and Dynamic or Behavioral Analysis.

2.4.1 Static Analysis

As the name suggests, this analysis is done without executing the samples. This technique is generally popular because it requires less amount of resources and time. It is easy to perform if the code is available, however, if not present then we can examine the hex code or can disas- semble the binary and examine its assembly code. Below are some of the prominent techniques which are used for feature extraction using static analysis:

• We can get the sequences of printable characters in the binary using the strings utility, which is present in GNU Binutils. Analyzing its output can sometimes give useful infor- mation about the action which a binary is trying to perform. For, e.g., if a binary is trying to connect to its command and control server then, you will see an IP address or some port numbers in the output.

• We can also extract useful information from the Portable executable (PE) Header. PE is the standard binary format for any Windows executable or DLL. Fields present in a COFF1 header such as the number of sections, size of the optional header, etc. can prove as important features in detection.

• We can perform a comprehensive analysis of the executable by first converting the as- sembly code to machine code and then applying ngrams2 on the opcodes.

• There are few researchers who are working on the image representation of the executa- bles and applying classifiers for detection and classification.

2.4.1.1 Limitations of Static Analysis

Static analysis may seem promising in detection, however, there are few techniques available which can evade such analysis. Few of them are discussed below.

• Polymorphism means changing the appearance of a binary so that it can evade any tech- nique which uses pattern matching. Generally, packers and crypters are used to create

1COFF header is present in PE file headers along with MS-DOS stub, PE signature and optional header 2ngrams is a continuous,overlapping sequence of n items in a given array Chapter 2. Background 8

such kind of malware by packing the binaries and then attaching the unpack subroutine on top of it. When a sample is executed, then a subroutine will unpack the binary in memory only and will pass the execution sequence to the start of the unpacked code. The only way for its detection is to do memory analysis and search for its signatures in memory. [49]

• Metamorphism in malware analysis refers to self-modifying binary executables. In each execution of a binary, it inserts or modifies few instructions in its code, changing its signa- ture. However, the basic functionality remains the same. Some of the common methods include insertion of NOP instructions, changing variable names, permuting the registers or modifying the instructions with equivalent ones, etc. [49]

2.4.2 Dynamic Analysis

Dynamic Analysis involves the collection of the behavioral data during the execution of a sam- ple and then using it for detection and classification. It is most widely used because researchers believe that malware won’t be able to achieve its aims without leaving a sufficient footprint be- hind. This analysis is not prone to Polymorphism and Metamorphism since it is completely independent of source code. Most commonly used techniques in this analysis are described below:

• System call monitoring: It involves recording the API calls to the operating system’s ker- nel. We are able to capture the side effects through system calls of any program only if it is executing in the user mode. Close observation of the system calls and capturing the sequence in which they are executing often reveals the malicious intentions of the executables. However, this technique will not work if the system is in kernel mode.

• Information flow tracking: This technique is mainly used to check if the sensitive infor- mation is handled in a wrong way. It focuses mainly on two concepts, taint source and taint sink. Taint source is responsible for the generation of the taint variables which com- prises of sensitive information such as browser data etc., while taint sink is responsible for creating an alert in the case tainted information is passed through it. For,example, any information that is sent over through the network.

• Instruction trace: This technique involves capturing the sequence in which instructions are executed by the malware with the help of a control flow graphs. Chapter 2. Background 9

• Tracking machine activity: This technique works by closely examining the machine ac- tivity such as RAM used (memory and swap), maximum process id, the percentage of CPU utilization, etc during the execution of a sample. [46]

• Memory forensics: There is a separate class of research where researchers are trying to predict malware by generating the memory dump of the sample. The clean snapshot of the memory is taken before the execution which is compared with the dump after exe- cution by using frameworks such as volatility. However, this technique consumes a lot of disk space. Malware often check if they are executing in a sandbox environment, by querying some system components like the amount of RAM available, etc. To avoid de- tection, researchers provide high RAM (upto 2GB) while analyzing the sample, which in turn increase the size of memory dump generated per sample (mostly 2GB).

• Capturing file and registry modifications: To alter system’s behavior or to execute pro- grams on startup, a malware performs registry changes. Thus keeping track of such changes, will provide substantial information for its detection. Furthermore, capturing every file operation such as read, write, modify and delete by the malicious executables, can also prove to be a good feature in its detection and classification. Chapter 3

Past Work

3.1 Static analysis based feature extraction

Kolter et al. [41] uses ngram of opcodes as features and performed experiments on 1,971 benign and 1,651 malicious executables. Their approach resulted in 255 million distinct n-grams, from which they selected the most important ones and applied various machine learning classifiers such as Naive Bayes, decision trees, SVM and boosting. Their results show that boosted decision trees outperformed all with the area under ROC curve1 of 0.996. Then they applied boosting on classification based on payload for example, if they create a backdoor or do mass mailing etc., and got area under ROC curve of 0.90.

Kong et al. [42] created a framework based on functional call graphs, capturing the structural information of malware. They evaluated the similarity by applying discriminate distance metric learning, which grouped families belonging to same class into clusters by keeping the marginal distance between them. Then they used an ensemble of classifiers (KNN and SVM) to do the classification.Their dataset contains 526,179 packed and unpacked malwares. However, they only used the unpacked malwares belonging to 11 malware families.

Tian et al. [52] used frequency of function length as a feature in classifying trojans. They used various machine learning algorithms present in WEKA library for the classification with an av- erage accuracy of 0.8776 on 721 samples. However, their approach relies on unpacking the samples before feature extraction.

1ROC shows the performance of a classification model at various thresholds by plotting true positive rate and false positive rate, AUC is area under the ROC curve which measures test accuracy

10 Chapter 3. Past Work 11

Saxea and Berlin [48] performed binary classification and achieved a true positive rate of 95.2% using a deep feed-forward neural network on 431,926 binaries, with 81,910 labeled as benign- ware and 350,016 as malware. However, their true positive rate dropped to 67.7% when the model is trained on files which were discovered before a certain date and tested on files which were discovered after that date. This shows the inefficiency of static models in detecting com- pletely novel malware families.

Damodoran et al. [36] performed a comparative study of static, behavioral and hybrid analysis models using Hidden Markov models and found behavioral models to outperform all with the highest area under the curve (AUC) value (0.98) on 745 malwares belonging to 6 families and 40 benign samples.

Grosse et al. [39] showed that obfuscated samples can drastically reduce the accuracy of static models from 97% to 20%. Training with few obfuscated samples increased the accuracy par- tially however it didn’t increase beyond a certain limit. They performed experiments on 123,453 benign and 5,560 malicious android applications.

Static data, being relatively fast to collect is still the first choice of many researchers. Although it performs relatively poor on obfuscated and packed samples, some papers have found a workaround by using entropy-based features, while some chose to unpack the samples before execution by using known packers available in the market.

3.2 Dynamic analysis based feature extraction

Due to the sudden increase in the frequency of obfuscated, packed and crypted executables nowadays, researchers have started using more behavior analysis techniques for performing detection and classification as compared to the static ones. Amongst these, the most common technique is to use API calls for prediction.

Firdausi et al. [38] analyzed the behavior of samples by executing them using Anubis sand- box and then preprocessed the generated reports into sparse vector models for classification. They performed the comparative study of 5 classifiers namely Naive Bayes, Decision tree, SVM, Multilayer Perceptron Neural Network and kNN using correlation-based feature selection and achieved an accuracy of 96.8% with the false positive rate of 2.4% on 220 malwares and 250 benign samples.

Nari et al. [44] used network behavior to classify malware into their respective families. They extracted network flows from the pcap files and created behavior graph to depict the network Chapter 3. Past Work 12 activity by an executable and dependencies between network flows using protocols. They col- lected various graph information such as average out degree, maximum out degree, graph size, etc. and used them as features for training various machine learning algorithms.

Tobiyama et al. [54] proposed a stepwise application using a deep neural net to classify mal- ware. They used process behavior (API call sequences) to generate log files and then used RNN to extract features as features vectors. Then they converted these vectors to feature images and used CNN to classify them. They got AUC= 0.96 of ROC curve in the best case on 81 malwares and 69 benign samples.

Ahmed et al. [34] used spatial-temporal information of API calls with Naive Bayes classifier to classify malicious executables into trojan, virus and worms. They also identified changes in accuracy if any one of spatial (arguments) or temporal (sequences) of API calls are used for prediction. They reported that by monitoring only memory specific and file I/O based system calls, they can get accuracy as high as 97% on 416 malwares.

Tian et al. [53] applied various pattern recognition techniques and statistical methods on API calls to do binary classification as well as classification into families. They recorded the fre- quency of API calls globally as well as per file and got accuracy close to 97% using RandomFor- est.

Kolosnjaji et al. [40] also worked on API call sequences and constructed a neural network com- prised of convolution and recurrent neural layers. They achieved on average 85.6% on precision and 89.4% on recall2.

Table 5.4 shows a summary of most representative previous work which used behavioral data as features. In case no time cap is mentioned we believe that the samples are fully executed.

Author Accuracy Analysis Time Dataset [54] 0.96 AUC 5 minutes 81 malwares and 69 benign [38] 96.8% no time cap mentioned 220 malwares and 250 benign [44] 0.945 ROC area 15 minutes 3768 samples (13 malware families) [34] 98% no time cap mentioned 416 malwares (Trojan, worms, virus) [52] 97% 30 seconds 1368 malwares and 456 benign [40] 85.6%(Precison) no time cap mentioned 4753 in 10 clusters

TABLE 3.1: Summary - Dynamic analysis based feature extraction

2precision and recall are explained in detail in the section "Evaluation Metrics" of next chapter Chapter 3. Past Work 13

3.3 Time efficient detection

All the previous work which focused on early-stage detection, works on two fundamentals. They either omit few components from the data collection process or stop the process early.

Shibahara et al. [51] reduce the total time required for the analysis by 67.1% as compared to the methods which require full analysis (upto 15 minutes), by detecting changes in the network communication.

Neugschwandtner et al. [45] firstly determined whether a malware is a known variant of any existing malware or to which behavioral class it belongs, by using clustering algorithm on static features and then if it’s found unlikely to resemble any existing class then they executed it. Their work has shown significant improvement in accuracy over randomly selecting samples to execute and selecting based on sample diversity.

Bayer et al. [35] used behavior profiles of a malware to avoid full analysis in case the malware is a mutation of any existing polymorphic malware. In an experiment conducted on 10,922 executable files, they were able to avoid full analysis in 25% cases.

Das et al. [37] proposed a comprehensive rule-based model which grouped system calls using their arguments and return values for feature construction. Their model can detect 47% of the malicious samples in 30% of their execution time and 98% after full execution.

Rhode et al. [46] predicted the malicious behavior during its execution unlike others which do analysis post-execution and got 91% accuracy in first 4 seconds of execution and upto 96% in first 20 seconds. They used various machine activity features such as the percentage of CPU usage, available memory and swap space etc. and Recurrent Neural networks as the model to make the prediction. This is the best accuracy in literature as per our knowledge corresponding to early-stage detection.

3.4 Goals of this thesis

Most of the existing systems focus on examining the executables for atleast 60 seconds to its complete execution period except [46], however, till then it can cause tremendous harm to the victim’s system. Other common problems include inefficiency in classifying any new malware or using very less number of samples for the classification. This thesis aims to solve the follow- ing problems. Chapter 3. Past Work 14

• Classifying any known/unknown malicious samples into its types/classes in the earliest possible time

• Using machine learning as a robust prediction tool for classification since current threats can spread faster than defenses can react.

• Building predictive models on a large dataset for efficient classification between various malware types Chapter 4

Machine Learning Background

4.1 Classifiers

We have used a variety of classifiers from Scikit learn [23], popular Python based machine learn- ing library and Keras [15], neural network API on top of theano [26] on our dataset and evalu- ated our approach. All these classifiers are listed below:

• XGBoost: eXtreme Gradient Boosting(XGBoost) [33] is one of the popular algorithms based on Boosting, which is producing accurate results by combining the results from weak learners like decision trees. It is an ensemble technique in which predictions are made sequentially rather than independently like in RandomForests. This technique fol- lows the logic of learning from mistakes, thus the subsequent predictive models are built based on the mistakes in the previous model. As a result, samples have an unequal prob- ability of appearing in subsequent models, with the ones having highest error appearing most of the times. Recently, it has gained wide popularity in kaggle competitions because of its fast speed and good performance.

• Simple Neural network: Neural networks [21] are nonlinear statistical models which can learn complex mathematical functions with the help of simpler ones. Most of the com- plex problems which were unsolved for years now can be easily solved with the help of it. It comprises of 3 layers, i.e., input, hidden and output with each layer comprising of several units as shown in Figure 4.1. There is no connection among the units of the same layer or between the units of nonadjacent layers. Let i and j be two units in adjacent lay-

ers and wi j be the weight of the connection connecting them. The input to the unit j is

15 Chapter 4. Machine Learning Background 16

a linear combination of the output of unit i with the corresponding weight wi j . It also

contains a bias term, b j .

X I j f ( wi j Oi b j ) = i +

where I j denotes the input to the node j in current layer and Oi denotes the output of the

node i in the previous layer, f is the activation function and b j is bias term for input to node j.

FIGURE 4.1: Neural network with one hidden layer [20]

• KNN: K Nearest Neighbor [19] is one of the widely used non-parametric classification algorithm which works on the idea of feature similarity, i.e. if an area in feature space consists of samples predominantly from one class as compared to another, then this al- gorithm will label that area as belonging to that class in the feature space. When any new sample will arrive it will calculate its k nearest neighbors, and assign the label based on the majority votes from k nearest neighbors. The distance measure which is usually used is the Euclidean distance which is described below:

q D(i, j) (i j )2 (i j )2 ... (i j )2 = 1 − 1 + 2 − 2 + + n − n

where i1,i2...in are the feature values of sample i. Similarly, we can say for sample j.

4.2 Handling Imbalanced Data

Most widely used technique for the imbalanced dataset is resampling [3]. It consists of two techniques: Oversampling and Undersampling. Undersampling is removing few points from the majority class and Oversampling is adding few points to the minority class. Despite the Chapter 4. Machine Learning Background 17 advantages, these techniques have few disadvantages. Simplest method of oversampling is du- plicating random samples which can cause overfitting, and removing samples can cause data loss.

• SMOTE: Synthetic Minority Oversampling TEchnique is creating synthetic samples of the minority class from the already existed samples. It works by selecting a random sample from the minority class, and calculating its K nearest neighbors, then synthetic samples are added in between the point and its neighbors as described in the figure 4.2.[3]

FIGURE 4.2: SMOTE oversampling technique [3]

• Tomek links: These are pairs of points of separate classes that are very close to each other, i.e. nearest neighbors. Removing such points will increase the distance between two classes and thus improve the classification accuracy. An alternate method of undersam- pling is to remove points only from the majority class if the dataset is highly unbalanced as shown in the following figure 4.3.[3]

FIGURE 4.3: Tomek undersampling technique [3]

• SMOTETomek: It is a technique which combines both undersampling and oversampling, i.e. Tomek Links and SMOTE. Chapter 4. Machine Learning Background 18

• ENN: It is another undersampling technique which removes samples from the majority class if its class label doesn’t match the majority of its k nearest neighbor.

• SMOTENN It is a technique which combines both SMOTE and ENN.

4.3 Cross Validation

Cross validation is a technique to evaluate the models by partitioning data into train and test set.

FIGURE 4.4: K-Fold Cross Validation [2]

The steps for K-fold cross-validation [2] is as follows:

• Partition the dataset into K subsamples(usually K=10 is used).

• Train the model using K-1 subsamples and retain the last subsamples as the validation set for evaluating the model.

• Repeat the above two steps K times(the folds) such that every subsample is only used once as the validation set.

• Combine or average the error from every fold to produce the final result. For parameter tuning, iterate through all combinations of parameters and produce the combination Chapter 4. Machine Learning Background 19

for which folds give the lowest error. Use that combination of the parameters to train the whole training set. It should be ensured that fold error should not be extensively minimized since it will lead to overfitting.

Figure 4.4 shows the steps mentioned above.

Stratified K-fold cross-validation is similar to the K-fold cross-validation, in addition, it ensures that the class labels are in the same proportion in every fold as in the complete training set.

4.4 Evaluation Metrics

4.4.1 Confusion Matrix

Confusion Matrix is used to summarize the performance of a classifier. It contains the informa- tion about the actual and predicted classification on a holdout test data. The concept can be understood using binary classification problem as there is no such concept for the multiclass problem. However, we can calculate the metrics for each class using one vs. rest methodology. For example, let’s take two classes A(trojan) and B(rest of the malware types).

Various metrics [1] in it are as follows, also depicted in figure 4.5

FIGURE 4.5: Confusion Matrix[1]

• True Positive (TP): It actually belongs to the trojan class and also predicted as the trojan.

• True Negative (TN): It actually belongs to a type other than trojan and also predicted as same.

• False Positive (FP): It belonged to a type other than trojan but predicted as the trojan by the classifier. Chapter 4. Machine Learning Background 20

• False Negative (FN): It is a trojan but predicted as not trojan (other malware types) by the classifier.

• True Positive Rate: TPR or Recall is the proportion of trojans which are correctly identi- fied. It is given by the formula:

TPR TP/(TP FN) = +

• False Positive Rate: The proportion of samples belonging to other types but classified as trojans. FPR FP/(TN FP) = +

• Precision: It is the ratio of the number of trojans correctly predicted as trojans to the total number of trojans predicted by the classifier.

Preci sion TP/(TP FP) = +

Low precision signifies the higher number of false positives.

• F Measure: In case of the imbalanced dataset, classifier will achieve high accuracy just by assigning all labels in the test set to majority class. F Measure focuses on both Precision and Recall, instead of favoring one over the other thus, a good evaluation metric

F Measure 2 TPR Preci son/(TPR Preci son) = ∗ ∗ + Chapter 5

Classification of Existing malware

5.1 Architecture of classification system

Collecting dynamic data is more robust for obfuscated malware, however, it usually takes a long time. Integrating it into the Antivirus engine will be able to detect malicious executables but at the cost of compromising user’s system if for example, in case of ransomware, it corrupts or encrypts user’s data. In this chapter, we have discussed the architecture in Figure 5.1 of our system which is to classify malicious executables into malware types with just 4 seconds of behavioral data. We have incorporated various feature engineering and machine learning techniques in order to achieve high classification accuracy. The basic outline is as follows:

• Dataset collection, Generation and Labeling

• Feature Extraction and preprocessing data

• Training and testing models

5.1.1 Dataset collection, Generation and Labeling

5.1.1.1 Dataset collection

We have collected approximately 1.2 lakh malicious PE executables from CDAC Mohali [5] and various malware repositories such as Malshare [17] and VirusShare [29]. CDAC Mohali has in- stalled honeypots all across India for the malware collection. These repositories also collect

21 Chapter 5. Classification of Existing malware 22

FIGURE 5.1: Architecture of our classification system samples through their portals on which users from all across the world submit their files for analysis. To ensure that we have collected valid malicious files, we crosschecked their reports by submitting them on Virustotal. Virustotal in its report produces results by 70 Antivirus en- gines. In this work, we have only kept those samples which have been identified as malicious by atleast a single AV engine in the virustotal report. Also, we have only worked on 32-bit PE executables, however, our work can be easily extended to its 64-bit versions as well.

5.1.1.2 Dataset generation

Cuckoo [7] is the tool which is responsible for performing sandboxed malware analysis and generating the reports containing the behavior shown by the samples. Cuckoo’s architecture comprises a host machine which is responsible for managing the analysis and a set of guest machines which are responsible for executing the samples. We need to submit the samples to the cuckoo database present in the host machine to execute it. The database is present to identify the file(through its hash) and to skip the analysis if it has already been executed once. Each analysis is launched in a fresh guest machine. Snapshot of the current state of the machine is taken so that cuckoo can restore the machine once the analysis is over. In this work, samples Chapter 5. Classification of Existing malware 23 are monitored for 4 seconds, after that analysis is terminated. All the behavior of the sample is recorded and stored in JSON format report on the host machine. Cuckoo also provides the pcap containing all the network activity of the executable. The report and the pcap are used in the next phase. We have used Ubuntu 16.04 on the host machine with 16 GB RAM and 1 TB hard disk. For the guest machines, we have used 32-bit Windows 7. A 2 GB RAM and 25 GB hard disk are allocated to each virtual machine. This is done to avoid the situation in which malware seize to show its malicious behavior by sensing the virtual environment through its low memory and hardware components. Also, automatic updates, firewall, user access control (UAC) are disabled on these machines to maximize the activities performed by the samples. The architecture is shown in the figure 5.2 as in [6].

FIGURE 5.2: Cuckoo Architecture [6]

5.1.1.3 Labeling

To perform supervised learning, we need to label for our malicious executables. Cuckoo re- port provides the labels by querying virustotal. Apart from that, it also provides the normalized keywords (family it belongs to) extracted from the label provided by each AV engine. We have collected all these normalized keywords, calculated the frequency of each keyword in the re- port, and assigned the executable with the keyword having the maximum frequency. However, this approach didn’t work in few cases. Different AV engines assigned the executable to different Chapter 5. Classification of Existing malware 24

TABLE 5.1: Dataset

Malware Type Malware Family Number of Samples Sventore.C 1,577 TrojanDropper Sventore.A 1,347 Renos 1,985 TrojanDownloader Small 1,316 Tugspay 3,417 Yuner 3,794 Worm Allaple 4,258 VB 2,418 Startpage 1,565 Trojan Comame!gmb 1,830 Virus Luder 1,967 Virtool VBInject 1202 PWS OnlineGames 1041 Agent 1,020 Backdoor RBot 817 Total 29,554

malware types keeping the family same. So as discussed in chapter 1, we used the labels pro- vided by Microsoft AV engine. There were many executables in our dataset which are malicious still Microsoft AV engine failed to classify it. We dropped such executables. Table 5.1 shows the number of malicious executables belonging to particular families in the malware types we have described in Chapter 1. We have only included those families which have the number of samples greater than 500.

5.1.2 Feature Extraction

This section describes of the features extracted from dynamic analysis. All the executables are submitted to cuckoo sandbox for analysis. The reports generated contains the execution traces of the executables which include process tree, API statistics, pcap containing network activ- ity, registry, file and mutex changes, static information, memory dump, etc. Although a large amount of information is available in these reports, it is not feasible to include everything as features. In this work, we have extracted only the features based on critical resources such as network data, system calls, process and registry. Chapter 5. Classification of Existing malware 25

5.1.2.1 Network related features

Tshark [27] is a tool which is used to dump and analyze network traffic. In this work, we have used it to analyze every executable’s network activity and to extract useful information from it. All these features are listed below:

• IP Entropy - The entropy is a measure of how many source IP addresses are there, how many packets are originating from a particular IP address and how random is the distri- bution of the packets in these IP addresses. Similarly we included TCP entropy and HTTP entropy to capture the randomness in TCP and HTTP packets, However, it didn’t prove to be useful. IP entropy is calculated using the following formula

N! E(yi ) log = NIP1 !NIP2 !..NIPn !

th where N is the total number of packets and NIPi is the number of packets from i source IP address.

• Ratio of public and private IP addresses - The intuition behind this feature is to capture the mechanism through which worms or worm components in any malware spread in the network, especially local area network or home network. Worms identify IP addresses of the systems connected to the private network through the IP address of the infected machine and scan them for the vulnerability. If found one, replicate itself and the chain continues. Thus the number of private addresses on which worm sends a query request for system configuration information is high as compared to public IP’s. Also, there are some set of malware such as trojan downloaders, trojans which frequently contact their Command & Control servers (C&C) for further instructions. Thus the count of destination public IP’s will be higher in their network activity.

• Protocol Information - It is a challenging task for malware authors to alter the underlying protocol through which malware interact with their C&C servers. Thus very often they use the standard protocols. Also, to avoid detection through anomaly detection systems installed to monitor network traffic, malware authors tend to use a variety of protocols for communication. For instance, a malware may use an unknown application layer protocol on top of TCP to communicate with a malicious server and may use another protocol like HTTP to perform fraudulent actions. In this work, we used the number of packets sent through the standard protocols (TCP,UDP,DNS, ICMP etc.) and some protocols specific to Microsoft Windows (LLMNR, NBNS) as features. Figure 5.3 demonstrates the protocol information of malware of family Sventore.A Chapter 5. Classification of Existing malware 26

FIGURE 5.3: Protocol Hierarchy of a malware using Wireshark

Various protocols which are observed are as follows:

– ICMP - Internet Control Message Protocol is an error reporting, network diagnostic protocol used by devices like routers to inform source IP addresses about the fail- ure of delivery of IP packets. Through this protocol malware often inquire about the closed/open ports on a device or it can be used to send data captured by the malware to its attacker [8].

– IGMP - Internet group message protocol is used to report/manage memberships of the multicast groups. If used properly it can be used for a DDOS attack [47].

– LLMNR and NBNS - Link-Local Multicast Name Resolution (LLMNR) and NetBIOS Name Service (NBNS) are protocols specific to Windows. LLMNR succeeded NBNS and was first introduced in Windows Vista. These protocols are used to query other hosts present in a private network in case a DNS query fails regarding a particular host. One instance is shown in the Figure 5.4 in which a host wants to connect to printserver located in the network but accidentally typed pintserver [16] Chapter 5. Classification of Existing malware 27

FIGURE 5.4: LLMNR Poisoning [16]

– SSDP - Simple Service Discovery Protocol is used for the discovery of network de- vices on port 1900. It is used by universal plug and play devices to exchange infor- mation. When a control point (e.g., Phone, Laptops) is connected to the network, it uses this protocol for the discovery of printers, TVs, audio systems etc. If used properly it can lead to very large scale DDOS attack [9][10]

– mDNS - Multicast DNS is similar to the protocols explained above. It is used by the devices on the same network to discover each other and the services. Apart from laptops and phones, it is used by a variety of other devices like printers, network connected storage systems etc.

– SSL - SSL/TLS was rarely used by the malware authors in the past. However, nowa- days they are encrypting the data which is sent over the network to C&C servers, to avoid getting detected by network administrators or anomaly detection systems. Many banking trojan families such as Zbot, Vawtrak and Trickbot and many other trojans such as Fareit and Papra use SSL. Also, malware authors are exploiting SS- L/TLS vulnerabilities to distribute malicious data. Figure 5.6 shows TLS connec- tions made by the Sventore.A malware. Chapter 5. Classification of Existing malware 28

FIGURE 5.5: TLS Connections

Apart from these, we also used the number of packets sent through TCP,UDP pro- tocol, DNS as a feature.

• Data packets - This feature is included to capture the count of data packets sent and received by the malware.

• HTTP Information Hypertext Transfer Protocol (HTTP) is the application layer protocol which is generally used to exchange hypermedia information such as HTML documents etc. It is most commonly used by a malware for Denial of Service(DOS) attacks.

– HTTP request packets - This feature captures the count of request packets sent by a malware. Majorly, there are three types of request packets, i.e., GET, POST and SEARCH. HTTP flooding of any request type generally cannot be detected by IPS because a majority of them tend to focus on TCP based DOS attacks (such as SYN floods etc.). It is very difficult to craft IPS rules to prevent such attacks since in most cases it is hard to distinguish it from real traffic. Thus it is becoming popular among malware authors. Below is a detailed description of each request type and their associated attacks.

FIGURE 5.6: HTTP Requests by Sventore.A malware Chapter 5. Classification of Existing malware 29

* GET - GET method requests the representation of a particular resource from the server through GET request. GET flood is generally used to exhaust the network bandwidth of a server so that it is unable to serve legitimate users. [13]

* POST - The POST method is used to submit an entity to a specified resource, of- ten causing a change in state on the server [13]. Since any post query is resource intensive (e.g. performs a database query), initiating many post requests simul- taneously could exhaust the system’s resources such as memory, etc. and can lead to DOS attack. Another example could be brute force attack or Dictionary attack in which several users try to login to the application using a set of user- names and passwords.

* SEARCH - Search method is used to initiate the server side search for a partic- ular resource. It is different from the GET method since in it payload which is returned in response to the query cannot be assumed as the representation of a resource identified by the requested URI. [24]

– HTTP Response Packets - This feature captures the count of response packets re- ceived by a malware in return to the requests they made. The various response codes which we included in our feature set are 2XX - Success, 3XX - Redirection, 4XX - Client errors and 5XX - Server error.

• Domains - Nowadays, instead of using IRC Networks to host , domains are used (IRC is an application layer protocol which is used for communication in text form. A is a collection of devices which are connected to the Internet and infected and controlled by common malware).Infected hosts will visit the malicious domain which serves the list of controlling commands and behaves as C&C server. The major advan- tage of using domains is that it is easy to control a large botnet by using the code which can be easily updated. Its biggest disadvantage is that government agencies can easily track malicious domains and can shut it down. Thus multiple domains are used as C&C servers. In this feature, we are capturing the count of unique domains, to which the ma- licious executable is interacting.

• Deadhosts - It is assumed that if a service is legitimate, then it will be maintained by the corresponding authorities. If it’s down and not responding then it may be the case that the executable which is trying to access it is malicious and that service is used for malicious purposes.Also if the executable is trying to contact too many hosts dead then also there is a possibility of malicious intent. Chapter 5. Classification of Existing malware 30

5.1.2.2 Process related features

• Process count - When an executable is submitted to cuckoo for analysis, it executes it and monitors the process. It then creates process tree containing all the processes spawned by the original process and its child processes. We included the number of processes in the process tree as a feature.

• Dropped files - This feature stores the count of files dropped by the malware in the sys- tem. It is important for the malware type trojandropper. However, many other malware like onlinegames, yuner, etc. also drop few files which help them in fulfilling their mali- cious intent.

5.1.2.3 API bins

WinAPI or Windows API is a set of Application programming interface (API) calls present in Windows operating system. In this set of features, we have grouped various API calls into 16 bins and calculated the frequency of these calls in each bin. Figure 5.7 shows the frequency of API calls that fall in each bin for training set. The detailed description of most common bins are as follows:

• File - It constitutes of API calls capturing all the actions performed on the file system. Possible actions which can be performed by malicious executables are: creating a new file directory, reading any file, copying some of the files/ directories to the new directory created, deleting file contents/files/directories, etc. To hide their activity, most malware tend to create files in the heavily used directories or in case of a ransomware, create new encrypted files and delete original files from the disk. This feature will capture all such information by keeping the count of these API calls.

• Registry - Registry contains the configuration settings about the software and hardware installed on the system. It contains complete information about the profile of the users of the system as well as about hardware and device drivers. Many malware heavily use registry keys to store their payload so that it can be executed at the startup. Like Files, we can perform several actions on the registry like Create, modify, read, delete.

• Process - These set of API calls captures the process functions for example, creating a new process, get information about running processes, terminate a process, open an existing local process object etc. Nowadays, to hide a malicious process, malware authors are us- ing a technique called Process Hollowing [22] in which a legitimate process is used as a Chapter 5. Classification of Existing malware 31

FIGURE 5.7: Frequency of API Calls in each bin

container for a malicious process. When a legitimate process is launched, its code is deal- located and replaced with the malicious code. To unmap the benign code from memory generally a malware uses ZwUnmapViewOfSection or textitNtUnmapViewOfSection API calls in Windows 32-bit version if it has root privileges.

• Synchronization - This bin contains the API calls which provide mechanisms that threads can use to access a shared resource. The most common synchronization object which is used by a malware is the mutex. A simple example to demonstrate its usage is through web browsers. During browsing, multiple windows (basically multiple processes) need to update the history file in a mutually exclusive manner. They do so by registering a mutex object with the history file. This feature captures the calls of this kind by various malware types. Chapter 5. Classification of Existing malware 32

• Crypto & Certificate - Crypto API is provided by Microsoft to add security based on cryp- tography to an application. It includes functions such as encryption and decryption of data, authentication using digital certificates and certificate management using certifi- cate stores. A polymorphic malware decrypts the code to exhibit malicious behavior and re-encrypt it to avoid detection, thus heavily using these API calls.

• Network - This bin includes the WinSock (Windows Sockets) API calls, which is the stan- dard for socket programming and Remote procedure calls (RPC) API’s. RPC provides the capability to the applications to invoke a function on other machines. Nowadays, many malware like Stuxnet use RPC mechanism to communicate with C&C servers and other infected machines on the network.

• OLE - Object Linking and Embedding (OLE) allows linking to documents and other ob- jects. Nowadays, malware authors use this method as a means of spreading malware. Emails are used to send MS office documents embedded with OLE objects. Whenever a user opens the attachment, a connection with the malicious server is opened and mali- cious files are downloaded onto victim’s system, without the users permission [25].

• Notification - The API calls in this category are used to display notifications to the user. Most malware, for example, in Trojan Renos, display fake security warnings to users about the presence of some malware and instruct them to install few software in order to re- move it from their system. Most users get duped by this and download malicious exe- cutables onto their system which may contain other kinds of malware.

• Exception - Malware analysts use a variety of debuggers for the better understanding of the binary, thus malware authors apply various anti-debugging techniques such as self-modification, detecting and removing checkpoints, etc. to evade analysis. Apart from these, exception attacks are also used as a means for avoiding debuggers. These attacks leverage exception handling in windows in the presence of debuggers. In case of suppressible exceptions attacks, malware registers a custom handler with the exception. When the exception is invoked, since it is suppressible it is not passed to the application. However, in the presence of a debugger, windows pass it to debugger which in return pass it to debugee. If the custom handler is invoked, the malware will detect the presence of debugger [50]. This feature captures the count of all the API calls which belongs to this category.

Apart from the API bins mentioned above, we have also used the count of API calls in the cate- gory Services, Resource, UI, Netapi and System. Rest of the API calls are grouped into a seperate category called miscellaneous. Chapter 5. Classification of Existing malware 33

5.1.2.4 Signatures

There is a variety of behaviors which are more specific to malware as compared to the be- nign executables for example, anti-detection techniques like checking the amount of memory present in the system. Presence of these signatures (behaviors) during the execution of a file strongly indicates that it is malicious and requires further analysis. Cuckoo provides a set of 433 signatures specific to Windows and Network and ranks them according to its importance in malware detection. We have calculated the frequency of these signatures in our training set and selected top 25 signatures as features. All these signatures are binary features, i.e., if mal- ware exhibit these behaviors during its execution, then the features are marked as 1 otherwise 0.

Here is a detailed explanation of few of these 25 signatures:

• Check available memory and disk size - Malware analysts generally assign low memory and disk size to a virtual machine (VM) so that they can execute multiple VMs parallely, thus decreasing the overall time required for executing all samples. Due to this, most malware query the memory and disk size before exhibiting malicious behavior to detect the virtual environment. It is not a dominant feature since most of the benign processes which are intended to perform heavy tasks also query for the size. API call which checks available memory is GlobalMemoryStatusEx and API call which checks disk size is Get- DiskFreeSpaceExW.

• Creates a shortcut to an executable file - Most of the malware authors use shortcut sys- tem to install malware or exploit other potential threats in the system because it is not a binary executable file and thus cannot be flagged by antivirus programs as malicious and can also execute windows shell commands. Figure 5.8 shows some of the shortcuts created by worm family yuner.

FIGURE 5.8: Shortcuts created by worm family Yuner Chapter 5. Classification of Existing malware 34

• Attempts to identify installed AV products - To refrain from getting detected, malware first query about the installed Antivirus products in the system using a registry key

HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run\\AVP

• Installs itself at Windows startup - There is a possibility that victim may detect the pres- ence of malware in the system and deletes its files and stops malicious processes. To tackle such scenario, most malware install themselves for autorun during the startup so that it can again be executed when the system is restarted. Many malware use it as a means to contact C&C servers for updates and new versions. Figure 5.9 shows some of the registry keys modified by the backdoor agent to install itself at startup.

FIGURE 5.9: Registry Keys modified by Backdoor Agent malware to install itself at startup

• Disable system restore - Disabling system restore is the first step of many malware which are used as a means to destruct services or to deny access to the victim and demanding something in return. Registry key which is modified is

HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\WindowsNT\\CurrentVersion \\SystemRestore\\DisableSR

• Modify security center warnings and Disables Security features notifications - Many malware that intend to exploit the zero-day vulnerability in the operating system, disable security warnings about updates, Antivirus, etc. to cause maximum damage to the user and infect the maximum number of systems present in the network. Following are the registry keys modified by such malwares:

HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center \\FirewallDisableNotify Chapter 5. Classification of Existing malware 35

HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center\\FirewallOverride HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center\\FirstRunDisabled HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center \\AntiVirusDisableNotify HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center \\UpdatesDisableNotify HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Security Center\\AntiVirusOverride

• Prevents display of file extensions - Many malware trick users to execute a file by chang- ing the executable name and hiding its extensions. For example, if malware leaves a file by the name "aejknffj.exe" on the system, a user will not execute it and instead run an AV product to delete such file or will simply delete the file. But if the malware modifies the filename to "Tickets.pdf.exe", changes its logo and hides its .exe extension, many users will fall for it. Registry key modified to do so is

HKEY_CURRENT_USER\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer \\Advanced\\HideFileExt

• Prevents display of hidden files - Most malware hide malicious files and folders to avoid detection by the user. To prevent the user from unhiding it again, they modify the reg- istry key mentioned below. There also exists some viruses that hide user’s files and creates shortcuts in place of that. These shortcuts are symbolically linked to another file in the system and when the user clicks on it, the malicious file is executed.

HKEY_CURRENT_USER\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer \\Advanced\\Hidden

• Drops a binary and executes it - Although this characteristic is mainly shown by tro- jandroppers, many other malware authors use this as means to drop malware onto the system.

• Deletes original binary - After exhibiting the malicious intent, leaving the binary behind may leave the traces of what has happened in the system. Thus malware authors prefer deleting the binary to keep size of their footprints low. In many cases, malware authors delete the original binary over the course of execution. Chapter 5. Classification of Existing malware 36

• Creates a slightly modified copy of itself - This signature proves to be helpful in case it’s a polymorphic malware. Figure 5.10 shows the example of malware family yuner which creates four copies of itself during execution

FIGURE 5.10: Polymorphic nature exhibited by malware Yuner

• Tries to locate browsers - Malwares tend to locate browsers to modify it for running threats or to lead it to websites which generate revenues for malware authors. It can also be used to promote misleading applications to the user.

• Executes one or more WMI queries - Windows Management Instrumentation (WMI) is set of tools used to manage windows locally and remotely. It has become famous among the malware authors because of its ability to explore the system, VM and Antivirus detec- tion, data theft, code execution etc. Example WMI queries used by Virus Comame!gmb are as follows:

SELECT * FROM Win32_ComputerSystem SELECT * FROM Win32_BaseBoard

• Checks adapter addresses - Like available memory and hard disk, the malware also checks adapter addresses to detect virtual network interfaces. The network API call which is used is GetAdaptersAddresses. Chapter 5. Classification of Existing malware 37

• One or more processes crashed - As mentioned in the Exception API call feature, mal- ware may raise exceptions to avoid getting detected by the debugger. Figure 5.11 shows the exception raised by the malware of family Renos.

FIGURE 5.11: Exception raised by malware Renos

• Allocates read-write-execute memory - This is the most important signature which dis- tinguishes benign and malicious executables. To avoid getting detected by static analysis, almost all malware nowadays are either packed or encrypted. Stub attached to the mal- ware generally allocate read-write-execute memory to unpack it. The API call which is used for it is NtAllocateVirtualMemory.

Apart from these, Malware may inquire about a specific process through Process32Next API call, try to detect cuckoo sandbox through the presence of analysis.py file, create a hidden win- dow, Check for the presence of known devices from debuggers and forensic tools or may extract buffers that may contain injected code, configuration data, etc. Chapter 5. Classification of Existing malware 38

5.1.3 Training and Testing

Our major goal is to classify malware into their classes using behavioral analysis in the earliest possible time so that the malware won’t affect the user significantly. If the model is accurate within a short time, we can initiate the cleanup process quickly according to the malware type, and can integrate this solution in any existing antivirus engine.

All the tests were conducted on Ubuntu 16.04 LTS machine with 16 GB RAM and Intel i7 octa- core processor. We have split the dataset mentioned at the beginning of the chapter in the ratio 80%-20% for training and testing. The classifiers which have been used are XGBoost, Simple Neural Network and K Nearest Neighbor. We have also used 10 Fold Stratified cross-validation in order to do parameter tuning and selected parameters with minimum misclassification error (total number of misclassified points in testing data).

Based on the correlation between input features, many features have been removed from the training and testing data, since their correlation exceeded the threshold which is 0.80. This is done since highly correlated features provide the same information and as per the curse of dimensionality, usually less features make the learning faster. Our final feature set comprised of 52 features.

In Section 4.1, we discussed single hidden layer neural network, here we have used the same. The input layer has 52 dimensions and output layer with 8 dimension(for eight types). The di- mension of hidden layer is 50. We experimented with a lot of dimensions for hidden layer, but this performed the best. Since neural network often faces difficulty in converging if the data is not normalized, we have applied Standardization or Z score normalization on our data. Stan- dardization is the widely used technique which transforms the data such that the distribution has mean (µ) 0 and standard deviation (σ) 1. The formula for computing it is = =

z (x µ)/σ = −

Table 5.2 shows the accuracy on test data for various combinations of loss functions1 and opti- mizers2 for all features. Here we have only used Sparse categorical cross entropy and Categor- ical hinge as loss functions since these are most widely used for multiclass classification. Also

1Loss function is the function which is used to calculate error, error is calculated as the difference of actual output and predicted output 2Error is the function of internal parameters of the model i.e. weights and bias. Backpropagation is a concept which is used to minimize the error in neural networks. In it, the error is propagated backwards to previous layers and weights and bias are adjusted so to minimize error. The function which is used to modify the weights are called Optimization Function Chapter 5. Classification of Existing malware 39 we have used all the available optimizers for example, adam, adadelta, sgd etc. to evaluate our model.

Sparse categorical cross entropy Categorical hinge adam 96.29% 95.62% adadelta 95.94% 95% adagrad 94.51% 93.89% adamax 95.88% 95% nadam 96.59% 96.13% rmsprop 96.57% 95.60% sgd 91.29% 80.92%

TABLE 5.2: Testing accuracy - Simple Neural Network, for various optimizers and loss func- tions

We also applied K Nearest Neighbor classifier on our dataset and achieved an accuracy of 93.62% for K=5.

Previous works have shown that tree-based classifiers such as Decision tree and RandomForest performed significantly well in detecting and classifying malware. In this work, we have used XGBoost, which is based on Boosting as described in Section 4.1 and got an accuracy of 98.02%.

Table 5.3 shows the results for applying these classifiers on the testing set on all the features discussed in the previous section. Evaluation metrics has been discussed in Section 4.3.

XG-Boost KNN Simple Neural Net Class TPR FPR TPR FPR TPR FPR Backdoor 0.96 0.003 0.955 0.007 0.955 0.006 PWS 0.955 0.0005 0.91 0.002 0.945 0.001 Trojan 0.923 0.0006 0.921 0.004 0.905 0.001 TrojDownloader 0.986 0.008 0.905 0.015 0.968 0.014 TrojDropper 0.996 0.0001 0.985 0.0009 0.992 0.0009 Virtool 0.992 0.001 0.836 0.003 0.944 0.002 Virus 0.977 0.0047 0.770 0.011 0.910 0.004 Worms 0.993 0.0041 0.992 0.036 0.992 0.009

TABLE 5.3: Test Results for all classifiers

Figure 5.12 shows the confusion matrix which is obtained by XGBoost, which performed rela- tively better than the other two classifiers. Chapter 5. Classification of Existing malware 40

FIGURE 5.12: Confusion Matrix - XGBoost

5.1.4 Comparison to Existing Approaches

In this section, we will compare our work with those who have classified malware into its fam- ilies/types and with those who have tried to predict maliciousness with the short duration of behavioral data. The biggest problem which we find in these works is that, they use very less number of samples in the dataset and the amount of time which they are using for capturing its behavior. Nari et al. [44] used 3 datasets, of 3768 (identified by 6 Antivirus engines that it belonged to a particular family). 3347 (identified by 7 AVs’), 2907 (identified by 8 AVs’) samples Chapter 5. Classification of Existing malware 41

from 13 families. There were many families in the training set which comprises of less than 5% of the total set. Clearly, the data is highly imbalanced. They achieved 0.945 ROC area, which is also not very good. Possible reason could be the noise created by malware to hide their mali- cious traffic. Similarly in Ahmed et al. [34] 516 total samples were used which consists of 100 benign and 416 malware. Although they achieved good accuracy, they haven’t mentioned the time which is used to collect the analysis features. Another setback to their approach is that they are directly using the API calls as binary features, so in future if a malware uses some other API calls that are not present in the feature set to fulfill their malicious intent, their model won’t be able to capture it. Also, API calls largely increase the features space. Rhode et al. [46] used only ten features to predict if the file is malicious, however, the sample size they have used is very less (594 malicious and 594 benign PE executables). In the second version of the paper, they increased their sample size to 2,345 benign and 2,286 malicious files, which leads to the decrease in their achieved accuracy. We believe that the features they used are not discrim- inative enough to build a robust prediction model. Their revised accuracies are 91% in first 4 seconds and 96% in 19 seconds. Also number of false positives (3.17%) and false negatives (4.72%) are very high in their case.

We achieved a comparable accuracy of 98.02% with just 4 seconds of the behavioral data. There is always a trade-off between achieving good accuracy and performing classification in short time. Our work has taken care of both the above shortcomings.

Table 5.4 has the summary of the previous work and our work.

Author Features(No of features) Dataset Accuracy [44] Network 3768 samples(13 families) 0.945 ROC area [34] API calls in 7 categories 416 malwares(Trojan, worms, virus) 98% [43] API calls and their parameters(7,605/1000) 1368 samples(10 families) 97.4%/ 94.5% [46] Machine Activity(10) 2345 benign,2286 malicious 91%(4 sec), 96%(20 sec) Ours Network, Process, API Bins, Signatures(52) 29554 samples(15 families, 8 types) 98.02%(4 sec)

TABLE 5.4: Comparison to previous approaches Chapter 6

Classification of Zero Day malwares

6.1 Architecture

In the previous chapter, we were able to classify the malware accurately 98% of times if its vari- ant or the malware itself already exists in our dataset. But what happens if the malware authors develop a completely new malware family by exploiting some zero-day vulnerability.

In this chapter, we will simulate zero day scenario with only 4 seconds of behavioral informa- tion. We assume that although actions of newly created malware will be different, the delivery mechanism will remain the same.

The architecture of our system is shown in the figure 6.1 and the basic outline is as follows.

• Dataset Collection and Generation

• Feature Extraction and Handling Imbalanced Data

• Training and testing models

6.1.1 Dataset Collection and Generation

Here we have only taken six malware types into account and ignored other types as the number of samples for these classes were significantly less in our dataset. Also, to capture the delivery mechanism of each type, we need more samples. Thus, we have included all samples of these above six types into our dataset which included malware families that are not present in signif- icant amount into our dataset. We have used Cuckoo Sandbox for generating reports and used

42 Chapter 6. Classification of Zero Day malwares 43

FIGURE 6.1: Architecture of our classification system the labeling by Microsoft as previously discussed.

Table 6.1 shows our training and testing set for each type and Table 6.2 shows the malware fam- ilies which are present in these sets. Here we have removed all the samples of malware family yuner, krepper etc. (present in our test set) from our training set to ensure our model has no prior knowledge about these families. Thus our testing set acts as a set of zero day malwares to our model.

Figure 6.2 shows the t-SNE of our test set. t-SNE is a technique which is widely used for the visualization of the high dimensional dataset. Chapter 6. Classification of Zero Day malwares 44

Types Training Testing Worms 7587 3794 Virus 5417 645 Trojan 4288 1565 TrojanDropper 3326 1347 TrojanDownloader 5700 1985 Backdoor 5448 1020 Total 31766 10356

TABLE 6.1: Number of samples in Training and Testing Set

Types Training Testing Worms Allaple,VB, Vobfus, Mydoom Yuner Virus luder, Expiro, Virut, Ramnit, Parite, Mabezat, Patchload Krepper Bulta!rfn, Comame!gmb, BHO, Koutodoor, Startpage Trojan Vundo, Agent, Toga!rfn, VB, Bagsu!rfn, Rimecud TrojanDropper Agent, Lamechi, Small, Sirefef Sventore.A TrojanDownloader Small, Tugspay, Agent, Banload, Delf, Adload, Wintrim Renos Rbot, Zegost, Hupigon, IRCbot, Delf Agent Backdoor Cycbot, Sdbot, VB, Bifrose

TABLE 6.2: Families in Training and Testing Set

FIGURE 6.2: tSNE - Test Set Chapter 6. Classification of Zero Day malwares 45

6.1.2 Feature Extraction

From the generated cuckoo reports, all the features related to network activities, processes, API calls and signatures are extracted. We have already discussed these features in the previ- ous chapter. In signatures as features, we have extracted all the signatures whose frequency is greater than 50, leading to 65 signatures in total. This is done to capture the activity of a malware family which is not present in significant amount in our dataset.

6.1.3 Handling Imbalanced Data

By closely examining families in each type, we analyzed that our dataset is highly imbalanced. Figure 6.3 and 6.4 shows the distribution of families in Virus and Trojan type. We used a variety of techniques mentioned in Section 4.2 and found that only oversampling will help in our case. Removing data samples of majority class will lead to large data loss which is not desired and duplicating random samples from minority class will lead to overfitting. Out of many oversam- pling techniques which we performed only SMOTE gave comparatively better results on the validation set. SMOTETomek failed because Tomek Links remove the pair of points of opposite classes which are very close to each other, in our cases all the classes which are subjected to oversampling belong to a single type and we want these classes to be as close to each other to capture the basic mechanism of that malware type. Similarly, SMOTENN will remove all those points whose labels will not match with its K Nearest Neighbor, thus will face similar problem as SMOTETomek.

FIGURE 6.3: Imbalanced Virus families Chapter 6. Classification of Zero Day malwares 46

FIGURE 6.4: Imbalanced Trojan families

Table 6.3 shows our training set for each type after applying SMOTE on Virus, Trojan and Tro- janDownloader type.

Types Samples in Training Set Worms 7587 Virus 13776 Trojan 20627 TrojanDropper 3326 TrojanDownloader 13598 Backdoor 5448 Total 64362

TABLE 6.3: Number of samples in Training Set after SMOTE Chapter 6. Classification of Zero Day malwares 47

6.1.4 Training and Testing

For classifying unknown samples, a different approach was adopted since the basic classifiers didn’t perform well on the dataset. Six binary classifiers were created one for each type and trained using One vs. All approach, i.e., if a classifier of type Trojan is being trained, then one class will contain all the trojans and the other class will contain samples from rest of the types. Since feature set consists of 4 categories namely network, process, bins and signatures, various combinations of these categories were tried to find the best feature set for a binary classifier. Also, several experiments were performed where top n features(which are ranked on the basis of their importance, measured using F-score) were selected as the feature set for the classifiers. Here n varies from 5 to 50. Finally, that feature set is selected for each classifier which gave minimum misclassification error on the validation set. Then for any test sample, the probability of it belonging to each type is calculated using these classifiers and then it is assigned to the type with maximum probability.

For TrojanDropper, Worms and Backdoor, we have used top 25 features in the feature set. For Virus, API Bins were used and for TrojanDownloader, signature and process categories were used. Lastly, for trojans, Bins and Network categories were used as feature set. Selection of Bins as the feature set for viruses was justified since basic mechanism of virus involves infect- ing files in the file system which can be easily captured by the files category in API calls. For TrojanDownloader, inclusion of network features were expected, but on careful examination of top features from process and signature category we concluded that it is also justified. These top features were dropped files, process count, connects to IP addresses that are no longer re- sponding to requests (legitimate services will remain up-and-running usually), installs itself for autorun at Windows startup, performs some HTTP requests, etc.

Figure 6.5 shows the confusion matrix obtained by using XGBoost in One vs. All approach men- tioned above and Table 6.4 contains the Accuracy, False positive rate obtained for each type.

Types Accuracy FPR Worms 78.30% 0.008 Virus 73.79% 0.065 Trojan 61.85% 0.166 TrojanDropper 91.98% 0.02 TrojanDownloader 34.55% 0.101 Backdoor 69.31% 0.013

TABLE 6.4: Accuracy for each type with corresponding FPR Chapter 6. Classification of Zero Day malwares 48

FIGURE 6.5: Confusion Matrix - XGBoost

From Table 6.4, we can see the accuracy of each type, Trojandownloader performed relatively poor than other types but since our classifier is classifying it in its parent family(Trojan), our classifier seems effective. Chapter 7

Scope And Future Work

7.1 Building a Hierarchical model

Our approach focuses on classifying the malicious executables into its types. In our future work we would like to create a hierarchal model which will first predict a sample as benign or ma- licious and if found malicious will classify it into its appropriate type and after that into its appropriate family in the type.

7.2 Sliding window based approach for classification

Our approach focuses on classification in the initial few seconds of the execution, however, if malware authors know about it, they will restrict the malicious behavior during that time and above hierarchal model will fail at its very first step. In the future, we aim at incorporating slid- ing window strategy with the hierarchal model. A system will monitor the executable behavior using sliding window (let’s say 4 seconds) and will detect and classify its malicious behavior regardless of the time it has occurred in the executable.

7.3 Building a robust classification system

We will try to add more malware types and families such as ransomware, rogue, exploits, rootk- its, etc. which exists in the wild to our zero-day classification dataset. Also, we will try to analyze the pre and post-execution memory dump to find the possible features which can help in de- tecting malware especially rootkits.

49 Appendix A

Appendix A

We have released all the code used in the analysis on github: https://github.com/mugdhagupta/Malware-Classification

50 Bibliography

[1] (2016). Confusion matrix. https://classeval.wordpress.com/introduction/ basic-evaluation-measures/.

[2] (2016). K fold cross validation. http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/ 08_CV_Ensembling/08_CV_Ensembling.html.

[3] (2017). Resampling techniques. https://www.kaggle.com/rafjaa/ resampling-strategies-for-imbalanced-datasets.

[4] (2018). Av-test security institute. https://www.av-test.org/en/statistics/ malware/.

[5] (2018). Cdac mohali. https://cdac.in/index.aspx?id=mohali.

[6] (2018a). Cuckoo architecture. http://docs.cuckoosandbox.org/en/latest/ introduction/what/.

[7] (2018b). Cuckoo sandbox. https://cuckoosandbox.org/.

[8] (2018). Exploiting icmp. https://blog.trendmicro.com/ trendlabs-security-intelligence/phishing-trojan-uses-icmp-packets-to-send-data/.

[9] (2018a). Exploiting ssdp. (https://blog.cloudflare.com/ssdp-100gbps/.

[10] (2018b). Exploiting ssdp. (https://www.corero.com/resources/ ddos-attack-types/ssdp-amplication-ddos.html.

[11] (2018). Firewalls. https://personalfirewall.comodo.com/what-is-firewall.html.

[12] (2018). How antivirus works? https://www.howtogeek.com/125650/ htg-explains-how-antivirus-software-works/.

[13] (2018). Http methods. https://developer.mozilla.org/en-US/docs/Web/HTTP/ Methods.

51 Bibliography 52

[14] (2018). Intrusion detection and prevention. https://www.incapsula.com/ web-application-security/intrusion-detection-prevention.html.

[15] (2018). Keras: The python deep learning library. https://keras.io/.

[16] (2018). Llmnr and nbns poisoning. https://www.sternsecurity.com/blog/ local-network-attacks-llmnr-and-nbt-ns-poisoning.

[17] (2018). Malshare. https://malshare.com/.

[18] (2018). Malware nomenclature. https://www.microsoft.com/en-us/wdsi/help/ malware-naming.

[19] (2018). Nearest neighbors. http://scikit-learn.org/stable/modules/neighbors. html.

[20] (2018a). Neural network with one hidden layer. http://cs231n.github.io/assets/ nn1/neural_net.jpeg.

[21] (2018b). Neural networks. http://neuralnetworksanddeeplearning.com/chap1. html.

[22] (2018). Process hollowing. https://www.trustwave.com/Resources/ SpiderLabs-Blog/Analyzing-Malware-Hollow-Processes/.

[23] (2018). scikit-learn: Machine learning in python. http://scikit-learn.org/stable/.

[24] (2018). Search http request type. https://trac.tools.ietf.org/id/ draft-snell-search-method-00.html.

[25] (2018). Spreading malware through ole objects. https://threatpost.com/ microsoft-patches-word-zero-day-spreading-dridex-malware/124906/.

[26] (2018). Theano. http://www.deeplearning.net/software/theano/.

[27] (2018). Tshark. https://www.wireshark.org/docs/man-pages/tshark.html.

[28] (2018a). Using legitimate tools to hide malicious code. https://securelist.com/ using-legitimate-tools-to-hide-malicious-code/83074/.

[29] (2018b). Virusshare. https://virusshare.com/.

[30] (2018). What is a backdoor? https://www.wired.com/2014/12/ hacker-lexicon-backdoor/. Bibliography 53

[31] (2018c). What is a or a ? https://usa.kaspersky.com/ resource-center/threats/computer-viruses-vs-worms.

[32] (2018). What is a trojan? https://www.kaspersky.co.in/resource-center/threats/ trojans.

[33] (2018). Xgboost. http://xgboost.readthedocs.io/en/latest/python/python_api. html.

[34] Ahmed, F.,Hameed, H., Shafiq, Z., and Farooq, M. (2009). Using spatio-temporal informa- tion in api calls with machine learning algorithms for malware detection. Proceedings of the 2nd ACM workshop on Security and artificial intelligence Pages 55-62(AISec ’09).

[35] Bayer, U., Kirda, E., and Kruegel, C. (2010). Improving the efficiency of dynamic malware analysis. Proceedings of the 2010 ACM Symposium on Applied Computing(SAC’10).

[36] Damodaran, A., Troia, F. D., Visaggio, C. A., Austin, T. H., and Stamp, M. (2017). A com- parison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques(Volume: 13, Issue:1).

[37] Das, S., Liu, Y., Zhang, W., and Chandramohan, M. (2016). Semantics-based online mal- ware detection: Towards efficient real-time protection against malware. IEEE Transactions on Information Forensics and Security (Volume:11, Issue:2).

[38] Firdausi, I., lim, C., Erwin, A., and Nugroho, A. S. (2010). Analysis of machine learning techniques used in behavior-based malware detection. Second International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT).

[39] Grosse, K., Papernot, N., Manoharan, P., Backes, M., and McDaniel, P. D. (2016). Ad- versarial perturbations against deep neural networks for malware classification. CoRR, abs/1606.04435.

[40] Kolosnjaji, B., Zarras, A., Webster, G., , and Eckert, C. (2016). Deep learning for classifica- tion of malware system call sequences. Australasian Conference on Artificial Intelligence.

[41] Kolter, J. Z. and Maloof, M. A. (2006). Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research (Volume:7).

[42] Kong, D. and Yan, G. (2013). Discriminant malware distance learning on structural infor- mation for automated malware classification. Proceedings of the ACM SIGMETRICS/interna- tional conference on Measurement and modeling of computer systems(SIGMETRICS ’13). Bibliography 54

[43] Moonsamy, V., Tian, R., and Batten, L. (2012). Feature reduction to speed up malware classification. Information security technology for applications, Springer, pp. 176–188.

[44] Nari, S. and Ghorbani, A. A. (2013). Automated malware classification based on net- work behavior. International Conference on Computing, Networking and Communications (ICNC).

[45] Neugschwandtner, M., Comparetti, P.M., Jacob, G., and Kruegel, C. (2011). Forecast: skim- ming off the malware cream. Proceedings of the 27th Annual Computer Security Applications Conference(ACSAC’11).

[46] Rhode, M., Burnap, P.,and Jones, K. (2017). Early stage malware prediction using recurrent neural networks. CoRR, abs/1708.03513.

[47] Sargent, M., Kristoff, J., Paxson, V., and Allman, M. (2017). On the potential abuse of igmp. ACM SIGCOMM Computer Communication Review(Volume 47, Issue: 1).

[48] Saxea, J. and Berlin (2015). Deep neural network based malware detection using two di- mensional binary program features. 10th International Conference on Malicious and Un- wanted Software(MALWARE).

[49] Sharma, A. and Sahay, S. K. (2014). Evolution and detection of polymorphic and metamor- phic malwares: A survey. International Journal of Computer Applications.

[50] Shi, H. and Mirkovic, J. (2017). Hiding debuggers from malware with apate. Proceedings of the Symposium on Applied Computing(SAC ’17).

[51] Shibahara, T., Yagi, T., Akiyama, M., Chiba, D., and Yada, T. (2016). Efficient dynamic mal- ware analysis based on network behavior using deep learning. IEEE Global Communications Conference (GLOBECOM).

[52] Tian, R., Batten, L., and Versteeg, S. (2008). Function length as a tool for malware classifi- cation. 3rd International Conference on Malicious and Unwanted Software(MALWARE).

[53] Tian, R., Islam, R., Batten, L., and Versteeg, S. (2010). Differentiating malware from clean- ware using behavioural analysis. 5th International Conference on Malicious and Unwanted Software (MALWARE).

[54] Tobiyama, S., Yamaguchi, Y., Shimada, H., Ikuse, T., and Yagi, T. (2016). Malware detection with deep neural network using process behavior. 40th Annual IEEE conference on Computer Software and Applications Conference (COMPSAC).