Nova Southeastern University NSUWorks

CEC Theses and Dissertations College of Engineering and Computing

2017 Improved Detection for Advanced Polymorphic James B. Fraley Nova Southeastern University, [email protected]

This document is a product of extensive research conducted at the Nova Southeastern University College of Engineering and Computing. For more information on research and degree programs at the NSU College of Engineering and Computing, please click here.

Follow this and additional works at: https://nsuworks.nova.edu/gscis_etd Part of the Computer Sciences Commons Share Feedback About This Item

NSUWorks Citation James B. Fraley. 2017. Improved Detection for Advanced Polymorphic Malware. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, College of Engineering and Computing. (1008) https://nsuworks.nova.edu/gscis_etd/1008.

This Dissertation is brought to you by the College of Engineering and Computing at NSUWorks. It has been accepted for inclusion in CEC Theses and Dissertations by an authorized administrator of NSUWorks. For more information, please contact [email protected].

Improved Detection for Advanced Polymorphic Malware

by

James B. Fraley

A Dissertation Proposal submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

in

Information Assurance

College of Engineering and Computing

Nova Southeastern University

2017 ii

An Abstract of a Dissertation Submitted to Nova Southeastern University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Improved Detection for Advanced Polymorphic Malware

by James B. Fraley May 2017 Malicious Software (malware) attacks across the internet are increasing at an alarming rate. Cyber-attacks have become increasingly more sophisticated and targeted. These targeted attacks are aimed at compromising networks, stealing personal financial information and removing sensitive data or disrupting operations. Current malware detection approaches work well for previously known signatures. However, malware developers utilize techniques to mutate and change software properties (signatures) to avoid and evade detection. Polymorphic malware is practically undetectable with signature-based defensive technologies. Today’s effective detection rate for polymorphic malware detection ranges from 68.75% to 81.25%. New techniques are needed to improve malware detection rates. Improved detection of polymorphic malware can only be accomplished by extracting features beyond the signature realm. Targeted detection for polymorphic malware must rely upon extracting key features and characteristics for advanced analysis. Traditionally, malware researchers have relied on limited dimensional features such as behavior (dynamic) or source/execution code analysis (static). This study’s focus was to extract and evaluate a limited set of multidimensional topological data in order to improve detection for polymorphic malware. This study used multidimensional analysis (file properties, static and dynamic analysis) with machine learning algorithms to improve malware detection. This research demonstrated improved polymorphic malware detection can be achieved with machine learning. This study conducted a number of experiments using a standard experimental testing protocol. This study utilized three advanced algorithms (Metabagging (MB), Instance Based k-Means (IBk) and Deep Learning Multi-Layer Perceptron) with a limited set of multidimensional data. Experimental results delivered detection results above 99.43%. In addition, the experiments delivered near zero false positives. The study’s approach was based on single case experimental design, a well-accepted protocol for progressive testing. The study constructed a prototype to automate feature extraction, assemble files for analysis, and analyze results through multiple clustering algorithms. The study performed an evaluation of large malware sample datasets to understand effectiveness across a wide range of malware. The study developed an integrated framework which automated feature extraction for multidimensional analysis. The feature extraction framework consisted of four modules: 1) a pre-process module that extracts and generates topological features based on static analysis of machine code and file characteristics, 2) a behavioral analysis module that extracts behavioral characteristics based on file execution (dynamic analysis), 3) an input file construction and submission module, and 4) a machine learning module that employs various advanced algorithms. As with most studies, careful attention was paid to false positive and false negative rates which reduce their overall detection accuracy and effectiveness. This study provided a novel approach to expand the malware body of knowledge and improve the detection for polymorphic malware targeting Microsoft operating systems.

Acknowledgments

I would like to thank Professor James Cannady for providing guidance and leadership as the Chair of my dissertation committee. His brutal honesty has made me a better student and person. I will always look back fondly on his instruction, guidance and wisdom. I also want to thank Dr. Hughes and Dr. Liu for working diligently on my committee. Your comments and feedback served to build a better end-product and researcher. I also want to thank Dr. Ackerman and Dr. Seagull who provided guidance and inspiration early on in the process.

I would also like to thank David Wasson who provided technical guidance and helped me build an infrastructure to process literally hundreds of experimental models. He provided a sound technical advice and allowed me to test ideas that neither of us had ever tackled. Some for good reason. I am grateful that he took my calls and provided technical insights when things broke. He helped me breakthrough by answering questions, giving up his time and providing technical expertise when it was desperately needed.

I am very appreciative of my employer both McAfee and Intel. My journey was only made possible by the company’s support and encouragement. McAfee has been exceptionally supportive, and I have appreciated the support from my senior management. I also have been surrounded by exceptional colleagues at McAfee who have increased my knowledge in information security by leaps and bounds. It has been my distinct honor to work and learn from all of you over the years.

I am so grateful to Theresa who partnered with me on my life journey. I thank her for inspiring, encouraging and fully supporting my academic journey. I appreciated her patience and willingness to let me work some insane hours for a long time. I am grateful that my mother instilled a belief that anything is possible if you work hard enough. Jake and Dylan, I hope my hard work and dedication provides a good example for you as you begin your academic journey. Lastly, I appreciate my family’s support and encouragement (especially David Bressi) throughout the entire process.

iv

List of Tables

Tables 1. Advanced Detection Studies. Sources Used in Study 92 2. Confusion Matrix 125 3. Example Confusion Matrix 128 4. Example Unweighted Testing Results 128 5. Experimental Resource Requirements 134 6. MB Baseline Classification Results 144 7. MB Baseline Experimental Results 144 8. MB Reduced Feature Selection Classification Results 145 9. MB Reduced Feature Selection Experimental Results 145 10. MB Data Amplification Classification Results 145 11. MB Data Amplification Experimental Results 146 12. MB Experimental Results 147 13. IBK Baseline Classification Results 152 14. IBK Baseline Experimental Results 152 15. IBK Reduced Feature Selection Classification Results 152 16. IBK Reduced Feature Selection Experimental Results 153 17. IBK Data Amplification Classification Results 153 18. IBK Data Amplification Experimental Results 153 19. IBK Experimental Results 155 20. DLMLP Baseline Classification Results 160 21. DLMLP Baseline Experimental Results 160 22. DLMLP Reduced Feature Selection Classification Results 161 23. DLMLP Reduced Feature Selection Experimental Results 161 24. DLMLP Data Amplification Classification Results 161 25. DLMLP Data Amplification Experimental Results 162 26. DLMLP Experimental Results 163

v

List of Figures

Figures

1. System Overview 101 2. ROC Results 124 3. Comparative Baseline Testing Results 129 4. ROC Analysis 130 5. MB Graph Results 146 6. IBk Graph Results 154 7. DLMLP Graph Results 162

vi

Table of Contents

Abstract ii Acknowledgements iii Approval/Signature Page iv List of Tables v List of Figures vi Chapters 1 Introduction 1 Background 1 Problem Statement 10 Dissertation Goal 11 Research Questions and/or Hypotheses 14 Relevance and Significance 16 Barriers and Issues 18 Assumptions, Limitations and Delimitations: 20 Summary 23 Definition of Terms 25 List of Acronyms 31

2 Review of the Literature 33 Overview of Topics 33 Synthesis of Current Literature 39 Research Methods 77 Gaps in Current Literature 81 Strengths and Weaknesses of Current Studies 85 Similar Research Methods 87 Summary 93

3 Methodology 94 Overview 94 Research Design 96 Research Procedures 97 Prototype Environment 100 Threats to Validity 108 Sample 121 Data Analysis 123 Data Formats for Results 127 Resource Requirements 130 Summary 135

vii

4 Results 136 Research Goals 136 Review of the Methodology 136 Experimental Outcomes 138 Data Analysis 163 Findings 166 Summary of Findings 167

5 Conclusions, Implications, Recommendations, and Summary 169 Conclusions 169 Implications 171 Recommendations 172 Summary 174

References 184 Appendix A – Advanced Detection Studies 191 Appendix B – Cuckoo Installation and Configuration 193

viii

Chapter 1

Introduction

Background

Malware (malicious software) represents a significant problem for today’s highly networked and distributed computer systems (Cesare & Xiang, 2010). In general, malware is used to represent a variety of annoying or hostile software programs specifically designed to access, interrupt or establish communication channels, and/or perform data exfiltration from compromised computer systems without the owner’s informed consent (Kumar, S., Rama Krishna, C., Aggarwal, N., Sehgal, R., & Chamotra,

S., 2014). The ability to detect malware is a major challenge due to the proliferation and complexity for detecting malware across a wide range of endpoints throughout the enterprise (Qu & Hughes, 2013). The growth of malware threats continues to far exceed the security industry’s projections and estimates (Qu & Hughes, 2013). According to the

McAfee Labs Threat Report (2014), the security company collected over approximately

235 million malware samples in the first quarter of 2014. In 2015, McAfee estimated that the company collected over 3 million new samples a day (McAfee, 2016). A majority of samples collect were detected by McAfee’s worldwide sensor network protecting computer systems and networks for both commercial and Government organizations

(McAfee, 2014). These samples were collected based on the suspicious activity or behavior of the malware by typical network protection devices and security appliances

(McAfee, 2014). The unprecedented growth of new malware increased by 174% for the same quarter from the previous year (2013) and exceeded an additional 100 million new samples (McAfee, 2014). McAfee’s 2015 Threat Report stated that malware rose by

13% in first quarter (McAfee Labs, 2015). Advanced detection methods to detect new

malware continues to be a major challenge for security researchers and commercial

security companies given the growth and sophistication of malware (Qu & Hughes,

2013).

The ability to detect and classify malware provides a means to identify and stop

the proliferation of malware across the network. Today’s detection relies upon signatures

of known malware to identify malware at the network and host (Cesare & Xiang, 2011).

Signature detection of advanced malware is complicated by polymorphism (Muhaya,

Khan, & Xiang, 2011). However, signature detection alone does not provide adequate protection for networks and end points from polymorphic malware. The McAfee Threat

Report (2016) stated that on average the security saw almost 40 million new malware

samples in each quarter for 2015.

In order to understand the malware threat landscape – it is helpful to understand

the various types of malware and the motivational factors driving malware development.

Malware is often developed for a specific purpose to attack or disrupt a specific target or

environment (Richardson, 2011). Typically malware aims to steal information,

compromise sensitive networks, and establish launch points for future attacks (Brenner,

2008; Chu, Holt, & Ahn, 2010; Wall, 2012) . The term “malware” includes viruses,

Trojan programs and other malicious tools installed on hosts or endpoints to automate

some type of compromise that may lead to more sophisticated or complex intrusions

(Brenner, 2008; Richardson, 2011; Wall, 2012). Another type of malware – botnets

extends this capability and combines multiple aspects of existing malware into a single

2

program enabling to form a network of infected computers world-wide (Bächer,

Holz, Kotter, & Wicherski, 2005). Botnets can provide a foothold to create a number of attacks ranging from the distribution of spam, denial of service attacks, and advanced network scanning (Bächer et al., 2005). There has been significant research involved with malware and role it plays in cybercrimes and the financial impact attacks play on organizations (Bächer et al., 2005; Wall, 2012). Additional research has been involved with the social factors that influence the creation, distribution, and use of malware in the community (Chu et al., 2010). Sophisticated malware enables hackers to sell access or steal information from their botnets across the world (Bächer et al., 2005).

Botnets have revolutionized attacks on infrastructures and has advanced the targeted attack on organizations (Chu et al., 2010). Hackers have established forums and Internet

Relay Chat (IRC) to sell and distribute malicious software, stolen data, and hacking tools that enable less skilled malware developers to gain direct access to advanced malware services (Chu et al., 2010).

Ransomware has become major malware challenge within the last few years

(McAfee Labs Threats Report, 2016). better known as “scareware” has become quite popular with cyber criminals (Kharraz, Robertson, Balzarotti, Bilge, &

Kirda, 2015). Most ransomware is polymorphic in nature (McAfee Labs Threats Report,

2016). Ransomware malware and the criminals operating the malware see to take advantage of people’s fear of releasing private information, losing critical data, or facing irreversible hardware damage (Kharraz et al., 2015). In most cases ransomware locks or encrypts the victims’ computers until they make a payment to unlock or re-gain access to their data (Kharraz et al., 2015). Ransomware has been on the malware landscape for over 10 years, however the volume of these attacks has dramatically increased over the 3

last few years - 500% in 2013, 550% in 2014 and nearly 600% in 2015 according to

Kaspersky’s Threat Report (2016). The Crypto locker ransomware has been detected on over 750,000 computers in the United States (Gostev, Unuchek, Garnaeva, Makrushin, &

Ivanov, 2016). Ransomware attacks have migrated from individuals to organizations such as healthcare/hospitals and local governments. Once such ransomware case involved an entire police department who had to pay the ransom in order to unlock or to

decrypt the departments document repository (Gostev et al., 2016). This has led to new

energy for detecting polymorphic ransomware and has mobilized many of the leading

security vendors to share findings because of the levels of sophistication and the current ineffective defensive techniques (Gostev et al., 2016).

In order to improve detection advances in malware classification and detection are

needed to provide better end-point protection. Malware detection has traditionally been largely signature-based (Qu & Hughes, 2013). These signatures are based on the hashing of the malware files, executables or binaries (Sulaiman, Ramamoorthy, Mukkamala, &

Sung, 2005). Detection is then based on the a-priori signature or hash of the complete binary or partial characteristics of byte sequences of known malware samples (Wüchner,

Ochoa, & Pretschner, 2014). Security vendors collect, analyze and create extensive repositories of malware signatures that are used for detection within anti-virus and other security protection. Signature-based detection continues to be one of the main tools used for detection on commercial anti-virus (Wüchner et al., 2014). However, signature based detection suffers from several major limitations. The first major limitation is speed. New malware released into the wild will not be detected until the signature is captured by some security mechanism and the executable/binary analyzed. In reality, once the

file/binary has been captured it can take weeks for one of the many security vendors to 4

publish the malicious reputation for a specific file. Unfortunately, polymorphic malware

can be developed and distributed via the internet across the world in minutes and hours

(Mandiant Research, 2014). Polymorphic malware can then self-generate new signatures

in order to appear to security devices as a new file. These new signatures will need time to be discovered, analyzed (determined malicious or benign) and released by commercial security vendors for protection (Szor, 2005). Signature-based defense will always suffer a time gap between the release of a new or unknown malware types, determination of

maliciousness by the security community and signatures for the corresponding malware

(Wüchner et al., 2014). Unfortunately, it is during this time gap that the malware goes undetected, reaches the intended targets and propagates itself throughout compromised networks.

Malware developers understand the time gap and exploit malware determination frequency cycle referred to as “catch, analyze, deploy” (Wüchner et al., 2014). The

challenge for malware researchers is to shorten this gap or approach the problem from a

different perspective (Cesare, Xiang, & Zhou, 2007). The second limitation of signature-

based detection is polymorphism. Polymorphic malware is an umbrella term used to

refer both polymorphic and metamorphic malware (Campo-Giralte, Jimenez-Peris, &

Patino-Martinez, 2009). Polymorphic and metamorphic malware exploit the malware

frequency determination cycle by dynamically creating a new executable and changing

its signature. By changing the signature, most current defensive technologies ignore the

executable as it has a new identity. The new executable will have a new signature and

evade existing signature detection (Cesare, Xiang, & Zhou, 2013). Malware developers

use the new signature to propagate and infect a large number of hosts (Wüchner et al.,

5

2014). The detection of malware is increasingly more difficult over time due to the

sophistication of the malware programs (Guri, Kedma, & Sela, 2013). Previous

polymorphic detection rates using advanced classification range from 68.75% to 81.25%

(Amos, Turner, & White, 2013). These detection rates used Bayes and Multilayer

Perceptron techniques to establish a baseline for effective detection of various types of

polymorphic malware (Amos et al., 2013).

Malware researchers have developed techniques to break the polymorphic malware executable into segments and attempt pattern matching on malware segments

(David & Yahav, 2013). This approach has been extremely helpful as malware developers tend to re-use and share functions (Rekdal & Bloemerus, 2013). However, advanced malware developers understanding this approach employ other techniques to

protect code from being analyzed by encrypting segments of execution code and

leveraging other obfuscation techniques to avoid sub-function signature development

(Campo-Giralte et al., 2009). Malware developers design malware to invade, self-mutate and propagate malware throughout the infected network using advanced techniques that are generally used one-time making the research very difficult (Bossert, Hiet, & Inria,

2014). Detection of advanced malware grows increasingly more difficult because of the sophistication and concealment of the executable. Malware detection and classification

research methods must be advanced in order to detect new types of malware. Advanced

malware analysis methodologies need to be developed in order to improve detection and

classification the evolving nature of polymorphic malware.

Identification of malicious or benign programs (files, portable executables, etc.)

must involve new discovery techniques and leveraging several types of advanced analysis

6

to identify malware (Cesare & Xiang, 2013). There are typically two types of advanced

analysis – static and dynamic. Static analysis never executes the malware, but rather

decodes the compiled code into human readable segments (Sulaiman et al., 2005).

Dynamic analysis is typically conducted via a platform referred to as a “sandbox”. The sandbox provides an environment to execute the program and captures each sequence of the runtime behavior (Cesare & Xiang, 2014). Both analysis tools are extremely helpful during malware analysis as researchers are able to understand how the program executes and what functions are actually executed by the malware (Sulaiman et al., 2005). Static and dynamic analysis allows researchers to evaluate and determine the intent of the malware programs and the potential targeted environment. Researchers can then classify malware into families of malware based on the results from static and dynamic analysis.

The ability for advanced analysis to produce classification and signature detection provides a means to perform advanced detection.

Advanced detection relies on feature extraction of the malware being analyzed

(Qu & Hughes, 2013). Recent malware research has divided detection into two tasks: malware classification and malware detection (Qu & Hughes, 2013). Classification of malware enables researchers to understand the family of malware and generally the intent of the attacker (Qu & Hughes, 2013). Detection involves the identification of novel instances of malware and detecting copies or variants of known malware (Qu & Hughes,

2013). Both classification and detection requires feature extraction of the malware binary

and class features associated with the malware (Cesare & Xiang, 2013). Detection of new malware relies heavily on statistical machine learning (Qu & Hughes, 2013).

Searching for malware variants uses the concept of similarity searching to query a

7 database of known malware types and functions (Qu & Hughes, 2013). Similarity queries or nearest neighbor searches relies upon techniques within machine learning or referred to as instance-based learning (Yerima, Sezer, McWilliams, & Muttik, 2013). Instance- based learning utilizes inherent distance functions within machine learning to highlight similarity or dissimilarity between objects (Yerima et al., 2013) . These search algorithms use distance functions and develop mathematical metrics for various searches and are much more efficient than traditional database queries (Yerima et al., 2013).

Traditional research and commercial malware detection systems have relied upon static string signature searching techniques to classify and detect malware (Cesare &

Xiang, 2011). This type of search relies upon capturing sections of malware executable code that uniquely identifies the malware (Cesare & Xiang, 2011). String signature searching performs well from a query performance and produces low false positive results. Therefore, string searching is heavily utilized in real-time systems for malware detection (Cesare, Xiang, & Zhou, 2013). Unfortunately, string signature searching performs poorly when searching for polymorphic malware variants. Poor string search performance for polymorphic malware variants is due to the closely related but non- identical signatures (Cesare et al., 2013). Polymorphic malware variants have similar properties but the malware changes between instances causes ineffective searches.

Additional research needs to be performed in order to find advanced search techniques for polymorphic and metamorphic malware.

The advancement in functionality and behavior of computer malware can be categorized into five distinct generations (Noreen, Murtaza, Shafiq, & Farooq, 2009).

The first-generation malware was quite simple, i.e., they caused infection by simply

8 attaching themselves to the code sections of benign executables (Noreen et al., 2009).

The malware in second generation had some additional functionality such as self – replication (Noreen et al., 2009). The malware of third generation had stealth capabilities that make this type more difficult to detect for signature-based detection (Noreen et al.,

2009). The malware belonging to the fourth generation uses armoring techniques to protect the code from static analysis (Noreen et al., 2009). Finally, the malware of the current generation use polymorphic techniques to encrypt functions and obfuscate their code with every replication (Noreen et al., 2009). Most of today’s polymorphic malware utilizes armoring and obfuscating protections to hide and propagate themselves across the environment.

The development and analysis of today’s malware requires experienced programmers. The development of malware requires years of experience due to the sophistication and nature of the programming. Therefore, malware developers have

“malware creation engines”, which generate different versions of malware in order to re- use logic and functions through and a complex set of algorithms (Noreen et al., 2009).

The malware creation engines use various techniques such as compression, function insertion and instruction substitution to develop a new variant (Noreen et al., 2009).

However, these variants can be discovered through static and dynamic analysis as they have the same functionality and semantics (Noreen et al., 2009). Thus, polymorphic and metamorphic malware retains similar functionality and executable code as it replicates itself. These functions and code base can essentially provide genetic markers for detection (Pfeffer, Call, & Chamberlain, 2012).

9

The goal of polymorphic and metamorphic malware is to dynamically generate

new versions of malware in order to evade signature detection techniques (Qu & Hughes,

2013). Today’s malware is capable of mutating or creating copies of itself making it very

difficult for signature-based detection to be effective (Qu & Hughes, 2013). Therefore,

malware research must consider new techniques to identify and detect malware variants

to improve endpoint or host security (Qu & Hughes, 2013).

Problem Statement

There is currently no single method to effectively detect polymorphic malware at

the host or end-point. Advanced malware is spreading across the enterprise through

internet downloads, email attachments and malicious hyperlinks, and mobile

devices/hosts and is a major challenge for system-owners and cyber-security

professionals (Campo-Giralte et al., 2009). Signature-based defense approaches such as anti-virus software do not provide adequate protection from polymorphic malware

(Campo-Giralte et al., 2009). According to McAfee’s report (2016), the security vendor collects over 3 million new samples of malware per day. Due to the overwhelming number of new and advanced malware, new detection techniques are needed to discover advanced malware and must be researched to provide improved detection (Borojerdi &

Abadi, 2013). New detection methods must be relied upon to examine untrusted programs and prevent malware from causing damage and disruption to computer systems

(Cesare & Xiang, 2013). Better detection malware detection methods may prevent the proliferation of end-point infections. Identifying and preventing malware from executing improves overall system security and represents significant cost avoidance savings for organizations (Qu & Hughes, 2013). New malware research must provide new detection

10 techniques to identify and classify today’s sophisticated malware (Qu & Hughes, 2013).

The research must also understand the sophistication of today’s malware and the evolving nature of targeted malware attacks in the future (Mandiant Research, 2014).

This research focused on developing improved detection rates using various machine learning cluster algorithms with a limited set of multidimensional topological features. As part of the study, the research developed a feature extraction methodology, developed a prototype environment and leveraged a testing protocol to better understand advanced clustering algorithm performance and feature weighting to improve detection for host or endpoint protection.

Dissertation Goal

The goal for the research was to develop an experimental prototype system to provide improved detection for polymorphic malware. The prototype system utilized various feature extraction methodologies for all types of samples. These features were assembled and evaluated with advanced clustering algorithms within machine learning to deliver improved detection. This study quantitatively examined and prototyped various techniques to test the theory that machine learning with advanced clustering can provide improved detection for polymorphic malware. This study is grounded in previous quantitative experiments that leverage supervised and unsupervised machine learning for malware research (Boro, et al., 2012; Pradesh, 2014). Further, previous research provided a basis for various feature extraction techniques and feature isolation models to enhance malware detection for previous research using machine learning (Chaumette, et al., 2011; Devesa, et al., 2010). This study extends previous malware detection techniques by combining file properties, static and dynamic analysis for feature

11

extraction. The feature extraction methodology used to assemble the multidimensional

topological data and became the input (dataset) for machine learning. Based on the

research of nearly 40 related and recent malware studies, it is believed that this is the first

study to evaluate the multidimensional topological features (file properties, static analysis

and dynamic analysis) using advanced clustering algorithms. Ultimately, the detection of

polymorphic malware was the focus and emphasis for the study. Current literature offers

detection rates for polymorphic detection using advanced classifiers range from 68.75%

to 81.25% (Amos et al., 2013). The study developed deliver detection rates of 99.43% well above 81.25% with better than acceptable accuracy rates.

As part of the study, a feature extraction methodology used an integrated analysis

environment to perform feature extraction from all samples (malware, benign and

unknown) samples as input for machine learning. The feature extraction methodology constructed produced a rich dataset for each sample. Feature extraction was achieved

through static analysis and dynamic analysis. This multidimensional topologic approach

delivered the unique dataset for each sample used for polymorphic malware detection.

The study evaluated detection rates for three advanced clustering algorithms using

machine learning within WEKA. The clustering algorithms used for this studying

included: 1) Advanced Ensemble Classification (Bootstrap Aggregating (Meta Bagging

(MB)), 2) Instance Based k-Nearest Neighbor (IBk) and 3) Deep Learning Multi-Layer

Perceptron (DLMLP). The study demonstrated improved malware detection based on

advanced cluster analysis using multidimensional topological data that delivered true

positive detection rates above the established baseline range of 68.75% to 81.25% for

polymorphic malware (Amos et al., 2013). Additionally, the research delivered improved

12

false positive rates well beyond the range 15.86% to 33.79% established by the same

study (Amos et al., 2013). The ability to achieve these goals was based on developing an

environment to analyze, extract, assemble and classify the malware.

The Instance Based k-Nearest Neighbor (IBk) algorithm was chosen to replace

LSH as the algorithm improves upon the “locality” aspects of the clustering algorithm.

The IBk is a k-nearest-neighbor classifier that utilizes a similar distance metric used in

the LSH algorithm. The calculated Euclidean distance function is used by IBk as a

critical search parameter within the algorithm (Manikandan, Ramyachitra, Kalaivani, &

Ranjani Rani, 2016). The IBk algorithm was selected as it provided newer techniques for

clustering outcomes.

The Deep Learning Multilevel Perceptron (DLMLP) clustering algorithm was

selected to replace the BP algorithm. Recent literature discussed the benefits of DLMLP over BP as it improves upon the dimensions of belief for training and test datasets

(Gruber, Cammerer, Hoydis, & Brink, 2017). The main idea in replacing the BP

algorithm with DLMLP was recently presented by Gruber et al. (2017). New research

has found that artificial neural networks (ANN) can leverage belief propagation for

clustering or classification but do so inefficiently (Gruber et al., 2017). ANN’s build

networks of neurons, share information between neurons and propagate results

throughout the network using weights or beliefs (Gruber et al., 2017). This approach to

propagation is done so inefficiently as it is done many times throughout the entire

network (Gruber et al., 2017). A more efficient way to achieve similar or better results is

to establish shallow neural networks and combine other algorithms to achieve similar

propagation (Gruber et al., 2017). The goal was to utilize the DLMLP algorithm for

13

improved detection. Given that the goal of this research was to establish new detection

using advanced algorithms. DLMLP was selected for model efficiencies and to leverage

nascent algorithms that improved upon other algorithms.

Research Questions and/or Hypotheses

This study quantitatively examined and prototyped various techniques to test the

theory that machine learning with advanced clustering can provide improved detection

for polymorphic malware. Therefore, the research question was - Can machine learning

utilizing clustering algorithms with multidimensional topological feature extraction

deliver improved detection for polymorphic malware? The research did in fact deliver

improved malware detection given multidimensional topological data extracted from

static analysis, dynamic analysis and file properties. This study developed an

experimental prototype that analyzed over 1M samples using MB, IBk and DLMLP.

These samples were used to evaluate polymorphic malware detection using machine

learning with advanced clustering.

Previous results suggested that various cluster algorithms could deliver impressive detection results with limited datasets (Fraley & Figueroa, 2016). Initial research demonstrated various clustering algorithms produced impressive detection rates with machine learning for smaller datasets. This study analyzed much larger datasets.

Each algorithm was allocated a dataset containing 200,000 samples containing known malware, known benign and unknown or undetermined files. Earlier demonstration results produced well above the 81.25% for a small dataset (Fraley & Figueroa, 2016).

The research produced detection rates for MB, IBk and DLMLP that exceeded 99.99%.

14

Feature extraction was accomplished using both static and dynamic analysis become the inputs for machine learning. The multidimensional topological data contained file features, static analysis features and dynamic analysis features for input for the experimental research model to demonstrate improved detection (Liu, Chen, & Guan,

2012; Tamersoy, Roundy, & Chau, 2014). These features were utilized with machine learning with advanced clustering algorithms to produce the achieved detection rates.

Based on the research of nearly 40 related and recent malware studies, it is believed that this is the first study to evaluate the combination of topological features (file properties, static analysis and dynamic analysis) with machine learning using advanced clustering. Most studies use either static or dynamic analysis. Ultimately, the focus of the study was to demonstrate improved detection rates for malware using multidimensional topological features with machine learning utilizing larger datasets. It is important to note that polymorphic malware has the ability to change behavior and execute different embedded functions over time (Campo-Giralte et al., 2009). Therefore, detection of polymorphic malware becomes more difficult and baseline detection rates range from 68.75% to 81.25% (Amos et al., 2013). Using this approach, the study was able to produce detection rates for MB, IBk and DLMLP that exceeded 99.999%.

This study developed a prototype to quantitatively evaluate malware detection effectiveness and accuracy. Accuracy can be expressed as the Number of correct assessments divided by the Number of all assessments. Effective detection rates were evaluated using standard statistical measures such as True Positive Rate (TPR), False

Positive Rates (FPR), True Negative Rate (TNR) and False Negative Rate (FNR). Using these statistical measures allowed deeper understanding for the various detection rates.

15

The focus of the experimental prototype was to evaluate detection rates for the

various clustering algorithms given the various features, weighting of the features and the

advanced cluster algorithms used with a series of machine learning tests. There are a

number of statistical measures to be utilized to evaluate malware detection performance.

Sensitivity and Specificity are two standard measures that evaluate effective detection rate (Kolter & Maloof, 2006). Sensitivity and Specificity statistical measures have been

used to evaluate malware detection performance for a number of previous malware

detection experiments (Kolter & Maloof, 2006). This analysis is provided later in the

report.

Relevance and Significance

There is currently no single method to identify and detect polymorphic malware at

an endpoint. Advanced malware affects almost every organization, business and

government entity connected to a network and internet (Pramono & Suhardi, 2015).

Traditional signature-based detection systems are ineffective due to the dynamic nature

of polymorphic malware (Cesare et al., 2013). Polymorphic malware can be delivered

through a number of distribution channels: email, embedded files, program updates, and

internet web sites (Cesare et al., 2007). Major security companies spend billions of

dollars to solve general malware issues (McAfee, 2014). The detection of polymorphic

malware is more difficult because of the sophistication of the malware programs (Guri et

al., 2013). Polymorphic malware developers continue to advance malicious programs to

evade or avoid detection by security protection devices (Qu & Hughes, 2013). There is no single answer for detecting polymorphic and metamorphic malware. However, researching polymorphic malware may provide greater insight into classes of malware

16 and effective detection techniques for various classes. Malware research continues to be needed in order to provide details regarding the evolving nature of polymorphic malware.

Many of today’s cyber-criminal activities on the internet can be directly attributed to malware or malicious programs (Ulrich Bayer, Kirda, & Kruegel, 2010). Malware comes in different sizes and shapes – Trojans, viruses, bots, etc. and give miscreants a wide range of possibilities for achieving nefarious activities (Ulrich Bayer et al., 2010).

As a result, security companies see a huge number of new malware samples each day –

McAfee estimates at least 30 per second according their most recent report (McAfee,

2014). Globally speaking, in 2014, cybercrime is estimated to cost businesses more than

$400 billion a year (McAfee Labs, 2015). Other experts believe that a $400 billion estimate is conservative and the total cost of cybercrime in 2014 could have approached nearly $600 billion worldwide (McAfee Labs, 2015). The rate of growth for cost impact has soared from $56 billion in 2004 to today’s $400 billion estimate (Schneidewind,

2010; McAfee Labs, 2015). In 2011, the Federal Bureau of Investigation (FBI) reported that on average, companies that report a network breach have lost an average of $150,000 per incident (Richardson, 2011). Data breaches today are estimated to cost US businesses at least $200 billion (McAfee Labs, 2015). Many of these breaches have been attributed to unauthorized money transfers and account hijacking on mobile devices (A.

Sharma & Sahay, 2014). Being able to detect and identify malware infections can deny hackers financial reward, improve security and reduce the organizational cost of mitigating these infections.

This research was built upon the current knowledge base for polymorphic malware research by improving detection through machine learning. This research developed an experimental approach to improve advanced detection by leveraging 17

multidimensional feature extraction for three advanced clustering algorithms using

machine learning. By utilizing this experimental approach, this research extends the

body of knowledge for detection of polymorphic malware. By achieving improved

detection rates for polymorphic malware there is the potential to reduce the financial

impact on businesses worldwide.

Barriers and Issues

There were a number barriers and issues with the experimental research. The first

and most imposing issue is obtaining the malware data set. In order to perform this research, the malware data set needed to contain representative samples of contemporary polymorphic malware. These data sets are usually collected by large security commercial organizations and are shared with partners for a fee. The researcher was able to obtain a sufficient sample datasets for known malware, known benign and a collection of unknown samples. The researcher was able to secure over 2 million total samples in order to develop each of the sample datasets. The collection consisted of 1,009,108 known malware samples, 756,322 known benign samples and 748,976 unknown files. In addition, the researcher ensured that known malware samples in the dataset were targeting Microsoft operating systems. Obtaining the collection of samples required and consumed a significant amount of time, a number of transportable drives and long data transfer times.

The second issue is that in addition to the malware data, the expanded data set must contain representative non-malware and unknown samples. The non-malware data set must be included as part of the research in order to evaluate the effectiveness of detection. A number of vendors were able to share a large amount of the benign files. In

18

particular, Virus Total provided 90% of the 756,322 known benign samples. Unknown

samples also needed to be included as part of the study. A number of vendors were able

to share “undetermined” or “unknown” samples as part of this study. Virus Total and

Malwarebytes provided unknown files samples for this study. Contemporary non- malware has characteristics and behaviors that protect intellectual property of commercial

software. These non-malware behaviors and characteristics are often mimicked by

malware developers in order to bypass detection. This research evaluated known

malware, known benign and unknown samples in order to evaluate overall detection

effectiveness (False Positives and False Negatives). Initially, it was thought that the

benign samples and unknown samples would be more difficult to obtain than the malware

samples. This was not the case, the malware samples targeting Microsoft operating

systems were much more difficult to obtain.

A related dataset issue was storing and securing the experimental dataset. The objective of the research is to analyze millions of samples in order to evaluate efficacy and validate the proposed detection approach. The study required over 2 million samples. It was expected that the storage for malware and non-malware samples would not exceed 1 TB. However, this was underestimated. Malware samples are typically stored in a .zip compressed file format with a password. In many cases, vendors who share samples also used the Unix TAR command to ship compressed files (30-35 GB for

15 Tarballs). This limited the network bandwidth and storage needed for these files.

However, the uncompressed files consumed over 1.3 TB of hard drive storage.

Once the various test data sets had been pseudo-randomly generated, the feature extraction process would simply use the unique id (file hash) associated with the sample

19

for feature extraction. As a separate process, selected files needed to be unzipped and

password supplied in order to access the sample. Additional unique identifiers were

developed as many of the files had a mixture of MD5, SHA1 and SHA256 hashes. All

files obtained, known malware, known benign and unknown files were stored in a secure

and protected repository. The secure storage of malware data set was needed to ensure

non-detonation within the experimental storage environment. The malware samples were

kept in a state such that the malware did not detonate or infect the analysis

platform/environment while stored.

Lastly, the software tools required for the prototype environment did have

represent some challenges. Sufficient software and hardware computing resources had been acquired prior to beginning study. However, the open source version of Cuckoo had several new features as part of the new release. Installing the new release presented some major challenges. In addition, previous versions of Cuckoo processed samples in a different fashion and produced automatic dynamic analysis reports. The upgraded version of Cuckoo provided some advanced features that were more suitable for this study. Online resources were used to eventually assist with installing and configuring

Cuckoo using the Ubuntu operating system (Kolo, 2016). Details regarding installation dependencies, networking and configuring VirtualBox is provided in Appendix A (Kolo,

2016). Managing the overall analysis environment was somewhat challenging but achievable. Difficulties with installing Cuckoo impacted schedule by a few weeks.

Assumptions, Limitations and Delimitations:

There were a number of assumptions, limitations and delimitations with the

research conducted. Each of these areas are discussed in detail below.

20

Assumptions

There were a limited number of assumptions for this study. The first assumption

was that the dataset collected from across the malware repositories represent the malware

population from across the internet. The population of malware selected was malware

captured relatively recently. Secondly, there was no determination made regarding the

genetic relationships for the polymorphic malware samples used for the study. In other

words, some samples may be related but the study assumes that these relationships do not

matter for the purpose of detection. Lastly, it was assumed that the samples collected and

maintained in a secure .zip container does not disturb or alter the malware sample itself.

The .zip container is a common industry protection mechanism that protects working

environments from malware outbreaks. All malware was secured and stored in a .zip file format in order to protect the prototype environment. It is believed that these assumptions were minimal and did not affect the research outcomes.

Limitations

This study included malware samples from various communities of interest for

malware research including Symantec, Virus Total, contagio, Virus Sign, and VxHeaven.

There was a limitation regarding collecting samples across these available sites. The

limitation for this study was that only polymorphic samples with dates within the last

eighteen months was used for detection research from these sites. Therefore, the sample

set may not be representative for older malware or other malware sites. The purpose of

study was to evaluate detection for current malware. Older samples may have value but

more recent malware will have more value. Collecting samples from multiple sites and

21

then performing random selection should provide the needed sampling technique for

generalizability.

Delimitations

Delimitations provide boundaries for the research or study (R. Yin, 2015).

Delimitations include topics or areas that the study chose to intentionally exclude from

consideration. For the purpose of this study, the three clustering algorithms have been

selected due to demonstrated detection results from previous studies (Alam, Horspool, &

Traore, 2013; Murphy, Weiss, & Jordan, 1999; Tamersoy et al., 2014). There are most

certainly other advanced clustering algorithms that could have been used for this study.

However, the intent of this research was to demonstrate that detection can be improved

by adding a combination features from file properties, static and dynamic analysis. The

proposed research expands current detection body of knowledge by using proven

clustering algorithms with the expanded features (static, dynamic and file properties) leveraging the machine learning environment. Other delimitations include selecting only malware samples specifically targeting the Microsoft operating systems. Malware targeting other operating systems was not a consideration for this study. However, there were a limited number of samples collected that were not targeting Microsoft operating systems across the community (Symantec, Virus Total, contagio, Virus Sign, and

VxHeaven). Non-Microsoft malware samples collected were minimal. Non-Microsoft malware samples were discarded prior to random selection. Microsoft targeting malware is believed to represent a majority of the targeted polymorphic malware (Ahmadi,

Ulyanov, Semenov, Trofimov, & Giacinto, 2016) Therefore, this study focused on only samples targeting Microsoft.

22

Summary

Malware represents some of the most serious security concerns for today’s

Internet. Security breaches and cyber-attacks can be directly attributed to malware or multi-stage cyber-attacks. Malware can compromise networks and computers in the form of botnets, viruses, worms, ransomware and advanced persistent threats (APTs). These cyber-attacks are launched using targeted and advanced malware techniques to steal personal, proprietary of financial information. The high number of attacks and the associated negative notoriety make malware one of the most popular areas for advanced research. Much of today’s advanced research has been concentrated on developing techniques to collect, study, and mitigate malware. This research focused on detecting

“real” malware and samples found “live” on the internet. As improved detection becomes a reality – mitigation or elimination of malware for end-points can be greatly enhanced. Unfortunately, current host-based detection approaches that leverage signature-based detection is largely ineffective for new polymorphic malware.

Polymorphic malware avoids or evades signature detection by using advanced obfuscation or encryption techniques. This research set out to address these shortcomings, new research was conducted to develop dynamic detection approaches to identify potential malware threats. This study proposed a novel malware detection approach that provided improved detection for polymorphic malware. The research should enhance and compliment traditional end-point detection approaches. This study’s approach extracted key features from file properties, static and dynamic analysis and through advanced cluster analysis determined the likelihood of files (samples) to be benign (good) or malicious (bad). The proposed approach analyzed the malware

executable (program) in a controlled environment in order to better understand behaviors, 23 function calls and the inclusion of dynamic libraries. The research conducted leveraged this information through machine learning to improve detection. In order to better understand the malware threat landscape, a review of past research literature and previous malware studies is provided next.

24

Definition of Terms

Advanced Persistent Threat: A deliberately slow-moving cyber-attack that is applied quietly to compromise information systems.

Anomaly detection: The search for network connections which do not conform to an expected normal traffic.

Bot: A term short for robot. Criminals distribute malicious software that can turn a computer into a bot. When this occurs, a computer can perform automated tasks over the Internet without one’s awareness.

Botnet: Criminals use bots to infect large numbers of computers. These computers form a network, or a botnet. Criminals use botnets to send out spam email messages, spread viruses, attack computers and servers, and commit other kinds of fraud.

Bring Your Own Device (BYOD): Mobile devices that are personally owned, not a corporate asset.

Clustering: Method that organizes objects with similarities into one cluster, and objects with dissimilarities into other clusters.

Crimeware: Tools that drive hackers’ attacks and fuel the black market (e.g., bots,

Viruses, Trojan, spyware, adware, etc.)

Cryptocurrency: A digital medium of exchange that uses encryption to secure the process involved in generating units and conducting transactions.

25

Cyber Attack: An attack, via cyberspace, that targets an enterprise’s use of

cyberspace for the purpose of disrupting, destroying, or maliciously controlling a

computer environment/infrastructure; destroying the integrity of the data; or stealing controlled information.

Data Breach: An organization’s unauthorized or unintentional exposure,

disclosure, or loss of sensitive PI, such as social security numbers; financial information,

such as credit card numbers; date of birth; or mother’s maiden name.

Data Breach: An organization’s unauthorized or unintentional exposure,

disclosure, or loss of sensitive PI, which can include PII, such as social security numbers; or financial information, such as credit card numbers.

Data Security Incident: A violation or imminent threat of violation of a computer

security policy, acceptable use policy, or standard security practice

Denial of Service (DoS): The prevention of authorized access to resources or the

delaying of time-critical operations.

Detection Rate: The percentage of the number of intrusion instances detected by

the system over the total number of intrusion instances present in the test set.

Distributed Denial of Service (DDoS): An approach whereby the hacker attempts

to make a service unavailable to its intended users by draining system or networking

resources, using multiple attacking systems.

26

False Alarm Rate: The percentage of the total number of incorrectly classified normal instances over the total number of instances.

Hacker: Unauthorized user who attempts to or gains access to an information system.

Indicators of Compromise (IOCs): Pieces of forensic data, such as data found in system log entries or files, that identify potentially malicious activity on a system or network.

Intrusion Detection System (IDS): Hardware or software that gathers and analyzes information from various areas within a computer or a network to identify possible security breaches, which include both intrusions and misuse.

Intrusion Prevention System (IPS): Systems that can detect and attempt to stop an intrusive activity, ideally before it reaches its target.

Malicious Code: Software or firmware intended to perform an unauthorized process that will have an adverse impact on the confidentiality, integrity, or availability of an information system.

Malware: Programs or executables targeted to infect a user’s device. When successful, the hacker is able to control the user’s device, which may lead to data loss or escalation in the hacker’s privileges on the information system.

Mobile Device: Smart phones, tablets, portable cartridge/disk-based, removable storage media (e.g., floppy disks, compact disks, USB flash drives, external hard drives,

27 flash memory cards/drives that contain nonvolatile memory; NIST, 2013d, 2013e).

Personal Information: Information from individuals that can uniquely identify a specific person.

Personal Identifiable Information (PII): Information that can be used to distinguish or trace an individual’s identity, such as his or her name, social security number, or biometric records, alone, or when combined with other personal or identifying information that is linked to a specific individual, such as date and place of birth or mother’s maiden name.

Privileged Account: An information system account with approved authorizations of a privileged user.

Privileged User: A user that is authorized to perform security relevant functions on a computer server that ordinary users are not authorized to perform (NIST, 2013e).

Ransomware: A type of malware that encrypts files and prevents the user from accessing data until the user pays a certain amount of money (ransom) to decrypt the files

Rootkit: A set of tools used by an attacker after gaining root-level access to a host to conceal the attacker’s activities on the host and permit the attacker to maintain the access through covert means.

Security Event: Any observable security occurrence in a system network.

Security Incident: A violation or imminent threat of violation of a computer security policy, acceptable use policy, or standard security practice. These include

28 an accessed occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of an information system or the information that the system processes, stores, or transmits.

Security Information and Event Management (SIEM) Tool: Application that provides the ability to gather security data from information system components and present that data as actionable information via a single interface.

Spyware: Software that is secretly or surreptitiously installed on an information system to gather information on individuals or organizations without their knowledge.

Testing Phase: A detection phase in which the distance (e.g. Euclidean distance) between each test instance and the normal cluster’s centroid is measured to determine whether or not that instance is normal.

Training phase: A period in which a normal profile is built and or updated.

Trojan: A computer program that appears to have a useful function, but also has a hidden and potentially malicious function that evades security mechanisms, sometimes by exploiting legitimate authorizations of a system entity that invokes the program.

Unauthorized User: A user who accesses a resource that he or she is not authorized to use.

29

Virtualization: Hiding the discrepancy between the virtual and physical allocation of information technology resources.

Virtual Machine: A separate logical instance of resources for a user or application that in reality is shared physical hardware.

Virus: A type of malicious software program that infects computer system programs, data or operating system files and is capable of replicating itself to other systems.

Worm: A self-replicating, self-propagating, self-contained program that uses networking mechanisms to spread malicious code.

30

List of Acronyms

ACC: Accuracy

API: Application Programming Interface

BP: Belief Propagation

CC: Correlation Coefficient

C&C: Command and Control

DLMLP: Deep Learning Multi-Layer Perceptron

FNR: False Negative Rate

FPR: False Positive Rate

GUI: Graphical User Interface

LSH: Locality Sensitive Hashing

IBk: Instance Based k-Nearest Neighbor

MB: Meta Bagging or Advanced Ensemble Classification (Bootstrap

Aggregating (Meta Bagging)

ML: Machine Learning

ROC: Receiver Operating Characteristic curve (or ROC curve.)

SDK: Software Development Kit

SCED: Single Case Experimental Design

SCED CCD: Single Case Experimental Design Changing Criterion Design

31

SVM: Support Vector Machine

TNR: True Negative Rate

TPR: True Positive Rate

32

Chapter 2

Review of the Literature

Overview of Topics

New malware development techniques render current signature protections for

polymorphic malware practically useless from a timeliness perspective (Rodríguez-

Gómez, Maciá-Fernández, & García-Teodoro, 2013). Being able to detect polymorphic, metamorphic and zero-day malware requires advanced detection techniques that provide rapid adaptation, scalability and produce low false positive rates (Borojerdi & Abadi,

2013). There are numerous research studies that offer attractive alternatives for detecting polymorphic and metamorphic malware. Given the security issues concerning malware, it is not surprising that a majority of the today’s security research is focused on developing enhanced detection using techniques that collect, study, and mitigate malicious code (Kolbitsch et al., 2009). Some studies are focused on botnets and botnet networks, others are focused the infected executables from websites and others study behavioral aspect of Windows and mobile devices (Seigneur & Kölndorfer, 2013). There are other studies who strictly look at the Windows API or system calls (Ye, Wang, Li,

Ye, & Jiang, 2008). New research attempts to capture a comprehensive snapshot of malicious behaviors and activities in order to classify the malware sample in question

(Cesare & Xiang, 2013). The crucial aspect for this and other malware research is to understand the significance of the malware problem, investigate new ways to detect the multitude of malware types and then benchmark those results against the current approaches. An overview of malware and the various aspects of malware are presented next.

33

Malware is software or a set of programs whose sole purpose is to damage,

disrupt or steal information from computer systems or networks (Kauranen & Makinen,

1990). Malware is often a broader term that includes viruses, worms, Trojans, botnets,

backdoors, exploits, etc. The most well-known type of malware is a virus. The term

“virus” was initially used by Fred Cohen in 1983 while conducting research for his

dissertation at the University of Southern California (Cohen, 1985). The fundamental

reason for creating such software or set of program is to create chaos, disrupt business or

seek financial motives by harming computer systems (Kauranen & Makinen, 1990). The

creation of various types of malware has launched an entire commercial industry known as “anti-virus software” with revenues skyrocketing to several billion dollars (Noreen et al., 2009). Malware has been a major threat to computer and networks since the early

1990s (Noreen et al., 2009). However, the malware sophistication has significantly

improved since the 1980’s when Fred Cohen coined the term. Cohen’s virus-based

research would propel numerous other researchers and discovery of “classes” of malware.

Another type of malware is something referred to as a Worm. A Worm is defined

as a type of malware that exploits vulnerabilities of unpatched systems with self-

propagation means to spread pervasively throughout a network (Weaver, Paxson,

Staniford, & Cunningham, 2003). Nazario (2004) describes worms as having the ability

to take advantage of system vulnerabilities that enable propagation via a network and

allows the execution of arbitrary code on a remote system. Largescale worm outbreaks

have decreased significantly since the early 2000s. Panda Security (2014) report that

Worm malware account for approximately 6% of all malware infections in the first

quarter of 2013. Zero-day worms are still today a real concern for system and security

34

professionals. Zero-day worm attacks still represent a real threat to organizations due to

the lack of detection and the availability of high speed networks (Kaur & Singh, 2014).

Organizations that have highly interconnected networks are susceptible to such worm

malware attacks and should be a major concern for Internet users (Kaur & Singh, 2014).

Worms also can deliver other types of malware such as Trojans.

Trojans invade systems and reside within the system in order to execute

commands or instruction given by external threat actors (Gordon & Chess, 1998).

Trojans enable threat actors to take control of compromised systems (K. Chen, Zhang, &

Lian, 2013). Trojans conceal themselves inside computer system with the hopes of not being discovered. Some Trojans are active soon after being installed. Other types of

Trojans wait for an instruction or some condition to be satisfied before executing (K.

Chen et al., 2013). After receiving remote instructions, Trojans can transact and receive commands in order to disrupt computer operations, gather and exfiltrate sensitive information or attempt to gain access and privileges on host computer systems (Gordon

& Chess, 1998). Trojans tend to be detected readily by signature-based detection such as anti-virus and other end-point protection software. However, systems are still susceptible to new Trojans attacks if user awareness and system hygiene are not addressed (K. Chen et al., 2013). Zero-day attacks taking advantage of system vulnerabilities are still real possibilities. However, systems policies requiring analysis of first time run executables make this type of attack less possible.

Botnets represent one of the greatest infrastructure threat to the Internet (Barakat

& Khattab, 2010). For years, the research community has investigated and described the impending issues and proposed countermeasures for disarming this capable adversary.

35

Botnets is typically a term to describe a network of infected end-hosts that become bots

under the control of a bot master (Barakat & Khattab, 2010). The bot master represents

known controls from a human or prescribed operations known as a bot-master. Cyber

criminals and adversaries use botnets to launch Denial of Service (DOS) attacks,

Distributed Denial of Service (DDoS) attacks and aid in the propagation polymorphic across the enterprise (Li, Duan, Liu, & Wu, 2010). Detection for bots come in the form in the form of monitoring for abnormal network, memory and system behavior.

Unfortunately, botnet malware can remain dormant for long periods of time and become active for a short period of time for a special purpose (J. Zhang et al., 2011). Unless systems are closely monitored, botnet malware has completed execution before notification and remediation action could be taken (Rodrigues, 2011).

Backdoors represent a malware mechanism that allows attackers surreptitious access to a computer systems (Y. Zhang & Paxson, 2000). Backdoors have existed for many years and were initially designed into operating systems in order to facilitate access by system administrators (Bohra, Neamtiu, Gallard, Sultan, & Iftodet, 2004). Today’s systems are supposed to be free of backdoors that facilitate unauthorized access to the computer systems (K. Chen et al., 2013). Typically, malware backdoors embed themselves into systems and networks in order to provide a means of repeatable access for external attackers. Backdoors can exist for both interactive and non-interactive

services on systems (K. Chen et al., 2013). Interactive services simply run commands or carry out instructions on the compromised system (Y. Zhang & Paxson, 2000). Non-

interactive services include services such as relaying email spam or file transfer services for data to be exfiltrated outside the organization (Y. Zhang & Paxson, 2000). Backdoors

36

are very difficult if not impossible to detect and they take advantage of authorized services, ports and protocols.

Taking advantage of flaws within a computer system, network or mobile platform requires that a vulnerability exist and attackers have the means to exploit a weakness

(Ritchey & Ammann, 2000). Malware exploits target these vulnerabilities and seek to gain access to and somehow compromise the system or network (Rodrigues, 2011).

Computer systems, networks and mobile devices will always have some underlying vulnerability. System designers and developers try to minimize the exposure of such weaknesses (Ritchey & Ammann, 2000). System and network vulnerabilities may be introduced by other activities such as system integration, system operations and maintenance or poor system or network hygiene (Kim & Hong, 2014). However, today’s

connectivity to networks make malware exploits more possible as attackers take

advantage of network connectivity to probe and understand system vulnerabilities.

Malware exploits target networks, servers and applications in order to disrupt operations

or steal information for financial gain (Alam, Horspool, & Traore, 2014). The detection

for malware exploits is even more difficult as they appear to simply be using standard

network and authorized system calls.

Ransom Malware or “Ransomware” has become quite prevalent over the past few years. Ransomware has also become known as scareware as cyber criminals use the malware to prey on the fears of infected computer users by stealing or encrypting user data (Kharraz et al., 2015). Ransomware preys on people’s fear of losing control of personal or private information, losing access to highly critical or sensitive data or damaging hardware such that access is no longer possible (Kharraz et al., 2015). Certain

37 types of ransomware will essentially encrypt data or system files such that access to the system or data can only be achieved by unlocking the system with a “key” (Gazet, 2010).

Ransomware generally provides payment information and once the transaction is complete, users are provided decryption keys in order to regain access (Gazet, 2010).

Payments are generally made in bitcoin or some other method that makes tracing the payment to the recipient more difficult or near impossible (Kharraz et al., 2015).

Although ransomware has been around for over 10 years, not until recently has the volume of ransomware raised major concerns within the security community. According to Symantec (2014), the number of ransomware attacks increased by over 500% on 2013.

In addition, ransomware has been in the press with attacks on local police, hospitals and small municipalities. In 2013, the Cryptolocker ransomware was in the press for infecting nearly 250,000 computers worldwide (Symantec Corporation, 2014). Given the substantial growth and spotlight on ransomware attacks, developing protection and detection mechanisms should be a major research area for the security community.

However, detecting ransomware and protecting organizations from this type of attack is difficult without having insight into the tactics and sophistication of these attacks.

The discussion above highlights the various types of malware and the impact that they play on system owners, security professionals and ordinary end-users. Protecting, detecting and remediating malware has become one of the fastest growing markets in the technology sector. More research is needed in order to keep pace with the advanced malware and detection become more critical each day.

38

Synthesis of Current Literature

Several researchers have described the history of malware in “waves” or levels of

sophistication. The first wave spans from the late 1970s through the early 1990s (T.

Chen & Robert, 2004). The first malware outbreak or “wave” can be described as inquisitive, exploratory and unsophisticated (Wüchner et al., 2014). The first wave was

really considered to be exploratory or accidental with no clear malicious intent (T. Chen

& Robert, 2004). This period can be symbolized by the first self-spreading internet

malware (Morris Worm) that infected approximately 10% of the computers connect to

the internet (T. Chen & Robert, 2004).

The second and third wave covers almost eleven years from 1990 through 2001.

The second wave covers the 1990 through 1999 and can be characterized as the first use

of polymorphic malware and encryption techniques to evade detection by anti-virus (T.

Chen & Robert, 2004). The third wave spanned only a couple of years 1999 through 2001 and was consumed with malware being distributed through email (T. Chen & Robert,

2004). Malware such as the Melissa, PrettyPark, or LoveLetter viruses were distribution through email and executables had additional functionality that allowed to propagate and maintain persistence (T. Chen & Robert, 2004). The third wave also was the first wave

where malware achieved remote system access in order steal sensitive information.

The fourth wave considered to take place from 2001 through 2009. The rise of

malware during this era was characterized by increasing sophistication of malware that

leveraged multiple vulnerabilities to infect and propagate infections through instant

messaging or peer-to-peer file sharing (Shafiq, Khayam, & Farooq, 2008).

39

The fourth wave also took advantage of unprotected applications and application interfaces by dynamically downloading macros and additional malicious payloads

(Rafique & Chen, 2014). The fourth wave also saw malware adapting to more effective detections and changes to malware developer goals and motivations. Malware such

CodeRed, Slammer, or took advantage of multiple host vulnerabilities to infect hosts and utilized email to propagate throughout networks to exploit system vulnerabilities (Rafique & Chen, 2014). The fourth wave of malware also seized the opportunity to compromise unprotected network shares and served up drive-by infections via webservers (Rafique & Chen, 2014).

The fifth wave raised the level of intent, targeting and sophistication of the malware (Wüchner et al., 2014). During this time period, roughly 2010 until present day, malware has become more targeted and developed by highly skilled professionals

(Wüchner et al., 2014). Malware such as Stuxnet was developed and attacked

Supervisory Control and Data Acquisition (SCADA) systems and was linked to sabotaging the Iranian nuclear program (Virvilis, Gritzalis, & Apostolopoulos, 2013).

This wave was the first to have malware referred to as Advanced Persistent Threats

(APTs) and Zero-Day exploits. Malware from this period is designed to attack specific systems and targets and the level of sophistication goes beyond that of previous commodity malware developed by previous waves (Virvilis et al., 2013). APTs such as

Stuxnet, Duqu or Flame have been linked to Government sponsors (Wüchner et al.,

2014). The motivation of these professional malware developers is economic, espionage or sabotage for targeted systems (Wüchner et al., 2014). Motivations for commodity malware different and is more experimental and economic based (Wüchner et al., 2014).

40

Commodity malware is typically developed by amateurs and organized crime. However, commodity malware developers will take advantage of collaborating and sharing with professional developers to gain insight into advanced exploit methods and techniques

(Wüchner et al., 2014). In addition to methods and techniques, these groups will often share concealment and obfuscation techniques needed to avoid detection (Wüchner et al.,

2014). Zero-day and APTs will go unknown for long periods of time due to advanced stealth and concealment techniques (Kaur & Singh, 2014) These waves represent an ongoing threat to organizations and the internet as a whole.

In summary, the development of malware has changed over time in both motivation and sophistication. There has been a radical shift from accidental outbreaks to very targeted and specific attacks. Many of the malware attacks have economic and organizational motivational factors. Malware developers have changed over time as well from amateurs to highly skilled and trained professionals. The malware itself has adopted sophisticated exploitation, propagation and replication techniques to avoid and evade detection. Today’s advanced malware use various techniques such as encryption, environmental sensing and embedded compilers to hide functions from static and dynamic analysis. Therefore, malware research spans years of research and has multiple topical areas. The synthesis of current literature is organized by topic area and is organized sequentially. The goal is to provide foundational as well as a time perspective for each malware research topic area. In some cases, there have been and continue to be arguments from various researchers as to which characteristics constitutes certain malware types or classes. The following is an overview of the foundational topics and related research in conducting malware detection research.

41

Virus Research

John von Neumann was the first to develop the concept of today’s computer virus

and is considered to be the seminal work for viruses (von Nuemann & Burks, 1966). Von

Neumann developed his theory of the computer virus during the mid-1960’s while

speculating on the fact that programs could produce and generate “code”. He conceived

and developed the concept of computer programs self-replicating through what he called

“reproducing automata” (von Nuemann & Burks, 1966). Although von Neumann did not develop the actual computer program, he envisioned the idea of self-replicating automata that could go viral if not addressed (T. Chen & Robert, 2004). In essence, von Neumann predicted the spawning of malware prior to the actual technical implementation of the concept (T. Chen & Robert, 2004). In 1971, Bob Thomas created the Creeper program and was the first implementation of von Neumann’s virus concept (T. Chen & Robert,

2004). The Creeper program is considered to be the first virus and piece of malware released into the wild (T. Chen & Robert, 2004). The Creeper program was not developed as a piece of malware but rather to test the concept self-replication through a network (Arpanet) and to inform users of its existence on the computer once it achieved a foothold (T. Chen & Robert, 2004). The Creeper is not considered to be malware in the truest sense of the word as it was not designed to actually harm any computer system (T.

Chen & Robert, 2004). However. Thomas’s Creeper program is largely considered to be the “father” of all future worms and viruses for its self-replication techniques (T. Chen &

Robert, 2004). Although, Thomas’ program became the first program to actually achieve self-replication from a malware perspective – the study of such automata would not be studied for ten years.

42

Cohen’s research is considered the first formal work in the field of computer viruses (Noreen et al., 2009). Cohen’s seminal research on “viruses” was conducted in

1983 through 1984. His research would be later published in late 1985. Cohen is credited for bringing the attention of viruses to the larger computer community with this research (Noreen et al., 2009). Cohen explored a number of approaches to detecting viruses. His research also was aimed at not only detection but the removal of unwanted software/programs. Cohen also started to classify various forms of viruses. Cohen’s detection and removal methods did not rely upon the information sharing or transitivity of information flow (Cohen, 1985). Instead his detection techniques were based on identifying various code traits used by viruses to exploit Turing machines (Cohen, 1985).

Cohen’s research outlines several types of computer viruses and the ability of the attacker

to quickly gain administrative rights to systems once infected (Cohen, 1985). Cohen also

began to describe future issues with malware and variants that would be difficult to

detect. Cohen’s research also described the problems with removing a virus from a

system once the system was infected (Cohen, 1985). Cohen also began the process to

articulate advanced classes of viruses that could mutate to avoid detection and removal

(Cohen, 1985). Cohen’s research would serve to accelerate other malware research and

begin to paint a picture for new attacks on computer systems.

Research conducted in 1987 by Maria M. Pozzo and Terence E. Gray, provided

an approach to detect modification of executable code – also known as viruses and a new

type of malware known as Worms (P. K. Singh & Lakhotia, 2002). Pozzo and Gray

presented various methods for detecting changes in executable code and housing a virus.

Their approach analyzed the run-time executable(s) and detected whether the

43

executable(s) had been modified since installation (P. K. Singh & Lakhotia, 2002). This

early virus detection utilized encryption to store code values for the executable modules

of a program (A. Singh, Walenstein, & Lakhotia, 2012). The encrypted value for each

module was used to create a hash value known today as “signature”. These set of

signatures would be used to enable the detection of modified executables as the hash

values would no longer match the values of the original module. Although the

implementation approach would not be sufficient for today’s rapidly changing code

environment, the signature concept would be used by an entire anti-virus industry to

register, track and share virus hashes (A. Singh et al., 2012) .

Fred Cohen, in 1989, advanced a number of malware theories regarding virus

detection and protection through a series of published papers in order to highlight

malware issues. Cohen (1989) presented a formal model for defining computer viruses.

Cohen’s virus model formally defines sets of transitive integrity-corrupting mechanisms

called "viral-sets". These viral-sets contained various characteristics that uniquely identified the virus and the type of attack (Cohen, 1989b). Further, Cohen’s research explored the deeper computational properties for the defined viral-sets in order to expose the underlying code. Additional research conducted by Cohen, shifts from detection of viruses to computer systems protection from attack of these viral sets (Cohen, 1989a).

Cohen also presented additional research regarding the automated detection of modified executables in order to prevent the spread of viruses of networked computer systems

(Cohen, 1989b). His work illustrated various virus models and the detections needed by host computers (Kauranen & Makinen, 1990). Further, Cohen’s models were used to simulate the infection and protection of trusted and untrusted systems (Kauranen &

44

Makinen, 1990). These models were then used to demonstrate both theoretical and operational infections to illustrate the feasibility of viral attacks (Cohen, 1989b). Cohen’s research provides greater insight into optimal protection mechanisms needed to stop the propagation of secondary infections to other systems (Kauranen & Makinen, 1990).

In 1990, Kerchen et al. at the University of California – Davis proposed to analyze malware in new ways. Their approach was to use static analysis techniques to discover whether code was indeed malware (Kerchen et al., 1990). The tools and techniques used provided heuristic tools to detect malicious code in a UNIX environment. The tools used could detect computer viruses prior to loading and executing the malware (Kerchen et al., 1990). Kerchen et al. (1990) used two tools to accomplish malware detection. The first detection tool searched for duplicate system calls in the compiled and linked program (Kerchen et al., 1990). The second detection tool used static analysis of the executable to determine the files/libraries used by the program to write to during execution (Kerchen et al., 1990). Through the use of both tools, Kerchen et al. were able to understand whether the program could be identified as a malicious or benign. The approach presented in this research would lead other researchers to investigate new ways to look the growing malware problem.

In 1990, Kephart and White began to look at computer viruses from an epidemiological model similar to those being used to perform advanced disease research

(J. Kephart & White, 1991). Kephart and White performing research at the IBM Thomas

J. Watson Research Center began to address malware from an immune system perspective. Their research parallels viral outbreaks and infections of the human body for computer systems (J. Kephart & White, 1991). This paper was the first published

45

research to compare the infection of the human body with computer virus epidemics.

Their work began a theoretical discussion of viral propagation using deterministic and

stochastic models (J. Kephart & White, 1991). The study also describes the conditions

under which widespread computer viral epidemics would likely occur across inter-

connected networks. One of the key outcomes of their research was to raise the argument

that imperfect defenses can still be highly effective at preventing the widespread

propagation of malware (J. Kephart & White, 1991).

In 1992, Cohen presented some advanced findings regarding defensive models for

computer viruses. Cohen proposed a formal definition of “computer worms” and detailed

the properties that would define this class of virus. He defined “computer worms” as a

malware subclass of viruses based on certain properties (Cohen, 1992). Cohen presented

an alternative formal definition of a virus based on the foundational work presented by

Professor Len Adelman in 1989 (Spafford, 1991). Cohen adopted Adelman’s definition

of viruses based on set theory. Although these virus definitions were not specific the

definition covered a broad range of replicating programs including Worms (Cohen,

1992). Cohen performed some analysis of internal code of Worms and viruses. He later went on to discern the differences between viruses and worms by the nature of the functions and self-replication needed for sustainment. Cohen demonstrated that viruses

merely create replicas. On the other hand, Worms were presented to be more purposeful

viruses because of their reliability, ability to spread and their ability to maintain malicious functionality with replication (Cohen, 1992). This deeper analysis provided a stark contrast of Worms and viruses. These definitions launched additional research and created classes of malware based on executable functions. The debate regarding the

46 characteristics for worms and viruses continues to rage throughout today’s research community.

Kephart & Arnold (1994) provide extended research by identifying viral signatures from executable code using statistical methods for the various functions contained within the virus. The researchers later used the analogy of a human body with the immune system to model viruses and viral attacks (J. Kephart et al., 1995). The researchers believed that new viruses acting as “intelligent agents” could begin to infect and propagate themselves across connected networks in new ways (J. Kephart et al.,

1995). In 1995, IBM was developing techniques to prevent computer infections by using biologically inspired anti-virus protection (J. Kephart et al., 1995). The researchers were early adopters for implementing neural network-based virus detection (J. Kephart et al.,

1995). This early research leveraged neural network learning to discriminate between infected and uninfected programs. Further, this research was extended to identify new viruses and remove the infected files automatically (J. Kephart & Arnold, 1994).

Spafford (1994) began to define how computer viruses operate and distinguish classes of malware based on these operations. Spafford discussed the nature of true viruses and began to describe the capabilities contained in the malware. Spafford defines true viruses as having two major components: one that handles the spread of the virus, and the other delivering a payload task (Spafford, 1994). Spafford was one of the first researches to recognize that the payload task may follow an infection and not be present initially. Instead, the payload task may await for a condition of a triggering event

(Spafford, 1994). Spafford describes how viruses work and how the virus must add itself to another piece of executable code (Spafford, 1994). Spafford’s research classified

47

various forms of computer viruses including Worms, shell code, intrusive code, and

companion code (Trojans). Spafford went to great lengths to highlight that companion

viruses are not real viruses unless the more encompassing definition of virus is used.

Again, classification of viruses began to differentiate the malware properties. The

research also draws parallels how viruses meet properties associated with life as defined by some researchers in the area of artificial life and self-organizing systems (Spafford,

1994). Spafford also begins describe an artificial "life" for viruses within computer

systems and related environments (Spafford, 1994).

In 1996, Bontchev presented a new threat vector and an approach for detecting the

presence of macro viruses in Microsoft Word for Windows. This type of attack was

relatively new and the new “virus” would rely on the availability underlying program that

supported “macros” (Bontchev, 1996). Bontchev also began to dive deeper into the MS

Word macro attack. Bontchev found that the typical anti-viral software companies

developed inadequate protections for this type of attack. Bontchev also demonstrated

that while typical virus replicate themselves in certain ways, the Word macros lived from

document to document (Bontchev, 1996). Bontchev also discussed the need for advanced

integrity checking for application programs. This paper also discussed the significant

threat that this type of attack could represent for application programs who enable macros

and the needed protections to prevent such an attack (Bontchev, 1996).

In 1997, Kephart et al. began to contemplate and research a decade of the growing

computer virus problem. The authors believed that the anti-virus community was engaged

in an escalating arms race with malware developers (J. Kephart, Sorkin, Swimmer, &

White, 1999). However, the authors believed that the “war” was still manageable and

48

winnable (J. Kephart et al., 1999). Their solution was to develop a blueprint that

automated the detection and remediation viruses on computer systems. The authors

believed that detection rates would increase and human experts would solve the problem

in the long term using such a blueprint (J. Kephart et al., 1999). The researchers also

acknowledge that the internet was becoming a fertile ground for new breeds’ malware (J.

Kephart et al., 1999). Kephart et al. believed that solution would be to develop a

protection systems similar to the human immune system for computers (J. Kephart et al.,

1999). Just as the human immune system senses the presence of previously unknown

pathogens, the researchers envisioned protection for computer systems that would one

day automatically detect and remove malware. However, their estimation of the growing

malware problem would be severely underestimated.

In 1998, White began to identify and articulate the challenges of the growing

computer virus problem (White, 1998). White’s desire was to raise awareness and alarm the growing computer community of the many open issues facing researchers for virus detection and protection (J. Kephart et al., 1999). White’s research highlights five problematic issues: 1) development of new heuristics for virus detection, 2) the study of

viral spread and epidemiology, 3) deploying distributed digital immune system for

detecting new viruses, 4) detection of worm programs and 5) proactive versus reactive

approaches towards detection of virus programs. Many of these issues still exist today.

In 1998, Bontchev extended his previous computer virus by expanding the definition of “viruses”. His research expanded the “virus” definition and pushed for greater understanding of the virus ecosystem. Bontchev explored the incomplete nature of definitions of computer viruses (V. Bontchev, 1998). This work discussed advanced

49

classification and analysis of computer viruses. Bontchev also discussed the incomplete

nature of the anti-virus software and possible attacks bypassing anti-virus software.

Further, Bontchev called for advanced testing methods for anti-virus software. The research also discussed system issues and social aspects of the growing computer virus problem. Bontchev also went on to discuss useful applications of using self-replicating software that are not malware.

In 1999, Jeffrey Kephart and Steve White presented an update to their previous work conducted in 1993. This updated research proposed two new epidemiological models of computer virus spread (J. Kephart et al., 1999). The two models were developed to explain epidemic and non-epidemic spread of viruses outside the laboratory environment. This work was conducted in order to explain the non-existent outbreak of viruses in the workplace. The researcher’s predicted a global virus outbreak in their previous study (J. O. Kephart, 1993). However, the researcher’s interest was heightened because only a small fraction of all well-known viruses seemed to have appeared in real business environment (J. Kephart et al., 1999). The researchers wanted to investigate whether the low rate of virus infections was real or if the theoretical epidemic threshold was too high and reporting was lost. The researchers develop localized model for software exchange in order to observe and explain the sub-exponential rate of viral spread (J. Kephart et al., 1999).

Worms

In 1989, Joyce Reynolds, in her work at the University of Southern California began to describe a new type of malware known as a Worm (Reynolds, 1989). Reynolds describes the infection and cure of the newly released Internet Worm. Reynolds’ work

50 begins to envision the impact of a worm on the greater Internet community. Her work describes how Worms could propagate through a series of inter-connected computer systems and begins to paint a picture of the damage that Worms could cause though such an attack (Reynolds, 1989). Her work also evaluated the social and ethical issues of attacking the Internet ecosystem. A large part of Reynolds’ work was based on reviewing and detailing the inadvertent release of the Internet Worm on the evening of

November 2, 1988.

Other malware research followed and began to leverage Cohen’s viral models.

These models were used to describe and classify additional types of malware. Additional research efforts began to classify malware by execution properties, executable traits, program characteristics and behaviors (Cohen, 1989b). There began a debate over types of malware. The classification of malware became a hot topic as researchers began to describe all malware as viruses. Some researchers believed that a distinction between various classes of malware needed to be drawn. The malware term Worm is largely credited to the work of Spafford (Spafford, 1989). Spafford (1989) defined a worm as “a program that can run independently and can propagate a fully working version of itself onto other machines”. Spafford’s work at Purdue University wanted to make sure that

Worms were classified differently from viruses (Denning, 1989). Spafford analyzed and described the characteristics of a Worm to self-replicate and spread itself to computer systems over a network. Spafford also explored issues with Worm replication over the integrated network known today as the Internet. Spafford is credited with dissecting the

November 1988 Internet Worm incident that infected thousands of machines (Denning,

1989). Spafford describes how the Worm attack known as the Morris Worm worked its

51

way through a series of inter-connected computer systems. Spafford raises concerns with

possible attacks on other computer systems/networks in which commerce, transportation,

utilities, defense, space flight and other critical activities depended on system inter-

connectivity (Denning, 1989). Other researchers would begin additional research into

various malware classes.

Eichin and Rochlis (1989) also began to analyze the inadvertent Internet Worm

attack in 1989 at the Massachusetts Institute of Technology. Their published paper defines the classification of the Internet "Worm" as a "virus" (Eichin & Rochlis, 1989).

Their paper leverages some of the work from Cohen but began to deviate from

classifying malware by executable code (Eichin & Rochlis, 1989). Instead, Eichin and

Rochlis begin to describe the possible intent of development teams of releasing malware

to attack specific computer targets. Their research was the first published work that

focused on the targeting strategies employed to execute a specific malware attack (Eichin

& Rochlis, 1989). This research also discussed the effective and ineffective defenses

used by the larger “Internet” community as a whole (Eichin & Rochlis, 1989). Their

work at MIT detailed a step by step account of the Internet crisis of 1988. Their work

outlined the various defensive security flaws that were exploited to attack the “inter-

connected” systems. Their work also described the propagation of the Worm/Virus

across the Internet. This research emphasized the corrective actions needed to prevent future attacks.

Seeley at the University of Utah also analyzed the same November 1988 Internet

Worm incident. Seeley analyzed the program, systems and the executables needed to propel the attack (Seeley, 1989). Seeley examined the Worm program, a 99-line

52

bootstrap program written in the C language and a needed object file used in UNIX systems such as VAX and Sun (Seeley, 1989). The basic goals of the attack was to locate

systems across the network, penetrate those systems by exploiting security flaws with

remote connections and replicate and execute the Worm (code) on the remote system

(Seeley, 1989). Penetration of a remote systems was accomplished in one of three ways;

1) taking advantage of security flaws in the “listening” server, 2) exploiting the “trap

door” in the SMTP mail service, or 3) guessing passwords for administrative accounts or

taking advantage of non-set passwords to elevate credentials (Seeley, 1989). Seeley also

outlined the defensive measures used by the Worm to prevent detection by inhibiting

analysis of the program. The worm’s simplest means of hiding itself was to change the

program name and directory (Seeley, 1989). Seeley’s work set the stage for analyzing

malware from an attacker’s defensive perspective.

In 2002, other researchers continued to investigate the 2001 Worm attack. Moore

et al. (2002) analyzed the worm which infected thousands (359,000) of hosts

across the Internet in 2001 in less than 14 hours. Additional research for more advanced

Worm malware highlight how the Storm Worm can be used to create botnets used by bot

masters to send spam emails or perform distributed denial of service attacks (DDOS)

(Kanich et al., 2008). The estimated cost of the Code-Red malware epidemic to be in

excess of $2.6 billion (Moore et al., 2002). The researchers used various techniques to

analyze how multiple worm-infected computers worked to propagate the Code-Red

Worm and consume network bandwidth in a targeted and coordinated manner (Moore et

al., 2002).

53

In 2004, Williamson et al. presented the idea of Virus Throttling. Virus

Throttling was a technique used to slow the spread of worms and viruses by disrupting their propagation activities largely over TCP/IP (Williamson et al., 2004). This research was conducted in conjunction with Massachusetts Institute of Technology (MIT) and

Hewlett Packard (HP). The Virus Throttling concept was used to prevent an infected machine from infecting other machines on the same network (Williamson et al., 2004).

The end result of such a technique was that there were fewer machines infected and there was less traffic generated by the virus over the network (Williamson et al., 2004). The technique worked well for Worms that used TCP/IP protocols and seemed to have promise for other protocols. The propagation of Worms at this time was largely attributed to the use of Instant Messaging over corporate networks. Malware being spread

over Instant Messaging was a growing concern and represented a significant threat at this time (Williamson et al., 2004). Virus Throttling was a technique used to address a specific malware using certain protocols. This was a step forward in terms of addressing specific malware behaviors.

Trojans

Ken Thompson, in 1984, wrote an additional seminal paper regarding Trojan malware (Thompson, 1984). His lecture “Reflections of Trust” was widely publicized for identifying a problem known today as a Trojan malware. His presentation was awarded the 1984 Turing Award for clearly presenting, explaining, and demonstrating a practical and dangerous Trojan attack using a UNIX standard compiler (Wheeler, 2005). His presentation demonstrated how to modify the Unix C compiler to inject a Trojan piece of code into an executable program (Wheeler, 2005). The injected code modified the

54

operating system login program to escalate privileges and grant root access to the UNIX

system (Wheeler, 2005). Thompson then demonstrated the ability to modify and

recompile the compiler itself with additional code designed to detect the existence of

Trojans in compiled code (Wheeler, 2005). Once the additional code was added, the

“Trojan code” attack would be removed from the source code so that no source code could be detected (Wheeler, 2005). Thompson presented that these Trojan attacks could

persist through numerous recompilations and cross-compilations of the compiler

(Wheeler, 2005). He then presented that no level of source-level verification or scrutiny will protect systems from such embedded malware code. In fact, he described that the problem could exist at lower levels beyond the compiler such as assembly level code or even hardware microcode which would be harder to detect (Wheeler, 2005). Thompson implemented his attack on a Bell Labs UNIX C compiler and successfully launched the attack on another Bell Labs group computer systems (Wheeler, 2005). His attacks were never detected within Bell Labs and the malicious compiler was never released outside of

Bell Labs (Wheeler, 2005). This research began to propel other research into Trojan malware and investigation into detection methods for executable code.

Botnets

Botnets represent the greatest threat to the internet according to multiple researchers (Barakat & Khattab, 2010). Bots represent an infected host that has a connection via a network to a botmaster (Barakat & Khattab, 2010). A botnet represents

a number of infected hosts end-hosts under the command and control (C&C) of a bot-

master (Barakat & Khattab, 2010). Botnets set out to infiltrate and connect more

vulnerable machines to the botmaster. Recruiting additional bots is usually done

55

exploiting various software vulnerabilities or by propagating malware to eventually

exploit a host (Vogt, Aycock, & Jacobson, 2007). All botnets are controlled by at least

one command and control (C&C) channel tied to at least one botmaster but in many cases there are several primary and secondary botmasters (Vogt et al., 2007). Communication channels are established in order to receive commands and funnel information back to the botmaster (Vogt et al., 2007). These channel have become more sophisticated over time by encrypting traffic and exfiltrating key information about the end-points, configurations, platforms, networks and organization (Vogt et al., 2007). The main purpose of such a secure communication channel is to provide a command and control

(C&C) medium for the botmaster’s commands (Vogt et al., 2007). Botnets are constantly recruiting new bots by exploiting different software vulnerabilities for end-points, replicating itself using the same malware to other hosts and using advanced propagation techniques to spread across various networks (Barakat & Khattab, 2010). Newly recruited hosts often download the latest version of the “bot code” and runs this code on the end- point typically in the background. However, this new software establishes connections to primary and secondary C&C servers. Once established and going undetected usually by running as a known service infected machines communicate and execute the commands of the botmaster (Vogt et al., 2007). Most C&C channels operate at the application layer and can establish IRC chat protocol to further hide their activities (Barakat & Khattab,

2010). Establishing IRC chat protocols and using open source P2P protocols can put organizations end-points and networks at great risk of being taken over by botnets and super-botnets (Barakat & Khattab, 2010). Additionally, sophisticated botnets can also implement encryption and/or digitally sign instructions to make it almost impossible for

56

network security and defensive operations to detect and stop a botnet attack (Barakat &

Khattab, 2010).

Backdoors

Landwehr et al. (1994) provide some additional definitions of security flaws that

then could be used as an attack vector. The researchers defined security flaws as "any

conditions or circumstances that can result in denial of service, unauthorized disclosure,

unauthorized destruction of data, or unauthorized modification of data” (Landwehr et al.,

1994). These researchers developed a taxonomy for security flaws and detailed over 50

actual security flaws. The goal for developing such a taxonomy was to organize or

classify computer security flaws by type in order to prevent the exploitation of

unintended security flaws and purposeful misuse of flaws to compromise computer systems (Landwehr et al., 1994). This research was similar to the Research in Secured

Operating Systems (RISOS) project and Protection Analysis project conducted by

Information Sciences Institute of the University of Southern California (Landwehr et al.,

1994). Moreover, the researchers wanted to bring attention to the inherent security flaws

in software and the consequence of these security flaws in operating systems (Landwehr

et al., 1994).

Exploits

In 2000, McGraw and Morrisett published a paper that presented the growing

malware problem. This paper detailed a historical perspective of malware and the

various approaches to detect and remove malicious files (McGraw & Morrisett, 2000).

McGraw and Morrisett (from Cornell University) chaired a group of over twenty

57

malware researchers from across the world to elevate a world-wide malware discussion and a call to action (McGraw & Morrisett, 2000). The groups discussed such issues as the increasing complexity of computer systems and the networks that deliver

connectivity. The group also discussed the ease computer extensibility and the

susceptibility of these “networked” computer systems to be attacked. The group also concluded that any networked computing system is susceptible to malware or hostile code. However, the researchers also pointed out that an ever-present network, like the internet, provide attack vectors with ease (McGraw & Morrisett, 2000). The group also came to the realization that attackers no longer have to gain physical access to computer systems to propagate attacks. Networks, not physical access, become the highway to drive malware attacks (McGraw & Morrisett, 2000). Networks combined with rising system complexity provide more avenues for attack and complex systems make it easier to hide or mask malicious code (McGraw & Morrisett, 2000).

Ransomware

Ransomware has become quite popular for cyber-criminals over the past few years. This type of malware has grown by some accounts by 500% each year since 2013

(Symantec, 2014). Ransomware has been classified as a Trojan variant or virus depending upon the particular malware variant infecting the system (Khakhutskyy,

2016). Although ransomware is thought to be fairly recent, ransomware has been around for nearly thirty years (Hampton & Baig, 2015). Ransomware is used by cyber criminals

infect and encrypt data/files such that access to this information is only granted once

ransom or financial arrangements have been made to provide the “key” to unlock these

files (Gazet, 2010). However, these types of attacks are not new. The PC CYBORG

58

(AIDS) as an example was delivered to many computers via a floppy disk in 1989

(Hampton & Baig, 2015). The PC CYBORG trojan then encrypted files and instructed

users via a socially engineered message to pay a license fee with a $189 check to a

company in Panama (Hampton & Baig, 2015). Much of the 1990’s ransomware was

driven by amateur hackers in order to test and demonstrate technical capability (Hampton

& Baig, 2015).

Ransomware began to rise in the early 2000’s as malware developers started to

sell their “bot-nets” (Bechtel, 2014). Malware developers started to see a market for their

“bot-nets” and began to profit from direct information theft and advertising revenue

(Bechtel, 2014). The early 2000’s also saw an increase in theft of banking credentials or

sensitive passwords (Condon, 2012). By the late 2000’s, malware developers started to

work together and share compromised assets for sale to the highest bidder (Hampton &

Baig, 2015). It was these networks of “bots” that enabled cyber criminals to launch

large-scale attacks on organizations (Condon, 2012). It was about this time that nation

state organizations also started to realize the benefit of such networks (Carlson, Davis, &

Leach, 2014). Cyber criminals were now positioned to offer cyber-attacks to steal

intellectual property, run sophisticated phishing campaigns and propagate networks

further into targeted organizations (Hampton & Baig, 2015).

Until 2011, ransomware attacks had been isolated and unsophisticated (Condon,

2012). In late 2011, ransomware began to directly attack end-users in mass (Hampton &

Baig, 2015). By 2012, ransomware launched a major cyber-attack attacking the more connected internet user base and prey on the hype of computer viruses (Hampton & Baig,

2015). In 2012, ransomware developers launched large-scale end-user attacks with “Fake

59

Anti-virus” (Kharraz et al., 2015). These fake attacks tricked end-users into believing

they had been infected by a serious virus and needed to pay for non-existent AV software

to remediate the remedy and restore the compromised system (Krebs, 2012). Information

security companies and researchers alerted the public to the scam shortly after the attack,

However, due to the sheer number of these attacks, the credit card companies responded

to payments made to the “Fake AV” and this first ransomware attack was virtually shut

down overnight (Krebs, 2012).

More sophisticated ransomware attacks began to take place around 2005.

Malware known as “lockers” began to emerge and stage denial of service attacks on

infected systems (Young & Yung, 2016). These early “lockers” attacked boot operations and would not allow the machine to initialize until the “ransom” request was paid (Young

& Yung, 2016). Early “lockers” did not attack our touch the file system and remained in memory after boot-up. Most security companies responded to these “locker” attacks by extending Anti-Virus software to remove malicious software (Young & Yung, 2016).

More advance “locker” attacks were seen in late 2005 with the PGPCoder/GPCode

locker. The PGPCoder/GPCode encryption locker was the first instance of ransomware

where files and content was encrypted and released for payment (Young & Yung, 2016).

Malware attackers released various versions of GPCode and many of these versions were

released were greatly flawed. The flawed GPCode had issues poorly implemented

encryption routines, breakable encryption keys and recovery of deleted content (Hampton

& Baig, 2015). Malware developers began to strengthen GPCode over time and improve

the many issues with encryption and encryption keys (Young & Yung, 2016). The

GPCode malware was not the only limiting factor for ransomware. Directly attacking

60 end-users required many points of contact including the end-user, end-user payment gateways, cyber-attacker payment gateway and ultimately the malware developer

(Hampton & Baig, 2015). This payment of ransom process was complex and risky. The many points of contact could slow the pace of payment and encumber the extortion process by allowing law enforcement to intervene to stop or track payments (Hampton &

Baig, 2015).

In 2013, ransomware met with the perfect storm for delivering and executing ransomware around the world. The perfect storm was represented by three components:

1) strong and unbreakable encryption technology, 2) delivery and exchange of encryption keys and 3) untraceable methods to execute ransom payments (Hampton &

Baig, 2015). CTB-Locker was the first ransom malware to take advantage of the perfect storm. CTB stands for “Curve, TOR, and Bitcoin” (Hampton & Baig, 2015). The

“curve” represented the elliptic curve cryptography implemented to encrypt the targeted files/content. The Onion Routing (TOR) protocol enabled anonymous communication for key exchange. Bitcoin provided secure and untraceable crypto-currency transactions in order pay ransom. The CTB-Locker model still provides the means for new generation of ransom malware to deny access, secure key exchange and make untraceable payments

(Young & Yung, 2016). According the Symantec ransomware grew at 500% in 2013 likely due to the perfect storm (Symantec, 2014). Large, medium and small organizations were hit with the CTB-Locker ransomware. However, large corporate organizations had backup solutions that offer protection for a number of threats. Many medium and small organizations did not and still do not have the financial or technical resources to develop the necessary backup solutions to address these threats.

61

Today’s malware researchers offer that the ransomware can be easily defeated with collaboration and sharing of malware (Kharraz et al., 2015). However, many information security experts suggest that history should be a lesson for improving malware over time (Kharraz et al., 2015).

Ransomware continues to own the media and prey on the public’s fear of losing access to their information (Hampton & Baig, 2015). The security industry and academic community is always playing catchup with advanced malware threats such as ransomware (Wüchner et al., 2014). Currently, ransomware has the technical and financial model for much larger payoffs (Kharraz et al., 2015). It is imperative that ransomware be monitored and analyzed for new and improved releases (Hampton &

Baig, 2015). The ransomware financial model is simply too lucrative to ignore. At some point, ransomware will migrate and target large corporate networks such as banks, hospitals and critical infrastructure. Sooner rather than later, large enterprise organizations will have address the threat of ransomware within their operational environment.

Advanced Persistent Threats (APTs)

Advanced Persistent Threat (APT) is a term used to describe a new type of malware attack (Tankard, 2011). The term Advanced Persistent Threat (APT) was originally identified as a specific type of malware by the United States Air Force in 2006

(Rekdal & Bloemerus, 2013). APT’s have been described as well-funded, technically advanced and well-organized with financial motivations (Rekdal & Bloemerus, 2013).

APT attacks are typically unique and utilize multiple attack vectors to gain access to networks (Rekdal & Bloemerus, 2013). APTs employ advanced techniques to avoid

62 detection and remain on infected systems for long periods of time before activation

(Rekdal & Bloemerus, 2013). In 2011, high profile APT attacks gained notoriety by attacking some of the who’s who for Government and technology organizations (Rekdal

& Bloemerus, 2013). Commercial organizations such as Sony, RSA Security, Lockheed

Martin, Citigroup, Fox Broadcasting and Public Broadcasting Service (PBS) were faced with new types of attacks (Nicho & Khan, 2014). United States Government organizations such as the National Aeronautics and Space Agency (NASA), Federal

Bureau of Investigation (FBI) and Department of Treasury were also targeted (Nicho &

Khan, 2014). Europe also faced similar struggles with this new type of malware.

European organizations such as the European Space Agency, the British and French treasuries were also targeted by APTs (Nicho & Khan, 2014).

APT malware has quickly become a major issue for information security and leaders around the world (Molok, Ahmad, & Chang, 2012). APTs employ stealthy techniques to breach networks and establish long term surveillance within the network

(Molok et al., 2012). Malware developers for APTs are concerned with breaching systems and gaining unauthorized access to systems over time (Nicho & Khan, 2014).

The first phase for an APT is typically breaching and evading detection for extended periods of time (Molok et al., 2012). The second phase for APT’s is to gather intelligence and perform reconnaissance within the breached network (Molok et al.,

2012). APT’s intelligence gathering and reconnaissance is sophisticated and literature suggests very targeted to the organization (Symantec, 2014). During the intelligence gathering phase, APTs try to gain insight into operational aspects of the organization such as information assets, business functions, approval authorities and “normal”

63

communications (Molok et al., 2012). After the intelligence has been gathered, APTs

begin to capture the necessary credentials, gain access or escalate privileges in order to

achieve the target objective (Rekdal & Bloemerus, 2013). APT attackers tend to target

large organizations such as financial services, government and defense contractors

(Symantec, 2014). In 2010, Google, Adobe and other large U.S. organizations were

reportedly breached by Chinese APT attacks that stole intellectual property, email

accounts, and other organizational information (Panda Security, 2014). The APTs then

sent the stolen information to Taiwanese IP addresses (Panda Security, 2014). APT’s

remain a powerful malware and represent a major threat to businesses and Governments.

However, APT’s remain the least studied and the least understood of all malware (Nicho

& Khan, 2014). It is understood that APT’s adapt constantly and take advantage of

polymorphic techniques to evade detection for long periods of time (Kaur, 2014).

Several studies and authors suggest that the sophistication and targeting of APT malware may represent the next wave of military conflict (Dunlap, 2011) .

Analyzing and Detecting Malware

In 1995, Lo et al. developed detection tools for various types of malware including computer viruses, worms, Trojans, and logic bombs (Lo, Levitt, & Olsson,

1995). This research was one of the first detection approaches that included Static

Analysis for the malware in question (Lo et al., 1995). The detection method used was called the Malicious Code Filter (MCF). MCF was used to detect malicious code and security related vulnerabilities in software (Lo et al., 1995). MCF could be used to perform off-line analysis to determine indications of compromise. The researchers proposed MCF to slice the program into small functions or “code pieces” in order to

64

analyze the overall program. Each sub-function could be analyzed to determine the maliciousness. Program slicing techniques were used to evaluate and collect the program properties (Lo et al., 1995). This approach enabled researchers to evaluate a number of sub-functions of a larger program to determine overall malicious behavior.

The researchers also proposed an approach to defeat “program slicing” and the potential countermeasures needed to maintain detection rates (Lo et al., 1995).

In 1998, Lee and Stolfo, working at Columbia University developed a general and systematic methodology for Intrusion Detection using data mining techniques. The researchers leveraged pattern recognition and machine learning techniques in order to model program execution properties and user behavior (W. Lee & Stolfo, 1998). The techniques employed were able to analyze system call data with network tcpdump data in order to detect potential anomalies from both programs and user behavior (W. Lee &

Stolfo, 1998). The researchers were able to construct concise and accurate classifiers by using the association rules algorithm and the frequent episodes algorithm (W. Lee &

Stolfo, 1998). These two algorithms were used to compare both intra-and inter-audit record patterns (W. Lee & Stolfo, 1998). Using this approach, researchers could analyze both standard and abnormal program or user behaviors. The discovered patterns then could be used to facilitate additional feature selection to improve detection rates.

In 2001, Wagner and Dean developed static and behavioral analysis methods for host-based intrusion detection. The researchers working at University of California

Berkeley developed methods to examine program specifications and behaviors exhibited at time of execution (Wagner & Dean, 2001). Static analysis for the first time was being used for intrusion detection and the researchers recognized that attacks should have

65

atypical behavioral characteristics (Wagner & Dean, 2001). Wagner & Dean developed a specification for a program based upon the results from static analysis. Secondly, the authors used execution monitoring to determine whether the program executed as expected. The primary challenge for the researchers was to develop an intrusion detection technique with low false positive rates (Wagner & Dean, 2001). The researchers were able to develop a static analysis and execution monitoring approach to look for unexpected execution of functions. This approach was able to demonstrate positive results for host-based intrusion detection. The authors highlight three distinct advantages for analysis using both static and dynamic analysis: 1) achieves a high degree of automation, 2) provides protection against a broad class of malware attacks based on corrupted code, and 3) limited false positives or “false alarms” (Wagner & Dean, 2001).

In 2003, Linn and Debray working at the University of Arizona, researched techniques used to obfuscation executable code to avoid static disassembly. Static analysis provides the ability to expose machine code into human readable functions that then can be used to reverse engineer software executables (Linn & Debray, 2003). Static analysis, while helpful in malware research, creates deep insight into the logic of executables and the library of functions used by the software by detailing the step- functions as the code executes. This insight provides malware analysts the ability to look into the executable and look for malicious functions as they execute. Static analysis also offers the opportunity for others to reverse engineer software and steal intellectual property from legitimate software providers (Linn & Debray, 2003). As part of the reverse engineering, executable disassembly provides a translation from machine code to assembly code (Linn & Debray, 2003). Code and function obfuscation techniques are

66 used by software developers in order to disrupt reverse engineering and program disassembly by making programs harder to disassemble correctly (Linn & Debray, 2003).

Linn and Debray concluded the paper with a discussion on two static disassembly algorithms that break obfuscation and various techniques used to impede these algorithms.

In 2005, Gheorghescu, working at Microsoft Corporation, performed research that would automate malware classification. The research conducted provided advanced malware classification methods to aid in the detection of malware. Classification and naming of viruses is helpful for sharing discovered malware. However, the anti-virus industry did not adhere to a standard naming convention causing issues with processing new malware samples (Gheorghescu, 2005). Standard naming conventions could significantly speed the determination of files being malware or benign. The researchers introduced an innovative classification system for desktop computers. The classification compared new and unknown samples with known database of malware within minutes.

This approach would also track samples based on evolution of the malware sample

(Gheorghescu, 2005). Gheorghescu’s approach used three matching algorithms to process malware samples and based on the results made determinations of good or bad.

The research also presented methods for malware-handling tasks including sample clustering, outbreak detection, automatic virus naming, and phylogeny tree

(Gheorghescu, 2005).

In 2006, Baecher et.al., working at the University of Mannheim, developed and presented a platform for processing a large-scale collection of self-replicating malware collected in the wild (Baecher, Koetter, Holz, Dornseif, & Freling, 2006), At this point in

67

time, there was little empirical data (quantitative or qualitative) to describe self- replicating malware. The inability to harness empirical data hampered many counter-

measures for malware including network-based and host-based intrusion detection

(Baecher et al., 2006). The Nepenthes platform, provided an emulation platform to expose and capture attack data. The Nepenthes platform also provided a means to capture empirical data about self-replicating malware and a means to analyze thousands of samples of previously unknown malware (Baecher et al., 2006). The data collected by the Nepenthes platform provided the empirical data capture to vendors such as host-based

IDS/anti-virus systems (Baecher et al., 2006).

Kolter and Maloof (2006), describe the use of machine learning and data mining to detect and classify malicious executables found in the wild. The researchers, from

Stanford University and Georgetown, gathered nearly 2000 benign and 1,700 malicious executables to perform advanced classification research. The researchers extracted more than 255 million distinct n-grams from these two sample sets. N-grams are used to extract substrings of a file for a fixed length n. These n-grams can be efficiently collected and analyzed for signatures. The researchers processed and extracted feature for 3,700 files (Kolter & Maloof, 2006). The researchers evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting to process n-gram samples (Kolter & Maloof, 2006). The researchers found that boosted decision trees outperformed other algorithms with an area under the receiver operating characteristic (ROC) curve of 0.996. The researchers proposed an analysis methodology that should scale to larger collections of samples. The conducted studies examined three key parameters: the size of n-grams, the size of words and the number of selected features

68

(Kolter & Maloof, 2006). Due to limited computational resources the study was unable to

evaluate exhaustively all methods for all settings. However, once the researchers applied

detectors to 291 malicious executables the true-positive rate of 98% and a desired false-

positive rate of less than 5%. This was an important finding as the methodology

suggested that the approach could be used in operational systems for detecting unknown

malicious samples (Kolter & Maloof, 2006).

In 2006, Lee and Mody presented methods for systems to automatically classify

malware into families or categories. Their approach monitored runtime behavior of

applications and captured the series of functions as they were executed (T. Lee & Mody,

2006). Based upon these environmental parameters the system was able to categorize applications. The system accurately classified malware based on execution properties and behavioral characteristics operating in a Microsoft-based computer system (T. Lee &

Mody, 2006). The researchers constructed a knowledge base of application groups by sampling a large population of applications. Based upon the set of known functions and behaviors exhibited by prior classifications, the system was able to classify new applications into known application groups and render a verdict of whether the application was malware (T. Lee & Mody, 2006).

In 2007, Bilar, from Wellesley College (Massachusetts), presented a paper to discuss detection mechanisms for malicious code through statistical analysis of operation code (opcode) distributions (Bilar, 2007). The researcher analyzed, disassembled and performed opcode frequency distribution for 67 malware executables (Bilar, 2007).

These results were compared to those of 20 non-malicious samples. Bilar (2007) found that the malware opcode distributions did significantly differ from that of non-malicious

69

samples. In addition, unique opcodes seem to be a stronger predictor as many of the

malware samples had unique frequency distribution patterns (Bilar, 2007). The

researchers found that sixty opcodes accounted for 99.8% of opcodes found in both

malicious and benign samples. However, 14 malicious opcodes accounted for 92% of the

total extracted opcodes and the top 5 malicious opcodes accounted for 65% of the

extracted opcodes. This research would be used as the basis for using machine learning

for detection of malware.

In 2007, Martignoni, Christodorescu and Jha discussed the growing threat of

malware due to the sophistication of malware evading signature-based detection

(Martignoni, Christodorescu, & Jha, 2007). The malware developers can easily evade

detection by "packing" the malicious payload in layers of compression or encryption

(Martignoni et al., 2007). The researchers describe state-of-the-art malware detection

using both static and dynamic techniques to uncover the packed malware. These

techniques are slow due to off-line nature of the analysis and prove to be highly

ineffective due to the metamorphic nature of the malware (Martignoni et al., 2007). The

researchers proposed a new technique known as OmniUnpack. The OmniUnpack approach closely monitors the execution of a program in real-time and detects the

“unpacking” when the program has removed the various layers of packing (Martignoni et

al., 2007). OmniUnpack improved “packed” malware detection by analyzing the

unpacked malicious payload during runtime execution. Experimental results

demonstrated quite effective detection results with low overhead.

In 2008, Ye, Wang, Li, Ye, & Jiang discussed the failure of the traditional

signature-based anti-virus systems to detect polymorphic/metamorphic malware (Ye et

70

al., 2008). The researchers also discussed that the traditional signature-based anti-virus

systems were also blind to new and previously unseen malicious executables (Ye et al.,

2008). The researchers discussed that data mining techniques are needed due to large and

growing collections of malware (Ye et al., 2008). The research presented involved

analyzing and developing profiles for Windows APIs called by PE files. The researchers

discuss the results of their Intelligent Malware Detection System (IMDS) using

Objective-Oriented Association (OOA) mining based classification (Ye et al., 2008).

IMDS consisted of three major modules: 1) PE parser, 2) OOA rule generator, and 3) rule

based classifier. This study outperformed commercial software products such as Norton

Anti-Virus and McAfee VirusScan (Ye et al., 2008). This approach also outperformed

previous data mining based detection systems using Naive Bayes, Support Vector

Machine (SVM) and Decision Tree techniques (Ye et al., 2008). The study demonstrated

a solid approach for polymorphic and metamorphic malware detection.

In 2009, Rieck et al. present issues with malware variants (polymorphism) and the

use of obfuscation to hinder detection at the file level (Rieck, Trinius, Willems, & Holz,

2009). The researchers acknowledge that the amount and diversity of malware variants

render classic security defenses like anti-virus ineffective (Rieck et al., 2009). The

research suggested that the sheer volume of attacks from malware including viruses,

worms and Trojans make detection more difficult by the day (Rieck et al., 2009). The

volume combined with obfuscation and polymorphism techniques require a new

approach for detection. The researchers believe that a machine learning approach was

needed to solve the polymorphic malware growing problem. The research presented

offers a framework for automatically identifying novel classes of malware with similar

71

behavior (clustering) and assigning unknown malware to these discovered classes

(classification) (Rieck et al., 2009). These techniques, clustering and classification, allowed for improved detection and an environment to process thousands of malware binaries. Further, clustering and classification provided improved discovery of novel malware variants.

In 2009, Bayer et.al., working on conjunction with University of California Santa

Barbara and University of Vienna, developed automated environment to analyze malware

samples in a controlled environment (U. Bayer, Milani-Comparetti, Hlauscheck, Kruegel,

& Kirda, 2009). This automated environment would also produce reports that detailed the program’s actions during execution. These details were then used to analyze both benign and malware samples. The researchers then explored clustering techniques to identify samples based on execution behavior. The researchers admitted that previous clustering techniques did not scale and failed to generalize the observed behavior (U.

Bayer et al., 2009). The new approach proposed a scalable clustering approach to identify and classify malware samples. The researchers performed dynamic analysis to capture execution traces for malware programs. Profiles for malware were generated using these execution traces in a generalized manner (U. Bayer et al., 2009). These profiles were then used as input into a scalable clustering algorithm capable of handling large datasets.

Dai, Guha and Lee (2009) provide unique insight into malware classification by analyzing frequency distribution or unique call sequences for various types of malware.

The researchers approach malware detection through extracting dynamic instruction

sequences from malware through data mining techniques (Dai et al., 2009). The researchers extracted runtime instruction sequences from unknown executables and

72

organized instruction sequences into basic blocks in order to evaluate malware (Dai et al.,

2009). The extraction techniques used were based on certain instruction sequence

patterns based on instruction associations with derived basic blocks (Dai et al., 2009).

The researchers used a data mining processes to perform feature extraction, feature

selection and to build the classification model (Dai et al., 2009). This approach yielded

accurate, reliable and efficient predictive classification model for malware detection.

In 2010, Devesa et al. presented that a significant security threat exists due to the

exponential growth in malware (viruses, Trojans or worms) (Devesa et al., 2010). The

researchers define malware as any kind of program explicitly designed to harm or disrupt

computer system operations (Devesa et al., 2010). In order to mitigate the malware

problem, all incoming code must be analyzed to classify these files as malware or benign software. The most common approach is to combine static and dynamic analysis techniques in order to extract execution properties for the unknown files (Devesa et al.,

2010). However, due to the escalation in polymorphic and metamorphic malware attacks the manually analyzing thousands of suspicious files each day would be futile (Devesa et al., 2010). The researchers proposed an emulation environment for testing that provided secure and safe execution of suspicious code. The environment properties to classify samples with several machine-learning algorithms (Devesa et al., 2010). The study tested the proposed system real malware samples. The initial results from the study reported reliable results with high performance for the malware sample set.

In 2010, Paulevé et al. working in France, applied Locality Sensitive Hashing

(LSH) to dramatically improve performance for processing pattern recognition to be applied to malware detection (Paulevé, Jégou, & Amsaleg, 2010). The researchers

73 viewed traditional search schemes as computationally expensive and poorly fitting for real data sets (Paulevé et al., 2010). Although several extensions had been proposed to address the limitations, there had not been a comprehensive review of the algorithms or recommendations for “real world data”. The study conducted a comparison of several families of space hashing functions in a real world environment (Paulevé et al., 2010).

The comparison included random projections, lattice quantizers, k-means and hierarchical k-means. The finding demonstrated that the unstructured quantizer significantly improved the accuracy of LSH (Paulevé et al., 2010). The study also discussed the previous findings concerning LSH, merits and limitations to the proposed

LSH approach.

In 2011, Bailey et al. presented a study to highlight the ineffectiveness of Anti-

Virus (AV) commercial products. The researchers used a large collection of malware that spanned a variety of attack vectors (e.g., spyware, worms, spam) to demonstrate the ineffectiveness of AV in an operational environment (Bailey, Oberheide, & Andersen,

2011). Further, the study showed that AV products mischaracterized malware and failed in large part to semantically define malware or classes of malware. The authors proposed a new classification technique that describes malware behavior in terms of system state changes (e.g., files written, processes created) rather than in sequences or patterns of system calls (Bailey et al., 2011). The researchers developed a methodology to automatically categorize of malware into classes based on behaviors (Bailey et al., 2011).

The study also demonstrated that behavior-based clustering provided a more direct and effective way to classify malware (Bailey et al., 2011).

74

Alam, Horspool and Traore (2013) extended malware research by examining

dynamic binary obfuscation. Dynamic binary obfuscation is used by metamorphic

malware to generate a new sequence of opcodes in the memory in order to avoid

detection (Alam et al., 2013). Polymorphic and metamorphic malware is very difficult to

analyze manually because of the sophistication of such malware (Alam et al., 2013). The

researchers developed an automated tool and language named MAIL (Malware Analysis

Intermediate Language) to automate and optimize the analysis and detection process

(Alam et al., 2013). MAIL provided an element of portability for building metamorphic

malware analysis and detection due to the use of annotated control flow graphing used for

pattern matching (Alam et al., 2013). The study yielded detection rated for metamorphic

malware of 93.92% and a low false positive rate of 3.02% (Alam et al., 2013). This study

also discussed the re-use of functions from previous successful malware exploits. The

researchers also discussed the use of embedded compilers and obfuscation by metamorphic malware to evade detection (Alam, et al., 2013).

In 2014, Tamersoy and Roundy, presented research for detecting both

polymorphic and metapmorphic malware. Their researchers stressed the need for new

detection methods due to the increasing sophistication of malicious software (Tamersoy

et al., 2014). The researchers call for new defensive techniques that are scalable and

agile to malware. The researchers propose a new algorithm that identifies malicious

executable files by applying reputational data with the file (Tamersoy et al., 2014). The

researchers constructed a large dataset from Norton Community Watch and performed

analysis for non-infected/infected machines and benign/malicious files resident on those

machines. The researchers then identified relationships that exist between machine and

75

malware (Tamersoy et al., 2014). This algorithm “Aesop” leverages locality-sensitive

hashing to measure the strength of these inter-file relationships (Tamersoy et al., 2014).

These relationships were used to construct a bi-partite graph to infer benign or malacious

(Tamersoy et al., 2014). The study produced impressive results with a 99.61% true

positive rate and a 0.01% false positive rate (Tamersoy et al., 2014).

In 2015, Mohaisen and Alrawi presented an aproach to automatically detect

metamorphic and polymorphic mawlare based on behavior exhibited during analysis.

Automated Malware (AMAL) provides both analysis and labeling (classification and clustering) to improve malware detection (Mohaisen & Alrawi, 2015). AMAL consists of two sub-systems, AutoMal and MaLabel. AutoMal provides tools to collect detailed behavioral artifacts of file, memory, network and registry during execution (Mohaisen &

Alrawi, 2015). MaLabel uses the data provided by AutoMal to train and build classifiers.

AutoMal can be used with unsupervised learning by leveaging multiple clustering algorithms for sample grouping (Mohaisen & Alrawi, 2015). The research presented achieved a precision rate of 99.6%. The researchers also discussed several benchmarks, costs estimates and measurements highlight and support the merits and features of

AMAL.

The foundational studies presented above were intended to provide a historical perspective and context for malware research. Advanced detection studies are provided below. Advanced detection studies provide deeper insight into designing, conducting and evaluating malware experimental research. Additional details regarding advanced detection studies referenced below are presented in Appendix A.

76

Research Methods

Malware research is typically conducted using quantitative research methods

(Rossow et al., 2012). Quantitative studies have been used to observe and collect

datasets for a wide array of experiments (Creswell, 2007). In many cases, quantitative

malware experiments have been designed to improve detection models, study

longitudinal behaviors and validate prior malware research (Rossow et al., 2012). The

study was aligned with Rossow’s (2012) three basic concerns for conducting malware

research: 1) following a research methodology and being transparent with the study

details, 2) developing an experimental framework that uses the correct and representative

data and 3) the experiment must take due consideration to not harm the research,

resources or contaminate findings. Each of these concerns were addressed and discussed

in detail later in the study’s approach.

The experimental design and approach enabled for this research to advance the study in stages. The staged approach enabled the research to conduct experiments and validate results at various checkpoints. This approach provided flexibility for the researcher and allowed the feature set and clustering algorithms to be utilized and adjusted over time. The developed prototype environment was able to discover polymorphic malware with multiple datasets – training datasets and experimental datasets. The developed prototype environment was designed to inspect and determine the malicious or benign attributes of files and compare results. The focus of this study

was to identify and detect Windows-based polymorphic malware. Windows has become primary target for malware attacks and represents the largest population of end-points at the enterprise level. An illustration of the proposed environment is provided below.

77

This malware research followed the single-subject experimental approach outlined by Neuman and McCormick (1995). Single-subject experimental design has

evolved over time to facilitate researchers quest to answer different types of questions

under varying experimental circumstances (Neuman & McCormick, 1995). The most

common single-subject experimental designs include are reversal designs, multiple-

baseline designs and multi-element designs (Neuman & McCormick, 1995). The purpose

of using the single-subject design for this experiment was to develop a protocol that

worked nicely with the malware research subject area. It was believed that malware

research closely resembles that of medical research and thus can be used to evaluate

incremental treatments and measure results for drug treatment and patient care protocols.

Single-subject experimental design was used for this study as it paralleled baseline testing with incremental treatment testing. This study achieved the desired goal to improve malware detection given multidimensional topological data extracted from static analysis, dynamic analysis and file property. This study further improved detection rates by adding weighted multidimensional features from static analysis, dynamic analysis and file properties to evaluate detection rates. This study was designed to answer two basic research questions:

1) Can detection rates be improved by increasing the quality and quantity of multidimensional features for the machine learning advanced clustering algorithms from file properties, static and dynamic analysis?

2) Which of the machine learning advanced clustering algorithms performed better given the multidimensional features from file properties, static and dynamic analysis?

78

The single-subject design provided a standard and widely accepted protocol to answer the

research questions posed for this study. The study followed the basic single-subject

protocol throughout the study:

1. Establish baseline data – established a baseline for detection given a standard feature dataset with a standard clustering algorithm for polymorphic malware through multiple measurements before an intervention. Currently the detection rates for polymorphic detection rates using advanced classifiers range from 68.75% to 81.25%

(Amos et al., 2013). Three datasets were generated that allowed detection rates to exceed the baseline detection rates of 68.75% to 81.25%.

2. Manipulate feature set – the standard feature dataset was augmented to evaluate whether detection rates are improved after the intervention of features. A number of studies examined dynamic analysis (Rieck, Trinius, Willems, & Holz, 2011), static analysis (Kerchen et al., 1990) and file properties (Subramanya & Lakshminarasimhan,

2001). The study examined three multidimensional datasets. The three datasets were generated by combining known malware, known benign and unknown samples into single feature dataset. Once the baseline was generated, the feature weighting was manipulated to see if detection rates would improve (Siddiqui, 2008).

3. Controlled procedures and environment – the developed prototype environment and datasets were controlled to ensure the extracted feature dataset remained static over time and the automated feature extraction process remained unaltered (Rossow et al.,

2012).

79

4. Standard measurement approaches – a standard measurement approach was

completed after performing baseline detection measurements for known (malware and

benign) and unknown samples using multidimensional topological data with baseline

clustering. The study established baseline detection rates for each of the three clustering

algorithms: 1) Advanced Ensemble Classification (Bootstrap Aggregating (Meta

Bagging (MB)), 2) Instance Based k-Nearest Neighbor (IBk) and 3) Deep Learning

Multi-Layer Perceptron (DLMLP). Standard measurements included such measures as

Accuracy (ACC) and the Correlation Coefficient (CC), True Positive Rate (sensitivity

measure), and False Positive Rate (specificity measure). As specified by the standard

protocol, these measurements were established as a permanent observational recording

(Rossow et al., 2012).

5. Weighting of features – as part of this testing protocol the inputs were manipulated and assessed. Within a single dataset, features were weighted in a structured methodology, evaluated and assessed. The weighting of static features, dynamic to static features and file properties to dynamic to static features were assessed to evaluate improved detection rates for polymorphic malware.

6. Capturing testing results - All testing data were captured in terms of ACC, CC,

True Positive Rate (TPR) and False Positive Rate (FPR).

7. Graphing results - All testing data were graphed and presented in terms of

ACC, CC, True Positive Rate (TPR) and False Positive Rate (FPR).

8. Evaluating results – results from test data was collected after each test. Each test was analyzed with rigor to understand the ACC, CC, True Positive Rate (TPR) and

80

False Positive Rate (FPR). The test results from each test was evaluated and changes to

inputs were documented to understand detection rate improvement. The test results were

generated after each test and provision were made to ensure that the results were

accurately reported.

9. Test controls - the specific design for this study required that test controls

remain in place from one test to another such that the test data, interpretation of the data

test and conclusions reached from testing were reliable and believable.

The experimental design and approach enabled the researcher to advance the

study through a defined testing protocol. The testing protocol enabled the researcher to

conduct baseline testing and perform experimental testing with the various clustering

algorithms. The testing protocol also provided the researcher with a means to select,

weight feature sets, conducting testing and document results for the various clustering

algorithms. The testing protocol allowed the researcher to validate detection results in

stages and evaluate effectiveness. The experimental research methodology was consistent with previous quantitative malware studies (Creswell, 2007). The study’s

methodology was constructed to improve detection models and validate test results prior

to conducting additional research (Rossow et al., 2012). The research design utilized a

widely accepted quantitative approach for conducting experimental studies similar to

those conducted in healthcare, drug trials and other medical studies (Rizvi & Nock,

2008).

Gaps in Current Literature

In both the foundational research and advanced detection studies there exists a

number of gaps for ongoing malware research. As the war against malware (Trojans,

81

viruses, worms, APT’s, etc.) continues to escalate and spiral into unknown dimensions

some perspective is needed to understand the threat landscape. It is important to

understand and recognize that malware attacks are causing businesses and Governments

to lose billions of dollars (McAfee Labs, 2015). Malware research is unfortunately not

keeping pace with the adversary and the rapid escalation of polymorphic and

metamorphic malware has continued evade today’s detection techniques (Hansen, Mark,

Larsen, Stevanovic, & Pedersen, 2016).

In reviewing nearly six-hundred (584) articles and studies, roughly one fifth (124)

of those studies conducted advanced polymorphic malware research. However,

polymorphic malware remains a major threat for ransomware and other types of malware

(Kaur, 2014). More research is needed for polymorphic malware as the threat has

become more advanced. Polymorphic malware is thought to represent at least half of the

new malware released annually (Symantec, 2014). In addition, polymorphic malware

growth in 2012 was 392% (Qu & Hughes, 2013) and greater than 500% in 2015 (Gostev

et al., 2016). Due to the changing nature of polymorphic malware, additional malware

research is needed to identify the evolving characteristics of the malware in order to

provide better and different detection approaches. New techniques are needed to identify and classify the evolving polymorphic families of malware (Kaur, 2014).

Another research gap identified is the dataset or “samples” used for conducting malware research. There are many variations for collecting and testing malware datasets.

Some of the major malware researchers from industry and academic continue to make samples available for consumption. Some of the major industry contributors include

McAfee, Symantec, Virus Total, etc. There are also a number of unaffiliated sites that

82 offer malware for research as well. The unaffiliated sites include contagio, Virus Sign,

VxHeaven and others. The majority of reviewed studies collected and used malware from these community of interest sites. In doing so, researchers leveraged known malware samples for detection purposes and experiments. Researchers were able to leverage these known malware datasets and validate new detection techniques with “live” malware samples. However, few studies mentioned the use of benign samples in order to validate detection. A majority (24 of 33) of the advanced malware detection studies referenced “labeled” test data – meaning that the study knows all the samples submitted.

A minority of advanced malware detection studies used both known malware and known benign samples in order to evaluate and validate detection rates. However, in reviewing over thirty advanced malware studies not a single study evaluated a dataset with unknown samples. Several leading researchers suggest that using unknown samples would strengthen the study and offer additional validation for detection techniques

(Hansen et al., 2016). It is well understood that assembling a malware dataset for research purposes is difficult. Finding additional benign and unknown samples makes the task of assembling a rich dataset much more difficult. As Hansen (2016) points out, there are significant benefits with constructing datasets that contains known malware, known benign and unknown samples.

Due to the use of polymorphic and metamorphic techniques utilized by malware developers, the number of unique malware files has exploded (Tamersoy et al., 2014). It is the sheer number of these unique files that render signature detection essentially useless (Fraley & Figueroa, 2016). However, there exists some number of unique features that can identify “families” of malware (Wüchner et al., 2014). The challenge

83

for improving malware detection is to find the right combinations that lead to better

results (Wüchner et al., 2014). Malware researchers have been utilizing varying

techniques to identify “clusters” of related malware families and use the proximity of the

clusters for detection (Tamersoy et al., 2014). Several researchers have offered that the

files can be clustered based on the result static or dynamic analysis (Hansen et al., 2016;

Kaur, 2014; Qu & Hughes, 2013). Other researchers have demonstrated unique file properties and executable location may be used to detect malware (Mohaisen & Alrawi,

2015). Taking feature extraction outputs from both static and dynamic analysis would allow the construction of a rich dataset. In addition to static and dynamic analysis features, values such as file characteristics, environmental properties and file relationships could be used for additional analysis (Mohaisen & Alrawi, 2015). Further, advanced algorithms such as Meta Bagging (MB), Instance Based k-Nearest Neighbor

(IBk) and Deep Learning Multi-Layer Perceptron (DLMLP) use proximity clustering algorithms to understand file-properties affinity and maliciousness (Yedidia, et al., 2003).

Thus, if one file is known as malicious with certain features it could be associated with like files and features. Advanced clustering techniques such as General Belief

Propagation, Loopy Belief Propagation, Dynamic Bayesian Networks, Hidden Markov

Model (HMM), and Markov Random Field (MRF) could be examined and evaluated using the rich dataset offered by combining the results of static and dynamic analysis with file properties. These techniques have been examined by previous studies and should prove to be a solid approach for future study. The benefit of advanced clustering algorithms is that they may provide additional fidelity for analyzing feature clusters

84

In summary, there are a number of gaps in current literature. The first issue is there is not sufficient research for polymorphic malware. Polymorphic malware continues to grow at rates that far exceed the ability to industry and academia to keep pace. The second issue concerns the datasets used for research. The datasets should have representation from three all types of files: 1) known malware, 2) known benign and 3) unknown samples. Constructing dataset in this manner should yield more “real world” detection results. Lastly, many of the malware studies have utilized advanced algorithms based on either static or dynamic analysis. None of the studies reviewed combined the features of both types of analysis in order to improve detection. By combining the feature dataset and using advanced clustering algorithms should provide improved polymorphic malware detection.

Strengths and Weaknesses of Current Studies

The research conducted for previous studies include both foundational malware research (discussed in the previous section) as well as advanced detection studies in order to establish a baseline for this study. It was important to establish a baseline of topics and emerging malware research. It was also important to expand the understanding of contemporary experimental research leading to advanced malware detection. For the purpose of this discussion, the research referred to in this section is the Advanced

Malware Detection Studies presented later in this chapter. In researching, reviewing and evaluating over thirty contemporary malware detection studies that span seven years

(2010-2016), researchers presented unique malware detection and classification using a number of experimental approaches. The majority of the studies presented an experimental framework for conducting the malware research. A number of the studies

85 selected a sufficient sample size for conducting the research. Some of the better studies conducted malware research and thoroughly evaluated detection findings. These findings and results were analyzed to evaluate achieved detection rates in terms of True Positive and False Positive rates.

The thirty-three (33) studies presented in the Advanced Malware Detection

Studies were reviewed and evaluated with the intent to leverage those positive aspects of the for this study. These studies were reviewed and evaluated based on a number of factors. The strengths of these studies include: 1) most of the studies (32 of 33) performed the study through an experimental framework, 2) almost all the studies (31 of

33) surveyed were focused on improving malware detection with “live” or real malware samples, 3) a majority of the studies (27 of 33) used dynamic analysis or static analysis to perform feature extraction for their study, 4) almost all of the studies (32 of 33) utilized advanced algorithms to aid or boost detection rates using features extracted from static or dynamic analysis and 5) most of the studies (30 of 33) analyzed detection results with a quantitative evaluation of study results. In most cases, the studies evaluated findings with several statistical measures to evaluate detection effectiveness. These criteria are typically used as statistical quality measures with machine learning. The statistical tests used to evaluate the performance various algorithms can be evaluated with Accuracy

(ACC) and the Correlation Coefficient, True Positive Rate (sensitivity measure), and

False Positive Rate (specificity measure).

Upon deeper inspection, the studies collectively also had some weaknesses or issues with conducting the research. The weaknesses for advanced malware studies include: 1) malware sample storage - there is little to no discussion regarding the

86

working with or securing “live” malware for the study or experiment, 2) sample size - the

range of malware samples tested was very large - nearly 23 million (from a low of 90

samples to a high of 24 million samples), 3) feature selection - not a single study (0 of

33) combined the features from both static and dynamic analysis – many of the studies

performed either static (14) or dynamic analysis (13), 4) malware classification or types

of malware used for the experiment was not disclosed or the study only worked with

known malware – there were no unknown samples to further test detection rates from a

blind study perspective and 5) the use of training datasets for machine learning – there

were varied and differing approaches to selecting, maintaining and using training

datasets. The weaknesses outlined here should not discount any work conducted.

However, addressing these issues can only help to improve future research.

Similar Research Methods

Researchers who choose to pursue quantitative research methods seek to understand complex models and make knowledgeable claims regarding the subject matter

(Creswell, 2007). Researchers develop and apply strategies of inquiry or develop

evaluation methodologies to analyze complex models or to better understand phenomena

(Creswell, 2007). Strategies of inquiry or evaluation methodologies assist with the

overall research approach by focusing on the experiments, quasi-experiments or

correlational studies (Creswell, 2007). Quantitative research strategies design and

execute complex experiments with many variables and specialize in the treatment of

interdependent variables (Creswell, 2007). Quantitative studies seek to explain

experiments through structural equation models that explain and prototype experimental

outcomes (Creswell, 2007).

87

Malware research is typically conducted using quantitative research methods

(Rossow et al., 2012). In the past, malware researchers have relied on quantitative

experimental designs to understand how malicious code behaves during execution

(Rossow et al., 2012). Quantitative studies have been used to observe and collect

datasets for a wide array of malware experiments. In many cases, malware experiments

were designed to improve detection models, study longitudinal behavior and validate

prior research (Rossow et al., 2012). The previous advanced malware detection quantitative research methods have been reviewed and will be presented later in this section.

Malware research, similar to other types of research, needs to construct the appropriate framework for the experimental study. The research design must address a number of basic concerns. The first concern is the research methodology. According to

Rossow (2012), the research methodology must be described sufficient detail in order to enable allow other researchers to reproduce the approach. In addition, the methodology must be transparent with the study’s details in order to fully share findings (Rossow et

al., 2012). Secondly, the experimental framework must have the correct and

representative dataset. Researchers must be careful with selecting and using datasets

such that bias is not introduced into the experiment (Rossow et al., 2012) Lastly, the

experiment must take due consideration to not harm others (Rossow et al., 2012).

Malware research must be very careful as these experiments are dealing with live

samples, similar to medical virus research. An outbreak of malware can harm not only

the experiment but also those resources around or supporting the research (Rossow et al.,

2012).

88

The thirty-three (33) studies presented in the Advanced Malware Detection

Studies section were reviewed to understand: 1) the research methodology, 2) the dataset and 3) the safeguards put in place in order to do no harm. Almost all the studies (32 of

33) performed the study through a quantitative experimental framework. The one study that did not follow an experimental framework conducted an observational study.

Almost all studies (31 of 33) performed malware detection research with “live” or real malware samples. A majority of the studies (27 of 33) used either dynamic or static analysis as part of the experiment to extract further data for analysis. Almost all of the studies (32 of 33) utilized advanced algorithms to aid or boost detection rates. However, the majority of the studies did not provide sufficient details to replicate the study. Most of the previous studies (30 of 33) analyzed detection results. In most cases, the findings were evaluated with several statistical measures to evaluate detection effectiveness.

Overall, the Advanced Malware Detection Studies provided the necessary framework and were fairly consistent with the experimental methodology and using representative datasets.

Advanced Malware Detection Studies

Over sixty (60) contemporary malware detection studies were surveyed that spanned seven years (2010-2016). These studies surveyed and reviewed to understand the unique detection and classification approaches presented by the researchers. In order to address current malware detection research for the proposed research were analyzed to understand:

1) research approach or framework for the study;

2) sample size – both lableled and unlabled samples;

89

3) feature selection - based on static or dynamic analysis;

4) use of machine learning (supervised/unsupervised training) or advanced

algorithms used to support detection;

5) evaluation of detection results.

After reviewing sixty studies, it became clear that number of studies had strengths and weaknesses. However, studies that had severe weakness needed to be eliminated from further review as they provided little or suspect value. Studies that did not provide a solid research approach, a reasonable sample size or set and an evaluation of detection results were eliminated from further review. In many cases, the evaluation 1) research approach, 2) sample set and 3) evaluation of results were pass/fail. Based upon the criteria above - twenty-seven (27) of the studies were eliminated from further review.

Some of the studies were eliminated due to the fact that they lacked sufficient detail regarding the experimental research approach or how the experiment was conducted.

Other studies were eliminated due to the fact that they were unclear or vague about the malware samples used for the experiment. Some of the eliminated studies had contradictory discussions regarding the experimental research and the results. These studies either did not specify or did not directly address the research approach for conducting the malware study. Studies were also eliminated due to the fact of how the malware detection results were either presented at a high level or not evaluated. In many cases, the studies eliminated presented results but did not quantitatively present an evaluation of the detection results achieved. Lastly, studies were eliminated because of the malware sampling or how the malware was collected for the study. Only the studies that clearly specified research framework, provided sufficient malware samples,

90 evaluated detection results and remained as part of the next phase of reviewing current research.

The remaining thirty-three (33) studies then reviewed and data extracted for evaluation purposes. These studies were reviewed and evaluated based on a number of factors.

1. Most of the studies (29 of 33) performed the study through an

experimental research framework;

2. All remaining studies (33 of 33) surveyed were focused on improving

malware detection with “live” malware samples;

3. A majority of the studies (27 of 33) used dynamic analysis or static

analysis to perform feature extraction for their study;

4. Almost all of the studies (32 of 33) utilized advanced algorithms to help or

boost detection rates and,

5. Most of the studies (30 of 33) analyzed detection results with a

quantitative evaluation of study results.

The highlights for the remaining malware experimental studies are presented in Table 1 below.

91

Table 1 Advanced Detection Studies

92

Summary

Detecting polymorphic and metamorphic malware continues to be a challenge for the security community. A majority of the today’s security research is focused on developing enhanced detection using techniques that collect, study, and mitigate malicious code (Kolbitsch et al., 2009). However, new polymorphic malware and the detection evading techniques render many of the current signature protections useless and therefore leave end-points unprotected (Rodríguez-Gómez et al., 2013). The speed at which polymorphic malware is advancing threatens enterprise computing and internet operations (Symantec Corporation, 2016). Being able to detect polymorphic, metamorphic and zero-day malware requires advanced detection techniques that provide rapid adaptation, scalability and produce low false positive rates (Borojerdi & Abadi,

2013). This study offers an attractive alternative for detecting polymorphic malware specifically targeting Windows operating systems. This study offers a single data set containing known malware, known benign and unknown samples, feature extraction methodology, a prototype environment and detection approach that far exceeded today’s accepted baseline. The proposed methodology, experimental framework and experimental design is presented next.

93

Chapter 3

Methodology

Overview

The purpose of this study was to develop a quantitative experimental prototype

using multidimensional topological data with machine learning using advanced clustering

to provide improved detection for polymorphic malware. The study ultimately yielded a

detection methodology for polymorphic malware with true positive detection rates well

above the established baseline range of 68.75% to 81.25% (Amos et al., 2013).

Quantitative methods were chosen in order to provide objective measurements for

malware detection (Babbie, 2010). The statistical, mathematical, or numerical analysis of

the data collected through the proposed malware experiments were conducted using

standard and accepted computational techniques (Babbie, 2010). Previous malware

research highlighted the need for correct and representative use of the datasets, proper

experimental methodology design such that there is sufficient transparency to enable

reproducibility, and safely conducting experiments such that no harm is caused to others

while performing the research (Rossow, 2013). This study leveraged past research for

selecting supervised and unsupervised machine learning training and experiments (Kaur,

2014).

Malware research relies mainly on observing or analyzing malicious code (H.

Yin, Song, Egele, Kruegel, & Kirda, 2007). Malware research methods must be cautious in designing and conducting experiments that observe and analyze the execution of malware within a real or virtual environment (Rossow, 2013). Malware experiments must carefully and thoughtfully select datasets, detection models, and research 94 methodologies in order to evaluate effectiveness (Hu, Wang, Li, Bai, & Jing, 2014).

Malware experiments must also study the longitudinal behavior and validate various aspects of the research (Rossow, 2013). The experimental methodology utilized was mindful of these requirements and carried out in a series of steps outlined below.

The experimental methodology had four components:

1) Secure dataset – The prototype protected the dataset (samples), data (feature

extraction from samples), experiment (cluster analysis) and experimental

resources with secure storage of malware, benign and unknown samples (over 2M

total samples) (Rossow et al., 2012).

2) Automated malware feature extraction – the prototype system processed,

analyzed, extracted and assembled multidimensional topological features from the

sample datasets through an automated means using both static and dynamic

analysis (Tamersoy et al., 2014),

3) Supervised and unsupervised machine learning – the developed prototype used

multidimensional topological features and applied advanced clustering algorithms

to demonstrate improved detection rates with training and extended datasets

(Kaur, 2014), and

4) Results Validation - the developed prototype evaluated detection effectiveness

using various accepted quantitative methods (Mohaisen & Alrawi, 2015).

This novel research developed a unique detection methodology using multidimensional topological features in a machine learning environment. The multidimensional topological features approach combined static and dynamic analysis in addition to unique file properties. Extensive research of over 500 studies over the last six

95 years has not produced a single study that leverages multidimensional topological features using static and dynamic analysis with file properties. In addition, the developed prototype produced repeatable feature extraction for various malware binaries such that new sample datasets could be analyzed over time. The study analyzed and evaluated results with the rigorous quantitative evaluation measures described by Mohaisen and

Alrawi (2015).

Research Design

In this study, the researcher sought to understand the role of multidimensional topological features plays on improved malware detection. The study employed a quantitative experimental approach using the Single-Case Experimental Design (SCED).

SCED has been chosen for this study as it enables the researcher to study small groups (in this case unique polymorphic malware) and generalize findings to a larger population

(malware classes) (Neuman & McCormick, 1995). SCED provided a methodology for establishing a baseline and measuring interventions or changes over time (Neuman &

McCormick, 1995). SCED was chosen because each polymorphic malware sample may be unique and can be treated as individual subject for research purposes. Therefore, the features extracted through static and dynamic analysis from the individual malware become criterion for further study. Instead of using control groups, Changing Criterion

Design (CCD) was selected for this study as it allows the researcher to add or subtract features to better understand and manipulate detection rates based on adding/subtracting or weighting of features. SCED CCD was chosen as control groups are not be appropriate for this malware study. In drug and disease treatment protocol studies, control groups are commonly used to evaluate treatment over time. Detection is binary

96

and will not improve or decline over time. The SCED CCD allowed the researcher to introduce incremental changes in terms of multidimensional topological features and measure those changes in terms of detection rates throughout the study (Neuman &

McCormick, 1995). Parallel studies in both the Educational and Clinical arenas use

SCED research design with changing criterion design to understand behaviors, drugs and treatment for certain conditions (Neuman & McCormick, 1995). SCED CCD allows

experimental controls to be demonstrated by a consistent shift in the rate of targeted

outcome with each successive test or measurement (Neuman & McCormick, 1995). In

theory, each achieved criterion of a behavior can function as the baseline for the next

targeted criterion for successive studies (Neuman & McCormick, 1995). Given that each

malware samples may represent a unique malware sample and unique features may

improve detection the design is appropriate.

Research Procedures

The study randomly selected and assigned three test datasets of 200,000 samples.

These datasets were constructed via a pseudo-random program that composed the datasets on a stratified sample basis (malware, benign and unknown). The test datasets developed maintained the distribution profile on a percentage basis: malware (40%), benign (30%) and unknown (30%). The datasets were then used to conduct baseline and further testing with the three clustering algorithms (MB, IBk and DLMLP). The study ran a series of baseline tests with unweighted multidimensional topological features to determine baseline detection rates. Baseline detection rates were documented for each clustering algorithm. The datasets were used as part of SCED CCD testing process for manipulating feature weighting for each clustering algorithm. After establishing the

97

baseline for each algorithm, incremental tests were conducted with weighted

multidimensional topological features to evaluate improved or decreased detection rates.

The study established an effective detection rate for polymorphic malware detection ranges from 68.75% to 81.25% (Amos, Turner, & White, 2013). Each of the three clustering algorithms produced detection rates well above 81.25%. Additional experiments were conducted to tune and document the effects of feature weighting on detection rates. Additional tests were used to better understand the how weighting of multidimensional topological features improved or reduced the effective detection rate.

Baseline Testing. Initial baseline experimental testing was conducted for each clustering algorithm using unweighted multidimensional topological features. These features were derived through extracting static analysis, dynamic analysis and file properties features. Baseline testing enabled the researcher to establish training dataset for each clustering algorithm. Baseline testing supported further investigation of cause and effect to enable the researcher to develop manipulation or intervention strategies

(Neuman & McCormick, 1995).

Weighted Occurrence Testing. The first set of experimental tests conducted utilized weighted multidimensional topological features derived from static analysis.

These tests are similar to experiments presented by Bilar in 2007. Bilar (2007) performed a series of experiments with weighted features based on frequency of occurrence using static analysis various malware samples. Bilar (2007) used the weighted features in a series of Support Vector Machines (SVM) experiments. The study performed both weighted features and frequency of occurrence testing from static analysis using the three clustering algorithms (MB, IBk and DLMLP).

98

Dynamic-Static Occurrence Testing. The next set of experiments involved leveraging Bilar’s weighted occurrence tests (derived from static analysis) with features derived from dynamic analysis. Polymorphic malware can encrypt functions and have embedded library calls which can be hidden from static analysis (Hansen, Mark, Larsen,

Stevanovic, & Pedersen, 2016). Other studies have observed that malware developers have padded or introduced “junk” code to bypass static dynamic thus making the function call tracing very difficult (Jasiul, Śliwa, Gleba, & Szpyrka, 2014). Therefore, linking static and dynamic analysis may serve to enhance detection by linking dynamic occurrence of functions to those function calls observed in static analysis. This study extended Bilar’s weighted feature study (2007) by linking opcodes observed in an actual execution environment (dynamic analysis) with those observed through static analysis.

The weighting of features was directly correlated to frequency of occurrence in dynamic to static analysis. The experiments performed evaluated the detection results linking the static to dynamic features for all three advanced clustering algorithms (MB, IBk and

DLMLP).

Extended Static-Dynamic Occurrence Testing. Polymorphic malware may call libraries or use functions available through the operating system such as Microsoft’s dynamic link libraries (DLLs) (Egele, Scholte, Kirda, & Kruegel, 2012). The intended use or abuse of such functions by malware can be linked to file properties of the malware itself (Hansen et al., 2016). Therefore, experiments conducted during this phase, added various features from file properties to those of dynamic-static occurrence tests performed previously. Previous research has shown that there exists a number of

“metadata” properties that can be used for malware detection (Kamongi, Kotikela, Kavi,

99

Gomathisankaran, & Singhal, 2013) and (A. Singh et al., 2012). This study found similar

results. This study performed extended static to dynamic occurrence testing with all three

advanced clustering algorithms (MB, IBk and DLMLP).

The research process evaluated and demonstrated effective detection rates for

polymorphic malware. The detection rates for each algorithm exceeded the established

ranges for polymorphic malware of 68.75% to 81.25% (Amos, Turner, & White, 2013).

It was anticipated that each of the clustering algorithms with weighted features would

deliver detection rates above the baseline test results. Additional weighting and

experiments conducted did in fact deliver higher detection rates. The developed

prototype environment to extract and assemble test datasets is presented next.

Prototype Environment

In order to consistently, accurately and safely extract multidimensional

topological feature information, the study developed an integrated system to perform

automated feature extraction for analysis for all datasets. This integrated system delivered

the capability to extract features for file properties, static and dynamic analysis. The

integrated system then constructed one dataset for each clustering algorithm. The feature

extraction framework consisted of five modules: 1) a pre-process module that extracts

and generates topological features based on static analysis of machine code and file

characteristics, 2) a behavioral analysis module that extracts behavioral characteristics

based on file execution (dynamic analysis), 3) a post-processing module that reviewed results from the pre-process and behavioral modules, 4) an input file construction and submission module, and 5) a machine learning module that employs various clustering techniques to be specified at run-time. As with most signature-based or behavior-based

100 detection mechanisms, careful attention was paid to false positive and false negative rates which can serve to reduce overall detection effectiveness.

Advanced Malware Inspection Environment

Data Acquisition Analysis Detection Peer-Review

Malware Pre-Processing Repository

Sample Malware Acquisition

Static Analysis Classification & Benign Detection Dynamic Analysis

Features & Characteristics

Figure 1. System Overview

Data Acquisition. The first step was to acquire sufficient samples needed to conduct study’s experiments. This study needed to acquire three types of samples: 1) known malware, 2) known benign files and 3) unknown files. Known malware and benign files was acquired through industry partnerships and open source repositories available for malware researchers. However, this required more effort and time than anticipated. This study leveraged malware repositories from across industries and communities of interest. The malware samples were acquired through relationships with

Virus Total, VirusShare, VXHeaven and Contagio. There are a number of such malware repositories that contain polymorphic and metamorphic samples specifically for malware

101

research. Symantec had previously agreed to share samples. However, due to personnel changes at Symantec this was no longer possible in time for the study. A majority of the malware and benign samples were acquired through Virus Total and VirusShare. These databases provided deep metadata and samples to track various types of malware. This database also provides other details such as unique files, cloned files and file identifiers.

Working with Virus Total and VirusShare, the researcher was able to obtain samples

(malware, benign and unknown), metadata/labels and other key malware classification information. Unknown files were acquired through the same means and other open source repositories for malware research. The unknown files were acquired from

VirusShare, VXHeaven and Contagio.

Analysis and Feature Extraction. As discussed previously, static and dynamic

analysis provides a means to observe and identify malicious behaviors and characteristics.

Pre-processing of samples is needed to identify obfuscated malware variants. This step

provides the ability to gather multidimensional topological data from malware samples.

This step extracts file properties and other detailed features extraction via static and

dynamic analysis. The integrated prototype system confirmed the inventory of functions

through static analysis. The integrated prototype system evaluated execution and observed

execution of behaviors through dynamic analysis. While many families of malware obfuscate and encrypt modules, many commercial vendors have incorporated “malware- like” obfuscation techniques to protect intellectual property. Pre-processing was needed to ensure the dataset contains representative samples for all three data types. As expected, static and dynamic analysis provide sufficient details, features and characteristics for identifying samples. These dynamic malware features consisted of observing malicious

102 behaviors, triggered behaviors and malicious characteristics.

The integrated prototype system produced an environment to conduct both static and dynamic analysis tools in order to extract multidimensional features from known and unknown samples. The integrated prototype used the static tool known as IDA SDK v6.95

(IDAPro) in order to extract static features and topological information. An example of static analysis is shown in Fig. 3. However, the extraction software development kit (SDK) provides a means to submit and extract feature information. As discussed in the literature review (Bilar, 2007), previous research suggests that there exists a limited set of operations that may point to malware; operations such as MOV, ADD, LEA, SUB, AND, INC, OR,

NEG, XOR, ASSIGN (XCHG), STACK (POP), and CONTROL_C (JMP) have been used for malware detection in other studies. Bilar (2007) observed that the total universe of

Microsoft 32-bit executables was 398 opcodes. Upon deeper inspection, Bilar (2007) observed that non-executable opcodes account for 192 of the 398 opcodes. Further, Bilar

(2007) found that malware accounted for 141 of the 398 opcodes. Analyzing the 141 malware opcodes further, Bilar (2007) found that 72 opcodes accounted for over 99% of the malware opcodes, 14 opcodes accounted for over 92% malware opcodes, and 5 opcodes accounted for over 65% malware opcodes found (Bilar, 2007). Thus, based on Bilar’s research (2007) the number of opcodes needed for improved detection was somewhere between the 14 and 72 opcodes. The opcodes captured during feature extraction was used for the static feature component of the multidimensional topological data.

103

Fig. 3. Static Analysis - IDAPro

As discussed previously, Dynamic Analysis is needed to collect additional features

during execution. Dynamic Analysis was performed on the basis of system-wide

quantitative data flows from previous research (Wüchner et al., 2014). System-wide data

flow analysis allows the researcher to capture the actual processes executed and the context

of the system calls within a specific time frame within an operating environment (Wüchner

et al., 2014). System calls were captured and aggregated as part of the integrated prototype

environment. Dynamic Analysis features were captured for the purpose of augmenting the

dynamic feature component of the multidimensional topological data. As previous

researchers found, features such as File Size (FSize), File Type (Ftype), Malware Type

(Malware_Type), Virtual Machine Aware (VM_Aware), Compiler Type (Comp_Type),

Internal or External Libraries (Libraries) and Encryption Function (EncrypFunction) have been proven in the past to be good indicators of malware during execution (Ulrich Bayer et al., 2010; Devesa et al., 2010; Wüchner et al., 2014). The dynamic analysis information was augmented with both the file and static feature information for a complete data set for

a single sample. An example output from an open source dynamic analysis tool – Cuckoo

– is shown in Fig. 4. During the dynamic analysis step, a sample was placed into an

operational virtual machine environment and system calls were captured. The dynamic

analysis features were combined with file and static analysis data and stored into a single

104

repository for the dataset being processed. Some programming/scripting enabled the

automated data collection and dataset construction.

Fig. 4. Dynamic Analysis - Cuckoo

Binary Feature Extraction. The study utilized three types of multidimensional

topological data: 1) file features, 2) static analysis features, and 3) dynamic analysis

feature for detection. The study treated input features as independent variables for the study. The emphasis of the study was to improve detection rates for polymorphic malware. Currently, detection rates range from polymorphic detection rates using advanced classifiers range from 68.75% to 81.25% (Amos et al., 2013). These detection

rates used Bayes and Multilayer Perceptron techniques to establish a baseline for

effective detection of various types of polymorphic malware (Amos et al., 2013). Due to

the polymorphic nature of the malware the study (dependent variable) controlling for

time (control variable) for polymorphic malware samples that have been collected from

the malware analysis community. The independent variables file features, static analysis

features and dynamic analysis features were used as a single input of features for the

sample in question. In other words, the topological file properties/features extracted from

105

the executable/sample, the static analysis features – the topological properties/features

extracted from code analysis for the executable/sample and dynamic analysis features–

the observational behavioral features extracted from the execution of the sample were all used for detection purposes. The detection rate dependent variable was defined as the

effective detection rate for malware in terms of Accuracy (Number of correct

assessments)/Number of all assessments) and the control variable can be defined as time

(control variable) – as polymorphic samples mutate changes occur and the intervening

variable defined as external calls/libraries.

Feature Aggregation and Detection. All information collected during file

property, static and dynamic analysis was compiled into a single database. The researcher explored various tools and databases (e.g. Elasticsearch) to store extracted features from the various modules based on a unique identifier. Elasticsearch enabled the construction

of linked features and enabled analysis to be conducted multiple times. Ultimately, the

database (all features linked to the MD5 hash) was converted into an Attribute-Relation

File Format (ARFF). The ARFF file was used as an input file for the Weka Open Source

Machine Learning tool. Weka was used to perform cluster analysis on the ARFF input file for each dataset. Weka is a collection of machine learning algorithms for data mining tasks.

The algorithms were directly applied to the three datasets through a programming application interface (API) or graphical user interface (GUI). Weka was used to perform a number of functions and to perform advanced analysis on the experimental datasets.

Weka was used to conduct additional analysis regarding feature isolation and hidden feature dependencies. Weka was used to advance the understanding of the relationships between the features for each of the three datasets. Weka algorithms available were used

106

to perform advanced functions such as data mining pre-processing, data classification, regression analysis, clustering, association rules, attribute selection and data visualization.

An example of Weka is shown in Fig. 5.

Fig. 5. Weka User Interface

The extracted multidimensional topological features in the form of an ARFF file

were then submitted to Weka using the various clustering algorithms. Weka was used

to perform detection based on both supervised and unsupervised learning. A malicious

feature matrix was constructed and used to train the three clustering algorithms in

Weka. These algorithms were then used for detection to identify known malware,

benign and unknown files. These features were correlated with attributes in the Weka

machine learning environment. The study considered and evaluated the three cluster

algorithms to explore, weight and strengthen possible detection techniques. A number

of other algorithms have been researched and explored in Weka. This study focused on

MB, IBk and DLMLP. These were selected based on a number of factors including

recent studies, new approaches and new techniques.

Other machine learning algorithms considered included Belief Propagation (BP),

General Belief Propagation, Loopy Belief Propagation, Dynamic Bayesian Networks,

Hidden Markov Model (HMM), and Markov Random Field (MRF). It was believed

107

that the combination of MB, IBk and DLMLP provided the study with sufficient and

broad types of algorithms for detection purposes. IBk provided advantages over LSH

by leveraging connectivity-based and proximity-based clustering. DLMLP improved

BP techniques. New research has found that artificial neural networks (ANN) can

leverage belief propagation for clustering or classification but do so inefficiently

(Gruber et al., 2017). ANN’s build networks of neurons, share information between

neurons and propagate results throughout the network using weights or beliefs (Gruber

et al., 2017). DLMLP improves the ANN approach be creating localized networks to

share “beliefs” (Gruber et al., 2017). The DLMLP approach builds localized or

shallow nets and distributes beliefs locally (Gruber et al., 2017). These shallow

networks work more efficiently as new “beliefs” do not need to be propagated

throughout an entire network (Gruber et al., 2017).

In addition to IBk and DLMLP, an ensemble technique such as Bootstrap

Aggregating (Meta Bagging) was also be evaluated. Meta Bagging may provide a model

in which the ensemble voting can be done with equal weights or weighted attributes for

various features extracted. This approach may allow the evaluation model to promote

variance based on training data to improve malware detection. Once satisfied with the

approach (through a series of techniques), the approach was tested against a larger dataset

to evaluate the efficacy for detection.

Threats to Validity

Researchers who use and quantitative measures and experimental methods to

explore hypothetical questions are categorized as quantitative research (Hoepfl, 1997).

Quantitative research emphasizes the measurement and analysis of causal relationships

108

between variables (Denzin & Lincoln, 2006). Quantitative research uses numbers, charts

and graphs to explain or describe relationships that exist between variables (Bogdan &

Biklen, 1998). Quantitative research use charts and graphs to illustrate relationships

between variables, populations and results of tests (Bogdan & Biklen, 1998). The

quantitative research process allows the researcher to understand the problem, develop a

methodology to study the problem and if needed generate hypotheses to be tested in

terms of numbers that can be quantified and summarized (Bogdan & Biklen, 1998).

Quantitative research utilizes a mathematical process to explore, analyze and explain

findings (Charles, 1995). Generally, quantitative research “…supported by the positivist

or scientific paradigm, leads us to regard the world as made up of observable, measurable

facts” (Glesne & Peshkin, 1992, p. 6). Quantitative research attempts to segment and control the experimental environment such that the research can be generalizable to wider populations and similar situations (Winter, 2000). Quantitative researchers must construct

an experiment to study the problem and use standard measures if the research is to be useful (Crocker & Algina, 1986). In simple terms, quantitative research design must ensure the validity of the test and measurements (Crocker & Algina, 1986). Special attention must be made to ensure replicability or repeatability of the experimental results

(Crocker & Algina, 1986).

Internal Validity

Internal validity is concerned with the rigor of the study design (Crocker &

Algina, 1986). Internal validity can be determined by the degree of control over the potential extraneous variables impacting the quantitative study (Crocker & Algina, 1986).

Researchers must control such issues as confounding variables in order to minimize the

109

potential for alternative explanations for results (Crocker & Algina, 1986). Researchers

must also control such issues as variable treatment for testing inputs such that there is

more confidence that the effects observed are due from research manipulation (Crocker &

Algina, 1986). Campbell & Stanley (1963) have provided the research world with the most authoritative source for understanding threats to internal validity.

Threats to internal validity compromise the study’s confidence in establishing

relationships exists between the independent and dependent variables (Morse, Barrett,

Mayan, Olson, & Spiers, 2008). Campbell & Stanley (1963) identified eight threats to

internal validity: history, maturation, testing, instrumentation, regression, selection,

experimental mortality, and an interaction of threats. Each are described and addressed

below.

History. History becomes a threat to validity when other factors impact the study,

subject or findings by the passage of time (Campbell & Stanley, 1963). For the purpose

of this study, the research conducted and the malware samples do not suffer from this

threat as malware remains in effect for long periods of time. History is not an internal

validity concern to be addressed for this study.

Maturation. Maturation becomes an issue when changes occur for subjects or

testing procedures and impact the outcomes or findings for the study (Campbell &

Stanley, 1963). History and maturation are both major concerns for longitudinal studies

(Campbell & Stanley, 1963). For the purpose of this study, the malware samples and

procedures will remain static throughout the study. Therefore, maturation is not an

internal validity concern to be addressed for this study.

110

Testing. The testing process may introduce issues with studies and test results

because of repeated testing and not the intervention or manipulation of variables by the

researcher (Campbell & Stanley, 1963). Testing issues often occur as a result of researchers administering pretests and posttests. For the purpose of this study, the malware samples are extracted through a single repeatable process (Campbell & Stanley,

1963). The features are extracted and subjected to testing. The feature dataset remains static but not influenced by repeated testing. Therefore, testing is not an internal validity concern to be addressed for this study.

Instrumentation. Instrumentation issues arise when instrument calibration or measurement protocols change and the results seemingly change due to measurement rather than to a true treatment effect (Campbell & Stanley, 1963). Measurement and evaluation protocols must remain constant for over the course of all experiments

(Campbell & Stanley, 1963). For the purpose of this study, all testing protocols and clustering algorithms used for detection will remain the same across all tests. Only features sets will change as part of the experimental testing. Therefore, instrumentation is not an internal validity concern to be addressed for this study.

Regression. A regression threat occurs when subjects have been selected on the

basis of previous test scores (low and high) (Campbell & Stanley, 1963). Issues with

regression occur for studies that have test scores move closer to the mean as a result of

repeated testing (Campbell & Stanley, 1963). For the purpose of this study, repeated

testing will have no effect on testing outcomes. Malware samples are subjected to a

standard single extraction process. The features are extracted and subjected to testing.

111

The feature dataset remains static but not influenced by repeated testing. Therefore, regression is not an internal validity concern to be addressed for this study.

Selection. The selection threat is of utmost concern for many experimental studies

(Campbell & Stanley, 1963). Studies that cannot be randomly assigned subjects to test groups or treatment groups have issues with measurement before and after treatment intervention (Campbell & Stanley, 1963). For the purpose of this study, each of the test groups have both random selection and random assignment. The test groups remain static throughout the study. The study design and testing protocol specifically addresses the internal validity issue of selection. Therefore, selection is not an internal validity concern to be addressed for this study.

Experimental mortality. The issue with experimental mortality occurs when testing groups suffer from attrition, withdrawals, or dropouts (Campbell & Stanley,

1963). Experimental mortality becomes problematic with ongoing testing by losing subjects from comparison groups subsequent to random selection (Campbell & Stanley,

1963). For the purpose of this study, each of the test groups will remain static after random selection and random assignment. The test groups will remain static and will not suffer from mortality throughout the study. Therefore, experimental mortality is not an internal validity concern to be addressed for this study.

Interaction of Threats. The final issue threatening internal validity is the interaction of the threats across the study (Campbell & Stanley, 1963). Most experimental studies are concerned with the interaction of selection and maturation

(Campbell & Stanley, 1963). Internal validity can be impacted when subjects are selected and assigned to groups based on the subject maturation (Campbell & Stanley,

112

1963). For the purpose of this study, random selection and random assignment to parallel

groups address the interaction of threat issues directly and effectively. Random selection and random assignment address most internal validity issues except experimental

mortality. Differential selection is controlled for this study as random assignment creates

groups that are equivalent with respect to known and unknown variables (Campbell &

Stanley, 1963). Lastly, influences of maturation on other variables can be ruled out

because of static and consistent test groups. Other issues considered for interaction of

threats include the reactive or interaction effect of testing subject multiple times.

Reactive validity issues concern pretesting procedures and the potential increase or

decrease of a subject's sensitivity or responsiveness to the experiment (Lana, 1959;

Willson & Putnam, 1982). Again, malware samples are subjected to a standard single

extraction process. The features are extracted and the data is subjected to testing. The

feature dataset remains static but not influenced by repeated testing. Therefore, reactive

issues are not an internal validity concern needed to be addressed for this study.

External Validity

Campbell and Stanley (1963) provided the foundational groundwork for

examining issues related to external threats to validity. Smith & Glass (1987) classified

threats to external validity in a number of related issues concerning 1) population

validity, 2) ecological validity and 3) external validity of operations (Smith & Glass,

1987). Population validity and selection treatment concerning the sampling and selection

process that impact generalizability (Smith & Glass, 1987). The ecological validity issues concern experimenter effects, multiple-treatment interference, reactive arrangements, time and treatment interaction, history and treatment interaction (Smith & Glass, 1987).

113

Lastly, issues for external validity of operations concern specificity of variables and

pretest sensitization (Smith & Glass, 1987). Each of these are discussed below.

Population validity. Population validity refers to the extent to which findings can

be generalized from the study sample to that of the larger target population and

potentially to subgroups within the larger target population (Smith & Glass, 1987).

Utilizing sufficiently large and random samples will increase the population validity of

results. For the purpose of this study, sample size far exceeds the minimum sample size

requirement. Further random selection and random assignment reduces the concern for

this external validity threat. Unfortunately, population validity is a threat in virtually all

experimental studies. The collection of malware samples from various malware

collaboration sites may under-represent new or emerging malware. Most researchers are

forced to use samples from accessible populations. In most cases these samples represent the best group of participants available for a study. Therefore, the degree of

representativeness depends on how large the accessible population is relative to the target

population. In this study, the collection of samples exceeds one million and should be

sufficiently large to satisfy this concern.

Ecological validity. Ecological validity refers to the extent that findings from one

study can be generalized to another experimental setting (Rossow et al., 2012; Smith &

Glass, 1987). The experimental setting must accurately and in sufficient detail allow

other researcher to understand settings, conditions, variables, and contexts extent that

findings from one study can be generalized to another experimental setting (Rossow et

al., 2012; Smith & Glass, 1987). Therefore, ecological validity represents the extent to

which findings are independent from the setting or location in the study was performed

114

extent that findings from one study can be generalized to another experimental setting

(Rossow et al., 2012; Smith & Glass, 1987). For the purpose of this study, ecological

validity is not a concern. The environment, settings, hardware, and software (including

versions) was disclosed as part of the study.

Temporal validity. This external validity issue refers to the ability for the

research and findings to be generalized across time (Smith & Glass, 1987). Researchers

conducting experimental research rarely mention temporal validity because most studies

are conducted within a relatively short period of time (Smith & Glass, 1987). For the

purpose of this study, the timeframe is sufficiently short such that issues with

experimental design and data collection are minimized. It should be noted that only

malware sample and benign samples created within the past eighteen months was used for the study. Also, the single case experimental design was selected for this study as the protocol facilitates the time to select, conduct and analyze results. Therefore, temporal validity is not a concern for this study.

Multiple-treatment interference. This external validity issue occurs when the same research subjects are exposed to multiple interventions (Smith & Glass, 1987)..

Multiple treatment interference also occurs when subjects are selected for multiple

studies with multiple interventions (Smith & Glass, 1987).. Thus, test results or

outcomes may not be generalizable for the target population because of the sequence of

interventions that was administered prior to or during the experiment (Smith & Glass,

1987). For the purpose of this study, the sample groups are each distinct with a

population of 200,000. However, the sampling technique to be used is random selection

without replacement across three types of samples. The total sample size is over 2 million

115

and assuming the selector is not repeated the probability of being selected is remote -

malware (1:1 million), benign samples (1:750,000) and unknown (1:300,000). In

addition, the single case experimental design selected for this study facilitates the

manipulation of variables over time. However, the feature dataset remains static and not

influenced by repeated testing. Therefore, multiple-treatment interference is not an

external validity concern to be addressed for this study.

Researcher bias. This external validity issue arises due to processes or

procedures the researcher introduces while conducting the experiment (Smith & Glass,

1987). Researcher bias is an external validity issue as the study’s finding may be

dependent upon the researcher and thus not generalizable for the intended target population (Smith & Glass, 1987). The more unique the researcher's process, protocol

and procedures are the more they interfere with findings (Smith & Glass, 1987). The

single case experimental design was specifically selected for this study as the testing

protocol facilitates the manipulation of variables and testing over time. SCED is an

accepted testing protocol for unique subjects such as polymorphic malware. The testing

protocol and procedures are consistent with each test. The dataset remains static and not

influenced by repeated testing. Therefore, researcher bias is minimized for this study.

Reactive Arrangements. This external threat to validity occurs when subject

response changed due to the fact that they know they are being studied (Smith & Glass,

1987). Reactive arrangements reduce external validity because findings may become less

generalizable as the results have been compromised or interventions not properly

measured (Smith & Glass, 1987). For the purpose of this study, certain malware samples

are sophisticated enough to know the they are being analyzed. For example, advanced

116

malware will check environmental conditions prior to executing in a virtual environment

like dynamic analysis. In this case, the additional static analysis and file property

analysis provide missing pieces as part of the standard feature extraction process.

Therefore, reactive arrangements have been dealt with as part of the feature extraction

process. External issues dealing with reactive arrangements have been adequately

compensated for and is not an external validity concern to be addressed for this study.

Order bias. This external validity issues occurs when observed findings become

dependent upon the order in which the multiple interventions are administered (Smith &

Glass, 1987). In this case, findings resulting from the administering interventions in a

particular order may not be confidently generalized to situations in which the

interventions are applied differently (Smith & Glass, 1987). The single case

experimental design was specifically selected for this study as the testing protocol

facilitates performing various tests in a structured protocol. However, the dataset,

clustering algorithms and testing procedures are not structured such that order dictates findings or outcomes. Therefore, order bias is not a concern for this study.

Matching bias. This external validity issues arises in cases where subjects tested

do not match those in the accessible population (Smith & Glass, 1987). Thus, matching

bias is a threat to external validity to the extent that findings cannot be generalized to general population (Smith & Glass, 1987). This issue is a concern when the subjects studied fall out of the sampling frame (Smith & Glass, 1987). For the purpose of this study, the samples collected and selected become the subject studied. Therefore, study findings will directly correlate to sample frame and reduces the concern for this external validity threat.

117

Specificity of variables. This issue is a threat to external validity to almost every

study. Researchers attempt to define and control a number of variables within the

experiment in order to generalize findings (Smith & Glass, 1987). However, the number

of variables that impact generalizability is large and include 1) a specific type of

individual, 2) at a specific time, 3) at a specific location, 4) under a specific set of

circumstances, 5) based on a specific operational definition of the independent variable,

6) using specific dependent variables, and 7) using specific instruments to measure all the

variables. Experiments with unique participants, testing time frames, developed context,

testing conditions, and other variables, the less likely the experiment will produce

findings that are generalizable. In order to counter threats of specific variables, the

researcher must operationally define variables and be able to represent the contextual

aspects of the study setting and use extreme caution when generalizing findings. The

single case experimental design was specifically selected for this study as the testing

protocol facilitates the study setting, baseline testing, manipulation of variables and

producing findings over time. SCED is an accepted testing protocol for unique subjects

such as polymorphic malware. The setting, setting variables, testing protocol and

procedures are consistent with each test. Therefore, the consistency of variables held

constant and the threat posed by specificity of variables is minimized for this study.

Treatment diffusion. This external validity threat arises as subjects are exposed to multiple interventions and these manipulations become diffused or impact other treatments threating the researcher’s ability to generalize findings (Smith & Glass, 1987).

In essence, the manipulation of testing variable contaminate one or more of the treatment conditions and prevents the study from being replicated in the future (Smith & Glass,

118

1987). SCED is an accepted testing protocol for manipulating treatments for

experiments. The establishing of baselines, applying treatments and measuring the

impact of treatments is consistent with each test. Therefore, the external validity of

treatment diffusion is minimized for this study.

Pretest x treatment interaction. This external validity threat arises when the

administering of pretests changes the participants’ responsiveness (sensitivity increases

or decreases) and renders the observed findings for the pretest group generalizable to

only that group and not to the untested group or larger population (Smith & Glass, 1987).

In order to utilize generalizable pretest findings, researchers must carefully choose pretest conditions and treatments, characteristics of the research participants, the duration of the study, and nature of the independent and dependent variables. SCED was specifically selected for this study as the testing protocol facilitates performing various baseline testing (pre-tests) in a structured protocol. The proposed baseline tests will not influence further testing post intervention or manipulation. The baseline tests are quantitative in nature and not attitudinal so future influence from pretest is non-existent. Therefore, pretest x treatment interaction is not a concern for this study.

Selection x treatment interaction. This external validity threat arises when treatment groups differ non-treatment groups and thus findings cannot be generalizable to the larger population (Smith & Glass, 1987). The selection-treatment interaction threat occurs more often when randomization is not used for selecting intervention and test groups (Smith & Glass, 1987). However, this external threat to validity can still prevail when randomization is used as it does not guarantee that the group selected is representative of the target population (Smith & Glass, 1987). For the purpose of this

119 study, there are a number of groups selected for testing from a sample population. The sample set is large (over 1 million samples) and the multiple individual test groups far exceeds the required minimum sample size. Therefore, the likelihood that the population studied do not represent the larger population is remote is not an external validity concern for this study.

An experiment is deemed to be valid as long as relationships can be established and the results possess internal and external validity (Campbell & Stanley, 1963). The proposed experiment has addressed internal validity concerns from an experimental design, sampling and a replicative perspective. Internal validity is maintained when conditions are changed or manipulated and the results are measured or observed

(Campbell & Stanley, 1963). Thus, internal validity provides assurances that results are in fact due to direct treatment and not due to other circumstances (Campbell & Stanley,

1963). As such, he proposed research has also addressed internal and external validity concerns from a number of perspectives. The proposed study has a controlled approach and environment and experimental design addresses generalizability of findings. There has been considerable amount of thought given to how the results and findings can be extended to broader populations, groups, environments, and contexts outside the experimental setting (Rossow et al., 2012; Smith & Glass, 1987). Consequently, the proposed experimental design and approach should satisfy the issues raised for internal and external validity. The proposed experimental design has addressed the number of issues raised concerning internal and external validity as defined by Campbell and

Stanley (1963). A number researchers have contributed to addressing threats to internal and external validity not only from analyzing the results but more importantly addressed

120

by quantitative research and the experimental design (Smith & Glass, 1987). The

proposed study kept internal and external validity in mind by selecting an accepted

experimental design (SCED), designing the experiment with a testing protocol, adopting

a sampling strategy and insuring any possible outside influences are controlled or

mitigated such that to the greatest extent possible concerns with validity have been

addressed ensuring the validity of study.

Sample

The sample used for this study is an unrestricted probability sampling or simple

random sampling or pseudo-random sample. The study conducted a random selection

and random assignment of files for the three datasets. However, the dataset is considered

to be stratified dataset. The sample set (files) included three data types or strata on a

percentage basis: 1) known bad samples – known malware (polymorphic/metamorphic),

2) known good samples – known benign samples and 3) unknown samples – files that are unknown to the researcher but was determined post- test. Although, malware detection is a straightforward bi-partite problem (malware or non-malware) the study also included a

“blind” dataset for validating detection results.

Known Malware Dataset. The study had a sample set of over 1 million malware samples of polymorphic malware targeting 32-bit and 64-bit malicious executables.

Although, there are a number of malware types, this study will group the distinct malware types into a single class - malware. A majority of malware samples were acquired from

Virus Total, Virus Sign, and VxHeaven. The samples used represented various malware types including: backdoors, Trojans, bots, worms, and viruses. According to Symantec

(2016), there are approximately 1 million new malware samples released and discovered

121

daily. In conducting a simple sizing analysis, a sample size of at least 666 yields 99%

confidence level and 5% confidence interval for a population of one million. A total

collection of 1,009,108 malware samples were acquired for the study. Therefore, the study dataset far exceeded the required number of malware samples needed for developing a valid sample size and representative dataset.

Known Benign Dataset. The known benign samples were collected from Virus

Total and Virus Sign. These files were tested as “clean” from the providers and probably were collected from clean installation disks and other known clean distribution sites.

Special attention was made to collect vendor software with valid digitally signed installation files. In conducting a simple sizing analysis, a sample size of at least 666 yields 99% confidence level and 5% confidence interval for a population of 1 million. A total collection of 756,322 known benign samples were acquired for the study.

Therefore, the study dataset far exceeded the required number of known benign samples needed for developing a valid sample size and representative dataset.

Unknown Dataset. The unknown samples were collected from Virus Total,

Virus Sign and VirusVx. A majority of these samples were collected from Virus Total.

Users can submit files to sites like Virus Total as unknown and eventually these files would be determined as malware or benign. Until the files are determined to be malware or benign they remain in an unknown status until a determination has been made. Again, in conducting a simple sizing analysis, a sample size of at least 666 yields 99% confidence level and 5% confidence interval for a population of 1 million. A total collection of 748,976 of unknown samples were acquired for the study. Therefore, the

122

study dataset far exceeded the required number of known benign samples needed for

developing a valid sample size and representative dataset.

In summary, the dataset for the study was developed as a simple stratified random sample. The study performed a pseudo-random selection via a python script for selecting and generating datasets on a percentage basis. The random assignment from a stratified sample of three sample subsets: 1) known malware (40%), 2) known benign (30%) and 3)

unknown (30%). The study’s sample size far exceeded the 666 samples required for a confidence level of 99%. The three data subsets had the following sample size: 1) malware – over 1 million samples, 2) known benign – over 750,000 samples and 3) unknown samples – just under 750,000 samples.

Data Analysis

Data analysis was conducted throughout the experiment. The focus of the

experiments was to evaluate detection rates given the various features, weighting of the

features and the advanced cluster algorithms leveraged during testing. There were a

number of data analysis tools that were used to evaluate detection performance.

Sensitivity and specificity were two statistical measures that were used to evaluate

malware detection performance (Kolter & Maloof, 2006). Sensitivity is the statistical

measure used to understand and evaluate the proportion of True Positives in an

experiment (Kolter & Maloof, 2006). The True Positive Rate (TPR) represents the

proportion of correctly identified malware samples within a given test. Specificity is another important statistical measure and is used to represent the True Negative Rate

(TNR) (Kolter & Maloof, 2006). Specificity measures the proportion of true negatives

(non-malware) correctly identified within an experiment (Kolter & Maloof, 2006). The

123

two measures – Sensitivity and Specificity provide insight into detection rates from a true

positive rate (sensitivity) and true negative rate (specificity). For any experiment, there is usually a trade-off between the two measures in terms of usability of results. Too many false positives or false negatives can erode the confidence of the detection approach.

Depending upon detection goal, one of the measures may be more important than the other. In some cases, false positives (false alarms) significantly impact detection rates and experimental outcomes. On the other hand, false negatives (non-detection) allow researchers to evaluate characteristics and issues with missed detections. In a perfect world, sensitivity and specificity would both be 100%. However, achieving 100% for both is usually not achievable. Both measures can be graphed in order to analyze the tradeoffs between sensitivity and specificity. Receiver Operating Characteristic (ROC)

depicts the curve between sensitivity and specificity. An example of ROC is provided

below.

1.0 • Metabagging • IBk • DLMLP 0.8

0.6

0.4 True True Positive Rate

0.2

0.0

0.0 0.2 0.4 0.6 1.0 False Positive Rate

Figure 2. ROC Results

124

In order to evaluate other aspects of detection results, other statistical measures can be used to evaluate detection effectiveness that go further that sensitivity and specificity. Accuracy (ACC) and the Correlation Coefficient (CC) are additional statistical quality measures with machine learning used to explore deeper understanding of detection rates (Kolter & Maloof, 2006). In order to gain additional insight into detection rate with Accuracy and Correlation Coefficient the study must produce the results in a confusion table format. The confusion table allows the results to be fully examined. An example of the confusion table is provided below.

Table 2 Confusion Matrix

TRUE ACCEPT FALSE REJECT TRUE (T) True Positive True Negative

FALSE (F) False Positive False Negative

The confusion matrix presented in Table 2 is an easy concept to evaluate detection outcomes and was generated for each test. Each entry in the table was used to evaluate the detection algorithm effectiveness. As part of the confusion matrix or truth table as it is also referred to, each of the rates were measured. The four entries in the confusion matrix were used to evaluate testing outcomes: 1) True Positive Rate, 2) True Negative

Rate, 3) False Positive Rate and 4) False Negative Rate. The True Positive Rate is defined as the number of malware that was correctly classified as malware. In other words, the True Positive Rate (TPR) represents the proportion of positive instances of malware correctly detected. The True Negative Rate (TNR) represents the number of benign samples that were correctly detected as Benign. The TNR represents the proportion of negative instances detected correctly. The False Positive Rate (FPR)

125 represents the number of non-malware samples that were classified as malware. The FPR is in essence the proportion of negative instances incorrectly detected as positive

(malware). The False Negative Rate (FNR) represents malware that was classified as

Benign. Thus, the FNR is the proportion of positive instances wrongly classified as non- malware. The equations for each statistical measure are provided below:

= + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 = + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 = + 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 = + 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇

Once the confusion matrix had been fully populated, the additional statistical measures of

Accuracy and Correlation Coefficient were then applied. The equations for these two measures are provided below:

+ = + + + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹 ( × ) ( × ) = ( + ) ( + )( + ) 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹 𝐹𝐹𝐹𝐹 𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 𝐹𝐹𝐹𝐹

The malware detection effective rate was analyzed based on the results from the testing for the three clustering algorithms. These measurements were largely concerned 126

with correctly classified sample based on the statistical measures described above. The

proposed study used these measures to evaluate detection results based on the various datasets for detection purposes.

Additional analysis was performed using Sensitivity, Specificity, Accuracy and

Correlation Coefficient information for the various tests. Developing the confusion matrix provided a solid understanding of the results for each test. The various types of analysis were conducted to compare and evaluate detection rates across the three algorithms used during testing. Also, the results were evaluated using various sample sets, feature set and weighting of features. There were three tests conducted per algorithm. The testing approach used a benchmark test with two other tests for weighted feature testing per algorithm. The formatting of test results for the various experiments are presented in the next section.

Data Formats for Results

Once the base data (TPR, TNR, FPR and FNR) had been captured, additional data

analysis was conducted using the ACC, CC, and ROC curve methodology. The detection

model had the possibility of two classes – positive (malware) or negative (benign). The

malware detection effective rate used analyzed the results from the three algorithms in

terms of correctly classified sample based on the statistical measures described above.

The study evaluated detection results based on mapping those results into a single

classification set (p, n) – positive and negative. Given a confusion matrix, there are four possible outcomes given the set: 1) detected as a positive and correctly classified therefore classified as a True Positive (TP), 2) detected as a negative and correctly classified and correctly classified therefore classified as a True Negative (TN), 3)

127

detected as a positive but should have been classified as a negative therefore this is a

False Positive (FP), and 4) detected as a negative but should have been classified as a

positive therefore this is a False Negative (FN). The confusion matrix illustrates the set

of instances in a two-by-two confusion matrix. The following example show the various

rates achieved by clustering algorithm (MB, IBk and DLMLP).

Table 3 Example Confusion Matrix

MB TRUE ACCEPT FALSE REJECT TRUE (T) 0.9630 0.0010 FALSE (F) 0.0031 0.0060

IBk TRUE ACCEPT FALSE REJECT TRUE (T) 0.9780 0.0012 FALSE (F) 0.0080 0.0017

DLMLP TRUE ACCEPT FALSE REJECT TRUE (T) 0.9970 0.0000 FALSE (F) 0.0000 0.0000

Further, the following example demonstrates the preliminary results for MB, IBk and DLMLP algorithms each with 20 unweighted features:

Table 4 Example Unweighted Testing Results

True Positive False Positive True Negative False Negative ACC CC IBk 0.9630 0.0031 0.0010 0.0060 0.9741 0.0010 DLMLP 0.9780 0.0012 0.0080 0.0017 0.9971 0.0078 MB 0.9970 0.0000 0.0000 0.0000 0.9970 0.0000

From this data, a graph can be generated to demonstrate the effective rates achieved comparing the results across the three clustering algorithms. The graph below demonstrates the baseline testing and the achieved results across the three algorithms

128

Figure 3. Comparative Baseline Testing Results.

Additionally, based on the result from the confusion matrix, the ROC graph can be constructed using a two-dimensional graph. The TP rate is plotted on the Y axis and the

FP rate is plotted on the X axis. Overall the ROC graph enables the visualization of relative trade-offs between benefits (true positives) and the associated costs with detection (false positives) (Fawcett, 2006). The graph below demonstrates the ROC with the three classifiers (MB, IBk and DLMLP).

129

1.0 • Metabagging • BP • LSH 0.8

0.6

0.4 True Positive Rate Positive True

0.2

0.0

0.0 0.2 0.4 0.6 1.0 False Positive Rate

Figure 4. ROC Analysis.

Data analysis will provide a means to evaluate achieved results. Each experiment will provide a series of data sets for each experiment: Sensitivity, Specificity, Accuracy,

Correlation Coefficient and ROC. These standard statistical measures provide deeper insight into the TPR, TNR, FPR and FNR produced by each experiment. The malware detection effective rate will analyze the result from the three clustering algorithms in terms of correctly classified sample based on the statistical measures described above.

Resource Requirements

The study has a number of resource requirements. The most critical resource for conducting the proposed study was the dataset. A number of unaffiliated websites

(contagio, Virus Sign, VxHeaven, Virus Total, and others) contributed samples in order to achieve the 1 million malware sample target. Malware samples collected from the various websites were checked via the MD5 hash to avoid duplication. It is common for these websites share and collaborate on samples. Malware researchers from industry and 130

academia frequently find interesting samples and share finding across the community. In doing so, researchers leverage known malware samples for detection purposes and experiments.

The second issue was that of constructing a dataset with sufficient number of non- malware (unknown and benign) samples. Fortunately, Virus Total provided a majority of benign samples for the research. Virus Total also was able to provide unknown samples.

Both sample types – unknown and benign were needed in order to evaluate the algorithm detection effectiveness. However, these samples were identified by the SHA256 hash versus MD5. Sample Identifiers (SampleIDs) were then generated for tracking purposes.

All samples were collected, uniquely identified, analyzed and screened. All samples were analyzed to ensure that they contained contemporary characteristics and behaviors that protect intellectual property for commercial software. Duplicate samples were removed. These non-malware behaviors and characteristics are often mimicked by malware developers in order to bypass detection. In order to conduct a realistic evaluation, malware, benign and unknown samples were used for testing. Over 1 million benign samples were collected, analyzed and screened. Benign samples historically had been more difficult to obtain than the malware data set. For the collected 1 million benign sample collected only roughly 75% passed the screening criteria. Over .5 million unknown samples were collected, analyzed and screened. Roughly 50% of the unknown samples passed screening criteria. These samples were then used to create basis for the generation of the datasets used for testing. These non-malware samples were then used to evaluate overall detection effectiveness (False Positives and False

Negatives) during testing.

131

A related data set issue was simply the storage of the malware data set. Once the

various data sets had been collected, the malware and non-malware samples were stored in a secure and protected repository. The secure storage of malware data set needed to

ensure non-detonation of samples within the experimental storage environment. The

malware samples were kept in a .zip file with password to prevent detonation or infect the analysis platform/environment. Another issue is the size and number of malware samples. The objective of the research is to analyze millions of samples in order to evaluate efficacy and validate the proposed detection approach. This study acquired over

2 million total samples. The storage will require over two terabytes of storage for

malware and non-malware samples.

Lastly, the tools required to perform the analysis had issues. IDAPro did not

ultimately provide the necessary API for bulk analysis. Open Source tools such as

Capstone provided better programmatic access and was readily available. These tools

provided static and dynamic analysis and was used to extract features from the sample

repository. The number of modules were developed to extract the various static and

dynamic features. The seven feature extraction modules included: 1) Compiler Extractor

(Embedded Compiler, Compiler Type, Compiler Used), 2) Encryption Extractor

(Encrypted Indicator, Encrypted Type), 3) Library Call Extractor (Call Indicator, Call

Counter, Library Call) , 4) Malware Extractor (Malware Indicator, Malware Type), 5)

Virtual Machine Aware Extractor (VM Aware Indicator), 6) OpCodes Extractor (MOV,

ADD, LEA, SUB, AND, INC, OR, NEG, XOR, XCHG, POP, and JMP) and 7) File

Properties Extractor (File Type and File Size). The dynamic analysis tool (Cuckoo) was leveraged for some feature extraction and detection. The tools used for these upgrades

132

did impact schedule but did not affect other tools such as IDAPro and Weka. Managing

the overall analysis environment was challenging and but was technically achievable.

The researcher had a mitigation plan or alternative plan for all issues raised. In terms of malware samples, the researcher had collected a majority of the malware samples. Virus Total was able to supply both benign and unknown samples. There were a number of blogs which offer malware samples as well. However, the researcher had a strong desire to work with mainstream malware research data sets. In corresponding with both Symantec and Virus Total, there is a high degree of confidence that at least one of the firms would deliver the number of benign samples needed. The researcher has also found a number of malware research sites that provide “benign” samples as well.

Therefore, the non-malware data set may have to be constructed using samples from

Virus Total. Additional analysis was required to ensure that the samples represent the types of samples needed for the research. The additional analysis has been factored as part of the proposed effort. Mitigating the secure storage of malware was a process that checks and changes file extensions to prevent file execution. In addition, shipments of malware were done via third-party compression software such as WinZip. Each set of

Zip files transferred will also be password protected in order to prevent file execution. In terms of meeting storage requirements for the project, a high performance 24 TB

Network-Attached Storage (NAS) device was purchased to house both malware and non- malware data sets. The only remaining concern for storage was a backup for experimental data set. The researcher did explore cloud-based backup solutions.

In order to construct a solid analysis environment, the researcher began with the latest releases of a number of static analysis tools (IDAPro, Capstone and objdump),

133

Open Source dynamic analysis tool (Cuckoo) and the Open Source machine learning tool

(Weka). The researcher had sufficient understanding of the computing requirements for the various tools as well as the underlying operating system compatibility issues. This approach allowed and enabled the researcher to begin the assembly of the analysis environment and provided a programmatic means to investigate integration issues with the latest releases. The assumption is that the latest releases will have a vibrant technical support network and issues raised were supported by the community of interest. Most of the other resources are in place or can be easily secured. A full description of the resources needed is provided below.

Table 5 Experimental Resource Requirements

Type Resource Description Possess? Risk Assessment Dataset Malware Samples Semantec Agreement No Other Sources are available Dell Server i7 64bit Hardware Server w/Linux 16GB RAM Yes New Server/No Warranty Toshiba i3 Hardware Workstation 64bitw/Windows7 Yes Current Laptop/No Warranty Hardware Storage 4 TB Storage No To be Purchase Hardware Network Internet Connectivity Yes Wifi Enabled Software IdaPro v6.6 Yes Education Version/Limited Version Software Volatility v2.3 Yes Education Version/Limited Version Software Yara v3.2 Yes Researcher owned Software Weka v3.7.9 Yes Open Source Software Cuckoo v1.1 Yes Open Source Software Python v2.7 Yes Education Version/Limited Version Software Microsoft Office 2010 or better Yes Education Version/Limited Version

Most of the resources needed fall into the software category. The researcher had developed risk mitigation for each item. The only issue that presented itself was Weka during experimental time period. WEKA had a major upgrade and release that took time to configure. Yara and Volatility are tools for analysis. Volatility is an advanced memory analysis framework for deeper inspection of memory. This software can be used to analyze memory dumps after malware has been detonated through dynamic analysis.

134

Yara is a tool that assists malware researchers to identify and classify malware samples.

Contingencies for most issues were use or leverage academic or demonstration licenses

offered by the software vendor. No issue arose during integration or testing.

Summary

This research developed a unique detection methodology using multidimensional topological features in a machine learning environment. The multidimensional topological features approach combined static and dynamic analysis in addition to unique

file properties. The study was designed to develop a quantitative experimental prototype

using multidimensional topological data with machine learning using advanced clustering that provided improved detection for polymorphic malware above 81.25% for

polymorphic malware. The study utilized quantitative methods to provide objective

measurements for experimental results as described by Babbie (2010). The proposed

data collection, data analysis and presentation experimental results were conducted using

standard and accepted computational techniques (Babbie, 2010). The study was designed with correctly assembled and representative datasets, proper experimental methodology design such that the experiment can be reproducible and safely conducting experiments such that the malware cannot infect the research environment (Rossow, 2013). The experimental methodology was designed to establish, capture, analyze and evaluate results with the rigorous quantitative evaluation measures offered by Mohaisen and

Alrawi (2015).

135

Chapter 4

Results

Research Goals

The goal of this study was to develop an experimental prototype system to provide improved detection for polymorphic malware. Today’s effective detection rate for polymorphic malware detection ranges from 68.75% to 81.25% (Amos et al., 2013).

The prototype system developed performed various feature extraction and assembled the datasets. This study conducted various tests within testing SCED protocol. The test results were then quantitatively examined and evaluated detection rates for polymorphic malware. This study leveraged previous quantitative experiments for supervised machine learning for malware research (Boro, et al., 2012; Pradesh, 2014) to better understand how a limited set of multidimensional topological information can be used for malware detection.

Review of the Methodology

The single-subject experimental design provides a standard framework and has been widely accepted protocol for research questions posed for this study. The study followed the basic single-subject protocol throughout the study. The protocol follows the following steps.

1. Establish baseline test data – established a testing baseline for detection given a standard feature dataset with a standard clustering algorithm for polymorphic malware through multiple measurements before an intervention. The number of standard features included file properties, static analysis and dynamic analysis.

136

2. Manipulate feature set – the standard feature dataset was manipulated to evaluate whether detection rates improved or deteriorated after the intervention. The study examined three multidimensional datasets – one dedicated for each algorithm. The three datasets were generated by combining known malware, known benign and unknown samples into single feature dataset of 200,000 samples.

3. Controlled procedures and environment – the developed prototype environment and datasets were controlled to ensure the extracted feature dataset remained static over time and the automated feature extraction process remained unaltered (Rossow et al.,

2012).

4. Standard measurement approaches – a standard measurement approach for documenting and capturing baseline testing data and subsequent test results. Detection measurements were conducted in the same fashion for all tests. The study established baseline detection rates for each of the three clustering algorithms. Standard measurements were documented to evaluate Accuracy (ACC), Correlation Coefficient

(CC), True Positive Rate (sensitivity measure), and False Positive Rate (specificity measure). As specified by the standard protocol, these measurements were established as a permanent observational recording (Rossow et al., 2012).

5. Weighting of features – as part of this testing protocol the inputs were manipulated and assessed. Within a single dataset, features were removed based on an attribute selection algorithm (e.g. greedy stepwise) and then evaluated for each algorithm.

6. Capturing testing results - All test results were captured in terms of ACC, CC,

True Positive Rate (TPR) and False Positive Rate (FPR).

137

7. Graphing results - All test results were graphed and presented in terms of ACC,

CC, True Positive Rate (TPR) and False Positive Rate (FPR).

8. Evaluating results – All test results were evaluated after each test. Each test was analyzed with rigor to understand the ACC, CC, True Positive Rate (TPR) and False

Positive Rate (FPR).

9. Test controls – All tests were conducted using the required test controls. Test controls regulated the test environment, test data, and test results.

The experimental design and approach enabled the researcher to conduct experiments in a staged and systematic manner. The standard testing protocol provided a means to plan, prepare, execute, document and analyze the test results. The testing protocol allowed the researcher to document the detection results for the various testing

algorithms (Rossow et al., 2012). The results were collected in stages and results were evaluated for effectiveness at the end of each experiment (Rossow et al., 2012). The experimental research methodology was consistent with previous quantitative malware studies (Creswell, 2007). The study’s methodology provides a means to introduce new measures and systematically evaluate test results prior to conducting additional testing.

The quantitative experimental approach for conducting this study leveraged other testing techniques conducted in healthcare, drug trials and other medical studies (Rizvi & Nock,

2008).

Experimental Outcomes

The following test cases were executed as part of the SCED protocol for each

algorithm (MB, IBk and DLMLP).

138

Test Case 1

As with all test cases, the SCED testing protocol was used to structure and guide

the execution of the various experiments. Test Case 1 utilized Dataset 1 with the

Metabagging (MB) cluster algorithm. Dataset 1 was generated through a pseudo-random selection of samples representing a stratified sample. The stratified sample contained three sample types: malware, benign and unknown. Dataset 1 was selected and generated on a percentage basis: Malware (40%), Benign (30%) and Unknown (30%). Dataset 1 was automatically selected and generated from a 2.5 million sample population.

However, the generation of the final file required some manual intervention to generate the final Comma Separated Values (CSV) file. Once the CSV file was generated, Weka was used for data validation purpose and to convert the CSV file to ARFF file. Once completed, the dataset was found to be “clean” or validated. Dataset 1 was then saved into a controlled directory for validated ARFF files. Dataset 1 was validated prior to the beginning of any baseline testing. The 26 features were validated and included all attributes from file properties, static analysis and dynamic analysis.

1. Establish baseline test data – Dataset 1 was validated through processing the features through Weka. Upon file import, it was discovered that Dataset 1 had several features/attributes mistyped. The fields are typed as part of validated upon import of the

CSV file into Weka. A number of numeric features (typically counts) were incorrectly

typed or imported as nominal values. In order to use these features for detection these

typing errors needed to be corrected. Errors introduced through automated parsing was

fixed through searching and eliminating leading spaces and/or special characters

embedded in the field. These errors were removed from the entire dataset and all features

139

were checked for leading spaces and embedded special characters. Validation was

conducted on all 200,000 rows and 26 features. Once completed, the dataset was found

to be “clean” or validated. Dataset 1 was then saved into a controlled directory for

validated ARFF files. Dataset 1 was validated prior to the beginning of any baseline

testing. The 26 features were validated and included all attributes from file properties,

static analysis and dynamic analysis.

2. Manipulate feature set – Once the Dataset 1 was validated, Dataset 1 was used to produce three distinct datasets - Dataset 1-1, Dataset 1-2, and Dataset 1-3. These three datasets would be used to perform testing for the metabagging algorithm in various ways.

Dataset 1-1 was used for MB Baseline experimental testing. Dataset 1-2 would be used for MB Reduced Features and Dataset 1-3 MB Information Gain. Dataset 1 was used as the basis for all MB training and test datasets.

Dataset 1 was then used to generate Dataset 1-1 (baseline – no manipulation).

Dataset 1-1 was used as a “full-feature” dataset and all 26 features were used with the

MB algorithm to establish a detection baseline. The baseline testing did not filter and did not use any feature analysis to improve any test results. Upon completion of baseline testing, the dataset was exposed to further feature analysis.

Dataset 1 was also used to generate Dataset 1-2 Reduced Features dataset.

Dataset 1-2 was generated by analyzing the features in Dataset 1 to better understand relationships and possible feature interdependencies. The full Dataset 1 – was analyzed and all 26 features were processed by the Greedy Stepwise algorithm to understand data relationships and interdependencies. This algorithm analyzes various features from the dataset based on analyzing relationships that exist between features. Previous studies

140

have analyzed and discovered that features or attributes can have relationships of various

types including irrelevant, weakly relevant or strongly relevant feature relationships

(John, Kohavi, & Pfleger, 1994). The output of the Greedy Stepwise algorithm suggested that of the 26 original features within Dataset 1, could be reduced to 6 features/attributes.

The Greedy Stepwise algorithm produced a .793 Merit of Subset value. The features selected by this algorithm included: 1) File Size, 2) File Type, 3) Compiler Type, 4)

Library Calls, 5) Encryption, and the Op Code 6) XCHG. The next step was to reimport

the entire Dataset 1and remove extraneous features/attributes – this became Dataset 1-2.

Dataset 1-2 removed all features except for File Size, File Type, Compiler Type, Library

Calls, Encryption, and XCHG. Dataset 1-2 was then processed with the MB algorithm.

The third and final dataset (Dataset 1-3) was analyzed utilizing another feature

analysis tool to evaluate features and relationships. The Information Gain algorithm

seeks to amplify certain features within a dataset to potentially improve model outcomes

(C. Lee & Lee, 2006). In this case, the intent was to use information gain to provide a

reduced feature set that potentially increases the detection rate. The information gain

algorithm was used to analyze the entire Dataset 1 – all 26 original features. The

information gain algorithm suggested that the top seven attributes/features included: 1)

Compiler Type (1.50537), 2) Malware Type (1.35976), 3) MOV (1.15211), 4) File Size

(1.07808), 5) Encrypt Type (0.89771), 6) File Type (0.88218), and 7) Library Calls

(0.79424). Dataset 1-3 used the full dataset and reduced the features to only the top

seven features to potentially improve detection results. Therefore, Dataset 1-3 used these

seven features for detection within MB. Further, Compiler Type could be used as a

141

leading detection indicator or attribute. Dataset 1-3 then used Compiler Type as a selected attribute and was then processed with the MB algorithm.

After running various experiments with Datasets 1-1, 1-2 and 1-3, it was determined that the results being produced delivered higher than anticipated results.

After researching the issues, it was determined that the models were potentially experiencing training data saturation as noted by Zhu (2012). These higher than expected results produced were attributed to using a subset of Dataset 1 for training purposes. The initial training approach was to use a subset of the dataset for training the classifier with widely accepted 10 fold cross validation (Kohavi, 1995). This approach is a standard supervised learning practice and has been used many times to build the internal estimation training models within machine learning (Kohavi, 1995). Based on

the previous research (Zhu, 2012), a new training dataset was generated and MB testing

was repeated to evaluate testing outcomes.

Previous research has shown that datasets too closely related can produce higher

than expected results (Zhu, 2012). Therefore, a new training dataset was generated

similar to Dataset 1. The training dataset that was selected and generated in the same

manner as Dataset 1. However, the training dataset was allowed to have overlapping

samples from Dataset 1. Based on a re-run of the MB Baseline test, the newly generated

training dataset seemed to correct the training set saturation issue. Based on the MB test

results, additional training datasets were generated in a pseudo-random stratified fashion

for use with the other two algorithms. Training Dataset 1 was used for MB testing.

Training Dataset 2 and Training Dataset 3 were generated and used for the other cluster

algorithm experiments.

142

3. Controlled procedures and environment – the developed prototype environment and datasets were controlled to ensure the extracted feature dataset remained static over

time and the automated feature extraction process remained unaltered (Rossow et al.,

2012). Dataset 1-1, Dataset1-2 and Dataset 1-3 remained unaltered during any and all

testing. Additional precautions were taken to ensure that the Datasets 1-2 and 1-3

remained static after removal of extraneous features. All datasets were exported in ARFF

format after validation and manipulation using naming standard conventions identifying

the dataset.

4. Standard measurement approaches – a standard measurement approach was

used to document and capture testing results. Test results for Baseline testing (Dataset 1-

1), Reduced Feature Selection (Dataset 1-2), and Data Amplification (Dataset 1-3) were

collected and captured. Detection measurements were conducted for all tests. Per the

testing protocol, the study established a baseline detection rate for MB. Subsequent tests

results were captured and analyzed. Results were analyzed using standard measurements

including Accuracy (ACC), Correlation Coefficient (CC), True Positive Rate (sensitivity

measure), and False Positive Rate (specificity measure). As specified by the standard

protocol, these measurements were captured as a permanent observational recording

(Rossow et al., 2012).

5. Weighting of features – as part of this testing protocol the inputs were

manipulated and assessed. Using the baseline dataset (Dataset 1-1), features were

removed based on an attribute selection algorithm (e.g. greedy stepwise) and then

evaluated for MB. Likewise, using the baseline dataset (Dataset 1-1), features were

removed based on the information gain for data amplification. The three tests for MB

143 included MB Baseline (Dataset 1-1), MB Reduced Feature Selection (Dataset 1-2), and

MB Data Amplification (Dataset 1-3).

6. Capturing testing results - All results were captured in terms test and measurement. Each test (MB Baseline, MB Reduced Feature Selection, and MB Data

Amplification) was captured with the associated measurement - ACC, CC, True Positive

Rate (TPR) and False Positive Rate (FPR). The results are provided below:

Table 6 MB Baseline Classification Results

Experimental Results MB Baseline Dataset 1-1 Correctly Classified 198,866 99.433% Instances Incorrectly Classified 1,134 0.567% Instances

Table 7 MB Baseline Experimental Results

MB Baseline Measurement Result True Positive Rate (TPR) 0.99983 True Negative Rate (TNR) 0.99983 False Positive Rate (FPR) 0.00017 False Negative Rate (FNR) 0.00017 Accuracy (ACC) 0.99983 Correlation Coefficient (CC) 0.00001

144

Table 8 MB Reduced Feature Classification Results

Experimental Results MB Reduced Feature Dataset 1-2 Selection Correctly Classified 199,994 99.997% Instances Incorrectly Classified 6 0.003% Instances

Table 9 MB Reduced Feature Selection Experimental Results

MB Reduced Feature Selection Measurement Result True Positive Rate (TPR) 0.99991 True Negative Rate (TNR) 0.99995 False Positive Rate (FPR) 0.00005 False Negative Rate (FNR) 0.00009 Accuracy (ACC) 0.99992 Correlation Coefficient (CC) 0.00001

Table 10 MB Data Amplification Classification Results

Experimental Results MB Data Amplification Dataset 1-3 Correctly Classified 199,997 99.9985% Instances Incorrectly Classified 3 0.0015% Instances

145

Table 11 MB Data Amplification Classification Experimental Results

MB Data Amplification Measurement Result True Positive Rate (TPR) 0.99999 True Negative Rate (TNR) 0.99998 False Positive Rate (FPR) 0.00002 False Negative Rate (FNR) 0.00001 Accuracy (ACC) 0.99998 Correlation Coefficient (CC) 0.00001

7. Graphing results - All test results were graphed and presented in terms test (MB

Baseline, MB Reduced Feature Selection, and MB Data Amplification) and measurement for True Positive Rate (TPR) and False Positive Rate (FPR) are shown below. Accuracy essentially followed the same trendline as TPR. CC as a constant .0001 for all three experiments MB Benchmark, MB Reduced Features and MB Amplified Features.

ROC 1.20000 0.99984 0.99992 0.99999 1.00000

0.80000

0.60000

0.40000

0.20000 0.00009 0.00003 0.00000 0.00000 0.00000 MB Benchmark MB Reduced MB Amplified Features Features

TPR FPR

Figure 5. MB Graph Results

146

8. Evaluating results – All test results were evaluated after each test. Each test was analyzed with rigor to understand the ACC, CC, True Positive Rate (TPR) and False

Positive Rate (FPR).

Table 12 MB Experimental Results

MB Reduced MB Data Measurement MB Baseline Feature Amplification Selection True Positive Rate 0.99983 0.99991 0.99999 (TPR) True Negative Rate 0.99983 0.99995 0.99998 (TNR) False Positive Rate 0.00017 0.00005 0.00002 (FPR) False Negative Rate 0.00017 0.00009 0.00001 (FNR) Accuracy (ACC) 0.99983 0.99992 0.99998 Correlation 0.00001 0.00001 0.00001 Coefficient (CC)

9. Test controls – All tests were conducted using the required test controls. Test

controls regulated the MB test environment, MB test data (Training Datasets 1-1, 1-2, 1-3

and Test Datasets 1-1, 1-2, 1-3), and test results (MB Baseline, MB Reduced Feature

Selection and MB Data Amplification).

Test Case 2

As with all test cases, the SCED testing protocol was used to structure the

experiment. Test Case 2 utilized Dataset 2 with the IBk Nearest Neighbor clustering

algorithm (IBk). The Instance Based k-Nearest Neighbor (IBk) algorithm was chosen to

replace LSH as the algorithm improves upon the “locality” aspects of the clustering

algorithm. The IBk is a k-nearest-neighbor classifier that utilizes a similar distance metric

147 used in the LSH algorithm. The calculated Euclidean distance function is used by IBk as a critical search parameter within the algorithm (Manikandan et al., 2016). Dataset 2 was generated through a pseudo-random selection of samples representing all three types of samples: malware, benign and unknown. Dataset 2 was automatically generated based on the same random selection algorithm that generated Dataset 1. Dataset 2 was selected and generated on a percentage basis: Malware (40%), Benign (30%) and Unknown

(30%). Dataset 2 was automatically selected and generated from a 2.5 million sample population. However, the generation of the final file required some manual intervention to generate the final Comma Separated Values (CSV) file. Once the CSV file was generated, Weka was used for data validation purpose and to convert the CSV file to

ARFF file. Once completed, the dataset was found to be “clean” or validated. Dataset 2 was then saved into a controlled directory for validated ARFF files. Dataset 2 was validated prior to the beginning of any baseline testing. The 26 features were validated and included all attributes from file properties, static analysis and dynamic analysis.

1. Establish baseline test data – Like Dataset 1, Dataset 2 was validated through processing the features through Weka. As was seen in Dataset 1, several of the numeric features within Dataset 2 were typed incorrectly as nominal. These errors were highlighted upon importing the Dataset 2 CSV file into Weka. Data validation was conducted and data errors were corrected in exactly the same fashion as Dataset 1.

Again, these errors were introduced into the dataset through automated parsing during the extraction process. These errors were fixed through searching and eliminating leading spaces and/or special characters embedded in the field. These errors were removed from the entire dataset and all features were validated. Once completed, the dataset was found

148

to be “clean” or validated. Per the testing protocol, the dataset was validated prior to the

beginning of baseline testing. The features (26) included attributes from file properties,

static analysis and dynamic analysis. As noted above, a Training Dataset 2 was generated, validated and used as part of the classifier training protocol.

2. Manipulate feature set – The manipulation of the feature set produced three test datasets – Dataset 2-1, Dataset 2-2, and Dataset 2-3. Similarly, the training dataset was developed from the Training Dataset 2 for each test. After both datasets (training and test) were validated, Training Dataset 2-1 and Test Dataset 2-1 (baseline – no manipulation) were processed with all 26 features using the IBk algorithm to establish a detection baseline. The baseline testing did not filter and did not use feature any analysis to improve test results. Upon completion of baseline testing, the dataset was exposed to further feature analysis.

The second set of datasets – Training Dataset 2-2 and Test Dataset 2-2 were generated by analyzing the features to better understand relationships and possible feature interdependencies. The full Dataset 2-1 – all 26 features were analyzed with the Greedy

Stepwise algorithm. The output of the Greedy Stepwise algorithm suggested that of the

26 original features, the dataset could be reduced to 7 features/attributes. The Greedy

Stepwise algorithm produced a Merit of best subset of 0.896. The predictive attributes for this dataset included 1) File Size, 2) File Type, 3) Compiler, 4) Compiler Type, 5)

Library Calls, 6) Encryption and the Op Code 7) XCHG. The next step was to reimport the both the Training Dataset 2 and Dataset 2 and remove extraneous features/attributes.

The classifier training was conducted using Training Dataset 2-2. Experimental testing was conducted using Dataset 2-2. In both cases, Training Dataset 2-2 and Test Dataset 2-

149

2 removed all features except for File Size, File Type, Compiler, Compiler Type, Library

Calls, Encryption, and XCHG. Dataset 2-2 was then processed with the IBk algorithm.

The third and final dataset (Dataset 2-3) was analyzed utilizing another feature analysis tool to evaluate features and relationships for Dataset 2. The Information Gain algorithm seeks to amplify certain features within a dataset to potentially improve model outcomes (C. Lee & Lee, 2006). In this case, the intent was to use information gain to

provide a reduced feature set that potentially increases the detection rate. The

information gain algorithm was used to analyze the entire Dataset 2 – all 26 original

features. In this case, the intent was to use information gain to provide a reduced feature

set that potentially increases the detection rate. The information gain algorithm

suggested that the top seven attributes/features included 1) Compiler Type (1.55744), 2)

Malware Type (1.40863), 3) XCHG (1.26408), 4) File Size (1.09264), 5) Encrypt Type

(0.88897), 6) File Type (0.86818), and 7) Library Calls (0.77946). Dataset 2-3 used the full dataset and reduced the features to only the top seven features to potentially improve detection results. Therefore, Dataset 2-3 used these seven features for detection within the IBk classifier. Further, Compiler Type could be used as a leading indicator attribute.

Dataset 2-3 then used Compiler Type with the IBk classifier to deliver experimental results.

3. Controlled procedures and environment – the developed prototype environment and datasets were controlled to ensure the extracted feature dataset remained static over

time and the automated feature extraction process remained unaltered (Rossow et al.,

2012). Training Datasets (Training Dataset 2-1, Training Dataset 2-2 and Training

Dataset 2-3) and Test Datasets (Test Dataset 2-1, Test Dataset 2-2 and Test Dataset 2-3)

150

remained unaltered during any and all testing. Additional precautions were taken to

ensure that the Test and Training Datasets 2-2 and 2-3 remained static after removal of

extraneous features. All datasets were exported in ARFF format after validation and

manipulation using naming standard conventions identifying the dataset.

4. Standard measurement approaches – a standard measurement approach was

used to document and capture testing results. Test results for Baseline testing (Dataset 2-

1), Reduced Feature Selection (Dataset 2-2), and Data Amplification (Dataset 2-3) were

collected and captured. Detection measurements were conducted for all tests. Per the testing protocol, the study established a baseline detection rate for IBk. Subsequent tests

results were captured and analyzed. Results were analyzed using standard measurements including Accuracy (ACC), Correlation Coefficient (CC), True Positive Rate (sensitivity measure), and False Positive Rate (specificity measure). As specified by the standard protocol, these measurements were captured as a permanent observational recording

(Rossow et al., 2012).

5. Weighting of features – as part of this testing protocol the inputs were manipulated and assessed. Using the baseline dataset (Dataset 2-1), features were removed based on an attribute selection algorithm (e.g. greedy stepwise) and then evaluated for IBk. Likewise, using the baseline dataset (Dataset 2-1), features were included IBk Baseline (Dataset 2-1), IBk Reduced Feature Selection (Dataset 2-2), and

IBk Data Amplification (Dataset 2-3).

6. Capturing testing results - All results were captured in terms test and measurement. Each test (IBk Baseline, IBk Reduced Feature Selection, and IBk Data

151

Amplification) was captured with the associated measurement - ACC, CC, True Positive

Rate (TPR) and False Positive Rate (FPR).

Table 13 IBk Baseline Classification Results

Experimental Results IBk Baseline Dataset 2-1 Correctly Classified 199,872 99.936% Instances Incorrectly Classified 128 0.064% Instances

Table 14 IBk Baseline Experimental Results

IBk Baseline Measurement Result True Positive Rate (TPR) 0.99984 True Negative Rate (TNR) 0.99991 False Positive Rate (FPR) 0.00009 False Negative Rate (FNR) 0.00016 Accuracy (ACC) 0.99986 Correlation Coefficient (CC) 0.00001

Table 15 IBk Reduced Feature Selection Classification Results

Experimental Results IBk Reduced Feature Dataset 2-2 Selection Correctly Classified 199,923 99.962% Instances Incorrectly Classified 77 0.038% Instances

152

Table 16 IBk Reduced Feature Selection Experimental Results

IBk Reduced Feature Selection Measurement Result True Positive Rate (TPR) 0.99992 True Negative Rate (TNR) 0.99997 False Positive Rate (FPR) 0.00003 False Negative Rate (FNR) 0.00008 Accuracy (ACC) 0.99993 Correlation Coefficient (CC) 0.00001

Table 17 IBk Data Amplification Classification Results

Experimental Results IBk Data Dataset 2-3 Amplification Correctly Classified 200,000 100.0000% Instances Incorrectly Classified 0 0.0000% Instances

Table 18 IBk Data Amplification Experimental Results

IBk Data Amplification Measurement Result True Positive Rate (TPR) 0.99999 True Negative Rate (TNR) 1.00000 False Positive Rate (FPR) 0.00000 False Negative Rate (FNR) 0.00001 Accuracy (ACC) 0.99999 Correlation Coefficient (CC) 0.00001

7. Graphing results - All test results were graphed and presented in terms of test

(IBK Baseline, IBK Reduced Feature Selection, and IBK Data Amplification) and measurement for True Positive Rate (TPR) and False Positive Rate (FPR). The results are shown below. Accuracy essentially followed the same trendline as TPR. CC was a

153 constant of .0001 across all three experiments IBk Benchmark, IBk Reduced Features and

IBk Amplified Features.

ROC 1.20000 0.99984 0.99992 0.99999 1.00000

0.80000

0.60000

0.40000

0.20000 0.00009 0.00003 0.00000 0.00000 0.00000 IBk Benchmark IBk Reduced IBk Amplified Features Features

TPR FPR

Figure 6. IBk Graph Results

8. Evaluating results – All test results were evaluated after each test. Each test was analyzed with rigor to understand the ACC, CC, True Positive Rate (TPR) and False

Positive Rate (FPR).

154

Table 19 IBk Experimental Results

IBk Reduced IBk Data Measurement IBK Baseline Feature Amplification Selection True Positive Rate 0.99984 0.99992 0.99999 (TPR) True Negative Rate 0.99991 0.99997 1.00000 (TNR) False Positive Rate 0.00009 0.00003 0.00000 (FPR) False Negative Rate 0.00016 0.00008 0.00001 (FNR) Accuracy (ACC) 0.99986 0.99993 0.99999 Correlation 0.00001 0.00001 0.00001 Coefficient (CC)

9. Test controls – All tests were conducted using the required test controls. Test

controls regulated the IBk test environment, IBk test data (Training Datasets 1-1, 1-2, 1-3 and Test Datasets 1-1, 1-2, 1-3), and test results (IBk Baseline, IBk Reduced Feature

Selection and IBk Data Amplification).

Test Case 3

As with all test cases, the SCED testing protocol was used to structure the experiment. Test Case 3 utilized Dataset 3 with the Deep Learning Multilevel Perceptron

(DLMLP) clustering algorithm. The DLMLP algorithm was chosen to replace the BP algorithm as it improves upon the dimensions of belief for training and test datasets. The main idea in replacing the BP algorithm with DLMLP was recently presented by Gruber et al. (2017). ANN’s can leverage belief propagation for clustering or classification inefficiently (Gruber et al., 2017). ANN’s build networks of neurons, share information between neurons and propagate results throughout the entire network (Gruber et al.,

155

2017). Gruber (2017) proposed using shallow neural networks with deep learning to achieve similar or better propagation results. The deep learning approach established shallow neural networks and combined other algorithms such as MLP to achieve similar propagation (Gruber et al., 2017). The goal was to utilize the DLMLP algorithm for improved detection. Given that the goal of this research was to establish new detection using advanced algorithms. DLMLP was selected for model efficiencies and to leverage nascent algorithms that improved upon other algorithms. Dataset 3 was generated through a pseudo-random selection of samples representing all three types of samples: malware, benign and unknown. Dataset 3 was automatically generated based on the same random selection algorithm that generated other datasets. Dataset 3 was selected and generated on a percentage basis: Malware (40%), Benign (30%) and Unknown

(30%). Dataset 3 was automatically selected and generated from a 2.5 million sample population. However, the generation of the final file required some manual intervention to generate the final Comma Separated Values (CSV) file. Once the CSV file was generated, Weka was used for data validation purpose and to convert the CSV file to

ARFF file. Once completed, the dataset was found to be “clean” or validated. Dataset 3 was then saved into a controlled directory for validated ARFF files. Dataset 3 was validated prior to the beginning of any baseline testing. The 26 features were validated and included all attributes from file properties, static analysis and dynamic analysis.

1. Establish baseline test data – Like previous datasets, Dataset 3 was validated through processing the features through Weka. As was seen in previous datasets, several of the numeric features within Dataset 3 were typed incorrectly as nominal. These errors were highlighted upon importing the Dataset 3 CSV file into Weka. Data validation was

156

conducted and data errors were corrected in exactly the same fashion as previous

datasets. Again, these errors were introduced into the dataset through automated parsing

during the extraction process. These errors were fixed through searching and eliminating

leading spaces and/or special characters embedded in the field. These errors were

removed from the entire dataset and all features were validated. Once completed, the

dataset was found to be “clean” or validated. Per the testing protocol, the dataset was

validated prior to the beginning of baseline testing. The features (26) included attributes

from file properties, static analysis and dynamic analysis. As noted above, a Training

Dataset 3 was generated, validated and used as part of the classifier training protocol.

2. Manipulate feature set – The manipulation of the feature set produced three test datasets – Dataset 3-1, Dataset 3-2, and Dataset 3-3. Similarly, the training dataset was developed from the Training Dataset 3 for each test. After both datasets (training and test) were validated, Training Dataset 3-1 and Test Dataset 3-1 (baseline – no manipulation) were processed with all 26 features using the DLMLP algorithm to establish a detection baseline. The baseline testing did not filter and did not use feature any analysis to improve test results. Upon completion of baseline testing, the dataset was exposed to further feature analysis.

The second set of datasets – Training Dataset 3-2 and Test Dataset 3-2 were generated by analyzing the features to better understand relationships and possible feature interdependencies. The full Test Dataset 3-1 – all 26 features were analyzed with the

Greedy Stepwise algorithm. The output of the Greedy Stepwise algorithm suggested that of the 26 original features, the dataset could be reduced to 7 features/attributes. The

Greedy Stepwise algorithm produced a Merit of best subset of 0.807. The predictive

157

attributes for this dataset included 1) File Size, 2) File Type, 3) Compiler, 4) Compiler

Type, 5) Library Calls, 6) Encryption and the Op Code 7) XCHG. The next step was to reimport the both the Training Dataset 3 and Test Dataset 3 and remove extraneous features/attributes. The classifier training was conducted using Training Dataset 3-2.

Experimental testing was conducted using Dataset 3-2. In both cases, Training Dataset 3-

2 and Test Dataset 3-2 removed all features except for File Size, File Type, Compiler,

Compiler Type, Library Calls, Encryption, and XCHG. Dataset 2-2 was then processed with the DLMLP algorithm.

The third and final dataset (Training and Test Dataset 3-3) was analyzed utilizing another feature analysis tool to evaluate features and relationships for Dataset 3. The

Information Gain algorithm seeks to amplify certain features within a dataset to potentially improve model outcomes (C. Lee & Lee, 2006). In this case, the intent was to use information gain to provide a reduced feature set that potentially increases the detection rate. The information gain algorithm was used to analyze the entire Dataset 3 – all 26 original features. In this case, the intent was to use information gain to provide a reduced feature set that potentially increases the detection rate. The information gain algorithm suggested that the top seven attributes/features included 1) Compiler Type

(1.69143), 2) Malware Type (1.51408), 3) XCHG (1.37264), 4) File Size (1.10192), 5)

Encrypt Type (0.91679), 6) File Type (0.78658), and 7) Library Calls (0.71794). Dataset

3-3 used the full dataset and reduced the features to only the top seven features to potentially improve detection results. Therefore, Dataset 3-3 used these seven features for detection within the DLMLP classifier. Further, Compiler Type could be used as a

158

leading indicator attribute. Dataset 3-3 then used Compiler Type with the DLMLP classifier to deliver experimental results.

3. Controlled procedures and environment – the developed prototype environment and datasets were controlled to ensure the extracted feature dataset remained static over

time and the automated feature extraction process remained unaltered (Rossow et al.,

2012). Training Datasets (Training Dataset 3-1, Training Dataset 3-2 and Training

Dataset 3-3) and Test Datasets (Test Dataset 3-1, Test Dataset 3-2 and Test Dataset 3-3) remained unaltered during any and all testing. Additional precautions were taken to ensure that the Test and Training Datasets 3-2 and 3-3 remained static after removal of extraneous features. All datasets were exported in ARFF format after validation and manipulation using naming standard conventions identifying the dataset.

4. Standard measurement approaches – a standard measurement approach was used to document and capture testing results. Test results for Baseline testing (Dataset 3-

1), Reduced Feature Selection (Dataset 3-2), and Data Amplification (Dataset 3-3) were collected and captured. Detection measurements were conducted for all tests. Per the testing protocol, the study established a baseline detection rate for the DLMLP classifier.

Subsequent tests results were captured and analyzed. Results were analyzed using standard measurements including Accuracy (ACC), Correlation Coefficient (CC), True

Positive Rate (sensitivity measure), and False Positive Rate (specificity measure). As specified by the standard protocol, these measurements were captured as a permanent observational recording (Rossow et al., 2012).

5. Weighting of features – as part of this testing protocol the inputs were manipulated and assessed. Using the baseline dataset (Dataset 3-1), features were

159

removed based on an attribute selection algorithm (e.g. greedy stepwise) and then

evaluated for DLMLP. Likewise, using the baseline dataset (Dataset 3-1), features were removed based on the information gain for data amplification. The three tests for

DLMLP included DLMLP Baseline (Dataset 3-1), DLMLP Reduced Feature Selection

(Dataset 3-2), and DLMLP Data Amplification (Dataset 3-3).

6. Capturing testing results - All results were captured in terms test and measurement. Each test (DLMLP Baseline, DLMLP Reduced Feature Selection, and

DLMLP Data Amplification) was captured with the associated measurement - ACC, CC,

True Positive Rate (TPR) and False Positive Rate (FPR).

Table 20 DLMLP Baseline Classification Results

Experimental Results DLMLP Baseline Dataset 3-1 Correctly Classified 199,903 99.952% Instances Incorrectly Classified 97 0.048% Instances

Table 21 DLMLP Baseline Experimental Results

DLMLP Baseline Measurement Result True Positive Rate (TPR) 0.99994 True Negative Rate (TNR) 0.99995 False Positive Rate (FPR) 0.00005 False Negative Rate (FNR) 0.00006 Accuracy (ACC) 0.99994 Correlation Coefficient (CC) 0.00001

160

Table 22 DLMLP Reduced Feature Selection Classification Results

Experimental Results DLMLP Reduced Dataset 3-2 Feature Selection Correctly Classified 199,996 99.998% Instances Incorrectly Classified 4 0.002% Instances

Table 23 DLMLP Reduced Feature Selection Experimental Results

DLMLP Reduced Feature Selection Measurement Result True Positive Rate (TPR) 0.99999 True Negative Rate (TNR) 0.99998 False Positive Rate (FPR) 0.00002 False Negative Rate (FNR) 0.00001 Accuracy (ACC) 0.99999 Correlation Coefficient (CC) 0.00001

Table 24 DLMLP Data Amplification Classification Results

Experimental Results DLMLP Data Dataset 3-3 Amplification Correctly Classified 199,999 99.9995% Instances Incorrectly Classified 1 0.0005% Instances

161

Table 25 DLMLP Data Amplification Experimental Results

DLMLP Data Amplification Measurement Result True Positive Rate (TPR) 0.99999 True Negative Rate (TNR) 1.00000 False Positive Rate (FPR) 0.00000 False Negative Rate (FNR) 0.00001 Accuracy (ACC) 0.99999 Correlation Coefficient (CC) 0.00001

7. Graphing results - All test results were graphed and presented in terms test

(DLMLP Baseline, DLMLP Reduced Feature Selection, and DLMLP Data

Amplification) and measurement for True Positive Rate (TPR) and False Positive Rate

(FPR). The results are shown below. Accuracy essentially followed the same trendline as TPR and was not graphed for that reason. CC was a constant of .0001 across all three experiments DLMLP Benchmark, DLMLP Reduced Features and DLMLP Amplified

Features and did not provide substantial value.

ROC 1.20000 0.99994 0.99999 0.99999 1.00000

0.80000

0.60000

0.40000

0.20000 0.00005 0.00002 0.00000 0.00000 0.00000 DLMLP Benchmark DLMLP Reduced DLMLP Amplified Features Features

TPR FPR

Figure 7. DLMLP Graph Results

162

8. Evaluating results – All test results were evaluated after each test. Each test

was analyzed with rigor to understand the ACC, CC, True Positive Rate (TPR) and False

Positive Rate (FPR).

Table 26 DLMLP Experimental Results

DLMLP DLMLP Measurement DLMLP Baseline Reduced Data Feature Amplification Selection True Positive Rate 0.99994 0.99999 0.99999 (TPR) True Negative Rate 0.99995 0.99998 1.00000 (TNR) False Positive Rate 0.00005 0.00002 0.00000 (FPR) False Negative Rate 0.00006 0.00001 0.00001 (FNR) Accuracy (ACC) 0.99994 0.99999 0.99999 Correlation 0.00001 0.00001 0.00001 Coefficient (CC)

9. Test controls – All tests were conducted using the required test controls. Test

controls regulated the DLMLP test environment, DLMLP test data (Training Datasets 3-

1, 3-2, 3-3 and Test Datasets 3-1, 3-2, 3-3), and test results (DLMLP Baseline, DLMLP

Reduced Feature Selection and DLMLP Data Amplification).

Data Analysis

The purpose of this study was to evaluate the effective malware detection rates

using multidimensional topological features with advanced cluster algorithms. Previous

research (Fraley & Figueroa, 2016) provided some insight into detection rates using

multidimensional topological features with using advanced clustering algorithms.

163

However, previous research was limited to a few thousand files/samples. This study greatly expanded both the training and test datasets. Having larger datasets provided deeper insight into how each of the various algorithms perform given a limited set of topological features. This study also provided insight into how algorithms perform given a full (26) and optimized features (6 or 7). Experiments were performed with full data, reduced features and amplified features. This process provided an understanding of how detection performed given some algorithm tuning. This study established baselines for all three cluster algorithms (MB, IBk and DLMLP). The MB classifier out performed other algorithms (LSH and BP) in previous studies (Fraley & Figueroa, 2016). However, given the larger dataset, both the IBk and DLMLP classifiers out performed MB in almost every respect.

This study also provided a test case for detecting malware using multidimensional topological data. The baseline testing included processing the full set of features that included file properties, static analysis and dynamic analysis. The features include File

Size, File Type, Malware Type, VM Aware, Compiler, Embedded Compiler, Compiler

Type. Library Import/Export, Library Call, Encryption, Encrypt Type, and Op codes –

ADD, AND, INC, LEA, MOV, NEG, OR, SUB, XOR, POP, JMP and XCHG. The testing from this study supports the concept that effective malware detection can be achieved using multidimensional topological data. Further, experimental testing has shown that a reduced feature set (6 or 7) delivered better and more effective detection rates than the full data set for this dataset. The features that delivered optimal detection rates include: 1) File Size, 2) File Type, 3) Compiler Type, 4) Library Calls, 5)

Encryption, 6) Malware Type and the Op Codes 6) XCHG or MOV.

164

In terms of algorithm performance, IBk and DLMLP delivered impressive true

positive test results given this dataset. The IBk classifier scored almost a perfect

detection rate with the Amplified Feature training and test dataset (Dataset 2-3). The IBk true positive detection rate (.99999) has been validated and repeated several times to ensure the results. This far exceeded the initial study expectations. In addition, the study considered training dataset saturation and developed separate training datasets in order to not artificially increase detection rates. The near perfect detection rate should also validate the selection of this algorithm for replacing LSH for the study.

Evaluating algorithm performance in terms of false positive rates, both IBk

Amplified Data and DLMLP Amplified Feature delivered impressive results. The false positive rates for both of these classifiers was near zero. It is difficult to improve false positive rates of zero. Again, several tests were repeated in order to validate these rates

for both IBk and DLMLP.

It should be noted that both the IBk and DLMLP algorithms are expensive in

terms of observed processing time. The MB algorithm would typically process and deliver test results (200,000 samples) in under 240 seconds. The IBk and DLMLP by

contrast would process and deliver results in excess of 2,800 and 10,000 seconds

respectively. The infrastructure was not tuned for processing any algorithm. Memory

allocation and processor allocations remained constant and unchanged throughout the

testing process.

Lastly, ACC and CC were not discriminators for the various algorithms. It was

originally thought that ACC and CC would provide greater insight into algorithm

performance over and above true positive and false positive rates. However, because of

165

the high true positive and low false positive rates, the evaluation of algorithms based on

ACC and CC was not as useful as anticipated.

Findings

There were a number of substantial findings from this study. The study

does answer the two basic research questions posed previously.

1) Can detection rates be improved by increasing the quality and quantity of

multidimensional features for the machine learning advanced clustering

algorithms from file properties, static and dynamic analysis?

Research findings suggest that reduced multidimensional topological feature set using file

features, static and dynamic analysis delivered better overall results than the full dataset

(7 feature vs 26 features). The reduced and amplified datasets delivered better results.

While all three algorithms delivered above target results, less or selected features

provided better detection results (99.99%) across all three algorithms.

2) Which of the machine learning advanced clustering algorithms performed

better given the multidimensional features from file properties, static and

dynamic analysis?

It should be stated that the experiments conducted for this study demonstrated that multidimensional topological data can be used to improve malware detection. The 26 selected features did help to establish and improve overall detection rates. The experimental approach demonstrated impressive detection rates of 99.99% (MB, IBk and

DLMLP) for all experimental datasets. The lowest detection rate delivered in experimental testing was 99.43% (MB Baseline). The experimental detection rates

166

produced delivered far above the 81.25% target for polymorphic malware (Amos et al.,

2013).

It should be noted that this study was concerned with recent malware (within the

previous 18 months) and used a malware dataset specifically targeting Microsoft

operating systems. Lastly, feature selection and data amplification both reduced the

number of features processed by the algorithm. In performing the research for both the

IBk and DLMLP algorithms, the theoretical aspects of both algorithms should perform

well with an abundance of data. It was thought that the 26 multidimensional features

would provide sufficient data features to enhance classifier performance. However, both

algorithms performed incredibly well given a reduced and amplified dataset. Both

produced near perfect true positive rates (99.99%) and near zero false positive rates.

Summary of Findings

The goal of this study was to develop an experimental approach for improved detection for polymorphic malware. The experimental detection rates delivered for this study were as high as 99.99% and the low was 99.43%. These experimental testing results far exceeded the current detection rate of 81.25% for polymorphic malware

(Amos et al., 2013). The hypothesis for the study was that detection could be achieved by combining file properties, static and dynamic analysis features. Experimental testing with this dataset delivered effective detection using file properties, static and dynamic analysis features. This study leveraged the SCED protocol for performing various experimental tests. The test results were then quantitatively examined and evaluated.

This study leveraged previous quantitative experiments for supervised machine learning

167 for malware research (Boro, et al., 2012; Pradesh, 2014) to better understand how multidimensional topological information can be utilized for malware detection.

The study’s target was to develop an approach that would deliver an effective detection rate of higher than 81.25% using multidimensional topological data for malware detection. The prototype system performed various feature extraction and assembled the datasets for training and testing. The study sought to establish baselines for three advanced cluster algorithms (MB, IBk and DLMLP) and then manipulate feature weighting to exceed the target effective rate. In all cases (MB, IBk and DLMLP) the feature weighting delivered better detection performance.

168

Chapter 5

Conclusions, Implications, Recommendations, and Summary

Conclusions

There were a number conclusions reached in for this study. The study demonstrated that multidimensional topological data can be effectively used for detection. Secondly, the study demonstrated that the various algorithms used delivered solid detection results using multidimensional topological data. The study also demonstrated that high malware detection rates can be achieved using reduced or amplified multidimensional topological features. A brief explanation for each conclusion is provided below.

This study produced experimental evidence that multidimensional topological data can be used for improved malware detection. The initial premise for the study was that by combining multidimensional features from file type, static and dynamic analysis with machine learning effective malware detection rates could be improved.

Experimental testing with three algorithms successfully demonstrated detection rates greater than 99.99% with the experimental datasets. In addition, the algorithms did not suffer from an increase in false positive rates which is usually the case with algorithms that produce high detection rates. Given the high true positive rates and low false positive rates it can be safely stated that multidimensional topological data can be used to improve detection rates.

The study also produced highly effective malware detection rates by leveraging the advanced algorithms (MB, IBk and DLMLP). All three algorithms produced and

169

delivered effective detection results given the experimental datasets: MB (99.985%), IBk

(100%) and DLMLP (99.9995%). All three algorithms produced and delivered low false

positive rates given the experimental datasets: MB (0.00002%), IBk (0.0%) and DLMLP

(0.0%). It can be safely concluded that these algorithms produced effective detection for

the experimental datasets.

The study also produced evidence that very high malware detection rates can be

achieved with reduced or amplified features. IBk and DLMLP both produced 99.999%

effective detection rates with using a minimum number of features (6 or 7). Both the IBk

and DLMLP algorithms are designed to process volumes of data. It was assumed that

these algorithms would produce better results with more data. The opposite was found to

be true for this study. Both algorithms produced better effective detection rates with the

reduced features or amplified dataset. In addition, neither algorithm suffered from

increased in false positive rate which normally is associated with algorithms that deliver higher detection rates. Therefore, it can be safely concluded that given the high true positive rate and the low false positive rate for both IBk and DLMLP it can be safely concluded that high malware detection rates can be achieved with reduced or amplified features leveraging multidimensional topological data with advanced algorithms.

As seen above, there are three valuable conclusions reached for this study. The conclusions reached for this study include leveraging multidimensional topological for detection, utilizing advanced algorithms to improve detection, and using reduced or amplified features produced higher than anticipated detection rates. These three conclusions are supported by both the experimental protocol and test results from various experiments presented earlier.

170

Implications

There are two major implications that can be gleaned from this research. First,

malware detection can be accurately detected using multidimensional topological data.

This means that a combination of file properties, static and dynamic analysis did provide impressive detection rates for this study. The implication is that detection for polymorphic malware targeting Microsoft platforms can be readily achieved through

extracting multidimensional features and subjecting those features to a clustering

algorithm. This implication may extend or augment current endpoint detection

techniques and/or approaches.

The second implication is that endpoint protection could be greatly improved

using reduced or amplified multidimensional topological features. The experimental

results with these datasets demonstrate that by using a few features (reduced or amplified

features) with advanced algorithms did produce impressive malware detection rates –

approaching 100%. Further, the implication of using only a few features (6 or 7) for

detection purposes for executable files may change or enhance how endpoint protection is

currently performed. Endpoint detection for malware could embed a feature extraction

and machine learning modules to process minimal features.

These two major implications for malware detection may change endpoint

strategies. Machine learning is being incorporated into many commercial products. The

study demonstrates that endpoint protection could and should use multidimensional data

for advanced protection. Secondly, endpoint protection could use a few or minimal

number of features (6 or 7) for detection purposes. The study also illustrates the case that

reduced features or amplified features could dramatically improve malware detection at

171

the endpoint. These two implications could easily improve or enhance the way endpoint

detection is conducted today across the enterprise.

Recommendations

There are three areas in which additional research is recommended. Two areas

deal with an understanding the various functions or opcodes inherent in the executable.

This study extracted and counted the associated opcode for each sample through a dynamic analysis process. For each sample, all opcodes were extracted and counted on

an execution basis. This study focused on Bilar’s (2007) top twelve Op Codes. The total

universe of opcodes is approximately 398 depending on the code set. Just as Bilar (2007)

observed, there were samples that executed a high number of opcodes. However, in

some cases the number of instances a particular opcode was executed was incredibly high

in comparison to other samples. As an example, one sample executed ADD nearly

10,000 times. Likewise, other opcodes such as LEA was executed over 7,000 times. The

average for most opcodes was under 1,000. While not the focus of this study, further

research could evaluate why the execution of some or all opcodes were performed so many times. As it turns out, the particular sample in question was indeed malware.

Additional research needs to conducted to further understand whether this is typical for poorly written malware thus making it noisy or is it something else more menacing like performing reconnaissance activities for detection at the endpoint. It is interesting from a research point of view as to why these opcodes are executed so many times.

Similarly, additional research needs to be conducted for two specific Op Codes:

MOV and XCHG. This study focused Bilar’s top 12 Op Codes. Bilar’s research demonstrated that these 12 Op Codes represent 95% of the malicious Op Codes. Bilar’s

172

(2007) research profiled ADD, AND, INC, LEA, MOV, NEG, OR, SUB, XOR, POP,

JMP and XCHG. In performing attribute analysis, reduced feature selection and data

amplification, MOV and XCHG were the only two Op Codes that were selected from an

attribute perspective. Feature analysis routinely highlighted MOV and XCHG as

interesting attributes. This study used the MOV and XCHG attributes in both reduced

feature selection and amplified features to perform advanced detection. These two

attributes produced highly effective detection rates for several experiments. These two

attributes provided improved detection across all algorithms in this study. Additional

research should be conducted to understand the nature of both Op Codes for malware and

benign samples.

The last research area is single-layer packed and multi-layer packed executables.

A substantial number of samples for this study used packing algorithms to avoid or evade further analysis. Packing algorithms commonly used for packing malware and hiding execution routines (Jeong, Choo, Lee, Bat-Erdene, & Lee, 2010). As previously discussed packing is also utilized to protect intellectual property for legitimate purposes.

Packed executables essentially encode the data sections so that dynamic analysis cannot view results (Jeong et al., 2010). As these executables are loaded into memory, packed executables dynamically change the size and content of the data (Jeong et al., 2010).

There were a number of samples in this study that exhibited single-layer packing, re- packing and multi-layer packing. This made feature extraction more difficult or in some cases impossible. This study did not find a single benign sample that used multi-layer packing. Therefore, additional research should be conducted to understand the nature of single-layer packed and multi-layer packed executables for malware detection. It is

173

believed that classification of packing algorithms could lead to better detection for

advanced malware.

Summary

Malware represents some of the most serious security concerns for today’s

Internet. Security breaches and cyber-attacks can be directly attributed to malware or multi-stage cyber-attacks. Malware can compromise networks and computers in the form

of botnets, viruses, worms, ransomware and advanced persistent threats (APTs). These cyber-attacks are launched using targeted and advanced malware techniques to steal

personal, proprietary of financial information. The high number of attacks and the

associated negative notoriety make malware one of the most popular areas for advanced

research. Much of today’s advanced research has been concentrated on developing

techniques to collect, study, and mitigate malware. This research focused on detecting

“real” malware and samples found “live” on the internet. As improved detection

becomes a reality – mitigation or elimination of malware for end-points can be greatly

enhanced. Unfortunately, current host-based detection approaches that leverage signature-based detection is largely ineffective for new polymorphic malware.

Polymorphic malware avoids or evades signature detection by using advanced obfuscation or encryption techniques. The goal of this research was to address these malware detection shortcomings. New research was conducted to develop new dynamic detection approaches. New approaches demonstrated that using machine learning with advanced algorithms can correctly and efficiently identify potential malware threats.

This study demonstrated a novel malware detection approach that provides

improved detection for polymorphic malware. The research should enhance and augment

174

traditional end-point detection approaches. This experimental study extracted key

features from file properties, static and dynamic analysis. Machine learning using advanced algorithms correctly determined the likelihood of files (samples) to be benign

(good) or malicious (bad). The prototype environment analyzed malware executables

(program) in a controlled environment in order to better understand behaviors, function

calls and the inclusion of dynamic libraries. Features were extracted and assembled for

further analysis by three different cluster algorithms. Experiments were conducted to

better understand the malware threat landscape in terms of file properties, static and

dynamic analysis. Foundational and experimental reviews of previous research literature

was conducted and summarized.

Detecting polymorphic and metamorphic malware continues to be a challenge for

the security community. A majority of the today’s security research is focused on

developing enhanced detection using techniques that collect, study, and mitigate

malicious code (Kolbitsch et al., 2009). However, new polymorphic malware and the

detection evading techniques render many of the current signature protections useless and

therefore leave end-points unprotected (Rodríguez-Gómez et al., 2013). The speed at

which polymorphic malware is advancing threatens enterprise computing and internet

operations (Symantec Corporation, 2016). Being able to detect polymorphic,

metamorphic and zero-day malware requires advanced detection techniques that provide

rapid adaptation, scalability and produce low false positive rates (Borojerdi & Abadi,

2013). This results from this study offers an attractive alternative for detecting

polymorphic malware specifically targeting Windows operating systems. This study

performed analysis on over a sample population of 2.5M files. This study assembled a

175

single data set containing known malware, known benign and unknown samples,

performing feature extraction and developed a prototype environment for detection that

far exceeded today’s accepted baseline (Amos et al., 2013). This study successfully demonstrated a feature extraction methodology, a prototype detection environment and conducted a number of experiments within an accepted testing protocol.

This study developed a unique detection methodology using multidimensional topological features in a machine learning environment. The ability to extract unique multidimensional topological features utilizing file properties, static and dynamic analysis is a new approach for feature extraction. The study uses this information to develop a quantitative experimental prototype using multidimensional topological data with machine learning utilizing advanced algorithms. This approach and every experiment conducted for this study provided ample evidence based on the experimental dataset delivered improved detection for polymorphic malware far above 81.25%. All experiments used quantitative methods to provide objective measurements for experimental results as described by Babbie (2010). Experimental data collection, data analysis and presentation of results was conducted using standard and accepted computational techniques (Babbie, 2010). The study was carefully designed and executed using the SCED testing protocol. The study paid particular attention to assembly of representative datasets and the execution of the various experiments within the testing protocol. The testing methodology followed an approach that can easily be reproduced.

All experiments were conducted in a safe environment such that the malware could not escape and infect the research environment (Rossow, 2013). The experimental prototype

176 was designed to assemble, capture, analyze and evaluate results with the rigorous quantitative evaluation measures discussed by Mohaisen and Alrawi (2015).

The goal of this study was to develop and demonstrate an experimental approach for improved detection for polymorphic malware that exceeded the effective detection rate of 81.25% for polymorphic malware (Amos et al., 2013). The experimental detection rate delivered across all three algorithms exceeded 99.99%. A number of the experiments (IBK & DLMLP) delivered effective detection results as high as 99.999%.

The lowest effective rate observed through testing 99.43% for MB. The hypothesis for the study was that detection could be achieved by combining file properties, static and dynamic analysis features. The study successfully demonstrated through multiple experiments that multidimensional topological can deliver effective malware detection.

Experimental testing with this dataset delivered effective detection using file properties, static and dynamic analysis features. This study leveraged the SCED protocol for performing and documenting various experimental tests. The test results were then quantitatively examined, evaluated, analyzed and presented testing outcomes. This study leveraged previous quantitative experiments for supervised machine learning for malware research (Boro, et al., 2012; Pradesh, 2014) to better understand how multidimensional topological information can be utilized for malware detection.

The study’s target was to develop an approach that would deliver an effective detection rate of higher than 81.25% using multidimensional topological data for malware detection. The developed prototype system performed various feature extraction and assembled the datasets for training and testing. The study sought to establish baselines for three advanced cluster algorithms (MB, IBk and DLMLP) and

177 then manipulate feature weighting to exceed the target effective rate. In all cases, each algorithm at baseline testing delivered higher than target detection rates. Additionally, feature weighting through reduced feature selection and amplified features delivered substantially better detection performance over baseline results.

Although this study was successful in addressing the research goal there were a number of important conclusions, implications and recommendations that stemmed from this research. In terms of conclusions, this study produced three important conclusions: 1) multidimensional topological features did effectively demonstrate effective malware detection with machine learning algorithms, 2) advanced algorithms using multidimensional topological features delivered impressive detection results (MB

(99.985%), IBk (100%) and DLMLP (99.9995%)) and 3) advanced algorithms (IBk and

DLMLP) with reduced or amplified features delivered near perfect (99.9999%) malware detection results. These three conclusions provide greater insight into how multidimensional topological can be used to enhance endpoint detection.

There were two major implications from this study. The first implication is that polymorphic malware targeting Microsoft platforms can be readily detected through machine learning with multidimensional topological data. This is the first study to evaluate and successfully demonstrate malware detection using multidimensional topological data with machine learning. The second implication is that endpoint protection could potentially be augmented with minimal features (6 or 7) and machine learning for malware detection. This implication could change or expand how detection is conducted at the endpoint. Both implications could potentially expand enterprise

178 malware detection by commercial security endpoint products as well as network boundary devices scanning and evaluating executable transport at the network layer.

There are three areas in which additional research was recommended. Two of the three areas deal with executable functions or Opcodes. This study extracted and studied over 2M samples through a dynamic analysis process. Leveraging prior research (Bilar,

2007), only the top twelve Op Codes were collected, analyzed and counted. In some samples, these Op Codes were executed in abnormally high numbers (e.g. ADD nearly

10,000 times). Abnormally high counts may be another indication of maliciousness for detecting malware. The second area requiring additional research is for two specific Op

Codes: MOV and XCHG. In over 2M samples, using reduced feature selection and data amplification these two Op Codes were found to provide additional indications of maliciousness for this dataset. However, additional research would have to be conducted to understand the nature of both Op Codes for malware and benign samples. The third research area involves packed executables. This study observed over 25% of the sample analyzed used either single-layer packed or multi-layer packed executables. Packing algorithms are commonly used for packing malware and hiding execution routines (Jeong et al., 2010). Packed executables encode aspects of execution to hide routines so that dynamic analysis cannot view results. This study did not find a single benign sample that used multi-layer packing. Therefore, additional research should be conducted to understand the frequency of single-layer packed and multi-layer packed executables for malware detection. (Kim & Hong, 2014). (Zhu et al., 2012)

179

References

Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., & Giacinto, G. (2016). Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, 183–194. Alam, S., Horspool, R. N., & Traore, I. (2014). MARD: A framework for metamorphic malware analysis and real-time detection. Proceedings - International Conference on Advanced Information Networking and Applications, AINA, 480–489. Alam, S., Horspool, R., & Traore, I. (2013). MAIL: Malware Analysis Intermediate Language: a step towards automating and optimizing malware detection. Proceedings from the 6th International Conference on Security of Information Networks. Amos, B., Turner, H., & White, J. (2013). Applying machine learning classifiers to dynamic Android malware detection at scale. IEEE Wireless Communications and Mobile Computing Conference (IWCMC), 1666–1671. Babbie, E. (2010). The Practice of Social Research. The Practice of Social Research. Bächer, P., Holz, T., Kotter, M., & Wicherski, G. (2005). Tracking Botnets: Using Honeynets to Learn More about Bots. The Honeynet Project and Research Alliance, 1–116. Baecher, P., Koetter, M., Holz, T., Dornseif, M., & Freling, F. (2006). The nepenthes platform: An efficient approach to collect malware. Recent Advances in Intrusion Detection, 165–184. Bailey, M., Oberheide, J., & Andersen, J. (2011). Automated classification and analysis of internet malware. Recent Advances in Intrusion Detection, 1–18. Barakat, A., & Khattab, S. (2010). A comparative study of traditional botnets versus super-botnets. Informatics and Systems (INFOS), 2010 The 7th International Conference on, 1–5. Bayer, U., Kirda, E., & Kruegel, C. (2010). Improving the efficiency of dynamic malware analysis. Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 1871. Bayer, U., Milani-Comparetti, P., Hlauscheck, C., Kruegel, C., & Kirda, E. (2009). Scalable, Behavior-Based Malware Clustering. 16th Symposium on Network and Distributed System Security, 120–129. Bechtel, K. (2014). Malware’s Journey from Hobby to Profit-Driven Attacks, 10–26. Bilar, D. (2007). Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics, 1(2), 156. Bogdan, R., & Biklen, S. (1998). Qualitative research in education: An introduction to theory and methods. Foundations of Qualitative Research in Education, 1–48. Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., & Iftodet, L. (2004). Remote repair of operating system state using backdoors. Proceedings - International Conference on 180

Autonomic Computing, 256–263. Bontchev, V. (1996). Possible macro virus attacks and how to prevent them. Computers & Security, 15(7), 595–626. Bontchev, V. (1998). Methodology of computer anti-virus research. Doctoral Thesis, Faculty of Informatics, University of Hamburg, 1–233. Boro, D., Nongpoh, B., & Bhattacharyya, D. K. (2012). Anomaly based intrusion detection using meta ensemble classifier. Proceedings of the Fifth International Conference on Security of Information and Networks - SIN ’12, 143–147. Borojerdi, H. R., & Abadi, M. (2013). MalHunter : Automatic Generation of Multiple Behavioral Signatures for Polymorphic Malware Detection. IEEE Systems Journal, (3rd International Conference on Computer and Knowledge Engineering (ICCKE 2013)), 1–7. Bossert, G., Hiet, G., & Inria, S. (2014). Towards Automated Protocol Reverse Engineering Using Semantic Information Categories and Subject Descriptors. ACM Symposium on Information, Computer and Communications Security (2014), 51–62. Brenner, S. (2008). Cyberthreats: The Emerging Fault Lines of the Nation State. New York.: Oxford University Press. Campbell, D., & Stanley, J. (1963). Experimental and Quasi-Experimental Designs for Research. American Educational Research Association. Campo-Giralte, L., Jimenez-Peris, R., & Patino-Martinez, M. (2009). PolyVaccine: Protecting Web Servers against Zero-Day, Polymorphic and Metamorphic Exploits. 2009 28th IEEE International Symposium on Reliable Distributed Systems, 91–99. Carlson, S. M., Davis, A. C., & Leach, J. G. (2014). Less Is More, (February). Cesare, S., & Xiang, Y. (2010). A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost. 2010 24th IEEE International Conference on Advanced Information Networking and Applications, 721–728. Cesare, S., & Xiang, Y. (2011). Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs. 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, 181–189. Cesare, S., & Xiang, Y. (2013). Malwise — An Effective and Efficient Classification System for Packed and Polymorphic Malware. IEEE Transactions on Computers, 62(6), 1193–1206. Cesare, S., & Xiang, Y. (2014). Control Flow-Based Malware Variant Detection. IEEE Transaction on Dependable and Secure Computing, 11(4), 304–317. Cesare, S., Xiang, Y., & Zhou, W. (2007). Malwise – An Effective and Efficient Classification System for Packed and Polymorphic Malware. IEEE Transactions on Computers, 1–14. Cesare, S., Xiang, Y., & Zhou, W. (2013). Malwise-an effective and efficient classification system for packed and polymorphic malware. IEEE Transactions on Computers, 62(6), 1193–1206. Charles, C. (1995). Introduction to educational research. (Longman, Ed.) (Second Edi).

181

San Diego, CA. Chaumette, S., Ly, O., & Tabary, R. (2011). Automated extraction of polymorphic virus signatures using abstract interpretation. Network and System Security. Chen, K., Zhang, Y., & Lian, Y. (2013). Vulnerability-based backdoors: Threats from two-step trojans. Proceedings - 7th International Conference on Software Security and Reliability, SERE 2013, 169–177. Chen, T., & Robert, J.-M. (2004). Evolution of Viruses and Worms. Statistical Methods in Computer Security, 1–16. Retrieved from http://vx.netlux.org/lib/atc01.html Chu, B., Holt, T., & Ahn, G. (2010). Examining the Creation, Distribution, and Function of Malware On Line. Department of Justice Abstract, 1–183. Cohen, F. (1985). Computer viruses, (January), 1–152. Cohen, F. (1989a). Computational aspects of computer viruses. Computers & Security, 8(4), 297–298. Cohen, F. (1989b). Models of practical defenses against computer viruses. Computers & Security, 8(2), 149–160. Cohen, F. (1992). A formal definition of computer worms and some related results. Computers Security, 11(7), 641–652. Condon, F. (2012). Towards a Scalable Framework for Android Application Analytics. UC Berkley 2013 Security Symposium, 1–10. Creswell, J. W. (2007). Research Design: Qualitative, Quantitative and Mixed Method Aproaches. SAGE Publications. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart and Winston. Dai, J., Guha, R., & Lee, J. (2009). Efficient Virus Detection Using Dynamic Instruction Sequences. Journal of Computers, 4(5), 405–414. David, Y., & Yahav, E. (2013). Tracelet-based code search in executables. Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI ’14, 349–360. Denning, J. (1989). The Science of Computing: The Internet Worm. American Scientist, Vol. 77(No. 2), 126–128. Denzin, N. K., & Lincoln, Y. S. (2006). The Sage Handbook of Qualitative Research, 2nd ed. Edited by Norman K. Denzin, and Yvonna S. Lincoln. Library, 28(August), 467–468. Devesa, J., Santos, I., Cantero, X., Penya, Y. K., & Bringas, P. G. (2010). Automatic behaviour-based analysis and classification system for malware detection. Computer, 2 AIDSS(November 2015), 395–399. Dunlap, C. J. (2011). Perspectives for Cyber Strategists on Law for Cyberwar. Strategic Studies Quarterly, (June 2010), 81–99. Eichin, M. W., & Rochlis, J. a. (1989). With microscope and tweezers: an analysis of the Internet virus of November 1988. Proceedings. 1989 IEEE Symposium on Security

182

and Privacy, (November), 1–17. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters 27 (2006) 861–874, 27(6), 861–874. Fraley, J. B., & Figueroa, M. (2016). Polymorphic malware detection using topological feature extraction with data mining. In SoutheastCon 2016 (pp. 1–7). Gazet, A. (2010). Comparative analysis of various ransomware virii. Journal in Computer Virology, 6(1), 77–90. Gheorghescu, M. (2005). An Automated Virus Classification System. Virus Bulletin Conference, (October), 294–300. Glesne, C., & Peshkin, A. (1992). Becoming qualitative researchers: An introduction. White Plains, NY: Longman. Gordon, S., & Chess, D. (1998). Attitude Adjustment : Trojans and Malware on the Internet Attitude Adjustment : Trojans and Malware on the Internet, (October 1998). Gostev, A., Unuchek, R., Garnaeva, M., Makrushin, D., & Ivanov, A. (2016). It Threat Evolution in Q1 2016. Kapersky 2015 Report, Kapersky L. Gruber, T., Cammerer, S., Hoydis, J., & Brink, S. ten. (2017). On Deep Learning-Based Channel Decoding. arXiv Preprint arXiv:1701.07738, 1–6. Guri, M., Kedma, G., & Sela, T. (2013). Noninvasive detection of anti-forensic malware. Proceedings of the 2013 8th International Conference on Malicious and Unwanted Software: “The Americas”, MALWARE 2013, 1–10. Hampton, N., & Baig, Z. A. (2015). Ransomware: Emergence of the cyber-extortion menace. The Proceedings of the13th Australian Information Security Management, 2015, 47–56. Hansen, S. S., Mark, T., Larsen, T., Stevanovic, M., & Pedersen, J. M. (2016). An Approach for Detection and Family Classification of Malware Based on Behavioral Analysis. 2016 International Conference on Computing; Networking and Communications (ICNC), (February), 1–5. Hoepfl, M. C. (1997). Choosing Qualitative Research : A Primer for Technology Education Researchers. Journal of Technology Education, 9(1), 47–63. Hu, C., Wang, X., Li, N., Bai, H., & Jing, X. (2014). Approach for malware identification using dynamic behaviour and outcome triggering. IET Information Security, 8(2), 140–151. Jeong, G., Choo, E., Lee, J., Bat-Erdene, M., & Lee, H. (2010). Generic unpacking using entropy analysis. Proceedings of the 5th IEEE International Conference on Malicious and Unwanted Software, Malware 2010, 98–105. John, G., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference, 121–129. Kamongi, P., Kotikela, S., Kavi, K., Gomathisankaran, M., & Singhal, A. (2013). VULCAN: Vulnerability Assessment Framework for Cloud Computing. 2013 IEEE 7th International Conference on Software Security and Reliability, 218–226.

183

Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G. M., Paxson, V., & Savage, S. (2008). Spamalytics: an empirical analysis of spam marketing conversion. ACM Conference on Computer and Communications Security, 3–14. Kaur, R. (2014). Efficient Hybrid Technique for Detecting Zero-Day Polymorphic Worms, (September 2011), 95–100. Kaur, R., & Singh, M. (2014). A Survey on Zero-Day Polymorphic Worm, 16(3), 1520– 1549. Kauranen, K., & Makinen, E. (1990). A note on Cohen’s formal model for computer viruses. ACM SIGSAC Review, 8(2), 40–43. Kayacik, H., Zincir-Heywood, a N., & Heywood, M. I. (2005). Selecting Features for Intrusion Detection : A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets. Proceedings of the Third Annual Conference on Privacy Security and Trust PST2005, 3–8. Kephart, J., & Arnold, W. (1994). Automatic extraction of computer virus signatures. 4Th Virus Bulletin International Conference, 178–184. Kephart, J. O. (1993). Measuring and Modeling Computer. Proceedings of IEEE Computer Society Symposium on Security and Privacy, 2–15. Kephart, J., Sorkin, G., Arnold, W., Chess, D., Tesauro, G., & White, S. (1995). Biologically inspired defenses against computer viruses, 985–996. Kephart, J., Sorkin, G., Swimmer, M., & White, S. (1999). Blueprint for a computer immune system. IBM Thomas J. Watson Research Center, 1–17. Kephart, J., & White, S. (1991). Directed-graph epidemiological models of computer viruses. Proceedings., 1991 IEEE Computer Society Symposium on Research in Security and Privacy, 343–359. Kerchen, P., Lo, R., Crossley, J., Elkinbard, G., Levitt, C., & Olsson, R. (1990). Static Analysis Virus Detection Tools For Unix Systems. Proceedings of the 13th National Computer Security Conference, 350–365. Khakhutskyy, V. (2016). Behavior-based Malware Detection with Quantitative Data Flow Analysis, (c), 2–4. Kharraz, A., Robertson, W., Balzarotti, D., Bilge, L., & Kirda, E. (2015). Cutting the gordian knot: A look under the hood of ransomware attacks. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9148, 3–24. Kim, K., & Hong, S. (2014). Study on Enhancing Vulnerability Evaluations for BYOD Security. Perspectives on Security, 8(4), 229–238. Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. International Joint Conference on Artificial Intelligence, 14(12), 1137–1143. Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X., Wang, X., … Antipolis, S. (2009). Effective and Efficient Malware Detection at the End Host. System, 4(1), 351–366.

184

Kolbitsch, C., Holz, T., Kruegel, C., & Kirda, E. (2010). Automated Extraction of Proprietary Gadgets from Malware Binaries. 2010 IEEE Symposium on Security and Privacy, 29–44. Kolo, E. (2016). Kolobyte. Kolobyte Blog, February. Kolter, J., & Maloof, M. (2006). Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 2721–2744. Krebs, B. (2012). Massive Profits Fueling Rogue Antivirus Market. Washington Post. Lana, R. (1959). Pretest-treatment interaction effects in attitudinal studies, 56(4), 293. Landwehr, C. E., Bull, A. R., McDermott, J. P., & Choi, W. S. (1994). A taxonomy of computer program security flaws. ACM Computing Surveys, 26(3), 211–254. Lee, C., & Lee, G. G. (2006). Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management, 42(1 SPEC. ISS), 155–165. Lee, T., & Mody, J. (2006). Behavioral Classification. In 15th European Institute for Computer Antivirus Research (EICAR 2006) Annual Conference. Lee, W., & Stolfo, S. J. (1998). Data Mining Approaches for Intrusion Detection. Proceedings Of the Seventh Usenix Security Symposium, 79–93. Li, X., Duan, H., Liu, W., & Wu, J. (2010). The growing model of Botnets. 1st International Conference on Green Circuits and Systems, ICGCS 2010, 414–419. Linn, C., & Debray, S. (2003). Obfuscation of executable code to improve resistance to static disassembly. Proceedings of the 10th ACM Conference on Computer and Communication Security - CCS ’03, 290–299. Liu, Y., Chen, W., & Guan, Y. (2012). Monitoring Traffic Activity Graphs with low-rank matrix approximation. 37th Annual IEEE Conference on Local Computer Networks, 59–67. Lo, R. W., Levitt, K. N., & Olsson, R. a. (1995). MCF: A malicious code filter. Computers & Security, 14(6), 541–566. Mandiant Research. (2014). Beyond the Breach. Mandiant Threat Report, 1–79. Manikandan, P., Ramyachitra, D., Kalaivani, S., & Ranjani Rani, R. (2016). An improved instance based K-nearest neighbor (IIBK) classification of imbalanced datasets with enhanced preprocessing. International Journal of Applied Engineering Research, 11(1), 642–649. Martignoni, L., Christodorescu, M., & Jha, S. (2007). OmniUnpack: Fast, generic, and safe unpacking of malware. Proceedings - Annual Computer Security Applications Conference, ACSAC, 431–440. McAfee. (2014). McAfee Labs Threats Report, (June). McAfee. (2016). McAfee Labs Threat Report, (June). McAfee Labs. (2015). McAfee Labs Threats Report, (May). McGraw, G., & Morrisett, G. (2000). Attacking malicious code: A report to the Infosec Research Council. IEEE Software, 17(5), 33–41.

185

Mohaisen, A., & Alrawi, O. (2015). High-fidelity, Behavior-based Automated Malware Analysis and Classification. Computers & Security, 1–12. Molok, N. N. A., Ahmad, A., & Chang, S. (2012). Social networking: a source of intelligence for advanced persistent threats. International Journal of Cyber Warfare and Terrorism (IJCWT), 2(1)(July), 1–13. Moore, D., Shannon, C., & Brown, J. (2002). Code-Red: A Case Study on the Spread and Victims of an Internet Worm. Proceedings of the Second ACM SIGCOMM Workshop on Internet Measurment (IMW ’02), 273–284. Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2008, December 19). Verification Strategies for Establishing Reliability and Validity in Qualitative Research. International Journal of Qualitative Methods. Muhaya, F. Bin, Khan, M. K., & Xiang, Y. (2011). Polymorphic Malware Detection Using Hierarchical Hidden Markov Model. 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, (9), 151–155. Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: An empirical study. Proceedings of Uncertainty in AI, 9(4), 467–475. Nazario, J., Ptacek, T., & Song, D. (2004). Wormability: A description for vulnerabilities. Arbor Networks. Neuman, S. B., & McCormick, S. (1995). What is single-subject experimental research? Single-Subject Experimental Research: Applications for Literacy (128th ed.). Newark, Delaware: International Reading Association. Nicho, M., & Khan, S. (2014). Identifying Vulnerabilities of Advanced Persistent Threats: International Journal of Information Security and Privacy, 8(1), 1–18. Noreen, S., Murtaza, S., Shafiq, M. Z., & Farooq, M. (2009). Evolvable malware. Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation GECCO 09, 1569–1576. Panda Security. (2014). PANDA Labs Annual Report 2014. Paulevé, L., Jégou, H., & Amsaleg, L. (2010). Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters, 31(11), 1348–1358. Pék, G., Lanzi, A., & Srivastava, A. (2014). On the feasibility of software attacks on commodity virtual machine monitors via direct device assignment. Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, 305–316. Pfeffer, A., Call, C., & Chamberlain, J. (2012). Malware Analysis and attribution using Genetic Information. 2012 7th International Conference on Malicious and Unwanted Software, 39–45. Ponomarev, S., Durand, J., Wallace, N., & Atkison, T. (2013). Evaluation of random projection for malware classification. Proceedings - 7th International Conference on Software Security and Reliability Companion, SERE-C 2013, 68–73.

186

Pradesh, A. (2014). Poster : Machine-learning Approaches for P2P Botnet Detection using Signal-processing Techniques. Debs ’14, 338–341. Pramono, Y. W. T., & Suhardi. (2015). Design of anomaly-based intrusion detection and prevention system for smart city web application using rule-growth sequential pattern mining. Proceedings - 2014 International Conference on ICT for Smart Society: “Smart System Platform Development for City and Society, GoeSmart 2014”, ICISS 2014, 56–60. Qu, Y., & Hughes, K. (2013). Detecting metamorphic malware by using behavior-based aggregated signature. Internet Security (WorldCIS), 2013 World Conference Proceedings, 13–18. Rafique, M., & Chen, P. (2014). Evolutionary algorithms for classification of malware families through different network behaviors. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, 1167–1174. Rekdal, J., & Bloemerus, P. (2013). Advanced Persistent Threat ( APT ) Beyond the hype, 1–50. Reynolds, J. K. (1989). The helminthiasis of the Internet. Computer Networks and ISDN Systems, 22(5), 347–361. Richardson, R. (2011). 15th Annual 2010/2011 Computer Crime and Security Survey. Computer Security Institute, 1–90. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2009). Automatic Analysis of Malware Behavior using Machine Learning. Journal of Computer Security, (18–2009), 1–30. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic Analysis of Malware Behavior using Machine Learning, 1–30. Ritchey, R. W., & Ammann, P. (2000). Using Model Checking to Analyze Network Vulnerabilities. IEEE Security and Privacy, 156–165. Rizvi, S. L., & Nock, M. K. (2008). Single-case experimental designs for the evaluation of treatments for self-injurious and suicidal behaviors. Suicide & Life-Threatening Behavior, 38(5), 498–510. Rodrigues, M. (2011). Utilizing Rootkits To Address the Vulnerabilities Exploited By, (December). Rodríguez-Gómez, R. a., Maciá-Fernández, G., & García-Teodoro, P. (2013). Survey and taxonomy of botnet research through life-cycle. ACM Computing Surveys, 45(4), 1– 33. Rossow, C. (2013). Using Malware Analysis to Evaluate Botnet Resilience. Vrije Universiteit, (April), 145. Rossow, C., Dietrich, C. J., Grier, C., Kreibich, C., Paxson, V., Pohlmann, N., … Van Steen, M. (2012). Prudent practices for designing malware experiments: Status quo and outlook. Proceedings - IEEE Symposium on Security and Privacy, 65–79. Schneidewind, N. (2010). Metrics for mitigating cybersecurity threats to networks. IEEE Internet Computing, 14(1), 64–71. Seeley, D. (1989). A Tour of the Worm. Proceedings of the USENIX Winter Technical

187

Conference, 287–304. Seigneur, J., & Kölndorfer, P. (2013). A survey of trust and risk metrics for a BYOD mobile working world. Third International Conference on Social Eco-Informatics (SOTICS 2013), 11–22. Shafiq, M. Z., Khayam, S. A., & Farooq, M. (2008). Embedded malware detection using markov n-grams. Dimva (Vol. 7591). Sharma, A., & Sahay, S. (2014). Evolution and Detection of Polymorphic and Metamorphic : A Survey. International Journal of Computer Applications, 90(2), 7–12. Retrieved from http://adsabs.harvard.edu/abs/2014IJCA...90b...7S Sharma, N., Bajpai, A., & Litoriya, R. (2012). Comparison the various clustering algorithms of weka tools. International Journal of Emerging Technology and Advanced Engineering, 2(5), 73–80. Siddiqui, M. A. (2008). Data Mining Methods for Malware Detection. ProQuest. Singh, A., Walenstein, A., & Lakhotia, A. (2012). Tracking concept drift in malware families. Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence - AISec ’12, 81–92. Singh, P. K., & Lakhotia, A. (2002). Analysis and detection of computer viruses and worms. ACM SIGPLAN Notices, 37(2), 29. Smith, M., & Glass, G. (1987). Research and evaluation in education and the social sciences. Prentice Hall. Spafford, E. H. (1989). The internet worm program: an analysis. ACM SIGCOMM Computer Communication Review, 19(1), 17–57. Spafford, E. H. (1991). Computer viruses and ethics. Purdue University ePubs, (Technical Report 91-061), 1–22. Spafford, E. H. (1994). Computer Viruses as Artificial Life. Artificial Life, 1(3), 249– 265. Subramanya, S. R., & Lakshminarasimhan, N. (2001). Computer viruses. IEEE Potentials, 20(4), 18–21. Sulaiman, A., Ramamoorthy, K., Mukkamala, S., & Sung, A. (2005). Malware examiner using disassembled code (medic). Proceeedings of the 2005 IEEE Workshop on Information Assurance and Security, 428–429. Symantec. (2014). Internet Security Threat Report, 19(April). Symantec Corporation. (2014). Symantec Intelligence Report, (June). Symantec Corporation. (2016). 2016 Internet Security Threat Report. Industry Report, 1– 81. Szor, P. (2005). The Art of Computer Virus Research and Defense. Pearson Education. Tamersoy, A., Roundy, K., & Chau, D. H. (2014). Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’14, 1524–1533.

188

Tankard, C. (2011). Advanced persistent threats and how to monitor and deter them. Network Security, (8), 16–19. Thompson, K. (1984). Reflections of Trust. Communications of the ACM, 27(8), 761– 763. Virvilis, N., Gritzalis, D., & Apostolopoulos, T. (2013). Trusted Computing vs. Advanced Persistent Threats: Can a Defender Win This Game? 2013 IEEE 10th International Conference on Ubiquitous Intelligence and Computing and 2013 IEEE 10th International Conference on Autonomic and Trusted Computing, 396–403. Vogt, R., Aycock, J., & Jacobson, M. (2007). Army of botnets. Network and Distributed System Security Symposium, 111–123. von Nuemann, J., & Burks, A. W. (1966). Theory of self-reproducing automata. IEEE Transactions on Neural Networks, 5(10), 3–14. Wagner, D., & Dean, R. (2001). Intrusion detection via static analysis. Security and Privacy, 2001. Proceedings 2001 IEEE Symposium on Security and Privacy, 2001., 156–168. Wall, D. S. (2012). Enemies within: Redefining the insider threat in organizational security policy. Security Journal, 26(2), 107–124. Weaver, N., Paxson, V., Staniford, S., & Cunningham, R. (2003). A taxonomy of computer worms. Proceedings of the 2003 ACM Workshop on Rapid Malcode, 11– 18. Wheeler, D. a. (2005). Countering trusting trust through diverse double-compiling. Proceedings - Annual Computer Security Applications Conference, ACSAC, 2005, 33–45. White, S. R. (1998). Open Problems in Computer Virus Research. Virus Bulletin Conference, October, 1–11. Williamson, M. M., Parry, A., Byde, A., Bristol, H. P. L., Road, F., & Gifford, S. (2004). Virus Throttling for Instant Messaging. Virus Bulletin Conference, (Septmber), 38– 48. Willson, V., & Putnam, R. (1982). A meta-analysis of pretest sensitization effects in experimental design. Educational Research Journal, 19(2), 249–258. Winter, G. (2000). A comparative discussion of the notion of’validity’in qualitative and quantitative research. The Qualitative Report, 4(3), 1–14. Wüchner, T., Ochoa, M., & Pretschner, A. (2014). Malware Detection with Quantitative Data Flow Graphs. Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security - ASIA CCS ’14, 271–282. Xiao, X., Yuxin, D., Yibin, Z., Ke, T., & Wei, D. A. I. (2013). Malware Detection Based on Object-oriented Association Mining. Proceedings of the 2013 International Conference on Machine Learning and Cybernetics, 14–17. Ye, Y., Wang, D., Li, T., Ye, D., & Jiang, Q. (2008). An intelligent PE-malware detection system based on association mining. Journal in Computer Virology, 4(4), 323–334.

189

Yedidia, J., Freeman, W., and Weiss, Y. (2003). Understanding belief propagation and its Generalizations. Exploring Artificial Intelligence in the New Millennium, 8, 236– 239. Yerima, S. Y., Sezer, S., McWilliams, G., & Muttik, I. (2013). A New Android Malware Detection Approach Using Bayesian Classification. 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), 121– 128. Yin, H., Song, D., Egele, M., Kruegel, C., & Kirda, E. (2007). Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis. Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS ’07), 116–127. Yin, R. (2015). Qualitative research from start to finish. Guilford Publications. Young, A., & Yung, M. (2016). Cryptography as an Attack Technology: Proving the RSA/Factoring Kleptographic Attack. The New Codebreakers, 243–255. Zhang, J., Luo, X., Perdisci, R., Gu, G., Lee, W., & Feamster, N. (2011). Boosting the scalability of botnet detection using adaptive traffic sampling. Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security - ASIACCS ’11, 124. Zhang, Y., & Paxson, V. (2000). Detecting Backdoors, (August), 1–11. Zhu, X., Vondrick, C., Ramanan, D., & Fowlkes, C. (2012). Do We Need More Training Data or Better Models for Object Detection? Procedings of the British Machine Vision Conference 2012 (BMVC12), 80.1-80.11. Zhuang, W., Ye, Y., Chen, Y., & Li, T. (2012). Ensemble clustering for internet security applications. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(6), 1784–1796. Zolotukhin, M., & Hamalainen, T. (2013). Support vector machine integrated with game- theoretic approach and genetic algorithm for the detection and classification of malware. 2013 IEEE Globecom Workshops (GC Wkshps), 211–216.

190

Appendix A – Advanced Detection Studies

Lead Year Study Dataset Malware Labeled Static Dynamic Author Type % Samples Analysis Analysis Park, Y. 2010 Experiment 200 55% Yes Yes Yes (2010) Shamili, A. S. 2010 Quantitative 897,922 33% No None None (2010, Experiments August). Sun, X. 2010 Experiment 90 67% Yes None Yes (2010) Wang, Z. H. 2010 Iterative 5,223 68% Yes Yes None (2010, May) Experiment Ye, Y. 2010 Experiment 50,000 70% Yes None None (2010). Caballero, J. 2011 Experiment 313,791 Not Yes Yes None (2011) Specified Jacob, G. 2011 Experiment 37,572 Not Yes None Yes (2011) Specified Jang, J. 2011 Experiment 20,000 100% Yes Yes Yes (2011) Kolbitsch, C. 2011 Experiment Not Not Yes Yes Yes (2011) Specified Specified Rossow, C. 2011 Experiment 100,000 85% Yes Yes None (2011) Zhang, J. 2011 Experiment Not Not Yes None None (2011) Specified Specified Chen, Y. 2012 Experiment 10,000 60% Yes Yes None (2012) Eskandari, M. 2012 Experiment 956 52% Yes Yes Yes (2012). Ghiasi, M. 2012 Experiment 1,211 68% Yes None Yes (2012)

Zhuang, W. 2012 Experiment 75,000 78% Yes None None (2012) Borojerdi, H. 2013 Experiment 360 67% Yes None Yes R. (2013) Jha, S. (2013) 2013 Experiment 961 95% Yes None Yes Naval, S. 2013 Experiment 1,296 37% Yes Yes None (2013) Ponomarev, 2013 Experiment 1,544 54% Yes Yes None S. (2013) Tsuruta, H. 2013 Experiment 23,234,538 27% Yes Yes None (2013)

191

Lead Year Study Dataset Malware Labeled Static Dynamic Author Type % Samples Analysis Analysis

Wang, H. T., 2013 Experiment 3,000 96% Yes Yes Yes Wei, T. E., & Lee, H. M. (2013) Xiao, X. 2013 Experiment 3,401 53% Yes None Yes (2013) Zolotukhin, 2013 Experiment 1,089 55% Yes Yes None M. (2013) Adebayo, 2014 Empirical 1,500 67% Yes Yes None O.S. (2014) Study Dornhackl, H. 2014 Observation Not Not No None None (2014) Specified Specified El Attar, A. 2014 Experiment Not Not No None None (2014) Specified Specified Elaziz, P. E. 2014 Experiment 80,077 Not No None None A. (2014) Specified Li, J. (2014) 2014 Experiment Not Not No None None Specified Specified Pramono,Y. 2014 Experiment 1,059,419 Not No None None W.T. (2014, Specified August) Pramono,Y. 2014 Experiment Not Not No None None W.T. (2014, Specified Specified September) Li, Y.H. 2015 Experiment Not Not No None Yes (2015) Specified Specified Hansen, S.S. 2016 Experiment 270,837 100% Yes None Yes (2016)

192

Appendix B – Cuckoo Installation and Configuration

Eugene Kolo’s blog - Provided useful information for installing and configuring Cuckoo (Kolo, 2016). https://eugenekolo.com/blog/installing-and-setting-up-cuckoo-sandbox/ Installation dependencies for Cuckoo sudo apt-get install python sudo apt-get install mongodb sudo apt-get install g++ sudo apt-get install python-dev python-dpkt python-jinja2 python-magic python- pymongo python-gridfs python-libvirt python-bottle python-pefile python-chardet python-pip sudo apt-get install libxml2-dev libxslt1-dev sudo pip2 install sqlalchemy yara sudo pip2 install cybox==2.0.1.4 sudo pip2 install maec==4.0.1.0 sudo pip2 install python-dateutil sudo apt-get install python-dev libfuzzy-dev sudo pip2 install pydeep sudo apt-get install tcpdump # If not installed # Allow tcpdump to read raw TCP data without root: sudo setcap cap_net_raw,cap_net_admin=eip /usr/sbin/tcpdump wget http://downloads.volatilityfoundation.org/releases/2.4/volatility-2.4.zip && unzip volatility-2.4.zip && cd volatility-2.4 sudo python setup.py install # Install the libraries that volatility wants: sudo pip2 install distorm3

193

Install Cuckoo and Virtual Box

git clone git://github.com/cuckoosandbox/cuckoo.git wget http://download.virtualbox.org/virtualbox/5.0.14/virtualbox-5.0_5.0.14- 105127~Ubuntu~trusty_i386.deb sudo dpkg -i virtualbox-5.0_5.0.14-105127~Ubuntu~trusty_i386.deb sudo apt-get install -f

Networking sudo iptables -A FORWARD -o eth0 -i vboxnet0 -s 192.168.56.0/24 -m conntrack -- ctstate NEW -j ACCEPT; sudo iptables -A FORWARD -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT; sudo iptables -A POSTROUTING -t nat -j MASQUERADE; sudo sysctl -w net.ipv4.ip_forward=1;

194