An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware

Thesis submitted in partial fulfillment of the requirements for the degree of

By

Nir Nissim

Submitted to the Senate of Ben-Gurion University of the Negev

31.12.2015

Beer-Sheva

- 1 -

- 2 -

An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware

Thesis submitted in partial fulfillment of the requirements for the degree of

“DOCTOR OF PHILOSOPHY”

By

Nir Nissim

Submitted to the Senate of Ben-Gurion University of the Negev

Approved by the advisor: ______

Approved by the Dean of the Kreitman School of Advanced Graduate Studies: ______

31.12.2015

Beer-Sheva

- 3 -

- 4 -

This work was carried out under the supervision of Prof. Yuval Elovici at the

Department of Information Systems Engineering

Faculty of Engineering Sciences,

Ben-Gurion University of the Negev.

- 5 -

Research-Student's Affidavit when Submitting the Doctoral Thesis for Judgment

I Mr. Nir Nissim, whose signature appears below, hereby declare that

 I have written this Thesis by myself, except for the help and guidance offered by my Thesis Advisors.

 The scientific materials included in this Thesis are products of my own research, culled from the period during which I was a research student.

Date: 31.12.2015 Student’s Name: Nir Nissim Signature: ______

- 6 -

Acknowledgements First and foremost, I want to thank God for providing me with the capabilities, wisdom, and blessing of success, during these important years of research and for surrounding me with an outstanding group of colleagues and researchers who were helpful in this research.

I would also like to thank my advisor, Prof. Yuval Elovici, for his support, guidance, and the opportunities provided to me, all of which have made these years of research extremely productive and challenging.

Thanks also to the National Cyber Bureau of the Israeli Ministry of Science, Technology and Space who partially supported my research.

I also wish to thank Clint Feher, Oren Barad, and Aviad Cohen who assisted in the collection and creation of the datasets and Yuval Fledel for his valuable advice regarding the efficient implementation aspects of my research. I would like also to thank Prof. Yuval Shahar for the meaningful discussions we shared and also for his expertise and support in the expansion of this research to additional directions in the biomedical domain.

Special thanks both to Dr. Robert Moskovitch and Prof. Lior Rokach for their assistance and helpful advice during the course of my research.

Thanks also to Ms. Yehudith Naftalovitch, the administrative and operational manager of our Cyber Security Research Center, who assisted and helped with many administrative matters during these years of research, providing valuable assistance that allowed me to better focus on the research itself.

I would like to thank also to Ms. Robin Levy-Stevenson for her devoted assistance, providing much appreciated English editing and proofreading during my Ph.D. studies which helped make my publications more comprehensive and clear.

And last but not least, thanks to my dear parents and my special grandparents who supported me in every way they possibly could, ensuring that I would always have the passion, and everything else I would need, to succeed.

- 7 -

- 8 -

Abstract The sheer volume of new malware created every day poses a significant challenge to existing detection solutions. This malware is aimed at compromising nearly every kind of widely used digital device, threatening individuals as well as organizations. Popular types of malware take different forms including computer worms, malicious PC executables, malicious documents (non- executables), and malicious applications aimed at mobile devices. Widely used antivirus software, which is based on manually crafted signatures, is only capable of identifying known malware and their relatively similar variants. To identify new and unknown malwares and keep their antivirus signature repository up-to-date, antivirus vendors must collect new suspicious files on a daily basis for manual analysis by information security experts who label the files as malware or benign. Analyzing suspected files is a time-consuming task, and it is impossible to manually analyze all questionable files. Consequently, antivirus vendors use detection models based on machine learning (ML) algorithms and heuristics in order to reduce the number of suspected files that must be inspected manually. In addition to antivirus software, recent detection solutions have also used machine learning algorithms independently, in order to provide better detection capabilities of new malware, an area in which antivirus software is limited. In light of the mass creation of new files daily, both antivirus and machine learning based detection solutions lack an essential element – they cannot be frequently and efficiently updated with newly created malware – a situation that creates a dangerous time gap between the creation and proliferation of malware and its detection and discovery. This time gap allows new malware to attack many targets before it is identified and thwarted. Therefore, both antivirus and machine learning based solutions must be frequently updated – the antivirus software must be updated with new signatures of malware, and machine learning based solutions require new informative files, both malicious and benign. In this research we introduce a solution for this updatability gap. We present a novel, generic, and efficient active learning (AL) framework and new AL methods that may assist antivirus vendors and machine learning based solutions and may allow them to focus their analytical efforts by acquiring only a small set of new files that are either most likely malicious or informative benign files, a process that enables efficient and frequent enhancement of the knowledge stores of both the detection model and the antivirus software. In addition to intelligent selection of the most contributive files, our framework is also aimed at working under higher rates of granularity in which it can efficiently select only a small number of instances related to the behavior of a specific analyzed file. By doing this, our framework can filter out the misleading and noisy instances of malware’s behavior which is popular among sophisticated and elusive malware and thus improve detection capabilities. Our framework also integrates tailored feature extraction methods for each of the above mentioned types of malware, and these feature extraction methods provide an accurate basis for enhancing the detection capabilities leveraged by our AL methods.

- 9 -

The main contributions of the study are summarized as follows: first, the experimental results showed that our framework can improve the detection capabilities of antivirus software and machine learning based solutions by frequently and efficiently enhancing the knowledge stores of the detection model and the antivirus software, as our framework outperformed any other existing solution and method. Second, based on the predefined limited number of files acquired daily in our experiments, the existing AL method showed a decrease in the number of new malwares acquired daily, while our AL methods showed an increase and daily improvement in the number of new malwares acquired daily and also acquired more new malwares each day than every other solution. Third, our framework conducts the above mentioned update using only small set of the most informative files (malicious and benign) leading to a significant reduction of security expert labeling efforts associated with manual analysis of the files. Fourth, our framework was also found to be efficient in retrospective acquisition of malware from large stores of files usually found in organizations. Fifth, our framework is able to efficiently improve the detection capabilities by enhancing its robustness by filtering out the presence of misleading malware instances and behavior. Lastly, as a proof of concept for the generality of our AL based framework, we have recently extended the framework's capabilities so it will provide solution in additional domains. We have adapted it to the biomedical informatics domain, in which we successfully enhanced the capabilities of a classification model that is used for condition severity classification while significantly reducing labeling efforts that can result in a substantial savings, both in time and money associated with medical experts.

Keywords. Malware, Malicious, Computer Worm, Executable, Android, Document, PDF, Machine Learning, Active Learning, Detection, Acquisition, Antivirus.

- 10 -

Table of Contents

1. Introduction

1.1. Background and Related Work

1.1.1. Malicious Executables and Computer Worms

1.1.2. Malicious Documents

1.1.3. Malicious Android Applications

1.2. The Problem Statement and Proposed Approach

1.3. Deployment of our Framework

2. Overview of the Core Papers in the Research

2.1. Research Results

2.1.1. Core Papers

3. Summary and Conclusions

4. Future Directions

5. References

6. Appendix

6.1. Additional Accepted Papers in the Malware Detection Domain

6.2. Additional Accepted Papers in the Biomedical Informatics Domain

- 11 -

1. Introduction

1.1. Background and Related Work

In recent years, the Internet has become an integral part of our lives, particularly with the increased availability of high speed internet connections, cloud computing, and the proliferation of mobile devices which have rapidly become indispensable to individuals around the world, able to handle many of our daily needs and interests such as communication, health, news, banking, shopping, mail, entertainment, etc. Increasingly large numbers of files are created and transferred over the Internet, including a growing percentage of malware that compromises a growing list of targets through a variety of attack methods. Although nowadays, the creation of malware requires much less expertise than in the past [75], attacks launched by today’s malware have become more sophisticated, harder to detect, and more dangerous [72]. These facts have shaped an insecure reality in which, according to Kaspersky’s report presented in 2013 [76], every day at least 315,000 new malware are created and widely spread over the Internet with ease, and since that time, this number has been exponentially increasing each year.

There are several levels of defense against malware attacks, and each level consists of different types of specialized techniques and tools. The lowest level of defense is at the level of the host computer and includes the user's computer itself and an organization’s application server. Techniques most often used at this level are host-based intrusion detection and prevention systems (IDSs and IPSs) that are installed on the host computer and can protect it from malware that has reached the host. Antivirus signature-based software is an integral tool implemented at the host level; widely used, antivirus IDSs detect known malware and its variants using signature detection methods relatively effectively for most organizations and individuals. Each time a new malware is found, antivirus vendors create a new signature and update their signatory repository, as well as their clients. It takes time to detect a malicious code and update clients, and such actions are definitely not immediate. Speed is essential – during the period of time between the appearance of a new unknown malware, its subsequent detection by the antivirus, and updating the new signature in the client’s database, many computers might be infected. Although more than a decade has passed since their first appearance, computer worms remain the most well-known examples of malware that maliciously takes advantage of the time it has to operate prior to its detection and

- 12 - neutralization. "Slammer," the fastest computer worm in history, infected more than 75,000 hosts (representing 90% of the vulnerable hosts) within ten minutes [70], while "Code Red," the most harmful and famous worm, infected 359,000 hosts within 14 hours [71]. Each of these computer worm attacks caused significant disruption to financial, transportation, and government institutions.

In order to accurately and quickly detect the newest malware, antivirus companies devote considerable effort, both in terms of time and resources, to maintaining an up-to-date signature repository of malicious code files. These efforts include monitoring new and unknown malicious code files sent over the Internet and the use of various types of honeypots to catch malicious files [77] [78]. This mission is complicated and time-consuming, particularly because these efforts rely heavily on manual inspection of suspected files [77].

This challenging situation has motivated researchers to develop more comprehensive and efficient solutions for agile detection of new unknown malware. Studies conducted over the last 15 years have shown that machine learning methods and algorithms (traditionally used for challenging classification and prediction tasks) can be used for the detection of unknown malware. New detection tools based on machine learning algorithms have been developed, and antivirus vendors have started to incorporate machine learning based detection models and heuristics into their processes in order to enhance their detection capabilities. Prior studies have primarily focused on two approaches: dynamic and static analysis. In each case, during both the training and detection phases, malicious and benign files are analyzed and subsequently represented by a vector of features (extracted statically from the content and structure of the file, or dynamically according to its behavior) that can be monitored by measuring elements within the system in which the malware is executed. These files are used during the training phase to induce a classifier that acts as the detection model. Based on the generalization capabilities of the detection model, an unknown file (one that did not appear in the training set and is not detected by the antivirus tool) is classified as malicious or benign during the detection phase.

Static analysis methods have several advantages over dynamic analysis. First, they are virtually undetectable – the analyzed file cannot detect that it is being analyzed, since it is not executed. While it is possible to create static analysis “traps” to detect analysis, these traps can actually be used as a contributing feature for the detection of malware [90]. In addition, since static analysis is relatively efficient and fast, it can be performed both without causing bottlenecks and within an acceptable timeframe. Static analysis is also easy to implement, monitor, and measure.

- 13 -

Moreover, it scrutinizes the file’s “genes” as opposed to its current behavior which can be changed or delayed to an unexpected time in order to evade detection by dynamic analysis. An additional advantage is that static analysis can be used for a scalable pre-check of malwares before deeper, more time consuming analysis is conducted.

On the other hand, static analysis can be evaded by code obfuscation and is also limited in its ability to analyze encrypted files. Whenever one uses machine learning methods based on static analysis for the detection of unknown malicious code applications, a question arises regarding the ability of the suggested framework to detect obfuscated malware. However, dynamic analysis (also known as behavioral analysis), aimed at tracing the behavior of the file and its effect on the environment in which it is executed, is not affected by code obfuscation. This type of analysis and its versatile methods for detecting an unknown malware based on its behavior have been thoroughly explored over the past several years. These dynamic analysis based methods are aimed at detecting malicious activity and content that cannot be discovered using static analysis, for example code obfuscation, encrypted files, and dynamic load of malicious code during run time.

Machine learning solutions based on static and dynamic analysis have been successfully applied for the detection of common types of malware including: malicious executables and computer worms, malicious documents (non-executables), and malicious applications aimed at mobile devices. Each type has its own characteristics and unique properties, and our research is aimed at providing comprehensive long-term detection solutions to the challenges posed by the various malware types. Thus, we present a brief introduction to the types of malware that have become more popular and attractive during the period in which this research was conducted and mention the machine learning approaches and developments associated with each.

1.1.1 Malicious Executables and Computer Worms

Malicious executables, especially those aimed at the Windows operating system, the most commonly attacked system, include malware families such as computer worms, computer viruses, Trojan horses, spyware, and adware. Computer worms are a widespread form of malicious executables that proactively propagates across networks while exploiting vulnerabilities in operating systems, protocols, devices, and installed programs. In contrast, other malicious executables such as viruses, Trojan horses, spyware, and adware usually operate and attack within

- 14 - a host while also infecting the host's files. Ransomware, which actually extorts its victims, represents an emerging trend of malicious executables belonging to the Trojan malware family. Once this ransomware reaches a host, it encrypts the host’s files using a strong encryption algorithm and a unique key and prevents the host’s owner from accessing and using his/her own files until the owner pays the requested ransom to the attacker. These ransomware malwares are financially driven and therefore aimed at attacking large organizations that rely heavily upon the availability of significant and valuable files that form a critical part of their daily work and business. In this situation the attacked organization remains helpless and is forced to comply with the attacker’s requests and pay the ransom, or else lose their data. A well-known type of ransomware is the CryptoLocker [91] which was able to extort approximately three million dollars before it was taken down by authorities.

Regardless of the malware’s mode of operation, today’s antivirus tools don’t offer an adequate solution for the detection of new unknown malware – that which doesn’t share signatures similar to known malware (those forming the signature repository of the antivirus). Over the past 15 years, many studies have investigated the possibility of enhancing the detection of unknown malicious executables using machine learning algorithms based on either static [49-57], or dynamic analyses [58-69].

The detection of elusive computer worms transmitted over computer networks has also been intensively researched over the past decade. Typically, worms operate autonomously, spreading quickly and attacking as many targets as possible, causing considerable harm as was demonstrated by the “Slammer” [70] and “Red-Code” [71] worms. Stuxnet [72], a more elusive malware (and probably the most sophisticated ever created), is an example of a new attack trend, the advanced persistent threat (APT). Stuxnet was a directed cyberwarfare attack against the Iranian nuclear program which also spread within the attacked system and was aimed at the controllers of the nuclear SCADA systems in order to physically destroy this military target.

Worms are a very elusive malware that tries to hide its malicious activity by spending most of its time in a dormant state or by otherwise acting benignly. Machine learning based solutions have been proposed to provide better solutions and enhance antivirus software in order to detect new computer worms based on behavioral classification of the host [73] [74].However, the key to the challenge of detecting new computer worms using machine learning algorithms is in the ability to filter out their misleading behavior from the data provided to the machine learning algorithms.

- 15 -

In addition, some worms act in a misleading way and behave as a legitimate application part of the time, and as a consequence, they generate misleading instances. Worms are not always active and even when active they do not always behave in an illegitimate way. Because they sometimes act like non-worm instances, their detection is much more difficult; furthermore, it can be misleading to monitor their behavior, and when this is done, misleading instances become part of the dataset.

In most domains, the misleading instances are not created intentionally, but exist naturally. However, in our case, the misleading data posed a greater problem. In order to make their detection harder, in the security domain worms are created in such a sophisticated way that they behave similarly to a legitimate application. Thus, monitoring worm behavior using dynamic analysis creates many instances that are very similar to non-worm instances and are therefore considered misleading instances that confuse the induced classifier. This phenomenon is called “malicious noise,” as presented in [92]. Misleading instances usually create confusion in the classification processes and cause degradation in the classifier’s performance.

With regard to worm detection, the task is more complicated here, since the misleading data is inherent in the class, and its presence is even greater in the class we want to detect. In this case, we used the AL method’s premise to select the most informative instances among the existing instances, so that the misleading instances would not be selected, as was done previously [93] and discussed in other work [94].

1.1.2 Malicious Documents

Cyber-attacks aimed at organizations have increased since 2009, with 91% of all organizations hit by cyber-attacks in 2013.1 Attacks aimed at organizations usually include harmful activities such as stealing confidential information, spying and monitoring an organization’s activity, and disrupting an organization's actions. The vast majority of organizations rely heavily on email for internal and external communication. Thus, email has become a very attractive platform from which to initiate cyber-attacks against organizations.

1 http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in-2013/ - 16 -

According to Trend Micro,2 APT attacks, particularly those against government agencies and large corporations, are largely dependent upon spear-phishing3 emails. As malicious executables have been widely used to launch attacks, current defensive solutions and organizational policies often prevent executables from entering organizational networks via emails4 [88]. Therefore, recent APT attacks tend to attach documents which are non-executable files (PDF, MS-Office files, Flash files etc.) that unlike executables, are not independent files and also require specific software in order to be opened (e.g., Adobe Reader Microsoft Office, etc.). These types of documents are widely used in organizations and are often mistakenly considered less suspicious or malicious than executables.

Furthermore, because email communication is an integral part of daily business operations, APT attackers frequently leverage email as an attack vector for initial penetration of the targeted organization. Attackers usually use social engineering in order to make the recipient open the malicious email, press a link, and open an attachment containing such a document. F-Secure’s 2008-2009 report5 indicates that the most popular file types for targeted attacks in 2008-2009 were PDF and Microsoft Office files. Since that time, as was reported in 2010-2011, the number of attacks on Adobe Reader has grown.6 A recent report presented in 2015 by Symantec [88] revealed that Microsoft Office document file attachments have become the most frequently used documents in email attachments for spear-phishing attacks and were used in 39% of attacks during 2014.

To date, antivirus packages are not sufficiently effective at intercepting malicious documents, even in the case of highly prominent PDF threats (Tzermias et al. [38]). However, according to studies such as [38-47], machine learning methods can be effective in distinguishing between malicious and benign PDF files and discovering new malicious PDF documents. Several deterministic solutions were presented for enhanced detection of new malicious Microsoft Office files such as BISSAM [48], OfficeMalScanner7, OfficeCat8, Microsoft OffVis,9, pyOLEScanner.py10,

2 http://www.infosecurity-magazine.com/view/29562/91-of-apt-attacks-start-with-a-spearphishing-email/ 3 http://searchsecurity.techtarget.com/definition/spear-phishing 4 https://www.paloaltonetworks.com/content/dam/paloaltonetworks-com/en_US/assets/pdf/datasheets/threats/threat-prevention.pdf 5 http://www.f-secure.com/weblog/archives/00001676.html 6 http://www.computerworld.com/article/2517774/security0/pdf-exploits-explode--continue-climb-in-2010.html 7 http://www.reconstructer.org/code.html 8 http://www.aldeid.com/wiki/Officecat 9 http://www.microsoft.com/en-us/download/details.aspx?id=2096 10 http://www.aldeid.com/wiki/PyOLEScanner - 17 - and Threat Emulation11, and a new machine learning based methodology called SFEM [8] was presented as well.

1.1.3 Malicious Android Applications

While the detection of malwares aimed at PCs (elusive worms, executables, and malicious documents) using ML methods has been intensively researched for nearly two decades, [12] and [13] were the first to discuss malware for smartphones in 2004. Since that time there has been a significant increase in the use of smartphones, dramatically increasing the possibility of cyber- attacks [14] [15]. Likewise, recent growth of the Android market has been accompanied by increased threats to Android security over the past few years [16-18]; "Secure-List" [19] reported that 9,000 such malware were created during 2012, a figure that indicates that the dominance of the Android operating system likely led to the massive creation of new types of Android malware.

The smartphone domain is an area in which the need for antivirus enhancement is even greater. In contrast to PCs, which rely on advanced detection tools (e.g., sandboxes, ML based solutions, anti-exploitation solutions, etc.) in addition to their basic reliance on antivirus solutions, smartphones are heavily dependent on antivirus solutions because of the inability to apply advanced detection methods (machine learning solutions based on static and dynamic analysis) within the device itself. The resource limitations of smartphones necessitate effective detection of new malware and efficient and nimbly updated antivirus tools. Antivirus solutions are lightweight and thus more appropriate for smartphones.

In addition, Android antivirus vendors must deal with large quantities of new applications on a daily basis in order to identify and update the antivirus signature repository with new unknown malware instances. The majority of these applications can be collected from application markets, and others can be collected by installing agents on smartphones that upload applications to a central server for analysis. Antivirus vendors must filter out known malware (and its variants), as well as known legitimate applications, utilizing white lists based on the reputation and certificates of applications [20]. Despite this filtering process, a large number of new unknown applications, both benign and malicious, remain.

11 https://www.checkpoint.com/products/threatcloud-emulation-service/ - 18 -

Antivirus vendors use complementary solutions that focus on the applications most likely to be malicious in order to further reduce the number of applications that must be handled manually. Among the complementary solutions that have been proposed for efficiently discovering new Android malware are heuristic engines based on a scoring algorithm [21] and many different detection models based on machine learning techniques [22-37].

- 19 -

1.2. The Problem Statement and Proposed Approach

Complementary solutions, including machine learning based solutions, targeted at detection of various malware types (computer worms and malicious executables, malicious documents, and malicious Android applications) have enhanced detection and have also demonstrated the ability to detect new unknown malware, an ability not shared by antivirus software. This stems from ML’s generalization capabilities, an inherent strength of inducing machine learning based algorithms. However, to date, such complementary solutions have one significant drawback and remain inefficient in the long run, because of it – in each case the knowledge store is not frequently and actively updated. This is particularly problematic in light of the mass creation of new malware.

A natural concept drift process [80] [81] exists, specifically in the malware domain [81], as benign files and newly created malware contain new properties and features that haven’t been seen by the detection model, as well as existing features with very different values than those the detection model has been trained on. These new features may result from different programing languages, compilers, platforms, operating systems, devices, etc. In addition, the malware domain is very dynamic, since attackers are continually seeking out new ways of attacking, new vulnerabilities that can be exploited, and new targets. These changing parameters eventually affect the static features and the behavior of the analyzed malware and thus significantly reduce the detection capabilities of the induced detection models which are not updated and remain outdated. None of the exiting machine learning based solutions address the crucial need to frequently and efficiently update the detection model and antivirus software.

In this research [1-8], we concentrated on the updatability process and enhancement of the detection capabilities of the detection model, striving to improve efficiency and speed in these areas. A well enhanced and updated detection model will have a better ability to detect future malware and thus will update the antivirus software more rapidly. It is therefore essential to ‘sustain’ the classifier constantly and frequently with new files (malicious and benign) in order to maintain detection accuracy over time. However, when a file is classified, the classifier cannot indicate whether it should be acquired as a new and informative sample for the training set. Additionally, in order to add the file to the training set, a labeling operation, usually by a human expert, is required. The labeling process is a very time-consuming task, because each unknown file (suspected as being malicious) has to be analyzed and inspected by an expert; the expert will likely

- 20 - have to perform static, and/or dynamic analysis and inspect the files’ behavior using a sandbox [13,14] or other behavioral based tools in order to determine their ambience. Because there are many files (malicious and especially benign) to inspect, it is not feasible to send them all to the human expert for labeling. All these difficulties affect the updatability (of both the detection model and the antivirus) which is directly related to one of the most challenging tasks in the domain: the agile detection of new unknown malware.

One of the keys to solving this challenging task is finding an automatic and efficient way to identify the most informative of the many new files, in an effort to minimize (to the greatest extent possible) the number of files sent to the human expert for labeling. Only the most informative labeled files will provide the knowledge required for the updatability of the detection model and ensure its ability to detect previously unknown malicious code. Our research aims to develop a framework that combines practical and efficient solutions for the agile detection of new unknown malware.

In order to meet this challenge we have divided the suggested framework into two main modules. The first is the Detection Module, which integrates the best methods from several relevant domains, such as feature extraction and representation, feature selection, text categorization, information retrieval, and classification algorithms. This module concentrates on collecting and representing the files in the dataset in a way that provides maximal knowledge to the classification algorithm for inducing the optimal detection model. The Updatability Module is the second module, in which we also propose the active learning (AL) approach to reduce the number of labeled training examples while maintaining high classification accuracy. By integrating AL, the classifier actively indicates the specific new files that should be labeled, i.e., the most informative, the addition of which to the training set will provide the maximal improvement in the detection model and consequently will also update the widely used antivirus software. The updatability module is also aimed at working under higher rates of granularity in which it can efficiently select only a small number of instances related to the behavior of a specific analyzed file. By doing this, the misleading and noisy instances of malware’s behavior can be filtered and thus improve the detection capabilities – this granularity rate is especially tailored for enhancing the detection of elusive malware such as the computer worm.

The framework can be instantiated using different AL methods. Two of the methods used are well-known AL methods which have been previously proposed: SVM Simple Margin

- 21 -

(Exploration) [95] and Error-Reduction [96]; these were used as a baseline for comparison in the various experiments documented both in our research and published papers. Both methods select examples for which the classifier is less confident regarding the true label; in SVM it means those examples that lie closest to the SVM separating hyperplane. Our two new methods are Exploitation and Combination. In contrast to Exploration, Exploitation chooses examples located deep inside the malicious side and farthest from the SVM’s separating hyperplane. Our Combination AL method is a two-phase method that combines principles of Exploration and Exploitation; in the early phase it conducts more Exploration, while Exploitation becomes the dominant strategy in the later phase of the acquisition process.

The framework is thoroughly tested on several different types of malware: computer worms, executables, malicious documents (PDF and docx MS Office files), and malicious Android applications. In each of these applications, sophisticated and specifically tailored feature extraction and dataset creation methodologies were proposed and implemented. We used the SVM classifier as the base classifier, and the experiments were carried out using the various SVM’s kernels. A solid and comprehensive evaluation methodology was used in order to test the framework, both in terms of classification performance (accuracy, TPR, FPR and AUC) and the number and percentage of malware acquired daily (NOMA / POMA), which are important measures given that the purpose of the framework is to update the antivirus signature’s repository with new malware on a frequent basis.

- 22 -

1.3. Deployment of our Framework

Another key to addressing the challenging task of agile detection of new unknown malware is efficient and sophisticated deployment of detection method over strategic nodes in the Internet network. In order to meet this challenge, we strive to expose our framework to as many new files transferred over the Internet as possible, so that most of the new informative files will be acquired; therefore deployment is defined in such a way as to have the largest coverage, while minimizing costs by involving as few units as possible. Therefore, once a new unknown malware is created and transferred over the network it will be monitored by the sophisticated deployment of the framework, like a "fly caught in tangled spider webs." The combination of these two components (effective deployment and identification of the most informative new files) will contribute to cleaner Internet network traffic for the hosts.

A comprehensive study by Puzis et.al [82-85] that deals with the efficient deployment of IDSs provides significant insight to this challenge. The study used the "betweenness" algorithm which is a good heuristic for traffic load, and it was found that most network traffic can be monitored by listening to only a few strategic nodes over the Internet. Our opinion is that in order to achieve maximal efficiency of our framework, it should be located and deployed in different levels of the network traffic: the higher level should include strategic NSP routers' links in order to prevent propagation of the malicious code and reduce the extent of the damage, as was suggested by Puzis, while the lower level should act as a host-based IDS that will protect the working stations when the higher levels have not been exposed to the malicious code.

The deployment will consider routers, gateways to organizations, and also the markets of mobile applications (official and non-official) as part of the strategic nodes in the Internet network. The integration between these strategic nodes and several levels of the framework will strengthen the detection and updatability capabilities.

- 23 -

2. Overview of the Core Papers in the Research

In the course of this research we have published eight papers supporting the efficiency and contributions of our active learning based framework for the malware detection domain (presented in Figure 1). Our research focused on the development of advanced detection frameworks and methods for frequent and enhanced detection of four of the most popular types of malware: computer worms and malicious executables, documents, and Android applications.

The main contributions of the eight publications that comprise this thesis include: improving the detection capabilities of machine learning based solutions by frequently and efficiently enhancing the knowledge stores of the detection model and antivirus tools, reducing the number of files acquired which are used for keeping the model updated, and improving the detection model’s capabilities by enhancing its robustness by filtering out the presence of misleading malware instances.

In order to efficiently update the detection model, the new framework employs active learning techniques that enable the experts to label instances that may better contribute to a more accurate model. A new active learning technique is proposed, which is used to detect various types of malware, including worms, executables, and Android applications. In addition, the new framework is used for detecting malicious documents (PDF and MS Word) that may be used by attackers to inject malware into victims’ computers.

As a proof of concept for the generality of our AL based framework, we have also extended the framework's capabilities so that it will provide solutions in additional domains. We have adapted it to the biomedical informatics domain, in which we successfully enhanced the capabilities of a classification model that is used for condition severity classification, while significantly reducing labeling efforts that can result in a substantial savings, both in the time and financial costs associated with medical experts.

Among the eight published papers [1-8], four of them [1-4] were published in top peer reviewed journals and form the core of this research. The remaining four papers [5-8] were accepted to additional journals, ranked conferences, and workshops within top tier conferences. In this section, we provide a brief introduction to each of the four core journal papers, as the complete

- 24 - papers will be presented in the next section. The other four papers are included in the appendix in the subsection entitled, “Additional Accepted Papers in Malware Detection Domain.”

As can be seen Figure 1, the topics researched span two domains and several sub-domains. While my primary expertise lies in the malware detection domain, the involvement and expertise of my co-authors enabled me to widen the scope of my research and delve into an entirely new domain of biomedical informatics. For example, the application of the framework to this domain required knowledge of the new field and access to an additional dataset, an area in which the role of the co-authors was invaluable. It is important to note that as this research constitutes my Ph.D. research, I was responsible for all aspects of the research and the experiments that comprise it.

Our four core papers are based on an evolving program of research that was guided by our attentiveness and awareness of upcoming trends in the malware detection domain. Broadly our research progressed as follows. We started with a behavioral active learning based framework [1] for the enhanced detection of elusive computer worms, on the heels of the discovery of the sophisticated and elusive “Stuxnet” malware in the SCADA systems of Iran’s nuclear facilities. After demonstrating improved and more efficient detection of unknown computer worms in our first study, we identified a major gap in the area of detection solutions aimed at another popular type of malware: malicious executables; in this case, a weakness was found in the updatability (or lack thereof) of existing detection solutions. Therefore we enhanced our AL based framework and extended it with the addition of two novel and efficient active learning methods. Based on these changes, the framework provides a solution for frequent and efficient updatability [2] of both the detection model and antivirus software, which is particularly needed in light of the daily creation of new malicious executables.

During that time, a new trend was emerging, and mobile devices (especially Android OS based smartphones) increasingly became attractive targets for Android malware. The amount of Android malware has increased at a significant rate; many unofficial application markets were contaminated with malicious versions of known Android applications, and the contamination also found its way to the official market of Android applications, even affecting Google Play. In 2012, Google presented Bouncer which is comprised of machine learning algorithms based on dynamic and basic static analysis of applications uploaded to the market. However, according to [86], it was announced at SummerCon 2012 that more than 20 ways of evading Bouncer had been discovered [87]. The insufficient detection of Android malware, the reliance of Android smartphones on

- 25 - antivirus solutions, and the updatability gap that we also identified in the detection solutions for Android malware, led us to enhance our capabilities in the active learning framework and create ALDROID [3]. This new framework outperformed the existing solutions and provides a solution for the enhanced detection of Android malware in the long run.

After providing solutions and enhanced detection of computer worms, malicious executables, and Android malware, our research continued, this time aimed at another emerging malware trend as the popularity of APT attacks increased and they became better funded, more sophisticated, and well-planned. Organizations increasingly prevent the entrance of executables into their internal networks because of their high risk; thus a new trend was created – instead of targeting executables, malicious documents are created and attached to email messages which are sent to organizations. In this way attackers attempt to penetrate organizations’ defenses and perform malicious activities utilizing social engineering techniques, causing innocent employees to open malicious documents (such as PDF and MS Office files). We applied our expertise and newfound insights toward the goal of enhancing the detection of malicious documents [4], enhancing our AL based framework for use with malicious documents and created the ALPD [5,6] and ALDOCX [7,8] frameworks which integrated our newly developed feature extraction methodologies for efficient detection of documents.

Figure 1 presents the domains and sub-domains to which our framework has been applied during the current Ph.D. research, as well as our related papers, including their reference number and ranking details, clustered into journals (green nodes), conferences (orange nodes), and workshops (yellow nodes). The main domains in which our framework was successfully applied appear in the red nodes. Figure 1 is divided into two main sub-diagrams: the upper red node which represents the core of this study – malware detection within the cyber security domain, and the lower red node which presents the additional domain – condition severity classification within biomedical informatics domain. The upper diagram is also divided into four blue nodes that represent sub-domains of malware detection: computer worms, malicious executables, malicious Android applications, and malicious documents. The rightmost sub-domain within malware detection, malicious documents, is sub-divided into two blue nodes, MS Office and PDF files, which are sub-types of the most popular documents through which cyber-attacks are launched.

- 26 -

Active Learning Based Framework

Malware Detection Cyber-Security Domain

Computer Android Executables Documents Worms Applications

[1] Nissim et.al (2012) [2] Nissim et.al (2014) [3] Nissim et.al (2015) “ALWORM Framework” “ALPC Framework” “ALDROID Framework” PAAA Journal – Q3 ESWA Journal – Q1 KAIS Journal – Q1

Legend Domain MS Office docx Files PDF Files

Sub-Domain

[7] Nissim et.al (2015) [4] Nissim et.al (2014) Accepted Journal Paper “ALDOCX Framework” “Malicious PDF Detection” IEEE ICMLA COSE Journal – Q2 Conference – Rank C Accepted Conference Paper [5] Nissim et.al (2015) [8] Nissim et.al (2015) “Enhanced ALPD Framework” Accepted Workshop Paper within Conference Malicious MS Office Docs Security-Informatics Journal ODDX3 Workshop at KDD

[6] Nissim et.al (2014) “ALPD Framework” IEEE JISIC Conference Condition Severity Classification Biomedical Informatics Domain

[11] Nissim et.al (2015) [10] Nissim et.al (2015) [9] Nissim et.al (2014) “CAESAR-ALE Enhancement” “Condition Severity Classifications” “CAESER-ALE” JBI Journal – Q1 AIME Conference – Rank A Big-CHAT Workshop at KDD * Best Paper Award

Figure 1: The domains and sub-domains in which our framework has been used and our related papers (published and in press) during the current Ph.D. research, including their ranking details, and clustering into journals, conferences, and workshops.

- 27 -

We now provide a brief overview of each of our core papers. Our first core paper [1] is entitled, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning.” In this paper we aimed at enhancing the detection of computer worms by dynamically monitoring and analyzing their behavior at high frequency rates. This research showed that our framework and AL method can efficiently select just a small number of instances related to an analyzed worm’s behavior and filter out the misleading and noisy instances of malware’s behavior which are popular among elusive computer worms, thereby improving the detection of unknown computer worms. Our behavioral analysis of these worms was based on computer measurements extracted from the operating system. We designed a series of experiments to test the new technique by employing several computer configurations and background application activities. In the course of the experiments, 323 computer features were monitored. In addition, we used active learning as a selective sampling method to increase the performance of the detection model which was improved by between 19% and 25%, and thus we also improved its robustness in the presence of misleading instances of computer worms.

Our second core paper [2] is entitled, "Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS.” In this paper we introduced a solution addressing the main problem associated with the agile detection of new unknown malware – the updatability gap of both antivirus software and the detection model in the domain of malicious executables. We presented an active learning framework and introduced two novel AL methods that assist antivirus vendors, helping them better focus their analytical efforts by acquiring the files that are most likely malicious. The new AL methods were designed and oriented at new malware acquisition. Our AL methods outperformed existing AL method in two respects related to the number of new malwares acquired daily which was the core measure in this study. First, on the ninth day of the experiment our best performing AL method, termed Exploitation, acquired approximately 2.6 times more malwares than the existing AL method and 7.8 times more than the random selection method. Second, while the existing AL method showed a decrease in the number of new malwares acquired over ten days, our AL methods showed an increase and daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist antivirus vendors.

Our third core paper [3] is entitled, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods.” In this paper our efforts were directed at the

- 28 - smartphone domain, an area in which the need for antivirus enhancement is even greater than in the PC domain. In contrast to PCs, smartphones are heavily dependent on antivirus solutions, because of the inability to apply advanced detection methods (static and dynamic analysis) within the device itself. The resource limitations of smartphones necessitate the effective detection of new malware, as well as the efficient and nimble update of the antivirus tools. It is not feasible to analyze every new application, so our ALDRIOD framework only selects the most probable Android malwares for labeling. While our framework reduces the number of unknown applications that must be manually analyzed, it strengthens the detection model at the same time by also selecting informative benign applications. Thus the framework addresses the resource limitations of the smartphone, as well as the challenge presented by the sheer volume of unknown applications created daily. Our approach is capable of providing more frequent updates to the detection model, because only a small and manageable set of informative applications are sent to the human expert for inspection and are subsequently acquired by the detection model. This is in contrast to heuristic approaches based on scoring algorithms or other types of detection models which are only updated periodically due to the labor intensive process of human expert analysis.

In our framework the updated detection model efficiently updates the antivirus signature repository which in turn, improves the detection capabilities of the installed and widely used antivirus software within smartphones. As is known, the structure of Android applications differs substantially from the executable files (including computer worms) within the Windows OS we previously investigated. Therefore the results of our previous paper [2] cannot automatically be assumed to work in the Android OS, since the detection model and AL methods used in the current study rely on different dataset characteristics related to the Android applications domain, particularly in terms of the extracted features, the malware distribution, and the attack techniques detected by the detection model. ALDROID combined three important elements; first we presented a set of general descriptive features for the detection of Android malware, features which are robust and unaffected by obfuscation or transformation evasion techniques. The features are based on the application's static genes and not on the optional operations it might conduct. Therefore, the features are also robust for evasive techniques based on delayed malicious operations. Second, from these features we induced a detection model using a SVM machine learning algorithm, and then we applied our malware oriented AL methods in order to leverage our

- 29 - general descriptive set of features and the knowledge of the detection model, in order to enhance the detection model and the antivirus software frequently.

Results indicate that our AL methods outperformed other solutions including the existing AL method and heuristic engine. Our AL methods acquired the largest number and percentage of new malwares, while preserving the detection models’ detection capabilities (high TPR and low FPR rates). Specifically, our methods acquired more than double the amount of new malwares acquired by the heuristic engine and 6.5 times more malwares than the existing AL method.

Our fourth core paper [4] is entitled, "Detection of Malicious PDF Files and Directions for Enhancements: a State of the Art Survey.” This paper is based on research which proved pivotal to our solid understanding of malicious documents, in general, and malicious PDF files, in particular. Through our comprehensive survey of advanced academic solutions for the detection of PDF malware, we were able to identify the best performing feature extraction methods for malicious PDF files and observe the significant lack of updatability in the detection solutions in PDF malware as well. In this paper we provided comparisons, insights, conclusions, and avenues for future research in order to enhance the detection of malicious PDFs. One of the most important contributions of this paper is revealing and highlighting the correlation between the structural incompatibility of PDF files and their likelihood of maliciousness. By doing this, at least 96.5% of the malicious PDF files can be easily filtered out using a simple and deterministic filtering process. The second, and probably more important contribution, is providing a detailed explanation of our active learning based framework for enhancing and supporting the updatability of detection solutions for malicious PDF files.

This paper was followed by our additional peer reviewed journal paper [5] (based on the preliminary results of our research on the ALPD framework which was presented at the IEEE JISIC conference [6]); in this extended paper [5] we implemented the above mentioned suggested framework [4] for efficiently updating detection models of PDF malware using AL methods. Results showed that our AL method, Combination, outperformed all of the other methods, enriching the signature repository of the antivirus with almost seven times more new malicious PDF files, while improving the detection model’s capabilities further each day. At the same time, it dramatically reduced security experts’ efforts by 75%. Despite this significant reduction, results also indicate that our framework better detects new malicious PDF files than leading antivirus tools commonly used by organizations for protection against malicious PDF files.

- 30 -

Furthermore, it is also worthwhile mentioning our additional paper [7] that was inspired by this important core paper [4]; in [7] we presented the ALDOCX system at the ranked machine learning conference, ICMLA-2015 [7] (based on our preliminary results [8] presented in a workshop at the KDD-2015 machine learning conference). In this paper [7], we presented SFEM feature extraction methodology and designated active learning methods aimed at accurate detection of new unknown malicious docx files; these methods also efficiently enhance the detection model’s capabilities over time and quickly utilize the vast amount of documents within an organization. Results show that our active learning methods used only 14% of the labeled docx files within an organization which led to a reduction of 95.5% in labeling efforts compared to passive learning and SVM-Margin (an existing active learning method). Our AL methods also showed a significant improvement of 91% in unknown docx malware acquisition compared to passive learning and SVM-Margin, thus providing an improved updating solution for the detection model, as well as the antivirus software widely used within organizations. We also showed that our novel structural feature extraction methodology (SFEM) results in a set of very discriminative and general features of the XML based MS Office documents, and we have shown that these features can be effectively leveraged by machine learning algorithms to induce an accurate detection model for malicious docx files.

To summarize, our fourth core paper [4] was the basis for four more papers [5-8] aimed at enhanced detection of malicious documents. These papers can be found in the appendix subsection entitled, “Additional Accepted Papers in the Malware Detection Domain.”

In addition to our contribution to the malware detection domain, as can be seen in the lower section of Figure 1, we have published three additional papers in the biomedical domain [9-11]. The second of these papers, [10] won the best student paper award in a prestigious artificial intelligence conference (AIME-2015 Rank-A). The third paper [11] was recently accepted for publication by the Journal of Biomedical Informatics (JBI), a top biomedical informatics peer reviewed journal. These papers can be found in the appendix subsection entitled, “Additional Accepted Papers in the Biomedical Informatics Domain.” The papers present the results and methodology of our recently extended framework which was applied to the biomedical informatics domain, which is a completely different domain than malware detection. In this research we were able to successfully enhance the capabilities of a classification model used for condition severity classification with a significant reduction of labeling efforts that can result in significant savings,

- 31 - both in terms of time and money, associated with the efforts of medical experts. As a part of the extension of our framework, we developed an additional new AL method called Combination_XA which is more oriented to the acquisition needs in the biomedical informatics domain. This extension showed that our methods and framework are generic and can provide a solution for a variety of problems in many different domains.

- 32 -

2.1. Research Results

The following is a complete list of the 11 papers that were published and submitted during this research in the cyber-security and biomedical informatics domains presented in Figure 1. [1] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning,” Pattern Analysis and Applications, (2012) 15:459-475. [2] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, " Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS,” Expert Systems with Applications, (2014), Link: http://authors.elsevier.com/sd/article/S095741741400133X. [3] Nir Nissim, Robert Moskovitch, Oren BarAd, Lior Rokach, Yuval Elovici, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods,” Knowledge and Information Systems (2016), 1-39. Accepted on 11 of January 2016. [4] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici, “Detection of Malicious PDF Files and Directions for Enhancements: A State-of-the Art Survey,” Computers & Security, Volume 48, February 2015, Pages 246-266, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2014.10.014. [5] Nir Nissim, Aviad Cohen, Robert Moskovitch, Asaf Shabtai, Matan Edri, Oren Bar-Ad, Yuval Elovici. (2016). Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework. Security Informatics, 5(1), 1-20. [6] Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, Assaf Shabatai, Yuval Elovici, "ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files Aimed at Organizations,” JISIC, (2014). [7] Nir Nissim, Aviad Cohen, Yuval Elovici, "Boosting the Detection of Malicious Documents Using Designated Active Learning Methods," 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 2015, pp. 760-765. doi: 10.1109/ICMLA.2015.52 [8] Nir Nissim, Aviad Cohen, Yuval Elovici, “Designated Active Learning Methods for Enhanced Detection of Unknown Malicious Microsoft Office Documents”. ODDX3 Workshop at KDD Conference, (2015), Sydney. [9] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification”. BigCHAT Workshop at KDD Conference, (2014).

Mario Stefanelli Best Paper Award at AIME 2015 Conference: [10] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “An Active Learning Framework for Efficient Condition Severity Classification”. Artificial Intelligence in Medicine (Pages 13-24), Springer International Publishing, AIME (2015). [11] Nir Nissim, Mary Regina Boland, Nicholas P. Tatonetti, Yuval Elovici, George Hripcsak, Yuval Shahar, Robert Moskovitch, Improving condition severity classification with an efficient active learning based framework, Journal of Biomedical Informatics, Volume 61, June 2016, Pages 44-54, ISSN 1532- 0464,

- 33 -

2.1.1. Core Papers

In this section we list the four papers which form the core of this research; the list will be followed by the papers themselves. The appendix contains our other accepted papers.

[1] Nir Nissim, Robert Moskovitch, Lior Rokach,Yuval Elovici, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning,” Pattern Analysis and Applications, (2012) 15:459-475.

[2] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, " Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS,” Expert Systems with Applications, (2014), Link: http://authors.elsevier.com/sd/article/S095741741400133X.

[3] Nir Nissim, Robert Moskovitch, Oren BarAd, Lior Rokach, Yuval Elovici, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods,” Knowledge and Information Systems (2016), 1-39. Accepted on 11 of January 2016.

[4] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici, “Detection of Malicious PDF Files and Directions for Enhancements: A State-of-the Art Survey,” Computers & Security, Volume 48, February 2015, Pages 246-266, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2014.10.014.

- 34 -

Pattern Anal Applic (2012) 15:459–475 DOI 10.1007/s10044-012-0296-4

INDUSTRIAL AND COMMERCIAL APPLICATION

Detecting unknown computer worm activity via support vector machines and active learning

Nir Nissim • Robert Moskovitch • Lior Rokach • Yuval Elovici

Received: 8 December 2009 / Accepted: 5 September 2012 / Published online: 25 September 2012 Ó Springer-Verlag London Limited 2012

Abstract To detect the presence of unknown worms, we 1 Introduction propose a technique based on computer measurements extracted from the operating system. We designed a series The detection of malicious code (malcode) transmitted of experiments to test the new technique by employing over computer networks has been researched intensively in several computer configurations and background applica- recent years. Worms, a particularly widespread malcode, tion activities. In the course of the experiments, 323 proactively propagate across networks while exploiting computer features were monitored. Four feature-ranking vulnerabilities in operating systems or in installed pro- measures were used to reduce the number of features grams. Other types of malcode include computer viruses, required for classification. We applied support vector Trojan horses, spyware, and adware. machines to the resulting feature subsets. In addition, we Nowadays, excellent technology (i.e., antivirus software used active learning as a selective sampling method to packages) exists for detecting known malicious code. increase the performance of the classifier and improve its Typically, antivirus software packages inspect each file that robustness in the presence of misleading instances in the enters the system, looking for known signs (signatures) that data. Our results indicate a mean detection accuracy in uniquely identify a malcode. Antivirus technology cannot, excess of 90 %, and an accuracy above 94 % for specific however, be used for detecting an unknown malcode, since unknown worms using just 20 features, while maintaining a it is based on prior explicit knowledge of malcode signa- low false-positive rate when the active learning approach is tures. Following the appearance of a new worm instance, applied. operating system providers provide a patch to deal with the problem, while antivirus vendors update their signatures- Keywords Malware detection Supervised learning base accordingly. This solution has obvious demerits, Active learning however, since worms propagate very rapidly. By the time the antivirus software has been updated with the new worm, very expensive damage has already been inflicted [1]. N. Nissim R. Moskovitch L. Rokach Y. Elovici Intrusion detection, termed a network-based intrusion Department of Information Systems Engineering, Ben Gurion University of the Negev, P.O.B. 653, 84105 Beer-Sheva, detection system (NIDS), is commonly implemented at the e-mail: [email protected] network level. NIDS has been substantially researched but R. Moskovitch remains limited in its detection capabilities (like any e-mail: [email protected] detection system). In order to detect malcodes that have Y. Elovici slipped through the NIDS at the network level, detection e-mail: [email protected] operations are performed locally by implementing a host- based intrusion detection system (HIDS). To monitor & N. Nissim R. Moskovitch L. Rokach ( ) Y. Elovici activities at the host level, HIDS usually compares various Deutsche Telekom Laboratories, Ben Gurion University, Beer-Sheva, Israel states of the computer, such as the changes in the file e-mail: [email protected] system, using checksum comparisons. The main drawback 123 460 Pattern Anal Applic (2012) 15:459–475 of this approach is the ability of malcodes to disable 2. Adaptation of SVM for malware detection: In our antiviruses. The main problem in using HIDS, however, is previous study [6] we used the algorithms decision detection knowledge maintenance, which is usually per- trees, naı¨ve Bayes, Bayesian networks, and neural formed manually by the domain expert. This is apt to be networks. In this paper, we study the performance of time-consuming and expensive. SVM. Specifically, we examine which of the three Recent studies have proposed methods for detecting kernel functions (linear, polynomial, RBF) is most unknown malcode using machine-learning techniques. suitable for detecting unknown worms. We argue that Given a training set of malicious and benign binary exec- SVM will achieve better results when using active utables, a classifier is trained to identify and classify learning as a selective sampling method. unknown malicious executables as malicious [2–4]. While 3. Comparison of feature selection methods for improv- this approach is potentially a good solution, it is not com- ing malware detection: We examine experimentally plete. It can detect only executable files, and not malcodes which feature selection method (if any) best fits the located entirely in the memory, such as the Slammer worm worm detection task using SVM. [5]. In a previous research report, we presented a new 4. We investigate the contribution of specific worms to method for detecting unknown computer worms [6–8]. The the detection performance and examine if all worms underlying assumption was that malcode within the same are equally informative. category (e.g., worms, Trojans, spyware, adware) share The rest of the article is structured as follows. Section 2 similar characteristics and behavior patterns and that these surveys the relevant background for this work, while patterns can be induced using machine-learning techniques. Sect. 3 describes the SVM, relevant kernel functions and By continuously monitoring and matching the computer’s active learning methods used in this study. Section 4 vital signs (such as CPU and hard disk usage) against the discusses the research question, corresponding experimen- previously induced malcode patterns, we can gain an indi- tal plan, and evaluation results. Finally, in Sect. 5 we cation as to whether the computer is infected. While this conclude the paper with a discussion of the evaluation approach does not prevent infection, it enables its fast results, conclusions, and future work. detection. Relevant decisions and policies, such as discon- necting a single computer or a cluster, can then be implemented. 2 Background and related work The goal of this study is to assess the viability of employing support vector machines (SVM) in an individual 2.1 Malicious code and worms computer host to detect unknown worms based on their behavior (measurements), and examine whether selective The term ’malicious code’ (malcode) refers to a piece of sampling can improve the detection performance. The code, not necessarily an executable file, the intention of behavior of some of the worms is unstable, so that some of which is to harm. In [9], the authors define a worm the time they tend to behave as a legitimate application according to how it can be distinguished from other types does. Thus by monitoring their behavior we would derive of malcode: (1) network propagation or human interven- instances that would negatively affect the model (hereafter tion—worms propagate actively over a network, while we will call these instances misleading instances). The other types of malicious codes, such as viruses, commonly selection of the right instances to be included in the require human activity to propagate; (2) standalone or file training-set is therefore also very challenging. infecting—while viruses infect a host, a worm does not This paper makes four contributions to our armory for require a host file and sometimes does not even require an combating malware: executable file since it may reside entirely in the memory. 1. Development of a selective sampling procedure using This was the case with the Code Red worm [10]. active learning: Active learning is commonly used to Worm developers have different purposes and motiva- reduce the amount of labeling required from an expert tions [11]. Some are motivated by experimental curiosity (a time-consuming task). The Oracle is actively asked (ILoveYou worm [12]), while pride and the desire for to label specific examples in the dataset that the learner power lead others to flaunt their knowledge and skill considers the most informative, based on its current through the harm caused by the worm. Still other motiva- knowledge, which eventually reduces the acquisition tions are commercial advantage, extortion and criminal cost. In this study, all the training examples are labeled gain, random and political protest, and terrorism and cyber in advance. However, the goal is to select intelligently warfare. the best examples that will increase the accuracy by The wide variety of motivations that we find among filtering misleading or non-informative instances. worm developers indicates that computer worms will be a

123 Pattern Anal Applic (2012) 15:459–475 461 long-lasting phenomenon. To address the challenge posed request the true class label for a certain number of by worms effectively, as much meaningful experience and instances in the pool. Other approaches focus on the knowledge as possible should be extracted from known expected improvement of class entropy [26], or mini- worms by analyzing them. Today, given the number of mizing both labelling and misclassification costs [27]. known worms, we have an excellent opportunity to learn Although in our problem all the examples are actually from these examples. We argue that active learning labeled, we decided to apply AL as the selective sampling methods can be very useful for learning and generalizing approach for choosing the most informative examples to from previously encountered worms in order to detect reduce the number of the misleading instances in the previously unseen worms effectively. training data. In Sect. 3.4, we explain how AL can be used to achieve this goal. 2.2 Detecting malicious code using supervised learning techniques 3 Methods Supervised learning techniques have already been used for detecting malicious codes and creating protection against 3.1 Dataset creation them. For example, in [13], the authors proposed a framework consisting of a set of algorithms for extracting Since no benchmark dataset exists that we could use for anomalies from a user’s normal behavior patterns. A this study, we created our own. A controlled network with normal behavior is learned and any abnormal activity is various computers (configurations) was deployed into considered intrusive. In order to determine what constitutes which we could inject worms, and monitor and log the normal, the authors suggest several techniques, such as computer operating system features using a dedicated classification, meta-learning, association rules, and fre- agent. In order to create the datasets, we isolated the local quent episodes. The extracted knowledge forms the basis of network of computers, simulating a real Internet network an anomaly-based intrusion detection system. that allowed worms to propagate. A naı¨ve Bayesian classifier was suggested in [14], We designed several experiments centered around eight referring to its implementation within the ADAM system, datasets, which we created based on three aspects: hard- developed in 2001 [15]. The classifier consists of three ware configuration, background applications, and user main parts: (a) a network data monitor listening to TCP/IP activities. Using this model, we designed several experi- protocol; (b) a learning engine for acquiring association ments to achieve our research goals: rules from the network data; and (c) a classification module that classifies the nature of the traffic into two possible a. To find out whether a classifier, trained on data classes, normal and abnormal, that can later be linked to collected from a computer with a certain hardware specific attacks. Other soft computing algorithms proposed configuration and specific background activity, is for detecting malicious code include: artificial neural net- capable of classifying correctly the behavior of a works (ANN) [16–19]; self-organizing maps (SOM) [20], computer that has other configurations. and fuzzy logic [21–23]. b. To select the minimal subset of features required to classify new cases correctly. Reducing the number of 2.3 Active learning features used in the model implies that less monitoring effort would be needed in an operational system. Labeled examples are crucial when training classifiers. In the course of experimentation, we applied four However, in certain domains the labeling operation is classification algorithms on the given datasets in a varied costly and time-consuming. Active learning (AL) [24] series of experiments in order to detect, first, known worms refers to learning policies, in which a learner actively in different environments and, later, completely new, selects unlabeled instances for labeling, based on some previously unseen worms. criterion. The objective of most AL methods is to mini- Figure 1 depicts the process that was used in this study. mize the cost of acquiring the labeled instances needed The upper part refers to the training phase. We collected a for inducing an accurate model. In this paper, we scruti- set of worms and used them to infect the hosts in the nize another aspect of AL. Instead of minimizing the controlled environment. An agent, which was installed on acquisition costs, our objective is to increase the gener- each host, then recorded its behavior. Based on the col- alization accuracy using an approach that disregards lected dataset, we trained the classifiers. The bottom part of misleading instances. Several AL frameworks are pre- Fig. 1 refers to the test phase. In this phase, we examined sented in the literature. In pool-based active learning [25], whether the induced classifier can be used to identify the the learner has access to a pool of unlabeled data and can existence of an unknown worm. 123 462 Pattern Anal Applic (2012) 15:459–475

Fig. 1 Outline of the train phase and the test phase. The worms are injected into the computers, which are monitored. Features are extracted and a SVM classifier is induced. In the test step the monitored features are provided to the classifier, which classifies whether there is worm activity or not

3.1.1 Environment description Trojans for installation, in parallel, on the distribution process of the network; others focused entirely on distri- The laboratory network consisted of seven computers, bution. Another feature that we desired to obtain was that which contained heterogenic hardware, and a server the worms would have different strategies for IP scanning computer simulating the Internet. We used the Windows that would result in varying communication behaviors, performance counters1 which enabled us to monitor system CPU consumption, and network usage. While all the features that appeared in the following categories (the worms were different, we wanted to find common char- number of features in each category appears in parentheses): acteristics, which could be used to detect an unknown internet control message protocol (27), internet protocol worm. We briefly describe here the main characteristics of (17), memory (29), network interface (17), physical disk the five worms included in this study. The information is (21), process (27), processor (15), system (17), transport based on the virus libraries on the Web2,3,4 control protocol (9), thread (12), user datagram protocol (5). In addition, we used VTrace [28], a software tool that can 3.1.3 W32.Dabber.A be installed for monitoring purposes on a PC running Windows. VTrace collects traces of the file system, the This worm randomly scans IP addresses and uses the network, the disk drive, processes, threads, inter-process W32.Sasser.D worm to propagate and open the FTP server communication, cursor changes, etc. The Windows perfor- in order to upload itself to the vicitom’s computer. The mance counters were configured to measure the features worm registers itself for implementation at the next user every second and to store them in a log file as a vector. login (human-based activation). It drops a backdoor, which VTrace stored, time-stamped events were aggregated into listens in on a predefined port. This worm is distinguished the same fixed intervals, and merged with the Windows by its use of an external worm in order to propagate. performance log files. This body of data eventually con- sisted of a vector of 323 features collected every second. We 3.1.4 W32.Deborm.Y worked with this granularity because these loggers’ most granular level was 1 s. Larger time windows, in which we W32.Deborm.Y is a self-carried worm that prefers local IP could aggregate the measurements over longer time periods, addresses. It registers itself as an MS Windows service and might have been too slow for worm activity detection. is implemented upon user login (human-based activation). This worm contains and implements three Trojans as a 3.1.2 Injected worms payload: Backdoor.Sdbot, Backdoor.Litmus, and Trojan. KillAV. We chose this worm because of its heavy payload. When selecting worms for injection, we tried to include every variety. Some of the worms had a heavy payload of 2 Symantec - http://www.symantec.com. 3 1 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ Kasparsky - http://www.viruslist.com. counter/counters2_lbfc.asp. 4 Macfee - http://vil.nai.com. 123 Pattern Anal Applic (2012) 15:459–475 463

3.1.5 W32.Korgo.X The data were collected in the presence or absence of background application and user activity in each of the This is a self-carrying worm that uses a completely random hardware configurations. We therefore had three binary method for scanning IP addresses. It is self-activated and aspects, which resulted in eight possible feature-collecting tries to inject itself via a new thread of MS Internet conditions, shown in Table 1, representing a variety of Explorer. It contains a payload code that enables it to dynamic computer configurations and usage patterns. Each connect to predefined websites in order to receive orders or dataset contained monitored instances of all the five download newer worm versions. injected worms separately, and instances of normal com- puter behavior without any injected worm. Each instance 3.1.6 W32.Sasser.D was labeled with the relevant worm (class), or ‘none’ for ‘‘clean’’ instances; Each worm was monitored for a period W32.Sasser.D has a preference for local address optimi- of 20 min with a resolution of 1 s. Thus, each instance zation while scanning the network. It divides its time, contained a vector of measurements that represents a 1-s approximately half and half, between scanning local snapshot. Accordingly, each dataset contained a few addresses and random addresses. In particular, it opens 128 thousand such labeled instances. Worms and legitimate threads for scanning the network. This requires heavy CPU applications were monitored in different configurations consumption, as well as significant network traffic. It is a (computer hardware configuration, existence of back- self-carried worm and uses a shell to connect to the ground application and also existence user-activity). The infected computer’s FTP server and to upload itself. outcome of this monitoring process was features that represent the application’s (worm/non worm) behavior. 3.1.7 W32.Slackor.A Some of the worms tend to behave in one environment similarly to a legitimate application in another environ- This is a self-carried worm that propagates by exploiting ment; similarly, a legitimate application might be per- MS Windows’ sharing vulnerability to propagate. The ceived as non legitimate when its behavior is monitored in worm registers itself for execution upon user login. It different environments. Thus, these cases are also a source contains a Trojan payload and opens an IRC server on the of misleading instances in the data. In order to derive a infected computer in order to receive orders. training set that included applications with distinctive behavior in any environment, we chose to disregard 3.1.8 Computer measurements applications whose behavior is not stable in all the environments. We examined the influence of computer hardware config- uration, applications running in the background, and user 3.2 Feature selection activity. In machine-learning applications, the large number of 1. Computer hardware configuration: We used two dif- features in many domains presents a huge challenge. ferent configurations. Both ran on Windows XP, which Typically, some of the features do not contribute to the is considered the most widely used operating system, accuracy of the classification task and may even hamper it. having two configuration types: the ‘‘old’’ configura- Moreover, in our approach, reducing the amount of tion has a Pentium 3,800 Mhz CPU, a bus speed of 133 Mhz, and 512 Mb memory; the ‘‘new’’ configu- ration has a Pentium 4 3 Ghz CPU, a bus speed of Table 1 The three aspects resulting in eight datasets, representing a 800 Mhz, and 1 Gb memory. variety of feature collecting conditions of a monitored computer 2. Background application: We ran an application that Computer Background application User activity Dataset name affects mainly the following features: processor object, processor time (usage of 100 %); page faults/s; Old No No o physical disk object; average disk bytes/transfer avg Old No Yes ou disk bytes/write, and disk writes/s. Old Yes No oa 3. User activity included several applications, among Old Yes Yes oau them: browsing, downloading, and streaming opera- Old Yes Yes oau tions through Internet Explorer, Word, Excel, chat New No Yes nu through MSN messenger, and Windows Media Player. New Yes No na These activities were implemented in such a way as to New Yes Yes nau imitate user activity in a scheduled way.

123 464 Pattern Anal Applic (2012) 15:459–475 features while maintaining a high level of detection overcome a bias in the information gain (IG) measure, and accuracy is crucial for meeting computer and resource measures the expected reduction of entropy caused by consumption requirements for the monitoring operations partitioning the examples according to a chosen feature. (measurements) and the classifier computations. This state Given entropy E(S) as a measure of the impurity in a can be achieved using the feature selection technique. collection of items, it is possible to quantify the effec- Since this is not the focus of this paper, we will describe tiveness of a feature in classifying the training data. the feature selection preprocessing only very briefly. In Equation 3 presents the entropy of a set of items S, based order to compare the performance of the different kernels on C subsets of S (for example, classes of the items), in the SVM, we used the filter approach, which is applied presented by Sc. IG measures the expected reduction of on the dataset and is independent of any classification entropy caused by partitioning the examples according to algorithm (unlike wrappers, in which the best subset is attribute A, in which V is the set of possible values of A,as chosen using an iterative evaluation experiment). Using shown in Eq. 2. These equations refer to discrete values; filters, a measure was calculated to quantify the correlation however, it is possible to extend them to continuous valued of each feature with the class, in our case, the presence or attribute. absence of worm activity. Each feature received a rank X j Sv j representing its expected contribution in the classification IGðS; AÞ¼EðSÞ EðSvÞð2Þ j S j task. Finally, the top ranked features were selected. v2VðAÞ X j S j j S j EðSÞ¼ c log c ð3Þ 3.2.1 Feature ranking metrics S 2 S c2C j j j j We used three feature metrics, which resulted in a list of The IG measure favors features having a high variety of ranked features for each metric and an ensemble incorpo- values over those with only a few. GR overcomes this rating all three of them. We used chi-square (CS), gain problem by considering how the feature splits the data ratio (GR) and RELIEF implemented in the WEKA envi- (Eqs. 4, 5). Si are d subsets of examples resulting from ronment [29] and their ensemble. partitioning S by the d-valued feature A. IGðS; AÞ 3.2.2 Chi-square GRðS; AÞ¼ ð4Þ SIðS; AÞ

Xd Chi-square measures the lack of independence between a j S j j S j SIðS; AÞ¼ i log i ð5Þ feature f and a class ci (such as W32.Dabber.A) and can be j S j 2 j S j compared to the chi-square distribution with one degree of i¼1 freedom to judge extremeness. Equation 1 shows how the 3.2.4 Relief chi-square measure is defined and computed, where N is the total number of documents, f refers to the presence of Relief [32] estimates the quality of the features according the feature (and f its absence), and ci refers to its mem- to how well their values distinguish between instances that bership in ci. P(f, ci) is the probability that the feature f are near each other. Given a randomly selected instance x, occurs in ci and the probability Pðf ; ciÞ is the probability from a dataset s with k features, Relief searches the dataset that the feature f does not occur in ci. Similarly, Pðf ; ciÞ for its two nearest neighbors from the same class, an action and Pðf; ciÞ are the probabilities that the feature does or termed ‘‘nearest hit H’’, and from a different class, referred does not occur in a file that is not labeled to ci, respec- to as ‘‘nearest miss M’’. The quality estimation W[Ai]is tively. P(f) is the probability that the feature appears in a stored in a vector of the features Ai, based on the values of file, and PðfÞ is the probability that the feature does not a difference function dif f() given x, H and M as shown in appear in the file. P(c ) is the probability that a file is Eq. 6. i 8 labeled to c , and Pðc Þ is the probability that a file is not to i i

Gain ratio (GR) was originally presented in 1993 [30]in Instead of selecting features based on feature selection the context of decision trees [31]. It was designed to methods, one can use the ensemble strategy (see for

123 Pattern Anal Applic (2012) 15:459–475 465 instance [33–35]), which combines the feature subsets that separation of the examples. Note that when the kernel are obtained from several feature selection methods. Spe- function satisfies Mercer’s condition [38], then K can be cifically, we combine several methods by averaging the written as shown in Eq. (8), where U is a function that feature ranks as shown in Eq. 7: maps the example from the original feature space into P k j higher dimensional space, while K captures the ‘‘inner j¼1 rank ðfiÞ rankðfiÞ¼ ð7Þ product’’ between the mappings of examples x1, x2. For the k general case, the SVM classifier will be in the form shown where fi is a feature and filter is one of the k filtering in Eq. (9). Note that n is the number of examples in the (feature selection) methods. Specifically in our case k = 3. training set. Eq. (10) defines W. Kðx ; x Þ¼Uðx ÞUðx Þð8Þ 3.3 Support vector machines 1 2 1 2 Xn f x w U x a y K x ; x We employed the SVM classification algorithm [36] using ð Þ¼ ð Þ¼ i i ð i Þð9Þ 1 three different kernel functions in a supervised learning Xn approach. We now briefly introduce the SVM classification w ¼ aiyiUðxiÞð10Þ algorithm and the principles and implementation of the 1 active learning method we used in this study. SVM is a binary classifier that finds a linear hyperplane that separates The use of kernel functions, often referred to as the the given examples into the two given classes. SVM is kernel trick [39], is of great importance. Equation (8) means known for its capability to handle a large amount of fea- that inner products in the higher dimensional space can be tures, such as text. We used the SVM-light implementation evaluated simply using the kernel function; it is therefore [37], given a training set in which an example is a vector not necessary to work explicitly in the higher dimensional whenever only inner products are required. Therefore, the xi ¼hf1; f2; ...fni labeled by yi = {-1, ?1} where fi is a feature. problem that arises from the high dimensional feature space The SVM attempts to specify a linear hyperplane that has is alleviated, because it allows the computations to take a maximal margin, defined by the maximal (perpendicular) place in the original feature space of the problem, which distance between the examples of the two classes. Figure 2 involves the computation of inner products in Eq. (8). After illustrates a two-dimensional space, in which the examples projecting the examples into the higher dimension space, are located according to their features; the hyperplane splits the SVM tries to identify the optimal hyperplane that them according to their label. The examples lying closest to separates the two classes. Logically there can be more than the hyperplane are the ‘‘supporting vectors’’. W, the normal one separating hyperplane for a specific projection of a of the hyperplane, is a linear combination of the most dataset; therefore, as a criterion of selection, the one important examples (supporting vectors), multiplied by maximizing the margin is selected in an attempt to achieve a Lagrange multipliers (alphas). better generalization capability in order to increase the Since the dataset in the original space often cannot be expected accuracy. separated linearly, a kernel function K is used. Using a Since the kernel function is derived from the theoretic kernel function, the SVM actually projects the examples basis of SVM, one should select a kernel function that has into a higher dimensional space in order to create linear the appropriate parameter configurations, as was empiri- cally demonstrated in [37]. Each kernel function creates a different separating plane in the original space as demonstrated in Figs. 3 and 4. Commonly, the kernel functions used are the Polynomial and RBF kernel. One should note that the performance of the kernel also depends on the true data distribution, which is usually unknown, and thus one should scrutinize dif- ferent kernels in order to determine the best kernel for the specific problem and dataset.

3.3.1 Polynomial kernel

The polynomial kernel creates values of degree p, where Fig. 2 SVM that separates the training set into two classes with the output depends on the direction of the two vectors maximal margin (examples x1, x2), as shown in Eq. 11, in the original 123 466 Pattern Anal Applic (2012) 15:459–475 ! kx x k2 Kðx ; x Þ¼exp 1 2 : ð12Þ 1 2 2r2

As Fig. 4 shows, the same training set is given to the SVM with the RBF and the polynomial kernels. While the SVM with the polynomial kernel (left side) has not determined the hyperplane that separates the training set, the SVM with the RBF kernel (right side) has successfully deter- mined such a one: The RBF has successfully separated the dataset, whereas the polynomial has not. There are several reasons for using the SVM as the classification algorithm. Primarily, SVM was successfully used for the detection of worms as indicated in four previ- Fig. 3 The same training set was given to polynomial (left) and linear ous works [41–44]. Moreover, in the first work [41], it was kernels (right); the polynomial kernel achieved better separation with indicated that ‘‘SVM learns a black-box classifier that is its induced model. The figure was produced by the applet provided in hard for worm writers to interpret’’. In addition, SVM was the LIBSVM software [40] very efficient when combined with AL methods in closely related domains, as was presented in [45]. An additional problem space. A special case of the polynomial kernel, reason is related to the fact that the data contain misleading having P = 1, is actually the linear kernel. data (the misleading issue will be explained in more detail in Sect. 3.4). In a nutshell, misleading data mean that it is Kðx ; x Þ¼ðx x þ 1ÞP ð11Þ 1 2 1 2 hard to find a clear separation in the dataset between the In order to convey the significance of the kernels, we pro- worm and non-worm instances due to the similarity vided an explanation combined with visualizations. As seen between the behaviors of these two classes. Thus, our goal in Fig. 3, the same training set is given to SVM with linear was to detect the worm activity through system behavior, and polynomial kernels. While the SVM with a linear kernel since, on the one hand, the RBF kernel might help because it (right side) has not determined a hyperplane that separates is sophisticated and very sensitive to misleading data, and, the training set, the SVM with the polynomial kernel (left on the other hand, the linear kernel, which is the simplest side) has successfully determined such a one: one, might find a simple separation between the classes.

3.3.2 Radial basis function (RBF) kernel 3.4 Active learning

The second most used kernel is a radial basis function Active learning (AL) is usually used to reduce the effort (RBF), as shown in Eq. 12, in which a Gaussian is used as expended on labeling examples, generally a time-con- the RBF and the output of the kernel depends on the suming and costly task, while obtaining a high accuracy Euclidean distance of examples x1, x2. rate. In AL, the learner actively acquires the labels of the

Fig. 4 The same training set was given to polynomial (left) and RBF kernels (right), the RBF kernel achieved better separation with its induced model. The figure was produced by the applet provided in the LIBSVM software [40]

123 Pattern Anal Applic (2012) 15:459–475 467 most informative instances. In our study, all the examples current actual classifier. Since the real future error rates are were labeled, since we knew which worm was active unknown, the learner utilizes its current classifier in order during the measurements. However, since the data were to estimate these errors, as will now be elaborated. At the ^ misleading, we used AL as a selective sampling technique beginning of an AL trial, an initial classifier PDðyi j xiÞ is that increases accuracy by selecting only those examples trained over a randomly selected initial set D, and for every that lead to a better classifier. Some of the worms behave as candidate example (xi, y1) where xi 2 P and its possible a legitimate application part of the time, and as a conse- ^ labels y1 2 Y, the algorithm induces a new classifier PDðyi j 0 quence, they generate misleading instances. In order to xiÞ trained on the extended training set D = D ? (xi, yi). prevent these instances from effecting the detection accu- One should note that this new classifier actually represents racy of our model negatively, we did not select them for the addition of the new example with a specific label into inclusion in the training set. The worms are not always the training set—and these are the classifiers by which the active and even when active they do not always behave in selection criterion is being calculated in order to select the an illegitimate way. Thus, according to their monitored most suitable examples. behavior in this period, they seem to act like a non-worm X X 1 ^ ^ instances. This creates misleading instances in the dataset E ^ ¼ P 0 ðy j xÞ j log P 0 ðy j xÞ j P 0 D ðyijxiÞ D ðyijxiÞ D ðyijxiÞ P and makes their detection much more difficult. The mis- x2P y2Y leading data here were a greater problem; in the security ð13Þ X domain, worms are created in such a sophisticated way that ^ ^ SExi ¼ PDðyi j xiÞEP ð14Þ they behave similarly to a legitimate application in order to D0ðy jx Þ y2Y i i make their detection harder. Thus, monitoring worm behavior creates many instances (snapshots) that are very The future expected generalization error of the new clas- similar to non-worm instances and are therefore considered sifier is then estimated using the entropy of the new induced as misleading instances that confuse the SVM. This phe- classifier itself, averaged over j P j, as given in Eq. 13. nomenon is called ‘‘malicious noise’’, as presented in [46]. From Eq. 13, it can be understood that the error calculation In most other domains, the misleading instances are not is being done over all the examples in the pool, and the more ^ purposefully created, but exist naturally. Here in worm confident the new classifier P 0 is in knowing x’s true D ðyijxiÞ detection, the task is more complicated since the mislead- label, the lower is the expected error. For example, ing data are inherent in the class, and even present in a ^ P 0 ðy j xÞ¼1 means that the classifier is confident that D ðyijxiÞ larger amount in the class we want to detect. Here we used ^ y is the true label of x; thus, error = 0, and P 0 ðy j xÞ¼ the AL idea to select the most informative instances among D ðyijxiÞ the existing ones so that the misleading ones would not be 0 means it is confident that y is not the true label of x and ^ error = 0, while P 0 ðy j xÞ¼0:5 means that the clas- selected, as was done in previous work [47] and discussed D ðyijxiÞ in other work [48]. sifier is most uncertain that y is the true label of x, and thus Misleading instances usually create confusion in the the error is maximal. Note that Eq. 13 actually calculates classification processes and cause degradation in the clas- the expected error of the new classifier over all the examples sifier’s performance. Thus, these misleading instances, in the pool. Yet it is not enough, since it was calculated only generally, will not meet the AL selection criterion, that is, for one label of xi; thus, for each candidate example, Eq. 13 for the error reduction method, the instances whose addition is calculated one time for each possible label and it is to the training set will create a classifier that is more averaged using Eq. 14 (in our context, there are only two confident in its capability to classify unknown instances labels: worm and non worm). Equation 14 is the self-esti- correctly. Selecting the misleading instances has the oppo- mated average error of candidate example xi denoted by site effect: it changes the decision function of the classifier SExi, which is actually a weighted average of the error for so that the classifier is less accurate and thus less confident in all the examples and its possible labels, as shown in Eq. 13. its capability to classify the unknown instances correctly. The example xi with the lowest expected self-estimated

In this study, we implemented an AL, termed ‘‘error- error (SExi ) is chosen and added to the training set. In brief, reduction’’ [26] the objective of which is to reduce the an example is chosen from the pool only if it dramatically expected generalization error. The task is to select the improves the confidence of the current classifier for all the most contributive examples for the classifier from a given examples in the pool. pool of unlabeled examples denoted by P. By estimating the expected error reduction achieved through labeling an 3.5 Evaluation measures example and adding it to the training set, the example that has the maximal error reduction will be selected for true For evaluation purposes, we measured: the true positive labeling and will also be added to the training set of the rate (TPR) measure, which is the number of positive 123 468 Pattern Anal Applic (2012) 15:459–475 instances correctly classified, as shown in Eq. 15; the false which contains all the eight datasets presented in positive rate (FPR), which is the number of misclassified Table 1. In the second option, features were ranked negative instances (Eq. 15); and the Total Accuracy, which separately on each dataset. We then computed the measures the number of absolutely correct classified average rank for each feature; instances, either positive or negative, divided by the entire 3. SVM kernel functions: linear, polynomial and RBF number of instances shown in Eq. 16. kernel; j TP j j FP j 4. Training set (selected from the eight datasets in TPR ¼ ; FPR ¼ ð15Þ Table 1) for inducing the SVM classifier; j TP þ FN j j FP þ TN j 5. Test set (selected from Table 1) for evaluating the j TP þ TN j Total accuracy ¼ ð16Þ SVM classifier. j TP þ FP þ TN þ FN j When the training and test sets were collected under the We also measured a confusion matrix, which depicts the same conditions (i.e., the same computer configuration, number of instances from each class that were classified in background application, and user activity), we employed a each one of the classes (ideally all the instances would be ten-fold cross-validation procedure for evaluating the in their actual class). accuracy. In all other cases, we simply used the entire training/test set for the corresponding training/testing. To analyze the results, we performed a factorial ANOVA. 4 Experiments and results Section 4.1.1 presents the results obtained when the training and test set were collected in the same condition. In the first part of the study, our objective was to identify Section 4.1.2 presents the results for all other cases the best feature selection measure, the best kernel function, and the minimal features required to maintain a high level 4.1.1 Training and test on the same feature collecting of accuracy. In the second part, we wanted to measure the condition possibility of classifying unknown worms using a training set of known worms, and to examine the possibility of Figure 5 presents the accuracy obtained by different fea- increasing the detection performance using selective sam- ture ranking measures on different features subset sizes. pling. In order to elucidate these issues, we designed three The results indicate that for this scenario feature selection experimental plans. We applied 4 different feature selec- reduces the accuracy. The GainRatio measure was partic- tion measures to generate 17 training sets such that each ularly less effective in selecting the most relevant features, measure was used to extract the Top 5, 10, 20 and 30.In especially in small subsets (top 5 and top 10). The null- addition, the full feature set was also used. hypothesis, that all feature-ranking measures perform equally and the observed differences are merely random, 4.1 Experiment I—analysis the effect of feature was rejected. Similarly, the null-hypothesis that all feature selection subset sizes perform equally was also rejected. Moreover, the interaction between the feature-ranking measure and We performed a wide set of experiments in order to eval- the top select features was found to be statistically signif- uate how feature selection affects the detection of unknown icant with F(12,752) = 7.9533 and p \ 1%. worms. Specifically, we compared the accuracy perfor- The trend in the results is that the more features added to mance obtained after selecting features using each one of the training set, the higher the accuracy of worm detection the above-mentioned feature-ranking metrics (chi-square, is, where here it is strongly related to the fact that the same gain ratio, relief, and ensemble). Note that in this experi- feature collecting conditions were used for the training and mental study we examined the task of identifying worm or test sets. The more features given to the classifier, the more no-worm activity. In certain scenarios, this binary classi- information the classifier receives. Consequently, due to fication is sufficient. While identifying the exact worm is the fact that over a small set of features some instances considered to be more challenging, we decided to explore might appear similar, while expanding this set of features this direction because of the possibility of obtaining addi- with additional features will probably reveal a difference tional insights. The evaluations were performed on differ- between them (if indeed it exists), these additional features ent conditions, based on the following factors: help the classifier to cope with the misleading instances. In 1. Top n—select top (best) 5, 10, 20, 30 or all features addition, it implies that the features we have monitored are according to the features ranking; relevant and actually help the classifier find patterns in the 2. Feature consolidation (unified, averaged). In the first dataset that are necessary for distinguishing between a option, features were ranked on a unified dataset, worm and non-worm behavior.

123 Pattern Anal Applic (2012) 15:459–475 469

Fig. 5 The interaction between feature-ranking measures and the top selected features. In general the FS reduces the accuracy, but the chi-square was found most effective. The vertical bars denote 0.95 confidence intervals

Fig. 6 The interaction between the kernel function and the top selected features. The vertical bars denote 0.95 confidence intervals

Figure 6 presents the mean accuracy obtained by the lower results (mostly under top 20) implies a complicated different kernel functions using different feature subset dataset that cannot be easily separated linearly: the sizes. The results indicate that the polynomial function instances of the worm and non-worm tend to be similar and performs best in terms of accuracy. The null-hypothesis, thus more features are needed to distinguish between them. that all kernel functions perform equally and the observed Figure 7a shows a comparison of the accuracy obtained differences are merely random, was rejected. Moreover, the by the unified and the averaged consolidation approaches. interaction between the kernel function and the top select The one-way ANOVA indicates that the averaged approach features was found to be statistically significant with significantly outperforms the unified approach with F(7,716) = 7.483 and p \ 1%. F(1,770) = 85.1, p \ 1 %. Nevertheless, a further inves- Again, it can be seen that the amount of features that is tigation, shown in Fig. 7b, indicates that the averaged taken into consideration has a positive influence on the approach outperforms the unified approach only when detection rate, and, although there are differences in the small feature sets are used. For the Top 30 and FULL accuracy rates among the various kernels, the trend is quite feature sets the unified approach was found to be better. similar: a steep incline in the accuracy when moving from Moreover, the interaction effect of the consolidation factor top 5 to top 10, while from top 10 to Full there is a mod- and feature subset size factor was found to be statistically erate increase. That the linear kernel achieved significantly significant with F(3,700) = 10.474 and p \ 1%.

123 470 Pattern Anal Applic (2012) 15:459–475

Fig. 7 a One-way ANOVA of the consolidation method. The averaged approach outperformed the unified approach. The vertical bars denote 0.95 CI. b The interaction between the consolidation method and the top selected features. The averaged consolidation method was better for the small number of features selected (5–20), while the unified was better for the Top30 and FULL. The vertical bars denote 0.95 CI

4.1.2 Training and test on different feature collecting features when using GainRatio included: A_1ICMP: conditions Sent_Echo_sec, Messages_Sent_sec, Messages_sec, and A_1TCP: Connections_Passive and Connection_Failures, We trained each classifier on a single dataset and tested on which are Windows’ performance counters, related to it each one of the remaining seven datasets. Thus, we had a ICMP and TCP, describing general communication set of eight iterations in which a dataset was used for properties. training, and seven corresponding evaluations of each one The null-hypothesis, that all feature-ranking measures of the datasets. In short, there were 56 evaluation runs for perform equally and that the observed differences are each combination. merely random, was rejected. Similarly, the null-hypothe- Figure 8 presents the accuracy obtained by various sis that all feature subset sizes perform equally was also feature-ranking measures on different features subset sizes. rejected. Moreover, the interaction between the feature- As expected, it can be seen that the accuracy level in this ranking measure and the top select features was found to case is significantly lower than the accuracy obtained in be statistically significant with F(12,5350) = 7.479 and Sect. 4.1.1. Contrary to the results appearing in Fig. 5, the p \ 1%. GainRatio measure outperforms other measures. Generally, Figure 9 presents the accuracy obtained by the different it can be seen that the above 20 features do not improve kernel functions using different feature-ranking methods. performance, possibly because the additional features The results indicate that the best combination is obtained correlate less with the classes. The Top5 significant by first using GainRatio for selecting the features and then

123 Pattern Anal Applic (2012) 15:459–475 471

Fig. 8 The interaction between feature ranking measures and the top selected features. The GainRatio outperforms all the methods and selecting more than 20 features reduces the accuracy. The vertical bars denote 0.95 CI

Fig. 9 The interaction between the kernel function and the top selected features. The RBF kernel outperforms the other kernels. The vertical bars denote 0.95 CI

building the SVM using the RBF function. The null- information-gain. However, in our experimental study, on hypothesis, that all kernel functions perform equally and the one hand, in the same feature-collecting conditions that observed differences are merely random, was rejected. (Fig. 5), chi-square provided the best results, while infor- Moreover, the interaction between the kernel function and mation gain yielded the worst results; on the other hand, in the feature-ranking measures was found to be statistically the different feature collecting conditions (Fig. 8), the significant with F(7,5392) = 7.2035 and p \ 1%. dominance relation was reversed. According to the experimental results given in 4.1.1 and 4.1.2, the different feature selection methods performed 4.2 Experiment II—unknown worms detection significantly differently in our context of worm detection. We understand that this is a result of the critical influence To evaluate the capability of the suggested approach to of each relevant feature that was selected. Different classify unknown worm activity, which was our main methods select different features, and it seems that the objective, an additional experiment was performed. In this features we monitored are very diverse, which means that experiment, the training set consisted of (5 - k) worms we have a complementary set of features that are inde- and the testing set contained the k excluded worms; the pendent of each other. Thus, the selection of different none activity appeared in both datasets. This process was features had a significant impact on the results. In addition, repeated for all the possible combinations of the k worms, the results reveal an interesting phenomenon. Previous for k = 1–4. Note that in these experiments, unlike in the research [49] showed a correlation between chi-square and previous section, there were two classes: (generally) worm,

123 472 Pattern Anal Applic (2012) 15:459–475 for any type of worm, and none for any other cases. For AL iteration, an additional example was selected. Perfor- selecting the features, we used the Top20 features of the mances were noted after selecting 50, 100 and 150 addi- GainRatio with unified consolidation. The full training set, tional instances. when no worm was excluded, contained 7126 instances of Figure 11 presents the obtained results. Two main worm and non-worm, while the full training set, when 1, 2, outcomes can be observed. First, the selective sampling 3, 4 worms were excluded, contained 5881, 4689, 3497, significantly improved the baseline accuracy by more than 2305 instances, respectively. 10 %. Second, actively selecting only 50 instances can be Figure 10 presents the results when all training data sufficient for obtaining high accuracy. When we used the were used. As more worms were included in the training entire dataset, the accuracy increased as more worms were set, a monotonic improvement was observed. However, removed from the training set. This can be explained by the RBF was less affected by the number of excluded worms. fact that some worms behave most of the time as legitimate Consequently, we prefer the RBF kernel when there are applications. Thus, adding all their instances to the training fewer worms in the training set. In addition the linear set, without filtering out the confusing instances, might kernel consistently outperformed the polynomial kernel. affect the training of the SVM negatively. Another obser- Note that the number of worms in the x axis refers to the vation that supports these insights is related to the fact that number of excluded worms. The RBF outperformed all the the fewer the worms excluded, the larger is the gap other kernels, while the polynomial kernel performed between the results of selective sampling and the learning worse than the linear kernel. from the full training set, with the largest gap being at 1 excluded worm. This means that as more misleading 4.3 Experiment 3—using selective sampling instances exist in the training set, thus the selective sam- pling is more contributive in filtering the worm instances In this set of experiments, we wanted to maximize the that are very similar to the non-worm behavior. performance achieved by the RBF kernel function. Spe- One should note that most of the instances in the test set cifically, we examined whether improved results can be that are presented to the SVM seem to be legitimate, yet the achieved by using a selective sampling approach to reduce detection of the worm is done according to its own process the number of misleading instances in the training set, in which there are also abnormal instances by which the which poses a challenge for the RBF. Thus, we employed a SVM successfully determines that it is a indeed a worm; all selective sampling approach based on AL. the instances that belong to the same process are classified The evaluation was made using the same setup as in the as worm, although they seem to be legitimate. previous section, in which worms excluded from the training Figure 12 presents the accuracy obtained when different set appeared in the test set. Specifically, we repeated the worms were excluded from the training set. It can be seen same experiment with the entire set of examples as a base- that not all worms have the same detection accuracy. The line using the selective sampling method. For the selective differences were found to be statistically significant with sampling process, first, we randomly selected six examples F(4,155) = 7.84 and p \ 1 %. Further investigation, as from each type of class (worms/none); subsequently, in each shown in Fig. 13, indicated that the exclusion of the

Fig. 10 The performance monotonically increases as fewer worms are excluded (and more worms appear in the training set). The RBF kernel presents a different behavior, in which a high level of accuracy is achieved even when learning from a single worm

123 Pattern Anal Applic (2012) 15:459–475 473

Fig. 11 The selective sampling approach significantly improves accuracy. Generally, an improvement of above 10 % in accuracy was achieved

Fig. 12 The accuracy obtained when different worms are excluded. The W32.Deborm.Y seems to be most informative and crucial for use in the training set. The vertical bars denote 0.95 CI

Fig. 13 The interaction effect of excluding W32.Deborm.Y worm and the tested worm. The vertical bars denote 0.95 CI. The interaction is statistically significant with F(3,2100) = 44.496 and p \ 1%

123 474 Pattern Anal Applic (2012) 15:459–475

W32.Deborm.Y decreased the performance of both References W32.Deborm.Y and W32.Sasser.D worms. This implies that some worms can be detected by learning patterns 1. Fosnock C (2008) Computer worms: past, present and future. sampled from other worms. Technical report, East Carolina University 2. Schultz MG , Eskin E, Zadok E, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Pro- ceedings of the 2001 IEEE symposium on security and privacy, 5 Conclusion and future work SP ’01, Washington, DC, USA, pp 38 3. Abou-Assaleh T, Cercone N, Keselj V, Sweidan R (2004) We have presented a concept for detecting unknown N-gram-based detection of new malicious code. In: Proceedings computer worms based on host behavior, using the SVM of the 28th annual international computer software and applica- tions conference—workshops and fast abstracts, COMPSAC ’04, classification algorithm with different kernels. The results vol 02. IEEE Computer Society, Washington, DC, pp 41–42 show that the use of SVM in the task of detecting unknown 4. Zico Kolter J, Maloof MA (2006) Learning to detect and classify computer worms is possible. We used a feature-selection malicious executables in the wild. J Mach Learn Res method that enabled us to identify the most important 5. Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N (2003) Inside the slammer worm. Security Privacy IEEE computer features in order to detect unknown worm 1(4):33–39 activity. This sort of work is currently performed by human 6. Moskovitch R, Elovici Y, Rokach L (2008) Detection of experts. We specifically focused on the use of active unknown computer worms based on behavioral classification of learning as a selective sampling method to increase the the host. Comput Stat Data Anal 52(9):4544–4566 7. Menahem E,Shabtai A, Rokach L, Elovici Y (2009) Improving performance of the unknown computer worm detection malware detection by applying multi-inducer ensemble. Comput with minimal human efforts. We rigorously analyzed the Stat Data Anal 53(4):1483–1494 data from the large set of experiments that we performed. 8. Moskovitch R, Stopel D, Feher C, Nissim N, Japkowicz N, In the case of different conditions (in the training set and Elovici Y (2009) Unknown malcode detection and the imbalance problem. J Comput Virol 5:295–308. doi:10.1007/s11416-009- test set), the GainRatio measure for feature selection was 0122-8 most effective. On average, the Top20 features produced 9. Kienzle DM, MC Elder (2003) Recent worms: a survey and the highest results and the RBF kernel commonly outper- trends. In: Proceedings of the 2003 ACM workshop on Rapid formed other kernels. For detecting unknown worms, the malcode, WORM ’03 , ACM, New York, pp 1–10 10. Moore D, Shannon C, Claffy K (2002) Code-red: a case study on results show that it was possible to achieve a high level of the spread and victims of an internet worm. In: Proceedings of the accuracy; accuracy improved as more worms were inclu- 2nd ACM SIGCOMM Workshop on Internet measurment, IMW ded in the training set. To reduce the number of misleading ’02, ACM, New York, pp 273–284 instances in the training set and improve the learning, we 11. Weaver N, Paxson V, Staniford S, Cunningham R (2003) A taxonomy of computer worms. In: Proceedings of the 2003 ACM show that the AL approach, as a selective method, can workshop on Rapid malcode, WORM ’03, ACM, New York, pp improve the performance. Selecting only 50 examples 11–18 increased the accuracy to about 90, and 94 % when the 12. Cert (2000) Multiple denial-of-Service problems in ISC training set contained 4 worms, in comparison to about 65 BIND. http://www.cert.org/advisories/CA-2000-20.html. (Online; Accessed 23 July 2012) and 75 %, respectively. When we selected 100 and 150 13. Lee W, Stolfo SJ, Mok KW (1999) A data mining framework for examples, no improvement was observed over the perfor- building intrusion detection models. In: Security and Privacy, mance achieved with 50 examples. Furthermore, we ana- 1999, Proceedings of the 1999 IEEE Symposium, pp 120–132 lyzed the importance of using each worm in the training 14. P Kabiri, Ghorbani Ali A (2005) Research on intrusion detection and response: a survey. Int J Netw Security 1:84–102 set. We found that a significant decrease in the perfor- 15. Barbara´ D , Ningning Wu, Jajodia S (2001) Detecting novel mance occurred only when the W32.Deborm.Y was network intrusions using Bayes estimators. In:Proceedings of the excluded from the training set. This can be explained by its First SIAM Conference on Data Mining nature, which is probably more general in its activity than 16. Zanero S, Savaresi SM (2004) Unsupervised learning techniques for an intrusion detection system. In: Proceedings of the 2004 are the other worms. ACM symposium on applied computing, SAC ’04,ACM, New We conclude that selective sampling can be used to York, NY, USA, pp 412–419 select the most informative examples from data that 17. Kayacik HG, Zincir-Heywood AN, Heywood MI (2003) On the include misleading instances. These results are highly capability of an som based intrusion detection system. In: Neural networks 2003. Proceedings of the International Joint Confer- encouraging and show that the propagation of unknown ence, vol 3, pp 1808–1813 worms, which commonly spread intensively, can be stop- 18. Lei JZ, Ghorbani A (2004) Network intrusion detection using an ped in real time. The advantage of the suggested approach improved competitive learning neural network. In: Communica- is the automatic acquisition and maintenance of knowledge tion networks and services research, 2004, Proceedings. second annual conference, pp 190–197 based on inductive learning. This avoids the need for a 19. Stopel D, Moskovitch R, Boger Z, Shahar Y, Elovici Y (2009) human expert who may not always be available or familiar Using artificial neural networks to detect unknown computer with generating rules or signatures. worms. Neural Comput Appl 18:663–674

123 Pattern Anal Applic (2012) 15:459–475 475

20. PingZhao Hu, MI Heywood (2003) Predicting intrusions with workshop on Computational learning theory, COLT ’92, , ACM, local linear models. In: Neural networks 2003. Proceedings of the New York, pp 144–152 international joint conference, vol 3, pp 1780–1785 37. Thorsten J (1999) Advances in kernel methods. chapter Making 21. Dickerson JE, Dickerson JA (2000) Fuzzy network profiling for large-scale support vector machine learning practical. MIT Press, intrusion detection. In: Fuzzy Information Processing Society, Cambridge, pp 169–184 NAFIPS, 19th International Conference of the North American, 38. CJC Burges (1998) A tutorial on support vector machines for pp 301–306 pattern recognition. Data Min Knowl Discov 2(2):121–167 22. Bridges SM, Vaughn RB (2000) Associate and 39. Aizerman A, Braverman EM, LI Rozoner (1964) Theoretical Associate Professor Fuzzy data mining and genetic algorithms foundations of the potential function method in pattern recogni- applied to intrusion detection. In: Proceedings of the national tion learning. Automat Remote Control 25:821–837 information systems security conference (NISSC), pp 6–19 40. Chih-Chung C, Chih-Jen Lin Libsvm: a library for support vector 23. Botha M, von Solms R (2003) Utilising fuzzy logic and trend machines. ACM Trans Intel Syst Technol analysis for effective intrusion detection. Comput Amp Security 41. Wang X, Yu W, Champion A, Xinwen Fu, Dong Xuan (2007) 22(5):423–434 Detecting worms via mining dynamic program execution. In: 24. Cohn DA, Ghahramani Z, Jordan MI (1995) Active learning with Security and Privacy in Communications Networks and the statistical models. Technical Report, Cambridge, MA, USA Workshops, 2007. SecureComm 2007. Third International Con- 25. Lewis DD, Gale WA (1994) A sequential algorithm for training ference, pp 412 –421 text classifiers. In: Proceedings of the 17th annual international 42. Masud MM, Khan L, Thuraisingham B (2007) Feature based ACM SIGIR conference on research and development in infor- techniques for auto-detection of novel email worms. In: Pro- mation retrieval, SIGIR ’94, New York, NY, USA. Springer- ceedings of the 11th Pacific-Asia conference on advances in Verlag New York, Inc,New York, pp 3–12 knowledge discovery and data mining, PAKDD’07. Springer, 26. Roy N, McCallum A (2001) Toward optimal active learning Berlin, pp 205–216 through sampling estimation of error reduction. In: Proceedings 43. Moskovitch R, Nissim N, Stopel D, Feher C, Englert R, Elovici Y of the eighteenth international conference on machine learning, (2007) Improving the detection of unknown computer worms ICML ’01. Morgan Kaufmann Publishers Inc, San Francisco, activity using active learning. In: Proceedings of the 30th annual pp 441–448 German conference on advances in artificial intelligence, KI ’07. 27. Margineantu DD (2005) Active cost-sensitive learning. In: Springer, Berlin, Heidelberg, pp 489–493 IJCAI, pp 1622–1613 44. Zhu Y, Wang X, Shen H (2008) Detection method of computer 28. Lorch JR, AJ Smith (2000) Building vtrace, a tracer for windows worms based on svm. Mech Elect Eng Magazine 8 nt and windows 2000. Technical Report UCB/CSD-00-1093, 45. Moskovitch R, Nissim N, Elovici Y (2009) Malicious code EECS Department, University of California, Berkeley detection using active learning. In: Bonchi F, Ferrari E, Jiang W, 29. Francisco A (2006) Witten ih, frank e: data mining: practical Malin B (eds) Privacy, Security, and Trust in KDD. Lecture notes machine learning tools and techniques. BioMed Eng OnLine 5:1–2 in computer science, vol 5456, pp 74–91. Springer, Berlin, 30. Ross Quinlan J (1993) C4.5: programs for machine learning. Heidelberg Morgan Kaufmann Publishers Inc., San Francisco, CA, USA 46. Rocco A (2003) Servedio smooth boosting and learning with 31. Mitchell TM (1997) Machine learning. McGraw-Hill, New York malicious noise J Mach Learn Res 4:633–648 32. Pearl J (1986) Fusion propagation, and structuring in belief net- 47. Chen Y, Zhan Y (2009) Co-training semi-supervised active works. Artif Intel 29(3):241–288 learning algorithm based on noise filter. In: Proceedings of the 33. Lior R, Oded M, Reuven A (2006) Selective voting—getting 2009 WRI global congress on intelligent systems, GCIS ’09, vol more for less in sensor fusion. IJPRAI 20(3):329–350 03. IEEE Computer Society, Washington, DC, USA, pp 524–528 34. Lior R, Barak C, Oded M (2007) A methodology for improving 48. Schohn G , Cohn D (2000) Less is more: active learning with the performance of non-ranker feature selection filters. IJPRAI support vector machines. In: Proceedings of the seventeenth 21(5):809–830 international conference on machine learning, ICML ’00. Morgan 35. Rokach L, Romano R, Maimon O (2008) Negation recognition in Kaufmann Publishers Inc,San Francisco, pp 839–846 medical narrative reports. Inf Retrieval 11(6):499–538 49. Forman G (2003) An extensive empirical study of feature 36. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm selection metrics for text classification. J Mach Learn Res for optimal margin classifiers. In: Proceedings of the fifth annual

123 Expert Systems with Applications 41 (2014) 5843–5857

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

Novel active learning methods for enhanced PC malware detection in windows OS ⇑ Nir Nissim , Robert Moskovitch, Lior Rokach, Yuval Elovici

Telekom Innovation Laboratories at Ben-Gurion University, Department of Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B 653, Be’erSheva 84105, Israel article info abstract

Keywords: The formation of new malwares every day poses a significant challenge to anti-virus vendors since anti- Malware virus tools, using manually crafted signatures, are only capable of identifying known malware instances Malicious code and their relatively similar variants. To identify new and unknown malwares for updating their anti-virus Machine Learning signature repository, anti-virus vendors must daily collect new, suspicious files that need to be analyzed Active learning manually by information security experts who then label them as malware or benign. Analyzing SVM suspected files is a time-consuming task and it is impossible to manually analyze all of them. Conse- quently, anti-virus vendors use machine learning algorithms and heuristics in order to reduce the num- ber of suspect files that must be inspected manually. These techniques, however, lack an essential element – they cannot be daily updated. In this work we introduce a solution for this updatability gap. We present an active learning (AL) framework and introduce two new AL methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files that are most probably mali- cious. Those new AL methods are designed and oriented towards new malware acquisition. To test the capability of our methods for acquiring new malwares from a stream of unknown files, we conducted a series of experiments over a ten-day period. A comparison of our methods to existing high performance AL methods and to random selection, which is the naïve method, indicates that the AL methods outper- formed random selection for all performance measures. Our AL methods outperformed existing AL method in two respects, both related to the number of new malwares acquired daily, the core measure in this study. First, our best performing AL method, termed ‘‘Exploitation’’, acquired on the 9th day of the experiment about 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection. Secondly, while the existing AL method showed a decrease in the number of new malwares acquired over 10 days, our AL methods showed an increase and a daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist anti-virus vendors. Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction For example, on the website http://vx.netlux.org/, malware devel- opers share tools for facilitating a generation of new malwares. The number of new malicious files is constantly increasing Furthermore, attackers have become much more sophisticated, (Schultz, Eskin, Zadok, & Stolfo, 2001). The term ‘‘malicious soft- creating malicious code files that seem to act like benign files but ware’’ (malware) commonly refers to pieces of code, not necessar- are harder to detect, such as the case with Trojan horses (CERT, ily executable files, that contain malicious functionality. Malwares 1999). Additionally, attackers either detect new vulnerabilities or are classified into four main categories mainly based on their follow announcements about the latest vulnerabilities before cre- transport mechanisms: worms, viruses, Trojans, and backdoors ating malicious codes to exploit these vulnerabilities. Attackers (Shabtai, Moskovitch, Elovici, & Glezer, 2009). also know that anti-virus vendors distribute patches to their Unlike in the past, creating a malware today is relatively easy anti-virus packages relatively slowly, providing the virus with a because malicious code libraries are shared between attackers. window of opportunity to attack and spread. Current anti-virus packages rely mainly on signature-based

⇑ Corresponding author. Tel.: +972 086428121. detection of malwares that have already been seen. In addition, E-mail addresses: [email protected] (N. Nissim), [email protected] sets of heuristics and rules are defined to look for generic and dis- (R. Moskovitch), [email protected] (L. Rokach), [email protected] (Y. Elovici). tinguishing characteristics of malwares in unknown files. These http://dx.doi.org/10.1016/j.eswa.2014.02.053 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. 5844 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 various methods depend on the presence of a previously detected studies focus on passive learning. We, on the other hand, focus malware. Consequently, a new unknown malware (with new on active learning and present novel AL methods that have been characteristics) will not be detected since its signature does not especially designed to enrich the signature repository of the anti- bear any similarity (including rules and heuristics) to signatures virus with new malwares in the course of several days thus ensur- in the repository. Until an update is distributed damages can be ing that the detection model is as up-to-date as possible. In a series extensive. ‘‘Slammer’’ was the fastest computer worm recorded of experiments that simulate reality, we show that AL can effi- in history. Within 10 min (Moore et al., 2003), it infected about ciently scan an ongoing stream of executable files and select the 75,000 vulnerable hosts. Another harmful and famous worm, most informative ones for manual analysis by human experts. ‘‘Code Red’’, infected 359,000 hosts within 14 h (David Moore, These files are then used to update the detection model, with the Shannon, & Claffy, 2002). malwares being used to update the signatures repository. Accord- In order to accurately and quickly detect the newest malicious ingly, both the detection model and ant-virus are being updated files, anti-virus companies devote considerable effort to preserving daily and detection capabilities improved. the updatability of their signature repository of malicious code We are aware of the limitations of static analysis in malware files. These efforts include monitoring new and unknown malicious detection but circumvent these difficulties by focusing on an AL code files sent over the Internet or using various types of honey- approach rather than specific analysis, which can be either static pots to catch the malicious files (Mokube, 2007; Provos & Holz, or dynamic. The use of AL concepts actually leverages the knowl- 2008). This mission is complicated and time-consuming, especially edge of the detection model, therefore our approach is effective when done manually (Iyatiti Mokube & Adams, 2007). in both analysis instances (Moskovitch, Nissim, & Elovici, 2009; A trivial solution to the problem of finding new malicious files Nissim, Moskovitch, Rokach, & Elovici, 2012). would be to randomly select files from the Internet in order to Our contributions in this paper are twofold: determine whether these files were malicious or not. As a result, the signature repository would be updated and the anti-virus – First, we present a framework for efficiently updating PC anti- application enhanced. However, such a strategy is obviously ineffi- viruses tools on a daily basis. cient and the chances of finding a new unknown malicious file by – Secondly, we present two AL methods for acquiring new mal- random selection is low given that the percentage of malicious files wares. The two methods, termed Exploitation and Combination, on the Internet amounts to about 10% (Moskovitch, Stopel, Feher, acquired a larger number of malwares daily compared to the Nissim, & Elovici, 2008; Moskovitch, Stopel, et al., 2009). existing AL method SVM-Margin. Our methods can be used In order to effectively analyze every day tens of thousands of for any domain for which an acquisition of specific class is new, potentially malicious code files, anti-virus vendors have inte- needed. grated into the core of their signature repository update activities, a component of a detection model based on machine learning The rest of the paper is organized as follows. Section 2 surveys methods (Kiem, Thuy, & Quang, 2004). This component, which is related work while Section 3 presents the dataset. Section 4 responsible for determining what files are most likely to be mali- describes the methods we applied and introduces the proposed cious, is intended to reduce the number of files sent to the human framework for improving detection capabilities. Section 5 dis- expert for labeling. This approach, however, suffers from several cusses the measures used for measuring and evaluating the pro- shortcomings. First, it has to be constantly updated with new infor- posed framework followed by a presentation of the experiment’s mative files to effectively maintain a high level of classification. design. Section 6 presents the results from the proposed approach, Since there are dozens of new unlabeled files every day, the while Section 7 discusses how the framework copes with potential problem is determining which files should be acquired and ana- attacks. Finally, Section 8 provides conclusions, discusses the lyzed by a human expert, and labeled as either malicious or benign. advantages of the described framework, and suggests future Additionally, this approach focuses on finding new malicious files research directions. in order to keep the signatures repository as updated as possible. However, this can only be done by updating the detection model 2. Background with new, informative benign files that are also crucial for accu- rately discriminating between malicious and benign files. Over the past 15 years, a number of studies have investigated In this paper we present a framework and active learning (AL) the possibility of detecting an unknown malcode based on its bin- methods for frequently updating anti-virus software by focusing ary code. Schultz et al. (2001) were the first to introduce the use of expert efforts on labeling those files that are most likely to be mal- classification methods on static representation using various sets ware or on benign files that can possibly improve the detection of features including program headers, string features, and byte model. Note that, both the anti-virus and the detection model must sequence features. They used standard classifiers that outperform be updated with new and labeled files. Using an updated detection a signature-based method (anti-virus).The use of n-grams to repre- model we can detect new malwares that are used to sustain the sent binary files was further performed by Abou-Assaleh et al. anti-virus signature repository. (2004), Kolter and Maloof (2004) and Schultz et al. (2001) using The framework that we present maintains a detection model various combinations of classifiers and feature selection methods. based on a classifier that is trained on a representative set of files Later Kolter and Maloof (2006) extended their work into classifying (malicious and benign) using static analysis. The advantages of malwares into families (classes) based on the functions of their the detection model is that it has generalization capabilities that respective payloads. This approach compared to previous experi- enable it to detect new unknown malwares at a high probability ments was not successful. This lack of success indicated the impor- level even before the files have infected the host computer (due tance of maintaining the training set by acquiring new executables, to static analysis). an approach that we propose in this paper. Due to the large num- While machine learning has been successfully used for inducing ber of features extracted by the n-grams, the feature selection issue malware detection models (Abou-Assaleh, Cercone, Keselj, & was specifically investigated by several researchers including Sweidan, 2004; Henchiri & Japkowicz, 2006; Jang, Brumley, & Henchiri and Japkowicz (2006) who presented a hierarchical Venkataraman, 2011; Kolter & Maloof, 2004; Kolter & Maloof, feature selection approach and Schultz et al. (2001). Working on 2006; Moskovitch Stopel, et al., 2008; Moskovitch, Stopel, et al., a very large test collection of more than 30,000 executable files, 2009; Schultz et al., 2001; Tahan, Rokach, & Shahar, 2012), most Moskovitcdh et al. investigated the problem of imbalance in N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5845 unknown malicious code detection using n-grams (Moskovitch, Moskovitch, Nissim, et al. (2008) and Nissim et al. (2012) suc- Stopel, et al., 2008; Moskovitch, Nissim, Englert, & Elovici, 2008) cessfully used AL methods to detect unknown computer worms, and op codes (Moskovitch, Feher, et al., 2008). They subsequently enhancing the methods proposed in Moskovitch, Elovici, and extended their work (Moskovitch, Stopel, et al., 2009) and provided Rokach (2008), Moskovitch et al. (2007) and Stopel, Boger, additional insights regarding their results such as that their meth- Moskovitch, Shahar, and Elovici (2006). Using AL in such cases ods able to classify executables based on the function of their pay- was very useful in removing noisy examples and in selecting the load. Menahem, Shabtai, Rokach, and Elovici (2009) presented a most informative examples. Other studies, using AL for unknown framework for improving malware detection by applying a multi- malware detection (Moskovitch, Nissim, et al., 2008; Moskovitch, inducer ensemble that actually leverages the knowledge of several Nissim, et al., 2009), demonstrate a somewhat limited approach, different classifiers by utilizing the classification decisions of every since an attempt is made to replace an antivirus with AL, which classifier for calculating the final classification decision. In 2011, is unrealistic. Additionally, in their experimental work, these Jang et al. (2011) presented Bitshred, a system that is mainly researchers do not refer to the real and crucial need of repeatingly designed for malware similarity analysis and used for sorting and and frequently updating the detection components along time. clustering on a large-scale. They used feature hashing to reduce In this paper we provide an answer to these shortcomings. We the feature space and thus made the triage of the malware faster used the detection model in combination with the AL to assist and and more efficient. More recent studies have shown the advantage improve the performance of the detection model and the updat- of these automatic methods both in time and in accuracy (Nataraj ability of the anti-virus tool. This makes our framework solution et al., 2011a; Nataraj et al., 2011b; Tahan et al., 2012). much more practical and secure. Consequently our framework is Dynamic analysis, also known as behavioral analysis, is aimed at feasible for widespread use since it can efficiently enrich the signa- tracing the behavior of the file and its implications for the environ- ture repository and quickly update the anti-virus tool on a daily ment in which it is being executed. This type of analysis, which basis. To look at it in a slightly different way, the framework acts has been thoroughly explored over the past several years, presents as a consultant, suggesting which of the suspected files should be versatile methods for detecting an unknown malware based on its sent to a human expert for labeling. Additionally, we present in behavior. These methods are aimed at detecting malicious activity our paper a comprehensive set of experiments that focus on the that cannot be discovered using static analysis (discussed above). daily process of updating the detection model and the signature However, since we are focusing on static analysis we will just men- repository by intelligently acquiring malicious files. Presenting tion several notable studies based on dynamic analysis: CWSandbox the framework on a daily basis with only new files that do not (Willems, Holz, & Freiling, 2007) Rotalum´ e(Sharif, Lanzi, Giffin, & appear in the training set or the signature repository results in a Lee, 2009), Polyunpack (Rieck, Holz, Willems, Düssel, & Laskov, highly accurate experiment that closely reflects reality. 2008; Rieck, Trinius, Willems, & Holz, 2011; Royal, Halpin, Dagon, In light of the advantages and disadvantages of dynamic and Edmonds, & Lee, 2006), BitBlaze (Song et al., 2008), McBoost frame- static analysis briefly presented above, we decided to focus on work (Moser, Kruegel, & Kirda, 2007a; Perdisci, Lanzi, & Lee, 2008), the static approach in our work. Our aim was to provide an K-Tracer (Bayer, Comparetti, Hlauschek, Kruegel, & Kirda, 2009; applicable active learning framework that would be empirically Jacob, Debar, & Filiol, 2009; Kolbitsch et al., 2009; Lanzi, Sharif, & evaluated over a large set of PC files in a reasonable execution time. Lee, 2009). Static analysis methods have several advantages over dynamic 3. Dataset creation analysis. First, they are virtually undetectable – the PC file cannot detect that it is being analyzed since it is not being executed. While 3.1. Dataset collection it is possible to create static analysis ‘‘traps’’ to deter analysis, these traps can actually be used as a contributing feature for detecting We created a dataset of malicious and benign executables for malware. In addition, since static analysis is relatively efficient the Windows operating system, the most commonly attacked and fast, it can be performed in an acceptable timeframe and con- system. We acquired 7688 malicious files from the VX Heaven sequently will not cause bottlenecks. Static analysis is also simple website1 that contains several types of malicious files such as to implement, monitor and measure. Moreover, it scrutinizes the viruses, worms, Trojans and malware (probes). To identify the files, file’s ‘‘genes’’ and not its current behavior which can be changed we used the Kaspersky2 anti-virus software and the Windows ver- or delayed to an unexpected time in order to evade the dynamic sion of the Unix ‘file’ command for file type identification. We also analysis. An additional aspect is that static analysis can be used included the obfuscated executables that VX Heaven provides. for a scalable pre-check of malwares. Among these executables, some were obfuscated using compression Labeling examples, which is crucial for the learning process, is or encryption while others were obfuscated using both techniques. often an expensive task since it involves human experts. Active As was the case with Kolter and Maloof (2006), we were not learning (AL) was designed to reduce the labeling efforts, by informed which of the files were obfuscated and which were not. actively selecting the examples with the highest potential contri- The files in the benign set, including executable and dynamic-link bution to the learning process of the detection model. AL is library (DLL) files, were gathered from computers running the roughly divided into two major approaches: membership queries Windows operating system. The benign set contained 22,735 files (Angluin, 1988), in which examples are artificially generated that the Kaspersky anti-virus program reported as being completely from the problem space and selective-sampling (Lewis & Gale, virus-free. To the best of our knowledge none of the benign files 1994), in which examples are selected from a pool, which is were obfuscated. used in this study. Studies in several domains have successfully applied active learning in order to reduce the time and money required for 3.2. Dataset representation using text categorization labeling examples. Unlike random learning, in which a classifier randomly selects examples from which to learn, the classifier In order to detect and acquire unknown malicious codes, we actively indicates the specific examples that should be labeled implemented well-studied concepts from the field of information and which are commonly the most informative examples for the training task. SVM-Simple-Margin (Tong & Koller, 2000–2001)is 1 http://vx.netlux.org. an existing AL method considered in our experiments. 2 http://www.kaspersky.com. 5846 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 retrieval (IR) and, more specifically, the text categorization to estimate its expected contribution to the classification task. domain. In the course of implementing our task, binary files (exec- Three feature selection measures were used. As a baseline we used utables) are parsed and n-gram terms are extracted. Each n-gram the document frequency measure DF (Eq. (2)); Gain ratio (GR) term (the sequence of bytes in the binary code) is analogous to Mitchell, 1997; and Fisher score (Golub et al., 1999). Eventually words in the textual domain. Here we present the IR concepts used the following top features 50, 100, 200 300, 1000, 1500 and 2000 in this study. Salton, Wong, and Yang (1975), presented the vector were selected using each feature selection method. space model to represent a textual file as a bag-of-words. For clar- ity, a word in our case is a binary sequence of bytes. For instance, a 4. Machine learning methods and the suggested framework 4-gram word will be in the form of 0101, 1110 among a total of 16 possibilities. After parsing the text and extracting the words, a 4.1. Support vector machines (SVM) vocabulary of the entire collection of words is constructed. Each of these words may appear multiple times in a document or not We employed the SVM classification algorithm using the radial at all. A vector of terms is created such that each element in the basis function (RBF) kernel in a supervised learning approach. vector represents the term frequency (TF) in the document. Eq. There are several reasons for using SVM as the classification algo- (1) shows the definition of a normalized TF in which the TF is rithm. First and foremost, SVM has been successfully used in divided by the frequency of the maximally appearing term in the detecting worms, as three previous works indicated (Masud, document with values in the range of [0–1]. Another common Khan, & Thuraisingham, 2007; Wang, Yu, Champion, Fu, & Xuan, representation is the TF inverse document frequency (TFIDF) that 2007; Yu, Xin-cai, & Hai-bin, 2008). Moreover, in the first work combines the term frequency in the document with its frequency (Wang et al., 2007) the authors noted that ‘‘SVM learns a black- in the document’s collection, as shown in Eq. (2), in which the box classifier that is hard for worm writers to interpret’’. As term’s (normalized) TF value is multiplied by the IDF = log (N/n), Moskovitch, Nissim, et al. (2009) have presented, SVM has proven where N is the number of documents in the entire file collection to be very efficient when combined with AL methods. Lastly, SVM and n is the number of documents in which the term appears. was also successfully used by Chen et al. (2012) in their ‘‘Malware term frequency Evaluator’’, a system that classifies malwares into species and TF ¼ ð1Þ maxðterm frequency in documentÞ detects zero day attacks based on information and features pro- vided by anti-virus vendors about every known malware and its TFIDF ¼ TF: logðDFÞ; ð2Þ breed. We now briefly introduce the SVM classification algorithm and where the active learning principles and their implementation as used in N this study. SVM is a binary classifier that finds a linear hyperplane DF ¼ : that separates given examples into two specific classes, yet is also n capable of handling multiclass classification (Vapnik, 1982). As Joachims (1999) demonstrated, SVM is widely known for its capa- 3.3. Data preparation and feature selection bility for handling large amounts of features, such as texts. We used the Lib-SVM implementation of Chang and Lin (2011), We parsed the binary code of the executable files using several which also supports multiclass classification. Given a training set n-gram length sliding windows. The parsing was performed on the in which an example is a vector xi = hf1,f2,...,fmi, where fi’isa raw file without any decryption or decompression. Vocabularies of feature, and labeled by yi ={1,+1}, the SVM attempts to specify 16,777,216, 1,084,793,035, 1,575,804,954 and 1,936,342,220 for a linear hyperplane with the maximal margin defined by the 3-gram, 4-gram, 5-gram and 6-gram, respectively, were extracted. maximal (perpendicular) distance between the examples of Later, the TF and TFIDF representations were calculated for each the two classes. Fig. 1 illustrates a two-dimensional space where n-gram in each file. the examples are located according to their features. The hyper- In machine learning applications, the large number of features plane splits them according to their label. in many domains, many of which do not contribute to the accuracy The examples lying closest to the hyperplane are the ‘‘support- of the detection model and may even decrease it, present a huge ing vectors’’. W, the Normal of the hyperplane, is a linear combina- problem. Moreover, in our task, a reduction in the number of fea- tion of the most important examples (supporting vectors) tures is crucial for practical reasons but must be performed while multiplied by LaGrange multipliers (alphas), as can be seen in Eq. simultaneously maintaining a high level of accuracy. This is due (5). Since the dataset in the original space cannot often be linearly to the fact that, as shown earlier, the vocabulary size may exceed separated, a kernel function K is used. SVM actually projects the billions of features, far more than can be processed by any feature examples into a higher dimensional space in order to create a selection tool within a reasonable period of time. Additionally, it is important to identify those features that appear in most of the files in order to avoid zeroed representation vectors. Thus, the features Class (+1) with the highest document frequency (DF) value (Eq. (2)) were ini- tially extracted. Based on the DF measure, two sets were selected; W the top 5500 terms and the top 1000–6500 terms. The use of a set consisting of the top 1000–6500 sets of features was motivated by the removal of stop-words, as is often done in information retrieval for common words. Later, feature selection methods were applied margin to each of these two sets. Since feature selection preprocessing is not the focus of this paper, we will describe it very briefly. We used a filter approach (Bi, Bennett, Embrechts, Breneman, & Song, 2003) to compare the performances of the different classification algo- Class(-1) rithms, where the measure was independent of any classification algorithm. In a filter approach, a measure is used to quantify the Fig. 1. An SVM that separates the training set into two classes, with a maximal correlation of each feature to the class (malicious or benign) and margin, in a two-dimensional space. N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5847 linear separation of the examples. Note that when the kernel func- accordingly. Thus the primal problem can be defined by optimiza- tion satisfies Mercer’s condition, as Burges (1998) explained, K can tion as in Eq. (12), where C is a parameter for tuning between the be presented using Eq. (3), where U is a function that maps the importance of classification errors and the width of the margin. example from the original feature space into a higher dimensional 1 XN space while K relies on the ‘‘inner product’’ between the mappings Min wT w þ C n ð12Þ ðw;b;nÞ 2 i of examples x1, x2. For the general case, the SVM classifier will be in i¼1 the form shown in Eq. (4), where n is the number of examples in subject to the training set; K is the kernel function; alpha is the LaGrange T multiplier that defines the linear combination of the Normal W; yi½w /ðxiÞþb P 1 ni; where i ¼ 1; ...; N and yi is the class label of support vector Xi. ni P 0; where i ¼ 1; ...; N Kðx ; x Þ¼Uðx ÞUðx Þð3Þ The solution of the primal problem in Eq. (12) requires using a 1 2 1 2 ! Xn Lagrange multiplier ai for every training instance, where the condi- f ðxÞ¼signðw UðxÞÞ ¼ sign aiyiKðxixÞ ð4Þ tions of the primal problem result in a quadratic programming (QP) 1 problem with Lagrange multiplier ai. Those training instances xi for Xn which a is not zero, are called the support vectors and are there- w y U x 5 i ¼ ai i ð iÞðÞ fore the only instances that play a role in identifying the separating 1 hyperplane of the SVM. Since the primal problem in Eq. (12) is hard Two commonly used kernel functions were used. First, the poly- to solve, it can be converted into dual problem that will be easier to nomial kernel, as shown in Eq. (6), creates polynomial values of solve due to the fact that the decision variables are actually the degree p, where the output depends on the direction of the two support vectors. The dual problem can be seen in Eq. (13) where vectors, examples x1, x2, in the original problem space. Note that Q is a N N semi-definite matrix, as defined in Eq. (14) that uses a private case of a polynomial kernel, where p = 1, is actually the the kernel trick from Eq. (3), and e is a vector of all ‘‘ones’’. linear kernel. The second is the radial basis function (RBF), as 1 T T shown in Eq. (7), in which a Gaussian function is used as the RBF MaxðaÞ a Qa e a ð13Þ and the output of the kernel depends on the Euclidean distance 2 of examples x1, x2. subject to P 0 6 ai 6 C; where i ¼ 1; ...; N Kðx1; x2Þ¼ðx1 x2 þ 1Þ ð6Þ yT a ¼ 0 jjx x jj2 1 2 2r2 Kðx1; x2Þ¼e ð7Þ Qij ¼ yiyjKðxi; xjÞð14Þ We now provide a formal and rigorous explanation about the algorithm that builds a SVM classifier from a given training set D After solving the optimization problem, the SVM classifier is in the form presented in Eq. (4) as shown above. that includes N examples xi with class label yi. Each instance is in a vector form with n features fi: 4.2. Selective sampling and active learning methods Training set: D x y N ¼f i; i;gi¼1 1 n T 2 Since our framework aims to provide solutions to real problems Instance in training set in vector form: xi ¼ fi ; ...; fi 2 R it must be based on a sophisticated, fast, and selective high-perfor- Class labels: yi 2 {+1,1} mance sampling method. We compared our proposed AL method According to Vapnik (1998) original definition and formulation, to several strategies described below. the SVM classifier that will be induced from the training set D, should satisfy the following conditions in Eq. (8) where W is the 4.2.1. Random selection weight vector, b is the bias and / is a function that maps the exam- Random selection is obviously not an active learning method ples from the original problem space (called weight space as well) yet it is actually the ‘‘lower bound’’ of the selection methods that into a higher dimensional space referred to as the feature space. Eq. will be discussed. As far as we know, no anti-virus tool uses an (8) can be simply summarized as Eq. (9): active learning method for preserving and improving its updatabil- ity. Consequently, we expect that all the AL methods will perform If y ¼þ1 then wT /ðx Þþb P þ1 better than a selection process that is based on the random acqui- i i ð8Þ T 6 sition of files. If yi ¼1 then w /ðxiÞþb 1

4.2.2. The SVM-Simple-Margin active learning method (SVM-Margin) y ½wT /ðx Þþb P 1; where i ¼ 1; ...; N ð9Þ i i The SVM-Simple-Margin method (Tong & Koller, 2000–2001) Eq. (9) actually defines the construction of two parallel surfaces that (referred to as SVM-Margin in the discussion below) is directly lie at similar distances from both sides of the separating hyperplane related to the SVM classifier. As is well-known, using a kernel as depicted in Fig. 1. The separating hyperplane can be seen in Eq. function, the SVM implicitly projects the training examples into a (10), where the decision of the classifier (positive or negative class) different (usually a higher dimensional) feature space denoted by on a given example x is provided by the sign calculated from Eq. F. In this space there is a set of hypotheses that are consistent with (11): the training set. This means that these hypotheses create a linear separation of the training set. This set of consistent hypotheses is wT /ðx Þþb ¼ 0 ð10Þ referred to as the version-space (VS). From among the consistent i T hypotheses, the SVM then identifies the best hypothesis with the sign w /ðxiÞþb ð11Þ maximal margin. To achieve a situation where the VS contains As explained above, since in reality most classification problems the most accurate and consistent hypothesis, the SVM-Margin AL cannot be linearly separated, a slack variable (ni) is used in order to method selects examples from the pool (of unlabeled examples) permit misclassifications in the training phase and to compensate which reduces the number of hypotheses. Calculating the VS is 5848 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 complex and impractical where large datasets are concerned and Malicious therefore, this method is based on simple heuristics that depend on the relationship between the VS and SVM with the maximal margin. Practically speaking, examples that lie closest to the separating hyperplane (inside the margin), are more likely to be informative and new to the classifier. Consequently, these exam- ples will be selected for labeling and acquisition. This method, contrary to ours, selects the examples according to W their distance from the separating hyperplane only to explore and acquire the informative files without relation to their classified labels. Thus it will not necessarily focus on selecting and acquiring malware instances. The SVM-Margin AL method is very fast and margin can be applied to real problems, yet, as its authors indicate (Tong & Koller, 2000–2001), this agility is achieved due to the fact that it is based on a rough approximation and relies on the assumption that the VS is fairly symmetric and that the hyperplane’s normal Benign (W) is centrally placed. It has been demonstrated, both in theory and practice, that these assumptions can fail significantly Fig. 2. The criteria by which Exploitation acquires new unknown malicious files. (Herbrich, Graepel, & Campbell, 2001). Indeed, the method may These files lie the farthest from the hyperplane and are regarded as representative actually query instances whose hyperplane does not intersect the files. VS and therefore may not be at all informative. Moskovitch, Nissim, et al. (2009) used the SVM-Margin method for detecting similar (the similarity is checked according to its representation instances of PC malware and according to their preliminary results, in the SVM kernel space). Consequently, only the representative the method also assisted in updating the detection model but not files that are most probably malicious are being selected. In case the anti-virus application itself. In Moskovitch, Nissim, et al. the representative file is detected as malware as a result of the (2009) the method was used for only one day-long trial and not manual analysis, then all its variants that were not acquired will for a period of several consecutive days as actually happens in real- be detected the moment the anti-virus is updated. And in case ity. Accordingly, we thought it would be interesting to compare its these files are not actually variants of the same malware, they will performance to our proposed AL methods in a daily process of be acquired the following day as long as they are still most likely to detection and acquisition experiments. Therefore, SVM-Margin is be malware after the detection model has been updated. In Fig. 2,it the base line AL method we expect to improve. can be observed that there are sets of relatively similar files (according to their distance in the kernel space). However, only 4.2.3. Exploitation: our proposed active learning method the representative files that are most likely to be malwares are Our method, which we term ‘‘Exploitation’’, is designed and being acquired. based on SVM classifier principles. It is oriented towards selecting As is well-known, the SVM classifier defines the class margins the examples that are probably the most malicious. More specifi- using a small set of supporting vectors (i.e., PC files).While the cally, it selects the examples that lie furthest from the separating usual goal is to improve the classification by uncovering (labeling) hyperplane. In our domain, detection of PC malware, this indicates files from the margin area, in our case the primary goal is to that only the files that are most likely to be malware will be acquire malwares for updating the AV. Actually the same number acquired. Our motivation for this set of actions is the desire to of files are acquired each day, but with Exploitation we attempt enhance the signature repository of the anti-virus tool with as to better explore the ‘‘malicious side’’ of the incoming files), result- many new malwares as possible. Thus for every file X that is ing sometimes in the discovery of also benign files (these files will suspected of being malicious, Exploitation rates its distance from probably become support vectors and update the classifier). In the separating hyperplane using Eq. (15) based on the Normal of Fig. 2 we can observe an example of a benign file lying deep inside the separating hyperplane of the SVM classifier that serves as the the malicious side. Contrary to SVM-Margin that explores detection model. As explained above, the separating hyperplane examples that lie inside the SVM margin, Exploitation explores of the SVM is represented by W, which is the Normal of the the ‘‘malicious side’’ more efficiently as part of an effort to discover separating hyperplane and actually a linear combination of the new and unknown malicious files that are essential for the fre- most important examples (supporting vectors), multiplied by quent update of the antivirus signature repository. LaGrange multipliers (alphas) and by the kernel function K that The distance calculation required for each instance in this assists in achieving linear separation in higher dimensions. Accord- method is quite fast and equal to the time it takes to classify an ingly, the distance calculation in Eq. (15) is simply done between instance in a SVM classifier. Consequently, it is a very practical example X and the Normal (W) that is presented in Eq. (5). and fast method that can provide the ranking for acquisition in a In Fig. 2, it can be observed that the files that were acquired short time frame. It is therefore applicable for products working (marked with a red circle) are those files that were classified as in real-time. malicious and have maximum distance from the separating hyper- ! Xn plane. Acquiring several new malicious files that are quite similar DistðXÞ¼ a y Kðx xÞ ð15Þ and belong to the same virus family is considered a waste of man- i i i 1 ual analysis resources since these files will probably be detected by the same signature. Thus, acquiring one representative file for this set of new malicious files will serve the goal of efficiently updating 4.2.4. Combination: a combined active learning method the signatures repository. In order to adhere to the goal of enhanc- The combination method lies between SVM-Margin and Exploi- ing the signature repository as much as possible, we also check the tation. On the one hand, the combination method will start with a similarity between the selected files using the kernel farthest-first phase in which it will acquire examples based on SVM-Margin (KFF) method suggested by Baram, El-Yaniv, and Luz (2004).By criteria in order to acquire the most informative files. using this method, we avoid acquiring examples that are quite Consequently, both malicious and benign files will be acquired. N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5849

This Exploration-type phase is important in order to enable the The remaining files, which are unknown, are then introduced detection model to discriminate between malicious and benign to the detection model based on SVM and AL. The detection files. On the other hand, the combination method will then try to model scrutinizes the files and provides two values for each file: maximally update the signature repository in an Exploitation-type a classification decision using Eq. (4) and a distance calculation phase. This means that in the early acquisition period, in the first from the separating hyperplane using Eq. (15) {4}. A file that part of the day, SVM-Margin predominates over Exploitation. As the AL method recognizes as informative and which it has the day progresses, Exploitation becomes predominant. The com- indicated should be acquired, is sent to an expert who manually bination was also applied in the course of the ten-day experiment analyzes and labels it {5}. By acquiring these informative files, and not only on a specific day. Consequently, as the day progresses, we aim to frequently update the anti-virus software by focusing the combination will perform more Exploitation than SVM-Margin. the expert’s efforts on labeling files that are most likely to be This means that on the ith day there is more Exploitation than in malware or on benign files that are expected to improve the the (i 1)th day. detection model. Note that informative files are those that when We defined and tracked several configurations over the course added to the training set improve the detection model’s predic- of several days. We found that in the relation between SVM-Mar- tive capabilities and enrich the anti-virus signature repository. gin and Exploitation a balanced division performs better than other Accordingly, in our context, there are two types of files that divisions (i.e., for 50% of the days, the method will conduct more may be more informative. The first type includes files for which SVM-Margin with Exploitation being implemented for the remain- the classifier is not confident as to their classification (the proba- ing time). In short, this method tries to take the best from both of bility of that they are malicious is very close to the probability the previous methods. that they may be benign). Acquiring them as labeled examples will probably improve the model’s detection capabilities. In prac- 4.3. A Framework for improving detection capabilities tical terms, these files will have new n-grams or special combina- tions of existing n-grams that represent their execution code Fig. 3 illustrates the framework and the process of detecting and (inside the binary code of the executable). Therefore these files acquiring new malicious files through preserving the updatability will probably lie inside the SVM margin and consequently will of the anti-virus and detection model. In order to receive maximal be acquired by the SVM-Margin strategy that selects informative contribution from the suggested framework, it should be deployed files both malicious and benign that are a small distance from the in strategic nodes over the Internet network in an attempt to separating hyperplane. expand its exposure to as many new files as possible. This wide The second type of informative files includes those that lie deep deployment will result in a scenario in which almost every new file inside the malicious side of the SVM margin and that are maximal will go through the framework. If the file is informative enough or distance from the separating hyperplane according to Eq. (15). is grasped as probably malicious, then it will be acquired for man- These files will be acquired by the Exploitation strategy and are ual analysis. Examples of strategic nodes can be ISPs and gateways also a large distance from the labeled files. This distance is mea- of large organizations. As Fig. 3 shows, the packets of files trans- sured by the KFF calculation that was explained in the Exploitation ported over the Internet through our framework are constructed AL method section. These informative files are then added to the into files {1}. These files are transformed into vector form {2}. This training set {6} for updating and retraining the detection model means that n-grams are extracted from the binary code of the files; {8}. The files that were labeled as malicious are also added to the frequencies are being calculated; and the files are now represented anti-virus signature repository for enriching and preserving its as vectors of n-gram frequencies (as explained above). Then, the updatability {7}. This updating of the signature repository also known files are filtered by the ‘‘known files module’’ that actually requires a distribution of an update for clients implementing the filters all the known benign and malicious files {3} (according to a anti-virus application. white list, reputation systems (Jnanamurthy, Warty, & Singh, 2013) The framework integrates two main phases training and and signature repository). detection/updating.

Fig. 3. The process of preserving the updatability of the anti-virus tool using AL methods. 5850 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857

4.3.1. Training Stopel, et al., 2009). In that study, we performed a comprehensive A detection model is trained over an initial training set that set of evaluation runs including all combinations of the optional includes 10% malicious files (MFP). After the model is tested over settings for each of the aspects listed above. The number of runs a stream that only consists of unknown files that were presented amounted to 1536 in a 5-fold cross-validation format for all three to it on the first day, the initial accuracy of the detection model kernels. It should be noted that the files in the test set were not in is evaluated. the training set that presented unknown files to the classifier. Elaboration and analysis of the results of this experiment can be 4.3.2. Detection and updating found in Moskovitch, Stopel, et al. (2009). Here, however, we will For every unknown file in the stream, the detection model briefly illustrate the best configuration and detection accuracy rate. provides a classification while the AL method provides a rank The best configuration included the dataset represented by: representing how informative the file is. Thus the framework will 5-grams; global top 5500; TF representation; Fischer score feature consider acquiring them. After being selected and receiving their selection method; and Top300 which is the number of features true labels from the expert, the informative files are acquired by considered. The RBF kernel out-performed the others with a the training set and the signature repository is updated as well, just 93.9% detection accuracy and a low false positive rate of less than in case the files are malicious. The detection model is retrained 0.03%. over the updated and extended training set which now also includes the acquired examples that are regarded as being very 5.2.2. Experiment 2: acquisition of new malwares through active informative. At the end of the day, the updated model receives a learning new stream of unknown files on which the updated model is once The objective in this main experiment was to evaluate and com- again tested and from which the updated model again acquires pare the performance of our new AL methods to the existing selec- informative files. Note that the motive is to acquire as many mali- tion method, SVM-Margin in regard to two tasks: cious files as possible since such information will maximally update the anti-virus tool. – Acquiring as many new unknown malicious files as possible in order to efficiently enrich on a daily basis the signature reposi- tory of the anti-virus. 5. Measurements and evaluation – Updating the predictive capabilities of the detection model that serves as the knowledge store of the AL methods in efficiently The objective of the first set of experiments was to determine identifying the most informative new malwares. the optimal settings for: the feature representation (TF or TFIDF); the n-gram representation (3, 4, 5 or 6); the best global range As assumed in a previous work (Moskovitch, Stopel, et al., (top 5500 or top 1000–6500); feature selection method (DF, FS or 2009), malwares on the Internet amount to approximately 10% of GR); and the top number of selected features (50, 100, 200, 300, the traffic (which actually represents the test set). In this previous 1000, 1500 or 2000). After determining the optimal settings, we work (Moskovitch, Stopel, et al., 2009), it was found that the per- performed a second set of experiments to evaluate our proposed centage of malicious files in the training set that leads to the high- acquisition process using the active learning methods. est detection accuracy is also 10%. In order to adhere as closely as possible to real-life conditions in our experiment, we used the 5.1. Evaluation measures guidelines proposed by Rossow et al. (2012). Consequently, we used 25,260 executables (22,734 benign, 2526 malicious), from a To evaluate the capability of the framework and the AL methods total of 30,423 executables of which 10% were malicious and 90% for efficiently acquiring new files and update the detection model, benign. One should note that in reality the detection model also we relied upon a set of standard, widely used measures. We encounters known and unknown files, but since there is no need measured the true positive rate (TPR) which is the percentage of to scrutinize known files, we filtered them out since they would correctly classified positive instances as shown in Eq. (16). The be detected in any case by the signature repository of the anti-virus false positive rate (FPR), which is the percentage of misclassified or the white list of the training set. We conducted this experiment negative instances, is also shown in Eq. (16). The total accuracy, using the insights from experiment 1, the dataset configuration which measures the number of absolutely correctly classified specifics and the RBF kernel of the SVM. instances, either positive or negative, divided by the entire number Over a ten-day period, we compared file acquisition based on of instances, is shown in Eq. (17). active-learning methods and random selection based on the per- jTPj jFPj formance of the detection model. TPR ¼ FPR ¼ ð16Þ jTPjþjFNj jFPjþjTNj We took the 25,260 executables (10% MFP) and created ten ran- jTPjþjTNj domly selected datasets with each dataset containing ten sub-sets Total Accuracy ¼ ð17Þ jTPjþjFPjþjTNjþjFNj of 2521 files representing each day’s stream of new files. The 50 remaining files were used by the initial training set to induce the In addition to the abovementioned measures, we present the initial model. Note that each day’s stream contained 2521 files core measure of this study which is the number of new malwares with 10% MFP. At first, we induced the initial model by training acquired each day, that is to say, the malwares that were discov- it over the 50 known files. We then tested it on the first day’s ered and acquired daily and added into the training set and signa- stream. Next, from this same stream, the selective sampling ture repository of the anti-virus software. method selected the most informative files according to that meth- od’s criteria. The informative files were sent to a human expert 5.2. Experiment design who labeled them. The files were later acquired by the training set which was enriched with an additional X new informative files. 5.2.1. Experiment 1: determining the best configuration of the dataset Of course, when a file was found to be malicious, it was immedi- and kernel ately used for updating the signature repository of the antivirus. To determine the best settings of the file representation and the An update was also distributed to clients. The process was repeated feature selection, we used the results and insights from previous over the next 9 days. The performance of the detection model was work that we conducted on the same dataset (Moskovitch, averaged for10 runs over the 10 different datasets that were N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5851 created. Each selective sampling method was checked separately methods (Combination) and random-selection (Random). Each on 43 different amounts of files that had been acquired. This means method was checked for all the forty-three acquisition amounts that for every acquisition amount, the methods were restricted to where the results, in order to neutralize the variance, were the acquiring a number of files equal to the acquisition amount that average of 10 different folds. As previously mentioned, we also followed, denoted as X: 10 files, 20 files and so on until 350 files took into consideration the acquisition of the whole files in the had been acquired (with gaps of 10 files), 450, 550, 600, 650, stream (the ALL method) in order to compare the performance of 700, 750, 800. We also considered the acquisition of all the files the methods in relation to an ideal case. Obviously ALL is not a fea- in the daily stream (referred to as the ALL method), which repre- sible method since anti-virus vendors cannot deal with the daily sents an ideal but not a feasible situation for acquiring all the amount of new files requiring manual analysis and inspection. new files. We depicted the results of the most representative acquisition The experiment’s various step are as follows: amounts: batches of 50, 250 and 800 files. We now present the results of the core measure in this study, 1. Inducing the initial detection model from the 50 files in the the number of new unknown malicious files that were discovered training set. and finally acquired into the training set and signature repository 2. The detection model is tested on the stream of the first day to of the anti-virus software. As explained above, each day the frame- measure its initial performance. work deals with 2521 new files, a 10% MFP of about 250 new 3. The stream of the first day is introduced to the selective sam- unknown malicious files. Statistically, the more files that are pling method, which chooses the X most informative files selected daily, the more the malicious files that will be acquired according to its criteria and sends them to the virus expert daily. Using AL methods, we tried to improve the number of mali- who manually analyzes them to determine their true labels. cious files acquired by means of existing solutions. More specifi- 4. These informative files are added to the training set, where the cally, using our methods (Exploitation, Combination) we also malwares are also used for updating the signature repository of sought to improve the number of files that are acquired by the cur- the anti-virus tool. rent AL method (SVM-Margin). 5. Inducing an updated detection model from the extended train- Figs. 4–6 present the number of malicious files obtained by ing-set for the next day. acquiring batches of 50, 250 and 800 files respectively, by each 6. The second day begins with a test of the updated detection of the four methods during the course of the ten-day experiment. model over the stream of the second day to measure its perfor- Note that in these three Figs. 4–6, the graph of Combination falls mance and improvement relative to the previous day. behind Exploitation and it is therefore difficult to observe Combi- 7. The stream of the second day is introduced to the selective sam- nation’s behavior. As can be seen in Fig. 4, Exploitation and Combi- pling method which chooses the X most informative files nation outperformed the other selection methods and showed an according to its criteria and sends them to the virus expert. increasing trend from the second day. These methods succeeded 8. Those informative files are added to the training set, where the in acquiring the maximal number of malwares from the 50 files malwares are also used for updating the signature repository of acquired daily. the anti-virus. On the second day, all the AL methods acquired the fewest 9. Inducing an updated detection model from the extended number of new malwares, even fewer than random selection. This training set for the next day. can be explained by the fact that the initial detection model was trained on an initial set of only X labeled files that consisted of only The process repeats itself until the 10th day. 6 malwares. Thus the detection model was not accurate enough to provide the AL methods with the knowledge needed to select new unknown malwares out of the daily amount of 50 files. 6. Results On the 10th day, using Exploitation, 88.5% of the acquired files were malicious (44.1 files out of 50); implementing SVM-Margin, In order to appropriately evaluate the efficiency and effective- only 59.8% of the acquired files were malicious (29.9 files out of ness of our framework, we compared four selective sampling 50). This is a significant improvement of almost 30% in unknown methods: an existing AL method; SVM-Simple-Margin (SVM-Mar- malware acquisition. Note that on the 10th day, using Random, gin); our method (Exploitation); a combination of the two previous only 11.8% of the acquired files were malicious. This is far less than

Fig. 4. The number of malicious files acquired by the framework for the different methods through the acquisition of 50 files daily. 5852 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857

Fig. 5. The amount of malicious files acquired by the framework for the different methods through the acquisition of 250 files daily.

Fig. 6. The amount of malicious files acquired by the framework for the different methods through the acquisition of 800 files daily.

the malware acquisition rates that Exploitation and Combination explained above, SVM-Margin selects new informative files inside achieved. The trend is very clear: each day Exploitation and Com- the margin of the SVM. Over time and with the improvement of bination acquired more malicious files than the day before – a fea- the detection model towards more malicious files, it seems that ture that supports the daily improvement in the capability for daily the malicious files are less informative (due to the fact that mal- updating the detection model, identifying new malwares and for ware writers frequently try to use upgraded variants of previous enriching the signatures repository of the anti-virus. As far as we malwares). Since these new malwares might not lie inside the mar- could observe, the random selection trend was constant; there gin, SVM-Margin may actually be acquiring informative benign was no improvement in acquisition capabilities over the 10-day rather than malicious files. However, our methods, especially period. Exploitation, are more oriented towards acquiring the most infor- As can be seen in Fig. 5, Exploitation outperformed the other mative and probable malwares by acquiring informative files from selection methods and showed an increasing trend. It succeeded the malicious side of the SVM Margin. As a result, an increasing in acquiring the maximal number of malwares from the 250 files number of new malwares are being acquired. And if the acquired acquired daily. benign file lies deep within the malicious side, it is still a very In tracking the performances of the various methods, we informative file that can be used for learning purposes and for observed an interesting phenomenon. Until the 4th day, all the improving the detection capabilities for the next day. AL methods showed improvement and an increasing number of This observation could not have been made from the results of a acquired files. However, after the 4th day, the SVM-Margin AL previous study (Moskovitch, Nissim, et al., 2009) in which there method showed a decrease in the number of malwares acquired, was only one active learning trial. In our experiment, which sought while our methods, Exploitation and Combination, continued to to represent reality, there were several days of detection and show an increase and improvement in their acquisition capabili- acquisition. Consequently, we see that SVM-Margin is less efficient ties. This phenomenon can be explained by the way the methods in acquiring malwares on a continuous basis. The Exploitation work. The SVM-Margin acquires examples about which the detec- method outperformed the SVM-Margin method throughout the tion model is less confident. Consequently, they are considered to ten-day period displaying on the 9th day the largest gap between be more informative but not necessarily malicious. As was the two methods in acquiring malwares. At this point, SVM-Margin N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5853 acquired 79.2 malwares with Exploitation acquiring 205.3 or 2.6 of the four selection methods over the ten-day experimental setup. times more malwares than SVM-Margin. We can see that Exploita- First and foremost, the main and significant observation is that our tion acquired 7.8 times more malwares than Random. The AL methods performed as well as the baseline existing AL method, advantage of Exploitation over SVM-Margin and Random is very SVM-Margin. The accuracy rates of the methods were very similar clear. Random fails to improve over time and it fails to select with negligible difference between them. When 50 files were being new and informative malwares. Where SVM-Margin fails in acquir- acquired daily, SVM-Margin performed a bit better than our meth- ing more malwares daily in the course of the 10 days. These results ods for several days within the ten-day period. But when dealing actually emphasize the efficiency of our framework and methods. with larger acquisition amounts of 250 and 800, our methods Combination performed almost as well as Exploitation, yet was (Exploitation and Combination) performed a bit better than the second best performer. SVM-Margin during the whole ten-day period. Secondly as was Table 1 and Fig. 6 present the results of the selection methods expected, the acquisition each day of additional files indeed for acquiring 800 files daily. The bold red numbers represent the contributes to the accuracy of the detection model because the highest quantities acquired by each of the selective sampling accuracy usually increase over time; the larger the amount of files methods. Almost the same trend that appeared in dealing with acquired each day, the higher the accuracy. Lastly, the three AL 250 files appears here, with the AL methods performing better than methods outperformed random selection in such a way that the random selection; Exploitation outperforming the other methods; gap in the performance between the AL methods and the random and SVM-Margin AL displaying a decreasing trend. selection become larger over the ten-day period. These results are not significantly better than those achieved Since the same trend was observed for the three acquisition with the acquisition of 250 files. Here the number of acquired files amounts, we presented in Table 2, only the results achieved when daily was 800 which is much more than the 250 malwares in the 800 files were acquired daily. As can be seen, the three AL methods daily stream (due to the 10% MFP). We expected that acquiring performed almost the same, indicating that different informative 800 files would probably identify almost all of the 250 malwares files contributed similarly to detection model’s accuracy. Exploita- in each day’s stream. However it seems that the identifying and tion and Combination outperformed all the selection methods dur- acquiring all the new malwares is not simple, even when more files ing the 10 days with Exploitation outperforming Combination, and are being acquired daily. The cost of acquiring (and manually ana- specifically presented a bit better accuracy than baseline SVM- lyzing) more files daily should be considered in light of the benefits Margin we aim to improve. In dealing with large acquisition amounts, to be obtained by acquiring additional files in an attempt to dis- every improvement in accuracy is significant since this helps cover more malicious files. reduce the extent of the manual analysis the experts must carry out. For example, if we compare the results of Exploitation we can Table 2 also shows that the difference between our AL methods see that on the 10th day, with the acquisition amount at 250 as (97.83%) and random selection (95.85%) amounts to a 2% detection presented in Fig. 5, Exploitation acquired 201.1 out of 250 mal- accuracy rate by the 10th day. This rate is especially significant wares (80.44%). On the same day, when the acquisition mount when the detection model is encountering each day dozens of files stood at 800, as presented in Table 1 and Fig. 6, Exploitation from which it should detect the newest malicious files. Note that acquired 230 out of 250 malwares (92.2%). This is equivalent to since the dataset is imbalanced and consists of 90% benign files, the cost of acquiring and analyzing 550 more files daily which it is not a hard to achieve 90% accuracy, whereas every other per- would have to be achieved by an increased manual effort that centage above 90% accuracy is a challenge. Thus, those differences would be 3.2 times greater than when the acquisition mount stood in the accuracy that were achieved are very significant. at 250 files. Those remaining benign files that were acquired as While the accuracy measure provides a means for determining thought to be malicious will be discussed later. the overall efficiency and effectiveness of the framework, the pri- We have shown here that our AL methods outperformed the mary task is to detect the files that are most likely to be malicious SVM-Margin AL method and improved the capabilities for acquir- in order to use them for updating the signature repository. Accord- ing new malwares. This improvement enriched the signatures ingly, the TPR and FPR measures shed significant light on the four repository of the anti-virus software which is the main goal of this methods that we are examining in this paper. study. Fig. 7 represents the TPR levels that were achieved as each of We now show that our methods, compared to SVM-Margin, also the four methods acquired 50,250,800 files in the final day of the preserve and even improve the predictive performance of the experiment – the tenth day, including also the acquisition of all detection model who plays the knowledge store in the acquisition the 2521 files in the daily stream which is the unfeasible scenario. process. We will show the Accuracy, TPR and FPR levels and their It can be observed that the three AL methods performed almost the trends in the course of the 10 days. same (again Combination’s graph falls behind Exploitation’s). The same trends were observed when measuring the accuracy SVM-Margin outperformed other selection methods in the 50 rates of the three acquisition amounts (50, 250 and 800) by each and 250 acquisition amounts. While our AL methods (Exploitation

Table 1 Table 2 The quantities of malicious files that the framework has acquired for its different The accuracy of the framework for different methods through the daily acquisition of methods through the acquisition of 800 files daily. 800 files.

Day Random SVM-Margin Exploitation Combination Day Random (%) SVM-Margin (%) Exploitation (%) Combination (%) 16 6 6 6 1 90.05 90.05 90.05 90.05 2 78 150.7 152.6 150.7 2 91.97 92.98 93.05 92.98 3 79.9 181.9 226.7 225.7 3 93.43 94.89 94.93 94.97 4 83.1 168.3 235.1 233.9 4 94.30 95.96 96.03 96.06 5 76.4 146.1 229.1 227.6 5 94.94 96.50 96.55 96.58 6 72.1 133.7 227.7 227.1 6 95.21 96.78 96.89 96.87 7 78.2 125.1 231.4 230.5 7 95.51 97.16 97.31 97.28 8 77.8 118.8 240.8 239.8 8 95.56 97.28 97.42 97.40 9 79.3 112.5 235.6 236.1 9 96.04 97.58 97.69 97.69 10 77.9 102 230.7 230.1 10 95.85 97.68 97.83 97.83 5854 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857

Fig. 7. The TPR of the framework on the tenth day for different methods through the acquisition of 50, 250, 800 and 2521 (ALL) files daily.

and Combination) outperforming in the acquisition amount of 800 Fig. 8 presents the FPR levels of the four acquisition methods for files daily. In addition, the performance of the detection model a batch of 800 files. As can be observed, the FPR rates were low and improves as more files are acquired daily, so that in the acquisition quite similar among the AL methods. A similar decreasing FPR amount of 800 files daily the results indicate that by acquiring a trend began to emerge on the 4th day. This decrease indicates an small and well selected set of informative files only (31% of the improvement in the detection capabilities and the contribution stream), the detection model can achieve TPR levels (85.12%) that of the AL methods contrary to the increase in FPR rates for Random. are quite close to those achieved by acquiring the whole stream Random had the lowest FPR until the 4th day – it can be explained (88.14%) – represented by the single point of 2521 files. As can be by a random selection of informative files that actually improved observed, the trend indicates that the difference between the AL the initial detection model that was not very accurate in the begin- methods and ALL becomes smaller, a trend that supports the effi- ning of the process due to relatively small initial training set. How- ciency of the framework and the AL method. This approach indicates ever in the long run, from the 5th day and on, Random had the viability in terms of time and money since it dramatically reduces highest FPR levels what indicates on selection of not very informa- the amount of files sent to virus experts. There is no doubt that these tive files that should have update of the detection model through achievements in acquiring minimal acquisition amounts have impli- the days. cations in terms of time and money for the efficiency of the frame- Most of the time, from the 5th day and on, Exploitation and work in preserving the updatability of the detection model and Combination achieved the lowest FPR rates, a bit lower than ultimately of the anti-virus tool. These factors indicate the benefits SVM-Margin. This indicates that our methods (Exploitation and to be obtained from performing this process on a daily basis. Note Combination) performed as well as SVM-Margin method in that when the acquisition amount was 50 files, the TPR levels were regard to predictive capabilities (Accuracy, TPR and FPR) but quite low because the detection model is induced from a small num- better than SVM-Margin in acquiring larger number of new ber of files: 50 files on the first day a (only 5–6 malicious files and malwares daily and in enriching the signature repository of the 44–45 benign files) resulting in only 550 files by the tenth day. anti-virus. The FPR is presented for the ten-day period due to We can see that the difference between the AL methods and the setup of the evaluation. Thus, on each day of the acquisition random selection becomes greater than 30% in the acquisition iteration, we evaluated the learnt classifier. Since for each day, a amount of 250 files. This difference becomes smaller (15%) yet set of new unknown applications (malwares and benign) are remains significant when the daily acquisition amount is 800 files. presented to the classifier, the FPR is not constantly decreasing Acquiring files depends upon sending them to human experts for as would have happened if the classifier was tested on the same final labeling. This is carried out manually and requires time and files daily. Again we can see a general trend in which the differ- money. High TPR rates, achieved by acquiring a small set of files, ence between the AL methods and Random becomes larger in indicate a capability for preserving the anti-virus updatability the course of the ten-day period. This trend supports the and achieving a significant savings of time and money. efficiency of the AL framework.

Fig. 8. The FPR of the trends of the framework for different methods based on acquiring 800 files daily. N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5855

7. Coping with possible attacks Kolter & Maloof (2006) used n-grams features, as we did in our study, and reported that the detection accuracy on obfuscated Zhao, Long, Yin, Cai, and Xia (2012) recently discussed two pos- malicious files was higher when using the n-grams features rather sible attack scenarios on an AL methods. In these attacks, referred than when using payload functions as a feature extraction method- to as adding and deleting, the attacker can actually pollute the ology. A recent study by Zhao et al. (2012), reported on the same unlabeled data before it is sampled by the active learner module. trend, that the TPR values among obfuscated files are higher than The results of their experiments on an intrusion detection dataset those among unobfuscated files for the same abovementioned showed that these attacks disrupt the performance of the AL pro- reasons. cess and significantly decreased detection accuracy: a decrease of It should be noted that packed files also contain a portion of 16–30% due to the adding attack and a 15–34% decrease due to code that is responsible for the unpacking operation, where those deleting. In our context, an adaptive attacker might, ‘‘guide’’ the portions of code can be used for identifying different and informa- AL process and poison the classifiers by producing many malwares tive packed files that might contain malicious code, especially in that contain specifically chosen n-grams by design. Consequently, case those files were not packed by a popular packer. For instance the AL would acquire these files since they would contain new Polyunpack (Royal et al., 2006), mentioned above, also conducted a and interesting n-grams which did not exist before. Attacking such static analysis phase of packed files in which the sequences of a biased classifier then becomes easy. The attacker simply leaves packed or hidden code in a malware instance can be made self- out the chosen n-grams and creates a malicious file that can evade identifying. the detection model. There are also different malware families that use popular pack- The way to meet this attack is quite simple. First the AL process ers such as UPX/PKlite in which a portion of the unpacking opera- is not based on a specific node in the Internet, but is sustained by tion will be similar to those benign files that have been packed many sources of information and files. Thus such an attack must with the same packer. However, the other part which is packed will flood significant parts of the Internet network in order to poison contain patterns that differ from the benign ones. These patterns the presented framework in a way that will bias the classifier. can be discovered with high probability when a suitable feature Not only is such a flood by an attacker not feasible but it is also extraction methodology is used. time-consuming and therefore anti-virus vendors have enough In relation to handling obfuscation files, one may ask how the time to distribute a patch against it. Secondly, since our framework framework will react if the reality is altered and many benign files tries to select the most informative files and attempts to enlarge are obfuscated? Likewise, what if an adaptive attacker obfuscates the signature repository, it is not choosing files that are similar multiple benign programs and floods the network with them? to previous ones that were acquired before. Our AL methods would Our answer is that this attack will probably ruin the discrimination not acquire a full set of malicious files that are similar in specific n- in the binary sequences which currently exists between obfuscated grams but only a few representative ones. Thus the framework is malicious and benign files. This is a subject for further exploration resilient to such attacks and its detection capabilities remain and we have recently begun working on this idea. Our work is unaffected. based on building two complementary detection models. One will Whenever one uses machine learning methods based on static be trained on an obfuscated dataset and will discriminate between analysis (especially n-grams) for detecting unknown malicious code obfuscated files (benign and malicious); the other will be trained files, a question is raised about the capability of the suggested frame- on a non-obfuscated dataset and will discriminate between non- work to detect obfuscated (Moser, Kruegel, & Kirda, 2007b) (includ- obfuscated files (benign and malicious). ing encrypted, compressed and packed), malicious files. In PC executables (contrary to Android mobile applications), most of the 8. Discussion and conclusion files, benign and malicious, are not obfuscated in the first instance. When they are obfuscated and encrypted, it may be due to an The main goal of this paper was to present a framework for effi- attempt to evade security mechanisms such as anti-virus packages ciently updating anti-virus tools with new unknown malwares. (Carey Nachenberg, 1997) that analyze the static data of files. Thus, With an updated classifier, we can detect new malwares that can obfuscated files are more likely to be malicious rather than benign be utilized for sustaining an anti-virus tool. Both the anti-virus and therefore become more suspicious. Such files are automatically and the detection model (classifier) must be updated with new sent to the lab for deeper scrutiny. Additionally, as was already and labeled files. This labeling is done manually by experts, thus reported by Tahan et al. (2012), obfuscation is usually performed the goal of the classification is to focus expert efforts on labeling by automated tools that generate distinctive properties inside the files that are more likely to be malware or on files that might binary code of the obfuscated file, properties that distinguishes add new information about benign files. them from unobfuscated files. The FalckonEncrypter is an example In this paper we proposed a framework based on new active of an obfuscation library that was developed by hackers and there learning methods (Exploitation and Combination) designed for is no reason for genuine software companies to use an untrustwor- acquiring unknown malware. The framework seeks to acquire the thy library developed by hackers. Consequently, the obfuscation most important files, benign and malwares, in order to improve actually helps in detecting malicious files. Moreover, as was pre- classifier performance, enabling it to frequently discover and sented in Newsome and Song (2005) and Newsome, Karp, and enrich the signature repository of anti-virus tools with new Song (2005), one of the desirable properties of a malicious file is unknown malwares. its self-propagation capability. Since the malicious file is likely to Adopting ideas from text categorization, we presented a static have self-decompression or self-decryption commands inside the analysis methodology for representing malicious and benign exec- code of the malicious file that is being represented by fixed binary utables in detecting unknown malicious codes. Two experiments sequences, these commands can be used for detection. Accordingly, were conducted. In the first and most basic experiment, we tried these sequences are taken into account in the learning process and to find the optimal configuration of dataset and SVM kernel that assist in discriminating between malicious and benign files. would yield the best capability for detecting unknown malwares. Two other independent studies have shown that ML methods In the second and most important experiment, we evaluated the based on static analysis of files can detect obfuscated malicious proposed framework, comparing the performance of our two AL files even better than detecting unobfuscated malicious files. methods to SVM-Margin (an existing AL method) and random 5856 N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 selection in acquiring new malicious files and updating both the demonstrated in a recent and comprehensive study on worm signatures repository of the anti-virus and the detection model. detection (Nissim et al., 2012) in 2012. It seems that our method In general, three of the AL methods performed very well, with (Exploitation) for acquiring files that mostly seem malicious our methods, Combination and Exploitation, outperforming induces a better detection model that will eventually acquire also SVM-Margin. This fact is one of the contributions of our study: confusing but valuable and informative benign files. the development of new AL methods that are more suitable than In future work, we are interested in implementing this frame- current ones in acquiring unknown malwares. The evaluation of work also on Android applications where it is not very feasible to the classifier before and after the daily acquisition showed an apply advanced detection techniques over the device itself due to improvement in the detection rate and subsequently more new its resource limitations (CPU, battery, etc.). Therefore these mobile malwares were acquired – a fact that actually justified the acquisi- devices are very dependent on the anti-virus solutions that should tion process done by our framework. be frequently and efficiently updated. Very possibly our suggested In the 10th day of the 50 file acquisition amount, Exploitation AL framework could address this problem. An additional research acquired 44 new malwares, which is almost 30% more malwares direction lies in developing AL methods for non-executable files than the number acquired by SVM-Margin (30 malwares) and such as PDF files. Those malicious PDF files were found to exploit 77% more malwares than those acquired by the random selection much vulnerability in Adobe-Reader versions. This follows from method (6 malwares). On the 10th day of acquisition, with the the recent phenomenon of PDF files being used to perform 250 batch, Exploitation acquired 201 malwares, which is almost malicious activities, especially as part of APT attacks against orga- eight times better than the amount acquired by random sampling nizations. The AL methods will help in detecting and identifying (26 malwares), the base line selection method. It also acquired the newest malicious PDF files that utilize zero-day exploits found 2.6 times more malwares than the existing AL method, SVM- on the Adobe-Reader. Margin (74 malwares). In both the 50 and 250 acquisition amounts, the trend in the course of the ten-day period was very clear. Each day Exploitation acquired more malicious files than Acknowledgements the day before. This is an important feature that enables the detec- tion model to update itself daily and to identify new malwares for This research was partly supported by the National Cyber enriching the signature repository of the anti-virus tool. In both the Bureau of the Israeli Ministry of Science, Technology and space. 250 and 800 acquisition amounts, we observed an interesting We would like to thank Clint Feher, who assisted in the data-set phenomenon. In the first few days of all the AL methods, there creation and Yuval Fledel for meaningful discussions and comments was an increase in the number of acquired files. However for the in the efficient implementation aspects. rest of the time, the SVM-Margin AL method showed a decrease in the number of malwares acquired, while our AL methods contin- ued to increase and improve acquisition capabilities. Therefore, we References conclude that the SVM-Margin method is less efficient in continu- ously acquiring new malwares and that our methods provide an Abou-Assaleh, T., Cercone, N., Keselj, V., & Sweidan, R. (2004). N-gram-based essential feature for continuously acquiring new files and updating detection of new malicious code. In Computer software and applications conference, 2004. COMPSAC 2004. Proceedings of the 28th annual, international, anti-virus tools. The larger acquisition amounts of 250/800 were September 28–30 (Vol. 2, pp. 41–42). also checked in order to measure the capability of the framework Angluin, D. (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. for efficiently acquiring informative and malicious files in large Baram, Y., El-Yaniv, R., & Luz, K. (2004). Online choice of active learning algorithms. The Journal of Machine Learning Research, 5, 255–291. scale scenarios. Basically this step demonstrated that, the more Bayer, U., Comparetti, P. M., Hlauschek, C., Kruegel, C., & Kirda, E. (2009). Scalable, the anti-virus vendor can invest in labeling, the better our frame- behavior-based malware clustering. In NDSS (Vol. 9, pp. 8–11). work will update the detection model and enlarge the signature Bi, J., Bennett, K., Embrechts, M., Breneman, C., & Song, M. (2003). Dimensionality repository of the anti-virus. reduction via sparse support vector machines. The Journal of Machine Learning Research, 3, 1229–1243. We can explain the better acquisition performance of Exploita- Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. tion according to the way it actually functions. Exploitation tries to Data Mining and Knowledge Discovery, 2(2), 121–167. acquire the files that are most likely malicious. In fact, it also CERT (1999). Trojan horse version of tcp wrappers. . acquires benign files that are thought to be malicious. Although Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM these benign files are indeed confusing, they are very informative Transactions on Intelligent Systems and Technology (TIST), 2(3), 27. for the detection model. As a consequence, their acquisition Chen, Z., Roussopoulos, M., Liang, Z., Zhang, Y., Chen, Z., & Delis, A. (2012). Malware characteristics and threats on the internet ecosystem. Journal of Systems and improved the performance of the detection model better than Software, 85(7), 1650–1672. July, ISSN 0164-1212.. the SVM-Margin method that acquires files which are known to Golub, T., Slonim, D., Tamaya, P., Huard, C., Gaasenbeek, M., Mesirov, J., et al. (1999). be confusing and thus contribute less to improving the detection Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. model. We understood from this phenomenon that there are noisy Henchiri, O., & Japkowicz, N. (2006). A feature selection and evaluation scheme for benign files lying deep within a side class of the malicious files. As computer virus detection. In Sixth international conference on data mining, 2006 noted they are perhaps confusing but nevertheless very informa- ICDM ’06 18–22, December (Vol., pp. 891–895). Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. The Journal of tive and valuable to the detection model (these files will probably Machine Learning Research, 1, 245–279. become a support vectors). Additionally, they help in acquiring Jacob, G., Debar, H., & Filiol, E. (2009). Malware behavioral detection by attribute- malicious files that eventually update and enrich the signature automata using abstraction from platform and language. In Recent advances in repository of the anti-virus. It should be noted that these files seem intrusion detection (pp. 81–100). Berlin Heidelberg: Springer. Jang, J., Brumley, D., & Venkataraman, S. (2011). BitShred: Feature hashing malware to be more informative than malwares since they embody relevant for scalable triage and semantic analysis. In Proceedings of the 18th ACM information that was hitherto hidden since the classifier first conference on computer and communications security (CCS ‘11) (pp. 309–320). regarded them as malwares. (The classifier thought them to be New York, NY, USA: ACM. Jnanamurthy, H. K., Warty, C., & Singh, S. (2013). Threat analysis and malicious user malwares and it finally were discovered as benign). Compared to detection in reputation systems using mean bisector analysis and cosine domains in which noisy data is not designed in the first place, in similarity (MBACS). the malware detection domain there is a significant rate of noisy Joachims, T. (1999). Making large scale SVM learning practical. Kiem, H., Thuy, N. T., & Quang, T. M. N. (2004). A machine learning approach to anti- files that make detection much more complicated (such as virus system. In Joint workshop of Vietnamese society of AI, SIGKBS-JSAI, ICS-IPSJ malwares that are purposely designed as benign files). This was and IEICE-SIGAI on active mining, 4–7 December (pp. 61–65), Hanoi-Vietnam. N. Nissim et al. / Expert Systems with Applications 41 (2014) 5843–5857 5857

Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X. Y., & Wang, X. (2009). Newsome, J., & Song, D. (2005). Dynamic taint analysis for automatic detection, Effective and efficient malware detection at the end host. In USENIX security analysis, and signature generation of exploits on commodity software. symposium (pp. 351–366). Newsome, J., Karp, B., & Song, D. (2005). Polygraph: Automatically generating Kolter, J. Z., & Maloof, M. A. (2004). August. Learning to detect malicious executables signatures for polymorphic worms. In Security and Privacy, 2005 IEEE in the wild. In Proceedings of the tenth ACM SIGKDD international conference on Symposium on (pp. 226–241). IEEE. knowledge discovery and data mining (pp. 470–478). ACM. Nissim, N., Moskovitch, R., Rokach, L., & Elovici, Y. (2012). Detecting unknown Kolter, J. Z., & Maloof, M. A. (2006). Learning to detect and classify malicious computer worm activity via support vector machines and active learning. executables in the wild. The Journal of Machine Learning Research, 7, 2721–2744. Pattern Analysis and Applications, 15(4), 459–475. Lanzi, A., Sharif, M. I., & Lee, W. (2009). K-tracer: A system for extracting kernel Perdisci, R., Lanzi, A., & Lee, W. (2008). McBoost: Boosting scalability in malware malware behavior. In NDSS. collection and analysis using statistical classification of executables. In Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. Computer Security Applications Conference, 2008. ACSAC 2008. Annual (pp. In Proceedings of the 17th annual international ACM SIGIR conference on research 301–310). IEEE. and development in information retrieval (pp. 3–12). Springer-Verlag New York Provos, N., & Holz, T. (2008). Virtual honeypots: From botnet tracking to intrusion Inc.. detection. Addison-Wesley, pp. 231–272. Masud, M. M., Khan, L., & Thuraisingham, B. (2007). Feature based techniques for Rieck, K., Holz, T., Willems, C., Düssel, P., & Laskov, P. (2008). Learning and auto-detection of novel email worms. In Advances in knowledge discovery and classification of malware behavior. In Detection of intrusions and malware, and data mining (pp. 205–216). Berlin Heidelberg: Springer. vulnerability assessment (pp. 108–125). Berlin Heidelberg: Springer. Menahem, E., Shabtai, A., Rokach, L., & Elovici, Y. (2009). Improving malware Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware detection by applying multi-inducer ensemble. Computational Statistics and behavior using machine learning. Journal of Computer Security, 19(4), 639–668. Data Analysis, 53(4), 1483–1494. Rossow, C., Dietrich, C. J., Grier, C., Kreibich, C., Paxson, V., Pohlmann, N., & van Mitchell, T. M. (1997). Machine learning. Burr Ridge, IL: McGraw Hill. 45. Steen, M. (2012). Prudent practices for designing malware experiments: Status Mokube, I., & Adams, M. (2007). Honeypots: Concepts, approaches, and challenges. quo and outlook. In Security and Privacy (SP), 2012 IEEE Symposium on (pp. 65– In Proceedings of the 45th annual southeast regional conference (ACM-SE 45) 79). IEEE. (pp. 321–326). New York, NY, USA: ACM. Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006). Polyunpack: Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). Automating the hidden-code extraction of unpack-executing malware. In Inside the slammer worm. IEEE Security and Privacy. Computer Security Applications Conference, 2006. ACSAC’06. 22nd Annual Moore, D., Shannon, C., & Claffy, K. (2002). Code-Red: A case study on the spread and (pp. 289–300). IEEE. victims of an internet worm. In Proceedings of the 2nd ACM SIGCOMM workshop Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic on internet measurement (IMW ‘02) (pp. 273–284). New York, NY, USA: ACM. indexing. Communications of the ACM, 18(11), 613–620. Moser, A., Kruegel, C., & Kirda, E. (2007a). Exploring multiple execution paths for Schultz, M. G., Eskin, E., Zadok, E., & Stolfo, S. J. (2001). Data mining methods for malware analysis. In Security and Privacy, 2007. SP’07. IEEE Symposium on (pp. detection of new malicious executables. In Security and Privacy, 2001. S&P 2001. 231–245). IEEE. Proceedings. 2001 IEEE Symposium on (pp. 38–49). IEEE. Moser, A., Kruegel, C., & Kirda, E. (2007b). Limits of static analysis for malware Shabtai, A., Moskovitch, R., Elovici, Y., & Glezer, C. (2009). Detection of malicious detection. In Computer Security Applications Conference, 2007. ACSAC 2007. code by applying machine learning classifiers on static features: A state-of-the- Twenty-Third Annual (pp. 421–430). IEEE. art survey, information security technical report 14 (pp. 16–29). Moskovitch, R., Elovici, Y., & Rokach, L. (2008). Detection of unknown computer Sharif, M., Lanzi, A., Giffin, J., & Lee, W. (2009). Automatic reverse engineering of worms based on behavioral classification of the host. Computational Statistics malware emulators. In Security and Privacy, 2009 30th IEEE Symposium on (pp. and Data Analysis, 52(9), 4544–4566. 94–109). IEEE. Moskovitch, R., Gus, I., Pluderman, S., Stopel, D., Glezer, C., Shahar, Y., & Elovici, Y. Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M. G., et al. (2008). (2007). Detection of unknown computer worms activity based on computer BitBlaze: A new approach to computer security via binary analysis. In behavior using data mining. In Computational Intelligence in Security and Defense Information systems security (pp. 1–25). Berlin Heidelberg: Springer. Applications, 2007. CISDA 2007. IEEE Symposium on (pp. 169–177). IEEE. Stopel, D., Boger, Z., Moskovitch, R., Shahar, Y., & Elovici, Y. (2006). Improving worm Moskovitch, R., Stopel, D., Feher, C., Nissim, N., & Elovici, Y. (2008). Unknown detection with artificial neural networks through feature selection and malcode detection via text categorization and the imbalance problem. In temporal analysis techniques. In Proceedings of third international conference Intelligence and Security Informatics, 2008. ISI 2008. IEEE International on neural networks, Barcelona. Conference on (pp. 156–161). IEEE. Tahan, G., Rokach, L., & Shahar, Y. (2012). Mal-id: Automatic malware detection Moskovitch, R., Nissim, N., Englert, R., & Elovici, Y. (2008). Detection of unknown using common segment analysis and meta-features. The Journal of Machine computer worms activity using active learning, In The 11th international Learning Research, 13(1), 949–979. conference on information fusion, Cologne, Germany, June 30–July 03. Tong, S., & Koller, D. (2000–2001). Support vector machine active learning with Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., et al. (2008). applications to text classification. Journal of Machine Learning Research, 2, 45–66. Unknown malcode detection using OPCODE representation. In Intelligence and Vapnik, V. N. (1982). Estimation of dependences based on empirical data (Vol. 41). security informatics (pp. 204–215). Berlin Heidelberg: Springer. Berlin: Springer. Moskovitch, R., Nissim, N., & Elovici, Y. (2009). Malicious code detection using active Vapnik, V. (1998). Statistical learning theory. New York: Springer. learning. In Privacy, security, and trust in KDD (pp. 74–91). Berlin Heidelberg: Wang, X., Yu, W., Champion, A., Fu, X., & Xuan, X.D. (2007). Worms via mining Springer. dynamic program execution. In Third international conference security and Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., & Elovici, Y. (2009). privacy in communication networks and the workshops, SecureComm (pp. 412– Unknown malcode detection and the imbalance problem. Journal in Computer 421). Virology, 5(4), 295–308. Willems, Carsten, Holz, Thorsten, & Freiling, Felix (2007). Toward automated Nachenberg, C. (1997). Computer virus-antivirus coevolution. Communications of dynamic malware analysis using CWSandbox. IEEE Security and Privacy, 5, the ACM, 40(1), 46–51. 32–39. 2 (March 2007). Nataraj, L., Yegneswaran, V., Porras, P., & Zhang, J. (2011). A comparative assessment Yu, Z. H. U., Xin-cai, W., & Hai-bin, S. (2008). Detection method of computer worms of malware classification using binary texture analysis and dynamic analysis. In based on SVM. Mechanical and Electrical Engineering Magazine, 8,2. Proceedings of the 4th ACM workshop on security and artificial intelligence (pp. 21– Zhao, W., Long, J., Yin, J., Cai, Z., & Xia, G. (2012). Sampling attack against active 30). ACM. learning in adversarial environment. In Modeling decisions for artificial Nataraj, L., Karthikeyan, S., Jacob, G., & Manjunath, B. S. (2011). Malware images: intelligence (pp. 222–233). Berlin Heidelberg: Springer. Visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security (p. 4). ACM. Knowl Inf Syst DOI 10.1007/s10115-016-0918-z

REGULAR PAPER

ALDROID: efficient update of Android anti-virus software using designated active learning methods

Nir Nissim1 · Robert Moskovitch2 · Oren BarAd1 · Lior Rokach1 · Yuval Elovici1

Received: 6 January 2015 / Revised: 12 October 2015 / Accepted: 11 January 2016 © Springer-Verlag London 2016

Abstract Many new unknown malwares aimed at compromising smartphones are created constantly. These widely used smartphones are very dependent on anti-virus solutions due to their limited resources. To update the anti-virus signature repository, anti-virus vendors must deal with vast quantities of new applications daily in order to identify new unknown malwares. Machine learning algorithms have been used to address this task, yet they must also be efficiently updated on a daily basis. To improve detection and updatability, we introduce a new framework, “ALDROID” and active learning (AL) methods on which ALDROID is based. Our methods are aimed at selecting only new informative applications (benign and especially malicious), thus reducing the labeling efforts of security experts, and enable a frequent and efficient process of enhancing the framework’s detection model and Android’s anti-virus software. Results indicate that our AL methods outperformed other solutions including the existing AL method and heuristic engine. Our AL methods acquired the largest number and percentage of new malwares, while preserving the detection models’ detection capabilities (high TPR and low FPR rates). Specifically, our methods acquired more than double the amount of new malwares acquired by the heuristic engine and 6.5 times more malwares than the existing AL method.

B Nir Nissim [email protected]; [email protected] Robert Moskovitch [email protected] Oren BarAd [email protected] Lior Rokach [email protected] Yuval Elovici [email protected] 1 Ben Gurion University of the Negev, Beersheba, Israel 2 Columbia University, New York, NY, USA 123 N. Nissim et al.

Keywords Detection · Acquisition · Malware · Android · Active learning · Anti-virus · Application

1 Introduction

Anti-virus vendors face increasing difficulty detecting unknown malware on smartphones, and alternative approaches must be developed in order to provide an effective solution. Existing approaches to detect unknown malware have often been based on machine learning (ML) methods that can induce a model from a set of samples (malware and benign), which later makes it possible to detect unknown malware. The detection of malwares in PCs using ML methods based on static analysis has been intensively researched over the past decade [1,16,35,36,44,48,66]. Dagon et al. [11]and Piercy [58] were the first to discuss malware for smartphones in 2004. However, the dramatic increase in the use of smartphones also increases the possibility of cyber attacks [37,71]. More specifically, the growth of the Android market has led to increased threats to Android security over the past few years [41,65,68]. The dominance of the Android operating system most likely led to the massive creation of new types of Android malware, reported by “Secure-List” [27], which indicated that 9000 such malware were created during 2012. Among the well-known Android malware specially designed to perform malicious activi- ties are: Geinimi [31], an information stealing Trojan Horse [2]; the DroidDream [30]Trojan family that was discovered circulating in the official Android market; and the DroidKungfu [25] which was the first to encrypt the exploit code it used to gain root access to the device. The main sources of both Android malware and legitimate Android applications are Web sites called “markets.” Specifically, most Android applications are downloaded from the official market which also goes by the name, “Play Store” [6]. A comprehensive characterization of 1200 known Android malware based on a variety of relevant features and aspects was presented by Zhou and Jiang [82]. While Apple applies a rigorous automatic and manual review process [61] requiring at least two human reviewers, Android relies primarily on the Android permission system, as well as several basic security mechanisms. To detect malicious behavior, Google presented Bouncer in Feb 2012. Bouncer is comprised of machine learning algorithms based on dynamic and basic static analysis of applications uploaded to the market. However, according to [41], it was announced at the Summer-Con 2012 that more than 20 ways of evading the Bouncer had been discovered [56]. Currently, despite the increased amount of research aimed at providing solutions for the detection of unknown malwares, signature-based anti-viruses remain the commonly used solution, despite the drawbacks of this method. Signature-based anti-viruses are characterized by a delay in detection, such that it often takes between 48 to 80 days to detect new malware and update clients with the new signature [55,58]. This period of time is prohibitive in cases involving fast propagating malware, such as the MMS worm which attacked the Symbian operating system and infected about 700,000 mobile devices in three hours [8]. Anti-virus vendors have tried to address this, attempting to reduce the amount of time by releasing multiple signature versions per day [29]. The key to this type of updatability lies in the fast and efficient discovery of new malware instances. Anti-virus vendors expend considerable effort in order to keep their signature repository up to date and maintain their ability to accurately detect malware [34]. This is a costly and complicated task that requires expertise in analyzing and labeling a vast number of applications daily [43].

123 ALDROID: efficient update of Android anti-virus software…

To identify new unknown malware instances, anti-virus vendors must deal with large quantities of new applications on a daily basis. Some of the applications can be collected by installing agents on smartphones that upload to a centric server for analysis. This is in contrast to the PC domain which is more likely to use honeypots for this purpose [59]. Anti-virus vendors first filter out known malware and its variants. Then known legitimate applications are filtered out utilizing white lists based on application’s reputation and certificates. This information is available for Android from a variety of market sources and is available from a reputation systems that indicates an application’s popularity [21]. Despite this filtering process, a large number of new unknown applications, both benign and malicious, remain. Anti-virus vendors use complementary solutions that focus on the applications most likely to be malicious in order to further reduce the number of applications that must be handled manually. Among the complementary solutions that have been proposed for efficiently discovering new Android malware are: heuristic engines based on a scoring algorithm [4] and detection models based on machine learning [7,57,61–63,69]. These solutions are inefficient in the long run, because in each case the knowledge store is not frequently and actively updated. Our framework based on active learning (AL) methods introduced in this paper was designed and developed to frequently update Android anti-virus software, in order to address this issue. The framework enables to focus expert efforts on labeling applications that are most likely to be new malwares, as well as benign applications that may improve the detec- tion model. Both the anti-virus and the detection model must be updated with the new and labeled applications (malicious or benign). The updated detection model is used to detect new malwares which enrich the anti-virus signature repository. Our framework maintains a detection model based on a classifier that is trained on a representative set of applications (malicious and benign) using static analysis. The detection model’s advantage lies in its generalization capabilities which enable it to detect new unknown malware with high probability. The use of static analysis allows for early detection which takes place even before an application has been installed (or infected) the host device. The novelty of our framework is its ability to frequently update the knowledge stores utilizing AL methods that prioritizes selecting the most informative applications (both benign and malicious) for deeper analysis in order to update the anti-virus and detection model. These methods were previously found to be effective and significantly enhance the detection of PC malwares both malicious executables [54] and non-executables (documents such as PDF files) [53]. As is known, the structure of Android applications differs substantially from the executable files within the Windows OS and PDF documents we previously investigated. Therefore, the results of our previous study cannot automatically be assumed to work in the Android OS; since the detection model and AL methods used in the current study rely on different dataset’s characteristics related to Android applications domain, particularly in terms of the extracted features, the malware distribution and the attack techniques detected by the detection model. In addition, our current efforts are directed at the smartphone domain, an area in which the need for anti-virus enhancement is even greater. In contrast to PCs, smartphones are heavily dependent on anti-virus solutions, because of the inability to apply advanced detection methods (static and dynamic analysis) within the device itself. The resource limitations of smartphones necessitate the effective detection of new malware and efficient and nimbly updated anti-virus tools. It is not feasible to analyze every new application, so our framework only selects the most probable malwares for labeling. While our framework reduces the number of unknown applications that must be manually analyzed, it strengthens the detection model at the same time by also selecting informative benign applications. Thus, the framework 123 N. Nissim et al. addresses the resource limitations of the smartphone, as well as the challenge presented by the sheer volume of unknown applications created daily. Our approach is capable of providing more frequent updates to the detection model, because only a small, yet manageable, set of informative applications are sent to the human expert for inspection and are subsequently acquired by the detection model. This is in contrast to heuristic approaches based on scoring algorithms or other types of detection models which are only updated periodically due to the labor-intensive process of human expert analysis. In our framework, the updated detection model efficiently updates the anti-virus signature repository which, in turn, improves the detection capabilities of the installed and widely used anti-virus software within smartphones. The advantages of our framework are meaningful: the detection model is strengthened, fewer applications require expert handling, and the anti-virus signature repository is more frequently updated, thus limiting the risk associated with malicious applications and improving the detection ability of the anti-virus within smartphones. We are fully aware of the limitations of static analysis in malware detection (as will be discussed below); however, we focus on the use of active learning (AL) concepts rather than on analytical ambience, which can be either static or dynamic, and our methods have been effective in both analysis instances [49,53,54]. To evaluate our methodology, we conducted an extensive and rigorous evaluation based on a test collection containing more than 27,000 Android applications (benign and malicious). The contributions of this paper are threefold: 1. We introduce the ALDROID framework for frequent (i.e. on a daily\weekly\monthly basis) and efficient updating of Android anti-virus tools. 2. We introduce two active learning methods, called Exploitation and Combination that were designed for the acquisition of new Android malware. The two methods are rigorously evaluated in comparison with a conventional malware detection and selection method and a basic active learning method. 3. We present a set of general descriptive features for the detection of Android malware, features which are robust and unaffected by obfuscation or transformation evasion tech- niques. The features are based on the application’s static genes and not on the optional operations it might conduct. Therefore, the features are also robust for evasive techniques based on delayed malicious operations. The rest of the paper is organized as follows. Section 2 surveys related work, and Sect. 3 presents the suggested framework and method’s dataset. Section 4 discusses the measures used for evaluating the proposed framework followed by a presentation of the experiment’s design. Section 5 presents the results from the proposed approach, while Sect. 6 discusses how the framework copes with potential attacks. Finally, Sect. 7 provides conclusions, discusses the advantages of the described framework, and suggests future research directions.

2 Background

Several studies have presented advanced machine learning-based methods for the detection of unknown Android malware. These methods analyze the Android application and extract sets of indicative features that are used to retrain a learning algorithm or rule-based engine. The methods can be roughly divided based on their feature extraction approach as either static or dynamic. Features extracted using static analysis are derived from basic elements related to the application’s structure and logics such as application’s permissions, dex code, and sources. While the features extracted dynamically can represent either the application’s 123 ALDROID: efficient update of Android anti-virus software… behaviors (functions flow, call sequences, etc.) or the device itself (memory usage, traffic, code execution, power, heat, etc.). In this section, we present a comprehensive overview of notable works of both types of analysis, grouping works by their feature extraction approach for the convenience of the reader. This thorough review process has been valuable on two counts: the extensive overview provides valuable background material to the reader, and it also strengthened our work by enabling us to better understand which features to include in our feature set and improving our ability to maximally leverage the information provided by the AL methods.

2.1 Detection methods based on static analysis

Prior to the discovery of actual malware on Android, Shabtai et al. [69] explored the applica- bility of ML techniques to Android malware detection. Their classification task was to classify games and tools in order to demonstrate possible capabilities for later classification of mal- ware. Later in [62], they expanded it to categorizing applications from a variety of categories andalsotomalwares. As permission analysis can shed light on the potential behavior and possible actions of an application, we present several examples of studies applied static analysis that leveraged permission data in order to detect malicious applications. In the PUMA method [61], the authors suggested a new representative approach of Android applications based on the analysis of the application’s permissions found in its AndroidMan- ifest file. They noticed several differences between malicious and benign applications, such as the fact that a malicious application is more likely to require only a single permission, while a benign applications usually requires two or three permissions. A method for evaluating malware risk for Android applications based on application permissions using the SVM classifier and risk signals was introduced in [63]. Later it was expanded to a deeper analysis [57] using probabilistic general models based on several variations of the Naïve Bayes classifier. Their approach successfully differentiated between critical permissions and less critical ones, an ability that is helpful in risk evaluation for Android malware, on a relatively small datasets of few hundreds of malwares. Zhou et al. [81] proposed Droidranger and presented two schemes: the first aimed at the detection of new samples of known Android malware families and based on behavioral permissions’ foot printing, and the second was heuristic-based filtering for the detection of inherent behavior of unknown malicious families. They focused on detecting a misuse of loading new untrusted code such as Java binary code loaded from untrusted Web sites. In addition, they monitored a dynamic loading of native code locally which can be indication for malicious applications that exploit vulnerabilities in the OS kernel. Therefore, an application that tries to hide its native code in a directory other than the default directory (lin/armeabi) is considered suspicious. Their evaluation included 204,040 applications collected from five different Android markets in 2011 (including the official market), and their results showed and revealed 211 malicious applications including two zero-day malware that were embedded in 40 different applications. In 2012, Zhou et al. [83] proposed a system called DroidMOSS, a static analysis approach that focuses on analyzing the code within the application’s dex file. DroidMOSS aims at detecting repackaged applications and is based on a fuzzy hashing technique. By extracting varying sizes of code from applications and then applying a hash function to each piece, they computed the features that served as fingerprints and used them in order to identify similarities between applications.

123 N. Nissim et al.

In 2014, Suarez-Tangil et al. [72] presented the Dendroid system aimed at characterizing Android applications into malware families, based on “code chunks” (CCs) to represent every method associated with a class within the application. Moreover, they represented the source code of the application using CFG algorithms provided by the Androguard tool [12]. This approach improved the resilience to code obfuscation over approaches based on analyzing the code sequence instructions. Another direction for static analysis presented by Luoshi et al. in 2013, and their A3 system [40] is looking for command and control servers that are usually used for malicious purposes, aiming at detecting the two most malicious Android application behaviors: the collection of user information and the act of sending premium-rate SMS messages suggested concentrating on static features based on the following steps: the extraction of IP/URL from the application’s decompiled code, tracing relationships of all the function calls through the construction of a graph, looking for sensitive API calls that are related to accessing the user’s information, and checking whether the IP/URL is used for malicious behavior. They used just three malwares (Alsalah, Sp_ntm, and Instagram), and their results demonstrated a clear relationships between sensitive APIs and the IP/URL in malware and showed that the malicious code is short and isolated since it is actually embedded within the normal application. Each of the static analysis methods presented earlier consists on one type of features, while the following represent combined analysis from several sources including the resources used, the dex code, and permissions. DroidAnalyzer was presented in 2014 by Seo et al. [67]. This is a static analysis tool that was designed to detect potential vulnerabilities of Android applications that are more likely to appear in malicious applications. In order to be up to date with the latest malware, they first analyzed the main features (API, keywords, commands, permissions, etc.) that represent root exploit malware by using the malware set of the Android Malware Genome Project [26]. Root exploits are the foundation for the detection of a root-level exploit, and indeed their evaluation exposed four banking applications that performed malicious activity such as SMS charging through permissions gaining. Motivated by the desire to reduce the lead time in detecting unknown malware, [4]pre- sented a heuristic engine based on static analysis to provide preliminary and fast analysis of large amounts of new applications that were uploaded daily to the Android market. The heuristic engine is actually a rule-based system that calculates an application’s risk score based on its patterns. Their dataset contained 947 benign applications and 107 malwares. The features were extracted from three sources of the application: the Android application package (APK) was decompressed and its resources extracted; the classes.dex was decom- piled by using Baksmali, which resulted in a readable format, SMALI; and the manifest.xml was analyzed. Their search for dangerous patterns and elements in all three of the sources encompassed 39 features divided into seven main groups: permissions; API calls; commands; presence of binaries or Zip applications (indicating possible exploits); geographic origin of the application that might raise the risk (40% of malware originates in Russia and 35% in China); URLs [33]; size of the code; and finally a combination of rules. Their results showed a clear difference between benign and malware applications.

2.2 Detection methods based on dynamic analysis

Dynamically analyzing the application’s dex code has many advantages, the most significant of which is overcoming well obfuscated code that is designed to evade static analysis methods. In AAsandbox [7], a two-step analysis of Android applications was presented, combining 123 ALDROID: efficient update of Android anti-virus software… static and dynamic analysis. The static analysis is first employed on the disassembled dex code, and then a dynamic approach is applied which executes the extracted code using a specially crafted Android emulator sandbox. Their dataset contained one malware and 150 popular benign applications that had been downloaded from the Android market in October 2009. Their experiment showed that the AAsandbox successfully detected malware among the benign applications. Another dynamic analysis approach involves monitoring the permission that where actu- ally used and not that were just declared as required similarly as done by the works employed static analysis. In 2013, Zhang et al. [77] presented VetDroid, an approach that conducts an actual analysis of permission use in Android applications. Their approach is based on dynamic analysis conducted during the resource request stage. In this process, they identify requests for protected system resources. These requests have to do with the call sites of the application that triggered the Android API. By conducting dynamic analysis, their approach reconstructs permission use behavior which represents malicious activity. Their approach was applied on 600 malwares collected from the Malware Genome project [26], and it was found to be efficient in identifying whether the acquired permissions were used for sensitive information access or data exfiltration. Analyzing network traffic patterns that were created by running an application was sug- gested by Shabtai et al. [70] who attempted to detect malware which use the self-updating capability. Those malwares are first downloaded from the Google Play market as benign applications with sufficient permissions and become malicious through the self-updating capabilities. They proposed the detection of this malware based on its network traffic pat- terns. Using their self-developed malware and 50,000 benign applications, they demonstrated that self-updating malware can be detected within a few minutes by scrutinizing its network traffic activity. A sophisticated approach was presented in 2013 by Lin et al. [39] who proposed the SCS- droid system (system call sequence droid) which was also aimed at detecting repackaged applications. The malicious application is usually repackaged into various benign applica- tions. Therefore, in order to detect a specific family of malicious repackaged applications, it is crucial to compare it to the original benign application. The novelty of their work is that by utilizing thread-grained system call sequences monitored during runtime, the malicious behavior cannot be hidden and will appear in this sequence. In 2013, Ham and Choi [14] presented an approach based on gathering several features extracted dynamically or the detection of new Android malware using machine learning algorithms. Behavioral features were collected by their agent which was installed on the Android device. After the feature selection, they used only ten features from five categories: network, SMS (send/receive SMS), CPU (usage), memory (native, Dalvik), and virtual mem- ory (VMpeack, VmHwn). Their dataset included 30 benign applications and five malwares (GoldDream, PJApps, DroidKungfu2, Snake, and Angry Birds Rio Unlocker). The Random Forest classifier was found to outperform the others with 0.998 AUC, 98% TPR, and 0.01% FPR. The novelty of their features set lies in the low level of resource consumption required to collect them.

2.3 Static versus dynamic analysis of Android applications

Each of the analytical approaches (static and dynamic) has advantages and disadvantages. Consequently, a multi-approach framework that conducts both static and dynamic analysis might reduce the ways in which malware evades the detection model. The static analytical methods have several advantages. First, static analysis is virtually undetectable – the appli- 123 N. Nissim et al. cation cannot know that it is being analyzed, because it is not executed. While it is possible to create static analysis traps to deter analysis, these traps can actually be used to detect malware [28]. Another beneficial feature is that static analysis is relatively efficient and fast and can therefore be performed in acceptable timeframes. Consequently, it will not cause bottlenecks, as was explained by [69]. Static analysis is also easy to implement, monitor, and measure. Moreover, it scrutinizes the application’s “genes” and not its current behav- ior which can be changed or postponed to a different, unknown time. An additional aspect, discussed by [4], shows that static analysis can be used to provide a scalable pre-check of malware. Lastly, when using lightweight algorithms static analysis can also be deployed on smartphones, therefore the approach can be used for an online and collaborative detection scheme as was shown by [64]. Hence, once an Android worm is discovered, static analysis can also help prevent its propagation and that of similar malware to other devices. On the other hand, static analysis can be evaded by code obfuscation. Whenever one uses machine learning methods based on static analysis (especially n-grams) for the detection of unknown malicious code applications, a question arises regarding the ability of the sug- gested framework to detect obfuscated malware. Almost all of the dex codes of Android applications are obfuscated to some degree. Moreover, additional techniques exist for evad- ing static analysis in Android such as Java Reflection and Java/Native Interface ability to run dynamically loaded libraries at runtime. Providing solutions to these evasion techniques is not the focus of our paper; however, we constructed our feature set so that it generally reflects the application’s possible behavior and is generically descriptive. This was done in order to ensure that the features aren’t affected by evasion techniques (especially code obfuscation) and that the features supply enough information about the expected behavior and abilities of the application. Therefore, our framework is designed to identify special informative applica- tions for further analysis. The dataset creation and collection section presents a more detailed explanation about these features. Considerable research has been conducted on the dynamic analysis approach for Android applications. However, dynamic analysis suffers from its high costs and complexity. It can also be detected and avoided by the executed malcode. For example, Google’s Bouncer, based on the dynamic approach [22], was proven to be easily evaded and manipulated. After evaluation of the advantages and disadvantages presented above, we chose to focus on static analysis, with the aim of developing an active learning framework capable of empiric evaluation over a large set of Android applications in a reasonable amount of time.

2.4 Active learning methods for malware detection enhancement

Both the static and dynamic analysis methods presented above require labeled applications. Labeling applications, a task which is crucial for the learning process, is often an expensive task since it involves human experts. Active learning (AL) was designed to reduce the label- ing efforts by actively selecting the examples with the highest potential contribution to the learning process of the detection model. AL is roughly divided into two major approaches: membership queries [3] in which examples are artificially generated from the problem space, and selective sampling [38] in which examples are selected from a pool. The selective sam- pling approach is used in our study. Studies in several domains have successfully applied active learning in order to reduce the time and money required for labeling examples. Unlike random learning in which a classifier randomly selects examples from which to learn, the classifier actively indicates the specific examples that should be labeled—commonly, the most informative examples for the training task. SVM Simple-Margin [74] is an existing AL method that we considered in our 123 ALDROID: efficient update of Android anti-virus software… experiments. Moskovitch et al. [47] and Nissim et al. [49] successfully applied AL methods to detect unknown computer worms. Using AL in such cases was very useful in removing noisy examples and in selecting the most informative examples. Other studies utilizing AL for unknown PC malware detection [45,46] demonstrate a somewhat limited approach in which an attempt is made to replace an antivirus with ML and AL. However, this is unrealistic, particularly for smartphones which are strongly depend on the anti-viruses. Additionally, in their experimental work, these researchers [45,46] do not refer to the real and crucial need for repeated and frequent updating of detection components over time. A study that was carried out in 2012 [78] presents the RobotDroid system that uses AL on Android applications to induce an accurate detection model with minimal labeled samples, is based on a principle similar to a system that had been previously used for the detection of PC worms [49]; however, RobotDroid [78] was somewhat limited in extending the anti- virus signature repository along time. Moreover, their solution conducts detection on the smartphone itself and not in markets that contain a large amount of applications such as Google market. Thus, their detection mechanism has less exposure to new malware that exists in the market, because it is exposed to malware that has been actively downloaded to the smartphones on which their detection model is deployed. On the other hand, a recent work done by Nissim et al. [54] presented a framework and novel AL methods that have been specially designed to update the detection model with informative files and enrich it and the signature repository of the anti-virus with new mal- wares daily. They successfully applied this concept for enhancing the detection of executable malware in the Windows OS. Additionally, Nissim et al. [50,53] presented the ALPD frame- work, based on AL methods and aimed at enhancing the detection of malicious PDF files targeted at organizations.

3 Suggested framework and methods

3.1 The framework

Figure 2 illustrates the framework and the process of detecting and acquiring new Android malware by preserving the updatability of the anti-virus and the detection model. In order to receive the maximal contribution from the suggested framework, the framework should be deployed on strategic nodes over the Internet network, including the application’s markets, in an attempt to expand its exposure to as many new applications as possible. This wide deployment will result in a scenario in which almost every new application will go through the framework. If the application is informative enough or is perceived as malicious or likely to be malicious, then it will be acquired for manual analysis. Examples of strategic nodes include ISPs and gateways of large organizations, and significant application markets include Google Play and official and unofficial Chinese markets. Figure 1 presents each step of the process (denoted by {step}). Further explanation is provided below. After the applications for analysis are collected from several primary resources such as official and unofficial application markets, Web sites and forums, and mobile devices (using installed agents) {1}, they are transformed into vector form {2} as explained in the dataset preparation subsection. The vectored applications are filtered by what we refer to as the “Known Applications Module” which actually filters out all the known benign and malicious applications (according to the white lists, anti-virus tools, and current signature repository) {3}. The remaining applications which are unknown are then introduced to the detection model based on SVM and AL.

123 N. Nissim et al.

Fig. 1 The process of preserving the updatability of the Android’s anti-virus software using AL methods

We employed the support vector machine (SVM) classification algorithm [9] in a super- vised learning approach since SVM has been successfully used to detect PC worms, as indicated in three previous works by [42,75,76]. Moreover, Wang states that “SVM learns a black-box classifier that is hard for worm writers to interpret.” In addition, SVM has been proven to be very efficient at enhancing malicious PDF file detection when combined with AL methods [53] and at enhancing PC malware detection [54]. The latter study also provided a comprehensive explanation about SVM principles. The SVM classification algorithm was also used for the detection of Android malware [78]. In our work, we used the Lib-SVM implementation of [10] which also supports multiclass classification. Drawing upon its acquired knowledge, the SVM-based detection model (utilizing the AL method) scrutinizes the unknown applications and provides a classification decision and a measure indicating the distance from the separating hyperplane for each application {4} using Eq. 1 (to be presented next). This distance represents the certainty of the detection model regarding the specific application. An application that the AL recognizes and perceives as informative is acquired and sent to an expert for analysis. The expert then provides its true label {5}. By acquiring these informative applications, we aim to frequently update the anti-virus software by focusing the experts’ efforts on labeling applications that are most likely to be malware or on benign applications that are expected to improve the detection model. Accordingly, in our context there are two types of applications that may be more informative. The first type includes applications for which the classifier is not confident regarding their classification (the probability that they are malicious is very close compared to the probability that they may be benign). Acquiring these applications as labeled examples is expected to improve the model’s detection capabilities. Therefore, as will be explained in the following section, these applications will probably lie inside the SVM margin and consequently will be acquired by the Exploration strategy that selects informative applications, both malicious and benign, that are close to the separating hyperplane. 123 ALDROID: efficient update of Android anti-virus software…

The second type of informative applications includes those that lie deep inside the mali- cious side of the SVM margin and that, according to Eq. 1, are found at a maximal distance from the separating hyperplane. These applications will be acquired by the Exploitation strategy and are also found at a distance far from the labeled applications. This distance is measured by the KFF calculation that is explained in the Exploitation AL method subsection. The informative applications are then added to the training set {6} for updating and retraining the detection model {8}. The applications that were labeled as malicious are also added to the anti-virus signature repository for enriching and preserving its updatability {7}. Last step of updating the signature repository also requires the distribution of an update for clients utilizing the anti-virus application. The framework consists of two integrated phases: Phase one: training A detection model is trained over an initial training set that includes a substantial number of applications including a percentage of malware (predefined by the study) which will be discussed below. Next, the detection performance of the detection model is evaluated on a stream of unknown applications that are presented to it on the first day, as it is also measured for its ability to update the anti-virus tool with new malicious applications.

Phase two: detection and update The detection model classifies each application in the stream which also contains the predefined percentage of malware, which will be discussed below. The AL method calculates a rank representing the extent to which an application is informative and potentially malicious. This rank is being used by the framework in order to consider whether the application should be selected and sent to the virus expert for deeper analysis. Once the informative applications have been selected and labeled, they are used to update the detection model’s training set (benign and malicious) as well as the anti-virus signature repository (when the application is malicious). The detection model is retrained over the updated and extended training set, which now also includes the acquired examples which are presumed at this point to be very informative. At the end of each day (or any other baseline period defined by the framework’s administrator), the updated model receives a new stream of unknown applications on which the updated model is tested and from which the updated model acquires informative applications. Note that the motivation is to acquire as much malwares as possible, since such information will maximally update the anti-virus tool and help protect Android devices.

3.2 Selection strategies for new malware acquisition

3.2.1 Conventional selection method: heuristic rule-based engine

The heuristic engine is actually a rule-based system that calculates a risk score for an applica- tion based on its patterns. According to a major anti-virus company (one of our authors also works as a virus expert at this company) with whom we are cooperating, heuristic engines are one of the most basic and common complementary solutions for discovering new mal- ware instances from among the vast quantities of unknown applications received daily by the company. The rules are usually characterized by a domain expert (malware analyst), who also decides (for every rule) what rate should be added to the total score of the application. In Android applications, the rules might refer to an application’s country of origin, whether or not a GPS is being used, whether there is access to a contact list, etc. For comparison purposes, we implemented a heuristic engine based on specifications presented in recent work [4]. 123 N. Nissim et al.

The main traits we implemented for the scoring algorithm were: size of the dex file, if smaller than 70Kb; a certificate from China or Russia; the permissions: SEND_SMS, RECIVE_SMS, RECIVE_MMS, INSTAL_PACKAGES, CALL_PHONE; a URL in the dex strings with legitimate URLs such as Google, Admob, and Yahoo being filtered out; the presence of strings such as “pm install,” “cmwap,” “cmnet,” and “10086” in dex strings or in manifest strings; the presence of a Zip file or APK file or binary file in APK. The algorithm has been proven to be quite effective as the average score for the benign set was 1.7, while the score for the malware set was 8.7. This engine was designed to assign a higher risk score in cases in which the application is more likely to be malware. Therefore, the engine selects those applications that have higher risk scores which might indicate how malicious the application is. It should be noted that the scoring algorithm will give a high score to certain benign applications that are legitimate, such as text managers/senders, certain system applications such as the default application installer, and small and simple apps with ads. It will also give a low score to specific malware families that originated in Europe or contain highly obfuscated strings in the dex file or those that display an extremely rare functionality such as the “No Permission Remote Shell” applications.

3.2.2 Baseline active learning method: SVM Simple-Margin (Exploration)

The Simple-Margin method [74] is directly related to the SVM classifier. As is known, by using a kernel function, SVM implicitly projects the training examples into a different (usually higher- dimensional) feature space denoted by F. In this space, there is a set of hypotheses that are consistent with the training set, which means that they create a linear separation of the training set. This set of consistent hypotheses is called the version space (VS). From among the consistent hypotheses, SVM then identifies the best hypothesis with the maximal margin. The motivation behind the Simple-Margin AL method is to select examples from the pool, which reduce the number of hypotheses, in order to achieve a situation where VS contains the most accurate and consistent hypothesis. Calculating the VS is complex and impractical where large datasets are concerned, and therefore, this method is constructed with simple heuristics based on the relation between the VS and SVM with the maximal margin. Practically speaking, instances which lie closest to the separating hyperplane (inside the margin) are more likely to be informative and new to the classifier. These instances will be selected for labeling and acquisition. This method, contrary to our strategy, selects instances according to their distance from the separating hyperplane only to explore and acquire the informative applications without relation to their classified label. Thus, it will not necessarily focus on selecting and acquiring malware instances. The Simple-Margin AL method is very fast and can be applicable to real problems, yet, as was indicated by its authors [74], this agility is achieved due to the fact that it provides a rough approximation and relies on the assumptions that the VS is fairly symmetric and that the hyperplane’s Normal (W) is centrally placed. It has been demonstrated, both in theory and in practice, that these assumptions can fail significantly [17]. Therefore, the method may actually query instances where the hyperplane does not even intersect the VS and thus might not even be informative. The Simple-Margin method was used for detecting PC malware instances [45], and according to preliminary results, the method also assisted in updating the detection model but not the anti-virus software. In [45], the method was used for only one trial, however, not in a process that consisted of several sequential days. Given this, we thought it would be interesting to compare its performance against our proposed AL methods 123 ALDROID: efficient update of Android anti-virus software… through a daily process of detection and acquisition of unknown Android malware. We refer to it in our experiments as “Exploration.”

3.2.3 New malware acquisition selection strategies active-learning-based

Since the goal of our framework is to provide solutions to real problems, our selective sampling method must be fast. We evaluate this by comparing our proposed AL methods with the two above- mentioned existing baseline methods: the heuristic rule-based engine and Exploration AL method.

Exploitation We designed “Exploitation” based on the SVM classifier’s principles. It was recently introduced and successfully applied to enhance the detection of PC malwares using AL concepts [54]. It is aimed at the selection of examples that are potentially the most malicious. Thus, it selects the examples that lie farthest from the separating hyperplane of the support vector machines. In our investigation of the detection of Android malware, only the applications that are most likely to be malware will be acquired. Our motivation for this set of actions is the desire to enhance the signature repository of the anti-virus tool with as much new malware as possible. Thus, for every unknown application x, Exploitation rates its distance from the separating hyperplane using Eq. 1 based on the Normal of the separating hyperplane of the SVM classifier that serves as the detection model. As explained above, the separating hyperplane of the SVM is represented by W, which is the Normal of the separating hyperplane and a linear combination of the most important examples (supporting vectors). Multiplied by Lagrange multipliers (alphas) and by the kernel function K assists in achieving linear separation in higher dimensions. Accordingly, the distance calculation in Eq. 1 is simply done between example x and the SVM Normal (W). In Fig. 2, for example, the applications that were acquired (marked with a red circle) are those applications that were classified as malicious and have the maximal distance from the separating hyperplane. Acquiring several new malicious applications that are quite similar and belong to the same virus family is considered a waste of manual analysis resources since these applications will probably be detected by the same signature. Thus, acquiring one

Fig. 2 The criteria by which Exploitation acquires new unknown malicious applications. These applications lie the farthest from the hyperplane and are regarded as representative applications 123 N. Nissim et al. representative application for this set of new malicious applications will serve the goal of efficiently updating the signatures repository. In order to adhere to the goal of enhancing the signature repository as much as possible, we also check the similarity between the selected applications using the kernel farthest-first (KFF) method suggested by Baram et al. [5]. By using this method, we avoid acquiring examples that are quite similar (the similarity is checked according to its representation in the SVM kernel space). Consequently, only the representative applications that are most probably malicious are selected. If the representative application is detected as malware as a result of the manual analysis, all its variants that were not previously acquired will be detected the moment the anti-virus is updated. In cases in which these applications are not variants of the same malware, they will be acquired the following day, as long as they are still most likely to be malware after the detection model has been updated. Figure 2 displays sets of relatively similar applications (according to their distance in the kernel space), and thus, only the representative applications that are most likely to be malware are being acquired. It is well known that the SVM classifier defines the class margins using a small set of supporting vectors (i.e., Android applications).While the usual goal is to improve the classification by uncovering (labeling) applications from the margin area, in our case the primary goal is to acquire malware to be used for updating the AV. Actually, the same number of applications is acquired each day, but with Exploitation we attempt to better explore the “malicious side” of the incoming applications, resulting sometimes in the discovery of additional benign applications (these applications will probably become support vectors and update the classifier). In Fig. 2, we can observe an example of a benign application lying deep inside the malicious side (in the blue triangular). Contrary to Exploration which explores examples that lie inside the SVM margin, Exploitation explores the “malicious side” more efficiently as part of an effort to discover new and unknown malicious applications that are essential for the frequent update of the anti-virus signature repository. The distance calculation required for each instance in this method is quite fast, and it is comparable to the time it takes to classify an instance in a SVM classifier. Consequently, it is a very practical and fast method that can provide acquisition ranking in a short timeframe. It is therefore applicable for products working in real-time.   n Dist(X) = αi yi K (xi x) (1) 1

It was found that Exploitation outperformed the Exploration method in the PC malware domain [54]. In the current study, we evaluate its performance in the Android malware domain and compare it to the performance of a conventional heuristic engine mentioned earlier as well. Combination The combination method (termed Combination) lies between Exploration and Exploitation. On the one hand, the combination method begins with a phase in which it acquires examples based on the Exploration criteria in order to acquire the most infor- mative applications. Thus, both malicious and benign applications will be acquired. This Exploration-type phase is important in order to enable the detection model to discriminate between malicious and benign applications. On the other hand, the combination method tries to maximally update the signature repository in an Exploitation-type phase. This means that in the early acquisition period, during the first part of the day, Exploration predominates over Exploitation. As the day progresses, Exploitation becomes predominant. The combination between the Exploration and Exploitation was also applied over the course of the ten days and not only on one specific day. As the day progresses, the combination performs more 123 ALDROID: efficient update of Android anti-virus software…

Exploitation than Exploration which means that on the ith day there is more Exploitation than in the (i − 1)th. We defined and tracked several configurations over the course of several days. We found that a combination based on a balance between Exploration and Exploitation performs better than other divisions (i.e., for 50% of the days the method will conduct more Exploration, and Exploitation will be implemented during the remaining time). In short, this method tries to take the best from both of the previous methods.

4 Evaluation

Using a set of standard and widely utilized measures that cover our experimental objectives, we evaluated the ability of the proposed methods to efficiently acquire new malicious appli- cations. The first objective was to validate the capability of the induced detection model to detect unknown malware based on the static features that we extracted. After validation, we carried out our main experiment for the evaluation of our proposed acquisition process using the various selection methods discussed above, as well as our AL methods.

4.1 Evaluation measures

For evaluation purposes, we measured the true-positive rate (TPR) measure [13,15]which is the percentage of positive instances classified correctly. The false-positive rate (FPR) is the percentage of negative instances misclassified. We also used the total accuracy measure, which measures the number of absolutely correctly classified instances, either positive or negative, divided by the entire number of instances. In addition, in the main experiment we measured the number of malicious applications that were acquired daily for labeling and updating both the anti-virus signature repository and the detection model’s training set. As was discussed earlier, the aim here is to maintain and improve the updatability of the anti-virus tool. This can only be done by daily enriching its signature repository with as many new malicious applications as possible. For that, we present two simple measures: the number of malwares acquired (NOMA) and the percentage of malwares acquired (POMA). Note that each new malware acquired contributes to the updatability of the anti-virus by creating new signatures or updating existing ones. This is due to the fact that the set of applications presented to the framework are unknown applications that were not detected previously as malware by either the current anti-virus signatures or the white lists.

4.2 Experiment design

Our experiments were designed to answer the following research questions: 1. Is it possible to utilize the large number of unknown applications to efficiently update an Android anti-virus tool on a daily basis using our suggested framework and the designated AL methods? 2. Do the new AL methods suggested in this paper perform better than existing methods, such as conventional heuristic engines or the Simple-Margin AL method, regarding the number of new Android malware acquisitions?

123 N. Nissim et al.

4.3 Dataset creation and collection

4.3.1 Dataset collection

We constructed a dataset of viable and known malware/benign applications. The malware samples were downloaded from “found in the wild” repositories like the Contagio mobile malware blog [20], the Malware Genome [26] and various third party application stores based primarily in Russia, China, and Europe. We verified the malware by using a known vendor anti-virus solution provided by the AVG Company [24] and cross checked it using VirusTotal [32] queries. We also collected some 10,000 malware samples covering Trojans, intrusive ads, premium SMS senders, information stealers, native payload, and Prankware, such as DroidDream, DDlight, BaseBridge, Geinimi (samples from all generations), DroidKungFu (A, B, C), Legacy Native (A and B), and various RuFraud samples; 30,000 benign samples were collected from Google Play between April 2012 and January 2013, and these were verified with the same tools. Notably, in Android it is possible to use Java Reflection and even execute a malicious code that is dynamically loaded in the runtime. An example of this may be seen in GingerMaster malware which is composed of a malware capable of downloading an “update” containing a binary exploit to the most currently installed version. Another malware called Plankton can load Java code dynamically. LeNa malware contains a binary file (.so) which is a malicious loadable binary code library. In our dataset, we included these and other types of malware that demonstrate this unique type of malicious behavior.

4.4 Feature extraction

We used a modified version of two open-source tools—the Axmlprinter project [19] that extracts features from the Android Manifest. The AndroidManifest.xml defines the application’s essential parts, including services, activities, content providers, receivers, package name, list of declared permissions, the list of requested permissions, and t. The Smali/Baksmali project [18] was used to extract features from the classes.dex file that con- tains the transformed Java bytecode to be executed. The current version [23]usedbythe public is Apps 3.5. For our research, 229 features were selected as generally indicative of the selected appli- cation’s behavior and properties—the full list can be found in the appendix section. Features extracted from the Android Manifest file included a list of 207 permission features as Booleans. We defined summarizing features such as the number of activities, permissions, services, receivers, content providers declared in the manifest, and also seven more Boolean features when the metadata tag was present. This feature might indicate that there is additional data which can be found in the AndroidManifest. General features were extracted from the classes.dex file, including features that represent the obfuscation rate in the application, since we assumed that heavily obfuscated applications have a much higher percentage of short class names [73], while slightly obfuscated apps have longer “human readable” class names. Thus, we calculated the percent of class names of various lengths: classes with a name lower than 5 percent, between 5 and 10, between 10–15, and 15 and above. We also collected the number of implemented activities, services, receivers, and content providers from the DEX data. We added generic Boolean features to discover the use of OpenGl, Bouncycastle and probable crypto-library calls. This was done to complement the

123 ALDROID: efficient update of Android anti-virus software… high- level functionality of the application not covered by the permissions list for six more features. Furthermore, we calculated a derivative discrepancy feature—the difference between the manifest declaration of services, activities, receivers, and those implemented in the classes.dex file for three more features. We normalized these features to prevent bias. The normalization was not performed on the Boolean attributes and class names percentages since the Boolean normalization is meaningless and the name length percentage is already normalized within the instance. These feature combinations and configurations are used by mobile malware researchers routinely to assist the decision-making process and quickly assess whether the examined application is worth another look, but the patterns defining the need for deeper analysis are usually not well defined and are dependent on the researcher’s experience. The use of ML techniques to streamline the process is very important and helps to reduce the need for costly human work. The patterns malware researchers look for can be as simple as a single Android permission, such as CALL_PHONE, SEND_SMS, and CHANGE_APN_SETTINGS, all of which are known to be used by malware to change an important system setting or cause the user to incur expense, or, and. Combinations of permissions can be indicative of possible malware behaviors, for instance INTERNT and READ_CONTACTS or FINE_LOCATION mean that the application has the potential to send the user contact list to a remote server or send the user geolocation to a central tracking server. For certain malware strains, the permissions pattern is distinctive enough for complete identification of the malware, such as the east Europe/Russian Premium SMS malware. Moreover, we chose our features to reflect possible modifications made to the application dex file, for instance, comparing the number of activities, services, and BroadcastReceivers declared in the manifest and actually implemented in the code, can indicate that a modification was made to the application code and/or manifest, signifying a possible Trojanized application. We also chose an obfuscation degree indicator as one of our features. The obfuscation degree as a standalone does not signify that the application is malicious but when high obfuscation is combined with certain permissions, we can identify whether the application is Trojanized or not.

4.4.1 Creating a representative dataset

We created an experimental dataset in order to evaluate our framework’s general performance in new Android malware acquisition and the updatability process of the signature repository, as well as to address the two research questions presented in Sect. 4.2. Our dataset is available1 on the web and can be easily downloaded for experimental and research use. We present several observations which shaped our thinking in this research. As was elab- orated in the background section, many works induced detection models using machine learning algorithms for the detection of Android malware. Yet none of them took into account the actual percentage of Android malware, a parameter which strongly affects the results and might even produce biased results that are not compatible with real scenarios. These include works by Zhou [83] who proposed a permission-based behavioral footprint scheme in 2012 aimed at detecting new samples of known Android malware families. In this case, the group of malware represented 0.2 to 0.47% of the total applications in several markets; however, this percentage is not representative since there are many other types of malware that were not detected by their method. Another work done by Lin et al. [39] in which they evalu- ated SCSdroid with malware rate of 30% where they had 100 benign applications and 49

1 www.ise.bgu.ac.il/engineering/PersonalWebSite1main.aspx?id=VMidijue. 123 N. Nissim et al. malicious applications. Alternatively, Zhao in [78] evaluated RobotDroid with malware rate of 31% and used 90 infected applications (infected by only three different malware) and 200 benign applications. And also Ham and Choi [14] evaluated their method with malware rate of 14% including only five malwares and 30 benign applications. Nevertheless, several works that deal with PC malware detection have taken this percentage into consideration and deeply scrutinized the datasets that try to imitate the reality of the malware detection domain as much as possible. These datasets usually result in imbalanced datasets that are compatible with the real-life scenario in which they are consisted of small percentage of malwares and lots of benign files. According to [44]and[54], the malware rate of among PC executables is about 9–10%, and the current reality is that most of programs, services, applications, and sites that run on PCs are being transformed and adjusted to mobile devices. Additionally, as noted above we have cooperated with a major anti-virus company which confirmed, based on company statistics, that 9% is more or less a correct estimate of the percentage of malware existing in the Android market. We do not claim that this represents the correct percentage of malware in the market, and we also recognize that the unofficial markets are more likely to include greater percentages of malware than the official market. The methodology presented is also suitable for smaller percentages of malware in the daily stream (e.g., 1 and 0.2%), only necessitating that the detection model’s initial training set be adjusted to this new percentage. The imbalance problem of malware percentage was studied comprehensively by Moskovitch et al. [44]who presented the of imbalance in the domain of PC malware detection and the effect of the malware rate on detection capabilities. Therefore, we tried to simulate real-life conditions as much as possible, and we adjusted our large dataset to the malware rate of 9% (discussed previously), taking into account the problem of imbalance in malware detection addressed in [44]. Hence, we adjusted our large dataset to reflect real-life conditions, reducing it from 40,000 applications to 27,500 applications (91%, totaling 25,000, benign Android applications, and 9%, totaling 2500, Android malwares.) The daily detection model encounters many known and unknown applications within the labs of anti-virus vendors. Since there is no need to scrutinize known applications, we filtered them out as either detected by the anti-virus’ current signatures or as those that can be found in the white list of known and legitimate applications. We thus conducted a pure experiment based on unknown instances, rendering the task facing our AL methods much more challenging. To demonstrate the incremental daily arrival of new suspected applications, we split the entire dataset of 27,500 Android applications into ten different sets that were randomly selected and represented the number of new applications the detection model and anti-virus would presumably encounter over a ten-day period. Each of the ten datasets contained 2670 applications representing the daily stream of new applications with a 9% malicious applica- tion rate. The remaining 800 applications used by the initial training set to induce the initial model also contained 9% malicious applications. This process was repeated ten times utiliz- ing different randomizations and data splitting in order to decrease the variance. Finally, the results of these ten runs were averaged and presented as the final results.

4.5 Stages of the experiment

As noted, a daily stream contained 2670 applications of which 9% were malwares. We first induced the initial model by training it over the 800 known applications (also with malware rate of 9%). We then tested it on the first day’s stream of new applications. Next, from the first 123 ALDROID: efficient update of Android anti-virus software… day’s stream, the selective sampling method selected the most informative applications. The informative applications were sent to a human expert for true labeling, and these applications were later acquired by the framework into the training set. Whenever an application was determined to be malware, it was immediately used to update the signature repository of the anti-virus applications and an update is available to distribute to clients. Following the first day’s acquisition, the detection model was updated for the next day, and this process was repeated over the next ten days. For each day, the performance of the detec- tion model was averaged for ten runs over the ten different datasets that were created. Each selection method (heuristic engine, Exploration, Exploitation, and Combination) was sepa- rately checked, and we considered several acquisition amounts to scrutinize each selection method’s performance over various daily acquisition amounts (50,100 and 245 applications) representing 2, 3.7, and 9.1% of the daily stream’s applications. The 245 acquisition amount (9.1%) is equal to the malicious file percentage (MFP) of Android malwares that were included in the daily stream (9.1% malware), and thus, this amount enabled us to evaluate the selection methods and their ability to acquire all of the malware presented on a daily basis. We considered the two lower percentages in order to better understand the relationship between the number of applications acquired and the labeling effort required and to determine whether there is a point at which the value added by acquisition maxes out. For summarizing, the experiment consisted of the following steps for each selection method:

1. The initial detection model is induced based on 800 labeled applications in the initial training set. 2. The detection model is tested on the daily stream. 3. The daily stream is introduced to the selective sampling method, which selects the X most informative applications and asks for their true labels from the security expert. 4. These applications are labeled by the experts and added to the training set, and the malicious applications are used to update the signature repository. 5. The anti-virus signature repository is updated, and an updated detection model is induced from the enhanced training set for use on the next day. 6. Steps 2–5 repeat until the tenth day.

5 Results

In this experiment, we had two goals: the first was enriching and updating the signature repository of the anti-virus vendors with as much new malwares as possible. However, the first goal can only be achieved through the second goal, which is efficiently updating the detection model with new and informative applications. The AL methods use the detection model to identify new malware, thus directly contributing to reaching the first goal. Section 5.1 discusses the first goal, and in Sect. 5.2, we present the results of the secondary goal.

5.1 Number and percentage of malware acquired daily

We now present the results of each selective sampling method using the NOMA and POMA measures presented earlier. Note that we also compared the performance of the AL methods to the performance of the heuristic engine. As was previously explained, on the first day the detection model was induced from a randomly selected training set of 800 applications which contained 72 malwares. The process of updating the detection model starts from the second day; however, we depicted the NOMA and POMA measures for all the ten days. 123 N. Nissim et al.

Figure 3 shows the number of malwares acquired (NOMA) using each of the selection methods when the acquisition amount is limited to 50 applications daily, which can be, for example, the labeling resources given. The Exploitation performed well as all the 50 applica- tions that were acquired were new malwares—in each of the ten days—much more malware than the other selection methods. Additionally, the Exploitation method performance was stable throughout the ten days due to the daily update of the detection model. Not only does the framework enrich the signature repository with new malware instances (discovered from the 50 acquired), but it also enhances the detection model with 50 newly labeled informative applications. Those 50 applications are added to the training set on which the model is also being trained, improved, and updated toward new and informative applications. The percentage of malware acquired (POMA) indicates that all of the applications acquired by the Exploitation AL method were malware (50 out of 50 applications, or 100%). Note that as was previously mentioned, there were no “known” malwares in the daily stream since they were initially filtered out by the known application module. The performance of the Exploitation compared to the Exploration method can be explained by the results. Exploration mainly focuses on an efficient update of the detection model by acquiring informative applications lying inside the SVMs’ margin regardless of the applica- tion’s class (benign or malicious). This acquisition strategy updates the detection model so that it may have potentially a more accurate detection model after the acquisition; however, it is less effective at acquiring as much malwares as possible for the anti-virus update. In addition, it can be observed that there is a decreasing trend in the number of malicious applications acquired (NOMA) daily by Exploration. Because of the detection model’s daily update, there are less malwares lying within the SVM’s margin every day, and the malware is less informative for the Exploration AL method, an observation that was first discovered during our experiment after several days of detection and acquisition. Therefore, Exploration is less efficient at the task of continuously acquiring malwares. The largest gap in the amount of the malware acquired occurs on the tenth day. Exploitation acquired 50 malwares (POMA = 100%), while Exploration acquired only 33 (POMA = 66%), and a difference of 34% POMA or more than 1.5 times more malware was acquired.

Fig. 3 The number of Android malwares acquired (NOMA) for updating the anti-virus and the detection model through the acquisition of 50 applications daily 123 ALDROID: efficient update of Android anti-virus software…

The Combination method presented a trend of continuous improvement over the ten- day experiment; it started by acquiring 33 malwares in the second day, and on the tenth day it acquired 48. In the acquisition of 50 applications daily, Combination was positioned in between Exploration and Exploitation in terms of performance and contributed to both significant improvement in the performance of the detection model and the amount of malware acquired daily. On the one hand, Exploration showed an increasing trend over the ten-day period in the performance of the detection model (Figs. 6, 7), yet showed a decreasing trend in the NOMA and POMA measures. In contrast, Exploitation did not show an increasing trend in the performance of the detection model (Figs. 6, 7), yet it maintained its almost perfect performance in the NOMA and POMA measures (Fig. 3). The better performance of Exploitation and Combination over the heuristic engine can be easily explained. The instability of the heuristic engine is reflected by the observation that it neither have a clear nor constant trend over the time period. This is due to the fact that it is a rule-based method which lacks frequent updates to its knowledge base. On the one hand, the heuristic engine selects malwares to enrich the anti-virus. While on the other hand, it is not enhanced daily with additional and new knowledge, because it uses a final set of deterministic rules which rate the application according to how malicious it is. During the first six days, the heuristic engine performed worse than or equal to Exploration which was the poorest performing AL method; during the next four days, its performance improved compared to itself. The Exploitation method outperformed the heuristic engine during the ten-day period. On the fourth day, when we observed the largest gap in the amount of malwares acquired, the Exploitation method acquired 50 malwares (POMA = 100%), while the heuristic engine acquired 30 (POMA = 60%), and a difference of 40% POMA, or 1.6 times more malwares was acquired. Figure 4 shows the NOMA measure for each of the selection methods when 100 applica- tions are acquired daily. These results shed additional light on the results of the performance achieved by the methods when 50 applications were acquired daily. In that acquisition amount of 100 applications, not only did the Exploitation method continually outperform the other methods during the tenth day period, it has also showed improvement in the NOMA and POMA measures, reflecting this method’s strength in updat-

Fig. 4 The number of Android malwares that were acquired (NOMA) for updating the anti-virus and the detection model through the acquisition of 100 applications daily 123 N. Nissim et al. ing the detection model with informative applications. The updating process consequently improved the Exploitation method’s ability to identify more malwares than amount identi- fied the day before, because each day new applications are proposed to the framework, and it continuously acquires new and relevant knowledge for the discovery of new malware from the detection model. On the tenth day, it acquired 97 malware out of 100 (POMA = 97%). Based on this trend, the method would probably achieve even higher rates of NOMA and POMA along the course of time. The Exploitation method outperformed the heuristic engine during the ten-day period. On the eighth day which contained the largest difference in the number of malwares acquired, the Exploitation method acquired 94 malwares (POMA = 94%), while the heuristic engine acquired66(POMA= 66%), a difference of 28% NOMA and 28% POMA, or 1.4 times more malwares acquired. The heuristic engine showed the same trend of instability demonstrated in the acquisition of 50 applications. However, its overall performance improved when a larger amount (100 applications) was considered. For instance, on the tenth day it had a POMA of 71% compared to 62% when 50 applications were acquired daily. For the Exploration method, the NOMA measure’s decreasing trend over the ten-day period was even more significant when a larger number of applications was acquired. On the second day, it had a POMA of 57%, whereas on the tenth day it had a POMA of only 35% (NOMA = 24) which is even less than it achieved on the tenth day when 50 applications were acquired daily (POMA = 57% and NOMA = 29). The Combination method showed an increasing trend in the NOMA and POMA measures so that on the tenth day it achieved a NOMA of 93 and POMA of 93%, thus making Exploitation and Combination AL methods better than the Exploration method for the main goal of this study. These results uphold the claim that with each day less malwares lie inside the SVM’s margin and thus is less informative for the detection model, supporting the need for our proposed AL methods. Figure 5 shows the NOMA measure for each of the selection methods through the acquisi- tion of 245 applications daily. These results present the largest difference in the performance between our proposed AL methods (Exploitation and Combination) and the existing solu- tions (Exploration AL method and the heuristic engine). It can be observed that after ten days of malware acquisition and detection model update, both Exploitation and Combination

Fig. 5 The number of Android malwares that were acquired (NOMA) for updating the anti-virus and the detection model through the acquisition of 245 applications 123 ALDROID: efficient update of Android anti-virus software… acquired 207 new malwares (POMA = 85%) which is significantly more malwares than the amount acquired by existing solutions. A NOMA of 207 is more than twice the amount of new malwares acquired by the heuristic engine (NOMA = 103) and more than 6.5 times more malware acquired than the Exploration method (NOMA = 32).

5.2 Predictive performance and update of the detection model

Here we present the predictive performance of the detection model in aspects of accuracy, TPR, and FPR. It is essential to maintain and improve the performance of the detection model, since the AL selection methods rely upon its knowledge in order to identify the most probable malware with which to update the signature repository. The detection model also relies on the selection methods’ performance in that the acquired applications (those selected by the selection methods) are also used for updating the detection model, and each selection method selects applications differently and this in turn affects the performance of the detection model. The detection model’s performance was measured through the aforementioned measures over the ten-day experiment. Note that since the heuristic engine does not rely on the detection model and only relies on a set of rules, the applications it selects are only used to update the signature repository. The selected applications are not used to update the heuristic engine’s knowledge store on a daily basis, and instead only infrequently update the knowledge store. Thus, in this subsection only the AL methods are compared and presented, because only these methods rely on the detection model and provide it with new informative applications for a daily update. The same magnitudes were observed in each of the three AL selection methods over the ten-day experiment when measuring the accuracy rates of the three acquisition amounts (50, 100, and 245). First and foremost, the most significant observation is that our AL methods performed almost as well as the baseline existing AL method, Exploration. The accuracy rates of the methods were very similar with negligible difference between them. When 50 applications were acquired daily, Exploration performed a little better than our methods for several days during the ten-day period, but when larger acquisition sizes of 100 and 245 were used, our proposed methods (Exploitation and Combination) performed as well as Exploration did during the whole ten-day period. Secondly, as was expected and shown in the figures, all the acquisition amounts indicate a similar magnitude, demonstrating that the acquisition of additional applications every day in fact improves the accuracy of the detection model in general, and moreover, the more applications that are acquired daily, the higher the accuracy achieved. Since the same trend was observed for the three acquisition sizes, only the results achieved when 245 applications were acquired daily are presented in Table 1. Table 1 provides a comparison of the three AL methods Exploration, Exploitation, and Combination based on the accuracy rates achieved after acquiring 245 applications each day. In our experiments, the AL methods were presented with 2670 unknown applications daily, from which they were limited to selecting and acquiring only 245 applications in order to update both the knowledge store of the detection model and the signature repository of the anti-virus. The three AL methods performed quite similarly, indicating that different informative applications contributed in the same manner to the detection model’s accuracy. Exploration outperformed all the selection methods during the ten days with slightly more accuracy than our AL methods (difference of only 0.3% on the tenth day). Table 1 also shows that all of the AL methods achieved above 97.79% accuracy, a rate which is particularly meaningful given that the volume of applications the detection model

123 N. Nissim et al.

Table 1 The accuracy of the Day Exploration (%) Exploitation (%) Combination (%) framework for different methods through the acquisition of 245 1 92.78 92.78 92.78 applications daily 2 94.60 94.35 94.60 3 96.10 95.67 96.09 4 96.63 96.36 96.63 5 97.03 96.80 97.02 6 97.29 97.01 97.25 7 97.67 97.34 97.63 8 97.79 97.44 97.70 9 97.89 97.57 97.78 10 98.10 97.79 97.97

encounters daily. Note that since the dataset is imbalanced and consists of 91% benign applications, it is not difficult to achieve 91% accuracy; however, every percentage point above 91% accuracy is a challenge. Thus, the high rates of accuracy achieved are very significant in the malware detection domain. Despite the fact that Exploration outperformed Exploitation in updating the detection model through the acquisition of 245 applications, the Exploitation method outperformed the Exploration based on NOMA and POMA measures during the ten-day period, as was presented in the previous subsection, which is the goal of this study. Table 1 also indicates that by acquiring only 9.1% (245 applications out of 2670) of the daily stream, the accuracy achieved by the detection model using the AL methods was almost as high as acquiring the whole stream (98.8%). Hypothetically, a detection model updated with all of the stream’s 2670 applications over the ten-day period would have achieved an accuracy level of 98.8% on the tenth day, where the three AL methods updated the detection model utilizing only 245 applications a day and achieved very similar rates of accuracy (97.7, 97.9, 98.1%, respectively, by Exploitation, Combination and Exploration). Note that the minor differences in accuracy rates between our AL methods and Exploration method during the experiment’s 10 days doesn’t have much impact on the capabilities of the framework in the task of malicious application detection, and thus with regard to predictive capabilities, our AL methods perform as well as the Exploration method. To support the results presented in Table 1 and our claim, we performed a single-factor Anova statistical test on the accuracy rates achieved by the three AL methods. The ANOVA test provided a P value of 92.23% which is much higher than the 5% (alpha) of the significance level; thus, the difference between the methods is not statistically significant, confirming that our AL methods performed as well as the Exploration method in terms of predictive capabilities. Note that the differences in accuracy levels between Exploration and our AL methods observed in the acquisition of 50,100 applications become negligible, while between the second and the ninth day, Combination even outperformed the Exploration method. This fact indicates that our methods perform better when the acquisition amount is closer to the number of malicious applications in the daily stream and thus better select informative applications for updating the detection model. The accuracy measure provided the general direction for the efficiency and effectiveness of the framework, yet the task here is to detect the applications that are most likely to be 123 ALDROID: efficient update of Android anti-virus software… malicious in order to add them to the signature repository, and thus, the TPR and FPR measures will provide additional insights regarding to the comparison of these methods. Figure 6 presents the TPR levels achieved by each of the three methods with the three acquisition sizes of 50, 100, and 245 applications on the last day of the experiment (the tenth day). Also included in Fig. 6 is the TPR rate for the unfeasible scenario (termed ALL) in which all of the 2670 applications in the daily stream were acquired and sent to security experts for inspection and labeling. The chart shows that the difference between the AL methods and ALL becomes smaller, a trend that supports the efficiency of the framework and the AL methods. This reflects the framework’s and method’s viability in terms of time and cost since it dramatically reduces the number of applications sent to virus experts. Note that when the acquisition amount was 50 applications, the TPR levels were quite low, primarily because the detection model is induced from a smaller number of applications (based on a total of 800 applications on the first day which included 728 benign applications and only 72 malicious applications and concluding with only 1300 applications by the tenth day). In addition, the initial induction of the accurate detection model from a small amount of malware was made even more difficult, because the malware consisted of a variety of malware families that differed from one another. However, through the process of acquiring informative applications, the TPR increases favorably. The Exploitation method had a lower performance in this regard, which can be explained by the method’s character, exhibiting “greediness” and a craving to acquire new malwares. Consequently, Exploitation does not sustain the detection model with crucial benign applications for a correct learning process. Thus, it might be more suited to the more mature stages of the detection model and less suitable for the early stages of acquisition. Figure 6 also shows the efficiency of the AL methods with the acquisition amount of 245 applications. The maximal TPR which can be achieved by acquiring the entire stream (2670 applications) is almost 90%, whereas by acquiring only 245 applications daily, constituting 9.1% of the stream, the Exploration, Combination, and Exploitation methods achieved 80.5, 79.3, and 77.1% respectively on the tenth day. This is quite high when compared to the maximal TPR (almost 90%) which is achieved by acquiring almost 11 times more applications. Note that based on the acquisition of 245 applications the three AL methods performed nearly the same.

Fig. 6 The TPR on the tenth day through the daily acquisition of 50, 100, 245 and 2670 (ALL) applications 123 N. Nissim et al.

Fig. 7 The FPR through the acquisition of 245 applications daily

The process of acquiring applications is labor intensive and depends upon analysis and final labeling by human experts, a task which is carried out manually, requiring significant time and money. Our framework’s high TPR rates are achieved by acquiring a small set of applications, indicating its capacity for preserving the updatability of the anti-virus and the potential to save significant amounts of time and money. Figure 7 illustrates the FPR levels for the acquisition amount of 245 applications, which are very low and indicate the frameworks’ ability to minimize false alarms. Similar trends existed in the other acquisition amounts and were therefore not depicted. Throughout most of the ten-day period, Exploration displayed the lowest FPR. The FPR rates are presented across days, and thus, for each day we evaluated the learned classifier. Since we tried to create a situation in which in a set of new unknown applications (e.g., 2670, malwares and benign) are presented to the classifier each day, the FPR does not constantly decrease as it would if the classifier was tested on the same applications daily. This being an even greater challenge to the framework since it is constantly coming up against new unknown applications and therefore does not benefit from encountering applications it already knows to some extent.

6 Coping with possible attacks

Rastogi et al. [60] presented an attack based on common malware transformation techniques in a framework called “DroidChameleon” in which, given enough time, resources, and obfus- cation know-how, a determined attacker can modify the code and certain manifest portions of an Android application either manually or automatically, until static detection techniques fail. The failure stems from the fact that almost all detection engines rely on unique identi- fication signatures, and re-obfuscation of the code will eventually modify the identification sequences. We argue that our approach will perform better than other approaches as it is highly resistant to this type of attack for the following reason: our feature set is based on Android permissions, and the Chameleon attack cannot remove the basic permissions needed for the malware without rendering the malware ineffective. Moreover, our selected features that are based on the code portion of the Android application are generally indicative of obfuscation, and re-obfuscation attacks will affect this feature. This will likely raise both 123 ALDROID: efficient update of Android anti-virus software… the rate of this feature and the manifest/code discrepancy measures can be affected by this attack as well, but in an unknown way. This means that any application that has been sub- jected to this type of treatment will contain new information for our AL approach, because it triggers human analysis and identifies permissions features that will be used to detect any other similar malware. We do not argue that our approach is capable of detecting whether a dynamically loaded code changes malware behavior dramatically, unless the basic dataset contains previous samples with similar patterns. If the modification by the attacker changes the malware’s behavior in a fundamental way, it will probably be acquired by our framework due to the new unknown patterns it conceals, such as if the attacker changes the malware from SMS premium type to Information Stealer. We also view the Chameleon attack as an opportunity for randomly generating malware which will improve our AL methods. Zheng et al. [80] presented the ADAM system which presented several transformations and techniques that can be applied on Android malware in order to evade detection mecha- nism. Our framework offers some resistance against it as our selected features are generally descriptive of the application’s behaviors and are not specific to the samples in the set like n-gram-based approaches. Such an attack can be successfully protected against only if the mutation of the malware code is predictable, or if the mutation itself leaves a signature. Malware code mutations and obfuscation techniques will not affect the set of Android per- missions used by the malware code without disabling the malware by removal of a required permission or by adding a new redundant permission. This will not affect our ability to iden- tify the malware, as we base our feature set on the permission set used by the malware. Our feature set is also able to detect mutation and obfuscation using our unique features that determine the obfuscation level of the application code and measure discrepancies between the code and the Android manifest. Zhao et al. [79] presented two possible attacks on AL referred to as adding and deleting. In these attacks, the attacker can actually pollute the unlabeled data before it is sampled by the AL. The techniques and methods we present in this paper are not affected by these types of attacks, mainly because of the amount of effort required by an attacker to create a sufficient number of applications and distribute them across the Internet and markets in such a way that they would be picked up by AL. Even in an extreme case of a determined attacker using a novel attack methodology, the AL model is constantly being monitored by a human expert who will take notice when the AL model shows signs of a sudden bias or unexplained drift. Moreover, since our framework selects the most informative malicious applications and uses them to update the signature repository, it does not choose applications that are similar to those that were previously acquired. A set of malicious applications that share specific feature combinations would not be fully acquired by our AL methods, but only a few and representative applications. Thus, the framework is resilient to such attacks and its detection capabilities will remain unaffected. Another aspect that should be considered is the relatively high resilience of our framework compared to the heuristic engine. Knowledge of the deterministic set of rules on which the heuristic engine is based makes it simple for an attacker to design Android malware that evades these rules, whereas it is more difficult to design malware that can evade the detection model based on a SVM classifier, as was noted by [75].

7 Discussion and conclusions

We introduce the problem of frequent and efficient updating of Android anti-virus software in light of the proliferation of unknown Android malwares. The framework we propose to

123 N. Nissim et al. solve this problem is based on active learning methods and has been designed for anti-virus vendors by focusing their analytic efforts on the most probable malicious applications. As previously discussed, mobile devices are very dependent on lightweight solutions such as anti-virus software due to their limited computational and power resources. On the other hand, for anti-virus vendors, the detection of new unknown Android malwares is time- consuming. To identify new unknown malware, anti-virus vendors need to deal with vast quantities of new applications on a daily basis, and based on selection heuristics, the most probable malwares are identified and analyzed by human experts. This information is then used to update the vendor’s signature repository. However, the manner by which these heuris- tic methods select malwares becomes less relevant with time due to the lack of efficient and frequent updates, a deficiency which renders the update of the anti-virus signature reposi- tory too inefficient. Our framework was developed to address the problem. We introduced the framework and we compared our two designated AL methods, Exploitation and Com- bination, with two existing methods: the conventional heuristic engine and Exploration AL method. Our evaluation showed that, in the course of the ten days, the AL methods updated and improved the detection model capabilities, an update that is not frequently done to a heuristic engine. The results show that the larger the acquisition amount is the greater the improvement in the detection model. The initial detection model was induced from a relatively small initial set of 800 applications that consisted of just 72 malwares. On the first day, the model showed performance of 92.7% accuracy, 25.3% TPR, and 0.09% FPR. However, it is the rate of the improvement in these measures that matters, rather than the final performance on the tenth day. The acquisition amount of 245 applications showed the greatest improvement, because the AL methods acquired the maximal amount of new malwares, the most informative applications and those bearing the most valuable information needed to update the detection model. Thus, it seems that anti-virus vendors should acquire a percentage of applications as close as possible to the malware percentage in real application markets or other repositories. In the acquisition amount of 245 applications, the Exploration and Combination outperform the Exploitation and had an improvement of 5.4% in the accuracy, about 50% in the TPR and 0.04% in the FPR. Those improvements are satisfying and can be improved in the course of additional days of acquisition. Although Exploration was the AL method that best updated the detection model, it was not the strongest at acquiring the maximal amount of new malwares for the purpose of updating the anti-virus signatures. It can be explained by the character of each AL method. Exploration tries to acquire informative applications (both benign and malicious), while Exploitation tries to acquire new malicious applications. We defined two simple measures, NOMA and POMA, respectively, the Number and Percentage of Malwares Acquired daily. Looking at the performance of the selection method with regard to the acquisition amounts (50, 100, and 245), we see that according to POMA, all the selection methods perform better when the acquisition amounts are smaller. On the tenth day, Exploitation’s POMA was 100, 97, and 85%, respectively, in relation to the 50, 100, 245 daily acquisition amounts. On the other hand, the method’s ability to acquire larger amounts of malwares (e.g., 100, 245) improved over the course of ten days. With an amount of 50 applications, the performance remained as high as it was at the beginning of the process, which represents an achievement in preserving the updatability and malwares acquisition capabilities. A similar trend was observed for the Combination method which performs better on fewer acquisitions. Specifically, on the tenth day, the POMA was 96, 93, and 85% respectively, in relation to the 50, 100, 245 daily acquisition amounts. As for the performance of the 123 ALDROID: efficient update of Android anti-virus software…

Exploration method, it performed quite poorly and became worse as the acquisition amounts increased, with only 58, 35, and 13% of POMA, respectively, in relation to the 50, 100, 245 daily acquisition amounts. Moreover, over the ten-day period, it acquired less and less malwares and showed a sharp decrease. These results do not parallel the improvement in the performance of the detection model in terms of accuracy, TPR, and FPR. In this context, it affirms that a selection method should not only be oriented toward efficient updating of the detection model with informative applications; rather, it should also be oriented toward the acquisition of new malwares as well as the Exploitation and Combination methods do. Similar to the other methods, the heuristic engine’s overall performance decreased when faced with larger acquisition amounts. On the tenth day, it had 72, 71, and 42% of POMA, respectively, in relation to the 50, 100, 245 daily acquisition amounts. In all three daily acquisition amounts (50, 100, and 245) that were considered in our evaluation, our AL methods outperformed the other two methods over ten days in the amount of new malwares that were acquired daily—which, as stated earlier, was the main goal of this study. While the heuristic engine showed instability, it performed better than the Exploration most of the time. Exploration displayed a decreasing performance trend over the ten-day period. When acquiring 50 applications daily, Exploitation had a POMA of 100%, indicating that all of the applications acquired by this method were malwares. This performance was stable during the ten-day period. Exploitation performed 34% better than the Exploration method, which achieved a POMA of only 66% on the tenth day and showed a decreasing trend in performance over the course of the ten days. Moreover, Exploitation performed 40% better than the heuristic engine, which achieved a POMA of only 60% on the fourth day. When acquiring 100 applications daily, Exploitation exhibited an improving trend in the NOMA and POMA measures during the ten-day period, indicating that the method successfully contributed to updating the detection model with informative applications. The highest POMA value (in terms of this acquisition amount) was achieved on the final day and stands at 97%, which is 62% better than Exploration and 16% better than the heuristic engine. The results achieved in acquiring 245 applications daily present the greatest difference in the performance between our proposed AL methods (Exploitation and Combination) and the other existing solutions. Both the Exploitation and Combination methods had a NOMA of 207 on the tenth day, a rate which is more than twice the amount of new malwares acquired by the heuristic engine (NOMA = 103) and 6.5 times more malwares acquired than the Exploration method (NOMA = 32). These results indicate that our presented framework can make a meaningful contribution to anti-virus vendors by helping them improve the updatability of their anti-virus software with more frequent updates, greater efficiency, and a reduction in the time between the creation of new Android malware, their subsequent discovery, and the process of updating the signature repository of the anti-virus. The improvements achieved and the application acquisitions on the ith day improve the capabilities of the anti-virus and detection model and strengthen its capacity to accurately detect new unknown malware on the ith + 1’s day. Both the Exploitation and Combination method succeeded in acquiring a high percentage of malwares; the small percentage that remained included very informative benign applica- tions. It should be noted that in selecting benign applications that were thought to be malicious, the AL’s operations are not less important than when acquiring actual malwares. Using the Exploitation approach (which is also a phase in Combination method), we explore the mali- cious “side” more efficiently, occasionally resulting in the discovery of benign applications as well (those will probably become support vectors after adding them into the detection 123 N. Nissim et al. model’s training set as labeled instances). Thus, with Exploitation, the acquisition of new applications will contain more malwares than with Exploration, which is the main goal of the framework and methods presented in the paper. It is also worthwhile mentioning that we addressed the imbalance problem in Android malwares. Since our original dataset included 40,000 applications with 10,000 malwares, it was too under-sampled to measure the how our framework would perform in real life. This paper presents unbiased results for Android malware detection and acquisition based on the Android malware percentage in reality. The results shown on this large and representative dataset make this research relevant for industrial purposes as well. It should be noted that in this paper we focus on presenting an efficient method for improv- ing the updatability of the anti-virus tool rather than presenting methods for detecting special malwares. Yet, our suggested framework and set of features were also found to be effective in detecting special malwares which run malicious codes obtained by techniques such as Java reflection, Java native code, and obfuscated malicious code (such as GingerMaster, Plankton, and LeNa). In our future work, our framework and AL methods might also be useful for additional domains in which an efficient selection of a specific class is needed more than an acquisition from another class. One interesting domain would be online social networks (OSNs), for the detection and acquisition of new fictitious profiles which is a hard and important task due to the increase in malicious profiles that tempt innocent and unsophisticated users such as children and the elderly. We also intend to extend our studies [51,52] in the bioinformatics domain. In these studies, an intelligent and active selection of specific class is needed for the task of reducing the labeling efforts required by medical experts in order to enhance the detection of severe and rare diseases.

Acknowledgments This research was partly supported by the National Cyber Bureau of the Israeli Ministry of Science, Technology and Space.

Appendix

List of features we used: number_of_activities number_of_services number_of_recivers number_of_permissions number_of_providers man_is_meta_data perm_android.intent.category.MASTER_CLEAR.permission.C2D_MESSAGE perm_android.permission.ACCESS_ALL_EXTERNAL_STORAGE perm_android.permission.ACCESS_CACHE_FILESYSTEM perm_android.permission.ACCESS_CHECKIN_PROPERTIES perm_android.permission.ACCESS_COARSE_LOCATION perm_android.permission.ACCESS_CONTENT_PROVIDERS_EXTERNALLY perm_android.permission.ACCESS_FINE_LOCATION perm_android.permission.ACCESS_LOCATION_EXTRA_COMMANDS perm_android.permission.ACCESS_MOCK_LOCATION perm_android.permission.ACCESS_MTP perm_android.permission.ACCESS_NETWORK_STATE 123 ALDROID: efficient update of Android anti-virus software… perm_android.permission.ACCESS_SURFACE_FLINGER perm_android.permission.ACCESS_WIFI_STATE perm_android.permission.ACCESS_WIMAX_STATE perm_android.permission.ACCOUNT_MANAGER perm_android.permission.ALLOW_ANY_CODEC_FOR_PLAYBACK perm_android.permission.ASEC_ACCESS perm_android.permission.ASEC_CREATE perm_android.permission.ASEC_DESTROY perm_android.permission.ASEC_MOUNT_UNMOUNT perm_android.permission.ASEC_RENAME perm_android.permission.AUTHENTICATE_ACCOUNTS perm_android.permission.BACKUP perm_android.permission.BATTERY_STATS perm_android.permission.BIND_ACCESSIBILITY_SERVICE perm_android.permission.BIND_APPWIDGET perm_android.permission.BIND_DEVICE_ADMIN perm_android.permission.BIND_DIRECTORY_SEARCH perm_android.permission.BIND_INPUT_METHOD perm_android.permission.BIND_KEYGUARD_APPWIDGET perm_android.permission.BIND_PACKAGE_VERIFIER perm_android.permission.BIND_REMOTEVIEWS perm_android.permission.BIND_TEXT_SERVICE perm_android.permission.BIND_VPN_SERVICE perm_android.permission.BIND_WALLPAPER perm_android.permission.BLUETOOTH perm_android.permission.BLUETOOTH_ADMIN perm_android.permission.BLUETOOTH_STACK perm_android.permission.BRICK perm_android.permission.BROADCAST_PACKAGE_REMOVED perm_android.permission.BROADCAST_SMS perm_android.permission.BROADCAST_STICKY perm_android.permission.BROADCAST_WAP_PUSH perm_android.permission.CALL_PHONE perm_android.permission.CALL_PRIVILEGED perm_android.permission.CAMERA perm_android.permission.CHANGE_BACKGROUND_DATA_SETTING perm_android.permission.CHANGE_COMPONENT_ENABLED_STATE perm_android.permission.CHANGE_CONFIGURATION perm_android.permission.CHANGE_NETWORK_STATE perm_android.permission.CHANGE_WIFI_MULTICAST_STATE perm_android.permission.CHANGE_WIFI_STATE perm_android.permission.CHANGE_WIMAX_STATE perm_android.permission.CLEAR_APP_CACHE perm_android.permission.CLEAR_APP_USER_DATA perm_android.permission.CONFIGURE_WIFI_DISPLAY perm_android.permission.CONFIRM_FULL_BACKUP perm_android.permission.CONNECTIVITY_INTERNAL perm_android.permission.CONTROL_LOCATION_UPDATES perm_android.permission.CONTROL_WIFI_DISPLAY 123 N. Nissim et al. perm_android.permission.COPY_PROTECTED_DATA perm_android.permission.CRYPT_KEEPER perm_android.permission.DELETE_CACHE_FILES perm_android.permission.DELETE_PACKAGES perm_android.permission.DEVICE_POWER perm_android.permission.DIAGNOSTIC perm_android.permission.DISABLE_KEYGUARD perm_android.permission.DUMP perm_android.permission.EXPAND_STATUS_BAR perm_android.permission.FACTORY_TEST perm_android.permission.FILTER_EVENTS perm_android.permission.FLASHLIGHT perm_android.permission.FORCE_BACK perm_android.permission.FORCE_STOP_PACKAGES perm_android.permission.FREEZE_SCREEN perm_android.permission.GET_ACCOUNTS perm_android.permission.GET_DETAILED_TASKS perm_android.permission.GET_PACKAGE_SIZE perm_android.permission.GET_TASKS perm_android.permission.GLOBAL_SEARCH perm_android.permission.GLOBAL_SEARCH_CONTROL perm_android.permission.GRANT_REVOKE_PERMISSIONS perm_android.permission.HARDWARE_TEST perm_android.permission.INJECT_EVENTS perm_android.permission.INSTALL_LOCATION_PROVIDER perm_android.permission.INSTALL_PACKAGES perm_android.permission.INTERACT_ACROSS_USERS perm_android.permission.INTERACT_ACROSS_USERS_FULL perm_android.permission.INTERNAL_SYSTEM_WINDOW perm_android.permission.INTERNET perm_android.permission.KILL_BACKGROUND_PROCESSES perm_android.permission.MAGNIFY_DISPLAY perm_android.permission.MANAGE_ACCOUNTS perm_android.permission.MANAGE_APP_TOKENS perm_android.permission.MANAGE_NETWORK_POLICY perm_android.permission.MANAGE_USB perm_android.permission.MANAGE_USERS perm_android.permission.MASTER_CLEAR perm_android.permission.MODIFY_APPWIDGET_BIND_PERMISSIONS perm_android.permission.MODIFY_AUDIO_SETTINGS perm_android.permission.MODIFY_NETWORK_ACCOUNTING perm_android.permission.MODIFY_PHONE_STATE perm_android.permission.MOUNT_FORMAT_FILESYSTEMS perm_android.permission.MOUNT_UNMOUNT_FILESYSTEMS perm_android.permission.MOVE_PACKAGE perm_android.permission.NET_ADMIN perm_android.permission.NET_TUNNELING perm_android.permission.NFC perm_android.permission.PACKAGE_USAGE_STATS 123 ALDROID: efficient update of Android anti-virus software… perm_android.permission.PACKAGE_VERIFICATION_AGENT perm_android.permission.PERFORM_CDMA_PROVISIONING perm_android.permission.PERSISTENT_ACTIVITY perm_android.permission.PROCESS_OUTGOING_CALLS perm_android.permission.READ_CALENDAR perm_android.permission.READ_CALL_LOG perm_android.permission.READ_CELL_BROADCASTS perm_android.permission.READ_CONTACTS perm_android.permission.READ_DREAM_STATE perm_android.permission.READ_EXTERNAL_STORAGE perm_android.permission.READ_FRAME_BUFFER perm_android.permission.READ_INPUT_STATE perm_android.permission.READ_LOGS perm_android.permission.READ_NETWORK_USAGE_HISTORY perm_android.permission.READ_PHONE_STATE perm_android.permission.READ_PRIVILEGED_PHONE_STATE perm_android.permission.READ_PROFILE perm_android.permission.READ_SMS perm_android.permission.READ_SOCIAL_STREAM perm_android.permission.READ_SYNC_SETTINGS perm_android.permission.READ_SYNC_STATS perm_android.permission.READ_USER_DICTIONARY perm_android.permission.REBOOT perm_android.permission.RECEIVE_BOOT_COMPLETED perm_android.permission.RECEIVE_DATA_ACTIVITY_CHANGE perm_android.permission.RECEIVE_EMERGENCY_BROADCAST perm_android.permission.RECEIVE_MMS perm_android.permission.RECEIVE_SMS perm_android.permission.RECEIVE_WAP_PUSH perm_android.permission.RECORD_AUDIO perm_android.permission.REMOTE_AUDIO_PLAYBACK perm_android.permission.REMOVE_TASKS perm_android.permission.REORDER_TASKS perm_android.permission.RESTART_PACKAGES perm_android.permission.RETRIEVE_WINDOW_CONTENT perm_android.permission.RETRIEVE_WINDOW_INFO perm_android.permission.SEND_SMS perm_android.permission.SEND_SMS_NO_CONFIRMATION perm_android.permission.SERIAL_PORT perm_android.permission.SET_ACTIVITY_WATCHER perm_android.permission.SET_ALWAYS_FINISH perm_android.permission.SET_ANIMATION_SCALE perm_android.permission.SET_DEBUG_APP perm_android.permission.SET_KEYBOARD_LAYOUT perm_android.permission.SET_ORIENTATION perm_android.permission.SET_POINTER_SPEED perm_android.permission.SET_PREFERRED_APPLICATIONS perm_android.permission.SET_PROCESS_LIMIT perm_android.permission.SET_SCREEN_COMPATIBILITY 123 N. Nissim et al. perm_android.permission.SET_TIME perm_android.permission.SET_TIME_ZONE perm_android.permission.SET_WALLPAPER perm_android.permission.SET_WALLPAPER_COMPONENT perm_android.permission.SET_WALLPAPER_HINTS perm_android.permission.SHUTDOWN perm_android.permission.SIGNAL_PERSISTENT_PROCESSES perm_android.permission.START_ANY_ACTIVITY perm_android.permission.STATUS_BAR perm_android.permission.STATUS_BAR_SERVICE perm_android.permission.STOP_APP_SWITCHES perm_android.permission.SUBSCRIBED_FEEDS_READ perm_android.permission.SUBSCRIBED_FEEDS_WRITE perm_android.permission.SYSTEM_ALERT_WINDOW perm_android.permission.TEMPORARY_ENABLE_ACCESSIBILITY perm_android.permission.UPDATE_DEVICE_STATS perm_android.permission.UPDATE_LOCK perm_android.permission.USE_CREDENTIALS perm_android.permission.USE_SIP perm_android.permission.VIBRATE perm_android.permission.WAKE_LOCK perm_android.permission.WRITE_APN_SETTINGS perm_android.permission.WRITE_CALENDAR perm_android.permission.WRITE_CALL_LOG perm_android.permission.WRITE_CONTACTS perm_android.permission.WRITE_DREAM_STATE perm_android.permission.WRITE_EXTERNAL_STORAGE perm_android.permission.WRITE_GSERVICES perm_android.permission.WRITE_MEDIA_STORAGE perm_android.permission.WRITE_PROFILE perm_android.permission.WRITE_SECURE_SETTINGS perm_android.permission.WRITE_SETTINGS perm_android.permission.WRITE_SMS perm_android.permission.WRITE_SOCIAL_STREAM perm_android.permission.WRITE_SYNC_SETTINGS perm_android.permission.WRITE_USER_DICTIONARY perm_com.android.alarm.permission.SET_ALARM perm_com.android.browser.permission.READ_HISTORY_BOOKMARKS perm_com.android.browser.permission.WRITE_HISTORY_BOOKMARKS perm_com.android.voicemail.permission.ADD_VOICEMAIL perm_com.fede.launcher.permission.READ_SETTINGS perm_com.htc.launcher.permission.READ_SETTINGS perm_com.lge.launcher.permission.INSTALL_SHORTCUT perm_com.lge.launcher.permission.READ_SETTINGS perm_com.motorola.dlauncher.permission.INSTALL_SHORTCUT perm_com.motorola.dlauncher.permission.READ_SETTINGS perm_com.motorola.launcher.permission.INSTALL_SHORTCUT perm_com.motorola.launcher.permission.READ_SETTINGS perm_org.adw.launcher.permission.READ_SETTINGS 123 ALDROID: efficient update of Android anti-virus software… number_of_classes number_of_classes_1-5 number_of_classes_5-10 number_of_classes_10-15 number_of_classes_15+ dex_strings_count dex_activities dex_services dex_recivers dex_bouncycastle_use dex_opengl_use dex_crypto_use dex_man_diff_Activities dex_man_diff_Services dex_man_diff_Recivers

References

1. Abou-Assaleh T, Cercone N, Keselj V, Sweidan R (2004) N-gram based detection of new malicious code. In: Proceedings of the 28th annual international computer software and applications conference (COMPSAC’04) 2. Andriatsimandefitra R, Geller S, Tong VVT (2012) Designing information flow policies for Android’s operating system. In: 2012 IEEE International conference on communications (ICC), 10–15 June 2012, pp 976–981 3. Angluin D (1988) Queries and concept learning. Mach Learn 2:319–342 4. Apvrille A, Strazzere T (2012) Reducing the window of opportunity for Android malware Gotta catch ’em all. J Comput Virol 8(1–2):61–71 5. Baram Y, El-Yaniv R, Luz K (2004) Online choice of active learning algorithms. J Mach Learn Res 5:255–291 6. Batyuk L, Herpich M, Camtepe SA, Raddatz K, Schmidt A, Albayrak S (2011) Using static analysis for automatic assessment and mitigation of unwanted and malicious activities within Android applications. In: 2011 6th International conference on malicious and unwanted software (MALWARE), 18–19 October 2009, pp 66–72 7. Bläsing T, Batyuk L, Schmidt A-D, Camtepe SA, Albayrak S (2010) An Android Application Sandbox system for suspicious software detection. In: 2010 5th International conference on malicious and unwanted software (MALWARE), 19–20 October 2010, pp 55–62 8. Bulygin Y (2007) Epidemics of mobile worms. In: Proceedings of the 26th IEEE international performance computing and communications conference, IPCCC 2007, April 11–13, 2007, New Orleans, Louisiana, USA. IEEE Computer Society, pp 475–478 9. Burges CJC (1988) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167 10. Chang CC, Lin C-J (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm 11. Dagon D, Martin T, Starner T (2004) Mobile phones as computing devices: the viruses are coming! IEEE Pervasive Comput 3(4):11–15 12. Desnos A (2013) (Visited June 2013) Androguard reverse engineering tool. http://code.google.com/p/ androguard/ 13. Fawcett T (2003) ROC graphs: notes and practical considerations for researchers. Technical report, HPL- 2003-4, HP Laboratories 14. Ham H-S, Choi M-J (2013) Analysis of Android malware detection performance using machine learning classifiers. In: 2013 International conference on ICT Convergence (ICTC). IEEE 15. Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45:171–186 16. Henchiri O, Japkowicz N (2006) A feature selection and evaluation scheme for computer virus detection. In: Proceedings of ICDM-2006, Hong Kong, pp 891–895 123 N. Nissim et al.

17. Herbrich R, Graepel T, Campbell C (2001) Bayes point machines. J Mach Learn Res 1:245–279. doi:10. 1162/153244301753683717 18. http://code.google.com/p/smali/ 19. http://code.google.com/p/xml-apk-parser/ 20. http://contagiominidump.blogspot.co.il/ 21. http://docs.oracle.com/javase/tutorial/security/apisign/gensig.html 22. http://jon.oberheide.org/applications/summercon12-bouncer.pdf 23. http://source.android.com/tech/dalvik/dex-format.html 24. http://www.avgmobilation.com/ 25. http://www.csc.ncsu.edu/faculty/jiang/DroidKungFu.html 26. http://www.malgenomeproject.org/ 27. http://www.securelist.com/en/analysis/204792250/IT_Threat_Evolution_Q3_2012 28. http://www.strazzere.com/papers/DexEducation-PracticingSafeDex.pdf 29. http://www.symantec.com/security_response/writeup.jsp?docid=2003-011711-1226-99 30. https://blog.lookout.com/blog/2011/03/01/security-alert-malware-found-in-official-android-market-dr oiddream/ 31. https://blog.lookout.com/_media/Geinimi_Trojan_Teardown.pdf 32. https://www.virustotal.com/ 33. Ikinci A, Holz T, Freiling FC (2008) Monkey-spider: detecting malicious websites with low-interaction honeyclients. In: Sicherheit, pp 407–421 34. Kiem H, Thuy NT, Quang TMN (2004) A machine learning approach to anti-virus system (2004) Joint workshop of Vietnamese society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on active mining, Hanoi-Vietnam, 4–7 December 2004, pp 61–65 35. Kolter JZ, Maloof MA (2004) Learning to detect malicious executables in the wild. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, pp 470–478 36. Kolter J, Maloof M (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744 37. Leavitt N (2005) Mobile phones: the next frontier for hackers? Computer 38(4):20–23 38. Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In: Proceedings of the seventeenth annual international ACM-SIGIR conference on research and development in information retrieval. Springer, pp 3–12 39. Lin Y-D, Lai Y-C, Chen C-H, Tsai H-C (2013) Identifying android malicious repackaged applications by thread-grained system call sequences. Comput Secur 39(Part B):340–350. doi:10.1016/j.cose.2013.08. 010 40. Luoshi Z, Yan N, Xiao W, Zhaoguo W, Yibo X (2013) A3: automatic analysis of Android malware. In: 1st International workshop on cloud computing and information security. Atlantis Press, November 2013 41. Mansfield-Devine S (2012) Android malware and mitigations. Netw Secur 2012(11):12–20. doi:10.1016/ S1353-4858(12)70104-6 42. Masud M, Khan L, Thuraisingham B (2007) Feature based techniques for auto-detection of novel email worms. In: Advances in knowledge discovery and data mining 43. Mokube I, Adams M (2007) Honeypots: concepts, approaches, and challenges. In: ACMSE 2007. Winston-Salem, North Carolina, USA, 23–24 March, pp 321–325 44. Moskovitch R, Stopel D, Feher C, Nissim N, Japkowicz N, Elovici Y (2009) Unknown malcode detection and the imbalance problem. J Comput Virol 5(4):295–308 45. Moskovitch R, Nissim N, Elovici Y (2008) Acquisition of malicious code using active learning. In; 2nd ACM SIGKDD international workshop on privacy, security, and trust in KDD, PinKDD08, Springer, Lectures Notes in Computer Sciences, Las Vegas, USA, vol 5456, 25 August 2008, pp 74–91 46. Moskovitch R, Nissim N, Elovici Y (2009) Malicious code detection using active learning. In: Privacy, security, and trust in KDD, pp 74–91 47. Moskovitch R, Nissim N, Englert R, Elovici Y (2008) Detection of unknown computer worms activity using active learning. In: The 11th International conference on information fusion, Cologne, Germany, June 30–July 3 48. Moskovitch R, Stopel D, Feher C, Nissim N, Elovici Y (2008) Unknown malcode detection via text categorization and the imbalance problem. In: IEEE intelligence and security informatics (ISI08), Taiwan 49. Nissim N, Moskovitch R, Rokach L, Elovici Y (2012) Detecting unknown computer worm activity via support vector machines and active learning. Pattern Anal Appl 15:459–475 50. Nissim N, Cohen A, Glezer C, Elovici Y (2015) Detection of malicious PDF files and directions for enhancements: a state-of-the art survey. Comput Secur 48:246–266. doi:10.1016/j.cose.2014.10.014

123 ALDROID: efficient update of Android anti-virus software…

51. Nissim N, Boland MR, Moskovitch R, Tatonetti NP, Elovici Y, Shahar Y, Hripcsak G (2015) An active learning framework for efficient condition severity classification. In: Artificial intelligence in medicine. Springer International Publishing (AIME-15), pp 13–24 52. Nissim N, Borland M, Moskovitch R, Tatonetti N, Elovici Y, Shahar Y, Hripcsak G (2014) An active learning enhancement for conditions severity classification. In: ACM KDD on workshop on connected health at big data era, NYC, USA 53. Nissim N, Cohen A, Moskovitch R, Shabtai A, Edry M, Bar-Ad O, Elovici Y (2014) ALPD: active learning framework for enhancing the detection of malicious PDF files. In: Intelligence and security informatics conference (JISIC), 2014 IEEE joint. IEEE, September 2014, pp 91–98 54. Nissim N, Moskovitch R, Rokach L, Elovici Y (2014) Novel active learning methods for enhanced PC malware detection in windows OS. Expert Syst Appl 41(13):5843–5857 55. Oberheide J, Cooke E, Jahanian F (2008) Cloudav: N-version antivirus in the network cloud. In: Proceed- ings of the 17th USENIX security symposium (Security’08), San Jose, CA, July 2008 56. Oberheide J, Miller J (2012) Dissecting the android bouncer. SummerCon2012, New York 57. Peng H, Gates C, Sarma B, Li N, Qi A, Potharaju R, Nita-Rotaru C, Molloy I (2012) Using probabilistic generative models for ranking risks of Android Apps. In: Proceedings of ACM CCS 58. Piercy M (2004) Embedded devices next on the virus target list. IEEE Electron Syst Softw 2:42–43 59. Provos N, Holz T (2008) Virtual honeypots: from botnet tracking to intrusion detection. Addison-Wesley, Upper Saddle River 60. Rastogi V, Chen Y, Jiang X (2013) DroidChameleon: evaluating Android anti-malware against trans- formation attacks. Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security. ACM 61. Sanz B, Santos I, Galán-García P, Laorden C, Ugarte-Pedrero X, Bringas PG, Alvarez PUMA G (2012) Permission Usage to detect Malware in Android. In: Proceedings of the 5th international conference on computational intelligence in security for information systems (CISIS). Ostrava (Czech Republic), 5–7 September 2012 62. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PG (2012) On the automatic categorisation of android applications. In: 2012 IEEE Consumer communications and networking conference (CCNC), 14–17 January, pp 149–153 63. Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C (2012) Android permissions: a perspective combining risks and benefits. In: Proceedings of SACMAT 64. Schmidt A-D, Bye R, Schmidt H-G, Clausen J, Kiraz O, Yuksel KA, Camtepe SA, Albayrak S (2009) Static analysis of executables for collaborative malware detection on Android. In: IEEE international conference on communications, 2009 ICC’09, 14–18 June 2009, pp 1–5 65. Schmidt A-D, Schmidt H-G, Batyuk L, Clausen JH, Camtepe SA, Albayrak S, Yildizli C (2009) Smart- phone malware evolution revisited: android next target? In: 2009 4th International conference on malicious and unwanted software (MALWARE), 13–14 October 2009, pp 1–7 66. Schultz M, Eskin E, Zadok E, Stolfo S (2001) Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE symposium on security and privacy, pp 178–184 67. Seo S-H, Gupta A, Sallam AM, Bertino E, Yim K (2014) Detecting mobile malware threats to homeland security through static analysis. J Netw Comput Appl 38:43–53. doi:10.1016/j.jnca.2013.05.008 68. Shabtai A, Fledel Y, Kanonov U, Elovici Y, Dolev S, Glezer C (2010) Google android: a comprehensive security assessment. IEEE Secur Priv 2:35–44 69. Shabtai A, Fledel Y,Elovici Y (2010) Automated static code analysis for classifying Android applications using machine learning. In: 2010 International conference on computational intelligence and security (CIS), 11–14 December 2010, pp 329–333 70. Shabtai A, Tenenboim-Chekina L, Mimran D, Rokach L, Shapira, B, Elovici Y (2014) Mobile malware detection through analysis of deviations in application network behavior. Comput Secur 43:1–18 71. Shih DH, Lin B, Chiang HS, Shih MH (2008) Security aspects of mobile phone virus: a critical survey. Ind Manag Data Syst 108(4):478–494 72. Suarez-Tangil G et al (2014) Dendroid: a text mining approach to analyzing and classifying code structures in Android malware families. Expert Syst Appl 41(4):1104–1117 73. Tahan G, Rokach L, Shahar Y (2012) Mal-ID: automatic malware detection using common segment analysis and meta-features. J Mach Learn Res 13:949–979 74. Tong S, Koller D (2000–2001) Support vector machine active learning with applications to text classifi- cation. J Mach Learn Res 2:45–66 75. Wang X, Yu W, Champion A, Fu X, Xuan D (2007) Worms via mining dynamic program execution. In: Security and privacy in communications networks and the workshops, 2007. SecureComm 2007. Third international conference security and privacy in communication networks and the workshops, SecureComm, pp 412–421

123 N. Nissim et al.

76. Yu ZHU, Wang X-C, Shen H-B (2008) Detection method of computer worms based on SVM. Mech Electr Eng Mag 8 77. Zhang Y et al (2013) Vetting undesirable behaviors in android apps with permission use analysis. In: Proceedings of the 2013 ACM SIGSAC conference on computer & communications security. ACM 78. Zhao M, Zhang T, Ge F, Yuan Z (2012) RobotDroid: a lightweight malware detection framework on smartphones. JNW 7(4):715–722 79. Zhao W, Long J, Yin J, Cai Z, Xia G-M (2012) Sampling attack against active learning in adversarial environment. In: MDAI, pp 222–233 80. Zheng M, Lee PPC, Lui JCS (2013) Adam: an automatic and extensible platform to stress test android anti- virus systems. In: Detection of intrusions and malware, and vulnerability assessment. Springer, Berlin, Heidelberg, pp 82–101 81. Zhou Y et al (2012) Hey, you, get off of my market: detecting malicious apps in official and alternative android markets. In: Proceedings of the 19th annual network and distributed system security symposium 82. Zhou Y, Jiang X (2012) Dissecting android malware: characterization and evolution. In: 2012 IEEE symposium on security and privacy (SP). IEEE, pp 95–109, May 2012 83. Zhou W, Zhou Y, Jiang X, Ning P (2012) Detecting repackaged smartphone applications in third-party Android marketplaces. In: Proceedings of the second ACM conference on data and application security and privacy. ACM, pp 317–326

Nir Nissim is a researcher and the head of the Malware-Lab in the Cyber Security Research Center (CSRC) at Ben-Gurion University. Mr. Nissim submitted his Ph.D. Dissertation this year in the Department of Information Systems Engineering at Ben-Gurion University. Mr. Nis- sim published several remarkable papers dealing with active learning approaches for the acquisition and detection of malware in both PC and Android. Mr. Nissim is recognized as an expert in information sys- tems security and has led several large-scale projects and researches in this field. His main areas of interests are mobile and computer secu- rity, machine learning and data mining. In addition to his contribution to the cyber security domain, Nir is also interested in the bioinformatics domain and has published number of papers regarding efficient classi- fication of conditions severity.

Robert Moskovitch holds a B.Sc., an M.Sc., and a Ph.D. in Infor- mation Systems Engineering from Ben-Gurion University, Israel. His Ph.D. focused on the development of novel multivariate temporal data analytics methods using temporal abstraction and time intervals min- ing, such as the KarmaLego algorithm. Currently, he is a postdoc- toral research scientist in the Department of Biomedical Informatics at Columbia University, in which he applies KarmaLego on Columbia University Medical Center data for purposes such as the prediction of clinical procedures or diagnoses in patient data. He has served on several journal editorial boards, as well as on program committees of several conferences and workshops in biomedical informatics and in information security. He has published more than fifty peer-reviewed papers in leading journals and conferences, several of which won best paper awards.

123 ALDROID: efficient update of Android anti-virus software…

Oren BarAd is a Mobile and Security Architect for Nuro Secure Messaging. He is managing the development teams’ iOS and Android Clients, and Security for Nuro Secure Messaging ltd for the last 6 months. He worked as the Mobile Security Team leader for Antivirus Vendor AVG, from October 2012 to June 2015. He has been working on Secure code development for Android since 2008. He holds a B.Sc. in Information System Engineering from Ben-Gurion University.

Lior Rokach is a data scientist and a Full Professor of Informa- tion Systems and Software Engineering at Ben-Gurion University. His research interests lie in the design and analysis of machine learn- ing and data mining algorithms and their applications in Cyber Secu- rity, Recommender Systems and Information Retrieval. Prof. Rokach is the author of over 200 peer-reviewed papers in leading journals (e.g., Machine Learning, Machine Learning Research, Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge, and Data Engineering and Pattern Recognition), conference proceedings, patents, and book chapters. Prof. Rokach publications cover topics in cyber security domain, in which advanced machine learning and big data techniques are required to provide solutions for existing problems, particularly on the following cyber security domains: Cyber Security, Malware Detection and User Authentication and Verification, Privacy Preserving Data Mining and Anomaly Detection.

Yuval Elovici from the Department of Information Systems Engineer- ing at Ben-Gurion University is a Full Professor of Information Sys- tems Engineering at Ben-Gurion University and is the director of the Telekom Innovation Laboratories at BGU as well as the head of the Cyber Security Research Center (CSRC) at Ben-Gurion University. He holds B.Sc. and M.Sc. degrees in Computer and Electrical Engineering from BGU and a Ph.D. in Information Systems from Tel-Aviv Uni- versity. His primary research interests are cyber security and machine learning and he is the author of a book on information leakage detec- tion and prevention.

123 computers & security 48 (2015) 246e266

Available online at www.sciencedirect.com ScienceDirect

journal homepage: www.elsevier.com/locate/cose

Detection of malicious PDF files and directions for enhancements: A state-of-the art survey

* Nir Nissim a, , Aviad Cohen a, Chanan Glezer b, Yuval Elovici a a Information Systems Engineering, Ben Gurion University, Beer Sheva, Israel b Department of Industrial Engineering and Management, Ariel University, Israel article info abstract

Article history: Initial penetration is one of the first steps of an Advanced Persistent Threat (APT) attack, Received 23 April 2014 and it is considered one of the most significant means of initiating cyber-attacks aimed at Received in revised form organizations. Such an attack usually results in the loss of sensitive and confidential in- 12 September 2014 formation. Because email communication is an integral part of daily business operations, Accepted 23 October 2014 APT attackers frequently leverage email as an attack vector for initial penetration of the Available online 3 November 2014 targeted organization. Emails allow the attacker to deliver malicious attachments or links to malicious websites. Attackers usually use social engineering in order to make the Keywords: recipient open the malicious email, open the attachment, or press a link. Existing defensive APT solutions within organizations prevent executables from entering organizational networks PDF via emails, therefore, recent APT attacks tend to attach non-executable files (PDF, MS Office Malicious code etc.) which are widely used in organizations and mistakenly considered less suspicious or Malware malicious. This article surveys existing academic methods for the detection of malicious Detection PDF files. The article outlines an Active Learning framework and highlights the correlation Email between structural incompatibility of PDF files and their likelihood of maliciousness. Organizations Finally, we provide comparisons, insights and conclusions, as well as avenues for future Cyber-attack research in order to enhance the detection of malicious PDFs. © 2014 Elsevier Ltd. All rights reserved.

monitoring an organization, and disrupting an organization's 1. Introduction actions. Attackers may be motivated by ideology, criminal intent, a desire for publicity, etc. Electronic mail (a.k.a., email) Since 2009, cyber-attacks against businesses and organiza- is a method for exchanging digital messages (composed of a tions have increased, and in 2013, 91% of all organizations header, body, and oftentimes an attachment) between au- 1,2 were hit with cyber-attacks and 9% were the victims of thors and one or more recipients through mail servers. Email 3 targeted attacks, according to Kaspersky Labs. Attacks aimed is one of the most popular means of communicating, and the at organizations usually take the form of harmful activities vast majority of organizations use email for internal and such as stealing confidential information, spying and

* Corresponding author. E-mail addresses: [email protected] (N. Nissim), [email protected] (A. Cohen), [email protected] (C. Glezer), elovici@post. bgu.ac.il (Y. Elovici). 1 http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in-2013/. 2 http://www.itwebafrica.com/security/515-kenya/232123-91-of-kenyan-organisations-hit-by-cyber-crime-in-2013. 3 http://timesofindia.indiatimes.com/city/pune/Companies-falling-victims-to-cyber-attacks/articleshow/27360540.cms. http://dx.doi.org/10.1016/j.cose.2014.10.014 0167-4048/© 2014 Elsevier Ltd. All rights reserved. computers & security 48 (2015) 246e266 247

external communication. Therefore, email containing at- To demonstrate this point, an incident aimed at the Israeli tachments of malicious files has become an attractive plat- Ministry of Defense (IMOD) took place on January 15, 2014, form by which to initiate cyber-attacks against organizations. which provides an example of a new type of targeted cyber- The primary responsibility of an organization's cyber se- attack. According to various media reports10 published on curity team is to prevent such attacks from penetrating the January 26, 2014, the Seculert11 Company reported that it organization's internal network, and prevent these attacks identified an attack in which attackers sent email messages, from causing any damage to the organization. This is done by allegedly from IMOD, with a malicious PDF file attachment using defensive tools such as firewalls, Intrusion Detection posing as an IMOD document. When opened, the PDF file Systems (IDS4), Intrusion Prevention Systems (IPS5), antivi- installed a Trojan horse that enabled the attacker to take ruses, etc. However, these tools are limited in their ability to control of the computer. This example clearly demonstrates detect and identify the attacks that occur within email today, that the existing solutions previously mentioned are insuffi- particularly when a sophisticated APT attack is executed cient in detecting and preventing such attacks, and further- against an organization when a user opens a malicious email more emphasizes the necessity for advanced detection attachment. Moreover, there is a lag time between the time a methods. new, unknown malcode appears and the time antivirus ven- To-date, antivirus packages are not sufficiently effective in dors update their clients with the new signature e a time in intercepting malicious PDF files, even in the case of highly which many computers are vulnerable to the new malcode prominent PDF threats (Tzermias et al. (Tzermias et al., 2011)). (Christodorescu and Jha, 2004; White et al., 1999). Conversely, according to studies such as Tzermias et al. (2011), Attackers usually use social engineering6 in order to Srndic and Laskov (2013), Laskov and Srndic (2011), Vatamanu encourage the recipient to open a malicious email, open an et al. (2012), Schmitt et al. (2012), Smutz and Stavrou (2012), attachment, or press a link. The term social engineering refers Maiorca et al. (2012), Pareek et al. (2013), Lu et al. (2013), to psychological manipulation techniques used to fool people Snow et al. (2011), Machine Learning (ML) methods can be into performing actions that meet the needs of the attacker. effective in distinguishing between malicious and benign PDF For example, attackers will send an email with an intriguing files. subject or sophisticated content that will tempt the user to In the following survey, we present several significant open the attachment. Trend Micro (an Internet security studies pertaining to PDF detection using ML algorithms company) found7 that APT attacks, particularly those against based on static analysis, dynamic analysis, or both. Our government agencies and large corporations, are almost survey focuses on existing academic solutions that offer a entirely dependent upon Spear-Phishing8 emails. variety of defense mechanisms against cyber-attacks. The As most email servers prevent attachments of executable typical use case scenario is characterized by a PC user, con- files to email messages, the non-executable files attached to nected to an organizational network, who opens a malicious an email have played a major role in many recent cyber- PDF email attachment, as PC is the platform mostly used by attacks. These files are written in a format that can only be organizations. It should be noted that mobile platforms such read by a tailored program and these files cannot be directly as Android and iPhone are not within the scope of this executed. For instance, only a PDF reader, such as Adobe survey. Reader or Foxit Reader, can open a PDF file. This paper also outlines a novel Active Learning (AL) Users consider non-executable files safer than executables, framework and highlights the correlation between the struc- and thus, they are less suspicious toward such files received tural incompatibility of PDF files and their maliciousness. by email. Most naive computer users avoid opening execut- Additionally, it provides comparisons, insights and conclu- able files received via email, but they don't think twice about sions, as well as research avenues for future work in order to opening a Microsoft Office document or a PDF file. Unfortu- enhance the detection of malicious PDFs. nately, non-executable files are as dangerous as executable files, since their readers can contain vulnerabilities that, when exploited, could allow an attacker to execute malicious ac- tions on the victim's computer. F-Secure's 2008e2009 report9 2. Structure of PDF files indicates that the most popular file types for targeted at- tacks in 2008e2009 were PDF and Microsoft Office files. Note A Portable Document Format (PDF) is a formatting language that since that time, the number of targeted attacks on Adobe first conceived by John Warnock, one of the founders of 12 Reader has almost doubled. Adobe Systems. The first version, version 1.0, was intro- duced in 1993, and the most current version now available is 1.7. The PDF specification is publically available13 and thus can be used by anyone. A PDF has many functions beyond 4 http://netsecurity.about.com/cs/hackertools/a/aa030504.htm. 5 https://www.paloaltonetworks.com/resources/learning- simple text: it can include images and other multimedia el- center/what-is-an-intrusion-prevention-system-ips.html. ements, be password protected, execute JavaScript, etc. 6 http://www.symantec.com/connect/articles/social- engineering-fundamentals-part-i-hacker-tactics. 10 uk.reuters.com/article/2014/01/26/us-israel-cybersecurity- 7 http://www.infosecurity-magazine.com/view/29562/91-of- idUKBREA0P0ON20140126. apt-attacks-start-with-a-spearphishing-email/. 11 http://www.seculert.com/. 8 http://searchsecurity.techtarget.com/definition/spear- 12 http://partners.adobe.com/public/developer/tips/topic_tip31. phishing. html. 9 http://www.f-secure.com/weblog/archives/00001676.html. 13 http://www.adobe.com/devnet/pdf/pdf_reference.html. 248 computers & security 48 (2015) 246e266

Likewise, the PDF is supported in all the prominent oper- Each object is represented by one entry in the table, ating systems for the PC and mobile platforms (e.g., Micro- which is always 20 bytes long. soft Windows, Linux, Mac OS, Android, Windows Phone 4. Trailer e provides relevant information about how the and iOS). application reading the file should find the cross refer- A PDF file is a set of interconnected objects built hier- ence table and other special objects. The trailer also archically. The PDF file structure is depicted in Fig. 1 and is contains information about the number of revisions comprised of four basic parts according to the Adobe made to the document. All PDF readers should begin PDF Reference14 (Maiorca et al., 2013; Srndic and Laskov, reading a file from this section. 2013): 3. Document Structure e defines how objects are logically and hierarchically organized to reflect the content of a PDF 1. Objects e basic elements in a PDF file: file. Each page in the document is represented by a page Indirect objects e objects referenced by a number object which is a dictionary that includes references to the Direct objects e objects that are not referenced by a page's content. The root object of the hierarchy is the cat- number alog object which is also a dictionary. Object types: 4. Content Streams e objects that contain instructions which B Boolean e for true or false values define the appearance of the page. B Numeric: ▪ Integer value ▪ Real value 3. Techniques and possible attacks via PDF B String: files ▪ Literal string e a sequence of literal characters enclosed with ( ) brackets Before presenting the techniques and possible attacks, it is ▪ Hexadecimal string e a sequence of hexadecimal worthwhile to mention that Adobe Reader version X was numbers enclosed with <>brackets released in 2011, containing a new feature called PMAR (Protected Mode Adobe Reader). Protected mode uses the B Name e a sequence of 8-bit characters starting with / sandbox15 technique in order to create an isolated envi- B Null e an empty object represented by the keyword ronmentfortheAcrobatReaderrenderingagenttorun null while reading a PDF file. Thus, malicious code actions B e Array an ordered sequence of objects enclosed with [ cannot affect the operating system. However, most orga- ] brackets that can be composed of any combination of nizations are not up-to-date with the newest versions of object types, including another array. PDF readers, and thus they are exposed to many of the e e B Dictionary an unordered sequence of key value well-known attacks that exploit the vulnerabilities existing pairs: keys being names which should be unique in in previous versions of Adobe Reader. In addition, PDF files the dictionary, and values being any object type. Most can be utilized for malicious purposes using a variety of of the indirect objects in a PDF document are dictio- techniques. << >> naries, and dictionaries are enclosed with After understanding the implications of Adobe Reader brackets. version X and introducing the PDF's basic structure and ele- B Stream e a special dictionary object followed by a ments, we now present the possible attack techniques that sequence of bytes enclosed with the keywords can be leveraged by attackers. “stream” and “endstream.” The information inside the stream may be compressed or encrypted by a filter. The 3.1. JavaScript code attack prefix dictionary contains information on whether and how to decode the stream. Streams are used to store PDF files can contain client-side JavaScript code for legitimate data such as images, text, script code, etc. purposes including: 3D content, form validation, and calcu- 2. File Structure e defines how the objects are accessed and lations. JavaScript code can reside on a local host or remote how they are updated. A valid PDF file consists of the server, and can be retrieved using/URI or/GoTo (Laskov and following four parts: Srndic, 2011). The main indicator for JavaScript code 1. Header e the first line of a PDF file which specifies the embedded in a PDF file, is the presence of the/JS keyword in version number of PDF specification which the docu- some dictionaries (as previously explained in section 2). ment uses. Header format is “%PDF-[version number]”. However, an object containing such a dictionary can be 2. Body e the biggest section in the file which contains all placed in a filtered stream. The/JS keyword will not be visible the PDF objects. The body is used to hold all of the in plain text, and therefore may obstruct static analysis document's data that is shown to the user. detection techniques that rely on these keywords (Maiorca 3. Cross reference e a table that includes the position of et al., 2012). every indirect object in memory and allows random The primary goal of the malicious JavaScript code inside a access to objects in the file, so the application does not PDF file is to exploit a vulnerability in the PDF viewer in order need to read the whole file to locate a particular object. to divert the normal execution flow to the embedded mali- cious JavaScript code. This can be achieved by performing a 14 http://www.adobe.com/content/dam/Adobe/en/devnet/ acrobat/pdfs/pdf_reference_1-7.pdf. 15 http://searchsecurity.techtarget.com/definition/sandbox. computers & security 48 (2015) 246e266 249

Fig. 1 e Simple PDF file structure.

heap spraying16 attack, as implemented through JavaScript. Code obfuscation is legitimately used to prevent reverse Another malicious activity that can be carried out using engineering of proprietary applications. It can also be used by JavaScript is downloading an executable file from the Internet attackers to conceal malicious JavaScript code from being that initiates an attack on the victim's machine once executed. recognized by signature based detectors, and to reduce read- Alternatively, JavaScript code can also open a malicious ability by a human analyst. website that can perform a variety of malicious operations Filters are used in PDFs to compress data in order to targeting the victim's machine. enhance compactness, or facilitate encoding. By using filters, an attacker can conceal the malicious code, rendering it un- 16 Heap Spraying e A technique used in exploits to assist detectable or unreadable. Multiple filters can be applied to a random code execution.http://en.wikipedia.org/wiki/Heap_ stream, one after the other. The filter names used must be spraying. declared in the stream dictionary. Available filters and their 250 computers & security 48 (2015) 246e266

Table 1 e Code obfuscation techniques in PDF files that can be used by an attacker. Obfuscation technique Details Separating malicious code over Malicious code is spread among multiple objects. Code chunks are collected and merged and multiple objects compiled to form a malicious piece of code only during runtime. This makes it difficult for static analysis detectors to recognize the malicious code. Applying filters Filters are used to conceal malicious code (as described above). White space randomization Random white spaces are inserted in the malicious code in order to evade recognition by signature based maliciousness detectors. White spaces do not affect the code since JavaScript ignores them. Comment randomization Random comments are inserted in the malicious code in order to evade recognition by signature based maliciousness detectors. Comments do not affect the code since JavaScript ignores them. Variable name randomization Changing the variable's name randomly in order to fool signature based maliciousness detectors. Integer obfuscation Representing numbers in a different way. For example, this can be used to hide a specific memory address. String obfuscation Making changes to string in order to make it difficult for a human analyst to understand the code. For example, by splitting string into several substrings. Function name obfuscation Hiding the name of the function used which can provide a clue about the code's intention. This is done by creating a pointer with a random name to the required function. Advanced code obfuscation String can hold encrypted malicious code. The decryption process takes place during runtime, just before usage. Metadata fields and even the document's words can also be used to store malicious code. Block randomization Changing the syntax of the code but not its action. Dead code Inserting blocks of code that are not intended to be executed. Pointless code Inserting blocks of code do not perform anything. primary purposes are discussed by P. Baccas and J. Kittilsen about the structural features used in the maliciousness (Baccas, 2010; Kittilsen, 2011). detector. Table 1 summarizes various code obfuscation techniques The authors present three methods of implementing the that can be employed by attackers (Kittilsen, 2011). reverse mimicry attack: firstly, embedding malicious EXE payload into a benign PDF file; secondly, embedding a mali- cious PDF file into a benign PDF file; and thirdly, JavaScript 3.2. Embedded files attack injection in which malicious JavaScript code that is embedded in the PDF file, and does not contain references to other ob- A PDF file can contain other file types inside of it, for jects inside the PDF file elements is encapsulated e which is example, HTML, JavaScript, SWF, XLSX, EXE, Microsoft Office perhaps the best way to perform a reverse mimicry attack. files or even another PDF file. An attacker can use this Here, the JavaScript code is not located in typical places, and it functionality in order to embed a malicious file inside a is not related to objects in the PDF; thus, it might evade benign file. This way, the attacker can leverage the vulnera- structural based detection techniques that will be elaborated bilities of other file types in order to perform malicious ac- upon later. Sample files created by this attack were evaluated tivity, such as the exploiting the vulnerability of a Flash file by state-of-the-art malicious PDF detectors (discussed in the embedded within a PDF file. By using special techniques, the detection methods section) based on supervised machine embedded file can be opened without alerting the user. learning: PJScan (Laskov and Srndic, 2011); three online ver- Usually, embedded malicious files are obfuscated in order to sions of PDFRate18 (Smutz and Stavrou, 2012) based on three avoid detection. The PDF viewer will not allow the launching different data sources: Contagio, George Mason University, and of an embedded executable file because of its blacklist, which Community; and PDF Malware Slayer (PDFMS)(Maiorca et al., is based on file extension. However, Python code (*.py) is not 2012). The evaluation results show that the three techniques blacklisted. presented are highly successful in evading the detection tools In 2013, Maiorca et al. (Maiorca et al., 2013) presented a listed above. Only PJScan (Laskov and Srndic, 2011) detected practical, novel evasion technique called “reverse mimicry.” the JavaScript attack. This technique was designed to evade state-of-the-art mali- They also proposed a new framework to deal with the cious PDF detectors, which have the ability to accurately evasion attacks presented in the article. The proposed detect malicious PDF files by analyzing their logical structure17 framework extracts any embedded PDF file from a suspicious (Srndic and Laskov, 2013). This technique will be elaborated PDF file recursively and then applies three analyses on it: upon in the solution section. embedded JavaScript code analysis, PDF structural analysis, Mimicry attacks attempt to change a malicious file's and analysis of the embedded EXE or SWF file. structure and objects so that the file is similar to a benign file. The proposed reverse mimicry attack changes a benign file to make it malicious. It injects malicious content into a benign 3.3. Form submission and URI attacks PDF file so that its benign structure remains. This method can be automated easily and does not require knowledge In 2013, Valentin Hamon (Hamon, 2013) presented practical techniques that can be used by attackers to execute malicious 17 PDF logical structure is a hierarchy of structural elements, code from a PDF file. Adobe Reader supports the option of each represented by a dictionary. See the PDF file structure section. 18 http://pdfrate.com/. computers & security 48 (2015) 246e266 251

submitting the PDF form from a client to a specific server using metadata (such as its objects or structure, or the number of the/SubmitForm command. Several file formats can be used for specific streams, objects, keywords, etc.). However, static that purpose, one of which is the Forms Data Format (FDF), the analysis has drawbacks as well, including the inability to default format which is based on XML. Adobe generates an detect well-obfuscated code that acts maliciously during FDF file from a PDF in order to send the data to a specified URL. runtime, in contrast to dynamic analysis that will probably If the URL belongs to a remote webserver, it is able to respond. detect the well-obfuscated malicious PDF. Responses are temporarily stored in the %APPData% directory We divided the methods into two main categories of which automatically pops up in the default web browser. An analysis: static and dynamic, based on the primary process attack can be performed by a simple request to a malicious applied to the PDF file. Static analysis includes methods that website that will automatically pop up on the web browser, statically analyze specific components inside the PDF file and the malicious website can exploit a vulnerability in the without executing or opening the PDF file. These components user's web browser. The author showed that security mech- might conceal indications of the maliciousness of a PDF file. anisms such as the Protected Mode of Adobe Reader X or the JavaScript code is one component of PDF files well worth URL Security Zone Manager of Internet Explorer can be easily analyzing. According to Vatamanu et al. (2012), most attacks disabled by changing the corresponding registry key, a change related to PDF files are conducted using JavaScript code that can be implemented with user privileges. Moreover, a embedded inside the PDF, therefore we created a sub-category URI19 address can be used (instead of URL), referring to any of JavaScript analysis and surveyed the methods aimed at type of file located remotely without limitations, both detecting malicious PDF files based on analysis of its Java- executable and non-executable files, including.exe or.PDF. Script code (Laskov and Srndic, 2011; Vatamanu et al., 2012). One interesting scenario is launching a malicious PDF file Another sub-category of static analysis is metadata analysis from a benign one. Other attacks described in the paper focused on analyzing meta-features related to all the content include an attack in which a benign PDF simply submits its of the file. The main advantage of the latter methods over form to the attacker's PHP web server. The server searches the JavaScript analysis is that they are not affected by code submitted form header for the Adobe Reader version using obfuscation, because this metadata analysis doesn't focus on regular expressions. Once the server identifies the user's analyzing the JavaScript code itself. This sub-category in- Adobe Reader version, it sends back a malicious PDF that ex- cludes several approaches (Srndic and Laskov, 2013; Smutz ploits a vulnerability corresponding to that version. The and Stavrou, 2012; Maiorca et al., 2012; Pareek et al., 2013). returned PDF is automatically launched. Another attack The second category, dynamic analysis, largely includes described is the Big Brother attack. When the user opens a studies that dynamically execute the JavaScript code PDF, it automatically downloads a malicious executable file embedded in a PDF. These studies conduct an extraction of using the web browser (URI address). This attack can be per- the JavaScript code, either statically or dynamically. We sur- formed prior to the previously described attack, in order to veyed four dynamic analysis works (Tzermias et al., 2011; make changes to registry keys that configure the security Schmitt et al., 2012; Lu et al., 2013; Snow et al., 2011) and mechanisms discussed above. divided them into two sub-categories based on the extraction method of the embedded JavaScript code within the PDF file. Fig. 2 presents a taxonomy of the latest research on the 4. Advanced methods for the detection of detection of malicious PDF that are further described in the malicious PDF files literature review. The techniques are grouped into two cate- gories: static analysis and dynamic analysis, while each category also contains sub-categories that contributed to Advanced methods for the detection of new, unknown mali- reader's knowledge and orientation. cious PDF files are based mainly on classifiers induced by machine learning algorithms that leverage information from features extracted from the PDF files either using static or 4.1. Detection methods based on static analysis dynamic analysis. Static analysis examines and evaluates files We now present the two sub-categories of static analysis for or applications solely by analyzing its code, without actually the detection of malicious PDF files. Since most of the attacks executing it. Alternatively, dynamic analysis executes the file related to PDF files are conducted using JavaScript code or application and examines its actions and behavior during embedded inside the PDF, this sub-category, which includes runtime. Moreover, static and dynamic analysis methods can methods aimed at statically analyzing the embedded Java- both be used to examine either executable or non- Script code inside the PDF files (Laskov and Srndic, 2011; executables. The desire to enhance security in the face of at- Vatamanu et al., 2012), will be presented first. The second tacks based on malicious PDF files has led to a great deal of sub-category includes four methods (Srndic and Laskov, 2013; published research in the recent years. Most of the academic Smutz and Stavrou, 2012; Maiorca et al., 2012; Pareek et al., work on the detection of malicious PDF is based on static 2013) that conduct static analysis based on the PDF file's analysis, because static analysis requires less computing re- metadata. It is important to note that almost all static analysis sources and it is much faster. Static analysis methods usually methods rely upon a PDF parser capable of parsing the file and inspect embedded JavaScript code or examine document extracting the desired informative data. This parser should 19 URI e “a compact string of characters for identifying an ab- also be capable of dealing with encryptions and encoding used stract or physical resource”, RFC2396. It is an extension of URL by filters in a file. The robustness of the parser and its ability to used for identifying any web resource (not limited to web pages). reach all information within the file is the key to thorough and 252 computers & security 48 (2015) 246e266

Fig. 2 e Taxonomy of academic research on detection methods of malicious PDF files.

profound feature extraction. A parser that is not able to cope makes use of an open source PDF rendering library called with some filters and other obfuscation techniques may miss POPPLER22 for searches for embedded JavaScript code in a valuable and important information such as JavaScript code, document. The extraction component permits reliable and this may result in a malicious PDF file being miss- extraction of JavaScript code from a PDF document by classified as benign. searching all potential locations, provided by the Adobe PDF reference documentation. Files which do not contain Java- 4.1.1. JavaScript analysis Script do not move to the next stage. After the JavaScript code The following two JavaScript analysis methods attempt to has been found and extracted, a lexical analysis is performed tokenize the JavaScript code extracted from a PDF file, both of on it using Mozilla SpiderMonkey23 JavaScript interpreter. Lex- them applying a different approach. Both methods apply ical analysis actually attempts to represent the JavaScript machine learning algorithms to the tokenized code in order to code as a sequence of tokens, for example, left parenthesis, build a classification model and classify new, unfamiliar PDF plus, right parenthesis, etc. By using these tokens PJScan tries files after the embedded JavaScript code has been extracted to induce learning detection models that differentiate be- from them. The tokenization of the code holds the used var- tween the benign and malicious PDF files. PJScan demon- iables types, function names, operators, etc. Such methods strates fast performance of less than 50 ms per file. The aimed at analyzing JavaScript code, must be capable of coping attained average true positive rate (TPR) was 85% for previ- with code obfuscation techniques such as those presented in ously known PDF attacks and 71% for unknown PDF attacks. A Table 1 of section 4. Well-performed code obfuscation tech- false positive rate (FPR) of 16e17% is measured when tested niques can evade code analysis methods, consequently against JavaScript-bearing benign PDF files. causing a malicious PDF file to be misclassified as benign. For instance, code obfuscation can conceal the use of suspicious 4.1.1.2. Clustering. While the previous work (Laskov and JavaScript functions such as the Eval() function. Eval() enables Srndic, 2011) presented a static lexical analysis of the Java- a dynamic execution of JavaScript source code stored in a Script code found in PDF files, the following solution focuses string variable. on establishing a clustering method of identifying similar scripts that have been obfuscated using a variety of obfusca- 4.1.1.1. Lexical analysis. Srndic and Laskov (Laskov and tion techniques in order to detect malicious JavaScript code Srndic, 2011) introduced PJScan20, a pure static analysis and inside a PDF file. anomaly detection tool for the detection of malicious Java- Yatamanu et al. (Vatamanu et al., 2012) introduced two Script code inside a PDF file. In this method, One-Class Sup- different static methods for clustering PDF files based on port Vector Machine (OCSVM), a machine learning method, is tokenization of their embedded JavaScript. The first is hier- used to automatically construct models from available data archical bottom up clustering and the second is hash table for subsequent classification of new data. The dataset used for clustering. The article focuses on establishing a clustering testing contained over 65,000 real-life PDF documents method of the identification of similar scripts that have been collected from the VirusTotal21 corpus, ~40,000 benign files and obfuscated using different techniques. For each examined PDF ~25,000 malicious files. The feature extraction component file, a fingerprint is created. The fingerprint is a set of unique JavaScript tokens and their frequencies. Experimentation 20 The source code of PJScan and its underlying library libPDFJS included two datasets of PDF files. The malicious PDF dataset can be found at http://sourceforge.net/p/pjscan/home/Home/. 21 Free online service that analyzes suspicious files and URLs and 22 http://poppler.freedesktop.org/. facilitates the quick detection of viruses, worms, Trojans, and all 23 https://developer.mozilla.org/en-US/docs/Mozilla/Projects/ kinds of malware.https://www.virustotal.com/. SpiderMonkey. computers & security 48 (2015) 246e266 253

consisted of 997,615 different malicious PDF files collected structural properties of malicious and benign PDF files. The from honeypots, spam messages, etc. The benign PDF dataset static analysis method they introduced evaluates documents consisted of 1,333,420 files collected from popular websites. based on side effects of malicious content within their struc- Results showed that 93% of the malicious PDF files contain ture. When an attacker injects malicious content into the PDF JavaScript and only 5% of the benign PDF files contain Java- file, the file structure inevitably changes. When a PDF is Script. The study also revealed that the hash table clustering examined, it is converted into a set of structural paths which method is much faster and more appropriate for large data- characterize the document's structure. A structural path is a sets than hierarchical bottom up clustering. path within the document's structural hierarchy, and the oc- currences of the paths are also counted. The PDF is parsed 4.1.2. Metadata analysis using the PDF parser, POPPLER. The parser extracts structural The following static analysis methods analyze a PDF file by paths from malicious and benign real-world PDF files which is examining its metadata (i.e. data about data). Each method used to create the training set. Feature extraction for the uses different approaches such as analyzing the occurrences 150 GB of data took a total of 121 min and 55 seconds. Two of embedded keywords (Maiorca et al., 2012), analyzing the classification models were trained: SVM e LibSVM,25 a well- hierarchical structural paths (Srndic and Laskov, 2013), known stand-alone SVM implementation, and Decision Tree calculating the entropy of sets of byte sequences of the entire e C5.0 inference implementation. The number of different file (Pareek et al., 2013), n-gram of the file's hexadecimal structural paths extracted from the PDF files in the training set dumps (Pareek et al., 2013) and using specific significant meta- was over nine million. Thus, only structural paths that features (Smutz and Stavrou, 2012). These approaches share a appeared in more than 1000 files of the dataset were selected, focus on global or statistical information about the PDF file's generating a training set with 6087 features. Their results objects and structure, rather than on its actual content show a TPR of 99.88% and an FPR of 0.001%. Detection accu- (plaintext or code). racy surpasses VirusTotal accuracy. This detection model was found to be a very effective way to differentiate malicious 4.1.2.1. Keywords analysis. Maiorka et al. (Maiorca et al., 2012) PDFs from benign PDFs and was even effective against mali- introduced the PDF Malware Slayer (PDFMS), a static analysis cious files created two months after the classification model tool which characterizes PDF files according to the set of was created. The method was tested against previous de- embedded keywords and their occurrence. This information is tectors: MDScan (Tzermias et al., 2011), PJScan (Laskov and used in order to classify suspicious files. The dataset consisted Srndic, 2011), ShellOS (Snow et al., 2011) and PDFMS (Maiorca of 21,146 files in total, 11,157 of them malicious and 9989 et al., 2012), and the comparison clearly demonstrated the benign. The proposed tool consists of two modules: a data efficiency and resilience of this method in the detection of retrieval module which retrieves files for the training and new malicious PDF files. The authors also discussed evasion testing phases, and a feature extractor module which de- techniques such as: the feature addition attack e the addition termines the type of features to be used by the classifier. To of benign features to malicious PDF files so that they will be retrieve the keywords from the PDF file, the authors used the classified as benign and the feature removal attack e the PDFid24 tool (Python script) developed by Didier Stevens. The removal of features from malicious PDF files so that they will files were characterized by keywords such as:/JS,/JavaScript,/ be classified as benign. Their main contribution is a novel Encrypt, obj, stream, filter, etc. The authors chose to use the technique for the detection of malicious PDF files based on the Random Forest classifier which provided the highest accuracy difference between the underlying structural properties of and performed significantly better than other tested classifiers benign and malicious PDF files. such as Naive Bayes, SVM Linear Kernel, and J48 Decision Tree. Their main contribution is the ability to detect malicious 4.1.2.3. Content metadata analysis. The aforementioned hi- PDF files whether or not they contain JavaScript code, unlike erarchical structure analysis, converts the structure of a file to previously described tools such as PJScan (Laskov and Srndic, a list of features that represents the file that can be used by 2011) which only detects malicious files if they contain Java- machine learning algorithms. The following solution selects Script code. However, an attacker can learn which keywords specific 202 meta-features that are available from a minimal characterize benign files and inject these keywords inside a parsing process. These meta-features are obtained from both malicious file in order to bypass PDFMS, thus demonstrating metadata and the document's structure specific meta- the tool's weakness. features. Smutz and Stavrou (Smutz and Stavrou, 2012) presented 4.1.2.2. Hierarchical structure analysis. The previous solution, PDFRate, a framework which is based on meta-features PDFMS, aggregates data on specific object types inside the PDF extracted from a document's content for the detection of file. The following solution takes it a step further and also malicious PDF files. The process is based on the use of a self- aggregates the objects' hierarchy which possesses much more implemented reliable parser for feature extraction, because information. existing tools are unable to deal with malformed documents. Srndic and Laskov (Srndic and Laskov, 2013) introduced a The extracted features include the number of font objects, high performance static method for the detection of malicious average length of stream objects, and the number of lower PDF documents which, instead of analyzing JavaScript or any case characters in the title. In total, 202 features where chosen other content, makes use of essential differences in the for classification. Two data sources were used for the

24 http://blog.didierstevens.com/2009/03/31/pdfid/. 25 http://www.csie.ntu.edu.tw/~cjlin/libsvm/. 254 computers & security 48 (2015) 246e266

research: the first is the Contagio26 dataset collection and the hexadecimal dumps from PDF files and generated 2- second is based on monitoring the network of a large uni- grams (words) from them. The 2-grams were represented by versity's HTTP traffic (six days of capturing). They collected term frequency (TF) and term frequency inverse document over 5000 unique malicious PDF files and over 100,000 benign frequency (TFIDF). The J48 algorithm was used to build a PDF files. Their results showed that Random Forest classifier model from the TF and TFIDF. The results were TPR of 0.9922 provided the best results based on its ability to distinguish and the FPR is 0.006. Their main contribution is the combi- opportunistic attacks from targeted attacks. The total results nation of the n-gram approach and entropy measurement for were as follows: the TPR exceeded 99%, and the FPR was lower the detection of malicious PDF files. than 0.2%. To evaluate their method they conducted three All of the above studies describe new methods for the experiments. First, an evaluation for classification and detection of malicious PDF files. However, the following two detection performance was carried out which resulted in two papers take a different approach and focus on the develop- ROC curves: one for the benign and malicious classifier and ment (and implementation aspects) of an applicable network's another for the opportunistic and targeted classifier. Next, an IDS aimed at the detection of malicious PDFs that pass evaluation for new variant detection was performed in which through that network component. five antiviruses were used to distinguish variants of similar The first work presented is that of Kittilsen (Kittilsen, 2011), malicious documents. The results were used to create a in which he attempted to implement an anomaly based variant identifier. Finally, PDFRate was compared with PJScan network IDS, which employs an SVM classifier to detect ma- (Laskov and Srndic, 2011) which produced results only for files licious PDF files. The IDS uses SNORT27 u2boat28 and tcpflow29 in which it found JavaScript. The results show that PJScan was tools to extract PDF files from the network stream to the unable to classify many malicious documents that do in fact hard drive. The classification process begins offline after a contain JavaScript but in atypical locations, such as metadata period of time, and the user has access to the file in the sections or corrupted document structures. The main contri- meantime. The author's own pdfextract.py script, written in bution of this work was the implementation of a robust Python, was used to extract 18 string features from the file and feature extractor that can also stand up to malformed docu- count their occurrences. The features chosen are based on the ments and to JavaScript that is embedded in atypical locations work of Didier Stevens30. The benign PDF files used to build inside the PDF. the classification model were collected using a WebCrawler. Malicious PDF files were collected using personal connections 4.1.2.4. Term frequency and entropy analysis. Contrary to and by helpful members of the information security com- aforementioned approaches rely upon a PDF parser's ability to munity. After optimizing the systems, the TPR was 99.50%. extract relevant data from objects embedded in the PDF file, The second work was presented two years later by Knut the following study proposes two different detection methods Borg (Borg, 2013) as a continuation of Kittilsen's research that do not employ a PDF parser. Alternatively, the methods (Kittilsen, 2011). This thesis focuses on online detection of PDF apply the analysis methods on the whole file, after its content files, while Kittilsen's thesis featured offline detection. Kittil- is converted to hexadecimal dumps or byte sequences. sen's proposed IDS extracted PDF files from the network traffic Pareek and Eswari (Pareek et al., 2013) introduced two static to the local hard drive and then executed a classification al- analysis methods for the detection of malicious PDF. The first gorithm to detect maliciousness. This meant that a malicious method is based on entropy, and the second is based on n- PDF file could reach its destination. Borg sought to provide gram term frequency. The benign and malicious datasets used answers regarding the viability of an online detection system for evaluating the two approaches contained 792 PDF docu- of PDF files which analyzes the PDF from the network traffic in ments each. The benign files were taken mainly from research real-time and does not allow the file to continue on to the papers, reports, and financial documents. The malicious PDF target if it is classified as malicious. Borg's research sought to files were obtained using computing.net, malware.lu, answer the following questions: “Is online analysis of PDF files contagiodump.blogspot.com, and Brandon Dixon. The first viable and what kind of time delay will the user experience?” entropy based method was used to measure the uncertainty and “Does the programming language C perform the same or randomness in a given dataset. A file is represented as a set task at a significantly higher speed than Kittilsen achieved of byte sequences. The authors assumed that the level of with Python and how significant is the difference?” The uncertainty of a malicious file should be less than that of a answer to the first question is that the detection system in its benign file of similar format. Low entropy of a file is not a current form should not be implemented in a real environ- strong indicator of maliciousness, however it can be a useful ment because of its many faults, including the limitations of feature in combination with other features. Evaluation results SNORT (specifically related to control), the difficulty in of entropy calculations on malicious and benign PDF files knowing when a PDF file ends because the end-of-file marker indicate that average entropy for the malicious dataset (4.8) is can exist almost anywhere in a document, and issues related significantly lower than that of the benign dataset (7.7). The to buffering the traffic in a network, such as what happens second method, the n-gram based approach, takes substrings of a given large string where the n-gram can be words or bytes; 27 for example, applying 3-grams of words to the sentence “is Open source network intrusion prevention and detection system (IDS/IPS) developed by Sourcefire. http://www.snort.org/. this PDF malicious?” creates the following two terms: “is this 28 https://github.com/jasonish/snort/blob/master/tools/u2boat/ ” “ ” PDF and this PDF malicious? The authors generated README.u2boat. 29 http://www.circlemud.org/jelson/software/tcpflow/. 26 http://contagiodump.blogspot.co.il/. 30 http://blog.didierstevens.com/programs/pdf-tools/. computers & security 48 (2015) 246e266 255

when the traffic cannot be delayed? The answer to Borg's indication of the file's real purpose. Moreover, full dynamic second question is that there was a significant difference in analysis approach is robust against code obfuscation since it time usage between C and Python, in favor of programming does not analyze string code or pretend to extract the code language C. The results show that it is quite hard to develop a from the file. This approach is most similar to the examination system that can analyze PDF files in real-time and prevent of suspicious code by a security expert with forensic tools. them from reaching their targets solely by listening to the network traffic. 4.2.1. Static JavaScript extraction Due to reasons of insufficient applicability, the last two Identical to the static methods presented in section 5.1, works described above will not be listed as solutions in the MDScan and PDF Scrutinizer rely on a PDF parser that should be summary tables presented in the upcoming section. capable of parsing the PDF file, locating the embedded Java- Script, and extracting it. In cases in which the parser isn't 4.2. Detection methods based on dynamic analysis robust enough or is unable to extract the code from a file that incorporates embedded malicious JavaScript code, the file All of the following dynamic analysis methods focus on the might be mistakenly classified as benign. analysis of embedded JavaScript code (which may or may not Tzermias et al. (Tzermias et al., 2011) introduced the design reside in a PDF file) during runtime. All these methods include and implementation of MDScan, a standalone malicious a dynamic step either in the feature extraction or in the document scanner which uses both static and dynamic anal- analysis phase, hence belonging to the dynamic analysis ysis methods to detect malicious PDF files. MDScan statically category. We divided this category into two sub-categories analyzes the PDF structure and extracts all of the objects from based on the way the methods extract the JavaScript to be the document body, including objects that contain JavaScript analyzed. The first sub-category presents studies that stati- and objects that are deliberately left out of the cross-reference cally extract the JavaScript code and includes three methods. table for malicious reasons. Then it pulls out the embedded Two of these methods, MDScan (Tzermias et al., 2011) and PDF JavaScript code and examines it by actually running it on a Scrutinizer (Schmitt et al., 2012), start with a static extraction of SpiderMonkey JavaScript engine. Used string variables are the embedded JavaScript code from a PDF file and then dynamically analyzed during execution, and if some form of execute the extracted code using a JavaScript engine. This is shellcode31 is revealed in the address space of the JavaScript examined during runtime using heuristics in order to detect interpreter, the document is classified as malicious. MDScan is suspicious or malicious activity. The third method, ShellOS V1 not affected by code obfuscation since it runs the code and does (Snow et al., 2011), also appears in the second sub-category of not simply look at it. MDScan does not rely on previously known dynamic extraction as ShellOS V2 (Snow et al., 2011). The au- vulnerabilities and thus, is able to detect malicious PDF docu- thors did not indicate how they extract the JavaScript code for ments which exploit unknown vulnerabilities (zero-day) in further analysis, and from this, we assume that it can be done PDF readers. The average processing time for a benign PDF file either by static or dynamic extraction, which is then followed is 1.5 seconds, and approximately 80% of files are scanned in by the method's special approach of analysis based on hosting less than one second. The effectiveness of MDScan was tested an operation system, which we will be elaborate on later. using real world samples. The malicious dataset contained 197 MPScan (Lu et al., 2013) also belongs to this second sub- malicious PDFs and was collected from public malware re- category of dynamic extraction as it extracts the JavaScript positories and malicious websites and also contained nine self- code dynamically during runtime. written samples generated using the nine different PDF exploit As the presented Dynamic analysis methods executing modules of Metasploit.32 The benign dataset consisted of 2000 JavaScript code in order to detect malicious behavior differ in benign PDF files found in Google. Evaluation results show a TPR the way that they extract data. Generally speaking, the more of 89% and an FPR of 0%. exhaustive the extraction of JavaScript code is, the better the As opposed to the previous solution that relies solely on the dynamic analysis can be in terms of detection ability. Never- detection of shellcode, the following solution employs multi- theless, an attempt to extract JavaScript code from a file ple heuristics to enable the detection of a wider range of statically may fail and result with JavaScript code that does malicious operations. Furthermore, it enables the identifica- not accurately represent the behavior of the file. The reasons tion of known vulnerabilities and can often recognize the CVE- may be varied. For example, due to cases in which the code IDs of the vulnerabilities. can be well obfuscated or located in irregular locations within Schmitt et al. (Schmitt et al., 2012) introduced PDF Scruti- the PDF file. Dynamic extraction is more robust with regard to nizer, a malicious PDF detection and analysis tool that also the weaknesses of the static extraction. uses static and dynamic analysis methods to detect mali- It is important to clarify that the academic solutions in this ciousness. PDF Scrutinizer focuses on JavaScript-based attacks, category do not perform a dynamic analysis on the entire file, but it is also suitable for non-JavaScript-based attacks. The rather, dynamic analysis is only performed on the JavaScript paper presents malicious techniques used in PDF documents code that was extracted from the PDF file. This is in contrast to some commercial solutions that execute the PDF file and 31 e examine its behavior and influence on the host operating Shellcode In computer security, a small piece of code used as the payload in the exploitation of a software vulnerability. system during runtime (full dynamic analysis). Full dynamic 32 Metasploit e A collaboration of the open source community analysis consumes much more resources than the dynamic and Rapid7. Our penetration testing software, Metasploit, helps analysis done by academic solutions, but is probably better at verify vulnerabilities and manage security assessments. http:// detecting malicious PDF files because it provides better www.metasploit.com/. 256 computers & security 48 (2015) 246e266

such as obfuscation techniques, the exploitation of vulnera- increases code execution efficiency. ShellOS runs as a guest bilities, heap spraying attacks, and malicious embedded files. under a host operating system using Kernel Virtual Machine PDF Scrutinizer has three main modules: the first is a parser, (KVM). ShellOS and the host operating system share the which simulates the way Adobe Reader parses a document; memory address space region used by the host operating the second is an action extractor, which statically extracts system which provides ShellOS a stream of code to analyze. In JavaScript actions; and the third module consists of an actions the examination phase the code is simply executed on the executor, which executes the extracted JavaScript code in CPU until the instruction sequence encounters a general fault Mozilla Rhino33 JavaScript engine. PDF Scrutinizer is a JavaScript or times out because it reaches its execution time or exceeds library, which extends existing components. Apache PDFBox34 some adjustable maximum threshold. Shellcode is likely to Java library is used as interface with a loaded PDF document. generate a fault, because it contains invalid instructions or During execution, libemu35 library is used to analyze variable accesses an invalid memory location. When a fault occurs, values for the existence of shellcode. PDF Scrutinizer does not ShellOS efficiently resets the program state and starts an use a machine learning algorithm for classification. Both static execution from the next position in the code stream. After the and dynamic heuristics are applied to detect maliciousness. examination, ShellOS writes its results to the shared region. Static heuristics focus on JavaScript code string analysis to When shellcode is executed, ShellOS collects useful informa- find a signature of known suspicious, vulnerable, or malicious tion, such as function name and parameters logged. The function. Dynamic heuristics focus on the detection of mali- increased analysis performance enables the framework to cious code behavior, for example, analyzing whether the code process more of the network stream and execute longer code tries to add multiple identical data blocks into an array which sequences. The framework makes evasion difficult for at- is a strong indicator that a heap spraying attack has occurred. tackers and has the ability to use existing runtime heuristics Another example is the checking of used strings for an in a manner that does not require tracing every machine-level excessive number of characters which is an indicator of ma- instruction or performing unsafe optimization. ShellOS source licious JavaScript code (the threshold is 100,000 characters). code has been released for the use of others and will be During runtime, PDF Scrutinizer counts the occurrence of available as part of the NSF SDCI36 program. The dataset events that are likely to represent malicious operations, such prepared for evaluation contained a set of 374 unique mali- as the use of vulnerable methods or the use of suspicious cious PDF files provided by Google from 2007 to 2010 attacks variable names, and if known exploits are used, the CVE-ID is and 179 benign PDF files collected from USENIX conferences. often provided. The evaluation dataset contained 6054 benign Of the 374 malicious PDF files, 325 were detected and 70 files and 11,278 malicious PDF files collected from email attach- used the Return Oriented Programming37 (ROP) method. Di- ments and websites. Evaluation results show a detection rate agnostics show that 87% of ROP payloads follow the same of 90% with 0% false positives. The authors compared their pattern: downloading an executable file from a hardcode URL proposed PDF Scrutinizer to other existing malicious PDF address using the URLDownloadToFile() function and executing analysis tools such as Wepawet, MDScan (Tzermias et al., 2011), the file using the ShellExecuteA() function. Surprisingly, all the and PJScan (Laskov and Srndic, 2011). URLs contain the substring “spl¼pdf_sing.”. Diagnostics also The following study differs from the previous work pre- show that 85% of non-ROP payloads had exactly the same API sented in this section in several respects. First, ShellOS (Snow call sequence: downloading a binary file from a hardcode URL et al., 2011) is an operating system. Second, unlike previous address using the URLDownloadToCacheFile() function and runtime analysis techniques that use software-based CPU creating a process to execute that binary file using Crea- emulation, the proposed framework leverages hardware vir- teProcessA() function. The remaining 15% follow a similar tualization technology. Finally, it can't examine a PDF file as a pattern: downloading a binary file from a URL using URL- whole, and instead it relies on a host operating system that DownloadToCacheFile() function and executing the binary file extracts the JavaScript code from the PDF file (either static or with another combination of API calls. dynamic), and then the JavaScript code is examined by Shel- lOS. Therefore we included ShellOS (Snow et al., 2011) in both 4.2.2. Dynamic JavaScript extraction sub-categories: V1 (static feature extraction based version) Contrary to previous solutions, the following solutions and V2 (dynamic feature extraction based version). employ a dynamic approach to extracting the embedded Snow et al. (Snow et al., 2011) presented ShellOS, a frame- JavaScript code from the file, thus overcoming the possible work for the detection of code injection attacks, based on code weakness of the parser. analysis during runtime (dynamic analysis). ShellOS is a new Lu et al. (Lu et al., 2013) introduced MPScan, a technique lightweight operation system kernel (approximately 2500 that integrates static malware detection and dynamic Java- lines of code), designed for efficient execution of code streams. Script de-obfuscation. MPScan can deal with two types of ex- The proposed framework uses hardware virtualization in ploitations involving JavaScript: vulnerabilities deriving from order to execute instruction sequences directly on the CPU. bugs in the implementation of Adobe JavaScript API, and This provides faster and more accurate code analysis and vulnerabilities triggered by non-JavaScript features of PDFs.

33 https://developer.mozilla.org/en-US/docs/Rhino. 36 http://www.nsf.gov/pubs/2011/nsf11504/nsf11504.htm. 34 http://pdfbox.apache.org/. 37 Return Oriented Programming e a generalization of return- 35 Small library written in C offering basic x86 emulation and into-libc that allows an attacker to undertake arbitrary, Turing- shellcode detection using GetPC heuristics http://libemu. complete computation without injecting code. http://cseweb. carnivore.it/. ucsd.edu/~hovav/talks/blackhat08.html. computers & security 48 (2015) 246e266 257

MPScan is composed of two modules: an embedded code Static analytical methods have several advantages. First, extraction module and a multilevel malware detection mod- static analysis is virtually undetectable e the malicious code ule that includes a shellcode/heap spraying detection inside the PDF does not know that it is being analyzed, component and an opcode38 signature matching component because it is not opened by the PDF reader or by an emulator. that searches for malicious signatures in the JavaScript While it is possible to create static analysis “traps” to deter opcode. By hooking39 the Adobe Reader's native JavaScript analysis, these traps can actually be used for detecting mal- engine, embedded codes (JavaScript source code and opera- ware. Another advantageous feature is that static analysis is tional code) can be extracted during execution and then relatively quick and efficient and can therefore be performed evaluated by the static detection module. MPScan is robust and in acceptable timeframes. Consequently, it will not cause effective against any kind of obfuscation including the type bottlenecks in the workflow of the organization. This is that takes advantage of the ambiguity and complexities of PDF especially important when the PDF files are attached to emails specification. Previous methods such as MDScan (Tzermias sent to the organization as part of its standard, day-to-day et al., 2011) and PDFphoneyC40 statically parse the PDF file operation, and there is a need to ensure that they arrive in a and extract JavaScript code and then examine the code timely manner and prevent a delay in their arrival. Static dynamically by running it in the emulated environment of the analysis is also easy to implement, monitor, and measure; SpiderMonkey JavaScript engine. The difference between therefore, most of the presented studies perform static anal- MPScan and these tools is that the other tools execute the ysis. Moreover, static analysis scrutinizes the application's extracted code in an emulated environment that lacks some “genes” instead of its current behavior which can be manip- proprietary features in the native Adobe environment. ulated or postponed to an unexpected time. The static anal- MPScan, on the other hand, hooks the native JavaScript engine ysis approaches can be divided roughly into two groups: the of Adobe Reader, and thus, it is more representative. For the first group analyzes the JavaScript code embedded inside the evaluation phase, the authors collected 198 malicious PDF PDF in a variety of representations. The second group relies samples from the Internet and nine malicious PDF samples upon meta-feature based approaches and focuses on the from the Metasploit framework. 500 benign PDF files were ob- content and structure of the PDF file. tained by crawling the Alexa41 top 50 websites. Evaluations Looking at the disadvantages, static analysis can be evaded show a detection accuracy of 98%. The processing time for all using code obfuscation. Whenever machine learning methods 207 files when the Adobe JavaScript engine was hooked was based on static analysis are used (especially n-grams) for 3.9 s, whereas without hooking it the processing time was detecting unknown malicious code applications, there is a 0.5 s. Their main contribution was designing a novel approach question about the capability of the suggested framework for to de-obfuscate embedded JavaScript code by hooking the detecting obfuscated code inside PDF files. Many of the Java- Acrobat Reader's native JavaScript execution engine. This Script code inside the PDF files are obfuscated to some degree. approach is more realistic and robust against unknown Static analysis also cannot detect malicious code that is obfuscation techniques. In addition, they presented a multi- dynamically loaded from a remote server. Yet, static analysis level malware detection scheme for the detection of shell- can detect pieces of code that are used for establishing access code and heap spraying. It should be noted that the analysis of to remote sources outside of the PDF which can raise some the extracted JavaScript code is carried out statically. suspicions about the PDF file. On the other hand, such iden- As mentioned, ShellOS (Snow et al., 2011), the method tification of remote access might create false positives in the previously discussed in the static extraction sub-category, can detection process. also be sustained by dynamic extraction of JavaScript code, We have also presented studies employing a dynamic and therefore it appears both in static and dynamic feature analysis approach for detecting malicious PDF files. In most of extraction. these studies, this approach dynamically runs the JavaScript Table 2 summarizes academic solutions for the detection code embedded in a PDF file by performing pre-static analysis of malicious PDFs, including dataset information and perfor- of the PDF file in order to extract JavaScript code which will be mance measures. analyzed dynamically. Therefore, the extractor must also be very robust and capable of handling complicated cases such 4.3. Advanced methods and coping with exiting attacks as corrupted files and embedded files inside PDF e as was presented with regard to the reverse mimicry attack (Maiorca Each of the aforementioned analytical approaches (static and et al., 2013). The authors in (Maiorca et al., 2013) proposed a dynamic) has its pros and cons. Consequently, a hybrid new framework to deal with the evasion attacks presented in detection framework meshing static and dynamic detection the paper. The proposed framework extracts any embedded techniques could reduce the likelihood of evasion of the PDF file from a suspicious PDF file recursively and then applies detection mechanism by a malicious PDF. three analyses on it: embedded JavaScript code analysis, PDF structural analysis, and analysis of the embedded EXE or SWF 38 Opcode e An intermediate instruction set generated by the file. Dynamic analysis, however, is complex as well as costly. JavaScript engine for efficient execution which better reflects the In addition, dynamic analysis can also be detected and avoi- ' actual behavior of malware since it s lower than the source code. ded by the executed malicious PDF file which can postpone its 39 Hooking e technique for intercepting functions calls, mes- malicious behavior until the emulation ends. sages, or events passed between software components in order to alter an application or operating system behavior. Table 3 summarize five main attacks and techniques which 40 https://code.google.com/p/phoneyc/. can use malicious PDF files and attempt to determine which of 41 http://www.alexa.com/. the surveyed solutions is capable of detecting each of the 258

Table 2 e Summary of academic solutions e main details and performance (ms represents milliseconds). System name PJScan ShellOS MDScan MPScan PDFMS PDF PDFRate Entropy Structural JavaScript Reverse mimicry Scrutinizer and n-gram paths clustering attack solution analysis #Article (Laskov and (Snow et al., (Tzermias (Lu et al., 2013)(Maiorca (Schmitt (Smutz and (Pareek (Srndic and (Vatamanu (Maiorca et al., 2013) Srndic, 2011) 2011) et al., 2011) et al., 2012) et al., 2012) Stavrou, 2012) et al., 2013) Laskov, 2013) et al., 2012) Year 2011 2011 2011 2013 2012 2012 2012 2013 2013 2012 2013 optr euiy4 21)246 (2015) 48 security & computers Analysis Static Dynamic Static & Static & Static Static & Static Static Static Static Static & Dynamic Dynamic Dynamic Dynamic #Malicious 30,157 405 197 207 11,157 11,278 5000 65,536 82,142 997,615 3 Samples Malicious VirusTotal Some web Public malware Internet & Contagio & Emails, Contagio Computing.net, VirusTotal Honeypots, Self-made reverse Samples malware repositories & Metasploit Yahoo! attachments malware.lu and spam mimicry attacks Source detection self-written & websites contagiodump. messages systems blogspot.com #Benign 906 179 2000 500 9989 6054 100,000 46,933 576,621 1,333,420 e Samples Benign VirusTotal USENIXa Google Alexa Contagio & Emails, 6 days Research papers, Google Popular e Samples Yahoo! attachments university reports and websites Source & websites traffic financial documents Processing 23 ms 7400e25, 1500e3000 ms eeeee 28 ms ee Time 460 ms e

(per file) 266 TPR 0.7194 0.8024 0.8934 0.98 0.9955 0.9 0.99þ 0.9922 0.9988 e 1.0 (Detection Rate) FPR 0.0011 e 0 e 0.0251 0 0.00244 0.006 0.001 ee

Note that it is incorrect to compare the solutions based on their TPR and FPR, since the dataset used for training the model and the dataset used for testing the model in order to evaluate the solution are different by size and content. a https://www.usenix.org/. Table 3 e Summary of academic solutions and their ability to detect five main attacks. System PJScan ShellOS MDScan MPScan PDFMS PDF PDFRate Entropy Structural JavaScript Reverse mimicry name Scrutinizer and paths clustering attack solution n-gram analysis #Article (Laskov and (Snow (Tzermias (Lu et al., (Maiorca (Schmitt (Smutz and (Pareek (Srndic and (Vatamanu (Maiorca et al., 2013) Srndic, 2011) et al., 2011) et al., 2011) 2013) et al., 2012) et al., 2012) Stavrou, 2012) et al., 2013) Laskov, 2013) et al., 2012) Year 2011 2011 2011 2013 2012 2012 2012 2013 2013 2012 2013 Detection Static Dynamic Static & Static & Static Static & Static Static Static Static Static & Dynamic method Dynamic Dynamic Dynamic Main idea Lexical Running Running Running Frequency Detect Meta features JavaScript Structural Tokenization of Ensemble of 4 systems: and features analysis of JavaScript JavaScript JavaScript of all JavaScript JavaScript code code paths the embedded PJScan (Laskov and the JavaScript code on found in code by keywords code found in Java Script Srndic, 2011) code hardware objects hooking the inside the Detect atypical PDFMS (Maiorca virtualization Acrobat PDF file shellcode locations et al., 2012)

reader JS Find exploit Handing PDFRate (Smutz 246 (2015) 48 security & computers engine Embedded corrupted and Stavrou, 2012) files documents Wepawet tool Attacks Malicious V (Only in V V (including V V V (including V (including VV V V JavaScript typical places) atypical atypical atypical code locations) locations) locations) Code e VVVe V Probably Probably V Probably obfuscation Reverse V (Only eeProbably e V eeeeV mimicry JavaScript attack attack) (Maiorca et al., 2013) (embedded files: EXE, PDF,

unrelated e JavaScript) 266 Loading e V e Probably e Probably eeeee malicious code from remote server (Hamon, 2013) Malicious eeeeeee eee e URI resolving (Hamon, 2013)

“V” represents the solution capable of detecting this attack, “Probably” represents our estimation that it can detect the attack yet needs to be checked, and “e” represents the inability of the solution to cope with the attack. 259 260 computers & security 48 (2015) 246e266

attacks. While most of the solutions are likely to detect ma- Table 4 e Our collected dataset categorized as malicious, licious JavaScript code, only the dynamic analysis based so- benign and incompatible PDF files. lutions and static analysis meta-features based solutions will Dataset source Year Malicious files Benign detect an obfuscated malicious JavaScript code. Detecting the files reverse mimicry attack relies upon a parser that can identify e and extract embedded files recursively as was suggested by VirusTotal repository 2012-2014 17,596 (1017) Srndic and Laskov 2012 27,757 (437) e (Maiorca et al., 2013). We assumed that since MPScan actually (Srndic and Laskov, performs hooking on the Adobe Reader JavaScript engine, it 2013) would probably detect the embedded malicious JavaScript Contagio project 2010 410 (175) e code. Note that PJScan was the only method, out of those Internet and Ben-Gurion 2013-2014 0 5145 evaluated by the authors in (Maiorca et al., 2013), that detected University (random the embedded malicious JavaScript code. selection) Total 45,763 (1629) 5145

section is located in the file. In cases of incompatibility, the 5. Dataset collection and preliminary number that appears is incorrect. Table 4 includes the number analysis of compatible files (bracketed) in each of our collected data- sets. Note that while incompatible benign files were not pre- This survey was conducted in order to acquire maximal un- sent in our dataset, this does not mean that there weren't any derstanding of existing solutions for the detection of mali- incompatible benign files. It might, however, suggest the very cious PDFs and their strengths and limitations, as well as to low probability of incompatibility among benign files and provide conclusions and concrete ideas for future work in provides support of our observation mentioned above. order to enhance the detection of malicious PDF files. In the conclusion section, we will describe several ideas, techniques, frameworks, and methods that we plan to implement in an 6. Our suggested active learning based attempt to enhance the detection of malicious PDF files. As a framework basis for empirically evaluating those ideas, we collected and created a dataset of malicious and benign PDF files for the In this survey we presented many studies that were based on Microsoft Windows operating system e the most commonly machine learning approaches and were successfully used to attacked system used by organizations. induce malicious PDF detection models. However, all of them We acquired a total of 50,908 PDF files, including 45,763 focus on passive learning. With passive learning, the induced malicious and 5145 benign files, from four sources as pre- detection model, as accurate as it is (based on a representative sented in Table 4 below. The benign files were reported to be set of features), quickly becomes obsolete since it is incapable virus-free by Kaspersky antivirus software. The malicious PDF of adaptive learning and integrating new malicious PDF files. files contain several types of malware families such as viruses, The detection model must be sustained and updated with Trojans, and backdoors. We also included obfuscated PDF newly labeled, informative PDF files (both malicious and files. benign). The labeling operation usually relies upon a human Analysis of our large dataset of 50,908 files by the parser expert who analyzes a file manually and labels it as malicious (PdfFileAnalyzer42) shows that most of the malicious files or benign; thus, only a small number of new informative PDF (96.5%) are not compatible with the PDF file format specifica- files can be labeled on a daily basis. We suggest using active tions according to the Adobe PDF Reference.43 When the user learning methods to prioritize the new PDF files so that only tries to open an incompatible file (malicious or benign), the the most informative files, those that are maximally expected PDF reader is not able to open it and provides an error mes- to improve the capabilities of the detection model, will be sent sage. If it is a malicious PDF file, the malicious operation is to the expert for manual analysis. In cases in which the PDF executed; if it is a benign file, nothing takes place. However, in files are labeled as malicious by the human expert, they will be both cases the file remains unopened and cannot be viewed by used to update the antivirus tool as well, which is currently an innocent user. Thus, it is clear that there is no reason to the most common solution for organizations. deliver an incompatible file to the user, and this observation In Fig. 3 we illustrate our suggested framework and the was taken into account in our suggested framework which process of detecting and acquiring new malicious PDF files by identifies such files and marks them as suspicious, blocking maintaining the updatability of the detection model and tool them from the outset and sending them for deeper inspection as well. In order to maximize the suggested framework's e before they are ever opened by the PDF reader. contribution, it should be deployed in strategic nodes (such as The incompatibility observed was located at the end of the ISPs and gateways of large organizations) over the network, in file (as seen in Fig. 1), in the line between “startxref” and “%% an attempt to expand its exposure to as many new files as EOF” lines. This line should contain a number serving as a possible. Widespread deployment will result in a scenario in reference (offset) to where the last cross reference table which almost every new file goes through the framework. If the file is informative enough or is identified as likely to be 42 http://www.codeproject.com/Articles/450254/PDF-File- Analyzer-With-Csharp-Parsing-Classes-Vers. malicious, it will be acquired for manual analysis. As this 43 http://www.adobe.com/content/dam/Adobe/en/devnet/ framework is proposed for future work, we aim at developing acrobat/pdfs/pdf_reference_1-7.pdf. it with orientation of becoming a multilayered online analysis computers & security 48 (2015) 246e266 261

Fig. 3 e The process of maintaining the updatability of the antivirus tool using AL methods. framework, starting from the fastest and most general and Specifically, JavaScript code attacks, embedded file attacks, then moving to slower yet deeper analysis. and form submission and URI attacks, are the most common As Fig. 3 depicts, the PDF files transported over the Internet attacks launched via PDF files and three of them are present in are collected and scrutinized within our framework {1}. Then, our data set. the “known files module” filters all the known benign and As being a large and representative dataset based upon malicious PDF files {2} (according to white lists, reputation trusted sources, our conclusion of high incompatibility among systems (Jnanamurthy et al., 2013) and antivirus signature malicious files is empirically well based. The PDF files which repository). The unknown PDF files are then checked for their are compatible and unknown are then introduced (as vectors compatibility as viable PDF files. The incompatible PDF files of features) to the detection model which is a classifier are immediately blocked from being transported into the induced by Machine Learning algorithms. The Active Learning organizational network. Since only compatible files are rele- methods are aimed at efficiently updating the detection model vant for organizations and innocent users, just these files are and antivirus tool in light of the creation of new PDF files. The transformed into vector form for the advanced check {3}. files are represented as vectors of features that are either Our framework provides detection solutions for both in- extracted statically or dynamically, as we will recommend in stances, whether the malicious file is compatible or not, and it the conclusion section. does so more efficiently than any other solution that exists today. We consider employing several algorithms in order to The framework uses the insight that most of the malicious files induce detection models, one of them is the SVM classification are incompatible as a first layer of filtering, and not as a detection algorithm with the radial basis function (RBF) kernel in a su- rule. As noted, there is no reason to open an incompatible file e pervised learning approach. We will use the SVM algorithm be it benign or malicious. Therefore, this understanding provides for the following reasons: first, in many of the surveyed works a significant reduction (~96.5%) of the analysis efforts of sus- presented in this paper (Srndic and Laskov, 2013; Laskov and pected malicious files. This effort reduction is done by simply Srndic, 2011; Maiorca et al., 2012; Borg, 2013; Schreck et al., filtering any PDF file, malicious or benign, that is not compatible. 2013), SVM was proven to be a very accurate classifier for Our dataset includes PDF files that were collected from detecting malicious PDF files, especially when it was based on several reliable sources (e.g., Virus-Total, Contagio, researches many features extracted by static analysis. Many of these published in recent papers (Srndic and Laskov, 2013)) and also works chose to use the SVM as their classification algorithm contains more than 45,000 malicious PDF files. This makes it a based on comparing it with other classification algorithms, large, diverse and representative dataset of malicious PDF SVM outperformed the others and achieved up to 0.998 TPR. material for empirical experiments. Second, the trained SVM classifier based on the RBF kernel Our dataset is not claimed to be a complete collection of actually projects the dataset to be higher in dimensional every possible type of PDF attack. However, having many space, aimed at calculating a surface that creates a linear malicious PDF files that are available and labeled by trusted separation with maximal margin between the classes. This sources, serves the purpose of being representative regarding projection into higher dimensional space actually makes the existing attacks that were previously mentioned in section 3. induced model complex and thus more difficult for an 262 computers & security 48 (2015) 246e266

attacker to understand. Additionally, it is also more difficult to The second type of informative file includes those that lie find specific features or patterns that can be used for evading deep inside the malicious side of the SVM margin and are a the induced detection model as was noted by Wang et al. maximal distance from the separating hyperplane according (2007). Third, SVM is known for its ability to handle large to Equation (1). These PDF files will be acquired by the numbers of features (Joachims, 1999) which makes it suitable “Exploitation” active learning method and are different from for handling the large number of features that we aim to the labeled PDF files existing in the training set. These infor- extract from the PDF files. Specifically, when considering mative files are then added to the training set {6} for updating handling large number of features for the task of malicious and retraining the detection model {7}. The files that were code detection, SVM showed high performance in the detec- labeled as malicious are also added to the antivirus signature tion of PC malwares when it was based on many n-grams repository in order to enrich and maintain its updatability {8}. features extracted from executables (Windows PE) (Nissim Updating the signature repository also requires an update of et al., 2014). It was successfully used to detect worms based clients utilizing the antivirus application. The framework in- on a large number of behavioral features (Nissim et al., 2012; tegrates two main phases: training and detection/updating. Wang et al., 2007), and showed efficiency and high capabil- Training: A detection model is trained over an initial ities when using large number of features in the task of clas- training set that includes both malicious and benign PDF files. sifying malware into species, and detect zero-day attacks Detection and updating: For every unknown PDF file that is (Chen et al., 2012). Finally, SVM has proven to be very efficient both transported over the Internet traffic and through the when combined with AL methods for enhancing the detection framework, the framework's detection model provides a of malicious code (Nissim et al., 2012; Nissim et al., 2014). classification, and its active learning method provides a rank These reasons are a convincing justification to consider using representing how informative the file is. The framework will the SVM as one of classification algorithms in our experi- then consider acquiring the files based on this assessment. ments. In our implementation we will use Lib-SVM imple- After being selected and receiving their true labels from the mentation (Chang and Lin, 2011) which is fast and also expert, the informative PDF files are acquired by the training supports multiclass classification. set. The signature repository is also updated, just in case the The detection model scrutinizes PDF files and provides two files are malicious. The detection model is retrained over the values for each file: 1) a classification decision using the SVM augmented training set which now also includes the acquired classification algorithm and 2) Distance calculation from the examples regarded as being informative. At the next cycle, the SVM's separating hyperplane using Equation (1) {4}. A file that updated model receives a new stream of unknown files on the AL method recognizes as informative and which it has which the updated model is once again tested, and from indicated should be acquired, is sent to an expert who which the updated model again acquires informative files. manually analyzes and labels it {5}. By acquiring these infor- Note that the motive is to acquire as many malicious PDF files mative PDF files we aim to frequently update the antivirus as possible since such information will maximally update the software by focusing the expert's efforts on labeling PDF files antivirus tool that protects most organizations. that are most likely to be malware, or on benign PDF files that The purpose of this framework is to provide a better solu- are expected to improve the detection model. Note that tion than random selection or passive learning employed informative files are defined as such in that when they are nowadays. We are not aiming at providing a holistic solution, added to the training set, they improve the detection model's however, we try to present a better solution for the existing predictive capabilities and enrich the antivirus signature re- problem. Our framework will reduce labeling efforts by pository. Accordingly, in our context, there are two types of selecting only the most informative examples for inspection. files that may be considered informative. The first type in- The number of files inside the borderline is not relevant e cludes PDF files in which the classifier has limited confidence what is relevant is the inspection resources that we or the AV as to their classification (the probability that they are mali- company can allocate. Therefore, we will prioritize the files for cious is very close to the probability that they may be benign). inspection, and the larger the resources are, the more files will Acquiring them as labeled examples will probably improve be inspected. Yet the advantage here is that the most the model's detection capabilities. In practical terms, these contributive files will be treated in early stages, thus updating PDF files will have new features or special combinations of the AV and detection model more frequently, ultimately existing features that should fairly represent their operations reducing the number of labeled files. and ambience. Therefore, these files will probably lie inside the SVM margin and consequently will be acquired by the SVM-Margin (Tong and Koller, 2000e2001) strategy, an exist- 7. Discussion and conclusions ing AL method that selects informative files, both malicious and benign, that are a short distance from the separating In this paper, we aimed to review the methods, techniques, hyperplane. and tools used for the detection of malicious PDF files. These PDF's are usually attached to emails that are sent to organi- Equation (1): the distance calculation between classified zations in order to perform the initial penetration of an APT example and the SVM's hyperplane attack, therefore their detection is a significant concern which requires attention. ! Xn According to our preliminary analysis in the dataset sec- a Dist X ¼ iyiK xix (1) tion, we found that most of the incompatible PDF files are 1 actually malicious, and therefore we recommend that before computers & security 48 (2015) 246e266 263

any analysis is performed, an incompatibility check be per- initial stage and is also not detected by antivirus tools merits formed. An integral component of our suggested framework is further dynamic analysis. a module that checks each PDF file's compatibility with the For the static analysis phase, the key to precise and sen- standards of viable PDF files. The incompatible PDF files must sitive detection is preliminary knowledge of the primary immediately be blocked from being transported to the orga- attack and evasion techniques that could be used by a PDF file, nizations. The reason for this is that only compatible files are as is described in section 3. The first mission is therefore relevant for organizations and innocent users, mainly because finding and extracting the indicators that assist and support incompatible files cannot be opened and used by the user. If determination of these attacks. A prerequisite for a compre- an incompatible file is actually malicious, it can still perform hensive analysis of a given PDF file is the development of a its malicious actions, and should thus be blocked. Therefore, sophisticated and robust parser that is able to extract all the this check is extremely significant since, as shown in Table 4, relevant information from the analyzed file (including cor- most of the malicious files are incompatible (96.5%). There- rupted PDF files, embedded EXE, PDF, and SWF files). fore, the incompatibility of a file is a strong indicator of its According to Vatamanu et al. (2012), which had the largest maliciousness. Such an approach can reduce the efforts and PDF file corpus, about 93% of the one million malicious PDF time invested in detecting the new malicious PDF files. files (out of a corpus of 2.2 million) contained JavaScript, However, one should note that we don't claim that every whereas only 5% of the benign PDF files contained JavaScript malicious PDF file is incompatible. And therefore, after the code. Therefore, as a mitigation strategy for the malicious incompatibility check within our framework, we aim at JavaScript code, all the JavaScript code should be extracted providing a comprehensive static and dynamic analysis based using a robust parser (including an unrelated object of Java- on advanced Machine Learning algorithms and detection Script code as was presented by Maiorca et al. (2013)). The models. Our framework does not rely upon the fact that most JavaScript code should be analyzed using two different direct of the malicious files are incompatible, therefore in the case representations that provided high TPR, the lexical analysis of that an attacker crafts a malicious PDF as an incompatible file, JavaScript code (Laskov and Srndic, 2011), and tokenization of it will be filtered out and will not be transported to the orga- the embedded JavaScript (Vatamanu et al., 2012). Direct rep- nizational network. And in the case that the attacker does resentation means analyzing the code itself, while indirect craft a compatible malicious file, it will be carefully analyzed representation means analyzing meta-features related to all and will most likely be detected if it has patterns e similar to the content of the file. The JavaScript will also be dynamically previous attacks. Additionally, if it contains new informative analyzed during the dynamic analysis phase. We also suggest patterns it will probably be acquired by our Active Learning conducting indirect static analysis that analyzes general methods for deeper inspection by a security expert. descriptive content in the PDF file rather than direct analysis In this survey paper we do not provide an elaborate seg- of the JavaScript code. This can be achieved by an approach mentation on our dataset and the attacks which occurred that utilizes meta-features of the content and structure of the within it, for two reasons: first because the dataset is in the PDF file such as structural paths (Srndic and Laskov, 2013), building process and has not yet reached the point at which it summarized meta-features (Smutz and Stavrou, 2012) and is considered as the final dataset. Secondly, it is not the focus frequency of keywords (Maiorca et al., 2012) that provided of this survey paper, as the full description of the dataset satisfactory results as well. The advantage of using meta- creation and segmentation pertaining to the attacks and virus features such as structural paths (Srndic and Laskov, 2013)is families, will be presented in future work where we will that they are not affected by code obfuscation. It was shown to conduct a comprehensive series of experiments in order to be a very effective way to discriminate malicious PDFs from evaluate the efficiency of our Active Learning frame- benign PDFs even for malicious files created two months after workdboth in the detection and the updatability aspects. the classification model was created. Based on our survey, we propose that the detection model As a solution to embedded malicious files (reverse mimicry include a hybrid detection approach that conducts both static attack (Maiorca et al., 2013)) the parser should also indicate and dynamic analysis as was suggested by Tzermias et al. whenever this scenario (a file embedded inside the PDF) exists (2011), Maiorca et al. (2013), Schmitt et al. (2012), and Lu in the suspicious PDF file. Generally speaking, there are few et al. (2013). This way the chance of an attack evading the benign reasons to embed a file inside a PDF file. In addition, the detection mechanism is significantly reduced, because most parser should recursively extract every embedded file inside attacks can be determined by dynamic analysis. Still, several the PDF and analyze it using the aforementioned static anal- techniques may evade detection, including techniques that ysis methods we suggested. One of the reverse mimicry attacks perform the malicious actions of the PDF file only when spe- (Maiorca et al., 2013) that embeds malicious EXE files inside the cific conditions are met (e.g., time, date, IP, specific user PDF and auto-executes it when the PDF file is opened is based intervention, etc.). In these instances, the dynamic analysis on a well-known legitimate feature which has been blocked in will be useless since it won't encounter and detect the mali- Adobe Reader X (version 10). Many organizations, however, cious behavior through its analysis. Static analysis scrutinizes don't update their installed software and thus, are exposed to the file's genes, content, and structure which are usually EXE running (such as in Adobe Reader MS Office, etc.). constant; consequently static analysis will not be affected by Regardless, once another feature or vulnerability is found that these techniques and will therefore be more effective than enables the operation of running EXE files embedded in PDF dynamic analysis. Because of the advantages of static anal- files, it can be detected with a variety of advanced techniques ysis, we suggest an initial static analysis stage for unknown (Jacob et al., 2008; Gryaznov, 1999; Schultz et al., 2001; Abou- PDF files. A file that is not identified as malicious after this Assaleh et al., 2004; Kolter and Maloof, 2004, 2006; Mitchell, 264 computers & security 48 (2015) 246e266

1997; Henchiri and Japkowicz, 2006; Moskovitch et al., 2008a, Nevertheless, this detection approach provides a compre- 2008b, 2009; Menahem et al., 2009; Jang et al., 2011; Tahan et al., hensive indication of the file's purposes and is robust against 2012; Nataraj et al., 2011a, 2011b; Royal et al., 2006; Willems many evasion techniques, such as code obfuscation and URI et al., 2007; Rieck et al., 2008; Sharif et al., 2009; Song et al., 2008; resolving. Therefore we suggest the integration of a full dy- Rieck et al., 2011; Perdisci et al., 2008; Moser et al., 2007; Lanzi namic analysis module that might detect malicious behavior et al., 2009; Kolbitsch et al., 2009; Jacob et al., 2009; Bayer et al., or determine the intention of PDF files in cases in which the 2009; Nachenberg, 1997; Zhao et al., 2012; Newsome and Song, static or dynamic analyses (based on analysis of specific 2005; Newsome et al., 2005) aimed at the detection of malicious components of the PDF file) are unable to provide the executables using static and dynamic analysis. comprehensive inspection provided by full dynamic analysis. All the extracted features mentioned in this article can be We also suggest running each suspicious PDF file through leveraged by an ensemble of classifiers such that each clas- several versions of Adobe Reader (or any PDF reader) in order sifier will be induced from different sets of features. It was to compare its behavior. Some malicious PDF files will behave shown by Menahem et al. (Menahem et al., 2009) that using an differently depending upon the version of Adobe Reader used, ensemble of classifiers using different features can signifi- because vulnerabilities are treated differently from one cantly improve detection capabilities. version to another. The differing behavior might provide an The attacks that were presented by Hamon et al. (Hamon, indication of a file's maliciousness. 2013) conducted a dynamic load of malicious code from a Moreover, one should remember that many organizations remote source as well as URI resolving (executing external currently rely on outdated versions of PDF readers due to malicious file). These attacks usually rely on clicking a link, financial constraints and lack of proper administrative con- however it is possible to open the link when the PDF file is trols. The fact that many organizations don't update their opened, and therefore the PDF file becomes the link. Conse- installed software (including their PDF readers) exposes their quently, as was indicated by Hamon et al. (Hamon, 2013), the/ computers and users to many known exploits and bugs OpenAction command is considered dangerous and can be associated with PDF readers such as JBIG2Decode algorithm detected by simple static analysis. Restricted use of this and util.printf Java function, as was discussed by Stevens command will help prevent this kind of attack. (Stevens, 2011). New readers take these exploits into account, In the dynamic analysis phase, it is better to rely on however the exploits and bugs remain relevant in older ver- hooking the Adobe Reader or using hardware virtualization in sions of software; as a rule of thumb, we therefore recom- order to execute the JavaScript code embedded in the PDF file mend that organizations strive equip themselves with the rather than running it in an emulator, as presented by Lu et al. latest version of PDF Readers as a standard security policy. (2013) and Snow et al. (2011). The malicious JavaScript code In future work we plan to implement our suggested active inside the PDF, however, can identify that it is being executed learning framework which addresses an important issue that in an emulated environment, and therefore it might refrain none of the articles surveyed considered e the detection from performing its malicious behavior. This will, however, model's lack of updatability. It is not enough to construct and probably provide a solution for malicious obfuscated Java- calibrate a preliminary detection model based on sophisticated Script code that wasn't detected by the static analysis. feature extraction techniques; rather the model should be As far as we could identify, no product or academic solu- constantly updated in light of the daily creation of new mali- tion actually analyzes the URLs inside the links in a PDF files. A cious PDF files. While machine learning has been successfully link, after being pressed, can refer the user to a malicious used to induce malicious PDF detection models, all methods website that, when loaded, initiates an attack on the user's utilizing this approach focus on passive learning. Alternatively, computer. An attacker can place a malicious link inside a we suggest focusing on active learning (Settles, 2010) and the benign file and persuade the user to click it. Dynamic analysis use of active learning methods that have been especially methods will not be able to detect this kind of attack since a designed to enrich the detection model with new malicious PDF user intervention is needed to press the link. However, static files in the course of several days e thus ensuring that the analysis methods can easily extract and analyze the links that detection model is up-to-date. This notion was successfully may be malicious. Thus, we recommend the addition of a used to enhance the detection of executable malwares in the module that checks the links inside the PDF file for mali- Microsoft Windows OS (Nissim et al., 2014) and is expected to ciousness to the detection model, as this module can integrate enhance the detection of malicious PDF files as well. many of the academic solutions designed for analyzing links As an additional indication of future work in the interest of (URLs) or websites for maliciousness (Xuewen et al., 2013; enhancing the detection of malicious PDF files attached to Eshete, 2013; Zhou et al., 2013; Su et al., 2013; Chitra et al., emails, we suggest augmenting the detection models of PDF 2012; Ranganayakulu and Chellappan, 2013; Ma et al., 2011). files with analysis of features and meta-features extracted As was noted in section 4.2, full dynamic analysis of PDF from the hosting email component e such as email header files is a costly approach. For instance, Checkpoint Threat and content. Methods for the detection of malicious email Emulation44 and SourceFire FireAMP45 execute the entire PDF file (Miyamoto et al., 2009; Amin et al., 2012; Shih et al., 2005) can in an isolated environment (sandbox) and examine the effect provide supportive indications for the suspicious PDF file. of the behavior and actions on the system during runtime. The final indication of future work we presently suggest pertains to the fact that PDFs are one of the most common 44 https://threatemulation.checkpoint.com/teb/. type of files that act as malicious attachments, however one 45 http://www.sourcefire.com/security-technologies/advanced- cannot ignore the phenomenon of malicious Microsoft Office malware-protection/fireamp. files attached to email (Schreck et al., 2013; Dechaux et al., computers & security 48 (2015) 246e266 265

2010). Therefore, we suggest combining email features Jnanamurthy HK, Warty Chirag, Singh Sanjay. Threat analysis (mentioned previously) with features extracted from attached and malicious user detection in reputation systems using Microsoft Office files, thus enhancing the detection of mali- mean bisector analysis and cosine similarity (MBACS). 2013. Joachims T. Making large scale SVM learning practical. 1999. cious office files as was explained in reference to the PDF files. Kittilsen J. Detecting malicious PDF documents. 2011. Kolbitsch C, Comparetti PM, Kruegel C, Kirda E, Zhou X, Wang X. Effective and efficient malware detection at the end host. Acknowledgments Presented at USENIX Security Symposium. 2009. Kolter JZ, Maloof MA. Learning to detect malicious executables in the wild. Presented at Proceedings of the Tenth ACM SIGKDD This research was partly supported by the National Cyber Bu- International Conference on Knowledge Discovery and Data reau of the Israeli Ministry of Science, Technology and Space. Mining. 2004. We would like to thank Mattan Edry who assisted in creating Kolter JZ, Maloof MA. Learning to detect and classify malicious the dataset and to Roy Nissim for meaningful discussions and executables in the wild. J Mach Learn Res 2006;7:2721e44. comments in the efficient implementation aspects. Lanzi A, Sharif MI, Lee W. K-tracer: a system for extracting kernel malware behavior. Presented at NDSS. 2009. Laskov P, Srndic N. Static detection of malicious JavaScript- references bearing PDF documents. Presented at Proceedings of the 27th Annual Computer Security Applications Conference. 2011. Lu X, Zhuge J, Wang R, Cao Y, Chen Y. De-obfuscation and detection of malicious PDF files with high accuracy. Presented Abou-Assaleh T, Cercone N, Keselj V, Sweidan R. N-gram-based at System Sciences (HICSS), 2013 46th Hawaii International detection of new malicious code. Presented at Computer Conference. 2013. Software and Applications Conference, 2004. COMPSAC 2004. Ma J, Saul LK, Savage S, Voelker GM. Learning to detect malicious Proceedings of the 28th Annual International. 2004. URLs. ACM Trans Intell Syst Technol (TIST) 2011;2(3):30. Amin Rohan, Ryan JJC, van Dorp J Rene. Detecting targeted Maiorca D, Giacinto G, Corona I. A pattern recognition system for malicious email. Secur Priv IEEE 2012;10(3):64e71. malicious pdf files detection. In: Machine learning and data Baccas P. Finding rules for heuristic detection of malicious pdfs: mining in pattern recognition anonymous; 2012. with analysis of embedded exploit code. Last Visitied 21. 2010. Maiorca D, Corona I, Giacinto G. Looking at the bag is not enough p. 12. to find the bomb: an evasion of structural methods for Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E. malicious PDF files detection. Presented at Proceedings of the Scalable, behavior-based malware clustering. Presented at 8th ACM SIGSAC Symposium on Information, Computer and NDSS. 2009. Communications Security. 2013. Borg K. Real time detection and analysis of PDF-files. 2013. Menahem E, Shabtai A, Rokach L, Elovici Y. Improving malware Chang CC, Lin CJ. LIBSVM: a library for support vector machines. detection by applying multi-inducer ensemble. Comput Stat ACM Trans Intell Syst Technol (TIST) 2011;2(3):27. Data Anal 2009;53(4):1483e94. Chen Z, Roussopoulos M, Liang Z, Zhang Y, Chen Z, Delis A. Mitchell TM. Machine learning. 1997, vol. 45. Burr Ridge, IL: Malware characteristics and threats on the internet McGraw Hill; 1997. ecosystem. J Syst Software July 2012;85(7):1650e72. Miyamoto D, Hazeyama H, Kadobayashi Y. Detecting methods of Chitra S, Jayanthan K, Preetha S, Shankar RU. Predicate based virus email based on mail header and encoding anomaly. In: algorithm for malicious web page detection using genetic Advances in neuro-information processing anonymous; 2009. fuzzy systems and support vector machine. Int J Comput Appl Moser A, Kruegel C, Kirda E. Exploring multiple execution paths 2012;40(10):13e9. for malware analysis. Presented at Security and Privacy, 2007. Christodorescu M, Jha S. Testing malware detectors. ACM SP'07. IEEE Symposium. 2007. SIGSOFT Software Eng Notes 2004;29(4):34e44. Moskovitch R, Elovici Y, Rokach L. Detection of unknown Dechaux J, Filiol E, Fizaine J. Office documents: new weapons of computer worms based on behavioral classification of the cyberwarfare. 2010. host. Comput Stat Data Anal 2008;52(9):4544e66. Eshete B. Effective analysis, characterization, and detection of Moskovitch R, Stopel D, Feher C, Nissim N, Elovici Y. Unknown malicious web pages. Presented at Proceedings of the 22nd malcode detection via text categorization and the imbalance International Conference on World Wide Web Companion. problem. Presented at Intelligence and Security Informatics, 2013. 2008. ISI 2008. IEEE International Conference. 2008. Gryaznov D. Scanners of the year 2000: heuristics. Presented at Moskovitch R, Stopel D, Feher C, Nissim N, Japkowicz N, Elovici Y. Proceedings of the 5th International Virus Bulletin. 1999. Unknown malcode detection and the imbalance problem. J Hamon V. Malicious URI resolving in PDF documents. J Comput Comput Virol 2009;5(4):295e308. Virol Hacking Tech 2013:1e12. Nachenberg C. Computer virus-coevolution. Commun ACM Henchiri O, Japkowicz N. A feature selection and evaluation 1997;50(1):46e51. scheme for computer virus detection. Presented at Data Nataraj L, Karthikeyan S, Jacob G, Manjunath B. Malware images: Mining, 2006. ICDM'06. Sixth International Conference. 2006. visualization and automatic classification. Presented at Jacob G, Debar H, Filiol E. Behavioral detection of malware: from a Proceedings of the 8th International Symposium on survey towards an established taxonomy. J Comput Virol Visualization for Cyber Security. 2011. 2008;4(3):251e66. Nataraj L, Yegneswaran V, Porras P, Zhang J. A comparative Jacob G, Debar H, Filiol E. Malware behavioral detection by assessment of malware classification using binary texture attribute-automata using abstraction from platform and analysis and dynamic analysis. Presented at Proceedings of language. Presented at Recent Advances in Intrusion the 4th ACM Workshop on Security and Artificial Intelligence. Detection. 2009. 2011. Jang J, Brumley D, Venkataraman S. Bitshred: feature hashing Newsome J, Song D. Dynamic taint analysis for automatic malware for scalable triage and semantic analysis. Presented detection, analysis, and signature generation of exploits on at Proceedings of the 18th ACM Conference on Computer and commodity software. 2005. Communications Security. 2011. 266 computers & security 48 (2015) 246e266

Newsome J, Karp B, Song D. Polygraph: automatically generating Tong S, Koller D. Support vector machine active learning with signatures for polymorphic worms. Presented at Security and applications to text classification. J Mach Learn Res Privacy, 2005 IEEE Symposium. 2005. 2000e2001;2:45e66. Nissim N, Moskovitch R, Rokach L, Elovici Y. Detecting unknown Tzermias Z, Sykiotakis G, Polychronakis M, Markatos EP. computer worm activity via support vector machines and Combining static and dynamic analysis for the detection of active learning. Pattern Analysis Appl 2012;15(4):459e75. malicious documents. Presented at Proceedings of the Fourth Nissim N, Moskovitch R, Rokach L, Elovici Y. Novel active learning European Workshop on System Security. 2011. methods for enhanced PC malware detection in Windows OS. Vatamanu C, Gavrilut‚ D, Benchea R. A practical approach on Available online 19 March 2014 Expert Syst Appl 2014 ISSN: clustering malicious PDF documents. J Comput Virol 0957-4174. http://dx.doi.org/10.1016/j.eswa.2014.02.053. 2012;8(4):151e63. Pareek H, Eswari P, Babu NSC, Bangalore C. Entropy and n-gram Wang X, Yu W, Champion A, Fu X, Xuan D. Worms via mining analysis of malicious PDF documents. Int J Eng 2013;2(2). dynamic program execution. In: Third international Perdisci R, Lanzi A, Lee W. McBoost: boosting scalability in conference security and privacy in communication networks malware collection and analysis using statistical classification and the workshops, SecureComm; 2007. p. 412e21. of executables. Presented at Computer Security Applications White SR, Swimmer M, Pring EJ, Arnold WC, Chess DM, Morar JF. Conference, 2008. ACSAC 2008. Annual. 2008. Anatomy of a commercial-grade immune system. IBM Res Ranganayakulu D, Chellappan C. Detecting malicious URLs in E- White Pap 1999. maileAn implementation. AASRI Procedia 2013;4:125e31. Willems C, Holz T, Freiling F. CWSandbox: towards automated Rieck K, Holz T, Willems C, Du¨ ssel P, Laskov P. Learning and dynamic binary analysis. IEEE Secur Priv 2007;5(2):32e9. classification of malware behavior. In: Detection of intrusions Xuewen Z, Xinochuan W, Hua Y. Detection of malicious URLs in a and malware, and vulnerability assessment anonymous; 2008. web page. 2013. Rieck K, Trinius P, Willems C, Holz T. Automatic analysis of Zhao W, Long J, Yin J, Cai Z, Xia G. Sampling attack against active malware behavior using machine learning. J Comput Secur learning in adversarial environment. In: Modeling decisions 2011;19(4):639e68. for artificial intelligence anonymous; 2012. Royal P, Halpin M, Dagon D, Edmonds R, Lee W. Polyunpack: Zhou H, Sun J, Chen H. Malicious websites detection and search automating the hidden-code extraction of unpack-executing engine protection. J Adv Comput Netw 2013;1(3). malware. Presented at Computer Security Applications Conference, 2006. ACSAC'06. 22nd Annual. 2006. Nir Nissim is a researcher and Project Manager at Telecom Inno- Schmitt F, Gassen J, Gerhards-Padilla E. PDF scrutinizer: detecting vation Laboratories at Ben-Gurion University. Nir is also a PhD JavaScript-based attacks in PDF documents. Presented at student in the department of Information Systems Engineering at Privacy, Security and Trust (PST), 2012 Tenth Annual Ben-Gurion University and published several interesting papers International Conference. 2012. deals with Active learning approaches for the acquisition and € Schreck T, Berger S, Gobel J. BISSAM: automatic vulnerability detection of malwares both in PC and Android. Nir is a recognized identification of office documents. In: Detection of intrusions as an expert in information systems security and has led several and malware, and vulnerability assessment anonymous; 2013. large-scale projects and researches in this field. His main areas of Schultz MG, Eskin E, Zadok F, Stolfo SJ. Data mining methods for interests are mobile and computer security, machine learning and detection of new malicious executables. Presentedat Securityand data mining. Nir holds a B.Sc. in Information Systems Engineering Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium. 2001. (2007) and M.Sc. in Information Systems Engineering (2010) as Settles Burr. Active learning literature survey, vol. 52. Madison: well, both from Ben-Gurion University. University of Wisconsin; 2010. p. 55e66. Sharif M, Lanzi A, Giffin J, Lee W. Automatic reverse engineering Aviad Cohen is a graduate of the Department of Information of malware emulators. Presented at Security and Privacy, 2009 Systems Engineering at Ben-Gurion University (BGU) and is a 30th IEEE Symposium. 2009. Master student and researcher at Telecom Innovation Labora- Shih Dong-Her, Chiang Hsiu-Sen, David Yen C. Classification tories at Ben-Gurion University. His master work focuses on methods in the detection of new malicious emails. Inf Sci detecting malicious email by combining both attachments and e 2005;172.1:241 61. email header analysis. Smutz C, Stavrou A. Malicious PDF detection using metadata and structural features. Presented at Proceedings of the 28th Prof. Chanan Glezer is an associate professor at the Department Annual Computer Security Applications Conference. 2012. of Industrial Engineering and Management and at the MBA pro- Snow KZ, Krishnan S, Monrose F, Provos N. SHELLOS: enabling gram of Ariel University of Samaria, Israel. His main research in- fast detection and forensic analysis of code injection attacks. terests are applied and include Cyber Security, Electronic Presented at USENIX Security Symposium. 2011. Commerce and Organizational Computing. He also has broad in- Song D, Brumley D, Yin H, Caballero J, Jager I, Kang MG, et al. dustry experience, both as an IT professional as well as a leader of BitBlaze: a new approach to computer security via binary a funded R&D project. analysis. In: Information systems security anonymous; 2008. Srndic N, Laskov P. Detection of malicious pdf files based on Prof. Yuval Elovici is the director of the Telekom Innovation Lab- hierarchical document structure. Presented at Proceedings of oratories at Ben-Gurion University of the Negev (BGU), head of the the 20th Annual Network & Distributed System Security Cyber Security Labs at BGU, and a Professor in the Department of Symposium. 2013. Information Systems Engineering at BGU. He holds B.Sc. and M.Sc. Stevens D. Malicious PDF documents explained. Secur Priv IEEE degrees in Computer and Electrical Engineering from BGU and a JaneFeb 2011;9(1):80e2. http://dx.doi.org/10.1109/MSP.2011.14. Ph.D. in Information Systems from Tel-Aviv University. Prof. Su K, Wu K, Lee H, Wei T. Suspicious URL filtering based on Elovici has published 56 articles in leading peer-reviewed jour- logistic regression with multi-view analysis. Presented at nals. In addition, he has co-authored a book on social network Information Security (Asia JCIS), 2013 Eighth Asia Joint security and a book on information leakage detection and pre- Conference. 2013. vention. His primary research interests are cyber security and Tahan G, Rokach L, Shahar Y. Mal-ID: automatic malware machine learning. detection using common segment analysis and meta-features. J Mach Learn Res 2012;13:949e79.

- 35 -

3. Summary and Conclusions

In this section we briefly summarize the main research results presented in the papers that constitute this dissertation and discuss their impact and contributions to research in the malware detection domain – specifically the areas of improving updatability and enhancing detection capabilities of detection models and antivirus software.

The proposed framework was thoroughly tested on very different types of malware: computer worms, executables, malicious documents (PDF and docx MS Office files), and malicious Android applications. In each of these applications, sophisticated and specifically tailored feature extraction and dataset creation methodologies were proposed and implemented. We used the SVM classifier as the base classifier, and the experiments were carried out using the various SVM’s kernels. A solid and comprehensive evaluation methodology was used in order to test the framework, both in terms of classification performance (accuracy, TPR, FPR and AUC) and the number and percentage of malware acquired daily (NOMA / POMA), which are important measures given that the purpose of the framework is to update the antivirus signature’s repository with new malware on a frequent basis.

We also extended the proposed framework and applied it to the biomedical informatics domain, which is a completely different domain than malware detection. In this research we were able to successfully enhance the capabilities of a classification model used for condition severity classification while achieving a significant reduction in labeling efforts; this can result in significant savings, both in terms of time and money, associated with the efforts of medical experts. This extension showed that our methods and framework are generic and can provide a solution for a variety of problems in many different domains.

Our research was guided by our attentiveness and awareness of the up and coming trends in the malware detection domain. We began with a behavioral active learning based framework [1] for the enhanced detection of elusive computer worms. Then, after identifying a serious gap in the area of detection solutions for malicious executables (specifically the limited updatability of current solutions), we enhanced our AL based framework and extended it with two novel and efficient AL methods (Exploitation and Combination). These methods allow our framework to address the existing updatability gap with frequent and efficient updating [2] of both the detection model and antivirus software.

- 36 -

Our next goal was identified based on the increased prevalence of Android malware and the contamination caused by malicious versions of known Android applications in many unofficial application markets, as well as the official market of Android applications, Google Play. The inadequate detection solutions in this area, the reliance of Android smartphones on antivirus solutions, and the updatability gap we also identified in the detection solutions for Android malware, pointed to the need for a better solution. In response, we enhanced the capabilities of our active learning framework and created ALDROID [3], a framework that outperformed existing solutions and enhanced the detection of Android malware in the long run.

The next trend we responded to was the increase in attacks aimed at organizations which, after limiting the entrance of executables into organizational networks, are increasingly threatened by malicious documents (PDF and MS Office files) which are attached to email messages and sent in order to penetrate organizations and perform malicious activities. Therefore, we further enhanced our AL based framework for use in the detection of this type of malicious documents [4] and created the ALPD [5, 6] and ALDOCX [7, 8] frameworks which include our newly developed feature extraction methodologies for the efficient detection of documents.

As can be seen in our published papers [1-8] which are included here, in the course of our study we have compared our framework with existing methods, solutions, and tools. The most meaningful results are highlighted below.

In the computer worm domain [1] our framework increased the performance of the detection model between 19%-25% and improved its robustness in the presence of misleading instances of elusive computer worms. This was accomplished using AL methods based on behavioral analysis of the monitored system in which the worm was executed.

In the domain of malicious executables [2] our novel AL methods (Exploitation and Combination) outperformed existing AL method (SVM-Margin) in the number of new malwares acquired daily from a stream of new unknown files, while acquiring 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection method which resulted in better updating and enrichment of the repository of antivirus software with new malware signatures. In addition, the performance of the detection model improved as more files were acquired daily, and the results indicate that by only acquiring a small and well selected set of informative files (31% of the stream), the detection model was able to achieve TPR levels that

- 37 - were 96.5% of the maximal TPR rates that could be achieved (by acquiring the whole stream of new unknown files).

In the Android application domain, we presented a general descriptive set of static features on which our ALDROID [3] framework is based. The framework acquired more than double the amount of new malwares acquired by a heuristic engine and 6.5 times more malwares than the existing AL method (SVM-Margin). By acquiring more malwares we frequently enriched the signature repository of the Android antivirus software, while updating the detection capabilities of the detection model as well.

For the detection of malicious PDF documents we first demonstrated [4] the correlation between the structural incompatibility of PDF files and their likelihood of maliciousness. By doing this, at least 96.5% of the malicious PDF files can be easily filtered out using a simple and deterministic filtering process. Later, we developed our ALPD [5, 6] active learning based framework which was capable of inducing an updated detection model on a daily basis, outperforming all of the antivirus tools’ detection rates (by at least 5% of TPR) and managing to accomplish this using only a relatively small fraction of the new PDF files (25%) – the most informative portion, consisting of the most valuable information required for updating the knowledge stores of the detection model and antivirus tools.

For detection of malicious Microsoft Office files, our ALDOCX [7, 8] framework based on a novel feature extraction methodology (SFEM) also showed a significant improvement of 91% in unknown docx malware acquisition compared to the random selection method and existing AL methods, thus providing an improved updating solution for the detection model, as well as the antivirus software widely used within organizations. ALDOCX also achieved a 94.4% TPR compared to the most accurate antivirus, TrendMicro, which had a detection rate of 85.9%. We achieved this using only 14% of labeled data (2,011 docx files out of 14,000), which means a reduction of 84.4% in labeling efforts.

Our results show that we have developed a framework that is general, yet flexible and adaptable, and proven to be efficient in many sub-domains of new unknown malware detection. Thus, in order to maximize the efficiency of our framework, it should be strategically deployed over specific nodes in the Internet network in an attempt to have the greatest coverage possible while limiting deployment to as few units as possible (minimizing costs). After deploying the

- 38 - framework in the strategic nodes, the integration of all the deployed units will expose this framework to as many new files transferred over the Internet as possible, ensuring that most of the new informative files will be acquired. Therefore, once a new unknown malware is created and transferred over the network it will be monitored by the strategic deployment of the framework, like a "fly caught in tangled spider webs." Routers, gateways to organizations, and the markets of mobile applications (official and non-official) will be considered part of the strategic nodes in which the framework should be deployed. The integration between these strategic nodes and several levels of the framework will improve detection and updatability capabilities.

It is important to consider possible attacks against our framework. Zhao et al. [89] recently discussed two possible attack scenarios on AL methods. In these attacks, referred to as adding and deleting, the attacker can actually pollute the unlabeled data before it is sampled by the active learner module. The results of their experiments on an intrusion detection dataset showed that these attacks disrupted the performance of the AL process and significantly decreased detection accuracy: a decrease of 16%-30% due to the adding attack and a 15%-34% decrease due to deleting. In our context, an adaptive attacker might "guide" the AL process and poison the classifiers by producing many malwares that contain specifically chosen features by design. Consequently, the AL methods would acquire these files since they would contain new and interesting features which did not exist before. Attacking such a biased classifier then becomes easy. The attacker simply leaves out the chosen features and creates a malicious file that can evade the detection model.

The way to confront this attack is quite simple and relates to the deployment of our framework, as well as to the file acquisition strategies integrated within our AL based framework. First the AL process is not based on a specific node in the Internet but is sustained by many sources of information and files. Thus such an attack must flood significant segments of the Internet network in order to poison the presented framework in a way that will bias the classifier. Not only is such a flood by an attacker not feasible, but it takes time to conduct, allowing antivirus vendors enough time to distribute a patch against it. In addition, since our framework tries to select the most informative files and attempts to enlarge the signature repository, it does not choose files that are similar to previously acquired files. Therefore, our AL methods would not acquire a full set of malicious files similar to specific features but would only acquire a few representative files. Thus

- 39 - the framework is resilient to such attacks, and its detection capabilities remain unaffected and intact. As a proof of concept for the generality of our AL based framework, we have recently extended the framework's capabilities so that it can provide a solution in additional domains. We have adjusted it to meet the needs of the biomedical informatics domain and created the CAESER- ALE [9-11] framework. We successfully enhanced the capabilities of a classification model that is used for condition severity classification with a significant reduction of labeling efforts of medical experts (ranging from 48% to 64%) which can result in savings, both in time and money. In fact, our framework can be adjusted and contribute to domains that demand efficient and enhanced classification and detection capabilities and could benefit from minimizing the efforts and resources associated with experts involved in classification and labeling.

Importantly, we demonstrate that our AL methods in the CAESAR-ALE framework (Exploitation and Combination_XA) are more robust to the use of different human labelers, particularly regarding differences in the level of medical training (i.e., having a medical degree versus having a master’s degree). In fact, our new AL method, Combination_XA, acquired conditions in such a way that the variance between the classifiers created by individual labelers was 50% lower than the variance of classifiers that were created by acquisition by the traditional SVM-Margin AL method. Therefore, not only does CAESAR-ALE result in a 62% reduction in labeler efforts to acquire the same number of severe instances; it can also learn from labeled data obtained by labelers without clinical training.

The results of the evaluation of our framework in a variety of domain and tasks serve as the motivation for its full deployment in real systems. Currently we are collaborating with several cyber-security and online bidding companies in which the framework can be beneficial. As was presented in the results section, the FPR rates shown in the results of our framework are low enough for deployment within real systems. Furthermore, the low time complexity associated with our AL methods provides the ability to handle a large number of cases online, as is required by many actual online systems. Prior to the full deployment of our framework in a real system, the user should first identify the prior probability of the positive class he/she aims to acquire. Second, a good set of general descriptive features, on which the classifier and AL methods will be based, must be created and designed. Third, in order to acquire a manageable amount of unlabeled and

- 40 - informative examples, the user must evaluate the amount of examples the human experts can inspect manually in a predefined time slice.

- 41 -

4. Future Directions

We plan to continue our research, adapting our framework and methods further and using them as a springboard to enhance detection and address other types of malware, develop tools in the biomedical informatics domain, and serve other domains.

More specifically, we plan to develop a new and complementary AL method for our Exploitation method (presented previously), which will acquire interesting and informative files from deep inside the benign side of the classifier (e.g., SVM) in an attempt to discover very elusive malicious files that, in order to evade the detection model, hide their malicious behavior and content, and are similar to benign files in that they have many features (dynamic or static) that are more likely to appear in a benign file rather than in malicious one; such files are actually a type of Trojan horse.

Barnabé-Lortie et.al [97], presented an AL method aimed at one class classification problems, a method that will likely be effective in the malware domain, particularly in cases in which the minority class associated with the malware and attacks is not representative enough, necessitating a sophisticated and specially tailored AL method such they proposed [97], can be very beneficial and appropriate. Thus we plan to integrate this method in future work as an additional AL method to be integrated into our AL framework.

In addition, we aim to adjust our framework and AL methods for additional domains within cyber security, in which an efficient selection of a specific class is needed. Online social networks (OSNs) are a domain that could benefit from our methods; in this setting our method could be used for the detection and acquisition of new fictitious profiles, a difficult and important task given the increased prevalence of malicious profiles that are used to tempt innocent and unsophisticated users such as children and the elderly.

Biomedical domains often involve significant efforts by experts (including clinical experts) in labeling the cases to be classified, such as structured and unstructured aspects of patient records, multiple types of images, etc. AL is a methodology that can significantly reduce the effort involved by initially labeling only a small portion of the data, while obtaining the same level of accuracy achieved via the use of the complete dataset. Furthermore, using specific enhanced AL methods, this level of accuracy might be obtained while also detecting a higher percentage of positive cases,

- 42 - an aspect that in some situations might be of interest. Many biomedical domains and tasks might benefit from the application of our AL methods and enhanced versions of these methods.

In addition, we plan to develop an online tool for medical experts to use to label condition severity. This will enable medical experts worldwide to label conditions and allow us to compare their labels and learn differences based on the clinical background or geographical location of the experts. However, because labelers (generally speaking and particularly those in the medical domain) have varying levels of expertise, a major issue associated with learning methods (and more specifically AL methods), is how to best to use the labeling provided by a committee of labelers. First, we want to know, based on the labelers’ learning curves, whether using AL methods (versus standard passive learning methods) has an effect on the intra-labeler variability (within the learning curve of each labeler) and inter-labeler variability (among the learning curves of different labelers). Then, we wanted to examine the effect of learning (either passively or actively) from the labels created by the majority consensus of a group of labelers.

Our research has been driven by emerging trends in malware and technology which will continue changing into the future, demanding new solutions and fueling additional research. Our methods and approaches have been proven adaptable and flexible to meet changing needs and will continue to evolve in the future.

- 43 -

5. References

[1] Nir Nissim, Robert Moskovitch, Lior Rokach,Yuval Elovici, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning,” Pattern Analysis and Applications, (2012) 15:459-475. [2] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, " Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS,” Expert Systems with Applications, (2014), Link: http://authors.elsevier.com/sd/article/S095741741400133X. [3] Nir Nissim, Robert Moskovitch, Oren BarAd, Lior Rokach, Yuval Elovici, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods,” Knowledge and Information Systems (2016), 1-39. Accepted on 11 of January 2016. [4] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici, “Detection of Malicious PDF Files and Directions for Enhancements: A State-of-the Art Survey,” Computers & Security, Volume 48, February 2015, Pages 246-266, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2014.10.014. [5] Nir Nissim, Aviad Cohen, Robert Moskovitch, Asaf Shabtai, Matan Edri, Oren Bar-Ad, Yuval Elovici. (2016). Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework. Security Informatics, 5(1), 1-20. [6] Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, Assaf Shabatai, Yuval Elovici, "ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files Aimed at Organizations,” JISIC, (2014). [7] Nir Nissim, Aviad Cohen, Yuval Elovici, "Boosting the Detection of Malicious Documents Using Designated Active Learning Methods," 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 2015, pp. 760-765. doi: 10.1109/ICMLA.2015.52 [8] Nir Nissim, Aviad Cohen, Yuval Elovici, “Designated Active Learning Methods for Enhanced Detection of Unknown Malicious Microsoft Office Documents”. ODDX3 Workshop at KDD Conference, (2015), Sydney. [9] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification”. BigCHAT Workshop at KDD Conference, (2014).

Mario Stefanelli Best Paper Award at AIME 2015 Conference: [10] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “An Active Learning Framework for Efficient Condition Severity Classification”. Artificial Intelligence in Medicine (Pages 13-24), Springer International Publishing, AIME (2015). [11] Nir Nissim, Mary Regina Boland, Nicholas P. Tatonetti, Yuval Elovici, George Hripcsak, Yuval Shahar, Robert Moskovitch, Improving condition severity classification with an efficient active learning based framework, Journal of Biomedical Informatics, Volume 61, June 2016, Pages 44-54, ISSN 1532-0464,

- 44 -

[12] D. Dagon, T. Martin, and T. Starner, “Mobile phones as computing devices: The viruses are coming!” IEEE Pervasive Computing, vol. 3,no. 4, pp. 11–15, 2004. [13] M. Piercy, “Embedded devices next on the virus target list,” IEEE electronics Systems and Software, vol. 2, pp. 42–43, Dec.-Jan. 2004. [14] N. Leavitt, "Mobile phones: the next frontier for hackers?" Computer, vol. 38(4), 2005, pp. 20-23. [15] D.H. Shih, B. Lin, H.S. Chiang, M.H. Shih, "Security aspects of mobile phone virus: a critical survey," Industrial Management & Data Systems, vol. 108(4), 2008, pp. 478-494. [16] Schmidt, A.-D.; Schmidt, H.-G.; Batyuk, L.; Clausen, J.H.; Camtepe, S.A.; Albayrak, S.; Yildizli, C.; , "Smartphone malware evolution revisited: Android next target?," Malicious and Unwanted Software (MALWARE), 2009 4th International Conference on , vol., no., pp.1-7, 13-14 Oct. 2009 [17] Shabtai, A., Fledel, Y., Kanonov, U., Elovici, Y., Dolev, S., & Glezer, C. (2010). Google android: A comprehensive security assessment. IEEE Security & Privacy, (2), 35-44. [18] Steve Mansfield-Devine, Android malware and mitigations, Network Security, Volume 2012, Issue 11, November 2012, Pages 12-20, ISSN 1353-4858, 10.1016/S1353-4858(12)70104-6. [19] http://www.securelist.com/en/analysis/204792250/IT_Threat_Evolution_Q3_2012 [20] http://docs.oracle.com/javase/tutorial/security/apisign/gensig.html [21] Axelle Apvrille, Tim Strazzere: Reducing the window of opportunity for Android malware Gotta catch 'em all. Journal in Computer Virology 8(1-2): 61-71 (2012) [22] B. Sanz, I. Santos, P. Galán-García, C. Laorden, X. Ugarte-Pedrero, P.G. Bringas and G. Alvarez PUMA: Permission Usage to detect Malware in Android. In Proceedings of the 5th International Conference on Computational Intelligence in Security for Information Systems (CISIS). Ostrava (Czech Republic), 5-7 September, 2012 [23] Bläsing, T.; Batyuk, L.; Schmidt, A.-D.; Camtepe, S.A.; Albayrak, S.; , "An Android Application Sandbox system for suspicious software detection," Malicious and Unwanted Software (MALWARE), 2010 5th International Conference on , pp.55-62, 19-20 Oct. 2010 [24] Shabtai, A.; Fledel, Y.; Elovici, Y.; , "Automated Static Code Analysis for Classifying Android Applications Using Machine Learning," Computational Intelligence and Security (CIS), 2010 International Conference on, pp.329-333, 11-14 Dec. 2010 [25] Sanz, B.; Santos, I.; Laorden, C.; Ugarte-Pedrero, X.; Bringas, P.G.; , "On the automatic categorisation of android applications," Consumer Communications and Networking Conference (CCNC), 2012 IEEE,pp.149-153, 14-17 Jan. 2012 [26] Min Zhao, Tao Zhang, FangbinGe, Zhijian Yuan: RobotDroid: A Lightweight Malware Detection Framework On Smartphones. JNW 7(4): 715-722 (2012) [27] B. Sarma, N. Li, C. Gates, R. Potharaju, and C. Nita-Rotaru. Android Permissions: A Perspective Combining Risks and Benefits. In Proceedings of SACMAT, 2012. [28] H. Peng, C. Gates, B. Sarma, N. Li, A. Qi, R. Potharaju, C. Nita-Rotaru, and I. Molloy. Using Probabilistic Generative Models for Ranking Risks of Android Apps.In Proceedings of ACM CCS, 2012.

- 45 -

[29] Suarez-Tangil, Guillermo, et al. "Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families." Expert Systems with Applications 41.4 (2014): 1104-1117. [30] Zhou, W., Zhou, Y., Jiang, X., & Ning, P. (2012). Detecting repackaged smartphone applications in third-party Android marketplaces. In Proceedings of the second ACM conference on data and application security and privacy (pp. 317–326). ACM.

[31] Ying-Dar Lin, Yuan-Cheng Lai, Chien-Hung Chen, Hao-Chuan Tsai, Identifying android malicious repackaged applications by thread-grained system call sequences, Computers & Security, Volume 39, Part B, November 2013, Pages 340-350, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2013.08.010.

[32] Seung-Hyun Seo, Aditi Gupta, Asmaa Mohamed Sallam, Elisa Bertino, Kangbin Yim, Detecting mobile malware threats to homeland security through static analysis, Journal of Network and Computer Applications, Volume 38, February 2014, Pages 43-53, ISSN 1084-8045, http://dx.doi.org/10.1016/j.jnca.2013.05.008.

[33] Zhou, Yajin, et al. "Hey, you, get off of my market: Detecting malicious apps in official and alternative android markets." Proceedings of the 19th Annual Network and Distributed System Security Symposium. 2012. [34] Ham, Hyo-Sik, and Mi-Jung Choi. "Analysis of Android malware detection performance using machine learning classifiers." ICT Convergence (ICTC), 2013 International Conference on. IEEE, 2013. [35] Luoshi, Z., Yan, N., Xiao, W., Zhaoguo, W., & Yibo, X. (2013, November). A3: Automatic Analysis of Android Malware. In 1st International Workshop on Cloud Computing and Information Security. Atlantis Press. [36] Shabtai, A., Tenenboim-Chekina, L., Mimran, D., Rokach, L., Shapira, B., & Elovici, Y., Mobile Malware Detection through Analysis of Deviations in Application Network Behavior, Computers & Security, 2014.

[37] Zhang, Yuan, et al. "Vetting undesirable behaviors in android apps with permission use analysis." Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013.

[38] Z. Tzermias, G. Sykiotakis, M. Polychronakis and E. P. Markatos. Combining static and dynamic analysis for the detection of malicious documents. Presented at Proceedings of the Fourth European Workshop on System Security. 2011.

[39] N. Šrndic and P. Laskov. Detection of malicious pdf files based on hierarchical document structure. Presented at Proceedings of the 20th Annual Network & Distributed System Security Symposium. 2013.

[40] P. Laskov and N. Šrndić. Static detection of malicious JavaScript-bearing PDF documents. Presented at Proceedings of the 27th Annual Computer Security Applications Conference. 2011.

[41] J. Kittilsen. Detecting malicious PDF documents.

[42] F. Schmitt, J. Gassen and E. Gerhards-Padilla. PDF scrutinizer: Detecting JavaScript-based attacks in PDF documents. Presented at Privacy, Security and Trust (PST), 2012 Tenth Annual International Conference On. 2012.

- 46 -

[43] C. Smutz and A. Stavrou. Malicious PDF detection using metadata and structural features. Presented at Proceedings of the 28th Annual Computer Security Applications Conference. 2012.

[44] D. Maiorca, G. Giacinto and I. Corona. "A pattern recognition system for malicious pdf files detection," in Machine Learning and Data Mining in Pattern RecognitionAnonymous 2012.

[45] H. Pareek, P. Eswari, N. S. C. Babu and C. Bangalore. Entropy and n-gram analysis of malicious PDF documents. Int. J. Eng. 2(2), 2013.

[46] X. Lu, J. Zhuge, R. Wang, Y. Cao and Y. Chen. De-obfuscation and detection of malicious PDF files with high accuracy. Presented at System Sciences (HICSS), 2013 46th Hawaii International Conference On. 2013.

[47] K. Z. Snow, S. Krishnan, F. Monrose and N. Provos. SHELLOS: Enabling fast detection and forensic analysis of code injection attacks. Presented at USENIX Security Symposium. 2011.

[48] T. Schreck, S. Berger and J. Göbel. "BISSAM: Automatic vulnerability identification of office documents," in Detection of Intrusions and Malware, and Vulnerability Assessment Anonymous 2013.

[49] Abou-Assaleh, T.; Cercone, N.; Keselj, V.; Sweidan, R., "N-gram-based detection of new malicious code," Computer Software and Applications Conference, 2004. COMPSAC 2004. Proceedings of the 28th Annual International , vol.2, no., pp.41,42 vol.2, 28-30 Sept. 2004

[50] Henchiri, O., & Japkowicz, N. (2006, December). A feature selection and evaluation scheme for computer virus detection. In Data Mining, 2006. ICDM'06. Sixth International Conference on (pp. 891-895). IEEE.

[51] Kolter, J. Z., & Maloof, M. A. (2004, August). Learning to detect malicious executables in the wild. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 470-478). ACM.

[52] Kolter, J. Z., & Maloof, M. A. (2006). Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7, 2721-2744.

[53] Moskovitch, R., Stopel, D., Feher, C., Nissim, N., & Elovici, Y. (2008, June). Unknown malcode detection via text categorization and the imbalance problem. In Intelligence and Security Informatics, 2008. ISI 2008. IEEE International Conference on (pp. 156-161). IEEE.

[54] Schultz, M. G., Eskin, E., Zadok, E., & Stolfo, S. J. (2001). Data mining methods for detection of new malicious executables. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on (pp. 38-49). IEEE.

[55] R. Moskovitch, D.Stopel, C.Feher, N.Nissim, N.Japkowicz, and Y. Elovici, “Unknown Malcode Detection and the Imbalance Problem,”Journal in Computer Virology, 5(4), 2009, 295-308.

[56] Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM conference on Computer and communications security (CCS '11). ACM, New York, NY, USA, 309-320.

- 47 -

[57] Tahan, G., Rokach, L., & Shahar, Y. (2012). Mal-id: Automatic malware detection using common segmentanalysis and meta-features. The Journal of Machine Learning Research, 13(1), 949-979.

[58] Carsten Willems, Thorsten Holz, and Felix Freiling. 2007. Toward Automated Dynamic Malware Analysis Using CWSandbox. IEEE Security and Privacy 5, 2 (March 2007), 32-39.

[59] Sharif, M., Lanzi, A., Giffin, J., & Lee, W. (2009, May). Automatic reverse engineering of malware emulators. In Security and Privacy, 2009 30th IEEE Symposium on (pp. 94-109). IEEE.

[60] Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006, December). Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In Computer Security Applications Conference, 2006. ACSAC'06. 22nd Annual (pp. 289-300). IEEE.

[61] Rieck, K., Holz, T., Willems, C., Düssel, P., & Laskov, P. (2008). Learning and classification of malware behavior. In Detection of Intrusions and Malware, and Vulnerability Assessment (pp. 108-125). Springer Berlin Heidelberg.

[62] Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Security, 19(4), 639-668.

[63] Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M. G., ... & Saxena, P. (2008). BitBlaze: A new approach to computer security via binary analysis. In Information systems security (pp. 1-25). Springer Berlin Heidelberg.

[64] Perdisci, R., Lanzi, A., & Lee, W. (2008, December). McBoost: Boosting scalability in malware collection and analysis using statistical classification of executables. In Computer Security Applications Conference, 2008. ACSAC 2008. Annual (pp. 301-310). IEEE.

[65] Moser, A.; Kruegel, C.; Kirda, E., "Exploring Multiple Execution Paths for Malware Analysis," Security and Privacy, 2007. SP '07. IEEE Symposium on , pp.231,245, 20-23 May 2007

[66] Lanzi, A., Sharif, M. I., & Lee, W. (2009, February). K-Tracer: A System for Extracting Kernel Malware Behavior. In NDSS.

[67] Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X. Y., & Wang, X. (2009, August). Effective and Efficient Malware Detection at the End Host. InUSENIX Security Symposium (pp. 351-366).

[68] Jacob, G., Debar, H., & Filiol, E. (2009, January). Malware behavioral detection by attribute-automata using abstraction from platform and language. In Recent Advances in Intrusion Detection (pp. 81-100). Springer Berlin Heidelberg.

[69] Bayer, U., Comparetti, P. M., Hlauschek, C., Kruegel, C., & Kirda, E. (2009, February). Scalable, Behavior-Based Malware Clustering. In NDSS (Vol. 9, pp. 8-11).

[70] Moore, D., Paxson, V. Savage, S., Shannon, C., Staniford, S., Weaver, N., "Inside the Slammer Worm", IEEE Security & Privacy 2003.

[71] Moore, D., Shannon, C. and Brown, J. Code-red: a case study on the spread and victims of an internet worm. In Traffic analysis, pages 273-284. The Second Internet Measurement Workshop, Nov. 2002

[72] Langner, Ralph. "Stuxnet: Dissecting a cyberwarfare weapon." Security & Privacy, IEEE 9.3 (2011): 49-51.

- 48 -

[73] Moskovitch R, Elovici Y, Rokach L (2008) Detection of unknown computer worms based on behavioral classification of the host. Comput Stat Data Anal 52(9):4544–4566

[74] Stopel D, Moskovitch R, Boger Z, Shahar Y, Elovici Y (2009) Using artificial neural networks to detect unknown computer worms. Neural Comput Appl 18:663–674

[75]-Hansman, S. A taxonomy of network and computer attack methodologies. http://www.cosc.canterbury.ac.nz/research/reports/HonsReps/2003/hons 0306.pdf, Nov. 2003.

[76] Report: http://www.kaspersky.com/about/news/virus/2013/number-of-the-year

[77] Iyatiti Mokube and Michele Adams. 2007. Honeypots: concepts, approaches, and challenges. In Proceedings of the 45th annual southeast regional conference (ACM-SE 45). ACM, New York, NY, USA, 321-326.

[78] N. Provos, and T. Holz, Virtual Honeypots: From Botnet Tracking to Intrusion Detection, Addison- Wesley,2008, pp. 231–272.

[79] T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179–208, 2005.

[80] I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.

[81] Singh, Anshuman, Andrew Walenstein, and Arun Lakhotia. "Tracking concept drift in malware families." Proceedings of the 5th ACM workshop on Security and artificial intelligence. ACM, 2012.

[82] Puzis, R., Tubi, M., Elovici, Y., Glezer, C., & Dolev, S. (2011). A decision support system for placement of intrusion detection and prevention devices in large-scale networks. ACM Transactions on Modeling and Computer Simulation (TOMACS), 22(1), 5.

[83] Puzis, R., Klippel, M. D., Elovici, Y., & Dolev, S. (2008). Optimization of NIDS placement for protection of intercommunicating critical infrastructures. InIntelligence and Security Informatics (pp. 191- 203). Springer Berlin Heidelberg.

[84] Dolev, S., Elovici, Y., Puzis, R., & Zilberman, P. (2009). Incremental deployment of network monitors based on group betweenness centrality.Information Processing Letters, 109(20), 1172-1176.

[85] Puzis, R., Zilberman, P., Elovici, Y., Dolev, S., & Brandes, U. (2012, September). Heuristics for speeding up betweenness centrality computation. InPrivacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (pp. 302- 311). IEEE.

[86] Steve Mansfield-Devine, Android malware and mitigations, Network Security, Volume 2012, Issue 11, November 2012, Pages 12-20, ISSN 1353-4858, 10.1016/S1353-4858(12)70104-6.

[87] Oberheide, J., Miller, J.: Dissecting the android bouncer (2012)

[88]https://www4.symantec.com/mktginfo/whitepaper/ISTR/21347932_GA-internet-security-threat-report- volume-20-2015-social_v2.pdf

- 49 -

[89] Zhao, W., Long, J., Yin, J., Cai, Z., & Xia, G. (2012). Sampling attack against active learning in adversarial environment. In Modeling Decisions for Artificial Intelligence (pp. 222-233). Springer Berlin Heidelberg.

[90] http://www.strazzere.com/papers/DexEducation-PracticingSafeDex.pdf

[91] Beuhring, A.; Salous, K., "Beyond Blacklisting: Cyberdefense in the Era of Advanced Persistent Threats," in Security & Privacy, IEEE , vol.12, no.5, pp.90-93, Sept.-Oct. 2014 doi: 10.1109/MSP.2014.86 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6924678&isnumber=6924618

[92] Rocco A (2003) Servedio smooth boosting and learning with malicious noise J Mach Learn Res 4:633– 648

[93] Chen Y, Zhan Y (2009) Co-training semi-supervised active learning algorithm based on noise filter. In: Proceedings of the 2009 WRI global congress on intelligent systems, GCIS ’09, vol 03. IEEE Computer Society, Washington, DC, USA, pp 524–528

[94] Schohn G , Cohn D (2000) Less is more: active learning with support vector machines. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00. Morgan Kaufmann Publishers Inc, San Francisco, pp 839–846

[95] Tong, Simon, and Daphne Koller. "Support vector machine active learning with applications to text classification." Journal of machine learning research2.Nov (2001): 45-66.

[96] Roy, Nicholas, and Andrew McCallum. "Toward optimal active learning through monte carlo estimation of error reduction." ICML, Williamstown(2001): 441-448.

[97] V. Barnabé-Lortie, C. Bellinger and N. Japkowicz, "Active Learning for One-Class Classification," 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, 2015, pp. 390-395. doi: 10.1109/ICMLA.2015.167

- 50 -

- 51 -

6. Appendix

6.1. Additional Accepted Papers in the Malware Detection Domain

On this section we present the list of our additional four papers in the malware domain which support the results and show additional results and experiments in regarding to the four core papers. The full version and published version of the papers will presented in following to the list.

[5] Nir Nissim, Aviad Cohen, Robert Moskovitch, Asaf Shabtai, Matan Edri, Oren Bar-Ad, Yuval Elovici. (2016). Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework. Security Informatics, 5(1), 1-20.

[6] Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, Assaf Shabatai and Yuval Elovici, " ALPD: Active Learning framework for Enhancing the Detection of Malicious PDF Files aimed at Organizations.” JISIC (2014).

[7] Nir Nissim, Aviad Cohen, Yuval Elovici, "Boosting the Detection of Malicious Documents Using Designated Active Learning Methods," 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 2015, pp. 760-765. doi: 10.1109/ICMLA.2015.52

[8] Nir Nissim, Aviad Cohen, Yuval Elovici. Designated Active Learning Methods for Enhanced Detection of Unknown Malicious Microsoft Office Documents. ODDX3 Workshop at KDD Conference (2015), Sydney.

- 52 -

Nissim et al. Secur Inform (2016) 5:1 DOI 10.1186/s13388-016-0026-3

RESEARCH Open Access Keeping pace with the creation of new malicious PDF files using an active‑learning based detection framework Nir Nissim1*, Aviad Cohen1, Robert Moskovitch2, Asaf Shabtai1, Matan Edri1, Oren BarAd1 and Yuval Elovici1

Abstract Attackers increasingly take advantage of naive users who tend to treat non-executable files casually, as if they are benign. Such users often open non-executable files although they can conceal and perform malicious operations. Existing defensive solutions currently used by organizations prevent executable files from entering organizational networks via web browsers or email messages. Therefore, recent advanced persistent threat attacks tend to lever- age non-executable files such as portable document format (PDF) documents which are used daily by organizations. Machine Learning (ML) methods have recently been applied to detect malicious PDF files, however these techniques lack an essential element—they cannot be efficiently updated daily. In this study we present an active learning (AL) based framework, specifically designed to efficiently assist anti-virus vendors focus their analytical efforts aimed at acquiring novel malicious content. This focus is accomplished by identifying and acquiring both new PDF files that are most likely malicious and informative benign PDF documents. These files are used for retraining and enhancing the knowledge stores of both the detection model and anti-virus. We propose two AL based methods: exploitation and combination. Our methods are evaluated and compared to existing AL method (SVM-margin) and to random sampling for 10 days, and results indicate that on the last day of the experiment, combination outperformed all of the other methods, enriching the signature repository of the anti-virus with almost seven times more new malicious PDF files, while each day improving the detection model’s capabilities further. At the same time, it dramatically reduces security experts’ efforts by 75 %. Despite this significant reduction, results also indicate that our framework better detects new malicious PDF files than leading anti-virus tools commonly used by organizations for protection against malicious PDF files. Keywords: Active learning, Machine learning, PDF, Malware

Introduction disrupting an organization’s actions. Attackers may be Cyber-attacks aimed at organizations have increased motivated by ideology, criminal intent, a desire for pub- since 2009, with 91 % of all organizations hit by cyber- licity, and more. The vast majority of organizations rely attacks in 2013.1 Attacks aimed at organizations usually heavily on email for internal and external communica- include harmful activities such as stealing confidential tion. Thus, email has become a very attractive platform information, spying and monitoring an organization, and from which to initiate cyber-attacks against organiza- tions. Attackers often use social engineering in order to encourage recipients to press a link or open a malicious web page or attachment. According to Trend Micro,2 1 http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in-2013/. attacks, particularly those against government agencies

*Correspondence: [email protected] 1 Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beersheba, Israel 2 http://www.infosecurity-magazine.com/view/29562/91-of-apt-attacks- Full list of author information is available at the end of the article start-with-a-spearphishing-email/.

© 2016 Nissim et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Nissim et al. Secur Inform (2016) 5:1 Page 2 of 20

and large corporations, are largely dependent upon the currently known signatures of the clients of anti-virus Spear-Phishing3 emails. vendors are consistently not up to date. An incident in 2014 aimed at the Israeli ministry of In order to effectively analyze tens of thousands of new, defense (IMOD) provides an example of a new type of potentially malicious PDF files on a daily basis, anti-virus targeted cyber-attack involving non-executable files vendors have integrated a component of a detection attached to an email. According to media reports,4 the model based on machine learning (ML) and rule-based attackers posed as IMOD representatives and sent email algorithms [3] into the core of their signature repository messages with a malicious portable document format update activities. However, these solutions are often inef- (PDF) file attachment which, when opened, installed a fective, because their knowledge base is not adequately Trojan horse enabling the attacker to control the updated. This occurs because many new, potentially computer. malicious PDF files are created every day, and only a lim- Non-executable files attached to an email are a compo- ited number of security researchers are tasked with man- nent of many recent cyber-attacks as well. This type of ually labeling them. Thus, the problem lies in prioritizing attack has grown in popularity, in part because executa- which PDF files should be acquired, analyzed, and labeled ble files (e.g., *.EXE) attached to emails are filtered by by a human expert.7 most email servers due to the risk they pose and also In this study we present an active learning (AL) based because non-executables (e.g., *.PDF, *.DOC, etc.) are not framework for frequently updating anti-virus software filtered out and are considered safe by most users. Non- with new malicious PDF files. The framework focuses executable files are written in a format that can be read on improving anti-virus tools by labeling those informa- only by a program that is specifically designed for that tive PDF files (potentially malicious or very informative purpose and often cannot be directly executed. For exam- benign files) that are most likely to improve the detection ple, a PDF file can be read only by a PDF reader such as model’s performance and, in so doing, enrich the signa- Adobe Reader or Foxit Reader. Unfortunately, non-exe- ture repository with as many new PDF malware files as cutable files are as dangerous as executable files, since possible, further enhancing the detection process. Specif- their readers can contain vulnerabilities that, when ically, the presented framework favors files that contain exploited, may allow an attacker to perform malicious new content. We focus on PCs, the platform most used actions on the victim’s computer. F-Secure’s 2008–2009 by organizations; mobile platforms are outside the scope report5 indicates that the most popular file types for tar- of this study. geted attacks in 2008–2009 were PDF and Microsoft Office files. Since that time, the number of attacks on Background Adobe Reader has grown.6 Before we provide a review of existing techniques and To prevent such attacks, defensive tools such as fire- known methods of attack, it is worthwhile to mention walls, Intrusion detection systems (IDSs), intrusion pre- that Adobe Reader version X, released in 2011, offers a vention systems (IPSs), anti-viruses, sandboxes, and new feature called Protected Mode Adobe Reader others are used; however, these tools have limitations in (PMAR). Protected mode uses a sandbox8 technique in the detection of attacks that are launched via non-execut- order to create an isolated environment for the Acrobat able files, particularly when a sophisticated advanced per- Reader rendering agent to run while reading a PDF file. sistent threat attack is executed against an organization. As a result, malicious code actions cannot affect the These tools rely heavily on the store of known signatures operating system. However, most organizations are not maintained by anti-virus vendors and communicated to up-to-date with the newest versions of software, includ- clients. Their main limitation is their inability to detect ing PDF readers,9 and thus they are exposed to many very new unknown types of attacks through known sig- well-known attacks that exploit vulnerabilities that exist natures, due to the time lag that exists between when a in previous versions of Adobe Reader. In addition, PDF new unknown malware appears and the time anti-virus files can be utilized for malicious purposes using a variety vendors update their clients with the new signature. Dur- of techniques. In order to explain how PDF files can be ing this period of time, many computers are vulnerable exploited when created or manipulated by an attacker, we to the spread and actions of new malware [1], [2]. Thus, first describe the structure of a viable PDF file.

3 http://searchsecurity.techtarget.com/definition/spear-phishing. 7 http://www.kaspersky.com/se/images/KESB_Whitepaper_KSN_ENG_ 4 http://www.israeldefense.co.il/?CategoryID=512&ArticleID=5766. final.pdf. 8 5 http://www.f-secure.com/weblog/archives/00001676.html. http://searchsecurity.techtarget.com/definition/sandbox. 6 http://www.computerworld.com/article/2517774/security0/pdf-exploits- 9 http://www.csoonline.com/article/2133359/malware-cybercrime/cyberat- explode–continue-climb-in-2010.html. tack-highlights-software-update-problem-in-large-organizations.html. Nissim et al. Secur Inform (2016) 5:1 Page 3 of 20

PDF file structure A PDF is a formatting language first conceived by John Warnock, one of the founders of Adobe Systems.10 The first version, version 1.0, was introduced in 1993, and the most current version is XI11 (11.0.10), was released on December 9, 2014. The PDF specification is publi- cally available,12 and thus can be used by anyone. In addition to simple text, a PDF can include images and other multimedia elements; a PDF can be password pro- tected, execute JavaScript, and more. Likewise, the PDF is supported in all of the prominent operating systems for the PC and mobile platforms (e.g., Microsoft Win- dows, Linux, MacOS, Android, Windows Phone, and iOS). A PDF file is a set of interconnected objects built hier- archically. The PDF file structure is depicted in Fig. 1 and is comprised of four basic parts according to the Adobe PDF Reference13 [4], [5], and [6]:

1. Objects—basic elements in a PDF file:

• Indirect objects—objects referenced by a number • Direct objects—objects that are not referenced by a number • Object types: Fig. 1 Simple PDF file structure

–– Boolean—for true or false values –– Numeric: Integer value Real value unique in the dictionary, and values being any –– String: object type. Most of the indirect objects in a Literal string—a sequence of literal char- PDF document are dictionaries, and dictionaries acters enclosed with ( ) brackets are enclosed with ≪ ≫ brackets. Hexadecimal string—a sequence –– Stream—a special dictionary object followed of hexadecimal numbers enclosed by a sequence of bytes enclosed with the key- with < > brackets words “stream” and “endstream.” The infor- –– Name—a sequence of 8-bit characters starting mation inside the stream may be compressed with / or encrypted by a filter. The prefix dictionary –– Null—an empty object represented by the key- contains information on whether and how to word ‘null’ decode the stream. Streams are used to store –– Array—an ordered sequence of objects data such as images, text, script code, etc. enclosed with [] brackets that can be composed 2. File structure—defines how the objects are accessed of any combination of object types, including and how they are updated. A valid PDF file consists another array of the following four parts: –– Dictionary—an unordered sequence of key- value pairs: keys being names which should be 1. Header—the first line of a PDF file which speci- fies the version number of PDF specification which the document uses. Header format is “ %

10 http://partners.adobe.com/public/developer/tips/topic_tip31.html. PDF-[version number].” 11 http://get.adobe.com/reader/. 2. Body—the largest section of the file which con- 12 http://www.adobe.com/devnet/pdf/pdf_reference.html. tains all the PDF objects. The body is used to 13 http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/ hold all of the document’s data that is shown to pdf_reference_1-7.pdf. the user. Nissim et al. Secur Inform (2016) 5:1 Page 4 of 20

3. Cross reference—a table that includes the posi- attacks related to PDF files are conducted using JavaScript tion of every indirect object in the memory and code embedded inside a PDF. allows random access to objects in the file, so JavaScript code obfuscation is legitimately used to the application does not need to read the entire prevent reverse engineering of proprietary applications. file to locate a particular object. Each object is However, it is also used by attackers to conceal malicious represented by one entry in the table which is JavaScript code and prevent it from being recognized by always 20 bytes long. signature based detectors or detectors based on lexical 4. Trailer—provides relevant information about analysis of code (such as [7]), and to reduce readability by how the application reading the file should a human security analyst. find the cross reference table and other special Filters are used in PDFs to compress data in order to objects. The trailer also contains information enhance compactness or facilitate encoding. By using about the number of revisions made to the doc- filters, an attacker can conceal the malicious JavaScript ument. All PDF readers should begin reading a code, rendering it undetectable or unreadable. Multi- file from this section. ple filters can be applied to a stream, one after the other. 3. Document structure—defines how objects are logi- The filter names used must be declared in the stream cally and hierarchically organized to reflect the con- dictionary. Available filters and their primary purposes tent of a PDF file. Each page in the document is rep- are discussed by P. Baccas and J. Kittilsen [10], [11]. resented by a page object which is a dictionary that Table 1 summarizes various code obfuscation techniques includes references to the page’s content. The root employed by attackers [4]. object of the hierarchy is the catalog object which is also a dictionary. Embedded file attack 4. Content streams—objects that contain instructions A PDF file can contain other file types within it, including which define the appearance of the page. HTML, JavaScript, SWF, XLSX, EXE, Microsoft Office files, or even another PDF file. An attacker can use this Techniques and possible attacks via PDF files functionality in order to embed a malicious file inside a JavaScript code attacks benign file. This way, the attacker can leverage the vul- PDF files can contain client-side JavaScript code for legiti- nerabilities of other file types in order to perform mali- mate purposes including: 3D content, form validation, cious activity, such as exploiting the vulnerability of a and calculations. JavaScript code can reside on a local Flash file embedded within a PDF file. CVE-2010-3654 is host or remote server, and can be retrieved using/URI or/ an example of arbitrary code execution that can be GoTo commands [7]. The main indicator for JavaScript achieved by embedding a specially crafted Flash movie code embedded in a PDF file is the presence of the/JS key- file (.SWF) into a PDF file. The embedded file can be word in some dictionaries (as previously explained in the opened when the PDF file is opened using embedded previous subsection II-A). However, an object containing JavaScript code or by other techniques such as PDF com- such a dictionary can be placed in a filtered stream. The/ mands (e.g.,/Launch or/Action/EmbeddedFiles) which JS keyword will not be visible in plain text and therefore are usually combined with sophisticated social engineer- may obstruct detection techniques that rely on these key- ing techniques.15 Usually, embedded malicious files are words [8]. The primary goal of the malicious JavaScript obfuscated in order to avoid detection. Adobe Reader code inside a PDF file is to exploit a vulnerability in the PDF viewer versions 9.3.3 and above restrict file formats PDF viewer in order to divert the normal execution flow that can be opened and do not allow the launching of an to the embedded malicious JavaScript code. This can be embedded executable file such as *.EXE, *.BAT, or *.CMD achieved by performing a heap spraying14 or buffer over- because of its blacklist which is based on file extension. flow attack implemented through JavaScript. Another However, Python code (*.PY) is not blacklisted and can malicious activity that can be carried out using JavaScript perform malicious activities when executed.16 is downloading an executable file from the Internet which initiates an attack on the victim’s machine once executed. Form submission and URI attacks Alternatively, JavaScript code can also open a malicious In 2013, Valentin Hamon [12] presented practical tech- website that can perform a variety of malicious operations niques that can be used by attackers to execute malicious targeting the victim’s machine. According to [9], most code from a PDF file. Adobe Reader supports the option

15 http://www.zdnet.com/article/hacker-finds-a-way-to-exploit-pdf-files- 14 Heap Spraying A technique used in exploits to assist random code execu- without-a-vulnerability/. tion. 16 http://www.decalage.info/file_formats_security/pdf. Nissim et al. Secur Inform (2016) 5:1 Page 5 of 20

Table 1 Code obfuscation techniques in PDF files that can be used by an attacker

Obfuscation technique Details

Separating malicious code over multiple objects Malicious code is spread among multiple objects. Code chunks are collected and merged and com- piled to form a malicious piece of code only during runtime. This makes it difficult for static analysis detectors to recognize the malicious code Applying filters Filters are used to conceal malicious code White space randomization Random white spaces are inserted in the malicious code in order to evade recognition by signature based maliciousness detectors. White spaces do not affect the code since JavaScript ignores them Comment randomization Random comments are inserted in the malicious code in order to evade recognition by signature based maliciousness detectors. Comments do not affect the code since JavaScript ignores them Variable name randomization Changing the variable’s name randomly in order to fool signature based maliciousness detectors Integer obfuscation Representing numbers in a different way which can be used to hide a specific memory address String obfuscation Making changes to string in order to make it difficult for a human analyst to understand the code (e.g., by splitting string into several substrings) Function name obfuscation Hiding the name of the function used which can provide a clue about the code’s intention. This is done by creating a pointer with a random name, pointing to the required function Advanced code obfuscation String can hold encrypted malicious code. The decryption process takes place during runtime, just before usage. Metadata fields and even the document’s words can also be used to store malicious code Block randomization Changing the syntax of the code but not its action Dead code Inserting blocks of code that are not intended to be executed Pointless code Inserting blocks of code that do not perform anything of submitting the PDF form from a client to a specific involves launching a malicious PDF file from a benign server using the/SubmitForm command. Several file for- one. Other attacks described in the paper include an mats can be used for that purpose, one of which is the attack in which a benign PDF simply submits its form to forms data format (FDF), the default format based on the attacker’s PHP web server. The server searches the XML. Adobe generates an FDF file from a PDF in order submitted form header for the Adobe Reader version to send the data to a specified URL. If the URL belongs to using regular expressions. Once the server identifies the a remote web server, it is able to respond. Responses are user’s Adobe Reader version, it sends back a malicious temporarily stored in the %APPData % directory which PDF that exploits a vulnerability corresponding to that automatically pops up in the default web browser. An version. The returned PDF is automatically launched. attack can be performed by a simple request to a mali- Another attack described is the Big Brother attack. When cious website that will automatically pop up on the web the user opens a PDF, it automatically downloads a mali- browser, and the malicious website can exploit a vulnera- cious executable file using the web browser (URI bility in the user’s web browser. The author also showed address). This attack can be performed prior to the previ- that security mechanisms such as the protected mode of ously described form submission attack, in order to make Adobe Reader X or the URL Security Zone Manager of changes to registry keys that configure the security the Internet Explorer web browser can be disabled easily mechanisms discussed above. by changing the corresponding registry key, a change that can be implemented with user privileges. However it Related work should be noted that vulnerabilities17 that target Adobe Existing anti-virus software is not adequately effective Reader X and XI have been released18 and are being against unknown non-executable malicious PDF files exploited as well. Moreover, a URI19 address can be used [13]. Advanced methods for the detection of new or (instead of a URL) to refer to any file type located unknown malicious PDF files are based primarily on clas- remotely (both executable and non-executable files, sifiers induced by ML algorithms. These methods lever- including *.EXE and *.PDF). One interesting scenario age information from features extracted from labeled PDF files using static or dynamic analysis. Static analysis 17 http://www.cvedetails.com. methods examine and evaluate a suspicious file solely by 18 https://blogs.mcafee.com/mcafee-labs/analyzing-the-first-rop-only- analyzing its code, without actually executing it. Alterna- sandbox-escaping-pdf-exploit. tively, dynamic analysis methods execute the file, usually 19 URI According to RFC2396, URI is a compact string of characters used in an isolated environment (sandbox), and examine its for identifying an abstract or physical resource. It is an extension of URL actions and behavior during runtime. used for identifying any web resource (not limited to web pages). Nissim et al. Secur Inform (2016) 5:1 Page 6 of 20

In recent years, the need to enhance security in the PDF files according to the set of embedded keywords face of attacks based on malicious PDF files has led to and their frequencies. The files are characterized by the increased research in this area. Most of the academic presence of specific chosen keywords, for example:/JS,/ work focuses on static analysis, since it requires less com- JavaScript,/Encrypt, obj, stream, filter, etc. The main putational resources and is much faster than dynamic contribution is the ability to detect malicious PDF files analysis. Static analysis methods usually examine and whether or not they contain JavaScript code, unlike pre- inspect the document’s embedded JavaScript code, file viously described tools which detect malicious files only structure, or metadata (such as the number of specific if they contain JavaScript code (e.g., PJScan [7]). How- streams, objects, keywords, etc.). However, static analy- ever, an attacker can learn which keywords characterize sis has drawbacks as well, including the inability to detect benign files and inject these keywords inside a malicious well obfuscated code that acts maliciously during runt- file in order to bypass PDFMS, thus demonstrating the ime, in contrast to dynamic analysis that will likely detect tool’s weakness. that code. Here we present the most relevant studies, Smutz and Stavrou [16] presented PDFRate, a frame- however a more comprehensive analysis of related work work for the detection of malicious PDF files which is can be found in our recent survey study [4]. based on meta-features extracted from a document’s Srndic and Laskov [7] introduced PJScan,20 a static content. The extracted features include the number of analysis and anomaly detection tool for the detection of font objects, average length of stream objects, and the malicious JavaScript code inside a PDF file. After the number of lower case characters in the title. In total, JavaScript code has been found and extracted, a lexical 202 features were chosen for classification. analysis is applied using a JavaScript interpreter. Lexical Pareek and Eswari [17] introduced two different analysis represents the JavaScript code as a sequence of static analysis methods that do not rely upon a PDF tokens. For example, left parenthesis, plus, right paren- parser to extract data from the file. Alternatively, the thesis, etc. By using these tokens PJScan tries to induce methods apply the analysis on the whole file, after its learning detection models that differentiate between content is converted to hexadecimal dumps or byte benign and malicious PDF files. Liu et al. [14] combined sequences. The first method is based on entropy and both static and dynamic analysis to detect potential is used to measure the uncertainty or randomness in infection attempts in the context of JavaScript execution. a given dataset, assuming the level of uncertainty of First, they extract a set of static features, and then they a malicious file should be less than that of a benign insert context monitoring code into a PDF document, a file with a similar format. Low entropy in a file is not code that later cooperates with the runtime monitor used a strong indicator of maliciousness, however it can be for the detection task. Additional work done by Corona a useful detection feature in combination with other et al. [15] in which they presented the Luxor system features. The second method leverages the n-gram which applied this combination of static and dynamic approach [18] in order to distinguish between benign analysis as well. Their idea involved translating the JavaS- and malicious PDF files. cript code into an API reference pattern, and accumulat- Srndic and Laskov [6] introduced a method that makes ing the times of presences for every API reference. By use of essential differences in the structural properties doing this they tried to find a discriminative set of refer- of malicious and benign PDF files. Their static method ences that characterizes malware code in order to auto- evaluates documents based on side effects of malicious matically differentiate this code from benign code within content within their structure by evaluating their struc- PDF files. tural paths and the frequencies of these paths. This detec- Yatamanu et al. [9] introduced two different static tion model was found to be effective at differentiating methods for clustering PDF files based on tokenization between malicious and benign PDF files and was effective of their embedded JavaScript. The article focuses on against new unknown malicious files that were created establishing a clustering method for the identification of 2 months after the classification model was built. similar scripts that have been obfuscated using different The following dynamic analysis methods focus on the techniques. For each examined PDF file, a fingerprint is analysis of embedded JavaScript code (which may or may created, consisting of a set of unique JavaScript tokens not reside in a PDF file) during runtime. All of these and their frequencies. methods include a dynamic step, either in the feature Maiorka et al. [8] introduced the PDF Malware Slayer extraction or analysis phase. MDScan [13] and PDF Scru- (PDFMS), a static analysis tool which characterizes tinizer [19] start with a static extraction of the embedded JavaScript code from a PDF file and then execute the extracted code using a JavaScript engine. The extracted 20 The source code of PJScan and its underlying library (libPDFJS) can be found at http://sourceforge.net/p/pjscan/home/Home/. JavaScript code is examined during runtime using Nissim et al. Secur Inform (2016) 5:1 Page 7 of 20

heuristics in order to detect suspicious or malicious examples needed to maintain the performance of an activity. Snow et al. [20] presented ShellOS, a framework ML classifier. Unlike random (or passive) learning, in for the detection of code injection attacks based on code which a classifier randomly selects examples from analysis during runtime. Lu et al. [21] introduced which to learn, the ML based classifier actively indicates MPScan, a technique that integrates static malware the specific examples which are commonly the most detection and dynamic JavaScript de-obfuscation. informative examples for the training task and should MPScan hooks21 the Adobe Reader’s native JavaScript be labeled. SVM-Simple-Margin [24] is a current AL engine; thus embedded codes (JavaScript source and method considered in our experiments. Active learn- operational code) can be extracted during execution and ing was successfully used to enhance the detection of evaluated by the static detection module. unknown computer worms [25] and malicious execut- While the presented dynamic analysis methods execute able files targeting the Windows OS [26]. Such methods the JavaScript code in order to detect malicious behav- are used in the current study in order to enhance the ior, they differ in the way that they extract the JavaS- detection of malicious PDF files. cript code. Generally speaking, the more exhaustive the extraction of JavaScript code is, the better the dynamic Suggested framework and methods analysis can be in terms of detection ability. Nevertheless, A framework for improving detection capabilities an attempt to extract JavaScript code from a file statically Figure 2 illustrates the framework and the process of may fail and result in the extraction of JavaScript code detecting and acquiring new malicious PDF files by main- that does not accurately represent the behavior of the taining the updatability of the anti-virus and detection file. The reasons may be varied; for example, this could model. In order to maximize the suggested framework’s be due to cases in which the code can be well obfuscated contribution, it should be deployed in strategic nodes or located in irregular locations within the PDF file. (such as ISPs and gateways of large organizations) over Dynamic extraction is more robust compared to static the Internet network in an attempt to expand its expo- extraction in these areas. sure to as many new files as possible. Widespread deploy- It is important to clarify that the academic solutions in ment will result in a scenario in which almost every new this category do not perform a dynamic analysis on the file goes through the framework. If the file is informative entire file, rather dynamic analysis is only performed on enough or is assessed as likely being malicious, it will be the JavaScript code that was extracted from the PDF file. acquired for manual analysis. As Fig. 2 shows, the PDF This is in contrast to some commercial solutions that files transported over the Internet are collected and scru- execute the PDF file and examine its behavior and influ- tinized within our framework {1}. Then, the “known files ence on the host operating system during runtime (full module” filters all the known benign and malicious PDF dynamic analysis). Full dynamic analysis consumes signif- files {2} (according to a white list, reputation systems icantly more resources than the dynamic analysis found [27], and the anti-virus signature repository). in academic solutions but is probably better at detecting Then, only the unknown PDF files from the previous malicious PDF files, because it provides a better indica- step are checked for their compatibility with PDF speci- tion of the file’s real purpose. Moreover, the full dynamic fications (explained in “Dataset” Section that describes analysis approach is robust against code obfuscation, the dataset collection) {3}. The incompatible PDF files are since it does not analyze string code or pretend to extract immediately blocked (since only compatible files are rel- the code from the file. This approach is most like the evant for organizations and individual users). Note that examination of suspicious code by a security expert with the compatibility check is extremely important since it forensic tools. Wepawet [22], an example of a well-known blocks many of the malicious files (discussed in “Dataset and widely used full-dynamic analysis tool, was pre- collection” section) from being introduced to the next sented in 2010 by Cova et al. In-Nimbo Sandboxing sys- step which is more time consuming and, in some cases, tem for PDF files, presented by Maas et al. [23], is based might even require manual analysis done by a human on conducting vulnerable or malicious computations on expert. The remaining PDF files which are compatible a virtual machine instances in a remote cloud computing and unknown are then introduced to the next step. In environment, and by so doing, preventing the ability of the advanced analysis step {4}, these compatible files are malware to affect the host system. transformed into vector form for the advanced check. In Studies in several domains have successfully applied vector form the files are represented as vectors of the fre- AL in order to improve the efficiency of labeling quencies of structural paths of the PDF files (as will be explained in the following sub-section). The detection 21 Hooking A technique for intercepting functions calls, messages, or events model within this check scrutinizes the PDF files and passed between software components in order to alter an application or operating system behavior. provides two values for each file: a classification decision Nissim et al. Secur Inform (2016) 5:1 Page 8 of 20

Fig. 2 The process of maintaining the updatability of the anti-virus tool using AL methods

using the SVM (Support Vector Machine) classification strategy that selects informative files, both malicious algorithm and a distance calculation from the separat- and benign, that are a short distance from the separating ing hyperplane using Eq. 1. The AL method acquires files hyperplane. which are found to be informative and sends these files The second type of informative files includes those that to an expert for manual analysis and labeling {5}. These lie deep inside the malicious side of the SVM margin and labeled informative files are being added to the training are a maximal distance from the separating hyperplane set which is used to induce new and updated detection according to Eq. 1. These PDF files will be acquired by model. By acquiring these informative PDF files, we aim the exploitation method (described later) and are also to frequently update the anti-virus software by focusing a maximal distance from the labeled files. This distance the expert’s efforts on labeling PDF files that are most is measured by the KFF calculation that will be further likely to be malware or benign PDF files that are expected explained as well. These informative files are then added to improve the detection model. Recall that informative to the training set {6} for updating and retraining the files are those files that when added to the training set detection model {7}. The files that were labeled as mali- improve the detection model’s predictive capabilities and cious are also added to the anti-virus signature reposi- enrich the anti-virus signature repository. Accordingly, in tory in order to enrich and maintain its updatability {8}. our context there are two types of files that may be con- Updating the signature repository also requires an update sidered informative. The first type includes files in which to clients utilizing the anti-virus application. The frame- the classifier has limited confidence as to their classifica- work integrates two main phases: training and detection/ tion (the probability that they are malicious is very close updating. to the probability that they are benign). Acquiring them as labeled examples will probably improve the model’s Training detection capabilities. In practical terms, these PDF files A detection model is trained over an initial training set will have new structural paths or special combinations of that includes both malicious and benign PDF files. After existing structural paths that represent their execution the model is tested over a stream that consists only code (inside the binary code of the executable). There- of unknown files that were presented to it on the first fore these files will probably lie inside the SVM margin day, the initial performance of the detection model is and consequently will be acquired by the SVM-Margin evaluated. Nissim et al. Secur Inform (2016) 5:1 Page 9 of 20

Detection and updating on the hierarchical structural path feature extraction For every unknown PDF file in the stream, the detection method presented by Šrndic and Laskov [6]. Instead model provides a classification, while the AL method of analyzing JavaScript code or any other content, this provides a rank representing how informative the file is. approach makes use of essential differences in the struc- The framework will consider acquiring the files based on tural properties of malicious and benign PDF files. It these two factors. After being selected and receiving their parses the PDF files and extracts their structural paths true labels from the expert, the informative PDF files are which are the paths in the files’ hierarchical object tree acquired by the training set, and the signature reposi- that characterize a document’s structure. Each structural tory is updated as well (in cases which the files are mali- path is analogous to a set of relations between the objects cious). The detection model is retrained over the updated within the PDF file. For example, the “…/JS” path means and extended training set which now also includes the that the PDF file contains JavaScript code. The structural acquired examples that are regarded as being very inform- paths represent the file’s properties and possible actions, ative. At the end of the current day the updated model acting like the file’s genes rather than its current behavior receives a new stream of unknown files on which the which can be postponed or delayed according to specific updated model is once again tested and from which the conditions. Their detection model was based on these updated model acquires additional informative files. Note structural paths and was found to be very effective at that the motive is to acquire as many malicious PDF files differentiating between malicious and benign PDF files, as possible, since such information will maximally update even against new unknown malicious files that were cre- the anti-virus tools that protect most organizations. ated two months after the classification model was built. We employed the SVM classification algorithm using Šrndic and Laskov reported on high TPR and low FPR. the radial basis function (RBF) kernel in a supervised The method was tested against previous detectors: MDS- learning approach. We used the SVM algorithm for the can [13], PJScan [7], ShellOS [20], and PDFMS [8], and following reasons: (1) SVM has been successfully used to the comparison clearly demonstrated the efficiency and detect worms [25, 28], classify malware into species, and resilience of this method in the detection of new mali- detect zero day attacks [29]; (2) the trained SVM classi- cious PDF files over other methods. fier is a black-box that is hard for an attacker to under- Figure 3 provides a simple example of the conversion stand [28]; (3) SVM has proven to be very efficient when of a PDF file into a set of structural paths. The PDF code combined with AL methods [26]; and (4) SVM is known is treated as a tree of objects. Note that only paths of the for its ability to handle large numbers of features which leaves in the structural tree are counted. makes it suitable for handling the number of structural When an attacker injects malicious content into the paths extracted from the PDF files [18]. In our experi- PDF file, the file structure inevitably changes. Thus, this ments we used Lib-SVM implementation [30] which also approach can easily discriminate between benign and supports multiclass classification. malicious files. This approach has several advantages. First, it is not affected by code obfuscation, filtering, and Detection of malicious PDF files using structural paths other encryption methods used for hiding and conceal- In order to detect and acquire unknown malicious PDF ing malicious code in the PDF file, since it doesn’t actu- files, we implemented a static analysis approach based ally analyze the embedded JavaScript code. Second, it is

Fig. 3 Example of the conversion of a PDF file to a set of structural paths Nissim et al. Secur Inform (2016) 5:1 Page 10 of 20

robust towards mimicry and reverse mimicry attacks [6]. achieved, because it is based on a rough approximation Finally, it is very fast and lightweight, since the analysis is and relies on assumptions that the VS is fairly symmet- done statically and does not require execution of the PDF ric and the hyperplane’s Normal (W) is centrally placed, file. Because of this, analysis is conducted quite quickly at assumptions that have been shown to fail significantly the rate of 28 ms for an average file [6]. [31]. The method may query instances whose hyper- plane does not intersect the VS and therefore may not Selective sampling and active learning methods be informative. The SVM-Margin method for detecting Since our framework aims to provide solutions to real instances of PC malware was used by Moskovitch et al. problems it must be based on a sophisticated, fast selec- [32, 33] whose preliminary results found that the method tive sampling method which also efficiently identifies also assisted in updating the detection model but not the informative files. We compared our proposed AL meth- anti-virus application itself; however, in this study the ods to other selective sampling methods, and all of these method was only used for a one day-long trial. We com- methods are described below. pare its performance to our proposed AL methods over a longer period, in a daily process of detection and acqui- Random selection (random) sition experiments that better reflect reality. This serves While random selection is obviously not an AL method, as our baseline AL method, and we expect our method it is at the “lower bound” of the selection methods dis- to improve the new malicious PDF detection and acquisi- cussed. We are unaware of an anti-virus tool that uses tion seen in SVM-Margin. an AL method for maintaining and improving its updat- ability. Consequently, we expect that all AL methods will Exploitation: our proposed active learning method perform better than a selection process based on the ran- Our method, “Exploitation,” is based on SVM classi- dom acquisition of files. fier principles and is oriented towards selecting exam- ples most likely to be malicious that lie furthest from the The SVM‑Simple‑Margin AL method (SVM‑Margin) separating hyperplane. Thus, our method supports the The SVM-Simple-Margin method [24] (referred to as goal of boosting the signature repository of the anti-virus SVM-Margin) is directly related to the SVM classifier. tool by acquiring as much new malware as possible. For Using a kernel function, the SVM implicitly projects every file X that is suspected of being malicious, Exploi- the training examples into a different (usually a higher tation rates its distance from the separating hyperplane dimensional) feature space denoted by F. In this space using Eq. 1 based on the Normal of the separating hyper- there is a set of hypotheses that are consistent with the plane of the SVM classifier that serves as the detection training set, and these hypotheses create a linear sepa- model. As explained above, the separating hyperplane of ration of the training set. From among the consistent the SVM is represented by W, which is the Normal of the hypotheses, referred to as the version-space (VS), the separating hyperplane and is actually a linear combina- SVM identifies the best hypothesis with the maximum tion of the most important examples (supporting vec- margin. To achieve a situation where the VS contains tors), multiplied by LaGrange multipliers (alphas) and by the most accurate and consistent hypothesis, the SVM- the kernel function K that assists in achieving linear sepa- Margin method selects examples from the pool of unla- ration in higher dimensions. Accordingly, the distance beled examples reducing the number of hypotheses. This in Eq. 1 is simply calculated between example X and the method is based on simple heuristics that depend on the Normal (W) presented in Eq. 2. relationship between the VS and the SVM with the maxi- 1 mum margin, because calculating the VS is complex and Dist(X) = αiyiK(xix) (1) impractical where large datasets are concerned. Exam- n  ples that lie closest to the separating hyperplane (inside  the margin) are more likely to be informative and new to n the classifier, and these examples are selected for labeling = w αiyi�(xi) (2) and acquisition. 1 This method, in contrast to ours, selects examples according to their distance from the separating hyper- In Fig. 4 the files that were acquired (marked with a plane only to explore and acquire the informative files red circle) are those files that are classified as malicious without relation to their classified labels, i.e., not specifi- and are at the maximum distance from the separating cally focusing on malware instances. The SVM-Margin hyperplane. Acquiring several new malicious files that method is very fast and can be applied to real problems, are very similar and belong to the same virus family is yet SVM-Margin’s authors [24] indicate that this agility is considered a waste of manual analysis resources, since Nissim et al. Secur Inform (2016) 5:1 Page 11 of 20

was found to be benign. With this method the distance calculation required for each instance is fast and equiva- lent to the time it takes to classify an instance in a SVM classifier, thus it is applicable for products working in real-time.

Combination: a combined active learning method The “combination” method lies between SVM-Margin and exploitation. On the one hand, the combination method begins by acquiring examples based on SVM- Margin criteria in order to acquire the most informa- tive PDF files, acquiring both malicious and benign files. This exploration phase is important in order to enable the detection model to discriminate between malicious Fig. 4 The criteria by which exploitation acquires new unknown and benign PDF files. On the other hand, the Combina- malicious PDF files. These files lie the farthest from the hyperplane and are regarded as representative files tion method then tries to maximally update the signa- ture repository in an exploitation phase, drawing on the exploitation method. This means that in the early acquisi- tion period, during the first part of the day, SVM-Margin these files will probably be detected by the same signa- leads, compared to exploitation. As the day progresses, ture. Thus, acquiring one representative file for this set of Exploitation becomes predominant. However, the new malicious files will serve the goal of efficiently updat- Combination method, which consists of first conduct- ing the signature repository. In order to enhance the ing exploration by SVM-Margin and then conducting signature repository as much as possible, we also check Exploitation, was also applied in the course of the 10 days the similarity between the selected files using the kernel of the experiment, and over a period of days, combina- farthest-first (KFF) method suggested by Baram et al. tion will perform more exploitation than SVM-Margin. [34] which enables us to avoid acquiring examples that This means that on the ith day there is more exploitation are quite similar. Consequently, only the representative than on the (i–1)th day. We defined and tracked several files that are most likely malicious are selected. In cases configurations over the course of several days. Regard- in which the representative file is detected as malware as ing the relation between SVM-Margin and Exploitation, a result of the manual analysis, all its variants that weren’t we found that a balanced division performs better than acquired will be detected when the anti-virus is updated. other divisions (i.e., during the first half of the study, the In cases in which these files are not actually variants of method will acquire more files using SVM-Margin, while the same malware, they will be acquired the following during the second half of the study, exploitation takes day (after the detection model has been updated), as the leading role in the acquisition of files). In short, this long as they are still most likely to be malware. In Fig. 4 method tries to take the best from both of the previous it can be observed that there are sets of relatively similar methods. files (based on their distance in the kernel space), how- Note that our combination AL method actually ever, only the representative files that are most likely to repeats in cycles of X days according to the configu- be malware are acquired. The SVM classifier defines the rations of the administrator, so that in each cycle, the class margins using a small set of supporting vectors combination method starts with more explorative (i.e., PDF files). While the usual goal is to improve clas- phase (SVM-Margin) in which both informative mali- sification by uncovering (labeling) files from the margin cious and benign PDF files are being acquired. Then, area, our primary goal is to acquire malware in order combination conducts more acquisition according to to update the anti-virus. In contrast to SVM-Margin the exploitation AL method in order to acquire more which explores examples that lie inside the SVM mar- PDF malware and enrich the antivirus signature reposi- gin, Exploitation explores the “malicious side” to discover tory. In our experiments we conducted only one cycle new and unknown malicious files that are essential for of 10 days experiment, yet on a long term, this strategy the frequent update of the anti-virus signature repository, will work on a continuous manner in cycles of 10 days, a process which occasionally also results in the discov- in which every cycle will be both consisted of explora- ery of benign files (files which will likely become support tion (SVM-Margin) and exploitation. The administrator vectors and update the classifier). Figure 4 presents an have the privilege to decide what will be the length of example of a file lying far inside the malicious side that the cycles. Nissim et al. Secur Inform (2016) 5:1 Page 12 of 20

Table 2 Our collected dataset categorized as malicious, benign, and incompatible PDF files (bracketed)

Dataset source Year Malicious files Benign files

VirusTotala repository 2012–2014 17,596 (1017) – Srndic and Laskov [6] 2012 27,757 (437) – Contagio project 2010 410 (175) – Internet and Ben-Gurion University (random selection) 2013–2014 0 5145 Total 45,763 (1629) 5145 a https://www.virustotal.com/

Evaluation user, and an error message will likely be displayed on the In this section we present the dataset on which we based screen when an attempt is made to open the file. How- our experiments and elaborate on the composition of ever in such cases involving incompatible malicious PDF the files within the dataset. We also discuss our insights files, the malicious operations will still be executed. A regarding the compatibility of the PDF files. At the con- common incompatibility observed was located at the end clusion of this section we present our experimental of the file between the “startxref” and “ %EOF” lines. This design which was aimed at comprehensively evaluating line should contain a number serving as a reference (off- our framework. set) to where the last cross reference table section is located in the file. In cases of incompatibility, the number Dataset that appears is incorrect. Dataset collection We would like to reiterate that incompatible files are We created a collection of benign and malicious PDF unreadable, because once they are opened they either files. We acquired a total of 50,908 PDF files consisting of cause the PDF reader to crash or generate an error mes- 45,763 malicious and 5145 benign files. The malicious sage sent to the user regarding the corruption of the PDF files aimed at compromising the Windows operating file. However, in cases in which the file is malicious, the system, the most commonly attacked system used by malicious operation is likely to occur whether or not the organizations. The files were collected during the years file has been presented the PDF reader. Regardless as to 2012–2014 from the four sources presented in Table 2. whether the incompatible file is benign or malicious, the All the files were verified to be labeled correctly as mali- file cannot be viewed by the user. Thus, there is no reason cious or benign using Kaspersky22 anti-virus. to transfer such files to the user, and we suggest that they The benign files were collected from Ben-Gurion Uni- get blocked. We observed that 96.5 % of the malicious versity, a very diverse organization that generates an files within our large and representative data collection of enormous number of PDF files from diverse sources, more than 45,000 malicious PDF files were incompatible; including many different academic and administra- based on this, we can therefore assume that 96.5 % of all tive departments. Although the entire benign collection malicious PDF files are incompatible. Because of this, originated from one institution, it should be noted that when one encounters an incompatible PDF file, there is a it is actually quite varied and contains PDF files that are strong chance that it is, in fact, a malicious PDF file. substantially different from one another in their content, Our proposed framework uses these observations to elements, and purpose that were randomly selected from flag files that can initially be blocked from the network the following sources within the university: academic of an organization or private computer, since they cannot papers, automatically generated reports, student assign- be opened correctly; therefore our ML based framework ments, previous exams, administrative forms (also with was applied only to the compatible files (6774). Table 2 active functionality), architectural renderings and plan- presents the number of compatible files in each of our ning documentation, brochures, and newsletters. collections (bracketed). In our recent studies [4], [35] we thoroughly investi- Figure 5 presents a general breakdown of the threats gated PDF files and found that most malicious PDF files raised by the compatible malicious PDF files within our (96.5 %) are not compatible with the PDF file format dataset. specifications (checked with the Adobe PDF Reference23). As can be seen, 3 % of the 1629 malicious files were These files cannot be viewed by the PDF reader or the classified as containing a Trojan which is a malicious pro- gram that when executed performs covert actions that

22 http://www.kaspersky.com. have not been permitted by the user. 97 % of the mali- 23 http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/ cious files were identified as exploiting some vulnerability pdf_reference_1-7.pdf. in one or more PDF readers. More specifically, they Nissim et al. Secur Inform (2016) 5:1 Page 13 of 20

capabilities reported upon now. We were unable to find any reliable source that published an accurate percentage of malicious PDF files on the web. Therefore we will rely upon the results of our recent study [36], and when the exact percentage is determined, we will adjust both the training and test sets to that percentage. In addition, in Table 4 we also included the percentage of malicious PDF files in previous and relevant studies in the malicious PDF detection domain. We have done so in order to com- pare the way this was handled in previous studies and to demonstrate that none of the previous studies addressed this issue or seemed to adhere to a figure representing the actual percentage of malicious PDF files on the web. As can be seen, the average percentage used within these studies was 38 % which is very high and not likely rea- sonable. It is important to understand that many of the malicious files on the web haven’t been discovered yet, Fig. 5 Breakdown of the threats identified among the malicious PDF since anti-virus tools are very limited in their detection files found in our dataset capabilities and discover only relatively similar variants of known malwares; in addition, it takes time for anti-virus vendors to discover new 0-day malicious files. Therefore our percentage (24 %) likely represents the upper rea- exploited 23 unique vulnerabilities, including the follow- sonable boundary of the actual percentage of malicious 24 ing common vulnerability and exposures (CVE ): CVE- PDF files. Moreover, in our previous study [26], aimed at 2007-3845, CVE-2008-2551, CVE-2009-0927, the detection of malicious executables using ML and AL CVE-2010-0188, and CVE-2013-0640. The four digits methods, we estimated the malicious executables per- after the “CVE-“represent the year the vulnerability was centage to be approximately 9 %. In that study we there- discovered. As can be seen, our dataset contains mali- fore adjusted our dataset to 9 % malicious files, and our cious files that exploit cross-time vulnerabilities. It is methods worked very well, providing high true positive interesting to note that despite the fact that our dataset rates (TPR) and low false positive rate (FPR) rates. Thus, consists of malicious PDF files that were created from we assume that these methods will be effective in the cur- 2012 to 2014, the files also exploited much older vulnera- rent study as well. bilities, including some discovered during 2007–2011. This indicates that attackers are aware of the fact that Dataset creation many organizations are not up-to-date with the newest As was explained previously, in order to detect and versions of software and thus are exposed to older vul- acquire unknown malicious PDF files, we implemented a nerabilities already discovered, as well as new unknown static analysis approach based on the hierarchical struc- vulnerabilities. In Table 3 we provide brief details regard- tural path feature extraction method presented by Šrndic ing the most commonly exploited vulnerabilities men- and Laskov [6]. Instead of analyzing JavaScript code or tioned above. The table indicates the percentage each any other content, this approach makes use of essential vulnerability represents of the total dataset. differences in the structural properties of malicious and The percent of malicious PDF files in the final dataset benign PDF files. Using the PdfFileAnalyzer25 parser we after removing incompatible files is 24 %. This percentage parsed the compatible PDF files (both malicious and is likely higher than the actual percentage of malicious benign, totaling 6774) and extracted all 7963 unique PDF files found on the web. According to our recent paths that were found within our dataset. Each of these study [36] that dealt with the issue of imbalanced datasets paths was used as feature. We didn’t apply a feature selec- in the cyber-security domain, the best detection perfor- tion method, since this number of features can easily be mance is achieved when the percentage of malicious PDF handled by the SVM classifier used in our experiments. files in the training set is equal to the percentage in the Each PDF file was represented as a vector of Boolean fea- test set. We have adhered to this guideline, thus the per- tures so that the presence (1) or absence (0) of a centage should not affect the detection and updatability

25 http://www.codeproject.com/Articles/450254/PDF-File-Analyzer-With- 24 https://cve.mitre.org/. Csharp-Parsing-Classes-Vers. Nissim et al. Secur Inform (2016) 5:1 Page 14 of 20

Table 3 List of the most exploited vulnerabilities in our dataset

The CVE Description Percentage

CVE-2007-3845a The version of some PDF readers was found to allow remote attackers to execute arbitrary commands via certain vectors 49 associated with launching malicious code based on the file extension at the end of the URI CVE-2008-2551b The DownloaderActiveX Control in Icona SpA C6 Messenger allows remote attackers to force the download and execu- 4 tion of arbitrary files via a URL in the prop download url parameter with the propPost download action parameter set to “run” CVE-2009-0927c Stack-based buffer overflow in some adobe reader versions allows remote attackers to execute malicious code via a 5 crafted argument to the ‘getIcon’ method of a ‘Collab’ object. This executed code can exfiltrate sensitive data to a remote server where it can download and execute dangerous payload to the host CVE-2010-0188d Unspecified vulnerability in adobe reader and acrobat allows attackers to cause a denial of service (application crash) or 6 possibly execute malicious code via unknown vectors CVE-2013-0640e Adobe reader and acrobat versions allow remote attackers to execute malicious code or cause a denial of service 32 (memory corruption) a http://cve.mitre.org/cgi-bin/cvename.cgi?name CVE-2007-3845 = b http://cve.mitre.org/cgi-bin/cvename.cgi?name CVE-2008-2551 = c http://cve.mitre.org/cgi-bin/cvename.cgi?name CVE-2009-0927 = d https://cve.mitre.org/cgi-bin/cvename.cgi?name CVE-2010-0188 = e https://cve.mitre.org/cgi-bin/cvename.cgi?name CVE-2013-0640 =

Table 4 Percentage of malicious PDF files in previous studies

Study Year Core of the study Malicious files percentage

[13] 2011 Dynamic analysis of embedded JavaScript code 9 [7] 2011 Static lexical analysis of embedded JavaScript code 94 [20] 2011 Dynamic analysis of embedded JavaScript code 69 [9] 2012 Static tokenization of embedded JavaScript code 43 [19] 2012 Dynamic analysis of embedded JavaScript code 65 [16] 2012 Static analysis of meta-features 5 [8] 2012 Static analysis of embedded keywords 53 [17] 2013 Static analysis using entropy and n-gram 58 [21] 2013 Dynamic analysis of embedded JavaScript code 29 [6] 2013 Static analysis of structural features 12 Average 38 structural path within a PDF file is represented by 1 or 0 Over a 10-day period, we compared PDF file acquisition respectively. based on AL methods to random selection based on the performance of the detection model. In our acquisition Experimental design experiments we used 6774 compatible PDF files (5145 The objective in our main experiment was to evaluate benign and 1629 malicious) in our repository and created and compare the performance of our new AL methods 10 randomly selected datasets as each one of them con- to the existing selection method, SVM-Margin, on two tains 10 subsets of 620 files representing each day’s stream tasks: of new files. The 574 remaining files were used by the ini- tial training set to induce the initial model. Note that each –– Acquiring as many new unknown malicious PDF files day’s stream contained 620 PDF files. At first we induced as possible on a daily basis in order to efficiently enrich the initial model by training it over the 574 known PDF the signature repository of the anti-virus files. We then tested it on the first day’s stream. Next, from –– Updating the predictive capabilities of the detection this same stream, the selective sampling method selected model that serves as the knowledge store of AL meth- the most informative PDF files according to that method’s ods and improving its ability to efficiently identify the criteria. The informative files were sent to an expert who most informative new malicious PDF files labeled them. The files were later acquired by the training Nissim et al. Secur Inform (2016) 5:1 Page 15 of 20

set which was enriched with an additional X new inform- We now present the results of the core measure in this ative files. When a file was found to be malicious, it was study, the number of new unknown malicious files that immediately used to update the signature repository of were discovered and finally acquired into the training the anti-virus, and an update was also distributed to cli- set and signature repository of the anti-virus software. ents. The process was repeated over the next 9 days. The As explained above, each day the framework deals with performance of the detection model was averaged for 10 620 new PDF files consisting of about 160 new unknown runs over the 10 different datasets that were created. Each malicious PDF files. Statistically, the more files that are selective sampling method was checked separately on 20 selected daily, the more malicious files will be acquired different acts of file acquisition (each consisting of a dif- daily. Yet, using AL methods, we tried to improve the ferent number of PDF files). This means that for each act number of malicious files acquired by means of existing of acquisition, the methods were restricted to acquiring solutions. More specifically, using our methods (exploi- a predefined number of files equal to the amounts that tation and combination) we also sought to improve the followed, denoted as X: 10 files, 20 files, and so on (with number of files acquired by SVM-Margin. gaps of 10 files), until 160 files, and subsequently at 200, Figure 6 presents the number of malicious PDF files 250, 300 and 350 files. We also considered the acquisition obtained by acquiring 160 files daily by each of the four of all the files in the daily stream (referred to as the ALL methods during the course of the 10-day experiment. method), which represents an ideal, but impractical way, Exploitation and combination outperformed the other of acquiring all the new files, and more specifically, all the selection methods. Exploitation was the only one that malicious PDF files. showed an increasing trend from the first day, while The experiment’s steps are summarized as follows combination showed a decrease in the second day and then exhibited an increasing trend on the following days. 1. Inducing the initial detection model from the initial Thus, from days four to 10, it performed like exploitation, available training set, i.e., training set available up to and during these days they both outperformed the other day d (the initial training set includes 574 PDF files) methods (SVM-Margin and random selection). Exploita- 2. Evaluating the detection model on the stream of day tion was successful at acquiring the maximal number of (d + 1) to measure its initial performance malwares from the 160 files acquired daily and to sup- 3. Introduction of the stream of day (d + 1) to the selec- port the results presented in Fig. 6 and our claim, we tive sampling method which chooses the X most performed a single-factor Anova statistical test on the informative files according to its criteria and sends acquisition rates achieved by the exploitation and combi- them to the expert for manual analysis and labeling nation methods. The Anova tests between the AL meth- 4. Acquiring the informative files and adding them to ods provided P values lower than the 5 % (alpha) of the the training set, as well as using their extracted signa- significance level; thus the difference between the exploi- ture to update the anti-virus signature repository tation and combination are statistically significant, con- 5. Inducing an updated detection model from the firming that exploitation AL method performed better updated training set and applying the updated model than combination in terms of number of malicious PDF on the stream of the next day (d + 2) files acquired.

This process repeats itself from the first day until the tenth day.

Results We rigorously evaluated the efficiency and effectiveness of our framework, comparing four selective sampling methods: (1) a well-known existing AL method, termed SVM-Simple-Margin (SVM-Margin) and based on [24]; our proposed methods (2) exploitation and (3) combi- nation; and (4) random-selection (random) as a “lower bound”. Each method was checked for all 20 acquisition amounts and the results were the mean of 10 differ- ent folds. Due to space limitations we have depicted the results of the most representative acquisition amount of 160 PDF files which is equal to the maximal mean num- Fig. 6 The number of malicious PDF files acquired by the framework ber of PDF files found in the daily stream. for different methods with acquisition of 160 files daily Nissim et al. Secur Inform (2016) 5:1 Page 16 of 20

On the first day, the number of new malicious PDF files the methods work. The SVM-Margin acquires examples is 128 since the initial detection model was trained on an about which the detection model is less confident. Con- initial set of 574 labeled PDF files that contained 128 mal- sequently, they are considered to be more informative but wares. We decided to induce the initial detection model not necessarily malicious. As was explained previously, on 574 files in order to have a stable detection model with SVM-Margin selects new informative PDF files inside sufficient detection performance from the start (92.5 % the margin of the SVM. Over time and with improve- TPR on the first day) that can still be improved through ment of the detection model towards more malicious our AL based framework. files, it seems that the malicious files are less informative The superiority of exploitation over combination (due to the fact that malware writers frequently try to can be observed on the second and third days. During use upgraded variants of previous malwares). Since these these 2 days, combination conducts more exploration in new malwares might not lie inside the margin, SVM- regarding to acquisition of informative PDF files (both Margin may actually be acquiring informative benign, malicious and benign), therefore acquired much less rather than malicious, files. However, our methods, malicious PDF files than the exploitation method that is combination and exploitation, are more oriented toward oriented towards the malicious PDF files acquisition. On acquiring the most informative files and most likely mal- the second day, exploitation acquired 10 times more PDF ware by obtaining informative PDF files from the mali- malware than combination (140 compare to 14 malicious cious side of the SVM margin. As a result, an increasing PDF files), and on the third day exploitation acquired 1.6 number of new malwares are acquired; in addition, if an times more malicious PDF files than combination (141 acquired benign file lies deep within the malicious side, it compare to 87). Then, from the fourth day, combination is still informative and can be used for learning purposes performed as well as exploitation in regarding to mali- and to improve the next day’s detection capabilities. cious PDF files acquisition, since the exploitation phase Our AL methods outperformed the SVM-Margin of combination have become more dominant along the method and improved the capabilities of acquiring new days.” PDF malwares and enriching the signature repository of On the tenth day, using combination and exploitation, the anti-virus software. In addition, compared to SVM- 93.75 and 92.5 % of the acquired files were malicious (150 Margin, our methods also maintained the predictive and 148 files out of 160); using SVM-Margin, only 13.5 % performance of the detection model that serves as the of the acquired files were malicious (22 files out of 160 knowledge store of the acquisition process. which is even less than random). This presents a signifi- Figure 7 presents the TPR levels and their trends dur- cant improvement of almost 80 % in unknown malware ing the course of the 10-day study. SVM-Margin outper- acquisition. Note that on the tenth day, using Random, formed other selection methods in the TPR measure, only 25 % of the acquired PDF files were malicious (40 while our AL methods, combination and exploitation, files out of 160). This is far less than the malware acqui- came close to SVM-Margin (SVM-Margin achieved 1 % sition rates achieved by both combination and exploita- better TPR rates than combination and 2 % better than tion. The trend is very clear from the second day at which exploitation) and performed better than random. To sup- point combination and exploitation each acquired more port the results presented in Fig. 7 and our claim, we malicious files than the day before—a finding that dem- performed several single-factor Anova statistical tests onstrates the impact of updating the detection model, on the TPR rates achieved by the three AL methods and identifying new malwares, and enriching the signature random. The Anova test between the three AL methods repository of the anti-virus. Moreover, because they are provided P-Values higher than the 5 % (alpha) of the sig- different, the acquired malwares are also expected to be nificance level; thus the difference between the AL meth- of higher quality in terms of their contribution to the ods is not statistically significant, confirming that our AL detection model and signature repository. methods performed as well as the exploration method in As far as we could observe, the random selection terms of predictive capabilities. While the Anova tests trend was constant, with no improvement in acquisition between the AL methods and random provided P values capabilities over the 10 days. While the SVM-Margin lower than the 5 % (alpha) of the significance level; thus method showed a decrease in the number of malwares the difference between the AL methods and random is acquired through the fifth day, it showed a very negligible statistically significant, confirming that our AL methods improvement from the sixth day. It can be seen that the performed better than random also in terms of predictive performance of our methods was much closer to the ALL capabilities. In addition, the performance of the detec- line which represents the maximum number of malicious tion model improves as more files are acquired daily, so PDF files that can be acquired each day. This phenom- that on the tenth day of the experiment, the results indi- enon can be explained by looking at the ways in which cate that by only acquiring a small and well selected set Nissim et al. Secur Inform (2016) 5:1 Page 17 of 20

achieved by the three AL methods and Random. The Anova test between the three AL methods provided P values higher than the 5 % (α) of the significance level; thus the difference between the AL methods is not sta- tistically significant, confirming that our AL methods performed as well as the exploration method in terms of false positives of the predictive capabilities. While The Anova tests between the AL methods and random pro- vided P values lower than the 5 % (α) of the significance level; thus the difference between the AL methods and random is statistically significant, confirming that our AL methods performed better than random also in terms of Fig. 7 The TPR of the framework over the period of 10 days for differ- false positives of the predictive capabilities. ent methods through the acquisition of 160 PDF files daily On the tenth day, combination and exploitation had an FPR of 0.1 %, while SVM-Margin had an FPR of 0.05 %. This indicates that our methods, exploitation and com- bination, performed as well as the SVM-Margin method of informative files (25 % of the stream), the detection with regard to predictive capabilities (TPR and FPR) but model can achieve TPR levels (97.7 % with SVM-Margin, much better than SVM-Margin in acquiring a large num- 96.7 % with combination, and 96.05 % with exploitation) ber of new PDF malwares daily and in enriching the sig- that are quite close to those achieved by acquiring the nature repository of the anti-virus. On each day of the whole stream (98.4 %). acquisition iteration, we evaluated the learned classifier, Figure 8 presents the FPR levels of the four acquisi- and the FPR is presented for the 10-day period. A set of tion methods. As can be observed, the FPR rates were new unknown files, both malicious and benign, is pre- low and quite similar among the AL methods. A similar sented to the classifier each day, thus the FPR does not decreasing FPR trend began to emerge on the second day. constantly decrease which would occur if the classifier This decrease indicates an improvement in the detection was tested on the same files each day. capabilities and the contribution of the AL methods, in We now compare the detection rate of our framework contrast to the increase in FPR rates observed in random with the leading anti-virus tools (those with the top TPR from the fifth day. Random had the highest FPR in the rates) commonly used by organizations, utilizing the course of the 10-day period, which indicates its inability malicious PDF files within our experimental dataset. We to select more informative files over the 10 days. used VirusTotal26 an anti-virus service that include many Apart from the second day, combination and exploita- different anti-virus tools in order to evaluate the detec- tion achieved quite similar FPR rates which were slightly tion rates. higher than SVM-Margin. To support the results pre- Figure 9 shows that the most accurate anti-virus, Anti- sented in Fig. 8 and our claim, we performed several Vir, had a detection rate of 92.5 %, while our methods single-factor Anova statistical tests on the FPR rates outperformed all of the anti-viruses in the task of detect- ing new malicious PDF files. Using our full framework, including the SVM based detection model, the AL meth- ods, and the enhancement process, we achieved a TPR of 97.7 % using only 25 % of labeled data daily (160 PDF files out of 620), which means a reduction of 75 % in security experts’ efforts. In a nutshell, our AL based framework was able to better induce an updated detection model on a daily basis, outperforming all the anti-virus tools and managing to accomplish this using only a fraction of the new PDF files (25 %)—the most informative portion, consisting of the most valuable information required for updating the knowledge stores of the detection model and anti-virus tools.

Fig. 8 The FPR of the trends of the framework for different methods based on acquiring 160 PDF files daily 26 http://www.virustotal.com. Nissim et al. Secur Inform (2016) 5:1 Page 18 of 20

SVM-Margin (22 PDF malwares) and almost four times more PDF malwares than those acquired by the Random method (40 PDF malwares). While our combination and exploitation methods showed an increasing trend in the number of PDF malwares acquired in the course of the 10 days, SVM-Margin showed unstable and poor perfor- mance, and random was consistent. Therefore our frame- work was found to be effective at updating the anti-virus software by acquiring the maximum number of malicious PDF files. It also maintained a well updated model that is aimed at daily detection of new and unknown malicious Fig. 9 The TPR of the framework compared to anti-viruses commonly PDF files. used by organizations One of our authors is a security expert who works as a virus analyst at one of the known anti-virus companies. According to his experience it requires up to 30 min for These results demonstrate the efficiency of the frame- a virus analyst to determine whether a file is malicious work in maintaining and improving the updatability of or benign using both static and dynamic tools. There- the detection model and ultimately of the anti-virus tool. fore, our approach was aimed at focusing the experts’ These factors also demonstrate the benefits obtained by efforts on the most informative files. The manual labe- performing this process on a daily basis—benefits which ling efforts equals the number of files acquired daily will likely include economic rewards as well. multiplied by the analysis time (30 min per file). This means that if 160 files were acquired per day, 80 h would Discussion and conclusion be required each day to analyze all the files. This is the We presented a framework for efficiently updating anti- equivalent of ten security experts, a reasonable number virus tools with unknown malicious PDF malware files. of analysts for an anti-virus company. Anti-virus com- With our updated classifier, we can better detect new panies can use this framework, adjust it accordingly malicious PDF files that can be utilized to maintain an to suit their needs and resources, and thus acquire the anti-virus tool. Due to the mass creation of new files, desired number of files for analysis. It is also worth not- especially new malicious PDF files, both the anti-virus ing some of the advantages of our AL methods com- and the detection model (classifier) must be updated with pared to SVM-Margin and passive learning. First, using new and labeled PDF files. Such labeling is done manually our AL methods it is possible to acquire fewer files by human experts, thus the goal of the AL component is while achieving nearly the same detection capabilities to focus expert efforts on labeling PDF files that are more as other methods; other methods must acquire a larger likely to be malicious or on PDF files that might add new number of malicious files to achieve poorer results. information about benign files. Our proposed framework Second, our AL methods acquired a greater number of is based on our AL methods (exploitation and combina- malicious PDF files daily than the alternative solutions; tion), specially designed for acquiring unknown malware. in so doing, updating the anti-virus software became The framework seeks to acquire the most informative more efficient. PDF files, benign and malicious, in order to improve clas- Our developing framework is currently based on a fea- sifier performance, enabling it to frequently discover and ture extractor tailored to PDF files (structural paths as enrich the signature repository of anti-virus tools with previously discussed), and consequently our framework new unknown malware. is limited to providing updating solutions and detection In general, three of the AL methods performed very capabilities for attacks that affect the structural paths well at updating the detection model, with our methods, within PDF files. Implementing and integrating more combination and exploitation, outperforming SVM-Mar- feature extractors within the framework will result in gin in the main goal of the study which is the acquisition more robust detection and updatability capabilities. An of new unknown malicious PDF files. The evaluation additional limitation is the fact that the framework can of the classifier before and after the daily acquisition only provide solutions for PDF files, however there are showed an improvement in the detection rate, and sub- many other widely-used document types such as Micro- sequently more new malwares were acquired. On the soft office files (e.g., *.docx, *.xlsx, *.pptx, *.rtf) that have tenth day, Combination acquired almost seven times become popular means for launching cyber-attacks more PDF malwares (150) than the number acquired by aimed at compromising organizations. These types of Nissim et al. Secur Inform (2016) 5:1 Page 19 of 20

files are substantially different from PDF files, and thus files detection. Presented at Proceedings of the 8th ACM SIGSAC Sympo- sium on Information, Computer and Communications Security the framework must be adapted to cope with the chal- 6. Šrndic N, Laskov P (2013) Detection of malicious pdf files based on hierar- lenges they pose. chical document structure. 20th Annual Network and Distributed System In future work, in addition to additional types of mali- Security Symposium 7. Laskov P, Šrndić N (2011) Static detection of malicious JavaScript- cious documents we are interested in extending this bearing PDF documents. 27th Annual Computer Security Applications framework to Android applications. Due to their resource Conference limitations, mobile devices are extremely dependent on 8. Maiorca D, Giacinto G, Corona I (2012) “A pattern recognition system for malicious pdf files detection,” in Machine Learning and Data Mining in anti-virus solutions that are frequently and efficiently Pattern Recognition Anonymous updated. Quite possibly our suggested AL framework 9. Vatamanu C, Gavriluţ D, Benchea R (2012) A practical approach on clus- could address this problem. tering malicious PDF documents. J Comput Virol 8(4):151–163 10. Baccas P (2010) Finding rules for heuristic detection of malicious pdfs: Authors’ contributions With analysis of embedded exploit code. 12–12 NN initiated and led the study and was the main author of the manuscript. 11. Kittilsen J Detecting malicious PDF documents His parts included determining the problem, designing the solution and 12. Hamon Valentin (2013) Malicious URI resolving in PDF documents. J emperical experiments, analyzing the results and providing the insights. He Comput Virol Hacking Tech 9(2):65–76 especially focused on developing the AL framework and AL methods applied 13. Tzermias Z, Sykiotakis G, Polychronakis M, Markatos EP (2011) Combining in the study. AC focused on the related work done in the malicious PDF static and dynamic analysis for the detection of malicious documents. detection domain, its background and provided comprehensive explanations 4th European Workshop on System Security regarding to the PDF file’s structure and the attacks that can be carried out 14. Liu D, Wang H, Stavrou A (2014) Detecting malicious javascript in pdf through it, AC also participated in the collection of the PDF files (malicious through document instrumentation. In Dependable Systems and and benign) and revising the manuscript. RM participated in shaping the AL Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on framework, as well as revising the manuscript. AS participated in the design 100–111 IEEE of the experiments and in the collection of the malicious PDF files, as well as 15. Igino Corona, Davide Maiorca, Davide Ariu, Giorgio Giacinto (2014) Lux0R: revising the manuscript. ME participated in the collection of the malicious Detection of Malicious PDF-embedded JavaScript code through Discri- PDF files and in extracting the structural features from the PDF files and as minant Analysis of API References. InProceedings of the 2014 Workshop well as helped to draft the manuscript. OB participated in the collection of on Artificial Intelligent and Security Workshop (AISec ’14). ACM, New York, the malicious PDF files, provided insights regarding to efficinet etractio of the NY, USA, 47–57. doi: 10.1145/2666652.2666657 features as well as participated in revising the manuscript. YE participated in 16. Smutz C, Stavrou A (2012) Malicious PDF detection using metadata determining the problem, designing the solution and experiments as well and structural features. 28th Annual Computer Security Applications as providing insights and significant information regarding to the need of Conference such a framework in AV companies. All authors read and approved the final 17. Pareek H et al (2013) Entropy and n-gram analysis of malicious PDF docu- manuscript. ments. Int J Eng Res Tech 2(2) 18. Joachims T (1999). Making large scale SVM learning practical Author details 19. Schmitt F, Gassen J, Gerhards-Padilla E (2012) PDF scrutinizer: Detecting 1 Department of Information Systems Engineering, Ben-Gurion University JavaScript-based attacks in PDF documents. Tenth Annual International of the Negev, Beersheba, Israel. 2 Department of Biomedical Informatics, Conference on Privacy, Security and Trust (PST) Columbia University, New York, USA. 20. Snow KZ, Krishnan S, Monrose F, Provos N (2011) SHELLOS: Enabling fast detection and forensic analysis of code injection attacks. USENIX Security Acknowledgements Symposium This research was partly supported by the National Cyber Bureau of the Israeli 21. Lu X, Zhuge J, Wang R, Cao Y, Chen Y (2013) De-obfuscation and detec- Ministry of Science, Technology and Space. Thanks to Šrndic and Laskov from tion of malicious PDF files with high accuracy. 46th Hawaii International the University of Tübingen for sharing their malicious PDF dataset with us. We Conference on System Sciences (HICSS) would like to thank VirusTotal for granting us access to their private services. 22. Marco Cova, Christopher Kruegel, Giovanni Vigna (2010) Detection and analysis of drive-by-download attacks and malicious JavaS- Competing interests cript code. In Proceedings of the 19th international conference on The authors declare that they have no competing interests. World wide web (WWW ‘10). ACM, New York, NY, USA, 281–290. doi: 10.1145/1772690.1772720 Received: 5 March 2015 Accepted: 19 January 2016 23. Maass M, Scherlis WL, Aldrich J (2014) In-nimbo sandboxing. InProceed- ings of the 2014 Symposium and Bootcamp on the Science of Security (p. 1). ACM‏ 24. Tong S, Koller D (2000–01) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66 25. Nissim N, Moskovitch R, Rokach L, Elovici Y (2012) Detecting unknown References computer worm activity via support vector machines and active learning. 1. Christodorescu M, Jha S (2004) Testing malware detectors. ACM SIGSOFT Pattern Analysis and Applications 15(4):459–475‏ Softw Eng Notes 29(4):34–44 26. Nissim N, Moskovitch R, Rokach L, Elovici Y Novel Active Learning Meth- 2. White SR, Swimmer M, Pring EJ, Arnold WC, Chess DM, Morar JF (1999) ods for Enhanced PC Malware Detection in Windows OS, Expert Systems Anatomy of a commercial-grade immune system. IBM Research White with Applications, http://dx.doi.org/10.1016/j.eswa.2014.02.053 Paper 27. Jnanamurthy HK, Chirag Warty, Sanjay Singh (2013) “Threat Analysis and 3. Kiem H, Thuy NT, Quang, TMNA machine learning approach to anti-virus Malicious User Detection in Reputation Systems using Mean Bisector system (2004) Joint Workshop of Vietnamese Society of AI, SIGKBS- Analysis and Cosine Similarity (MBACS)” JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining 61–65, 4–7 December, 28. Wang X, Yu W, Champion A, Fu X, Xuan D (2007) “Worms via mining Hanoi-Vietnam dynamic program execution,” Third International Conference Security and 4. Nissim N, Cohen A, Glezer C, Elovici Y (2015) Detection of malicious PDF Privacy in Communication Networks and the Workshops, SecureComm files and directions for enhancements: a state-of-the art survey. Comput 412–421 Secur 48:246–266 29. Chen Z, Roussopoulos M, Liang Z, Zhang Y, Chen Z, Delis A (2012) Mal- 5. Maiorca D, Corona I, Giacinto G (2013) Looking at the bag is not enough ware characteristics and threats on the internet ecosystem. J Syst Softw to find the bomb: An evasion of structural methods for malicious PDF 85(7):1650–1672 Nissim et al. Secur Inform (2016) 5:1 Page 20 of 20

30. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. 34. Baram Y, El-Yaniv R, Luz K (2004) Online choice of active learning algo- TIST 2(3):27 rithms. J Mach Learn Res 5:255–291 31. Herbrich Ralf (2001) Thore Graepel, and Colin Campbell. “Bayes point 35. Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, machines”. J Mach Learn Res 1:245–279 Assaf Shabtai, Yuval Elovici (2014) “ALPD: Active Learning framework for 32. Moskovitch R, Nissim N, Elovici Y (2009) Malicious code detection using Enhancing the Detection of Malicious PDF Files aimed at Organizations” active learning. In Privacy, Security, and Trust in KDD Springer Berlin JISIC Heidelberg; 74–91 36. Moskovitch R, Stopel D, Feher C, Nissim N, Japkowicz N, Elovici Y (2009) 33. Moskovitch R, Nissim N, Elovici Y (2007) “Malicious Code Detection and Unknown malcode detection and the imbalance problem. J Comput Acquisition Using Active Learning”. IEEE International Conference on Virol 5(4):295–308 Intelligence and Security Informatics (IEEE ISI-2007), Rutgers University, New Jersey, USA 2014 IEEE Joint Intelligence and Security Informatics Conference

ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files

Nir Nissim, Aviad Cohen, Robert Moskovitch, Assaf Shabtai, Mattan Edry, Oren Bar-Ad, Yuval Elovici. Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva, Israel 84105 P.O.B 653 Abstract—Email communication carrying malicious attachments Additionally, these days most users consider non-executables or links is often used as an attack vector for initial penetration of safer and behave less suspiciously toward such files. However, the targeted organization. Existing defense solutions prevent non-executable files are as dangerous as executable files, executables from entering organizational networks via emails, because their readers may contain vulnerabilities that can be therefore recent attacks tend to use non-executable files such as exploited maliciously by attackers. F-Secure’s 2008-2009 PDF. Machine learning algorithms have recently been applied for report4 indicates that the most popular file types for targeted detecting malicious PDF files. These techniques, however, lack an essential element-- they cannot be updated daily. In this study we attack in 2008-2009 were PDF and Microsoft Office files. present ALPD, a framework that is based on active learning Since that time, the number of attacks on Adobe Reader has methods that are specially designed to efficiently assist anti-virus grown. vendors to focus their analytical efforts. This is done by Thus, in order to effectively analyze tens of thousands of new, identifying and acquiring new PDF files that are most likely potentially malicious PDF files on a daily basis, anti-virus malicious, as well as informative benign PDF documents. These vendors have integrated a component of a detection model files are used for retraining and enhancing the knowledge stores. based on machine learning and rule-based algorithms [27] into Evaluation results show that in the final day of the experiment, the core of their signature repository update activities. Combination, one of our AL methods, outperformed all the However, these solutions are often ineffective, because their others, enriching the anti-virus's signature repository with almost seven times more new PDF malware while also improving the knowledge base is not adequately updated. This occurs detection model's performance on a daily basis. because many new, potentially malicious PDF files are created every day, and only a limited number of security researchers — Keywords Active Learning; Machine Learning; PDF; Malware; tasked with manual labeling them. Thus, the problem lies in I. INTRODUCTION prioritizing which PDF files should be acquired, analyzed, and labeled by a human expert. In this study we present ALPD – Since 2009, cyber-attacks against organizations have an active learning based framework for frequently updating increased, and 91% of all organizations were hit by cyber- anti-virus software with PDF files. 1 attacks in 2013 . The vast majority of organizations rely ALPD focuses on improving anti-virus frameworks by heavily on email for internal and external communication, and labeling those PDF files (potentially malicious or informative email has become a very attractive platform from which to benign files) that are most likely to improve the detection initiate cyber-attacks against organizations. model's performance and, in so doing, enriching the signature Attackers usually use social engineering in order to encourage repository with as many new PDF malware files as possible, recipients to press a link or open a malicious web page or further enhancing the detection process. Specifically, the 2 attachment. According to Trend Micro , attacks, particularly ALPD framework favors files that contain new content. We those against government agencies and large corporations, are focus on PCs, the platform most used by organizations, and almost entirely dependent upon spear-Phishing emails. An mobile platforms are outside the scope of this study. incident in 2014 aimed at the Israeli Ministry of Defense (IMOD) provides an example of a new type of targeted cyber- II. BACKGROUND attack involving non-executable files attached to an email. 3 Adobe Reader version X, Protected Mode Adobe Reader According to media reports , the attackers posed as IMOD (PMAR), was released in 2011, featuring a new sandbox representatives and sent email messages with a malicious PDF isolated environment aimed at preventing malicious code file attachment which when opened installed a Trojan horse actions from affecting the operating system. Organizations are enabling the attacker to control the computer. This type of not always equipped with the newest version of PDF readers, attack has grown in popularity, in part because executable files leaving them exposed to many types of attacks. For example, a attached to emails are filtered by most email servers. malicious PDF file can run code embedded in the PDF file

1 (usually JavaScript) or retrieved from a URI that exploits http://www.humanipo.com/news/37983/91-of-organisations-hit-by- vulnerability in the PDF viewer to divert the normal execution cyber attacks-in-2013/ 2 http://www.infosecurity-magazine.com/view/29562/91-of-apt- flow to the malicious code; a PDF file can contain other file attacks-start-with-a-spearphishing-email/ 3 http://www.israeldefense.co.il/?CategoryID=512&ArticleID=5766 4 http://www.f-secure.com/weblog/archives/00001676.html

978-1-4799-6364-5/14 $31.00 © 2014 IEEE 91 DOI 10.1109/JISIC.2014.23 types, such as HTML, JavaScript, SWF, XLSX, EXE, or even II. THE SUGGESTED FRAMEWORK AND METHODS another PDF file, which can be used to embed malicious files A. A Framework for Improving Detection Capabilities that are often obfuscated and can be opened without alerting the user; and filters used in PDFs to compress data in order to Figure 1 illustrates the framework and the process of detecting reduce file size or for encoding, can be used by attackers to and acquiring new malicious PDF files by maintaining the conceal malicious content [7] [8]. Recently, Maiorca et al.[4] updatability of the anti-virus and detection model. In order to presented a practical novel evasion technique called "reverse maximize the suggested framework’s contribution, it should mimicry" that was designed to evade state-of-the-art malicious be deployed in strategic nodes (such as ISPs and gateways of PDF detectors based on their logical structure [5]. large organizations) over the Internet network in an attempt to Existing anti-virus software is not effective enough against expand its exposure to as many new files as possible. new non-executable malicious PDF files which are unknown Widespread deployment will result in a scenario in which [3]. Several significant studies pertaining to malicious PDF almost every new file goes through the framework. If the file detection used machine learning (ML) algorithms based on is informative enough or is assessed as likely being malicious, static and dynamic analysis. Academic studies often focus on it will be acquired for manual analysis. As Figure 1 shows, the static analysis which is faster and requires less computing PDF files transported over the Internet are collected and resources, however this method cannot detect well-obfuscated scrutinized within our framework {1}. Then, the "known files code that acts maliciously during runtime. module" filters all the known benign and malicious PDF files Similar to the detection of unknown malicious executables {2} (according to a white list, reputation systems [25] and using static analysis [2] [16], detection of PDF files is mainly anti-virus signature repository). The unknown PDF files are based on induced classifiers that extract and leverage then checked for their compatibility as viable PDF files. The information used for representing the PDF files. Such incompatible PDF files are immediately blocked from being information may include features extracted from JavaScript transported inside the organizations. Since only compatible code embedded in a document [6] [9]; meta-static features, files are relevant for organizations and innocent users, only [11]; or sets of embedded terms and their frequencies [12]. these files are transformed into vector form for the advanced However, an attacker can learn which keywords characterize check {3}. This check is extremely significant since, as can be benign files and inject these keywords inside a malicious file seen in Table 1, most of the malicious files are incompatible in order to evade detection. Another study [13] suggested (96.5%), thus, the incompatibility of a file is a strong using n-grams and entropy that represent the content of the indication of its maliciousness. In vector form the files are PDF file. Srndic and Laskov [5] introduced a high represented as vectors of the frequencies of structural paths of performance static method for the detection of malicious PDF the PDF files (as explained in Section IV). documents which, instead of analyzing JavaScript or any other The remaining PDF files which are compatible and unknown content, makes use of essential differences in the structural are then introduced to the detection model based on SVM and properties of malicious and benign PDF files. The method was AL. The detection model scrutinizes the PDF files and tested against previous detectors: MDScan [3], PJScan [6], provides two values for each file: a classification decision ShellOS [15] and PDFMS [12], and the comparison clearly using the SVM classification algorithm and a distance demonstrated the efficiency and resilience of this method in calculation from the separating hyperplane using Equation 1 the detection of new malicious PDF files. {4}. A file that the AL method recognizes as informative and In dynamic analysis, the file or application is executed and which it has indicated should be acquired is sent to an expert monitored during runtime [1] [17] [19]. Snow et al.[15] who manually analyzes and labels it {5}. By acquiring these presented ShellOS, a framework for the detection of code informative PDF files, we aim to frequently update the anti- injection attacks, based on code analysis during runtime virus software by focusing the expert’s efforts on labeling (dynamic analysis). Additional methods based on both static PDF files that are most likely to be malware or on benign PDF and dynamic analysis are [3] [6] [10] [14]. files that are expected to improve the detection model. Note Studies in several domains have successfully applied active that informative files are defined as those that when added to learning in order to improve the efficiency of labeling the training set improve the detection model's predictive examples needed to maintain the performance of a machine capabilities and enrich the anti-virus signature repository. learning classifier. Unlike random (or passive) learning, in Accordingly, in our context, there are two types of files that which a classifier randomly selects examples from which to may be considered informative. The first type includes files in learn, the classifier actively indicates the specific examples which the classifier has limited confidence as to their which are commonly the most informative examples for the classification (the probability that they are malicious is very training task and should be labeled. SVM-Simple-Margin [18] close to the probability that they may be benign). Acquiring is a current AL method considered in our experiments. them as labeled examples will probably improve the model’s Active learning was successfully used to enhance the detection detection capabilities. In practical terms, these PDF files will of unknown computer worms [17] and of malicious executable have new structural paths or special combinations of existing files targeting the Windows OS [16]. Such methods are used structural paths that represent their execution code (inside the in this study in order to enhance the detection of malicious binary code of the executable). Therefore these files will PDF files. probably lie inside the SVM margin and consequently will be

92 acquired by the SVM-Margin strategy that selects informative since such information will maximally update the anti-virus files, both malicious and benign, that are a short distance from tool that protects most organizations. the separating hyperplane. We employed the SVM classification algorithm using the The second type of informative files includes those that lie radial basis function (RBF) kernel in a supervised learning deep inside the malicious side of the SVM margin and are a approach. We used the SVM algorithm for the following maximal distance from the separating hyperplane according to reasons: 1) SVM has been successfully used to detect worms Equation 1. These PDF files will be acquired by the [17] [19], classify malware into species, and detect zero day Exploitation method (will be farther explained) and are also a attacks [20], 2) the trained SVM classifier is a black-box that maximal distance from the labeled files. This distance is is hard for an attacker to understand [19], 3) SVM has proven measured by the KFF calculation that will be farther explained to be very efficient when combined with AL methods [16], as well. These informative files are then added to the training and 4) SVM is known for its ability to handle large numbers set {6} for updating and retraining the detection model {7}. of features which makes it suitable for handling the large The files that were labeled as malicious are also added to the number of structural paths extracted from the PDF files [21]. anti-virus signature repository in order to enrich and maintain In our experiments we used Lib-SVM implementation [22] its updatability {8}. Updating the signature repository also which also supports multiclass classification. requires an update to clients utilizing the anti-virus B. Selective Sampling and Active Learning Methods application. The framework integrates two main phases: training and detection/updating. Since our framework aims to provide solutions to real problems it must be based on a sophisticated, fast, and selective high-performance sampling method. We compared our proposed AL methods to other strategies, and all the methods considered are described below. 1) Random Selection (Random) While random selection is obviously not an active learning method, it is at the "lower bound" of the selection methods discussed. We are unaware of an anti-virus tool that uses an active learning method for maintaining and improving its updatability. Consequently, we expect that all AL methods will perform better than a selection process based on the random acquisition of files. 2) The SVM-Simple-Margin AL Method (SVM-Margin) The SVM-Simple-Margin method [18] (referred to as SVM- Figure 1: The process of maintaining the updatability of the anti-virus Margin) is directly related to the SVM classifier. Using a tool using AL methods. kernel function, the SVM implicitly projects the training examples into a different (usually a higher dimensional) Training: A detection model is trained over an initial training feature space denoted by F. In this space there is a set of set that includes both malicious and benign PDF files. After hypotheses that are consistent with the training set, and these the model is tested over a stream that consists only of hypotheses create a linear separation of the training set. From unknown files that were presented to it on the first day, the among the consistent hypotheses, referred to as the version- initial performance of the detection model is evaluated. space (VS), the SVM identifies the best hypothesis with the Detection and updating: For every unknown PDF file in the maximum margin. To achieve a situation where the VS stream, the detection model provides a classification while the contains the most accurate and consistent hypothesis, the AL method provides a rank representing how informative the SVM-Margin AL method selects examples from the pool of file is, and the framework will consider acquiring the files unlabeled examples reducing the number of hypotheses. This based on this. After being selected and receiving their true method is based on simple heuristics that depend on the labels from the expert, the informative PDF files are acquired relationship between the VS and SVM with the maximum by the training set, and the signature repository is updated as margin because calculating the VS is complex and impractical well, just in case the files are malicious. The detection model where large datasets are concerned. Examples that lie closest is retrained over the updated and extended training set which to the separating hyperplane (inside the margin) are more now also includes the acquired examples that are regarded as likely to be informative and new to the classifier, and these being very informative. At the end of the day, the updated examples are selected for labeling and acquisition. model receives a new stream of unknown files on which the This method, in contrast to ours, selects examples according to updated model is once again tested and from which the their distance from the separating hyperplane only to explore updated model again acquires informative files. Note that the and acquire the informative files without relation to their motive is to acquire as many malicious PDF files as possible classified labels, i.e., not specifically focusing on malware

93 instances. The SVM-Margin AL method is very fast and can malicious are selected. In cases in which the representative file be applied to real problems, yet, as its authors indicate [18], is detected as malware as a result of the manual analysis, all its this agility is achieved because it is based on a rough variants that weren't acquired will be detected the moment the approximation and relies on assumptions that the VS is fairly anti-virus is updated. In cases in which these files are not symmetric and that the hyperplane's Normal (W) is centrally actually variants of the same malware, they will be acquired placed, assumptions that have been shown to fail significantly the following day (after the detection model has been [26]. The method may query instances whose hyperplane does updated), as long as they are still most likely to be malware. In not intersect the VS and therefore may not be informative. The Figure 2 it can be observed that there are sets of relatively SVM-Margin method for detecting instances of PC malware similar files (based on their distance in the kernel space), was used by Moskovitch et al. [23] whose preliminary results however, only the representative files that are most likely to be found that the method also assisted in updating the detection malware are acquired. The SVM classifier defines the class model but not the anti-virus application itself; however, in this margins using a small set of supporting vectors (i.e., PDF study the method was only used for one day-long trial. We files). While the usual goal is to improve classification by compare its performance to our proposed AL methods for a uncovering (labeling) files from the margin area, our main longer period, in a daily process of detection and acquisition goal is to acquire malware in order to update the anti-virus. experiments, as actually happens in reality. This serves as our Contrary to SVM-Margin which explores examples that lie baseline AL method, and we expect our method to improve inside the SVM margin, Exploitation explores the "malicious the new malicious PDF detection and acquisition seen in side" to discover new and unknown malicious files that are SVM-Margin. essential for the frequent update of the anti-virus signature repository, a process which occasionally also results in the 3) Exploitation: Our Proposed Active Learning Method discovery of benign files (files which will likely become Our method, "Exploitation," is based on SVM classifier support vectors and update the classifier). Figure 2 presents an principles and is oriented towards selecting examples most example of a file lying far inside the malicious side that was likely to be malicious that lie furthest from the separating found to be benign. The distance calculation required for each hyperplane. Thus, our method supports the goal of boosting instance in this method is fast and equal to the time it takes to the signature repository of the anti-virus tool by acquiring as classify an instance in a SVM classifier, thus it is applicable much new malware as possible. For every file X that is for products working in real-time. suspected of being malicious, Exploitation rates its distance from the separating hyperplane using Equation 1 based on the Normal of the separating hyperplane of the SVM classifier that serves as the detection model. As explained above, the separating hyperplane of the SVM is represented by W, which is the Normal of the separating hyperplane and actually a linear combination of the most important examples (supporting vectors), multiplied by LaGrange multipliers (alphas) and by the kernel function K that assists in achieving linear separation in higher dimensions. Accordingly, the distance in Equation 1 is simply calculated between example X and the Normal (W) presented in Equation 2.

C n S n Dist X )(  DB xxKy )( T  iii B xyw iii )( E 1 U 1 (1) (2)

In Figure 2 the files that were acquired (marked with a red Figure 2: The criteria by which Exploitation acquires new unknown circle) are those files classified as malicious and have malicious PDF files. These files lie the farthest from the hyperplane maximum distance from the separating hyperplane. Acquiring and are regarded as representative files. several new malicious files that are very similar and belong to 4) Combination: A Combined Active Learning Method the same virus family is considered a waste of manual analysis resources since these files will probably be detected by the The "Combination" method lies between SVM-Margin and same signature. Thus, acquiring one representative file for this Exploitation. On the one hand, the combination method begins set of new malicious files will serve the goal of efficiently by acquiring examples based on SVM-Margin criteria in order updating the signature repository. In order to enhance the to acquire the most informative PDF files, acquiring both signature repository as much as possible, we also check the malicious and benign files, and this exploration phase is similarity between the selected files using the kernel farthest- important in order to enable the detection model to first (KFF) method suggested by Baram et al. [24] which discriminate between malicious and benign PDF files. On the enables us to avoid acquiring examples that are quite similar. other hand, the combination method then tries to maximally Consequently, only the representative files that are most likely update the signature repository in an exploitation phase, drawing on the Exploitation method. This means that in the

94 early acquisition period, during the first part of the day, SVM- to a set of relations between the objects within the PDF file. Margin leads, compared to Exploitation. As the day Those paths were found to be very informative and capable of progresses, Exploitation becomes predominant. However, the discriminating between benign and malicious PDF files. In combination was also applied in the course of the 10-day addition, this feature extraction approach is not affected by experiment, and over a period of days, the combination will code obfuscation, filtering, and other encryptions methods. perform more Exploitation than SVM-Margin. This means We parsed and represented the compatible PDF files using the that on the ith day there is more Exploitation than in the (i-1)th PdfFileAnalyzer6 parser. We used all the 7,963 unique paths day. We defined and tracked several configurations over the that were extracted based on our preliminary experiments course of several days. Regarding the relation between SVM- (space constraints prevent further discussion here). Margin and Exploitation, we found that a balanced division B. Experimental Design performs better than other divisions (i.e., during the first half of the study, the method will acquire more files using SVM- The objective in our main experiment was to evaluate and Margin, while during the second half of the study, compare the performance of our new AL methods to the Exploitation takes the leading role in the acquisition of files). existing selection method, SVM-Margin, on two tasks: In short, this method tries to take the best from both of the - Acquiring as many new unknown malicious PDF files as previous methods. possible on a daily basis in order to efficiently enrich the signature repository of the anti-virus. III. EVALUATION - Updating the predictive capabilities of the detection model A. Data Set Creation that serves as the knowledge store of AL methods and We created a dataset of malicious and benign PDF files for the improving its ability to efficiently identify the most Windows OS, the most commonly attacked system used by informative new malicious PDF files. organizations. We acquired a total of 50,908 PDF files, Over a 10-day period, we compared PDF file acquisition including 45,763 malicious and 5,145 benign files, from the based on AL methods to random selection based on the four sources presented in Table 1. The benign files were performance of the detection model. In our acquisition reported to be virus free by Kaspersky anti-virus software. experiments we used 6,774 compatible PDF files (5,145 Many of the malicious files are not compatible with the PDF benign, 1,629 malicious) in our repository and created 10 file format specifications according to the Adobe PDF 5 randomly selected datasets with each dataset containing 10 Reference and cannot be opened by the PDF reader and sub-sets of 620 files representing each day’s stream of new viewed by the user. In cases involving malicious PDF files, files. The 574 remaining files were used by the initial training the malicious operations will be executed. Our proposed set to induce the initial model. Note that each day’s stream framework uses these observations to flag files that can contained 620 PDF files. At first, we induced the initial model initially be blocked from the network of an organization or by training it over the 574 known PDF files. We then tested it private computer, to be marked as suspicious and sent for on the first day’s stream. Next, from this same stream, the deeper analysis. A common incompatibility observed was an selective sampling method selected the most informative PDF incorrect cross reference table, the table responsible for the files according to that method’s criteria. The informative files relation between objects within the PDF file. We provided the were sent to an expert who labeled them. The files were later number of compatible files in each of our collected datasets in acquired by the training set which was enriched with an Table 1 (bracketed). The malicious set contains several additional X new informative files. When a file was found to malware families such as viruses, Trojans and backdoors. be malicious, it was immediately used to update the signature Malicious Benign repository of the anti-virus, and an update was also distributed Dataset Source Year files files to clients. The process was repeated over the next nine days. VirusTotal repository 2012-2014 17,596 (1,017) - The performance of the detection model was averaged for 10 Srndic and Laskov [5] 2012 27,757 (437) - runs over the 10 different datasets that were created. Each Contagio project 2010 410 (175) - selective sampling method was checked separately on 20 Internet and Ben-Gurion 2013-2014 0 5,145 different acts of file acquisitions (each consisting of a different University (random selection) Total 45,763 (1,629) 5,145 amount of PDF files). This means that for each act of Table 1: Our collected dataset categorized as malicious, benign and acquisition, the methods were restricted to acquiring a number incompatibles PDF files. of files equal to the amounts that followed, denoted as X: 10 files, 20 files and so on until 160 files had been acquired (with In order to detect and acquire unknown malicious PDF files, gaps of 10 files), 200, 250, 300 and 350. We also considered we implemented a static analysis approach based on the the acquisition of all the files in the daily stream (referred to as hierarchical structural path feature extraction method the ALL method), which represents an ideal, but not a feasible presented by Šrndic and Laskov [5] which parses PDF files way, of acquiring all the new files and more specifically, all and extracts structural paths. Each structural path is analogous the malicious PDF files.

5http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/p 6 http://www.codeproject.com/Articles/450254/PDF-File-Analyzer- df_reference_1-7.pdf With-Csharp-Parsing-Classes-Vers

95 The experiment’s various steps are as follows: method, Combination, succeeded in acquiring the maximal 1. Inducing the initial detection model from the initial number of malwares from the 160 files acquired daily. available training set; i.e., training set available up to day d On the first day, the number of new malicious PDF files is 128 (the initial training set includes 574 PDF files). since the initial detection model was trained on an initial set of 2. Evaluating the detection model on the stream of day (d+1) 574 labeled PDF files that consisted of 128 malwares. We to measure its initial performance. decided on 574 files from which the initial detection model would be induced in order to have a stable detection model 3. Introduction of the stream of day (d+1) to the selective with sufficient detection performance from the start (92.5% sampling method, which chooses the X most informative TPR in the first day) that can still be improved through our files according to its criteria and sends them to the expert for active learning based framework. manual analysis and labeling. 4. Acquiring the informative files and adding them to the training set, as well as using their extracted signature to update the anti-virus signature repository. 5. Inducing an updated detection model from the updated training-set and applying the updated model on the stream of the next day (d+2). This process repeats itself on our dataset from first day until the tenth day.

IV. RESULTS We evaluated rigorously the efficiency and effectiveness of our framework, comparing four selective sampling methods: (1) a well-known existing AL method, termed as SVM- Simple-Margin (SVM-Margin) based on [18]; our proposed methods (2) Exploitation and (3) Combination and (4) random-selection (Random) as a "lower bound". Each method was checked for all 20 acquisition amounts, in which the Figure 3: The number of malicious PDF files acquired by the results were the mean of 10 different folds. Due to the space framework for different methods with acquisition of 160 files daily. limitations we depicted the results of the most representative acquisition amount of 160 PDF files which equals to the On the tenth day, using Combination and Exploitation, 93.5% maximal mean number of PDF files found in the daily stream. and 92.5% of the acquired files were malicious (148 and 150 We now present the results of the core measure in this study, files out of 160); using SVM-Margin, only 13.5% of the the number of new unknown malicious files that were acquired files were malicious (22 files out of 160 which is discovered and finally acquired into the training set and even less than Random). This presents a significant signature repository of the anti-virus software. As explained improvement of almost 80% in unknown malware acquisition. above, each day the framework deals with 620 new PDF files, Note that on the tenth day, using Random, only 25% of the consisting of about 160 new unknown malicious PDF files. acquired PDF files were malicious (40 files out of 160). This Statistically, the more files that are selected daily, the more is far less than the malware acquisition rates achieved by both malicious files that will be acquired daily. Yet, Using AL Combination and Exploitation. The trend is very clear from methods, we tried to improve the number of malicious files the second day: each day, Combination and Exploitation acquired by means of existing solutions. More specifically, acquired more malicious files than the day before – a finding using our methods (Exploitation and Combination) we also that demonstrates the impact of updating the detection model, sought to improve the number of files acquired by SVM- identifying new malwares and for enriching the signatures Margin. repository of the anti-virus. Moreover, the acquired malwares are expected to have also higher quality in terms of their Figure 3 presents the number of malicious PDF files obtained contribution to the detection model, as well as the signatures- by acquiring the 160 files daily, by each of the four methods repository, since they are different. during the course of the 10 days experiment. Exploitation and Combination outperformed the other selection methods. As far as we could observe, the random selection trend was Exploitation was the only one who showed an increasing trend constant, there was no improvement in acquisition capabilities from the first day, while Combination had a decrease in the over the 10 days. While the SVM-Margin AL method showed second day and then an increasing trend following days. Thus, a decrease in the number of malwares acquired through the from the fourth day to the tenth it performed like the fifth day, it showed a very negligible improvement from the Exploitation, and they both outperformed all the other sixth day. It can be seen that the performance of our methods methods, both SVM-Margin and Random selection. This was much closer to the ALL line which represents the maximum malicious PDF files that can be acquired each day. This phenomenon can be explained by looking at the ways in

96 which the methods work. The SVM-Margin acquires Apart from the second day, Combination and Exploitation examples about which the detection model is less confident. achieved quite similar FPR rates which were slightly higher Consequently, they are considered to be more informative but than SVM-Margin. On the tenth day, Combination and not necessarily malicious. As was explained previously, SVM- Exploitation had an FPR of 0.1% while SVM-Margin had an Margin selects new informative PDF files inside the margin of FPR of 0.05%. This indicates that our methods, Exploitation the SVM. Over time and with the improvement of the and Combination, performed as well as SVM-Margin method detection model towards more malicious files, it seems that with regard to predictive capabilities (TPR and FPR) but much the malicious files are less informative (due to the fact that better than SVM-Margin in acquiring a large number of new malware writers frequently try to use upgraded variants of PDF malwares daily and in enriching the signature repository previous malwares). Since these new malwares might not lie of the anti-virus. On each day of the acquisition iteration, we inside the margin, SVM-Margin may actually be acquiring evaluated the learned classifier, and the FPR is presented for informative benign, rather than malicious, files. However, our the 10-day period. A set of new unknown applications, both methods, Combination and Exploitation, are more oriented malwares and benign, is presented to the classifier each day, toward acquiring the most informative files and most likely thus the FPR does not constantly decrease as would have malware by obtaining informative PDF files from the occurred if the classifier was tested on the same files each day. malicious side of the SVM margin. As a result, an increasing number of new malwares are acquired; in addition, if an acquired benign file lies deep within the malicious side, it is still informative and can be used for learning purposes and to improve the next day’s detection capabilities. We have shown here that our AL methods outperformed the SVM-Margin AL method and improved the capabilities for acquiring new PDF malwares and enriching the signature repository of the anti-virus software. In addition, our methods, compared to SVM-Margin, also maintain the predictive performance of the detection model that serves as the knowledge store of the acquisition process. Figure 4 presents the TPR levels and their trends in the 10 day course of study. SVM-Margin outperformed other selection methods in the TPR measure, while our AL methods, Combination and Exploitation, came close to SVM-Margin Figure 4: The TPR of the framework over the 10 days for different (SVM-Margin achieved 1% better TPR rates than methods through the acquisition of 160 PDF files daily. Combination and 2% better than Exploitation) and performed better than Random. In addition, the performance of the detection model improves as more files are acquired daily, so that in the tenth day of the experiment, the results indicate that by only acquiring a small and well selected set of informative files (25%% of the stream), the detection model can achieve TPR levels (97.7% with SVM-Margin, 96.7% with Combination and 96.05% with Exploitation) that are quite close to those achieved by acquiring the whole stream (98.4%). These results will probably have economic implications and demonstrate the efficiency of the framework in maintaining and improving the updatability of the detection model and ultimately of the anti-virus tool. These factors demonstrate the benefits obtained by performing this process on a daily basis. Figure 5 presents the FPR levels of the four acquisition Figure 5: The FPR of the trends of the framework for different methods. As can be observed, the FPR rates were low and methods based on acquiring 160 PDF files daily. quite similar among the AL methods. A similar decreasing FPR trend began to emerge on the second day. This decrease V. DISCUSSION AND CONCLUSION indicates an improvement in the detection capabilities and the We presented a framework for efficiently updating anti-virus contribution of the AL methods contrary to the increase in tools with unknown PDF malware files usually. With our FPR rates for Random from the fifth day. Random had the updated classifier, we can detect better new malicious PDF highest FPR in the course of the 10 days, what indicates on files that can be utilized for sustaining an anti-virus tool. Both selection of minimally informative files that should have the anti-virus and the detection model (classifier) must be update of the detection model over the 10 days.

97 updated with new and labeled PDF files. Such labeling is done [5] N. Šrndic and P. Laskov. Detection of malicious pdf files based on manually by human experts, thus the goal of the Active hierarchical document structure. 20th Annual Network & Distributed Learning is to focus expert efforts on labeling files that are System Security Symposium, 2013. [6] P. Laskov and N. Šrndić. Static detection of malicious JavaScript- more likely to be PDF malware or on PDF files that might add bearing PDF documents. 27th Annual Computer Security Applications new information about benign files. Our proposed framework Conference, 2011. is based on our active learning methods (Exploitation and [7] P. Baccas. Finding rules for heuristic detection of malicious pdfs: With Combination), specially designed for acquiring unknown analysis of embedded exploit code. pp. 12-12. 2010. malware. The framework seeks to acquire the most [8] J. Kittilsen. Detecting malicious PDF documents. informative PDF files, benign and malicious, in order to [9] C. Vatamanu, D. Gavriluţ and R. Benchea. A practical approach on improve classifier performance, enabling it to frequently clustering malicious PDF documents. Journal in Computer Virology 8(4), pp. 151-163, 2012. discover and enrich the signature repository of anti-virus tools [10] F. Schmitt, J. Gassen and E. Gerhards-Padilla. PDF scrutinizer: with new unknown malware. Detecting JavaScript-based attacks in PDF documents. Tenth Annual In general, three of the AL methods performed very well in International Conference on Privacy, Security and Trust (PST), 2012. updating the detection model, with our methods, Combination [11] C. Smutz and A. Stavrou. Malicious PDF detection using metadata and structural features. 28th Annual Computer Security Applications and Exploitation, outperforming SVM-Margin in the main Conference, 2012. goal of the study which is acquisition of new unknown [12] D. Maiorca, G. Giacinto and I. Corona. "A pattern recognition system malicious PDF files. The evaluation of the classifier before for malicious pdf files detection," in Machine Learning and Data and after the daily acquisition showed an improvement in the Mining in Pattern Recognition Anonymous, 2012. detection rate, and subsequently more new malwares were [13] H. Pareek, P. Eswari, N. S. C. Babu and C. Bangalore. Entropy and n- acquired. On the 10th day, Combination acquired almost gram analysis of malicious PDF documents. Int. J. Eng. 2(2), 2013. seven times more PDF malware (150) than the number [14] X. Lu, J. Zhuge, R. Wang, Y. Cao and Y. Chen. De-obfuscation and detection of malicious PDF files with high accuracy. 46th Hawaii acquired by SVM-Margin (22 PDF malware) and almost four International Conference on System Sciences (HICSS), 2013. times more PDF malware than those acquired Random method [15] K. Z. Snow, S. Krishnan, F. Monrose and N. Provos. SHELLOS: (40 malware). While our Combination and Exploitation Enabling fast detection and forensic analysis of code injection attacks. methods showed an increasing trend in the number of PDF USENIX Security Symposium, 2011. malware acquired in the course of the 10 days, SVM-Margin [16] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Novel Active Learning Methods for Enhanced PC Malware Detection in Windows showed an unstable and poor performance whereas Random OS, Expert Systems with Applications, was consistent. Therefore our framework was found to be http://dx.doi.org/10.1016/j.eswa.2014.02.053. effective in updating the anti-virus software by acquiring the [17] Nissim, N., Moskovitch, R., Rokach, L., and Elovici, Y. (2012). maximum number of malicious PDF files. It also maintains a Detecting unknown computer worm activity via support vector well updated model that is aimed at daily detection of new and machines and active learning. Pattern Analysis and Applications, 15(4), 459-475. unknown malicious PDF files. [18] S. Tong, and D. Koller, "Support vector machine active learning with In future work, we are interested in extending this framework applications to text classification,"Journal of Machine Learning to Android applications. Due their resource limitations, mobile Research, 2:45–66, 2000-2001. [19] X. Wang, W. Yu, A. Champion, X. Fu, X. and D. Xuan, "Worms via devices are very dependent on anti-virus solutions that should mining dynamic program execution," Third International Conference be frequently and efficiently updated. Quite possibly our Security and Privacy in Communication Networks and the Workshops, suggested AL framework could address this problem. SecureComm (412-421), 2007. [20] Z., Chen, M., Roussopoulos, Z., Liang, Y., Zhang, Z., Chen, A., Delis, VI. ACKNOWLEDGMENT Malware characteristics and threats on the internet ecosystem, Journal of Systems and Software, 85(7):1650-1672, July 2012. This research was partly supported by the National Cyber Bureau of the Israeli Ministry of Science, Technology and Space. We would like to [21] Joachims, T. (1999). Making large scale SVM learning practical. thank Roy Nissim for assisting in the efficient implementation aspects. [22] Chang, C.C., and Lin, C. J. (2011). LIBSVM: a library for support And to thanks also to Šrndic and Laskov from University of T¨ubingen vector machines. ACM Transactions on Intelligent Systems and for sharing their malicious PDF dataset with us. Technology (TIST),2(3), 27. [23] Moskovitch, R., Nissim, N., & Elovici, Y. (2009). Malicious code VII. REFERENCES detection using active learning. In Privacy, Security, and Trust in KDD (pp. 74-91). Springer Berlin Heidelberg. [1] R. Moskovitch, Y. Elovici and L. Rokach. Detection of unknown [24] Baram, Y., El-Yaniv, R., and Luz, K.. Online choice of active learning computer worms based on behavioral classification of the host. algorithms. Journal of Machine Learning Research, 5, 255-291, 2004. Comput. Stat. Data Anal. 52(9), pp. 4544-4566, 2008. [25] Jnanamurthy, H. K., Chirag Warty, and Sanjay Singh. "Threat Analysis [2] R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz and Y. and Malicious User Detection in Reputation Systems using Mean Elovici. Unknown malcode detection and the imbalance problem. Bisector Analysis and Cosine Similarity (MBACS)." (2013). Journal in Computer Virology 5(4), pp. 295-308, 2009. [26] Herbrich, Ralf, Thore Graepel, and Colin Campbell. "Bayes point [3] Z. Tzermias, G. Sykiotakis, M. Polychronakis and E.P. Markatos. machines."The Journal of Machine Learning Research 1 (2001): 245- Combining static and dynamic analysis for the detection of malicious 279. documents. 4th European Workshop on System Security, 2011. [27] Kiem, H., Thuy, N.T., Quang, T.M.N., A machine learning approach to [4] D. Maiorca, I. Corona and G. Giacinto. Looking at the bag is not anti-virus system (2004) Joint Workshop of Vietnamese Society of AI, enough to find the bomb: An evasion of structural methods for SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining, pp. 61- malicious PDF files detection. 8th ACM SIGSAC Symposium on 65. , 4-7 December, Hanoi-Vietnam. Information, Computer and Communications Security, 2013.

98 2015 IEEE 14th International Conference on Machine Learning and Applications

Boosting the Detection of Malicious Documents Using Designated Active Learning Methods

Nir Nissim Aviad Cohen Yuval Elovici Department of Department of Department of Information Systems Engineering, Information Systems Engineering, Information Systems Engineering, Ben-Gurion University of the Negev Ben-Gurion University of the Negev Ben-Gurion University of the Negev Beer-Sheva, Israel Beer-Sheva, Israel Beer-Sheva, Israel [email protected] [email protected] [email protected]

ABSTRACT — most organizations usually create, send and receive potentially malicious, Microsoft Office files on a daily basis, anti- huge amounts of documents daily, Attackers increasingly take virus vendors have integrated a component of a detection model based advantage of innocent users who tend to casually open email on machine learning and rule-based algorithms [6] into the core of massages assumed to be benign, carrying malicious documents. their signature repository update activities. However, these solutions Recent targeted attacks aimed at organizations, utilize the new are often ineffective, because their knowledge base is not adequately Microsoft Word documents (*.docx). Anti-virus software fails to updated. This occurs because an enormous number of new, potentially detect new unknown malicious files, including malicious docx files. In malicious Microsoft Office files are created each day, and only a this study, we present SFEM feature extraction methodology and limited number of security researchers are tasked with manually designated Active Learning (AL) methods, aimed at accurate inspecting and labeling them. Thus, there is a need to prioritize which detection of new unknown malicious docx files that also efficiently enhances the detection’s model capabilities over time. Our AL files should be acquired, analyzed, and labeled by a human expert. As methods identify and acquire only small set of new docx files that are Microsoft Word files are the most popular Microsoft Office files used most likely malicious, as well as informative benign files, these files by organizations, our work focuses on this type of file. In this study are used for enhancing the knowledge stores of both the detection we present ALDOCX – a framework and new structural feature model and the anti-virus software. Results show that our active extraction methodology (SFEM) aimed at the accurate detection of learning methods used only 14% of the labeled docx files within malicious Microsoft Word XML-based (*.docx) files. ALDOCX is an organization which led to a reduction of 95.5% in labeling efforts active learning based framework for frequently updating anti-virus compared to passive learning and SVM-Margin (existing active software with docx files. ALDOCX focuses on improving the learning method). Our AL methods also showed a significant performance of the detection model and anti-virus software by improvement of 91% in unknown docx malware acquisition labeling the docx files (potentially malicious or informative benign compared to passive learning and SVM-Margin, thus providing an files) that are most likely to add new and discriminative information. improved updating solution for detection model, as well as the anti- By doing so, ALDOCX enriches the signature repository with as many virus software widely used within organizations. new malicious docx files as possible, further enhancing the detection process. Specifically, the ALDOCX framework favors files that Keywords — Active Learning; Machine Learning; Structural; contain new content. In this paper we focus on PCs and docx files, the Documents; Microsoft office files; docx; Malware; Malicious. platform and documents most used by organizations. Microsoft Word legacy binary files (*.doc) are beyond the scope of this paper as they I. INTRODUCTION have a file structure that substantially differs from the new XML- The vast majority of organizations rely heavily on email for internal based files (*.docx) and requires a feature extraction methodology that and external communication, and email has become an attractive is specifically tailored for them. Our contributions in this paper are: platform from which to initiate cyber-attacks against organizations. - Presenting SFEM, a new methodology of features extraction based Attackers usually use social engineering in order to encourage on structural features within a docx file and capable of providing recipients to press a link or open a malicious web page or attachment. accurate detection of malicious docx files. In addition, most users today consider non-executable files safer than - Presenting the use of machine learning algorithms for the detection executable files, and therefore users are less suspicious of such files. of malicious Microsoft Word XML-based documents using a new However, non-executable files are as dangerous as executable files, structural feature extraction methodology (SFEM). because their readers may contain vulnerabilities that can be exploited - Developing ALDOCX, a framework based on active learning for malicious purposes by attackers. Cybercriminals launch attacks methods for enhancing and maintaining detection capabilities and through Microsoft Office files,1 taking advantage of the fact that updatability. Office documents are widely used among most organizations as well as organizations’ employees don’t usually take precautions when II. BACKGROUND receiving and opening these files. In an attempt to prevent or mitigate With the release of Microsoft Office 2007, Microsoft introduced an these types of attacks, Microsoft Office provides several mechanisms entirely new file format based on XML called “Open XML.”2 The such as macro security level, trusted locations, and digital signatures. new file format applies to Microsoft Word, Excel, and PowerPoint However, Dechaux et.al [2] showed that these mechanisms can easily and comes with the addition of the “x” suffix to the recognized file be bypassed. In order to effectively analyze thousands of new, extension: *.docx, *.xlsx, and *.pptx. The files are automatically

1 http://securelist.com/blog/research/65414/obfuscated-malicious-office-documents- adopted-by-cybercriminals-around-the-world/ 2 http://office.microsoft.com/en-001/help/introduction-to-new-file-name-extensions- HA010006935.aspx

978-1-5090-0287-0/15 $31.00 © 2015 IEEE 760 DOI 10.1109/ICMLA.2015.52 compressed (up to 75% in some cases) using Zip technology, and Macro is a legitimate component that can be dangerous when used for when the file is opened, it is automatically unzipped. malicious purposes. In Microsoft Office 2013 all macros are disabled by default, and notification of this fact is provided. However, security In 2013, Schreck et.al [1] presented a new approach called BISSAM level and trusted location security features can be bypassed using (Binary Instrumentation System for Secure Analysis of Malicious techniques that were presented by Dechaux et.al [2]. In addition, Documents) with three purposes: distinguishing malicious documents malicious Office files can contain a malicious Object Linking and from benign documents, extracting embedded malicious shellcode, Embedding (OLE) object. An OLE package object may contain any and detecting and identifying the vulnerability exploited by a file or command line. If the user double-clicks on the object (located malicious document. BISSAM analyzes only the Microsoft Office in the document), the file or the command is launched. Microsoft legacy binary files and not the new XML cased ones. Other tools Office files can embed active content such as binary files, command aimed at detecting legacy binary malicious Office files exist, such as and script files, Hypertext Markup Language files which can contain OfficeMalScanner3, OfficeCat4 , Microsoft OffVis,5 and malicious JavaScript code, and other document files such as *.pdf, pyOLEScanner.py6. Threat Emulation7 product, conducts dynamic *.doc, *.xls, and *.ppt which can also be malicious. Embedded analysis and executes the suspicious file in multiple operating systems malicious files can be automatically executed when the container file and with different versions of viewing programs (such as Microsoft is opened using the macro. Office or Adobe Reader), and if the tool detects that an unusual network connection has been established or changes have been made IV. OFFICE FILE STRUCTURE to the File System, Registry, or processes, the file will be labeled as malicious. However, dynamic analysis has several disadvantages, An introduction to the including its high resource costs, computational complexity, and the structure of a viable docx file length of time it requires; in addition, it can also be detected by the is provided. Figure 1 shows executed malicious code and consequently cease its malicious the directory tree of a sample operations. On the other hand, static analysis methods have several *.docx file. The actual advantages over dynamic analysis. First, they are virtually content of the file is stored in undetectable—the docx file cannot detect that it is being analyzed, various XML files located in since it is not executed. While it is possible to create static analysis different folders. Each XML “traps” to deter static analysis, these traps can themselves be leveraged file holds the information of and used to detect malware using the static analysis. In addition, since a different component. For static analysis is relatively efficient and fast, it can be performed in an example, ‘styles.xml’ file acceptable timeframe, minimizing bottlenecks. Static analysis is also holds the styling data. easy to implement, monitor, and measure. Moreover, it scrutinizes the ‘app.xml’ and ‘core.xml’ file’s “genes” and not its current behavior which can be changed or hold the metadata of the delayed to an unexpected time or specific conditions in order to evade document (i.e., the title, detection by dynamic analysis. document's author, the The above mentioned tools aimed at malicious Microsoft Office files, number of lines and words, can be categorized into two groups. The first is aimed at detecting etc.). Our framework aimed only Microsoft Office legacy binary files. The second group is also at enhancing the detection of capable of analyzing the new XML-based format files, doing so malicious docx files is based through dynamic analysis with all of its limitations. Both groups on this file structure. Figure 1: Example of an unzipped share the same, significant disadvantage in that they use deterministic *.docx file. analysis and/or rule based heuristics in order to detect maliciousness; yet none of them applies machine learning algorithms which have many known advantages, including generalization capabilities and the V. METHODS AND FRAMEWORK ability to discover hidden patterns for better detection of the new A. Structural Feature Extraction Methodology (SFEM) XML-based format Office files. As anti-viruses are capable of detecting only known malicious files and their relative variants, their We now present our new structural feature extraction ability to detect new and unknown malicious Office files is limited. methodology (SFEM) in which we use the hierarchical nature of an There is a distinct lack of high performance methods for the detection Office file and convert it to a list of unique paths. Figure 2 includes of malicious Microsoft Office XML-based files, particularly the the beginning of the full list of unique paths extracted from the sample widely used docx files. Moreover, given the enormous number of file and from its XML files (presented in Figure 1), after unzipping it. documents created daily, the strength of a detection method lies in its The red paths represent directories and the purple paths represent ability to be continuously up-to-date (i.e., its updatability resulting files. Since XML files have a hierarchical nature as well, we converted from efficient and frequent updates of both the detection model and the XML files within the Office files to a list of paths by commonly used anti-virus software). concatenating the names of the hierarchal tags within the XML, using ‘\’. Only the tags' names are concatenated; tags' properties and III. MICROSOFT OFFICE FILES SECURITY THREATS properties’ values are ignored. The green paths represent a path of Several security threats associated with Microsoft Office documents tags within an XML file. The paths represent the file’s properties and files follow. Microsoft Office files can contain macro which is an actions. For example, the “\word\media\image1.png” path means that embedded code written in Visual Basic for Applications (VBA). the document contains an image, and since this is the only path under the “\word\media\” path, we know that this is the only media item in the file. There are a couple of paths whose presence in the file can 3 http://www.reconstructer.org/code.html indicate the presence of macro code, one being the 4 http://www.aldeid.com/wiki/Officecat 5 http://www.microsoft.com/en-us/download/details.aspx?id=2096 “\word\vbaData.xml” path. 6 http://www.aldeid.com/wiki/PyOLEScanner [Document Folder]\[Content_Types].xml 7 https://www.checkpoint.com/products/threatcloud-emulation-service/

761 [Document Folder]\[Content_Types].xml\Types 3) Exploitation: Our Proposed Active Learning Method [Document Folder]\[Content_Types].xml\Types\Override [Document Folder]\docProps Our method, "Exploitation," is based on SVM classifier principles and [Document Folder]\docProps\app.xml is oriented towards selecting examples most likely to be malicious that [Document Folder]\docProps\app.xml\Properties [Document Folder]\docProps\app.xml\Properties\Template lie furthest from the separating hyperplane. Thus, our method supports [Document Folder]\docProps\app.xml\Properties\Template\#text the goal of boosting the signature repository of the anti-virus software [Document Folder]\docProps\app.xml\Properties\TotalTime by acquiring as much new malware as possible. For every file X that is [Document Folder]\docProps\app.xml\Properties\DocSecurity suspected of being malicious, Exploitation rates its distance from the Figure 2: List of paths extracted from a sample *.docx file. separating hyperplane using Equation 1 based on the Normal of the Note that the leaves, as well as the nodes in the directory tree have separating hyperplane of the SVM classifier that serves as the been included. We do this in order to maintain the generality of the detection model. As explained above, the separating hyperplane of the higher nodes in the tree. For example, the path '\word\vbaData.xml' SVM is represented by W, which is the Normal of the separating which is the ancestor of the leaf path: hyperplane and is actually a linear combination of the most important 'word\vbaData.xml\wne:vbaSuppData\wne:mcds\wne:mcd', was examples (supporting vectors), multiplied by LaGrange multipliers found to be a more powerful feature than its descendant path as it (alphas) and the kernel function K that assists in achieving linear indicates the presence of macro code in the file, and not just a specific separation in higher dimensions. Accordingly, the distance in property in the vbaData.xml files. The extraction process is done Equation 1 is simply calculated between example X and the Normal statically without executing the file and is conducted quite quickly at (W) presented in Equation 2. the rate of 270ms for an average file. SFEM advantages include: C n S n Dist X )(  DB xxKy )( T B  xyw )( - SFEM does not focus on the extraction and analysis of malicious E iii U iii code inside the document (which does not always exist), and 1 1 because of this, it cannot be evaded by code obfuscation techniques Equation 1 Equation 2 and is therefore a more general and robust approach for many types In Figure 3, the files that were acquired (marked with a red circle) are of attacks. the files classified as malicious and are at the maximum distance from - The extraction process is done statically, without executing the file, the separating hyperplane. Acquiring several new malicious files that and therefore it is more secure, fast, and requires less computational are very similar to one another and belong to the same virus family is resources; as a result it can be deployed over endpoint and considered a waste of manual analysis resources, since these files will lightweight devices as well (e.g., smartphones). probably be detected by the same signature. Thus, acquiring one representative file for this set of new malicious files will serve the goal B. Selective Sampling and Active Learning Methods of efficiently updating the signature repository. In order to enhance Since our goal is to provide solutions to real problems it must be the signature repository as much as possible, we check the similarity based on a sophisticated, fast, and selective high-performance between the selected files using the kernel farthest-first (KFF) method sampling method. That will select only the minimal yet informative set suggested by Baram et al. [12] which enables us to avoid acquiring of documents by which the detection model and anti-virus tool can be very similar examples. Consequently, only the representative files that updated on daily manner. We compared our proposed AL methods to are most likely malicious are selected. In cases in which the other strategies, and the methods considered are described below. representative file is determined as malware by the security expert, all variants that were not acquired will be detected the moment the anti- 1) Random Selection (Random) virus is updated. In cases in which these files are not actually variants While random selection is obviously not an active learning method, it of the same malware, they will be acquired the following trail (after is at the "lower bound" of the selection methods discussed. We are the detection model has been updated), as long as they are again unaware of an anti-virus tool that uses an active learning method for determined to most likely to be malware. maintaining and improving its updatability. Consequently, we expect that all AL methods will perform better than a selection process based on the random acquisition of files. 2) The SVM-Simple-Margin AL Method (SVM-Margin) The SVM-Simple-Margin method [9] (referred to as SVM-Margin) is directly related to the SVM classifier. However, in contrast to our methods, it selects examples according to their distance from the separating hyperplane in order to explore and acquire the informative files without relation to their classified labels (i.e., without specifically focusing on malware instances). The SVM-Margin AL method is very fast and can be applied to real problems; yet, as its authors indicate [9], this agility is achieved, because it is based on rough approximation and relies on assumptions that the VS is fairly symmetric and that the hyperplane's Normal (W In Equation 2) is centrally placed, assumptions that have been shown to fail Figure 3: The criteria by which Exploitation acquires new unknown significantly [3]. The method may query instances in which the malicious docx files. These files lie the farthest from the hyperplane and hyperplane does not intersect the VS, and therefore may not be are regarded as representative files. informative. The SVM-Margin method for detecting instances of PC In Figure 3 it can be also observed that there are sets of relatively malware was used by Moskovitch et.al [10] whose preliminary results similar files (based on their distance in the kernel space), however, found that the method also assisted in updating the detection model only the representative files that are most likely to be malware are but not the anti-virus application itself. This serves as our baseline AL acquired. Exploitation explores the "malicious side" to discover new method, and we expect our method to improve the new malicious and unknown malicious files that are essential for the frequent update docx detection and acquisition seen in SVM-Margin. of the anti-virus signature repository, a process which occasionally

762 also results in the discovery of benign files (files which will likely For every unknown docx file in the scrutinized file collection, the become support vectors and update the classifier). Figure 3 also detection model provides a classification, while the AL method presents an example of a file lying far inside the malicious side that provides a rank representing how informative the file is. The was found to be benign. The distance calculation required for each framework will consider acquiring the files based on this information. instance in this method is fast and equal to the time it takes to classify After being selected and receiving their true labels from the expert, the an instance in a SVM classifier, thus it is applicable for products informative docx files are acquired and added to the training set. The working in real-time. signature repository is also updated, in case the files are malicious. The detection model is retrained over the updated and extended 4) Combination: A Combined Active Learning Method training set which now also includes the acquired examples that are The "Combination" method lies between the SVM-Margin and regarded as being very informative. Note that the goal is to acquire as Exploitation methods. On the one hand, in order to acquire the most many malicious docx files as possible since such information will informative docx files (acquiring both malicious and benign files), the maximally update the anti-virus software that protects most combination method begins by acquiring examples based on SVM- organizations. We employed several algorithms in order to induce Margin criteria, an exploration phase which is important in order to detection models, including the SVM classification algorithm with the enable the detection model to discriminate between malicious and radial basis function (RBF) kernel in a supervised learning approach. benign docx files. On the other hand, the combination method then SVM has proven to be very efficient at enhancing the detection of tries to maximally update the signature repository in an exploitation malware when combined with AL methods [4], [5], [7]. phase, drawing on the Exploitation method. This means that in the early acquisition period, during the first part of the day, SVM-Margin VI. EVALUATION is more dominant compared to Exploitation. We found that a balanced division of labor with SVM-Margin and Exploitation achieved the A. Dataset Collection best performance. In short, this method tries to take the best from both In order to evaluate our proposed framework and methods, we of these methods. created a large and representative dataset of malicious and benign Microsoft Word files (*.docx). We acquired a total of 16,811 files, C. ALDOCX Framework for Enhancing Detection Capabilities including 327 malicious and 16,484 benign files, from three sources: First we collect all the docx files within the organization which are VirusTotal repository, Contagio Project and files randomly collected introduced to ALDOCX framework that includes detection model from the Internet and Our University. Our dataset contains only 1.9% based on SVM, SFEM and AL methods. The detection model malicious documents in order to adequately reflect the reality as scrutinizes the docx files and provides two values for each file: a closely as possible. During the paper's composition, we used all 327 classification decision using the SVM classification algorithm and a existing malicious docx files in the VirusTotal repository (as samples distance calculation from the separating hyperplane using Equation 1. of the old format, *.doc files, are not relevant), and we used only A file that the AL method recognizes as informative and has indicated malicious-files detected as malicious by at least five anti-virus should be acquired is sent to an expert who manually analyzes and engines. All the files were assured to be labeled correctly as malicious labels it. We aim to frequently update the anti-virus software by or benign using VirusTotal service. acquiring these informative docx files and focusing the expert’s efforts on labeling docx files that are most likely to be malware or on benign B. Dataset Creation docx files that are expected to improve the detection model. Note that We developed a feature extractor based on our structural feature informative files are defined as those that when added to the training extraction methodology (SFEM) in order to extract and analyze the set, improve the detection model's predictive capabilities and enrich features from all the files in the dataset. The feature extraction process the anti-virus signature repository. Accordingly, in our context there resulted in a vocabulary of 134,854 unique features (paths) extracted are two types of files that may be considered informative. The first from both benign and malicious files. In order to select the most type includes files in which the classifier has limited confidence as to prominent features from the list of 134,845 features, we used a feature their classification (the probability that they are malicious is very selection method based on Information Gain and we were left with a close to the probability that they may be benign). Acquiring them as list of the 5,000 most prominent features, which we called the “Top labeled examples will probably improve the model’s detection 5000.” Using the ranked list of the 5,000 most prominent features and capabilities. In practical terms, these docx files will have new the collection of 16,811 *.docx files (benign and malicious), we structural paths or special combinations of existing structural paths created the dataset for our experiments which contained 16,811 that represent their execution code (e.g., inside the macro code of the records and 5,001 fields - field 5,001 represents the class of the file, docx file). Therefore these files will probably lie inside the SVM “Malicious” if the record represents a malicious docx file and margin and consequently will be acquired by the SVM-Margin “Benign” if the record represents a benign docx file. We used a strategy that selects informative files, both malicious and benign, that Boolean representation to indicate the presence (1) or absence (0) of a are a short distance from the separating hyperplane. feature within a file. The second type of informative files includes those that lie deep We also took into different number of prominent features and created inside the malicious side of the SVM margin and are a maximal 8 datasets using the 10, 40, 80, 100, 300, 500, 800, and 1000 top distance from the separating hyperplane according to Equation 1. features. The top features in which the best detection rates are These docx files which are a maximal distance from the labeled files achieved, will be the features that will be used for further experiments. will be acquired by the Exploitation method (explanation will follow) Among the 10 most prominent features, features 1 to 8 are related to as this distance is measured by the KFF calculation [8]. These the existence of macro code in the document and its activation, feature informative files are then added to the training set for updating and 9 is related to the existence of embedded files, and feature 10 is retraining the detection model. The files that were labeled as related to OLE objects in the document. Deeper analysis of these malicious are also added to the anti-virus signature repository in order features and their percentages of occurrence within benign and to enrich and maintain its updatability. Updating the signature malicious files contributes to better understanding of the dataset's repository also requires an update to clients utilizing the anti-virus composition. About 44% of the malicious files contain macro code, application. whereas among the benign files only 0.13% to 0.16% contain macro code. Additionally, almost 60% of the malicious files contain an

763 embedded file compared to only 3.14% of the benign files. The most thus we tried to reduce the need for extensive labeling efforts. The popular attacks through docx files are launched via macro, file experiment’s steps are as follows: embedding, and OLE categories, and it is significant that the most 1. Inducing the initial detection model from the initial available prominent features belong to these categories. However, in order to training set (the initial training set includes 1,811 docx files). boost and generalize the detection capabilities, we will evaluate the 2. Evaluating the detection model on the test set of 1,000 docx detection model with additional features which are not necessarily files to measure its initial performance. related directly to the most popular attack. 3. Introducing the pool of unknown and unlabeled docx files to the C. Experimental Design selective sampling method which chooses the X most informative docx files and sends them to the security expert for Our experimental design aims at providing clear and practical answer inspection and labeling. to the following research question: 4. Acquiring these labeled informative docx files, removing them Is it possible to leverage and utilize an organization’s large from the pool and adding them to the training set, as well as collection of unlabeled docx files to create an accurate detection using their extracted signature to update the anti-virus signature model and efficiently update the anti-virus software through the repository (in the case that they were found to be malicious). acquisition of only a minimal, yet informative set of docx files while also reducing the labeling efforts of the security experts? 5. Inducing an updated detection model from the updated training set and applying the updated model on the pool (which now The objective of this experiment which compares the performance of contains fewer docx files). our new AL methods to the existing selection method, SVM-Margin, was to measure and evaluate our framework's contribution when This process is repeated until the entire pool is acquired. applying it inside an organization for both enhancing the capabilities VII. RESULTS of the detection model and anti-virus, as well as reducing the labeling Our focus this paper is evaluating the process of learning from efforts of the security expert when labeling the most informative docx small sample size of documents within organization for updatability files. This experiment aims to evaluate the framework with respect to and detection enhancement using our AL methods, therefore the basic the fact that many organizations already have huge amounts of docx detection experiment is out of the scope of this paper, yet we will only files within their networks, files that can be very informative and conclude its results that serves us as a basis for the main AL process contributive. Thus these files can be utilized and leveraged for of this paper. After evaluating different classifiers and tops through 10 enhancing the detection's model performance in the very early stages. folds cross validation detection experiment, we can conclude that the By doing do, the framework (SFEM and AL methods) might even configuration that provides the best detection capabilities is the SVM detect, discover and acquire new malicious files within the classifier trained on the top 100 features: TPR of 93.34%, FPR of organizational networks, files that performed malicious activity 0.19% and accuracy rate of 99.67%. against the organization and haven’t been detected by the existing On the main focus of the paper we compared AL methods to the solutions such as anti-viruses. random selection (passive learning) and show only the results of During the acquisition trials that ended by acquiring every docx acquisition 50 files per trail (due to space limitations). We can see in file in the pool of unlabeled docx files, we compared the acquisition Figure 4 that the SVM-Margin achieved the highest TPR of 94.4% of docx files based on AL methods to random selection based on the after only five trials. This indicates that only 2,061 informative docx performance of the classification model. In these set of experiments, well selected files (2061 = 250 +1811 which are 14% of the pool of we used all 16,811 docx files (16,484 benign, 327 malicious) in our unlabeled docx files) were required to induce this accurate detection repository and created ten randomly selected datasets with each model. Both Exploitation and Combination achieved a high and stable dataset containing three elements: an initial set of 1,811 docx files TPR rate of 93.2% TPR after ten trials which means only 2,511 randomly selected that were used to induce the initial detection model; informative and well selected docx files (18% of the pool of unlabeled a test set of 1,000 docx files upon which the detection model was docx files). It took the Random selection 237 trials which is 13,661 tested and evaluated after every trial in which it was updated; and a labeled files, representing 97.5% of the docx files in the pool. pool of the remaining 14,000 unlabeled and unknown docx files, from Therefore we demonstrate a reduction of 81% in labeling efforts when which the framework and the selective sampling methods selected the comparing Random selection to our AL methods and still maintain most informative and most likely malicious docx files according to low FPR rates of 0.18%. that method’s criteria. The informative docx files were sent to a security expert who inspected and labeled them. The docx files were later acquired by the training set that was enriched with an additional X new informative docx files. When a file was found to be malicious, it was immediately used to update the signature repository of the anti- virus, and an update was also distributed to clients. The process was repeated over the next trials until the entire pool was acquired. The performance of the detection model was averaged for ten runs over the ten different datasets that were created. Each selective sampling method was checked separately on the five different acts of docx files acquisition (each consisting of a different number of docx files). This means that for each act of acquisition, the methods were restricted to acquiring a number of files equal to the amount that followed, denoted as X: 10 files, 20 files, 50 files 100 and 250 files. Out of the total 16,811 files, we used 1,811, a relatively small number of docx files (~1774 benign, ~37 malicious) as an initial set since labeling files can become costly due to the need for a security expert to inspect them, Figure 4: The TPR of the framework over the 280 trials of acquisition for different methods when acquiring 50 docx files in each trail.

764 In Figure 5 we can see that during very early stages, both of our greater since our new feature extraction methodology, SFEM, is AL methods acquired 304 malicious docx files out of the 312 present general and aimed also at other Microsoft Office XML based files in the pool (97.4% of the malicious docx files), and updated the (e.g., Excel [*.xlsx], Power-Point [*.pptx]), a level of coverage that detection model and anti-virus signature repository. This was achieved other existing techniques are incapable of. in the 12th trial, while it took the Random method 270 trials to acquire The main results show that our framework is capable of utilizing the same number of malicious docx files, and the SVM-Margin the vast number of docx files within organizations for efficiently and required the entire pool, or 280 trials. Ultimately, ALDOCX provided effectively inducing an accurate detection model. Additionally, it a reduction of 95.5% and 95.7% respectfully with Random and SVM- shows that our framework is capable of updating the anti-virus Margin in the labeling efforts and updated both the detection model software using only a minimal set of well selected, informative docx and the anti-virus software with new informative docx files – benign files and significantly reducing the labeling efforts required by and especially malicious. security experts. Moreover, the fact that in this historically-based experiment, we achieved higher TPR rates (94.44%) than the non- historical (93.6%) AL experiment, emphasizes that a docx file that was not considered to be informative in an early stage may be very informative during later stages of acquisition and can contribute to the detection model's performance and the updatability of the anti-virus as well. We showed that during very early stages, both of our AL methods acquired 304 malicious docx files out of 312 present in the pool (97.4% of the malicious docx files) and both updated the detection model and anti-virus signature repository. This was achieved on the 12th trial, while it took the Random method 270 trials to acquire the same number of malicious docx files, and the SVM- Margin required the entire pool, or 280 trials. Ultimately, ALDOCX provided a reduction in the labeling efforts of 95.5% and 95.7%, respectfully with Random and the SVM-Margin, and updated both the detection model and the anti-virus software with new informative Figure 5: The cumulative number of malicious docx files acquired by the docx files—benign and especially malicious which are more valuable framework for different methods with acquisition of 50 in each trail. for detection purposes. The results of this third and last experiment prove that our framework can be also deployed inside organizations We also compared the detection rate of our methods with the best for creating a detection model that is well tailored toward the 10 anti-viruses commonly used by organizations. The most accurate proclivities of the organization itself—according to the docx files it anti-virus, TrendMicro, had a detection rate of 85.9%, while our uses and creates. In future work, we are interested in extending this methods outperformed all the anti-viruses in the task of detecting new framework to the detection of additional Office files such as xlsx, pptx malicious docx files. Using the SVM classifier with 100 structural which share the same XML based structure as the docx file. features (SFEM), we achieved 93.4% which required using 90% of the dataset for training (due to the 10XV settings). Using the full IX. REFERENCES ALDOCX framework, including the AL methods and its enhancement [1] T. Schreck, S. Berger and J. Göbel. "BISSAM: Automatic vulnerability process, we even improved the performance of SVM-SFEM, identification of office documents," in Detection of Intrusions and Malware, achieving a 94.4% TPR using only 14% of labeled data (2,011 docx and Vulnerability AssessmentAnonymous 2013, . files out of 14,000), which means a reduction of 84.4% in labeling [2] J. Dechaux, E. Filiol and J. Fizaine. Office documents: New weapons of efforts. cyberwarfare. 2010. [3] Herbrich, Ralf, Thore Graepel, and Colin Campbell. "Bayes point VIII. DISCUSSION AND CONCLUSION machines."The Journal of Machine Learning Research 1 (2001): 245-279. We presented ALDOCX, a framework for boosting the detection of [4] Nissim, N., Moskovitch, R., Rokach, L., and Elovici, Y. (2012). Detecting unknown malicious Microsoft Word documents using designated unknown computer worm activity via support vector machines and active active learning methods. ALDOCX is based mainly on machine learning. Pattern Analysis and Applications, 15(4), 459-475. learning algorithms trained with our new Structural Feature Extraction [5] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Novel Active Learning Methodology (SFEM), which is static, fast, and robust against the Methods for Enhanced PC Malware Detection in Windows OS, Expert different evasion attacks used by attackers. We evaluated our Systems with Applications, http://dx.doi.org/10.1016/j.eswa.2014.02.053. framework through a comprehensive series of experiments using our [6] Kiem, H., Thuy, N.T., Quang, T.M.N., A machine learning approach to anti- large and representative dataset. We found that the configuration that virus system (2004) Joint Workshop of Vietnamese Society of AI, SIGKBS- JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining, pp. 61-65. , 4-7 yielded the best results was the SVM classifier trained on the top 100 December, Hanoi-Vietnam. structural features, which achieved a TPR of 93.34%, FPR of 0.19%, [7] Nissim N, Cohen A, Moskovitch R, Barad O, Edry M, A S, Elovici Y. ALPD: and an accuracy rate of 99.67%. The number of features (top top 100) Active Learning Framework for Enhancing the Detection of Malicious PDF showed that the detection of malicious docx files with high TPR rates Files Aimed at Organizations. Proceedings of JISIC. 2014. requires the consideration of more than simply the top ten "trivial" [8] Baram, Y., El-Yaniv, R., and Luz, K.. Online choice of active learning features (macro, embedding, OLE) and must include features that are algorithms. Journal of Machine Learning Research, 5, 255-291, 2004. extracted from deep within the structure of the docx file. These non- [9] S. Tong, and D. Koller, "Support vector machine active learning with trivial features are difficult for an attacker to learn and cannot be applications to text classification,"Journal of Machine Learning Research, evaded easily. The basic results show that the ALDOCX framework 2:45–66, 2000-2001. can be integrated into Microsoft Office tools or deployed on endpoints [10] Moskovitch, R., Nissim, N., & Elovici, Y. (2009). Malicious code detection in order to detect malicious docx files. Our contribution is even using active learning. In Privacy, Security, and Trust in KDD (pp. 74-91). Springer Berlin Heidelberg.

765 Designated Active Learning Methods for Detection Enhancement of Unknown Malicious Microsoft Office Documents.

Nir Nissim Aviad Cohen Yuval Elovici Department of Department of Department of Information Systems Engineering, Information Systems Engineering, Information Systems Engineering, Ben-Gurion University of the Negev Ben-Gurion University of the Negev Ben-Gurion University of the Negev Beer-Sheva, Israel Beer-Sheva, Israel Beer-Sheva, Israel [email protected] [email protected] [email protected]

ABSTRACT The vast majority of organizations rely heavily on email for Attackers increasingly take advantage of innocent users who tend internal and external communication, and email has become an to casually open email massages assumed to be benign, carrying attractive platform from which to initiate cyber-attacks against malicious documents. Recent targeted attacks aimed at organizations. Attackers usually use social engineering in order to organizations, utilize the new Microsoft Word documents encourage recipients to press a link or open a malicious web page 2 (*.docx). Anti-virus software fails to detect new unknown or attachment. According to TrendMicro, attacks, particularly malicious files, including malicious docx files. In this study, we those against government agencies and large corporations, are 3 present ALDOCX, a framework aimed at accurate detection of almost entirely launched through spear-phishing emails. In new unknown malicious docx files that also efficiently enhances addition, most users today consider non-executable files safer than the framework's detection capabilities over time. Detection relies executable files, and therefore users are less suspicious of such upon our new structural feature extraction methodology (SFEM) files. However, non-executable files are as dangerous as which is performed statically using meta-features within docx executable files, because their readers may contain vulnerabilities files. Using machine learning algorithms with SFEM, we created a that can be exploited for malicious purposes by attackers. 4 detection model that successfully detects new unknown malicious Cybercriminals launch attacks through Microsoft Office files, docx files. In addition, because it is crucial to maintain the taking advantage of the fact that Office documents are widely detection model's updatability and incorporate new malicious files used among most organizations. Cybercriminals take advantage of created daily, ALDOCX integrates our active learning (AL) the fact that organizations’ employees don’t usually take methods which are designed to efficiently assist anti-virus vendors precautions when receiving and opening these files. by better focusing their analytical efforts and enhancing detection In an attempt to prevent or mitigate these types of attacks, capability. ALDOCX identifies and acquires new docx files that Microsoft Office provides several mechanisms such as macro are most likely malicious, as well as informative benign files, security level, trusted locations, and digital signatures. However, these files are used for enhancing the knowledge stores of both the Dechaux et.al [2] showed that these mechanisms can easily be detection model and the anti-virus software. The evaluation bypassed. In order to effectively analyze thousands of new, results show that by using ALDOCX with SFEM to detect potentially malicious, Microsoft Office files on a daily basis, anti- malicious docx files, we achieved a high detection rate (93.6% virus vendors have integrated a component of a detection model TPR) compared to anti-virus software (85.9% TPR) – with very based on machine learning and rule-based algorithms [13] into the low FPR rates (0.2%). ALDOCX’s active learning methods only core of their signature repository update activities. However, these used 7.7% of the labeled docx files and this led to a reduction of solutions are often ineffective, because their knowledge base is 91.4% in labeling efforts. Our AL methods also showed a not adequately updated. This occurs because an enormous number significant improvement of 91% in unknown docx malware of new, potentially malicious Microsoft Office files are created acquisition compared to passive learning and existing AL method, each day, and only a limited number of security researchers are thus providing an improved updating solution for the detection tasked with manually inspecting and labeling them. Thus, there is model, as well as the anti-virus software widely used by a need to prioritize which files should be acquired, analyzed, and organizations. labeled by a human expert. As Microsoft Word files are the most popular Microsoft Office files used by organizations, our work Categories and Subject Descriptors focuses on this type of file. In this study we present ALDOCX – a D.2.0 [Computer-Communication-Networks]: Security and protection framework and new structural feature extraction methodology (SFEM) aimed at the accurate detection of malicious Microsoft General Terms Word XML-based (*.docx) files. ALDOCX is an active learning Algorithms, Measurement, Security, Human Factors. based framework for frequently updating anti-virus software with docx files. ALDOCX focuses on improving the performance of Keywords the detection model and anti-virus software by labeling the docx Active Learning; Machine Learning; Structural; Documents; files (potentially malicious or informative benign files) that are Microsoft office files; docx; Malware; Malicious. most likely to add new and discriminative information. By doing 1. INTRODUCTION Since 2009, cyber-attacks against organizations have increased, 2 http://www.infosecurity-magazine.com/view/29562/91-of-apt-attacks-start-with-a- and 91% of all organizations were hit by cyber-attacks in 2013.1 spearphishing-email/ 3 http://searchsecurity.techtarget.com/definition/spear-phishing 4 http://securelist.com/blog/research/65414/obfuscated-malicious-office-documents- 1 adopted-by-cybercriminals-around-the-world/ http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in- 2013/ so, ALDOCX enriches the signature repository with as many new based on vulnerability signatures. pyOLEScanner.py9 is a Python malicious docx files as possible, further enhancing the detection based script that can examine and decode some aspects of process. Specifically, the ALDOCX framework favors files that malicious legacy binary Microsoft Office files. Threat contain new content. In this paper we focus on PCs and docx files, Emulation10 product, conducts dynamic analysis and executes the the platform and documents most used by organizations. suspicious file in multiple operating systems and with different Microsoft Word legacy binary files (*.doc) are beyond the scope versions of viewing programs (such as Microsoft Office or Adobe of this paper as they have a file structure that substantially differs Reader), and if the tool detects that an unusual network from the new XML-based files (*.docx) and requires a feature connection has been established or changes have been made to the extraction methodology that is specifically tailored for them. Our File System, Registry, or processes, the file will be labeled as contributions in this paper are: malicious. However, dynamic analysis has several disadvantages, - Presenting SFEM, a new methodology of features extraction including its high resource costs, computational complexity, and based on structural features within a docx file and capable of the length of time it requires; in addition, it can also be detected providing accurate detection of malicious docx files. by the executed malicious code and consequently cease its malicious operations. On the other hand, static analysis methods - Presenting the use of machine learning algorithms for the have several advantages over dynamic analysis. First, they are detection of malicious Microsoft Word XML-based documents virtually undetectable—the docx file cannot detect that it is being using a new structural feature extraction methodology (SFEM). analyzed, since it is not executed. While it is possible to create - Developing ALDOCX, a framework based on active learning static analysis “traps” to deter static analysis, these traps can methods for enhancing and maintaining detection capabilities themselves be leveraged and used to detect malware using the and updatability. static analysis. In addition, since static analysis is relatively efficient and fast, it can be performed in an acceptable timeframe, 2. BACKGROUND minimizing bottlenecks. Static analysis is also easy to implement, Microsoft Office 97 and more recent versions used the legacy monitor, and measure. Moreover, it scrutinizes the file’s “genes” binary file as their default file format (referred to as “Microsoft and not its current behavior which can be changed or delayed to Office 97-2003”). These files can be read only by using Microsoft an unexpected time or specific conditions in order to evade Office. The extensions for the well-known Microsoft Word, detection by dynamic analysis. Excel, and PowerPoint files are *.doc, *.xls, and *.ppt The above mentioned tools aimed at malicious Microsoft Office respectively. With the release of Microsoft Office 2007, Microsoft files, can be categorized into two groups. The first is aimed at introduced an entirely new file format based on XML called 5 detecting only Microsoft Office legacy binary files. The second “Open XML.” The new file format applies to Microsoft Word, group is also capable of analyzing the new XML-based format Excel, and PowerPoint and comes with the addition of the “x” files, doing so through dynamic analysis with all of its limitations. suffix to the recognized file extension: *.docx, *.xlsx, and *.pptx. Both groups share the same, significant disadvantage in that they The files are automatically compressed (up to 75% in some cases) use deterministic analysis and/or rule based heuristics in order to using Zip technology, and when the file is opened, it is detect maliciousness; yet none of them applies machine learning automatically unzipped. algorithms which have many known advantages, including In 2013, Schreck et.al [1] presented a new approach called generalization capabilities and the ability to discover hidden BISSAM (Binary Instrumentation System for Secure Analysis of patterns for better detection of the new XML-based format Office Malicious Documents) with three purposes: distinguishing files. As anti-viruses are capable of detecting only known malicious documents from benign documents, extracting malicious files and their relative variants, their ability to detect embedded malicious shellcode, and detecting and identifying the new and unknown malicious Office files is limited. There is a vulnerability exploited by a malicious document. The system distinct lack of high performance methods for the detection of inspects the Office file, extracts the embedded malicious shellcode malicious Microsoft Office XML-based files, particularly the from it, and then automatically determines what type of widely used docx files. Moreover, given the enormous number of vulnerability the malicious code targets. The system focuses on documents created daily, the strength of a detection method lies in Microsoft Office 2003 and 2007 running on Microsoft Windows its ability to be continuously up-to-date (i.e., its updatability XP, but it can be also used with other applications. resulting from efficient and frequent updates of both the detection model and commonly used anti-virus software). A number of other tools aimed at detecting malicious Office files exist. OfficeMalScanner6 is a free forensic tool used to scan for Many studies have focused on the detection of malicious PDF malicious traces, such as shellcode heuristics, PE-files or documents, as was surveyed by Nissim et.al [8]. While PDF is embedded OLE streams in legacy binary Microsoft Office files. also a very popular type of documents, these studies are not Another tool called OfficeCat7 is a file checker based on directly relevant, because PDF files differ significantly from docx vulnerability signatures, aimed at the detection of various files, in two important ways that highlight the novelty and vulnerabilities in Microsoft Office legacy binary files. Microsoft contributions made by our framework. First, PDF files consist of a OffVis,8 a free tool that provides visualization for displaying the set of related objects, while docx files are archival files consisting raw content and structure of Microsoft Office legacy binary of folders and XML files. Given this, applying a structural based documents and identifies some common exploits, which is also approach for malicious PDF detection [14] on malicious docx files would be an unsuccessful detection approach. Second, PDF and docx files are associated with different attack techniques, and even when they do share an attack technique (e.g., embedded- 5 http://office.microsoft.com/en-001/help/introduction-to-new-file-name-extensions- files), the attack is launched very differently, such that the same HA010006935.aspx 6 http://www.reconstructer.org/code.html 7 http://www.aldeid.com/wiki/Officecat 9 http://www.aldeid.com/wiki/PyOLEScanner 8 http://www.microsoft.com/en-us/download/details.aspx?id=2096 10 https://www.checkpoint.com/products/threatcloud-emulation-service/ attack affects the file structure differently in each case. Therefore, 5. FRAMEWORK AND METHODS the detection of malicious docx files requires different and customized structural features based on the special file structure 5.1 Structural Feature Extraction of these files. Methodology (SFEM) We now present our new structural feature extraction 3. MICROSOFT OFFICE FILES methodology (SFEM) in which we use the hierarchical nature of SECURITY THREATS an Office file and convert it to a list of unique paths. Figure 2 includes the beginning of the full list of unique paths extracted Several security threats associated with Microsoft Office from the sample file and from its XML files (presented in Figure documents files follow. Microsoft Office files can contain macro 1), after unzipping it. The red paths represent directories and the which is an embedded code written in Visual Basic for purple paths represent files. Since XML files have a hierarchical Applications (VBA). Macro is a legitimate component that can be nature as well, we converted the XML files within the Office files dangerous when used for malicious purposes. In Microsoft Office to a list of paths by concatenating the names of the hierarchal tags 2013 all macros are disabled by default, and notification of this within the XML, using ‘\’. Only the tags' names are concatenated; fact is provided. However, security level and trusted location tags' properties and properties’ values are ignored. Although tags' security features can be bypassed using techniques that were properties and their values contain a great deal of information that presented by Dechaux et.al [2]. In addition, malicious Office files 11 can be helpful, the integration of all the properties and their values can contain a malicious Object Linking and Embedding (OLE ) will exponentially increase the number of extracted paths and the object. An OLE package object may contain any file or command extraction process time. The green paths represent a path of tags line. If the user double-clicks on the object (located in the within an XML file. The total path count in the list is 425. The document), the file or the command is launched. Microsoft Office paths represent the file’s properties and actions. For example, the files can embed active content such as binary files, command and “\word\media\image1.png” path means that the document script files, Hypertext Markup Language files which can contain contains an image, and since this is the only path under the malicious JavaScript code, and other document files such as *.pdf, “\word\media\” path, we know that this is the only media item in *.doc, *.xls, and *.ppt which can also be malicious. Embedded the file. There are a couple of paths whose presence in the file can malicious files can be automatically executed when the container indicate the presence of macro code, one being the file is opened using the macro. “\word\vbaData.xml” path. [Document Folder]\[Content_Types].xml 4. OFFICE FILE STRUCTURE [Document Folder]\[Content_Types].xml\Types [Document Folder]\[Content_Types].xml\Types\Default An introduction to the [Document Folder]\[Content_Types].xml\Types\Override structure of a viable docx file [Document Folder]\docProps [Document Folder]\docProps\app.xml is provided. Figure 1 shows [Document Folder]\docProps\app.xml\Properties the directory tree of a sample [Document Folder]\docProps\app.xml\Properties\Template *.docx file. The actual [Document Folder]\docProps\app.xml\Properties\Template\#text [Document Folder]\docProps\app.xml\Properties\TotalTime content of the file is stored in [Document Folder]\docProps\app.xml\Properties\TotalTime\#text various XML files located in [Document Folder]\docProps\app.xml\Properties\Pages different folders. Each XML [Document Folder]\docProps\app.xml\Properties\Pages\#text [Document Folder]\docProps\app.xml\Properties\Words file holds the information of [Document Folder]\docProps\app.xml\Properties\Words\#text a different component. For [Document Folder]\docProps\app.xml\Properties\Characters example, ‘styles.xml’ file [Document Folder]\docProps\app.xml\Properties\Characters\#text [Document Folder]\docProps\app.xml\Properties\Application holds the styling data. [Document Folder]\docProps\app.xml\Properties\Application\#text ‘app.xml’ and ‘core.xml’ [Document Folder]\docProps\app.xml\Properties\DocSecurity hold the metadata of the Figure 2: List of paths extracted from a sample *.docx file. document (i.e., the title, document's author, the Note that the leaves, as well as the nodes in the directory tree have number of lines and words, been included. We do this in order to maintain the generality of etc.). Our framework aimed the higher nodes in the tree. For example, the path at enhancing the detection of '\word\vbaData.xml' which is the ancestor of the leaf path: malicious docx files is based 'word\vbaData.xml\wne:vbaSuppData\wne:mcds\wne:mcd', was on this file structure. found to be a more powerful feature than its descendant path as it indicates the presence of macro code in the file, and not just a

specific property in the vbaData.xml files. The extraction process Figure 1: Example of an is done statically without executing the file and is conducted quite unzipped *.docx file. quickly at the rate of 270ms for an average file. SFEM advantages include: - SFEM does not focus on the extraction and analysis of malicious code inside the document (which does not always exist), and because of this, it cannot be evaded by code obfuscation techniques and is therefore a more general and robust approach for many types of attacks.

- SFEM aims to be robust against a mimicry attack in which 11 http://office.microsoft.com/en-us/excel-help/create-edit-and-control-ole-objects- changes were made to a malicious file to make it appear benign. HP010217697.aspx These changes should not affect our method significantly, since the features of malicious actions still exist deep inside the The second type of informative files includes those that lie structure of the file. deep inside the malicious side of the SVM margin and are a - SFEM also aims to be robust against a reverse mimicry attack maximal distance from the separating hyperplane according to in which changes are made to a benign file in order to make it Equation 1. These docx files which are a maximal distance from malicious. These changes should not affect our method the labeled files will be acquired by the Exploitation method significantly, because the changes related to the new malicious (explanation will follow) as this distance is measured by the KFF behavior will be reflected in the file structure, and given this, calculation [12]. These informative files are then added to the the features of malicious actions will probably be detected. training set {6} for updating and retraining the detection model {7}. The files that were labeled as malicious are also added to the - The extraction process is done statically, without executing the anti-virus signature repository in order to enrich and maintain its file, and therefore it is more secure, fast, and requires less updatability {8}. Updating the signature repository also requires computational resources; as a result it can be deployed over an update to clients utilizing the anti-virus application. The endpoint and lightweight devices as well (e.g., smartphones). framework integrates two main phases: training and 5.2 A Framework for Enhancing Detection detection/updating. Capabilities Training: A detection model is trained over an initial training set Figure 3 illustrates the framework and process of detecting that includes both malicious and benign docx files. The initial and acquiring new malicious docx files by maintaining the performance of the detection model is evaluated after the model is updatability of the anti-virus and detection model. In order to tested over a stream that consists solely of unknown files that maximize the suggested framework’s contribution, it should be were presented to it on the first day. deployed in strategic nodes (such as ISPs and gateways of large Detection and updating: For every unknown docx file in the organizations) over the Internet network in an attempt to expand stream, the detection model provides a classification, while the its exposure to as many new files as possible. Widespread AL method provides a rank representing how informative the file deployment will result in a scenario in which almost every new is. The framework will consider acquiring the files based on this file goes through the framework. If the file is informative enough information. After being selected and receiving their true labels or is assessed as likely to be malicious, it will be acquired for from the expert, the informative docx files are acquired by the manual analysis. As Figure 3 shows, the docx files transported training set. The signature repository is also updated, in case the over the Internet are collected and scrutinized within our files are malicious. The detection model is retrained over the framework {1}. Then, the "known files module" filters all the updated and extended training set which now also includes the known benign and malicious docx files {2} (according to a white acquired examples that are regarded as being very informative. At list, reputation systems [3], and anti-virus signature repository). the end of the day, the updated model receives a new stream of The unknown docx files are then transformed into vector form unknown files on which the updated model is once again tested using our structural feature extraction methodology (SFEM); and from which the updated model again acquires informative these vectors represent the new files and are used for the advanced files. Note that the goal is to acquire as many malicious docx files check {3}. as possible since such information will maximally update the anti- The remaining docx files which are unknown, are then virus software that protects most organizations. We employed introduced to the detection model based on SVM and AL. The several algorithms in order to induce detection models, including detection model scrutinizes the docx files and provides two values the SVM classification algorithm with the radial basis function for each file: a classification decision using the SVM (RBF) kernel in a supervised learning approach using the Lib- classification algorithm and a distance calculation from the SVM [6] implementation. We expect that the SVM classification separating hyperplane using Equation 1 {4}. A file that the AL algorithm will provide the best results since SVM has proven to method recognizes as informative and has indicated should be be very efficient at enhancing the detection of malware when acquired is sent to an expert who manually analyzes and labels it combined with AL methods [4], [5], [7], [8]. {5}. We aim to frequently update the anti-virus software by acquiring these informative docx files and focusing the expert’s efforts on labeling docx files that are most likely to be malware or on benign docx files that are expected to improve the detection model. Note that informative files are defined as those that when added to the training set, improve the detection model's predictive capabilities and enrich the anti-virus signature repository. Accordingly, in our context there are two types of files that may be considered informative. The first type includes files in which the classifier has limited confidence as to their classification (the probability that they are malicious is very close to the probability that they may be benign). Acquiring them as labeled examples will probably improve the model’s detection capabilities. In practical terms, these docx files will have new structural paths or special combinations of existing structural paths that represent their execution code (e.g., inside the macro code of the docx file). Therefore these files will probably lie inside the SVM margin and consequently will be acquired by the SVM-Margin strategy that Figure 3: ALDOCX framework - the process of maintaining selects informative files, both malicious and benign, that are a the updatability of the anti-virus tool and the detection model short distance from the separating hyperplane. using AL methods.

5.3 Selective Sampling and Active Learning malicious files that are very similar to one another and belong to the same virus family is considered a waste of manual analysis Methods resources, since these files will probably be detected by the same Since our framework aims to provide solutions to real problems it signature. Thus, acquiring one representative file for this set of must be based on a sophisticated, fast, and selective high- new malicious files will serve the goal of efficiently updating the performance sampling method. We compared our proposed AL signature repository. In order to enhance the signature repository methods to other strategies, and the methods considered are as much as possible, we check the similarity between the selected described below. files using the kernel farthest-first (KFF) method suggested by 1) Random Selection (Random) Baram et al. [12] which enables us to avoid acquiring very similar While random selection is obviously not an active learning examples. Consequently, only the representative files that are method, it is at the "lower bound" of the selection methods most likely malicious are selected. In cases in which the discussed. We are unaware of an anti-virus tool that uses an active representative file is determined as malware by the security expert, learning method for maintaining and improving its updatability. all variants that were not acquired will be detected the moment the Consequently, we expect that all AL methods will perform better anti-virus is updated. In cases in which these files are not actually than a selection process based on the random acquisition of files. variants of the same malware, they will be acquired the following day (after the detection model has been updated), as long as they 2) The SVM-Simple-Margin AL Method (SVM- are again determined to most likely to be malware. In Figure 4 it Margin) can be observed that there are sets of relatively similar files (based The SVM-Simple-Margin method [9] (referred to as SVM- on their distance in the kernel space), however, only the Margin) is directly related to the SVM classifier. However, in representative files that are most likely to be malware are contrast to our methods, it selects examples according to their acquired. Exploitation explores the "malicious side" to discover distance from the separating hyperplane in order to explore and new and unknown malicious files that are essential for the acquire the informative files without relation to their classified frequent update of the anti-virus signature repository, a process labels (i.e., without specifically focusing on malware instances). which occasionally also results in the discovery of benign files The SVM-Margin AL method is very fast and can be applied to (files which will likely become support vectors and update the real problems; yet, as its authors indicate [9], this agility is classifier). Figure 4 presents an example of a file lying far inside achieved, because it is based on rough approximation and relies the malicious side that was found to be benign. The distance on assumptions that the VS is fairly symmetric and that the calculation required for each instance in this method is fast and hyperplane's Normal (W In Equation 2) is centrally placed, equal to the time it takes to classify an instance in a SVM assumptions that have been shown to fail significantly [11]. The classifier, thus it is applicable for products working in real-time. method may query instances in which the hyperplane does not intersect the VS, and therefore may not be informative. The SVM- Margin method for detecting instances of PC malware was used by Moskovitch et.al [10] whose preliminary results found that the method also assisted in updating the detection model but not the anti-virus application itself. This serves as our baseline AL method, and we expect our method to improve the new malicious docx detection and acquisition seen in SVM-Margin. 3) Exploitation: Our Proposed Active Learning Method Our method, "Exploitation," is based on SVM classifier principles and is oriented towards selecting examples most likely to be malicious that lie furthest from the separating hyperplane. Thus, our method supports the goal of boosting the signature repository Figure 4: The criteria by which Exploitation acquires new of the anti-virus software by acquiring as much new malware as unknown malicious docx files. These files lie the farthest from possible. For every file X that is suspected of being malicious, the hyperplane and are regarded as representative files. Exploitation rates its distance from the separating hyperplane using Equation 1 based on the Normal of the separating 4) Combination: A Combined Active Learning Method hyperplane of the SVM classifier that serves as the detection The "Combination" method lies between the SVM-Margin and model. As explained above, the separating hyperplane of the SVM Exploitation methods. On the one hand, in order to acquire the is represented by W, which is the Normal of the separating most informative docx files (acquiring both malicious and benign hyperplane and is actually a linear combination of the most files), the combination method begins by acquiring examples important examples (supporting vectors), multiplied by LaGrange based on SVM-Margin criteria, an exploration phase which is multipliers (alphas) and the kernel function K that assists in important in order to enable the detection model to discriminate achieving linear separation in higher dimensions. Accordingly, the between malicious and benign docx files. On the other hand, the distance in Equation 1 is simply calculated between example X combination method then tries to maximally update the signature and the Normal (W) presented in Equation 2. repository in an exploitation phase, drawing on the Exploitation n n method. This means that in the early acquisition period, during the   first part of the day, SVM-Margin is more dominant compared to Dist(X )   i yi K(xi x) w   y (x )   i i i Exploitation. As the day progresses, Exploitation becomes  1  1 Equation 1 Equation 2 predominant. However, Combination was also applied in the In Figure 4, the files that were acquired (marked with a red circle) course of the ten day experiment, and over a period of days, are the files classified as malicious and are at the maximum Combination performs more Exploitation than SVM-Margin. This distance from the separating hyperplane. Acquiring several new means that on the ith day there is more Exploitation than in the (i- representation to indicate the presence (1) or absence (0) of a 1)th day. feature within a file. We found that a balanced division of labor with SVM-Margin and In order to check how the number of features affects the detection Exploitation achieved the best performance. In short, this method rates, we also took into consideration subsets of prominent tries to take the best from both of these methods. features and created 8 datasets using the 10, 40, 80, 100, 300, 500, 800, and 1000 top features. The top features in which the 5) Comb-Ploit: A Combined Active Learning Method best detection rates are achieved, will be the features that will be Comb-Ploit is the opposite of Combination, as it starts with used for further experiments. Among the 11 most prominent Exploitation in the early stages of the acquisition of new features, features 1 to 8 are related to the existence of macro code informative malicious docx files and then tries to acquire in the document and its activation, feature 9 is related to the generally informative docx files. The motivation behind this existence of embedded files, and feature 10 is related to OLE approach is that most of the malicious docx files are very objects in the document. Feature 11 signifies the presence of an informative and acquired during the very early stages. The method *.emf image in the document. then changes its strategy and tries to improve its knowledge store by also acquiring benign docx files. As our framework is also Deeper analysis of these features and their percentages of aimed at acquiring more malicious files that are used for occurrence within benign and malicious files contributes to better enhancing the anti-virus tools, we suggested the Comb-Ploit understanding of the dataset's composition. About 44% of the method as an additional contributive selection strategy. malicious files contain macro code, whereas among the benign files only 0.13% to 0.16% contain macro code. Additionally, 6. EVALUATION almost 60% of the malicious files contain an embedded file compared to only 3.14% of the benign files. The most popular 6.1 Dataset Collection attacks through docx files are launched via macro, file embedding, In order to evaluate our proposed framework and methods, and OLE categories, and it is significant that the most prominent we created a large and representative dataset of malicious and features belong to these categories. Note also that the eleventh benign Microsoft Word files (*.docx). We acquired a total of feature is indicative of the presence of an EMF (Enhanced Meta 16,811 files, including 327 malicious and 16,484 benign files, File). We checked the appearance of the EMF feature (11) in our from the three sources presented in Table 1. Our dataset contains dataset and came to the conclusion that it appears in many 1.9% malicious documents in order to adequately reflect the families - not solely in many variants of one family within the reality as closely as possible. During the paper's composition, we malicious files (40.37%). Therefore, the EMF's existence is used all 327 existing malicious docx files in the VirusTotal12 actually a strong, additional indication of maliciousness. repository (as samples of the old format, *.doc files, are not relevant), and we used only malicious-files detected as malicious 6.3 Experimental Design by at least five anti-virus engines. All the files were assured to be Our experimental design aims at providing clear and practical labeled correctly as malicious or benign using VirusTotal service. answers to the following research question:

Table 1: Dataset sources. On a daily basis, is it possible to improve both the detection Malicious Benign model’s performance and the anti-virus detection Dataset Source Year files files capabilities by enlarging the signature repository with new VirusTotal repository 2010-2014 327 15,517 unknown malicious docx files using AL methods and the new structural feature extraction methodology? Contagio13 collection 2013 ---- 100 Ben-Gurion Uni’ (Random) 2010-2014 ---- 867 For this question, we designed a comprehensive and specific Total 327 16,484 experiment, and here we present an experiment pertaining to that research question.

6.2 Dataset Creation 6.3.1 Updatability and Detection Enhancement We developed a feature extractor based on our structural feature Evaluation extraction methodology (SFEM) in order to extract and analyze The objective of this experiment was to evaluate and the features from all the files in the dataset. The feature extraction compare the performance of our AL methods to the existing process resulted in a vocabulary of 134,854 unique features selection method, SVM-Margin and passive learning (random (paths) extracted from both benign and malicious files. In order to selection), based on two tasks: select the most prominent features from the list of 134,845 features, we used a feature selection method based on Information - Acquiring as many new, unknown malicious docx files as Gain [15]. We sorted the features according to their Information possible on a daily basis in order to efficiently enrich the Gain grade and were left with a list of the 5,000 most prominent signature repository of the anti-virus. features, which we called the “Top 5000.” Using the ranked list of - Updating the predictive capabilities of the detection model the 5,000 most prominent features and the collection of 16,811 that serves as the knowledge store of AL methods and *.docx files (benign and malicious), we created the dataset for our improving its ability to efficiently identify the most experiments which contained 16,811 records and 5,001 fields - informative new malicious docx files. field 5,001 represents the class of the file, “Malicious” if the Over a ten day period, we compared docx file acquisition record represents a malicious docx file and “Benign” if the record based on AL methods to random selection based on the represents a benign docx file. We chose to use a Boolean performance of the detection model. In our acquisition experiments we used 16,811 docx files (16,484 benign, 327 malicious) from our repository and created ten randomly selected 12 https://www.virustotal.com/ datasets with each dataset containing ten subsets of 1,600 files 13 http://contagiodump.blogspot.co.il/ which represent each day’s stream of new files. The 811 sent to the security expert for inspection was reasonable, in terms remaining files were used by the initial training set to induce the of the number that could be dealt with within one day, and was initial model. The experiment’s steps are as follows: also a meaningful amount that could contribute to the efficient 1. Inducing the initial detection model from the initial and frequent update of the detection model and the anti-virus. available training set, i.e., the training set available up to We now present the results of the core measure in this study, day d (the initial training set includes 811 docx files). the number of new, unknown malicious files that were discovered 2. Evaluating the detection model on the stream of day (d+1) and finally acquired into the training set and signature repository to measure its initial performance. of the anti-virus software. As explained above, each day the framework dealt with 1,600 new docx files, consisting of about 33 3. Introducing the stream of day (d+1) to the selective new, unknown malicious docx files (1.9%). Statistically, the more sampling method, which chooses the X most informative files that are selected daily, the more malicious files will be files according to its selection criteria and sends them to the acquired daily. Yet, using AL methods, we tried to improve the expert for manual analysis and labeling. number of malicious files acquired by means of existing solutions. 4. Acquiring the informative files and adding them to the More specifically, using our methods (Exploitation, Combination training set, as well as using their extracted signatures to and Comb-Ploit) we also sought to improve the number of files update the anti-virus signature repository. acquired by the SVM-Margin. Figure 5 presents the number of 5. Inducing an updated detection model from the updated malicious docx files obtained by acquiring the 50 files daily, by training set and applying the updated model on the stream of each of the five methods during the course of the ten day the next day (d+2). experiment. Exploitation and Combination outperformed the other Each selective sampling method was checked separately on five selection methods. Combination had a decrease on the first day different acts of file acquisition (each consisting of a different and then had the same performance as Exploitation over the number of docx files). This means that for each act of acquisition, following days. During the course of the ten days, SVM-Margin the methods were restricted to acquiring a number of files equal to and Random selection had a decreasing trend in the number of the amounts that followed, denoted as X: 10 files, 20 files, 50 malicious docx files acquired and showed the poorest files, 100 files, and 250 files. We also considered the acquisition performance in updating the detection model and anti-virus with of all the files in the daily stream (referred to as the ALL method), new malicious docx files. Comb-Ploit was designed to act 50% of which represents an ideal, but impractical way, of acquiring all the the time like Exploitation and during the remaining 50% of the new files and more specifically, all the malicious docx files. The time like SVM-Margin, and its acquisition performance is ALL method was considered in order to compare the effectiveness depicted accordingly. Exploitation succeeded in acquiring the of our methods against the maximal malicious docx file maximal number of malwares from the 50 files acquired daily and acquisition. Inducing the initial detection model from the initial outperformed all the other methods. available training set, i.e., the training set available up to day d (the initial training set includes 811 docx files). This process repeats itself on our dataset from the first day until the tenth day. The performance of the detection model was averaged for ten runs over the ten different datasets that were created. 7. RESULTS Our focus this paper is evaluation of updatability and detection enhancement using our AL methods, therefore the basic detection experiment is out of the scope of this paper, yet we will only conclude its results that serves us as a basis for the main AL process of this paper. After evaluating different classifiers and tops using these three important measures through 10 folds cross validation detection experiment, we can conclude that the configuration that provides the best detection capabilities is the SVM classifier trained on the top 100 features: TPR of 93.34%, Figure 5: The number of malicious docx files acquired by FPR of 0.19% and accuracy rate of 99.67%. Although using the the framework for different methods with the acquisition of 50 top 40, SVM achieved a 0.14% higher TPR rate, we will use the files daily. top 100 features because this had a lower FPR (0.1% lower). A On the first day, the number of new malicious docx files was false alarm on benign file (FP) is a very expensive event, which 21 since the initial detection model was trained on an initial set of should be minimized when dealing with detection of malicious 811 labeled docx files that consisted of only 21 malwares (1.9%). files, particularly within organizations. We wanted to measure the updatability improvement process We rigorously evaluated the efficiency and effectiveness of through our active learning based framework. Therefore, in order our AL framework, comparing five selective sampling methods: to have an initial detection model that can be improved over time (1) a well-known existing AL method, termed SVM-Simple- - we used 811 files only to induce the initial detection model Margin (SVM-Margin) based on [9]; our three proposed methods: which resulted in a TPR of 42.13% on the first day. (2) Exploitation, (3) Combination, (4) Comb-Ploit, and (5) On the tenth day, using Combination and Exploitation, 94% random-selection (Random) as a "lower bound." Each method was of the acquired files were malicious (31 out of 33); using SVM- checked for all five acquisition amounts in which the results were Margin, only 3% of the acquired files were malicious (1 file out of the mean of ten different folds. Due to space limitations, we 33—less than Random). This presents a significant improvement depicted the results of the most representative acquisition amount of 91% in unknown docx malware acquisition. Note that on the of 50 docx files which is not much more than the average number tenth day, using Random, only 6% of the acquired docx files were of malicious files (33) found in the daily stream. The 50 docx files malicious (2 out of 33).It can be seen that the performance of our methods was much closer to the ALL line which represents the classifier with the top 100 when trained with over 90% of the maximum malicious docx files that can be acquired each day. dataset. This led to a reduction of 91.4% in labeling efforts of Therefore, in comparing the acquisition graph lines of unknown docx files and also induced a more accurate detection Combination and Exploitation to the ALL graph line, the trend is model with a TPR of 93.6% compared to 93.34% TPR for the quite clear from the third day: each day, Combination and detection model without AL. The FPR rates were low especially Exploitation maintained the same high acquisition rates of 93- among our AL methods and varied between 0.13% and 0.21%. 94% of malicious files despite the fact that every day there were new, unknown files—a finding that demonstrates the impact of updating the detection model with new informative files which results in identifying new malwares for enriching the signature repository of the anti-virus. SVM-Margin AL method showed a decrease in the number of malwares acquired from the first day. This phenomenon can be explained by looking at the ways the methods work. The SVM-Margin acquires documents about which the detection model is less confident. Consequently, these documents are considered to be more informative but not necessarily malicious. As was explained previously, SVM-Margin selects new informative docx files inside the margin of the SVM. Over time and with the improvement of the detection model regarding more malicious files, it seems that the malicious files are less informative (due to the fact that malware writers frequently try to use upgraded variants of previous malwares). Figure 6: The TPR of the framework over the ten days for Since these new malwares might not lie inside the margin, SVM- different methods through the acquisition of 50 docx files Margin may actually be acquiring informative benign files rather daily. than malicious ones. However, our methods of Combination and Exploitation are more oriented toward acquiring the most The results of the predictive capabilities demonstrated by the TPR informative files and those most likely to be malicious by and FPR measures indicate that our Exploitation and Combination obtaining informative docx files from the malicious side of the methods performed as well as the SVM-Margin method with SVM margin. As a result, an increasing number of new malwares regard to predictive capabilities (TPR and FPR) but, as can be are acquired; in addition, if an acquired benign file lies deep seen in Figure 5, much better than the SVM-Margin in acquiring a within the malicious side, it is still informative and can be used large number of new docx malwares daily and enriching the for learning purposes and to improve the next day’s detection signature repository of the anti-virus. capabilities. We have shown here that our AL methods We now compare the detection rate of our methods with the outperformed the SVM-Margin AL method and improved the leading anti-virus tools commonly used by organizations. From capabilities for acquiring new docx malwares and enriching the Figure 7, it can be seen that the most accurate anti-virus, signature repository of the anti-virus software. In addition, as is TrendMicro, had a detection rate of 85.9%, while our methods shown in Figure 6, our methods also maintain the predictive outperformed all the anti-viruses in the task of detecting new performance of the detection model that serves as the knowledge malicious docx files. Using the SVM classifier with 100 structural store of the acquisition process. features (SFEM), we achieved a detection rate of 93.4% which Figure 6 presents the TPR levels and their trends during the required using 90% of the dataset for training (due to the 10XV ten day course of study. SVM-Margin outperformed other settings). Using the full ALDOCX framework, including the AL selection methods in the TPR measure, while our AL methods, methods and its enhancement process, we also improved the Combination, Exploitation and Comb-Ploit, came very close to performance of SVM-SFEM, achieving a TPR of 93.6% using SVM-Margin (SVM-Margin achieved 0.4% better TPR rates than only 7.7% of labeled data (1,311 docx files out of 16,811), which Comb-Ploit, 0.5% better than Exploitation and 1.3% better than means a reduction of 91.4% in labeling efforts. Combination) and performed much better than Random. In addition, the performance of the detection model improves as more files are acquired daily, so that on the tenth day of the experiment, the results indicate that by only acquiring a small and well-selected set of informative files (50 docx files out of 1600 are 3% of the stream), the detection model can achieve TPR levels that are quite close to those achieved by acquiring the whole stream (93.6% with SVM-Margin, 93.2% with Comb-Ploit, 93% with Exploitation, and 92.3% with Combination, as compared with 94.4% for the whole stream). These factors demonstrate the benefits obtained by performing this process on a daily basis. To better demonstrate the contribution of AL methods within ALDOCX, we will now compare the TPR rates with and without AL methods. The TPR rate of 93.6% was achieved in this AL Figure 7: The TPR of the framework against anti-viruses process using only 1,311 docx files (811 initial set + 500 acquired commonly used by organizations. after ten days) out of the total 16,811, which is 7.7%. Whereas, in the previous experiment of detection evaluation (without AL), the best TPR rate was 93.34% which was achieved using the SVM 8. DISCUSSION AND CONCLUSION without AL using 90% of the docx files to be labeled. Our AL We presented ALDOCX, a framework for enhancing the detection methods also showed a significant improvement of 91% in of unknown malicious Microsoft Word documents using unknown docx malware acquisition compared to passive learning designated active learning methods. ALDOCX is based mainly on and SVM-Margin. These results show that the ALDOCX machine learning algorithms trained with our new Structural framework can be deployed on strategic nodes of the Internet Feature Extraction Methodology (SFEM), which is static, fast, network in order to actively identify, select, and acquire the most and robust against the different evasion attacks used by attackers. informative and most likely malicious docx files, and thus provide We evaluated our framework through a comprehensive series of an efficient, frequent, and valuable update for anti-virus software experiments using our large and representative dataset. We found that is widely used by organizations. We also compared the that the configuration that yielded the best results was the SVM detection rate of our methods with several widely used anti-virus classifier trained on the top 100 structural features, which engines commonly used by organizations. The best detection rate, achieved a TPR of 93.34%, FPR of 0.19%, and an accuracy rate 85.9%, was provided by TrendMicro, while our methods of 99.67%. If the number of features plays a significant role, then outperformed all the anti-viruses in the task of detecting new the top 40 could also be used for achieving nearly the same malicious docx files. Using the SVM classifier with 100 structural results. The number of features (top 40, top 100) showed that the features (SFEM), we achieved a TPR of 93.4% which required detection of malicious docx files with high TPR rates requires the using 90% of the dataset for training (due to the 10 folds XV consideration of more than simply the top ten "trivial" features settings). Using the full ALDOCX framework, including the AL (macro, embedding, OLE) and must include features that are methods and its enhancement process, we even improved the extracted from deep within the structure of the docx file. These performance of SVM-SFEM, achieving a 93.6% TPR using only non-trivial features are difficult for an attacker to learn and cannot 7.7% of the labeled data (1,311 docx files out a total of 16,811), be evaded easily. The results our results in the detection which represents a reduction of 91.4% in labeling efforts. In experiment show that the ALDOCX framework can be integrated future work, we are interested in extending this framework to the into Microsoft Office tools or deployed on endpoints in order to detection of additional Microsoft Office files, such as xlsx and detect malicious docx files. Our contribution is even greater since pptx which share the same XML based structure as the docx file. our new feature extraction methodology, SFEM, is general and aimed also at other Microsoft Office XML based files (e.g., Excel 9. REFERENCES [1] T. Schreck, S. Berger and J. Göbel. "BISSAM: Automatic vulnerability [*.xlsx], Power-Point [*.pptx]), a level of coverage that other identification of office documents," in Detection of Intrusions and existing techniques are incapable of. Malware, and Vulnerability AssessmentAnonymous 2013. Using ALDOCX’s designated AL methods, we showed that [2] J. Dechaux, E. Filiol and J. Fizaine. Office documents: New weapons of we can efficiently update anti-virus software with unknown cyberwarfare. 2010. malicious docx files. With our updated classifier, we can better [3] Jnanamurthy, H. K., Chirag Warty, and Sanjay Singh. "Threat Analysis detect new malicious docx files that can be utilized to sustain anti- and Malicious User Detection in Reputation Systems using Mean virus software. Both the anti-virus and the detection model Bisector Analysis and Cosine Similarity (MBACS)." (2013). (classifier) must be updated with new and labeled docx files. The [4] Nissim, N., Moskovitch, R., Rokach, L., and Elovici, Y. (2012). framework seeks to acquire the most informative docx files, Detecting unknown computer worm activity via support vector machines benign and malicious, in order to improve classifier performance, and active learning. Pattern Analysis and Applications, 15(4), 459-475. enabling it to frequently discover and enrich the signature [5] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS, repository of anti-virus software with new unknown malware. Expert Systems with Applications. In general, four of the AL methods performed very well at [6] Chang, C.C., and Lin, C. J. (2011). LIBSVM: a library for support vector updating the detection model, with two of our methods, machines. ACM Transactions on Intelligent Systems and Technology Combination and Exploitation, outperforming SVM-Margin in the (TIST),2(3), 27. study’s highest objective - the acquisition and detection of new [7] Nissim N, Cohen A, Moskovitch R, Barad O, Edry M, A S, Elovici Y. unknown malicious docx files. The evaluation of the classifier ALPD: Active Learning Framework for Enhancing the Detection of before and after daily acquisition showed an improvement in the Malicious PDF Files Aimed at Organizations. Proc’ of JISIC. 2014. detection rate, and subsequently more new malicious files were [8] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici Detection of acquired. On the tenth day, Combination and Exploitation Malicious PDF Files and Directions for Enhancements: a State of the Art acquired more than ten times more malicious docs files (31) than Survey, Computers & Security, Volume 49, November 2014, Pages 1-18. the number acquired by SVM-Margin (3 docx files) and more [9] S. Tong, and D. Koller, "Support vector machine active learning with applications to text classification," JMLR, 2:45–66, 2000-2001. than five times more malicious docx than those acquired by the Random method (6 malicious docx files). While our Combination [10] Moskovitch, R., Nissim, N., & Elovici, Y. (2009). Malicious code detection using active learning. In Privacy, Security, and Trust in and Exploitation methods showed a stable trend and almost KDD (pp. 74-91). Springer Berlin Heidelberg. perfect acquisition rates in the number of malicious docx files in [11] Herbrich, Ralf, Thore Graepel, and Colin Campbell. "Bayes point the course of the ten days, SVM-Margin and Random showed a machines."The Journal of Machine Learning Research 1 (2001): 245-279. steep decrease and poor performance along the ten days. [12] Baram, Y., El-Yaniv, R., and Luz, K.. Online choice of active learning Therefore our framework was found to be effective at updating the algorithms. Journal of Machine Learning Research, 5, 255-291, 2004. anti-virus software by acquiring the maximum number of [13] Kiem, H., Thuy, N.T., Quang, T.M.N., A machine learning approach to malicious PDF files. By acquiring only 50 docx files out of the anti-virus system (2004) Joint Workshop of Vietnamese Society of AI, new 1600 docx files presented to the framework each day, we SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining, pp. 61- showed that our AL methods can provide a reduction of 91.4% in 65. , 4-7 December, Hanoi-Vietnam. the labeling efforts of unknown docx files (only 7.7% of the docx [14] N. Šrndic and P. Laskov. Detection of malicious pdf files based on files needed to be labeled) and also induced a more accurate hierarchical document structure. Presented at Proceedings of the 20th Annual Network & Distributed System Security Symposium. detection model, with a TPR of 93.6% compared to the results achieved in the detection experiment (TPR of 93.34%). The [15] Quinlan, J.R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 93.34% TPR rate in the detection experiment was achieved

6.2 Additional Accepted Papers in the Biomedical Informatics Domain

On this section we present the list of our additional papers in the Biomedical Informatics domain for which we have extended our AL framework and that supports the generality of our framework that was in regarding to the four core papers. The full version and published version of the papers will presented in following to the list.

[9] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak. CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification. BigCHAT Workshop at KDD conference (2014).

Mario Stefanelli Best Paper Award at AIME 2015 Conference:

[10] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “An Active Learning Framework for Efficient Condition Severity Classification”. Artificial Intelligence in Medicine (Pages 13-24), Springer International Publishing, AIME (2015).

[11] Nir Nissim, Mary Regina Boland, Nicholas P. Tatonetti, Yuval Elovici, George Hripcsak, Yuval Shahar, Robert Moskovitch, Improving condition severity classification with an efficient active learning based framework, Journal of Biomedical Informatics, Volume 61, June 2016, Pages 44-54, ISSN 1532- 0464.

- 53 -

CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification

Nir Nissim1, Mary Regina Boland2, Robert Moskovitch2, Nicholas Tatonetti2, Yuval Elovici1, Yuval Shahar1, George Hripcsak2 1Information Systems Engineering, Ben Gurion University, Beer Sheva, Israel 2Biomedical Informatics, Columbia University, New York, New York, USA nirni,elovici,[email protected]; mb3402,robert.moskovitch,nick.tatonetti,[email protected]

ABSTRACT phenotypes from EHRs for Phenome-Wide Association Studies [3]. Defining phenotypes from EHRs is a complex process Electronic Health Records (EHRs) are a treasure trove of health- because of discrepancies in definitions [4], data sparseness, data related data. Prioritizing conditions extracted from EHRs is important for minimizing the burden on medical experts, who often quality [5], bias [6], and healthcare process effects [7]. need to manually review patient conditions for accuracy. Severity is Currently, around 100 conditions/phenotypes have been useful for prioritizing and discriminating among conditions. successfully defined and extracted from EHRs. However, a short Recently, a framework called CAESAR (Classification Approach list of conditions ranked by priority (or severity) remains for Extracting Severity Automatically from Electronic Health lacking. Records), for classifying condition-level severity, was proposed. However, it used passive learning that requires extensive manual To generate a prioritized list of conditions, we sought to rank labeling efforts by medical experts of each condition severity. We them by their severity status by classifying conditions as either present CAESAR-ALE, an Active Learning (AL) based framework that uses several Active Learning (AL) methods to decrease the severe or mild at the condition-level. Classifying severity at the manual labeling efforts required. At each step in the algorithm, condition-level distinguishes acne as a mild condition from only the most informative conditions are labeled to train the myocardial infarction (MI) as a severe condition. In contrast, algorithm. Our results show that our first method, which we refer patient-level severity assesses whether a given patient has a mild to as Exploitation, reduced labeling efforts by 64% while achieving or severe form of a condition (e.g., acne). The bulk of the a true positive rate equivalent to that achieved by passive methods. literature focuses on patient-level condition severity with many Additionally, our second method that we refer to as indices being unique to a given condition [8-11]. However, none Combination_XA reduced labeling efforts by 48% while achieving accuracy equivalent to that achieved by passive learning. Our of these indices capture severity at the condition-level. proposed methods (Exploitation and Combination_XA) were superior in identifying a larger number of severe conditions, Others developed methods to study patient-specific condition compared to SVM-Margin and the Random methods with a severity at the whole-body level, e.g., the Severity of Illness reduction of 46% in labeling efforts. As for the PPV (precision) Index [12], for a wide range of conditions. This is useful for measure, CAESAR-ALE achieved a 71% relative improvement in characterizing patients as severe or mild manifestations of a the predictive capabilities of the framework when classifying given disease condition. However, it does not measure severity conditions as severe. These results demonstrate the potential of AL methods to decrease the labeling efforts of medical experts, while at the condition-level (e.g., acne vs. MI), which is required to increasing accuracy given the same or even a smaller number of prioritize conditions by severity and thereby reduce the selection acquired conditions. space to only the most severe conditions.

Keywords In this paper, we describe the development and validation of an Active Learning, Electronic Health Records, Phenotyping. Active Learning (AL) approach to classifying severity from EHRs. This builds on a previous passive learning approach 1. INTRODUCTION called CAESAR (Classification Approach for Extracting Connected health is increasingly becoming a common Severity Automatically from Electronic Health Records). We framework to improve health service. An important component call our Active Learning Enhancement of CAESAR, CAESAR- is diagnosing patients and labeling the severity of their ALE, and we demonstrate that it reduces the burden on medical diagnoses, which we focus in this study using active learning experts by minimizing the number of conditions requiring approach to decrease labeling efforts. Many national and severity assignment. CAESAR-ALE works well in the international organizations study conditions and their clinical biomedical domain by utilizing EHR-derived variables to assess outcomes. The Observational Medical Outcomes Partnership severity of EHR-derived conditions. (OMOP) standardized condition/phenotype identification and extraction from electronic data sources including Electronic Health Records (EHRs) [1]. The Electronic Medical Records and Genomics Network [2] successfully extracted some 20 2. BACKGROUND 2.3 Active Learning In prior work, a classification method was developed called Labeling examples, which is crucial for the learning process, is CAESAR (Classification Approach for Extracting Severity often an expensive task since it involves human experts. Active Automatically from Electronic Health Records) using a passive learning (AL) was designed to reduce labeling efforts by learning approach to capture condition severity from EHRs [13]. actively selecting the examples with the highest potential This method required medical experts to manually review contribution to the learning process of the classification model. conditions and assign severity status to each (severe or mild). AL is roughly divided into two major approaches: membership They assigned severity to a set of 516 conditions included in the queries [24], in which examples are artificially generated from reference standard. These severity assignments were then used the problem space and selective-sampling [25], in which to evaluate the quality of the classifier. The review of conditions examples are selected from a pool. Selective-sampling is used was limited to 516 conditions out of the 4,683 conditions in this paper. Studies in several domains have successfully included in the reference standard, because medical expert applied active learning in order to reduce the time and money review is time-consuming and costly. required for labeling examples. Unlike random learning, in which a classifier randomly selects examples from which to 2.1 SNOMED-CT learn, in active learning the classifier actively indicates the SNOMED-CT (Systemized Nomenclature of Medicine-Clinical specific examples that should be labeled and which are Terms) is a specialized ontology developed to capture commonly the most informative examples for the training task. conditions from EHRs obtained during the clinical encounter [14, 15]. SNOMED-CT is the terminology of choice of the Active learning approaches can be useful for selecting the most World Health Organization and the International Health discriminative conditions from the entire dataset in order to Terminology Standards Development Organization (IHTSDO). minimize the number of conditions that experts need to It also satisfies Meaningful Use requirements of the Health manually review. Doing this focuses experts’ efforts on a Information Technology component of the American Recovery smaller subset of conditions, thereby saving time and money. and Reinvestment Act of 2009 [16], and often clinical ontologies are used for the retrieval of clinical guidelines [39]. 2.4 Active Learning in Biomedicine Therefore, we used SNOMED-CT to extract patient conditions Although applications of active learning algorithms have been from EHRs treating each coded clinical event as a “condition” widely demonstrated in other domains, their applications in the or “phenotype,” knowing that this is a broad definition [4]. biomedical domain have been limited. Liu described a method similar to relevance feedback [26]. Warmuth et al. used a 2.2 Classification of Conditions similar approach to separate active (positive side) and inactive Classification of conditions in the biomedical domain typically (negative side) compounds [27]. More recently, active learning is based on two main methods: 1) manual approach where was reported to be useful in biomedicine for classification of experts assign labels to conditions; and 2) passive classification text [28] and radiology reports [29]. In all these cases active approaches (typically supervised) where a dataset is labeled learning methods were found to perform better than passive based on a subset of labeled data. learning. The Chronic Condition Indicator (CCI) was developed, as part of the Healthcare Cost and Utilization Project, using a totally 3. METHODS manual approach [17] to assign chronicity categories (acute vs. chronic) to ICD-9 codes. Medical experts were asked whether or 3.1 Dataset Development not a particular ICD-9 code was chronic. Disagreements were The dataset used in this study was developed as a result of prior handled by consultation with one of the physician panel work [13]. It contains 516 conditions (SNOMED-CT codes) members [17]. The CCI built on original work by Hwang et al. labeled as mild and severe. The labeling was performed using a [18], and it has been used successfully in multiple studies [18, set of expert heuristics, described in detail elsewhere [13] and 19] demonstrating the value of manual expert labeling. validated with five independent evaluators. The dataset also contains six severity measures for each condition: number of Others employed passive learning approaches, including Perotte comorbidities, number of procedures, number of medications, et al. who classified International Classification of Disease cost, treatment time, and a proportion term. These six measures version 9 (ICD-9) codes and showed that leveraging the ICD-9 were used to classify conditions previously in the original hierarchy outperformed treating ICD-9 codes as a flat list [20]. passive learning method (CAESAR). Therefore, when Another work by Perotte et al. classified conditions into constructing and testing CAESAR-ALE, we used exactly the chronicity categories [21]. Other machine learning approaches same dataset used in developing and evaluating CAESAR. have been used in biomedicine typically in the subfield of text classification. Torii et al. showed that the performance of 3.2 The CAESAR-ALE Framework automatic taggers improved when trained on a dataset The purpose of the CAESAR-ALE framework is to decrease the comprised of multiple data sources [22]. They also mention the labeling efforts of an expert. Using active learning, the need to have more documents available for training to improve framework actively asks the expert to label a specific condition performance [22], a common issue in passive learning as severe or mild, rather than label randomly selected techniques. Nguyen et al. built an algorithm for classifying lung conditions. The workflow of the CAESAR-ALE framework is cancer stages (tumor, node, and metastasis) using pathology described in figure 1. reports and SNOMED-CT [23].

Figure 1 illustrates the framework and the process of labeling both severe and mild, that are a short distance from the and acquiring new conditions by maintaining the updatability of separating hyper-plane. The second type of informative the classification model. If the AL method finds the condition to conditions includes those that lie deep inside the severe side of be more informative than others, the conditions will be acquired the SVM margin and are a maximal distance from the separating for labeling. Conditions are collected and scrutinized within our hyper-plane according to Equation 1. These conditions will be framework. Then, they are transformed into a vector form (as acquired by the Exploitation method (which will be further with the CEASER method) for the advanced check. The explained below) and are also a maximal distance from the conditions are then introduced to the classification model based labeled conditions. This distance is measured by the KFF on a Support Vector Machine (SVM) and Active Learning (AL) calculation that will be further explained below as well. method. The classification model scrutinizes the conditions and The motivation underlying the selection of the conditions that provides two values for each condition: a classification decision are most likely to be classified as severe is based on two using the SVM classification algorithm and a distance reasons: first, severe conditions have a higher medical and calculation from the separating hyper-plane using Equation 1. A practical value, since they provide information about high condition that the AL method recognizes as informative and priority conditions, those that should be treated more urgently. which it has indicated should be acquired is sent to an expert Second, by attempting to select conditions from deep within the who labels it. The labeled conditions are then added to the "severe" instances sub-space of the SVM's separating hyper- training set for further use. By labeling the most informative plane, we may encounter a mild condition that was erroneously conditions, we aim to frequently update and improve the considered as being severe; the addition of this mild condition to classification model. Note that informative conditions are the classifier provides highly valuable information, which defined as those that when added to the training set improve the greatly improves the classification model. Finally, the classification model's predictive capabilities. informative conditions are then added to the training set for updating and retraining the classification model (Figure 1). The framework integrates two main phases: training and classification/updating. Training: A classification model is trained over an initial training set that includes both severe and mild conditions. After the model is tested over a test set that consists only of unknown conditions that were not presented to it during training, the initial performance of the classification model is evaluated. Classification and updating: For every condition in the pool of unknown conditions the classification model provides a classification, while the AL method provides a rank representing how informative the condition is. The framework will then consider acquiring the conditions based on this. After being selected and receiving their true labels from the expert, the informative conditions are acquired by the training set (and removed from the pool). The classification model is retrained over the updated and extended training set that now also Figure 1: The process of using AL methods to detect discriminative includes the acquired conditions that are regarded as being very conditions requiring medical expert annotation. informative. At the end of the update, the updated model again receives the pool of unknown conditions from which the The AL methods are based on the predictive capabilities of the updated framework and model again actively select informative classification model, thus an updated classification model conditions. directly affects the AL method's ability to select the most informative conditions and by doing so decreases the labeling We employed the SVM classification algorithm using the radial efforts; with few and well-selected labeled conditions we can basis function (RBF) kernel in a supervised learning approach. maintain an accurate model and decrease the labeling efforts, in We used the SVM algorithm, because it has proven to be very contrast to a situation in which the expert is required to labeled a efficient when combined with AL methods [26], [27]. In our large number of less informative conditions. Accordingly, in our experiments we used Lib-SVM implementation [30], because it context, there are two types of conditions that may be also supports multiclass classification. considered informative. The first type includes conditions in which the classifier has limited confidence as to their 3.3 Active Learning Methods classification (the probability that they are mild is very close to Since our framework aims to provide solutions to real problems the probability that they may be Severe). Labeling them would it must be based on a sampling method. We compared our improve the model’s classification capabilities. In practical proposed AL methods to other strategies, and all the methods terms, these conditions will have new combinations of features considered are described below. (e.g., low in cost and requiring a long treatment time) or special 3.3.1 Random Selection (Random) combinations of existing features that represent their particular Random selection (also referred to as random learning or permutations. Therefore, these conditions will probably lie passive learning) is the default case in machine learning, in inside the SVM margin and consequently will be acquired by the SVM-Margin strategy that selects informative conditions, which the classifier is given by a set of labeled training Accordingly, the distance in Equation 1 is simply calculated examples. Thus, this is used as a baseline method. between example X and the Normal (W) presented in Equation While random selection is obviously not an active learning 2.  n  n method, it is at the "lower bound" of the selection methods Dist(X )    y K(x x)  i i i w   i yi (xi ) discussed. Consequently, we expect that all AL methods will  1   1 (2) perform better than a selection process based on the random (1) selection of examples. In Figure 2 the conditions that were acquired (marked with a red 3.3.2 The SVM-Simple-Margin AL Method (SVM-Margin) circle) are those conditions classified as severe and have The SVM-Simple-Margin method [31] (referred to as SVM- maximum distance from the separating hyper-plane. Acquiring Margin) is directly related to the SVM classifier. Using a kernel several new severe conditions that are very similar and whose function, the SVM implicitly projects the training examples into values share nearly the same features is considered a waste of a different (usually a higher dimensional) feature space denoted manual analysis resources. Thus, acquiring one representative by F. In this space there is a set of hypotheses that are consistent condition for this set of new severe conditions will serve the with the training set, and these hypotheses create a linear goal of efficiently updating the classification model. In order to separation of the training set. From among the consistent enhance the training set as much as possible, we also check the hypotheses (referred to as the version-space (VS)), the SVM similarity between the selected conditions using the kernel identifies the best hypothesis with the maximum margin. To farthest-first (KFF) method suggested by Baram et al. [33] achieve a situation where the VS contains the most accurate and which enables us to avoid acquiring conditions that are quite consistent hypothesis, the SVM-Margin AL method selects similar. Consequently, only the representative conditions that examples from the pool of unlabeled examples reducing the are most likely severe are selected. In Figure 2 it can be number of hypotheses. This method is based on simple observed that there are sets of relatively similar conditions heuristics that depend on the relationship between the VS and (based on their distance in the kernel space), however, only the the SVM's maximum margin. The heuristics are used since representative conditions that are most likely to be severe are calculating the VS is complex and impractical where large acquired. The SVM classifier defines the class margins using a datasets are concerned. Examples that lie closest to the small set of supporting vectors (i.e., conditions). While the usual separating hyper-plane (inside the margin) are more likely to be goal is to improve classification by uncovering (labeling) informative and new to the classifier, and these examples are conditions from the margin area, Exploitation's goal is to selected for labeling and acquisition. acquire conditions in order to enhance the detection of severe This method selects examples according to their distance from conditions. Contrary to SVM-Margin which explores examples the separating hyper-plane only to explore and acquire the that lie inside the SVM margin, Exploitation explores the informative conditions without relation to their classified labels, "severe side" to discover new and unknown severe conditions i.e., not specifically focusing on severe or mild conditions. The that are essential for the detection of severe conditions which SVM-Margin AL method is very fast and can be applied to real might cost more money and requires different treatment in problems, yet as its authors indicate [N-18], this agility is hospitals (conditions which will likely become support vectors achieved because it is based on a rough approximation and and update the classifier). relies on assumptions that the VS is fairly symmetric and that Figure 2 presents an example of a condition lying far inside the the hyper-plane's Normal (W) is centrally placed, assumptions severe side that was found to be mild. The distance calculation that have been shown to fail significantly[32]. The method may required for each instance in this method is quick and equal to query instances whose hyper-plane does not intersect the VS and the time it takes to classify an instance in a SVM classifier, thus therefore may not be informative. it is applicable for products working in real-time. 3.3.3 Exploitation We have developed a method, called "Exploitation", for efficient detection of malicious contents [34, 35], such as for malicious files [36, 38] or documents [37]. Exploitation is based on the SVM classifier principles and is oriented towards selecting examples most likely to be severe that lie furthest from the separating hyper-plane. Thus, this method supports the goal of boosting the classification capabilities of the classification model by acquiring as many new severe conditions as possible. For every condition X that is suspected of being severe, Exploitation rates its distance from the separating hyper-plane using Equation 1 based on the Normal of the separating hyper- plane of the SVM classifier that serves as the classification model. As explained above, the separating hyper-plane of the SVM is represented by W, which is the Normal of the separating hyper-plane and actually a linear combination of the most important examples (supporting vectors), multiplied by Figure 2: The criteria by which Exploitation acquires new unknown LaGrange multipliers (alphas) and by the kernel function K that severe conditions. These conditions lie the farthest from the hyper- assists in achieving linear separation in higher dimensions. plane and are regarded as representative conditions. 3.3.4 Combination_XA: A Combined Active Learning equal to the amount that followed, denoted as X: five Method conditions, 10 conditions, 20 conditions and 30 conditions. The "Combination_XA" method lies between SVM-Margin and Exploitation. It conducts a cross acquisition of informative The experiment’s steps are as follows: conditions, which means the first trial it selects conditions 1. Inducing the initial classification model from the initial according to SVM-Margin criteria and next trial it selects available training set (the initial training set includes six according to Exploitation criteria and so on with cross selection. conditions). Thus, on the odd trials the Combination_XA method selects 2. Evaluating the classification model on the test set of 200 examples based on SVM-Margin criteria in order to acquire the conditions to measure its initial performance. most informative conditions, acquiring both severe and mild 3. Introduction of the pool of unknown and unlabeled conditions conditions; this exploration phase is important in order to enable to the selective sampling method, which chooses the X most the classification model to discriminate between severe and mild informative conditions according to its criteria and sends them conditions. While on even trials, the Combination_XA method to the medical expert for labeling. then tries to maximally update the detection capabilities of severe conditions using the exploitation phase, drawing on the 4. Acquiring the informative conditions, removing them from Exploitation method. On the one hand, this strategy is aimed at the pool and adding them to the training set. selecting the most informative conditions, both mild and severe, 5. Inducing an updated classification model from the updated and on the other hand, it tries to boost the classification model training set and applying the updated model on the pool with severe conditions or very informative mild conditions (which now contains fewer conditions). which are confusing and lie deep inside the severe side of the This process repeats itself on our dataset from first trial until the SVM's hyper-plane. entire pool is acquired. 4. EVALUATION The objective in our primary experiment was to evaluate and 5. RESULTS compare the performance of our new AL methods to the existing We evaluated the efficiency and effectiveness of our framework selection method, SVM-Margin, on the tasks of: by comparing four selective sampling methods: 1) a well-known existing AL method, termed SVM-Simple-Margin (SVM- - Updating the predictive capabilities (Accuracy) of the Margin) based on [27]; our proposed methods 2) Exploitation, classification model that serves as the knowledge store of AL 3) Combination_XA, and 4) random-selection (Random) as a methods and improving its ability to efficiently identify the "lower bound." Each method was checked against all four most informative new conditions. acquisition amounts, in which the results were the mean of 10 - Identifying which of the AL methods better improve the different folds. Due to space limitations we present the results of capabilities of the classification model to correctly classify the most representative acquisition amount of five conditions in the severe conditions (TPR) with minimal errors (FPR), a each trial. task which is of particular importance given the need to We now present the results of the core measures in this study the identify severe conditions from the outset. accuracy, TPR, and the improvement in the classification model regarding these measures after each acquisition and retraining During a variable number of acquisition trials that ended trial. In addition, we also measured the number of new severe with acquiring every condition in the pool of unlabeled conditions that were discovered and finally acquired into the condition, we compared the acquisition of conditions based on training set. As explained above, five conditions (while the AL methods to random selection based on the performance of acquisition amount varied, we present only the most pertinent the classification model. In our acquisition experiments we used results from our experiments using an acquisition amount of five 516 conditions (372 mild, 144 Severe) in our repository and conditions because of page constraints) were selected from a created 10 randomly selected datasets with each dataset pool of new unlabeled conditions during each trial of CAESAR- containing three elements: an initial set of six conditions that ALE. It is well known that selecting more conditions per trial were used to induce the initial classification model, a test set of will improve accuracy. However, we wanted to reduce the 200 conditions on which the classification was tested and medical experts’ efforts in labeling conditions, and therefore, we evaluated after every trial in which it was updated, and a pool of used AL methods to maximally improve the classification the remaining 310 unlabeled conditions, from which the model's accuracy while minimizing the number of conditions framework and the selective sampling method selected the most acquired. More specifically, we used two of our methods informative conditions according to that method’s criteria. The (Exploitation and Combination_XA) to reduce the number of informative conditions were sent to a medical expert who conditions selected by SVM-Margin. labeled them. The conditions were later acquired by the training set that was enriched with an additional X new informative Figure 3 presents the Accuracy levels and their trends in the 62 conditions. The process was repeated over the next trials until trials at acquisition level of five conditions per trial (62*5=310 the entire pool was acquired. The performance of the conditions in pool). As can be seen, in most of the trials all of classification model was averaged for 10 runs over the 10 the AL methods outperformed Random. This shows that the use different datasets that were created. Each selective sampling of AL methods can reduce the number of conditions required to method was checked separately on four different acts of achieve similar accuracy to the passive learning methods (i.e., condition acquisition (each consisting of a different number of Random). The classification model had an initial accuracy of conditions). This means that for each act of acquisition, the 0.72, and all methods converged at an accuracy of 0.975 after methods were restricted to acquiring a number of conditions the pool was fully acquired into the training set. Figure 4: The TPR of the framework over 61 trials for different methods through the acquisition of five conditions in each trial. We now present the results of another important measure in this study, the number of new severe conditions that were discovered and finally acquired into the training set. As explained above, during each trial the framework deals with pool of conditions beginning with 310 conditions, consisting of about 82 new severe conditions. Statistically, the more conditions selected during each trial, the more severe conditions will be acquired. Yet, using AL methods, we tried to improve the number of severe conditions acquired by means of existing solutions. More specifically, using our methods (Exploitation and Combination_XA) we also sought to improve the number of files acquired by SVM-Margin. Figure 5 presents a cumulative number of severe conditions obtained by acquiring the five conditions during each trial by each of the four methods until the pool was fully acquired. From the fifth trial, Exploitation and Combination_XA outperformed Figure 3: The accuracy of the framework over 62 trials for different the other selection methods (their graph intersects in Figure 5). methods (Five conditions acquired during each trial). It can be observed that after 23 trials (115 conditions) both of The first selection method that arrived at a 0.95 rate of accuracy our AL methods acquired 73 severe condition out of the 82 was our Combination_XA method, which required 23 severe conditions in the pool, whereas SVM-Margin and acquisition trials (acquisition of 115 conditions out of 310 Random achieved it after 42 trials (210 conditions) and 60 trials conditions) while other AL methods required 26 trials. When (300 conditions), respectfully. This represents a reduction of compared to random selection, the Combination_XA method 46% compared to SVM-margin and 62% compared to Random. performed almost twice as well (23 vs. 44 trials) while achieving The greatest difference between our AL methods and SVM- the same accuracy (i.e., 0.95). Margin was 15 severe conditions after 23 trials, while during the Figure 4 presents TPR levels and their trends over 62 trials. Five same trial we also observed the greatest difference between our new conditions are acquired during each trial (62*5=310 AL methods and Random, a difference of 43 severe conditions conditions in pool). Using TPR, Exploitation outperformed the after the 23 trials. The difference between our AL methods and other selection methods. It achieved a 0.85 TPR rate after only Simple-SVM can be explained by the way this method acts: The 17 trials (85 conditions out of 310), while random selection SVM-Margin acquires examples about which the classification achieved a TPR of 0.85 after 47 trials. model is less confident. Consequently, they are considered to be more informative but not necessarily severe. As was explained In addition, the performance of the classification model previously, SVM-Margin selects informative conditions inside improved as more conditions were acquired. After 36 trials, all the margin of the SVM. Over time and with the improvement of AL methods converged to TPR rates around 0.92. Our results the detection model towards more severe conditions, it seems show that using AL methods to select discriminative conditions that the severe conditions are less informative. Since these for classification can reduce the number of trials required for severe conditions might not lie inside the margin anymore, training the classifier. In turn, this will reduce the total number SVM-Margin may actually be acquiring informative mild of conditions requiring medical expert review and thereby conditions, rather than severe conditions. reduce costs.

Figure 5: The number of severe conditions in the training set acquired The stronger acquisition performance of the Exploitation and by the framework for different methods with the acquisition of five Combination_XA methods can be explained by the way they conditions in each trial. function. Both methods have an exploitation phase during However, our methods, Combination_XA and Exploitation, are which they attempt to acquire conditions that are most likely more oriented toward acquiring the most informative severe conditions by obtaining conditions from the severe side of the severe. In fact, these two methods also acquire mild conditions SVM margin. As a result, an increasing number of new severe that are thought to be severe. Although these mild conditions conditions are acquired in the earliest trials, thus improving the are indeed initially confusing, they are actually very informative classification model's performance and subsequently reducing to the classification model, since they lead to a major labeling efforts; In addition, if an acquired mild condition lies modification of the SVM margin and its separating hyper-plane. deep within the severe side, it is still informative and can be As a consequence, their acquisition improves the performance used for learning purposes and to improve the upcoming trial’s of the classification model better than the SVM-Margin method, classification capabilities. which focuses on acquiring conditions that are known to be 6. DISCUSSION AND CONCLUSION confusing, lead to only small changes in the SVM margin and its separating hyper-plane, and thus contribute less to improving We present a framework, CAESAR-ALE, used to detect the classification model. We understand from this phenomenon informative conditions that, if labeled by experts, improve the that there are often noisy "mild" conditions lying deep within classifier. We presented results from the lowest acquisition what seems to be the sub-space of the "severe" conditions, as amount, because our primary goal was to minimize the number of conditions sent to medical experts for manual labeling. was explained in recent study that focused on the detection of Two different measures were mainly used to evaluate our PC worms [34]. As noted, these "surprising" cases are very algorithm: accuracy and TPR. TPR is important in severity informative and valuable to the improvement of the classification, because of the great importance of detecting classification model (these conditions will probably become severe conditions. Therefore, because of the consequences support vectors after acquiring them and retraining the model). inherent in misclassifying severity, it is better to classify a In addition, they are helpful in the acquisition of severe condition as severe when it is actually mild instead of conditions that eventually update and enrich the knowledge classifying a condition as mild when in reality it is severe. store. It should be noted that these conditions seem to be more Bearing this in mind, traditional passive learning approaches informative than severe conditions, because they provide require large amounts of training data to achieve sufficient relevant information that was previously not considered (they performance. However, our Exploitation method achieved a TPR of 0.85 after only 17 trials. Only 85 conditions would were initially classified tentatively as severe by the classifier). require manual expert labeling in this scenario. In contrast, (That is, the classifier initially considered them as being severe, random selection required 47 trials or 235 conditions to achieve but they were eventually discovered as being mild). It seems that the same TPR (representing a 64% reduction). This would cut our Exploitation and Combination_XA methods for acquiring costs by almost two-thirds and allow medical experts to focus conditions that are most likely severe induce a better their energy elsewhere. classification model, a model that will eventually also acquire In terms of accuracy, the Combination_XA AL method confusing but valuable and informative mild conditions. performed the best with a reduction in the number of trials from 44 to 23 (when compared to random). If we translate this to the When calculating the Positive Predictive Value (PPV) (i.e., number of conditions, we find that the Combination_XA method Precision) of our enhanced framework CAESAR-ALE and required 115 vs. 220 conditions (representing a 48% reduction). comparing it to the basic approach CAESER, we observed that Therefore, because for our purposes, FPR is less important (we by acquiring 200 conditions (40 trials), CAESAR-ALE achieved don’t mind calling some mild conditions severe as long as we a 96.6% PPV, compared to the 56.2% PPV that was achieved by accurately capture all severe conditions), we can reduce efforts CAESAR. This represents more than a 40% absolute, and a 71% and cost by 64% without compromising the classification performance. However, in some instances we may desire relative improvement in the predictive capabilities of the maximal accuracy, and in those cases we would still achieve a framework when classifying conditions as severe, an reduction in the number of trials required of 48% when using improvement that was achieved along with the simultaneous AL methods vs. passive learning. significant reduction of labeling efforts. Considering the number of severe conditions acquired across the In our future work, we hope to extend our efforts and provide a trials, we observed that our methods (Exploitation and tool that prompts medical experts to label only pertinent and Combination_XA) were more successful at the identification of discriminative conditions. This should significantly reduce the severe conditions, acquiring many more severe conditions in the workload of already busy clinicians. early stages of the enhancement process, compared to the SVM- Margin and Random methods. The results showed a reduction In conclusion, we presented a framework called CAESAR-ALE of 46% in the number of acquired conditions needed for that reduces the manual efforts of medical experts by identifying identification of most of the severe conditions. the most important phenotypes for labeling. Our AL framework reduced labeling efforts significantly, with a reduction by 64% for the same TPR, and 48% for the same accuracy level. We us.ahrq.gov/toolssoftware/chronic/chronic.jsp - download, also demonstrated the strength of using AL methods on EHR Accessed on February 25, 2014. [18] Hwang, W., Weller, W., Ireys, H. and Anderson, G. 2001. Out-Of-Pocket data in the biomedical domain. Medical Spending For Care Of Chronic Conditions. Health Affairs, 20, 6 (November 1, 2001), 267-278. 7. ACKNOWLEDGMENTS [19] Chi, M.-j., Lee, C.-y. and Wu, S.-c. 2011. The prevalence of chronic This research was partly supported by the National Cyber Bureau of conditions and medical expenditures of the elderly by chronic condition the Israeli Ministry of Science, Technology and Space. Support for indicator (CCI). Archives of Gerontology and Geriatrics, 52, 3, 284-289. portions of this research provided by R01 LM006910 (GH). [20] Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F. and Elhadad, N. 2014. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, 21, 2 8. REFERENCES (March 1, 2014), 231-237. [1] Stang, P. E., Ryan, P. B., Racoosin, J. A., Overhage, J. M., Hartzema, A. G., [21] Perotte, A. and Hripcsak, G. 2013. Temporal properties of diagnosis code Reich, C., Welebob, E., Scarnecchia, T. and Woodcock, J. 2010. time series in aggregate. IEEE journal of biomedical and health Advancing the science for active surveillance: rationale and design for the informatics, 17, 2 (Mar), 477-483. Observational Medical Outcomes Partnership. Ann Intern Med., 153, 9 [22] Torii, M., Wagholikar, K. and Liu, H. 2011. Using machine learning for (Nov 2), 600-606. concept extraction on clinical documents from multiple data sources. [2] Kho, A. N., Pacheco, J. A., Peissig, P. L., Rasmussen, L., Newton, K. M., Journal of the American Medical Informatics Association(June 27, 2011). Weston, N., Crane, P. K., Pathak, J., Chute, C. G., Bielinski, S. J., Kullo, [23] Nguyen, A. N., Lawley, M. J., Hansen, D. P., Bowman, R. V., Clarke, B. I. J., Li, R., Manolio, T. A., Chisholm, R. L. and Denny, J. C. 2011. E., Duhig, E. E. and Colquist, S. 2010. Symbolic rule-based classification Electronic medical records for genetic research: results of the eMERGE of lung cancer stages from free-text pathology reports. Journal of the consortium. Science translational medicine, 3, 79 (Apr 20), 79re71. American Medical Informatics Association, 17, 4 (July 1, 2010), 440-445. [3] Denny, J. C., Ritchie, M. D., Basford, M. A., Pulley, J. M., Bastarache, L., [24] Angluin, D. 1988. Queries and concept learning. Machine Learning, 2, Brown-Gentry, K., Wang, D., Masys, D. R., Roden, D. M. and Crawford, 319-342. D. C. 2010. PheWAS: demonstrating the feasibility of a phenome-wide [25] Lewis, D. and Gale, W. 1994. A sequential algorithm for training text scan to discover gene–disease associations. Bioinformatics, 26, 9 (May 1), classifiers. Proceedings of the Seventeenth Annual International ACM- 1205-1210. SIGIR Conference on Research and Development in Information [4] Boland, M. R., Hripcsak, G., Shen, Y., Chung, W. K. and Weng, C. 2013. Retrieval, Springer-Verlag, , 3-12. Defining a comprehensive verotype using electronic health records for [26] Liu, Y. 2004. Active learning with support vector machine applied to gene personalized medicine. J Am Med Inform Assoc., 20, e2 (December 1), expression data for cancer classification. Journal of chemical information e232-e238. and computer sciences, 44, 6, 1936-1941. [5] Weiskopf, N. G. and Weng, C. 2013. Methods and dimensions of electronic [27] Warmuth, M. K., Liao, J., Rätsch, G., Mathieson, M., Putta, S. and health record data quality assessment: enabling reuse for clinical research. Lemmen, C. 2003. Active learning with support vector machines in the J Am Med Inform Assoc., 20, 1, 144-151. drug discovery process. Journal of chemical information and computer [6] Hripcsak, G., Knirsch, C., Zhou, L., Wilcox, A. and Melton, G. B. 2011. sciences, 43, 2, 667-673. Bias associated with mining electronic health records. Journal of [28] Figueroa, R. L., Zeng-Treitler, Q., Ngo, L. H., Goryachev, S. and biomedical discovery and collaboration, 6, 48. Wiechmann, E. P. 2012. Active learning for clinical text classification: is [7] Hripcsak, G. and Albers, D. J. 2013. Correlating electronic health record it better than random sampling? Journal of the American Medical concepts with healthcare process events. J Am Med Inform Assoc., 20, e2 Informatics Association, amiajnl-2011-000648. (December 1), e311-e318. [29] Nguyen, D. H. and Patrick, J. D. 2014. Supervised machine learning and [8] Rich, P. and Scher, R. K. 2003. Nail psoriasis severity index: a useful tool active learning in classification of radiology reports. Journal of the for evaluation of nail psoriasis. Journal of the American Academy of American Medical Informatics Association, amiajnl-2013-002516. Dermatology, 49, 2, 206-212. [30] Chang, C. C. and Lin, C. J. 2011. LIBSVM: a library for support vector [9] Bastien, C. H., Vallières, A. and Morin, C. M. 2001. Validation of the machines. ACM Transactions on Intelligent Systems and Technology Insomnia Severity Index as an outcome measure for insomnia research. (TIST), 2, 3, 27. Sleep Medicine, 2, 4, 297-307. [31] Tong, S. and Koller, D. 2000-2001. Support vector machine active [10] McLellan, A. T., Kushner, H., Metzger, D., Peters, R., Smith, I., Grissom, learning with applications to text classification. Journal of Machine G., Pettinati, H. and Argeriou, M. 1992. The fifth edition of the Addiction Learning Research, 2, 45-66. Severity Index. Journal of substance abuse treatment, 9, 3, 199-213. [32] Ralf, H., Graepel, T. and Campbell, C. 2001. Bayes point machines. The [11] Rockwood, T. H., Church, J. M., Fleshman, J. W., Kane, R. L., Journal of Machine Learning Research 1, 245-279. Mavrantonis, C., Thorson, A. G., Wexner, S. D., Bliss, D. and Lowry, A. [33] Baram, Y., El-Yaniv, R. and Luz, K. 2004. Online choice of active C. 1999. Patient and surgeon ranking of the severity of symptoms learning algorithms. . Journal of Machine Learning Research, 5, 255-291. associated with fecal incontinence. Diseases of the colon & rectum, 42, [34] Nissim, N., Moskovitch, R., Rokach, L., & Elovici, Y. (2012). Detecting 12, 1525-1531. unknown computer worm activity via support vector machines and active [12] Horn, S. D. and Horn, R. A. 1986. Reliability and validity of the severity learning. Pattern Analysis and Applications, 15(4), 459-475. of illness index. Medical care, 24, 2, 159-178. [35] Nissim, N., Moskovitch, R., Rokach, L., and Elovici, Y., Novel Active [13] Boland, M. R., Tatonetti, N. and Hripcsak, G. 2014. CAESAR: a Learning Methods for Enhanced PC Malware Detection in Windows OS, Classification Approach for Extracting Severity Automatically from Expert Systems With Applications, 41(13), 2014. Electronic Health Records. Intelligent Systems for Molecular Biology [36] Moskovitch, R., Nissim, N., and Elovici, Y., Malicious code detection Phenotype Day, Boston, MA, In Press, 1-8. using active learning, ACM SIGKDD Workshop In Privacy, Security and [14] Elkin, P. L., Brown, S. H., Husser, C. S., Bauer, B. A., Wahner-Roedler, Trust in KDD, Las Vegas, 2008. D., Rosenbloom, S. T. and Speroff, T. Evaluation of the content coverage [37] Nissim, N., Cohen, A., Moskovitch, R., Barad, O., Edry, M., Shabatai A., of SNOMED CT: ability of SNOMED clinical terms to represent clinical and Elovici, Y., ALPD: Active Learning framework for Enhancing the problem lists. Elsevier, City, 2006. Detection of Malicious PDF Files aimed at Organizations, Proceedings of [15] Stearns, M. Q., Price, C., Spackman, K. A. and Wang, A. Y. SNOMED JISIC, 2014. clinical terms: overview of the development process and project status. [38] Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., Elovici, American Medical Informatics Association, City, 2001. Y., Unknown Malcode Detection and The Imbalance Problem, Journal in [16] Elhanan, G., Perl, Y. and Geller, J. 2011. A survey of SNOMED CT direct COmputer Viorology, 5 (4), 2009. users, 2010: impressions and preferences regarding content and quality. [39] Moskovitch, R., and Shahar, R., Vaidurya: A Multiple Ontology, Concept Journal of the American Medical Informatics Association, 18, Suppl 1 Based, Context Sensitive Clinical Guideline Search Engine, Journal of (December 1, 2011), i36-i44. Biomedical Informatics, 42 (1), 2009. [17] 2011. HCUP Chronic Condition Indicator for ICD-9-CM. Healthcare Cost and Utilization Project (HCUP), https://http://www.hcup-

An Active Learning Framework for Efficient Condition Severity Classification

Nir Nissim1(), Mary Regina Boland2, Robert Moskovitch2, Nicholas P. Tatonetti2, Yuval Elovici1, Yuval Shahar1, and George Hripcsak2

1 Department of Information Systems Engineering, Ben-Gurion University, Beer-Sheva, Israel [email protected] 2 Department of Biomedical Informatics, Columbia University, New York, USA

Abstract. Understanding condition severity, as extracted from Electronic Health Records (EHRs), is important for many public health purposes. Methods requiring physicians to annotate condition severity are time-consuming and costly. Previous- ly, a passive learning algorithm called CAESAR was developed to capture severity in EHRs. This approach required physicians to label conditions manually, an ex- haustive process. We developed a framework that uses two Active Learning (AL) methods (Exploitation and Combination_XA) to decrease manual labeling efforts by selecting only the most informative conditions for training. We call our ap- proach CAESAR-Active Learning Enhancement (CAESAR-ALE). As compared to passive methods, CAESAR-ALE’s first AL method, Exploitation, reduced label- ing efforts by 64% and achieved an equivalent true positive rate, while CAESAR- ALE’s second AL method, Combination_XA, reduced labeling efforts by 48% and achieved equivalent accuracy. In addition, both these AL methods outperformed the traditional AL method (SVM-Margin). These results demonstrate the potential of AL methods for decreasing the labeling efforts of medical experts, while achiev- ing greater accuracy and lower costs.

Keywords: Active-learning · Condition · Electronic health records · Phenotyping

1 Introduction

Connected health is increasingly becoming an important framework for improving health. A crucial aspect of this includes labeling condition severity for prioritization pur- poses. Many national and international organizations study medical conditions and their clinical outcomes. The Observational Medical Outcomes Partnership standardized condi- tion/phenotype identification and extraction from electronic data sources, including Electronic Health Records (EHRs) [1]. The Electronic Medical Records and Genomics Network [2] successfully extracted over 20 phenotypes from EHRs [3]. However, defin- ing phenotypes from EHRs is a complex process because of definition discrepancies [4], data sparseness, data quality [5], bias [6], and healthcare process effects [7]. Currently, around 100 conditions/phenotypes have been successfully defined and extracted from

© Springer International Publishing Switzerland 2015 J.H. Holmes et al. (Eds.): AIME 2015, LNAI 9105, pp. 13–24, 2015. DOI: 10.1007/978-3-319-19551-3_3 14 N. Nissim et al.

EHRs out of the approximately 401,2001 conditions they contain. To utilize all data available in EHRs, a prioritized list of conditions classified by severity at the condition- level is needed. Condition-level severity classification can distinguish acne (mild condi- tion) from myocardial infarction (severe condition). In contrast, patient-level severity determines whether a given patient has a mild or severe form of a condition (e.g., acne). The bulk of the literature focuses on patient-level severity. Patient-level severity general- ly requires individual condition metrics [8-11], although whole-body methods exist [12]. We defined as “severe conditions” those conditions that are life-threatening or perma- nently disabling, and thus, would have a high priority for generating phenotype defini- tions for tasks such as pharmacovigilance. In this paper, we describe the development and validation of a specially designed Ac- tive Learning (AL) approach, an approach to learning, the objective of which is to mini- mize the number of training instances that need to be labeled by experts (see Section 2.3) for classifying condition severity from EHRs. This builds on our previous work using passive learning, called CAESAR (Classification Approach for Extracting Severity Au- tomatically from Electronic Health Records) [13]. We call our algorithm the CAESAR Active Learning Enhancement, or CAESAR-ALE. We show that CAESAR-ALE can reduce the burden on medical experts by minimizing the number of conditions requiring manual severity assignment. Our AL methods, integrated in CAESAR-ALE, actively select during the classification-model training phase only those conditions that both add new informative knowledge to the classification model and improve its classification performance. This focused, informed selection contrasts with the randomly selected, poorly informative conditions that are selected by a passive learning method.

2 Background and Significance

Our previous algorithm, CAESAR, used passive learning to capture condition severity from EHRs [13]. This method required medical experts to review conditions manually and assign a severity status (severe or mild) to each. The resulting reference-standard contained 516-labeled conditions. Because of the significant effort involved, only a relatively small number of conditions was labeled by a group of human experts in the original CAESAR study [13]. These severity assignments were then used to evaluate the quality of CAESAR.

2.1 SNOMED-CT The Systemized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) is a spe- cialized ontology developed for conditions obtained during the clinical encounter and recorded in EHRs [14, 15]. SNOMED-CT is the terminology of choice of the World Health Organization and the International Health Terminology Standards Develop- ment Organization and satisfies the Meaningful Use requirements of the Health In- formation Technology component of the American Recovery and Reinvestment Act

1 The number of SNOMED-CT codes as of September 9, 2014. Accessed via: http://bioportal.bioontology.org/ontologies/SNOMEDCT An Active Learning Framework for Efficient Condition Severity Classification 15 of 2009 [16]. We used SNOMED-CT, because clinical ontologies are useful for re- trieval [17]. Each coded clinical event is considered a “condition” or “phenotype,” knowing that this is a broad definition [4].

2.2 Classification of Conditions In biomedicine, condition classification follows two main approaches: 1) manual approaches, where experts manually assign labels to conditions [18-22]; and 2) pas- sive classification approaches that require a labeled training set and are based on ma- chine learning approaches and text classification [23-24].

2.3 Active Learning AL approaches are learning methods useful for selecting, during the learning process, the most discriminative and informative conditions from the entire dataset, thus min- imizing the number of instances that experts need to review manually. Studies in sev- eral domains have successfully applied AL to reduce the time and money required for labeling examples [25-27]. The two major AL approaches are membership queries [28], in which examples are artificially generated from the problem space, and selec- tive-sampling [29], in which examples are selected from a pool (a method used in this paper). Applications in the biomedical domain remain limited. Liu described a meth- od similar to relevance feedback for cancer classification [30]. Warmuth et al. used a similar approach to separate compounds for drug discovery [31]. More recently, AL was applied in biomedicine for text [32] and radiology report classification [33].

3 Materials and Methods

3.1 Dataset The development and evaluation of CAESAR-ALE used the CAESAR dataset devel- oped previously [13], which contains 516 conditions (SNOMED-CT codes) labeled as mild and severe. These 516 conditions were randomly selected out a total of 4683 unlabeled conditions. The gold-standard labeling of the 516 conditions used in the current study was manual, following an automated filtering phase that significant- ly reduced the labeling expert's effort (e.g., all malignant cancers and accidents were labeled as “severe”), as described elsewhere [13]. The dataset contains six severity features for each condition: the number of comorbidities, procedures, and medica- tions; cost; treatment time; and a proportion term. Each one of these features describ- ing a specific condition was aggregated as an average value for all the records of the same condition in the EHR system.

3.2 The CAESAR-ALE Framework The purpose of CAESAR-ALE is to decrease experts’ labeling efforts. Utilizing AL methods, CAESAR-ALE directs only informative conditions to experts for labeling. Informative conditions are defined as those that improve the classification model’s 16 N. Nissim et al. predictive capabilities when added to the training set. Figure 1 illustrates the process of labeling and acquiring new conditions by maintaining the updatability of the classifica- tion model within CAESAR-ALE. Conditions are introduced to the classification model, which is based on the SVM algorithm and AL methods, using both of which the informa- tive conditions are selected and sent to medical expert for annotation. We can maintain an accurate model and decrease labeling efforts by adding only two types of informative conditions: 1) those identified by the classifier as low confidence (similar probability that the condition is mild and severe) and 2) those conditions that are at a maximal dis- tance from the separating hyper-plane (see Equation 1). These conditions are deep with- in the "severe" instances sub-space of the SVM's separating hyper-plane. Adding the mild conditions that exist within this space of otherwise severe conditions greatly informs and improves the classification model. The overall CAESAR-ALE framework integrates two main phases. Training: The model is trained using an initial set of severe and mild conditions and evaluated against a test set consisting of conditions not used during train- ing. Classification and updating: The AL method ranks how informative each condi- tion is using the classification model’s prediction. Only the most informative are selected and labeled by the expert. These conditions are added to the training set and removed from the pool. The model is then retrained using the updated training set. This process is repeated until all conditions in the pool have been added to the training set. We employed the SVM classification algorithm using the radial basis function (RBF) kernel in a super- vised learning approach, because it has proven to be very efficient when combined with AL methods [26], [27]. We used the Lib-SVM implementation [34] because it supports multiclass classification.

Fig. 1. Process of using AL methods to detect discriminative conditions requiring medical expert annotation An Active Learning Framework for Efficient Condition Severity Classification 17

3.3 Active Learning Methods

CAESAR-ALE uses two AL methods (Exploitation and Combination_XA), described below, together with the SVM-margin and random method.

3.3.1 Random Selection or Passive Learning (Random) Random selection, machine learning’s default case/“lower bound,” involves adding conditions randomly selected from the pool to the training set.

3.3.2 The SVM-Simple-Margin AL Method (SVM-Margin) The SVM-Simple-Margin method [35] (or SVM-Margin) is directly related to the SVM classifier. SVM-Margin selects examples to explore and acquire informative conditions according to their distance from the separating hyper-plane, disregarding their classified label. Examples that lie closest to the separating hyper-plane (inside the margin) are more likely to be informative and therefore are acquired and labeled. SVM-Margin is fast and yet has significant limitations [35] based on assumptions that have been shown to fail [36].

3.3.3 Exploitation Exploitation, one of CAESAR-ALE's AL methods, efficiently detects malicious con- tents, files, and documents [37-41]. Exploitation is based on SVM classifier princi- ples, selecting examples more likely to be severe lying further from the separating hyper-plane. Thus, this method supports the goal of boosting the classification capa- bilities of the model by acquiring as many new severe conditions as possible. Exploi- tation rates the distance of every condition X from the separating hyper-plane using Equation 1 based on the Normal of the separating hyper-plane of the SVM classifier that serves as the classification model. Accordingly, the distance in Equation 1 is calculated between example X and W (Equation 2), the Normal that represents the separating hyper-plane.

n =  α  Dist(X )  i yi K(xi x) (1)  1 

n = α Φ w  i yi (xi ) (2) 1 To optimally enhance the training set, we also checked the similarity among se- lected conditions using the kernel farthest-first (KFF) method, suggested by Baram et al. [42], enabling us to avoid acquiring similar conditions, which would waste manual analysis resources.

3.3.4 Combination_XA: A Combined Active Learning Method The "Combination_XA" method is a hybrid of SVM-Margin and Exploitation. It con- ducts a cross acquisition (XA) of informative conditions, meaning that during the first trial (and all odd-numbered trials) it acquires conditions according to SVM-Margin crite- ria and during the next trial (and all even-numbered trials) it selects conditions using Exploitation’s criteria. This strategy alternates between the exploration phases 18 N. Nissim et al.

(conditions acquired using SVM-margin) and the exploitation phase (conditions acquired using Exploitation) to select the most informative conditions, both mild and severe, while boosting the classification model with severe conditions or very informative mild condi- tions that lie deep inside the severe side of the SVM’s hyper-plane.

4 Evaluation

The objective of our experiments was to evaluate the performance of CAESAR- ALE’s two new AL methods and compare it with that of the existing AL method, SVM-Margin, for two tasks: (1) To update the predictive capabilities (Accuracy) of the classification model that serves as the knowledge store of AL methods and improve its ability to identify efficiently the most informative new conditions. (2) To identify the method that best improves the capabilities of the classification model by correctly classifying conditions according to the accuracy measure and also specifically severe conditions as assessed by the true positive rate (TPR), with minimal errors as measured by the false positive rate (FPR). This task is of particular importance, given the need to identify severe conditions from the outset. In our first acquisition experiment, we used 516 conditions (372 mild, 144 severe) in our repository and created 10 randomly selected datasets, with each dataset con- taining three elements: an initial set of six conditions that were used to induce the initial classification model, a test set of 200 conditions on which the classification was tested and evaluated after every trial in which it was updated, and a pool of 310 unla- beled conditions. Informative conditions were selected according to each method’s criteria and then sent to a medical expert for labeling. Conditions were later acquired by the training set for enrichment with an additional five new informative conditions. The process was repeated over the next trials until the entire pool was acquired. The performance of the classification model was averaged for 10 runs over the 10 differ- ent datasets that were created. The experiment’s steps (below) are repeated until the entire pool is acquired: (1) Induce the initial classification model from the initial training set. (2) Evaluate the classification model's initial performance using the test set. (3) Introduce unlabeled conditions to the pool for the selective sampling method. The five most informative conditions are selected according to each method’s criteria and then sent to the medical expert for labeling. (4) Add acquired conditions to the training set (removing them from the pool). (5) Induce an updated classification model by using the updated training set and applying the updated model on the pool (now containing fewer conditions).

5 Results

Results are presented for the core evaluation measures used in this study: accuracy and TPR. We also measured the number of new severe conditions discovered and acquired into the training set at each trial. An Active Learning Framework for Efficient Condition Severity Classification 19

Figure 2 presents accuracy levels and their trends in the 62 trials with an acquisi- tion level of 5 conditions per trial (62*5=310 conditions in pool). In most trials, the AL methods outperformed Random selection, illustrating that using AL methods can reduce the number of conditions required to achieve accuracy similar to that of pas- sive learning (i.e., random). The classification model had an initial accuracy of 0.72, and all methods converged at an accuracy of 0.975 after the pool was fully acquired. Combination_XA reached a 0.95 rate of accuracy first, requiring 23 acquisition trials (115/310 conditions), while the other AL methods required 26 trials. Furthermore, Combination_XA required only approximately half the number of conditions required by the random acquisition method (23 vs. 44 trials), while achieving the same accura- cy of 0.95.

Fig. 2. The accuracy of CAESAR-ALE AL Fig. 3. TPR for active and passive learning methods versus SVM-Margin and Random methods over 62 trials over 62 trials (five conditions acquired during each trial)

Fig. 4. FPR for active and passive learning Fig. 5. Number of acquired severe conditions methods over 62 trials for active and passive learning methods over 62 trials

20 N. Nissim et al.

Figure 3 shows TPR levels and their trends over 62 trials. Exploitation outper- formed the other selection methods, achieving a TPR rate of 0.85 after only 17 trials (85/310 conditions). This was much better than random selection (TPR=0.85 after 47 trials), and performance improved as additional conditions were acquired. After 36 trials, all AL methods converged to TPR rates around 0.92. Our results demonstrate that using AL methods for condition selection can reduce the number of trials re- quired in training the classifier, thereby reducing the total number of conditions re- quired for medical expert labeling, and correspondingly reducing the training costs. Figure 4 shows FPR levels and their trends over 62 trials. It can be seen that throughout the trials, the random selection method had the highest FPR values, as compared to the AL methods. Furthermore, its FPR levels were mostly unstable until the end of the trials. In contrast, from trial 23 onward, all the AL methods displayed a low and stable FPR of 1.3%. Considering the TPR and FPR measures together, we can see how efficient the AL methods were, in spite of the unbalanced mix of condi- tion severities in our data set. The cumulative number of severe conditions acquired for each trial is shown in Figure 5. By the fifth trial, Exploitation and Combination_XA outperformed the other selection methods (both lines overlap in Figure 5). We observed that after 23 trials (115 conditions) both of CAESAR-ALE’s methods acquired 73 out of 82 severe con- ditions in the pool, as compared to 42 trials (210 conditions) for SVM-Margin and 60 trials (300 conditions) for random. This represents a 46% reduction as compared to SVM-margin and a 62% reduction as compared to random. After 23 trials (Figure 5), we observed the largest difference between CAESAR-ALE’s methods and random, a difference of 43 severe conditions.

6 Discussion

We present CAESAR-ALE, an AL framework for identifying informative conditions for medical expert labeling. CAESAR-ALE was evaluated based on accuracy and TPR. Traditional passive learning approaches require large amounts of training data to achieve a satisfactory performance. Exploitation achieved a TPR of 0.85 after only 17 trials (a scenario in which only 85 conditions would require manual expert label- ing), whereas random selection required 47 trials or 235 conditions to achieve the same TPR, representing a 64% reduction in labeling efforts. This would reduce costs by almost two-thirds, allowing medical experts to focus their energy elsewhere. In terms of accuracy, Combination_XA performed best with a reduction in the number of trials from 44 to 23, as compared to the random acquisition method. Therefore, the Combination_XA method required 115 vs. 220 conditions (representing a 48% reduc- tion) to achieve equivalent accuracy. The Exploitation and Combination_XA methods have an exploitation phase during which they attempt to acquire more severe condi- tions. They both acquire mild conditions when they are thought to be severe, and this contributes to the methods’ strong acquisition performance. These mild conditions are very informative to the classification model, because they lead to a major modifica- tion of the SVM margin and its separating hyper-plane. Consequently, acquisition of these conditions improves the performance of the model. An Active Learning Framework for Efficient Condition Severity Classification 21

In contrast, traditional AL methods (e.g., SVM-Margin) focus on acquiring condi- tions that lead to only small changes in its separating hyper-plane and contribute less to the improvement of the classification model. Our research points to the existence of noisy "mild" conditions lying deep within what seems to be the sub-space of the "se- vere" conditions. This is explained in a recent study focused on the detection of PC worms [38] that mentioned "surprising" cases that are very informative and valuable to the improvement of the classification model, and helpful in acquiring severe condi- tions that eventually update and enrich the knowledge store. These conditions are more informative than severe conditions, because they provide relevant information that was previously not considered (they were initially classified tentatively as severe by the classifier when they are mild). SVM-Margin acquires examples that the classi- fication model determines to be low confidence. Consequently, they are informative but not necessarily severe. In contrast, the CAESAR-ALE framework is oriented to- ward acquiring the most informative severe conditions by obtaining conditions from the severe side of the SVM margin. As a result, more new severe conditions are ac- quired in earlier trials. Additionally, if an acquired condition that lies deep within the supposedly severe side of the margin is found to be mild, it is still informative (per- haps even more so) and can be used to improve the iteratively modified classifier’s classification capabilities in the next trials. In our future work, we hope to develop an online tool that medical experts can use to label condition severity. This should further reduce the workload on busy clini- cians, and offer an easy to use method for condition-level severity labeling. Using several different labelers for subsets of conditions, we would like to scrutinize and analyze the difference between the different labelers and understand how AL methods can handle these differences. The presented approach is general and not domain- dependent; therefore, it can be applied to and provide a solution for every medical domain (and even non-medical domains) in which it is beneficial to reduce the costly or time-consuming labeling efforts at training time. In addition, we will consider ap- plying our AL framework to the publicly available MIMIC II intensive care domain database, to understand better the benefits of applying active learning methods to various medical domains.

7 Conclusions

We presented the CAESAR-ALE framework, which uses AL to identify important conditions for labeling. AL methods are based on the predictive capabilities of the classification model; thus, an updated classification model directly affects the AL method's ability to select the most informative conditions and thereby decreases label- ing efforts. CAESAR-ALE reduced labeling efforts by 48%, while achieving equiva- lent accuracy. Overall, CAESAR-ALE demonstrates the strength and utility of employing AL methods that are specially designed for the biomedical domain. In addition, both our AL methods (Exploitation and Combination_XA) outperformed the traditional AL method (SVM-Margin). These results demonstrate the potential of AL methods for decreasing the labeling efforts of medical experts, while achieving great- er accuracy and lower costs. 22 N. Nissim et al.

References

1. Stang, P.E., Ryan, P.B., Racoosin, J.A., et al.: Advancing the science for active surveil- lance: rationale and design for the Observational Medical Outcomes Partnership. Ann. Intern. Med. 153(9), 600–606 (2010) 2. Kho, A.N., Pacheco, J.A., Peissig, P.L., et al.: Electronic medical records for genetic re- search: results of the eMERGE consortium. Science Translational Medicine 3(79), 79re1 (2011) 3. Denny, J.C., Ritchie, M.D., Basford, M.A., et al.: PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26(9), 1205–1210 (2010) 4. Boland, M.R., Hripcsak, G., Shen, Y., Chung, W.K., Weng, C.: Defining a comprehensive verotype using electronic health records for personalized medicine. J. Am. Med. Inform. Assoc. 20(e2), e232–e238 (2013) 5. Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quali- ty assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20(1), 144–151 (2013) 6. Hripcsak, G., Knirsch, C., Zhou, L., Wilcox, A., Melton, G.B.: Bias associated with min- ing electronic health records. Journal of Biomedical Discovery and Collaboration 6, 48 (2011) 7. Hripcsak, G., Albers, D.J.: Correlating electronic health record concepts with healthcare process events. J. Am. Med. Inform. Assoc. 20(e2), e311–e318 (2013) 8. Rich, P., Scher, R.K.: Nail psoriasis severity index: a useful tool for evaluation of nail psoriasis. Journal of the American Academy of Dermatology 49(2), 206–212 (2003) 9. Bastien, C.H., Vallières, A., Morin, C.M.: Validation of the Insomnia Severity Index as an outcome measure for insomnia research. Sleep Medicine 2(4), 297–307 (2001) 10. McLellan, A.T., Kushner, H., Metzger, D., et al.: The fifth edition of the Addiction Severi- ty Index. Journal of Substance Abuse Treatment 9(3), 199–213 (1992) 11. Rockwood, T.H., Church, J.M., Fleshman, J.W., et al.: Patient and surgeon ranking of the severity of symptoms associated with fecal incontinence. Diseases of the Colon & Rec- tum 42(12), 1525–1531 (1999) 12. Horn, S.D., Horn, R.: Reliability and validity of the severity of illness index. Medical Care 24(2), 159–178 (1986) 13. Boland, M.R., Tatonetti, N., Hripcsak, G.: CAESAR: A classification approach for extract- ing severity automatically from electronic health records. In: Intelligent Systems for Molecular Biology Phenotype Day, Boston, MA, pp. 1–8 (2014) (in Press) 14. Elkin, P.L., Brown, S.H., Husser, C.S., et al.: Evaluation of the content coverage of SNOMED CT: ability of SNOMED clinical terms to represent clinical problem lists. In: Mayo Clinic Proceedings, pp. 741–748. Elsevier (2006) 15. Stearns, M.Q., Price, C., Spackman, K.A., Wang, A.: SNOMED clinical terms: overview of the development process and project status. In: Proceedings of the AMIA Symposium 2001, p. 662. American Medical Informatics Association (2001) 16. Elhanan, G., Perl, Y., Geller, J.: A survey of SNOMED CT direct users, 2010: impressions and preferences regarding content and quality. Journal of the American Medical Informat- ics Association 18(suppl. 1), i36–i44 (2011) 17. Moskovitch, R., Shahar, Y.: Vaidurya–a concept-based, context-sensitive search engine for clinical guidelines. American Medical Informatics Association (2004) An Active Learning Framework for Efficient Condition Severity Classification 23

18. HCUP Chronic Condition Indicator for ICD-9-CM. Healthcare Cost and Utilization Pro- ject (HCUP) (2011), http://www.hcup-us.ahrq.gov/toolssoftware/chronic/chronic.jsp (accessed on February 25, 2014) 19. Hwang, W., Weller, W., Ireys, H., Anderson, G.: Out-of-pocket medical spending for care of chronic conditions. Health Affairs 20(6), 267–278 (2001) 20. Chi, M.-J., Lee, C.-Y., Wu, S.-C.: The prevalence of chronic conditions and medical ex- penditures of the elderly by chronic condition indicator (CCI). Archives of Gerontology and Geriatrics 52(3) (2011) 21. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Infor- matics Association 21(2), 231–237 (2014) 22. Perotte, A., Hripcsak, G.: Temporal properties of diagnosis code time series in aggregate. IEEE Journal of Biomedical and Health Informatics 17(2), 477–483 (2013) 23. Torii, M., Wagholikar, K., Liu, H.: Using machine learning for concept extraction on clini- cal documents from multiple data sources. Journal of the American Medical Informatics Association (June 27, 2011) 24. Nguyen, A.N., Lawley, M.J., Hansen, D.P., et al.: Symbolic rule-based classification of lung cancer stages from free-text pathology reports. Journal of the American Medical In- formatics Association 17(4), 440–445 (2010) 25. Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y.: Novel active learning methods for enhanced PC malware detection in windows OS. Expert Systems with Applica- tions 41(13), 5843–5857 (2014) 26. Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y.: Detecting unknown computer worm activity via support vector machines and active learning. Pattern Analysis and Applica- tions 15, 459–475 (2012) 27. Nissim, N., Cohen, A., Glezer, C., Elovici, Y.: Detection of malicious PDF files and directions for enhancements: A state-of-the art survey. Computers & Security 48, 246–266 (2015) 28. Angluin, D.: Queries and concept learning. Machine Learning 2, 319–342 (1988) 29. Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Develop- ment in Information Retrieval, pp. 3–12. Springer (1994) 30. Liu, Y.: Active learning with support vector machine applied to gene expression data for cancer classification. Journal of Chemical Information and Computer Sciences 44(6), 1936–1941 (2004) 31. Warmuth, M.K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., Lemmen, C.: Active learn- ing with support vector machines in the drug discovery process. Journal of Chemical In- formation and Computer Sciences 43(2), 667–673 (2003) 32. Figueroa, R.L., Zeng-Treitler, Q., Ngo, L.H., Goryachev, S., Wiechmann, E.P.: Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association (2011), 2012:amiajnl-2011-000648 33. Nguyen, D.H., Patrick, J.D.: Supervised machine learning and active learning in classifica- tion of radiology reports. Journal of the American Medical Informatics Association (2014) 34. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Transac- tions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011) 35. Tong, S., Koller, D.: Support vector machine active learning with applications to text clas- sification. Journal of Machine Learning Research 2, 45–66 (2000-2001) 36. Ralf, H., Graepel, T., Campbell, C.: Bayes point machines. The Journal of Machine Learn- ing Research 1, 245–279 (2001) 24 N. Nissim et al.

37. Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y.: Novel active learning methods for enhanced pc malware detection in Windows OS. Expert Systems With Applications 41(13) (2014) 38. Nissim, N., Moskovitch, R., Rokach, L., Elovici, Y.: Detecting unknown computer worm activity via support vector machines and active learning. Pattern Analysis and Applica- tions 15(4), 459–475 (2012) 39. Moskovitch, R., Nissim, N., Elovici, Y.: Malicious code detection using active learning. In: ACM SIGKDD Workshop in Privacy, Security and Trust in KDD, Las Vegas (2008) 40. Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., Elovici, Y.: Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5(4) (2009) 41. Nissim, N., Cohen, A., Moskovitch, R., et al.: ALPD: Active learning framework for en- hancing the detection of malicious PDF files aimed at organizations. In: Proceedings of JISIC (2014) 42. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms. Journal of Machine Learning Research 5, 255–291 (2004) 43. Herman R. 72 Statistics on Hourly Physician Compensation (2013), http://www.beckershospitalreview.com/compensation-issues/72-statistics-on-hourly- physician-compensation.html Journal of Biomedical Informatics 61 (2016) 44–54

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin

Improving condition severity classification with an efficient active learning based framework ⇑ Nir Nissim a,b, , Mary Regina Boland c,f, Nicholas P. Tatonetti c,d,e,f, Yuval Elovici a, George Hripcsak c,f, ⇑ Yuval Shahar a, Robert Moskovitch c,d,e,f, a Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel b Malware Lab, Cyber Security Research Center, Ben-Gurion University of the Negev, Beer-Sheva, Israel c Department of Biomedical Informatics, Columbia University, New York, NY, USA d Department of Systems Biology, Columbia University, New York, NY, USA e Department of Medicine, Columbia University, New York, NY, USA f Observational Health Data Sciences and Informatics, Columbia University, New York, NY, USA article info abstract

Article history: Classification of condition severity can be useful for discriminating among sets of conditions or phenotypes, Received 29 October 2015 for example when prioritizing patient care or for other healthcare purposes. Electronic Health Records Revised 31 January 2016 (EHRs) represent a rich source of labeled information that can be harnessed for severity classification. Accepted 21 March 2016 The labeling of EHRs is expensive and in many cases requires employing professionals with high level of Available online 22 March 2016 expertise. In this study, we demonstrate the use of Active Learning (AL) techniques to decrease expert label- ing efforts. We employ three AL methods and demonstrate their ability to reduce labeling efforts while Keywords: effectively discriminating condition severity. We incorporate three AL methods into a new framework Active learning based on the original CAESAR (Classification Approach for Extracting Severity Automatically from Electronic Health Records Phenotyping Electronic Health Records) framework to create the Active Learning Enhancement framework (CAESAR- Condition ALE). We applied CAESAR-ALE to a dataset containing 516 conditions of varying severity levels that were Severity manually labeled by seven experts. Our dataset, called the ‘‘CAESAR dataset,” was created from the medical records of 1.9 million patients treated at Columbia University Medical Center (CUMC). All three AL methods decreased labelers’ efforts compared to the learning methods applied by the original CAESER framework in which the classifier was trained on the entire set of conditions; depending on the AL strategy used in the current study, the reduction ranged from 48% to 64% that can result in significant savings, both in time and money. As for the PPV (precision) measure, CAESAR-ALE achieved more than 13% absolute improve- ment in the predictive capabilities of the framework when classifying conditions as severe. These results demonstrate the potential of AL methods to decrease the labeling efforts of medical experts, while increas- ing accuracy given the same (or even a smaller) number of acquired conditions. We also demonstrated that the methods included in the CAESAR-ALE framework (Exploitation and Combination_XA) are more robust to the use of human labelers with different levels of professional expertise. Ó 2016 Elsevier Inc. All rights reserved.

1. Introduction phenotypes. For example, a condition’s severity status can enable public health researchers to easily identify conditions that require Condition severity is an important aspect of medical conditions higher prioritization and an allocation of resources. For the pur- that can be useful for discriminating among sets of conditions or poses of our research, we define severe conditions as those that

Abbreviations: CAESAR, Classification Approach for Extracting Severity Automatically from Electronic Health Records; CAESAR-ALE, Classification Approach for Extracting Severity Automatically from Electronic Health Records – Active Learning Enhancement; EHR, Electronic Health Record; AL, Active Learning; SVM, Support Vector Machines; VS, Version Space; SNOMED-CT, Systemized Nomenclature of Medicine-Clinical Terms; ICD-9, International Classification of Diseases – Version 9; SVM-Margin, Support Vector Machines-Margin Method – an existing AL method oriented towards acquiring informative conditions that lie closest to the separating hyperplane (inside the margin).; Exploitation, an AL method included in the CAESAR-ALE framework that is oriented towards acquisition of severe conditions.; Combination_XA, an AL method included in the CAESAR-ALE framework that combines elements of the Exploitation method and the SVM-Margin method, so that it applies a hybrid acquisition strategy for enhanced improvement of the CAESER method. ⇑ Corresponding authors at: Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva 84105, Israel. E-mail addresses: [email protected] (N. Nissim), [email protected] (R. Moskovitch). http://dx.doi.org/10.1016/j.jbi.2016.03.016 1532-0464/Ó 2016 Elsevier Inc. All rights reserved. N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 45 are life-threatening or permanently disabling. Such conditions involving expert assignment of chronicity categories (acute versus would be considered high priority in terms of the need to generate chronic) to the International Classification of Diseases – version 9 phenotype definitions for tasks including pharmacovigilance (ICD-9) codes. The CCI was built on original work by Hwang [44,45,47]. et al. [19] and was used successfully in multiple studies [19,20]. A prioritized list of conditions (disease codes) classified by Other researchers have employed standard learning approaches severity at the condition-level is needed. Condition-level severity in the biomedical domain, including a classification approach that classification distinguishes acne (mild condition) from myocardial leveraged the ICD-9 hierarchy for improved performance [21]. infarction (severe condition). In contrast, patient-level severity Another study classified conditions into chronicity categories determines whether a given patient has a mild or severe form of [22]. Other machine learning approaches have been used in biome- a condition (e.g., acne). The bulk of the literature focuses on dicine for classifying text into condition hierarchies [48] to patient-level severity. Patient-level severity generally requires indi- improve subsequent retrieval [55] compared to traditional free vidual condition metrics [8–11], although whole-body methods text [56]. Torii et al. [23] showed that performance improved when exist [11–13]. None of the prior severity metrics or classification classifier was trained on a dataset based on multiple data sources methods use condition-level severity metrics. Condition-level and noted that having more documents available during training severity is important for prioritizing phenotypes. improved performance [23]. Nguyen et al. built an algorithm for Condition-level severity is useful for prioritizing conditions that classifying lung cancer stages using pathology reports and are important for specialized phenotyping algorithms. Although SNOMED-CT [24]. several consortiums and partnerships, including the Observational Medical Outcomes Partnership [1] and the Electronic Medical Records and Genomics Network [2,3], have developed phenotype 2.1. Mining Electronic Health Records extraction methods that utilize Electronic Health Records (EHRs), only a little more than 100 conditions/phenotypes have been suc- Secondary use of EHRs through data mining [57] has become a cessfully defined. Unfortunately, this represents a small fraction of trendy area of biomedical informatics research [58] and the data the approximately 401,2001 conditions recorded in EHRs. Hurdles mining literature [59,60,67]. Learning predictive models in clinical faced by experts when defining phenotype-extraction algorithms medicine through data mining is an important and developing field include overcoming definition discrepancies [4], data sparseness, [58]. Ng et al. [61] introduced a distributed platform for healthcare data quality [5], bias [6], and healthcare process effects [7]. Condi- analytics for EHR data that consists of the MapReduce principles tion severity can be a way for prioritizing conditions worthy of that distributes and parallels the entire process of cohort construc- developing a specialized phenotype-extraction algorithm. tion, feature construction, and selection and classification in a cross In our previous work we developed an algorithm which we validation fashion. Sun et al. [62] used this framework to predict refer to as the Classification Approach for Extracting Severity Auto- hypertension transition points in EHR data without temporal rep- matically from Electronic Health CAESARecords (CAESAR) [13,47], resentation. Rana et al. [64] introduced a framework that models which uses standard machine learning (also referred to as passive the change in interventions over time to predict outcome events learning) to classify condition severity based on metrics extracted and considers the temporal evolution of the events that were from EHRs [13]. This method requires medical experts to manually shown to be useful. To handle temporal data [63] a comprehensive review all of the conditions and assign a severity status to each framework was introduced that enables learning patients’ behavior condition (i.e., severe or mild) independently from EHR metrics. over time, including discovery of frequent temporal patterns [60], In the current study, we describe the development and valida- learning classification models [65], and acquiring cutoffs to dis- tion of an active learning approach for classifying condition sever- cretize the variables into states to increase classification perfor- ity from EHRs, the Active Learning Enhancement of CAESAR or mance [66]. CAESAR-ALE. We recently introduced this approach which enhances our previous CAESAR framework [13] in a preliminary fashion [49]. We now provide a detailed description of the new methodology and our results. Using our new AL methods, we 2.2. Active learning applications in biomedical data demonstrate that decreasing the number of labeled conditions required to train a classifier can reduce the theoretical burden on Labeled examples, crucial for classification, are generally expen- medical experts. sive to acquire, since they require medical experts for annotation. The remainder of the paper is structured as follows. In Section 2 Active learning (AL) approaches are useful for selecting (for label- we provide background and related work to this study. In Section 3 ing) the most discriminative and informative conditions from a we describe our new methods which is followed by the evaluation dataset during the learning process. This selection is expected to in Section 4. decrease the number of conditions that experts need to manually review and label. Studies in several domains have successfully applied AL to reduce the resources (i.e., time and money) required 2. Background for labeling examples [25–27,68–71]. AL is divided roughly into two major approaches: (1) member- In this study we use the SNOMED-CT ontology [14–16], because ship queries [28] in which examples are artificially generated from it is an expressive clinical ontology that is useful for retrieval [17]. the problem space; and (2) selective-sampling [29] in which exam- Each coded clinical event is considered a ‘‘condition” or ‘‘pheno- ples are selected from a pool. This paper only focuses on the type,” with the knowledge that this is a broad definition [4]. In bio- selective-sampling approach. AL algorithms have been widely uti- medicine, condition classification follows two main approaches: lized in multiple domains, although applications in the biomedical (1) manual approaches in which experts manually assign labels domain remain limited. Liu described a method similar to rele- to conditions; and (2) passive classification approaches that vance feedback for cancer classification [30]. Warmuth et al. used require a labeled training set. Manual approaches include the a similar approach to separate compounds for drug discovery development of the Chronic Condition Indicator (CCI) [18] [31]. AL was also used for cell image pathology [53] and assay clas- 1 The number of SNOMED-CT codes as of September 9, 2014. Accessed via: http:// sification [54]. More recently, AL was applied in biomedicine for bioportal.bioontology.org/ontologies/SNOMEDCT. text [32] and radiology report classification [33]. 46 N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54

3. Methods

In this section we present the methods and techniques upon which our framework is based. We aim to provide a solution to an existing challenge in the area of efficient condition severity clas- sification, and our framework is based on a combination of meth- ods and techniques derived from previous research (ours and research conducted by others) that we believe will be most appro- priate for achieving the goals of this study.

3.1. Support vector machines classification algorithm

The support vector machines (SVM) [46] classifier is a binary classifier that finds a linear hyperplane that separates given exam- ples into two specific classes, yet is also capable of handling mul- Fig. 1. An SVM with a maximal margin which separates the training set into two ticlass classification [50]. As Joachims [51] demonstrated, the classes in a two-dimensional space (two features). SVM is widely known for its ability to handle a large amount of fea- tures, a capability which is useful in the textual domain. 3.2. Random selection Given a training set in which an example is a vector xi = hf1, f ...,f i, in which f is a feature labeled by y ={1,+1}, the SVM 2, m i i Random selection is not an active learning method, but it is the attempts to specify a linear hyperplane with the maximal margin ‘‘lower bound” of the selection methods that will be discussed. As defined by the maximal (perpendicular) distance between the far as we know, no biomedical machine learning based solution has examples of the two classes. Fig. 1 illustrates a two-dimensional used an active learning method to reduce the labeling efforts of space where the examples are positioned according to their fea- medical doctors in the task of condition severity classification. tures. The hyperplane splits them based on their labels. Random selection doesn’t have a sophisticated selection strategy; The examples lying closest to the hyperplane are the ‘‘support- consequently, we expect that all of the AL methods we examine ing vectors.” W, the Normal of the hyperplane, is a linear combina- will perform better than a selection process based on the random tion of the most important examples (supporting vectors) selection of conditions. Thus, in the context of our framework, multiplied by LaGrange multipliers (a), as can be seen in Eq. (3). the random selection method will feed the SVM classifier with con- Since the dataset in the original space cannot always be linearly ditions that were randomly selected from the pool of unlabeled separated, a kernel function K is used. SVM actually projects the conditions. In our experiments we called this method Random examples into a higher dimensional space in order to create a lin- Selection or just Random. ear separation of the examples. Note that when the kernel function satisfies Mercer’s condition, as Burges [52] explained, K can be pre- 3.3. The SVM-Simple-Margin AL method (SVM-Margin) sented using Eq. (1), where U is a function that maps the example from the original feature space into a higher dimensional space, The SVM-Simple-Margin method [35] (or SVM-Margin) is based while K relies on the ‘‘inner product” between the mappings of on SVM classifier principles. Using a kernel function, the SVM examples x , x . For the general case, the SVM classifier will be in 1 2 implicitly projects the training examples into a different (usually the form shown in Eq. (2), where n is the number of examples in a higher dimensional) feature space. In this space there is a set of the training set, K is the kernel function, a is the LaGrange multi- hypotheses that are consistent with the training set, creating a lin- plier that defines the linear combination of the Normal W, and y i ear separation of the training set. The SVM identifies the best is the class label of support vector X . i hypothesis with the maximal margin from among the consistent

Kðx1; x2Þ¼Uðx1ÞUðx2Þð1Þ hypotheses (referred to as the version-space [VS]). To achieve a sit- uation in which the VS contains the most accurate and consistent ! Xn hypothesis, the SVM-Margin AL method selects examples from a U a pool of unlabeled examples, thereby reducing the number of f ðxÞ¼signðw ðxÞÞ ¼ sign iyiKðxixÞ ð2Þ 1 hypotheses. The SVM-Margin method selects examples according to their distance from the separating hyperplane to explore and Xn acquire informative conditions, disregarding their labels. Examples w a y U x : 3 ¼ i i ð iÞ ð Þ that lie closest to the separating hyperplane (see Fig. 2 in which the 1 selected examples from both classes are colored in red and lie Two commonly used kernel functions are used: (1) the polyno- inside the margin) are more likely to be informative (may improve mial kernel, as shown in Eq. (4)), creates polynomial values of the classifier’s capabilities) and therefore are acquired and labeled. degree p, where the output depends on the direction of the two SVM-Margin is fast; yet, as its authors indicate [35], this agility is vectors, examples x1, x2, in the original problem space (note that achieved because of its rough approximation and assumptions that a private case of a polynomial kernel, where p=1, is actually the the VS is fairly symmetric and the hyperplane’s Normal, W linear kernel), and (2) the radial basis function (RBF), as shown in (Eq. (3)), is centrally placed. These assumptions have been shown Eq. (5), in which a Gaussian function is used as the RBF, and the to fail significantly [36], because SVM-Margin may query instances output of the kernel depends on the Euclidean distance of exam- whose hyperplane does not intersect with the VS and therefore ples x1, x2. may not be informative. ; P Kðx1 x2Þ¼ðx1 x2 þ 1Þ ð4Þ 3.4. The CAESAR-ALE framework kx x k2 1 2 The purpose of our enhanced method, CAESAR-ALE, is to ; 2r2 Kðx1 x2Þ¼e ð5Þ decrease the experts’ labeling efforts using AL methods. N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 47

where y is an optional label of example x. {+1,1}, h(x) is the deci- sion value provided by Eq. (6). An illustration can be seen in Fig. 4, which shows two examples for which the SVM produced classifica- tion decision values. For instance: P (y = 1|X2) = 0.8 means that the classifier is quite confident that x2 belongs to class (1). While P (y = +1|X2) = 0.2 means that the classifier is quite confident that X2 does not belong to class (+1); if P is (y = 1|X2) = (y = +1|X2) = 0.5, the classifier has a severe lack of confidence regarding the class of X2. A graphical analysis of Eq. (7) can be seen in Fig. 5. The second type of informative conditions include conditions that are at a maximal distance from the separating hyperplane; these conditions are deep within the severe instances sub-space of the SVM’s separating hyperplane. Nevertheless, some mild con- ditions may still exist within this space of otherwise severe condi- Fig. 2. The examples (colored in red) that will be selected according to the SVM- tions (although, of course, their being mild is unknown to the Margin AL method’s criteria. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) algorithm, until they are selected and labeled). Consequentially, presenting these mild conditions to the induction algorithm is expected to greatly inform and improve the resulting classification CAESAR-ALE does this by only asking experts to label informative model. conditions. Fig. 3 illustrates the process of labeling and acquiring The overall CAESAR-ALE framework includes a repetition of two new conditions by maintaining the updatability of the classifica- main phases: training and classification/updating, which includes tion model within CAESAR-ALE. Conditions are introduced to the the selection of potentially informative examples (i.e., conditions), classification model, which is induced by an SVM algorithm on labeling them, and then training the model with the new labeled which AL methods are based. The classification model scrutinizes conditions. conditions and provides two values for each condition: a classifica- Training: The model is trained using an initial pool of severe and tion decision using the SVM classification algorithm and a calcula- mild conditions. The model is evaluated against a test set consist- tion of the distance from the separating hyperplane. Informative ing of conditions that were not used during training to estimate the conditions are defined as conditions that are expected to improve classification accuracy. the classification model’s predictive capabilities when added to the Classification and updating: The AL method estimates and ranks training set. A condition that is identified as potentially informa- how potentially informative each condition is within the pool of tive by the AL method is sent to a human expert for labeling. In this unlabeled conditions left, based on the classification model’s pre- manner, most potentially informative conditions are labeled and diction. Only the most informative are selected and labeled by added to the training set so that a new and updated model will the expert. These conditions are added to the training set and be induced. removed from the pool. The model is then retrained using the By selecting the most informative conditions, the use of an AL updated training set, and this process repeats iteratively until a method leads to a theoretical decrease in the labeling efforts, as sufficient level of accuracy is reached or alternatively, until the compared to learning from the entire set of conditions. Using the entire pool of conditions have been acquired. AL approach, we can maintain an accurate model while decreasing We employed the SVM classification algorithm using the radial the labeling efforts, since the induction method requires fewer basis function (RBF) kernel in a supervised learning approach. This examples, i.e., conditions, since the input instances are more infor- combination had been shown to be very efficient when combined mative. Accordingly, in our context, there are two types of condi- with AL methods [26,27]. We use the Lib-SVM implementation tions that may be considered informative. The first type includes [34] and modify it to implement our AL methods. conditions that the classifier has identified with a low level of con- Although our focus in this study is on reducing the condition fidence. Here, the probability of being mild is close to the probabil- labeling efforts while maintaining similar or enhanced classifica- ity of being severe. The calculation of probability in based on the tion performance, the detection of severe conditions – even during distance of the example from the separating hyperplane of the the learning phase (as opposed to the detection of mild conditions) SVM classifier – thus a maximal distance from the separating – has some advantages, due to their value for training and insur- hyperplane represents a high level of confidence, while minimal ance reporting purposes. distance from the separating hyperplane represents low confi- dence. Eq. (6) represents the distance of example x from the sepa- 3.5. CAESAR-ALE’s active learning methods rating hyperplane of the SVM classifier (note that Eq. (2) makes use of this distance and provides a classification decision regarding the CAESAR-ALE uses two AL methods (Exploitation and Combina- sign of the distance in which a positive sign means a positive class tion_XA). We describe each of these, along with the SVM-Margin classification, while a negative sign means a negative class and Random methods below. classification).

Xn 3.5.1. Exploitation U a f ðxÞ¼w ðxÞ¼ iyiKðxixÞð6Þ One of the AL methods implemented in CAESAR-ALE is called 1 Exploitation, referred to as such because it exploits the current In order to calculate this probability using the distance of exam- separating hyperplane to find condition instances that are most ple x from the separating hyperplane according to the given classi- likely to be severe. Exploitation has shown efficiency at detecting fier’s knowledge, we use a transformation function that converts unknown malicious code content, files [37–40], and documents distance value into probability [42], see Eq. (7): [41]. Exploitation is based on SVM classifier principles and selects examples more likely to be severe, those lying further from the 1 PðyjxÞ¼ ð7Þ separating hyperplane, as can be seen in Fig. 2. Thus, this method 1 þ eyf ðxÞ aims at boosting the classification capabilities of the model 48 N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54

Fig. 3. The process of using AL methods to detect discriminative conditions requiring medical expert labeling.

set is preferable. In the Exploitation method, conditions are acquired if they are classified as severe and have maximal distance from the separating hyperplane (marked with a red2 circle in Fig. 6). To enhance the training set as much as possible, we also check the similarity among selected conditions using the kernel farthest-first (KFF) method suggested by Baram et al. [42], enabling us to avoid acquiring similar conditions. Consequently, only poten- tially informative conditions likely to be labeled as severe are selected. In Fig. 6, it can be seen that there are several sets of highly similar conditions, based on their distance in the kernel space. How- ever, only representative conditions that are more likely to be severe are acquired. Contrary to SVM-Margin, Exploitation explores the ‘‘severe space” to discover potentially more informative severe condi- tions, a process which enables further detection of severe conditions. Fig. 4. Decision values given to two examples. Fig. 6 also illustrates an additional ability of Exploitation as it some- times discovers conditions located far inside the severe side (i.e., class) that were ultimately labeled by the expert as mild. Finding through the acquisition of as many new severe conditions as pos- such a surprise is useful – this type of confusing condition will sible. For every condition x, Exploitation measures its distance become a new support vector of the SVM classifier and update the from the separating hyperplane using Eq. (8), based on the Normal classification model with the new discovery and knowledge, and (W) of the separating hyperplane of the SVM classifier. The separat- therefore these ‘‘surprises” play an important role in increasing the ing hyperplane of the SVM is represented by W, which is a linear accuracy of the resultant classifier. combination of the most important examples (supporting vectors), multiplied by LaGrange multipliers (a) and by the kernel function K that assists in achieving linear separation in higher dimensions. Accordingly, the distance in Eq. (8) is calculated between example 3.5.2. Combination_XA: a combined active learning method x and the Normal (W) presented in Eq. (3). The distance calculation The ‘‘Combination_XA” method is a hybrid of SVM-Margin and required for each instance in Exploitation is equal to the time it Exploitation. It conducts a cross acquisition (XA) of potentially takes to classify an instance using SVM-Margin. informative conditions. That means that during the first trial (and all odd-numbered trials) it acquires conditions according to ! Xn the SVM-Margin method’s criteria, while during the next trial (and all even-numbered trials) it selects conditions using the DistðXÞ¼ aiyiKðxixÞ ð8Þ 1 Exploitation method’s criteria. This strategy alternates between the exploration phases (conditions acquired using SVM-margin) Acquiring several severe conditions that are highly similar to and the exploitation phase (conditions acquired using Exploita- each other (i.e., which have similar values for all of the meaningful tion) to select the most informative conditions, both mild and sev- features, and of course, belong to the same target class) would ere, while boosting the classification model with severe conditions waste labeling resources, while not contributing much to the future classification capabilities (generality) of the induced classi- 2 For interpretation of color in Fig. 6, the reader is referred to the web version of fier; therefore, acquiring one representative condition from this this article. N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 49

Fig. 5. Analysis of Eq. (7) – the larger the distance the example is from the separating hyperplane, the higher the probability and the more confidence of the classifier.

datasets are created in order to perform 10-fold cross-validation evaluation. Each fold contains three elements: (1) an initial set of six conditions that are used to induce the initial classification model, (2) a test set of 200 conditions on which the induced clas- sifier is tested and evaluated in each active learning iteration, and (3) a pool of 310 unlabeled conditions from which the conditions are selected to be labeled by each one of the examined selection methods. The process is repeated, through the iterative acquisition steps, until the entire pool is acquired. The performance of the clas- sification models is averaged over the ten runs of the 10-fold cross- validation. The experiment’s steps are as follows:

(1) Inducing the initial classification model from the initial training set containing six conditions. (2) Evaluating the classification model using the 200 condition test set to measure its initial performance. (3) Introducing the pool of 310 unlabeled conditions to the sam- Fig. 6. Diagram showing the Exploitation method’s criteria for acquiring new pling methods. During every trial the five most informative severe conditions. conditions are selected according to the selection method’s preferences, and their labels are revealed by the single gold or very informative mild conditions that lie deep inside the severe standard labeler used in the original CAESAR system (in a side of the SVM’s hyperplane. real system the selected conditions will be labeled by an expert, but in our dataset all of the conditions are already labeled). 4. Evaluation (4) Adding the acquired conditions to the training set and removing them from the pool. The objective of our two experiments is to evaluate and com- (5) Inducing an updated classification model using the updated pare the performance of CAESAR-ALE’s two proposed AL methods training set and applying the updated model to the condi- to the three alternatives (SVM-Margin, Random Selection, and tions remaining in the pool. learning from the entire set of conditions) based on the following (6) This process iterates until the entire pool is acquired. three tasks: Based on the results of this first experiment we are able to (1) Improving the predictive capabilities (accuracy) of the clas- determine the potential cost savings associated with the use of sification model that serves as the knowledge store of the CAESER-ALE by estimating the cost of labeling the condition-level AL methods and its ability to efficiently identify the most severity for all conditions contained in SNOMED-CT (10,529 condi- informative unlabeled conditions. tions) by three expert labelers. To do this, we kept track of the (2) Evaluating the proposed AL methods in comparison to the number of conditions labeled per minute by our physician collab- baseline methods based on their ability to correctly classify orators in our second experiment (described below). We then use severe conditions (TPR) with minimal errors (FPR). the customary estimate for physicians’ time ($120 per hour) [43] (3) Evaluating all of the selection methods using seven different to estimate the cost of labeling the entire dataset (Eq. (9)). We labelers (experts who labeled the conditions) and measuring calculate the estimated savings as the entire cost ⁄ reduction in the variance of their learning curves. labeling efforts (Eq. (10)).

These three tasks raise two research questions: Estimated Cost ¼ðCost to Label 10; 529 ConditionsÞ 3 Physician Labelers ð9Þ – Is it possible to efficiently create a condition severity classifica- tion model that uses AL methods to significantly reduce the v labeling efforts of the medical expert? Estimated Sa ings ¼ Estimated Cost Experiment 1: Our first experiment aims to evaluate and com- Reductionin Labeling Efforts When Using CAESAR ALE pare the selection methods in the task of the efficient creation of ð10Þ an accurate severity classification model while reducing the label- ing efforts of medical experts. In our first acquisition experiment Experiment 2: Our second experiment is aimed at assessing the we use a repository of 516 conditions (CAESAR dataset) consisting differences in the learning curves of the severity classification of 372 mild and 144 severe conditions. Ten randomly selected models induced from conditions that were labeled by each one of 50 N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 the seven different labelers. The conditions are selected using the conditions are selected from a pool of unlabeled conditions during same acquisition process described in the first experiment, how- each trial of CAESAR-ALE. It is well known that selecting more con- ever here, in experiment 2, they are actively selected only by the ditions per trial will improve accuracy. We use a low acquisition three AL methods in order to provide a more focused comparison amount of five conditions per trial, because our primary goal is between the AL methods. We use a dataset containing 100 condi- to minimize the number of conditions sent to medical experts for tions labeled by seven different labelers as our pool (three labelers manual labeling and thereby reduce costs. are medical doctors who have completed their residency training, Fig. 7 presents accuracy levels and their trends in the 62 trials and the remaining four labelers are informatics experts with at with an acquisition amount of five conditions per trial least a master’s degree). We follow the same steps outlined in (62 ⁄ 5 = 310 conditions in pool). In most trials, the AL methods our first experiment in which we established our initial set of six outperformed the Random selection method, illustrating that by seeds conditions from the gold standard. The initial classification using the AL methods, one can reduce the number of conditions model is trained on six randomly selected conditions. After each required to achieve a rate of accuracy similar to that achieved by acquisition step we evaluate the performance of each of the label- learning from the entire set of conditions. The classification model ers using the remaining 410 conditions from the gold standard had an initial accuracy rate of 0.72, and all methods converged at (516 conditions 6 seed conditions) (100 conditions given to an accuracy rate of 0.975 after the pool was fully acquired. Combi- all seven labelers) = 410 remaining conditions in the gold standard. nation_XA arrived at a 0.95 rate of accuracy first, after requiring This allows us to compare differences among the various labelers only 23 acquisition trials (115/310 conditions), while other AL at each acquisition step. This experiment is also evaluated using methods required 26 trials. Further, Combination_XA performed 10-fold cross-validation. Note that the conditions are presented almost twice as well as Random (23 versus the 44 trials that Ran- to the labelers in a different order, depending on the learning algo- dom required), while achieving the same accuracy rate of 0.95. rithm used and on the labels assigned by the labeler to previous condition instances. CAESER dataset: We obtain a dataset of conditions, along with six severity-related metrics related to each condition. These met- rics or features consist of: average number of comorbidities, average number of procedures, average number of medications, average cost of procedures, average treatment time, and a proportion term [47]. Each of these severity-related metrics was computed using an underly- ing dataset of over 1.9 million patients. The proportion term is cal- culated to normalize each severity metric using the entire corpora and combine all five metrics into one weighted term. Each condition’s proportion term was calculated previously as part of CAESAR’s construction, additional details are found in that study [47]. The method for calculating the proportion term is as follows. To calculate the proportion term we first calculate a pro- portion for each of the five measures. We then sum these individ- ual proportions and divide by the total (i.e., five). It is easiest to illustrate this with an example. Let us assume the condition ‘‘my- ocardial infarction” has an average procedure cost of $10,000, and an average treatment length of 30 days, an average number of med- ications of 10, an average number of procedures of 6, and an average number of comorbidities of 3. Each of these values would be divided Fig. 7. The accuracy of CAESAR-ALE AL methods versus SVM-Margin and Random by their respective maximums. Therefore, the proportions are as over 62 trials (five conditions acquired during each trial). follows: average procedure cost – $10,000/$50,000; average treat- ment length – 30/1406 days; average number of medications – 10/25; average number of procedures – 6/15; and average number of comorbidities – 3/100. Each of these proportions are then summed: 10;000 30 10 6 3 50;000 þ 1406 þ 25 þ 15 þ 100 1:051 ¼ ¼ 0:210 5 5 This yields the proportion index term for this condition.

5. Results

We evaluate the efficiency and effectiveness of CAESAR-ALE by comparing the two CAESAR-ALE selective sampling methods, Exploitation and Combination_XA, to the two other selective sam- pling methods: Random Selection (Random) and SVM-Simple- Margin (SVM-Margin) [27]. For all methods, results are averaged over ten different runs of the 10-fold cross-validation. We now present results for the core evaluation measures used in this study: accuracy, TPR, FPR, and AUC. In addition, we also measure the number of new severe conditions discovered and acquired into the training set at each trial. As explained above, five Fig. 8. TPR for active learning and random selection methods over 62 trials. N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 51

Summarizing the results regarding rates of accuracy, Combina- Combination_XA was lowest with a variance of 0.0197, followed tion_XA’s performance represents a reduction of 48% in labeling closely by Exploitation with a variance of 0.0209. In contrast, efforts compared to Random. SVM-Margin had the highest variance (almost 50% greater than Considering the overall learning phase, Exploitation outper- Combination_XA). formed the Combination_XA method up to trial 35, while after trial We performed a single-factor Anova statistical test on the stan- 35 Combination_XA presented slightly better performance – indi- dard deviation of the labelers for both Combination_XA and cating that the cross-acquisition strategy is superior for the long run. Exploitation. The Anova test provided a P-Value of 67.93% which Fig. 8 shows TPR levels (severe is the positive class) and their is much higher than the 5% (alpha) of the significance level; thus trends over 62 trials. Both Exploitation and Combination_XA out- the difference between the methods is not statistically significant, performed the other selection methods, achieving a TPR rate of confirming that Exploitation is as robust as Combination_XA to the 0.85 after only 17 trials (85/310 conditions), while Random different clinical training levels of the labelers. achieved the same TPR rate after 47 trials. Summarizing the results We also estimate the cost to label the severity of the 10,529 con- regarding TPR rates, the performance of Exploitation and Combina- ditions contained in SNOMED-CT by the three physician labelers. tion_XA represents a reduction of 64% in labeling efforts compared Based on our experiments and experience, physicians were able to Random Selection. As can also be seen in Fig. 8, the performance to label 2.5 conditions per minute. Based on the typical physician improved as more conditions were acquired. After 36 trials, all AL salary of $2 per minute of time ($120 per hour), we calculate the methods converged to TPR rates around 0.92. Our results demon- estimated savings achieved by utilizing the CAESAR-ALE framework strate that using AL methods for condition selection can reduce to select meaningful conditions (for labeling) from the entire the number of trials required in training the classifier. This will 10,529 condition set. Based on our calculations, the entire set would reduce the total number of conditions requiring medical expert cost $13161.25 per physician labeler, with a total cost of $39483.75 labeling and thereby reduce costs. for three labelers. We found that CAESAR-ALE reduced labeling The cumulative number of severe conditions acquired for each efforts by at least 48%, resulting in a savings of at least $18,952. trial is shown in Fig. 9. By the fifth trial, CAESAR-ALE’s methods, Exploitation and Combination_XA, outperformed the other selec- tion methods with respect to the rate of acquiring severe condition 6. Discussion instances (both lines overlap in Fig. 9). After 23 trials (115 condi- tions), both of CAESAR-ALE’s methods acquired 73 out of the 82 We introduce CAESAR-ALE, an active learning based framework severe conditions in the pool, compared to 42 trials (210 condi- used to identify informative conditions for medical expert labeling. tions) for the SVM-Margin method and 60 trials (300 conditions) The overall task is to classify conditions into severe and mild, based for the Random method. This represents a 46% reduction in the on features extracted from an EHR database. In our dataset, each number of trials required to acquire the same number of severe condition is further represented by five features/metrics including: condition instances compared to the SVM-margin method, and a average number of comorbidities, average number of procedures, 62% reduction in the labeling efforts compared to the Random average number of medications, average cost, average treatment method. After 23 trials (Fig. 9), we observed the largest difference time, and a proportion term. Based on these metrics, conditions between CAESAR-ALE’s methods and Random - a difference of 43 can be categorized as severe or mild, and this information can be severe conditions. used to train a classifier with good classification performance. In Fig. 10, it can be observed that the variance among the learn- However, labeling the conditions requires the expertise and ing curves induced for each labeler depends in part on the AL involvement of medical experts which is costly. In response to this method used (Fig. 10A–C). Fig. 10D shows the overall variance issue, we developed CAESAR-ALE, a framework based on active among labelers by method and the methods’ average variance. learning methods that decreases the need for labeling by medical experts. In this paper, we propose two active learning based meth- ods (Combination_XA and Exploitation) oriented at the acquisition of potentially severe conditions. We define severe conditions as those that are life-threatening or permanently disabling. Acquiring such severe conditions is beneficial because of their value for train- ing purposes and insurance reporting needs. In addition, we also assume that favoring potentially severe conditions will decrease the chances of misclassifying a severe condition as mild, which is often the more costly mistake. We evaluate these two methods and compare them to two baselines methods: (1) a very basic method using random selec- tion, and (2) SVM-Margin, an existing active learning method; our evaluation also includes comparison to learning from the entire set of conditions. Several measures are used to evaluate CAESAR-ALE: accuracy, TPR, FPR, and AUC. Combination_XA and Exploitation achieved a TPR of 0.85 after only 17 trials during which 85 conditions were acquired. Therefore, in this scenario only these 85 conditions will require manual expert labeling. In con- trast, Random Selection required 47 trials or 235 conditions to achieve the same TPR. This comparison demonstrates that by using our AL methods, a reduction of 64% in labeling efforts can be achieved. This reduction also results in significant cost savings (almost two-thirds of the cost), allowing medical experts to focus their energy elsewhere. Fig. 9. The accumulated number of severe conditions acquired into the training set In terms of accuracy rates, as expected, all of the AL methods by each AL method over 62 trials. performed better than the Random Selection method, while 52 N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54

Fig. 10. The learning curves of the three AL methods for each of the expert labelers.

Combination_XA performed slightly better than the others, with a trial, our Al methods managed to acquire almost 90% of the severe meaningful reduction in the number of trials from 44 to 23 (23 tri- conditions (73out of 82) after investing only 37% of the time and als when using the Random Selection method). The Combina- labeling efforts required to acquire all of the severe conditions by tion_XA method required only 115 conditions versus the 220 Random or by labeling the entire pool of conditions. As can be seen, required by Random (representing a 48% reduction), to achieve the use of AL methods, and our two AL methods in particular, the same rate of accuracy. For our purposes, FPR is less important results in a very significant improvement in the number of severe (we don’t mind calling some mild conditions severe, as long as we conditions acquired, compared to the linear and poor improvement accurately capture all severe conditions), therefore we can focus on demonstrated by the Random Selection method. The best example the TPR measure and reduce efforts and cost by 64% without com- of the strength of our AL methods can be found in the performance promising the classification performance. However, in some demonstrated during the early acquisition stages, when again in instances we may desire maximal accuracy, and in those cases the 23rd trial, our AL methods acquired almost 2.5 times more sev- we will still achieve a reduction of 48% in the number of trials ere conditions than Random (72 compared to 30). required when using CAESAR-ALE’s Combination_XA AL method The acquisition results achieved by the Exploitation and Com- versus Random Selection. bination_XA methods were superior compared to Random and Note that in some cases it might be beneficial, even during the SVM-Margin, can be explained by the way they function. Both training phase, to increase the rate and number of severe condi- methods have an exploitation phase during which they attempt tions acquired, (e.g., for the purpose of reporting them to insurance to acquire more severe conditions. Thus, both methods occasion- companies). By using the Combination_XA and Exploitation AL ally acquire mild conditions which initially were thought to be methods within CAEASER-ALE, one can achieve a reduction of severe. These mild conditions are very informative to the classifi- 62% in labeling efforts. CAEASER-ALE’s AL methods acquired severe cation model, because they lead to a major modification of the conditions much more quickly than baseline methods. In the 23rd SVM classifier’s margin and its separating hyperplane. N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54 53

Consequently, acquiring these conditions improves the perfor- Acknowledgments mance of the model. Alternatively, traditional AL methods (e.g., SVM-Margin) focus We thank all of the labelers who patiently labeled condition on acquiring examples (in our study conditions) that lead to only severity status in our dataset. This research was paritally sup- small changes in the separating hyperplane and thus contribute ported by the National Cyber Bureau of the Israeli Ministry of less to the improvement of the classification model. We under- Science, Technology and Space. Support for portions of this stand from this phenomenon that there are often noisy ‘‘mild” con- research was provided by R01 LM006910 (GH) and R01 ditions lying deep within what seems to be the sub-space of the GM107145 (NPT). MRB is supported by the National Library of severe conditions, as was explained in a recent study that focused Medicine training grant T15 LM00707. on the detection of PC Malware [37]. As noted, these ‘‘surprising” cases are very informative and valuable in contributing to the improvement of the classification model. They are helpful in References acquiring severe conditions that eventually update and enrich [1] P.E. Stang, P.B. Ryan, J.A. Racoosin, et al., Advancing the science for active the knowledge store. It should be noted that these conditions are surveillance: rationale and design for the observational medical outcomes more informative than severe conditions, because they provide rel- partnership, Ann. Intern. Med. 153 (9) (2010) 600–606. Nov 2. evant information that was previously not considered (they were [2] A.N. Kho, J.A. Pacheco, P.L. Peissig, et al., Electronic medical records for genetic research: results of the eMERGE consortium, Sci. Translational Med. 3 (79) initially tentatively classified as severe by the classifier when in (2011) 79re1. Apr 20. fact they are mild). The CAESAR-ALE framework induces a better [3] J.C. Denny, M.D. Ritchie, M.A. Basford, et al., PheWAS: demonstrating the classification model by acquiring more severe conditions within feasibility of a phenome-wide scan to discover gene–disease associations, Bioinformatics 26 (9) (2010) 1205–1210. May 1. the same number of iterations. [4] M.R. Boland, G. Hripcsak, Y. Shen, W.K. Chung, C. Weng, Defining a Furthermore, the SVM-Margin method acquires examples that comprehensive verotype using electronic health records for personalized the classification model cannot confidently classify (there is low medicine, J. Am. Med. Inform. Assoc. 20 (e2) (2013) e232–e238. December 1. [5] N.G. Weiskopf, C. Weng, Methods and dimensions of electronic health record confidence regarding their correct class). Consequently, they are data quality assessment: enabling reuse for clinical research, J. Am. Med. informative but are not necessarily severe. In contrast, the Inform. Assoc. 20 (1) (2013) 144–151. CAESAR-ALE framework is oriented toward acquiring the most [6] G. Hripcsak, C. Knirsch, L. Zhou, A. Wilcox, G.B. Melton, Bias associated with informative severe conditions by obtaining conditions from the mining electronic health records, J. Biomed. Discov. Collab. 6 (2011) 48. [7] G. Hripcsak, D.J. Albers, Correlating electronic health record concepts with severe side of the SVM classifier’s margin. As a result, more new healthcare process events, J. Am. Med. Inform. Assoc. 20 (e2) (2013) e311– severe conditions are acquired in earlier trials. Additionally, if an e318. December 1. acquired mild condition lies deep within the severe side of the [8] P. Rich, R.K. Scher, Nail psoriasis severity index: a useful tool for evaluation of nail psoriasis, J. Am. Acad. Dermatol. 49 (2) (2003) 206–212. margin, it is still informative and can be used to improve the clas- [9] C.H. Bastien, A. Vallières, C.M. Morin, Validation of the insomnia severity index sification capabilities of the model for the upcoming trial. as an outcome measure for insomnia research, Sleep Med. 2 (4) (2001) 297– In addition to the above mentioned results and improve- 307. [10] A.T. McLellan, H. Kushner, D. Metzger, et al., The fifth edition of the addiction ments, we observed that after acquiring 95 conditions (19trials), severity index, J. Subst. Abuse Treat. 9 (3) (1992) 199–213. CAESAR-ALE achieved a 93.1% Positive Predictive Value (PPV), [11] T.H. Rockwood, J.M. Church, J.W. Fleshman, et al., Patient and surgeon ranking compared to a 80.1% PPV achieved using Random Selection. This of the severity of symptoms associated with fecal incontinence, Dis. Colon Rectum 42 (12) (1999) 1525–1531. represents more than a 13% absolute improvement in the predic- [12] S.D. Horn, R.A. Horn, Reliability and validity of the severity of illness index, tive capabilities of CAESAR-ALE for classifying conditions as Med. Care 24 (2) (1986) 159–178. severe. This improvement was achieved with a significant reduc- [13] M.R. Boland, N.P. Tatonetti, G. Hripcsak, CAESAR: a Classification Approach for Extracting Severity Automatically from Electronic Health Records, 2014. tion in labeling efforts, a 51% reduction for equivalent accuracy [14] P.L. Elkin, S.H. Brown, C.S. Husser, et al., Evaluation of the content coverage of (92.5%). SNOMED CT: ability of SNOMED clinical terms to represent clinical problem Importantly, we demonstrate that our AL methods in the lists, Mayo Clinic Proceedings, 2006, Elsevier, 2006, pp. 741–748. CAESAR-ALE framework (Exploitation and Combination_XA) are [15] M.Q. Stearns, C. Price, K.A. Spackman, A.Y. Wang, SNOMED clinical terms: overview of the development process and project status, Proceedings of the more robust to the use of different human labelers, having differ- AMIA Symposium, 2001, American Medical Informatics Association, 2001, p. ent levels of domain knowledge. In fact, our method, Combina- 662. tion_XA, acquired conditions in such a way that the variance [16] G. Elhanan, Y. Perl, J. Geller, A survey of SNOMED CT direct users, 2010: impressions and preferences regarding content and quality, J. Am. Med. between the classifiers created by individual labelers was 50% Inform. Assoc. 18 (Suppl. 1) (2011) i36–i44. 2011 December 1. lower than the variance of learning curves that were based on [17] R. Moskovitch, Y. Shahar, Vaidurya: a multiple-ontology, concept-based, the acquisition of the traditional SVM-Margin AL method. context-sensitive clinical-guideline search engine, J. Biomed. Inform. 42 (1) (2009) 11–21. Feb. [18] HCUP Chronic Condition Indicator for ICD-9-CM. Healthcare Cost and Utilization Project (HCUP), 2011. (accessed on February 25, 2014). [19] W. Hwang, W. Weller, H. Ireys, G. Anderson, Out-of-pocket medical spending for care of chronic conditions, Health Aff. 20 (6) (2001) 267–278. 2001 We present the CAESAR-ALE framework which uses AL methods November 1. to identify important conditions for labeling. CAESAR-ALE reduced [20] M-j. Chi, C-y. Lee, S-c. Wu, The prevalence of chronic conditions and medical expenditures of the elderly by chronic condition indicator (CCI), Arch. labeling efforts by 48% when classifying conditions into severe and Gerontol. Geriatr. 52 (3) (2011) 284–289. mild conditions while achieving equivalent classification accuracy. [21] A. Perotte, R. Pivovarov, K. Natarajan, N. Weiskopf, F. Wood, N. Elhadad, CAESAR-ALE reduced labeling efforts by 48% when classifying con- Diagnosis code assignment: models and evaluation metrics, J. Am. Med. ditions into severe and mild conditions, while achieving equivalent Inform. Assoc. 21 (2) (2014) 231–237. 2014 March 1. [22] A. Perotte, G. Hripcsak, Temporal properties of diagnosis code time series in classification accuracy comparable to a scenario in which a passive aggregate, IEEE J. Biomed. Health Inform. 17 (2) (2013) 477–483. Mar. learning is being applied. Use of the CAESAR-ALE framework has [23] M. Torii, K. Wagholikar, H. Liu, Using machine learning for concept extraction the potential to result in significant monetary savings. In future on clinical documents from multiple data sources, J. Am. Med. Inform. Assoc. 27 (2011). 2011 June. work, we would like to develop an online tool for medical experts [24] A.N. Nguyen, M.J. Lawley, D.P. Hansen, et al., Symbolic rule-based classification to label condition severity. This will facilitate a more extensive of lung cancer stages from free-text pathology reports, J. Am. Med. Inform. condition severity labeling experiment that will enable us to exam- Assoc. 17 (4) (2010) 440–445. 2010 July 1. [25] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Novel active learning methods ine labeling differences based on the experts’ clinical background for enhanced PC malware detection in windows OS, Expert Syst. Appl. 41 (13) or geographical location. (2014) 5843–5857. 10/1/. 54 N. Nissim et al. / Journal of Biomedical Informatics 61 (2016) 44–54

[26] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Detecting unknown computer [49] N. Nissim, M.R. Boland, R. Moskovitch, N.P. Tatonetti, Y. Elovici, Y. Shahar, G. worm activity via support vector machines and active learning, Pattern Anal. Hripcsak, An active learning framework for efficient condition severity Appl. 15 (2012) 459–475. classification, in: Artificial Intelligence in Medicine (AIME-15), Springer [27] N. Nissim, A. Cohen, C. Glezer, Y. Elovici, Detection of malicious PDF files and International Publishing, 2015, pp. 13–24. directions for enhancements: a state-of-the art survey, Comput. Secur. 48 [50] V.N. Vapnik, Estimation of Dependences Based on Empirical Data, vol. 41, (2015) 246–266. Springer-Verlag, New York, 1982. [28] D. Angluin, Queries and concept learning, Mach. Learn. 2 (1988) 319–342. [51] T. Joachims, Making large scale SVM learning practical, 1999. [29] D. Lewis, W. Gale, A sequential algorithm for training text classifiers, [52] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data Proceedings of the Seventeenth Annual International ACM-SIGIR Conference Min. Knowl. Disc. 2 (2) (1998) 121–167. on Research and Development in Information Retrieval, Springer-Verlag, 1994, [53] Michael Berthold, The fog of data: data exploration in the life sciences, in: pp. 3–12. Invited Talk at 11th AIME Conference, 2007 In Artificial Intelligence in [30] Y. Liu, Active learning with support vector machine applied to gene expression Medicine. data for cancer classification, J. Chem. Inf. Comput. Sci. 44 (6) (2004) 1936– [54] Nicolas Cebron, Michael R. Berthold, Active learning for object classification: 1941. from exploration to exploitation, Data Min. Knowl. Discov 18 (2) (2009) 283– [31] M.K. Warmuth, J. Liao, G. Rätsch, M. Mathieson, S. Putta, C. Lemmen, Active 299. 27 Jul 2008. learning with support vector machines in the drug discovery process, J. Chem. [55] Robert Moskovitch, Alon Hessing, Yuval Shahar, Vaidurya – a concept-based, Inf. Comput. Sci. 43 (2) (2003) 667–673. context-sensitive search engine for clinical guidelines, MedInfo 2004, San [32] R.L. Figueroa, Q. Zeng-Treitler, L.H. Ngo, S. Goryachev, E.P. Wiechmann, Active Francisco, USA, 2004. learning for clinical text classification: is it better than random sampling?, J [56] Robert Moskovitch, Suzana Martins, Eytan Behiri, Aviram Weiss, Yuval. Shahar, Am. Med. Inform. Assoc. (2012), http://dx.doi.org/10.1136/amiajnl-2011- A comparative evaluation of a full-text, concept based, and context sensitive 000648. search engine, J. Am. Med. Inform. Assoc. 14 (2007) 164–174. [33] D.H. Nguyen, J.D. Patrick, Supervised machine learning and active learning in [57] P.B. Jensen, L.J. Jensen, S. Brunak, Mining electronic health records: towards classification of radiology reports, J. Am. Med. Inform. Assoc. (2014), http://dx. better research applications and clinical care, Nat. Rev. Genet. 13 (6) (2012). doi.org/10.1136/amiajnl-2013-002516. [58] R. Bellazzi, B. Zupan, Predictive data mining in clinical medicine: current issues [34] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. and guidelines, Int. J. Med. Inform. 77 (2) (2008). Intell. Syst. Technol. (TIST) 2 (3) (2011) 27. [59] I. Batal, D. Fradkin, J. Harrison, F. Moerchen, M. Hauskrecht, Mining recent [35] S. Tong, D. Koller, Support vector machine active learning with applications to temporal patterns for event detection in multivariate time series data, in: text classification, J. Mach. Learn. Res. 2 (2000–2001) 45–66. Proceedings of Knowledge Discovery and Data Mining (KDD), Beijing, China, [36] H. Ralf, T. Graepel, C. Campbell, Bayes point machines, J. Mach. Learn. Res. 1 2012. (2001) 245–279. [60] R. Moskovitch, Y. Shahar, Fast time intervals mining using transitivity of [37] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Novel active learning methods temporal relations, Knowl. Inf. Syst. 42 (2015) 1. for enhanced PC malware detection in windows OS, Expert Syst. Appl. 41 (13) [61] K. Ng, A. Ghoting, S.R. Steinhubl, W.F. Stewart, B. Malin, J. Sun, PARAMO: a (2014). PARAllel predictive MOdeling platform for healthcare analytic research using [38] N. Nissim, R. Moskovitch, L. Rokach, Y. Elovici, Detecting unknown computer electronic health records, J. Biomed. Inform. 48 (2014) 160–170. worm activity via support vector machines and active learning, Pattern Anal. [62] J. Sun, C.D. McNaughton, P. Zhang, A. Perer, A. Gkoulalas-Divanis, J.C. Denny, J. Appl. 15 (4) (2012) 459–475. Kirby, T. Lasko, A. Saip, B.A. Malin, Predicting changes in hypertension control [39] R. Moskovitch, N. Nissim, Y. Elovici, Malicious code detection using active using electronic health records from a chronic disease management program, learning, in: ACM SIGKDD Workshop in Privacy, Security and Trust in KDD, Las J. Am. Med. Inform. Assoc. 21 (2014) 337–344. Vegas, 2008. [63] G. Hripcsak, Physics of the Medical Record: Handling Time in Health Record [40] R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz, Y. Elovici, Unknown Studies, Artificial Intelligence in Medicine, Pavia, Italy, 2015. malcode detection and the imbalance problem, J. Comput. Virol. 5 (4) (2009). [64] S. Rana, S. Gupta, D. Phung, S. Venkatesh, A predictive framework for modeling [41] N. Nissim, A. Cohen, R. Moskovitch, et al., ALPD: active learning framework for healthcare data with evolving clinical interventions, Stat. Anal. Data Min.: ASA enhancing the detection of malicious PDF files aimed at organizations, in: Data Sci. J. 8 (3) (2015) 162–182. Proceedings of JISIC, 2014. [65] R. Moskovitch, Y. Shahar, Classification of multivariate time series via [42] Y. Baram, R. El-Yaniv, K. Luz, Online choice of active learning algorithms, J. temporal abstraction and time intervals mining, Knowl. Inf. Syst. 45 (1) Mach. Learn. Res. 5 (2004) 255–291. (2015) 35–74. [43] Herman R. 72 Statistics on Hourly Physician Compensation, 2013. (accessed in January 2015). [67] Z. Huang, W. Dong, P. Bath, L. Ji, H. Duan, On mining latent treatment patterns [44] M.R. Boland, N.P. Tatonetti, Are all vaccines created equal? Using electronic from electronic medical records, Data Min. Knowl. Disc. 29 (2015) 914–949. health records to discover vaccines associated with clinician-coded adverse [68] N. Nissim, R. Moskovitch, O. BarAd, L. Rokach, Y. Elovici, ALDROID: efficient events, in: AMIA Summits on Translational Science Proceedings 2015, San update of android anti-virus software using designated active learning Francisco, CA, USA, 2015, pp. 196–200. methods, Knowl. Inf. Syst. (2016) 1–39. [45] M.R. Boland, N.P. Tatonetti, G. Hripcsak, CAESAR: a classification approach for [69] R. Moskovitch, N. Nissim, Y. Elovici, Malicious code detection and acquisition extracting severity automatically from electronic health records, in: Intelligent using active learning, IEEE International Conference on Intelligence and Systems for Molecular Biology Phenotype Day, 2014; Boston, MA. Security Informatics (IEEE ISI-2007), Rutgers University, New Jersey, USA, 2007. [46] V. Vapnik, Statistical Learning Theory, Springer, New York, 1998. [70] Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, Assaf [47] M.R. Boland, N.P. Tatonetti, G. Hripcsak, Development and validation of a Shabatai, Yuval Elovici, ALPD: active learning framework for enhancing the classification approach for extracting severity automatically from electronic detection of malicious PDF files, in: Intelligence and Security Informatics health records, J. Biomed. Semantics 6 (14) (2015). Conference (JISIC), 2014 IEEE Joint, 2014, pp. 91–98. [48] R. Moskovitch, S. Cohen-Kashi, U. Dror, I. Levy, A. Maimon, Y. Shahar, Multiple [71] R. Moskovitch, N. Nissim, R. Englert, Y. Elovici, Detection of unknown hierarchical classification of free-text clinical guidelines, Artif. Intell. Med. 37 computer worms activity using active learning, in: The 11th International (2006) 177–190. Conference on Information Fusion, Cologne, Germany, 2008

- 54 -

- 55 -

התרומות העיקריות של המחקר שלנו הן כדלהלן: ראשית, תוצאות הניסויים הראו שהמערכת שלנו יכולה לשפר ולעדכן בצורה יעילה ותדירה את יכולות הזיהוי של חבילות האנטיוירוס וכן את יכולות הזיהוי של פתרונות מבוססי אלגוריתמי למידת מכונה, באופן טוב יותר מכל שיטה קיימת. שנית, בעוד ששיטות למידה אקטיבית קיימות הראו ירידה במספר הפוגענים החדשים שהן רוכשות יום יום, שיטות הלמידה האקטיבית שפיתחנו הראו שיפור יומי במספר הפוגענים החדשים הנרכשים, וזאת בנוסף לעובדה ששיטות הלמידה האקטיבית שלנו רכשו יותר פוגענים חדשים מכל פתרון אחר בכל יום ויום. שלישית, המערכת שלנו מבצעת את העדכונים הללו בשימוש של כמות קטנה בלבד של הקבצים האינפורמטיביים ביותר )תמימים ועוינים(, תוך הורדה משמעותית של מאמץ מומחי האבטחה בניתוח ידני של הקבצים הנבחרים. רביעית, המערכת שלנו נמצאה גם יעילה ברכישה היסטורית של פוגענים מתוך מאגרי קבצים גדולים שנמצאים בדרך כלל בארגונים רבים. חמישית, המערכת שלנו גם מסוגלת לחזק ולשפר את העמידות של יכולות הלמידה בכך שהיא מסננת מופעים רועשים ומטעים של התנהגות פוגענים חמקמקים. נקודה אחרונה, כהוכחה לגנריות של מערכת מבוססת הלמידה האקטיבית שלנו, לאחרונה הרחבנו את יכולותיה וכעת היא מסוגלת לתת מענה לבעיות מתחומים נוספים. למעשה התאמנו את המערכת לתחום המידע הרפואי, שבו הצלחנו לשפר את יכולות הסיווג של מודל למידה לצורכי סיווג רמת חומרה של מחלות, וזאת תוך כדי הפחתה משמעותית של המאמץ בתיוג ידני של מחלות, והדבר למעשה מסתכם בחיסכון גדול של כסף וזמן עבודה יקר של מומחי רפואה.

מילות מפתח: פוגען, עוין, תולעת מחשב, קובץ הרצה, אנדרואיד, מסמך, PDF, למידת מכונה, למידה אקטיבית, זיהוי, רכישה, אנטיוירוס.

- 56 -

תקציר היווצרותם של פוגענים חדשים מדי יום מציבה אתגר משמעותי לפתרונות הזיהוי הקיימים. פוגענים אלו מעלים סיכונים רבים משום שהם מוכוונים לפגוע כמעט בכל מכשיר דיגיטאלי הנמצא בשימוש פרטי או אירגוני. עם סוגי פוגענים פופולאריים נמנים תולעי מחשב, קבצי הרצה עויינים, מסמכים עויינים וכן אפליקציות עוינות המוכוונות לפגיעה ושימוש עוין במכשירים ניידים. חבילות האנטיוירוס , הנמצאות בשימוש רחב, והמבוססות החתימות, מסוגלות לזהות רק פוגענים ידועים או גרסאות דומות להן. כדי לזהות פוגענים חדשים על מנת לעדכן את מאגר החתימות של הקבצים העוינים של חבילות האנטיוירוס, חברות האנטיוירוס חייבות לאסוף בכל יום כמויות גדולות של קבצים חשודים שצריכים להיות מנותחים באופן ידני על ידי מומחי אבטחה, שלבסוף גם קובעים את סיווגם הסופי כפוגען או תכנה תמימה. ניתוח של קובץ חשוד הינה פעולה הדורשת זמן, ואין זה אפשרי לנתח ידנית את כל הקבצים החשודים. לכן חברות האנטיוירוס החלו להשתמש במודלי זיהוי מבוססי אלגוריתמי למידת מכונה והיוריסטיקות שונות, במטרה להקטין את מספר הקבצים החשודים שיש לנתחם ידנית. בנוסף לחבילות האנטיוירוס, פתרונות זיהוי חדשניים התחילו להשתמש באלגוריתמי למידת מכונה באופן עצמאי במטרה לייצר יכולות זיהוי טובות יותר מהיכולת המוגבלת של חבילות האנטיוירוס בכדי לזהות פוגענים חדשים.

לאור היצירה ההמונית של קבצים מדי יום, גם חבילות האנטיוירוס וגם פתרונות המבוססים אלגוריתמי למידת מכונה, חסרים יכולת חיונית ביותר – הם אינם יכולים להתעדכן בצורה תדירה ויעילה עם פוגענים חדשים – מצב שיוצר פער בעדכניות וחלון זמן בין מועד היווצרותו של פוגען לבין מועד הזיהוי שלו, ובכך מאפשר לפוגען חדש לתקוף מספר רב של מטרות וקורבנות טרם זיהויו. לכן גם חבילות האנטיוירוס וגם פתרונות מבוססי אלגוריתמי למידת מכונה חייבים להתעדכן באופן תדיר כך שחבילות האנטיוירוס יתעדכנו עם חתימות חדשות של פוגענים חדשים, ופתרונות המבוססים למידת מכונה צריכים להתעדכן עם קבצים אינפורמטיביים )המכילים מידע רב( הן עויינים והן תמימים. במחקר זה אנחנו מציגים פתרון לפער העדכניות המדובר, אנחנו מציגים מערכת ופלטפורמה חדשנית, גנרית ויעילה המבוססת למידה אקטיבית, וכן גם שיטות למידה אקטיבית חדשניות, שכל אלו יחד עשויים לסייע לחברות האנטיוירוס וכן גם לפתרונות מבוססי למידת המכונה, למקד את מאמץ הניתוח שלהם על ידי רכישה של כמות קטנה ביותר של קבצים שהם ככל הנראה פוגענים או לחילופין קבצים תמימים מאוד אינפורמטיביים, ובכך לאפשר לשיפור יעיל ותדיר של בסיסי המידע של חבילות האנטיוירוס ומודלי הזיהוי. בנוסף לרכישה אינטליגנטית של רוב הקבצים האינפורמטיביים, המערכת שלנו מוכוונת גם לעבוד ברזולוציה גבוהה יותר, שבה היא יכולה לסנן בצורה יעילה מופעי התנהגות "רועשים" שאינם נחוצים ואף משבשים את תהליך הלמידה של התנהגות פוגען ספציפי, ובכך המערכת שלנו משפרת את יכולות הזיהוי של פוגענים חמקמקים כדוגמת תולעי מחשב. המערכת שלנו משלבת גם שיטות חדשניות לחילוץ מאפיינים משמעותיים, שיטות שעוצבו במיוחד עבור זיהוי יעיל של סוגי הפוגענים שצוינו לעיל. שיטות אלו למעשה חילצו מאפיינים שמונפו באמצעות שיטות הלמידה האקטיבית שלנו לטובת שיפור ועדכון יעיל של יכולות הזיהוי.

- 57 -

- 58 -

העבודה נעשתה בהדרכת פרופ' יובל אלוביץ'

במחלקה להנדסת מערכות מידע

בפקולטה להנדסה

- 59 -

רכישה יעילה וזיהוי של פוגענים לא ידועים באמצעות למידה אקטיבית.

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

ניסים ניר

הוגש לסינאט אוניברסיטת בן גוריון בנגב

אישור המנחה ______

אישור דיקנית בית הספר ללימודי מחקר מתקדמים ע"ש קרייטמן ______

31.12.2015 י"ט בטבת תשע"ו

באר שבע

- 60 -

- 61 -

רכישה יעילה וזיהוי של פוגענים לא ידועים באמצעות למידה אקטיבית.

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

ניסים ניר

הוגש לסינאט אוניברסיטת בן גוריון בנגב

31.12.2015 י"ט בטבת תשע"ו

באר שבע

- 62 -