An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware
Thesis submitted in partial fulfillment of the requirements for the degree of
By
Nir Nissim
Submitted to the Senate of Ben-Gurion University of the Negev
31.12.2015
Beer-Sheva
- 1 -
- 2 -
An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware
Thesis submitted in partial fulfillment of the requirements for the degree of
“DOCTOR OF PHILOSOPHY”
By
Nir Nissim
Submitted to the Senate of Ben-Gurion University of the Negev
Approved by the advisor: ______
Approved by the Dean of the Kreitman School of Advanced Graduate Studies: ______
31.12.2015
Beer-Sheva
- 3 -
- 4 -
This work was carried out under the supervision of Prof. Yuval Elovici at the
Department of Information Systems Engineering
Faculty of Engineering Sciences,
Ben-Gurion University of the Negev.
- 5 -
Research-Student's Affidavit when Submitting the Doctoral Thesis for Judgment
I Mr. Nir Nissim, whose signature appears below, hereby declare that
I have written this Thesis by myself, except for the help and guidance offered by my Thesis Advisors.
The scientific materials included in this Thesis are products of my own research, culled from the period during which I was a research student.
Date: 31.12.2015 Student’s Name: Nir Nissim Signature: ______
- 6 -
Acknowledgements First and foremost, I want to thank God for providing me with the capabilities, wisdom, and blessing of success, during these important years of research and for surrounding me with an outstanding group of colleagues and researchers who were helpful in this research.
I would also like to thank my advisor, Prof. Yuval Elovici, for his support, guidance, and the opportunities provided to me, all of which have made these years of research extremely productive and challenging.
Thanks also to the National Cyber Bureau of the Israeli Ministry of Science, Technology and Space who partially supported my research.
I also wish to thank Clint Feher, Oren Barad, and Aviad Cohen who assisted in the collection and creation of the datasets and Yuval Fledel for his valuable advice regarding the efficient implementation aspects of my research. I would like also to thank Prof. Yuval Shahar for the meaningful discussions we shared and also for his expertise and support in the expansion of this research to additional directions in the biomedical domain.
Special thanks both to Dr. Robert Moskovitch and Prof. Lior Rokach for their assistance and helpful advice during the course of my research.
Thanks also to Ms. Yehudith Naftalovitch, the administrative and operational manager of our Cyber Security Research Center, who assisted and helped with many administrative matters during these years of research, providing valuable assistance that allowed me to better focus on the research itself.
I would like to thank also to Ms. Robin Levy-Stevenson for her devoted assistance, providing much appreciated English editing and proofreading during my Ph.D. studies which helped make my publications more comprehensive and clear.
And last but not least, thanks to my dear parents and my special grandparents who supported me in every way they possibly could, ensuring that I would always have the passion, and everything else I would need, to succeed.
- 7 -
- 8 -
Abstract The sheer volume of new malware created every day poses a significant challenge to existing detection solutions. This malware is aimed at compromising nearly every kind of widely used digital device, threatening individuals as well as organizations. Popular types of malware take different forms including computer worms, malicious PC executables, malicious documents (non- executables), and malicious applications aimed at mobile devices. Widely used antivirus software, which is based on manually crafted signatures, is only capable of identifying known malware and their relatively similar variants. To identify new and unknown malwares and keep their antivirus signature repository up-to-date, antivirus vendors must collect new suspicious files on a daily basis for manual analysis by information security experts who label the files as malware or benign. Analyzing suspected files is a time-consuming task, and it is impossible to manually analyze all questionable files. Consequently, antivirus vendors use detection models based on machine learning (ML) algorithms and heuristics in order to reduce the number of suspected files that must be inspected manually. In addition to antivirus software, recent detection solutions have also used machine learning algorithms independently, in order to provide better detection capabilities of new malware, an area in which antivirus software is limited. In light of the mass creation of new files daily, both antivirus and machine learning based detection solutions lack an essential element – they cannot be frequently and efficiently updated with newly created malware – a situation that creates a dangerous time gap between the creation and proliferation of malware and its detection and discovery. This time gap allows new malware to attack many targets before it is identified and thwarted. Therefore, both antivirus and machine learning based solutions must be frequently updated – the antivirus software must be updated with new signatures of malware, and machine learning based solutions require new informative files, both malicious and benign. In this research we introduce a solution for this updatability gap. We present a novel, generic, and efficient active learning (AL) framework and new AL methods that may assist antivirus vendors and machine learning based solutions and may allow them to focus their analytical efforts by acquiring only a small set of new files that are either most likely malicious or informative benign files, a process that enables efficient and frequent enhancement of the knowledge stores of both the detection model and the antivirus software. In addition to intelligent selection of the most contributive files, our framework is also aimed at working under higher rates of granularity in which it can efficiently select only a small number of instances related to the behavior of a specific analyzed file. By doing this, our framework can filter out the misleading and noisy instances of malware’s behavior which is popular among sophisticated and elusive malware and thus improve detection capabilities. Our framework also integrates tailored feature extraction methods for each of the above mentioned types of malware, and these feature extraction methods provide an accurate basis for enhancing the detection capabilities leveraged by our AL methods.
- 9 -
The main contributions of the study are summarized as follows: first, the experimental results showed that our framework can improve the detection capabilities of antivirus software and machine learning based solutions by frequently and efficiently enhancing the knowledge stores of the detection model and the antivirus software, as our framework outperformed any other existing solution and method. Second, based on the predefined limited number of files acquired daily in our experiments, the existing AL method showed a decrease in the number of new malwares acquired daily, while our AL methods showed an increase and daily improvement in the number of new malwares acquired daily and also acquired more new malwares each day than every other solution. Third, our framework conducts the above mentioned update using only small set of the most informative files (malicious and benign) leading to a significant reduction of security expert labeling efforts associated with manual analysis of the files. Fourth, our framework was also found to be efficient in retrospective acquisition of malware from large stores of files usually found in organizations. Fifth, our framework is able to efficiently improve the detection capabilities by enhancing its robustness by filtering out the presence of misleading malware instances and behavior. Lastly, as a proof of concept for the generality of our AL based framework, we have recently extended the framework's capabilities so it will provide solution in additional domains. We have adapted it to the biomedical informatics domain, in which we successfully enhanced the capabilities of a classification model that is used for condition severity classification while significantly reducing labeling efforts that can result in a substantial savings, both in time and money associated with medical experts.
Keywords. Malware, Malicious, Computer Worm, Executable, Android, Document, PDF, Machine Learning, Active Learning, Detection, Acquisition, Antivirus.
- 10 -
Table of Contents
1. Introduction
1.1. Background and Related Work
1.1.1. Malicious Executables and Computer Worms
1.1.2. Malicious Documents
1.1.3. Malicious Android Applications
1.2. The Problem Statement and Proposed Approach
1.3. Deployment of our Framework
2. Overview of the Core Papers in the Research
2.1. Research Results
2.1.1. Core Papers
3. Summary and Conclusions
4. Future Directions
5. References
6. Appendix
6.1. Additional Accepted Papers in the Malware Detection Domain
6.2. Additional Accepted Papers in the Biomedical Informatics Domain
- 11 -
1. Introduction
1.1. Background and Related Work
In recent years, the Internet has become an integral part of our lives, particularly with the increased availability of high speed internet connections, cloud computing, and the proliferation of mobile devices which have rapidly become indispensable to individuals around the world, able to handle many of our daily needs and interests such as communication, health, news, banking, shopping, mail, entertainment, etc. Increasingly large numbers of files are created and transferred over the Internet, including a growing percentage of malware that compromises a growing list of targets through a variety of attack methods. Although nowadays, the creation of malware requires much less expertise than in the past [75], attacks launched by today’s malware have become more sophisticated, harder to detect, and more dangerous [72]. These facts have shaped an insecure reality in which, according to Kaspersky’s report presented in 2013 [76], every day at least 315,000 new malware are created and widely spread over the Internet with ease, and since that time, this number has been exponentially increasing each year.
There are several levels of defense against malware attacks, and each level consists of different types of specialized techniques and tools. The lowest level of defense is at the level of the host computer and includes the user's computer itself and an organization’s application server. Techniques most often used at this level are host-based intrusion detection and prevention systems (IDSs and IPSs) that are installed on the host computer and can protect it from malware that has reached the host. Antivirus signature-based software is an integral tool implemented at the host level; widely used, antivirus IDSs detect known malware and its variants using signature detection methods relatively effectively for most organizations and individuals. Each time a new malware is found, antivirus vendors create a new signature and update their signatory repository, as well as their clients. It takes time to detect a malicious code and update clients, and such actions are definitely not immediate. Speed is essential – during the period of time between the appearance of a new unknown malware, its subsequent detection by the antivirus, and updating the new signature in the client’s database, many computers might be infected. Although more than a decade has passed since their first appearance, computer worms remain the most well-known examples of malware that maliciously takes advantage of the time it has to operate prior to its detection and
- 12 - neutralization. "Slammer," the fastest computer worm in history, infected more than 75,000 hosts (representing 90% of the vulnerable hosts) within ten minutes [70], while "Code Red," the most harmful and famous worm, infected 359,000 hosts within 14 hours [71]. Each of these computer worm attacks caused significant disruption to financial, transportation, and government institutions.
In order to accurately and quickly detect the newest malware, antivirus companies devote considerable effort, both in terms of time and resources, to maintaining an up-to-date signature repository of malicious code files. These efforts include monitoring new and unknown malicious code files sent over the Internet and the use of various types of honeypots to catch malicious files [77] [78]. This mission is complicated and time-consuming, particularly because these efforts rely heavily on manual inspection of suspected files [77].
This challenging situation has motivated researchers to develop more comprehensive and efficient solutions for agile detection of new unknown malware. Studies conducted over the last 15 years have shown that machine learning methods and algorithms (traditionally used for challenging classification and prediction tasks) can be used for the detection of unknown malware. New detection tools based on machine learning algorithms have been developed, and antivirus vendors have started to incorporate machine learning based detection models and heuristics into their processes in order to enhance their detection capabilities. Prior studies have primarily focused on two approaches: dynamic and static analysis. In each case, during both the training and detection phases, malicious and benign files are analyzed and subsequently represented by a vector of features (extracted statically from the content and structure of the file, or dynamically according to its behavior) that can be monitored by measuring elements within the system in which the malware is executed. These files are used during the training phase to induce a classifier that acts as the detection model. Based on the generalization capabilities of the detection model, an unknown file (one that did not appear in the training set and is not detected by the antivirus tool) is classified as malicious or benign during the detection phase.
Static analysis methods have several advantages over dynamic analysis. First, they are virtually undetectable – the analyzed file cannot detect that it is being analyzed, since it is not executed. While it is possible to create static analysis “traps” to detect analysis, these traps can actually be used as a contributing feature for the detection of malware [90]. In addition, since static analysis is relatively efficient and fast, it can be performed both without causing bottlenecks and within an acceptable timeframe. Static analysis is also easy to implement, monitor, and measure.
- 13 -
Moreover, it scrutinizes the file’s “genes” as opposed to its current behavior which can be changed or delayed to an unexpected time in order to evade detection by dynamic analysis. An additional advantage is that static analysis can be used for a scalable pre-check of malwares before deeper, more time consuming analysis is conducted.
On the other hand, static analysis can be evaded by code obfuscation and is also limited in its ability to analyze encrypted files. Whenever one uses machine learning methods based on static analysis for the detection of unknown malicious code applications, a question arises regarding the ability of the suggested framework to detect obfuscated malware. However, dynamic analysis (also known as behavioral analysis), aimed at tracing the behavior of the file and its effect on the environment in which it is executed, is not affected by code obfuscation. This type of analysis and its versatile methods for detecting an unknown malware based on its behavior have been thoroughly explored over the past several years. These dynamic analysis based methods are aimed at detecting malicious activity and content that cannot be discovered using static analysis, for example code obfuscation, encrypted files, and dynamic load of malicious code during run time.
Machine learning solutions based on static and dynamic analysis have been successfully applied for the detection of common types of malware including: malicious executables and computer worms, malicious documents (non-executables), and malicious applications aimed at mobile devices. Each type has its own characteristics and unique properties, and our research is aimed at providing comprehensive long-term detection solutions to the challenges posed by the various malware types. Thus, we present a brief introduction to the types of malware that have become more popular and attractive during the period in which this research was conducted and mention the machine learning approaches and developments associated with each.
1.1.1 Malicious Executables and Computer Worms
Malicious executables, especially those aimed at the Windows operating system, the most commonly attacked system, include malware families such as computer worms, computer viruses, Trojan horses, spyware, and adware. Computer worms are a widespread form of malicious executables that proactively propagates across networks while exploiting vulnerabilities in operating systems, protocols, devices, and installed programs. In contrast, other malicious executables such as viruses, Trojan horses, spyware, and adware usually operate and attack within
- 14 - a host while also infecting the host's files. Ransomware, which actually extorts its victims, represents an emerging trend of malicious executables belonging to the Trojan malware family. Once this ransomware reaches a host, it encrypts the host’s files using a strong encryption algorithm and a unique key and prevents the host’s owner from accessing and using his/her own files until the owner pays the requested ransom to the attacker. These ransomware malwares are financially driven and therefore aimed at attacking large organizations that rely heavily upon the availability of significant and valuable files that form a critical part of their daily work and business. In this situation the attacked organization remains helpless and is forced to comply with the attacker’s requests and pay the ransom, or else lose their data. A well-known type of ransomware is the CryptoLocker [91] which was able to extort approximately three million dollars before it was taken down by authorities.
Regardless of the malware’s mode of operation, today’s antivirus tools don’t offer an adequate solution for the detection of new unknown malware – that which doesn’t share signatures similar to known malware (those forming the signature repository of the antivirus). Over the past 15 years, many studies have investigated the possibility of enhancing the detection of unknown malicious executables using machine learning algorithms based on either static [49-57], or dynamic analyses [58-69].
The detection of elusive computer worms transmitted over computer networks has also been intensively researched over the past decade. Typically, worms operate autonomously, spreading quickly and attacking as many targets as possible, causing considerable harm as was demonstrated by the “Slammer” [70] and “Red-Code” [71] worms. Stuxnet [72], a more elusive malware (and probably the most sophisticated ever created), is an example of a new attack trend, the advanced persistent threat (APT). Stuxnet was a directed cyberwarfare attack against the Iranian nuclear program which also spread within the attacked system and was aimed at the controllers of the nuclear SCADA systems in order to physically destroy this military target.
Worms are a very elusive malware that tries to hide its malicious activity by spending most of its time in a dormant state or by otherwise acting benignly. Machine learning based solutions have been proposed to provide better solutions and enhance antivirus software in order to detect new computer worms based on behavioral classification of the host [73] [74].However, the key to the challenge of detecting new computer worms using machine learning algorithms is in the ability to filter out their misleading behavior from the data provided to the machine learning algorithms.
- 15 -
In addition, some worms act in a misleading way and behave as a legitimate application part of the time, and as a consequence, they generate misleading instances. Worms are not always active and even when active they do not always behave in an illegitimate way. Because they sometimes act like non-worm instances, their detection is much more difficult; furthermore, it can be misleading to monitor their behavior, and when this is done, misleading instances become part of the dataset.
In most domains, the misleading instances are not created intentionally, but exist naturally. However, in our case, the misleading data posed a greater problem. In order to make their detection harder, in the security domain worms are created in such a sophisticated way that they behave similarly to a legitimate application. Thus, monitoring worm behavior using dynamic analysis creates many instances that are very similar to non-worm instances and are therefore considered misleading instances that confuse the induced classifier. This phenomenon is called “malicious noise,” as presented in [92]. Misleading instances usually create confusion in the classification processes and cause degradation in the classifier’s performance.
With regard to worm detection, the task is more complicated here, since the misleading data is inherent in the class, and its presence is even greater in the class we want to detect. In this case, we used the AL method’s premise to select the most informative instances among the existing instances, so that the misleading instances would not be selected, as was done previously [93] and discussed in other work [94].
1.1.2 Malicious Documents
Cyber-attacks aimed at organizations have increased since 2009, with 91% of all organizations hit by cyber-attacks in 2013.1 Attacks aimed at organizations usually include harmful activities such as stealing confidential information, spying and monitoring an organization’s activity, and disrupting an organization's actions. The vast majority of organizations rely heavily on email for internal and external communication. Thus, email has become a very attractive platform from which to initiate cyber-attacks against organizations.
1 http://www.humanipo.com/news/37983/91-of-organisations-hit-by-cyber attacks-in-2013/ - 16 -
According to Trend Micro,2 APT attacks, particularly those against government agencies and large corporations, are largely dependent upon spear-phishing3 emails. As malicious executables have been widely used to launch attacks, current defensive solutions and organizational policies often prevent executables from entering organizational networks via emails4 [88]. Therefore, recent APT attacks tend to attach documents which are non-executable files (PDF, MS-Office files, Flash files etc.) that unlike executables, are not independent files and also require specific software in order to be opened (e.g., Adobe Reader Microsoft Office, etc.). These types of documents are widely used in organizations and are often mistakenly considered less suspicious or malicious than executables.
Furthermore, because email communication is an integral part of daily business operations, APT attackers frequently leverage email as an attack vector for initial penetration of the targeted organization. Attackers usually use social engineering in order to make the recipient open the malicious email, press a link, and open an attachment containing such a document. F-Secure’s 2008-2009 report5 indicates that the most popular file types for targeted attacks in 2008-2009 were PDF and Microsoft Office files. Since that time, as was reported in 2010-2011, the number of attacks on Adobe Reader has grown.6 A recent report presented in 2015 by Symantec [88] revealed that Microsoft Office document file attachments have become the most frequently used documents in email attachments for spear-phishing attacks and were used in 39% of attacks during 2014.
To date, antivirus packages are not sufficiently effective at intercepting malicious documents, even in the case of highly prominent PDF threats (Tzermias et al. [38]). However, according to studies such as [38-47], machine learning methods can be effective in distinguishing between malicious and benign PDF files and discovering new malicious PDF documents. Several deterministic solutions were presented for enhanced detection of new malicious Microsoft Office files such as BISSAM [48], OfficeMalScanner7, OfficeCat8, Microsoft OffVis,9, pyOLEScanner.py10,
2 http://www.infosecurity-magazine.com/view/29562/91-of-apt-attacks-start-with-a-spearphishing-email/ 3 http://searchsecurity.techtarget.com/definition/spear-phishing 4 https://www.paloaltonetworks.com/content/dam/paloaltonetworks-com/en_US/assets/pdf/datasheets/threats/threat-prevention.pdf 5 http://www.f-secure.com/weblog/archives/00001676.html 6 http://www.computerworld.com/article/2517774/security0/pdf-exploits-explode--continue-climb-in-2010.html 7 http://www.reconstructer.org/code.html 8 http://www.aldeid.com/wiki/Officecat 9 http://www.microsoft.com/en-us/download/details.aspx?id=2096 10 http://www.aldeid.com/wiki/PyOLEScanner - 17 - and Threat Emulation11, and a new machine learning based methodology called SFEM [8] was presented as well.
1.1.3 Malicious Android Applications
While the detection of malwares aimed at PCs (elusive worms, executables, and malicious documents) using ML methods has been intensively researched for nearly two decades, [12] and [13] were the first to discuss malware for smartphones in 2004. Since that time there has been a significant increase in the use of smartphones, dramatically increasing the possibility of cyber- attacks [14] [15]. Likewise, recent growth of the Android market has been accompanied by increased threats to Android security over the past few years [16-18]; "Secure-List" [19] reported that 9,000 such malware were created during 2012, a figure that indicates that the dominance of the Android operating system likely led to the massive creation of new types of Android malware.
The smartphone domain is an area in which the need for antivirus enhancement is even greater. In contrast to PCs, which rely on advanced detection tools (e.g., sandboxes, ML based solutions, anti-exploitation solutions, etc.) in addition to their basic reliance on antivirus solutions, smartphones are heavily dependent on antivirus solutions because of the inability to apply advanced detection methods (machine learning solutions based on static and dynamic analysis) within the device itself. The resource limitations of smartphones necessitate effective detection of new malware and efficient and nimbly updated antivirus tools. Antivirus solutions are lightweight and thus more appropriate for smartphones.
In addition, Android antivirus vendors must deal with large quantities of new applications on a daily basis in order to identify and update the antivirus signature repository with new unknown malware instances. The majority of these applications can be collected from application markets, and others can be collected by installing agents on smartphones that upload applications to a central server for analysis. Antivirus vendors must filter out known malware (and its variants), as well as known legitimate applications, utilizing white lists based on the reputation and certificates of applications [20]. Despite this filtering process, a large number of new unknown applications, both benign and malicious, remain.
11 https://www.checkpoint.com/products/threatcloud-emulation-service/ - 18 -
Antivirus vendors use complementary solutions that focus on the applications most likely to be malicious in order to further reduce the number of applications that must be handled manually. Among the complementary solutions that have been proposed for efficiently discovering new Android malware are heuristic engines based on a scoring algorithm [21] and many different detection models based on machine learning techniques [22-37].
- 19 -
1.2. The Problem Statement and Proposed Approach
Complementary solutions, including machine learning based solutions, targeted at detection of various malware types (computer worms and malicious executables, malicious documents, and malicious Android applications) have enhanced detection and have also demonstrated the ability to detect new unknown malware, an ability not shared by antivirus software. This stems from ML’s generalization capabilities, an inherent strength of inducing machine learning based algorithms. However, to date, such complementary solutions have one significant drawback and remain inefficient in the long run, because of it – in each case the knowledge store is not frequently and actively updated. This is particularly problematic in light of the mass creation of new malware.
A natural concept drift process [80] [81] exists, specifically in the malware domain [81], as benign files and newly created malware contain new properties and features that haven’t been seen by the detection model, as well as existing features with very different values than those the detection model has been trained on. These new features may result from different programing languages, compilers, platforms, operating systems, devices, etc. In addition, the malware domain is very dynamic, since attackers are continually seeking out new ways of attacking, new vulnerabilities that can be exploited, and new targets. These changing parameters eventually affect the static features and the behavior of the analyzed malware and thus significantly reduce the detection capabilities of the induced detection models which are not updated and remain outdated. None of the exiting machine learning based solutions address the crucial need to frequently and efficiently update the detection model and antivirus software.
In this research [1-8], we concentrated on the updatability process and enhancement of the detection capabilities of the detection model, striving to improve efficiency and speed in these areas. A well enhanced and updated detection model will have a better ability to detect future malware and thus will update the antivirus software more rapidly. It is therefore essential to ‘sustain’ the classifier constantly and frequently with new files (malicious and benign) in order to maintain detection accuracy over time. However, when a file is classified, the classifier cannot indicate whether it should be acquired as a new and informative sample for the training set. Additionally, in order to add the file to the training set, a labeling operation, usually by a human expert, is required. The labeling process is a very time-consuming task, because each unknown file (suspected as being malicious) has to be analyzed and inspected by an expert; the expert will likely
- 20 - have to perform static, and/or dynamic analysis and inspect the files’ behavior using a sandbox [13,14] or other behavioral based tools in order to determine their ambience. Because there are many files (malicious and especially benign) to inspect, it is not feasible to send them all to the human expert for labeling. All these difficulties affect the updatability (of both the detection model and the antivirus) which is directly related to one of the most challenging tasks in the computer security domain: the agile detection of new unknown malware.
One of the keys to solving this challenging task is finding an automatic and efficient way to identify the most informative of the many new files, in an effort to minimize (to the greatest extent possible) the number of files sent to the human expert for labeling. Only the most informative labeled files will provide the knowledge required for the updatability of the detection model and ensure its ability to detect previously unknown malicious code. Our research aims to develop a framework that combines practical and efficient solutions for the agile detection of new unknown malware.
In order to meet this challenge we have divided the suggested framework into two main modules. The first is the Detection Module, which integrates the best methods from several relevant domains, such as feature extraction and representation, feature selection, text categorization, information retrieval, and classification algorithms. This module concentrates on collecting and representing the files in the dataset in a way that provides maximal knowledge to the classification algorithm for inducing the optimal detection model. The Updatability Module is the second module, in which we also propose the active learning (AL) approach to reduce the number of labeled training examples while maintaining high classification accuracy. By integrating AL, the classifier actively indicates the specific new files that should be labeled, i.e., the most informative, the addition of which to the training set will provide the maximal improvement in the detection model and consequently will also update the widely used antivirus software. The updatability module is also aimed at working under higher rates of granularity in which it can efficiently select only a small number of instances related to the behavior of a specific analyzed file. By doing this, the misleading and noisy instances of malware’s behavior can be filtered and thus improve the detection capabilities – this granularity rate is especially tailored for enhancing the detection of elusive malware such as the computer worm.
The framework can be instantiated using different AL methods. Two of the methods used are well-known AL methods which have been previously proposed: SVM Simple Margin
- 21 -
(Exploration) [95] and Error-Reduction [96]; these were used as a baseline for comparison in the various experiments documented both in our research and published papers. Both methods select examples for which the classifier is less confident regarding the true label; in SVM it means those examples that lie closest to the SVM separating hyperplane. Our two new methods are Exploitation and Combination. In contrast to Exploration, Exploitation chooses examples located deep inside the malicious side and farthest from the SVM’s separating hyperplane. Our Combination AL method is a two-phase method that combines principles of Exploration and Exploitation; in the early phase it conducts more Exploration, while Exploitation becomes the dominant strategy in the later phase of the acquisition process.
The framework is thoroughly tested on several different types of malware: computer worms, executables, malicious documents (PDF and docx MS Office files), and malicious Android applications. In each of these applications, sophisticated and specifically tailored feature extraction and dataset creation methodologies were proposed and implemented. We used the SVM classifier as the base classifier, and the experiments were carried out using the various SVM’s kernels. A solid and comprehensive evaluation methodology was used in order to test the framework, both in terms of classification performance (accuracy, TPR, FPR and AUC) and the number and percentage of malware acquired daily (NOMA / POMA), which are important measures given that the purpose of the framework is to update the antivirus signature’s repository with new malware on a frequent basis.
- 22 -
1.3. Deployment of our Framework
Another key to addressing the challenging task of agile detection of new unknown malware is efficient and sophisticated deployment of detection method over strategic nodes in the Internet network. In order to meet this challenge, we strive to expose our framework to as many new files transferred over the Internet as possible, so that most of the new informative files will be acquired; therefore deployment is defined in such a way as to have the largest coverage, while minimizing costs by involving as few units as possible. Therefore, once a new unknown malware is created and transferred over the network it will be monitored by the sophisticated deployment of the framework, like a "fly caught in tangled spider webs." The combination of these two components (effective deployment and identification of the most informative new files) will contribute to cleaner Internet network traffic for the hosts.
A comprehensive study by Puzis et.al [82-85] that deals with the efficient deployment of IDSs provides significant insight to this challenge. The study used the "betweenness" algorithm which is a good heuristic for traffic load, and it was found that most network traffic can be monitored by listening to only a few strategic nodes over the Internet. Our opinion is that in order to achieve maximal efficiency of our framework, it should be located and deployed in different levels of the network traffic: the higher level should include strategic NSP routers' links in order to prevent propagation of the malicious code and reduce the extent of the damage, as was suggested by Puzis, while the lower level should act as a host-based IDS that will protect the working stations when the higher levels have not been exposed to the malicious code.
The deployment will consider routers, gateways to organizations, and also the markets of mobile applications (official and non-official) as part of the strategic nodes in the Internet network. The integration between these strategic nodes and several levels of the framework will strengthen the detection and updatability capabilities.
- 23 -
2. Overview of the Core Papers in the Research
In the course of this research we have published eight papers supporting the efficiency and contributions of our active learning based framework for the malware detection domain (presented in Figure 1). Our research focused on the development of advanced detection frameworks and methods for frequent and enhanced detection of four of the most popular types of malware: computer worms and malicious executables, documents, and Android applications.
The main contributions of the eight publications that comprise this thesis include: improving the detection capabilities of machine learning based solutions by frequently and efficiently enhancing the knowledge stores of the detection model and antivirus tools, reducing the number of files acquired which are used for keeping the model updated, and improving the detection model’s capabilities by enhancing its robustness by filtering out the presence of misleading malware instances.
In order to efficiently update the detection model, the new framework employs active learning techniques that enable the experts to label instances that may better contribute to a more accurate model. A new active learning technique is proposed, which is used to detect various types of malware, including worms, executables, and Android applications. In addition, the new framework is used for detecting malicious documents (PDF and MS Word) that may be used by attackers to inject malware into victims’ computers.
As a proof of concept for the generality of our AL based framework, we have also extended the framework's capabilities so that it will provide solutions in additional domains. We have adapted it to the biomedical informatics domain, in which we successfully enhanced the capabilities of a classification model that is used for condition severity classification, while significantly reducing labeling efforts that can result in a substantial savings, both in the time and financial costs associated with medical experts.
Among the eight published papers [1-8], four of them [1-4] were published in top peer reviewed journals and form the core of this research. The remaining four papers [5-8] were accepted to additional journals, ranked conferences, and workshops within top tier conferences. In this section, we provide a brief introduction to each of the four core journal papers, as the complete
- 24 - papers will be presented in the next section. The other four papers are included in the appendix in the subsection entitled, “Additional Accepted Papers in Malware Detection Domain.”
As can be seen Figure 1, the topics researched span two domains and several sub-domains. While my primary expertise lies in the malware detection domain, the involvement and expertise of my co-authors enabled me to widen the scope of my research and delve into an entirely new domain of biomedical informatics. For example, the application of the framework to this domain required knowledge of the new field and access to an additional dataset, an area in which the role of the co-authors was invaluable. It is important to note that as this research constitutes my Ph.D. research, I was responsible for all aspects of the research and the experiments that comprise it.
Our four core papers are based on an evolving program of research that was guided by our attentiveness and awareness of upcoming trends in the malware detection domain. Broadly our research progressed as follows. We started with a behavioral active learning based framework [1] for the enhanced detection of elusive computer worms, on the heels of the discovery of the sophisticated and elusive “Stuxnet” malware in the SCADA systems of Iran’s nuclear facilities. After demonstrating improved and more efficient detection of unknown computer worms in our first study, we identified a major gap in the area of detection solutions aimed at another popular type of malware: malicious executables; in this case, a weakness was found in the updatability (or lack thereof) of existing detection solutions. Therefore we enhanced our AL based framework and extended it with the addition of two novel and efficient active learning methods. Based on these changes, the framework provides a solution for frequent and efficient updatability [2] of both the detection model and antivirus software, which is particularly needed in light of the daily creation of new malicious executables.
During that time, a new trend was emerging, and mobile devices (especially Android OS based smartphones) increasingly became attractive targets for Android malware. The amount of Android malware has increased at a significant rate; many unofficial application markets were contaminated with malicious versions of known Android applications, and the contamination also found its way to the official market of Android applications, even affecting Google Play. In 2012, Google presented Bouncer which is comprised of machine learning algorithms based on dynamic and basic static analysis of applications uploaded to the market. However, according to [86], it was announced at SummerCon 2012 that more than 20 ways of evading Bouncer had been discovered [87]. The insufficient detection of Android malware, the reliance of Android smartphones on
- 25 - antivirus solutions, and the updatability gap that we also identified in the detection solutions for Android malware, led us to enhance our capabilities in the active learning framework and create ALDROID [3]. This new framework outperformed the existing solutions and provides a solution for the enhanced detection of Android malware in the long run.
After providing solutions and enhanced detection of computer worms, malicious executables, and Android malware, our research continued, this time aimed at another emerging malware trend as the popularity of APT attacks increased and they became better funded, more sophisticated, and well-planned. Organizations increasingly prevent the entrance of executables into their internal networks because of their high risk; thus a new trend was created – instead of targeting executables, malicious documents are created and attached to email messages which are sent to organizations. In this way attackers attempt to penetrate organizations’ defenses and perform malicious activities utilizing social engineering techniques, causing innocent employees to open malicious documents (such as PDF and MS Office files). We applied our expertise and newfound insights toward the goal of enhancing the detection of malicious documents [4], enhancing our AL based framework for use with malicious documents and created the ALPD [5,6] and ALDOCX [7,8] frameworks which integrated our newly developed feature extraction methodologies for efficient detection of documents.
Figure 1 presents the domains and sub-domains to which our framework has been applied during the current Ph.D. research, as well as our related papers, including their reference number and ranking details, clustered into journals (green nodes), conferences (orange nodes), and workshops (yellow nodes). The main domains in which our framework was successfully applied appear in the red nodes. Figure 1 is divided into two main sub-diagrams: the upper red node which represents the core of this study – malware detection within the cyber security domain, and the lower red node which presents the additional domain – condition severity classification within biomedical informatics domain. The upper diagram is also divided into four blue nodes that represent sub-domains of malware detection: computer worms, malicious executables, malicious Android applications, and malicious documents. The rightmost sub-domain within malware detection, malicious documents, is sub-divided into two blue nodes, MS Office and PDF files, which are sub-types of the most popular documents through which cyber-attacks are launched.
- 26 -
Active Learning Based Framework
Malware Detection Cyber-Security Domain
Computer Android Executables Documents Worms Applications
[1] Nissim et.al (2012) [2] Nissim et.al (2014) [3] Nissim et.al (2015) “ALWORM Framework” “ALPC Framework” “ALDROID Framework” PAAA Journal – Q3 ESWA Journal – Q1 KAIS Journal – Q1
Legend Domain MS Office docx Files PDF Files
Sub-Domain
[7] Nissim et.al (2015) [4] Nissim et.al (2014) Accepted Journal Paper “ALDOCX Framework” “Malicious PDF Detection” IEEE ICMLA COSE Journal – Q2 Conference – Rank C Accepted Conference Paper [5] Nissim et.al (2015) [8] Nissim et.al (2015) “Enhanced ALPD Framework” Accepted Workshop Paper within Conference Malicious MS Office Docs Security-Informatics Journal ODDX3 Workshop at KDD
[6] Nissim et.al (2014) “ALPD Framework” IEEE JISIC Conference Condition Severity Classification Biomedical Informatics Domain
[11] Nissim et.al (2015) [10] Nissim et.al (2015) [9] Nissim et.al (2014) “CAESAR-ALE Enhancement” “Condition Severity Classifications” “CAESER-ALE” JBI Journal – Q1 AIME Conference – Rank A Big-CHAT Workshop at KDD * Best Paper Award
Figure 1: The domains and sub-domains in which our framework has been used and our related papers (published and in press) during the current Ph.D. research, including their ranking details, and clustering into journals, conferences, and workshops.
- 27 -
We now provide a brief overview of each of our core papers. Our first core paper [1] is entitled, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning.” In this paper we aimed at enhancing the detection of computer worms by dynamically monitoring and analyzing their behavior at high frequency rates. This research showed that our framework and AL method can efficiently select just a small number of instances related to an analyzed worm’s behavior and filter out the misleading and noisy instances of malware’s behavior which are popular among elusive computer worms, thereby improving the detection of unknown computer worms. Our behavioral analysis of these worms was based on computer measurements extracted from the operating system. We designed a series of experiments to test the new technique by employing several computer configurations and background application activities. In the course of the experiments, 323 computer features were monitored. In addition, we used active learning as a selective sampling method to increase the performance of the detection model which was improved by between 19% and 25%, and thus we also improved its robustness in the presence of misleading instances of computer worms.
Our second core paper [2] is entitled, "Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS.” In this paper we introduced a solution addressing the main problem associated with the agile detection of new unknown malware – the updatability gap of both antivirus software and the detection model in the domain of malicious executables. We presented an active learning framework and introduced two novel AL methods that assist antivirus vendors, helping them better focus their analytical efforts by acquiring the files that are most likely malicious. The new AL methods were designed and oriented at new malware acquisition. Our AL methods outperformed existing AL method in two respects related to the number of new malwares acquired daily which was the core measure in this study. First, on the ninth day of the experiment our best performing AL method, termed Exploitation, acquired approximately 2.6 times more malwares than the existing AL method and 7.8 times more than the random selection method. Second, while the existing AL method showed a decrease in the number of new malwares acquired over ten days, our AL methods showed an increase and daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist antivirus vendors.
Our third core paper [3] is entitled, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods.” In this paper our efforts were directed at the
- 28 - smartphone domain, an area in which the need for antivirus enhancement is even greater than in the PC domain. In contrast to PCs, smartphones are heavily dependent on antivirus solutions, because of the inability to apply advanced detection methods (static and dynamic analysis) within the device itself. The resource limitations of smartphones necessitate the effective detection of new malware, as well as the efficient and nimble update of the antivirus tools. It is not feasible to analyze every new application, so our ALDRIOD framework only selects the most probable Android malwares for labeling. While our framework reduces the number of unknown applications that must be manually analyzed, it strengthens the detection model at the same time by also selecting informative benign applications. Thus the framework addresses the resource limitations of the smartphone, as well as the challenge presented by the sheer volume of unknown applications created daily. Our approach is capable of providing more frequent updates to the detection model, because only a small and manageable set of informative applications are sent to the human expert for inspection and are subsequently acquired by the detection model. This is in contrast to heuristic approaches based on scoring algorithms or other types of detection models which are only updated periodically due to the labor intensive process of human expert analysis.
In our framework the updated detection model efficiently updates the antivirus signature repository which in turn, improves the detection capabilities of the installed and widely used antivirus software within smartphones. As is known, the structure of Android applications differs substantially from the executable files (including computer worms) within the Windows OS we previously investigated. Therefore the results of our previous paper [2] cannot automatically be assumed to work in the Android OS, since the detection model and AL methods used in the current study rely on different dataset characteristics related to the Android applications domain, particularly in terms of the extracted features, the malware distribution, and the attack techniques detected by the detection model. ALDROID combined three important elements; first we presented a set of general descriptive features for the detection of Android malware, features which are robust and unaffected by obfuscation or transformation evasion techniques. The features are based on the application's static genes and not on the optional operations it might conduct. Therefore, the features are also robust for evasive techniques based on delayed malicious operations. Second, from these features we induced a detection model using a SVM machine learning algorithm, and then we applied our malware oriented AL methods in order to leverage our
- 29 - general descriptive set of features and the knowledge of the detection model, in order to enhance the detection model and the antivirus software frequently.
Results indicate that our AL methods outperformed other solutions including the existing AL method and heuristic engine. Our AL methods acquired the largest number and percentage of new malwares, while preserving the detection models’ detection capabilities (high TPR and low FPR rates). Specifically, our methods acquired more than double the amount of new malwares acquired by the heuristic engine and 6.5 times more malwares than the existing AL method.
Our fourth core paper [4] is entitled, "Detection of Malicious PDF Files and Directions for Enhancements: a State of the Art Survey.” This paper is based on research which proved pivotal to our solid understanding of malicious documents, in general, and malicious PDF files, in particular. Through our comprehensive survey of advanced academic solutions for the detection of PDF malware, we were able to identify the best performing feature extraction methods for malicious PDF files and observe the significant lack of updatability in the detection solutions in PDF malware as well. In this paper we provided comparisons, insights, conclusions, and avenues for future research in order to enhance the detection of malicious PDFs. One of the most important contributions of this paper is revealing and highlighting the correlation between the structural incompatibility of PDF files and their likelihood of maliciousness. By doing this, at least 96.5% of the malicious PDF files can be easily filtered out using a simple and deterministic filtering process. The second, and probably more important contribution, is providing a detailed explanation of our active learning based framework for enhancing and supporting the updatability of detection solutions for malicious PDF files.
This paper was followed by our additional peer reviewed journal paper [5] (based on the preliminary results of our research on the ALPD framework which was presented at the IEEE JISIC conference [6]); in this extended paper [5] we implemented the above mentioned suggested framework [4] for efficiently updating detection models of PDF malware using AL methods. Results showed that our AL method, Combination, outperformed all of the other methods, enriching the signature repository of the antivirus with almost seven times more new malicious PDF files, while improving the detection model’s capabilities further each day. At the same time, it dramatically reduced security experts’ efforts by 75%. Despite this significant reduction, results also indicate that our framework better detects new malicious PDF files than leading antivirus tools commonly used by organizations for protection against malicious PDF files.
- 30 -
Furthermore, it is also worthwhile mentioning our additional paper [7] that was inspired by this important core paper [4]; in [7] we presented the ALDOCX system at the ranked machine learning conference, ICMLA-2015 [7] (based on our preliminary results [8] presented in a workshop at the KDD-2015 machine learning conference). In this paper [7], we presented SFEM feature extraction methodology and designated active learning methods aimed at accurate detection of new unknown malicious docx files; these methods also efficiently enhance the detection model’s capabilities over time and quickly utilize the vast amount of documents within an organization. Results show that our active learning methods used only 14% of the labeled docx files within an organization which led to a reduction of 95.5% in labeling efforts compared to passive learning and SVM-Margin (an existing active learning method). Our AL methods also showed a significant improvement of 91% in unknown docx malware acquisition compared to passive learning and SVM-Margin, thus providing an improved updating solution for the detection model, as well as the antivirus software widely used within organizations. We also showed that our novel structural feature extraction methodology (SFEM) results in a set of very discriminative and general features of the XML based MS Office documents, and we have shown that these features can be effectively leveraged by machine learning algorithms to induce an accurate detection model for malicious docx files.
To summarize, our fourth core paper [4] was the basis for four more papers [5-8] aimed at enhanced detection of malicious documents. These papers can be found in the appendix subsection entitled, “Additional Accepted Papers in the Malware Detection Domain.”
In addition to our contribution to the malware detection domain, as can be seen in the lower section of Figure 1, we have published three additional papers in the biomedical domain [9-11]. The second of these papers, [10] won the best student paper award in a prestigious artificial intelligence conference (AIME-2015 Rank-A). The third paper [11] was recently accepted for publication by the Journal of Biomedical Informatics (JBI), a top biomedical informatics peer reviewed journal. These papers can be found in the appendix subsection entitled, “Additional Accepted Papers in the Biomedical Informatics Domain.” The papers present the results and methodology of our recently extended framework which was applied to the biomedical informatics domain, which is a completely different domain than malware detection. In this research we were able to successfully enhance the capabilities of a classification model used for condition severity classification with a significant reduction of labeling efforts that can result in significant savings,
- 31 - both in terms of time and money, associated with the efforts of medical experts. As a part of the extension of our framework, we developed an additional new AL method called Combination_XA which is more oriented to the acquisition needs in the biomedical informatics domain. This extension showed that our methods and framework are generic and can provide a solution for a variety of problems in many different domains.
- 32 -
2.1. Research Results
The following is a complete list of the 11 papers that were published and submitted during this research in the cyber-security and biomedical informatics domains presented in Figure 1. [1] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning,” Pattern Analysis and Applications, (2012) 15:459-475. [2] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, " Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS,” Expert Systems with Applications, (2014), Link: http://authors.elsevier.com/sd/article/S095741741400133X. [3] Nir Nissim, Robert Moskovitch, Oren BarAd, Lior Rokach, Yuval Elovici, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods,” Knowledge and Information Systems (2016), 1-39. Accepted on 11 of January 2016. [4] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici, “Detection of Malicious PDF Files and Directions for Enhancements: A State-of-the Art Survey,” Computers & Security, Volume 48, February 2015, Pages 246-266, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2014.10.014. [5] Nir Nissim, Aviad Cohen, Robert Moskovitch, Asaf Shabtai, Matan Edri, Oren Bar-Ad, Yuval Elovici. (2016). Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework. Security Informatics, 5(1), 1-20. [6] Nir Nissim, Aviad Cohen, Robert Moskovitch, Oren Barad, Mattan Edry, Assaf Shabatai, Yuval Elovici, "ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files Aimed at Organizations,” JISIC, (2014). [7] Nir Nissim, Aviad Cohen, Yuval Elovici, "Boosting the Detection of Malicious Documents Using Designated Active Learning Methods," 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 2015, pp. 760-765. doi: 10.1109/ICMLA.2015.52 [8] Nir Nissim, Aviad Cohen, Yuval Elovici, “Designated Active Learning Methods for Enhanced Detection of Unknown Malicious Microsoft Office Documents”. ODDX3 Workshop at KDD Conference, (2015), Sydney. [9] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification”. BigCHAT Workshop at KDD Conference, (2014).
Mario Stefanelli Best Paper Award at AIME 2015 Conference: [10] Nir Nissim, Mary Regina Boland, Robert Moskovitch, Nicholas Tatonetti, Yuval Elovici, Yuval Shahar, George Hripcsak, “An Active Learning Framework for Efficient Condition Severity Classification”. Artificial Intelligence in Medicine (Pages 13-24), Springer International Publishing, AIME (2015). [11] Nir Nissim, Mary Regina Boland, Nicholas P. Tatonetti, Yuval Elovici, George Hripcsak, Yuval Shahar, Robert Moskovitch, Improving condition severity classification with an efficient active learning based framework, Journal of Biomedical Informatics, Volume 61, June 2016, Pages 44-54, ISSN 1532- 0464,
- 33 -
2.1.1. Core Papers
In this section we list the four papers which form the core of this research; the list will be followed by the papers themselves. The appendix contains our other accepted papers.
[1] Nir Nissim, Robert Moskovitch, Lior Rokach,Yuval Elovici, "Detecting Unknown Computer Worm Activity via Support Vector Machines and Active Learning,” Pattern Analysis and Applications, (2012) 15:459-475.
[2] Nir Nissim, Robert Moskovitch, Lior Rokach, Yuval Elovici, " Novel Active Learning Methods for Enhanced PC Malware Detection in Windows OS,” Expert Systems with Applications, (2014), Link: http://authors.elsevier.com/sd/article/S095741741400133X.
[3] Nir Nissim, Robert Moskovitch, Oren BarAd, Lior Rokach, Yuval Elovici, "ALDROID: Efficient Update of Android Antivirus Software Using Designated Active Learning Methods,” Knowledge and Information Systems (2016), 1-39. Accepted on 11 of January 2016.
[4] Nir Nissim, Aviad Cohen, Chanan Glezer, Yuval Elovici, “Detection of Malicious PDF Files and Directions for Enhancements: A State-of-the Art Survey,” Computers & Security, Volume 48, February 2015, Pages 246-266, ISSN 0167-4048, http://dx.doi.org/10.1016/j.cose.2014.10.014.
- 34 -
Pattern Anal Applic (2012) 15:459–475 DOI 10.1007/s10044-012-0296-4
INDUSTRIAL AND COMMERCIAL APPLICATION
Detecting unknown computer worm activity via support vector machines and active learning
Nir Nissim • Robert Moskovitch • Lior Rokach • Yuval Elovici
Received: 8 December 2009 / Accepted: 5 September 2012 / Published online: 25 September 2012 Ó Springer-Verlag London Limited 2012
Abstract To detect the presence of unknown worms, we 1 Introduction propose a technique based on computer measurements extracted from the operating system. We designed a series The detection of malicious code (malcode) transmitted of experiments to test the new technique by employing over computer networks has been researched intensively in several computer configurations and background applica- recent years. Worms, a particularly widespread malcode, tion activities. In the course of the experiments, 323 proactively propagate across networks while exploiting computer features were monitored. Four feature-ranking vulnerabilities in operating systems or in installed pro- measures were used to reduce the number of features grams. Other types of malcode include computer viruses, required for classification. We applied support vector Trojan horses, spyware, and adware. machines to the resulting feature subsets. In addition, we Nowadays, excellent technology (i.e., antivirus software used active learning as a selective sampling method to packages) exists for detecting known malicious code. increase the performance of the classifier and improve its Typically, antivirus software packages inspect each file that robustness in the presence of misleading instances in the enters the system, looking for known signs (signatures) that data. Our results indicate a mean detection accuracy in uniquely identify a malcode. Antivirus technology cannot, excess of 90 %, and an accuracy above 94 % for specific however, be used for detecting an unknown malcode, since unknown worms using just 20 features, while maintaining a it is based on prior explicit knowledge of malcode signa- low false-positive rate when the active learning approach is tures. Following the appearance of a new worm instance, applied. operating system providers provide a patch to deal with the problem, while antivirus vendors update their signatures- Keywords Malware detection Supervised learning base accordingly. This solution has obvious demerits, Active learning however, since worms propagate very rapidly. By the time the antivirus software has been updated with the new worm, very expensive damage has already been inflicted [1]. N. Nissim R. Moskovitch L. Rokach Y. Elovici Intrusion detection, termed a network-based intrusion Department of Information Systems Engineering, Ben Gurion University of the Negev, P.O.B. 653, 84105 Beer-Sheva, Israel detection system (NIDS), is commonly implemented at the e-mail: [email protected] network level. NIDS has been substantially researched but R. Moskovitch remains limited in its detection capabilities (like any e-mail: [email protected] detection system). In order to detect malcodes that have Y. Elovici slipped through the NIDS at the network level, detection e-mail: [email protected] operations are performed locally by implementing a host- based intrusion detection system (HIDS). To monitor & N. Nissim R. Moskovitch L. Rokach ( ) Y. Elovici activities at the host level, HIDS usually compares various Deutsche Telekom Laboratories, Ben Gurion University, Beer-Sheva, Israel states of the computer, such as the changes in the file e-mail: [email protected] system, using checksum comparisons. The main drawback 123 460 Pattern Anal Applic (2012) 15:459–475 of this approach is the ability of malcodes to disable 2. Adaptation of SVM for malware detection: In our antiviruses. The main problem in using HIDS, however, is previous study [6] we used the algorithms decision detection knowledge maintenance, which is usually per- trees, naı¨ve Bayes, Bayesian networks, and neural formed manually by the domain expert. This is apt to be networks. In this paper, we study the performance of time-consuming and expensive. SVM. Specifically, we examine which of the three Recent studies have proposed methods for detecting kernel functions (linear, polynomial, RBF) is most unknown malcode using machine-learning techniques. suitable for detecting unknown worms. We argue that Given a training set of malicious and benign binary exec- SVM will achieve better results when using active utables, a classifier is trained to identify and classify learning as a selective sampling method. unknown malicious executables as malicious [2–4]. While 3. Comparison of feature selection methods for improv- this approach is potentially a good solution, it is not com- ing malware detection: We examine experimentally plete. It can detect only executable files, and not malcodes which feature selection method (if any) best fits the located entirely in the memory, such as the Slammer worm worm detection task using SVM. [5]. In a previous research report, we presented a new 4. We investigate the contribution of specific worms to method for detecting unknown computer worms [6–8]. The the detection performance and examine if all worms underlying assumption was that malcode within the same are equally informative. category (e.g., worms, Trojans, spyware, adware) share The rest of the article is structured as follows. Section 2 similar characteristics and behavior patterns and that these surveys the relevant background for this work, while patterns can be induced using machine-learning techniques. Sect. 3 describes the SVM, relevant kernel functions and By continuously monitoring and matching the computer’s active learning methods used in this study. Section 4 vital signs (such as CPU and hard disk usage) against the discusses the research question, corresponding experimen- previously induced malcode patterns, we can gain an indi- tal plan, and evaluation results. Finally, in Sect. 5 we cation as to whether the computer is infected. While this conclude the paper with a discussion of the evaluation approach does not prevent infection, it enables its fast results, conclusions, and future work. detection. Relevant decisions and policies, such as discon- necting a single computer or a cluster, can then be implemented. 2 Background and related work The goal of this study is to assess the viability of employing support vector machines (SVM) in an individual 2.1 Malicious code and worms computer host to detect unknown worms based on their behavior (measurements), and examine whether selective The term ’malicious code’ (malcode) refers to a piece of sampling can improve the detection performance. The code, not necessarily an executable file, the intention of behavior of some of the worms is unstable, so that some of which is to harm. In [9], the authors define a worm the time they tend to behave as a legitimate application according to how it can be distinguished from other types does. Thus by monitoring their behavior we would derive of malcode: (1) network propagation or human interven- instances that would negatively affect the model (hereafter tion—worms propagate actively over a network, while we will call these instances misleading instances). The other types of malicious codes, such as viruses, commonly selection of the right instances to be included in the require human activity to propagate; (2) standalone or file training-set is therefore also very challenging. infecting—while viruses infect a host, a worm does not This paper makes four contributions to our armory for require a host file and sometimes does not even require an combating malware: executable file since it may reside entirely in the memory. 1. Development of a selective sampling procedure using This was the case with the Code Red worm [10]. active learning: Active learning is commonly used to Worm developers have different purposes and motiva- reduce the amount of labeling required from an expert tions [11]. Some are motivated by experimental curiosity (a time-consuming task). The Oracle is actively asked (ILoveYou worm [12]), while pride and the desire for to label specific examples in the dataset that the learner power lead others to flaunt their knowledge and skill considers the most informative, based on its current through the harm caused by the worm. Still other motiva- knowledge, which eventually reduces the acquisition tions are commercial advantage, extortion and criminal cost. In this study, all the training examples are labeled gain, random and political protest, and terrorism and cyber in advance. However, the goal is to select intelligently warfare. the best examples that will increase the accuracy by The wide variety of motivations that we find among filtering misleading or non-informative instances. worm developers indicates that computer worms will be a
123 Pattern Anal Applic (2012) 15:459–475 461 long-lasting phenomenon. To address the challenge posed request the true class label for a certain number of by worms effectively, as much meaningful experience and instances in the pool. Other approaches focus on the knowledge as possible should be extracted from known expected improvement of class entropy [26], or mini- worms by analyzing them. Today, given the number of mizing both labelling and misclassification costs [27]. known worms, we have an excellent opportunity to learn Although in our problem all the examples are actually from these examples. We argue that active learning labeled, we decided to apply AL as the selective sampling methods can be very useful for learning and generalizing approach for choosing the most informative examples to from previously encountered worms in order to detect reduce the number of the misleading instances in the previously unseen worms effectively. training data. In Sect. 3.4, we explain how AL can be used to achieve this goal. 2.2 Detecting malicious code using supervised learning techniques 3 Methods Supervised learning techniques have already been used for detecting malicious codes and creating protection against 3.1 Dataset creation them. For example, in [13], the authors proposed a framework consisting of a set of algorithms for extracting Since no benchmark dataset exists that we could use for anomalies from a user’s normal behavior patterns. A this study, we created our own. A controlled network with normal behavior is learned and any abnormal activity is various computers (configurations) was deployed into considered intrusive. In order to determine what constitutes which we could inject worms, and monitor and log the normal, the authors suggest several techniques, such as computer operating system features using a dedicated classification, meta-learning, association rules, and fre- agent. In order to create the datasets, we isolated the local quent episodes. The extracted knowledge forms the basis of network of computers, simulating a real Internet network an anomaly-based intrusion detection system. that allowed worms to propagate. A naı¨ve Bayesian classifier was suggested in [14], We designed several experiments centered around eight referring to its implementation within the ADAM system, datasets, which we created based on three aspects: hard- developed in 2001 [15]. The classifier consists of three ware configuration, background applications, and user main parts: (a) a network data monitor listening to TCP/IP activities. Using this model, we designed several experi- protocol; (b) a learning engine for acquiring association ments to achieve our research goals: rules from the network data; and (c) a classification module that classifies the nature of the traffic into two possible a. To find out whether a classifier, trained on data classes, normal and abnormal, that can later be linked to collected from a computer with a certain hardware specific attacks. Other soft computing algorithms proposed configuration and specific background activity, is for detecting malicious code include: artificial neural net- capable of classifying correctly the behavior of a works (ANN) [16–19]; self-organizing maps (SOM) [20], computer that has other configurations. and fuzzy logic [21–23]. b. To select the minimal subset of features required to classify new cases correctly. Reducing the number of 2.3 Active learning features used in the model implies that less monitoring effort would be needed in an operational system. Labeled examples are crucial when training classifiers. In the course of experimentation, we applied four However, in certain domains the labeling operation is classification algorithms on the given datasets in a varied costly and time-consuming. Active learning (AL) [24] series of experiments in order to detect, first, known worms refers to learning policies, in which a learner actively in different environments and, later, completely new, selects unlabeled instances for labeling, based on some previously unseen worms. criterion. The objective of most AL methods is to mini- Figure 1 depicts the process that was used in this study. mize the cost of acquiring the labeled instances needed The upper part refers to the training phase. We collected a for inducing an accurate model. In this paper, we scruti- set of worms and used them to infect the hosts in the nize another aspect of AL. Instead of minimizing the controlled environment. An agent, which was installed on acquisition costs, our objective is to increase the gener- each host, then recorded its behavior. Based on the col- alization accuracy using an approach that disregards lected dataset, we trained the classifiers. The bottom part of misleading instances. Several AL frameworks are pre- Fig. 1 refers to the test phase. In this phase, we examined sented in the literature. In pool-based active learning [25], whether the induced classifier can be used to identify the the learner has access to a pool of unlabeled data and can existence of an unknown worm. 123 462 Pattern Anal Applic (2012) 15:459–475
Fig. 1 Outline of the train phase and the test phase. The worms are injected into the computers, which are monitored. Features are extracted and a SVM classifier is induced. In the test step the monitored features are provided to the classifier, which classifies whether there is worm activity or not
3.1.1 Environment description Trojans for installation, in parallel, on the distribution process of the network; others focused entirely on distri- The laboratory network consisted of seven computers, bution. Another feature that we desired to obtain was that which contained heterogenic hardware, and a server the worms would have different strategies for IP scanning computer simulating the Internet. We used the Windows that would result in varying communication behaviors, performance counters1 which enabled us to monitor system CPU consumption, and network usage. While all the features that appeared in the following categories (the worms were different, we wanted to find common char- number of features in each category appears in parentheses): acteristics, which could be used to detect an unknown internet control message protocol (27), internet protocol worm. We briefly describe here the main characteristics of (17), memory (29), network interface (17), physical disk the five worms included in this study. The information is (21), process (27), processor (15), system (17), transport based on the virus libraries on the Web2,3,4 control protocol (9), thread (12), user datagram protocol (5). In addition, we used VTrace [28], a software tool that can 3.1.3 W32.Dabber.A be installed for monitoring purposes on a PC running Windows. VTrace collects traces of the file system, the This worm randomly scans IP addresses and uses the network, the disk drive, processes, threads, inter-process W32.Sasser.D worm to propagate and open the FTP server communication, cursor changes, etc. The Windows perfor- in order to upload itself to the vicitom’s computer. The mance counters were configured to measure the features worm registers itself for implementation at the next user every second and to store them in a log file as a vector. login (human-based activation). It drops a backdoor, which VTrace stored, time-stamped events were aggregated into listens in on a predefined port. This worm is distinguished the same fixed intervals, and merged with the Windows by its use of an external worm in order to propagate. performance log files. This body of data eventually con- sisted of a vector of 323 features collected every second. We 3.1.4 W32.Deborm.Y worked with this granularity because these loggers’ most granular level was 1 s. Larger time windows, in which we W32.Deborm.Y is a self-carried worm that prefers local IP could aggregate the measurements over longer time periods, addresses. It registers itself as an MS Windows service and might have been too slow for worm activity detection. is implemented upon user login (human-based activation). This worm contains and implements three Trojans as a 3.1.2 Injected worms payload: Backdoor.Sdbot, Backdoor.Litmus, and Trojan. KillAV. We chose this worm because of its heavy payload. When selecting worms for injection, we tried to include every variety. Some of the worms had a heavy payload of 2 Symantec - http://www.symantec.com. 3 1 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ Kasparsky - http://www.viruslist.com. counter/counters2_lbfc.asp. 4 Macfee - http://vil.nai.com. 123 Pattern Anal Applic (2012) 15:459–475 463
3.1.5 W32.Korgo.X The data were collected in the presence or absence of background application and user activity in each of the This is a self-carrying worm that uses a completely random hardware configurations. We therefore had three binary method for scanning IP addresses. It is self-activated and aspects, which resulted in eight possible feature-collecting tries to inject itself via a new thread of MS Internet conditions, shown in Table 1, representing a variety of Explorer. It contains a payload code that enables it to dynamic computer configurations and usage patterns. Each connect to predefined websites in order to receive orders or dataset contained monitored instances of all the five download newer worm versions. injected worms separately, and instances of normal com- puter behavior without any injected worm. Each instance 3.1.6 W32.Sasser.D was labeled with the relevant worm (class), or ‘none’ for ‘‘clean’’ instances; Each worm was monitored for a period W32.Sasser.D has a preference for local address optimi- of 20 min with a resolution of 1 s. Thus, each instance zation while scanning the network. It divides its time, contained a vector of measurements that represents a 1-s approximately half and half, between scanning local snapshot. Accordingly, each dataset contained a few addresses and random addresses. In particular, it opens 128 thousand such labeled instances. Worms and legitimate threads for scanning the network. This requires heavy CPU applications were monitored in different configurations consumption, as well as significant network traffic. It is a (computer hardware configuration, existence of back- self-carried worm and uses a shell to connect to the ground application and also existence user-activity). The infected computer’s FTP server and to upload itself. outcome of this monitoring process was features that represent the application’s (worm/non worm) behavior. 3.1.7 W32.Slackor.A Some of the worms tend to behave in one environment similarly to a legitimate application in another environ- This is a self-carried worm that propagates by exploiting ment; similarly, a legitimate application might be per- MS Windows’ sharing vulnerability to propagate. The ceived as non legitimate when its behavior is monitored in worm registers itself for execution upon user login. It different environments. Thus, these cases are also a source contains a Trojan payload and opens an IRC server on the of misleading instances in the data. In order to derive a infected computer in order to receive orders. training set that included applications with distinctive behavior in any environment, we chose to disregard 3.1.8 Computer measurements applications whose behavior is not stable in all the environments. We examined the influence of computer hardware config- uration, applications running in the background, and user 3.2 Feature selection activity. In machine-learning applications, the large number of 1. Computer hardware configuration: We used two dif- features in many domains presents a huge challenge. ferent configurations. Both ran on Windows XP, which Typically, some of the features do not contribute to the is considered the most widely used operating system, accuracy of the classification task and may even hamper it. having two configuration types: the ‘‘old’’ configura- Moreover, in our approach, reducing the amount of tion has a Pentium 3,800 Mhz CPU, a bus speed of 133 Mhz, and 512 Mb memory; the ‘‘new’’ configu- ration has a Pentium 4 3 Ghz CPU, a bus speed of Table 1 The three aspects resulting in eight datasets, representing a 800 Mhz, and 1 Gb memory. variety of feature collecting conditions of a monitored computer 2. Background application: We ran an application that Computer Background application User activity Dataset name affects mainly the following features: processor object, processor time (usage of 100 %); page faults/s; Old No No o physical disk object; average disk bytes/transfer avg Old No Yes ou disk bytes/write, and disk writes/s. Old Yes No oa 3. User activity included several applications, among Old Yes Yes oau them: browsing, downloading, and streaming opera- Old Yes Yes oau tions through Internet Explorer, Word, Excel, chat New No Yes nu through MSN messenger, and Windows Media Player. New Yes No na These activities were implemented in such a way as to New Yes Yes nau imitate user activity in a scheduled way.
123 464 Pattern Anal Applic (2012) 15:459–475 features while maintaining a high level of detection overcome a bias in the information gain (IG) measure, and accuracy is crucial for meeting computer and resource measures the expected reduction of entropy caused by consumption requirements for the monitoring operations partitioning the examples according to a chosen feature. (measurements) and the classifier computations. This state Given entropy E(S) as a measure of the impurity in a can be achieved using the feature selection technique. collection of items, it is possible to quantify the effec- Since this is not the focus of this paper, we will describe tiveness of a feature in classifying the training data. the feature selection preprocessing only very briefly. In Equation 3 presents the entropy of a set of items S, based order to compare the performance of the different kernels on C subsets of S (for example, classes of the items), in the SVM, we used the filter approach, which is applied presented by Sc. IG measures the expected reduction of on the dataset and is independent of any classification entropy caused by partitioning the examples according to algorithm (unlike wrappers, in which the best subset is attribute A, in which V is the set of possible values of A,as chosen using an iterative evaluation experiment). Using shown in Eq. 2. These equations refer to discrete values; filters, a measure was calculated to quantify the correlation however, it is possible to extend them to continuous valued of each feature with the class, in our case, the presence or attribute. absence of worm activity. Each feature received a rank X j Sv j representing its expected contribution in the classification IGðS; AÞ¼EðSÞ EðSvÞð2Þ j S j task. Finally, the top ranked features were selected. v2VðAÞ X j S j j S j EðSÞ¼ c log c ð3Þ 3.2.1 Feature ranking metrics S 2 S c2C j j j j We used three feature metrics, which resulted in a list of The IG measure favors features having a high variety of ranked features for each metric and an ensemble incorpo- values over those with only a few. GR overcomes this rating all three of them. We used chi-square (CS), gain problem by considering how the feature splits the data ratio (GR) and RELIEF implemented in the WEKA envi- (Eqs. 4, 5). Si are d subsets of examples resulting from ronment [29] and their ensemble. partitioning S by the d-valued feature A. IGðS; AÞ 3.2.2 Chi-square GRðS; AÞ¼ ð4Þ SIðS; AÞ
Xd Chi-square measures the lack of independence between a j S j j S j SIðS; AÞ¼ i log i ð5Þ feature f and a class ci (such as W32.Dabber.A) and can be j S j 2 j S j compared to the chi-square distribution with one degree of i¼1 freedom to judge extremeness. Equation 1 shows how the 3.2.4 Relief chi-square measure is defined and computed, where N is the total number of documents, f refers to the presence of Relief [32] estimates the quality of the features according the feature (and f its absence), and ci refers to its mem- to how well their values distinguish between instances that bership in ci. P(f, ci) is the probability that the feature f are near each other. Given a randomly selected instance x, occurs in ci and the probability Pðf ; ciÞ is the probability from a dataset s with k features, Relief searches the dataset that the feature f does not occur in ci. Similarly, Pðf ; c iÞ for its two nearest neighbors from the same class, an action and Pðf ; c iÞ are the probabilities that the feature does or termed ‘‘nearest hit H’’, and from a different class, referred does not occur in a file that is not labeled to ci, respec- to as ‘‘nearest miss M’’. The quality estimation W[Ai]is tively. P(f) is the probability that the feature appears in a stored in a vector of the features Ai, based on the values of file, and Pðf Þ is the probability that the feature does not a difference function dif f() given x, H and M as shown in appear in the file. P(c ) is the probability that a file is Eq. 6. i 8 labeled to c , and Pðc Þ is the probability that a file is not to i i