An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware

An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOR OF PHILOSOPHY” By Nir Nissim Submitted to the Senate of Ben-Gurion University of the Negev 31.12.2015 Beer-Sheva - 1 - - 2 - An Active Learning Framework for Efficient Acquisition and Detection of Unknown Malware Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOR OF PHILOSOPHY” By Nir Nissim Submitted to the Senate of Ben-Gurion University of the Negev Approved by the advisor: _____________________________ Approved by the Dean of the Kreitman School of Advanced Graduate Studies: _________________ 31.12.2015 Beer-Sheva - 3 - - 4 - This work was carried out under the supervision of Prof. Yuval Elovici at the Department of Information Systems Engineering Faculty of Engineering Sciences, Ben-Gurion University of the Negev. - 5 - Research-Student's Affidavit when Submitting the Doctoral Thesis for Judgment I Mr. Nir Nissim, whose signature appears below, hereby declare that I have written this Thesis by myself, except for the help and guidance offered by my Thesis Advisors. The scientific materials included in this Thesis are products of my own research, culled from the period during which I was a research student. Date: 31.12.2015 Student’s Name: Nir Nissim Signature: __________________ - 6 - Acknowledgements First and foremost, I want to thank God for providing me with the capabilities, wisdom, and blessing of success, during these important years of research and for surrounding me with an outstanding group of colleagues and researchers who were helpful in this research. I would also like to thank my advisor, Prof. Yuval Elovici, for his support, guidance, and the opportunities provided to me, all of which have made these years of research extremely productive and challenging. Thanks also to the National Cyber Bureau of the Israeli Ministry of Science, Technology and Space who partially supported my research. I also wish to thank Clint Feher, Oren Barad, and Aviad Cohen who assisted in the collection and creation of the datasets and Yuval Fledel for his valuable advice regarding the efficient implementation aspects of my research. I would like also to thank Prof. Yuval Shahar for the meaningful discussions we shared and also for his expertise and support in the expansion of this research to additional directions in the biomedical domain. Special thanks both to Dr. Robert Moskovitch and Prof. Lior Rokach for their assistance and helpful advice during the course of my research. Thanks also to Ms. Yehudith Naftalovitch, the administrative and operational manager of our Cyber Security Research Center, who assisted and helped with many administrative matters during these years of research, providing valuable assistance that allowed me to better focus on the research itself. I would like to thank also to Ms. Robin Levy-Stevenson for her devoted assistance, providing much appreciated English editing and proofreading during my Ph.D. studies which helped make my publications more comprehensive and clear. And last but not least, thanks to my dear parents and my special grandparents who supported me in every way they possibly could, ensuring that I would always have the passion, and everything else I would need, to succeed. - 7 - - 8 - Abstract The sheer volume of new malware created every day poses a significant challenge to existing detection solutions. This malware is aimed at compromising nearly every kind of widely used digital device, threatening individuals as well as organizations. Popular types of malware take different forms including computer worms, malicious PC executables, malicious documents (non- executables), and malicious applications aimed at mobile devices. Widely used antivirus software, which is based on manually crafted signatures, is only capable of identifying known malware and their relatively similar variants. To identify new and unknown malwares and keep their antivirus signature repository up-to-date, antivirus vendors must collect new suspicious files on a daily basis for manual analysis by information security experts who label the files as malware or benign. Analyzing suspected files is a time-consuming task, and it is impossible to manually analyze all questionable files. Consequently, antivirus vendors use detection models based on machine learning (ML) algorithms and heuristics in order to reduce the number of suspected files that must be inspected manually. In addition to antivirus software, recent detection solutions have also used machine learning algorithms independently, in order to provide better detection capabilities of new malware, an area in which antivirus software is limited. In light of the mass creation of new files daily, both antivirus and machine learning based detection solutions lack an essential element – they cannot be frequently and efficiently updated with newly created malware – a situation that creates a dangerous time gap between the creation and proliferation of malware and its detection and discovery. This time gap allows new malware to attack many targets before it is identified and thwarted. Therefore, both antivirus and machine learning based solutions must be frequently updated – the antivirus software must be updated with new signatures of malware, and machine learning based solutions require new informative files, both malicious and benign. In this research we introduce a solution for this updatability gap. We present a novel, generic, and efficient active learning (AL) framework and new AL methods that may assist antivirus vendors and machine learning based solutions and may allow them to focus their analytical efforts by acquiring only a small set of new files that are either most likely malicious or informative benign files, a process that enables efficient and frequent enhancement of the knowledge stores of both the detection model and the antivirus software. In addition to intelligent selection of the most contributive files, our framework is also aimed at working under higher rates of granularity in which it can efficiently select only a small number of instances related to the behavior of a specific analyzed file. By doing this, our framework can filter out the misleading and noisy instances of malware’s behavior which is popular among sophisticated and elusive malware and thus improve detection capabilities. Our framework also integrates tailored feature extraction methods for each of the above mentioned types of malware, and these feature extraction methods provide an accurate basis for enhancing the detection capabilities leveraged by our AL methods. - 9 - The main contributions of the study are summarized as follows: first, the experimental results showed that our framework can improve the detection capabilities of antivirus software and machine learning based solutions by frequently and efficiently enhancing the knowledge stores of the detection model and the antivirus software, as our framework outperformed any other existing solution and method. Second, based on the predefined limited number of files acquired daily in our experiments, the existing AL method showed a decrease in the number of new malwares acquired daily, while our AL methods showed an increase and daily improvement in the number of new malwares acquired daily and also acquired more new malwares each day than every other solution. Third, our framework conducts the above mentioned update using only small set of the most informative files (malicious and benign) leading to a significant reduction of security expert labeling efforts associated with manual analysis of the files. Fourth, our framework was also found to be efficient in retrospective acquisition of malware from large stores of files usually found in organizations. Fifth, our framework is able to efficiently improve the detection capabilities by enhancing its robustness by filtering out the presence of misleading malware instances and behavior. Lastly, as a proof of concept for the generality of our AL based framework, we have recently extended the framework's capabilities so it will provide solution in additional domains. We have adapted it to the biomedical informatics domain, in which we successfully enhanced the capabilities of a classification model that is used for condition severity classification while significantly reducing labeling efforts that can result in a substantial savings, both in time and money associated with medical experts. Keywords. Malware, Malicious, Computer Worm, Executable, Android, Document, PDF, Machine Learning, Active Learning, Detection, Acquisition, Antivirus. - 10 - Table of Contents 1. Introduction 1.1. Background and Related Work 1.1.1. Malicious Executables and Computer Worms 1.1.2. Malicious Documents 1.1.3. Malicious Android Applications 1.2. The Problem Statement and Proposed Approach 1.3. Deployment of our Framework 2. Overview of the Core Papers in the Research 2.1. Research Results 2.1.1. Core Papers 3. Summary and Conclusions 4. Future Directions 5. References 6. Appendix 6.1. Additional Accepted Papers in the Malware Detection Domain 6.2. Additional Accepted Papers in the Biomedical Informatics Domain - 11 - 1. Introduction 1.1. Background and Related Work In recent years, the Internet has become an integral part of our lives, particularly with the increased availability of high speed internet connections, cloud computing, and the proliferation of mobile devices which have rapidly

Load more