Linux Malware Detection by Hybrid Analysis

Linux malware detection by hybrid analysis By ANMOL KUMAR SHRIVASTAVA Department of Computer Science INDIAN INSTITUTE OF TECHNOLOGY,KANPUR A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF TECHNOLOGY Under the supervision of: DR. SANDEEP SHUKLA MAY 2018 Abstract Name of the student: Anmol Kumar Shrivastava Roll No: 16111031 Degree for which submitted: M.Tech. Department: Computer Science and Engineering Thesis title: Linux malware detection by hybrid analysis Thesis supervisor: Dr. Sandeep Shukla Month and year of thesis submission: ver the past two decades, cyber-security research community has been working on detect- ing malicious programs for the Windows-based operating system. However, the recent Oexponential growth in popularity of IoT (Internet of Things) devices is causing the malware landscape to change rapidly. This so-called ’IoT Revolution ’ has fueled the interests of malware authors which has led to an exponential growth in Linux malware. The increasing number of malware is becoming a serious threat to data privacy as well as to the expensive computer resources. Manual malware analysis is not effective due to the large number of such cases. Furthermore, the malware authors are using various obfuscation techniques to impede the detection of traditional signature-based anti-virus system. As a result, automated yet robust malware analysis is much needed. In this thesis, we develop a hybrid approach by integrating both static features as well as dynamic features of a malware, to detect it efficiently. We performed our analysis on 7717 malware and 2265 benign files and got a highly promising detection accuracy of 99.14%. All prior work on Linux malware analysis used less than 1000 malware, and hence the accuracy numbers reported by them are not completely validated. Our work improves over prior work in two ways: substantial enhancement in the dataset, and hybrid analysis based on both static and dynamic features. Acknowledgements would like to extend my sincerest gratitude to my thesis supervisor, Dr. Sandeep Shukla, for his unparalleled guidance and support. I was new to this field, I Ican’t be much more thankful for all the wild ideas and new things i got to learn and explore under his guidance. I am grateful for his patience and all those weekly sessions of discussion. I would also thank my Family for believing in me and my Friends for their support. These 2 years would not have been same without you all. Last but not the least I would like to thank Gaurav kumar who has helped me in this work and making it possible to complete the work on time. TABLE OF CONTENTS Page List of Tables vii List of Figures viii 1 Introduction 1 2 Problem Background 4 2.1 Linux Malware and types . 4 2.2 Existing Malware detection strategy . 5 2.2.1 Static Analysis . 5 2.2.2 Dynamic Analysis . 6 2.3 Motivation For a new Approach . 7 2.4 Contribution . 7 3 Past work 8 3.1 Static analysis approaches . 9 3.2 Dynamic analysis approaches . 9 3.3 Drawback of past works: . 10 4 Analysis Infrastructure and Feature extraction 11 4.1 Analysis infrastructure . 11 4.1.1 Data Generation . 11 4.1.2 Feature Extraction . 14 4.1.2.1 Static feature vector . 14 4.1.2.2 Dynamic Feature Extraction . 23 4.1.3 Machine learning classifier . 29 4.2 Summary . 30 5 Result And Discussion 31 5.1 Dataset . 31 5.2 Evaluation Metric . 31 v TABLE OF CONTENTS 5.2.1 Confusion Matrix . 31 5.2.2 Additional metrics . 32 5.3 Training and Testing . 33 5.4 Result . 33 5.5 Comparison To Existing Approaches . 34 6 Scope and Future Work 36 6.1 Supporting Multiple architecture . 36 6.2 Analysis on different file format . 36 6.3 Multi-path execution of files . 37 A Appendix A 38 Bibliography 39 vi LIST OF TABLES TABLE Page 4.1 Fields in ELF header . 16 4.2 Fields in setion header . 17 4.3 Fields in segment header . 18 4.4 Mean value comparision of different fields in ELF header . 19 5.1 Confusion matrix for a two class classifier . 32 5.2 Test result on static feature set . 33 5.3 Test result on hybrid features . 33 5.4 Previous works on Linux malware analysis . 35 vii LIST OF FIGURES FIGURE Page 1.1 VirusTotal stats1 . 1 1.2 VirusTotal stats2 . 2 4.1 Architecture of hybrid model . 12 4.2 ELF file format . 13 4.3 Limon sandbox architecture . 14 4.4 ELF Layout in disk . 15 4.5 Frequency distribution of various section . 19 4.6 Frequency distribution of various section type . 20 4.7 Frequency distribution of various segment type type . 20 4.8 Frequency distribution of symbol table features type . 21 4.9 Example of GNU string output . 22 4.10 Example of strace output . 23 4.11 Benign sys call statistics . 24 4.12 Malware sys call statistics . 24 4.13 Benign proc file system access stats . 25 4.14 Malware proc files access stats . 25 4.15 Benign proc files access stats . 26 4.16 Malware sys file system access stats . 26 4.17 Benign etc files access stats . 27 4.18 Malware etc file system access stats . 28 5.1 Confusion matrix . 34 viii HAPTER C 1 INTRODUCTION ver the last past decades, the Internet has grown so much and so did the technology. This growth of the Internet and advancement in technology has resulted in a steep increase in O the number of Linux based servers, Linux based routers and Linux based IoT (Internet of Things) devices. Latest trends shows that Linux has now become a fresh worthy target for the hackers. FIGURE 1.1. Types of files submitted on VirusTotal [9] in the past 7 days 1 CHAPTER 1. INTRODUCTION A report from AV-Test [1] shows that in 2016 MacOS computers show a 360 percent increase in malware targeting with respect to the previous year, but Linux was not far behind.It saw a 300 percent increment in malware targeting with respect to its previous year. According to WatchGuard Security report [11] of Q1 2017, Linux malware made up 36 percent of the top threats. In Figure 1.1, we can see that the number of ELF [3] i.e Linux executables, submitted in the last seven days is in huge number and their counts can be compared to that of windows executable. This all statistics shows that Linux malware threats are now in an alarming state. As Linux is open source, all the latest threats are easily identified, and there are regular updates to the Linux kernel to catch up with new threats. Developers are regularly providing some system updates and are using some system protection mechanisms to protect the systems from emerging threats. These make Linux one of the safest platforms. But here comes the problem. Most of the IoT devices, router, etc. vendors do not provide this system updates as frequently, and if they do, this takes a long time for the consumer to download the updates and install on all the devices. These make the devices prone to exploitation. As the Linux malware threat is on the rise, there is a need for Anti-threat mechanisms to make our data and systems secure. In this work, we are going to present an automated Linux malware detection method which currently outperforms all the existing detection methods. Our work focuses on Linux binaries, i.e., ELF (Executable and Linkable Format) which is considered a standard binary file format. FIGURE 1.2. Number of files submitted on VirusTotal in the past 7 days 2 CHAPTER 1. INTRODUCTION As we see in Figure 1.2, the number of new malware and total malware submitted on VirusTotal is very high. To manually analyze this amount of data by manual reverse engineering and unpacking is a hectic task. This leads to the need for an efficient automatic detection system to detect malware. Currently, malware analyst are using two approaches for analysis namely, Static and Dynamic analysis. In static analysis, a file is analyzed just by looking at its structure and data present in it. As it is not executed, it is fast and easy to deploy, but there are some limitations for example, polymorphism/metamorphism and packers make the file data encrypted or packed so that no more static analysis can be done on it. This motivates for another approach, dynamic analysis. In this analysis, files are executed in a sandbox environment, and its behavior is analyzed. Since this analysis does not get affected by obfuscation and encryption of data, it has its limitation too. Some malware hide their true behavior when they find out that they are running in a controlled environment. Another limitation is that, since we usually monitor one of the paths of the process, hence rest of the code remains unexplored. So there is a need for a new detection mechanism which can overcome the limitations of these approaches. In the following chapters, we are going to discuss about Linux binary (ELF), types of malware and our methodology to detect Linux malware efficiently. 3 HAPTER C 2 PROBLEM BACKGROUND In this section, we describe some of the Linux malware and discuss what problems are present in the existing detection methodology and how we plan to overcome those problems using our approach. 2.1 Linux Malware and types Malware stands for malicious software which has some intention of disrupting infrastructure, stealing useful data, spying on a victims computer, etc. Malware can be categorized by their actions.

Linux Malware Detection by Hybrid Analysis

A Story of an Embedded Linux Botnet

Survivor: a Fine-Grained Intrusion Response and Recovery Approach for Commodity Operating Systems

Sureview® Memory Integrity Advanced Linux Memory Analysis Delivers Unparalleled Visibility and Verification

Malware Trends

Russian GRU 85Th Gtsss Deploys Previously Undisclosed Drovorub Malware

Linux Security Review 2015

Analysis of SSH Executables

Unix/Mac/Linux OS Malware 10/15/2020

What Is Linux?

Leveraging Advanced Linux Debugging Techniques for Malware Hunting

MMALE„A Methodology for Malware Analysis in Linux Environments

Geek Guide > Childhood's