<<

Linux detection by hybrid analysis

By

ANMOL KUMAR SHRIVASTAVA

Department of Computer Science INDIAN INSTITUTEOF TECHNOLOGY,KANPUR

A thesis submitted in partial fulfillment of the requirements for the degree of MASTEROF TECHNOLOGY

Under the supervision of: DR. SANDEEP SHUKLA MAY 2018

Abstract

Name of the student: Anmol Kumar Shrivastava Roll No: 16111031 Degree for which submitted: M.Tech. Department: Computer Science and Engineering Thesis title: malware detection by hybrid analysis Thesis supervisor: Dr. Sandeep Shukla Month and year of thesis submission:

ver the past two decades, cyber-security research community has been working on detect- ing malicious programs for the Windows-based . However, the recent Oexponential growth in popularity of IoT () devices is causing the mal- ware landscape to change rapidly. This so-called ’IoT Revolution ’ has fueled the interests of malware authors which has led to an exponential growth in . The increasing number of malware is becoming a serious threat to data privacy as well as to the expensive computer resources. Manual malware analysis is not effective due to the large number of such cases. Furthermore, the malware authors are using various obfuscation techniques to impede the detection of traditional signature-based anti-virus system. As a result, automated yet robust malware analysis is much needed. In this thesis, we develop a hybrid approach by integrating both static features as well as dynamic features of a malware, to detect it efficiently. We performed our analysis on 7717 malware and 2265 benign files and got a highly promising detection accuracy of 99.14%. All prior work on Linux malware analysis used less than 1000 malware, and hence the accuracy numbers reported by them are not completely validated. Our work improves over prior work in two ways: substantial enhancement in the dataset, and hybrid analysis based on both static and dynamic features. Acknowledgements

would like to extend my sincerest gratitude to my thesis supervisor, Dr. Sandeep Shukla, for his unparalleled guidance and support. I was new to this field, I Ican’t be much more thankful for all the wild ideas and new things i got to learn and explore under his guidance. I am grateful for his patience and all those weekly sessions of discussion. I would also thank my Family for believing in me and my Friends for their support. These 2 years would not have been same without you all. Last but not the least I would like to thank Gaurav kumar who has helped me in this work and making it possible to complete the work on time. TABLEOF CONTENTS

Page

List of Tables vii

List of Figures viii

1 Introduction 1

2 Problem Background 4 2.1 Linux Malware and types ...... 4 2.2 Existing Malware detection strategy ...... 5 2.2.1 Static Analysis ...... 5 2.2.2 Dynamic Analysis ...... 6 2.3 Motivation For a new Approach ...... 7 2.4 Contribution ...... 7

3 Past work 8 3.1 Static analysis approaches ...... 9 3.2 Dynamic analysis approaches ...... 9 3.3 Drawback of past works: ...... 10

4 Analysis Infrastructure and Feature extraction 11 4.1 Analysis infrastructure ...... 11 4.1.1 Data Generation ...... 11 4.1.2 Feature Extraction ...... 14 4.1.2.1 Static feature vector ...... 14 4.1.2.2 Dynamic Feature Extraction ...... 23 4.1.3 Machine learning classifier ...... 29 4.2 Summary ...... 30

5 Result And Discussion 31 5.1 Dataset ...... 31 5.2 Evaluation Metric ...... 31

v TABLEOFCONTENTS

5.2.1 Confusion Matrix ...... 31 5.2.2 Additional metrics ...... 32 5.3 Training and Testing ...... 33 5.4 Result ...... 33 5.5 Comparison To Existing Approaches ...... 34

6 Scope and Future Work 36 6.1 Supporting Multiple architecture ...... 36 6.2 Analysis on different file format ...... 36 6.3 Multi-path execution of files ...... 37

A Appendix A 38

Bibliography 39

vi LISTOF TABLES

TABLE Page

4.1 Fields in ELF header ...... 16 4.2 Fields in setion header ...... 17 4.3 Fields in segment header ...... 18 4.4 Mean value comparision of different fields in ELF header ...... 19

5.1 Confusion matrix for a two class classifier ...... 32 5.2 Test result on static feature set ...... 33 5.3 Test result on hybrid features ...... 33 5.4 Previous works on Linux malware analysis ...... 35

vii LISTOF FIGURES

FIGURE Page

1.1 VirusTotal stats1 ...... 1 1.2 VirusTotal stats2 ...... 2

4.1 Architecture of hybrid model ...... 12 4.2 ELF file format ...... 13 4.3 Limon sandbox architecture ...... 14 4.4 ELF Layout in disk ...... 15 4.5 Frequency distribution of various section ...... 19 4.6 Frequency distribution of various section type ...... 20 4.7 Frequency distribution of various segment type type ...... 20 4.8 Frequency distribution of symbol table features type ...... 21 4.9 Example of GNU string output ...... 22 4.10 Example of strace output ...... 23 4.11 Benign sys call statistics ...... 24 4.12 Malware sys call statistics ...... 24 4.13 Benign proc file system access stats ...... 25 4.14 Malware proc files access stats ...... 25 4.15 Benign proc files access stats ...... 26 4.16 Malware sys file system access stats ...... 26 4.17 Benign etc files access stats ...... 27 4.18 Malware etc file system access stats ...... 28

5.1 Confusion matrix ...... 34

viii O fTig)dvcs aetted hw htLnxhsnwbcm rs otytre o the for target worthy fresh a become now has Linux that hackers. shows trends Latest devices. Things) of h ubro iu ae evr,LnxbsdruesadLnxbsdIT(Internet IoT based Linux and routers in based increase Linux steep servers, a based in Linux This resulted technology. of has the number technology did the in so advancement and and much Internet so the grown of has growth Internet the decades, past last the ver F IGURE ..Tpso lssbitdo iuTtl[]i h at7days 7 past the in [9] VirusTotal on submitted files of Types 1.1. 1 I NTRODUCTION

C HAPTER 1 CHAPTER 1. INTRODUCTION

A report from AV-Test [1] shows that in 2016 MacOS computers show a 360 percent increase in malware targeting with respect to the previous year, but Linux was not far behind.It saw a 300 percent increment in malware targeting with respect to its previous year. According to WatchGuard Security report [11] of Q1 2017, Linux malware made up 36 percent of the top threats. In Figure 1.1, we can see that the number of ELF [3] i.e Linux executables, submitted in the last seven days is in huge number and their counts can be compared to that of windows executable. This all statistics shows that Linux malware threats are now in an alarming state. As Linux is open source, all the latest threats are easily identified, and there are regular updates to the Linux kernel to catch up with new threats. Developers are regularly providing some system updates and are using some system protection mechanisms to protect the systems from emerging threats. These make Linux one of the safest platforms. But here comes the problem. Most of the IoT devices, router, etc. vendors do not provide this system updates as frequently, and if they do, this takes a long time for the consumer to download the updates and install on all the devices. These make the devices prone to exploitation. As the Linux malware threat is on the rise, there is a need for Anti-threat mechanisms to make our data and systems secure. In this work, we are going to present an automated Linux malware detection method which currently outperforms all the existing detection methods. Our work focuses on Linux binaries, i.e., ELF (Executable and Linkable Format) which is considered a standard binary file format.

FIGURE 1.2. Number of files submitted on VirusTotal in the past 7 days

2 CHAPTER 1. INTRODUCTION

As we see in Figure 1.2, the number of new malware and total malware submitted on VirusTotal is very high. To manually analyze this amount of data by manual reverse engineering and unpacking is a hectic task. This leads to the need for an efficient automatic detection system to detect malware. Currently, malware analyst are using two approaches for analysis namely, Static and Dynamic analysis. In static analysis, a file is analyzed just by looking at its structure and data present in it. As it is not executed, it is fast and easy to deploy, but there are some limitations for example, polymorphism/metamorphism and packers make the file data encrypted or packed so that no more static analysis can be done on it. This motivates for another approach, dynamic analysis. In this analysis, files are executed in a sandbox environment, and its behavior is analyzed. Since this analysis does not get affected by obfuscation and encryption of data, it has its limitation too. Some malware hide their true behavior when they find out that they are running in a controlled environment. Another limitation is that, since we usually monitor one of the paths of the process, hence rest of the code remains unexplored. So there is a need for a new detection mechanism which can overcome the limitations of these approaches. In the following chapters, we are going to discuss about Linux binary (ELF), types of malware and our methodology to detect Linux malware efficiently.

3 taigueu aa pigo itm optr t.Mlaecnb aeoie ytheir by categorized be can below: listed Malware are etc. them computer, of victims Some a actions. infrastructure, on disrupting spying of data, intention useful some has stealing which software malicious for stands types Malware and Malware Linux 2.1 our using problems those overcome to present plan are we problems how what approach. and discuss methodology and detection malware existing Linux the the of in some describe we section, this In • • • • Exploit:- enqiefeunl nLnxsres hna taki ucsfl h evrbecomes server the successful, is attack an When unresponsive. servers. Linux on server. frequently large quite a seen infect can time short very a in and fast very spread They them. infecting and system. the to access get , to some user get registered to a user as registered themselves the register all to of try list they the sometimes get or to try they Linux, In information. accordingly. attack then and and system system, the a in in vulnerability found the vulnerabilities find known them well of for some available are exploits the of Some DSdoxing:- DDoS Virus:- Backdoor:- iu samlaewihwe xctd fetohrfie yisrigiscode its inserting by files other affect executed, when which malware a is virus A xlisaemlaewihuesm ftesse unrblte oattack. to vulnerabilities system the of some use which malware are Exploits hr r oemlaewihtyt n oebcdost ta user steal to backdoors some find to try which malware some are There DSsad o itiue eilo evc.Ti yeo takis attack of type This Service. of Denial Distributed for stands DDoS 4 P ROBLEM B ACKGROUND

C HAPTER 2 CHAPTER 2. PROBLEM BACKGROUND

• keylogger:- Keylogger is a type of malware which tracks all the keys pressed by the victim and sends them to its command and control server.

• A digital currency mining malware:- This type of malware tries to gain access to a system and then uses its resources for mining digital currency. This malware uses some machine learning methods to analyze victims usage, and on the basis of that, it uses the resources so that it does not get exposed.

• Dropper:- This type of executable malware contains another executable. When executed, it installs the other executable and runs it parallelly so that if it gets detected, actual malware remains in the system.

There are some malware which combines the above-listed types to perform thier mallicious activity. So, distinguishing between a malicious file to a non-malicious one becomes a major task once a file enters in the system. The work in this thesis aims to detect malware by performing analysis on a large corpus of data to make our model robust to Zero-day malware. A zero-day malware uses a security vulnerability the same day that vulnerability becomes known to the public or to the vendor who created the software.. Our model outperforms many antiviruses as it does not depend on signatures of malware. It uses static as well as dynamic features of a executable for their detection.

2.2 Existing Malware detection strategy

Malware analysis is the method of dissecting a binary file to understand how it works and then devising methods to identify it and other similar files. It aims to gain information about the actions performed by the malware and then developing a method to neutralize its effect and to protect our system from further infection. Malware analysis can be used both for detecting malware and classification of malware. Malware detection means labeling an executable as benign or malware. Therefore malware detection becomes the first stage of malware analysis. Once an executable is detected as malware, further classification based on malware types and family can be performed on it. Malware analysis can be done in two basic ways static and dynamic analysis. The Static analysis aims to analyze a binary without executing it whereas in Dynamic analysis a binary is executed inside a sandbox and analysis is performed based on its behavior.

2.2.1 Static Analysis

Static analysis is used to perform analysis of static properties of a binary. Without executing the binary, an examination can be performed on it’s ELF header, embedded strings, metadata, disassembly, etc. This analysis is fast as we do not have to execute the binary. But along with it’s

5 CHAPTER 2. PROBLEM BACKGROUND

pros, it has some limitations too. There are many techniques that malware authors nowadays uses to thwart static analysis. Some of the techniques are described below:

• packing:-In this technique malware authors use some encryption or compression algorithm on the original executable to create a packed executable which contains an unpacked stub. As an executable is executed, the first thing that loads in memory is the unpacked stub. This stub then unpacks the packed executable, and transfer the control to the actual entry point of the executable.

• Metamorphism:- A metamorphic malware is a malware which changes its code each time its get executed, without actually changing the actual functionality of the malware. This functionality can be achieved by replacing an instruction by another similar instruction with different opcode, including garbage code, changing the order of subroutines, etc.

• Polymorphism:- This type of malware also changes its shape as well as signature. It has two part malware decryption part and encrypted malware body. Malware authors use a randomly generated key to encrypt malware body. Once this malware is loaded into the memory, the decryption part decrypts the encrypted part to perform malicious activity. After execution, it again encrypts itself so that it doesnot get discovered.

These types of techniques are used by malware authors to create malware, so when the hash of the malware is taken, each time it will give a different hash value which will make them bypass signature-based detection system. Neither static code analysis can be performed on this type of malware as the code is encrypted. These limitations of static analysis motivates the need for another type of analysis which can overcome these limitations.

2.2.2 Dynamic Analysis

In dynamic analysis, an executable is run under a controlled environment, or Virtual Machine and the behavior of the executable is observed to deduce whether it is a malware or not. When an executable is run, we can track what files are being accessed by it, what are the ip’s it’s trying to connect, what new files are created, etc. The main advantage of Dynamic analysis is it remains unaffected from Polymorphic or Metamorphic malware as it has nothing to do with the static code of the malware. With such advantage, it has some limitations too. Some of them are listed below:-

• incomplete code coverage:- During Dynamic analysis we are able to monitor single execution path which lead to incomplete code coverage.

• Detection of Sandbox Environment:- Some malware can detect whether it is running in a controlled environment. When this happens, the malware does not show its true behavior.

6 CHAPTER 2. PROBLEM BACKGROUND

• risk to the host machine:-- If there is some bug in the sandbox environment, malware may escape the isolaton and the host machine or other computers in the network may get damaged or infected.

2.3 Motivation For a new Approach

As we have seen in the earlier sections that both static and dynamic analysis have some limi- tations. Static analysis can be thwarted once some encryption algorithm is used while dynamic analysis suffers from low code coverage problem. What if we could combine the feature set of both of thess approaches. The dynamic features may be handy to get full insights when static analysis gets thwarted by obfuscations. On the other hand, static analysis can provide the full overview of the executable when dynamic analysis suffers from the code coverage issue. This shows that both may act as complementary to each other. Malware authors can use packers, obfuscation techniques, polymorphism/metamorphism techniques to bypass file format based analysis or signature-based analysis. They can make the malware to do some additional actions like a randomly accessing a file, call a random system call, etc, to bypass dynamic analysis. But bypassing both the technique at once will be a tougher job for them. In this work we have used this hybrid approach and results are quite promising. In later chapters, we discuss our architecture and features of our model in furthur details.

2.4 Contribution

• To the best of our knowledge and our literature survey, this is the first time hybrid analysis is done on linux binaries using such a large dataset. Most of the previous work which we survey in the next chapter are interesting but are based on experiment on much smaller dataset.

• We have used our new hybrid approach to detect Zero-day malware, and the results are quite promising.

• We have created an automated system to deal with a large number of files which is difficult to be analyze manually.

7 nte ehdwihi o ann ouaiyi ersi ae.I hstcnqestatic, technique this In based. heuristic is popularity gaining now is which method Another private steal to infrastructure, critical attack to used being are malware of variety large A . yai nlssbsdapoce a endiscussed. section been and has based approaches analysis based static work analysis 3.1 the dynamic section of 3.2 In some techniques. survey heuristic we thess literature, used this are has In malware classifier. that and learning benign machine both a containing train dataset to a used of features based behavior rate. or false-alarm dynamic high has technique very This a malicious. has as it labeled but is malware, it new process, capture a to by capability broken the actions. are safe rules as the consider of they any which If actions detection about anomaly-based rules In form it. companies detect only antivirus then to technique, signature, able its will generate technique malware to detection that able signature-based Once are this fails. analysts technique and this systems system, numerous once the affected But deploy. into has to comes malware easy and new fast or efficient, unseen is an technique antivirus This by threat. developed capture are to signature These companies signature-based malware. known The malware. of signature specific the a uses identifies technique uniquely that A hash detection. a anomaly-based usually or is techniques signature detection detection signature-based malware on or based companies anti-virus are their these systems creating of or Most software systems. anti-virus detection Government on malware MNCs, money own big of them, amount from huge spared spending remain To are etc. organizations fraud, financial out carry to data, 8 P

S WORK AST C HAPTER 3 CHAPTER 3. PAST WORK

3.1 Static analysis approaches

In this section, we discuss some of the past work which have used the static analysis approach.

– In Shahzad, F. [15] the authors have used executable and linkable format (ELF) for analysis and extracted 383 features from ELF header. They have used information gain as feature selection algorithm. They used four famous supervised machine learning algorithms C4.5 Rules, PART, RIPPER and decision tree J48 for classification. Their dataset contained 709 benign executables scraped from the Linux platform, and 709 Malware executables downloaded from VX heavens [10] and offensive computing [5]. They registered nearly 99 % detection accuracy with less than 0.1 false alarm.

– Jinrong Bai et. al [12] proposed a new malware detection technique in which they extracted system calls from the symbol table of Linux executables. Out of many system calls, they selected 100 of them as features. There method obtained an accuracy of 98 % in malware detection. Their dataset contains 756 benign executables scraped from Linux system and 763 malware executables from VX heavens.

3.2 Dynamic analysis approaches

In this section, we discuss some of the past works which have used Dynamic analysis

– Ashmita, K. et al.: [13], proposed an approach based on system call features. They use ’strace’[8] to trace all the system calls of the executables running in a controlled environment. In this paper, authors have used two-step correlation-based feature reduction. They first calculated feature-class correlation by using information gain and entropy, to rank the features than in the next step they removed redundant feature by calculating Feature-Feature correlation. They used three supervised machine learning algorithm J48, Random Forest AdaBoost for classification and have feature vector with 27 feature length. The authors of this paper used 668 files in the dataset, out of which 442 are benign 226 were malware executable. From this approach, they proposed the accuracy of 99.40 %.

– Shahzad, F., Bhatti [14], [16] have proposed a concept of genetic footprints in which mined information of process control Block (PCB) of the kernel is used to detect the runtime behavior of a process. In this approach, authors have selected 16 out of available 118 parameters of the task_struct for each running process. To decide which parameters to select, authors have claimed to done forensics study on that. These authors believe that these parameters will define the semantic and the behavior of the executing process. These selected parameters are called genetics footprints of the

9 CHAPTER 3. PAST WORK

process. Authors have then generated system call dump of all these parameters for 15 sec with a resolution of 100 ms. All the instances of benign and malware process are classified using RBF-Network, SVM, J48 decision tree and a propositional rule learner (J-Rip) in weka environment. Authors have done an analysis of their result and shortlisted J-48 J-Rip classifiers having less class-noise as compared to others. In the end, authors have also listed the comparison with other system call based existing solutions. Authors have also discussed evasion in relation to the robustness of their approach to evasion and access to task struct for modification. They used a dataset of 105 benign and 114 malware process and proposed a result of 96 percent detection rate with 0 percent false alarm rate.

3.3 Drawback of past works:

– All of the past work which we seen in the literature survey used very small sample size and hence their validity of the accuracy or false positive rates may not be reflective of the real power of their malware classification methods.

– In the Dynamic cases , they have used very restricted number of features.

– They donot handle Zero-Day malware at all.

– They do not take advantage of using both static and dynamic featuires

10 well. . nlssinfrastructure Analysis 4.1 .. aaGeneration Data 4.1.1 xctbe o ute nlssi h usqetpae.o ttcrprs eue: from use reports we dynamic reports, and static .For static phase generated subsequent we the in how analysis see further to for going executables are we section, this In generation, Data phase: three has modelling Data It and Infrastructure. Extraction Analysis Feature our illustrate 4.1 Figure as version 64-bit the to extended be can but executables accuracy. ElF detection 32-bit good on various a based and achieve is system to work detection used Our our we of technique architecture modelling the and discuss engineering to feature going are we section, this In GUstrings:- GNU – oncin ecnseI drs meddi h xctbei ASCII. TCP in a executable make the to in tries embedded executable address an IP if see or is can have output, we can file string connection, we the a /etc/passwd, as what access printed to of is tries overview directory malware an this a if gives in example, sometimes for Strings it perform, character. to as unprintable going useful an be by can followed executable character an ASCII for searches it A NALYSIS nBntl,srnsi n fteuiiis hnue nayfile, any on used When utilities. the of one is strings Binutils, In I FATUTR AND NFRASTRUCTURE 11 F AUEEXTRACTION EATURE

C HAPTER 4 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.1. Architecture of our hybrid malware detection system

12 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

– readelf:- readelf [6] is an binary utility that displays structural information of one or more ELF files. ELF file format contains lots of information that can be used in the detection of malware. It contains ELF header followed by file data .More about ElF format will be discussed in further sections.

FIGURE 4.2. ELF file format

– limon sandbox:- limon [4] is a tool which allows us to run an executable in a sand- boxed/controlled environment and give us a report about what it did during its runtime. The main component of limon sandbox includes a host machine which manages the guest machine. We have used 16.04 as our host machine and Ubuntu 14.02 a 32-bit machine as our guest machine. To get a full picture of a file, it is executed in a full privileged mode o the guest machine. In order to run an executable in limon sandbox, its path is given using a command line. Each analysis is performed in a fresh Virtual machine. While setting up a virtual machine, a current snapshot is taken so that after execution of the file, limon can revert back. At the end of the execution, the sandbox returns a text file containing the full trace of system calls and userspace functions. We have used the default setting of 60 sec for which a file will be monitored. The architecture of limon sandbox is shown in Figure 4.3

13 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.3. Limon sandbox architecture

4.1.2 Feature Extraction

As we use the hybrid approach, we first extract static features and dynamic feature of executables separately and then integrat both of them to use in our model.

In the following section, we describe what are the various static and dynamic features we have use in our model.

4.1.2.1 Static feature vector

Static feature extracted from the ELF file format of both malware and benign. Before describing the features, we first briefly introduce the ELF format

14 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

– Executable and Linkable format ELF(Executable and Linkable format) is standard binary file format in Unix and Unix like system. The binaries in Linux are executable, shared libraries, object code and core dumps.ELF File format basically has three major part.

* ELF header * Segments * Sections

FIGURE 4.4. ELF Layout in disk

Each of these parts plays an important role in loading and linking process.Let’s now look at the role and structure of each component one by one:

15 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

* ELF header ELF header consists a data-structure which gives information about the organi- zation of a file. There are various fields in ELF Header and their definitions is given below table:

Table 4.1: Fields in ELF header This field contains various flags for identification of file. These flags e_ident helps in decoding and interpreting the file content. example: ELF magic, File class, File version’s etc This field signifies the type of binary e.g e_type executable, shared library etc This field gives information about the e_machine architecture oft the file eg x86, MIPS etc The version of an object file.Its set to 1 for e_version original version of ELF e_entry address of process entry point e_phoff This points to start of program header table e_shoff This points to start of section header table information about some processor e_flags specific fields e_ehsize size of the ELF header e_phentsize size of program header table entry e_phnum number of entry in program header table e_shentsize size of section header table entry e_shnum number of entry in section header table it contains the index of section table entry e_shstrndx which contains name of all sections

* Sections All the information required during the linking process to make a target object file an working executable resides in sections. Sections are actually needed during the time of linking, they are of no use at run time. In ELF header there is a section header table which contains the information about each section present in the file. This table contains a number of section headers which points to each section. There are various fields in each section header and their description is shown in the table 4.2:

16 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

Table 4.2: Fields in setion header It contains an offset to a string in.shstrtab section sh_name which is the name of the section

It signifies the type of section sh_type e.g Program data, Symbol table, String table etc sh_flags this signifies the attribute of a section For the sections which are loaded in memory, sh_addr it contains their virtual address sh_offset it contains offset of the section in process image sh_size Size of the section sh_link It contains section index of another associated section sh_Info additional information about section sh_addralign It signifies the required alignment for the section sh_entsize contain the size of each section header table entry

There are number of section each having different roles in linking process let’s look sum of them : · .text section: This contains user executable code. · .data section: This contains all the uninitialized data. · .rodata section:This contains read only initialized data · .bss section:This contains all the uninitialized data · .got section: For the dynamic binaries this section which stands for Global Offset Table contains the address of all the variable which are relocated upon loading. · .got.plt section:This are the GOT entries which is assigned to dynamic linking functions. · .dynamic section: For the dynamic binaries this section contains information about dynamic linking which is used by runtime linker. · .dynsym section:This is actually runtime symbol table. · .dynstr section:This contains null terminated strings which are the name of symbols. · .symtab section:This is compile time symbol table.dynsym is a subset of this section · .strtab section: contains the name of the symbols in symbol table · .shstrtab section:This contains name of the sections · .rela.dynsection:run time relocation table

17 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

* Segments Segments are basically known as program headers.It is just like disk view of ELF format is broken into suitable chunks known as segments which get loaded in memory.Like section header table there is also program header table.This program header is optional in linking view while section header table is optional in runtime view.In this header there are segment header which give information about various segments present in the file image.each section header have some fields related to the segments ,there defination is define in the below table:

Table 4.3: Fields in segment header Gives the information about type p_type of segment p_flags segment related flags are present here p_offset offset of the segment p_vaddr virtual addr of segment in the memory p_paddr phisical address of segment in the disk p_filesz segment size in disk p_memsz segment size in memory P_align segment required alignment

There are number of segment types,some of them are describe below: · NULL:This is basically unassigned segments. · LOAD:This is the segment which gets loaded into the memory,rest of the other segment mapped within the range of the memory of one of this segment. · INTERP: .interp section gets mapped to this segment · DYNAMIC: this is basically the .dynamic section in memory

we just looked the overview of ELF file format.Now lets go back to our static feature extraction which uses all the information described above.

– ELF structure feature set In this section, we are going to discuss what are the feature we have extracted from ELF structure.

* ELF Header This gives us information about the organisation of ELF File. Out of various fields present in ELF header, we have picked seven of the fields for our feature set. Statistical comparison of that seven feature between malware and benign is shown in the table 4.4 :

18 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

Table 4.4: Mean value comparision of different fields in ELF header Features Mean for benign Mean for malware

Number of section headers 29.647 97.396 Size of ELF header 63.466 52.202 Number of program headers 8.921 4.222 start of section header 170400.26 346456.475 start of programm header 63.512 52.189 size of program headers 55.025 32.404 section header string table index 27.729 94.977

* Section Header Table The basic structure of Section Header Table has been discussed in the previous section. From the section header, we have used section name and section type in our feature list. Frequency distribution of a various section in benign and malware is shown below:

FIGURE 4.5. Frequency distribution of various section

19 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.6. Frequency distribution of various section type

* Program Header Table Structure of Program Header Table is discussed in segments field in the previous section. From the Program Header Table, out the various fields, we have used Segment type in our feature list. Comparison of segments type for benign and malware is shown below:

FIGURE 4.7. Frequency distribution of various segment type

20 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

* Dynamic section The runtime linker uses this segment to find all the necessary information needed for dynamic linking and relocation. The Dynamic section contains two fields ’tags’ and ’values’.The number of entries in the dynamic section is not fixed. In our work, we have used the content of ’tags’ field in our feature list.

* Symbol table symbol table contains a large amount of data needed to link or debug files. Symbol table structure has five fields namely: name, value, info, size and section header index. For the symbols, we categorized them according to there ’info’ field and for the objects and functions, we categorized them according to their scope and info. As features, we have used 14 categories we make and count of dynamic symbol. Frequency distribution of these feature for benign and malware is shown below:

FIGURE 4.8. Frequency distribution of various symbol table features

21 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

– Strings based feature extraction Strings from a file can be informative, but most of the strings which we get from a file are from file structure like the name of objects, a function from symbol tables or argument of functions or some garbage values.

FIGURE 4.9. Example of GNU string output

This all things, we already using our features in one way or another. So reusing all of this will cause redundancy in our feature set. However, we are using frequency bins related to the length of strings in our feature set.

22 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

4.1.2.2 Dynamic Feature Extraction

The runtime behaviour based feature is extracted from the reports of the files that are being generated by limon sandbox. Limon sandbox full report comes in a text file which contains static analysis(structure information),ssdeep [7](fuzzy hashing comparison with other other reports), Dynamic Analysis( system call trace ) and Network Analysis (if a file is engaged in some network activity). As most of the malware authors are now using Polymorphism and metamorphism technique, this results to have multiple signatures of a single file. But when the programme is loaded in memory, to perform its action its is decrypted to its original form. There comes the role of ssdeep, what it did it take hashes of the binary that are loaded and then it compares with other files. This helps us to remove multiple files having different signatures but have the same fuzzy hash to reduce redundancy in our dataset. For this work, we have use system call trace and arguments of system calls in our feature list. Let’s see all of them one by one.

– System Calls: System calls can give ous the information about what a process want to perform or what we say the behaviour of a process. Limon sandbox uses ’strace’ to get a full system call trace of a process and process related to that process. An example of an output of strace is shown below in image:

FIGURE 4.10. Example of strace output

The output of ’strace’ as we can see in the above picture contains system calls, its arguments and return values. We have created an architecture through which we can get only the names of the system calls from the strace report. Linux system uses a fixed set of system calls. We have used all the system calls in our feature list and check for each file, what are the system calls they are using. A statistical comparison of system calls for benign and malware executables as observed in our dataset is shown in fig 4.17 and fig 4.18.

23 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.11. 20 most frequent system call of benign

FIGURE 4.12. 20 most frequent system call of malware

24 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

– File system based feature:-

* proc and sysfs filesystem Proc and sysfs filesystem are a virtual filesystem which contains information about runtime system information on processes, system and hardware configura- tions, and information on the kernel subsystems and kernel drivers. Comparision of some of the proc and sys file accessed by malware and benign is shown below.

FIGURE 4.13. 7 most frequent proc files access by benign

FIGURE 4.14. 7 most frequent proc files access by malware

25 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.15. 7 most frequent sys files access by benign

FIGURE 4.16. 7 most frequent sys files access by malware

26 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

From our dataset we observed a large portion of malware is accessing ’/proc/net/route’ i.e system routing table, to get the list of all active network interfaces. We also find that they are accessing ’/proc/net/tcp’ and ’/proc/net/dev’ to get information about active tcp socket and sent and received packet respectively. On the counter- part, in sysfs, we saw the malware are accessing ’/sys/class/net/’ to get the length information of transmission queue.This information is very much important for performing a DDoS attack. Some of the sys and proc file are used by malware authors in VM detection like ’/proc/cpuinfo’, ’/proc/sysinfo’, ’/sys/class/dmi/id/product name’ etc.From our data set, we have observed these files are being accessed by malware more frequently.

* etc file system etc folder contains all system configuration files.It contains system configurtion tables,configuration files which can force a service to start or stop,configuration file of installed programs,configuration files which gives information about allowed or restricted usr ,permitted ip’s etc.

FIGURE 4.17. 7 most frequent etc files access by benign

27 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.18. 7 most frequent etc files access by malware

From our malware dataset we observed that network configuration file are ac- cessed like ’/etc/resolv.conf’,’/etc/hosts’ etc ,more frequently.We observed that a chunk of malware is accessing ’/etc/passwd’ file which gives information about each user registered. Flooders (a malware type) uses this information to find a backdoor account. They also try to edit ’/etc/passwd’ and ’/etc/shadow’ to add a new usr.

* shell commands Shell acts as interface between user and Operating system.Using commands we can use the services of OS.In our Dataset 16 percent of malware were executing atleast one external command while in the Benign the percentage is quite low ,nearly 3-4 percent.In total we get 131 unique commands from our dataset.In our dataset, commands like cp ,netstat,iptables,touch ,file etc are most frequently seen being executed in malware.Some of the malware try to execute system ’reboot’command and some of them were executing ’ufw’ command which actually can be used to alter the firewall of the network.In benign not much of commands were found to be executed.Commands like file,grep,basename etc are mostly seen in them.

28 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

4.1.3 Machine learning classifier

We have used python based machine learning library sckit-learn to evaluate the efficiency of the hybrid approach we used. We have used various classification algorithms which are described below:

– KNN KNN stands for K-Nearest Neighbour.In this algorithm, there a large number of labelled data present in the feature space and if an unlabeled data point comes in that feature space then labelling of it can be done by looking it’s K nearest neighbour and giving the label which is in the majority. The distance which is generally used here is Euclidean distance i.e square root of the sum of square of difference of each feature of two data points.

– Decision Tree One of the most intuitive and popular methods of data mining that provides explicit rules for classification and copes well with heterogeneous data, missing data, and nonlinear effects is decision tree.It uses information gain to select the root node which gives the highest information gain value and similarly it goes to the leaf node where decision is made.

– Random Forest Random Forest is a supervised learning algorithm. Like you can already see from it’s name, it creates a forest and makes it somehow random. The „forest“ it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems.The random-forest algorithm brings extra randomness into the model, when it is growing the trees. Instead of searching for the best feature while splitting a node, it searches for the best feature among a random subset of features. This process creates a wide diversity, which generally results in a better model. Therefore when you are growing a tree in random forest, only a random subset of the features is considered for splitting a node.

29 CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

4.2 Summary

In this section we see the architecture of our model,Types of features we used ,how they are extracted and what are the different machine learning classifier we have used in our model for the detection of malware.In next Chapter we are going to discuss about the datset we used and see the result of our model on that datset.

30 ehv sdsvrlmtist vlaetepromneo lsicto oe.Blwis Below model. classification of performance the evaluate to metrics several used have We ofso arx(oaiadPoot 98 otisifrainaotata and actual about information contains 1998) Provost, and (Kohavi matrix confusion A omk u oe out h rttigw eddwstelrecru fadataset.Most a of corpus large the was needed we thing first the robust, model our make To h nre ntecnuinmti aetefloigmaigi h otx forstudy: our of context the in meaning following the have matrix confusion the in entries The . Dataset 5.1 omnyeautduigtedt ntemti.Tefloigtbesosteconfusion the shows classifier. table class following two The a matrix. for the matrix is in systems data such the of using Performance evaluated system. commonly classification a by done classifications predicted Matrix Confusion 5.2.1 them. of description brief a Metric Evaluation 5.2 get benign to 2265 and system malware our 7717 into used them have executable. compiled we analysis, and final software the c++ executable.For benign c more open some project, downloaded Intel we /usr/bin, open and the /bin,/sbin of directories system operating our Linux major collected fresh the we [ from malware, Detux.org of dataset.For and one of virustotal amount so VXheavens, large from data, a data amount collect fewer to was very faced used we have challenges authors work’s previous the of 31 2 .FrBng,w ca h executables the scrap we Benign, For ]. R ESULT A ND D

ISCUSSION C HAPTER 5 CHAPTER 5. RESULT AND DISCUSSION

– True positive: These are the samples which are correctly predicted as benign

– True Negative: These are the samples which are correctly predicted as malware

– False positive: These are the samples which are malware but predicted as benign

– False positive: These are the samples which are benign but predicted as malware

Table 5.1: Confusion matrix for a two class classifier Predicted Positive Negative Actual Positive #TP #FN Negative #FP #TN

5.2.2 Additional metrics

– TPR: TPR stands for True Positive Rate also known as recall,this is define as ratio of number of true positive to total number positive sample:

TP TPR = TP FN +

– FPR: FPR stands for False Positive Rate,this is define as ratio of number of False positive to total number negative sample:

FP FPR = FP TN +

– Precision: precision (PV) is the proportion of the predicted positive cases that were correct, as calculated using the equation:

TP PV = TP FP + – F score: when the dataset is imbalance ,its is used to measure how much your model is correct.Its calculated as weighted harmonic mean of precision and recall:

2 TPR PV Fscore ∗ ∗ = TPR PV +

32 CHAPTER 5. RESULT AND DISCUSSION

5.3 Training and Testing

For our experiments we used ubuntu 16.04 with 32 Gb RAM and INTEL I7 octa core processor . We used 70% data for training purpose and 30% data for testing purpose. To minimize the risk of overfitting and to get a generalize result we used 10 fold cross validation.

5.4 Result

As we have already seen that we have used three machine learning classifier and we check our model efficiency on all of them. Results we received are highly promis- ing.Table 5.2 and 5.3 show the results which we achieve on our dataset for static only features and for hybrid approach respectively.

Table 5.2: Test result on static feature set Class KNN Decision Tree Random Forest TPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM Benign 0.888 0.067 0.757 0.817 0.978 0.017 0.937 0.957 0.982 0.014 0.950 0.966 Malware 0.932 0.111 0.972 0.951 0.982 0.021 0.994 0.988 0.986 0.017 0.995 0.990 Average 0.922 0.101 0.923 0.921 .981 0.020 .981 .981 .985 .016 .985 .985

Table 5.3: Test result on hybrid features Class KNN Decision Tree Random Forest TPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM Benign 0.914 0.062 0.777 0.840 0.983 0.010 0.963 0.973 0.989 0.006 0.976 0.982 Malware 0.938 0.085 0.978 0.958 0.989 0.016 0.995 0.992 0.993 0.010 0.997 0.995 Average 0.932 .079 0.932 0.9314 0.987 0.014 0.987 0.987 0.992 .009 0.992 0.992

As we can see from the table 5.2 and 5.3 that using Random Forest we get the best detection accuracy.However we can see that detection efficiency increases in all of the three model (KNN,Decision Tree,Random Forest) as we shifted our feature set from static to hybrid features set.The best weighted average F Measure score i.e 0.992, we get in Random Forest which is pretty good considering the fact that the TPR of Benign is low in comparison with the malware .The main goal of our work was to have high TPR for malware while not predicting too much benign in to malware.Figure 5.1 is depicting the confusion matrix for the best detection accuracy which we got using Random Forest.

33 CHAPTER 5. RESULT AND DISCUSSION

Confusion matrix statistics:

* malware/malware = 99.69% ( 2313/2320 )

* malware/benign = 0.31% ( 7/2320 )

* benign/benign = 97.62% ( 659/675 )

* benign/malware = 2.37% ( 16/675 )

FIGURE 5.1. confusion matrix result

It can observed from the confusion matrix that false negative for malware is quite low which also explains high precision value for malware,whereas false negative value for benign is quite high compared to malware which also explains low precision value for benign compared to malware.

5.5 Comparison To Existing Approaches

In this section we are going to compare our work with the other author’s work who have done some work on Linux malware analysis.To the best of our knowledge, most of the work that other authors have used are either static or Dynamic approach. Our work is the first that has used hybrid approach i.e by integrating static and dynamic features.

34 CHAPTER 5. RESULT AND DISCUSSION

Table 5.4 shown the comparison result.Most of the work as we can see in table 5.4 has performed analysis on a very small number of a dataset. In our work, we have used a large corpus of both malware and benign files to make our model robust. Shahzad, F. has performed analysis using fields of ELF static structure with a 99% detection accuracy but since this approach is static based they have rejected some of the samples which have forged headers. Ashmita, K. et al. (2014) has used the dynamic approach in which they have analyzed system calls. They got a great detection accuracy of 99.40 %, but the dataset they used had only 226 malware, and the number of features was also very less. Our model has got a comparable average detection accuracy of 99.14%, and the strength of our dataset is also pretty good compared to them which makes our model robust.

Table 5.4: Previous works on Linux malware analysis Authors Features Accuracy Dataset Type of feature 709 Benign Shahzad, F.(2011) 383 99% Static :ELF structure 709 Malware 756 Benign Jinrong Bai et al(2012) 100 98% Static :Symbol Table 763 Malware 442 Benign Ashmita, K. et al(2014) 27 99.40% Dynamic:System calls 226 Malware 105 Benign Shahzad, F., Bhatti(2013) 16 96% Dynamic :Process control block 114 Malware Static : ELF Header +Strings 115 2265 Benign Ours 99.14% Dynamic: System calls + File Systems 260 7717 Malware + Shell Command

conclusion :We are using a new approach in performing Linux malware ananlysis by the help traditional static and dynamic ones.Our model has shown some great result and we have used a large dataset to prove robustness of our model.

35 . uprigMlil architecture Multiple Supporting 6.1 eotfrtee lsas.nteftr ecnadamdl opromaayi for analysis perform to module a add also. runtime can files the we new us future these give the to also.In capabilities files theses the for has report in analysis used Dynamic we perform which uses to script sandbox activity.Limon work malware ,Bash malicious this script the there ,Shell some perform script to format,but script,Python etc file Perl script ,PHP like ELF format on file was of focus type different main our work this format In file different on Analysis 6.2 to architecture. extended different can from our are future which the many files limitation.In those are this architecture.There due Intel un-analyzed on remain based the which were done files which have files ,we those work which only our files of of scope analysis ELF small various the across to came architecture.Due ,we different supports sample Malware of collection the During 36 S OEAND COPE F UTURE

C HAPTER 6 W ORK CHAPTER 6. SCOPE AND FUTURE WORK

6.3 Multi-path execution of files

Currently Limon sandbox gives us the report of single execution path of an executable file.This is the limitation of Dynamic based approach as we are unable to get all the possible execution path of the malware to cover its complete run time behaviour.In future we can add different modules to our model such that it can generate more comprehensive reports.

37 https://github.com/Anmol33/M.Tech_thesis.git Detcetion: Malware Linux for base Code 38 A

PPENDIX A PPENDIX A A BIBLIOGRAPHY

[1] Av-test security report. https://www.av-test.org/fileadmin/pdf/security_{report}/AV-TEST_ {Security}_{Report}_2016-2017.pdf.

[2] Detux.org:. https://detux.org/.

[3] Elf format:. http://www.skyfree.org/linux/references/ELF_Format.pdf.

[4] Limon sandbox:. https://github.com/monnappa22/Limon.

[5] Offensive computing. http://www.offensivecomputing.net/.

[6] readelf tool:. https://sourceware.org/binutils/docs/binutils/readelf.html.

[7] ssdeep (fuzzy hash):. https://ssdeep-project.github.io/ssdeep/index.html.

[8] Strace tool:. https://strace.io/.

[9] Virustotal statistics. https://www.virustotal.com/en/statistics/.

[10] vx heaven. http://vx.netlux.org/.

[11] Watchgaurd report. https://media.scmagazine.com/documents/306/ wg-threat-reportq1-2017_76417.pdf.

39 BIBLIOGRAPHY

[12] S.M.JINRONG BAI,YANRONG YANGAND Y. MA, Malware detection through mining symbol table of linux executables., Information Technology Journal, (2012).

[13] A.K.A AND V. P, Linux malware detection using non-parametric statistical methods, Chakraborty R.S., Matyas V., Schaumont P. (eds) Security, Privacy, and Applied Cryptography Engineering. SPACE, (2014).

[14] B.S.S.M.SHAHZAD, F. AND M.FAROOQ, In-execution malware detection using task structures of linux process, IEEE International Conference on Communication, pp. 1–6, (2011).

[15] F. SHAHZADAND M.FAROOQ, Elf-miner: using structural knowledge and data mining methods to detect new (linux) malicious executables, Knowledge and Information Systems, 30 (2012), pp. 589–612.

[16] S.M.SHAHZAD, F. AND M.FAROOQ, In-execution dynamic malware analysis and detection by mining information in process control blocks of linux os, Inf. Sci. 231, 45–63, (2013).

40