CLASSIFICATION OF USING REVERSE ENGINEERING AND DATA

MINING TECHNIQUES

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Ravindar Reddy Ravula

August, 2011

CLASSIFICATION OF MALWARE USING REVERSE ENGINEERING AND DATA

MINING TECHNIQUES

Ravindar Reddy Ravula

Thesis

Approved: Accepted:

______Advisor Department Chair Dr. Kathy J. Liszka Dr. Chien-Chung Chan

______Committee Member Dean of the College Dr. Chien-Chung Chan Dr. Chand K. Midha

______Committee Member Dean of the Graduate School Dr. Zhong-Hui Duan Dr. George R. Newkome

______Date

ii

ABSTRACT

Detecting new and unknown malware is a major challenge in today’s software security profession. A lot of approaches for the detection of malware using data mining techniques have already been proposed. Majority of the works used static features of malware. However, static detection methods fall short of detecting present day complex malware. Although some researchers proposed dynamic detection methods, the methods did not use all the malware features.

In this work, an approach for the detection of new and unknown malware was proposed and implemented. 582 malware and 521 benign software samples were collected from the Internet. Each sample was reverse engineered for analyzing its effect on the operating environment and to extract the static and behavioral features. The raw data extracted from the reverse engineering was preprocessed and two datasets are obtained: dataset with reversed features and dataset with API Call features. Feature reduction was performed manually on the dataset with reversed features and the features that do not contribute to the classification were removed.

Machine learning classification algorithm, J48 was applied to dataset with reversed features to obtain classification rules and a decision tree with the rules was obtained. To reduce the tree size and to obtain optimum number of decision rules, attribute values in the dataset with reversed features were discretized and

iii another dataset was prepared with discretized attribute values. The new dataset was applied to J48 algorithm and a decision tree was generated with another set of classification rules. To further reduce the tree and number of decision rules, the dataset with discretized features was subjected to a machine learning tool, BLEM2 which is based on the rough sets and produces decision rules. To test the accuracy of the rules, the dataset with decision rules from BLEM2 was given as input to J48 algorithm. The same procedure was followed for the dataset with API Call features. Another set of experiments was conducted on the three datasets using Naïve Bayes classifier to generate training model for classification. All the training models were tested with an independent training set. J48 decision tree algorithm produced better results with DDF and DAF datasets with accuracies of 81.448% and 89.140% respectively. Naïve Bayes classifier produced better results with DDF dataset with an accuracy of 85.067%.

iv

AKNOWLEDGMENTS

I would like to express my sincere gratitude to people who are the reason for making this research possible. I want to express my heartiest thanks to Dr. Kathy J. Liszka for giving me the opportunity to work on the thesis. Her invaluable guidance and support at every stage has led to successful conclusion of the study.

I would like to thank Dr. Chien Chung Chan for his expert advice in data mining and the insightful suggestions that have been very helpful in the study. In addition, I want to thank Dr. Zhong Hui Duan for taking time and willing to be on the thesis committee.

I want to convey my special thanks to my parents, sister, brother, brother-in-law, cousin and friends for their love and continuous encouragement. Their blessings and moral support have been invaluable at every stage of my life. Thank you all for standing by me at all times.

v

TABLE OF CONTENTS

Page

LIST OF TABLES…..……………………………………………………………….....viii

LIST OF FIGURES...………………………………………………………………….....ix

CHAPTER

I. INTRODUCTION………………………………………………………………...1

II. LITERATURE REVIEW…………………………………...... 4

III. TYPES OF MALWARE AND ANTI-MALWARE DEFENSE

TECHNIQUES…………………………………………………………………..15

3.1 Malware Types……………………………………………………………….15

3.1.1 Virus………………………………………………………………….15

3.1.2 Worm………………………………………………………………...18

3.1.3 ……………………………………...... 19

3.1.4 ………………………………………………………....19

3.1.5 ………………………………………...... 20

3.1.6 Spyware………………………………………………………………21

3.1.7 Adware…………………………………………………………….....21

3.2 Antivirus Detection Techniques………..……………………………...…….21

3.2.1 Signature Based Detection…………...……………………………....22

vi

3.2.2 Heuristic Approach…………………………...... 23

3.2.3 Sandbox Approach…………………………………………………...23

3.2.4 Integrity Checking…………………………………………………...24

IV. REVERSE ENGINEERING……………………………………………………..25

4.1 Controlled Environment……………………………………………….……..25

4.2 Experimental Setup…………………………………………………………..27

4.3 Static Analysis……………………………………………………………….28

4.3.1 Cryptographic Hash Function………………………...……………...28

4.3.2 Packer Detection……………………………………………………..29

4.3.3 Code Analysis………………………………………..………………31

4.4 Dynamic Analysis……………………………………………………………33

4.4.1 File System Monitor…………………………………………………33

4.4.2 Registry Monitor……………………………………………………..34

4.4.3 API Call Tracer………………………………………………………36

V. DATA MINING……………………………………………………………….…38

5.1 System Design……………………………………………………………….38

5.2 KDD Process…………………………………………………………………40

5.2.1 Target Data………………………………………………………...…44

5.2.2 Preprocessing………………………………………………………...45

5.2.3 Transformation……………………………………………………….45

5.2.4 Data Mining………………………………………………………….47

5.2.5 Interpretation/Evaluation…………………………………………….51

VI. RESULTS AND DISCUSSIONS………………………………………………..52

vii

6.1 Experiment 1: Classification of DRF…………………………………..…….52

6.2 Experiment 2: Classification of DDF…………………………………….….56

6.3 Experiment 3: Classification of DDF using BLEM2………………………...59

6.4 Experiment 4: Classification of DDF from BLEM2 using J48..…………….62

6.5 Experiment 5: Classification of DAF………………………………………..64

6.6 Experiment 6: Classification of DAF using BLEM2……..……………….…68

6.7 Experiment 7: Classification of DAF from BLEM2 using J48……………...68

6.8 Accuracies……………………………………………………………………72

6.9 Pattern in API Call Frequencies……………………………………………...75

VII. CONCLUSIONS AND FUTURE WORK………………………………………77

7.1 Conclusions…………………………………………………………………..77

7.2 Future Work……………………………………………………………….…78

REFERENCES…………………………………………………………………………..79

APPENDICES…………………………………………………………………………...83

APPENDIX A. DATASETS…………………………………………………….83

viii

LIST OF TABLES

Table Page

5.1 Attributes in DRF…………………………………………………………………….44

5.2 Attributes in DRF after Transformation…………………………………………..…46

5.3 Discretized Values………………………………………………………………...…47

6.1 Decision rules for DRF for the decision label “YES”………….……………………54

6.2 Decision rules for DRF for the decision label “NO”……………….………………..55

6.3 Decision rules for DDF for the decision label “YES”…………….…………………57

6.4 Decision rules for DDF for the decision label “NO”……………….………………..58

6.5 BLEM2 rules for DDF for the decision label “YES”………………………………..59

6.6 BLEM2 rules for DDF for the decision label “NO”………………………………....61

6.7 Decision rules for DDF from BLEM2 for the decision label “YES”………………..63

6.8 Decision rules for DDF from BLEM2 for the decision label “NO”………………....63

6.9 Decision rules for DAF for the decision label “YES”……………….………………65

6.10 Decision rules for DAF for the decision label “NO”…..………….………………..66

6.11 Decision rules for DAF from BLEM2 for the decision label “YES”………………69

6.12 Decision rules for DAF from BLEM2 for the decision label “NO”………………..70

6.13 Testing set Results against Training Models from Experiments 1, 2 and 3………..73

6.14 Testing set Results against Training Models from Experiments 4 and 5…………..73

6.15 Experimental Results from Naïve Bayes Classifier………………………………...73

ix

A1: An Instance for Attributes File Name, File Size and MD5 Hash in DRF…………..84

A2: An Instance for Attributes Packer, File Access, Directory Access and Internet Access in DRF……………………………………………………………………………………84

A3: API Calls Accessed By the Trojan…………………………………………………..85

A4: DLLs Accessed By the Trojan………………………………………………………85

A5: Registry Keys Added By the Trojan………………………………………………...86

A6: Registry Keys Modified By the Trojan……………………………………………...86

A7: Registry Keys Deleted By the Trojan……………………………………………….88

A8: URL References Made By the Trojan………………………………………………88

A9: Programming Language used, Strings and Decision label of the Trojan…………...89

A10: An Instance of DRF Dataset after Preprocessing………………………………….89

A11: An Instance of DDF Dataset……………………………………………………….90

A12: An Instance of DAF Dataset……………………………………………………….90

x

LIST OF FIGURES

Figure Page

3.1 Typical Malware Signature…………………………………………………………..22

4.1 Snapshot Manager……………………………………………………………………27

4.2 Normal PE File……………………………………………………………………....30

4.3 Packed PE File……………………………………………………………………….30

4.4 PEiD………………………………………………………………………………….31

4.5 IDA Pro Disassembler……………………………………………………………….32

4.6 File Monitor………………………………………………………………………….34

4.7 Registry Monitor……………………………………………………………………..35

4.8 Registry Key Changes Made by a PE………………………………………………..36

4.9 Maltrap……………………………………………………………………………….37

5.1 KDD Process…………………………………………………………………………39

5.2 File System Activity Log…………………………………………………………….40

5.3 DLLs…………………………………………………………………………………41

5.4 Registry Keys Added………………………………………………………………...41

5.5 Registry Keys Deleted……………………………………………………………….42

5.6 Modified Registry Keys……………………………………………………………...42

5.7 API Call Sequence Log………………………………………………………………43

5.8 Attributes in DAF……………………………………………………………………45

xi

5.9 WEKA Explorer……………………………………………………………………...48

5.10 Web Based BLEM2 GUI…………………………………………………………...51

6.1 J48 Decision Tree for DRF…………………………………………………………..53

6.2 J48 Decision Tree for DDF…………………………………………………………..57

6.3 J48 Decision Tree for DRF from BLEM2…………………………………………...62

6.4 J48 Decision Tree for DAF…………………………………………………………..64

6.5 J48 Decision Tree for DAF from BLEM2…………………………………………...69

6.6 API Call Graph for Malware Vs Software…………………………………………...75

6.7 API Call Graph for Software Vs Malware…………………………………………...76

xii

CHAPTER I

INTRODUCTION

Malware, short for Malicious Software, is a sequence of instructions that perform malicious activity on a computer. The history of malicious programs started with

”, a term first introduced by Cohen in1983 [2]. It is a piece of code that replicates by attaching itself to the other executables in the system. Today, malware includes viruses, worms, Trojans, root kits, backdoors, bots, spyware, adware, Scareware and any other program that exhibits malicious behavior.

Malware is a fast growing threat to the modern computing world. The production of malware has become a multi-billion dollar industry. The growth of Internet, the advent of social networks and rapid multiplication of botnets has caused exponential increase in the amount of malware. In 2010, there was a large increase in the amount of malware spread through spam emails sent from machines that were part of botnets [3]. McAfee

Labs have reported that, there were 6 million new botnet infections in each month of

2010. They have also reported an average detection of 60,000 new pieces of malware per day in 2010 [4]. They also reported that there were on an average of 60,000 new pieces of malware detections on each day of 2010 [4]. Symantec message labs reported that, on each day of January 2011 an average of 2,751 websites hosted [5]. Currently,

1 the primary and the most important defense against malware are Antivirus programs, such as Norton, McAfee, Sophos, Kaspersky and Clam Antivirus. The vendors of the

Antivirus programs apply new technologies to their products frequently to fight with malwares. They use a signature database as a primary tool for detecting malware.

Although signature based detection is very effective against the previously discovered malware, it proves to be ineffective against new and previously unknown malware. The techniques like obfuscation, code displacement, compression and encryption to evade signature based detection has made it easier for malware writers to bypass signatures. The antivirus companies are trying hard to detect the variants of known malware, new and unknown malwares to develop robust antivirus. Some of the techniques include

Heuristics, Integrity verification and sandboxing. However, they are not really effective when it comes to the detection of new malware. We are virtually unprotected until the signature is extracted and deployed.

Most of the antivirus companies use manual methods for the detection of malware. However with the amount of new malware generated each day, it would be difficult to use manual methods and automatic analysis will be a must in the near future.

Hence, we cannot depend solely on antivirus programs to combat malware. We need an alternative mechanism to detect new and unknown malwares.

In an effort to solve the problem of detecting new and unknown malware, we have proposed an approach in the present study. The proposed approach uses reverse engineering and data mining techniques to classify a new malware. We have collected

582 Malicious Software and 521 Benign Software and reverse engineered each executable using static and dynamic analysis techniques. By applying data mining

2 techniques to the data obtained from reverse engineering process, we have generated a classification model that would classify a new instance with the same set of features either as malware or benign program.

The rest of the paper is organized as follows. Chapter 2 discusses the previous work based on the detection of malware using data mining techniques. Chapter 3 describes different types of malware and the current antivirus detection methods. Chapter

4 presents the reverse engineering techniques used in our work. Chapter 5 explains the data mining process and the machine learning tools we used for the experiments. Chapter

6 presents and discusses the results and finally, Chapter 7 concludes the study and suggests possible future work.

3

CHAPTER II

LITERATURE REVIEW

Significant research has been done in the field of computer security for the detection of known and unknown malware using different machine learning and data mining approaches.

A method for automated classification of malware using static feature selection was proposed by et.al [6]. The authors used two static features extracted from malware and benign software, Function Length Frequency (FLF) [11] and Printable String

Information (PSI) [12]. This work was based on the hypothesis that “though function calls and strings are independent of each other they reinforce each other in classifying malware”. Disassembly of all the samples was done using IDA Pro and FLF, PSI features were extracted using Ida2DB.

In FLF, function length is the number of bytes of code in the function.

Frequencies of all function lengths for all malware was calculated and distributed in the exponential interval ranges (1 to E, E to E2 etc). In total they got 50 intervals. Printable

String Information in each unpacked malware was extracted and all the strings for all malware were combined to create a database. A dataset was created with these strings as

4 features which assume binary value of whether a particular malware contained this string or not. All strings with minimum length of 3 were selected. With the selected features 13 different datasets were created for 13 different malware families and benign programs.

And the authors used 5 classifiers; Naive Bayes, SVM, Random Forest, IB1 and Decision

Table. Best results were obtained by AdaBoostM1 with Decision Table with an accuracy rate of 98.86%. It was also observed that the results obtained by combining both features were more satisfactory than using each kind of features individually.

Schultz et al. [9] used different data mining techniques to detect unknown malware. The authors collected 4,226 programs of which 2.365 were malicious and 1,001 were benign. In the selected data there were 206 benign executables and 38 malicious executables were in PE format. Static features from each program were extracted using three approaches; Binary profiling, Strings and Byte sequences. Binary profiling was applied to only PE files and other approaches were used for all programs.

Binary profiling was used to extract three types of features; 1) list of Dynamic

Link Libraries used by the PE, 2) function calls made from each Dynamic Link Library and 3) unique function calls in each DLL. “GNU Strings” program was used to extract printable strings. Each string was used as a feature in the dataset. In the third method for features extraction, hexdump [10] utility was used. Each byte sequence was used as a feature.

The authors applied rule based learning algorithm RIPPER [11] to the 3 datasets with binary profiling features, Naïve Bayes classifier to data with String and Byte

Sequence features and finally six different Naïve Bayes classifiers to the data with Byte

5

Sequence features. To compare the results from these approaches with traditional signature based method, the authors designed an automatic signature generator.

With RIPPER they got accuracies of 83.62%, 89.36%, and 89.07% respectively for datasets with features DLLs used, DLL function calls and Unique Calls in DLLs. The accuracies obtained with Naïve Bayes and Multi-Naïve Bayes were 97.11% and 96.88%.

And finally with Signature method they got 49.28% accuracy. Multi-Naïve Bayes produced better results compared to the other methods.

In [12], the information in PE headers was used for the detection of malware. This work was based on the assumption that there would be difference in the characteristics of

PE headers for malware and benign software as they were developed for different purposes.1908 benign and 7863 malicious executables were collected. The malware samples contained viruses, email worms, Trojans and backdoors.

PE headers of all the files were dumped using a program called DUMPBIN.

Every header (MS DOS header, file header, optional header and section headers) in the

PE was considered as a potential attribute. For each malware and benign program position and entry values of each attribute were calculated. For the reloc section in the

PE, the decision of whether a malware contained that section or not was noted down.

Every field in the dataset was converted to binary value in the attribute binarization process. Unimportant and redundant attributes were eliminated in the next step.

Unimportant attributes are the ones which were present in only one executable.

Redundant attributes were the ones present in all executables. In parallel, attribute selection was performed using Support Vector Machines.

6

The resulting dataset was tested with SVM classifier using five-fold cross validation. Accuracies of 98.19%, 93.96%, 84.11% and 89.54% were obtained for virus, email worm, Trojans and backdoors respectively. The detection rates of viruses and email worms were high compared to the detection rates of Trojans and backdoors.

In Kolter et.al. [13], multiple byte sequences from the executables were used. The authors collected 1971 clean and 1651 malicious executables. All of them were in PE format. Hexadecimal code for each executable was obtained by using hexdump [10].

From that code multiple bytes in sequence were combined to produce n-grams.

Training data was prepared with the extracted n-grams as binary features. Most relevant features were selected by calculating the information gain for each feature. In this process a total of 500 features were selected. Several data mining techniques like

IBk, TFIDF, naive Bayes, Support Vector Machine (SVM) and decision trees applied to generate rules for classifying malware. The authors also used “Boosted” naïve bayes,

SVM and decision tree learners.

Three experiments were conducted on the data. In the first experiment, size of the words, size of n-grams and number of features appropriate for the experiments were assessed. From the subset of executables, n-grams were extracted with n=4. Multiple data mining experiments were conducted to find the optimal size of n-grams by varying the subset size (10, 20, 100, 1000 etc). Best results were obtained with the size of 500. By fixing the size to 500, n in n-grams was varied and the results were more accurate with n=4. In the second experiment, out of 68,774,909 n-grams, 500 best n-grams were selected and applied 10-fold cross validation in each classification method. In the third experiment, 255 million n-grams were extracted from the all the executables and the

7 same procedure were followed as in second experiment. The boosted classifiers, SVM and IBk produced good results compared to the other methods. The performance of classifiers was improved by boosting and the overall performance of all the classifiers was better with the large dataset compared with the small dataset.

Dmitry and Igor [14], used positionally dependent features in the Original Entry

Point (OEP) of a file for detecting unknown malware. In this work they used 5854 malicious and 1656 benign executable in WIN 32 PE format. Various data mining algorithms like Decision Table, C4.5, Random Forest, and Naive Bayes were applied on the prepared dataset. Three assumptions were made for this work. 1) Studying the entry point of the program known as Original Entry Point (OEP) reveals more accurate information. 2) The location of the byte value of OEP address was set to zero. And the offsets for all the bytes in OEP was considered to be in the range [-127,127]. 3) Only a single byte can be read for each position value. So the range for Byte in position value is from 0 to 255. And finally the possible number of features that could be used for classification was 65536. The dataset contained three features; Feature ID, Position and

Byte in Position.

Feature selection was performed to extract more significant features. The features extracted in this step were based on the dependencies between features information gain and the base components of the features. The resulting data was tested against all classifiers and the results were compared based on ROC-area. Random Forest outperformed all the other classifiers.

A Specification language was derived in Jha et.al [15] based on the system calls made by the malware. These specifications were supposed to describe the behavior of

8 malware. The authors also developed an algorithm known as MINIMAL that mines the specifications of malicious behavior from the dependency graphs and applied this algorithm to the email worm Bagle.J, a variant of Bagle malware.

Clean and malicious files were executed in the controlled environment, traces of system calls were extracted for each sample while execution. The dependencies between the system call arguments were obtained by observing the arguments and their type in sequence of calls. Dependency graph was constructed using system calls and their argument dependencies. In that graph, each node denotes a system call and its arguments; each edge denotes dependency between arguments of the two system calls.

A sub graph was extracted from the malware dependence graph by contrasting with benign software dependence graph such that it uniquely specifies the malware behavior. A new file with these specifications would be classified as malware.

Virus prevention Model (VPM) to detect unknown malware using DLLs was implemented by Wang et.al. [16] in their work. 846 malicious and 1,758 benign files in portable executable format were collected. All files were parsed by a program

“dependency walker” which shows all the DLLs used in a tree structure. Three types of attributes T1, T2 and T3 were derived from the resulting tree. T1 is the list of APIs used by main program directly, T2 indicates the DLLs invoked by other DLLs other than main program and T3 is the relationships among DLLs which consists of dependency paths down the tree.

In total, 93,116 attributes were obtained. The attributes with low Information Gain were removed. Further feature reduction was done by using L-SVM; the attributes with lower rank were removed. After preprocessing the there were 1,398 attributes. Finally

9

429 important attributes were selected and tested the dataset with RBF-SVM classifier using five-fold cross validation. The detection rate with RBF-SVM classifier was 99.00% with True Positive rate of 98.35% and False Positive rate of 0.68%.

A similarity measure approach for the detection of malware was proposed by

Sung et.al. [17] based on the hypothesis that, variants of a malware have the same core signature which is a combination of features of the variants of malware. To generate variants for different strains of malware, traditional obfuscation techniques were used.

Generated variants were tested against 8 different antivirus products.

Four virus strains W32.Mydoom, W32.Blaster, W32.Beagle and Win32.Wika were used in this process. The new malware strains obtained from obfuscation were classified into 5 types; null operation and dead code insertion, data modification, control flow modification, data and control flow modification, pointer aliasing.

The source code of each PE was parsed to produce API calling sequence and this sequence was considered as signature for that file. Each API call was given an integer ID.

The sequence of API calls was represented by corresponding sequence of IDs. The resulting sequence was compared with the original malware sequence to generate similarity measure. The similarity measures were calculating using Euclidian Distance, sequence alignment and different similarity functions including cosine measure, extended

Jaccard measure and Pearson correlation measure. A mean value of all the measures was calculated for each signature. The largest index in the similarity table denotes to which original malware the particular variant belongs. By comparing that value with a threshold the nature of the file, benign or malicious was decided. After the experiments the results

10 from 8 different antivirus scanners were compared with SAVE. The detection rate of

SAVE was far better than antivirus scanners.

In [18], a strain of Nugache worm was reversed in order to study its underlying design, behavior and to understand attacker’s approach for finding vulnerabilities in a system. In addition to that, the authors also reverse engineered 49 malware executables in an isolated environment, extracted various features like MD5 hash, printable strings, number of API calls made, DLLs accessed and URL referenced. Using these features they prepared a dataset. Due to the multi dimensional nature of the dataset, a machine learning tool, BLEM2 [19] based on rough set theory was used to generate dynamic patterns which would help in classifying an unknown malware. As the size of the dataset was small, a very few number of decision rules were generated and the results were not satisfactory.

In another work [20] based on dynamic analysis, spatio-temporal information in

API calls was used to detect unknown malware. The proposed technique consists of two modules; an offline module that develops a training model using available data and an online module that generates a testing set by extracting spatio-temporal information during run time and compares them with the training model to classify run time process as either benign or malicious.

System logs for 100 benign and 416 malicious programs were collected and 237 native Windows API calls of different categories like socket, memory management, threads etc were traced and used as base.

In the dynamic analysis, spatial information was obtained from function call arguments, return values and were divided into seven subsets socket, memory

11 management, processes and threads, file, DLLs, registry and network management based on their functionality.

Temporal information was obtained from the sequence of calls and the authors observed that some of the sequences were present only in malwares and were missing in benign programs.

Spatial information was quantified using statistic and information theoretic measures. By calculating the autocorrelation the authors were able to extract the relation between calls in API call sequences. The correlation value lies in the range of [-1, 1]. -1 denotes no correlation at all and 1 denotes perfect correlation for which the lag value would be 0. For the API call sequences best correlation was obtained at n=3, 6, 9...

API call sequence was modeled using discrete time Markove chain, which enables them to decide how many lags to examine in API sequences and to reduce the size of the sample space.

The Markov chain had k states and the transition probabilities between these states were represented in state transition matrix T. Each transition probability was considered as a potential attribute. Feature selection was performed to select attributes with most information gain. Finally they selected 500 transitions and prepared a set with

Boolean values.

Three datasets were created by combining benign programs API trace with each malware type. The three datasets were combinations of benign-Trojan, benign-virus and benign-worm. They conducted two experiments. First one to study the combined performance of spatio-temporal features compared to standalone spatial or temporal features. Second experiment was conducted to extract a minimal subset of API categories

12 that gives same accuracy as from the first experiment. For this, the authors combined API call categories in all possible ways to find the minimal subset of categories that would give same classification rate as obtained in first experiment.

From the first experiment, the authors obtained 98% accuracy with naive bayes and 94.7% accuracy with J48 decision tree and they got better results with combined features compared standalone features. The detection rate of Trojans was less compared to viruses and worms.

In the second experiment, combination of API calls related to memory management and file I/O produced best results with an accuracy of 96.6%.

Our work is based on an assumption that, behavior of a malware can be revealed totally by executing it and observing its effects on the operating environment. For this task, we captured all the activities including registry activity, files system activity, network activity, API Calls made, DLLs accessed for each executable by running them in an isolated environment. Using the extracted features from the reverse engineering process, we prepared three datasets. To these datasets, we applied data mining algorithms

C4.5, Naïve Bayes and a rough set based tool BLEM2 to generate classification rules and compared the results.

In some of the above mentioned works ([6], [9], [12], [13], [14], [15], [16] and

[17]), only static features like byte sequences, printable strings and API call sequences were used. Though effective in detecting malware, they would be ineffective if the attackers use obfuscation techniques to write malware. To solve this problem, some other works ([18] and [20]) used dynamic detection methods. The work done in [20] used only dynamic API call sequences. Using only API calls may not be effective in detecting

13 malware. In [20], malwares were reversed to find their behavior and applied data mining techniques to the data obtained from reversing process. Very small number of rules was generated and the results were not effective as the experiments were conducted on very few numbers of samples. Our work is different from all the above works as we combined static and behavioral features of all malware and benign software. It is an extension of the work done in [20] and it is different from it as we did rigorous reverse engineering of each executable to find their inner workings in detail and we used large number (582 malicious and 521 benign) of samples which would enable in determining more accurate behavior of malicious executables.

14

CHAPTER III

TYPES OF MALWARE AND ANTI-MALWARE DEFENSE TECHNIQUES

3.1 Malware Types

Based on the infection mechanism and behavior, malware can be divided into various types like Viruses, Worms, Trojans, , Backdoors, Spyware, Adware,

Scareware, Rogue Software etc. Following is a brief description of each malware type and the infection mechanisms used by them.

3.1.1 Virus

A program that replicates by attaching itself to other programs running on the system is known as virus. In order to replicate viruses normally require human participation. This section describes the most common techniques used by attackers to infect computer systems.

15

3.1.1.1 Infecting Executables

The most frequent targets for computer viruses are executables. Virus writers use many techniques to infect executable files, some of which include-

 Overwriting Infection Technique: These types of viruses infect other files in the

system by replacing portions of their code with malicious code. Often this type of

infection makes the file inoperable because of the significant portion of missing

code.

 Appending Infection Technique: In this technique, a virus infects a target file by

attaching its code to the end of the target file. To execute the attached portion of

code, it inserts a jump call in the header of the target file. Sometimes they attach

executables to the target file by changing file header to reflect the changes.

 Prepending Infection Technique: In this case the virus code is attached to the

beginning of the target file. When the target file is launched, the virus code is

executed first and then it may transfer control to the original host program

depending the nature of virus.

 Parasitic Infection Technique: This technique follows slight variation of

prepending infection technique. Virus replaces the top portion of the target

program with its code and the replaced code will be moved to end of the target

program. Sometimes they move the replaced code to another temporary file

instead of the end of the program.

 Compressing Technique: In this technique, the infector compresses the content of

the target program using packer. This mechanism is used to conceal the increase

16

in file size after infection. It also helps in avoiding detection by static analysis

techniques.

3.1.1.2 Infecting Boot Sectors

Master Boot Sector (MBR) and Partition Boot Sector (PBS) are the two areas on the hard drive that contain sequence of instructions which help in boot up process of a system.

Boot sector virus writers make use of executable instructions in boot sectors. It replaces the instructions with malicious code and each time the system boots up the malicious code of the virus will be executed.

A famous example for boot sector virus is Michelangelo virus discovered in 1991.

It was programmed such that if the infected computer booted up on the birthday of

Michelangelo, it would overwrite the sectors of hard drive.

3.1.1.3 Infecting Word Documents

Word documents like Microsoft Word support macros. Macros are series of instruction that run when a document is opened. If the document settings are set to macro enabled then each time the document is opened the macro executes automatically. Virus authors take advantage of this feature to infect word documents.

17

3.1.1.4 Infecting Images

In this technique, malicious code is normally embedded in images. When a user clicks on the image the embedded code within the image will be executed. Usually this technique is used to escape detection by antivirus programs installed on the system.

3.1.2 Worm

A Worm is a self replicating computer program that spreads copies of itself to other computers over the network. Most of the worms do not need any kind of user participation to spread. However, some worms like mass mailer worms sometimes require human participation.

3.1.2.1 Infection Techniques

In order to infect a target system a worm must first gain an access to it. To achieve this worms search for vulnerabilities in the target system and if they succeed in finding vulnerability they exploit that vulnerability to inject malicious code into that system. After infecting a target system, a worm searches for vulnerabilities in systems that are connected to the infected ones. This process is repeated several times to infect thousands of computers worldwide. Most widely used techniques by worms to gain access to target computers are buffer overflow exploits, Network file sharing exploits, E-

18 mail exploits, Zero-day exploits and other common system misconfigurations. Since no code is bug free new vulnerabilities surface each day making job of worm authors easier.

An example for a worm is Code Red which was released in 2001. It gained access to thousands of computers within few hours of release.

3.1.3 Backdoor

Backdoor is a malicious program installed by an attacker on a target system to gain remote access.

Cyber goons exploit various vulnerabilities on victim machine to install backdoors. Sometimes they trick users to install backdoors by themselves making them believe that the program is legitimate. After installing they employ different techniques to restart backdoors frequently. Some of them include modifying startup files, dropping a registry key that is run each time the system boots up and making is as a scheduled task.

Backdoors are written as per the attacker’s requirement. Some backdoors let attackers change their privileges to root or administrator allowing remote execution of commands and some other let attackers monitor all the activities on a target system.

3.1.4 Trojan horse

A program that appears to be benign but performs malicious activity is known as

Trojan horse. This term was derived from the famous story of Trojan horse in Greek mythology.

19

To hide the malicious behavior of these programs, attackers employ several techniques like changing name of the file to a legitimate program’s name, manipulating the file type, modifying the source code of an original program, using polymorphic code and many others.

Once a Trojan is installed on a system, it can be used to install other malicious programs, create backdoors in the system enabling remote access and collecting useful information from system.

3.1.5 Rootkit

Rootkit is a program that exhibits the combined behavior of Trojan horse and backdoor and additionally modifies other programs of the operating system [1].

They exhibit Trojan behavior by replacing the original version of a file with an infected copy and backdoor behavior by enabling attackers to access a system remotely.

Unlike Trojans and backdoors it also modifies operating system programs.

Based on the operating environment Rootkits are divided into types:

i) User Mode Rootkits

ii) Kernel Mode Rootkits.

User Mode Rootkits replace applications on top of kernel with malicious code to achieve their goal. This helps attackers to hide their presence. Kernel Mode Rootkits are same as that of User Mode Rootkits except the change in operating environment. In this case, they modify the kernel itself impairing victim computer completely.

20

3.1.6 Spyware

Spyware is a program that collects confidential information from user’s computer, captures web browsing activity and sends all the data to a third party for monetary benefits [21].

Web browsers are the major targets for the spyware. Most frequently used mechanisms to spread spyware include ActiveX controls, Plug-ins and Executable programs. In fact, using ActiveX controls is the easiest and most effective way for attackers to distribute spyware. A Plug-in is a program that enhances web browser’s functionality. An ActiveX control downloads and installs plug-ins in the web browser.

3.1.7 Adware

Programs that display advertisements in the form of pop-ups, flash and other means on user screen are known as Adware. Some types of adware also work as spyware and collect confidential user information. To write these kinds of programs, programmers are often paid by business organizations.

3.2 Antivirus Detection Techniques

Rapid evolution of malware urge an equally strong Antivirus (in fact it is Anti- malware) software. This section explains the following mechanisms employed by anti- malware engines.

21

1. Signature based detection

2. Heuristic based detection

3. Sandbox approach

4. Integrity checking

3.2.1 Signature based detection

Signature based detection is the one of the simple and popularly used mechanism to detect malware. A signature is a sequence of hexadecimal bytes that distinctly differentiates the behavior of a file from others. Normally signature extraction process starts by converting the binary code of the programs into assembly language then search for the suspicious section of the code and then selecting sequence of bytes corresponding to that section. A typical signature looks like as shown in figure 3.1

Figure 3.1: Typical Malware Signature

The vendors of anti-malware products collect malware samples and generate signature for each sample. With the collection of these signatures a huge signature database is created and deployed on user’s system as part of the anti-malware program.

When this program is launched, the scanner compares each file against this signature set.

If any file matches with the signature in database, the file is marked as malware.

22

3.2.2 Heuristic Approach

Heuristic Analysis is the proactive approach for malware detection. In heuristic analysis a rule based approach is used for finding new and unknown malware. The rules are derived from the characteristics of malware that were previously found and detected antivirus products. In this approach, a value or rank is assigned to each malware like feature. Below are some of the features of known malware [1]

 Logging key strokes.

 Writing to executable files.

 Dropping new registry keys at particular places in the Windows Registry.

 Deleting files on the hard drive.

 Suspicious network activity.

If a new suspicious file is found, the scanner looks for all these features in it. The values associated with all the features found in that file are combined. If this combined value exceeds certain predefined threshold value, then that file is categorized as malware.

However the problem with this approach is that if the threshold value is very high, lots of malware will escape without detection. If the threshold value is very low, some legitimate programs might be classified as malware resulting in false positives.

3.2.3 Sandbox Approach

In this approach, the antivirus program emulates the operating system and executes the code in the simulated environment. It then monitors all the activities made

23 by the executable. If any anomalous behavior or behavior similar to previously known malware is detected then, that executable would be classified as malware. This approach is not used frequently because of its high consumption of system resources.

3.2.4 Integrity Checking

In this technique, malware detection is performed by observing the changes in the operating environment. In the integrity checking process, the antivirus program takes the image of the files in the system which needs to be monitored. After scanning the file system, it takes another image. If there are any suspicious changes in the file system, it would know that it is because of the infection of malware to the files under consideration

[22].

This technique has the same limitation as that of the signature based approach that it detects malware after infection.

The antivirus vendors combine above mentioned approaches and other new techniques to detect malware. They use different strategies to detect metamorphic malware, polymorphic malware, worms, Trojans and other malicious software. They frequently add new techniques to their engines to fight with newly emerging and complicated malware. All antivirus vendors share malicious software openly for the common interests but detection techniques used by each vendor is proprietary.

24

CHAPTER IV

REVERSE ENGINEERING

Reverse Engineering of malware can be defined as an analysis of the malware in order to understand its design, components and its behavior that make it to inflict damage on a computer system. The benefit of reverse engineering is that it allows us to see the hidden behavior of the file under consideration which, we can’t see by merely executing it [23].

In the reverse engineering process we used static and dynamic analysis techniques. There are many different tools available for each technique. All the tools used for our work are open source. In total, we reversed 1103 PE (Portable Executable) files of which 582 were malicious executables and 521 were benign executables. All the malicious executables were downloaded from [24] and all benign executables were downloaded from [25] and [26]. We analyzed each executable using both static and dynamic analysis techniques.

4.1 Controlled Environment

For static analysis of executables, we do not require a controlled environment. In this case, we do not run the malware. But in the case of dynamic analysis, the code we

25 run is malicious and dangerous. The environment for the reversing process should be isolated from the other hosts on the network. There are software products that provide virtual environment for the analysis of malware. Some of them are Parallels, Microsoft

Virtual PC, VMware and Xen. These products allow running more than one virtual machine on a single computer. Each virtual machine can has its own guest operating system like Windows and Linux and each guest operating system is isolated from all the others. Due to a strong isolation between the guest operating system in VM and host operating systems, even if the virtual machine is infected with a malware there will be no effect of it on the host operating system.

For the analysis of malware we needed virtualization software that would allow quick backtrack to previous system state after it has been infected by the malware. Each time a malware is executed in dynamic analysis process, it would infect the system.

Analysis of another malware had to be done in a clean system. We chose VMware

Workstation as virtualization software for our work. VMware Workstation [27] has a feature called Snapshot Manager that creates a tree of snapshots with the set of system states that were captured at various times. With this option it is easy to move to whichever virtual machine state we want by just double clicking on the corresponding snapshot. Fig 4.1 shows the tree of system states created by snapshot manager in

VMware Workstation.

26

Figure 4.1: Snapshot Manager

4.2 Experimental setup

Our experimental setup for the reverse engineering of malware and benign software was as follows:

 Virtualization Software: VMware Workstation.

 Virtual System Configuration: 512 Megabytes of RAM and 40 Gigabytes of secondary storage.  Host Operating System: Microsoft Windows XP Professional with Service Pack 2 updates.

27

 Target Operating System: Microsoft Windows XP Professional with Service Pack 2 updates.

 The environment for reverse engineering was isolated from underlying Local Area network.

4.3 Static Analysis

In general, it is a good idea to start analysis of any given program by observing the properties associated with it and predicting the behavior from visible features without actually executing it. This kind of analysis is known as static analysis. The advantage with static analysis is that it gives us an approximate idea of how it will affect the environment upon execution without actually being executed. However, most of the times, it is not possible to predict the absolute behavior of a program with just static analysis.

There are many different tools available that aid in static analysis of executables like Decompilers, Disassemblers, Source code analyzers and some other tools that help in extracting useful features from executables. The tools we used were Malcode Analyst

Pack [28], PeID [29] and IDA Pro Disassembler [30]. Section 4.3.1 lists the techniques we used for the static analysis in the reverse engineering process.

4.3.1 Cryptographic Hash Function

A unique cryptographic hash value is associated with each executable file. This value differentiates each file from others. We started our reverse engineering process of each executable by calculating its hash value.

28

The reason for calculating the hash value is twofold. There is no unique standard for naming of malware. There can be multiple names for a single piece of malware. So by calculating hash value of each sample we know that all of them are indeed the same. This results in eliminating ambiguity in the reverse engineering process. The second reason is if an executable is modified, its hash value will also be changed. That way we can identify that changes were made to the executable and thereby analyzing it to detect the changes made.

MD5, SHA1 and SHA256 are the widely used hash functions. We used Malcode

Analyst Pack (MAP) tool to compute the MD5 (Message Digest 5) hash value of each PE file that we analyzed.

4.3.2 Packer Detection

Malware authors employ various techniques to obfuscate the content of the malware they have written and making it unable to be reversed. Using packers is one of them. A Packer is a program that helps in compressing another executable program and thereby hiding the content of it. Packers help malware authors to hide actual program logic of the malware so that a reverse engineer cannot analyze it using static analysis techniques. Packers also help to evade detection of the malware from antivirus programs.

In order to execute, a packed malware must unpack its code into the memory. For this reason, the authors of the malware include an unpacker routine in the program itself.

The unpacker routine would be invoked at the time of execution of the malware and converts the packed code into original executable form. Sometimes they use multiple

29 levels of packing to make the malware more sophisticated [31]. The figures 4.2 [32] and

4.3 [32] below show the Normal PE file and Packed PE file.

Figure 4.2: Normal PE File

Figure 4.3: Packed PE File

30

Detection of packer with which a malware is packed is very important for the analysis of the malware. If we know the packer, we can unpack the code and then analyze the malware. We used PEiD tool which is a free tool for the detection of packers. It has over 600 different signatures for the detection of different packers, cryptors and compilers [29]. It displays the packer with which the malware is packed by simply opening the malware using PEiD. If the signature of the packer or compiler with which the malware is packed is not present in the PEiD database it will report that it didn’t find any packer. Figure 4.4 shows the working of PEiD for the detection of packer. The name of the packer is shown in the text box.

Figure 4.4: PEiD

4.3.3 Code Analysis

The next step for better understanding the malware is to analyze its source code.

Although there are many decompilers that help in decompiling executables into high level languages, analyzing the malware by keeping the source code in low level language

31 reveals more information. IDA Pro disassembler [30] from DataRescue is a popular choice for the disassembly of executable program into Assembly Language.

We used IDA Pro Disassembler for the code analysis of malware. In this step, we have gone through the assembly code of each PE file to find useful information and to understand the behavior of it. Figure 4.5 shows the IDA Pro Disassembler form

DataRescue. Below is the list of features that we were able to extract from the assembly code of PE files.

 Type of file from the PE header. If it was not a PE file, we discarded it.

 List of strings embedded in the code that would be useful for predicting the behavior of the PE.

 The programming language with which the PE was written.

 Compiled date and time.

Figure 4.5: IDA Pro Disassembler

32

4.4 Dynamic Analysis

In static analysis of executables, we just analyze the static code of the executable and approximately predict its properties and behavior. The authors of malware can use techniques like binary obfuscation and packing to evade static analysis techniques. So, to thoroughly understand the nature of the malware we cannot rely on static analysis techniques alone. If a program has to be run, the whole code has to be unpacked and loaded into the primary memory. Every detail of the executable program is revealed at run time no matter how obfuscated the code is and with what packer the executable is packed [1]. In dynamic analysis, we observe the full functionality of the malware and its effect on the environment as it executes. This is done with the help of tools that assist in the dynamic analysis.

The tools of the trade that help in dynamic analysis of executables include

Debuggers, Registry Monitors, File System Monitors, Network Sniffers and API Call

Tracers. The tools we used in this step were Filemon, Regshot and Maltrap.

4.4.1 File System Monitor

When a program is executed it makes changes to the file system. The file system activity made by the program helps partly in determining its behavior. We used File

Monitor (Filemon) [33], a product of Microsoft Sysinternals to monitors file system activity of all the processes running on a windows system. It installs a device driver which keeps track of all the file system activity in the system. However, we need only the

33 information related to a particular process under consideration. We can use a filter which lets us select a particular process for which we want to monitor file system activity by removing all the other processes from the list. Each file system activity made by the PE on the file system produces one line of output in Filemon GUI window. Figure 4.6 shows the Filemon GUI window.

Figure 4.6: File Monitor

4.4.2 Registry Monitor

Windows Registry is a huge database hosting configuration details of the operating system and the programs installed on it. Registry entries are divided into hives and represented in tree structure in Windows systems. Most applications use two hives frequently; HKLM and HKCU. HKLM stands for Hive Key Local Machine and contains

34 settings of the operating system. HKCU stands for Hive Key Current User and contains configuration details for the user currently logged into the system [34].

Malware authors frequently use Registries to infect systems. One usual technique employed is to insert an entry at the location

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\RUN so that each time system boots up the malware is executed. There is an extensive list of such keys in the Windows Registry and they are used by attackers for their malicious purposes.

Regshot is a product of Microsoft Sysinternals that helps in the reverse engineering process by monitoring the Windows Registry. It lists the changes made in the windows registry upon installation of software. We used this tool to know the changes made by malware and benign software in Windows Registry. Figure 4.7 shows the

Registry Monitor GUI.

Figure 4.7: Registry Monitor

35

The changes made by the software in the windows registry can be generated by comparing the registry keys before and after execution of the software. The base image of the windows registry can be generated by clicking 1st shot tab. After the execution of software, another image is generated by clicking the 2nd shot. Now if both images are compared a log file is generated that shows the list of keys added, keys deleted, values added, values deleted and values modified by the software. Figure 4.8 shows the result of comparing both images.

Figure 4.8: Registry Key changes made by a PE

4.4.3 API Call Tracer

Windows API (Application Program Interface) also known as Win API is a long list of functions that provide system level services to the user programs. Every Windows application is implemented using Win API functions [35].

36

Keeping track of the sequence of API Calls made by an application helps in the reverse engineering process. It allows us to go through each call and thereby predicting the behavior of that software.

Maltrap is a software that lists the sequence of calls made by the software while execution. Figure 4.9 shows the GUI of Maltrap while execution of a software.

Figure 4.9: MALTRAP

37

CHAPTER V

DATA MINING

In this chapter, we apply data mining techniques to the reverse engineered data to analyze patterns associated with malware and benign software. Based on the knowledge and learning from the data mining tasks, new and unknown software can be classified as either benign or malicious. Keeping in mind the large volume of malware being generated each day, this approach would be very useful in detection of the malware. It also serves the purpose of proactive detection of malware.

5.1 System Design

Data Mining is the process of extracting useful patterns from data. It is a part of the KDD (Knowledge Discovery in Databases) process [36]. The goal of the KDD process is to derive knowledge from the huge volumes of data. Various Data Mining algorithms are used in the KDD process for processing of raw data and generating patterns from the processed data. Figure 4.1 shows the steps in KDD process.

38

Figure 5.1: KDD Process

In the KDD process the first step involves understanding of the application domain, prior knowledge related to that domain and goal of the end user.

 Target Data Selection: With the help of knowledge about the application domain,

a target dataset is created in this step.

 Cleaning and Preprocessing: In this step, the raw target data is preprocessed.

Preprocessing involves removing noisy data and handling missing values in the

data.

 Data Transformation: This step involves reduction of the dataset. Features that

would not be any help for the overall goal of the process are eliminated.

 Data Mining: This step involves deciding which kind analysis has to be done i.e.

classification, clustering, association or regression. Selecting the data mining

algorithms for the respective kinds of analysis and applying the algorithms to data

to generate patterns.

 Interpretation/Evaluation: Patterns and rules generated in the Data Mining step

analyzed to create a knowledge base.

39

5.2 KDD Process

In this section, we explain our implementation of the KDD process to find patterns that would help in classifying malware from benign software. We got the raw data from reversing of malware and benign software samples. We processed that data and extracted useful features from it. The detailed process we followed to extract the features is explained below

From static analysis of each sample in the reverse engineering process we got

MD5 hash of the file, file size in bytes, packer with which the file was packed, a decision of whether it contained unique strings, time stamp and the programming language used to write the file.

From the dynamic analysis, for each file, we got log of file system activity, registry activity and the sequence of API calls made by the sample while running. Figure

5.2 shows the file system activity log of the malware named aIRCBot.

Figure 5.2: File System Activity Log

40

From the file system activity log we were able to extract three important features; the decisions of whether the file under consideration writing to another file or not, accessing another directory or not and unique DLLs accessed by that sample while execution. Figure 5.3 shows the DLLs accessed by the same malware.

Figure 5.3: DLLs

From the registry activity we extracted three features; registry keys added, registry keys deleted and registry values modified. Figures 5.4, 5.5 show registry keys added and registry keys deleted.

Figure 5.4: Registry Keys Added

41

Figure 5.5: Registry Keys Deleted

The log contained all the registry keys modified by executables and their modified values. We removed the key values and recorded only the keys. Figure 5.6 shows the modified registry keys of an executable.

Figure 5.6: Modified Registry Keys

42

From the API call log we extracted unique API calls made by each PE and the decisions of whether that PE is accessing Internet, making any URL references. We combined all the unique API calls made by each file and removed any duplicates. In total, we got 141 unique API calls. Figure 5.7 shows API call sequence log. With this step we completed processing of raw data for the feature selection.

Figure 5.7: API Call Sequence Log

43

5.2.1 Target Data

Our target data is the result of preprocessing raw data obtained from reverse engineering. We prepared two datasets in this step; first dataset with 15 features, 1103 instances and second dataset with 141 features, 1103 instances. We named the first dataset as DRF (Dataset with Reversed Features) and second dataset as DAF (Dataset with API Call Features) to avoid large names each time we refer to these datasets. Table

5.1 shows the list of attributes in DRF and figure 5.8 shows the list of attributes in DAF.

All the attributes in DAF are of type binary except the decision label which is Boolean.

Table 5.1: Attributes in DRF

S.NO ATTRIBUTE NAME TYPE 1 FILE NAME NOMINAL 2 FILE SIZE NOMINAL 3 MD5 HASH NOMINAL 4 PACKER NOMINAL 5 FILE ACCESS BINARY 6 DIRECTORY ACCESS BINARY 7 DLLs NOMINAL 8 API CALLS NOMINAL 9 INTERNET ACCESS BINARY 10 URL REFERENCES BINARY 11 REGISTRY KEYS ADDED NOMINAL 12 REGSITRY KEYS DELETED NOMINAL 13 REGISTRY VALUES MODIFIED NOMINAL 14 UNIQUE STRINGS BINARY 15 PROGRAMMING LANGUAGE NOMINAL 16 DECISION LABEL BOOLEAN

44

Figure 5.8: Attributes in DAF

5.2.2 Preprocessing

As the dataset is manually created and there are no missing values, we didn’t need to preprocess the dataset.

5.2.3 Transformation

We transformed our dataset two times; once for attribute reduction and second time for discretizing attribute values. Both times only DRF was transformed.

45

5.2.3.1 Attribute Reduction

We performed attribute reduction before running experiments on our dataset. We removed some of the attributes from the first dataset that were not helpful in the overall goal of the task. The decision of which features to be removed was based on our intuition of which attributes would not contribute to the overall goal of the task. By the end of this step, we had 12 attributes with 1103 instances in DRF. The attributes are listed in table

5.2.

Table 5.2: Attribute in DRF after transformation

S.NO. ATTRIBUTE NAME TYPE

1 PACKER BINARY

2 FILE ACCESS BINARY

3 DIRECTORY ACCESS BINARY

4 DLLs NOMINAL

5 API CALLS NOMINAL

6 INTERNET ACCESS BINARY

7 URL REFERENCES BINARY

8 REGISTRY KEYS ADDED NOMINAL

9 REGISTRY KEYS DELETED NOMINAL

10 REGISTRY VALUES MODIFIED NOMINAL

11 UNIQUE STRINGS BINARY

12 DECISION LABEL BOOLEAN

46

5.2.3.2 Discretization

The values of attributes Registry Keys Added, Registry Keys Deleted, Registry

Keys Modified, API CALLs and DLLs were ranged from very small value i.e. 0 to very high values. Performing experiments on this dataset having huge number of discrete values for the attributes would result in generating large number of rules in the learning model. To solve this problem, we moved back to the data transformation step again.

Using the Weka filter we discretized values of the above mentioned attributes. The ranges we got for each attribute after transforming the dataset are shown in Table 5.3.

Table 5.3: Discretized Values

S.NO. ATTRIBUTE NAME DISCRETIZED VALUES 1 KEYS ADDED (-INF-1] (1-INF) 2 REGISTRY VALUES (-INF-12.5] (12.5-INF) MODIFIED

3 API CALLS (-INF -5.5] (5.5-22.5] (22.5-41.5] (41.5- INF) 4 DLLs (-INF -16.5] (16.5- INF)

We prepared another dataset from DRF by replacing the discrete values with the discretized values shown in table 5.3. We call it as DDF (Dataset with Discretized

Features).

5.2.4 Data Mining

In this step, we applied different data mining techniques to our datasets to achieve the goal of classifying malware from benign software.

47

All experiments on the datasets were conducted by dividing them into independent training and testing sets. The training set had 80% of the instances of the whole set and testing set had 20% of the instances. We used training set to generate the learning model then applied test set with decision label to the generated model. The algorithm internally assigns label to each instance of the test set then compares it with the supplied label. This information is used to compute the accuracy of the model.

5.2.4.1 Weka

Weka (Waikato Environment for Knowledge Analysis) [37] is a tool that provides implementations of set of learning algorithms for the data mining task. The data mining problems Weka handles include classification, association rule mining, clustering, and regression and attribute selection. It also provides tools for transformation of datasets.

Weka takes relational tables as input which can be in ARFF or CSV (Comma Separated

Value) formats. It also supports some other formats. It provides Explore, a graphical user interface for easy accessing of all of its functions. Figure 5.9 shows the Weka Explorer.

Figure 5.9: WEKA Explorer

48

5.2.4.2 Classification Techniques

In WEKA, we used two classification techniques for our datasets; decision trees and

Bayesian classifiers.

 Decision Trees

Classification using decision trees is based on the “divide-and-conquer” approach

for problem solving [38]. In a decision tree, non-leaf nodes involve in testing the

values of attributes and leaf node assign classification label that applies to all

instances reaching that leaf node. To classify a new instance, the values of attributes

are tested in successive nodes and depending on those values the instance is routed

down the tree till it reaches a leaf node. At leaf node the instance is assigned a

classification label based on that particular leaf’s class.

There are several algorithms for decision tree induction. Some of them include

ID3, C4.5 and CART [39]. For our experiments we used J48 algorithm which is a

Weka implementation of C4.5 decision tree learner [38].

 Bayesian Classifiers

Bayesian classifiers are based on the Bayes theorem. Bayes theorem helps in solving

predictive problems.

Using the Bayes theorem, we can determine the conditional probability of value

of Y given the value of X. This can be carried to classification problem by replacing

X with set of attributes and Y with the class label.

49

There are numerous classifiers based on Bayes theorem. We used Naïve Bayes

Classifier for our experiments. Naïve Bayes classifier works by assuming the

attributes are conditionally independent.

5.2.4.3 Rough Sets

The concept of rough set is based on the assumption that every object in the universe is associated with some kind of information. Objects with the same information are known as indiscernible. A set of all indiscernible objects is known as elementary set.

This elementary set forms atom of knowledge about the universe. Any union of some elementary sets is called as precise or crisp set; otherwise the set is rough or imprecise set

[40].

The attributes in our dataset are multi-dimensional. It is difficult to force a relationship between attributes in our dataset. So we applied rough set theory [41] to our dataset with an assumption that it would generate optimal number of rules for classification and the generated model would be more effective compared to the above two models.

To generate classification rules for rough sets, we used BLEM2 [19]. The web based BLEM2 tool set includes pre-processing tools, machine learning tools and rule based classification tools. Figure 5.10 show the web based GUI of BLEM2 for conducting experiments on rough sets. BLEM2 takes attribute file and data file as input for a given rough set and generates rules.

50

Figure 5.10: Web based BLEM2 GUI

5.2.5 Interpretation/Evaluation

The results we got from the experiments and the evaluation of those results are presented in the chapter 6, results and discussions.

51

CHAPTER VI

RESULTS AND DISCUSSIONS

In this chapter, we present the experiments conducted on each dataset and thereafter we analyze the results from the experiments. We divided each dataset into independent training and testing sets consisting of 80% and 20% of the data respectively.

We generated the decision trees using training sets and tested the generated models with the testing sets.

6.1. Experiment 1: Classification of DRF

We conducted our first experiments on the dataset with reversed features (DRF).

This dataset had 12 attributes including the decision label as mentioned in Table 5.2. We converted the training set into ARFF (Attribute Relation File Format) format and supplied it as input to the J48 decision tree algorithm in WEKA. The algorithm generated a decision tree consisting of decision rules that classify an instance as either malware or benign software.

Figure 6.1 shows the decision tree obtained from J48 decision tree algorithm in WEKA.

52

Figure 6.1: J48 Decision Tree for DRF

53

Table 6.1: Decision rules for the decision label “YES”

Decision Rules Rule (API = [0-5]) (UNIQUE STRINGS = 0)YES No

(API = (5-14]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = 0) Yes (DLLs = [0-29]) (REGISTRY KEYS MODIFIED = (12-INF])YES

(API = (5-14]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = 0) Yes (DLLs = (29-INF]) YES

(API = (14-20]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = Yes 0) YES

(API = (16-19]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = Yes 1) (DLLs = [133-INF]) YES

(API = (19-20]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = Yes 1) YES

(API = (20-23]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (REGISTRY Yes KEYS DELETED = 0) (PACKER=0) YES

(API = (20-23]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (REGISTRY Yes KEYS DELETED = 0) (PACKER=1) (REGISTRY KEYS ADDED = (5-INF]) YES

(API = (20-23]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (REGISTRY Yes KEYS DELETED = (0-INF]) YES

(API = (5-23]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 1) YES Yes

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = [0-13])YES

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = (20-INF]) (REGISTRY KEYS MODIFIED = [0-11]) (REGISTRY KEYS ADDED = [0-17])YES (API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = (13-INF]) (REGISTRY KEYS MODIFIED = (11-28]) YES

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 1) Yes YES

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = Yes 0) YES

(API = [0-5]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = 1) Yes (INTERNET ACCESS= 0) (REGISTRY KEYS MODIFIED = [0-9])YES

54

(API = [0-5]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = 1) Yes (INTERNET ACCESS= 0) (REGISTRY KEYS MODIFIED = (15-INF])YES

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = Yes 1) (INTERNET ACCESS= 1) (REGISTRY KEYS ADDED = [0-3]) YES

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = Yes 1) (INTERNET ACCESS= 1) (REGISTRY KEYS ADDED = (3-INF]) (URL = 1) YES

(API = (23-INF]) (URL=0) (REGISTRY KEYS DELETED = (4-INF])YES Yes

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = (6-INF]) (UNIQUE Yes STRINGS = 0)YES

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = (6-INF]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS =0)YES (API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = (6-13]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS = 1) (REGISTRY KEYS DELETED=0) (DLLs=[51-61])YES

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED= (13-INF]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS = 1) (REGISTRY KEYS DELETED=0) (DLLs=[0-61])YES

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED= (6-INF]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS = 1) (REGISTRY KEYS DELETED=(0- INF))YES

Table 6.2: Decision rules for the decision label “NO”

Decision Rules Rule (API = (5-14]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = 0) Yes (DLLs = [0-29]) (REGISTRY KEYS MODIFIED = [0-12])NO

(API = (16-19]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (PACKER = Yes 1) (DLLs = [0-133]) NO (API = (20-23]) (UNIQUE STRINGS = 0) (INTERNET ACCESS = 0) (REGISTRY Yes KEYS DELETED = 0) (PACKER=1) (REGISTRY KEYS ADDED = [0-5]) NO

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = (13-20]) (REGISTRY KEYS MODIFIED = [0-11])NO

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = (20-INF]) (REGISTRY KEYS MODIFIED = [0-11]) (REGISTRY KEYS ADDED = (17-INF])NO

(API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 0) (INTERNET ACCESS = 0) Yes (DLLs = (13-INF]) (REGISTRY KEYS MODIFIED = (28-INF]) NO

55

(API = [0-5]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = 1) Yes (INTERNET ACCESS= 0) (REGISTRY KEYS MODIFIED = (9-15])NO

(API = (5-23]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = Yes 1) (INTERNET ACCESS= 0) NO (API = [0-23]) (UNIQUE STRINGS = 1) (PACKER = 1) (DIRECTORY ACCESS = Yes 1) (INTERNET ACCESS= 1) (REGISTRY KEYS ADDED = (3-INF]) (URL = 0) NO

(API = (23-INF]) (URL=0) (REGISTRY KEYS DELETED=0)NO Yes

(API = (23-INF]) (URL=0) (REGISTRY KEYS DELETED = (0-4])NO Yes

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = [0-6])NO Yes (API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = (6-13]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS = 1) (REGISTRY KEYS DELETED=0) (DLLs=[0-51])NO

(API = (23-INF]) (URL=1) (REGISTRY KEYS MODIFIED = (6-INF]) (UNIQUE Yes STRINGS = 1) (DIRECTORY ACCESS = 1) (REGISTRY KEYS DELETED=0) (DLLs=(61-INF])NO

Figure 6.1 shows that the number of API Calls made by a PE was used as the root node of the decision tree. API Calls, Unique Strings, URL References, Internet Access,

Packer, Registry Keys Deleted, Directory Access and Registry Keys Modified were the most used attributes in the classification model although it used all the other attributes in the dataset. In Tables 6.1 and 6.2, the Rule column shows whether the decision rule mentioned in the “Decision Rules” column is pertinent or not and the results show that almost all the rules produced by J48 algorithm were befitting.

6.2 Experiment 2: Classification of DDF

The values of the attributes in DRF were discrete. There was large number of discrete values for each attribute that resulted in more number of decision rules, thereby increasing the size of the tree. With the aim of reducing the number of decision rules and

56 reduce the size of the tree, we discretized the values of the attributes. The values of the attributes after discretization are shown in Table 5.3. We named the new dataset with discretized attribute values as dataset with discrete features (DDF). We derived a decision tree for this dataset using WEKA and J48 decision tree algorithm. Figure 6.2 shows the decision tree obtained from J48 algorithm.

Figure 6.2: J48 Decision Tree for DDF

Table 6.3: Decision rules for DDF for the decision label “YES”

Decision Rules Relevancy (API (5.5-23.5]) (UNIQUE STRINGS=0)YES No

(API (5.5-23.5]) (UNIQUE STRINGS=1) (PACKER=0)YES No

(API (5.5-23.5]) (UNIQUE STRINGS=1) (PACKER=1) (URL=0) (INTERNET Yes ACCESS=1) (REGISTRY KEYS ADDED=(-INF-2.5])YES

57

(API (5.5-23.5]) (UNIQUE STRINGS=1) (PACKER=1) (URL=1) YES Yes

(API (23.5-41.5]) (URL=0) (DELETED=(4-INF)) (REGISTRY KEYS Yes MODIFIED=(12.5-INF)) YES

(API (23.5-41.5]) (URL=1) (REGISTRY KEYS MODIFIED=(12.5-INF)) YES Yes

(API (23.5-41.5]) (URL=1) (REGISTRY KEYS MODIFIED=(-INF-12.5]) Yes (REGISTRY KEYS ADDED=(1-2.5]) YES

(API (23.5-41.5]) (URL=1) (REGISTRY KEYS MODIFIED=(-INF-12.5]) Yes (REGISTRY KEYS ADDED=(2.5-INF)) (UNIQUE STRINGS=0) YES (API (41.5-INF)) (UNIQUE STRINGS=0) YES No

(API (41.5-INF)) (UNIQUE STRINGS=1) (DA=0) YES No

(API (41.5-INF)) (UNIQUE STRINGS=1) (DA=1) (REGISTRY KEYS Yes ADDED=(-INF-1]) (REGISTRY KEYS MODIFIED=(12.5-INF)) YES

(API (-INF-5.5)) YES No

Table 6.4: Decision rules for DDF for the decision label “NO”

Decision Rules Relevancy (API (5.5-23.5]) (UNIQUE STRINGS=1) (PACKER=1) (URL=0) (INTERNET Yes ACCESS=0)NO

(API (5.5-23.5]) (UNIQUE STRINGS=1) (PACKER=1) (URL=0) (INTERNET Yes ACCESS=1) (REGISTRY KEYS ADDED=(2.5-INF))NO

(API (23.5-41.5]) (URL=0) (REGISTRY KEYS DELETED=0) NO Yes

(API (23.5-41.5]) (URL=0) (REGISTRY KEYS DELETED=(0-4]) (REGISTRY Yes KEYS MODIFIED=(12.5-INF)) NO

(API (23.5-41.5]) (URL=0) (REGISTRY KEYS DELETED=(0-INF)) Yes (REGISTRY KEYS MODIFIED=(-INF-12.5]) NO

(API (23.5-41.5]) (URL=1) (REGISTRY KEYS MODIFIED=(-INF-12.5]) Yes (REGISTRY KEYS ADDED=(-INF-1]) NO

(API (23.5-41.5]) (URL=1) (REGISTRY KEYS MODIFIED=(-INF-12.5]) Yes (REGISTRY KEYS ADDED=(2.5-INF)) (UNIQUE STRINGS=1) NO

(API (41.5-INF)) (UNIQUE STRINGS=1) (DIRECTORY ACCESS=1) Yes (REGISTRY KEYS ADDED=(-INF-1]) (REGISTRY KEYS MODIFIED=(-INF- 12.5]) NO

58

(API (41.5-INF)) (UNIQUE STRINGS=1) (DIRECTORY ACCESS=1) Yes (REGISTRY KEYS ADDED=(1-INF)) NO

The decision tree in Figure 6.2 shows that 9 attributes, API Calls, Unique Strings,

URL References, Packer, Registry Keys Deleted, Registry Keys Modified, Directory

Access, Registry Keys added and Internet Access were used in the classification model.

The attributes DLLs and File Access were not used in the decision rules for classification.

Out of 9 attributes used, API Calls, Unique Strings, URL References, Packer, Directory

Access, Registry Keys Deleted and Registry Keys Modified contributed most in derived decision rules. It can be seen that the number of rules and the size of tree obtained from

DDF reduced drastically as compared to those obtained from DRF.

6.3 Experiment 3: Classification of DDF using BLEM2

We used machine learning tool based on rough sets, BLEM2 to generate decision rules for classification and we used DDF dataset in this experiment. We conducted this experiment in an effort to generate optimum number of decision rules. Table 6.5 shows the decision rules from BLEM2 for the decision label “YES” and Table 6.6 shows the decision rules for the decision label “NO”.

Table 6.5: BLEM2 rules for DDF for the decision label “YES”

FI LE A DIRE VALUE C CTO INTER UNIQ Ce Co PA KEYS KEYS S API C RY NET URL UE Su rta Str ver Rel CK ADD DELE MODIF CALL DLL ES ACC ACCES REFER SRTIN LAB pp int en ag eva ER ED TED IED S s S ESS S ENCE GS EL ort y gth e ncy (- (16. inf- (-inf- (5.5- 5- 0.8 0.0 0.0 0 1] ? 12.5] 23.5] inf) ? 1 0 ? 0 YES 43 11 39 74 Yes (2.5- (-inf- (5.5- (16. 0.9 0.0 0.0 0 inf) 0 12.5] 23.5] 5- ? ? ? ? 0 YES 30 09 27 52 Yes

59

inf) 0.0 0.0 ? ? ? ? ? ? 0 ? ? ? ? YES 26 1 24 45 No (16. (2.5- (12.5- (5.5- 5- 0.5 0.0 0.0 1 inf) 0 inf) 23.5] inf) ? 1 0 ? ? YES 21 53 19 36 Yes (- (16. inf- (12.5- (5.5- 5- 0.0 0.0 0 1] ? inf) 23.5] inf) ? 1 ? ? 0 YES 20 1 18 34 Yes (- inf- (-inf- (5.5- 0.7 0.0 0.0 0 1] ? 12.5] 23.5] ? ? 1 0 ? 1 YES 20 69 18 34 Yes (16. (2.5- (-inf- (5.5- 5- 0.6 0.0 0.0 1 inf) 0 12.5] 23.5] inf) ? ? 0 ? 0 YES 17 54 15 29 Yes (- inf- (5.5- 0.0 0.0 0 1] ? ? 23.5] ? ? 0 ? ? ? YES 16 1 15 28 No

(- (16. inf- (-inf- (5.5- 5- 0.4 0.0 0.0 1 1] 0 12.5] 23.5] inf) 1 1 0 ? 0 YES 15 55 14 26 Yes (- (- inf- inf- (-inf- (-inf- 16. 0.0 0.0 ? 1] ? 12.5] 5.5] 5] ? 1 ? ? 0 YES 14 1 13 24 Yes (2.5- (-inf- (5.5- 0.5 0.0 0.0 0 inf) ? 12.5] 23.5] ? ? ? 0 ? 1 YES 13 42 12 22 Yes (16. (2.5- (-inf- (5.5- 5- 0.2 0.0 0.0 1 inf) 0 12.5] 23.5] inf) ? ? 0 ? 1 YES 13 36 12 22 Yes (2.5- (12.5- (5.5- 0.0 0.0 0 inf) ? inf) 23.5] ? ? ? ? ? 0 YES 12 1 11 21 Yes (2.5- (12.5- (5.5- 0.9 0.0 0.0 0 inf) 0 inf) 23.5] ? ? ? ? ? 1 YES 11 17 1 19 Yes (- inf- (12.5- (41.5 0.0 0.0 0 1] ? inf) -inf) ? ? ? ? ? ? YES 10 1 09 17 Yes (- (16. inf- (12.5- (5.5- 5- 0.7 0.0 0.0 0 1] 0 inf) 23.5] inf) ? 1 ? ? 1 YES 10 69 09 17 Yes (- (16. inf- (-inf- (5.5- 5- 0.0 0.0 1 1] 0 12.5] 23.5] inf) ? ? 0 ? 1 YES 10 0.2 09 17 Yes (- (16. inf- (12.5- (5.5- 5- 0.0 0.0 1 1] 0 inf) 23.5] inf) ? ? 0 ? 0 YES 9 0.9 08 16 Yes (- inf- (12.5- (-inf- 0.0 0.0 1 1] ? inf) 5.5] ? ? ? ? ? 1 YES 7 1 06 12 Yes (2.5- (-inf- (5.5- 0.0 0.0 ? inf) ? 12.5] 23.5] ? ? ? 1 0 0 YES 7 1 06 12 Yes (- inf- (-inf- (5.5- 0.0 0.0 ? 1] ? 12.5] 23.5] ? ? ? 1 0 ? YES 7 1 06 12 Yes

60

Table 6.6: BLEM2 rules for DDF for the decision label “NO”

VALUE DIRECT INTER URL UNIQ L PA KEYS KEYS S API FILE ORY NET REFE UE A Su Cer Str Co Rel CK ADD DELE MODIF CAL DL ACC ACCES ACCES RENC SRTIN B pp tai en ver eva ER ED TED IED LS Ls ESS S S E GS EL ort nty gth age ncy (23. 5- (2.5- (-inf- 41. N 0.9 0.0 0.1 1 inf) 0 12.5] 5] ? ? 1 0 ? 1 O 94 9 85 8 Yes (23. 5- (1- (-inf- 41. N 0.9 0.0 1 2.5] 0 12.5] 5] ? ? ? 0 ? 1 O 52 81 47 0.1 Yes (5.5 - (16 (2.5- (-inf- 23. .5- N 0.7 0.0 0.0 1 inf) 0 12.5] 5] inf) ? ? 0 ? 1 O 42 64 38 81 Yes (5.5 (- - (16 inf- (-inf- 23. .5- N 0.0 0.0 1 1] 0 12.5] 5] inf) ? ? 0 ? 1 O 40 0.8 36 77 Yes (5.5 (- - (16 inf- (-inf- 23. .5- N 0.3 0.0 0.0 ? 1] 0 12.5] 5] inf) 1 1 0 ? 0 O 28 26 25 54 Yes (23. 5- (2.5- (12.5- 41. N 0.0 0.0 1 inf) 0 inf) 5] ? ? ? 0 ? 1 O 27 0.9 25 52 Yes (23. 5- (2.5- 41. N 0.0 0.0 1 inf) 0 ? 5] ? ? 1 0 ? 0 O 27 0.9 25 52 Yes (5.5 - (2.5- (12.5- 23. N 0.6 0.0 0.0 1 inf) 0 inf) 5] ? ? 1 0 ? 1 O 14 67 13 27 Yes (5.5 - (2.5- (-inf- 23. N 0.4 0.0 0.0 0 inf) ? 12.5] 5] ? ? ? 0 ? 1 O 11 58 1 21 Yes (23. (- 5- inf- (-inf- 41. N 0.0 0.0 1 1] ? 12.5] 5] ? ? ? 0 ? ? O 9 1 08 17 Yes (5.5 - (16 (2.5- (-inf- 23. .5- N 0.3 0.0 0.0 1 inf) 0 12.5] 5] inf) ? ? 0 ? 0 O 9 46 08 17 Yes (23. 5- (1- (-inf- 41. N 0.0 0.0 1 2.5] 0 12.5] 5] ? ? ? 0 ? 0 O 7 1 06 13 Yes (23. 5- (2.5- (-inf- 41. N 0.0 0.0 0 inf) ? 12.5] 5] ? ? ? 0 ? 1 O 7 1 06 13 Yes

61

Tables 6.5 and 6.6 show the rules from BLEM2 with minimum support count of

7. The rules show that Packer, Keys Added, Keys Deleted, Values Modified, Internet

Access and Unique Strings were the most used attributes in arriving at a decision in the classification process. BLEM2 used better set of attributes than J48 from experiment 3 and the rules generated from BLEM2 have higher certainty.

6.4 Experiment 4: Classification of DDF from BLEM2 using J48

We prepared a dataset from the rules obtained from BLEM2. We removed support, certainty, strength and coverage values from the rules and replaced the missing values with -1. We supplied the resultant dataset as input to the J48 decision tree algorithm. We conducted this experiment to reduce the size of the decision tree and generate optimum number of decision rules for the classification purpose. Figure 6.3 shows the decision tree obtained from J48 algorithm. Table 6.7 and table 6.8 show the decision rules generated by the J48 algorithm for the decision labels “YES” and “NO” respectively.

Figure 6.3: J48 Decision Tree for DRF from BLEM2

62

Table 6.7: Decision rules for DDF from BLEM2 for the decision label “YES”

Decision Rule Relevancy (API=-1)YES No

(API= (-INF-5.5])YES No

(API= (5.5-23.5]) (UNIQUE STRINGS=0)YES No

(API= (41.5-INF)) (INTERNET ACCESS=-1) (PACKER=-1) YES No (API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=-1) (REGISTRY KEYS Yes DELETED=[0-2])YES

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(12.5-INF)) (INTERNET Yes ACCESS=0,-1) (UNIQUE STRINGS=0) (PACKER=(-1-INF)) YES

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(12.5-INF)) (INTERNET Yes ACCESS=1) YES

Table 6.8: Decision rules for DDF from BLEM2 for the decision label “NO”

Decision Rule Relevancy (API= (5.5-23.5]) (UNIQUE STRINGS=1)NO No

(API= (41.5-INF)) (INTERNET ACCESS=-1) (PACKER=(-1-INF)) NO No

(API= (41.5-INF)) (INTERNET ACCESS=(-1-INF)) NO No

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=-1) (REGISTRY KEYS No DELETED=(2-INF))NO

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(12.5-INF)) (INTERNET Yes ACCESS=0,-1) (UNIQUE STRINGS=0) (PACKER=-1) NO

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(12.5-INF)) (INTERNET Yes ACCESS=0,-1) (UNIQUE STRINGS=1) NO

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(12.5-INF)) (INTERNET Yes ACCESS=0) NO

(API= (23.5-41.5]) (REGISTRY KEYS MODIFIED=(-INF-12.5]) NO Yes

Figure 6.3 shows that tree size is small as compared to those in experiment 1 and

2. Out of 11 attributes, only 6 attributes, API Calls, Unique Strings, Internet Access,

Values Modified, Packer and Keys Deleted were used in the classification model. Tables

63

6.5 and 6.6 show that only few number of decision rules were derived from the J48 algorithm. These results show that the tree was over simplified.

6.5 Experiment 5: Classification of DAF

We conducted the same experiments for dataset API with Call attributes (DAF) as for DRF. Figure 6.4 shows the tree obtained from the J48 algorithm. Table 6.9 shows the decision rules generated by J48 algorithm for the decision label “YES” and Table 6.10 shows the rules for the decision label “NO”.

Figure 6.4: J48 Decision Tree for DAF

64

Table 6.9: Decision rules for DAF for the decision label “YES”

Decision Rule Relevancy (IsDebuggerPresent=0)YES No

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=0)YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=0) (ExitProcess=0) (WriteFile=0) (CreateMutexA=1)YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=0) (ExitProcess=1) (OpenSCManagerW=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=1) (RegOpenKeyW=0) (CreateRemoteThread=0) (FindWindowW=0) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=1) (RegOpenKeyW=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=0) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=0) (GetTempFileNameA=1) (RegSetValueExA=1) YES

65

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=1) (CreateProcessA=1) (_lopen=0)YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=1) (ExitProcess=0)YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=1) (ExitProcess=1) (send=0) (InternetOpenA=0)YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=1) (ExitProcess=1) (send=1) YES

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=1) (PostMessageA=0) (GetFileAttributesExW=0) (FindWindowExW=0) (RegDeleteValueW=0) YES

(IsDebuggerPresent=1) (WriteProcessMemory=1) YES No

Table 6.10: Decision rules for DAF for the decision label “NO”

Decision Rule Relevancy (IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=0) (ExitProcess=0) (WriteFile=0) (CreateMutexA=0)NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=0) (ExitProcess=0) (WriteFile=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=0) (SetFileAttributesA=0) (RegOpenKeyExA=1) (WSACleanup=0) (ExitProcess=1) (OpenSCManagerW=0) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=1) (RegOpenKeyW=0) (CreateRemoteThread=0) (FindWindowW=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes (GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=0) (CreateFileA=1) (RegOpenKeyW=0) (CreateRemoteThread=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=0) Yes

66

(GetVolumeInformationW=0) (bind=0) (GetTempFileNameA=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=0) (GetTempFileNameA=0) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=0) (GetTempFileNameA=1) (RegSetValueExA=0) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=1) (CreateProcessA=0) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=0) (ReadFile=1) (GetVolumeInformationA=1) (CreateProcessA=1) (_lopen=1)NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=0) (CreateRemoteThread=1)NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=0) (MoveFileA=1)NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=0) (connect=1) (ExitProcess=1) (send=0) (InternetOpenA=1)NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=1) (PostMessageA=0) (GetFileAttributesExW=0) (FindWindowExW=0) (RegDeleteValueW=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=1) (PostMessageA=0) (GetFileAttributesExW=0) (FindWindowExW=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=1) (PostMessageA=0) (GetFileAttributesExW=1) NO

(IsDebuggerPresent=1) (WriteProcessMemory=0) (RegSetValueExA=1) Yes (CreateProcessW=1) (PostMessageA=1) NO

Figure 6.4 shows that out of 141 attributes, 31 attributes were used in the classification model. The API Call “IsDebuggerPresent” was used as the root node in the tree. In total, 69 decision rules were generated by J48 algorithm. As can be seen from the

67 decision rules in tables 6.9 and 6.10, the most used attributes in the classification model were IsDebuggerPresent, WriteProcessMemory, RegSetValuesExW,

GetVolumeInformationW, bind, CreateProcessW and Connect. However, the attributes that contributed to distinguish malware from the benign software were Connect,

ReadFile, CreateRemoteThread, InternetOpenA and RegDeleteValueW. The rules show that most of them are pertinent to the classification task.

6.6 Experiment 6: Classification of DAF using BLEM2

We used BLEM2 tool to generate optimum and more efficient rules for DAF dataset. As the attribute values were binary, there was no need to discretize the values.

The process followed in this experiment was the same as the one followed in experiment

3. Out of 141 attributes, 101 attributes were used in the classification model but the values of most of those attributes were not used. With minimum support count of 7, only

18 decision rules were generated for the decision labels YES and NO.

6.7 Experiment 7: Classification of DAF from BLEM2 rules using J48

We conducted another experiment on the DAF dataset obtained from BLEM2 rules with the aim of reducing the size of the decision tree. To prepare the dataset for this experiment, we removed support, certainty, strength and coverage columns in the

BLEM2 rules and replaced missing values with -1. Although, the size of the tree was reduced compared to tree size in experiment 5, the rules were not effective as there were

68 lots of missing values in the dataset. Figure 6.5 shows the tree obtained from J48 algorithm. Tables 6.11 and 6.12 show the decision rules obtained from J48 for the decision rules “YES” and “NO” respectively.

Figure 6.5: J48 Decision Tree for DAF from BLEM2

Table 6.11: Decision rules for DAF from BLEM2 for the decision label “YES”

Decision Rule Relevancy (PostMessageA=-1) (RegOpenKeyExA=-1) (_lcreat=-1) (OpenProcess=-1) No (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=-1)

69

(GetFileAttributesExA=-1) (FindWindowW=0) (MoveFileExW=-1)YES

(PostMessageA=-1) (RegOpenKeyExA=-1) (_lcreat=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=0, 1) (ReadFile=0,1) YES (PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=0,1) (DeleteFileA=0,1) YES

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=1) YES

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) No (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=0,1) (ExitProcess=0,1) YES

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) No (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=1) (RegCreateKeyExA=0,1) YES

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=1) (CopyFileA=1) (RegCreateKeyExW=0)YES

(PostMessageA=-1) (RegOpenKeyExA=0,1) (ReadFile=0) YES No

Table 6.12: Decision rules for DAF from BLEM2 for the decision label “NO”

Decision Rule Relevancy (PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=-1) (GetFileAttributesExA=-1) (FindWindowW=0) (MoveFileExW=0,1)NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=-1) (GetFileAttributesExA=-1) (FindWindowW=1) NO

70

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) No (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=-1) (GetFileAttributesExA=0, 1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=-1) (CreateFileA=0, 1) (ReadFile=-1) NO (PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=-1) (CreateRemoteThread=0,1) (DeleteFileA=-1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=-1) (CopyFileA=0) (PostMessageW=0,1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=-1) (GetAsynckeyState=0) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=0) (RegSetValueExA=0,1) (ExitProcess=-1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) No (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=0) (IsDebuggerPresent=1) (RegCreateKeyExA=-1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=1) (CopyFileA=0)NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=-1) (AdjustTokenPrivileges=1) (CopyFileA=1) (RegCreateKeyExW=1)NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) Yes (GetVolumeInformationW=-1) (FindWindowExA=0,1)NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=-1) No (GetVolumeInformationW= 0, 1) NO

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=-1) (OpenProcess=0,1) No NO

71

(PostMessageA=-1) (RegOpenKeyExA=-1) (_create=0,1) NO No

(PostMessageA=-1) (RegOpenKeyExA=0,1) (ReadFile=1) NO No

(PostMessageA=0,1) NO No

Tables 6.11 and 6.12 show that, most of the rules in the classification model are not relevant to the task. AdjustTokenPrivileges, IsDebuggerPresent, FindWindowW, CopyFileA, CreateRemoteThread, ReadFile, DeleteFileA, RegSetValueExA, RegCreateKeyExA, RegCreateKeyExW, GetAsynckeyState and GetVolumeInformationW were the mostly used attributes to classify a malware from benign software.

6.8 Accuracies

We tested each training model with a corresponding test set consisting of 20% of the data. Table 6.13 shows the results for the DRF and DDF datasets. Table 6.14 shows the results from J48 for the DAF dataset. The three datasets were tested using Naïve

Bayes classifier and the results are shown in Table 6.15. In all the tables, TP (True

Positives) denotes the number of malware correctly classified as malware, TN (True

Negatives) denotes the number of benign software correctly classified as benign, FP

(False Positives) denotes number of benign software incorrectly classified as malware and FN (False Negatives) denotes number of malware incorrectly classified as benign software. Receiver Operating Characteristic (ROC) area is the area under ROC curve. It represents accuracy of a classifier. It is calculated based on true positive and false positive rates. A value of 1 denotes a perfect test and a value of 0.5 denotes a poor test.

72

Table 6.13: Testing set results against the training models from experiments 1, 2 and 4

Experiment Dataset No. of Tree TP TN FP FN ROC Overall Leaves Size Area Accuracy

1 DRF 40 79 88 89 21 23 0.843 80.090% 2 DDF 23 40 94 86 24 17 0.815 81.448% 3 DDF 15 25 73 98 12 38 0.802 77.375%

Table 6.14: Testing set results against the training models from experiments 5 and 7

Experiment Dataset No. of Tree TP TN FP FN ROC Overall Leaves Size Area Accuracy

4 DAF 35 69 89 108 6 18 0.917 89.140%

5 DAF 25 49 33 64 17 39 0.633 63.398%

Table 6.15: Experimental results from the Naïve Bayes Classifier

Dataset TP TN FP FN ROC Area Overall Accuracy

DRF 95 85 25 16 0.889 81.448% DDF 96 92 18 15 0.912 85.067% DAF 75 108 6 32 0.921 82.805%

As shown in the Table 6.13, using J48 decision trees, size of tree and the number of leaves were reduced by discretizing the attributes resulting in fewer number of decision rules for classification. The tree size and number of rules were further reduced by using BLEM2. Though the performance of the classification model was increased using DDF compared to DRF, based on the ROC area, the accuracy of the classifier was decreased. The performance of the classifier and overall accuracy were very low for the dataset from BLEM2.

73

In all the experiments using J48 decision trees, the value of ROC area is above 0.8 which shows the better quality of classification.

From Table 6.14, it can be seen that, for the DAF dataset, the results from experiment 4 using just J48 decision trees outperformed those from experiment 5 which used dataset from the BLEM2.

The results from all the experiments indicate that the decision trees were oversimplified using BLEM2 for the DDF and DAF datasets. For the dataset with reversed features, better results were obtained by discretizing the attribute values. Overall accuracy was increased, size of the tree was reduced and optimum number of decision rules was generated. For the DAF dataset in experiments 4 and 5, using only J48 algorithm produced better results compared to using BLEM2 and J48. The classifier accuracy was also high in experiment 4 with the ROC area value of 0.917.

As can be seen from table 6.15, ROC areas for all the experiments using naïve bayes classifier is above 0.88 which indicate better classifier accuracy. Based on ROC area and overall accuracy, naïve bayes classifier produced good results for the DDF dataset.

Based on the analysis of decision trees from the five experiments and results obtained by applying testing sets to the training models, we conclude that, J48 algorithm produced better results for the DDF and DAF datasets in experiments 2 and 4. The size of the trees and the number of decision rules were optimum. The combined result of the classifier accuracy based on ROC area and the overall classification accuracy obtained by applying testing set to the training models was maximum among all the 5 experiments.

74

6.9 Pattern in API Call Frequencies

Looking at the frequency with which each API call was accessed helps in categorizing set of API Calls. This information helps in recognizing most exploited vulnerabilities in a system. For this task, we plot two frequency comparison graphs for malware and benign executables.

Figure 6.6 shows the frequency graph of patterns in benign software with respect to malware API calls. The frequency of malware API calls is sorted in descending order.

Figure 6.7 shows the frequency graph with respect to benign executable API calls in the same order as in Figure 6.6. In both graphs, X-Axis represents the API Calls and Y-Axis represents the number of malwares or benign software which made that API Call.

Figure 6.6: API Call graph for Malware Vs Software

75

Figure 6.7: API Call graph for Software Vs Malware

From the above graphs, we can see that a set of API Calls was made by malwares more frequently as compared to the benign software. Analyzing each API call individually from that set helps in building a more robust system.

76

CHAPTER VII

CONCLUSIONS AND FUTURE WORK

7.1 Conclusions

In this work, the problem of detecting new and unknown malware is addressed.

Present day technologies and our approach for the detection of malware are discussed. An isolated environment was set up for the process of reverse engineering and each executable was reversed rigorously to find its properties and behavior. On the data extracted from the reversing process, different data mining techniques were used to procure patterns of malicious executables and thereby classification models were generated. To test the models, new executables were supplied from the wild with the same set of features. The results thus obtained proved to be satisfactory.

From analyzing the experimental results, we can conclude that, finding static and behavioral features of each malware through reverse engineering and applying data mining techniques to the data helps in detecting the new age malware. Especially with increasing amount of malware appearing each day, this method of detection can be used along with the present day detection techniques.

77

7.2 Future Work

We have reversed each strain of malware and benign executable to extract all the possible features as we could with the help of the tools used in present day computer security profession. However, we were not able to analyze process address space of the executables in the physical memory as the memory analysis tools were released after we completed reversing. Analyzing the address space would reveal more interesting information about the processes and thereby analyzing their behavior more accurately.

Also, reversing each malware manually is a time consuming process and requires lots of effort with the thousands of new malware generated each day. One way to cope up with this problem is to automate the whole reverse engineering process. Although there are some tools for automated reverse engineering, they do not record full details of malware.

A more specific tool that does rigorous reversing would help in combating large amounts of malware. We consider these two tasks as the future works that aid in detecting new malware more efficiently.

78

REFERENCES

[1] E.Skoudis. Malware: Fighting Malicious Code. Prentice Hall, 2004.

[2] Fred Cohen. Computer Viruses. PhD thesis, University of Southern California, 1985.

[3] http://www.mcafee.com/us/resources/reports/rp-quarterly-threat-q3-2010.pdf; last access: Jan 2011.

[4] http://www.mcafee.com/us/resources/reports/rp-good-decade-for-cybercrime.pdf; last access: Jan 2011.

[5] http://www.messagelabs.com/mlireport/MLI_2011_01_January_Final_en-us.pdf; last access: Feb 2011.

[6] Classification of Malware Based on String and Function Feature Selection, Rafiqul Islam, Ronghua Tian, Lynn Batten, Steve Versteeg., 2010 Second Cybercrime and Trustworthy Computing Workshop, Ballarat, Victoria Australia., July 19-July 20, ISBN: 978-0-7695-4186-0

[7] R. Tian, L.M. Batten, and S.C. Versteeg. Function length as a tool for malware classification. In Proceedings of the 3rd International Conference on Malicious and Unwanted Software : MALWARE 2008, pages 69–76, 2008.

[8] Ronghua Tian, Lynn Batten, Rafiqul Islam, and Steve Versteeg. An automated classification system based on the strings of Trojan and virus families. In Proceedings of the 4rd International Conference on Malicious and Unwanted Software : MALWARE 2009, pages 23–30, 2009.

[9] M.G. Schultz, E. Eskin, E. Zadok and S.J. Stolfo, “Data Mining Methods for Detection of New Malicious Executables,” In Proceedings of the 2001 IEEE Symposium on Security and Privacy, IEEE Computer Society, 2001, pp. 38-49.

79

[10] Peter Miller. Hexdump. Online publication, 2000 http://www.pcug.org.au/ millerp/hexdump.html; last access: May 2010.

[11] William Cohen. Learning Trees and Rules with Set-Valued Features. American Association for ArtiJicial Intelligence (AMI), 1996.

[12] Tzu-Yen Wang, Chin-Hsiung Wu, Chu-Cheng Hsieh, "Detecting Unknown Malicious Executables Using Portable Executable Headers," ncm, pp.278-284, 2009 Fifth International Joint Conference on INC, IMS and IDC, 2009.

[13] J. Kolter and M. Maloof, “Learning to detect malicious executables in the wild,” in Proc. KDD-2004, pp. 470–478.

[14] Malware Detection by Data Mining Techniques Based on Positionally Dependent Features., Dmitriy Komashinskiy, Igor Kotenko., PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing., IEEE Computer Society Washington, DC, USA ©2010. ISBN: 978-0-7695-3939-3

[15] M. Christodorescu, S. Jha, and C Kruegel, “Mining specifications of malicious behavior,” in Proc. ESEC/FSE-2007, pp. 5–14.

[16] A Virus Prevention Model Based on Static Analysis and Data Mining Methods Tzu- Yen Wang; Chin-Hsiung Wu; Chu-Cheng Hsieh; CITWORKSHOPS '08 Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops., Publication Year: 2008 , Page(s): 288 - 293

[17] A. Sung, J. Xu, P. Chavez, and S. Mukkamala, “Static analyzer of vicious executables (save),” in Proc. 20th Annu. Comput. Security Appl. Conf., 2004, pp. 326– 334.

[18] Malware Analysis Using Reverse Engineering and Data Mining Tools, Burji, S., Liszka, K. J., and Chan, C.-C., The 2010 International Conference on System Science and Engineering (ICSSE 2010), July 2010, pp. 619-624.

[19] Chan, c.-c. and S. Santhosh, "BLEM2: Leaming Bayes' rules from examples using rough sets," Proc. NAFIPS 2003, 22nd Int. Conf. of the North American Fuzzy Information Processing Society, July 24 - 26, 2003, Chicago, Illinois, pp. 187-190.

80

[20] Faraz Ahmed, Haider Hameed, M. Zubair Shafiq, and Muddassar Farooq. Using spatio-temporal information in API calls with machine learning algorithms for malware detection. In AISec ’09: Proceedings of the 2nd ACMworkshop on Security and artificial intelligence, pages 55–62, New York, NY, USA, 2009. ACM.

[21] http://www.cexx.org/adware.htm; last access: Dec 2010

[22] Peter Szor. The Art of Virus Research and Defense. Symantec Corporation, 2005.

\ [23] Reverse Code Engineering: An In-Depth Analysis of the Bagle Virus, Rozinov, K.; Dept. of Comput. & Inf. Sci., Polytech. Univ., Brooklyn, New York, USA., Information Assurance Workshop, 2005. IAW '05. Proceedings from the Sixth Annual IEEE SMC, 15- 17 June 2005, pp. 380 – 387, Print ISBN: 0-7803-9290-6.

[24] http://www.offensivecomputing.net; last access: Apr 2011

[25] http://sourceforge.net; last access: Apr 2011

[26] http://www.brothersoft.com; last access: Apr 2011

[27] http://www.vmware.com/products/workstation; last access: Jan 2011

[28] http://labs.idefense.com/software/malcode.php; last access: Mar 2010

[29] http://peid.has.it/; last access: May 2011

[30] http://www.hex-rays.com/idapro/; last access: Jan 2011

[31] M. G. Kang, P. Poosankam, and H. Yin. Renovo: A hidden code extractor for packed executables. In Proc. Fifth ACM Workshop on Recurring Malcode (WORM 2007), November 2007.

[32]http://www.offensivecomputing.net/bhusa2009/oc-reversing-by-crayon bhusa2009.pdf; last access: Jun 2010

81

[33] http://technet.microsoft.com/en-us/sysinternals/bb896642; last access: Apr 2010

[34] http://support.microsoft.com/kb/256986; last access: Aug 2010

[35] http://msdn.microsoft.com/en-us/library/cc433218%28v=vs.85%29.aspx; last access: Jan 2011

[36] The KDD Process And Data Mining For Computer Performance Professionals, Susan P. Imberman. Journal of Computer Resource Management, Summer 2002 Issue 107, pgs 68-77

[37] http://www.cs.waikato.ac.nz/ml/weka/; last access: Mar 2011

[38] Ian H. Witten, Eibe Frank, “Data Mining Practical Machine Learning Tools and Techniques”, Second Edition, 2005, pp-396, ISBN: 0-12-088407-0.

[39] Pang-Ning Tan, Michael Steinbach, Vipin Kumar; “Introduction to Data Mining”; Pearson Education; ISBN 0-321-42052-7; pp 256-276

[40] http://www.nit.eu/czasopisma/JTIT/2002/3/7.pdf; last access: Jan 2011

[41] Zdzisław Pawlak, “Rough Sets: Theoretical Aspects of Reasoning about data,” Kluwer Academic Publishing, 1991

82

APPENDICES

83

APPENDIX A

DATASETS

This section presents the details of all the datasets we used in our experiments.

 Dataset with Reversed Features as Attributes (DRF)

From the reversing engineering process, we extracted various details of the

executables including file name, file size, MD5 hash, packer, registry keys added,

registry keys modified, registry keys deleted, file access, directory access, API calls

made, DLLs accessed, network access, URL references, printable strings,

programming language and time stamp. Tables from A1 to A10 list all the attributes

and their values for a malware instance named Trojan.Win32.Startpage.ama

Table A1: An Instance for Attributes File Name, File Size and MD5 Hash in DRF

File Name File Size MD5 Hash

Trojan.Win32.StartPage.ama 948235 6F17D85FAAFA408670D1E9944F182BF5

Table A2: An Instance for Attributes Packer, File Access, Directory Access and Internet Access in DRF Packer File Directory Internet Access Access Access ASPack 2.12 -> Alexey Yes Yes Yes Solodovnikov

84

Table A3: API Calls Accessed By the Trojan

API Calls

CreateFileA RegOpenKeyExA InternetConnectA OpenServiceW ReadFile CreateFileW HttpSendRequest QueryServiceStatu WriteFile AdjustTokenPrivileges A s LoadLibraryA/ExA OpenProcess CreateMutexW OpenSCManagerA IsDebuggerPresent SHGetValue SetFileAttributes RegOpenKeyA LoadLibraryW GetVolumeInformation A _lread CreateMutexA W RegCreateKeyEx SHSetValue SetWindowsHookEx HttpOpenRequestA A RegDeleteValueA A InternetCreateUrlA CreateProcessW Getaddrinfo CreateProcessA InternetCreateUrlW WSAStartup InternetCloseHand CopyFileA OpenSCManagerW FindWindowW le RegCreateKeyExW WSAStringToAddress ExitProcess PostMessageW GetFileAttributesW W GetFileAttributes connect RegSetValueExW GetVolumeInformation A socket RegOpenKeyExW A InternetCrackUrl RegSetValueExA CreateRemoteThread InternetOpenA SetFileTime RegCreateKeyA RegOpenKeyW Bind

Table A4: DLLs Accessed By the Trojan

DLLs Accessed acgenral.dll exp_xps.dll msftedit.dll oleaut32.dll shimeng.dll advapi32.dll gdi32.dll msi.dll olepro32.dll shlwapi.dll apphelp.dll grooveintlresource.dll msimg32.dll psapi.dll tapi32.dll atl.dll groovenew.dll msointl.dll rasadhlp.dll urlmon.dll browselc.dll grooveutil.dll msores.dll rasapi32.dll user32.dll browseui.dll hnetcfg.dll msvcr80.dll rasman.dll uxtheme.dll cabinet.dll imagehlp.dll msvcr90.dll riched20.dll version.dll clbcatq.dll imm32.dll msvcrt.dll rpcrt4.dll winhttp.dll comctl32.dll iphlpapi.dll mswsock.dll rpcss.dll wininet.dll comres.dll kernel32.dll msxml3.dll rsaenh.dll winmm.dll crypt32.dll linkinfo.dll msxml3r.dll rtutils.dll winrnr.dll cryptui.dll maltrap.dll msxml5.dll secur32.dll wintrust.dll cscdll.dll mlang.dll netapi32.dll sensapi.dll wldap32.dll cscui.dll msacm32.dll ntdll.dll setupapi.dll ws2_32.dll dhcpcsvc.dll msadox.dll ntshrui.dll shdoclc.dll ws2help.dll dnsapi.dll msasn1.dll ole32.dll shdocvw.dll wshtcpip.dll exp_pdf.dll msctf.dll oleacc.dll shell32.dll wsock32.dll

85

Table A5: Registry Keys Added By the Trojan

Registry Keys Added

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\HideDesktopIcons

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\HideDesktopIcons\NewStart Panel

Table A6: Registry Keys Modified By the Trojan

Registry Keys Modified

HKLM\SOFTWARE\Microsoft\Cryptography\RNG\Seed:

HKLM\SOFTWARE\Microsoft\Internet Explorer\Main\Start Page:

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5- 18\Products\00002109030000000000000000F01FEC\Usage\GrooveFiles:

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5- 18\Products\00002109030000000000000000F01FEC\Usage\ProductFiles:

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5- 18\Products\00002109AB0090400000000000F01FEC\Usage\GrooveFilesIntl_1033:

HKU\S-1-5-21-1844237615-484061587-839522115-500\Software\Microsoft\Internet Explorer\Main\Start Page:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Office\12.0\Groove\MTTF:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Office\12.0\Groove\MTTA: HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\ComDlg32\LastVisitedMRU \b:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\ComDlg32\OpenSaveMRU\ *\MRUList:

HKU\S-1-5-21-1844237615-484061587-839522115-

86

500\Software\Microsoft\Windows\CurrentVersion\Explorer\ComDlg32\OpenSaveMRU\ *\c:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\ComDlg32\OpenSaveMRU\ exe\MRUList:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\ComDlg32\OpenSaveMRU\ exe\e:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\UserAssist\{75048700- EF1F-11D0-9888-006097DEACF9}\Count\HRZR_EHACNGU:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Explorer\UserAssist\{75048700- EF1F-11D0-9888-006097DEACF9}\Count\HRZR_EHACNGU:P:\Qbphzragf naq Frggvatf\Nqzvavfgengbe\Qrfxgbc\Fbsgjner\znygenc_i0.2n\znygenc_i0.2n\znygencthv.rk r:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{2670000A-7350-4F3C- 8081-5663EE0C6C49}\iexplore\Count:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{2670000A-7350-4F3C- 8081-5663EE0C6C49}\iexplore\Time:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{72853161-30C5-4D22- B7F9-0BBC1D38A37E}\iexplore\Count:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{72853161-30C5-4D22- B7F9-0BBC1D38A37E}\iexplore\Time:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{92780B25-18CC-41C8- B9BE-3C9C571A8263}\iexplore\Count:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{92780B25-18CC-41C8- B9BE-3C9C571A8263}\iexplore\Time:

HKU\S-1-5-21-1844237615-484061587-839522115-

87

500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{FB5F1910-F110-11D2- BB9E-00C04F795683}\iexplore\Count:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Ext\Stats\{FB5F1910-F110-11D2- BB9E-00C04F795683}\iexplore\Time:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\DefaultConnectionSettings:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\SavedLegacySettings:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\ShellNoRoam\BagMRU\MRUListEx:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\Software\Microsoft\Windows\ShellNoRoam\BagMRU\5\MRUListEx:

HKU\S-1-5-21-1844237615-484061587-839522115- 500\SessionInformation\ProgramCount:

Table A7: Registry Keys Deleted By the Trojan Registry Keys Deleted

None

Table A8: URL References Made By the Trojan

URL References http://www.ooooos.com http://www.burnsrecyclinginc.com/hvplace/rel1.php?id=DR7t22_direct www.burnsrecyclinginc.com

88

Table A9: Programming Language used, Strings and Decision label of the Trojan

Programming Language Printable Strings Decision Label

Not Found Yes Yes

In the preprocessing, we removed some of the attributes which would not contribute to the overall goal of the task and used remaining attributes in the experiments. All attribute values were converted to either integer or binary depending on the type. Decision label takes the value of yes or no. Table A10 shows the attributes in DRF and their values for the same instance as shown in tables A1 to A9.

Table A10: An Instance of DRF Dataset after Preprocessing ATTRIBUTE VALUE PACKER 1 FILE ACCESS 1 DIRECTORY ACCESS 1 DLLs 85 API CALLS 56 INTERNET ACCESS 1 URL REFERENCES 1 REGISTRY KEYS ADDED 2 REGISTRY KEYS DELETED 0 REGISTRY VALUES MODIFIED 28 UNIQUE STRINGS 1 DECISION LABEL Yes

 Dataset with Discretized Attribute Values (DDF)

We discretized the values of attributes in DRF and prepared another dataset named

DDF. Table A11 shows the attributes and their values in DDF for the same instance

as shown in table A10.

89

Table A11: An Instance of DDF Dataset

ATTRIBUTE VALUE

PACKER 1 FILE ACCESS 1 DIRECTORY ACCESS 1 DLLs (16.5-inf) API CALLS (41.5-inf) INTERNET ACCESS 1 URL REFERENCES 1 REGISTRY KEYS ADDED (1-2.5] REGISTRY KEYS DELETED 0 REGISTRY VALUES MODIFIED (12.5-inf) UNIQUE STRINGS 1 DECISION LABEL Yes

 Dataset with API Calls as Attributes (DAF)

We combined all the API calls made by all executables and prepared another dataset

with these API Calls as attributes. The values of the attributes were binary values

which tell whether an executable made a call to that API or not. Table A12 shows all

the attributes and their values for an instance.

Table A12: An Instance of DAF Dataset

Attribute Name Value Attribute Name Value

DeleteFileW 0 _lcreat 0 SetFileTime 0 _lwrite 0 WSAStringToAddressW 1 811DWN 0 RegOpenKeyA 1 Accept 0 _lread 1 Closesocket 0 SetFileAttributesA 1 ControlService 0 LoadLibraryA/ExA 1 CopyFileA 0 FindWindowW 1 CopyFileExW 0 SHGetValue 1 CreateProcessAsUserA 0 OpenSCManagerW 1 CreateProcessAsUserW 0 CreateFileA 1 CreateProcessW 1 Bind 1 CreateServiceA 0

90

OpenProcess 1 CreateServiceW 0 InternetGetCookieW 0 DebugActiveProcess 0 VirtualAllocEx 0 Decision Label Yes InternetOpenUrlW 0 DeleteService 0 FindWindowExA 0 ExitWindowsEx 0 InternetConnectA 1 FindWindowExW 0 GetVolumeInformationW 1 GetAsynckeyState 0 InternetCrackUrl 1 GetFileAttributesExA 0 WriteFile 1 GetFileAttributesExW 0 RegCreateKeyA 1 GetKeyboardState 0 OpenServiceW 1 GetRawInputData 0 GetVolumeInformationA 1 GetTime 0 CreateFileW 1 HttpOpenRequestW 0 QueryServiceStatus 1 HttpSendRequestExW 0 GetFileAttributesW 1 HttpSendRequestW 0 RegDeleteValueW 0 InternetCheckConnectionA 0 ReadFile 1 InternetConnectW 0 LoadLibraryW 1 InternetGetConnectedState 0 RegCreateKeyW 0 InternetGetConnectedStateExA 0 ShellExecuteW 0 InternetGetConnectedStateExW 0 HttpSendRequestA 1 InternetGetCookieA 0 CopyFileW 0 InternetReadFile 0 ExitProcess 1 Listen 0 RegSetValueExA 1 MoveFileA 0 RegDeleteKeyA 0 MoveFileExA 0 InternetOpenUrlA 0 MoveFileExW 0 RegSetValueExW 1 MovieFileW 0 IsDebuggerPresent 1 OpenSCManagerA 0 WSAStartup 1 OpenServiceA 0 RegOpenKeyW 1 OutputDebugStringA 0 _lclose 0 PostMessageA 0 CreateMutexW 1 PostMessageW 1 InternetCreateUrlW 1 Recv 0 InternetOpenA 1 RegDeleteValueA 1 Getaddrinfo 1 Register 0 GetFileAttributesA 1 RegSetValueA 0 Connect 1 RegSetValueW 0 InternetCloseHandle 1 Send 0 HttpOpenRequestA 1 Sendto 0 AdjustTokenPrivileges 1 SetWindowsHookExW 0 CreateProcessA 1 SHDeleteKey 0 InternetCreateUrlA 1 SHDeleteValue 0 _lopen 0 ShellExecuteA 0 RegOpenKeyExW 1 ShellExecuteExW 0 CreateMutexA 1 SHSetValue 1

91

SetFileAttributesW 0 Shutdown 0 Socket 1 StartServiceA 0 SetWindowsHookExA 1 StartServiceW 0 GetTempFileNameA 0 TerminateProcess 0 ShellExecuteExA 0 TerminateThread 0 RegCreateKeyExW 1 WinExec 0 RegDeleteKeyW 0 WinPos1024x768 0 InternetOpenW 0 WinPos1280x1024 0 RegOpenKeyExA 1 WriteProcessMemory 0 CreateRemoteThread 1 WSAAsyncGetHostByName 0 DeleteFileA 0 WSACleanup 0 GetTempFileNameW 0 WSARecv 0 RegCreateKeyExA 1 WSASocket 0 InternetAttemptConnect 0 WSAStringToAddressA 0

92