<<

REVERSE ENGINEERING OF A MALWARE

EYEING THE FUTURE OF SECURITY

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Supreeth Burji

August, 2009 OF A MALWARE

EYEING THE FUTURE OF SECURITY

Supreeth Burji

Thesis

Approved: Accepted:

______Advisor Department Chair Dr. Kathy J. Liszka Dr. Chien-Chung Chan

______Faculty Reader Dean of the College Dr. Timothy W. O'Neil Dr. Chand Midha

______Faculty Reader Dean of the Graduate School Dr. Wolfgang Pelz Dr. George R. Newkome

______Date

ii ABSTRACT

Reverse engineering malware has been an integral part of the world of security.

At best it has been employed for signature logging malware until now. Since the evolution of new age technologies, this is now being researched as a robust methodology which can lead to more reactive and proactive solutions to the modern security threats that are growing stronger and more sophisticated. This research in its entirety has been an attempt to understand the in and outs of reverse engineering pertaining to malware analysis, with an eye to the future trends in security.

Reverse engineering of malware was done with Nugache P2P malware as the target showing that signature based malware identification is ineffective. Developing a proactive approach to quickly identifying malware was the objective that guided this research work. Innovative malware analysis techniques with data mining and rough sets methodologies have been employed in this research work in the quest of a proactive and feasible security solution.

iii ACKNOWLEDGEMENTS

I extend my warmest regards and appreciation to Dr. Kathy J. Liszka, a wonderful teacher and guide I have had in my life. Without her guidance, support and inputs this research work would not have been possible.

I would like to thank Dr. Wolfgang Pelz, for giving me an opportunity to study the domain that interested me a lot, the world of Security. The ideas for this research work, was in these Security courses where it all began. I thank Dr. Timothy W.

O'Neil, whose valuable inputs during the research work, gave a new and thought provoking direction on more than one occasion. I would like thank Dr. Chan, for helping me throughout the Data Mining phase and with usage of Roughsets and guiding me in depth and detail about the subject.

I would like to thank Adam, for knowingly or unknowingly helping me to go offensive against certain kind of malware strains called the Rogue . I thank

Stephen Sciarini and Chuck Van Tilburg, for extending their help in the research labs and providing a workable environment there.

I would like to thank Snehal, one of my very good friend and whom I look up to as my sister, for extending her help in one of the case studies of this research work. I thank my parents for the kind of patience they have shown with me all these years and provided support. Thank you mom and dad.

iv Last, but not the least, I would like to convey my heartfelt appreciation to my girl friend Preetham, to whom I dedicate the work to, without whose love and support I would have never made it so far.

v TABLE OF CONTENTS

Page

LIST OF TABLES...... ix

LIST OF FIGURES...... x

CHAPTER

I. INTRODUCTION...... 1

II. REVERSE ENGINEERING NUGACHE...... 4

2.1 Overview...... 4

2.2 Experiments...... 6

2.3 An Explanation of P2P...... 7

2.3.1 Implementation...... 9

2.3.2 Infection Techniques...... 10

2.3.3 Infection...... 11

2.4 Network Flow Analysis...... 14

2.5 Possible Security Solutions...... 15

2.6 The Nugache Creator...... 15

III. THE LIFECYCLE OF MALWARE...... 22

3.1 Overview...... 22

3.2 Hardware Propagation...... 23

vi 3.3 Application Obfuscation ...... 25

3.4 Key Loggers ...... 27

3.5 Conclusions ...... 29

IV. ROGUE ...... 30

4.1 Attack Trends ...... 31

4.2 Reverse Engineering the Rogue softwares ...... 32

4.3 Features ...... 34

4.4 Virus Total Reports ...... 35

V. DATA MINING ...... 40

5.1 Patterns ...... 40

5.2 Dataset...... 41

5.3 System Design ...... 42

5.4 Implementation ...... 43

5.5 Decision Trees ...... 45

5.6 Naive Bayesian Classifier ...... 46

5.7 Pattern Analysis ...... 48

VI. ROUGH SETS ...... 51

6.1 Predictive Modeling ...... 51

6.2 Implementation ...... 52

6.3 Results and Evaluation ...... 54

VII. CONCLUSIONS AND FUTURE WORK...... 57

REFERENCES ...... 59

APPENDICES ...... 62

vii APPENDIX A. THE MALWARE DATASET...... 63

APPENDIX B. DATA MINING...... 73

APPENDIX . ROUGH SETS...... 76 LIST OF TABLES

Table Page

4.1 Characteristics of similar Rogue malware strain...... 34

5.1 Attribute list of the malware dataset with type characteristics...... 41

5.2 Microsoft Clustering algorithm applied on the malware dataset...... 49 A.1 Attribute list used for creating the malware dataset...... 63

A.2 Malware dataset with the file size and MD5 hash attributes...... 64

A.3 Malware dataset with the time/date and filename attributes...... 65

A.4 Malware dataset with the unique strings, URL, registry and API reference attributes...... 67 A.5 Malware dataset the with the file packer attributes...... 68

A.6 Malware dataset the with the programming platform attributes...... 70

A.7 Malware dataset the with the DLL, directory, file, internet access and label attributes...... 71

ix LIST OF FIGURES

Figure Page

2.1 A centralized P2P network...... 8

2.2 A pure P2P network...... 8

2.3 Depicting the phishing technique used for infection...... 10

2.4 Potential P2P IP list...... 13

2.5 Malware showing false date and time references...... 13

2.6 United States map depicting the IP address list...... 14

2.7 TCP conversations from the Wireshark packet sniffer...... 17

2.8 HTTP dominates active TCP connections...... 18

2.9 Packet flow of a machine in the compromised P2P network...... 19

3.1 Hardware Propagation of a Malware...... 24

3.2 Application Obfuscation of a Malware...... 25

3.3 Bad effects of a Malware – Key Stroke Logging...... 28

4.1 Virus Total report on the Rogue software strain av2008...... 37

4.2 Virus Total report on the Rogue software strain av2009b...... 38

4.3 Virus Total report on the Rogue software strain epro...... 39

5.1 Steps of the KDD Process...... 42

5.2 Data source view of the malware dataset in the Microsoft SQL Server 2005...... 44

x 5.3 A one Level no split Decision Tree as seen in the Microsoft Tree Viewer of SQL Server 2005...... 46 5.4 The Class Label shows no relationship with the data attributes in the Dependency Network of the Microsoft Naive Bayesian Classifier output...... 47 5.5 The Microsoft Cluster algorithm applied on the malware dataset in the Microsoft SQL Server 2005...... 49 5.6 Cluster 9 characteristics with the probability distribution in the Microsoft SQL Server 2005...... 50 6.1 Web interface for BLEM2 rough sets tool suite...... 52

6.2 Rules generated using machine learning tools in BLEM2 tool suite...... 53

6.3 Depicts all the possible rule sets for the malware dataset...... 55

6.4 Rule based Classifier in BLEM2 rough sets tool suite...... 56

B.1 The KDD process...... 73

B.2 Data Source View of the training set...... 74

xi CHAPTER I

INTRODUCTION

The invention of the Internet marked the dawn of a new age of computing. With the increase in the number of users of the Internet, the potential for cyber crimes has also increased. The race is on between potential web marketing and e-commerce giants for supremacy over the Internet domain. So is the battle of the Blackhats for supremacy in the hackers domain.

A computer virus is the most primitive form of the malicious application family. This is a program that can copy itself from a host instance and cause unwanted and ill effects to an infected computer. For viruses to spread, they need a carrier or host, hence the name derived from the biological term. A computer worm is a self-replicating program within a computer network. Unlike a virus, a worm may propagate without user intervention.

Trojan horses, working in similar historical terms, are computer programs that allow unauthorized access to the host machines by performing some undisclosed and hidden functions while the user is duped into thinking that the program is doing something completely different, such as playing a game. Rootkits are computer programs that can operate without the knowledge of the users of the infected machine, compromising the host machine typically at the level to make its presence totally obscure, and virtually impossible to eradicate once detected. Spyware, adware and crimeware are

1 the latest members of the malicious program family. They also work their way through the host via interactions over the Internet, without a user’s knowledge or consent. All these malicious platforms and others were in existence in simultaneously, but with the increase in the development and sophistication of these, aided by the Internet revolution, the differences between them have been rapidly fading. They are now collectively called malware.

In the beginning, malware would propagate from and to networks using storage devices such as floppy disks and later CD-ROMS. More recently, USB sticks, memory sticks and other such media are used as instruments of dissemination. With an increase in

Internet usage, the methodologies for malware propagation have become seamless. Then came the “killer app” of the Internet – email. Email attachments are a huge and successful target for hackers to use for their Blackhat schemes. Hacking, which likely started out because of curious and intrigued minds, suddenly proved to be a lucrative way to make money. Machines are now being compromised at an alarming rate by every means possible. Practically nobody connected to the Internet today is safe. Gone are the days of dirty disks, showing the world that they could be hacked, as a simple badge of honor.

Now hacking is a far more lucrative, advanced, sophisticated and secretive venture.

New techniques that make use of loop holes present in computer applications and the

Internet are seen emerging almost daily. Phishing attacks have gained prominence in recent years, as web pages are forged and people are fraudulently led to them as trusted sites in order to con users out of passwords or other sensitive data such as credit card information. Cross site scripting (XSS) injects malicious code into web pages leading to compromised web pages, thereby making them prone to phishing attacks. IP spoofing and

2 packet sniffing refer to the creation of IP packets with a forged source IP address, the purpose being to impersonate another computing system. Buffer overflow attacks are extremely common and perhaps the preferred vulnerability attack to date. These make use of the host system buffer management in a process or an application. If carefully crafted, an illegal memory access can lead to an entry point elsewhere into the system other than where the original program intended, leading to a compromise. All these attack vectors and many others, show that the trends in the hacking world have become very dangerous. Serious hackers try to lure their victims into traps and compromise their – truly an Internet con game.

The trends and the attack vectors have become highly sophisticated, intricate and complex. Not only are they evolving but they are now more difficult to identify and monitor. This research work is an effort to study the changes in trends in the hacking world and to provide a potentially viable link to the next progression.

The thesis is organized as follows. Chapter two presents the process and results of reverse engineering the famous Nugache worm. This work was completed prior to the worm creator’s arrest and subsequent sentencing. Chapter three discusses three stages identified for the lifecycle of malware in generic terms. Rogue software is discussed in chapter four. We demonstrate an example of one strain of rogue malware and reveal the features from reverse engineering techniques. The effectiveness of signature logging, a popular technique used by many anti-malware vendors, is demonstrated and discussed. In chapter five, we apply three data mining techniques to the task of identifying malware and in chapter six; we apply the theory of rough sets to this task. Finally, in chapter seven, we present conclusions and future work.

3 CHAPTER II

REVERSE ENGINEERING NUGACHE

2.1 Overview

Psychological profiling [1] malware coders and hackers has been done over the years, but the fact remains that the way hackers think and operate has changed by leaps and bounds over the years as well. Their motives have changed from curiosity to monetary.

Most of the research that has been done is dated and does not reflect the changing scenario anymore. The following reasons are worthy of mention here:

• Hacker approaches and dynamics are rapidly evolving.

• Stakes are ever increasing with organized groups involved from all over the

world.

• Hacking has become more financially driven and lucrative because of the

economic downturn over the past decade.

• Blackhats are making the news by compromising computers in huge quantities.

The main idea behind this particular case study was to be involved in an active role in understanding this type of person without concentrating on any of the moral aspects present. Psychological profiling is a different field of research and beyond the scope of this thesis.

4 The research started with a case study inspired by the Nugache worm. The creator,

Jason Micheal Millmont, a 19 year old from Cheyenne, Wyoming has since been convicted of the crime of creating a Peer to Peer botnet controlling several thousands of computers at any given time [2]. Botnets are autonomous systems which act as robots to perform certain pre-defined actions using a distributed computing paradigm [3]. From the computer security perspective they are typically viewed as autonomous systems used for controlling the victim’s machine which has been compromised by a hacker. The concept of peer to peer will be discussed shortly.

The case study was directed as a passive immersion exercise to understand the technical factors involved with creation of that kind of worm. Rather than approaching it from a psychological point of view, this research presents a technical viewpoint from a computer science perspective involving the aspects of coding the malware to victimize users.

The case study in this chapter shows the process of reverse engineering Nugache keeping the following concepts in mind:

• malware coding techniques,

• code analysis,

• thought process of the malware writer with regard to programming skills.

The following sections describe and discuss in depth the details of the case study and experiments conducted in this research work.

5 2.2 Experiments

With an increase in the number of Internet users, there has also a substantial increase in the number of malwares created. The war between malware writers and the white hat engineers trying to stop them has been an ongoing battle. Surely by the success rate of the attacks, compromises and hacks that occur, it is no surprise as to who has the upper hand.

One such case is presented here involving a pure point-to-point (P2P) malware named

Nugache.

This strain of malware came into the picture in early 2006 when it started out as a simple Trojan horse, simple with respect to a more notorious P2P worm called the Storm worm [2]. Nugache targeted the Operating System including the widely used NT, 2000 and XP flavors. This malware self propagated through computer networks but its existence was not very hidden in the sense that although its actions were misleading, its existence was obvious. Later strains became more sophisticated. It ultimately developed into a pure P2P strain, forming a bot network [3] with the potential to bring down the entire infected network. A bot is an automated program and, in this context, the actions are undesirable.

Part of this thesis research was to learn and apply reverse engineering techniques that are used by researchers, security groups and commercial anti-virus developers in order to analyze one of the Nugache strains. The strain of the worm under investigation had already been reverse engineered by professional anti-virus developers, but only for the purpose of logging signatures and releasing appropriate patches.

The intention was to select one of the strains and learn the steps of the reverse engineering process to understand how malware writers think and try to understand their

6 approach in searching for vulnerabilities in a target operating system platform. Basically, one needs to think like hacker to beat one. This part of the research helped shed light in understanding the technicalities involved in malware development, but also provided a sense of direction with regard to preventive measures. In the following section, we present a detailed description of the implementation, infection techniques, effects and aftermath of a strain of the Nugache worm.

2.3 An Explanation of P2P

The pure P2P strain in this context means the bot network involves no central command and control server. Unlike earlier bot network worms, the strain of the

Nugache worm under study involves no central server for example the simple IRC

(Internet Relay Chat) protocol involving central servers. In Figure 2.1 we note that the central command and control server maintains and updates the information required to handle a centralized peer network. In contrast, in a pure P2P network, as shown in Figure

2.2, there is no central command and control server and hence the information is maintained by the peers themselves. Initial information in the form of a peer list is obtained by a tracker site in the case of a legitimate peer network.

The emphasis here is on the pure P2P version because a worm designed to use this network model provides very strong protection for the author of the code. In fact, it is very difficult, if not impossible, to trace the worm back to the originator.

7 Figure 2.1: A centralized P2P network

Figure 2.2: A pure P2P network

8 2.3.1 Implementation

The research needed a framework which provides a controlled environment for testing malware strains. VmWare Server [4] is one such option. It is commercially available software that may also be used free for educational purpose. VmWare has many interesting features for creating and maintaining a variety of operating systems as virtual machines. Any havoc created in the virtual operating system does not harm the underlying real operating system.

The features of the virtual machine used for this part of the thesis work include the following:

• The target operating system was Microsoft Windows XP Professional edition with

Service Pack 2 updates.

• The virtual system was configured with 256 Megabytes of RAM and 4 Gigabytes

of hard disk space.

• The guest operating system was configured so that it could not access any data on

the host operating system where VmWare was installed.

Several other commercial and open source tools were used for sniffing purposes, system analysis, registry monitoring, file and process handling monitoring, internet and web content analysis and packet flow analysis. They are Windows Operating System File and Registry monitors [5], Wireshark packet analyzer [6], Packetyzer packet classifier

[6], Sniffhit packet sniffer [6] and Malcode Analyst Pack [7].

Experiments were carried on in a secure lab, in a controlled environment, and without a connection to the Internet.

9 2.3.2 Infection Techniques

From the experiments carried out, it was evident that there were no clear limitations to the infecting mechanisms that compromised the host system network. Some of the infecting mechanisms that were detected and observed included the following:

• Phishing sites/ Ads/ Popups – Use of phishing sites was one way of pushing the

malware payload to a victim’s machine. As shown in Figure 2.3, the malware

payload was obfuscated by disguising itself as a legitimate software plug-in or

add-on. This cleverly fools the victim and effectively compromises their system.

Figure 2.3: Depicting the phishing technique used for infection

• EMAIL attachments/ Spam - Spam e-mails involving malware payloads are not a

new concept. The malware writer also harnessed this technique. Email sent out 10 with certain attachments that contained names like selfnude.scr, mypic.scr, and

DSC1060193.scr which made lot of curious people fall into the trap of attempting

to open them and then infecting their systems. [8]

• Instant Messenger - Instant messengers are also in the line of fire causing wide

scale infections via a friend’s compromised or fake IM account. One interesting

report claimed that AOL-IM (America Online LLC Instant Messenger) topped the

list of propagation techniques for this malware payload. [8]

• Microsoft Internet Explorer Browser – Unpatched versions of Internet Explorer

were also used for compromising and infecting hosts. One of the most disturbing

techniques that surfaced is something referred to as key stroke logging. This is

discussed in more depth in a following section.

• P2P software – Many P2P software packages used on the Internet for sharing

music, pictures, files and applications were also used as the main attack vector for

this malware to spread in rapidly. Limewire [9] is one such P2P application that

was modified and used for this purpose.

2.3.3 Infection

Three virtual machines were Setup on a single desktop machine with a bridged LAN between them. No internet connection was made available for the machine. Three controlled machines running under VmWare were infected and while running, all transactions and operations that were carried out on the infected machines were recorded.

This included keyboard strokes, mouse events, browser usage, and screen shots. These

11 techniques are most likely used to collect information related to user accounts, passwords, and other confidential information.

The following list contains conclusions drawn after running the malware and applying the tools previously described:

• A bot network was formed using the pure P2P approach.

• There was a total compromise of the infected computer with regard to files and

the registry. The Internet prefetch, which is used to speed up the boot time of the

application, was also polluted.

• There were uncontrollable TCP/IP SYN flood attacks initiated. This is one very

effective way to compromise networked machines by creating an artificial

overflow of service requests.

• It disabled the installed antivirus software, Avg Free Edition version 8.0 [10]. It

compromised the unpatched Internet Explorer. It performed Internet protocol

redirection, leading to other malicious web sites and opening up loop holes for

downloading backdoor client/agents.

• After the infection stage, the infected system registry had the entry of potential

P2P bot networks shown in Figure 2.4.

• Figure 2.5 shows an interesting entry of the malware, dated 8/4/2004, a direct

contradiction of the experimental setup date of 2/20/2008.

• The victim forms one of the peers (potentially a seeder). This lays the foundation

for further infection in the network.

• The P2P list shown in Figure 2.4 changes and updates itself after it forms into a

full blown bot network (also known as a botnet). The computers in the botnet 12 created from the initial peer list continued to work and grow as shown in Figure

2.6 [11] indicating that they were not the originators of the malware.

• Network systems, whether legacy or current, would be compromised once one of

the systems in the network was compromised.

Figure 2.4: Potential P2P IP list

Figure 2.5: Malware executable showing false date and time references

13 Figure 2.6: United States map depicting the IP address list

2.4 Network Flow Analysis

Wireshark is an open source packet sniffing tool used for packet classification and analysis [6]. Generally the network interface is set to promiscuous mode for sniffing packets and visualizing using different packet sorting and filtering techniques. As seen in

Figure 2.7 and Figure 2.8, we observed inbound network traffic being received at a very high and alarming rate from the infected machine.

While Figure 2.7 indicates many backdoor agents and malicious applications being pushed to the peer computers in the network, Figure 2.8 depicts the distribution of packets that were sniffed within the network. HTTP dominated the TCP conversations

14 even though the network had no active Internet connections. In a sense, the compromised machine in the network tends to act as the initiator without knowledge of the user.

Figure 2.9 depicts the network flow within the network on one of the machines connected to the compromised machine. Clearly the packet inflow taken over a period of

30 minutes dominated the packet outflow. From these results, a pattern starts to emerge which could potentially be used for detection of this type of malware.

2.5 Possible Security Solutions

Security experts are of the opinion that formatting the hard drive of a system infected with Nugache is the only solution after infection. After observing this malware in action, it seems like an unfortunate, but reasonable solution. In this research, we are searching for relevant patterns along with other anomalies to create a behavioral model that will help in providing a robust security solution. If patterns could be identified, they could be fed into neural networks of an Intrusion Detection System (IDS) at the firewall level.

Rather than using a heuristic approach like the ones present in current IDS solutions, neural networks applied at the firewall need to be proactive in order to deal with sophisticated malware. Clean and perennial network flow logging and packet analysis can also be useful as a security alternative.

2.6 The Nugache Creator

At the time the reverse engineering experiments for the research were being conducted, the author of the Nugache worm was unknown. On June 27, 2008, Jason

Michael Millmont, a 19 year old college student from Cheyenne, Wyoming, was arrested

15 for infecting machines with his malware. This marked the first in the United States that a person was prosecuted for using P2P software to deliver malware. Nugache turned infected computers into zombies (machines which are remotely controlled without the knowledge of the user of the machine for malicious acts). Jason controlled over 15,000 machines at a time. Jason pled guilty to creating the worm and agreed to pay $74,000 in restitution and faces five years in federal prison with an additional $250,000 fine. [12, 13]

As validation of the experiments conducted for this research, it was reported that

Nugache spread through AIM and modified Limewire installation programs. Once clicked on, the malware made unwilling users part of a botnet, which Millmont then used to steal user names, passwords and account numbers of those who were infected.

Nugache circulated as early as 2006 and spawned one of the first botnets to use a decentralized system to send instructions to zombie drones. Rather than relying on a single command and control channel, the zombie network used a peer-to-peer mechanism to communicate. This technology fundamentally changed the cyber crime landscape by making it much harder to shut down botnets. [12]

16 Figure 2.7: TCP conversations from the Wireshark packet sniffer 17 Figure 2.8: HTTP dominates active TCP connections

18 Figure 2.9: Packet flow of a machine in the compromised P2P network

Over time, Millmont added new features to Nugache. A graphical user interface made it easy to access infected machines from his home server. This allowed him to issue a command to a single machine, which would then transmit the command to other machines, until it had spread through the entire network. The program contained a key logger and was also capable of sniffing sensitive information stored in Internet Explorer.

Most browsers offer a feature to store usernames and passwords for online banks and other sensitive websites to spare users the hassle of having to remember them.

According to a plea agreement signed by Millmont, he used his botnet to launch a

Distributed Denial-Of-Service (DDOS) attack. This is an attack mechanism involving the attacker overflowing the web service requests thereby making the server unreliable and unusable for some period of time. This is a classic way for an attacker to open up a loop hole in a program to gain access to a network. These DDOS attacks targeted an unnamed online business located in the Los Angeles area. After sending a command that instructed

19 infected machines to transmit captured passwords and other information, he ordered items online and took control of his victims’ accounts by changing the addresses and other details that were associated with them. For example, in April 2007 he used stolen credit card information to make a $1,422 purchase from Hinsite Global Technologies and had the items shipped to a vacant residence in the Cheyenne area. [12]

Nugache spread itself via chats, with a message asking a buddy to view a photo on websites like MySpace.com and Photobucket.com. Users were taken to a spoofed website, and became infected with the worm. "After obtaining this information from a victim's computer, defendant used his/her financial institution's online user name and password to access the account online," the plea agreement said. "Defendant then changed the victim's e-mail address to a similar e-mail that he controlled and the mailing address to an address in Cheyenne, Wyoming, typically an address that was listed for sale." [13]

Millmont also changed the telephone number on a victims’ account to a number he controlled using Skype, a free to download application making use of Voice over IP to make free calls over the internet to other people using Skype application [14]. Millmont kept his schemes from prying eyes for quite some time by replacing the victim’s phone numbers with Skype numbers he created. To add insult to injury, he paid for these Skype numbers with credit card data harvested from his botnet. [12, 13]

In conclusion, this part of the research work pointed almost exactly to what has come out of the arrest of this hacker. While this is encouraging, it is also sad, particularly considering the age of this young man who was drawn towards this most likely by curiosity and later for money. As a result, he will most likely serve time in prison. In the

20 next chapter, we take the results of these experiments and develop a model for the lifecycle of malware.

21 CHAPTER III

THE LIFECYCLE OF MALWARE

3.1 Overview

The motivation for the series of experiments discussed in chapter two was to understand the different ways malware operates. Malware writers, or hackers, have in most cases been ahead of their white hat counterparts. In order to analyze why this is the case, a detailed lifecycle of malware needs to be developed and studied.

Analysis was done at each and every stage of malcoding to get an insight into their minds. Indeed I found out that Malware writers have a lot of things at their disposal.

These Malware writers have a lot of freedom when it comes to malcoding. Freedom in the malware design stages, programming approaches, implementations, packaging and composition, target platforms and most important of all they do not have any time constraints.

In contrast their counterparts have to work within the boundaries of all the aforesaid constraints. Hence Malware writers get that edge over their counterparts, which makes one think as to who is winning in this race of secure computing.

22 3.2 Hardware Propagation

The objective of this part of the research is to gain a detailed understanding of an application meant to be stealthy and to propagate from one user to the other users by using hardware as the catalyst. Specifically, we use a pen drive in an experiment to test whether hardware propagation of malware, although possible, is actually feasible. The experimental setup involved the following components:

• Windows XP with Service Pack 2 on a VmWare platform.

• Dev C++, an open source C++ . [15]

• No Antivirus software or any commercial firewall.

The point was to study an application that would be running in a compromised machine, in a stealth mode and as a system process, with all the necessary administrator privileges. The application immediately copies itself (or a payload) to a flash/pen drive, if and when one is found plugged into the system. This is illustrated in Figure 3.1, label A.

Thus, the application goes active and starts propagating into different machines; wherever the pen drive plugged as illustrated in Figure 3.1, label B. An Autoplay option was a major design consideration in this application. This makes a flash/pen drive to open up automatically in Windows XP or earlier versions.

23 Figure 3.1: Hardware Propagation of a Malware

The results from this experiment were that the malware sample could propagate itself.

Hardware propagation can be used to exploit different Windows based Operating System because of vulnerabilities associated with default flash/pen drive Autoplay settings. We note, however, that with increased security features Autoplay is no longer allowed in updated XP versions and other patched Windows flavors. This does prove the point that malware writers are extremely resourceful. The wide variety of native Windows API calls available at the disposal of a malware writer makes it more difficult to stop these kinds of attacks. This was evident in SanDisk’s U3 hack, where a hardware device was used as a hack tool that worked the same as a CD – ROM running applications on its own

[16]. Our conclusion is that hardware propagation plays a major role in spreading malware.

24 3.3 Application Obfuscation

The purpose of this experiment was to study in detail how a stealthy application hides itself from the user. This is a step in the lifecycle of malware. The experimental setup involved the following components:

• Windows XP with Service Pack 2 on a VmWare platform.

• Dev C++, an open source C++ compiler.

• No Antivirus software or any commercial firewall.

The point was to investigate an application that would run in stealth mode or simply non-windowed mode using the Microsoft Windows API. The malware attempts to copy itself into the Windows system folder. Then creating a Microsoft Windows Registry hook, the application tried to run as a native Windows process. This would render the malware application as a ‘startup’ application at boot time. Finally, the malware logs activity as a post obfuscation activity, recording user and system information. The steps can be seen in Figure 3.2.

Figure 3.2: Application Obfuscation of a Malware 25 The steps in this process are as follows:

• The malware copies itself into the Windows System Folder without the

knowledge of the user.

• It makes an entry in the Windows Registry so that it will run as a startup process.

• It deletes the payload from the original source path after infection to avoid trace

back.

• The most important step is that it runs with a process name intended to trick users

into thinking it is a legitimate Windows based Service. In this case it was

svchost.exe.

• It logs user and system information for later use.

The results from this experiment were that the malware successfully hid itself and logged the user data without his/her knowledge in the given environment. From a computer security awareness point of view, the novice users, who form the majority of

Internet users, will be easily trapped by this type of malware. The Native Windows API acts as a knife; it may be used for cutting edge software creation, as well as cutting the throat, in the sense of users’ computer being compromised.

Obfuscation/stealth/hiding of malware applications is indeed an important technique.

Security specialists need to understand how this works in order to protect users.

Interestingly, the approach that this type of application follows, shows an over dependence on the Native Windows API’s. Antivirus software should be able to detect this.

26 Finally, we stress that more than the required API’s have been provided on Windows based systems for legacy support. Care should be taken that these are strictly off limits to an application through Process Privilege Access Violations at the kernel level. This policy is included in other operating systems, for example, a number of flavors.

3.4 Key Loggers

The purpose of this experiment was to understand how a hook-based key logger works, the concepts involved in it, in depth, and its relevance to the lifecycle of malware

[17, 18]. Hackers are very interested in collecting information on a person such as their full name, date of birth, bank account data, geographical location, and so forth. For this purpose the most popular method used is called keystroke logging. This is a mechanism to store the key strokes made by the user while working on a machine which has been compromised. Other information collected includes websites visited, links and hyperlinks clicked on, and usernames and passwords used to access various resources.

Kernel based key loggers are very difficult to detect, as they target the input device driver itself to compromise. Using the same techniques as keyboard drivers, the malware logs all the information and passes it back to the black hat. In this experiment, in addition to keystroke logging, we added the logistics of mouse events. Screenshots may also be captured using this application at any desired interval of time. This gives relevant information such as:

• The specific operating system in use - Windows XP, NT, Server, etc.

• Applications being used (ex., Mozilla Firefox, Internet Explorer browsers, Instant

Messenger). This lets the hacker target vulnerabilities in those applications later.

27 The experimental setup involved the following components:

• Windows XP with Service Pack 2 on a VmWare platform.

• Dev C++, an open source C++ compiler.

• No Antivirus software or any commercial firewall.

The application developed was a keystroke logger using the GetAsynckeystate Win32

Native Windows API. It has the ability to take screenshots at any time interval using gdilib class libraries. We were able to successful study the malware recording keystrokes, mouse events and taking screenshots as well. The setup is shown in Figure 3.3.

Figure 3.3: Bad effects of a Malware – Key Stroke Logging

There were some interesting results from this experiment. The Native Windows API was called, which should be quite easy to detect if we are equipped with a decent

Antivirus solution and active firewall. API monitors can work with high efficiency against these keystroke loggers. However, the Windows firewall did not pick up this key stroke logger. Logged information was easily passed on through email and ftp. Internet

28 Explorer failed to detect the keystroke logging. Surprisingly, even Mozilla Firefox 2 web browser permitted it!

3.5 Conclusions

The purpose of these experiments was to demonstrate a life cycle for malware. The order mentioned is neither rigid nor systematic, and is not limited to using all or any steps described. Hardware propagation enabled the malware to move about and spread among users. Application obfuscation allows malware to be very stealthy and hide its existence, a trait common in Rootkits. Finally, we believe that keystroke logging is one of the most dangerous aftereffects of a compromised machine. Working with this lifecycle gives insight into hacker coding and methodology. Hackers have no time constraints and no time to market for their software. They use numerous approaches and possibilities at each and every step of malware creation. In the next chapter, we demonstrate an example of one strain of rogue malware and reveal the features from reverse engineering techniques.

29 CHAPTER IV

ROGUE SOFTWARE

As the number of Internet users has increased, the tech savvy world has seen an increase in the number of computer machines being compromised at an alarming rate.

Computer data and resources used to be the prime targets of the so called hacking community, but now the focus has shifted to a more intriguing and scary venture. Rogue software is an application intended to appear as a genuine computer application, but actually harms the user. The harm is usually in the form of stealing confidential information about the user, for the purpose of making money or engaging in other criminal activities. “Rogue” in this context means something which is fake or dishonest.

One particular current type of rogue software advertises itself to be an antivirus application. This chapter presents the complete methodology of infection, propagation and functionality in detail. An attempt has been made to showcase the present security solutions that are fighting to overcome this menace, highlighting their merits and demerits. Finally a more feasible solution to prevent such a security issue is discussed, keeping in mind the level of sophistication of rogue antivirus software and the ineffectiveness of the present day solutions.

30 4.1 Attack Trends

Not limited to any particular set of attack vectors, rogue software has started to gain prominence probably because of the ease with which hackers who wrote these fake applications made money from victims. The latest and current trend during this thesis work was the Internet-based web support used for these kinds of rogue applications to spread. The attack trends have shifted to a more sophisticated method of trapping victims.

Users are lured to a fake website through torrent links, spam emails and attachments,

Instant Messengers, Internet browser popups and so forth. They are psychologically coerced, through the use of irritating and nagging pop windows, into downloading the rogue application to their system, which is then compromised. The payload is pushed down to the system, creating an annoying and confusing environment that leads them into a trap. The victim is asked to purchase the rogue application, in this case, fake antivirus software. The catch is that there is no antivirus solution. After the money is spent, nothing is sent. The malicious code remains on the compromised computer, causing havoc.

We show that popularly used security solutions were not able to detect these rogue applications in time. The frightening implication is that many users fall prey to this type of malware. The problem is that most security software written for home computers is based on static signature logging. This is a technique where the malware is reverse engineered and a unique signature is identified that antivirus software can then detect when it sees the malware. However, it takes time to reverse engineer malware once it is found, putting security experts behind the hackers rather than being a step ahead of them.

31 Clearly this is a reactive solution rather than proactive. This will be discussed in more depth in the following chapters.

4.2 Reverse Engineering the Rogue softwares

We start with a detailed study on a new kind of rogue software that is a type of misleading application intended to cause harm to the infected computer [19]. The objective is to identify a pattern or signature for some of the strains of these rogue software so that a counterattack can be developed. The tools and malware used for this part of the research included:

• VmWare Server

• AVG antivirus free edition version 8.0

• Windows XP with SP2

• Strains downloaded - 4 samples

a. Antvirus-XP-2008 (av2008.exe) [20]

b. av2009b.exe [20]

c. epro.exe [20]

d. unidentified strain found on a colleague’s computer, referred to as simply

“case study”

When AVG was used to scan the zipped files, it did not report any infections. When it scanned the unzipped files, it identified a problem with av2008.exe but problems with the other strains went undetected. All four malware strains had three main . Using reverse engineering tools, three main files were identified and studied. Similar behavior 32 of files was observed in other strains as well, but with differences only being seen in the increase in the level of obfuscation and file structuring.

The following were the files associated with one of the strain of rogue software:

1. Userinit.exe – This is the main process which acts as the initiator and runs as a system process.

• Undetectable after successive boot up.

• System registry hookup.

• Payload placement.

• System analytics.

2. Pphc*.exe – This is the auxiliary support application for the main payload.

• System registry hookup.

• Annoying popups are generated by this application.

3. Rhvj*.exe – This is the application dialog window.

• Program runs as a trial version antivirus product and asks for registration for full

version.

• URL redirects.

Table 4.1 summarizes the preliminary results of reverse engineering.

33 Table 4.1: Characteristics of similar Rogue malware strain

Antivirus- E-antivirus- Antivirus- Strain Case Study XP-2008 Pro-2009 2009 AVG Free 8.0 Yes No No Not present MacAfee Partial, non Enterprise 8.5i Not present Not present Not present intuitive and [Source -UA] Signature based File and Obvious Obvious Obvious Obvious Registry Access Template Template Template Hex Dump Template based based based based UPX/PE Yes Yes Yes unknown Prefetch Yes Yes Yes Yes Hash checks - Unique Unique Unique Unique MD5 API Native Native Native Native Network Very strong with No No No Capability backdoor agents

4.3 Features

A number of observations can be made from studying these strains. First, there is no uninstall feature, and an uninstall icon click fakes a removal of the software. Novice users are faked into scams and ultimately end up paying for nothing. The category is definitely adware/spyware. An antivirus product with the most updated signatures might take care of them. We reach this conclusion because Native Windows API’s are very obviously used and hence might make these very easily detectable.

One feature that stands out is the access of Windows Native API’s by a user program.

This is not supposed to happen, but unfortunately user programs in the Windows OS architecture have the luxury of accessing kernel files.

Patterns that emerge are:

• Registry hookup by an initiator giving startup functionality at boot time. 34 • Application obfuscation, placing the executable into the local system folder and

make it to start as a system process.

• Acting as legitimate application causing ill effects, post infection.

All the samples indicated the same kind of behavior, if not worse. The newer strains were more sophisticated with backdoor clients and other network capabilities, making it difficult to prevent infection. A definite pattern emerged which we believe can be monitored and used for heuristic based evaluations which will make security options more feasible.

4.4 Virus Total Reports

Signature logging is a very static security solution that suffers from dangerous lag time. Hackers need less than 24 to 48 hours in order to trap their victims for money or confidential information using rogue software. However, this is roughly the time it takes for antivirus companies to find new strains, reverse engineer them, update signatures and push the updates out to customers. In order to give more concrete evidence of this lag time, three of the strains, av2008.exe, av2009b.exe and epro.exe, were run through a service called Virus Total [21]. This is an online portal for the most updated and arguably successful Antivirus solutions. Figure 4.1 shows the results for av2008.exe, figure 4.2 shows the results for av2009b.exe and finally, epro.exe is shown in figure 4.3.

Although most of them caught the rogue application strains, they were not able to distinctly characterize each of the strains, before and after their updates and patches.

These strains were operated around the same dates. All of them originated from the same

35 template based application, but were different mostly in the look and feel. Furthermore, some antivirus solutions failed to even register some of these strains because they did not have current, updated signatures for any of the strains tested. Signature logging was clearly found wanting in this area and was not enough of a solution. In the next chapters, we describe an alternative to this method.

36 Figure 4.1: Virus Total report on the Rogue software strain av2008 37 Figure 4.2: Virus Total report on the Rogue software strain av2009b 38 Figure 4.3: Virus Total report on the Rogue software strain epro 39 CHAPTER V

DATA MINING

In this chapter, we take a first step towards predictive modeling. Looking at various strains of rogue software, it quickly becomes apparent that they go through a morphing process. Rather than taking the traditional, reactive, signature based approach, we believe there is merit in looking for patterns in the strains. If patterns are present, what are they saying? Will they reveal any useful information that could be used to formulate a proactive approach to quickly identify them?

5.1 Patterns

We start by identifying attributes that we believe will be useful. Table 5.1 shows a list of fourteen such attributes used in this research. The fields in the table are not limited to these and they are subject to change as well. Patterns may be analyzed using simple groupings of fields or a more traditional and scientific approach using data mining may be employed.

Table 5.1: Attribute list of the malware dataset with type characteristics

Number Attribute name Attribute type 1 File size in bytes Long integer 2 MD5 Hash String 3 Packer method PE/UPX String 4 Registry entries Integer 5 Native Windows Api references Integer 40 Table 5.1: Attribute list of the malware dataset with type characteristics (continued)

6 Directory Access Boolean 7 File access Boolean 8 Internet Access Boolean 9 Url references Integer 10 Unique strings Boolean 11 Date/Time Small date-time format 12 Programming Platform String 13 Windows DLL references Integer 14 Label String

The results may be not too revealing of the facts, because of the fields chosen and lack of input data (49 samples at present). The intention here is to point to the estimated outcome of anomaly detection using the patterns generated. This experiment in its entirety is merely an initial attempt towards using patterns to reveal how to stay ahead of the malware writers.

5.2 Dataset

The dataset as described in this research consists of 49 data samples with appropriate data attributes. All the attributes were selected and populated in the dataset based on an intuitive approach. The data collection was guided with the objective of finding patterns rather than being data centric.

Rigorous reverse engineering was the starting point for acquiring the data required for the dataset using several tools mentioned below:

• MAP [7] – Malcode Analyst Pack for hash code analysis, string operations, URL

referencing etc.

41 • IDA Pro [22], hview32demo [23], ResHack [24] – tools used for obtaining the

hex dumps necessary for disassembling the payload.

The reverse engineering process was carefully and systematically carried out for each and every malware sample. To add validity, some of the samples are not malware. They are well known and understood application samples. Both the malware and non-malware samples formed the dataset for the data mining experiments. Refer to Appendix A for a more detailed explanation on the dataset used here.

5.3 System Design

Knowledge Discovery and Data Mining (KDD), is a process of discovery where we search for and interpret useful patterns from our dataset. Data mining is a part of KDD where we apply algorithms to extract those patterns from the data [25]. Figure 5.1 shows these steps.

Figure 5.1: Steps of the KDD Process

• Target data selection

o The first step is to understand the application domain. With this

knowledge, we then create a target dataset.

42 • Pre – Process the selected data

o This step involves data cleaning and preprocessing.

• Transformation of the dataset

o Here we perform data reduction and eliminate attributes that do not show

promise.

• Data mining

o We applied a number of different algorithms for the data mining step.

These include association rule mining, classification rule mining, and

clustering.

• Interpretation and evaluation of patterns

o In the final step, we validate and verify any patterns discovered and create

a knowledge base.

5.4 Implementation

The dataset selection is a very important step in the KDD process, and hence it needs to be in a proper format to be properly analyzed further. Appendix A illustrates the format of the dataset. The dataset did not require major pre-processing, other than removal of data samples which absolutely gave very little data during the reverse engineering phase. Data reduction was out of the question since the dataset was quite small.

After the dataset was prepared for the data mining task, it was loaded into the

Microsoft SQL Server 2005, in table format using import data services from Excel format to SQL table format. 43 As part of data mining tasks, two operations were run on the dataset:

1. Decision Trees which is based on Regression based analysis.[26]

2. Naive Bayesian Classification which is based on the conditional probability

measurements.[26]

The data mining task was run on the training sample generated from the main dataset, which is in the ratio of 2:3. This is the standard learning procedure followed for data mining tasks. The test sample which is in the ratio of 1:3 but without the Label attribute may be evaluated against the learning model for verification and validation purposes.

Figure 5.2 shows the Data Source View depicting the attributes list for the training set in

Microsoft SQL server 2005.

Figure 5.2: Data source view of the malware dataset in Microsoft SQL Server 2005

44 5.5 Decision Trees

The first data mining experiment used decision trees for the pattern detection algorithm. The initial entropy based attribute selections were used and the decision trees were constructed. The Label attribute as mentioned in the table, was the predictable or the decision attribute.

The decision trees complained that they could not split based on the attribute selection and the dataset input. By split, we mean that it could not decide which way to go further in order to classify the dataset from the given constraints. This obviously resulted in a no-split decision tree as seen in Figure 5.3 and hence the learning model failed to yield positive results. Refer to Appendix B for the detailed explanation of the application of this algorithm on our dataset.

45 Figure 5.3: A one Level no split Decision Tree as seen in the Microsoft Tree Viewer of

SQL Server 2005

5.6 Naive Bayesian Classifier

The initial entropy based attribute selections were used and the classifier was built.

The Label attribute shown in Table 5.1, was the predictable or the classifier attribute. The classifier here again failed to pick up the attributes with their conditional probabilities to clearly distinguish the data samples into the classes specified by the Label attribute.

Hence the classifier output also showed no data attributes related to the Class Label as

46 shown in Figure 5.4. Refer to Appendix B for the detailed explanation of the application of this algorithm on our dataset.

Figure 5.4: The Class Label shows no relationship with the data attributes in the

Dependency Network of the Microsoft Naive Bayesian Classifier output 47 As expected the results were skewed. Some features of the dataset that resulted in skewing the outcome were as follows:

• The number of samples and the dimensionality, i.e. the data attributes, were small

and hence the dataset was well represented for the objective. But the attributes

were still not enough.

• Typically data mining tasks require huge amounts of data sets with multi-

dimensionalities (attributes) for better results.

• The dataset was created with a more intuitive approach than rather being guided

by an expert in the security and data mining fields.

• Technical detailing for the dataset was definitely less, because of time constraints.

5.7 Pattern Analysis

After knowing that the dataset was not quite enough to be classified effectively using data mining classification algorithms, the next task was to concentrate on trying to find patterns or establish relationships between the data attributes, which could be put to use in classification at a later stage. Hence the Microsoft Clustering algorithm [26] was used to understand the data correlation in the dataset. The Microsoft Clustering algorithm iteratively groups records from a dataset into clusters that have similar characteristics.

The purpose is to search for and identify general groupings in your data [26]. The entire dataset without the Label attribute was then fed into the clustering algorithm as the input, and the results observed are shown in Table 5.2 and Figures 5.5 and 5.6. Though outliers were found in almost all cases, there was definite evidence of clear data attributes being

48 used to cluster few data samples based on the data correlation. Refer to Appendix B for the detailed explanation of the application of this algorithm on our dataset.

Table 5.2: Microsoft Clustering algorithm applied on the malware dataset

Sample Attribute State Cluster Number 1 Internet Access yes 9 2 Registry Entries < 8 9 3 Url References 0 1, 9

Figure 5.5: Microsoft Cluster algorithm applied on the malware dataset in Microsoft SQL

Server 2005

Further, the probability distributions can be used very effectively for data selection and classification procedures in order to provide better results.

49 Figure 5.6: Cluster 9 characteristics with probability distribution in the Microsoft SQL

Server 2005

Interpretation and evaluation of these patterns then becomes much easier, simpler and effective at the same time. The results so obtained then can be put to practical use without any doubt.

However, the experiments conducted were disappointing. Overall, there simply are not nearly enough samples for this technique to be applied. Also the high level of ambiguity present in the dataset makes it harder for the Data mining algorithms to perform with high efficiency with this dataset. In the next chapter, we explore the methodology of rough sets, for a feasible solution given these constraints on our dataset.

50 CHAPTER VI

ROUGH SETS

6.1 Predictive Modeling

While working with the data mining algorithms, we saw that the dataset was very ambiguous and the attributes inconsistent. In order to come up with the core attributes

(representing dependencies between values of attributes in the form of decision rules), a possible reduction of all superfluous attributes and cases in the information system is necessary. The attributes are of different types such as nominal, continuous or ordinal, while other attributes might be interval scaled. This situation often exists in practical applications. The inconsistency indicates attributes have values which make a decision imprecise. Hence, we introduce the application of rough sets [27] in this research to help solve this problem. Roughsets are basically trying to arrive at rough subsets of the original set. This is done to remove the ambiguities or impreciseness present in the original dataset, which might be hampering our grouping or classification decisions. The rough subsets are obtained by using the rough set theory [27] of lower and upper approximations on a given set. The rough subsets so obtained give precise bounds by their crisp values thus making our decision on the imprecise set now easier.

51 When using rough sets, the first task is the formation of core attributes which are most likely to impact the decision attribute. Next, is the reduction of attributes which might make the decision attribute imprecise. The final task is to generate decision rules using the core attributes, with high precision in the decision attribute, and use these decision rules to create the learning models. In this chapter, we apply this technique to see if there is merit in applying rough sets to help determine whether an executable should be categorized as rogue software.

6.2 Implementation

The data set used here is the same one used in the data mining experiments. The tool suite used was BLEM2 [28], a web based solution involving a Roughsets implementation with pre-processing tools, machine learning tools to generate the rules and the rule based classification which forms the main computing class in the tool suite. The web based solution in its entirety, with all its web services, is the first of its kind. Figure 6.1 shows a snapshot of the web interface.

Figure 6.1: Web interface for BLEM2 rough sets tool suite 52 The first task was to find the set of core attributes which would be useful to generate the classification rules. Several runs of datasets were used to clearly define the core attributes set, which removed the redundant or inconsistent attributes that would have otherwise skewed the decision attribute. The following attributes were retained:

• URL references.

• API references.

• DLL file references.

• Registry entries.

• Label which is the decision attribute.

The formation of the core attributes in the previous stage, suggest that many other attributes seriously hampered the decision making because of its ambiguity to classify the dataset. This more clearly illustrates the reason for disappointment in the data mining process.

The next task was to generate the decision rules, which are in the natural language format for easy understanding and data analysis, using the BLEM2 machine learning tools. The screenshot is shown in Figure 6.2.

Figure 6.2: Rules generated using machine learning tools in BLEM2 tool suite 53 A total of 19 rules were generated for a tiny dataset of 49 samples. The rules generated were taken as the input to the next phase. These decision rules make us realize how the conditional attributes are indeed dependent on each other. These dependencies can be effectively exploited to indicate the patterns which would otherwise be difficult to realize and interpret. Refer to Appendix C for the detailed explanation of the application of this technique on our dataset.

6.3 Results and Evaluation

The results indicate clear decision rules to efficiently classify or interpret patterns relevant to the datasets objectives. The generated rules were tested for different input sets to evaluate patterns.

54 Figure 6.3: Depicts all the possible rule sets for the malware dataset

As shown in Figure 6.3, the set of all possible rules sets has been laid out with input being ‘*’ (all). This learning model was then evaluated for different input sets. The results, though not quite efficient, are indeed indicative of the strength of this methodology over others, in the case of ambiguous and inconsistent datasets which exists in most real world data samples.

55 Figure 6.4: Rule based Classifier in BLEM2 rough sets tool suite

Figure 6.4 shows us that the dataset with the following conditional attributes would be labeled as shown based on the support count major, certainty factors and coverage of the rules for that decision attribute. With more attributes, features and clear definitions of the datasets and high technical detailing, powerful and huge datasets can be created.

Those datasets can be ultimately fed into such rule-based systems to build accurate and precise learning models that can be put to practical use for real world applications in the security domain.

56 CHAPTER VII

CONCLUSIONS AND FUTURE WORK

One of the important aspects that was exciting about this research work was the opportunity to work with live malware samples. After a lot of background study and research in the field of reverse engineering malware, in particular, for the Microsoft

Windows platform, we were able to systematically understand the methodologies for applying reverse engineering techniques using actual tools used in the computer security profession. This made us realize their importance, usage and better appreciate their application.

This research experience helped us to understand the technical aspects involved in handling malware and perform malware analysis. There is a need to understand the safe environment required in terms of virtual machines and how they function for this purpose. Similarly the malware analysis also requires the researcher to understand the intricate details of analyzing hex dumps, disassembling the malware, code analysis and so on.

With the research objective of finding a proactive solution for identifying new malware, there was a need to understand the malware code in depth and detail. Hence the

Lifecycle of a Malware was introduced here. A detailed process of the lifecycle has been

57 discussed which led to conclusions that make us realize the amount of effort and constraints involved in this line of work. In the quest for a solution, data mining and rough set technologies were applied here. To the best of our knowledge, this has never been done before. From this research work, we can realize the usefulness of these data analysis approaches in the world of security. Malware datasets were created and analyzed and results listed, indicating the need for further research in the same direction.

Because malware is so sophisticated, they morph their behaviors and change attack vectors. It is practically impossible to take care of every strain that we come across. Although the data mining results were disappointing, we believe rough sets can be used as a feasible approach for establishing suitable patterns that distinguish the malwares from the good. Rigorous research and experimentation is required with different strains of malware, as a future work to generate more powerful patterns that can be placed into rule-based engines as a proactive solution. Then the integration of these approaches with the present day security solution is very important to make it viable for the average computer user. This also is extremely challenging future work and essential for a security solution to be useful in the everyday computing world.

58 REFERENCES

[1] P. Dragan, V. Milutinovic, B. Djordjevic, “Psychological Profile of Network Intruder,” VIP Symposia on Internet related research internet conference, Belgrade, 2006 www.internetconferences.net/ipsi

[2] S. Stover, D. Dittrich, J. Hernandez and S. Dietrich, “Analysis of the Storm and Nugache Trojans: P2P is here,” USENIX vol. 32, no. 6, December 2007

[3] Dr. Jose Nazario, “Botnet tracking tools, Techniques and lessons Learned,” Arbor Networks, 2007

[4] VmWare - commercial virtual machine software tool. www.vmware.com

[5] FileMon and RegMon - File and Registry monitoring tools from Microsoft Corporation.

[6] Sniff hit, Wireshark sniffer, Packetyzer These are Windows user interface(s) for the ethereal packet analysis engine. www.ethereal.com

[7] Malcode Analyst Pack Malware code analyzer from Idefense labs, a VeriSign company. www.labs.idefense.com

[8] Symantec Security response article - W32.Nugache.A strain www.symantec.com

[9] P2P data sharing application from Limewire www.limewire.com

[10] AVG Antivirus solution Free Edition Version 8.0 www.free.avg.com

[11] Data Navteq – Digital map reporting tools www.navteq.com

59 [12] Cheyenne Wired Topix Articles on Nugache Creator – Jason Micheal Millmont MacRonin Hacker, “Launches Botnet Attack via P2P Software,” June 30, 2008 Dan Goodin, “Nugache was mine - Posted in Software & Security,” San Francisco, June 28, 2008

[13] Nugache Creator Jason Micheal Millmont - Plea Agreement www.theregister.co.uk/2008/06/28/nugache_creator_plea_agreement This case was investigated by the Federal Bureau of Investigation. Assistant United States Attorney, Wesley L. Hsu Chief, Cyber and Intellectual Property Crimes Section

[14] Skype VoIP tool www.skype.com

[15] Dev C++, a C++ compiler www.bloodshed.net/devcpp.html

[16] Hack U3 USB Smart Drive to Become Ultimate Hack Tool www.dotnetwizard.net/soft-apps/hack-u3-usb-smart-drive-to-become-ultimate-hack-tool

[17] Online tutorials and programming sources for Keystroke Logging, Screenshot capture. www.irongeek.com

[18] MSDN Library for Microsoft Windows Operating System API.

[19] Symantec Security response article – Rogue softwares strains antivirusXP2008, eantiviruspro, antivirus2009 www.symantec.com

[20] Unofficial computer and software security groups guild available at www.offensivecomputing.net

[21] Virus Total is a web service of Hispasec Sistemas, a security firm headquartered in Spain and branches in Mexico and Argentina. www.virustotal.com, www.hispasec.com

[22] IDA Pro – the multi-processor, multi-OS, interactive , by DataRescue. www.datarescue.com

[23] Hview DEMO for Windows (based on Hview 7.01) www.hiew.ru

[24] ResHack resource analyzer tool www.angusj.com/resourcehacker

60 [25] U. Fayyad, G. Piatetsky-shapiro, P. Smyth, “Knowledge Discovery and Data Mining: Towards a Unifying Framework,” in KDD-96 Conference Proceedings, ed. E. Simoudis, J. Han, and U. Fayyad, AAAI Press, 1996 www.aaai.org/Press/Proceedings/kdd96.php

[26] MSDN library for Microsoft SQL Server 2005 data mining algorithms Microsoft Decision Trees Algorithm – www.msdn.microsoft.com/en- us/library/ms175312.aspx Microsoft Naive Bayes Algorithm – www.msdn.microsoft.com/en- us/library/ms174806.aspx Microsoft Clustering algorithm – www.msdn.microsoft.com/en- us/library/ms174879.aspx

[27] Zdzisław Pawlak, “Rough Sets: Theoretical Aspects of Reasoning about data,” Kluwer Academic Publishing, 1991

[28] Chan, C.-C. and S. Santhosh, “BLEM2: Learning Bayes’ rules from examples using rough sets,” Proc. NAFIPS 2003, 22nd Int. Conf. of the North American Fuzzy Information Processing Society, July 24 – 26, 2003, Chicago, Illinois, pp. 187-190

[29] Torrent sites and other internet sites accessed for malware samples www.sumotorrent.com, www.isohunt.com, www.torrentreactor.net, www.utorrent.com

[30] Microsoft SQL Server 2005 from Microsoft Corporation www.microsoft.com

61 APPENDICES

62 APPENDIX A

THE MALWARE DATASET

For analysis of the data obtained from reverse engineering, there was a need for a well defined dataset. This section describes in detail how the dataset was created for research purposes.

In the process of collecting malware samples [20, 29], the creation of the dataset was purely intuitive and data-centric. Data-centric here means a brute force approach to get as much information as possible from the reverse engineering process on the malware samples. With the goal of data mining, the attributes selected for the dataset was pre- determined before data collection. The attribute list is shown in Table A.1.

Table A.1: Attribute list used for creating the malware dataset

Number Attribute name Attribute type 1 File size in bytes Long integer 2 MD5 Hash String 3 Packer method PE/UPX String 4 Registry entries Integer 5 Native Windows Api references Integer 6 Directory Access Boolean 7 File access Boolean 8 Internet Access Boolean 9 Url references Integer 10 Unique strings Boolean 11 Date/Time Small date-time format 12 Programming Platform String 13 Windows DLL references Integer 14 Label String

After deciding on the attributes list, the next challenge was the phase of data collection. The malware samples were reverse engineered on a per sample basis using different tools.

63 They are as follows:

1. Malcode Analyst Pack [7] – for collecting: • File/payload size in bytes, strings of interest (which were very obvious in pointing to the fact that code obfuscation was used). • Message Digest 5 (abbreviated as MD5, this file hashing technique helps in identifying similar strain types to avoid duplication). • Date/Time references for the payload. • Uniform Resource Locator references (abbreviated as URL, typically the string dump was examined manually for http, https, ftp, sftp, scp; and network domains like com, net, org etc).

2. File monitor and Registry Monitor [5] – for collecting information on the access to the local file system and the directory and the registry hooks or entries for the payload.

3. IDA Pro [22] – for disassembling and debugging the malcode for data analysis. Information like Packer method PE/UPX, Native Windows API calls, Internet access using standard Windows API calls, Programming platform and Windows Dynamic Linked Library references were harnessed using the above mentioned tools.

The procedure of collecting data was done on each malware sample and other genuine application samples with no bias. The Label attribute was tagged in each sample as a string “good” or “bad” to distinguish between the same. The following Tables A.2, A.3, A.4, A.5, A.6 and A.7 depict the dataset.

Table A.2: Malware dataset with the file size and MD5 hash attributes

Sample No. File Size in bytes Message Digest 5 Hash 1 91424 abf87b30b970fa10cd660ca69a988e15 2 100356 b7d5753de85d50f1847fa8b699113a62 3 167936 e2b90e2d9a7642ad412b12e7b252c2d3 4 54272 6d174fc3df6c79db83a9036c4f51e8c8 5 85200 e24237385abe499806a5b9f240fb378a 6 1399061 e979fb2eb504972ed87ad3c825ec6c2c 7 2006502 8c396fbdacce214de2e86354a77350d2 8 59392 a591501ce6746babf13554c3b6432f97 9 64048 39069e0546624ae139c88a8f370383af 10 174161 f17799b67ebaba4fe258113a6828059c 11 61440 18a752131bf9770050613d030dc62125 12 64050 589dd217560edec544557c79f92b95db 13 108548 85c29f997ee1906b05f40d9ace2b70db 14 82944 74c09011bb97eb30cc73286657fe81c9 15 53248 cdfae03ca18bbaf307a77f9ba2bb7b38

64 Table A.2: Malware dataset with the file size and MD5 hash attributes (continued)

16 172164 3855fce7973768de5599da6c8a095c6c 17 81924 d6b94251a65e5d4fcb2d275b8f944a46 18 86020 dc5db4b6c577a01ec43faa6e9c63a716 19 69632 e777a667407f8c5a100da1297156af46 20 70656 aac629a95c1da11217c96c133f2763e2 21 90116 d07f1603b5743c28daba50ecc95932c8 22 81931 361ca4f9964c861bd8d512604c5266d6 23 109056 4d698c16614cbe57735359b8bca9fa56 24 81924 0362213297ccd9b7952a77a9baa3ac7b 25 77377 28d328f10f7315299798215d3df6abd2 26 81924 c10c2c06d45f5565e271515ab379f3a5 27 124928 d1d65d2f22121f65ea114ac9cecc380a 28 71168 1efa320c8b8347c67823f074dc5feb2d 29 2109201 499d7dacb0dc68c83650b4fd3928d1dd 30 87552 c52e23bc44b3e955ecb8a3182785c148 31 102912 41d6aa1f6f7a8ab2c9ef5ce5e42a5f82 32 12288 639311b81c19087ab39d60be418d9c1d 33 71680 ca97f5d068663491751814e9263b4425 34 81924 b60dfd3a7d90b5bb0727b8a55daa1636 35 167936 c63170a24e9cea93da3b63be6c07517d 36 40960 a8ad8adeb5e5153173e9cccbbf3bcdeb 37 145920 039d7be393458990f7b8c353c8bf6a69 38 310272 c16c12d4335675f5f18db88d8b65d8d0 39 9096848 faef0c37b86cddd96cb80c382861eae05 40 459533 c1894e46ff89be6ca35729d9dab6145 41 9326468 a0fa0a34a842dbfd9d3b0ca834311926 42 2067073 20b5a8f02ec56ddbc230cc1ffef67d88 43 15689246 dd7efc663be7f523148f62577e5e44ce 44 8513872 d1ca06e013031eafeb2ac09422ee820f 45 2003884 b75f17199ab6eb781595758c51413ef3 46 2529640 fbcf0a3f70561df83ba6bbbd786dea3e 47 20769483 388386d0ffc2c54ad94575ee19edd8 48 6903874 91631142926f60cb9b4f63ae0667abd7 49 9917089 f6e9091948d810b28fabaca943dbd55

Table A.3: Malware dataset with the time/date and filename attributes

Sample No. Time/Date Filename of the payload 1 10/27/2007 22:51 WinAntiVirusPro2007FreeInstall.exe 2 8/22/2008 2:18 antivirus.v.1.0.exe 3 11/17/2008 1:06 Antivirus_2009_pro-scan-online.com.exe 65 Table A.3: Malware dataset with the time/date and filename attributes (continued)

4 12/2/2007 13:20 XPantivirus2008_v880019.exe 5 6/29/2007 21:57 WinAntiVirusPro2006Install_de.exe 6 7/26/2008 11:51 AntivirusXP2008Installer.exe 7 9/17/2008 3:15 eAntivirusProInstaller.exe 8 4/2/2008 16:44 XPantivirus2008_v880201.exe 9 3/24/2008 9:50 XPantivirus2008_v880011.exe 10 9/10/2008 5:36 ivrl-antivirus-600457.zip 11 8/16/2008 4:33 Antivirus-XP-2008.exe 12 3/25/2008 10:17 XPantivirus2008_v77011807.exe 13 8/27/2008 15:58 antivirus.v.1.exe 14 7/12/2008 9:45 AntivirusXP2008Installer.exe 15 8/5/2008 18:41 Antivirus-XP-2008.exe 16 9/30/2007 16:03 avg_antivirus_7_5_key.exe 17 2/20/2009 18:54 unknown 18 1/31/2009 9:39 unknown 19 2/16/2009 2:14 unknown 20 1/27/2009 8:42 unknown 21 2/10/2009 20:50 unknown 22 2/24/2009 6:30 unknown 23 10/26/2008 5:02 unknown 24 1/24/2009 3:25 unknown 25 10/6/2008 17:15 unknown 26 2/2/2009 18:55 unknown 27 3/28/2009 2:37 unknown 28 2/9/2009 16:45 unknown 29 10/30/2008 8:20 unknown 30 8/28/2008 7:34 unknown 31 1/26/2009 17:42 unknown 32 1/12/2009 0:18 unknown 33 2/18/2009 19:03 unknown 34 2/5/2009 10:05 unknown 35 1/14/2009 10:36 InstallAVg_880460.exe 36 12/7/2008 14:30 spyprotector_install_4173.exe 37 8/29/2008 12:33 XPAinstall_880711.exe 38 1/12/2009 8:36 winsystems.dll 39 11/2/2007 0:48 capsa66_demobuild123 40 6/19/1990 0:56 CaptureBat-Setup-2.0.0-5574 41 6/21/2008 10:52 devcpp-4.9.9.2_setup.exe 42 4/3/2008 23:22 map_setup.exe 43 3/11/2009 22:31 idafree49.exe 44 4/3/2008 23:13 PacketTrap_PT360_Tool_Suite_Setup.exe 45 4/3/2008 23:13 SysAnalyzer_Setup.exe 66 Table A.3: Malware dataset with the time/date and filename attributes (continued)

46 4/3/2008 23:13 whereis_sdk.exe 47 6/19/1990 0:58 wireshark-setup-0.99.7.exe 48 10/28/2008 19:56 ca_setup.exe 49 9/1/2008 17:56 CiscoVPN_setup.exe

Table A.4: Malware dataset with the unique strings, URL, registry and API reference attributes

Sample Unique URL Registry entry API references No. Strings references 1 yes 21 8 188 2 yes 0 0 103 3 no 0 52 109 4 no 1 1 15 5 yes 17 8 188 6 no 1 9 155 7 no 1 1 36 8 no 0 0 9 9 yes 0 3 36 10 yes 0 1 61 11 no 0 1 61 12 yes 0 3 36 13 no 0 3 112 14 no 0 0 22 15 no 0 1 48 16 yes 0 6 147 17 no 0 18 85 18 no 0 20 93 19 no 0 9 27 20 no 0 0 7 21 no 0 19 97 22 no 0 15 65 23 no 0 0 17 24 no 0 20 93 25 no 1 10 155 26 no 0 20 94 27 no 0 7 25

67 Table A.4: Malware dataset with the unique strings, URL, registry and API reference attributes (continued)

28 no 0 8 25 29 no 3 10 155 30 yes 0 0 89 31 no 0 13 126 32 no 0 0 10 33 no 0 6 16 34 no 0 18 90 35 no 0 39 70 36 yes 1 5 76 37 no 0 0 40 38 no 0 10 133 39 yes 11 3 96 40 yes 0 10 154 41 yes 4 10 160 42 yes 3 3 89 43 yes 4 3 96 44 yes 30 10 272 45 yes 2 3 89 46 yes 30 10 187 47 yes 0 10 155 48 no 4 1 78 49 yes 4 0 50

Table A.5: Malware dataset with the file packer attributes

Sample File Packer No. 1 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 2 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 3 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit, 4 UPX compressed 5 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit, Null 6 soft Installer self-extracting archive 7 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 8 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 9 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 10 Zip archive data, at least v2.0 to extract

68 Table A.5: Malware dataset with the file packer attributes (continued)

11 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 12 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 13 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 14 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 15 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit, RAR 16 self-extracting archive 17 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 18 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 19 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 20 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 21 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 22 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 23 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 24 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 25 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 26 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 27 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 28 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 29 Null soft Installer self-extracting archive 30 MS-DOS executable PE for MS Windows (console) Intel 80386 32-bit 31 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit, UPX 32 compressed 33 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 34 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 35 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 36 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 37 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit MS-DOS executable PE for MS Windows (DLL) (GUI) Intel 80386 32- 38 bit 39 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 40 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 41 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 42 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 43 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 44 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 45 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 46 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 47 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 48 MS-DOS executable PE for MS Windows (GUI) Intel 80386 32-bit 49 gsfx self extractor 69 Table A.6: Malware dataset with the programming platform attributes

Sample No. Programming platform 1 C 2 Armadillo v1.71, Microsoft Visual C++ v5.0/v6.0 (MFC) 3 C 4 UPX v0.89.6 - v1.02 / v1.05 -v1.22 (Delphi) stub 5 C 6 C 7 C 8 C 9 Borland Delphi v6.0 - v7.0 10 C 11 C 12 Borland Delphi v6.0 - v7.0 13 Microsoft Visual C++/MFC, Armadillo v1.71 14 C 15 C 16 C 17 C 18 C 19 C 20 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 28 C 29 C 30 C 31 C 32 C 33 C 34 C 35 C 36 Microsoft Visual C++, Armadillo v1.71

70 Table A.6: Malware dataset with the programming platform attributes (continued)

37 C 38 C 39 C 40 C 41 C 42 C 43 C 44 C 45 C 46 C 47 C 48 C 49 C

Table A.7: Malware dataset with the DLL, directory, file, internet access and label attributes

Sample Windows DLL Directory Internet File access Label No. references access access 1 12 yes yes yes bad 2 9 yes yes yes bad 3 4 yes yes no bad 4 10 no no yes bad 5 11 yes yes yes bad 6 8 yes yes no bad 7 3 yes yes yes bad 8 3 no no no bad 9 3 yes yes no bad 10 4 yes yes yes bad 11 4 no no no bad 12 3 yes yes no bad 13 10 yes yes yes bad 14 4 yes yes yes bad 15 4 yes yes no bad 16 7 yes yes yes bad 17 4 yes yes no bad 18 5 yes yes no bad 19 3 no no no bad 20 1 no no no bad

71 Table A.7: Malware dataset with the DLL, directory, file, internet access and label attributes (continued)

21 5 yes yes no bad 22 2 no no no bad 23 3 no yes no bad 24 4 yes yes no bad 25 8 yes yes no bad 26 4 yes yes no bad 27 4 no no no bad 28 3 no no no bad 29 7 yes yes no bad 30 4 yes yes no bad 31 4 yes yes no bad 32 2 no yes yes bad 33 3 no no no bad 34 4 yes yes no bad 35 3 yes yes no bad 36 6 yes yes yes bad 37 3 yes yes no bad 38 5 yes yes no bad 39 5 yes yes no good 40 8 yes yes no good 41 8 yes yes no good 42 5 yes yes no good 43 5 yes yes no good 44 7 yes yes yes good 45 4 yes yes no good 46 9 yes yes yes good 47 8 yes yes no good 48 5 yes yes no good 49 2 yes yes no good

This completes the Appendix section on the malware dataset.

72 APPENDIX B

DATA MINING

The KDD (Knowledge Discovery and Data Mining) process is a systematic approach to processing and handling large amounts of data. The steps of the KDD process consist of the following phases shown in Figure B.1.

Figure B.1: The KDD process

1. Selection: a. Analyze the Malware sample dataset b. Generate the table by creating an SSAS (SQL Server Analysis Services) Package

2. Preprocessing and Transformation – data cleaning phase

3. Data Mining: a. Creation of the Data Cube b. Application of the Data Mining Algorithms

4. Interpretation and Evaluation – extracting the patterns

1. Selection – creating the target dataset

In order to get the information from our source file to a database table, SSAS (SQL Server Analysis Services) was used. The flat source file here was a Microsoft Excel 73 table with the attributes discussed in the previous section. It was then imported to the database using the SSAS package [30].

2. Preprocessing and Transformation

Since the dataset was created with the objective at hand, it did not require any data cleaning process to be applied.

3. Applying the Data Mining Algorithms

Creating the views – The next step was to create a data cube from the table.

Generating the Data Source View

The data source view, shown in Figure B.2, gives a data abstraction of the data cube to be created. It was then created and contained the table.

Figure B2: Data Source View of the training set

Creating the training and testing datasets

The training and testing datasets were created in the ratio of 2:3 using simple SQL query in the SSAS. The training set contained 32 entries while the testing set had 17. 74 Entropy Based Attribute Selection

The “Suggest related columns” dialog box, performs an analysis on a sample of data to identify input columns that show the greatest relationship to the selected Predictable column based on Entropy.

Creating the Decision Trees [26]

The next step was to create decision trees on the training table. SSMS (SQL Server Management Services) was used for this purpose. A mining structure for decision trees was created involving the training set. To create the decision tree, the suggested related columns were chosen as the input attributes. The tree then displayed the splits between the input attributes in order to reach the appropriate predicate attribute, i.e. Label – good or bad.

Creating the Naive Bayesian Classifier [26]

Based on the conditional probabilities, the data records were classified into two classes, i.e., good or bad. A mining structure was created for applying the Naive Bayesian algorithm on the training set. The attributes were selected using the suggested method provided in the SQL server as input attributes. The Naive Bayesian algorithm then results in using all of the suggested attributes and creates the class predicate taking into account the majority probability associated with that data record.

Applying the Microsoft Clustering [26]

Similar to the procedure followed above, a mining structure was created for the same training set but without the predicate attribute, since Clustering does not require any predicate.

This completes the Appendix section on data mining using the Microsoft SQL Server 2005 [30].

75 APPENDIX C

ROUGH SETS

The application of rough sets involved using the tools suite BLEM2 [28]. BLEM2 tool usage had 3 phases:

1. Feeding the dataset as input. 2. Generating rules using rough sets methodology. 3. Testing the rules.

1. Feeding the dataset as input

The dataset has to be divided into two input files of specific format for the Roughsets engine to understand. The format is strictly a tool specification. The two files are:

• The attribute file – this file has an extension ‘att’. It is basically a meta file for the dataset. The structure of the file is tool specific. The first field indicates the number of attributes in the dataset, followed by the attribute name exactly matching that of the dataset in both case and order. Then we have the type of attribute indicated with letter c – that denotes the number of instances or values this field can take. An example is shown below:

[Registry references] c 16

This is followed by all those instances for that attribute terminated by a newline as shown in the example for the same attribute Registry references below:

8 0 52 9 1 3 6 18 20 19 10 7 13 39 5 15

This is done for all the attributes in order as specified in the dataset.

• The data file – this file needs to have extension ‘dat’. This basically has all the data values in the dataset with commas separating them and each sample entry separated with a newline feed. 76 The data file entries are illustrated in the example below:

91424, yes, 21, 8, 188, 12, yes, yes, yes, bad 100356, yes, 0, 0, 103, 9, yes, yes, yes, bad 167936, no, 0, 52, 109, 4, yes, yes, no, bad

Both these files need to be prepared in the format specified for the BLEM2 tool to understand out dataset. Since the target dataset was indeed very small, this process was done manually. For larger datasets, transformation, cleaning and format specification can be done in the tools other features provided.

2. Generating rules using rough sets methodology

After the two files are ready they are then fed into the rough sets learning tool with required predicates like support count major (which indicates the frequency of occurrence for that particular rule generated), rules coverage (number of rules the sample covers or satisfies), certainty factors and strength of the rules (when all the rules are taken into account). Rules are generated from the learning model of the BLEM2 tool suite using the above mentioned predicates and features as the input.

3. Testing the rules

Rules so generated are further utilized in the Rule based engine for extracting mining patterns or to solve classification problems.

This completes the Appendix section on rough sets using the BLEM2 tool suite.

77