EXFILD: A TOOL FOR THE DETECTION OF DATA EXFILTRATION USING ENTROPY AND ENCRYPTION CHARACTERISTICS OF NETWORK TRAFFIC

by Tyrell William Fawcett

A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Masters of Science in Electrical and Computer Engineering

Spring 2010

c 2010 Tyrell William Fawcett All Rights Reserved EXFILD: A TOOL FOR THE DETECTION OF DATA EXFILTRATION USING ENTROPY AND ENCRYPTION CHARACTERISTICS OF NETWORK TRAFFIC

by Tyrell William Fawcett

Approved: W. David Sincoskie, Ph.D. Professor in charge of thesis on behalf of the Advisory Committee

Approved: Kenneth E. Barner, Ph.D. Chairman of the Department of Electrical and Computer Engineering

Approved: Michael J. Chajes, Ph.D Dean of the College of Engineering

Approved: Debra Hess Norris, M.S. Vice Provost for Graduate and Professional Education ACKNOWLEDGMENTS

To my advisor, Dave Sincoskie: Dr. Sincoskie has given me the opportunity to further my education at the University of Delaware. He has provided me with the guidance needed to perform my research at a level higher than I believed I could. He has taught me not to blindly believe things without proving them myself. His vast experience in research and ability to see the importance in this area of research is what lead to my thesis. To Chase Cotton: Dr. Cotton is always around and willing to brainstorm and work through technical problems with anyone. His practical experience in the field and willingness to perform experiments all hours of the day has proven to be invaluable my research. To Fouad Kiamilev: Dr. K has been gracious enough to provide space in his lab and access to all of his equipment to me during my Graduate career. He provides an enthusiastic and upbeat lab environment for CVORG. His eagerness to learn and encouragement to do what you love is what keeps CVORG such a great place to work. To Charles Boncelet: Dr. Boncelet provided great advice and guidance re- lating to encryption and entropy calculations in my work. I’m thankful for him sharing his experience and saving me time and effort going down unnecessary paths in unfamiliar territory. To all the members of CVORG: CVORG is definitely a very eclectic group of people piled into labs together. The members are always there to help someone in need. The brainstorming sessions in this lab leads to a lot of great ideas and

iii research. A lot of aspects of both my thesis work and other projects have resulted from these brainstorming sessions and has provided me with some paths forward in future work. I am proud to be a part of CVORG. To Larry and Karen Steenhoek: The Steenhoeks have provided a lot of sup- port over the last semester to my wife and I. It has made this transitional period and specifically the writing process a much easier time and for that I thank them. To my parent, Jim and Deby Fawcett, and my sister, Sammy Fawcett: I always know that no matter what happens I always have their support. I attribute my successes up to now to the work ethic and morals they instilled in me. I love them for everything they have done for me. To my wife, Valerie Fawcett: She has been my best friend for as long as most people remember. She has kept my life interesting with her goofy antics, which I’m sure will only get more entertaining in DC. She has made sacrifices allowing me to quench my thirst for knowledge and for that I can’t thank her enough. She will be as happy as I am to see this thesis completed, because she calls it the day she gets her husband back! I couldn’t ask for a more supportive wife to spend the rest of my life with.

iv TABLE OF CONTENTS

LIST OF FIGURES ...... viii LIST OF TABLES ...... xii ABSTRACT ...... xiii

Chapter

1 BACKGROUND ...... 1 2 INTRODUCTION ...... 8

2.1 Motivation ...... 8

2.1.1 Scenario of Security Against Infiltration ...... 9 2.1.2 System Administrators ...... 10 2.1.3 Encryption ...... 11 2.1.4 Out of the Box ...... 11 2.1.5 Login Credentials ...... 12 2.1.6 / Detection ...... 13 2.1.7 Properly Configured Firewall ...... 13 2.1.8 Intrusion Detection System ...... 14 2.1.9 Intrusion Prevention System ...... 15 2.1.10 Outgoing Traffic ...... 15 2.1.11 Consequences of Encryption ...... 16

2.2 Goals ...... 16 2.3 Related Work ...... 18

3 NETWORK TOOLS ...... 19

3.1 Network Sniffer ...... 20 3.2 Corporate Watcher ...... 22 3.3 Network Top ...... 24

v 3.4 DNS Extractor ...... 28 3.5 Session Extractor ...... 29 3.6 IP Helper ...... 31

4 EXFILD DESIGN AND IMPLEMENTATION ...... 33

4.1 Packet and Session Processing ...... 33

4.1.1 Packet Decoding ...... 34 4.1.2 Extract Sessions ...... 35 4.1.3 Entropy Calculation ...... 36

4.1.3.1 Scaling by Initial Values ...... 41 4.1.3.2 Scaling by Size ...... 42

4.1.4 Checking if Encryption is Expected ...... 44 4.1.5 Checking if Encryption is Present ...... 45

4.2 Tree ...... 48

4.2.1 Expected Unencrypted and Received Unencrypted ...... 49 4.2.2 Expected Encrypted and Received Encrypted ...... 51 4.2.3 Expected Encrypted and Received Unencrypted ...... 52 4.2.4 Expected Unencrypted and Received Encrypted ...... 54

5 EXPERIMENTS, RESULTS, AND ANALYSIS ...... 57

5.1 Experiments ...... 57

5.1.1 Control Data Set ...... 57 5.1.2 Data Set #1 ...... 60 5.1.3 Data Set #2 ...... 61 5.1.4 Malware Data Sets ...... 62

5.1.4.1 Kraken ...... 63 5.1.4.2 Botnet ...... 63 5.1.4.3 Black Worm ...... 64

5.2 Results and Analysis ...... 65

5.2.1 Packet Versus Session Alerts ...... 65

vi 5.2.2 Control Data Set ...... 66 5.2.3 Data Set #1 ...... 70 5.2.4 Data Set #2 ...... 71 5.2.5 Malware Data Sets ...... 72

5.2.5.1 Kraken ...... 72 5.2.5.2 Zeus ...... 72 5.2.5.3 Blackworm ...... 73

5.2.6 Data Exfiltration Detection Performance ...... 74

6 CONCLUSIONS ...... 78 7 FUTURE WORK ...... 80

7.1 Performance ...... 80 7.2 Handle A Network ...... 81 7.3 Application Layer Decoding of Packets ...... 82 7.4 Comparison to packet and session entropy ...... 82 7.5 Compressed File Analysis ...... 83 7.6 Behavioral Analysis ...... 83 7.7 Additional Tools ...... 84

BIBLIOGRAPHY ...... 85

Appendix

A EXPERIMENTS ...... 89 B ENTROPY PLOTS FOR DATA SETS ...... 99 C ALERTS FOR DATA SETS ...... 108 D VERIFICATION OF MALWARE PACKET CAPTURES ..... 112

D.1 Kraken Packet Capture ...... 112 D.2 Zeus Packet Captures ...... 114

D.2.1 Zeus #1 ...... 114 D.2.2 Zeus #2 ...... 116 D.2.3 Zeus #3 ...... 119

D.3 Blackworm ...... 121

vii LIST OF FIGURES

3.1 Example Output from the Network Sniffer ...... 21

3.2 Example Output from the Corporate Watcher Program (Simple) .. 23

3.3 Example Output from the Corporate Watcher Program (Complex) 24

3.4 Example Output from the Network Top Program ...... 27

3.5 Example Output from the DNS Extractor Program ...... 29

3.6 Example Output from the Session Extractor Program ...... 30

3.7 GUI for the IP Helper Program ...... 32

4.1 The Flow of the Processing of Packets ...... 34

4.2 The Flow of the Processing of Sessions ...... 34

4.3 Maximum and Minimum Entropy vs. Size of a Packet’s Payload .. 38

4.4 Maximum Entropy vs. Size of a Packet’s Payload ...... 39

4.5 Maximum and Minimum Entropy Values with Different Initial Values...... 42

4.6 Scaled Maximum and Minimum Entropy Values ...... 44

4.7 HTTP and HTTPS Traffic ...... 47

4.8 The Four Branches of the Tree ...... 48

4.9 Flow for Expected Unencrypted and Received Unencrypted Branch 49

viii 4.10 Flow for the Expected Encrypted and Received Encrypted Branch . 51

4.11 Flow for Expected Encrypted and Received Unencrypted Branch . 53

4.12 Flow for Expected Unencrypted and Received Unencrypted Branch 55

5.1 Network Diagram for the Control Data Set ...... 58

5.2 Network Diagram for Data Set #1 ...... 61

5.3 Network Diagram for Data Set #2 ...... 62

B.1 Packet Entropies for the Control Data Set ...... 99

B.2 Session Entropies for the Control Data Set ...... 100

B.3 Packet Entropies for Data Set #1 (First Plot) ...... 100

B.4 Packet Entropies for Data Set #1 (Second Plot) ...... 101

B.5 Session Entropies for Data Set #1 ...... 101

B.6 Packet Entropies for Data Set #2 ...... 102

B.7 Session Entropies for Data Set #2 ...... 102

B.8 Packet Entropies for the Kraken Data Set ...... 103

B.9 Session Entropies for the Kraken Data Set ...... 103

B.10 Packet Entropies for Zeus Data Set #1 ...... 104

B.11 Session Entropies for Zeus Data Set #1 ...... 104

B.12 Packet Entropies for Zeus Data Set #2 ...... 105

B.13 Session Entropies for Zeus Data Set #2 ...... 105

B.14 Packet Entropies for Zeus Data Set #3 ...... 106

B.15 Session Entropies for Zeus Data Set #3 ...... 106

ix B.16 Packet Entropies for the Blackworm Data Set ...... 107

B.17 Session Entropies for the Blackworm Data Set ...... 107

C.1 Session Alerts for the Control Data Set ...... 108

C.2 Session Alerts for Data Set #2 ...... 109

C.3 Session Alerts for the Kraken Data Set ...... 109

C.4 Session Alerts for Zeus Data Set #1 ...... 110

C.5 Session Alerts for Zeus Data Set #2 ...... 110

C.6 Session Alerts for Zeus Data Set #3 ...... 111

C.7 Session Alerts for the Blackworm Data Set ...... 111

D.1 Sessions from the Kraken Data Set to Servers on Port 447 ..... 113

D.2 DNS Names Resolved from the Kraken Data Set ...... 113

D.3 HTTP GET Command Downloading the Configuration File .... 115

D.4 HTTP POST Commands and HTTP OK Responses ...... 115

D.5 Zeus HTTP Communications ...... 116

D.6 HTTP GET Command Requesting the Configuration File ..... 117

D.7 HTTP OK Response to the Request for the Configuration File ... 117

D.8 HTTP POST Commands and HTTP OK Responses ...... 118

D.9 Zeus HTTP Communications ...... 118

D.10 HTTP GET Command Requesting the Configuration File ..... 119

D.11 HTTP OK Response to the Request for the Configuration File ... 119

D.12 HTTP POST Commands and HTTP OK Responses ...... 120

x D.13 Zeus HTTP Communications ...... 121

D.14 Connection to Victim and Enumerating Its Network Shares .... 122

D.15 Searching for Security Vendors’ Program Folders ...... 122

D.16 Copying Blackworm Executable to Victim ...... 123

D.17 Copying Blackworm Executable to Victim ...... 123

D.18 Deleting Startup Link from Victim ...... 123

D.19 Creating a Job to Run the Blackworm Executable ...... 124

xi LIST OF TABLES

4.1 Scalars for Max Entropy of Payloads by Size ...... 43

4.2 Magic Numbers and HTTP Strings for Compressed Files ...... 52

5.1 Alert Statistics for Packets and Sessions of Each Data Set ..... 65

5.2 Packet Alerts from the Control Data Set ...... 67

5.3 Session Alerts from the Control Data Set ...... 67

5.4 Packet Alerts from Data Set #1 ...... 70

5.5 Packet Alerts from Data Set #2 ...... 71

5.6 Zeus Packet and Session Alerts #2 ...... 72

5.7 HTTP POST Commands in Zeus Packet Captures #2 ...... 73

5.8 Packet Alert Metrics ...... 75

5.9 Session Alert Metrics ...... 75

5.10 True and False Positive Rates for Each Data Set ...... 76

A.1 Annotation of the Control Data Set ...... 89

xii ABSTRACT

The twin goals of easy communication and privacy protection have always been in conflict. Everyone can agree that important information such as social secu- rity numbers, credit card numbers, proprietary information, and classified govern- ment information should not be shared with untrusted and unknown entities. The Internet makes it rather simple for an attacker to steal this information from even security conscious users without the victims ever discovering the theft. All it takes is one lapse in judgment and an attacker can have access to sensitive information. Currently the computer and industry places its focus on tools and techniques that are concerned with what is entering a system and not what is exiting a system. The industry has no reason to not inspect the outgoing traffic. Many attacks’ success and effectiveness rely heavily on traffic exiting the computer system. Outgoing traffic is just as, if not more important to inspect as incoming traffic to detect attacks involving theft of confidential information or interaction between the attacker and victim’s computer systems. Frequently recurring data breaches reinforce the necessity of tools and techniques capable of alerting the users when data is being exfiltrated from their computer systems. This thesis explores the use of entropy characteristics of network traffic to ascertain whether egress traffic from computer systems is encrypted. The inspection of network traffic at the session level instead of the packet is proposed to improve the accuracy of the entropy values. It establishes that entropy can indeed be used as an accurate metric of the traffic’s actual state of encryption.

xiii The major contribution of this thesis to the field of computer and network security is presenting a detection scheme capable of distinguishing data exfiltration from benign traffic. The detection scheme is based on the results of the entropy calculations and the observation of the traffic’s state of encryption. The expected state of encryption for the session’s data link, network, transport, or application layer protocol are found by decoding the protocol stack data fields. This expectation is compared to the actual traffic’s observed state of encryption as measured by its entropy. It is demonstrated that using this comparison as the base of the detection scheme provides accurate detection of inappropriate data being exfiltrated from a system. The thesis produces multiple network tools that can be used for inspection of traffic. The initial work develops tools that are capable of equipping system admin- istrators with a better understanding of the network traffic leaving their systems. The most important tool resulting from this research is ExFILD, which implements the proposed detection scheme. The tool accurately alerts when it detects traffic with suspicious payloads exiting a system.

xiv Chapter 1

BACKGROUND

The field of computer and network security has increased in importance in recent years. Historically (and in some cases in currently) computer security has been neglected for reasons including cost, disbelief a system would be attacked, and the inconvenience of managing the security systems. People involved in early development of computers and the ARPANET (what is known as the Internet today) wanted to share information with all of their colleagues to facilitate the development of technology. They were not worried about people seeing their information and they trusted their colleagues not to alter their data. Decades ago the implementation of security was worried about the amount of time used on a computer and the financial aspects, not an attack attempting to steal or gain control of a computer system. Today, it is hard to go a week without seeing a report of a data breach at a government facility, financial corporation, or an educational institute. The President of the United States of America has recently issued a Cyberspace Policy Review [1] and started pushing for more attention to be placed on protecting the country from domestic and foreign threats. Society is beginning to take computer security seriously and doing more to protect computers and data from threats. Computer and network attacks come in many different forms. They range from , groups of compromised computer systems controlled by a common attacker used for , such as the Srizbi Botnet sending billions of spam messages per day [2], to nation states attacking other nation states such as the

1 Russian DDOSing of Estonia’s computers in 2007 [3]. The attacker can use some automated scripts to do most of the mundane tasks and make the decisions of how to proceed after each step. Attacks can also be almost exclusively automated as most botnets are. Attacks can be well known and run by script kiddies (inexperienced attackers that only run tools created by others) or they can be sophisticated zero-day attacks (an attack that has not been publicly released) written by an experienced attacker. They can be very focused and target a single system or they can target a large number of systems in the case of a worm spreading from peer to peer. A main concern today is data confidentiality. Attacks can be divided into two classes: those that steal confidential data and those that do not. One example of an attack that does not need data to be extracted from the target is a DOS (Denial of Service) attack. A DOS attack saturates the computer’s resources leaving it in a non-functional state. Usually a DOS attack refers to saturating a network link, but it can also refer to the saturation of the CPU cycles. The confidential data exfiltrated from systems depends on the attacker’s goals. In the case of an attack targeting individuals, it could be personal informa- tion such as social security numbers, credit card information, passwords, etc. For attacks targeting nation states, extraction of confidential information on military projects or classified information may be the objective. Attacks against financial institutions seek money - either directly by manipulating financial transactions or indirectly by extracting massive amounts of customer information. A bot herder (the attacker controlling the compromised computers of a bot- net) may also be interested in data in addition to the victim’s personal information. They may also need information about the host system to be able to control the bot and continue expanding the botnet. An example of information that could be passed from the host to the herder would be the results from a network scan that revealed possible new targets. Information about the host’s intranet could also be

2 used to decide what kind of information to extract, i.e. the host appears to be a government contracted engineering firm, so an attacker should look for design plans. Data can be exfiltrated from a system in many ways. Some of the more common ways of exfiltrating data are through botnets, viruses, worms, , FTP, close access means such as flash drives or CDs, email, and even phishing attacks. Many techniques for exfiltrating data are discussed in [4]. The two general methods of exfiltration are through physical means or over a network connection. For this thesis the working definition of physical exfiltration will be any method where a user will have to be in the proximity of the system containing the data. Network exfiltration will be any method where the location of the attacker does not matter for the attack to be successful. The physical methods are easier and are often overlooked, but require physical access to the computer. These attacks are often insider attacks, because it can be hard to get physical access to these systems if the attacker is not trusted by the owner or does not work for the organization where the computer is physically located. Alternatively, physical attack is often employed by officials executing legal search warrants, or covertly by individuals seeking illegal access to systems, the so-called “black bag job.” In the case of an insider attack, an employee can plug a flash drive into a computer and copy the files they want on to it and then take it home. The employee has now transferred the data offsite and can do anything he would like with it. Physically exfiltrating data is even easier when an employee is issued a laptop for work. The data on the laptop leaves the site everyday and also gives the employee access to the network from home, usually through a VPN connection, allowing the employee to copy it off of the system in the safety of their home. Some other physical methods of stealing data are burning it to a CD/DVD, printing the information, leaking it through lights and diodes as demonstrated in [5], cameras

3 recording monitors and keyboards, and transferring it over a wireless channel. The network methods require some more knowledge, but the attacker can be anywhere in the world. The possibilities for exfiltrating data over a network are endless. Only an attackers creativity and the desire for stealthiness limit the methods that can be devised to get data out of a computer. Exfiltrating data can be as easy as emailing the information out or simply opening a connection to a computer owned by an attacker. Covert channel data exfiltration methods such as flipping bits in the TCP header or by controlling the timing of packets can be more complicated [4]. The type of the method used to get the data out of the system depends on the system and its configuration settings. Email is an easy way to get data out of a system without causing suspicion, because a user sends multiple emails a day, many with attachments. A lot of information exfiltrated is the result of botnets, worms, and viruses. When a computer system is infected, backdoors, rootkits, and hidden FTP servers are installed, as well as impromptu servers set up by programs such as Netcat [6]. These installed binaries make it possible for the attacker to exfiltrate information from a computer system. Botnets such as and Conficker have received an increased amount of media attention. Torpig was estimated to have infected a little more than 182,000 hosts [7] and Conficker was estimated to have infected almost 9 million hosts [8]. Many people become worried that a virus, worm, or botnet has infected their computer. It is not the fact that a computer system is infected that should worry users, but more of what tasks the attacker performs with the infected computer. For example, if an attacker uses one hundred compromised hosts to launch a DDoS (Distributed Denial of Service) attack against another host, it would have a rather small impact on each of the compromised hosts. It would only result in a loss in bandwidth for a relatively short time for the compromised hosts. It will however

4 have a substantial impact on the target computer of the DDoS. On the other hand if the attack is similar to that of Torpig’s, which is used to steal information ranging from passwords to credit card numbers [7], then the victims should be worried. Loss of bandwidth is insignificant compared to the impact of identity theft. Users should be concerned about being infected and take steps to clean the system, but what they are infected with and its function should be of great concern. The infinite ways to extract data is what makes the problem of detecting it a difficult problem. In general, defense is hard and offense is easy, because an attacker only needs one hole for an attack to work, while the defender needs to protect against an infinite number of attacks. One mistake or oversight by a defender could open a door for the attacker. Many systems are protected with software to regulate incoming data such as firewalls, anti-virus software, intrusion detection systems, and intrusion detection systems, but there is not as much emphasis on controlling outbound traffic. Exfiltration tools are typically very specific to what they want to detect. The only exception to this may be some tools that use algorithms using self-learning or artificial intelligence techniques; however, these tools may also be very focused. DNStTrap [9] is a tool developed by Jhind using artificial intelligence to detect ex- filtration through DNS tunneling. DNS tunnels are created by passing information within the DNS names of hosts. DNStTrap extracts traits from the leftmost sub- domain from the domain name, i.e. cics is the leftmost subdomain of cics.udel.edu. One of the more important traits is the similarity of the leftmost subdomains to the other leftmost subdomains. The idea is that if a user is browsing the Inter- net, the user may visit cics.udel.edu 5 times a day; however, if a user is exfiltrating data it would look more like somethinginteresting.udel.edu, exfiltrateddata.udel.edu, ssn.udel.edu, creditcardnumber.udel.edu, and password.udel.edu. DNStTrap will then use the similarity of the domains along with other traits to decide if it is a

5 DNS tunnel. With a lot of tools with a very narrow focus and not much with a broad overview for detecting exfiltration, it appears the next logical step would be to develop a tool or framework that looks at a broad space with multiple detection modules that can be expanded. An example of this would be the framework used by the DoD (Department of Defense) named Interrogator [10]. Interrogator is im- plemented in a way that allows for addition and removal of tools as desired. It is able to glue together many different tools to achieve an overall picture of a network’s status. CloudAV [11] is another implementation of a framework bringing together tools. It combines multiple antivirus engines into to a system that resides in a cloud to improve virus detection. Interrogator was designed for intrusion detection and CloudAV is targeting detection of malware. A tool that uses the same principles as the Interrogator and CloudAV that detects data being exfiltrated would be valuable. Antivirus products and IDS (intrusion detection systems) use signatures of malware for detection. Signature-based tools are very effective against malware that has been seen before, but they are not effective against zero-day threats. The industry has known that signature-based schemes are not a foolproof solution for protecting a computer from viruses; however, they are a good and easy way to pro- tect against known threats. Executables, payloads, or exploits can be run through a program that will return a signature and then distribute the signature in an up- date to the database. The process of making signatures and distributing them is simple and works well. However, attackers can get around signature detection by altering sections of the sample just enough to create a different signature or by using polymorphic code. The future of detection is in anomaly and behavioral techniques. These schemes would be customized based on the different traits of the host or the network. These schemes would be self-learning schemes that would detect unusual events for a system or network and flag them to be decided if the event is normal

6 or suspicious. These systems learn the normal and suspicious traffic from training data given to the system. Detection tools that are signature-based will continue to be used as a baseline, but behavioral schemes will need to be incorporated also to detect the more complex attacks.

7 Chapter 2

INTRODUCTION

2.1 Motivation Many security tools and strategies emphasize detecting and preventing at- tackers from breaking into a system. It is very intuitive that the focus is on intrusion. Because no intrusion detection system is perfect, a user cannot be content with just preventing an attacker from breaking into a system. A user has to worry about an attacker preventing communication with the system or intercepting information leaving the system and using the information maliciously. If these attacks are ig- nored, a user could theoretically be secure by preventing an attacker from entering the system. The problem with the argument is that there are many vulnerabilities or holes to get into a system. A user must defend against them all to stay secure, while an attacker only needs one to get into the system. The vulnerability the attacker uses can be a zero-day vulnerability, which would leave a user with little chance of protecting against the first time it is used. It is also conceivable that a vulnerability is pre-installed or designed in, with an attacker simply sending an otherwise innocuous trigger to activate the exploit. Many security plans do not include a strategy or the tools to mitigate the possible risks that appear after an intrusion. It is bad enough that the security plans do not have techniques to prevent risks such as stealing sensitive information, but not having a way to detect that the sensitive information is leaving the system is even worse. An attacker could potentially steal information from a system indefinitely with no inspection of the outgoing traffic.

8 The problem is present all throughout society, including organizations as so- phisticated as large enterprises and nation states. The data breach of the Joint Strike Fighter project is an excellent example of an enormous amount of data leaving a net- work before being noticed. According to [12], attackers were able to exfiltrate several terabytes of data about the Pentagon’s Joint Strike Fighter project. Heartland is a company that provides services for credit cards. It was reported in early 2009 by [13] that attackers may have obtained access to more than 100 million credit cards, and it was later reported by [14] that it was more than 130 million credit cards. It is easy to see that the organizations that are entrusted with sensitive information do not have tools and techniques to detect data leaving their networks in a timely manner.

2.1.1 Scenario of Security Against Infiltration Let’s look at the general strategy of the security of computer systems and networks while relating it to security used on a battlefield scenario. Once a group of soldiers make the decision of where to construct base camp, the first thing they will do is set up a perimeter. To begin with it may be a couple of soldiers patrolling the perimeter while the remaining soldiers start the other duties, i.e. set up a command post, create a strategy to bring in more equipment and troops, and layout the blueprint of the camp. During the planning phases and from this point forward the base will need secure lines of communication within the base and more importantly with friendly troops outside of the base. The secure communications will allow the base to communicate with the confidence that the messages it is passing will not be deciphered by the enemy’s intelligence organization and used against the base. After the initial blueprint has been completed implementation of the plan will begin. The soldiers patrolling the perimeter may be given some aid in controlling their boundaries with a primitive barbed wire fence. The fence serves the purpose of slowing down ground troops, but not stopping long range attacks or heavy machinery

9 such as tanks from driving right over it. Space will be left in the barbed wire and trenches to allow for troops to enter and exit the base. A guard shack will be added at these entrance points. The guards will not let people enter without the proper credentials, i.e. a military issued identification card and a record of a soldier’s orders to enter or exit the base. The military police will be responsible for checking everything entering the base for suspicious and banned objects. They will also be responsible for keeping order within the base and preventing any action that could threaten the base from the inside. The next step would be to add a large reinforced wall for better protection from ground troops, stopping soldiers from easily entering on foot, and being a much larger deterrent to heavy machinery. The base now has a good foundation of protection against ground troops, but can still be caught off guard when a battle ensues. The base will build watchtowers along the walls to see in every direction. They will then setup radar systems to facilitate in the monitoring of the area around them. Finally they will construct more advanced defenses. These defenses will include turrets and stationary artillery along with antiaircraft guns to protect the base with higher-powered weapons. The battlefield scenario can be directly related to the security of a computer system.

2.1.2 System Administrators Starting from the beginning of the scenario, the first aspect to discuss is the patrolling troops. The patrolling troops will continually move from place to place looking for anything suspicious. Their function is very much like that of a system administrator. The early phase in the construction can be related to turning on a system for the first time and beginning to configure it. The system administrator is the main line of defense at this point. Now as the base slowly builds up to its final design, the patrolling troops will not be as heavily relied upon for the base’s safety. They will still be needed, but will have much more support from the watchtowers,

10 radars, and the equipment providing heavier firepower. The system administrator receives its support from the antivirus software, IDS, IPS, and other tools. They still have to respond the alerts and logs from these systems.

2.1.3 Encryption Secure lines of communication at the base and secure communications in a network can be handled in the same manner and with the same mechanisms. The idea is that it does not matter if communications are intercepted as long as an attacker cannot decipher them. A main difference is that the base might care more about people being able to intercept the message and figure out details such as the sender, the recipient, when was it sent, and much more. This is because this information might divulge strategic information, i.e. if one country suddenly increases communications with another, it may mean they have formed an alliance. Computer systems however send login credentials or banking information over the line encrypted while the headers are in the clear. If users were worried about people knowing with whom they are talking, they would use tools such as Tor [15] to anonymize themselves. According to [16], there are hundreds of thousands of Tor users each day. Hundreds of thousands of users would appear to be a lot, but when compared to the total number of Internet users (1,802,330,457 users [17]) it is at most only 0.0554 of a percent.

2.1.4 Out of the Box Firewall The barbed wire fence is the next line of defense initiated at the base. The fence can be correlated to a default firewall being initially installed. The firewall can provide a strong enough defense to slow down attackers, but can also be subverted. For instance if the firewall lets all DNS traffic into the system, an attacker can send malformed DNS packets that causes unwanted actions on the system, assuming there is some vulnerability the attacker can exploit.

11 The barbed wire fence can also prevent or slow down desired activities. Con- sider the entrances and exits to the base to be the ports that are open on the firewall. An example of this is a large base that only has an opening at the north end of the base, but a critical mission that is time sensitive needs to head south immediately. The mission would have to head north first to exit the base and then head south wasting time. The same scenario can be related to the firewall’s actions. A user wants to login to a bank’s website to check a bank account. The user will need to send traffic on port 443 (HTTPS) to be able to securely access the bank account. If the firewall only allows port 80 (HTTP) to enter and exit the network, the user would be forced to use HTTP instead of HTTPS, the secure version of HTTP. The user would only be able to do this if the bank allowed them to communicate insecurely. Preferably the bank would not allow insecure communications with the user to prevent the user from passing sensitive information in cleartext over the Internet.

2.1.5 Login Credentials The guard shack can be related to the oldest and most commonly used form of security in computer systems. The process of logging in to a system is easily related to the guard shack. At the guard shack, soldiers would have to show a military issued identification to verify their identity. The guards would then check against some orders to verify that these soldiers have the authority to enter or leave the base. On a computer system a user would first have to login into the system using their user name and their login credentials such as a password or biometric measurement. After the computer has verified the user’s identity, they now need to decide whether or not the user has the proper permissions to access this computer. Enterprise systems have this decision quite often. They have thousands of users whose credentials can be used on several multipurpose computers, but not all users will have access to the system with the database of salary information.

12 2.1.6 Antivirus Software / Malware Detection Now that there are guarded entrances controlling everyone getting in, there need to be people responsible for keeping order within the base. The military police would pick up this responsibility. They would investigate any suspicious activity and mitigate any threats from inside the base. Antivirus software has responsibilities that line up nicely with the military police’s responsibilities. The antivirus software is responsible for scanning the system for any items that could be detrimental to the system. The antivirus software uses signatures of malicious software to determine if a computer system is infected by malware. An actively scanning antivirus system will set off an alarm when a virus is downloaded. Depending on the user defined options, the antivirus software will either delete the virus or quarantine it. This could be represented as the military police catching an attacker entering the base with a concealed bomb and intent to detonate it within the base. The military police could detain the person or open fire on the attacker if the situation deteriorates. The antivirus software will also watch the system memory and instruction calls for any unusual behavior, just as the military police patrol the base for disorderly conduct amongst the soldiers.

2.1.7 Properly Configured Firewall The base would eventually build a permanent large reinforced wall to pro- tect the base. The wall would be well planned to include entrances and exits of strategic value. The design should be one that provides the base with the greatest protection possible, but does not hinder the base’s ability to successfully operate. The reinforced wall is analogous to a firewall that is given the attention by a system administrator to properly configure it using the characteristics of the network. A properly configured firewall will take into account all of the services that are authorized to run on the network and communicate with systems outside of the network. The configured firewall will be as specific as only allowing inbound traffic

13 on port 80 with a known web server as its destination The more thought put into configuring the firewall, the better it is able to protect the system from unknown threats. If the firewall is configured correctly, it will be similar to whitelisting. It will not be listing all of the unwanted traffic and blocking it, it will be blocking everything and only allowing whitelisted traffic through it. The whitelisting strategy is more effective than blacklisting, because the system administrator should know all of the normal traffic and may not know all of the malicious traffic.

2.1.8 Intrusion Detection System An IDS is a system that looks at the network traffic (usually focused on incoming traffic) and attempts to detect attacks being made over the network. Snort [18] is a very common open-source implementation of an IDS. An IDS is similar to an antivirus application in respect to inspecting the system or network for anything malicious. The system administrator should customize the IDS rules to the system or network that it is protecting. A network intrusion detection system (NIDS) only inspects the network traf- fic and extracts characteristics that will be used to decide whether the traffic is malicious. A host-based intrusion detection system (HIDS) inspects all of the ac- tions on a system to determine if malicious actions are being performed. An IDS can implement signature-based detection or anomaly-based detection. Anomaly- based detection inspects the network traffic and compares it to the behavior that is considered normal for that system or network. The radar and watchtowers have the same responsibilities as an IDS, although radars have a much broader scope. Radars watch activities happening outside of the base also. Most single hosts or small networks do not run intrusion detection systems due to the time and resources that must be committed. An out of the box implementation of an IDS will catch a lot of malicious events, but just like a firewall it will be much more effective if time is spent on configuring it properly for

14 the network. An IDS can use a lot of system resources if there is a lot of traffic to be processed and if there are a lot of checks being performed on the traffic. It may leave the system at a state where it is not practical to be used for day-to-day activities.

2.1.9 Intrusion Prevention System An intrusion prevention system, or IPS, monitors traffic for intrusions and will attempt to mitigate any intrusions automatically. It is basically a reactive IDS. The advantage to an IPS over an IDS is that it can reject any traffic involved with an attack at the time of detection, just like the turrets or anti-aircraft guns would do to attackers who were breaching the base’s perimeter. An IPS will focus on incoming traffic and its contents. In some cases it will look at outgoing traffic for a signature match. A downside to an IPS system is that it can treat legitimate traffic as if it is malicious and drop it. An example of this would be a signature-based detection that is too generic that matches legitimate traffic and malicious traffic. The IPS will then react to normal traffic with undesired actions.

2.1.10 Outgoing Traffic Looking over the battlefield scenario and relating it to the security of a com- puter or network illustrates that the layers of each are akin to the other. The interesting thing to note about both examples is that there are not many precau- tions being taken to detect or prevent data from being exfiltrated from the base or the computer system. The firewall will help prevent outgoing traffic that has been denied in the rules; however, an attacker can use traffic allowed by the firewall to easily transport data out of the computer. The IDS and IPS can look at outgoing traffic, but by definition they are more sophisticated in detecting intrusions than they are in detecting exfiltrations. The system administrators would have to spend

15 extra time configuring the IDS and IPS to handle the outgoing traffic. The same situation is prevalent in the battlefield scenario.

2.1.11 Consequences of Encryption Encryption is the standard for sending sensitive data between systems through unknown and untrusted systems. It is a great defense against messages being inter- cepted in between those systems; however, it can easily be used against a system. The point in encryption is to be able to obfuscate the data for storage or transport and then be able to return the encrypted data to its original cleartext form. The attacker can use the trait of obfuscation against a user and encrypt all of the data it sends out of the system and thus leaving it hard to detect. The outgoing traffic can be monitored leaving the network, but the user does not know what is in the payload. The user will be able to observe where the packet claims to be going, how it is leaving, the amount of data leaving, the timing information, and other miscel- laneous header information. The users can use this information to separate out the traffic they feel comfortable allowing to exit the system and the traffic that looks suspicious and they would not want leaving their systems. A system administrator can learn a lot about traffic by knowing if the packets are encrypted in addition to all of the header information. Analyzing traffic without knowing the data or payload of the packets can be counterintuitive for system administrators, but it will be shown in this thesis that it is very effective.

2.2 Goals Too many small businesses, large businesses, government organizations, and average computer users in society are losing sensitive information each day. Com- puter and network security needs more work in the area of detecting attackers exfil- trating data from victims’ computers. Richard Bejtlich described a good first step on his blog called TaoSecurity. He is the Director of Incident Response at General

16 Electric. “[D]evelop tools and techniques to describe what is happening on the net- work. ... Without understanding what is happening, we can’t decide if the activity is normal, suspicious, or malicious.” [19] Creating tools that help a user or system administrator understand the network’s ground truth would be invaluable. System administrators can look at IDS and firewall logs all day, but if they do not under- stand what their network should be doing then they can misdiagnose the meaning and importance of each message. The first goal of this thesis is to create tools to help users and system administrators to better understand their network. The tools developed will be discussed in Chapter 3. System administrators and users who are comfortable with looking through logs and reacting to them appropriately can still use a lot of resources looking through the logs. The time can be wasted on many simple decisions that build off of each other. The simple decisions are usually cut and dry and should be automated. The next goal of this thesis is to accumulate all of these simple decisions and automate it for the administrators. The results should be in a format that will allow the administrator to quickly inspect and observe if there is traffic that needs some attention. It will show the results in formatted text to be parsed and graphically for a fast inspection. The tool will only handle outgoing traffic. It will detect exfiltration, but will not attempt to prevent it in any matter. ExFILD is the tool that is developed in this thesis. The development of ExFILD will show that entropy calculations of a packet or session’s payload can be used to characterize the payload’s observed state of encryption. It will focus on decisions that will spawn off of the determination of whether traffic is encrypted and if the traffic was expected to be encrypted. The design of ExFILD will be described in Chapter 4 and its results in Chapter 5.

17 2.3 Related Work Research has been performed in the area of detecting of data exfiltration. Researchers from the University of California at Berkley have created a system named Glavlit [20] to prevent data from being exfiltrated. Their system uses a whitelisting approach. It has two main parts to the system that are referred to as a guard system and a warden system. The guard will not let any traffic out of the network without being approved by the warden. The warden is in charge of approving all of the traffic. The approval process can be automated with content matching or system administrators can manually approve each piece of data by hand. Researchers from University of California at Davis and Sandia National Lab- oratories developed a framework to detect data exfiltration called SIDD, Sensitive Information Dissemination Detection [21]. SIDD has three major components to it: application identification, content detection, and covert channel detection. The components are inline and each component can cause an object to exit the chain and assign some action to be taken. The application identification component tries to determine the application of the traffic and then it uses a policy to determine if it should be allowed. The content detection component checks traffic for data that has been labeled as sensitive. The searching for the sensitive data is signature-based. The last component handles covert channel detection. The covert channel detector focuses on digital audio channels. Steganalysis described in [21] is used to generate characteristics and decide whether there is a covert channel. The work present in this paper will be complimentary to the efforts described in this section.

18 Chapter 3

NETWORK TOOLS

System administrators use a surprisingly large number of tools in a day to monitor and diagnose the computer systems and networks under their responsibility. The amount of tools can be contributed to the number of different tasks they perform and the different types of system they maintain. A system administrator could be in charge setting up a printer server, debugging a DNS server, or diagnosing if a recent failure of a web server was due to a hardware failure or a successful attack. They could also be dealing with an Apple computer used by marketing, a Windows computer used by an engineer, and a UNIX system used as an IDS. System administrators will use tools that they feel comfortable with, but they can run into the problem of not having exactly what they need. They will be forced to find or develop another tool. Their toolbox will continually grow as they progress through their career. Understanding and maintaining a network would be an unmanageable task without the information obtained from these tools. Chapter 3 will discuss tools that have been developed during the data exfil- tration research for this thesis. The tools are not necessarily novel or unique, but their functions are crucial to understanding and diagnosing a network, specifically the outgoing traffic and watching for data being exfiltrated. The tools have all been developed in Perl. The choice was made to develop them in Perl to make them easily customized for special circumstances. As mentioned earlier, system administrators may use a tool for everything except one instance where a critical feature is not

19 supported. Instead of switching to another tool to obtain a feature, the Perl script can be altered to add the feature and the administrator only needs to keep one tool instead of multiple. Many of the fundamental ideas behind these tools are combined to be used as the underlying framework for ExFILD. There are additional tools that are not discussed in Chapter 3, but will be discussed in detail in Chapter 4.

3.1 Network Sniffer The most important tool for a network analyst is a network sniffer. Tcpdump [22] and Wireshark [23] are two of the most common network sniffers. Tcpdump is a command line tool, while Wireshark is a GUI (graphical user interface) based tool. A network sniffer taps into a network interface and extracts all of the packets. The packet information will be outputted to the standard output, a GUI, or a file to be inspected later. The sniffer will decode the different fields of a packet such as source and destination IP addresses and ports, packet length, data, and the rest of the header information from each of the layers. Network sniffers can be very basic and only decode through the transport layer protocols (TCP and UDP), while others such as Wireshark have modules that are capable of decoding the application layer protocols. Sniffers can also perform tasks on packet captures that have been saved to a file. The files are in the pcap format and allow a user to look at the traffic offline. The sniffer developed during this research is a simple one. The sniffer can capture traffic live from a network device or read the traffic from a pcap file. The header information will be printed out for each layer. Packets are handled at the data link, network, and transport layers. TCP, UDP, ICMP, and IGMP are the transport layer protocols that are supported by the sniffer. The information will be printed out for each layer. It will print out the information as far up the layers as it can decode and then it will store the rest of the data in a variable. The variable containing the data and the options fields is not printed out by default due to the

20 fields having non-printable ASCII values. The program will print out these fields if the –nonPrintableAscii flag is set on the command line. Each layer’s information is printed out on a new line. Each packet is separated from the other packets by a line of ∼’s. The network sniffer will output the results of the program to the standard output unless the –silent flag is given. If the –silent flag is passed to the program, the results will be outputted to a file. For live captures the file will be located at Logs/traffic PID TIME.log, where PID and TIME are the program id and the current time in seconds since the epoch respectively. The file will be located at Logs/PCAP traffic.log when a pcap file is inputted, where PCAP is the name of the pcap file. Live captures will also dump the packets into a pcap file. The file will be located at Logs/traffic PID TIME.pcap.

Figure 3.1: Example Output from the Network Sniffer

A sample packet is shown in Figure 3.1. In the Network Sniffer’s output, the tabbed lines in Figure 3.1 are part of the previous lines, but have been separated strictly for display purposes. The first line represents the data link layer, the second line represents the network layer, and the third line represents the transport layer. The protocol being used for each layer is stated by the first word on each line. In Figure 3.1, the data link layer is using Ethernet, the network layer is using IP, and the transport layer is using TCP. The source and destination fields of the different layers are represented by source -> destination. Depending on the layer, the source

21 and destination fields will be of different types. The data link layer will have the type of MAC address, the network layer will have the type of IP address, and the transport layer will have the type of port number. Each of the header fields following the source and destination fields are displayed with its name and the value it holds. Not all of the header fields will display a value. For example, the options field of an IP header is empty when no options are set.

3.2 Corporate Watcher Corporations and government organizations have to take data exfiltration very seriously. The corporations need to prevent proprietary information from falling into the hands of a competitor or an attacker looking to exploit the personal infor- mation of their employees or customers. A corporation that is concerned with this should check outgoing communication for undesired traffic. Undesired traffic could be defined in many ways depending on the type of goods and services provided by a corporation. For example, a company that creates classified technical documents for a government agency would want to know all of the documents with file extensions of .doc or .pdf leaving the network. A financial company would want to check for emails containing their clients’ credit card numbers, social security numbers, and other personal information. A big threat for corporations is their own employees, also known as an insider threat. A lot of responsibility is put onto employees to remain loyal to a company and to not betray the company’s trust. Corporations have to be aware of the employees with malicious intent and the ones that are careless. An employee with malicious intent could be a disgruntled employee or an employee who was recently laid off or fired and wants to get revenge on the company. It could also be an employee being paid to retrieve information from the company for an outside party. The careless employee could accidentally send an email with important proprietary information

22 Figure 3.2: Example Output from the Corporate Watcher Program (Simple)

to a person outside of the company or even worse a mailing list with multiple people outside of the company. Entities need a simple detection module to know when specified strings, such as English words or file extensions, are leaving the network. A tool was developed to search the outgoing traffic to see if certain words or phrases are contained in the payload. The tool will take a file containing the strings to match against the traffic. Any matches will be flagged for later inspection. The file will be created by the system administrator to represent the information that should not be leaving the network. The matching mechanism used is regular expressions. A sample output is shown in Figure 3.2. The sample is a result of a system administrator finding a PDF file being served by an html server. The administrator would flag PDF documents by adding “.pdf” to the file of sensitive strings and “html” to see the files being served by a web server. Every packet with .pdf or html in its payload will be flagged. The administrator can then search the results of the scan for the desired set of flagged words. Each line represents a packet that has flagged words in the payload of the data. The censored words that have been matched in the payload are listed at the beginning of the line. Commas separate multiple matches in the results. The middle of the line displays the source and destination IP addresses of the flagged packet and their respective ports. The words that caused the packet to be flagged are listed at the end of the line, with any multiples separated with a comma. The first line shows a packet that contains the flagged strings of .pdf and html, which can represent a packet from a web server

23 Figure 3.3: Example Output from the Corporate Watcher Program (Complex)

serving a file of an extension .pdf. The second and third lines represent packets that could be originating from a web server. The fourth line represents a packet that contains the string .doc, which could represent a packet involved with serving a word document. The regular expressions used for matching can be more complex than the pre- vious example matching against just text. The sample output displayed in Figure 3.3 is the result of matching against a more complex regular expression. The regular expression used is \[1-3][0-9][0-9] Evans Hall\ and can be seen in the Matched field. The regular expression is used to look for any packets that mention a room in Evans Hall, and this case any Evans Hall room numbers between 100 and 399. The packets shown are flagged because they contain the phrase “140 Evans Hall” in the payload. The regular expression can be as complex as necessary. The regular expressions could be for file extensions, the magic number of files, social security numbers, etc. The administrators just need to add the part of the regular expressions between the \’s to the file of censored expressions.

3.3 Network Top A system administrator interested in outgoing traffic would want to know where the traffic is destined and how much data is being sent. A tool that is similar to the UNIX top command would be useful resource for an administrator. The program would show the outgoing traffic stats in real-time. The development of the network top program was performed in two steps and then combined. Searching through pcap files while responding to an incident can be tedious for system administrators. The logs will be full of IP addresses, which will have little

24 meaning to the administrator. Administrators will recognize a few IP addresses that belong to their network or to well known organizations such as Google or Microsoft. The first step an administrator will perform while investigating suspicious traffic is to determine the destination of the traffic. The administrator could then run the whois command against the IP address to gather some information on the organization that administers the IP address. The network top program uses the DNS name to attempt to gather this information. The network top program attempts to resolve the DNS name of the IP address. If the DNS name can not be resolved, it will be displayed as “N/A.” Investigating the IP address 128.175.13.63 shows a good example of the problem being solved by the network top program. An administrator may not recognize the IP address, but they will recognize www.udel.edu, the DNS name that is resolved from 128.175.13.63. Another example is 64.12.26.59, which resolves to bos-m043a-sdr4.blue.aol.com. The administrator may not know the exact function of 64.12.26.59, but knowing AOL maintains it will bring some meaning to the IP address. If the administrator trusts traffic going to AOL, any warnings related to 64.12.26.59 could be ignored. The amount of outgoing data is another useful piece of information to system administrators. An administrator can look at the IP addresses that have the most data being sent to it and work down the list. The administrator will ignore any of the IP addresses that are justified in receiving that amount of data. For example a administrator with many Gmail users would be comfortable with a lot of outgoing data destined to IP addresses administered by Google. The administrator may not make it to the bottom of the list every time the logs are checked, but the risk of losing data is less when the amount of data destined to an IP address is small. Administrators may set a threshold of data leaving in a given amount of time that will be tolerated. The threshold would have to be the result of a risk analysis performed using the company’s policies.

25 The administrators should whitelist the known IP addresses, so they can quickly skip to an unknown IP address. The program has a built-in feature to facil- itate the process of skipping known IP addresses. The program’s -a command line option will take a tab-delimited file containing IP addresses and corresponding la- bels. The labels can be anything as long as it has no white space. The administrator can use the labels to characterize the IP addresses as either allowed or suspicious. The labels can also be used to describe the function of the IP address, such as Uni- versity Of Delaware Webserver. The matching of the IP address against those in the input file is performed using regular expressions. Using regular expressions as the matching mechanism allows for partial matching. An administrator may want to label the entire range of 64.233.160.0 - 64.233.160.254 as Google. The adminis- trator would add an entry with the IP address as “64.233.160.” and the description of Google in the file to label the entire range of 64.233.0.0/24. The -a option can be very convenient for a system administrator; however, it will lower the program’s performance. An administrator can run the program during a meeting or at the end of the day to be inspected later if the increased run time becomes a nuisance. Combining multiple IP addresses into a range and labeling them together can also improve performance. It will reduce the number of checks performed on each IP address against those in the file of labels. The network top program can be given a command line option for a network device or a pcap file. Giving the network device as an option will display the outgoing traffic to the console in real-time until the program is terminated. It will also store the results to a file named “outbound ips PID TIME.log”. It will sort the list by the destination IP addresses with the most outgoing data at the top. The network top program can be resource intensive just like the UNIX top program, so an administrator may not want to run it constantly. The option of giving the pcap file will allow an administrator to inspect previous network captures that have set

26 Figure 3.4: Example Output from the Network Top Program

off flags for the large amounts of outgoing traffic. The results will be outputted to a file with the name of the given pcap file concatenated with “ outbound ips.log.” Sample output can be seen in Figure 3.4. The results are tab-delimited with each unique destination IP address on a separate line. The first entry of the line is the IP address in which the outgoing traffic is destined. The second entry is the number of bytes that have been sent in the IP packets’ payloads. A design choice was made to not decode the packet past IP to the transport layer to help with performance. If the administrator is looking at the results in relation to other entries, this choice should not affect the process of searching for suspicious outgoing traffic. The third entry will have the successfully resolved DNS name or a “N/A” designation. The last entry will be blank or will have the description from the file given with the -a option if there is a match. The last column will not be present if the -a option is not given. The sample output in Figure 3.4 displays the 10 destination IP addresses receiving the most data from the data set explained in Section 5.1.1. Inspecting these samples, an administrator can quickly see that the user is banking with ING, using Live Mesh to back up some files, using a service from Google, watching a video on YouTube, communicating with a host from the University of Delaware, and sending data to 10.0.0.54. Assuming that the administrator is comfortable

27 with the system sending data to the labeled IP addresses, it leaves one suspicious IP address to investigate further. The administrator could quickly ignore the IP address if the 10.0.0.0/8 subnet is implemented within the internal network, but it will require attention if it is not an internal IP address. Note the DNS name for 64.249.81.83 is lga15s01-in-f83.1e100.net. The DNS name could concern the administrator, because it does not give any clues to what organization administers that IP address. However, a whois command will quickly inform the administrator that Google maintains the IP address. The administrator can add a label to the IP address so it can be quickly recognized next time it appears.

3.4 DNS Extractor Another way for an administrator to find where the data is going is to inspect the DNS queries and answers. It is rare that users will remember IP addresses to systems on the Internet. If the IP address has a DNS name, the user will use the DNS name instead of attempting to remember the IP address. The DNS name will need to be resolved by the user’s computer. A packet capture of the computer’s network traffic will contain all of the DNS queries and DNS answers for the computer. An administrator can use these queries and answers to get a summary of what users are doing over the network. The DNS Extractor will take a pcap file as an input. All of the DNS answers will be extracted from the pcap. The information in the answer will then be printed out to a file. Sample output from the DNS extractor program is shown in Figure 3.5. The output is displayed in a fashion very similar to a sentence written in English, so it is very intuitive. Each line of the output represents an entry from a DNS answer. The first IP address is the system that sent the DNS query, while the second is the DNS server that answered the query. The IP address is being requested for the first DNS name in the line. The last part of the line can be a DNS name or an IP address. The reason for receiving another DNS name is that there is a CNAME for

28 Figure 3.5: Example Output from the DNS Extractor Program

that particular DNS name. It will then try to resolve the CNAME to an IP address. The process will continue until the IP address is already known (currently stored in the cache) or the name has been resolved. If there is an organization wide security policy of not using social networks on the computer systems, an administrator can run the tool and grep the results for phrases such as “facebook” or “twitter” to find the users that are breaking the policy and putting the organization at risk. Many botnets use obscure DNS names for their command and control servers. The DNS Extractor can be used to search for these obscure host names, and if they are found there is a good chance that the host is infected.

3.5 Session Extractor A session can be described as a persistent connection between two hosts. Sessions are defined by the two host’s addresses, whether they are MAC addresses, IP addresses, or IP addresses with the respective ports depends on the type of traffic. They can be thought of as a conversation between two people, where each word is a packet and the complete conversation is a session. Two hosts can have multiple sessions between them at any given time. Multiple conversations, such as speaking, emailing, and writing notes to each other would be the same as two hosts having multiple sessions with each other. A person speaking to two people at once is the same as a host having a session with two different hosts.

29 Figure 3.6: Example Output from the Session Extractor Program

The combination of the source and destination addresses can give enough information to decode the application of the session, and it can facilitate the system administrator in debugging the outgoing network traffic. A common example of determining the application of a session can be found in a session between an end user browsing the Internet and a web server. The session would look similar to 123.45.67.89:12345 -> 98.76.54.32:80. The system administrator can look at the session information to see a destination of port 80 and deduce that it is web traffic. Further investigation into the IP address of the host with port 80 would reveal the web site serving the traffic and what organization is administering it. The Session Extractor takes a pcap file for its input. The host’s IP address will be taken as a command line argument. If a host is not passed as an option the program will attempt to find the appropriate host. It does this by searching for the most common host in the pcap file, and assumes the administrator is interested in this host. The sessions with outgoing traffic from the host will be extracted from the pcap file. The results will be outputted to a file with the same file name as the given pcap file with .log concatenated to the end. Only unique sessions will be outputted to the results file. A sample of the output from the Session Extractor is shown in Figure 3.6. The left side of the -> is the given host and its port, while the right side is the destination host and the port it is utilizing for the session. An administrator can now parse the files for the destination IP address to be able to inspect the sessions and decode its function.

30 3.6 IP Helper The IP Helper program is a simple GUI program that helps with the process of creating whitelists and blacklists for a network. A screenshot of the GUI is shown in Figure 3.7. It was created as a helper program for ExFILD. ExFILD uses whitelists and blacklists in the decision-making processes. ExFILD will also output the IP addresses it processes to a file to be used by the IP Helper program to create the lists. The IP Helper program will open a file located at ../Config/ipList. It will sort and remove the duplicate IP addresses at the beginning of the program. It will then iterate through the IP addresses displaying the resolved DNS name (if a DNS name could successfully be resolved) and the location of the IP address. The location of the IP address is found using the Geo:IP [24] Perl module and the GeoLite City [25] and GeoLite County [26] databases. The databases contain the information that allows IP addresses to be mapped to a geographical region. The administrator will use the information to make the decision of how to handle the IP address. The program has 6 buttons in total. Two of them are for starting and exiting the program. The other four are the choices of how the IP address can be handled. They are the whitelist, blacklist, neither, and ignore for now buttons. The whitelist and blacklist buttons will add the IP address to the respective files located at ../Config/whiteList and ../Config/blackList. The neither button will add the IP address to a file located at ../Config/ignoreList. The ignoreList file is only used for the IP Helper program. The IP addresses are read from the ignore list and excluded from the list of IP addresses shown to the administrator during the process. This functionality was added for when an administrator does not want to add the IP address to the whitelist or blacklist, but it continues to be displayed every time the IP Helper program is run. The administrator would use this button to command the program to ignore this IP address in the future. The ignore for now button is for the instances when the

31 Figure 3.7: GUI for the IP Helper Program

administrator is not sure what should be done with an IP address and wants it to be shown next time the program is run. When exiting the program, the IP addresses that have yet to be addressed by the administrator will be outputted back into the ipList file. The nice thing about doing this is that all of the duplicates were removed at the beginning of the program, thus an administrator can run the program and then exit to shrink the ipList file when it starts getting too large.

32 Chapter 4

EXFILD DESIGN AND IMPLEMENTATION

ExFILD was designed to facilitate the detection of data being exfiltrated. It is designed to only inspect the outgoing traffic of a host. It will automate the decisions usually made by a system administrator while searching for suspicious traffic leaving a computer system. The foundation of ExFILD is based on the encryption characteristics of the packets and sessions. Features will be extracted from the network traffic during processing to be used as the input of the decision- making tree, which will be referred to as the tree in the rest of this thesis. The processing of the network traffic will be discussed in Section 4.1 and the tree will be discussed in Section 4.2.

4.1 Packet and Session Processing The network traffic is processed two different times, once at the packet level and another time at the session level. The flow for the processing of the packets is shown in Figure 4.1 and the flow for the processing of the sessions is shown in Figure 4.2. The processing of the packets will occur first, and the processing of the sessions will occur after all of the packets have been processed. The two processes are almost identical. The main differences are the inputs taken and the first two stages in the processing of the packets. The session processing does not require the first stage, because the processing of the packet handles the decoding and then extracts the sessions. The individual steps of the processes will be explained in the

33 following sub-sections, except for the tree. The tree will be explained by its own in Section 4.2.

Figure 4.1: The Flow of the Processing of Packets

Figure 4.2: The Flow of the Processing of Sessions

4.1.1 Packet Decoding The decoding process is responsible for extracting key features from the pack- ets. It supports the data link, network, and transport layer protocols. It supports Ethernet in the data link layer, IP in the network layer, and TCP, UDP, ICMP, and IGMP in the transport layers. The decoder does not handle the application layer; however, it will attempt to decode the application protocol using the desti- nation port and even the source port if needed. Many application layer protocols are assigned to specific ports by IANA (Internet Assigned Numbers Authority) [27], which makes it possible to match the two ports to a standard protocol. For the ports that are not assigned to a specific application layer protocol, an administrator will have to investigate the traffic to determine if it is a constant port for an application layer protocol or just a randomly negotiated port. The administrator can then add functionality for any new application layer protocols that are not yet supported.

34 All of the stages rely on the decoder to provide features that are extracted from the packets and sessions. When the packet has been decoded as far up the layers as possible, it will extract the information from the current layer. The information extracted will be the protocol itself, the session identifier (explained in detail in Section 4.1.2), the packet count of the session in which it belongs, the data, and the destination IP address. The different stages of the process will use these features to perform their functions. They may need more features such as the data length, but the other features can be derived as needed from the features that were passed to them. The decoder will also extract some other features for strictly statistical purposes that are outputted to a file with the same name as the pcap file and .stats concatenated to the end as the file extension.

4.1.2 Extract Sessions The extracting sessions stage combines all of the packets in the same session into one piece of data. The packets are grouped together using the session identifier provided by the packet decoder. The session identifier consists of a string containing the source and destination IP addresses and their respective ports. The format of the session identifier is SRC IP:SRC PORT->DEST IP:DEST PORT, where SRC IP and DEST IP are the source and destination IP addresses and SRC PORT and DEST PORT are the source and destination ports. Not all session identifiers use this format, but it is the most common session identifier due to most of the packets in a network being TCP or UDP packets. IP packets that are not TCP or UDP have a session identifier with a format of SRC IP->DEST IP. Ethernet packets will have a session identifier with the format SRC MAC->DEST MAC, where SRC MAC is the source MAC address and DEST MAC is the destination MAC address. ARP packets will have a session identifier with the format of SHA:SPA->THA:TPA, where SHA is the source hardware address, SPA is the source protocol address, THA is the target hardware address, and TPA is the target protocol address. The

35 session extractor will attach the protocol, the accumulation of the data in all of the packets from the session, and the number of packets in the sessions to the session identifier to form a tuple to be processed. These sessions will be processed after the processing of the packets is complete. Session identifiers are not far from being unique, but there is a flaw to using these session identifiers. Using session identifiers of this form without any discrimi- nation in the time domain allows for the possibility for two separate sessions being grouped together. The probability of two sessions being grouped together at the transport layer protocols is rare due to the client ports being incremented or ran- domized with every session. The session identifiers for IP packets with no transport layer information are more likely to characterize multiple sessions between two hosts as one session. Ethernet packets will also have the same problem as the IP packets, as the same session will characterize any Ethernet packet between the same two hosts.

4.1.3 Entropy Calculation Data encrypted through good encryption algorithms should look random when observed at the bit level. Theoretically the data should be completely random so that a person should not be able to find any patterns that disseminate information about the data that is encrypted. Entropy is used to characterize the randomness of data. A string of all the same characters will give an entropy value of 0. The entropy will be higher as the string becomes more random. Entropy is used to detect whether a packet is encrypted from the randomness measurement. Good encryption will produce very random sets of data, which means that its entropy will be higher. A good demonstration of this is obtained by looking at the payload of a HTTPS packet and comparing it to the payload of a HTTP packet, which is discussed more in Section 4.1.5 and is plotted in Figure 4.7.

36 N X − p (x) log2 p (x) (4.1) i=1 Equation 4.1 is used to calculate the entropy of data (in our case a packet’s payload), which was derived by C.E. Shannon in [28]. It is the sum over all possible combinations of the probability of each time a combination occurs multiplied by the log2 of the probability of the combination occurring. Conceptually, it will produce a histogram of the possible bit combinations. If the distribution of the possible combinations is even across all of the combinations, then it will have a high entropy value. Conversely, an uneven distribution of the possible combinations will result in a lower entropy value. N is the number of possible combinations. N is 256 for this thesis due to the code inspecting 8 bits at a time meaning there are 28 or 256 possible combinations. The calculation is done 8 bits, or one byte, at a time, because most computers communicate in multiples of bytes. Also, human readable text is common in com- puter communications, each character being represented by and 8-bit code known as ASCII. The entropy calculation can range from 0 to 8. Let’s look at the example packet with “AAAA” as its payload. The entropy calculation for the bit combina- tion that represents ‘A’ will have the probability of 1 and log2 of the probability will be 0. This will give an entropy value of 0 for that combination. The rest of the combinations will have a probability of 0 meaning their respective entropy calcula- tions will be 0. Summing the results of these entropy calculations will be 0. Now let’s look at the opposite side of the spectrum. Assume that the packet payload contains only one of each of the 256 ASCII characters. Every possible combination will have an equal probability of occurring with a probability of 1/256. The result of the log calculation will be -8. Summing all of the combinations’ calculations will result in 8. Figure 4.3 shows the maximum entropy versus the size of a packet’s payload.

37 It is observed that the maximum entropy value is not 8 between the sizes of 0 and 256 bytes. The maximum entropy value grows as the number of possible combinations increases forming a steep slope until it reaches the size of 256 bytes. The slope is an artifact of the equation for Shannon’s entropy. Since the algorithm looks at 8 bits at a time, which makes 256 possible combinations, each of those combinations must each occur at least once and the same number of times as all of the other combinations for the entropy to be 8. If a payload has a size of 2 bytes, two of the combinations will have a probability of 1/2, but the other 254 combinations will have a probability of 0, which will result in an entropy result of 1 not 8. As the size of the packet grows from 0 to 256, the entropy values approach 8, which is the asymptote or the maximum possible entropy for this calculation.

Figure 4.3: Maximum and Minimum Entropy vs. Size of a Packet’s Payload

38 Figure 4.4: Maximum Entropy vs. Size of a Packet’s Payload

Another interesting artifact of Equation 4.1 is displayed in Figure 4.4. In- tuition would lead a person to believe that payloads of size greater than 256 bytes would have a maximum possible entropy value of 8. The dips in the calculated en- tropy disprove the assumption made from one’s intuition. The dips can be explained succinctly with an example. Consider the scenario of a 257-byte payload, where each possible combination occurs once, except the bit combination of “00000000” occurs twice. The bit combination of “00000000” will have a probability of 2/257, while all of the other combinations will have a probability of 1/257. For the entropy cal- culation to equate the maximum of 8, all of the combinations must have the same probability of occurring as each other. The payload size of 257 is not evenly divisible by 256, so it is not possible to reach the entropy of 8 with a payload size of 257.

39 The first dip between the sizes of 256 bytes and 512 bytes is more pronounced than the dip between the sizes between 512 bytes and 768 bytes. The dips will become less pronounced as the sizes of the payloads grow larger. If you look at the transitions from 256 bytes to 257 bytes and 512 bytes to 513 bytes, the 257th byte and the 513th bytes will decrease the maximum entropy by different amounts. The 257th byte will leave one probability at 2/257 and the rest with a probability of 1/257 and the 513th byte will leave one combination at 3/513 and the rest with a probability of 2/513. In the case of the payload with the size of 257 bytes, the difference between the probabilities of the combination from the 257th byte and the rest of the combinations will be 2/257 - 1/257 = 1/257. In the case of the payload with the size of 513 bytes, the difference between the probabilities of the combination from the 513th byte and the rest of the combinations will be 3/513 - 2/513 = 1/513. The difference of 1/257 is greater than 1/513, which causes it to have a greater impact, thus pulling the maximum possible entropy farther down from 8. The entropy calculations will be used to distinguish from random data from non-random data. In the work of this thesis, random data is considered encrypted unless ExFILD finds it to be compressed. It is important to note that this determina- tion is an estimation of encryption, and is not an exact science. The determination will be made using a static threshold set by an administrator. Graphically, this means a straight line will be drawn across a graph and packets or sessions above it will be encrypted and any below it will be non-encrypted. The slope between the payloads of size 0 and 256 is problematic for this determination. If a threshold is set to 7, no payloads with a size less than 128 could be considered encrypted. The solution we adopt is to scale the entropies in a way that keeps the integrity of the relationships between payloads’ entropy values while flattening out the pos- sible maximums. Two different solutions will be described in Section 4.1.3.1 and

40 Section 4.1.3.2. The dips are not a concern for the research in this thesis, because their magnitude is small enough that it will not affect many entropy values enough to change them their observed state of encryption from encrypted to unencrypted.

4.1.3.1 Scaling by Initial Values The first proposed solution to the problem was to initialize the number of each combination to a value other than 0. Initializing the number of each of the possible combinations to 1 would be equivalent to a 256-byte packet that has a completely random payload. The 256-byte payload is where the maximum possible entropies begin to flatten out. The idea would then be that the line of maximum possible entropies is scaled up to 8, thus removing the steep slope. The solution was tested using 10 different initial values starting at 0.1 and incrementing by 0.1 up to 1. Figure 4.5 shows the plots for the initial values of 0.1, 0.3, 0.7, and 1. Each plot has two lines with the top line being the maximum possible entropy values and the bottom one being the minimum possible entropy values. The steep slope of maximum possible entropies has been removed from the line. It has been normalized into a dip down from the line with the entropy value of 8. The larger the initial value is the smaller the dip becomes and thus the quicker the maximum entropies converge on the value of 8. The solution performs the desired normalization on the maximum entropies. A problem arises with the minimum entropies line when using this solution. The minimum possible entropies for the values that are not scaled are all 0 as shown in Figure 4.3. The minimum values will start at a value of 8 and slope downward when the initial value is greater than 0. The minimum entropy values converge to 0 at a slower rate as the initial values grow larger. Packets with small payloads will have a larger chance of being labeled as encrypted even if they are unencrypted. The solution alleviates the problems with the maximum values, but creates a problem with the minimum values.

41 (a) Initial value set to 0.1. (b) Initial value set to 0.3.

(c) Initial value set to 0.7. (d) Initial value set to 1.

Figure 4.5: Maximum and Minimum Entropy Values with Different Initial Values.

4.1.3.2 Scaling by Size The solution that is implemented in ExFILD is a scaling technique based on the size of the payload. Since the range between 0 bytes and 256 bytes is the problematic area, only the payloads that are less than 256 bytes will be scaled using this method. The scaling is as simple as multiplying the entropies by a scalar. The scalar will be based on the relationship between the maximum possible entropy for a size and 8, which is the maximum possible entropy for this entropy calculation. The scalar is calculated using Equation 4.2.

8 scalar = (4.2) log2 (payload size)

42 Equation 4.2 shows that the scalar is dependent only on the size of the payload. Using this scalar will allow for the possible entropies of payload less than 256 bytes to range from 0 to 8. The scalar can range from 8 down to 1. Table 4.1 shows example scalars for milestone entropies. This scalar could be used to correct dips for payloads larger than 256 bytes, but they are not used due to the dips not having a significant impact and to prevent unnecessary computations.

Table 4.1: Scalars for Max Entropy of Payloads by Size Packet Size Max Entropy Scalar 0 0 0 1 0 0 2 1 8 4 2 4 8 3 2.667 16 4 2 32 5 1.6 64 6 1.333 128 7 1.143 256 8 1

Figure 4.6 displays the lines of maximum and minimum entropy values that have been scaled based on the payload size. The line of maximum entropy values is almost identical to a line with the entropy value equal to 8. It will not always have a value of 8 due only to the multiplication of decimals and rounding. The line of minimum entropy values is identical to a line with an entropy value of 0. The scaling has proven to work well on the line of maximum entropy values without disturbing the line of minimum entropy values.

43 Figure 4.6: Scaled Maximum and Minimum Entropy Values

4.1.4 Checking if Encryption is Expected Encryption is the most important characteristic of network traffic for this thesis. Knowing whether network traffic is encrypted is not enough to make a de- termination of whether the outgoing traffic is benign or undesired. An administrator needs to know whether the outgoing traffic is supposed to be encrypted or not. An example would be an administrator observes traffic between two hosts that appears to be encrypted. The administrator is nervous about the encrypted traffic and im- mediately suspects that sensitive data is being exfiltrated from the network. The administrator will start investing resources into determining the application of the outgoing traffic. The investment of resources may be worthwhile if it is HTTP traf- fic that is not expected to be encrypted, but it would be less likely to be useful if

44 the traffic was HTTPS where it is supposed to be encrypted. The administrator would have more justification to investigate why encrypted data is leaving in HTTP packets than encrypted traffic leaving in HTTPS packets, since it is abnormal. This justification is based on whether or not the traffic is encrypted and if it is expected to be encrypted, not on other characteristics such as the IP address being in the blacklist. Taking the other characteristics into account may justify another action being taken. The stage checking if encryption is expected is just as important as the stage checking if the data is encrypted. This stage uses the most specific protocol linked with the data to determine if encryption is expected. The application layer protocol that is determined by the ports of the traffic is used a majority of the time. Com- mon protocols are already supported meaning that it has already been determined whether the protocols’ data should or should not be encrypted with standard use. Administrators can add protocols seen on the network as needed. They will need to determine whether encryption is expected. The protocol of the packet or session will be checked against the supported protocols. There is a possibility that the protocol is yet to be handled by this stage, and in this case encryption is not expected by default. Since unsupported protocols do not expect encryption by default, protocols do not need to be added unless encryption is expected, but adding the protocols that do not expect encryption can be used to prevent other administrators from having to check if it should be encrypted at a later time.

4.1.5 Checking if Encryption is Present Encryption is a standard for people who want to protect their data from others. When encryption is implemented correctly it is a very effective method of protecting data. Although encryption is very useful it can also be used against people, for example while attackers are exfiltrating their data as discussed in Sec- tion 2.1.11. An administrator will know if encryption is expected from the previous

45 stage, and now needs to determine whether or not it is actually encrypted. The state of encryption for a packet or session is observed using the scaled entropy from Section 4.1.3.2. A comparison is performed between the entropies and a threshold set by an administrator. The comparison in this stage is trivial; however, setting the threshold is not. As with all thresholds, an administrator does not want to set it too high or too low. A threshold that is too low will cause false positives, and one that is too high will cause false negatives. Assuming an administrator is more concerned with encrypted traffic than unencrypted leaving the network, false positives will waste the administrator’s resources investigating unencrypted traffic, while false negatives will let encrypted traffic leave the network unobserved by the administrator. For this thesis false positives and negatives will cause the traffic to be mishandled, so it is important to set a threshold that produces a minimal amount of either. The initial threshold for the program was set using the data set described in Section 5.1.1. HTTP and HTTPS were used as the basis for setting the threshold. The reason these two protocols were chosen is that they perform the same function, but HTTP is not encrypted while HTTPS is encrypted. An important observation to take away from Figure 4.7 is the separation between most of the HTTP and HTTPS. The separation between encrypted and unencrypted traffic is the attribute of entropy that allows for the determination of whether a payload is encrypted. The result of averaging the entropies for all of the HTTP and HTTPS packets was used to set the threshold. The entropies have enough separation that the average of the packets will be a good value to delineate between the two. The threshold was calculated to be 6.51237, which was rounded down to 6.5 for this thesis. The threshold is represented by the straight line in Figure 4.7. There are outliers that would be considered false positives and negatives, but some of them will be handled in checks within the tree, explained in Section 4.2. The

46 Figure 4.7: HTTP and HTTPS Traffic

outliers were the motivation to input sessions through the tree along with packets due to the belief that the small size of the payloads do not allow for an accurate measurement of entropy. By grouping all of the packets in a session together, the entropy for the entire session will be calculated and will normalize any outliers in the session. In the case where the observed state of encryption does not match the expected state, the session will be flagged and need to be investigated. ExFILD will output the packet and session stats, so the administrators may use which of the two they prefer.

47 Figure 4.8: The Four Branches of the Tree

4.2 Tree The decision-making tree (which will be known as the tree throughout this thesis) is responsible for deciding whether a packet or session might contain ex- filtrated data and should be logged. All of the packets and sessions will be sent through the tree except for traffic from the data link layer. Data link layer packets will only travel on the local network, which for this thesis is assumed to be trusted, so they are omitted from the tree. Figure 4.8 shows the division of the four types of traffic being inputted into the tree. The tree branches out into four branches: expected and received unencrypted traffic, expected unencrypted but received en- crypted traffic, expected encrypted but received unencrypted traffic, and expected and received encrypted traffic. Each of the four branches will make a different set of decisions to decide whether or not to flag the packet or session. The two branches where the expected state of encryption matches the received are the less interesting cases in this thesis. The other two branches are more interesting, because they are showing behavior that is not expected under the standard use of a protocol. The branch where the

48 payload is expected to be unencrypted but is observed to be encrypted tends to be the most intriguing branch. The four branches will be discussed in more detail in Sections 4.2.1 - 4.2.4.

4.2.1 Expected Unencrypted and Received Unencrypted The case where the traffic is expected to be in the clear and the traffic that is received is not encrypted is very common because of protocols such as HTTP. Sensitive information should not be sent over unencrypted protocols, but if it is an administrator will be able to inspect the payloads of the packets for anything suspicious. Figure 4.9 displays the flow for handling the traffic under this charac- terization. As a convention for this thesis, any decision will result in two branches with the top being for a result of true and the bottom for a result of false.

Figure 4.9: Flow for Expected Unencrypted and Received Unencrypted Branch

The first stage, which checks the payloads for censored strings, is derived from the Corporate Watcher tool described in Section 3.2. It will take regular expressions from a file located at “../Config/censoredList”, and it will log an alert for any matches to the expression. This stage is very effective for administrators with well-defined strings that should not be leaving the network. The censoredList file initially contained regular expressions for social security numbers and credit cards.

49 If credit cards are not a concern for a particular network, the regular expressions for them should be removed. These regular expressions caused a lot of false positives due to a lot of payloads containing large strings of numbers. The censoredList is now empty by default. An administrator will need to add the regular expressions as needed and tweak them to avoid false positives. False positives may not be avoidable, so an administrator would need to make the choice of whether or not to include them. The destination IP addresses will then be checked against the blacklisted IP addresses listed in the file “../Config/blackList”, and an alert will be logged for each match. The blacklist is empty by default and during the running of ExFILD in Chap- ter 5. The destination addresses will then be checked against the whitelisted IP ad- dresses in the file located at “../Config/whiteList”. If there is a match, no alert will be logged and the tree will process the next packet or session. The default whitelist includes some IP addresses that belong to Skype, Google, multicasting, and broad- casting. The Google and Skype IP addresses are included because their services are run on the host that created the Control Data set. The multicast (224.0.0.251 for mDNS and 224.0.0.252 for LLMNR) and broadcast (255.255.255.255) addresses are included, because they are used on the local network and considered trusted. The checks will end if there is no match at this point, but additional checks can be added at this point for packets and sessions that are neither in the whitelist or blacklist. The choice was to place the blacklisting stage in front of the whitelisting stage in case an IP address is in both lists. This way the IP address will be handled correctly as being a blacklisted IP address or it will create a false positive in the logs when it is actually supposed to be a whitelisted IP address. Switching the order of these two stages would lead to a false negative that will not be shown in any logs and an administrator may not catch the undesired traffic leaving the network.

50 Figure 4.10: Flow for the Expected Encrypted and Received Encrypted Branch

4.2.2 Expected Encrypted and Received Encrypted The case where encrypted traffic was both expected and received is processed in a very similar manner as described in Section 4.2.1. The difference is in the first stage of the processing of the packets and sessions. The corporate watcher program will not be effective against encrypted data, because the data has been obfuscated. The only way an administrator would be able to use regular expressions is if effort was put forth to decrypt the data and at that point the administrator would most likely want to look at it by hand. The flow for processing traffic of this characterization is shown in Figure 4.10. The first stage will check if the data is actually a compressed file being served through html and not encrypted data. Packet payloads with compressed files will also be random, thus having high entropy. The compressed files are found by their magic numbers. A magic number is a set of bytes that can be found in every file of a certain type and used to identify that file type. Table 4.2 has a list of the magic numbers for the file types checked for in this stage. It also contains the strings used to match it to a HTTP transfer. The HTTP strings are used in the matching, because without some other reference to the file there will be a lot of false positives. Magic file numbers are short in length, which

51 Table 4.2: Magic Numbers and HTTP Strings for Compressed Files

File Type Extension Magic Num- HTTP Strings ber Bzip .bz 42 5A Content-Type: application/zip Gzip .gz 1F 8B Content-Encoding: gzip ZIP .zip 50 4B 03 04 Content-Type: application/octet-stream Tar .tar 75 73 74 61 72 Content-Type: application/x-tar

leads to a lot of matches in payloads. In order to use only the magic numbers for matching, all of the application layers protocols would need to be decoded to remove any header information and find the beginning of the file. The Unix file program is able to use only the magic numbers, because it is looking at the file and no data from the headers of the application layer protocols. The stage checking for compression will be more effective when inspecting the sessions rather than the packets, because not every packet will have the magic numbers or the HTTP strings to match against. Even though a packet does not have the magic file number or HTTP string, it could still be a packet containing parts of a compressed file. All of the packets in the session will be treated as one, so it will only need to match one packet containing this information to be able to exclude the rest of the packets in the session. The file types in Table 4.2 are supported by default, but more can be added as required. Future work is planned to decompress the files and running it through the Corporate Watcher program or some other checks. This work is discussed in Section 7.5.

4.2.3 Expected Encrypted and Received Unencrypted The case where the traffic is expected to be encrypted but unencrypted traffic is received is more concerning than the cases where the received traffic is what is expected. The traffic is automatically assumed to be suspicious once the received and

52 expected states of encryption do not match. It is not common for protocols to send encrypted messages for part of the time and then switch to unencrypted or vice versa. The processing for the case of expected encrypted but received unencrypted traffic is similar to that of unencrypted traffic is both expected and received described in Section 4.2.1. The flow for the processing of this case is shown in Figure 4.11. The differences are the addition of the first stage, the logged alerts have different messages that correspond to this stage, and the alert added after the whitelist check.

Figure 4.11: Flow for Expected Encrypted and Received Unencrypted Branch

The first stage is added to filter out packets that are unencrypted, but are used for the management of the encrypted traffic. Transport Layer Security [29] is a protocol that is used to communicate using an encrypted channel. The channel cannot be established without unencrypted packets to initialize the stream and negotiate the characteristics of the stream. The initialization is called a handshake. There are two parts of this handshake that will cause an unnecessary alert. Those alerts are the client hello message and the change cipher spec message. These are not the only messages in the handshake that are not encrypted, but the structure of the other messages and the values of their fields make them appear to be encrypted. The TLS client hello message contains a list of cipher suites that the client is capable of using for the communication. Each cipher suite is represented by two

53 constant byte values. The bytes are assigned in order meaning that the first cipher is represented by the hex value 0x00 and 0x01, the second is represented by the hex values 0x00 and 0x02, and the rest are numbered incrementally. This means that the first byte will be repeated multiple times within the packet, which will lower the entropy of the packet. The cipher suites that are assigned have two possible values for the first byte, which are the hex values of 0x00 and 0xc0. The change cipher spec is a small message that is only 6 bytes and it is a constant for TLS version 1. Three of the bytes of this message are represented by the hex value of 0x01. With half of the bytes the same value, the entropy will not be high enough to be considered random. The alerts for this case will be different from those in the case of expected and received unencrypted traffic. The alerts in this case will be altered to include a message stating that the expected and received characteristics of the traffic were not same. It is important for the administrator to be notified of this discrepancy. The other checks may be able to provide additional information explaining the cause of the discrepancy. The traffic received is unencrypted, which allows for the corporate watcher to be used to search for censored strings. The blacklist and whitelist checks will be performed as described earlier in this chapter. If the destination IP address is not in the whitelist then an alert will be logged for the discrepancy in the expected and received encryption characteristics.

4.2.4 Expected Unencrypted and Received Encrypted The case that causes the most suspicion is where the traffic is expected to be unencrypted, but the received traffic is encrypted. The discrepancy between the traffic’s expected and received characteristics is definitely disconcerting, and the fact that the data is encrypted and can’t be inspected is even more worrisome. Setting up a server that initiates an encrypted channel requires additional work compared to one that serves data over an unencrypted channel. A server that implements TLS

54 or SSL will require the creation of a certificate. A lazy attacker will not put forth the extra effort to implement a server with encrypted channels. The attacker would be more inclined to encrypt the data and then send it over an unencrypted channel. That is why detecting the discrepancy between the entropy characteristics of the expected and received traffic is important, otherwise there would be no indication for this unwanted traffic leaving the network.

Figure 4.12: Flow for Expected Unencrypted and Received Unencrypted Branch

Figure 4.12 displays the processing flow for the case where the outgoing net- work traffic is expected to be unencrypted, but the received traffic is encrypted. The stages for this case have all been discussed in the other cases except for the stage checking for ICMP requests and replies. The stages checking for compressed files being served by a web server and checking if the destination IP address is in the blacklist or whitelist have been discussed in the previous cases. The other stage of the processing is a check to remove packets that appear to be encrypted, but are not encrypted and are not used to maliciously exfiltrate data. The check is looking for ICMP ping requests and reply messages. The list below contains the payloads of these ICMP requests and replies. The x’s in the third item represent non-printable ASCII characters.

1. abcdefghijklmnopqrstuvwabcdefghi

55 2. ABCDEFGHIJKLMNOPQRSTUVWABCDEFGHI

3. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx !”#$%&’()*+,-./01234567

If ICMP ping requests and replies are found, the alerts will be suppressed. The alerts for all of the stages will contain both the warnings for the results of that stage’s check and a message stating that the traffic received was encrypted when it was expected to be unencrypted.

56 Chapter 5

EXPERIMENTS, RESULTS, AND ANALYSIS

5.1 Experiments ExFILD was run against multiple data sets to test how effective it is at detecting data exfiltration. It was run against eight different data sets that were produced on different machines and environments. One of the data sets was created with specific tasks to contain a wide variety of traffic, including intentional exfiltra- tion of data to be used as a control. Recording the normal use of two different users on different networks created another two of the data sets. One of these data sets was created at the University of Delaware using a university owned computer during business hours, and the other one was created at another person’s residence on their personal computer outside of business hours. The last data sets are packet captures from pieces of malware that exfiltrate data from a network. The data sets from malware are used to show the effectiveness of the program against actual threats.

5.1.1 Control Data Set The Control Data Set was created to get as many types of traffic as possible and was used to develop ExFILD. The diversity made a good baseline to test Ex- FILD. Audio and video streams of the user were recorded during the creation of the data set. A written log was also created to document every step of the creation of the data set. The written log can be found in Appendix A. There is an advantage to using a data set created by hand instead of setting up a computer system and let- ting it monitor the user’s network traffic during normal use. The advantage is that

57 the creator of the data set knows exactly what actions were performed to cause the network traffic. Using data sets that were recorded without someone documenting the actions will make it difficult to know if the program is working correctly. The created data set will act as the control experiment for this thesis. The Control Data Set was created by capturing the traffic entering and ex- iting an Apple MacBook running Mac OS X 10.6. The MacBook was using the Ethernet port to communicate over the network. The MacBook’s network configu- ration during the creation of the Control Data Set is shown in Figure 5.1. It was connected to a Netgear switch (model GS724T) with four computers running Linux. The switch was behind a D-Link router (model DIR-655), which had a public IP address, meaning it can be accessed from other computers on the Internet, not only the hosts on the local network. Knowing the network configuration will help to understand what causes some of the alerts produced by ExFILD.

Figure 5.1: Network Diagram for the Control Data Set

The plan for the data set is to capture a large variety of different services

58 for a baseline to develop ExFILD. The packet capture covers a time span of 66 minutes. It includes activities that are commonly performed by user on a daily basis. Email was checked using the web interface as well as using a mail client that implements the IMAP protocol. Email attachments were downloaded using the web client. Many web pages were loaded including CNN.com for the daily news and SI.com to read the sports news for the day. Multiple RSS feeds were accessed using Google Reader. Bank statements were checked for multiple bank accounts. Twitter accounts were checked using both the web and desktop clients. Videos were viewed from YouTube.com. Some other services that were run are less common to the general public, but common for people in fields that relate to computers and networks. SSH was used to log into other computer systems and perform some mundane tasks. VNC was used for remote desktop administration of another computer system on the network. A complete account of the activities performed during the creation of this data set are included in Appendix A. ExFILD is based on knowing the encryption characteristics of packets and network sessions. The data set needs to have traffic that is both encrypted and unencrypted in order to test the program adequately. The applications run during this data set were a mix of protocols that utilize encrypted and unencrypted com- munications. The activities that will produce encrypted data are accessing bank web sites through HTTPS, logging into computers using SSH, instant messaging with colleagues, and other applications used for secure communications. Many of the application protocols used over the Internet communicate using unencrypted channels. The data set will include unencrypted traffic from browsing web sites using HTTP, transferring files using FTP, resolving DNS names, and many other application protocols. The data set needs to have an example of data being exfiltrated within it. Simple means of exfiltration were added to the Control Data Set. The first means of

59 exfiltration was transferring files between two computers using FTP. The files will be sent between the two in an unencrypted channel. The second means of exfiltration will also be transferring files between two hosts using FTP; however, the second means will be using an FTP server over a channel encrypted using SSL. Two Perl scripts were written to transfer files with one of the scripts for the unencrypted transfer and the other for the encrypted transfer. The scripts are run to transfer small text-only files throughout the data set. The sizes of the different files are 100, 1,000, 4,000, and 8,041 bytes of text from the Declaration of Independence. The reason for the files being so small is that large files will be much easier to detect due to the increase in network traffic from the larger files. All of the files were transferred using both scripts at a time when no other network traffic was present to set a baseline. The files were also transferred over the encrypted channel while web sites are loaded and while videos from YouTube are being buffered. These file transfers will be used to show the effectiveness of ExFILD to detect exfiltration interweaved with other traffic.

5.1.2 Data Set #1 The first uncontrolled data set was created by a user over a three-day period. This data set will be referred to as Data Set #1. The communications recorded are from an Apple iMac running Mac OS X 10.5. The computer is on a university’s network and has a public IP address, so the network is a simple one shown in Fig- ure 5.2. The activities performed during this data set are simple and common to the daily activities of many Internet users. Most of the traffic results from browsing web pages on the Internet using HTTP, HTTPS, and PROXY (used to access university sites). There are many other protocols that are observed in this data set, but they are underlying protocols that are not directly initiated by the user. Even though the user may not have implicitly requested for these protocols to run, these services are fundamental for other tasks the user wants to perform. Some of these protocols

60 include NTP for synchronizing the time, DNS to resolve names to IP addresses, and ARP for routing packets on the local network. Data set #1 gives a good overview of basic user activities.

Figure 5.2: Network Diagram for Data Set #1

5.1.3 Data Set #2 The second uncontrolled data set was created by a user other than the Control Data Set and Data Set #1. It was created over a period of six and a half hours. The communications were recorded on a Dell laptop running Windows 7. It was using its wireless card (802.11g) to access the Internet. The network is shown in Figure 5.3. The laptop is behind a router with two other computers running Windows XP. The activities directly initiated by this user are very similar to the activities performed by the user in Data Set #1. The other activities differ due to Windows and Apple operating systems having different services that run by default. Data Set #2 has the NetBIOS suite used by Windows for networking. Another Windows service seen is Teredo, which is Microsoft’s solution for using IPv6 over IPv4. It also has other unique protocols not discussed in the previous data sets, such as SMB for sharing files over the network and DROPBOX, which is used to backup files online. The combination of Data Set #1 and Data Set #2 gives a good overview of outgoing traffic for basic users.

61 Figure 5.3: Network Diagram for Data Set #2

5.1.4 Malware Data Sets Packet captures of real world examples of data exfiltration are not readily available to the public. There are a couple of explanations for the lack of public packet captures. One explanation is that not all companies will disclose information to the public regarding data being exfiltrated from their networks, thus no infor- mation including packet captures will be found. Another explanation is that the data being exfiltrated will most likely be sensitive information. A company would not want to release a packet capture containing the data it was trying to secure. A company could sanitize the packet captures of any sensitive data, but there is always a chance that pieces of data were overlooked. Companies are not willing to take a risk of putting the packet captures in the public domain with a chance that the capture was not fully sanitized. The company needs to be concerned about di- vulging sensitive information about their employees in the packet capture. A packet capture could also contain financial information, personal emails, or other personal information that is not related to the sanitized data. Sanitizing a packet capture for personal information will result in a large amount of things to find and remove.

62 The administrator will also have to decide what the employees do and do not want other people to know about. The task will become overbearing very quickly. For these reasons using packet captures with real world examples of large data breaches is not a viable option; however, packet captures from malware could be useful for this thesis. The packet captures from the malware are not going to be large samples, but they will contain data being exfiltrated by an attacker. Most of the data will be for control and not sensitive data, but the malware would use the same mechanisms of exfiltration for sensitive data as it does for control data.

5.1.4.1 A packet capture of the Kraken botnet was obtained from the OpenPacket.org repository [30]. The packet capture contains command and control traffic between an infected host and a command and control server. Kraken was reverse engineered and a description of its activities were discussed in [31]. Infected hosts that are part of the Kraken botnet will attempt to communicate with command and control servers in a large list of DNS names generated by Kraken. The domain names of the command and control servers are listed in [32]. It will continue to look for servers on port 447 until it finds a match. The communications between the bot and the command and control servers are encrypted using a custom encryption protocol. The data being exfiltrated in this data set is the encrypted communications with the server. The command and control server is using the data received from the infected computer to direct that host’s actions. The encrypted outgoing messages will provide a good test for ExFILD. The verification that the packet capture contains network traffic from the Kraken botnet is performed in Appendix D.1

5.1.4.2 Zeus Botnet Three packet captures of the Zeus botnet were obtained from the Open- Packet.org repository ([33] [34] [35]). The packet captures contain three separate

63 communications between infected hosts and command and control servers. The communications of the Zeus botnet are explained in [36]. It is explained that the communications between the infected hosts and the command and controlled servers are encrypted using RC4, a stream cipher. At first an infected host will download a configuration file from a server. The infected host will then use HTTP POST commands to send information to a server defined in the configuration file. The ex- filtrated data includes information about the infected host and sensitive data stolen from the host retrieved by web injection or from protected storage. The verification that all three of the packet captures contain network traffic from Zeus is performed in Appendix D.2

5.1.4.3 Black Worm A packet capture from Blackworm or Nyxem.E was obtained from the repos- itory at pcapr.net [37]. The packet capture contains the communication from Black- worm during the search and infection of a victim. Blackworm spreads using email attachments and network shares. A detailed description of it is given by [38]. Black- worm is using network shares as its propagation method in this packet capture. It involves walking through all of the known network shares of the infected host. The host will try to copy itself into one of three file locations listed [38]. Before the vic- tim is infected, the worm will delete any known folders relating to security vendors from the victim machine. If the worm successfully infects a victim, the victim will delete registry keys for known security vendors to attempt to stop the detection of the infection. The victim will then search for other computers to infect. Blackworm will delete many different types of files on the third of every month. The data ex- filtrated from the infected host is the Blackworm executable itself. The verification that the packet capture contains network traffic from Blackworm is performed in Appendix D.3

64 5.2 Results and Analysis 5.2.1 Packet Versus Session Alerts The choice was made to focus on the session alerts rather than the packet alerts. This choice was made to reduce the impact of small packets and packets determined to have a different encryption status than the rest of the packets in a session. The payload of a small packet can be deceiving on whether it is random or not, because there is not enough information to confidently determine the ran- domness of the payload. The problem with looking at single packets is that all of the packets in a session could be determined to be unencrypted except for one. The single encrypted packet could just be a coincidence, but it would cause an unde- sired alert or a false positive. However, looking at just the session alerts and not the packet alerts could allow for legitimate alerts to be neglected. Future work described in 7.4 is planned to investigate this problem. The choice also has the advantage of combining alerts from multiple packets into one if they are from the same session. Administrators would like this fact, because they will not be overwhelmed with repetitive alerts.

Table 5.1: Alert Statistics for Packets and Sessions of Each Data Set

Packets Sessions Data Unique Less Than Set Alerts Sessions 256 Bytes Alerts Control 1,011 62 881 43 One 225 218 225 0 Two 178 135 174 12 Kraken 18 16 10 16 Zeus 1 3 3 0 3 Zeus 2 3 3 0 3 Zeus 3 4 3 0 3 Blackworm 147 1 0 1

Table 5.1 displays stats about the packet and session alerts. It is shown that

65 for the Control Data Set, Data Set #1, Data Set #2, and the Blackworm Data Set, which have a large number of packet alerts, the number of session alerts is decreased dramatically. It can also be seen that the number of small packets for the Control Data Set, Data Set #1, and Data Set #2 closely correlates to the difference between the packet and session alerts of each respectively. The data in Table 5.1 provides evidence that justifies the use of session alerts over packet alerts, thus the results will be discussed with a focus on the session level rather than the packet level. The Control Data Set will go into more detail at the packet level, since its activities were generated for development and well known. All of the results for the data sets’ entropy values at the packet and session level are included in Appendix B. All of the plots for the alerts will be included in Appendix C.

5.2.2 Control Data Set The Control Data Set produced some results that were expected and some that were unexpected. The unexpected results were due to errors in logic when creating the data set. The controlled exfiltration methods were implemented in the wrong direction, meaning that the files were being transported into the MacBook instead of out of it. To mitigate this oversight the data set was run through ExFILD with a host of 10.0.0.101 (the IP address for the MacBook), but with a filter that removed any traffic to 10.0.0.54 and added any traffic from 10.0.0.54. The filter used to do this is equivalent to “(src host 10.0.0.101 && !dst host 10.0.0.54) || src host 10.0.0.54.” The filter essentially returns any traffic leaving 10.0.0.101 and inverts the traffic that includes 10.0.0.54, meaning incoming traffic from 10.0.0.54 to 10.0.0.101 will now be treated as outgoing traffic from 10.0.0.101 to 10.0.0.54. The traffic from 10.0.0.54 will include the traffic from sessions for a FTP server, a SSH server, and a VNC server. This corrected data set will be used in the results section. The Control Data Set created 1,011 packet alerts and 43 session alerts as stated in Table 5.1, and more specific stats are shown in Table 5.2 and Table 5.3.

66 Table 5.2: Packet Alerts from the Control Data Set

Packet Type Alerts Small Packets Unique Sessions AIM 2 2 2 Controlled FTP Exfiltration 503 417 45 HTTP 7 7 7 IRC 1 1 1 Skype 17 14 2 SSH 8 4 4 VNC 473 436 1

Table 5.3: Session Alerts from the Control Data Set

Packet Type Alerts Controlled FTP Exfiltration 41 Skype 2

The first thing to note is that there are 7 HTTP, 2 AIM, and 1 IRC packet in the set that cause alerts. All of these packets are small (less than 256 bytes), and their sessions do not cause an alert. These packets can be the result of a payload being too large to fit in a single packet, so the overflow of the payload will be sent in a subsequent packet. The packets could also be small packets used to initialize and close sessions. SSH traffic causes 8 packet alerts. Four of these packets are smaller than 256 bytes and the other 4 packets are larger than 256 bytes. These packets are not carrying any user data. They are used to administer the SSH session, such as the server and client key exchanges. There are 473 packet alerts resulting from VNC traffic. A majority of these packets, 436 to be specific, are smaller than 256 bytes. All of the alerts caused by the SSH and VNC packets are not concerning, because their respective sessions do not cause any alerts. The choice to only look at the session level alerts is reaffirmed by the large number of packet alerts from AIM,

67 HTTP, IRC, SSH, and VNC that do not cause any session alerts. Skype also causes alerts, which is not surprising. Skype communications are transferred over encrypted channels to many different hosts over non-standard ports. The 17 packet alerts and 2 session alerts occur even after attempting to whitelist it. Skype was whitelisted because it is used on the MacBook regularly, but it should not be whitelisted if it is not used regularly on the host.1 IP addresses maintained by Skype were added to the whitelist to remove the alerts resulting from those communications. Although it was mentioned that Skype does not use a standard port, it does have a configurable port on a host-by-host basis, which will remain static unless changed by the user. That port was also added in to ExFILD to whitelist Skype’s communications. If Skype was not whitelisted, there would have been a total of 1,805 packet alerts and 84 session alerts resulting from its communications. If Skype was not trusted on a network, these alerts would be legitimate concerns, but since they are trusted in this case, the 17 packet alerts and 2 session alerts still remaining are considered false positives. A better way to filter out traffic from Skype would need to be developed. On the other hand, ExFILD resulted in less than one percent (0.94%) of the Skype packets and a little more than two percent (2.34%) of the Skype sessions causing false positives. These percentages of false positives are not bad considering only simple whitelisting was used to identify Skype traffic. The remaining alerts were the result of the controlled data exfiltration. The controlled exfiltration was performed using FTP and FTPS (FTP over SSL), which resulted in 49 sessions. There were 4 unencrypted file transfers that resulted in a total of 8 sessions. Each transfer had a control channel and a data channel. There

1 The fact that Skype source code has not been released, combined with its use of encrypted sessions connected to many random hosts, a consequence of its peer-to-peer model, make it suspect to security conscious administrators, since it is indeed exfiltrating encrypted data to random destinations.

68 were 14 encrypted file transfers that resulted in 41 sessions. Each encrypted file transfer had 2 control channels and a data channel (except for a failed transfer resulted in only 2 control channels and no data channel). As seen in Table 5.2, the packet alerts accounted for 45 of the FTP sessions. The 4 sessions not causing any packet alerts were the sessions responsible for the transfer of the unencrypted data. This is expected behavior due to FTP traffic being expected to be unencrypted which the files are, meaning they are not suspicious. To catch these sessions an administrator would need to be inspecting the payloads or using the censored word list. If the data set is run through ExFILD again with a word from the Declaration of Independence in the censored word list, these sessions will cause 4 alerts. This is assuming the word was within the first 100 characters of the Declaration of Independence. If the censored word is between the 100th and 1,000th character position, it will only cause three alerts and so on. The data set was run through with the word “political” in the censored word list and four additional alerts were caused by the censored word. The four sessions that are represented in the packet alerts and not in the session alerts are also from the unencrypted file transfers. They are the sessions responsible for FTP control channels. The reason that the similar sessions for the encrypted file transfers still cause alerts is that they include additional setup for the encryption, including transferring the certificate. All of the sessions involved with the encrypted file transfers caused alerts. Only the 13 sessions used as the data channels should cause alerts. The 28 false positives caused by the remaining sessions are due to the lack of application layer decoding. The sessions will be treated as if they are normal FTP transfers not FTPS transfers. There is no intelligence yet built into ExFILD to distinguish between the traffic of the two protocols. The addition of this intelligence should mitigate the problem by only inspecting the data channel and not the control sessions, whose

69 payloads are only used to control the channels. The packet alerts from the Control Data Set caused many alerts that were false positives. All of the AIM, HTTP, IRC, Skype, SSH, and VNC packets were false positives. The controlled exfiltration using FTP and FTPS connections also caused 464 false positive alerts at the packet level. Some of these packets were the small FTP packets that are used for the opening and closing of the FTP connections, while the other packets were from FTPS control channels. All of these packets are grouped with their respective sessions and produce the correct results except for the previously mentioned Skype and FTPS control sessions. As expected, the Control Data Set produced better results at the session level than it did at the packet level.

5.2.3 Data Set #1 Data Set #1 resulted in 225 packet alerts and 0 session alerts as shown in Table 5.1. The activities performed during the creation of this data set were basic and it is not surprising that it generated no session alerts. Table 5.4 displays the breakdown of the packet alerts for this data set. A majority of the packet alerts resulted from traffic that is expected to be encrypted, but is observed to be unencrypted. All of the alerts from this data set are false positives and can be attributed to the payloads being small. An interesting aspect to note about the HTTPS packets is that they are all Client Hello messages for SSL. A check could be added to remove these alerts. Adding additional checks for any of these alerts would not make a difference, since their respective sessions do not cause alerts.

Table 5.4: Packet Alerts from Data Set #1

Packet Type Alerts Small Packets Unique Sessions HTTP 12 12 7 HTTPS 155 155 154 PROXY 58 58 57

70 5.2.4 Data Set #2 Data Set #2 created almost as many alerts as Data Set #1 with 178 packet alerts and 12 session alerts as shown in Table 5.1. Table 5.5 shows the breakdown of the packet alerts. The packet alerts are comprised mostly of small HTTPS and HTTP packets with the largest payload being only 208 bytes. Once again a majority of the HTTPS packets are Client Hello messages for SSL. The SMB packets are large, which would be concerning if not for the fact that no session alerts result from them. The NETBIOS-SSN caused 12 session alerts with small error packets. The fact that the size of each packet is only 5 bytes would lead to the assumption that the packets should not cause session alerts. The other packets in the session should average out the entropies, but the other packets in the session do not have payloads. These packets are just SYN-ACKs and ACKs. The session alerts that result from these packets are false positives and need to be mitigated in some manner.

Table 5.5: Packet Alerts from Data Set #2

Packet Type Alerts Small Packets Unique Sessions HTTPS 100 100 93 HTTP 62 62 29 NETBIOS-SSN 12 12 12 SMB 4 0 1

The mitigation could be to add checks for these sets of error packets; however, a better method would be to develop an algorithm to handle small sessions in a better way. Handling traffic as sessions instead of packets solved this problem for the packet level where it was more common, but doing this moved the problem to the session level. The activities performed in this data set are similar to that of Data Set #1 and there is no glaring reason to cause session alerts, which explains why there are none other than the false positives.

71 5.2.5 Malware Data Sets 5.2.5.1 Kraken The Kraken Data Set caused 18 packet alerts and 16 session alerts as shown in Table 5.1. The packet alerts are from 16 unique sessions. Inspecting the packet capture revealed that there are 16 sessions that communicate to a server on port 447, meaning that there are 16 sessions that are responsible for exfiltrating data. The data being exfiltrated is encrypted and uses a port that is not defined in ExFILD. ExFILD alerted on all of the sessions responsible for data exfiltration and had no false positives, which means it performed well against this data set.

5.2.5.2 Zeus The three Zeus Data Sets created a total of 10 packet alerts and 9 session alerts. Table 5.6 shows the breakdown of alerts for each data set and Table 5.7 shows the breakdown for the unique sessions containing the exfiltrated data. Zeus Data Sets #2 and #3 resulted in session alerts for all of the unique sessions containing the HTTP POST commands. These two sets had a 100% detection rate with no false positives. Zeus Data Set #1 did not produce the results expected. Three session alerts were produced by this set, when 13 alerts should have been produced.

Table 5.6: Zeus Packet and Session Alerts #2

Data Set Packet Alerts Session Alerts Zeus 1 3 3 Zeus 2 3 3 Zeus 3 4 3

The 10 sessions containing data exfiltration were examined closer to find the cause of the false negatives. The 10 sessions are all single packets with the same TCP payloads. The sessions have entropies values of 6.4551 bits/byte. An observation to note is that the threshold to determine encryption is set at 6.5 for this thesis.

72 Table 5.7: HTTP POST Commands in Zeus Packet Captures #2

Data Set HTTP POST Commands Unique Sessions Zeus 1 13 13 Zeus 2 3 3 Zeus 3 4 3

The sessions’ entropy values are pretty close to the threshold, which leads to the question of whether there is a better way to set the threshold. One aspect that could affect the threshold is that it was set using the traffic from the Control Data Set, and did not take into account traffic from any of the other sets. Setting the threshold specifically for each host could have beneficial results, but it may also have the opposite affect. There is not enough traffic in the malware packet captures to be able to set a threshold with confidence. The threshold was not changed for integrity of these experiments. An experiment was run on the payloads of the sessions causing the false negatives. The session payload was stripped of the HTTP headers, which is the same as decoding the application layer protocol. The entropy was then calculated solely on the data of the sessions, which returned a value of 7.5639 bits/byte. This entropy value would have caused alerts to result from the 10 false negatives. The program currently counts these sessions as false negatives, but they will be true positives after the application layer protocols are decoded in the future work mentioned in Section 7.3.

5.2.5.3 Blackworm The Blackworm Data Set caused 147 packet alerts and 1 session alert as shown in Table 5.1. The 147 packet alerts are all from the same session. During the verification of the data set, it was found that all of the traffic from Blackworm was contained in a single session. ExFILD was able to detect the session responsible for

73 exfiltrating data, which in this case was the Blackworm executable used during the infection of the victim. It also produced no false positives. There were 51 packets that contained data from the Blackworm executable that did not cause alerts. These packets are false negatives, but they will be considered true positives with the rest of the packets at the session level.

5.2.6 Data Exfiltration Detection Performance The metrics for each of the data sets were accumulated and are displayed in Table 5.8 for the packet level and Table 5.9 for the session level. The metrics were based on whether the packet or session contained exfiltrated data. A positive means the packet or session did in fact contain inappropriate data leaving the host, while a negative meant it only contained data that was allowed to leave the host. Inappropriate data is defined for this thesis as any information stored on or can be extracted from a host that a user does not want to leave the host. It does not include egress traffic that is exclusively used to establish or control the connections responsible for transferring data. It is important to note that some packets may be part of a session exfiltrating data, but do not contain exfiltrated data. A session is considered to be a positive if any of the packets within it contain exfiltrated data, so a negative session would contain no packets containing exfiltrated data.

74 Table 5.8: Packet Alert Metrics

Data Outgoing True False True False Set Packets Positives Positives Negatives Negatives Control 47,032 39 9722 46,021 0 One 89,150 0 2253 88,925 0 Two 80,807 0 1784 80,629 0 Kraken 205 18 0 187 0 Zeus 1 2,628 3 0 2615 105 Zeus 2 48 3 0 45 0 Zeus 3 435 4 0 431 0 Blackworm 308 147 0 110 516

Table 5.9: Session Alert Metrics

Data Outgoing True False True False Set Sessions Positives Positives Negatives Negatives Control 2,779 13 307 2,736 0 One 9,789 0 0 9,789 0 Two 3,348 0 128 3,336 0 Kraken 53 16 0 37 0 Zeus 1 72 3 0 59 109 Zeus 2 5 3 0 2 0 Zeus 3 5 3 0 2 0 Blackworm 10 1 0 9 0

2 2 small AIM, 464 FTP setup, 7 small HTTP, 1 IRC, 17 Skype, 8 SSH, and 473 VNC packets. 3 12 small HTTP, 155 small HTTPS, and 58 small PROXY packets. 4 62 small HTTP, 100 small HTTPS, 12 small NETBIOS, and 4 SAMBA packets. 5 10 HTTP POST packets explained in Section 5.2.5.2. 6 51 NETBIOS packets transferring the worm executable. 7 28 FTP Setup and 2 SKYPE sessions. 8 12 small NETBIOS sessions. 9 10 HTTP sessions containing the HTTP POST packets explained in Sec- tion 5.2.5.2.

75 Table 5.10: True and False Positive Rates for Each Data Set

Packets Sessions True False True False Data Positive Positive Positive Positive Set Rate Rate Rate Rate Control 1 0.0207 1 0.0108 One N/A 0.0025 N/A 0 Two N/A 0.0022 N/A 0.0036 Kraken 1 0 1 0 Zeus 1 0.2308 0 0.2308 0 Zeus 2 1 0 1 0 Zeus 3 1 0 1 0 Blackworm 0.7424 0 1 0

Table 5.10 contains the calculated true positive rates and false positive rates for each of the data sets. They were calculated for packets and sessions using Equation 5.1 and Equation 5.2, which will allow for a comparison between their performances.

T rue P ositives T rue P ositve Rate = (5.1) T rue P ositives + F alse Negatives F alseP ositives F alse P ositve Rate = (5.2) F alse P ositives + T rue Negatives The false positive rates for all of the data sets at both the packet and session levels are very low. The Control Data Set is the only data set that has a false positive rate above one percent. The increased false positive rate is caused by the large number of small VNC packets and the FTP and FTP packets containing header information or within the control channels. Zeus Data Set #1 has a low true positive rate, which will be improved greatly with the addition of decoding application layer protocols. ExFILD also did not perform well against the Blackworm Data Set at the packet level. The low true positive rate was due to the 51 packets containing the Blackworm executable that did not cause an alert. Data Sets #1 and #2 did

76 not have any true positives or false negatives, so the true positive rate could not be calculated. The main aspect of these two tables to note is that ExFILD at the session level performs as good or better in every data set as it does at packet level except for Data Set #2. A better solution to handle very small sessions would improve the false positive rate at the session level. The biggest increases in performance between the packet and session level come from the Control and Blackworm Data Sets. The increase in the true positive rate of the Blackworm Data Set is the result of the false negatives at the packet level being grouped with their respective sessions as true positives. The decrease in the false positive rate of the Control Data Set is the result of the aggregation of the false positives into their respective sessions. The increase in the performance justifies the choice to focus on alerts at the session level rather than the packet level.

77 Chapter 6

CONCLUSIONS

The lack of viable options for detecting theft of information has lead to an in- credible amount of confidential data being extracted by attackers. The field of com- puter and network security has a need for tools and techniques capable of detecting unwanted extraction of information from computer systems. Intrusion detection is a mature field with thorough research performed on the techniques used, but the research and development of techniques to detect exfiltration is still rudimentary. The research for this thesis was started with the single goal of developing a program capable of detecting data exfiltration. It evolved into three separate goals listed below.

1. Create tools that will allow for a better understanding of a host’s network traffic, specifically with relation to data exfiltration.

2. Determine if encrypted network traffic can be discerned from unencrypted data using the entropy characteristics.

3. Develop a program that is capable of effectively detecting data exfiltration using the encryption characteristics of network traffic.

Many tools were created to study a network and were discussed in Chapter 3. The tools that were developed included the functionality to monitor network traffic, report censored words, display the amount of traffic destined to each IP address, and extract session and DNS information. The Network Top program provides a very

78 intuitive way to view the amount outgoing traffic and its destination. The Session Extractor and DNS Extractor proved to be useful in the verification of the malware samples and determining with whom a host is communicating. It was proven within this thesis that entropy can be used to effectively dis- cern encrypted traffic from unencrypted traffic. The entropy calculations result in significant separation between the encrypted and unencrypted traffic, which allows for the determination of whether data is encrypted or unencrypted. A method of normalizing small packets to allow for this separation was described. It was shown that small packets and sessions can cause misleading entropy values, and further research should be performed to produce more useful results from them. It was also described how to separate compressed data from encrypted data. Encrypted traffic can be identified with confidence by a simple entropy calculation as long as its payload is at least 256 bytes. The accomplishment of the last goal is the largest contribution from this thesis to the field of computer and network security. It was verified in this thesis that exfiltrated data can indeed be detected based on the expected and actual state of encryption for a packet or session’s payload. Chapter 5 provided results that justify this claim. Research needs to be continued in this research area to improve the accuracy of the technique. Simply implementing the decoding of the application layer could greatly improve the performance of some of the data sets used in this thesis. Hopefully the research performed in this thesis and other similar work in this research area stimulate research on data leaving a computer system as well as the traffic entering the system. The field of computer and network security would greatly benefit from continued research in techniques that process the outgoing traffic.

79 Chapter 7

FUTURE WORK

7.1 Performance Before other features are added to the program, some attention should be given to improving the performance of it. The performance of the program does not make it feasible to run on a live capture, and can take a considerable amount of time for large capture files. The decision to add features will need to consider the performance hit that will be taken. If the new feature increases the effectiveness and does not significantly affect the performance, it will be a good choice to add the feature. If the new feature only slightly increases the effectiveness and decreases the performance, it might not be a good choice to add it. For these reasons the performance needs to be improved. There are many solutions to increase performance. The first solution would be to look through the code and optimize it by hand. The functions were created with an emphasis on functionality and will have un-optimized segments of code. There are a lot of comparisons being done on each packet and session. The order of these comparisons may be able to be reordered to improve the performance with- out changing the results. For example in Figure 4.12, assume that there are more whitelisted packets than there are ICMP requests. In this case it would make sense to switch the order of the two checks. Switching the order of the two checks will eliminate unnecessary checks, thus improving the performance. Improving perfor- mance will be the first objective of future work, because it will enable the other future goals to be achieved while maintaining the programs usability.

80 Another possible solution to improve performance is to port the program into the Snort IDS. Snort is widely used in the security industry and has a lot of developer and community support. For it to be useful it needs to be able to perform its tasks at a high rate. The community of developers has already put a lot of effort toward improving the performance of Snort, which could be used to the advantage of this research. Individual parts of the entire program could be ported to Snort. Snort could also be used to perform the decoding explained in Section 4.1.1. Rules can be written to perform the checks of the packets and sessions. A more interesting possibility is writing a dynamic module to perform some of the calculations in ExFILD, specifically the entropy calculations. Porting the algorithms in ExFILD from Perl into Snort could allow for better performance and a more stable system.

7.2 Handle A Network The program currently only inspects the outgoing traffic of a single host at a time. Under normal circumstances administrators would be more interested in all of the traffic leaving a network, not just a single host. They would only check a single host when its activities have been identified as suspicious. If an administrator wants to look at more than one host’s outgoing traffic, ExFILD will have to be run multiple times with different IP addresses as the inputs. It would be useful to add the capability of looking at an entire network’s outgoing traffic. This could be accomplished by allowing multiple IP addresses or the network’s subnet to be inputted to the program. ExFILD would then ignore intranet traffic between all of the local network’s hosts and inspect all of the outgoing traffic remaining. It can also be accomplished by altering ExFILD to be able to run on the single gateway leaving the network or running multiple instances of ExFILD on all of the gateways between the Internet and the intranet.

81 7.3 Application Layer Decoding of Packets One of the improvements that can increase the accuracy of the entropy cal- culations is to decode the packets to the application layer. For example, ExFILD will extract the payload of a HTTP packet with the headers, since HTTP protocol is not decoded. The HTTP headers are large enough to influence the entropy of the payload to look unencrypted, when it is actually encrypted. Decoding the applica- tion layer protocols will remove the headers from the data and perform the entropy calculation only on the payload. Integrating ExFILD into Snort could provide some of the decoding functionality. In [39] it is discussed that Snort has some basic decod- ing capabilities for application protocols such as HTTP, FTP, DNS, and SSL/TLS. For performance reasons, it would be wise to implement only the important or most frequent application protocols. There may be some protocols where the headers do not affect the payloads’ entropy values enough to use resources to decode it.

7.4 Comparison to packet and session entropy As described in Section 4.1, all of the packets and sessions were processed and run through ExFILD. The entropy values were then calculated for both the packets and sessions. Having both values provides the opportunity to create another feature to be used within the tree. Comparing the entropy of a session to the entropies of each packet contained within it could lead to interesting results. It is possible that a single unencrypted packet is being hidden an encrypted session or vice versa. The session entropy value may match the expected range, but individual packets within it do not. It may be troublesome performing the comparison this way. It is also counterproductive, because sessions are looked at to remove alerts from small packets. It may be more useful to look for outliers in each session and perform some analysis and checks on that packet. Performing experiments to see if there is any relationship between the packets and the session could prove to be worthwhile and could provide more topics of research.

82 7.5 Compressed File Analysis ExFILD currently inspects traffic with highly random payloads for com- pressed files. If a compressed file is found using the magic file numbers and a HTTP string, it is assumed that the file is a properly compressed file. More steps can be taken to ensure this is a compressed file and that its contents are allowed to leave the network. The first step would be to extract the entire file from the network traffic, which will require the application layer protocol to be completely decoded. The next step would be to verify that the file structure matches the specification for the file type. The most straightforward way to do this is by decompressing the file. A file that decompresses successfully will have the correct file structure, while a file that does not decompress successfully has an incorrect file structure. If it has an incorrect file structure, it should automatically be marked as suspicious. If it can be decompressed, checks should be run against its contents. Compressing files and sending them out of the network is a common way to exfiltrate data, because it’s easy and the contents are rarely inspected. The checks of the contents should include ones similar to those in the corporate watcher program that search for censored words and files. The checks performed on the packets and sessions could also be applied to the contents of the compressed file. The main idea is to make sure that an attacker is not just compressing data and exfiltrating it. The packets and sessions containing the compressed file will be marked as not suspicious if it passes all of these checks.

7.6 Behavioral Analysis Another interesting piece of research to pursue is taking into account the host’s normal behavior. Each host has a normal set of applications run on them, which will create outgoing traffic with certain sets of characteristics. The charac- teristics can be used to create a profile or signature for the host’s typical outgoing traffic. A profile may be more appropriate, since a signature is usually very specific

83 and a profile tends to be broader. A single host’s profile can be used to compare against outgoing traffic and find some characteristics that stand out from the nor- mal use. For example consider a host that only checks email and browses the web in normal use, and one day it begins transferring a large amount of data out of the network. When comparing that host’s profile to the outgoing traffic it will flag the traffic. However, comparing a FTP server’s profile to this traffic may not show any suspicious characteristics. Integrating the host’s normal use into ExFILD could lead to more accurate results when looking for data being exfiltrated.

7.7 Additional Tools All of the other sections for future work deal with improving the detection of data being exfiltrated from a network. There needs to be work performed in the area of analyzing the results of the alerts and determining what caused the alert. The tools from Chapter 3 that were used to develop ExFILD can be used to analyze the cause of the alerts. However, during this research it was realized that more tools would be useful. The verification of the packet captures containing communications for Zeus revealed a need for another useful tool. A tool that is able to extract all of the HTTP GET and POST commands would make it easier to identify communication channels over HTTP. In the case of Zeus, this tool could be used to identify the name of the configuration file downloaded and the hosts who have used HTTP POST commands to exfiltrate data. It could also show any files that were downloaded by the attacker using HTTP GET commands, which would help to explain what activities the attacker is performing.

84 BIBLIOGRAPHY

[1] White House. White House (2009). Cyberspace Policy Review: Assur- ing a Trusted and Resilient Information and Communications Infrastructure, May 2009. http://www.whitehouse.gov/assets/documents/Cyberspace_ Policy_Review_final.pdf.

[2] Joe Stewart. Top Botnets Exposed. April 2008. http://www.secureworks. com/research/threats/topbotnets/.

[3] Mark Landler and John Markoff. Digital Fears Emerge After Data Siege in Estonia. The New York Times, May 2007. http://www.nytimes.com/2007/ 05/29/technology/29estonia.html.

[4] Annarita Giani, Vincent H. Berk, and George V. Cybenko. Data exfiltration and covert channels. In Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Security and Homeland Defense V, 2006.

[5] Fouad Kiamilev, Ryan Hoover, Ray Delvecchio, Nicholas Waite, Stephen Janan- sky, Rodney McGee, Corey Lange, and Michael Stamat. Demonstration of Hardware Trojans. In DEFCON conference, August 20008.

[6] Netcat. http://netcat.sourceforge.net/.

[7] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kem- merer, C. Kruegel, and G. Vigna. Your Botnet is My Botnet: Analysis of a Botnet Takeover. In Proceedings of the 16th ACM conference on Computer and communications security, pages 635–647, Chicago, IL, November 2009. ACM.

[8] Toni Koivunen. Calculating the Size of the Downadup Outbreak. F-Secure We- blog: News from the Lab, January 2009. http://www.f-secure.com/weblog/ archives/00001584.html.

[9] Jhind. Catching DNS Tunnels with A.I. In DEFCON conference, July 2009.

[10] Kerry S. Long. Catching the Cyber Spy: ARL’s Interrogator. December 2004.

85 [11] Jon Oberheide, Evan Cooke, and Farnam Jahanian. CloudAV: N-Version An- tivirus in the Network Cloud. In Proceedings of the 17th USENIX Security Symposium, San Jose, CA, July 2008. Usenix Security Symposium.

[12] August Cole, Yochi Dreazen, and Siobahn Gorman. Computer Spies Breach Fighter-Jet Project. The Wall Street Journal, April 2009. http://online. wsj.com/article/SB124027491029837401.html.

[13] Jaikumar Vijayan. Heartland data breach could be bigger than TJX’s: This recent incident suggests cybercrooks have shifted to targeting payment proces- sors. Computer World, January 2009. http://www.computerworld.com/s/ article/9126379/Heartland_data_breach_could_be_bigger_than_TJX_s.

[14] Office of Public Affairs. Alleged International Hacker Indicted for Massive Attacks on U.S. Retail and Banking Networks: Data Related to More Than 130 Million Credit and Debits Cards Allegedly Stolen. United States De- partment of Justice, August 2009. http://www.justice.gov/opa/pr/2009/ August/09-crm-810.html.

[15] Tor. http://www.torproject.org/.

[16] Karsten Loesing, Steven J. Murdoch, and Roger Dingledine. A Case Study on Measuring Statistical Data in Tor Anonymity Network. In Accepted for publication at Workshop on Ethics in Computer Security Research (WECSR 2010), Tenerife, Spain, January 2010.

[17] Miniwatts Marketing Group. World Internet Usage And Population Statistics. Internet World Stats, 2010. http://www.internetworldstats.com/stats. htm.

[18] Snort. http://www.snort.org/.

[19] Richard Bejtlich. Advice for Academic Researchers. TaoSecu- rity, February 2010. http://taosecurity.blogspot.com/2010/02/ advice-for-academic-researchers.html.

[20] Nabil Schear, Carmelo Kintana, Qing Zhang, and Amin Vahdat. Glavlit: Pre- venting Exfiltration at Wire Speed. In Proceedings of the 5th ACM Workshop on Hot Topics in Networks (HotNets-V), Irvine, CA, November 2006.

[21] Yali Liu, Cherita Corbett, Rennie Archibald, Biswanath Mukherjee, and Di- pak Ghosal. SIDD: A Framework for Detecting Sensitive Data Exfiltration by an Insider Attack. In Proceedings of the 42nd Annual Hawaii International Conference on System Science, pages 1–10, January 2009.

86 [22] Tcpdump. http://www.tcpdump.org/. [23] Wireshark. http://www.wireshark.org/. [24] Boris Zentner. Geo::IP. CPAN, 2009. http://search.cpan.org/dist/ Geo-IP/lib/Geo/IP.pm. [25] GeoLite City. Maxmind, March 2010. http://www.maxmind.com/app/ geolitecity. [26] GeoLite Country. Maxmind, March 2010. http://www.maxmind.com/app/ geolitecountry. [27] Port Numbers. IANA, March 2010. http://www.iana.org/assignments/ port-numbers. [28] C. E. Shannon. Prediction and Entropy of Printed English. The Bell System Technical Journal, 30:50–64, 1951. [29] T. Dierks and C. Allen. The TLS Protocol Version 1.0 (RFC 2246). IETF, January 1999. http://www.ietf.org/rfc/rfc2246.txt.pdf. [30] Michael Ligh. 12b0c78f05f33fe25e08addc60bd9b7c.pcap. OpenPacket.org, May 2008. https://www.openpacket.org/capture/grab/33. [31] Cody Pierce. Owning Kraken Zombies, a Detailed Dissection. DVLabs, April 2008. http://dvlabs.tippingpoint.com/blog/2008/04/28/ owning-kraken-zombies. [32] Paul Royal. On the Kraken and Bobak Botnets. Damballa, April 2008. http: //www.damballa.com/downloads/press/Kraken_Response.pdf. [33] JJ Cummings. zeus-sample-1.pcap. OpenPacket.org, March 2010. https: //www.openpacket.org/capture/grab/67. [34] JJ Cummings. zeus-sample-2.pcap. OpenPacket.org, March 2010. https: //www.openpacket.org/capture/grab/68. [35] JJ Cummings. zeus-sample-3.pcap. OpenPacket.org, March 2010. https: //www.openpacket.org/capture/grab/69. [36] Doug Macdonald. Zeus: God of DIY Botnets. FortiGuard, October 2009. http://www.fortiguard.com/analysis/zeusanalysis.html. [37] kowsik. blackworm-clean-install.pcap. pcapr.net, May 2009. http: //www.pcapr.net/view/kowsik/2009/4/5/11/blackworm-clean-install. pcap.html.

87 [38] F-Secure Virus Information Pages: Nyxem.E. F-Secure, January 2006. http: //www.f-secure.com/v-descs/nyxem_e.shtml.

[39] The Snort Project. SNORT R Users Manual 2.8.5. October 2009. http: //www.snort.org/assets/125/snort_manual-2_8_5_1.pdf.

88 Appendix A

EXPERIMENTS

Table A.1: Annotation of the Control Data Set

Wireshark Activity Time -108 Start Movie 0 Start Wireshark 12 Open Safari 22 Open CNN.com 36 Click a link for SI.com 52 Go back to CNN.com 61 Click CNN link to “Terror task force arrests 2 in NY” 88 Check Gmail (Webmail) 100 Clicked on new email 111 Downloaded PDF attachment 118 Downloaded PDF attachment 129 Clicked back to inbox 132 Sign out of Gmail 139 Quit Safari 144 Open Mail.app 187 Send email Continued on Next Page. . . 89 Table A.1 – Continued

Wireshark Activity Time 225 Open Safari 232 Open Twitter.com 245 Signing into Twitter 259 Clicked the more button on Twitter 276 Clicked the more button on Twitter again 295 Opened new tab with user1’s Twitter profile 299 Opened new tab with woot.com 305 Opened new tab with link for patrick.net 312 Clicked on Twitter DM 321 Clicked on user2’s Twitter profile 325 Clicked back (to Twitter DM) 328 Deleted DM from user2 335 Signed out of Twitter 336 Closed Twitter Tab 341 Closed tab with user1’s profile 346 Closed tab with woot.com 357 Quit Safari 366 Opened Twitteriffic 370 Loaded my mentions in Twitteriffic 372 Loaded the rest of the tweets 403 Refreshed Twitteriffic 411 Closed Twitteriffic 419 Opened Live mesh Continued on Next Page. . .

90 Table A.1 – Continued

Wireshark Activity Time 432 Live Mesh connected 440 Live mesh begins uploading 447 Live mesh finished uploading 565 Live Mesh updated 19 files 585 Uploaded TEST FOR DATASET LIVE MESH.rtf to Live Mesh using desktop client 602 Quit Live Mesh 624 Dropbox Opened 628 Dropbox finished updating 659 Uploaded TEST FOR DATASET DROP BOX.rtf to Dropbox us- ing desktop client 704 Quit Dropbox 718 Opened iTunes 748 Refreshed iTunes Podcasts 756 Refreshed iTunes Podcasts 781 Closed iTunes 937 Ran unencrypted FTP 100 bytes 947 Ran unencrypted FTP 1000 bytes 960 Ran unencrypted FTP 4000 bytes 974 Ran unencrypted FTP 8041 bytes 1056 Ran encrypted FTP 100 bytes 1066 Ran encrypted FTP 1000 bytes 1074 Ran encrypted FTP 4000 bytes Continued on Next Page. . .

91 Table A.1 – Continued

Wireshark Activity Time 1085 Ran encrypted FTP 8041 bytes 1145 Start Skype 1180 Login into Skype 1211 Skype test call 1220 Start recording playback message 1234 Start Skype playback 1244 End Skype playback 1263 End Skype call 1284 Quit Skype 1327 ssh to 10.0.0.54 1332 Entered password for 10.0.0.54 1340 ls 10.0.0.54 1370 sudo apt-get update (10.0.0.54) 1377 sudo apt-get upgrade (10.0.0.54) 1462 screen -r (10.0.0.54) 1489 exit (10.0.0.54) 1491 ssh to 10.0.0.53 1494 Entered password for 10.0.0.53 1497 w 1509 sudo apt-get update (10.0.0.53) 1516 sudo apt-get upgrade (10.0.0.53) 1581 screen -r (10.0.0.53) 1617 ssh back to host (10.0.0.53) Continued on Next Page. . .

92 Table A.1 – Continued

Wireshark Activity Time 1624 send password to host (10.0.0.53) 1628 ls (10.0.0.53) 1645 exit host (10.0.0.53) 1650 exit 10.0.0.53 1659 netstat -an 1702 Started iChat 1760 Messaged user3 1772 Messaged user3 1783 user3 messaged me 1788 Messaged user3 1797 user3 messaged me 1798 Messaged user3 1818 user3 messaged me 1824 Messaged user3 1913 Messaged user3 1934 Started safari 1965 Browsing through bank site 1993 ping Google 1996 Stop ping 2009 Open cnn.com 2024 Open bank1 site 2039 Put in login for bank1 2052 Logged into bank account Continued on Next Page. . .

93 Table A.1 – Continued

Wireshark Activity Time 2076 Load more entries 2090 Load transfer history 2095 Load account overview 2112 Close bank1 2125 Open bank2 2131 Reload with flash 2152 Put in user name 2168 Enter security questions 2197 Enter pin 2199 Browsing bank2 2246 Updating preferences in bank2 2335 Updated pin, rejected pin for being too short 2359 Submitted new password 2366 Logged out of bank2 2371 Quit Safari 2393 Opened chicken of the vnc 2407 Allowed connection to cotvnc 2445 VNC to 10.0.0.54 2460 Ran update manager 2499 ssh to blah 2506 ls 2529 exit ssh 2538 Close vnc connection Continued on Next Page. . .

94 Table A.1 – Continued

Wireshark Activity Time 2542 Exit cotvnc 2554 Open Colloquy 2596 Join lug irc 2629 Message winter 2654 PM user4 user5 2692 PM user4 user5 smiley 2701 Closed pm 2706 Get info user5 2716 Get info user4 2722 Get info user6 2735 Close Winter 2740 Close irc 2755 Open safari 2766 Open Google Reader 2774 Login into Google 2784 Load UDel news 2792 Mark all read 2799 Load MAC folder 2822 Open CES power tablet in a new tab 2850 Open Deal Brothers in a new tab 2862 Open CES Dell in a new tab 2892 Close CES power tablet 2908 Close Deal Brothers Continued on Next Page. . .

95 Table A.1 – Continued

Wireshark Activity Time 2923 Close CES Dell 2927 Open security 2953 Open 768 bit RSA in a new tab 2980 Open industry group plans cyber in a new tab 3024 Open Liquidmatrix in Google Reader 3054 Open CES security cam in a new tab 3057 Mark all read 3093 Open tech folder 3239 Open BSOD in a new tab 3298 Open Netflix slashdot.com in a new tab 3304 Ftp encrypted 100 bytes to 10.0.0.54 3336 Open a hack-a-day in new tab 3339 Ftp encrypted 1000 bytes to 10.0.0.54 3359 Close Hack-A-Day article 3354 Close slashdot.com article 3373 Close BSOD article 3386 Close security cam article 3400 Close cyber plan article 3426 Open RSA paper site 3428 Ftp encrypted 4000 bytes to 10.0.0.54 3444 Ftp encrypted 8041 bytes to 10.0.0.54 3448 Ftp encrypted 8041 bytes to 10.0.0.54 3452 Open RSA paper Continued on Next Page. . .

96 Table A.1 – Continued

Wireshark Activity Time 3457 Download RSA paper 3477 Close RSA paper 3478 Close 768 bit 3484 Open tech 3490 Mark all read 3492 Open Yahoo Sports 3496 Mark all read 3498 Open Vikings 3501 Mark all read 3503 Open Vikings 3520 Mark all read 3531 Open YouTube 3549 Load Josh Wilson video 3639 Clicked more information on side bar 3714 Ftp encrypted 100 bytes to 10.0.0.54 3734 Clicked back on YouTube 3741 Clicked Music 3771 Clicked Indie Music 3788 Clicked Bon Iver - Flume video 3799 Ftp encrypted 100 bytes to 10.0.0.54 3827 Ftp encrypted 1000 bytes to 10.0.0.54 3859 Ftp encrypted 4000 bytes to 10.0.0.54 3884 Ftp encrypted 8041 bytes to 10.0.0.54 Continued on Next Page. . .

97 Table A.1 – Continued

Wireshark Activity Time 3899 Quit safari 3906 Open iTunes 3924 Refresh podcasts 3949 Refresh iTunes U 3962 Close iTunes 3989 Stop Wireshark

98 Appendix B

ENTROPY PLOTS FOR DATA SETS

Note: All packets or sessions with a payload of size 0 have been omitted from the plots in this appendix.

Figure B.1: Packet Entropies for the Control Data Set

99 Figure B.2: Session Entropies for the Control Data Set

Figure B.3: Packet Entropies for Data Set #1 (First Plot)

100 Figure B.4: Packet Entropies for Data Set #1 (Second Plot)

Figure B.5: Session Entropies for Data Set #1

101 Figure B.6: Packet Entropies for Data Set #2

Figure B.7: Session Entropies for Data Set #2

102 Figure B.8: Packet Entropies for the Kraken Data Set

Figure B.9: Session Entropies for the Kraken Data Set

103 Figure B.10: Packet Entropies for Zeus Data Set #1

Figure B.11: Session Entropies for Zeus Data Set #1

104 Figure B.12: Packet Entropies for Zeus Data Set #2

Figure B.13: Session Entropies for Zeus Data Set #2

105 Figure B.14: Packet Entropies for Zeus Data Set #3

Figure B.15: Session Entropies for Zeus Data Set #3

106 Figure B.16: Packet Entropies for the Blackworm Data Set

Figure B.17: Session Entropies for the Blackworm Data Set

107 Appendix C

ALERTS FOR DATA SETS

Figure C.1: Session Alerts for the Control Data Set

108 Figure C.2: Session Alerts for Data Set #2

Figure C.3: Session Alerts for the Kraken Data Set

109 Figure C.4: Session Alerts for Zeus Data Set #1

Figure C.5: Session Alerts for Zeus Data Set #2

110 Figure C.6: Session Alerts for Zeus Data Set #3

Figure C.7: Session Alerts for the Blackworm Data Set

111 Appendix D

VERIFICATION OF MALWARE PACKET CAPTURES

D.1 Kraken Packet Capture The packet capture came from an open repository on the Internet, which leads to the question of whether the source is credible. It needs to be verified that the packet capture is actually traffic from the Kraken botnet. The first step was to verify that there were conversations between the infected host and a server hosting from port 447. The packet capture was run against the Session Extractor program from Section 3.5 to find 16 sessions that match the criteria. The matching sessions are shown in Figure D.1. The next step was to verify that the infected host is communicating to the hosts listed in [32]. The DNS Extractor from Section 3.4 was run against the data set and the results are shown in Figure D.2. The domain names hmhxnupkc.mooo.com and bdubefoeug.yi.org are found in the list of command and control servers in [32]. Upon further inspection of the packet capture, there is another domain name that was not resolved at the time. The host name was rffcteo.dyndns.org and it is also listed in [32] as a host name for the Kraken botnet. Entropy calculations were performed on all the packets contained in sessions from Figure D.1. All of the packets appear to be encrypted with entropies greater than 7.45 bits/byte. With this information it is verified with confidence that this packet capture contains traffic from the Kraken botnet.

112 Figure D.1: Sessions from the Kraken Data Set to Servers on Port 447

Figure D.2: DNS Names Resolved from the Kraken Data Set

113 D.2 Zeus Packet Captures The three packet captures for Zeus were obtained from the same repository as the Kraken packet capture, meaning the integrity of these packet captures should also be checked. The verification of these packet captures will be done by hand, since no tools have been written at this time to extract the HTTP GET and POST commands from a packet capture. The first thing to note is that all packet captures were submitted by the same author, we assume that all of the files are either from Zeus or none of the files are from Zeus. The verification process will be based on the “BOTNET COMMUNICATIONS” section in [36]. The basic steps will be to search for the HTTP GET command downloading the configuration file, the HTTP OK response displaying the downloading of the configuration files, the HTTP POST commands responsible for exfiltrating the data, and the HTTP OK responses containing information or commands for the bot.

D.2.1 Zeus #1 Figure D.3 shows the packets that are responsible for the infected host down- loading the configuration file from the command and control server. Packet number 4 is the HTTP GET command requesting the configuration file with the name of cfg3.bin. It is important to note that the creator of the Zeus executable can customize the name of the configuration file. Packet numbers 5 through 8 are re- sponsible for downloading the configuration file. Packet number 9 is the command and control server affirming the download of the configuration file. Figure D.4 shows the HTTP POST commands and the server’s HTTP OK responses. Packet numbers 17 and 18 are the HTTP POST commands that are used to exfiltrate data to the server. The file being sent has the name of stat1.php, which can be customized by the creator of the Zeus executable. Packet numbers 21 and 25 are the HTTP OK responses from the server. Packet 21 contains a larger than normal payload meaning it is most likely a command being sent from the server.

114 Figure D.3: HTTP GET Command Downloading the Configuration File

Figure D.4: HTTP POST Commands and HTTP OK Responses

115 Figure D.5 shows all of the HTTP communications from Zeus in this packet capture, except for the packets carrying the files to and from the bot and the server. It includes the HTTP GET and POST commands as well as the HTTP OK re- sponses. Note that there are 13 HTTP POST commands exfiltrating data in this packet capture. The packet capture shows all of the activities described in [36], meaning that it contains traffic from Zeus.

Figure D.5: Zeus HTTP Communications

D.2.2 Zeus #2 Figure D.6 shows the HTTP GET packet that requests the configuration file with a name of ribbon.tar. Figure D.7 displays the command and control server’s

116 HTTP OK response to the request to download the configuration file.

Figure D.6: HTTP GET Command Requesting the Configuration File

Figure D.7: HTTP OK Response to the Request for the Configuration File

Figure D.8 shows the HTTP POST commands and the server’s HTTP OK responses. Packet numbers 54 and 57 are the HTTP POST commands that are used to exfiltrate data to the server. The file being sent has the name of index1.php. Packet numbers 60 and 64 are the HTTP OK responses from the server. Packet 60 contains a smaller payload than seen in Zeus Data Set #1, meaning it most likely does not have a command from the server in the message. Figure D.9 shows all of the HTTP communications from Zeus in this packet capture, except for the packets carrying the files to and from the bot and the server. It includes the HTTP GET and POST commands as well as the HTTP OK re- sponses. Note that there are 3 HTTP POST commands exfiltrating data in this packet capture. The packet capture shows all of the activities described in [36], meaning that it contains traffic from Zeus.

117 Figure D.8: HTTP POST Commands and HTTP OK Responses

Figure D.9: Zeus HTTP Communications

118 D.2.3 Zeus #3 Figure D.10 shows the HTTP GET packet that requests the configuration file with a name of kartos.bin.

Figure D.10: HTTP GET Command Requesting the Configuration File

Figure D.11 displays the command and control server’s HTTP OK response to the request to download the configuration file.

Figure D.11: HTTP OK Response to the Request for the Configuration File

Figure D.12 shows the HTTP POST commands and the server’s HTTP OK responses. Packet numbers 239 and 240 are the HTTP POST commands that are used to exfiltrate data to the server. The file being sent has the name of youyou.php. Packet numbers 244 and 246 are the HTTP OK responses from the server. Packet

119 246 contains a larger payload meaning it is most likely a command being sent from the server.

Figure D.12: HTTP POST Commands and HTTP OK Responses

Figure D.13 shows all of the HTTP communications from Zeus in this packet capture, except for the packets carrying the files to and from the bot and the server. It includes the HTTP GET and POST commands as well as the HTTP OK re- sponses. Note that there are 4 HTTP POST commands exfiltrating data in this packet capture, but packet numbers 244 and 389 are from the same session. The packet capture shows all of the activities described in [36], meaning that it contains traffic from Zeus.

120 Figure D.13: Zeus HTTP Communications

D.3 Blackworm As with the other malware packet captures, this packet capture is from a repository that anyone can upload captures. The capture needs to be verified that it matches the description of Blackworm. The verification is performed by hand by inspecting the packet capture. Figure D.14 shows the infected host connecting to the victim host and enumerating all of the network shares. Packet number 23 shows the infected host’s name is BLACKWORM and the victim host’s name is BLACKWORM-VICTI. Packet numbers 39 and 40 are responsible for the enumer- ation of the network shares. Figure D.15 displays the packets that are querying for the security vendors’ program folders. The folders that are being requested match those listed in [38]. In the case of this packet capture, none of the folders were found on the victim host.

121 Figure D.14: Connection to Victim and Enumerating Its Network Shares

Figure D.15: Searching for Security Vendors’ Program Folders

122 Figure D.16 and Figure D.17 show the copying of the Blackworm executable to different file locations. They represent the spreading of the worm onto the victim host. The packet from Figure D.17 is disguising itself as a file that may normally be there, and Figure D.18 shows the packet that removes the file that the worm is imitating.

Figure D.16: Copying Blackworm Executable to Victim

Figure D.17: Copying Blackworm Executable to Victim

Figure D.18: Deleting Startup Link from Victim

Figure D.19 displays the infected host creating a job that will run the exe- cutable. Once this job has been made the executable will be run at a later time and the infection has been successful. The packet capture shows the activities described in [38], meaning that it contains traffic from Blackworm.

123 Figure D.19: Creating a Job to Run the Blackworm Executable

124