The Pennsylvania State University

The Graduate School

College of Engineering

PACKET INSPECTION FOR APPLICATION CLASSIFICATION

AND INTRUSION DETECTION

A Dissertation in

Electrical Engineering

by

Jisheng Wang

© 2008 Jisheng Wang

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2008

ii

The dissertation of Jisheng Wang was reviewed and approved* by the following:

David J. Miller Associate Professor of Electrical Engineering Dissertation Co-Adviser Co-Chair of Committee

George Kesidis Professor of and Engineering Professor of Electrical Engineering Dissertation Co-Adviser Co-Chair of Committee

Nirmal K. Bose HRB-Systems Professor of Electrical Engineering

Prasenjit Mitra Assistant Professor of the School of Information Sciences and Technology Assistant Professor of Computer Science and Engineering Assistant Professor of Industrial and Manufacturing Engineering

Kenneth W. Jenkins Professor of Electrical Engineering Head of the Department of Electrical Engineering

*Signatures are on file in the Graduate School

iii

Abstract

Current computer networks remain vulnerable to a variety of families of attacks including scanning worms, distributed denial-of-service (DDoS) attacks targeting resources associated with end-systems or critical network protocols, and hit-list worms. These kinds of attacks remain significant direct and indirect threats to the network’s infrastructure and its end-systems. Despite past developments, anomaly detection and response targeting zero-day attacks (as not yet seen) remains an open research problem.

This dissertation presents the complete structure of an automated payload-based network intrusion detection system, which includes three main components: network traffic mining, network anomaly identification, and worm signature extraction. Estan et al.’s multidimensional digesting algorithm is introduced to mine significant flows – either worm flows or dominant normal flows

– among entire network traffic, and several techniques are proposed for improving its efficiency. Based on the mining results, a new entropy-based criterion is presented to correctly identify anomaly network traffic, including the Slammer and

Code-Red worms and the DDoS attacks. Moreover, a Generalized Suffix

Tree-based approach is proposed for efficiently extracting signatures of polymorphic worms. Therefore, the proposed intrusion detection system can

iv

automatically generate signatures of zero-day attacks/worms which can be used to contain their spread in the future.

Meanwhile, with the increasing flexibility in current networks, tons of new applications appear and begin to dominate the Internet. The newly emerging peer-to-peer applications, such as Bitcomet and Skype, can be responsible for more than 80% of the total traffic volume in the Internet. Therefore, it is essential for

Internet service providers to correctly identify these new applications. This dissertation presents an efficient approach to identify Skype voice over IP (VoIP) traffic by using reliable statistical information. Because of its efficiency in both computational complexity and consumption, the new approach can be implemented on network backbone routers to identify Skype VoIP traffic in real-time.

v

Table of Contents

List of Acronyms...... ix

List of Figures...... xi

List of Tables...... xiii

Acknowledgments...... xiv

Chapter 1. Introduction...... 1

1.1 Background...... 1

1.1.1 Network Traffic Management ...... 2

1.1.2 Network Intrusion Detection...... 3

1.1.3 “Lawful Interception” of IP Data Traffic ...... 4

1.2 Contributions...... 5

1.3 Organization...... 6

Chapter 2. Multidimensional Network Traffic Digesting...... 8

2.1 Introduction...... 8

2.2 Multidimensional, Hierarchical Flow Mining of Network Traffic .. 14

2.2.1 Identifying Significant Unidimensional Flows ...... 17

2.2.2 Identifying Significant Multidimensional Flows ...... 21

2.2.3 Improving the Efficiency of Multidimensional Flow Mining23

2.2.4 Implementation Considerations...... 31

vi

2.3 Experiments Comparing Computational Efficiency...... 33

2.4 Conclusion ...... 39

Chapter 3. Network Intrusion Detection Systems ...... 40

3.1 Introduction of Network Attacks ...... 40

3.2 Review of Network Intrusion Detection Systems...... 45

3.2.1 Host/Operation System-Based Intrusion Detection ...... 45

3.2.2 Network-Based Intrusion Detection...... 47

3.2.3 Packet Payload-Based Intrusion Detection ...... 49

3.3 Comprehensive Intrusion Defense System ...... 51

3.4 White-Listing in Payload-Based Detection ...... 53

3.5 Covert Modeling that Exploits White-Listing ...... 55

3.6 Port-80 Data Traffic and Peer-to-Peer Traffic...... 59

3.7 Conclusion ...... 61

Chapter 4. Multidimensional Mining-Based Network Anomaly Identification 62

4.1 Introduction...... 62

4.2 Criterion for Anomaly Identification ...... 64

4.2.1 Leaf and Internal Node Clusters...... 66

4.3 Attack Identification Results...... 67

4.3.1 DARPA Trace ...... 68

4.3.2 Sapphire/Slammer Trace ...... 71

4.3.3 Code-Red version 2 Trace ...... 72

vii

4.4 Discussion and Relation to Prior Work...... 79

4.5 Conclusion ...... 83

Chapter 5. Generalized Suffix Tree-Based Worm Signature Extraction ...... 85

5.1 Introduction...... 85

5.2 Prior Work on Worm Signature Extraction...... 88

5.3 New Polymorphic Worm IDS...... 93

5.3.1 Directly Mining Suspicious Clusters...... 93

5.3.2 Worm Signature Extraction ...... 97

5.4 Experimental Methodology ...... 100

5.4.1 Polymorphism via Encryption Schemes...... 100

5.4.2 Issues in Salting Background with Worm Traffic ...... 101

5.5 Experimental Results and Discussion...... 102

5.6 Conclusion ...... 107

Chapter 6. Identifying VoIP Traffic by Using Reliable Statistical Signatures. 108

6.1 Introduction and Motivation ...... 108

6.2 Skype Transmission Mechanism...... 112

6.2.1 Peer-to-Peer Structure...... 112

6.2.2 Obfuscation Played by Skype...... 114

6.3 Related Work...... 117

6.4 Efficient Statistical Method for Identifying VoIP Traffic ...... 123

6.4.1 Statistical Feature Selection ...... 123

viii

6.4.2 Implementation Considerations...... 126

6.5 Statistical Analysis of Skype VoIP Traffic ...... 131

6.5.1 Skype Video...... 133

6.5.2 Skype Voice ...... 136

6.5.3 Skype Phone...... 139

6.5.4 Growing Window versus Sliding Window...... 142

6.6 Experimental Results ...... 148

6.6.1 Training Data...... 149

6.6.2 Performance Evaluation ...... 149

6.7 Conclusion ...... 157

Chapter 7. Conclusions...... 158

Bibliography ...... 161

ix

List of Acronyms

AIDE: Absolute IP Difference in Entropy

CSS: Color Set Size

DDoS: Distributed Denial-of Service

DoS: Denial-of-Service

DSL: Digital Subscriber Line

EM: Expectation Maximization

FTP: File Transfer Protocol

GST: Generalized Suffix Tree

HTTP: Hypertext Transfer Protocol

HTTPS: Hypertext Transfer Protocol over Secure Socket Layer

ICMP: Internet Control Message Protocol

IDS: Intrusion Detection System

IETF: Internet Engineering Task Force

IIS: Internet Information Server

IP: Internet Protocol

ISP: Internet Service Provider

LAN: Local Area Network

LI: Lawful Interception

NAT: Network Address Translation

x

NIC: Network Interface Card

NZIX: New Zealand Internet Exchange

P2P: Peer-to-Peer

POTS: Plain Old Telephone Service

PSTN: Public Switched Telephone Network

QoS: Quality of Service

RFC: Request for Comments

SMTP: Simple Mail Transfer Protocol

SSH: Secure Shell

STUN: Simple Traversal of UDP through NATs

TCP: Transmission Control Protocol

UC: Unidimensional Clustering

UDP: User Datagram Protocol

URL: Uniform Resource Locator

VAD: Voice Activity Detection

VoIP: Voice over IP

xi

List of Figures

Figure 1. Part of the flow hierarchy for the 2-D subspace consisting of the

source port and protocol...... 15

Figure 2. A portion of the 1-D IP address hierarchy, with significant nodes

identified...... 19

Figure 3. Part of the flow hierarchy for a 5-D multidimensional tree...... 24

Figure 4. Illustration of online mining implementation...... 31

Figure 5. An attack scenario of botnets...... 41

Figure 6. A comprehensive intrusion detection system...... 53

Figure 7. Syntax of HTTP request...... 57

Figure 8. AIDE and unexpectedness distribution of DARPA trace...... 70

Figure 9. AIDE and unexpectedness distribution of Slammer worm trace..... 75

Figure 10. AIDE and unexpectedness distribution of Code-Red worm trace... 77

Figure 11. AIDE and unexpectedness distribution of merged Code-Red worm

trace...... 78

Figure 12. Structure of payload-based intrusion detection system...... 92

Figure 13. Suffix tree of input string “aaab” ...... 95

Figure 14. Generalized suffix tree of input strings “aaab” and “aabb”...... 96

Figure 15. Prototype of Skype P2P network structure...... 111

xii

Figure 16. Traffic flow of an active Skype user...... 113

Figure 17. Packet sizes of a typical Skype video call...... 132

Figure 18. Statistical feature distribution of a typical Skype video call...... 135

Figure 19. Packet sizes of a typical Skype voice call...... 137

Figure 20. Statistical feature distribution of a typical Skype voice call...... 138

Figure 21. Packet sizes of a typical Skype phone call...... 140

Figure 22. Statistical feature distribution of a typical Skype phone call...... 141

Figure 23. Comparison between the growing window method and the sliding

window method (sliding window size = 100 packets)...... 143

Figure 24. Comparison between the growing window method and the sliding

window method (sliding window size = 500 packets)...... 144

Figure 25. Comparison between the growing window method and the sliding

window method (sliding window size = 1000 packets)...... 145

Figure 26. Illusion of implementing the sliding window method by using a linked

list...... 146

xiii

List of Tables

Table 1. The algorithm pseudocode for multidimensional clustering...... 20

Table 2. Experimental results for the New Zealand (NZIX) trace data...... 34

Table 3. Complexity reduction associated with each of the three strategies. 37

Table 4. Comparison of execution times for top-down and bottom-up

unidimensional clustering...... 37

Table 5. Multidimensional clustering report of DARPA trace...... 69

Table 6. Multidimensional clustering report of Slammer worm trace...... 74

Table 7. Multidimensional clustering report of Code-Red worm trace...... 76

Table 8. The mining and signature report of worm salted Taiwan trace...... 104

Table 9. Pseudocode of our efficient statistical approach...... 130

Table 10. Component traffic of a mixed Skype trace...... 151

Table 11. Traffic identification results of a mixed Skype trace...... 152

Table 12. Traffic information of a tested trace...... 154

Table 13. Protocol breakdown of a tested trace...... 155

xiv

Acknowledgments

First and foremost, I would like to express my deepest gratitude to my thesis

co-advisers, Prof. David J. Miller and Prof. George Kesidis, for their continuous

guidance, patience, and encouragement during my graduate studies at Penn State. I

am indebted for the financial support which they have provided to me over the

years. I would also like to thank my other committee members, Prof. Nirmal Bose

and Prof. Prasenjit Mitra, for taking the time to serve on my committee and for their

insightful commentary on my work. I appreciate Cetin Seren for his help and

guidance during my internship in Cisco Systems, Inc., and feedback on sections of this dissertation.

I am deeply grateful and indebted to my parents who have stood by me at every turn in my life. I sincerely thank for the selfless love and support they have given me and continue to give.

Finally, I would like to dedicate this thesis to my dear fiancee, Yangyang Tang.

Without her love, belief, understanding, and emotional support during these years, this thesis would never have been done.

1

Chapter 1

Introduction

1.1 Background

With the rapid development of networks, tons of new

applications, services – as well as viruses and worms – are emerging to the current

Internet. Therefore, the research of network traffic classification has become more

and more essential since it has the potential to help Internet service providers (ISPs)

and their equipment vendors solve difficult network management problems. For example, if network administrators are able to know the content sent through each

network flow correctly and promptly, it is safe and confident for them to take

correct actions on different traffic flows – blocking the spread of worms and

malware, assigning enough network resources to real-time applications, and so on.

Generally, there are three main applications of network traffic classification:

network traffic management, network intrusion detection, and recently emerging

“lawful interception” (LI) of IP data traffic.

2

1.1.1 Network Traffic Management

Generally, network traffic classification provides opportunities for many

traditional and new network management activities, e.g., quality-of-service (QoS) management, network resource provisioning, network traffic engineering, application performance evaluation, and pricing and accounting. In these activities, traffic classification was originally performed in a straightforward fashion – identifying different network applications simply by inspecting packet header information, such as IP addresses, port numbers, and protocol.

Nowadays, the infrastructure of Internet is becoming more and more flexible; thus individual users have more opportunities to change the Internet themselves.

This change makes the Internet become more popular and powerful than ever before; on the contrary, it also makes network management more difficult. Any individual user can develop a new network application for his/her own purpose by making use of some open port numbers. There have been tons of these new applications spread over the Internet, and some of them are becoming very popular, such as Bitcomet and Skype. The analysis in [92] points out that currently the predominant type of traffic is produced by peer-to-peer (P2P) file sharing applications, which can be responsible for more than 80% of the total traffic

3

volume depending on the location and hour of day. Therefore, identification of these new applications becomes essential to ISPs for both management and security purposes. Meanwhile, some new applications utilize undisclosed protocols or obfuscation techniques to evade the detection of current network classification approaches, which makes the task of classifying their traffic especially difficult. In this dissertatio