The Pennsylvania State University
The Graduate School
College of Engineering
PACKET INSPECTION FOR APPLICATION CLASSIFICATION
AND INTRUSION DETECTION
A Dissertation in
Electrical Engineering
by
Jisheng Wang
© 2008 Jisheng Wang
Submitted in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
May 2008
ii
The dissertation of Jisheng Wang was reviewed and approved* by the following:
David J. Miller Associate Professor of Electrical Engineering Dissertation Co-Adviser Co-Chair of Committee
George Kesidis Professor of Computer Science and Engineering Professor of Electrical Engineering Dissertation Co-Adviser Co-Chair of Committee
Nirmal K. Bose HRB-Systems Professor of Electrical Engineering
Prasenjit Mitra Assistant Professor of the School of Information Sciences and Technology Assistant Professor of Computer Science and Engineering Assistant Professor of Industrial and Manufacturing Engineering
Kenneth W. Jenkins Professor of Electrical Engineering Head of the Department of Electrical Engineering
*Signatures are on file in the Graduate School
iii
Abstract
Current computer networks remain vulnerable to a variety of families of attacks including scanning worms, distributed denial-of-service (DDoS) attacks targeting resources associated with end-systems or critical network protocols, and hit-list worms. These kinds of attacks remain significant direct and indirect threats to the network’s infrastructure and its end-systems. Despite past developments, anomaly detection and response targeting zero-day attacks (as not yet seen) remains an open research problem.
This dissertation presents the complete structure of an automated payload-based network intrusion detection system, which includes three main components: network traffic mining, network anomaly identification, and worm signature extraction. Estan et al.’s multidimensional digesting algorithm is introduced to mine significant flows – either worm flows or dominant normal flows
– among entire network traffic, and several techniques are proposed for improving its efficiency. Based on the mining results, a new entropy-based criterion is presented to correctly identify anomaly network traffic, including the Slammer and
Code-Red worms and the DDoS attacks. Moreover, a Generalized Suffix
Tree-based approach is proposed for efficiently extracting signatures of polymorphic worms. Therefore, the proposed intrusion detection system can
iv
automatically generate signatures of zero-day attacks/worms which can be used to contain their spread in the future.
Meanwhile, with the increasing flexibility in current networks, tons of new applications appear and begin to dominate the Internet. The newly emerging peer-to-peer applications, such as Bitcomet and Skype, can be responsible for more than 80% of the total traffic volume in the Internet. Therefore, it is essential for
Internet service providers to correctly identify these new applications. This dissertation presents an efficient approach to identify Skype voice over IP (VoIP) traffic by using reliable statistical information. Because of its efficiency in both computational complexity and memory consumption, the new approach can be implemented on network backbone routers to identify Skype VoIP traffic in real-time.
v
Table of Contents
List of Acronyms...... ix
List of Figures...... xi
List of Tables...... xiii
Acknowledgments...... xiv
Chapter 1. Introduction...... 1
1.1 Background...... 1
1.1.1 Network Traffic Management ...... 2
1.1.2 Network Intrusion Detection...... 3
1.1.3 “Lawful Interception” of IP Data Traffic ...... 4
1.2 Contributions...... 5
1.3 Organization...... 6
Chapter 2. Multidimensional Network Traffic Digesting...... 8
2.1 Introduction...... 8
2.2 Multidimensional, Hierarchical Flow Mining of Network Traffic .. 14
2.2.1 Identifying Significant Unidimensional Flows ...... 17
2.2.2 Identifying Significant Multidimensional Flows ...... 21
2.2.3 Improving the Efficiency of Multidimensional Flow Mining23
2.2.4 Implementation Considerations...... 31
vi
2.3 Experiments Comparing Computational Efficiency...... 33
2.4 Conclusion ...... 39
Chapter 3. Network Intrusion Detection Systems ...... 40
3.1 Introduction of Network Attacks ...... 40
3.2 Review of Network Intrusion Detection Systems...... 45
3.2.1 Host/Operation System-Based Intrusion Detection ...... 45
3.2.2 Network-Based Intrusion Detection...... 47
3.2.3 Packet Payload-Based Intrusion Detection ...... 49
3.3 Comprehensive Intrusion Defense System ...... 51
3.4 White-Listing in Payload-Based Detection ...... 53
3.5 Covert Malware Modeling that Exploits White-Listing ...... 55
3.6 Port-80 Data Traffic and Peer-to-Peer Traffic...... 59
3.7 Conclusion ...... 61
Chapter 4. Multidimensional Mining-Based Network Anomaly Identification 62
4.1 Introduction...... 62
4.2 Criterion for Anomaly Identification ...... 64
4.2.1 Leaf and Internal Node Clusters...... 66
4.3 Attack Identification Results...... 67
4.3.1 DARPA Trace ...... 68
4.3.2 Sapphire/Slammer Trace ...... 71
4.3.3 Code-Red version 2 Trace ...... 72
vii
4.4 Discussion and Relation to Prior Work...... 79
4.5 Conclusion ...... 83
Chapter 5. Generalized Suffix Tree-Based Worm Signature Extraction ...... 85
5.1 Introduction...... 85
5.2 Prior Work on Worm Signature Extraction...... 88
5.3 New Polymorphic Worm IDS...... 93
5.3.1 Directly Mining Suspicious Clusters...... 93
5.3.2 Worm Signature Extraction ...... 97
5.4 Experimental Methodology ...... 100
5.4.1 Polymorphism via Encryption Schemes...... 100
5.4.2 Issues in Salting Background with Worm Traffic ...... 101
5.5 Experimental Results and Discussion...... 102
5.6 Conclusion ...... 107
Chapter 6. Identifying VoIP Traffic by Using Reliable Statistical Signatures. 108
6.1 Introduction and Motivation ...... 108
6.2 Skype Transmission Mechanism...... 112
6.2.1 Peer-to-Peer Structure...... 112
6.2.2 Obfuscation Played by Skype...... 114
6.3 Related Work...... 117
6.4 Efficient Statistical Method for Identifying VoIP Traffic ...... 123
6.4.1 Statistical Feature Selection ...... 123
viii
6.4.2 Implementation Considerations...... 126
6.5 Statistical Analysis of Skype VoIP Traffic ...... 131
6.5.1 Skype Video...... 133
6.5.2 Skype Voice ...... 136
6.5.3 Skype Phone...... 139
6.5.4 Growing Window versus Sliding Window...... 142
6.6 Experimental Results ...... 148
6.6.1 Training Data...... 149
6.6.2 Performance Evaluation ...... 149
6.7 Conclusion ...... 157
Chapter 7. Conclusions...... 158
Bibliography ...... 161
ix
List of Acronyms
AIDE: Absolute IP Difference in Entropy
CSS: Color Set Size
DDoS: Distributed Denial-of Service
DoS: Denial-of-Service
DSL: Digital Subscriber Line
EM: Expectation Maximization
FTP: File Transfer Protocol
GST: Generalized Suffix Tree
HTTP: Hypertext Transfer Protocol
HTTPS: Hypertext Transfer Protocol over Secure Socket Layer
ICMP: Internet Control Message Protocol
IDS: Intrusion Detection System
IETF: Internet Engineering Task Force
IIS: Internet Information Server
IP: Internet Protocol
ISP: Internet Service Provider
LAN: Local Area Network
LI: Lawful Interception
NAT: Network Address Translation
x
NIC: Network Interface Card
NZIX: New Zealand Internet Exchange
P2P: Peer-to-Peer
POTS: Plain Old Telephone Service
PSTN: Public Switched Telephone Network
QoS: Quality of Service
RFC: Request for Comments
SMTP: Simple Mail Transfer Protocol
SSH: Secure Shell
STUN: Simple Traversal of UDP through NATs
TCP: Transmission Control Protocol
UC: Unidimensional Clustering
UDP: User Datagram Protocol
URL: Uniform Resource Locator
VAD: Voice Activity Detection
VoIP: Voice over IP
xi
List of Figures
Figure 1. Part of the flow hierarchy for the 2-D subspace consisting of the
source port and protocol...... 15
Figure 2. A portion of the 1-D IP address hierarchy, with significant nodes
identified...... 19
Figure 3. Part of the flow hierarchy for a 5-D multidimensional tree...... 24
Figure 4. Illustration of online mining implementation...... 31
Figure 5. An attack scenario of botnets...... 41
Figure 6. A comprehensive intrusion detection system...... 53
Figure 7. Syntax of HTTP request...... 57
Figure 8. AIDE and unexpectedness distribution of DARPA trace...... 70
Figure 9. AIDE and unexpectedness distribution of Slammer worm trace..... 75
Figure 10. AIDE and unexpectedness distribution of Code-Red worm trace... 77
Figure 11. AIDE and unexpectedness distribution of merged Code-Red worm
trace...... 78
Figure 12. Structure of payload-based intrusion detection system...... 92
Figure 13. Suffix tree of input string “aaab” ...... 95
Figure 14. Generalized suffix tree of input strings “aaab” and “aabb”...... 96
Figure 15. Prototype of Skype P2P network structure...... 111
xii
Figure 16. Traffic flow of an active Skype user...... 113
Figure 17. Packet sizes of a typical Skype video call...... 132
Figure 18. Statistical feature distribution of a typical Skype video call...... 135
Figure 19. Packet sizes of a typical Skype voice call...... 137
Figure 20. Statistical feature distribution of a typical Skype voice call...... 138
Figure 21. Packet sizes of a typical Skype phone call...... 140
Figure 22. Statistical feature distribution of a typical Skype phone call...... 141
Figure 23. Comparison between the growing window method and the sliding
window method (sliding window size = 100 packets)...... 143
Figure 24. Comparison between the growing window method and the sliding
window method (sliding window size = 500 packets)...... 144
Figure 25. Comparison between the growing window method and the sliding
window method (sliding window size = 1000 packets)...... 145
Figure 26. Illusion of implementing the sliding window method by using a linked
list...... 146
xiii
List of Tables
Table 1. The algorithm pseudocode for multidimensional clustering...... 20
Table 2. Experimental results for the New Zealand (NZIX) trace data...... 34
Table 3. Complexity reduction associated with each of the three strategies. 37
Table 4. Comparison of execution times for top-down and bottom-up
unidimensional clustering...... 37
Table 5. Multidimensional clustering report of DARPA trace...... 69
Table 6. Multidimensional clustering report of Slammer worm trace...... 74
Table 7. Multidimensional clustering report of Code-Red worm trace...... 76
Table 8. The mining and signature report of worm salted Taiwan trace...... 104
Table 9. Pseudocode of our efficient statistical approach...... 130
Table 10. Component traffic of a mixed Skype trace...... 151
Table 11. Traffic identification results of a mixed Skype trace...... 152
Table 12. Traffic information of a tested trace...... 154
Table 13. Protocol breakdown of a tested trace...... 155
xiv
Acknowledgments
First and foremost, I would like to express my deepest gratitude to my thesis
co-advisers, Prof. David J. Miller and Prof. George Kesidis, for their continuous
guidance, patience, and encouragement during my graduate studies at Penn State. I
am indebted for the financial support which they have provided to me over the
years. I would also like to thank my other committee members, Prof. Nirmal Bose
and Prof. Prasenjit Mitra, for taking the time to serve on my committee and for their
insightful commentary on my work. I appreciate Cetin Seren for his help and
guidance during my internship in Cisco Systems, Inc., and feedback on sections of this dissertation.
I am deeply grateful and indebted to my parents who have stood by me at every turn in my life. I sincerely thank for the selfless love and support they have given me and continue to give.
Finally, I would like to dedicate this thesis to my dear fiancee, Yangyang Tang.
Without her love, belief, understanding, and emotional support during these years, this thesis would never have been done.
1
Chapter 1
Introduction
1.1 Background
With the rapid development of telecommunication networks, tons of new
applications, services – as well as viruses and worms – are emerging to the current
Internet. Therefore, the research of network traffic classification has become more
and more essential since it has the potential to help Internet service providers (ISPs)
and their equipment vendors solve difficult network management problems. For example, if network administrators are able to know the content sent through each
network flow correctly and promptly, it is safe and confident for them to take
correct actions on different traffic flows – blocking the spread of worms and
malware, assigning enough network resources to real-time applications, and so on.
Generally, there are three main applications of network traffic classification:
network traffic management, network intrusion detection, and recently emerging
“lawful interception” (LI) of IP data traffic.
2
1.1.1 Network Traffic Management
Generally, network traffic classification provides opportunities for many
traditional and new network management activities, e.g., quality-of-service (QoS) management, network resource provisioning, network traffic engineering, application performance evaluation, and pricing and accounting. In these activities, traffic classification was originally performed in a straightforward fashion – identifying different network applications simply by inspecting packet header information, such as IP addresses, port numbers, and protocol.
Nowadays, the infrastructure of Internet is becoming more and more flexible; thus individual users have more opportunities to change the Internet themselves.
This change makes the Internet become more popular and powerful than ever before; on the contrary, it also makes network management more difficult. Any individual user can develop a new network application for his/her own purpose by making use of some open port numbers. There have been tons of these new applications spread over the Internet, and some of them are becoming very popular, such as Bitcomet and Skype. The analysis in [92] points out that currently the predominant type of traffic is produced by peer-to-peer (P2P) file sharing applications, which can be responsible for more than 80% of the total traffic
3
volume depending on the location and hour of day. Therefore, identification of these new applications becomes essential to ISPs for both management and security purposes. Meanwhile, some new applications utilize undisclosed protocols or obfuscation techniques to evade the detection of current network classification approaches, which makes the task of classifying their traffic especially difficult. In this dissertatio