P2P Population Tracking and Traffic Characterisation
Total Page:16
File Type:pdf, Size:1020Kb
Institut für Technische Informatik und Kommunikationsnetze P2P Population Tracking and Traffic Characterization of Current P2P File-sharing Systems Lukas H¨ammerle Master Thesis MA-2004-04 April 2004 - September 2004 Tutor: Arno Wagner Co-tutor: Thomas Dubendorfer¨ Supervisor: Prof. Dr. Bernhard Plattner Preface While writing this thesis a lot of interesting P2P related developments took place. Some of them will change the P2P community very rapidly as it was always the case with P2P networks. It was a very exciting time and even when this thesis is finished I still will be observing the P2P (r)evolution. At this place I also want to thank some people who supported my work and who I could count on if I needed assistance or guidance. Namely these are: • Arno Wagner: For the good guidance and technical support • Thomas Dubendorfer:¨ For being my co-tutor and gnuplot guru • Prof. Bernhard Plattner: For being my supervisor and making this thesis possible • Dienstgruppe TIK: For providing an overall reliable computer environment that was comfortable to work with • Stephane Racine: For introducing me to NetFlow and the cluster environ- ment • Philipp Jardas: For his preceding survey work about P2P file sharing systems • Caspar Schlegel: For his support with UPFrame • Rakesh Kumar: For providing me some of their FastTrack client port usage data • Luca Deri: For providing us a free version of nprobe/ntop • My fellow co-workers of G69: For the “candy-sharing” and the interesting conversations during coffee brakes I further want to state that this thesis did not aim at finding methods to identify P2P users in order to prosecute them but in order to better analyze and observe their numbers and their usage characteristics. Since the described methods and concepts are based on flow-level data they can’t be used to inspect which files a certain user shares or downloads. Copyright °c 2004 Lukas H¨ammerle <[email protected]> ETH Zurich, TIK, Communication Systems Group, DDoSVax team. Abstract Detecting denial of service attacks and the spreading of virii and worms requires knowledge about the type of Internet traffic that passes a network. Nowadays a fairly large amount of all traffic in Internet backbones is not easily identifiable anymore since the amount of non well-known traffic has dramatically increased with the upcoming popularity of P2P file-sharing networks. For network mon- itoring and anomaly detection reasons it is important to know which traffic is to account for P2P networks. This master’s thesis gives an overview about problems and methods concern- ing identification, population tracking and traffic monitoring of P2P clients. The examined P2P networks are FastTrack, eDonkey, Overnet, Kademlia, Gnutella and BitTorrent, which represent the six most popular P2P networks world- wide. While other research on that topic mainly tried to identify P2P traffic directly, our approach first identifies P2P hosts in the medium sized back-bone SWITCH network and then uses them to track other peers. Various measure- ments made with a proof of concept implementation called “PeerTracker” are presented and verified. The results show that the P2P bandwidth consumption in the SWITCH network is considerable and that a substantial amount of the total P2P population can be tracked for some P2P networks. iv Table of Contents 1 Introduction 1 1.1 Overview ............................... 2 1.1.1 DDoSVax Project ...................... 3 1.1.2 Reasons for P2P Identification ............... 3 1.1.3 P2P Identification and DDoS ................ 3 1.2 Goals ................................. 4 1.3 Internet Communication Patterns .................. 5 1.3.1 Client-Server Model ..................... 5 1.3.2 P2P Model .......................... 6 1.4 Ways of File-sharing ......................... 7 1.5 P2P Networks and Clients ...................... 7 1.5.1 Generations .......................... 7 1.5.2 General P2P Usage ...................... 8 1.5.3 Popularity ........................... 9 1.5.4 P2P Prosecution ....................... 10 1.6 Network environment ........................ 11 1.6.1 Flow-level Data ........................ 11 1.6.2 SWITCH NetFlows ..................... 12 2 P2P Host Identification 15 2.1 P2P Systems Considered ....................... 16 2.2 Traffic Observation Possibilities ................... 20 2.2.1 Packet-level Observations .................. 20 2.2.2 Flow-level Observations ................... 21 2.3 Host Identification Methods ..................... 22 2.3.1 Default Ports ......................... 24 2.3.2 Generic Approaches ..................... 27 2.3.3 Host Pool ........................... 32 2.3.4 Non-peer Identification ................... 33 2.4 Identification for Concrete P2P Systems .............. 33 2.4.1 eDonkey Identification .................... 33 2.4.2 Kademlia Identification ................... 34 2.4.3 Overnet Identification .................... 35 2.4.4 Gnutella Identification .................... 35 2.4.5 FastTrack Identification ................... 36 2.4.6 BitTorrent Identification ................... 37 2.4.7 Undetectable and Safe P2P Usage ............. 39 2.5 Population Tracking ......................... 39 v vi TABLE OF CONTENTS 2.5.1 Conditions .......................... 40 2.6 Example: Overnet Tracking ..................... 41 2.7 Traffic Identification ......................... 42 2.8 Limitations .............................. 43 3 Implementation 45 3.1 Offline Scripts ............................. 45 3.2 Online Plug-in ............................ 45 3.2.1 Identification Algorithm Overview ............. 45 3.2.2 Suspicious Port Range .................... 49 3.2.3 Tracking of External P2P Hosts .............. 50 3.2.4 Client Shutdown Detection ................. 51 3.2.5 Listening Port Detection ................... 51 3.3 Containers ............................... 53 3.3.1 Hashed Table ......................... 53 3.3.2 Hashed Queue ........................ 55 3.4 Resource usage ............................ 56 3.4.1 Space Complexity ...................... 56 3.4.2 Time Complexity ....................... 56 4 Findings 59 4.1 Peer Verification Approaches .................... 59 4.1.1 eDonkey Server Tracking .................. 60 4.1.2 Polling Peers ......................... 61 4.1.3 BitTorrent probe clients ................... 64 4.2 Interpretation of Verification Results ................ 64 4.3 P2P Usage in SWITCH Network .................. 65 4.3.1 Measurement Setup ..................... 65 4.3.2 Peers in SWITCH network ................. 66 4.3.3 Peer Characteristics ..................... 67 4.3.4 Bandwidth Consumption .................. 71 4.4 Population Tracking Results ..................... 74 4.5 Related Work ............................. 77 5 Conclusion 81 6 Outlook 85 6.1 Future Work and Improvements ................... 85 6.1.1 Automated Verification ................... 85 6.1.2 More General Identification ................. 85 6.1.3 Auxiliary Network Processors ................ 85 6.1.4 More P2P Networks ..................... 86 6.1.5 Multi Peer identification ................... 86 6.1.6 TCP flags in NetFlow .................... 86 6.1.7 Continuous Examination .................. 86 6.1.8 State Reload ......................... 86 6.2 Unsolved Problems .......................... 86 References 89 A Full Task Description 93 TABLE OF CONTENTS vii B P2P Networks Table 99 C NetFlow Format 101 D SWITCH Network 105 E P2P Port Usage in the Internet 2 109 F Identification Code 113 G PeerTracker Usage Instructions 119 H Used Software 127 viii TABLE OF CONTENTS List of Figures 1-1 Client-Server model ......................... 5 1-2 P2P model .............................. 6 1-3 Flow Timings ............................. 12 1-4 Switch network topology ....................... 13 1-5 NetFlows emitted in SWITCH network .............. 14 2-1 P2P network and client relations .................. 17 2-2 TCP and UDP connections in P2P networks ........... 25 3-1 Internal SWITCH candidate peers ................. 48 3-2 Internal SWITCH non-peers ..................... 48 3-3 Influence of suspicious port range on identification ........ 50 3-4 Hashed queue ............................. 55 4-1 One day eDonkey comparison server method vs port method . 60 4-2 Active peers within SWITCH network ............... 66 4-3 P2P clients vs web clients comparison ............... 68 4-4 Comparison Web vs P2P vs Mail .................. 72 4-5 Total TCP bandwidth consumption ................ 72 4-6 P2P traffic in comparison to other protocols ............ 73 4-7 P2P traffic of considered networks ................. 73 4-8 Tracked Overnet peers with different timeouts ........... 76 4-9 Tracked extern peers with 6 hours timeout ............. 76 4-10 FastTrack tracked vs internal peers ................. 77 D-1 Switch network ............................106 D-2 Switch traffic volume per month since 1998, from [1] ....... 106 D-3 Active internal hosts in the SWITCH network during 8 days . 107 D-4 Active extern hosts contacting hosts within the SWITCH network 107 E-1 Internet 2 traffic graph ........................110 E-2 Internet S2 P2P traffic graph ....................111 ix x LIST OF FIGURES List of Tables 2-1 Official port ranges .......................... 25 2-2 Hosts with suspicious flows ..................... 27 2-3 Top level domain names of 565 Kazaa peers (TCP) ........ 30 2-4 Top level domain names of 3’975 Kazaa peers (UDP) ....... 31 2-5 Hosts which had at least k flows with P2P default ports ..... 31 2-6 eDonkey and Kademlia top 5 TCP listening ports ......... 34 2-7 Overnet top 5 UDP listening