A Longitudinal Study of P2P Traffic Classification

A Longitudinal Study of P2P Traffic Classification Alok Madhukar Carey Williamson Department of Computer Science University of Calgary Email: {madhukar,carey}@cpsc.ucalgary.ca Abstract ISPs can formulate traffic management policies based on knowledge of P2P traffic. ISPs might differentiate the services This paper focuses on network traffic measurement of Peer-to- into premium and ordinary classes, applying differentiated Peer (P2P) applications on the Internet. P2P applications charges based on traffic type instead of the flat pricing charges supposedly constitute a substantial proportion of today's Internet that are currently prevalent. Another possibility is for an ISP traffic. However, current P2P applications use several to provide a file sharing proxy [7] to reduce Internet backbone obfuscation techniques, including dynamic port numbers, port hopping, HTTP masquerading, chunked file transfers, and traffic and operational costs. For traffic-specific network encrypted payloads. As P2P applications continue to evolve, engineering, the ISP must be able to classify Internet traffic, robust and effective methods are needed for P2P traffic which highlights the need for P2P traffic detection. identification. The paper compares three methods to classify There is also an important legal issue since some content P2P applications: port-based classification, application-layer found in file sharing systems (e.g., music, films, eBooks, signatures, and transport-layer analysis. The study uses games) may violate copyright laws. Some recent trials have empirical network traces collected from the University of held ISPs responsible for those copyright infringements; this Calgary Internet connection for the past 2 years. The results precedent strongly encourages ISPs to detect and pursue show that port-based analysis is ineffective, being unable to abusing users. identify 30%-70% of today's Internet traffic. Application signatures are accurate, but may not be possible for legal or Identifying P2P traffic on today's Internet is challenging, technical reasons. The transport-layer method seems promising, since P2P applications are evolving rapidly. Many P2P providing a robust means to assess aggregate P2P traffic. The applications use one or more techniques to conceal their latter method suggests that 30%-70% of the campus Internet presence on the network. These techniques include dynamic traffic for the past year was P2P. assignment of random port numbers, the use of multiple transport-layer connections, chunked file transfers, HTTP Keywords: Network Traffic Measurement, Peer-to-Peer, Traffic masquerading, and encrypted payloads. Analysis This paper provides a comparative evaluation of the effectiveness of three different P2P traffic classification 1. Introduction techniques: port-based analysis, application-layer signatures, Recent measurement studies [1, 7, 10, 17, 18, 19] indicate and transport-layer heuristics. The port-based analysis method that Peer-to-Peer (P2P) applications generate a substantial is a classical approach based on well-known port numbers volume of Internet traffic. Understanding the Internet traffic assigned by IANA (Internet Assigned Numbers Authority) profile is important for several reasons, including traffic [10]. Since current P2P applications transmit data on engineering, network service pricing, and capacity planning. randomly assigned ports, this method is no longer accurate. Internet traffic cannot be managed properly if it cannot be The novelty in our work is the use of a 2-year dataset to measured properly. provide a longitudinal assessment of the effectiveness of P2P P2P traffic measurement is especially important for Internet traffic classification techniques. We use empirical network Service Providers (ISPs), for several reasons. First, many P2P traces from the University of Calgary Internet connection. The applications are bandwidth-intensive. Excessive network network traces were collected using tcpdump [10] over a congestion could lead to dissatisfied customers and possible period of two years since September 2003. customer churn. Second, increasing the network capacity is Our results show that port-based analysis is ineffective for expensive, and only effective on a short-term basis. P2P classifying P2P applications: 30-70% of the Internet traffic is application traffic may soon expand to occupy the increased classified as "unknown". While the application-layer signature capacity as well, making network congestion inevitable. Third, approach is accurate for classifying P2P applications, many some current Internet access technologies have asymmetric reasons (e.g., legal and technical) preclude its use in practice. upstream and downstream bandwidths, to exploit existing A modified version of the transport-layer method [10] is also access technologies while limiting operating costs for the tested. The results show that the transport-layer method ISPs. The underlying assumption is that Internet users provides a promising way to measure aggregate P2P traffic. download much more than they upload, as they do with the The rest of the paper is organized as follows. Section 2 Web. However, in P2P applications, users may upload as provides background on P2P applications, and a brief much as they download. If a large proportion of the Internet summary of related work. Section 3 describes the data traffic is P2P, then the underlying assumption of traffic collection method and infrastructure. Section 4 describes the asymmetry may be invalid. port-based analysis. Section 5 discusses the application-layer method. Section 6 discusses the transport-layer method for The authors evaluated the accuracy and scalability of the P2P traffic classification. Section 7 concludes the paper. application-layer signature technique. The accuracy was evaluated by determining measures for false positives and 2. Background and Related Work false negatives. False positives measure the amount of traffic that the classifier erroneously classified as P2P. False 2.1 Background negatives measure the amount of P2P traffic that the classifier failed to identify as a P2P application. P2P applications have been evolving at a very rapid rate. The results show that their technique has less than 5% false Within a span of six years (since 1999), P2P applications have positives and false negatives, indicating that the technique is evolved from first generation to second generation to third accurate most of the time. They also contend that the generation. One reason for the rapid evolution of these technique is scalable based on the number of packets applications has been the desire to avoid detection. examined before the application was classified. In a The first generation [15] of P2P systems consisted of subsequent study, Haffner et al. [8] proposed a technique to centralized systems like Napster. A centralized server was determine application signatures automatically using machine used to index files, making it relatively easy to locate the learning algorithms. The error rate is shown to be below 1%. server and block it. Also, P2P applications used well-known Karagiannis et al. [10] provide a novel approach to identify ports to transfer data, so that it was easy for network operators P2P flows at the transport layer (i.e., based on connection to identify the P2P traffic, and consequently block the patterns), without relying on packet payloads. The study corresponding ports to discourage P2P traffic. examines packet-level traces collected on three different days The second generation [15] of P2P systems includes (May 5, 2003, as well as January 22 and February 25, 2004) protocols like Gnutella [5]. Gnutella was a completely on an OC-48 Tier-1 ISP link. The first trace contained the first distributed system, where queries were flooded to neighbors. 44 bytes of each TCP packet, containing 0 or 4 bytes of Peers also used dynamically assigned ports to transfer data, so payload, while the remaining two traces contained 16 bytes of that it was difficult to classify P2P traffic. payload for each packet. Third generation [15] P2P systems are quite sophisticated. The transport-layer method relies primarily on two These are hybrid systems that combine ideas from centralized heuristics. The first heuristic identifies source-destination IP systems and distributed systems. There is a notion of pairs that concurrently use both TCP and UDP. If such IP supernodes as in KaZaA or ultrapeers as in Gnutella2. The pairs exist and they do not use specific well-known ports, then supernodes/ultrapeers have comparatively more computing these flows are considered P2P. The second heuristic resources than other neighboring peers and are responsible for considers the structural pattern of transport-layer connections handling index files for a subset of peers. They often transmit between hosts. In particular, for P2P applications, the number data using randomly chosen ports. Sometimes, they disguise of distinct ports connected to a host often matches the number their traffic by using ports of other well-known applications. of distinct IP hosts connected to it. Also, a single large file can be downloaded simultaneously in smaller pieces from several other peers. Furthermore, a few 3. Datasets protocols like FastTrack have started encrypting the application-layer data in the packets. These techniques make it harder to detect P2P traffic. 3.1 Data Collection Infrastructure and Methodology The primary datasets used in our study come from passive 2.2 Related Work network traffic

Load more