Deconstructing the Kazaa Network

Deconstructing the Kazaa Network Nathaniel Leibowitz Matei Ripeanu Adam Wierzbicki Expand Networks Computer Science Department, Polish-Japanese Institute for [email protected] The University of Chicago Information Technology [email protected] [email protected] Abstract the user community and in its tastes, and the potential for caching. We are interested in the rate Internet traffic is experiencing a shift from web of appearance of new content, as well as in the traffic to file swapping traffic. Today a significant part stability properties for the sets of the most popular of Internet traffic is generated by peer-to-peer items. applications, mostly by the popular Kazaa application. ° Thirdly, we study the virtual relationships that form Yet, to date, few studies analyze Kazaa traffic, thus among users based on the data they download. We leaving the bulk of Internet traffic in dark. We present model the network as a data-sharing graph and a large-scale investigation of Kazaa traffic based on uncover its small-world characteristics. We believe logs collected at a large Israeli ISP, which capture that these small-world characteristics can be roughly a quarter of all traffic between Israel and US. exploited to build efficient data location and data delivery mechanisms. 1. Introduction The rest of this paper is structured as follows. In the next section we describe our data collection setup and In a brief period of time the composition of the main trace processing steps. Section 3 surveys Internet traffic shifted dramatically from mainly Web related work on peer-to-peer traffic characterization. traffic to traffic generated by peer-to-peer (P2P) Section 4 comprises the bulk of our analysis structured file-sharing applications like Kazaa, Morpheus or along the three guidelines mentioned above. We iMesh. Both network measurements and anecdotic summarize our findings in Section 5. evidence support this statement. Internet2 administrators report that about 16% of the traffic carried by their network is P2P traffic while a further 2. Data Collection and Processing 54% is unidentified traffic most likely to be generated by applications in the same class [1] (January 2003). Few details are publicly available about the Kazaa Between 15% and 30% of residential subscribers on protocol. Apparently, Kazaa nodes dynamically elect several large ISPs surveyed were using Kazaa or ‘super-nodes’ that form an unstructured overlay Morpheus [2]. Downloads of P2P applications network and use query flooding to locate content. progress at incredible rates: 3.2 million per week for Regular nodes connect to one or more super-nodes to Kazaa and 200,000 per week for Gnutella [3] query the network content and in fact act as querying (February 2003). clients to super-nodes. Once desired content has been Yet, to date, few studies have analyzed Kazaa located Kazaa uses HTTP protocol to transfer it traffic, thus leaving the bulk of Internet traffic in dark. directly between the content provider and the node This paper describes a large-scale investigation of this that issues the download request. In fact, to improve traffic structured along three main guidelines: transfer speed multiple file fragments are downloaded ° Firstly, we try to identify the salient features of in parallel from multiple providers. While traffic Kazaa traffic. We confirm that the traffic is highly flowing on the control channel (queries, network concentrated around a small minority of large, membership information, software version popular items. We find however, that this information, etc.) is encrypted, traffic on the download concentration is even more pronounced than channel (i.e. all HTPP transfers) is not encrypted. We previously reported. This is a strong indication that are using information collected from the download caching can bring significant savings in this channel for an, admittedly non-exhaustive, study of context. Kazaa network. ° Secondly, we study the dynamics of network content to better understand both the underlying trends in 1 2.1. Data Collection Table 1 summarizes the main characteristics of the traffic captured. To collect Kazaa traces we use a setup similar to Table 1: Characteristics of collected Kazaa traces. [4]. We briefly describe the trace collection setup below and refer to [4] for a complete description. A Data collection period 1/15 – 2/15/03 Number of download sessions 7 * 106 caching server is installed at the border between the 6 local user base of a large ISP and the Internet cloud. Number of control sessions 24 * 10 For each TCP connection, regardless if direction (both Bytes transferred 20 TB in and out), a Layer 4 switch inspects the fist few Concurrent sessions (avg.) 1200 packets to detect Kazaa download traffic. If download Concurrent sessions (peak) 3000 traffic is detected then the switch redirects it through Bandwidth used (average) 75 Mbps the caching server. Thus, the caching server is able to Bandwidth used (peak) 145 Mbps intercept all downloads performed by local users, Number of unique files ~300,000 cache, and serve cached data. We note that we focus Number of unique users ≥ 50,000 on downloads performed by local users and completely ignore downloads performed by outside 3. Related Work users from local file providers (in other words we are only interested in incoming traffic). A number of recent studies cast more light on the Additionally, our content-based technique to nature of P2P traffic in particular on traffic generated detect traffic of interest has significant advantages by FastTrack (KaZaa, KaZaa Lite) and Gnutella over traditional, port based techniques: we find that in (Morpheus, LimeWire, etc) family applications that February 2003 about 38% of all download sessions have started to generate a significant share of Internet were not using the standard Kazaa port (1214). traffic. It is difficult to define Kazaa downloads in the Sen et al. [5] use TCP flow-level data gathered terms originally coined for describing standard file from multiple routers across a large Tier-1 ISP to downloads, the salient difference being that the analyze three P2P applications (Kazaa, Gnutella and download of a single file is usually composed of tens DirectConnect). While this data does not reveal of smaller downloads of different fragments of the file application level details and cannot give insights from different providers. This complicates both the explaining the behavior observed, it is an important terminology and the computations involved in step in characterizing these applications from a analyzing the data. We use the terms and methods network engineering perspective. For example, Sen et introduced in [4] to circumvent these problems: The al. [5] report that although the distribution of term download or session describes a single TCP generated P2P traffic volume is highly skewed at the session between two nodes, over which a portion of a individual host level, the fraction of the traffic file (none, part, or all of the file) is transferred. The contributed by each network prefix remains relatively term download cycle describes the logical transfer of a unchanged over long time intervals. whole file, which may consist of tens of sessions, and At the application level, Gnutella’s open protocol extend over hours or even days. Finally, we use the has made the analysis of this network somewhat following scheme to quantify the number of download simpler. A number of studies [5-9], based mostly on cycles for each file: if an accumulated value of X bytes data collected from the control channel, explore the of file Y have been transferred over the network, then topology of the Gnutella overlay, its mapping on the X we estimate that /FileSize-of-Y download cycles of the file Internet physical infrastructure, the behavior of have passed over the network. Gnutella users, and the main characteristics of Gnutella nodes. 2.2. The Traces Two recent studies [4, 10] use the fact that although Kazaa’s protocol (FastTrack) is proprietary, The caching server has been continuously logging Kazaa uses HTTP to move data files: thus this traffic traffic over the past year. As we do not see qualitative can be logged and cached. Both these studies monitor changes in traffic characteristics during this period we HTTP traffic on costly links: traffic from a large use only a part of these logs for most of our analysis Israeli ISP to US and Europe [4], or from University below. of Washington campus to its ISP [10] (in the following We eliminate from our logs all control channel we refer to these traces as UW traces). connections and use only the inbound download Leibowitz et al. [4] note that Kazaa traffic sessions (i.e. data flowing to local users from both constitutes most of the Internet traffic, show that a tiny other external or other internal users) for our analysis. number of files generates most of the download activity, suggest the feasibility of traffic caching, and 2 empirically demonstrate its benefits. Saroiu et al. [10] reaffirm these findings and compare Kazaa traffic with traffic generated by traditional content distribution systems (i.e. Akamai and Web traffic). We believe the traces analyzed in this study and in [4] complement well the traces analyzed in [10]: our traces reflect a more diverse user population with significant heterogeneity in network connectivity and interests. Additionally we analyze a community where users pay network usage charges upfront, as opposed to a university where users charged for network usage indirectly or not at all. One interesting observation highlights the difference in user population (and respective usage patterns) in these two studies: while the University of Figure 1: CDF for file download cycles. Note that Washington user community acts as net provider of more than 10% of all download sessions attempted Kazaa content (Saroiu et al. report that at UW actually fail.

Deconstructing the Kazaa Network

Uila Supported Apps

What Is Peer-To-Peer File Transfer? Bandwidth It Can Use

The Effects of Digital Music Distribution" (2012)

The Edonkey File-Sharing Network

Piratez Are Just Disgruntled Consumers Reach Global Theaters That They Overlap the Domestic USA Blu-Ray Release

Forescout Counteract® Endpoint Support Compatibility Matrix Updated: October 2018

[Hal-00744922, V1] Improving Content Availability in the I2P Anonymous

Downloading Copyrighted Materials

Services for Microsoft Onedrive for Business

The Robustness of Content-Based Search in Hierarchical Peer to Peer Networks

FASTTRACK Through FUJITSU-DEV

Internet Peer-To-Peer File Sharing Policy Effective Date 8T20t2010