A Measurement Study Supporting P2P File-Sharing Community Models
Total Page:16
File Type:pdf, Size:1020Kb
A Measurement Study Supporting P2P File-Sharing Community Models Raffaele Bolla∗, Rossano Gaeta∗∗, Andrea Magnetto∗∗, Michele Sciuto∗, Matteo Sereno∗∗ ∗Dipartimento di Informatica, Sistemistica e Telematica, Universit`adi Genova ∗∗Dipartimento di Informatica, Universit`adi Torino Abstract—Knowledge of emergent properties of existing peer- In this paper, we present an active measurement study of to-peer file-sharing communities can be helpful for the design and the characteristics of peers and files in the real Gnutella implementation of innovative peer-to-peer protocols/services that overlay network. We focused our attention on this P2P com- exploit autonomicity, self-configuration, resilience, scalability, and performance. It is well known that the performance of this munity since the Gnutella overlay network is unstructured and class of peer-to-peer applications depends on several parameters completely decentralized, thus making it a self-configuring, that represent the topological structure of the overlay network, self-organizing and self-managing structure; furthermore, it the users’ behavior, and resource dynamics. Estimation of these represents an extremely popular and long-lived community. parameters is difficult, but it is crucial for analytical models as We provide results regarding system parameters that have been well as for realistic simulation of peer-to-peer applications. In this paper, we present an active measurement-based study the subject of previous studies with the goal of investigating designed to glean insights on the above parameters within the the evolution and stability of the Gnutella community over Gnutella network. The measurement software that we developed time. On the other hand, the most relevant part of the paper is able to collect topological information about the overlay presents the following original results: network topology in a few minutes; in a second step, it contacts the users discovered during the topological measurement in order • the effects of network address translation (NAT) on to acquire a novel dataset regarding the shared resources. both the overlay network connectivity and the degree of connectivity of peers (single or reciprocal); I. INTRODUCTION • the problem of edge distortion represented by connections The support of media exchange communities, where large among ultra-peers that are established or closed during a groups of users with common interests exchange self-made crawl, and that are likely to be reported only by one end multimedia documents (e.g., music, video, presentations, etc.), of the connection; is one of the most successful and interesting internet service • the percentage of peers and ultra-peers and their geo- typologies to date. Peer-to-Peer (P2P) technologies are one of graphical position; the most innovative attempts to support these communities. • the behavior of users, involving the length of their con- In fact, P2P file-sharing applications currently represent a nections and disconnections, their rate of ingress into the large fraction of overall Internet traffic [5] due to their high network and the number of files they share; popularity among Internet users. Several papers have analyzed • a classification of resources based on their popularity the characteristics of P2P systems as well as the peculiarities profiles, along with the probability distribution that files of the network traffic generated by such applications and will belong to a popularity class; their impact on the network. Estimation of system parameters, • the rate of births of distinct categories of files in the even though it undoubtedly represents a complex task (due to system; factors such as the large size of the P2P overlay networks), • the evolution of each files popularity. can be beneficial for many research activities, like, among the One of the goals of our work is to provide useful tools for others: modeling a P2P file-sharing system. To this end, for the most • the design of innovative P2P services and applications interesting measured distributions, we provide the best-fitting that exploit the advantages of existing programs and curve (chosen among the exponential, lognormal, Weibull and the characteristics of their communities, such as auto- Pareto distributions) to represent the measured phenomena. nomicity, self-configuration, resilience, scalability, and The outline of the paper is the following: in Section performance. II, we briefly summarize previous empirical findings from • the modeling of P2P file-sharing applications, since ana- measurement-based investigations of P2P communities. Sec- lytical models and simulators must be fed with realistic tion III describes the architecture of the software that we values for the input variables. developed, which has been used to capture snapshots of the • the development of traffic engineering strategies, which Gnutella overlay. Measurement results are then presented in can support an efficient use of network resources, and Section IV, where we discuss on the difficulties that we faced appropriate strategies for the management of the network in collecting information from a wide area overlay network; traffic originated by P2P file-sharing applications. Section V reports our analysis on the properties of the overlay topology, Section VI concerns the behavior of users, while presented results both for the static and dynamic properties in Section VII, we describe the peculiarities of files shared by of resources. Finally, an innovative contribution was recently users. Finally, Section VIII synthesizes connections throughout published [1], analyzing the topological properties of a P2P our work and outlines several developments for current and streaming overlay and their behavior over time. future investigation. III. THE GNUTELLA CRAWLER II. RELATED WORKS According to the results presented in [22] and [23], the In the past, many crawlers operating on the Gnutella net- accuracy of captured network snapshots depends on the crawl- work have been proposed, with the aim of acquiring snapshots ing speed. In principle, a perfect snapshot is captured if the of the overlay network at many points in time. One of the crawling process is instantaneous and complete (that is, if first incarnations was developed by Mihajlo Jovanovic et all peers are contacted). However, neither of these conditions al. from the University of Cincinnati [13]. The work was is generally met for the following reasons: rapidly changing aimed at studying the overlay network topology, by acquiring topology, and unreachable peers. Two studies presented a the lists of neighbors by peers, in order to reconstruct the distributed crawling technique that aims to mitigate these network using graph theory. This crawler could contact 1, 000 problems (see [22] and [23]). peers on average, rebuilding graphs with approximately 4, 000 relationships between neighbors. A few months later, Ripeanu Top-level topology et al. developed another Gnutella crawler, whose application was important for achieving results regarding the size of the ultrapeer network [18]. In May 2001, Saroiu et al. from the University of leaf peer legacy peer Washington, published a comparative measurement work com- paring Napster and Gnutella [20]. A relevant study, although different from the one presented here, is the passive distributed crawler gnuDC [27], designed to establish connections with the network nodes in order to collect information on their traffic. As the new version (0.6) of the Gnutella protocol was (a) Gnutella overlay topology implemented in the most popular software clients (Limewire, Master BearShare, Shareaza, Phex, etc. ), the measurement crawlers Crawling results were also improving their efficiency, allowing for the discovery up global queue Crawling Get UP addresses results of overlay network topology, with faster acquisitions (by at Receive UP addresses least two orders of magnitude). In 2005, the first crawler for the Gnutella 0.6 network was realized by R. Rezaje, D. Get UP addresses Receive UP addresses Ultrapeers Stutzbach and S. Zhao from the University of Oregon. It is Slave X-Try up neighbors Slave composed by two modules: an overlay network component and up slave queue X-Try- up slave queue a content crawler, working on shared resources. The software slave slave slave slave thread thread thread thread Ultrapeers is designed to work via one master and 6 slave components X-Try up neighbors that operate in parallel. Each component contacts a number X-Try- of ultra-peers, depending on its CPU load. The crawler is (b) The master-slave crawler architecture extremely powerful, able to contact up to 140, 000 peers per Fig. 1. Figure 1(a) depicts the two layer hierarchy of the Gnutella overlay, minute. In [25], a study on the complexity of churning in composed by top-level (ultra-peers) and second-level (leaf-peers) users. Figure different P2P systems is presented. Reza et al. investigated the 1(b) shows the crawler architecture. properties of user dynamics with a focus on the session lengths for users of Gnutella, BitTorrent and Kad. They find lognormal To obtain accurate network snapshots, we used a distributed distributions to be best for modeling the session lengths of crawler architecture, similar to the one described in [22], but Gnutella users, though this analysis was restricted to one day modified with an innovative two-phase crawling technique, as of measurement only. They provided results based on a longer explained in the following section. period for users of BitTorrent. However, in the case of a