On the Long-term Evolution of the Network

Amir H. Rasti, Daniel Stutzbach, Reza Rejaie University of Oregon

Abstract the impact of a newly introduced feature de- pends on two factors: (i) the coherency (or com- The popularity of Peer-to-Peer (P2P) file sharing patibility) of features across different brands of applications has significantly increased during client software and (ii) the penetration rate and the past few years. To accommodate the increas- availability of a feature throughout the system. ing user population, developers gradually evolve As new features become available, users upgrade their client software by introducing new features, their client software and the new features grad- such as a two-tier overlay topology. The evolu- ually spread throughout the system. Therefore, tion of these P2P systems primarily depends on the penetration rate of new features primarily (i) coherency across different implementations depends on which brands implement the feature and (ii) the availability of these feature among and the ability and willingness of users to up- clients throughout the system. grade. This paper sheds some light on the evolution In this paper, we examine the evolution of of client and overlay characteristics in the Gnu- the Gnutella network during the past year over tella network during a one year period over which which its overall population has tripled to more it tripled in size, exceeding two million concur- than two million simultaneous peers. Toward rent peers. Our results show several interesting this end, we explore the evolution of the Gnu- trends including (i) most users quickly upgrade tella network across two related angles: their software which provides great flexibility for • developers to react to changes and (ii) as the sys- Client Characteristics that include changes tem has grown in size the top-level overlay has in the overall population, the distribution remained properly connected but there is grow- across different geographic regions, and the ing imbalance between the two levels. popularity of brands and versions. • Overlay Characteristics that represent 1 Introduction changes in various properties of the two-tier Contrary to the common assumption about the overlay topology including balance between limited scalability of unstructured Peer-to-Peer the populations at each tier, different (P2P) file-sharing applications, major P2P file angles of node degree, intra-regional bias sharing applications (i.e., FastTrack (or ), in peer connectivity, and resiliency to node Gnutella, and eDonkey) have become signifi- removal. cantly more popular during the past few years Modern Gnutella has incorporated a two-tier and now contribute a significant portion of total architecture where a small subset of peers, called Internet traffic [1,5]. For example, over the past ultrapeers, form a top-level overlay while other 12 months, the Gnutella network has grown in peers, called leaf peers, are connected to the top- size by a factor of three to more than 2 million level overlay through one or multiple ultrapeers simultaneous users. To scale with this growth (Figure 1(a)). When a leaf peer cannot locate in user population, different group of applica- a sufficient number of parents in the top-level tion developers have introduced a two-tier over- overlay, it will promote itself and become an ul- lay topology coupled with more efficient search trapeer if it believes that it is not firewalled. mechanisms (e.g., Dynamic Querying in Gnu- The goal of this mechanism is to maintain a tella) in their client software. Given the P2P proper balance between ultrapeers and leaves (and often open-source) nature of these systems, (15%–17%).

1 Our main findings are summarized in the fol- system. Sections 3 and 4 present the evolution lowing list. Due to the limited space, we only of client characteristics and overlay characteris- present a subset of our results. Further details tics in the Gnutella network, respectively. Fi- can be found in the related technical report [6]. nally, we conclude the paper and sketch our fu- • Gnutella is North America-centric, with a ture plans in Section 5. moderate number of European users. 2 Data Collection • There is a strong bias towards intra- To accurately characterize P2P overlay topolo- continent connectivity, especially in conti- gies, we need to capture complete and accurate nents with smaller population. snapshots. By “snapshot”, we mean a graph that • Most users upgrade their client software captures all participating peers (as nodes) and within 2 months, which is much faster than the connections between them (as edges) at a users upgrade operating systems and other single instance in time. The most common ap- types of software. proach to capture a snapshot is to crawl the over- • Peers in the top-level overlay can effectively lay. In practice, capturing accurate snapshots is locate the maximum number of neighbors challenging due to the large size and the dynamic and form a well-connected top-level overlay. nature of P2P systems. Because overlays change • The ultrapeers-to-leaf ratio has unnecessar- as the crawler operates, captured snapshots are ily increased, degrading performance of the inherently distorted where the degree of distor- system. This also implies that leaves are tion is proportional to the crawling duration [11]. unable to locate available open slots among We have developed a set of measurement ultrapeers, and thus unnecessarily promote techniques into a , called themselves to become ultrapeers. Cruiser [10]. Cruiser improves the accuracy of Several previous studies have examined char- captured snapshots by significantly increasing acteristics of overlay topologies [3, 4, 7, 12] and the crawling speed primarily through two mecha- participating peers [8, 9] in large scale P2P sys- nisms. First, it leverages the two-tier structure of tems. However, there are at least three impor- the overlay by contacting only ultrapeers. Since tant differences between our work and these prior leaf peers connect only to ultrapeers, all of their studies as follows: First, all the previous stud- topological information can be captured with- ies (except our earlier work [12]) were conducted out contacting them directly. Second, Cruiser more than three years ago on much smaller user significantly increases the degree of concurrency populations and before the introduction of the in crawling by running on several machines and two-tier overlay topology. Second, and more im- opening hundreds of simultaneous connections portantly, most of these studies focused on char- from each machine. acteristics of P2P systems over a relatively short Cruiser can capture the Gnutella network period (e.g., a few days or weeks) and did not with 2.2 million peers in around 8 minutes, or capture any major growth in population. Fi- around 275 Kpeer/minute (by directly contact- nally, previous studies have used either partial ing 22 Kpeer/minute). This is orders of magni- or distorted snapshots of the system that could tude faster than the fastest previously reported significantly affect the accuracy of their results crawlers (i.e., 2.5Kpeers/minute in [8])1. Cruiser [10]. To our knowledge, our work is the only captures the following information from each study that examines the evolution of a P2P file- peer it successfully contacts: (i) peer type (ul- sharing system over a one year period. This pa- trapeer or leaf), (ii) brand and version, (iii) a per builds on our recent work of characterizing list of the peer’s neighbors, and (iv) a list of an graph-related properties of the Gnutella overlay ultrapeer’s leaf nodes. Since the crawler does and its short-term dynamics [12], by exploring not directly contact leaf peers, we do not have long-term trends and effects. information about their brand and versions. The rest of this paper is organized as follows: 1Capturing snapshot of a large scale P2P system with In Section 2, we briefly present our data collec- a slow crawler might be infeasible since the crawler may tion methodology and tools, and explain the im- never terminate due to ongoing arrival of new peers in the portance of capturing accurate snapshots of P2P system.

2 3 100 Total 3 . 3 3 3 3 3 2 5 UltraPeers + 33 80 3 3 33 Leaves 2 3 Legacy Peer 2 22 2 60 NA 3 AS 2 Ultra Peer . 33 × Leaf Peer 1 5 3 22 EU + SA 2 40 1 3 3 32 22 0.5 20 + + ++ + + + + ++ + ++ + + + + + ×2 ×2 ×2 ×2 ××22 ×2 ××22 Population (million) 0 0 Percentage of UltraPeers Core Plane of the Gnutella Topology O N D J F M A M J J A S O O N D J F M A M J J A S O Date (Month) Date (Month) (a) Gnutella’s Two-Tier (b) Growth of Gnutella population (c) Breakdown of ultrapeers by Topology between Oct. 2004, Oct. 2005 continent Figure 1 We have captured more than 80,000 snapshots region in the next section. of the Gnutella network with Cruiser between The distribution of user populations across dif- Oct. 2004 and Oct. 2005. To examine the long- ferent countries has also grown proportionally, term trends in the Gnutella network, we selected except for China where client population has nine snapshots roughly evenly spaced during this dropped significantly (94%). 2 period . To minimize any possible error due to Clients in US, Canada, and UK make up 65%, the time-of-day or day-of-week effects, our candi- 14%, and 5% of the total population, respec- date snapshots were all taken around 3pm PDT tively3. The remaining countries made up less on weekdays. than 2% each, but make up 16% in total. Thus, 3 Client Characteristics while the Gnutella network is dominated by pre- dominately English-speaking countries, around In this section, we explore changes in several one-fifth is composed of users from other coun- characteristics of Gnutella peers during the past tries. 12 months. Figure 1(b) depicts growth in the Figure 2(a) shows the effect of different time overall population of Gnutella peers and its zones on the distribution of user population break down between ultrapeers and leaves. This of different regions. The population of North figure shows that Gnutella has tripled in size, American and European clients peak at around with surprisingly linear growth, though some- 7pm and 11am PDT with 86% and 24%, respec- what slower during last year’s holiday season tively. This figure indicates that our 3pm snap- (Dec. ’04–Jan. ’05). shots capture roughly average daily population, Client Location: We also examined the i.e., not at any of the peaks. breakdown of ultrapeers across different regions Client Implementation: Figure 2(b) depicts and countries using GeoIP 1.3.14 from Max- the breakdown of ultrapeers across the major Mind, LLC. Figure 1(c) shows the distribution brands that implement Gnutella. This figure of Gnutella clients across four regions, namely shows that the two most popular implementa- North America (NA), South America (SA), Eu- tions are LimeWire and BearShare. Overall, the rope (EU), and Asia (AS) that collectively make ratio between LimeWire and BearShare has been up 98.5% of the total ultrapeer population. This fairly stable, with LimeWire making up 75–85% figure reveals that a majority of Gnutella ul- of ultrapeers, BearShare4 making up 10–20%, trapeers are in North American (80%), with a and other brands making up 3–7%. significant fraction (13%) in Europe. Further- more, the user population of different regions Gradual upgrading by users implies that dif- have grown proportionally. Gnutella is mostly 3These values are from the snapshot taken on 9/20/05 US-centric and may not properly represent a and are similar to the other values observed during the more international P2P system. We use this in- study period, as shown in Figure 1(c). formation to examine connectivity within each 4BearShare clients support more leaves per ultrapeer, and thus tend to have fewer ultrapeers. Therefore, while 2Unfortunately, we did not capture any snapshots dur- our results accurately represent the top-level overlay, they ing May or June of 2004. could potentially under-represent BearShare users.

3 100 100 3 3 3 3 4.0 3 3 3 3 3 33 100 + 80 3 3 80 3 3 3 3 3 3 4.2 3 3 3 3 3 3 4.8 2 2 NA 80 × 60 EU + 60 3 LimeWire 3 4.9 2 + 60 2 2 40 AS 40 BearShare + ×2× SA × + Other 2 40 × 2 + + 2 20 + + + + 20 + + 3 + + + + ++ + ++ 20 3+ ×2 ×+2 ×+2 +2 +2 2 2 2 2 2 2 2 2 2 2 2 22 2 22 3 ×3 0 × × × ×× × × × × 0 0 +×2 ++××22 ×××2 + +3++33 Percentage of Ultrapeers Percentage of UltraPeers 16 20 00 04 08 12 O N D J F M A M J J A S O LimeWire UltraPeers (%) O N D J F M A M J J A S O Time of Day (PDT) Date (Month) Date (Month) (a) Breakdown of ultrapeers by (b) Breakdown of ultrapeers by (c) Breakdown of LimeWire continent, Sep. 20, 2005 brand ultrapeers by version Figure 2 ferent versions of each brand coexist at any Ultrapeer-Leaf Ratio: Figure 4(a) presents point of time. P2P systems may need to evolve the changes in the percentage of ultrapeers dur- quickly in order to accommodate growing (and ing the past year. While this percentage fluctu- even changing) user demand. Otherwise, users ates, it exhibits an overall increasing trend. We may not observe acceptable performance and recall that leaf peers become ultrapeer only when leave the system. This raises the following fun- they cannot locate a sufficient number of ultra- damental question:“How rapidly and effectively peers that can accept an additional leaf. This can a widely-deployed P2P system evolve in or- result suggests that the ability of leaves to lo- der to cope with increasing user demand?” Fur- cate available ultrapeers has not scaled as the thermore, how is this evolution affected by the system has grown in size. Note that the per- heterogeneity among coexisting implementations centage of ultrapeers seems to have dropped by (and versions) and the rate of software upgrade 1.4% between Sep.–Oct. 2005. This drop seems by users? to be correlated with the increase in popularity Since LimeWire clients make up an over- of LimeWire client version 4.9 (released in Jan. whelming majority of ultrapeers, we explored the ’05). We will revisit this issue when we explore breakdown among popular versions of LimeWire. the ultrapeer-leaf degree distribution. Figure 2(c) shows the percentage of LimeWire Node Degree: To examine changes in the con- ultrapeers running each version, revealing that nectivity of the overlay topology, we examine within 2 months of the release of a new version three different angles of node degree distribu- most LimeWire users are running it. This is il- tion in the two-tier overlay: (i) for ultrapeers, lustrated by the way the market share of a ver- the number of ultrapeer neighbors; (ii) for ul- sion increases from 0% to more than 50%, and trapeers, the number of leaf children; and (iii) only decreases when a new version appears. This for leaves, the number of ultrapeer parents5. To behavior can be attributed to the automatic no- show the evolution of the degree distribution over tification of new versions coupled with the sim- time, we show each angle on the degree distri- plicity of using the P2P system for distributing bution for three snapshots that are roughly six updates quickly. The quick upgrade by users months apart. In the absence of other factors, as also implies that new features rapidly become the population grows, one expects the distribu- widespread throughout the system. Due to the tion to change proportionally across different de- rapid deployment of new versions, “flag days” gree values, i.e., the ratio of peers with different are practical in P2P systems where new clients degree would remain approximately constant. are configured to use a new, incompatible feature Figure 3(a) shows the distribution of the num- on a particular date. ber of top-level neighbors across ultrapeers in a log-log plot. All three distributions show a 4 Overlay Characteristics strong peak in the range of 20 to 30 neighbors,

In this section, we turn our attention to the evo- 5We limit the range of node degree to 500 in these lution of the two-tier overlay topology during the graphs. This range will include all but a small percentage past year. of peers (<0.1%) with a higher degree.

4 k k 10 k 10/12/04 10 100 03/10/05

k Parents k 09/20/05 1k 1 10 x 1k Neighbors 100 Neighbors 100 x 10/12/04 x 10/12/04 100 Ultrapeers 10 10 03/10/05 03/10/05 10 Top-Level Peers with 09/20/05 with 09/20/05 1 1 1 Leaves with 1 10 100 0 1 10 100 1 10 100 Degree Number of Leaves Number of Parents (a) Ultrapeer Neighbors (b) Ultrapeer Children (c) Leaf Parents Figure 3: Frequency plots of observed degree with a significant number of peers having less the neighbor selection is generally random in than 20 neighbors. Comparison of these three Gnutella, one key question is whether con- snapshots reveal that the peak has dramatically nectivity in the Gnutella overlay topology is grown, while the number of peers with fewer than geographically-aware. In other words, whether 20 neighbors has increased only slightly rather peers in a certain region are more likely to con- than proportionally. This implies that despite nect to each other than peers in other regions. the dramatic growth in the total population, ul- This is an important issue because it affects the trapeers with open slots for neighbors quickly efficiency of the search mechanism. For each one locate one another and form a well connected of the main four regions, Figure 4(b) depicts the top-level overlay. percentage of neighbors for all ultrapeers in a re- Figure 3(b) shows the distribution of the num- gion that are located in the same region. If there ber of leaf children across ultrapeers for three is no bias towards intra-region connectivity, the snapshots in a log-log plot. In all three snap- percentage for each region should be the same shots, there are peaks at 30 and 45 children, cor- as the percentage of the total population that responding to the maximums set in LimeWire are located in that region (Figure 1(c)). Fig- and BearShare, respectively. However, unlike ure 4(b) reveals that there is a strong bias to- the number of neighbors, the peaks have not sig- wards intra-region connectivity, especially within nificantly increased over time. Instead, the dra- smaller regions. More specifically, even though matic increases have been in the number of ul- 13.3%, 2.8%, and 2.3% of the overall popula- trapeers with fewer children. This means that tion are located in EU, AS and SA, more than there are proportionally more ultrapeers with 22.9%, 24.5%, and 16% of their neighbors are open slots for more children. This is the direct within the same region, respectively. This biased result of the increase in the ratio of ultrapeers to intra-region connectivity occurs due to three rea- leaves as shown in Figure 4(a). sons: First, LimeWire client attempts to main- Figure 3(c) shows the distribution of the num- tain at least one neighbor with the same locale ber of ultrapeer parents among leaves in a log- setting [2], i.e., at least one neighbor whose user log plot. In all three snapshots, there is a peak speaks the same language. Second, when peers at 1–3 children, with many peers having slightly contact others to identify additional neighbors, more parents. While the number of peers with 1– they contact more peers than actually needed 3 children has proportionally increased with the and select those peers that respond faster. This population, the number of peers with more par- simple mechanism implicitly leads to bias in con- ents only exhibit a minor increase. This seems nectivity within each region. Third, because reasonable given the fact that both LimeWire users in the same region tend to arrive at the and BearShare clients attempt to maintain 3 ul- same time of day (as shown in Figure 2(a)), they trapeer parents by default, and peers with less tend to be looking for neighbors at the same time parents are trying to find 3 parents. It also shows and are more likely to find one another. that the number of peers with presumably mod- This intra-region biased connectivity in the ified implementations have not increased. overlay topology implies that users searching for Intra-Region Bias in Connectivity: While content are more likely to locate desired content

5 19.5 3 100 100 Ultrapeers 3 + + + + ++ + ++ 19 3 3 3 3 3 33 3 18.5 80 33 80 18 3 3 33 3 2 3 33 17.5 3 60 NA AS 60 3 17 EU + SA × 3 16.5 3 3 40 40 16 2 + 22 2 22 3 . +2 + 2 +2 ++ ×+ ++ Systematic Removal 3 15 5 3 20 × ×× × ×× ×× 20 15 3 3 Random Removal + 14.5 0 0 Percentage of all peers

O N D J F M A M J J A S O Intra-Region Neighbors(%) O N D J F M A M J J A S O O N D J F M A M J J A S O Percentage of Nodes Removed Date (Month) Date (Month) Date (Month) (a) Percentage of peers that are (b) Percent of neighbors on the (c) Percent of peers removed to ultrapeers same continent cause severe fragmentation Figure 4 among other peers in the same region with the creasing difficulty locating available ultrapeers. same language and culture. Furthermore, re- We also found that within two months, most sponse time to queries will also be faster since users are running the latest version of their P2P geography is a good first-order estimator of la- file-sharing software. tency. References Resiliency to Peer Departure: Finally, we examine the resiliency of the Gnutella overlay [1] slyck.com. http://www.slyck.com, 2005. [2] G. Bildson. CTO, LimeWire, LLC. Personal corre- topology to both random and highest-degree spondence, 2005. node removal (or failure). Figure 4(c) shows the [3] J. Chu, K. Labonte, and B. N. Levine. Availability percentage of ultrapeers that must be removed and Locality Measurements of Peer-to-Peer File Sys- until the largest connected component contains tems. In ITCom: Scalability and Traffic Control in IP Networks II Conferences, 2002. less than 50% of the remaining ultrapeers (i.e., [4] M. Jovanovic, F. Annexstein, and K. Berman. overlay becomes severely fragmented) over the Modeling Peer-to-Peer Network Topologies through past year. This figure shows that more than “Small-World” Models and Power Laws. In 90% of peers must be randomly removed for the TELFOR, 2001. [5] T. Karagiannis, A. Broido, M. Faloutsos, and kc overlay to become severely fragmented. Further- claffy. Transport Layer Identification of P2P Traffic. more, the degree of resiliency has remained rela- In International Measurement Conference, 2004. tively constant during the past year. Resiliency [6] A. Rasti, D. Stutzbach, and R. Rejaie. On the Long- to the removal of the highest-degree nodes is term Evolution of the Gnutella Network. Technical clearly worse than random node removal. How- Report 2005-07, University of Oregon, 2005. [7] M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping ever, overall Gnutella is growing increasingly re- the Gnutella Network: Properties of Large-Scale silient to highest-degree removal, though with Peer-to-Peer Systems and Implications for System large fluctuations. Since these results are nor- Design. IEEE Internet Computing Journal, 6(1), malized by total population, the actual number 2002. [8] S. Saroiu, P. K. Gummadi, and S. D. Gribble. Mea- of removed ultrapeers has increased by a factor suring and Analyzing the Characteristics of Napster of 3 (i.e., n · 50% in Oct. 2004, n · 3 · 60% in Sep. and Gnutella Hosts. Multimedia Systems Journal, 2005). 8(5), 2002. [9] S. Sen and J. Wang. Analyzing Peer-To-Peer Traffic 5 Conclusions Across Large Networks. IEEE/ACM Transactions on Networking, 12(2), 2004. In this paper, we explored long-term trends in [10] D. Stutzbach and R. Rejaie. Capturing Accurate the popular Gnutella P2P file-sharing system. Snapshots of the Gnutella Network. In Global Inter- net Symposium, 2005. By examining how the system has changed over [11] D. Stutzbach and R. Rejaie. Evaluating the the course of one year, we can see how the change Accuracy of Captured Snapshots by Peer-to-Peer in scale and the introduction of new software ver- Crawlers. In Passive and Active Measurement Work- sions has impacted the system. Our results show shop, Extended Abstract, 2005. [12] D. Stutzbach, R. Rejaie, and S. Sen. Characteriz- that ultrapeers remain effective at finding one ing Unstructured Overlay Topologies in Modern P2P another and filling out their target degree (30), File-Sharing Systems. In Internet Measurement Con- while leaf peers are experiencing gradually in- ference, 2005.

6