Peer-to-PeerArchitectureCaseStudy:GnutellaNetwork∗ MateiRipeanu [email protected] DepartmentofComputerScience TheUniversityofChicago 1100E.58thStreet,ChicagoIL60637 Tel:(773)955-4040(ext.57395) Fax:(773)702-8487 Abstract Despite recent excitement generated by the peer-to-peer (P2P) paradigm and the surprisingly rapiddeploymentofsomeP2Papplications,therearefewquantitativeevaluationsofP2Psystems behavior. The open architecture, achieved scale, and self-organizing structure of the Gnutella network make it an interesting P2P architecture to study. Like most other P2P applications, Gnutellabuilds,attheapplicationlevel,avirtualnetworkwithitsownroutingmechanisms.The topologyofthisvirtualnetworkandtheroutingmechanismsusedhaveasignificantinfluenceon applicationpropertiessuchasperformance,reliability,andscalability.Wehavebuilta“crawler” to extract the topology of Gnutella’s application level network. In this paper we analyze the topologygraphandevaluategeneratednetworktraffic.Thetwomajorfindingswefocusonare: (1)althoughGnutellaisnotapurepower-lawnetwork,itscurrentconfigurationhasthebenefits andthedrawbacksofapower-lawstructure,and(2)theGnutellavirtualnetworktopologydoes notmatchwelltheunderlyingInternettopology,henceleadingtoineffectiveuseofthephysical networkinginfrastructure.ThesefindingsguideustoproposechangestotheGnutellaprotocol and implementations that may bring significant performance and scalability improvements. AlthoughGnutellanetworkmightfade,webelievetheP2Pparadigmisheretostay.Inthislight, ourfindingsaswellasourmeasurementandanalysistechniquesbringpreciousinsightintoP2P systemdesigntradeoffs. Keywords:peer-to-peersystemevaluation,self-organizednetworks,power-lawnetwork, topologyanalysis.

1. Introduction Peer-to-peersystems(P2P)haveemergedasasignificantsocialandtechnicalphenomenonoverthe last year. They provide infrastructure for communities that CPU cycles (e.g., SETI@Home, Entropia) and/or storage space (e.g., , , Gnutella), or that support collaborative environments(Groove).Twofactorshavefosteredtherecentexplosivegrowthofsuchsystems:first, thelowcostandhighavailabilityoflargenumbersofcomputingandstorageresources,andsecond, increasednetworkconnectivity.Asthesetrendscontinue,theP2Pparadigmisboundtobecomemore popular. Unliketraditionaldistributedsystems,P2Pnetworksaimtoaggregatelargenumbersofcomputersthat joinandleavethenetworkfrequentlyandthatmightnothavepermanentnetwork(IP)addresses.In

∗ AnextendedversionofthispaperwaspublishedasUniversityofChicagoTechnicalReportTR-2001-26.

1

pureP2Psystems,individualcomputerscommunicatedirectlywitheachotherandshareinformation andresourceswithoutusingdedicatedservers.Acommoncharacteristicofthisnewbreedofsystems isthattheybuild, attheapplicationlevel, avirtualnetworkwithitsownroutingmechanisms. The topology of the virtual network and the routing mechanisms used have a significant impact on application properties such as performance, reliability, and, in some cases, anonymity. The virtual topologyalsodeterminesthecommunicationcostsassociatedwithrunningtheP2Papplication,bothat individualhostsandintheaggregate.NotethatthedecentralizednatureofpureP2Psystemsmeans thatthesepropertiesareemergentproperties,determinedbyentirelylocaldecisionsmadebyindividual resources, based only on local information: we are dealing with a self-organized network of independententities. Theseconsiderationshavemotivatedustoconductadetailedstudyofthetopologyandprotocolsofa popularP2Psystem:Gnutella.Inthisstudy,webenefitedfromGnutella’slargeexistinguserbaseand open architecture, and, in effect, use the public Gnutella network as a large-scale, if uncontrolled, testbed.Weproceededasfollows.First,wecapturedthenetworktopology,itsgeneratedtraffic,and dynamicbehavior.Then,weusedthisrawdatatoperformamacroscopicanalysisofthenetwork,to evaluatecostsandbenefitsoftheP2Papproach,andtoinvestigatepossibleimprovementsthatwould allowbetterscalingandincreasedreliability. OurmeasurementsandanalysisoftheGnutellanetworkaredrivenbytwoprimaryquestions.Thefirst concernsitsconnectivitystructure.Recentresearch[1,8,7]showsthatnetworksasdiverseasnatural networksformedbymoleculesinacell,networksofpeopleinasocialgroup,ortheInternet,organize themselvessothatmostnodeshavefewlinkswhileatinynumberofnodes,calledhubs,havealarge numberoflinks.[14]findsthatnetworksfollowingthisorganizationalpattern(power-lawnetworks) display an unexpected degree of robustness: the ability of their nodes to communicate is unaffected evenbyextremelyhighfailurerates.However,errortolerancecomesatahighprice:thesenetworks are vulnerable to attacks, i.e., to the selection and removal of a few nodes that provide most of the network’s connectivity. We show here that, although Gnutella is not a pure power-law network, it preserves good fault tolerance characteristics while being less dependent than a pure power-law networkonhighlyconnectednodesthatareeasytosingleout(andattack). The second question concerns how well (if at all) Gnutella virtual network topology maps to the physicalInternetinfrastructure.Therearemultiplereasonstoanalyzethisissue.First,itisaquestion ofcrucialimportanceforInternetServiceProviders(ISP):ifthevirtualtopologydoesnotfollowthe physicalinfrastructure,thentheadditionalstressontheinfrastructureand,consequently,thecostsfor ISPs,areimmense.Thispointhasbeenraisedonvariousoccasions[9,12]but,asfarasweknow,we are the first to provide a quantitative evaluation on P2P application and Internet topology (mis)matching.Second,thescalabilityofanyP2Papplicationisultimatelydeterminedbyitsefficient useofunderlyingresources. WearenotthefirsttoanalyzetheGnutellanetwork.Inparticular,theDistributedSearchSolutions (DSS)group[15]haspublishedresultsoftheirGnutellasurveys[4,5],andothershaveusedtheirdata toanalyzeGnutellausers’behavior[2]andtoanalyzesearchprotocolsforpower-lawnetworks[6]. However,ournetworkcrawlingandanalysistechnology(developedindependentlyofthiswork)goes significantly further in terms of scale (both spatial and temporal) and sophistication. While DSS presentsonlyrawfactsaboutthenetwork,weanalyzethegeneratednetworktraffic,findpatternsin networkorganization,andinvestigateitsefficiencyinusingtheunderlyingnetworkinfrastructure. Therestofthepaperisstructuredasfollows:thenextsectionsuccinctlydescribesGnutellaprotocol andapplication.Section3introducesthecrawlerwedevelopedtodiscoverGnutella’svirtualnetwork

2

topology.InSection4weanalyzethenetworkandanswerthequestionsintroducedintheprevious paragraphs.WeconcludeinSection5.

2. GnutellaProtocol:DesignGoalsandDescription The Gnutella protocol [3] is an open, decentralized group membership and search protocol, mainly usedforfile.ThetermGnutellaalsodesignatesthevirtualnetworkofInternetaccessiblehosts runningGnutella-speakingapplications(thisisthe“Gnutellanetwork”wemeasure)andanumberof smaller,andoftenprivate,disconnectednetworks. AsmostP2Pfilesharingapplications,Gnutellaprotocolwasdesignedtomeetthefollowinggoals: o Abilitytooperateinadynamicenvironment.P2Papplicationsoperateindynamicenvironments, wherehostsmayjoinorleavethenetworkfrequently.Theymustachieveflexibilityinorderto keepoperatingtransparentlydespiteaconstantlychangingsetofresources. o Performance and Scalability. P2P paradigm shows its full potential only on large-scale deploymentswherethelimitsofthetraditional/serverparadigmbecomeobvious.Moreover, scalabilityisimportantasP2Papplicationsexhibitwhateconomistscallthe“networkeffect”[10]: thevalueofanetworktoanindividualuserincreaseswiththetotalnumberofusersparticipatingin the network. Ideally, when increasing the number of nodes, aggregate storage space and file availabilityshould growlinearly, responsetime shouldremainconstant, whilesearchthroughput shouldremainhighorgrow. o Reliability.Externalattacksshouldnotcausesignificantdataorperformanceloss. o Anonymity. Anonymity is valued as a means to protect privacy of people seeking or providing informationthatmaynotbepopular. Gnutella nodes, calledservents by developers, perform tasks normally associated with both SERVers andcliENTS.Theyprovideclient-sideinterfacesthroughwhichuserscanissuequeriesandviewsearch results,acceptqueriesfromotherservents,checkformatchesagainsttheirlocaldataset,andrespond withcorrespondingresults.Thesenodesarealsoresponsibleformanagingthebackgroundtrafficthat spreadstheinformationusedtomaintainnetworkintegrity. Inordertojointhesystemanewnode/serventinitiallyconnectstooneofseveralknownhoststhatare almostalwaysavailable(e.g.,gnutellahosts.com).Onceattachedtothenetwork(havingoneormore openconnectionswithnodesalreadyinthenetwork),nodessendmessagestointeractwitheachother. Messagescanbebroadcasted(i.e.,senttoallnodeswithwhichthesenderhasopenTCPconnections) orsimplyback-propagated(i.e.,sentonaspecificconnectiononthereverseofthepathtakenbyan initial, broadcasted, message). Several features of the protocol facilitate this broadcast/back-propagation mechanism. First, each message has a randomly generated identifier. Second, each node keeps a short memory of the recently routed messages, used to prevent re-broadcastingandimplementback-propagation.Third,messagesareflaggedwithtime-to-live(TTL) and“hopspassed”fields. Themessagesallowedinthenetworkare: ° GroupMembership(PINGandPONG)Messages.Anodejoiningthenetworkinitiatesabroadcasted PINGmessagetoannounceitspresence.WhenanodereceivesaPINGmessageitforwardsittoits neighborsandinitiatesaback-propagatedPONGmessage.ThePONGmessagecontainsinformation aboutthenodesuchasitsIPaddressandthenumberandsizeofsharedfiles. ° Search(QUERYandQUERYRESPONSE)Messages.QUERYmessagescontainauserspecifiedsearch string, each receiving node matches against locally stored file names. QUERY messages are

3

broadcasted. QUERY RESPONSES are back-propagated replies to QUERY messages and include informationnecessarytoafile. ° FileTransfer(GETandPUSH)Messages.Filearedonedirectlybetweentwopeersusing GET/PUSHmessages. To summarize: to become a member of the network, a servent (node) has to open one or many connectionswithnodesthatarealreadyinthenetwork.InthedynamicenvironmentwhereGnutella operates, nodes often join and leave and network connections are unreliable. To cope with this environment, after joining the network, a node periodically PINGs its neighbors to discover other participatingnodes.Usingthisinformation,adisconnectednodecanalwaysreconnecttothenetwork. Nodes decide where to connect in the network based only on local information, and thus forming a dynamic,self-organizingnetworkofindependententities.Thisvirtual,applicationlevelnetworkhas Gnutella servents at its nodes and open TCP connections as its links. In the following sections we presentoursolutiontodiscoverthisnetworktopologyandanalyzeitscharacteristics.

3. DataCollection:TheCrawler Wehavedevelopedacrawlerthatjoinsthenetworkasaserventandusesthemembershipprotocol(the PING-PONGmechanism)tocollecttopologyinformation.Inthissectionwebrieflydescribethecrawler anddiscussotherissuesrelatedtodatacollection. The crawler starts with a list of nodes, initiates a TCP connection to each node in the list, sends a generic join-in message (PING), and discovers the neighbors of the node it contacted based on the replies it gets back (PONG messages). Newly discovered neighbors are added to the list. For each discoverednodethecrawlerstoresitsIPaddress,port,thenumberoffilesandthetotalspaceshared. Westartedwithashort,publiclyavailablelistofinitialnodes,butintimewehaveincrementallybuilt ourownlistwithmorethan400,000nodesthathavebeenactiveatonetimeoranother. Wefirstdevelopedasequentialversionofthecrawler.Usingempiricallydeterminedoptimalvalues forconnectionestablishmenttimeoutaswellasforconnectionlisteningtimeout(thetimeintervalthe crawlerwaitstoreceivePONGsafterithassentaPING),asequentialcrawlofthenetworkprovedslow: about50hoursevenforasmallnetwork(4000nodes).Thisslowsearchspeedhastwodisadvantages: notonlyitisnotscalable,butbecauseofthedynamicnetworkbehavior,theresultofourcrawlisfar fromanetworktopologysnapshot. Inordertoreducethecrawlingtime,wenextdevelopedadistributedcrawlingstrategy.Ourdistributed crawlerhasaclient/serverarchitecture:theserverisresponsiblewithmanagingthelistofnodestobe contacted, assembling the final graph, and assigning work to clients. Clients receive a small list of initialpointsanddiscoverthenetworktopologyaroundthesepoints.Althoughwecouldusealarge numberofclients(easilyintheorderofhundreds),wedecidedtouseonlyupto50clientsinorderto reducetheinvasivenessofoursearch.Thesetechniqueshaveallowedustoreducethecrawlingtimeto acoupleofhoursevenforalargelistofstartingpointsandadiscoveredtopologygraphwithmorethan 30,000activenodes. Notethatinthefollowingweuse aconservativedefinitionofnetworkmembership:weexcludethe nodes that, although were reported as part of the network, our crawler could not connect to. This situation might occur when the local servent is configured to allow only a limited number of TCP connectionsorwhenthenodeleavesthenetworkbeforethecrawlercontactsit.

4

4. GnutellaNetworkAnalysis WestartbypresentingbrieflyGnutellanetworkgrowthtrendsanddynamicbehavior.Ourdatashows that (Section 4.1), although over the past 6 months Gnutella overhead traffic has been decreasing, currentlythegeneratedtrafficvolumerepresentsasignificantpercentageoftotalInternettrafficandis amajorobstacletofurthergrowth.Wecontinuewithamacroscopicanalysisofthenetwork:westudy first connectivity patterns (Section 4.2) and then the mapping of the Gnutella topology to the underlyingnetworkinginfrastructure(Section4.3). Figure1presentsthegrowthoftheGnutellanetworkinthepast6months.Weranourcrawlerduring November 2000, February/March 2001, and May 2001. While in November 2000 the largest connectedcomponentofthenetworkwefoundhad2,063hosts,thisgrewto14,949hostsinMarchand 48,195hostsinMay2001.AlthoughGnutella’sfailuretoscalehasbeenpredictedtimeandagain,the numberofnodesinthelargestnetworkcomponentgrewabout25times(admittedlyfromalowbase) inthepast6months. Weidentifythreefactorsthatallowedthenetworkthisexceptionalgrowthinresponsetouserpressure. First,asweargueinSection4.1,carefulengineeringledtosignificantoverheadtrafficdecreasesover the last six months. Second, the network connectivity of Gnutella participating machines improved significantly.Ourroughestimate(basedontracingDNShostnames)isthatthenumberofDSLor cable-connectedmachinesgrewtwiceasfastastheoverallnetworksize.WhileinNovember2000 about24%ofthenodeswereDSLorcablemodemenabled,thisnumbergrewtoabout41%sixmonths later.Finally,theeffortsmadetobetteruseavailablenetworkingresourcesbysendingnodeswithlow availablebandwidthattheedgesofthenetworkeventuallypaidoff. It is worth mentioning that the number of connected components is relatively small: the largest connectedcomponentalwaysincludesmorethan95%oftheactivenodesdiscovered,whilethesecond biggestconnectedcomponentusuallyhaslessthan10nodes. d

50 o

. . c 25 MessageFrequency t GnutellaNetworkGrowth

e Ping s s e ) r g

0 Push e r 0 p a

40 l 0 ' s 20 Query ( e e t h g

t Other a n s e n s i n

30 e o s 15 e m p d m o o n c 20 f k o r 10 r o e w b t

e 10 m n u 5 N - 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 / / / / / / / / / / / / / / / / / / 0 9 8 1 7 6 5 4 3 2 1 0 9 8 0 1 5 8 7 1 5 9 3 6 9 2 4 2 6 2 4 9 3 5 8 1 4 7 0 3 6 9 2 4 7 2 2 2 2 2 0 0 0 1 1 1 2 2 1 1 2 2 2 / / / / / / / / / / / / / / / / / / 1 1 1 2 2 2 2 3 3 3 1 1 1 1 2 3 3 3 3 3 3 3 3 5 5 5 5 5 minute 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure1:Gnutellanetworkgrowth.Theplotpresentsthe Figure2:Generatedtraffic(messages/sec)inNov.2000 numberofnodesinthelargestconnectedcomponentinthe classified by message type over a 376 minute period. network. Data collected during Nov. 2000, Feb./March Note that overhead traffic (PING messages, that serve 2001 and May 2001. We found significantly a larger onlytomaintainnetworkconnectivity)formedmorethan network around Memorial Day (24-28 May) and 50%ofthetraffic.Theonly‘true’usertrafficisQUERY Thanksgivings,whenapparentlymorepeopleareonline. messages. Traffic become more efficient by May 2001. Usingrecordsofsuccessivecrawls,weinvestigatethedynamicgraphstructureovertime.Wediscover thatabout40%ofthenodesleavethenetworkinlessthan4hours,whileonly25%ofthenodesare alive for more than 24 hours. Given this dynamic behavior, it is important to find the appropriate

5

tradeoffbetweendiscoverytimeandinvasivenessofour crawler. Increasingthenumberofparallel crawling tasks reduces discovery time but increases the burden on the application. Obviously, the Gnutellamapourcrawlerproducesisnotanexact‘snapshot’ofthenetwork.However,wearguethat thenetworkgraphweobtainisclosetoasnapshotinastatisticalsense:allpropertiesofthenetwork: size,diameter,averageconnectivity,andconnectivitydistributionarepreserved.

4.1EstimateofGnutellaGeneratedTraffic Weusedamodifiedversionofthecrawlertoeavesdropthetrafficgeneratedbythenetwork.InFigure 2 we classify, according to message type, the traffic that goes across one randomly chosen link in November 2000. After adjusting for message size, we find that, on average, only 36% of the total traffic(in)isuser-generatedtraffic(QUERYmessages).Therestisoverheadtraffic:55%usedto maintain group membership (PING and PONG messages) while 9% contains either non-standard messages(1%)orPUSHmessagesbroadcastbyserventsthatarenotcompliantwiththelatestversionof theprotocol.Apparently,byJune2001,theseengineeringproblemsweresolvedwiththearrivalof newerGnutellaimplementations:generatedtrafficcontains92%QUERYmessages,8%PINGmessages andinsignificantlevelsofothermessagetypes. Given the small diameter of the network (any two nodes are generally less than 7 hops away, see Figure 3), the message time-to-live (TTL=7) preponderantly used, and the flooding-based routing algorithm employed, most links support similar traffic. We verified this theoretical conclusion by measuring the traffic at multiple, randomly chosen, nodes. As a result,the total Gnutella generated trafficisproportionaltothenumberofconnectionsinthenetwork.Basedonourmeasurementswe estimate the total traffic (excluding file transfers) for a large Gnutella network as 1Gbps (170,000 connections for a 50,000 nodes large Gnutella network times 6Kbps per connection) or about 330TB/month.Toputthistrafficvolumeintoperspectivewenotethatitamountstoabout1.7%of totaltrafficinUSInternetbackbonesinDecember2000(asreportedin[16]).Weinferthatthevolume of generated traffic is an important obstacle for further growth and that efficient use of underlying networkinfrastructureiscrucialforbetterscalingandwiderdeployment. ) )

50% 0 200 % Searchcharacteristichs Graphconnectivity. ( 0 0 s ' r ( i a s p k

40% n e

i 150 l d f o o n r f 30% e o b t

n 100 m e u c r 20% N e P 50 10%

0% 0 1 2 3 4 5 6 7 8 9 10 11 12 0 10000 20000 30000 40000 50000 Node-to-nodeshortestpath(hops) Numberofnodes Figure 3: Distribution of node-to-node shortest paths. Figure 4: Average node connectivity. Each point Each line represents one network measurement. Note representsoneGnutellanetwork.Notethat,asthenetwork that,althoughthelargestnetworkdiameter(thelongest grows, the average number of connections per node node-to-nodepath)is12,morethan95%ofnodepairs remains constant (average node connectivity is 3.4 areatmost7hopsaway connectionspernode). Oneinterestingfeatureofthenetworkisthat,overaseven-monthperiod,withthenetworkscalingup almost two orders of magnitude, the average number of connections per node remained constant

6

(Figure 4). Assuming this invariant holds, it is possible to estimate the generated traffic for larger networksandfindscalabilitylimitsbasedonavailablebandwidth.

4.2.ConnectivityandReliabilityinGnutellaNetwork.Power-lawDistributions. WhenanalyzingglobalconnectivityandreliabilitypatternsintheGnutellanetwork,itisimportantto keep in mind the self-organized network behavior: users decide only the maximum number of connections a node should support, while nodes decide whom to connect to or when to drop/add a connectionbasedonlyonlocalinformation. Recentresearch[1,7,8,13]showsthatmanynaturalnetworkssuchasmoleculesinacell,speciesinan ecosystem,andpeopleinasocialgrouporganizethemselvesassocalledpower-lawnetworks.Inthese networksmostnodeshavefewlinksandatinynumberofhubshavealargenumberoflinks.More specifically,inapower-lawnetworkthefractionofnodeswithLlinksisproportionalto L−k ,wherek isanetworkdependentconstant. ThisstructurehelpsexplainingwhynetworksrangingfrommetabolismstoecosystemstotheInternet aregenerallyhighlystableandresilient,yetpronetooccasionalcatastrophiccollapse[14].Sincemost nodes(molecules,Internetrouters,Gnutellaservents)aresparselyconnected,littledependsonthem:a largefractioncanbetakenawayandthenetworkstaysconnected.But,ifjustafewhighlyconnected nodes are eliminated, the whole system could crash. One implication is that these networks are extremelyrobustwhenfacingrandomnodefailures,butvulnerabletowell-plannedattacks. Giventhediversityofnetworksthatexhibitpower-lawstructureandtheirpropertieswewereinterested to determine whether Gnutella falls into the same category. Figure 5 presents the connectivity distribution in Nov. 2000. Although data are noisy (due to the small size of the networks), we can easilyrecognizethesignatureofapower-lawdistribution:theconnectivitydistributionappearsasa line on a log-log plot. [6,4] confirm that early Gnutella networks were power-law. Later measurements(Figure6)however,showthatmorerecentnetworksmoveawayfromthisorganization: therearetoofewnodeswithlowconnectivitytoformapurepower-lawnetwork.Inthesenetworks thepower-lawdistributionispreservedfornodeswithmorethan10linkswhilenodeswithfewerlink followanalmostconstantdistribution.

10000 10000 Nodeconnectivitydistribution. Nodeconnectivitydistribution. ) e l g a c o l s 1000 1000 ( g s o e l ( d ) o s e n e l d a 100 f 100 c o o s n r f e o b . m m 10 u

u 10 N N

1 1 1 10 100 1 10 100 Numberoflinks(logscale) Numberoflinks(logscale) Figure 5: Connectivity distribution during November Figure6:ConnectivitydistributionsduringMarch2001. 2000. Each series of points represents one Gnutella Each series of points represents one Gnutella network networktopologywediscoveredatdifferenttimesduring topology discovered during March 2001. Note the log that month. Note the log scale on both axes. Gnutella scale on both axes. Networks crawled during May/June nodesorganizedthemselvesintoapower-lawnetwork. 2001showasimilarpattern.

7

An interesting issue is the impact of this new, multi-modal distribution on network reliability. We believethatthemoreuniformconnectivitydistributionpreservesthenetworkcapabilitytodealwith randomnodefailureswhilereducingthenetworkdependenceonhighlyconnected,easytosingleout (andattack)nodes.

4.3.InternetInfrastructureandGnutellaNetwork Peer-to-peer computing brings an important change to the way we use the Internet: it enables computers sitting at the edges of the network to act as both clients and servers. As a result, P2P applications radically change the amount of bandwidth the average Internet user consumes. Most Internet Service Providers (ISP) use flat rates to bill their clients. If P2P applications become ubiquitous,theycouldbreaktheexistingbusinessmodelsofmanyISPsandforcethemtochangetheir pricingscheme[9]. Given the considerable traffic volume a P2P application generates (see our Gnutella estimates in previoussection)itiscrucialthatitemployswellavailablenetworkingresources.Thescalabilityofa P2Papplicationisultimatelydeterminedbyhowefficientlyitusestheunderlyingnetwork.Gnutella’s store-and-forwardarchitecturemakesthevirtualnetworktopologyextremelyimportant.Thelargerthe mismatchbetweenthenetworkinfrastructureandtheP2Papplication’svirtualtopology,thebiggerthe “stress” on the infrastructure. In the following we investigate whether the self-organizing Gnutella networkshapesitstopologytomapwellonthephysicalinfrastructure. Letusfirstpresentanexampletohighlighttheimportanceofa“fitting”virtualtopology.InFigure7, eighthostsparticipateinaGnutellalikenetwork.Weuseblack,solidlinestopresenttheunderlying networkinfrastructureandblue,dottedlinesforapplicationvirtualtopology. Intheleftpicture,the virtualtopologycloselymatchestheinfrastructure.Intherightpicture,thevirtualtopology,although functionally similar, does not match the infrastructure. In this case the traffic the link D-E has to supportissixtimeshigher. F A F A

E E B D B G G D C H C H

Figure7:Gnutella’svirtualnetworktopology(blue,dottedarrows)mappingontheunderlyingnetwork infrastructure(black).Leftpicture:perfectmapping.Rightpicture:inefficientmapping;linkD-Eneeds tosupportsixtimeshighertraffic. Unfortunately,itisprohibitivelyexpensivetomapGnutellaontheInternetdetailedtopology(firstly, duetotheinherentdifficultyofextracting Internettopology andsecondly,duetothecomputational scaleoftheproblem).Instead,weproceedwithtwohighlevelexperimentsthathighlightthemismatch betweenthetopologiesofthetwonetworks. TheInternetisacollectionofAutonomousSystems(AS)whichareconnectedbyrouters.ASs,inturn, arecollectionsoflocalareanetworksunderasingletechnicaladministration.FromanISPpointof view traffic crossing AS borders is more expensive than local traffic. We found that only 2-5% of GnutellaconnectionslinknodeslocatedwithinthesameAS,althoughmorethan40%ofthesenodes

8

are located in the top ten ASs. This indicates that, although unforced by node distribution, most GnutellageneratedtrafficcrossesASborderbeingthusmoreexpensivetohandle. Inthesecondexperimentweassumethatthehierarchicalorganizationofdomainnamesmirrorsthatof theInternetinfrastructure.Forexample,itislikelythatcommunicationcostsbetweentwohostsinthe “uchicago.edu” domain are significantly smaller than between “uchicago.edu” and “sdsc.edu.” The underlyingassumptionhereisthatdomainnamesexpresssomesortoforganizationalhierarchy and thatorganizationstendtobuildnetworksthatexploitlocalitywithinthathierarchy. InordertostudyhowwellGnutellavirtualtopologymapsontotheInternetpartitioningasdefinedby domainnames,wedividetheGnutellavirtualtopologygraphintoclusters,i.e.,subgraphswithhigh interior connectivity. Given the flooding-like routing algorithm used by Gnutella, it is within these clustersthatmostloadisgenerated.Wearethereforeinterestedtoseehowwelltheseclustersmapon thepartitioningdefinedbythedomainnamingscheme. Weuseasimpleclusteringalgorithmbasedontheconnectivitydistributiondescribedearlier:wedefine asclusterssubgraphsformedbyonehubwithitsadjacentnodes.Iftwoclustershavemorethan25% nodesincommon,wemergethem.Aftertheclusteringisdone,we(1)assignnodesthatareincluded inmorethanoneclusteronlytothelargestclusterand(2)formalastclusterwithnodesthatarenot includedinanyothercluster. We define the entropy [24] of a set C, containing |C| hosts, each labeled with one of the n distinct domainnames,as: n = (− − − − ) E(C) ƒ pi log( pi ) (1 pi )log(1 pi ) , i=1 where pi istheprobabilityofrandomlypickingahostwithdomainnamei.

Wethendefinetheentropyofaclusteringofagraphofsize|C|,clusteredinkclusters C1 ,C2 ,...,Ck of = + + + sizes C1 , C2 ,..., Ck ,with C C1 C2 ... Ck ,as: k C E(C ,C ,...C ) = i * E(C ) 1 2 k ƒ + + + i i=1 C1 C2 ... Ck ≥ We base our reasoning on the property that E(C) E(C1 ,C2 ,...,Ck ) no matter how the clusters

C1 ,C2 ,...,Ck arechosen.Iftheclusteringmatchesthedomainpartitioning,thenweshouldfindthat >> E(C) E(C1 ,C2 ,...,Ck ) . Conversely, if the clustering C1 ,C2 ,...,Ck has the same level of randomnessasintheinitialsetC,thentheentropyshouldremainlargelyunchanged.Essentially,the entropyfunctionisusedheretomeasurehowwellthetwopartitionsappliedonsetnodesmatch:the firstpartitionusestheinformationcontainedindomainnames,whilethesecondusesthe clustering heuristic.Notethatalargeclassofdataminingandmachinelearningalgorithmsbasedoninformation gains(ID3,C4.5,etc.[25])useasimilarargumenttobuildtheirdecisiontrees. We performed this analysis on 10 topology graphs collected during February/March 2001. We detectednosignificantdecreaseinentropyafterperformingtheclustering(alldecreaseswerewithin lessthan8%fromtheinitialentropyvalue).Consequently,weconcludethatGnutellanodesclusterin awaythatiscompletelyindependentfromtheInternetstructure.AssumingthattheInternetdomain namestructureroughlymatchestheunderlyingtopology(thecostofsendingdatawithinadomainis smaller than that of sending data across domains), we conclude that the self-organizing Gnutella networkdoesnotefficientlyusetheunderlyingphysicalinfrastructure.

9

5. SummaryandPotentialImprovements SociologicalcircumstancesthathavefosteredthesuccessofGnutellanetworkmightchangeandthe networkmightfade.P2P,however,“isoneofthoserareideasthatissimplytoogoodtogoaway” [18].Despiterecentexcitementgeneratedbythisparadigmandthesurprisinglyrapiddeploymentof some P2P applications, there are few quantitative evaluations of P2P systems behavior. The open architecture, achieved scale, and self-organizing structure of the Gnutella network make it an interestingP2Parchitecturetostudy.Ourmeasurementandanalysistechniquescanbeusedformost P2Psystemstoenhancegeneralunderstandingofdesigntradeoffs. OuranalysisshowsthatGnutellanodeconnectivityfollowsamulti-modaldistribution:combining a power law and a quasi-constant distribution. This property keeps the network as reliable as a pure power-lawnetworkwhenassumingrandomnodefailures,andmakesithardertoattackbyamalicious adversary. Gnutella takes few precautions to ward off potential attacks. For example, the network topologyinformationthatweobtainhereiseasytoobtainandwouldpermithighlyefficientdenial-of- serviceattacks.Someformofsecuritymechanismsthatwouldpreventanintrudertogathertopology informationappearsessentialforthelong-termsurvivalofthenetwork. Wehaveestimatedthat,asofJune2001,thenetworkgeneratesabout330TB/monthonlytoremain connectedandbroadcastuserqueries.Thistrafficvolumerepresentsasignificantfractionofthetotal Internet traffic and makesthe future growth of Gnutella network particularly dependent on efficient network usage. We have also documented the topology mismatch between the self-organized, applicationlevelGnutellanetworkandtheunderlyingphysicalnetworkinginfrastructure.Webelieve thishasmajorimplicationsforthescalabilityoftheInternet(or,equivalently,forthebusinessmodels of ISPs). This problem must be solved if Gnutella or similarly built systems are to reach larger deployment. We see two directions for improvement. First, we observe that the application-level topology determinesthevolumeofgeneratedtraffic,thesearchsuccessrate,andtheapplicationreliability.We imagineanagentthatconstantlymonitorsthenetworkandintervenesbyaskingserventstodroporadd linksasnecessarytokeepthenetworktopologyefficient.Additionally,agents(ornodes)couldlearn about the underlying physical network and build the virtual application topology accordingly. Note thatimplementingthisidearequiressomeminimalprotocolmodifications. A second, orthogonal, direction is to replace flooding with a smarter (less expensive in terms of communicationcosts)routingandgroupcommunicationmechanism.Recentresearchprojects:Chord [19],CAN[21],SDS[23]orOceanStore[22]focusonbuilding Intenetscaleoverlaynetworks and offeravastarrayofchoicesfutureGnutellaimplementationscouldbuildon.

6. Acknowledgements IamgratefultoIanFoster,AdrianaIamnitchi,LarryLidz,ConorMcGrath,DustinMitchell,andAlain Royfortheirinsightfulcommentsandgeneroussupport.Thisworkstartedasajointclassprojectwith YugoNakaiandXuehaiZhang.RongguanJin,KnoxMcMurry,andYanWangparticipatedinrefining thiswork. ThisresearchwassupportedbytheNationalScienceFoundationundercontractITR-0086044.

7. References [1]M.Faloutsos,P.Faloutsos,C.Faloutsos,OnPower-LawRelationshipsoftheInternetTopology,SIGCOMM 1999. [2]E.Adar,B.Huberman,FreeridingonGnutella,FirstMondayVol5-10–Oct.2,2000.

10

[3]TheGnutellaprotocolspecificationv4.0-http://dss.clip2.com/GnutellaProtocol04.pdf [4]DSSGroup,Gnutella:TotheBandwidthBarrierandBeyond,http://dss.clip2.com,Nov.6,2000. [5]DSSGroup,BandwidthBarrierstoGnutellaNetworkScalability,http://dss.clip2.com,Sept.8,2000. [6]LadaA.Adamic,RajanM.Lukose,AmitR.Puniyani,B.Huberman,SearchinPowe-LawNetworks. [7]A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.TomkinsandJ.Wiener,Graph structureintheweb,8thInternationalWWWConference,May15-19Amsterdam. [8]A.BarabasiandR.Albert.Emergenceofscalinginrandomnetworks,Science,286(509),1999. [9]ToddSpangle,TheHiddenCostOfP2P,InteractiveWeek,February26,2001. [10]M.Katz,C.Shapiro,SystemsCompetitionandNetworkEffects,JournalofEconomicPerspectives,vol.8, no.2,pp.93-115,1994. [11]T.Cover,J.Thomas,ElementsofInformationTheory,Wiley,1991. [12]B.St.Arnaud,ScalingIssuesonInternetNetworks,TechnicalReport,CANARIEInc. [13]A,Barabási,R.Albert,H.Jeong,G.Bianconi,Power-lawdistributionoftheWorldWideWeb,Science 287,(2000). [14]R.Albert,H.Jeong,A.Barabási,Attakandtoleranceincomplexnetworks,Nature406378(2000). [15]http://dss.clip2.com [16]K.Coffman,A.Odlyzko,Internetgrowth:Istherea"Moore'sLaw"fordatatraffic?,HandbookofMassive DataSets,J.Abello&alleditors.,Kluwer,2001. [17]J.Han,M.Kamber,DataMining:ConceptsandTechniques,MorganKaufmann,August2000. [18]TheEconomist,InventionistheEasyBit,TheEconomistTechnologyQuarterly6/23/01. [19]IonStoica,RobertMorris,DavidKarger,M.FransKaashoek,andHariBalakrishnan,Chord:AScalable Peer-to-peerLookupServiceforInternetApplications,SIGCOMM2001. [20] S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker A Scalable Content-Addressable Network. Submittedforpublication,2000 [21] Ben Zhao,JohnKubiatowicz, AnthonyJoseph. Tapestry: Aninfrastructure for wide-area fault tolerant locationandrouting [22]ZacharyIves,AlonLevy,JayantMadhavan,RachelPottinger,StefanSaroiu,IgorTatarinov,ShioriBetzler, QiongChen,EwaJaslikowska,JingSu,W.T.TheodoraYeung.Self-OrganizingDataSharingCommunities withSAGRES,SIGMOD2000,Dallas,TX. [23]ToddD.Hodes,StevenE.Czerwinski,BenY.Zhao,AnthonyD.Joseph,RandyH.Katz,AnArchitecture forSecureWide-AreaServiceDiscovery,ACMBaltzerWirelessNetworks:selectedpapersfromMobiCom 1999. [24]T.Cover,J.Thomas,ElementsofInformationTheory,Wiley,1991. [25]J.Han,M.Kamber,DataMining:ConceptsandTechniques,MorganKaufmann,August2000.

11