The Popularity Parameter in Unstructured P2P File Networks

JAIME LLORET, JUAN R. DIAZ, JOSE M. JIMÉNEZ, MANUEL ESTEVE Department of Communications Polytechnic University of Valencia Camino de Vera s/n, 46022 Valencia SPAIN

Abstract: - Since P2P became extremely popular between Internet users, many researchers have tried to model those P2P networks. One of the parameters, used in these models, is the popularity of a file. Some articles demonstrate that, if a file is so popular, the probability to find this file inside the P2P network is bigger. This article deals with popularity parameter in P2P file sharing networks. In order to do so, the unstructured public domain Peer-to-Peer networks , FastTrack, OpenNap, eDonkey, and MP2P have been measured. The authors have established a relationship between some films, songs, programs and documents found in web search engines and the same files found in public domain P2P file sharing networks. If all these analyzed Peer-To-Peer file-sharing networks were interconnected, the probability to find a desired file will be incremented. On the other hand, those analyzed P2P networks seems to be specialized in different type of files as it is shown in the paper.

Key-Words: - Peer to peer, File Popularity, File Search, Peer-To-Peer Interconnection.

1 Introduction results, these algorithms use location/frequency Since Internet became accessible to the world, one method (search engines check to see if the search of the first users concerns is to find the file or the keywords appear near the top of a web page and information is looking for. A measurement study [1] how often keywords appear in relation to other of the deep Web reveals that it contains nearly 550 words in a web page) and the off-the-page factor billion of pages and it is doubling each year. On the (like clickthrough measurement). They are the other hand, the surface Web contains an estimated major factor in how search engines determine the 2.5 billion documents, growing at a rate of 7.5 popularity of a document. Habitually, search results million documents per day, and the deep Web is are sorted in popularity order. approximately 500 times greater than that visible to Currently there are a lot of P2P file-sharing conventional search engines. Nowadays there are a networks in existence, and many of them have lot of web search engines [2] and a lot of them have millions of on-line users and millions of data shared billions of textual documents indexed [3]. The Web [4]. In this type of networks, what a user really search engines can be classified in three types: wants is to find the file is looking for to it. - Crawler-Based Search Engines, such as Google, The probability to find a desired file, in the network which create their listings automatically. They where a user is searching, is associated to the "crawl" or "spider" documents by following one popularity of the file. Some other parameters like hypertext link to another, then people search through what they have found. the type of file it can be shared, the availability of - Human-Powered Directories, such as Open the file and its replication are also considered. In Directory. It depends on humans for its listings. order to have real search measurements about some People have to submit a short description to the films, songs, programs and documents, we have directory for their entire site. A search looks for selected some of the most popular public domain matches only in the descriptions submitted. P2P file-sharing networks. Those selected networks - Hybrid Search Engines, such as MSN search. It are Gnutella [5], FastTrack [6], Opennap [7], is maintained by a combination of previous Edonkey [8], Soulseek [9] and MP2P [10]. types and present both results. Although there are other networks [11], we have The Web search engines employ some kind of selected this ones because they are so popular centralized algorithm. In order to have the best between Internet users. On the other hand, we have selected two crawler- clients with a higher and process based search engines, Google and Altavista, and one capacity will be considered automatically search directory, Yahoo!, in order to find the same supernodes. Those clients with less bandwidth will files in Web search engines. be supernode clients. This type of system uses an Later on, it is established a relationship between flow control algorithm for sending queries and the results obtained in web search engines and the replies. It also has a diagram of priorities used to results obtained in the peer-to-peer file-sharing discard some messages. This type of search is used networks aforementioned. It will give us the by FastTrack and Gnutella 2 [12]. popularity of those files. This paper is structured as follows. Section 2 2.1.3 Randomly search technique The query is sent to k number of randomly selected discusses the search techniques used in Peer-To- neighbours. Each of these neighbours forward the Peer file-sharing networks. In section 3, it is query to any of their randomly selected neighbours. described the popularity parameter. Section 4 shows The query is propagated to sufficient number of the measurements taken in the Peer-To-Peer file- nodes to match the entry or until a TTL value. This sharing networks and Web search engines selected. technique is described in [13]. It is also shown the relationship between them. In section 5, it is discussed how can be increased the 2.1.4 Probably search technique probability to find a desired file in Peer-To-Peer In this case, the queries are sent to specific clients file-sharing networks. Finally, in Section 6, there are which are considered to have the greater probability conclusions and future works. finding the request. Each maintains a probability value corresponding to each neighbour which defines the chances that a query will be 2 Search Techniques in Peer to Peer forwarded to that neighbour. An example of this File Sharing Networks type of search is APS [14]. In order to find a file in a P2P network, a search is needed. The implemented search algorithm in every 2.2 Strongly controlled P2P search network depends on the type of the network algorithms. (centralized P2P, decentralized P2P and partially In structured P2P networks, data placement and centralized). There are several types of searching topology within the P2P file-sharing network is algorithms and they can be classified as follows: tightly controlled. These networks are based in Distributes Hash Tables (DHT), and the nodes do 2.1 Loosely controlled P2P search not decide what they store and with other algorithms. peers in the network. The data placement is defined They are used in decentralized Peer-To-Peer by the algorithm. When a document is published, it networks. The data placement is not defined because is routed to the whose ID is the most similar the nodes of the network decide what files they want to the document’s ID. In order to find a file, the to share. There are two kind of loosely controlled queries are sent to the client whose ID is the most P2P search algorithms: similar to the document’s ID. The process is repeated until a close match is found. The main 2.1.1 Broadcast search technique search problem in this type of networks is that they The query search is sent to all directly connected are not very efficient for keyword based search. neighbours and they forward the query to all their This type of search is used by [15], CAN neighbours. The query is propagated to sufficient [16], Chord [17], Pastry [18] and Tapestry[19]. number of nodes to match the entry or until a TTL value. If the neighbour has the content, it replies, otherwise if floods the query to its neighbours. This type of search is used by Gnutella network. 2.3 -centrally controlled P2P search algorithms. 2.1.2 Selective search technique They are used in peer-to-peer networks where there The query search is sent to some nodes called is a server or a group of servers. This type of search supernodes that act as a central nodes. This is very simple and has short query time. There are supernodes will perform the search to other two kind of server-centrally controlled P2P search supernodes in order to find the requested file. The algorithms: 2.3.1 Single-Server search technique other peers. The popularity of a file governs Initially, P2P clients connect to a central server how long it stays in the network and how often where they publish their shared files (the files’ it is replicated. names, their sizes, etc). When a search query is sent In peer-to-peer file-sharing networks, the to the server, it looks up in its index database. If popularity can be mathematically expressed as there is a matching entry, the IP address of the node follows. Objects in a peer-to-peer file-sharing that shares the file is sent to the one that requested networks do not have the same popularity. it, and then, the direct connection and download Assuming that there are m files of interest in one takes place. This technique is used by the Soulseek P2P network and qi represents their normalized network. relative popularity (number of queries issued for it), it is verified: 2.3.2 Farm-of-Servers search technique m = In this type of P2P networks, there is a group of ∑ qi 1 (1) available servers called “brokers”. P2P clients must i=1 be authenticated to one of those central servers. Each “broker” has the indexes of the local clients All selected networks are unstructured public and in some cases the indexes of some files from domain peer-to-peer networks. So, there is no neighbour “brokers”. When a client performs a control over or data placement in query to a “broker”, this one searches in its local any of them. Some of those networks are Zipf-like database and if it doesn’t find a match, it uses the distributions, as such Gnutella and [20][21]. local index in order to find a neighbour “broker” 1 that can send the request. The server indexes are not m 1 static and it can change according to the files in the ∑ α = i=1 i system. The networks OpenNap, eDonkey and qi α (2) MP2P use this technique. i

3 The Popularity Parameter Where α is a Zipf coefficient and i is the i-th most popular file. This Zipf distribution can be There are different ways to measure the popularity further used to determine the probability for a query parameter, it depends on where this parameter is to be associated with the i-th most requested file q . needed. The following are some ways to measure i Other studies demonstrates that FastTrack and the popularity parameter: other P2P file-sharing systems are non-Zipf - In Web search engines, as such Google, it is behavior [22][23]. given a lot of importance to the number of Assuming that each file i is replicated on r websites that link to a website, so the popularity i nodes, the total number of interesting files stored in parameter is measured by the number of the network is R: incoming links to the site. With this popularity m parameter it is built the PageRank. = R ∑ ri (3) - The popularity of a file can be related to the i=1 number of times the file has been retrieved from the surveyed archive during a certain period of We can assume that the most popular file is also time. It is used in web servers. the most replicated. Analyzed P2P file-sharing - The popularity of a file can be determined by networks have rigid assumptions on how the number of users that have requested its replications of objects happen in the system. Only download. It is used to measure how many users nodes that request a file makes copies of the file. use a certain software. On the other hand, in some networks, as such - The popularity parameter in movies can be Gnutella (decentralized) and FastTrack (partially related with the audience it have had in cinemas, decentralized), search consists of randomly probing the number of DVD sold or the number of sites until the desired file is found. Thus, the rented movies by videoclubs. probability to find a file Pr(k) on the k’th probe is - The popularity parameter in songs can be given by [24]: − related with radio or web top lists. r  r  k 1 = i  − i  - In structured peer-to-peer networks, the Pri (k) 1 (4) popularity of a file or a service is measured by n  n  the number of times the file is requested. It also affects to the probability that it is replicated by Where n is the number of nodes in the network. Gnutella FastTrack OpenNap eDonkey MP2P SoulSeek Average number of users 181 * 3.467.918 256.003 1.428.175 244.418 8981 Average number of shared files 55.540 * 631.678.681 158.902.178 103.469.627 59.756.764 n/t Average size of total shared data 0,294 GB* 4.947.261 GB 5.409.326 GB n/t 236.564 GB n/t Max. Variation of users (%) 41,49 21,33 42,02 39,13 5,50 1,17 Max. Variation of shared files (%) 260,35 18,63 53,65 36,76 5,47 n/t Max. Variation of shared data (%) 349,49 15,72 34,58 n/t 5,79 n/t

Table 1. Comparative of the 6 architectures measured (* non total network values, n/t: measured not taken)

0% 1% 2% 1% 0% 1% 15% eDonkey 83% 1% eDonkey 37% FastTrack 15% 37% FastTrack 59% 1% Gnutella2 1% OpenNap 1% OpenNap 2% soulseek 0% 59% soulseek 0% 83% MP2P 0% MP2P 2%

Figure 1: Movies percentage Figure 2: Songs percentage

1% 1% 0% 8% 0% eDonkey 91% 1% 4% eDonkey 95% FastTrack 8% FastTrack 1% Gnutella2 1% Gnutella2 4% OpenNap 1% OpenNap 0% soulseek 0% soulseek 0% MP2P 0% MP2P 91% 95% 0%

Figure 3: Software percentage Figure 4: Documents percentage

4 Search results - The number of software versions were not We have measured the average number and the included in the keyword of the search. maximum variation of peers, shared files and total - The name of movies that have second deliveries amount of data shared of the six selected networks (shreck 2, Spiderman 2 and so on) were not in order to know how many peers and information included. are inside the selected networks. Those data are Although there are peers interconnected to more shown in Table 1. It has to be taken into account than one peer-to-peer network, they are insignificant that Gnutella, FastTrack, Opennap, Edonkey, and compared with the total number of peers of every Soulseek networks permit to search every type of network. On the other hand, the files shared by file, but MP2P only permits audio files (mp3, ogg, those peers could not be the same for all connected wma, etc.). networks. In order to establish in which network is more If all analyzed networks were interconnected, the probably to find movies, songs, software and type of files shared by every network to the other documents, we have measured the number of peers are different. Although all of them supports every that have files with keywords of 12 movies, 24 type of network (except MP2P), those networks songs, 12 software programs and 8 documents. This seems to be specialized in different type of files, as measures have been taken in every one of the six it is shown in Figures 1 to 4. The table 2 shows networks. which is the ranking for those six networks To avoid wrong results in our searches we have In order to know the popularity of those movies, employed next methodology: songs, software and documents, results have been - To limit the results to the type of file we were compared with two search engines, Google and looking for, the type of file was added (avi, Altavista, and one search directory, Yahoo!. mpg, exe, pdf, doc, etc.) to they keywords of the search. 5000000 3000000 google google 4000000 yahoo 2500000 yahoo altavista 2000000 altavista 3000000 eDonkey eDonkey 1500000 2000000 FastTrack FastTrack Gnutella2 1000000 Gnutella2 1000000 OpenNap 500000 OpenNap soulseek soulseek 0 0 123456789101112 Selected Files 1 3 5 7 9 11131517192123Selected Files Figure 5: Number of results of Movies Figure 6: Number of results of Songs 25000000 google 140000 google 20000000 yahoo 120000 yahoo altavista 100000 altavista 15000000 eDonkey 80000 eDonkey FastTrack 10000000 60000 FastTrack Gnutella2 40000 Gnutella2 5000000 OpenNap 20000 OpenNap soulseek 0 0 soulseek 123456789101112Selected Files 1234567Selected Files Figure 7: Number of results of Software Figure 8: Number of results of Documents

P2P Network Movies Songs Software Documents eDonkey 1st 2nd 1st 1st 5 Increasing the Probability to Find FastTrack 2nd 1st 2nd 3rd the Desired File Gnutella2 4th 4th 3rd 2nd OpenNap 3rd 3rd 4th 4th The files are not always in the peer-to-peer network Soulseek 5th 6th 5th 5th where the user is searching. Most of the networks MP2P X 5th XX implemented nowadays support any filetype, but Table 2. Ranking in movies, songs, software and there are some that only supports audio files. What documents is needed is a system which will allow to search in To avoid wrong results in web searches and to every P2P network and download from every peer limit the results to the type of file we are looking of every network. for, we have employed next methodology: The union of Peer-To-Peer file-sharing networks, - For movie searches we have added the word by the creation of a Peer-To-Peer file-sharing “movie” to the search. networks Interconnection System, will give greater - For song searches, we have added not only the probability to find the desired file. If there is n Peer- name of the song, but the name of the group. To-Peer file-sharing networks, the total probability - For documents searches we have added the type will be: of the document (pdf, doc, etc.).  n  n n n - For software searches we have searched only P∪Eα  =∑Pα − ∑Pα Pβ + ∑Pα Pβ Pγ +... α=1  specific manufacturers software. α=1 β>α=1 γ>β>α=1 In order to analyze the P2P data collected and n (5) +()− n−1 + + ()− n−1 compare it with Web search engines we have scaled 1 ∑Pα Pβ ...Pη ... 1 P1P2...Pn eDonkey and FastTrack data results with a x50 η>...>β>α=1 factor and OpenNap, soulseek and MP2P with As it can be seen, the total probability to find the x1000 factor. These factors have been used in desired file will be greater than the probability of movies, songs and software. Figures 5 to 8 shows one them only, but less than the sum of all of them. the data collected. This measures have been taken only for comparative popularity purposes, it is 6Conclusion pretended to know if a popular file in Web search Six unstructured public domain Peer-to-Peer engines will give a popular file in peer-to-peer file- networks have been measured in order to know what sharing networks. As it can be seen in those figures, is the P2P file sharing network with most search there is no result of movies, software and documents results in movies, songs, software and documents. for MP2P due to its only-songs implementation. As a result of our measurements, those networks seems to be specialized in different type of files. The search results have been compared with those [14] D. Tsoumakos and N. Roussopoulos: Adaptive obtained in Web search engines. We have checked Probabilistic Search for Peer-to-Peer Networks. that if a file is popular in Web search engines, it is In Proceedings of the 3rd IEEE International also popular in P2P file-sharing networks. Future Conference on P2P Computing, Linkoping, works will try to find the mathematical relationship Sweden, September 2003. between the file popularity in Web searches engines [15] I. Clarke et al. Freenet: A distributed and the file popularity in P2P file-sharing networks. anonymous information storage and retrieval system, ICSI Workshop on Design Issues in acknowledgements Anonymity and Unobservability, Int'l Computer Authors want to acknowledge to Mr. Miguel A. Science Inst., 2000. Granados from Polytechnic School of Gandia for his [16] S. Ratnasamy, P. Francis, M. Handley, R. data collection. Karp, S. Shenker, A Scalable Content- Adressable Network, ACM Sigcomm 2001, San References: Diego, CA, USA, August 2001, [1] Michael K. Bergman, The Deep Web: [17] I. Stoica, R. Morris, D.Karger, F.Kaashoek, H. Surfacing Hidden Value, The Journal of Balakrishnan, Chord: A Scalable Peer-To-Peer Electronic Publishing, Volume 7, Issue 1. Lookup Service for Internet Applications, ACM August, 2001. Sigcomm 2001, San Diego, USA, August 2001, [2] Danny Sullivan, Nielsen NetRatings Search [18] A. Rowstron and P. Druschel, Pastry: Scalable, Engine Ratings, July 14, 2004. Available at: distributed object location and routing for http://searchenginewatch.com/reports/article.ph large-scale peer-to-peer systems, IFIP/ACM p/2156451 International Conference on Distributed [3] Danny Sullivan, Search Engine Sizes, Systems Platforms (Middleware), heidelberg, September 2, 2003. Available at: Germany, pages 329-350, November, 2001 http://searchenginewatch.com/reports/article.ph [19] B. Zhou, D.A. Joseph, J. Kubiatowicz, p/2156481 Tapestry: a fault tolerant wide area network [4] J. Lloret Mauri, B. Molina Moreno, . Palau infraestructure, UC Berkeley technical report Salvador y M. Esteve Domingo. Public Peer- UCB/CSD-01-1141 To-Peer Filesharing Networks Evaluation. The [20] Kunwadee Sripanidkulchai, the popularity of 2nd Iasted International Conference On gnutella queries and its implications on Communication And Computer Networks. MIT scalability. In O’Reilly’s www.openp2p.com, Cambridge, MA, USA. November 2004. February 2001 [5] Eytan Adar and Bernardo Huberman. Free [21] Zihui Ge, Daniel R. Figueiredo, Sharad riding on gnutella. First Monday, 5(10), Jaiswal, Jim Kurose, Don Towsley. Modeling October 2000. Peer-Peer File Sharing Systems, Proceedings [6] Nathaniel Leibowitz, Matei Ripeanu, and IEEE INFOCOM 2003, San Francisco, March- Adam Wierzbicki. Deconstructing the April 2003. Network, 3rd IEEE Workshop on Internet [22] Krishna P. Gummadi, Richard J. Dunn, Stefan Applications, San Jose, USA June 2003. Saroiu, Steven D. Gribble, Henry M. Levy, [7] OpenNap http://opennap.sourceforge.net/ John Zahorjan, Measurement, Modeling, and [8] Oliver Heckmann and Axel Bock. The Analysis of a Peer-to-Peer File-Sharing eDonkey 2000 Protocol. Technical Report Workload, Proceedings of the nineteenth ACM KOM-TR-08-2002, Multim. Communications symposium on Operating systems principles, Lab, Darmstadt University of Technology, 2003, p. 314-329. December 2002. [23] J. Chu, K. Labonte, and B. N. Levine. [9] Soulseek http://www.slsk.org Availability and locality measurements of peer- [10] MP2P http://www.blubster.com/protocol1.html to-peer file-sharing systems. In Proceedings of [11] Wikipedia SPIE ITCom: Scalability and Traffic Control in http://www.wikipedia.org/wiki/Peer-to-peer IP Networks, volume 4868, July 2002. [12] Gnutella2 http://www.gnutella2.com [24] Qin Lv, Pei Cao, Edith Cohen, Kai Li, and [13] Christos Gkantsidis, Milena Mihail, and Amin Scott Shenker, Search and replication in Saberi, Random Walks in Peer-to-Peer unstructured peer-to- peer networks, Networks, The 23rd Conference of the IEEE Proceedings of the 16th international Communications Society (Infocom 2004), Hong conference on Supercomputing, ACM Press, Kong, March 2004 2002, p. 84–95.