Partial Swarm Merger: Increasing BitTorrent content availability

Antonio´ Homem Ferreira∗, Ricardo Lopes Pereira∗ and Fernando M. Silva∗ ∗INESC-ID / Instituto Superior Tecnico´ Email: {antonio.h.ferreira,fernando.silva}@ist.utl.pt, [email protected]

Abstract—The BitTorrent Peer-to-Peer (P2P) file sharing pro- covers an hash of each piece of the content and the names of tocol is a popular way to distribute digital content. It’s a the files being shared. This means that, even if two torrent files very scalable protocol, where the entry of new peers increases represent the exact same content, just by changing the name the total capacity of the network, especially after peers have finished their download and remain just as uploaders (known as of one of the files being shared (e.g. adding the team name seeders). Through the monitoring of live Internet swarms we have to the file’s name), the two files will generate two different discovered that there is a significant amount of repeated content infohashes and thus result in two different swarms that do not being shared. Various publishers tend to publish the same content share any data with each other. The same will happen if a new through different torrent files, creating independent swarms that file (e.g. a subtitles file for another language) is appended. end up having the exact same content or a large number of common parts. As such the size of each swarm is smaller than This study shows that this competition among teams results it could be, should there be only one swarm for that content. in the existence of different swarms, sharing the same or This affects the performance of BitTorrent and diminishes the similar content but isolated from each other and discusses opportunities for exploiting locality. how this affects the performance of BitTorrent. We propose By analyzing 3067 swarms, we concluded that there is a significant amount of common content between different swarms. that content availability can be improved by combining these This redundancy can be exploited in order to increase data different swarms into a larger one. These swarms can be found availability and source diversity. We propose a novel technique, by comparing information inside each torrent file: the hash called Partial Swarm Merger, which adds a new component string of each piece as well as the piece size and overall file to the BitTorrent infrastructure, allowing peers to learn about size. With this information, it can be determined if the different swarms with common content. With this information, peers could combine the different swarms, announcing and requesting from swarms share the same content or, at least, if they share a high each swarm the pieces in common with their download. This will number of identical pieces. increase the availability of the parts which are common to the The data for this study was obtained by analysing torrent several swarms. files and gathering information on the associated swarms. The I.INTRODUCTION latter was performed using an instrumented BitTorrent client application running on several PlanetLab [3] nodes. Based on In the last few years, Peer-to-Peer (P2P) communication this data, we propose a solution called Partial Swarm Merger has increased exponentially [1] proving to be one of the (PSM) as an efficient way to exploit the content redundancy. most successful architectures for providing a number of ser- This solution would be based on a service outside the BitTor- vices like VoIP, video streaming and of course file sharing. rent network and would not require any modifications to the One of the most successful and popular P2P protocols is BitTorrent protocol. However, BitTorrent client applications BitTorrent[2] which represents most of the P2P traffic gener- would need an extension in order to use the service. ated worldwide[1]. BitTorrent presents no infrastructure costs beyond the residential grade Internet connection supported by PSM could also be used in other scenarios. A GNU/Linux each user. As such, it is very affordable and convenient for a distribution could arrange common, non architecture specific, user to put his content online to be shared. files into the first pieces, placing architecture specific binaries The low barrier of entry for publishing content has enabled at the end, allowing peers downloading versions for differ- many to publish the works of others. Although anyone can ent architectures to share the common pieces. A software publish content for sharing, users tend to download the content distributer could also package common library files used by from sources they trust. These sources are usually groups different software products in the first positions as to be shared of individuals (publisher teams) that have made a reputation among peers downloading different products. for themselves by competing with each other to be the first This paper is organized as follows: Section II presents the group to publish a specific content. This competition between BitTorrent protocol, focusing on the publishing mechanism groups often results in the creation and publishing of different and the torrent file. We present our methodology for swarm torrent files that represent the same or very similar content. monitoring in Section III and in Section IV we discuss the This fact is a source of redundancy which is not exploited by results of the study. Partial Swarm Merger is detailed in the conventional BitTorrent protocol. In BitTorrent, swarms Section V. Finally, Section VI discusses the related work and are identified by an infohash which is an SHA1 hash which Section VII presents the final conclusions. 2

II.BITTORRENT PROTOCOL III.SWARM MONITORING In this section, we will first describe how the torrent files Before presenting the work described in this paper, we were collected. Then we will present our methodology for provide a brief review of the BitTorrent protocol, focusing analyzing and comparing files in order to determine redundant on the publishing method and the torrent file. content. BitTorrent is a P2P protocol for file sharing where peers The first step in this study was the download of the share files among themselves, supporting the upload costs. BitTorrent torrent files. The files were obtained through a Rich Unlike other file-sharing P2P protocols such as eMule1 or Site Summary (RSS) reader script that read RSS feeds from Kazaa2, BitTorrent doesn’t provide any mechanism for file PirateBay3, isohunt4 and btjunkie5 and downloaded the files search. Its goal is just exchange and replication of files. This in each feed. The RSS feeds were followed from 25th of April means that all file searches are done outside the network. There 2011 to 12th of June 2011. The torrent database was primed are two main components on the BitTorrent protocol: with Piratebay’s one hundred most popular files on April 25th and complemented with ’s twenty most popular files 1) Tracker: Provides a list of peers sharing a given file. Can for each major category (audio, tv and video) on May 24th. also receive and log information about upload/download During the same period, an instrumented version of Bit- rates and other details for statistical purposes. 6 2) Peers: Share a given file among themselves. There are Tornado (a python BitTorrent client) , ran on a number two types of peers: the ones that have already finished of PlanetLab nodes, reading the torrent files and querying the download of the file, called seeders, and the ones corresponding trackers to obtain peer related information such still downloading the file, called leechers. as swarm size, number of leechers and number of seeders. BitTornado was used to query trackers for peers with inter- To share a file a peer needs to create a torrent file and publish vals of approximately 20 minutes. The application was also it. This file contains meta-information like: (1) file names and modified in order to not download the content pertaining to sizes, (2) tracker(s) Uniform Resource Locator (URL), (3) the the torrent files. The results for swarm size don’t include hashes for each file part (piece) and the fixed piece size, (4) the PlanetLab nodes used. A swarm stopped being monitored comments, creation date, encoding and other information on when the number of peers dropped below 30 and the number the content and files. After publishing the torrent file, usually of seeders dropped to 0 for a period of a week. in a webpage, interested users can download and open it with To determine which torrent files represented the same con- a BitTorrent client application. This application reads the file tent or at least very similar content, the torrent files (metainfo) and queries the tracker for a list of active peers for that same were compared based on the following similarity criteria (in file. After receiving the list, it connects to the peers and starts order): downloading the file. All file distribution is done between 1) Piece size - In order to be able to compare pieces’ hash peers. Trackers don’t get involved in the file sharing process. values, it is necessary for the pieces of both torrent file Peers exchange blocks of data from a data aggregate which to have the same length; contains one of more files concatenated. The exchange unit 2) Overall content size - torrent files were only compared if is the piece, which has a fixed size. Each piece is associated either both had the same size or their sizes were within with a SHA1 hash, found in the torrent file, used to verify a margin of 5% of each other; its integrity. After downloading and verifying the hash, a peer 3) Pieces’ hash value - For two pieces to be considered informs every peer connected to it that it already has that as corresponding to the exact same content, their hash piece available for upload. These hashes are the same for needs to be exactly the same; equal pieces, being very unlikely that different pieces happen 4) Number of pieces in common - we decided that files to produce the same hash. with more than 75% of the total number of their pieces Peers periodically query the tracker for other peers sharing in common should be considered as corresponding to the the same content. Each swarm is identified by the infohash same or very similar content. These pieces also need to generated from the torrent file. The infohash is an urlencoded be in order and starting at the beginning of both torrent 20-byte SHA1 hash generated with the information in the info files. key of the torrent file. This information includes: piece size, The cutoff value of 75% common pieces was used after the hash of all pieces, the file name, the file size and other analysing all the gathered torrent files. Figure 1 shows the information. This way, two torrent files that refer to the same cumulative distribution function (CDF) for the number of file can produce two different infohashes just by changing the common pieces for every torrent pair combination with at file’s name. Despite sharing the same content, these two torrent least one common piece. It shows that torrent pairs either have files will be associated with two different and isolated swarms less than 20% in common or more than 95%. We choose to that do not share any information with each other. 3http://thepiratebay.org/ , last accessed June 2011 4http://isohunt.com/ , last accessed June 2011 1http://www.emule.com/ , last accessed June 2011 5http://btjunkie.org/ , last accessed June 2011 2http://www.kazaa.com/ , last accessed June 2011 6http://www.bittornado.com/ , last accessed June 2011 3 focus on the torrents with more pieces in common, where 100 the gains would be larger, thus compensating for the similar 90 torrent discovery process overhead. Although two torrents are 80 determined to be similar without ever analysing their content, 70 it is nearly impossible for so many parts to have common hash values and have diferent content. 60

The torrent file name and corresponding content file name 50 weren’t a criteria for comparing contents, since torrent files 40 often have a serial number or their infohash as name and the torrents’ file content many times doesn’t have a name 30 Percentage of number torrent pairs corresponding to its content (for example, a movie might 20 have the name movie.avi instead of the actual movie name). 10 However, this data was useful for manually validating a sample 0 of the findings. 0.01 0.1 1 10 100 1000 10000 Shared MegaBytes IV. DATA ANALYSIS Fig. 2. CDF with the shared MegaBytes per number of torrent pairs We obtained 3067 torrent files of which 200 had more than 50% of their own pieces’ hash values repeated. These files were excluded from this study since they were deemed as to have at least one other torrent file with 100% equal pieces, representing no real content, but rather pollution [4]. After none of them shared the same infohash, which means that comparing all of the remaining torrent files with each other, there was the exact same content being shared over different we obtained the percentage of pieces in common for each and isolated swarms. torrent pair. Figure 1 shows that CDF. As we can observe, most After comparing the different torrents, we decided that only torrent pairs had between 0% and 16% and between 95% and torrent files that shared more than 75% of their pieces should 100% of all their pieces in common. No torrents were found be considered as referring to similar content. We obtained 2589 with 30% to 75% of pieces in common. As for the amount of torrent files representing unique content and 278 torrent files data in common, Figure 2 shows that most torrent pairs have representing content being shared more than once. This means 10 to 200 MB of data in common, however, there is also a that, about 9.7% of all torrent files analysed were files that significant number of torrent pairs that have 400 to 850 MB represented repeated content in other BitTorrent swarms. As of data in common. This is the amount of data that could be stated before, of these 278 torrent files, 66 had at least one shared among peers currently in different swarms. other torrent file with 100% equal pieces and all had unique Figure 1 also shows that there is a significant number of infohashes, due to different names being assigned to the files torrent pairs that share 100% of their pieces. In this case, we being shared. had to compare the infohashes of these pairs to identify the As mentioned in the introduction, most of the similar ones that shared the same swarm (pairs with same infohash) torrent files exist due to different publishing teams repackaging and the ones that had different and isolated swarms (pairs with content first published by others. Figure 3 shows the number different infohash). From the 66 torrent files that were found of torrent files found to have been published by each team and number of torrent files found to be repeated. From this graphic we can conclude that the three teams that uploaded 100 more torrent files account for about 5% of all the 3067 torrent 90 files analyzed and for about 14% of the torrent files found

80 to be repeated. This shows the impact these teams have in

70 the publishing of repeated content. We did not analyse who published what first as this was irrelevant to our study. 60 Figure 4 shows a histogram representing the content rep- 50 etition frequency. For example, content that was found to be

40 repeated twice was observed 44 times, which means that there were 88 torrent files that referred to a content that was already 30 being referred by exactly one other torrent file. By analyzing Percentage of number torrent pairs 20 Figure 4, we observe that most content is unique and that most

10 repeated content is repeated only twice. As expected, we can also observe that contents that are found to be repeated more 0 0 10 20 30 40 50 60 70 80 90 100 than six times tend to be rare. Even so, there was one content Percentage of shared pieces that was found to be repeated in 24 torrent files. If these Fig. 1. CDF with the shared percentage of pieces per number of torrent torrent files had at least one swarm associated, this means that pairs there were at least twenty four independent swarms sharing 4

70 Number of repeated torrents published is repeated, the more the peers are distributed over different Total number of torrents published swarms and the smaller the swarms corresponding to each 60 torrent file for that same content are. By analysing both points

50 and line in the graphic we can conclude that, for most of these cases, by merging all swarms into one, we would get a bigger 40 swarm for that content and the more peers there are in a swarm the higher the availability can be. Third, about 17 contents 30 didn’t have a single peer in all their swarms. Number of torrents

20 Figure 5 showing us that we can expect a significant growth in the number of peers by combining the different swarms 10 sharing similar content. However, it is also important to analyse the different swarms regarding the number of seeders, 0 2HD ARROW ASAP CORE CTU DEFACEDDOH FPM FQM IFLIX IMAGiNE LOL MAX MEM MOMENT RELOADEDTASTE T0XiCiNK WHOA since these are the peers that have all the file’s pieces available for sharing. Furthermore, as seeders do not upload, they are Teams not subject to the tit-for-tat algorithm, providing newly joined Fig. 3. Number of torrents published by team peers with their first pieces for trading with others. Figure 6 represents, for each repeated content, the average number of seeders per torrent as points and the sum of all seeders per the same content. From these results, one may conclude that similar content as the line in the graphic. From this graphic, we there is a strong potential to increase availability through the can observe that, for most contents, there is only one swarm combination of BitTorrent swarms. However, we also need with seeders. This proves that, by joining all swarms regarding to know how many peers all swarms corresponding to each a given content, we can definitely improve availability. As for content have, in order to determine if the swarm merging the ones with multiple swarms with seeders, since the seeder would actually be useful. number grows, the availability will also improve. Figure 5 represents, for each repeated content, the average Increasing the number of peers available for downloading number of peers per torrent announced by the tracker as points each piece will also increase the opportunities for exploiting and the sum of all peers per content as the line in the graphic. locality. By having peers exchange data with other peers within There are several conclusions that can be derived from this the same ISP or in ISPs with favorably peering agreements, graphic. The first is that there are contents where there is locality algorithms and protocols enable ISPs to reduce their only one torrent that has a swarm with peers. This can be outside connectivity costs, while improving the performance explained in two different ways: there might be much trust of other applications and even of P2P file exchange[5][6][7]. on the publisher of that torrent file and very little trust in the Having several peers within the same ISP downloading the publishers of the remaining torrent files representing that same same content will also increase the hit ratio of eventual caching content; the first torrent file is published much sooner than systems [8]. the remaining ones, making these obsolete when published. Figure 7 shows the number of peers found to be sharing the In this case, they behave as unique content does, where there same content during one day. Three torrents are shown, which is only one swarm per content. Second, the more the content were found to have similar content. For each, the number of

10000 10000 Number of peers in the swarm Sum of the number of peers of all swarms

1000 1000

100 100 Average number of peers Number of times observed

10 10

1 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 Content repetition Repeated content

Fig. 4. Histogram of the content repetition frequency Fig. 5. Average number of peers per content 5

10000 Number of seeders in the swarm BitTorrent traffic. Sum of the number of seeders of all swarms Figure 9 provides an overview of the gains in swarm dimension throughout the torrent life and our monitoring 1000 period. For each torrent and for each day, we calculated the average number of peers per ISP by itself and by aggregating all the similar content torrents. For each torrent, we show the th th th 100 minimum, maximum and 25 , 50 and 75 percentile for the increase in average number of peer per ISP. We notice that for most torrents, the gains are significant. Only for a few Average number of seeders

10 was there a decrease in the number of peers per ISP which is explained by the small overlap in ISPs of each similar torrent, causing the increase in number of ISPs to exceed the increase in number of peers. 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 As we can see from our findings, there is a high percentage Repeated content of repeated content shared across BitTorrent networks that Fig. 6. Average number of seeders per content results in multiple swarms that, despite sharing the same content, do not exchange data among themselves. Merging these isolated swarms sharing the same contents can, in many peers per country is shown, as is the number of peers per cases, improve content availability. country obtained by combining the three torrents. Each line V. PARTIAL SWARM MERGING is sorted in descending order of number of peers per country. We envisioned several solutions to take advantage of the Countries with a single peer are not shown. We can observe repeated content problem, using it to increase content avail- that the number of countries with peers increases as does ability: the number of peers in each country. This is substantiated 1) A BitTorrent client modified to search for different by the average values shown in Table I. In countries where torrent files that refer to same or very similar content, internal connectivity is better than external connectivity, the using the torrent names for searches on torrent databases, increased national availability should improve the download and to query the different swarms for different pieces. performance. The peers considered were those provided by This solution would need modifications to the BitTorrent the trackers to our modified peers. This number is inferior to client application; the total amount of peers announce by the tracker provided 2) Trackers which analyses each torrent file published and statistics. As such, we would expect the gains to be even automatically notified peers, in order to merge swarms higher. within the tracker that shared the same or very similar Figure 8 shows the same data aggregated by ISP. We also content. Modifications to the BitTorrent Tracker appli- observe an increase in the number of ISPs with peers, but more cation, client and protocol would be needed; significantly, the number of peers per ISP increases noticeably. 3) A Service outside BitTorrent which peers could query This will provide ISPs with more internal sources on their for other torrent files with the same or very similar network, creating opportunities for diminishing the external content. This service would maintain an index for torrent

100 100 Total Total Torrent 1 Torrent 1 Torrent 2 Torrent 2 Torrent 3 Torrent 3

10 10 Number of peers Number of peers

1 1 0 10 20 30 40 50 60 70 80 0 50 100 150 Country count ISP count

Fig. 7. Number of peers per country for similar content Fig. 8. Number of peers per ISP for similar content 6

TABLE I TORRENT AGGREGATION BENEFITS with the content being shared in the other swarms. This solution would have no need for modification to the BitTorrent Torrent 1 Torrent 2 Torrent 3 Total protocol. However, the BitTorrent client application would Num. Countries 87 68 38 101 require an extension or add-on to be able to use this service. Avg. Peers/Country 6.20 5.25 2.61 9.85 By having the service outside the network, we can also have Num. ISPs 238 178 54 342 information regarding all swarms from different trackers. As Avg. Peers/ISP 2.26 2.01 1.83 2.91 this service is not critical, but rather a performance enhancer, it need not be a part of the core protocol.

1800 Figures 10 and 11 show the workflow of the service. Figure 10 demonstrates how PSM would build and populate its 1600 databases with the information from the different torrent files. 7 1400 First, a peer sends a magnet URI with the infohash and other

1200 relevant information such as tracker names. If the PSM service doesn’t have that infohash in its databases, it requests the info 1000 key from the peer using the (DHT) 800 used by modern BitTorrent clients [9]. This info key has the

600 piece hashes, file name and piece size value. When the peer Swarm increase (%)

400 sends the info key to the PSM service, the latter associates the infohash to all this information. Then, it analyzes and 200 compares that information with the data in its databases and 0 returns the magnet links to swarms sharing, to some extent,

-200 the same content, if any. If the infohash received from the first 0 10 20 30 40 50 60 Torrent message is already in the databases, message two and three are not exchanged. Fig. 9. Increase in swarm size at each ISP by aggregating similar content After receiving the magnet links from the PSM service, peers use it to request peers from the tracker and join the other swarms. However, peers only announce and request, from the files referring to the same content and would supply this newly joined swarms, pieces in common with their original information to peers. This wouldn’t require modification swarm. Figure 11 represents the workflow of the solution. neither to the BitTorrent protocol nor to the BitTorrent Peers should periodically contact the PSM server in order to trackers. Only modifications to the BitTorrent client learn of other similar torrents that the PSM server has found applications would be necessary. out since the previous request. From these three possible solutions, the simpler to imple- This architecture allows PSM servers to be deployed au- ment and deploy is the third one. The first would require tonomously as they don’t rely on other services. Ideally, the modified BitTorrent client application to search for files in order to maximise their efficiency PSM servers should with a similar name to the one the peer is downloading and communicate among themselves, in order to guarantee that then, download all torrent files so they could be analysed. This 7 process is slow and could only search for torrents based on http://magnet-uri.sourceforge.net/ , last accessed June 2011 the name of the torrent file not its content. The second solution, where the tracker would analyse and merge the swarms sharing common pieces, has many disadvantages that make it hard to implement. Most trackers don’t keep the torrent files, they just keep an association between peers and the infohash from the swarm they belong to. PSM Service Another disadvantage is the fact that Trackers don’t normally

y o

e t y r

k

r e a l k o a i s

l communicate or exchange any kind of information with each f k i k o m n n n i f i i i m

s l l i n t

i s t s t

n t e e e r d s n u n u other. For these reasons, this solution would require major n e t g g q e u e a a S e

q R - R

m m

e - -

3

R

4 2 modifications to BitTorrent Trackers. -

Having this in mind, we propose a solution with an ar- 1 chitecture based on a service outside the BitTorrent network, called Partial Swarm Merger (PSM). PSM would be a service that peers would query for information on other swarms in the BitTorrent network that are, at that time, sharing the Peer same content. With this information, peers would participate in multiple swarms but would only announce and request the pieces the content they are downloading has in common Fig. 10. PSM - Populating databases 7

less efficient. Most solutions to this problem either involve analysing content represented by the same torrent file being PSM Service shared between different swarms or by using mechanisms Tracker

2

- such as bundling, a slightly different approach from our work m R

1 e a t g - u

t R n s o r r e e n e s m q t that focuses on analyzing same partial content shared across r s e u l i p e a i m e n e g t k s p n s i s l e t a n e u s r r t i q u different torrent files and corresponding swarms. m l e t i n R e i R k l - a - r 3 4 One of the first approaches to increase availability through swarm merging is DISM[10]. This work focuses on load- balancing between trackers responsible for the same torrent 5 – Request/Announce file. Since many torrent files are published across a number pieces in common

Peer 1 of trackers, by re-allocating peers among these trackers, small swarms can be merged into bigger ones and at the same time

Swarm 2 avoid re-allocating too many peers to a specific tracker. This Swarm 1 way, there is a load-balancing between trackers and swarms, and availability is increased for the swarms with few peers. However, this approach is only good for the same torrent Fig. 11. PSM’s workflow file and is based on the fact that the same torrent file is shared across a significant number of trackers. Their proposed they know the most torrents. However, for the purpose of solution would be implemented through a small change in increasing the locality opportunities within a single ISP, an the BitTorrent protocol by adding a new protocol message: isolated PSM server would produce the optimal result as long tracker redirect, for redirecting peers to other trackers. as all the peers within that ISP used it. A different study determined, through measurements that more than 85% of all peers participate in more than one torrent A. Use Case [11]. This means that most peers are, at a given time, sharing Besides the opportunities described in the previous Section, several different files and contents. Through measurements, PSM could also be used in other scenarios. All that is required analysis and modeling, the authors concluded that availability is for someone packaging different content, with common is a big problem with BitTorrent, mainly because of the parts, into different torrent files, to place the common content decreasing peer arrival rate and the free-riding. Since most first and in the same order. peers participate in various torrents at a given time, they For instance, a GNU/Linux distribution supporting different propose incentives based on inter-torrent collaboration, instead architectures, could arrange the common, non-architecture of the current incentives, for prolonged seed lifetime. This specific, files into the first pieces and place the architecture mechanism follows the same path as current BitTorrent mech- specific binaries at the end. Peers downloading different ar- anisms, namely the ”tit-for-tat”, to be able to get also an instant chitecture versions could share the common pieces among collaboration. However, this mechanism would calculate the them. The same principal could be applied to any software incentives and collaboration based on all torrent files being publisher that had their different products or product versions shared by peers, whereas current mechanisms focus on only sharing common library packages. This small change in the one torrent at a time. The authors argue that this approach publishers’ behaviour could allow PSM to be more beneficial could be applied to an exchange based incentive mechanism in more situations. that would be fairer and improve content availability. Another advantage would be a more efficient load- Another approach to the problem is bundling [12]. By balancing. Even when content is first published, if it has pieces grouping and sharing of related content, availability can be in common with other versions or software already being improved for unpopular content. The authors quantify content shared in the network, peers are not fully dependent on what availability and explain how bundling can improve current the original seeders are sharing. Peers can download common availability. The authors’ measurements show that 40% of the pieces from other peers in the network and request from the swarms have no publishers (seeders) available more than half original seeders just the unique pieces. This way, peers would the time. However, bundling can increase the content avail- download the content faster and become seeders quicker. ability. By grouping related content, unpopular content can become much more popular and thus improve its availability in VI.DISCUSSIONAND RELATED WORK the network. This is a technique that is already being used by Most studies and related work focus on solving one of many publishers. The best example for bundling is the sharing greatest problem: content availability. Despite the of music albums instead of a single song. Their measurements fact that BitTorrent has very high scalability, leading very well prove that bundled content has more availability then isolated with flash crowd moments, it struggles to replicate content content in comparison and that download times for unpopular with few peers exchanging it. When content is either unpopular content decrease when bundling is used. However, this method or has a very small number of seeders sharing it, BitTorrent forces peers to download content they don’t want along with tends to make the download last longer which makes it the one they want. This way, peers consume more resources, 8 mainly bandwidth and disc space, for content they don’t PSM is not without its weaknesses. For one, PSM increases want/need. Throughout the paper, the authors also develop availability for pieces which are common among swarms but a model for content availability. In [13], the authors also does not fully solve the problem of peers which are alone or addressed the availability and the bundling solution but in a with very few peers in their swarm, as the unique parts may rather different way. First they show that content bundling is still become unavailable[9]. It is possible that PSM makes it already widely deployed in BitTorrent and quantify it, and possible for a peer to download most of its torrent from other then in [14] the same authors propose that bundling should swarms, but never finish as it is alone in its swarm. be done automatically by the system and not manually by the In the future we intend to implement PSM as a plugin for a publisher, as it happens currently. However, they propose to popular BitTorrent client in order to evaluate the performance consider content similar based only on the torrent file name. gains experienced by its users. File size, content hash or even category are not used as criteria REFERENCES due to the fact that BitTorrent users use the name of the file to find the content they are looking for. By comparing three text [1] H. Schulze and K. Mochalski, “Internet Study 2008/2009,” 2009. [Online]. Available: http://www.ipoque.com/resources/internet-studies/ classification algorithms, they concluded that the ”cosine” one internet-study-2008 2009 is the one with more accuracy and show that it is possible to [2] B. Cohen, “Incentives Build Robustness in BitTorrent,” in Proceedings get benefits from ”title-based bundling”. of the 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley, USA, Jun. 2003. A study of what drives publishers to publish content, was [3] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson, M. Wawrzoniak, also very important for understanding the amount of repeated and M. Bowman, “Planetlab: an overlay testbed for broad-coverage content being shared in the network [15]. Through swarm services,” SIGCOMM Comput. Commun. Rev., vol. 33, pp. 3–12, July 2003. measurements they discovered that most publishers are divided [4] N. Christin, A. S. Weigend, and J. Chuang, “Content availability, pollu- into three categories: antipiracy agencies that publish fake tion and poisoning in file sharing peer-to-peer networks,” in Proceedings or malicious content, altruistic publishers and profit-driven of the 6th ACM conference on Electronic commerce, ser. EC ’05. New York, NY, USA: ACM, 2005, pp. 68–77. publishers. With the help of the RSS feed from BitTorrent [5] A.-J. Su, D. R. Choffnes, A. Kuzmanovic, and F. E. Bustamante, “Draft- tracker sites, they could identify the initial publisher by being ing Behind Akamai (Travelocity-Based Detouring),” in Proceedings of the first ones to join a given very recent swarm and identifying the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2004), Pisa, Italy, the only seed. Their study shows that a very small number Sep. 2006. of publishers is responsible for about 67% of all published [6] R. Bindal, P. Cao, W. Chan, J. Medval, G. Suwala, T. Bates, and content. In their conclusions, the authors state, through their A. Zhangan, “Improving Traffic Locality in BitTorrent via Biased Neigh- bor Selection,” in Proceedings of the 26th International Conference on measurements and analysis, that if the profit-driven publishers Distributed Computing Systems (ICDCS 2006), Lisboa, Portugal, Jul. were unable to continue to publish new content, BitTorrent 2006. could have its popularity at risk. [7] R. L. Pereira, T. Vazao,˜ and R. Rodrigues, “Adaptive Search Radius - Lowering Internet P2P File-Sharing Traffic through Self-Restraint,” in Proceedings of the 6th IEEE International Symposium on Network VII.CONCLUSION Computing and Applications (IEEE NCA07), Cambridge, USA, Jul. BitTorrent is a very popular and successful protocol for 2007. [8] PeerApp. (2007, Mar.) Comparing P2P Solutions. White Paper. file sharing, however, it faces a problem regarding content [Online]. Available: http://www.peerapp.com/docs/ComparingP2P.pdf availability. By analysing approximately 3,000 torrent files, [9] G. Neglia, G. Reina, H. Zhang, D. Towsley, A. Venkataramani, and we found a significant amount that referred to content already J. Danaher, “Availability in systems,” in INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, being shared in the network. This study shows that there are Anchorage, USA, may 2007, pp. 2216 –2224. many BitTorrent swarms, isolated from each other, that share, [10] G. Dan and G. Carlsson, “Dynamic swarm management for improved to some extent, the same content. Despite the potential for bittorrent performance,” in Proceedings of the 8th international con- ference on Peer-to-peer systems, ser. IPTPS’09. Berkeley, CA, USA: higher availability, these swarms are independent and unaware USENIX Association, 2009. of each other. By merging these swarms into a single one, we [11] L. Guo, S. Chen, Z. Xiao, E. Tan, X. Ding, and X. Zhang, “Measure- showed that the number of peers and seeders would increase ments, analysis, and modeling of bittorrent-like systems,” Berkeley, CA, USA, 2005. for most cases, and thus availability would be improved. This [12] D. S. Menasche, A. A. Rocha, B. Li, D. Towsley, and A. Venkataramani, could be beneficial for both ISPs, which could exploit locality “Content availability and bundling in swarming systems,” in Proceedings for lowering their connectivity costs, and peers who would of the 5th international conference on Emerging networking experiments and technologies, ser. CoNEXT ’09. New York, NY, USA: ACM, 2009. benefit from more sources for the content being downloaded. [13] J. Han, T. Chung, S. Kim, H. Kim, T. T. Kwon, and Y. Choi, “An We propose a solution called Partial Swarm Merging, a service empirical study on content bundling in bittorrent swarming system,” in outside the BitTorrent network capable of identifying torrents CoRR, 2010. [14] J. Han, T. Chung, H. Kim, T. T. Kwon, and Y. Choi, “Systematic support that represent the same or very similar content and supply for content bundling in bittorrent swarming,” in INFOCOM IEEE peers with this information. Peers can then join the different Conference on Computer Communications Workshops, 2010, 2010. swarms, requesting and announcing only the pieces in common [15] R. Cuevas, M. Kryczka, A. Cuevas, S. Kaune, C. Guerrero, and R. Rejaie, “Is content publishing in bittorrent altruistic or profit-driven,” between their torrent and the torrent being shared in those in Proceedings of the 6th International COnference, ser. Co-NEXT ’10. swarms. We also show that, if publishers had a different New York, NY, USA: ACM, 2010. behaviour when publishing content, they could take much more advantage from using PSM service.