Increasing Bittorrent Content Availability
Total Page:16
File Type:pdf, Size:1020Kb
Partial Swarm Merger: Increasing BitTorrent content availability Antonio´ Homem Ferreira∗, Ricardo Lopes Pereira∗ and Fernando M. Silva∗ ∗INESC-ID / Instituto Superior Tecnico´ Email: fantonio.h.ferreira,[email protected], [email protected] Abstract—The BitTorrent Peer-to-Peer (P2P) file sharing pro- covers an hash of each piece of the content and the names of tocol is a popular way to distribute digital content. It’s a the files being shared. This means that, even if two torrent files very scalable protocol, where the entry of new peers increases represent the exact same content, just by changing the name the total capacity of the network, especially after peers have finished their download and remain just as uploaders (known as of one of the files being shared (e.g. adding the team name seeders). Through the monitoring of live Internet swarms we have to the file’s name), the two files will generate two different discovered that there is a significant amount of repeated content infohashes and thus result in two different swarms that do not being shared. Various publishers tend to publish the same content share any data with each other. The same will happen if a new through different torrent files, creating independent swarms that file (e.g. a subtitles file for another language) is appended. end up having the exact same content or a large number of common parts. As such the size of each swarm is smaller than This study shows that this competition among teams results it could be, should there be only one swarm for that content. in the existence of different swarms, sharing the same or This affects the performance of BitTorrent and diminishes the similar content but isolated from each other and discusses opportunities for exploiting locality. how this affects the performance of BitTorrent. We propose By analyzing 3067 swarms, we concluded that there is a significant amount of common content between different swarms. that content availability can be improved by combining these This redundancy can be exploited in order to increase data different swarms into a larger one. These swarms can be found availability and source diversity. We propose a novel technique, by comparing information inside each torrent file: the hash called Partial Swarm Merger, which adds a new component string of each piece as well as the piece size and overall file to the BitTorrent infrastructure, allowing peers to learn about size. With this information, it can be determined if the different swarms with common content. With this information, peers could combine the different swarms, announcing and requesting from swarms share the same content or, at least, if they share a high each swarm the pieces in common with their download. This will number of identical pieces. increase the availability of the parts which are common to the The data for this study was obtained by analysing torrent several swarms. files and gathering information on the associated swarms. The I. INTRODUCTION latter was performed using an instrumented BitTorrent client application running on several PlanetLab [3] nodes. Based on In the last few years, Peer-to-Peer (P2P) communication this data, we propose a solution called Partial Swarm Merger has increased exponentially [1] proving to be one of the (PSM) as an efficient way to exploit the content redundancy. most successful architectures for providing a number of ser- This solution would be based on a service outside the BitTor- vices like VoIP, video streaming and of course file sharing. rent network and would not require any modifications to the One of the most successful and popular P2P protocols is BitTorrent protocol. However, BitTorrent client applications BitTorrent[2] which represents most of the P2P traffic gener- would need an extension in order to use the service. ated worldwide[1]. BitTorrent presents no infrastructure costs beyond the residential grade Internet connection supported by PSM could also be used in other scenarios. A GNU/Linux each user. As such, it is very affordable and convenient for a distribution could arrange common, non architecture specific, user to put his content online to be shared. files into the first pieces, placing architecture specific binaries The low barrier of entry for publishing content has enabled at the end, allowing peers downloading versions for differ- many to publish the works of others. Although anyone can ent architectures to share the common pieces. A software publish content for sharing, users tend to download the content distributer could also package common library files used by from sources they trust. These sources are usually groups different software products in the first positions as to be shared of individuals (publisher teams) that have made a reputation among peers downloading different products. for themselves by competing with each other to be the first This paper is organized as follows: Section II presents the group to publish a specific content. This competition between BitTorrent protocol, focusing on the publishing mechanism groups often results in the creation and publishing of different and the torrent file. We present our methodology for swarm torrent files that represent the same or very similar content. monitoring in Section III and in Section IV we discuss the This fact is a source of redundancy which is not exploited by results of the study. Partial Swarm Merger is detailed in the conventional BitTorrent protocol. In BitTorrent, swarms Section V. Finally, Section VI discusses the related work and are identified by an infohash which is an SHA1 hash which Section VII presents the final conclusions. 2 II. BITTORRENT PROTOCOL III. SWARM MONITORING In this section, we will first describe how the torrent files Before presenting the work described in this paper, we were collected. Then we will present our methodology for provide a brief review of the BitTorrent protocol, focusing analyzing and comparing files in order to determine redundant on the publishing method and the torrent file. content. BitTorrent is a P2P protocol for file sharing where peers The first step in this study was the download of the share files among themselves, supporting the upload costs. BitTorrent torrent files. The files were obtained through a Rich Unlike other file-sharing P2P protocols such as eMule1 or Site Summary (RSS) reader script that read RSS feeds from Kazaa2, BitTorrent doesn’t provide any mechanism for file PirateBay3, isohunt4 and btjunkie5 and downloaded the files search. Its goal is just exchange and replication of files. This in each feed. The RSS feeds were followed from 25th of April means that all file searches are done outside the network. There 2011 to 12th of June 2011. The torrent database was primed are two main components on the BitTorrent protocol: with Piratebay’s one hundred most popular files on April 25th and complemented with isohunt’s twenty most popular files 1) Tracker: Provides a list of peers sharing a given file. Can for each major category (audio, tv and video) on May 24th. also receive and log information about upload/download During the same period, an instrumented version of Bit- rates and other details for statistical purposes. 6 2) Peers: Share a given file among themselves. There are Tornado (a python BitTorrent client) , ran on a number two types of peers: the ones that have already finished of PlanetLab nodes, reading the torrent files and querying the download of the file, called seeders, and the ones corresponding trackers to obtain peer related information such still downloading the file, called leechers. as swarm size, number of leechers and number of seeders. BitTornado was used to query trackers for peers with inter- To share a file a peer needs to create a torrent file and publish vals of approximately 20 minutes. The application was also it. This file contains meta-information like: (1) file names and modified in order to not download the content pertaining to sizes, (2) tracker(s) Uniform Resource Locator (URL), (3) the the torrent files. The results for swarm size don’t include hashes for each file part (piece) and the fixed piece size, (4) the PlanetLab nodes used. A swarm stopped being monitored comments, creation date, encoding and other information on when the number of peers dropped below 30 and the number the content and files. After publishing the torrent file, usually of seeders dropped to 0 for a period of a week. in a webpage, interested users can download and open it with To determine which torrent files represented the same con- a BitTorrent client application. This application reads the file tent or at least very similar content, the torrent files (metainfo) and queries the tracker for a list of active peers for that same were compared based on the following similarity criteria (in file. After receiving the list, it connects to the peers and starts order): downloading the file. All file distribution is done between 1) Piece size - In order to be able to compare pieces’ hash peers. Trackers don’t get involved in the file sharing process. values, it is necessary for the pieces of both torrent file Peers exchange blocks of data from a data aggregate which to have the same length; contains one of more files concatenated. The exchange unit 2) Overall content size - torrent files were only compared if is the piece, which has a fixed size. Each piece is associated either both had the same size or their sizes were within with a SHA1 hash, found in the torrent file, used to verify a margin of 5% of each other; its integrity. After downloading and verifying the hash, a peer 3) Pieces’ hash value - For two pieces to be considered informs every peer connected to it that it already has that as corresponding to the exact same content, their hash piece available for upload.