How Much Material on Bittorrent Networks Is Infringing Content? a Validation Study
Total Page:16
File Type:pdf, Size:1020Kb
How much material on BitTorrent networks is infringing content? A validation study Robert Layton, Paul A. Watters, Richard Dazeley November, 2010 1 Abstract BitTorrent is a widely used protocol for peer-to-peer (P2P) file sharing, including material which is often suspected to be infringing content. However, little systematic research has been undertaken to establish to measure the true extent of illegal file sharing. In this paper, we propose a new methodology for measuring the extent of infringing content. Our initial results indicate that at least 89.9% of files shared contain infringing content. We discuss the limitations of the approach and outline proposals to further verify the results. Keywords BitTorrent, infringing content, copyright infringement, piracy 1 Introduction BitTorrent is a peer to peer (P2P) file sharing protocol which allows files to be efficiently distributed without reliance on a central server [1]. Files are distributed through clients and peers, with each peer containing different pieces of the file. Peers contact each other to download new pieces, while at the same time, allowing the pieces they currently have to be uploaded to other peers. Downloads through BitTorrent can be much faster than traditional downloads, due to the highly distributed nature of the download process. As a protocol, BitTorrent has become extremely popular on the internet – one global estimate is that BitTorrent traffic accounts for 57.19% of all Internet traffic [2]; one major ISP in Australia estimates that the figure is >50% of all traffic [26]. It has been utilised commercially, with companies such as Blizzard Entertainment using BitTorrent to release patches and updates for their popular online game World of Warcraft [3]. Another legitimate use is distributing updates to computers in corporate networks, allowing for the more efficient utilisation of scarce network resources. However, there is significant debate in many communities over the way in which BitTorrent can and has been used to share and distribute movies, software and music over the Internet, usually infringing the copyright held on the material. When shared illegally, this type of content is known as infringing content. It is fair to say that there are a range of perspectives expressed in the popular and online media about the extent to which BitTorrent and other P2P systems are used to distribute infringing content. On the one hand, critics of the copyright system argue that new technologies have opened up new ways of doing business, and that “old economy” companies must adapt to the changes [4]. On the other hand, creative industries rely on the copyright system to protect their intellectual property. It is important that these matters be publicly debated; our intention in this paper is not to enter into this debate, however, but to introduce a methodology that can be used to provide objective evidence about the true nature of copyright infringement over BitTorrent networks. Evidence must be a key part of any public debate; often, proponents highlight the “positive” aspects of the technology. For example, the very popular “BitTorrent for Dummies” book [5] says that BitTorrent can be used for: 2 Distributing “free” computer operating systems (like Linux) Distributing a “free” file to create “buzz” Distributing a book “for free” (like Free Culture) Distributing musical recordings released for “free” (like Phish) Distributing beta software All of these uses are theoretically possible. The critical question for copyright owners and law enforcement is whether these “free” items constitute the vast majority of BitTorrent file sharing or not. File sharing proponents would generally argue for the former position – indeed, during The Pirate Bay trial, the defendants argued that 80% of torrents were legal [27]. In contrast, copyright owners often argue that the opposite case must be true. Finding an objective answer would assist all parties involved in fighting or advocating for file sharing to understand the actual scale and scope of the problem. However, given the distributed nature of P2P protocols, answering this question in a rigorous and reliable manner is non-trivial. To understand why the question is significant, consider why file sharers use P2P technology rather than a website with a single URL. In simple terms, BitTorrent and similar P2P technologies work in the following way: 1. A source file is created for sharing. 2. A torrent file is created, that acts as a table of contents for fragments of a shared file. It contains the expected filenames of the shared files, the number of fragments in the file, and the hash of each fragment, so that the client can verify that the file has been reconstructed correctly. It also has a list of preferred and alternate trackers, and – for the latest version of the protocol – distributed hash and peer exchange details. 3. A tracker is notified that the source file is ready for sharing 4. The source is seeded until enough copies are available in fragmentary form on clients that have downloaded the source file. 5. Downloading the source file requires (a) finding the torrent, and (b) ensuring that all of the fragments are available and ideally downloaded from “peers” who have the highest bandwidth and lowest packet latency relative to the downloader, by using a client that understands the BitTorrent protocol. Searching is performed at one of several searching sites, such as The Pirate Bay or Isohunt. Integrity checks performed by the client ensure that the file is correctly reassembled, using a hashmap of the file. P2P systems can be considered highly secure: (1) availability is provided through numerous peers rather than a single server representing a single point of failure; (2) access control can be provided through a number of different frameworks [6]; and (3) confidentiality can be provided through encryption of the source file. P2P technology reduces the bandwidth burden and cost associated with content producers; once a file has been seeded, there is no further necessary burden on the user who has shared the source file. There is also a logical separation between the act of hosting data fragments (as a peer) and searching for torrents (which is quite centralized). It is important to note that torrent search sites do not directly store any copyrighted data, and typically (but not always) disclaim any responsibility for copyright infringement1. 1 Note that trackers do not store any attributions of copyright either. 3 The scale of file sharing activity is significant: for every shared file, there may be hundreds and thousands of fragments. In extreme cases, it can be difficult (but not impossible [7]) to identify, track, monitor and notify individuals who are involved in sharing a single file, especially where anonymisation technologies or network address translation is used. The highly distributed nature of BitTorrent makes it very difficult to directly “measure” its attributes. However, some recent research has been undertaken to characterize different aspects of BitTorrent performance, including measures and estimates of popularity, availability, content lifetime and download performance, suggesting that BitTorrent outperforms its peers on the following metrics, as defined in [8]: Popularity, defined as the total number of users active during a specific time window Download performance, which is the ratio of the file size to the time taken to complete the download Content injection time, which is the gap between the creation of (copyrighted) content and its P2P release Pollution level, which is the proportion of content that is corrupt Each of these metrics has received limited study in the academic literature, although one study [9] looked at content injection times for cinematic-release films on P2P networks. Ironically, much literature, for example [10-12], focuses on the effect of “free-riding” in BitTorrent and P2P networks, i.e., users who download a lot but do not significantly contribute to uploading data for other users. While the work described in [8] and [13] has been useful in modelling characteristics of P2P file sharing (such as average download speeds) these are a function of both popularity and available bandwidth. Most research so far does not directly address the status of copyrighted material, even though the computations were made using copyrighted files. Research papers which do address copyright infringement (e.g., [14]) are often then not concerned with the practicalities of measuring the scale of sharing for specific (or all) copyrighted works. Other projects have focused on identifying whether specific countermeasures (such as distributing fakes) are effective, and conclude that they are probably not as effective as an intelligence-based approach [15, 16] In this paper, we introduce a methodology that attempts to measure the extent of sharing of copyright infringing material over BitTorrent. Specifically, we set out to answer the following research questions: How many files are shared using BitTorrent, and what are the major categories of the files being shared? At a given point in time, how much file sharing is actually occurring using BitTorrent? For each shared file, how many times has it been shared in total? Overall, what is the number and percentage of shared files which are infringing, both by number of files and total downloads? Obtaining an exact answer for any of these questions is impossible due to the scope and distributed nature of BitTorrent - there are thousands of BitTorrent trackers available, as well as other technologies such as Distributed Hash Tables and Peer Exchange, which prohibit a complete study being performed. However, our goal was to makes the most accurate and precise approximations possible by sampling the most popular trackers, and using a number of techniques to extract metadata from torrents, and then matching these to known descriptors. After describing the methodology, we present preliminary results, and use triangulation 4 to verify the relative rates of sharing of different categories of files being shared.