Song Ranking Based on Piracy in Peer-To-Peer Networks
Total Page:16
File Type:pdf, Size:1020Kb
10th International Society for Music Information Retrieval Conference (ISMIR 2009) SONG RANKING BASED ON PIRACY IN PEER-TO-PEER NETWORKS Noam Koenigstein Yuval Shavitt School of Electrical Engineering Tel Aviv University, Israel {noamk, shavitt}@eng.tau.ac.il ABSTRACT board. We measured music piracy using a data set of geo- graphically identified P2P query string, and compared it to Music sales are loosing their role as a means for music dis- songs ranking on the Billborad Hot 100, which measures semination but are still used by the music industry for rank- sales and air plays. We compiled popularity charts based ing artist success, e.g., in the Billboard Magazine chart. on P2P activity, and show a strong correlation between mu- Thus, it was suggested recently to use social networks as sic piracy and legal sales and air plays. We argue that rank- an alternative ranking system; a suggestion which is prob- ing songs through measurement of P2P queries is a good lematic due to the ease of manipulating the list and the dif- predictor of peoples’ taste, and has many advantages over ficulty of implementation. In this work we suggest to use other means of popularity ranking, which were suggested logs of queries from peer-to-peer file-sharing systems for in the past, most notably using social networks [6]. ranking song success. We show that the trend and fluctua- The remainder of the paper is organized as follows: In tions of the popularity of a song in the Billboard list have Section 2 we introduce the data set used in this study, and strong correlation (0.89) to the ones in a list built from the the methodology used to collect it. In Section 3 we focus P2P network, and that the P2P list has a week advantage on comparing song popularity in P2P networks with their over the Billboard list. Namely, music sales are strongly ranking on the Billboard. We discuss the significance of correlated with music piracy. our finding and our conclusions in Section 4. 1. INTRODUCTION 2. DATA-SETS AND METHODOLOGY Peer-to-peer (P2P) networks are one of the internet’s most We use two data sources for this study: popular applications. The number of users and traffic, is growing dramatically from year to year. Despite several re- • P2P Search Queries: A data-set of queries collected cent high profile legal cases against P2P vendors and users, from the Gnutella file-sharing network over twenty it seems that the P2P community at large remains strong three weeks from January the 7th 2007 to June 8th and healthy. In fact, P2P networks gain more acceptance 2007. as many companies and organizations distribute software • The Billboard Hot 100 The Billboad Hot 100 weekly and updates via networks such as BitTorrent to save band- charts for 2007 as published by the Billboard Maga- width (e.g., Ubuntu). zine. Some studies suggest that music piracy might increase legal sales [1, 2], and copyright owners are advised to start These two data-sets were collected independently, yet developing business models that will allow them to gener- this study reveals a strong relationship between them. How- ate revenue from P2P activity. Pioneering suggestions to ever, before analyzing the commonalities and differences, utilize P2P networks for the benefit of the music industry let us first describe the data sets and the methodology used were made by Bhattacharjee et al. [3,4], where P2P activity to collect them. was used to predict an album’s life cycle on the Billboard’s top 200 albums chart. 2.1 P2P Search Queries In our previous work [5] we showed how P2P queries can be used for early detecting unknown emerging artists. Queries in a file sharing network represent their users cur- In this study we take a different approach; we suggest an rent taste and interests. A query is issued upon a request alternative songs ranking based on file sharing activity, that by a user searching for a specific file, or content relevant to might replace traditional artists ranking such as the Bill- the search string. In this study we used data collected from the Gnutella network using the Skyrider systems 1 . This data-set and the technical details of the methodology used Permission to make digital or hard copies of all or part of this work for to collect it are described in more depth in [7]. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 1 Skyrider was a startup company that developed file sharing appli- cations and services. It has recently been closed down. The data-set was bear this notice and the full citation on the first page. collected when the company was still active, and is available for academic c 2009 International Society for Music Information Retrieval. research. 633 Poster Session 4 2.1.1 The Gnutella File Sharing Network Rank String Occurrences 1 adult 36,290 In a study performed by Slyck.com, a website which tracks 2 akon 23,468 3 lil wayne 12,518 the number of users of different P2P applications, Gnutella 4 beyonce 11,987 was found among the three most popular P2P file-sharing 5 this is why i’m hot 10,746 applications together with eDonkey and FastTrack [8]. Fur- 6 justin timberlake 10,193 thermore, according to [9], Gnutella is the most popular 7 porn 9,144 file sharing network in the Internet today with a market 8 don’t matter 9,047 9 fergie 8,979 share of more than 40%. It is mainly used for piracy of mu- 10 fall out boy 8,077 sic. In [5] the top 500 most popular queries were manually classified, and it was found that 68% of the queries were Table 1. P2P popularity chart for week 9 of 2007 music related. Together with adult content (22%), these two categories dominate the query traffic, accounting to- gether for 90% of the queries. Gnutella is also among the most studied P2P networks in the literature [5, 7, 10–16]. 2.2 Data Set Statistics A daily log file of queries, typically contained 25-40 mil- lion record lines, each line consists of the query string, a 2.1.2 Methodology date/time field, and the IP address of the node issuing the A query’s origin IP address is required for its geographical query. The origin country for each query was resolved classification according to its country of origin. While it is using MaxMind commercial GeoIp database. Similarly possible to capture a large quantity of Gnutella queries by to the Billboard charts, we wanted to concentrate on data deploying several hundred ultrapeer nodes 2 , it will not be originated from the United States. We thus removed all the possible to tell the origin IP address of most of these cap- non US queries reducing 55%-60% of the data records. tured queries. The basic problem in identifying the origin Our data-set comprised of query strings collected over of captured queries is that queries do not in general carry a period of 23 weeks from January the 7th 2007 to June information regarding their origin. What they do usually 8th 2007. The activity on the Gnutella networks increases carry is an “Out Of Band” (OOB) return IP address. This by 20%-25% over the weekend [7]. We thus used weekly address allows clients that have content matching a query samples taken on a Saturday or a Sunday of every week to respond to a location close to the origin of the query, of that period. The total number of US originated query without having to backtrack the path taken by the query strings processed in this study is 185,598,176. message. However, as most queries come from firewalled clients, in most cases the OOB address will belong to the 2.3 The Billboard Hot 100 ultrapeer connected to the query origin, acting as a proxy The Billboard Hot 100 is the United States music industry on behalf of the query originator. Deducting the missing standard singles popularity chart issued weekly by Bill- origin IP address is not trivial. We resolved this problem board magazine [17]. Chart rankings are based on radio by using a hop counting technique that is further explained play and sales data collected 10 days before the chart is re- in [7]. leased. The ranking process does not take into account file The vast majority of the Gnutella network is comprised sharing activity. A new chart is compiled and officially re- of Limewire clients (80%-85%) and Bearshare clients (6%- leased to the public each Thursday. The chart is dated with 10%) [10]. The Limewire client does not allow users to the week number of the Saturday after, but in this study we perform any kind of automatic or robotic queries. It does used dates and week numbers according to the actual re- not allow queries with the SHA1 extension 3 , nor does it lease date of the chart, and ignored the date issued by Bill- allow the automatic re-sending of queries. When it does board magazine. To simplify time tracking in this paper, send duplicate queries, it uses a constant Message ID which we use week numbers instead of full date to chronologi- enables a simple removal of any duplication. By recording cally order the Billboard charts and the weekly file sharing only queries originating from Limewire clients, we were data we collected. For example, the Billboard chart which able to significantly reduce the amount of duplications and was released on Thursday January 11th 2007 (week num- automatic (non-human) queries, without losing too much ber 2), was dated by billboard to January 20th (week 3) but of the traffic.