Correlation of Bittorrent Downloads with Movie Ratings
Total Page:16
File Type:pdf, Size:1020Kb
Correlation of BitTorrent Downloads with Movie Ratings Eveline Elmer Zürich, Switzerland Student ID: 13-738-034 – Communication Systems Group, Prof. Dr. Burkhard Stiller HESIS T Supervisor: Andri Lareida, Thomas Bocek ACHELOR B Date of Submission: December 15, 2016 University of Zurich Department of Informatics (IFI) Binzmühlestrasse 14, CH-8050 Zürich, Switzerland ifi Bachelor Thesis Communication Systems Group (CSG) Department of Informatics (IFI) University of Zurich Binzmühlestrasse 14, CH-8050 Zürich, Switzerland URL: http://www.csg.uzh.ch/ Zusammenfassung Inhalt dieser Arbeit ist das Erstellen eines Vergleichs von illegal heruntergeladenen Fil- men und der Bewertung der Filmen. Die Daten von den illegal heruntergeladenen Filmen kommen von dem VIOLA Projekt, das Daten von der Torrentseite BitTorrent gemessen hat und diese Daten in einer Datenbank gespeichert hat. Jeder Eintrag in der Datenbank enth¨alt einen Torrent, eine IP Adresse, von welcher der Torrent heruntergeladen wurde und eine Zeitangabe, welche besagt, wann der Torrent in der Datenbank gespeichert wur- de. In der Facharbeit wurde nur ein Teil der imens grossen VIOLA Datenbank genutzt. Die Daten, welche fur¨ die Arbeit gebraucht wurden, beschr¨anken sich auf Torrents, die im Mai 2016 heruntergeladen wurden. Die Daten der Torrents wurden fur¨ nur eine Woche angeschaut und die Anzahl von Downloads anhand der einzigartigen IP und Port Kombi- nationen gespeichert. Die Bewertung fur¨ des Vergleichs der Filme wurde von der OMDb API Seite genommen, eine Webseite die Daten uber¨ Filme zusammentr¨agt. Die Bewer- tung der Filme erfolgt aufgrund der IMDb Bewertung, sowie einer Bewertung der Rotten Tomatoes Webseite. Die Nummer der betrachteten Filme betr¨agt 1'813. Fur¨ einige der Filme exsitierten mehrere Torrents, die Anzahl Downloads der einzelnen Torrents wurden zusammengetragen und als Summe mit der Bewertung verglichen. Die Auswertung der Daten wird anhand von Graphen dargestellt. Der Vergleich zwischen den Anzahl von Downloads und der IMDb Bewertung uber¨ aller Filme ergab, dass Fil- me mit einer hohen Bewertung nicht ¨ofters heruntergeladen wurden als Filme mit einer niederigeren Bewertung. Literatur wies darauf hin, dass andere Faktoren fur¨ eine h¨ohere Anzahl von Downloads verantwortlich sein kann. Daher wurde nur ein Teil der Filme mit- einander verglichen. Neuere Filme wurden ¨ofter heruntergeladen als ¨altere Filme. Filme mit mehreren Torrents wurden h¨aufiger heruntergeladen als Filme mit nur einem Tor- rent. Die Sprache und das Genre des Filmes kann auch eine Rolle fur¨ die Anzahl der Downloads spielen. Daher wurde 319 Filme ausgew¨ahlt, die zwischen 2000 und 2015 ver- ¨offentlich wurden, weniger als 5 Torrents hatten und Englisch als Filmsprache hatten. Bei diesem reduzierten Dataset wurde auch keinen Zusammenhang zwischen der Anzahl von Downloads und der Film Bewertung festgestellt. Faktoren, die viel deutlicher und st¨arker die Anzahl der Downloads beeinflussten waren, waren die Anzahl der Torrents und das Ver¨offentlichungsdatum der Filme. i ii Abstract The topic of this thesis is to find out if there exists a correlation between a rating of a movie and the number of times it gets downloaded. The information about the movies stems from the VIOLA database, which observed torrent websites and stored the information. The data used for this thesis are movies from one week in May 2016. The rating of the movies stem from the IMDb website and the Rotten Tomatoes website. There was no correlation found between the number of downloads and the rating of a movie when looking at all torrents used in the thesis. Taking in consideration that factors like genre, language, release date and numbers of torrents had an influence on the number of downloads for one movie, a reduced dataset with 319 movies was compared. For this reduced dataset no correlation between the rating and the number of downloads was established. Factors like the number of downloads and release date seem to have a higher influence on the number of downloads than all the other factors looked at. iii iv Acknowledgments I would like to thank Prof. Dr. Burkhard Stiller and the Communication Systems Group of the University of Zurich for the possibility of making it possible to delve into a versatile topic, leading to interesting insights. I would also like to thank my supervisor Andri Lareida very much for all his inputs, ideas and help at any time during the wrtiting of the thesis. His feedback was very helpful and greatly appreciated. v vi Contents Zusammenfassung i Abstract iii Acknowledgments v 1 Introduction 1 1.1 Motivation....................................1 1.2 Description of Work . .1 1.3 ThesisOutline..................................2 2 Related Work 3 2.1 RatingWebsites.................................3 2.2 Number of Torrents . .4 2.3 Genre Preferences . .4 2.4 Demography of Movie Pirates . .6 3 Data Processing 7 3.1 Overview over VIOLA . .7 3.2 Processing VIOLA Data . .7 3.3 OMDb API . 10 3.4 Overview over the Dataset . 11 vii viii CONTENTS 4 Correlation of Movie Downloads and Movie Ratings 13 4.1 Raw Data Material Evaluation . 13 4.2 MovieLanguages ................................ 16 4.3 MoviesbyGenres................................ 16 4.4 Recent Releases . 18 4.5 IMDbVotes................................... 19 4.6 Number Of Different Torrents For The Same Movie . 19 4.7 Summary .................................... 21 5 Summary 23 Bibliography 25 List of Figures 26 List of Tables 27 List of Code Snippets 29 A Contents of the CD 33 A.1 SourceCode................................... 33 A.2 Collected Data . 33 A.3 Thesis ...................................... 33 A.4 Related Work . 33 A.5 Presentation . 34 Chapter 1 Introduction File sharing applications are causing a big part of the total Internet traffic [2]. Most of the applications work as peer-to-peer. During peak traffic hours this can lead to problems for Internet Service Providers. Software, music and movies are the main content that get distributed over file sharing applications. For content creators and providers it is of interest to know how successful their content is. This thesis focuses on movies and the distribution of those movies. The main question discussed is if popularity can be studied through file sharing applications. The data of the file sharing applications is compared to ratings of the data from users and professionals. The question if movies with high ratings are more likely to get downloaded is be answered. 1.1 Motivation A study [11], examining the correlation between movie ratings and movie performances, looked at 246 movies over multiple genres and compared them to ratings of profession- als(Rotten Tomatoes) and users(Yahoo Movies) The study showed, that there exists a correlation between movie ratings and revenues in the early weeks of a released movie. High ratings by professionals lead to higher user ratings and higher revenues [11]. It is therefore of interest, if higher movie ratings lead to more illegal downloads or if the internet community and the internet pirates are not as influenced by ratings as the nor- mal movie visitors. The study presented would suggest, that higher ratings lead to more downloads of a movie. 1.2 Description of Work This thesis will investigate if there exists a correlation between illegal movie downloads and movie ratings. Basis for the information about the illegal movie downloads will be data from the VIOLA project [9]. The thesis does not look at all the data from the VIOLA project but rather makes a selection thereof. It focuses on movie downloads during one 1 2 CHAPTER 1. INTRODUCTION week in May 2016 and compares these movies with each other. To be able to compare movies with each other the dataset from VIOLA has to be cleaned up and completed with information from a free-to-use website. To use the information from the website a Java program is implemented. Rating websites like IMDb and Rotten Tomatoes will be used to take a look at the movie ratings. The selected data from VIOLA will be compared with the data collected from the IMDb and Rotten Tomatoes website and afterwards evaluated and presented in this paper. 1.3 Thesis Outline Chapter 2 contains information about the sources used for the paper and presents related literature. Topics of the literature presented are the genre preferences for movies, and the demography of a movie pirate. Chapter 3 provides an overview over the data processing of the VIOLA dataset. It contains information about how data is stored in the VIOLA database. The chapter further explains how the dataset used to evaluate, was collected and completed. Chapter 4 evaluates the dataset collected in Chapter 3. Multiple factors are considered when evaluating the dataset to ensure that the result is not falsified. Chapter 5 provides a summary of the thesis as well as a conclusion of the findings made in Chapter 4. Chapter 2 Related Work This chapter focuses on related work of the thesis and on the main technologies and websites used in the thesis. The websites used for a comparison between the number of downloads and movie-ratings are the IMDb-Website and the Rotten Tomatoes-Website, both will be introduced in this chapter. This chapter will further provide a short intro- duction to the OMDb API which is being used for getting the relevant information. The next sections will focus on literature on related topics and finally on data analysis. 2.1 Rating Websites The IMDb-Website is a site dedicated to giving information on celebrities, TV shows and movies. The Website provides amongst other things information on movies, their plots, the actors, the crew, and provides a rating for movies [7]. The rating system is a user based ratings system. Every registered user has the possibility to vote for a movie on a scale from 1 to 10 [8]. There need to be at least 5 votes for a movie to get a rating. 1 being the worst rating and 10 the best rating.