University of Southampton Research Repository
Total Page:16
File Type:pdf, Size:1020Kb
University of Southampton Research Repository Copyright © and Moral Rights for this thesis and, where applicable, any accompanying data are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis and the accompanying data cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content of the thesis and accompanying research data (where applicable) must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holder/s. When referring to this thesis and any accompanying data, full bibliographic details must be given, e.g. Thesis: Author (Year of Submission) "Full thesis title", University of Southampton, name of the University Faculty or School or Department, PhD Thesis, pagination. Data: Author (Year) Title. URI [dataset] UNIVERSITY OF SOUTHAMPTON Optimisation Classification on the Web of Data using Linked Data. A study case: Movie Popularity Classification by Gunawan Budiprasetyo A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science March 2019 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE Doctor of Philosophy Optimisation Classification on the Web of Data using Linked Data. A study case: Movie Popularity Classification by Gunawan Budiprasetyo Data mining algorithms have been widely used to solve various types of prediction models in movie domain. Classification problems especially to predict the future success of movies have attracted many researchers in order to find efficient ways to address them. However, movie popularity classification has become very complicated as it has too many parameters with different degrees. In this thesis, we review a broad range of literature on (1) movie prediction domain and identify related data, a main data source, and additional data sources to address these problems; (2) on data mining algorithms to build more robust classification models to predict movie popularity. To obtain the robust movie popularity classification model, three experiments were conducted. The first experiment examined five single classifiers (Artificial Neural Network (ANN), Decision Tree (DT), k-NN, Rule Induction (RI) and SVM Polynomial) to develop classification models to predict the future success of movie. The second experiment assessed the use of wrapper-type feature selection algorithms to develop classification models of movie popularity. The last one scrutinized two ensemble methods, bagging and boosting in classifying movie popularity. Based upon the finding and analysis, this thesis contributes in four areas: (1) it demonstrates the capabilities of linked data to get external movie- related data sources and shows how additional attributes fom external data sources can be used to improve performances of the classification model based on a single data source; (2) it presents the use of Grid Search to get a set of optimal hyper-parameters of Artificial Neural Network (ANN), Decision Tree (DT), Rule Induction (RI) and SVM Polynomial classifiers so as to get more robust classification model; (3) it proves the use of wrapper-type feature selection using Genetic Algorithm suited to those classifiers either using default or optimized parameters in order to get the robust classification model and (4) it establishes the use of ensemble methods (bagging and boosting) to those classifiers either using default or optimized parameters in order to get the model in question. Contents Authorship xxix Acknowledgements xxxi 1 Introduction1 1.1 Classification Problems With Web Data...................2 1.2 Obtaining Better Classification Models....................3 1.3 Motivations...................................4 1.4 Research Objectives..............................5 1.5 Research Contributions............................7 1.6 Thesis Structure................................7 2 Literature Review9 2.1 Movie Prediction Domain...........................9 2.1.1 The Prediction of Popularity Success of Movies...........9 2.1.2 The Prediction of Financial Success of Movies............ 12 2.1.3 The Prediction of Movie Contents.................. 14 2.2 Semantic Web.................................. 15 2.3 The Linked Data................................ 18 2.4 Wikidata.................................... 19 2.5 Web Scraping.................................. 20 2.6 REST Web Service............................... 21 2.7 Data Mining.................................. 22 2.7.1 Types of Data Mining......................... 22 2.7.2 Algorithms used for classification.................. 22 2.7.2.1 Artificial Neural Network.................. 24 2.7.2.2 Decision Tree......................... 24 2.7.2.3 k-Nearest Neighbours (k-NN)................ 25 2.7.2.4 Rule Induction........................ 26 2.8 Feature Selection................................ 27 2.8.1 Genetic Algorithms (GA)....................... 28 2.8.2 Particle Swarm Optimization (PSO)................. 29 2.9 Summary.................................... 30 3 Research Methodology 33 3.1 Introduction................................... 33 3.2 Datasets..................................... 36 3.3 Methods..................................... 38 v vi CONTENTS 3.3.1 Method to Answer RQ1........................ 40 3.3.2 Method to Answer RQ2........................ 40 3.3.3 Method to Answer RQ3........................ 41 3.3.4 Method to Answer RQ4........................ 42 3.4 Research Ethics................................. 42 3.4.1 IMDb Data Source........................... 43 3.4.1.1 The Terms of Service of IMDb............... 44 3.4.1.2 The Robots.txt of IMDb Web Site............. 44 3.4.2 Box Office Mojo Data Source..................... 45 3.4.2.1 The Terms of Service of Box Office Mojo......... 45 3.4.2.2 The Robots.txt of Box Office Mojo Web Site....... 45 3.4.3 FilmAffinity Data Source....................... 46 3.4.3.1 The Terms of Service of FilmAffinity........... 46 3.4.3.2 The Robots.txt of FilmAffinity Web Site......... 46 3.4.4 Metacritic Data Source........................ 47 3.4.4.1 The Terms of Service of Metacritic............. 47 3.4.4.2 The Robots.txt of Metacritic Web Site.......... 47 3.5 The Selection of Supervised Learning Techniques.............. 48 3.5.1 Classifiers................................ 48 3.5.2 Feature Selection............................ 51 3.5.3 Parameter Optimization........................ 51 3.6 Summary.................................... 52 4 Data Collection and Preparation 53 4.1 Introduction................................... 53 4.2 Data Selection................................. 53 4.2.1 IMDb List Files Extraction and Transformation........... 54 4.2.2 DBpedia Data Extraction....................... 55 4.2.3 Extraction of Wikidata Data..................... 56 4.2.4 Data Extraction from the IMDb Website.............. 58 4.2.5 Data Extraction from the BoxOfficeMojo Website......... 59 4.2.6 Data Extraction from the FilmAffinity Website........... 59 4.2.7 Data Extraction from the Metacritic Website............ 61 4.2.8 Data Extraction from the MovieMeter Website Using the MM API 62 4.2.9 Data Extraction from the RottenTomatoes Website Using the OMDb API............................... 63 4.2.10 Linking IMDb database and DBpedia................ 65 4.3 Data Accessing................................. 78 4.3.1 Publishing the Database Content on the Web............ 79 4.4 Data Preparation................................ 80 4.4.1 Generating the IMDb Dataset..................... 81 4.4.1.1 The Input Attributes of the IMDb Dataset........ 81 4.4.1.2 The Output Attributes of IMDb Dataset......... 83 4.4.2 Generating the IMDb++ Dataset................... 84 4.4.3 Generating FilmAffinity, MovieMeter, and RottenTomatoes Datasets 85 4.5 Summary.................................... 86 CONTENTS vii 5 Basic Classifiers for Movie Classification Models 89 5.1 Classification System Design......................... 89 5.1.1 Design Implementations........................ 90 5.1.1.1 Data Collection....................... 91 5.1.1.2 Data Preparation...................... 94 5.2 Performance Measurement........................... 96 5.3 Experimental Results.............................. 98 5.3.1 Artificial Neural Network (ANN)................... 98 5.3.2 Decision Tree.............................. 100 5.3.3 k-NN (k-Nearest Neighbour)..................... 105 5.3.4 Rule Induction............................. 107 5.3.5 Support Vector Machines (SVM)................... 110 5.4 Parameter Optimization to ANN, DT, RI and SVM Parameters...... 112 5.4.1 Parameter Optimisation to ANN Parameters............ 113 5.4.2 Parameter Optimisation to Decision Tree Parameters....... 115 5.4.3 Parameter Optimisation to Rule Induction Parameters....... 118 5.4.4 Parameter Optimisation to SVM Parameters............ 120 5.5 The Results Comparison of Classification Models.............. 123 5.6 Summary.................................... 129 6 Feature Selection for Movie Classification Models 131 6.1 Feature Selection System Architecture.................... 131 6.2 Feature Selection Algorithms......................... 132 6.2.1 Genetic Algorithm (GA) Search.................... 133 6.2.2 Particle Swarm Optimization (PSO) Search............. 134 6.3 Experimental Results.............................. 135 6.3.1