Integrated Movie Database CSCI 586 Project Report
Total Page:16
File Type:pdf, Size:1020Kb
Integrated Movie Database CSCI 586 Project Report Muhammad Rizwan Saeed, Santhoshi Priyanka Gooty Agraharam, Ran Ao University of Southern California 1 Background The extensive growth of the Web and associated web technologies have brought the focus on the research towards the idea of Semantic Web. The purpose behind Semantic Web is to give meaning to data on the web, so that it augments the hu- mans' as well as machines' capability of effectively understanding and processing information [1]. In the domain of Semantic Web, Ontologies (or vocabularies) define the concepts and relationships used to describe and represent an area of concern [2]. The purpose of creating Ontologies and integrating data is to orga- nize heterogeneous data sources for simplifying on-demand information access and enabling complex analysis to be performed on the integrated knowledgebase. In our project titled \Integrated Movie Database", we apply the concepts of Semantic Web and, by using Ontologies, integrate data coming from various sources to provide a unified view to the user. The data becomes queryable and can be used to extract information required by the user. 2 Problem Statement There are various websites that provide different (and often disjoint) chunks of in- formation related to movies. IMDB1 (Internet Movie Database) is the most com- prehensive online source of information for movies. It provides various attributes related to movies such as title, genre, run time, casting details, awards etc. or- ganized across multiple web pages. Similarly, BoxOfficeMojo2 is another movie- related website which mainly focuses on the financial aspects of movies. It keeps track of the domestic and worldwide earnings of movies on daily, weekly and monthly basis. Multiple other websites such as Fandango3 and Google Movies4 provide lists of movies in local movie theaters and their showtimes. Some web- sites keep track of critical reception of movies e.g. Rotten Tomatoes5 is one such website that assigns a score to every movie based on the percentage of positive reviews published about it in notable (print and digital) publications. Due to 1 http://www.imdb.com 2 http://www.boxofficemojo.com 3 http://www.fandango.com 4 http://www.google.com/movies 5 http://www.rottentomatoes.com Fig. 1: Work Flow this distribution of information about movies on multiple websites, users cannot get a unified view of all the relevant information pertaining to a movie at one place. For example, a user can only get movie showtimes from Google Movies. If the user wants to select a movie to watch from listed movies, he may require in- formation such as critic and user reviews and gross etc. to make a decision. Since this information is not available on Google Movies page, the user must browse other websites to get all relevant pieces of information. Similarly, performing an analysis of box office business of Oscar-winning movies also requires accessing multiple web pages of different websites, which can be a time-consuming process for the end-user. Solving this problem of consolidating scattered pieces of relevant information is the goal of our project. We acquire information about movies coming from multiple web sources and create a framework that organizes and integrates the data and provides a SPARQL6 endpoint for querying movie-related data. Here, we want to clarify that the goal of the project is not to create a mash-up appli- cation. Mash-up applications usually focus on only presenting different streams of information in different panels or frames in the same application window. However, those streams cannot be used to run integrated queries. For example, consider a mash-up application that shows information from RottenTomatoes and BoxOfficeMojo in separate panels in the same window. A user may be able to interact with one panel to filter movies with a critic score of > 90% and with the other to filter movies that have grossed more than a billion dollars worldwide. However, a query to get results that fulfill both the criteria simultaneously, will require both data streams to be integrated, which is the objective of our project. 3 Scope In our project, we are focusing on data available from the following websites. The attributes crawled from each page are listed in Table 1: 1. Internet Movie Database (www.imdb.com) 6 https://www.w3.org/TR/rdf-sparql-query/ 2. Box Office Mojo (www.boxofficemojo.com) 3. Rotten Tomatoes (www.rottentomatoes.com) 4. Good Reads (www.goodreads.com) 5. Wikipedia (en.wikipedia.org) 6. Google Movies Cinema Page (crawled in real time based on user query, see Section 4.4 for details) The high level schematic of the project is shown in Figure 1. We provide more details about the project in subsequent sections. In Section 4, we discuss, in detail, the multiple phases of the project and the challenges faced in every phase and in Section 5, we talk about the conclusion and possible future work based on this work. Website Extracted Information IMDB Title, Release Date, Genre, MPAA Rating, IMDB User Rating, List of Cast Members (Actors/Actresses), Metacritic Score, List of Academy Awards (won), Director RottenTomatoes Title, Year, Critic Score (Tomatometer) BoxOfficeMojo Title, Release Date, Genre, Run time, Domestic Gross, Worldwide Gross, Budget GoodReads Book Title, Year, Author Name, Rating Wikipedia Movie links for IMDB and RottenTomatoes Google Movies Title, Year, IMDB Link, Showtimes Table 1: Information extracted from websites 4 Approach In this section, we discuss different phases of the development and execution of our project. The project can be divided into four phases.: 1. Data Acquisition 2. Data Modeling & Integration 3. Data Linking 4. Querying 4.1 Data Acquisition Our primary focus is on extracting information from data sets listed in Section 3. We created Java-based crawlers using the jsoup7 library, which provides an API for extracting and manipulating data from web pages. Due to the unique structure of each web page, we have created separate crawlers for each type of web page. The details of the process and challenges of extracting data for different web sources are discussed next. 7 http://jsoup.org/ Fig. 2: SPARQL Query to extract Wikipedia pages of Movies from DBpedia IMDB and RottenTomatoes For each crawler, we require a list of URLs to crawl. In order to generate such a list of URLs for each website, we started with extracting relevant information from DBpedia8. DBpedia is a crowd sourced version of Wikipedia9 built on the principle of Linked Open Data (LOD)[3]. On DBpedia, movies are represented as instances of classes dbo:Film10 and schema:Movie11. Every DBpedia entity has a property foaf:isPrimaryTopicOf12 that holds the link to the corresponding Wikipedia page. Using SPARQL queries (similar to one shown in Figure 2), we created a list of Wikipedia pages related to movies. Every Wikipedia page contains links to IMDB and RottenTomatoes pages for the corresponding movie in the section named \External links". We created a crawler for Wikipedia page to extract those links and, hence, generated a list of URLs for the software to crawl IMDB and RottenTomatoes websites. BoxOfficeMojo and GoodReads BoxOfficeMojo provides an index of movies that are available on its website. We, first, wrote a crawler to extract the list of URLs from that index and then using another crawler extracted information from the respective movie pages. We also thought that it would be interesting to add another dimension to our movie data sets by integrating information related to books to our repository. We crawled multiple user-generated lists on GoodReads, which contained names of books which were adapted for movies. The statistics of all the data extracted are given in Table 2. 8 http://wiki.dbpedia.org/ 9 https://www.wikipedia.org/ 10 http://dbpedia.org/ontology/Film 11 http://schema.org/Movie 12 http://xmlns.com/foaf/0.1/isPrimaryTopicOf Fig. 3: Integrated Movie Ontology Website No. of Records Generated Records generated for 36; 549 movies, IMDB Total casting records generated: 856; 407 Rotten Tomatoes Records generated for 10; 000 movies Box Office Mojo Records generated for 16; 945 movies Good Reads Records generated for over 3; 000 books Table 2: Statistics about generated data 4.2 Data Modeling & Integration To model our data, we created the Integrated Movie Ontology, shown in Figure 3. In the center of the figure, we have the Movie class with various attributes. An instance of Movie can be based on an instance of the Book class. Both Book and Movie are subclasses of the class CreateWork. Each book has an associ- ated instance of the Author class and each movie has associated instances of classes Actor and Director. Actor, Director and Author are subclasses of class Person. Each movie instance can be associated with multiple instances of class Award and each such instance of class Award will have an associated winner (instance of class Person). We created a Java program using Apache Jena13 API, that took relevant seg- 13 http://jena.apache.org/ ments of the Ontology and the data file(s) as input and converted CSV data into RDF triples format. Table 3 shows few examples of triples for movie \The Prestige". The Prestige (2006) PREFIX imo: < http : ==www:usc:edu=csci586=entertainment# > subject predicate object http : ==www:imdb:com=title=tt0482571= rdf:type imo:Movie http : ==www:imdb:com=title=tt0482571= imo:Title The Prestige http : ==www:imdb:com=title=tt0482571= imo:IMDBUserRating 8.5 http : ==www:imdb:com=title=tt0482571= imo:Year 2006 Table 3: Sample triples for movie The Prestige 4.3 Data Linking Every movie has been assigned its own ID in different data sources. After con- verting data into RDF in the previous phase, there can be multiple nodes in the RDF graph corresponding to the same movie, as the linkage between same movies across data sets has yet to be established.