Integrated Movie Database CSCI 586 Project Report

Muhammad Rizwan Saeed, Santhoshi Priyanka Gooty Agraharam, Ran Ao

University of Southern California

1 Background

The extensive growth of the Web and associated web technologies have brought the focus on the research towards the idea of Semantic Web. The purpose behind Semantic Web is to give meaning to data on the web, so that it augments the hu- mans’ as well as machines’ capability of effectively understanding and processing information [1]. In the domain of Semantic Web, Ontologies (or vocabularies) define the concepts and relationships used to describe and represent an area of concern [2]. The purpose of creating Ontologies and integrating data is to orga- nize heterogeneous data sources for simplifying on-demand information access and enabling complex analysis to be performed on the integrated knowledgebase. In our project titled “Integrated Movie Database”, we apply the concepts of Semantic Web and, by using Ontologies, integrate data coming from various sources to provide a unified view to the user. The data becomes queryable and can be used to extract information required by the user.

2 Problem Statement

There are various websites that provide different (and often disjoint) chunks of in- formation related to movies. IMDB1 (Internet Movie Database) is the most com- prehensive online source of information for movies. It provides various attributes related to movies such as title, genre, run time, casting details, awards etc. or- ganized across multiple web pages. Similarly, BoxOfficeMojo2 is another movie- related website which mainly focuses on the financial aspects of movies. It keeps track of the domestic and worldwide earnings of movies on daily, weekly and monthly basis. Multiple other websites such as Fandango3 and Google Movies4 provide lists of movies in local movie theaters and their showtimes. Some web- sites keep track of critical reception of movies e.g. Rotten Tomatoes5 is one such website that assigns a score to every movie based on the percentage of positive reviews published about it in notable (print and digital) publications. Due to

1 http://www.imdb.com 2 http://www.boxofficemojo.com 3 http://www.fandango.com 4 http://www.google.com/movies 5 http://www.rottentomatoes.com Fig. 1: Work Flow this distribution of information about movies on multiple websites, users cannot get a unified view of all the relevant information pertaining to a movie at one place. For example, a user can only get movie showtimes from Google Movies. If the user wants to select a movie to watch from listed movies, he may require in- formation such as critic and user reviews and gross etc. to make a decision. Since this information is not available on Google Movies page, the user must browse other websites to get all relevant pieces of information. Similarly, performing an analysis of box office business of Oscar-winning movies also requires accessing multiple web pages of different websites, which can be a time-consuming process for the end-user. Solving this problem of consolidating scattered pieces of relevant information is the goal of our project. We acquire information about movies coming from multiple web sources and create a framework that organizes and integrates the data and provides a SPARQL6 endpoint for querying movie-related data. Here, we want to clarify that the goal of the project is not to create a mash-up appli- cation. Mash-up applications usually focus on only presenting different streams of information in different panels or frames in the same application window. However, those streams cannot be used to run integrated queries. For example, consider a mash-up application that shows information from RottenTomatoes and BoxOfficeMojo in separate panels in the same window. A user may be able to interact with one panel to filter movies with a critic score of > 90% and with the other to filter movies that have grossed more than a billion dollars worldwide. However, a query to get results that fulfill both the criteria simultaneously, will require both data streams to be integrated, which is the objective of our project.

3 Scope

In our project, we are focusing on data available from the following websites. The attributes crawled from each page are listed in Table 1:

1. Internet Movie Database (www..com)

6 https://www.w3.org/TR/rdf-sparql-query/ 2. Box Office Mojo (www.boxofficemojo.com) 3. Rotten Tomatoes (www.rottentomatoes.com) 4. Good Reads (www.goodreads.com) 5. Wikipedia (en.wikipedia.org) 6. Google Movies Cinema Page (crawled in real time based on user query, see Section 4.4 for details) The high level schematic of the project is shown in Figure 1. We provide more details about the project in subsequent sections. In Section 4, we discuss, in detail, the multiple phases of the project and the challenges faced in every phase and in Section 5, we talk about the conclusion and possible future work based on this work.

Website Extracted Information IMDB Title, Release Date, Genre, MPAA Rating, IMDB User Rating, List of Cast Members (Actors/Actresses), Metacritic Score, List of Academy Awards (won), Director RottenTomatoes Title, Year, Critic Score (Tomatometer) BoxOfficeMojo Title, Release Date, Genre, Run time, Domestic Gross, Worldwide Gross, Budget Book Title, Year, Author Name, Rating Wikipedia Movie links for IMDB and RottenTomatoes Google Movies Title, Year, IMDB Link, Showtimes Table 1: Information extracted from websites

4 Approach

In this section, we discuss different phases of the development and execution of our project. The project can be divided into four phases.: 1. Data Acquisition 2. Data Modeling & Integration 3. Data Linking 4. Querying

4.1 Data Acquisition Our primary focus is on extracting information from data sets listed in Section 3. We created Java-based crawlers using the jsoup7 library, which provides an API for extracting and manipulating data from web pages. Due to the unique structure of each web page, we have created separate crawlers for each type of web page. The details of the process and challenges of extracting data for different web sources are discussed next. 7 http://jsoup.org/ Fig. 2: SPARQL Query to extract Wikipedia pages of Movies from DBpedia

IMDB and RottenTomatoes For each crawler, we require a list of URLs to crawl. In order to generate such a list of URLs for each website, we started with extracting relevant information from DBpedia8. DBpedia is a crowd sourced version of Wikipedia9 built on the principle of Linked Open Data (LOD)[3]. On DBpedia, movies are represented as instances of classes dbo:Film10 and schema:Movie11. Every DBpedia entity has a property foaf:isPrimaryTopicOf12 that holds the link to the corresponding Wikipedia page. Using SPARQL queries (similar to one shown in Figure 2), we created a list of Wikipedia pages related to movies. Every Wikipedia page contains links to IMDB and RottenTomatoes pages for the corresponding movie in the section named “External links”. We created a crawler for Wikipedia page to extract those links and, hence, generated a list of URLs for the software to crawl IMDB and RottenTomatoes websites.

BoxOfficeMojo and GoodReads BoxOfficeMojo provides an index of movies that are available on its website. We, first, wrote a crawler to extract the list of URLs from that index and then using another crawler extracted information from the respective movie pages. We also thought that it would be interesting to add another dimension to our movie data sets by integrating information related to books to our repository. We crawled multiple user-generated lists on GoodReads, which contained names of books which were adapted for movies. The statistics of all the data extracted are given in Table 2.

8 http://wiki.dbpedia.org/ 9 https://www.wikipedia.org/ 10 http://dbpedia.org/ontology/Film 11 http://schema.org/Movie 12 http://xmlns.com/foaf/0.1/isPrimaryTopicOf Fig. 3: Integrated Movie Ontology

Website No. of Records Generated Records generated for 36, 549 movies, IMDB Total casting records generated: 856, 407 Rotten Tomatoes Records generated for 10, 000 movies Box Office Mojo Records generated for 16, 945 movies Good Reads Records generated for over 3, 000 books Table 2: Statistics about generated data

4.2 Data Modeling & Integration

To model our data, we created the Integrated Movie Ontology, shown in Figure 3. In the center of the figure, we have the Movie class with various attributes. An instance of Movie can be based on an instance of the Book class. Both Book and Movie are subclasses of the class CreateWork. Each book has an associ- ated instance of the Author class and each movie has associated instances of classes Actor and Director. Actor, Director and Author are subclasses of class Person. Each movie instance can be associated with multiple instances of class Award and each such instance of class Award will have an associated winner (instance of class Person). We created a Java program using Apache Jena13 API, that took relevant seg-

13 http://jena.apache.org/ ments of the Ontology and the data file(s) as input and converted CSV data into RDF triples format. Table 3 shows few examples of triples for movie “The Prestige”.

The Prestige (2006) PREFIX imo: < http : //www.usc.edu/csci586/entertainment# > subject predicate object http : //www.imdb.com/title/tt0482571/ rdf:type imo:Movie http : //www.imdb.com/title/tt0482571/ imo:Title The Prestige http : //www.imdb.com/title/tt0482571/ imo:IMDBUserRating 8.5 http : //www.imdb.com/title/tt0482571/ imo:Year 2006 Table 3: Sample triples for movie The Prestige

4.3 Data Linking

Every movie has been assigned its own ID in different data sources. After con- verting data into RDF in the previous phase, there can be multiple nodes in the RDF graph corresponding to the same movie, as the linkage between same movies across data sets has yet to be established. For example, movie The Pres- tige (2006) has different URLs across three data sets and the book on which it is based has another URL, as shown in Table 4.

The Prestige (2006) IMDB http : //www.imdb.com/title/tt0482571/ Rotten Tomatoes http : //www.boxofficemojo.com/movies/?id = prestige.htm Box Office Mojo http : //www.rottentomatoes.com/m/prestige/ Good Reads http : //www.goodreads.com/book/show/239239.T he P restige Table 4: Information related to same movie across website

To establish links between data sets, we used the tool called FRIL14(Fine- grained Record Integration and Linkage Tool) which allows users to upload two data sources and configure linkages based on multiple similarity metrics. The interface of FRIL is shown in Figure 4.

Movie-to-Movie Linkage For linking an IMDB movie to the same movie in both RottenTomatoes or BoxOfficeMojo data sets, we performed record link- age based on similar titles and same release years. For matching titles, we use edit distance similarity metric. The edit distance between strings a1 . . . am and 14 http://fril.sourceforge.net/ Fig. 4: FRIL - Main Interface

b1 . . . bn is the minimum cost of a sequence of editing steps (insertions, dele- tions, substitutions) that convert one string into the other. Let d be the edit distance function and e be the exact matching function, then formula for finding similarity between two records MA and MB can be represented as:

sim(MA,MB) = w1 ∗d(MA.T itle, MB.T itle) +w2 ∗e(MA.Y ear, MB.Y ear) (1)

By hit and trial, we used value of 50 for both w1 and w2 which gave us quite accurate results. For example, we were able to make matches despite slight differences of letters and punctuation in titles across data sets. Some examples of matches are shown in Table 5. After such matches were found, we connected the two nodes representing the same movie with owl:sameAs link.

Movie (IMDB) Movie (Box Office Mojo) Crank 2: High Voltage Crank: High Voltage Love Wedding Marriage Love, Wedding, Marriage The Hills have Eyes II The Hills have Eyes 2 Table 5: Approximate matching of movie titles

Movie-to-Book Linkage For linking an IMDB movie to a book, we performed record linkage based on similar titles and the notion that the publishing year of the book should to be less than or equal to the release year of the movie. Let d be the edit distance function and l be the comparison function such that it is true when publishing year of book is less than or equal to the release year of movie. The relation for finding similarity between two records MA (movie) and BA (book) can, then, be represented as:

sim(MA,BA) = w1 ∗ d(MA.T itle, BA.T itle) + w2 ∗ l(MA.Y ear, BA.Y ear) (2)

Equation 2 is less restrictive than the movie to movie comparison of Equation 1 because we are not doing exact matching of years. Hence, this approach turned out to be less precise. We had to perform manual clean-up of the suggested matches. For instance, the movie Avatar (2009) was matched to the book Avatar from the Avatar: The Last Air Bender Series. Similarly Amnesia was matched to Amnesiac (edit distance = 1), even though they are unrelated. So for this step, we used FRIL to provide initial record linkage which was then manually checked of erroneous matching. Based on the corrected matching, we used the basedOn property from our ontology to create links between relevant instances of Movie and Book.

4.4 Querying

After data integration and linking, we had an RDF dataset of roughly 2.3 mil- lion triples. We used OpenLink Virtuoso15 server to host the RDF data set. The repository was then queried using a combination of Apache Jena API and Virtuoso API. We developed an interface using Java swing libraries that takes a query type as an input. Additional text input is required based on the type of query selected. The interface is shown in Figure 5. Some of the queries have been pre-configured, however provisions for users to issue their own queries is also provided. When users choose to run a pre-configured query, the actual SPARQL query is also shown in the text area, which can be used as a template for custom queries. The list of pre-configured queries and their descriptions is given in Table 6. To show the value of integration, we have divided the test queries into three categories, which are discussed next.

Adding Relevant Data to Movie Theater Page As discussed in Section 2 that websites such as Google Movies do not provide entire information that a user can use to determine which movie to watch e.g. critic rating, user rating, gross etc. In this first type of query, the users provide a Google Movie URL for a particular cinema. The software uses a crawler to get information from the page (e.g. title, IMDB Link), queries the repository for each movie found and augments information with received data and presents as a table. Essentially, the user is getting information from four websites (IMDB, BoxOfficeMojo, Rot- tenTomatoes, Google Movies) in a single table. The table can be sorted based on any column. This gives user the ability to sort movies showing in a particular cinema based on his preferred criteria. Figure 6 shows result of this query run for a particular movie theater in the downtown Los Angeles area.

15 http://virtuoso.openlinksw.com/ Fig. 5: Integrated Movie Database Search Interface

Fig. 6: Augmented movie theater data with additional information

Querying Data from Previously Disjoint Data Sources Now that we have integrated previously disjoint data sources, we can think of queries that will fetch variety of data in single query. For example: “Which authors’ books have been most profitable for the Movie Industry?” or “List all movie adaptations along with certain attributes for a particular author”. Both of these queries would have required users to explore multiple websites to find the answer. With the integrated data, this can be done with a single query. The results for latter query is shown in Figure 7 and the equivalent SPARQL query is shown in Figure 8.

Path Query: Find Collaboration / Degree of Separation The third type of queries find a collaborative path between two people based on the movies that they have acted in together. This kind of collaborative queries can be found in Menu Item Description Process Google Movies URL Extract movies from the Google Movies Cinema Page and augment it with data from the repository Get Information of Movies Provide all attributes related to movies based on books based on books by an Au- by user-provided author thor Most Oscars Won By a Top 10 list of books which have been basis for movies Movie based on Book Adap- with most Oscar wins tation Directed and Acted in a Find people who directed a movie, acted in it and won Movie and Won an Oscar either the Best Director or Best Actor/Actress Oscar Collaborative Path Between Using the graph of actors and actresses connected with Two Actors of Fixed Length each other through movies, find a path of particular length between two people Collaborative Path Between Using the graph of actors and actresses connected with Two Actors of Any Length each other through movies, find a path of any length between two people Partial Match Movie Search Find all movies that partially match a user provided phrase Custom Query Users enter their own queries in text area. Table 6: Description of pre-configured queries available through the software interface

Fig. 7: List of movies with selected attributes along with the books they were based on

other domains as well. When a user on LinkedIn16 clicks on a profile of another professional, they can see how they are connected to that person via professional network through other people. Similarly in research community, two examples of measuring collaboration are Einstein number and Erd¨os number. Einstein himself has an Einstein number of 0. Anybody who has co-authored a paper with Einstein has an Einstein number of 1. For instance, as shown in Figure 9, Ernst Gabor Straus has an Einstein number of 1 and Lee A. Rubel who collaborated with Ernst Gabor Straus has an Einstein number of 2. We have calculated a similar metric between two people based on the movies they have acted in. For instance, Figure 10a shows the minimum collaborative path between Al Pacino and Marlon Brando, which is of length 1, since they worked together

16 http: www.linkedIn.com Fig. 8: SPARQL query to extract movies based on books of J.R.R. Tolkien

Fig. 9: An example of the Einstein Number in Godfather (1972). Another version of this query finds path of fixed length between two people. The desired path length is provided by the user. Figure 10b shows result of query that finds a collaborative path of length 3 between the same two actors.

5 Conclusion and Future Work

Ontologies are becoming an increasingly popular way of organizing data and are used in multiple domains. We have shown the value of Ontology based integration in the domain of Movies and Books by answering queries which would have required exploring multiple web pages to answer. For future work, we feel that the data set developed for this project can be used for the application of interesting Machine Learning techniques and Social Media Analytics. For instance, in the domain of Authors/Researchers, having more publications can be a measure of finding influential nodes in the graph. In the graph of Actors/Actresses, having links to more movies does not give any clue about the popularity or influence of the person. More sophisticated measures of influence need to be established for such collaborative networks. Another interesting measure would be to cluster movies based on multiple features e.g. (a) Shortest collaborative path be- (b) Collaborative path of length 3 be- tween Al Pacino & Marlon Brando tween Al Pacino & Marlon Brando

Fig. 10: Results of Path Queries budget, gross, rating, cast, awards etc. to determine if any interesting patterns of similarity can emerge. Hence automatic determination of influential nodes and finding similar movies or people can be interesting next avenues to explore using the data we have generated for this project.

References

1. Sunitha Ramanujam, Anubha Gupta, Latifur Khan, Steven Seida, and Bhavani M. Thuraisingham. A relational wrapper for RDF reification. In Trust Management III, Third IFIP WG 11.11 International Conference, IFIPTM 2009 , West Lafayette, IN, USA, June 15-19, 2009. Proceedings, pages 196–214, 2009. 2. Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, 284(5):34–43, May 2001. 3. Liyang Yu. A Developer’s Guide to the Semantic Web. Springer, 2011.