Integrated Movie Database Group 9: Muhammad Rizwan Saeed Santhoshi Priyanka Gooty Agraharam Ran Ao Outline • Background • Project Description • Demo • Conclusion 2 Background • Semantic Web – Gives meaning to data • Ontologies – Concepts – Relationships – Attributes • Benefits 3 Background • Semantic Web – Gives meaning to data • Ontologies – Concepts – Relationships – Attributes • Benefits – Facilitates organization, integration and retrieval of data 4 Project Description • Integrated Movie Database Movies Querying Books RDF Repository People 5 Project Description • Integrated Movie Database – Project Phases • Data Acquisition • Data Modeling • Data Linking • Querying 6 Data Acquisition • Datasets – IMDB.com – BoxOfficeMojo.com – RottenTomatoes.com – Wikipedia.org – GoodReads.com • Java based crawlers using jsoup library 7 Data Acquisition: Challenge • Crawlers require a list of URLs to extract data from. • How to generate a set of URLs for the crawler? – User-created movie list on IMDB • Can be exported via CSV – DBpedia 1. Using foaf:isPrimaryTopicOf of dbo:Film or schema:Movie classes to get corresponding Wikipedia Page. 2. Crawl Wikipedia Page to get IMDB and RottenTomatoes link 8 Data Acquisition: IMDB • Crawled 4 types of pages – Main Page • Title, Release Date, Genre, MPAA Rating, IMDB User Rating – Casting • List of Cast Members (Actors/Actresses) – Critics • Metacritic Score – Awards • List of Academy Awards (won) – Records generated for 36,549 movies – Casting Records generated: 856,407 9 Data Acquisition: BoxOfficeMojo • BoxOfficeMojo provides an index of all the movies on their website – Total movie links found: 16,945 • Main Movie Page – Title, Release Date, Genre, Run time, Domestic Gross, Worldwide Gross, Budget 10 Data Acquisition: Others • RottenTomatoes – Extracted another score based on % of positive reviews for the movie: (Scale: 0-100) – Records generated: 10,000 • GoodReads.com – Crawled user-generated lists of books which were adapted for movies – Records generated: 3,000 11 Data Modeling • Integrated Movie Database Ontology – Ontology creation using Protégé – Automatic conversion of CSV data into RDF using Apache Jena API 12 Data Modeling 13 Data Linking • Why do we need to link data? 14 Data Linking • Why do we need to link data? – Movie: Prestige (2006) • http://www.imdb.com/title/tt0482571/ • http://www.boxofficemojo.com/movies/?id=prestige.htm • http://www.rottentomatoes.com/m/prestige/ • http://www.goodreads.com/book/show/239239.The_Prestige 15 Data Linking • Connect independently modeled data sources 130 $109m min Box IMDB The The 2006 Office 2006 Prestige Prestige Movie Mojo 8.5 $40m http://www.imdb.com/title/tt0482571/ http://www.boxofficemojo.com/movies/?id=prestige.htm 16 Data Linking • Connect independently modeled data sources 130 $109m min Box IMDB The The 2006 Office 2006 Prestige Prestige Movie Mojo 8.5 $40m http://www.imdb.com/title/tt0482571/ http://www.boxofficemojo.com/movies/?id=prestige.htm 17 Data Linking • FRIL: Fine-grained Record Integration and Linkage Tool – Allows record linkage based on combination of similarity metrics – Linkage was fine-tuned based on multiple trials 18 Data Linking 19 Data Linking 20 Data Linking • Matching criteria: – Movies vs Movies • Match based on (similar) Title and same Release Year • simScore = 50 * Edit Distance(Titles) + 50 * Equals(Release Years) 21 Data Linking • Edit Distance: – Number of modifications (insertion, deletion, modification) that make string A equal to string B • f(Max, May) = 1 • f(Ma, May) = 1 22 Data Linking • Matches found despite – Punctuation Difference – Textual Difference Crank 2: High Voltage Crank: High Voltage Love Wedding Marriage Love, Wedding, Marriage The Hills have Eyes II The Hills have Eyes 2 – High Precision – Recall? 23 Data Linking • Matching criteria: – Movies vs Books • Match based on (similar) Title and Book Publication Year <= Movie Release Year Avatar Avatar Amnesia Amnesiac 24 Data Linking • Matching criteria: – Movies vs Books • Match based on (similar) Title and Book Publication Year <= Movie Release Year Avatar (Movie) Avatar (The Last Air Bender) Amnesia Amnesiac • Lower precision • Required manual cleaning 25 Querying • Data hosted using Openlink Virtuoso – RDF Dataset created contains 2.3 million triples • Demo: SPARQL Queries – Enrich user experience – Link previously disconnected data • e.g. Which author’s books have been most profitable for the Movie Industry? – Path Query • e.g. Finding collaboration 26 Querying: Path Queries • Finding Collaboration or Degree of Separation – Show Business • (Kevin) Bacon Number – Research Community • Erdős Number, Einstein Number – Social Media • LinkedIn Connections 2 1 0 Ernst Lee Paper A Gabor Paper B Einstein A. Rubel Straus 27 Demo 28 Conclusion • Integrated data allows to cross reference information that previously required accessing multiple web pages. • The datasets can be augmented and used for applying ML and Social Media Analysis techniques e.g. – Calculating Influence – Similarity between Entities (e.g. Movies, Actors) 29 Thank you! 30.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages30 Page
-
File Size-