Movie Database

Movie Database

Integrated Movie Database Group 9: Muhammad Rizwan Saeed Santhoshi Priyanka Gooty Agraharam Ran Ao Outline • Background • Project Description • Demo • Conclusion 2 Background • Semantic Web – Gives meaning to data • Ontologies – Concepts – Relationships – Attributes • Benefits 3 Background • Semantic Web – Gives meaning to data • Ontologies – Concepts – Relationships – Attributes • Benefits – Facilitates organization, integration and retrieval of data 4 Project Description • Integrated Movie Database Movies Querying Books RDF Repository People 5 Project Description • Integrated Movie Database – Project Phases • Data Acquisition • Data Modeling • Data Linking • Querying 6 Data Acquisition • Datasets – IMDB.com – BoxOfficeMojo.com – RottenTomatoes.com – Wikipedia.org – GoodReads.com • Java based crawlers using jsoup library 7 Data Acquisition: Challenge • Crawlers require a list of URLs to extract data from. • How to generate a set of URLs for the crawler? – User-created movie list on IMDB • Can be exported via CSV – DBpedia 1. Using foaf:isPrimaryTopicOf of dbo:Film or schema:Movie classes to get corresponding Wikipedia Page. 2. Crawl Wikipedia Page to get IMDB and RottenTomatoes link 8 Data Acquisition: IMDB • Crawled 4 types of pages – Main Page • Title, Release Date, Genre, MPAA Rating, IMDB User Rating – Casting • List of Cast Members (Actors/Actresses) – Critics • Metacritic Score – Awards • List of Academy Awards (won) – Records generated for 36,549 movies – Casting Records generated: 856,407 9 Data Acquisition: BoxOfficeMojo • BoxOfficeMojo provides an index of all the movies on their website – Total movie links found: 16,945 • Main Movie Page – Title, Release Date, Genre, Run time, Domestic Gross, Worldwide Gross, Budget 10 Data Acquisition: Others • RottenTomatoes – Extracted another score based on % of positive reviews for the movie: (Scale: 0-100) – Records generated: 10,000 • GoodReads.com – Crawled user-generated lists of books which were adapted for movies – Records generated: 3,000 11 Data Modeling • Integrated Movie Database Ontology – Ontology creation using Protégé – Automatic conversion of CSV data into RDF using Apache Jena API 12 Data Modeling 13 Data Linking • Why do we need to link data? 14 Data Linking • Why do we need to link data? – Movie: Prestige (2006) • http://www.imdb.com/title/tt0482571/ • http://www.boxofficemojo.com/movies/?id=prestige.htm • http://www.rottentomatoes.com/m/prestige/ • http://www.goodreads.com/book/show/239239.The_Prestige 15 Data Linking • Connect independently modeled data sources 130 $109m min Box IMDB The The 2006 Office 2006 Prestige Prestige Movie Mojo 8.5 $40m http://www.imdb.com/title/tt0482571/ http://www.boxofficemojo.com/movies/?id=prestige.htm 16 Data Linking • Connect independently modeled data sources 130 $109m min Box IMDB The The 2006 Office 2006 Prestige Prestige Movie Mojo 8.5 $40m http://www.imdb.com/title/tt0482571/ http://www.boxofficemojo.com/movies/?id=prestige.htm 17 Data Linking • FRIL: Fine-grained Record Integration and Linkage Tool – Allows record linkage based on combination of similarity metrics – Linkage was fine-tuned based on multiple trials 18 Data Linking 19 Data Linking 20 Data Linking • Matching criteria: – Movies vs Movies • Match based on (similar) Title and same Release Year • simScore = 50 * Edit Distance(Titles) + 50 * Equals(Release Years) 21 Data Linking • Edit Distance: – Number of modifications (insertion, deletion, modification) that make string A equal to string B • f(Max, May) = 1 • f(Ma, May) = 1 22 Data Linking • Matches found despite – Punctuation Difference – Textual Difference Crank 2: High Voltage Crank: High Voltage Love Wedding Marriage Love, Wedding, Marriage The Hills have Eyes II The Hills have Eyes 2 – High Precision – Recall? 23 Data Linking • Matching criteria: – Movies vs Books • Match based on (similar) Title and Book Publication Year <= Movie Release Year Avatar Avatar Amnesia Amnesiac 24 Data Linking • Matching criteria: – Movies vs Books • Match based on (similar) Title and Book Publication Year <= Movie Release Year Avatar (Movie) Avatar (The Last Air Bender) Amnesia Amnesiac • Lower precision • Required manual cleaning 25 Querying • Data hosted using Openlink Virtuoso – RDF Dataset created contains 2.3 million triples • Demo: SPARQL Queries – Enrich user experience – Link previously disconnected data • e.g. Which author’s books have been most profitable for the Movie Industry? – Path Query • e.g. Finding collaboration 26 Querying: Path Queries • Finding Collaboration or Degree of Separation – Show Business • (Kevin) Bacon Number – Research Community • Erdős Number, Einstein Number – Social Media • LinkedIn Connections 2 1 0 Ernst Lee Paper A Gabor Paper B Einstein A. Rubel Straus 27 Demo 28 Conclusion • Integrated data allows to cross reference information that previously required accessing multiple web pages. • The datasets can be augmented and used for applying ML and Social Media Analysis techniques e.g. – Calculating Influence – Similarity between Entities (e.g. Movies, Actors) 29 Thank you! 30.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    30 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us