The Webdatacommons Microdata, Rdfa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer HTML-Embedded Structured Data on the Web
Total Page:16
File Type:pdf, Size:1020Kb
The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer HTML-embedded Structured Data on the Web More and more websites semantically markup the content of their HTML pages. RDFa Microformats Microdata The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 2 Dataset Creation − Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 3 Dataset Creation − Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fu\u00DFballschuh"@de . 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax- Any23 ns#type> <http://schema.org/Offer> . 4. _:node1 <http://schema.org/Offer/price> "\u20AC 219,95"@de . 5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 6. … The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 4 Dataset Creation − Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fu\u00DFballschuh"@de . 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax- Any23 ns#type> <http://schema.org/Offer> . 4. _:node1 <http://schema.org/Offer/price> "\u20AC 219,95"@de . 5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 6. … The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/ The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 5 Dataset Series Overview − Series contains three datasets from 2010, 2012 and 2013 − All together over 30 billion RDF quads − Each dataset is again split into subsets including quads extracted for a particular markup language The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 6 Overview of 2013 dataset − Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 7 Overview of 2013 dataset − Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 8 Overview of 2013 dataset − Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains − Microdata contains the largest number of quads The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 9 Divergence in Class and Property Usage in 2013 − Small number of classes and properties is used by a large number of domains The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 10 Divergence in Class and Property Usage in 2013 − Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 11 Divergence in Class and Property Usage in 2013 − Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains − MD: 15k classes and 170k properties, but but ~1.2k classes and <13k properties are used by at least two different domains. The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 12 Divergence in Class and Property Usage in 2013 − Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains − MD: 15k classes and 170k properties, but but ~1.2k classes and <13k properties are used by at least two different domains. Classes and Properties used by solely one domain are mostly typos The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 13 RDFa Insights 2013 − Usage of various vocabularies to describe information: • Strong presents of Open Graph Protocol (e.g. Facebook) • FOAF and SIOC (Blog-Software as Drupal) − Largest topics covered are: • Articles and Documents (Blogs and News portals) • Products, Reviews and Ratings • BusinessEntities and Organizations The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 14 Microdata Insights 2013 and 2012 − Clear increase of development in comparison to 2012 − Still two vocabularies deployed: data-vocabulary and schema.org − Largest topical areas: • Postal Addresses and Locations • Products, Offers and Ratings • Organizations and Persons • Articles and Blogs • Breadcrumb The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 15 Focus on Schema.org/Product − One of the largest public available product collections − Almost 100 million records described with name, offer and image − 34 million records contain a further description − 11% of all product records include a brand The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 16 Microformats Insights 2013 − Most dominant vocabulary is hCard − Still a very solid deployment − Topics are: • Persons & Organizations • Events • Products and reviews • Recipes The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 17 Opportunities & Challenges The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 18 Opportunities & Challenges Opportunities − Vast amounts of free data, created from people all over the world − Large topical coverage from broad areas (as products) to niche (as recipes) − High up-to-dateness of information, as popular pages potentially update their content frequently The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 19 Opportunities & Challenges Opportunities Challenges − Vast amounts of free data, − Data quality assessment, as created from people all over the data is created by the world experts and rookies − Large topical coverage from − Further information broad areas (as products) to extraction, as a flat schema niche (as recipes) and rather low number of properties are used − High up-to-dateness of information, as popular − Identity resolution, as the pages potentially update data does hardly contain their content frequently identifiers The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 20 Possible Application Domains The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 21 Possible Application Domains − Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Semantic Web Challenge, this data can be used as additional source (besides others) to gather and return wider search results The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 22 Possible Application Domains − Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Semantic Web Challenge, this data can be used as additional source (besides others) to gather and return wider search results − Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 23 Possible Application Domains − Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Semantic Web Challenge, this data can be used as additional source (besides others) to gather and return wider search results − Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions − Starting point for further data discovery • The dataset can be used as starting points for further data crawling, as not all pages from a domain are included (in most of the cases) The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 24 Thank you! Questions? Feedback? Data and more statistics can be found at: http://webdatacommons.org/structureddata/index.html