The WebDataCommons Microdata, RDFa, and Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer HTML-embedded Structured Data on the Web

More and more websites semantically markup the content of their HTML pages.

RDFa

Microdata

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 2 Dataset Creation

− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 3 Dataset Creation

− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances

1. _:node1 . 2. _:node1 "Predator Instinct FG Fu\u00DFballschuh"@de . 3. _:node1 . 4. _:node1 "\u20AC 219,95"@de . 5. _:node1 "EUR"@de . 6. …

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 4 Dataset Creation

− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances

1. _:node1 . 2. _:node1 "Predator Instinct FG Fu\u00DFballschuh"@de . 3. _:node1 . 4. _:node1 "\u20AC 219,95"@de . 5. _:node1 "EUR"@de . 6. …

The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 5 Dataset Series Overview

− Series contains three datasets from 2010, 2012 and 2013 − All together over 30 billion RDF quads

− Each dataset is again split into subsets including quads extracted for a particular

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 6 Overview of 2013 dataset

− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities)

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 7 Overview of 2013 dataset

− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 8 Overview of 2013 dataset

− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains − Microdata contains the largest number of quads

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 9 Divergence in Class and Property Usage in 2013

− Small number of classes and properties is used by a large number of domains

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 10 Divergence in Class and Property Usage in 2013

− Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 11 Divergence in Class and Property Usage in 2013

− Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains − MD: 15k classes and 170k properties, but but ~1.2k classes and <13k properties are used by at least two different domains.

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 12 Divergence in Class and Property Usage in 2013

− Small number of classes and properties is used by a large number of domains − RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains − MD: 15k classes and 170k properties, but but ~1.2k classes and <13k properties are used by at least two different domains.

Classes and Properties used by solely one domain are mostly typos

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 13 RDFa Insights 2013

− Usage of various vocabularies to describe information: • Strong presents of Open Graph Protocol (e.g. Facebook) • FOAF and SIOC (Blog-Software as Drupal)

− Largest topics covered are: • Articles and Documents (Blogs and News portals) • Products, Reviews and Ratings • BusinessEntities and Organizations

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 14 Microdata Insights 2013 and 2012

− Clear increase of development in comparison to 2012 − Still two vocabularies deployed: data-vocabulary and schema.org − Largest topical areas: • Postal Addresses and Locations • Products, Offers and Ratings • Organizations and Persons • Articles and Blogs • Breadcrumb

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 15 Focus on Schema.org/Product

− One of the largest public available product collections − Almost 100 million records described with name, offer and image − 34 million records contain a further description − 11% of all product records include a brand

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 16 Microformats Insights 2013

− Most dominant vocabulary is hCard − Still a very solid deployment − Topics are: • Persons & Organizations • Events • Products and reviews • Recipes

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 17 Opportunities & Challenges

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 18 Opportunities & Challenges

Opportunities − Vast amounts of free data, created from people all over the world − Large topical coverage from broad areas (as products) to niche (as recipes) − High up-to-dateness of information, as popular pages potentially update their content frequently

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 19 Opportunities & Challenges

Opportunities Challenges − Vast amounts of free data, − Data quality assessment, as created from people all over the data is created by the world experts and rookies − Large topical coverage from − Further information broad areas (as products) to extraction, as a flat schema niche (as recipes) and rather low number of properties are used − High up-to-dateness of information, as popular − Identity resolution, as the pages potentially update data does hardly contain their content frequently identifiers

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 20 Possible Application Domains

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 21 Possible Application Domains

− Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Challenge, this data can be used as additional source (besides others) to gather and return wider search results

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 22 Possible Application Domains

− Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Semantic Web Challenge, this data can be used as additional source (besides others) to gather and return wider search results − Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 23 Possible Application Domains

− Enriching existing knowledge bases • E.g. mapping DBPedia Classes and Properties to the corresponding classes and properties within the available vocabularies to add missing information and extend entity knowledge • As shown by Lehmberg et al. within the Semantic Web Challenge, this data can be used as additional source (besides others) to gather and return wider search results − Design and adaption of algorithms and methods to face the characteristics of such web data • Training of data extraction methods to gather not marked data within the HTML pages • Further extraction of additional information from the raw data, e.g. extraction of skills, requirements etc. from job posting descriptions − Starting point for further data discovery • The dataset can be used as starting points for further data crawling, as not all pages from a domain are included (in most of the cases)

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 24 Thank you! Questions? Feedback?

Data and more statistics can be found at: http://webdatacommons.org/structureddata/index.html

More interesting datasets and analysis can be found at the website of WebDataCommons: http://webdatacommons.org/index.html

Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC2014.

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 25 Backup Slides

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 26 Focus on Schema.org/JobPostings

− Over 70% of all records are described by a title, an organization, the location and a description − Only a few use more fine-grained properties − Does not directly indicate the information is not there, it might just not be marked up correctly

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 27 Development of the Deployment from 2010 to 2013

Perceptual deployment (by # quads) of MD has increased strongly since 2010 (~0% up to 51%), pushed by the ongoing initiative of the big companies

100% 90% 80% 33,69% 70% 65,08% 60% 50% 94,32% 40% 51,01% 30% 20,24% 20% 10% 0,02% 14,68% 15,29% 0% 5,65% 2010 2012 2013

RDFa Microdata Microformat RDFa hyped from 2010 to 2012, mainly pushed by the Social Network Facebook and is now rather stable

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 28