The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel, Petar Petrovski, and Christian Bizer HTML-embedded Structured Data on the Web
More and more websites semantically markup the content of their HTML pages.
RDFa Microformats
Microdata
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 2 Dataset Creation
− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 3 Dataset Creation
− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances
1. _:node1
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 4 Dataset Creation
− Common Crawl Foundation Corpora of 2010, 2012 and 2013 • Snapshot of popular pages of the Web • Continuously new crawls available − Parsing the HTML pages using Apache Any23 • Using a distributed framework on 100 parallel EC2 instances
1. _:node1
The framework is easy to adapt and is publicly available at: http://webdatacommons.org/framework/
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 5 Dataset Series Overview
− Series contains three datasets from 2010, 2012 and 2013 − All together over 30 billion RDF quads
− Each dataset is again split into subsets including quads extracted for a particular markup language
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 6 Overview of 2013 dataset
− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities)
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 7 Overview of 2013 dataset
− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 8 Overview of 2013 dataset
− Over 1.7 million domains using at least one markup language − Over 17 billion quads with over 4 billion records (typed entities) − hCard still most dominant among domains − Microdata contains the largest number of quads
The WebDataCommons Microdata, RDFa, and Microformats Dataset Series 9 Divergence in Class and Property Usage in 2013
− Small number of classes and properties is used by a large number of domains