HTML is Everywhere

l We usually think of HTML as the language of Web pages Embedding l But it’s also widely used on/for mobile devices and tablets – It readily adapts for different screen sizes/orientaons Knowledge in HTML l And is the basis of many ebook formats – E.g. Kindle’s formats, mobi, epub l How can we add knowledge to HTML pages?

Some content from a presentaons by Ivan Herman of the W3c

Adding RDF-like data to HTML One page, not two l We’d like to add semi-structured know-ledge to l Content providers prefer not to generate a convenonal HTML document mulple pages, one for humans (HTML) and – Humans can see and understand the regular HTML content another for machines (RDF) (text, images, videos, audio) – RDF serializaons are complex – Machines can see and understand the data markup in XML, RDF or some other format – Requires a separate storage, generaon, etc. mechanism l Possibilies include – Introduces redundancy, which can lead to errors if – Add a link to a separate document with the knowledge we change one page but not the other – Embed the knowledge as comments, javascript, etc. – Distribute the knowledge markup throughout the HTML as l Simplifies the job of search engines as well aributes of exisng HTML tags

1 General approach approach l Provide or reuse tag aributes to encode the l Reuses HTML aributes like @class, @tle l – Browsers and web apps ignore aributes they don’t Separate vocabularies (address, CV, …) understand l Difficult to mix microformats (no concept of l Three approaches have been developed namespaces) – Microformats (~ 2005) l Does not, inherently, define an RDF – RDFa (~ 2007) representaon – Microdata (aka schema.org) (~ 2012) possible to transform via, e.g., XSLT + GRDDL, but l Status 2014/5 (IMHO) transformaons are vocabulary dependent – Microformats used but future is limited – RDFa becoming the encoding of choice – Schema.org vocabularies geng large uptake

Microdata approach RDFa approach l Defined and supported by Google, Bing, Yahoo l Adds new (X)HTML/XML aributes and Yandex l Has namespaces and URIs at its core l Adds new aributes to HTML5 to express – So mixing vocabulary is easy, as in RDF metadata l Complete flexibility for using literals or URI l Works well for simpler “single-vocabulary” resources cases, but not well suited for mixing l Is a complete serializaon of RDF vocabularies or for complex vocabularies l No noon of datatypes or namespaces l Defines a generic mapping to RDF

2 Yielding this RDF

schema:alumniOf ; :schoolHomePage ; schema:worksFor ; … dc:title "Eötvös Loránd University of Budapest" . … dc:title "World Wide Web Consortium (W3C)” …

3 Yielding this RDF

[ rdf:type schema:Review ; schema:name "Oscars 2012: The Artist, review" ; schema:description "The Artist, an utterly beguiling…" ; schema:ratingValue "5" ; … ]

4 Rich Snippets l Search engines add text under results to preview what’s on page and why it’s relevant l Text en extracted from structured data embedded on the page l See hp://bit.ly/RichSN for more informaon

RDFa and Microdata: similaries RDFa and : differences l RDFa and Microdata are modern opons l Microdata opmized for simpler use cases: – Microformats is another – One vocabulary at a me l Both have similar approaches – Tree shaped data – Structured data encoded in HTML aributes only – no – No datatypes new elements l RDFa provides full serializaon of RDF in XML – Define some special aributes or HTML e.g., itemscope for microdata, resource for RDFa – Price is extra complexity over Microdata – Reuse some HTML core aributes (e.g., href) l RDFa 1.1 Lite is a simplified authoring profile of – Use textual content of HTML source, if needed RDFa, very similar to microdata l RDF data can be extracted from both

5 Amount of structured data on Web? What formats were found?

l l Web Data Commons project uses Common Crawl data Microdata use up (140K->463K sites form 2012->13) to esmate how much structured data is on the Web l See here for details on 2013 crawl l Looked for Microdata, RDFa, and nine common Micro- data formats (e.g., hCalendar, hCard) in URLs parsable as HTML l Nov. 2013 crawl: – 44TB (compressed) data from 2.2B URLs from 13M domains – 14% of domains, 26% of URLs had semanc data l Processing 40TB (compressed) of the 2012 crawl took 5.6K machine hours on 100 machines and cost ~$400

Conclusions l The amount of structured data on the web is growing steadily l Microdata shows the strongest growth l RDFa also common l data is probably not growing as much

6