<<

DERI–DIGITALENTERPRISERESEARCHINSTITUTE

RDFAVS.

Alexander Graf

DERITECHNICAL REPORT 2007-04-10

APRIL 2007

DERI Galway University Road Galway, Ireland www.deri.ie DERI Innsbruck Technikerstrasse 21a Innsbruck, Austria www.deri.at DERI Korea Yeonggun-Dong, Chongno-Gu Seoul, Korea korea.deri.org DERI Stanford Serra Mall Stanford, USA DERI–DIGITALENTERPRISERESEARCHINSTITUTE www.deri.us

DERITECHNICAL REPORT DERITECHNICAL REPORT 2007-04-10, APRIL 2007

RDFAVS.MICROFORMATS A COMPARISON OF INLINE FORMATS IN (X)HTML

Alexander Graf1

Abstract. Most Web pages contain inherent structured and significant data like contact details for various people, dates and addresses of events, descriptive elements for photos and a lot more. As it is, this data is expressed in a way that is easily understandable for humans but incredibly hard to detect and interpret for machines. Once content publishers gain the ability to express this data more completely and tools are developed that are able to understand the , a whole new set of possibilities on the becomes available to the end user. New forms of web content, meaningful to computers, will unleash a revolution of the internet. A true might still lie in the future but that’s no reason not to start using the core ideas from which it is being formed. There are several attempts that try to combine the principles of the Semantic Web, as envisioned by Tim Berners-Lee, with currently established technologies such as (X)HTML. This growth of semantics in the existing Web is swiftly advancing the state of the art for all Semantic Web processes. By enhancing existing Web documents with semantics we allow machines to categorise and handle information, so it can be used in a much more practical way, yet keep the principles of the existing web and still code for humans first. This paper reviews, analyzes and compares RDFa and microformats, two of the current technologies for inline metadata in (X)HTML and aims to give an overview over what possibilites are currently available for annotating existing data in Web sites.

Keywords: semantic web, microformats, , , inline metadata, erdf.

1Digital Enterprise Research Institute Innsbruck, University of Innsbruck, Technikerstraße 21a, A-6020 Innsbruck, Austria. E-mail: [email protected] Copyright c 2007 by the authors DERI TR 2007-04-10 I

Contents

1 Introduction 1

2 Common Principles 1 2.1 Visible Metadata ...... 1 2.2 DRY principle ...... 2

3 RDFa 2 3.1 Benefits ...... 3 3.2 Drawbacks ...... 3

4 Microformats 3 4.1 Benefits ...... 4 4.2 Drawbacks ...... 4

5 Side by Side Comparison 5

6 Discussion and Conclusions 6 DERI TR 2007-04-10 1

1 Introduction

“The goal of the Semantic Web initiative is as broad as that of the Web: to create a universal medium for the exchange of data. It is envisaged to smoothly interconnect personal information management, enterprise application integration, and the global sharing of commercial, scien- tific and cultural data. Facilities to put machine-understandable data on the Web are quickly becoming a high priority for many organizations, individuals and communities.” [1]

This vision of the Semantic Web consists essentially of a distributed knowledge system based on RDF, a markup format that provides a way to express logical statements in serialized formats like XML. It derives from Tim Berners-Lee’s vision of the as a universal medium for knowledge exchange. This new Semantic Web would be fundamentally different from the Web of today, mainly because it will be designed for machines first and humans second. Additionally the Semantic Web requires that we take a step away from the Web that we all know, throw away current practices and formats and embrace a new Web. While this isn’t exactly bad, it’s not a step that is to be taken lightly and it’s certainly not a step that can be taken quickly. At the moment we already have a Web which is viewable by humans in its native form and yet can be used as a first step to a Semantic Web. By re-using existing data and allowing the expression of semantics in Web pages, we can provide machines with information already being published on the Web as (X)HTML. Those “Real World Semantics” are seeing a widespread adoption by companies, bloggers and other “real people” on the internet beyond academic institutions. Recently the term “Lowercase Semantic Web” was coined for this type of mark-up, where the goals of the semantic web are achieved without dependence on the standards that are part of the wider Semantic Web initiative but can still work together with the “Uppercase Semantic Web” which comprises those standards. Several technologies that aim to enhance (X)HTML with semantic information have surfaced over the time and struggle for public acceptance, the most important ones being RDFa and microformats. Both RDFa and microformats share the same goal, yet are fundamentally different in that they approach the problem from a different direction, and deserve a closer inspection.

2 Common Principles

While RDFa and microformats are very different, they share several core principles. For example both technologies support plain literals, are well formed and have no negative effect on browser behaviour. They also both follow the Principle of Least Astonishment which states that, when several elements of an interface are ambiguous, the behaviour that least surprises the human user should apply as it will usually be the correct one. The principle of Visible Metadata and the DRY Principle are two more features that are equally available in all approaches.

2.1 Visible Metadata

Previously there were several attempts to annotate HTML documents with metadata. Ranging from tags in the head of a document to embedded RDF in HTML comments, those attempts had in common that the metadata was invisible to the human reader of the document. Hidden metadata is often abused for placement or other gain that only benefits the author of the document, not the user. 2 DERI TR 2007-04-10

By making metadata available and completely visible, a consumer can easily know whether to trust the author and can be sure that all data is actually relevant to the human reader as well as machines. This principle also assists the document author in keeping the metadata up-to-date. Metadata that is hidden away can be easily forgotten and go stale, whereas visible inaccuracies would soon be discovered by humans and could thus be fixed.

2.2 DRY principle DRY stands for Don’t Repeat Yourself and describes another important process philosophy used in the RDFa and microformats approaches. Also known as Once and Only Once or Single Point of Truth, the core principle, which has first been mentioned in Andy Hunt and Dave Thomas’s book The Pragmatic Programmer, aims to reduce redundancy in computing.

“DRY says that every piece of system knowledge should have one authoritative, unambigu- ous representation. Every piece of knowledge in the development of something should have a single representation. [...] Given all this knowledge, why should you find one way to represent each feature? The obvious answer is, if you have more than one way to express the same thing, at some point the two or three different representations will most likely fall out of step with each other. Even if they don’t, you’re guaranteeing yourself the headache of maintaining them in parallel whenever a change occurs. And change will occur.” [3]

Often we maintain seperate RDF documents along with their HTML equivalents and have to update both resources on a regular basis. If the DRY principle is applied, a modification of any metadata in the system has to be done only in one place, expressed for both humans and machines.

3 RDFa

RDFa, developed and proposed by the W3C, is a set of rules that can be used as a module for XHTML 2. It reuses attributes from standard XHTML meta and link elements and applies them to all other XHTML elements, so that one can annotate XHTML markup with semantic information. With a simple mapping it is possible to extract RDF triples from a RDFa annotated document.

“RDFa is a syntax for expressing this structured data in XHTML. The rendered, data of XHTML is reused by the RDFa markup, so that publishers don’t repeat themselves. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time. The ex- pressed structure is closely tied to the data, so that rendered data can be copied and pasted along with its relevant structure.” [4]

The ultimate goal of RDFa is to make any RDF structure representable in pure XHTML. Other than the microformats approach, this allows an author to use a predefined set of rules to mark up just about anything. Since the underlying abstract presentation is pure RDF, publishers can build their own vocabulary and extend other vocabularies with maximum interoperability. The structure expressed with RDFa is closely tied to the actual data, so the rendered elements can be copied and pasted along with their relevant RDF structure. However, there are also problems with RDFa. Not only does it require XHTML 2, it also requires a new form of URIs, called . DERI TR 2007-04-10 3

3.1 Benefits • Publishers are independent and each website is allowed to use their own standards

• Because of Self Containment, the RDF triples are seperated from the (X)HTML content

• Modularity of the schema makes attributes reusable

• Follows several well-working microformats principles like DRY

• All resource descriptions can be extracted with a single transformation stylesheet and the output are always RDF triples

• Supported and developed by the W3C

• QNames can be used to provide full namespacing (but see 3.2 Drawbacks)

• Can operate at the level of document structure semantics, as well as the document content semantics

3.2 Drawbacks • Not XHTML 1.0 compliant, will only work with XHTML 1.1, XHTML 2 and maybe (X)HTML 5 once there are appropriate modules

• No implementations yet

• Using Tidy or any other XHTML cleaning tool (to create well-formed content) can break the embed- ded RDF semantics

• No adoption yet, no working examples

• Although there are attempts1 to use RDFa in current HTML 4 and XHTML 1, it breaks CSS and (X)HTML validity

• The default datatype of literals is rdf:XMLLiteral which is wrong for most deployed properties

• RDFa requires and introduces a new form of URIs (CURIE)

• QNames are not a use of standard XML namespaces. Since QNames don’t work with CSS selectors they are impractical for presentation on the Web.

4 Microformats

“Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behav- iors and usage patterns (e.g. XHTML, blogging).” [5]

1http://internet-apps.blogspot.com/2007/02/using-rdfa-in-xhtml-1.html 4 DERI TR 2007-04-10

Led by prominent designers and bloggers Tantek elik, Eric Meyer, Kevin Marks and others, microformats are appearing as a new set of standards. microformats specify class names and rel attributes to be used in a certain way that give the content a specified meaning. Some examples include vote links, personal , iCal events, friendship relations (XFN, similar to FOAF) and tags. The microformats process tries to look at existing practices and how data is already being published on the web and then codifies these conventions so they have a minimal impact on the current web. microformats are, in fact, subversive. They manage to question the approach of full-blown Semantic Web approaches and challenge fundamental parts of the current Web architecture such as Web Feeds and APIs. With microformats, your website does not need an API, it becomes an API and unlike with RDFa, applications do not need to take the RDF route and can instead work with raw data in already established and often-used formats like vCards. The same approach is taken with RSS/ feeds. Using the hAtom , a website becomes the feed by intelligently marking up items. There is no longer a need to publish the same content in other formats like RSS (see 2.2 DRY principle) since applications can extract raw Atom data from the (X)HTML of the website, again without converting to and from RDF.

4.1 Benefits • Very wide deployment and adoption by mainstream Webdesigners and -developers • Use of microformats already has real world consequences and benefits • Community based process makes developing custom microformats easy • Integrate with current HTML standards and work perfectly with CSS • Follow several common design principles like DRY • Very compact syntax based on existing HTML semantics • Integrates existing and evolving publishing patterns, for example rel="nofollow" and XFN • Try to model and encapsulate real, existing technologies like vCards and iCal data • Allow applications to use already existing technologies instead of converting data to RDF and back • Usable in arbitrary (X)HTML envelopes like mails and even Web Service descriptions • Don’t require head profiles or extensions to (X)HTML • Very easy to understand and implement, don’t require special knowledge about data models • microformats are Tidy-safe, cleaning a page with Tidy doesn’t remove or modify the embedded se- mantics

4.2 Drawbacks • Centralized project, the invention of custom microformats outside of the community is discouraged • issues and only limited number of microformats • Separate parsing rules are required for each microformat, a general XML stylesheet to transform all microformats to RDF is not possible DERI TR 2007-04-10 5

5 Side by Side Comparison

Feature RDFa microformats Namespace XML namespaces Flat HTML Validity Only XHTML 2.0 HTML 4.01 / 5, XHTML 1.x / 2.0 Attribute usage Introduces new attributes Reuses existing HTML attributes Syntax definition Custom interoperable definitions Defined by a community Data models Reuses RDF models Require new data models Real World implementations No implementations yet By companies and hobbyists Shortcuts Limited shortcuts Uses shortcuts and abbreviations W3C opinion Part of XHTML 2 Partly supported and ad hoc Follows the DRY principle Redundantly encapsulated literals Yes Custom extensions Yes, and mixing of vocabulary No Arbitrary resource descriptions Yes No Follows DCMI guidelines 2 No No Syntax Uniform syntax specification Separate parsing rules for each µF Predictable RDF mappings Yes Mostly, multiple mappings possible Live Clipboard Compatibility Tweaks needed Yes Reliable copying, aggregation, Mostly but some µFs (e.g. XFN) Only chunks with nearby/embedded and re-publishing of source lose their intended semantics namespace definitions can be reliably chunks (self-containment) when regarded out of context copied Triple bloat prevention No Yes, only actively marked-up informa- tion leads to RDF triples Integration in namespaced (non- Possible Not possible HTML) XML Early Adoption No Yes, adoption by mainstream Web de- velopers Usable in (X)HTML envelopes No Yes, usable in arbitrary HTML en- velopes such as rich content e-mails, service descriptions, . . . Tidy-Safe No Yes, using Tidy to clean up the HTML will not alter embedded semantics) Compact syntax Partly Yes, based on existing HTML seman- tics (

, rel, rev, class, ...) Inclusion of evolving publishing Partly Yes, for example rel="nofollow" patterns Support for metadata Partly, will probably interprete No such as OpenID any rel value

2Dublin Core Metadata Initiative found at http://www.dublincore.org/documents/dcq-html/ 6 DERI TR 2007-04-10

6 Discussion and Conclusions

It seems like we are just waiting for the next upcoming “de jure” versus “de facto” standards war. On one side, the microformats group is working to solve real-world problems with current implementations. On the other side, the W3C is thinking carefully about future directions of the Web. While most people seem to think that this is a bad thing, it actually presents opportunities. RDFa will not be usable for a good while and microformats, by questioning fundamental building blocks of the web, make us think about alternatives. Besides, there are several big issues with the current version of RDFa. The fact that RDFa doesn’t manage to work with normal URIs but needs CURIEs should be an indicator that there is something wrong. While there are several possibilities what might happen to microformats once RDFa evolves and is ready to use, currently the fact remains that microformats present the only working option with real world benefits and usage possibilities.

References

[1] Ivan Herman, Semantic Web Activity Statement. Retrieved March 7th, 2007 from http://www.w3. org/2001/sw/Activity

[2] Andy Hunt and Dave Thomas, The Pragmatic Programmer: From Journeyman to Master (New York: Addison-Wesley Pub. Co., 1999)

[3] Dave Thomas, interviewed by Bill Venners, October 10th, 2003, Orthogonality and the DRY Principle. Retrieved March 14th, 2007 from http://www.artima.com/intv/dry.html

[4] RDFA Primer 1.0, Embedding RDF in XHTML. W3C Working Draft, March 12th, 2007 from http: //www.w3.org/TR/xhtml-rdfa-primer/

[5] About microformats. Retrieved March 24th, 2007 from http://microformats.org/about/