Hgrddl: Bridging Microformats and Rdfa Ben Adida A,∗

hGRDDL: Bridging Microformats and RDFa Ben Adida a;∗, aHarvard University, 33 Oxford Street MD-110, Cambridge MA 02118 Abstract We propose hGRDDL (pronounced \h-griddle"), a simple mechanism for transforming ad-hoc HTML-embedded structured data, such as microformats, into RDFa. This technique preserves the advantages of the original syntax, notably the correspondence between the rendered HTML and the related structured data, and requires little change on the publisher end. RDFa tool developers can leverage the existing deployments of microformats, while focusing new deployments on RDFa for greater extensibility and consistency, all using the same client-side toolset. We provide a prototype implementation of the hGRDDL processor and of transformations for hCard and hCal, two popular microformats. Key words: RDFa, microformats, GRDDL 1. Introduction 1.1. Multiple Approaches Since the early days of the Web, there have been While microformats have grown to provide a efforts to publish semantic data alongside the pri- number of vocabulary-specific syntaxes, other tech- marily visual content of HTML. Certain HTML el- niques have emerged to provide more generic se- ements, e.g. <em> for emphasis, are inherently more mantic web data embedding, with syntax inde- semantic and less explicitly visual than others, e.g. pendent of the vocabulary. RDFa [3], the W3C's <i> for italics. Certain HTML attributes, e.g. rel work-in-progress in this area, enables the embed- on an anchor element <a rel="next" href="page2" ding of almost any RDF graph inside XHTML1.1 >, are inherently semantic since they generally do and XHTML2 [10] using some additional attributes. not result in any visual effect, rather they indicate eRDF [6] can embed a smaller subset of RDF a relationship between the current and linked docu- graphs inside plain HTML as long as the publisher ments. These concepts were the driving force behind controls the document's <head>. Meanwhile, a sep- the microformats effort [18], which was colloquially arate W3C effort, GRDDL [4], aims to extract called \the small-s semantic web" for its ad-hoc ap- RDF triples from any XML document, including proach to evolving the Web towards more machine- XHTML, using a document- or namespace-specific readable data. transformation. One of the use cases for GRDDL is to extract RDF/XML from an HTML-with- microformats document, though it can be applied to any homegrown syntax, i.e. “unofficial” microformats. Multiple syntaxes are inevitable. However, vari- ∗ Corresponding author. Tel: +1 617.395.8535 ety comes at a cost: each syntax requires its own Email address: [email protected] (Ben Adida). Preprint submitted to Elsevier 20 August 2008 parser, often its own semantic interpreter, and often its own mechanism for relating rendered screen re- hGRDDL gions to structured data. Though RDFa provides a XHTML consistent syntax and the most expressivity, micro- +µformat +RDFa formats currently enjoy significantly greater deploy- ment: a number of large publishers, e.g. the social network LinkedIn [15], the online calendar 30Boxes hGRDDL [1], and Apple's dotMac online email system [9], use microformats. Recently, the total number of +eRDF +RDFa microformat-enabled web pages was estimated at \hundreds of millions" [17]. Whatever limitations GRDDL microformats may have, and however extensible a RDF/XML toolbuilder wishes to make her software, she can- +µformat not ignore microformats. How might we give the toolbuilder the best of all worlds while reducing her development effort? How can we enable a world of GRDDL many structured data syntaxes while reducing the RDF/XML duplication of effort in writing client-side tools? +eRDF Fig. 1. hGRDDL vs. GRDDL. The hGRDDL transform hap- pens before the user sees the rendered content. The point 1.2. Bridging the Syntaxes where the user sees the document is represented in the shaded vertical bar. Note how, in the case of hGRDDL, the output of the transformation is both the machine-readable data, and We propose hGRDDL 1 , an approach that re- the human-readable data. duces syntax-specific code to a single transformation to RDFa. This solution effectively provides a funnel to RDFa, which becomes a unifying syntax around which all client-side tools can be built. The key point of hGRDDL is that it processes an HTML document and produces another HTML document 1.3. Why Not Just GRDDL? that renders identically. The original HTML document can be entirely replaced by the new HTML The W3C's GRDDL effort provides a way to document within the user's browser. All client-side transform microformats, eRDF, and other ad-hoc data processing can then be based solely on the HTML-embedded syntaxes into RDF/XML. One RDFa version of the document, even if publish- might wonder what the point of hGRDDL is, then, ers produced microformats, eRDF, or some other if plain GRDDL can already extract the RDF. The custom syntax for their specific needs. key point is human context: an HTML page may Note that the specific HTML \host" for RDFa contain a significant amount of structured data, does not change with the transformation. If the orig- of which the user may only want to select a small inal document is XHTML2, then the target docu- portion, typically by visual means, e.g. using a con- ment is XHTML2+RDFa, and if the original doc- textual menu on a particular section of the page. ument is XHTML1.1, then the target document is A number of tools, e.g. Operator [13], provide just XHTML1.1+RDFa. As RDFa becomes supported this kind of human-context-driven structured data in additional host dialects, we expect hGRDDL to access. apply accordingly. With GRDDL, once RDF/XML has been ex- tracted from HTML, there is no practical way to associate a point in the visually rendered HTML with its corresponding RDF: the HTML Document Object Model (DOM) [22] has been discarded. 1 Our solution is called hGRDDL because it is meant as a GRDDL-like transformation with a focus on processing hGRDDL preserves the expression of the structured microformats, which have a habit of using lowercase `h' in data in HTML form, so that this visual-semantic their naming to indicate \HTML". correspondence is retained. 2 2. Principles for Embedding Semantics into version, a publisher should not have to repeat her- HTML self: the HTML should only contain a single copy of the data, which can be both rendered and in- There are many methods for embedding seman- terpreted as the machine-readable equivalent. In tics in HTML. Here, we outline four guiding princi- particular, solutions that maintain parallel HTML ples, which we illustrate by example. Because these and RDF/XML files are not compliant with this principles drove the design of RDFa, it should come principle. as no surprise that RDFa is the only solution that Example 2 (Creative Commons) An artist satisfies all four. That said, detailing these princi- with a simple HTML page licenses his graphic de- ples should clarify the advantages of choosing RDFa signs under a non-commercial Creative Commons as the target of hGRDDL transformations and illus- license [5]. He decides to change the license to non- trate why retaining information about the HTML commercial-share-alike, so that other artists who DOM is particularly important. reuse his work must share their new work in the same way. To accomplish this, he simply changes the Independence & Extensibility. When a publisher href in the anchor link to the Creative Commons provides structured data in her HTML, she should license. Machine readers should detect this change, determine independently the exact vocabulary she too, without the artist having to modify a second needs. There may be community-defined best prac- portion of his markup. tices, but the publisher should not be forced to use a consensus approach, as she knows her requirements Locality. When a user selects a portion of the ren- better than anyone else. In particular, if she chooses, dered HTML within his browser, he should be able as a starting point, some best-practice vocabularies, to access the corresponding structured data, using, she should be able to extend them, combine them, for example, a contextual menu. The structured data or use only a subset. must thus remain intimately tied to the syntax that Tools built to recognize the initial vocabularies expressed it. should still recognize as much of them as remains in the final, customized vocabulary, and be able to Example 3 (Weblog) A weblog consists of multi- perform basic, generic data processing on the addi- ple entries, often presented in batches on a single tional fields. It should be noted that RDF [16], as an HTML page. A user wishing to extract categorization abstract data modeling approach, precisely fits this tags and authorship from a given entry should be able requirement: any publisher can reuse existing RDF to right-click on the entry and find the related infor- properties and mint new URIs to create new ones, mation, independently of other entries. This assumes while any consumer can perform HTTP-based dis- a properly updated browser but a straight-forward covery on properties it has not yet encountered. mechanism for locating related structured data. Example 1 (Photography) A professional photography web site may want to reuse the vocabulary Self-Containment. It should be relatively easy to created by a simpler image-sharing web site, includ- produce a fragment of HTML that is entirely self- ing photo caption, size, and date taken. In addition, contained with respect to the structured data it the professional photography site may wish to de- expresses. At a high level, this enables \copy-and- scribe the lens and exposure settings.

Load more