hGRDDL: Bridging and RDFa Ben Adida a,∗,

aHarvard University, 33 Oxford Street MD-110, Cambridge MA 02118

Abstract

We propose hGRDDL (pronounced “h-griddle”), a simple mechanism for transforming ad-hoc HTML-embedded structured data, such as microformats, into RDFa. This technique preserves the advantages of the original syntax, notably the correspondence between the rendered HTML and the related structured data, and requires little change on the publisher end. RDFa tool developers can leverage the existing deployments of microformats, while focusing new deployments on RDFa for greater extensibility and consistency, all using the same client-side toolset. We provide a prototype implementation of the hGRDDL processor and of transformations for hCard and hCal, two popular microformats.

Key words: RDFa, microformats, GRDDL

1. Introduction 1.1. Multiple Approaches

Since the early days of the Web, there have been While microformats have grown to provide a efforts to publish semantic data alongside the pri- number of vocabulary-specific syntaxes, other tech- marily visual content of HTML. Certain HTML el- niques have emerged to provide more generic se- ements, e.g. for emphasis, are inherently more mantic web data embedding, with syntax inde- semantic and less explicitly visual than others, e.g. pendent of the vocabulary. RDFa [3], the W3C’s for italics. Certain HTML attributes, e.g. rel work-in-progress in this area, enables the embed- on an anchor element ” for its ad-hoc ap- RDF triples from any XML document, including proach to evolving the Web towards more machine- XHTML, using a document- or namespace-specific readable data. transformation. One of the use cases for GRDDL is to extract RDF/XML from an HTML-with- microformats document, though it can be applied to any homegrown syntax, i.e. “unofficial” micro- formats. Multiple syntaxes are inevitable. However, vari- ∗ Corresponding author. Tel: +1 617.395.8535 ety comes at a cost: each syntax requires its own Email address: [email protected] (Ben Adida).

Preprint submitted to Elsevier 20 August 2008 parser, often its own semantic interpreter, and often

its own mechanism for relating rendered screen re- hGRDDL gions to structured data. Though RDFa provides a XHTML consistent syntax and the most expressivity, micro- +µformat +RDFa formats currently enjoy significantly greater deploy- ment: a number of large publishers, e.g. the social network LinkedIn [15], the online calendar 30Boxes hGRDDL [1], and Apple’s dotMac online email system [9], use microformats. Recently, the total number of +eRDF +RDFa -enabled web pages was estimated at

“hundreds of millions” [17]. Whatever limitations GRDDL microformats may have, and however extensible a RDF/XML toolbuilder wishes to make her software, she can- +µformat not ignore microformats. How might we give the toolbuilder the best of all worlds while reducing her development effort? How can we enable a world of GRDDL

many structured data syntaxes while reducing the RDF/XML duplication of effort in writing client-side tools? +eRDF

Fig. 1. hGRDDL vs. GRDDL. The hGRDDL transform hap- pens before the user sees the rendered content. The point 1.2. Bridging the Syntaxes where the user sees the document is represented in the shaded vertical bar. Note how, in the case of hGRDDL, the output of the transformation is both the machine-readable data, and We propose hGRDDL 1 , an approach that re- the human-readable data. duces syntax-specific code to a single transforma- tion to RDFa. This solution effectively provides a funnel to RDFa, which becomes a unifying syntax around which all client-side tools can be built. The key point of hGRDDL is that it processes an HTML document and produces another HTML document 1.3. Why Not Just GRDDL? that renders identically. The original HTML docu- ment can be entirely replaced by the new HTML The W3C’s GRDDL effort provides a way to document within the user’s browser. All client-side transform microformats, eRDF, and other ad-hoc data processing can then be based solely on the HTML-embedded syntaxes into RDF/XML. One RDFa version of the document, even if publish- might wonder what the point of hGRDDL is, then, ers produced microformats, eRDF, or some other if plain GRDDL can already extract the RDF. The custom syntax for their specific needs. key point is human context: an HTML page may Note that the specific HTML “host” for RDFa contain a significant amount of structured data, does not change with the transformation. If the orig- of which the user may only want to select a small inal document is XHTML2, then the target docu- portion, typically by visual means, e.g. using a con- ment is XHTML2+RDFa, and if the original doc- textual menu on a particular section of the page. ument is XHTML1.1, then the target document is A number of tools, e.g. Operator [13], provide just XHTML1.1+RDFa. As RDFa becomes supported this kind of human-context-driven structured data in additional host dialects, we expect hGRDDL to access. apply accordingly. With GRDDL, once RDF/XML has been ex- tracted from HTML, there is no practical way to associate a point in the visually rendered HTML with its corresponding RDF: the HTML Document Object Model (DOM) [22] has been discarded. 1 Our solution is called hGRDDL because it is meant as a GRDDL-like transformation with a focus on processing hGRDDL preserves the expression of the structured microformats, which have a habit of using lowercase ‘h’ in data in HTML form, so that this visual-semantic their naming to indicate “HTML”. correspondence is retained.

2 2. Principles for Embedding Semantics into version, a publisher should not have to repeat her- HTML self: the HTML should only contain a single copy of the data, which can be both rendered and in- There are many methods for embedding seman- terpreted as the machine-readable equivalent. In tics in HTML. Here, we outline four guiding princi- particular, solutions that maintain parallel HTML ples, which we illustrate by example. Because these and RDF/XML files are not compliant with this principles drove the design of RDFa, it should come principle. as no surprise that RDFa is the only solution that Example 2 (Creative Commons) An artist satisfies all four. That said, detailing these princi- with a simple HTML page licenses his graphic de- ples should clarify the advantages of choosing RDFa signs under a non-commercial Creative Commons as the target of hGRDDL transformations and illus- license [5]. He decides to change the license to non- trate why retaining information about the HTML commercial-share-alike, so that other artists who DOM is particularly important. reuse his work must share their new work in the same way. To accomplish this, he simply changes the Independence & Extensibility. When a publisher href in the anchor link to the Creative Commons provides structured data in her HTML, she should license. Machine readers should detect this change, determine independently the exact vocabulary she too, without the artist having to modify a second needs. There may be community-defined best prac- portion of his markup. tices, but the publisher should not be forced to use a consensus approach, as she knows her requirements Locality. When a user selects a portion of the ren- better than anyone else. In particular, if she chooses, dered HTML within his browser, he should be able as a starting point, some best-practice vocabularies, to access the corresponding structured data, using, she should be able to extend them, combine them, for example, a contextual menu. The structured data or use only a subset. must thus remain intimately tied to the syntax that Tools built to recognize the initial vocabularies expressed it. should still recognize as much of them as remains in the final, customized vocabulary, and be able to Example 3 (Weblog) A weblog consists of multi- perform basic, generic data processing on the addi- ple entries, often presented in batches on a single tional fields. It should be noted that RDF [16], as an HTML page. A user wishing to extract categorization abstract data modeling approach, precisely fits this tags and authorship from a given entry should be able requirement: any publisher can reuse existing RDF to right-click on the entry and find the related infor- properties and mint new URIs to create new ones, mation, independently of other entries. This assumes while any consumer can perform HTTP-based dis- a properly updated browser but a straight-forward covery on properties it has not yet encountered. mechanism for locating related structured data.

Example 1 (Photography) A professional pho- tography web site may want to reuse the vocabulary Self-Containment. It should be relatively easy to created by a simpler image-sharing web site, includ- produce a fragment of HTML that is entirely self- ing photo caption, size, and date taken. In addition, contained with respect to the structured data it the professional photography site may wish to de- expresses. At a high level, this enables “copy-and- scribe the lens and exposure settings. A tool that sorts paste” of semantically marked-up HTML from one the image-sharing web site photos by “date taken” document to another. (The source and target docu- should continue to function on the professional pho- ments may require some kind of flag indicating the tography site. This tool is not aware of the new fields, presence of structured data in the first place, but but it can process the old ones. More interestingly, this flag should not be vocabulary specific.) the old tool can still sort, filter, aggregate, and relate Self-containment does not immediately follow photos according to the new fields without knowing from locality: the expression of the structured their deep semantics. data may depend on multiple portions of markup sprinkled throughout the page, which weaken self- containment even if the crucial portion of markup DRY (Don’t Repeat Yourself). When human- still provides locality. readable data is the same as the machine-readable

3 Example 4 (Widgets) A number of destination graph more complex than a single layer of attributes web sites, e.g. MySpace [19], provide users with becomes fairly difficult to express. the ability to include HTML fragments from other sources within their page. A self-contained approach to HTML-embedded structured data ensures that GRDDL. GRDDL is a technique for declaring and these widgets can carry their own structured data applying transformations to XML documents in or- with them, independently of the containing page. der to extract RDF, usually in RDF/XML form. GRDDL can be used specifically to process micro- 3. Approaches to Embedding Semantics in formats, eRDF, and RDFa, as long as the HTML HTML document uses a profile URI. GRDDL offers in- dependence & extensibility and DRY, since anyone Microformats. Microformats were built specifi- can write a GRDDL transformation and the whole cally with the intent to add just enough markup to point of the transformation is to reuse the exist- make existing HTML “semantic.” Using the HTML ing information in the HTML. However, the end attributes class, rel, and title, microformats product of a GRDDL transformation no longer con- define domain-specific syntaxes, such as hCard for tains any HTML DOM information. In addition, contact information and hCal for calendaring. Mi- the vocabulary-specific transformation is specified croformats are quite widely distributed: a recent es- in the HTML document’s head. Thus, GRDDL does timate gauges deployment at “hundreds of millions not provide locality or self-containment. of web pages” [17]. Microformats provide locality and DRY, but they fail to provide independence & extensibility. Multi- RDFa. RDFa was built from the ground up to en- ple microformats may conflict with one another, and able the embedding of most RDF graphs in HTML. adding custom fields is not supported, neither tech- To accomplish this task in a consistent way: nically nor socially within the microformat commu- – new attributes about, resource, instanceof, nity [8]. Though it may be technically possible to property, and content were introduced, create one’s own microformat, it is neither recom- – the existing attributes rel and href were applied mended nor made easy by the parsing architecture, to all elements, not just . which tends to assume a fixed set of well-defined vo- As a result, a single syntax can be used for em- cabularies. bedding just about any RDF graph, including even In the current deployments of microformats and blank nodes: RDFa thus supports independence & their associated parsing tools, self-containment is extensibility. Like eRDF, RDFa also supports DRY supported, though this assumes that parsers are and locality. In addition, because RDFa relies on aware of all possible vocabularies ahead of time. xmlns for namespace declarations, it can easily be If microformats were deployed according to their written so as to support self-containment, making it specification – using a profile URI – they would not trivial to copy and paste chunks of HTML+RDFa provide true self-containment, as the profile URI into a page: no vocabulary-specific information is needs to be copied into the of the target required in the . document, an operation which precludes many uses of self-containment, e.g. widgets. Selecting a Target for hGRDDL Transforms. Note eRDF. eRDF is a syntax that also uses the ex- how the set of data expressible by RDFa is a su- isting HTML attributes class, rel, and title to perset of that expressible by eRDF and microfor- enable the embedding of simple, single-layer-depth mats (given proper semantic mapping of the micro- RDF graphs. It builds on some [7] con- formats). Note also how GRDDL, though it is in- ventions for embedding attributes in HTML. eRDF finitely flexible in what it can express, does not pro- provides DRY and locality, but, because it also re- vide locality or self-containment. Though the valid- quires namespace declarations in the document’s ity of these principles is debatable (and RDFa’s sat- , does not fully provide self-containment. isfaction of the principles a bit of a tautology), we eRDF has good support for independence & exten- believe they provide real end-user value. Thus, we sibility, though not complete support, as any RDF choose RDFa as the target for hGRDDL transforms.

4 4. hGRDDL: Transforming to RDFa that the hGRDDL transform is expected to be per- formed by the user’s browser before the HTML is We propose hGRDDL, a simple technique for rendered, thus the minting of a new predicate. See transforming HTML-embedded ad-hoc semantics Figure 1 for an illustration. into HTML-embedded RDFa. The goal is to pro- vide a single, flexible development path for tool 4.3. The Transformation builders, while preserving the existing properties of the original syntax. In particular, the hGRDDL Because the HTML output should render exactly HTML output renders exactly like its input: the like the input, an hGRDDL transformation should embedded semantics remain local to the rendered take great care to preserve all of the HTML markup data they pertain to. that might be used to style the content or that might Of course, as we are not modifying the syn- in any other way affect it. Specifically, all class tax of the published document, the limitations names should be preserved, and the element struc- remain, too. A microformat or eRDF page, even ture of the DOM tree should be preserved. The only when hGRDDL-enabled, does not provide self- significant changes should be in the non-CSS at- containment. The point of using RDFa as a target tributes. syntax is to unify the client-side toolset and en- Preservation of the element structure of the DOM courage a maximally expressive syntax. However, tree may not be possible when there is no one-to- existing syntaxes cannot gain properties they do one mapping of the vocabularies: e.g. a full name not already support. becomes a first and last name. In such cases, we rec- ommend that an alternate RDF vocabulary be uti- 4.1. GRDDL lized so that the addition of new elements be avoided at all costs. If the transformation must add new el- In the case of HTML documents, GRDDL uses ements, it should strive to use as few as possible, the profile attribute, and takes the following steps: though one must acknowledge that the risk of chang- (i) detect and retrieve the profile indicated in ing the rendering of the page goes up significantly the head with this approach. (ii) run a GRDDL processor on the profile docu- ment to retrieve a 4.4. An Example: hCard grddl:profileTransformation triple that indicates the URI of the document- level transformation The point of hGRDDL, and its simplicity, are best (iii) fetch and apply the transformation to the orig- expressed by example. Consider the following sim- ple, fictitious hCard: inal HTML document to obtain RDF/XML ... .. 4.2. Finding the hGRDDL Transform ...

hGRDDL, as its name implies, is strongly inspired Tim Berners-Lee by GRDDL. Unfortunately, because GRDDL does
W3C
not provide a way to label transformations accord-
ing to output format, we cannot use it as-is. We can, We assume that the hCard profile has been updated however, use a very similar approach without inter- to indicate an hGRDDL transformation. The output fering with any existing GRDDL transforms. of the hGRDDL processor should then be: hGRDDL bootstraps off the GRDDL profile- ... based process, using the alternate predicate .. hgrddl:hProfileTransformation. This new pred- ... icate subclasses grddl:profileTransformation
so that existing GRDDL processors can use this transformation as a generic way to extract an RDF Tim Berners-Lee serialization, which is what the RDFa output of the
W3C
hGRDDL transformation provides. Note, however,

5 Note how all existing class names were pre- run inside a web browser with the input HTML page served. Only the instanceof, property and rel loaded into the DOM. attributes were used. The output HTML should render exactly like the input. Note also how this 5. Implementing hGRDDL syntax-specific transformation is relatively straight- forward: it need only add a handful of attributes Our prototype implementation is available at where it finds the specific microformat properties. http://ben.adida.net/projects/hgrddl/

5.1. Deployment Platform Hard-Wiring some Transforms. Most microfor- mats are published without a profile URI. As a In order to automatically run the hGRDDL trans- result, most microformat parsers look through the formations, we opted, in our prototype, to deploy DOM for the top-level class values expected for the hGRDDL processor as a client-side JavaScript each microformat. Fortunately, this same approach user script. User scripts are typically run automat- can be used in hGRDDL: look for the top-level mi- ically by compatible browsers on each page load, croformat class, and use it as a flag of microformat with an API that allows for more flexibility, e.g. presence. When a specific microformat is detected, cross-domain requests, than a normal JavaScript hGRDDL can simply assume the presence of a pro- program. The specific API that user script platforms file it already knows about for that microformat, implement was initially defined by Firefox’s Grease- exactly like existing microformat parsers which Monkey add-on [11]. We wrote our hGRDDL pro- have hard-wired parsers for each vocabulary. cessor to be as cross-browser-compatible as possi- ble: we tested on Firefox with GreaseMonkey, Safari 4.5. Implementation Language with CreamMonkey [14], and Opera with its built-in user-script support [20]. We do not yet have an In- Given that we want a client-side, browser- ternet Explorer implementation, as it currently does based, cross-platform implementation, two major not provide a complete user-script platform. Firefox routes are immediately worth considering: XSLT 2, Safari 3, and Opera 8 proved successful deploy- or JavaScript+DOM. The XSLT route makes ment ground for hGRDDL with very little effort. We hGRDDL particularly similar to GRDDL. GRDDL hope that IE and add-ons like GreaseMonkey for IE already defines ways to handle malformed XHTML [12] will enables a fully cross-platform approach in and non-XML-based HTML by tidying the input the near future. There is no deep reason why this [21], which hGRDDL can mimick. However, this could not be made compatible with IE. tidying brings problems of its own: it becomes par- ticularly difficult to ensure that the output HTML 5.2. Components renders exactly like the input HTML. Because we expect an hGRDDL-aware user agent to perform Profiles and Transforms. We built hGRDDL pro- the hGRDDL processing before rendering the re- files for two microformats, hCal and hCard. We sult to the end-user, the XSLT approach is likely implemented only the basic features of hCard. Each unrealistic. profile references its own hGRDDL JavaScript The second approach, JavaScript with DOM transformation built to run against an existing access, is far more promising. JavaScript can be DOM in a browser-independent way. The lion’s tweaked to function seamlessly across browsers, share of the code is spent building a cross-platform while the DOM can handle less than well formed way to read and write DOM attributes, in par- XHTML. The JavaScript-DOM approach enables ticular RDFa attributes that are not yet officially us to change the HTML DOM “in place,” tweaking supported by browsers. The transforms themselves only the attributes we want. The DOM disam- are fairly simple, as each microformat field gener- biguations performed by a browser when processing ally maps to a single RDFa field in the appropriate malformed input are fully preserved: the hGRDDL vocabulary. JavaScript processor is only tweaking attributes in an existing DOM tree. Thus, we propose an archi- The hGRDDL Processor. We wrote the hGRDDL tecture where the JavaScript transformations are processor as a user-script JavaScript program.

6 Beyond the GRDDL code needed for processing more robust implementation of hGRDDL should profile documents, our hGRDDL processor re- ensure that the window.onload event isn’t fired quired fewer than 25 lines of JavaScript. At a high until all transformations have completed. level, the code looks for a profile attribute in the HTML document’s , fetches it, looks for the generic hGRDDL profile, and then for an anchor with rel=hProfileTransformation. (We effec- tively hard-wired the GRDDL transformation for Dynamic Pages. Performing hGRDDL transfor- the hGRDDL profile.) If an hGRDDL transforma- mations on the fly when pages dynamically update tion is found, a