<<

Using RDF with XML to Automatically Augment Metadata

Tina Jayroe

University of Denver

Professor Jessica Branco Colati

Metadata Architectures

June 3, 2009 2

Abstract

The Consortium’s Resource Description Framework (RDF) provides a methodology and an opportunity for organizations to have their metadata become more valuable, contextual, and uniform; this allows for more relevant information to be retrieved when accessing Web . Since Extensible Markup Language (XML) is the most widely used format for sharing metadata, it is advantageous for organizations to use its traits and constraints in conjunction with RDF triples and graphs—which correspond to ontologies and namespaces.

However, writing RDF in XML is an expensive, labor-intensive process that is often too costly and technical for most organizations. This article provides examples of the various schemata which exploit the RDF/XML specification, and attempts to make a case for information organizations to consider implementing RDF in order to provide richer metadata for information seekers, and more uniformity for the entire Web community.

3

The Web will become a repository of knowledge not only a compendium of facts.

–Reed Hellman, A semantic approach adds meaning to the Web.

In the past, individuals, associations, and computers have organized data to be retrievable by reconfiguring systems and adopting certain Web protocols and languages. A major component to the current flexible and interactive Web environment is the widely accepted metalanguage XML. XML is a World Wide Web Consortium (W3C) standard format language that allows a user to represent resources using any other type of language (Klein, 2001, p. 26).

One downside to XML, however, is that its elements, subelements, and attributes do not define or reference the content enclosed within its structure (or tags). Therefore, many organizations have implemented Resource Description Framework (RDF) in order that their metadata reference ontologies for better context, uniformity, , and precision.

“The Resource Description Framework is a . . . W3C recommendation designed to standardize the definition and use of metadata—descriptions of Web-based resources” (Decker,

Melnik, Van Harmelen, Fensel, Klein, Broekstra, . . . Horrocks, 2000, p. 66).1 RDF is an application of XML; it allows for self-description, reification, and logic through the use of statements to which XML provides the notation (Berners-Lee, 1999, “: the pieces”; “Namespaces”). XML is extensible, readable by humans, and can be encoded in any proprietary language.

As previously mentioned, XML does not attribute any vocabulary or meaning to content; the metadata within its tags are essentially ambiguous (Klein, 2001, p. 26). Thus, often a DTD

(Document Type Declaration/Definition) is used because it can specify a vocabulary and other syntax specifications to use during the processing of the XML document. DTDs (and the XML 4

Schema) serve as translation agreements between parties regarding a document’s grammar

(Decker et al., 2000; Klein, 2001, p. 26), and are a step toward increased description and

interoperability—goals of the Semantic Web.

By using what is referred to as RDF’s triples list and graph—entity, attribute, and value

(also referred to as subject, predicate, and object)—data in RDF have the ability to be independent of each other within the same syntax, or syntax-independent. The other and opposite advantage is that the meaning (the ) between the objects can be identified or defined in relation to a reference, and each other.

Figure 1: An RDF triple states the relationship between subject, predicate, and object.

Decker et al. provide a comparison of the XML Schema vs. the RDF Schema2 mechanisms:

[I]n XML schema, if type T′ is derived from type T, then elements of the derived type T′

are not necessarily members of the original type T. In the subClassOf relationship in

RDF schema, on the other hand, a member of a subclass is also a member of the original

superclass. As a result, subClassOf can be used to model ontological subtyping, whereas

XML schema’s type extension cannot (2000, p. 70).

This is an extremely important concept in the development of the future Web where ontologies are used to express the meaning between applications and systems (Baca, 2008, “Glossary”).

Ontologies define terms for computer expressions and concepts in a given domain and are

technical components of the Semantic Web, and while there are a lot of pieces needed to fulfill 5

this vision (see Figure 2), the RDF component cannot be effectively discussed without at least a brief explanation of XML namespaces and Uniform Resource (URI).

URIs identify resources; resources are anything that can be identified on the Web. A

URIref (absolute or relative) identifies a resource using a fragment . XML namespaces

(which are also known as qualified names or QNames) declare the URIref in the code. What this means is, when the syntax that determines a namespace is constructed, the referenced object within the schema will correspond to the given resource (e.g., a certain vocabulary/ontology or a vCard3 URI) automatically. It will then retrieve any specific, absolute, or relative information about that resource (Breitman, Casanova & Truszkowski, 2007, pp. 59–61).

Figure 2: The Semantic Web layers. Note: Created by Sebastian Faubel and released to the public domain.

In other words, RDF is a framework that takes advantage of XML’s encoding to identify and define objects on the Web and put them in context according to the appropriate domain. The

RDF model is an advanced way of representing metadata for the benefit of reuse and automation.

The Good News 6

RDF is now being used in many areas and by many organizations. classifications systems have implemented applications that build on RDF. Simple Knowledge

Organization System (SKOS) Core utilizes metadata contained in bibliographic records and social computing networks such as Friend of a Friend (FOAF)—an ontology that contains more personal-type links providing a richer source of data that can be related to other systems.

SKOS Core is a W3C format designed to be less costly, and less complex than the more sophisticated (OWL) from which it is based. SKOS Core utilizes the

RDF and schema and is intended for creating relationships between controlled vocabularies:

Controlled vocabularies facilitate consistent documentation, but they do not guarantee

flawless searching across multiple collections. . . . Controlled vocabularies preclude the

user from using natural language terms and phrases of his or her choosing, and seeking

and finding resources illustrative of conceptual ideas and relationships (Cantara, 2006, p.

111).

The benefit of using SKOS Core in (besides its cost-effectiveness) is that resource discovery systems will become more effective and interoperable if semantic tools are implemented to search multiple vocabularies, thesauri, , subject headings, etc.

Further, the retrieved data will have been analyzed using intelligent algorithms which are applied to the attributes of the terms contained within a system.

Linda Cantara, author of Encoding controlled vocabularies for the Semantic Web using

SKOS Core, notes how terms used in classification systems to define subject headings are usually nouns, however, in actuality those terms often are meant to be used as verbs, adjectives, or some other context in natural language processing (2006, p. 112). 7

SKOS Core works extremely well with RDF is because of its extensibility and serialization.4 By associating concepts using RDF description statements, it provides the user

with more accurate results; this occurs by mapping class and property elements which are easily

extendible (Miles, 2005, Slide 6, 13, 14; W3C, 2009, “SKOS Core”). Another advantage is that

SKOS Core can be implemented as a “basic” or “advanced” application, depending on the level

of expression needed for the given institution and the amount of time that can be dedicated to the

transition and maintenance of such a complex system (W3C, 2009, ¶ 3–4).5

FOAF is used to aggregate information via the Web about a person, their related persons,

and their personal associations:

In addition to the FOAF vocabulary, one of the most interesting features of a FOAF file

is that it can contain "see Also" pointers to other FOAF files. This provides a basis for

automatic harvesting tools to traverse a Web of interlinked files, and learn about new

people, documents, services, data . . . (Brickley & Miller, 2000–2007, “The Basic Idea”).

This information may be extracted from a social networking site, authority files, or may be

imported using vCard elements. The FOAF project/initiative is built on the RDF data model to

support semantic connections between identities (Harper & Tillett, 2007, p. 61); RDF allows

vCard vocabularies to be imported into FOAF code (Breitman et al., 2007, p. 183). Thus, by

using FOAF as a resource for RDF, it becomes possible to obtain more contextual information

from a larger pool of decentralized data and vocabularies. For example, in a a vCard could be retrieved in relation to the MARC fields 245/246 which denote statement of responsibility, thereby providing much more information (richer metadata) about that resource.

Much like SKOS Core, FOAF uses object attributes such as classes and properties to be

“discernable in the syntax for RDF” (Brickley & Miller, 2000–2007, “FOAF and RDF”). 8

Another application developed by the W3C, the RDF iCalendar, is also being referenced much like FOAF. By accessing the iCalendar’s properties and components, information such as

“[e]vents, places, names, and coordinates” further the definitions that the RDF and XML namespaces can access and allow for more specification in attributing venue information to social information (Connelly & Miller, 2005, “Events, places, names and coordinates”).

While FOAF and iCalendar are effective linking and exchange mechanisms for aggregating personal identification attributes, at present there is a risk of invasion of and valid security concerns. According to Harper and Tillet, authors of Library of Congress controlled vocabularies and their application to the Semantic Web, there are still many security issues that need to be worked out (2007, p. 61).

One area in which the application of RDF can help with sensitive information and this privacy issue is in the healthcare industry. Zhuan and Yuanzhen, authors of An approach for

XML inference control based on RDF, believe that simply using XML neglects a potential security hole; sensitive information could be obtained by unauthorized persons based on inference (2006, p. 338).

After analyzing saved query sets, the authors found that it is easy to deduce, for example, what disease a person might have by parsing the information from a query made up of structured

XML files. For example, if a patient is assigned a certain room/ward/hall or has a certain prescription of drugs, it is fairly easy to “infer” what ails him or her, revealing confidential information.

The authors’ solution is to use RDF statements in conjunction with XML as an extra security mechanism. They claim this can be done by encapsulating XML access control nodes using the RDF data model. The RDF triple can help administrators enclose all the required 9

security specifications within an RDF repository because every resource shares one subject node.

They also suggest referencing object IDs6 instead of URIs for use with given algorithms, as well

as using “RDF to redefine XML Key to conduct the combination of the user’s current query and his history file” (2006, p. 339–341).

The Bad News

Figure 3: An example of the RDF/XML syntax referring to a resource named in the document (2004). Note: Copyright Simons et al., 2004/Emeld.org, 2001–2008. Re-printed with permission.

The downsides to RDF are: (a) it is more time-consuming to write than just constructing

XML code; and (b) in-house technical expertise with the Semantic Web’s development is

essential (MIT, 2003–2008, “RDFizers”; Keston, 2008, p. 4). However, institutions such as the

Massachusetts Institute of Technology (MIT) and Hewlett-Packard Development Company (HP)

have been developing tools that address these issues.

MIT’s Semantic Interoperability of Metadata and Information in unLike Environments

(SIMILE) project is primarily using RDF as a way to make metadata and schemata more 10

interoperable in regards to digital assets (MIT, 2003–2008, “The SIMILE Project”). This group7

makes available a list of “RDFizers” on its wiki site. Because RDF would need to be encoded

from different formats, the SIMILE project recommends specific conversion tools for use on particular file types. For example, the group provides a resource library for the API

(application programming interface) called Flickcurl. This product contains a program called flickrfd which takes the image’s metadata description fields and tags from a Flickr file and

converts them into RDF triples (Beckett, 2007–2009, ¶ 1).

While some of these “RDFizer” and RDF-conversion tools may be costly in the form of

high technical salaries, the is freely available using open source applications. And in

the long run, having the ability to share resources may well be worth the added cost. Tim

Berners-Lee, inventor of the World Wide Web and founder of the W3C says, “There is a huge amount to be gained from having a document be self-describing in the Web” (Berners-Lee, 1999,

“Namespaces”).

Hewlett-Packard is also focused on the Semantic Web as a strategic function for knowledge discovery. The company is involved in its development on many levels, from policy making at the W3C, to creating better metadata for interoperability, to creating IT tools and technologies—such as its RDF visualization index (HP, 2009, “HP Labs”).

Visualization indices are for technical data specialists who use open source software to create visual nodes of RDF datasets. This helps them in several ways: (a) it provides a means for visual analysis through clustering of the resource data; (b) it presents them with a mental model of the RDF data mappings and illustrates a general orientation of dataset properties; and (c) it helps to uncover some errors that might have occurred during data entry (MIT, 2004–2008, ¶ 4). 11

Both MIT and HP assert that it is crucial for open standards to be developed

simultaneous to the Semantic Web in order to guarantee a level playing field in terms of having

reliable data sources. HP claims this is one of the most difficult challenges of the Semantic Web

(HP, 2009, “Introduction”).

Is RDF Worth It?

For many years now the Web has been a network of sites containing metadata catalogued

by specific institutions where descriptive information is added to an object for use in either local,

or collaboratively agreed upon domains. These metadata, however, are more valuable when they

can be shared in a global environment. By implementing certain standards such as formatting languages, protocols, and sophisticated algorithms, the metadata are poised to become

extremely rich because now knowledge and logic may be applied to, and derived from an

object’s dataset.

At present, RDF is the language that enables application of semantic qualities to the

metadata contained in a digital object. It works by relating data to an ontological resource. Yet,

there are many ontologies and therefore a need for frameworks, like RDF, and schemas, like

RDFS, to point to them:

Ideally, we would like a universal shared knowledge-representation language to support

the Semantic Web, but for a variety of pragmatic and technological reasons, this is

unachievable in practice. Instead, we will have to live with a multitude of metadata

representations. RDF contains as much knowledge-representation technology as can be

shared between widely varying metadata languages. Furthermore, the RDF schema

language is powerful enough to define richer languages on top of RDF’s limited

primitives (Decker et al., 2000, p. 69). 12

While RDF can be expressed using other serializations such as Notation 3, making it easier to

write and read, it works best with XML. In both cases the use of namespaces results in the

ability to retrieve information that has some as opposed to receiving information

without any context at all. XML’s structure is flexible enough to allow for multiple namespaces

to be declared within one record (Zeng, 2008, p. 202), however, either combination employs a

greater set of descriptive resources, guaranteeing the user some meaning, not just a document

(Berners-Lee, 1998, ¶ 1).

The Library of Congress has already seen the potential of shareable metadata by

converting MARC records into the MARCXML format. Their next step is to convert those

records into RDF “which is no trivial task” (Harper & Tillet, 2007, p. 53). This issue begs the

eternal metadata/cataloguing question: how much granularity? Of course, it is really up to the

given institution to determine how much time and money should be allotted for creating richer

metadata, however, the good news about RDF seems really good and the bad news is really not

so bad.

Since most schemas nowadays are either created in XML syntax or are at least compliant

with it, it may be worth it for an organization such as the Library of Congress to start using RDF

to describe its holdings. Anna Dubas, author of Getting to know SIMILE, states that RDF tools allow for automatic creation of access points contained in a work and is a huge benefit for repositories (Dubas, 2009, p. 3). Thus, in effect, the framework actually prevents at least some of

the taxing human effort involved in exploiting metadata.

In summary, information is being published faster than we humans can intelligently

describe or define it. However, in this era of decentralized and open resources there are now 13

more than enough technological tools and collaborative working groups available for developing,

validating, and using RDF in order to automatically augment our metadata as well as their respective schemas. Yes, it may be costly to get this “general-markup-light-ontology-language” encoded in our institutions, yet when a user poses a query to an online system and retrieves irrelevant information . . . there is a different kind of price to pay.

14

References

Baca, M. (2008). Introduction to Metadata. Retrieved March 20, 2009 from

http://www.getty.edu/research/conducting_research/standards/intrometadata/index..

Beckett, D. (2007–2009). Flickcurl: C library for the Flickr API. Retrieved April 21, 2009

from http://librdf.org/flickcurl/.

Berners-Lee, T. (1998). Why RDF model is different from the XML model. Retrieved April 14,

2009 from http://www.w3.org/DesignIssues/RDF-XML.html.

Berners-Lee, T. (1999). Web architecture from 50,000 feet. Retrieved April 14, 2009 from

http://www.w3.org/DesignIssues/Architecture.html.

Breitman, K., Casanova, M. & Truszkowski, W. (2007). Semantic Web: Concepts, technologies

and applications. London, UK: Springer-Verlag London, Ltd.

Brickley, D & Miller, L. (2000–2007), FOAF Vocabulary Specification 0.91. Retrieved April 20,

2009 from http://xmlns.com/foaf/spec/#sec-glance.

Cantara, L. (2006). Encoding controlled vocabularies for the Semantic Web using SKOS Core.

OCLC Systems & Services: International library perspectives, 22(2), 111–114. doi:

10.1108/10650750610663996

Connolly, D. & Miller, L. (2005). RDF Calendar - an application of the Resource Description

Framework to iCalendar data. Retrieved April 21, 2009 from

http://www.w3.org/TR/rdfcal/.

Decker, S., Melnik, S., Van Harmelen, F., Fensel, D., Klein, M., Broekstra, J., . . . Horrocks, I.

(2000). The Semantic Web: The roles of XML and RDF. IEEE Internet Computing, 4(5)

63–74. doi:10.1109/4236.877487 15

Dubas, A. (2009). Getting to know SIMILE. Retrieved April 7, 2009, from University of

Denver’s Metadata Architecture course in Blackboard Learning System™.

Faubel, S. (2007). W3C Semantic Layers [Online Image]. Retrieved December 7, 2009 from

Wikipedia Commons http://commons.wikimedia.org/wiki/File:W3c-semantic-web-

layers.svg.

Harper, C. A. & Tillet, B. B. (2007). Library of Congress controlled vocabularies

and their application to the Semantic Web. Cataloging & Classification Quarterly, 43(3),

47–68. doi:10.1300/J104v43n03_03

Hellman, R. (1999). A semantic approach adds meaning to the web. Computer, 32(12), 13–16.

doi:10.1109/MC.1999.809245

Hewlett-Packard Development Company, L.P. (2009). HP Labs Semantic Web Research.

Retrieved April 21, 2009 from http://www.hpl.hp.com/semweb/.

Hewlett-Packard Development Company, L.P. (2009). Introduction to semantic technologies.

Proof, trust and security. Retrieved April 21, 2009 from http://www.hpl.hp.com

/semweb/sw-technology2.htm.

Keston, G. (2008). Resource Description Framework. Retrieved April 10, 2009 from the

University of Denver’s trial subscription to Faulkner Information Services.

Klein, M. (2001). XML, RDF, and relatives. IEEE Intelligent Systems,16(2), 26–28.

doi:10.1109/5254.920596

Massachusetts Institute of Technology. (2003–2008). RDFizers. Retrieved April 14, 2009 from

http://simile.mit.edu/wiki/RDFizers.

Massachusetts Institute of Technology. (2004–2008). Welkin User Guide. Retrieved April 14,

2009 from http://simile.mit.edu/welkin/guide.html. 16

Miles, A. (2005). SKOS Core tutorial DC-2005 Madrid. Retrieved April 20, 2009, from

http://isegserv.itd.rl.ac.uk/cvs-public/skos/press/dc2005/tutorial.ppt.

W3C. (2009). SKOS Simple Knowledge Organization System Primer. Retrieved April 20, 2009,

from http://www.w3.org/TR/skos-primer/.

Simons, G. F., Fitzsimons, B., Langendoen, D. T., Lewis,W.D., Farrar, S. O., Lanham, A.,

…Gonzalez, H. (2004, 07). A Model for Interoperability: XML Documents as an RDF

Database. EMELD Workshop on , Detroit MI. Retrieved December 7, 2009

from http://emeld.org/workshop/2004/langendoen-paper.html.

Zhuan, L. & Yuanzhen, W. (2006). An approach for XML inference control based on RDF. In J.

Küng, R. Wagner & S. Bressan (Eds.), Lecture Notes in , Database and

Expert Systems Applications: DEXA 2006, LNCS 4080 (pp. 338–347). Berlin Heidelberg:

Springer-Verlag. doi:10.1007/11827405_33

Zeng, M. L. & Qin, J. (2008). Metadata. New York, NY: Neal-Schuman Publishers, Inc.

17

Footnotes

1 RDF “is particularly intended for representing metadata about Web resources, but it can also

be used to represent information about objects that can be identified on the Web, even when they

cannot be directly retrieved from the Web (Breitman et al., 2007, p. 57).

2 XML is used to encode the RDF because of reasons already stated. RDF Schema is used to

apply domain-specific properties and classes (Klein, 2001, p. 27).

3 vCard is “a standard . . . accepted by the International Mail Consortium and made a MIME

standard by IETF. A vCard is a series of typed elements that describe the attributes that may be

found on a business card, for example, name, position, business address, fax and phone numbers.

A vCard may be expressed in RDF with the help of the namespace http://imc.org/vcard3/3.0#”

(Breitman et al., 2007, pp. 181–182).

4 Tim Berners-Lee on serialization: “A document for a person is generally serialized so that,

when read serially by a human being, the result will be to build up a graph of associations in that

person’s head. The order is important” (1998, “Order in documents”).

5 Consequently, SKOS Core maps well to , FOAF, and OWL—which are also schemas that utilize RDF (Cantara, 2006, p. 112).

6 Object IDs are often used in healthcare industries as unambiguous identifiers. OIDs need to

be registered with registration authorities that assign an identifier to an object according to

standards .

7 SIMILE is actually a joint project of Massachusetts Institute of Technology (MIT) Libraries

and MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). Retrieved December

6, 2009 from http://simile.mit.edu/wiki/SIMILE:About.