An Extensible Framework for Efficient Document Management Using RDF and OWL

An Extensible Framework for Efficient Document Management Using RDF and OWL Erica Meena Ashwani Kumar Laurent Romary Laboratory LORIA M.I.T Laboratory LORIA Vandoeuvre-les-Nancy Cambridge, MA Vandoeuvre-les-Nancy France USA France [email protected] [email protected] [email protected] bling technologies such as XML, RDF and OWL. Most Abstract of the existing document management systems ([1], [2]) limit themselves in the scope of application or document In this paper, we describe an integrated approach to- formats or simply neglect any structure-based analysis. wards dealing with various semantic and structural is- However, considering our requirements, it is obvious sues associated with document management. We that only a multi-layered functional architecture can provide motivations for using XML, RDF and OWL in cover various issues related to distributed document building a seamless architecture to serve not only as a management such as localized vs global structural con- document exchange service but also to enable higher straints, conceptual definition of documents, reasoning- level services such as annotations, metadata access and based discovery etc. querying. The key idea is to manifest differential treat- ments for the actual document structure, semantic con- Indeed, evolving technologies such as XML (eXtensible tent of the document and ontological document Markup Language), RDF (Resource Description organization. The deployment of this architecture in the Framework) and OWL (Web Ontology Language) pro- PROTEUS project1 provides an industrial setting for vide us with rich set of application frameworks that if evaluation and further specification. applied intelligently, can help a great deal in solving these problems. XML ([3]) is primarily designed for 1. Introduction low-level structural descriptions. It provides a tree of structured nodes, which can be efficiently used to de- Digital documents are ubiquitously used to encode, pre- scribe documents and check their models using DTDs serve as well as exchange useful information in order to (Document Type Definitions) or XML Schemas. Be- accomplish information sharing across the community. sides, XML enables easy human readability as well as As the growth in volumes of digital data is exponential, efficient machine interpretability. However, there are it is necessary to adopt a principled way of managing issues if we only deal with the structural aspect. If one these documents. Besides, due to the distributed nature wants to pick some semantic information from a docu- of information, it is also imperative to take into account ment, there is no straightforward way other than to con- the geographical and enterprise-level barriers for uni- strain it by an schema or make an application hand- form data access and retrieval. programmed to recognize certain document-specific semantics. Furthermore, if the schema changes over The ITEA (Information Technology for European Ad- time, it could typically introduce new intermediate ele- vancement) project, Proteus2 has similar objectives. ments. This might have the consequences of invalidat- Proteus is a collaborative initiative of French, German ing certain queries and creating incoherencies in the and Belgium companies, universities and research in- semantic data-model of the document. stitutes aimed at developing a European generic soft- ware platform usable for implementation of web-based RDF (Resource Description Framework) and OWL e-maintenance centers. It implements a generic archi- (Web Ontology Language) build upon the XML syntax tecture for integrated document management using ena- to describe the actual semantics of a document and provide useful reasoning and inference mechanisms. RDF ([4]) specifies graphs of nodes, which are connected by 1 This material is based upon work supported by the ITEA (Information Tech- directed arcs representing relational predicates such as nology for European Advancement) programme under Grant 01011 (2002). 2 http://www.proteus-iteaproject.com/ URIs (Uniform Resource Identifiers) and encode the conceptual model of the real world. Unlike XML, an ers though distinct at the level of data flow and RDF schema is a simple vocabulary language. The individual processing of information, afford function- parse of the semantic graph results in a set of triples, alities that are exploited by the e-Doc server. which mimic predicate-argument conceptual structures. OWL can be used on top of these semantic structures to Figure 2.2 shows various such layers of the e-Doc do logical reasoning and discover relations that are not server. On the foundation level, it is assumed that every explicit and obvious. document on the e-Doc server adheres to a single syntax i.e. XML, which represents the top most layer in the In the following sections we discuss how we use these architecture. The second layer depicts the access points technologies to enable a generic document management that are broadly categorized along various dimensions system. Firstly, in Section 2 we describe the document such as metadata, conceptual/ontology system and ter- management and the Proteus architecture followed by minology. A detailed description of the access points discussion on Annotations in Section 3. Section 4 pro- will be carried out in the Section 4. The e-Doc server is vides brief account of the model theoretic access assumed to be flexible enough to handle all possible mechanisms enabled by OWL followed by description ontology formats/standards whether it is a native XML of data categories in Section 5. document or a text or a picture/video data coming from some streaming applications. This forms the third im- 2. Document Management Architecture portant layer of the e-Doc server. The bottom layer represents Annotations [6], which adheres to the RDF [4] Without differentiating at the level of content, layout syntax. This layer forms an integral part of the e-Doc and formats, we treat documents as information re- server as it enables annotation capability and RDF- sources. These information resources can potentially be describable semantics to the actively retrieved document distributed across various document repositories called or existing documents in the server [7]. Besides, RDF e-Doc servers. Figure 2.1 demonstrates a simplified also provides the opportunity to utilize annotations as distributed document management system. The archi- access points for the documents. tecture shows how three different document repositories could co-exist functionally along with the Annotea en- Query abled annotation framework ([5]). These servers imple- Retrieve Document XML Syntax ment procedural mechanisms for query access and server View retrieval of documents. Besides, these documents can be Access Meta-data Conceptual Terminology points (1 doc) system User annotated and the annotations reside on an independent (n ontologies) Document Native XML Misc. textual Picture/ Application server known as the annotation server, which also s document format Videos Data (XML) RDF serves as a document server. Principally, annotations Annotations Comment can be viewed as information resources, which are de- scribed in RDF. Figure 2.2: General Organization of the e-Doc Server Annotea RDF As can be seen from the Figure 2.2, a user interacts with the server through a client interface by launching his queries. The architecture provides the user ample flexi- Server 1 Server 2 Server 3 bility in utilizing different levels of descriptions for re- TEI/XML Doc TEI/XML Doc TEI/XML Doc trieving documents by providing variety of access points. In the following sections, we describe each of Header Header Header these access layers in more detail. Body Body Body 3. Annotations: Specified as RDF Model Annotations form the most abstract layer within the e- Doc architecture. They can be broadly defined as com- Figure 2.1: Simplified view of the distributed document ments, notes, explanations, or other types of external server architecture remarks that can be attached to either a document or a sub portion of a document. As annotations are consid- The e-Doc server consists of several functional layers ered external, it is possible to annotate a document as a that inter-communicate and holistically, serve the cu- whole or in part without actually editing its content or mulative purpose of document management. These lay- structure. Conceptually, annotations can be considered as metadata, as they give additional information about …> an existing piece of data. Annotations can have many <dc:creator>Ashwani</dc:creator> distinguishing properties, which can be broadly classi- <rdf:type fied as:- rdf:resource='http://www.w3.org/2000 /10/annotation-ns#Annotation'/> • Physical location:- An annotation can be stored <NS1:origin rdf:nodeID='A0'/> locally or on one or more annotation servers; <NS0:created>2004-05- • Scope:- An annotation can be associated with a 24T01:11Z</NS0:created> document as a whole or to a sub-portion of a <NS0:annotates document. rdf:resource='http://docB4.teiSpec.o • Annotation type:- Annotations can have vari- rg'/> <rdf:type ous functional types such as, “Comment”, rdf:resource='http://www.w3.org/2000 “Remark”, “Query” e.t.c.… /10/annotationType#Comment'/> <NS0:body rdf:resource='Please review this document.'/> Due to this abstract nature and multiplicity of functional <dc:title>review</dc:title> types, a formal treatment of annotations is often un- <dc:date>2004-05-24T01:11Z</dc:date> wieldy. Therefore, it is desired to have a semantically </rdf:Description> <rdf:Description driven structural representation for annotations, which rdf:nodeID='A1'> we describe below. …. </rdf:Description> Annotation Semantics </rdf:RDF> Figure 3.2: An abridged Annotation in RDF Annotations are stored in one or multiple annotation servers. These servers endorse exchange protocols as Xpointers are used to point to the Annotated portions specified by Annotea [5]. Essentially, the Annotation within the document, while Xlinks [11] are used to Server can be regarded as a general purpose RDF store, setup a link between the Document and it's annotation. with additional mechanisms for optimized queries and access. This RDF store is built on top of a general SQL store.

Load more