<<

An Extensible Framework for Efficient Document Management Using RDF and OWL

Erica Meena Ashwani Kumar Laurent Romary Laboratory LORIA M.I.T Laboratory LORIA Vandoeuvre-les-Nancy Cambridge, MA Vandoeuvre-les-Nancy France USA France [email protected] [email protected] [email protected]

bling technologies such as XML, RDF and OWL. Most Abstract of the existing document management systems ([1], [2]) limit themselves in the scope of application or document In this paper, we describe an integrated approach to- formats or simply neglect any structure-based analysis. wards dealing with various semantic and structural is- However, considering our requirements, it is obvious sues associated with document management. We that only a multi-layered functional architecture can provide motivations for using XML, RDF and OWL in cover various issues related to distributed document building a seamless architecture to serve not only as a management such as localized vs global structural con- document exchange service but also to enable higher straints, conceptual definition of documents, reasoning- level services such as annotations, metadata access and based discovery etc. querying. The key idea is to manifest differential treat- ments for the actual document structure, semantic con- Indeed, evolving technologies such as XML (eXtensible tent of the document and ontological document Markup Language), RDF (Resource Description organization. The deployment of this architecture in the Framework) and OWL (Web Ontology Language) pro- PROTEUS project1 provides an industrial setting for vide us with rich set of application frameworks that if evaluation and further specification. applied intelligently, can help a great deal in solving these problems. XML ([3]) is primarily designed for 1. Introduction low-level structural descriptions. It provides a tree of structured nodes, which can be efficiently used to de- Digital documents are ubiquitously used to encode, pre- scribe documents and check their models using DTDs serve as well as exchange useful information in order to (Document Type Definitions) or XML Schemas. Be- accomplish information sharing across the community. sides, XML enables easy human readability as well as As the growth in volumes of digital data is exponential, efficient machine interpretability. However, there are it is necessary to adopt a principled way of managing issues if we only deal with the structural aspect. If one these documents. Besides, due to the distributed nature wants to pick some semantic information from a docu- of information, it is also imperative to take into account ment, there is no straightforward way other than to con- the geographical and enterprise-level barriers for uni- strain it by an schema or make an application hand- form data access and retrieval. programmed to recognize certain document-specific semantics. Furthermore, if the schema changes over The ITEA (Information Technology for European Ad- time, it could typically introduce new intermediate ele- vancement) project, Proteus2 has similar objectives. ments. This might have the consequences of invalidat- Proteus is a collaborative initiative of French, German ing certain queries and creating incoherencies in the and Belgium companies, universities and research in- semantic data-model of the document. stitutes aimed at developing a European generic soft- ware platform usable for implementation of web-based RDF (Resource Description Framework) and OWL e-maintenance centers. It implements a generic archi- (Web Ontology Language) build upon the XML syntax tecture for integrated document management using ena- to describe the actual semantics of a document and pro- vide useful reasoning and inference mechanisms. RDF ([4]) specifies graphs of nodes, which are connected by 1 This material is based upon work supported by the ITEA (Information Tech- directed arcs representing relational predicates such as nology for European Advancement) programme under Grant 01011 (2002). 2 http://www.proteus-iteaproject.com/ URIs (Uniform Resource Identifiers) and encode the conceptual model of the real world. Unlike XML, an ers though distinct at the level of data flow and RDF schema is a simple vocabulary language. The individual processing of information, afford function- parse of the semantic graph results in a set of triples, alities that are exploited by the e-Doc server. which mimic predicate-argument conceptual structures. OWL can be used on top of these semantic structures to Figure 2.2 shows various such layers of the e-Doc do logical reasoning and discover relations that are not server. On the foundation level, it is assumed that every explicit and obvious. document on the e-Doc server adheres to a single syntax i.e. XML, which represents the top most layer in the In the following sections we discuss how we use these architecture. The second layer depicts the access points technologies to enable a generic document management that are broadly categorized along various dimensions system. Firstly, in Section 2 we describe the document such as metadata, conceptual/ontology system and ter- management and the Proteus architecture followed by minology. A detailed description of the access points discussion on Annotations in Section 3. Section 4 pro- will be carried out in the Section 4. The e-Doc server is vides brief account of the model theoretic access assumed to be flexible enough to handle all possible mechanisms enabled by OWL followed by description ontology formats/standards whether it is a native XML of data categories in Section 5. document or a text or a picture/video data coming from some streaming applications. This forms the third im- 2. Document Management Architecture portant layer of the e-Doc server. The bottom layer rep- resents Annotations [6], which adheres to the RDF [4] Without differentiating at the level of content, layout syntax. This layer forms an integral part of the e-Doc and formats, we treat documents as information re- server as it enables annotation capability and RDF- sources. These information resources can potentially be describable semantics to the actively retrieved document distributed across various document repositories called or existing documents in the server [7]. Besides, RDF e-Doc servers. Figure 2.1 demonstrates a simplified also provides the opportunity to utilize annotations as distributed document management system. The archi- access points for the documents. tecture shows how three different document repositories could co-exist functionally along with the Annotea en- Query abled annotation framework ([5]). These servers imple- Retrieve Document XML Syntax ment procedural mechanisms for query access and server View retrieval of documents. Besides, these documents can be Access Meta-data Conceptual Terminology points (1 doc) system User annotated and the annotations reside on an independent (n ontologies) Document Native XML Misc. textual Picture/ Application server known as the annotation server, which also s document format Videos Data (XML) RDF serves as a document server. Principally, annotations Annotations Comment can be viewed as information resources, which are de- scribed in RDF. Figure 2.2: General Organization of the e-Doc Server

Annotea RDF As can be seen from the Figure 2.2, a user interacts with the server through a client interface by launching his queries. The architecture provides the user ample flexi- Server 1 Server 2 Server 3 bility in utilizing different levels of descriptions for re-

TEI/XML Doc TEI/XML Doc TEI/XML Doc trieving documents by providing variety of access points. In the following sections, we describe each of Header Header Header these access layers in more detail. Body Body Body 3. Annotations: Specified as RDF Model

Annotations form the most abstract layer within the e- Doc architecture. They can be broadly defined as com- Figure 2.1: Simplified view of the distributed document ments, notes, explanations, or other types of external server architecture remarks that can be attached to either a document or a sub portion of a document. As annotations are consid- The e-Doc server consists of several functional layers ered external, it is possible to annotate a document as a that inter-communicate and holistically, serve the cu- whole or in part without actually editing its content or mulative purpose of document management. These lay- structure. Conceptually, annotations can be considered as metadata, as they give additional information about …> an existing piece of data. Annotations can have many Ashwani distinguishing properties, which can be broadly classi- • Physical location:- An annotation can be stored locally or on one or more annotation servers; 2004-05- • Scope:- An annotation can be associated with a 24T01:11Z document as a whole or to a sub-portion of a Due to this abstract nature and multiplicity of functional review types, a formal treatment of annotations is often un- 2004-05-24T01:11Z wieldy. Therefore, it is desired to have a semantically we describe below. …. Annotation Semantics Figure 3.2: An abridged Annotation in RDF Annotations are stored in one or multiple annotation servers. These servers endorse exchange protocols as Xpointers are used to point to the Annotated portions specified by Annotea [5]. Essentially, the Annotation within the document, while Xlinks [11] are used to Server can be regarded as a general purpose RDF store, setup a link between the Document and it's annotation. with additional mechanisms for optimized queries and access. This RDF store is built on top of a general SQL store. Annotations are stored in a generic RDF database Annotation Operations accessible through an Apache HTTP server (see Figure The user makes a selection of the text to be annotated 3.1). All communication between a client and an anno- and provides the annotation along with other details tation server uses the standard HTTP methods such as such as author name, date of creation, type of annota- POST or GET. tion, URI of the annotated document etc. The annota- tions are published using standard HTTP POST method. Client http Annotation Server POST To do this the client generates an RDF description of the Query annotation that includes the metadata and the body and interface RDF sends it to the server. The annotation server receives the database data and assigns a URI to the annotation i.e. the body,

SQL store while metadata is identified by the URI of the Docu- ment. http GET For annotation retrieval, the client queries the annota- tion server via the HTTP GET method, requesting the Figure 3.1: Access to the Annotation server annotation metadata by means of the document's URI. The annotation server replies with an RDF-specified list Annotations have metadata associated with them, which of the annotation metadata. For each list of annotations is modeled according to an RDF schema and encode that the client receives, it parses the metadata of each information such as date of creation of the annotation, annotation, resolves the Xpointer of the annotation, and name of the author, the annotation type (e.g. comment, if successful, highlights the annotated text. If the user query, correction) the URI [8] of the annotated docu- clicks on the highlighted text, the browser uses an ment, and an Xpointer [9] that specifies what part of the HTTP GET method to fetch the body of the annotation document was annotated, and the URI to the body of the from the URI specified in the metadata. annotation which is assumed to be an XHTML [10] document (Figure 3.2). The following are the broad categories of the annotation [Terminological documents. OWL provides formal mechanisms for de- Entry] scribing ontology of documents. By doing so, the ar- Dispositif logical inference mechanisms, which are necessary permettant d'imprimer un déplacement linéaire ou angulaire à while performing metadata queries. un élément mobile. [Language Section] Access points play an important role by providing flexi- fr bility and intuitiveness in access mechanisms to the [Term Section] user. Figure 4.1 depicts a very basic characterization of vérin the access points. As it is illustrated in the figure, a spe- [Term Component cific access point is needed to direct a query to attain Section] certain desired result set. Within the Proteus framework, noun the e-doc architecture provides a model driven specifi- cation of access points such as metadata-based, onto- ….. logical, or terminological model. The model driven approach has strong significance in the sense that every access point is associated by certain abstract informa- Description of terminological model tion structure so that it provides transparency to the que- ries, which remain independent from actual A general terminological model contains a Termino- implementation and data formats (e.g. XML DTD). logical entry section, a Language section and a Term Even though these models are independent, they are section. flexible enough to interact among themselves. For ex- /Entry identifier/ ample, results of queries on one model can act as a ref- /Subject field/ erence for another model. The references may be /Definition/ Terminological Entry /Explanation/ transformed into document excerpts by requests made /Example/ synchronously at the query stage or asynchronously /Note/ when the user wants to visualize the information. Language Section /Language/ /Note/ Query Result Data server1 Term Section set

/Term/ Access point Search primitives /Term Data serveri status/ ⇒ Figure 4.2 Simplified terminological model ⇐ Ontological references Figure 4.2 describes a simplified terminology model - or Data servern Data Source data excerpts the terminological section contains entries such as iden- tifier, subject field, definition, and explanations etc., Figure 4.1 Characterization of Access Points where as the other sections such as the language and the term sections contain details regarding the language used and the term status respectively. This can also be Conceptual system Terminological system seen within the sample Proteus terminology described above. Turbine Terminological access is significant in cases where the user is aware of the specific term and needs to make a Links (hasDoc) search within the related domain to access certain Suggestions documents of his interest. For example, an operator of a (concepts!) firm might be willing to retrieve all the maintenance documents related to the term “Pump”. Thanks to the terminological access point, the operator needs nothing E-doc User but just the term to launch his query and retrieve the desired document. The above-mentioned scenario is Figure 4.4 interaction of terminological model with depicted in Figure 4.3 other models

Terminological system Meta-Data Access Pump Metadata can be loosely defined as “data about data”.. List of available maintenance docs Specifically, metadata encodes certain attributive infor- mation about the data, in our case documents, which can be used to access data. Within this platform the meta- model can be seen as a meta-tree of nodes in which Operator E-doc every node refers to certain precise set of information descriptors. For example, Dublin Core descriptors such Figure 4.3: A Sample Terminological Access as title, author, date, publisher, etc can potentially be represented as nodes in the description trees.

Terminological access plays a dual role. On one hand it Meta-model Description acts as a data source providing support for finding mono or multilingual equivalences or linguistic descriptions. This Meta-model is discussed keeping the specific On the other hand, it provides access for on-line docu- Dublin Core [12] model in mind. Meta model consists ments. When seen as a data source, it can also provide of three basic components, a Resource, an Element, and indexing support for manual indexing and can perform its value. semi-automated indexing: • Graphic files (drawings, pictures, video, • Resource – the object being described. scanned texts etc): manual indexing • Element – a characteristic or property of the • Text files: semi-automatic indexing; suggestion Resource. of descriptors to be confirmed by a human ex- pert • Value – the literal value corresponding to the Element. • Data, e.g. from monitoring: automatic indexing with metadata. Figure 4. 5 shows a simplified view of the Dublin core reference model, within which the Element Qualifiers Terminological model serves as a gateway to the Ontol- are nothing but additional attributes that further specify ogy-based Conceptual model of the domain (Figure the relationship of the element to the resource. On the 4.4). Use of a technical term as a query parameter re- other hand, the value qualifiers can be described as ad- lates to set of relevant concepts, which can further be ditional attributes that further specify the relationship of used to retrieve the desired set of documents. the value to the element. Metadata model can be seen as an enhanced search mechanism. A sequence of access points i.e. terminol- ogy followed by metadata, when launched can help in refining the search along an attribute dimension. Ontology Access

Ontology is a hierarchy of concepts or in other way a platform for describing the concepts used within a spe- cific domain, maintenance in our case. Its independence with regard to specific model or format makes it inter- Figure 4.5: Simplified view of the Dublin Core refer- operable. For example, one can have an ontology repre- ence model. sented in a UML [13] class diagram whereas the same ontology can be represented in an XML schema. As For Example: already discussed, Ontology is complementary to termi- nology in terms of attribution of concepts to terms. Element = Creator Conceptually, it serves as an abstract structure, which Component = Firstname Value = Ashwani can be populated by the interested parties and thus, can Component = Lastname Value = Kumar serve as a very important access point. An abridged Component = Email Value = [email protected] Element = Contributor example of an abstract Proteus OWL [14] ontology ver- Value = fn:Erica Meena; org:DSTC sion can be seen in the figure 4.7 Type = Illustrator Encoding = vCard Resource = http://www.loria.fr/projets/proteus/RDU/NOTE- directly retrieve a well defined piece of information, The Engineering compo- under the condition that he knows a small number of nent of the PIR “facts” about the information: e.g. the authors name, the date, the reference number or the date of a previous maintenance. This corresponds to a typical situation within the Proteus framework (see Figure 4.6). Meta- data access, in other way, can be seen as an advanced index functionality, which can update itself and grow automatically in the same form as the amount of stored information grows. For example: While sorting documents by date or type, the date, time, source or author information can always be automati- …… cally collected. However, in case of a new maintenance document, advanced metadata can be collected by ask- ing a human to enter it into the system. Figure 4.7: An example of Proteus ontology Ontology model description Dublin Core 1 Title: name of the client 2 Creator: name of the method agent 3 Subject: Equipment ID As per the requirements of the Proteus project, an ontol- 4 Description: type of equipment 5 Publisher: CEF CIGMA division 6 Contributor: division out of CIGMA ogy model comprises of a three-tiered structure. The 7 Date: date of the draw-up Type: procedure, FMECA, ... 8 Format: .doc, .ppt, . xls (not useful) three layers consist of General concepts (General 9 Identifier: name of the site (location) 1 List of available maintenance docs Source: former version Maintenance Ontology), Application Profiles, and the 01 Language: French, English, German 1 Relation: related FMECA, procedure, 21 video, pictures, ... industrial contexts respectively. These layers are built 3 Coverage: equipment location 1 Rights: public, confidential up keeping in mind the interoperability with other ex- 41 5 Operator ternal applications. As can be seen from the Figure 4.8

E-doc below, the general concept layer has the highest interoperability as it contains basic level concepts such Figure 4.6: Document access via metadata as Actors, Documents, Location, Equipments etc. The second layer (Application Profiles) consists of concepts, guage and by visual elements (hierarchy structured sets which are specific to a certain application, for instance of pictures). In a way we can see ontology as an empty pertinent to a train manufacturing company, or an avia- structure with user-defined class relationships, which tion company. All the layers are bound to inherit con- can be filled with visual elements (photos, drawing, cepts, but not necessarily all from the first layer (general scheme) and then the referring terms. concept), which in turn forms the parent layer of all other layers. The third layer (Industrial contexts) con- For example Figure 4.9 depicts visual search of docu- tains concepts very specific to an industry for instance, ments via ontology. In order to avoid complexity, only car manufacturing companies such as Ford, GM etc. recommended terms are used to name the objects repre- Instances can be derived only from the last layer i.e. the sented by the visual elements. Other terms can be left Industrial contexts layer. apart pointing to plain concepts (without visual con- cepts). The index of the metadata tool could be virtually The model is open for external sources i.e. ontology integrated into the index administrated by the terminol- from external sources can be merged within each layer, ogy tool. This enables a two-step-search, beginning with for example, SUMO [15], which is a higher-level ontol- a word and then finding the actually searched item not ogy. It contains very general concepts, which can be by selecting a more specific term from the terminology used directly within our ontology. tool, but by looking for a picture of the searched item in the functional concept. This index could also be virtu- External High interoperability ally integrated into the index of the functional concept. sources Example Thus the user could situate the search results provided General concepts Actors, Location, Documents, (GMO) Equipment by the metadata tool within the functional structure of the maintained equipment (instead of getting designa- Car manufacturing company Application profiles tion, ID-Number, description and meta data only).

FORD Industrial contexts

Conceptual system Low interoperability Ford Tech Instances Report

Figure 4.8: Proteus Ontology model portal

List of available maintenance docs, parameters setting, OWL-DL is used for specifying the ontological model user manual as it provides the following advantages: Maintenance doc (including C needed tools), e Parameters • Basic support for describing classification hier- setting, Operator E-Doc User manualdedicated of to one single site archies and simple constraint features. e.g. mi- this gration path for thesauri and other taxonomies; Figure 4.9: Visual search of documents via ontology • Rich expressiveness; • Computational completeness and decidability; • Allows imports of OWL Lite simple descrip- 5. Data Category Specification tion; • Allows consistency checks across description The various models (terminology, annotations, etc.) and levels; functionalities (access primitives to an e-doc server) • Existence of optimized inference platforms. have to be defined in such a way that a similar piece of E.g. Racer [16]. information (e.g. author, subject field, term, etc.) means the same thing from one place to another. Such a se- In a way, ontology access is a complementary approach mantic definition of data categories (in the terminology to the terminology access, as terminology structure de- of ISO committee TC 37) acts in complementary to an scribes the global concept behind a thematic domain, ontology such as the one we define in the Proteus sys- but does not deliver a functional description of the do- tem since it is intended to be a general purpose layer of main. The ontology access exactly provides this func- descriptors that may be used in other environments than tional description (as is usually needed in the that of a specific project. Therefore, we adopted a simi- maintenance domain). The concept remains global when lar methodology as that of the efforts within the ISO TC referring to a generic class of entities and gets specific 37 committee to deploy a data category registry of all when describing a particular entity type. Apart from the descriptors used in the project as reference semantic normal functionality of this access point, it can be very units described in accordance to ISO standard 11179 important when combined with retrieval by natural lan- (metadata registries). Such a registry plays a double [8]T. Berners-Lee, R. Fielding, and L. Masinter, Uni- role: form Resource Identifiers (URI): Generic Syntax, • It provides unique entry point (of formal public IETF Draft Standard August 1998 (RFC 2396). identifier) for any model that refers to it; [9]XML Pointer Language. http://www.w3.orgwtr/xptr/ • It gives a precise description of the data cate- gory by means of a definition and associated [10]The Extensible HyperText Markup Language. documentation (examples, application notes, http://www.w3.org/TR/xhtml1/ etc.). [11]XML Linking Language. 6. Conclusions http://www.w3.org/TR/xlink/ [12]Dublin Core Metadata Initiative. OCLC, Dublin We have provided a brief account of how document Ohio. http://purl.org/dc/ . structure and inherent semantics can be captured and processed efficiently by the emerging technologies such [13]Unified Modeling Language Home Page. as XML, RDF and OWL. By doing so, we have brought http://uml.org/ . innovations in correlating different levels of document [14]Deborah L. McGuinness and Frank van Harmelen, management with respect to various services afforded OWL Web Ontology Language Overview, W3C by these technologies. The differential treatment of Proposed Recommendation, 15 December 2003. structure, content and organization provides ample http://www.w3.org/TR/owl-features/. flexibility and extensibility, which are the primary re- quirements for such a system. [15]Niles, I., and Pease, A. 2001. Towards a Standard Upper Ontology. In Proceedings of the 2nd Interna- tional Conference on Formal Ontology in Informa- References tion Systems (FOIS-2001), Chris Welty and Barry [1]Lagoze C, Dienst - An Architecture for Distributed Smith, eds, Ogunquit, Maine, October 17-19, 2001. Document Libraries, Communications of the ACM, [16]V.Haarslev and R. Moller. Description of the Vol. 38, No 4, April 1995. 12 RACER system and its applications. In DL2001 [2]Satoshi Wakayama, Yasuki Ito, Toshihiko Fukuda Workshop on Description Logics, Stanford, CA, and Kanji Kato, Distributed Object-Based Applica- 2001. tions for Document Management, Hitachi Review Vol. 47 (1998), No.6 [3]Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0., eds.W3C Recommendation 10-February-1998. [4]Swick Lassila, Resource Description Framework (RDF) Model and Syntax Specification., Consortium Recommendation, 1999. http://www.w3.org/TR/REC-rdf-syntax/. [5]José Kahan, Marja-Riitta Koivunen, Eric Prud'Hom- meaux, and Ralph R. Swick, Annotea: An Open RDF Infrastructure for shared Web Annotations, in Proc. of the WWW10 International Conference, Hong Kong, May 2001. [6]The W3C Collaborative Project ... or how to have fun while building an RDF infra- structure. http://www.w3.org/2000/Talks/www9- annotations/Overview.html. [7]N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, & M. A. Musen. Creating Semantic Web Contents with Protege-2000. IEEE Intelligent Sys- tems 16(2):60-71, 2001.