Automated Ontology Creation Using XML Schema Elements
Total Page:16
File Type:pdf, Size:1020Kb
Automated Ontology Creation using XML Schema Elements Samuel Suhas Singapogu, Paulo C. G. Costa, J. Mark Pullen C4I center, George Mason University Fairfax, VA, USA [ssingapo,pcosta,mpullen]@c4i.gmu.edu Abstract—Ontologies are commonly used to represent formal semantics in a computer system, usually capturing them in the Although ontologies are valuable assets in system model- form of concepts, relationships and axioms. Axioms convey ing, testing and analysis, the process of manually creating an asserted knowledge and support inferring new knowledge ontology for a complex system is inherently tedious and through logical reasoning. For complex systems, the process of error prone. Existing methods of knowledge discovery and creating ontologies manually can be tedious and error-prone. ontology creation usually are based on text mining of a data Many automated methods of knowledge discovery are based on mining domain text corpus, but current state-of-the-art meth- corpus for that domain. XML-based systems capture the ods using this approach fail to consider properly semantic data structure and syntax of all necessary and meaningful ele- embedded in XML schemata in complex systems. This paper ments in a XML schema, which therefore is a useful starting proposes a mapping method for identifying relevant semantic point for semantic analysis. In such systems, XML schemata data in XML schemata, automatically structuring and repre- have been shown to be a valuable resource of semantic data senting it in the form of a draft ontology. Concepts, concept [2]. Command and Control systems in the military context hierarchy and domain relationships from XML schema are support various functions, including commanding of forces mapped to relevant parts of an OWL ontology. A part-of- and also receiving and interpretation of situational aware- speech tagging method extracts domain relationships from ness reports. To perform these functions, most C2 systems schema annotations. This mapping method can be applied to are modeled using XML schemata e.g. Coalition Battle any system that has a well-annotated XML schema. We illus- Management System (C-BML) [3], Military Scenario Defi- trate our process with the preliminary results obtained when creating a command and control to simulation (C2SIM) draft nition Language (MSDL) [4], and National Information ontology from an XML schema. Exchange Model (NIEM) [5], which represent the systems’ structural and syntactic framework. These systems use and exchange XML documents that are based on XML schema- Keywords—OWL, XML Schema, Part of Speech tagging, Com- ta. mand and Control, Interoperability XML often is used as the exchange mechanism in the command and control domain [6]. Given the increasing I. INTRODUCTION number of XML-based systems in the C2 domain, an auto- mated framework that leverages semantic information in the XML schema and creates a draft ontology would be a useful Knowledge discovery is essentially the process of ex- tool. Domain experts can refine and populate the draft on- tracting semantic concepts and relationships from domain tology using concept hierarchy and basic domain relation- resources within a particular domain. Ontologies are the de ships. For large XML based systems, this process of creat- facto standard for representing knowledge of a system [1]. ing a draft ontology saves valuable time and avoids errors The framework and components of ontologies are based on common to the alternative tedious, manual process. In con- established, yet evolving W3C standards. Ontologies usual- trast to existing techniques used to map a XML schema to ly are comprised of concepts, relationships and axioms. an ontology, the mapping proposed in this paper highlights Concepts are abstractions of related attributes that form the the need to map schema element (xs:element from here basic building blocks of a semantic model. Relationships forward) to an ontological concept. This avoids the usual can be between two concepts or between a concept and a approach of mapping XML complex types to concepts that data-type. Axioms consist of asserted knowledge that can be could lead to unnecessary ontological complexity. In addi- represented as a <subject, predicate, object> tion, this paper proposes a novel Part of Speech tagging triple. Logic-based reasoners can be used to infer new knowledge in an ontology. STIDS 2015 Proceedings Page 10 method to extract domain relationships from well-annotated XML schema. Existing work ignores semantic information pertaining to domain relationships that are present in well-annotated XML schema annotations. By design, the purpose of XML II. RELATED WORK annotations is to capture description of elements, which are often described in relation to other elements. For instance, consider the XSD annotation for the element Most existing work on mapping a XML Schema to an on- <xs:EventStatus> of the C2 domain in Figure 1. tology (e.g. [7], [8] ) is based on mapping XML schema complexType (henceforth referred to as xs:complexType) to an owl:class. This mapping can lead to problems for the following reasons: 1. A “simpleType” definition is sufficient to define a se- mantic concept. In the command and control domain, for example, it is possible to define an element called “Unit- Name,” which is of a “simpleType” string (with string re- strictions). There is sufficient semantic information in Fig. 1. Illustrating the presence of domain relationships in XSD annotations this definition to create a distinct ontological concept. When this level of abstraction of concepts is ignored and This pattern of linking elements in annotations is com- only complex type definitions are considered as semantic mon in domains with well-annotated XML schemata. Our concepts, the resulting ontologies will contain significant approach leverages this pattern by employing a mapping modeling gaps for any useful analysis. from xs:element to owl:class and uses Part of 2. By design, XML parsing only allows elements associated Speech tagging of XSD annotations to extract domain rela- to a complex type to appear in valid XML files. XML tionships. schemata can contain complex type definitions that are never associated to an element definition. When concepts are mapped to complex types, the resulting ontology will III. SYSTEM DESIGN likely include concepts that will never appear in the XML document. This unnecessary complexity is counter- productive to efficient semantic modeling in design and The mapping process takes as input a sufficiently anno- analysis. tated XML schema, which we define as any XML schema that contains the following: Bohring et al. [9] recognize the value of semantic con- a) The schema provides annotations for most elements cepts being mapped to <xs:element> definition. How- using descriptive domain terminology ever, this mapping is done only for <xs:element> defi- b) Annotations referencing elements defined in the schema nitions that are not leaf nodes and have at least one attribute use a consistent naming convention. definition. The approach in [9] fails to consider valid se- mantic concepts that are simple literal definitions. In addi- As an example of the latter, if the schema defines an el- tion, ontologies have been designed so that datatype proper- ement as “ReporterWho” then any annotation referring to ties can be mapped to XML schema datatypes [10]. There- this element must do so in a consistent way, i.e., the refer- fore, the resulting mapping of simple xs:element to ence can be extracted by simple operations (e.g., removing owl:DatatypeProperty using that approach would be spaces, pruning special characters, etc.) inconsistent with standard practice of ontology design. Yang, Steele, and Lo [11] describe an ontology-based map- The system components and relationships between the ping between XML and ontology (bi-directional) that focus- components are illustrated in Figure 2 below. es on limiting loss of information in the bi-directional map- ping. STIDS 2015 Proceedings Page 11 Fig. 4. XML schema after pre-processing Fig. 2. Illustrating system components and relationships Pre-processing the XML schema: Our approach maps the After pre-processing, the following mappings are estab- ontological concept name to the name of the element in the lished while parsing through the XML schema: XML schema. XML schemata allow for multiple elements Mapping 1: 'xs:element’ to owl:class: Com- to have the same name. More often than not, the same name monly, each definition of an element contains a name and an is used (e.g. <xs:element name=”ID”/>),specially associated complex type. An owl:class is created with when defining identifiers and other common elements. the class name equal to the name of the element. Attribute Even though the same name may exist in different element definitions for the element are mapped to the definitions, as long as their context is different they are owl:datatype property. Cardinality of concepts is de- semantically different concepts. In order to disambiguate fined according to the xs:minOccurs and between elements with the same name, we propose a pre- xs:MaxOccurs in the XML schema as explained in [9]. processing step that concatenates the parent complex type Therefore, if an element “element1” is defined as having name to the name of the element using a delimiter. For ele- type “complexType1” and the definition of “com- ments that do not have a parent complex type, an iterating plexType1” includes an “element2” with max- place-holder name can be used (e.g. “Parenti”) Because occurs=unbounded, then the cardinality between the XML schema rules require that all xs:complexType concepts “element1” and “element2” is 1…∝ have unique names in the schema, this pre-processing step Mapping 2: Element hierarchy to concept hierarchy: ensures that element names (and therefore concept names in Nayak and Wina [12] have noted that XML schema contains the ontology) are unique in the XML schema. The pre- element definitions in a hierarchical structure. They employ processing step performs the disambiguation technique to structural information in XML schemata to define clusters all elements, and not only to the redundant elements.