Using a Shareable Extraction Ontology and a Reasoner

SEMAPRO 2011 : The Fifth International Conference on Advances in Semantic Processing Semantic Annotation Semantically: Using a Shareable Extraction Ontology and a Reasoner Jan Dedekˇ Peter Vojta´sˇ Department of Software Engineering Department of Software Engineering MFF, Charles University MFF, Charles University Prague, Czech Republic Prague, Czech Republic [email protected] [email protected] Abstract—Information extraction (IE) and automated se- In the environment of the Semantic Web it is essential mantic annotation of text are usually done by complex tools. that information is shareable and some ontology based IE These tools use some kind of a model that represents the tools keep the model in so called extraction ontologies actual task and its solution. The model is usually represented as a set of extraction rules (e.g., regular expressions), gazetteer [1]. Extraction ontologies should serve as a wrapper for lists, or it is based on some statistical measurements and documents of a narrow domain of interest. When we apply probability assertions. In the environment of the Semantic an extraction ontology to a document, the ontology identifies Web it is essential that information is shareable and some objects and relationships present in the document and it ontology based IE tools keep the model in so called extraction associates them with the corresponding ontology terms and ontologies. In practice, the extraction ontologies are usually strongly dependent on a particular extraction/annotation tool thus wraps the document so that it is understandable in terms and cannot be used separately. In this paper, we present an of the ontology [1]. extension of the idea of extraction ontologies. According to In practice the extraction ontologies are usually strongly the presented concept the extraction ontologies should not dependent on a particular extraction/annotation tool and be dependent on the particular extraction/annotation tool. In cannot be used separately. The strong dependency of an our solution the extraction/annotation process can be done separately by an ordinary reasoner. We also present a proof of extraction ontology on the corresponding tool makes it very concept for the idea: a case study with a linguistically based difficult to share. When an extraction ontology cannot be IE engine that exports its extraction rules to an extraction used outside the tool there is also no need to keep the ontology and we demonstrate how this extraction ontology can ontology in a standard ontology format (RDF or OWL). be applied to a document by a reasoner. The paper also contains The only way how to use such extraction ontology is an evaluation experiment with several OWL reasoners. within the corresponding extraction tool. It is not necessary Keywords-Extraction Ontology; Reasoning; Information Extrac- to have the ontology in a “owl or rdf file”. In a sense tion; Semantic Annotation; such extraction ontology is just a configuration file. For example in [2] (and also in [1]) the so called extraction I. INTRODUCTION ontologies are kept in XML files with a proprietary structure Information extraction (IE) and automated semantic anno- and it is absolutely sufficient, there is no need to treat them tation of text are usually done by complex tools and all these differently. tools use some kind of model that represents the actual task and its solution. The model is usually represented as a set A. Shareable Extraction Ontologies of some kind of extraction rules (e.g., regular expressions), In this paper, we present an extension of the idea of gazetteer lists or it is based on some statistical measure- extraction ontologies. We adopt the point that extraction ments and probability assertions (classification algorithms models are kept in extraction ontologies and we add that like Support Vector Machines (SVM), Maximum Entropy the extraction ontologies should not be dependent on the Models, Decision Trees, Hidden Markov Models (HMM), particular extraction/annotation tool. In such case the ex- Conditional Random Fields (CRF), etc.) traction/annotation process can be done separately by an In the beginning, a model is either created by a hu- ordinary reasoner. man user or it is learned from a training dataset. Then, In this paper, we present a proof of concept for the idea: in the actual extraction/annotation process, the model is a case study with our linguistically based IE engine and an used as a configuration or as a parameter of the particular experiment with several OWL reasoners. In the case study extraction/annotation tool. These models are usually stored (see Section IV) the IE engine exports its extraction rules to in proprietary formats and they are accessible only by the the form of an extraction ontology. Third party linguistic tool corresponding tool. linguistically annotates an input document and the linguistic Copyright (c) IARIA, 2011. ISBN: 978-1-61208-175-5 29 SEMAPRO 2011 : The Fifth International Conference on Advances in Semantic Processing 2) It is hard to argue that linguistic extraction rules should be considered a part of an ontology while Document TectoMT GRDDL information extractors based on other IE techniques XSLT Document Ontology (such as SVM, HMM, CRF, etc. classifiers used to identify instances of a class when classification is used Information Extraction engine as the IE technique) should be kept out of it: all IE Reasoner techniques perform the same task with comparable TectoMT effectiveness (generally successful but not 100% accurate). But the techniques advocated for the inclusion of linguistic rules in ontologies cannot accommodate Extraction such IE techniques. Learning Corpus Rules “Annotated” The authors then conclude that either all information Extraction Document Ontology Ontology extractors (that use different IE techniques) should be included in the ontologies or none should be included. Figure 1. Semantic annotation driven by an extraction ontology and a Concerning the first argument, we have to take into ac- reasoner – schema of the process. count that extraction ontologies are not ordinary ontologies, annotations are translated to so-called document ontology. it should be agreed that they do not contain 100% accurate After that an ordinary OWL reasoner is used to apply the knowledge. Also the estimated accuracy of the extraction extraction ontology on the document ontology, which has the rules can be saved in the extraction ontology and it can then same effect as a direct application of the extraction rules on help potential users to decide how much they will trust the the document. The process is depicted in Fig 1 and it will extraction ontology. be described in detail in Section IV-B. Concerning the second argument, we agree that in the case Section II presents several closely related works. The of complex classification based models (SVM, HMM, CRF, main idea of the paper will be described in Section III, etc.) serialization of such model to RDF does not make much its implementation in Section IV and in Section V an sense (cf. the next section). But on the other hand we think experiment with several OWL reasoners and IE datasets will that there are cases when shareable extraction ontologies be presented. In Section VI related issues are discussed and can be useful and in the context of Linked Data providing Section VII concludes the paper. shareable descriptions of information extraction rules may II. RELATED WORK be valuable. It is also possible that new standard ways how to encode such models to an ontology will appear in the Ontology-based Information Extraction (OBIE) [3] or future. Ontology-driven Information Extraction [4] has recently emerged as a subfield of information extraction. Further- III. SEMANTIC ANNOTATION SEMANTICALLY more, Web Information Extraction [5] is a closely related The problem of extraction ontologies that are not share- discipline. Many extraction and annotation tools can be able was pointed out in the introduction (Section I). The found in the above mentioned surveys ([3], [5]), many of the cause of the problem is that a particular extraction model can tools also use an ontology as the output format, but almost all only be used and interpreted by the corresponding extraction of them store their extraction models in proprietary formats tool. If an extraction ontology should be shareable, there and the models are accessible only by the corresponding has to be a commonly used tool that is able to interpret tool. the extraction model encoded by the extraction ontology. In the literature we have found only two approaches that In this paper we present a proof of concept that Semantic use extraction ontologies. The former one was published Web reasoners can play the role of commonly used tools by D. Embley [1], [6] and the later one – IE system Ex that can interpret shareable extraction ontologies. was developped by M. Labsky´ [2]. But in both cases the Although it is probably always possible to encode an extraction ontologies are dependent on the particular tool extraction model using a standard ontology language, only and they are kept in XML files with a proprietary structure. certain way of encoding makes it possible to interpret By contrast authors of [3] (a recent survey of OBIE such model by a standard reasoner in the same way as if systems) do not agree with allowing for extraction rules to the original extraction tool was used. The difference is in be a part of an ontology. They use two arguments against semantics. It is not sufficient to encode just the model’s data, that: it is also necessary to encode the semantics of the model. 1) Extraction rules are known to contain errors (because Only then the reasoner is able to interpret the model in the they are never 100% accurate), and objections can same way as the original tool. And this is where the title of be raised on their inclusion in ontologies in terms of the paper and the present section comes from.

Load more