Shallow and Deep Natural Language

Shallow and Deep NNNaturalNatural LLLanguageLanguage PPProcessingProcessing for Ontology LearningLearning:: a Quick Overview Amal Zouaq Simon Fraser University - Athabasca University, Canada ABSTRACT This chapter gives an overview over the state-of-the-art in natural language processing for ontology learning. It presents two main NLP techniques for knowledge extraction from text, namely shallow techniques and deep techniques, and explains their usefulness for each step of the ontology learning process. The chapter also advocates the interest of deeper semantic analysis methods for ontology learning. In fact, there have been very few attempts to create ontologies using deep NLP. After a brief introduction to the main semantic analysis approaches, the chapter focuses on lexico-syntactic patterns based on dependency grammars and explains how these patterns can be considered as a step towards deeper semantic analysis. Finally, the chapter addresses the “ontologization” task that is the ability to filter important concepts and relationships among the mass of extracted knowledge. 1. INTRODUCTION Given the large amount of textual data in almost all the aspects of our everyday lives, and given that natural language is our first medium for communicating knowledge, there is no doubt that natural language processing (NLP) technologies are of tremendous importance for analyzing textual resources and extracting their meaning. One of the current research avenues where NLP should play a leading role is the Semantic Web. In fact, NLP should be considered as one of the pillars of the Semantic Web (Wilks & Brewster, 2009) and should be particularly useful for the acquisition of domain ontologies. Despite the vast majority of works dedicated to ontology learning based on NLP (Buitelaar & Cimiano, 2008) (Cimiano & Volker, 2005) (Buitelaar et al., 2005), it is clear that the whole potential of the available techniques and representations has not been fully exploited. More precisely, the works from the computational semantics community (Bos, 2008c) have been particularly neglected, to my knowledge, in the ontology learning field until now, while these works deal with very important aspects of text understanding that are not available in more shallow techniques. I believe that these deep aspects are essential for building an accurate domain ontology reflecting the content of its source data. This chapter is a tentative to bring this issue to the attention of the research community and it gives a quick overview over techniques from the computational semantics community that may be of interest for ontology learning. It also talks about the shallow NLP techniques, both at the syntactic and semantic levels, which have already been employed for ontology learning, as well as deeper methods based on dependency grammars and lexico-syntactic patterns. These deeper methods do not still reach the required depth that is advocated in computational semantics, but they can be considered as attempts towards this goal. The chapter is organized as follows: After the introduction, section 2 provides a definition of the ontology and the ontology learning task. Section 3 talks about the various natural language processing techniques that may be used in an ontology learning process. It also provides a quick overview over the existing techniques in the computational linguistic and semantics communities including shallow and deep analysis, both at the syntactic level and the semantic level. Section 4 presents a set of projects for ontology learning with a special emphasis on the dependency grammar formalism and on the patterns based on this formalism. This section underlines the links between the presented projects and the various NLP techniques that were used. Section 5 explains a very important stage of the ontology learning process which is the ontologization task. Finally, section 6 discusses a number of issues that are still to be solved by the research community. 2. BACKGROUND There are a number of resources that describe what an ontology is, with the most cited definition being the one presented by (Gruber, 93): “ An ontology is a formal specification of a conceptualization ”. Although this definition may seem too broad, we can extract from it two keywords that are essential for our understanding of ontologies: formal and conceptualization. The formal characteristic : In the domain of computer science and formal logic, a formal system designates a system using a formal language, a grammar that indicates the well-expressed formulas according to the language and a set of axioms or inference rules to reason over this language. A formal language is defined using a set of symbols. The conceptual characteristic : Having its root in philosophy, the notion of concept has been widely used in the Artificial Intelligence community. According to (Guarino, 98), a conceptualization must be defined on an intentional level and an extensional level. The intentional level deals with the meaning of what is being defined (the domain of interest), while the extensional level describes the instances of that domain. As it can be seen, an ontology is grounded in the domain of mathematical logic, reasoning and theorem- proving. In fact, it is the main knowledge structure of the Semantic Web , whose aim is to provide a set of machine understandable semantics. These semantics are generally organized in a structure called a domain ontology, which is used to express the conceptualization of that domain. Formally, a domain ontology is represented by a tuple <C, H, R, A, I>, where: • C represents the set of classes. E.g.: Animal, Human, etc. • H represents the set of hierarchical links between the concepts. E.g. is-a (Feline, Animal). • R, the set of conceptual links. E.g. eat(Herbivores, Plant) • A, the set of axioms, i.e. the rules that govern this domain and make a reasoner able to infer new information. • I the set of instances, i.e. the objects of the world which can be categorized into the ontological classes C. Generally, when we think about developing or learning an ontology, we target the first four components (which are considered as steps in the ontology learning process), i.e. the tuple <C, H, R, A>, while the last one (I) is tackled by what is called “ontology population”. Ontology engineering has proposed various methodologies that provide guidelines to users for effectively building ontologies through ontology editors. However, one drawback of these methodologies is the huge amount of time and effort required by humans, the so called knowledge acquisition bottleneck. This situation is even worsened when it comes to ontology evolution or mapping. In front of the rapidly growing amounts of electronic data, providing (semi)automatic knowledge extraction tools is a must. This is the reason why this chapter focuses on (semi) automatic ontology learning. As a process, ontology learning starts with a source of knowledge, extracts the various elements <C, H, R, A>, and produces an ontology, generally expressed using one of the Semantic Web Languages, such as RDF or OWL. There have been many attempts to define (semi)automatic processes from structured documents, including knowledge bases and xml files for instance (Maedche & Staab, 2001). This is mainly accomplished by creating mappings between the original structures and the ontological structures (concepts, taxonomy, etc.). But the main effort of the research community is directed towards extracting semantics from unstructured sources, such as plain text or Web documents (which cannot be really considered as structured). The democratization of the access to the Web and hence the huge production of Web documents participate greatly to this need as well as the emergence of the Semantic Web, which calls for semantic models attached to the WWW resources. This chapter does not deal particularly with Web content but rather focuses on how texts (and among them textual Web resources) might be a useful knowledge source for acquiring ontologies. Important aspects such as the necessity of extracting the structure of a Web document to guide the learning process or how we might filter web content are not addressed here. However, we believe that many of the methods and approaches presented here may benefit to ontology learning from the Web. Ontology learning from texts involves a number of disciplines ranging from lexical acquisition, text mining, natural language processing, statistics and machine learning. These are generally fully intricate in the ontology learning process and are involved at the various levels of ontological acquisition. This chapter focuses particularly on the natural language processing (NLP) techniques ranging from lexical acquisition to shallow and deep analysis methods (both at the syntactic and semantic levels). However, we will also refer to many other approaches from statistics and machine learning, especially when they come as a complement to the NLP process. There are many reasons why deep NLP may be of great interest. First, because it is the discipline which deals with the understanding of a message conveyed by a text, hence it is an indispensable tool for ontology learning from text. Second, NLP tools are becoming sufficiently robust to be considered as reliable tools for knowledge and model extraction [Bos, 2008c]. Third, NLP methods that have been used until now, in the ontology learning field, remain generally shallow and do not borrow significant insights from the computational linguistics and semantics communities.

Load more