Ontology Learning and Its Application to Automated Terminology Translation
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing Ontology Learning and Its Application to Automated Terminology Translation Roberto Navigli and Paola Velardi, Università di Roma La Sapienza Aldo Gangemi, Institute of Cognitive Sciences and Technology lthough the IT community widely acknowledges the usefulness of domain ontolo- A gies, especially in relation to the Semantic Web,1,2 we must overcome several bar- The OntoLearn system riers before they become practical and useful tools. Thus far, only a few specific research for automated ontology environments have ontologies. (The “What Is an Ontology?” sidebar on page 24 provides learning extracts a definition and some background.) Many in the and is part of a more general ontology engineering computational-linguistics research community use architecture.4,5 Here, we describe the system and an relevant domain terms WordNet,3 but large-scale IT applications based on experiment in which we used a machine-learned it require heavy customization. tourism ontology to automatically translate multi- from a corpus of text, Thus, a critical issue is ontology construction— word terms from English to Italian. The method can identifying, defining, and entering concept defini- apply to other domains without manual adaptation. relates them to tions. In large, complex application domains, this task can be lengthy, costly, and controversial, because peo- OntoLearn architecture appropriate concepts in ple can have different points of view about the same Figure 1 shows the elements of the architecture. concept. Two main approaches aid large-scale ontol- Using the Ariosto language processor,6 OntoLearn a general-purpose ogy construction. The first one facilitates manual extracts terminology from a corpus of domain text, ontology engineering by providing natural language such as specialized Web sites and warehouses or doc- ontology, and detects processing tools, including editors, consistency uments exchanged among members of a virtual com- checkers, mediators to support shared decisions, and munity. It then filters the terminology using natural taxonomic and other ontology import tools. The second approach relies on language processing and statistical techniques that per- machine learning and automated language-processing form comparative analysis across different domains, semantic relations techniques to extract concepts and ontological rela- or contrastive corpora. Next, we use the WordNet and tions from structured and unstructured data such as SemCor3 lexical knowledge bases to semantically inter- among the concepts. databases and text. Few systems exploit both pret the terms. We then relate concepts according to approaches. The first approach predominates in most taxonomic (kind-of) and other semantic relations, gen- The authors used it to development toolsets such as Kaon, Protégé, Chi- erating a domain concept forest. We use WordNet and maera, and WebOnto, but some systems also imple- our rule-based inductive-learning method to extract automatically translate ment machine learning techniques.1 such relations. Our OntoLearn system is an infrastructure for auto- Finally, OntoLearn integrates the domain concept multiword terms from mated ontology learning from domain text. It is the forest with WordNet to create a pruned and specialized only system, as far as we know, that uses natural lan- view of the domain ontology. We then use ontology English to Italian. guage processing and machine learning techniques, editing, validation, and management tools4,5 that are 22 1094-7167/03/$17.00 © 2003 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society part of the more general architecture to enrich, correct, and update the generated ontology. The main novel aspect of our machine Domain Extraction of candidate learning method is semantic interpretation. corpus terminology Ariosto language We identify the right senses (concepts) for processor complex domain term components and the semantic relations between them. For exam- Filtering of domain Contrastive terminology ple, the result of this process on the compound corpora transport company selects the sense “company- enterprise” for company (as opposed to the Semantic term interpretation “social gathering” or “crew” senses) and the WordNet sense “commercial enterprise” for transport and (as opposed to the “exchange of molecules,” SemCor Identification of taxonomic “overwhelming emotion,” and “transport of and similarity relations magnetic tape” senses), and associates a pur- pose relation (a company for transporting goods and people) between the two concepts. Machine- Identification of other We know of nothing similar in the ontol- learned semantic relations ogy learning literature. Many described rule base methods first extract domain terms using var- ious statistical methods, then detect the tax- Ontology integration onomic and other types of relations between terms. The literature1, 7-9 uses the notion of domain term and domain concept inter- Domain concept forest changeably, but no semantic interpretation of terms actually takes place. For example, Domain the concept digital printing technology is ontology considered a kind of printing technology by virtue of simple string inclusion.7 However, printing has four senses in WordNet and technology has two. The WordNet lexical Figure 1. The OntoLearn architecture. ontology thus contains eight possible con- cept combinations for printing technology. Identifying the correct sense combinations usually captured with shallow techniques that domain relevance of a term t in class Dk is is important in determining the taxonomic range from stochastic methods to more computed as relations between a new domain concept and sophisticated syntactic approaches. an existing ontology. Furthermore, semantic Obviously, richer syntactic information PtD() DR = k , interpretation is relevant in view of document positively influences the quality of the result tk, n semantic indexing and retrieval. For exam- to be input to statistical filtering. We use ∑PtD()j ple, a query for hotel facility could retrieve Ariosto to parse documents in the applica- j =1 documents including concepts that are tion domain to extract a list Tc of syntacti- related by a kinship relation, such as swim- cally plausible terminological candidates. where P(t |Dk) is estimated by ming pool or conference room. Semantic Examples include compounds (credit card), interpretation is also useful for automatic adjective-nouns (public transport), and base f = tk, , translation, as we later demonstrate. Know- noun phrases (board of directors). EPtD( ()k ) ing the right senses of the complex term com- High frequency in a corpus is a property ∑ ftk′, ′∈ ponents in the source language—given a observable for terminological as well as non- tDk bilingual dictionary—greatly simplifies the terminological expressions, such as last problem of selecting the correct translation week. We measure the specificity of a ter- where ft,k is the frequency of term t in the for each component in the target language. minology candidate with respect to the tar- domain Dk. OntoLearn has three main phases: termi- get domain via comparative analysis across A second filter operates on the principle nology extraction, semantic interpretation, and different domains. To this end, we define a that a term cannot be a clue for a domain Dk creation of a specialized view of WordNet. specific domain relevance score. A quanti- unless it appears in several documents; there tative definition of the DR can be given must be some consensus on using that term Phase 1: Terminology extraction according to the amount of information cap- in the domain Dk. The domain consensus of Terminology can be considered the sur- tured in the target corpus relative to the term t in class Dk captures those terms that face appearance of relevant domain concepts. entire collection of corpora. More precisely, appear frequently across a given domain’s Candidate terminological expressions are given a set of n domains {D1,… ,Dn}, the documents. DC is defined as JANUARY/FEBRUARY 2003 computer.org/intelligent 23 Natural Language Processing What Is an Ontology? A domain ontology seeks to reduce or eliminate concep- tual and terminological confusion among the members of Foundational ontology a user community who need to share various kinds of electronic documents and information. It does so by iden- tifying and properly defining a set of relevant concepts Core ontology that characterize a given application domain, say, for travel agents or medical practitioners. An ontology speci- fies a shared understanding of a domain. It contains a set of generic concepts (such as “object,” “process,” “accom- Specific modation,” and “single room”), together with their defi- domain nitions and interrelationships. The construction of its uni- ontology fying conceptual framework fosters communication and cooperation among people, better enterprise organiza- tion, and system interoperability. It also provides such sys- tem-engineering benefits as reusability, reliability, and Figure A. The three levels of generality in a domain ontology. specification. Ontologies can have different degrees of formality, but they top ontology. The result, the core ontology, usually includes a must include metadata such as concepts, relations, axioms, few hundred application-domain concepts. Although many instances, or terms that lexicalize concepts. From the termi- projects eventually succeed in defining a core domain ontol- nological viewpoint, an ontology can even be seen