Robust Ontology Acquisition from Machine-Readable Dictionaries

Robust Ontology Acquisition from Machine-Readable Dictionaries Eric Nichols Francis Bond Daniel Flickinger Nara Inst. of Science and Technology NTT Communication Science Labs CSLI Nara, Japan Nippon Telegraph and Telephone Co. Stanford University [email protected] Keihanna, Japan California, U.S.A. [email protected] [email protected] Abstract Our basic approach is to parse dictionary definition sentences with multiple shallow and deep processors, generating In this paper, we outline the development of a semantic representations of varying specificity. The seman- system that automatically constructs ontologies by tic representation used is robust minimal recursion semantics extracting knowledge from dictionary definition (RMRS: Section 2.2). We then extract ontological relations sentences using Robust Minimal Recursion Se- using the most informative semantic representation for each mantics (RMRS), a semantic formalism that per- definition sentence. mits underspecification. We show that by com- In this paper we discuss the construction of an ontology for bining deep and shallow parsing resources through Japanese using the the Japanese Semantic Database Lexeed the common formalism of RMRS, we can extract [Kasahara et al., 2004]. The deep parser uses the Japanese ontological relations in greater quality and quan- Grammar JACY [Siegel and Bender, 2002] and the shallow tity. Our approach also has the advantages of re- parser is based on the morphological analyzer ChaSen. quiring a very small amount of rules and being We carried out two evaluations. The first gives an automat- easily adaptable to any language with RMRS re- ically obtainable measure by comparing the extracted onto- sources. logical relations by verifying the existence of the relations in exisiting WordNet [Fellbaum, 1998]and GoiTaikei [Ikehara 1 Introduction et al., 1997] ontologies. The second is a small scale human Ontologies are an important resource in natural language pro- evaluation of the results. cessing. They have been shown to be useful in tasks such machine translation, question answering, and word-sense dis- 2 Resources ambiguation, among others where information about the rela- tionship and similarity of words can be exploited. While there 2.1 The Lexeed Semantic Database of Japanese are large, hand-crafted ontologies available for several lan- The Lexeed Semantic Database of Japanese is a machine guages, such as WordNet for English [Fellbaum, 1998] and readable dictionary that covers the most common words in GoiTaikei for Japanese [Ikehara et al., 1997], these resources Japanese [Kasahara et al., 2004]. It is built based on a series are difficult to construct and maintain entirely by hand. They of psycholinguistic experiments where words from two ex- are, however, of proven utility in many NLP tasks, such as isting machine-readable dictionaries were presented to mul- PP-attachment, where results using WordNet approach hu- tiple subjects who ranked them on a familiarity scale from man accuracy (88.1% vs 88.2%), while the best methods us- one to seven, with seven being the most familiar [Amano and ing automatically constructed hierarchies still lag behind (at Kondo, 1999]. Lexeed consists of all open class words with 84.6%) [Pantel and Lin, 2000]. Therefore, there is still a need a familiarity greater than or equal to five. An example entry to improve methods of both fully-automated and supervised for the word À{§ doraiba¯ “driver” is given in Fig- construction of ontologies. ure 1, with English glosses added. The underlined material There is a great deal of work on the creation of ontolo- was not in Lexeed originally, we add it in this paper. doraiba¯ gies from machine readable dictionaries (a good summary is “driver” has a familiarity of 6.55, and three senses. Lexeed [Wilkes et al., 1996]), mainly for English. Recently, there has 28,000 words divided into 46,000 senses and defined with has also been interest in Japanese [Tsurumaru et al., 1991; 75,000 definition sentences. Tokunaga et al., 2001; Bond et al., 2004]. Most approaches Useful hypernym relations can also be extracted from use either a specialized parser or a set of regular expressions large corpora using relatively simple patterns (e.g., [Pantel et tuned to a particular dictionary, often with hundreds of rules. al., 2004]). However, even a large newspaper corpus does In this paper, we take advantage of recent advances in both not include all the familiar words of a language, let alone deep parsing and semantic representation to combine general those words occurring in useful patterns [Amano and Kondo, purpose deep and shallow parsing technologies with a simple 1999]. Therefore it makes sense to extract data from machine special relation extractor. readable dictionaries (MRDs). 2 3 HEADWORD À{§ doraiba- 6 7 6POS noun Lexical-type noun-lex 7 6 7 6FAMILIARITY 6.5 [1–7] 7 6 2 37 6 2 F1 Wh0 3 7 6 S1 / / 7 6 6 6 777 6 6 6 screw turn (screwdriver) 777 6 6DEFINITION 4 0 F1 k F0óe 8c e&1< 8 2d ° 577 6 6 S1 / / / / / / / / / 77 6 6 77 6SENSE 1 6 A tool for inserting and removing screws . 77 6 6 77 6 6 ° 77 6 6HYPERNYM 1 equipment “tool” 77 6 4 57 6 SEM.CLASS h942:tooli (⊂ 893:equipment) 7 6 7 6 WORDNET screwdriver1 (⊂ tool1) 7 6 2 · ¸3 7 6 ô¥ k þU 2d 0 7 6 S1 / / / / / 7 6 6DEFINITION 7 7 6 6 Someone who drives a car 7 7 6 6 7 7 6SENSE 2 6HYPERNYM 01 hito “person” 7 7 6 6 7 7 6 4SEM.CLASS h292:driveri (⊂ 4:person)5 7 6 7 6 7 6 WORDNET driver1 (⊂ person1) 7 6 2 2 33 7 6 S Â¬/@/ /2/®/X/G/À/ 7 6 1 7 6 6 6 77 7 6 6 6 In golf, a long-distance club. 77 7 6 6DEFINITION 4 Í } 57 7 6 6 S2 / / / 7 7 6 6 7 7 6SENSE 3 6 A number one wood . 7 7 6 6 7 7 6 6HYPERNYM À2 kurabu “club” 7 7 6 6 7 7 4 4WORDNET SENSE driver5 (⊂ club5) 5 5 DOMAIN Â¬1 gorufu “golf” Figure 1: Entry for the Word doraiba- “driver” from Lexeed (with English glosses) hook(h1) hook(h9) proposition m rel(h1,h3:) qeq(h3:,h17) jidousha n(h4,x5:) jidousha n(h1,x2) udef rel(h6,x5:) o rel(h3,u4) RSTR(h6,h7:) BODY(h6,h8:) qeq(h7:,h4) unten s 2(h9,e11:present:) unten s(h5,e6) ARG1(h9,x10:) suru rel(h7,e8) ARG2(h9,x5:) hito n(h12,x10:) hito n(h9,x10) ING(h12:,h10001:) udef rel(h13,x10:) RSTR(h13,h14:) BODY(h13,h15:) qeq(h14:,h12) proposition m rel(h10001,h16:) qeq(h16:,h9) unknown rel(h17,e2:present:) ARG2(h17,x10:) RMRS from JACY (deep) RMRS from ChaSen (shallow) jidosha¯ wo unten suru hito ‘‘a person who drives a car (lit: car-ACC drive do person)’’ Real predicates are shown in bold font. Figure 2: Deep and Shallow RMRS results for doraiba¯2 2.2 Parsing Resources 1. let Pi be the number of real predicates in the defining We used the robust minimal recursion semantics (RMRS) de- sentence signed in the Deep Thought project Callmeier et al. [2004], • IF Pi = 1 (there is a unique real predicate) with tools from the Deep Linguistic Processing with HPSG return: hsynonym: headword, predicatei Initiative (DELPH-IN: http://www.delph-in.net/). 2. Initialize a stack of semantic relations to be processed Robust Minimal Recursion Semantics with the semantic relation from the defining sentence’s Robust Minimal Recursion Semantics is a form of flat seman- HOOK (the highest scoping handle) tics which is designed to allow deep and shallow process- 3. Pop a semantic relation from the stack and check it ing to use a compatible semantic representation, while being against special predicates that require additional pro- rich enough to support generalized quantifiers [Frank, 2004]. cessing The full representation is basically the same as minimal recursion semantics [Copestake et al., 2003]: a bag of labeled • When a relation indicating coordination or con- elementary predicates and their arguments, a list of scoping junction is found, locate all of its arguments and constraints, and a handle that provides a hook into the repre- push them onto the stack for processing sentation. The main difference is that handles must be unique, • IF a special predicate is found, extract its relations and there is an explicit distinction between grammatical and and add them to the stack real predicates. • ELSE IF the current semantic relation is a real The representation can be underspecified in three ways: predicate, add it to list of extracted semantic heads relationships can be omitted (such as message types, quantifiers and so on); predicate-argument relations can be omit- Repeat until stack is empty ted; and predicate names can be simplified. Predicate names 4. Return the ontological relations in the list of extracted are defined in such a way as to be as compatible as possi- semantic heads in the form: hrelation: headword, ble among different analysis engines (e.g. lemma-pos-sense, semantic headi where sense is optional and the part of speech (pos) is drawn Step 1. checks for a synonym relation, shown by a defining from a small set of general types (noun, verb, sahen (ver- sentence containing a genus term with no differentia. Such unten s bal noun, . )). The predicate is less specific than a sentence will have a semantic representation with only a unten s 2 and thus subsumes it. In order to simplify the single real predicate. combination of different analyses, the results are indexed to In Step 2., for more complicated defining sentences, we the position in the original input sentence.

Robust Ontology Acquisition from Machine-Readable Dictionaries

Lemma Selection in Domain Specific Computational Lexica – Some Specific Problems

Etytree: a Graphical and Interactive Etymology Dictionary Based on Wiktionary

Illustrative Examples in a Bilingual Decoding Dictionary: an (Un)Necessary Component

And User-Related Definitions in Online Dictionaries

From Monolingual to Bilingual Dictionary: the Case of Semi-Automated Lexicography on the Example of Estonian–Finnish Dictionary

A Framework for Understanding the Role of Morphology in Universal Dependency Parsing

New Insights in the Design and Compilation of Digital Bilingual Lexicographical Products: the Case of the Diccionarios Valladolid-Uva Pedro A

Lexical Access and the Phonology of Morphologically Complex Words

An Aid to Consistency in Dictionary Entry Design

COMMON and DIFFERENT ASPECTS in a SET of COMMENTARIES in DICTIONARIES and SEMANTIC TAGS Akhmedova Dildora Bakhodirovna a Teacher

Sandro Nielsen* Bilingual Dictionaries for Communication in the Domain

TEI-Lex0 Guidelines for the Encoding of Dictionary Information on Written and Spoken Forms