Acquiring an Ontology for a Fundamental Vocabulary
Total Page:16
File Type:pdf, Size:1020Kb
Acquiring an Ontology for a Fundamental Vocabulary Francis Bond∗ and Eric Nichols ∗∗ and Sanae Fujita∗ and Takaaki Tanaka∗ * fbond,fujita,[email protected] ** [email protected] * NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation ** Nara Advanced Institute of Science and Technology Abstract extracted thesaurus to an existing ontology of In this paper we describe the extraction of Japanese: the Goi-Taikei ontology (Ikehara et thesaurus information from parsed dictio- al., 1997). This allows us to use tools that ex- nary definition sentences. The main data ploit the Goi-Taikei ontology, and also to extend for our experiments comes from Lexeed, it and reveal gaps. a Japanese semantic dictionary, and the The immediate application for our ontology Hinoki treebank built on it. The dictio- is in improving the performance of stochastic nary is parsed using a head-driven phrase structure grammar of Japanese. Knowledge models for parsing (see Bond et al. (2004) for is extracted from the semantic representa- further discussion) and word sense disambigua- tion (Minimal Recursion Semantics). This tion. However, this paper discusses only the makes the extraction process language inde- construction of the ontology. pendent. We are using the Lexeed semantic database of Japanese (Kasahara et al. (2004), next sec- 1 Introduction tion), a machine readable dictionary consisting In this paper we describe a method of acquiring of headwords and their definitions for the 28,000 a thesaurus and other useful information from most familiar open class words of Japanese, a machine-readable dictionary. The research is with all the definitions using only those 28,000 part of a project to construct a fundamental words (and some function words). We are pars- vocabulary knowledge-base of Japanese: ing the definition sentences using an HPSG a resource that will include rich syntactic and Japanese grammar and parser and treebanking semantic descriptions of the core vocabulary of the results into the Hinoki treebank (Bond et Japanese. In this paper we describe the auto- al., 2004). We then train a statistical model matic acquisition of a thesaurus from the dic- on the treebank and use it to parse the remain- tionary definition sentences. The basic method ing definition sentences, and extract an ontology has a long pedigree (Copestake, 1990; Tsuru- from them. maru et al., 1991; Rigau et al., 1997). The In the next phase, we will sense tag the defini- main difference from earlier work is that we use tion sentences and use this information and the a mono-stratal grammar (Head-Driven Phrase thesaurus to build a model that combines syn- Structure Grammar: Pollard and Sag (1994)) tactic and semantic information. We will also where the syntax and semantics are represented produce a richer ontology | by combining infor- in the same structure. Our extraction can thus mation for word senses not only from their own be done directly on the semantic output of the definition sentences but also from definition sen- parser. tences that use them (Dolan et al., 1993), and In the first stage, we extract the thesaurus by extracting selectional preferences. Once we backbone of our ontology, consisting mainly of have done this for the core vocabulary, we will hypernym links, although other links are also look at ways of extending our lexicon and on- extracted (e.g., domain). We also link our tology to less familiar words. ∗∗ Some of this research was done while the second In this paper we present the details of the author was visiting the NTT Communication Science ontology extraction. In the following section we Laboratories give more information about Lexeed and the Hi- noki treebank.We then detail our method for ex- which we have extended to cover the dictio- tracting knowledge from the parsed dictionary nary definition sentences (Bond et al., 2004). definitions (x 3). Finally, we discuss the results We have treebanked 23,000 sentences using the and outline our future research (x 4). [incr tsdb()] profiling environment (Oepen and Carroll, 2000) and used them to train a parse 2 Resources ranking model for the PET parser (Callmeier, 2.1 The Lexeed Semantic Database of 2002) to selectively rank the parser output. Japanese These tools, and the grammar, are available The Lexeed Semantic Database of Japanese is from the Deep Linguistic Processing with HPSG a machine readable dictionary that covers the Initiative (DELPH-IN: http://www.delph-in. most common words in Japanese (Kasahara et net/). al., 2004). It is built based on a series of psy- We use this parser to parse the defining sen- cholinguistic experiments where words from two tences into a full meaning representation using existing machine-readable dictionaries were pre- minimal recursion semantics (MRS: Copestake sented to multiple subjects who ranked them on et al. (2001)). a familiarity scale from one to seven, with seven being the most familiar (Amano and Kondo, 3 Ontology Extraction 1999). Lexeed consists of all open class words In this section we present our work on creating with a familiarity greater than or equal to five. an ontology. Past research on knowledge ac- The size, in words, senses and defining sentences quisition from definition sentences in Japanese is given in Table 1. has primarily dealt with the task of automat- ically generating hierarchical structures. Tsu- Table 1: The Size of Lexeed rumaru et al. (1991) developed a system for automatic thesaurus construction based on in- Headwords 28,300 formation derived from analysis of the terminal Senses 46,300 clauses of definition sentences. It was success- Defining Sentences 81,000 ful in classifying hyponym, meronym, and syn- onym relationships between words. However, it lacked any concrete evaluation of the accu- The definition sentences for these sentences racy of the hierarchies created, and only linked were rewritten by four different analysts to use words not senses. More recently Tokunaga et only the 28,000 familiar words and the best al. (2001) created a thesaurus from a machine- definition chosen by a second set of analysts. readable dictionary and combined it with an ex- Not all words were used in definition sentences: isting thesaurus (Ikehara et al., 1997). the defining vocabulary is 16,900 different words (60% of all possible words were actually used in For other languages, early work for En- the definition sentences). An example entry for glish linked senses exploiting dictionary domain the word ¢¡¤£¤¥§¦ doraib¯a \driver" is given in codes and other heuristics (Copestake, 1990), Figure 1, with English glosses added. The un- and more recent work links senses for Spanish derlined material was not in Lexeed originally, and French using more general WSD techniques we extract it in this paper. doraib¯a \driver" (Rigau et al., 1997). Our goal is similar. We has a familiarity of 6.55, and three senses. The wish to link each word sense in the fundamen- first sense was originally defined as just the syn- tal vocabulary into an ontology. The ontology is onym nejimawashi \screwdriver", which has a primarily a hierarchy of hyponym (is-a) rela- familiarity below 5.0. This was rewritten to the tions, but also contains several other relation- explanation: \A tool for inserting and removing ships, such as abbreviation, synonym and screws". domain. We extract the relations from the semantic 2.2 The Hinoki Treebank output of the parsed definition sentences. The In order to produce semantic representations output is written in Minimal Recursion Seman- we are using an open source HPSG grammar tics (Copestake et al., 2001). Previous work has of Japanese: JACY (Siegel and Bender, 2002), successfully used regular expressions, both for Headword ¢¡¤£¦¥¨§ doraiba- POS noun Lexical-type noun-lex 2 3 Familiarity 6.5 [1{7] 6 7 6 S ¦ 7 6 1 © / / 7 6 7 6 2 screw turn (screwdriver) 37 6 Definition 2 0 3 7 ! "# $% S 6 1 © / / / / / / / / / 7 6Sense 1 6 6 777 6 6 6 A tool for inserting and removing screws . 777 6 6 6 777 6 6 4 577 6 Hypernym $% 1 equipment \tool" 7 6 6 77 6 6Sem. Class 77 6 6 h942:tooli (⊂ 893:equipment) 77 6 6 77 6 4 57 *,+ "# - S1 6 &(') / / / / / 7 6 Definition 7 6 Someone who drives a car 7 6Sense 2 2 " #3 7 6 7 Hypernym - 6 6 1 hito \person" 7 7 6 6 7 7 6 6Sem. Class h292:driveri (⊂ 4:person) 7 7 6 6 7 7 6 4 5 7 2 3,4 5 6 78¡:9 6 S1 1 7 6 .¢/0 / / / / / / / / 7 6 7 6 2 In golf, a long-distance club. 3 7 6 Definition 2 3 7 S =>? 6 2 ;< / / / 7 6 6 6 77 7 6Sense 3 6 6 A number one wood . 77 7 6 6 6 77 7 6 6 4 57 7 6 6Hypernym 78¡:9 2 kurabu \club" 7 7 6 6 7 7 6 6Sem. Class 7 7 6 6 h921:leisure equipmenti (⊂ 921) 7 7 6 6Domain 1 7 7 6 6 .¢/@0 gorufu \golf" 7 7 6 6 7 7 4 4 5 5 Figure 1: Entry for the Word doraiba- \driver" (with English glosses) English (Barnbrook, 2002) and Japanese (Tsu- velop a parser. Any effort spent in building rumaru et al., 1991; Tokunaga et al., 2001). and refining regular expressions is not reusable, Regular expressions are extremely robust, and while creating and improving a grammar has relatively easy to construct. However, we use intrinsic value. a parser for four reasons. The first is that it makes our knowledge acquisition more lan- 3.1 The Extraction Process guage independent. If we have a parser that To extract hypernyms, we parse the first defi- can produce MRS, and a machine readable dic- nition sentence for each sense.