Acquiring an Ontology for a Fundamental

Francis Bond∗ and Eric Nichols ∗∗ and Sanae Fujita∗ and Takaaki Tanaka∗ * {bond,fujita,takaaki}@cslab.kecl.ntt.co.jp ** [email protected] * NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation ** Nara Advanced Institute of Science and Technology

Abstract extracted to an existing ontology of In this paper we describe the extraction of Japanese: the Goi-Taikei ontology (Ikehara et thesaurus information from parsed dictio- al., 1997). This allows us to use tools that ex- nary definition sentences. The main data ploit the Goi-Taikei ontology, and also to extend for our experiments comes from Lexeed, it and reveal gaps. a Japanese semantic , and the The immediate application for our ontology Hinoki treebank built on it. The dictio- is in improving the performance of stochastic nary is parsed using a head-driven phrase structure grammar of Japanese. Knowledge models for parsing (see Bond et al. (2004) for is extracted from the semantic representa- further discussion) and sense disambigua- tion (Minimal Recursion Semantics). This tion. However, this paper discusses only the makes the extraction process language inde- construction of the ontology. pendent. We are using the Lexeed semantic database of Japanese (Kasahara et al. (2004), next sec- 1 Introduction tion), a machine readable dictionary consisting In this paper we describe a method of acquiring of and their definitions for the 28,000 a thesaurus and other useful information from most familiar open class of Japanese, a machine-readable dictionary. The research is with all the definitions using only those 28,000 part of a project to construct a fundamental words (and some function words). We are pars- vocabulary knowledge-base of Japanese: ing the definition sentences using an HPSG a resource that will include rich syntactic and Japanese grammar and parser and treebanking semantic descriptions of the core vocabulary of the results into the Hinoki treebank (Bond et Japanese. In this paper we describe the auto- al., 2004). We then train a statistical model matic acquisition of a thesaurus from the dic- on the treebank and use it to parse the remain- tionary definition sentences. The basic method ing definition sentences, and extract an ontology has a long pedigree (Copestake, 1990; Tsuru- from them. maru et al., 1991; Rigau et al., 1997). The In the next phase, we will sense tag the defini- main difference from earlier work is that we use tion sentences and use this information and the a mono-stratal grammar (Head-Driven Phrase thesaurus to build a model that combines syn- Structure Grammar: Pollard and Sag (1994)) tactic and semantic information. We will also where the syntax and semantics are represented produce a richer ontology — by combining infor- in the same structure. Our extraction can thus mation for word senses not only from their own be done directly on the semantic output of the definition sentences but also from definition sen- parser. tences that use them (Dolan et al., 1993), and In the first stage, we extract the thesaurus by extracting selectional preferences. Once we backbone of our ontology, consisting mainly of have done this for the core vocabulary, we will hypernym links, although other links are also look at ways of extending our and on- extracted (e.g., domain). We also link our tology to less familiar words. ∗∗ Some of this research was done while the second In this paper we present the details of the author was visiting the NTT Communication Science ontology extraction. In the following section we Laboratories give more information about Lexeed and the Hi- noki treebank.We then detail our method for ex- which we have extended to cover the dictio- tracting knowledge from the parsed dictionary nary definition sentences (Bond et al., 2004). definitions (§ 3). Finally, we discuss the results We have treebanked 23,000 sentences using the and outline our future research (§ 4). [incr tsdb()] profiling environment (Oepen and Carroll, 2000) and used them to train a parse 2 Resources ranking model for the PET parser (Callmeier, 2.1 The Lexeed Semantic Database of 2002) to selectively rank the parser output. Japanese These tools, and the grammar, are available The Lexeed Semantic Database of Japanese is from the Deep Linguistic Processing with HPSG a machine readable dictionary that covers the Initiative (DELPH-IN: http://www.delph-in. most common words in Japanese (Kasahara et net/). al., 2004). It is built based on a series of psy- We use this parser to parse the defining sen- cholinguistic experiments where words from two tences into a full meaning representation using existing machine-readable were pre- minimal recursion semantics (MRS: Copestake sented to multiple subjects who ranked them on et al. (2001)). a familiarity scale from one to seven, with seven being the most familiar (Amano and Kondo, 3 Ontology Extraction 1999). Lexeed consists of all open class words In this section we present our work on creating with a familiarity greater than or equal to five. an ontology. Past research on knowledge ac- The size, in words, senses and defining sentences quisition from definition sentences in Japanese is given in Table 1. has primarily dealt with the task of automat- ically generating hierarchical structures. Tsu- Table 1: The Size of Lexeed rumaru et al. (1991) developed a system for automatic thesaurus construction based on in- Headwords 28,300 formation derived from analysis of the terminal Senses 46,300 clauses of definition sentences. It was success- Defining Sentences 81,000 ful in classifying hyponym, meronym, and syn- onym relationships between words. However, it lacked any concrete evaluation of the accu- The definition sentences for these sentences racy of the hierarchies created, and only linked were rewritten by four different analysts to use words not senses. More recently Tokunaga et only the 28,000 familiar words and the best al. (2001) created a thesaurus from a machine- definition chosen by a second set of analysts. readable dictionary and combined it with an ex- Not all words were used in definition sentences: isting thesaurus (Ikehara et al., 1997). the defining vocabulary is 16,900 different words (60% of all possible words were actually used in For other languages, early work for En- the definition sentences). An example entry for glish linked senses exploiting dictionary domain the word ¢¡¤£¤¥§¦ doraib¯a “driver” is given in codes and other heuristics (Copestake, 1990), Figure 1, with English glosses added. The un- and more recent work links senses for Spanish derlined material was not in Lexeed originally, and French using more general WSD techniques we extract it in this paper. doraib¯a “driver” (Rigau et al., 1997). Our goal is similar. We has a familiarity of 6.55, and three senses. The wish to link each word sense in the fundamen- first sense was originally defined as just the syn- tal vocabulary into an ontology. The ontology is onym nejimawashi “screwdriver”, which has a primarily a hierarchy of hyponym (is-a) rela- familiarity below 5.0. This was rewritten to the tions, but also contains several other relation- explanation: “A tool for inserting and removing ships, such as abbreviation, synonym and screws”. domain. We extract the relations from the semantic 2.2 The Hinoki Treebank output of the parsed definition sentences. The In order to produce semantic representations output is written in Minimal Recursion Seman- we are using an open source HPSG grammar tics (Copestake et al., 2001). Previous work has of Japanese: JACY (Siegel and Bender, 2002), successfully used regular expressions, both for ¢¡¤£¦¥¨§ doraiba- POS noun Lexical-type noun-lex   Familiarity 6.5 [1–7]

  

 S ¦    1 © / /      screw turn (screwdriver) 

 Definition  0  

    !  "# $%  S 

 1 © / / / / / / / / /  Sense 1       A tool for inserting and removing screws .           Hypernym $% 1 equipment “tool”      Sem. Class    h942:tooli (⊂ 893:equipment)    

  

*,+ "# -  S1   &(') / / / / /   Definition   Someone who drives a car  Sense 2  " #   

Hypernym -   1 hito “person”        Sem. Class h292:driveri (⊂ 4:person)      

   

 2 3,4 5 6 78¡:9   S1 1   .¢/0 / / / / / / / /      In golf, a long-distance club.  

 Definition     S =>?

 2 ;< / / /       Sense 3   A number one wood .              Hypernym 78¡:9 2 kurabu “club”        Sem. Class     h921:leisure equipmenti (⊂ 921)    Domain 1     .¢/@0 gorufu “golf”           Figure 1: Entry for the Word doraiba- “driver” (with English glosses)

English (Barnbrook, 2002) and Japanese (Tsu- velop a parser. Any effort spent in building rumaru et al., 1991; Tokunaga et al., 2001). and refining regular expressions is not reusable, Regular expressions are extremely robust, and while creating and improving a grammar has relatively easy to construct. However, we use intrinsic value. a parser for four reasons. The first is that it makes our knowledge acquisition more lan- 3.1 The Extraction Process guage independent. If we have a parser that To extract hypernyms, we parse the first defi- can produce MRS, and a machine readable dic- nition sentence for each sense. The parser uses tionary for that language, the knowledge acqui- the stochastic parse ranking model learned from sition system can easily be ported. The sec- the Hinoki treebank, and returns the MRS of ond reason is that we can go on to use the the first ranked parse. Currently, just over 80% parser and acquisition system to acquire knowl- of the sentences can be parsed. edge from non-dictionary sources. Fujii and An MRS consists of a bag of labeled elemen- Ishikawa (2004) have shown how it is possible to identify definitions semi automatically, how- tary predicates and their arguments, a list of ever these sources are not as standard as dic- scoping constraints, and a pair of relations that provide a hook into the representation — a la- tionaries and thus harder to parse using only regular expressions. The third reason is that bel, which must outscope all the handles, and an index (Copestake et al., 2001). The MRSs we can more easily acquire knowledge beyond simple hypernyms, for example, identifying syn- for the definition sentence for doraib¯a 2 and its onyms through common definition patterns as English equivalent are given in Figure 2. The hook’s label and index are shown first, followed proposed by Tsuchiya et al. (2001). The final reason is that we are ultimately interested in by the list of elementary predicates. The fig- ure omits some details (message type and scope language understanding, and thus wish to de- have been suppressed). hh0, x1{h0 : prpstn rel(h1) hh1, x1{h : prpstn rel(h0) h1 : hito(x1) h1 : person(x1) h2 : udef(x1, h1, h6) h2 : some(x1, h1, h6) h3 : jidosha(x2) h3 : car(x2) h4 : udef(x2, h3, h7) h4 : indef(x1, h3, h7)

h5 : unten(u1, x1, x2)}i h5 : drive(u1, x1, x2)}i  ¢¡¤£¦¥¨§ ©   somebody who drives a car

Figure 2: Simplified MRS represntations for doraib¯a 2 %

In most cases, the first sentence of a dictio-  nary definition consists of a fragment headed no ryaku by the same part of speech as the headword. adn abbreviation Thus the noun driver is defined as an noun a: an abbreviation for the Alps or the phrase. The fragment consists of a genus Japanese Alps term (somebody) and differentia (who drives a car).1 The genus term is generally the most The extent to which non-hypernym relations semantically salient word in the definition sen- are included as text in the definition sentences, tence: the word with the same index as the in- as opposed to stored as separate fields, varies

dex of the hook. For example, for sense 2 of the from dictionary to dictionary. For knowledge



¡ £ ¥ ¦ word doraib¯a, the hypernym is acquisition from open text, we can not expect hito “person” (Figure 2). Although the actual any labeled features, so the ability to extract hypernym is in very different positions in the information from plain text is important. Japanese and English definition sentences, it is We also extract information not explicitly la- the hook in both the semantic representations. beled, such as the domain of the word, as in For some definition sentences (around 20%), Figure 3. Here the adpositional phrase repre- further parsing of the semantic representation senting the domain has wide scope — in effect is necessary. The most common case is where the definition means “In golf, [a driver3 is] a club the index is linked to a coordinate construction. for playing long strokes”. The phrase that spec- In that case, the coordinated elements have to ifies the domain should modify a non-expressed be extracted, and we build two relationships. predicate. To parse this, we added a construc- Other common cases are those where the rela- tion to the grammar that allows an NP frag- tionship between the headword and the genus ment heading an utterance to have an adposi- is given explicitly in the definition sentence: for tional modifier. We then extract these modifiers example in (1), where the relationship is given and take the head of the noun phrase to be the as abbreviation. We initially process the rela- domain. Again, this is hard to do reliably with

regular expressions, as an initial NP followed by  tion, ryaku “abbreviation”, yielding the coor- &

dinate structure. This in turn gives two words: could be a copula phrase, or a PP that at-   ¦ taches anywhere within the definition — not all arupusu “alps” and ¤ nihon

arupusus “Japanese Alps”. Our system thus such initial phrases restrict the domain. Most of 

produces two relations: abbreviation(  , the domains extracted fall under a few superor-

 ¦   dinate terms, mainly sport, games and religion. ) and abbreviation( ,  ). As can be seen from this example, special cases can Other, more general domains, are marked ex- embed each other, which makes the use of reg- plicitly in Lexeed as features. Japanese equiva- ular expressions difficult. lents of the following words have a sense marked

as being in the domain golf: approach, edge,

!#" $

  

#  down, tee, driver, handicap, pin, long shot.

(1) : ¤ a: arupusu , matawa nihon-arupusu We summarize the links acquired in Table 2, a: alps , or japan alps grouped by coarse part of speech. The first three lines show hypernym relations: implicit 1Also know as superordinate and discriminator or hypernyms (the default); explicitly indicated restriction. hypernyms, and implicitly indicated hyponyms. UTTERANCE genus, the test is reversed: we look for an in- VP stance of the genus’ class being subsumed by the headword’s class (c ⊂ c ). Our results V[COPULA] g h are summarized in Table 3. The total is 58.5% PP NP (15,888 confirmed out of 27,146). Adding in the PP PP named and abbreviation relations, the coverage is 60.7%. This is comparable to the coverage

N CASE-P PUNCT N NMOD-P N of Tokunaga et al. (2001), who get a coverage

¡ ¡ ¡ © © ©

  ¢¡ ¡ ¡ & & &

% % % ¨ ¨ ¨

£ £ £

¤ ¤¦¥ ¥ ¥ § § § / ¤ / of 61.4%, extracting relations using regular ex- golf in , long-distance adn club pressions from a different dictionary. 3.3 Extending the Goi-Taikei Figure 3: Parse for Sense3 of Driver In general we are extracting pairs with more information than the Goi-Taikei hierarchy of The second three names show other relations: 2,710 classes. For 45.4% of the confirmed re- abbreviations, names and domains. Implicit hy- lations both the headword and its genus term pernyms are by far the most common relations: were in the same Goi-Taikei semantic class. In fewer than 10% of entries are marked with an particular, many classes contain a mixture of

explicit relationship. class names and instance names: buta niku

“pork” and niku “meat” are in the same

¡ Relation Noun Verbal Verb Other

class, as are doramu “drum” and  Type Noun  dagakki “percussion instrument”, which Implicit 21,245 5,467 6,738 5,569 we can now distinguish. This conflation has Hypernym 230 5 9 caused problems in applications such as ques- Hyponym 194 5 5 tion answering as well as in fundamental re- Abbreviation 423 35 76 search on linking syntax and semantics (Bond Name 121 5 and Vatikiotis-Bateson, 2002). Domain 922 170 141 An example of a more detailed hierarchy de- duced from Lexeed is given in 4. All of the Table 2: Acquired Knowledge words come from the same Goi-Taikei seman- tic class: h842:condimenti, but are given more 3.2 Verification with Goi-Taikei structure by the thesaurus we have induced. We verified our results by comparing the hyper- There are still some inconsistencies: ketchup is nym links to the manually constructed Japanese directly under condiment, while tomato sauce ontology Goi-Taikei. It is a hierarchy of 2,710 and tomato ketchup are under sauce. This re- semantic classes, defined for over 264,312 nouns flects the structure of the original machine read- (Ikehara et al., 1997). Because the semantic able dictionary. classes are only defined for nouns (including ver- bal nouns), we can only compare nouns. Senses 4 Discussion and Further Work are linked to Goi-Taikei semantic classes by the From a language engineering point of view, we following heuristic: look up the semantic classes found the ontology extraction an extremely use- C for both the headword (wi) and the genus ful check on the output of the grammar/parser. term(s) (wg). If at least one of the index word’s Treebanking tends to focus on the syntactic semantic classes is subsumed by at least one of structure, and it is all too easy to miss a mal- the genus’ semantic classes, then we consider formed semantic structure. Parsing the seman- their relationship confirmed (1). tic output revealed numerous oversights, espe- cially in binding arguments in complex rules and lexical entries.

∃(ch, cg) : {ch ⊂ cg; ch ∈ C(wh); cg ∈ C(wg)} It also reveals some gaps in the Goi-Taikei

¡ £ ¥ ¦ (1) coverage. For the word doraib¯a

In the event of an explicit hyponym relation- “driver” (shown in Figure 1), the first two hy-

¡ £ ¥ ¦ ship indicated between the headword and the pernyms are confirmed. However, in Relation Noun Verbal Noun Implicit 56.66% (12,037/21,245) 64.55% (3,529/5,467) Hypernym 56.52% (134/230) 0% (0/5) Hyponym 94.32% (183/194) 100% (5/5) Subtotal 57.01% (12,354/21,669) 64.52% (3,534/5,477)

Table 3: Links Confirmed by Comparison with Goi-Taikei

¢¡ ¡ ¡£ ¥¤ ¤ ¤§¦ ¦ ¦©¨ ¨ ¨

   1 We are extending the extraction process by

tomato ketchup adding new explicit relations, such as $&%&'

© £ £ £    ¦ ¦ ¦

   1 teineigo “polite form”. For word senses such

white sauce as driver3, where there is no appropriate Goi-

   ¦ ¦ ¦£    ¦ ¦ ¦    ¦ ¦ ¦

      1 2 Taikei class, we intend to estimate the semantic

meat sauce sauce class by using the definition sentence as a vec-

¢¡ ¡ ¡£    ¦ ¦ ¦

   1 tor, and looking for words with similar defini-

tomato sauce tions (Kasahara et al., 1997).

      

¤ ¤ ¤§¦ ¦ ¦©¨ ¨ ¨

   1 1 We are extending the extracted relations in

ketchup condiment

   several ways. One way is to link the hypernyms

1 to the relevant word sense, not just the word. If

¡ ©

¨

   

salt we know that club “kurabu” is a hyper-

¦ ¦ ¦  

1 nym of h921:leisure equipmenti, then it rules

   

curry powder

      

¦ ¦ ¦ out the card suit “clubs” and the “association 1 1 of people with similar interests” senses. Other

curry spice heuristics have been proposed by Rigau et al.

£ £ £

      1 (1997). Another way is to use the thesaurus to spice predict which words name explicit relationships which need to be extracted separately (like ab- breviation). Figure 4: Refinement of the class condiment. 5 Conclusion GT only has two semantic classes: h942:tooli In this paper we described the extraction of the- and h292:driveri. It does not have the seman- saurus information from parsed dictionary defi- tic class h921:leisure equipmenti. Therefore nition sentences. The main data for our exper- we cannot confirm the third link, even though it iments comes from Lexeed, a Japanese seman- is correct, and the domain is correctly extracted. tic dictionary, and the Hinoki treebank built on it. The dictionary is parsed using a head-driven Further Work phrase structure grammar of Japanese. Knowl- There are four main areas in which we wish to edge is extracted from the semantic representa- extend this research: improving the grammar, tion. Comparing our results with the Goi-Taikei extending the extraction process itself, further hierarchy, we could confirm 60.73% of the rela- exploiting the extracted relations and creating tions extracted. a thesaurus from an English dictionary. Acknowledgments As well as extending the coverage of the gram- mar, we are investigating making the semantics The authors would like to thank Colin Bannard, more tractable. In particular, we are investi- the other members of the NTT Machine Trans- gating the best way to represent the semantics lation Research Group, NAIST Matsumoto Laboratory, and researchers in the DELPH-IN of explicit relations such as !#" isshu “a kind of”.2 community, especially Timothy Baldwin, Dan Flickinger, Stephan Oepen and Melanie Siegel. 2 These are often transparent nouns: those nouns which are transparent with regard to collocational or se- nal context of the construction, or transparent to number lection relations between their dependent and the exter- agreement (Fillmore et al., 2002). References Yoshifumi Ooyama, and Yoshihiko Hayashi. 1997. Goi-Taikei — A Japanese Lexicon. Iwanami Shigeaki Amano and Tadahisa Kondo. 1999. Shoten, Tokyo. 5 volumes/CDROM. Nihongo-no Goi-Tokusei (Lexical properties of Kaname Kasahara, Kazumitsu Matsuzawa, and Japanese). Sanseido. Tsutomu Ishikawa. 1997. A method for judgment Geoff Barnbrook. 2002. Defining Language — A lo- of semantic similarity between daily-used words cal grammar of definition sentences. Studies in by using machine readable dictionaries. Transac- Corpus Linguistics. John Benjamins. tions of IPSJ, 38(7):1272–1283. (in Japanese). Francis Bond and Caitlin Vatikiotis-Bateson. 2002. Kaname Kasahara, Hiroshi Sato, Francis Bond, Using an ontology to determine English count- Takaaki Tanaka, Sanae Fujita, Tomoko Kanasugi, ability. In 19th International Conference on Com- and Shigeaki Amano. 2004. Construction of a putational Linguistics: COLING-2002, volume 1, Japanese semantic lexicon: Lexeed. SIG NLC- pages 99–105, Taipei. 159, IPSJ, Tokyo. (in Japanese). Francis Bond, Sanae Fujita, Chikara Hashimoto, Stephan Oepen and John Carroll. 2000. Perfor- Kaname Kasahara, Shigeko Nariyama, Eric mance profiling for grammar engineering. Natural Nichols, Akira Ohtani, Takaaki Tanaka, and Language Engineering, 6(1):81–97. Shigeaki Amano. 2004. The Hinoki treebank: Carl Pollard and Ivan A. Sag. 1994. Head A treebank for text understanding. In Proceed- Driven Phrase Structure Grammar. University of ings of the First International Joint Conference Chicago Press, Chicago. on Natural Language Processing (IJCNLP-04). German Rigau, Jordi Atserias, and Eneko Agirre. Springer Verlag. (in press). 1997. Combining unsupervised lexical knowledge Ulrich Callmeier. 2002. Preprocessing and encod- methods for word sense disambiguation. In Pro- ing techniques in PET. In Stephan Oepen, Dan ceedings of joint EACL/ACL 97, Madrid. Flickinger, Jun-ichi Tsujii, and Hans Uszkor- Melanie Siegel and Emily M. Bender. 2002. Efficient eit, editors, Collabarative Language Engineering, deep processing of Japanese. In Procedings of the chapter 6, pages 127–143. CSLI Publications, 3rd Workshop on Asian Language Resources and Stanford. International Standardization at the 19th Interna- Ann Copestake, Alex Lascarides, and Dan tional Conference on Computational Linguistics, Flickinger. 2001. An algebra for semantic Taipei. construction in constraint-based grammars. In Takenobu Tokunaga, Yasuhiro Syotu, Hozumi Proceedings of the 39th Annual Meeting of the Tanaka, and Kiyoaki Shirai. 2001. Integration Association for Computational Linguistics (ACL of heterogeneous language resources: A monolin- 2001), Toulouse, France. gual dictionary and a thesaurus. In Proceedings of Ann Copestake. 1990. An approach to building the the 6th Natural Language Processing Pacific Rim hierarchical element of a lexical knowledge base Symposium, NLPRS2001, pages 135–142, Tokyo. from a machine readable dictionary. In Proceed- Masatoshi Tsuchiya, Sadao Kurohashi, and Satoshi ings of the First International Workshop on In- Sato. 2001. Discovery of definition patterns by heritance in Natural Language Processing, pages compressing dictionary sentences. In Proceedings 19–29, Tilburg. (ACQUILEX WP NO. 8.). of the 6th Natural Language Processing Pacific William Dolan, Lucy Vanderwende, and Stephen D. Rim Symposium, NLPRS2001, pages 411–418, Richardson. 1993. Automatically deriving struc- Tokyo. tured knowledge from on-line dictionaries. In Pro- Hiroaki Tsurumaru, Katsunori Takesita, Itami Kat- ceedings of the Pacific Association for Computa- suki, Toshihide Yanagawa, and Sho Yoshida. tional Linguistics, Vancouver. 1991. An approach to thesaurus construction Charles J. Fillmore, Collin F. Baker, and Hiroaki from Japanese language dictionary. In IPSJ SIG- Sato. 2002. Seeing arguments through transpar- Notes Natural Language, volume 83-16, pages ent structures. In Proceedings of the Third Inter- 121–128. (in Japanese). national Conference on Language Resources and Evaluation (LREC-2002), pages 787–91, Las Pal- mas. Atsushi Fujii and Tetsuya Ishikawa. 2004. Summa- rizing encyclopedic term descriptions on the web. In 20th International Conference on Computa- tional Linguistics: COLING-2004, Geneva. (this volume). Satoru Ikehara, Masahiro Miyazaki, Satoshi Shirai, Akio Yokoo, Hiromi Nakaiwa, Kentaro Ogura,