Multilingual Ontology Acquisition from Multiple Mrds
Total Page:16
File Type:pdf, Size:1020Kb
Multilingual Ontology Acquisition from Multiple MRDs Eric Nichols♭, Francis Bond♮, Takaaki Tanaka♮, Sanae Fujita♮, Dan Flickinger ♯ ♭ Nara Inst. of Science and Technology ♮ NTT Communication Science Labs ♯ Stanford University Grad. School of Information Science Natural Language ResearchGroup CSLI Nara, Japan Keihanna, Japan Stanford, CA [email protected] {bond,takaaki,sanae}@cslab.kecl.ntt.co.jp [email protected] Abstract words of a language, let alone those words occur- ring in useful patterns (Amano and Kondo, 1999). In this paper, we outline the develop- Therefore it makes sense to also extract data from ment of a system that automatically con- machine readable dictionaries (MRDs). structs ontologies by extracting knowledge There is a great deal of work on the creation from dictionary definition sentences us- of ontologies from machine readable dictionaries ing Robust Minimal Recursion Semantics (a good summary is (Wilkes et al., 1996)), mainly (RMRS). Combining deep and shallow for English. Recently, there has also been inter- parsing resource through the common for- est in Japanese (Tokunaga et al., 2001; Nichols malism of RMRS allows us to extract on- et al., 2005). Most approaches use either a special- tological relations in greater quantity and ized parser or a set of regular expressions tuned quality than possible with any of the meth- to a particular dictionary, often with hundreds of ods independently. Using this method, rules. Agirre et al. (2000) extracted taxonomic we construct ontologies from two differ- relations from a Basque dictionary with high ac- ent Japanese lexicons and one English lex- curacy using Constraint Grammar together with icon. We then link them to existing, hand- hand-crafted rules. However, such a system is lim- crafted ontologies, aligning them at the ited to one language, and it has yet to be seen word-sense level. This alignment provides how the rules will scale when deeper semantic re- a representative evaluation of the qual- lations are extracted. In comparison, as we will ity of the relations being extracted. We demonstrate, our system produces comparable re- present the results of this ontology con- sults while the framework is immediately applica- struction and discuss how our system was ble to any language with the resources to produce designed to handle multiple lexicons and RMRS. Advances in the state-of-the-art in pars- languages. ing have made it practical to use deep processing systems that produce rich syntactic and semantic 1 Introduction analyses to parse lexicons. This high level of se- Automatic methods of ontology acquisition have a mantic information makes it easy to identify the long history in the field of natural language pro- relations between words that make up an ontol- cessing. The information contained in ontolo- ogy. Such an approach was taken by the MindNet gies is important for a number of tasks, for ex- project (Richardson et al., 1998). However, deep ample word sense disambiguation, question an- parsing systems often suffer from small lexicons swering and machine translation. In this paper, and large amounts of parse ambiguity, making it we present the results of experiments conducted difficult to apply this knowledge broadly. in automatic ontological acquisition over two lan- Our ontology extraction system uses Robust guages, English and Japanese, and from three dif- Minimal Recursion Semantics (RMRS), a formal- ferent machine-readable dictionaries. ism that provides a high level of detail while, at Useful semantic relations can be extracted from the same time, allowing for the flexibility of un- large corpora using relatively simple patterns (e.g., derspecification. RMRS encodes syntactic infor- (Pantel et al., 2004)). While large corpora often mation in a general enough manner to make pro- contain information not found in lexicons, even a cessing of and extraction from syntactic phenom- very large corpus may not include all the familiar ena including coordination, relative clause analy- 10 Proceedings of the 2nd Workshop on Ontology Learning and Population, pages 10–17, Sydney, July 2006. c 2006 Association for Computational Linguistics sis and the treatment of argument structure from 2.2 The Iwanami Dictionary of Japanese verbs and verbal nouns. It provides a common for- The Iwanami Kokugo Jiten (Iwanami) (Nishio mat for naming semantic relations, allowing them et al., 1994) is a concise Japanese dictionary. to be generalized over languages. Because of this, A machine tractable version was made avail- we are able to extend our system to cover new lan- able by the Real World Computing Project for guages that have RMRS resourses available with the SENSEVAL-2 Japanese lexical task (Shirai, a minimal amount of effort. The underspecifica- 2003). Iwanami has 60,321 headwords and 85,870 tion mechanism in RMRS makes it possible for us word senses. Each sense in the dictionary con- to produce input that is compatible with our sys- sists of a sense ID and morphological information tem from a variety of different parsers. By select- (word segmentation, POS tag, base form and read- ing parsers of various different levels of robustness ing, all manually post-edited). and informativeness, we avoid the coverage prob- lem that is classically associated with approaches 2.3 The Gnu Contemporary International using deep-processing; using heterogeneous pars- Dictionary of English ing resources maximizes the quality and quantity of ontological relations extracted. Currently, our The GNU Collaborative International Dictionary system uses input from parsers from three lev- of English (GCIDE) is a freely available dic- els: with morphological analyzers the shallowest, tionary of English based on Webster’s Revised parsers using Head-driven Phrase Structure Gram- Unabridged Dictionary (published in 1913), and mars (HPSG) the deepest and dependency parsers supplemented with entries from WordNet and ad- providing a middle ground. ditional submissions from users. It currently contains over 148,000 definitions. The version Our system was initially developed for one used in this research is formatted in XML and is Japanese dictionary (Lexeed). The use of the ab- available for download from www.ibiblio.org/ stract formalism, RMRS, made it easy to extend to webster/. a different Japanese lexicon (Iwanami) and even a We arranged the headwords by frequency and lexicon in a different language (GCIDE). segmented their definition sentences into sub- Section 2 provides a description of RMRS and sentences by tokenizing on semicolons (;). This the tools used by our system. The ontological ac- produced a total of 397,460 pairs of headwords quisition system is presented in Section 3. The re- and sub-sentences, for an average of slightly less sults of evaluating our ontologies by comparison than four sub-sentences per definition sentence. with existing resources are given in Section 4. We For corpus data, we selected the first 100,000 def- discuss our findings in Section 5. inition sub-sentences of the headwords with the highest frequency. This subset of definition sen- tences contains 12,440 headwords with 36,313 2 Resources senses, covering approximately 25% of the defi- 2.1 The Lexeed Semantic Database of nition sentences in the GCIDE. The GCIDE has Japanese the most polysemy of the lexicons used in this re- search. It averages over 3 senses per word defined The Lexeed Semantic Database of Japanese is a in comparison to Lexeed and Iwanami which both machine readable dictionary that covers the most have less than 2. familiar open class words in Japanese as measured by a series of psycholinguistic experiments (Kasa- 2.4 Parsing Resources hara et al., 2004). Lexeed consists of all open class We used Robust Minimal Recursion Semantics words with a familiarity greater than or equal to (RMRS) designed as part of the Deep Thought five on a scale of one to seven. This gives 28,000 project (Callmeier et al., 2004) as the formal- words divided into 46,000 senses and defined with ism for our ontological relation extraction en- 75,000 definition sentences. All definition sen- gine. We used deep-processing tools from the tences and example sentences have been rewritten Deep Linguistic Processing with HPSG Initiative to use only the 28,000 familiar open class words. (DELPH-IN: http://www.delph-in.net/) as The definition and example sentences have been well as medium- and shallow-processing tools for treebanked with the JACY grammar (§ 2.4.2). Japanese processing (the morphological analyzer 11 ChaSen and the dependency parser CaboCha) we used JACY (Siegel, 2000), for English we used from the Matsumoto Laboratory. the English Resource Grammar (ERG: Flickinger 2000).1 2.4.1 Robust Minimal Recursion Semantics Robust Minimal Recursion Semantics is a form JACY The JACY grammar is an HPSG-based of flat semantics which is designed to allow deep grammar of Japanese which originates from work and shallow processing to use a compatible se- done in the Verbmobil project (Siegel, 2000) on mantic representation, with fine-grained atomic machine translation of spoken dialogues in the do- components of semantic content so shallow meth- main of travel planning. It has since been ex- ods can contribute just what they know, yet with tended to accommodate written Japanese and new enough expressive power for rich semantic content domains (such as electronic commerce customer including generalized quantifiers (Frank, 2004). email and machine readable dictionaries). The architecture of the representation is based on The grammar implementation is based on a sys- Minimal Recursion Semantics (Copestake et al., tem of types. There are around 900 lexical types 2005), including a bag of labeled elementary pred- that define the syntactic, semantic and pragmatic icates (EPs) and their arguments, a list of scoping properties of the Japanese words, and 188 types constraints which enable scope underspecification, that define the properties of phrases and lexical and a handle that provides a hook into the repre- rules.