Using Goi-Taikei as an Upper Ontology to Build a Large-Scale Japanese Ontology from Wikipedia Masaaki Nagata Yumi Shibaki and Kazuhide Yamamoto NTT Communication Science Nagaoka University of Laboratories Technology [email protected] shibaki,yamamoto @jnlp.org { } Abstract Ponzetto and Strube (2007) presents a set of lightweight heuristics such as head matching and We present a novel method for build- modifier matching for distinguishing between is- ing a large-scale Japanese ontology from a and not-is-a links in the Wikipedia category Wikipedia using one of the largest network. The most powerful heuristics is head Nihongo Goi-Taikei Japanese thesauri, matching in which a category link is labeled as (referred to hereafter as “Goi-Taikei”) as is-a if the two categories share the same head an upper ontology. First, The leaf cat- lemma, such as CAPITALS IN ASIA and CAPI- egories in the Goi-Taikei hierarchy are TALS. For Japanese, Sakurai et al. (2008) present semi-automatically aligned with seman- a method equivalent to head matching in Japanese. tically equivalent Wikipedia categories. As Japanese is a head final language, they intro- Then, their subcategories are created au- duced a heuristics called suffix matching in which is-a tomatically by detecting links in the a category link is labeled as is-a if one category ¢¡¤£ Wikipedia category network below the is the suffix of the other category, such as ¥§¦ ¥¨¦ junction using the knowledge defined in (airports in Japan) and (airports). The Goi-Taikei above the junction. The re- problem with the ontology extracted by these two sulting ontology has a well-defined taxon- methods is that it is not a single interconnected omy in the upper level and a fine-grained taxonomy, but a set of taxonomic trees. taxonomy in the lower level with a large number of up-to-date instances. A sam- One way to make a single taxonomy is to use ple evaluation shows that the precisions of an existing large-scale taxonomy as a core for the the extracted categories and instances are resulting ontology. In YAGO, Suchanek et al. 92.8% and 98.6%, respectively. (2007) merged English WordNet and Wikipedia 1 Introduction by adding instances (namely Wikipedia articles) to the is-a hierarchy of WordNet. Of the cate- In recent years, we have become increasingly gories assigned to a Wikipedia article, they re- aware of the need for up-to-date knowledge bases garded one with a plural head noun as the article's offering broad-coverage in order to implement hypernym, which is called a conceptual category. practical semantic inference engines for advanced They then linked the conceptual category to a applications such as question answering, summa- WordNet synset by heuristic rules including head rization and textual entailment recognition. One matching. For Japanese, Kobayashi et al. (2008) promising approach involves automatically ex- present an attempt equivalent to YAGO, where tracting a large comprehensive ontology from they merged Goi-Taikei and Japanese Wikipedia. Wikipedia, a freely available online encyclopedia The problem with these two methods is that the with a wide variety of information. One problem core taxonomy is extended only one level al- with previous such efforts is that the resulting on- though many new instances are added. They can- tology is either fragmentary or trivial. not make the most of the fine-grained taxonomic 11 Proceedings of the 6th Workshop on Ontologies and Lexical Resources (Ontolex 2010), pages 11–18, Beijing, August 2010 information contained in the Wikipedia category word “writer” while the latter originates with En- network. glish word “lighter”. By climbing up the Goi- In this paper, we present a novel method for Taikei category hierarchy, we can infer that the building a single interconnected ontology from former refers to a human being (4:person) while Wikipedia, with a fine-grained taxonomy in the the latter refers to a physical object (533:con- lower level, by using a manually constructed the- crete object). saurus as its upper ontology. In the following sections, we first describe the language resources 2.2 Japanese Wikipedia used in this work. We then describe a semi- automatic method for building the ontology and Wikipedia is a free, multilingual, on-line ency- report our experimental results. clopedia actively developed by a large number of volunteers. Japanese Wikipedia now has about 2 Language Resources 500,000 articles. Figure 2 shows examples of an article page and a category page. An article page 2.1 Nihongo Goi-Taikei has a title, body, and categories. In most articles, ¡¢¡¤£¦¥¢§ ¡ Nihongo Goi-Taikei ( , `compre- the first sentence of the body gives the definition hensive outline of Japanese vocabulary')1 is one of the title. A category also has a title, body, and of the largest and best known Japanese thesauri categories. Its title is prefixed with “Category:” (Ikehara et al., 1997). It was originally developed and its body includes a list of articles that belong as a dictionary for a Japanese-to-English machine to the category. translation system in the early 90's. It was then Although the Wikipedia category system is or- published as a book in 5 volumes in 1997 and as ganized in a hierarchal manner, it is not a tax- a CD-ROM in 1999. It contains about 300,000 onomy but a thematic classification. An article Japanese words and the meanings of each word could belong to many categories and the category are described by using 2,715 hierarchical seman- network has loops. The relations between linked tic categories. Each word has up to 5 semantic categories are chaotic, but the lower the category categories in order of frequency in use, and each link is in the hierarchy, the more it is likely to be category is assigned with a unique ID number and an is-a relation. For example, the category link 2 category name such as 4:person and 388:place . between ¤ (COCKTAIL) and (ALCO- Goi-Taikei has different semantic category hi- HOLIC BEVERAGE) is an is-a relation. Although erarchies for common nouns, proper nouns, and the article ¤©¤ (shaker) is in the category verbs, respectively. We used only the common (COCKTAIL), a shaker is not a cocktail noun category in this work. For simplicity, we but an appliance. Extracting a taxonomy from the mapped all proper nouns in the proper noun cate- Wikipedia category network is not trivial. gory to the equivalent common noun category us- ing the category mapping table shown in the Goi- 3 Ontology Building Method Taikei book. Figure 1 shows the top three layers for common Figure 3 shows an outline of the proposed ontol- nouns3. For example, the transliterated Japanese ogy building method. We first semi-automatically word raita ( ¨ © ) has two semantic cate- align each leaf category in the Goi-Taikei category gories 353:author and 915:household appli- hierarchy with one or more Wikipedia categories. ance. The former originates with the English We call a Wikipedia category aligned with a Goi- Taikei category a junction category. We then ex- 1Referred to as “Goi-Taikei” unless otherwise noted. 2We use Sans Serif for the Goi-Taikei category and tend each Goi-Taikei leaf category by detecting SMALL CAPS for the Wikipedia category. The Goi-Taikei the is-a links below the junction category in the category is prefixed with ID number. Wikipedia category network using the knowledge 3The maximum depth of the common noun hierarchy is 12. Most links are is-a relations, but some are part-of rela- defined above the junction category in Goi-Taikei tions, which are explicitly marked . 12 ¢ "!¤ #£ $¤ "¢$¤%&' & )(¤(¤(" *¢+¤,-' %*¢$' )(¤(H0 *¢+¤,-' %*¢$'K £+76 &¤$'L)#¤.£4£ 'BI¤; ¢/ #¤8¤#¤#" *¢+¤,-' %*¢$'K%5&"3 *)'B; " ." *©/©&£0' .¤1£1£ 2"3 *©$£& 4¤.¤." $¤ "¢$£%5&©' & £+76 &¤$' 8£ 2¤&£%, "9.£:¤#£ £%7/¢*¢£; <*0' ; " .¤1¤=" >5*¢$£; 3 ; '?@8¤4¤1" %&¢/¢; "A8¤:¤1£ ¢*0'B!¢%& 4¤.¤8£ *¢£; CD*0' &+¤&"; ©/FE©(£:£ ; ¤*¢£; CG*0' &+¤&"; ©/ =H)4£ I¤ "!¢,&£I¤ "3 J*¢2¤2£3 ; *¤¢$£& .¤4£.£ *¤!0'BI¢ £% ¢¡¤¡¤¡¥¢§¨ ¢ ¢ £ ¢¡£¡¤¡¦¥¢§©¨ ¢ © £ ¢ ¤ ¢¡¤¡¤¡¥¢§¨ ¢ ¢ £ ¢¡£¡¤¡¦¥¢§©¨ ¢ © £ ¢ ¤ %; ' &¤% 3 ; /¢I0' &£% M ¢¡£¡ ¢¡¤¡ ¡¤¡¤¡ ££¡£¡¤¡ £¤ ¢¡£¡ ¢¡¤¡ ¡¤¡¤¡ ££¡£¡¤¡ £¤ Figure 1: Top three layers of the common noun semantic category hierarchy in Nihongo Goi-Taikei <title> </title> <title>cocktail</title> N ¡ OQPSR9TVU@W @XYO ( :Cocktail) A cocktail (English:Cocktail) is an alcoholic bev- Z\[ ZShjilkAm £ ]U^R`_ badcePSfSg @X erage made by mixing a base liquor with other [`q npo sr ^teu . liquor or juice. <Category> </Category> <Category>cocktail</Category> <title>Category: </title> <title>Category:Cocktails</title> [ Uwvlx ey{z [[ ]] . Category on [[cocktails]] . <Category> </Category> <Category>alcoholic beverages</Category> Figure 2: Examples of title, body (definition sentence), and category for article page and category page in Japanese Wikipedia (left) and their translation (right) £ " -£-)¢ ¤¢£ ¢ } } 0}0©£ £"} " |~} } ¤"} ©£ £"} £ ) ©¢ £ ¡0© ¢¢" ¢ "¢ £ ¢) £¢0¢ Figure 3: The ontology building method: First, Goi-Taikei leaf categories are aligned with Wikipedia categories (left), then each leaf category is extended by detecting is-a links in Wikipedia (right). 13 3.1 Category Alignment extract a hypernym of the name of each article and For each leaf category in Goi-Taikei, we first make category in advance. a list of junction category candidates. Wikipedia We regard the first sentence of each article page categories satisfying at least one of the following as the definition of the concept referred to by three conditions are extracted as candidates: the title. We applied language dependent lexico- syntactic patterns to the definition sentence to ex- • The Goi-Taikei category name exactly tract the hypernym. The hypernym of the category matches the Wikipedia category name. name is extracted from the definition sentence if it • One of the instances of the Goi-Taikei cate- exists. If there is an article whose title is the same gory exactly matches the Wikipedia category as its category, the hypernym of the article is used name.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-