Ontological Parsing of Encyclopedia Information1

Ontological Parsing of Encyclopedia Information 1 Victor Bocharov, Lidia Pivovarova, Valery Rubashkin, Boris Chuprin St. Petersburg State University, Universitetskaya nab. 11, Saint-Petersburg, Russia [email protected], [email protected], [email protected], [email protected] Abstract. Semi-automatic ontology learning from encyclopedia is presented with primary focus on syntax and semantic analyses of definitions. Keywords: Ontology Learning, Syntax Analysis, Relation Extraction, Encyclopedia, Wikipedia . Introduction Ontology Learning is a rapidly expanding area of Natural Language Processing. Many language technologies – from machine translation to speech recognition – should be supported by ontologies that provide conceptual interpretation encompassing the entire corpus vocabulary. However, a formal ontology, which is sufficient to encompass the entire lexis even in a narrow domain, should include a few dozen thousand concepts. Therefore, manual development of an ontology is a very time consuming process that can not be completed at the required level of completeness. Nowadays, this “bottleneck” problem is considered as the main obstacle to using ontologies [1]. This problem becomes even more complex if a universal knowledge base is necessary instead of a domain ontology. Therefore, ontology learning technologies are quite popular now. It is possible to use different sources (such as natural language texts, machine readable dictionaries, semi-structured data, knowledge bases, etc.; a complete survey is presented in [2]) for ontology learning, which is generally understood as ontology development based on natural language. However, parsing of machine-readable dictionaries seems to be more effective. The main difference between a natural language text and a dictionary is the form of knowledge representation. Knowledge in a dictionary is more structured and compact than in free texts. In some cases, the structure is presented in dictionaries explicitly (as markups, tags, etc.), and otherwise it is expressed only by syntax. Many efforts are currently underway in this area. (e.g., [3], [4], [5], [6], [7], [8], [9]). Nevertheless, we are unaware of any comparable effort for Russian dictionaries, though certain approaches to ontology learning from Russian free texts are known (e.g., [10], [11], [12]). Problem Statement and Basic Algorithm We present here ontology learning from machine-readable version of “Russian Encyclopedic Dictionary” [13]. We use the entire dictionary with the exception of toponyms and proper names. A portion of the dictionary taken into consideration includes of 26,375 entries, which describe 21,782 different terms. The difference between these two figures is caused by presence of disambiguated terms (e.g., there are five different definitions for “aberration” in such areas as biology, physics, etc.). The learned ontology is a universal ontology developed primarily for semantic text analysis. The basic structure for this ontology is represented by an attribute tree where objects alternate with attributes [15]. A small fragment of this tree is presented as an example below: • TRANSPORT o BY ENERGY SOURCE • ELECTRIC TRANSPORT • ATOMIC TRANSPORT • FUEL TRANSPORT • WIND-DRIVEN TRANSPORT o BY ENVIRONMENT TYPE • AIR TRANSPORT • WATER TRANSPORT • LAND TRANSPORT • SPACE TRANSPORT 1 This paper is supported by Russian Foundation For Basic Research, project №09-06-00275-а This structure provides the most natural way to present different links such as correspondence of a value to an attribute (* great color vs. great volume ), correspondence of an attribute to an object class (SOLID –> SHAPE vs. *LIQUID –> SHAPE), or a complete set of extension relations between concepts (incompatibility, intersection, inclusion). The ontology provides also representation of different associative relations, which are either unified (PART –> WHOLE, OBJECT –> LOCALIZATION, OBJECT –> FUNCTION, etc.) or specialized (COUNTRY –> CAPITAL, ORGANIZATION –> CHIEF, etc.). Lexicon is an integral part of a working ontology, which connects a conceptual model with natural language units. Such a lexicon includes words and collocations that can be used to express various concepts. These words and collocations can represent standard terms (i.e., names of concepts used for the ontology) or their synonyms (we use the “synonym” term here in its broad sense as any natural language expression that refers to a respective concept with a reasonable probability). We use our own ontoedidor [13] with additional tools for encyclopedia information import at the stage of ontology learning. Since the requirements for concept description in natural language processing are very strict, it is hardly possible to populate the ontology from our source in fully automatic fashion. Therefore, ontology learning is broken down into two stages: first, the dictionary entries are pre-classified automatically, and, second, an ontology administrator in given an opportunity to approve, change or cancel a decision made by the program. We discuss here primarily the first stage of this process, which represents automatic linguistic analysis of encyclopedia entries. This linguistic analysis is based on the following simple hypothesis: usually, a hyperonym for a dictionary term is the first subjective-case noun of its definition (referred to hereafter as “basic word”). Several examples of typical dictionary entries, which correspond to this hypothesis, are shown below 2. АГРАФ – нарядная заколка для волос, с помощью которой крепили в прическах перья, цветы, искусственные локоны и т. д. HAIRPIN – a pin to hold the hair in place. ПЕРИСТИЛЬ – прямоугольный двор , сад, площадь, окруженные с 4 сторон крытой колоннадой. PERISTYLE – a colonnade surrounding a building or court . ЯТАГАН – рубяще-колющее оружие (среднее между саблей и кинжалом) у народов Ближнего и Среднего Востока (известно с 16 в.). YATAGHAN - a long knife or short saber that lacks a guard for the hand at the juncture of blade and hilt and that usually has a double curve to the edge and a nearly straight back. As was demonstrated in pilot study [17], the structure of most dictionary entries corresponds to our hypothesis; however, its direct usage yields incorrect results occasionally. A list of the most frequent basic words selected at the first step of analysis [17] is shown in Table 1. А very simple lemmatizer was used to determine the first noun in each definition. The total of 4603 different first nouns are were identified using this technique. Table 1. List of the most frequently used basic words (according to pilot study [17]) Rank Basic Word Translation Frequency Rank Basic Word Translation Frequency 1 ИЗА IZA 475 18 ЗАБОЛЕВАНИЕ DISEASE 186 2 ЧАСТЬ PART 415 19 ПРОЦЕСС PROCESS 182 3 СОВОКУПНОСТЬ COMBINATION 406 20 СПОСОБ APPROACH 169 4 НАЗВАНИЕ NAME 389 21 БОЛЕЗНЬ ILLNESS 164 5 СИСТЕМА SYSTEM 347 22 ##не выявлено ## ##undefined## 162 6 РАЗДЕЛ SECTION 336 23 ЖИДКОСТЬ LIQUID 154 7 ВИД KIND 305 24 СОЕДИНЕНИЕ COMPOUND 153 8 УСТРОЙСТВО DEVICE 298 25 КРИСТАЛЛ CRYSTAL 153 9 ПРИБОР INSTRUMENT 286 26 ПОРОДА BREED 141 10 МИНЕРАЛ MINERAL 286 27 НАПРАВЛЕНИЕ DIRECTION 137 11 ЕДИНИЦА UNIT 264 28 ОРГАН ORGAN 134 12 ФОРМА FORM 232 29 НАУКА DISCIPLINE 132 13 ГРУППА GROUP 212 30 ТКАНЬ TISSUE 132 14 ИНСТРУМЕНТ TOOL 204 31 ЛИЦО PERSON 120 15 ВЕЩЕСТВО SUBSTANCE 202 32 ОБЛАСТЬ PROVINCE 116 16 ЭЛЕМЕНТ ELEMENT 198 33 ОТРАСЛЬ BRANCH 116 17 МЕТОД METHOD 194 34 КОМПЛЕКС COMPLEX 109 The most frequent word here is Иза , a Russian woman name. Из (the plural form of this name in the genitive case), is a homonym of very frequent Russian preposition из (from ). If this preposition is situated before any noun in the definition, the program selects it as a noun. This situation and some similar cases make it necessary to complete morphological information about grammemes instead of using simple lemmatization. Then, there are such frequent words as part , compleX , name, kind , sort, etc. These words cannot be used as basic words; they are more like links that mark relationship between a dictionary term and a proper basic word. The high frequency of using such words makes it necessary to apply additional logical-linguistic rules for extracting relations of different kind. 2 Relevant definitions taken from Webster dictionary (http://www.merriam-webster.com/) or English Wikipedia (http://en.wikipedia.org/) are shown here instead of translations of respective Russian definitions. Finally, some other words are noticeable in this list. For example, единица is a part of Russian phrases единица измерения (unit of measurement ) or денежная единица (monetary unit ), which are very frequent in encyclopedic dictionary. Similarly, such frequently used words as элемент (element) and лицо (person) are parts of such phrases as химический элемент (chemical element ) and должностное лицо (official) respectively. This fact justifies extraction of noun groups (in addition to single nouns) as basic words, and, therefore, it becomes necessary to use certain elements of syntactic analysis. Very frequent occurrence of undefined basic words can be explained in two different ways. First, this phenomenon can be caused by certain errors, which are partly corrected herein. Second, it can indicate an unusual dictionary definition. For example: МОРСКАЯ АРТИЛЛЕРИЯ – состоит на вооружении кораблей и береговых ракетно-артиллерийских войск (NAVAL ARTILLERY – is in service with naval ship or coastal defense troops ) – no noun in subjective case is present in this definition.

Load more