Ontological Parsing of Encyclopedia Information 1

Victor Bocharov, Lidia Pivovarova, Valery Rubashkin, Boris Chuprin

St. Petersburg State University, Universitetskaya nab. 11, SaintPetersburg, Russia [email protected], [email protected], [email protected], [email protected]

Abstract. Semiautomatic ontology learning from encyclopedia is presented with primary on syntax and semantic analyses of definitions. Keywords: Ontology Learning, Syntax Analysis, Relation Extraction, Encyclopedia, Wikipedia .

Introduction

Ontology Learning is a rapidly expanding area of Natural Language Processing. Many language technologies – from machine translation to speech recognition – should be supported by ontologies that provide conceptual interpretation encompassing the entire corpus vocabulary. However, a formal ontology, which is sufficient to encompass the entire lexis even in a narrow domain, should include a few dozen thousand concepts. Therefore, manual development of an ontology is a very time consuming process that can not be completed at the required level of completeness. Nowadays, this “bottleneck” problem is considered as the main obstacle to using ontologies [1]. This problem becomes even more complex if a universal knowledge base is necessary instead of a domain ontology. Therefore, ontology learning technologies are quite popular now. It is possible to use different sources (such as natural language texts, machine readable dictionaries, semistructured data, knowledge bases, etc.; a complete survey is presented in [2]) for ontology learning, which is generally understood as ontology development based on natural language. However, parsing of machinereadable dictionaries seems to be more effective. The main difference between a natural language text and a dictionary is the form of knowledge representation. Knowledge in a dictionary is more structured and compact than in free texts. In some cases, the structure is presented in dictionaries explicitly (as markups, tags, etc.), and otherwise it is expressed only by syntax. Many efforts are currently underway in this area. (e.g., [3], [4], [5], [6], [7], [8], [9]). Nevertheless, we are unaware of any comparable effort for Russian dictionaries, though certain approaches to ontology learning from Russian free texts are known (e.g., [10], [11], [12]).

Problem Statement and Basic Algorithm

We present here ontology learning from machinereadable version of “Russian Encyclopedic Dictionary” [13]. We use the entire dictionary with the exception of toponyms and proper names. A portion of the dictionary taken into consideration includes of 26,375 entries, which describe 21,782 different terms. The difference between these two figures is caused by presence of disambiguated terms (e.g., there are five different definitions for “aberration” in such areas as biology, physics, etc.). The learned ontology is a universal ontology developed primarily for semantic text analysis. The basic structure for this ontology is represented by an attribute tree where objects alternate with attributes [15]. A small fragment of this tree is presented as an example below: • TRANSPORT o BY ENERGY SOURCE • ELECTRIC TRANSPORT • ATOMIC TRANSPORT • FUEL TRANSPORT • WINDDRIVEN TRANSPORT o BY ENVIRONMENT TYPE • AIR TRANSPORT • WATER TRANSPORT • LAND TRANSPORT • SPACE TRANSPORT

1 This paper is supported by Russian Foundation For Basic Research, project №090600275а This structure provides the most natural way to present different links such as correspondence of a value to an attribute (* great color vs. great volume ), correspondence of an attribute to an class (SOLID –> SHAPE vs. *LIQUID –> SHAPE), or a complete set of extension relations between concepts (incompatibility, intersection, inclusion). The ontology provides also representation of different associative relations, which are either unified (PART –> WHOLE, OBJECT –> LOCALIZATION, OBJECT –> FUNCTION, etc.) or specialized (COUNTRY –> CAPITAL, ORGANIZATION –> CHIEF, etc.). Lexicon is an integral part of a working ontology, which connects a conceptual model with natural language units. Such a lexicon includes words and collocations that can be used to express various concepts. These words and collocations can represent standard terms (i.e., names of concepts used for the ontology) or their synonyms (we use the “synonym” term here in its broad sense as any natural language expression that refers to a respective concept with a reasonable probability). We use our own ontoedidor [13] with additional tools for encyclopedia information import at the stage of ontology learning. Since the requirements for concept description in natural language processing are very strict, it is hardly possible to populate the ontology from our source in fully automatic fashion. Therefore, ontology learning is broken down into two stages: first, the dictionary entries are preclassified automatically, and, second, an ontology administrator in given an opportunity to approve, change or cancel a decision made by the program. We discuss here primarily the first stage of this process, which represents automatic linguistic analysis of encyclopedia entries. This linguistic analysis is based on the following simple hypothesis: usually, a hyperonym for a dictionary term is the first subjectivecase noun of its definition (referred to hereafter as “basic word”). Several examples of typical dictionary entries, which correspond to this hypothesis, are shown below 2. АГРАФ – нарядная заколка для волос, с помощью которой крепили в прическах перья, цветы, искусственные локоны и т. д. HAIRPIN – a pin to hold the hair in place. ПЕРИСТИЛЬ – прямоугольный двор , сад, площадь, окруженные с 4 сторон крытой колоннадой. PERISTYLE – a colonnade surrounding a building or court . ЯТАГАН – рубящеколющее оружие (среднее между саблей и кинжалом) у народов Ближнего и Среднего Востока (известно с 16 в.). YATAGHAN a long knife or short saber that lacks a guard for the hand at the juncture of blade and hilt and that usually has a double curve to the edge and a nearly straight back. As was demonstrated in pilot study [17], the structure of most dictionary entries corresponds to our hypothesis; however, its direct usage yields incorrect results occasionally. A list of the most frequent basic words selected at the first step of analysis [17] is shown in Table 1. А very simple lemmatizer was used to determine the first noun in each definition. The total of 4603 different first nouns are were identified using this technique.

Table 1. List of the most frequently used basic words (according to pilot study [17]) Rank Basic Word Translation Frequency Rank Basic Word Translation Frequency 1 ИЗА IZA 475 18 ЗАБОЛЕВАНИЕ DISEASE 186 2 ЧАСТЬ PART 415 19 ПРОЦЕСС PROCESS 182 3 СОВОКУПНОСТЬ COMBINATION 406 20 СПОСОБ APPROACH 169 4 НАЗВАНИЕ NAME 389 21 БОЛЕЗНЬ ILLNESS 164 5 СИСТЕМА SYSTEM 347 22 ##не выявлено ## ##undefined## 162 6 РАЗДЕЛ SECTION 336 23 ЖИДКОСТЬ LIQUID 154 7 ВИД KIND 305 24 СОЕДИНЕНИЕ COMPOUND 153 8 УСТРОЙСТВО DEVICE 298 25 КРИСТАЛЛ CRYSTAL 153 9 ПРИБОР INSTRUMENT 286 26 ПОРОДА BREED 141 10 МИНЕРАЛ MINERAL 286 27 НАПРАВЛЕНИЕ DIRECTION 137 11 ЕДИНИЦА UNIT 264 28 ОРГАН ORGAN 134 12 ФОРМА FORM 232 29 НАУКА DISCIPLINE 132 13 ГРУППА GROUP 212 30 ТКАНЬ TISSUE 132 14 ИНСТРУМЕНТ TOOL 204 31 ЛИЦО PERSON 120 15 ВЕЩЕСТВО SUBSTANCE 202 32 ОБЛАСТЬ PROVINCE 116 16 ЭЛЕМЕНТ ELEMENT 198 33 ОТРАСЛЬ BRANCH 116 17 МЕТОД METHOD 194 34 КОМПЛЕКС COMPLEX 109

The most frequent word here is Иза , a Russian woman name. Из (the form of this name in the genitive case), is a homonym of very frequent Russian preposition из (from ). If this preposition is situated before any noun in the definition, the program selects it as a noun. This situation and some similar cases make it necessary to complete morphological information about grammemes instead of using simple lemmatization. Then, there are such frequent words as part , complex , name, kind , sort, etc. These words cannot be used as basic words; they are more like links that mark relationship between a dictionary term and a proper basic word. The high frequency of using such words makes it necessary to apply additional logicallinguistic rules for extracting relations of different kind.

2 Relevant definitions taken from Webster dictionary (http://www.merriamwebster.com/) or English Wikipedia (http://en.wikipedia.org/) are shown here instead of translations of respective Russian definitions. Finally, some other words are noticeable in this list. For example, единица is a part of Russian phrases единица измерения (unit of measurement ) or денежная единица (monetary unit ), which are very frequent in encyclopedic dictionary. Similarly, such frequently used words as элемент (element) and лицо (person) are parts of such phrases as химический элемент (chemical element ) and должностное лицо (official) respectively. This fact justifies extraction of noun groups (in addition to single nouns) as basic words, and, therefore, it becomes necessary to use certain elements of syntactic analysis. Very frequent occurrence of undefined basic words can be explained in two different ways. First, this phenomenon can be caused by certain errors, which are partly corrected herein. Second, it can indicate an unusual dictionary definition. For example: МОРСКАЯ АРТИЛЛЕРИЯ – состоит на вооружении кораблей и береговых ракетноартиллерийских войск (NAVAL ARTILLERY – is in service with naval ship or coastal defense troops ) – no noun in subjective case is present in this definition. The general framework of linguistic analysis is shown in Figure 1. The rest of this paper describes every stage of this framework in more details.

Fig 1. The general framework of linguistic analysis.

Lexicographic Processing

Lexicographic processing is a preliminary step aimed to prepare a dictionary entry for morphology and syntax analyses. AOT (http://www.aot.ru/) open source tool is used for morphology and syntax analyses. Input text for this instrument should consist of wellformed Russian sentences. However, a dictionary is not written using exactly natural language text since it includes certain labels, abbreviations and extra punctuation. Thus, lexicographic processing consists of the following steps: − term recognition; − recognition of domain labels, e.g., в медицинe (medical ), в антропологии (anthropological), etc.; − bracket text elimination; − replacement of abbreviations by full forms of words. The first three steps are executed for regular expressions. The last one is possible only if a context hints for an unambiguous form of an abbreviated word. Only most frequent abbreviations in certain already known contexts are replaced with full words. Here are some examples: на Сев. Кавказе  на Северном Кавказе (at N. Caucasus  at the North Caucasus ). Russian adjectives have to agree grammatically with nouns. In the list of abbreviations, Сев. is associated with Северный (North ) adjective. The form of the adjective can be copied from the respective noun form; в 18 в .  в 18 веке (in 18 c.  in the 18 th century ). In this example, we use the prepositional government to determine the noun case. If context is ambiguous, abbreviations are just eliminated.

Morphology and Syntax

At this step, we use contextfree grammar to analyze first sentences of dictionary entries. The output of this step is represented by dependency trees. Since dictionary definitions usually start with a noun group that includes the base word, full syntax analysis is unnecessary. The grammar is very simple and aimed to recognize noun groups only. The grammar consists of the following rules: [NP] > [NOUN]; A noun group may consist of a single noun.

[NP] > [ADJ] [NP root] : $0.grm := case_number_gender($1.grm, $2.type_grm, $2.grm);

An adjective stays at the left side of a noun (this is a standard word order in Russian). The second line determines gender, number and case between a noun and an adjective.

[NP] > [NP root] [NP grm="рд"]; A noun group may be added at the righthand side to another noun group in the genitive case (indicated by "рд"grammeme).

[PP] > [PREP root] [NP]; A preposition and a noun group may be combined into a prepositional group.

[NP] > [NP root] [PP]; A noun group may be added to a prepositional group at the righthand side.

We use AOT tool to compile this grammar. The AOT output is an immediate constituent structure where roots of constituents are marked. An example of the constituent structure for phrase верхняя одежда у некоторых азиатских народов (outdoor clothes of some Asian nations ), which is a definition for халат (oriental robe ), is shown in Figure 2.

ВЕРХНЯЯ ANP ОДЕЖДА NP У

НЕКОТОРЫХ PP ANP АЗИАТСКИХ ANP НАРОДОВ

Fig 2. An example of immediate constituents structure

Since a dependency tree is necessary for the subsequent steps of analysis, it is transformed using the following rules: − a root governs other elements of the constituent; − a constituent root is governed by the root of the immediate constituent of the upper level; An example of dependency tree for the same phrase is shown in Figure 3. ВЕРХНЯЯ ANP ОДЕЖДА NP У

НЕКОТОРЫХ PP АЗИАТСКИХ ANP ANP НАРОДОВ

Fig. 3. Dependency tree

Morphology analysis is applied just ahead of syntax analysis. The result of morphology analysis is a set of morphological analysis outputs. Availability of multiple outputs for one word represents a very frequent situation as Russian is an inflectional language and the level of homonymy between different forms is very high. Conducting syntax analysis, we are able to avoid some “unproductive” forms that are not implemented in the dependency tree (the similar approach for French is presented in []). We discuss now о Чукотском море (about Chukchee Sea ) phrase. There are three outputs of morphological analysis for море (sea) : мор (pestilence ), prepositional case, singular, masculine gender; море (sea) , prepositional case, singular, neuter gender; and мора (mora) , prepositional case, singular, feminine gender. There are two outputs of morphological analysis for чукотском (Chukchee ) word: чукотский (Chukchee ) adjective in prepositional case and masculine or neuter gender. Only two outputs are agreed by gender (singular/plural forms and cases are identical), and thereby the third lemma – мора (mora) – has to be rejected. Unfortunately, two other outputs мор (pestilence ) and морe (sea) are still possible and, therefore, certain ambiguity is unavoidable here. However, dramatic decrease of ambiguity in Russian language can be achieved by applying syntax analysis. Our numerical results are presented in Table 2.

Table 2. Applying syntax for disambiguation

Before syntax analysis After syntax analysis Average number of lemmas for one word 1.27 1.06 form

Average number of morphological 2.26 1.64 analysis outputs for one word form

Relation Recognition

Relation recognition is based here on logicallinguistic rules relevant to a dependency tree. Six types of semantic relations currently used in the ontology are extracted. These relation types are listed in Table 3.

Table 3. Extracted relation types

Relation Description Notation GENERALIZATION (ISA) – default value Gen INSTANCE (reverse to Gen) Spec IDENTITY Same PART Part WHOLE (reverse to Part) Whole FUNCTION Func OTHER Other

A specific rule is attached to a certain word. Our software parses the dependency tree and searches the first nouns in the definitions. Then, the rule attached to this word (if any) is executed. Each rule describes, first, the type of relation indicated by this word and, second, a directive of saving this word as a basic, or rejecting it and obtaining the next basic word candidate according to the rule. Two examples of rules for GENERALIZATION relation are presented in Table 4 as examples. Table 4. Examples of GENERALIZATION relation rules

Basic word Example Rule Result of application род , вид , ФИЛЬДЕПЕРС – высший сорт 1. Save default type of relation ФИЛЬДЕПЕРС фильдекос GEN сорт , тип, фильдекоса . ( ) … PERSIAN THREAD – the first class of 2. Save next noun as a basic PERSIAN THREAD lisle GEN kind , sort , lisle . word (“next” means the next type , class, node in the dependency tree, etc. ПИДЖИНЫ – тип языков , which does not necessarily ПИДЖИН язык GEN используемых как средство represent the next word in межэтнического общения в среде linear context). разноязычного населения. PIDGINS – a sort of languages , used PIDGIN language GEN for communication between people with different languages . жанр МИСТЕРИЯ – жанр средневекового 1. Save word as a basic word МИСТЕРИЯ жанр GEN genre западноевропейского религиозного with default relation type МИСТЕРИЯ театр GEN театра . 2. Save default type of relation MYSTERY – a genre of the religious () MYSTERY genre GEN medieval theatre . 3. Save the next noun as a basic MYSTERY theatre GEN word context.

We discuss now these two rules in more details. The difference between them is that such words as kind , sort, etc. are eliminated while genre is saved. Therefore, there are two relations in the resulting output if genre is a basic word (in some cases, it is possible to extract even a larger number of different relations and save them as the result). We have two reasons to save genre : first, it is intuitively clear that this word is more sensible than sort and other similar words; second, in some cases the definition may be too complicated for correct syntax analysis, and the program extracts at least one basic word in such cases. Generally there are two main types of logicallinguistic rules: 1. Save the first basic word – change the type of relation – save the next basic word (the notation for this type is save word next noun ) 2. Reject the first basic word – change the type of relation – save the next basic word ( next noun ) Choosing either of these types depends on the frequency of a particular structure and authors’ introspection. Two additional examples for IDENTITY relation are presented in Table 5.

Table 5. Examples of IDENTITY relation rules

Basic word Example Rule Result of application обозначение СОЦИОСФЕРА – обозначение next noun СОЦИОСФЕРА nomination человечества , общества, а человечество SAME также освоенной человеком природной среды, в совокупности составляющих часть географической оболочки.

SOCIOSPHERE – a nomination of SOCIOSPHERE humanity humanity as well as human assimilated SAME environment arranged together in a part of geographical envelope. явление СИНЕСТЕЗИЯ – явление Save word СИНЕСТЕЗИЯ явление phenomena восприятия , когда при next noun GEN раздражении данного органа СИНЕСТЕЗИЯ восприятие чувств наряду со SAME специфическими для него ощущениями возникают и ощущения, соответствующие другому органу чувств.

SYNESTHESIA – a perception SYNESTHESIA phenomenon phenomenon with subjective GEN sensation or image of a sense other SYNESTHESIA perception than the one being stimulated. SAME We have an additional reason to save явление (phenomenon) as a basic word: it is a part of such Russian phrases as атмосферное явление (atmospheric phenomenon ), физическое явление (physical phenomenon), and so on. Our syntax analysis yields all grammatical information about noun phrases and this information has to be saved at the relation recognition step. The final choice between a single basic word and a basic collocation should be done by an ontology administrator. More complicated rules, which can not be reduced to the two previous types, are used sometimes. An example of such a rule for FUNCTION relation is presented in Table 6.

Table 6. An example of complicated rule

Basic word Example Rule Result of application инструмент, ФЕН – электрический аппарат Save word – move to the next ФЕН аппарат GEN прибор, для сушки волос. preposition ФЕН сушка FUNC аппарат, … If it is для (for ): instrument, HAIRDRYER – an electric device change relation type to HAIRDRYER device GEN tool, device, for hair drying . HAIRDRYER drying FUNC save next noun etc.

This rule factors in such a fact that functional relations in Russian are usually formed by preposition для , while dependent noun without preposition can not indicate a functional relation: прибор темной окраски (darkly colored device ) vs. прибор для окраски (device for coloring ). The “Other” type of relation is very significant as it can result in modifications of the ontology model. Some examples are presented in Table 7.

Table 7. Examples of OTHER relation rules

Basic word Example Rule Result of application прерывание АБОРТ – прерывание Save word АБОРТ прерывание GEN termination беременности в сроки до 28 next noun АБОРТ беременность OTHER недель (то есть до момента, когда возможно рождение

жизнеспособного плода) .

ABORTION termination GEN ABORTION – the termination of a ABORTION pregnancy OTHER pregnancy after, accompanied by, resulting in, or closely followed by the death of the embryo or fetus. способность ХОМИНГ – способность Save word ХОМИНГ способность GEN ability животного возвращаться со next noun ХОМИНГ животное OTHER значительного расстояния на свой участок обитания, к гнезду,

логову и т. д.

HOMING – the ability of animals HOMING ability GEN to come back from the considerable HOMING animal OTHER distance to their home range, nest, lie etc .

These rules represent the intuitively recognized fact that abortion is relevant to pregnancy and homing is relevant to animals – even if it is difficult to specify such relevance. Such rules are applied to approximately 30 basic words. It is found unexpectedly that these 30 words can be broken down into the following two groups: (i) words, which indicate a certain feature of a term defined (e.g., ability ), and (ii) words, which indicate certain transformation (e.g., termination ). The first group includes the following basic words: характеристика (characteristic ), признак (attribute ), свойство (property ), число (number ), показатель (index ), степень (degree ), количество (quantity ), характер (character ), масса (mass ), состояние (condition ), способность (ability ), место (place ), источник (source ). The second group includes primarily verbal nouns: переход (transition ), извлечение (extraction ), превращение (transformation ), введение (introduction ), выделение (emission ), возникновение (origination ), нарушение (deviation ), прерывание (termination ), развитие (evolution ), образование (formation ), увеличение (increase ), уменьшение (decrease ). A genitive noun is used very frequently after these words in Russian (e.g., прерывание беременности in the first example in Table 7). The equivalent English form is of + noun (such as termination of a pregnancy in this example). Sometimes these words form relatively long genitive chains: увеличение показателя состояния… (an increase of an index of condition… ). Therefore, the rules are applied recursively. The fact that a group of unrelated words is clustered with such ease is very significant. Therefore, further extension of the ontology model by adding these two types of relations deserve additional consideration. In total, there are 18 different rules for 91 basic words. The list of most frequent basic words from pilot study [17] (a representative excerpts from this list is shown in Table 1) is used to develop specific rules. In particular, we formulate rules for 51 of 100 most frequently used basic words. These rules are applicable to a relative minority of all entries. We currently apply them to just 8484 of 26,375 entries. Probably, this number can be increased; however, no rules shall be attached to a majority of entries since the main hypothesis is valid for them. After the relation recognition step, the total number of different basic words grows slightly to 4679 (while 4603 possible basic words were found in pilot study [17]); however, these words are much more informative. The new list of the most frequent basic words obtained by applying our rules is presented in Table 8.

Table 8. Most frequent basic words

Rank Basic Word translation frequency rank basic word translation frequency 1 УСТРОЙСТВО DEVICE 332 18 РАСТЕНИЕ PLANT 146 2 МИНЕРАЛ MINERAL 322 19 ТКАНЬ TISSUE 146 3 ЕДИНИЦА UNIT 293 20 СООРУЖЕНИЕ STRUCTURE 138 4 ПРИБОР INSTRUMENT 292 21 МАТЕРИАЛ MATERIAL 134 5 ВЕЩЕСТВО SUBSTANCE 277 22 ЛИЦО PERSON 133 6 ПРОЦЕСС PROCESS 243 23 ОБЛАСТЬ PROVINCE 121 7 ИНСТРУМЕНТ TOOL 235 24 ИЗМЕРЕНИЕ MEASUREMENT 117 8 ЭЛЕМЕНТ ELEMENT 228 25 ИЗМЕНЕНИЕ MODIFICATION 117 9 ЗАБОЛЕВАНИЕ DISEASE 210 26 ВЕЛИЧИНА MAGNITUDE 116 10 НАУКА DISCIPLINE 199 27 ОБРАЗОВАНИЕ FORMATION 114 11 СОЕДИНЕНИЕ COMPOUND 184 28 ПРОДУКТ PRODUCT 110 12 БОЛЕЗНЬ ILLNESS 174 29 ДВИЖЕНИЕ MOVEMENT 104 13 ПОРОДА BREED 170 30 ВОСПАЛЕНИЕ INFLAMMATION 98 14 ОРГАН ORGAN 168 31 МЕРА MEASURE 98 15 ЖИДКОСТЬ LIQUID 166 32 УЧАСТОК SITE 97 16 КРИСТАЛЛ CRYSTAL 164 33 ПРОИЗВЕДЕНИЕ CREATION 94 17 МАШИНА ENGINE 158 34 АППАРАТ MECHANISM 93

We evaluate our relation recognition approach by comparing its output with opinion of an expert who reads 200 dictionary entries and extracts basic words from them. For 90% of entries (179 of 200), the results obtained by the expert and our sofware are identical. We analyze now those 21 dictionary entries, which are incorrectly processed by the program. Most of these errors (16 of 21) are caused at different steps of analysis by specific algorithm inaccuracies that can be eliminated by minor modifications. We expect to correct these inaccuracies in the nearest feature and achieve the theoretical level of accuracy of (179 + 16 = 195) / 200 = 97.5% by applying the proposed approach to this source. However, for each of the other 5 of 200 dictionary entries, а basic word is missing from the definition text. These entries are inconsistent with the basic hypothesis that the basic word is the first subjectivecase noun of the definition, and the proposed approach is unsuitable for processing such entries. There are also three dictionary entries where definition starts with a verb while the defined term is a . For example: АБРАЗИВНЫЙ ИНСТРУМЕНТ – служит для механической обработки (шлифование, притирка и другие ). ABRASIVE TOOL – is designed for mechanical processing (grinding, reseating, etc.). The grammar has to be dramatically expanded for processing such definitions. Another way is to analyze the entire dictionary entry (including the defined term) for recognizing инструмент (tool ) as a basic word in this example. Another type of unusual entries is represented by statements of natural laws, theorems, etc. where a definition represents an extended description of the respective defined object. For example: АВОГАДРО ЗАКОН – в равных объемах идеальных газов при одинаковых давлении и температуре содержится одинаковое число молекул. AVOGADRO'S LAW – equal volumes of ideal or perfect gases, at the same temperature and pressure, contain the same number of particles, or molecules. This case is very similar to the previous one because it is possible to extract a basic word from the defined term: закон (law ). Another type of difficulties is represented by omission of a basic word. There is one such example in the evaluation set: АБИТУРИЕНТ – оканчивающий среднее учебное заведение. COLLEGE APPLICANT – a person graduating from high school. The translation does not reflect difficulties with this example because there is a subject noun (person ) in the English phrase, while in the Russian phrase (which represents a wellformed Russian sentence) it is absent. Such a word as человек (person ) is probably missing from the Russian definition. An approach enabling us to reconstruct the eliminated word is necessary to overcome this difficulty. However, only one such case is found in the evaluation set (corresponding to less then 1% of the set scope), whereas a modified algorithm necessary to remedy this deficiency would be very complex and inaccurate.

Import to Ontology

The final step of import is manual; however, it can be simplified by using our ontoeditor. A table with all dictionary terms is listed under a respective tab of the ontoeditor. Each term is matched with its definition (using syntax markup), and, then, the first sentence of the definition and a basic word (extracted automatically as described in the previous sections) are shown in individual columns. This is the right part of ontoeditor in Figure 4. An ontology administrator selects a subset of dictionary entries for each individual import operation. There are three ways to make such a selection: (i) to specify a base word (factoring in all synonyms from the ontology lexicon); (ii) to specify a base word and all dependent concepts; and (iii) to specify a certain word in the definition governed by the base word. The ontology administrator may exclude irrelevant terms from the selection or include other terms. Then, the selection is imported to the ontology in one of the following ways: − added as a synonym to the existing concept (which corresponds to the basic word); − added as a new concept positioned in the taxonomy factoring in the basic word (the ontology administrator may add extra information to clarify these concepts); or − added as an unsorted list of concepts (the ontology administrator may sort it later using the druganddrop interface) Currently, rules for GENERATION relation (ISA) only are added to our ontology. Processing of rules of other aforementioned types will be added to our software soon.

Fig. 4. The encyclopedia import into the ontoeditor

Wikipedia Parsing

The proposed approach is designed as scalable and applicable to other dictionary resources. We discuss now its trial application to Russian Wikipedia. As Wikipedia is a free encyclopedia developed through community efforts, its content is larger than the content of any other dictionary, and the information and execution quality of its individual articles varies. Wikipedia includes a lot of natural language information as well as its own taxonomy and templates. Using Wikipedia for ontology learning is quite popular ([19], [20], [21], [22], etc.); however, we are unaware of any comparable effort for Russian Wikipedia. For our experiment we use the Russian Wikipedia dump of November 13, 2009, which includes 506,504 entries. We use Zemanta Wikiprep program (http://sourceforge.net/apps/mediawiki/wikiprep/) to convert articles from wiki markup to a plain text format. Wikipedia includes different types of terms: abstract concepts; terms for different narrow domains; proper names (of persons, cities, streets, etc.); and lists (of dates, events, etc.) Proper names and lists are out of scope of our study, and we filter them out by Wikipedia categories. The portion of Wikipedia taken into consideration includes 196,349 entries. The first sentence of each Wikipedia article, which includes a dash symbol, is used for analysis as a definition since this format is recommended by Wikipedia guidelines. A dash symbol is used as an equivalent of English isa structure in Russian sentences with the represented by a noun. We apply an algorithm described in the previous sections to these first sentences. We evaluate the results against opinion of an expert who reads 500 Wikipedia entries and extracts basic words from them. For 82% of entries (410 of 500), the results obtained by the expert and our software are identical. We attribute approximately 40% of the errors (36 of 90 entries) to irregularities in the article texts. It is necessary to add some extra syntax and lexicallogical rules to our algorithm in order to remedy the other errors, and it will be a subject of future studies. Nevertheless, our approach is generally applicable to Wikipedia also.

Conclusion

We present an approach for discerning semantic relations automatically from the text of an encyclopedic dictionary. This approach is designed for semiautomatic addition of terms to ontologies. The original hypothesis that the first subjectivecase noun of the definition represents a base term yields correct results for more than 90% of entries of Russian Encyclopedic Dictionary [10]. This hypothesis is applicable to practical extension of ontologies. The original hypothesis is clarified by developing methods for selection of a proper base word (when it is not represented by the root of the first noun group) and determining types of those semantic relations that do not fall under ISA category. Certain definition structures, which can not be properly processed using our algorithm of automatic processing, are revealed. Such definitions are rare (1% of entries) in the encyclopedic dictionary that we processed. The presented approach is generally applicable to other dictionary resources. We expect to apply it in future to Wikipedia and traditional explanatory dictionaries.

References

1. GomezPerez, A., FernandoLopez, M., Corcho, O. Ontology Engineering. Springer – Ferlag (2004) 2. GomezPerez, A., Manzano, D., Mancho, A. Survey of Ontology Learning Methods and Techniques, IST Project IST200029243 OntoWeb, Technical Report (2003) 3. Jannink, J. Thesaurus Entry Extraction from an Online Dictionary. In: Proceedings of Fusion '99, Sunnyvale CA (1999) 4. Rigau, G., Rodríguez, H., and Agirre, E. Building Accurate Semantic Taxonomies from Monolingual MRDs. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics COLINGACL'98. Montreal, Canada (1998) 5. Lee, C., Lee, G., Yun, S. J. Automatic WordNet Mapping Using Word Sense Disambiguation. In: 38th Annual Meeting of the Association for Computational Linguistics (2000) 6. Eneko Agirre et al. Extraction of semantic relations from a Basque monolingual dictionary using Constraint Grammar. In Euralex Sttutgart, Germany (2000) 7. Litkowski, K. C. Digraph Analysis of Dictionary PrepositionDefinitions. In: Proceedings of the Asscociation for Computational Linguistics Special Interest Group on the Lexicon, July 11,Philadelphia. (2002) 8. Nichols et al. Multilingual Ontology Acquisition from Multiple MRDs. In Proceedings of the 2nd Workshop on Ontology Learning and Population, pp. 10 17, Sydney, (2006) 9. Morita T., Fukuta N., Izumi N. and Yamaguchi T. DODDLEOWL: A Domain Ontology Costruction Tool with OWL. In: Lecture Notes in Computer Science, Springer, (2006) 10. Ермаков А. Е. Автоматизация онтологического инжиниринга в системах извлечения знаний из текста // Труды международной конференции «Диалог 2008» М.:Наука 2008. Ermakov, A. E. The Atomation of Ontology Engineering for Knowledge Acquisition Systems. In: Proceedings of Dialogue 2008 International Meeting. Nauka, Moscow, (2008) [In Russian] 11. Минаков И. А. Системный анализ, онтологический синтез и инструментальные средства обработки информации в процессах интеграции профессиональных знаний // Автореферат диссертации на соискание ученой степени доктора технических наук – Самара 2007. Minakov I. A. System analysis, ontological synthesis and tools for information processing for professional knowledge integration. PhD thesis, Samara (2007) [In Russian] 12. Пекар В.И. Автоматическое пополнение специализированного тезауруса // Труды международной конференции «Диалог 2002» М.:РГГУ 2002. Pekar V. I. The Domain Thesaurus Learning. In: Proceedings of Dialogue 2002 International Meeting. RGGU, Moscow (2002) [In Russian] 13. Российский энциклопедический словарь Гл. ред.: А. М. Прохоров — М.: Большая Российская энциклопедия, 2001. Russian Encyclopedic Dictionary. A. M. Prohorov (ed.). Russian Encyclopedic Dictionary, Moscow (2001) [In Russian] 14. Рубашкин В. Ш. Семантический компонент в системах понимания текста // КИИ2006. Десятая национальная конференция по искусственному интеллекту с международным участием. Труды конференции. – М.: Физматлит, 2006. С. 455 – 463. Rubashkin V. Sh. Semantic Component in Understanding Text Systems. In: Proceedings of Tenth National Meeting on Artificial Intelligence (KII2006), Fizmatgiz, Moscow, pp. 455 – 463 (2006) [in Russian] 15. Рубашкин В. Ш. Онтологии проблемы и решения. Точка зрения разработчика. // Труды международной конференции «Диалог 2007» М.:Наука 2007. Rubashkin V. Sh. Ontologies: Problems and Solutions. Developer’s Point of View. In: Proceedings of Dialogue 2007 International Meeting. Nauka, Moscow, pp. 456 – 458, (2007) [In Russian] 16. Рубашкин В. Ш., Пивоварова Л. М Онторедактор как комплексный инструмент онтологической инженерии. // Труды международной конференции «Диалог 2008» М.:Наука – 2008 Rubashkin V. Sh., Pivovarova L. M. Ontoeditor as a Complex Tool for Ontology Engineering. In: Proceedings of Dialogue 2008 International Meeting – Nauka, Moscow, pp. 456 – 458, 2008 [In Russian] 17. Рубашкин В.Ш., Капустин В.А. Использование определений терминов в энциклопедических словарях для автоматизированного пополнения онтологий XI Всероссийская объединенная конференция "Интернет и современное общество" – СПб., 2008. Rubashkin V. Sh., Kapustin V. A. Ontology Learning from Encyclopedia Entries Definitions. In: Proceedings of Internet and Modern Society International Meeting. SaintPetersburg (2008) [In Russian] 18. Fernandez M., Clergerie E., Vilares M. Mining Conceptual Graphs for Knowledge Acquisition. In: Procedings of Workshop on Improving NonEnglish Web Searching Inews'08, pp. 2532, Napa Valley, USA (2008) 19. Kassner, L., Nastase, V., Strube, M. Acquiring a Taxonomy from the German Wikipedia. In: Proceeding of the 6th International Conference on Language Resources and Evaluation, Marakech, Morocco (2008) 20. RuizCasado, M., Alfonseca, E., Okumura, M., Castells, P. Information Extraction and Semantic Annotation of Wikipedia. In: Proceeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 145169 (2008) 21. Ponzetto, S. P., Strube, M. WikiTaxonomy: A Large Scale Knowledge Resource. Proceedings of the 18th European Conference on Artificial Intelligence, Patras, Greece, 2125 July 2008, pp. 751752 (2008) 22. Wu F., Hoffmann, R., Weld, D. S. Information Extraction from Wikipedia: Moving down the Long Tail. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 731739 (2008)