Extending Czech Wordnet Using a Bilingual Dictionary

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨ OF I !"#$%&'()+,-./012345<yA|NFORMATICS Extending Czech WordNet Using a Bilingual Dictionary DIPLOMA THESIS Marek Blahuš Brno, May 2011 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: doc. PhDr. Karel Pala, CSc. ii Thanks In this place, I would like to thank several people who have helped me by their advice or during the realization of this work. I want to thank: doc. PhDr. Karel Pala, CSc., my advisor, for providing me the inspiration, necessary back- ground knowledge and guidance; RNDr. Rudolf Cervenka,ˇ director of LEDA spol.s r.o., for providing Josef Fronek’s bilingual dictionary for research purposes of the Natural Language Processing Center; Bc. Jana Blahušová, my sister and student of English at the Charles University in Prague, for language advice. iii Abstract This thesis is an attempt at automatically extending the Czech WordNet lexical database (48,000 literals in 28,000 synsets) by translation of English literals from existing synsets in Princeton WordNet. We make use of a machine-readable bilingual dictionary to extract English-Czech translation pairs, search the English literals in Princeton WordNet and in case of a high-confidence match we transfer the literal into Czech WordNet. Along with literals, new synsets parallel to the English ones and identified by ILI (Inter-Lingual Index) are introduced into Czech WordNet, including information on their ILR (Internal Language Re- lations) such as hypernymy/hyponymy. The thesis describes the parsing of dictionary data, extraction of translation pairs and the criteria used for estimating the confidence level of a match. Results of the work are 36,228 added literals and 12,403 created synsets. An overview of previous similar attempts for other languages is also included. iv Keywords WordNet, Czech WordNet, extension, translation, bilingual dictionary v Contents 1 Introduction .........................................1 1.1 Motivation .......................................1 1.2 Structure of the Thesis ................................1 1.3 WordNet ........................................2 1.4 Czech WordNet ....................................3 1.5 Comprehensive English-Czech Dictionary .....................4 2 Analysis and Design ....................................5 2.1 Overview of Previous WordNet Translation Attempts ..............5 2.1.1 BabelNet . .5 2.1.2 MLSN . .6 2.1.3 WordNet Builder . .7 2.1.4 Miscelaneous . .7 2.2 Designing an Algorithm ...............................8 2.3 Used Tools ....................................... 10 2.3.1 GNU Tools and Utilities . 10 2.3.2 XSLT . 11 2.3.3 libxslt . 12 2.3.4 Saxon-HE . 12 2.3.5 SAX . 13 2.3.6 Python . 14 2.3.7 VisDic and DEBVisDic . 14 2.3.8 uconv . 15 3 Dictionary Data Parser ................................... 16 3.1 Structure of a Dictionary Entry ........................... 16 3.2 Transforming Dictionary Data ............................ 18 3.3 Extracting Translation Pairs ............................. 20 3.3.1 Treatment of Alternatives and Parentheses . 20 4 WordNet Synset Generator ................................ 24 4.1 Structure of WordNet Data .............................. 24 4.2 Extending WordNet by Matching Translation Pairs ................ 25 5 Evaluation .......................................... 28 5.1 Statistics and Observations .............................. 28 6 Conclusion .......................................... 29 6.1 Future Directions ................................... 29 Bibliography . 33 Index ............................................... 35 A Examples of Generated Synsets ............................. 36 B Contents of the Enclosed CD-ROM ........................... 40 vi Chapter 1 Introduction 1.1 Motivation The idea to create this thesis originated in the spring semester of 2010, during a course on Se- mantics and Communication taught by Karel Pala at the Faculty of Informatics of Masaryk University, which aimed particularly at students interested in the field of Natural Language Processing (NLP). The course dealt mainly with the problems of the analysis of meaning with regard to computer processing. Besides familiarizing the students with the fundamen- tals of linguistic semantics, it focused on semantic networks, which are structures used to represent semantic relations among concepts, and therefore knowledge of the world. The most well known of them is WordNet, a lexical database for the English language, along with its counterparts for other languages. Since ideally a lexical semantic network should cover all lexicalized concepts of a language, developing and maintaining such databases is expensive and time-consuming. As an exception, the database for English is particularly large and might therefore be used to stimulate the growth of other languages, if an approach is found to automatize the translation of synsets (i.e. synonym sets, the elementary units of WordNet) by using a suitable linguistic resource as a reference, e.g. a bilingual dictionary. In the summer of 2010, the Natural Language Processing Center at the Faculty of Infor- matics acquired data from the largest one-volume English-Czech dictionary ever published, which has prepared the ground for this attempt at automatically suggesting extensions to the Czech WordNet. 1.2 Structure of the Thesis This thesis contains six chapters. The first chapter provides a description of the linguistic resources involved in the work. In the second chapter, we give references to previous attempts similar to this one, discuss them and choose a suitable strategy for our own attempt. A list of tools selected for the implementation, along with a short description of each, is also given there. The description of the implementation has been divided into the third and the fourth chapter, due to the fact that the machine-readable dictionary, whose parsed form is the result of the third chapter, may, on its own, serve also other goals than those of providing data for the WordNet enlargement for which the parsing has been done by us. The fifth chapter gives statistics on the data generated by the implemented software and presents an evaluation of the obtained results. Finally, in the last chapter, we summarize our work and 1 1.3. WORDNET suggest some future directions and related tasks which were not performed in the scope of this work. Appendix A contains sample synsets produced by the software. Contents of the enclosed CD-ROM are listed in Appendix B. The text of this thesis, as well as all identifiers and comments in the source code, has been written in English, in hope that foreign researchers may find it inspiring for their own attempts at building or extending WordNets for their languages. Some passive knowledge of Czech can be useful to better understand the examples cited in this work that contain Czech words, but an approximate English translation is given every time a Czech expression appears. 1.3 WordNet WordNet R [1], [2], sometimes referred to as Princeton WordNet to distinguish it from its younger other-language counterparts (hereafter in such cases abbreviated as PWN), is a lexical database for the English language. It was developed at the Cognitive Science Labo- ratory of Princeton University in the United States by a team of scientists led by the psy- chology professor George A. Miller. Since 1985, PWN has reached version 3.0 which encom- passes 155,287 literals organized in 117,659 synsets. For an example of WordNet data, see Section 4.1. A synset (synonym set) is the elementary unit of WordNet and consists of one or more literals (words or multi-word expressions) that are synonymous with each other – they denote the same meaning, or, by definition, “are interchangable in some context without changing the truth value of the preposition in which they are embedded.” Literals that bear a single meaning are called monosemous. A literal is called polysemous if it denotes more than one meaning and is therefore present in more than one synset. Each such occurence of a literal in a synset is attributed a sense number (usually an integer, starting from 1), so that each pair consisting of a literal and a sense number is guaranteed to be unique across all synsets in the database within the respective part of speech. As a direct consequence of George A. Miller’s design of WordNet as a model of the human mind in accordance with observations of contemporary psycholinguistics, WordNet contains records only for the so-called open- class words – nouns, verbs, adjectives and adverbs. With each synset, WordNet stores also other linguistic information in addition to the list of synonymous literals. Each synset comes with a gloss consisting of a definition of the meaning it represents, optionally accompanied by example sentences. It is also desirable that each synset be provided with links to other synsets that are in some linguistic relation to it: WordNet registers Internal Language Relations (ILR) such as hypernymy/hyponymy (cat is a kind of mammal), meronymy/holonymy (window is a part of building) or troponymy (to lisp is to talk in some manner). For the purpose of this thesis, we have worked with an older version of WordNet – 2.0, because this is the one that the other language resources which we have used are linked to. In size, which seems to be the most important factor

Extending Czech Wordnet Using a Bilingual Dictionary

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support