<<

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Extending Czech WordNet Using a Bilingual

DIPLOMA THESIS

Marek Blahuš

Brno, May 2011 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Advisor: doc. PhDr. Karel Pala, CSc.

ii Thanks

In this place, I would like to thank several people who have helped me by their advice or during the realization of this work. I want to thank: doc. PhDr. Karel Pala, CSc., my advisor, for providing me the inspiration, necessary back- ground knowledge and guidance; RNDr. Rudolf Cervenka,ˇ director of LEDA spol.s r.o., for providing Josef Fronek’s bilin- gual dictionary for research purposes of the Natural Processing Center; Bc. Jana Blahušová, my sister and student of English at the Charles University in Prague, for language advice.

iii Abstract

This thesis is an attempt at automatically extending the Czech WordNet lexical (48,000 literals in 28,000 synsets) by of English literals from existing synsets in Princeton WordNet. We make use of a machine-readable to extract English-Czech translation pairs, search the English literals in Princeton WordNet and in case of a high-confidence match we transfer the literal into Czech WordNet. Along with lit- erals, new synsets parallel to the English ones and identified by ILI (Inter-Lingual Index) are introduced into Czech WordNet, including information on their ILR (Internal Language Re- lations) such as hypernymy/hyponymy. The thesis describes the of dictionary data, extraction of translation pairs and the criteria used for estimating the confidence level of a match. Results of the work are 36,228 added literals and 12,403 created synsets. An overview of previous similar attempts for other is also included.

iv Keywords

WordNet, Czech WordNet, extension, translation, bilingual dictionary

v Contents

1 Introduction ...... 1 1.1 Motivation ...... 1 1.2 Structure of the Thesis ...... 1 1.3 WordNet ...... 2 1.4 Czech WordNet ...... 3 1.5 Comprehensive English-Czech Dictionary ...... 4 2 Analysis and Design ...... 5 2.1 Overview of Previous WordNet Translation Attempts ...... 5 2.1.1 BabelNet ...... 5 2.1.2 MLSN ...... 6 2.1.3 WordNet Builder ...... 7 2.1.4 Miscelaneous ...... 7 2.2 Designing an Algorithm ...... 8 2.3 Used Tools ...... 10 2.3.1 GNU Tools and Utilities ...... 10 2.3.2 XSLT ...... 11 2.3.3 libxslt ...... 12 2.3.4 Saxon-HE ...... 12 2.3.5 SAX ...... 13 2.3.6 Python ...... 14 2.3.7 VisDic and DEBVisDic ...... 14 2.3.8 uconv ...... 15 3 Dictionary Data Parser ...... 16 3.1 Structure of a Dictionary Entry ...... 16 3.2 Transforming Dictionary Data ...... 18 3.3 Extracting Translation Pairs ...... 20 3.3.1 Treatment of Alternatives and Parentheses ...... 20 4 WordNet Synset Generator ...... 24 4.1 Structure of WordNet Data ...... 24 4.2 Extending WordNet by Matching Translation Pairs ...... 25 5 Evaluation ...... 28 5.1 Statistics and Observations ...... 28 6 Conclusion ...... 29 6.1 Future Directions ...... 29 Bibliography ...... 33 Index ...... 35 A Examples of Generated Synsets ...... 36 B Contents of the Enclosed CD-ROM ...... 40

vi Chapter 1 Introduction

1.1 Motivation

The idea to create this thesis originated in the spring semester of 2010, during a course on Se- mantics and Communication taught by Karel Pala at the Faculty of Informatics of Masaryk University, which aimed particularly at students interested in the field of Natural Language Processing (NLP). The course dealt mainly with the problems of the analysis of meaning with regard to computer processing. Besides familiarizing the students with the fundamen- tals of linguistic semantics, it focused on semantic networks, which are structures used to represent semantic relations among concepts, and therefore knowledge of the world. The most well known of them is WordNet, a lexical database for the , along with its counterparts for other languages. Since ideally a lexical should cover all lexicalized concepts of a language, developing and maintaining such is expensive and time-consuming. As an exception, the database for English is particularly large and might therefore be used to stimulate the growth of other languages, if an approach is found to automatize the translation of synsets (i.e. sets, the elementary units of WordNet) by using a suitable linguistic resource as a reference, e.g. a bilingual dictionary. In the summer of 2010, the Natural Language Processing Center at the Faculty of Infor- matics acquired data from the largest one-volume English-Czech dictionary ever published, which has prepared the ground for this attempt at automatically suggesting extensions to the Czech WordNet.

1.2 Structure of the Thesis

This thesis contains six chapters. The first chapter provides a description of the linguistic resources involved in the work. In the second chapter, we give references to previous at- tempts similar to this one, discuss them and choose a suitable strategy for our own attempt. A list of tools selected for the implementation, along with a short description of each, is also given there. The description of the implementation has been divided into the third and the fourth chapter, due to the fact that the machine-readable dictionary, whose parsed form is the result of the third chapter, may, on its own, serve also other goals than those of provid- ing data for the WordNet enlargement for which the parsing has been done by us. The fifth chapter gives statistics on the data generated by the implemented and presents an evaluation of the obtained results. Finally, in the last chapter, we summarize our work and

1 1.3. WORDNET suggest some future directions and related tasks which were not performed in the scope of this work. Appendix A contains sample synsets produced by the software. Contents of the enclosed CD-ROM are listed in Appendix B. The text of this thesis, as well as all identifiers and comments in the source code, has been written in English, in hope that foreign researchers may find it inspiring for their own attempts at building or extending for their languages. Some passive knowledge of Czech can be useful to better understand the examples cited in this work that contain Czech , but an approximate English translation is given every time a Czech expression appears.

1.3 WordNet

WordNet R [1], [2], sometimes referred to as Princeton WordNet to distinguish it from its younger other-language counterparts (hereafter in such cases abbreviated as PWN), is a lexical database for the English language. It was developed at the Cognitive Science Labo- ratory of Princeton University in the United States by a team of scientists led by the psy- chology professor George A. Miller. Since 1985, PWN has reached version 3.0 which encom- passes 155,287 literals organized in 117,659 synsets. For an example of WordNet data, see Section 4.1. A synset (synonym set) is the elementary unit of WordNet and consists of one or more lit- erals (words or multi- expressions) that are synonymous with each other – they denote the same meaning, or, by definition, “are interchangable in some context without changing the truth value of the preposition in which they are embedded.” Literals that bear a single meaning are called monosemous. A literal is called polysemous if it denotes more than one meaning and is therefore present in more than one synset. Each such occurence of a literal in a synset is attributed a sense number (usually an integer, starting from 1), so that each pair consisting of a literal and a sense number is guaranteed to be unique across all synsets in the database within the respective part of speech. As a direct consequence of George A. Miller’s design of WordNet as a model of the human mind in accordance with observations of contemporary , WordNet contains records only for the so-called open- class words – , , adjectives and . With each synset, WordNet stores also other linguistic information in addition to the list of synonymous literals. Each synset comes with a gloss consisting of a definition of the meaning it represents, optionally accompanied by example sentences. It is also desirable that each synset be provided with links to other synsets that are in some linguistic relation to it: WordNet registers Internal Language Relations (ILR) such as hypernymy/hyponymy (cat is a kind of mammal), meronymy/holonymy (window is a part of building) or (to lisp is to talk in some manner). For the purpose of this thesis, we have worked with an older version of WordNet – 2.0, because this is the one that the other language resources which we have used are linked to. In size, which seems to be the most important factor that could influence the results of our work, it differs only slightly from the latest version, as it contains 152,059 literals organized

2 1.4. CZECH WORDNET in 115,424 synsets for a total of 203,145 word-sense pairs. Out of this, 79,689 (69 %) are synsets, 13,508 (12 %) are synsets, 18,563 (16 %) are adjective synsets and 3664 (3 %) are synsets.

1.4 Czech WordNet

The unique concept introduced by PWN and the interest in the range of its application in Natural Language Processing had stirred up several international research cooperations which eventually culminated in the founding of the Global WordNet Association (estab- lished in June 2000). Its founding members were researchers at European universities who had participated in the EuroWordNet project [3] that ran from 1996 to 1999. It has produced lexical databases of WordNet type for seven European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). BalkaNet [4], a later project running from 2001 to 2004, has contributed by WordNets for five languages of the Balkans (Bulgarian, Greek, Ro- manian, Turkish and Serbian). A common link among all these databases is the Inter-Lingual Index (hereafter ILI) – a unique assignment of identification numbers to all synsets of PWN which makes it possible to establish links among PWN synsets and their equivalents in any other language (and as a consequence also between any two such languages). The Czech WordNet [5] (hereafter CzWN) has been developed at the Faculty of Informat- ics of Masaryk University since 1998. The initial effort was made within the second phase of the EuroWordNet project and the database was further developed when Masaryk Univer- sity participated as a partner in the BalkaNet project. Its 1016 base concepts were derived by analysing definitions in the Dictionary of Literary Czech (Slovník spisovné ˇceštiny [6]), which was the only usable resource of its kind at that time. Lingea [7], an electronic Czech-English and English-Czech dictionary, and of Czech (Výk- ladový slovník ˇceštiny [8]), which were both being built at the time of CzWN’s construction, were used to populate the database with more words. Eventually, between 13,000 and 15,000 synsets formed the first version of CzWN in 1999. At the time of publication of this work, CzWN comprises 34,026 literals organized in 28,478 synsets for a total of 47,542 word-sense pairs. Out of this, 21,018 (74 %) are noun synsets, 5162 (18 %) are verb synsets, 2129 (7 %) are adjective synsets and 166 (1 %) are adverb synsets. The ILI indices that link them to PWN are those of PWN 2.0; a discussion on the possiblity of remapping them to PWN 3.0 may be found in Section 6.1. VerbaLex [9] is a database of verb valency frames created since 2005 at the Faculty of Informatics, which has also been linked by ILI to PWN 2.0, is large in size (approximately 20,000 valency frames) and may eventually replace the verbal part of CzWN. A more detailed description of EuroWordNet, BalkaNet and the Czech WordNet and the history of their development have been given by Tomáš Capekˇ in his diploma thesis [10]. The early development of Czech WordNet has been described in a technical report [5].

3 1.5. COMPREHENSIVE ENGLISH-CZECH DICTIONARY

1.5 Comprehensive English-Czech Dictionary

The Comprehensive English-Czech Dictionary (Velký anglicko-ˇceskýslovník [11], hereafter referred to by its Czech abbreviation VAC) compiled by Josef Fronek, a Czech lecturer in Lin- guistics and Phonetics at the University of Glasgow, is the largest one-volume English-Czech dictionary ever published. It was published by LEDA in 2006. According to the publisher, it contains more than 100,000 and sub-headwords, more than 200,000 words and phrases and roughly 400,000 equivalents. Emphasis has been laid on contemporary lan- guage – the publisher advertises the dictionary by saying that words such as “biomass”, “cy- bersex”, “dumb down”, “dweeb”, “Eurocracy”, “home page”, “loyalty card”, “netiquette”, etc. may be found there. In the foreword, the author says it was his aim to fill a gap in English-Czech by publishing a dictionary that would enable active use of both languages, therefore being convenient for both the speakers of English and Czech who are dealing with the other language. Furthermore, he states that due to its character, his dic- tionary is accessible to the widest possible public, from beginners to translators and inter- preters, of either mother tongue. More details on the microstructure of the dictionary will be given in Section 3.1. By courtesy of Rudolf Cervenka,ˇ the director of LEDA spol.s r.o., the data from VAC has been put at the disposal of researchers at the Natural Language Processing Center of Masaryk University. The data has been provided in machine-readable form by Petr Sojka, a lecturer at the Faculty of Informatics at the same university, who had happened to be the person responsible for the typesetting of the dictionary. This has made it possible to use VAC as a source of linguistic information for the purpose of this thesis, and it has been for the first time that such a large English-Czech dictionary might have been used for automatic or semi-automatic extraction of translation pairs and enlargement of CzWN. This work will be described in detail in Chapter 3, “Dictionary Data Parser” and Chapter 4, “WordNet Synset Generator”.

4 Chapter 2 Analysis and Design

2.1 Overview of Previous WordNet Translation Attempts

Dictionaries have been the source of WordNet data from the beginning, since it is natural to reuse existing resources rather than to try to recreate everything from scratch. In the fore- word of [2], even the creators of PWN admit that they have used various , word lists, corpora and thesauri as sources for building their over the time. Bilingual dictionaries come in question when we move from PWN to WordNets for other languages. Particularly those bilingual dictionaries that have English as one of the two languages are of most use, due to the large size of PWN with which synset links may be sought by means of ILI. The situation is, in fact, similar to that of interwiki links in the online multilingual en- cyclopedia Wikipedia [12], with the difference that Wikipedia provides a way to create link between two non-English languages while WordNets only support creation of special non- ILI identifiers own to the particular WordNet (usually to identify a concept that is not lex- icalized in English) and do not provide any convenient way of direct linking between lan- guages without the use of ILI based on English (this aspect of ILI not being a superset of all languages’ concepts is discussed in [13]). Indeed, Wikipedia as a kind of electronic multi- lingual dictinary has inspired a number of recent projects aimed at developing a WordNet database in an automated manner. In this chapter, we will mention several previous attempts at using bilingual (and multi- lingual) dictionaries for the construction or enlargement of non-English WordNets, with no assumption of completeness, and we will provide summaries of the algorithms and results for a couple of them to illustrate the distinct paradigms that may be followed to accomplish the same goal.

2.1.1 BabelNet

BabelNet [14] is described by its authors as a very large, wide-coverage multilingual seman- tic network. Its construction is by definition automatic, supported by a methodology that in- tegrates data from PWN and Wikipedia. As a means for further enrichment of the database’s multilinguality, is also employed in the process. Wikipedia is presented by BabelNet’s authors as a perfect complement for WordNet, because of the multilingual lexical knowledge of encyclopedic nature it provides. Its drawback is that it is less formally

5 2.1. OVERVIEW OF PREVIOUS WORDNET TRANSLATION ATTEMPTS organized, particularly owing to its large-scale collaborative character which has enabled it to become what it is today (more than 3,645,000 articles in English and 18,800,000 articles in all languages in total, as of May 2011). BabelNet collects texts of English Wikipedia articles along with relations among them that are established by hyperlinks and automatically maps these articles to PWN senses, using sense labels in Wikipedia article titles, hyperlinks in the article texts and the classification of articles into categories as sources of information, on which disambiguation of polysemous literals is based. This information is compared with the data in PWN, including information stored as ILR (both simple hypernymy/hyponymy as well as sibling relations between two synsets that share a common direct hypernym) and an analysis of glosses, and links are established by an algorithm based on the maximization of conditional probabilities. When a common set of concepts has been created, the algorithm continues by forming babel synsets which link common synsets with liter- als found among the interwiki links of the English article. So far, the project has succeeded in covering 55.7 % of the noun senses in PWN. Be- tween 52.9 % and 86.0 % word senses from five non-English WordNets have been cov- ered (the highest number relates to a French WordNet database and is biased by the fact that it itself has been created semi-automatically by combining several resources, includ- ing Wikipedia). Manual evaluation of the newly generated synsets has resulted in strong favor of the including of Wikipedia as a resource for WordNet construction. Several other projects have followed similar design, mining information from Wikipedia, sometimes also its sister project . [15] BabelNet’s results obtained by harvesting and processing Wikipedia interwiki links have been made available under the conditions of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

2.1.2 MLSN

MLSN (A multi-lingual semantic network, [16]), started in 2005, is an open-source project to create a semantic network covering all the world’s major languages. The database consists of words and relations among them and a web interface exists that enables the public to help enhance the database. In order to cope with the need of good content in order to attract users and contributors to an open-source project, the creators have come up with a solution that uses freely available bilingual dictionaries to automatically translate entries. MLSN’s algorithm for automatic PWN translation using a bilingual dictionary has provided some inspiration for this thesis by its design, which is fairly simple yet it yields good-quality results. Its basic constraint is that of limiting itself to monosemous literals. All monosemous nouns from PWN are translated using a bilingual dictionary from En- glish to the target language. Out of these, only those with a single translation equivalent are kept and translated back using a complementary dictionary (target language to English). If the result of this counter-checking, which is a set of one or more English literals, equals ex- actly to the PWN synset from which the original literal stems, it is called high confidence; if only some of the have been discovered, then medium confidence; if all have been discovered plus some extra ones, low confidence; if none of the sets is a subset of the other,

6 2.1. OVERVIEW OF PREVIOUS WORDNET TRANSLATION ATTEMPTS the literal is ignored. Only high and medium confidence results are eventually imported into the MLSN database, with an appropriate mark in the comment field. The authors report a high reliability of this approach, in spite of the pain from excluding many common words that are polysemous. Eight to seventeen thousand literals could have been imported by this algorithm for each of the languages of Japanese, German and Chinese. As an additional observation, adding a second independent dictionary has proved very helpful to improve the quality of the results. This additional data has been harvested from Wiktionary. [15] During a manual evaluation, 95–97 % of the high-confidence results, 94– 98 % of the medium-confidence results and 68–72 % have been found satisfactory.

2.1.3 WordNet Builder

A Thai project called WordNet Builder [17] has designed a WordNet-building system with a clearly defined architecture. Its two main components are a Machine-Readable Dictionary Extractor (there may be more instances of it if there are more dictionary sources) and a Word- Net Constructor, which communicate with each other through a Link Analyzer. The authors distinguished three kind of links – translation links (between English and Thai words), se- mantic links (between English words and their meaning) and candidate links (between Thai words and their meaning). For each one of a defined set of 13 criteria divided into three groups (monosemic, polysemic and structural), a sample of translation links was manually verified and a statistical classification model was constructed which was later used to clas- sify and validate the remaining ones. By the clear distinction of the possible word-to-word- to-synset-mapping situations (as expressed by the criteria) and the graphical representation of each of them in form of a diagram, this work is very instructive and may be used as a reference for other ones, including ours. Predictably, the best criterion, which had 92 % correctness, was the “monosemic one-to- one” criterion that links words that are monosemous in PWN and that have a single trans- lation equivalent in Thai. The worst criterion with 49.25 % correctness was the “polysemic many-to-many”. With the exception of one, also the structural criteria, some of which make use of PWN’s ILR, have given good results, approximately between 80 and 90 % correctness.

2.1.4 Miscelaneous

Some other works concerned with translation of PWN using a bilingual dictionary have contributed particular interesting ideas to the discussion. For instance, a Japanese attempt [18] has employed a Japanese-English dictionary in alignment of parallel corpora (or even merely weakly comparable corpora) and calculated synsets using a correlation matrix of Japanese of an English word and its associated words derived from the synset’s gloss. The team whose goal in 2001 was to enrich the Spanish WordNet with new adjectives [19] accepted a Spanish word to correspond with an English synset if it appeared among the Spanish translation equivalents of at least two distinct members of the synset. This corre-

7 2.2. DESIGNING AN ALGORITHM sponds to the “intersection” criterion of the WordNet Builder. Similarly to the Thai approach, an attempt at automatic construction of Korean WordNet [20] has devised and tested a set of six heuristics, taking into account even the supposition that if two Korean words have an IS-A relation (i.e. hypernymy/hyponymy) as evidenced by Korean dictionary definitions, their English equivalents in PWN should have it as well. In a paper dealing with automatic construction of Persian WordNet [21], results were spoiled by insufficient quality of the available dictionary data – the Persian-English dictio- nary lacked part of speech information, which caused such English translations as of a Persian noun meaning “a tag” to be mistaken for synsets of other part of speech, such as the homonymous verbal synset glossed “attach a tag or label to”, i.e. meaning “to tag”.

2.2 Designing an Algorithm

The above research into the topic of WordNet sense mapping with help of bilingual dictio- naries has shown that there exists a great variety of possible approaches to the task. More- over, since many of them tend to recur over the years, it seems that there is currently no known approach that would clearly overcome the rest. As , parts of speech and concepts sets varying among cultures and languages comes into question, it may even be the case that no universal solution for the problem exists. At last but not least, also the quan- tity and quality of linguistic resources at the disposal of the researcher play a role, both in the choice of method and in the parameters of the result. From our perspective, the work that has been done for Thai [17] seems to be the most appealing one, considering that it is also our goal to generate new synsets using a machine- readable dictionary, which means that in this thesis we will not deal with Wikipedia or Wiktionary, although research has shown that it is possible to derive good results from them. Because of the limited scope of this thesis, we will also not go down into so many details, but we will focus only on two of the thirteen criteria that were applied in WordNet Builder. This decision is only partially influenced by the public unavailability of the WordNet Builder software; also our technical circumstances are different – due to the data format used to represent WordNets that is different from that one in which PWN is commonly shipped (see VisDic and DEBVisDic in Section 2.3 below for more information), and also because we are working with a dictionary resource that no other computer linguist has ever worked with before – we need to spend a considerable amount of time on parsing the acquired dictionary data, before we are at all able to extract translation equivalents from it and feed it to kind of a WordNet constructor. Eventually, unlike the Thai team, we need to take into consideration the possibility that some synsets may already exist and only need completion, because we are not building a new database – rather extending an existing one. The more modest MLSN project [16] is interesting due to the simplicity of its algorithm and the good-quality results that it provides. We have decided to derive our own algorithm from it, but we modified it and expressed it in terms of the criteria introduced by Word- Net Builder. We have decided to consider not only those monosemous English literals that

8 2.2. DESIGNING AN ALGORITHM have a single Czech translation (WordNet Builder’s criterion 1: “monosemic one-to-one”), but also those for which the dictionary provides several translation equivalents (WordNet Builder’s criterion 3: “monosemic many-to-one”). On the other hand, since we do not have a machine-readable Czech-English dictionary of similar size, we do not translate the found equivalents back to English to check the results with the original synset as MLSN does. Instead, we profit from the special properties of VAC (our dictionary) which has been de- signed not as a passive (i.e. typically translational) dictionary, but as an active one. As its author writes in the foreword, “it aims to show clearly the whole range of English mean- ings and unambiguously define, as far as possible, every single Czech equivalent.” One of the consequences of this decision is that when an English word has more senses, these have been carefully distinguished by Arabic numerals. Sometimes there are even three levels of subdivision (subsenses) for a . Moreover, our research has shown average polysemy of English literals in VAC to be the same or even higher than that in comparison with the statistics of PWN 3.0 [22]. On average, there are:

• 1.51 senses per noun in VAC as compared to only 1.24 in PWN 3.0; • 2.22 senses per verb as compared to 2.17; • 1.36 senses per adjective as compared to 1.40 (adjectives are the only part of speech that is more polysemous in PWN than in VAC); • 1.33 senses per adjective as compared to 1.25.

This observation makes us believe that the sense distinctions made in VAC are at least as fine-grained as in PWN (which itself is often being criticized for being too fine-grained even for humans) and that therefore if we restrict our attention only to literals that are monose- mous in both PWN and VAC, we can be confident enough to assume that the same sense has been found in both sources. Taking inspiration from MLSN and WordNet Builder, and having seen that VAC should provide enough sense distinction that it can be considered a source of unambiguous trans- lations, we conclude this discussion by suggesting an algorithm that we later implement:

1 Transform VAC dictionary data from presentational markup to descriptive markup. 2 Extract translation pairs from VAC dictionary data. 3 Keep only those pairs where English literals are monosemous. 4 Optional: Keep only those pairs with unique source literals – i.e. one-to-one translations. 5 Match VAC English literals with monosemous PWN literals. 6 See which matched synsets are present in CzWN and which are not. 7 Recalculate and transfer ILR to new CzWN synsets. 8 Merge new synsets and new literals with CzWN.

The implementation of steps 1 and 2 is described in Chapter 3, “Dictionary Data Parser”. The implementation of steps 3 to 8 is described in Chapter 4, “WordNet Synset Generator”.

9 2.3. USED TOOLS

We may observe that with the progress of the algorithm, the number of retained trans- lation pairs decreases, so it works as a kind of filter. Eventually, only those language pairs that have survived all the filtering, and are therefore thought to be of high confidence, get merged with CzWN in the last step.

2.3 Used Tools

During the implementation, we have used many different programming and data process- ing tools. This is because each part of the task had to be or was best approached with its own specific set of tools. This has included the choice of:

• programming paradigm (procedural, object-oriented, declarative, data-driven), • program execution (interpreted, bytecode, compiled), • data format (plaintext, CSV, XML, TEX), • computing platform (GNU/Linux, Microsoft Windows, XUL) • and character encoding (UTF-8, ISO 8859-2).

A list of the used programming and data processing tools follows. It may serve as a list of requirements for running the created software, but it is also a good testimony about the variety of programming techniques that have been applied in the creation of this thesis. While we discuss the tools and mention theirs pros and cons, we will also give examples of algoritmization tasks which we have solved with it. This chapter is therefore, at the same time, an introduction into the two following ones, Chapter 3, “Dictionary Data Parser” and Chapter 4, “WordNet Synset Generator”, which will concentrate on details of the devised al- goritm and will not pay much more attention to implementation details. Non- who are not interested in the implementation may skip the rest of this chapter and resume reading at the following one.

2.3.1 GNU Tools and Utilities

The GNU Project () has generated much open-source software with the aim to eventually form a completely free . Many of the GNU pack- ages are at present a standard part of GNU/Linux software distributions, including Ubuntu which was our primary development environment. We have widely used the functional- ity provided by GNU tools and utilities and the skeleton of the software – the scripts that control the order in which data processing is being performed – have been programmed in bash, GNU’s default shell. The main script make.sh invokes two other shell scripts, which each forms the body of one of our software’s two components, it provides them with input and collects the output they produce. By displacing file names out of the shell scripts, in special filelist files, we have avoided unpleasant typos during the programming, made the scripts easier to main- tain and through sharing these files with a clean.sh script we can also provide a neat way

10 2.3. USED TOOLS of removing all files that have been produced by the software but are not necessary for it to run again from the beginning. Options given as command-line arguments to the main script are being passed to the scripts it invokes. The created scripts benefit heavily from pipelines and each action reported to the user is actually a long sequence of commands that take text input from one file, process it as necessary while it goes through the pipe, and output it to another file. In many cases, we could have created even longer pipe lines, but we interrupt the processing and save the result into a file any time it may be of interest to the user (who may want to hunt for a piece of information that normally gets lost in the process) or when it may be useful later in the program. The activity of each pipeline is documented, both by a message that reports it to the user and by comments in the source code that clarify the meaning of each step and describe the interfaces used by pipelines to interchange data (the input and output file formats – usually CSV with explicitely defined columns semantics or XML). To process text data, we make large use of programs from the textutils package (today part of coreutils) such as cat, comm, cut, join, sort, tr, uniq and wc. For tasks that required more programming capacity, we used the GNU ports of the data-driven scripting language awk, the command line text search utility grep and the text parsing and transformation utility sed. As arguments to them, we have used a great many of regular expressions, often accurately crafted to respect the syntax of the processed data and avoid the pitfalls of escape sequences; in one particular case we even had to find a workaround for a parsing problem that could not be expressed in terms of regular languages and would require a parser capable of context-free pattern matching. Regular expressions were used also within other tools, such as Python and XSLT. Locale-influenced collation has posed a problem at some points, for instace when spoiling a search for identical consecutive items (to merge them into one) by not making a difference between lowercase and uppercase charac- ters. Therefore, we often ignored the user locale (UTF-8 in our case) and performed collation operations in the special C or POSIX locale where characters are being treated exclusively by their character code, although this could have had unpleasant influence on our data as it ignored the fact it is encoded in UTF-8. The utility that we have used for text parsing the most is definitely awk – it has helped us perform all kinds of tasks, from the very easy ones (such as exchanging the content of two CSV columns for what cut is of no use) up to com- plex merging operations envolving several files at a time. We were particularly appealed by awk’s ease at loading even a huge file into an associative array and then giving fast replies to queries on its random indices, which has proved to be much faster than calling grep on that file over and over again.

2.3.2 XSLT

The Extensible Stylesheet Language Transformations (XSLT; ), a declarative, XML-based language used for transformation of XML documents, has been used particularly in implementation of the first component of our software, the Dic- tionary Data Parser. VAC data comes already in the form of a text file with an XML markup,

11 2.3. USED TOOLS but our observations showed that it was hardly possible to automatically extract translation pairs from the data without first transforming it into a markup that was more feasible for content extraction and less for presentation, which had been the main purpose of the origi- nal markup. Therefore, we have created an XSLT stylesheet (/code/vac/transform.xsl on the CD-ROM; 21.7 KB) which, when applied on the received dictionary data, parses it, removes redundant or unnecessary information related to the presentation (we must not forget that the XML data comes from a typesetting program) and tries to induce the miss- ing information on meaning and structure. This information is sometimes lost in the flow of text, whose implicit structure is easily recognized by the eyes and brain of a human reader of a printed dictionary, but which is not so self-evident to the computer. In many cases we could eventually reconstruct the logical structure of an entry from the way it is being presented, but some problems such as the parsing of alternatives and parentheses (see Section 3.3.1 for a discussion of this problem) seem not to always have an unambiguous solution. Still, it is important that we have made an attempt at understanding the presentational markup of the received data, because not to do that would leave us with only a very superficial insight into the relations that govern among the elements of a dic- tionary entry. Because of this deeper look into the data that has enabled us to extract from it translation pairs with more confidence and in higher amounts than a straight-forward text-flow parsing would be able to achieve, the resulting XSLT transformation is a robust descriptive markup representation of the original data, and as such it may easily be used also by other applications that require linguistic resources of the kind English-Czech trans- lation pairs. This is for instance the case of PRESEMT [23], a running machine translation project, which will make use of the English-Czech translation pairs extracted from VAC – a pleasant side effect of this thesis.

2.3.3 libxslt libxslt () is an open-source XSLT 1.0 processor that forms part of GNOME, itself part of the GNU Project. It is written in C, based on the XML parsing library libxml2 and may be embedded into an application or used as xsltproc from the command line. We use libxslt for some of the less complicated XML processing tasks that would still require too much effort if they were to be performed by a textutils utility – by means of keys in XSLT, for instance, it is much easier (and reliable) to obtain polysemy information from a WordNet represented in XML. Another perceived advantage of libxslt is being a relatively fast and low-resource processor. This was the main reason we did not completely resign on it in favor of Saxon-HE.

2.3.4 Saxon-HE

Saxon XSLT () is an XSLT processor created by Michael Kay, the editor of the W3C XSLT 2.0 specification. There are open-source as well as

12 2.3. USED TOOLS closed-source commerical versions of the processor, which has been written in Java. Saxon- HE which we have used is its open-source “Home Edition”. An important asset of the Saxon processor is its implementation of and strict conformance to the latest standards including XSLT 2.0 and XPath 2.0. We use Saxon-HE for the processing of our largest XSLT stylesheet, the one that performs the initial transformation of dictionary data to a form from which translation pairs may be extracted. The use of XSLT 2.0 (and therefore the need for the Saxon processor, as XSLT 2.0 is not supported by libxslt) is justified by the use of elements such as , running template matching and XPath 2.0 queries on node sequences passed in variables (equivalent of the result tree fragment in XSLT 1.0) or performing even a little ad- vanced string operations. All of this could perhaps have been implemented also in XSLT 1.0, but the workarounds that would have been necessary to accomplish it would have se- riously obfuscated the code and made it run slower. Also Saxon processing seems slower if compared to libxslt, but this may be due to the fact that Saxon requires the Java virtual machine so that it can run. An unplanned consequence of using Saxon-HE for transforming the dictionary data has been the need to write a wrap-up program in Java that takes care of the processing, since the command-line interface to Saxon could not be used due to its efficiency constraints caused by the need to start and stop the Java virtual machine every time a transformation of a chunk of XML code was needed. Such a batch procesing (one dictionary entry at a time) was necessary, because loading the whole dictionary data and trying to construct a DOM tree from it in the memory at once would not be feasible - the VAC source that we have received is an 18 MB large file.

2.3.5 SAX

Moving on from structure transformation to the extraction of information from the dictio- nary data, the declarative paradigma manifested by XSLT and the DOM (Document Object Model) processing model cease to be advantageous. Instead, we need to be able to grasp the document as a flow of data (rather than a hierarchy) so that we are able to easily look forward or behind to the neighboring elements when this becomes necessary, using the perspective of the reader of a printed publication, i.e. disregarding any descriptive hier- archy imposed on the data at the moment. Such peeps at the following or preceding ele- ments in document order (which closely related to the order in which the textual content of the elements eventually gets rendered in the paper dictionary) are very useful for de- termining the current context once an element of certain type has been found. Yet, there is nothing like a document-order axis in XPath, because XPath and XSLT work in terms of element hierarchies. This is where a niche is found for SAX, the Simple API for XML (). SAX is not an actual product, rather the definition of an interface that has been imple- mented in many different programming languages. It provides an alternative mechanism to DOM, because it does not regard the document tree as a whole but operates on each

13 2.3. USED TOOLS node sequentially. A major benefit of SAX in contrast with DOM is in its being much less memory consuming, and therefore also running faster. Yet, in our work it is not the latter properties that we do profit the most from – for our needs the sequential approach is decid- ing. This makes us able to always remember information on the current context while we are traveling through the document and use it whenever necessary. Unlike in a procedural programming language implementing the SAX API, context information is very difficult to preserve in a declarative language such as XSLT – in a limited scope it may be done (and we are doing it) by passing copies of document fragments as parameters to templates, but the main obstacle is that there is no way to keep applying templates and providing each of them with the same context information until some of the templates decides to change it (after which such a change should be reflected in the context passed to all elements encoun- tered thereafter).

2.3.6 Python

Python () is the language of choice for many programmers in Natural Language Processing, particularly for the wide range of functions it provides, including a solid support for regular expressions and internal use of Unicode for string values. It is an interpreted, general-purpose high-level programming language with an original philosophy and a large standard library, available on many platforms and used also for web development. In our work, we have used Python for the extraction of translation pairs from the pre- processed dictionary data, in connection with a SAX parser that is instantly available from Python’s xml.sax package. The extraction of translation pairs includes a number of text processing tasks (character substitution, expansion of abbreviations, resolution of logic con- structs) for which Python’s functions for regular expression matching and data structures such as lists, tuples and dictionaries have been used.

2.3.7 VisDic and DEBVisDic

When EuroWordNet had started, Polaris, a commercial tool (with a user interface Periscope) was developed to simplify the creating, editing and exporting of WordNet databases. Af- ter the project terminated in 1999, the program was not further developed and was later substituted by VisDic, a novel tool created at Masaryk University. VisDic [24] (), which was developed for Linux and ported also to Microsoft Windows, was used as already as with BalkaNet [4]. It uses XML as the for- mat for representation of WordNet data. This is an important step forwards from the data format still used by the creators of PWN. They keep on using a set of CSV-like files, edited manually, compiled by a decidated tool from the so-called lexicographer files and accessed by a dedicated search application. This approach lacks all the advantages of VisDic’s XML solution, which uses a single plaintext file (with markup in it), is more portable, its data is more easily interchangable and can be parsed by many already existing parsers (an XML

14 2.3. USED TOOLS parser can even be instructed to automatically validate against a DTD or a schema); yet an- other differences is that VisDic uses other synset identificators – unlike PWN which uses a sense key composed of many different values as an identificator, VisDic has adopted the ILI introduced by EuroWordNet as a means of synset identification. DEBVisDic [25] () is the suc- cessor of VisDic. It is also a project of Masaryk University, built on the platform for client- server XML databases called DEB II. A data format for WordNet data that is very similar to that of VisDic is used (an XSLT stylesheet can do the conversion in either way without any information loss) and by virtue of the DEB II platform, the application may be run in any environment for which an implementatio of XUL (XML User Interface Language de- veloped by the Mozilla Foundation ) is available. DEBVisDic is a WordNet browsing and editing tool that supports all the functions of VisDic and adds to them its own new features of a general on-line dictionary platform. Users can freely acquire a copy of DEBVisDic from Masaryk University and install their own DEB server. Because of a recent update of Mozilla’s browser, the DEB client was not yet supported by Mozilla Firefox 4.0 at the time of writing. This forced us to use a parallel installation of Mozilla Firefox 3.6 as a tentative solution to the problem, as recommended by the developers. The older VisDic, because of being a stand-alone offline application, is still preferred by the developers of CzWN at the moment, so that the latest CzWN had come from it and this had to be reflected in our work.

2.3.8 uconv A character encoding converter such as uconv, available in the Ubuntu GNU/Linux distri- bution, needs to be used to perform conversions of data file between Unicode and national character encodings. Due to legacy reasons, ISO 8859-2 (Latin 2) is still being used as the encoding of CzWN files in VisDic – this means that we have to perform a conversion to Unicode before processing CzWN data in our software, and the complementary conversion when producing output (extended CzWN data) for use in VisDic. If there happen to be char- acters in the newly added Czech literals that do not have an equivalent in ISO 8859-2, they get escaped into the %Uxxxx form.

15 Chapter 3 Dictionary Data Parser

3.1 Structure of a Dictionary Entry

A typical entry in the paper version of VAC [11] is shown in Example 3.1.1. shout [šaut] n 1 [cry] výkˇrik,zavolání, zakˇriˇcení; [pl] ~s výkˇriky, kˇrik,pokˇrik,pokˇrikování, povykování; ~s of laughter salvy smíchu; give a s. vykˇriknout,zakˇriˇcet; give me a s. {if you need me} zavolej mˇe; ~s of joy radostné výkˇriky, plesání; a triumphant s. jásavý výkˇrik 2 esp Br inf runda; it’s my s. ted’ platím já • v ◦ vt s. an order zaˇrvat or vyštˇeknoutrozkaz; s. oneself hoarse ochraptˇet  s. sth from the rooftops vytrubovat or roztrubovat nˇecodo svˇeta,vybubnovat nˇeco; it’s nothing to s. about tím bych se moc nechlubil ◦ vi [be noisy] kˇriˇcet,vykˇrikovat, (loudly) rámusit, povykovat, halekat, hartusit; s. at sb za- volat na koho, (loudly) zakˇriˇcet or zahulákat na koho, (repeatedly) pokˇrikovat,kˇriˇcet; s. in chorus kˇriˇcetsborem; s. like a fishwife kˇriˇcetjako trhovkynˇe or domovnice; start ~ing at sb rozkˇriˇcetse na koho; s. at the top of one’s voice kˇriˇcetzplna hrdla, kˇriˇcetz plných plic; s. for help volat o pomoc; s. for joy jásat radostí 2 shout down vt (person) ukˇriˇcet,umlˇcetkˇrikem,pˇrehlušit,pˇrekˇriˇcet 2 shout out vi(t) kˇriknout,vykˇriknout (co), zakˇriˇcet (co), (very loudly) zaˇrvat (co) Example 3.1.1: The entry for “shout” in the paper version of VAC

And this is how the same entry appears in the computer data that we have received:

shout [šaut] n [cry] výkˇrik, zavolání, zakˇriˇcení; [pl] s výkˇriky, kˇrik, pokˇrik, pokˇrikování, povykování; s of laughter salvy smíchu; give a s. vykˇriknout, zakˇriˇcet; give me a s. {if you need me} zavolej mˇe; s of joy radostné výkˇriky, plesání; a triumphant s. jásavý výkˇrik esp Br inf runda; it’s my s. ted’ platím já v vt s. an order zaˇrvat or vyštˇeknoutrozkaz; s. oneself hoarse ochraptˇet s. sth from the rooftops vytrubovat or roztrubovat nˇecodo

16 3.1. STRUCTURE OF A DICTIONARY ENTRY svˇeta, vybubnovat nˇeco; it’s nothing to s. about tím bych se moc nechlubil vi [be noisy] kˇriˇcet, vykˇrikovat, (loudly) rámusit, povykovat, halekat, hartusit; s. at sb zavolat na koho, (loudly) zakˇriˇcet or zahulákat na koho, (repeatedly) pokˇrikovat, kˇriˇcet; s. in chorus kˇriˇcetsborem; s. like a fishwife kˇriˇcetjako trhovkynˇe or domovnice; start ing at sb rozkˇriˇcetse na koho; s. at the top of one’s voice kˇriˇcetzplna hrdla, kˇriˇcetz plných plic; s. for help volat o pomoc; s. for joy jásat radostí

shout down

vt (person) ukˇriˇcet, umlˇcetkˇrikem, pˇrehlušit, pˇrekˇriˇcet

shout out

vi(t)
kˇriknout, vykˇriknout (co), zakˇriˇcet (co), (very loudly) zaˇrvat (co)

There are in total 54,046 such entries in VAC (measured on the received electronic data). A detailed description of the dictionary’s microstructure (the structure of an entry) by its au- thor in both Czech and English may be found in the paper version of the dictionary (section “Pokyny pro uživatele” on pages 11–16 or section “Using this dictionary” on pages 17–22). A list of used abbreviations is also provided in the book, on pages 23–34. We refer the reader to this description and concentrate in this chapter only on explaining the semantics of the XML markup used in the received data. After some time spent on training by comparing the data with the print dictionary, one may guess the meaning of each element that appears in the received XML file. Most of the markup has been obviously designed to serve presentational purposes, but fortunately in many cases an element may be though of as marking up structure too. This is due to the fact that the dictionary has been typeset using a great variety of font styles and interpunc- tion, which makes ambiguous markup a rare phenomenon. We have listed all elements that appear in the received XML data in Table 3.1 that appears at the end of this chapter. The first column of that table gives information on the number of occurrences across the whole file, thus providing an idea about the size of the dictionary and the frequency in which each kind of information appears, which is interesting both from the absolute and the relative perspectives. There is also some amount of syntatic information that is not represented in the form of elements and must be therefore obtained by parsing the document’s text nodes. This is the case of headword referals (“~” or one-letter abbreviation of the headword), stylistic labels for Czech translations (*, **, ***), the approximate translation indicator (≈ ), wrappers for Czech phrases that contain a comma (<>) and, in particular, semantically different but syntactically interchangable alternatives (/ ) and parts of expressions that may be left out (parentheses). Some of these may be treated easily, another ones present a rich source of ambiguity. Moreover, also the elements have not always been used in an unambiguous and predictable way. Eventually, the treatment of syntactic exceptions has consisted a significant part of the work spent on transforming the dictionary file into a more descriptive markup.

17 3.2. TRANSFORMING DICTIONARY DATA

3.2 Transforming Dictionary Data

Our goal in the process of dictionary data transformation was to produce an XML file from which translation pairs could be extracted in an easier way. One of the reasons for the ex- istence of this step is that there is much markup in the original data that is not interesting for achieving our goal (we may ignore information about pronunciation, for instance). By ignoring such elements (either all the time or only in certain contexts), we focus on the in- formation that may be useful for our future work. We define this to be the headwords and alternative headwords of dictionary entries, the grammatical marks (which include infor- mation on part of speech), the hierarchy of senses (to recognize polysemy), the headwords of phrasal verbs, and all translation equivalents of headwords, examples and . In the XSLT script /code/vac/transform.xsl, we process a dictionary entry by grad- ually breaking it up into parts, called sections. The main template matches the H element and searches it for a first head, a first body and an unlimited number of next sections. It is not known beforehand which section corresponds to which element, since for instance the first body can be formed by the usual N element (which consists one or more BE, CE or DE elements which each presents one sense of the headword), but in case of a monosemous headword there is no need for such a wrappen and a translation immediately follows the head. In some cases, the translation was even included right in the entry head. Such a vari- ety of representation of the same information has forced us to develop rather complicated matching expressions – the script could have be significantly more simple if we had ignored such peculiarities, but that would also mean losing some of the data, in some cases a signifi- cant amount of. For each case in which a syntactic exception or error was treated, a note has been made in form of a comment next to the matching expression in the stylesheet, referring to the headword for which the extension of that matching rule was originally crafted. Each of the child sections of a headword is subsequently treated by a dedicated tem- plate: In the first head we search for all headwords, its variants or alternatives, for grammatical marks (with particular interest in those that convey information about part of speech) and for links to other headwords. We disregard the rest of markup that may apear in the first head since it is of no use. In the first body, we go down the sense hierarchy, if it exists, and look for all sections with meaning descriptions that we find on the way. For each meaning section, we note the sense number (along with the sense numbers of any possible supersections, in case of nested senses), all translation equivalents of the headword, all En- glish examples and idioms along with their respective translations, and also all grammatical marks (because part of speech may occasionally change within a meaning section). Because of the limitations of XSLT’s declarative paradigm, all of this information is registered in the output file in the same order in which it appears in the source document. It is, for instance, not feasible to look ahead for a translation once an English is found, because it may indeed be in the following element, but it can also be the case that the following element is yet another idiom (synonymical to the present one) and the translation follows only after it, possibly divided from its English counterpart(s) by some grammatical or usage informa- tion element; and in rare cases of malformed syntax, the translation may even not be present

18 3.2. TRANSFORMING DICTIONARY DATA within its counterparts’s parent element, or it may be not existing at all. This is why we have left this task of matching expressions with their translation to the second component of the Dictionary Data Parser that runs a sequential scan of the elements in their document order and can process the elements in a previously determined context. Having treated the first head and the first body (which are special because the whole entry’s headword is defined there), we go on parsing the rest of the entry’s data which is formed by sections which either treat the headword as another part of speech, or they are concerned with a phrasal verb that is derived from it. We determine the exact type of the section (part of speech change or phrasal verb) by checking the presence of either the KPL or the CTV element and, after writing an appropriate new element on the output, we go on processing the senses in this section’s body. Finally, once all sections have been treated, we have finished processing the headword. At that point, we have output XML code that con- tains all the interesting information about the dictionary entry that we have found during the traversal of its contents. For our running example of “shout”, it looks like the following (exampleS and idiomS contain text in the source language, i.e. English; exampleT and idiomT in the target one, i.e. Czech): shoutn výkrikzavolání, ... [pl]~s výkrikykrik,ˇ ... ~s~of laughtersalvy smíchu; ... espBrinf runda; it’s my s.ted’ platím já v vt ... s. sth from the rooftops vytrubovat | roztrubovat necoˇ do svetavybubnovat neco;ˇ ... ...

19 3.3. EXTRACTING TRANSLATION PAIRS

3.3 Extracting Translation Pairs

After the source dictionary has been transformed and the results stored in the /data/vac/ transform.xml file, we proceed to the extraction of translation pairs. This is done by a Python script /code/vac/extract.py that implements the interface of the VACHandler class from Python’s xml.sax package. With SAX, data is read in document order and an event is fired every time a start element, an end element or a text node is encountered. We define our goal as to extract pairs of English and Czech literals that are translations of each other, be it headword translations, usage examples or idioms. In the output, we note for each pair of literals the sense number of its English constituent (so that we can still distinguish those pairs that originated from polysemous headwords), the relevant part of speech (defined as the last part of speech mark encountered so far in the same entry during the reading of its contents) and the type of that pair (“q”, “x” or “i” that stand for equivalent, example and idiom). The result is therefore a list of five-tuples. English and Czech literals that belong to each other are determined by maintaining a hierarchy of source language (English) literals (i.e. headwords, phrasal verbs, derivatives of headwords introduced in senses and finally also each source language literal that has appeared within that sense). Every time we reach a target language literal (a Czech one), we search back in that hierarchy for the last encountered such English literal that has not yet expired (i.e. we are still nested within the element where it was defined, or we have not already matched it with another Czech literal that is not an immediate neighbor of the Czech literal in question). We declare the last encountered English literal that matches these criteria to be the one that the Czech literal is a translation of, and we write this pair on the output. If there are more English literals of the same priority (e.g. an entry consists of a main headword and its alternative spelling, or an example gives two synonymic English literals for a single Czech translation), we output a record for each combination. After we had run the XSLT transformation and this Python script, we received 315,991 translation pairs, along with the information on their type, part of speech and the sense number of the English constituent. An interesting side effect of this work are the 163,776 unique Czech literals that were extracted from the dictionary in this process. Regardless of their English equivalents, we may look at them as at a word list that comprises 80,862 unique Czech literals. If we input this list to ajka [26], a morphological generator of Czech language developed at Masaryk University then, after removing typos and non-words, we receive a list of 6380 Czech words that are not recognized by the morphological analyzer. They include words such as “audiokniha” (“audiobook”), “bruslištˇe”(“skate park”) and “cypˇrišek”(“white cedar”). We would be pleased if the current maintainer of ajka decided to improve his tool with use of suggestions from our word list.

3.3.1 Treatment of Alternatives and Parentheses

Special attention during the extraction of translation pairs from the dictionary has been paid to the treatment of alternatives and parentheses that appear within both English and Czech

20 3.3. EXTRACTING TRANSLATION PAIRS literals. The dictionary does not provide us with any hints about what the proper treat- ment of slashes (e.g. “bathing/ camping/ smoking prohibited”, translated as “koupání/ táboˇrení/kouˇrenízakázáno”), of OR’s (e.g. “bash about OR around” in English or “každý rok OR každoroˇcnˇe”in Czech) and parentheses (e.g. “eviction” is translated as “(soudní) výpovˇed’” – literally “(court) dismissal”) should be. It is not specified anywhere what the exact interpretation of these signs should be – how long parts of the whole literal get alter- nated when a slash is applied; which part (or whether the whole) of the text to the left of an OR should be replaced by the text to the right of it; whether an expression that contains parentheses may be understood as a short for writing two alternatives, one that includes the word in parantheses and one that does not, or whether this should not be done because the word in parentheses is only used as an aid for the reader to better understand the translation. For the purpose of our work, it is fortunately not overly important if we include English literals that result from an exaggerated treatment of parentheses or alternatives, because in most cases we only risk producing a literal that is non-existent in PWN, or we spoil the monosemy of another literal so that it will seem polysemous and will therefore not be in- cluded in the results. I am thankful to Jan Pomikálek for first suggesting the basic principle of this solution. This means that we may provide the user with a possibility to keep alter- natives and parentheses untouched, or even to ignore any translations paired that contain either of these (for these purposes, /make.sh may be run with the command-line argu- ments nosplit and/or excludesplit), but it is still interesting to try to implement an algorithm that tries to do its best at treating such cases. It may be easily shown for each of these phenomenons that there is no universal solution and that actually their use in the dictionary is ambiguous and relies on the correct interpre- tation of a human who is familiar with the language in question. A great demonstration of the confusion that may be experienced by a computer program, or even a human reader not capable of speaking the particular language, is the dictionary entry for “basis”: In its sense no. 3, there is an English literal “on a daily/ monthly/ annual basis” whose translation is given as “každý den OR (deno)dennˇe/každý mˇesíc/každý rok OR každoroˇcnˇe”.We even do not need to try to explain the structure of each of these literals to the reader because this example clearly demonstrates the ambiguity that reigns in this field. It is instructive at least in that that priorities should be attributed to each of these three operators before trying to do any parsing of these expressions. In spite of all the ambiguity, an algorithm has been devised which tries to cope with this challenge. It produces satisfactory results for the easy cases (which are, fortunately, also the most common ones), while being only partially successful in the complicated cases similar to the one mentioned above. We are thankful to Jan Pomikálek who has formulated the first version of this algorithm, which we later adapted to the real data and to the fact that three different phenomena may be occurring at the same time. The algorithm spans about a half of the whole source code of /vac/code/extract.py. At the same time it also treats the expansion of headwords that may be abbreviated in dictionary entries either as ~ or as the first letter of the headword followed by a period. The algorithm works approximately like this:

21 3.3. EXTRACTING TRANSLATION PAIRS

1 Calculate the number of parts into which the source and the target literals are divided if taking slashes as delimiters (though excluding slashes inside parentheses). 2 If the number of parts is the same for both literals, split the literals at the slahes and generate new translation pairs from each two parts that share the same position within their literal of origin (due to the similarity to the mathematical operation, we call this a dot product). 3 If no multiplication occurred above because of unequal number of parts, split each literal at the slashes anyway, but instead create new translation pairs by combining each part of the source literal with each part of the target literal (we call this a cross product). 4 Perform the cross product operation also with OR’s as delimiters. 5 For each literal, calculate all possible forms that it may get if any combination of including or excluding each of its parenthetical subexpressions is taken into account (we call this a power set) and perfrom a cross product on the resulting two sets. Still while performing the power set operation on parenthetical subexpressions, take into account any slashes or OR’s that are situated within the parentheses, by running this algo- rithm on those subexpressions and taking all of its output as the possible values by which the parentheses contribute to the result of the power set operation.

22 3.3. EXTRACTING TRANSLATION PAIRS

Count Name Semantics 1 root the document element 54046 H dictionary entry 81537 B entry head / new part of speech 54046 Z headword 55538 VYEN English pronounciation 159252 S small print (grammar and stylistic information) 76683 N entry body / additional information inside entry head 41839 BE first-level sense distinction 685 CE second-level sense distinction 262 DE third-level sense distinction 180172 NCZ Czech equivalent (of a headword or a phrasal verb) 68904 PA English example 87619 PC Czech translation of an example 5160 FA English idiom 7966 FC Czech translation of an idiom 16522 PH new block – text starts on a new line (phrasal verb, new part of speech) 42447 CHNS undifferentiated sense indicator 29341 IEN indicator of a partial synonym / synonymic explanation 829 PN indicator of untranslated part of an English example 68 FN indicator of untranslated part of an English idiom 6016 SCZ small print / verbal clarification of a translation’s context 5994 BEN synonymous alternative expressions / other varities of English 2488 V spelling or morphological variations of the headword 9287 KPL beginning of a new part of speech 4640 KPR beginning of a transitive / an intransitive subentry for a verb 2920 CTV beginning of a phrasal verb subentry 2701 HRE beginning of a section of idioms 3047 P head of a phrasal verb 1773 OT referenced headword 51 OV sense number of a reference (used with either OT and BEN) 1757 SUB subscript (homonym distinction) 1587 VYCZ irregular Czech pronounciation 583 SEN stylistic labels for English words and phrases 91 PSOJKA typesetter’s private use markup 7 VP variant of a phrasal verb 6 SUP superscript (exponents) 4 mbox (unrecognized) 2 UPP all capitals 1 UNSKIP (unrecognized) 1 SMIND subscript in small print 1 smfrac fraction 1 EL (unrecognized) 1 ditto the ditto mark

Table 3.1: Elements appearing in VAC XML data – a reading aid

23 Chapter 4 WordNet Synset Generator

4.1 Structure of WordNet Data

All WordNet data that we use is represented in the form of a VisDic XML file. For the de- scription of VisDic, see Section 2.3.7. This is not the original form in which PWN data is distributed, but the experience of many previous researchers has shown that it is easier to understand WordNet and work with it as a single text file with markup that makes it is easy to process it with existing generic parsers, while simple operations may still be performed by direct edits of the text file. VisDic’s .def file defines the following hierarchical structure for a (Czech) WordNet synset represented in XML:

0 SYNSET 1 ID 1 POS 1 SYNONYM 2 LITERAL 3 SENSE 3 LNOTE 1 ILR 2 TYPE 1 RILR 1 BCS 1 DEF 1 USAGE 1 SNOTE 1 STAMP 1 VALENCY 2 FRAME 1 NL

Out of these elements, particularly ID, POS, SYNONYM, LITERAL, SENSE, ILR and TYPE are of our interest, and the synsets that we produce consist of some or all of these elements. A SYNSET needs to have an ID, which is the value that represents it in ILI. It also con- tains a POS element, although part of speech information is also included as a part of ILI indices. Most importantly, a SYNSET consists of one or more LITERALs that together form a SYNONYM. The SYNONYM is the set of language labels used to denote the particular meaning that is unique to the SYNSET. If one literal appears several times in a WordNet file, it means

24 4.2. EXTENDING WORDNET BY MATCHING TRANSLATION PAIRS that it is polysemous and in such case each occurence of LITERAL with it as its text value is distinguished by a different SENSE number. This is what a PWN synset in VisDic XML format looks like:

ENG20-01681694-v v choir1 chorus2 hypernymENG20-01680326-v ...

4.2 Extending WordNet by Matching Translation Pairs

Once we have acquired PWN and CzWN data in the VisDic XML format and once we have extracted translation pairs from the dictionary, we are ready to start the process of matching translation pairs to PWN literals and identifying synsets that may be transfered to CzWN. This work is realized by the /code/wn/make.sh script. We proceed as follows:

Filtering Dictionary Entries We read the extracted translation pairs from the file produced by the Dictionary Data Extractor. If the user has requested the results of our splitting algoritm to be excluded (by running the script with the excludesplit argument on the command line), we remove any such translation pairs from the list. For our future work, we use only translation equivalent of headwords and phrasal verbs (they have “q” noted as their type in the file), because examples and idioms constitute classes of words that are not covered in a lexical database like WordNet. We take only those translation pairs with a known part of speech (some time part of speech is unknown because no part of speech is marked in the dictionary), we choose those belonging to the category of nouns, verbs, adjectives or adverbs and we normalized their part of speech tags to “n”, “v”, “a” and “b”, respectively. Verbs need our special attention, because the dictionary distinguishes transitive and intransitive verbs, while WordNet does not. Eventually, we sort the list by the English literal as the primary criterion, the part of speech as the secondary and the sense number as the tertiary.

Matching Translation Pairs to Princeton WordNet In our approach, we consider only lit- erals that are monosemous both in PWN and in VAC. Monosemity in PWN is deter- mined by the literal appearing only once in the whole database. Monosemity in VAC is determined by the fact that there is no other translation pair in the list on the in- put that would contain the same English literal with another sense number. In order

25 4.2. EXTENDING WORDNET BY MATCHING TRANSLATION PAIRS

to make this process more reliable, the user may launch the script with a command- line parameter called dummy, which makes the Dictionary Data Extractor produce a dummy entry (i.e. a placeholder) for each sense it encounters during the traversal of the transformed XML file, even if that sense has not produced any translation pair at all (and such a sense is therefore normally not known of by the WordNet Synset Gen- erator). Even higher reliability (although at the expense of a smaller result set) may be achieved by using the command-line argument onetoone, which makes the program consider only those English literals that have a single translation equivalent in Czech.

Identifying Missing Synsets and Literals Once matches have been established between En- glish literals and PWN synsets, equivalents of these synsets are searched in CzWN, by means of ILI. There are two possible results for each such search: Either a synset of that ILI index is not yet present in CzWN, or a Czech synset of that number has been found. In the first case, we have discovered a missing synset and we have at least one Czech literal that may become a part of it, therefore we may create a new synset in CzWN. In the other case, we compare the Czech literals that are already present in the found CzWN synset and we see if there is any literal coming from VAC that is missing among them. If so, we add such a literal among them and call this a changed synset – unless the user has specified the onlynew flag on the command line which prohibits us from modifying existing CzWN synsets. There are also the flags marknew and markchanged which make the program to prepend an “X ” (a short word and unique in CzWN) to every literal that has been added to a new and/or changed synset, so that such literals are easy to spot for the user when he is inspecting the results.

Transfering Internal Language Relations When we have created a new CzWN synset and know its relation to a particular PWN synset, we may make use of such information that is contained within the PWN synset and that is fairly language-indepedent. ILR (Internal Language Relations) are such a kind of information, at least as far as the hypernymy/hyponymy relation is considered. Therefore, we proceed to copying ILR information from the PWN synsets to the corresponding newly created CzWN synsets. No that we do not modify existing CzWN synsets, even though we may have added some literals to them. The user may influence the transfer of ILR in several was, again buy specifying arguments on the command line: The onlyvalidilr argument tells the script to transfer only those ILR for which both partners in the relation exist in the destination WordNet. The onlyhypernym argument limits the transfer to the hy- pernymy/hyponymy relation, avoiding other relations such as morphological ones, whose automatica transfer from English to Czech may not be desired because of many inaccuracies. Finally, the nohypernymclimb argument can prohibit the program from solving out problematic transfers of hypernymy/hyponymy relations by linking the new Czech synset to a more distant ancestor on the hypernym axis, if an equivalent of the English synset’s hypernym does not exist in CzWN.

26 4.2. EXTENDING WORDNET BY MATCHING TRANSLATION PAIRS

Generating Extended WordNet Data In the last phase of the process, we proceed to merg- ing all the acquired information and generating a new, extended CzWN file. Before starting, we must make sure that such a merge is possible, i.e. we must assign fitting sense numbers to all new Czech literals that we have added but that have already been present in CzWN in some other of their senses. Once the uniqueness of literal- sense combinations is guaranteed in this way, we may proceed to the merge: First, we output all the synsets that had already existed in CzWN before. While we are doing that, we pay attention to synsets that have been changed to us, since we need to re- flect these changes in the resulting file. After all old synsets have been output, along with the changes we have made in them, we proceed by outputting the newly created synsets, including any ILR information that has been transfered in the previous step. Eventually, we present the user with a new, extended WordNet file, and terminate.

27 Chapter 5 Evaluation

5.1 Statistics and Observations

The latest version of Czech WordNet, with which we have worked, contains 47,542 literals organized into 28,478 synsets. The number of unique literals is 34,026, which is an average polysemy of 1.40 senses per literal. There are on average 1.67 literals in a synset. Out of the synsets, 21,018 are noun synsets, 5162 are verb synsets, 2129 are adjective synsets and 166 are adverb synsets. The extended version of Czech WordNet, which is the result of our work, contains 83,769 literals (1.76 times more) organized into 40,621 synsets (1.43 times more). The number of unique literals is 63,775, which is an average polysemy of 1.31 senses per literal (monosemy has increased). There are on average 2.06 literals in a synset (synset size has grown due to augmentation). Out of the synsets, 27,658 are noun synsets (increase by 6640, or 31.6%), 5852 are verb synsets (increase by 690, or 13.3 %), 5651 are adjective synsets (increase by 3522, or 165.4 %) and 1457 are adverb synsets (increase by 1291, or 877.7 %). Princeton WordNet 2.0, with which we have worked, contains 203,145 literals organized into 115,424 synsets. The number of unique literals is 152,059, which is an average polysemy of 1.34 senses per literal. There are on average 1.32 literals in a synset. Out of the synsets, 79,689 are noun synsets, 13,508 are verb synsets, 18,563 are adjective synsets and 3664 are adverb synsets. The Comprehensive Czech-English Dictionary contains 54,046 entries, from which we were able to generate 315,991 translation pairs. Out of these, 194,276 are headword or phrasal verb equivalents, 111,774 are examples and 9941 are idioms. Out of the headword or phrasal verb equivalents, 47,588 (24.5 %) are monosemous. Out of these, 21,700 have a single Czech translation (“one-to-one monosemy”). For more statistics about this dictionary, see Sec- tion 2.2. A manual assessment of the generated synsets has identified that two synsets in a hun- dred had to be reconsidered by a human and one needed a minor change (result of incorrect splitting of a dictionary translation). This indicates a 97–98 % reliability rate of the results. The file on which this data was induced is the /REALEXAMPLES file which may be found on the enclosed CD-ROM. There are in total 7399 synsets generated with such a high confi- dence.

28 Chapter 6 Conclusion

We have designed and implemented a set of tools that is capable of automatically gen- erating and augmenting Czech WordNet synsets, based on analysis of English synsets in Princeton WordNet and of English-Czech equivalents extracted from the Comprehensive English-Czech Dictionary – a bilingual dictionary that has been put at our disposal in a machine-readable form, even though the structure of its data was not very appropriate for our goals at first, and we had to develop a dedicated data transformer and translation pair extractor to cope with that situation. We have shown that by searching Princeton WordNet for English literals from the bilin- gual dictionary, links may be established that make it possible to create new synsets in Czech WordNet, using the Czech translations of these literals to populate the synsets, and the synsets may be linked to their Princeton WordNet counterparts by means of ILI (Inter- Lingual Index). Along with this linking, also ILR (Internal Language Relations) may be transfered from English synsets into the new Czech ones, on demand, including the pos- sibility of automatically creating a hypernymy/hyponymy relation between the new Czech synset and its closest existing Czech hypernym, by tracing the hypernym axis in Princeton WordNet and looking for synsets that do already have a Czech equivalent. In spite of the fact that only a subset of the whole dictionary could be used, due to the constraint that only monosemous literals were to be taken into account, we have managed to almost double the size of Czech WordNet in the number of literals and increase it by a half in the number of synsets. First observations suggest that the quality of the produced data is high enough for it to be passed to linguists who would verify all the new and modified synsets and perform corrections, after which this new, extended Czech WordNet could be put at the disposal of other researchers in the DEBVisDic platform.

6.1 Future Directions

Our research into the problem of translating WordNet synsets by using bilingual dictionar- ies, which we presented in Section 2.1, has shown a lot of possible directions in continuing this work. Although some novel and currently popular directions such as making use of Wikipedia’s interwiki links for augmenting our results could be followed, we believe that we have not yet fully exhausted the potential of the Comprehensive English-Czech Dictio- nary and that polysemy is the main challenge for our future work. The quality of the results obtained by working only with monosemous literals from the

29 6.1. FUTURE DIRECTIONS dictionary, as well as our own observations from Section 2.2 related to its fine-grainedness and the comparison of its sense numbers with those of Princeton WordNet have shown that the dictionary is indeed a good source for determining polysemy of English literals, with Czech equivalents provided for each sense. This could give us enough self-confidence to create Czech WordNet synsets for each of the senses of a polysemous literal found in the dictionary. The problem that remains, though, is how to link a sense from the dictionary with another one in Princeton WordNet, if such a mapping is possible at all. Linguists have yet not even discovered any universal way of describing the individual senses of polysemous words so everybody would know exactly what others are talking about – that is why is seems that we are even farther away from the point when all lexicographers would agree on a single classification of every word’s meanings. Taking into account our multilingual world and the never-ending development of languages, such a moment may even never come. In the meantime, attempts aimed at automatic mapping of polysemous literals that achieve good results in it, such as the work of the Thai team ([17]), can be of enough inspiration to us. At last but not at least, there are also two other activities that are linked to this work and that could benefit the usability of its results. VerbaLex [9], the database of Czech verb valency frames developed at Masaryk University, has been linked to the Czech WordNet. Therefore, before the results of this work are taken over by the maintainers of Czech WordNet, a de- cision should be made about what the future relationship between VerbaLex and the verbs in Czech WordNet should be. It is, naturally, not practical to maintain two resources that a share a part of their goals but each has its own database. The other task worth performing is remapping Czech WordNet (and also VerbaLex) from ILI of Princeton WordNet 2.0 to that of Princeton WordNet 3.0. It seems that there are not many important differences between these two versions of the database (at least as far as numbers of synsets are concerned), but already for the sake of the researchers who are using these resources and therefore often need to consult outdated versions of Princeton WordNet to be able to perform their research, such a remapping is worth realizing. We had been considering performing it in the scope of our work, but we realized that before a decision is made about the relationship between Czech WordNet and VerbaLex, the time has not yet come for such a step. Therefore, we decided to postpone the activity.

30 Bibliography

[1] MILLER, George A. WordNet: A Lexical Database for English. In: Communications of the ACM. New York : ACM Press, 1995, Vol. 38, No. 11, pp. 39–41, ISSN 0001-0782. 1.3

[2] FELLBAUM, Christiane. WordNet : an electronic lexical database. Cambridge, Mass : MIT Press, c1998, xxii, 423 pp., 9780262061971. 1.3, 2.1

[3] VOSSEN, Piek. EuroWordNet: a multilingual database for . In: Proceedings of DELOS workshop on Cross-language Information Retrieval. Zürich, 1997, pp. 85–94. 1.4

[4] TUFI¸S,Dan and CRISTEA, Dan and STAMOU, Sofia. BalkaNet: Aims, methods, re- sults and perspectives; a general overview. In: Romanian Journal of Information Sci- ence and Technology. Bucure¸sti: Romanian Academy, 2004, Vol. 7, No. 1–2, pp. 9–43, ISSN 1453-8245. 1.4, 2.3.7

[5] PALA, Karel and ŠEVECEK,ˇ Pavel. The Czech WordNet, final report. Brno : Masarykova univerzita, 1999, 21 pp., technical report, . 1.4

[6] Slovník spisovné ˇceštiny.Praha : Academia, 1986. 1.4

[7] Lingea Lexicon 2.0. Brno : Lingea s.r.o., 1998. 1.4

[8] ŠEVECEK,ˇ Pavel. Výkladový slovník ˇceštiny. Brno : Lingea s.r.o., not published. 1.4

[9] HLAVÁCKOVÁ,ˇ Dana. Databáze slovesných valenˇcníchrámc ˚uVerbaLex. Ph.D. the- sis, Brno : Masarykova univerzita, Filozofická fakulta, Ústav ˇceskéhojazyka, 2008, 136 pp., . 1.4, 6.1

[10] CAPEK,ˇ Tomáš. Systém pro ˇcásteˇcnésémantické znaˇckovánívolného textu. Master’s thesis, Brno : Masarykova univerzita, Fakulta informatiky, 2006, 42 pp., . 1.4

[11] FRONEK, Josef. Velký anglicko-ˇceskýslovník [Comprehensive English-Czech Dictio- nary]. Vydání 1., Praha : LEDA, 2006, 1734 pp., 80-7335-022-X. 1.5, 3.1

[12] Wikipedia: The Free Encyclopedia. Encyclopedia on-line, Wikimedia Foundation Inc., . 2.1

[13] CRISTEA, Dan and MIHAIL˘ A,˘ C˘at˘alinand FORASCU,˘ Corina and TRANDABA¸T,˘ Di- ana and HUSARCIUC, Maria and HAJA, Gabriela and POSTOLACHE, Oana. Map- ping Princeton WordNet Synsets onto Romanian Wordnet Synsets. In: Romanian Jour- nal of Information Science and Technology. Bucure¸sti: Romanian Academy, 2004, Vol. 7, No. 1–2, pp. 125–145, ISSN 1453-8245. 2.1

31 [14] NAVIGLI, Roberto and PONZETTO, Simone Paolo. BabelNet: Building a Very Large Multilingual Semantic Network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010. 2010, pp. 216–225. 2.1.1

[15] Wiktionary: The Free Dictionary. Dictionary on-line, Wikimedia Foundation Inc., . 2.1.1, 2.1.2

[16] COOK, Darren. MLSN: A multi-lingual semantic network. In: 14th Annual Meeting of the Association for Natural Language Processing. Tokyo, 2008. 2.1.2, 2.2

[17] SATHAPORNRUNGKIJ, Patanakul and PLUEMPITIWIRIYAWEJ, Charnyote. Con- struction of Thai WordNet Lexical Database from Machine Readable Dictionaries. In: Conference Proceedings: the tenth Machine Translation Summit. Thailand, 2005, pp. 87–92. 2.1.3, 2.2, 6.1

[18] KAJI, Hiroyuki and WATANABE, Mariko. Automatic Construction of Japanese Word- Net. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). Genoa, 2006. 2.1.4

[19] CHUGUR, Irina and PEÑAS, Anselmo and GONZALO, Julio and VERDEJO, Felisa. Monolingual and bilingual dictionary approaches to the enrichment of the Spanish WordNet with adjectives. In: Proceedings of WordNet and Other Lexical Resources Workshop. Pittsburgh : NAACL, 2001. 2.1.4

[20] LEE, Changki and LEE, Geunbae and JUNGYUN, Seo. Automatic WordNet mapping using word sense disambiguation. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Hong Kong, 2000. 2.1.4

[21] MONTAZERY, Mortaza and FAILI, Heshaam. Automatic Persian WordNet Construc- tion. In: Proceedings of the 23rd International Conference on Computational Linguis- tics: Posters. Beijing : Association for Computational Linguistics, pp. 846–850, 2010. 2.1.4

[22] wnstats - WordNet 3.0 database statistics [online]. In: WordNet 3.0 Reference Manual. Priceton University, accessed 2011-05-28, . 2.2

[23] PRESEMT (Pattern REcognition-based Statistically Enhanced MT). Athens : Institute for Language & Speech Processing [coordinator], running project, . 2.3.2

[24] HORÁK, Aleš and SMRŽ, Pavel. Visdic – WordNet browsing and editing tool. In: Proceedings of the Second International WordNet Conference – GWC 2004. Brno : Masaryk University, pp. 136–141, 2003. 2.3.7

32 [25] HORÁK, Aleš and PALA, Karel and RAMBOUSEK, Adam and POVOLNÝ, Martin. DEBVisDic - First Version of New Client-Server Wordnet Browsing and Editing Tool. In: Proceedings of the Third International WordNet Conference – GWC 2006. Brno : Masaryk University, pp. 325–328, 2005, 80-210-3915-9. 2.3.7

[26] SEDLÁCEK,ˇ Radek and SMRŽ, Pavel. A new Czech morphological analyser ajka. In: Proceedings of the TSD. Czech Republic, pp. 100–107, 2001. 3.3

33 Index algorithm,9 examples (synsets), 35 architecture,7 Explanatory Dictionary of Czech,3 extraction – results, 20 babel synset,6 BabelNet,5 fine-grainedness,9, 30 BalkaNet,3, 14 ,3 bash, 10 Fronek, Josef,4 Bulgarian language,3 ,3,7 C (programming language), 12 Global WordNet Association,3 Capek,ˇ Tomáš,3 gloss,2,6,7 CD-ROM, 39 GNU, 10, 12, 39 Cervenka,ˇ Rudolf,4 Greek language,3 character encoding, 10 Chinese language,7 headwords, 18 closed source, 13 hypernymy/hyponymy,2,6,8, 26 Comprehensive English-Czech Dictionary,1, idioms, 18 4,9, 39 Inter-Lingual Index (ILI),3,5, 15, 24, 39 computing platform, 10 Internal Language Relations (ILR),2,6,7,9, conclusion, 29 26 corpora, parallel,7 interwiki links,5 counter-checking,6 ISO 8859-2, 15 Creative Commons,6 Italian language,3 Czech language,2,3 Japanese language,7 data format, 10 Java, 13 DEB II, 15 DEBVisDic,8, 14 Kay, Michael, 12 dictionary entry, 16, 18 Korean language,8 Dictionary of Literary Czech,3 dictionary, active,9 LEDA spol.s r.o.,4, 39 dictionary, bilingual,5,6 libxml2, 12 dictionary, passive,9 libxslt, 12 disambiguation,6 Lingea Lexicon,3 DOM, 13 Linux, 10, 15 dummy entry, 26 literal,2,9 Dutch language,3 machine translation,5 elements, 23 markup,9, 11 English language,2,5 Masaryk University,1,3,4, 14, 15, 40 Estonian language,3 meaning,1,2 EuroWordNet (EWN),3, 14 meronymy/holonymy,2 examples (dictionary), 18 Miller, George A.,2

34 MLSN,6,8 Turkish language,3 monosemy,2,6–9, 18 Ubuntu, 10, 15 Natural Language Processing, 14 uconv, 15 Natural Language Processing Center,1,4 Unicode, 11, 14, 15 University of Glasgow,4 open source,6, 10, 12 open-class words,2 VerbaLex,3, 30 OR, 21 VisDic,8, 14, 24, 39

Pala, Karel,1 W3C, 12 parentheses, 21 Wikipedia,5 parts of speech,2 Wiktionary,6,7 Persian language,8 WordNet Builder,7,8 phrasal verbs, 18 WordNet, Czech (CzWN),1,3,9, 39 polysemy,2,7–9, 12, 18, 25, 29 WordNet, Czech (CzWN) statistics,3, 28 Pomikálek, Jan, 21 WordNet, Princeton (PWN),1,2, 39 PRESEMT, 12 WordNet, Princeton (PWN), 2.0, 39 Princeton University,2 WordNet, Princeton (PWN), 2.0 statistics,2, program execution, 10 28 programming paradigm, 10 WordNet, Princeton (PWN), 3.0 statistics,2, psycholinguistics,2 9 Python, 14 XML, 11, 14 regular expressions, 11, 14 XPath, 13 Romanian language,3 XSLT, 11, 12 xsltproc, 12 SAX, 13 XUL, 15 Saxon-HE, 12 sense number,2 Serbian language,3 slash, 21 Sojka, Petr,4, 23 ,3,7 statistics, 28 subsense,9 synonymy,2 synset,1,2 synset, changed, 26 synset, missing, 26 textutils, 11 Thai language,7 troponymy,2

35 Appendix A Examples of Generated Synsets

Following are 60 examples of Czech synsets (with their English counterparts) that have been generated by running the created software using the options make.sh excludesplit dummy onetoone markchanged marknew fromvisdic which guarantee the highest possible quality of results. In total, 7399 such highest-confidence synsets have been created and/or modified. If a synset had already been existing before and our software only added some new literal(s) into it, we mark the old literals in italics. More examples of this kind may be found on the enclosed CD-ROM, where also more detailed information (such as English glosses defining the meanings of the synsets) are present.

Jacobite:1 = jakobita:1 nanotechnology:1 = nanotechnologie:1 poleax:1, poleaxe:1 = zabít:4 Erasmus:1, Desiderius Erasmus:1, Gerhard Gerhards:1, Geert Geerts:1 = Erasmus:1 Algiers:1, Algerian capital:1 = Alžír:1 bullring:1 = aréna pro býˇcízápasy:1 allomorphic:1 = alomorfní:1 extrasensory:1, paranormal:1 = mimosmyslový:1 magnetic storm:1 = magnetické bouˇre:1, magnetická bouˇre:1 dolmen:1, cromlech:1 = dolmen:1 plumber:1, pipe fitter:1 = instalatér:1, potrubáˇr:1 transcontinental:1 = transkontinentální:1 enlisted man:1 = obyˇcejnývoják:1 amalgamate:1, amalgamated:1, coalesced:1, consolidated:1, fused:1 = opatˇrenýpojistkou:1 colleen:1 = dˇevˇce:4 drain:3, drainpipe:1, waste pipe:1 = odvodˇnovacípotrubí:1, odvodˇnovacítrubka:1, odtoková roura:1, odpadová roura:2 stainless steel:1, chromium steel:1 = nerezavˇejícíocel:1, nerez:1, chrómová ocel:1, chro- mová ocel:1 stationer:1, stationery seller:1 = papírník:1 Turkey:2, Republic of Turkey:1 = Turecko:1 radio astronomy:1 = radioastronomie:1 psychometry:1, psychometrics:1, psychometrika:1 = psychometrie:1 prefaded:1 = vybˇelený:1

36 A.EXAMPLESOF GENERATED SYNSETS

Rwanda:1, Rwandese Republic:1, Ruanda:1 = Rwanda:1 buckshee:1 = bezplatný:2 monitor:7, monitor lizard:1, varan:1 = varan:1 antitrade wind:1, antitrade:1 = antipasát:1 supergrass:1 = superagent:1 ontogenetic:1 = ontogenetický:1 infra dig:1 = ned ˚ustojnýkoho:1 toy soldier:1 = vojáˇcek:1, cínový vojáˇcek:1 Venetian:1 = benátský:1 aggressive:3, fast-growing:1 = rychle rostoucí:1 base runner:1, runner:4 = bˇežecna metu:1 sponge cloth:1 = froté utˇerka:1 boom town:1 = prosperující mˇesto:1 holly:1 = cesmína:1 millipede:1, millepede:1, milliped:1 = stonožka:2 overambitious:1 = pˇrílišctižádostivý:1 bivalve:1, bivalved:1 = dvouchlopˇnový:1 harakiri:1, hara-kiri:1, harikari:1 = harakiri:1 Trotskyite:1, Trotskyist:1, Trot:2 = trockista:1 housewifely:1 = hospodyˇnský:1 polygraph:1 = detektor lži:2 oil change:1 = výmˇenaoleje:1 leftism:1 = leviˇcáctví:1 Riyadh:1, capital of Saudi Arabia:1 = Rijád:1 cabin cruiser:1, cruiser:3, pleasure boat:1, pleasure craft:1 = kajutová jachta:1, výletní jachta:1, výletní ˇclun:1, motorová jachta pro zábavní plavbu:1, výletní lod’:1 ampersand:1 = symbol &:1 Trojan:1 = trojský:1 affix:1 = afix:1 oven-ready:1 = pˇripravenýk peˇcení:1 Lenin:1, Vladimir Lenin:1, Nikolai Lenin:1, Vladimir Ilyich Lenin:1, Vladimir Ilich Lenin:1, Vladimir Ilyich Ulyanov:1, Vladimir Ilich Ulyanov:1 = Lenin:1 short order:1 = objednávka minutky:1 reporter:1, newsman:1, newsperson:1 = komentátor:3, referent:2, reportér:1, zpravodaj:1, novináˇr:3 breathlessly:1, gaspingly:1 = udýchanˇe:1 horn:1, tusk:1 = nabrat na rohy:1 possibly:2, potentially:1 = potenciálnˇe:1 goatfish:1, red mullet:1, surmullet:1, Mullus surmuletus:1 = parmice nachová:1 chihuahua:1 = ˇcivava:1 administratively:1 = administrativnˇe:1

37 A.EXAMPLESOF GENERATED SYNSETS

Below, we present yet some more 60 examples, this time generated by using the least re- strictive options make.sh markchanged marknew fromvisdic to show the other end of the result quality spectrum. In total, 17,879 synsets get created and/or modified if this set- ting is used (that includes all the synsets listed above). In this list, possible incorrent sense mapping, errors that occurred during the treatment of alternatives and parentheses in the dictionary, unwanted spelling variants and various other pitfalls should be easier to spot.

petunia:1 = petúnie:1 commanding:1, ranking:1, top-level:1, top-ranking:1 = nejvˇetšíhokalibru:1, velkého for- mátu:3, na nejvyšší úrovni:1, velkého kalibru:1, špiˇckový:5,na slovo vzatý:1 lengthways:1, lengthwise:1, longwise:1, longways:1, longitudinally:2 = podélnˇe:1, na délku:1, po délce:1 cloudless:1, unclouded:2 = bezmraˇcný:1,bez jediného mráˇcku:1,bez mráˇcku:1,jasný:8 strictly:1, purely:1 = ˇcistˇe:1 predictably:1 = samozˇrejmˇe:1,jak se dalo oˇcekávat:1,jak se dalo pˇredpokládat:1 talk show:1, chat show:1 = talkshow:1, beseda:3, rozhlasová talk show:1, rozhlasová beseda:1, talk show:1, televizní talk show:1, televizní beseda:1 gutta-percha:1 = gutaperˇca:1 celebrated:1, famed:1, far-famed:1, famous:1, illustrious:1, notable:2, noted:1, renowned:1 = vyhlášený:1, proslulý:1, slavný:1, povˇestný:1, renomovaný:1, vˇehlasný:1 everlastingly:1, eternally:1, forever:1, evermore:2 = navˇeky:1,vˇeˇcnˇe:2,v jednom kuse:1, stále:1 refrigerated:1 = chlazený:1 camera obscura:1 = camera obscura:1, temná komora:2 concert-goer:1, music lover:1 = milovník hudby:1 grizzled:1 = prokvetlý:1, prošedivˇelý:1 ungrudgingly:1 = velkoryse:2, ochotnˇe:6,srdeˇcnˇe:3 nephrectomy:1 = chirurgické vynˇetíledviny:1, nefrektomie:1 gradually:1, bit by bit:2, step by step:1 = postupnˇe:1, pozvolna:1, povlovnˇe:1,ponenáhlu:1 pharmacopoeia:1 = lékopis:1, farmakopéa:1, seznam lék ˚u:1, popis léˇciva jejich pˇrípravy:1, seznam léˇciv:1 control panel:1, instrument panel:1, control board:1, board:7, panel:7 = ovládací panel:1, ˇrídicípanel:1, indikaˇcnídeska:1, deska:6, palubní deska:2, pˇrístrojová deska:2 car:2, railcar:1, railway car:1, railroad car:1 = motorák:1, motorový železniˇcnív ˚uz:1, mo- torový v ˚uz:1 dodderer:1 = starý fotr:1, senilní ˇclovˇek:1 titter:1 = chichot:3, chichotání:3 sycophancy:1 = podkuˇrování:1,podlézavost:2, pochlebování:1 dandelion:1, blowball:1 = pampeliška:3, smetanka:1 teaching fellow:1 = studentská pedagogická síla:1 awning:1, sunshade:1, sunblind:1 = žaluzie:1, sluneˇcníplachta:1, plachta:3, markýza:1, dešt’ová plachta:1

38 A.EXAMPLESOF GENERATED SYNSETS

gizzard:1, ventriculus:1, gastric mill:1 = zažívací dutina:1, druhý žaludek:1 Caroline:1, Carolean:1 = karolínský:1, Karl ˚uv:1 irrelevantly:1 = bezpodstatnˇe:1,nezávažnˇe:1,irelevantnˇe:1 deforest:1, disforest:1, disafforest:1 = odlesnit:2, odlesˇnovat:1 stagflation:1 = stagflace:1 zip code:1, postcode:1, postal code:1 = poštovní smˇerovací ˇcíslo:1 octal:1 = osmiˇckový:1,oktalový:1 Prussian:1 = Prussian:1, Prus:1, Prušaˇcka:1,Pruska:1, Prušák:1 Oligocene:1, Oligocene epoch:1 = oligocén:1 Jericho:1 = Jericho:1 labium:1 = pysk:2 zapper:1 = dálkové ovládání:1 Leninism:1, Marxism-Leninism:1 = leninismus:1 crouch:1 = nahrbení:1, schýlení:1, hrbení:1 register:1, registry:1 = matrika:1, seznam:3, registr:1, spisovna:1, registratura:2 counterattack:1, counterstrike:1 = provést protiútok:1, podniknout protiútok:1 misjudge:1 = pˇrepoˇcítatse:2, pˇrepoˇcístse:2, nedocenit:1, špatnˇeodhadnout:1, udˇelatsi:2, nesprávné mínˇení:1 clavichord:1 = clavichord:1, klavichord:1 talc:1, talcum:1 = klouzek:1, mastek:2 artesian well:1 = artéská studna:1, artézská studna:1 lurching:1, stumbling:1, staggering:1, weaving:1 = tkalcovský:1 laundering:1 = praní:2 interpersonal:1 = mezilidský:1 gown:1 = obléci do taláru:2 conjectural:1, divinatory:2, supposed:7, suppositional:1, suppositious:1, supposititious:1 = konjekturální:1, založený na dohadech:1 moralistic:1 = moralistický:1, mravokárný:2 ranching:1 = farmaˇrení:1, chov dobytka na ranˇci:1 escapist:1, dreamer:3, wishful thinker:1 = snílek:4 subpopulation:1 = podskupina obyvatelstva:1 recipe:1, formula:2 = recept:1, pˇredpis:4,návod:4 web log:1, blog:1 = internetový deník:1, osobní stránka:1, blog:1, webový deník:1 open-minded:1 = otevˇrený:6, nezaujatý:1, bez pˇredsudk˚u:1,svobodomyslný:3, osvícený:2, tolerantní:2, nepˇredpojatý:2 die out:1, die off:1 = vyhynout:1, vymˇrít:1, pomˇrít:1,odumírat:1, umírat jeden po druhém:1 party wall:1 = mezibytová stˇena:1,spoleˇcnástˇena:1

39 Appendix B Contents of the Enclosed CD-ROM

The enclosed CD-ROM contains the source code of the software developed as a part of the work on this thesis. It also contains some sample data. A description of the content of all files on the disk may be found in the file /CONTENTS. The real data used during the work is absent from the CD-ROM, because of the applica- ble licensing policies which do not permit us to publish the electronic data of VAC (Com- prehensive Czech-English Dictionary, published in paper by LEDA spol.s r.o.) neither the Czech WordNet (created by the Faculty of Informatics but not freely available). An excep- tion is the Princeton WordNet R which has been made available under a license that may be found in the file /data/wn/LICENSE on the CD-ROM, accompanied by its unofficial XML version (VisDic format) created by conversion of Princeton WordNet 2.0 that is in the file / data/wn/pwn.xml. Small dummy files have been put in place of real dictionary and Czech WordNet data, so that the software should still be able to run if started by running the /make.sh file on a GNU/Linux machine furnished with all the tools requested by the program (see Section 2.3 for details). Once run, the program outputs messages about its activities and about the loca- tion of output files it has produced. In order to anticipate possible problems with running the software on an improperly configured computer, we have created files /SAMPLELOG and /SAMPLEOUT that contain respectively the output messages and the output extended version of a mock-up Czech WordNet. This mock-up Czech WordNet in state before the per- formed extension may be found in /data/wn/czwn.xml and used for comparison, along with /data/vac/vac.xml which is a mock-up version of the VAC dictionary. Randomly chosen 300 synsets induced from real VAC and CzWN data are situated in the file /REALEXAMPLES. They provide insight into the effective usability of the software if real data is employed. These synsets have been generated by running make.sh excludesplit dummy onetoone markchanged marknew fromvisdic and represent therefore that part of the output which has the highest confidence (in total 7399 such highest-confidence synsets have been generated and/or modified). In the synsets, newly added literals that have been added by our software are marked with “X ” prefixed to them, in order to dis- tinguish them from literals that had already existed before in the synset if the synset has not been newly generated but only modified by our software. To see the English synsets to which these generated Czech synsets are linked, you may either use their ILI and look them up in the WordNet 2.0 database (for instance in its copy situated in /data/wn/pwn. xml), or you may consult the file /REALEXAMPLESPWN where we have placed them for your

40 B.CONTENTSOFTHE ENCLOSED CD-ROM convenience, with equivalent synsets being on the same line in each file. Any questions related to the real data that could not be put on the enclosed CD-ROM may be answered by the author during the thesis’ public defense, or the author may be contacted using the contact information given at the top of the /README file or provided by the Masaryk University.

Following files and folders are present on the CD-ROM:

/clean.sh shell script to clean all data produced by running the software (including all output) /code/ source code of the created software /CONTENTS description of all files on the CD (this list) /data/ input and output data of the created software /filelist list of file names used by make.sh for creating files and by /clean.sh for cleaning /make.sh shell script to run the software; running /make.sh help shows usage in- formation /README readme file with information about this CD /REALEXAMPLES randomly chosen 300 synsets generated by the software from real data /REALEXAMPLESPWN equivalent 300 English synsets from PWN, aligned with REALEXAMPLES /SAMPLELOG copy of messages output by running the software on the sample data that is present on this CD /SAMPLEOUT copy of output received by running the software on the sample data that is present on this CD

/code/vac/ source code of the Dictionary Data Parser (see chapter 3) /code/wn/ source code of the WordNet Synset Generator (see chapter 4)

/code/vac/extract.py Python script that forms part of the Dictionary Data Parser (see section 3.3); invoked by make.sh /code/vac/filelist list of file names used by make.sh for creating files and by /clean.sh for cleaning /code/vac/make.sh shell script to run the Dictionary Data Parser; invoked by /make.sh /code/vac/saxon9he.jar the Saxon XSLT and XQuery Processor (downloaded from , see inside for license) /code/vac/transform.java Java class that forms part of the Dictionary Data Parser (see section 3.2 in the thesis); invoked by make.sh /code/vac/transform.xsl XSLT stylesheet that forms part of the Dictionary Data Parser; used by transform.java

41 B.CONTENTSOFTHE ENCLOSED CD-ROM

/code/wn/czwn2synsets.xsl XSLT stylesheet that forms part of the WordNet Synset Generator; used by make.sh /code/wn/filelist list of file names used by make.sh for creating files and by /clean.sh for cleaning /code/wn/make.sh shell script to run the WordNet Synset Generator; invoked by /make.sh /code/wn/pwn2mono.xsl XSLT stylesheet that forms part of the WordNet Synset Gen- erator; used by make.sh /code/wn/pwn2poly.xsl XSLT stylesheet that forms part of the WordNet Synset Gen- erator; used by make.sh

/data/other/ other data used and produced by the software /data/vac/ data used and produced by the Dictionary Data Parser see chapter 3) /data/wn/ data used and produced by the WordNet Synset Generator (see chapter 4)

/data/other/wnczx.cfg configuration file that is necessary to enable use of the out- put data in VisDic /data/other/wnczx.def definition file that is necessary to enable use of the output data in VisDic

/data/vac/vac.xml machine-readable form of VAC (Velký anglicko-ˇceskýslovník) (only a mock-up file has been placed on this disk)

/data/wn/czwn.xml data of the Czech WordNet in XML form for VisDic (only a mock- up file has been placed on this disk) /data/wn/LICENSE license information for the Princeton WordNet /data/wn/pwn.xml data of the Princeton WordNet in XML form for VisDic (down- loaded from and converted)

42