A Multilingual Lexical Database Application with a Structured Interlingua

SIMuLLDA a Multilingual Lexical Database Application using a Structured Interlingua SIMuLLDA een toepassing van een meertalig lexicaal gegevensbestand met gebruikmaking van een gestructureerde tussentaal (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op het gezag van de Rector Magnificus, Prof. dr. W.H. Gispen, ingevolge het besluit van het College voor Promoties in het openbaar te verdedigen op vrijdag 7 juni 2002 des middags te 4:15 uur door Maarten Janssen geboren op 28 januari 1971 te Nijmegen Promotoren: Prof. dr. H.J. Verkuyl UiL-OTS, Universiteit Utrecht Prof. dr. A. Visser Faculteit Wijsbegeerte, Universiteit Utrecht Contents Preface vii 1 Multilingual Lexical Databases 1 1.1 Multilingual Lexical Databases . 1 1.2 Current Approaches and their Shortcomings . 2 1.2.1 Parallel Wordlists . 2 1.2.2 Hub-and-Spoke Model . 5 1.2.3 WordNet and EuroWordNet . 9 1.2.4 Acquilex et al. 15 1.2.5 Corpus Based Approaches . 19 1.3 Conclusion to Chapter 1 . 21 2 FCA and SIMuLLDA 23 2.1 Formal Concept Analysis . 23 2.1.1 Partial Ordering . 27 2.1.2 Hasse Diagrams . 30 2.2 Connotative Context . 31 2.3 The SIMuLLDA System . 35 2.3.1 Multilinguality . 38 2.3.2 Lexical Gap Filling . 43 2.4 Formal Properties of FCA . 45 2.4.1 FCA and Lattices . 45 2.4.2 Smallest Common Concept . 46 2.4.3 Maximal Filled Sub-Tables . 46 2.4.4 Distributive and Atomic Lattices . 47 2.4.5 Extending Contexts . 48 2.4.6 Models and the Number of Concepts . 49 2.4.7 Partial Ordering on Attributes . 51 2.5 JaLaBA: an Online FCA Tool . 52 2.5.1 Construing Formal Concepts . 53 2.5.2 Drawing Lattices . 56 2.6 Conclusion to Chapter 2 . 58 IV Contents 3 SIMuLLDA Elements 61 3.1 Words and Word-Forms . 61 3.1.1 Word-Form and Lexeme . 62 3.1.2 Morphemes . 68 3.1.3 History, Etymology . 70 3.2 The Language . 71 3.2.1 Language, Dialect and Idiolect . 71 3.2.2 New, Regional, and Infrequent Words . 75 3.3 The Interlingual Meanings . 77 3.3.1 Homonymy, Polysemy and Metonymy . 78 3.3.2 Word-meaning and Denotation . 84 3.3.3 Interlingua, Incommensurability, and Cultural Differ- ences . 88 3.3.4 Word meaning and the Colour of the Word . 94 3.4 The Definitional Attributes . 95 3.4.1 Semes,` Semantic Primitives, and Interpretative Seman- tics . 95 3.4.2 Lexicalisation of Definitional Attributes . 101 3.4.3 Adjusted Attributes . 102 3.4.4 The Value of Dictionary Definitions . 104 3.5 Conclusion to Chapter 3 . 107 4 Field Testing: Some Actual Dictionary Data 111 4.1 Horses and the Like . 111 4.2 Bodies of Water . 115 4.2.1 Hierarchy Problems . 118 4.2.2 Interlingual Alignments . 128 4.2.3 Generating Bilingual Definitions . 142 4.3 Ships and Sails . 143 4.4 Conclusion to Chapter 4 . 150 5 Extending the System 151 5.1 Derivations, Inflections, and Lexical Functions . 151 5.1.1 Meaning ⇔ Text Theory . 153 5.1.2 Lexemes as (Lexical) Functions . 156 5.1.3 Derivation and Inflection in SIMuLLDA . 158 5.2 Labels, Examples, Collocations . 162 5.2.1 Collocations . 162 5.2.2 Corpus Examples and Illustrative Sentences . 165 5.2.3 Register Labels and Group Labels . 167 5.3 Abstract Nouns, Verbs, Adjectives . 170 5.4 Lexical Database, Size, and Word Frequency . 173 5.5 Conclusion to Chapter 5 . 176 Contents V 6 Conclusion and Afterthoughts 177 6.1 Program for Further Research . 181 APPENDICES 183 A Lexical Definitions for Water 183 A.1 ‘Water’ in WordNet 1.6 . 183 A.2 English Definitions . 185 A.3 Dutch Definitions . 187 A.4 Italian Definitions . 190 A.5 German Definitions . 192 A.6 French Definitions . 194 A.7 Russian Definitions . 196 B Abbreviations and Notations 199 B.1 Dictionaries Referred to in this Thesis . 199 B.2 IPA Pronuncation Rules . 201 B.3 Lexical Functions . 202 B.4 Notational Conventions used in this Thesis . 204 Nederlandse Samenvatting 205 Curriculum Vitae 211 Acknowledgements 213 References 215 Index 222 Preface It is commonly accepted that there are about five to six thousand languages. For many pairs of languages hX, Y i, there is no dictionary X → Y or Y → X, there are only dictionaries for the pairs X → English/French/Spanish and dictionaries English/French/Spanish → Y (as well as the other way around). There is a clear need for dictionaries translating between languages without the intervention of a small number of Western European languages with a colonial past. Also from a theoretical point of view, such a need can be defended. The creation of a dictionary of good quality takes a lot of time, and given the fact that 5000-6000 languages yield 25-30 million pairs of languages, it is important to have a database that provides the possibility to translate directly between pairs of languages, even though just a small subset of the millions of pairs will be spoken by sufficiently many speakers to make the enterprise probable. However, sufficiently many will remain to justify a theoretical attempt to think about a database. This thesis highlights some problems that play a role in the creation of such a database, attempts to solve some of them, and tries to show that some other problems cannot be solved. A well-known problem is that words are often hard to match across languages: different words from different languages do not have the same range of meanings, not all words from one language have an equivalent in the other, etc. In this thesis, a sketch will be given of a database in which most of these problems are solved. Crucial in this set-up is the structure of the interlingua, which provides the possibility to relate non-corresponding meanings in a structural way. With the set-up proposed in this thesis, which is called SIMuLLDA (short for Structured Interlingua MultiLingual Lexical Database Application) it is possible to generate a descriptive translation for words in the source language that lack a direct translation in the target language. This should ease the work of a lexicographer making a dictionary for a new pair of languages. In the first chapter, a global explanation is given of why a set-up of a multilingual lexical database that can generate translations between arbitrary pairs of languages is an important asset to have. Furthermore, it VIII Preface is explained what requirements a lexical database has to fulfil in order to meet this demand. This is done by examining a number of existing the- ories, and by showing why they are (currently) not capable of generating translations. The second chapter explains the layout of the alternative lexical database set-up that is proposed in this thesis. As said above the crucial component of the SIMuLLDA set-up is the structured interlingua. The structure of the interlingua is provided by a logical formalism called Formal Concept Analysis. Therefore, the chapter gives an introduction of the FCA framework, after wich an explanation is given of how the FCA framework is applied to dictionaries in the SIMuLLDA set-up. It is also shown that with this set-up, it is possible to generate definitions for words without a ‘trans- lational synonym’. Since FCA is a logical framework, some of the logical properties are discussed, especially those relevant for the functioning of the SIMuLLDA system. Additionally, an on-line application called JaLaBA is introduced. This application can be used to generate the Hasse diagrams that represent the structure yielded by FCA automatically, displaying them 3-dimensionally. In the third chapter, the basic component of the SIMuLLDA set-up are investigated in depth: words, languages, interlingual meanings, and definitional attributes. There are two motivations for this thorough investi- gation. The first is simply that a good theory should explain the intended interpretation of its basic elements, and especially so in the case of an abstract framework like FCA, which does not in itself have any interpretation at all. So if meanings are assigned in the system to words, it is important to define precisely what counts as a word. The second motivation is that in the SIMuLLDA set-up, meanings are interlingual and defined by means of what might be viewed as ‘semantic primitives’. Without restrictions on the interpretation of the primitives (the definitional attributes), the system would make incorrect claims. In the fourth chapter, the system is empirically tested. As said before, the SIMuLLDA system is designed to provide a practical aid for lexicogra- phers. So its practical applicability is of vital importance. The empirical test in chapter four is only a relatively small scale test: to really test the system, it should be applied on a larger scale. But the smaller test in this thesis does show the applicability of the system to an arbitrary semantic domain, and raises a number of important issues. In the fifth chapter, some extensions to the system are discussed. As discussed in the first four chapters, the system mainly applies to concrete nouns, and then also to the semantic definitions of these concrete nouns. To be a fully workable system, SIMuLLDA also needs to model the other com- ponents of dictionaries. So, chapter five discusses the treatment of labels, collocations, inflections, derivations, and illustrative examples, also taking into consideration the applicability of the system to other word classes like Preface IX abstract nouns, verbs, and adjectives. Finally, some aspects of an application for the system that should be developed in order to use SIMuLLDA are discussed.

A Multilingual Lexical Database Application with a Structured Interlingua

1 Symbols (2286)

FUPA) in the UCS Source: Michael Everson and Klaas Ruppel Version: 2.0 Date: 1998-11-02

The Brill Typeface User Guide & Complete List of Characters

Fraser's Lisu Orthography [L2/07-294 = WG2/N3326]

IPA Braille: an Updated Tactile Representation of the International Phonetic Alphabet

Latin Extended-B Range: 0180–024F

Reference Chart for IPA Typography

View Extract

The Unicode Cookbook for Linguists

Proposal to Encode Additional Dialectology Latin Characters

Phonetic Extensions Supplement Range: 1D80–1DBF

Considerations in the Use of the Latin Script in Variant Internationalized Top-Level Domains