Building Basic Vocabulary Across 40 Languages
Total Page:16
File Type:pdf, Size:1020Kb
Building basic vocabulary across 40 languages Judit Acs´ Katalin Pajkossy Andras´ Kornai HAS Computer and Automation Research Institute H-1111 Kende u 13-17, Budapest {judit.acs,pajkossy,kornai}@sztaki.mta.hu Abstract to the top 40 languages (by Wikipedia size) using a variety of methods. Since some of the resources The paper explores the options for build- studied here are not available for the initial list of ing bilingual dictionaries by automated 40 languages, we extended the original list to 50 methods. We define the notion ‘ba- languages so as to guarantee at least 40 languages sic vocabulary’ and investigate how well for every method. Throughout the paper, results the conceptual units that make up this are provided for all 50 languages, indicating miss- language-independent vocabulary are cov- ing data as needed. ered by language-specific bindings in 40 Section 1 outlines the approach taken toward languages. defining the basic vocabulary and translational equivalence. Section 2 describes how Wiktionary Introduction itself measures up against the 4lang resource Globalization increasingly brings languages in directly and after triangulation across language contact. At the time of the pioneering IBM work pairs. Section 2.3 and Section 2.4 deals with ex- on the Hansard corpus (Brown et al., 1990), only traction from multiply parallel and near-parallel two decades ago, there was no need for a Basque- corpora, and Section 3 offers some conclusions. Chinese dictionary, but today there is (Saralegi et al., 2012). While the methods for building dic- 1 Basic vocabulary tionaries from parallel corpora are now mature The idea that there is a basic vocabulary composed (Melamed, 2000), there is a dearth of bilingual or of a few hundred or at most a few thousand ele- even monolingual material (Zseder´ et al., 2012), ments has a long history going back to the Renais- hence the increased interest in comparable cor- sance – for a summary, see Eco (1995). The first pora. modern efforts in this direction are Thorndike’s Once we find bilingual speakers capable of car- (1921) Word Book, based entirely on frequency rying out a manual evaluation of representative counts (combining TF and DF measures), and samples, it is relatively easy to measure the pre- Ogden’s (1944) Basic English, based primarily cision of a dictionary built by automatic meth- on considerations of definability. Both had last- ods. But measuring recall remains a challenge, for ing impact, with Thorndike’s approach forming if there existed a high quality machine-readable the basis of much subsequent work on readabil- dictionary (MRD) to measure against, building a ity (Klare 1974, Kanungo and Orr 2009) and new one would largely be pointless, except per- Ogden’s forming the basis of the Simple En- haps as a means of engineering around copyright glish Wikipedia1. An important landmark is the restrictions. We could measure recall against Wik- Swadesh (1950) list, which puts special emphasis tionary, but of course this is a moving target, and on cross-linguistic definability, as its primary goal more importantly, the coverage across language is to support glottochronological studies. pairs is extremely uneven. Until the advent of large MRDs, the frequency- What we need is a standardized vocabulary re- based method was much easier to follow, and source that is equally applicable to all language Thorndike himself has extended his original list of pairs. In this paper we describe our work toward ten thousand words to twenty thousand (Thorndike creating such a resource by extending the 4lang conceptual dictionary (Kornai and Makrai, 2013) 1http://simple.wikipedia.org 52 Proceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 52–58, Sofia, Bulgaria, August 8, 2013. c 2013 Association for Computational Linguistics 1931) and thirty thousand (Thorndike and Lorge the most frequent 2,000 words according to the 1944). For a recent example see Davies and Gard- Google unigram count (Brants and Franz 2006) ner (2010), for a historical survey see McArthur and the BNC, as well as the most frequent 2,000 (1998). The main problem with this approach is words from Polish (Halacsy´ et al 2004) and Hun- the lack of clear boundaries both at the top of the garian (Kornai et al 2006). Since Latin is one of list, where function words dominate, and at the the four languages supported by 4lang (the other bottom, where it seems quite arbitrary to cut the three being English, Polish, and Hungarian), we list off after the top three hundred words (Diedrich added the classic Diederich (1938) list and Whit- 1938), the top thousand, as is common in foreign ney’s (1885) Roots. language learning, or the top five thousand, es- The basic list emerging from the iteration has pecially as the frequency curves are generally in 1104 elements (including two bound morphemes good agreement with Zipf’s law and thus show no but excluding technical terms of the formal seman- obvious inflection point. The problem at the top tic model that have no obvious surface reflex). We is perhaps more significant, since any frequency- will refer to this as the basic or uroboros set as it based listing will start with the function words has the property that each of its members can be of the language, characterizing more its grammar defined in terms of the others, and we reserve the than its vocabulary. For this reason, the list is name 4lang for the larger set of 3,345 elements highly varied across languages, and what is a word from which it was obtained. Since 4lang words (free form) in one language, like English the, often can be defined using only the uroboros vocabu- ends up as an affix (bound form) in another, like lary, and every word in the Longman Dictionary the Romanian suffix -ul. By choosing a frequency- of Contemporary English can be defined using the based approach, we inevitably put the emphasis on 4lang vocabulary (since this is a superset of LDV), comparing grammars and morphologies, instead we have full confidence that every sense of every of comparing vocabularies. non-technical word can be defined by the uroboros The definitional method is based on the assump- vocabulary. In fact, the Simple English Wikipedia tion that dictionaries will attempt to define the is an attempt to do this (Yasseri et al., 2012) based more complex words by simpler ones. Therefore, on Ogden’s Basic English, which overlaps with the starting with any word list L, the list D(L) ob- uroboros set very significantly (Dice 0.527). tained by collecting the words appearing on the The lexicographic principles underlying 4lang right-hand side of the dictionary definitions will have been discussed elsewhere (Kornai, 2012; be simpler, the list D(D(L)) obtained by repeat- Kornai and Makrai, 2013), here we just summa- ing the method will be yet simpler, and so on, un- rize the most salient points. First, the system is til we arrive at an irreducible list of basic words intended to capture everyday vocabulary. Once that can no longer be further simplified. Mod- the boundaries of natural language are crossed, ern MRDs, starting with the Longman Dictionary and goats are defined by their set of genes (rather of Contemporary English (LDOCE), generally en- than an old-fashioned taxonomic description in- force a strict list of words and word senses that can volving cloven hooves and the like), or derivative appear in definitions, which guarantees that the ba- is defined as lim (f(x + ∆) − f(x))/∆, the sic list will be a subset of this defining vocabulary. ∆→0 uroboros vocabulary loses its grip. But for the This method, while still open to charges of arbi- non-technical vocabulary, and even the part of the trariness at the high end, in regards to the separa- technical vocabulary that rests on natural language tion of function words from basic words, creates a (e.g. legal definitions or the definitions in philos- bright line at the low end: no word, no matter how ophy and discursive prose in general), coverage frequent, needs to be included as long as it is not of the uroboros set promises a strategy of grad- necessary for defining other words. ually extending the vocabulary from the simple In creating the 4lang conceptual dictionary to the more complex. Thus, to define Jupiter as (Kornai and Makrai, 2013), we took advantage ‘the largest planet of the Sun’, we need to define of the fact that the definitional method is robust planet, but not large as this item is already listed in in terms of choosing the seed list L, and built a the uroboros set. Since planet is defined ‘as a large seed of approximately 3,500 entries composed of body in space that moves around a star’, by substi- the Longman Defining Vocabulary (2,200 entries), tution we will obtain for Jupiter the definition ‘the 53 largest body in space that moves around the Sun’ pairs are only obtained indirectly, through the con- where all the key items large, body, space, move, ceptual pivot, and thus do not amount to fully valid around are part of the uroboros set. Proper nouns bilingual translation pairs. For example, he-goat like Sun are discussed further in (Kornai, 2010), in one language may just get mapped to the con- but we note here that they constitute a very small cept goat, and if billy-goat is present in another proportion (less than 6%) of the basic vocabulary. language, the strict translational equivalence be- tween the gendered forms will be lost because Second, the ultimate definitions of the uroboros of the poverty of the pivot.