Bilingual Dictionary Generation for Low-Resourced Language Pairs

Bilingual dictionary generation for low-resourced language pairs Varga István Yokoyama Shoichi Yamagata University, Yamagata University, Graduate School of Science and Engineering Graduate School of Science and Engineering [email protected] [email protected] choice and adaptation of the translation method Abstract to the problem of available translation resources between the chosen languages. Bilingual dictionaries are vital resources in One possible solution is bilingual corpus ac- many areas of natural language processing. quisition for statistical machine translation Numerous methods of machine translation re- (SMT). However, for highly accurate SMT sys- quire bilingual dictionaries with large cover- tems large bilingual corpora are required, which age, but less-frequent language pairs rarely are rarely available for less represented lan- have any digitalized resources. Since the need for these resources is increasing, but the hu- guages. Rule or sentence pattern based systems man resources are scarce for less represented are an attractive alternative, for these systems the languages, efficient automatized methods are need for a bilingual dictionary is essential. needed. This paper introduces a fully auto- Our paper targets bilingual dictionary genera- mated, robust pivot language based bilingual tion, a resource which can be used within the dictionary generation method that uses the frameworks of a rule or pattern based machine WordNet of the pivot language to build a new translation system. Our goal is to provide a low- bilingual dictionary. We propose the usage of cost, robust and accurate dictionary generation WordNet in order to increase accuracy; we method. Low cost and robustness are essential in also introduce a bidirectional selection method order to be re-implementable with any arbitrary with a flexible threshold to maximize recall. Our evaluations showed 79% accuracy and language pair. We also believe that besides high 51% weighted recall, outperforming represen- precision, high recall is also crucial in order to tative pivot language based methods. A dic- facilitate post-editing which has to be performed tionary generated with this method will still by human correctors. For improved precision, we need manual post-editing, but the improved propose the usage of WordNet, while for good recall and precision decrease the work of hu- recall we introduce a bidirectional selection man correctors. method with local thresholds. Our paper is structured as follows: first we 1 Introduction overview the most significant related works, af- ter which we analyze the problems of current In recent decades automatic and semi-automatic dictionary generation methods. We present the machine translation systems gradually managed details of our proposal, exemplified with the to take over costly human tasks. This much wel- Japanese-Hungarian language pair. We evaluate comed change can be attributed not only to major the generated dictionary, performing also a com- developments in techniques regarding translation parative evaluation with two other pivot- methods, but also to important translation re- language based methods. Finally we present our sources, such as monolingual or bilingual dic- conclusions. tionaries and corpora, thesauri, and so on. How- ever, while widely used language pairs can fully 2 Related works take advantage of state-of-the-art developments in machine translation, certain low-frequency, or 2.1 Bilingual dictionary generation less common language pairs lack some or even Various corpus based, statistical methods with most of the above mentioned translation re- very good recall and precision were developed sources. In that case, the key to a highly accurate starting from the 1980’s, most notably using the machine translation system switches from the 862 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 862–870, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP Dice-coefficient (Kay & Röscheisen, 1993), cor- 2.2 Lexical database in lexical acquisition respondence-tables (Brown, 1997), or mutual Large lexical databases are vital for many areas information (Brown et al. , 1998). in natural language processing (NLP), where As an answer to the corpus-based method’s large amount of structured linguistic data is biggest disadvantage, namely the need for a large needed. The appearance of WordNet (Miller et bilingual corpus, in the 1990’s Tanaka and al., 1990) had a big impact in NLP, since not Umemura (1994) presented a new approach. As a only did it provide one of the first wide-range resource, they only use dictionaries to and from a collections of linguistic data in electronic format, pivot language to generate a new dictionary. but it also offered a relatively simple structure These so-called pivot language based methods that can be implemented with other languages as rely on the idea that the lookup of a word in an well. In the last decades since the first, English uncommon language through a third, intermedi- WordNet, numerous languages adopted the ated language can be automated. Tanaka and WordNet structure, thus creating a potential large Umemura’s method uses bidirectional source- multilingual network. The Japanese language is pivot and pivot-target dictionaries (harmonized one of the most recent ones added to the Word- dictionaries). Correct translation pairs are se- Net family (Isahara et al. 2008), but the Hungar- lected by means of inverse consultation, a ian WordNet is still under development method that relies on counting the number of (Prószéky et al. 2001; Miháltz and Prószéky pivot language definitions of the source word, 2004). through which the target language definitions can Multilingual projects, such as EuroWordNet be identified (Tanaka and Umemura, 1994). (Vossen 1998; Peters et al. 1998), Balkanet Sjöbergh (2005) also presented an approach to (Stamou et al. 2002) or Multilingual Central Re- pivot language based dictionary generation. pository (Agirre et al. 2007) aim to solve numer- When generating his English pivoted Swedish- ous problems in natural language processing. Japanese dictionary, each Japanese-to-English EuroWordNet was specifically designed for description is compared with each Swedish-to- word disambiguation purposes in cross-language English description. Scoring is based on word information retrieval (Vossen 1998). The internal overlap, weighted with inverse document fre- structure of the multilingual WordNets itself can quency; the best matches being selected as trans- be a good starting point for bilingual dictionary lation pairs. generation. In case of EuroWordNet, besides the These two approaches described above are the internal design of the initial WordNet for each best performing ones that are general enough to language, an Inter-Lingual-Index interlinks word be applicable with other language pairs as well. meaning across languages is implemented (Pe- In our research we used these two methods as ters et al. 1998). However, there are two limita- baselines for comparative evaluation. tions: first of all, the size of each individual lan- There are numerous refinements of the above guage database is relatively small (Vossen 1998), methods, but for various reasons they cannot be covering only the most frequent words in each implemented with any arbitrary language pair. language, thus not being sufficient for creating a Shirai and Yamamoto (2001) used English to dictionary with a large coverage. Secondly, these design a Korean-Japanese dictionary, but be- multilingual databases cover only a handful of cause the usage of language-specific information, languages, with Hungarian or Japanese not being they conclude that their method ‘can be consid- part of them. Adding a new language would re- ered to be applicable to cases of generating quire the existence of a WordNet of that lan- among languages similar to Japanese or Korean guage. through English’ . In other cases, only a small portion of the lexical inventory of the language is 3 Problems of current pivot language chosen to be translated: Paik et al. (2001) pro- based methods posed a method with multiple pivots (English and Kanji/Hanzi characters) to translate Sino- 3.1 Selection method shortcomings Korean entries. Bond and Ogura describe a Japa- Previous pivot language based methods generate nese-Malay dictionary that uses a novel tech- and score a number of translation candidates, and nique in its improved matching through normali- the candidate’s scores that exceed a certain pre- zation of the pivot language, by means of seman- defined global threshold are selected as viable tic classes, but only for nouns (2007). Besides translation pairs. However, the scores highly de- English, they also use Chinese as a second pivot. 863 pend on the entry itself or the number of transla- entry_target k candidates are selected, we ensure tions in the pivot language, therefore there is a that at least one translation will be available for variance in what that score represents. For this entry_source , maintaining a high recall. Since we reason, a large number of good entries are en- can group the entries in the source language and tirely left out from the dictionary, because all of target language as well, we perform this selection their translation candidates scored low, while twice, once in each direction. Local thresholds faulty translation candidates are selected, be- depend on the top scoring entry_target , being set cause they exceed the global threshold. Due to to

Bilingual Dictionary Generation for Low-Resourced Language Pairs

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support