<<

Bilingual generation for low-resourced pairs

Varga István Yokoyama Shoichi Yamagata University, Yamagata University, Graduate School of Science and Engineering Graduate School of Science and Engineering [email protected] [email protected]

choice and adaptation of the method Abstract to the problem of available translation resources between the chosen . Bilingual are vital resources in One possible solution is bilingual corpus ac- many areas of natural language processing. quisition for statistical machine translation Numerous methods of machine translation re- (SMT). However, for highly accurate SMT sys- quire bilingual dictionaries with large cover- tems large bilingual corpora are required, which age, but less-frequent language pairs rarely are rarely available for less represented lan- have any digitalized resources. Since the need for these resources is increasing, but the hu- guages. Rule or sentence pattern based systems man resources are scarce for less represented are an attractive alternative, for these systems the languages, efficient automatized methods are need for a bilingual dictionary is essential. needed. This paper introduces a fully auto- Our paper targets bilingual dictionary genera- mated, robust pivot language based bilingual tion, a resource which can be used within the dictionary generation method that uses the frameworks of a rule or pattern based machine WordNet of the pivot language to build a new translation system. Our goal is to provide a low- bilingual dictionary. We propose the usage of cost, robust and accurate dictionary generation WordNet in order to increase accuracy; we method. Low cost and robustness are essential in also introduce a bidirectional selection method order to be re-implementable with any arbitrary with a flexible threshold to maximize recall. Our evaluations showed 79% accuracy and language pair. We also believe that besides high 51% weighted recall, outperforming represen- precision, high recall is also crucial in order to tative pivot language based methods. A dic- facilitate post-editing which has to be performed tionary generated with this method will still by human correctors. For improved precision, we need manual post-editing, but the improved propose the usage of WordNet, while for good recall and precision decrease the work of hu- recall we introduce a bidirectional selection man correctors. method with local thresholds. Our paper is structured as follows: first we 1 Introduction overview the most significant related works, af- ter which we analyze the problems of current In recent decades automatic and semi-automatic dictionary generation methods. We present the machine translation systems gradually managed details of our proposal, exemplified with the to take over costly human tasks. This much wel- Japanese-Hungarian language pair. We evaluate comed change can be attributed not only to major the generated dictionary, performing also a com- developments in techniques regarding translation parative evaluation with two other pivot- methods, but also to important translation re- language based methods. Finally we present our sources, such as monolingual or bilingual dic- conclusions. tionaries and corpora, thesauri, and so on. How- ever, while widely used language pairs can fully 2 Related works take advantage of state-of-the-art developments in machine translation, certain low-frequency, or 2.1 Bilingual dictionary generation less common language pairs lack some or even Various corpus based, statistical methods with most of the above mentioned translation re- very good recall and precision were developed sources. In that case, the key to a highly accurate starting from the 1980’s, most notably using the machine translation system switches from the

862 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 862–870, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP Dice-coefficient (Kay & Röscheisen, 1993), cor- 2.2 Lexical database in lexical acquisition respondence-tables (Brown, 1997), or mutual Large lexical databases are vital for many areas information (Brown et al. , 1998). in natural language processing (NLP), where As an answer to the corpus-based method’s large amount of structured linguistic data is biggest disadvantage, namely the need for a large needed. The appearance of WordNet (Miller et bilingual corpus, in the 1990’s Tanaka and al., 1990) had a big impact in NLP, since not Umemura (1994) presented a new approach. As a only did it provide one of the first wide-range resource, they only use dictionaries to and from a collections of linguistic data in electronic format, pivot language to generate a new dictionary. but it also offered a relatively simple structure These so-called pivot language based methods that can be implemented with other languages as rely on the idea that the lookup of a in an well. In the last decades since the first, English uncommon language through a third, intermedi- WordNet, numerous languages adopted the ated language can be automated. Tanaka and WordNet structure, thus creating a potential large Umemura’s method uses bidirectional source- multilingual network. The Japanese language is pivot and pivot-target dictionaries (harmonized one of the most recent ones added to the Word- dictionaries). Correct translation pairs are se- Net family (Isahara et al. 2008), but the Hungar- lected by means of inverse consultation, a ian WordNet is still under development method that relies on counting the number of (Prószéky et al. 2001; Miháltz and Prószéky pivot language definitions of the source word, 2004). through which the target language definitions can Multilingual projects, such as EuroWordNet be identified (Tanaka and Umemura, 1994). (Vossen 1998; Peters et al. 1998), Balkanet Sjöbergh (2005) also presented an approach to (Stamou et al. 2002) or Multilingual Central Re- pivot language based dictionary generation. pository (Agirre et al. 2007) aim to solve numer- When generating his English pivoted Swedish- ous problems in natural language processing. , each Japanese-to-English EuroWordNet was specifically designed for description is compared with each Swedish-to- word disambiguation purposes in cross-language English description. Scoring is based on word information retrieval (Vossen 1998). The internal overlap, weighted with inverse document fre- structure of the multilingual itself can quency; the best matches being selected as trans- be a good starting point for bilingual dictionary lation pairs. generation. In case of EuroWordNet, besides the These two approaches described above are the internal design of the initial WordNet for each best performing ones that are general enough to language, an Inter-Lingual-Index interlinks word be applicable with other language pairs as well. meaning across languages is implemented (Pe- In our research we used these two methods as ters et al. 1998). However, there are two limita- baselines for comparative evaluation. tions: first of all, the size of each individual lan- There are numerous refinements of the above guage database is relatively small (Vossen 1998), methods, but for various reasons they cannot be covering only the most frequent in each implemented with any arbitrary language pair. language, thus not being sufficient for creating a Shirai and Yamamoto (2001) used English to dictionary with a large coverage. Secondly, these design a Korean-Japanese dictionary, but be- multilingual databases cover only a handful of cause the usage of language-specific information, languages, with Hungarian or Japanese not being they conclude that their method ‘can be consid- part of them. Adding a new language would re- ered to be applicable to cases of generating quire the existence of a WordNet of that lan- among languages similar to Japanese or Korean guage. through English’ . In other cases, only a small portion of the lexical inventory of the language is 3 Problems of current pivot language chosen to be translated: Paik et al. (2001) pro- based methods posed a method with multiple pivots (English and Kanji/Hanzi characters) to translate Sino- 3.1 Selection method shortcomings Korean entries. Bond and Ogura describe a Japa- Previous pivot language based methods generate nese-Malay dictionary that uses a novel tech- and score a number of translation candidates, and nique in its improved matching through normali- the candidate’s scores that exceed a certain pre- zation of the pivot language, by means of seman- defined global threshold are selected as viable tic classes, but only for nouns (2007). Besides translation pairs. However, the scores highly de- English, they also use Chinese as a second pivot.

863 pend on the entry itself or the number of transla- entry_target k candidates are selected, we ensure tions in the pivot language, therefore there is a that at least one translation will be available for variance in what that score represents. For this entry_source , maintaining a high recall. Since we reason, a large number of good entries are en- can group the entries in the source language and tirely left out from the dictionary, because all of target language as well, we perform this selection their translation candidates scored low, while twice, once in each direction. Local thresholds faulty translation candidates are selected, be- depend on the top scoring entry_target , being set cause they exceed the global threshold. Due to to maxscore ·c. Constant c varies between 0 and 1, this effect the recall value drops significantly. allowing a small window not only for the maxi- mum, but high scoring candidates as well. It is

3.2 Dictionaries not enough as resource language and selection method dependent (see Regardless of the language pair, in most cases §5.1 for details). the meanings of the corresponding words are not 4.2 Translation resources identical; they only overlap to a certain extent. Therefore, the pivot language based dictionary As an example of a less-common language pair, generation problem can be defined as the identi- we have chosen Japanese and Hungarian. For fication of the common elements or the extent of translation candidate generation, we have chosen the relevant overlapping in the source-to-pivot two freely available dictionaries with English as and target-to-pivot definitions. the pivot language. The Japanese-English dic- Current methods perform a strictly lexical tionary had 197282, while the Hungarian-English overlap of the source-pivot and target-pivot en- contained 189331 1-to-1 entry pairs. The Japa- tries. Even if the meanings of the source and tar- nese-English dictionary had part-of-speech get head words are transferred to the pivot lan- (POS) information as well, but to ensure robust- guage, this is rarely done with the same set of ness, our method does not use this information. words or definitions. Thus, due to the different To select from the translation candidates, we word-usage or paraphrases, even semantically mainly use WordNet (Miller et. al., 1990). From identical or very similar head words can have WordNet we consider four types of information: different definitions in different dictionaries. As sense categorization, synonymy , antonymy and a result, performing only lexical overlap, current semantic categories provided by the tree struc- methods cannot identify the differences between ture of nouns and verbs. totally different definitions resulted by unrelated concepts, and differences in only nuances re- 4.3 Dictionary generation method sulted by lexicographers describing the same Our proposed method consists of two steps. In concept, but with different words. step 1 we generate a number of translation pair candidates, while in step 2 we score and select 4 Proposed method from them based on semantic information ex- 4.1 Specifics of our proposal tracted from WordNet. For higher precision, instead of the familiar lexi- Step 1: translation candidate generation cal overlap of the current methods we calculate Using the source-pivot and pivot-target diction- the semantically expanded lexical overlap of the aries, we connect the source and target entries source-to-pivot and target-to-pivot . that share at least one common translation in the In order to do that, we use semantic information pivot language. We consider each source-target extracted from the WordNet of the pivot lan- pair a translation candidate . With our Japanese- guage. English and English-Hungarian dictionaries we To improve recall, we introduce bidirectional accumulated 436966 Japanese-Hungarian trans- selection. As we stated above, the global thresh- lation candidates. old eliminates a large number of good translation pairs, resulting in a low recall. As a solution, we Step 2: translation pair selection can group the translations that share the same source or target entry, and set local thresholds We examine the translation candidates one by for each head word. For example, for a source one, looking up the source-pivot and target-pivot language head word entry_source there could be dictionaries, comparing the translations in the multiple target language candidates: en- pivot language. There are six types of transla- try_target 1, … ,entry_target n. If the top scoring tions that we label A-F and explain below. First,

864 we perform a strictly lexical match based only on the Hungarian to English definition the dictionaries. Next, using information ex- (sns (right )={ #1, #3, #5, #6, #10}, through cor- tracted from WordNet we attempt to identify the rect or proper ). As a result, 4 senses are com- correct translation pairs. mon, and 1 is different. Thus the adjective right ’s (a) Lexically unambiguous translation pairs score is 0.8 ( score B(s,t )[right ](正解,helyes )). The adjective correct has 4 senses, all of them are Some of the translation candidates have exactly recognized by both definitions through right , the same translations into in the pivot language; therefore the score through correct is 1 we consider these pairs as being correct by de- 正 解 fault. Also among the translation candidates we (score B(s,t )[correct ]( ,helyes )). The maxi- identified a number of source entries that had mum of the above scores is the final score: only one target translation; and a number of tar- score B(s,t )( 正解,helyes )=1. get entries that had only one source translation. All translation candidates are verified based Being the sole candidates for the given entries, on all four POS available from WordNet. Since we consider these pairs too as being correct. synonymy information is available for nouns (N), 37391 Japanese-Hungarian translation pairs were verbs (V), adjectives (A) and adverbs (R), four retrieved with this method ( type A pairs). separate scores are calculated for each POS. Scores that pass a global threshold are consid- (b) Using sense description ered correct. 33971 Japanese-Hungarian candi- For most polysemous words WordNet has de- dates ( type B translations) were selected, with tailed descriptions with synonyms for each sense. these two languages the global threshold was set We use these synonyms of WordNet’s sense de- to 0.1. Even this low value ensures that at least scriptions to disambiguate the meanings of the one of ten meanings is shared by the two entries common translations. For a given source-target of the pair, thus being suitable as translation pair. translation candidate (s,t ) we look up the source- (c) Using synonymy, antonymy and semantic pivot and target-pivot translations categories (s→I={ s→i 1,…,s→i n} and We expand the source-to-pivot and target-to- t→I={ t→i 1,…,t→i m}). We select the elements that are common in the two definitions pivot definitions with information from WordNet (synonymy, antonymy and semantic category, (I’=(s→I )∩( t→I )) and we look up their respec- respectively). Thus the similarity of the two ex- tive senses from WordNet ( ). We identify sns (I’) panded pivot language descriptions gives a better the words’ senses comparing each synonym in indication on the suitability of the translation the WordNet’s synonym description with each candidate. Using the three relations, the common word from the dictionary definition. As a result, versus total number of translations (Jaccard coef- for each common word we arrive at a certain set ficient) will define the appropriateness of the of senses from the source-pivot definitions translation candidate. (sns ((s→I’)) and a certain set of senses from the ext (s → i)∩ ext (t → i) target-pivot definitions ( sns ((t→I’)). We mark score ()s,t = (2) C,D,E ext ()()s → i ∪ ext t → i score B(s,t ) the maximum ratio of the identical and total number of identified senses (Jaccard Since the same word or concept’s translations coefficient). The higher the score B(s,t ) is, the into the pivot language also share the same se- more probable is candidate (s,t ) a valid transla- mantic value, the extension with synonyms tion. (ext (l→i )=( l→i )∪syn (l→i ), where l={ s,t }) the sns (s → i')∩ sns (t → i') (1) extended translation should share more common score B ()s,t = max i'∈()()s→I ∩ t→I sns ()()s → i' ∪ sns t → i' elements. In case of antonymy, we expand the initial 正解 For example, (seikai: correct, right, cor- definitions with the antonyms of the antonyms rect interpretation) and helyes (correct, proper, (ext (l→i )=( l→i )∪ant (ant(l→i )), where l={ s,t }). right, appropriate) have two common transla- This extension is different from the synonymy tions ( ), thus can be I’={right, correct } score B(s,t ) extension, in most cases the resulting set of performed with these two words. The adjective words being considerably larger. right has 13 senses according to WordNet, Along with synonymy, antonymy is also avail- among them 4 were identified from the Japanese able for nouns, verbs, adjectives and adverbs, to English definition ( sns (right )={#1, #3, #5, four separate scores are calculated for each POS. #10}, all identified through correct ) and 5 from

865 Semantic categories are provided by the tree tion was selected as a good translation for the structure (hypernymy/hyponymy) of nouns and source language entry; and whose source lan- verbs of WordNet. We transpose each entry from guage translation was also selected as a good the pivot translations to its semantic categories translation for the target language entry, should (ext (l→i )=Σsemcat (l→i), where l={ s,t }). We as- be awarded with a higher score. In the same way, sume that the correct translation pairs share a entries selected only during one direction should high percentage of semantic categories. Accord- receive a penalty. For every translation candidate ingly, the translations of semantically similar or we select the maximum score from the several identical entries should share a high number of POS (noun, verb, adjective and adverb for syn- common semantic categories. onymy and antonymy relations; noun and verb The scores based on these relations highly de- for semantic category) based scores, multiplied pend on the number of pivot language transla- by a multiplication factor (mfactor ). The multi- tions; therefore we use the bidirectional selection plication factor varies between 0 and 1, awarding method with local thresholds for each source and the candidates that were selected both times dur- target head word. Local thresholds are set based ing the double directional selection; and punish- on the best scoring candidate for a given entry. ing when selection was made only in a single The thresholds were maxscore ·0.9 for synonymy direction. The product gives the combined score and antonymy; and maxscore ·0.8 for the seman- (score F), c1, c2 and c3 are constants. In case of tic categories (see §5.1 for details). Japanese and Hungarian, these method scored Using synonymy, 196775 candidate pairs best with the constants set to 1, 0.5 and 0.8, re- (type C ), with antonymy 99614 pairs ( type D ); spectively. The combined score also highly de- while with semantic categories 195480 pairs pends on the word entry, therefore local thresh- (type E ) were selected. olds are used in this selection method as well, which were empirically set to maxscore ·0.85 (see (d) Combined semantic information §5.1 for details). The three separate lists of type C, D and E selec- (c + max (score (s,t)))⋅ tion methods resulted in slightly different results,  1 rel  score F ()s,t = ∏  (3) proving that they cannot be used as standalone rel ()c2 + c3 ⋅mfactor rel ()s,t  selection methods (see §5.2 for details). Because of the multiple POS labelling of nu- As an example, for the Japanese entry 購入 merous words in WordNet, many translation (k ōny ū: buy, purchase) there are 10 possible pairs can be selected up to four times based on Hungarian translations; using the above methods separate POS information (noun, verb, adjective, 5 of them (#1, #7, #8, #9, #10) are selected as adverb), all within one single semantic informa- correct ones. Among these, only 1 of them (#1) tion based methods. Since we use a bidirectional is a correct translation, the rest have similar or selection method, experiments showed that trans- totally different meanings. However, with the lation pairs that were selected during both direc- combined scores the faulty translations were tions, in most cases were the correct translations. eliminated and a new, correct, but previously Similarly, translation pairs selected during only average scoring translation (#2) was selected one direction were less accurate. In other words, (Table 1). translation pairs whose target language transla-

score C score D score E # translation candidate score F N V A R N V A R N V 1 vétel (purchase) 2.012 0.193 0.096 0 0 0 0.500 0 0 0.154 0.500 2 üzlet (business transaction) 1.387 0.026 0.030 0 0 0 0.250 0 0 0.020 0.077 3 hozam (output, yield) 1.348 0.095 0.071 0 0 0 0 0 0 0.231 0.062 4 emel őrúd (lever, purchase) 1.200 0.052 0.079 0 0 0 0 0 0 0.111 0.067 5 el őny (advantage, virtue) 1.078 0.021 0.020 0 0 0 0 0 0 0.054 0.056 6 támasz (purchase, support) 1.053 0.014 0.015 0 0 0 0 0 0 0.037 0.031 7 vásárlás (shopping) 0.818 0.153 0.285 0 0 0 0 0 0 0.273 0.200 8 szerzemény (attainment) 0.771 0.071 0.285 0 0 0 0 0 0 0.136 0.200 9 könnyítés (facilitation) 0.771 0.064 0.285 0 0 0 0 0 0 0.136 0.200 10 emel őszerkezet (lever) 0.459 0.285 0.285 0 0 0 0 0 0 0.429 0.200 Table 1: Translation candidate scoring for 購入: buy, purchase (above thresholds in bold)

866 161202 translation pairs were retrieved with 5.2 Selection method evaluation this method ( type F ). As a pre-evaluation of the above selection meth- During pre-evaluation type A and type B trans- ods, we randomly selected 200 1-to-1 source- lations received a score of above 75%, while type target entries resulted by each method. The same C, type D and type E scored low (see §5.2 for evaluator scored the translation pairs as correct details). However, type F translations scored (the translation conveys the same meaning, or the close to 80%, therefore from the six translation meanings are slightly different, but in a certain methods presented above we chose only three context the translation is possible), undecided (type A , B and F) to construct the dictionary, (the translation pair’s semantic value is similar, while the remaining three methods ( type C, D but a translation based on them would be faulty) and E) are used only indirectly for type F selec- or wrong (the translation pair’s two entries con- tion. vey a different meaning). With the described selection methods 187761 translation pairs, with 48973 Japanese and 44664 selection evaluation score (%) Hungarian unique entries was generated. type correct undecided wrong 5 Threshold settings and pre-evaluation A 75.5 6.5 18 B 83 7 10 5.1 Local threshold settings C 68 5.5 26.5 As development set we considered all translation D 60 9 31 candidates whose Hungarian entry starts with E 71 5.5 23.5 “zs” (IPA: ʒ). We assume that the behaviour of F 79 5 16 this subset of words reflects the behaviour of the Table 3: Selection type evaluation entire . 133 unique entries totalling The results showed that type A and type B selec- 515 translation candidates comprise this devel- tions scored higher than all order-based selec- opment set. After this, we manually scored the tions, with type C , type D and type E selections 515 translation candidates as correct (the transla- failing to deliver the desired accuracy (Table 3). tion conveys the same meaning, or the meanings are slightly different, but in a certain context the 6 Evaluation translation is possible) or wrong (the translation pair’s two entries convey a different meaning). We performed three types of evaluation: (1) frequency-weighted recall evaluation The scoring was performed by one of the authors who is a native Hungarian and fluent in Japanese. (2) 1-to-1 entry precision evaluation (3) 1-to-multiple entry evaluation 273 entries were marked as correct . Next, we experimented with a number of thresholds to de- For comparative purposes we also performed termine which ones provide with the best F- each type of evaluation for two other pivot lan- scores (Table 2). The F-scores were determined guage based methods whose characteristics per- as follows: for example using synonymy infor- mit to be implementable with virtually any lan- guage pair. In order to do so, we constructed two mation (type C) in case of threshold=0.85%, 343 of the 515 translation pairs were above the other Hungarian-Japanese dictionaries using the threshold. Among these, 221 were marked as methods proposed by Tanaka & Umemura and correct by our manual evaluator, thus the preci- Sjöbergh, using the same source dictionaries. sion being 221/343 ·100=64.43 and the recall be- 6.1 Recall evaluation ing 221/273 ·100=80.95. F-score is the harmonic mean of precision and recall (71.75 in this case). It is well known that one of the most challenging aspects of dictionary generation is word ambigu- selection threshold value (%) ity. It is relatively easy to automatically generate type 0.75 0.80 0.85 0.90 0.95 the translations of low-frequency keywords, be- cause they tend to be less ambiguous. On the C 70.27 70.86 71.75 72.81 66.95 contrary, the ambiguity of the high frequency D 69.92 70.30 70.32 70.69 66.66 words is much higher than their low-frequency E 73.71 74.90 72.52 71.62 65.09 counterparts, and as a result conventional meth- F 78.78 79.07 78.50 76.94 79.34 ods fail to translate a considerable number of Table 2: Selection type F-scores with varying thresh- them. However, this discrepancy is not reflected olds (best threshold values in bold) in the traditional recall evaluation, since each

867 word has an equal weight, regardless of its fre- tion pairs, only after manual scoring the total quency of use. As a result, we performed a fre- score for each dictionary was available, after re- quency weighted recall evaluation. We used a grouping based on the initial identification codes. Japanese frequency dictionary (FD) generated The process was repeated 10 times, 2000 pairs from the Japanese EDR corpus (Isahara, 2007) to were manually checked from each dictionary. weight each Japanese entry. Setting the standard Japanese Hungarian to the frequency dictionary (its recall value being code classification 100), we automatically search for each entry ( w) entry entry 報告 (h ōkoku: from the frequency dictionary, looking whether k9g6 hír (report, infor- information, re- correct n5d8 mation, news) or not it is included in the bilingual dictionary port) (WD). If it is recalled, we weight it with its fre- j8h0 初 (ubu: innocent, zöld (green, ver- undecided quency from the frequency dictionary. k1x5 naive) dant) a5b6 エントリ (entori: bejárat (entry, wrong ∑ frequency (w) n8i3 entry ) entrance) ∈ recall = w WD ⋅100 (4) Table 5: 1-to-1 precision evaluation examples w ∑ frequency ()w w∈FD evaluation score (%) method correct undecided wrong method recall our method 79.15% 6.15% 14.70% our method 51.68 Sjöbergh method 54.05% 9.80% 36.15% Sjöbergh method 37.03 Tanaka method 62.50% 7.95% 29.55% Tanaka method 30.76 Table 6: 1-to-1 precision evaluation results initial candidates 51.68 Japanese-English(*) 73.23 To rank the methods we only consider the cor- Table 4: Recall evaluation results (* marks a manu- rect translations. Our method performed best ally created dictionary) with an average of 79.15%, outscoring Tanaka method’s 62.50% and Sjöbergh method’s The frequency weighted recall value results 54.05% (Table 6). The maximum deviance of the show that our method’s dictionary (51.68) out- correct translations during the 10 repetitions was scores every other automatically generated less than 3% from the average. method’s dictionary (37.03, 30.76) with a sig- nificant advantage. Moreover, it maintains the 6.3 1-to-multiple evaluation score of the initial translation candidates, there- While with 1-to-1 precision evaluation we esti- fore managing to maximize the recall value, ow- mated the accuracy of the translation pairs, with ing to the bidirectional selection method with 1-to-multiple we calculate the true reliability of local thresholds. However, the recall value of a the dictionary, with the initial translation candi- manually created Japanese-English dictionary is dates set as recall benchmark. When looking up higher than any automatically generated diction- the meanings or translations of a certain head ary’s value (Table 4). word, the user, whether he’s a human or a ma- 6.2 1-to-1 precision evaluation chine, expects all translations to be accurate. Therefore we evaluated 200 randomly selected With 1-to-1 precision evaluation we determine Japanese entries from the initial translation can- the translation accuracy of our method, com- didates, together with all of their Hungarian pared with the two baseline methods. 200 ran- translations, scoring them as correct (all transla- dom pairs were selected from each of the three tions are correct), acceptable (the good transla- Hungarian-Japanese dictionaries, scoring them tions are predominant, but there are up to 2 erro- manually the same way as with selection type neous translations), wrong (the number or wrong evaluation ( correct , undecided , wrong ) (Table 5). translations exceeds 2) or missing (the translation The manual scoring was performed by one of the is missing) (Table 7). authors, who is a native Hungarian and fluent in The same type of mixed, manual evaluation Japanese. Since no independent evaluator was was performed by the same author on samples of available for these two languages, after a random 200 entries from each Japanese-Hungarian dic- identification code being assigned to each of the tionary. This evaluation was also repeated 10 600 selected translation pairs (200 from each times. dictionary), they were mixed. Therefore the To rank the methods, we only consider the evaluator did not know the origin of the transla- correct translations. Our method scored best with

868 71.45%, outperforming Sjöbergh method’s with the difficulty in connecting the target and 61.65% and Tanaka method’s 46.95% (Table 8). source entries through the pivot language, while precision problems discuss the reasons why erro- Japanese Hungarian neous pairs are produced. code classification entry translations összenyomás (com- 7.1 Recall problems pression, crush, 圧縮 squeeze: correct ) We managed to maximize the recall of our initial j4h8 (asshuku: összeszorítás (com- translation candidates, but in many cases certain m9x compres- correct pression, confinement: 5 sion, translation pairs still could not be generated be- correct ) cause the link from the source language to the squeeze) zsugorítás (shrinkage: correct ) target language through the pivot language sim- alap (base, bottom, ply doesn’t exist. The main reasons are: the entry foundation: correct ) is missing from at least one of the dictionaries; 底面 alapzat (base, bed, h9j9l translations in the pivot language are expressions (teimen: bottom: correct ) acceptable 3v1 or explanations; or there is no direct translation base) lúg (alkali, base: unde- cided ) or link between the source and target entries. The támpont (base: correct ) entries that could not be recalled are mostly ex- bekerít (to encircle, to pressions, rare entries, words specific to a lan- enclose, to ring: guage (ex: tatami: floor-mat, or gulyás: goulash ). wrong ) Moreover, a number of head words don’t have cseng (to clang, to 鳴らす clank, to ring, to tinkle: any synonym, antonym and/or hy- l0k6 (narasu: to correct ) pernymy/hyponymy information in WordNet, m3n wrong sound, to hangzik (to ring, to 7 and as a result these words could not participate ring, to beat) sound: correct ) in the type B, C, D, E and F scoring. horkan (to snort: wrong ) 7.2 Precision problems üt (to bang, to knock, to ring: wrong ) We identified two types of precision problems. Table 7: 1-to-multiple entry evaluation examples The most obvious reasons for erroneous transla- evaluation score (%) tions are the polysemous nature of words and the method accept- meaning-range differences across languages. correct wrong missing able With words whose senses are clear and mostly our method 71.45 13.85 14.70 0 preserved even through the pivot language, most Sjöbergh method 61.65 11.30 15.00 12.05 of the correct senses were identified and cor- Tanaka method 46.95 3.35 9.10 40.60 rectly translated. Nouns, adjectives and adverbs Table 8: 1-to-many evaluation results had a relatively high degree of accuracy. How- ever, verbs proved to be the most difficult POS

7 Discussion to handle. Because semantically they are more Based on the recall evaluations, the traditional flexible than other POS categories, and the methods showed their major weakness by losing meaning range is also highly flexible across lan- substantially from the initial recall values, scored guages, the identification of the correct transla- by the initial translation candidates. Our method tion is increasingly difficult. For this reason, the maintains the same value with the translation number of faulty translations and the number of candidates, but we cannot say that the recall is meanings that are not translated was relatively perfect. When compared with a manually created high. dictionary, our method also lost significantly. One other source of erroneous translations is Precision evaluation also showed an im- the quality of the initial dictionaries. Even the provement compared with the traditional meth- unambiguous type A translations fail to produce ods, our method outscoring the other two meth- the desired accuracy, although they are the ods with the 1-to-1 precision evaluation. 1-to- unique candidate for a given word entry. The multiple evaluation was also the highest, proving main reason for this is the deficiency of the ini- that WordNet based methods outperform dic- tial dictionaries, which contain a great number of tionary based methods. Discussing the weak- irrelevant or low usage translations, shadowing nesses of our system, we have to divide the prob- the main, important senses of some words. In lems into two categories: recall problems deal other cases the resource dictionaries don’t con- tain translations of all meanings; homonyms are

869 present as pivot entries with different meanings, Isahara, H. 2007. EDR – pre- sometimes creating unique, but faulty links. sent status (EDR 電子化辞書の現状), NICT-EDR symposium , pp. 1-14. (in Japanese)

8 Conclusions Kay, M., Röscheisen, M. 1993. Text-Translation We proposed a new pivot language based Alignment, Computational Linguistics , 19(1), pp. method to create bilingual dictionaries that can 121-142. be used as translation resource for machine trans- Miháltz, M., Prószéky, G. 2004. Results and Evalua- lation. In contrast to conventional methods that tion of Hungarian Nominal WordNet v1.0, Pro- use dictionaries only, our method uses WordNet ceedings of the Second Global WordNet Confer- as a main resource of the pivot language to select ence , pp. 175-180. the suitable translation pairs. As a result, we Miller G.A., Beckwith R., Fellbaum C., Gross D., eliminate most of the weaknesses caused by the Miller K.J. (1990). Introduction to WordNet: An structural differences of dictionaries, while prof- Online Lexical Database, Int J 3(4), iting from the semantic relations provided by pp. 235-244. WordNet. We believe that because of the nature Paik, K., Bond, F., Shirai, S. 2001. Using Multiple of our method it can be re-implemented with Pivots to align Korean and Japanese Lexical Re- most language pairs. sources, NLPRS-2001 , pp. 63-70, Tokyo, Japan. In addition, owing to features such as the bidi- Peters, W., Vossen, P., Díez-Orzas, P., Adriaens, G. rectional selection method with local thresholds 1998. Cross-linguistic Alignment of Wordnets with we managed to maximize recall, while maintain- an Inter-Lingual-Index, Computers and the Hu- ing a precision which is better than any other manities 32, pp. 221–251. compared method’s score. During exemplifica- Prószéky, G., Miháltz, M., Nagy, D. 2001. Toward a tion, we generated a mid-large sized Japanese- Hungarian WordNet, Proceedings of the NAACL Hungarian dictionary with relatively good recall 2001 Workshop on WordNet and Other Lexical Re- and promising precision. sources , Pittsburgh, June 2001. The dictionary is freely available online (http://mj-nlp.homeip.net/mjszotar), being also Sjöbergh, J. 2005. Creating a free Japanese-English , Proceedings of PACLING , pp. 296-300. downloadable at request. Shirai, S., Yamamoto, K. 2001. Linking English References words in two bilingual dictionaries to generate an- other pair dictionary, ICCPOL-2001 , pp. 174-179. Agirre, E., Alegria, I., Rigau, G, Vossen, P. 2007. MCR for CLIR, Procesamiento del lenguaje natu- Stamou, S., Oflazer, K., Pala, K., Christoudoulakis, ral 38, pp 3-15. D., Cristea, D., Tufi ş, D., Koeva, S., Totkov, G., Dutoit, D., Grigoriadou, M. 1997. BalkaNet: A Bond, F., Ogura, K. 2007. Combining linguistic re- Multilingual Semantic Network for the Balkan sources to create a machine-tractable Japanese- Languages, In Proceedings of the International Malay dictionary, Language Resources and Wordnet Conference , Mysore, India. Evaluation , 42(2), pp. 127-136. Tanaka, K., Umemura, K. 1994. Construction of a Breen, J.W. 1995. Building an Electric Japanese- bilingual dictionary intermediated by a third lan- English Dictionary, Japanese Studies Association guage, Proceedings of COLING-94 , pp. 297-303. of Australia Conference , Brisbane, Queensland, Australia. Vossen, P. 1998. Introduction to EuroWordNet. Com- puters and the Humanities 32: 73-89 Special Issue Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., on EuroWordNet. Jelinek, F., Mercer, R., Roossin, P. 1998. A Statis- tical Approach to Language Translation, Proceed- ings of COLING-88 , pp. 71-76. Brown, R.D. 1997. Automated Dictionary Extraction for Knowledge-Free Example-Based Translation, Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation , pp. 111-118. Isahara, H., Bond, F., Uchimoto, K., Uchiyama, M., Kanzaki, K. 2008. Development of Japanese WordNet, Proceedings of LREC-2008 .

870