<<

Japanese-English Cross-Language Headword Search of Wikipedia

Satoshi Sato and Masaya Okada Graduate School of Engineering Nagoya University Chikusa-ku, Nagoya, Japan [email protected] masaya [email protected]

Abstract glish article. To realize CLHS, term translation is required. This paper describes a Japanese-English cross-language headword search system of Term translation is not a mainstream of machine Wikipedia, which enables to find an appro- translation research, because a simple - priate English article from a given Japanese based method is widely used and enough for query term. The key component of the sys- general terms. For technical terms and proper tem is a term translator, which selects an ap- nouns, automatic extraction of translation pairs propriate English headword among the set from bilingual corpora and the Web has been in- of headwords in English Wikipedia, based on the framework of non-productive ma- tensively studied (e.g., (Daille et al., 1994) and chine translation. An experimental result (Lin et al., 2008)) in order to fill the lack of en- shows that the translation performance of tries in a . However, the qual- our system is equal or slightly better than ity of automatic term translation is not well exam- commercial machine translation systems, ined with a few exception (Baldwin and Tanaka, Google translate and Mac-Transer. 2004). Query translation in cross-language infor- mation retrieval is related to term translation, but it concentrates on translation of content in 1 Introduction a query (Fujii and Ishikawa, 2001). Many people use Wikipedia, an on In the CLHS situation, the ultimate goal of the Web, to find meanings of unknown terms. translation is to find the right headword in the tar- Among many language-versions of Wikipedia, get encyclopedia. In other words, unlike a general English version (EnWiki) is the largest, whose size situation of translation, a set of translation candi- is five times larger than that of Japanese version dates is clearly defined as a finite set. Therefore, (JaWiki). the translation can be simplified to selection of For Japanese people, Japanese articles are the the appropriate one from the set—the right head- most convenient and easy to read. In the case that from the set of headwords of the target en- JaWiki has no appropriate article, English articles cyclopedia. This fact brings a new framework of are the second best, from which they obtain some term translation, namely non-productive machine knowledge according to their English skills—it is translation (NPMT). much better than nothing. A problem arises here. How do they consult The rest of the paper is organized as follows. EnWiki? What term should they input? In this sit- Section 2 describes an overview of a Japanese- uation, the only thing that they know is a Japanese English CLHS system of Wikipedia. Section 3 de- term itself; no meaning and no translation. scribes the formal definition of the NPMT frame- The best solution is cross-language headword work and its algorithm. Section 4 describes its search (CLHS), where a user inputs a Japanese extension for Japanese-English term translation. term and a system retrieves the appropriate En- Section 5 describes an experimental result.

44 Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, TIA 2011, pages 44–50 Paris, 8–10 November 2011 Figure 1: Screen shot of the system

2 Overview of the System work does not produce new translations; just se- lects one (or more) from a large pool of candi- The system works as an interface of JaWiki and dates, called target list. The assumption behind EnWiki. The target users are Japanese people who this framework is that, for every term, a translation speak Japanese as the first language and read En- is already available. The task of term translation glish texts to a certain level. For a given Japanese is just to find it. term, the system first tries to retrieve its Japanese article. If no article is found, the system translates the Japanese term into English. If a single trans- 3.1 Translation Grammar lation is obtained, the system displays its English For the formal definition of the NPMT framework, article. If more than one translation are obtained, we first introduce a simple grammar that produces the system enumerates these translations for user’s a set of translation pairs. selection. In case no translation is obtained, the system reports it. G =(A, B, D) (1) Figure 1 shows a screen shot of the system, where the input term is “反結合性オービタル (an- tibonding orbital).” Because JaWiki have no arti- A grammar G consists of three components: A—a cle of the term, the system shows the English arti- set of words in the source language; B—a set of cle of “Antibonding”, which is found by redirec- words in the target language; D—a bilingual dic- tion from the obtained translation, “antibonding tionary, which is a set of translation rules (bilin- orbital.” gual pairs). A rule r D takes the following ∈ form. 3 NMPT Framework

r = α, β where α A∗,β B∗, For term translation, we use non-productive ma-   ∈ ∈ chine translation (NPMT) framework. This frame- max( α , β ) 1 (2) | | | | ≥

45 1 def npmt_je(dic, tlist, s) 2 sl = s.length; table = []; table[0] = [’’] 3 1.upto(sl) do |k| 4 table[k] = [] 5 0.upto(k-1) do |p| 6 (dic[s[p, k-p].join(’’)] || []).each do |tt| 7 # a bilingual pair is found in dic 8 table[p].each do |tp| 9 tk = (tp == ’’ ? tt : (tt == ’’ ? tp : [tp, tt].join(’ ’))) 10 table[k] << tk if tlist.find{|t| t =˜ /ˆ#{tk}/} 11 end 12 end 13 end 14 end 15 table[sl].select{|t| tlist.member?(t)} 16 end

Figure 2: Skeleton of NPMT algorithm in Ruby

In this grammar, a rule sequence δ D produces From this definition, we can see that a member ∈ ∗ a translation pair, of the output (i.e., tgt(δ)) is always a member of T . In other words, this framework always outputs δ = r1r2 rn (3) actually-observed terms; it does not produce new ··· = α ,β α ,β α ,β (4) terms that have not been observed yet. The name  1 1 2 2··· n n = α α α ,β β β (5) non-productive is derived from this fact.  1 2 ··· n 1 2 ··· n where each rule corresponds to a local mapping 3.3 Algorithm between α α α and β β β . Hereafter, The algorithm of finding Δ is not trivial. We 1 2 ··· n 1 2 ··· n we write the source side and the target side of δ as use a simplified version of Sato’s algorithm (Sato, src(δ) and tgt(δ), respectively. 2010), where prefix-filtering and dynamic pro- A language L (i.e., a set of translation pairs) gramming are used to reduce the search space. generated by a grammar G is defined as follows. Figure 2 shows a skeleton of our algorithm in Ruby. Three arguments, dic, tlist, and s, cor- L(G)= src(δ), tgt(δ) δ D∗ (6) D { | ∈ } respond to a bilingual dictionary , a target list T , and a source term σ, respectively. At line 6, We use this grammar framework for defining a set the program tries to find a dictionary entry for a of theoretically-possible translation pairs. substring of the input term; because of the dou- ble loops in line 3–14 and 5–13, all possibilities 3.2 Non-Productive Machine Translation of segmentation of the input term are examined1. Theoretically-possible translation pairs are not al- The line 10 corresponds to the prefix-filtering; a ways actually-observed or valid translation pairs. partial translation tk, which is a translation of the Usually a very small portion of L(G) is actually- first k characters of the input term, is stored in observed and valid. Therefore we need a device to table[k] only if tk is a prefix of a member of select valid members from L(G). the target list. For this purpose, we introduce a target list T ⊂ B∗, which is a model of actually-observed terms 4 Extension for Japanese-English Term in the target language. By using a target list, we Translation define the framework of the non-productive ma- chine translation (NPMT) as follows. For a given source term σ, the NPMT framework produces the correct translation τ when the fol- Given a grammar G =(A, B, D), a source term lowing two conditions are satisfied.

σ A , and a target list T B 1 ∈ ∗ ⊂ ∗ No Japanese morphological analyzer is used for segmen- Find Δ= δ δ D∗, src(δ)=σ, tgt(δ) T { | ∈ ∈ } tation of terms, because segmentation errors cannot be recov- Output = tgt(δ) δ Δ ered in term translation. T { | ∈ }

46 Japanese string s (any substring of an input term) ? - - 'bilingual $ - variant - - generator dictionary D ?- attaching - - func. elem. If s is a string of Katakana characters and s 5 | |≥ -&back- %- transliterator If s is a string of non-Japanese characters then s - If s is a functional element that can be dropped then the empty string - ? ? output of extended look-up

Figure 3: Diagram of extended dictionary look-up

1. τ is a member of the target list T , and last type, i.e., variants related to Kanji, we use 2. δ = σ, τ can be produced by the bilingual Hyouki Tougou dictionary (dictionary of -   dictionary D, i.e., δ D . variants) to generate variants, which is provided ∈ ∗ by National Institute for Japanese Language and In the CLHS situation, the first condition is always Linguistics. All generated variants are used to find satisfied by using all headwords in the target lan- entries in a bilingual dictionary. guage as T , except the case when no translation is the correct answer. However, the second condi- 4.2 Transliterations tion is sometimes not satisfied because of limited The Japanese language has many terms imported coverage of the bilingual dictionary D; no transla- from English in the form of transliteration. How- tion is obtained as a result. In order to reduce such ever, a bilingual dictionary does not store all of cases, we extend D in several ways. them, especially when English words have other In practice, we extend the process of dictionary Japanese translations. For example, an English look-up, not a bilingual dictionary itself. Fig- word “orbital” has a translation “軌道 (orbital)” ure 3 shows the diagram of the extended dictio- and a transliteration “オービタル (orbital)”, the lat- nary look-up. The dash box of this figure works as ter is usually not stored in a bilingual dictionary, a virtually-extended bilingual dictionary. because it is obvious for Japanese people. This transliteration problem can be solved by 4.1 Spelling Variants introducing a back-transliterator, which produces Many Japanese words have more than one the English original spelling from a translitera- spelling (The National Langauge Research Insti- tion in Japanese. In practice, we use a back- tute, 1983). Usually, for each word, one spelling transliterator based on non-productive machine variant is stored in a bilingual dictionary and the transliteration (Sato, 2010). It is called when an others are not, because Japanese people easily rec- input string of the dictionary look-up is a Katakana ognized them. Typical types of spelling variants string whose length is equal or longer than five; its are shown in Table 1. output is merged with other outputs. Ideally, the variant problem would be solved by 4.3 Non-Japanese characters dictionary extension, where all variants are gener- ated and stored into a dictionary in advance. The Some Japanese terms contain non-Japanese char- same result is obtained by introducing a variant- acters, such as Latin alphabet, Arabic numbers, generation module in the process of dictionary and symbols. For example, look-up. (1) a. ABS プラスチック (ABS plastic) For the first two types in Table 1, we use a b. マイクロサテライト DNA small number of variant-generation rules. For the (microsatellite DNA)

47 Table 1: Types of spelling variants type example English 1. number variants 第2(Arabic) / 第二(Kanji) second 2. Katakana variants モホロ ヴィ チッチ / モホロ ビ チッチ Mohoroviciˇ c´ 3. Variants related to Kanji a. Kanji-Kanji variants 浸蝕(old style) / 浸食(current style) erosion b. Kanji-Katakana variants 蛋白 細胞 (Kanji) / タンパク 細胞 (Katakana) albuminous cell c. Kanji-Hiragana variants 動物 澱粉 (Kanji) / 動物 でんぷん (Hiragana) animal starch d. Okurigana variants 独立組合せ / 独立組 み 合わせ independent assortment

The components that consist of non-Japanese Table 2: List of functional elements characters, such as “ABS” and “DNA”, were im- action functional elements ported from other languages (mainly English). attach 的な, な, 性の, する Therefore, English translations of these compo- drop 類 nents are identical. This can be handled by both の, 的, 性, 用, 法, 論, 式, 化, 型, 術, 症, 病 adding identical pairs, such as ABS, ABS and   DNA, DNA , virtually in a bilingual dictionary.   Table 3: Extension levels extension module L1 L2 4.4 Translation of Compounds spelling variants √√ Wikipedia’s headwords are typically transliterations √√ nouns; noun+noun construction and adjec- non-Japanese characters √√ tive+noun construction are possible in both lan- word order √√ inter-POS √ guages. Translations of compound nouns are ba- sically compositional but POS changes (e.g., noun adjective) are observed in actual translations. ↔ tional elements is executed before dictionary look- For example, up. In contract, dropping functional elements is (2) a. multifactorial inheritance not executed; these elements are simply translated b. 多因子の (multifactorial; adj.) 遺伝 into the empty string, shown in Figure 3, because c. 多因子 (multifactor; noun) 遺伝 the NPMT algorithm examines all possibilities of segmentation of an input term, described in Sec- (3) a. bonding orbital tion 3.3. b. 結合 (bonding; noun) オービタル c. 結合性 (bonding; adj.) オービタル 4.5 Word Order where (a) is the original English term, (b) is the In Japanese-English term translation, word order literal translation, and (c) is the actually-observed is usually preserved. A typical exception is the translation. For back-translation, bilingual “の (of)”-compound, e.g., “自由落下 (free fall) の pairs such as 多因子 (noun), multifactorial (adj.) (of) 加速度 (acceleration)” is translated into “ac-   and 結合性 (adj.), bonding (noun) are necessary. celeration of free fall.”   They are similar to morphosyntactic term varia- The NMPT framework can handle word-order tions within a single language, i.e., nominalisation change by introducing a dummy input, such as “加 of adjectives and adjectivisation of nouns (Daille 速度 (acceleration) $の$ (of) 自由落下 (free fall)”, et al., 1996), but they are observed in term trans- where “$の$” is a special symbol that represents lation between two languages. the inverse of “の.” Note that this is an extension In Japanese, functional suffixes and particles of the NMPT framework, not of a bilingual dictio- such as “性 (-nature)” and “の (of)” produce adjec- nary. tives from nouns; by attaching or dropping these functional elements in the process of dictionary 4.6 Incremental Extension look-up, the inter-POS entries can be retrieved. We use the above mentioned extensions incremen- In practice, we handle a limited number of func- tally, shown in Table 3, because their safety levels tional elements shown in Table 2. Attaching func- vary. The upper level is used only when the lower

48 Table 4: Experimental result Eijiro Eijiro w/T NPMT Google Mac-Transer NPMT+ Perfect 633 (24.8%) 977 (38.3%) 1,218 (47.8%) 1,530 (60.0%) 1,591 (62.4%) 1,619 (63.5%) Ambiguous 530 (20.8%) 186 ( 7.3%) 539 (21.1%) ––– subtotal 1,163 (45.6%) 1,163 (45.6%) 1,757 (68.9%) ––– False 346 (13.6%) 148 ( 5.8%) 285 (11.2%) 977 (38.3%) 892 (35.0%) 423 (16.6%) None 1,040 (40.8%) 1,238 (48.6%) 507 (19.9%) 42 ( 1.6%) 66 ( 2.6%) 507 (19.9%) total 2,549 ( 100%) 2,549 ( 100%) 2,549 ( 100%) 2,549 ( 100%) 2,549 ( 100%) 2,549 ( 100%) level produces no translation. 5.8%. From this fact, we can see that the target list works effectively in translation selection and 5 Experiment disambiguation. 5.1 Experimental Setup NPMT is much better than two baselines; the coverage increase from 45.6% to 68.9%. You may The bilingual dictionary D was generated from Ei- see that the portion of the ambiguous class is large jiro ver.116, which is the largest English-Japanese (21.1%). However, because the average number dictionary for humans. We extracted all entries of the produced translations is 4.1 (1.4 correct and reliable component pairs, by using a method and 2.7 incorrect) in this class, it is not so diffi- similar to (Fujii and Ishikawa, 2001); the size is cult to find the correct article by checking all arti- 2,366,870. The target list T is the set of head- cles corresponding to the produced translations. In words in EnWiki; the size is 5,907,150. conclusion, for 68.9% of Japanese inputs, we can The test set was created from Japanese trans- reach the appropriate English articles by using the lation of Oxford Dictionary of Science (Daintith, system. 2009). We selected all translation pairs j, e { } that satisfy three conditions: (a) j appears more The right part of Table 4 shows the compari- than ten times on the Web; (b) j is not a headword son with commercial machine translation systems: in JaWiki; (c) e is a headword in EnWiki. The size Google translate and Mac-Transer. Because these of the test set is 2,549. two systems do not produce multiple translations, we attached a simple final selector into the NPMT 5.2 Result system, which uses the hit count of the Web. From For every test pair j, e , we use j as an input and this table, we can see that the performance of our   e as a reference. We judge an obtained translation system (NPMT+) is equal or slightly better than other two systems. A statistical test confirms that e as correct, when both e and e indicate the same article in EnWiki. the difference between Google and NPMT+ is sta- We classify a result per input into four classes, tistically significant (α =0.05); the difference be- because the NPMT framework may produce mul- tween Google and Mac-Transer and the difference tiple translations. between Mac-Transer and NPMT+ are not. The translation performance varies according Perfect All produced translations are correct. to the term length measured by the number of Ambiguous Some produced translations are cor- words in the English term2, shown in Table 5. In rect. general, the translation accuracy of single-word False All produced translations are incorrect. terms (l=1) is lower than that of multi-word terms None No translation is produced. (l=2+). Among three systems, Mac-Transer is the best for single-word terms; it is probably because Table 4 shows the experimental result. In this Mac-Transer has the best bilingual dictionary. In table, the first column corresponds to a baseline, contrast, NPMT+ is the best for multi-word terms; which is a simple dictionary-based translation us- this superiority is achieved by the NPMT frame- T ing Eijiro. The second column (Eijiro w/ )isan- work and its extensions. other baseline, with the target list T , where pro- Table 6 shows contribution of each extension duced translations are restricted to members of T ; the ambiguous class reduces from 20.8% to 2The length of a Japanese term is not obvious because it 7.3%, and the false class reduces from 13.6% to has no white space between words.

49 Table 5: Performance of different term lengths Table 7: Contribution of each extension module l P A (P+A) F N w/o extension P A F N 1 Google 758 – (53.4%) 634 27 none 1218 539 285 507 spelling variants 11 29 11 +51 Mac-Transer 813 – (57.3%) 558 48 − − − transliterations 41 85 3 +129 NPMT+ 759 – (53.5%) 341 319 − − − non-Japanese char. 12 +1 0 +11 (NPMT 512 348 (60.6%) 240 319) − ± word order 1 +1 1 +1 2+ Google 772 – (68.3%) 343 15 − − inter-POS 82 54 49 +185 Mac-Transer 778 – (68.8%) 334 18 − − − NPMT+ 860 – (76.1%) 82 188 (NPMT 706 191 (79.4%) 45 188) word terms are more difficult than words used as components in multi-word terms. In fact, the aver- Table 6: Contribution of each extension level level P A F N age length of single-word terms is 9.15 characters, (no extension) 1041 374 219 915 which is longer than that of component words in L1 +95 +111 +17 223 multi-word terms, 6.98 characters. − L2 +82 +54 +49 185 − total +177 +165 +66 408 Acknowledgments − This work was supported by JSPS KAKENHI level. The first line shows the performance of the 22650047 and 21300094. system with no extensions, which is much worse than that with full extensions. The other lines References show the performance improvements with exten- sions. From this table, we can confirm that each Timothy Baldwin and Takaaki Tanaka. 2004. Trans- of two extension levels is effective and the lower lation by machine of complex nominals: Getting it right. In Proc. of the ACL 2004 Workshop on Mul- level is safer. tiword Expressions: Integrating Processing, pages Table 7 shows contribution of each extension 24–31. module, which is calculated by the performance Beatrice´ Daille, Eric´ Gaussier, and Jean-Marc Lange.´ drop when it is removed from the system. This 1994. Towards automatic extraction of monolingual table shows that two modules, transliterations and and bilingual terminology. In Proc. of COLING- inter-POS, make large contributions. The word- 1994, pages 515–521. order extension makes almost no contribution for Beatrice´ Daille, Bonoˆıt Harbert, Christian Jacquemin, this test set. and Jean Royaute.´ 1996. Empirical observation of term variations and principles for their description. 5.3 Discussion Terminology, 3(2):192–257. John Daintith, editor. 2009. Oxford Dictionary of Sci- The above experimental result shows that the ence (in Japanese). Asakura Syoten. NPMT framework and its extension works well Atsushi Fujii and Tetsuya Ishikawa. 2001. in Japanese-English headword search. The target Japaense/English cross-language information list works effectively in translation selection and retrieval: Exploration of query translation and disambiguation. The extension modules improve transliteration. Computers and the Humanities, translation performance. 35:389–420. For 20% of inputs, however, the system pro- Dekang Lin, Shaojun Zhao, Benjamin Van Durme, and Marius Pas¸ca. 2008. Mining parenthetical transla- duces no translation; the portion is not small tions from the Web by word alignment. In Proc. of enough. No translation is caused by the limited ACL-08: HLT, pages 994–1002. coverage of the bilingual dictionary, described in Satoshi Sato. 2010. Non-productive machine translit- Section 4. In addition to the virtual extension, ac- eration. In Adaptivity, Personalization and Fusion tual enlargement of the bilingual dictionary is re- of Heterogeneous Information, RIAO ’10, pages 16– quired to achieve higher performance, especially 19. for translation of single-word terms. The National Langauge Research Institute. 1983. An interesting phenomenon that we observed in Writing-Form Variation of Words in Contemporary the experiment is that the translation accuracy of Japanese (in Japanese). Shuei Shuppan. single-word terms is lower than that of multi-word terms. It is probably because words used as single-

50