Japanese-English Cross-Language Headword Search of Wikipedia

Japanese-English Cross-Language Headword Search of Wikipedia Satoshi Sato and Masaya Okada Graduate School of Engineering Nagoya University Chikusa-ku, Nagoya, Japan [email protected] masaya [email protected] Abstract glish article. To realize CLHS, term translation is required. This paper describes a Japanese-English cross-language headword search system of Term translation is not a mainstream of machine Wikipedia, which enables to find an appro- translation research, because a simple dictionary- priate English article from a given Japanese based method is widely used and enough for query term. The key component of the sys- general terms. For technical terms and proper tem is a term translator, which selects an ap- nouns, automatic extraction of translation pairs propriate English headword among the set from bilingual corpora and the Web has been in- of headwords in English Wikipedia, based on the framework of non-productive ma- tensively studied (e.g., (Daille et al., 1994) and chine translation. An experimental result (Lin et al., 2008)) in order to fill the lack of en- shows that the translation performance of tries in a bilingual dictionary. However, the qual- our system is equal or slightly better than ity of automatic term translation is not well exam- commercial machine translation systems, ined with a few exception (Baldwin and Tanaka, Google translate and Mac-Transer. 2004). Query translation in cross-language infor- mation retrieval is related to term translation, but it concentrates on translation of content words in 1 Introduction a query (Fujii and Ishikawa, 2001). Many people use Wikipedia, an encyclopedia on In the CLHS situation, the ultimate goal of the Web, to find meanings of unknown terms. translation is to find the right headword in the tar- Among many language-versions of Wikipedia, get encyclopedia. In other words, unlike a general English version (EnWiki) is the largest, whose size situation of translation, a set of translation candi- is five times larger than that of Japanese version dates is clearly defined as a finite set. Therefore, (JaWiki). the translation can be simplified to selection of For Japanese people, Japanese articles are the the appropriate one from the set—the right head- most convenient and easy to read. In the case that word from the set of headwords of the target en- JaWiki has no appropriate article, English articles cyclopedia. This fact brings a new framework of are the second best, from which they obtain some term translation, namely non-productive machine knowledge according to their English skills—it is translation (NPMT). much better than nothing. A problem arises here. How do they consult The rest of the paper is organized as follows. EnWiki? What term should they input? In this sit- Section 2 describes an overview of a Japanese- uation, the only thing that they know is a Japanese English CLHS system of Wikipedia. Section 3 de- term itself; no meaning and no translation. scribes the formal definition of the NPMT frame- The best solution is cross-language headword work and its algorithm. Section 4 describes its search (CLHS), where a user inputs a Japanese extension for Japanese-English term translation. term and a system retrieves the appropriate En- Section 5 describes an experimental result. 44 Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, TIA 2011, pages 44–50 Paris, 8–10 November 2011 Figure 1: Screen shot of the system 2 Overview of the System work does not produce new translations; just selects one (or more) from a large pool of candi- The system works as an interface of JaWiki and dates, called target list. The assumption behind EnWiki. The target users are Japanese people who this framework is that, for every term, a translation speak Japanese as the first language and read En- is already available. The task of term translation glish texts to a certain level. For a given Japanese is just to find it. term, the system first tries to retrieve its Japanese article. If no article is found, the system translates the Japanese term into English. If a single trans- 3.1 Translation Grammar lation is obtained, the system displays its English For the formal definition of the NPMT framework, article. If more than one translation are obtained, we first introduce a simple grammar that produces the system enumerates these translations for user’s a set of translation pairs. selection. In case no translation is obtained, the system reports it. G =(A, B, D) (1) Figure 1 shows a screen shot of the system, where the input term is “反結合性オービタル (antibonding orbital).” Because JaWiki have no arti- A grammar G consists of three components: A—a cle of the term, the system shows the English arti- set of words in the source language; B—a set of cle of “Antibonding”, which is found by redirec- words in the target language; D—a bilingual dic- tion from the obtained translation, “antibonding tionary, which is a set of translation rules (bilin- orbital.” gual pairs). A rule r D takes the following ∈ form. 3 NMPT Framework r = α, β where α A∗,β B∗, For term translation, we use non-productive ma- ∈ ∈ chine translation (NPMT) framework. This frame- max( α , β ) 1 (2) | | | | ≥ 45 1 def npmt_je(dic, tlist, s) 2 sl = s.length; table = []; table[0] = [’’] 3 1.upto(sl) do |k| 4 table[k] = [] 5 0.upto(k-1) do |p| 6 (dic[s[p, k-p].join(’’)] || []).each do |tt| 7 # a bilingual pair <s[p, k-p].join(’’), tt> is found in dic 8 table[p].each do |tp| 9 tk = (tp == ’’ ? tt : (tt == ’’ ? tp : [tp, tt].join(’ ’))) 10 table[k] << tk if tlist.find{|t| t =˜ /ˆ#{tk}/} 11 end 12 end 13 end 14 end 15 table[sl].select{|t| tlist.member?(t)} 16 end Figure 2: Skeleton of NPMT algorithm in Ruby In this grammar, a rule sequence δ D produces From this definition, we can see that a member ∈ ∗ a translation pair, of the output (i.e., tgt(δ)) is always a member of T . In other words, this framework always outputs δ = r1r2 rn (3) actually-observed terms; it does not produce new ··· = α ,β α ,β α ,β (4) terms that have not been observed yet. The name 1 1 2 2··· n n = α α α ,β β β (5) non-productive is derived from this fact. 1 2 ··· n 1 2 ··· n where each rule corresponds to a local mapping 3.3 Algorithm between α α α and β β β . Hereafter, The algorithm of finding Δ is not trivial. We 1 2 ··· n 1 2 ··· n we write the source side and the target side of δ as use a simplified version of Sato’s algorithm (Sato, src(δ) and tgt(δ), respectively. 2010), where prefix-filtering and dynamic pro- A language L (i.e., a set of translation pairs) gramming are used to reduce the search space. generated by a grammar G is defined as follows. Figure 2 shows a skeleton of our algorithm in Ruby. Three arguments, dic, tlist, and s, cor- L(G)= src(δ), tgt(δ) δ D∗ (6) D { | ∈ } respond to a bilingual dictionary , a target list T , and a source term σ, respectively. At line 6, We use this grammar framework for defining a set the program tries to find a dictionary entry for a of theoretically-possible translation pairs. substring of the input term; because of the dou- ble loops in line 3–14 and 5–13, all possibilities 3.2 Non-Productive Machine Translation of segmentation of the input term are examined1. Theoretically-possible translation pairs are not al- The line 10 corresponds to the prefix-filtering; a ways actually-observed or valid translation pairs. partial translation tk, which is a translation of the Usually a very small portion of L(G) is actually- first k characters of the input term, is stored in observed and valid. Therefore we need a device to table[k] only if tk is a prefix of a member of select valid members from L(G). the target list. For this purpose, we introduce a target list T ⊂ B∗, which is a model of actually-observed terms 4 Extension for Japanese-English Term in the target language. By using a target list, we Translation define the framework of the non-productive machine translation (NPMT) as follows. For a given source term σ, the NPMT framework produces the correct translation τ when the fol- Given a grammar G =(A, B, D), a source term lowing two conditions are satisfied. σ A , and a target list T B 1 ∈ ∗ ⊂ ∗ No Japanese morphological analyzer is used for segmen- Find Δ= δ δ D∗, src(δ)=σ, tgt(δ) T { | ∈ ∈ } tation of terms, because segmentation errors cannot be recov- Output = tgt(δ) δ Δ ered in term translation. T { | ∈ } 46 Japanese string s (any substring of an input term) ? - - 'bilingual $ - variant - - generator dictionary D ?- attaching - - func. elem. If s is a string of Katakana characters and s 5 | |≥ -&back- %- transliterator If s is a string of non-Japanese characters then s - If s is a functional element that can be dropped then the empty string - ? ? output of extended look-up Figure 3: Diagram of extended dictionary look-up 1. τ is a member of the target list T , and last type, i.e., variants related to Kanji, we use 2. δ = σ, τ can be produced by the bilingual Hyouki Tougou dictionary (dictionary of spelling- dictionary D, i.e., δ D . variants) to generate variants, which is provided ∈ ∗ by National Institute for Japanese Language and In the CLHS situation, the first condition is always Linguistics.

Japanese-English Cross-Language Headword Search of Wikipedia

Problem of Creating a Professional Dictionary of Uncodified Vocabulary

Towards a Conceptual Representation of Lexical Meaning in Wordnet

Inferring Parts of Speech for Lexical Mappings Via the Cyc KB

Symbols Used in the Dictionary

DICTIONARY News

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Phonemic Similarity Metrics to Compare Pronunciation Methods

Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

Effects of Mora Phonemes on Japanese Word Accent

From Machine Readable Dictionaries to Lexicons for NLP: the Cobuild Dictionaries - a Different Approach

Data for Lexicography the Central Role of the Corpus

Improving Dictionaries by Measuring Atypical Relative Word-Form Frequencies