Lightly Supervised Transliteration for Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Lightly Supervised Transliteration for Machine Translation Amit Kirschenbaum Shuly Wintner Department of Computer Science Department of Computer Science University of Haifa University of Haifa 31905 Haifa, Israel 31905 Haifa, Israel [email protected] [email protected] Abstract the other hand, two different Hebrew sounds can be mapped into the same English letter. For exam- .[are in most cases mapped to [t ט! and ת! We present a Hebrew to English transliter- ple, both ation method in the context of a machine A major difficulty stems from the fact that in the translation system. Our method uses ma- Hebrew orthography (like Arabic), words are rep- chine learning to determine which terms resented as sequences of consonants where vow- are to be transliterated rather than trans- els are only partially and very inconsistently rep- lated. The training corpus for this purpose resented. Even letters that are considered as rep- includes only positive examples, acquired resenting vowels may sometimes represent conso- y]/[i]. As a] י! v]/[o]/[u] and] ו! semi-automatically. Our classifier reduces nants, specifically more than 38% of the errors made by a result, the mapping between Hebrew orthography baseline method. The identified terms are and phonology is highly ambiguous. then transliterated. We present an SMT- Transliteration has acquired a growing inter- based transliteration model trained with a est recently, particularly in the field of Machine parallel corpus extracted from Wikipedia Translation (MT). It handles those terms where no using a fairly simple method which re- translation would suffice or even exist. Failing to quires minimal knowledge. The correct re- recognize such terms would result in poor perfor- sult is produced in more than 76% of the mance of the translation system. In the context cases, and in 92% of the instances it is one of an MT system, one has to first identify which of the top-5 results. We also demonstrate a terms should be transliterated rather than trans- small improvement in the performance of lated, and then produce a proper transliteration for a Hebrew-to-English MT system that uses these terms. We address both tasks in this work. our transliteration module. Identification of Terms To-be Transliterated 1 Introduction (TTT) must not be confused with recognition of Named Entities (NE) (Hermjakob et al., 2008). Transliteration is the process of converting terms On the one hand, many NEs should be translated written in one language into their approximate rather than transliterated, for example:1 spelling or phonetic equivalents in another lan- guage. Transliteration is defined for a pair of lan- m$rd hm$p@im guages, a source language and a target language. misrad hamishpatim The two languages may differ in their script sys- ministry-of the-sentences tems and phonetic inventories. This paper ad- ‘Ministry of Justice’ dresses transliteration from Hebrew to English as 1To facilitate readability, examples are presented with in- part of a machine translation system. terlinear gloss, including an ASCII representation of Hebrew Transliteration of terms from Hebrew into En- orthography followed by a broad phonemic transcription, a glish is a hard task, for the most part because of the word-for-word gloss in English where relevant, and the cor- responding free text in English. The following table presents differences in the phonological and orthographic the ASCII encoding of Hebrew used in this paper: systems of the two languages. On the one hand, כ|! י! ט! ח! ז! ו! ה! ד! ג! ב! א! there are cases where a Hebrew letter can be pro- abgdhwzx@ik ת! ש! ר! ק! צ|! פ|! ע! ס! נ|! מ|! ל! nounced in multiple ways. For example, Hebrew can be pronounced either as [b] or as [v]. On lmns&pcq r$t ב! Proceedings of the 12th Conference of the European Chapter of the ACL, pages 433–441, Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics 433 him htikwn Machine Translation (SMT). The two modules are hayam hatichon combined and integrated in a Hebrew to English the-sea the-central MT system (section 6). ‘the Mediterranean Sea’ The main contribution of this work is the actual On the other hand, there are terms that are not transliteration module, which has already been in- NEs, such as borrowed words or culturally specific tegrated in a Hebrew to English MT system. The terms that are transliterated rather than translated, accuracy of the transliteration is comparable with as shown by the following examples: state-of-the-art results for other language pairs, aqzis@ncializm @lit where much more training material is available. eqzistentzializm talit More generally, we believe that the method we de- ‘Existentialism’ ‘Tallit’ scribe here can be easily adapted to other language pairs, especially those for which few resources are As these examples show, transliteration cannot available. Specifically, we did not have access to be considered the default strategy to handle NEs a significant parallel corpus, and most of the re- in MT and translation does not necessarily apply sources we used are readily available for many for all other cases. other languages. Candidacy for either transliteration or transla- tion is not necessarily determined by orthographic 2 Previous Work features. In contrast to English (and many other languages), proper names in Hebrew are not cap- In this section we sketch some related works, fo- italized. As a result, the following homographs cusing on transliteration from Hebrew and Arabic, may be interpreted as either a proper name, a noun, and on the context of machine translation. or a verb: Arbabi et al. (1994) present a hybrid algorithm alwn alwn alwn for romanization of Arabic names using neural alon alun alon networks and a knowledge based system. The pro- ‘oak’ ‘I will sleep’ ‘Alon’ (name) gram applies vowelization rules, based on Arabic One usually distinguishes between two types of morphology and stemming from the knowledge transliteration (Knight and Graehl, 1997): For- base, to unvowelized names. This stage, termed ward transliteration, where an originally Hebrew the broad approach, exhaustively yields all valid term is to be transliterated to English; and Back- vowelizations of the input. To solve this over- ward transliteration, in which a foreign term that generation, the narrow approach is then used. In has already been transliterated into Hebrew is to this approach, the program uses a neural network be recovered. Forward transliteration may result in to filter unreliable names, that is, names whose several acceptable alternatives. This is mainly due vowelizations are not in actual use. The vowelized to phonetic gaps between the languages and lack names are converted into a standard phonetic rep- of standards for expressing Hebrew phonemes in resentation which in turn is used to produce var- English. For example, the Hebrew term cdiq may ious spellings in languages which use Roman al- be transliterated as Tzadik, Tsadik, Tsaddiq, etc. phabet. The broad approach covers close to 80% On the other hand, backward transliteration is re- of the names given to it, though with some extra- strictive. There is usually only one acceptable way neous vowelization. The narrow approach covers to express the transliterated term. So, for exam- over 45% of the names presented to it with higher ple, the name wiliam can be transliterated only precision than the broad approach. to William and not, for example, to Viliem, even This approach requires a vast linguistic knowl- though the Hebrew character w may stand for the edge in order to create the knowledge base of vow- consonant [v] and the character a may be vow- elization rules. In addition, these rules are appli- elized as [e]. cable only to names that adhere to the Arabic mor- We approach the task of transliteration in the phology. context of Machine Translation in two phases. Stalls and Knight (1998) propose a method for First, we describe a lightly-supervised classifier back transliteration of names that originate in En- that can identify TTTs in the text (section 4). The glish and occur in Arabic texts. The method uses a identified terms are then transliterated (section 5) sequence of probabilistic models to convert names using a transliteration model based on Statistical written in Arabic into the English script. First, 434 an Arabic name is passed through a phonemic English with each of four target languages. NEs model producing a network of possible English were extracted from the English side and were sound sequences, where the probability of each compared with all the words in the target lan- sound is location dependent. Next, phonetic se- guage to find proper transliterations. The baseline quences are transformed into English phrases. Fi- presented for the case of transliteration from En- nally, each possible result is scored according to a glish to Arabic achieves Mean Reciprocal Rank unigram word model. This method translates cor- (MRR) of 0.66 and this method improves its re- rectly about 32% of the tested names. Those not sults by 7%. This technique involves knowledge translated are frequently not foreign names. about phonological characteristics, such as elision This method uses a pronunciation dictionary of consonants based on their position in the word, and is therefore restricted to transliterating only which requires expert knowledge of the language. words of known pronunciation. Both of the above In addition, conversion of terms into a phonemic methods perform only unidirectional translitera- representation poses hurdles in representing short tion, that is, either forward- or backward- translit- vowels in Arabic and will have similar behavior in eration, while our work handles both. Hebrew. Moreover, English to Arabic transliter- ation is easier than Arabic to English, because in Al-Onaizan and Knight (2002) describe a sys- the former, vowels should be deleted whereas in tem which combines a phonetic based model with the latter they should be generated.