Pattern-Based English-Latvian Toponym Translation

Pattern-based English-Latvian Toponym Translation The paper on similar issues will be presented at Euro- pean Association for Machine Translation conference. Tatiana Gornostay Inguna Skadiņa Tilde, Latvia Tilde, Latvia [email protected] [email protected] geonyms (general names for streets, Abstract squares, lines, avenues, paths, alleys, roads, embankments, etc.); Due to their linguistic and extra-linguistic nature toponyms deserve a special treat- oeconyms (names of populated places: an ment when they are translated. The paper administrative division, country, city, deals with issues related to automated town, house or other building); translation of toponyms from English into Latvian. Translation process allows us to cosmonyms or astronyms (names of stars, translate not only toponyms from a dictio- constellations or other heavenly bodies). nary, but out-of-vocabulary toponyms as The paper aims to research a complicated task of well. Translation of out-of-vocabulary to- machine translation (MT) and cross-language in- ponyms is divided into three steps: source formation retrieval (CLIR) – automated translation string normalization, translation, and target of toponyms. Most of toponym translation ap- string normalization. Translation step im- proaches are data-driven (see, e.g. Meng et al., plies application of translation strategies 2001; Al-Onaizan and Knight, 2002; Sproat et al., and linguistic toponym translation patterns. 2006; Alegria et al., 2006; Wentland et al., 2008) 10,000 UK-related toponyms from Geo- since they deal with widely used languages which names were used as a development set. The have enough linguistic resources for development. developed methods have been evaluated on Taking into account an under-resourced status of a test set: the accuracy of translation is the Latvian language with few available corpus 67% for the whole test set, 58% for one- resources, especially parallel bilingual corpora, a word toponymic units, and 81% for multi- rule-based approach is proposed for the English- word toponyms. Latvian toponym translation. There are several commonly used translation 1 Introduction strategies for toponyms (Babych and Hartley, 2004): transference strategy (i.e., do-not-translate), Toponyms in general are studied by toponymy, transliteration strategy (i.e., phonetic or spelling they represent names of places comprising the fol- rendering), translation strategy (i.e., translation lowing types: itself) and combined strategy. hydronyms (names of bodies of water: Transference strategy with a do-not-translate list bays, streams, lakes, lagoons, oceans, is often used for translation of toponyms which do ponds, seas, etc.); not need any rendering at all and are often left not translated, e.g. organization names (Babych and oronyms (names of mountains, cliffs, cra- Hartley, 2003) or names of hotels in our system. ters, rocks, points, etc.); The most common transliteration techniques are phoneme-based and grapheme-based (Zhang et al., 2004). The phoneme-based approach (Knight and Kristiina Jokinen and Eckhard Bick (Eds.) NODALIDA 2009 Conference Proceedings, pp. 41–47 Tatiana Gornostay and Inguna Skadin¸a Graehl, 1998; Meng et al., 2001; Oh and Choi, e.g. Firenze for its inhabitants and Florence for 2002; Lee and Chang, 2003) implies conversion of English. a source language word into a target language Furthermore, metonymy also contributes to the word via its phonemic representation, i.e., gra- issue. This linguistic phenomenon was studied pheme-phoneme-grapheme conversion. The gra- from the toponymical point of view by Markert pheme-based technique converts a source language and Nissim (2002). The authors stated that meto- word into a target language word without any pho- nymic use of toponyms is regular and productive. nemic representation (grapheme-grapheme conver- It can reach up to 17% of all of toponyms as it was sion) (Stalls and Knight, 1998; Li et al., 2004). proved by the example of the English language. The first part of the paper presents an overview The most frequent and conventional case of topo- of the concept and nature of toponyms. In the nymical metonymy is as in the “government of …” second part we focus on the English-Latvian to- pattern, e.g. “Latvia announced …” means “the ponym translation, including the description of government of Latvia announced …”. translation strategies (TS) and linguistic toponym Finally, toponyms are changed frequently since translation patterns (LTTP). they themselves and the places they refer to are not constant. Therefore, when dealing with toponyms 2 Concept and Nature of Toponyms it is also very important to take into consideration historical and cultural facts. Although Geoffrey Leech (1981) accepts a spe- Thus, the abovementioned linguistic and extra- cial status of toponyms as proper names without a linguistic features make toponym processing diffi- conceptual meaning since any componential analy- cult, i.e., their resolution, retrieval, and especially sis cannot be performed for them, we should bear translation. in mind and admit the fact that many toponyms are at least meaningful etymologically, e.g Cam- 3 English-Latvian Toponym Translation bridge – bridge over the river Cam (Leidner, 2007). In the overall MT, English-Latvian toponym trans- Toponyms are also ambiguous. Leidner (2007) lation problems have not been researched in be- describes three types of toponymical ambiguity: fore. The existing literature describes general prin- ciples of rendering of the English proper names, morpho-syntactic ambiguity: a word itself mostly anthroponyms, into Latvian. Therefore we may be a toponym or may be a non- studied three main issues related to MT of the Eng- toponym, e.g. Liepa as a populated place lish-Latvian toponyms: in Latvia versus liepa (lime-tree) as a common noun; orthographic, phonetic and grammatical distinctions between these languages; referential ambiguity: a toponym may refer to more than one place of the same type, potential toponym translation strategies; e.g. Riga as a populated place and the capi- potential linguistic toponym translation tal of Latvia and Riga as a populated place patterns. in the USA, state Michigan; Although English and Latvian are Indo- feature type ambiguity: a toponym may re- European languages and share some grammatical fer to more than one place of a different features, they have a lot of differences. At first, type, e.g. Ogre as a populated place and a English belongs to the Germanic language group river in Latvia. while Latvian belongs to the group of the Baltic Another type of toponymical ambiguity is epo- languages. In morphological typology the English nymical ambiguity when places are named after language is an analytical language in contrast to a people or deities, e.g., Vancouver after George synthetic Latvian with a rich set of inflections. Vancouver. Sometimes the same place is known by The linguistic features of Latvian toponymic different names – endonyms (names of places used units were studied to ensure that translations cor- by inhabitants, self-assigned names) and exonyms respond to common rules of the Latvian grammar (names of places used by other groups, not locals), and orthography. For instance, Latvian multi-word 42 Pattern-based English-Latvian Toponym Translation units can be translated in several ways, however, a The set of English-Latvian transliteration rules compound is preferable if the source toponymic consists of about 110 transliteration patterns de- unit could be reconstructed (Ahero, 2006). scribing English-Latvian grapheme-to-grapheme The lack of orthographic and phonetic conver- correspondences. All foreign names (those of non- gence in English (26 letters to 44 phonemes), his- English origin) are rendered according to English torical changes and traditions in spelling, origin pronunciation standards. The main principle is the language of a toponym, and ambiguity were the possibility to reconstruct the source toponymic unit main difficulties we faced. (Ahero, 2006). The result of transliteration may vary, as there 3.1 Source String Normalization are several ways of rendering English letter com- The process of translation of a toponymic unit is binations into Latvian, e.g., -c- stands for -k- be- divided into three steps: source string normaliza- fore consonants (except -h-), and -a-, -o-, -u-, for - tion, translation, i.e., application of translation s- before -i-, -e-, -y-, and for -č- in the combination strategy (TS) and linguistic toponym translation with -h-. patterns (LTTP), and target string normalization Transference strategy is applied to both unpro- according to the Latvian grammar and orthography cessed toponymic units, which are not described by rules. any of linguistic toponym translation patterns, and Source string normalization implies the follow- organization and hotel names. ing changes: There are cases when multi-word toponyms are not transferred or transliterated but translated into all tabs and double space characters, in- Latvian, e.g., East Anglian Heights, North West cluding the string beginning, are norma- Highlands are translated into Latvian as Austru- lized to single space characters; manglijas augstiene, Ziemeļskotijas kalnāji corres- the so-called “zero-fertility words” (Al- pondingly. Single word units are transliterated, as a Onaizan and Knight, 2002) of English are rule. normalized to zero-translations into Lat- Transliteration strategy can be also applied to vian, e.g. the indefinite article a is omitted; multi-word units in parallel

Pattern-Based English-Latvian Toponym Translation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support