<<

Pattern-based English-Latvian Toponym Translation

The paper on similar issues will be presented at - pean Association for Machine Translation conference.

Tatiana Gornostay Inguna Skadiņa Tilde, Tilde, Latvia [email protected] [email protected]

geonyms (general names for streets, Abstract squares, lines, avenues, paths, alleys, roads, embankments, etc.); Due to their linguistic and extra-linguistic nature toponyms deserve a special treat- oeconyms (names of populated places: an ment when they are translated. The paper administrative division, country, city, deals with issues related to automated town, house or other building); translation of toponyms from English into Latvian. Translation process allows us to cosmonyms or astronyms (names of stars, translate not only toponyms from a dictio- constellations or other heavenly bodies). nary, but out-of-vocabulary toponyms as The paper aims to research a complicated task of well. Translation of out-of-vocabulary to- machine translation (MT) and cross-language in- ponyms is divided into three steps: source formation retrieval (CLIR) – automated translation string normalization, translation, and target of toponyms. Most of toponym translation ap- string normalization. Translation step im- proaches are data-driven (see, .g. Meng et al., plies application of translation strategies 2001; Al-Onaizan and Knight, 2002; Sproat et al., and linguistic toponym translation patterns. 2006; Alegria et al., 2006; Wentland et al., 2008) 10,000 UK-related toponyms from Geo- since they deal with widely used languages which names were used as a development set. The have enough linguistic resources for development. developed methods have been evaluated on Taking into account an under-resourced status of a test set: the accuracy of translation is the with few available corpus 67% for the whole test set, 58% for one- resources, especially parallel bilingual corpora, a word toponymic units, and 81% for multi- rule-based approach is proposed for the English- word toponyms. Latvian toponym translation. There are several commonly used translation 1 Introduction strategies for toponyms (Babych and Hartley, 2004): transference strategy (i.e., do-not-translate), Toponyms in general are studied by toponymy, strategy (i.e., phonetic or spelling they represent names of places comprising the fol- rendering), translation strategy (i.e., translation lowing types: itself) and combined strategy. hydronyms (names of bodies of water: Transference strategy with a do-not-translate list bays, streams, lakes, lagoons, oceans, is often used for translation of toponyms which do ponds, seas, etc.); not need any rendering at all and are often left not translated, e.g. organization names (Babych and oronyms (names of mountains, cliffs, cra- Hartley, 2003) or names of hotels in our system. ters, rocks, points, etc.); The most common transliteration techniques are -based and grapheme-based (Zhang et al., 2004). The phoneme-based approach (Knight and

Kristiina Jokinen and Eckhard Bick (Eds.) NODALIDA 2009 Conference Proceedings, pp. 41–47 Tatiana Gornostay and Inguna Skadin¸a

Graehl, 1998; Meng et al., 2001; Oh and Choi, e.g. Firenze for its inhabitants and Florence for 2002; Lee and Chang, 2003) implies conversion of English. a source language word into a target language Furthermore, metonymy also contributes to the word via its phonemic representation, i.e., gra- issue. This linguistic phenomenon was studied pheme-phoneme-grapheme conversion. The gra- from the toponymical point of view by Markert pheme-based technique converts a source language and Nissim (2002). The authors stated that meto- word into a target language word without any pho- nymic use of toponyms is regular and productive. nemic representation (grapheme-grapheme conver- It can reach up to 17% of all of toponyms as it was sion) (Stalls and Knight, 1998; Li et al., 2004). proved by the example of the . The first part of the paper presents an overview The most frequent and conventional case of topo- of the concept and nature of toponyms. In the nymical metonymy is as in the “government of …” second part we focus on the English-Latvian to- pattern, e.g. “Latvia announced …” means “the ponym translation, including the description of announced …”. translation strategies (TS) and linguistic toponym Finally, toponyms are changed frequently since translation patterns (LTTP). they themselves and the places they refer to are not constant. Therefore, when dealing with toponyms 2 Concept and Nature of Toponyms it is also very important to take into consideration historical and cultural facts. Although Geoffrey Leech (1981) accepts a spe- Thus, the abovementioned linguistic and extra- cial status of toponyms as proper names without a linguistic features make toponym processing diffi- conceptual meaning since any componential analy- cult, i.e., their resolution, retrieval, and especially sis cannot be performed for them, we should bear translation. in mind and admit the fact that many toponyms are at least meaningful etymologically, e.g Cam- 3 English-Latvian Toponym Translation bridge – bridge over the river Cam (Leidner, 2007). In the overall MT, English-Latvian toponym trans- Toponyms are also ambiguous. Leidner (2007) lation problems have not been researched in be- describes three types of toponymical ambiguity: fore. The existing literature describes general prin- ciples of rendering of the English proper names, morpho-syntactic ambiguity: a word itself mostly anthroponyms, into Latvian. Therefore we may be a toponym or may be a non- studied three main issues related to MT of the Eng- toponym, e.g. Liepa as a populated place lish-Latvian toponyms: in Latvia versus liepa (lime-tree) as a common noun; orthographic, phonetic and grammatical distinctions between these languages; referential ambiguity: a toponym may refer to more than one place of the same type, potential toponym translation strategies; e.g. as a populated place and the capi- potential linguistic toponym translation tal of Latvia and Riga as a populated place patterns. in the USA, state Michigan; Although English and Latvian are Indo- feature type ambiguity: a toponym may re- European languages and share some grammatical fer to more than one place of a different features, they have a lot of differences. At first, type, e.g. Ogre as a populated place and a English belongs to the Germanic language group river in Latvia. while Latvian belongs to the group of the Baltic Another type of toponymical ambiguity is epo- languages. In morphological typology the English nymical ambiguity when places are named after language is an analytical language in contrast to a people or deities, e.g., Vancouver after George synthetic Latvian with a rich set of . Vancouver. Sometimes the same place is known by The linguistic features of Latvian toponymic different names – endonyms (names of places used units were studied to ensure that translations cor- by inhabitants, self-assigned names) and exonyms respond to common rules of the (names of places used by other groups, not locals), and orthography. For instance, Latvian multi-word

42 Pattern-based English-Latvian Toponym Translation units can be translated in several ways, however, a The set of English-Latvian transliteration rules compound is preferable if the source toponymic consists of about 110 transliteration patterns de- unit could be reconstructed (Ahero, 2006). scribing English-Latvian grapheme-to-grapheme The lack of orthographic and phonetic conver- correspondences. All foreign names (those of non- gence in English (26 letters to 44 ), his- English origin) are rendered according to English torical changes and traditions in spelling, origin pronunciation standards. The main principle is the language of a toponym, and ambiguity were the possibility to reconstruct the source toponymic unit main difficulties we faced. (Ahero, 2006). The result of transliteration may vary, as there 3.1 Source String Normalization are several ways of rendering English letter com- The process of translation of a toponymic unit is binations into Latvian, e.g., -- stands for -k- be- divided into three steps: source string normaliza- fore consonants (except -h-), and -a-, -o-, -u-, for - tion, translation, i.e., application of translation - before -i-, -e-, --, and for -č- in the combination strategy (TS) and linguistic toponym translation with -h-. patterns (LTTP), and target string normalization Transference strategy is applied to both unpro- according to the Latvian grammar and orthography cessed toponymic units, which are not described by rules. any of linguistic toponym translation patterns, and Source string normalization implies the follow- organization and hotel names. ing changes: There are cases when multi-word toponyms are not transferred or transliterated but translated into all tabs and double space characters, in- Latvian, e.g., East Anglian Heights, North West cluding the string beginning, are norma- Highlands are translated into Latvian as Austru- lized to single space characters; manglijas augstiene, Ziemeļskotijas kalnāji corres- the so-called “zero-fertility words” (Al- pondingly. Single word units are transliterated, as a Onaizan and Knight, 2002) of English are rule. normalized to zero-translations into Lat- Transliteration strategy can be also applied to vian, e.g. the indefinite a is omitted; multi-word units in parallel with translation which is infrequent and conventional. hyphenated words are replaced with non- Toponym translation strategies are closely re- hyphenated ones; lated with LTTPs and are language dependent. some abbreviations are expanded to full Therefore combined strategy is also used when words, e.g. St. to Saint; treating different types of toponyms. signs, if possible, are replaced with words, 3.3 Translation: Linguistic Toponym Trans- e.g. & to and; lation Patterns punctuation marks are normalized to zero- Most of popular toponyms, such as names of coun- translations. tries and capitals, seas and oceans, are translated using an English-Latvian dictionary, e.g., – 3.2 Translation: English-Latvian Toponym Lisabona, Brussels – Brisele, Cologne – Ķelne, Translation Strategies Antwerp – Antverpene, – The English-Latvian transliteration strategy is Lielbritānija, Atlantic Ocean – Atlantijas okeāns. based on the grapheme-to-grapheme approach, If a toponym is an out-of-vocabulary (OOV) word which implies direct mapping of English letter se- then one of the LTTPs is applied. quences into Latvian ones, formalized in a set of To determine common LTTPs for toponyms transliteration rules. Transliteration strategy is lan- which are not in dictionaties we used a list of guage dependent (Karimi et al., 2007). It is not a 10,000 UK-related toponyms from Geonames and trivial task, due to issues described above, as well analyzed 59 most common toponym types. as due to many exceptions (see Castañeda- LTTPs determine ways how source toponymic Hernández, 2004 about general toponym transla- units are rendered into target toponymic units. We tion problem). distinguish two types of LTTPs: in-word patterns and multi-word patterns.

43 Tatiana Gornostay and Inguna Skadin¸a

The in-word LTTP describes word transforma- We have described 40 nomenclature words tion model based on English-Latvian transliteration which are translated under certain conditions. Aux- rules, including the most frequent prefixes, suffix- iliary words, such as prepositions, are also either , and letter combinations. There are about 300 in- translated or transliterated, e.g., Horse of Copinsay word LTTPs described, e.g.: new- to ņū-, deep- to – Horsofkopinsejs (transliteration), Milford upon dīp-, mc- to mak-, -worth to –vērt, -islet to –ailet, Sea - Milforda pie jūras (translation). etc. Examples of LTTPs are presented in Table 1. Xn Multi-word LTTPs involve three translation is a toponymic unit in a source language, Sn is a strategies. The first translation strategy S1 is based translation strategy, Yn is a toponymic unit in a tar- on transliteration rules. Translation strategy S2 get language, and Pn{Xn, Sn, Yn} is a corresponding combines the translation strategy S1 with the inser- LTTP. tion of a nomenclature word, e.g., Bebington (as a railroad station) – Bebingtonas stacija. If a nomen- 3.4 Target String Normalization clature word is included in a source toponymic Target string normalization modifies a toponymic unit, as it is in the pattern S3, it is either translated unit according to the Latvian grammar and ortho- (Newton Point - Ņūtona zemesrags, Gog Magog graphy rules, e.g. all populated places are feminine Hills - Gogmagogu kalni) or transliterated (Green gender (see P2): Newcastle → Ņūkāsla which is Isle – Grīnaila, North East Coast – Nortīstkosta) in indicated by the ending –a (feminine, singular no- the target language. minative).

English Toponym Xn Translation Translation Latvian Toponym Yn Pattern Pn Strategy Sn P1{X1, S1, Y1} X1: N P1: N → N S1: transliteration Y1: N masculine singular Knocklayd Nokleids P2={X1, S1, Y2} X1: N P2: N → N S1: transliteration Y2: N feminine singular Newcastle Ņūkāsla P3={X1, S2, Y3} X1: N P3: N → N + N S2: transliteration + Y3: N feminine singular Bebington nomenclature word genitive + N Bebingtonas stacija P4={X2, S1, Y2} X2: N’s + N P4: N’s + N → N S1: transliteration Y2: N feminine singular 's Stortford Bišopsstortforda P5={X3, S1, Y2} X3: N + N’s + N P5: N + N’s + N S1: transliteration Y2: N feminine singular St. Bishop's Town → N Sentbišopsatauna P6={X4, S1, Y2} X4: N + N P6: N + N → N S1: transliteration Y2: N feminine singular Bishop Auckland Bošopoklenda North Ronaldsay Nortronaldseja P7={X5, S1, Y2} X5: A + N P7: A + N → N S1: transliteration Y2: N feminine singular South Ribble, Green Sautribla Isle Grīnaila P8={X6, S3, Y4} X6: N + P + N P8: N + P + N → S3: transliteration + Y4: N feminine singular Milford upon Sea N + P + N translation genitive + P + N

44 Pattern-based English-Latvian Toponym Translation

Stratford upon Avon Milforda pie jūras, Stradforda pie Avona P9={X6, S1, Y5} X6: N + P + P9: N + P + N → S1: transliteration Y5: N feminine singular Longville in the Dale N + N genitive + N feminine sin- gular locative Longvila Deilā P10={X7, S1, Y2} X7: A + A + N P10: A + A + N → S1: transliteration Y2: N feminine singular North East Coast N Nortīstkosta P11={X8, S2, Y3} X8: N + C + N P11: N + C + N → S2: transliteration + Y3: N feminine singular Sandal & Agbrigg N + N nomenclature word genitive + N Sendalendagbrigas stacija P12={X4, S3, Y6} X4: N + N P12: N + N → N + S3: transliteration + Y6: N masculine singular Newton Point N translation genitive + N Ņūtona zemesrags P13={X6, S1, Y1} X6: N + P + N P:13 N + P + N → S1: transliteration Y1: N masculine singular Horse of Copinsay N Horsofkopinsejs P14={X7, S3, Y7} X7: N + N + N P14: N + N + N → S3: transliteration + Y7: N masculine plural ge- Gog Magog Hills N + N translation nitive +N Gogmagogu kalni Table 1. Examples of English-Latvian Linguistic Toponym Translation Patterns

tionaries. 330 English toponymic units of different 4 Evaluation and Limitations types with Latvian translation equivalents were manually extracted from dictionaries and The current MT evaluation theory and practice processed with our OOV toponym translation lacks in evaluation methods for toponym transla- module. We set the following evaluation scores: tion task. One of the reasons could be that it is not clear what the correct toponym translation is, since if the translation result coincides with the results may vary and more than one target topo- corresponding linguistic toponym transla- nymic unit is acceptable. As a result, scores calcu- tion pattern then the translation is accurate lated with a single target variant will underestimate and the score is 1; translation accuracy. Moreover, human translations if the translation result deviates from the are often inaccurate as well. 1 corresponding linguistic toponym transla- Existing English-Latvian MT systems do not tion pattern then the translation is inaccu- implement any OOV algorithms to translate topo- rate, and the score is 0,5 for one distinc- nymic units. Thus, we had no possibility to com- tion and 0 for more distinctions. pare our algorithm with other MT performance. For evaluation purposes we compared transla- We accept variants as they were also described tion results of our translation module with refer- by linguistic toponym translation patterns (in trans- ence (human) translations from two bilingual dic- literation rules). As a result, the accuracy of trans- lation is 67% on the whole test set, 58% on the set containing one-word toponymic units, and 81% on 1 English-Latvian Pragma Expert: www.acl.lv, English- multi-word test set. Latvian Google: http://translate.google.com, English-Latvian Tilde http://www.tilde.lv/English/portal/go/tilde/3777/en- US/DesktopDefault.aspx (November, 2008)

45 Tatiana Gornostay and Inguna Skadin¸a

5 Conclusions and Future Work pora. Proceedings of the 11th Conference of the Eu- ropean Chapter of the Association for Computational We have described the pattern-based toponym Linguistics, Workshop on Multi-word expressions in translation approach developed for the English- a Multilingual Context, Italy. Pp.1-8. Latvian language pair. The focus of the paper is on Yaser Al-Onaizan and Kevin Knight. 2002. Translating the detailed description of OOV toponym named entities using monolingual and bilingual re- processing and describes possible translation strat- sources. Proceedings of the 40th Meeting of the Asso- egies and linguistic toponym translation patterns ciation for Computational Linguistics, USA. Pp.400- with examples and evaluation results. 408. We can conclude that for the implemented rule- Bogdan Babych and Anthony Hartley. 2003. Improving based approach there is much room for possible Machine Translation Quality with Automatic Named improvements, and evaluation results prove this Entity Recognition. Proceedings of the 7th European statement. The main reason, why toponym Association for Machine Translation Workshop Im- processing is such a challenge for an MT task, is proving machine translation through other language the necessity of knowledge of toponym rendering Technology Tools, . Pp.1-8. rules, variety of languages as well as a considera- Bogdan Babych and Anthony Hartley. 2004. Selecting ble amount of history and culture (Castañeda- Translation Strategies in MT using Automatic Hernández, 2004). It is impossible to formalize this Named Entity Recognition. Proceedings of the 9th process completely and it is obvious that there can European Association for Machine Translation be mistakes in automated translation of toponymic Workshop Broadening horizons of machine transla- units. tion and its applications, Malta. Pp.18-25. Corpus-based approach has not been applied in Gilberto Castañeda-Hernández. 2004. Navigating this research due to the lack of monolingual and through Treacherous Waters: The Translation of bilingual linguistic resources. However, the issue Geographical Names. Translation Journal, 8(2): of compiling a multilingual corpus of toponym- [electronic resource]: referenced texts for the Latvian language is being http://accurapid.com/journal/28names.htm#1 studied. Sarvnaz Karimi, Falk Scholer, and Turpin. We consider the present research as the starting 2007. Collapsed consonant and vowel models: new point for such tasks as multilingual cross-language approaches for English-Persian transliteration and MT of toponyms and application to other languag- back-transliteration. Proceedings of the 45th Annual es, especially Cyrillic or other non- scripts. Meeting of the Association for Computational Lin- guistics, . Pp.648-655. Acknowledgement Kevin Knight and Jonathan Graehl. 1998. Machine We would like to thank Raivis Skadiņš for his Transliteration. Computational Linguistics, 24(4):599-612. comments and remarks on the article, Lars Ahren- berg for discussions on toponym machine transla- Chun-Jen Lee and Jason S. Chang. 2003. Acquisition of tion, and Lars Borin for general discussions on to- English-Chinese Transliteration Word Pairs from Pa- ponymy. rallel-Aligned Texts using a Statistical Machine This research was carried out in the framework Translation Model. Proceedings of Human Language of the project no. 045335 – TRIPOD project (TRI- Technologies – The North American Chapter of the Association for Computational Linguistics Workshop: Partite multimedia Object Description) co-funded Building and Using parallel Texts Data Driven Ma- by the within the Sixth chine Translation and Beyond, Canada. Pp.96-103. Framework Programme. Geoffrey Leech. 1981. Semantics. The Study of Meaning. nd References 2 edition. Penguin, , , UK. Jochen . Leidner. 2007. Toponym Resolution in Text: Antonija Ahero. 2006. English Proper Name Rendering Annotation, Evaluation and Applications of Spatial into the Latvian Language (Angļu Īpašvārdu Atveide Grounding of Place Names. PhD thesis. Institute for Latviešu Valodā). Zinātne, Rīga. Communicating and Collaborative Systems School of Iñaki Alegria, Nerea Ezeiza, Izaskun Fernandez. 2006. Informatics, University of . Named entities translation based on comparable cor-

46 Pattern-based English-Latvian Toponym Translation

Haizhou Li, Min Zhang, and Jian Su. 2004. A joint source-channel model for machine translitera- tion. Proceedings of the 42nd Annual Meeting on as- sociation for Computational Linguistics. . Pp.159–166. Katja Markert and Malvina Nissim. 2002. Towards a corpus annotated for metonymies: the case of loca- tion names. Proceedings of the 3rd International Con- ference on Language Resources and Evaluation, . Pp.1385-1392. Helen M. Meng, Wai-Kit Lo, Berlin Chen, and Karen Tang. 2001. Generate Phonetic to Handle Named Entities in English-Chinese cross-language spoken document retrieval. Proceedings of Institute of Electrical and Electronics Engineers Automatic Speech Recognition and Understanding Workshop, Italy. Jong-Hoon Oh and Key-Sun Choi. 2002. An English- Korean Transliteration Model Using Pronunciation and Contextual Rules. Proceedings of the 19th Inter- national Conference on Computational Linguistics, Taiwan, 1:1-7. Sproat, Tao Tao, and Cheng-Xiang Zhai. 2006. Named entity transliteration with comparable corpo- ra. Proceedings of the 44th Annual meeting of the As- sociation for Computational Linguistics, Australia. Pp.73-80. Bonnie Glover Stalls and Kevin Knight. 1998. Translat- ing Names and Technical Terms in Text. Pro- ceedings of the Coling / Association for Computa- tional Linguistics Workshop on Computational Ap- proaches to Semitic Languages, Canada. Pp.365-266. Wolodja Wentland, Johannes Knopp, Carina Silberer, and Hartung. 2008. Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration. Proceedings of the 6th Language Resources and Evaluation Conference, Morocco. Min Zhang, Haizhou Li, and Jian Su. 2004. Direct Or- thographical Mapping for Machine Transliteration. Proceedings of the 20th International Conference on Computational Linguistics, .

47 ISSN 1736-6305 Vol. 4 http://hdl.handle.net/10062/9206