Proceedings of the 21St Annual Conference of the European Association for Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 21st Annual Conference of the European Association for Machine Translation 28{30 May 2018 Universitat d'Alacant Alacant, Spain Edited by Juan Antonio P´erez-Ortiz Felipe S´anchez-Mart´ınez Miquel Espl`a-Gomis Maja Popovi´c Celia Rico Andr´eMartins Joachim Van den Bogaert Mikel L. Forcada Organised by research group The papers published in this proceedings are |unless indicated otherwise| covered by the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 International (CC-BY-ND 3.0). You may copy, distribute, and transmit the work, provided that you attribute it (au- thorship, proceedings, publisher) in the manner specified by the author(s) or licensor(s), and that you do not use it for commercial purposes. The full text of the licence may be found at https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en. c 2018 The authors ISBN: 978-84-09-01901-4 Rule-based machine translation from Kazakh to Turkish Sevilay Bayatli Sefer Kurnaz Dept. Elec. and Computer Engineering Dept. Elec. and Computer Engineering Altınbaş Üniversitesi Altınbaş Üniversitesi [email protected] [email protected] Ilnar Salimzianov Jonathan North Washington Francis M. Tyers School of Science and Technology Linguistics Deptartment School of Linguistics Nazarbayev University Swarthmore College Higher School of Economics [email protected] [email protected] [email protected] Abstract all the dictionary entries and transfer rules. Con- versely, in a corpus-based/statistical MT approach This paper presents a shallow-transfer ma- (Koehn, 2010), no such effort is required as the sys- chine translation (MT) system for translat- tem can be automatically built from parallel corpora ing from Kazakh to Turkish. Background providing they exist. If parallel corpora do not ex- on the differences between the languages ist, then we see two options that remain. The first is presented, followed by how the system is to create a new parallel corpus, either by translat- was designed to handle some of these differ- ing millions of words from scratch (requiring effort ences. The system is based on the Apertium from translators),¹ or by finding parallel text online free/open-source machine translation plat- and processing it (requiring effort from program- form. The structure of the system and how it mers). The second option is to build a rule-based works is described, along with an evaluation machine translation system (requiring effort from against two competing systems. Linguis- linguists). The most labour-intensive of these ap- tic components were developed, including a proaches is to translate the data from scratch, al- Kazakh-Turkish bilingual dictionary, Con- though this might be practical in certain situations. straint Grammar disambiguation rules, lex- Building a rule-based machine translation system ical selection rules, and structural transfer and finding and processing parallel texts from the rules. With many known issues yet to be ad- internet are, given equal available expertise, around dressed, our RBMT system has reached per- equally time consuming. As large, freely available formance comparable to publicly-available parallel corpora are not known to exist between corpus-based MT systems between the lan- Kazakh and Turkish and we were interested in struc- guages. tural differences between these two languages — and producing new linguistic resources, we chose to 1 Introduction build a rule-based MT system. Kazakh and Turk- In this paper we present a prototype shallow- ish are different enough that native speakers are not transfer rule-based machine translation system using able to make sense of the other language, but also the Apertium free/open-source machine translation share similar enough structure that an RBMT system platform (Forcada et al., 2011) for translating from is feasible with some level of linguistic knowledge. Kazakh to Turkish. This paper demonstrates that, with many known One of the most common criticisms towards issues yet to be addressed, our RBMT system has Rule-Based Machine Translation (RBMT) regards already reached performance comparable to publi- the amount of work necessary to build a system for cally available SMT systems between the languages. a new language pair (Arnold, 2003). In fact, in a This has been accomplished solely with open source traditional scenario, linguists with expertise in the tools and some level of linguistic knowledge about source and target language need to manually build ¹The millions of words number is taken from Koehn and Knowles (2017) who compare neural MT against phrase-based © 2018 The authors. This article is licensed under a Creative SMT for English–Spanish for 0.4 million to 385.7 million Commons 3.0 licence, no derivative works, attribution, CC- words. Even for these morphologically-poor languages, even BY-ND. with one million words the performance is poor. P´erez-Ortiz,S´anchez-Mart´ınez,Espl`a-Gomis,Popovi´c,Rico, Martins, Van den Bogaert, Forcada (eds.) Proceedings of the 21st Annual Conference of the European Association for Machine Translation, p. 49{58 Alacant, Spain, May 2018. the two languages, and without large parallel cor- 3 The languages pora or machine learning algorithms (although some Kazakh is classified as a member of the Northwest- of the components in our opinion could be signifi- ern (or Kypchak) branch of the Turkic language cantly improved by using them; see Section 6 for family. It is primarily spoken in Kazakhstan, where more details). it is the national language, sharing official status with The paper will be laid out as follows: Section Russian. Large communities of native speakers also 2 gives a short review of some previous work in exist in China, neighbouring Central Eurasian re- the area of Turkic-Turkic language machine trans- publics, and Mongolia. The total number of speak- lation and an overview of other publically available ers is at least 10 million people (Simons and Fennig, Kazakh-Turkish machine translators; Section 3 in- 2018). The present-day Kazakh Cyrillic alphabet troduces Kazakh and Turkish and compares their consists of 42 letters, 33 of which are letters found in grammar; Section 4 describes the system and the the Russian alphabet. There are controversial plans tools used to construct it; Section 5 gives a prelim- to transition to a Latin alphabet by 2025. inary evaluation of the system; Section 6 describes Turkish is classified as a member of the South- our aims for future work; and finally Section 7 con- western (or Oghuz) branch of Turkic language fam- tains some concluding remarks. ily. With over 70 million L1 speakers (Simons and 2 Previous work Fennig, 2018), it is the Turkic language spoken by the most people. The Turkish Latin alphabet con- Within the Apertium project, there is ongoing tains 29 letters. work on building MT systems (and thus underly- Kazakh and Turkish exhibit agglutinative mor- ing components such as morphological transducers) phology, meaning that word forms may consist of for translating between two Turkic languages, be- a root and a series of affixes. tween a Turkic language and Russian or between An MT system between Kazakh and Turkish is a Turkic language and English. Among released potentially of great use to the language communities. MT systems there are: Kazakh-Tatar (Salimzyanov Automatic MT can save time and money over going et al., 2013), Tatar-Bashkir (Tyers et al., 2012b), through e.g. Russian or English (and the system is Crimean Tatar-Turkish and English-Kazakh (Sun- much easier to develop). detova et al., 2015) MT systems. We continue with a brief overview of some differ- Several other MT systems have been reported ences in phonology, orthography, morphology and that translate between Turkish and other Turkic lan- syntax of the languages. A complete and detailed guages, including Turkish–Crimean Tatar (Altıntaş, comparison is out of scope of this work. 2001), Turkish–Azerbaijani (Hamzaoglu, 1993), Turkish–Tatar (Gilmullin, 2008), and Turkish– 3.1 Phonology and orthography Turkmen (Tantuğ et al., 2007a,b) MT systems. As Differences in phonology and orthography are less for the systems for translating to/from Kazakh (be- relevant for this work because relatively high- sides the ones already mentioned above), there is coverage morphological transducers were available a bidirectional Kazakh-English machine translation for both of the languages when we started work- system (Tukeyev et al., 2011) which uses a link ing on the translator. The mutual intelligibility of grammar and statistical approach. None of these Kazakh and Turkish, both in their spoken and writ- MT systems to our knowledge have been released ten forms, is rather low, despite much similar mor- to a public audience. phology and the existence of many cognates, often Altenbek and Xiao-long (2010) propose a seg- with similar meanings. mentation system for inflectional affixes of Kazakh. Makhambetov et al. (2015), Kessikbayeva and Ci- 3.2 Morphology and syntax cekli (2014) and Kairakbay and Zaurbekov (2013) 3.2.1 Verbals present work on Kazakh morphological analysis. There are verbal tenses and moods common to Both Kazakh and Turkish are among the lan- both Kazakh and Turkish, like the definite past guages supported by Google Translate² and Yandex tense, the imperative mood, and the conditional Translate³ tools. mood. There are also quite a few differences. For ²http://translate.google.com example, Kazakh lacks the definite future tense af- ³http://translate.yandex.com fix -{y}{A}c{A}k known in Turkish; but has the so 50 called goal oriented future tense, absent in Turk- 3.2.2 Nominals ish: e.g., the Kazakh verb form бармакпын ‘I in- Another example of a morphological difference tend to go’ can be translated into Turkish as gitmeyi between Kazakh and Turkish is the presence of a düşünüyorum ‘I intend to go’. Another example of four-way distinction in Kazakh’s 2nd person system an affix found in one language but not in the other is (both pronouns and agreement suffixes).