Proceedings of the 21st Annual Conference of the European Association for Machine

28–30 May 2018 Universitat d’Alacant Alacant,

Edited by Juan Antonio P´erez-Ortiz Felipe S´anchez-Mart´ınez Miquel Espl`a-Gomis Maja Popovi´c Celia Rico Andr´eMartins Joachim Van den Bogaert Mikel L. Forcada

Organised by

research group The papers published in this proceedings are —unless indicated otherwise— covered by the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 International (CC-BY-ND 3.0). You may copy, distribute, and transmit the work, provided that you attribute it (au- thorship, proceedings, publisher) in the manner specified by the author(s) or licensor(s), and that you do not use it for commercial purposes. The full text of the licence may be found at https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en. c 2018 The authors

ISBN: 978-84-09-01901-4 Rule-based from Kazakh to Turkish

Sevilay Bayatli Sefer Kurnaz Dept. Elec. and Computer Engineering Dept. Elec. and Computer Engineering Altınbaş Üniversitesi Altınbaş Üniversitesi [email protected] [email protected]

Ilnar Salimzianov Jonathan North Washington Francis M. Tyers School of Science and Technology Linguistics Deptartment School of Linguistics Nazarbayev University Swarthmore College Higher School of Economics [email protected] [email protected] [email protected]

Abstract all the dictionary entries and transfer rules. Con- versely, in a corpus-based/statistical MT approach This paper presents a shallow-transfer ma- (Koehn, 2010), no such effort is required as the sys- chine translation (MT) system for translat- tem can be automatically built from parallel corpora ing from Kazakh to Turkish. Background providing they exist. If parallel corpora do not ex- on the differences between the languages ist, then we see two options that remain. The first is presented, followed by how the system is to create a new parallel corpus, either by translat- was designed to handle some of these differ- ing millions of words from scratch (requiring effort ences. The system is based on the Apertium from translators),¹ or by finding parallel text online free/open-source machine translation plat- and processing it (requiring effort from program- form. The structure of the system and how it mers). The second option is to build a rule-based works is described, along with an evaluation machine translation system (requiring effort from against two competing systems. Linguis- linguists). The most labour-intensive of these ap- tic components were developed, including a proaches is to translate the data from scratch, al- Kazakh-Turkish bilingual dictionary, Con- though this might be practical in certain situations. straint Grammar disambiguation rules, lex- Building a rule-based machine translation system ical selection rules, and structural transfer and finding and processing parallel texts from the rules. With many known issues yet to be ad- internet are, given equal available expertise, around dressed, our RBMT system has reached per- equally time consuming. As large, freely available formance comparable to publicly-available parallel corpora are not known to exist between corpus-based MT systems between the lan- Kazakh and Turkish and we were interested in struc- guages. tural differences between these two languages — and producing new linguistic resources, we chose to 1 Introduction build a rule-based MT system. Kazakh and Turk- In this paper we present a prototype shallow- ish are different enough that native speakers are not transfer rule-based machine translation system using able to make sense of the other language, but also the Apertium free/open-source machine translation share similar enough structure that an RBMT system platform (Forcada et al., 2011) for translating from is feasible with some level of linguistic knowledge. Kazakh to Turkish. This paper demonstrates that, with many known One of the most common criticisms towards issues yet to be addressed, our RBMT system has Rule-Based Machine Translation (RBMT) regards already reached performance comparable to publi- the amount of work necessary to build a system for cally available SMT systems between the languages. a new language pair (Arnold, 2003). In fact, in a This has been accomplished solely with open source traditional scenario, linguists with expertise in the tools and some level of linguistic knowledge about source and target language need to manually build ¹The millions of words number is taken from Koehn and Knowles (2017) who compare neural MT against phrase-based © 2018 The authors. This article is licensed under a Creative SMT for English–Spanish for 0.4 million to 385.7 million Commons 3.0 licence, no derivative works, attribution, CC- words. Even for these morphologically-poor languages, even BY-ND. with one million words the performance is poor.

P´erez-Ortiz,S´anchez-Mart´ınez,Espl`a-Gomis,Popovi´c,Rico, Martins, Van den Bogaert, Forcada (eds.) Proceedings of the 21st Annual Conference of the European Association for Machine Translation, p. 49–58 Alacant, Spain, May 2018. the two languages, and without large parallel cor- 3 The languages pora or machine learning algorithms (although some Kazakh is classified as a member of the Northwest- of the components in our opinion could be signifi- ern (or Kypchak) branch of the Turkic language cantly improved by using them; see Section 6 for family. It is primarily spoken in Kazakhstan, where more details). it is the national language, sharing official status with The paper will be laid out as follows: Section Russian. Large communities of native speakers also 2 gives a short review of some previous work in exist in China, neighbouring Central Eurasian re- the area of Turkic-Turkic language machine trans- publics, and Mongolia. The total number of speak- lation and an overview of other publically available ers is at least 10 million people (Simons and Fennig, Kazakh-Turkish machine translators; Section 3 in- 2018). The present-day Kazakh Cyrillic alphabet troduces Kazakh and Turkish and compares their consists of 42 letters, 33 of which are letters found in grammar; Section 4 describes the system and the the Russian alphabet. There are controversial plans tools used to construct it; Section 5 gives a prelim- to transition to a Latin alphabet by 2025. inary evaluation of the system; Section 6 describes Turkish is classified as a member of the South- our aims for future work; and finally Section 7 con- western (or Oghuz) branch of Turkic language fam- tains some concluding remarks. ily. With over 70 million L1 speakers (Simons and 2 Previous work Fennig, 2018), it is the Turkic language spoken by the most people. The Turkish Latin alphabet con- Within the Apertium project, there is ongoing tains 29 letters. work on building MT systems (and thus underly- Kazakh and Turkish exhibit agglutinative mor- ing components such as morphological transducers) phology, meaning that word forms may consist of for translating between two , be- a root and a series of affixes. tween a Turkic language and Russian or between An MT system between Kazakh and Turkish is a Turkic language and English. Among released potentially of great use to the language communities. MT systems there are: Kazakh-Tatar (Salimzyanov Automatic MT can save time and money over going et al., 2013), Tatar-Bashkir (Tyers et al., 2012b), through e.g. Russian or English (and the system is Crimean Tatar-Turkish and English-Kazakh (Sun- much easier to develop). detova et al., 2015) MT systems. We continue with a brief overview of some differ- Several other MT systems have been reported ences in phonology, orthography, morphology and that translate between Turkish and other Turkic lan- syntax of the languages. A complete and detailed guages, including Turkish–Crimean Tatar (Altıntaş, comparison is out of scope of this work. 2001), Turkish–Azerbaijani (Hamzaoglu, 1993), Turkish–Tatar (Gilmullin, 2008), and Turkish– 3.1 Phonology and orthography Turkmen (Tantuğ et al., 2007a,b) MT systems. As Differences in phonology and orthography are less for the systems for translating to/from Kazakh (be- relevant for this work because relatively high- sides the ones already mentioned above), there is coverage morphological transducers were available a bidirectional Kazakh-English machine translation for both of the languages when we started work- system (Tukeyev et al., 2011) which uses a link ing on the translator. The mutual intelligibility of grammar and statistical approach. None of these Kazakh and Turkish, both in their spoken and writ- MT systems to our knowledge have been released ten forms, is rather low, despite much similar mor- to a public audience. phology and the existence of many cognates, often Altenbek and Xiao-long (2010) propose a seg- with similar meanings. mentation system for inflectional affixes of Kazakh. Makhambetov et al. (2015), Kessikbayeva and Ci- 3.2 Morphology and syntax cekli (2014) and Kairakbay and Zaurbekov (2013) 3.2.1 Verbals present work on Kazakh morphological analysis. There are verbal tenses and moods common to Both Kazakh and Turkish are among the lan- both Kazakh and Turkish, like the definite past guages supported by ² and Yandex tense, the imperative mood, and the conditional Translate³ tools. mood. There are also quite a few differences. For

²http://translate.google.com example, Kazakh lacks the definite future tense af- ³http://translate.yandex.com fix -{y}{A}c{A}k known in Turkish; but has the so

50 called goal oriented future tense, absent in Turk- 3.2.2 Nominals ish: e.g., the Kazakh verb form бармакпын ‘I in- Another example of a morphological difference tend to go’ can be translated into Turkish as gitmeyi between Kazakh and Turkish is the presence of a düşünüyorum ‘I intend to go’. Another example of four-way distinction in Kazakh’s 2nd person system an affix found in one language but not in the other is (both pronouns and agreement suffixes). In other the affix -{D}{A}й in Kazakh, which follows nouns words, in Kazakh there is a distinct word for all and numbers and indicates resemblance, and can combinations of [ plural, formal] (Muhamedow, often be translated as the postposition gibi ‘like’ in 2016), whereas the± Turkish± 2nd person singular for- Turkish. mal pronoun coincides with the 2nd person plural informal and 2nd person plural formal pronouns, as Kazakh has several auxiliary verbs which are used summarized in Table 1 (both siz and sizler are used for constructing analytic verbal forms. Four of as the plural formal pronoun in Turkish). them, the auxiliary verbs жатыр, отыр, жүр, тұр are used to construct the present continuous Kazakh Turkish tense (Muhamedow, 2016), as in the collocation - + - + жауып жатыр ‘is raining’, translated to Turkish as yağıyor ‘is raining’. There are many other cases - сен сендер sen siz (i.e., not just due to analytic tenses) in which se- + сіз сіздер siz siz/sizler quences of two or more Kazakh verbs map to a sin- gle verb in Turkish, as in the case of the expression Table 1: Second person personal pronouns in Kazakh and Turkish. Note the extra distinctions in the Kazakh forms. қуанып кетті ‘gladdened’, which is translated as neşelendi ‘gladdened’ in Turkish. All of the differences or one-to-many/many-to- one correspondences in morphology and syntax de- In the case of non-finite forms, there are one- scribed in this and the preceding subsections are rel- to-many correspondences. For instance, Kazakh evant for a shallow-transfer RBMT because to han- past verbal adjectives (participles) formed with the dle them is the main job of the transfer component. -{G}{A}н suffix can be translated into Turkish in at least three ways: as past verbal adjective with 4 System the -m{I}ş suffix, as a subject-relative verbal adjec- Our machine translation system is based on the tive formed with the -{y}{A}n suffix or as a past Apertium MT platform (Forcada et al., 2011).⁴ The verbal adjective formed with the -{D}{I}k suffix. platform was originally aimed at the Romance lan- As an example, the Kazakh sentence Сербия мен guages of the Iberian peninsula, but has also been Қазақстан арасында шешілмеген мәселе жоқ. adapted for other, more distantly related, language ‘There aren’t any unresolved issues between Ser- pairs. The whole platform, both programs and data, bia and Kazakhstan’ can be translated into Turkish are licensed under the Foundation’s as Sırbistan ve Kazakistan arasında çözümlenmemiş General Public Licence⁵ (GPL) and all the software mesele yok., whereas the sentence Екі мемлекет and data for the completed supported language pairs басшылары шағын және кеңейтілген құрамда (and the other pairs being worked on) is available for келіссөздер жүргізді. ‘The two leaders held talks download from the project website. in small and expanded format.’ in the parallel cor- pus we constructed (see Section 4.3) is translated as 4.1 Architecture of the system İki memleket başkanları küçük ve genişletildiği kap- A typical translator built using the Apertium plat- samda müzekereler yönetti. form, including the translator described here, con- sists of a Unix-style pipeline or assembly line with Similarly, the Kazakh imperfect verbal adjective the following modules (see Fig. 1. In Table 2 you formed with the suffix -{E}т{I}н is translated as ei- can see an example of how a Kazakh sentence passes ther a subject-relative verbal adjective formed with through the pipeline): the -{y}{A}n suffix or as future verbal adjective con- stituted with -{y}{A}c{A}k suffix. For example, • De-formatter. Separates the text to be trans- the Kazakh phrase сөйлейтін can be translated as lated from the formatting tags. Formatting tags konuşacak ‘which will speak’ or as konuşan ‘(which ⁴http://www.apertium.org is) speaking’. ⁵http://www.fsf.org/licensing/licenses/gpl.html

51 SL deformatter text

morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator

reformatter TL text

Figure 1: The pipeline architecture of a typical Apertium MT system.

are encapsulated in brackets so they are treated • Post-generator. Performs orthographic oper- as “superblanks” that are placed between words ations, for example elision (such as da + il = in such a way that the remaining modules see dal in Italian). This module has not been em- them as regular blanks. ployed in our translator so far.

• Morphological analyser. Segments the • Reformatter. De-encapsulates any format in- source-language (SL) text in surface forms formation. (SF) (words, or, where detected, multiword lexical units) and for each, delivers one or more The Apertium platform provides what can be lexical forms (LF) consisting of lemma (dic- called a ‘vanilla’ program and a formalism for de- tionary or citation form), lexical category (or scribing linguistic data (if the module in question part-of-speech) and inflection information. requires it) for each of the modules. We want to emphasise though that modules of the pipeline • Morphological disambiguator. A morpho- just described are independent from each other and logical disambiguator, in case of the Kazakh- thus can rely on different programs, different for- Turkish translator based on the Constraint malisms, and be of rule-based, statistical or hy- Grammar (CG) formalism (Karlsson et al., brid nature. For example, - 1995), chooses the most adequate sequence of based morphological disambiguator can be consid- morphological analyses for an ambiguous sen- ered a drop-in replacement for the Hidden Markov tence. Model-based statistical tagger found in a few other Apertium MT systems. So are the formalisms used • Lexical transfer. This module reads each for morphological transducers which are described SL LF and delivers the corresponding target- next. language (TL) LF by looking it up in a bilin- gual dictionary encoded as an finite-state trans- 4.2 Morphological transducers ducer compiled from the corresponding XML The morphological transducers are based on the file. The lexical transfer module may return Helsinki Finite State Toolkit (Linden et al., 2011) – more than one TL LF for a single SL LF. a free/open-source reimplementation of the Xerox finite-state tool chain, popular in the field of mor- • Lexical selection. A lexical selection module phological analysis. It implements both the lexc Tyers et al. (2012a) chooses, based on context formalism for defining lexicons, and the twol and rules, the most adequate translation of ambigu- xfst formalisms for modeling morphophonological ous SL LFs. rules. This toolkit has been chosen as it — or the • Structural transfer. A structural transfer equivalent XFST — has been widely used for other module, which performs local syntactic oper- Turkic languages (Cöltekin 2010; Altintas and Ci- ations, is compiled from XML files containing cekli 2001; Washington et al. 2012; Tantuğ et al. rules that associate an action to each defined LF 2006; Tyers et al. 2012b,Washington et al. 2014, pattern. Patterns are applied left-to-right, and Çöltekin 2014) and is available under a free/open- the longest matching pattern is always selected. source licence. The morphologies of both languages are implemented in lexc, and the morphophonolo- • Morphological generator. It transforms the gies of both languages are implemented in twol. The sequence of target–language LFs, produced by same lexc and twol files are used to compile both the the structural transfer, to a corresponding se- morphological analyser and the morphological gen- quence of target–language SFs. erator for each language.

52 (Kazakh) Input Біз жаттығулар барысын мұқият бақылап отырдық. Mor. analysis ^Біз/біз$ ^жаттығулар/жаттығу/жаттығу$ ^барысын/бары/барыс$ ^мұқият/мұқият/мұқият/мұқият$ ^бақылап/бақыла/бақыла$ ^отырдық/отыр/отыр$^./.$ Mor. disambig ^Біз$ ^жаттығу$ ^бары$ ^мұқият$ ^бақыла$ ^отыр$ ^.$ Lex. transfer ^Біз/Biz$ ^жаттығу/çalışma/egzersiz$ ^бары/süreç$ ^мұқият/dikkatlice$ ^бақыла/gözlemle$ ^отыр//otur$^./.$ Structural transfer ^Biz$ ^çalışma$ ^süreç$ ^dikkatlice$ ^gözlemle$^.$ Mor. generation Biz çalışmalar sürecini dikkatlice gözlemledik.

Table 2: Translation process (from Kazakh to Turkish) for the phrase Біз жаттығулар барысын мұқият бақылап отырдық. ‘We carefully followed the work process.’ Some analyses are omitted for reasons of space. Note how a transfer rule has trans- formed a participle + auxiliary construction of Kazakh, бақылап отырдық ‘we followed’, to an analytic construction in Turkish, gözlemledik ‘we followed’.

The Kazakh morphological transducer used in tur morphological transducers. This provided the this work was presented in (Washington et al., lemma and the part of speech tag for most of the 2014). Turkish morphological transducer also surface forms in the corpora. The lemmas which comes from the Apertium project. It has not been were not already in the monolingual lexicons were described in a published work yet. Both transducers added to them, and corresponding words were added were extended to support all stems from the bilin- to the bilingual lexicon. In addition, some of the gual lexicon we constructed. stems present in the Kazakh lexc file but not found We decided to use the Turkish morphological in the parallel corpus were translated into Turkish transducer developed in the Apertium project and and added to the bilingual dictionary. Because of not the also free/open-source TRMorph (Çöltekin, the similarity of the languages, the majority of en- 2014), because the tagset used in the former is tries in the bilingual dictionary (a file in an XML- more consistent with morphological transducers de- based format) are one-to-one mappings of stems, veloped in the Apertium project for other Turkic but there are ambiguous . For exam- languages, including the Kazakh transducer. The ple, the Kazakh word ‘азамат’ has two translations consistency of the tagset allows to keep the trans- in Turkish: ‘sivil’ and ‘vatandaş’, as shown in Fig. 2. fer module relatively simple and pay more attention to the actual differences in the grammar of the lan- guages rather than on differences in the tagset used. 4.4 Rules 4.3 Bilingual lexicon The bilingual lexicon currently contains 7,385 stem- The note made at the end of Section 4.1 on re- to-stem correspondences and was built mostly by placability of the components aside, Apertium is hand in the following way. We assembled a par- primarily a rule-based MT system. Not counting allel Kazakh-Turkish corpus. For this we took all morphophonology (morphotactics) rules required by sentences from the Kazakh treebank (Tyers and HFST-based morphological transducers, there are Washington, 2015) — approximately one thousand three main categories of rules in our system — sentences — and translated them manually to Turk- morphological disambiguation rules, lexical selec- ish. Then, these Kazakh and Turkish sentences were tion rules and transfer rules. A description of each analysed with the apertium-kaz and apertium- follows.

53 .

қас ruh

азамат sivil

азамат vatandaş

үлкен büyük

ұлттық ulusal

дауа çare

дауа ilaç

дауа çözüm

шешім çözüm

шешім karar

Figure 2: Example entries from the bilingual lexicon. Kazakh is on the left, and Turkish on the right. Each stem is accompanied by a part-of-speech tag and there may be many–many correspondences between the stems.

4.4.1 Morphological disambiguation rules 4.4.3 Structural transfer rules

The system has a morphological disambiguation Apertium, as a rule, translates lemmas and mor- module in the form of a Constraint Grammar (CG) phemes one by one. Obviously, this does not always (Karlsson et al., 1995). The version of the formalism work, even for closely related languages. Structural used is vislcg3.⁶ The goal of the CG rules is to select transfer rules are responsible for modifying mor- the correct morphological analysis when there are phology or word order in order to produce “ade- multiple analyses. We used the Kazakh CG previ- quate” target language. ously developed partially by the authors of this paper As seen in Table 2, the structural transfer mod- and partially by other Apertium contributors. At the ule takes a sequence of (source language lexical time of this writing the file contains 164 rules. Due form — target language lexical form) pairs in to closeness of the languages, the majority of ambi- the following format: ˆSL-lemma<…> the other. /TL-lemma<…>$ TL lemma and tags are provided by the preceding 4.4.2 Lexical selection rules two modules — lexical transfer and lexical selec- tion. The lexical transfer module looks up the TL In general, lexical selection rules are necessary to lemma and usually the first one or two tags (read: handle one-to-many correspondences of the bilin- part of speech tag) in the bilingual transducer, the gual lexicon. While many lexical items have a sim- rest of the tags are carried over from the SL. ilar range of meaning, lexical selection is some- Figure 3 gives an example of a transfer rule. times necessary when translating between Kazakh Any transfer rule consists of two core parts — of and Turkish as well. For example, the Kazakh word a pattern and an action. In this case, the pattern ат has two meanings: ат ‘name’ and ат ‘horse’ and named “gpr_impf” matches Kazakh verbal adjec- can be translated into Turkish as either ad ‘name’ or tives formed with the -{E}т{I}н affix (for reasons at ‘horse’. A lexical selection rule chooses the trans- of space, we omitted the definition of the pattern lation at ‘horse’ if the immediate context includes itself). Recall from Section 3.2.1 that Kazakh ver- a word ұста ‘hold’. Another example is the word bal adjectives ending in -{E}т{I}н have two possi- жарық, which as a noun can mean either ‘light’ ble translations in Turkish — either with a verbal or ‘crack’. It is translated to Turkish by default as adjective ending in -{y}{A}n suffix or with a ver- ışık ‘light’, and is translated as yarık ‘crack’ only in bal adjective ending in -{A}c{A}k suffix. The rule the immediate context of words like есік ‘door’ and in Figure 3 replaces the tag on the TL қабырға ‘wall’. A relatively small number of 92 lex- side with the tag, which corresponds to ical selection rules were developed and added to the a -{y}{A}n verbal adjective in Turkish. In addition, system. The lexical selection module we used (Ty- transfer rules perform chunking, and later transfer ers et al., 2012a) allows inferring such rules auto- stages can operate on chunks of words as if they matically from a parallel corpus, but we have not were single words, but we will not discuss chunking- employed this feature of it yet. based rules here since this technique is currently not employed in the Kazakh-Turkish translator.

⁶http://beta.visl.sdu.dk/constraint_grammar.html The patterns are matched on the SL side by the

54

Figure 3: Examples of a transfer rule. The first rule translates Kazakh -{E}т{I}н verbal adjectives with -(y)An verbal adjectives by replacing the tag on the TL side with the tag. The second rule transfer Kazakh -{G}{A}нл{I}{K} verbal noun (gerund) into -{D}{I}k verbal noun (Gerund) by replacing the tag on the TL side with the tag.

“left–to–right, longest–match” principle. For in- Corpus Tokens Coverage (%) stance, Kazakh determiner–adjective–noun phrase Wikipedia 22,515,314 83.42 Бұл үлкен жетістік ‘This big success’ will be Quran.altay 107,451 70.30 matched and processed by a rule having determiner– Bible.kkitap 577,070 80.89 adjective–noun sequence as its pattern, and not by e.g. a rule matching determiner–adjective or Table 3: Coverage of the Kazakh-Turkish MT system determiner–noun sequences. If there are ties, the rule which is placed higher in the source file is ap- plied first. Quran, and the Kazakh translation of the Bible. Since Kazakh and Turkish have similar syntax, The corpus extracted from Kazakh Wikipedia⁷ most of the rules are not about reordering but about contains 22,515,314 words (1,404,467 sentences). altering the tags carried over from SL LFs to TL Wikipedia is one of the major uses for Apertium LFs. We used the parallel corpus described in Sec- translators, especially since some of these are used tion 4.3 as a development set for writing and refining by the Wikimedia Content Translation Tool⁸. transfer rules. In all, we have defined 76 structural The coverage is the percentage of tokens in transfer rules. running text which are translated by the MT system. Apertium MT systems mark tokens that could not 5 Evaluation be analysed with a special sign, thus coverage was The system has been evaluated in two ways. The first calculated by dividing the number of tokens without is its coverage. The second is translation quality — that special sign by the total number of words in the the error rate of two pieces of text produced by the text. The coverage results are presented in Table 3. system when comparing with postedited versions of As seen in the table, the coverage of the system is them. not very high, with an average of 78.20% .

5.1 Coverage Coverage of the system was calculated over three ⁷https://kk.wikipedia.org/; kkwiki-20180320-pages- articles.xml.bz2 freely-available corpora — a dump of articles from ⁸https://www.mediawiki.org/wiki/Content_ Kazakh Wikipedia, a Kazakh translation of the translation

55 5.2 Translation quality The higher word error rate can be explained by the fact that Kazakh and Turkish are more distantly re- The translation quality was measured using two met- lated than Tatar and Kazakh or Bashkir. rics, the first was word error rate (WER), and the second was position-independent word error rate 5.3 Error analysis (PER). Both metrics are based on the Levenshtein The majority of remaining errors are mostly due to distance (Levenshtein, 1965). Metrics based on a lot of unknown words (because the relatively low word error rate were chosen as to be able to compare number of words in the bilingual dictionary), and the system against systems based on similar technol- disambiguation errors. ogy, and to assess the usefulness of the system in a real setting, that is of translating for dissemination. 6 Future work Besides calculating WER and PER for our Kazakh-Turkish MT system, we did the same for We intend to continue the development of the sys- two other publically available Kazakh-Turkish MT tem to improve the quality of the translations. There systems — from Google Translate and Yandex are a number of areas where we believe that more Translate. The procedure was the same for all work would yield better results, among which are three. We took a small (1,025 tokens) Kazakh text, the following: which was a concatenation of several articles from • Coverage. By expanding the dictionaries with Wikipedia and translated it using the three MT sys- new lists of stems, and providing bilingual cor- tems. The output of each system was postedited in- respondences, the error rate will decrease and, dependently to avoid biasing in favour of one par- consequently, there will be less post-editing ticular system. Then we calculated WER and PER work necessary (and translations will be much for each using the apertium-eval-translator more intelligible). The principle issue here is tool⁹ and BLEU using the mteval-v13a.pl script.¹⁰ adequate Kazakh–Turkish lexicography which Note that BLEU score is typically calculated by is an under-investigated area. comparing against a pre-translated reference trans- lation, where here we calculate against posteditted • Ambiguous transfer rules. The “left-to-right, reference translations for each of the systems. longest match” principle by which structural transfer rules are applied (which implies that 5.2.1 Results if there are several rules with the same pattern, Table 4 shows the results obtained for all only one of them will apply — the first one that three systems — Google, Yandex and Apertium. matches that pattern), although makes it easy Google’s¹¹ MT system has the lowest WER. to change the behaviour of the system (simply Apertium has a comparable WER despite having by reordering rules), in our opinion, is quite much higher number of OOV words. Yandex limiting. In particular, it limits the ways how Translate’s¹² WER is higher, but PER is similar to one-to-many correspondences can be handled the other two. (several examples of such cases were given in Section 3), essentially forcing the developer to These numbers can be compared with scores for hard code one of the several possible transla- other translators based on the Apertium platform. tions as the default one and checking the sur- For example, the Kazakh–Tatar system described rounding lemmas or other features in the rule in (Salimzyanov et al., 2013) achieves post-edition manually if he wishes to select an alternative WER of 15.19% and 36.57% over two texts of translation. We conceive a method of select- 2,457 words and 2,862 words respectively. The ing the most adequate rules for a given input by Tatar–Bashkir system in (Tyers et al., 2012b) re- setting weights learned with an unsupervised ports WER of 8.97% over a small text of 311 words learning algorithm. and WER of 7.72% over another text of 312 words. 7 Conclusion ⁹http://wiki.apertium.org/wiki/ apertium-eval-translator To our knowledge we have presented the first ¹⁰https://github.com/moses-smt/mosesdecoder/blob/ free/open-source MT system between Kazakh and master/scripts/generic/mteval-v13a.pl ¹¹https://translate.google.com/#kk/tr/ Turkish. The performance is similar to other trans- ¹²https://ceviri.yandex.com.tr/ lators created using the same technology, and in

56 System OOV WER (%) PER (%) BLEU Yandex 43 69.73 48.63 2.84 Apertium 128 45.77 41.69 16.67 Google 5 43.85 33.67 16.32

Table 4: Word error rate and Position-independent word error rate; OOV is the number of out-of-vocabulary (unknown) words. The Google system has a similar word error rate to the Apertium system despite the significantly lower number of out-of-vocabulary words. Note that the BLEU scores are computed against a posteditted reference translation. terms of WER to SMT systems available. The sys- Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., tem beats the SMT systems on BLEU, while hav- O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., ing a much higher out-of-vocabulary rate. This Sánchez-Martínez, F., Ramírez-Sánchez, G., and would suggest that given better vocabulary cover- Tyers, F. M. (2011). Apertium: a free/open- age the system would perform significantly better source platform for rule-based machine transla- than SMT and NMT systems. The system is avail- tion. Machine translation, 25(2):127–144. able as free/open-source software under the GNU GPL and the whole system may be downloaded from Gilmullin, R. (2008). The Tatar-Turkish machine Github.¹³ translation based on the two-level morphological analyzer. Interactive Systems and Technologies: Acknowledgements The Problems of Human-Computer Interaction, pages 179–186. We would like to thank Marzhan Zhakupova for translating the Kazakh sentences in the development Hamzaoglu, I. (1993). Machine translation from corpus. Turkish to other Turkic languages and an imple- mentation for the Azeri language. PhD thesis, References MSc Thesis, Bogazici University, Istanbul. Altenbek, G. and Xiao-long, W. (2010). Kazakh segmentation system of inflectional affixes. In Kairakbay, B. M. and Zaurbekov, D. L. (2013). Proceedings of CIPS-SIGHAN Joint Conference Finite state approach to the Kazakh nominal on Chinese Language Processing, pages 183–190. paradigm. In FSMNLP, pages 108–112. Altıntaş, K. (2001). Turkish to Crimean Tatar ma- Karlsson, F., Voutilainen, A., Heikkilae, J., and chine translation system. PhD thesis, Bilkent Uni- Anttila, A. (1995). Constraint Grammar: a versity. language-independent system for parsing unre- stricted text, volume 4. Walter de Gruyter. Altintas, K. and Cicekli, I. (2001). A morphological analyser for Crimean Tatar. In Proceedings of the Kessikbayeva, G. and Cicekli, I. (2014). Rule based 10th Turkish Symposium on Artificial Intelligence morphological analyzer of . In and Neural Networks (TAINN’2001), pages 180– Proceedings of the 2014 Joint Meeting of SIG- 189. MORPHON and SIGFSM, pages 46–54.

Arnold, D. (2003). Why translation is difficult for Koehn, P. (2010). Statistical Machine Translation. computers. Computers and Translation: A trans- Cambridge University Press. lator’s guide, Amsterdam and Philadelphia: John Benjamins, pages 119–42. Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. In Proceedings of Cöltekin, C. (2010). A freely available morpholog- the First Workshop on Neural Machine Transla- ical analyzer for Turkish. In LREC, volume 2, tion, pages 28–39. pages 19–28. Linden, K., Silfverberg, M., Axelson, E., Hardwick, Çöltekin, Ç. (2014). A set of open source tools for S., and Pirinen, T. (2011). HFST–Framework for Turkish natural language processing. In LREC, Compiling and Applying Morphologies, volume pages 1079–1086. 100 of Communications in Computer and Infor- ¹³https://github.com/apertium/apertium-kaz-tur mation Science, pages 67–85.

57 Makhambetov, O., Makazhanov, A., Sabyrgaliyev, Tyers, F. M., Washington, J. N., Salimzyanov, I., I., and Yessenbayev, Z. (2015). Data-driven and Batalov, R. (2012b). A prototype machine morphological analysis and disambiguation for translation system for Tatar and Bashkir based Kazakh. In International Conference on Intelli- on free/open-source components. In First Work- gent Text Processing and Computational Linguis- shop on Language Resources and Technologies tics, pages 151–163. Springer. for Turkic Languages, page 11.

Muhamedow, R. (2016). Kazakh: A Comprehensive Washington, J., Ipasov, M., and Tyers, F. M. Grammar. Routledge: Oxford. (2012). A finite-state morphological transducer for Kyrgyz. In Proceedings of the Eighth Inter- Salimzyanov, I., Washington, J., and Tyers, F. national Conference on Language Resources and (2013). A free/open-source Kazakh-Tatar ma- Evaluation (LREC), pages 934–940. chine translation system. Machine Translation Summit XIV. Washington, J., Salimzyanov, I., and Tyers, F. M. (2014). Finite-state morphological transducers Simons, G. F. and Fennig, C. D. (2018). Ethno- for three Kypchak languages. In Proceedings of logue: Languages of the world. Twenty-first edi- the 9th Conference on Language Resources and tion. SIL, Dallas, Texas. Evaluation (LREC), pages 3378–3385. Sundetova, A., Forcada, M., and Tyers, F. (2015). A free/open-source machine translation system for English to Kazakh. In 3rd International Con- ference on Turkic Languages Processing (Turk- Lang 2015), Kazan, Tatarstan, pages 78–91.

Tantuğ, A. C., Adalı, E., and Oflazer, K. (2006). Computer analysis of the Turkmen language mor- phology. In Advances in Natural Language Pro- cessing, pages 186–193. Springer.

Tantuğ, A. C., Adalı, E., and Oflazer, K. (2007a). Machine translation between Turkic languages. pages 189–192. Association for Computational Linguistics.

Tantuğ, A. C., Adalı, E., and Oflazer, K. (2007b). An MT system from Turkmen to Turkish employ- ing finite state and statistical methods.

Tukeyev, U., Melby, A., and Zhumanov Zh, M. (2011). Models and algorithms of translation of the Kazakh language sentences into English lan- guage with use of link grammar and the statistical approach. In Proc. of IV Congress of the Turkic World Math. Society, pages 1–3.

Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L., et al. (2012a). Flexible finite-state lexical selec- tion for rule-based machine translation.

Tyers, F. M. and Washington, J. (2015). Towards a free/open-source universal dependency treebank for Kazakh. In Proceedings of the 3rd Inter- national Conference on Computer Processing in Turkic Languages (TurkLang), pages 276–289.

58