A Free/Open-Source Kazakh-Tatar Machine Translation System

Home , Apertium, Kazakh language, Tatar language, Turkmen language

A free/open-source Kazakh-Tatar machine translation system

Ilnar Salimzyanov Jonathan North Washington Francis Morton Tyers Kazan Federal University Departments of Linguistics Departament de Llenguatges Kazan, Republic of Tatarstan and Central Eurasian Studies i Sistemes Informàtics Russian Federation Indiana University Universitat d’Alacant [email protected] Bloomington, Indiana 47405 USA E-03877 Alacant [email protected] [email protected]

Abstract guages exist, although to our knowledge none are pub- licly available except for the Turkish–Azerbaijani pair This paper presents a bidirectional machine available through Google Translate.1 Some MT sys- translation system between Kazakh and Tatar, tems have been reported that translate between Turk- two Turkic languages. Background on the dif- ish and other Turkic languages, including Turkish– ferences between the languages is presented, Crimean Tatar (Altintas, 2001b), Turkish–Azerbaijani followed by how the system was designed to (Hamzaoğlu, 1993), Turkish–Tatar (Gilmullin, 2008), handle some of these differences. We provide and Turkish–Turkmen (Tantuğ et al., 2007), though none an evaluation of the system’s performance and of these have been released to a public audience. directions for future work.

1 Introduction 3 Languages

This paper presents a prototype shallow-transfer rule- Both Tatar and Kazakh belong to the Kypchak (or North- based machine translation system between Kazakh and western) group of Turkic languages. The spoken and Tatar. The paper will be laid out as follows: Section 2 written languages share some level of mutual intelligibil- gives a short review of some previous work in the area of ity to native speakers, though this is somewhat limited, Turkic–Turkic language machine translation; Section 3 and is obscured by different orthographical conventions introduces Kazakh and Tatar and compares their gram- and some opaque correspondences. mar; Section 4 describes the system and the tools used to Kazakh is primarily spoken in Kazakhstan, where it construct it; Section 5 gives a preliminary evaluation of is the national language, sharing official status with Rus- the system; and finally Section 6 describes our aims for sian as an official language. Large communities of na- future work and some concluding remarks. tive speakers also exist in China, neighbouring Central- Eurasian republics, and Mongolia. The total number of 2 Previous work speakers is at least 10 million people. Within the Apertium project, work on several MT Tatar is spoken in and around Tatarstan by approxi- systems between Turkic languages has been started mately 6 million people. It is co-official with Russian in (Turkish–Kyrgyz, Azeri–Turkish, Tatar–Bashkir), but Tatarstan — a republic within Russia. A majority of na- the Kazakh–Tatar system described by the present study tive speakers of both languages are bilingual in Russian. is the closest to production-ready of them. Among these An MT system between Tatar and Kazakh is poten- systems is a prototype Tatar–Bashkir machine translation tially of great use to the language communities. Di- system which was built by the authors of this paper (Ty- rect MT can save time and money over going through ers et al., 2012a); due to the closeness of these languages, e.g. Russian (and the system is much easier to de- it proved to provide high accuracy in its translations, but velop). Some accompanying language resources (e.g., being a prototype system by design, had relatively low morphololgical analysers, disambiguators) can also be coverage. repurposed. Besides these systems, several previous works on 1 making machine translation systems between Turkic lan- http://translate.google.com 3.1 Phonological differences is used to form non-past verbal adjectives and verbal As closely related languages, Kazakh and Tatar share nouns. The semantically equivalent structure in Tatar many phonological processes, including front-back is -{E} торган, which historically corresponds to the vowel harmony systems, consonant voicing assimila- source of the Kazakh morpheme; however, the use of tion, and even a typologically rare consonantal nasal har- -{E} тұрған in modern Kazakh is different from that of mony system. However, the differing details of these -{E}т{I}н. processes and the existence of processes unique to each Another example of a far-reaching morphological dif- language render Kazakh and Tatar fairly different. For ference between Tatar and Kazakh is the presence of a nd example, Kazakh has a ubiquitous system of desonori- four-way distinction in Kazakh’s 2 person system (both sation of the initial sonorants found in many common pronouns and agreement suffixes), where Tatar only has morphemes. Furthermore, Tatar has nasal assimilation a two-way distinction. Kazakh has a distinct pronoun of the initial /l/ of the plural-suffix. for all combinations of [±plural, ±formal], whereas Tatar collapses all pronouns except the [-plural, -formal] into 3.2 Orthographic differences one pronoun, as summarised in Table 1. The standard varieties of Kazakh and Tatar our system deals with are both written in Cyrillic, though their im- [-pl] [+pl] [-pl] [+pl] plementations of Cyrillic differ in many ways. [-frm] сен сендер [-frm] син сез While Tatar and Kazakh both have a velar/uvular ob- [+frm] сіз сіздер [+frm] сез сез struent distinction (e.g., /k/ vs. /q/) that interacts with ad- (a) Kazakh 2nd person pronouns (b) Tatar 2nd person pronouns jacent vowels, the Tatar orthography only has one series of letters (e.g., ‹к›), relying on adjacent vowels (and em- Table 1: 2nd pers. pronoun systems of Kazakh and Tatar ploying ‹ъ› ‘hard sign’ and ‹ь› ‘soft sign’ when these fail) to differentiate the two, and Kazakh has two series of ob- This systematic difference would seem to be a mi- struents (e.g., ‹к› and ‹қ›). nor issue, since, as is typical in pro-drop languages, pronouns are only used for emphasis and clarification. Kazakh does not orthographically distinguish high However, this difference between Tatar and Kazakh in unrounded vowels (/ɘ/ ‹і› and /ə/ ‹ы›) before glides (/w/ the second-person system runs much deeper than just ‹у› and /j/ ‹й›) and represents both combinations with one the pronoun system. Since all finite verb forms morpho- letter; i.e., /ɘj/ and /əj/ are both written ‹и›, while /ɘw/ logically agree in person and number with their subject and /əw/ are both written ‹у›. The quality of these vow- and all possessed nouns agree in person and number with els is necessary to know in order to predict the quality of their possessor (even when there is no overt pronoun, in following harmonising vowels. Additionally, Tatar and either situation), the Kazakh and Tatar systems of agree- Kazakh both use ‘yoticised’ vowels—i.e., when ‹о›, ‹у›, ment suffixes reflect the same pattern; i.e., there are sev- or ‹а› (along with ‹ə› in Tatar) follow /j/, a single char- eral sets of agreement morphemes which have a one-to- acter is used to represent the combination: ‹ё›, ‹ю›, and one correspondence with the pronouns in each language, ‹я› respectively.2 resulting in several systems of suffixes in each language All of these orthographical conventions present acute that have the same set of distinctions as in the 2nd person challenges to designing accurate morphological trans- pronoun systems. ducers for the languages. The past tense systems of Kazakh and Tatar have a 3.3 Morphological differences many-to-many correspondence. As shown in Table 2, There are a number of examples where the morphologies at a basic level, in the past tense, Kazakh differentiates 3 of Kazakh and Tatar are rather different, including mor- [±eyewitness] (where [-eyewitness] is used for cases of phemes in one language that do not exist in the other, both potentially unreliable information and newly dis- entirely different uses of the same morpheme combina- covered information) and [±recent], whereas Tatar has tions, and morphotactic differences (i.e., allowable or- only three categories: eyewitness, non-eyewitness, and dering and placement of morphemes). newly-acquired information—all with no [±recent] dis- An example of a morpheme that does not exist in tinction. As an example of the many-to-many corre- one of the two languages is Kazakh -{E}т{I}н, which 3“Eyewitness” is a convenient term for this feature, though it may be better expressed as simply “reliability of knowledge” (which indeed 2Furthermore, in Tatar, /j/ followed by ‹э› or ‹ы› in Tatar is repre- often equates to whether the knowledge was acquired first-hand or sented by ‹е›, though ‹е› is also the non-word-initial variant of ‹э›. not) in many cases. spondence that this results in, Tatar has a single non- is used, as shown in example (1), where the subject “I” eyewitness past tense morpheme (-GAн-) while Kazakh in English is expressed through a dative experiencer in has a recent non-eyewitness past (-Iп-) and a distant Kazakh and the gerundive “writing dictionaries” is the non-eyewitness past (-GAн екен-). On the other hand, grammatical subject. Tatar, on the other hand, uses a these two non-eyewitness past forms in Kazakh are used verb whose arguments correspond to the arguments of for both potentially unreliable information and newly ac- “to like” in English, as shown in example (2), where quired information, whereas in Tatar, non-eyewitness (- the first person pronoun is in nominative case as the GAн-) and newly-acquired information (-GAн- икән) past grammatical subject and the infinitival verb phrase is the forms are distinguished. grammatical direct object.

[+recent] [-recent] (1) Маған сөздік түзген ұнайды. маған сөздік түз-GAн ұна-E-дI [+reliable] -DI- -GAн- 1.S.D dictionary compile-G like-A-3.S [-reliable] -Iп- -GAн екен- ‘I like writing dictionaries.’ (a) Kazakh past tense morphology (2) Мин сүзлек төзергə яратам. [-new] [+new] мин сүзлек төз-IргA ярат-E-м

[+reliable] -DI- -GAн- икəн 1.S dictionary compile-I like-P-1.S [-reliable] -GAн- — ‘I like writing dictionaries.’

(b) Tatar past tense morphology In Kazakh, a gerund (or verbal noun), often with case marking and person agreement (via possessive suffixes) Table 2: A comparison of the basic past-tense morphol- is used to make a verb phrase an argument to another ogy of Kazakh and Tatar main phrases. In Tatar, many of these phrases use an invariant infinitive form. Examples are shown in (3-5). The morphotactics of the cognate Kazakh distant non- eywitness past (-GAн екен-) and Tatar newly-acquired- (3) Мен үйге қайтуым керек. 4 information past (-GAн- икән) are different. Specif- мен үй-GA қайт-у-Iм керек ically, in both languages, the person agreement takes I home-D go-G-1 need the form of a person copula suffix; however in Kazakh ‘I need to go home.’ this suffix follows the tense morphemes (e.g., барған екенсің “you apparently went”), whereas in Tatar this (4) Миңа өйгə кайтырга кирəк. suffix intervenes between the two pieces of the ‘com- мин-GA өй-GA кайт-IргA кирəк pound’ tense morpheme (e.g., баргансың икəн “I guess I-D home-D go-I need you went”). ‘I need to go home.’ Another morphotactic difference between Kazakh and Tatar is found with the negative forms of the cog- (5) Айгүл оны табуға əрекет жасап жүр. Айгүл о-NI тап-у-GA əрекет жаса-Iп жүр nate -GAн- past tenses. In Kazakh, the negative form Aygül he-A find-G-D effort make-P P of the non-recent reliable-information past tense is -GAн емес-, whereas in Tatar, the negative form of the non- ‘Aygül is trying to find him.’ eyewitness past tense is -мAGAн-. (6) Айгөл аны табарга тырыша. 3.4 Syntactic differences Айгөл а-NI тап-AргA тырыш-E There are a number of minor syntactic differences in Aygöl 3.S-A find-I тырыш-P Tatar and Kazakh, which include differences in verb va- ‘Aygöl is trying to find him.’ lencies in equivalent translations, as well Tatar’s reliance As shown in (7-8), the Tatar infinitive also corre- on a “true” infinitive that is used in place of various ver- sponds to a verbal adverb form in Kazakh. bal noun and verb adverb forms in Kazakh. (7) Мен сенімен сөйлескелі келдім. An example of a difference in verb valencies is with мен сен-мен сөйле-с-GAлI кел-DI-м. the expression corresponding to “to like to do some- I you-I talk-C-VA come-I-1.S thing” in Kazakh and Tatar. In Kazakh, the verb ұна ‘I came to speak with you.’ 4This comparison is made without regard to semantic alignment. (8) Мин синең белəн сөйлешергə килдем. • A structural transfer module, which performs local мин син-Iң белəн сөйле-ш-IргA кил-DI-м. syntactic operations, is compiled from XML files con- I you-G with talk-C-I come-I-1.S taining rules that associate an action to each defined ‘I came to speak with you.’ LF pattern. Patterns are applied left-to-right, and the longest matching pattern is always selected. This example also demonstrates the correspondence • A morphological generator which delivers a TL SF of the Kazakh intstrumental case -Мен to the Tatar for each TL LF, by suitably inflecting it. postposition белəн ‘with’, which are cognate structures; • A reformatter which de-encapsulates any format in- while their phonology and orthographic standards differ, formation. they are largely parallel in use. Table 5 provides an example of a single phrase as it 4 System moves through the pipeline.

The system is based on the Apertium machine transla- 4.2 Morphological transducers tion platform (Forcada et al., 2011).5 The platform was The morphological transducers are based on the Helsinki originally aimed at the Romance languages of the Iberian Finite State Toolkit (Linden et al., 2011), a free/open- peninsula, but has also been adapted for other, more dis- source reimplementation of the Xerox finite-state tantly related, language pairs. The whole platform, both toolchain, popular in the field of morphological analy- programs and data, are licensed under the Free Software sis. It implements both the lexc formalism for defining Foundation’s General Public Licence6 (GPL) and all the lexicons, and the twol and xfst formalisms for model- software and data for the 30 supported language pairs ing morphophonological rules. It also supports other fi- (and the other pairs being worked on) is available for nite state transducer formalisms such as sfst. This toolkit download from the project website. has been chosen as it — or the equivalent XFST — has 4.1 Architecture of the system been widely used for other Turkic languages (Çöltekin, 2010; Altintas, 2001a; Tantuğ et al., 2006; Washington The Apertium translation engine consists of a Unix-style et al., 2012; Tyers et al., 2012a), and is available under a pipeline or assembly line with the following modules free/open-source licence. (see Figure 1): The morphologies of both languages are implemented • A deformatter which encapsulates the format informa- in lexc, and the morphophonologies of both languages tion in the input as superblanks that will then be seen are implemented in twol. as blanks between words by the other modules. Use of lexc allows for straightforward definition of • A morphological analyser which segments the text in different word classes and subclasses. For example, surface forms (SF) (words, or, where detected, multi- Tatar (but not Kazakh) has two classes of verbs: one word lexical units or MWLUs) and for each, delivers which takes a harmonised high vowel in the infinitive one or more lexical forms (LF) consisting of lemma, (the default), and one which takes a harmonised low lexical category and morphological information. vowel in the infinitive. Class membership cannot be pre- • A morphological disambiguator (CG) which chooses, dicted based on any phonological criteria and is simply using linguistic rules the most adequate sequence of a lexical property of any given verb. This was imple- morphological analyses for an ambiguous sentence. mented in lexc with two similar sets of continuation lexica for verbs: one pointing at a lexicon with an A-initial • A lexical transfer module which reads each SL LF and infinitive ending, and another pointing at a lexicon with delivers the corresponding target-language (TL) LF by an I-initial infinitive ending. These two sets of continu- looking it up in a bilingual dictionary encoded as an ation lexica are otherwise the same. FST compiled from the corresponding XML file. The Use of twol allows for phonological processes present lexical transfer module may return more than one TL in the languages, like vowel harmony and desonorisa- LF for a single SL LF. tion, to be implemented in a straightforward manner. For • A lexical selection module (Tyers et al., 2012b) which example, in Tatar, the A and I archiphonemes found in chooses, based on context rules, the most adequate the infinitive are harmonised to one of two vowels each, translation of ambiguous source language LFs. depending on the value of the preceding vowel; the ba-

5 http://www.apertium.org sic form of this process can be implemented in one twol 6 http://www.fsf.org/licensing/licenses/gpl.html rule. SL deformatter text

morph. morph. lexical lexical structural morph. post- analyser disambig. transfer selection transfer generator generator

reformatter TL text

Figure 1: The pipeline architecture of the Apertium system.

The same morphological description is used for both null modified noun), and even finite forms.9 analysis and generation. To avoid overgeneration, any Given the similarity of Kazakh and Tatar, this sort of alternative forms are marked with one of two marks, LR ambiguity may often be passed from one language to (only analyser) or RL (only generator). Instead of the the other and not lead to many translation errors. While usual compile/invert to compile the transducers, we com- disambiguating between these analyses would be crucial pile twice, once the generator, without the LR paths, and for e.g., a Turkic-to-English system, we have not yet put then again the analyser without the RL paths. much effort into developing CG rules to deal with such ambiguity. 4.3 Bilingual lexicon The bilingual lexicon currently contains 9,269 stem-to- 4.5 Lexical selection rules stem correspondences and was built mostly by hand (i.e., While many lexical items have a similar range of mean- by translating Kazakh stems unrecognised by the mor- ing, lexical selection can sometimes be problematic be- phological analyser into Tatar). Some toponyms and tween Kazakh and Tatar. other proper names were translated semi-automatically For example (see Figure 2), Kazakh құрал can mean by looking up links in Wikipedia (Tyers and Pienaar, an instrument, device, tool, or even weapon, all mean- 2008); also, some Russian loanwords common to both ings corresponding to its Tatar cognate корал; however, languages (such as автомобиль, гонорар, etc.) were it is also used in the compound ақпарат құралдары added to the bilingual dictionary automatically by tak- ‘mass media’ (literally, ‘means of information’), which ing the intersection of Russian and Kazakh wordlists. translates to Tatar as мəгълүмат чаралары (which has Entries consist largely of one-to-one stem-to-stem the same literal translation). Hence, the Kazakh word correspondences with part of speech, but also include құрал must have two entries in the bilingual lexicon: some entries with ambiguous translations (see e.g., Fig- one that corresponds to Tatar корал and one that corre- ure 2). sponds to Tatar чара. A lexical selection rule that selects the translation чара when it occurs in a compound with 4.4 Disambiguation rules ақпарат is written to ensure the correct translation; this The system has a morphological disambiguation module rule is shown in Figure 3. in the form of a Constraint Grammar (CG) (Karlsson et Likewise, the Kazakh word топ can be translated to al., 1995). The version of the formalism used is vislcg3.7 Tatar as туп ‘ball’ (sometimes доп in Kazakh), and as The output of each morphological analyser is highly төркем ‘group’. The bilingual dictionary also has the ambiguous, measured at around 2.4 morphological anal- Russian word группа ‘group’, which is used in Tatar, as yses per form for Kazakh and 2.1 for Tatar. The goal of an entry which may be translated to Kazakh топ (i.e., the CG rules is to select the correct analysis when there analysed), but is never generated. are multiple analyses. Currently, ambiguity is down to Tatar has separate words for ‘[physical] life’ (гомер) 1.4 analyses per form for Kazakh and 1.7 for Tatar.8 and ‘life [as a human condition]’ (тормыш), whereas One reason for the still high level of ambiguity is a se- Kazakh only has one word (өмір). The lexical selection ries of affixes in both languages which can each be anal- rule provided in Figure 4 chooses the latter Tatar trans- ysed variously as verbal nouns, verbal adjectives, sub- lation after the adjective рухани ‘spiritual’. The system stantitivised verbal adjectives (verbal adjectives with a currently has a total of 33 lexical selection rules.

7 9 http://beta.visl.sdu.dk/constraint_grammar.html Despite the fact that the various suffixes in this category pattern dif- 8The Tatar CG is largely based on rules generalised from Kazakh’s ferently and do not form a single natural class, most grammars of the CG, but it has not received as much attention. languages label all them as simply “gerunds” or “participles”.

құрал~~корал~~

құрал~~чара~~

есім~~исем~~

топ~~туп~~

топ~~төркем~~

топ~~группа~~
Figure 2: Example entries from the bilingual transfer lexicon. Kazakh is on the left, and Tatar on the right.
5 Evaluation All evaluation was tested against version 0.2.0, or r45482 in the Apertium SVN.10 Lexical coverage of the system is calculated over freely available corpora of Kazakh and Tatar. For Kazakh, two years worth of content (2010 and 2012) Figure 3: A lexical selection rule that selects чара as the from Radio Free Europe / Radio Liberty (RFERL)’s translation of құрал if part of a compound with ақпарат. Kazakh-language service,11 as well as a recent dump of Wikipedia’s articles in Kazakh12 were used. For Tatar, a dump of articles from the Tatar Wikipedia,13 a translation of the New Testament, and content from RFERL’s Tatar-language service14 from early 2007 to early 2012 were used for testing. Corpora were divided into 10 parts each; the coverage numbers given are the averages of the calculated percent- Figure 4: A lexical selection rule that selects тормыш ages of number of words analysed for each of these parts, for өмір if preceded by the word рухани. and the standard deviation presented is the standard deviation of the coverage on each corpus. As shown in Table 3, the naïve coverage of the Corpus Tokens Coverage stdev Kazakh-Tatar MT system15 over the news corpora ap- RFERL 2010 3.2M 90.19% 0.23% proaches that of a broad-coverage MT system, with one RFERL 2012 2.9M 89.74% 0.59% word in ten unknown. The coverage over the Wikipedia Wikipedia 1.2M 80.75% 5.23% corpus is substantially worse, due to the fact that this cor-
(a) Naïve coverage of the Kazakh-Tatar direction pus is “dirtier”: it contains orthographical errors, wiki code, repetitions, as well as quite a few proper nouns. Corpus Tokens Coverage stdev To measure the performance of the translator we used the Word Error Rate metric — an edit-distance metric RFERL 2007-’12 1.2M 82.24% 2.88% based on the Levenshtein distance (Levenshtein, 1966). New Testament 137K 91.79% 1.39% We had two small Kazakh corpora along with their Wikipedia 128K 81.36% 1.48% postedited translations into Tatar to measure the WER. (b) Naïve coverage of the Tatar-Kazakh direction The first one (2,457 words total) was a concatenation of an article from RFERL’s Kazakh-language service, Table 3: Naïve coverage of the Kazakh-Tatar system an article from Wikipedia, and a simple story used for pedagogical purposes in a workshop on MT for the lan- Corpus Direction Tokens OOV WER (%) guages of Russia. In addition to postediting the transla-
devel kaz→tat 2457 2 15.19 10 https://svn.code.sf.net/p/apertium/svn/trunk/ test kaz→tat 2862 43 36.57 apertium-kaz-tat 11 http://www.azattyq.org/ 12 http://kk.wikipedia.org/; kkwiki-20130408-pages-articles.xml.bz2 Table 4: Word error rate over two corpora; OOV is the 13 http://tt.wikipedia.org/; ttwiki-20130205-pages-articles.xml.bz2 14 number of out-of-vocabulary (unknown) words. http://www.azatlyk.org/ 15The coverage of the vanilla transducers is slightly higher. (Kazakh) Input Ол енді ол дыбысты анығырақ ести бастады.
Mor. analysis ˆОл/ол/ол/ол$ ˆенді/ен/ен/ен/енді$ ˆол/ол/ол/ол$ ˆдыбысты/дыбыс$ ˆанығырақ/анық/анық/анық$ ˆести/есті$ ˆбастады/баста/баста/баста /баста$
Mor. disambig. ˆОл$ ˆенді$ ˆол$ ˆдыбыс$ ˆанық$ ˆесті$ ˆбаста$ˆ.$
Lex. transfer ˆОл/Ул$ ˆенді/инде/хəзер$ ( + selection) ˆол/ул$ ˆдыбыс/тавыш$ ˆанық/анык$ ˆесті/ишет$ ˆбаста/башла$ˆ./.$
Struct. transfer ˆУл$ ˆинде$ ˆул$ ˆтавыш$ ˆанык$ ˆишет$ ˆбашла$ˆ.$ Mor. generation Ул инде ул тавышны аныграк ишетə башлады.
Table 5: Translation process for the Kazakh phrase Ол енді ол дыбысты анығырақ ести бастады ‘He began to listen to that sound more carefully’. tion, we ran this corpus through the morphological trans- form current expected ducer and manually disambiguated its output. All the ди дия ди stems in these texts were added to the system, and all the йөр йөрə йөри rules (CG, lexical selection, and transfer) were based on кияү кияүе кияве this corpus. This “development” corpus presents an up- укы укыу уку per bound on the current performance of the system. The testing corpus, used solely for evaluation, was comprised Table 6: Examples of some phonological problems in of articles from RFERL’s Kazakh-language service, and the Tatar transducer. had a similar size (2,862 words) to the development corpus. Table 4 presents the WER for both corpora.
Some of the corrections that were made as part of 6 Concluding remarks postediting were not translation errors as such, but were To our knowledge we have presented the first ever instead due to the lack of morphophonological rules in MT system between Kazakh and Tatar. It has near- the Tatar transducer. A few examples, which include production-level coverage, but is rather prototype-level irregular verbs (ди and йөр) and are otherwise ortho- in terms of the number of rules. Although the impact graphical corner cases, are given in Table 6. Aside from of this relatively low number of rules on the quality of the irregular behaviour of йөр, which would need its ir- translation is extensive (cf., the difference in WER be- regularities implemented through special phonotactics in tween the development and testing corpora), the outlook lexc, these issues are all shortcomings of our phonolog- is promising and the current results suggest that a high- ical layer, implemented in twol. quality translation between morphologically-rich agglu- The majority of remaining errors are mostly due to tinative languages is possible. mistakes and gaps in the Tatar morphophonology com- We plan to continue development on the pair; the cov- ponent, lack of transfer rules to handle some Kazakh erage of the system is already quite high, although we compounds, and disambiguation errors. intend to increase it to 95% on the corpora we have — we estimate that this will mean adding around 5,000 new Levenshtein, V. I. 1966. Binary codes capable of cor- stems and take 1–2 months. The remaining work will recting deletions, insertions, and reversals. Soviet be improving the quality of translation by adding more Physics—Doklady 10, 707–710. Translated from Dok- rules, starting with the CG module. The long-term plan is lady Akademii Nauk SSSR, pages 845–848. to integrate the data created with other open-source data Linden, Krister, Miikka Silfverberg, Erik Axelson, Sam for Turkic languages in order to make transfer systems Hardwick, and Tommi Pirinen, 2011. HFST— between all the Turkic language pairs. Related work is Framework for Compiling and Applying Morpholo- currently ongoing with Chuvash–Turkish and Turkish– gies, volume Vol. 100 of Communications in Com- Kyrgyz. puter and Information Science, pages 67–85. The system is available as free/open-source software Tantuğ, A. Cüneyd, Eşref Adalı, and Kemal Oflazer. under the GNU GPL and the whole system may be down- 2006. Computer analysis of Turkmen language mor- loaded from sourceforge.16 phology. Advances in natural language processing, Acknowledgements proceedings (LNAI), pages 186–193. Tantuğ, A. Cüneyd, Eşref Adali, and Kemal Oflazer. The work on this Kazakh-Tatar machine translation sys- 2007. A MT system from Turkmen to Turkish employ- tem was partially funded by the Google Summer of Code ing finite state and Statistical Methods. In Proceedings and Google Code-In programmes. of MT Summit XI, Copenhagen, Denmark.
References Tyers, F. M. and J. A. Pienaar. 2008. Extracting bilingual word pairs from wikipedia. In Proceedings of the Altintas, Kemal. 2001a. A morphological analyser for SALTMIL Workshop at the Language Resources and Crimean Tatar. Proceedings of Turkish Artificial In- Evaluation Conference, LREC2008, pages 19–22. telligence and Neural Network Conference. Tyers, Francis, Jonathan North Washington, Ilnar Sal- Altintas, Kemal. 2001b. Turkish To Crimean Tatar Ma- imzyan, and Rustam Batalov. 2012a. A prototype machine Translation System. Master’s thesis, Bilkent chine translation system for Tatar and Bashkir based University. on free/open-source components. In Proceedings of the First Workshop on Language Resources and Tech- Forcada, Mikel L., Mireia Ginestí-Rosell, Jacob Nord- nologies for Turkic Languages at the Eight Interna- falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- tional Conference on Language Resources and Eval- tonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema uation (LREC’12), Istanbul, Turkey, May 21. Ramírez-Sánchez, and Francis M. Tyers. 2011. Aper- tium: a free/open-source platform for rule-based ma- Tyers, Francis M., Felipe Sánchez-Martínez, and chine translation. Machine Translation, 25(2):127– Mikel L. Forcada. 2012b. Flexible finite-state lexical 144. selection for rule-based machine translation. In Pro- ceedings of the 16th Annual Conference of the Euro- Gilmullin, R. A. 2008. The Tatar-Turkish Machine pean Association for Machine Translation, pages 213– Translation Based On The Two-Level Morphological 220, Trento, Italy, May. Analyzer. In Interactive Systems and Technologies : The Problems of Human-Computer Interaction, pages Washington, Jonathan, Mirlan Ipasov, and Francis Ty- 179–186, Ulyanovsk. ers. 2012. A finite-state morphological transducer for Kyrgyz. In Calzolari, Nicoletta (Conference Chair), Hamzaoğlu, Ilker. 1993. Machine translation from Turk- Khalid Choukri, Thierry Declerck, Mehmet Uğur ish to other Turkic languages and an implementation Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, for the Azeri language. Master’s thesis, Bogazici Uni- and Stelios Piperidis, editors, Proceedings of the Eight versity. International Conference on Language Resources and Karlsson, F., A. Voutilainen, J. Heikkilä, and A. Anttila. Evaluation (LREC’12), Istanbul, Turkey, May 23-25. 1995. Constraint Grammar: A language indepen- European Language Resources Association (ELRA). dent system for parsing unrestricted text. Mouton de Çöltekin, Çağrı. 2010. A freely available morphological Gruyter. analyzer for Turkish. Proceedings of the 7th Interna-
16 tional Conference on Language Resources and Eval- http://sourceforge.net/projects/apertium/files/ apertium-kaz-tat/ uation (LREC2010), pages 820–827.