STRUCTURAL TRANSFER RULES FOR KAZAKH-TO- ENGLISH MACHINE IN THE FREE/OPEN-SOURCE PLATFORM APERTIUM

Aida Sundetova Aidana Karibayeva Ualsher Tukeyev Information Systems Department, Al-Farabi Kazakh National University, Almaty, Kazakhstan [email protected]; [email protected]; [email protected];

ABSTRACT expressed by additional constructions with modal verbs or prepositions [2]. This paper describes process of building structural There are important differences in syntax transfer rules for Kazakh-to-English machine between the Kazakh and English languages; translation system on free/open-source Apertium for example, the order of constituents in platform. Structural transfer rules are used for translating texts from Kazakh to English by couple sentences: subject–object–adverbial modifier– of rules in three stages. This paper shows how verb (in English it is: subject–verb–object– sentences in Kazakh are transformed to English adverbial modifier). There are also important sentences, what types of phrases and attributes are differences in translating verb tenses: Future used. Results are presented by comparing Apertium Simple and Present Simple, Present Perfect Kazakh—English system with other online and Past Perfect in Kazakh have the same translators. translation, modal verbs are made by adding auxiliary verbs (I can play – Мен ойнай 1 Introduction аламын) or using adjectives which mean Nowadays developing “obligation”: жөн (‘should’), қажет from to English is very (‘necessary’), керек (‘need’) (I should go – important and useful for people who want to Менің барғаным жөн) [3]. understand texts in Kazakh and translate them. Kazakh language has no gender, so personal However, building translation system from a pronoun “Ол” could has three : Turkic language, which has complex he/she/it. By default, it is translated as “he”, agglutinative morphology, faces some however, for special constructions as “Ол – difficulties. For example, Kazakh morphology, қыз” (in English “She is girl”) “Ол” is as all Turkic language morphologies, is more translated as “She”. complex than English morphology and very By considering these features, we are different from it. It is impossible to do developing machine translation from Kazakh translation form Kazakh to English by word- to English based on the Apertium free/open- to-word. Because Kazakh is agglutinative source machine translation platform (Forcada language, words are done by adding et al. 2011, http://www.apertium.org) [4]. morphemes with vowel harmony Because, firstly, it already contains a rather (synharmonism) [1]. English is an analytic complete Kazakh morphology (Salimzyanov et language that conveys grammatical al. 2013), secondly, it includes an English relationships without using complex monolingual dictionary which also contains inflectional morphemes like in Kazakh morphological analysis [5]. Therefore for language. To be more precise, relationships are developing Kazakh–English system we need to

build bilingual dictionary and write couple of than one word (multiword lexical units) are rules. analyzed as a single lexical unit. This paper contains 4 sections: Section 2 Morphological analyser uses a finite state describes Apertium platform and its structure, transducer based on two-level rules (in the case Section 3 describes Kazakh–English structural of Kazakh, apertium-kaz.kaz.lexc, transfer and Section 4 gives results of system apertium-kaz.kaz.twol). This module by comparing with other systems. therefore separates lexemes and processes morphological analysis, and then returns 2 Apertium platform and its possible lexical forms. modules  Part-of-speech (POS) tagger. Apertium's Apertium is a free/open source machine POS tagger is based on a statistical model translation system. Apertium is a platform of based on hidden Markov models which machine translation which whose development processes the result of the application of on started with financing from the governments of constraint-grammar rules (Karlsson 2005), and Catalonia at which are used to discard some analyses using (Universitat d’Alacant) in 2005. Apertium is simple rules (written in apertium- which is published by developers kaz.kaz.rlx) based on context. For example, according to GNU GPL conditions. consider the morphological analysis of word Apertium was originally intended for қара: translation between related languages. ^қара/қара/қара/қара However this system has been expanded to /қара/ translate texts between less similar language қара/қара/қара/қара (dictionaries, rules) in accurately specified XML formats. This system uses finite state This word is ambiguous and has 6 meanings. transducers for all of its lexical Many surface forms are ambiguous, which transformations, and hidden Markov models means that these words have more than one for part-of-speech tagging or word category POS and therefore more than one possible disambiguation. translation. After this module, all words have Apertium platform consisting of the modules only one morphological analysis. (Figure 1):  Lexical transfer. This module uses a  Deformatter. It separates the text to be bilingual dictionary (apertium-eng-kaz.eng- translated from the formatting tags. Formatting kaz.dix) which has very simple structure [7]. tags are encapsulated as “superblanks” that are placed between words in such a way that the The module reads each source-language lexical remaining modules see them as regular blanks. form and finds one or more corresponding target-language lexical forms. Multiword units  Morphological analyser. For each surface are translated as a single word. form (that is, for each lexical unit as it appears in the text), the morphological analyser  Lexical selection. It uses rules that select generates one or more lexical forms composed for those lexical words having many of: lemma (dictionary or citation form), lexical translations, one of the translations in the category (or part-of-speech), and inflection target language according to context. All rules information. The morphological analyser are written in file apertium-eng-kaz.kaz- executes a finite-state transducer generated by eng.lrx. compiling a morphological dictionary for the  Structural transfer. This module source language. Lexical units containing more identifies sequences of lexical forms (phrases or segments), which need syntactical

processing (handling of number, prepositions, generated from file with rules which are very etc.) to be translated. It uses files with rules, similar in format to dictionary files. which specify the syntactic transformation as  Reformatter. It places format tags back a cascaded process. Transfer rules, which into the text so that its format is preserved. transform lexical-form sequences into a new sequences for the target language, perform the 3 Structural transfer from Kazakh work in this module. Structural transfer is the into English languages focus of this paper, and will be described in The structural transfer module in Apertium detail in section 3. does operations, which determined in transfer rules and can be like this: word reordering, adding some suffixes, removing unnecessary tags or attributes etc [8]. Structural transfer in Apertium system comprises two parts: pattern and action. “Pattern” defines the sequence to which the rule will be applied, whereas “action” consists of the actual operations needed to generate the corresponding sequence in the target language. Transfer in Apertium may be of two types. The first type is used in a similar languages and generates the sequence of lexical forms in the target language in a single step. The second type is the one used in our Kazakh–English system, and consists of three levels:  “chunker” level (file apertium-eng- kaz.kaz-eng.t1x);  “interchunk” level (file apertium-eng- kaz.kaz-eng.t2x);  “postchunk” level (file apertium-eng- kaz.kaz-eng.t3x).The following sections describe the three levels of Kazakh-English structural transfer.

Figure-1. The Apertium machine translation pipeline 3.1 The Kazakh-English chunker  Morphological generator. From the The chunker divides a sentence in chunks sequence of target-language lexical forms which may be seen as elementary sentence produced by the structural transfer, it generates constituents such as noun phrases, verb a corresponding sequence of target language phrases, etc. (see Table 1) surface forms. The morphological generator Table-1. Types of chunks executes a finite-state transducer generated by compiling a morphological dictionary for the Patterns Meaning SN Noun phrase target language. SV Verb phrase  Post-generator. It takes care of some AdjP Adjectival phrase minor orthographical operations in the target PP Postpositional phrase language (for instance, it generates the English form cannot from can and not). This module is

Some examples of noun- and verb-phrase v ойнар SV {vaux vbhaver will have chunks are given in the next tables (Table 2, vblex} played Table 3): Take into account that the lexical forms have Table-2. Noun-phrase chunks been translated in advance and that the Inp Example Output Translati remaining transfer modules work only on ut block on target-language lexical forms. patt After these blocks (chunks) are created the ern1 n бақша SN{n} garden interchunk module performs operations on these blocks, without modifying their contents. adj әдемі AdjP{adj} beautiful This module makes it possible to generate the num жеті SN {num} seven correct target-language word order, to treat adj әдемі SN{adj n} beautiful n бақша garden number and person, number agreement in det менің SN{det n} my garden verbs. n бақшам num жеті бақша SN{num n} seven 3.1.1 Translation of noun-phrases n gardens We will illustrate the translation of noun- num жеті әдемі SN{num adj seven phrase with the example: әдемі бақшаларда adj бақша n} beautiful ('in the beautiful gardens'). n gardens det менің әдемі SN {det adj my The chunker identifies this phrase as a noun- adj бақшам n} beautiful phrase (adj noun) and after that, it translates it n garden into English by adding relevant tags. There det менің жеті SN {det my seven may be such tags: number (plural form), cases num әдемі num adj n} beautiful (assign locative case). adj бақшам gardens In general, this phrase has the following n attributes: number (singular or plural), cases, n pr үстел PP {pr n} under possessives. One of the main problems in астында table translation noun-phrases is generating the adj үлкен үстел PP {pr adj on big English articles (a, an, the), which are absent n pr үстінде n} table in Kazakh. All nouns with nominative and num бес үстел PP {pr num on five n pr үстінде n} tables accusitive cases are translated as noun-phrases:  single noun: SN [қыз ] - Table-3. Verb-phrase chunks SN [girl ]; 2 Input Example Output block Translati Also for structue like: patter on n  adjective + noun: SN [әдемі v ойна SV{vblex} play үй - SN [beautiful house]; v ойнап SV{vbser vblex } is playing отыр  numerals + noun (in accusative case): SN v ойнаған SV{vbhaver has [жеті бақша - SN vblex} played [seven garden]. Rules for v ойнамаған SV {vbhaver adv has not vblex} played this phrase are not assigned to noun accusative case because in English translation it does not have any suffixes. 1 Abbreviations: adj, adjective; n, noun; num, 3.1.2 Translation of verb phrases numeral; pr, postposition; det, determiner. 2 Abbreviations: vblex, lexical verb; vbser, Translation of verb from Kazakh to English verb 'to be'; vaux, auxiliary verb; vbhaver, verb has specific difficulties. For instance, in 'to have'.

Kazakh the past tense might have two 3.2 “Interchunk” level translations; for example, the sentence “Мен As we can see from the other translation ойнағанмын” can be translated such as “I have systems (try translate texts [9,10]), in target- played” or “I had played”, that is, the sentence language texts word order is incorrect. It can be translated as present perfect or past means that reordering does not work well. perfect. We decided to generate a present .t1x perfect translation, because in while When we write a chunker rule ( ), we aim developing in first steps it difficult to identify at dividing the sentence in a sequence of past perfect, which has to come before past patterns or chunks. After that, we take care of simple and present perfect are more common the order of these chunks by writing interchunk in simple sentences. Below are shown rules in file apertium-eng-kaz.kaz- examples of verb-phrases, which the system eng.t2x, by writing appropriate reordering already translates (Table 4): rules. For instance, in sentence “Біз кітапты оқимыз” pattern of pronoun “Біз” ('We') is Table-4. Translation of verb phrases “SN”, pattern of object “кітапты” ('book') is Tense in Example Tense in Transla Kazakh English tion “SN-accusative” and pattern of verb language “оқимыз” ('read') will be “SV”: “Біз кітапты Present Мен Present I play оқимыз” - “We read book”(reordering “We (Ауыспалы ойна+й+мын Simple book read”). So in Kazakh language verb stays осы шақ) Past (Жедел Мен Past I played at the end of sentence, although in English that өткен шақ) ойна+дым Simple can stay at the beginning or in the middle: Future Мен Future I will “Мен[1] әдемі[2] бақшаны[3] көремін[4]” → (Болжалды ойна+р+мын Perfect have “I[1] see[4] beautiful[2] garden[3]”. Rules of келер шақ) played this level do next operations: Present (Нақ Мен ойна+п Present I am - build new sequence of chunks; осы шақ) жатыр+мын Continu- playing ous - adding prepositions by cases: “әдемі бақшаДА[locative]” - “in beautiful garden”; 3.1.3 Translation of adjectival and - agreement. Agreement of words – subject postpositional phrases and verb, adjective and noun, for example, for agreement between verb and subject are person Adjectival phrases do not have any attributes and number agreement. For example: “Бала and are marked as “AdjP”, and are used for ойнайды” – “Child plays” (noun is third those cases in which adjectives are not part of person and number is singular, so why noun a noun phrase. should have morpheme of person). Postpositional phrases are structures in which The number of rules like this is some few, function words are found after the noun. For about ten rules, furthermore these rules will be instance, the phrase «жеті әдемі бақшаның extended. астында» translated as «under seven beautiful gardens». In a construction like this the 4 Results compound postposition formed by the genitive ending “-ның” in “бақшаның” and the word The current version of the system (revision “астында” are used to express the notion №56387) can translate SN-, SV-. AdjP- and expressed in English with the function word PP- phrases. We plan to extend the number of “under”. rules to improve translation quality. In the In this level of transfer rules are written 57 table below we compare some translation rules. systems with examples that our system can translate [9, 10]. All available translations of sentences and phrases can be seen from tests (see [11]).

Table-5. Results of comparison source platform for rule-based machine Phra Example Aperti- Pragma Sanasoft translation”. Machine Translation 25(2)127-144. se um 6 [5] Salimzyanov, I., Washington, J.N., Tyers, F.M. SN менің екі My two me two My two “A free/open-source Kazakh-Tatar machine әдемі beautiful әдемі my beautiful translation”. Proceedings of MT Summit XIV көйлегім dresses the dress көйлегім (Nice, France, 4–6 September 2013), accepted. SV Мен I am not I not I student

студент student student емеспін [6] Karlsson, F., Voutilainen, A., Heikkilä, J., емеспін Anttila, A. 1995. : A PP Анау under Under under Language-Independent System for Parsing суреттер those those by that Unrestricted Text. Mouton de Gruyter, Berlin. дің pictures pictures pictures [7] Сундетова А.М., Кәрібаева А.С., Апертиум астында платформасындағы Ағылшын–Қазақ машиналық аудармашы үшін екітлді сөздікті 5 Conclusion құру. Материалы международной научно- практической конференции «Применение We have described Kazakh—English machine информационно-коммуникационных translation system on Apertium platform and технологий в образовании и науке», process of developing structural transfer rules. посвященной 50-летию Департамента Many features in translating from Kazakh to информационно-коммуникационных технологий и 40-летию кафедры English as assign cases, agreement, «Информационные системы» КазНУ им. аль- prepositions, etc. were solved. In the future Фараби. 22 ноября 2013г. – Алматы: Қазақ this system will be considered the translation Университеті, 2013. – С.53-57. task of future transitional tense, the passive [8] Sundetova A., M.L. Forcada, A. Shormakova, voice, degree adjective, interrogative sentence A.Aitkulova, Structural transfer rules for and other tasks will be observed. English-to-Kazakh machine translation in the Acknowledgements: the authors thank Mikel free/open-source platform Apertium. L. Forcada, Francis Tyers, Jonathan N. Компьютерная обработка тюркских языков. Washington and Ilnar Salimzyanov and other Первая международная конференция: Труды. developers in the Apertium project for their – Астана: ЕНУ им. Л.Н. Гумилева, 2013. – С. 317-326. help during the development of this system, [9] Online-translator «Sanasoft»: authors would also like to express their http://www.sanasoft.kz/c/ru/node/47 (in gratitude to Mikel L. Forcada for advises in Russian), http://www.sanasoft.kz/c/kk/node/53 writing this paper. (in Kazakh). [10] Online-translator «Trident: 6 References http://www.translate.ua/us/on-line; [11] Regression tests. [1] Агглютинативные языки (2012). Retrieved http://wiki.apertium.org/wiki/English_and_Kaza from http://ru.wikipedia.org/ kh/Regression_tests wiki/Агглютинативный_язык [2] Аналитический язык (2013). Retrieved from http://ru.wikipedia.org/ wiki/Аналитический_язык [3] Печерских, Т.Ф., Амангельдина, Г.А. (2012) “Особенности перевода разносистемных языков (на примере английского и казахского языков)”, Молодой ученый. №3, 259–261 [4] Forcada, M.L., Ginestí-Rosell, M., Nordfalk, J., O'Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J.A. Sánchez-Martínez, F., Ramírez-Sánchez, G., Tyers, F.M. 2011. “Apertium: a free/open-