Neural machine system for the

Ualsher Tukeyev Zhandos Zhumanov Al-Farabi Kazakh National University Al-Farabi Kazakh National University 71 Al-Farabi ave., Almaty, 050040, 71 Al-Farabi ave., Almaty, 050040, Kazakhstan Kazakhstan [email protected] [email protected]

developed for the Kazakh language. Currently we 1 Abstract have a small parallel corpus of approximately 140 000 Kazakh–English sentences, and a “Development and research of the Kazakh Kazakh–Russian corpus of a similar size. To solve language neural sys- this problem, in the project we will leverage on tem” is a 3-year-long project funded by existing corpora of related languages. the Science Committee of Ministry of Ed- The objectives of the project are investigation ucation and Science of the Republic of of: Kazakhstan. The project is being executed - the basic version of the neural machine by the team of the Laboratory of Intelli- translation of the Kazakh language, using gent Information Systems at the Research standard technology of NMT to Kazakh; Institute of Mathematics and Mechanics - the morphological segmentation for the of al-Farabi Kazakh National University. Kazakh language NMT; Duration of the project: January 2018 – - the development of syntactic corpora for December 2020. NMT of Kazakh; - the development of models and algo- 1 The purpose of the project rithms for solving the problem of un- The purpose of the project is to create neural known words for the neural machine machine translation technology for the Kazakh translation of the Kazakh language; language aiming at a high quality of machine - the evaluation of the quality of the neural translation, specifically adapted to the features of machine translation of the Kazakh lan- the Kazakh language. guage. Since 2013 the direction of machine translation Our team had worked on a project titled “De- based on recurrent neural networks, that is, neural velopment of free/open-source machine transla- machine translation, is intensively developing, tion system for Kazakh–English and Kazakh– and is actively explored for popular world lan- Russian (and vice versa) on the base of Apertium guages. Practical applications of neural machine platform.” in 2015–2017. Language resources translation, in particular, in , created in the previous project are used in the cur- show impressive results. At the same time, neural rent one [1, 2, 3]. machine translation research for the Kazakh lan- Investigation of the basic version of the neural guage as a low resource language is a topical task. machine translation of the Kazakh language is This project aims to fill that gap. based on the use of recurrent neural networks us- ing the "encoder–semantic representation–de- 2 Project objectives coder" model [4]. Along with that we will explore transformer architecture as well. One problem that affects the quality of neural machine translation for Kazakh is the fact that parallel corpora with large volumes of data are not

1 © 2019 The authors. This article is licensed under a Cre- ative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND.

Proceedings of MT Summit XVII, volume 2 Dublin, Aug. 19-23, 2019 | p. 123 3 Expected results of the project 45 million tenge (100 000 euro) on three year (2018–2020) funded by Kazakh government. Project results will include technology (models, algorithms and software) of neural machine References translation, adapted to the features of the Kazakh language. The system of neural machine [1] Tukeyev U.A., Rakhimova D.R., Zhumanov Zh.M., translation of the Kazakh language will be Kartbayev A.Zh. Single state transducer model for Kazakh and Russian morphology // KazNU BUL- developed as a free/open-source system. LETIN, Mathematics, Mechanics, Computer Sci- ence Series. – Алматы, «Қазақ университетi». – 4 Current results and future plans 2016. – №2 (89). – P. 110-117. The first part of the project is focused on Kazakh– [2] Rakhimova D. R., Tukeyev U.A., Zhumanov Zh.M. pair. The second part will be Methodology of the automated enrichment of ma- focused on Kazakh– pair. At the chine translation system dictionaries for Kazakh– end of the first year, the following results were Russian and Kazakh–English language pair // Pro- achieved: the technology of hybrid automaton- ceedings of 4th International conference on Turkic neural machine translation of the Kazakh languages (“TurkLang–2016”). – Bishkek, Kyrgyz- language based on the complete system of Kazakh stan, 2016. – С. 81-85. language endings [5]; the preprocessor of [3] Zhumanov Zh., Madiyeva A., Rakhimova D. New morphological segmentation in Kazakh–English Kazakh Parallel Text Corpora with On-line Access. neural machine translation system; the Lecture Notes in Computer Science. – 2017. – postprocessor of morphological desegmentation 10449. – pp. 501-508. in Kazakh–English neural machine translation [4] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. system. Because of agglutinative nature words in Linguistic Regularities in Continuous Space Word Kazakh are formed by adding affixes. Different Representations. The Association for Computa- forms of the same word could be used in text and tional Linguistics. In HLTNAACL, p. 746– treated as different words if segmentation is not 751(2013). applied. The fact that the rules for adding affixes [5] Tukeyev U., Sundetova A., Abduali B., Akhmad- are very strict allows for creation of a complete iyeva Zh., Zhanbussunov N. Inferring of the mor- system of Kazakh language endings, which phological chunk transfer rules on the base of com- simplifies segmentation and helps reducing plete set of Kazakh endings // LNAI 9876, Compu- vocabulary. tational Collective Intelligence, Part 2, Springer, Current work is directed at gathering more par- 2016, pp. 563-574 allel corpora by crawling multilingual web-sites with various tools, experimenting with different neural machine translation architectures, transla- tion of unknown (out of vocabulary) words and integrating all of techniques that prove to be effec- tive into one neural translation system. Our team consists from 7 members, described project have

Proceedings of MT Summit XVII, volume 2 Dublin, Aug. 19-23, 2019 | p. 124