Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-To-Latin Converter for Tatar
Total Page:16
File Type:pdf, Size:1020Kb
Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar Chihiro Taguchi∗, Yusuke Sakai∗, and Taro Watanabe {taguchi.chihiro.td0, sakai.yusuke.sr9, taro}@is.naist.jp Nara Institute of Science and Technology Abstract mixes Russian words, it is not easy to obtain a pure Tatar dataset for developing a language detector. We introduce a Cyrillic-to-Latin transliterator for the Tatar language based on subword-level Existing methods are based on either Tatar mono- language identification. The transliteration is a lingual rules or a huge bundle of ad-hoc rules aimed challenging task due to the following two rea- to cover Russian-origin words (Bradley, 2014; Ko- sons. First, because modern Tatar texts often rbanov, n.d.). The experimental results in Section contain intra-word code-switching to Russian, 6 demonstrate that the former monolingual rule- a different transliteration set of rules needs to based transliterators show low accuracy because be applied to each morpheme depending on the Russian words are not supported. The latter exten- language, which necessitates morpheme-level sively rule-based transliterator has better accuracy, language identification. Second, the fact that Tatar is a low-resource language, with most of but still misses a certain amount of words. This the texts in Cyrillic, makes it difficult to pre- implies that a strictly rule-based method requires pare a sufficient dataset. Given this situation, an ever-lasting process of adding rules ad hoc for we proposed a transliteration method based exceptional words to further improve the accuracy. on subword-level language identification. We This is obviously unrealistic and inefficient. trained a language classifier with monolingual In this study, in contrast, we pursue a simple Tatar and Russian texts, and applied different yet high-accuracy automatic transliteration system transliteration rules in accord with the identi- fied language. The results demonstrate that our from Cyrillic Tatar to Latin Tatar. We prepare two proposed method outscores other Tatar translit- sets of simple rule-based monolingual translitera- eration tools, and imply that it correctly tran- tion rules for Tatar and Russian each. In addition, scribes Russian loanwords to some extent. we train a classifier that automatically determines Tatar or Russian for each subword. Each token 1 Introduction is then transliterated to Latin Tatar by applying Modern Tatar has two orthographies: Cyrillic and the rules of the detected language to its subwords. Latin. The two alphabets are mostly mutually com- The results demonstrate that our proposed method patible when an input string consists of only Tatar- achieves higher accuracy than the previous tools. origin words. Effectively, however, modern Tatar Also, our proposed method demonstrates higher ac- has a massive amount of Russian loanwords, and, curacy than the transliterator with only Tatar-based in colloquial texts, even a whole phrase may be rules, indicating that the method can correctly pre- switched to Russian. This linguistic phenomenon dict and transcribe Russian words to some extent. is known as code-switching or code-mixing. A difficulty of transliteration from Cyrillic to 2 Tatar: Linguistic Background Latin lies in the following two facts. First, a dif- Tatar (ISO 639-1 Code: tt) is a Kipchak language ferent set of transliteration rules has to be applied of the Turkic language family mainly spoken in to Tatar and Russian words. This requires a lan- the Republic of Tatarstan, Russia. The number of guage detection for each token, or worse, for each the speakers is estimated to be around five million morpheme, which would additionally require mor- (Eberhard et al., 2021). The canonical word order phological analysis. It is expected that a full imple- is SOV, but allows for free word order with scram- mentation of such a system will produce heavy pro- bling. Morphological inflections and declensions cesses. Second, because modern Tatar frequently are derived by means of suffixation. The suffixa- ∗Equal contribution tion obeys the vowel harmony rule where backness 133 Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 133–140 June 11, 2021. ©2021 Association for Computational Linguistics https://doi.org/10.26615/978-954-452-056-4_018 (or frontness) of vowels is kept consistent. locative suffix -ta (-ta “at, in”) is attached in the Through the long history of language contact example, as Russian loanwords may take Tatar suf- with Russian, modern Tatar contains a large amount fixes, causing CS within a token (intra-word CS). of Russian loanwords. Most of the speakers are (1) Bezneң klassta kyzlar sigezenqedәn bilingual along with Russian, and younger genera- qә baxlagan ide. tions living in urbanized cities tend to have better translit.: Bezneñ klassta qızlar sigezençedän competence in Russian than in Tatar. This bilin- eçä ba¸slagan˘ ide. gualism leads to frequent code-switching (CS) in “In our class, girls used to start drinking by the text and speech, particularly of colloquial genre eighth grade.” (Izmailova et al., 2018). 3 Related Work 2.1 Tatar Orthographies The mainstream orthography for modern Tatar is Code-Switching. Even though CS has attracted the Cyrillic alphabet, which comprises the Russian researchers in NLP, the lack of resource has been standard alphabet and six extended Cyrillic letters. a major difficulty, because CS is an exclusively Cyrillic Tatar is mostly used by Tatars living in colloquial linguistic phenomenon and CS texts are Russian-speaking countries, while the Latin alpha- seldom recorded. Jose et al.(2020) enumerates bet is used by Tatars living in other areas such as a list of available CS datasets at the time of the Turkey and Finland. publication. In terms of both the availability of Until 1928, Tatar was exclusively written in the datasets and the popularity of research, CS lan- Arabic script. Along with the latinization project guage pairs in trend are Hindi–English (Srivastava prevailing in the Soviet Union in 1920–30s, a Latin et al., 2020, Singh and Lefever, 2020), Spanish– alphabet system yañalif was introduced to Tatar. In English (Alvarez-Mellado, 2020, Claeser et al., 1939, for political reasons, yañalif was superseded 2018), Arabic varieties and Modern Standard Ara- by the Cyrillic alphabet which has been officially bic (Hamed et al., 2019). used in Tatarstan as of now. After the demise of As for the studies of intra-word CS in other lan- the Soviet Union, with the resurgence of the move- guages, Mager et al.(2019) for German–Turkish ment for restoring the Latin orthography, a new and Spanish–Wixarika, Nguyen and Cornips(2016) Latin-based orthography system was adopted by for Dutch–Limburgish, and Yirmibe¸soglu˘ and Ery- a republic law in 1999 (National Council of the igit˘ (2018) for Turkish–English have a similar ap- Republic of Tatarstan, 1999). However, the law proach to ours. The differences from ours are that soon lost its validity in 2002 when a new para- Mager et al.(2019) employs segRNN (Lu et al., graph stipulating that all the ethnic languages in 2016) for segmentation and language identification, Russia must be written in Cyrillic was added to and that Nguyen and Cornips(2016) uses Morfes- the federal law (Yeltsin, 2020). The current Latin sor (Creutz and Lagus, 2006) for morphological alphabet (2013Latin henceforth) based on the Com- segmentation. However, our task that combines mon Turkic Alphabet was officially adopted by a language detection of intra-word CS and translit- republic law in 2013, and is commonly used in eration has never been undertaken in any of these Tatar diaspora communities (National Council of studies. the Republic of Tatarstan, 2013). We define that the term “Latin alphabet” used Tatar Transliteration. At the time of this writing, in this paper refers to 2013Latin. A detailed the following tools are available for Tatar Cyrillic- rules for the conversion to 2013Latin is given in Latin conversion. The Tatar Transcription Tool Timerkhanov and Safiullina(2019). (TTT henceforth) (Bradley, 2014) is a translitera- tor published online by Universität Wien as a part 2.2 Code-switching in Tatar of the Mari Web Project. speak.tatar1 is an anony- An example of the former is displayed in (1) with mously developed transliteration service. FinTat2 is transliteration in the Latin alphabet and the trans- a transliteration tool developed as a part of the Cor- lation. Underlined klass (klass “class, grade”) is pus of Written Tatar (Saykhunov et al., 2019). Ay- a Russian word naturalized in Tatar, though it is landirow is a strictly rule-based transliteration tool pronounced with Russian phonology and therefore 1https://speak.tatar/en/language/converter/tat/cyrillic/latin requires a different transliteration. In addition, a 2http://www.corpus.tatar/fintat 134 available online that extensively covers Russian- form high accuracy8: the number of dimensions is origin words as well as Tatar-origin (Korbanov, 16, the minimum and maximum character n-gram n.d.). The transliteration system employed in Fin- sizes are 2 and 4. Hierarchical softmax is used as Tat is based on the Latin alphabet used by Tatars in the loss function. Together with this representa- Finland, whose orthography is somewhat different tion, the subword embeddings are annotated with a from 2013Latin. language label (Joulin et al., 2017). It is worth noting that, unlike Mager et al.(2019), 4 Method we do not employ deep learning approach. While the task in Mager et al.(2019) is multi-labeling, We transliterate Cyrillic Tatar to Latin Tatar word our language identification task is a binary classifi- by word, as each word does not affect other words cation with a low-resource training dataset; for this in Tatar transliteration3. reason, deep learning is too superfluous and heavy Taking into account the fact that Tatar has intra- for achieving our task. word CS, we created a classifier that detects a lan- To evaluate the performance of our model, we guage (Tatar or Russian) for each subword.