Char-RNN for Word Stress Detection in East Slavic Languages

Maria Ponomareva Kirill Milintsevich Ekaterina Artemova ABBYY University of Tartu National Research University Moscow, Tartu, Higher School of Economics [email protected] [email protected] Moscow, Russia [email protected]

Abstract 1. how well does the sequence labeling ap- proach suit the word stress detection task? We explore how well a sequence labeling approach, namely, recurrent neural net- 2. among the investigated RNN-based archi- work, is suited for the task of resource-poor tectures, what is the best one for the task? and POS tagging free word stress detec- tion in the Russian, Ukranian, Belarusian 3. can a word detection system be trained languages. We present new datasets, an- on one or a combination of languages and notated with the word stress, for the three successfully used for another language? languages and compare several RNN mod- els trained on three languages and explore To tackle these questions we: possible applications of the transfer learn- ing for the task. We show that it is possi- 1. compare the investigated RNN-based ble to train a model in a cross-lingual set- ting and that using additional languages models for the word stress detection task improves the quality of the results. on a standard dataset in Russian and se- lect the best one; 1 Introduction 2. create new data sets in Russian, It is impossible to describe Russian (and any Ukrainian and Belarusian and con- other East Slavic) word stress with a set of duct a series of mono- and cross-lingual hand-picked rules. While the stress can be experiments to study the possibility of fixed at a word base or ending along the cross-lingual analysis. whole paradigm, it can also change its posi- tion. The word stress detection task is im- The paper is structured as follows: we start portant for text-to-speech solutions and word- with the description of the datasets created. level homonymy resolving. Moreover, stress Next, we present our major approach to the detecting software is in demand among Rus- selection of neural network architecture. Fi- sian learners. nally, we discuss the results and related work. One of the approaches to solving this prob- 2 Dataset lem is a dictionary-based system. It simply keeps all the wordforms and fails at OOV- In this project, we approach the word stress words. The rule-based approach offers better detection problem for three East Slavic lan- results; however collecting the word stress pat- guages: Russian, Ukrainian and Belarusian, terns is a highly time consuming task. Also, which are said to be mutually intelligible to the method cannot manage words without spe- some extent. Our preliminary experiments cial morpheme markers. As shown in (Pono- along with the results of (Ponomareva et al., mareva et al., 2017), even simple deep learning 2017) show that using context, i.e., left and methods easily outperform all the approaches right words to the word under consideration, described above. is of great help. Hence, such data sources as In this paper we address the following re- dictionaries, including Wiktionary, do not sat- search questions: isfy these requirements, because they provide

35 Proceedings of VarDial, pages 35–41 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics only single words and do not provide context are shared between Ukranian and Belarusian words. datasets and between Russian and Ukranian To our knowledge, there are no corpora, an- and Belarusian datasets. The intersection notated with word stress for Ukrainian and between the Ukrainian and Russian datasets Belarusian, while there are available transcrip- amounts around 200 words. tions from the speech subcorpus in Russian1 The structure of the dataset is straightfor- of Russian National Corpus (RNC) (Grishina, ward: each entry consists of a word trigram 2003). Due to the lack of necessary corpora, and a number, which indicates the position of we decided to create them manually. the word stress in the central word3. The approach to data annotation is quite simple: we adopt texts from Universal Depen- dencies project and use provided tokenization and POS-tags, conduct simple filtering and use a crowdsourcing platform, .Toloka2, for the actual annotation. To be more precise, we took Russian, Ukrainian and Belarusian treebanks from Uni- versal Dependencies project. We split each text from these treebanks in word trigrams and filtered out unnecessary trigrams, where cen- Figure 1: A screenshot of the word stress detection ter words correspond to NUM, PUNCT, and task from Yandex.Toloka crowdsourcing platform other non-word tokens. The next step is to create annotation tasks for the crowdsourcing platform. We formulate word stress annota- 3 Preprocessing tion task as a multiple choice task: given a We followed a basic preprocessing strategy for trigram, the annotator has to choose the word all the datasets. First, we tokenize all the texts stress position in the central word by choosing into words. Next, to take the previous and one of the answer options. Each answer option next word into account we define left and right is the central word, where one of the vowels is contexts of the word as the last three charac- capitalized to highlight a possible word stress ters of the previous word and last three char- position. The example of an annotation task acters of the next word. The word stresses (if is provided in Fig.1. Each task was solved any) are removed from context characters. If by three annotators. As the task is not com- the previous / next word has less than three plicated, we decide to accept only those tasks letters, we concatenate it with the current where all three annotators would agree. Fi- word (for example, “te_oblak´a”[that-Pl.Nom nally, we obtained three sets of trigrams for the cloud-Pl.Nom]). This definition of context is Russian, Ukrainian and Belarusian languages used since East Slavic endings are typically of approximately the following sizes 20K, 10K, two-four letters long and derivational mor- 3K correspondingly. The sizes of the resulting phemes are usually located on the right pe- datasets are almost proportional to the initial riphery of the word. corpora from the Universal Dependencies tree- Finally, each character is annotated with one banks. of the two labels L = {0, 1}: it is annotated Due to the high quality of the Universal De- with 0, if there is no stress, and with 1, if there pendencies treebanks and the languages being should be a stress. An example of an input not confused, there are little intersections be- character string can be found in Table1. tween the datasets, i.e., only around 50 words 4 Model selection 1Word stress in spoken texts database in Russian National Corpus [Baza dannykh aktsentologicheskoy We treat word stress detection as a sequence razmetki ustnykh tekstov v sostave Natsional’nogo kor- labeling task. Each character (or syllable) is pusa russkogo yazyka], http://www.ruscorpora.ru/en/ search-spoken.html 3Datasets are avaialable at: https://github.com/ 2https://toloka.yandex.ru MashaPo/russtressa

36 in лая ворона тит out 00000001000000

Table 1: Character model input and output: each character is annotated with either 0, or 1. A tri- gram “белая ворона летит” (“white crow flies”) is annotated. The central word remains unchanged, while its left and right contexts are reduced to the last three characters labeled with one of the labels L = {0, 1}, indi- cating no stress on the character (0) or a stress (1). Given a string s = s1, . . . , sn of characters, ∗ ∗ ∗ the task is to find the labels Y = y1, . . . , yn, Figure 2: Local model for word stress detection such that

Y ∗ = arg max p(Y |s). position of the stress instead of making a series Y ∈Ln of local decisions if there should be a stress on The most probable label is assigned to each each character or not. character. We compare two RNN-based models for the task of word stress detection (see Fig.2 and Fig.3). Both models have a common input and hidden layers but differ in output layers. The input of both models are embeddings of the characters. In both cases, we use bidirec- tional LSTM of 32 units as the hidden layer. Further, we describe the difference between the output layers of the two models. 4.1 Local model The decision strategy of the local model (see Fig.2) follows common language modeling and NER architectures (Ma and Hovy, 2016): all Figure 3: Global model for word stress detection outputs are independent of each other. We de- cide, whether there should be a stress on each To test the approach and to compare these given symbol (or syllable) or not. To do this models, we train two models on the subcorpus for each character we put an own dense layer of Russian National Corpus for Word stress with two units and a softmax activation func- in spoken texts, which appears to be a stan- tion, applied to the corresponding hidden state dard dataset for the task of word stress detec- of the recurrent layer, to label each input char- tion. This dataset was preprocessed according acter (or syllable) with L = {0, 1}. to our standard procedure, and the resulting dataset contains approximately around 1M tri- 4.2 Global model grams. The results of cross-validation exper- The decision strategy of the global model (see iments, presented in Table2, show that the Fig.3) follows common encoder-decoder archi- global model outperforms significantly the lo- tectures (Sutskever et al., 2014). We use the cal model. Hence, the global architecture is hidden layer to encode the input sequence into used further on in the next experiments. a vector representation. Then, we use a dense We pay special attention to homographs: as layer of n units as a decoder to decode the rep- one can see, in general, the quality of word resentation of the input and to generate the stress detection is significantly lower on homo- desired sequence of {0, 1}. In comparison to graphs than on regular words. However, in the the local model, in this case, we try to find the majority of cases, we are still able to detect

37 # vowels local global Table3 presents the results of these experi- all words ments. 2 961 983 3 940 977 test dataset Be- 4 947 976 Rus- train dataset laru- Ukrai- 5 960 977 sian sian nian 6 958 973 Belarusian 647 326 373 7 924 955 Russian 495 738 516 8 866 923 Ukrainian 556 553 683 9 809 979 Ukrainian, avg 952 979 769 597 701 Belarusian homographs Russian, 2 839 810 740 740 563 Belarusian 3 774 844 Russian, 4 787 847 627 756 700 Ukrainian avg 821 819 Russian, Table 2: Accuracy scores × 1000 for two models Ukrainian, 772 760 698 Belarusian the word stress position for a homograph, most Table 3: Accuracy scores × 1000 for different train likely due to the understanding of the word and test dataset combinations context. The Table3 shows, that: 5 Experiments and results 1. in monolingual setting, we can get high- In these series of experiments, we tried to check quality results. The scores are signifi- the following assumptions for various experi- cantly lower than the scores of the same ment settings: model on the standard dataset, due to 1. monolingual setting: the presented above the smaller sizes of the training datasets. approach applies not only to the Russian Nevertheless, one can see, that our ap- language word stress detection but also to proach to word stress detection applies the other East Slavic languages not only to the data, but also to the data in the Belarusian and 2. cross-lingual setting (1): it is possible Ukrainian languages; to train a model on one language (e.g., Ukrainian) and test it on another lan- 2. cross-lingual setting (1): the Belarusian guage (e.g., Belarusian) and achieve re- training dataset, being the smallest one sults comparable to monolingual setting among the three datasets, is not a good source for training word stress detec- 3. cross-lingual setting (2): training on tion models in other languages, while the several languages (e.g., Russian and Ukrainian dataset stands out as a good Ukrainian) will improve the results of test- source for training word stress detection ing on a single language (e.g., Russian) in systems both for the Russian and Belaru- comparison to the monolingual setting. sian languages;

To conduct the experiments in these mono- 3. cross-lingual setting (2): adding one or and cross-lingual settings, we split the anno- two datasets to the other languages im- tated datasets for Russian, Ukrainian and Be- proves the quality. For example, around larusian randomly in the 7:3 train-test ratio 10% of accuracy is gained by adding the and conducted 20 runs of training and testing Russian training dataset to the Belarusian with different random seeds. Afterward, the training dataset, while testing on Belaru- accuracy scores of all runs were averaged. The sian.

38 One possible reason for the difference of Be- study features the results obtained using the larusian from the other two languages can be corpus of Russian wordforms generated based the following. After the orthography reform in on Zaliznyak’s Dictionary (approx. 2m word- 1933, the cases of vowel reduction in the un- forms). Testing the model on a randomly split stressed position (common phonetic feature for train and test samples showed the accuracy of East Slavic languages) have been represented 0.987. According to the authors, they observed orthographically in the Belarusian language. such a high accuracy because splitting the sam- However, the size of the Belarusian dataset (it ple randomly during testing helped the algo- is much smaller than the other two) may affect rithm benefit from the lexical information, i.e., the quality as well. different wordforms of the same lexical item often share the same stress position. The au- 6 Related Work thors then tried to solve a more complicated problem and tested their solution on a small 6.1 Char-RNN models number of wordforms for which the paradigms Several research groups have shown that were not included in the training sample. As a character-level models are an efficient way to result, the accuracy of 0.839 was achieved. The deal with unseen words in various NLP tasks, evaluation technique that the authors propose such as text classification (Joulin et al., 2017), is quite far from a real-life application which named entity recognition (Ma and Hovy, is the main disadvantage of their study. Usu- 2016), POS-tagging (Santos and Zadrozny, ally, the solutions in the field of automated 2014; Cotterell and Heigold, 2017), depen- stress detection are applied to real texts where dency parsing (Alberti et al., 2017) or machine the frequency distribution of wordforms differs translation (Chung et al.). The character-level drastically from the one in a bag of words ob- model is a model which either treats the text as tained from “unfolding” of all the items in a a sequence of characters without any tokeniza- dictionary. tion or incorporates character-level informa- tion into word-level information. Character- Also, another study (Reynolds and Tyers, level models can capture morphological pat- 2015) describes the rule-based method of au- terns, such as prefixes and suffixes so that the tomated stress detection without the help of model can define the POS-tag or NE class of machine learning. The authors proposed a an unknown word. system of finite-state automata imitating the rules of Russian stress accentuation and formal 6.2 Word stress detection in East Slavic grammar that partially solved stress ambiguity languages by applying syntactical restrictions. Thus, us- ing all the above-mentioned solutions together Only a few authors touch upon the problem of with wordform frequency information, the au- automated word stress detection in Russian. thors achieved the accuracy of 0.962 on a rela- Among them, one research project, in partic- tively small hand-tagged Russian corpus (7689 ular, is worth mentioning (Hall and Sproat, tokens) that was not found to be generally 2013). The authors restricted the task of stress available. We can treat the proposed method detection to find the correct order within an as a baseline for the automated word stress de- array of stress assumptions where valid stress tection problem in Russian. patterns were closer to the top of the list than the invalid ones. Then, the first stress assump- The global model, which is shown to be the tion in the rearranged list was considered to be best RNN-based architecture for this setting correct. The authors used the Maximum En- of the task, was first presented in (Ponomareva tropy Ranking method to address this problem et al., 2017), where a simple bidirectional RNN (Collins and Koo, 2005) and took character bi- with LSTM nodes was used to achieve the ac- and trigram, suffixes and prefixes of ranked curacy of 90% or higher. The authors experi- words as features as well as suffixes and pre- ment with two training datasets and show that fixes represented in an “abstract” form where using the data from an annotated corpus is most of the vowels and consonants were re- much more efficient than using a dictionary placed with their phonetic class labels. The since it allows to consider word frequencies and

39 the morphological context of the word. We ex- Ukrainian and Belarusian languages; tend the approach of (Ponomareva et al., 2017) by training on new datasets from additional 2. using an additional language for training languages and conducting cross-lingual exper- most likely improves the quality of the re- iments. sults.

6.3 Cross-lingual analysis Future work should focus on both annotating new datasets for other languages that possess Cross-lingual analysis has received some atten- word stress phenomena and further develop- tion in the NLP community, especially when ment of cross-lingual neural models based on applied in neural systems. Among a few re- other sequence processing architectures, such search directions of cross-lingual analysis are as transformers. multilingual word embeddings (Ammar et al., 2016; Hermann and Blunsom, 2013) and di- alect identification systems (Malmasi et al., References 2016; Al-Badrashiny et al., 2015). Traditional Mohamed Al-Badrashiny, Heba Elfardy, and Mona NLP tasks such as POS-tagging (Cotterell Diab. 2015. Aida2: A hybrid approach for to- and Heigold, 2017), morphological reinflection ken and sentence level dialect identification in (Kann et al., 2017) and dependency parsing arabic. In Proceedings of the Nineteenth Con- ference on Computational Natural Language (Guo et al., 2015) benefit from cross-lingual Learning, pages 42–51. training too. Although the above-mentioned tasks are quite diverse, the undergirding philo- Chris Alberti, Daniel Andor, Ivan Bogatyy, Michael Collins, Dan Gillick, Lingpeng Kong, sophical motivation is similar: to approach a Terry Koo, Ji Ma, Mark Omernick, Slav Petrov, task on a low-resource language by using ad- et al. 2017. Syntaxnet models for the conll 2017 ditional training data in a high-resource lan- shared task. arXiv preprint arXiv:1703.04929. guage or training a model on a high-resource Waleed Ammar, George Mulcaire, Yulia Tsvetkov, language and fine-tune this model on a low- Guillaume Lample, Chris Dyer, and Noah A resource language with a probably lower learn- Smith. 2016. Massively multilingual word em- ing rate. beddings. arXiv preprint arXiv:1602.01925. Junyoung Chung, Kyunghyun Cho, and Yoshua 7 Conclusion Bengio. A character-level decoder without ex- In this project, we present a neural approach plicit segmentation for neural machine transla- tion. for word stress detection. We test the ap- proach in several settings: first, we com- Michael Collins and Terry Koo. 2005. Discrimi- pare several neural architectures on a standard native reranking for natural language parsing. Computational Linguistics, 31:25–70. dataset for the Russian language and use the results of this experiment to select the architec- Ryan Cotterell and Georg Heigold. 2017. Cross- ture that provides the highest accuracy score. lingual, character-level neural morphological Next, we annotated the Universal Dependen- tagging. arXiv preprint arXiv:1708.09157. cies corpora for the Russian, Ukrainian and Elena B. Grishina. 2003. Spoken russian in rus- Belarusian languages with word stress using sian national corpus. Russian National Corpus, Yandex. Toloka crowdsourcing platform. The 2005:94–110. experiments conducted on these datasets con- Jiang Guo, Wanxiang Che, David Yarowsky, sist of two parts: a) in the monolingual setting Haifeng Wang, and Ting Liu. 2015. Cross- we train and test the model for word stress de- lingual dependency parsing based on distributed representations. In Proceedings of the 53rd An- tection on the data sets separately; b) in the nual Meeting of the Association for Compu- cross-lingual setting: we train the model on tational Linguistics and the 7th International various combinations of the datasets and test Joint Conference on Natural Language Process- on all three data sets. These experiments show ing (Volume 1: Long Papers), volume 1, pages that: 1234–1244. Keith Hall and Richard Sproat. 2013. Russian 1. the proposed method for word stress stress prediction using maximum entropy rank- detection is applicable or the Russian, ing. In Proceedings of the 2013 Conference on

40 Empirical Methods in Natural Language Pro- cessing, pages 879–883, Seattle, Washington, USA. Association for Computational Linguis- tics. Karl Moritz Hermann and Phil Blunsom. 2013. Multilingual distributed representa- tions without word alignment. arXiv preprint arXiv:1312.6173. Armand Joulin, Edouard Grave, and Piotr Bo- janowski Tomas Mikolov. 2017. Bag of tricks for efficient text classification. EACL 2017, page 427. Katharina Kann, Ryan Cotterell, and Hinrich Sch¨utze. 2017. One-shot neural cross-lingual transfer for paradigm completion. arXiv preprint arXiv:1704.00052. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns- crf. arXiv preprint arXiv:1603.01354. Shervin Malmasi, Marcos Zampieri, Nikola Ljubeˇsi´c,Preslav Nakov, Ahmed Ali, and J¨org Tiedemann. 2016. Discriminating between simi- lar languages and arabic dialect identification: A report on the third dsl shared task. In Proceed- ings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 1–14. Maria Ponomareva, Kirill Milintsevich, Ekaterina Chernyak, and Anatoly Starostin. 2017. Auto- mated word stress detection in russian. In Pro- ceedings of the First Workshop on Subword and Character Level Models in NLP, pages 31–35. Robert Reynolds and Francis Tyers. 2015. Au- tomatic word stress annotation of russian un- restricted text. In Proceedings of the 20th Nordic Conference of Computational Linguis- tics (NODALIDA 2015), pages 173–180, Vil- nius, . Link¨opingUniversity Electronic Press, Sweden. Cicero D Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neu- ral networks. In Advances in neural information processing systems, pages 3104–3112.

41