Char-RNN for Word Stress Detection in East Slavic Languages
Total Page:16
File Type:pdf, Size:1020Kb
Char-RNN for Word Stress Detection in East Slavic Languages Maria Ponomareva Kirill Milintsevich Ekaterina Artemova ABBYY University of Tartu National Research University Moscow, Russia Tartu, Estonia Higher School of Economics [email protected] [email protected] Moscow, Russia [email protected] Abstract 1. how well does the sequence labeling ap- proach suit the word stress detection task? We explore how well a sequence labeling approach, namely, recurrent neural net- 2. among the investigated RNN-based archi- work, is suited for the task of resource-poor tectures, what is the best one for the task? and POS tagging free word stress detec- tion in the Russian, Ukranian, Belarusian 3. can a word detection system be trained languages. We present new datasets, an- on one or a combination of languages and notated with the word stress, for the three successfully used for another language? languages and compare several RNN mod- els trained on three languages and explore To tackle these questions we: possible applications of the transfer learn- ing for the task. We show that it is possi- 1. compare the investigated RNN-based ble to train a model in a cross-lingual set- ting and that using additional languages models for the word stress detection task improves the quality of the results. on a standard dataset in Russian and se- lect the best one; 1 Introduction 2. create new data sets in Russian, It is impossible to describe Russian (and any Ukrainian and Belarusian and con- other East Slavic) word stress with a set of duct a series of mono- and cross-lingual hand-picked rules. While the stress can be experiments to study the possibility of fixed at a word base or ending along the cross-lingual analysis. whole paradigm, it can also change its posi- tion. The word stress detection task is im- The paper is structured as follows: we start portant for text-to-speech solutions and word- with the description of the datasets created. level homonymy resolving. Moreover, stress Next, we present our major approach to the detecting software is in demand among Rus- selection of neural network architecture. Fi- sian learners. nally, we discuss the results and related work. One of the approaches to solving this prob- 2 Dataset lem is a dictionary-based system. It simply keeps all the wordforms and fails at OOV- In this project, we approach the word stress words. The rule-based approach offers better detection problem for three East Slavic lan- results; however collecting the word stress pat- guages: Russian, Ukrainian and Belarusian, terns is a highly time consuming task. Also, which are said to be mutually intelligible to the method cannot manage words without spe- some extent. Our preliminary experiments cial morpheme markers. As shown in (Pono- along with the results of (Ponomareva et al., mareva et al., 2017), even simple deep learning 2017) show that using context, i.e., left and methods easily outperform all the approaches right words to the word under consideration, described above. is of great help. Hence, such data sources as In this paper we address the following re- dictionaries, including Wiktionary, do not sat- search questions: isfy these requirements, because they provide 35 Proceedings of VarDial, pages 35–41 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics only single words and do not provide context are shared between Ukranian and Belarusian words. datasets and between Russian and Ukranian To our knowledge, there are no corpora, an- and Belarusian datasets. The intersection notated with word stress for Ukrainian and between the Ukrainian and Russian datasets Belarusian, while there are available transcrip- amounts around 200 words. tions from the speech subcorpus in Russian1 The structure of the dataset is straightfor- of Russian National Corpus (RNC) (Grishina, ward: each entry consists of a word trigram 2003). Due to the lack of necessary corpora, and a number, which indicates the position of we decided to create them manually. the word stress in the central word3. The approach to data annotation is quite simple: we adopt texts from Universal Depen- dencies project and use provided tokenization and POS-tags, conduct simple filtering and use a crowdsourcing platform, Yandex.Toloka2, for the actual annotation. To be more precise, we took Russian, Ukrainian and Belarusian treebanks from Uni- versal Dependencies project. We split each text from these treebanks in word trigrams and filtered out unnecessary trigrams, where cen- Figure 1: A screenshot of the word stress detection ter words correspond to NUM, PUNCT, and task from Yandex.Toloka crowdsourcing platform other non-word tokens. The next step is to create annotation tasks for the crowdsourcing platform. We formulate word stress annota- 3 Preprocessing tion task as a multiple choice task: given a We followed a basic preprocessing strategy for trigram, the annotator has to choose the word all the datasets. First, we tokenize all the texts stress position in the central word by choosing into words. Next, to take the previous and one of the answer options. Each answer option next word into account we define left and right is the central word, where one of the vowels is contexts of the word as the last three charac- capitalized to highlight a possible word stress ters of the previous word and last three char- position. The example of an annotation task acters of the next word. The word stresses (if is provided in Fig.1. Each task was solved any) are removed from context characters. If by three annotators. As the task is not com- the previous / next word has less than three plicated, we decide to accept only those tasks letters, we concatenate it with the current where all three annotators would agree. Fi- word (for example, “te_oblak´a”[that-Pl.Nom nally, we obtained three sets of trigrams for the cloud-Pl.Nom]). This definition of context is Russian, Ukrainian and Belarusian languages used since East Slavic endings are typically of approximately the following sizes 20K, 10K, two-four letters long and derivational mor- 3K correspondingly. The sizes of the resulting phemes are usually located on the right pe- datasets are almost proportional to the initial riphery of the word. corpora from the Universal Dependencies tree- Finally, each character is annotated with one banks. of the two labels L = f0; 1g: it is annotated Due to the high quality of the Universal De- with 0, if there is no stress, and with 1, if there pendencies treebanks and the languages being should be a stress. An example of an input not confused, there are little intersections be- character string can be found in Table1. tween the datasets, i.e., only around 50 words 4 Model selection 1Word stress in spoken texts database in Russian National Corpus [Baza dannykh aktsentologicheskoy We treat word stress detection as a sequence razmetki ustnykh tekstov v sostave Natsional’nogo kor- labeling task. Each character (or syllable) is pusa russkogo yazyka], http://www.ruscorpora.ru/en/ search-spoken.html 3Datasets are avaialable at: https://github.com/ 2https://toloka.yandex.ru MashaPo/russtressa 36 in лая ворона тит out 00000001000000 Table 1: Character model input and output: each character is annotated with either 0, or 1. A tri- gram “белая ворона летит” (“white crow flies”) is annotated. The central word remains unchanged, while its left and right contexts are reduced to the last three characters labeled with one of the labels L = f0; 1g, indi- cating no stress on the character (0) or a stress (1). Given a string s = s1; : : : ; sn of characters, ∗ ∗ ∗ the task is to find the labels Y = y1; : : : ; yn, Figure 2: Local model for word stress detection such that Y ∗ = arg max p(Y js): position of the stress instead of making a series Y 2Ln of local decisions if there should be a stress on The most probable label is assigned to each each character or not. character. We compare two RNN-based models for the task of word stress detection (see Fig.2 and Fig.3). Both models have a common input and hidden layers but differ in output layers. The input of both models are embeddings of the characters. In both cases, we use bidirec- tional LSTM of 32 units as the hidden layer. Further, we describe the difference between the output layers of the two models. 4.1 Local model The decision strategy of the local model (see Fig.2) follows common language modeling and NER architectures (Ma and Hovy, 2016): all Figure 3: Global model for word stress detection outputs are independent of each other. We de- cide, whether there should be a stress on each To test the approach and to compare these given symbol (or syllable) or not. To do this models, we train two models on the subcorpus for each character we put an own dense layer of Russian National Corpus for Word stress with two units and a softmax activation func- in spoken texts, which appears to be a stan- tion, applied to the corresponding hidden state dard dataset for the task of word stress detec- of the recurrent layer, to label each input char- tion. This dataset was preprocessed according acter (or syllable) with L = f0; 1g. to our standard procedure, and the resulting dataset contains approximately around 1M tri- 4.2 Global model grams. The results of cross-validation exper- The decision strategy of the global model (see iments, presented in Table2, show that the Fig.3) follows common encoder-decoder archi- global model outperforms significantly the lo- tectures (Sutskever et al., 2014). We use the cal model. Hence, the global architecture is hidden layer to encode the input sequence into used further on in the next experiments.