Evaluation of Morphological Embeddings for English and Russian Languages

Vitaly Romanov Albina Khusainova Innopolis University, Innopolis, Innopolis University, Innopolis, Russia Russia [email protected] [email protected]

Abstract representations. Such approaches involve addi- tional techniques that perform segmentation of a This paper evaluates morphology-based em- word into morphemes (Arefyev N.V., 2018; Virpi- beddings for English and Russian languages. oja et al., 2013). The presumption is that we can Despite the interest and introduction of sev- potentially increase the quality of distributional eral morphology-based mod- els in the past and acclaimed performance im- representations if we incorporate these segmenta- provements on word similarity and language tions into the language model (LM). modeling tasks, in our experiments, we did Several approaches that include morphology not observe any stable preference over two of into word embeddings were proposed, but the our baseline models - SkipGram and FastText. evaluation often does not compare proposed em- The performance exhibited by morphological bedding methodologies with the most popular em- embeddings is the average of the two baselines mentioned above. bedding vectors - , FastText, Glove. In this paper, we aim at answering the question of 1 Introduction whether morphology-based embeddings can be useful, especially for languages with rich mor- One of the most significant shifts in the area of nat- phology (such as Russian). Our contribution is the ural language processing is to the practical use of following: distributed word representations. Collobert et al. (2011) showed that a neural model could achieve 1. We evaluate simple SkipGram-based (SG- close to state-of-the-art results in Part of Speech based) morphological embedding models (POS) tagging and chunking by relying almost with new intrinsic evaluation BATS dataset only on word embeddings learned with a language (Gladkova et al., 2016) model. In modern language processing architec- tures, high quality pre-trained representations of 2. We compare relative gain of using morpho- words are one of the major factors of the resulting logical embeddings against Word2Vec and model performance. FastText for English and Russian languages Although word embeddings became ubiqui- tous, there is no single benchmark on evaluating arXiv:2103.06884v1 [cs.CL] 11 Mar 2021 their quality (Bakarov, 2018), and popular intrin- 3. We test morphological embeddings on sev- sic evaluation techniques are subject to criticism eral downstream tasks other than language (Gladkova and Drozd, 2016). Researchers very of- modeling, i.e., mapping embedding spaces, ten rely on intrinsic evaluation, such as semantic POS tagging, and chunking similarity or analogy tasks. While intrinsic evalu- ations are simple to understand and conduct, they The rest of the paper is organized as fol- do not necessarily imply the quality of embed- lows. Section2 contains an overview of exist- dings for all possible tasks (Gladkova et al., 2016). ing approaches for morphological embeddings and In this paper, we turn to the evaluation of mor- methods of their evaluation. Section3 explains phological embeddings for English and Russian embedding models that we have tested. Section languages. Over the last decade, many approaches 4 explains our evaluation approaches. Section5 tried to include subword information into word describes results. 2 Related work word itself, and morphological tag. Chaudhary et al.(2018) took the next level of a similar ap- The idea to include subword information into proach. Besides including morphological tags, word representation is not new. The question is they include morphemes and character n-grams, how does one obtain morphological segmentation and study the possibility of embedding transfer of words. Very often, researchers rely on the unsu- from Turkish to Uighur and from Hindi to Ben- pervised morphology mining tool Morfessor (Vir- gali. They test the result on NER and monolingual pioja et al., 2013). machine translation. Many approaches use simple composition, e.g., Another approach that deserves being men- sum, of morpheme vectors to define a word em- tioned here is FastText by Bojanowski et al. bedding. Botha and Blunsom(2014) were one (2017). They do not use morphemes explicitly, but of the first to try this approach. They showed a instead rely on subword character n-grams, that considerable drop in perplexity of log-bilinear lan- store morphological information implicitly. This guage model and also tested their model on word method achieves high scores on both semantic and similarity and downstream translation task. The syntactic similarities, and by far is the most popu- translation task was tested against an n-gram lan- lar word embedding model that also captures word guage model. Similarly, Qiu et al.(2014) tweak morphology. CBOW model so that besides central word it can There are also approaches that investigate the predict target morphemes in this word. Final em- impact of more complex models like RNN and beddings of morphemes are summed together into LSTM. Luong et al.(2013) created a hierarchical the word embedding. They test vectors on analog- language model that uses RNN to combine mor- ical reasoning and word similarity, showing that phemes of a word to obtain a word representa- incorporating morphemes improves semantic sim- tion. Their model performed well on word sim- ilarity. El-kishky et al.(2018) develop their own ilarity task. Similarly, Cao and Rei(2016) cre- morpheme segmentation algorithm and test the re- ate Char2Vec BiLSTM for embedding words and sulting embeddings on the LM task with SGNS train a language model with SG objective. Their objective. Their method achieved lower perplex- model excels at the syntactic similarity. ity than FastText and SG. A slightly different approach was taken by Cot- 3 Embedding techniques terell and Schutze¨ (2015) who optimized a log- bilinear LM model with a multitask objective, In this work, we test three embedding models on where the second objective is to guess the next English and Russian languages: SkipGram, Fast- morphological tag. They test resulting vector Text, and MorphGram. The latter one is similar similarity against string distance (morphologically to FastText with the only difference that instead close words have similar substrings) and find that of character n-grams we model word morphemes. their vectors surpass Word2Vec by a large margin. This approach was often used in previous research. Bhatia et al.(2016) construct a hierarchical All three models are trained using the negative graphical model that incorporates word morphol- sampling objective ogy to predict the next word and then optimize the variational bound. They compare their model with Word2Vec and the one described by Botha and T 1 X X Blunsom(2014). They found that their method log σ(s(w , w ))+ T j t improves results on word similarity but is inferior t=1 −m≤j≤m,j6=0 to approach by Botha and Blunsom(2014) in POS k tagging. X Ew∼Pn(wt) [log σ(s(w, wt))] (1) Another group of methods tries to incorporate i=1 arbitrary morphological information into embed- ding model. Avraham and Goldberg(2017) ob- In the case of SG, the similarity function s is the serve that it is impossible to achieve both high inner product of corresponding vectors. FastText semantic and syntactic similarity on the Hebrew and MorphGram are using subword units. We use language. Instead of morphemes, they use other the same approach to incorporate subword infor- linguistic tags for the word, i.e., lemma, the mation into the word vector for both models: SG FT Morph X T en 0.37 0.35 0.36 s(wj, wt) = vs vwj ru 0.24 0.19 0.19 s∈Swt where Swt is the set of word segmentations into Table 1: Correlation between human judgments and n-grams or morphemes. We use Gensim1 as the model scores for similarity datasets, Spearman’s ρ. implementation for all models (Rehˇ u˚rekˇ and So- jka, 2010). For MorphGram, we utilize FastText SG FT Morph model and substitute the function that computes Google Semantic 65.34 48.75 57.52 character n-grams for the function that performs en Google Syntactic 55.88 75.10 61.16 morphological segmentation. BATS 29.67 33.33 32.71 Translated Semantic 39.11 25.59 34.69 4 Experiments and Evaluation ru Translated Syntactic 32.71 59.29 43.68 Synthetic 24.52 36.78 27.06 To understand the effect of using morphemes for training word embeddings, we performed intrin- Table 2: Accuracy of models on different analogies sic and extrinsic evaluations of SG, FastText, and tasks. MorphGram model for two languages - English and Russian. Russian language, in contrast to En- glish, is characterized by rich morphology, which for words in each pair, human judgments are com- makes this pair of languages a good choice for ex- pared with model scores—the more is the corre- ploring the difference in the effect of morphology- lation, the better model “understands” semantic based models. similarity of words. We used SimLex-999 (Hill et al., 2015) dataset—the original one for English 4.1 Data and Training Details and its translated by Leviant and Reichart(2015) We used the first 5GB of unpacked English and version for Russian, for evaluating trained embed- Russian Wikipedia dumps2 as training data. dings. Out-of-vocabulary words were excluded For training both SG and FastText we used from tests for all models. The results are presented Gensim library, for MorphGram - we adapted in Table1. Gensim’s implementation of FastText by break- We see that SG beats the other two models ing words into morphemes instead of n-grams, on similarity task for both languages, and Mor- all other implementation details left unchanged. phGram performs almost the same as Fasttext. Training parameters remain the same as in the 4.3 Analogies original FastText paper, except the learning rate was set to 0.05 at the beginning of the training, Another type of intrinsic evaluations is analo- and vocabulary size was constrained to 100000 gies test, where the model is expected to answer words. Morphemes for English words were gen- questions of the form A is to B as C is to D, erated with polyglot3, and for Russian - with D should be predicted. For English, we used seq2seq segmentation tool4. Google analogies dataset introduced by Mikolov When reporting our results in tables, we will re- et al. (Mikolov et al., 2013a) and BATS collec- fer for FastText as FT and MorphGram as Morph. tion (Gladkova et al., 2016). For Russian, we used a partial translation5 of Mikolov’s dataset, and a 4.2 Similarity synthetic dataset by Abdou et al.(2018). Again, we excluded all out-of-vocabulary One of the intrinsic evaluations often used for words from tests. We report accuracy for differ- word embeddings is a similarity test - given word ent models in Table2. pairs with human judgments of similarity degree Interestingly, MorphGram is between SG and 1https://radimrehurek.com/gensim FastText in semantic categories for both lan- 2https://dumps.wikimedia.org/ guages, and between FastText and SG for syntactic 3https://polyglot.readthedocs.io/en/ categories for English. latest/index.html 4https://github.com/kpopov94/morpheme_ 5https://rusvectores.org/static/ seq2seq testsets/ SG FT Morph SG FT Morph ru-en 1-nn 56.27 55.58 53.51 en 0.9824 0.9754 0.9722 ru-en 10-nn 78.96 78.82 77.03 ru 0.8817 0.8899 0.8871

Table 3: Accuracy of supervised mapping from Rus- Table 4: Accuracy on POS task sian to English using different models, searching among first and ten nearest neighbors. SG FT Morph en 0.8966 0.9034 0.8985 4.4 Mapping Embedding Spaces ru 0.8442 0.8548 0.8534 Here we introduce a new type of evaluation—it fo- Table 5: Accuracy on Chunk task cuses on a cross-lingual task of mapping two em- bedding spaces for different languages. The core idea is to transform embedding spaces such that els. The English language embeddings are tested after this transformation the vectors of words in with Conll2000 dataset which contains 8935 train- one language appear close to the vectors of their ing sentences and 44 unique POS tags. The dataset translations in another language. We were inter- for the Russian language contains 49136 sentences ested to see if using morphemes has any benefits and 458 unique POS tags. Due to time constraint, to perform this kind of mapping. we train models only for a fixed number of epochs: We map embeddings using a train seed dictio- 50 for English and 20 for Russian (iterations re- nary (dictionary with word meanings) and state duced due to a larger training set). The results for of the art supervised mapping method by Artetxe POS and chunking are given in Tables4 and5 cor- et al.(2018), and calculate the accuracy of the respondingly. It is interesting to note that SG em- mapping on the test dictionary. In short, the beddings perform better for English on POS task, essence of this method is to find optimal orthogo- but for Russian, embeddings that encode more nal transforms for both embedding spaces to map syntactic information always perform better. them to a shared space based on a seed dictionary, plus some additional steps such as embeddings 5 Results normalization. For each model—SG, FastText, and MorphGram, we mapped Russian and English In this paper, we compared three word embedding embeddings trained using this model. We used the approaches for English and Russian languages. original implementation6 for mapping (supervised The main inquiry was about the relevance of pro- option), and ground-truth train/test dictionaries viding morphological information to word em- provided by Facebook for their MUSE7 library. beddings. Experiments showed that morphology- We report 1-nn and 10-nn accuracy: whether the based embeddings exhibit qualities intermediate correct translation was found as a first nearest between semantic driven embedding approaches neighbor or among 10 nearest neighbors of a word as SkipGram and character-driven one as FastText. in the mapped space. See the results in Table3. Morphological embeddings studied here showed We observe no positive impact of using Mor- average performance on both semantic and syntac- phGram for mapping word embedding spaces. tic tests. We also studied the application of mor- phological embeddings on two downstream tasks: 4.5 POS Tagging and Chunking POS tagging and chunking. For English language, Other tasks where incorporation of morphology SG provided the best results for POS, whereas can be crucial are the tasks of POS Tagging and FastText gave the best result on chunking task. chunking. We use a simple CNN-based architec- For Russian, FastText showed better performance ture introduced in (Collobert et al., 2011), with on both tasks. Morphological embeddings, again, one projection layer, one convolutional layer, and showed average results. We recognize that the dif- the final logit layer. The only input features we ference in the results on downstream task can be use are the embeddings from corresponding mod- considered marginal. We also did not observe im- provements from morphological embeddings on 6https://github.com/artetxem/vecmap 7https://github.com/facebookresearch/ word similarity dataset compared to other models. MUSE References Anna Gladkova and Aleksandr Drozd. 2016. Intrinsic evaluations of word embeddings: What can we do Mostafa Abdou, Artur Kulmizev, and Vinit Ravis- better? In Proceedings of the 1st Workshop on Eval- hankar. 2018. Mgad: Multilingual generation of uating Vector-Space Representations for NLP, pages analogy datasets. In Proceedings of the Eleventh In- 36–42. ternational Conference on Language Resources and Evaluation (LREC-2018). Anna Gladkova, Aleksandr Drozd, and Satoshi Mat- suoka. 2016. Analogy-based detection of morpho- Popov K.P. Arefyev N.V., Gratsianova T.Y. 2018. 24rd logical and semantic relations with word embed- International Conference on Computational Linguis- dings: What works and what doesn’t. In Proceed- tics and Intellectual Technologies. ings of the NAACL-HLT SRW, pages 47–54, San Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Diego, California, June 12-17, 2016. ACL. Generalizing and improving bilingual word embed- Felix Hill, Roi Reichart, and Anna Korhonen. 2015. ding mappings with a multi-step framework of linear Simlex-999: Evaluating semantic models with (gen- transformations. In Thirty-Second AAAI Conference uine) similarity estimation. Computational Linguis- on Artificial Intelligence. tics, 41(4):665–695. Oded Avraham and Yoav Goldberg. 2017. The inter- Ira Leviant and Roi Reichart. 2015. Judgment lan- play of semantics and morphology in word embed- guage matters: Multilingual vector space models for dings. arXiv preprint arXiv:1704.01938. judgment language aware lexical semantics. CoRR, Amir Bakarov. 2018. A survey of word em- abs/1508.00106. beddings evaluation methods. arXiv preprint arXiv:1801.09536. Thang Luong, Richard Socher, and Christopher Man- ning. 2013. Better word representations with recur- Parminder Bhatia, Robert Guthrie, and Jacob Eisen- sive neural networks for morphology. Proceedings stein. 2016. Morphological priors for proba- of the Seventeenth Conference on Computational bilistic neural word embeddings. arXiv preprint Natural Language Learning, pages 104—-113. arXiv:1608.01056. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Piotr Bojanowski, Edouard Grave, Armand Joulin, and frey Dean. 2013a. Efficient estimation of word Tomas Mikolov. 2017. Enriching word vectors with representations in vector space. arXiv preprint subword information. Transactions of the Associa- arXiv:1301.3781. tion for Computational Linguistics, 5:135–146. Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Jan Botha and Phil Blunsom. 2014. Compositional Liu. 2014. Co-learning of word representations and morphology for word representations and language morpheme representations. In Proceedings of COL- modelling. In International Conference on Machine ING 2014, the 25th International Conference on Learning, pages 1899–1907. Computational Linguistics: Technical Papers, pages 141–150. Kris Cao and Marek Rei. 2016. A joint model for word embedding and word morphology. arXiv preprint Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Frame- arXiv:1606.02601. work for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Aditi Chaudhary, Chunting Zhou, Lori Levin, Gra- Challenges for NLP Frameworks, pages 45–50, Val- ham Neubig, David R. Mortensen, and Jaime G. letta, Malta. ELRA. http://is.muni.cz/ Carbonell. 2018. Adapting word embeddings to publication/884893/en. new languages with morphological and phonologi- cal subword representations. Sami Virpioja, Peter Smit, Stig-Arne Gronroos,¨ Mikko Kurimo, et al. 2013. Morfessor 2.0: Python imple- Ronan Collobert, Jason Weston, Leon´ Bottou, Michael mentation and extensions for morfessor baseline. Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Research, 12(Aug):2493–2537. Ryan Cotterell and Hinrich Schutze.¨ 2015. Morpho- logical word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1287–1292. Ahmed El-kishky, Frank Xu, Aston Zhang, Stephen Macke, and Jiawei Han. 2018. Entropy-Based Sub- word Mining with an Application to Word Embed- dings. pages 12–21.