Evaluation of Morphological Embeddings for English and Russian Languages

Evaluation of Morphological Embeddings for English and Russian Languages Vitaly Romanov Albina Khusainova Innopolis University, Innopolis, Innopolis University, Innopolis, Russia Russia [email protected] [email protected] Abstract representations. Such approaches involve addi- tional techniques that perform segmentation of a This paper evaluates morphology-based em- word into morphemes (Arefyev N.V., 2018; Virpi- beddings for English and Russian languages. oja et al., 2013). The presumption is that we can Despite the interest and introduction of sev- potentially increase the quality of distributional eral morphology-based word embedding models in the past and acclaimed performance im- representations if we incorporate these segmenta- provements on word similarity and language tions into the language model (LM). modeling tasks, in our experiments, we did Several approaches that include morphology not observe any stable preference over two of into word embeddings were proposed, but the our baseline models - SkipGram and FastText. evaluation often does not compare proposed em- The performance exhibited by morphological bedding methodologies with the most popular em- embeddings is the average of the two baselines mentioned above. bedding vectors - Word2Vec, FastText, Glove. In this paper, we aim at answering the question of 1 Introduction whether morphology-based embeddings can be useful, especially for languages with rich mor- One of the most significant shifts in the area of nat- phology (such as Russian). Our contribution is the ural language processing is to the practical use of following: distributed word representations. Collobert et al. (2011) showed that a neural model could achieve 1. We evaluate simple SkipGram-based (SG- close to state-of-the-art results in Part of Speech based) morphological embedding models (POS) tagging and chunking by relying almost with new intrinsic evaluation BATS dataset only on word embeddings learned with a language (Gladkova et al., 2016) model. In modern language processing architec- tures, high quality pre-trained representations of 2. We compare relative gain of using morpho- words are one of the major factors of the resulting logical embeddings against Word2Vec and model performance. FastText for English and Russian languages Although word embeddings became ubiqui- tous, there is no single benchmark on evaluating arXiv:2103.06884v1 [cs.CL] 11 Mar 2021 their quality (Bakarov, 2018), and popular intrin- 3. We test morphological embeddings on sev- sic evaluation techniques are subject to criticism eral downstream tasks other than language (Gladkova and Drozd, 2016). Researchers very of- modeling, i.e., mapping embedding spaces, ten rely on intrinsic evaluation, such as semantic POS tagging, and chunking similarity or analogy tasks. While intrinsic evaluations are simple to understand and conduct, they The rest of the paper is organized as fol- do not necessarily imply the quality of embed- lows. Section2 contains an overview of exist- dings for all possible tasks (Gladkova et al., 2016). ing approaches for morphological embeddings and In this paper, we turn to the evaluation of mor- methods of their evaluation. Section3 explains phological embeddings for English and Russian embedding models that we have tested. Section languages. Over the last decade, many approaches 4 explains our evaluation approaches. Section5 tried to include subword information into word describes results. 2 Related work word itself, and morphological tag. Chaudhary et al.(2018) took the next level of a similar ap- The idea to include subword information into proach. Besides including morphological tags, word representation is not new. The question is they include morphemes and character n-grams, how does one obtain morphological segmentation and study the possibility of embedding transfer of words. Very often, researchers rely on the unsu- from Turkish to Uighur and from Hindi to Ben- pervised morphology mining tool Morfessor (Vir- gali. They test the result on NER and monolingual pioja et al., 2013). machine translation. Many approaches use simple composition, e.g., Another approach that deserves being men- sum, of morpheme vectors to define a word em- tioned here is FastText by Bojanowski et al. bedding. Botha and Blunsom(2014) were one (2017). They do not use morphemes explicitly, but of the first to try this approach. They showed a instead rely on subword character n-grams, that considerable drop in perplexity of log-bilinear lan- store morphological information implicitly. This guage model and also tested their model on word method achieves high scores on both semantic and similarity and downstream translation task. The syntactic similarities, and by far is the most popu- translation task was tested against an n-gram lan- lar word embedding model that also captures word guage model. Similarly, Qiu et al.(2014) tweak morphology. CBOW model so that besides central word it can There are also approaches that investigate the predict target morphemes in this word. Final em- impact of more complex models like RNN and beddings of morphemes are summed together into LSTM. Luong et al.(2013) created a hierarchical the word embedding. They test vectors on analog- language model that uses RNN to combine mor- ical reasoning and word similarity, showing that phemes of a word to obtain a word representa- incorporating morphemes improves semantic sim- tion. Their model performed well on word similarity. El-kishky et al.(2018) develop their own ilarity task. Similarly, Cao and Rei(2016) cre- morpheme segmentation algorithm and test the re- ate Char2Vec BiLSTM for embedding words and sulting embeddings on the LM task with SGNS train a language model with SG objective. Their objective. Their method achieved lower perplex- model excels at the syntactic similarity. ity than FastText and SG. A slightly different approach was taken by Cot- 3 Embedding techniques terell and Schutze¨ (2015) who optimized a log- bilinear LM model with a multitask objective, In this work, we test three embedding models on where the second objective is to guess the next English and Russian languages: SkipGram, Fast- morphological tag. They test resulting vector Text, and MorphGram. The latter one is similar similarity against string distance (morphologically to FastText with the only difference that instead close words have similar substrings) and find that of character n-grams we model word morphemes. their vectors surpass Word2Vec by a large margin. This approach was often used in previous research. Bhatia et al.(2016) construct a hierarchical All three models are trained using the negative graphical model that incorporates word morphol- sampling objective ogy to predict the next word and then optimize the variational bound. They compare their model with Word2Vec and the one described by Botha and T 1 X X Blunsom(2014). They found that their method log σ(s(w ; w ))+ T j t improves results on word similarity but is inferior t=1 −m≤j≤m;j6=0 to approach by Botha and Blunsom(2014) in POS k tagging. X Ew∼Pn(wt) [log σ(s(w; wt))] (1) Another group of methods tries to incorporate i=1 arbitrary morphological information into embedding model. Avraham and Goldberg(2017) ob- In the case of SG, the similarity function s is the serve that it is impossible to achieve both high inner product of corresponding vectors. FastText semantic and syntactic similarity on the Hebrew and MorphGram are using subword units. We use language. Instead of morphemes, they use other the same approach to incorporate subword infor- linguistic tags for the word, i.e., lemma, the mation into the word vector for both models: SG FT Morph X T en 0.37 0.35 0.36 s(wj; wt) = vs vwj ru 0.24 0.19 0.19 s2Swt where Swt is the set of word segmentations into Table 1: Correlation between human judgments and n-grams or morphemes. We use Gensim1 as the model scores for similarity datasets, Spearman’s ρ. implementation for all models (Rehˇ u˚rekˇ and So- jka, 2010). For MorphGram, we utilize FastText SG FT Morph model and substitute the function that computes Google Semantic 65.34 48.75 57.52 character n-grams for the function that performs en Google Syntactic 55.88 75.10 61.16 morphological segmentation. BATS 29.67 33.33 32.71 Translated Semantic 39.11 25.59 34.69 4 Experiments and Evaluation ru Translated Syntactic 32.71 59.29 43.68 Synthetic 24.52 36.78 27.06 To understand the effect of using morphemes for training word embeddings, we performed intrin- Table 2: Accuracy of models on different analogies sic and extrinsic evaluations of SG, FastText, and tasks. MorphGram model for two languages - English and Russian. Russian language, in contrast to En- glish, is characterized by rich morphology, which for words in each pair, human judgments are com- makes this pair of languages a good choice for ex- pared with model scores—the more is the corre- ploring the difference in the effect of morphology- lation, the better model “understands” semantic based models. similarity of words. We used SimLex-999 (Hill et al., 2015) dataset—the original one for English 4.1 Data and Training Details and its translated by Leviant and Reichart(2015) We used the first 5GB of unpacked English and version for Russian, for evaluating trained embed- Russian Wikipedia dumps2 as training data. dings. Out-of-vocabulary words were excluded For training both SG and FastText we used from tests for all models. The results are presented Gensim library, for MorphGram - we adapted in Table1. Gensim’s implementation of FastText by break- We see that SG beats the other two models ing words into morphemes instead of n-grams, on similarity task for both languages, and Mor- all other implementation details left unchanged. phGram performs almost the same as Fasttext. Training parameters remain the same as in the 4.3 Analogies original FastText paper, except the learning rate was set to 0.05 at the beginning of the training, Another type of intrinsic evaluations is analo- and vocabulary size was constrained to 100000 gies test, where the model is expected to answer words.

Evaluation of Morphological Embeddings for English and Russian Languages

A Comprehensive Embedding Approach for Determining Repository Similarity

Practice with Python

Gensim Is Robust in Nature and Has Been in Use in Various Systems by Various People As Well As Organisations for Over 4 Years

Software Framework for Topic Modelling with Large

Word2vec and Beyond Presented by Eleni Triantaﬁllou

Generative Adversarial Networks for Text Using Word2vec Intermediaries

Term Document Matrix Python

Machine Learning Methods for Finding Textual Features of Depression from Publications

Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks

Arxiv:1612.05340V2 [Cs.CL] 23 Dec 2016 Learn, High, Program I, a Possible Textual Label Could Be Simply EDUCATION

An Introduction to Topic Modeling As an Unsupervised Machine Learning Way to Organize Text Information

Going Beyond T-SNE: Exposing Whatlies in Text Embeddings