Linear Ensembles of Word Embedding Models

Linear Ensembles of Word Embedding Models Avo Muromagi¨ Kairit Sirts Sven Laur University of Tartu University of Tartu University of Tartu Tartu, Estonia Tartu, Estonia Tartu, Estonia [email protected] [email protected] [email protected] Abstract models are not directly comparable, even when they have been trained on exactly the same data. This paper explores linear methods for This random behaviour provides an opportunity combining several word embedding mod- to combine several embedding models into an en- els into an ensemble. We construct semble which, hopefully, results in a better set the combined models using an iterative of word embeddings. Although model ensembles method based on either ordinary least have been often used in various NLP systems to squares regression or the solution to the or- improve the overall accuracy, the idea of combin- thogonal Procrustes problem. ing several word embedding models into an en- We evaluate the proposed approaches semble has not been explored before. on Estonian—a morphologically complex The main contribution of this paper is to show language, for which the available corpora that word embeddings can benefit from ensemble for training word embeddings are rela- learning, too. We study two methods for com- tively small. We compare both combining word embedding models into an ensem- bined models with each other and with ble. Both methods use a simple linear transforma- the input word embedding models using tion. First of them is based on the standard ordi- synonym and analogy tests. The results nary least squares solution (OLS) for linear regres- show that while using the ordinary least sion, the second uses the solution to the orthogonal squares regression performs poorly in our Procrustes problem (OPP) (Schonemann,¨ 1966), experiments, using orthogonal Procrustes which essentially also solves the OLS but adds the to combine several word embedding mod- orthogonality constraint that keeps the angles be- els into an ensemble model leads to 7-10% tween vectors and their distances unchanged. relative improvements over the mean re- There are several reasons why using an ensem- sult of the initial models in synonym tests ble of word embedding models could be useful. and 19-47% in analogy tests. First is the typical ensemble learning argument— the ensemble simply is better because it enables to 1 Introduction cancel out random noise of individual models and Word embeddings—dense low-dimensional vec- reinforce the useful patterns expressed by several tor representations of words—have become very input models. Secondly, word embedding systems popular in recent years in the field of natu- require a lot of training data to learn reliable word ral language processing (NLP). Various meth- representations. While there is a lot of textual ods have been proposed to train word embed- data available for English, there are many smaller dings from unannoted text corpora (Mikolov et languages for which even obtaining enough plain al., 2013b; Pennington et al., 2014; Al-Rfou et unannotated text for training reliable embeddings al., 2013; Turian et al., 2010; Levy and Gold- is a problem. Thus, an ensemble approach that berg, 2014), most well-known of them being per- would enable to use the available data more effec- haps Word2Vec (Mikolov et al., 2013b). Embed- tively would be beneficial. ding learning systems essentially train a model According to our knowledge, this is the first from a corpus of text and the word embeddings work that attempts to leverage the data by com- are the model parameters. These systems con- bining several word embedding models into a new tain a randomized component and so the trained improved model. Linear methods for combin- 96 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 96–104, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press ing two embedding models for some task-specific linear regression optimization goals: purpose have been used previously. Mikolov et r al. (2013a) optimized the linear regression with J = Y W P 2 , (1) ∑ k − i ik stochastic gradient descent to learn linear transfor- i=1 mations between the embeddings in two languages where P1,...,Pr are transformation matrices that for machine translation. Mogadala and Rettinger translate W1,...,Wr, respectively, into the com- (2016) used OPP to translate embeddings between mon vector space containing Y. two languages to perform cross-lingual document We use an iterative algorithm to find matrices classification. Hamilton et al. (2016) aligned a P1,...,Pr and Y. During each iteration the algo- series of embedding models with OPP to detect rithm performs two steps: changes in word meanings over time. The same 1. Solve r linear regression problems with re- problem was addressed by Kulkarni et al. (2015) spect to the current target model Y, which re- who aligned the embedding models using piece- sults in updated values for matrices P ,...P ; wise linear regression based on a set of nearest 1 r neighboring words for each word. 2. Update Y to be the mean of the translations Recently, Yin and Schutze¨ (2016) experimented of all r models: with several methods to learn meta-embeddings by 1 r combining different word embedding sets. Our Y = ∑WiPi. (2) r i=1 work differs from theirs in two important aspects. This procedure is continued until the change in First, in their work each initial model is trained the average normalised residual error, computed as with a different word embedding system and on a different data set, while we propose to combine the 1 r Y W P k − i ik, (3) models trained with the same system and on the ∑ r i=0 V d same dataset, albeit using different random initial- | | · isation. Secondly, although the 1toN model pro- will become smaller thanp a predefined threshold posed in (Yin and Schutze,¨ 2016) is very similar to value. the linear models studied in this paper, it doesn’t We experiment with two different methods for involve the orthogonality constraint included in computing the translation matrices P1,...,Pr. The the OPP method, which in our experiments, as first is based on the standard least squares solu- shown later, proves to be crucial. tion to the linear regression problem, the second method is known as solution to the Orthogonal We conduct experiments on Estonian and con- Procrustes problem (Schonemann,¨ 1966). struct ensembles from ten different embedding models trained with Word2Vec. We compare the 2.1 Solution with the ordinary least squares initial and combined models in synonym and anal- (SOLS) ogy tests and find that the ensemble embeddings The analytical solution for a linear regression combined with orthogonal Procrustes method in- problem Y = PW for finding the transformation deed perform significantly better in both tests, matrix P, given the input data matrix W and the leading to a relative improvement of 7-10% over result Y is: the mean result of the initial models in synonym T 1 T tests and 19-47% in analogy tests. P = (W W)− W Y (4) We can use this formula to update all matrices Pi 2 Combining word embedding models at each iteration. The problem with this approach is that because Y is also unknown and will be up- V d A word embedding model is a matrix W R| |× , ∈ dated repeatedly in the second step of the iterative where V is the number of words in the model lex- | | algorithm, the OLS might lead to solutions where icon and d is the dimensionality of the vectors. both WiPi and Y are optimized towards 0 which is Each row in the matrix W is the continuous rep- not a useful solution. In order to counteract this resentation of a word in a vector space. effect we rescale Y at the start of each iteration. Given r embedding models W1,...,Wr we want This is done by scaling the elements of Y so that to combine them into a target model Y. We de- the variance of each column of Y would be equal fine a linear objective function that is the sum of r to 1. 97 SOLS SOPP Dim SOLS SOPP Mean W Dim Error # Iter Error # Iter 50 70098 38998 41933 50 0.162828 33 0.200994 5 100 68175 32485 35986 100 0.168316 38 0.183933 5 150 73182 30249 33564 150 0.169554 41 0.171266 4 200 73946 29310 32865 200 0.172987 40 0.167554 4 250 75884 28469 32194 250 0.175723 40 0.164493 4 300 77098 28906 32729 300 0.177082 40 0.160988 4 Avg 73064 31403 34879 Table 1: Final errors and the number of iterations Table 2: Average mean ranks of the synonym test, until convergence for both SOLS and SOPP. The smaller values are better. The best result in each first column shows the embedding size. row is in bold. All differences are statistically sig- . nificant: with p < 2.2 10 16 for all cases. · − 2.2 Solution to the Orthogonal Procrustes Corpus is the largest text corpus available for Es- problem (SOPP) tonian. Its size is approximately 240M word to- Orthogonal Procrustes is a linear regression prob- kens, which may seem like a lot but compared to lem of transforming the input matrix W to the out- for instance English Gigaword corpus, which is put matrix Y using an orthogonal transformation often used to train word embeddings for English matrix P (Schonemann,¨ 1966). The orthogonality words and which contains more than 4B words, constraint is specified as it is quite small. All models were trained using PPT = PT P = I a window size 10 and the skip-gram architecture. We experimented with models of 6 different em- The solution to the Orthogonal Procrustes can bedding sizes: 50, 100, 150, 200, 250 and 300.

Linear Ensembles of Word Embedding Models

Arxiv:2002.06235V1

Combining Word Embeddings with Semantic Resources

Kitsune: an Ensemble of Autoencoders for Online Network Intrusion Detection

Word-Like Character N-Gram Embedding

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Ensemble Learning

Knowledge-Powered Deep Learning for Word Embedding

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions

Local Homology of Word Embeddings

(Machine) Learning

Word Embedding, Sense Embedding and Their Application to Word Sense Induction

Learning Word Meta-Embeddings by Autoencoding