An Exploration of Word Embedding Initialization in Deep-Learning Tasks

An Exploration of Word Embedding Initialization in Deep-Learning Tasks Tom Kocmi and Ondrejˇ Bojar Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics [email protected] Abstract 2016) in one language or cross-lingually (Luong et al., 2015). Word embeddings are the interface be- Embeddings are trained for a task. In other tween the world of discrete units of text words, the vectors that embeddings assign to each processing and the continuous, differen- word type are almost never provided manually but tiable world of neural networks. In this always discovered automatically in a neural network, we examine various random and work trained to carry out a particular task. The pretrained initialization methods for em- well known embeddings are those by Mikolov et beddings used in deep networks and their al. (2013), where the task is to predict the word effect on the performance on four NLP from its neighboring words (CBOW) or the neigh- tasks with both recurrent and convolu- bors from the given word (Skip-gram). Trained tional architectures. We confirm that pre- on a huge corpus, these “Word2Vec” embeddings trained embeddings are a little better than show an interesting correspondence between lexi- random initialization, especially consider- cal relations and arithmetic operations in the vec- ing the speed of learning. On the other tor space. The most famous example is the follow- hand, we do not see any significant dif- ing: ference between various methods of random initialization, as long as the variance v(king) − v(man) + v(woman) ≈ v(queen) is kept reasonably low. High-variance initialization prevents the network to use the In other words, adding the vectors associated with space of embeddings and forces it to use the words ‘king’ and ‘woman’ while subtracting other free parameters to accomplish the ‘man’ should be equal to the vector associated task. We support this hypothesis by ob- with the word ‘queen’. We can also say that serving the performance in learning lexical the difference vectors v(king) − v(queen) and relations and by the fact that the network v(man) − v(woman) are almost identical and de- can learn to perform reasonably in its task scribe the gender relationship. even with fixed random embeddings. Word2Vec is not trained with a goal of proper representation of relationships, therefore the ab- 1 Introduction arXiv:1711.09160v1 [cs.CL] 24 Nov 2017 solute accuracy scores around 50% do not allow Embeddings or lookup tables (Bengio et al., 2003) to rely on these relation predictions. Still, it is a are used for units of different granularity, from rather interesting property observed empirically in characters (Lee et al., 2016) to subword units the learned space. Another extensive study of em- (Sennrich et al., 2016; Wu et al., 2016) up to bedding space has been conducted by Hill et al. words. In this paper, we focus solely on word (2017). embeddings (embeddings attached to individual Word2Vec embeddings as well as GloVe em- token types in the text). In their highly dimen- beddings (Pennington et al., 2014) became very sional vector space, word embeddings are capa- popular and they were tested in many tasks, also ble of representing many aspects of similarities be- because for English they can be simply down- tween words: semantic relations or morphological loaded as pretrained on huge corpora. Word2Vec properties (Mikolov et al., 2013; Kocmi and Bojar, was trained on 100 billion words Google News dataset1 and GloVe embeddings were trained on on the same task and pretrained on a standard task 6 billion words from the Wikipedia. Sometimes, such as Word2Vec or GloVe. they are used as a fixed mapping for a better Random initialization methods common in the robustness of the system (Kenter and De Rijke, literature2 sample values either uniformly from a 2015), but they are more often used to seed the em- fixed interval centered at zero or, more often, from beddings in a system and they are further trained a zero-mean normal distribution with the standard in the particular end-to-end application (Collobert deviation varying from 0.001 to 10. et al., 2011; Lample et al., 2016). The parameters of the distribution can be set In practice, random initialization of embeddings empirically or calculated based on some assump- is still more common than using pretrained embed- tions about the training of the network. The sec- dings and it should be noted that pretrained em- ond approach has been done for various hidden beddings are not always better than random ini- layer initializations (i.e. not the embedding layer). tialization (Dhingra et al., 2017). E.g. Glorot and Bengio (2010) and He et al. (2015) We are not aware of any study of the effects of argue that sustaining variance of values thorough various random embeddings initializations on the the whole network leads to the best results and training performance. define the parameters for initialization so that the In the first part of the paper, we explore various layer weights W have the same variance of output English word embeddings initializations in four as is the variance of the input. tasks: neural machine translation (denoted MT in Glorot and Bengio (2010) define the “Xavier” the following for short), language modeling (LM), initialization method. They suppose a linear neu- part-of-speech tagging (TAG) and lemmatization ral network for which they derive weights initial- (LEM), covering both common styles of neural ar- ization as chitectures: the recurrent and convolutional neural p p 6 6 networks, RNN and CNN, resp. W ∼ U − p ; p (1) ni + no ni + no In the second part, we explore the obtained embeddings spaces in an attempt to better understand where ni is the size of the input and no is the size the networks have learned about word relations. of the output. The initialization for nonlinear networks using ReLu units has been derived similarly 2 Embeddings initialization by He et al. (2015) as 2 Given a vocabulary V of words, embeddings rep- W ∼ N (0; ) (2) resent each word as a dense vector of size d (as ni opposed to “one-hot” representation where each The same assumption of sustaining variance can- word would be represented as a sparse vector of not be applied to embeddings because there is no size jV j with all zeros except for one element in- input signal whose variance should be sustained dicating the given word). Formally, embeddings to the output. We nevertheless try these initializa- jV |×d are stored in a matrix E 2 R . tion as well, denoting them Xavier and He, respec- For a given word type w 2 V , a row is se- tively. lected from E. Thus, E is often referred to as word lookup table. The size of embeddings d is 2.2 Pretrained embeddings often set between 100 and 1000 (Bahdanau et al., Pretrained embeddings, as opposed to random ini- 2014; Vaswani et al., 2017; Gehring et al., 2017). tialization, could work better, because they already contain some information about word relations. 2.1 Initialization methods To obtain pretrained embeddings, we can train Many different methods can be used to initialize a randomly initialized model from the normal dis- the values in E at the beginning of neural network tribution with a standard deviation of 0.01, extract training. We distinguish between randomly initial- embeddings from the final model and use them as ized and pretrained embeddings, where the latter pretrained embeddings for the following trainings can be further divided into embeddings pretrained 2Aside from related NN task papers such as Bahdanau et al. (2014) or Gehring et al. (2017), we also checked several 1See https://code.google.com/archive/p/ popular neural network frameworks (TensorFlow, Theano, word2vec/. Torch, ...) to collect various initialization parameters. on the same task. Such embeddings contain infor- are no pretrained embeddings available for Czech. mation useful for the task in question and we refer The goal of the language model (LM) is to pre- to them as self-pretrain. dict the next word based on the history of previous A more common approach is to download words. Language modeling can be thus seen as some ready-made “generic” embeddings such as (neural) machine translation without the encoder Word2Vec and GloVe, whose are not directly part: no source sentence is given to translate and related to the final task but show to contain we only predict words conditioned on the previ- many morpho-syntactic relations between words ous word and the state computed from predicted (Mikolov et al., 2013; Kocmi and Bojar, 2016). words. Therefore the parameters of the neural net- Those embeddings are trained on billions of work are the same as for the MT decoder. The only monolingual examples and can be easily reused in difference is that we use dropout with keep prob- most existing neural architectures. ability of 0.7 (Srivastava et al., 2014). The gener- ated sentence is evaluated as the perplexity against 3 Experimental setup the gold output words (English in our case). The third task is the POS tagging (TAG). We This section describes the neural models we use use our custom network architecture: The model for our four tasks and the training and testing starts with a bidirectional encoder as in MT. For datasets. each encoder state, a fully connected linear layer 3.1 Models description then predicts a tag. The parameters are set to be equal to the encoder in MT, the predicting layer For all our four tasks (MT, LM, TAG, and LEM), have a size equal to the number of tags.

Load more