Two/Too Simple Adaptations of Word2vec for Syntax Problems

Two/Too Simple Adaptations of Word2Vec for Syntax Problems Wang Ling Chris Dyer Alan Black Isabel Trancoso L2F Spoken Systems Lab, INESC-ID, Lisbon, Portugal Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA Instituto Superior Tecnico,´ Lisbon, Portugal flingwang,cdyer,[email protected] [email protected] Abstract in particular the “skip-gram” and the “continuous bag-of-words” (CBOW) models. These two mod- We present two simple modifications to the els make different independence and conditioning models in the popular Word2Vec tool, in or- assumptions; however, both models discard word der to generate embeddings more suited to order information in how they account for context. tasks involving syntax. The main issue with the original models is the fact that they are Thus, embeddings built using these models have insensitive to word order. While order in- been shown to capture semantic information be- dependence is useful for inducing semantic tween words, and pre-training using these models representations, this leads to suboptimal re- has been shown to lead to major improvements in sults when they are used to solve syntax-based many tasks (Collobert et al., 2011). While more so- problems. We show improvements in part-of- phisticated approaches have been proposed (Dhillon speech tagging and dependency parsing using et al., 2011; Huang et al., 2012; Faruqui and Dyer, our proposed models. 2014; Levy and Goldberg, 2014; Yang and Eisen- stein, 2015), Word2Vec remains a popular choice 1 Introduction due to their efficiency and simplicity. Word representations learned from neural language However, as these models are insensitive to word models have been shown to improve many NLP order, embeddings built using these models are sub- tasks, such as part-of-speech tagging (Collobert et optimal for tasks involving syntax, such as part-of- al., 2011), dependency parsing (Chen and Man- speech tagging or dependency parsing. This is be- ning, 2014; Kong et al., 2014) and machine trans- cause syntax defines “what words go where?”, while lation (Liu et al., 2014; Kalchbrenner and Blunsom, semantics than “what words go together”. Obvi- 2013; Devlin et al., 2014; Sutskever et al., 2014). ously, in a model where word order is discarded, These low-dimensional representations are learned the many syntactic relations between words can- as parameters in a language model and trained to not be captured properly. For instance, while most maximize the likelihood of a large corpus of raw words occur with the word the, only nouns tend to text. They are then incorporated as features along occur exactly afterwords (e.g. the cat). This is side hand-engineered features (Turian et al., 2010), supported by empirical evidence that suggests that or used to initialize the parameters of neural net- order-insensitivity does indeed lead to substandard works targeting tasks for which substantially less syntactic representations (Andreas and Klein, 2014; training data is available (Hinton and Salakhutdinov, Bansal et al., 2014), where systems using pre-trained 2012; Erhan et al., 2010; Guo et al., 2014). with Word2Vec models yield slight improvements One of the most widely used tools for building while the computationally far more expensive which word vectors are the models described in (Mikolov use word order information embeddings of Col- et al., 2013), implemented in the Word2Vec tool, lobert et al. (2011) yielded much better results. In this work, we describe two simple modifica- owi , since this requires the computation of a jV j×dw tions to Word2Vec, one for the skip-gram model matrix multiplication. Solutions for problem are ad- and one for the CBOW model, that improve the dressed in the Word2Vec by using the hierarchical quality of the embeddings for syntax-based tasks1. softmax objective function or resorting to negative Our goal is to improve the final embeddings while sampling (Goldberg and Levy, 2014). maintaining the simplicity and efficiency of the orig- The CBOW model predicts the center word wo inal models. We demonstrate the effectiveness of given a representation of the surrounding words our approaches by training, on commodity hard- w−c; :::; w−1; w1; wc. Thus, the output vector ware, on datasets containing more than 50 million ow−c;:::;w−1;w1;wc is obtained from the product of the sentences and over 1 billion words in less than a matrix O 2 <jV |×dw with the sum of the embed- P day, and show that our methods lead to improve- dings of the context words −c≤j≤c;j6=0 rwj . ments when used in state-of-the-art neural network We can observe that in both methods, the order of systems for part-of-speech tagging and dependency the context words does not influence the prediction parsing, relative to the original models. output. As such, while these methods may find similar representations for semantically similar words, 2 Word2Vec they are less likely to representations based on the The work in (Mikolov et al., 2013) is a popular syntactic properties of the words. choice for pre-training the projection matrix W 2 Skip-Ngram <d×|V j where d is the embedding dimension with CBOW input projection output input projection output the vocabulary V . As an unsupervised task that is O trained on raw text, it builds word embeddings by w-2 w-2 O maximizing the likelihood that words are predicted w-1 O w-1 from their context or vice versa. Two models were SUM w0 w0 O w1 w1 defined, the skip-gram model and the continuous O bag-of-words model, illustrated in Figure 1. w2 w2 The skip-gram model’s objective function is to Figure 1: Illustration of the Skip-gram and Continuous maximize the likelihood of the prediction of contex- Bag-of-Word (CBOW) models. tual words given the center word. More formally, given a document of T words, we wish to maximize T 1 X X L = log p(w j w ) (1) 3 Structured Word2Vec T t+j t t=1 −c≤j≤c; j6=0 To account for the lack of order-dependence in the above models, we propose two simple modifications Where c is a hyperparameter defining the window to these methods that include ordering information, of context words. To obtain the output probabil- which we expect will lead to more syntactically- ity p(w jw ), the model estimates a matrix O 2 o i oriented embeddings. These models are illustrated <jV |×dw , which maps the embeddings r into a wi in Figure 2. jV j-dimensional vector owi . Then, the probability of predicting the word wo given the word wi is de- 3.1 Structured Skip-gram Model fined as: The skip-gram model uses a single output matrix o (w ) jV |×d e wi o O 2 < to predict every contextual word p(wo j wi) = (2) P ow (w) w−c; :::; w−1; w1; :::; wc, given the embeddings of w2V e i the center word w0. Our approach adapts the This is referred as the softmax objective. However, model so that it is sensitive to the positioning of the for larger vocabularies it is inefficient to compute words. It defines a set of c × 2 output predictors (jV j)×d 1The code developed in this work is made available in O−c; :::; O−1;O1;Oc, with size O 2 < . Each https://github.com/wlin12/wang2vec. of the output matrixes is dedicated to predicting the output for a specific relative position to the center CWINDOW Structured Skip-Ngram word. When making a prediction p(wo j wi), we input projection output input projection output O-2 select the appropriate output matrix Oo−i to project w-2 w-2 the word embeddings to the output vector. Note, that O-1 w-1 O w-1 the number of operations that must be performed for w0 w0 O1 the forward and backward passes in the network re- w1 w1 O2 mains the same, as we are simply switching the out- w2 w2 put layer O for each different word index. Figure 2: Illustration of the Structured Skip-gram and 3.2 Continuous Window Model Continuous Window (CWindow) models. The Continuous Bag-Of-Words words model defines a window of words w−c; :::; wc with size c, where the prediction of the center word shown that pre-trained embeddings can be used to w0 is conditioned on the remaining words achieve better generalization (Collobert et al., 2011; w−c; :::; w−1; w1; :::; wc. The prediction matrix Chen and Manning, 2014). O 2 <(jV j)×d is fed with the sum of the embeddings of the context words. As such, the order 4.1 Building Word Vectors of the contextual words does not influence the prediction of the center word. Our approach We built vectors for English in two very different defines a different output predictor O 2 <(jV |×2cd domains. Firstly, we used an English Wikipedia which receives as input a (2c × d)-dimensional dump containing 1,897 million words (60 million vector that is the concatenation of the embeddings sentences), collected in September of 2014. We of the context words in the order they occur built word embeddings using the original and our proposed methods on this dataset. These embed- [e(w−c); : : : ; e(w−1); e(w1); : : : ; e(wc)]. As matrix O defines a parameter for the word embeddings for dings will be denoted as WIKI(L). Then, we took a each relative position, this allows the words to be sample of 56 million English tweets with 847 mil- treated differently depending on where they occur. lion words collected in (Owoputi et al., 2013), and This model, denoted as CWindow, is essentially the applied the same procedure to build the TWITTER window-based model described in (Collobert et al., embeddings.

Load more