<<

Two/Too Simple Adaptations of for Syntax Problems

Wang Ling Chris Dyer Alan Black Isabel Trancoso L2F Spoken Systems Lab, INESC-ID, Lisbon, Portugal Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA Instituto Superior Tecnico,´ Lisbon, Portugal {lingwang,cdyer,awb}@cs.cmu.edu [email protected]

Abstract in particular the “skip-gram” and the “continuous bag-of-words” (CBOW) models. These two mod- We present two simple modifications to the els make different independence and conditioning models in the popular Word2Vec tool, in or- assumptions; however, both models discard word der to generate embeddings more suited to order information in how they account for context. tasks involving syntax. The main issue with the original models is the fact that they are Thus, embeddings built using these models have insensitive to word order. While order in- been shown to capture semantic information be- dependence is useful for inducing semantic tween words, and pre-training using these models representations, this leads to suboptimal re- has been shown to lead to major improvements in sults when they are used to solve syntax-based many tasks (Collobert et al., 2011). While more so- problems. We show improvements in part-of- phisticated approaches have been proposed (Dhillon speech tagging and dependency using et al., 2011; Huang et al., 2012; Faruqui and Dyer, our proposed models. 2014; Levy and Goldberg, 2014; Yang and Eisen- stein, 2015), Word2Vec remains a popular choice 1 Introduction due to their efficiency and simplicity. Word representations learned from neural language However, as these models are insensitive to word models have been shown to improve many NLP order, embeddings built using these models are sub- tasks, such as part-of-speech tagging (Collobert et optimal for tasks involving syntax, such as part-of- al., 2011), dependency parsing (Chen and Man- speech tagging or dependency parsing. This is be- ning, 2014; Kong et al., 2014) and machine trans- cause syntax defines “what words go where?”, while lation (Liu et al., 2014; Kalchbrenner and Blunsom, semantics than “what words go together”. Obvi- 2013; Devlin et al., 2014; Sutskever et al., 2014). ously, in a model where word order is discarded, These low-dimensional representations are learned the many syntactic relations between words can- as parameters in a and trained to not be captured properly. For instance, while most maximize the likelihood of a large corpus of raw words occur with the word the, only nouns tend to text. They are then incorporated as features along occur exactly afterwords (e.g. the cat). This is side hand-engineered features (Turian et al., 2010), supported by empirical evidence that suggests that or used to initialize the parameters of neural net- order-insensitivity does indeed lead to substandard works targeting tasks for which substantially less syntactic representations (Andreas and Klein, 2014; training data is available (Hinton and Salakhutdinov, Bansal et al., 2014), where systems using pre-trained 2012; Erhan et al., 2010; Guo et al., 2014). with Word2Vec models yield slight improvements One of the most widely used tools for building while the computationally far more expensive which word vectors are the models described in (Mikolov use word order information embeddings of Col- et al., 2013), implemented in the Word2Vec tool, lobert et al. (2011) yielded much better results. In this work, we describe two simple modifica- owi , since this requires the computation of a |V |×dw tions to Word2Vec, one for the skip-gram model matrix multiplication. Solutions for problem are ad- and one for the CBOW model, that improve the dressed in the Word2Vec by using the hierarchical quality of the embeddings for syntax-based tasks1. softmax objective function or resorting to negative Our goal is to improve the final embeddings while sampling (Goldberg and Levy, 2014). maintaining the simplicity and efficiency of the orig- The CBOW model predicts the center word wo inal models. We demonstrate the effectiveness of given a representation of the surrounding words our approaches by training, on commodity hard- w−c, ..., w−1, w1, wc. Thus, the output vector ware, on datasets containing more than 50 million ow−c,...,w−1,w1,wc is obtained from the product of the sentences and over 1 billion words in less than a matrix O ∈ <|V |×dw with the sum of the embed- P day, and show that our methods lead to improve- dings of the context words −c≤j≤c,j6=0 rwj . ments when used in state-of-the-art neural network We can observe that in both methods, the order of systems for part-of-speech tagging and dependency the context words does not influence the prediction parsing, relative to the original models. output. As such, while these methods may find sim- ilar representations for semantically similar words, 2 Word2Vec they are less likely to representations based on the The work in (Mikolov et al., 2013) is a popular syntactic properties of the words. choice for pre-training the projection matrix W ∈ Skip-Ngram

T 1 X X L = log p(w | w ) (1) 3 Structured Word2Vec T t+j t t=1 −c≤j≤c, j6=0 To account for the lack of order-dependence in the above models, we propose two simple modifications Where c is a hyperparameter defining the window to these methods that include ordering information, of context words. To obtain the output probabil- which we expect will lead to more syntactically- ity p(w |w ), the model estimates a matrix O ∈ o i oriented embeddings. These models are illustrated <|V |×dw , which maps the embeddings r into a wi in Figure 2. |V |-dimensional vector owi . Then, the probability of predicting the word wo given the word wi is de- 3.1 Structured Skip-gram Model fined as: The skip-gram model uses a single output matrix o (w ) |V |×d e wi o O ∈ < to predict every contextual word p(wo | wi) = (2) P ow (w) w−c, ..., w−1, w1, ..., wc, given the embeddings of w∈V e i the center word w0. Our approach adapts the This is referred as the softmax objective. However, model so that it is sensitive to the positioning of the for larger vocabularies it is inefficient to compute words. It defines a set of c × 2 output predictors (|V |)×d 1The code developed in this work is made available in O−c, ..., O−1,O1,Oc, with size O ∈ < . Each https://github.com/wlin12/wang2vec. of the output matrixes is dedicated to predicting the output for a specific relative position to the center CWINDOW Structured Skip-Ngram word. When making a prediction p(wo | wi), we input projection output input projection output

O-2 select the appropriate output matrix Oo−i to project w-2 w-2 the word embeddings to the output vector. Note, that O-1 w-1 O w-1 the number of operations that must be performed for w0 w0 O1 the forward and backward passes in the network re- w1 w1

O2 mains the same, as we are simply switching the out- w2 w2 put O for each different word index. Figure 2: Illustration of the Structured Skip-gram and 3.2 Continuous Window Model Continuous Window (CWindow) models. The Continuous Bag-Of-Words words model defines a window of words w−c, ..., wc with size c, where the prediction of the center word shown that pre-trained embeddings can be used to w0 is conditioned on the remaining words achieve better generalization (Collobert et al., 2011; w−c, ..., w−1, w1, ..., wc. The prediction matrix Chen and Manning, 2014). O ∈ <(|V |)×d is fed with the sum of the embed- dings of the context words. As such, the order 4.1 Building Word Vectors of the contextual words does not influence the prediction of the center word. Our approach We built vectors for English in two very different defines a different output predictor O ∈ <(|V |×2cd domains. Firstly, we used an English Wikipedia which receives as input a (2c × d)-dimensional dump containing 1,897 million words (60 million vector that is the concatenation of the embeddings sentences), collected in September of 2014. We of the context words in the order they occur built word embeddings using the original and our proposed methods on this dataset. These embed- [e(w−c), . . . , e(w−1), e(w1), . . . , e(wc)]. As matrix O defines a parameter for the word embeddings for dings will be denoted as WIKI(L). Then, we took a each relative position, this allows the words to be sample of 56 million English tweets with 847 mil- treated differently depending on where they occur. lion words collected in (Owoputi et al., 2013), and This model, denoted as CWindow, is essentially the applied the same procedure to build the TWITTER window-based model described in (Collobert et al., embeddings. Finally, we also use the Wikipedia 2011), with the exception that we do not project documents, with 16 million words, provided in the Word2Vec the vector of word embeddings into a window package for contrastive purposes, de- embedding before making the final prediction. noted as WIKI(S). As preprocessing, the text was lowercased and groups of contiguous digits were re- In both models, we are increasing the number placed by a special word. For all corpora, we trained of parameters of matrix O by a factor of c × 2, the network with a c = 5, with a negative sampling which can lead to sparcity problems when training value of 10 and filter out words with less than 40 on small datasets. However, these models are gener- instances. WIKI(L) and TWITTER have a vocabu- ally trained on datasets in the order of 100 millions lary of 424,882 and 216,871 types, respectively, and of words, where these issues are not as severe. embeddings of 50 . 4 Experiments Table 1 shows the similarity for a few selected keywords for each of the different embeddings. We We conducted experiments in two mainstream can see hints that our models tend to group words syntax-based tasks part-of-speech Tagging and De- that are more syntactically related. In fact, for the pendency parsing. Part-of-speech tagging is a word word breaking, the CWindow model’s top five words labeling task, where each word is to be labelled with are exclusively composed by verbs in the contin- its corresponding part-of-speech. In dependency uous form, while the Structured Skip-gram model parsing, the goal is to predict a tree built of syntactic tends to combine these with other forms of the verb relations between words. In both tasks, it has been break. The original models tend to be less keen on Embeddings WIKI(S) TWITTER WIKI(L) center word are sampled more frequently. That is, query breaking amazing person when defining a window size of 5, the actual win- CBOW breaks incredible someone dow size used for each sample is a random value turning awesome anyone broke fantastic oneself between 1 and 5. As we use a separate output layer break phenomenal woman for each position, we did not find this property to be stumbled awsome if useful as it provides less training samples for out- Skip-gram break incredible harasser put matrixes with higher indexes. While our models breaks awesome themself are slower they are still suitable for processing large broke fantastic declarant datasets as all the embeddings we use were all built down phenominal someone broken phenomenal right-thinking within a day. CWindow putting incredible woman (this work) turning amaaazing man 4.2 Part-Of-Speech Tagging sticking awesome child We reimplemented the window-based model pro- pulling amzing grandparent posed in (Collobert et al., 2011), which defines a picking a-mazing servicemember 3-layer . In this network, words are first Structured break incredible declarant Skip-gram turning awesome circumstance projected into embeddings, which are concatenated (this work) putting amaaazing woman and projected into a window embedding. These are out ah-mazing schoolchild finally projected into an output layer with size of the breaks amzing someone POS tag vocabulary, followed by a softmax. In our experiments, we used a window size of 5, word em- Table 1: Most similar words using different word embed- beddings of size 50 and window embeddings of size ding models for the words breaking, amazing and person. 500. Word embeddings were initialized using the Each word is queried in a different dataset. pre-trained vectors and these parameters are updated as the rest of the network. Additionally, we also add preserving such properties. As for the TWITTER a capitalization feature which indicates whether the embeddings, we can observe that our adapted em- first letter of the work is uppercased, as all word fea- beddings are much better at finding lexical variations tures are lowercased words. Finally, for words un- of the words, such as a-mazing, resembling the re- seen in the training set and in the pre-trained em- sults obtained using brown clusters (Owoputi et al., beddings, we replace them with a special unknown 2013). Finally, for the query person, we can see token, which is also modelled as a word type with a that our models tend to associate this term to other set of 50 parameters. At training time, we stochasti- words in the same class, such as man, woman and cally replace word types that only occur once in the child, while original models tend to include unre- training dataset with the unknown token. Evaluation lated words, such as if and right-thinking. is performed with the part-of-speech tag accuracy, In terms of computation speed, the Skip-gram which denotes the percentage of words labelled cor- and CBOW models, achieve a processing rate of rectly. 71.35k and 342.17k words per second, respectively. Experiments are performed on two datasets, the The Structured Skip-gram and CWindow models English Penn (PTB) dataset using the can process 34.64k and 124.43k words per second, standard train, dev and test splits, and the ARK respectively. There is a large drop in computa- dataset (Gimpel et al., 2011), with 1000 training, tional speed in the CWindow model compared to 327 dev and 500 labelled English tweets from Twit- the CBOW model, as it uses a larger output ma- ter. For the PTB dataset, we use the WIKI(L) em- trix, which grows with the size of the window. The beddings and use TWITTER embeddings for the Structured Skip-gram model processes words at al- ARK dataset. Finally, the set of parameters with the most half the speed of the Skip-gram model. This is highest accuracy in the dev set are used to report the explained by the fact that the Skip-gram model sub- score for the test set. samples context words, varying the size of the win- Results are shown in Table 2, where we observe dow size stochastically, so that words closer to the that our adapted models tend to yield better re- PTB Twitter Dev Test Dev Test Dev Test UAS LAS UAS LAS CBOW 95.89 96.13 87.85 87.54 CBOW 91.74 88.74 91.52 88.93 Skip-gram 96.62 96.68 88.84 88.73 Skip-gram 92.12 89.30 91.90 89.55 CWindow 96.99 97.01 89.72 89.63 CWindow 92.38 89.62 92.00 89.70 Structured Skip-gram 96.62 97.05 89.69 89.79 Structured Skip-gram 92.49 89.78 92.24 89.92 SENNA 96.54 96.58 84.96 84.85 SENNA 92.24 89.30 92.03 89.51

Table 2: Results for part-of-speech tagging using differ- Table 3: Results for dependency parsing on PTB using ent word embeddings (rows) on different datasets (PTB different word embeddings (rows). Columns UAS and and Twitter). Cells indicate the part-of-speech accuracy LAS indicate the labelled attachment score and the unla- of each experiment. belled parsing scores, respectively. sults than the original models in both datasets. In tasks. This is done by introducing changes that make the Twitter dataset, our results slightly higher than the network aware of the relative positioning of con- the accuracy reported using only Brown clusters text words. With these models we obtain improve- in (Owoputi et al., 2013), which was 89.50. We also ments in two mainstream NLP tasks, namely part- try initializing our embeddings with those in (Col- of-speech tagging and dependency parsing, and re- lobert et al., 2011), which are in the “Senna” row. sults generalize in both clean and noisy domains. Even though results are higher in our models, we cannot conclude that our method is better as they Acknowledgements are trained crawls from Wikipedia in different time This work was partially supported by FCT (INESC- periods. However, it is a good reference to show that ID multiannual funding) through the PIDDAC Pro- our embeddings are on par with those learned using gram funds, and also through projects CMU- more sophisticated models. PT/HuMach/0039/2008 and CMU-PT/0005/2007. The PhD of Wang Ling is supported by FCT 4.3 Dependency Parsing grant SFRH/BD/51157/2010. The authors also wish The evaluation on dependency parsing is performed to thank the anonymous reviewers for many helpful comments. on the English PTB, with the standard train, dev and test splits with Stanford Dependencies. We use neu- ral network defined in (Chen and Manning, 2014), References with the default hyper-parameters2 and trained for Jacob Andreas and Dan Klein. 2014. How much do word 5000 iterations. The word projections are initial- embeddings encode about syntax. In Proceedings of ized using WIKI(L) embeddings. Evaluation is ACL. performed with the labelled (LAS) and unlabeled Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. (UAS) attachment scores. Tailoring continuous word representations for depen- In Table 3, we can observe that results are consis- dency parsing. In Proceedings of the Annual Meeting tent with those in part-of-speech tagging, where our of the Association for Computational Linguistics. models obtain higher scores than the original models Danqi Chen and Christopher D Manning. 2014. A and with competitive results compared to Senna em- fast and accurate dependency parser using neural net- beddings. This suggests that our models are suited works. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing at learning syntactic relations between words. (EMNLP), pages 740–750. 5 Conclusions Ronan Collobert, Jason Weston, Leon´ Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. In this work, we present two modifications to the Natural language processing (almost) from scratch. original models in Word2Vec that improve the word The Journal of Research, 12:2493– embeddings obtained for syntactically motivated 2537. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas 2Found in http://nlp.stanford.edu/software/nndep.shtml Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statis- Annual Meeting of the Association for Computational tical . In 52nd Annual Meeting of Linguistics, volume 2. the Association for Computational Linguistics, Balti- Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014. A more, MD, USA, June. recursive for statistical ma- Paramveer Dhillon, Dean P Foster, and Lyle H Ungar. chine translation. In Proceedings of ACL, pages 1491– 2011. Multi-view learning of word embeddings via 1500. cca. In Advances in Neural Information Processing Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Systems, pages 199–207. rado, and Jeff Dean. 2013. Distributed representations Dumitru Erhan, , Aaron Courville, Pierre- of words and phrases and their compositionality. In Antoine Manzagol, Pascal Vincent, and . Advances in Neural Information Processing Systems, 2010. Why does unsupervised pre-training help deep pages 3111–3119. learning? The Journal of Machine Learning Research, Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin 11:625–660. Gimpel, Nathan Schneider, and Noah A Smith. 2013. Manaal Faruqui and Chris Dyer. 2014. Improving vector Improved part-of-speech tagging for online conversa- space word representations using multilingual correla- tional text with word clusters. In HLT-NAACL, pages tion. In Proceedings of EACL. 380–390. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Sequence to sequence learning with neural networks. Heilman, Dani Yogatama, Jeffrey Flanigan, and In Advances in Neural Information Processing Sys- Noah A Smith. 2011. Part-of-speech tagging for twit- tems, pages 3104–3112. ter: Annotation, features, and experiments. In Pro- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. ceedings of the 49th Annual Meeting of the Associa- Word representations: A simple and general method tion for Computational Linguistics: Human Language for semi-. In Proceedings of the Technologies: short papers-Volume 2, pages 42–47. 48th Annual Meeting of the Association for Computa- Association for Computational Linguistics. tional Linguistics, ACL ’10, pages 384–394, Strouds- Yoav Goldberg and Omer Levy. 2014. word2vec burg, PA, USA. Association for Computational Lin- explained: deriving mikolov et al.’s negative- guistics. sampling word-embedding method. arXiv preprint Yi Yang and Jacob Eisenstein. 2015. Unsupervised arXiv:1402.3722. multi-domain adaptation with feature embeddings. In Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. Proceedings of the 2015 Conference of the North 2014. Revisiting embedding features for simple semi- American Chapter of the Association for Computa- supervised learning. In Proceedings of EMNLP. tional Linguistics: Human Language Technologies. Association for Computational Linguistics. Geoffrey E Hinton and Ruslan Salakhutdinov. 2012. A better way to pretrain deep boltzmann machines. In Advances in Neural Information Processing Systems, pages 2447–2455. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics: Long Papers- Volume 1, pages 873–882. Association for Computa- tional Linguistics. Nal Kalchbrenner and Phil Blunsom. 2013. Recur- rent continuous translation models. In EMNLP, pages 1700–1709. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. 2014. A dependency parser for tweets. In Proc. of EMNLP, pages 1001–1012, Doha, Qatar, October. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd