Quick viewing(Text Mode)

Improving Lexical Embeddings with Semantic Knowledge

Improving Lexical Embeddings with Semantic Knowledge

Improving Lexical Embeddings with Semantic Knowledge

Mo Yu ∗ Mark Dredze Machine Translation Lab Human Language Technology Center of Excellence Harbin Institute of Technology Center for Language and Speech Processing Harbin, China Johns Hopkins University [email protected] Baltimore, MD 21218 [email protected]

Abstract knowledge. Our model builds on (Mikolov et al., 2013), a neural network based Word embeddings learned on unlabeled language model that learns word embeddings by data are a popular tool in semantics, but maximizing the probability of raw text. We extend may not capture the desired semantics. We the objective to include prior knowledge about propose a new learning objective that in- synonyms from semantic resources; we consider corporates both a neural language model both the Paraphrase (Ganitkevitch et al., objective (Mikolov et al., 2013) and prior 2013) and WordNet (Fellbaum, 1999), which an- knowledge from semantic resources to notate semantic relatedness between words. The learn improved lexical semantic embed- latter was also used in (Bordes et al., 2012) for dings. We demonstrate that our embed- training a network for predicting synset relation. dings improve over those learned solely on The combined objective maximizes both the prob- raw text in three settings: language mod- ability of the raw corpus and encourages embed- eling, measuring semantic similarity, and dings to capture semantic relations from the re- predicting human judgements. sources. We demonstrate improvements in our embeddings on three tasks: language modeling, 1 Introduction measuring word similarity, and predicting human Word embeddings are popular representations for judgements on word pairs. syntax (Turian et al., 2010; Collobert and We- 2 Learning Embeddings ston, 2008; Mnih and Hinton, 2007), semantics (Huang et al., 2012; Socher et al., 2013), morphol- We present a general model for learning word em- ogy (Luong et al., 2013) and other areas. A long beddings that incorporates prior knowledge avail- line of embeddings work, such as LSA and ran- able for a domain. While in this work we con- domized embeddings (Ravichandran et al., 2005; sider semantics, our model could incorporate prior Van Durme and Lall, 2010), has recently turned knowledge from many types of resources. We be- to neural language models (Bengio et al., 2006; gin by reviewing the word2vec objective and then Collobert and Weston, 2008; Turian et al., 2010). present augmentations of the objective for prior can take advantage of large knowledge, including different training strategies. corpora, which can produce impressive results. 2.1 Word2vec However, the main drawback of unsupervised Word2vec (Mikolov et al., 2013) is an algorithm learning is that the learned embeddings may not for learning embeddings using a neural language be suited for the task of interest. Consider se- model. Embeddings are represented by a set of mantic embeddings, which may capture a notion latent (hidden) variables, and each word is rep- of semantics that improves one semantic task but resented by a specific instantiation of these vari- harms another. Controlling this behavior is chal- ables. Training learns these representations for lenging with an unsupervised objective. However, each word w (the tth word in a corpus of size T ) rich prior knowledge exists for many tasks, and t so as to maximize the log likelihood of each token there are numerous such semantic resources. given its context: words within a window sized c: We propose a new training objective for learn- ing word embeddings that incorporates prior T 1 X t+c max log p wt|wt−c , (1) This∗ work was done while the author was visiting JHU. T t=1 t+c where wt−c is the set of words in the window of suited to specific tasks based on knowledge re- size c centered at wt (wt excluded). sources, cbow learns embeddings for words not in- Word2vec offers two choices for modeling of cluded in the resource but appear in a corpus. We Eq. (1): a skip-gram model and a continuous bag- form a joint model through a linear combination of-words model (cbow). The latter worked better of the two (weighted by C): in our experiments so we focus on it in our presen- T N 1 X C X X tation. cbow defines p(w |wt+c) as: log p w |wt+c+ log p (w|w ) t t−c T t t−c N i t=1 i=1 w∈R  0 > P  wi exp ewt · −c≤j≤c,j6=0 ewt+j , (2) Based on our initial experiments, RCM uses the P  0 > P  w exp ew · −c≤j≤c,j6=0 ewt+j output embeddings of cbow. 0 We learn embeddings using stochastic gradient where e and e represent the input and output w w ascent. Updates for the first term for e0 and e are: embeddings respectively, i.e., the assignments to t+c the latent variables for word w. While some learn 0  X e − αcbow σ(f(w)) − I[w=w ] · ew 0 w t j a single representation for each word (ew , ew), j=t−c our results improved when we used a separate em- X  0 ewj − αcbow σ(f(w)) − I[w=wt] · ew, bedding for input and output in cbow. w 2.2 Relation Constrained Model where σ(x) = exp{x}/(1 + exp{x}), I[x] is 1 when x is true, f(w) = e0 > Pt+c e . Second Suppose we have a resource that indicates rela- w j=t−c wj term updates are: tions between words. In the case of semantics,   we could have a resource that encodes semantic e0 − α σ(f 0(w)) − I · e0 w RCM [w∈Rwi ] wi similarity between words. Based on this resource, X   e0 − α σ(f 0(w)) − I · e0 , we learn embeddings that predict one word from wi RCM [w∈Rwi ] w another related word. We define R as a set of rela- w 0 0 0 > 0 tions between two words w and w . R can contain where f (w) = ew ewi . We use two learning 0 typed relations (e.g., w is related to w through rates: αcbow and αRCM. a specific type of semantic relation), and rela- 2.4 Parameter Estimation tions can have associated scores indicating their strength. We assume a single relation type of uni- All three models (cbow, RCM and joint) use the form strength, though it is straightforward to in- same training scheme based on Mikolov et al. clude additional characteristics into the objective. (2013). There are several choices to make in pa- Define Rw to be the subset of relations in R rameter estimation; we present the best perform- which involve word w. Our objective maximizes ing choices used in our results. the (log) probability of all relations by summing We use noise contrastive estimation (NCE) over all words N in the vocabulary: (Mnih and Teh, 2012), which approximately max- N imizes the log probability of the softmax objec- 1 X X tive (Eq. 2). For each objective (cbow or RCM), log p (w|wi) , (3) N we sample 15 words as negative samples for each i=1 w∈Rwi training instance according to their frequencies in  0 T  P  0 T  p(w|wi) = exp ew ewi / w¯ exp ew¯ ewi raw texts (i.e. training data of cbow). Suppose w takes a form similar to Eq. (2) but without the has frequency u(w), then the probability of sam- context: e and e0 are again the input and output pling w is p(w) ∝ u(w)3/4. 0 embeddings. For our semantic relations ew and We use distributed training, where shared em- ew are symmetrical, so we use a single embedding. beddings are updated by each thread based on Embeddings are learned such that they are predic- training data within the thread, i.e., asynchronous tive of related words in the resource. We call this stochastic gradient ascent. For the joint model, the Relation Constrained Model (RCM). we assign threads to the cbow or RCM objective 1 with a balance of 12:1(i.e. C is approximately 12 ). 2.3 Joint Model We allow the cbow threads to control convergence; The cbow and RCM objectives use separate data training stops when these threads finish process- for learning. While RCM learns embeddings ing the data. We found this an effective method for balancing the two objectives. We trained each PPDB Relations WordNet Relations Train XL 115,041 Train 68,372 cbow objective using a single pass over the data set XXL 587,439 (not used in (except for those in Section 4.1), which we empir- XXXL 2,647,105 this work) ically verified was sufficient to ensure stable per- Dev 1,582 Dev 1,500 Test 1,583 Test 1,500 formances on semantic tasks. Model pre-training is critical in Table 1: Sizes of semantic resources datasets. (Bengio et al., 2007; Erhan et al., 2010). We eval- uate two strategies: random initialization, and pre- into 6 parts, ranging from S (small) to XXXL. training the embeddings. For pre-training, we first Division into these sets is based on an automat- learn using cbow with a random initialization. The ically derived accuracy metric. Since S contains resulting trained model is then used to initialize the most accurate paraphrases, we used these for the RCM model. This enables the RCM model to evaluation. We divided S into a dev set (1582 benefit from the unlabeled data, but refine the em- pairs) and test set (1583 pairs). Training was based beddings constrained by the given relations. on one of the other sets minus relations from S. Finally, we consider a final model for training We created similar splits using WordNet, ex- embeddings that uses a specific training regime. tracting synonyms using the 100,000 most fre- While the joint model balances between fitting the quent NYT words. We divide the vocabulary into text and learning relations, modeling the text at three sets: the most frequent 10,000 words, words the expense of the relations may negatively impact with ranks between 10,001-30,000 and 30,001- the final embeddings for tasks that use the embed- 100,000. We sample 500 words from each set to dings outside of the context of word2vec. There- construct a dev and test set. For each word we fore, we use the embeddings from a trained joint sample one synonym to form a pair. The remain- model to pre-train an RCM model. We call this ing words and their synonyms are used for train- setting Joint→RCM. ing. However we did not use the training data be- 3 Evaluation cause it is too small to affect the results. Table 1 summarizes the datasets. For training cbow we use the New York Times (NYT) 1994-97 subset from Gigaword v5.0 4 Experiments (Parker et al., 2011). We select 1,000 paragraphs The goal of our experiments is to demonstrate the each for dev and test data from the December 2010 value of learning semantic embeddings with infor- portion of the NYT. Sentences are tokenized using mation from semantic resources. In each setting, OpenNLP1, yielding 518,103,942 tokens for train- we will compare the word2vec baseline embed- ing, 42,953 tokens for dev and 41,344 for test. ding trained with cbow against RCM alone, the We consider two resources for training the joint model and Joint→RCM. We consider three RCM term: the Paraphrase Database (PPDB) evaluation tasks: language modeling, measuring (Ganitkevitch et al., 2013) and WordNet (Fell- semantic similarity, and predicting human judge- baum, 1999). For each semantic pair extracted ments on semantic relatedness. In all of our ex- from these resources, we add a relation to the periments, we conducted model development and RCM objective. Since we use both resources for tuned model parameters (C, α , α , PPDB evaluation, we divide each into train, dev and test. cbow RCM dataset, etc.) on development data, and evaluate PPDB is an automatically extracted dataset con- the best performing model on test data. The mod- taining tens of millions of paraphrase pairs, in- els are notated as follows: word2vec for the base- cluding words and phrases. We used the “lexi- line objective (cbow or skip-gram), RCM-r/p and cal” version of PPDB (no phrases) and filtered to Joint-r/p for random and pre-trained initializations include pairs that contained words found in the of the RCM and Joint objectives, and Joint→RCM 200,000 most frequent words in the NYT corpus, for pre-training RCM with Joint embeddings. Un- which ensures each word in the relations had sup- less otherwise notes, we train using PPDB XXL. port in the text corpus. Next, we removed dupli- We initially created WordNet training data, but cate pairs: if occurred in PPDB, we re- found it too small to affect results. Therefore, moved relations of . PPDB is organized we include only RCM results trained on PPDB, 1https://opennlp.apache.org/ but show evaluations on both PPDB and WordNet. Model NCE HS in the table; while set αcbow = 0.025 (the default word2vec (cbow) 8.75 6.90 RCM-p 8.55 7.07 setting of word2vec). Even when our goal is to −2 Joint-r (αRCM = 1 × 10 ) 8.33 6.87 strictly model the raw text corpus, we obtain im- −3 Joint-r (αRCM = 1 × 10 ) 8.20 6.75 provements by injecting semantic information into Joint→RCM 8.40 6.92 the objective. RCM can effectively shift learning Table 2: LM evaluation on held out NYT data. to obtain more informative embeddings. 4.2 Measuring Semantic Similarity We trained 200-dimensional embeddings and used Our next task is to find semantically related words output embeddings for measuring similarity. Dur- using the embeddings, evaluating on relations ing the training of cbow objectives we remove all from PPDB and WordNet. For each of the word words with frequencies less than 5, which is the pairs in the evaluation set , we use the co- default setting of word2vec. sine distance between the embeddings to score A with a candidate word B0. We use a large sample 4.1 Language Modeling of candidate words (10k, 30k or 100k) and rank all Word2vec is fundamentally a language model, candidate words for pairs where B appears in the which allows us to compute standard evaluation candidates. We then measure the rank of the cor- metrics on a held out dataset. After obtaining rect B to compute mean reciprocal rank (MRR). trained embeddings from any of our objectives, Our goal is to use word A to select word B as we use the embeddings in the word2vec model the closest matching word from the large set of to measure perplexity of the test set. Measuring candidates. Using this strategy, we evaluate the perplexity means computing the exact probability embeddings from all of our objectives and mea- of each word, which requires summation over all sure which embedding most accurately selected words in the vocabulary in the denominator of the the true correct word. softmax. Therefore, we also trained the language Table 3 shows MRR results for both PPDB models with hierarchical classification (Mikolov and WordNet dev and test datasets for all models. et al., 2013) strategy (HS). The averaged perplexi- All of our methods improve over the baselines in ties are reported on the NYT test set. nearly every test set result. In nearly every case, While word2vec and joint are trained as lan- Joint→RCM obtained the largest improvements. guage models, RCM is not. In fact, RCM does not Clearly, our embeddings are much more effective even observe all the words that appear in the train- at capturing semantic similarity. ing set, so it makes little sense to use the RCM em- beddings directly for language modeling. There- 4.3 Human Judgements fore, in order to make fair comparison, for every Our final evaluation is to predict human judge- set of trained embeddings, we fix them as input ments of semantic relatedness. We have pairs of embedding for word2vec, then learn the remain- words from PPDB scored by annotators on a scale ing input embeddings (words not in the relations) of 1 to 5 for quality of similarity. Our data are and all the output embeddings using cbow. Since the judgements used by Ganitkevitch et al. (2013), this involves running cbow on NYT data for 2 it- which we filtered to include only those pairs for erations (one iteration for word2vec-training/pre- which we learned embeddings, yielding 868 pairs. training/joint-modeling and the other for tuning We assign a score using the dot product between the language model), we use Joint-r (random ini- the output embeddings of each word in the pair, tialization) for a fair comparison. then order all 868 pairs according to this score. Table 2 shows the results for language mod- Using the human judgements, we compute the eling on test data. All of our proposed models swapped pairs rate: the ratio between the number improve over the baseline in terms of perplexity of swapped pairs and the number of all pairs. For when NCE is used for training LMs. When HS is pair p scored yp by the embeddings and judged yˆp used, the perplexities are greatly improved. How- by an annotator, the swapped pair rate is: ever in this situation only the joint models improve P I[(y − y ) (ˆy − yˆ ) < 0] p1,p2∈D p1 p2 p2 p1 the results; and Joint→RCM performs similar to P (4) I[yp 6= yp ] the baseline, although it is not designed for lan- p1,p2∈D 1 2 guage modeling. We include the optimal αRCM where I[x] is 1 when x is true. PPDB WordNet Dev Test Dev Test Model 10k 30k 100k 10k 30k 100k 10k 30k 100k 10k 30k 100k word2vec (cbow) 49.68 39.26 29.15 49.31 42.53 30.28 10.24 8.64 5.14 10.04 7.90 4.97 word2vec (skip-gram) 48.70 37.14 26.20 - - - 8.61 8.10 4.62 - - - RCM-r 55.03 42.52 26.05 - - - 13.33 9.05 5.29 - - - RCM-p 61.79 53.83 40.95 65.42 55.82 41.20 15.25 12.13 7.46 14.13 11.23 7.39 Joint-r 59.91 50.87 36.81 - - - 15.73 11.36 7.14 13.97 10.51 7.44 Joint-p 59.75 50.93 37.73 64.30 53.27 38.97 15.61 11.20 6.96 - - - Joint→RCM 64.22 54.99 41.34 68.20 57.87 42.64 16.81 11.67 7.55 16.16 11.21 7.56

Table 3: MRR for semantic similarity on PPDB and WordNet dev and test data. Higher is better. All RCM objectives are trained with PPDB XXL. To preserve test data integrity, only the best performing setting of each model is evaluated on the test data.

Model Swapped Pairs Rate important factor. Table 5 shows the effect on dev word2vec (cbow) 17.81 RCM-p 16.66 data of using various numbers of relations. While Joint-r 16.85 we see improvements from XL to XXL (5 times as Joint-p 16.96 many relations), we get worse results on XXXL, Joint→RCM 16.62 likely because this set contains the lowest quality Table 4: Results for ranking the quality of PPDB relations in PPDB. Finally, Table 6 shows different pairs as compared to human judgements. learning rates αRCM for the RCM objective. The baseline word2vec and the joint model have PPDB Dev nearly the same averaged running times (2,577s Model Relations 10k 30k 100k RCM-r XL 24.02 15.26 9.55 and 2,644s respectively), since they have same RCM-p XL 54.97 45.35 32.95 number of threads for the CBOW objective and the RCM-r XXL 55.03 42.52 26.05 joint model uses additional threads for the RCM RCM-p XXL 61.79 53.83 40.95 RCM-r XXXL 51.00 44.61 28.42 objective. The RCM models are trained with sin- RCM-p XXXL 53.01 46.35 34.19 gle thread for 100 epochs. When trained on the PPDB-XXL data, it spends 2,931s on average. Table 5: MRR on PPDB dev data for training on an increasing number of relations. 5 Conclusion We have presented a new learning objective for Table 4 shows that all of our models obtain neural language models that incorporates prior reductions in error as compared to the baseline knowledge contained in resources to improve (cbow), with Joint→RCM obtaining the largest re- learned word embeddings. We demonstrated that duction. This suggests that our embeddings are the Relation Constrained Model can lead to better better suited for semantic tasks, in this case judged semantic embeddings by incorporating resources by human annotations. like PPDB, leading to better language modeling, PPDB Dev semantic similarity metrics, and predicting hu- Model αRCM 10k 30k 100k man semantic judgements. Our implementation is Joint-p 1 × 10−1 47.17 36.74 24.50 −2 based on the word2vec package and we made it 5 × 10 54.31 44.52 33.07 2 1 × 10−2 59.75 50.93 37.73 available for general use . 1 × 10−3 57.00 46.84 34.45 We believe that our techniques have implica- tions beyond those considered in this work. We α Table 6: Effect of learning rate RCM on MRR for plan to explore the embeddings suitability for the RCM objective in Joint models. other semantics tasks, including the use of re- sources with both typed and scored relations. Ad- 4.4 Analysis ditionally, we see opportunities for jointly learn- We conclude our experiments with an analysis of ing embeddings across many tasks with many re- modeling choices. First, pre-training RCM models sources, and plan to extend our model accordingly. gives significant improvements in both measuring semantic similarity and capturing human judge- Acknowledgements Yu is supported by China ments (compare “p” vs. “r” results.) Second, the Scholarship Council and by NSFC 61173073. number of relations used for RCM training is an 2https://github.com/Gorov/JointRCM References Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: us- , Holger Schwenk, Jean-Sebastien´ ing locality sensitive hash function for high speed Senecal,´ Frederic´ Morin, and Jean-Luc Gauvain. noun clustering. In Association for Computational 2006. Neural probabilistic language models. In Linguistics (ACL). Innovations in , pages 137–186. Springer. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Christopher Potts. 2013. Recursive deep models Larochelle, et al. 2007. Greedy -wise training for semantic compositionality over a sentiment tree- Neural Information Processing of deep networks. In bank. In Empirical Methods in Natural Language Systems (NIPS) . Processing (EMNLP), pages 1631–1642. Antoine Bordes, Xavier Glorot, Jason Weston, and Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Yoshua Bengio. 2012. Joint learning of words Word representations: a simple and general method and meaning representations for open-text semantic for semi-. In Association for parsing. In International Conference on Artificial Computational Linguistics (ACL). Intelligence and , pages 127–135. Benjamin Van Durme and Ashwin Lall. 2010. On- Ronan Collobert and Jason Weston. 2008. A unified line generation of locality sensitive hash signatures. architecture for natural language processing: Deep In Association for Computational Linguistics (ACL), neural networks with multitask learning. In Interna- pages 231–235. tional Conference on Machine Learning (ICML). Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research (JMLR), 11:625–660. Christiane Fellbaum. 1999. WordNet. Wiley Online Library. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In North American Chapter of the Asso- ciation for Computational Linguistics (NAACL). Eric H Huang, Richard Socher, Christopher D Man- ning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Association for Computational Lin- guistics (ACL), pages 873–882. Minh-Thang Luong, Richard Socher, and Christo- pher D Manning. 2013. Better word representa- tions with recursive neural networks for morphol- ogy. In Conference on Natural Language Learning (CoNLL). Tomas Mikolov, , Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- tations of words and phrases and their composition- ality. arXiv preprint arXiv:1310.4546. Andriy Mnih and . 2007. Three new graphical models for statistical language modelling. In International Conference on Machine Learning (ICML). Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English gigaword fifth edi- tion. Technical report, Linguistic Data Consortium.