Improving Lexical Embeddings with Semantic Knowledge

Improving Lexical Embeddings with Semantic Knowledge Mo Yu ∗ Mark Dredze Machine Translation Lab Human Language Technology Center of Excellence Harbin Institute of Technology Center for Language and Speech Processing Harbin, China Johns Hopkins University [email protected] Baltimore, MD 21218 [email protected] Abstract knowledge. Our model builds on word2vec (Mikolov et al., 2013), a neural network based Word embeddings learned on unlabeled language model that learns word embeddings by data are a popular tool in semantics, but maximizing the probability of raw text. We extend may not capture the desired semantics. We the objective to include prior knowledge about propose a new learning objective that in- synonyms from semantic resources; we consider corporates both a neural language model both the Paraphrase Database (Ganitkevitch et al., objective (Mikolov et al., 2013) and prior 2013) and WordNet (Fellbaum, 1999), which an- knowledge from semantic resources to notate semantic relatedness between words. The learn improved lexical semantic embed- latter was also used in (Bordes et al., 2012) for dings. We demonstrate that our embed- training a network for predicting synset relation. dings improve over those learned solely on The combined objective maximizes both the prob- raw text in three settings: language mod- ability of the raw corpus and encourages embed- eling, measuring semantic similarity, and dings to capture semantic relations from the re- predicting human judgements. sources. We demonstrate improvements in our embeddings on three tasks: language modeling, 1 Introduction measuring word similarity, and predicting human Word embeddings are popular representations for judgements on word pairs. syntax (Turian et al., 2010; Collobert and We- 2 Learning Embeddings ston, 2008; Mnih and Hinton, 2007), semantics (Huang et al., 2012; Socher et al., 2013), morphol- We present a general model for learning word em- ogy (Luong et al., 2013) and other areas. A long beddings that incorporates prior knowledge avail- line of embeddings work, such as LSA and ran- able for a domain. While in this work we con- domized embeddings (Ravichandran et al., 2005; sider semantics, our model could incorporate prior Van Durme and Lall, 2010), has recently turned knowledge from many types of resources. We be- to neural language models (Bengio et al., 2006; gin by reviewing the word2vec objective and then Collobert and Weston, 2008; Turian et al., 2010). present augmentations of the objective for prior Unsupervised learning can take advantage of large knowledge, including different training strategies. corpora, which can produce impressive results. 2.1 Word2vec However, the main drawback of unsupervised Word2vec (Mikolov et al., 2013) is an algorithm learning is that the learned embeddings may not for learning embeddings using a neural language be suited for the task of interest. Consider se- model. Embeddings are represented by a set of mantic embeddings, which may capture a notion latent (hidden) variables, and each word is rep- of semantics that improves one semantic task but resented by a specific instantiation of these vari- harms another. Controlling this behavior is chal- ables. Training learns these representations for lenging with an unsupervised objective. However, each word w (the tth word in a corpus of size T ) rich prior knowledge exists for many tasks, and t so as to maximize the log likelihood of each token there are numerous such semantic resources. given its context: words within a window sized c: We propose a new training objective for learning word embeddings that incorporates prior T 1 X t+c max log p wtjwt−c ; (1) This∗ work was done while the author was visiting JHU. T t=1 t+c where wt−c is the set of words in the window of suited to specific tasks based on knowledge re- size c centered at wt (wt excluded). sources, cbow learns embeddings for words not in- Word2vec offers two choices for modeling of cluded in the resource but appear in a corpus. We Eq. (1): a skip-gram model and a continuous bag- form a joint model through a linear combination of-words model (cbow). The latter worked better of the two (weighted by C): in our experiments so we focus on it in our presen- T N 1 X C X X tation. cbow defines p(w jwt+c) as: log p w jwt+c+ log p (wjw ) t t−c T t t−c N i t=1 i=1 w2R 0 > P wi exp ewt · −c≤j≤c;j6=0 ewt+j ; (2) Based on our initial experiments, RCM uses the P 0 > P w exp ew · −c≤j≤c;j6=0 ewt+j output embeddings of cbow. 0 We learn embeddings using stochastic gradient where e and e represent the input and output w w ascent. Updates for the first term for e0 and e are: embeddings respectively, i.e., the assignments to t+c the latent variables for word w. While some learn 0 X e − αcbow σ(f(w)) − I[w=w ] · ew 0 w t j a single representation for each word (ew , ew), j=t−c our results improved when we used a separate em- X 0 ewj − αcbow σ(f(w)) − I[w=wt] · ew; bedding for input and output in cbow. w 2.2 Relation Constrained Model where σ(x) = expfxg=(1 + expfxg), I[x] is 1 when x is true, f(w) = e0 > Pt+c e . Second Suppose we have a resource that indicates rela- w j=t−c wj term updates are: tions between words. In the case of semantics, we could have a resource that encodes semantic e0 − α σ(f 0(w)) − I · e0 w RCM [w2Rwi ] wi similarity between words. Based on this resource, X e0 − α σ(f 0(w)) − I · e0 ; we learn embeddings that predict one word from wi RCM [w2Rwi ] w another related word. We define R as a set of rela- w 0 0 0 > 0 tions between two words w and w . R can contain where f (w) = ew ewi . We use two learning 0 typed relations (e.g., w is related to w through rates: αcbow and αRCM. a specific type of semantic relation), and rela- 2.4 Parameter Estimation tions can have associated scores indicating their strength. We assume a single relation type of uni- All three models (cbow, RCM and joint) use the form strength, though it is straightforward to in- same training scheme based on Mikolov et al. clude additional characteristics into the objective. (2013). There are several choices to make in pa- Define Rw to be the subset of relations in R rameter estimation; we present the best perform- which involve word w. Our objective maximizes ing choices used in our results. the (log) probability of all relations by summing We use noise contrastive estimation (NCE) over all words N in the vocabulary: (Mnih and Teh, 2012), which approximately max- N imizes the log probability of the softmax objec- 1 X X tive (Eq. 2). For each objective (cbow or RCM), log p (wjwi) ; (3) N we sample 15 words as negative samples for each i=1 w2Rwi training instance according to their frequencies in 0 T P 0 T p(wjwi) = exp ew ewi = w¯ exp ew¯ ewi raw texts (i.e. training data of cbow). Suppose w takes a form similar to Eq. (2) but without the has frequency u(w), then the probability of sam- context: e and e0 are again the input and output pling w is p(w) / u(w)3=4. 0 embeddings. For our semantic relations ew and We use distributed training, where shared em- ew are symmetrical, so we use a single embedding. beddings are updated by each thread based on Embeddings are learned such that they are predic- training data within the thread, i.e., asynchronous tive of related words in the resource. We call this stochastic gradient ascent. For the joint model, the Relation Constrained Model (RCM). we assign threads to the cbow or RCM objective 1 with a balance of 12:1(i.e. C is approximately 12 ). 2.3 Joint Model We allow the cbow threads to control convergence; The cbow and RCM objectives use separate data training stops when these threads finish process- for learning. While RCM learns embeddings ing the data. We found this an effective method for balancing the two objectives. We trained each PPDB Relations WordNet Relations Train XL 115,041 Train 68,372 cbow objective using a single pass over the data set XXL 587,439 (not used in (except for those in Section 4.1), which we empir- XXXL 2,647,105 this work) ically verified was sufficient to ensure stable per- Dev 1,582 Dev 1,500 Test 1,583 Test 1,500 formances on semantic tasks. Model pre-training is critical in deep learning Table 1: Sizes of semantic resources datasets. (Bengio et al., 2007; Erhan et al., 2010). We eval- uate two strategies: random initialization, and pre- into 6 parts, ranging from S (small) to XXXL. training the embeddings. For pre-training, we first Division into these sets is based on an automat- learn using cbow with a random initialization. The ically derived accuracy metric. Since S contains resulting trained model is then used to initialize the most accurate paraphrases, we used these for the RCM model. This enables the RCM model to evaluation. We divided S into a dev set (1582 benefit from the unlabeled data, but refine the em- pairs) and test set (1583 pairs). Training was based beddings constrained by the given relations. on one of the other sets minus relations from S. Finally, we consider a final model for training We created similar splits using WordNet, ex- embeddings that uses a specific training regime.

Load more