Semantic Information Extraction for Improved Word Embeddings

Semantic Information Extraction for Improved Word Embeddings Jiaqiang Chen Gerard de Melo IIIS, Tsinghua University IIIS, Tsinghua University Beijing, China Beijing, China [email protected] [email protected] Abstract Wikipedia, a commonly used training corpus for word representation learning: Word embeddings have recently proven useful in a number of different applications that deal Although Roman political authority in with natural language. Such embeddings suc- the West was lost, Roman culture would cinctly reflect semantic similarities between last in most parts of the former Western words based on their sentence-internal con- provinces into the 6th century and beyond. texts in large corpora. In this paper, we show that information extraction techniques provide In this example sentence, the token “parts” does not valuable additional evidence of semantic re- seem to bear any particularly close relationship with lationships that can be exploited when pro- the meaning of some of the other tokens, e.g. “Ro- ducing word embeddings. We propose a joint man” and “culture”. In contrast, the occurrence of model to train word embeddings both on reg- an expression such as “Greek and Roman mythol- ular context information and on more explicit semantic extractions. The word vectors ob- ogy” in a corpus appears to indicate that the two tained from such an augmented joint train- tokens “Roman” and “Greek” likely share certain ing show improved results on word similarity commonalities. There is a large body of work on tasks, suggesting that they can be useful in ap- information extraction techniques to discover text plications that involve word meanings. patterns that reflect semantic relationships (Hearst, 1992; Tandon and de Melo, 2010). In this paper, we propose injecting semantic in- 1 Introduction formation into word embeddings by training them In recent years, the idea of embedding words in a not just on general contexts but paying special at- vector space has gained enormous popularity. This tention to stronger semantic connections that can be success of such word embeddings as semantic rep- discovered in specific contexts on the Web or in cor- resentations has been driven in part by the develop- pora. In particular, we investigate mining informa- ment of novel methods to efficiently train word vec- tion of this sort from enumerations and lists, as well tors from large corpora, such that words with sim- as from definitions. Our training procedure can ex- ilar contexts end up having similar vectors. While ploit any source of knowledge about pairs of words it is indisputable that context plays a vital role in being strongly coupled to improve over word em- meaning acquisition, it seems equally plausible that beddings trained just on generic corpus contexts. some contexts would be more helpful for this than others. Consider the following sentence, taken from 2 Background and Related Work Words are substantially discrete in nature, and thus, This research was partially funded by China 973 Program Grants 2011CBA00300, 2011CBA00301, and NSFC Grants traditionally, the vast majority of natural language 61033001, 61361136003, 20141330245. processing tools, both rule-based and statistical, 168 Proceedings of NAACL-HLT 2015, pages 168–175, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics have regarded words as distinct atomic symbols. Their word2vec1 implementation provides two ar- Even methods that rely on vectors typically made chitectures, the CBOW and the Skip-gram models. use of so-called “one-hot” representations, which al- CBOW also relies on a window approach, attempt- locate a separate dimension in the vector space for ing to use the surrounding words to predict the cur- every content word in the vocabulary. Such rep- rent target word. However, it simplifies the hidden resentations suffer from two problems. First, vec- layer to be just the average of surrounding words’ tors for two distinct word forms have distinct vec- embeddings. The Skip-gram model tries to do the tors without any overlap, which means that the vec- opposite. It uses the current word to predict the sur- tor similarities for any two distinct individual word rounding words. Both architectures can be trained forms will fail to reflect any possible syntactic or se- in just a few hours, while obtaining state-of-the-art mantic similarities between them. Second, the vec- embeddings. tor space dimensionality is proportional to the vo- Distributed word representations now have been cabulary size, which can be very large. For instance, applied to numerous natural language processing the Google 1T corpus has 13M distinct words. tasks. For instance, they have been used for sen- timent analysis (Socher et al., 2013), paraphrase To address these two problems, other representa- detection (Socher et al., 2011), machine transla- tions have been proposed. Brown clustering (Brown tion (Devlin et al., 2014), relation extraction (Chang et al., 1992) organizes words into a binary tree based et al., 2014), and parsing, just to name a few. Some on the contexts in which they occur. Latent Semantic of these works use neural network models, e.g. re- Analysis and Indexing (LSA/LSI) use singular value cursive neural networks, auto-encoders, or convo- decomposition (SVD) to identify the relationships lutional neural networks. Others use word embed- between words in a corpus. Latent Dirichlet Anal- dings directly as features for clustering or classifica- ysis (LDA) (Blei et al., 2003), a generative graphi- tion with alternative machine learning algorithms. cal model, views each document as a collection of topics and assigns each word to these topics. There have been other proposals to adapt the word2vec model. Similar to previous work on se- Recently, neural networks have been applied to mantic spaces based on dependency parse relations learn word embeddings in dense real-valued vector (Pado´ and Lapata, 2007), Levy and Goldberg (2014) spaces. In training, such an approach may com- rely on dependency parsing to create word embed- bine vector space semantics with predictions from dings. These are able to capture contextual relation- probabilistic models. For instance, Bengio et al. ships between words that are further apart in the sen- (2003) present a neural probabilistic language model tence while simultaneously filtering out some words that uses the n-gram model to learn word embed- that are not directly related to the target word. Fur- dings. The network tries to use the first n 1 words − ther analysis revealed that their word embeddings to predict the next one, outperforming n-gram fre- capture more functional but less topical similarity. quency baselines. Collobert et al. (2011) use word Faruqui et al. (2015) apply post-processing steps to embeddings for traditional NLP tasks: POS tagging, existing word embeddings in order to bring them named entity recognition, chunking, and semantic more in accordance with semantic lexicons such as role labeling. Their pairwise ranking approach tries PPDB and FrameNet. Wang et al. (2014) train em- to maximize the difference between scores from text beddings jointly on text and on Freebase, a well- windows in a large training corpus and correspond- known large knowledge base. Their embeddings ing randomly generated negative examples. How- are trained to preserve relations between entities in ever, the training for this took about one month. The the knowledge graph. Rather than using structured next breakthrough came with Mikolov et al. (2013a), knowledge sources, our work focuses on improving who determined that, for the previous models, most word embeddings using textual data by relying on of the complexity is caused by the non-linear hid- information extraction to expose particularly valu- den layer. The authors thus investigated simpler net- able contexts in a text corpus. work architectures to efficiently train the vectors at a much faster rate and thus also at a much larger scale. 1https://code.google.com/p/word2vec/ 169 3 Joint Model samples, we assign the label l = 0, while for the original word pairs, l = 1. Now, for each word pair Our model simultaneously trains the word embed- we try to minimize its loss function: dings on generic contexts from the corpus on the one hand and semantically significant contexts, obtained Loss = l log f (1 l) log(1 f) using extraction techniques, on the other hand. For − · − − · − f = σ(vT v ) the regular general contexts, our approach draws on wt · wr 1 the word2vec CBOW model (Mikolov et al., 2013a) Here, σ( ) is the sigmoid function σ(x) = x · 1+e− to predict a word given its surrounding neighbors in and vwt , vwr refer to the vectors for the two words the corpus. wt and wr. We use stochastic gradient descent to At the same time, our model relies on our abil- optimize this function. The formulae for the gradient ity to extract semantically salient contexts that are are easy to compute: more indicative of word meanings. Our algorithm ∂Loss assumes that these have been transformed into a set = (l f) v ∂v − − wt of word pairs known to be closely related. These wr pairs of related words are used to modify the word ∂Loss = (l f) vwr embeddings by jointly training them simultaneously ∂vwt − − with the word2vec model for regular contexts. Due This objective is optimized alongside with the to this more focused information, we expect the fi- original word2vec CBOW objective. Our overall nal word embeddings to reflect more semantic in- model combines the two objectives. Training the formation than embeddings trained only on regular model in parallel with the word2vec model allows us contexts. to inject the extracted knowledge into the word vec- Given an extracted pair of semantically related tors such that they are reflected during the word2vec words, the intuition is that the embeddings for the training rather than just as a post-processing step. two words should be pulled together.

Load more