Arxiv:1904.05033V1

Better Word Embeddings by Disentangling Contextual n-Gram Information Prakhar Gupta* Matteo Pagliardini* Martin Jaggi EPFL, Switzerland Iprova SA, Switzerland EPFL, Switzerland [email protected] [email protected] [email protected] Abstract augmenting word-context pairs with sub-word information in the form of character n-grams Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In (Bojanowski et al., 2017), especially for morpho- this paper, we show how training word em- logically rich languages. Nevertheless, to the best beddings jointly with bigram and even trigram of our knowledge, no method has been introduced embeddings, results in improved unigram em- leveraging collocations of words with higher order beddings. We claim that training word embed- word n-grams such as bigrams or trigrams as well dings along with higher n-gram embeddings as character n-grams together. helps in the removal of the contextual infor- In this paper, we show how using higher order mation from the unigrams, resulting in better stand-alone word embeddings. We empirically word n-grams along with unigrams during training show the validity of our hypothesis by outper- can significantly improve the quality of obtained forming other competing word representation word embeddings. The addition furthermore helps models by a significant margin on a wide va- to disentangle contextual information present in riety of tasks. We make our models publicly the training data from the unigrams and results in available. overall better distributed word representations. 1 Introduction To validate our claim, we train two mod- ifications of CBOW augmented with word-n- Distributed word representations are essential gram information during training. One is a building blocks of modern NLP systems. Used recent sentence embedding method, Sent2Vec as features in downstream applications, they often (Pagliardini et al., 2018), which we repurpose to enhance generalization of models trained on a lim- obtain word vectors. The second method we pro- ited amount of data. They do so by capturing rel- pose is a modification of CBOW enriched with evant distributional information about words from character n-gram information (Bojanowski et al., large volumes of unlabeled text. 2017) that we again augment with word n-gram Efficient methods to learn word vectors have information. In both cases, we compare the result- been introduced in the past, most of them based ing vectors with the most widely used word em- on the distributional hypothesis of Harris (1954); bedding methods on word similarity and analogy arXiv:1904.05033v1 [cs.CL] 10 Apr 2019 Firth (1957): “a word is characterized by the com- tasks and show significant quality improvements. pany it keeps”. While a standard approach re- The code used to train the models presented in this lies on global corpus statistics (Pennington et al., paper as well as the models themselves are made 2014) formulated as a matrix factorization us- available to the public1. ing mean square reconstruction loss, other widely used methods are the bilinear word2vec architec- 2 Model Description tures introduced by Mikolov et al. (2013a): While skip-gram aims to predict nearby words from a Before introducing our model, we recapitulate given word, CBOW predicts a target word from fundamental existing word embeddings methods. its set of context words. CBOW and skip-gram models. Continuous Recently, significant improvements in the qual- bag-of-words (CBOW) and skip-gram models are ity of the word embeddings were obtained by 1publicly available on * indicates equal contribution http://github.com/epfml/sent2vec standard log-bilinear models for obtaining word statistics by factorizing the word-context co- embeddings based on word-context pair informa- occurrence matrix. tion (Mikolov et al., 2013a). Context here refers to Ngram2vec. In order to leverage the perfor- a symmetric window centered on the target word mance of word vectors, training of word vec- wt, containing the surrounding tokens at a distance tors using the skip-gram objective function with less than some window size ws: Ct = {wk | k ∈ negative sampling is augmented with n-gram co- [t − ws,t + ws]}. The CBOW model tries to pre- occurrence information (Zhao et al., 2017). dict the target word given its context, maximizing T the likelihood Qt=1 p(wt|Ct), whereas skip-gram 2.1 Improving unigram embeddings by learns by predicting the context for a given target adding higher order word-n-grams to T word maximizing Qt=1 p(Ct|wt). To model those contexts probabilities, a softmax activation is used on top CBOW-char with word n-grams. We propose to of the inner product between a target vector uwt 1 augment CBOW-char to additionally use word n- and its context vector vw. |Ct| Pw∈Ct gram context vectors (in addition to char n-grams To overcome the computational bottleneck of and the context word itself). More precisely, dur- the softmax for large vocabulary, negative sam- ing training, the context vector for a given word w pling or noise contrastive estimation are well- t is given by the average of all word-n-grams N , all established (Mikolov et al., 2013b), with the idea t char-n-grams, and all unigrams in the span of the of employing simpler pairwise binary classifier current context window C : loss functions to differentiate between the valid t context Ct and fake contexts NCt sampled at ran- vw + vn + vc dom. While generating target-context pairs, both v := Pw∈Ct Pn∈Nt Pw∈Ct Pc∈Ww |C | + |N | + |W | CBOW and skip-gram also use input word sub- t t Pw∈Ct w (1) sampling, discarding higher-frequency words with For a given sentence, we apply input subsam- higher probability during training, in order to pre- pling and a sliding context window as for standard vent the model from overfitting the most frequent CBOW. In addition, we keep the mapping from tokens. Standard CBOW also uses a dynamic the subsampled sentence to the original sentence context window size: for each subsampled tar- for the purpose of extracting word n-grams from get word w, the size of its associated context the original sequence of words, within the span window is sampled uniformly between 1 and ws of the context window. Word n-grams are added (Mikolov et al., 2013b). to the context using the hashing trick in the same Adding character n-grams. Bojanowski et al. way char-n-grams are handled. We use two dif- (2017) have augmented CBOW and skip-gram by ferent hashing index ranges to ensure there is no adding character n-grams to the context represen- collision between char n-gram and word n-gram tations. Word vectors are expressed as the sum representations. of its unigram and average of its character n-gram Sent2Vec for word embeddings. Initially im- embeddings W : w plemented for sentence embeddings, Sent2Vec 1 (Pagliardini et al., 2018) can be seen as a deriva- v := vw + X vc tive of word2vec’s CBOW. The key differences |Ww| c∈Ww between CBOW and Sent2Vec are the removal of Character n-grams are hashed to an index in the input subsampling, considering the entire sen- the embedding matrix . The training remains the tence as context, as well as the addition of word- same as for CBOW and skip-gram. This approach n-grams. greatly improves the performances of CBOW and Here, word and n-grams embeddings from an skip-gram on morpho-syntactic tasks. For the rest entire sentence are averaged to form the corre- of the paper, we will refer to the CBOW and skip- sponding sentence (context) embedding. gram methods enriched with subword-information For both proposed CBOW-char and Sent2Vec as CBOW-char and skip-gram-char respectively. models, we employ dropout on word n-grams dur- GloVe. Instead of training online on local win- ing training. For both models, word embeddings dow contexts, GloVe vectors (Pennington et al., are obtained by simply discarding the higher order 2014) are trained using global co-occurrence n-gram embeddings after training. Model WS 353 WS 353 Relatedness WS 353 Similarity CBOW-char .709 ± .006 .626 ± .009 .783 ± .004 CBOW-char + bi. .719 ± .007 .652 ± .010 .778 ± .007 CBOW-char + bi. + tri. .727 ± .008 .664 ± .008 .783 ± .004 Sent2Vec uni. .705 ± .004 .593 ± .005 .793 ± .006 Sent2Vec uni. + bi. .755 ± .005 .683 ± .008 .817 ± .007 Sent2Vec uni. + bi. + tri. .780 ± .003 .721 ± .006 .828 ± .003 Model SimLex-999 MEN Rare Words Mechanical Turk CBOW-char .424 ± .004 .769 ± .002 .497 ± .002 .675 ± .007 CBOW-char + bi. .436 ± .004 .786 ± .002 .506 ± .001 .671 ± .007 CBOW-char + bi. + tri. .441 ± .003 .788 ± .002 .509 ± .003 .678 ± .010 Sent2Vec uni. .450 ± .003 .765 ± .002 .444 ± .001 .625 ± .005 Sent2Vec uni. + bi. .440 ± .002 .791 ± .002 .430 ± .002 .661 ± .005 Sent2Vec uni. + bi. + tri. .464 ± .003 .798 ± .001 .432 ± .003 .658 ± .006 Google Google Model MSR (Syntactic Analogies) (Semantic Analogies) CBOW-char .920 ± .001 .799 ± .004 .842 ± .002 CBOW-char + bi. .928 ± .003 .798 ± .006 .856 ± .004 CBOW-char + bi. + tri. .929 ± .001 .794 ± .005 .857 ± .002 Sent2Vec uni. .826 ± .003 .847 ± .003 .734 ± .003 Sent2Vec uni. + bi. .843 ± .004 .844 ± .002 .754 ± .004 Sent2Vec uni. + bi. + tri. .837 ± .003 .853 ± .003 .745 ± .001 Table 1: Impact of using word n-grams: Models are compared using Spearman correlation measures for word similarity tasks and accuracy for word analogy tasks. Top performances on each dataset are shown in bold. An underline shows the best model(s) restricted to each architecture type. The abbreviations uni., bi., and tri. stand for unigrams, bigrams, and trigrams respectively. 3 Experimental Setup best performing model and is included in the com- parison. 3.1 Training For each method, we extensively tuned hyper- We train all competing models on a wikipedia parameters starting from the recommended val- dump of 69 million sentences containing 1.7 bil- ues.

Arxiv:1904.05033V1

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support