Glossary Guided Post-Processing for Word Embedding Learning
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4726–4730 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC GGP: Glossary Guided Post-processing for Word Embedding Learning Ruosong Yang, Jiannong Cao, Zhiyuan Wen The Hong Kong Polytechnic University Hong Kong, China fcsryang,csjcao,[email protected] Abstract Word embedding learning is the task to map each word into a low-dimensional and continuous vector based on a large corpus. To enhance corpus based word embedding models, researchers utilize domain knowledge to learn more distinguishable representations via joint optimization and post-processing based models. However, joint optimization based models require much training time. Existing post-processing models mostly consider semantic knowledge so that learned embedding models show less functional information. Compared with semantic knowledge sources, glossary is a comprehensive linguistic resource which contains complete semantics. Previous glossary based post-processing method only processed words occurred in the glossary, and did not distinguish multiple senses of each word. In this paper, to make better use of glossary, we utilize attention mechanism to integrate multiple sense representations which are learned respectively. With measuring similarity between word representation and combined sense representation, we aim to capture more topical and functional information. We propose GGP (Glossary Guided Post-processing word embedding) model which consists of a global post-processing function to fine-tune each word vector, and an auto-encoding model to learn sense representations, furthermore, constrains each post-processed word representation and the composition of its sense representations to be similar. We evaluate our model by comparing it with two state-of-the-art models on six word topical/functional similarity datasets, and the re- sults show that it outperforms competitors by an average of 4.1% across all datasets. And our model outperforms GloVe by more than 7%. Keywords: Word Embedding, Post-processing model, Representation Learning 1. Introduction cient way. Besides, as many pre-trained word vectors exist, e.g., Google News Vectors 1 (based on Word2Vec), GloVe Word embedding learning is the task to utilize a continu- 2, and FastText 3 (Bojanowski et al., 2016), post-processing ous vector to represent each word. With the success of approaches are more effective. Word2Vec (Mikolov et al., 2013) and GloVe (Pennington Recent works focus on fine-tuning pre-trained word em- et al., 2014), which are trained on a large corpus via a sim- bedding models with word-to-word semantic knowledge ple neural network or matrix factorization, pre-trained word such as synonyms and antonyms. Retrofitting (Faruqui et representations are widely used in various Natural Lan- al., 2015) utilized semantic lexicons’ relational information guage Processing (NLP) tasks, such as sequence labeling and assumed linked words should learn similar represen- task (Lample et al., 2016), text classification (Kim, 2014), tations. ER-CNT (Glavasˇ and Vulic,´ 2018) used a deep etc. However, typical word embedding models including neural network to fine-tune word vectors by adding con- Word2Vec and GloVe, are based on the Distributional Hy- straints on synonyms and antonyms. It shows a significant pothesis (Harris, 1954), which utilizes the distribution of improvement in topical similarity datasets, while loses the context words as the target word representation. In prac- functional (Levy and Goldberg, 2014) information. There tice, the corpus is always limited, which makes it difficult are also some works utilize the glossary to learn word rep- to calculate the actual context word distribution of each tar- resentations via extending the original corpus with word get word. For example, synonyms and antonyms are usu- definitions. The Dict2vec (Tissier et al., 2017) model con- ally hard to distinguish. Meanwhile, the quality of the word structed word pairs from both the corpus and dictionary en- embedding highly depends on the frequency of the word in tries. Especially for negative sampling, it ignored the words the corpus, rare words are usually discarded. in pairs from the second part. Then Skip Gram (Word2Vec) To improve the quality of the word embedding, vari- is used to learn the word embedding model. CPAE (Bosc ous knowledge bases such as WordNet (Miller, 1995), and Vincent, 2018) proposed a LSTM based auto-encoding FrameNet (Baker et al., 1998), and Paraphrase Database model to learn word representations from dictionary def- (Ganitkevitch et al., 2013) are considered. Meanwhile, initions which are assumed to be similar to their embed- joint optimization and post-processing are two popu- ding. However, Dict2vec required to retrain the word em- lar approaches to incorporate domain knowledge, which bedding model, which was time-consuming. CPAE com- have achieved better performance. Joint optimization bined multiple senses of each word into one sense, which is based models design extra constraints according to domain not reasonable. And in CPAE, the authors also introduced a knowledge. And they retrain a new model with integrated post-processing model, however, it could only fine-tune the objectives on a large corpus and knowledge bases, which words having a definition or that occur in definitions. usually require much training time. As for post-processing Glossary is the collection of glosses, and each gloss con- based models, they fine-tune pre-trained word embedding models with new constraints on the knowledge bases, and 1https://code.google.com/archive/p/word2vec/ the data volume of the knowledge bases is much smaller 2https://nlp.stanford.edu/projects/glove/ than that of the corpus. So post-processing is a more effi- 3https://fasttext.cc/docs/en/english-vectors.html 4726 sists of a word as well as its definition (a word sequence decodes the vector to a new sequence. The encoder has to explain the word). In common sense, looking up the two layers, the first layer is a bi-directional Long Short- glossary (dictionary) is one important way for humans to Term Memory (LSTM) neural network which captures the learn a new word, and learning from example sentences of full context information of each word. The second layer, a the word is another way. However, corpus is always not vanilla LSTM network, is used to learn a composition func- enough to estimate the real context words distribution, and tion to transform a whole sentence into one vector by utiliz- there are also no effective approaches for semantic com- ing word order information. The decoder is also a two-layer position. Combining the corpus and the glossary is nec- vanilla LSTM neural network which tries to reconstruct the essary, and they should be modeled in different ways. In input sequence from the output of the encoder. Distributional Hypothesis based word embedding models, Given a word wt, the definition sequence of ith sense is ini = fini ; ini ; :::; ini g n word vector represents the distribution of context words, t t;1 t;2 t;ni , where i is the number of which combines context words of different senses. Besides words of the sense definition. Encoder(· ) and Decoder(· ) learned word representations are the composition of differ- are the encoder model and the decoder model respectively. ent senses. So word embedding should be approximate to The encoder transforms the definition sequence into a se- hei = fhei ; hei ; :::; hei g the linear combination of its senses’ representations. quence of hidden states t t;0 t;1 t;ni as We propose GGP (Glossary Guided Post-processing) hei shown in Formula 1. Meanwhile, t;ni is used to represent model, which combines a general function to fine-tune the ith sense. And the decoder utilizes the sense represen- all pre-trained word representations, and an auto-encoding hei hdi tation t;ni as the initial hidden state t;0, then generates model to learn the representation of each sense. By joint hdi = fhdi ; hdi ; :::; hdi g a new sequence t t;1 t;2 t;ni as shown optimization, embedding learned by GGP exhibits better in Formula 2. topical and functional similarity. In addition, to speed up i i the convergence of the auto-encoding model, word defi- he = Encoder(int) = BiLST M(LST M(int)) (1) nitions are also used to pre-train it. To evaluate whether hd = Decoder(hei ) = LST M(LST M(hei )) (2) our model learns both topical and functional information, t;ni t;ni we test our model on six word tpical/functional similarity The reconstruction loss Lr of the auto-encoding part is de- datasets, i.e., Men, Simverb3500, RW, Simlex999, WS353 fined by CrossEntropy loss shown in Formula 3, where and Mturk. Experimental results show that our model out- i nt is the number of senses, int is the input sequence of the performs rivals by at least 4.1% in all benchmarks. And i ith sense, and hdt is the output sequence of the decoder. compared to GloVe, our model outperforms more than 7%. We also use two glossaries and two pre-trained word em- nt 1 X i i bedding models with different vocabulary sizes to show the Lr = CrossEntropy(hdt; int) (3) nt effectiveness of our model. In summary, our contribution i=1 are: 2.2. Composition of Sense representations • For multi-sense word, we learn sense representation With all calculated sense representations het = respectively and utilize attention to integrate them. fhe1 ; he2 ; :::; hent g of the given word, addi- t;n1 t;n2 t;nnt tive attention (Bahdanau et al., 2014) is used to calculate • We combine a general post-processing function and Swt , the composition representation of all sense vectors, sense representation learning model so that each pre- which is shown as Formula 4: trained word representation could be post-processed. n Xt • S = w ∗ hei Experimental results show that our model learn both wt i t;ni (4) topic and functional information and performs much i=1 better than previous model.