Enhancing Word Embeddings with Knowledge Extracted from Lexical Resources

Enhancing Word Embeddings with Knowledge Extracted from Lexical Resources Magdalena Biesialska∗ Bardia Rafieian∗ Marta R. Costa-jussa` TALP Research Center, Universitat Politecnica` de Catalunya, Barcelona fmagdalena.biesialska,bardia.rafieian,[email protected] Abstract • We introduce an improved post-specialization method (dubbed WGAN-postspec), which In this work, we present an effective method demonstrates improved performance as com- for semantic specialization of word vector representations. To this end, we use tradi- pared to state-of-the-art DFFN (Vulic´ et al., tional word embeddings and apply specializa- 2018) and AuxGAN (Ponti et al., 2018) mod- tion methods to better capture semantic rela- els. tions between words. In our approach, we leverage external knowledge from rich lexical • We show that the proposed approach achieves resources such as BabelNet. We also show performance improvements on an intrinsic that our proposed post-specialization method task (word similarity) as well as on a down- based on an adversarial neural network with stream task (dialog state tracking). the Wasserstein distance allows to gain improvements over state-of-the-art methods on two tasks: word similarity and dialog state 2 Related Work tracking. Numerous methods have been introduced for in- 1 Introduction corporating structured linguistic knowledge from Vector representations of words (embeddings) have external resources to word embeddings. Funda- become the cornerstone of modern Natural Lan- mentally, there exist three categories of semantic guage Processing (NLP), as learning word vectors specialization approaches: (a) joint methods which and utilizing them as features in downstream NLP incorporate lexical information during the training tasks is the de facto standard. Word embeddings of distributional word vectors; (b) specialization (Mikolov et al., 2013; Pennington et al., 2014) are methods also referred to as retrofitting methods typically trained in an unsupervised way on large which use post-processing techniques to inject se- monolingual corpora. Whilst such word represen- mantic information from external lexical resources tations are able to capture some syntactic as well into pre-trained word vector representations; and as semantic information, their ability to map rela- (c) post-specialization methods which use linguis- tions (e.g. synonymy, antonymy) between words is tic constraints to learn a general mapping function limited. To alleviate this deficiency, a set of refine- allowing to specialize the entire distributional vec- ment post-processing methods–called retrofitting tor space. or semantic specialization–has been introduced. In In general, joint methods perform worse than the the next section, we discuss the intricacies of these other two methods, and are not model-agnostic, methods in more detail. as they are tightly coupled to the distributional To summarize, our contributions in this work are word vector models (e.g. Word2Vec, GloVe). There- as follows: fore, in this work we concentrate on the specialization and post-specialization methods. Approaches • We introduce a set of new linguistic con- which fall in the former category can be consid- straints (i.e. synonyms and antonyms) created ered local specialization methods, where the most with BabelNet for three languages: English, prominent examples are: retrofitting (Faruqui et al., German and Italian. 2015) which is a post-processing method to enrich ∗Equal contribution word embeddings with knowledge from semantic 271 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 271–278 July 5 - July 10, 2020. c 2020 Association for Computational Linguistics Distributional Vector Space Post-Specialization X = X ∪ X d s u (DFFN, AuxGAN, WGAN-postspec) travel stay-at-home ∪ tour isolation Xg = X’s X’u Initial Specialization (Attract-Repel) . SYNONYMS: . Xf = X’s travel - tour . Lexical . … . Resources tour . ANTONYMS: travel . (WordNet, . travel - stay-at-home . BabelNet, etc.) … stay-at-home isolation Linguistic Constraints Figure 1: Illustration of the semantic specialization approach. lexicons, in this case it brings closer semantically 3.1 Initial Specialization similar words. Counter-fitting (Mrksiˇ c´ et al., 2016) likewise fine-tunes word representations; however, In this step a subspace of distributional vectors conversely to the retrofitting technique it counter- for words that occur in the external constraints fits the embeddings with respect to the given sim- is specialized. To this end, fine-tuning of seen ilarity and antonymy constraints. Attract-Repel words can be performed using any specialization (Mrksiˇ c´ et al., 2017b) uses linguistic constraints method. In this work, we utilize Attract-Repel obtained from external lexical resources to seman- model (Mrksiˇ c´ et al., 2017b) as it offers state- tically specialize word embeddings. Similarly to of-the-art performance. This method allows to counter-fitting it injects synonymy and antonymy make use of both synonymy (attract) and antonymy constraints into distributional word vector spaces. (repel) constraints. More formally, given a set In contrast to counter-fitting, this method does not A of attract word pairs and a set of R of repel ignore how updates of the example word vector word pairs, let VS be the vocabulary of words pairs affect their relations to other word vectors. seen in the constraints. Hence, each word pair (v ; v ) is represented by a corresponding vector On the other hand, the latter group, post- l r pair (x ; x ). The model optimization method op- specialization methods, performs global special- l r erates over mini-batches: a mini-batch B of syn- ization of distributional spaces. We can distinguish: A onymy pairs (of size k ) and a mini-batch B of explicit retrofitting (Glavasˇ and Vulic´, 2018) that 1 R antonymy pairs (of size k2). The pairs of negative was the first attempt to use external constraints (i.e. h i 1 1 k1 k1 synonyms and antonyms) as training examples for examples TA (BA) = tl ; tr ;:::; tl ; tr h i learning an explicit mapping function for specializ- 1 1 k2 k2 and TR (BR) = tl ; tr ;:::; tl ; tr are ing the words not observed in the constraints. Later, drawn from 2 (k + k ) word vectors in B [B . a more robust DFFN (Vulic´ et al., 2018) method 1 2 A R was introduced with the same goal – to special- The negative examples serve the purpose of ize the full vocabulary by leveraging the already pulling synonym pairs closer and pushing antonym specialized subspace of seen words. pairs further away with respect to their corresponding negative examples. For synonyms: 3 Methodology k1 X i i i i A (BA) = τ(δatt + xltl − xlxr + In this paper, we propose an approach that builds i=1 upon previous works (Vulic´ et al., 2018; Ponti et al., i i i i + τ δatt + x t − x x ] (1) 2018). The process of specializing distributional r r l r vectors is a two-step procedure (as shown in Figure 1). First, an initial specialization is performed (see where τ is the rectifier function, and δatt is the simi- x3.1). In the second step, a global specialization larity margin determining the distance between syn- mapping function is learned, allowing to generalize onymy vectors and how much closer they should to unseen words (see x3.2). be comparing to their negative examples. Similarly, 272 the equation for antonyms is given as: vanilla GANs is that WGANs are generally more stable, and also they do not suffer from vanishing k X2 gradients. R (B ) = τ(δ + xixi − xiti + R rep l r l l Our proposed post-specialization approach is i=1 based on the principles of GANs, as it is composed + τ δ + xixi − xi ti ] rep l r r r (2) of two elements: a generator network G and a discriminator network D. The gist of this concept, is A distributional regularization term is used to re- to improve the generated samples through a min- tain the quality of the original distributional vector max game between the generator and the discrimi- space using L2-regularization. nator. X In our post-specialization model, a multi-layer Reg (BA; BR) = λreg kxbi − xik2 feed-forward neural network, which trains a global xi2V (BA[BR) (3) mapping function, acts as the generator. Conse- quently, the generator is trained to produce pre- where λreg is a L2-regularization constant, and xbi dictions G(x; θG) that are as similar as possible is the original vector for the word xi. to the corresponding initially specialized word 0 Consequently, the final cost function is formu- vectors xs. Therefore, a global mapping func- lated as follows: tion is trained using word vector pairs, such that 0 0 0 (xi; xi) = fxi 2 Xs; xi 2 Xsg. On the other C(BA; BR) = A(BA) + R(BR) + Reg(BA; BR) hand, the discriminator D(x; θD), which is a multi- (4) layer classification network, tries to distinguish the generated samples from the initially special- 3.2 Proposed Post-Specialization Model 0 ized vectors sampled from Xs. In this process, the Once the initial specialization is completed, post- differences between predictions and initially spe- specialization methods can be employed. This step cialized vectors are used to improve the generator, is important, because local specialization affects resulting in more realistically looking outputs. only words seen in the constraints, and thus just In general, for the GAN model we can define the a subset of the original distributional space Xd. loss LG of the generator as: While post-specialization methods learn a global n X specialization mapping function allowing them to LG = − log P ( spec = 1jG(xi; θG); θD)− generalize to unseen words Xu. i=1 Given the specialized word vectors X0 from m s X 0 the vocabulary of seen words VS , our proposed − log P ( spec = 0jxi; θD) (5) method propagates this signal to the entire dis- i=1 tributional vector space using a generative adver- While the loss of the discriminator LD is given as: n sarial network (GAN) (Goodfellow et al., 2014).

Load more