Retrofitting Vectors of Words Unseen in Lexical Resources

Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources Ivan Vulic´1, Goran Glavaš2, Nikola Mrkšic´3, Anna Korhonen1 1 Language Technology Lab, University of Cambridge 2 Data and Web Science Group, University of Mannheim 3 PolyAI {iv250,alk23}@cam.ac.uk [email protected] [email protected] Abstract Pennington et al., 2014; Levy and Goldberg, 2014; Levy et al., 2015; Bojanowski et al., 2017). As a Word vector specialisation (also known as retrofitting) is a portable, light-weight ap- result, these models tend to coalesce the notions of proach to fine-tuning arbitrary distributional semantic similarity and (broader) conceptual relat- word vector spaces by injecting external edness, and cannot accurately distinguish antonyms knowledge from rich lexical resources such from synonyms (Hill et al., 2015; Schwartz et al., as WordNet. By design, these post-processing 2015). Recently, we have witnessed a rise of in- methods only update the vectors of words oc- terest in representation models that move beyond curring in external lexicons, leaving the repre- stand-alone unsupervised learning: they leverage sentations of all unseen words intact. In this external knowledge in human- and automatically- paper, we show that constraint-driven vector space specialisation can be extended to unseen constructed lexical resources to enrich the semantic words. We propose a novel post-specialisation content of distributional word vectors, in a process method that: a) preserves the useful linguistic termed semantic specialisation. knowledge for seen words; while b) propagat- This is often done as a post-processing (some- ing this external signal to unseen words in or- times referred to as retrofitting) step: input word der to improve their vector representations as vectors are fine-tuned to satisfy linguistic con- well. Our post-specialisation approach explic- straints extracted from lexical resources such as its a non-linear specialisation function in the form of a deep neural network by learning to WordNet or BabelNet (Faruqui et al., 2015; Mrkšic´ predict specialised vectors from their original et al., 2017). The use of external curated knowl- distributional counterparts. The learned func- edge yields improved word vectors for the benefit tion is then used to specialise vectors of unseen of downstream applications (Faruqui, 2016). At words. This approach, applicable to any post- the same time, this specialisation of the distribu- processing model, yields considerable gains tional space distinguishes between true similarity over the initial specialisation models both in in- and relatedness, and supports language understand- trinsic word similarity tasks, and in two downstream tasks: dialogue state tracking and lexi- ing tasks (Kiela et al., 2015; Mrkšic´ et al., 2017). cal text simplification. The positive effects per- While there is consensus regarding their benefits sist across three languages, demonstrating the and ease of use, one property of the post-processing importance of specialising the full vocabulary specialisation methods slips under the radar: most of distributional word vector spaces. existing post-processors update word embeddings only for words which are present (i.e., seen) in the 1 Introduction external constraints, while vectors of all other (i.e., Word representation learning is a key research area unseen) words remain unaffected. In this work, we in current Natural Language Processing (NLP), propose a new approach that extends the speciali- with its usefulness demonstrated across a range sation framework to unseen words, relying on the of tasks (Collobert et al., 2011; Chen and Manning, transformation of the vector (sub)space of seen 2014; Melamud et al., 2016b). The standard tech- words. Our intuition is that the process of fine- niques for inducing distributed word representa- tuning seen words provides implicit information on tions are grounded in the distributional hypothesis how to leverage the external knowledge to unseen (Harris, 1954): they rely on co-occurrence informa- words. The method should preserve the already in- tion in large textual corpora (Mikolov et al., 2013b; jected knowledge for seen words, simultaneously 516 Proceedings of NAACL-HLT 2018, pages 516–527 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics propagating the external signal to unseen words in In contrast, fine-tuning models inject external order to improve their vectors. knowledge from available lexical resources (e.g., The proposed post-specialisation method can be WordNet, PPDB) into pre-trained word vectors as seen as a two-step process, illustrated in Fig. 1a: a post-processing step (Faruqui et al., 2015; Rothe 1) We use a state-of-the-art specialisation model and Schütze, 2015; Wieting et al., 2015; Nguyen to transform the subspace of seen words from the et al., 2016; Mrkšic´ et al., 2016; Cotterell et al., input distributional space into the specialised sub- 2016; Mrkšic´ et al., 2017). Such post-processing space; 2) We learn a mapping function based on models are popular because they offer a portable, the transformation of the “seen subspace”, and then flexible, and light-weight approach to incorporating apply it to the distributional subspace of unseen external knowledge into arbitrary vector spaces, words. We allow the proposed post-specialisation yielding state-of-the-art results on language under- model to learn from large external linguistic re- standing tasks (Faruqui et al., 2015; Mrkšic´ et al., sources by implementing the mapping as a deep 2016; Kim et al., 2016; Vulic´ et al., 2017b). feed-forward neural network with non-linear acti- Existing post-processing models, however, suf- vations. This allows the model to learn the general- fer from a major limitation. Their modus operandi isation of the fine-tuning steps taken by the initial is to enrich the distributional information with ex- specialisation model, itself based on a very large ternal knowledge only if such knowledge is present number (e.g., hundreds of thousands) of external in a lexical resource. This means that they update linguistic constraints. and improve only representations of words actually As indicated by the results on word similar- seen in external resources. Because such words ity and two downstream tasks (dialogue state constitute only a fraction of the whole vocabulary tracking and lexical text simplification) our post- (see Sect.4), most words, unseen in the constraints, specialisation method consistently outperforms retain their original vectors. The main goal of this state-of-the-art methods which specialise seen work is to address this shortcoming by specialising words only. We report improvements using three all words from the initial distributional space. distinct input vector spaces for English and for three test languages (English, German, Italian), ver- 3 Methodology: Post-Specialisation ifying the robustness of our approach. Our starting point is the state-of-the-art specialisation model ATTRACT-REPEL (AR) (Mrkšic´ et al., 2 Related Work and Motivation 2017), outlined in Sect. 3.1. We opt for the Vector Space Specialisation A standard ap- AR model due to its strong performance and proach to incorporating external and background ease of use, but we note that the proposed post- knowledge into word vector spaces is to pull the specialisation approach for specialising unseen representations of similar words closer together words, described in Sect. 3.2, is applicable to any and to push words in undesirable relations (e.g., post-processor, as empirically validated in Sect.5. antonyms) away from each other. Some models 3.1 Initial Specialisation Model: AR integrate such constraints into the training procedure and jointly optimize distributional and non- Let s be the vocabulary, A the set of synony- V distributional objectives: they modify the prior or mous ATTRACT word pairs (e.g., rich and wealthy), the regularisation (Yu and Dredze, 2014; Xu et al., and R the set of antonymous REPEL word pairs 2014; Bian et al., 2014; Kiela et al., 2015), or (e.g., increase and decrease). The ATTRACT-REPEL use a variant of the SGNS-style objective (Liu procedure operates over mini-batches of such et al., 2015; Ono et al., 2015; Osborne et al., 2016; pairs A and R. Let each word pair (xl, xr) in B B Nguyen et al., 2017). In theory, word embeddings these sets correspond to a vector pair (xl, xr).A obtained by these joint models could be as good as mini-batch of batt attract word pairs is given by 1 1 k1 k1 representations produced by models which fine- A = [(x , xr),..., (x , xr )] (analogously for B l l tune input vector space. However, their perfor- R, which consists of brep pairs). B mance falls behind that of fine-tuning methods (Wi- Next, the sets of negative exam- 1 1 k1 k1 eting et al., 2015). Another disadvantage is that ples TA = [(tl , tr),..., (tl , tr )] and 1 1 k2 k2 their architecture is tied to a specific underlying TR = [(tl , tr),..., (tl , tr )] are defined as model (typically word2vec models). pairs of negative examples for each A and R 517 dim dim pair in mini-batches A and R. These negative function f : R R , where dim is the vec- B B → examples are chosen from the word vectors present tor space dimensionality. It maps word vectors from in or so that, for each A pair (x , x ), the the initial vector space X to the specialised target BA BR l r negative example pair (tl, tr) is chosen so that tl space X0. Let X0 = f(X) refer to the predicted is the vector closest (in terms of cosine distance) to mapping of the vector space, while the mapping of 1 c xl and tr is closest to xr. The negatives are used a single word vector is denoted xi0 = f(xi). 1) to force A pairs to be closer to each other than An input distributional vector space Xd repre- to their respective negative examples; and 2) to sents words from a vocabulary b .

Load more