Towards Better Context-Aware Lexical Semantics: Adjusting Contextualized Representations Through Static Anchors

Towards Better Context-aware Lexical Semantics: Adjusting Contextualized Representations through Static Anchors Qianchu Liu, Diana McCarthy, Anna Korhonen Language Technology Lab, TAL, University of Cambridge, UK [email protected], [email protected], [email protected] Abstract capturing context-sensitive word meanings, static embeddings still achieve better performance than One of the most powerful features of contextualized models is their dynamic embed- contextualized embeddings in traditional context- dings for words in context, leading to state-of- independent lexical semantic tasks including word the-art representations for context-aware lex- similarity and analogy (Wang et al., 2019). This ical semantics. In this paper, we present a suggests that static embeddings have the potential post-processing technique that enhances these to offer complementary semantic information to en- representations by learning a transformation hance contextualized models for lexical semantics. through static anchors. Our method requires We bridge the aforementioned gaps and propose only another pre-trained model and no labeled a general framework that improves contextualized data is needed. We show consistent improvement in a range of benchmark tasks that test representations by leveraging other pre-trained con- contextual variations of meaning both across textualized/static models. We achieve this by using different usages of a word and across different static anchors (the average contextual representa- words as they are used in context. We demon- tions for each word) to transform the original con- strate that while the original contextual rep- textualized model, guided by the embedding space resentations can be improved by another em- from another model. We assess the overall quality bedding space from either contextualized or of a model’s lexical semantic representation by two static models, the static embeddings, which have lower computational requirements, pro- Inter Word tasks that measure relations between dif- vide the most gains. ferent words in context. We also evaluate on three Within Word tasks that test the contextual effect 1 Introduction from different usages of the same word/word pair. Word representations are fundamental in Natural Our method obtains consistent improvement across Language Processing (NLP) (Bengio et al., 2003). all these context-aware lexical semantic tasks. We Recently, there has been a surge of contextualized demonstrate the particular strength of leveraging models that achieve state-of-the-art in many NLP static embeddings, and offer insights on the reasons benchmark tasks (Peters et al., 2018; Devlin et al., behind the improvement. Our method also has min- 2019; Liu et al., 2019b; Yang et al., 2019). Even imum computational complexity and requires no better performance has been reported from fine- labeled data. tuning or training multiple contextualized models 2 Background for a specific task such as question answering (De- vlin et al., 2019; Xu et al., 2020). However, little This section briefly introduces the contextual- has been explored on directly leveraging the many ized/static models that we experimented in this off-the-shelf pre-trained models to improve task- study. For static models, we select three repre- independent representations for lexical semantics. sentative methods. SGNS (Mikolov et al., 2013), Furthermore, classic static embeddings are often as the most successful variant of word2vec, trains overlooked in this trend towards contextualized a log linear model to predict context words given models. As opposed to contextualized embeddings a target word with negative sampling in the objec- that generate dynamic representations for words tive. FastText improves over SGNS by training in context, static embeddings such as word2vec at the n-gram level and can generalize to unseen (Mikolov et al., 2013) assign one fixed representa- words (Bojanowski et al., 2017). In addition to tion for each word. Despite being less effective in these two prediction-based models, we also include 4066 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4066–4075, November 16–20, 2020. c 2020 Association for Computational Linguistics one count-based model, GloVe (Pennington et al., as they provide one fixed representation per word, 2014). GloVe is trained to encode semantic rela- and therefore correspond to word embeddings from tions exhibited in the ratios of word-word occur- a static model such as FastText. We also use these rence probabilities into word vectors. anchors to align between contextualized models. As opposed to static embeddings, contextualized To form the vocabulary for creating static anchors models provide dynamic lexical representations as in our experiments, we take the top 200k most fre- hidden layers in deep neural networks typically quent words and extract their contexts from English pre-trained with language modeling objectives. In Wikipedia. our study, we choose three state-of-the-art con- To describe the method in more detail, we repre- textualized models. BERT (Devlin et al., 2019) sent the anchor embeddings from the original con- trains bidirectional transformers (Vaswani et al., textualized model as our source matrix S, and the 2017) with masked language modeling and next corresponding representations from another con- sentence prediction objectives. Liu et al.(2019b)’s textualized/static model as target matrix T. si and RoBERTa further improves upon BERT by care- ti are the source and target vectors for the ith word fully optimizing a series of design decisions. XL- in the vocabulary (V ). We first find an orthogonal Net (Yang et al., 2019) takes a generalized auto- alignment matrix W that rotates the target space to regressive pre-training approach and integrates the source space by solving the least squares linear ideas from Transformer-XL (Dai et al., 2019). For regression problem in Eq.1. W is found through the best performance, we use the Large Cased1 Procrustes analysis (Schonemann¨ , 1966). variant for each contextualized model. Since our jV j study focuses on generic lexical representations X 2 T W = arg min kWti − sik s:t: W W = I: (1) W and many of the lexical semantic tasks do not i=1 provide training data, we extract features2 from these contextualized models without fine-tuning As described in Eq.2, we then learn a linear the weights for a specific task. This feature-based mapping M to transform the source space towards approach is also more efficient compared with fine- the average of source and the rotated target space, tuning the increasingly larger models which can by minimizing the squared Euclidean distance be- have hundreds of millions of parameters. tween each transformed source vector Msi and the mean vector µi (µi = (si + Wti)=2). M is the 3 Method mapping we will use to transform the original contextualized space. Following Doval et al.(2018), Our method3 is built from a recently proposed M is found via a closed-form solution. cross-lingual alignment technique called meeting jV j in the middle X 2 (Doval et al., 2018). Their method M = arg min kMsi − µik (2) M relies on manual translations to learn a transfor- i mation over an orthogonal alignment for better For improved alignment quality, as advised by cross-lingual static embeddings. We show that Artetxe et al.(2016), we normalize and mean- by a similar alignment + transformation technique, center4 the embeddings in S and T a priori. we can improve monolingual contextualized embeddings without resorting to any labeled data. 4 Experiments The direct correspondence among contextual- Task Descriptions5 We evaluate on three Within ized and static embeddings for alignment is not Word tasks. Usage Similarity (Usim)(Erk et al., straightforward, as contextualized models can com- 2013) dataset measures graded similarity of the pute infinite representations for infinite contexts. same word in pairs of different contexts on the Inspired by previous study (Schuster et al., 2019) scale from 1 to 5. Word in Context (WiC)(Pilehvar that found contextualized embeddings roughly and Camacho-Collados, 2019) dataset challenges a form word clusters, we take the average of each system to predict a binary choice of whether a pair word’s contextual representations as anchors of a of contexts for the same word belongs to the same contextualized model. We call them static anchors 4We pre-process representations with the same centering 11024 dimensions with case-preserving vocabulary. and normalization in all tasks. Our reported results are similar 2AppendixA contains more details on feature extraction. or better than the results from un-preprocessed representations. 3Implementation details are listed in AppendixB. 5AppendixC reports details for each task and experiment. 4067 meaning or not. We follow the advised training word relations (Wang et al., 2019) into a contextu- scheme in the original paper to learn a cosine simi- alized model. At the same time, static embeddings larity threshold on the representations. The recently consistently improve performance in Within Word proposed CoSimlex (Armendariz et al., 2019) task tasks in 24 out of the total 27 configurations, reas- provides contexts for selected word pairs from the suring us that the contextualization power of the word similarity benchmark SimLex-999 (Hill et al., original contextual space is not only preserved but 2015) and measures the graded contextual effect. even enhanced. Overall, FastText is the most robust We use the English

Towards Better Context-Aware Lexical Semantics: Adjusting Contextualized Representations Through Static Anchors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support