Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations
Total Page:16
File Type:pdf, Size:1020Kb
Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan Department of Computer Science, National University of Singapore [email protected], [email protected], gan [email protected] Abstract can also make use of continuous word represen- tations of the surrounding words (Taghipour and Contextualized word representations are able Ng, 2015; Iacobacci et al., 2016). Neural WSD to give different representations for the same systems (Kageb˚ ack¨ and Salomonsson, 2016; Ra- word in different contexts, and they have been shown to be effective in downstream natu- ganato et al., 2017b) feed the continuous word rep- ral language processing tasks, such as ques- resentations into a neural network that captures the tion answering, named entity recognition, and whole sentence and the word representation in the sentiment analysis. However, evaluation on sentence. However, in both approaches, the word word sense disambiguation (WSD) in prior representations are independent of the context. work shows that using contextualized word Recently, pre-trained contextualized word rep- representations does not outperform the state- resentations (Melamud et al., 2016; McCann et al., of-the-art approach that makes use of non- contextualized word embeddings. In this pa- 2017; Peters et al., 2018; Devlin et al., 2019) per, we explore different strategies of integrat- have been shown to improve downstream NLP ing pre-trained contextualized word represen- tasks. Pre-trained contextualized word represen- tations and our best strategy achieves accura- tations are obtained through neural sentence en- cies exceeding the best prior published accura- coders trained on a huge amount of raw texts. cies by significant margins on multiple bench- When the resulting sentence encoder is fine-tuned mark WSD datasets. on the downstream task, such as question answer- 1 Introduction ing, named entity recognition, and sentiment anal- ysis, with much smaller annotated training data, it Word sense disambiguation (WSD) automatically has been shown that the trained model, with the assigns a pre-defined sense to a word in a text. Dif- pre-trained sentence encoder component, achieves ferent senses of a word reflect different meanings a new state-of-the-art results on those tasks. word has in different contexts. Identifying the cor- While demonstrating superior performance in rect word sense given a context is crucial in natural downstream NLP tasks, pre-trained contextual- language processing (NLP). Unfortunately, while ized word representations are still reported to give it is easy for a human to infer the correct sense of lower accuracy compared to approaches that use a word given a context, it is a challenge for NLP non-contextualized word representations (Mela- systems. As such, WSD is an important task and it mud et al., 2016; Peters et al., 2018) when eval- has been shown that WSD helps downstream NLP uated on WSD. This seems counter-intuitive, as tasks, such as machine translation (Chan et al., a neural sentence encoder better captures the sur- 2007a) and information retrieval (Zhong and Ng, rounding context that serves as an important cue to 2012). disambiguate words. In this paper, we explore dif- A WSD system assigns a sense to a word by tak- ferent strategies of integrating pre-trained contex- ing into account its context, comprising the other tualized word representations for WSD. Our best words in the sentence. This can be done through strategy outperforms prior methods of incorpo- discrete word features, which typically involve rating pre-trained contextualized word represen- surrounding words and collocations trained using tations and achieves new state-of-the-art accuracy a classifier (Lee et al., 2004; Ando, 2006; Chan on multiple benchmark WSD datasets. et al., 2007b; Zhong and Ng, 2010). The classifier The following sections are organized as follows. 5297 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5297–5306, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics Section2 presents related work. Section3 de- A neural language model (LM) is aimed at pre- scribes our pre-trained contextualized word repre- dicting a word given its surrounding context. As sentation. Section4 proposes different strategies such, the resulting hidden representation vector to incorporate the contextualized word represen- captures the context of a word in a sentence. Mela- tation for WSD. Section5 describes our experi- mud et al.(2016) designed context2vec, which is mental setup. Section6 presents the experimental a one-layer bidirectional LSTM trained to maxi- results. Section7 discusses the findings from the mize the similarity between the hidden state rep- experiments. Finally, Section8 presents the con- resentation of the LSTM and the target word em- clusion. bedding. Peters et al.(2018) designed ELMo, which is a two-layer bidirectional LSTM language 2 Related Work model trained to predict the next word in the for- ward LSTM and the previous word in the back- Continuous word representations in real-valued ward LSTM. In both models, WSD was evaluated vectors, or commonly known as word embed- by nearest neighbor matching between the test and dings, have been shown to help improve NLP training instance representations. However, de- performance. Initially, exploiting continuous rep- spite training on a huge amount of raw texts, the resentations was achieved by adding real-valued resulting accuracies were still lower than those vectors as classification features (Turian et al., achieved by WSD approaches with pre-trained 2010). Taghipour and Ng(2015) fine-tuned non-contextualized word representations. non-contextualized word embeddings by a feed- End-to-end neural machine translation (NMT) forward neural network such that those word em- (Sutskever et al., 2014; Bahdanau et al., 2015) beddings were more suited for WSD. The fine- learns to generate an output sequence given an tuned embeddings were incorporated into an SVM input sequence, using an encoder-decoder model. classifier. Iacobacci et al.(2016) explored differ- The encoder captures the contextualized represen- ent strategies of incorporating word embeddings tation of the words in the input sentence for the and found that their best strategy involved expo- decoder to generate the output sentence. Follow- nential decay that decreased the contribution of ing this intuition, McCann et al.(2017) trained an surrounding word features as their distances to the encoder-decoder model on parallel texts and ob- target word increased. tained pre-trained contextualized word representa- The neural sequence tagging approach has also tions from the encoder. been explored for WSD.K ageb˚ ack¨ and Salomons- son(2016) proposed bidirectional long short-term 3 Pre-Trained Contextualized Word memory (LSTM) (Hochreiter and Schmidhuber, Representation 1997) for WSD. They concatenated the hidden states of the forward and backward LSTMs and The contextualized word representation that we fed the concatenation into an affine transforma- use is BERT (Devlin et al., 2019), which is a tion followed by softmax normalization, similar to bidirectional transformer encoder model (Vaswani the approach to incorporate a bidirectional LSTM et al., 2017) pre-trained on billions of words of adopted in sequence labeling tasks such as part-of- texts. There are two tasks on which the model is speech tagging and named entity recognition (Ma trained, i.e., masked word and next sentence pre- and Hovy, 2016). Raganato et al.(2017b) pro- diction. In both tasks, prediction accuracy is de- posed a self-attention layer on top of the concate- termined by the ability of the model to understand nated bidirectional LSTM hidden states for WSD the context. and introduced multi-task learning with part-of- A transformer encoder computes the represen- speech tagging and semantic labeling as auxiliary tation of each word through an attention mecha- tasks. However, on average across the test sets, nism with respect to the surrounding words. Given n their approach did not outperform SVM with word a sentence x1 of length n, the transformer com- embedding features. Subsequently, Luo et al. putes the representation of each word xi through a (2018) proposed the incorporation of glosses from multi-head attention mechanism, where the query WordNet in a bidirectional LSTM for WSD, and vector is from xi and the key-value vector pairs are 0 reported better results than both SVM and prior from the surrounding words xi0 (1 ≤ i ≤ n). The bidirectional LSTM models. word representation produced by the transformer 5298 captures the contextual information of a word. For the hidden layer l (1 ≤ l ≤ L), the self- l The attention mechanism can be viewed as attention sub-layer output fi is computed as fol- mapping a query vector q and a set of key-value lows: vector pairs (k; v) to an output vector. The at- l l l−1 tention function A(·) computes the output vector fi = LayerNorm(χh;i + hi ) which is the weighted sum of the value vectors and l l l−1 l−1 l−1 1 χh;i = MHh(hi ; h1:n ; h1:n ; p ) is defined as: dv X A(q; K; V; ρ) = α(q; k; ρ)v (1) where LayerNorm refers to layer normalization (k;v)2(K;V) (Ba et al., 2016) and the superscript l and sub- exp(ρk>q) script h indicate that each encoder layer l has an α(q; k; ρ) = P 0> (2) independent set of multi-head attention weight pa- 0 exp(ρk q) k 2K rameters (see Equations3 and4). The input for 0 where K and V are matrices, containing the key the first layer is hi = E(xi), which is the non- vectors and the value vectors of the words in the contextualized word embedding of xi. sentence respectively, and α(q; k; ρ) is a scalar at- The second sub-layer consists of the position- tention weight between q and k, re-scaled by a wise fully connected FFNN, computed as: scalar ρ.