Improved Sense Disambiguation Using Pre-Trained Contextualized Word Representations

Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan Department of Computer Science, National University of Singapore [email protected], [email protected], gan [email protected]

Abstract can also make use of continuous word represen- tations of the surrounding (Taghipour and Contextualized word representations are able Ng, 2015; Iacobacci et al., 2016). Neural WSD to give different representations for the same systems (Kageb˚ ack¨ and Salomonsson, 2016; Ra- word in different contexts, and they have been shown to be effective in downstream natu- ganato et al., 2017b) feed the continuous word rep- ral language processing tasks, such as ques- resentations into a neural network that captures the tion answering, named entity recognition, and whole sentence and the word representation in the . However, evaluation on sentence. However, in both approaches, the word word sense disambiguation (WSD) in prior representations are independent of the context. work shows that using contextualized word Recently, pre-trained contextualized word rep- representations does not outperform the state- resentations (Melamud et al., 2016; McCann et al., of-the-art approach that makes use of non- contextualized word embeddings. In this pa- 2017; Peters et al., 2018; Devlin et al., 2019) per, we explore different strategies of integrat- have been shown to improve downstream NLP ing pre-trained contextualized word represen- tasks. Pre-trained contextualized word represen- tations and our best strategy achieves accura- tations are obtained through neural sentence en- cies exceeding the best prior published accura- coders trained on a huge amount of raw texts. cies by significant margins on multiple bench- When the resulting sentence encoder is fine-tuned mark WSD datasets. on the downstream task, such as question answer- 1 Introduction ing, named entity recognition, and sentiment anal- ysis, with much smaller annotated training data, it Word sense disambiguation (WSD) automatically has been shown that the trained model, with the assigns a pre-defined sense to a word in a text. Dif- pre-trained sentence encoder component, achieves ferent senses of a word reflect different meanings a new state-of-the-art results on those tasks. word has in different contexts. Identifying the cor- While demonstrating superior performance in rect word sense given a context is crucial in natural downstream NLP tasks, pre-trained contextual- language processing (NLP). Unfortunately, while ized word representations are still reported to give it is easy for a human to infer the correct sense of lower accuracy compared to approaches that use a word given a context, it is a challenge for NLP non-contextualized word representations (Mela- systems. As such, WSD is an important task and it mud et al., 2016; Peters et al., 2018) when eval- has been shown that WSD helps downstream NLP uated on WSD. This seems counter-intuitive, as tasks, such as (Chan et al., a neural sentence encoder better captures the sur- 2007a) and information retrieval (Zhong and Ng, rounding context that serves as an important cue to 2012). disambiguate words. In this paper, we explore dif- A WSD system assigns a sense to a word by tak- ferent strategies of integrating pre-trained contex- ing into account its context, comprising the other tualized word representations for WSD. Our best words in the sentence. This can be done through strategy outperforms prior methods of incorpo- discrete word features, which typically involve rating pre-trained contextualized word represen- surrounding words and collocations trained using tations and achieves new state-of-the-art accuracy a classifier (Lee et al., 2004; Ando, 2006; Chan on multiple benchmark WSD datasets. et al., 2007b; Zhong and Ng, 2010). The classifier The following sections are organized as follows.

5297 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5297–5306, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Section2 presents related work. Section3 de- A neural (LM) is aimed at pre- scribes our pre-trained contextualized word repre- dicting a word given its surrounding context. As sentation. Section4 proposes different strategies such, the resulting hidden representation vector to incorporate the contextualized word represen- captures the context of a word in a sentence. Mela- tation for WSD. Section5 describes our experi- mud et al.(2016) designed context2vec, which is mental setup. Section6 presents the experimental a one-layer bidirectional LSTM trained to maxi- results. Section7 discusses the findings from the mize the similarity between the hidden state rep- experiments. Finally, Section8 presents the con- resentation of the LSTM and the target word em- clusion. bedding. Peters et al.(2018) designed ELMo, which is a two-layer bidirectional LSTM language 2 Related Work model trained to predict the next word in the for- ward LSTM and the previous word in the back- Continuous word representations in real-valued ward LSTM. In both models, WSD was evaluated vectors, or commonly known as word embed- by nearest neighbor matching between the test and dings, have been shown to help improve NLP training instance representations. However, de- performance. Initially, exploiting continuous rep- spite training on a huge amount of raw texts, the resentations was achieved by adding real-valued resulting accuracies were still lower than those vectors as classification features (Turian et al., achieved by WSD approaches with pre-trained 2010). Taghipour and Ng(2015) fine-tuned non-contextualized word representations. non-contextualized word embeddings by a feed- End-to-end neural machine translation (NMT) forward neural network such that those word em- (Sutskever et al., 2014; Bahdanau et al., 2015) beddings were more suited for WSD. The fine- learns to generate an output sequence given an tuned embeddings were incorporated into an SVM input sequence, using an encoder-decoder model. classifier. Iacobacci et al.(2016) explored differ- The encoder captures the contextualized represen- ent strategies of incorporating word embeddings tation of the words in the input sentence for the and found that their best strategy involved expo- decoder to generate the output sentence. Follow- nential decay that decreased the contribution of ing this intuition, McCann et al.(2017) trained an surrounding word features as their distances to the encoder-decoder model on parallel texts and ob- target word increased. tained pre-trained contextualized word representa- The neural sequence tagging approach has also tions from the encoder. been explored for WSD.K ageb˚ ack¨ and Salomons- son(2016) proposed bidirectional long short-term 3 Pre-Trained Contextualized Word memory (LSTM) (Hochreiter and Schmidhuber, Representation 1997) for WSD. They concatenated the hidden states of the forward and backward LSTMs and The contextualized word representation that we fed the concatenation into an affine transforma- use is BERT (Devlin et al., 2019), which is a tion followed by softmax normalization, similar to bidirectional transformer encoder model (Vaswani the approach to incorporate a bidirectional LSTM et al., 2017) pre-trained on billions of words of adopted in sequence labeling tasks such as part-of- texts. There are two tasks on which the model is speech tagging and named entity recognition (Ma trained, i.e., masked word and next sentence pre- and Hovy, 2016). Raganato et al.(2017b) pro- diction. In both tasks, prediction accuracy is de- posed a self-attention layer on top of the concate- termined by the ability of the model to understand nated bidirectional LSTM hidden states for WSD the context. and introduced multi-task learning with part-of- A transformer encoder computes the represen- speech tagging and semantic labeling as auxiliary tation of each word through an attention mecha- tasks. However, on average across the test sets, nism with respect to the surrounding words. Given n their approach did not outperform SVM with word a sentence x1 of length n, the transformer com- embedding features. Subsequently, Luo et al. putes the representation of each word xi through a (2018) proposed the incorporation of glosses from multi-head attention mechanism, where the query WordNet in a bidirectional LSTM for WSD, and vector is from xi and the key-value vector pairs are 0 reported better results than both SVM and prior from the surrounding words xi0 (1 ≤ i ≤ n). The bidirectional LSTM models. word representation produced by the transformer

5298 captures the contextual information of a word. For the hidden layer l (1 ≤ l ≤ L), the self- l The attention mechanism can be viewed as attention sub-layer output fi is computed as fol- mapping a query vector q and a set of key-value lows: vector pairs (k, v) to an output vector. The at- l l l−1 tention function A(·) computes the output vector fi = LayerNorm(χh,i + hi ) which is the weighted sum of the value vectors and l l l−1 l−1 l−1 1 χh,i = MHh(hi , h1:n , h1:n , √ ) is defined as: dv X A(q, K, V, ρ) = α(q, k, ρ)v (1) where LayerNorm refers to layer normalization (k,v)∈(K,V) (Ba et al., 2016) and the superscript l and sub- exp(ρk>q) script h indicate that each encoder layer l has an α(q, k, ρ) = P 0> (2) independent set of multi-head attention weight pa- 0 exp(ρk q) k ∈K rameters (see Equations3 and4). The input for 0 where K and V are matrices, containing the key the first layer is hi = E(xi), which is the non- vectors and the value vectors of the words in the contextualized word embedding of xi. sentence respectively, and α(q, k, ρ) is a scalar at- The second sub-layer consists of the position- tention weight between q and k, re-scaled by a wise fully connected FFNN, computed as: scalar ρ. hl = LayerNorm(FFl (f l) + f l) Two building blocks for the transformer en- i h i i coder are the multi-head attention mechanism where, similar to self-attention, an independent set and the position-wise feed-forward neural net- of weight parameters in Equation5 is defined in work (FFNN). The multi-head attention mecha- each layer. nism with H heads leverages the attention func- tion in Equation1 as follows: 4 Incorporating Pre-Trained

H Contextualized Word Representation M MH(q, K, V, ρ) = WMH headη(q, K, V, ρ) As BERT is trained on the masked word predic- η=1 tion task, which is to predict a word given the sur- (3) rounding (left and right) context, the pre-trained Q K V headη(q, K, V, ρ) = A(Wη q, Wη K, Wη V, ρ) model already captures the context. In this sec- (4) tion, we describe different techniques of leverag- ing BERT for WSD, broadly categorized into near- where ⊕ denotes concatenation of vectors, est neighbor matching and linear projection of hid- dmodel×Hdv Q K dk×dmodel WMH ∈ R , Wη , Wη ∈ R , den layers. V dv×dmodel and Wη ∈ R . The input vector q ∈ d 4.1 Nearest Neighbor Matching R model is the hidden vector for the ambiguous d ×n word, while input matrices K, V ∈ R model A straightforward way to disambiguate word sense are the concatenation of the hidden vectors of all is through 1-nearest neighbor matching. We com- words in the sentence. For each attention head, the pute the contextualized representation of each dimension of both the query and key vectors is dk word in the training data and the test data through L while the dimension of the value vector is dv. The BERT. Given a hidden representation hi at the L- encoder model dimension is dmodel. th layer for word xi in the test data, nearest neigh- The position-wise FFNN performs a non-linear bor matching finds a vector h∗ in the L-th layer transformation on the attention output correspond- from the training data such that ing to each input word position as follows: ∗ L 0 h = arg max cos(hi , h ) (6) h0 FF(u) = WFF2(max(0, WFF1u + bFF1)) + bFF2 (5) where the sense assigned to xi is the sense of the d ∗ in which the input vector u ∈ R model is trans- word whose contextualized representation is h . formed to the output vector with dimension dmodel This WSD technique is adopted in earlier work via a series of linear projections with the ReLU on contextualized word representations (Melamud activation function. et al., 2016; Peters et al., 2018).

5299 (a) (b)

Figure 1: Illustration of WSD models by linear projection of (a) the last layer and (b) the weighted sum of all layers.

L 4.2 Linear Projection of Hidden Layers tiated by hi . The last layer projection model is Apart from nearest neighbor matching, we can illustrated in Figure1(a). perform a linear projection of the hidden vector hi 4.2.2 Weighted Sum of Hidden Layers by an affine transformation into an output sense BERT consists of multiple layers stacked one after vector, with its dimension equal to the number of another. Each layer carries a different representa- senses for word xi. The output of this affine trans- tion of word context. Taking into account different formation is normalized by softmax such that all hidden layers may help the WSD system to learn its values sum to 1. Therefore, the predicted sense from different context information encoded in dif- si of word xi is formulated as ferent layers of BERT. To take into account all layers, we compute the lexelt(xi) lexelt(xi) si = softmax(W hi + b ) (7) l weighted sum of all hidden layers, hi, where 1 ≤ l ≤ L, corresponding to a word position i, through where si is a vector of predicted sense distribution attention mechanism. That is, hi in Equation7 is for word x , while Wlexelt(xi) and blexelt(xi) are i replaced by the following equation: respectively the projection matrix and bias vector specific to the lexical element (lexelt) of word x , s 1:L 1:L i hi = A(m, W hi , hi , 1) (8) which consists of its lemma and optionally its part- d of-speech tag. We choose the sense corresponding where m ∈ R model is a projection vector to obtain to the element of si with the maximum value. scalar values with the key vectors. The model with Training the linear projection model is done weighted sum of all hidden layers is illustrated in by the back-propagation algorithm, which updates Figure1(b). the model parameters to minimize a cost function. 4.2.3 Gated Linear Unit Our cost function is the negative log-likelihood While the contextualized representations in the of the softmax output value that corresponds to BERT hidden layer vectors are features that deter- the tagged sense in the training data. In addi- mine the word sense, some features are more use- tion, we propose two novel ways of incorporating ful than the others. As such, we propose filtering BERT’s hidden representation vectors to compute the vector values by a gating vector whose values the sense output vector, which are described in the range from 0 to 1. This mechanism is known as following sub-subsections. the gated linear unit (GLU) (Dauphin et al., 2017), 4.2.1 Last Layer Projection which is formulated as Similar to the nearest neighbor matching model, GLU(h) = (h+Whh+bh) σ(Wgh+bg) (9) we can feed the hidden vector of BERT in the last L h g layer, hi , into an affine transformation followed where W and W are separate projection matri- h g by softmax. That is, hi in Equation7 is instan- ces and b and b are separate bias vectors. The

5300 symbols σ(·) and denote the sigmoid function Task # Instances # Lexelts English all-words and element-wise vector multiplication operation - SemCor (train) 226,036 22,436 respectively. - SemEval-2007 (dev) 455 330 GLU transforms the input vector h by feeding it - Senseval-2 (test) 2,282 1,093 - Senseval-3 (test) 1,850 977 to two separate affine transformations. The second - SemEval 2013 (test) 1,664 751 transformation is used as the sigmoid gate to filter - SemEval 2015 (test) 1,022 512 English lexical sample the input vector, which is summed with the vector - Senseval-2 (train) 8,611 177 after the first affine transformation. - Senseval-2 (test) 4,328 146 - Senseval-3 (train) 8,022 57 5 Experimental Setup - Senseval-3 (test) 3,944 57 Chinese OntoNotes - Train 66,409 431 We conduct experiments on various WSD tasks. - Dev 9,523 341 The description and the statistics for each task are - BC (test) 1,769 160 given in Table1. For English, a lexical element - BN (test) 3,227 253 - MZ (test) 1,876 223 (lexelt) is defined as a combination of lemma and - NW (test) 1,483 143 part-of-speech tag, while for Chinese, it is simply - All (test) 8,355 324 the lemma, following the OntoNotes setup. Table 1: Statistics of the datasets used for the English We exploit English BERTBASE for the English all-words task, English lexical sample task, and Chi- tasks and Chinese BERT for the Chinese task. nese OntoNotes WSD task in terms of the number of We conduct experiments with different strategies instances and the number of distinct lexelts. For Chi- of incorporating BERT as described in Section4, nese WSD task, “All” refers to the concatenation of all namely 1-nearest neighbor matching (1-nn) and genres BC, BN, MZ, and NW. linear projection. In the latter technique, we ex- plore strategies including simple last layer projec- best model parameters after a number of neural tion, layer weighting (LW), and gated linear unit network updates, for models that require back- (GLU). propagation training. In the linear projection model, we train the We also evaluate on Senseval-2 and Senseval- model by the Adam algorithm (Kingma and Ba, 3 English lexical sample tasks, which come with 2015) with a learning rate of 10−3. The model pre-defined training and test data. For each word parameters are updated per mini-batch of 16 sen- type, we pick 20% of the training instances to be tences. As update progresses, we pick the best our development set and keep the remaining 80% model parameters from a series of neural network as the actual training data. Through this develop- updates based on accuracy on a held-out develop- ment set, we determine the number of epochs to ment set, disjoint from the training set. use in training. We then re-train the model with the The state-of-the-art supervised WSD approach whole training dataset using the number of epochs takes into account features from the neighboring identified in the initial training step. sentences, typically one sentence to the left and While WSD is predominantly evaluated on En- one to the right apart from the current sentence that glish, we are also interested in evaluating our ap- contains the ambiguous words. We also exploit proach on Chinese, to evaluate the effectiveness this in our model, as BERT supports inputs with of our approach in a different language. We use multiple sentences separated by a special [SEP] OntoNotes Release 5.01, which contains a num- symbol. ber of annotations including word senses for Chi- For English all-words WSD, we train our WSD nese. We follow the data setup of Pradhan et al. model on SemCor (Miller et al., 1994), and test (2013) and conduct an evaluation on four genres, it on Senseval-2 (SE2), Senseval-3 (SE3), Se- i.e., broadcast conversation (BC), broadcast news mEval 2013 task 12 (SE13), and SemEval 2015 (BN), magazine (MZ), and newswire (NW), as task 13 (SE15). This common benchmark, which well as the concatenation of all genres. While the has been annotated with WordNet-3.0 senses (Ra- training and development datasets are divided into ganato et al., 2017a), has recently been adopted genres, we train on the concatenation of all genres in English all-words WSD. Following (Raganato and test on each individual genre. et al., 2017b), we choose SemEval 2007 Task 17 (SE07) as our development data to pick the 1LDC2013T19

5301 System SE07 SE2 SE3 SE13 SE15 Avg Reported in previous papers MFS baseline 54.5 65.6 66.0 63.8 67.1 65.6 IMS (Zhong and Ng, 2010) 61.3 70.9 69.3 65.3 69.5 68.8 IMS+emb (Iacobacci et al., 2016) 60.9 71.0 69.3 67.3 71.3 69.7 SupWSD (Papandrea et al., 2017) 60.2 71.3 68.8 65.8 70.0 69.0 SupWSD+emb (Papandrea et al., 2017) 63.1 72.7 70.6 66.8 71.8 70.5 BiLSTMatt+LEX (Raganato et al., 2017b) 63.7 72.0 69.4 66.4 72.4 70.1 GASext Concat (Luo et al., 2018) – 72.2 70.5 67.2 72.6 70.6 context2vec (Melamud et al., 2016) 61.3 71.8 69.1 65.6 71.9 69.6 ELMo (Peters et al., 2018) 62.2 71.6 69.6 66.2 71.3 69.7 BERT nearest neighbor (ours) 1nn (1sent) 64.0 73.0 69.7 67.8 73.3 71.0 1nn (1sent+1sur) 63.3 73.8 71.6 69.2 74.4 72.3 BERT linear projection (ours) Simple (1sent) 67.0 75.0 71.6 69.7 74.4 72.7 Simple (1sent+1sur) 69.3∗ 75.9∗ 73.4 70.4∗ 75.1 73.7∗ LW (1sent) 66.7 75.0 71.6 69.9 74.2 72.7 LW (1sent+1sur) 69.0∗ 76.4∗ 74.0∗ 70.1∗ 75.0 73.9∗ GLU (1sent) 64.9 74.1 71.6 69.8 74.3 72.5 GLU (1sent+1sur) 68.1∗ 75.5∗ 73.6∗ 71.1∗ 76.2∗ 74.1∗ GLU+LW (1sent) 65.7 74.0 70.9 68.8 73.6 71.8 GLU+LW (1sent+1sur) 68.5∗ 75.5∗ 73.4∗ 71.0∗ 76.2∗ 74.0∗

Table 2: English all-words task results in F1 measure (%), averaged over three runs. SemEval 2007 Task 17 (SE07) test set is used as the development set. We show the results of nearest neighbor matching (1nn) and linear projection, by simple last layer linear projection, layer weighting (LW), and gated linear units (GLU). Apart from BERT representation of one sentence (1sent), we also show BERT representation of one sentence plus one surrounding sentence to the left and one to the right (1sent+1sur). The best result in each dataset is shown in bold. Statistical significance tests by bootstrap resampling (∗: p < 0.05) compare 1nn (1sent+1sur) with each of Simple (1sent+1sur), LW (1sent+1sur), GLU (1sent+1sur), and GLU+LW (1sent+1sur).

For Chinese WSD evaluation, we train IMS 6 Results (Zhong and Ng, 2010) on the Chinese OntoNotes dataset as our baseline. We also incorporate pre- We present our experimental results and compare trained non-contextualized Chinese word embed- them with prior baselines. dings as IMS features (Taghipour and Ng, 2015; Iacobacci et al., 2016). The pre-trained word em- 6.1 English All-Words Tasks beddings are obtained by training the For English all-words WSD, we compare our ap- skip-gram model on Chinese Gigaword Fifth Edi- proach with three categories of prior approaches. tion2, which after automatic word segmentation Firstly, we compare our approach with the su- contains approximately 2 billion words. Follow- pervised SVM classifier approach, namely IMS ing (Taghipour and Ng, 2015), we incorporate the (Zhong and Ng, 2010). We compare our approach embedding features of words within a window sur- with both the original IMS without word embed- rounding the target ambiguous word. In our ex- ding features and IMS with non-contextualized periments, we take into account 5 words to the left word embedding features, that is, word2vec with and right. exponential decay (Iacobacci et al., 2016). We also compare with SupWSD (Papandrea et al., 2017), which is an alternative implementation of IMS. Secondly, we compare our approach with the neural WSD approaches that leverage bidi- 2LDC2011T13 rectional LSTM (bi-LSTM). These include the

5302 bi-LSTM model with attention trained jointly LSTM model trained on 2B words of text. Fi- with lexical semantic labeling task (Raganato nally, we also compare with a prior multi-task and et al., 2017b) (BiLSTMatt+LEX) and the bi- semi-supervised WSD approach learned through LSTM model enhanced with gloss representation alternating structure optimization (ASO) (Ando, from WordNet (GAS). Thirdly, we show compari- 2006), which also utilizes unlabeled data for train- son with prior contextualized word representations ing. for WSD, pre-trained on a large number of texts, namely context2vec (Melamud et al., 2016) and System SE2 SE3 Reported ELMo (Peters et al., 2018). In these two models, IMS 65.2 72.3 WSD is treated as nearest neighbor matching as IMS+adapted CW 66.2 73.4 described in Section 4.1. IMS+w2v+expdecay 69.9 75.2 BiLSTM 66.9 73.4 Table2 shows our WSD results in F1 mea- context2vec – 72.8 sure. It is shown in the table that with the near- ASO+multitask+semisup – 74.1 est neighbor matching model, BERT outperforms BERT nearest neighbor (ours) 1nn (1sent) 67.7 72.7 context2vec and ELMo. This shows the effective- 1nn (1sent+1sur) 71.5 75.7 ness of BERT’s pre-trained contextualized word BERT linear projection (ours) representation. When we include surrounding sen- Simple (1sent) 73.3 78.6 Simple (1sent+1sur) 76.9∗ 80.0∗ tences, one to the left and one to the right, we get LW (1sent+1sur) 76.7∗ 80.0∗ improved F1 scores consistently. GLU (1sent+1sur) 75.1∗ 79.4∗ GLU+LW (1sent+1sur) 74.2∗ 79.8∗ We also show that linear projection to the sense output vector further improves WSD per- Table 3: English lexical sample task results in accu- formance, by which our best results are achieved. racy (%), averaged over three runs. Best accuracy in While BERT has been shown to outperform other each dataset is shown in bold. Statistical significance pre-trained contextualized word representations tests by bootstrap resampling (∗: p < 0.05) compare through the nearest neighbor matching experi- 1nn (1sent+1sur) with each of Simple (1sent+1sur), LW (1sent+1sur), GLU (1sent+1sur), and GLU+LW ments, its potential can be maximized through lin- (1sent+1sur). ear projection to the sense output vector. It is worthwhile to note that our more advanced linear As shown in Table3, our BERT-based WSD projection, by means of layer weighting (§4.2.2 approach with linear projection model outper- and gated linear unit (§4.2.3) gives the best F1 forms all prior approaches. context2vec, which scores on all test sets. is pre-trained on a large amount of texts, per- All our BERT WSD systems outperform gloss- forms worse than the prior semi-supervised ASO enhanced neural WSD, which has the best overall approach on Senseval-3, while our best result out- score among all prior systems. performs ASO by a large margin. Neural bi-LSTM performs worse than IMS 6.2 English Lexical Sample Tasks with non-contextualized word embedding fea- For English lexical sample tasks, we compare our tures. Our neural model with pre-trained contex- approach with the original IMS (Zhong and Ng, tualized word representations outperforms the best 2010) and IMS with non-contextualized word em- result achieved by IMS on both Senseval-2 and bedding features. The embedding features incor- Senseval-3. porated into IMS include CW embeddings (Col- 6.3 Chinese OntoNotes WSD lobert et al., 2011), obtained from a convolu- tional language model, fine-tuned (adapted) to We compare our approach with IMS without and WSD (Taghipour and Ng, 2015)(+adapted CW), with word embedding features as the baselines. and word2vec skip-gram (Mikolov et al., 2013) The results are shown in Table4. with exponential decay (Iacobacci et al., 2016) Across all genres, BERT outperforms the (+w2v+expdecay). We also compare our ap- baseline IMS with word embedding (non- proach with the bi-LSTM, on top of which sense contextualized word representation) features classification is defined (Kageb˚ ack¨ and Salomon- (Taghipour and Ng, 2015). The latter also sson, 2016), and context2vec (Melamud et al., improves over the original IMS without word 2016), which is a contextualized pre-trained bi- embedding features (Zhong and Ng, 2010). Linear

5303 System BC BN MZ NW All to the individual word representation), we showed Baseline IMS 79.8 85.7 83.0 91.0 84.8 that taking into account different layers by means IMS+w2v 80.2 86.4 82.3 92.0 85.2 of the attention mechanism is useful for WSD. BERT GLU as an activation function has been shown to 1nn 81.7 88.5 85.1 93.3 87.1 Simple 84.6∗ 90.4∗ 88.9∗ 93.9 89.5∗ be effective for better convergence and to over- LW 84.7∗ 90.3∗ 88.8∗ 94.0 89.5∗ come the vanishing gradient problem in the convo- ∗ ∗ ∗ ∗ GLU 84.6 90.0 88.3 93.3 89.0 lutional language model (Dauphin et al., 2017). In GLU+LW 84.6∗ 90.2∗ 88.2∗ 93.4 89.2∗ addition, the GLU gate vector, with values ranging Table 4: Chinese OntoNotes WSD results in accu- from 0 to 1, can be seen as a filter for the features racy (%), averaged over three runs, for each genre. All from the hidden layer vector. BERT results in this table are obtained from the rep- Compared with two prior contextualized word resentation of one sentence plus one surrounding sen- representations models, context2vec (Melamud tence to the left and to the right (1sent+1sur). We show et al., 2016) and ELMo (Peters et al., 2018), BERT results of various BERT incorporation strategy, namely nearest neighbor matching (1nn), simple projection, achieves higher accuracy. This shows the effec- projection with layer weighting (LW) and gated linear tiveness of the attention mechanism used in the unit (GLU). Best accuracy in each genre is shown in transformer model to represent the context. bold. Statistical significance tests by bootstrap resam- Our BERT WSD models outperform prior neu- pling (∗: p < 0.05) compare 1nn with each of Simple, ral WSD models by a large margin. These prior LW, GLU, and GLU+LW. neural WSD models perform comparably with IMS with embeddings as classifier features, in ad- projection approaches consistently outperform dition to the discrete features. While neural WSD nearest neighbor matching by a significant margin, approaches (Kageb˚ ack¨ and Salomonsson, 2016; similar to the results on English WSD tasks. Raganato et al., 2017b; Luo et al., 2018) exploit The best overall result for the Chinese non-contextualized word embeddings which are OntoNotes test set is achieved by the models with trained on large texts, the hidden layers are trained simple projection and hidden layer weighting. only on a small amount of labeled data.

7 Discussion 8 Conclusion Across all tasks (English all-words, English lex- For the WSD task, we have proposed novel strate- ical sample, and Chinese OntoNotes), our ex- gies of incorporating BERT, a pre-trained con- periments demonstrate the effectiveness of BERT textualized word representation which effectively over various prior WSD approaches. The best re- captures the context in its hidden vectors. Our ex- sults are consistently obtained by linear projection periments show that linear projection of the hidden models, which project the last hidden layer or the vectors, coupled with gating to filter the values, weighted sum of all hidden layers to an output gives better results than the prior state of the art. sense vector. Compared to prior neural and feature-based WSD We can view the BERT hidden layer outputs as approaches that make use of non-contextualized contextual features, which serve as useful cues in word representations, using pre-trained contextu- determining the word senses. In fact, the attention alized word representation with our proposed in- mechanism in transformer captures the surround- corporation strategy achieves significantly higher ing words. In prior work like IMS (Zhong and Ng, scores. 2010), these contextual cues are captured by the manually-defined surrounding word and colloca- tion features. The features obtained by the hidden References vector output are shown to be more effective than Rie Kubota Ando. 2006. Applying alternating struc- the manually-defined features. ture optimization to word sense disambiguation. In We introduced two advanced linear projection Proceedings of the Tenth Conference on Computa- techniques, namely layer weighting and gated lin- tional Natural Language Learning, pages 77–84. ear unit (GLU). While Peters et al.(2018) showed Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. that the second biLSTM layer results in better Hinton. 2016. Layer normalization. CoRR, WSD accuracy compared to the first layer (nearer abs/1607.06450.

5304 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang, and gio. 2015. Neural machine translation by jointly Zhifang Sui. 2018. Incorporating glosses into neural learning to align and translate. In Proceedings of word sense disambiguation. In Proceedings of the the 3rd International Conference on Learning Rep- 56th Annual Meeting of the Association for Compu- resentations. tational Linguistics, pages 2473–2482.

Yee Seng Chan, Hwee Tou Ng, and David Chiang. Xuezhe Ma and Eduard Hovy. 2016. End-to-end 2007a. Word sense disambiguation improves sta- sequence labeling via bi-directional LSTM-CNNs- tistical machine translation. In Proceedings of the CRF. In Proceedings of the 54th Annual Meeting 45th Annual Meeting of the Association for Compu- of the Association for Computational Linguistics,, tational Linguistics, pages 33–40. pages 1064–1074.

Yee Seng Chan, Hwee Tou Ng, and Zhi Zhong. 2007b. Bryan McCann, James Bradbury, Caiming Xiong, and NUS-PT: Exploiting parallel texts for word sense Richard Socher. 2017. Learned in translation: Con- disambiguation in the English all-words tasks. In textualized word vectors. In Proceedings of the Proceedings of the Fourth International Workshop Thirty-First Conference on Neural Information Pro- on Semantic Evaluations, pages 253–256. cessing Systems, pages 6297–6308.

Ronan Collobert, Jason Weston, Leon´ Bottou, Michael Oren Melamud, Jacob Goldberger, and Ido Dagan. Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2016. context2vec: Learning generic context em- 2011. Natural language processing (almost) from bedding with bidirectional LSTM. In Proceedings scratch. Journal of Research, of the 20th SIGNLL Conference on Computational 12(Aug):2493–2537. Natural Language Learning, pages 51–61. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Yann N. Dauphin, Angela Fan, Michael Auli, and Dean. 2013. Efficient estimation of word represen- David Grangier. 2017. Language modeling with tations in . In Proceedings of Workshop gated convolutional networks. In Proceedings of the at the International Conference on Learning Repre- Thirty-Fourth International Conference on Machine sentations. Learning, pages 933–941. George A. Miller, Martin Chodorow, Shari Landes, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Claudia Leacock, and Robert G. Thomas. 1994. Us- Kristina Toutanova. 2019. BERT: Pre-training of ing a semantic concordance for sense identification. deep bidirectional transformers for language under- In Human Language Technology: Proceedings of standing. In Proceedings of the 2019 Conference of a Workshop held at Plainsboro, New Jersey, pages the North American Chapter of the Association for 240–243. Computational Linguistics: Human Language Tech- nologies, pages 4171–4186. Simone Papandrea, Alessandro Raganato, and Claudio Delli Bovi. 2017. SupWSD: A flexible toolkit for Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. supervised word sense disambiguation. In Proceed- Long short-term memory. Neural Computation, ings of the 2017 Conference on Empirical Methods 9(8):1735–1780. in Natural Language Processing: System Demon- strations, pages 103–108. Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt disambiguation: An evaluation study. In Proceed- Gardner, Christopher Clark, Kenton Lee, and Luke ings of the 54th Annual Meeting of the Association Zettlemoyer. 2018. Deep contextualized word repre- for Computational Linguistics, pages 897–907. sentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association Mikael Kageb˚ ack¨ and Hans Salomonsson. 2016. Word for Computational Linguistics: Human Language sense disambiguation using a bidirectional LSTM. Technologies, pages 2227–2237. In Proceedings of the 5th Workshop on Cognitive As- pects of the Lexicon, pages 51–56. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bjorkelund,¨ Olga Uryupina, Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: Yuchen Zhang, and Zhi Zhong. 2013. Towards ro- A method for stochastic optimization. In Proceed- bust linguistic analysis using OntoNotes. In Pro- ings of the 3rd International Conference on Learn- ceedings of the Seventeenth Conference on Com- ing Representations. putational Natural Language Learning, pages 143– 152. Yoong Keok Lee, Hwee Tou Ng, and Tee Kiah Chia. 2004. Supervised word sense disambiguation with Alessandro Raganato, Jose Camacho-Collados, and support vector machines and multiple knowledge Roberto Navigli. 2017a. Word sense disambigua- sources. In Proceedings of SENSEVAL-3, the Third tion: A unified evaluation framework and empirical International Workshop on the Evaluation of Sys- comparison. In Proceedings of the 15th Conference tems for the Semantic Analysis of Text, pages 137– of the European Chapter of the Association for Com- 140. putational Linguistics, pages 99–110.

5305 Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017b. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 1156–1167. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural net- works. In Proceedings of the Twenty-Eighth Con- ference on Neural Information Processing Systems, pages 3104–3112. Kaveh Taghipour and Hwee Tou Ng. 2015. Semi- supervised word sense disambiguation using word embeddings in general and specific domains. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 314–323. Joseph Turian, Lev Ratinov, and . 2010. Word representations: A simple and general method for semi-. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, pages 384–394. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Thirty-First Con- ference on Neural Information Processing Systems, pages 5998–6008.

Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense: a wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pages 78–83.

Zhi Zhong and Hwee Tou Ng. 2012. Word sense dis- ambiguation improves information retrieval. In Pro- ceedings of the 50th Annual Meeting of the Associa- tion for Computational Linguistics, pages 273–282.

5306