Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations

Gabor´ Berend1,2 1Institute of Informatics, University of Szeged 2MTA-SZTE Research Group on Aartificial Intelligence [email protected]

Abstract differ from typical word embeddings in that most coefficients are exactly zero. Such sparse word In this paper, we demonstrate that by utiliz- ing sparse word representations, it becomes representations have been argued to convey an possible to surpass the results of more com- increased interpretability (Murphy et al., 2012; plex task-specific models on the task of fine- Faruqui et al., 2015; Subramanian et al., 2018) grained all-words word sense disambiguation. which could be advantageous for WSD. It has been Our proposed algorithm relies on an overcom- shown that sparsity can not only favor interpretabil- plete set of semantic basis vectors that allows ity, but it can contribute to an increased perfor- us to obtain sparse contextualized word repre- mance in downstream applications (Faruqui et al., sentations. We introduce such an information theory-inspired synset representation based on 2015; Berend, 2017). the co-occurrence of word senses and non- The goal of this paper is to investigate and quan- zero coordinates for word forms which allows tify what synergies exist between contextualized us to achieve an aggregated F-score of 78.8 and sparse word representations. Our rigorous ex- over a combination of five standard word sense periments show that it is possible to get increased disambiguating benchmark datasets. We also performance on top of contextualized representa- demonstrate the general applicability of our tions when they are post-processed in a way which proposed framework by evaluating it towards part-of-speech tagging on four different tree- ensures their sparsity. banks. Our results indicate a significant im- In this paper we introduce an information theory- provement over the application of the dense inspired algorithm for creating sparse contextu- word representations. alized word representations and evaluate it in a 1 Introduction series of challenging WSD tasks. In our exper- iments, we managed to obtain solid results for Natural language processing applications have ben- multiple fine-grained word sense disambiguation efited remarkably form language modeling based benchmarks. All our source code for reproduc- contextualized word representations, including ing our experiments are made available at https: CoVe (McCann et al., 2017), ELMo (Peters et al., //github.com/begab/sparsity_makes_sense.1 2018) and BERT (Devlin et al., 2019), inter alia. Our contributions can be summarized as follows: Contrary to standard “static” word embeddings like (Mikolov et al., 2013) and Glove (Pen- • we propose the application of contextualized nington et al., 2014), contextualized representa- sparse overcomplete word representation in tions assign such vectorial representations to men- the task of word sense disambiguation, tions of word forms that are sensitive to the entire sequence in which they are present. This charac- • we carefully evaluate our information theory teristic of contextualized word embeddings makes inspired approach for quantifying the strength them highly applicable for performing word sense of the connection between the individual di- disambiguation (WSD) as it has been investigated mensions of (sparse) word representations and recently (Loureiro and Jorge, 2019; Vial et al., 2019). 1An additional demo application performing all-words word sense disambiguation is also made available at Another popular line of research deals with http://www.inf.u-szeged.hu/˜berendg/nlp_ sparse overcomplete word representations which demos/wsd.

8498 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8498–8508, November 16–20, 2020. c 2020 Association for Computational Linguistics human interpretable semantic content such as et al.(2017b) tackled all-words WSD as a sequence fine grained word senses, learning model and solved it using LSTMs. Vial et al.(2019) introduced a similar framework, but • we demonstrate the general applicability of replaced the LSTM decoder with an ensemble of our algorithm by applying it for POS tagging transformers. (Vial et al., 2019) additionally relied on four different UD . on BERT contextual word representations as input 2 Related work to their all-words WSD system. Contextual word embeddings have recently su- One of the key difficulties of natural language un- perseded traditional word embeddings due to their derstanding is the highly ambiguous nature of lan- advantageous property of also modeling the neigh- guage. As a consequence, WSD has long-standing boring context of words upon determining their origins in the NLP community (Lesk, 1986; Resnik, vectorial representations. As such, the same word 1997a,b), still receiving major recent research in- form gets assigned a separate embedding when terest (Raganato et al., 2017a; Trask et al., 2015; mentioned in different contexts. Contextualized Melamud et al., 2016; Loureiro and Jorge, 2019; word vectors, including (Devlin et al., 2019; Yang Vial et al., 2019). A thorough survey on WSD et al., 2019), typically employ some language algorithms of the pre-neural era can be found in modelling-inspired objective and are trained on (Navigli, 2009). massive amounts of textual data, which makes them A typical evaluation for WSD systems is to quan- generally applicable in a variety of settings as illus- tify the extent to which they are capable of iden- trated by top-performing entries at the SuperGLUE tifying the correct sense of ambiguous words in leaderboard (Wang et al., 2019). their contexts according to some sense inventory. Most recently, Loureiro and Jorge(2019) have One of the most frequently applied sense inventory proposed the usage of contextualized word rep- in the case of English is the Princeton WordNet resentations for tackling WSD. Their framework (Fellbaum, 1998) which also served the basis of builds upon BERT embeddings and performs WSD our evaluation. relying on a k-NN approach of query words to- A variety of WSD approaches has evolved rang- wards the sense embeddings that are derived as the ing from unsupervised and knowledge-based solu- centroids of contextual embeddings labeled with tions to supervised ones. Unsupervised approaches a certain sense. The framework also utilizes static could investigate the textual overlap between the (Bojanowski et al., 2017) embeddings, and context of ambiguous words and their potential averaged contextual embeddings derived from the sense definitions (Lesk, 1986) or they could be definitions attached to WordNet senses for mitigat- based on random walks over the semantic graph ing the problem caused by the limited amounts of providing the sense inventory (Agirre and Soroa, sense-labeled training data. 2009). Supervised WSD techniques typically perform Kumar et al.(2019) proposed the EWISE ap- better than unsupervised approaches. IMS (Zhong proach which constructs sense definition embed- and Ng, 2010) is a classical supervised WSD frame- dings also relying on the network structure of Word- work which was created with the intention of easy Net for performing zero-shot WSD in order to han- extensibility. It trains SVMs for predicting the cor- dle words without any sense-annotated occurrence rect sense of a word based on traditional features, in the training data. Bevilacqua and Navigli(2020) such as surface forms and POS tags of the ambigu- introduces EWISER as an improvement over the ous words as well as its neighboring words. EWISE approach by providing a hybrid knowledge- The recent advent of neural text representations based and supervised approach via the integration have also shaped the landscape of algorithms per- of explicit relational information from WordNet. forming WSD. Iacobacci et al.(2016) extended the Our approach differs from both (Kumar et al., 2019) classical feature-based IMS framework by incor- and (Bevilacqua and Navigli, 2020) in that we are porating word embeddings. Melamud et al.(2016) not exploiting the structural properties of WordNet. devised context2vec, which relies on a bidirectional SenseBERT (Levine et al., 2019) extends BERT LSTM (biLSTM) for performing supervised WSD. (Devlin et al., 2019) by incorporating an auxiliary Kageb˚ ack¨ and Salomonsson(2016) also proposed task into the masked language modeling objective the utilization of biLSTMs for WSD. Raganato for predicting word supersenses besides word iden-

8499 tities. Our approach differs from SenseBERT as posing a total of M sequences and Ni tokens in we do not propose an alternative way for training sentence i. We refer to the contextualized word (i) contextualized embeddings, but introduce an algo- representation for some token in boldface, i.e. xj rithm for extracting a useful representation from and the collection of contextual embeddings as M pretrained BERT embeddings that can effectively n (i)Ni o X = xj . be used for WSD. Due to this conceptual difference, j=0 i=0 Likewise to the sequence of sentences and their our approach does not need a large transformer respective tokens, we also utilize a sequence of an- model to be trained, but it can be steadily applied M n (i)Ni o over pretrained models. notations that we denote as S = sj , j=0 i=0 GlossBERT (Huang et al., 2019) framed WSD with s(i) indicating the labeling of token j within as a sentence pair classification task between the j (i) |S| sentence containing an ambiguous target token and sentence i. We have sj ∈ {0, 1} with S de- the contents of the glosses for the potential synsets noting the set of possible labels included in our of the ambiguous token and fine-tuned BERT ac- annotated corpus. That is, we have an indicator cordingly. GlossBERT hence requires a fine-tuning vector conveying the annotation for every token. (i) stage, whereas our approach builds directly on the We allow for the sj = 0 case, meaning that it pre-trained contextual embeddings, which makes it is possible that certain tokens lack annotation. In more resource efficient. the case of WSD, the annotation is meant in the Our work also relates to the line of research on form of sense annotation, but in general, the to- sparse word representations. The seminal work ken level annotations could convey other types of on obtaining sparse word representations by Mur- information as well. phy et al.(2012) applied matrix factorization over The next step in our algorithm is to perform the co-occurrence matrix built from some corpus. sparse coding over the contextual embeddings of Arora et al.(2018) investigated the linear alge- the annotated corpus. Sparse coding is a matrix braic structure of static spaces and decomposition technique which tries to approxi- v×m concluded that “simple sparse coding can recover mate some matrix X ∈ R as a product of a v×k vectors that approximately capture the senses”. sparse matrix α ∈ R and a dictionary matrix k×m Faruqui et al.(2015); Berend(2017); Subrama- D ∈ R , where k denotes the number of basis nian et al.(2018) introduced different approaches vectors to be employed. for obtaining sparse word representations from tra- We formed matrix X by stacking and unit nor- ditional static and dense word vectors. Our work malizing the contextual embeddings comprising X. differs from all the previously mentioned papers in We then optimize that we create sparse contextualized word represen- M Ni tations. X X (i) (i) 2 (i) min kxj −αj Dk2+λkαj k1, (1) D∈C (i) k i=1 j=1 3 Approach αj ∈R≥0

Our algorithm is composed of two important steps, where C denotes the convex set of matrices with i.e. we first make a sparse representation from row norm at most 1, λ is the regularization coeffi- the dense contextualized ones, then we derive a (i) cient and the sparse coefficients in αj are required succinct representation describing the strength of to be non-negative. We imposed the non-negativity connection between the individual basis of our rep- constraint on α as it has been reported to provide resentation and the sense inventory we would like increased interpretability (Murphy et al., 2012). to perform WSD against. We elaborate on these components next. 3.2 Binding basis vectors to senses

3.1 Sparse contextualized embeddings Once we have obtained a sparse contextualized rep- resentation for each token in our annotated corpus, Our algorithm first determines contextualized word we determine the extent to which the individual representations for some sense-annotated corpus. bases comprising the dictionary matrix D bind to We shall denote the surface form realizations in the the elements of our label inventory S. In order to n (i) N oM (i)   i k×|S| corpus as X = xj , with xj standing do so, we devise a matrix Φ ∈ , which con- j=0 i=0 R for the token at position j within sentence i, sup- tains a φbs score for each pair of basis vector b and 8500 a particular label s. We summarize our algorithm In order to handle low-frequency senses better, for obtaining Φ in Algorithm1. we optionally calculate the normalized (positive) The definition of Φ is based on a generalization PMI (Bouma, 2009) between a pair of base and   . of co-occurrence of bases and the elements of the la- sense as log pij − log (p ). That is, we pi∗p∗j ij bel inventory S. We first define our co-occurrence normalize the PMI scores by the negative logarithm matrix between bases and labels as of the joint probability (cf. line 8 of Algorithm1).

M Ni This step additionally ensures that the normalized X X (i) (i)| C = αj sj , (2) PMI (nPMI) ranges between −1 and 1 as opposed i=1 j=1 to the (−∞, min(− log(pi), − log(pj))) range of the unnormalized PMI values. i.e. C is the sum of outer products of sparse word (i) representations (αj ) and their respective sense de- Algorithm 1 Calculating Φ (i) ( , ) scription vector (sj ). The definition in (2) ensures Require: sense annotated corpus X S k×|S| that every cbs ∈ C aggregates the sparse nonneg- Ensure: Φ ∈ R describing the strength be- ative coefficients words labeled as s has received tween k sense basis and the elements of the for their coordinate b. Recall that we allowed cer- sense inventory |S| (i) 1: ALCULATE HI , tain sj to be the all zero vector, i.e. tokens that procedure C P (X S) lack any annotation are conveniently handled by 2: X ← UNITNORMALIZE(X) Eq. (2) as the sparse coefficients of such tokens do 3: D, α ← arg min kX −DαkF +λkαk1 not contribute towards C. D∈C,α∈R≥0 4: C ← αS We next turn the elements of C into a matrix 5: P ← C/kC|1 representing a joint probability distribution P by h  p i 6: Φ ← log ij determining the `1-normalized variant of C (line 5 pi∗p∗j ij of Algorithm1). This way we devise a sparse ma- 7: Φ ← [max (0, φij)]ij . cf. pPMI trix, the entries of which can be used for calculating h φ i 8: Φ ← ij . cf. nPMI Pointwise Mutual Information (PMI) between se- − log(pij ) ij mantic bases and the presence of symbolic senses 9: return Φ,D of our sense inventory. 10: end procedure For a pair of events (i, j) PMI is measured as  pij  log , with pij referring to their joint proba- pi∗p∗j 3.3 Inferring senses bility, pi∗ and p∗j denoting the marginal probability of i and j, respectively. We determine these proba- We now describe the way we assign the most plau- bilities from the entries of P that we obtain from sible sense to any given token from a sequence C via `1 normalization. according to the sense inventory employed for con- structing D and Φ. Employing Positive PMI Negative PMI values For an input sequence of N tokens accompa- for a pair of events convey the information that they nied by their corresponding contextualized word repel each other. Multiple studies have argued that N representations as [xj] , we determine their cor- negative PMI values are hence detrimental (Bulli- j=1 responding sparse representations [α ]N based naria and Levy, 2007; Levy et al., 2015) . To this j j=1 on D that we have already determined upon obtain- end, we could opt for the determination of posi- ing Φ. That is, we solve an ` -regularized convex tive PMI (pPMI) values as indicated in line 7 of 1 optimization problem with D being kept fixed for Algorithm1. all the unit normalized vectors xj in order to obtain Employing normalized PMI An additional the sparse contextualized word representation αj property of (positive) PMI is that it favors observa- for every token j in the sequence. k tions with low marginal frequency (Bouma, 2009), We then take the product between αj ∈ R and k×|S| since for events with low p(x) marginal probability Φ ∈ R . Since every column in Φ corresponds p(x|y) ≈ p(x) tend to hold, which results in high to a sense from the sense inventory, every scalar | |S| PMI values. In our setting, it would result in rarer in the resulting product αj Φ ∈ R can be inter- senses receiving higher φbs scores towards all the preted as the quantity indicating the extent to which bases. token j – in its given context – pertains to the in-

8501 dividual senses from the sense inventory. In other 50 k 1500 words, we assign that sense s to a particular token j 2000 | 45 3000 which maximizes αj Φ∗s, where Φ∗s indicates the column vector from Φ corresponding to sense s. 40

35

4 Experiments and results nnz 30

We evaluate our approach towards the unified WSD 25 evaluation framework released by Raganato et al. 20 (2017a) which includes the sense-annotated Sem- Cor dataset for training purposes. SemCor (Miller 15 et al., 1994) consists of 802,443 tokens with more 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 layer than 28% (226,036) of its tokens being sense- annotated using WordNet sensekeys. Figure 1: Average number of nonzero coefficients per For instance bank%1:14:00:: is one of the SemCor tokens when relying on contextualized embed- possible sensekeys the word bank can be assigned dings from different layers of BERT as input. to according to one of the 18 different synsets it is included in WordNet 3.0. WordNet 3.0 contains from the SemCor database when using different all together 206,949 distinct senses for 147,306 values of k and different layers of BERT as input. unique lemmas grouped into 117,659 synsets. We The average time for determining sparse contextual Φ constructed relying on the synset-level informa- word representations for one layer of BERT was 40 tion of WordNet. minutes on an Intel Xeon 5218 for k = 3000.

4.1 Sparse contextualized embeddings 4.2 Evaluation on all-words WSD For obtaining contextualized word representations, The evaluation framework introduced in (Raganato we rely on the pretrained -large-cased et al., 2017a) contains five different all-words WSD model from (Wolf et al., 2019). Each input token benchmarks for measuring the performance of (i) (i) 24 xj gets assigned 25 contextual vectors [xj,l ]l=0 WSD systems. The dataset includes the SensEval2 according to the input and the 24 inner layers of (Edmonds and Cotton, 2001), SensEval3 (Mihalcea (i) the BERT-large model. Each vector xj,l is 1024- et al., 2004), SemEval 2007 Task 17 (Pradhan et al., dimensional. 2007), SemEval 2013 Task 12 (Navigli et al., 2013), BERT relies on WordPiece tokenization, which SemEval 2015 Task 13 (Moro and Navigli, 2015) means that a single token, such as playing, could be datasets each containing 2282, 1850, 455, 1644 broken up into multiple subwords (play and ##ing). and 1022 sense annotated tokens, respectively. We defined token-level contextual embeddings to The concatenation of the previous datasets is also be the average of their subword-level contextual included in the evaluation toolkit, which is com- embeddings. monly referred as the ALL dataset that includes Sparse coding as formulated in (1) took the 7253 sense-annotated test cases. We relied on the stacked 1024-dimensional contextualized BERT official scoring script included in the evaluation embeddings for the 802,443 tokens from SemCor framework from (Raganato et al., 2017a). Unless 1024×802443 as input, i.e. we had X ∈ R . We used stated otherwise, we report our results on the com- the SPAMS library (Mairal et al., 2009) to solve bination of all the datasets for brevity as results for our optimization problems. Our approach has two all the subcorpora behaved similarly. hyperparameters, i.e. the number of basis vectors In order to demonstrate the benefits of our pro- included in the dictionary matrix (k) and the reg- posed approach, we develop a strong baseline simi- ularization coefficient (λ). We experimented with lar to the one devised in (Loureiro and Jorge, 2019). k ∈ {1500, 2000, 3000} in order to investigate the This approach employs the very same contextu- sensitivity of our proposed algorithm towards the alized embeddings that we use otherwise in our dimension of the sparse vectors and we employed algorithm for providing identical conditions for the λ = 0.05 throughout all our experiments. different approaches. For each synset s, we then Figure1 includes the average number of nonzero determine its centroid based on the contextualized coefficients for the sparse word representations word representations pertaining to sense s accord-

8502 80 75 75

70 70

65

65 Training data F-score F-score 60 SemCor SemCor+WordNet SemCor+WNGC 60 55 (dense centroid) SemCor+WordNet+WNGC npPMI (k=1500) Method npPMI (k=2000) 50 (dense centroid) 55 npPMI (k=3000) npPMI

0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 layer layer

Figure 2: Comparative results of relying on the dense Figure 3: The effects of employing additional sources and sparse word representations of different dimen- of information besides SemCor during training. sions for WSD using the SemCor dataset for training.

4.2.1 Increasing the amount of training data We also measured the effects of increasing the ing to the training data. We then use this matrix amount of training data. We additionally used two Ψ as a replacement over Φ when making predic- sources of information, i.e. the WordNet synsets tions for some token with its dense contextualized themselves and the Princeton WordNet Gloss Cor- embedding xj. pus (WNGC) for training. The WordNet synsets The way we make our fine-grained sensekey pre- were utilized in an identical fashion to the LMMS dictions towards the test tokens are identical when approach (Loureiro and Jorge, 2019), i.e. we deter- utilizing dense and sparse contextualized embed- mined a vectorial representation for each synset by dings, the only difference is whether we base our taking the average of the contextual representations | | decision on xj Ψ (for the dense case) or αj Φ (for that based on the concatenation of the definition the sparse case). In either case, we choose the best and the lemmas belonging to the synsets. scoring synset a particular query lemma can belong WNGC includes a sense-annotated version of to. That is, we perform argmax operation described WordNet itself containing 117,659 definitions in Section 3.3 over the set of possible synsets a (one for each synset in WordNet), consisting of query lemma can belong to. 1,634,691 tokens out of which 614,435 has a cor- responding sensekey attached to. We obtained this Figure2 includes comparative results for the data from the Unification of Sense Annotated Cor- approach using dense and sparse contextualized pora (UFSAC) (Vial et al., 2018). embeddings derived from different layers of BERT. We can see that our approach yields considerable For this experiment all our framework was kept improvements over the application of dense em- intact, the only difference was that instead of beddings. In fact, applying sparse contextualized solely relying on the sense-annotated training data embeddings provided significantly better results included in SemCor, we additionally relied on (p  0.01 using McNemar’s test) irrespective of the sense representations derived from WordNet the choice of k when compared against the utiliza- glosses and sense annotations included in WNGC Φ Ψ tion of dense embeddings. upon the determination of and for the sparse and dense cases, respectively. For these experi- Additionally, the different choices for the dimen- ments we used the same set of semantic basis vec- sion of the sparse word representations does not tors D that we determined earlier for the case when seem to play a decisive role as illustrated by Fig- we relied solely on SemCor as the source of sense ure2 and also confirmed by our significance tests annotated dataset. Figure3 includes our results conducted between the sparse approaches using when increasing the amount of sense-annotated different values of k. Since the choice of k does training data. We can see that the additional train- not severely impacted results, we report our experi- ing data consistently improves performance for ments for the k = 3000 case hereon. both the dense and the sparse case. Figure3 demon-

8503 80 synset centroids based on the sparse contextualized word representations. 75 Figure4 includes our results for the previously 70 mentioned variants of our algorithm when relying

65 on the different layers of BERT as input. Figure4 highlights that calculating PMI is indeed a cru- F-score 60 (dense centroid) cial step in our algorithm (cf. the no PMI and no PMI (sparse centroid) PMI results). We also tried to adapt the PMI 55 vPMI * * pPMI approaches for the dense contextual embeddings, nPMI 50 npPMI but the results dropped severely in that case.

0 2 4 6 8 10 12 14 16 18 20 22 24 We can additionally observe that normalization layer has the most impact on improving the results, as Figure 4: Ablation experiments regarding the differ- the performance of nPMI is at least 4 points better ent strategies to calculate Φ using the combined (Sem- than that of vPMI for all layers. Not relying on neg- Cor+WordNet+WNGC) training data. ative PMI scores also had an overall positive effect (cf. vPMI and pPMI), which seems to be additive with normalization (cf. nPMI and npPMI). strates that our proposed method when trained on the SemCor data alone is capable of achieving the 4.2.3 Comparative results same or better performance as the approach which is based on dense contextual embeddings using all We next provide detailed performance results bro- the available sources of training signal. ken down for the individual subcorpora of the eval- uation dataset. Table1 includes comparative re- 4.2.2 Ablation experiments sults to previous methods that also use SemCor We gave a detailed description of our algorithm in and optionally WordNet glosses as their training Section 3.2. We now report our experimental re- data. In Table1 we report our results obtained by sults that we conducted in order to see the contribu- our model which derives sparse contextual word tion of the individual components of our algorithms. embeddings based on the averaged representations As mentioned in Section 3.2, determining normal- retrieved from the last four layers of BERT iden- ized positive PMI (npPMI) between the semantic tical to how it was done in (Loureiro and Jorge, bases and the elements of the sense inventory plays 2019). Figure4 illustrates that reporting results a central role in our algorithm. from any of the last 4 layers would not change our In order to see the effects of normalizing and overall results substantially. keeping only the positive PMI values, we evaluated Table1 reveals that it is only the LMMS 3 further *PMI-based variants for the calculation 2348 (Loureiro and Jorge, 2019) approach which per- of Φ, i.e. we had forms comparably to our algorithm. LMMS2348 de- • vPMI vanilla PMI without normalization or termines dense sense representations relying on the discarding negative entries, large BERT model as well. The sense representa- tions used by LMMS2348 are a concatenation of the • pPMI, which discards negative PMI values 1024-dimensional centroids of each senses encoun- but does not normalize them and tered in the training data, an 1024-dimensional vec- tors derived from the glosses of WordNet synsets • nPMI which performs normalization, how- and a 300-dimensional static fasttext embeddings. ever does not discard negative PMI values. Even though our approach does not rely on static Additionally, we evaluated the system which uses fasttext embeddings, we still managed to improve sparse contextualized word representations for de- upon the best results reported in (Loureiro and termining Φ, however, does not involve the calcula- Jorge, 2019). The improvement of our approach tion of PMI scores at all. In that case we calculated which uses the SemCor training data alone is 1.9 a centroid for every synset similar to the calculation points compared to the LMMS1024, i.e. such a of Ψ for the case of contextualized embeddings that variant of the LMMS system (Loureiro and Jorge, are kept dense. The only difference is that for the 2019) which also relies solely on BERT represen- approach we refer to as no PMI, we calculated tations for the SemCor training set.

8504 approach SensEval2 SensEval3 SemEval2007 SemEval2013 SemEval2015 ALL Most Frequent Sense (MFS) 66.8 66.2 55.2 63.0 67.8 65.2 IMS (Zhong and Ng, 2010) 70.9 69.3 61.3 65.3 69.5 68.4 IMS+emb-s (Iacobacci et al., 2016) 72.2 70.4 62.6 65.9 71.5 69.6 context2Vec (Melamud et al., 2016) 71.8 69.1 61.3 65.6 71.9 69.0 LMMS1024 (Loureiro and Jorge, 2019) 75.4 74.0 66.4 72.7 75.3 73.8 LMMS2348 (Loureiro and Jorge, 2019) 76.3 75.6 68.1 75.1 77.0 75.4 GlossBERT(Sent-CLS-WS) (Huang et al., 2019) 77.7 75.2 72.5 76.1 80.4 77.0 Ours (using SemCor) 77.6 76.8 68.4 73.4 76.5 75.7 Ours (using SemCor + WordNet) 77.9 77.8 68.8 76.1 77.5 76.8 Ours (using SemCor + WordNet + WNGC) 79.6 77.3 73.0 79.4 81.3 78.8

Table 1: Comparison with previous supervised results in terms of F measure computed by the official scorer provided in (Raganato et al., 2017a).

Dev set results on dataset EWT Dev set results on dataset GUM Centroid (Ψ) npPMI (Φ) p-value

0.90 0.90 EWT 86.66 91.81 7e-193 0.85 0.85 GUM 89.58 92.93 2e-63 0.80

accuracy 0.75 accuracy 0.80 LinES 91.24 94.64 1e-87 (dense centroid) (dense centroid) 0.70 npPMI 0.75 npPMI ParTUT 90.73 92.99 4e-7

0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 layer layer Table 2: Comparison of the adaptation of the LMMS Dev set results on dataset LinES Dev set results on dataset ParTUT 0.95 approach and ours on POS tagging over the test sets 0.90 0.90 of four English UD v2.5 treebanks. The last column 0.85 0.85 contains the p-value for the McNemar test comparing 0.80 0.80 the different behavior of the two approaches. accuracy accuracy 0.75 (dense centroid) (dense centroid) 0.75 npPMI 0.70 npPMI

0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 layer layer form the one that is based on the adaptation of the Figure 5: POS tagging results evaluated over the devel- LMMS approach for POS tagging by a fair mar- opment set of four English UD v2.5 treebanks. gin, again irrespective of the layer of BERT that is used as input. A notable difference compared to the results obtained for all-words WSD that for 4.3 Evaluation towards POS tagging POS tagging the intermediate layers of BERT seem In order to demonstrate the general applicability of to deliver the most useful representation. our proposed algorithm, we evaluated it towards We used the development set of the individual POS tagging using version 2.5 of Universal De- treebanks for choosing the most promising layer of pendencies. We conducted experiments over four BERT to employ the different approaches over. For different subcorpora in English, namely the EWT the npPMI approach we selected layer 13, 13, 14 (Silveira et al., 2014), GUM (Zeldes, 2017), LinEs and 11 for the EWT, GUM, LinES and ParTut tree- (Ahrenberg, 2007) and ParTut (Sanguinetti and banks. As for the dense centroid based approach Bosco, 2015) treebanks. we selected layer 6 for the ParTUT treebank and For these experiments, we used the same ap- layer 13 for the rest of the treebanks. After doing proach as before. We also used the same dictionary so, our results for the test set of the four treebanks matrix D for obtaining the sparse word represen- are reported in Table2. Our approach delivered tations that we determined based on the SemCor significant improvements for POS tagging as well dataset. The only difference for our POS tagging as indicated by the p-values of the McNemar test. experiments is that this time the token level labels 5 Conclusions were replaced by the POS tags of the individual tokens as opposed to their sense labels. This means In this paper we investigated how the application that both Ψ and Φ had 17 columns, i.e. the number of sparse word representations obtained from con- of distinct POS tags used in these treebanks. textualized word embeddings can provide a sub- Figure5 reveals that the approach utilizing stantially increased ability for solving problems sparse contextualized word representations outper- that require the distinction of fine-grained word

8505 senses. In our experiments, we managed to ob- Meaning: Processing Texts Automatically, Proceed- tain solid results for multiple fine-grained word ings of the Biennial GSCL Conference 2009, volume sense disambiguation benchmarks with the help Normalized, pages 31–40, Tubingen.¨ of our information theory-inspired algorithm. We John A. Bullinaria and Joseph P. Levy. 2007. Ex- additionally carefully investigated the effects of tracting semantic representations from word co- increasing the amount of sense-annotated training occurrence statistics: A computational study. Be- havior Research Methods, 39(3):510–526. data and the different design choices we made. We also demonstrated the general applicability of our Jacob Devlin, Ming-Wei Chang, Kenton Lee, and approach by evaluating it in POS tagging. Our Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- source code is made available at https://github. standing. In Proceedings of the 2019 Conference com/begab/sparsity_makes_sense. of the North American Chapter of the Association for Computational Linguistics: Human Language Acknowledgments Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- This work was in part supported by the National Re- ation for Computational Linguistics. search, Development and Innovation Office of Hun- Philip Edmonds and Scott Cotton. 2001. SENSEVAL- gary through the Artificial Intelligence National 2: Overview. In The Proceedings of the Second Excellence Program (grant no.: 2018-1.2.1-NKP- International Workshop on Evaluating Word Sense 2018-00008). Disambiguation Systems, SENSEVAL ’01, pages 1– 5, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. References Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcom- Eneko Agirre and Aitor Soroa. 2009. Personalizing plete word vector representations. In Proceedings PageRank for word sense disambiguation. In Pro- of the 53rd Annual Meeting of the Association for ceedings of the 12th Conference of the European Computational Linguistics and the 7th International Chapter of the Association for Computational Lin- Joint Conference on Natural Language Processing guistics, EACL ’09, pages 33–41, Stroudsburg, PA, (Volume 1: Long Papers), pages 1491–1500, Beijing, USA. Association for Computational Linguistics. China. Association for Computational Linguistics. Lars Ahrenberg. 2007. LinES: An English-Swedish Christiane Fellbaum. 1998. WordNet: An Electronic parallel treebank. In Proceedings of the 16th Nordic Lexical Database. Bradford Books. Conference of Computational Linguistics (NODAL- IDA 2007), pages 270–273, Tartu, Estonia. Univer- Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing sity of Tartu, Estonia. Huang. 2019. GlossBERT: BERT for word sense disambiguation with gloss knowledge. In Proceed- Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, ings of the 2019 Conference on Empirical Methods and Andrej Risteski. 2018. Linear algebraic struc- in Natural Language Processing and the 9th Inter- ture of word senses, with applications to polysemy. national Joint Conference on Natural Language Pro- Transactions of the Association for Computational cessing (EMNLP-IJCNLP), pages 3509–3514, Hong Linguistics, 6:483–495. Kong, China. Association for Computational Lin- guistics. Gabor´ Berend. 2017. Sparse coding of neural word em- beddings for multilingual sequence labeling. Trans- Ignacio Iacobacci, Mohammad Taher Pilehvar, and actions of the Association for Computational Lin- Roberto Navigli. 2016. Embeddings for word sense guistics, 5:247–261. disambiguation: An evaluation study. In Proceed- ings of the 54th Annual Meeting of the Association Michele Bevilacqua and Roberto Navigli. 2020. Break- for Computational Linguistics (Volume 1: Long Pa- ing through the 80in word sense disambiguation by pers), pages 897–907, Berlin, Germany. Association incorporating knowledge graph information. In Pro- for Computational Linguistics. ceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Mikael Kageb˚ ack¨ and Hans Salomonsson. 2016. Word Papers). Association for Computational Linguistics. sense disambiguation using a bidirectional LSTM. In Proceedings of the 5th Workshop on Cognitive Piotr Bojanowski, Edouard Grave, Armand Joulin, and Aspects of the Lexicon (CogALex - V), pages 51–56, Tomas Mikolov. 2017. Enriching word vectors with Osaka, Japan. The COLING 2016 Organizing Com- subword information. Transactions of the Associa- mittee. tion for Computational Linguistics, 5:135–146. Sawan Kumar, Sharmistha Jat, Karan Saxena, and G. Bouma. 2009. Normalized (pointwise) mutual infor- Partha Talukdar. 2019. Zero-shot word sense dis- mation in extraction. In From Form to ambiguation using sense definition embeddings. In

8506 Proceedings of the 57th Annual Meeting of the Z. Ghahramani, and K. Q. Weinberger, editors, Ad- Association for Computational Linguistics, pages vances in Neural Information Processing Systems 5670–5681, Florence, Italy. Association for Compu- 26, pages 3111–3119. Curran Associates, Inc. tational Linguistics. George A. Miller, Martin Chodorow, Shari Landes, Michael Lesk. 1986. Automatic sense disambiguation Claudia Leacock, and Robert G. Thomas. 1994. Us- using machine readable dictionaries: How to tell a ing a semantic concordance for sense identification. pine cone from an ice cream cone. In Proceedings of In HUMAN LANGUAGE TECHNOLOGY: Proceed- the 5th Annual International Conference on Systems ings of a Workshop held at Plainsboro, New Jersey, Documentation, SIGDOC ’86, pages 24–26, New March 8-11, 1994. York, NY, USA. ACM. Andrea Moro and Roberto Navigli. 2015. SemEval- Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, 2015 task 13: Multilingual all-words sense disam- Proceedings of the Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, biguation and entity linking. In 9th International Workshop on Semantic Evaluation and Yoav Shoham. 2019. SenseBERT: Driving (SemEval 2015) some sense into BERT. CoRR, abs/1908.05646. , pages 288–297, Denver, Colorado. Association for Computational Linguistics. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- Brian Murphy, Partha Talukdar, and Tom Mitchell. proving distributional similarity with lessons learned 2012. Learning effective and interpretable seman- from word embeddings. Transactions of the Associ- tic models using non-negative sparse embedding. In ation for Computational Linguistics, 3:211–225. Proceedings of COLING 2012, pages 1933–1950, Mumbai, India. The COLING 2012 Organizing Daniel Loureiro and Al´ıpio Jorge. 2019. Language Committee. modelling makes sense: Propagating representations through WordNet for full-coverage word sense dis- Roberto Navigli. 2009. Word sense disambiguation: A ambiguation. In Proceedings of the 57th Annual survey. ACM Comput. Surv., 41(2):10:1–10:69. Meeting of the Association for Computational Lin- guistics, pages 5682–5691, Florence, Italy. Associa- Roberto Navigli, David Jurgens, and Daniele Vannella. tion for Computational Linguistics. 2013. SemEval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference Julien Mairal, Francis Bach, Jean Ponce, and on Lexical and Computational Semantics (*SEM), Guillermo Sapiro. 2009. Online dictionary learning Volume 2: Proceedings of the Seventh International for sparse coding. In Proceedings of the 26th An- Workshop on Semantic Evaluation (SemEval 2013), nual International Conference on Machine Learn- pages 222–231, Atlanta, Georgia, USA. Association ing, ICML ’09, pages 689–696, New York, NY, for Computational Linguistics. USA. ACM. Jeffrey Pennington, Richard Socher, and Christopher Bryan McCann, James Bradbury, Caiming Xiong, and Manning. 2014. Glove: Global vectors for word rep- Richard Socher. 2017. Learned in translation: Con- resentation. In Proceedings of the 2014 Conference textualized word vectors. In I. Guyon, U. V. on Empirical Methods in Natural Language Process- Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- wanathan, and R. Garnett, editors, Advances in Neu- ciation for Computational Linguistics. ral Information Processing Systems 30, pages 6294– Matthew Peters, Mark Neumann, Mohit Iyyer, Matt 6305. Curran Associates, Inc. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Oren Melamud, Jacob Goldberger, and Ido Dagan. resentations. In Proceedings of the 2018 Confer- 2016. context2vec: Learning generic context em- ence of the North American Chapter of the Associ- bedding with bidirectional LSTM. In Proceedings ation for Computational Linguistics: Human Lan- of The 20th SIGNLL Conference on Computational guage Technologies, Volume 1 (Long Papers), pages Natural Language Learning , pages 51–61, Berlin, 2227–2237, New Orleans, Louisiana. Association Germany. Association for Computational Linguis- for Computational Linguistics. tics. Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, Rada Mihalcea, Timothy Chklovski, and Adam Kilgar- and Martha Palmer. 2007. Semeval-2007 task 17: riff. 2004. The senseval-3 english lexical sample English lexical sample, srl and all words. In Pro- task. In Proceedings of SENSEVAL-3, the Third ceedings of the 4th International Workshop on Se- International Workshop on the Evaluation of Sys- mantic Evaluations, SemEval ’07, pages 87–92, tems for the Semantic Analysis of Text, pages 25–28, Stroudsburg, PA, USA. Association for Computa- Barcelona, Spain. Association for Computational tional Linguistics. Linguistics. Alessandro Raganato, Jose Camacho-Collados, and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Roberto Navigli. 2017a. Word sense disambigua- rado, and Jeff Dean. 2013. Distributed representa- tion: A unified evaluation framework and empiri- tions of words and phrases and their composition- cal comparison. In Proceedings of the 15th Con- ality. In C. J. C. Burges, L. Bottou, M. Welling, ference of the European Chapter of the Association

8507 for Computational Linguistics: Volume 1, Long Pa- Alex Wang, Yada Pruksachatkun, Nikita Nangia, pers, pages 99–110, Valencia, Spain. Association for Amanpreet Singh, Julian Michael, Felix Hill, Omer Computational Linguistics. Levy, and Samuel R. Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language Alessandro Raganato, Claudio Delli Bovi, and Roberto understanding systems. Navigli. 2017b. Neural sequence learning mod- els for word sense disambiguation. In Proceed- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ings of the 2017 Conference on Empirical Methods Chaumond, Clement Delangue, Anthony Moi, Pier- in Natural Language Processing, pages 1156–1167, ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- Copenhagen, Denmark. Association for Computa- icz, and Jamie Brew. 2019. Huggingface’s trans- tional Linguistics. formers: State-of-the-art natural language process- ing. Philip Resnik. 1997a. A perspective on word sense dis- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Tag- ambiguation methods and their evaluation. In Carbonell, Ruslan Salakhutdinov, and Quoc V. ging Text with Lexical Semantics: Why, What, and Le. 2019. Xlnet: Generalized autoregres- How? sive pretraining for language understand- ing. Cite arxiv:1906.08237Comment: Pre- Philip Resnik. 1997b. Selectional preference and sense trained models and code are available at disambiguation. In Tagging Text with Lexical Se- https://github.com/zihangdai/xlnet. mantics: Why, What, and How? Amir Zeldes. 2017. The GUM corpus: Creating mul- Manuela Sanguinetti and Cristina Bosco. 2015. Parttut: tilayer resources in the classroom. Language Re- The turin university parallel treebank. In Harmo- sources and Evaluation, 51(3):581–612. nization and Development of Resources and Tools for Italian Natural Language Processing within the Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: PARLI Project, pages 51–69. A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 Sys- Natalia Silveira, Timothy Dozat, Marie-Catherine tem Demonstrations, pages 78–83, Uppsala, Swe- de Marneffe, Samuel Bowman, Miriam Connor, den. Association for Computational Linguistics. John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC- 2014).

Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard H. Hovy. 2018. SPINE: sparse interpretable neural embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th inno- vative Applications of Artificial Intelligence (IAAI- 18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4921–4928.

Andrew Trask, Phil Michalak, and John Liu. 2015. sense2vec - A fast and accurate method for word sense disambiguation in neural word embeddings. CoRR, abs/1511.06388.

Lo¨ıc Vial, Benjamin Lecouteux, and Didier Schwab. 2018. UFSAC: Unification of sense annotated cor- pora and tools. In Proceedings of the Eleventh In- ternational Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. Euro- pean Languages Resources Association (ELRA).

Lo¨ıc Vial, Benjamin Lecouteux, and Didier Schwab. 2019. Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation. In Global Wordnet Confer- ence, Wroclaw, Poland.

8508