Embeddings for the : Modelling and Representation

Christian Chiarcos1 Thierry Declerck2 Maxim Ionov1 1 Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany 2 DFKI GmbH, Multilinguality and Language Technology, Saarbrucken,¨ Germany {chiarcos,ionov}@informatik.uni-frankfurt.de [email protected]

Abstract identified as a topic for future developments. Here, we describe possible use-cases for the latter and In this paper, we propose an RDF model for representing embeddings for elements in lex- present our current model for this. ical knowledge graphs (e.g. words and con- cepts), harmonizing two major paradigms in 2 Sharing Embeddings on the Web computational semantics on the level of rep- Although word embeddings are often calculated on resentation, i.e., distributional semantics (us- the fly, the community recognizes the importance ing numerical vector representations), and lex- ical semantics (employing lexical knowledge of pre-trained embeddings as these are readily avail- graphs). Our model is a part of a larger ef- able (it saves time), and cover large quantities of fort to extend a community standard for rep- text (their replication would be energy- and time- resenting lexical resources, OntoLex-Lemon, intense). Finally, a benefit of re-using embeddings with the means to connect it to corpus data. By is that they can be grounded in a well-defined, and, allowing a conjoint modelling of embeddings possibly, shared feature space, whereas locally built and lexical knowledge graphs as its part, we embeddings (whose creation involves an element hope to contribute to the further consolidation of randomness) would reside in an isolate feature of the field, its methodologies and the accessi- bility of the respective resources. space. This is particularly important in the context of multilingual applications, where collections of 1 Background embeddings are in a single feature space (e.g., in MUSE [6]). With the rise of word and concept embeddings, Sharing embeddings, especially if calculated lexical units at all levels of granularity have been over a large amount of data, is not only an econom- subject to various approaches to embed them into ical and ecological requirement, but unavoidable in numerical feature spaces, giving rise to a myriad many cases. For these, we suggest to apply estab- of datasets with pre-trained embeddings generated lished web standards to provide metadata about the with different algorithms that can be freely used lexical component of such data. We are not aware in various NLP applications. With this paper, we of any provider of pre-trained word embeddings present the current state of an effort to connect which come with machine-readable metadata. One these embeddings with lexical knowledge graphs. such example is language information: while ISO- This effort is a part of an extension of a widely 639 codes are sometimes being used for this pur- used community standard for representing, link- pose, they are not given in a machine-readable way, ing and publishing lexical resources on the web, but rather documented in human-readable form or OntoLex-Lemon1. Our work aims to complement given implicitly as part of file names.2 the emerging OntoLex module for representing Frequency, Attestation and Corpus Information 2.1 Concept Embeddings (FrAC) which is currently being developed by the It is to be noted, however, that our focus is not W3C Community Group “Ontology-Lexica”, as so much on word embeddings, since lexical infor- presented in [4]. There we addressed only fre- mation in this context is apparently trivial – plain quency and attestations, whereas core aspects of corpus-based information such as embeddings were 2See the ISO-639-1 code ‘en’ in FastText/MUSE files such as https://dl.fbaipublicfiles.com/arrival/ 1https://www.w3.org/2016/05/ontolex/ vectors/wiki.multi.en.vec. tokens without any lexical information do not seem 2.2 Organizing Contextualized Embeddings to require a structured approach to lexical seman- A related challenge is the organization of contex- tics. This changes drastically for embeddings of tualized embeddings that are not adequately iden- more abstract lexical entities, e.g., word senses or tified by a string (say, a word form), but only by concepts [17], that need to be synchronized be- that string in a particular context. By providing tween the embedding store and the lexical knowl- a reference vocabulary to organize contextualized edge graph by which they are defined. WordNet embeddings together with the respective context, [14] synset identifiers are a notorious example for this challenge will be overcome, as well. the instability of concepts between different ver- sions: Synset 00019837-a means ‘incapable of As a concrete use case of such information, con- being put up with’ in WordNet 1.71, but ‘estab- sider a classical approach to Word Sense Disam- lished by authority’ in version 2.1. In WordNet biguation: The Lesk algorithm [12] uses a bag-of- 3.0, the first synset has the ID 02435671-a, the words model to assess the similarity of a word in a second 00179035-s.3 given context (to be classified) with example sen- tences or definitions provided for different senses The precompiled synset embeddings provided of that word in a (training data, with AutoExtend [17] illustrate the consequences: provided, for example, by resources such as Word- The synset IDs seem to refer to WordNet 2.1 Net, thesauri, or conventional ). The (wn-2.1-00001742-n), but use an ad hoc no- approach was very influential, although it always tation and are not in a machine-readable format. suffered from data sparsity issues, as it relied on More importantly, however, there is no synset string matches between a very limited collection of 00001742-n in Princeton WordNet 2.1. This reference words and context words. With distribu- does exist in Princeton WordNet 1.71, and in fact, tional semantics and word embeddings, such spar- it seems that the authors actually used this edition sity issues are easily overcome as literal matches instead of the version 2.1 they refer to in their pa- are no longer required. per. Machine-readable metadata would not prevent Indeed, more recent methods such as BERT al- such mistakes, but it would facilitate their verifia- low us to induce contextualized word embeddings bility. Given the current conventions in the field, [20]. So, given only a single example for a partic- such mistakes will very likely go unnoticed, thus ular word sense, this would be a basis for a more leading to highly unexpected results in applications informed word sense disambiguation. With more developed on this basis. Our suggestion here is than one example per word sense, this is becom- to use resolvable URIs as concept identifiers, and ing a little bit more difficult, as these now need if they provide machine-readable lexical data, lex- to be either aggregated into a single sense vector, ical information about concept embeddings can or linked to the particular word sense they pertain be more easily verified and (this is another applica- to (which will require a set of vectors as a data tion) integrated with predictions from distributional structure rather than a vector). For data of the sec- semantics. Indeed, the WordNet community has ond type, we suggest to follow existing models for adopted OntoLex-Lemon as an RDF-based repre- representing attestations (glosses, examples) for sentation schema and developed the Collaborative lexical entities, and to represent contextualized em- Interlingual Index (ILI or CILI) [3] to establish beddings (together with their context) as properties sense mappings across a large number of . of attestations. Reference to ILI URIs would allow to retrieve the lexical information behind a particular concept em- bedding, as the WordNet can be queried for the 3 Addressing the Modelling Challenge lexemes this concept (synset) is associated with. A In order to meet these requirements, we propose a versioning mismatch can then be easily spotted by formalism that allows the conjoint publication of comparing the cosine distance between the word lexical knowledge graphs and embedding stores. embeddings of these lexemes and the embedding We assume that future approaches to access and of the concept presumably derived from them. publish embeddings on the web will be largely in line with high-level methods that abstract from de- 3See https://github.com/globalwordnet/ tails of the actual format and provide a conceptual ili view on the data set as a whole, as provided, for ex- (1) LexicalEntry representation of a lexeme in a lexical knowledge graph, groups together form(s) and sense(s), resp. concept(s) of a particular expres- sion; (2) (lexical) Form, written representation(s) of a particular lexical entry, with (optional) gram- matical information; (3) LexicalSense word sense of one particular lexical entry; (4) LexicalConcept elements of meaning with different lexicalizations, e.g., WordNet synsets. As the dominant vocabulary for lexical knowl- edge graphs on the web of data, the OntoLex- Figure 1: OntoLex-Lemon core model Lemon model has found wide adaptation beyond its original focus on ontology lexicalization. In ample, by the torchtext package of PyTorch.4 the WordNet Collaborative Interlingual Index [3] There is no need to limit ourselves to the cur- mentioned before, OntoLex vocabulary is used to rently dominating tabular structure of commonly provide a single interlingual identifier for every con- used formats to exchange embeddings; access to cept in every WordNet language as well as machine- (and the complexity of) the actual data will be hid- readable information about it (including links with den from the user. various languages). A promising framework for the conjoint publi- To broaden the spectrum of possible usages of cation of embeddings and lexical knowledge graph the core model, various extensions have been de- is, for example, RDF-HDT [9], a compact binary veloped by the community effort. This includes serialization of RDF graphs and the literal values the emerging module for frequency, attestation and these comprise. We would assume it to be inte- corpus-based information, FrAC. grated in programming workflows in a way similar The vocabulary elements introduced with FrAC to program-specific binary formats such as Numpy cover frequency and attestations, with other as- arrays in pickle [8]. pects of corpus-based information described as prospective developments in our previous work [4]. 3.1 Modelling Lexical Resources Notable aspects of corpus information to be cov- Formalisms to represent lexical resources are mani- ered in such a module for the purpose of lexico- fold and have been a topic of discussion within the graphic applications include contextual similarity, language resource community for decades, with co-occurrence information and embeddings. standards such as LMF [10], or TEI-Dict [19] de- Because of the enormous technical relevance signed for the electronic edition and/or search in of the latter in language technology and AI, and individual dictionaries. Lexical data, however, does because of the requirements identified above for not exist in isolation, and synergies can be un- the publication of embedding information over the leashed if information from different dictionaries is web, this paper focuses on embeddings, combined, e.g., for bootstrapping new bilingual dic- tionaries for languages X and Z by using another 3.2 OntoLex-FrAC Y X 7→ Y language and existing dictionaries for FrAC aims to (1) extend OntoLex with corpus in- Y 7→ Z and as a pivot. formation to address challenges in lexicography, Information integration beyond the level of indi- (2) model lexical and distributional-semantic re- vidual dictionaries has thus become an important sources (dictionaries, embeddings) as RDF graphs, concern in the language resource community. One (3) provide an abstract model of relevant concepts way to achieve this is to represent this data as a in distributional semantics that facilitates applica- knowledge graph. The primary community stan- tions that integrate both lexical and distributional dard for publishing lexical knowledge graphs on information. Figure2 illustrates the FrAC concepts the web is OntoLex-Lemon [5]. and properties for frequency and attestations (gray) OntoLex-Lemon defines four main classes of along with new additions (blue) for embeddings as lexical entities, i.e., concepts in a lexical resource: a major form of corpus-based information. 4https://pytorch.org/text/ Prior to the introduction of embeddings, the ding as a literal rdf:value. If necessary, more elaborate representations, e.g., using rdf:List, may subsequently be generated from these literals. A natural and effort-less modelling choice is to represent embedding values as string literals with whitespace-separated numbers. For decod- ing and verification, such a representation bene- fits from metadata about the length of the vector. For a fixed-size vector, this should be provided by dc:extent. An alternative is an encoding as JSON list. In order to support both structured and string literals, FrAC does not restrict the type of Figure 2: OntoLex-FrAC overview: Extensions for em- the rdf:value of embeddings. beddings highlighted in blue Lexicalized embeddings should be published together with their metadata, at least proce- main classes of FrAC were (1) Observable, any dure/method (dct:description with free text, unit within a lexical resource about which informa- e.g., ”CBOW”, ”SKIP-GRAM”, ”collocation tion can be found in a corpus, includes all main counts”), data basis (frac:corpus), and dimen- classes of OntoLex-Core, lexical forms, lexical en- sionality (dct:extent). tries, lexical senses and concepts; (2) Corpus, a We thus introduce the concept embedding, with collection of linguistic data from which empirical a designated subclass for fixed-size vectors: evidence can be derived, including corpora in the sense as understood in language technology; (3) At- Embedding (Class) is a representation of a testation, example for a specific phenomenon, us- given Observable in a numerical feature space. age or form found in a corpus or in the literature; It is defined by the methodology used for creating (4) CorpusFrequency, absolute frequency of a it (dct:description), the URI of the corpus particular observation in a given corpus. or language resource from which it was created (corpus). The literal value of an Embedding is 4 Modelling Embeddings in FrAC provided by rdf:value. Embedding v rdf:value exactly 1 u In the context of OntoLex, word embeddings corpus exactly 1 u dct:description are to be seen as a feature of individual lexical min 1 forms. However, in many cases, word embed- dings are not calculated from plain strings, but embedding (ObjectProperty) is a relation that from normalized strings, e.g., lemmatized text. For maps an Observable into a numerical feature such data, we model every individual lemma as space. An embedding is a structure-preserving an ontolex:LexicalEntry. Moreover, as mapping in the sense that it encodes and preserves argued in Sec.2, embeddings are equally rele- contextual features of a particular Observable vant for lexical senses and lexical concepts; the (or, an aggregation over all its attestations) in a embedding property that associates a lexical en- particular corpus. tity with an embedding is thus applicable to every Observable. FixedSizeVector (Class) is an Embedding that represents a particular Observable as a list 4.1 Word Embeddings of numerical values in a k-dimensional feature Pre-trained word embeddings are often distributed space. The property dc:extent defines k. as text files consisting of the label (token) and a se- quence of whitespace-separated numbers. E.g. the For the GloVe example, a lemma (lexical entry) entry for the word frak from the GloVe embeddings embedding can be represented as follows: [15]: :frak a ontolex:LexicalEntry; ontolex:canonicalForm/ frak 0.015246 -0.30472 0.68107 ... ontolex:writtenRep ”frak”@en; frac:embedding [ Since our focus on publishing and sharing embed- a frac:FixedSizeVector; dings, we propose to provide the value of an embed- rdf:value ”0.015246 ...”; d c t : s o u r c e Like most RDF models, this appears to be overly ; verbose, but by introducing a subclass for a set dct:extent 50ˆˆˆxsd:int; dct:description ”GloVe v.1.1, of embeddings from which the embedding can in- . . . ” @en . ] . herit extent, corpus, and description information, e.g. :WN31FixedSizeVector, the BERT em- 4.2 Contextualized Embeddings bedding in this example becomes much more di- gestable – without any loss in information: Above, we mentioned contextualized embeddings, and more recent methods such as ELMo [16], Flair wn31−bert:07032045 −n−1 a :WN31FixedSizeVector; NLP [1], or BERT [7] have been shown to be re- rdf:value ”0.327246 0.48170 ...”. markably effective at many NLP problems. In the context of , contextual 4.3 Other Types of Embeddings embeddings can prove beneficial for inducing or FrAC is not restricted to uses in language technol- distinguishing word senses, and in extension of ogy; its definition of ‘embedding’ is thus broader the classical Lesk algorithm, for example, a lexi- than the conventional interpretation of the term cal sense can be described by means of the con- in NLP, and based on its more general usage textualized word embeddings for the examples across multiple disciplines. In mathematics, the associated with that particular lexical sense, and embedding f of an object X in another object Y words for which word sense disambiguation is to (f : X → Y ) is defined as an injective, structure- be performed can then just be compared with these. preserving map (morphism). Important in the con- These examples then serve a similar function as text of FrAC is the structure-preserving character attestations in a , and indeed, the link has of the projection into numerical feature spaces, as been emphasized before [11]. Within FrAC, con- embeddings are most typically used to assess simi- textualized embeddings are thus naturally modelled larity, e.g., by means of cosine measure, the entire as a property of Attestation. point being that these assessments remain stable when cooccurrence vectors are projected into a instanceEmbedding (ObjectProperty) for lower-dimensional space. a given Attestation. The property In computational lexicography, raw collocation instanceEmbedding provides an embedding counts continue to be used for the same purpose. In of the example in its current corpus context into a its classical form, these collocation counts get ag- numerical feature space (see Embedding). gregated into bags of words, sometimes augmented In this modelling, multiple contextualized em- with numerical weights. Bags of words are closely beddings can be associated with, say, a lexical related to raw collocation vectors (and thus, pro- sense by means of an attestation that then points to totypical embeddings), with the main difference the actual string context. Considering play, multi- being that collocation vectors consider only a finite ple WordNet examples (glosses) per sense can thus set of diagnostic collocates (the reference dictio- be rendered by different fixed-size vectors: nary), whereas a bag of words can include any wn31 : p l a y n a ontolex:LexicalEntry; word that occurs in the immediate context of the ontolex : sense wn31:07032045 −n , target expression. wn31 : p l a y n 4 , . . . wn31:07032045 −n In this understanding, a bag of words can be a ontolex:LexicalSense; considered as an infinite-dimensional collocation frac:attestation [ vector, i.e., as an embedding in the broad sense frac:quotation ”the play lasted two hours”; introduced above. The practical motivation is that frac : locus wn31:07032045 −n ; applications of bag-of-words models and fixed-size frac :instanceEmbedding vectors are similar, and that bags of words remain wn31−bert:07032045 −n−1 ]. practically important to a significant group of On- wn31−bert:07032045 −n−1 a toLex or FrAC users. frac :FixedSizeVector; dc:extent ”300”ˆˆxsd:int; A difference is that a bag of words model, if it rdf:value ”0.327246 0.48170 ...”; provides frequency information or another form of dc:description ”...”; numerical scores, must not be modelled by a plain frac:corpus . structure for this, we recommend JSON literals. To represent this model, we introduce the follow- them by providing a vocabulary that covers the ing Embedding subclass: relevant aspects of lexical data involved in this con- text, grounded in existing specifications for lexical BagOfWords (Class) is an Embedding that metadata and publication forms for lexical knowl- represents an Observable by a set of collocates edge graphs on the web of data. We provide an or their mapping to numerical scores. RDF vocabulary for encoding numerical, corpus- Another type of embeddings concerns sequential based information, in particular, embeddings such data, and one example for that are multimodal cor- as those prevalent in language technology, in con- pora. In a case study with Leiden University, we junction with, or as a part of lexical knowledge explored the encoding of dictionaries for African graphs. We cover a broad range of applications sign languages. In addition to graphics and videos, from NLP to computational lexicography, and a such dictionaries can also contain sensor data that broad range of data structures that fall under a gen- records the position of hands and digits during the eralized understanding of embeddings. course of a sign or a full conversation. To search Notable features of our proposal include (1) the in this data, conventional techniques like dynamic coverage of a broad band-width of use cases, (2) time warp [2] are being applied – effectively in anal- its firm grounding in commonly used community ogy with cosine distance among finite-size vectors. standards for lexical knowledge graphs, (3) the pos- Furthermore, such data has also been addressed in sibility to provide machine-readable metadata and NLP research in the context of neural time series machine-readable lexical data along, and even in transformation for the sake of translation between close integration with word, lexeme, sense or con- or from sign languages [18]. Also, in an NLP cept embeddings, (4) the extension beyond fixed- context, time series analysis is relevant for stream size vectors as currently dominating in language processing, so this would make this data structure technology applications, (5) and a designated vo- of more general interest [13] in and beyond the cabulary for organizing contextualized embeddings language technology community. with explicit links to both their respective context Both from a modelling perspective and in terms and the lexical entities they refer to. of actual uses of such forms of lexical data, it is At the time of writing, one dataset (AutoExtend) thus appealing to extend the concept of embeddings has been successfully transformed from its tradi- to time series data. In our current use case, we tional publication form and integrated with a lex- assume that time series data is mostly relevant in ical knowledge graph. However, we are still in relation to attestations rather than OntoLex core the process of evaluating which WordNet edition concepts, but we can foresee that generalizations the AutoExtend data actually refers to. Another over multiple attestations could be developed at dataset currently under construction is a dictionary some point which would then used to extrapolate of African sign languages, where time series infor- prototypical time series information for canonical mation is being encoded as attestation-level embed- forms, lexical entries, senses, or concepts. dings. To represent this, we introduce another Embedding subclass: We see our work as a building block for the de- velopment of convergencies between the Linguistic TimeSeries (Class) is an Embedding that rep- Linked Open Data and the NLP/ML community, resents an Observable or its Attestation as a conjoint modelling – or, at least, compatible as a sequence of a fixed number of data points levels of representation, allow to combine both recorded over a certain period of time. technology stacks to common problems. For the NLP/ML community, machine-readable metadata The dc:extent of a time series specifies the and adherence to established community standards number of data points per observation. The will facilitate the potential for verifying, enriching rdf:value should be a structured JSON literal. and building embeddings with or against lexical 5 Summary and Outlook resources. For computational lexicography, closer ties with the language technology community will We identified a number of shortcomings in the facilitate the uptake of machine-learning methods (lack of) standards applied by providers of pre- and their evaluation, and intensify synergies be- trained embeddings on the web and aim to address tween both fields. Acknowledgements Binary rdf representation for publication and ex- change (hdt). Journal of Web Semantics, 19:22–41, The research presented in this paper was in part 2013. supported by the European Horizon2020 Pret-ˆ a-` [10] Gil Francopoulo, Monte George, Nicoletta Calzo- LLOD project (grant agreement no. 825182), by lari, Monica Monachini, Nuria Bel, Mandy Pet, and the project ”Linked Open Dictionaries” (LiODi), Claudia Soria. Lexical markup framework (lmf). funded by the German Ministry of Education and In International Conference on Language Resources Science (BMBF), and the COST Action CA18209 and Evaluation-LREC 2006, page 5, 2006. European network for Web-centred linguistic data [11] Luyao Huang, Chi Sun, Xipeng Qiu, and Xuan- science (”NexusLinguarum”). jing Huang. Glossbert: Bert for word sense dis- ambiguation with gloss knowledge. arXiv preprint arXiv:1908.07245, 2019. References [12] Michael Lesk. Automatic sense disambiguation us- ing machine readable dictionaries: how to tell a pine [1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. cone from an ice cream cone. In Proceedings of the Contextual string embeddings for sequence label- 5th annual international conference on Systems doc- ing. In Proceedings of the 27th International Con- umentation, pages 24–26, 1986. ference on Computational Linguistics, pages 1638– 1649, Santa Fe, New Mexico, USA, August 2018. [13] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Association for Computational Linguistics. Bill Chiu. A symbolic representation of time series, with implications for streaming algorithms. In In [2] Donald J Berndt and James Clifford. Using dynamic Proceedings of the 8th ACM SIGMOD Workshop on time warping to find patterns in time series. In KDD Research Issues in Data Mining and Knowledge Dis- workshop, volume 10, pages 359–370. Seattle, WA, covery, pages 2–11. ACM Press, 2003. USA:, 1994. [14] George A Miller. Wordnet: a lexical database for [3] Francis Bond, Piek Vossen, John P McCrae, and english. Communications of the ACM, 38(11):39– Christiane Fellbaum. Cili: the collaborative interlin- 41, 1995. gual index. In Proceedings of the Global WordNet Conference, volume 2016, 2016. [15] Jeffrey Pennington, Richard Socher, and Christo- pher D Manning. Glove: Global vectors for word [4] Christian Chiarcos, Maxim Ionov, Jesse de Does, representation. In Proceedings of the 2014 confer- Katrien Depuydt, Fahad Khan, Sander Stolk, ence on empirical methods in natural language pro- Thierry Declerck, and John Philip McCrae. Mod- cessing (EMNLP), pages 1532–1543, 2014. elling frequency and attestations for ontolex-lemon. [16] Matthew Peters, Mark Neumann, Mohit Iyyer, In Proceedings of the 2020 Globalex Workshop on Matt Gardner, Christopher Clark, Kenton Lee, and Linked Lexicography, pages 1–9, 2020. Luke Zettlemoyer. Deep contextualized word repre- sentations. In Proceedings of the 2018 Conference [5] Philipp Cimiano, John P. McCrae, and Paul Buite- of the North American Chapter of the Association laar. Lexicon Model for Ontologies: Community for Computational Linguistics: Human Language Report, 2016. Technologies, Volume 1 (Long Papers), pages 2227– 2237, New Orleans, Louisiana, June 2018. Associa- [6] Alexis Conneau, Guillaume Lample, Marc’Aurelio tion for Computational Linguistics. Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ Word translation without parallel data. arXiv preprint [17] Sascha Rothe and Hinrich Schutze.¨ Autoex- arXiv:1710.04087, 2017. tend: Combining word embeddings with semantic resources. Computational Linguistics, 43(3):593– [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 617, 2017. Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understand- [18] S. S Kumar, T. Wangyal, V. Saboo, and R. Srinath. ing. In Proceedings of the 2019 Conference of Time series neural networks for real time sign lan- the North American Chapter of the Association for guage translation. In 2018 17th IEEE International Computational Linguistics: Human Language Tech- Conference on Machine Learning and Applications nologies, Volume 1 (Long and Short Papers), pages (ICMLA), pages 243–248, 2018. 4171–4186, Minneapolis, Minnesota, June 2019. As- sociation for Computational Linguistics. [19] TEI. TEI P5: Guidelines for electronic text encod- ing and interchange. Technical report, 2019. [8] Laurent Fasnacht. mmappickle: Python 3 module to [20] Gregor Wiedemann, Steffen Remus, Avi Chawla, store memory-mapped numpy array in pickle format. and Chris Biemann. Does bert make any Journal of Open Source Software, 3(26):651, 2018. sense? interpretable word sense disambiguation with contextualized embeddings. arXiv preprint [9] Javier D Fernandez,´ Miguel A Mart´ınez-Prieto, arXiv:1909.10430, 2019. Claudio Gutierrez,´ Axel Polleres, and Mario Arias.