Embeddings for the Lexicon: Modelling and Representation

Embeddings for the Lexicon: Modelling and Representation Christian Chiarcos1 Thierry Declerck2 Maxim Ionov1 1 Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany 2 DFKI GmbH, Multilinguality and Language Technology, Saarbrucken,¨ Germany fchiarcos,[email protected] [email protected] Abstract identified as a topic for future developments. Here, we describe possible use-cases for the latter and In this paper, we propose an RDF model for representing embeddings for elements in lex- present our current model for this. ical knowledge graphs (e.g. words and concepts), harmonizing two major paradigms in 2 Sharing Embeddings on the Web computational semantics on the level of rep- Although word embeddings are often calculated on resentation, i.e., distributional semantics (us- the fly, the community recognizes the importance ing numerical vector representations), and lexical semantics (employing lexical knowledge of pre-trained embeddings as these are readily avail- graphs). Our model is a part of a larger ef- able (it saves time), and cover large quantities of fort to extend a community standard for rep- text (their replication would be energy- and time- resenting lexical resources, OntoLex-Lemon, intense). Finally, a benefit of re-using embeddings with the means to connect it to corpus data. By is that they can be grounded in a well-defined, and, allowing a conjoint modelling of embeddings possibly, shared feature space, whereas locally built and lexical knowledge graphs as its part, we embeddings (whose creation involves an element hope to contribute to the further consolidation of randomness) would reside in an isolate feature of the field, its methodologies and the accessi- bility of the respective resources. space. This is particularly important in the context of multilingual applications, where collections of 1 Background embeddings are in a single feature space (e.g., in MUSE [6]). With the rise of word and concept embeddings, Sharing embeddings, especially if calculated lexical units at all levels of granularity have been over a large amount of data, is not only an econom- subject to various approaches to embed them into ical and ecological requirement, but unavoidable in numerical feature spaces, giving rise to a myriad many cases. For these, we suggest to apply estab- of datasets with pre-trained embeddings generated lished web standards to provide metadata about the with different algorithms that can be freely used lexical component of such data. We are not aware in various NLP applications. With this paper, we of any provider of pre-trained word embeddings present the current state of an effort to connect which come with machine-readable metadata. One these embeddings with lexical knowledge graphs. such example is language information: while ISO- This effort is a part of an extension of a widely 639 codes are sometimes being used for this pur- used community standard for representing, link- pose, they are not given in a machine-readable way, ing and publishing lexical resources on the web, but rather documented in human-readable form or OntoLex-Lemon1. Our work aims to complement given implicitly as part of file names.2 the emerging OntoLex module for representing Frequency, Attestation and Corpus Information 2.1 Concept Embeddings (FrAC) which is currently being developed by the It is to be noted, however, that our focus is not W3C Community Group “Ontology-Lexica”, as so much on word embeddings, since lexical infor- presented in [4]. There we addressed only fre- mation in this context is apparently trivial – plain quency and attestations, whereas core aspects of corpus-based information such as embeddings were 2See the ISO-639-1 code ‘en’ in FastText/MUSE files such as https://dl.fbaipublicfiles.com/arrival/ 1https://www.w3.org/2016/05/ontolex/ vectors/wiki.multi.en.vec. tokens without any lexical information do not seem 2.2 Organizing Contextualized Embeddings to require a structured approach to lexical seman- A related challenge is the organization of contex- tics. This changes drastically for embeddings of tualized embeddings that are not adequately iden- more abstract lexical entities, e.g., word senses or tified by a string (say, a word form), but only by concepts [17], that need to be synchronized be- that string in a particular context. By providing tween the embedding store and the lexical knowl- a reference vocabulary to organize contextualized edge graph by which they are defined. WordNet embeddings together with the respective context, [14] synset identifiers are a notorious example for this challenge will be overcome, as well. the instability of concepts between different ver- sions: Synset 00019837-a means ‘incapable of As a concrete use case of such information, con- being put up with’ in WordNet 1.71, but ‘estab- sider a classical approach to Word Sense Disam- lished by authority’ in version 2.1. In WordNet biguation: The Lesk algorithm [12] uses a bag-of- 3.0, the first synset has the ID 02435671-a, the words model to assess the similarity of a word in a second 00179035-s.3 given context (to be classified) with example sen- tences or definitions provided for different senses The precompiled synset embeddings provided of that word in a lexical resource (training data, with AutoExtend [17] illustrate the consequences: provided, for example, by resources such as Word- The synset IDs seem to refer to WordNet 2.1 Net, thesauri, or conventional dictionaries). The (wn-2.1-00001742-n), but use an ad hoc no- approach was very influential, although it always tation and are not in a machine-readable format. suffered from data sparsity issues, as it relied on More importantly, however, there is no synset string matches between a very limited collection of 00001742-n in Princeton WordNet 2.1. This reference words and context words. With distribu- does exist in Princeton WordNet 1.71, and in fact, tional semantics and word embeddings, such spar- it seems that the authors actually used this edition sity issues are easily overcome as literal matches instead of the version 2.1 they refer to in their pa- are no longer required. per. Machine-readable metadata would not prevent Indeed, more recent methods such as BERT al- such mistakes, but it would facilitate their verifia- low us to induce contextualized word embeddings bility. Given the current conventions in the field, [20]. So, given only a single example for a partic- such mistakes will very likely go unnoticed, thus ular word sense, this would be a basis for a more leading to highly unexpected results in applications informed word sense disambiguation. With more developed on this basis. Our suggestion here is than one example per word sense, this is becom- to use resolvable URIs as concept identifiers, and ing a little bit more difficult, as these now need if they provide machine-readable lexical data, lex- to be either aggregated into a single sense vector, ical information about concept embeddings can or linked to the particular word sense they pertain be more easily verified and (this is another applica- to (which will require a set of vectors as a data tion) integrated with predictions from distributional structure rather than a vector). For data of the sec- semantics. Indeed, the WordNet community has ond type, we suggest to follow existing models for adopted OntoLex-Lemon as an RDF-based repre- representing attestations (glosses, examples) for sentation schema and developed the Collaborative lexical entities, and to represent contextualized em- Interlingual Index (ILI or CILI) [3] to establish beddings (together with their context) as properties sense mappings across a large number of WordNets. of attestations. Reference to ILI URIs would allow to retrieve the lexical information behind a particular concept embedding, as the WordNet can be queried for the 3 Addressing the Modelling Challenge lexemes this concept (synset) is associated with. A In order to meet these requirements, we propose a versioning mismatch can then be easily spotted by formalism that allows the conjoint publication of comparing the cosine distance between the word lexical knowledge graphs and embedding stores. embeddings of these lexemes and the embedding We assume that future approaches to access and of the concept presumably derived from them. publish embeddings on the web will be largely in line with high-level methods that abstract from de- 3See https://github.com/globalwordnet/ tails of the actual format and provide a conceptual ili view on the data set as a whole, as provided, for ex- (1) LexicalEntry representation of a lexeme in a lexical knowledge graph, groups together form(s) and sense(s), resp. concept(s) of a particular expres- sion; (2) (lexical) Form, written representation(s) of a particular lexical entry, with (optional) gram- matical information; (3) LexicalSense word sense of one particular lexical entry; (4) LexicalConcept elements of meaning with different lexicalizations, e.g., WordNet synsets. As the dominant vocabulary for lexical knowledge graphs on the web of data, the OntoLex- Figure 1: OntoLex-Lemon core model Lemon model has found wide adaptation beyond its original focus on ontology lexicalization. In ample, by the torchtext package of PyTorch.4 the WordNet Collaborative Interlingual Index [3] There is no need to limit ourselves to the cur- mentioned before, OntoLex vocabulary is used to rently dominating tabular structure of commonly provide a single interlingual identifier for every con- used formats to exchange embeddings; access to cept in every WordNet language as well as machine- (and the complexity of) the actual data will be hid- readable information about it (including links with den from the user. various languages). A promising framework for the conjoint publi- To broaden the spectrum of possible usages of cation of embeddings and lexical knowledge graph the core model, various extensions have been de- is, for example, RDF-HDT [9], a compact binary veloped by the community effort.

Load more