Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos1 Thierry Declerck2 Maxim Ionov1 1 Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany 2 DFKI GmbH, Multilinguality and Language Technology, Saarbrucken,¨ Germany {chiarcos,ionov}@informatik.uni-frankfurt.de [email protected]
Abstract identified as a topic for future developments. Here, we describe possible use-cases for the latter and In this paper, we propose an RDF model for representing embeddings for elements in lex- present our current model for this. ical knowledge graphs (e.g. words and con- cepts), harmonizing two major paradigms in 2 Sharing Embeddings on the Web computational semantics on the level of rep- Although word embeddings are often calculated on resentation, i.e., distributional semantics (us- the fly, the community recognizes the importance ing numerical vector representations), and lex- ical semantics (employing lexical knowledge of pre-trained embeddings as these are readily avail- graphs). Our model is a part of a larger ef- able (it saves time), and cover large quantities of fort to extend a community standard for rep- text (their replication would be energy- and time- resenting lexical resources, OntoLex-Lemon, intense). Finally, a benefit of re-using embeddings with the means to connect it to corpus data. By is that they can be grounded in a well-defined, and, allowing a conjoint modelling of embeddings possibly, shared feature space, whereas locally built and lexical knowledge graphs as its part, we embeddings (whose creation involves an element hope to contribute to the further consolidation of randomness) would reside in an isolate feature of the field, its methodologies and the accessi- bility of the respective resources. space. This is particularly important in the context of multilingual applications, where collections of 1 Background embeddings are in a single feature space (e.g., in MUSE [6]). With the rise of word and concept embeddings, Sharing embeddings, especially if calculated lexical units at all levels of granularity have been over a large amount of data, is not only an econom- subject to various approaches to embed them into ical and ecological requirement, but unavoidable in numerical feature spaces, giving rise to a myriad many cases. For these, we suggest to apply estab- of datasets with pre-trained embeddings generated lished web standards to provide metadata about the with different algorithms that can be freely used lexical component of such data. We are not aware in various NLP applications. With this paper, we of any provider of pre-trained word embeddings present the current state of an effort to connect which come with machine-readable metadata. One these embeddings with lexical knowledge graphs. such example is language information: while ISO- This effort is a part of an extension of a widely 639 codes are sometimes being used for this pur- used community standard for representing, link- pose, they are not given in a machine-readable way, ing and publishing lexical resources on the web, but rather documented in human-readable form or OntoLex-Lemon1. Our work aims to complement given implicitly as part of file names.2 the emerging OntoLex module for representing Frequency, Attestation and Corpus Information 2.1 Concept Embeddings (FrAC) which is currently being developed by the It is to be noted, however, that our focus is not W3C Community Group “Ontology-Lexica”, as so much on word embeddings, since lexical infor- presented in [4]. There we addressed only fre- mation in this context is apparently trivial – plain quency and attestations, whereas core aspects of corpus-based information such as embeddings were 2See the ISO-639-1 code ‘en’ in FastText/MUSE files such as https://dl.fbaipublicfiles.com/arrival/ 1https://www.w3.org/2016/05/ontolex/ vectors/wiki.multi.en.vec. tokens without any lexical information do not seem 2.2 Organizing Contextualized Embeddings to require a structured approach to lexical seman- A related challenge is the organization of contex- tics. This changes drastically for embeddings of tualized embeddings that are not adequately iden- more abstract lexical entities, e.g., word senses or tified by a string (say, a word form), but only by concepts [17], that need to be synchronized be- that string in a particular context. By providing tween the embedding store and the lexical knowl- a reference vocabulary to organize contextualized edge graph by which they are defined. WordNet embeddings together with the respective context, [14] synset identifiers are a notorious example for this challenge will be overcome, as well. the instability of concepts between different ver- sions: Synset 00019837-a means ‘incapable of As a concrete use case of such information, con- being put up with’ in WordNet 1.71, but ‘estab- sider a classical approach to Word Sense Disam- lished by authority’ in version 2.1. In WordNet biguation: The Lesk algorithm [12] uses a bag-of- 3.0, the first synset has the ID 02435671-a, the words model to assess the similarity of a word in a second 00179035-s.3 given context (to be classified) with example sen- tences or definitions provided for different senses The precompiled synset embeddings provided of that word in a lexical resource (training data, with AutoExtend [17] illustrate the consequences: provided, for example, by resources such as Word- The synset IDs seem to refer to WordNet 2.1 Net, thesauri, or conventional dictionaries). The (wn-2.1-00001742-n), but use an ad hoc no- approach was very influential, although it always tation and are not in a machine-readable format. suffered from data sparsity issues, as it relied on More importantly, however, there is no synset string matches between a very limited collection of 00001742-n in Princeton WordNet 2.1. This reference words and context words. With distribu- does exist in Princeton WordNet 1.71, and in fact, tional semantics and word embeddings, such spar- it seems that the authors actually used this edition sity issues are easily overcome as literal matches instead of the version 2.1 they refer to in their pa- are no longer required. per. Machine-readable metadata would not prevent Indeed, more recent methods such as BERT al- such mistakes, but it would facilitate their verifia- low us to induce contextualized word embeddings bility. Given the current conventions in the field, [20]. So, given only a single example for a partic- such mistakes will very likely go unnoticed, thus ular word sense, this would be a basis for a more leading to highly unexpected results in applications informed word sense disambiguation. With more developed on this basis. Our suggestion here is than one example per word sense, this is becom- to use resolvable URIs as concept identifiers, and ing a little bit more difficult, as these now need if they provide machine-readable lexical data, lex- to be either aggregated into a single sense vector, ical information about concept embeddings can or linked to the particular word sense they pertain be more easily verified and (this is another applica- to (which will require a set of vectors as a data tion) integrated with predictions from distributional structure rather than a vector). For data of the sec- semantics. Indeed, the WordNet community has ond type, we suggest to follow existing models for adopted OntoLex-Lemon as an RDF-based repre- representing attestations (glosses, examples) for sentation schema and developed the Collaborative lexical entities, and to represent contextualized em- Interlingual Index (ILI or CILI) [3] to establish beddings (together with their context) as properties sense mappings across a large number of WordNets. of attestations. Reference to ILI URIs would allow to retrieve the lexical information behind a particular concept em- bedding, as the WordNet can be queried for the 3 Addressing the Modelling Challenge lexemes this concept (synset) is associated with. A In order to meet these requirements, we propose a versioning mismatch can then be easily spotted by formalism that allows the conjoint publication of comparing the cosine distance between the word lexical knowledge graphs and embedding stores. embeddings of these lexemes and the embedding We assume that future approaches to access and of the concept presumably derived from them. publish embeddings on the web will be largely in line with high-level methods that abstract from de- 3See https://github.com/globalwordnet/ tails of the actual format and provide a conceptual ili view on the data set as a whole, as provided, for ex- (1) LexicalEntry representation of a lexeme in a lexical knowledge graph, groups together form(s) and sense(s), resp. concept(s) of a particular expres- sion; (2) (lexical) Form, written representation(s) of a particular lexical entry, with (optional) gram- matical information; (3) LexicalSense word sense of one particular lexical entry; (4) LexicalConcept elements of meaning with different lexicalizations, e.g., WordNet synsets. As the dominant vocabulary for lexical knowl- edge graphs on the web of data, the OntoLex- Figure 1: OntoLex-Lemon core model Lemon model has found wide adaptation beyond its original focus on ontology lexicalization. In ample, by the torchtext package of PyTorch.4 the WordNet Collaborative Interlingual Index [3] There is no need to limit ourselves to the cur- mentioned before, OntoLex vocabulary is used to rently dominating tabular structure of commonly provide a single interlingual identifier for every con- used formats to exchange embeddings; access to cept in every WordNet language as well as machine- (and the complexity of) the actual data will be hid- readable information about it (including links with den from the user. various languages). A promising framework for the conjoint publi- To broaden the spectrum of possible usages of cation of embeddings and lexical knowledge graph the core model, various extensions have been de- is, for example, RDF-HDT [9], a compact binary veloped by the community effort. This includes serialization of RDF graphs and the literal values the emerging module for frequency, attestation and these comprise. We would assume it to be inte- corpus-based information, FrAC. grated in programming workflows in a way similar The vocabulary elements introduced with FrAC to program-specific binary formats such as Numpy cover frequency and attestations, with other as- arrays in pickle [8]. pects of corpus-based information described as prospective developments in our previous work [4]. 3.1 Modelling Lexical Resources Notable aspects of corpus information to be cov- Formalisms to represent lexical resources are mani- ered in such a module for the purpose of lexico- fold and have been a topic of discussion within the graphic applications include contextual similarity, language resource community for decades, with co-occurrence information and embeddings. standards such as LMF [10], or TEI-Dict [19] de- Because of the enormous technical relevance signed for the electronic edition and/or search in of the latter in language technology and AI, and individual dictionaries. Lexical data, however, does because of the requirements identified above for not exist in isolation, and synergies can be un- the publication of embedding information over the leashed if information from different dictionaries is web, this paper focuses on embeddings, combined, e.g., for bootstrapping new bilingual dic- tionaries for languages X and Z by using another 3.2 OntoLex-FrAC Y X 7→ Y language and existing dictionaries for FrAC aims to (1) extend OntoLex with corpus in- Y 7→ Z and as a pivot. formation to address challenges in lexicography, Information integration beyond the level of indi- (2) model lexical and distributional-semantic re- vidual dictionaries has thus become an important sources (dictionaries, embeddings) as RDF graphs, concern in the language resource community. One (3) provide an abstract model of relevant concepts way to achieve this is to represent this data as a in distributional semantics that facilitates applica- knowledge graph. The primary community stan- tions that integrate both lexical and distributional dard for publishing lexical knowledge graphs on information. Figure2 illustrates the FrAC concepts the web is OntoLex-Lemon [5]. and properties for frequency and attestations (gray) OntoLex-Lemon defines four main classes of along with new additions (blue) for embeddings as lexical entities, i.e., concepts in a lexical resource: a major form of corpus-based information. 4https://pytorch.org/text/ Prior to the introduction of embeddings, the ding as a literal rdf:value. If necessary, more elaborate representations, e.g., using rdf:List, may subsequently be generated from these literals. A natural and effort-less modelling choice is to represent embedding values as string literals with whitespace-separated numbers. For decod- ing and verification, such a representation bene- fits from metadata about the length of the vector. For a fixed-size vector, this should be provided by dc:extent. An alternative is an encoding as JSON list. In order to support both structured and string literals, FrAC does not restrict the type of Figure 2: OntoLex-FrAC overview: Extensions for em- the rdf:value of embeddings. beddings highlighted in blue Lexicalized embeddings should be published together with their metadata, at least proce- main classes of FrAC were (1) Observable, any dure/method (dct:description with free text, unit within a lexical resource about which informa- e.g., ”CBOW”, ”SKIP-GRAM”, ”collocation tion can be found in a corpus, includes all main counts”), data basis (frac:corpus), and dimen- classes of OntoLex-Core, lexical forms, lexical en- sionality (dct:extent). tries, lexical senses and concepts; (2) Corpus, a We thus introduce the concept embedding, with collection of linguistic data from which empirical a designated subclass for fixed-size vectors: evidence can be derived, including corpora in the sense as understood in language technology; (3) At- Embedding (Class) is a representation of a testation, example for a specific phenomenon, us- given Observable in a numerical feature space. age or form found in a corpus or in the literature; It is defined by the methodology used for creating (4) CorpusFrequency, absolute frequency of a it (dct:description), the URI of the corpus particular observation in a given corpus. or language resource from which it was created (corpus). The literal value of an Embedding is 4 Modelling Embeddings in FrAC provided by rdf:value. Embedding v rdf:value exactly 1 u In the context of OntoLex, word embeddings corpus exactly 1 u dct:description are to be seen as a feature of individual lexical min 1 forms. However, in many cases, word embed- dings are not calculated from plain strings, but embedding (ObjectProperty) is a relation that from normalized strings, e.g., lemmatized text. For maps an Observable into a numerical feature such data, we model every individual lemma as space. An embedding is a structure-preserving an ontolex:LexicalEntry. Moreover, as mapping in the sense that it encodes and preserves argued in Sec.2, embeddings are equally rele- contextual features of a particular Observable vant for lexical senses and lexical concepts; the (or, an aggregation over all its attestations) in a embedding property that associates a lexical en- particular corpus. tity with an embedding is thus applicable to every Observable. FixedSizeVector (Class) is an Embedding that represents a particular Observable as a list 4.1 Word Embeddings of numerical values in a k-dimensional feature Pre-trained word embeddings are often distributed space. The property dc:extent defines k. as text files consisting of the label (token) and a se- quence of whitespace-separated numbers. E.g. the For the GloVe example, a lemma (lexical entry) entry for the word frak from the GloVe embeddings embedding can be represented as follows: [15]: :frak a ontolex:LexicalEntry; ontolex:canonicalForm/ frak 0.015246 -0.30472 0.68107 ... ontolex:writtenRep ”frak”@en; frac:embedding [ Since our focus on publishing and sharing embed- a frac:FixedSizeVector; dings, we propose to provide the value of an embed- rdf:value ”0.015246 ...”; d c t : s o u r c e Like most RDF models, this appears to be overly