Implementation of Supervised Training Approaches for Monolingual Word Sense Alignment: ACDH-CH System Description for the MWSA Shared Task at Globalex 2020
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Globalex Workshop on Linked Lexicography, pages 84–91 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Implementation of Supervised Training Approaches for Monolingual Word Sense Alignment: ACDH-CH System Description for the MWSA Shared Task at GlobaLex 2020 Bajcetiˇ c´ Lenka, Yim Seung-Bin Austrian Centre for Digital Humanities and Cultural Heritage Vienna flenka.bajcetic, [email protected] Abstract This paper describes our system for monolingual sense alignment across dictionaries. The task of monolingual word sense alignment is presented as a task of predicting the relationship between two senses. We will present two solutions, one based on supervised machine learning, and the other based on pre-trained neural network language model, specifically BERT. Our models perform competitively for binary classification, reporting high scores for almost all languages. Keywords: Monolingual Word Sense Alignment, Lexicography, BERT, Gloss Similarity 1. Introduction overview of related work in Section 2, and a description of This paper presents our submission for the shared task the corpus in Section 3. In Section 4 we explain all impor- on monolingual word sense alignment across dictionaries tant aspects of our model implementation, while the results as part of the GLOBALEX 2020 – Linked Lexicography are presented in Section 5. Finally, we end the paper with workshop at the 12th Language Resources and Evaluation the discussion in Section 6 and conclusion in Section 7. Conference (LREC). Monolingual word sense alignment (MWSA) is the task of aligning word senses across re- sources in the same language. 2. Related Work Lexical-semantic resources (LSR) such as dictionaries form Similar work in monolingual word sense alignment has pre- valuable foundation of numerous natural language process- viously been done mostly for one language in mind, for ing (NLP) tasks. Since they are created manually by ex- example (Henrich et al., 2014), (Sultan et al., 2015) and perts, dictionaries can be considered among the resources (Caselli et al., 2014). of highest quality and importance. However, the existing LSRs in machine readable form are small in scope or miss- Researchers avoid modeling features according to a specific ing altogether. Thus, it would be extremely beneficial if resource pair, but aim to combine generic features which the existing lexical resources could be connected and ex- are applicable to a variety of resources. One example is the panded. work of (Matuschek and Gurevych, 2014) on alignment be- Lexical resources display considerable variation in the tween Wiktionary and Wikipedia using distances calculated number of word senses that lexicographers assign to a given with Dijkstra-WSA, an algorithm which works on graph entry in a dictionary. This is because the identification and representations of resources, as well as gloss similarity val- differentiation of word senses is one of the harder tasks that ues. lexicographers face. Hence, the task of combining dictio- Recent work in monolingual corpora linking includes (Mc- naries from different sources is difficult, especially for the Crae and Buitelaar, 2018) which utilizes state-of-the-art case of mapping the senses of entries, which often differ methods from the NLP task of semantic textual similarity significantly in granularity and coverage. (Ahmadi et al., and combines them with structural similarity of ontology 2020) alignment. There are three different angles from which the problem of Since our work is focusing on similarity of textual descrip- word sense alignment can be addressed: approaches based tions, it is worth mentioning that there have been lots of ad- on the similarity of textual descriptions of word senses, ap- vances in natural language processing with pre-trained con- proaches based on structural properties of lexical-semantic textualized language representations relying on large cor- resources, and a combination of both. (Matuschek, 2014) pora (Devlin et al., 2018), which have been delivering im- In this paper we focus on the similarity of textual de- provements in a variety of related downstream tasks, such scriptions. This is a common approach as the majority as word sense disambiguation (Scarlini et al., 2020) and of previous work used some notion of similarity between question answering (Yang et al., 2019). However, we could senses, mostly gloss overlap or semantic relatedness based not find any related work leveraging the newest advances on glosses. This makes sense, as glosses are a prerequisite with neural network language models (NNLM) for mono- for humans to recognize the meaning of an encoded sense, lingual word sense alignment. For this reason we have cho- and thus also an intuitive way of judging the similarity of sen to implement our classifiers based on two approaches: senses. (Matuschek, 2014) one which is feature-based, and the other one using pre- The paper is structured as follows: we provide a brief trained NNLMs. 84 3. Dataset By swapping the definitions, more general features can be The dataset used to train and test our models was com- calculated, since the columns contain definitions of two dic- piled specifically with this purpose in mind (Ahmadi et tionaries, instead of one. This aspect could make the trained al., 2020). The complete corpus for the shared task con- feature-based models more robust against new dictionaries. sists of sixteen datasets from fifteen European languages.1 After doubling the data samples, we applied upsampling to The gold standard was obtained by manually classifying the match the number of samples of NONE category. level of semantic similarity between two definitions from For linguistic preprocessing, the definitions were tokenized 2 3 two resources for the same lemma. using Spacy for English and German, and NLTK for other The data was given in four columns: lemma, part-of-speech languages. For languages other than English and German, (POS) tag and two definitions for the lemma. The fifth col- stopwords were removed from the definitions, in order to umn which the system aims to predict contains the semantic create word embedding models. Word vectors included in relationship between definitions. This falls in one of the five Spacy language models were used for English and German. following categories: EXACT, BROADER, NARROWER, We have compiled stopword lists for all languages using 4 RELATED, NONE. several resources found on the Web. The data was collected as follows: a subset of entries with 4.1.2. Feature Extraction the same lemma is chosen from the two dictionaries and a Since many of the languages in the dataset have very few spreadsheet is created containing all the possible combina- open-source resources and tools, and of uncertain quality, tions of definitions from the entries. Experts are then asked the features used are mostly based on word embeddings. to go through the list and choose the level of semantic simi- The word embeddings were trained using the sets of def- larity between each pair. This has created a huge number of initions provided and the Word2Vec(Mikolov et al., 2013) pairs which have no relation, and thus the dataset is heavily model from gensim(Rehˇ u˚rekˇ and Sojka, 2010) Python li- imbalanced in favor of NONE class. Two challenges caused brary. To calculate the vector of a definition we used the by the skewness of data were identified. Firstly, the mod- average of word embeddings of consisting tokens. Sen- els should be able to deal with underrepresented semantic tence similarity was calculated with different similarity relations. Secondly, evaluation metrics should consider the measures, namely cosine distance, Jaccard similarity, and imbalanced distribution. word mover distance (WMD). For English and German, Table 1 displays the distribution of relations between two we used Spacy’s built-in language models for word embed- word definitions and the imbalance of the labels in the train- dings. The English language model used, en core web lg ing data. We have implemented several ways to battle this, has 685k unique vectors over 300 dimensions, while the such as undersampling and oversampling, as well as dou- German model, de core news md has 20k unique vectors bling the broader, narrower, exact and related class by re- over 300 dimensions. Additionally, similarity calculation lying on their property of symmetry, or applying ensemble based on contextualized word representation ELMo (Peters learning methods, such as random forest. et al., 2018) was used for English to model semantic differ- 4. System Implementation ences depending on the context. We selected a different set of features for each classification We aimed to explore the advantages of two different ap- model from the features described below. Complete list of proaches, so we created two different versions of our sys- features used by each classification model is shown in Table tem. One is the more standard, feature-based approach, 4. and the other is a more novel approach with pre-trained Overall, we used the following features: neural language models, specifically BERT (Devlin et al., 2018). The novel approach was used for English and Ger- • Statistical features: Difference in length of definitions man dataset, in addition to the feature based approach. was added as a feature. 4.1. Feature-based models • Similarity measures based features: In addition to the 4.1.1. Preprocessing word embedding comparisons between the word defi- Firstly, we loaded the datasets and mitigated imbalanced nition pair, we calculated similarity of the most similar distribution of relation labels by swapping the two defini- word to the headword by calculating cosine similarity tions and thus doubling the data samples for related labels, for list of word embeddings of tokens of definitions i.e. BROADER, NARROWER, EXACT, RELATED. For excluding stopwords and headword word embedding. example, one English data sample for English head word follow has the definition pair “keep to” and “to copy af- • Part-of-speech based features: We included one-hot ter; to take as an example” and the relation “narrower”.